[slurm-users] Re: Convergence of Kube and Slurm?
There is a kubeflow offering that might be of interest: https://www.dkube.io/post/mlops-on-hpc-slurm-with-kubeflow I have not tried it myself, no idea how well it works. Regards, --Dani_L. On 05/05/2024 0:05, Dan Healy via slurm-users wrote: Bright Cluster Manager has some verbiage on their marketing site that they can manage a cluster running both Kubernetes and Slurm. Maybe I misunderstood it. But nevertheless, I am encountering groups more frequently that want to run a stack of containers that need private container networking. What’s the current state of using the same HPC cluster for both Slurm and Kube? Note: I’m aware that I can run Kube on a single node, but we need more resources. So ultimately we need a way to have Slurm and Kube exist in the same cluster, both sharing the full amount of resources and both being fully aware of resource usage. Thanks, Daniel Healy -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
Re: [slurm-users] Usage of particular GPU out of 4 GPUs while submitting jobs to DGX Server
Hi Ravi, On 20/11/2023 6:36, Ravi Konila wrote: Hello Everyone My question is related to submission of jobs to those GPUs. How do a student submit the job to a particular GPU out of 4 GPUs? For example, studentA should submit the job to GPU ID 1 instead of GPU ID 0. In classical HPC this is a counterproductive - you don't want to assign specific resources to jobs, as this would lead to jobs waiting needlessly while resources are available, so I think some background for this request might help understand the need and possible solutions. That said, it might be possible by assigning different artificial types to each gpu, e.g. in gres.conf Name=gpu type=gpu0 file=/dev/nvidia0 etc... Then submission would be of the form sbatch --gpus=gpu0 The issue would be with submitting in the general case, where you want any gpu. For that you might have to fall back to using gres as in sbatch --gres=gpu:3 This is obviously cumbersome and less convenient, and I'm not sure this is not an XY problem. Also we are planning for MIG in the server and we would like few students to submit the jobs to 20G partition and non critical jobs to 5G partition. How should be the slurm.conf and gres.conf in this case. Can you elaborate on the use case? It's unclear to me if the students are expected to decide on their own when to submit to 20G and when to 5G, if students with access to 20G should also use the 5G together with the rest of the students, or if all students should have access to both partitions and some other criteria should be used to determine placement. Currently our configuration is as below: gres.conf Name=gpu type=A100 file=/dev/nvidia[0-2,4] slurm.conf . . . GresTypes=gpu NodeName=rl-dgxs-r21-l2 Gres=gpu:A100:4 CPUs=128 RealMemory=50 State=UNKNOWN PartitionName=LocalGPUQ Nodes=ALL Default=YES MaxTime=INFINITE State=UP - Any suggestions or help in this regard is highly appreciated. With Warm Regards Ravi Konila Best regards, --Dani_L.
Re: [slurm-users] stopping job array after N failed jobs in row
Not sure about automatically canceling a job array, except perhaps by submitting 2 consecutive arrays - first of size 20, and the other with the rest of the elements and a dependency of afterok. That said, a single job in a job array in Slurm documentation is referred to as a task. I personally prefer element, as in array element. Consider creating a batch job with: arrayid=$(sbatch --parsable --array=0-19 array-job.sh) sbatch --dependency=afterok:$arrayid --array=20-5 array-job.sh I'm not near a cluster right now, so can't test for correctness. The main drawback is of course if 20 jobs takes a long time to complete, and there are enough resources to run more than 20 jobs in parallel, all those resources will be wasted for the duration. Not a big issue in busy clusters, as some other job will run in the meantime, but this will impact completion time of the array, if 20 jobs use significantly less than the resources available. It might be possible to depend on afternotok of the first 20 tasks, to run --wrap="scancel $arrayid" Maybe something like: sbatch --array=1-5 array-job.sh with cat array-job.sh #!/bin/bash srun myjob.sh $SLURM_ARRAY_TASK_ID & [[ $SLURM_ARRAY_TASK_ID -gt 20 ]] && srun -d afternotok:${SLURM_ARRAY_JOB_ID}_1,afternotok:${SLURM_ARRAY_JOB_ID}_2,...afternotok:${SLURM_ARRAY_JOB_ID}_20 scancel $SLURM_ARRAY_JOB_ID Will also work. Untested, use at your own risk. The other OTHER approach might be to use some epilog (or possibly epilogslurmctld) to log exit codes for first 20 tasks in each array, and cancel the array if non-zero. This is a global approach which will affect all job arrays, so might not be appropriate for your use case. On 01/08/2023 16:48:47, Josef Dvoracek wrote: my users found the beauty of job arrays, and they tend to use it every then and now. Sometimes human factor steps in, and something is wrong in job array specification, and cluster "works" on one failed array job after another. Isn't there any way how to automatically stop/scancel/? job array after, let say, 20 failed array jobs in row? So far my experience is, if first ~20 array jobs go right, there is no catastrophic failure in sbatch-file. If they fail, usually it's bad and there is no sense to crunch the remaining thousands of job array jobs. OT: what is the correct terminology for one item in job array... sub-job? job-array-job? :) cheers josef -- Regards, --Dani_L.
Re: [slurm-users] Slurmdbd High Availability
My go to solution is setting up Galera cluster using 2 slurmdbd servers (each pointing to it's local db) and a 3rd quorum server. It's fairly easy to setup and doesn't rely on block level duplication, HA semantics or shared storage. Just my 2 cents On 14/04/2023 14:18, Tina Friedrich wrote: Or run your database server on something like VMWare ESXi (which is what we do). Instant HA and I don't even need multiple servers for it :) I don't mean to be flippant, and I realise it's not addressing the mysql HA question (but that got answered). However, a lot of us will have some sort of failure-and-load-balancing VM estate anyway, or not? Using that does - at least in my mind - solve the same problem (just via a slightly different route). Other than that I'd agree that HA solutions - of the pacemaker & mirrored block devices sort - tend to make things less reliable instead of more. Tina On 13/04/2023 16:03, Brian Andrus wrote: I think you mean both slurmctld servers are pointing the one slurmdbd server. Ole is right about the usefulness of HA, especially on slurmdbd, as slurm will cache the writes to the database if it is down. To do what you want, you need to look at configuring your database to be HA. That is a different topic and would be dictated by what database setup you are using. Understand the the backend database is a tool used by slurm and not part of slurm. So any HA in that are needs to be done by the database. Once that is done, merely have 2 separate slurmdbd servers, each pointing at the HA database. One would be primary and the other a failover (AccountingStorageBackupHost). Although, technically, they would both be able to be active at the same time. Brian Andrus On 4/13/2023 2:49 AM, Shaghuf Rahman wrote: Hi, I am setting up Slurmdb in my system and I need some inputs My current setup is like server1 : 192.168.123.12(slurmctld) server2: 192.168.123.13(Slurmctld) server3: 192.168.123.14(Slurmdbd) which is pointing to both Server1 and Server2. database: MySQL I have 1 more server named as server 4: 192.168.123.15 which I need to make it as a secondary database server. I want to configure this server4 which will sync the database and make it either Active-Active slurmdbd or Active-Passive. Could anyone please help me with the *steps* how to configure and also how am i going to *sync* my *database* on both the servers simultaneously. Thanks & Regards, Shaghuf Rahman
Re: [slurm-users] Jobs can grow in RAM usage surpassing MaxMemPerNode
MaxMemPerNode=532000 MaxTime=3-12:00:00 State=UP Nodes=nodeGPU01 Default=YES PartitionName=cpu OverSubscribe=No MaxCPUsPerNode=64 DefMemPerNode=16384 MaxMemPerNode=42 MaxTime=3-12:00:00 State=UP Nodes=nodeGPU01 -- Cristóbal A. Navarro -- Regards, Daniel Letai +972 (0)505 870 456
Re: [slurm-users] unused job data fields?
If you're looking for a free text filed, I would posit that the "comment" field supplied by '--comment' flag of srun/sbatch and viewed by the comment field of sacct, is what your looking for. On 03/10/2022 12:25:37, z1...@arcor.de wrote: Hello, are there additional job data fields in slurm besides the job name which can be used for additional information? The information should not be used by slurm, only included in the database for external evaluation. Thanks Mike -- Regards, Daniel Letai +972 (0)505 870 456
Re: [slurm-users] srun using infiniband
Hello Anne, On 01/09/2022 02:01:53, Anne Hammond wrote: We have a CentOS 8.5 cluster slurm 20.11 Mellanox ConnectX 6 HDR IB and Mellanox 32 port switch Our application is not scaling. I discovered the process communications are going over ethernet, not ib. I used the ifconfig count for the eno2 (ethernet) and ib0 (infiniband) interfaces at end of a job, and subtracted the count at the beginning. We are using sbatch and srun {application} If I interactively login to a node and use the command mpiexec -iface ib0 -n 32 -machinefile machinefile {application} Is your application using IPoIB or RDMA? where machinefile contains 32 lines with the ib hostname: ne08-ib ne08-ib ... ne09-ib ne09-ib the application runs over ib and scales. /etc/slurm/slurm.conf uses the ethernet interface for administrative communications and allocation: NodeName=ne[01-09] CPUs=32 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 State=UNKNOWN PartitionName=neon-noSMT Nodes=ne[01-09] Default=NO MaxTime=3-00:00:00 DefaultTime=4:00:00 State=UP OverSubscribe=YES I've read this is the recommended configuration. I looked for srun parameters that would instruct srun to run over the ib interface when the job is run through the slurm queue. I found the --network parameter: srun --network=DEVNAME=mlx5_ib,DEVTYPE=IB What is the output of srun --mpi=list ? but there is not much documentation on this and I haven't been able to run a job yet. Is this the way we should be directing srun to run the executable over infiniband? Thanks in advance, Anne Hammond -- Regards, --Dani_L.
Re: [slurm-users] do oversubscription with algorithm other than least-loaded?
I could be missing something here, but if you refer to the SelectTypeParameters=cr_lln you could just try cr_pack_nodes. https://slurm.schedmd.com/slurm.conf.html#OPT_CR_Pack_Nodes If you want it on a per-partition configuration, I'm not sure that's possible, you might need to set a distribution (-m) in your job submit script/wrapper (E.g., -m block:*:*,pack) https://slurm.schedmd.com/sbatch.html#OPT_distribution If you're referring to something else entirely, could you elaborate on the least-loaded configuration in your setup? On 24/02/2022 23:35:30, Herc Silverstein wrote: Hi, We would like to do over-subscription on a cluster that's running in the cloud. The cluster dynamically spins up and down cpu nodes as needed. What we see is that the least-loaded algorithm causes the maximum number of nodes specified in the partition to be spun up and each loaded with N jobs for the N cpu's in a node before it "doubles back" and starts over-subscribing. What we actually want is for the minimum number of nodes to be used and for it to fully load (to the limit of the oversubscription setting) one node before starting up another. That is, we really want a "most-loaded" algorithm. This would allow us to reduce the number of nodes we need to run and reduce costs. Is there a way to get this behavior somehow? Herc -- Regards, Daniel Letai +972 (0)505 870 456
Re: [slurm-users] How to determine (on the ControlMachine) which cores/gpus are assigned to a job?
I don't have access to a cluster right now so can't test this, but possibly tres_alloc squeue -O JobID,Partition,Name,tres_alloc,NodeList -j might give some more info. On 04/02/2021 17:01, Thomas Zeiser wrote: Dear All, we are running Slurm-20.02.6 and using "SelectType=select/cons_tres" with "SelectTypeParameters=CR_Core_Memory", "TaskPlugin=task/cgroup", and "ProctrackType=proctrack/cgroup". Nodes can be shared between multiple jobs with the partition defaults "ExclusiveUser=no OverSubscribe=No" For monitoring purpose, we'd like to know on the ControlMachine which cores of a batch node are assigned to a specific job. Is there any way (except looking on each batch node itself into /sys/fs/cgroup/cpuset/slurm_*) to get the assigned core ranges or GPU IDs? E.g. from Torque we are used that qstat tells the assigned cores. However, with Slurm, even "scontrol show job JOBID" does not seem to have any information in that direction. Knowing which GPU is allocated (in case of gres/gpu) of course also would be interested to know on the ControlMachine. Here's the output we get from scontrol show job; it has the node name and the number of cores assigned but not the "core IDs" (e.g. 32-63) JobId=886 JobName=br-14 UserId=hpc114(1356) GroupId=hpc1(1355) MCS_label=N/A Priority=1010 Nice=0 Account=hpc1 QOS=normal WCKey=* JobState=RUNNING Reason=None Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:40:09 TimeLimit=1-00:00:00 TimeMin=N/A SubmitTime=2021-02-04T07:26:51 EligibleTime=2021-02-04T07:26:51 AccrueTime=2021-02-04T07:26:51 StartTime=2021-02-04T07:26:54 EndTime=2021-02-05T07:26:54 Deadline=N/A PreemptEligibleTime=2021-02-04T07:26:54 PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-02-04T07:26:54 Partition=a100 AllocNode:Sid=gpu001:1743663 ReqNodeList=(null) ExcNodeList=(null) NodeList=gpu001 BatchHost=gpu001 NumNodes=1 NumCPUs=32 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=32,mem=12M,node=1,billing=32,gres/gpu=1,gres/gpu:a100=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=3750M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/var/tmp/slurmd_spool/job00877/slurm_script WorkDir=/home/hpc114/run2 StdErr=/home/hpc114//run2/br-14.o886 StdIn=/dev/null StdOut=/home/hpc114/run2/br-14.o886 Power= TresPerNode=gpu:a100:1 MailUser=(null) MailType=NONE Also "scontrol show node" is not helpful NodeName=gpu001 Arch=x86_64 CoresPerSocket=64 CPUAlloc=128 CPUTot=128 CPULoad=4.09 AvailableFeatures=hwperf ActiveFeatures=hwperf Gres=gpu:a100:4(S:0-1) NodeAddr=gpu001 NodeHostName=gpu001 Port=6816 Version=20.02.6 OS=Linux 5.4.0-62-generic #70-Ubuntu SMP Tue Jan 12 12:45:47 UTC 2021 RealMemory=51 AllocMem=48 FreeMem=495922 Sockets=2 Boards=1 State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=80 Owner=N/A MCS_label=N/A Partitions=a100 BootTime=2021-01-27T16:03:48 SlurmdStartTime=2021-02-03T13:43:05 CfgTRES=cpu=128,mem=51M,billing=128,gres/gpu=4,gres/gpu:a100=4 AllocTRES=cpu=128,mem=48M,gres/gpu=4,gres/gpu:a100=4 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s There is no information on the currently running four jobs included; neither which share of the allocated node is assigned to the individual jobs. I'd like to see isomehow that job 886 got cores 32-63,160-191 assigned as seen on the node from /sys/fs/cgroup %cat /sys/fs/cgroup/cpuset/slurm_gpu001/uid_1356/job_886/cpuset.cpus 32-63,160-191 Thanks for any ideas! Thomas Zeiser
Re: [slurm-users] Use gres to handle permissions of /dev/dri/card* and /dev/dri/renderD*?
Just a quick addendum - rsmi_dev_drm_render_minor_get used in the plugin references the ROCM-SMI lib from https://github.com/RadeonOpenCompute/rocm_smi_lib/blob/2e8dc4f2a91bfa7661f4ea289736b12153ce23c2/src/rocm_smi.cc#L1689 So the library (as an .so file) should be installed for this to work. On 20/10/2020 23:58, Mgr. Martin Pecka wrote: Pinging this topic again. Nobody has an idea how to define multiple files to be treated as a single gres? Thank you for help, Martin Pecka Dne 4.9.2020 v 21:29 Martin Pecka napsal(a): Hello, we want to use EGL backend for accessing OpenGL without the need for Xorg. This approach requires access to devices /dev/dri/card* and /dev/dri/renderD* . Is there a way to give access to these devices along with /dev/nvidia* which we use for CUDA? Ideally as a single generic resource that would give permissions to all three files at once. Thank you for any tips.
Re: [slurm-users] Use gres to handle permissions of /dev/dri/card* and /dev/dri/renderD*?
Take a look at https://github.com/SchedMD/slurm/search?q=dri%2F If the ROCM-SMI API is present, using AutoDetect=rsmi in gres.conf might be enough, if I'm reading this right. Of course, this assumes the cards in question are AMD and not NVIDIA. On 20/10/2020 23:58, Mgr. Martin Pecka wrote: Pinging this topic again. Nobody has an idea how to define multiple files to be treated as a single gres? Thank you for help, Martin Pecka Dne 4.9.2020 v 21:29 Martin Pecka napsal(a): Hello, we want to use EGL backend for accessing OpenGL without the need for Xorg. This approach requires access to devices /dev/dri/card* and /dev/dri/renderD* . Is there a way to give access to these devices along with /dev/nvidia* which we use for CUDA? Ideally as a single generic resource that would give permissions to all three files at once. Thank you for any tips.
Re: [slurm-users] how to restrict jobs
On 06/05/2020 20:44, Mark Hahn wrote: Is there no way to set or define a custom variable like at node level and you could use a per-node Feature for this, but a partition would also work. A bit of an ugly hack, but you could use QoS (requires accounting) to enforce this: 1. Create a qos (using sacctmgr) with GrpTRES=Node=4 2. Create a new partiton identical to the current one, but with the new qos 3. instruct users to submit to the new partition any job requiring the license. This will not solve the issue of fragmentation due to non-licensed jobs - for that you should enable a packing scheduler like SelectTypeParameters=CR_Pack_Nodes (https://slurm.schedmd.com/slurm.conf.html#OPT_CR_Pack_Nodes).
Re: [slurm-users] not allocating jobs even resources are free
On 29/04/2020 12:00:13, navin srivastava wrote: Thanks Daniel. All jobs went into run state so unable to provide the details but definitely will reach out later if we see similar issue. i am more interested to understand the FIFO with Fair Tree.it will be good if anybody provide some insight on this combination and also if we will enable the backfilling here how the behaviour will change. what is the role of the Fair tree here? Fair tree is the algorithm used to calculate the interim priority, before applying weight, but I think after the halflife decay. To make it simple - fifo without fairshare would assign priority based only on submission time. With faishare, that naive priority is adjusted based on prior usage by the applicable entities (users/departments - accounts). Backfill will let you utilize your resources better, since it will allow "inserting" low priority jobs before higher priority jobs, provided all jobs have defined wall times, and any inserted job doesn't affect in any way the start time of a higher priority job, thus allowing utilization of "holes" when the scheduler waits for resources to free up, in order to insert some large job. Suppose the system is at 60% utilization of cores, and the next fifo job requires 42% - it will wait until 2% are free so it can begin, meanwhile not allowing any job to start, even if it would tke only 30% of the resources (whic are currently free) and would finish before the 2% are free anyway. Backfill would allow such job to start, as long as it's wall time ensures it would finish before the 42% job would've started. Fairtree in either case (fifo or backfill) calculates the priority for each job the same - if the account had used more resources recently (the halflife decay factor) it would get a lower priority even though it was submitted earlier than a job from an account that didn't use any resources recently. As can be expected, backtree has to loop over all jobs in the queue, in order to see if any job can fit out of order. In very busy/active systems, that can lead to poor response times, unless tuned correctly in slurm conf - look at SchedulerParameters, all params starting with bf_ and in particular bf_max_job_test= ,bf_max_time= and bf_continue (but bf_window= can also have some impact if set too high). see the man page at https://slurm.schedmd.com/slurm.conf.html#OPT_SchedulerParameters PriorityType=priority/multifactor PriorityDecayHalfLife=2 PriorityUsageResetPeriod=DAILY PriorityWeightFairshare=50 PriorityFlags=FAIR_TREE Regards Navin. On Mon, Apr 27, 2020 at 9:37 PM Daniel Letai <d...@letai.org.il> wrote: Are you sure there are enough resources available? The node is in mixed state, so it's configured for both partitions - it's possible that earlier lower priority jobs are already running thus blocking the later jobs, especially since it's fifo. It would really help if you pasted the results of: squeue sinfo As well as the exact sbatch line, so we can see how many resources per node are requested. On 26/04/2020 12:00:06, navin srivastava wrote: Thanks Brian, As suggested i gone through document and what i understood that the fair tree leads to the Fairshare mechanism and based on that the job should be scheduling. so it mean job scheduling will be based on FIFO but priority will be decided on the Fairshare. i am not sure if both conflicts here.if i see the normal jobs priority is lower than the GPUsmall priority. so resources are available with gpusmall partition then it should go. there is no job pend due to gpu resources. the gpu resources itself not asked with the job. is there any article where i can see how the fairshare works an
Re: [slurm-users] not allocating jobs even resources are free
tiveFeatures=K2200 Gres=gpu:2 NodeAddr=node18 NodeHostName=node18 Version=17.11 OS=Linux 4.4.140-94.42-default #1 SMP Tue Jul 17 07:44:50 UTC 2018 (0b375e4) RealMemory=1 AllocMem=0 FreeMem=79532 Sockets=2 Boards=1 State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=GPUsmall,pm_shared BootTime=2019-12-10T14:16:37 SlurmdStartTime=2019-12-10T14:24:08 CfgTRES=cpu=36,mem=1M,billing=36 AllocTRES=cpu=6 CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s node19:- NodeName=node19 Arch=x86_64 CoresPerSocket=18 CPUAlloc=16 CPUErr=0 CPUTot=36 CPULoad=15.43 AvailableFeatures=K2200 ActiveFeatures=K2200 Gres=gpu:2 NodeAddr=node19 NodeHostName=node19 Version=17.11 OS=Linux 4.12.14-94.41-default #1 SMP Wed Oct 31 12:25:04 UTC 2018 (3090901) RealMemory=1 AllocMem=0 FreeMem=63998 Sockets=2 Boards=1 State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=GPUsmall,pm_shared BootTime=2020-03-12T06:51:54 SlurmdStartTime=2020-03-12T06:53:14 CfgTRES=cpu=36,mem=1M,billing=36 AllocTRES=cpu=16 CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s could you please help me to understand what could be the reason? -- Regards, Daniel Letai +972 (0)505 870 456
[slurm-users] Assigning gpu freq va;ues manually
Is it possible to assign gpu freq values without use of specialized plugin? Currently gpu freqs can be assigned by use of AutoDetect=nvml Or AutoDetect=rsmi In gres.conf, but I can't find any reference to assigning freq values manually via direct input in gres.conf. Is it possible to populate gpu freqs in gres.conf, or must I use autodetect if I want such functionality? Thanks in advance, --Dani_L.
Re: [slurm-users] Alternative to munge for use with slurm?
in v20.02 you can use jwt, as per https://slurm.schedmd.com/jwt.html Only issue is getting libjwt for most rpm based distros. The current libjwt configure;make dist-all doesn't work. I had to cd into dist, and 'make rpm' to create the spec file, then rpmbuild -ba after placing the tar gz file in the SOURCES dir of rpmbuild tree. Possibly just installing libjwt manually is easier for image based clusters. HTH. On 17/04/2020 22:42, Dean Schulze wrote: Is there an alternative to munge when running slurm? Munge issues are a common problem in slurm, and munge doesn't give any useful information when a problem occurs. An alternative that at least gave some useful information when a problem occurs would be a big improvement. Thanks.
Re: [slurm-users] Need to execute a binary with arguments on a node
Use sbatch's wrapper command: sbatch --wrap='ls -l /tmp' Note that the output will be in the directory on the execution node, by default with the name slurm-.out On 12/18/19 8:40 PM, William Brown wrote: Sometimes the way is to make the shell the binary, e.g. bash -c 'ls -lsh' On Wed, 18 Dec 2019, 18:25 Dean Schulze,wrote: This is a rookie question. I can use the srun command to execute a simple command like "ls" or "hostname" on a node. But I haven't found a way to add arguments like "ls -lart". What I need to do is execute a binary that takes arguments (like "a.out arg1 arg2 arg3) that exists on the node. Is srun the right way to do this or do I need a script or something else? Thanks.
Re: [slurm-users] Limiting the number of CPU
3 possible issue, inline below On 14/11/2019 14:58:29, Sukman wrote: Hi Brian, thank you for the suggestion. It appears that my node is in drain state. I rebooted the node and everything became fine. However, the QOS still cannot be applied properly. Do you have any opinion regarding this issue? $ sacctmgr show qos where Name=normal_compute format=Name,Priority,MaxWal,MaxTRESPU Name Priority MaxWall MaxTRESPU -- -- --- - normal_co+ 1000:01:00 cpu=2,mem=1G when I run the following script: #!/bin/bash #SBATCH --job-name=hostname #sbatch --time=00:50 #sbatch --mem=1M I believe those should be uppercase #SBATCH #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=1 #SBATCH --nodelist=cn110 srun hostname It turns out that the QOSMaxMemoryPerUser has been met $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 88 defq hostname sukman PD 0:00 1 (QOSMaxMemoryPerUser) $ scontrol show job 88 JobId=88 JobName=hostname UserId=sukman(1000) GroupId=nobody(1000) MCS_label=N/A Priority=4294901753 Nice=0 Account=user QOS=normal_compute JobState=PENDING Reason=QOSMaxMemoryPerUser Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:01:00 TimeMin=N/A SubmitTime=2019-11-14T19:49:37 EligibleTime=2019-11-14T19:49:37 StartTime=Unknown EndTime=Unknown Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-11-14T19:55:50 Partition=defq AllocNode:Sid=itbhn02:51072 ReqNodeList=cn110 ExcNodeList=(null) NodeList=(null) NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,node=1 Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=257758M MinTmpDiskNode=0 MinMemoryNode seems to require more than FreeMem in Node below Features=(null) DelayBoot=00:00:00 Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/home/sukman/script/test_hostname.sh WorkDir=/home/sukman/script StdErr=/home/sukman/script/slurm-88.out StdIn=/dev/null StdOut=/home/sukman/script/slurm-88.out Power= $ scontrol show node cn110 NodeName=cn110 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUErr=0 CPUTot=56 CPULoad=0.01 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=cn110 NodeHostName=cn110 Version=17.11 OS=Linux 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017 RealMemory=257758 AllocMem=0 FreeMem=255742 Sockets=56 Boards=1 This would appear to be wrong - 56 sockets? How did you configure the node in slurm.conf? FreeMem lower than MinMemoryNode - not sure if that is relevant. State=IDLE ThreadsPerCore=1 TmpDisk=268629 Weight=1 Owner=N/A MCS_label=N/A Partitions=defq BootTime=2019-11-14T18:50:56 SlurmdStartTime=2019-11-14T18:53:23 CfgTRES=cpu=56,mem=257758M,billing=56 AllocTRES= CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s --- Sukman ITB Indonesia - Original Message - From: "Brian Andrus" To: slurm-users@lists.schedmd.com Sent: Tuesday, November 12, 2019 10:41:42 AM Subject: Re: [slurm-users] Limiting the number of CPU You are trying to specifically run on node cn110, so you may want to check that out with sinfo A quick "sinfo -R" can list any down machines and the reasons. Brian Andrus -- Regards, Daniel Letai +972 (0)505 870 456
Re: [slurm-users] RPM build error - accounting_storage_mysql.so
On 11/12/19 9:34 AM, Ole Holm Nielsen wrote: On 11/11/19 10:14 PM, Daniel Letai wrote: Why would you need galera-4 as a build require? This is the MariaDB recommendation in https://mariadb.com/kb/en/library/yum/, see the section "Installing MariaDB Packages with YUM". I have no clue why this would be needed. Yes, it's required for mariadb multimaster cluster. This has nothing to do with the mariadb api required for linking against mariadb libs. You don't even need the mariadb-server pkg for build purposes - it's only required for deployment of slurmdbd. On a build machine, you should only require the client section: https://mariadb.com/kb/en/library/yum/#installing-mariadb-clients-and-client-libraries-with-yum, as well as the devel pkg. /Ole If it's required by any of the mariadb packages, it'll get pulled automatically. If not, you don't need it on the build system. On 11/11/19 10:56 PM, Ole Holm Nielsen wrote: Hi William, Interesting experiences with MariaDB 10.4! I tried to collect the instructions from the MariaDB page, but I'm unsure about how to get the galera-4 RPM. Could you kindly review and correct my updated instructions? https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#build-slurm-rpms That said, what are the main reasons for installing MariaDB 10 in stead of the 5.5 delivered by RedHat? I'm not sure how well SchedMD has tested MariaDB 10 with Slurm? /Ole On 11-11-2019 21:23, William Brown wrote: I have in fact found the answer by looking harder. The config.log clearly showed that the build of the test MySQL program failed, which is why it was set to be excluded. It failed to link against '-lmariadb'. It turns out that library is no longer in MariaDB or MariaDB-devel, it is separately packaged in MariaDB-shared. That may of course be because I have built MariaDB 10.4 from the mariadb.org site, because CentOS 7 only ships with the extremely old version 5.5. Once I installed the missing package it built the RPMs just fine. However it would be easier to use it linked to static MariaDB libraries, as I now have to installed MariaDB-shared on every server that will run slurmd, i.e. all compute nodes. I expect that if I looked harder at the build options there may be a way to do this, perhaps with linker flags. For now, I can progress. Thanks William -Original Message- From: slurm-users On Behalf Of Ole Holm Nielsen Sent: 11 November 2019 20:02 To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] RPM build error - accounting_storage_mysql.so Hi, Maybe my Slurm Wiki can help you build SLurm on CentOS/RHEL 7? See https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#build-slurm-rpms Note in particular: Important: Install the MariaDB (a replacement for MySQL) packages before you build Slurm RPMs (otherwise some libraries will be missing): yum install mariadb-server mariadb-devel /Ole On 11-11-2019 15:22, William Brown wrote: Fabio Did you ever resolve the problem building accounting_storage_mysql.so? I have the exact same problem with CentOS 7.6, building Slurm 19.05.03. My command: rpmbuild -ta slurm-19.05.3-2.tar.bz2 | tee /var/tmp/slurm-build.log The directory with the plugin source is all there: /home/users/slurm/rpmbuild/BUILD/slurm-19.05.3-2/src/plugins
Re: [slurm-users] RPM build error - accounting_storage_mysql.so
Why would you need galera-4 as a build require? If it's required by any of the mariadb packages, it'll get pulled automatically. If not, you don't need it on the build system. On 11/11/19 10:56 PM, Ole Holm Nielsen wrote: Hi William, Interesting experiences with MariaDB 10.4! I tried to collect the instructions from the MariaDB page, but I'm unsure about how to get the galera-4 RPM. Could you kindly review and correct my updated instructions? https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#build-slurm-rpms That said, what are the main reasons for installing MariaDB 10 in stead of the 5.5 delivered by RedHat? I'm not sure how well SchedMD has tested MariaDB 10 with Slurm? /Ole On 11-11-2019 21:23, William Brown wrote: I have in fact found the answer by looking harder. The config.log clearly showed that the build of the test MySQL program failed, which is why it was set to be excluded. It failed to link against '-lmariadb'. It turns out that library is no longer in MariaDB or MariaDB-devel, it is separately packaged in MariaDB-shared. That may of course be because I have built MariaDB 10.4 from the mariadb.org site, because CentOS 7 only ships with the extremely old version 5.5. Once I installed the missing package it built the RPMs just fine. However it would be easier to use it linked to static MariaDB libraries, as I now have to installed MariaDB-shared on every server that will run slurmd, i.e. all compute nodes. I expect that if I looked harder at the build options there may be a way to do this, perhaps with linker flags. For now, I can progress. Thanks William -Original Message- From: slurm-users On Behalf Of Ole Holm Nielsen Sent: 11 November 2019 20:02 To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] RPM build error - accounting_storage_mysql.so Hi, Maybe my Slurm Wiki can help you build SLurm on CentOS/RHEL 7? See https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#build-slurm-rpms Note in particular: Important: Install the MariaDB (a replacement for MySQL) packages before you build Slurm RPMs (otherwise some libraries will be missing): yum install mariadb-server mariadb-devel /Ole On 11-11-2019 15:22, William Brown wrote: Fabio Did you ever resolve the problem building accounting_storage_mysql.so? I have the exact same problem with CentOS 7.6, building Slurm 19.05.03. My command: rpmbuild -ta slurm-19.05.3-2.tar.bz2 | tee /var/tmp/slurm-build.log The directory with the plugin source is all there: /home/users/slurm/rpmbuild/BUILD/slurm-19.05.3-2/src/plugins/accountin g_storage/mysql, with a Makefile that is the same date/time as the other accounting_storage alternatives. In the log I can see: checking for mysql_config... /usr/bin/mysql_config Looking at the process of building the RPMs it looks as if it has skipped trying to create the missing library file, but then expects to find it in the RPM. This is what I see when it is building, it builds the accounting_storage .so files for _fileext, _none and _slurmdbd, but not for _mysql. I do have MariaDB-devel 10.4.10 installed . . Making all in mysql make[5]: Entering directory `/home/users/slurm/rpmbuild/BUILD/slurm-19.05.3-2/src/plugins/accounting_storage/mysql' make[5]: Nothing to be done for `all'. make[5]: Leaving directory `/home/users/slurm/rpmbuild/BUILD/slurm-19.05.3-2/src/plugins/accounting_storage/mysql' . . Making
Re: [slurm-users] How to find core count per job per node
I can't test this right now, but possibly squeue -j -O 'name,nodes,tres-per-node,sct' From squeue man page https://slurm.schedmd.com/squeue.html: sct Number of requested sockets, cores, and threads (S:C:T) per node for the job. When (S:C:T) has not been set, "*" is displayed. (Valid for jobs only) tres-per-node Print the trackable resources per node requested by the job or job step. Again, can't test just now, so no idea if applicable to your use case. On 10/18/19 9:51 PM, Mark Hahn wrote: $ scontrol --details show job 1653838 JobId=1653838 JobName=v1.20 ... Nodes=r00g01 CPU_IDs=31-35 Mem=5120 GRES_IDX= Nodes=r00n16 CPU_IDs=34-35 Mem=2048 GRES_IDX= Nodes=r00n20 CPU_IDs=12-17,30-35 Mem=12288 GRES_IDX= Nodes=r01n16 CPU_IDs=15 Mem=1024 GRES_IDX= thanks for sharing this! we've had a lot of discussion on how to collect this information as well, even whether it would be worth doing in a prolog script... regards, -- Mark Hahn | SHARCnet Sysadmin | h...@sharcnet.ca | http://www.sharcnet.ca | McMaster RHPCS | h...@mcmaster.ca | 905 525 9140 x24687 | Compute/Calcul Canada | http://www.computecanada.ca
[slurm-users] Using swap for gang mode suspended jobs only
Hi, I'd like to allow job suspension in my cluster, without the "penalty" of RAM utilization. The jobs are sometimes very big and can require ~100GB mem on each node. Suspending such a job would usually mean almost nothing else can run on the same node, except for very small memory jobs. Currently the solution is requeue preemption with or without checkpointing. I don't want to use swap for running jobs, ever - I'd rather get OOM killed than use swap while the job is running. Is there a way to tell Slurm to allocate swap and use it only for suspending, to allow preemption without terminating the jobs? The nodes have ~TB of disk space each, and most jobs never utilize any of that (relying on shared storage instead), so local disk space is usually not a concern. Using swap to store suspended jobs, while slow to freeze and thaw, seems o me to be a better localized solution than checkpointing and requeuing, allowing the job to resume "immediately" (sans disk io times) after the high priority job finishes, but if I'm mistaken, please enlighten me. I was wandering if simply setting a large swap in linux, while setting AllowedSwapSpace=0 in cgroup.conf would work, but I suspect the following: 1. Even suspended, the job still remains in it's cgroup limits, and 2. Which process gets swapped is non-deterministic from my point of view - I'm not sure the kernel will swap out the suspended job rather than the new job, at least in it's early stages. Thanks in advance, --Dani_L.
Re: [slurm-users] How can jobs request a minimum available (free) TmpFS disk space?
Make tmpfs a TRES, and have NHC update that as in: scontrol update nodename=... gres=tmpfree:$(stat -f /tmp -c "%f*%S" | bc)" Replace /tmp with your tmpfs mount. You'll have to define that TRES in slurm.conf and gres.conf as usual (start with count=1 and have nhc update it) Do note that this is a simplistic example - updating like that will overwrite any other gres defined for the node, so you might wish to create an 'updategres' function that first reads in the node's current gres, only modifies the count of the fields you wish to modify, and returns a complete gres string. In sbatch do: sbatch --gres=tmpfree:20G And based on last update from NHC should only consider nodes with enough tmpfree for the job. HTH --Dani_L. On 9/10/19 10:15 PM, Ole Holm Nielsen wrote: Hi Michael, Thanks for the suggestion! We have user requests for certain types of jobs (quantum chemistry) that require fairly large local scratch space. Our jobs normally do not have this requirement. So unfortunately the per-node NHC check doesn't seem to do the trick. (We already have an NHC check "check_fs_used /scratch 90%"). Best regards, Ole On 10-09-2019 20:41, Michael Jennings wrote: On Monday, 02 September 2019, at 20:02:57 (+0200), Ole Holm Nielsen wrote: We have some users requesting that a certain minimum size of the *Available* (i.e., free) TmpFS disk space should be present on nodes before a job should be considered by the scheduler for a set of nodes. I believe that the "sbatch --tmp=size" option merely refers to the TmpFS file system *Size* as configured in slurm.conf, and this is *not* what users need. For example, a job might require 50 GB of *Available disk space* on the TmpFS file system, which may however have only 20 GB out of 100 GB *Available* as shown by the df command, the rest having been consumed by other jobs (present or past). However, when we do "scontrol show node ", only the TmpFS file system *Size* is displayed as a "TmpDisk" number, but not the *Available* number. Question: How can we get slurmd to report back to the scheduler the amount of *Available* disk space? And how can users specify the minimum *Available* disk space required by their jobs submitted by "sbatch"? If this is not feasible, are there other techniques that achieve the same goal? We're currently still at Slurm 18.08. Hi, Ole! I'm assuming you are wanting a per-job resolution on this rather than per-node? If per-node is good enough, you can of course use NHC to check this, e.g.: * || check_fs_free /tmp 50GB That doesn't work per-job, though, obviously. Something that might work, however, as a temporary work-around for this might be to have the user run a single NHC command, like this: srun --prolog='nhc -e "check_fs_free /tmp 50GB"' There might be some tweaks/caveats to this since NHC normally runs as root, but just an idea :-) An even crazier idea would be to set NHC_LOAD_ONLY=1 in the environment, source /usr/sbin/nhc, and then execute the shell function `check_fs_free` directly! :-D
Re: [slurm-users] Different Memory Nodes
Just a quick FYI - using gang mode preemption would mean the available memory would be lower, so if the preempting job requires the entire node memory, this will be an issue. On 9/4/19 8:51 PM, Tina Fora wrote: Thanks Brian! I'll take a look at weights. I want others to be able to use them and take advantage of the large memory when free. We have a preemptable partiton below that works great. PartitionName=scavenge AllowGroups=ALL AllowAccounts=ALL AllowQos=scavenge,abc AllocNodes=ALL Default=NO QoS=scavenge DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=4-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=...a[01-05,11-15],b[01-10],c[01-20] PriorityJobFactor=10 PriorityTier=10 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=GANG,SUSPEND ... (Added a subject) Tina, If you want group xxx to be the only ones to access them, you need to either put them in their own partition or add info to the node definitions to only allow certain users/groups. If you want them to be used last, so they are available until all the other nodes are in use, you can add weights to the node definitions. This would mean users could request >192GB memory, so it has to go to one of the updated nodes, which will only be taken if the other nodes are used up, or a job needing > 192GB is running on them. Brian Andrus On 9/4/2019 9:53 AM, Tina Fora wrote: Hi, I'm adding a bunch of memory on two of our nodes that are part of a blade chassis. So two computes will be upgraded to 1TB RAM and the rest have 192GB. All of the nodes belog to several partitons and can be used by our paid members given the partition below. I'm looking for ways to figure out how only group xxx (the ones paying for the memory upgrade) can get to them. PartitionName=member AllowGroups=ALL AllowAccounts=ALL AllowQos=xxx,yyy,zzz AllocNodes=ALL Default=NO QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=5-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=a[01-05,11-15],b[01-20] PriorityJobFactor=500 PriorityTier=500 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP ... Say compute a01 and a02 will have 1TB memory and I want group xxx to be able to get to them quickly using the partition above. Thanks, Tina
Re: [slurm-users] Usage splitting
Wouldn't fairshare with a 90/10 split achieve this? This will require accounting is set in your cluster, with the following parameters: In slurm.conf set AccountingStorageEnforce=associations # And possibly '...,limits,qos,safe' as required - so perhaps just use '=all' PriorityType=priority/multifactor # Required by other parameters PriorityDecayHalfLife=14-0 # Every 14 days (two weeks) reset usage PriorityWeightFairshare=1 # With all other weights defaulting to 0, ensures only fairshare influences priority. TRESBillingWeights="Node=1" # According to docs, "Node" should be a TRES. I've never tested this. And from cmdline add the fair share split via: sacctmgr create account name=A fairshare=10 sacctmgr create account name=B fairshare=90 Than simply associate users to each account, and use something like 'sbatch --account=A ... ' to charge jobs to accounts. This won't do exactly what you want - it might allow 'A' to utilize more than 10%, if the cluster is under utilized. I'm not aware of a scheme where 'A' might be preempted only if it has been awarded more than it's fair share due to underutilization. If the 10% hard limit is a concern, it might be worth to investigate reservations, and allocate to 'A' only from a 10% reservation, while somehow allowing 'B' to utilize that reservation too if required. On 30/08/2019 14:14:16, Stefan Staeglich wrote: Hi, we have some compute nodes paid by different project owners. 10% are owned by project A and 90% are owned by project B. We want to implement the following policy such that every certain time period (e.g. two weeks): - Project A doesn't use more than 10% of the cluster in this time period - But project B is allowed to use more than 90% What's the best way to enforce this? Best, Stefan -- HTH, --Dani_L.
Re: [slurm-users] sacctmgr dump question - how can I dump entities other than cluster?
The cluster config doesn't contain qos rules definitions. It only contains mappings of qos to users/accounts. Currently it is impossible to dump and edit qos rules, although it is possible to add and remove defined qos from/to users. Regards, --Dani_L. On 8/12/19 8:44 AM, Barbara Krašovec wrote: Yes, afaik you can only dump the whole cluster config, not a specific entity. If you dump the cluster config, also qos rules are included, so you can modify the rules in the cluster config and load it. If you don't want to do that, then just use the sacctmgr modify option. Cheers, Barbara On 8/5/19 12:02 PM, Daniel Letai wrote: The documentation clearly states dump <ENTITY> <File=FILENAME> Dump cluster data to the specified file. If the filename is not specified it uses clustername.cfg filename by default. However, the only entity sacctmgr dump seems to accept is . Glancing over the code at https://github.com/SchedMD/slurm/blob/master/src/sacctmgr/cluster_functions.c#L1006 it doesn't seem like sacctmgr will accept anything other than a cluster name either. How can I easily dump to file qos rules, in a way that would allow me to modify and upload new qos as required? BTW, just noticed "archive" is not in the 'commands' section of sacctmgr man, but is treated as a command in later sections of the man page.
[slurm-users] sacctmgr dump question - how can I dump entities other than cluster?
The documentation clearly states dumpDump cluster data to the specified file. If the filename is not specified it uses clustername.cfg filename by default. However, the only entity sacctmgr dump seems to accept is . Glancing over the code at https://github.com/SchedMD/slurm/blob/master/src/sacctmgr/cluster_functions.c#L1006 it doesn't seem like sacctmgr will accept anything other than a cluster name either. How can I easily dump to file qos rules, in a way that would allow me to modify and upload new qos as required? BTW, just noticed "archive" is not in the 'commands' section of sacctmgr man, but is treated as a command in later sections of the man page.
Re: [slurm-users] Slurm configuration
Hi. On 8/3/19 12:37 AM, Sistemas NLHPC wrote: Hi all, Currently we have two types of nodes, one with 192GB and another with 768GB of RAM, it is required that in nodes of 768 GB it is not allowed to execute tasks with less than 192GB, to avoid underutilization of resources. This, because we have nodes that can fulfill the condition of executing tasks with 192GB or less. Is it possible to use some slurm configuration to solve this problem? Easiest would be to use features/constraints. In slurm.conf add NodeName=DEFAULT RealMemory=196608 Features=192GB Weight=1 NodeName=... (list all nodes with 192GB) NodeName=DEFAULT RealMemory=786432 Features=768GB Weight=2 NodeName=... (list all nodes with 768GB) And to run jobs only on node with 192GB in sbatch do sbatch -C 192GB ... To run jobs on all nodes, simply don't add the constraint to the sbatch line, and due to lower weight jobs should prefer to start on the 192GB nodes. PD: All users can submit jobs on all nodes Thanks in advance Regards.
Re: [slurm-users] Unexpected MPI process distribution with the --exclusive flag
On 7/30/19 6:03 PM, Brian Andrus wrote: I think this may be more on how you are calling mpirun and the mapping of processes. With the "--exclusive" option, the processes are given access to all the cores on each box, so mpirun has a choice. IIRC, the default is to pack them by slot, so fill one node, then move to the next. Whereas you want to map by node (one process per node cycling by node) From the man for mpirun (openmpi): --map-by Map to the specified object, defaults to socket. Supported options include slot, hwthread, core, L1cache, L2cache, L3cache, socket, numa, board, node, sequential, distance, and ppr. Any object can include modifiers by adding a : and any combination of PE=n (bind n processing elements to each proc), SPAN (load balance the processes across the allocation), OVERSUBSCRIBE (allow more processes on a node than processing elements), and NOOVERSUBSCRIBE. This includes PPR, where the pattern would be terminated by another colon to separate it from the modifiers. so adding "--map-by node" would give you what you are looking for. Of course, this syntax is for Openmpi's mpirun command, so YMMV If using srun (as recommended) instead of invoking mpirun directly, you can still achieve the same functionality using exported environment variables as per the mpirun man page, like this: OMPI_MCA_rmaps_base_mapping_policy=node srun --export OMPI_MCA_rmaps_base_mapping_policy ... in you sbatch script. Brian Andrus On 7/30/2019 5:14 AM, CB wrote: Hi Everyone, I've recently discovered that when an MPI job is submitted with the --exclusive flag, Slurm fills up each node even if the --ntasks-per-node flag is used to set how many MPI processes is scheduled on each node. Without the --exclusive flag, Slurm works fine as expected. Our system is running with Slurm 17.11.7. The following options works that each node has 16 MPI processes until all 980 MPI processes are scheduled.with total of 62 compute nodes. Each of the 61 nodes has 16 MPI processes and the last one has 4 MPI processes, which is 980 MPI processes in total. #SBATCH -n 980 #SBATCH --ntasks-per-node=16 However, if the --exclusive option is added, Slurm fills up each node with 28 MPI processes (the compute node has 28 cores). Interestingly, Slurm still allocates 62 compute nodes although only 35 nodes of them are actually used to distribute 980 MPI processes. #SBATCH -n 980 #SBATCH --ntasks-per-node=16 #SBATCH --exclusive Has anyone seen this behavior? Thanks, - Chansup
Re: [slurm-users] Can I use the manager as compute node
Yes, just add it to the Nodes= list of the partition. You will have to install slurm-slurmd on it as well, and enable and start as on any compute node, or it will be DOWN. HTH, --Dani_L. On 7/30/19 3:45 PM, wodel youchi wrote: Hi, I am newbie in Slurm, All examples I saw when they declare the Partition, only compute nodes are used. My question is : can I use the manager or the slurmctldhost (the master host) as a compute node in and extended partition for example? if yes how? Regards.
Re: [slurm-users] Weekend Partition
I would use a partition with very low priority and preemption. General cluster conf: PreemptType=preempt/partition_prio Preemptmode=Cancel # Anything except 'Off' Partition definition: ParttionName=weekend PreemptMode=Cancel MaxTime=Unlimited PriorityTier=1 State=Down Use cron to 'scontrol update PartitionName=weekend state=up' when desired and 'scontrol update PartitionName=weekend state=down' on Sunday. This will not cancel the jobs on it's own, but will prevent new ones from starting. The preemption will kill jobs as required to allow regular jobs to run - the added value is that as long as they don't prevent other jobs from starting, those jobs can continue, and won't be killed needlessly. Just my 2 cents. The other option is to use a recurring reservation with a start and stop time frame, and force jobs to use that reservation (possibly via qos). This solution might look something like: scontrol create reservation StartTime=00:00:01 Duration= Flags= For Flags you have a couple of options: WEEKEND Repeat the reservation at the same time on every weekend day (Saturday and Sunday). WEEKLY Repeat the reservation at the same time every week. So I would guess Duration=1-0 Flags=WEEKEND Or Duration=2-0 Flags=WEEKLY You will have to test to see what works best for you. HTH --Dani_L. On 7/23/19 7:36 PM, Matthew BETTINGER wrote: Hello, We run lsf and slurm here. For LSF we have a weekend queue with no limit and jobs get killed after Sunday. What is the best way to do something similar for slurm? Reservation? We would like to have any running jobs killed after Sunday if possible too. Thanks.
Re: [slurm-users] [pmix] [Cross post - Slurm, PMIx, UCX] Using srun with SLURM_PMIX_DIRECT_CONN_UCX=true fails with input/output error
Thank you Artem, I've made a mistake while typing the mail, in all cases it was 'OMPI_MCA_pml=ucx' and not as written. When I went over the mail before sending, I must have erroneously 'fixed' it for some reason. Best regards, --Dani_L. On 7/9/19 9:06 PM, Artem Polyakov wrote: Hello, Daniel Let me try to reproduce locally and get back to you. Best regards, Artem Y. Polyakov, PhD Senior Architect, SW Mellanox Technologies От: p...@googlegroups.com от имени Daniel Letai Отправлено: Tuesday, July 9, 2019 3:25:22 AM Кому: Slurm User Community List; p...@googlegroups.com; ucx-gr...@elist.ornl.gov Тема: [pmix] [Cross post - Slurm, PMIx, UCX] Using srun with SLURM_PMIX_DIRECT_CONN_UCX=true fails with input/output error Cross posting to Slurm, PMIx and UCX lists. Trying to execute a simple openmpi (4.0.1) mpi-hello-world via Slurm (19.05.0) compiled with both PMIx (3.1.2) and UCX (1.5.0) results in: [root@n1 ~]# SLURM_PMIX_DIRECT_CONN_UCX=true SLURM_PMIX_DIRECT_CONN=true OMPI_MCA_pml=true OMPI_MCA_btl='^vader,tcp,openib' UCX_NET_DEVICES='mlx4_0:1' SLURM_PMIX_DIRECT_CONN_EARLY=false UCX_TLS=rc,shm srun --export SLURM_PMIX_DIRECT_CONN_UCX,SLURM_PMIX_DIRECT_CONN,OMPI_MCA_pml,OMPI_MCA_btl, UCX_NET_DEVICES,SLURM_PMIX_DIRECT_CONN_EARLY,UCX_TLS --mpi=pmix -N 2 -n 2 /data/mpihello/mpihello slurmstepd: error: n1 [0] pmixp_dconn_ucx.c:668 [_ucx_connect] mpi/pmix: ERROR: ucp_ep_create failed: Input/output error slurmstepd: error: n1 [0] pmixp_dconn.h:243 [pmixp_dconn_connect] mpi/pmix: ERROR: Cannot establish direct connection to n2 (1) slurmstepd: error: n1 [0] pmixp_server.c:731 [_process_extended_hdr] mpi/pmix: ERROR: Unable to connect to 1 srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: n2 [1] pmixp_dconn_ucx.c:668 [_ucx_connect] mpi/pmix: ERROR: ucp_ep_create failed: Input/output error slurmstepd: error: n2 [1] pmixp_dconn.h:243 [pmixp_dconn_connect] mpi/pmix: ERROR: Cannot establish direct connection to n1 (0) slurmstepd: error: *** STEP 7202.0 ON n1 CANCELLED AT 2019-07-01T13:20:36 *** slurmstepd: error: n2 [1] pmixp_server.c:731 [_process_extended_hdr] mpi/pmix: ERROR: Unable to connect to 0 srun: error: n2: task 1: Killed srun: error: n1: task 0: Killed However, the following works: [root@n1 ~]# SLURM_PMIX_DIRECT_CONN_UCX=false SLURM_PMIX_DIRECT_CONN=true OMPI_MCA_pml=true OMPI_MCA_btl='^vader,tcp,openib' UCX_NET_DEVICES='mlx4_0:1' SLURM_PMIX_DIRECT_CONN_EARLY=false UCX_TLS=rc,shm srun --export SLURM_PMIX_DIRECT_CONN_UCX,SLURM_PMIX_DIRECT_CONN,OMPI_MCA_pml,OMPI_MCA_btl, UCX_NET_DEVICES,SLURM_PMIX_DIRECT_CONN_EARLY,UCX_TLS --mpi=pmix -N 2 -n 2 /data/mpihello/mpihello n2: Process 1 out of 2 n1: Process 0 out of 2 [root@n1 ~]# SLURM_PMIX_DIRECT_CONN_UCX=false SLURM_PMIX_DIRECT_CONN=true OMPI_MCA_pml=true OMPI_MCA_btl='^vader,tcp,openib' UCX_NET_DEVICES='mlx4_0:1' SLURM_PMIX_DIRECT_CONN_EARLY=true UCX_TLS=rc,shm srun --export SLURM_PMIX_DIRECT_CONN_UCX,SLURM_PMIX_DIRECT_CONN,OMPI_MCA_pml,OMPI_MCA_btl, UCX_NET_DEVICES,SLURM_PMIX_DIRECT_CONN_EARLY,UCX_TLS --mpi=pmix -N 2 -n 2 /data/mpihello/mpihello n2: Process 1 out of 2 n1: Process 0 out of 2 Executing mpirun directly (same env vars, without the slurm vars) works, so UCX appears to function correctly. If both SLURM_PMIX_DIRECT_CONN_EARLY=true and SLURM_PMIX_DIRECT_CONN_UCX=true then I get collective timeout errors from mellanox/hcoll and glibc detected /data/mpihello/mpihello: malloc(): memory corruption (fast) Can anyone help using PMIx direct connection with UCX in Slurm? Some info about my setup: UCX version [root@n1 ~]# ucx_info -v # UCT version=1.5.0 revision 02078b9
[slurm-users] [Cross post - Slurm, PMIx, UCX] Using srun with SLURM_PMIX_DIRECT_CONN_UCX=true fails with input/output error
Cross posting to Slurm, PMIx and UCX lists. Trying to execute a simple openmpi (4.0.1) mpi-hello-world via Slurm (19.05.0) compiled with both PMIx (3.1.2) and UCX (1.5.0) results in: [root@n1 ~]# SLURM_PMIX_DIRECT_CONN_UCX=true SLURM_PMIX_DIRECT_CONN=true OMPI_MCA_pml=true OMPI_MCA_btl='^vader,tcp,openib' UCX_NET_DEVICES='mlx4_0:1' SLURM_PMIX_DIRECT_CONN_EARLY=false UCX_TLS=rc,shm srun --export SLURM_PMIX_DIRECT_CONN_UCX,SLURM_PMIX_DIRECT_CONN,OMPI_MCA_pml,OMPI_MCA_btl, UCX_NET_DEVICES,SLURM_PMIX_DIRECT_CONN_EARLY,UCX_TLS --mpi=pmix -N 2 -n 2 /data/mpihello/mpihello slurmstepd: error: n1 [0] pmixp_dconn_ucx.c:668 [_ucx_connect] mpi/pmix: ERROR: ucp_ep_create failed: Input/output error slurmstepd: error: n1 [0] pmixp_dconn.h:243 [pmixp_dconn_connect] mpi/pmix: ERROR: Cannot establish direct connection to n2 (1) slurmstepd: error: n1 [0] pmixp_server.c:731 [_process_extended_hdr] mpi/pmix: ERROR: Unable to connect to 1 srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: n2 [1] pmixp_dconn_ucx.c:668 [_ucx_connect] mpi/pmix: ERROR: ucp_ep_create failed: Input/output error slurmstepd: error: n2 [1] pmixp_dconn.h:243 [pmixp_dconn_connect] mpi/pmix: ERROR: Cannot establish direct connection to n1 (0) slurmstepd: error: *** STEP 7202.0 ON n1 CANCELLED AT 2019-07-01T13:20:36 *** slurmstepd: error: n2 [1] pmixp_server.c:731 [_process_extended_hdr] mpi/pmix: ERROR: Unable to connect to 0 srun: error: n2: task 1: Killed srun: error: n1: task 0: Killed However, the following works: [root@n1 ~]# SLURM_PMIX_DIRECT_CONN_UCX=false SLURM_PMIX_DIRECT_CONN=true OMPI_MCA_pml=true OMPI_MCA_btl='^vader,tcp,openib' UCX_NET_DEVICES='mlx4_0:1' SLURM_PMIX_DIRECT_CONN_EARLY=false UCX_TLS=rc,shm srun --export SLURM_PMIX_DIRECT_CONN_UCX,SLURM_PMIX_DIRECT_CONN,OMPI_MCA_pml,OMPI_MCA_btl, UCX_NET_DEVICES,SLURM_PMIX_DIRECT_CONN_EARLY,UCX_TLS --mpi=pmix -N 2 -n 2 /data/mpihello/mpihello n2: Process 1 out of 2 n1: Process 0 out of 2 [root@n1 ~]# SLURM_PMIX_DIRECT_CONN_UCX=false SLURM_PMIX_DIRECT_CONN=true OMPI_MCA_pml=true OMPI_MCA_btl='^vader,tcp,openib' UCX_NET_DEVICES='mlx4_0:1' SLURM_PMIX_DIRECT_CONN_EARLY=true UCX_TLS=rc,shm srun --export SLURM_PMIX_DIRECT_CONN_UCX,SLURM_PMIX_DIRECT_CONN,OMPI_MCA_pml,OMPI_MCA_btl, UCX_NET_DEVICES,SLURM_PMIX_DIRECT_CONN_EARLY,UCX_TLS --mpi=pmix -N 2 -n 2 /data/mpihello/mpihello n2: Process 1 out of 2 n1: Process 0 out of 2 Executing mpirun directly (same env vars, without the slurm vars) works, so UCX appears to function correctly. If both SLURM_PMIX_DIRECT_CONN_EARLY=true and SLURM_PMIX_DIRECT_CONN_UCX=true then I get collective timeout errors from mellanox/hcoll and glibc detected /data/mpihello/mpihello: malloc(): memory corruption (fast) Can anyone help using PMIx direct connection with UCX in Slurm? Some info about my setup: UCX version [root@n1 ~]# ucx_info -v # UCT version=1.5.0 revision 02078b9 # configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --target=x86_64-redhat-linux-gnu --program-prefix= --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check Mellanox OFED version: [root@n1 ~]# ofed_info -s OFED-internal-4.5-1.0.1: Slurm: slurm was built with: rpmbuild -ta slurm-19.05.0.tar.bz2 --without debug --with ucx --define '_with_pmix --with-pmix=/usr' PMIx: [root@n1 ~]# pmix_info -c --parsable config:user:root config:timestamp:"Mon Mar 25 09:51:04 IST 2019" config:host:slurm-test config:cli: '--host=x86_64-redhat-linux-gnu' '--build=x86_64-redhat-linux-gnu' '--program-prefix=' '--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--sysconfdir=/etc' '--datadir=/usr/share' '--includedir=/usr/include' '--libdir=/usr/lib64' '--libexecdir=/usr/libexec' '--localstatedir=/var' '--sharedstatedir=/var/lib' '--mandir=/usr/share/man' '--infodir=/usr/share/info' Thanks, --Dani_L.
Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"
I had similar problems in the past. The 2 most common issues were: 1. Controller load - if the slurmctld was in heavy use, it sometimes didn't respond in timely manner, exceeding the timeout limit. 2. Topology and msg forwarding and aggregation. For 2 - it would seem the nodes designated for forwarding are statically assigned based on topology. I could be wrong, but that's my observation, as I would get the socket timeout error when they had issues, even though other nodes in the same topology 'zone' were ok and could be used instead. It took debug3 to observe this in the logs, I think. HTH --Dani_L. On 6/11/19 5:27 PM, Steffen Grunewald wrote: On Tue, 2019-06-11 at 13:56:34 +, Marcelo Garcia wrote: Hi Since mid-March 2019 we are having a strange problem with slurm. Sometimes, the command "sbatch" fails: + sbatch -o /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.1 -p operw /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.job1 sbatch: error: Batch job submission failed: Socket timed out on send/recv operation I've seen such an error message from the underlying file system. Is there anything special (e.g. non-NFS) in your setup that may have changed in the past few months? Just a shot in the dark, of course... Ecflow runs preprocessing on the script which generates a second script that is submitted to slurm. In our case, the submission script is called "42.job1". The problem we have is that sometimes, the "sbatch" command fails with the message above. We couldn't find any hint on the logs. Hardware and software logs are clean. I increased the debug level of slurm, to # scontrol show config (..._) SlurmctldDebug = info But still not glue about what is happening. Maybe the next thing to try is to use "sdiag" to inspect the server. Another complication is that the problem is random, so we put "sdiag" in a cronjob? Is there a better way to run "sdiag" periodically? Thnaks for your attention. Best Regards mg. - S
Re: [slurm-users] SLURM heterogeneous jobs, a little help needed plz
Hi Loris, On 3/21/19 6:21 PM, Loris Bennett wrote: Chris, maybe you should look at EasyBuild (https://easybuild.readthedocs.io/en/latest/). That way you can install all the dependencies (such as zlib) as modules and be pretty much independent of the ancient packages your distro may provide (other Cheers, Loris Do you have experience with spack or flatpak too? They all seem to solve the same problem, and I'd be interested in any comparison based on experience. https://spack.readthedocs.io/en/latest/ http://docs.flatpak.org/en/latest/
Re: [slurm-users] Sharing a node with non-gres and gres jobs
Hi Peter, On 3/20/19 11:19 AM, Peter Steinbach wrote: [root@ernie /]# scontrol show node -dd g1 NodeName=g1 CoresPerSocket=4 CPUAlloc=3 CPUTot=4 CPULoad=N/A AvailableFeatures=(null) ActiveFeatures=(null) Gres=gpu:titanxp:2 GresDrain=N/A GresUsed=gpu:titanxp:0(IDX:N/A) NodeAddr=127.0.0.1 NodeHostName=localhost Port=0 If the following is true RealMemory=4000 AllocMem=4000 FreeMem=N/A Sockets=1 Boards=1 That is - all memory is allocated for the job, then I don't think any new job will enter, regardless of gres. State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=gpu BootTime=2019-03-18T10:14:18 SlurmdStartTime=2019-03-20T09:07:45 CfgTRES=cpu=4,mem=4000M,billing=4 AllocTRES=cpu=3,mem=4000M CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s I now filled the 'cluster' with non-gres jobs and I submitted a GPU job:
Re: [slurm-users] problems with slurm and openmpi
Hi. On 12/03/2019 22:53:36, Riccardo Veraldi wrote: Hello, after trynig hard for over 10 days I am forced to write to the list. I am not able to have SLURM work with openmpi. Openmpi compiled binaries won't run on slurm, while all non openmpi progs run just fine under "srun". I am using SLURM 18.08.5 building the rpm from the tarball: rpmbuild -ta slurm-18.08.5-2.tar.bz2 prior to bulid SLURM I installed openmpi 4.0.0 which has built in pmix support. the pmix libraries are in /usr/lib64/pmix/ which is the default installation path. The problem is that hellompi is not working if I launch in from srun. of course it runs outside slurm. [psanagpu105:10995] OPAL ERROR: Not initialized in file pmix3x_client.c at line 113 -- The application appears to have been direct launched using "srun", but OMPI was not built with SLURM's PMI support and therefore cannot execute. There are several options for building PMI support under I would guess (but having the config.log files would verify it) that you should rebuild Slurm --with-pmix and then you should rebuild OpenMPI --with Slurm. Currently there might be a bug in Slurm's configure file building PMIx support without path, so you might either modify the spec before building (add --with-pmix=/usr to the configure section) or for testing purposes ./configure --with-pmix=/usr; make; make install. It seems your current configuration has built-in mismatch - Slurm only supports pmi2, while OpenMPI only supports PMIx. you should build with at least one common PMI: either external PMIx when building Slurm, or Slurm's PMI2 when building OpenMPI. However, I would have expected the non-PMI option (srun --mpi=openmpi) to work even in your env, and Slurm should have built PMIx support automatically since it's in default search path. SLURM, depending upon the SLURM version you are using: version 16.05 or later: you can use SLURM's PMIx support. This requires that you configure and build SLURM --with-pmix. Versions earlier than 16.05: you must use either SLURM's PMI-1 or PMI-2 support. SLURM builds PMI-1 by default, or you can manually install PMI-2. You must then build Open MPI using --with-pmi pointing to the SLURM PMI library location. Please configure as appropriate and try again. -- *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [psanagpu105:10995] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! srun: error: psanagpu105: task 0: Exited with exit code 1 I really have no clue. I even reinstalled openmpi on a specific different path /opt/openmpi/4.0.0 anyway seems like slurm does not know how to fine the MPI libraries even though they are there and right now in the default path /usr/lib64 even using --mpi=pmi2 or --mpi=openmpi does not fix the problem and the same error message is given to me. srun --mpi=list srun: MPI types
Re: [slurm-users] Visualisation -- Slurm and (Turbo)VNC
I haven't done this in a long time, but this blog entry might be of some use (I believe I did something similar when required in the past) : https://summerofhpc.prace-ri.eu/remote-accelerated-graphics-with-virtualgl-and-turbovnc/ On 03/01/2019 12:14:52, Baker D.J. wrote: Hello, We have set up our NICE/DCV cluster and that is proving to be very popular. There are, however, users who would benefit from using the resources offered by our nodes with multiple GPU cards. This potentially means setting up TurboVNC, for example. I would, if possible, like to be able to make the process of starting a VNC server as painless as possible. I wondered if anyone had written a slurm script that users could modify/submit to reserve resources and start the VNC server. If you have such a template script and/or any advice in using VNC via slurm then I would be interested to hear from you please. Many of our visualization users are not "expert user" and so, as I note above, it would be useful to try to make the process as painless as possible, If you would be happy to share your script with us please then that would be appreciated. Best regards, David -- Regards, Daniel Letai +972 (0)505 870 456
Re: [slurm-users] Can frequent hold-release adversely affect slurm?
On 18/10/2018 20:34, Eli V wrote: On Thu, Oct 18, 2018 at 1:03 PM Daniel Letai wrote: Hello all, To solve a requirement where a large number of job arrays (~10k arrays, each with at most 8M elements) with same priority should be executed with minimal starvation of any array - we don't want to wait for each array to complete before starting the next one - we wish to implement "interleaving" between arrays, we came up with the following scheme: Start all arrays in this partition in a "Hold" state. Release a predefined number of elements (E.g., 200) from this point a slurmctld prolog takes over: On the 200th job run squeue, note the next job array (array id following the currently executing array id) Release a predefined number of elements (E.g., 200) and repeat This might produce a very large number of release requests to the scheduler in a short time frame, and one concern is the scheduler loop getting too many requests. Can you think of other issues that might come up with this approach? Do you have any recommendations, or might suggest a better approach to solve this problem? I can't comment on the scalability issues but if possible using %200 on the array submission seems like the simplest solution. From the sbatch man page: For example "--array=0-15%4" will limit the number of simultaneously running tasks from this job array to 4. Won't achieve the same solution - we want a round-robin solution, your proposal would hard limit to 200 jobs per array. This will leave most of the cluster underutilized. We have considered fairshare, but all arrays are from same account and user. We have considered creating accounts on the fly (1 for each array) but get an error ("This should never happen") after creating a few thousand accounts. To my understanding fairshare is only viable between accounts. -- Regards, Daniel Letai +972 (0)505 870 456
[slurm-users] Can frequent hold-release adversely affect slurm?
Hello all, To solve a requirement where a large number of job arrays (~10k arrays, each with at most 8M elements) with same priority should be executed with minimal starvation of any array - we don't want to wait for each array to complete before starting the next one - we wish to implement "interleaving" between arrays, we came up with the following scheme: Start all arrays in this partition in a "Hold" state. Release a predefined number of elements (E.g., 200) from this point a slurmctld prolog takes over: On the 200th job run squeue, note the next job array (array id following the currently executing array id) Release a predefined number of elements (E.g., 200) and repeat This might produce a very large number of release requests to the scheduler in a short time frame, and one concern is the scheduler loop getting too many requests. Can you think of other issues that might come up with this approach? Do you have any recommendations, or might suggest a better approach to solve this problem? We have considered fairshare, but all arrays are from same account and user. We have considered creating accounts on the fly (1 for each array) but get an error ("This should never happen") after creating a few thousand accounts. To my understanding fairshare is only viable between accounts.
Re: [slurm-users] Is it possible to select the BatchHost for a job through some sort of prolog script?
On 06/07/2018 10:22, Steffen Grunewald wrote: On Fri, 2018-07-06 at 07:47:16 +0200, Loris Bennett wrote: Hi Tim, Tim Lin writes: As the title suggests, I’m searching for a way to have tighter control of which node the batch script gets executed on. In my case it’s very hard to know which node is best for this until after all the nodes are allocate, right before the batch job starts . I’ve looked through all the documentation I can get my hands on but I haven’t found any mention of any control over the batch host for admins. Am I missing something? As the documentation of 'sbatch' says: "When the job allocation is finally granted for the batch script, Slurm runs a single copy of the batch script on the first node in the set of allocated nodes. " I am not aware of any way of changing this. Perhaps you can explain why you feel it is necessary for you do this. For me, the above reads like the user has an idea of a metric for how to select the node for rank-0 (and perhaps the code is sufficiently asymmetric to justify such a selection), but no way to tell Slurm about it. What about making the batch script a wrapper around the real payload, on the "outer first node" take the list of assigned nodes and possibly reorder it, then run the payload (via passphrase-less ssh?) on the selected, "new first" node? Why not just use salloc instead? Allocate all the nodes for the job, then use the script to select (ssh?) the master and start the actual job there. I'm still not sure why that would be necessary, though. Could you give a clear example of the master selection process? What metric/constraint is involved, and why can it only be obtained after node selection? This may require changing some more environment variables, and may harm signalling. Okay, my suggestion reads like a terrible kludge (which it certainly is), but AFAIK there's no way to tell Slurm about "preferred first nodes". - S