I can add more information to this. We’ve found that if we add the –N option things behave much more like we’d want them to. We looking at the submit plugins (maybe lua) to see make sure that this is specified on all jobs. “--exclusive” still won’t just find a free node. It takes the first node that has enough and pre-empts whatever is running there.
Deborah Crocker, PhD Systems Engineer III Office of Information Technology The University of Alabama Box 870346 Tuscaloosa, AL 36587 Office 205-348-3758 | Fax 205-348-9393 [email protected] From: Crocker, Deborah [mailto:[email protected]] Sent: Friday, April 08, 2016 7:43 AM To: slurm-dev Subject: [slurm-dev] Problems with preemption First off I’d like to apologize to the list. I sent this by grabbing an older message and forgot to change the subject line. Here is my question with a better subject. I should add that I’ve experimented some with using LLN to get jobs to go to other nodes, of which many are free, instead of crashing over the one that is running but still no luck. ----------------- original question---------------------- We have our basic slurm setup running with a couple of queues set up through QOS (a main and long “queue”). What we would like to do now is set up for pre-emption for our stakeholders. When we have preemption set on the partition (suspend), though, it looks like any job with –exclusive set will suspend the job running on the node that has the next available with enough free cores (since the jobs block schedule to fill free cores before going on to a new node). The only way to stop it seems to be to put preemptmode=off on the partition (which defeats the purpose). How do we do this, short of telling everyone never to use “—exclusive” There seem to be a number of descriptions online of setting up preemption, some using different partitions but when we use different partitions we always get oversubscribed nodes (since it seems that the scheduling doesn’t coordinate between partitions). Thanks in advance - any help is appreciated. This is in our slurm.conf: PreemptType=preempt/qos PreemptMode=SUSPEND,GANG … SelectType=select/cons_res SelectTypeParameters=CR_CPU Our partition: PartitionName=main Default=yes Priority=1 State=UP TRESBillingWeights="CPU=1.0,Mem=0.25G" nodes=compute-0-[0-8],compute-5-[0-1],compute-20-0 DefMemPerCPU=4096 QOS=main PreemptMode=suspend And this is our qos: Name Priority GraceTime Preempt PreemptMode Flags UsageThres UsageFactor GrpTRES GrpTRESMins GrpTRESRunMin GrpJobs GrpSubmit GrpWall MaxTRES MaxTRESPerNode MaxTRESMins MaxWall MaxTRESPU MaxJobsPU MaxSubmitPU MinTRES ---------- ---------- ---------- ---------- ----------- ---------------------------------------- ---------- ----------- ------------- ------------- ------------- ------- --------- ----------- ------------- -------------- ------------- ----------- ------------- --------- ----------- ------------- long 0 00:00:00 master cluster OverPartQOS 1.000000 cpu=132 7-00:00:00 main 0 00:00:00 cluster 1.000000 Deborah Crocker, PhD Systems Engineer III Office of Information Technology The University of Alabama Box 870346 Tuscaloosa, AL 36587 Office 205-348-3758 | Fax 205-348-9393 [email protected]<mailto:[email protected]> From: Husen R [mailto:[email protected]] Sent: Wednesday, April 06, 2016 9:05 PM To: slurm-dev Subject: [slurm-dev] Re: Failed to access munge.socket.2 Hello Lachlan,Chris Thank you for your reply. I don't know why "/usr/local" is appended to the path.. I tried to locate munge.socket.2 manually using locate command and the file is not exist indeed. The directory /usr/local/var/run/munge is empty. There is no munge directory in /var/run. I don't know why the munge directory is located in /usr/local/var/run instead of in /var/run. I have ever installed slurm-llnl from repository before installing it from source. is this probably the cause of the problem ? Regards, Husen On Thu, Apr 7, 2016 at 8:26 AM, Christopher Samuel <[email protected]<mailto:[email protected]>> wrote: On 06/04/16 19:50, Husen R wrote: > however, when I tried to run sbatch I get the following error message: > > Failed to access "/usr/local/var/run/munge/munge.socket.2": No such file > or directory Is that path really correct? On our systems it's: /var/run/munge/munge.socket.2 Best of luck, Chris -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: [email protected]<mailto:[email protected]> Phone: +61 (0)3 903 55545<tel:%2B61%20%280%293%20903%2055545> http://www.vlsci.org.au/ http://twitter.com/vlsci
