SLURM's reservation system is working as designed and satisfies the needs at LLNL (which paid for the development). We understand that it could be more versatile and have discussed the matter, but there are currently no plans to change the logic. ________________________________________ From: [email protected] [[email protected]] On Behalf Of Lennart Karlsson [[email protected]] Sent: Thursday, March 31, 2011 2:16 AM To: [email protected] Subject: [slurm-dev] Difficulties with SLURM reservations
(We are running version 2.2.0 of SLURM.) Yesterday, one of my reservations ended at six p.m. It was named "raalvmar" like the user that is was reserved for: ReservationName=raalvmar StartTime=2011-03-28T08:00:00 EndTime=2011-03-30T18:00:00 Duration=2-10:00:00 Nodes=h1 NodeCnt=1 Features=(null) PartitionName=halvan Flags=IGNORE_JOBS Users=lka,raalvmar Accounts=(null) Licenses=(null) 24 minutes before the reservation ended, the user was allowed to start a job. Here is the "scontrol show job": JobId=155 Name=job_20110330_173627_eBoJT3 UserId=raalvmar(40037) GroupId=uppmax(40001) Priority=70058 Account=b2010051 QOS=normal WCKey=* JobState=TIMEOUT Reason=TimeLimit Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 ExitCode=0:1 RunTime=00:23:24 TimeLimit=10:00:00 TimeMin=N/A SubmitTime=2011-03-30T17:36:28 EligibleTime=2011-03-30T17:36:28 StartTime=2011-03-30T17:36:48 EndTime=2011-03-30T18:00:12 SuspendTime=None SecsPreSuspend=0 Partition=halvan AllocNode:Sid=kalkyl4:22785 ReqNodeList=(null) ExcNodeList=(null) NodeList=h1 NumNodes=1 NumCPUs=8 CPUs/Task=1 ReqS:C:T=*:*:* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/bubo/home/h13/raalvmar/glob/jobs/job_20110330_173627_eBoJT3 WorkDir=/bubo/glob/g13/raalvmar/jobs As you can read, the timelimit was 10 hours but it was terminated with status TIMEOUT long before that. We, the user and I, were both surprised. In my view, it would have been better if the job had not started, because SLURM should know that it was unlikely that it would be able to run for ten hours. Even better would be if the job was rejected at submit time. We would like to reserve nodes for users or accounts, but the reservation system seems to have strange limits: 1/ At submit time you must specify exactly one reservation name, otherwise your job can not use a node that is reserved. 2/ If you specify a reservation name, your job will not start on the node until the reservation starts, even if the node is free also before the reservation starts. As I now understand, it is not allowed to continue running on the node when the reservation ends. As I see the situation, the reservation would be more useful, if it was allowed a/ ... for a job to specify that it would not mind to run freely within any reservation that allows it's user and/or account to run there. b/ ... for a job to run within the reservation and also before and after the reservation. Have I misunderstood the reservation rules, so reservations actually are more useful than I understand? Or are there other mechanisms within SLURM that does what I am asking for? Otherwise, are there any plans to make reservations more versatile? Best regards, -- Lennart Karlsson UPPMAX, Uppsala, Sweden http://www.uppmax.uu.se
