I see the same problem that Andy describes. We are testing 2.6.5 on our
test cluster.
As root:
# scontrol create reservation=bhms users=bhm nodes=c0-0 start=now
duration=unlimited
Reservation created: bhms
# scontrol create reservation=staffs accounts=staff nodes=c2-0 start=now
duration=unlimited
Reservation created: staffs
# scontrol show reservation
ReservationName=bhms StartTime=2014-01-17T10:49:23 EndTime=2015-01-17T10:49:23
Duration=365-00:00:00
Nodes=c0-0 NodeCnt=1 CoreCnt=32 Features=(null) PartitionName=(null)
Flags=SPEC_NODES
Users=bhm Accounts=(null) Licenses=(null) State=ACTIVE
ReservationName=staffs StartTime=2014-01-17T10:50:53
EndTime=2015-01-17T10:50:53 Duration=365-00:00:00
Nodes=c2-0 NodeCnt=1 CoreCnt=4 Features=(null) PartitionName=(null)
Flags=SPEC_NODES
Users=(null) Accounts=staff Licenses=(null) State=ACTIVE
As bhm:
$ /opt/slurm/bin/sbatch --reservation=bhms mal.sm
Submitted batch job 20
$ /opt/slurm/bin/sbatch --reservation=staffs mal.sm
Submitted batch job 21
$ bjob
JOBID NAME USER ACCOUNT PARTITI QOS ST PRIORI TIME
TIME_LEFT CPUS NOD MIN_MEM MIN_TMP NODELIST(REASON)
21 mal.sm bhm staff normal staff R 21000 2:55
57:05 1 1 500 0 c2-0
20 mal.sm bhm staff normal staff PD 21001 0:00
1:00:00 1 1 500 0 (Reservation)
So the job requesting the staff account reservation starts, while the
job requesting the bhm user reservation does not.
slurmctld.log says:
[2014-01-17T10:56:54.021] backfill test for job 20
[2014-01-17T10:56:54.021] debug2: backfill: found new user 10231. Total #users
now 1
[2014-01-17T10:56:54.021] backfill: completed testing 1 jobs, usec=21741
[2014-01-17T10:56:54.490] debug2: Testing job time limits and checkpoints
[2014-01-17T10:56:54.490] debug: sched: Running job scheduler
[2014-01-17T10:56:54.490] debug3: acct_policy_job_runnable: job 20: MPC:
job_memory set to 500
[2014-01-17T10:56:54.490] No nodes satisfy job 20 requirements in partition
normal
[2014-01-17T10:56:54.490] debug3: sched: JobId=20. State=PENDING.
Reason=Reservation. Priority=21001. Partition=normal.
[2014-01-17T10:57:03.241] debug3: Processing RPC: REQUEST_JOB_INFO from uid=0
[2014-01-17T10:57:21.163] debug2: Processing RPC: REQUEST_PARTITION_INFO
uid=10231
[2014-01-17T10:57:21.163] debug2: _slurm_rpc_dump_partitions, size=482 usec=217
[2014-01-17T10:57:21.165] debug3: Processing RPC: REQUEST_NODE_INFO from
uid=10231
[2014-01-17T10:57:21.171] debug2: Processing RPC: REQUEST_RESERVATION_INFO from
uid=10231
[2014-01-17T10:57:24.493] debug2: Testing job time limits and checkpoints
[2014-01-17T10:57:24.494] debug2: Performing purge of old job records
[2014-01-17T10:57:40.643] debug3: Processing RPC: REQUEST_JOB_INFO from
uid=10231
[2014-01-17T10:57:54.497] debug2: Testing job time limits and checkpoints
[2014-01-17T10:57:54.497] debug: sched: Running job scheduler
[2014-01-17T10:57:54.497] debug3: acct_policy_job_runnable: job 20: MPC:
job_memory set to 500
[2014-01-17T10:57:54.497] No nodes satisfy job 20 requirements in partition
normal
[2014-01-17T10:57:54.498] debug3: sched: JobId=20. State=PENDING.
Reason=Reservation. Priority=21001. Partition=normal.
Increasing the priority of the job with
# scontrol update jobid=20 nice=-1000
does not help. It still does not start.
--
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo