We are running SLURM 2.6.1. So far it's been working great. However we
ran into a bug recently. We wanted to disable users from using
--exclusive because many of our users were using it when they didn't
actually need it. So we used the SHARED=FORCE option for the queue. We
have this configured too:
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
So that should prevent collisions and everyone should get the resource
they asked for. However, it appears not to work that way. I submitted
the following job:
#!/bin/sh
#SBATCH -n 64
#SBATCH --ntasks-per-node=64
#SBATCH -t 20
#SBATCH --exclusive
#SBATCH --mem=10000
#SBATCH -p general
echo "Hello, World"
echo start
hostname
sleep 10m
echo end
Which ended up looking like this:
[pedmon@itc011 slurm-testing]$ scontrol -dd show job 986116
JobId=986116 Name=sleep-test
UserId=pedmon(56483) GroupId=rc_admin(40273)
Priority=199305409 Account=cluster_users QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
DerivedExitCode=0:0
RunTime=00:04:54 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2013-09-24T12:57:50 EligibleTime=2013-09-24T12:57:50
StartTime=2013-09-24T12:58:01 EndTime=2013-09-24T13:18:11
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=general AllocNode:Sid=itc011:33593
ReqNodeList=(null) ExcNodeList=(null)
NodeList=holy2a02205
BatchHost=holy2a02205
NumNodes=1 NumCPUs=64 CPUs/Task=1 ReqS:C:T=*:*:*
Nodes=holy2a02205 CPU_IDs=0-31 Mem=10000
MinCPUsNode=64 MinMemoryNode=10000M MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/n/home_rc/pedmon/slurm-testing/sleep-test
WorkDir=/n/home_rc/pedmon/slurm-testing
BatchScript=
#!/bin/sh
#SBATCH -n 64
#SBATCH --ntasks-per-node=64
#SBATCH -t 20
#SBATCH --exclusive
#SBATCH --mem=10000
#SBATCH -p general
echo "Hello, World"
echo start
hostname
sleep 10m
echo end
Each one of our nodes had 64 cores and 256 GB of RAM. With SHARED=FORCE
on it should disable exclusive but still obey the other commands. Thus
I should get the whole node as I requested all 64 cores and for them all
to land on the same node. However when I look at the node I landed on I
get:
Tasks: 1704 total, 14 running, 1689 sleeping, 0 stopped, 1 zombie
Cpu(s): 28.6%us, 0.9%sy, 0.0%ni, 70.0%id, 0.5%wa, 0.0%hi, 0.0%si,
0.0%st
Mem: 264498760k total, 90909832k used, 173588928k free, 81228k buffers
Swap: 8388600k total, 120688k used, 8267912k free, 30757332k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
17952 jgarcia 20 0 1344m 757m 3000 R 96.0 0.3 3273:04 xillver-HR.x
18024 jgarcia 20 0 1344m 759m 3000 R 96.0 0.3 3269:57 xillver-HR.x
25336 chaoye 20 0 516m 495m 1656 R 96.0 0.2 174:31.83 dynslicing
31573 chaoye 20 0 226m 205m 1656 R 96.0 0.1 50:09.53 dynslicing
31574 chaoye 20 0 231m 210m 1656 R 96.0 0.1 50:09.47 dynslicing
53832 sglee 20 0 1314m 1.1g 4152 R 96.0 0.4 857:59.02 R
53894 sglee 20 0 1502m 1.3g 4084 R 96.0 0.5 857:42.19 R
53933 sglee 20 0 1486m 1.3g 4084 R 96.0 0.5 857:30.79 R
17890 jgarcia 20 0 1344m 759m 3008 R 94.3 0.3 3274:48 xillver-HR.x
18218 jgarcia 20 0 1344m 759m 3008 R 94.3 0.3 3268:29 xillver-HR.x
25337 chaoye 20 0 509m 488m 1656 R 94.3 0.2 174:31.79 dynslicing
29758 chaoye 20 0 358m 337m 1632 R 94.3 0.1 103:26.06 dynslicing
29759 chaoye 20 0 364m 343m 1632 R 94.3 0.1 103:26.09 dynslicing
21775 sstokes 20 0 28.9g 20g 41m S 82.0 8.3 153:33.34 MATLAB
37767 root 20 0 27124 2564 980 R 7.0 0.0 0:00.10 top
30910 root 20 0 0 0 0 S 1.7 0.0 1:14.49 ldlm_poold
If I look at what jobs are there I get:
[root@holy-slurm01 log]# scontrol -dd show job | grep holy2a02205
NodeList=holy2a02205
BatchHost=holy2a02205
Nodes=holy2a02205 CPU_IDs=0 Mem=8000
NodeList=holy2a02205
BatchHost=holy2a02205
Nodes=holy2a02205 CPU_IDs=1 Mem=8000
NodeList=holy2a02205
BatchHost=holy2a02205
Nodes=holy2a02205 CPU_IDs=4 Mem=8000
NodeList=holy2a02205
BatchHost=holy2a02205
Nodes=holy2a02205 CPU_IDs=5 Mem=8000
NodeList=holy2a02205
BatchHost=holy2a02205
Nodes=holy2a02205 CPU_IDs=12 Mem=4096
NodeList=holy2a02205
BatchHost=holy2a02205
Nodes=holy2a02205 CPU_IDs=13 Mem=4096
NodeList=holy2a02205
BatchHost=holy2a02205
Nodes=holy2a02205 CPU_IDs=14 Mem=4096
NodeList=holy2a02205
BatchHost=holy2a02205
Nodes=holy2a02205 CPU_IDs=7 Mem=30000
NodeList=holy2a02205
BatchHost=holy2a02205
Nodes=holy2a02205 CPU_IDs=6 Mem=30000
NodeList=holy2a02205
BatchHost=holy2a02205
Nodes=holy2a02205 CPU_IDs=8 Mem=30000
NodeList=holy2a02205
BatchHost=holy2a02205
Nodes=holy2a02205 CPU_IDs=10 Mem=16000
NodeList=holy2a02205
BatchHost=holy2a02205
Nodes=holy2a02205 CPU_IDs=11 Mem=16000
NodeList=holy2a02205
BatchHost=holy2a02205
Nodes=holy2a02205 CPU_IDs=9 Mem=16000
NodeList=holy2a02205
BatchHost=holy2a02205
Nodes=holy2a02205 CPU_IDs=15 Mem=16000
NodeList=holy2a02205
BatchHost=holy2a02205
Nodes=holy2a02205 CPU_IDs=2 Mem=16000
NodeList=holy2a02205
BatchHost=holy2a02205
Nodes=holy2a02205 CPU_IDs=16 Mem=16000
NodeList=holy2a02205
BatchHost=holy2a02205
Nodes=holy2a02205 CPU_IDs=29 Mem=200
NodeList=holy2a02205
BatchHost=holy2a02205
Nodes=holy2a02205 CPU_IDs=0-31 Mem=10000
As you can see this is oversubscribed. Looks like it is not oversubscribed in
memory space but it is in core space. This is not good. We do not want
oversubscription in any space. This seems to be a bug in the code. Unless of
course there is something about the behavior of SHARED=FORCE we aren't
understanding.
Is this fixed in the newer version of SLURM? Anyone have any ideas?
-Paul Edmon-