Folks,
my goal is to run a parallel job on a cluster of KNL nodes configured
with the same cluster *and* memory mode.
at first, i made a simple prototype with 8 nodes, and the four following
features : north, east, west and south.
each node is part of one quadrant, and there are two nodes per quadrant.
from my slurm.conf(*):
# COMPUTE NODES
NodeName=n[0-1] Procs=4 State=UNKNOWN Feature=north,east
NodeName=n[2-3] Procs=4 State=UNKNOWN Feature=south,east
NodeName=n[4-5] Procs=4 State=UNKNOWN Feature=south,west
NodeName=n[6-7] Procs=4 State=UNKNOWN Feature=north,west
PartitionName=debug Nodes=n[0-7] Default=YES MaxTime=INFINITE State=UP
$ sinfo -o "%30N %20b %f"
NODELIST ACTIVE_FEATURES AVAIL_FEATURES
n[0-1] north,east north,east
n[2-3] south,east south,east
n[4-5] south,west south,west
n[6-7] north,west north,west
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 8 idle n[0-7]
my submission command is
salloc -N 2 -C '[north|south]&[east|west]' ./hello.sh
and the hello.sh script simply displays the node list (e.g. echo
$SLURM_NODELIST)
at first, n[0-1] are allocated (e.g. north-east quadrant) => OK
then i make n0 unavailable, and n[6-7] are allocated (e.g. north-west)
quadrant => OK
then i make n6 unavaliable, and [n1,7] are allocated (e.g. one node is
north-east and the other node is north-west) => KO
is there something wrong with my command line ?
or is this a bug ?
fwiw, i was unsuccessful using parenthesis :
$ salloc -N 2 -C '([north|south])&([east|west])' ./hello.sh
salloc: error: Job submit/allocate failed: Invalid feature specification
(*)
i noted the man page suggests AvailableFeatures and ActiveFeatures can
be set by scontrol.
my initial plan was to
scontrol update NodeName=n[0-7] AvailableFeatures=north,east,west,south
and then
scontrol update NodeName=n[0 -1] ActiveFeatures=north,east
...
both commands seem to work, but all available features are active
$ sinfo -o "%30N %30b %f"
NODELIST ACTIVE_FEATURES AVAIL_FEATURES
n[0-1] north,east,west,south north,east,west,south
did i correctly interpret the man pages ?
if yes, is this a bug ?
Thanks in advance for you help
Gilles