Hi all,
Friday afternoon I accidentally upgraded from 14.03.9 to 14.11.1 (just
wanted to compile but then a symlink was changed and the new version was
started). My users were still using the older version of the tools.
Since Monday (but probably since the update) users weren't able to
submit jobs or allocate resources.
I was able to narrow down the problem to this: if using "-p <default
partition>" or not using "-p" at all, resources are allocated. But if
selecting a non-default partition, jobs get stuck with "Required node
not available (down or drained)"
See the output below for an example. My default partition is called "Murks".
Did anyone already face this situation? Any advice? I can provide
configuration and log files (debug=9) if needed.
Best regards,
Uwe
### sinfo ###
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
Simtech up infinite 20 idle
n[510301,510401,510501,510601,510701,510901,511001,511101,511201,511301,511601,511701,511801,511901,512001,512201,512301,512401,512501,512601]
Theochem_small up infinite 8 idle
n[512901,513001,513101,513201,513301,513501,513601,520301]
Theochem_large up infinite 26 idle
n[520401,520501,520601,520701,520901,521001,521101,521201,521301,521601,521701,521801,521901,522001,522201,522301,522401,522501,522601,522901,523001,523101,523201,523301,523501,523601]
Murks* up 1:00:00 2 idle n[523501,523601]
### several srun tries ###
1$srun -N1 hostname
n523501
$srun -N2 hostname
n523501
n523601
$srun -w n523501 hostname
n523501
$srun -w n523601 hostname
n523601
$srun -p Murks -N2 hostname
n523501
n523601
$srun -p Theochem_large -N1 hostname
srun: Required node not available (down or drained)
srun: job 9239 queued and waiting for resources
^Csrun: Job allocation 9239 has been revoked
srun: Force Terminated job 9239
$srun -p Theochem_small -N1 hostname
srun: Required node not available (down or drained)
srun: job 9241 queued and waiting for resources
^Csrun: Job allocation 9241 has been revoked
$srun -p Simtech -N1 hostname
srun: Required node not available (down or drained)
srun: job 9240 queued and waiting for resources
^Csrun: Job allocation 9240 has been revoked
srun: Force Terminated job 9240