[slurm-dev] jobs killed on controller restart

Michael Gutteridge Tue, 10 Jun 2014 11:02:34 -0700

We've had some trouble with curious job failures- the jobs aren't even
assigned nodes:


       JobID        NodeList      State ExitCode
------------ --------------- ---------- --------

7229124        None assigned     FAILED      0:1



We finally got some better log data (I'd turned it way too low) which
suggests that restarting and/or reconfiguring the controller is at the
root.  After some preliminaries (purging job records, recovering active
jobs) there will be these sorts of messages
:


[2014-06-09T23:10:15.920] No nodes satisfy job 7228909 requirements in
partition full
[2014-06-09T23:10:15.920] sched: schedule: JobId=7228909 non-runnable:
Requested node configuration is not available

The indicated job has specified --mem and --tmp, but the values are within
the capacities for all nodes in that "full" partition.  Typically if a user
were to request resources exceeding those available on nodes in this
partition the submission is failed.  It appears that this failure only
occurs for jobs with memory and/or disk constraints.  Worse yet, it's not
consistent- only seems to happen sometime.  I also cannot reproduce this in
our test environment.

A typical node configuration line looks thus:

NodeName=gizmod[51-60] Sockets=2 CoresPerSocket=6 RealMemory=48000
Weight=10 Feature=full,restart,rx200,ssd

though I've got FastSchedule=0.  Honestly it *feels* like there's a moment
where the node data isn't fully loaded from the slurmd and thus the
scheduler doesn't see any nodes that satisfy the requirements.

Thanks all...

Michael

[slurm-dev] jobs killed on controller restart

Reply via email to