Occasionally, a node will be correctly set to drain on failure of the prolog and the job will be requeued in HOLD state.
However occasionally, if the node state is set back to state=resume without doing anything to the node itself, subsequent jobs submitted to the node which also result in failure of the prolog will run. Details can be found in Slurm Bugzilla Bug 2254[1] If anybody else has any bright ideas I'd sure be grateful for your input while waiting for SchedMD to respond. Thanks, -- Trevor [1] - http://bugs.schedmd.com/show_bug.cgi?id=2254
