[slurm-dev] Re: Strangely a job requeued, when a node failed

Lennart Karlsson Mon, 21 Oct 2013 10:02:23 -0700


On 10/21/2013 06:08 PM, Marcin Stolarek wrote:

2013/10/21 Lennart Karlsson <[email protected]>


Hi,

I have set the configuration parameter JobRequeue to zero, so failed
jobs should not automatically requeue and rerun:
# scontrol show config|grep -i requeue
JobRequeue              = 0
#

But still jobs are rerun:
[2013-10-18T13:53:25.556] sched: Allocate JobId=4451116 NodeList=q4 #CPUs=8
[2013-10-18T16:39:08.952] Batch JobId=4451116 missing from node 0
[2013-10-18T16:39:08.952] completing job 4451116
[2013-10-18T16:39:08.952] Job 4451116 cancelled from interactive user
[2013-10-18T16:39:08.957] Requeue JobId=4451116 due to node failure
[2013-10-18T16:39:08.957] sched: job_complete for JobId=4451116
successful, exit code=4294967294
[2013-10-18T16:39:08.958] Node q4 unexpectedly rebooted
[2013-10-20T07:02:30.080] sched: Allocate JobId=4451116 NodeList=q3 #CPUs=8


How can I stop this from happening? (Most times the "node failure" is
because the job exceeded memory limits, and will do so also on next try.)

So why are you using configuration which is allowing allocation of more RAM
than it's really availalble?
cheers,
marcin



Hi Marcin,

On normal nodes, we do not allow that. But the user has asked for the fattest
nodes and is there allowed to use them to the utter limit of internal memory
and swap space.

But the problem is the rerun of the job, when I have specified that it must
not be done. It is a waste in most cases.

Our SLURM version is  2.6.2.

Thanks,
-- Lennart Karlsson, UPPMAX, Uppsala University, Sweden

[slurm-dev] Re: Strangely a job requeued, when a node failed

Reply via email to