2013/10/21 Lennart Karlsson <[email protected]>

>
> Hi,
>
> I have set the configuration parameter JobRequeue to zero, so failed
> jobs should not automatically requeue and rerun:
> # scontrol show config|grep -i requeue
> JobRequeue              = 0
> #
>
> But still jobs are rerun:
> [2013-10-18T13:53:25.556] sched: Allocate JobId=4451116 NodeList=q4 #CPUs=8
> [2013-10-18T16:39:08.952] Batch JobId=4451116 missing from node 0
> [2013-10-18T16:39:08.952] completing job 4451116
> [2013-10-18T16:39:08.952] Job 4451116 cancelled from interactive user
> [2013-10-18T16:39:08.957] Requeue JobId=4451116 due to node failure
> [2013-10-18T16:39:08.957] sched: job_complete for JobId=4451116
> successful, exit code=4294967294
> [2013-10-18T16:39:08.958] Node q4 unexpectedly rebooted
> [2013-10-20T07:02:30.080] sched: Allocate JobId=4451116 NodeList=q3 #CPUs=8
>
>
> How can I stop this from happening? (Most times the "node failure" is
> because the job exceeded memory limits, and will do so also on next try.)
>
So why are you using configuration which is allowing allocation of more RAM
than it's really availalble?
cheers,
marcin

Reply via email to