2013/10/21 Lennart Karlsson <[email protected]>
> > Hi, > > I have set the configuration parameter JobRequeue to zero, so failed > jobs should not automatically requeue and rerun: > # scontrol show config|grep -i requeue > JobRequeue = 0 > # > > But still jobs are rerun: > [2013-10-18T13:53:25.556] sched: Allocate JobId=4451116 NodeList=q4 #CPUs=8 > [2013-10-18T16:39:08.952] Batch JobId=4451116 missing from node 0 > [2013-10-18T16:39:08.952] completing job 4451116 > [2013-10-18T16:39:08.952] Job 4451116 cancelled from interactive user > [2013-10-18T16:39:08.957] Requeue JobId=4451116 due to node failure > [2013-10-18T16:39:08.957] sched: job_complete for JobId=4451116 > successful, exit code=4294967294 > [2013-10-18T16:39:08.958] Node q4 unexpectedly rebooted > [2013-10-20T07:02:30.080] sched: Allocate JobId=4451116 NodeList=q3 #CPUs=8 > > > How can I stop this from happening? (Most times the "node failure" is > because the job exceeded memory limits, and will do so also on next try.) > So why are you using configuration which is allowing allocation of more RAM than it's really availalble? cheers, marcin
