Do you have either of these configured in slurm.conf?
$ scontrol show config | grep Requeue
RequeueExit = (null)
RequeueExitHold = (null)
Quoting Sean Blanton <s...@blanton.com>:
Yes, this is a big problem for us after upgrading from 2.6.1 to 14.11.7. We
have had to schedule a cron job every 5 minutes to release held jobs. We
notice this may happen when there is a WIFEXITED status of zero, but
haven't nailed down the cause(s). Simply releasing the jobs, they then
requeue and most often succeed normally. We are wondering if there is a
configuration setting that would cause them to requeue without being held.
Regards,
Sean
Sean Blanton
s...@blanton.com
On Tue, Jul 7, 2015 at 4:44 AM, Qianqian Sha <qqsha0...@gmail.com> wrote:
Hi, Anatoliy
Maybe you can try "scontrol release <job_id>".
Node Failure may cause batch jobs requeue.
2015-07-07 4:55 GMT+08:00 Anatoliy Kovalenko <tolik.kovale...@gmail.com>:
Hello. We have a job that has a "job requeued in held state".
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
8 part1 test bob PD 0:00 1 (job
requeued in held state)
What does it mean? Other tasks work well, but this task is hang. scontrol
resume/requeue doesn't helps. In slurm's log we see:
[2015-07-06T20:31:06.126] _slurm_rpc_requeue: Processing RPC:
REQUEST_JOB_REQUEUE from uid=0
[2015-07-06T20:31:06.126] _slurm_rpc_requeue: 8: Job is pending execution
[2015-07-06T20:31:18.469] Processing RPC: REQUEST_SUSPEND(resume) from
uid=0
[2015-07-06T20:31:18.469] _slurm_rpc_suspend(resume) for 8 Job is pending
execution
What we can do to continue execution without breaking or cansel?
--
Morris "Moe" Jette
CTO, SchedMD LLC
Commercial Slurm Development and Support