Do you have either of these configured in slurm.conf?

$ scontrol show config | grep Requeue
RequeueExit             = (null)
RequeueExitHold         = (null)

Quoting Sean Blanton <s...@blanton.com>:
Yes, this is a big problem for us after upgrading from 2.6.1 to 14.11.7. We
have had to schedule a cron job every 5 minutes to release held jobs. We
notice this may happen when there is a WIFEXITED status of zero, but
haven't nailed down the cause(s). Simply releasing the jobs, they then
requeue and most often succeed normally.  We are wondering if there is a
configuration setting that would cause them to requeue without being held.

Regards,
Sean

Sean Blanton
s...@blanton.com

On Tue, Jul 7, 2015 at 4:44 AM, Qianqian Sha <qqsha0...@gmail.com> wrote:

 Hi, Anatoliy

Maybe you can try "scontrol release <job_id>".
Node Failure may cause batch jobs requeue.


2015-07-07 4:55 GMT+08:00 Anatoliy Kovalenko <tolik.kovale...@gmail.com>:

 Hello. We have a job that has a "job requeued in held state".
  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 8      part1    test   bob PD       0:00      1 (job
requeued in held state)
What does it mean? Other tasks work well, but this task is hang. scontrol
resume/requeue doesn't helps. In slurm's log we see:
[2015-07-06T20:31:06.126] _slurm_rpc_requeue: Processing RPC:
REQUEST_JOB_REQUEUE from uid=0
[2015-07-06T20:31:06.126] _slurm_rpc_requeue: 8: Job is pending execution
[2015-07-06T20:31:18.469] Processing RPC: REQUEST_SUSPEND(resume) from
uid=0
[2015-07-06T20:31:18.469] _slurm_rpc_suspend(resume) for 8 Job is pending
execution
What we can do to continue execution without breaking or cansel?





--
Morris "Moe" Jette
CTO, SchedMD LLC
Commercial Slurm Development and Support

Reply via email to