Hi, Anatoliy Maybe you can try "scontrol release <job_id>". Node Failure may cause batch jobs requeue.
2015-07-07 4:55 GMT+08:00 Anatoliy Kovalenko <tolik.kovale...@gmail.com>: > Hello. We have a job that has a "job requeued in held state". > JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) > 8 part1 test bob PD 0:00 1 (job > requeued in held state) > What does it mean? Other tasks work well, but this task is hang. scontrol > resume/requeue doesn't helps. In slurm's log we see: > [2015-07-06T20:31:06.126] _slurm_rpc_requeue: Processing RPC: > REQUEST_JOB_REQUEUE from uid=0 > [2015-07-06T20:31:06.126] _slurm_rpc_requeue: 8: Job is pending execution > [2015-07-06T20:31:18.469] Processing RPC: REQUEST_SUSPEND(resume) from > uid=0 > [2015-07-06T20:31:18.469] _slurm_rpc_suspend(resume) for 8 Job is pending > execution > What we can do to continue execution without breaking or cansel? >