Hi, Anatoliy

Maybe you can try "scontrol release <job_id>".
Node Failure may cause batch jobs requeue.


2015-07-07 4:55 GMT+08:00 Anatoliy Kovalenko <tolik.kovale...@gmail.com>:

>  Hello. We have a job that has a "job requeued in held state".
>   JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
>                  8      part1    test   bob PD       0:00      1 (job
> requeued in held state)
> What does it mean? Other tasks work well, but this task is hang. scontrol
> resume/requeue doesn't helps. In slurm's log we see:
> [2015-07-06T20:31:06.126] _slurm_rpc_requeue: Processing RPC:
> REQUEST_JOB_REQUEUE from uid=0
> [2015-07-06T20:31:06.126] _slurm_rpc_requeue: 8: Job is pending execution
> [2015-07-06T20:31:18.469] Processing RPC: REQUEST_SUSPEND(resume) from
> uid=0
> [2015-07-06T20:31:18.469] _slurm_rpc_suspend(resume) for 8 Job is pending
> execution
> What we can do to continue execution without breaking or cansel?
>

Reply via email to