[slurm-dev] Re: special job error state

Christopher Samuel Sat, 21 Sep 2013 12:19:03 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 20/09/13 17:17, Stu Midgley wrote:


> Chris, I think you have focussed on the wrong issue.  The issue
> isn't system problems etc.

Ahh, OK, sorry, I saw the bit about the issues usually being system
problems and ran with that.

So it sounds like what you'd like is for some way for the job to
requeue itself on demand, rather than just on a node failure, and for
the requeued jobs initial state to be held rather than pending?

There is an "scontrol requeue $JOBID" which will work as long as the
job is requeuable (submitted with --requeue, or you can update that on
the fly with "scontrol update job=$JOBID requeue=1") which will give
you the first part.

Problem is I can't see a way at present for that requeued job to be
held, it'll just be started again by Slurm when it is eligible and
immediately overwrite the slurm output file so you'll have lost any
diagnostics your script output. :-(

The restarted job will have SLURM_RESTART_COUNT set to the appropriate
number, but I can't see that helping.   I tested requeuing a job that
was submitted as held initially (sbatch --hold), but that doesn't make
any difference - after doing the scontrol release and then scontrol
requeue it will just start again.

Moe, et. al, how easy would it be to have some form of:

scontrol requeue --hold $JOBID

?

cheers,
Chris
- -- 
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlI98MIACgkQO2KABBYQAh+YdgCgiCl3UTP3acTxX3oFFfSLDq9F
mtEAn3ev/Iyo5PTJBDRqQdt3UJkNzRGX
=fd8R
-----END PGP SIGNATURE-----

[slurm-dev] Re: special job error state

Reply via email to