-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 20/09/13 17:17, Stu Midgley wrote:
> Chris, I think you have focussed on the wrong issue. The issue > isn't system problems etc. Ahh, OK, sorry, I saw the bit about the issues usually being system problems and ran with that. So it sounds like what you'd like is for some way for the job to requeue itself on demand, rather than just on a node failure, and for the requeued jobs initial state to be held rather than pending? There is an "scontrol requeue $JOBID" which will work as long as the job is requeuable (submitted with --requeue, or you can update that on the fly with "scontrol update job=$JOBID requeue=1") which will give you the first part. Problem is I can't see a way at present for that requeued job to be held, it'll just be started again by Slurm when it is eligible and immediately overwrite the slurm output file so you'll have lost any diagnostics your script output. :-( The restarted job will have SLURM_RESTART_COUNT set to the appropriate number, but I can't see that helping. I tested requeuing a job that was submitted as held initially (sbatch --hold), but that doesn't make any difference - after doing the scontrol release and then scontrol requeue it will just start again. Moe, et. al, how easy would it be to have some form of: scontrol requeue --hold $JOBID ? cheers, Chris - -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlI98MIACgkQO2KABBYQAh+YdgCgiCl3UTP3acTxX3oFFfSLDq9F mtEAn3ev/Iyo5PTJBDRqQdt3UJkNzRGX =fd8R -----END PGP SIGNATURE-----