-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi Stu,
On 19/09/13 17:19, Stu Midgley wrote: > SGE has a special job error state of 100 (ie. exit 100) which puts > the job in E state in the queue. The first talk of the day today at the Slurm User Group was on fault tolerance coming in future versions of Slurm and it seems to me that using that framework to allow a job/user to report a node as bad should be possible. The slides are here: http://slurm.schedmd.com/SUG13/nonstop.pdf I suspect it'd be something that would need to be explicitly enabled by a config option though, I reckon many sites would have conniptions if users were able to take nodes out at random. ;-) cheers, Chris - -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlI7xNAACgkQO2KABBYQAh9fAACdEgLQXJILOxU2o+e0mhsgVIvu CgEAn1f1qmJfOSKB+b3IHa5ulUUr4s+s =2SHx -----END PGP SIGNATURE-----