-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Stu,

On 19/09/13 17:19, Stu Midgley wrote:

> SGE has a special job error state of 100 (ie. exit 100) which puts
> the job in E state in the queue.

The first talk of the day today at the Slurm User Group was on fault
tolerance coming in future versions of Slurm and it seems to me that
using that framework to allow a job/user to report a node as bad
should be possible.

The slides are here:

http://slurm.schedmd.com/SUG13/nonstop.pdf

I suspect it'd be something that would need to be explicitly enabled
by a config option though, I reckon many sites would have conniptions
if users were able to take nodes out at random. ;-)

cheers,
Chris
- -- 
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlI7xNAACgkQO2KABBYQAh9fAACdEgLQXJILOxU2o+e0mhsgVIvu
CgEAn1f1qmJfOSKB+b3IHa5ulUUr4s+s
=2SHx
-----END PGP SIGNATURE-----

Reply via email to