Quoting Christopher Samuel <[email protected]>:


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Stu,

On 19/09/13 17:19, Stu Midgley wrote:

SGE has a special job error state of 100 (ie. exit 100) which puts
the job in E state in the queue.

The first talk of the day today at the Slurm User Group was on fault
tolerance coming in future versions of Slurm and it seems to me that
using that framework to allow a job/user to report a node as bad
should be possible.

The slides are here:

http://slurm.schedmd.com/SUG13/nonstop.pdf

I suspect it'd be something that would need to be explicitly enabled
by a config option though,

Correct. There is also an ACL to control who has permissions to do this.


I reckon many sites would have conniptions
if users were able to take nodes out at random. ;-)

cheers,
Chris
- --
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: [email protected] Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlI7xNAACgkQO2KABBYQAh9fAACdEgLQXJILOxU2o+e0mhsgVIvu
CgEAn1f1qmJfOSKB+b3IHa5ulUUr4s+s
=2SHx
-----END PGP SIGNATURE-----



Reply via email to