Chris, I think you have focussed on the wrong issue. The issue isn't system problems etc.
Its more when a job "exit 100" (for whatever reason), it stays in the queue in E state... so the user knows it has a problem, can fix the problem, then clear the job error and get it back on running... without resubmitting it and without stuffing up their job dependancies. On Fri, Sep 20, 2013 at 11:46 AM, Christopher Samuel <[email protected]>wrote: > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi Stu, > > On 19/09/13 17:19, Stu Midgley wrote: > > > SGE has a special job error state of 100 (ie. exit 100) which puts > > the job in E state in the queue. > > The first talk of the day today at the Slurm User Group was on fault > tolerance coming in future versions of Slurm and it seems to me that > using that framework to allow a job/user to report a node as bad > should be possible. > > The slides are here: > > http://slurm.schedmd.com/SUG13/nonstop.pdf > > I suspect it'd be something that would need to be explicitly enabled > by a config option though, I reckon many sites would have conniptions > if users were able to take nodes out at random. ;-) > > cheers, > Chris > - -- > Christopher Samuel Senior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: [email protected] Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.12 (GNU/Linux) > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ > > iEYEARECAAYFAlI7xNAACgkQO2KABBYQAh9fAACdEgLQXJILOxU2o+e0mhsgVIvu > CgEAn1f1qmJfOSKB+b3IHa5ulUUr4s+s > =2SHx > -----END PGP SIGNATURE----- > -- Dr Stuart Midgley [email protected]
