Re: [gridengine users] Message in stderr after exceeding resources

Dave Love Thu, 03 Mar 2011 09:17:22 -0800

Chris Jewell <[email protected]> writes:

> I quite like the &3 idea.  One reason I slightly shy away from emails
> is that they're not a default option (though easy enough for an
> administrator to configure that way), and it may often be more useful
> to have job context output kept with the stdout and stderr files


Agreed.  I keep meaning to put together a script to extract the
available info from qacct, the GE log files, and possibly syslog, post
mortem for a job (assuming shared classic spooling).  Does anyone else
fancy having a go?

> Agreed.  The standard 137 exit code is not really quite enough.  For
> the administrator, it could be *very* useful to know how many jobs are
> being killed as a result of exceeding resource requests, 

(In case people don't know) You can arrange for the admin to get more
useful mail about at least some failed jobs, but I can't remember
off-hand what configures it.  That's at least partially broken in 6.2u5,
and I typically just get mailed the hostfile, though I have a fix which
hasn't been properly tested
<https://arc.liv.ac.uk/trac/SGE/ticket/1307>.  It can also swamp you if
an array job fails because its working directory disappeared, for
instance.
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Message in stderr after exceeding resources

Reply via email to