Re: [gridengine users] Message in stderr after exceeding resources

Reuti Wed, 02 Mar 2011 10:59:52 -0800

Hi,

Am 02.03.2011 um 19:37 schrieb Chris Jewell:


> I was wondering if it was possible to get GE to output an error message to 
> the stderr file in response to a job being killed due to it exceeding a 
> resource request?  
> 
> Currently, we have an open doors policy on runtime (ie default h_rt=INFINITY) 
> which is playing havoc with a) long jobs filling up the cluster and 
> precluding short jobs from running (alleviated inefficiently with the 
> introduction of a 'short' queue), and b) preventing efficient resource 
> reservation for parallel SMP jobs.  I'd therefore like to change the default 
> time to 30mins, and have users explicitly request more time if they need it.  
> However, I'm worried that the default position of killing jobs with a SIGKILL 
> will confuse users.  PBS Pro prints out a message to stderr to tell you why 
> your job was killed (memory, time, io etc exceeded request): is there 
> anything like this in GE I can use?

yep, it's sometimes not easy to investigate why a job was killed as you have to 
check the messages file of the appropriate nodes. As you have only SMP jobs in 
the parallel case there is only one machine to check, and it can be attached to 
the email which is send to the user. Please find attached a mail-wrapper which 
uses a local messages file, but it can be adjusted to reflect your path. In 
case you face race conditions that the email is send too early before there is 
an entry in the messages file, a `sleep 5` or alike should help.

-- Reuti

mailer.sh
Description: Binary data

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Message in stderr after exceeding resources

Reply via email to