Re: [gridengine users] s_rt / h_rt Limits with Informative Messages?

Reuti Tue, 11 Sep 2012 16:09:35 -0700

Am 12.09.2012 um 00:48 schrieb Joseph Farran:

> Thanks Reuti.
> 
> I think this sends an additional email, correct?    Any easy way to append or 
> check for "-m bea" in case users does not want the email?


Yes/No, it will add some lines to the usual eMail you get by the mentioned "-m 
ea" option. You can define it in `qconf -mconf` under "mailer".

You can use a JSV or the global $SGE_ROOT/default/common/sge_request to request 
it for all jobs.

If the users shouldn't get this eMail it's more tricky. The "messages" file is 
only written after the job left the system already, and also the mail is 
triggered after this event. And at that point you won't have access to other 
informations from `qstat -j $JOB_ID` like the job context or so any longer. So:

- the JSV will copy the original "-m ..." option in a job context variable "-ac 
MAIL=..."
- the JSV will add an unconditional "-m bea" all the time

- in a job prolog (i.e. set in the queue definition) you copy the context 
information to a directory in the local spool directory, i.e. 
/var/spool/sge/context/$JOB_ID

- the mail wrapper will read this information and send eMails to the user 
and/or the admin
- the mail wrapper will remove the context file

As I just noticed I forgot to remove the `rm` command from the snippet, as we 
are using this with the mentioned job context already to add some information 
given at submisson time to the eMail (in particular the users wanted have the 
original command line added to the eMail they usually get).

-- Reuti


> Joseph
> 
> On 09/11/2012 11:21 AM, Reuti wrote:
>> Hi,
>> 
>> Am 11.09.2012 um 19:10 schrieb Joseph Farran:
>> 
>>> Is there a way ( hopefully easy way ) to have Grid Engine to give an 
>>> informative message when a job has gone past a limit and killed, like when 
>>> a job goes over the wall time limit.
>>> 
>>> When I get an email from Grid Engine where a job has gone past it's wall 
>>> time limit, it is not very informative:
>>> 
>>> Job 3568 (TEST) Aborted
>>> Exit Status      = 0
>>> Signal           = USR1
>>> User             = me
>>> Queue            = [email protected]
>>> Host             = compute-1-1.local
>>> Start Time       = 09/11/2012 09:54:01
>>> End Time         = 09/11/2012 09:56:02
>>> CPU              = 00:00:00
>>> Max vmem         = 124.145M
>>> failed assumedly after job because:
>>> job 3568.1 died through signal USR1 (10)
>> You can scan the messages file on the node and put the relevant lines in the 
>> email in the mail-wrapper:
>> 
>> #!/bin/sh
>> 
>> #
>> # Distinguish between normal jobs and an array job.
>> #
>> 
>> case `echo "$2" | cut -d " " -f 1` in
>> 
>>       Job) JOB_ID=`echo "$2" | cut -d " " -f 2`
>>            CONDITION=`echo "$2" | cut -d " " -f 4` ;;
>> 
>> Job-array) JOB_ID=`echo "$2" | cut -d " " -f 3`
>>            CONDITION=`echo "$2" | cut -d " " -f 5` ;;
>> 
>>         *) ;;
>> 
>> esac
>> 
>> if [ "$CONDITION" = "Aborted" ]; then
>>     if [ -f /var/spool/sge/$HOSTNAME/messages -a -r 
>> /var/spool/sge/$HOSTNAME/messages ]; then
>>         APPENDIX=`egrep "[|]job $JOB_ID([.][[:digit:]]+)? exceed" 
>> /var/spool/sge/$HOSTNAME/messages | head -n 1`
>>     fi
>>     if [ -z "$APPENDIX" ]; then
>>         APPENDIX="Unknown, no entry found in messages file on the master 
>> node of the job."
>>     fi
>> fi
>> 
>> #
>> # No construct and send the email.
>> #
>> 
>> if [ -n "$APPENDIX" ]; then
>>     (cat; echo; echo "Reason for job abort:"; echo $APPENDIX) | mail -s "$2" 
>> "$3"
>> else
>>     mail -s "$2" "$3"
>> fi
>> 
>> if [ -f /var/spool/sge/context/$JOB_ID -a -w /var/spool/sge/context/$JOB_ID 
>> ]; then
>>     rm -f /var/spool/sge/context/$JOB_ID
>> fi
>> 
>> 
>> -- Reuti
>> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] s_rt / h_rt Limits with Informative Messages?

Reply via email to