Am 12.09.2012 um 00:48 schrieb Joseph Farran: > Thanks Reuti. > > I think this sends an additional email, correct? Any easy way to append or > check for "-m bea" in case users does not want the email?
Yes/No, it will add some lines to the usual eMail you get by the mentioned "-m ea" option. You can define it in `qconf -mconf` under "mailer". You can use a JSV or the global $SGE_ROOT/default/common/sge_request to request it for all jobs. If the users shouldn't get this eMail it's more tricky. The "messages" file is only written after the job left the system already, and also the mail is triggered after this event. And at that point you won't have access to other informations from `qstat -j $JOB_ID` like the job context or so any longer. So: - the JSV will copy the original "-m ..." option in a job context variable "-ac MAIL=..." - the JSV will add an unconditional "-m bea" all the time - in a job prolog (i.e. set in the queue definition) you copy the context information to a directory in the local spool directory, i.e. /var/spool/sge/context/$JOB_ID - the mail wrapper will read this information and send eMails to the user and/or the admin - the mail wrapper will remove the context file As I just noticed I forgot to remove the `rm` command from the snippet, as we are using this with the mentioned job context already to add some information given at submisson time to the eMail (in particular the users wanted have the original command line added to the eMail they usually get). -- Reuti > Joseph > > On 09/11/2012 11:21 AM, Reuti wrote: >> Hi, >> >> Am 11.09.2012 um 19:10 schrieb Joseph Farran: >> >>> Is there a way ( hopefully easy way ) to have Grid Engine to give an >>> informative message when a job has gone past a limit and killed, like when >>> a job goes over the wall time limit. >>> >>> When I get an email from Grid Engine where a job has gone past it's wall >>> time limit, it is not very informative: >>> >>> Job 3568 (TEST) Aborted >>> Exit Status = 0 >>> Signal = USR1 >>> User = me >>> Queue = [email protected] >>> Host = compute-1-1.local >>> Start Time = 09/11/2012 09:54:01 >>> End Time = 09/11/2012 09:56:02 >>> CPU = 00:00:00 >>> Max vmem = 124.145M >>> failed assumedly after job because: >>> job 3568.1 died through signal USR1 (10) >> You can scan the messages file on the node and put the relevant lines in the >> email in the mail-wrapper: >> >> #!/bin/sh >> >> # >> # Distinguish between normal jobs and an array job. >> # >> >> case `echo "$2" | cut -d " " -f 1` in >> >> Job) JOB_ID=`echo "$2" | cut -d " " -f 2` >> CONDITION=`echo "$2" | cut -d " " -f 4` ;; >> >> Job-array) JOB_ID=`echo "$2" | cut -d " " -f 3` >> CONDITION=`echo "$2" | cut -d " " -f 5` ;; >> >> *) ;; >> >> esac >> >> if [ "$CONDITION" = "Aborted" ]; then >> if [ -f /var/spool/sge/$HOSTNAME/messages -a -r >> /var/spool/sge/$HOSTNAME/messages ]; then >> APPENDIX=`egrep "[|]job $JOB_ID([.][[:digit:]]+)? exceed" >> /var/spool/sge/$HOSTNAME/messages | head -n 1` >> fi >> if [ -z "$APPENDIX" ]; then >> APPENDIX="Unknown, no entry found in messages file on the master >> node of the job." >> fi >> fi >> >> # >> # No construct and send the email. >> # >> >> if [ -n "$APPENDIX" ]; then >> (cat; echo; echo "Reason for job abort:"; echo $APPENDIX) | mail -s "$2" >> "$3" >> else >> mail -s "$2" "$3" >> fi >> >> if [ -f /var/spool/sge/context/$JOB_ID -a -w /var/spool/sge/context/$JOB_ID >> ]; then >> rm -f /var/spool/sge/context/$JOB_ID >> fi >> >> >> -- Reuti >> > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
