Hi, Am 11.09.2012 um 19:10 schrieb Joseph Farran:
> Is there a way ( hopefully easy way ) to have Grid Engine to give an > informative message when a job has gone past a limit and killed, like when a > job goes over the wall time limit. > > When I get an email from Grid Engine where a job has gone past it's wall time > limit, it is not very informative: > > Job 3568 (TEST) Aborted > Exit Status = 0 > Signal = USR1 > User = me > Queue = [email protected] > Host = compute-1-1.local > Start Time = 09/11/2012 09:54:01 > End Time = 09/11/2012 09:56:02 > CPU = 00:00:00 > Max vmem = 124.145M > failed assumedly after job because: > job 3568.1 died through signal USR1 (10) You can scan the messages file on the node and put the relevant lines in the email in the mail-wrapper: #!/bin/sh # # Distinguish between normal jobs and an array job. # case `echo "$2" | cut -d " " -f 1` in Job) JOB_ID=`echo "$2" | cut -d " " -f 2` CONDITION=`echo "$2" | cut -d " " -f 4` ;; Job-array) JOB_ID=`echo "$2" | cut -d " " -f 3` CONDITION=`echo "$2" | cut -d " " -f 5` ;; *) ;; esac if [ "$CONDITION" = "Aborted" ]; then if [ -f /var/spool/sge/$HOSTNAME/messages -a -r /var/spool/sge/$HOSTNAME/messages ]; then APPENDIX=`egrep "[|]job $JOB_ID([.][[:digit:]]+)? exceed" /var/spool/sge/$HOSTNAME/messages | head -n 1` fi if [ -z "$APPENDIX" ]; then APPENDIX="Unknown, no entry found in messages file on the master node of the job." fi fi # # No construct and send the email. # if [ -n "$APPENDIX" ]; then (cat; echo; echo "Reason for job abort:"; echo $APPENDIX) | mail -s "$2" "$3" else mail -s "$2" "$3" fi if [ -f /var/spool/sge/context/$JOB_ID -a -w /var/spool/sge/context/$JOB_ID ]; then rm -f /var/spool/sge/context/$JOB_ID fi -- Reuti _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
