I recently spent the better part of a week getting this to work semi-reliably
in user-run epilogs. It seems that the information retrieved by sacct and
scontrol is updated lazily (often, not until after the epilog would complete),
especially when run under sbatch, though there is still some delay even under
srun. At the end I'll share a bash function I wrote to do this to hopefully
save someone some time.
Obtaining the correct exit status is non-trivial. When slurm kills a job due to
exceeding memory times, it doesn't kill the script, rather, it just kills the
application that is exceeding the memory allotment. As a result, there is no
obvious way to figure out whether SLURM killed the application due to exceeding
memory requirements if you have a script in which some of the internal commands
could return a non-zero exit status, because the exit status is that of of the
script (if all commands should return a 0 exit status, then all we need to do
set -e at the beginning of the script, or use srun -K1). Illustration:
> cat consume_250mb_for_two_minutes.sh
#!/bin/sh
timeout 120 dd bs=250M obs=250M if=/dev/zero of=/dev/null
echo "Done"
> srun -vv --partition=debug --mem-per-cpu=100 -t 1:00
> consume_250mb_for_two_minutes.sh
(using -vv to get the job number)
> sacct -j 10663091 -o JobID,MaxRSS,ExitCode,DerivedExitCode
JobID MaxRSS ExitCode DerivedExitCode
------------ ---------- -------- ---------------
10663091 1848K 0:0 0:0
If I remove echo "Done" and rerun:
> sacct -j 10662944 -o JobID,MaxRSS,ExitCode,DerivedExitCode
JobID MaxRSS ExitCode DerivedExitCode
------------ ---------- -------- ---------------
10662944 98020K 9:0 137:0
Likewise, it is necessary to check the exit status of each command run during a
slurm script and test for 9 or 137 as the exit status and exit with that status
immediately if it is detected A bash function for this is pretty trivial,
though a hassle to use for every command in every script.
My SLURM Wishlist:
1: slurm should wait to run epilogue until scontrol/sacct have up to date
accounting information on the job.
2: have an option to kill the script by default and set the SLURM_EXIT_CODE
status to some sentinel value if any step is killed by slurm due to memory
over-consumption, even when run under srun -K0 due to potentially non-zero exit
status' from internal commands.
3: When slurm kills an application due to memory overconsumption, it should set
the MaxRSS to the value that it killed it for, and record the SLURM-based exit
code to be non-zero since some part of the application failed. This would
better enable systems that estimate future memory and time consumption for
well-defined workflows and protocols to utilize feedback from SLURM to improve
their ability to estimate an upper bounds on resources required.
4: making resources requested/used directly available to the epilog as epilog
script arguments, rather than relying on the users luck that the lazily updated
job databases have the correct information before SLURM kills the epilog
script. This isn't strictly necessary if wishlist item 1 (waiting till info is
available) were performed, but it would perhaps be less strain on SLURMs
database resources to pass that information directly to the script at the time
that it is collected than having the user query the database afterwards. It
would also improve compatibility with existing MOAB/Torque workflows that
utilize this information.
5: define an environment variable SLURM_OUTPUT_FILE in epilogue scripts that
indicates where the output is going to/has went. We currently have to have our
workflow write different epilog scripts for each srun call to inform the
epilogue script where the main script was sending its output to, so that this
output can be grepped for slurm messages indicating memory overrun or walltime
exceeded.
6: Add an option to srun to act more like qsub: just immediately return the job
number and then go to background. As it is, getting the job number immediately
(independently of whether resources were available immediately or not) on
calling srun requires srun -vv ... my_script.sh 2>&1 | egrep -m1 '^srun:
(job|jobid) [0-9]' | cut -f3 -d\ &
Here's the bash function that we call in all epilogue scripts.
# this function determines job resource usage, reason for completion or
failure, etc
# It takes a single argument: the file given to --output in the srun command
SLURMEpilogue()
{
# wait around for SLURM to notice that the job is done and update resource
info
# 5 seconds isn't always long enough; fortunately this info isn't always
critical
# and can be retrieved by the user later if they need to. Making it longer
than 5 seconds runs a real risk that
# SLURM decides to kill the epilogue before it's done (which happens after 10
seconds in the default configuration)
sleep 5
INFO=`scontrol -o show job $SLURM_JOB_ID`
JOBSTATE=`echo "$INFO" | tr ' ' '\n' | grep '^JobState' | sed 's/.*=//'`
EXITCODE_JOB_SIGNAL=`echo "$INFO" | tr ' ' '\n' | grep '^ExitCode' | sed
's/.*=//' | tr ':' ' '`
EXITCODE_JOB=`echo $EXITCODE_JOB_SIGNAL | awk '{print $1}'`
EXITCODE_SLURM=`echo $EXITCODE_JOB_SIGNAL | awk '{print $2}'`
EXITCODE_MAX=`echo $EXITCODE_JOB_SIGNAL | awk '{print ($1 != 0 ? $1 : $2)}'`
USEFULINFO=`echo "$INFO" | tr ' ' '\n' | egrep
'^(JobId|JobName|JobState|RunTime|TimeLimit|StartTime|EndTime|NodeList|BatchHost|NumNodes|NumCPUs|NumTasks|CPUs/Task|MinMemoryCPU|MinMemoryNode|Command)='`
# call sacct to retrieve completed job information. Reformat it to row-based
format which is much easier to read for single jobs when more than 2-3 fields
are queried
awk_command='{if(NR==1){for( x=1; x <= NF; ++x){FIELD_NAMES[x-1]=$x;}}else
if(NR == 2){for( x=1; x <= NF; ++x){FIELD_SIZES[x-1]=length($x);}}else{ y=1;
for( x=1; x <= length(FIELD_SIZES); ++x){print
FIELD_NAMES[x-1],substr($0,y,FIELD_SIZES[x-1]); y += FIELD_SIZES[x-1]+1;}}}'
full_out=`sacct -j $SLURM_JOB_ID -o
JobID,User,JobName,ExitCode,DerivedExitCode,AveCPU,MaxRSS,Timelimit,UserCPU,SystemCPU,MaxPages
| awk "$awk_command"`
if [[ ${EXITCODE_MAX} -ne 0 || "$JOBSTATE" -ne "COMPLETED" ]]; then
echo "Job Failed! $INFO" >> $1.err
echo -ne "$full_out" >> $1.err
# Determine whether SLURM killed the job or
if [[ $EXITCODE_SLURM -ne 0 || $EXITCODE_JOB -eq 9 || $EXITCODE_JOB -eq 137
]]; then
echo -ne "$SLURM_JOB_NAME $SLURM_JOB_ID killed: $JOBSTATE
$USEFULINFO\n" >> $1.err
else
echo -ne "Job died due to internal application failure $USEFULINFO" >>
$1.err
fi
# Repeat any slurmstepd errors at the bottom of the file for convenience of
the user since these are usually hidden among other outputs and will indicate
the job being killed due to excess memory consumption
grep "slurmstepd: error: " $1 >> $1.err
else
#write job resource utilization info out to the output file
echo "Job ID: $SLURM_JOB_ID $JOBSTATE" >> $1
echo "$USEFULINFO" >> $1
echo "$full_out" >> $1
fi
}
I have let my SLURM admins know about the issues, but thought I'd share some
workaround I've found useful as a user.
++Jeff=