I recently spent the better part of a week getting this to work semi-reliably 
in user-run epilogs. It seems that the information retrieved by sacct and 
scontrol is updated lazily (often, not until after the epilog would complete), 
especially when run under sbatch, though there is still some delay even under 
srun. At the end I'll share a bash function I wrote to do this to hopefully 
save someone some time. 

Obtaining the correct exit status is non-trivial. When slurm kills a job due to 
exceeding memory times, it doesn't kill the script, rather, it just kills the 
application that is exceeding the memory allotment. As a result, there is no 
obvious way to figure out whether SLURM killed the application due to exceeding 
memory requirements if you have a script in which some of the internal commands 
could return a non-zero exit status, because the exit status is that of of the 
script (if all commands should return a 0 exit status, then all we need to do 
set -e at the beginning of the script, or use srun -K1). Illustration: 
> cat consume_250mb_for_two_minutes.sh
timeout 120 dd bs=250M obs=250M if=/dev/zero of=/dev/null
echo "Done"
> srun -vv --partition=debug --mem-per-cpu=100 -t 1:00 
> consume_250mb_for_two_minutes.sh
(using -vv to get the job number)
> sacct -j 10663091 -o JobID,MaxRSS,ExitCode,DerivedExitCode
       JobID     MaxRSS ExitCode DerivedExitCode
------------ ---------- -------- ---------------
10663091          1848K      0:0             0:0
If I remove echo "Done" and rerun:
> sacct -j 10662944 -o JobID,MaxRSS,ExitCode,DerivedExitCode
       JobID     MaxRSS ExitCode DerivedExitCode
------------ ---------- -------- ---------------
10662944         98020K      9:0           137:0
Likewise, it is necessary to check the exit status of each command run during a 
slurm script and test for 9 or 137 as the exit status and exit with that status 
immediately if it is detected  A bash function for this is pretty trivial, 
though a hassle to use for every command in every script. 

My SLURM Wishlist:
1: slurm should wait to run epilogue until scontrol/sacct have up to date 
accounting information on the job.
2: have an option to kill the script by default and set the SLURM_EXIT_CODE 
status to some sentinel value if any step is killed by slurm due to memory 
over-consumption, even when run under srun -K0 due to potentially non-zero exit 
status' from internal commands. 
3: When slurm kills an application due to memory overconsumption, it should set 
the MaxRSS to the value that it killed it for, and record the SLURM-based exit 
code to be non-zero since some part of the application failed. This would 
better enable systems that estimate future memory and time consumption for 
well-defined workflows and protocols to utilize feedback from SLURM to improve 
their ability to estimate an upper bounds on resources required.
4: making resources requested/used directly available to the epilog as epilog 
script arguments, rather than relying on the users luck that the lazily updated 
job databases have the correct information before SLURM kills the epilog 
script. This isn't strictly necessary if wishlist item 1 (waiting till info is 
available) were performed, but it would perhaps be less strain on SLURMs 
database resources to pass that information directly to the script at the time 
that it is collected than having the user query the database afterwards. It 
would also improve compatibility with existing MOAB/Torque workflows that 
utilize this information.
5: define an environment variable SLURM_OUTPUT_FILE in epilogue scripts that 
indicates where the output is going to/has went. We currently have to have our 
workflow write different epilog scripts for each srun call to inform the 
epilogue script where the main script was sending its output to, so that this 
output can be grepped for slurm messages indicating memory overrun or walltime 
6: Add an option to srun to act more like qsub: just immediately return the job 
number and then go to background.  As it is, getting the job number immediately 
(independently of whether resources were available immediately or not) on 
calling srun requires srun -vv ... my_script.sh 2>&1 | egrep -m1 '^srun: 
(job|jobid) [0-9]' | cut -f3 -d\  &

Here's the bash function that we call in all epilogue scripts. 
# this function determines job resource usage, reason for completion or 
failure, etc
# It takes a single argument: the file given to --output in the srun command
  # wait around for SLURM to notice that the job is done and update resource 
  # 5 seconds isn't always long enough; fortunately this info isn't always 
  # and can be retrieved by the user later if they need to. Making it longer 
than 5 seconds runs a real risk that
  # SLURM decides to kill the epilogue before it's done (which happens after 10 
seconds in the default configuration)
  sleep 5
  INFO=`scontrol -o show job $SLURM_JOB_ID`
  JOBSTATE=`echo "$INFO" | tr ' ' '\n' | grep '^JobState' | sed 's/.*=//'`
  EXITCODE_JOB_SIGNAL=`echo "$INFO" | tr ' ' '\n' | grep '^ExitCode' | sed 
's/.*=//' | tr ':' ' '`
  EXITCODE_JOB=`echo $EXITCODE_JOB_SIGNAL | awk '{print $1}'`
  EXITCODE_SLURM=`echo $EXITCODE_JOB_SIGNAL | awk '{print $2}'`
  EXITCODE_MAX=`echo $EXITCODE_JOB_SIGNAL | awk '{print ($1 != 0 ? $1 : $2)}'`
  USEFULINFO=`echo "$INFO" | tr ' ' '\n' | egrep 
  # call sacct to retrieve completed job information. Reformat it to row-based 
format which is much easier to read for single jobs when more than 2-3 fields 
are queried
  awk_command='{if(NR==1){for( x=1; x <= NF; ++x){FIELD_NAMES[x-1]=$x;}}else 
if(NR == 2){for( x=1; x <= NF; ++x){FIELD_SIZES[x-1]=length($x);}}else{ y=1; 
for( x=1; x <= length(FIELD_SIZES); ++x){print 
FIELD_NAMES[x-1],substr($0,y,FIELD_SIZES[x-1]); y += FIELD_SIZES[x-1]+1;}}}'
  full_out=`sacct -j $SLURM_JOB_ID -o 
 | awk "$awk_command"`
  if [[ ${EXITCODE_MAX} -ne 0 || "$JOBSTATE" -ne "COMPLETED" ]]; then
    echo "Job Failed! $INFO" >> $1.err
    echo -ne "$full_out" >> $1.err
    # Determine whether SLURM killed the job or 
    if [[ $EXITCODE_SLURM -ne 0 || $EXITCODE_JOB -eq 9 || $EXITCODE_JOB -eq 137 
]]; then
        echo -ne "$SLURM_JOB_NAME $SLURM_JOB_ID killed: $JOBSTATE 
$USEFULINFO\n" >> $1.err
        echo -ne "Job died due to internal application failure $USEFULINFO" >> 
    # Repeat any slurmstepd errors at the bottom of the file for convenience of 
the user since these are usually hidden among other outputs and will indicate 
the job being killed due to excess memory consumption
    grep "slurmstepd: error: " $1  >> $1.err
    #write job resource utilization info out to the output file
    echo "Job ID: $SLURM_JOB_ID $JOBSTATE" >> $1
    echo "$USEFULINFO"  >> $1
    echo "$full_out"  >> $1

I have let my SLURM admins know about the issues, but thought I'd share some 
workaround I've found useful as a user.


Reply via email to