The “slurmstepd: error: Exceeded step memory limit at some point. Step may have been partially swapped out to disk.” message was removed in 14.11.7 because users found this message to be misleading when jobs were not subsequently killed. I personally prefer to see this message because (1) it makes troubleshooting much easier if the job is eventually killed (MaxRSS is not always super accurate) and (2) even if the job is not killed you’re made aware of swap usage and the potential for performance degradation in your job.
http://bugs.schedmd.com/show_bug.cgi?id=1682 Will On Jun 3, 2015, at 3:48 PM, Michael Kit Gilbert <[email protected]> wrote: > Hello everyone, > > We just upgraded to Slurm 14.11.7 recently and have noticed an issue with > jobs that exceed their specified memory limit. Instead of the familiar > "Exceeded step memory limit..." error that we normally see, we are now seeing > that the job simply was killed. > > Here is a job script I ran to reproduce the issue, which previously was found > to need around 500MB of memory: > -------------------------------------------------------------------------------- > #!/bin/bash > #SBATCH -o /home/mkg52/mpi_test.%J.out > #SBATCH --workdir=/home/mkg52/namd_testing > #SBATCH --ntasks=8 > #SBATCH --mem=200 > > module load namd/2.10 > > srun namd2 apoa1/apoa1.namd > --------------------------------------------------------------------------------- > > Here is the end of the output file produced by the job: > --------------------------------------------------------------------------------- > srun: error: cn4: task 6: Killed > srun: error: cn4: task 7: Killed > srun: error: cn4: task 1: Killed > srun: Force Terminated job step 1300000.0 > --------------------------------------------------------------------------------- > > Does anyone have any insight into why we aren't seeing memory-related errors > in the output like we used to? > > Thanks a bunch for any help, > > Mike
