Will, Thank you for the explanation and that link. I'll be sure to check for bug reports before posting questions like this in the future. I do agree with your thoughts about the logging of the message. We've gotten several instances of users asking what the "killed" errors meant, when in the past, they were able to solve the issue themselves due to Slurm's "memory exceeded" message.
On Thu, Jun 4, 2015 at 9:56 AM, Will French <[email protected]> wrote: > > The “slurmstepd: error: Exceeded step memory limit at some point. Step may > have been partially swapped out to disk.” message was removed in 14.11.7 > because users found this message to be misleading when jobs were not > subsequently killed. I personally prefer to see this message because (1) it > makes troubleshooting much easier if the job is eventually killed (MaxRSS > is not always super accurate) and (2) even if the job is not killed you’re > made aware of swap usage and the potential for performance degradation in > your job. > > http://bugs.schedmd.com/show_bug.cgi?id=1682 > > Will > > > On Jun 3, 2015, at 3:48 PM, Michael Kit Gilbert <[email protected]> wrote: > > > Hello everyone, > > > > We just upgraded to Slurm 14.11.7 recently and have noticed an issue > with jobs that exceed their specified memory limit. Instead of the familiar > "Exceeded step memory limit..." error that we normally see, we are now > seeing that the job simply was killed. > > > > Here is a job script I ran to reproduce the issue, which previously was > found to need around 500MB of memory: > > > -------------------------------------------------------------------------------- > > #!/bin/bash > > #SBATCH -o /home/mkg52/mpi_test.%J.out > > #SBATCH --workdir=/home/mkg52/namd_testing > > #SBATCH --ntasks=8 > > #SBATCH --mem=200 > > > > module load namd/2.10 > > > > srun namd2 apoa1/apoa1.namd > > > --------------------------------------------------------------------------------- > > > > Here is the end of the output file produced by the job: > > > --------------------------------------------------------------------------------- > > srun: error: cn4: task 6: Killed > > srun: error: cn4: task 7: Killed > > srun: error: cn4: task 1: Killed > > srun: Force Terminated job step 1300000.0 > > > --------------------------------------------------------------------------------- > > > > Does anyone have any insight into why we aren't seeing memory-related > errors in the output like we used to? > > > > Thanks a bunch for any help, > > > > Mike >
