One should keep in mind that sacct results for memory usage are not
accurate for Out Of Memory (OoM) jobs. This is due to the fact that the
job is typically terminated prior to next sacct polling period, and also
terminated prior to it reaching full memory allocation. Thus I wouldn't
trust any of the results with regards to memory usage if the job is
terminated by OoM. sacct just can't pick up a sudden memory spike like
that and even if it did it would not correctly record the peak memory
because the job was terminated prior to that point.
-Paul Edmon-
On 3/15/2021 1:52 PM, Chin,David wrote:
Hi, all:
I'm trying to understand why a job exited with an error condition. I
think it was actually terminated by Slurm: job was a Matlab script,
and its output was incomplete.
Here's sacct output:
JobID JobName User Partition NodeList
Elapsed State ExitCode ReqMem MaxRSS MaxVMSize
AllocTRES AllocGRE
-------------------- ---------- --------- ---------- ---------------
---------- ---------- -------- ---------- ---------- ----------
-------------------------------- --------
83387 ProdEmisI+ foob def node001
03:34:26 OUT_OF_ME+ 0:125 128Gn
billing=16,cpu=16,node=1
83387.batch batch node001 03:34:26 OUT_OF_ME+
0:125 128Gn 1617705K 7880672K cpu=16,mem=0,node=1
83387.extern extern node001 03:34:26 COMPLETED
0:0 128Gn 460K 153196K billing=16,cpu=16,node=1
Thanks in advance,
Dave
--
David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu 215.571.4335 (o)
For URCF support: urcf-supp...@drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode
Drexel Internal Data