Re: [slurm-users] [EXT] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Chin,David
Hi, Sean: Slurm version 20.02.6 (via Bright Cluster Manager) ProctrackType=proctrack/cgroup JobAcctGatherType=jobacct_gather/linux JobAcctGatherParams=UsePss,NoShared I just skimmed https://bugs.schedmd.com/show_bug.cgi?id=5549 because this job appeared to have left two slurmstepd

Re: [slurm-users] [EXT] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Sean Crosby
What are your Slurm settings - what's the values of ProctrackType JobAcctGatherType JobAcctGatherParams and what's the contents of cgroup.conf? Also, what version of Slurm are you using? Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services | Business

Re: [slurm-users] [EXTERNAL] Re: Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Chad DeWitt
Hi Dave, Hope you're doing well. (...very possible you have already done these things...) Maybe the logs on the compute node (system and slurmd.log) would yield more info? Rolling dice, it may also be worth a look for runaway processes or jobs on that compute node as well as confirm the node

Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Chin,David
One possible datapoint: on the node where the job ran, there were two slurmstepd processes running, both at 100%CPU even after the job had ended. -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support:

Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Chin,David
Hi Michael: I looked at the Matlab script: it's loading an xlsx file which is 2.9 kB. There are some "static" arrays allocated with ones() or zeros(), but those use small subsets (< 10 columns) of the loaded data, and outputs are arrays of 6x10. Certainly there are not 16e9 rows in the

Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Chin,David
Here's seff output, if it makes any difference. In any case, the exact same job was run by the user on their laptop with 16 GB RAM with no problem. Job ID: 83387 Cluster: picotte User/Group: foob/foob State: OUT_OF_MEMORY (exit code 0) Nodes: 1 Cores per node: 16 CPU Utilized: 06:50:30 CPU

Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Renfro, Michael
Just a starting guess, but are you certain the MATLAB script didn’t try to allocate enormous amounts of memory for variables? That’d be about 16e9 floating point values, if I did the units correctly. On Mar 15, 2021, at 12:53 PM, Chin,David wrote:  External Email Warning This email

Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Paul Edmon
One should keep in mind that sacct results for memory usage are not accurate for Out Of Memory (OoM) jobs.  This is due to the fact that the job is typically terminated prior to next sacct polling period, and also terminated prior to it reaching full memory allocation.  Thus I wouldn't trust

[slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Chin,David
Hi, all: I'm trying to understand why a job exited with an error condition. I think it was actually terminated by Slurm: job was a Matlab script, and its output was incomplete. Here's sacct output: JobIDJobName User PartitionNodeListElapsed State

Re: [slurm-users] slurm bank and sreport tres minute usage problem

2021-03-15 Thread Miguel Oliveira
Hi Paul, Thank you for your reply. Good to know that in your case you get consistent replies. I had done a similar analises. Starting with a user I got from the accounting records: sacct -X -u rsantos --starttime=2020-01-01 --endtime=now -o jobid,part,account,start,end,elapsed,alloctres%80 |