On 20/09/14 02:02, Riccardo Murri wrote: > The problem is that the job above has actually negligible heap use, > *but* it allocates a SysV shared memory segment of about 100GB. It > seems that the size of this shared memory segment is counted towards > *all* 4 processes in the job, instead of being counted just once.
My gut instinct is that this may be the result of how the kernel reports memory usage for processes. I can see that you are using cgroups for task control, have you considered using cgroups for accounting as well? We patched that job killing aspect out of Slurm 2.6.x as it would kill any Open-MPI jobs we ran with mpirun (srun hurt performance too much) because it was miscounting its memory usage. As an experiment I'm running two identical NAMD jobs on 8 cores on the same node, one with shared memory and the other using TCP to try and see if that shows a difference in recorded memory usage between the two. Might take a while to have something to report though. :-) All the best, Chris -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci