[gridengine users] Memory errors even after setting h_vmem

Simon Andrews Tue, 24 Feb 2015 06:52:36 -0800

We've recently implemented a memory management system on our cluster which
requires that users set h_vmem on their jobs, and also tracks the
consumption of RAM on each compute node by setting h_vmem as a consumable
resource so we don't overcommit any nodes.


Despite this we're getting jobs which are dying due to not being able to
allocate memory.  The nodes on which these failures happen still have
plenty of free memory and the jobs are dying from internal malloc errors,
rather than being killed due to the limit which was imposed by grid engine.

I suspect that what is happening is that we're getting memory
fragmentation, so that even though there is plenty of memory available,
the programs aren't able to allocate a large enough contiguous block of
memory and are therefore dying.

Does this seem like a likely explanation?  If so, is there anything which
can be done in the configuration of either the queues or the nodes to try
to minimise the chances of these kinds of errors occurring?

Thanks

Simon.

The Babraham Institute, Babraham Research Campus, Cambridge CB22 3AT Registered 
Charity No. 1053902.
The information transmitted in this email is directed only to the addressee. If 
you received this in error, please contact the sender and delete this email 
from your system. The contents of this e-mail are the views of the sender and 
do not necessarily represent the views of the Babraham Institute. Full 
conditions at: www.babraham.ac.uk<http://www.babraham.ac.uk/terms>

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

[gridengine users] Memory errors even after setting h_vmem

Reply via email to