We've recently implemented a memory management system on our cluster which requires that users set h_vmem on their jobs, and also tracks the consumption of RAM on each compute node by setting h_vmem as a consumable resource so we don't overcommit any nodes.
Despite this we're getting jobs which are dying due to not being able to allocate memory. The nodes on which these failures happen still have plenty of free memory and the jobs are dying from internal malloc errors, rather than being killed due to the limit which was imposed by grid engine. I suspect that what is happening is that we're getting memory fragmentation, so that even though there is plenty of memory available, the programs aren't able to allocate a large enough contiguous block of memory and are therefore dying. Does this seem like a likely explanation? If so, is there anything which can be done in the configuration of either the queues or the nodes to try to minimise the chances of these kinds of errors occurring? Thanks Simon. The Babraham Institute, Babraham Research Campus, Cambridge CB22 3AT Registered Charity No. 1053902. The information transmitted in this email is directed only to the addressee. If you received this in error, please contact the sender and delete this email from your system. The contents of this e-mail are the views of the sender and do not necessarily represent the views of the Babraham Institute. Full conditions at: www.babraham.ac.uk<http://www.babraham.ac.uk/terms> _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
