Dear all,
since a week we have h_vmem consumable in the cluster. Now suddenly some massive parallel jobs are killed because of memory allocation failure. The user are in creasing the value for h_vmem until their job runs stable. The effects are to much "wasted" slots, because our machines have limited amount of RAM.
The reason for that I found in
http://gridengine.org/pipermail/users/2011-September/001636.html
The virtual memory overload of the first node: overhead_vmem = bash_vmem + mpirun_vmem + (nodes -1)*qrsh_vmem
I checked it for our environment and I am tending to enforce
#$ -l exclusive=true with use of a multiply of the available slots per machine. Does anybody have same experiences and found a flexible solution for that ? I would wish or imagine a kind of "complex variable h_vmem_master" for the Master node.

Udo
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to