Schmidt U. <[email protected]> writes: > Dear all, > since a week we have h_vmem consumable in the cluster. Now suddenly > some massive parallel jobs are killed because of memory allocation > failure. The user are in creasing the value for h_vmem until their job > runs stable. > The effects are to much "wasted" slots, because our machines have > limited amount of RAM. > The reason for that I found in > http://gridengine.org/pipermail/users/2011-September/001636.html > The virtual memory overload of the first node: overhead_vmem = > bash_vmem + mpirun_vmem + (nodes -1)*qrsh_vmem
The first thing I'd suggest is using an MPI (e.g. open-mpi) which supports tree spawning of the slave tasks so that you just don't start so many on the master. Also if bash is a problem, maybe use a lighter shell like (d)ash, which is sometimes /bin/sh anyhow. Of course there probably should be special provision for resources on the master host, and possibly excluding qrsh from the accounting. -- Community Grid Engine: http://arc.liv.ac.uk/SGE/ _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
