Schmidt U. <[email protected]> writes:

> Dear all,
> since a week we have  h_vmem consumable in the cluster. Now suddenly
> some massive parallel jobs are killed because of memory allocation
> failure. The user are in creasing the value for h_vmem until their job
> runs stable.
> The effects are to much "wasted" slots, because our machines have
> limited amount of RAM.
> The reason for that I found in
> http://gridengine.org/pipermail/users/2011-September/001636.html
> The virtual memory overload of the first node: overhead_vmem =
> bash_vmem + mpirun_vmem + (nodes -1)*qrsh_vmem

The first thing I'd suggest is using an MPI (e.g. open-mpi) which
supports tree spawning of the slave tasks so that you just don't start
so many on the master.  Also if bash is a problem, maybe use a lighter
shell like (d)ash, which is sometimes /bin/sh anyhow.

Of course there probably should be special provision for resources on
the master host, and possibly excluding qrsh from the accounting.

-- 
Community Grid Engine:  http://arc.liv.ac.uk/SGE/
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to