[gridengine users] h_vmem and parallel jobs, or "why exclusive=true is important"

Mark Dixon Thu, 22 Sep 2011 03:41:42 -0700

I've recently been dealing with a trouble ticket for one of our users. Itled me down an interesting rabbit hole: what was happening wasn'tsurprising, but the scale of it was.


So I thought I would bore you with it :)


Here goes. Background first...

* We have had reports where OpenMPI jobs above a certain size areoccasionally killed by our GE (ge6.2u5 plus the odd patch).

* Our compute cluster supports both serial and parallel computing: so eachqueue slot corresponds to a CPU core (instead of say, 1 slot per node).

* We make our users specify the virtual memory their jobs require (viah_vmem), to stop nodes from running out of memory. h_vmem isn't a perfectmatch for this (we should have a discussion on the technical merits of thealternative options sometime - I'm looking at you, William).

* We use tight integration to keep control of the parallel jobs; however,the principles below are mostly applicable to non-tightly integrated jobs.

It turned out that GE was killing the jobs because they had run out ofvmem. We suggested they used "-l exclusive=true" with a few hand-wavyarguments to back it up and it started working again.

So this week, I finally got round to looking at exactly *why*exclusive=true fixed things...

How OpenMPI (and similar) interacts with Grid Engine:~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* Grid Engine constrains parallel jobs such that the virtual memory usedon a compute node cannot exceed h_vmem * slots assigned to it on thatcompute node.

* The first compute node in a job runs the user's batch script, includingthe mpirun command.

* The mpirun command starts the copies of the MPI process for the firstcompute node, but also one "qrsh" command for each of the other nodes inthe job. Each "qrsh" command runs for the lifetime of the job.



What this means:
~~~~~~~~~~~~~~~

The virtual memory overhead on the first node for a job is:

  overhead_vmem = bash_vmem + mpirun_vmem + (nodes -1)*qrsh_vmem

  (nodes is the number of nodes assigned to the job)

And so the extra h_vmem the job needs to ask for is:

  h_vmem = overhead_vmem / node_slots

  (node_slots is the number of copies of the MPI program assigned to the
  first node)


An example job:
~~~~~~~~~~~~~~

Looking at a real 256 core job that failed in MPI_Init, it happened to beallocated bits of 96 hosts and only one MPI process was assigned to thefirst node. The virtual memory overhead for the first node was therefore(in M):


  overhead_vmem = 66 + 59 + (96 -1)*18 = 1835M

So, as there was only one slot assigned to the first node, the extraper-slot h_vmem the job needed to ask for, for the topology it wasassigned, is:


  h_vmem = 1835 / 1 = 1835M

Yikes. We cannot afford to request the best part of 2Gb (or more) per slotto a job _plus_ what the actual MPI program needs, in case we get anunfavourable distribution of slots.



How does exclusive=true help?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If the 256 core job was submitted with exclusive=true, it would have beenallocated on our machine to 32 hosts, 8 processes per node. Running thenumbers again for the first node:


  overhead_vmem = 66 + 59 + (32 -1)*18 = 683M

And the extra per-slot h_vmem required by the job to accommodate thisoverhead is:


  h_vmem = 683 / 8 = 85M

That's more like it!

Completing the story, the overheads on the non-first compute nodes arearound the 65M per slot mark, so we have an even <100M/slot vmem overheadacross the job.



So why don't you use exclusive=true by default?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There's collateral damage to other jobs:

https://arc.liv.ac.uk/trac/SGE/ticket/767

I don't know if this has been fixed in any of the forks.


What else can be done?
~~~~~~~~~~~~~~~~~~~~~

We could reconfigure to use a JSV that re-writes the requested PE toselect one that enforced the number of cores per node, adjusting for theamount of RAM. We didn't originally do this, because the exclusive=truefeature seemed more simpler. Also, it's not that desirable for us, becausewe're already doing something very similar to encode interconnecttopology.


Other avenues of attack to aid scalability, with varying levels of kludge:

* Replace qrsh with a 32-bit version (vmem gives a factor of 2 improvementin overhead (vmem comes down from 18M to 9M).

* Enhance GE to sort the hostlist such that the host with the greatestnumber of slots assigned to the job is first in the list, reducing thefrequency that the problem is hit.


* Enhance GE by making qrsh more light-weight.


If you made it down to the bottom of this post, my thanks :)

Mark
--
-----------------------------------------------------------------
Mark Dixon                       Email    : [email protected]
HPC/Grid Systems Support         Tel (int): 35429
Information Systems Services     Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-----------------------------------------------------------------
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

[gridengine users] h_vmem and parallel jobs, or "why exclusive=true is important"

Reply via email to