I've recently been dealing with a trouble ticket for one of our users. It led me down an interesting rabbit hole: what was happening wasn't surprising, but the scale of it was.

So I thought I would bore you with it :)

Here goes. Background first...

* We have had reports where OpenMPI jobs above a certain size are occasionally killed by our GE (ge6.2u5 plus the odd patch).

* Our compute cluster supports both serial and parallel computing: so each queue slot corresponds to a CPU core (instead of say, 1 slot per node).

* We make our users specify the virtual memory their jobs require (via h_vmem), to stop nodes from running out of memory. h_vmem isn't a perfect match for this (we should have a discussion on the technical merits of the alternative options sometime - I'm looking at you, William).

* We use tight integration to keep control of the parallel jobs; however, the principles below are mostly applicable to non-tightly integrated jobs.


It turned out that GE was killing the jobs because they had run out of vmem. We suggested they used "-l exclusive=true" with a few hand-wavy arguments to back it up and it started working again.


So this week, I finally got round to looking at exactly *why* exclusive=true fixed things...


How OpenMPI (and similar) interacts with Grid Engine: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* Grid Engine constrains parallel jobs such that the virtual memory used on a compute node cannot exceed h_vmem * slots assigned to it on that compute node.

* The first compute node in a job runs the user's batch script, including the mpirun command.

* The mpirun command starts the copies of the MPI process for the first compute node, but also one "qrsh" command for each of the other nodes in the job. Each "qrsh" command runs for the lifetime of the job.


What this means:
~~~~~~~~~~~~~~~

The virtual memory overhead on the first node for a job is:

  overhead_vmem = bash_vmem + mpirun_vmem + (nodes -1)*qrsh_vmem

  (nodes is the number of nodes assigned to the job)

And so the extra h_vmem the job needs to ask for is:

  h_vmem = overhead_vmem / node_slots

  (node_slots is the number of copies of the MPI program assigned to the
  first node)


An example job:
~~~~~~~~~~~~~~

Looking at a real 256 core job that failed in MPI_Init, it happened to be allocated bits of 96 hosts and only one MPI process was assigned to the first node. The virtual memory overhead for the first node was therefore (in M):

  overhead_vmem = 66 + 59 + (96 -1)*18 = 1835M

So, as there was only one slot assigned to the first node, the extra per-slot h_vmem the job needed to ask for, for the topology it was assigned, is:

  h_vmem = 1835 / 1 = 1835M

Yikes. We cannot afford to request the best part of 2Gb (or more) per slot to a job _plus_ what the actual MPI program needs, in case we get an unfavourable distribution of slots.


How does exclusive=true help?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If the 256 core job was submitted with exclusive=true, it would have been allocated on our machine to 32 hosts, 8 processes per node. Running the numbers again for the first node:

  overhead_vmem = 66 + 59 + (32 -1)*18 = 683M

And the extra per-slot h_vmem required by the job to accommodate this overhead is:

  h_vmem = 683 / 8 = 85M

That's more like it!

Completing the story, the overheads on the non-first compute nodes are around the 65M per slot mark, so we have an even <100M/slot vmem overhead across the job.


So why don't you use exclusive=true by default?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There's collateral damage to other jobs:

https://arc.liv.ac.uk/trac/SGE/ticket/767

I don't know if this has been fixed in any of the forks.


What else can be done?
~~~~~~~~~~~~~~~~~~~~~

We could reconfigure to use a JSV that re-writes the requested PE to select one that enforced the number of cores per node, adjusting for the amount of RAM. We didn't originally do this, because the exclusive=true feature seemed more simpler. Also, it's not that desirable for us, because we're already doing something very similar to encode interconnect topology.

Other avenues of attack to aid scalability, with varying levels of kludge:

* Replace qrsh with a 32-bit version (vmem gives a factor of 2 improvement in overhead (vmem comes down from 18M to 9M).

* Enhance GE to sort the hostlist such that the host with the greatest number of slots assigned to the job is first in the list, reducing the frequency that the problem is hit.

* Enhance GE by making qrsh more light-weight.


If you made it down to the bottom of this post, my thanks :)

Mark
--
-----------------------------------------------------------------
Mark Dixon                       Email    : [email protected]
HPC/Grid Systems Support         Tel (int): 35429
Information Systems Services     Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-----------------------------------------------------------------
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to