On 08/02/2011 06:27 PM, Michel Bourget wrote:
Hello,

we are in the process of integrating SGI MPI ( also known as MPT ) into
SLURM.This is in a context of a SGI offering SLURM as a product; hence,
it involves support and requires a simple operation of "srun
--mpi=sgimpi a.out" to be performed by our customer.SGI also committed
to fully support process tracking and job accounting within the context
of SGI MPI.

The problem: SGI MPI has its own launching mechanism scaled for very
very large cluster.

* Specifically, mpirun sends a request to a launcher helper daemon
and the rest follows.
* Teaching slurm to launch on just the 1st node/task isn't a
problem. It requires a new mpi hook non-intrusive call which
would be a no-op for each other MPI plugins.
* Because SGI MPI launcher daemon is actually the pgid/container_id
of the real MPI processes running on all the nodes, we thought we
could use slurm_container_add()semantic, etc ... to realize this
is actually a no-op in many many proctrack plugins.
* More to the point, the container_id is determined before the
exeve(in exec_task()). Most of everything assume "slurmstepd" is
the ancestor of all the tasks to track and monitor,
accounting-wise. This is not true with SGI MPI.
* It seems like there is no way to easily add a list of ancestor pid
descendant ( to proctrack ) and/or a list of container id ( to
jobacct_gather ) plugins.


A proposed approach:We believe we could tackle the above problem by
designing an "strack()" interface, similar to sattach(). Something like:
strack(job id, step id, argc, argv )

* either replace cont_id and the mother pid
* spin another task within the same slurmstepd instance to
track/monitor the "other pgid"
* launch the final "a.out"
* of, if not, just waipid(pgid of the SGI MPI launcher ), ...

More notes:

* we use version 2.2.7
* We are planing to use single_task() turned on in sgimpi
* Initial plans is to embed SGI MPI plugin ( and related patch and
changes, etc ... ) into the general SLURM offering.
* We don't plan to change SGI MPI per-se to integrate slurm but
adapt it with minimal impact for our customers. Technically
speaking, SGI MPI won't link with libslurm as mvapich2 is for example.


 From the above, questions are:

* any suggestions for a better approach ?
* is is feasible given 2.2.7 ?
* could 2.3 contains more facilities to help implementing the above
problems ?
* are we missing something ?


Many many thanks in advance.





-----------------------------------------------------------
Michel Bourget - SGI - Linux Software Engineering
"Past BIOS POST, everything else is extra" (travis)
-----------------------------------------------------------


Hello,
I implemented SLURM on a SGI UV1000 (256cores, SSI) machine just few month ago. I used the CPUSET plugin ...

In this case SLURM allocates the desired number of cores creating a new cpuset and the job/tasks are bound by the kernel to this cores....
hence inside this cpuset we run any kind of process (mpirun for instance).
We use the same approach for the OMP tasks.

Here below the output of a tools I wrote to have a synoptic view of the SLURM jobs allocation:

*********************************************************************
ID    slurmID        Owner
 a    2         UserId=root(0) GroupId=root(0)
 b    76        UserId=nbianchi(21542) GroupId=csstaff(1000)

Logical CPU mapping:
Blade         ID                            Cores
-------------------------------------------------
    0 r001i01b00  a a b . . . . . b . . . . . . .
    1 r001i01b01  b . . . . . . . b . . . . . . .
    2 r001i01b02  b . . . . . . . b . . . . . . .
    3 r001i01b03  b . . . . . . . b . . . . . . .
    4 r001i01b08  b . . . . . . . b . . . . . . .
    5 r001i01b09  b . . . . . . . b . . . . . . .
    6 r001i01b10  b . . . . . . . b . . . . . . .
    7 r001i01b11  b . . . . . . . b . . . . . . .
    8 r001i01b04  b . . . . . . . b . . . . . . .
    9 r001i01b05  b . . . . . . . b . . . . . . .
   10 r001i01b06  b . . . . . . . b . . . . . . .
   11 r001i01b07  b . . . . . . . b . . . . . . .
   12 r001i01b12  b . . . . . . . b . . . . . . .
   13 r001i01b13  b . . . . . . . b . . . . . . .
   14 r001i01b14  b . . . . . . . b . . . . . . .
   15 r001i01b15  b . . . . . . . b . . . . . . .
-------------------------------------------------
*********************************************************************

My 2 cents.
  Nicola

Reply via email to