As I am currently deploying slurm on a UV1000 ( looking at slurm's cgroups functionality as implemented by Bull ), I am very interested in this work.

Please keep the list informed of the progress on this.

--Jerry

Michel Bourget wrote:
Hello,

we are in the process of integrating SGI MPI ( also known as MPT ) into SLURM.This is in a context of a SGI offering SLURM as a product; hence, it involves support and requires a simple operation of "srun --mpi=sgimpi a.out" to be performed by our customer.SGI also committed to fully support process tracking and job accounting within the context of SGI MPI.

The problem: SGI MPI has its own launching mechanism scaled for very very large cluster.

   * Specifically, mpirun sends a request to a launcher helper daemon
     and the rest follows.
   * Teaching slurm to launch on just the 1st node/task isn't a
     problem. It requires a new mpi hook non-intrusive call which
     would  be a no-op for each other MPI plugins.
   * Because SGI MPI launcher daemon is actually the pgid/container_id
     of the real MPI processes running on all the nodes, we thought we
     could use slurm_container_add()semantic, etc ... to realize this
     is actually a no-op in many many proctrack plugins.
   * More to the point, the container_id is determined before the
     exeve(in exec_task()).  Most of everything assume "slurmstepd" is
     the ancestor of all the tasks to track and monitor,
     accounting-wise. This is not true with SGI MPI.
   * It seems like there is no way to easily add a list of ancestor pid
     descendant ( to proctrack ) and/or a list of container id ( to
     jobacct_gather ) plugins.


A proposed approach:We believe we could tackle the above problem by designing an "strack()" interface, similar to sattach(). Something like: strack(job id, step id, argc, argv )

   * either replace cont_id and the mother pid
   * spin another task within the same slurmstepd instance to
     track/monitor the "other pgid"
   * launch the final "a.out"
   * of, if not, just waipid(pgid of the SGI MPI launcher ), ...

More notes:

   * we use version 2.2.7
   * We are planing to use single_task() turned on in sgimpi
   * Initial plans is to embed SGI MPI plugin ( and related patch and
     changes, etc ... ) into the general SLURM offering.
   * We don't plan to change SGI MPI per-se to integrate slurm but
     adapt it with minimal impact for our customers. Technically
speaking, SGI MPI won't link with libslurm as mvapich2 is for example.


From the above, questions are:

   * any suggestions for a better approach ?
   * is is feasible given 2.2.7 ?
   * could 2.3 contains more facilities to help implementing the above
     problems ?
   * are we missing something ?


Many many thanks in advance.





-----------------------------------------------------------
     Michel Bourget - SGI - Linux Software Engineering
    "Past BIOS POST, everything else is extra" (travis)
-----------------------------------------------------------



Reply via email to