As I am currently deploying slurm on a UV1000 ( looking at slurm's
cgroups functionality as implemented by Bull ), I am very interested in
this work.
Please keep the list informed of the progress on this.
--Jerry
Michel Bourget wrote:
Hello,
we are in the process of integrating SGI MPI ( also known as MPT )
into SLURM.This is in a context of a SGI offering SLURM as a product;
hence, it involves support and requires a simple operation of "srun
--mpi=sgimpi a.out" to be performed by our customer.SGI also committed
to fully support process tracking and job accounting within the
context of SGI MPI.
The problem: SGI MPI has its own launching mechanism scaled for very
very large cluster.
* Specifically, mpirun sends a request to a launcher helper daemon
and the rest follows.
* Teaching slurm to launch on just the 1st node/task isn't a
problem. It requires a new mpi hook non-intrusive call which
would be a no-op for each other MPI plugins.
* Because SGI MPI launcher daemon is actually the pgid/container_id
of the real MPI processes running on all the nodes, we thought we
could use slurm_container_add()semantic, etc ... to realize this
is actually a no-op in many many proctrack plugins.
* More to the point, the container_id is determined before the
exeve(in exec_task()). Most of everything assume "slurmstepd" is
the ancestor of all the tasks to track and monitor,
accounting-wise. This is not true with SGI MPI.
* It seems like there is no way to easily add a list of ancestor pid
descendant ( to proctrack ) and/or a list of container id ( to
jobacct_gather ) plugins.
A proposed approach:We believe we could tackle the above problem by
designing an "strack()" interface, similar to sattach(). Something
like: strack(job id, step id, argc, argv )
* either replace cont_id and the mother pid
* spin another task within the same slurmstepd instance to
track/monitor the "other pgid"
* launch the final "a.out"
* of, if not, just waipid(pgid of the SGI MPI launcher ), ...
More notes:
* we use version 2.2.7
* We are planing to use single_task() turned on in sgimpi
* Initial plans is to embed SGI MPI plugin ( and related patch and
changes, etc ... ) into the general SLURM offering.
* We don't plan to change SGI MPI per-se to integrate slurm but
adapt it with minimal impact for our customers. Technically
speaking, SGI MPI won't link with libslurm as mvapich2 is for
example.
From the above, questions are:
* any suggestions for a better approach ?
* is is feasible given 2.2.7 ?
* could 2.3 contains more facilities to help implementing the above
problems ?
* are we missing something ?
Many many thanks in advance.
-----------------------------------------------------------
Michel Bourget - SGI - Linux Software Engineering
"Past BIOS POST, everything else is extra" (travis)
-----------------------------------------------------------