On Tue, 2 Aug 2011 09:27:35 -0700, Michel Bourget <[email protected]> wrote:
> Hello,
> 
> we are in the process of integrating SGI MPI ( also known as MPT ) into 
> SLURM.This is in a context of a SGI offering SLURM as a product; hence, 
> it involves support and requires a simple operation of "srun 
> --mpi=sgimpi a.out" to be performed by our customer.SGI also committed 
> to fully support process tracking and job accounting within the context 
> of SGI MPI.
> 
> The problem: SGI MPI has its own launching mechanism scaled for very 
> very large cluster.
> 
>     * Specifically, mpirun sends a request to a launcher helper daemon
>       and the rest follows.
>     * Teaching slurm to launch on just the 1st node/task isn't a
>       problem. It requires a new mpi hook non-intrusive call which
>       would  be a no-op for each other MPI plugins.
>     * Because SGI MPI launcher daemon is actually the pgid/container_id
>       of the real MPI processes running on all the nodes, we thought we
>       could use slurm_container_add()semantic, etc ... to realize this
>       is actually a no-op in many many proctrack plugins.
>     * More to the point, the container_id is determined before the
>       exeve(in exec_task()).  Most of everything assume "slurmstepd" is
>       the ancestor of all the tasks to track and monitor,
>       accounting-wise. This is not true with SGI MPI.
>     * It seems like there is no way to easily add a list of ancestor pid
>       descendant ( to proctrack ) and/or a list of container id ( to
>       jobacct_gather ) plugins.
> 
> A proposed approach:We believe we could tackle the above problem by 
> designing an "strack()" interface, similar to sattach().  Something 
> like: strack(job id, step id, argc, argv )
> 
>     * either replace cont_id and the mother pid
>     * spin another task within the same slurmstepd instance to
>       track/monitor the "other pgid"
>     * launch the final "a.out"
>     * of, if not, just waipid(pgid of the SGI MPI launcher ), ...


It isn't exactly clear what the strack() interface you are proposing
would do. Is this a replacement for srun? Why not just have SGI MPI
users use your mpirun under a SLURM allocation? If srun isn't really
launching all tasks, then there really isn't that much benefit to
using srun that I can see. 

If you want the MPI launcher to be launched with srun, your mpirun
could run something like "srun -N1 -n1 mpi_launcher..."

Or maybe that is what you are already proposing?

mark


 
> More notes:
> 
>     * we use version 2.2.7
>     * We are planing to use single_task() turned on in sgimpi
>     * Initial plans is to embed SGI MPI plugin ( and related patch and
>       changes, etc ... ) into the general SLURM offering.
>     * We don't plan to change SGI MPI per-se to integrate slurm but
>       adapt it with minimal impact for our customers. Technically
>       speaking, SGI MPI won't link with libslurm as mvapich2 is for example.
> 
> 
>  From the above, questions are:
> 
>     * any suggestions for a better approach ?
>     * is is feasible given 2.2.7 ?
>     * could 2.3 contains more facilities to help implementing the above
>       problems ?
>     * are we missing something ?
> 
> 
> Many many thanks in advance.
> 
> 
> 
> 
> 
> -----------------------------------------------------------
>       Michel Bourget - SGI - Linux Software Engineering
>      "Past BIOS POST, everything else is extra" (travis)
> -----------------------------------------------------------
> 

Reply via email to