On Tue, 2 Aug 2011 09:27:35 -0700, Michel Bourget <[email protected]> wrote: > Hello, > > we are in the process of integrating SGI MPI ( also known as MPT ) into > SLURM.This is in a context of a SGI offering SLURM as a product; hence, > it involves support and requires a simple operation of "srun > --mpi=sgimpi a.out" to be performed by our customer.SGI also committed > to fully support process tracking and job accounting within the context > of SGI MPI. > > The problem: SGI MPI has its own launching mechanism scaled for very > very large cluster. > > * Specifically, mpirun sends a request to a launcher helper daemon > and the rest follows. > * Teaching slurm to launch on just the 1st node/task isn't a > problem. It requires a new mpi hook non-intrusive call which > would be a no-op for each other MPI plugins. > * Because SGI MPI launcher daemon is actually the pgid/container_id > of the real MPI processes running on all the nodes, we thought we > could use slurm_container_add()semantic, etc ... to realize this > is actually a no-op in many many proctrack plugins. > * More to the point, the container_id is determined before the > exeve(in exec_task()). Most of everything assume "slurmstepd" is > the ancestor of all the tasks to track and monitor, > accounting-wise. This is not true with SGI MPI. > * It seems like there is no way to easily add a list of ancestor pid > descendant ( to proctrack ) and/or a list of container id ( to > jobacct_gather ) plugins. > > A proposed approach:We believe we could tackle the above problem by > designing an "strack()" interface, similar to sattach(). Something > like: strack(job id, step id, argc, argv ) > > * either replace cont_id and the mother pid > * spin another task within the same slurmstepd instance to > track/monitor the "other pgid" > * launch the final "a.out" > * of, if not, just waipid(pgid of the SGI MPI launcher ), ...
It isn't exactly clear what the strack() interface you are proposing would do. Is this a replacement for srun? Why not just have SGI MPI users use your mpirun under a SLURM allocation? If srun isn't really launching all tasks, then there really isn't that much benefit to using srun that I can see. If you want the MPI launcher to be launched with srun, your mpirun could run something like "srun -N1 -n1 mpi_launcher..." Or maybe that is what you are already proposing? mark > More notes: > > * we use version 2.2.7 > * We are planing to use single_task() turned on in sgimpi > * Initial plans is to embed SGI MPI plugin ( and related patch and > changes, etc ... ) into the general SLURM offering. > * We don't plan to change SGI MPI per-se to integrate slurm but > adapt it with minimal impact for our customers. Technically > speaking, SGI MPI won't link with libslurm as mvapich2 is for example. > > > From the above, questions are: > > * any suggestions for a better approach ? > * is is feasible given 2.2.7 ? > * could 2.3 contains more facilities to help implementing the above > problems ? > * are we missing something ? > > > Many many thanks in advance. > > > > > > ----------------------------------------------------------- > Michel Bourget - SGI - Linux Software Engineering > "Past BIOS POST, everything else is extra" (travis) > ----------------------------------------------------------- >
