Just curious, does SGI MPI use Process Aggregates?
http://oss.sgi.com/projects/csa
http://oss.sgi.com/projects/pagg
On Tue, 2 Aug 2011, Michel Bourget wrote:
Date: Tue, 02 Aug 2011 12:27:35 -0400
From: Michel Bourget <[email protected]>
Reply-To: [email protected]
To: [email protected]
Subject: [slurm-dev] SGI MPI (MPT) integration question
Hello,
we are in the process of integrating SGI MPI ( also known as MPT ) into
SLURM.This is in a context of a SGI offering SLURM as a product; hence, it
involves support and requires a simple operation of "srun --mpi=sgimpi a.out"
to be performed by our customer.SGI also committed to fully support process
tracking and job accounting within the context of SGI MPI.
The problem: SGI MPI has its own launching mechanism scaled for very very
large cluster.
* Specifically, mpirun sends a request to a launcher helper daemon
and the rest follows.
* Teaching slurm to launch on just the 1st node/task isn't a
problem. It requires a new mpi hook non-intrusive call which
would be a no-op for each other MPI plugins.
* Because SGI MPI launcher daemon is actually the pgid/container_id
of the real MPI processes running on all the nodes, we thought we
could use slurm_container_add()semantic, etc ... to realize this
is actually a no-op in many many proctrack plugins.
* More to the point, the container_id is determined before the
exeve(in exec_task()). Most of everything assume "slurmstepd" is
the ancestor of all the tasks to track and monitor,
accounting-wise. This is not true with SGI MPI.
* It seems like there is no way to easily add a list of ancestor pid
descendant ( to proctrack ) and/or a list of container id ( to
jobacct_gather ) plugins.
A proposed approach:We believe we could tackle the above problem by designing
an "strack()" interface, similar to sattach(). Something like: strack(job
id, step id, argc, argv )
* either replace cont_id and the mother pid
* spin another task within the same slurmstepd instance to
track/monitor the "other pgid"
* launch the final "a.out"
* of, if not, just waipid(pgid of the SGI MPI launcher ), ...
More notes:
* we use version 2.2.7
* We are planing to use single_task() turned on in sgimpi
* Initial plans is to embed SGI MPI plugin ( and related patch and
changes, etc ... ) into the general SLURM offering.
* We don't plan to change SGI MPI per-se to integrate slurm but
adapt it with minimal impact for our customers. Technically
speaking, SGI MPI won't link with libslurm as mvapich2 is for example.
From the above, questions are:
* any suggestions for a better approach ?
* is is feasible given 2.2.7 ?
* could 2.3 contains more facilities to help implementing the above
problems ?
* are we missing something ?
Many many thanks in advance.
-----------------------------------------------------------
Michel Bourget - SGI - Linux Software Engineering
"Past BIOS POST, everything else is extra" (travis)
-----------------------------------------------------------