On 08/02/2011 01:15 PM, Mark A. Grondona wrote:
On Tue, 2 Aug 2011 09:27:35 -0700, Michel Bourget<[email protected]> wrote:
Hello,
we are in the process of integrating SGI MPI ( also known as MPT ) into
SLURM.This is in a context of a SGI offering SLURM as a product; hence,
it involves support and requires a simple operation of "srun
--mpi=sgimpi a.out" to be performed by our customer.SGI also committed
to fully support process tracking and job accounting within the context
of SGI MPI.
The problem: SGI MPI has its own launching mechanism scaled for very
very large cluster.
* Specifically, mpirun sends a request to a launcher helper daemon
and the rest follows.
* Teaching slurm to launch on just the 1st node/task isn't a
problem. It requires a new mpi hook non-intrusive call which
would be a no-op for each other MPI plugins.
* Because SGI MPI launcher daemon is actually the pgid/container_id
of the real MPI processes running on all the nodes, we thought we
could use slurm_container_add()semantic, etc ... to realize this
is actually a no-op in many many proctrack plugins.
* More to the point, the container_id is determined before the
exeve(in exec_task()). Most of everything assume "slurmstepd" is
the ancestor of all the tasks to track and monitor,
accounting-wise. This is not true with SGI MPI.
* It seems like there is no way to easily add a list of ancestor pid
descendant ( to proctrack ) and/or a list of container id ( to
jobacct_gather ) plugins.
A proposed approach:We believe we could tackle the above problem by
designing an "strack()" interface, similar to sattach(). Something
like: strack(job id, step id, argc, argv )
* either replace cont_id and the mother pid
* spin another task within the same slurmstepd instance to
track/monitor the "other pgid"
* launch the final "a.out"
* of, if not, just waipid(pgid of the SGI MPI launcher ), ...
It isn't exactly clear what the strack() interface you are proposing
would do. Is this a replacement for srun?
Hello Mark,
No. Just like sattach() isn't a replacement to srun().
Why not just have SGI MPI
users use your mpirun under a SLURM allocation?
The contractual requirement we have is 'srun a.out', ie NO mpirun.
If srun isn't really
launching all tasks, then there really isn't that much benefit to
using srun that I can see.
The SGI MPI plugin takes care of that by translating SLURM_STEP_NODELIST
and SLURM_STEP_TASKS_PER_NODE into appropriate sgimpi specifications.
Then , it is launched only when SLURM_PROCID=0. I can see similarities with
the "HAVE_FRONT_END" concept.
If you want the MPI launcher to be launched with srun, your mpirun
could run something like "srun -N1 -n1 mpi_launcher..."
Or maybe that is what you are already proposing?
I am proposing something like sattach(). strack( job_it, step_id , ... )
Given valid job_id.step_id pair, it would teach slurmd-or-slurmstepd
currently loaded
plugins to track( add, replace ? ) the current pid/pgid/whatnot ( or
supplied values ).
I mention "replace" because we could think the initial slurmstepd
could go away while another instance of slurmstepd ( triggered by an
strack ) kicks in.
I mention "add" because we could think the initial slurmstepd stays
there while another parallel slurmstepd is kicked in to monitor another
pid/pgid pairs. Issue with that is linuxproc/job_acct_gather are single
container_id oriented. Or am I wrong completely ?
mark
More notes:
* we use version 2.2.7
* We are planing to use single_task() turned on in sgimpi
* Initial plans is to embed SGI MPI plugin ( and related patch and
changes, etc ... ) into the general SLURM offering.
* We don't plan to change SGI MPI per-se to integrate slurm but
adapt it with minimal impact for our customers. Technically
speaking, SGI MPI won't link with libslurm as mvapich2 is for example.
From the above, questions are:
* any suggestions for a better approach ?
* is is feasible given 2.2.7 ?
* could 2.3 contains more facilities to help implementing the above
problems ?
* are we missing something ?
Many many thanks in advance.
-----------------------------------------------------------
Michel Bourget - SGI - Linux Software Engineering
"Past BIOS POST, everything else is extra" (travis)
-----------------------------------------------------------
--
-----------------------------------------------------------
Michel Bourget - SGI - Linux Software Engineering
"Past BIOS POST, everything else is extra" (travis)
-----------------------------------------------------------