On 08/02/2011 12:53 PM, Kenneth Yoshimoto wrote:
Just curious, does SGI MPI use Process Aggregates?
http://oss.sgi.com/projects/csa
http://oss.sgi.com/projects/pagg
Hi Kenneth,
No. pagg got essentially rejected by the kernel community. csa would not
be useful, to be honest.
However, we developed an alternative solution in user-land called
'procset' requiring no kernel hooks. It acually use standard kernel
socket process connector to monitor pid(and child) it is told to watch.
SGI MPT launcher ( arrayd ) use procset to assign what we call an ASH(
Array Session Handle ) which (1) are inherited to all the childs and
(2) gets propagated ( by arrayd ) to all MPI tasks. This is very
similar to 'sgi-job'.
But, even with sgi-job ( job_id ) or procset ( ash ), the fundamental
issue is the
job_id of slurmstepd != job_id of the MPI process ancestors.
Substitute job_id with pgid, etc ...
Hence, this idea of an strack().
All the proctrack/job_acct_gather can only handle ONE pid/pgid/job_id
descendant tree.
This is another aspect of the fundamental issue.
Besides, as for sgi-job, SGI don't support it anymore. hence, it's not a
tool we can use for the slurm/sgimpi proposal to SGI customers, etc ...
On Tue, 2 Aug 2011, Michel Bourget wrote:
Date: Tue, 02 Aug 2011 12:27:35 -0400
From: Michel Bourget <[email protected]>
Reply-To: [email protected]
To: [email protected]
Subject: [slurm-dev] SGI MPI (MPT) integration question
Hello,
we are in the process of integrating SGI MPI ( also known as MPT )
into SLURM.This is in a context of a SGI offering SLURM as a product;
hence, it involves support and requires a simple operation of "srun
--mpi=sgimpi a.out" to be performed by our customer.SGI also
committed to fully support process tracking and job accounting within
the context of SGI MPI.
The problem: SGI MPI has its own launching mechanism scaled for very
very large cluster.
* Specifically, mpirun sends a request to a launcher helper daemon
and the rest follows.
* Teaching slurm to launch on just the 1st node/task isn't a
problem. It requires a new mpi hook non-intrusive call which
would be a no-op for each other MPI plugins.
* Because SGI MPI launcher daemon is actually the pgid/container_id
of the real MPI processes running on all the nodes, we thought we
could use slurm_container_add()semantic, etc ... to realize this
is actually a no-op in many many proctrack plugins.
* More to the point, the container_id is determined before the
exeve(in exec_task()). Most of everything assume "slurmstepd" is
the ancestor of all the tasks to track and monitor,
accounting-wise. This is not true with SGI MPI.
* It seems like there is no way to easily add a list of ancestor pid
descendant ( to proctrack ) and/or a list of container id ( to
jobacct_gather ) plugins.
A proposed approach:We believe we could tackle the above problem by
designing an "strack()" interface, similar to sattach(). Something
like: strack(job id, step id, argc, argv )
* either replace cont_id and the mother pid
* spin another task within the same slurmstepd instance to
track/monitor the "other pgid"
* launch the final "a.out"
* of, if not, just waipid(pgid of the SGI MPI launcher ), ...
More notes:
* we use version 2.2.7
* We are planing to use single_task() turned on in sgimpi
* Initial plans is to embed SGI MPI plugin ( and related patch and
changes, etc ... ) into the general SLURM offering.
* We don't plan to change SGI MPI per-se to integrate slurm but
adapt it with minimal impact for our customers. Technically
speaking, SGI MPI won't link with libslurm as mvapich2 is for
example.
From the above, questions are:
* any suggestions for a better approach ?
* is is feasible given 2.2.7 ?
* could 2.3 contains more facilities to help implementing the above
problems ?
* are we missing something ?
Many many thanks in advance.
-----------------------------------------------------------
Michel Bourget - SGI - Linux Software Engineering
"Past BIOS POST, everything else is extra" (travis)
-----------------------------------------------------------
--
-----------------------------------------------------------
Michel Bourget - SGI - Linux Software Engineering
"Past BIOS POST, everything else is extra" (travis)
-----------------------------------------------------------