On 08/02/2011 12:53 PM, Kenneth Yoshimoto wrote:

Just curious, does SGI MPI use Process Aggregates?

http://oss.sgi.com/projects/csa
http://oss.sgi.com/projects/pagg


Hi Kenneth,

No. pagg got essentially rejected by the kernel community. csa would not be useful, to be honest.

However, we developed an alternative solution in user-land called 'procset' requiring no kernel hooks. It acually use standard kernel socket process connector to monitor pid(and child) it is told to watch. SGI MPT launcher ( arrayd ) use procset to assign what we call an ASH( Array Session Handle ) which (1) are inherited to all the childs and (2) gets propagated ( by arrayd ) to all MPI tasks. This is very similar to 'sgi-job'.

But, even with sgi-job ( job_id ) or procset ( ash ), the fundamental issue is the

    job_id of slurmstepd != job_id of the MPI process ancestors.

Substitute job_id with pgid, etc ...

Hence, this idea of an strack().

All the proctrack/job_acct_gather can only handle ONE pid/pgid/job_id descendant tree.
This is another aspect of the fundamental issue.

Besides, as for sgi-job, SGI don't support it anymore. hence, it's not a tool we can use for the slurm/sgimpi proposal to SGI customers, etc ...


On Tue, 2 Aug 2011, Michel Bourget wrote:

Date: Tue, 02 Aug 2011 12:27:35 -0400
From: Michel Bourget <[email protected]>
Reply-To: [email protected]
To: [email protected]
Subject: [slurm-dev] SGI MPI (MPT) integration question

Hello,

we are in the process of integrating SGI MPI ( also known as MPT ) into SLURM.This is in a context of a SGI offering SLURM as a product; hence, it involves support and requires a simple operation of "srun --mpi=sgimpi a.out" to be performed by our customer.SGI also committed to fully support process tracking and job accounting within the context of SGI MPI.

The problem: SGI MPI has its own launching mechanism scaled for very very large cluster.

   * Specifically, mpirun sends a request to a launcher helper daemon
     and the rest follows.
   * Teaching slurm to launch on just the 1st node/task isn't a
     problem. It requires a new mpi hook non-intrusive call which
     would  be a no-op for each other MPI plugins.
   * Because SGI MPI launcher daemon is actually the pgid/container_id
     of the real MPI processes running on all the nodes, we thought we
     could use slurm_container_add()semantic, etc ... to realize this
     is actually a no-op in many many proctrack plugins.
   * More to the point, the container_id is determined before the
     exeve(in exec_task()).  Most of everything assume "slurmstepd" is
     the ancestor of all the tasks to track and monitor,
     accounting-wise. This is not true with SGI MPI.
   * It seems like there is no way to easily add a list of ancestor pid
     descendant ( to proctrack ) and/or a list of container id ( to
     jobacct_gather ) plugins.


A proposed approach:We believe we could tackle the above problem by designing an "strack()" interface, similar to sattach(). Something like: strack(job id, step id, argc, argv )

   * either replace cont_id and the mother pid
   * spin another task within the same slurmstepd instance to
     track/monitor the "other pgid"
   * launch the final "a.out"
   * of, if not, just waipid(pgid of the SGI MPI launcher ), ...

More notes:

   * we use version 2.2.7
   * We are planing to use single_task() turned on in sgimpi
   * Initial plans is to embed SGI MPI plugin ( and related patch and
     changes, etc ... ) into the general SLURM offering.
   * We don't plan to change SGI MPI per-se to integrate slurm but
     adapt it with minimal impact for our customers. Technically
speaking, SGI MPI won't link with libslurm as mvapich2 is for example.


From the above, questions are:

   * any suggestions for a better approach ?
   * is is feasible given 2.2.7 ?
   * could 2.3 contains more facilities to help implementing the above
     problems ?
   * are we missing something ?


Many many thanks in advance.





-----------------------------------------------------------
    Michel Bourget - SGI - Linux Software Engineering
   "Past BIOS POST, everything else is extra" (travis)
-----------------------------------------------------------




--

-----------------------------------------------------------
     Michel Bourget - SGI - Linux Software Engineering
    "Past BIOS POST, everything else is extra" (travis)
-----------------------------------------------------------

Reply via email to