On 08/02/2011 01:15 PM, Mark A. Grondona wrote:
On Tue, 2 Aug 2011 09:27:35 -0700, Michel Bourget<[email protected]>  wrote:
Hello,

we are in the process of integrating SGI MPI ( also known as MPT ) into
SLURM.This is in a context of a SGI offering SLURM as a product; hence,
it involves support and requires a simple operation of "srun
--mpi=sgimpi a.out" to be performed by our customer.SGI also committed
to fully support process tracking and job accounting within the context
of SGI MPI.

The problem: SGI MPI has its own launching mechanism scaled for very
very large cluster.

     * Specifically, mpirun sends a request to a launcher helper daemon
       and the rest follows.
     * Teaching slurm to launch on just the 1st node/task isn't a
       problem. It requires a new mpi hook non-intrusive call which
       would  be a no-op for each other MPI plugins.
     * Because SGI MPI launcher daemon is actually the pgid/container_id
       of the real MPI processes running on all the nodes, we thought we
       could use slurm_container_add()semantic, etc ... to realize this
       is actually a no-op in many many proctrack plugins.
     * More to the point, the container_id is determined before the
       exeve(in exec_task()).  Most of everything assume "slurmstepd" is
       the ancestor of all the tasks to track and monitor,
       accounting-wise. This is not true with SGI MPI.
     * It seems like there is no way to easily add a list of ancestor pid
       descendant ( to proctrack ) and/or a list of container id ( to
       jobacct_gather ) plugins.

A proposed approach:We believe we could tackle the above problem by
designing an "strack()" interface, similar to sattach().  Something
like: strack(job id, step id, argc, argv )

     * either replace cont_id and the mother pid
     * spin another task within the same slurmstepd instance to
       track/monitor the "other pgid"
     * launch the final "a.out"
     * of, if not, just waipid(pgid of the SGI MPI launcher ), ...

It isn't exactly clear what the strack() interface you are proposing
would do. Is this a replacement for srun?


Hello Mark,

No. Just like sattach() isn't a replacement to srun().

Why not just have SGI MPI
users use your mpirun under a SLURM allocation?

The contractual requirement we have is 'srun a.out', ie NO mpirun.

If srun isn't really
launching all tasks, then there really isn't that much benefit to
using srun that I can see.


The SGI MPI plugin takes care of that by translating SLURM_STEP_NODELIST
and SLURM_STEP_TASKS_PER_NODE into appropriate sgimpi specifications.
Then , it is launched only when SLURM_PROCID=0. I can see similarities with
the "HAVE_FRONT_END" concept.

If you want the MPI launcher to be launched with srun, your mpirun
could run something like "srun -N1 -n1 mpi_launcher..."

Or maybe that is what you are already proposing?

I am proposing something like sattach(). strack( job_it, step_id , ... )

Given valid job_id.step_id pair, it would teach slurmd-or-slurmstepd currently loaded plugins to track( add, replace ? ) the current pid/pgid/whatnot ( or supplied values ).

I mention "replace" because we could think the initial slurmstepd could go away while another instance of slurmstepd ( triggered by an strack ) kicks in.

I mention "add" because we could think the initial slurmstepd stays there while another parallel slurmstepd is kicked in to monitor another pid/pgid pairs. Issue with that is linuxproc/job_acct_gather are single container_id oriented. Or am I wrong completely ?

mark



More notes:

     * we use version 2.2.7
     * We are planing to use single_task() turned on in sgimpi
     * Initial plans is to embed SGI MPI plugin ( and related patch and
       changes, etc ... ) into the general SLURM offering.
     * We don't plan to change SGI MPI per-se to integrate slurm but
       adapt it with minimal impact for our customers. Technically
       speaking, SGI MPI won't link with libslurm as mvapich2 is for example.


  From the above, questions are:

     * any suggestions for a better approach ?
     * is is feasible given 2.2.7 ?
     * could 2.3 contains more facilities to help implementing the above
       problems ?
     * are we missing something ?


Many many thanks in advance.





-----------------------------------------------------------
       Michel Bourget - SGI - Linux Software Engineering
      "Past BIOS POST, everything else is extra" (travis)
-----------------------------------------------------------



--

-----------------------------------------------------------
     Michel Bourget - SGI - Linux Software Engineering
    "Past BIOS POST, everything else is extra" (travis)
-----------------------------------------------------------

Reply via email to