There is already a slurm mpi call that tells slurm to launch one task per node:
int p_mpi_hook_client_single_task_per_node(void)
{
        return true;
}

Slurm version 2.3 will be released soon. Anything that you make for version 2.2 should work with version 2.3 with little to no change.

There are several Slurm container (proctrack) plugins. As you note, many of them would not permit adding new processes (e.g. those based upon process group id, parent process id tree, etc.). Your new code should probably test that a valid proctrack plugin is configured so no processes go unaccounted for.


Quoting Michel Bourget <[email protected]>:

Hello,

we are in the process of integrating SGI MPI ( also known as MPT ) into
SLURM.This is in a context of a SGI offering SLURM as a product; hence,
it involves support and requires a simple operation of "srun
--mpi=sgimpi a.out" to be performed by our customer.SGI also committed
to fully support process tracking and job accounting within the context
of SGI MPI.

The problem: SGI MPI has its own launching mechanism scaled for very
very large cluster.

   * Specifically, mpirun sends a request to a launcher helper daemon
     and the rest follows.
   * Teaching slurm to launch on just the 1st node/task isn't a
     problem. It requires a new mpi hook non-intrusive call which
     would  be a no-op for each other MPI plugins.
   * Because SGI MPI launcher daemon is actually the pgid/container_id
     of the real MPI processes running on all the nodes, we thought we
     could use slurm_container_add()semantic, etc ... to realize this
     is actually a no-op in many many proctrack plugins.
   * More to the point, the container_id is determined before the
     exeve(in exec_task()).  Most of everything assume "slurmstepd" is
     the ancestor of all the tasks to track and monitor,
     accounting-wise. This is not true with SGI MPI.
   * It seems like there is no way to easily add a list of ancestor pid
     descendant ( to proctrack ) and/or a list of container id ( to
     jobacct_gather ) plugins.


A proposed approach:We believe we could tackle the above problem by
designing an "strack()" interface, similar to sattach().  Something
like: strack(job id, step id, argc, argv )

   * either replace cont_id and the mother pid
   * spin another task within the same slurmstepd instance to
     track/monitor the "other pgid"
   * launch the final "a.out"
   * of, if not, just waipid(pgid of the SGI MPI launcher ), ...

More notes:

   * we use version 2.2.7
   * We are planing to use single_task() turned on in sgimpi
   * Initial plans is to embed SGI MPI plugin ( and related patch and
     changes, etc ... ) into the general SLURM offering.
   * We don't plan to change SGI MPI per-se to integrate slurm but
     adapt it with minimal impact for our customers. Technically
     speaking, SGI MPI won't link with libslurm as mvapich2 is for example.


From the above, questions are:

   * any suggestions for a better approach ?
   * is is feasible given 2.2.7 ?
   * could 2.3 contains more facilities to help implementing the above
     problems ?
   * are we missing something ?


Many many thanks in advance.





-----------------------------------------------------------
     Michel Bourget - SGI - Linux Software Engineering
    "Past BIOS POST, everything else is extra" (travis)
-----------------------------------------------------------




Reply via email to