There is already a slurm mpi call that tells slurm to launch one task
per node:
int p_mpi_hook_client_single_task_per_node(void)
{
return true;
}
Slurm version 2.3 will be released soon. Anything that you make for
version 2.2 should work with version 2.3 with little to no change.
There are several Slurm container (proctrack) plugins. As you note,
many of them would not permit adding new processes (e.g. those based
upon process group id, parent process id tree, etc.). Your new code
should probably test that a valid proctrack plugin is configured so no
processes go unaccounted for.
Quoting Michel Bourget <[email protected]>:
Hello,
we are in the process of integrating SGI MPI ( also known as MPT ) into
SLURM.This is in a context of a SGI offering SLURM as a product; hence,
it involves support and requires a simple operation of "srun
--mpi=sgimpi a.out" to be performed by our customer.SGI also committed
to fully support process tracking and job accounting within the context
of SGI MPI.
The problem: SGI MPI has its own launching mechanism scaled for very
very large cluster.
* Specifically, mpirun sends a request to a launcher helper daemon
and the rest follows.
* Teaching slurm to launch on just the 1st node/task isn't a
problem. It requires a new mpi hook non-intrusive call which
would be a no-op for each other MPI plugins.
* Because SGI MPI launcher daemon is actually the pgid/container_id
of the real MPI processes running on all the nodes, we thought we
could use slurm_container_add()semantic, etc ... to realize this
is actually a no-op in many many proctrack plugins.
* More to the point, the container_id is determined before the
exeve(in exec_task()). Most of everything assume "slurmstepd" is
the ancestor of all the tasks to track and monitor,
accounting-wise. This is not true with SGI MPI.
* It seems like there is no way to easily add a list of ancestor pid
descendant ( to proctrack ) and/or a list of container id ( to
jobacct_gather ) plugins.
A proposed approach:We believe we could tackle the above problem by
designing an "strack()" interface, similar to sattach(). Something
like: strack(job id, step id, argc, argv )
* either replace cont_id and the mother pid
* spin another task within the same slurmstepd instance to
track/monitor the "other pgid"
* launch the final "a.out"
* of, if not, just waipid(pgid of the SGI MPI launcher ), ...
More notes:
* we use version 2.2.7
* We are planing to use single_task() turned on in sgimpi
* Initial plans is to embed SGI MPI plugin ( and related patch and
changes, etc ... ) into the general SLURM offering.
* We don't plan to change SGI MPI per-se to integrate slurm but
adapt it with minimal impact for our customers. Technically
speaking, SGI MPI won't link with libslurm as mvapich2 is for example.
From the above, questions are:
* any suggestions for a better approach ?
* is is feasible given 2.2.7 ?
* could 2.3 contains more facilities to help implementing the above
problems ?
* are we missing something ?
Many many thanks in advance.
-----------------------------------------------------------
Michel Bourget - SGI - Linux Software Engineering
"Past BIOS POST, everything else is extra" (travis)
-----------------------------------------------------------