On 08/02/2011 04:47 PM, Mark A. Grondona wrote:
I am proposing something like sattach(). strack( job_it, step_id , ... )
Given valid job_id.step_id pair, it would teach slurmd-or-slurmstepd
currently loaded
plugins to track( add, replace ? ) the current pid/pgid/whatnot ( or
supplied values ).
I mention "replace" because we could think the initial slurmstepd
could go away while another instance of slurmstepd ( triggered by an
strack ) kicks in.
I mention "add" because we could think the initial slurmstepd stays
there while another parallel slurmstepd is kicked in to monitor another
pid/pgid pairs. Issue with that is linuxproc/job_acct_gather are single
container_id oriented. Or am I wrong completely ?
Ok, I think I understand what you are proposing with strack(), but it
might help if you walk through an MPI launch using your proposed
strack() interface.
I am still not 100% sure why the SGI mpi launcher would need something
like strack, but that probably isn't your fault. I'm sorry if I'm being
unnecessarily dense.
Hi Mark,
no, it's a very good question.
Summary: slurmstepd trigger the launch but the launch is performed
by MPT mpirun making a request to another daemon, called
arrayd, the SGI MPT launcher so to speak, which sends the jobs
to each arrayd's on the node(s), which, in turn, start a.out, of course,
which trigger SGI libmpi.so "init()" function. I believe the "init()"
part is pretty standard. Note the "init()" part is ran once per node
and launch the required task(s) as setup by the user.
An 'pstree' below illustrate better previous summary. Here an example of
an 'srun -l -N2 -n 4 mpi_hello". Note we use the single_task mpi plugin
hook:
0: r1i3n0 PPID PID PGID COMMAND
0: r1i3n0 1 11109 11108 slurmstepd
0: r1i3n0 11109 11113 11113 \_ mpiexec_mpt
0: r1i3n0 11113 11204 11113 \_ mpirun
'-l' task id is always 0 because this is the only slurmstepd
actually sending the job. We key off $SLURM_PROCID == 0
to make that decision. Yes, it's a new mpi hook to modify
argc and argv.
mpirun send the entire request to arrayd's which, in turn, launch
the actual 'mpi_hello'.
0: r1i3n0 1 3750 3749 slurmd
0: r1i3n0 1 2407 2407 arrayd
0: r1i3n0 2407 11218 11218 \_ mpi_hello
0: r1i3n0 11218 11224 11218 \_ mpi_hello
0: r1i3n0 11218 11225 11218 \_ mpi_hello
note we tell slurm to launch single task because the SGI mpi
daemon ( pid 11218 will manage that part itself. Note also
arrayd pgid != pgid of mpi_hello. We do that to protect arrayd
of course from children signals just like anyone.
And, of course, the root problem: we need to teach slurm to not
only monitor pgid 11113 but also 11218 , which is disconnected,
external to slurm.
Now, on the 2nd node, we have ...
0: r1i3n1 PPID PID PGID COMMAND
0: r1i3n1 1 3607 3606 slurmd
0: r1i3n1 1 2388 2388 arrayd
0: r1i3n1 2388 10192 10192 \_ mpi_hello
0: r1i3n1 10192 10195 10192 \_ mpi_hello
0: r1i3n1 10192 10196 10192 \_ mpi_hello
Now, on each other node where SLURM_PROCID !=0,
slurmstepd is not involved for the moment.
But wait ... this is temporary. That need to
change.
So far, launching works but that's about it. It's not right of course.
slurm have no clues about resources being consumed, etc on "each other
nodes",
ie. r1i3n1 in that case.
So, this is why we are talking about an strack() API ( and likely an
strack wrapper )
where slurmstepd facilities would now be ( indirectly ) launched on each
other node,
as opposed to the above model, but we would not launch mpiexec_mpt
and/or mpi_hello
but rather an "strack <params>", in a similar fashion to sattach.
Anyway, this is implementation details but something like:
strack pgid : add pgid "pidtree" for proctrack/job_acct_gather to
monitor.
strack would be responsible to provide the pgid.
strack would use the current jobId.stepId. Maybe it could use command-line
argument to change that.
strack would waitpid( -pgid ).
Above is just a rough draft definition :)
Iiuc, none of the proctrack/jobacct_gather "common" infrastructure
can deal with multiple pidTree. Yes, ok for one container per slurmstepd
but no for multiple pidTree per container.
In the above proposal, we would end up with :
strack 1: pgid of slurmstepd
mpi_hello ( daemon ) 2: pgid strack command-line arg
So, I am thinking that maybe that adding a 2nd pgid_t field in
relevant proctrack/jobacct_gather structs and adapts the locations
where it would have to deal with 2 potential pgid's.
Another approach Karl and I were thinking about is to have an interface
where we could ask slurmd to launch a specialized slurmstepd additional task
designed to monitor the desired external pgid. Using same strack: don't
know.
But, given the lecture I had from slurm, I would rather think this could be
risky, else hairy given we would inject an additional task not accounted
in advance.
I am think all those lopo spots, etc... where ntasks is known and
actually change.
At any rate, the problem can be simply stated:
Can we get slurm to monitor an external pgid (or of some other type )
without disrupting too much the slurm architecture ?
Cheers
--
-----------------------------------------------------------
Michel Bourget - SGI - Linux Software Engineering
"Past BIOS POST, everything else is extra" (travis)
-----------------------------------------------------------