I would think that your idea of
"2nd pgid_t field in relevant proctrack/jobacct_gather structs"
would be the easiest to develop and support. I would suggest that rather than modifying the existing plugins, that you create new plugins specifically for the SGI systems that are based upon the existing plugins. That should avoid breaking any existing logic and isolating your work for easier support.

Moe Jette

Quoting Michel Bourget <[email protected]>:

On 08/02/2011 04:47 PM, Mark A. Grondona wrote:

I am proposing something like sattach(). strack( job_it, step_id , ... )

Given valid job_id.step_id pair, it would teach slurmd-or-slurmstepd
currently loaded
plugins to track( add, replace ? ) the current pid/pgid/whatnot ( or
supplied values ).

I mention "replace" because  we could think the initial  slurmstepd
could go away while another instance of slurmstepd ( triggered by an
strack ) kicks in.

I mention "add" because we could think the initial slurmstepd stays
there while another parallel slurmstepd is kicked in to monitor another
pid/pgid pairs. Issue with that is linuxproc/job_acct_gather are single
container_id oriented. Or am I wrong completely ?


Ok, I think I understand what you are proposing with strack(), but it
might help if you walk through an MPI launch using your proposed
strack() interface.

I am still not 100% sure why the SGI mpi launcher would need something
like strack, but that probably isn't your fault. I'm sorry if I'm being
unnecessarily dense.

Hi Mark,

no, it's a very good question.

Summary: slurmstepd trigger the launch but the launch is performed
   by MPT mpirun making a request to another daemon, called
   arrayd, the SGI MPT launcher so to speak, which sends the jobs
   to each arrayd's on the node(s), which, in turn, start a.out, of course,
   which trigger SGI libmpi.so "init()" function. I believe the "init()"
   part is pretty standard. Note the "init()" part is ran once per node
   and launch the required task(s) as setup by the user.

An 'pstree' below illustrate better  previous summary. Here an example of
an 'srun -l -N2 -n 4 mpi_hello". Note we use the single_task mpi plugin hook:

   0: r1i3n0     PPID   PID  PGID COMMAND
   0: r1i3n0        1 11109 11108 slurmstepd
   0: r1i3n0    11109 11113 11113  \_ mpiexec_mpt
   0: r1i3n0    11113 11204 11113      \_ mpirun

   '-l' task id is always 0 because this is the only slurmstepd
   actually sending the job. We key off $SLURM_PROCID == 0
   to make that decision. Yes, it's a new mpi hook to modify
   argc and argv.

   mpirun send the entire request to arrayd's which, in turn, launch
   the actual 'mpi_hello'.

   0: r1i3n0        1  3750  3749 slurmd
   0: r1i3n0        1  2407  2407 arrayd
   0: r1i3n0     2407 11218 11218  \_ mpi_hello
   0: r1i3n0    11218 11224 11218      \_ mpi_hello
   0: r1i3n0    11218 11225 11218      \_ mpi_hello

   note we tell slurm to launch single task because the SGI mpi
   daemon ( pid 11218 will manage that part itself. Note also
   arrayd pgid != pgid of mpi_hello. We do that to protect arrayd
   of course from children signals just like anyone.

   And, of course, the root problem: we need to teach slurm to not
   only monitor pgid 11113 but also 11218 , which is disconnected,
   external to slurm.

   Now, on the 2nd node, we have ...

   0: r1i3n1     PPID   PID  PGID COMMAND
   0: r1i3n1        1  3607  3606 slurmd
   0: r1i3n1        1  2388  2388 arrayd
   0: r1i3n1     2388 10192 10192  \_ mpi_hello
   0: r1i3n1    10192 10195 10192      \_ mpi_hello
   0: r1i3n1    10192 10196 10192      \_ mpi_hello

   Now, on each other node where SLURM_PROCID !=0,
   slurmstepd is not involved for the moment.
   But wait ... this is temporary. That need to
   change.

So far, launching works but that's about it. It's not right of course.
slurm have no clues about resources being consumed, etc on "each other nodes",
ie. r1i3n1 in that case.

So, this is why we are talking about an strack() API ( and likely an
strack wrapper )
where slurmstepd facilities would now be ( indirectly ) launched on
each other node,
as opposed to the above model, but we would not launch mpiexec_mpt
and/or mpi_hello
but rather  an "strack <params>", in a similar fashion to sattach.
Anyway, this is implementation details but something like:

  strack pgid : add pgid "pidtree" for proctrack/job_acct_gather to monitor.

strack would be responsible to provide the pgid.
strack would use the current jobId.stepId. Maybe it could use command-line
         argument to change that.
strack would waitpid( -pgid ).
Above is just a rough draft definition :)

Iiuc, none of the proctrack/jobacct_gather "common" infrastructure
can deal with multiple pidTree. Yes, ok for one container per slurmstepd
but no for multiple pidTree per container.
In the above proposal, we would end up with :

    strack                 1: pgid of slurmstepd
    mpi_hello ( daemon )   2: pgid strack command-line arg


So, I am thinking that maybe that adding a 2nd pgid_t field in
relevant proctrack/jobacct_gather structs and adapts the locations
where it would have to deal with 2 potential pgid's.

Another approach Karl and I were thinking about is to have an interface
where we could ask slurmd to launch a specialized slurmstepd additional task
designed to monitor the desired external pgid. Using same strack: don't know.
But, given the lecture I had from slurm, I would rather think this could be
risky, else hairy given we would inject an additional task not
accounted in advance.
I am think all those lopo spots, etc... where ntasks is known and
actually change.

At any rate, the problem can be simply stated:

Can we get slurm to monitor an external pgid (or of some other type )
without disrupting too much the slurm architecture ?

Cheers



--

-----------------------------------------------------------
     Michel Bourget - SGI - Linux Software Engineering
    "Past BIOS POST, everything else is extra" (travis)
-----------------------------------------------------------




Reply via email to