Hi all,

was the issue of monitoring pids coming-and-going-away addressed ( or 
debated ) in
the past ( or the future tbd) in regards to proctrack and job_acct_gather ?

I mean, since pids can fork() children and go away later, proctrack seems
not to able to dynamically track this since it's "on-demand". Same for
jobacct_gather since it's set "in stone" when a step is launched.
And, because proctrack is on-demand and jobacct_gather pids are set in stone
at the beginning, on-demand newly discovered pids never intersect
with those jobacct pids.

Maybe an approach like using the kernel process socket connector,
based on an initial set of pids ( monitor fork() and exit() ), and then
proctrack/job_act_gather using that list instead,  would be useful
and feasible ? In that case, I would think additional information
relative to  the obtained pid list would be something in the lines of:

  pid_list_t {
         a_lock;             // Global list lock
         int n;              // # of records
         pid_info_t *info;   // Obvious
         more ?
  }

  pid_info_t {
         a_lock;             // Record lock
         int is_active;      // 0 means pids once live but now gone
         struct jobacctinfo; // acct for that pid so far.
         more ?
  }

Given the above, proctrack services would key on pid where active=1.
And jobacct_gather services would key on jobacctinfo gathered so far,
regardless of is_active.And I would risk to state proctrack and
jobacct_gather could be independent of each other, which is not the case
today, I believe.

I have to admit the above would allow a lot more easily to inject
out-of-band pids to slurm. I can think of those using mpirun
in an salloc, or similar. "Similar" is about the sgimpi
implementation I maintain here at SGI.  I understand it
sounds SGI-specific but I believe there is a generic value
in the above-mentioned approach that would benefit to SLURM in
general.

Hopefully, I hope I am not off track ;-)

Too evil ? Not worth ? Comments ?

-- 

-----------------------------------------------------------
      Michel Bourget - SGI - Linux Software Engineering
     "Past BIOS POST, everything else is extra" (travis)
-----------------------------------------------------------

Reply via email to