I have not looked at the patch, but changing 72 files is very surprising to me. Adding support for Cray and IBM BlueGene systems each involved changes to about 25 files with the vast majority of changes in new plugins. I'd expect an SGI port to follow a similar pattern with a small number of new plugins and minor changes elsewhere.

Moe Jette
SchedMD LLC


Quoting Andy Riebs <[email protected]>:

Hi Michel,

Some of the things that you should consider as you approach submitting your changes to SLURM:

* SLURM already has the PMI interface (and I see that someone is working on PMI2); do you require more support than the PMI interface, or SPANK plugins, could provide? It might be helpful to identify specific hooks that you need -- others on the list may be able to identify existing mechanisms. * Are you introducing new functionality that might be of more general use? This may relate to the previous question. * You mentioned a concern with high cpu counts. The BlueGene code offers an excellent example of "the SLURM way" to handle those problems. * Are your changes implemented so that they will have little or no impact on those who choose not to use them? (This should also be viewed from the point of view of maintaining the code.)

Changing 72 files is a huge change. Clearly I speak only on behalf of myself, but the SLURM community can be of best help if we understand the pieces of the puzzle, and have a chance to ensure that the changes that you require will also meet the needs of the rest of the community.

Best regards,
Andy

On 11/24/2011 01:31 PM, Michel Bourget wrote:
Hi all,


It's about time I report to this mailing list what "SGI did to SLURM".

Short story:
FYI, we are releasing ( and support ) "SGI SLURM" product on SGI platforms this November. It's based on version 2.2.7. For the user, it simply introduce the "sgimpi" mpi plugin.

Long story:

SGI MPI integration was not trivial since we are utilizing the native SGI MPI launcher ( array services ) underneath slurmstepd. We have introduced the notion of "strack" allowing job launched outside slurm scope to be tracked process-wise( proctrack ) and accounting-wise ( job_acct_gather ). This introduce the notion of "sentinel" thread, in slurmstepd, responsible to add additional "pgid's" not being launched under slurmstepd umbrella. Those additional pgid are communicated by strack usinga simple mailbox file mechanism ( slurm.sentinel.<job>.<step> ). Essentially, in addition to the native slurmstepd childmonitoring, we are adding hooks to monitor
out-of-band pgid's via the newly introduced strack/sentinel mechanism.

The resulting source patches to accomplish this integration are not, in our opinion, ready for a proposal
on this mailing list yet for the following reasons:

- we would need to re-base on 2.3 and/or 2.4. Can someone confirm ?
- the source patches are quite large.

initd.sysconfig.patch : 3 files changed, 37 insertions(+), 16 deletions(-) sentinel.patch : 50 files changed, 3334 insertions(+), 28 deletions(-) sgimpi.patch : 18 files changed, 1089 insertions(+), 5 deletions(-)
   slurm.modulefile.patch         : 1 file changed, 28 insertions(+)

  We need some guidance on an acceptable process for the slurm community for
submitting above patches. I presume a documented ( details, do, don't, why, ... )
  approach is probably required.

  Note the source RPM is, of course, shipped on the SGI SLURM iso.
  Please let me know if you'd like to look at it.

We hope to integrate the above into the stock SLURM release in the following year.

- we believe a safe soak time ( customer's reported bug to us, etc ... ) is necessary. - initial SGI release support ALTIX ICE Cluster. We don't support large SSI yet ( UV 1024 cores for example ) because it would require additional required optimizations for such big machines. In particular, proctrack/job_acct_gather need to relieve pressure on reading the entire /proc/<all_pids>/stat. Why ? Because, on an idle 512 CPU machines,
 we have:

 #nproc=8867 #kthreads=8813 kthreads/nproc= 99.39%

In other words, kthreads are useless to scan over and over every all_user_job/all_step. I am working on a separate solution ( GPL ) to scan-once-and-for-all-and-share those kthreads, hence relieving pressure. That separate solution would then be
 integrated into slurm in a form of an optional option:
   - dlopen the optional library
   - if there: use it
   - else    : continue as before.

In addition, the SGI MPI plugin would require some adjustments for SSI machines.

Cheers






Reply via email to