Hi Michel,
Some of the things that you should consider as you approach submitting
your changes to SLURM:
* SLURM already has the PMI interface (and I see that someone is working
on PMI2); do you require more support than the PMI interface, or SPANK
plugins, could provide? It might be helpful to identify specific hooks
that you need -- others on the list may be able to identify existing
mechanisms.
* Are you introducing new functionality that might be of more general
use? This may relate to the previous question.
* You mentioned a concern with high cpu counts. The BlueGene code offers
an excellent example of "the SLURM way" to handle those problems.
* Are your changes implemented so that they will have little or no
impact on those who choose not to use them? (This should also be viewed
from the point of view of maintaining the code.)
Changing 72 files is a huge change. Clearly I speak only on behalf of
myself, but the SLURM community can be of best help if we understand the
pieces of the puzzle, and have a chance to ensure that the changes that
you require will also meet the needs of the rest of the community.
Best regards,
Andy
On 11/24/2011 01:31 PM, Michel Bourget wrote:
Hi all,
It's about time I report to this mailing list what "SGI did to SLURM".
Short story:
FYI, we are releasing ( and support ) "SGI SLURM" product on SGI
platforms this November.
It's based on version 2.2.7. For the user, it simply introduce the
"sgimpi" mpi plugin.
Long story:
SGI MPI integration was not trivial since we are utilizing the native
SGI MPI launcher ( array
services ) underneath slurmstepd. We have introduced the notion of
"strack" allowing job launched
outside slurm scope to be tracked process-wise( proctrack ) and
accounting-wise ( job_acct_gather ).
This introduce the notion of "sentinel" thread, in slurmstepd,
responsible to add additional
"pgid's" not being launched under slurmstepd umbrella. Those
additional pgid are communicated by
strack usinga simple mailbox file mechanism (
slurm.sentinel.<job>.<step> ). Essentially,
in addition to the native slurmstepd childmonitoring, we are adding
hooks to monitor
out-of-band pgid's via the newly introduced strack/sentinel mechanism.
The resulting source patches to accomplish this integration are not,
in our opinion, ready for a proposal
on this mailing list yet for the following reasons:
- we would need to re-base on 2.3 and/or 2.4. Can someone confirm ?
- the source patches are quite large.
initd.sysconfig.patch : 3 files changed, 37
insertions(+), 16 deletions(-)
sentinel.patch : 50 files changed, 3334
insertions(+), 28 deletions(-)
sgimpi.patch : 18 files changed, 1089
insertions(+), 5 deletions(-)
slurm.modulefile.patch : 1 file changed, 28 insertions(+)
We need some guidance on an acceptable process for the slurm
community for
submitting above patches. I presume a documented ( details, do,
don't, why, ... )
approach is probably required.
Note the source RPM is, of course, shipped on the SGI SLURM iso.
Please let me know if you'd like to look at it.
We hope to integrate the above into the stock SLURM release in the
following year.
- we believe a safe soak time ( customer's reported bug to us, etc ...
) is necessary.
- initial SGI release support ALTIX ICE Cluster. We don't support
large SSI yet ( UV
1024 cores for example ) because it would require additional
required optimizations
for such big machines. In particular, proctrack/job_acct_gather need
to relieve pressure
on reading the entire /proc/<all_pids>/stat. Why ? Because, on an
idle 512 CPU machines,
we have:
#nproc=8867 #kthreads=8813 kthreads/nproc= 99.39%
In other words, kthreads are useless to scan over and over every
all_user_job/all_step.
I am working on a separate solution ( GPL ) to
scan-once-and-for-all-and-share
those kthreads, hence relieving pressure. That separate solution
would then be
integrated into slurm in a form of an optional option:
- dlopen the optional library
- if there: use it
- else : continue as before.
In addition, the SGI MPI plugin would require some adjustments for
SSI machines.
Cheers