Hi Ralph,

  Thanks again for this wealth of information - we've successfully run the
same container instance across multiple systems without issues, even
surpassing 'native' performance in edge cases, presumably because the
native host MPI is either older or simply tuned differently (eg, 'eager
limit' differences or medium/large message optimizations) than the one in
the container.  That's actually really useful already, and coupled with
what you've already pointed out about the launcher vs the MPI when it comes
to ABI issues, makes me pretty happy.

  But I would love to dive a little deeper on two of the more complex
things you've brought up:

  1) With regards to not including the fabric drivers in the container and
mounting in the device drivers, how many different sets of drivers
*are* there?
I know you mentioned you're not really plugged into the container
community, so maybe this is more a question for them, but I'd think if
there's a relatively small set that accounts for most systems, you might be
able to include them all, and have the dlopen facility find the correct one
at launch?  (Eg, the 'host' launcher could transmit some information as to
which drivers  -like 'mlnx5'- are desired, and it looks for those inside
the container?)

  I definitely agree that mounting in specific drivers is beyond what most
users are comfortable with, so understanding the plans to address this
would be great.  Mind you, for most of our work, even just including the
'inbox' OFED drivers works well enough right now.

  2) With regards to launching multiple processes within one container for
shared memory access, how is this done?  Or is it automatic now with modern
launchers?  Eg, if the launch commands knows it's running 96 copies of the
same container (via either 'host:96' or '-ppn 96' or something), is it
'smart' enough to do this?   This also hasn't been a problem for us, since
we're typically rate-limited by inter-node comms, not intra-node ones, but
it'd be good to ensure we're doing it 'right'.

  Thanks again,
  - Brian


On Thu, Jan 27, 2022 at 10:22 AM Ralph Castain via users <
users@lists.open-mpi.org> wrote:

> Just to complete this - there is always a lingering question regarding
> shared memory support. There are two ways to resolve that one:
>
> * run one container per physical node, launching multiple procs in each
> container. The procs can then utilize shared memory _inside_ the container.
> This is the cleanest solution (i.e., minimizes container boundary
> violations), but some users need/want per-process isolation.
>
> * run one container per MPI process, having each container then mount an
> _external_ common directory to an internal mount point. This allows each
> process to access the common shared memory location. As with the device
> drivers, you typically specify that external mount location when launching
> the container.
>
> Using those combined methods, you can certainly have a "generic" container
> that suffers no performance impact from bare metal. The problem has been
> that it takes a certain degree of "container savvy" to set this up and make
> it work - which is beyond what most users really want to learn. I'm sure
> the container community is working on ways to reduce that burden (I'm not
> really plugged into those efforts, but others on this list might be).
>
> Ralph
>
>
> > On Jan 27, 2022, at 7:39 AM, Ralph H Castain <r...@open-mpi.org> wrote:
> >
> >> Fair enough Ralph! I was implicitly assuming a "build once / run
> everywhere" use case, my bad for not making my assumption clear.
> >> If the container is built to run on a specific host, there are indeed
> other options to achieve near native performances.
> >>
> >
> > Err...that isn't actually what I meant, nor what we did. You can, in
> fact, build a container that can "run everywhere" while still employing
> high-speed fabric support. What you do is:
> >
> > * configure OMPI with all the fabrics enabled (or at least all the ones
> you care about)
> >
> > * don't include the fabric drivers in your container. These can/will
> vary across deployments, especially those (like NVIDIA's) that involve
> kernel modules
> >
> > * setup your container to mount specified external device driver
> locations onto the locations where you configured OMPI to find them. Sadly,
> this does violate the container boundary - but nobody has come up with
> another solution, and at least the violation is confined to just the device
> drivers. Typically, you specify the external locations that are to be
> mounted using an envar or some other mechanism appropriate to your
> container, and then include the relevant information when launching the
> containers.
> >
> > When OMPI initializes, it will do its normal procedure of attempting to
> load each fabric's drivers, selecting the transports whose drivers it can
> load. NOTE: beginning with OMPI v5, you'll need to explicitly tell OMPI to
> build without statically linking in the fabric plugins or else this
> probably will fail.
> >
> > At least one vendor now distributes OMPI containers preconfigured with
> their fabric support based on this method. So using a "generic" container
> doesn't mean you lose performance - in fact, our tests showed zero impact
> on performance using this method.
> >
> > HTH
> > Ralph
> >
>
>
>

Reply via email to