Thanks Ralph,

Now I get what you had in mind.

Strictly speaking, you are making the assumption that Open MPI performance
matches the system MPI performances.

This is generally true for common interconnects and/or those that feature
providers for libfabric or UCX, but not so for "exotic" interconnects (that
might not be supported natively by Open MPI or abstraction layers) and/or
with an uncommon topology (for which collective communications are not
fully optimized by Open MPI). In the latter case, using the system/vendor
MPI is the best option performance wise.

Cheers,

Gilles

On Fri, Jan 28, 2022 at 2:23 AM Ralph Castain via users <
users@lists.open-mpi.org> wrote:

> Just to complete this - there is always a lingering question regarding
> shared memory support. There are two ways to resolve that one:
>
> * run one container per physical node, launching multiple procs in each
> container. The procs can then utilize shared memory _inside_ the container.
> This is the cleanest solution (i.e., minimizes container boundary
> violations), but some users need/want per-process isolation.
>
> * run one container per MPI process, having each container then mount an
> _external_ common directory to an internal mount point. This allows each
> process to access the common shared memory location. As with the device
> drivers, you typically specify that external mount location when launching
> the container.
>
> Using those combined methods, you can certainly have a "generic" container
> that suffers no performance impact from bare metal. The problem has been
> that it takes a certain degree of "container savvy" to set this up and make
> it work - which is beyond what most users really want to learn. I'm sure
> the container community is working on ways to reduce that burden (I'm not
> really plugged into those efforts, but others on this list might be).
>
> Ralph
>
>
> > On Jan 27, 2022, at 7:39 AM, Ralph H Castain <r...@open-mpi.org> wrote:
> >
> >> Fair enough Ralph! I was implicitly assuming a "build once / run
> everywhere" use case, my bad for not making my assumption clear.
> >> If the container is built to run on a specific host, there are indeed
> other options to achieve near native performances.
> >>
> >
> > Err...that isn't actually what I meant, nor what we did. You can, in
> fact, build a container that can "run everywhere" while still employing
> high-speed fabric support. What you do is:
> >
> > * configure OMPI with all the fabrics enabled (or at least all the ones
> you care about)
> >
> > * don't include the fabric drivers in your container. These can/will
> vary across deployments, especially those (like NVIDIA's) that involve
> kernel modules
> >
> > * setup your container to mount specified external device driver
> locations onto the locations where you configured OMPI to find them. Sadly,
> this does violate the container boundary - but nobody has come up with
> another solution, and at least the violation is confined to just the device
> drivers. Typically, you specify the external locations that are to be
> mounted using an envar or some other mechanism appropriate to your
> container, and then include the relevant information when launching the
> containers.
> >
> > When OMPI initializes, it will do its normal procedure of attempting to
> load each fabric's drivers, selecting the transports whose drivers it can
> load. NOTE: beginning with OMPI v5, you'll need to explicitly tell OMPI to
> build without statically linking in the fabric plugins or else this
> probably will fail.
> >
> > At least one vendor now distributes OMPI containers preconfigured with
> their fabric support based on this method. So using a "generic" container
> doesn't mean you lose performance - in fact, our tests showed zero impact
> on performance using this method.
> >
> > HTH
> > Ralph
> >
>
>
>

Reply via email to