Re: [OMPI users] RES: OpenMPI - Intel MPI

Ralph Castain via users Thu, 27 Jan 2022 10:46:37 -0800

See inline
Ralph

On Jan 27, 2022, at 10:05 AM, Brian Dobbins <bdobb...@gmail.com 
<mailto:bdobb...@gmail.com> > wrote:

Hi Ralph,

  Thanks again for this wealth of information - we've successfully run the same 
container instance across multiple systems without issues, even surpassing 
'native' performance in edge cases, presumably because the native host MPI is 
either older or simply tuned differently (eg, 'eager limit' differences or 
medium/large message optimizations) than the one in the container.  That's 
actually really useful already, and coupled with what you've already pointed 
out about the launcher vs the MPI when it comes to ABI issues, makes me pretty 
happy.  

  But I would love to dive a little deeper on two of the more complex things 
you've brought up:

  1) With regards to not including the fabric drivers in the container and 
mounting in the device drivers, how many different sets of drivers are there?  
I know you mentioned you're not really plugged into the container community, so 
maybe this is more a question for them, but I'd think if there's a relatively 
small set that accounts for most systems, you might be able to include them 
all, and have the dlopen facility find the correct one at launch?  (Eg, the 
'host' launcher could transmit some information as to which drivers  -like 
'mlnx5'- are desired, and it looks for those inside the container?)  

  I definitely agree that mounting in specific drivers is beyond what most 
users are comfortable with, so understanding the plans to address this would be 
great.  Mind you, for most of our work, even just including the 'inbox' OFED 
drivers works well enough right now.  

So there are two problems you can encounter when using internal drivers. The 
first is a compatibility issue between the drivers in the container and the 
firmware in the local hardware. Vendors assume that the drivers and firmware 
get updated as a package since that is how they deliver it. In the case of a 
container, however, you have a driver that is "frozen" in time while the 
firmware is moving around. You could wind up on a system whose firmware is more 
recent, but also on a system with "stone age" firmware someone hasn't 
wanted/needed to update for some time. Thus, your drivers could run into 
problems.

One way you can work around that is to create your own compatibility table - 
i.e., embed some logic in your container that queries the firmware level of the 
local hardware and checks for compatibility with your embedded driver. If it 
isn't compatible, then you can either report that problem and cleanly abort, or 
you can mount the external drivers. Little more work, and I'm not sure what it 
really buys you since you'd have to be prepared to mount the external drivers 
anyway - but still something one could do.

Second problem is that some drivers actually include kernel modules (e.g., if 
you are using CUDA). In those cases, you have no choice but to mount the local 
drivers as your container isn't going to have the system's addresses in it.

Frankly, the latter issue is the one that usually gets you - unless you have an 
overriding reason to bring your own drivers, probably better to mount them.

  2) With regards to launching multiple processes within one container for 
shared memory access, how is this done?  Or is it automatic now with modern 
launchers?  Eg, if the launch commands knows it's running 96 copies of the same 
container (via either 'host:96' or '-ppn 96' or something), is it 'smart' 
enough to do this?   This also hasn't been a problem for us, since we're 
typically rate-limited by inter-node comms, not intra-node ones, but it'd be 
good to ensure we're doing it 'right'.

It depends on how you are launching the containers. If the system is launching 
the containers and you use mpirun (from inside one of the containers) to spawn 
the processes, then mpirun "sees" each container as being a "node". Thus, you 
just launch like normal.

If you are using something outside to not only start the containers, but also 
to start the application processes within the container, then it's a little 
less "normal". You'd need a container-aware launcher to do it - i.e., something 
that understands it is doing a two-stage launch, how to separate mapping of the 
processes from placement of the containers, and how to "inject" processes into 
the container. This is feasible and I actually had this working at one time 
(with Singularity, at least), but it has bit-rotted. If there is interest, I 
can put it on my "to-do" list to revive in PRRTE, though I can't promise a 
release date for that feature. You can see some info on the general idea here:

[PDF] https://openpmix.github.io/uploads/2019/04/PMIxSUG2019.pdf 
<https://openpmix.github.io/uploads/2019/04/PMIxSUG2019.pdf> 
[PPX] https://www.slideshare.net/rcastain/pmix-bridging-the-container-boundary 
<https://www.slideshare.net/rcastain/pmix-bridging-the-container-boundary> 
[video] 
https://www.sylabs.io/2019/04/sug-talk-intels-ralph-castain-on-bridging-the-container-boundary-with-pmix/

<https://www.sylabs.io/2019/04/sug-talk-intels-ralph-castain-on-bridging-the-container-boundary-with-pmix/>

  Thanks again,
  - Brian

On Thu, Jan 27, 2022 at 10:22 AM Ralph Castain via users 
<users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> > wrote:
Just to complete this - there is always a lingering question regarding shared 
memory support. There are two ways to resolve that one:

* run one container per physical node, launching multiple procs in each 
container. The procs can then utilize shared memory _inside_ the container. 
This is the cleanest solution (i.e., minimizes container boundary violations), 
but some users need/want per-process isolation.

* run one container per MPI process, having each container then mount an 
_external_ common directory to an internal mount point. This allows each 
process to access the common shared memory location. As with the device 
drivers, you typically specify that external mount location when launching the 
container.

Using those combined methods, you can certainly have a "generic" container that 
suffers no performance impact from bare metal. The problem has been that it 
takes a certain degree of "container savvy" to set this up and make it work - 
which is beyond what most users really want to learn. I'm sure the container 
community is working on ways to reduce that burden (I'm not really plugged into 
those efforts, but others on this list might be).

Ralph

> On Jan 27, 2022, at 7:39 AM, Ralph H Castain <r...@open-mpi.org 
> <mailto:r...@open-mpi.org> > wrote:
> 
>> Fair enough Ralph! I was implicitly assuming a "build once / run everywhere" 
>> use case, my bad for not making my assumption clear.
>> If the container is built to run on a specific host, there are indeed other 
>> options to achieve near native performances.
>> 
> 
> Err...that isn't actually what I meant, nor what we did. You can, in fact, 
> build a container that can "run everywhere" while still employing high-speed 
> fabric support. What you do is:
> 
> * configure OMPI with all the fabrics enabled (or at least all the ones you 
> care about)
> 
> * don't include the fabric drivers in your container. These can/will vary 
> across deployments, especially those (like NVIDIA's) that involve kernel 
> modules
> 
> * setup your container to mount specified external device driver locations 
> onto the locations where you configured OMPI to find them. Sadly, this does 
> violate the container boundary - but nobody has come up with another 
> solution, and at least the violation is confined to just the device drivers. 
> Typically, you specify the external locations that are to be mounted using an 
> envar or some other mechanism appropriate to your container, and then include 
> the relevant information when launching the containers.

> 
> When OMPI initializes, it will do its normal procedure of attempting to load 
> each fabric's drivers, selecting the transports whose drivers it can load. 
> NOTE: beginning with OMPI v5, you'll need to explicitly tell OMPI to build 
> without statically linking in the fabric plugins or else this probably will 
> fail.
> 
> At least one vendor now distributes OMPI containers preconfigured with their 
> fabric support based on this method. So using a "generic" container doesn't 
> mean you lose performance - in fact, our tests showed zero impact on 
> performance using this method.
> 
> HTH
> Ralph
>

Re: [OMPI users] RES: OpenMPI - Intel MPI

Reply via email to