See inline Ralph On Jan 27, 2022, at 10:05 AM, Brian Dobbins <bdobb...@gmail.com <mailto:bdobb...@gmail.com> > wrote:
Hi Ralph, Thanks again for this wealth of information - we've successfully run the same container instance across multiple systems without issues, even surpassing 'native' performance in edge cases, presumably because the native host MPI is either older or simply tuned differently (eg, 'eager limit' differences or medium/large message optimizations) than the one in the container. That's actually really useful already, and coupled with what you've already pointed out about the launcher vs the MPI when it comes to ABI issues, makes me pretty happy. But I would love to dive a little deeper on two of the more complex things you've brought up: 1) With regards to not including the fabric drivers in the container and mounting in the device drivers, how many different sets of drivers are there? I know you mentioned you're not really plugged into the container community, so maybe this is more a question for them, but I'd think if there's a relatively small set that accounts for most systems, you might be able to include them all, and have the dlopen facility find the correct one at launch? (Eg, the 'host' launcher could transmit some information as to which drivers -like 'mlnx5'- are desired, and it looks for those inside the container?) I definitely agree that mounting in specific drivers is beyond what most users are comfortable with, so understanding the plans to address this would be great. Mind you, for most of our work, even just including the 'inbox' OFED drivers works well enough right now. So there are two problems you can encounter when using internal drivers. The first is a compatibility issue between the drivers in the container and the firmware in the local hardware. Vendors assume that the drivers and firmware get updated as a package since that is how they deliver it. In the case of a container, however, you have a driver that is "frozen" in time while the firmware is moving around. You could wind up on a system whose firmware is more recent, but also on a system with "stone age" firmware someone hasn't wanted/needed to update for some time. Thus, your drivers could run into problems. One way you can work around that is to create your own compatibility table - i.e., embed some logic in your container that queries the firmware level of the local hardware and checks for compatibility with your embedded driver. If it isn't compatible, then you can either report that problem and cleanly abort, or you can mount the external drivers. Little more work, and I'm not sure what it really buys you since you'd have to be prepared to mount the external drivers anyway - but still something one could do. Second problem is that some drivers actually include kernel modules (e.g., if you are using CUDA). In those cases, you have no choice but to mount the local drivers as your container isn't going to have the system's addresses in it. Frankly, the latter issue is the one that usually gets you - unless you have an overriding reason to bring your own drivers, probably better to mount them. 2) With regards to launching multiple processes within one container for shared memory access, how is this done? Or is it automatic now with modern launchers? Eg, if the launch commands knows it's running 96 copies of the same container (via either 'host:96' or '-ppn 96' or something), is it 'smart' enough to do this? This also hasn't been a problem for us, since we're typically rate-limited by inter-node comms, not intra-node ones, but it'd be good to ensure we're doing it 'right'. It depends on how you are launching the containers. If the system is launching the containers and you use mpirun (from inside one of the containers) to spawn the processes, then mpirun "sees" each container as being a "node". Thus, you just launch like normal. If you are using something outside to not only start the containers, but also to start the application processes within the container, then it's a little less "normal". You'd need a container-aware launcher to do it - i.e., something that understands it is doing a two-stage launch, how to separate mapping of the processes from placement of the containers, and how to "inject" processes into the container. This is feasible and I actually had this working at one time (with Singularity, at least), but it has bit-rotted. If there is interest, I can put it on my "to-do" list to revive in PRRTE, though I can't promise a release date for that feature. You can see some info on the general idea here: [PDF] https://openpmix.github.io/uploads/2019/04/PMIxSUG2019.pdf <https://openpmix.github.io/uploads/2019/04/PMIxSUG2019.pdf> [PPX] https://www.slideshare.net/rcastain/pmix-bridging-the-container-boundary <https://www.slideshare.net/rcastain/pmix-bridging-the-container-boundary> [video] https://www.sylabs.io/2019/04/sug-talk-intels-ralph-castain-on-bridging-the-container-boundary-with-pmix/ <https://www.sylabs.io/2019/04/sug-talk-intels-ralph-castain-on-bridging-the-container-boundary-with-pmix/> Thanks again, - Brian On Thu, Jan 27, 2022 at 10:22 AM Ralph Castain via users <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> > wrote: Just to complete this - there is always a lingering question regarding shared memory support. There are two ways to resolve that one: * run one container per physical node, launching multiple procs in each container. The procs can then utilize shared memory _inside_ the container. This is the cleanest solution (i.e., minimizes container boundary violations), but some users need/want per-process isolation. * run one container per MPI process, having each container then mount an _external_ common directory to an internal mount point. This allows each process to access the common shared memory location. As with the device drivers, you typically specify that external mount location when launching the container. Using those combined methods, you can certainly have a "generic" container that suffers no performance impact from bare metal. The problem has been that it takes a certain degree of "container savvy" to set this up and make it work - which is beyond what most users really want to learn. I'm sure the container community is working on ways to reduce that burden (I'm not really plugged into those efforts, but others on this list might be). Ralph > On Jan 27, 2022, at 7:39 AM, Ralph H Castain <r...@open-mpi.org > <mailto:r...@open-mpi.org> > wrote: > >> Fair enough Ralph! I was implicitly assuming a "build once / run everywhere" >> use case, my bad for not making my assumption clear. >> If the container is built to run on a specific host, there are indeed other >> options to achieve near native performances. >> > > Err...that isn't actually what I meant, nor what we did. You can, in fact, > build a container that can "run everywhere" while still employing high-speed > fabric support. What you do is: > > * configure OMPI with all the fabrics enabled (or at least all the ones you > care about) > > * don't include the fabric drivers in your container. These can/will vary > across deployments, especially those (like NVIDIA's) that involve kernel > modules > > * setup your container to mount specified external device driver locations > onto the locations where you configured OMPI to find them. Sadly, this does > violate the container boundary - but nobody has come up with another > solution, and at least the violation is confined to just the device drivers. > Typically, you specify the external locations that are to be mounted using an > envar or some other mechanism appropriate to your container, and then include > the relevant information when launching the containers. > > When OMPI initializes, it will do its normal procedure of attempting to load > each fabric's drivers, selecting the transports whose drivers it can load. > NOTE: beginning with OMPI v5, you'll need to explicitly tell OMPI to build > without statically linking in the fabric plugins or else this probably will > fail. > > At least one vendor now distributes OMPI containers preconfigured with their > fabric support based on this method. So using a "generic" container doesn't > mean you lose performance - in fact, our tests showed zero impact on > performance using this method. > > HTH > Ralph >