Re: [OMPI users] RES: OpenMPI - Intel MPI

2022-01-27 Thread Ralph Castain via users
Sure - but then we aren't talking about containers any more, just vendor vs 
OMPI. I'm not getting in the middle of that one!


On Jan 27, 2022, at 6:28 PM, Gilles Gouaillardet via users 
mailto:users@lists.open-mpi.org> > wrote:

Thanks Ralph,

Now I get what you had in mind.

Strictly speaking, you are making the assumption that Open MPI performance 
matches the system MPI performances.

This is generally true for common interconnects and/or those that feature 
providers for libfabric or UCX, but not so for "exotic" interconnects (that 
might not be supported natively by Open MPI or abstraction layers) and/or with 
an uncommon topology (for which collective communications are not fully 
optimized by Open MPI). In the latter case, using the system/vendor MPI is the 
best option performance wise.

Cheers,

Gilles

On Fri, Jan 28, 2022 at 2:23 AM Ralph Castain via users 
mailto:users@lists.open-mpi.org> > wrote:
Just to complete this - there is always a lingering question regarding shared 
memory support. There are two ways to resolve that one:

* run one container per physical node, launching multiple procs in each 
container. The procs can then utilize shared memory _inside_ the container. 
This is the cleanest solution (i.e., minimizes container boundary violations), 
but some users need/want per-process isolation.

* run one container per MPI process, having each container then mount an 
_external_ common directory to an internal mount point. This allows each 
process to access the common shared memory location. As with the device 
drivers, you typically specify that external mount location when launching the 
container.

Using those combined methods, you can certainly have a "generic" container that 
suffers no performance impact from bare metal. The problem has been that it 
takes a certain degree of "container savvy" to set this up and make it work - 
which is beyond what most users really want to learn. I'm sure the container 
community is working on ways to reduce that burden (I'm not really plugged into 
those efforts, but others on this list might be).

Ralph


> On Jan 27, 2022, at 7:39 AM, Ralph H Castain   > wrote:
> 
>> Fair enough Ralph! I was implicitly assuming a "build once / run everywhere" 
>> use case, my bad for not making my assumption clear.
>> If the container is built to run on a specific host, there are indeed other 
>> options to achieve near native performances.
>> 
> 
> Err...that isn't actually what I meant, nor what we did. You can, in fact, 
> build a container that can "run everywhere" while still employing high-speed 
> fabric support. What you do is:
> 
> * configure OMPI with all the fabrics enabled (or at least all the ones you 
> care about)
> 
> * don't include the fabric drivers in your container. These can/will vary 
> across deployments, especially those (like NVIDIA's) that involve kernel 
> modules
> 
> * setup your container to mount specified external device driver locations 
> onto the locations where you configured OMPI to find them. Sadly, this does 
> violate the container boundary - but nobody has come up with another 
> solution, and at least the violation is confined to just the device drivers. 
> Typically, you specify the external locations that are to be mounted using an 
> envar or some other mechanism appropriate to your container, and then include 
> the relevant information when launching the containers.

> 
> When OMPI initializes, it will do its normal procedure of attempting to load 
> each fabric's drivers, selecting the transports whose drivers it can load. 
> NOTE: beginning with OMPI v5, you'll need to explicitly tell OMPI to build 
> without statically linking in the fabric plugins or else this probably will 
> fail.
> 
> At least one vendor now distributes OMPI containers preconfigured with their 
> fabric support based on this method. So using a "generic" container doesn't 
> mean you lose performance - in fact, our tests showed zero impact on 
> performance using this method.
> 
> HTH
> Ralph
> 





Re: [OMPI users] RES: OpenMPI - Intel MPI

2022-01-27 Thread Gilles Gouaillardet via users
Thanks Ralph,

Now I get what you had in mind.

Strictly speaking, you are making the assumption that Open MPI performance
matches the system MPI performances.

This is generally true for common interconnects and/or those that feature
providers for libfabric or UCX, but not so for "exotic" interconnects (that
might not be supported natively by Open MPI or abstraction layers) and/or
with an uncommon topology (for which collective communications are not
fully optimized by Open MPI). In the latter case, using the system/vendor
MPI is the best option performance wise.

Cheers,

Gilles

On Fri, Jan 28, 2022 at 2:23 AM Ralph Castain via users <
users@lists.open-mpi.org> wrote:

> Just to complete this - there is always a lingering question regarding
> shared memory support. There are two ways to resolve that one:
>
> * run one container per physical node, launching multiple procs in each
> container. The procs can then utilize shared memory _inside_ the container.
> This is the cleanest solution (i.e., minimizes container boundary
> violations), but some users need/want per-process isolation.
>
> * run one container per MPI process, having each container then mount an
> _external_ common directory to an internal mount point. This allows each
> process to access the common shared memory location. As with the device
> drivers, you typically specify that external mount location when launching
> the container.
>
> Using those combined methods, you can certainly have a "generic" container
> that suffers no performance impact from bare metal. The problem has been
> that it takes a certain degree of "container savvy" to set this up and make
> it work - which is beyond what most users really want to learn. I'm sure
> the container community is working on ways to reduce that burden (I'm not
> really plugged into those efforts, but others on this list might be).
>
> Ralph
>
>
> > On Jan 27, 2022, at 7:39 AM, Ralph H Castain  wrote:
> >
> >> Fair enough Ralph! I was implicitly assuming a "build once / run
> everywhere" use case, my bad for not making my assumption clear.
> >> If the container is built to run on a specific host, there are indeed
> other options to achieve near native performances.
> >>
> >
> > Err...that isn't actually what I meant, nor what we did. You can, in
> fact, build a container that can "run everywhere" while still employing
> high-speed fabric support. What you do is:
> >
> > * configure OMPI with all the fabrics enabled (or at least all the ones
> you care about)
> >
> > * don't include the fabric drivers in your container. These can/will
> vary across deployments, especially those (like NVIDIA's) that involve
> kernel modules
> >
> > * setup your container to mount specified external device driver
> locations onto the locations where you configured OMPI to find them. Sadly,
> this does violate the container boundary - but nobody has come up with
> another solution, and at least the violation is confined to just the device
> drivers. Typically, you specify the external locations that are to be
> mounted using an envar or some other mechanism appropriate to your
> container, and then include the relevant information when launching the
> containers.
> >
> > When OMPI initializes, it will do its normal procedure of attempting to
> load each fabric's drivers, selecting the transports whose drivers it can
> load. NOTE: beginning with OMPI v5, you'll need to explicitly tell OMPI to
> build without statically linking in the fabric plugins or else this
> probably will fail.
> >
> > At least one vendor now distributes OMPI containers preconfigured with
> their fabric support based on this method. So using a "generic" container
> doesn't mean you lose performance - in fact, our tests showed zero impact
> on performance using this method.
> >
> > HTH
> > Ralph
> >
>
>
>


Re: [OMPI users] RES: OpenMPI - Intel MPI

2022-01-27 Thread Ralph Castain via users
See inline
Ralph

On Jan 27, 2022, at 10:05 AM, Brian Dobbins mailto:bdobb...@gmail.com> > wrote:


Hi Ralph,

  Thanks again for this wealth of information - we've successfully run the same 
container instance across multiple systems without issues, even surpassing 
'native' performance in edge cases, presumably because the native host MPI is 
either older or simply tuned differently (eg, 'eager limit' differences or 
medium/large message optimizations) than the one in the container.  That's 
actually really useful already, and coupled with what you've already pointed 
out about the launcher vs the MPI when it comes to ABI issues, makes me pretty 
happy.  

  But I would love to dive a little deeper on two of the more complex things 
you've brought up:

  1) With regards to not including the fabric drivers in the container and 
mounting in the device drivers, how many different sets of drivers are there?  
I know you mentioned you're not really plugged into the container community, so 
maybe this is more a question for them, but I'd think if there's a relatively 
small set that accounts for most systems, you might be able to include them 
all, and have the dlopen facility find the correct one at launch?  (Eg, the 
'host' launcher could transmit some information as to which drivers  -like 
'mlnx5'- are desired, and it looks for those inside the container?)  

  I definitely agree that mounting in specific drivers is beyond what most 
users are comfortable with, so understanding the plans to address this would be 
great.  Mind you, for most of our work, even just including the 'inbox' OFED 
drivers works well enough right now.  

So there are two problems you can encounter when using internal drivers. The 
first is a compatibility issue between the drivers in the container and the 
firmware in the local hardware. Vendors assume that the drivers and firmware 
get updated as a package since that is how they deliver it. In the case of a 
container, however, you have a driver that is "frozen" in time while the 
firmware is moving around. You could wind up on a system whose firmware is more 
recent, but also on a system with "stone age" firmware someone hasn't 
wanted/needed to update for some time. Thus, your drivers could run into 
problems.

One way you can work around that is to create your own compatibility table - 
i.e., embed some logic in your container that queries the firmware level of the 
local hardware and checks for compatibility with your embedded driver. If it 
isn't compatible, then you can either report that problem and cleanly abort, or 
you can mount the external drivers. Little more work, and I'm not sure what it 
really buys you since you'd have to be prepared to mount the external drivers 
anyway - but still something one could do.

Second problem is that some drivers actually include kernel modules (e.g., if 
you are using CUDA). In those cases, you have no choice but to mount the local 
drivers as your container isn't going to have the system's addresses in it.

Frankly, the latter issue is the one that usually gets you - unless you have an 
overriding reason to bring your own drivers, probably better to mount them.


  2) With regards to launching multiple processes within one container for 
shared memory access, how is this done?  Or is it automatic now with modern 
launchers?  Eg, if the launch commands knows it's running 96 copies of the same 
container (via either 'host:96' or '-ppn 96' or something), is it 'smart' 
enough to do this?   This also hasn't been a problem for us, since we're 
typically rate-limited by inter-node comms, not intra-node ones, but it'd be 
good to ensure we're doing it 'right'.

It depends on how you are launching the containers. If the system is launching 
the containers and you use mpirun (from inside one of the containers) to spawn 
the processes, then mpirun "sees" each container as being a "node". Thus, you 
just launch like normal.

If you are using something outside to not only start the containers, but also 
to start the application processes within the container, then it's a little 
less "normal". You'd need a container-aware launcher to do it - i.e., something 
that understands it is doing a two-stage launch, how to separate mapping of the 
processes from placement of the containers, and how to "inject" processes into 
the container. This is feasible and I actually had this working at one time 
(with Singularity, at least), but it has bit-rotted. If there is interest, I 
can put it on my "to-do" list to revive in PRRTE, though I can't promise a 
release date for that feature. You can see some info on the general idea here:

[PDF] https://openpmix.github.io/uploads/2019/04/PMIxSUG2019.pdf 
 
[PPX] https://www.slideshare.net/rcastain/pmix-bridging-the-container-boundary 
 
[video] 

Re: [OMPI users] RES: OpenMPI - Intel MPI

2022-01-27 Thread Brian Dobbins via users
Hi Ralph,

  Thanks again for this wealth of information - we've successfully run the
same container instance across multiple systems without issues, even
surpassing 'native' performance in edge cases, presumably because the
native host MPI is either older or simply tuned differently (eg, 'eager
limit' differences or medium/large message optimizations) than the one in
the container.  That's actually really useful already, and coupled with
what you've already pointed out about the launcher vs the MPI when it comes
to ABI issues, makes me pretty happy.

  But I would love to dive a little deeper on two of the more complex
things you've brought up:

  1) With regards to not including the fabric drivers in the container and
mounting in the device drivers, how many different sets of drivers
*are* there?
I know you mentioned you're not really plugged into the container
community, so maybe this is more a question for them, but I'd think if
there's a relatively small set that accounts for most systems, you might be
able to include them all, and have the dlopen facility find the correct one
at launch?  (Eg, the 'host' launcher could transmit some information as to
which drivers  -like 'mlnx5'- are desired, and it looks for those inside
the container?)

  I definitely agree that mounting in specific drivers is beyond what most
users are comfortable with, so understanding the plans to address this
would be great.  Mind you, for most of our work, even just including the
'inbox' OFED drivers works well enough right now.

  2) With regards to launching multiple processes within one container for
shared memory access, how is this done?  Or is it automatic now with modern
launchers?  Eg, if the launch commands knows it's running 96 copies of the
same container (via either 'host:96' or '-ppn 96' or something), is it
'smart' enough to do this?   This also hasn't been a problem for us, since
we're typically rate-limited by inter-node comms, not intra-node ones, but
it'd be good to ensure we're doing it 'right'.

  Thanks again,
  - Brian


On Thu, Jan 27, 2022 at 10:22 AM Ralph Castain via users <
users@lists.open-mpi.org> wrote:

> Just to complete this - there is always a lingering question regarding
> shared memory support. There are two ways to resolve that one:
>
> * run one container per physical node, launching multiple procs in each
> container. The procs can then utilize shared memory _inside_ the container.
> This is the cleanest solution (i.e., minimizes container boundary
> violations), but some users need/want per-process isolation.
>
> * run one container per MPI process, having each container then mount an
> _external_ common directory to an internal mount point. This allows each
> process to access the common shared memory location. As with the device
> drivers, you typically specify that external mount location when launching
> the container.
>
> Using those combined methods, you can certainly have a "generic" container
> that suffers no performance impact from bare metal. The problem has been
> that it takes a certain degree of "container savvy" to set this up and make
> it work - which is beyond what most users really want to learn. I'm sure
> the container community is working on ways to reduce that burden (I'm not
> really plugged into those efforts, but others on this list might be).
>
> Ralph
>
>
> > On Jan 27, 2022, at 7:39 AM, Ralph H Castain  wrote:
> >
> >> Fair enough Ralph! I was implicitly assuming a "build once / run
> everywhere" use case, my bad for not making my assumption clear.
> >> If the container is built to run on a specific host, there are indeed
> other options to achieve near native performances.
> >>
> >
> > Err...that isn't actually what I meant, nor what we did. You can, in
> fact, build a container that can "run everywhere" while still employing
> high-speed fabric support. What you do is:
> >
> > * configure OMPI with all the fabrics enabled (or at least all the ones
> you care about)
> >
> > * don't include the fabric drivers in your container. These can/will
> vary across deployments, especially those (like NVIDIA's) that involve
> kernel modules
> >
> > * setup your container to mount specified external device driver
> locations onto the locations where you configured OMPI to find them. Sadly,
> this does violate the container boundary - but nobody has come up with
> another solution, and at least the violation is confined to just the device
> drivers. Typically, you specify the external locations that are to be
> mounted using an envar or some other mechanism appropriate to your
> container, and then include the relevant information when launching the
> containers.
> >
> > When OMPI initializes, it will do its normal procedure of attempting to
> load each fabric's drivers, selecting the transports whose drivers it can
> load. NOTE: beginning with OMPI v5, you'll need to explicitly tell OMPI to
> build without statically linking in the fabric plugins or else this
> probably 

Re: [OMPI users] RES: OpenMPI - Intel MPI

2022-01-27 Thread Ralph Castain via users
Just to complete this - there is always a lingering question regarding shared 
memory support. There are two ways to resolve that one:

* run one container per physical node, launching multiple procs in each 
container. The procs can then utilize shared memory _inside_ the container. 
This is the cleanest solution (i.e., minimizes container boundary violations), 
but some users need/want per-process isolation.

* run one container per MPI process, having each container then mount an 
_external_ common directory to an internal mount point. This allows each 
process to access the common shared memory location. As with the device 
drivers, you typically specify that external mount location when launching the 
container.

Using those combined methods, you can certainly have a "generic" container that 
suffers no performance impact from bare metal. The problem has been that it 
takes a certain degree of "container savvy" to set this up and make it work - 
which is beyond what most users really want to learn. I'm sure the container 
community is working on ways to reduce that burden (I'm not really plugged into 
those efforts, but others on this list might be).

Ralph


> On Jan 27, 2022, at 7:39 AM, Ralph H Castain  wrote:
> 
>> Fair enough Ralph! I was implicitly assuming a "build once / run everywhere" 
>> use case, my bad for not making my assumption clear.
>> If the container is built to run on a specific host, there are indeed other 
>> options to achieve near native performances.
>> 
> 
> Err...that isn't actually what I meant, nor what we did. You can, in fact, 
> build a container that can "run everywhere" while still employing high-speed 
> fabric support. What you do is:
> 
> * configure OMPI with all the fabrics enabled (or at least all the ones you 
> care about)
> 
> * don't include the fabric drivers in your container. These can/will vary 
> across deployments, especially those (like NVIDIA's) that involve kernel 
> modules
> 
> * setup your container to mount specified external device driver locations 
> onto the locations where you configured OMPI to find them. Sadly, this does 
> violate the container boundary - but nobody has come up with another 
> solution, and at least the violation is confined to just the device drivers. 
> Typically, you specify the external locations that are to be mounted using an 
> envar or some other mechanism appropriate to your container, and then include 
> the relevant information when launching the containers.
> 
> When OMPI initializes, it will do its normal procedure of attempting to load 
> each fabric's drivers, selecting the transports whose drivers it can load. 
> NOTE: beginning with OMPI v5, you'll need to explicitly tell OMPI to build 
> without statically linking in the fabric plugins or else this probably will 
> fail.
> 
> At least one vendor now distributes OMPI containers preconfigured with their 
> fabric support based on this method. So using a "generic" container doesn't 
> mean you lose performance - in fact, our tests showed zero impact on 
> performance using this method.
> 
> HTH
> Ralph
> 




Re: [OMPI users] RES: OpenMPI - Intel MPI

2022-01-27 Thread Ralph Castain via users
> Fair enough Ralph! I was implicitly assuming a "build once / run everywhere" 
> use case, my bad for not making my assumption clear.
> If the container is built to run on a specific host, there are indeed other 
> options to achieve near native performances.
> 

Err...that isn't actually what I meant, nor what we did. You can, in fact, 
build a container that can "run everywhere" while still employing high-speed 
fabric support. What you do is:

* configure OMPI with all the fabrics enabled (or at least all the ones you 
care about)

* don't include the fabric drivers in your container. These can/will vary 
across deployments, especially those (like NVIDIA's) that involve kernel modules

* setup your container to mount specified external device driver locations onto 
the locations where you configured OMPI to find them. Sadly, this does violate 
the container boundary - but nobody has come up with another solution, and at 
least the violation is confined to just the device drivers. Typically, you 
specify the external locations that are to be mounted using an envar or some 
other mechanism appropriate to your container, and then include the relevant 
information when launching the containers.

When OMPI initializes, it will do its normal procedure of attempting to load 
each fabric's drivers, selecting the transports whose drivers it can load. 
NOTE: beginning with OMPI v5, you'll need to explicitly tell OMPI to build 
without statically linking in the fabric plugins or else this probably will 
fail.

At least one vendor now distributes OMPI containers preconfigured with their 
fabric support based on this method. So using a "generic" container doesn't 
mean you lose performance - in fact, our tests showed zero impact on 
performance using this method.

HTH
Ralph




Re: [OMPI users] RES: OpenMPI - Intel MPI

2022-01-27 Thread Jeff Squyres (jsquyres) via users
This is part of the challenge of HPC: there are general solutions, but no 
specific silver bullet that works in all scenarios.  In short: everyone's setup 
is different.  So we can offer advice, but not necessarily a 100%-guaranteed 
solution that will work in your environment.

In general, we advise users to:

* Configure/build Open MPI with their favorite compilers (either a proprietary 
compiler or modern/recent GCC or clang).  More recent compilers tend to give 
better performance than older compilers.
* Configure/build Open MPI against the communication library for your HPC 
interconnect.  These days, it's mostly Libfabric or UCX.  If you have 
InfiniBand, it's UCX.

That's probably table stakes right there; you can tweak more beyond that, but 
with those 2 things, you'll go pretty far.

FWIW, we did a series of talks about Open MPI in conjunction with the EasyBuild 
community recently.  The videos and slides are available here:

* https://www.open-mpi.org/video/?category=general#abcs-of-open-mpi-part-1
* https://www.open-mpi.org/video/?category=general#abcs-of-open-mpi-part-2
* https://www.open-mpi.org/video/?category=general#abcs-of-open-mpi-part-3

For a beginner, parts 1 and 2 are probably the most relevant, and you can 
probably skip the parts about PMIx (circle back to that later for more advanced 
knowledge).

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of Diego Zuccato via 
users 
Sent: Thursday, January 27, 2022 2:59 AM
To: users@lists.open-mpi.org
Cc: Diego Zuccato
Subject: Re: [OMPI users] RES: OpenMPI - Intel MPI

Sorry for the noob question, but: what should I configure for OpenMPI
"to perform on the host cluster"? Any link to a guide would be welcome!

Slightly extended rationale for the question: I'm currently using
"unconfigured" Debian packages and getting some strange behaviour...
Maybe it's just something that a little tuning can fix easily.

Il 27/01/2022 07:58, Ralph Castain via users ha scritto:
> I'll disagree a bit there. You do want to use an MPI library in your
> container that is configued to perform on the host cluster. However,
> that doesn't mean you are constrained as Gilles describes. It takes a
> little more setup knowledge, true, but there are lots of instructions
> and knowledgeable people out there to help. Experiments have shown that
> using non-system MPIs provide at least equivalent performance to the
> native MPIs when configured. Matching the internal/external MPI
> implementations may simplify the mechanics of setting it up, but it is
> definitely not required.
>
>
>> On Jan 26, 2022, at 8:55 PM, Gilles Gouaillardet via users
>> mailto:users@lists.open-mpi.org>> wrote:
>>
>> Brian,
>>
>> FWIW
>>
>> Keep in mind that when running a container on a supercomputer, it is
>> generally recommended to use the supercomputer MPI implementation
>> (fine tuned and with support for the high speed interconnect) instead
>> of the one of the container (generally a vanilla MPI with basic
>> support for TCP and shared memory).
>> That scenario implies several additional constraints, and one of them
>> is the MPI library of the host and the container are (oversimplified)
>> ABI compatible.
>>
>> In your case, you would have to rebuild your container with MPICH
>> (instead of Open MPI) so it can be "substituted" at run time with
>> Intel MPI (MPICH based and ABI compatible).
>>
>> Cheers,
>>
>> Gilles
>>
>> On Thu, Jan 27, 2022 at 1:07 PM Brian Dobbins via users
>> mailto:users@lists.open-mpi.org>> wrote:
>>
>>
>> Hi Ralph,
>>
>>   Thanks for the explanation - in hindsight, that makes perfect
>> sense, since each process is operating inside the container and
>> will of course load up identical libraries, so data types/sizes
>> can't be inconsistent.  I don't know why I didn't realize that
>> before.  I imagine the past issues I'd experienced were just due
>> to the PMI differences in the different MPI implementations at the
>> time.  I owe you a beer or something at the next in-person SC
>> conference!
>>
>>   Cheers,
>>   - Brian
>>
>>
>> On Wed, Jan 26, 2022 at 4:54 PM Ralph Castain via users
>> mailto:users@lists.open-mpi.org>> wrote:
>>
>> There is indeed an ABI difference. However, the _launcher_
>> doesn't have anything to do with the MPI library. All that is
>> needed is a launcher that can provide the key exchange
>> required to wireup the MPI processes. At this point, both
>> MPICH and OMPI have PMIx support, so you can use the same
>> launcher for both. IMPI does not, and so the IMPI launcher
>> will only support PMI-1 or PMI-2 (I forget which one).
>>
>> You can, however, work around that problem. For example, if
>> the host system is using Slurm, then you could "srun" the
>> containers and let Slurm perform the wireup. Again, you'd have
>> to ensure that OMPI was built to 

Re: [OMPI users] Gadget2 error 818 when using more than 1 process?

2022-01-27 Thread Jeff Squyres (jsquyres) via users
I'm afraid that without any further details, it's hard to help. I don't know 
why Gadget2 would complain about its parameters file.  From what you've stated, 
it could be a problem with the application itself.

Have you talked to the Gadget2 authors?

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of Diego Zuccato via 
users 
Sent: Wednesday, January 26, 2022 2:06 AM
To: users@lists.open-mpi.org
Cc: Diego Zuccato
Subject: Re: [OMPI users] Gadget2 error 818 when using more than 1 process?

Il 26/01/2022 02:10, Jeff Squyres (jsquyres) via users ha scritto:

> I'm afraid I don't know anything about Gadget, so I can't comment there.  How 
> exactly does the application fail?
Neither did I :(
It fails saying a 'timestep' is 0, and that's usually caused by an error
in the parameters file. But the parameters file is OK, and it actually
works if the user runs it in a single process. Or even with
multithreaded runs, sometimes and on some nodes. That's quite random :(
But the runs are usually single-node (simple examples for students).

> Can you try upgrading to Open MPI v4.1.2?
That would be a real mess. I'm stuck with packages provided by Debian
stable. I lack both the manpower and the knowledge to compile everything
from scratch, given the intricate relations between slurm, openmpi,
infiniband, etc. :(

> What networking are you using?
Infiniband (Mellanox cards, w/ Debian-supplied drivers and support
programs) and ethernet. Infiniband is also used by IPoIB to reach the
storage servers (gluster). Some nodes lacks IB, so access to the storage
is achieved by a couple of iptables rules.

> 
> From: users  on behalf of Diego Zuccato via 
> users 
> Sent: Tuesday, January 25, 2022 5:43 AM
> To: Open MPI Users
> Cc: Diego Zuccato
> Subject: [OMPI users] Gadget2 error 818 when using more than 1 process?
>
> Hello all.
>
> A user of our cluster is experiencing a weird problem that I can't pinpoint.
>
> He does have a job script that worked well on every node. I's based on
> Gadget2.
>
> Lately, *sometimes*, the same executable with the same parameters file
> works, sometimes it fails. On the same node and submitting with the same
> command. On some nodes it always fails. But if it gets reduced to
> sequential (asking for just one process), it completes correctly (so the
> parameters file, common source of Gadget2 error 818, seems innocent).
>
> The cluster uses SLURM and limits resources using cgroups, if that matters.
>
> Seems most of the issues started after upgrading from openmpi 3.1.3 to
> 4.1.0 in september.
>
> Maybe related, the nodes started spitting out these warnings (that IIUC
> should be harmless... but I'd like to debug & resolve anyway):
> -8<--
> Open MPI's OFI driver detected multiple equidistant NICs from the
> current process, but had insufficient information to ensure MPI
> processes fairly pick a NIC for use.
> This may negatively impact performance. A more modern PMIx server is
> necessary to resolve this issue.
> -8<--
>
> Code is run (from the jobfile) with:
> srun --mpi=pmix_v4 ./Gadget2 paramfile
> (we also tried with a simple mpirun w/ no extra parameters leveraging
> SLURM's integration/autodetection -- same result)
>
> Any hints?
>
> TIA
>
> --
> Diego Zuccato
> DIFA - Dip. di Fisica e Astronomia
> Servizi Informatici
> Alma Mater Studiorum - Università di Bologna
> V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
> tel.: +39 051 20 95786

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786


Re: [OMPI users] RES: OpenMPI - Intel MPI

2022-01-27 Thread Diego Zuccato via users
Sorry for the noob question, but: what should I configure for OpenMPI 
"to perform on the host cluster"? Any link to a guide would be welcome!


Slightly extended rationale for the question: I'm currently using 
"unconfigured" Debian packages and getting some strange behaviour... 
Maybe it's just something that a little tuning can fix easily.


Il 27/01/2022 07:58, Ralph Castain via users ha scritto:
I'll disagree a bit there. You do want to use an MPI library in your 
container that is configued to perform on the host cluster. However, 
that doesn't mean you are constrained as Gilles describes. It takes a 
little more setup knowledge, true, but there are lots of instructions 
and knowledgeable people out there to help. Experiments have shown that 
using non-system MPIs provide at least equivalent performance to the 
native MPIs when configured. Matching the internal/external MPI 
implementations may simplify the mechanics of setting it up, but it is 
definitely not required.



On Jan 26, 2022, at 8:55 PM, Gilles Gouaillardet via users 
mailto:users@lists.open-mpi.org>> wrote:


Brian,

FWIW

Keep in mind that when running a container on a supercomputer, it is 
generally recommended to use the supercomputer MPI implementation
(fine tuned and with support for the high speed interconnect) instead 
of the one of the container (generally a vanilla MPI with basic

support for TCP and shared memory).
That scenario implies several additional constraints, and one of them 
is the MPI library of the host and the container are (oversimplified) 
ABI compatible.


In your case, you would have to rebuild your container with MPICH 
(instead of Open MPI) so it can be "substituted" at run time with 
Intel MPI (MPICH based and ABI compatible).


Cheers,

Gilles

On Thu, Jan 27, 2022 at 1:07 PM Brian Dobbins via users 
mailto:users@lists.open-mpi.org>> wrote:



Hi Ralph,

  Thanks for the explanation - in hindsight, that makes perfect
sense, since each process is operating inside the container and
will of course load up identical libraries, so data types/sizes
can't be inconsistent.  I don't know why I didn't realize that
before.  I imagine the past issues I'd experienced were just due
to the PMI differences in the different MPI implementations at the
time.  I owe you a beer or something at the next in-person SC
conference!

  Cheers,
  - Brian


On Wed, Jan 26, 2022 at 4:54 PM Ralph Castain via users
mailto:users@lists.open-mpi.org>> wrote:

There is indeed an ABI difference. However, the _launcher_
doesn't have anything to do with the MPI library. All that is
needed is a launcher that can provide the key exchange
required to wireup the MPI processes. At this point, both
MPICH and OMPI have PMIx support, so you can use the same
launcher for both. IMPI does not, and so the IMPI launcher
will only support PMI-1 or PMI-2 (I forget which one).

You can, however, work around that problem. For example, if
the host system is using Slurm, then you could "srun" the
containers and let Slurm perform the wireup. Again, you'd have
to ensure that OMPI was built to support whatever wireup
protocol the Slurm installation supported (which might well be
PMIx today). Also works on Cray/ALPS. Completely bypasses the
IMPI issue.

Another option I've seen used is to have the host system start
the containers (using ssh or whatever), providing the
containers with access to a "hostfile" identifying the TCP
address of each container. It is then easy for OMPI's mpirun
to launch the job across the containers. I use this every day
on my machine (using Docker Desktop with Docker containers,
but the container tech is irrelevant here) to test OMPI.
Pretty easy to set that up, and I should think the sys admins
could do so for their users.

Finally, you could always install the PMIx Reference RTE
(PRRTE) on the cluster as that executes at user level, and
then use PRRTE to launch your OMPI containers. OMPI runs very
well under PRRTE - in fact, PRRTE is the RTE embedded in OMPI
starting with the v5.0 release.

Regardless of your choice of method, the presence of IMPI
doesn't preclude using OMPI containers so long as the OMPI
library is fully contained in that container. Choice of launch
method just depends on how your system is setup.

Ralph



On Jan 26, 2022, at 3:17 PM, Brian Dobbins
mailto:bdobb...@gmail.com>> wrote:


Hi Ralph,

Afraid I don't understand. If your image has the OMPI
libraries installed in it, what difference does it make
what is on your host? You'll never see the IMPI installation.


We have been supporting people running that way since