[OMPI users] Is the mpi.3 manpage out of date?

2020-08-25 Thread Riebs, Andy via users
In searching to confirm my belief that recent versions of Open MPI support the 
MPI-3.1 standard, I was a bit surprised to find this in the mpi.3 man page from 
the 4.0.2 release:

"The  outcome,  known  as  the MPI Standard, was first published in 1993; its 
most recent version (MPI-2) was published in July 1997. Open MPI 1.2 includes 
all MPI 1.2-compliant and MPI 2-compliant routines."

(For those who are manpage-averse, see < 
https://www.open-mpi.org/doc/v4.0/man3/MPI.3.php>.)

I'm willing to bet that y'all haven't been sitting on your hands since Open MPI 
1.2 was released!

Andy

--
Andy Riebs
andy.ri...@hpe.com
Hewlett Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024



Re: [OMPI users] Problem in starting openmpi job - no output just hangs

2020-08-25 Thread Jeff Squyres (jsquyres) via users
On Aug 24, 2020, at 9:44 PM, Tony Ladd  wrote:
> 
> I appreciate your help (and John's as well). At this point I don't think is 
> an OMPI problem - my mistake. I think the communication with RDMA is somehow 
> disabled (perhaps its the verbs layer - I am not very knowledgeable with 
> this). It used to work like a dream but Mellanox has apparently disabled some 
> of the Connect X2 components, because neither ompi or ucx (with/without ompi) 
> could connect with the RDMA layer. Some of the infiniband functions are also 
> not working on the X2 (mstflint, mstconfig).

If the IB stack itself is not functioning, then you're right: Open MPI won't 
work, either (with openib or UCX).

You can try to keep poking with the low-layer diagnostic tools like ibv_devinfo 
and ibv_rc_pingpong.  If those don't work, Open MPI won't work over IB, either.

> In fact ompi always tries to access the openib module. I have to explicitly 
> disable it even to run on 1 node.

Yes, that makes sense: Open MPI will aggressively try to use every possible 
mechanism.

> So I think it is in initialization not communication that the problem lies.

I'm not sure that's correct.

>From your initial emails, it looks like openib thinks it initialized properly.

> This is why (I think) ibv_obj returns NULL.

I'm not sure if that's a problem or not.  That section of output is where Open 
MPI is measuring the distance from the current process to the PCI bus where the 
device lives.  I don't remember offhand if returning NULL in that area is 
actually a problem or just an indication of some kind of non-error condition.

Specifically: if returning NULL there was a problem, we *probably* would have 
aborted at that point.  I have not looked at the code to verify that, though.

> The better news is that with the tcp stack everything works fine (ompi, ucx, 
> 1 node, many nodes) - the bandwidth is similar to rdma so for large messages 
> its semi OK. Its a partial solution - not all I wanted of course. The direct 
> rdma functions ib_read_lat etc also work fine with expected results. I am 
> suspicious this disabling of the driver is a commercial more than a technical 
> decision.
> I am going to try going back to Ubuntu 16.04 - there is a version of OFED 
> that still supports the X2. But I think it may still get messed up by kernel 
> upgrades (it does for 18.04 I found). So its not an easy path.


I can't speak for Nvidia here, sorry.

-- 
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI users] Issue with shared memory arrays in Fortran

2020-08-25 Thread Patrick McNally via users
Thank you very much for the response.  I have to admit that I'm much more
in the developer camp than the admin camp and am not terribly familiar with
installing and configuring OpenMPI myself.  At least one of the systems
does not appear to use ucx but both are using mxm.  I'm attaching the
output of 'ompi_info --all' for the system on which the code always fails
(not just with Python) in case it is helpful.  I neglected to mention that
the shmemTest.out file included in the original archive was a run from this
system.

I tested your suggested workaround on that same always-failing system and
it did indeed work, so thank you very much for that!  I realize you
probably don't know enough about the root cause yet to know the scope of
the issue, but if you get to that point I'd appreciate any additional
knowledge about whether other calls (gather, reduce, etc.) might also be
affected.  This particular call was relatively easy for me to find because
the bad data caused obvious failures in our code.  It is possible other
areas are also affected but just in more subtle ways.

Please let me know if there is any additional testing/exploration I can do
on this end to help.

-Patrick

On Mon, Aug 24, 2020 at 11:19 PM Gilles Gouaillardet via users <
users@lists.open-mpi.org> wrote:

> Patrick,
>
> Thanks for the report and the reproducer.
>
> I was able to confirm the issue with python and Fortran, but
>  - I can only reproduce it with pml/ucx (read --mca pml ob1 --mca btl
> tcp,self works fine)
>  - I can only reproduce it with bcast algorithm 8 and 9
>
> As a workaround, you can keep using ucx but manually change the bcast algo
>
> mpirun --mca coll_tuned_use_dynamic_rules 1 --mca
> coll_tuned_bcast_algorithm 1 ...
>
> /* you can replace the bcast algorithm with any value between 1 and 7
> included */
>
> Cheers,
>
> Gilles
>
> On Mon, Aug 24, 2020 at 10:58 PM Patrick McNally via users
>  wrote:
> >
> > I apologize in advance for the size of the example source and probably
> the length of the email, but this has been a pain to track down.
> >
> > Our application uses System V style shared memory pretty extensively and
> have recently found that in certain circumstances, OpenMPI appears to
> provide ranks with stale data.  The attached archive contains sample code
> that demonstrates the issue.  There is a subroutine that uses a shared
> memory array to broadcast from a single rank on one compute node to a
> single rank on all other compute nodes.  The first call sends all 1s, then
> all 2s, and so on.  The receiving rank(s) get all 1s on the first
> execution, but on subsequent executions they receive some 2s and some 1s;
> then some 3s, some 2s, and some 1s.  The code contains a version of this
> routine in both C and Fortran but only the Fortran version appears to
> exhibit the problem.
> >
> > I've tried this with OpenMPI 3.1.5, 4.0.2, and 4.0.4 and on two
> different systems with very different configurations and both show the
> problem.  On one of the machines, it only appears to happen when MPI is
> initialized with mpi4py, so I've included that in the test as well.  Other
> than that, the behavior is very consistent across machines.  When run with
> the same number of ranks and same size array, the two machines even show
> the invalid values at the same indices.
> >
> > Please let me know if you need any additional information.
> >
> > Thanks,
> > Patrick
>


ompiInfoAll.tar.bz2
Description: application/bzip2


Re: [OMPI users] Problem in starting openmpi job - no output just hangs

2020-08-25 Thread John Hearns via users
I apologise. That was an Omnipath issue
https://www.beowulf.org/pipermail/beowulf/2017-March/034214.html

On Tue, 25 Aug 2020 at 08:17, John Hearns  wrote:

> Aha. I dimly remember a problem with the ibverbs /dev device - maybe the
> permissions,
> or more likely the owner account for that device.
>
>
>
> On Tue, 25 Aug 2020 at 02:44, Tony Ladd  wrote:
>
>> Hi Jeff
>>
>> I appreciate your help (and John's as well). At this point I don't think
>> is an OMPI problem - my mistake. I think the communication with RDMA is
>> somehow disabled (perhaps its the verbs layer - I am not very
>> knowledgeable with this). It used to work like a dream but Mellanox has
>> apparently disabled some of the Connect X2 components, because neither
>> ompi or ucx (with/without ompi) could connect with the RDMA layer. Some
>> of the infiniband functions are also not working on the X2 (mstflint,
>> mstconfig).
>>
>> In fact ompi always tries to access the openib module. I have to
>> explicitly disable it even to run on 1 node. So I think it is in
>> initialization not communication that the problem lies. This is why (I
>> think) ibv_obj returns NULL. The better news is that with the tcp stack
>> everything works fine (ompi, ucx, 1 node, many nodes) - the bandwidth is
>> similar to rdma so for large messages its semi OK. Its a partial
>> solution - not all I wanted of course. The direct rdma functions
>> ib_read_lat etc also work fine with expected results. I am suspicious
>> this disabling of the driver is a commercial more than a technical
>> decision.
>>
>> I am going to try going back to Ubuntu 16.04 - there is a version of
>> OFED that still supports the X2. But I think it may still get messed up
>> by kernel upgrades (it does for 18.04 I found). So its not an easy path.
>>
>> Thanks again.
>>
>> Tony
>>
>> On 8/24/20 11:35 AM, Jeff Squyres (jsquyres) wrote:
>> > [External Email]
>> >
>> > I'm afraid I don't have many better answers for you.
>> >
>> > I can't quite tell from your machines, but are you running IMB-MPI1
>> Sendrecv *on a single node* with `--mca btl openib,self`?
>> >
>> > I don't remember offhand, but I didn't think that openib was supposed
>> to do loopback communication.  E.g., if both MPI processes are on the same
>> node, `--mca btl openib,vader,self` should do the trick (where "vader" =
>> shared memory support).
>> >
>> > More specifically: are you running into a problem running openib
>> (and/or UCX) across multiple nodes?
>> >
>> > I can't speak to Nvidia support on various models of [older] hardware
>> (including UCX support on that hardware).  But be aware that openib is
>> definitely going away; it is wholly being replaced by UCX.  It may be that
>> your only option is to stick with older software stacks in these hardware
>> environments.
>> >
>> >
>> >> On Aug 23, 2020, at 9:46 PM, Tony Ladd via users <
>> users@lists.open-mpi.org> wrote:
>> >>
>> >> Hi John
>> >>
>> >> Thanks for the response. I have run all those diagnostics, and as best
>> I can tell the IB fabric is OK. I have a cluster of 49 nodes (48 clients +
>> server) and the fabric passes all the tests. There is 1 warning:
>> >>
>> >> I- Subnet: IPv4 PKey:0x7fff QKey:0x0b1b MTU:2048Byte rate:10Gbps
>> SL:0x00
>> >> -W- Suboptimal rate for group. Lowest member rate:40Gbps >
>> group-rate:10Gbps
>> >>
>> >> but according to a number of sources this is harmless.
>> >>
>> >> I have run Mellanox's P2P performance tests (ib_write_bw) between
>> different pairs of nodes and it reports 3.22 GB/sec which is reasonable
>> (its PCIe 2 x8 interface ie 4 GB/s). I have also configured 2 nodes back to
>> back to check that the switch is not the problem - it makes no difference.
>> >>
>> >> I have been playing with the btl params with openMPI (v. 2.1.1 which
>> is what is relelased in Ubuntu 18.04). So with tcp as the transport layer
>> everything works fine - 1 node or 2 node communication - I have tested up
>> to 16 processes (8+8) and it seems fine. Of course the latency is much
>> higher on the tcp interface, so I would still like to access the RDMA
>> layer. But unless I exclude the openib module, it always hangs. Same with
>> OpenMPI v4 compiled from source.
>> >>
>> >> I think an important component is that Mellanox is not supporting
>> Connect X2 for some time. This is really infuriating; a $500 network card
>> with no supported drivers, but that is business for you I suppose. I have
>> 50 NICS and I can't afford to replace them all. The other component is the
>> MLNX-OFED is tied to specific software versions, so I can't just run an
>> older set of drivers. I have not seen source files for the Mellanox drivers
>> - I would take a crack at compiling them if I did. In the past I have used
>> the OFED drivers (on Centos 5) with no problem, but I don't think this is
>> an option now.
>> >>
>> >> Ubuntu claims to support Connect X2 with their drivers (Mellanox
>> confirms this), but of course this is community support and the number of
>> cases 

Re: [OMPI users] Problem in starting openmpi job - no output just hangs

2020-08-25 Thread John Hearns via users
Aha. I dimly remember a problem with the ibverbs /dev device - maybe the
permissions,
or more likely the owner account for that device.



On Tue, 25 Aug 2020 at 02:44, Tony Ladd  wrote:

> Hi Jeff
>
> I appreciate your help (and John's as well). At this point I don't think
> is an OMPI problem - my mistake. I think the communication with RDMA is
> somehow disabled (perhaps its the verbs layer - I am not very
> knowledgeable with this). It used to work like a dream but Mellanox has
> apparently disabled some of the Connect X2 components, because neither
> ompi or ucx (with/without ompi) could connect with the RDMA layer. Some
> of the infiniband functions are also not working on the X2 (mstflint,
> mstconfig).
>
> In fact ompi always tries to access the openib module. I have to
> explicitly disable it even to run on 1 node. So I think it is in
> initialization not communication that the problem lies. This is why (I
> think) ibv_obj returns NULL. The better news is that with the tcp stack
> everything works fine (ompi, ucx, 1 node, many nodes) - the bandwidth is
> similar to rdma so for large messages its semi OK. Its a partial
> solution - not all I wanted of course. The direct rdma functions
> ib_read_lat etc also work fine with expected results. I am suspicious
> this disabling of the driver is a commercial more than a technical
> decision.
>
> I am going to try going back to Ubuntu 16.04 - there is a version of
> OFED that still supports the X2. But I think it may still get messed up
> by kernel upgrades (it does for 18.04 I found). So its not an easy path.
>
> Thanks again.
>
> Tony
>
> On 8/24/20 11:35 AM, Jeff Squyres (jsquyres) wrote:
> > [External Email]
> >
> > I'm afraid I don't have many better answers for you.
> >
> > I can't quite tell from your machines, but are you running IMB-MPI1
> Sendrecv *on a single node* with `--mca btl openib,self`?
> >
> > I don't remember offhand, but I didn't think that openib was supposed to
> do loopback communication.  E.g., if both MPI processes are on the same
> node, `--mca btl openib,vader,self` should do the trick (where "vader" =
> shared memory support).
> >
> > More specifically: are you running into a problem running openib (and/or
> UCX) across multiple nodes?
> >
> > I can't speak to Nvidia support on various models of [older] hardware
> (including UCX support on that hardware).  But be aware that openib is
> definitely going away; it is wholly being replaced by UCX.  It may be that
> your only option is to stick with older software stacks in these hardware
> environments.
> >
> >
> >> On Aug 23, 2020, at 9:46 PM, Tony Ladd via users <
> users@lists.open-mpi.org> wrote:
> >>
> >> Hi John
> >>
> >> Thanks for the response. I have run all those diagnostics, and as best
> I can tell the IB fabric is OK. I have a cluster of 49 nodes (48 clients +
> server) and the fabric passes all the tests. There is 1 warning:
> >>
> >> I- Subnet: IPv4 PKey:0x7fff QKey:0x0b1b MTU:2048Byte rate:10Gbps
> SL:0x00
> >> -W- Suboptimal rate for group. Lowest member rate:40Gbps >
> group-rate:10Gbps
> >>
> >> but according to a number of sources this is harmless.
> >>
> >> I have run Mellanox's P2P performance tests (ib_write_bw) between
> different pairs of nodes and it reports 3.22 GB/sec which is reasonable
> (its PCIe 2 x8 interface ie 4 GB/s). I have also configured 2 nodes back to
> back to check that the switch is not the problem - it makes no difference.
> >>
> >> I have been playing with the btl params with openMPI (v. 2.1.1 which is
> what is relelased in Ubuntu 18.04). So with tcp as the transport layer
> everything works fine - 1 node or 2 node communication - I have tested up
> to 16 processes (8+8) and it seems fine. Of course the latency is much
> higher on the tcp interface, so I would still like to access the RDMA
> layer. But unless I exclude the openib module, it always hangs. Same with
> OpenMPI v4 compiled from source.
> >>
> >> I think an important component is that Mellanox is not supporting
> Connect X2 for some time. This is really infuriating; a $500 network card
> with no supported drivers, but that is business for you I suppose. I have
> 50 NICS and I can't afford to replace them all. The other component is the
> MLNX-OFED is tied to specific software versions, so I can't just run an
> older set of drivers. I have not seen source files for the Mellanox drivers
> - I would take a crack at compiling them if I did. In the past I have used
> the OFED drivers (on Centos 5) with no problem, but I don't think this is
> an option now.
> >>
> >> Ubuntu claims to support Connect X2 with their drivers (Mellanox
> confirms this), but of course this is community support and the number of
> cases is obviously small. I use the Ubuntu drivers right now because the
> OFED install seems broken and there is no help with it. Its not supported!
> Neat huh?
> >>
> >> The only handle I have is with openmpi v. 2 when there is a message
> (see my original post) that