Re: [OMPI users] Issue with shared memory arrays in Fortran

2020-08-24 Thread Gilles Gouaillardet via users
Patrick,

Thanks for the report and the reproducer.

I was able to confirm the issue with python and Fortran, but
 - I can only reproduce it with pml/ucx (read --mca pml ob1 --mca btl
tcp,self works fine)
 - I can only reproduce it with bcast algorithm 8 and 9

As a workaround, you can keep using ucx but manually change the bcast algo

mpirun --mca coll_tuned_use_dynamic_rules 1 --mca
coll_tuned_bcast_algorithm 1 ...

/* you can replace the bcast algorithm with any value between 1 and 7
included */

Cheers,

Gilles

On Mon, Aug 24, 2020 at 10:58 PM Patrick McNally via users
 wrote:
>
> I apologize in advance for the size of the example source and probably the 
> length of the email, but this has been a pain to track down.
>
> Our application uses System V style shared memory pretty extensively and have 
> recently found that in certain circumstances, OpenMPI appears to provide 
> ranks with stale data.  The attached archive contains sample code that 
> demonstrates the issue.  There is a subroutine that uses a shared memory 
> array to broadcast from a single rank on one compute node to a single rank on 
> all other compute nodes.  The first call sends all 1s, then all 2s, and so 
> on.  The receiving rank(s) get all 1s on the first execution, but on 
> subsequent executions they receive some 2s and some 1s; then some 3s, some 
> 2s, and some 1s.  The code contains a version of this routine in both C and 
> Fortran but only the Fortran version appears to exhibit the problem.
>
> I've tried this with OpenMPI 3.1.5, 4.0.2, and 4.0.4 and on two different 
> systems with very different configurations and both show the problem.  On one 
> of the machines, it only appears to happen when MPI is initialized with 
> mpi4py, so I've included that in the test as well.  Other than that, the 
> behavior is very consistent across machines.  When run with the same number 
> of ranks and same size array, the two machines even show the invalid values 
> at the same indices.
>
> Please let me know if you need any additional information.
>
> Thanks,
> Patrick


Re: [OMPI users] Problem in starting openmpi job - no output just hangs

2020-08-24 Thread Tony Ladd via users

Hi Jeff

I appreciate your help (and John's as well). At this point I don't think 
is an OMPI problem - my mistake. I think the communication with RDMA is 
somehow disabled (perhaps its the verbs layer - I am not very 
knowledgeable with this). It used to work like a dream but Mellanox has 
apparently disabled some of the Connect X2 components, because neither 
ompi or ucx (with/without ompi) could connect with the RDMA layer. Some 
of the infiniband functions are also not working on the X2 (mstflint, 
mstconfig).


In fact ompi always tries to access the openib module. I have to 
explicitly disable it even to run on 1 node. So I think it is in 
initialization not communication that the problem lies. This is why (I 
think) ibv_obj returns NULL. The better news is that with the tcp stack 
everything works fine (ompi, ucx, 1 node, many nodes) - the bandwidth is 
similar to rdma so for large messages its semi OK. Its a partial 
solution - not all I wanted of course. The direct rdma functions 
ib_read_lat etc also work fine with expected results. I am suspicious 
this disabling of the driver is a commercial more than a technical decision.


I am going to try going back to Ubuntu 16.04 - there is a version of 
OFED that still supports the X2. But I think it may still get messed up 
by kernel upgrades (it does for 18.04 I found). So its not an easy path.


Thanks again.

Tony

On 8/24/20 11:35 AM, Jeff Squyres (jsquyres) wrote:

[External Email]

I'm afraid I don't have many better answers for you.

I can't quite tell from your machines, but are you running IMB-MPI1 Sendrecv 
*on a single node* with `--mca btl openib,self`?

I don't remember offhand, but I didn't think that openib was supposed to do loopback 
communication.  E.g., if both MPI processes are on the same node, `--mca btl 
openib,vader,self` should do the trick (where "vader" = shared memory support).

More specifically: are you running into a problem running openib (and/or UCX) 
across multiple nodes?

I can't speak to Nvidia support on various models of [older] hardware 
(including UCX support on that hardware).  But be aware that openib is 
definitely going away; it is wholly being replaced by UCX.  It may be that your 
only option is to stick with older software stacks in these hardware 
environments.



On Aug 23, 2020, at 9:46 PM, Tony Ladd via users  
wrote:

Hi John

Thanks for the response. I have run all those diagnostics, and as best I can 
tell the IB fabric is OK. I have a cluster of 49 nodes (48 clients + server) 
and the fabric passes all the tests. There is 1 warning:

I- Subnet: IPv4 PKey:0x7fff QKey:0x0b1b MTU:2048Byte rate:10Gbps SL:0x00
-W- Suboptimal rate for group. Lowest member rate:40Gbps > group-rate:10Gbps

but according to a number of sources this is harmless.

I have run Mellanox's P2P performance tests (ib_write_bw) between different 
pairs of nodes and it reports 3.22 GB/sec which is reasonable (its PCIe 2 x8 
interface ie 4 GB/s). I have also configured 2 nodes back to back to check that 
the switch is not the problem - it makes no difference.

I have been playing with the btl params with openMPI (v. 2.1.1 which is what is 
relelased in Ubuntu 18.04). So with tcp as the transport layer everything works 
fine - 1 node or 2 node communication - I have tested up to 16 processes (8+8) 
and it seems fine. Of course the latency is much higher on the tcp interface, 
so I would still like to access the RDMA layer. But unless I exclude the openib 
module, it always hangs. Same with OpenMPI v4 compiled from source.

I think an important component is that Mellanox is not supporting Connect X2 
for some time. This is really infuriating; a $500 network card with no 
supported drivers, but that is business for you I suppose. I have 50 NICS and I 
can't afford to replace them all. The other component is the MLNX-OFED is tied 
to specific software versions, so I can't just run an older set of drivers. I 
have not seen source files for the Mellanox drivers - I would take a crack at 
compiling them if I did. In the past I have used the OFED drivers (on Centos 5) 
with no problem, but I don't think this is an option now.

Ubuntu claims to support Connect X2 with their drivers (Mellanox confirms 
this), but of course this is community support and the number of cases is 
obviously small. I use the Ubuntu drivers right now because the OFED install 
seems broken and there is no help with it. Its not supported! Neat huh?

The only handle I have is with openmpi v. 2 when there is a message (see my 
original post) that ibv_obj returns a NULL result. But I don't understand the 
significance of the message (if any).

I am not enthused about UCX - the documentation has several obvious typos in 
it, which is not encouraging when you a floundering. I know its a newish 
project but I have used openib for 10+ years and its never had a problem until 
now. I think this is not so much openib as the software below. One other thing 
I should say is 

Re: [OMPI users] Problem in starting openmpi job - no output just hangs

2020-08-24 Thread Jeff Squyres (jsquyres) via users
I'm afraid I don't have many better answers for you.

I can't quite tell from your machines, but are you running IMB-MPI1 Sendrecv 
*on a single node* with `--mca btl openib,self`?

I don't remember offhand, but I didn't think that openib was supposed to do 
loopback communication.  E.g., if both MPI processes are on the same node, 
`--mca btl openib,vader,self` should do the trick (where "vader" = shared 
memory support).

More specifically: are you running into a problem running openib (and/or UCX) 
across multiple nodes?

I can't speak to Nvidia support on various models of [older] hardware 
(including UCX support on that hardware).  But be aware that openib is 
definitely going away; it is wholly being replaced by UCX.  It may be that your 
only option is to stick with older software stacks in these hardware 
environments.


> On Aug 23, 2020, at 9:46 PM, Tony Ladd via users  
> wrote:
> 
> Hi John
> 
> Thanks for the response. I have run all those diagnostics, and as best I can 
> tell the IB fabric is OK. I have a cluster of 49 nodes (48 clients + server) 
> and the fabric passes all the tests. There is 1 warning:
> 
> I- Subnet: IPv4 PKey:0x7fff QKey:0x0b1b MTU:2048Byte rate:10Gbps SL:0x00
> -W- Suboptimal rate for group. Lowest member rate:40Gbps > group-rate:10Gbps
> 
> but according to a number of sources this is harmless.
> 
> I have run Mellanox's P2P performance tests (ib_write_bw) between different 
> pairs of nodes and it reports 3.22 GB/sec which is reasonable (its PCIe 2 x8 
> interface ie 4 GB/s). I have also configured 2 nodes back to back to check 
> that the switch is not the problem - it makes no difference.
> 
> I have been playing with the btl params with openMPI (v. 2.1.1 which is what 
> is relelased in Ubuntu 18.04). So with tcp as the transport layer everything 
> works fine - 1 node or 2 node communication - I have tested up to 16 
> processes (8+8) and it seems fine. Of course the latency is much higher on 
> the tcp interface, so I would still like to access the RDMA layer. But unless 
> I exclude the openib module, it always hangs. Same with OpenMPI v4 compiled 
> from source.
> 
> I think an important component is that Mellanox is not supporting Connect X2 
> for some time. This is really infuriating; a $500 network card with no 
> supported drivers, but that is business for you I suppose. I have 50 NICS and 
> I can't afford to replace them all. The other component is the MLNX-OFED is 
> tied to specific software versions, so I can't just run an older set of 
> drivers. I have not seen source files for the Mellanox drivers - I would take 
> a crack at compiling them if I did. In the past I have used the OFED drivers 
> (on Centos 5) with no problem, but I don't think this is an option now.
> 
> Ubuntu claims to support Connect X2 with their drivers (Mellanox confirms 
> this), but of course this is community support and the number of cases is 
> obviously small. I use the Ubuntu drivers right now because the OFED install 
> seems broken and there is no help with it. Its not supported! Neat huh?
> 
> The only handle I have is with openmpi v. 2 when there is a message (see my 
> original post) that ibv_obj returns a NULL result. But I don't understand the 
> significance of the message (if any).
> 
> I am not enthused about UCX - the documentation has several obvious typos in 
> it, which is not encouraging when you a floundering. I know its a newish 
> project but I have used openib for 10+ years and its never had a problem 
> until now. I think this is not so much openib as the software below. One 
> other thing I should say is that if I run any recent version of mstflint is 
> always complains:
> 
> Failed to identify the device - Can not create SignatureManager!
> 
> Going back to my original OFED 1.5 this did not happen, but they are at v5 
> now.
> 
> Everything else works as far as I can see. But I could not burn new firmware 
> except by going back to the 1.5 OS. Perhaps this is connected with the 
> obv_obj = NULL result.
> 
> Thanks for helping out. As you can see I am rather stuck.
> 
> Best
> 
> Tony
> 
> On 8/23/20 3:01 AM, John Hearns via users wrote:
>> *[External Email]*
>> 
>> Tony, start at a low level. Is the Infiniband fabric healthy?
>> Run
>> ibstatus   on every node
>> sminfo on one node
>> ibdiagnet on one node
>> 
>> On Sun, 23 Aug 2020 at 05:02, Tony Ladd via users > > wrote:
>> 
>>Hi Jeff
>> 
>>I installed ucx as you suggested. But I can't get even the
>>simplest code
>>(ucp_client_server) to work across the network. I can compile openMPI
>>with UCX but it has the same problem - mpi codes will not execute and
>>there are no messages. Really, UCX is not helping. It is adding
>>another
>>(not so well documented) software layer, which does not offer better
>>diagnostics as far as I can see. Its also unclear to me how to
>>control
>>what drivers are being loaded - UCX 

[OMPI users] Issue with shared memory arrays in Fortran

2020-08-24 Thread Patrick McNally via users
I apologize in advance for the size of the example source and probably the
length of the email, but this has been a pain to track down.

Our application uses System V style shared memory pretty extensively and
have recently found that in certain circumstances, OpenMPI appears to
provide ranks with stale data.  The attached archive contains sample code
that demonstrates the issue.  There is a subroutine that uses a shared
memory array to broadcast from a single rank on one compute node to a
single rank on all other compute nodes.  The first call sends all 1s, then
all 2s, and so on.  The receiving rank(s) get all 1s on the first
execution, but on subsequent executions they receive some 2s and some 1s;
then some 3s, some 2s, and some 1s.  The code contains a version of this
routine in both C and Fortran but only the Fortran version appears to
exhibit the problem.

I've tried this with OpenMPI 3.1.5, 4.0.2, and 4.0.4 and on two different
systems with very different configurations and both show the problem.  On
one of the machines, it only appears to happen when MPI is initialized with
mpi4py, so I've included that in the test as well.  Other than that, the
behavior is very consistent across machines.  When run with the same number
of ranks and same size array, the two machines even show the invalid values
at the same indices.

Please let me know if you need any additional information.

Thanks,
Patrick


shmemTest.tgz
Description: application/compressed-tar