Re: [OMPI users] Problem in starting openmpi job - no output just hangs - SOLVED

2020-09-01 Thread Tony Ladd via users

Jeff

I found the solution - rdma needs significant memory so the limits on 
the shell have to be increased. I needed to add the lines


* soft memlock unlimited
* hard memlock unlimited

to the end of the file /etc/security/limits.conf. After that the openib 
driver loads and everything is fine - proper IB latency again.


I see that # 16 of the tuning FAQ discusses the same issue, but in my 
case there was no error or warning message. I am posting this in case 
anyone else runs into this issue.


The Mellanox OFED install adds those lines automatically, so I had not 
run into this before.


Tony


On 8/25/20 10:42 AM, Jeff Squyres (jsquyres) wrote:

[External Email]

On Aug 24, 2020, at 9:44 PM, Tony Ladd  wrote:

I appreciate your help (and John's as well). At this point I don't think is an 
OMPI problem - my mistake. I think the communication with RDMA is somehow 
disabled (perhaps its the verbs layer - I am not very knowledgeable with this). 
It used to work like a dream but Mellanox has apparently disabled some of the 
Connect X2 components, because neither ompi or ucx (with/without ompi) could 
connect with the RDMA layer. Some of the infiniband functions are also not 
working on the X2 (mstflint, mstconfig).

If the IB stack itself is not functioning, then you're right: Open MPI won't 
work, either (with openib or UCX).

You can try to keep poking with the low-layer diagnostic tools like ibv_devinfo 
and ibv_rc_pingpong.  If those don't work, Open MPI won't work over IB, either.


In fact ompi always tries to access the openib module. I have to explicitly 
disable it even to run on 1 node.

Yes, that makes sense: Open MPI will aggressively try to use every possible 
mechanism.


So I think it is in initialization not communication that the problem lies.

I'm not sure that's correct.

 From your initial emails, it looks like openib thinks it initialized properly.


This is why (I think) ibv_obj returns NULL.

I'm not sure if that's a problem or not.  That section of output is where Open 
MPI is measuring the distance from the current process to the PCI bus where the 
device lives.  I don't remember offhand if returning NULL in that area is 
actually a problem or just an indication of some kind of non-error condition.

Specifically: if returning NULL there was a problem, we *probably* would have 
aborted at that point.  I have not looked at the code to verify that, though.


The better news is that with the tcp stack everything works fine (ompi, ucx, 1 
node, many nodes) - the bandwidth is similar to rdma so for large messages its 
semi OK. Its a partial solution - not all I wanted of course. The direct rdma 
functions ib_read_lat etc also work fine with expected results. I am suspicious 
this disabling of the driver is a commercial more than a technical decision.
I am going to try going back to Ubuntu 16.04 - there is a version of OFED that 
still supports the X2. But I think it may still get messed up by kernel 
upgrades (it does for 18.04 I found). So its not an easy path.


I can't speak for Nvidia here, sorry.

--
Jeff Squyres
jsquy...@cisco.com


--
Tony Ladd

Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA

Email: tladd-"(AT)"-che.ufl.edu
Webhttp://ladd.che.ufl.edu

Tel:   (352)-392-6509
FAX:   (352)-392-9514



Re: [OMPI users] Problem in starting openmpi job - no output just hangs

2020-08-25 Thread Jeff Squyres (jsquyres) via users
On Aug 24, 2020, at 9:44 PM, Tony Ladd  wrote:
> 
> I appreciate your help (and John's as well). At this point I don't think is 
> an OMPI problem - my mistake. I think the communication with RDMA is somehow 
> disabled (perhaps its the verbs layer - I am not very knowledgeable with 
> this). It used to work like a dream but Mellanox has apparently disabled some 
> of the Connect X2 components, because neither ompi or ucx (with/without ompi) 
> could connect with the RDMA layer. Some of the infiniband functions are also 
> not working on the X2 (mstflint, mstconfig).

If the IB stack itself is not functioning, then you're right: Open MPI won't 
work, either (with openib or UCX).

You can try to keep poking with the low-layer diagnostic tools like ibv_devinfo 
and ibv_rc_pingpong.  If those don't work, Open MPI won't work over IB, either.

> In fact ompi always tries to access the openib module. I have to explicitly 
> disable it even to run on 1 node.

Yes, that makes sense: Open MPI will aggressively try to use every possible 
mechanism.

> So I think it is in initialization not communication that the problem lies.

I'm not sure that's correct.

>From your initial emails, it looks like openib thinks it initialized properly.

> This is why (I think) ibv_obj returns NULL.

I'm not sure if that's a problem or not.  That section of output is where Open 
MPI is measuring the distance from the current process to the PCI bus where the 
device lives.  I don't remember offhand if returning NULL in that area is 
actually a problem or just an indication of some kind of non-error condition.

Specifically: if returning NULL there was a problem, we *probably* would have 
aborted at that point.  I have not looked at the code to verify that, though.

> The better news is that with the tcp stack everything works fine (ompi, ucx, 
> 1 node, many nodes) - the bandwidth is similar to rdma so for large messages 
> its semi OK. Its a partial solution - not all I wanted of course. The direct 
> rdma functions ib_read_lat etc also work fine with expected results. I am 
> suspicious this disabling of the driver is a commercial more than a technical 
> decision.
> I am going to try going back to Ubuntu 16.04 - there is a version of OFED 
> that still supports the X2. But I think it may still get messed up by kernel 
> upgrades (it does for 18.04 I found). So its not an easy path.


I can't speak for Nvidia here, sorry.

-- 
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI users] Problem in starting openmpi job - no output just hangs

2020-08-25 Thread John Hearns via users
I apologise. That was an Omnipath issue
https://www.beowulf.org/pipermail/beowulf/2017-March/034214.html

On Tue, 25 Aug 2020 at 08:17, John Hearns  wrote:

> Aha. I dimly remember a problem with the ibverbs /dev device - maybe the
> permissions,
> or more likely the owner account for that device.
>
>
>
> On Tue, 25 Aug 2020 at 02:44, Tony Ladd  wrote:
>
>> Hi Jeff
>>
>> I appreciate your help (and John's as well). At this point I don't think
>> is an OMPI problem - my mistake. I think the communication with RDMA is
>> somehow disabled (perhaps its the verbs layer - I am not very
>> knowledgeable with this). It used to work like a dream but Mellanox has
>> apparently disabled some of the Connect X2 components, because neither
>> ompi or ucx (with/without ompi) could connect with the RDMA layer. Some
>> of the infiniband functions are also not working on the X2 (mstflint,
>> mstconfig).
>>
>> In fact ompi always tries to access the openib module. I have to
>> explicitly disable it even to run on 1 node. So I think it is in
>> initialization not communication that the problem lies. This is why (I
>> think) ibv_obj returns NULL. The better news is that with the tcp stack
>> everything works fine (ompi, ucx, 1 node, many nodes) - the bandwidth is
>> similar to rdma so for large messages its semi OK. Its a partial
>> solution - not all I wanted of course. The direct rdma functions
>> ib_read_lat etc also work fine with expected results. I am suspicious
>> this disabling of the driver is a commercial more than a technical
>> decision.
>>
>> I am going to try going back to Ubuntu 16.04 - there is a version of
>> OFED that still supports the X2. But I think it may still get messed up
>> by kernel upgrades (it does for 18.04 I found). So its not an easy path.
>>
>> Thanks again.
>>
>> Tony
>>
>> On 8/24/20 11:35 AM, Jeff Squyres (jsquyres) wrote:
>> > [External Email]
>> >
>> > I'm afraid I don't have many better answers for you.
>> >
>> > I can't quite tell from your machines, but are you running IMB-MPI1
>> Sendrecv *on a single node* with `--mca btl openib,self`?
>> >
>> > I don't remember offhand, but I didn't think that openib was supposed
>> to do loopback communication.  E.g., if both MPI processes are on the same
>> node, `--mca btl openib,vader,self` should do the trick (where "vader" =
>> shared memory support).
>> >
>> > More specifically: are you running into a problem running openib
>> (and/or UCX) across multiple nodes?
>> >
>> > I can't speak to Nvidia support on various models of [older] hardware
>> (including UCX support on that hardware).  But be aware that openib is
>> definitely going away; it is wholly being replaced by UCX.  It may be that
>> your only option is to stick with older software stacks in these hardware
>> environments.
>> >
>> >
>> >> On Aug 23, 2020, at 9:46 PM, Tony Ladd via users <
>> users@lists.open-mpi.org> wrote:
>> >>
>> >> Hi John
>> >>
>> >> Thanks for the response. I have run all those diagnostics, and as best
>> I can tell the IB fabric is OK. I have a cluster of 49 nodes (48 clients +
>> server) and the fabric passes all the tests. There is 1 warning:
>> >>
>> >> I- Subnet: IPv4 PKey:0x7fff QKey:0x0b1b MTU:2048Byte rate:10Gbps
>> SL:0x00
>> >> -W- Suboptimal rate for group. Lowest member rate:40Gbps >
>> group-rate:10Gbps
>> >>
>> >> but according to a number of sources this is harmless.
>> >>
>> >> I have run Mellanox's P2P performance tests (ib_write_bw) between
>> different pairs of nodes and it reports 3.22 GB/sec which is reasonable
>> (its PCIe 2 x8 interface ie 4 GB/s). I have also configured 2 nodes back to
>> back to check that the switch is not the problem - it makes no difference.
>> >>
>> >> I have been playing with the btl params with openMPI (v. 2.1.1 which
>> is what is relelased in Ubuntu 18.04). So with tcp as the transport layer
>> everything works fine - 1 node or 2 node communication - I have tested up
>> to 16 processes (8+8) and it seems fine. Of course the latency is much
>> higher on the tcp interface, so I would still like to access the RDMA
>> layer. But unless I exclude the openib module, it always hangs. Same with
>> OpenMPI v4 compiled from source.
>> >>
>> >> I think an important component is that Mellanox is not supporting
>> Connect X2 for some time. This is really infuriating; a $500 network card
>> with no supported drivers, but that is business for you I suppose. I have
>> 50 NICS and I can't afford to replace them all. The other component is the
>> MLNX-OFED is tied to specific software versions, so I can't just run an
>> older set of drivers. I have not seen source files for the Mellanox drivers
>> - I would take a crack at compiling them if I did. In the past I have used
>> the OFED drivers (on Centos 5) with no problem, but I don't think this is
>> an option now.
>> >>
>> >> Ubuntu claims to support Connect X2 with their drivers (Mellanox
>> confirms this), but of course this is community support and the number of
>> cases 

Re: [OMPI users] Problem in starting openmpi job - no output just hangs

2020-08-25 Thread John Hearns via users
Aha. I dimly remember a problem with the ibverbs /dev device - maybe the
permissions,
or more likely the owner account for that device.



On Tue, 25 Aug 2020 at 02:44, Tony Ladd  wrote:

> Hi Jeff
>
> I appreciate your help (and John's as well). At this point I don't think
> is an OMPI problem - my mistake. I think the communication with RDMA is
> somehow disabled (perhaps its the verbs layer - I am not very
> knowledgeable with this). It used to work like a dream but Mellanox has
> apparently disabled some of the Connect X2 components, because neither
> ompi or ucx (with/without ompi) could connect with the RDMA layer. Some
> of the infiniband functions are also not working on the X2 (mstflint,
> mstconfig).
>
> In fact ompi always tries to access the openib module. I have to
> explicitly disable it even to run on 1 node. So I think it is in
> initialization not communication that the problem lies. This is why (I
> think) ibv_obj returns NULL. The better news is that with the tcp stack
> everything works fine (ompi, ucx, 1 node, many nodes) - the bandwidth is
> similar to rdma so for large messages its semi OK. Its a partial
> solution - not all I wanted of course. The direct rdma functions
> ib_read_lat etc also work fine with expected results. I am suspicious
> this disabling of the driver is a commercial more than a technical
> decision.
>
> I am going to try going back to Ubuntu 16.04 - there is a version of
> OFED that still supports the X2. But I think it may still get messed up
> by kernel upgrades (it does for 18.04 I found). So its not an easy path.
>
> Thanks again.
>
> Tony
>
> On 8/24/20 11:35 AM, Jeff Squyres (jsquyres) wrote:
> > [External Email]
> >
> > I'm afraid I don't have many better answers for you.
> >
> > I can't quite tell from your machines, but are you running IMB-MPI1
> Sendrecv *on a single node* with `--mca btl openib,self`?
> >
> > I don't remember offhand, but I didn't think that openib was supposed to
> do loopback communication.  E.g., if both MPI processes are on the same
> node, `--mca btl openib,vader,self` should do the trick (where "vader" =
> shared memory support).
> >
> > More specifically: are you running into a problem running openib (and/or
> UCX) across multiple nodes?
> >
> > I can't speak to Nvidia support on various models of [older] hardware
> (including UCX support on that hardware).  But be aware that openib is
> definitely going away; it is wholly being replaced by UCX.  It may be that
> your only option is to stick with older software stacks in these hardware
> environments.
> >
> >
> >> On Aug 23, 2020, at 9:46 PM, Tony Ladd via users <
> users@lists.open-mpi.org> wrote:
> >>
> >> Hi John
> >>
> >> Thanks for the response. I have run all those diagnostics, and as best
> I can tell the IB fabric is OK. I have a cluster of 49 nodes (48 clients +
> server) and the fabric passes all the tests. There is 1 warning:
> >>
> >> I- Subnet: IPv4 PKey:0x7fff QKey:0x0b1b MTU:2048Byte rate:10Gbps
> SL:0x00
> >> -W- Suboptimal rate for group. Lowest member rate:40Gbps >
> group-rate:10Gbps
> >>
> >> but according to a number of sources this is harmless.
> >>
> >> I have run Mellanox's P2P performance tests (ib_write_bw) between
> different pairs of nodes and it reports 3.22 GB/sec which is reasonable
> (its PCIe 2 x8 interface ie 4 GB/s). I have also configured 2 nodes back to
> back to check that the switch is not the problem - it makes no difference.
> >>
> >> I have been playing with the btl params with openMPI (v. 2.1.1 which is
> what is relelased in Ubuntu 18.04). So with tcp as the transport layer
> everything works fine - 1 node or 2 node communication - I have tested up
> to 16 processes (8+8) and it seems fine. Of course the latency is much
> higher on the tcp interface, so I would still like to access the RDMA
> layer. But unless I exclude the openib module, it always hangs. Same with
> OpenMPI v4 compiled from source.
> >>
> >> I think an important component is that Mellanox is not supporting
> Connect X2 for some time. This is really infuriating; a $500 network card
> with no supported drivers, but that is business for you I suppose. I have
> 50 NICS and I can't afford to replace them all. The other component is the
> MLNX-OFED is tied to specific software versions, so I can't just run an
> older set of drivers. I have not seen source files for the Mellanox drivers
> - I would take a crack at compiling them if I did. In the past I have used
> the OFED drivers (on Centos 5) with no problem, but I don't think this is
> an option now.
> >>
> >> Ubuntu claims to support Connect X2 with their drivers (Mellanox
> confirms this), but of course this is community support and the number of
> cases is obviously small. I use the Ubuntu drivers right now because the
> OFED install seems broken and there is no help with it. Its not supported!
> Neat huh?
> >>
> >> The only handle I have is with openmpi v. 2 when there is a message
> (see my original post) that 

Re: [OMPI users] Problem in starting openmpi job - no output just hangs

2020-08-24 Thread Tony Ladd via users

Hi Jeff

I appreciate your help (and John's as well). At this point I don't think 
is an OMPI problem - my mistake. I think the communication with RDMA is 
somehow disabled (perhaps its the verbs layer - I am not very 
knowledgeable with this). It used to work like a dream but Mellanox has 
apparently disabled some of the Connect X2 components, because neither 
ompi or ucx (with/without ompi) could connect with the RDMA layer. Some 
of the infiniband functions are also not working on the X2 (mstflint, 
mstconfig).


In fact ompi always tries to access the openib module. I have to 
explicitly disable it even to run on 1 node. So I think it is in 
initialization not communication that the problem lies. This is why (I 
think) ibv_obj returns NULL. The better news is that with the tcp stack 
everything works fine (ompi, ucx, 1 node, many nodes) - the bandwidth is 
similar to rdma so for large messages its semi OK. Its a partial 
solution - not all I wanted of course. The direct rdma functions 
ib_read_lat etc also work fine with expected results. I am suspicious 
this disabling of the driver is a commercial more than a technical decision.


I am going to try going back to Ubuntu 16.04 - there is a version of 
OFED that still supports the X2. But I think it may still get messed up 
by kernel upgrades (it does for 18.04 I found). So its not an easy path.


Thanks again.

Tony

On 8/24/20 11:35 AM, Jeff Squyres (jsquyres) wrote:

[External Email]

I'm afraid I don't have many better answers for you.

I can't quite tell from your machines, but are you running IMB-MPI1 Sendrecv 
*on a single node* with `--mca btl openib,self`?

I don't remember offhand, but I didn't think that openib was supposed to do loopback 
communication.  E.g., if both MPI processes are on the same node, `--mca btl 
openib,vader,self` should do the trick (where "vader" = shared memory support).

More specifically: are you running into a problem running openib (and/or UCX) 
across multiple nodes?

I can't speak to Nvidia support on various models of [older] hardware 
(including UCX support on that hardware).  But be aware that openib is 
definitely going away; it is wholly being replaced by UCX.  It may be that your 
only option is to stick with older software stacks in these hardware 
environments.



On Aug 23, 2020, at 9:46 PM, Tony Ladd via users  
wrote:

Hi John

Thanks for the response. I have run all those diagnostics, and as best I can 
tell the IB fabric is OK. I have a cluster of 49 nodes (48 clients + server) 
and the fabric passes all the tests. There is 1 warning:

I- Subnet: IPv4 PKey:0x7fff QKey:0x0b1b MTU:2048Byte rate:10Gbps SL:0x00
-W- Suboptimal rate for group. Lowest member rate:40Gbps > group-rate:10Gbps

but according to a number of sources this is harmless.

I have run Mellanox's P2P performance tests (ib_write_bw) between different 
pairs of nodes and it reports 3.22 GB/sec which is reasonable (its PCIe 2 x8 
interface ie 4 GB/s). I have also configured 2 nodes back to back to check that 
the switch is not the problem - it makes no difference.

I have been playing with the btl params with openMPI (v. 2.1.1 which is what is 
relelased in Ubuntu 18.04). So with tcp as the transport layer everything works 
fine - 1 node or 2 node communication - I have tested up to 16 processes (8+8) 
and it seems fine. Of course the latency is much higher on the tcp interface, 
so I would still like to access the RDMA layer. But unless I exclude the openib 
module, it always hangs. Same with OpenMPI v4 compiled from source.

I think an important component is that Mellanox is not supporting Connect X2 
for some time. This is really infuriating; a $500 network card with no 
supported drivers, but that is business for you I suppose. I have 50 NICS and I 
can't afford to replace them all. The other component is the MLNX-OFED is tied 
to specific software versions, so I can't just run an older set of drivers. I 
have not seen source files for the Mellanox drivers - I would take a crack at 
compiling them if I did. In the past I have used the OFED drivers (on Centos 5) 
with no problem, but I don't think this is an option now.

Ubuntu claims to support Connect X2 with their drivers (Mellanox confirms 
this), but of course this is community support and the number of cases is 
obviously small. I use the Ubuntu drivers right now because the OFED install 
seems broken and there is no help with it. Its not supported! Neat huh?

The only handle I have is with openmpi v. 2 when there is a message (see my 
original post) that ibv_obj returns a NULL result. But I don't understand the 
significance of the message (if any).

I am not enthused about UCX - the documentation has several obvious typos in 
it, which is not encouraging when you a floundering. I know its a newish 
project but I have used openib for 10+ years and its never had a problem until 
now. I think this is not so much openib as the software below. One other thing 
I should say is 

Re: [OMPI users] Problem in starting openmpi job - no output just hangs

2020-08-24 Thread Jeff Squyres (jsquyres) via users
I'm afraid I don't have many better answers for you.

I can't quite tell from your machines, but are you running IMB-MPI1 Sendrecv 
*on a single node* with `--mca btl openib,self`?

I don't remember offhand, but I didn't think that openib was supposed to do 
loopback communication.  E.g., if both MPI processes are on the same node, 
`--mca btl openib,vader,self` should do the trick (where "vader" = shared 
memory support).

More specifically: are you running into a problem running openib (and/or UCX) 
across multiple nodes?

I can't speak to Nvidia support on various models of [older] hardware 
(including UCX support on that hardware).  But be aware that openib is 
definitely going away; it is wholly being replaced by UCX.  It may be that your 
only option is to stick with older software stacks in these hardware 
environments.


> On Aug 23, 2020, at 9:46 PM, Tony Ladd via users  
> wrote:
> 
> Hi John
> 
> Thanks for the response. I have run all those diagnostics, and as best I can 
> tell the IB fabric is OK. I have a cluster of 49 nodes (48 clients + server) 
> and the fabric passes all the tests. There is 1 warning:
> 
> I- Subnet: IPv4 PKey:0x7fff QKey:0x0b1b MTU:2048Byte rate:10Gbps SL:0x00
> -W- Suboptimal rate for group. Lowest member rate:40Gbps > group-rate:10Gbps
> 
> but according to a number of sources this is harmless.
> 
> I have run Mellanox's P2P performance tests (ib_write_bw) between different 
> pairs of nodes and it reports 3.22 GB/sec which is reasonable (its PCIe 2 x8 
> interface ie 4 GB/s). I have also configured 2 nodes back to back to check 
> that the switch is not the problem - it makes no difference.
> 
> I have been playing with the btl params with openMPI (v. 2.1.1 which is what 
> is relelased in Ubuntu 18.04). So with tcp as the transport layer everything 
> works fine - 1 node or 2 node communication - I have tested up to 16 
> processes (8+8) and it seems fine. Of course the latency is much higher on 
> the tcp interface, so I would still like to access the RDMA layer. But unless 
> I exclude the openib module, it always hangs. Same with OpenMPI v4 compiled 
> from source.
> 
> I think an important component is that Mellanox is not supporting Connect X2 
> for some time. This is really infuriating; a $500 network card with no 
> supported drivers, but that is business for you I suppose. I have 50 NICS and 
> I can't afford to replace them all. The other component is the MLNX-OFED is 
> tied to specific software versions, so I can't just run an older set of 
> drivers. I have not seen source files for the Mellanox drivers - I would take 
> a crack at compiling them if I did. In the past I have used the OFED drivers 
> (on Centos 5) with no problem, but I don't think this is an option now.
> 
> Ubuntu claims to support Connect X2 with their drivers (Mellanox confirms 
> this), but of course this is community support and the number of cases is 
> obviously small. I use the Ubuntu drivers right now because the OFED install 
> seems broken and there is no help with it. Its not supported! Neat huh?
> 
> The only handle I have is with openmpi v. 2 when there is a message (see my 
> original post) that ibv_obj returns a NULL result. But I don't understand the 
> significance of the message (if any).
> 
> I am not enthused about UCX - the documentation has several obvious typos in 
> it, which is not encouraging when you a floundering. I know its a newish 
> project but I have used openib for 10+ years and its never had a problem 
> until now. I think this is not so much openib as the software below. One 
> other thing I should say is that if I run any recent version of mstflint is 
> always complains:
> 
> Failed to identify the device - Can not create SignatureManager!
> 
> Going back to my original OFED 1.5 this did not happen, but they are at v5 
> now.
> 
> Everything else works as far as I can see. But I could not burn new firmware 
> except by going back to the 1.5 OS. Perhaps this is connected with the 
> obv_obj = NULL result.
> 
> Thanks for helping out. As you can see I am rather stuck.
> 
> Best
> 
> Tony
> 
> On 8/23/20 3:01 AM, John Hearns via users wrote:
>> *[External Email]*
>> 
>> Tony, start at a low level. Is the Infiniband fabric healthy?
>> Run
>> ibstatus   on every node
>> sminfo on one node
>> ibdiagnet on one node
>> 
>> On Sun, 23 Aug 2020 at 05:02, Tony Ladd via users > > wrote:
>> 
>>Hi Jeff
>> 
>>I installed ucx as you suggested. But I can't get even the
>>simplest code
>>(ucp_client_server) to work across the network. I can compile openMPI
>>with UCX but it has the same problem - mpi codes will not execute and
>>there are no messages. Really, UCX is not helping. It is adding
>>another
>>(not so well documented) software layer, which does not offer better
>>diagnostics as far as I can see. Its also unclear to me how to
>>control
>>what drivers are being loaded - UCX 

Re: [OMPI users] Problem in starting openmpi job - no output just hangs

2020-08-23 Thread Tony Ladd via users

Hi John

Thanks for the response. I have run all those diagnostics, and as best I 
can tell the IB fabric is OK. I have a cluster of 49 nodes (48 clients + 
server) and the fabric passes all the tests. There is 1 warning:


I- Subnet: IPv4 PKey:0x7fff QKey:0x0b1b MTU:2048Byte rate:10Gbps SL:0x00
-W- Suboptimal rate for group. Lowest member rate:40Gbps > group-rate:10Gbps

but according to a number of sources this is harmless.

I have run Mellanox's P2P performance tests (ib_write_bw) between 
different pairs of nodes and it reports 3.22 GB/sec which is reasonable 
(its PCIe 2 x8 interface ie 4 GB/s). I have also configured 2 nodes back 
to back to check that the switch is not the problem - it makes no 
difference.


I have been playing with the btl params with openMPI (v. 2.1.1 which is 
what is relelased in Ubuntu 18.04). So with tcp as the transport layer 
everything works fine - 1 node or 2 node communication - I have tested 
up to 16 processes (8+8) and it seems fine. Of course the latency is 
much higher on the tcp interface, so I would still like to access the 
RDMA layer. But unless I exclude the openib module, it always hangs. 
Same with OpenMPI v4 compiled from source.


I think an important component is that Mellanox is not supporting 
Connect X2 for some time. This is really infuriating; a $500 network 
card with no supported drivers, but that is business for you I suppose. 
I have 50 NICS and I can't afford to replace them all. The other 
component is the MLNX-OFED is tied to specific software versions, so I 
can't just run an older set of drivers. I have not seen source files for 
the Mellanox drivers - I would take a crack at compiling them if I did. 
In the past I have used the OFED drivers (on Centos 5) with no problem, 
but I don't think this is an option now.


Ubuntu claims to support Connect X2 with their drivers (Mellanox 
confirms this), but of course this is community support and the number 
of cases is obviously small. I use the Ubuntu drivers right now because 
the OFED install seems broken and there is no help with it. Its not 
supported! Neat huh?


The only handle I have is with openmpi v. 2 when there is a message (see 
my original post) that ibv_obj returns a NULL result. But I don't 
understand the significance of the message (if any).


I am not enthused about UCX - the documentation has several obvious 
typos in it, which is not encouraging when you a floundering. I know its 
a newish project but I have used openib for 10+ years and its never had 
a problem until now. I think this is not so much openib as the software 
below. One other thing I should say is that if I run any recent version 
of mstflint is always complains:


Failed to identify the device - Can not create SignatureManager!

Going back to my original OFED 1.5 this did not happen, but they are at 
v5 now.


Everything else works as far as I can see. But I could not burn new 
firmware except by going back to the 1.5 OS. Perhaps this is connected 
with the obv_obj = NULL result.


Thanks for helping out. As you can see I am rather stuck.

Best

Tony

On 8/23/20 3:01 AM, John Hearns via users wrote:

*[External Email]*

Tony, start at a low level. Is the Infiniband fabric healthy?
Run
ibstatus   on every node
sminfo on one node
ibdiagnet on one node

On Sun, 23 Aug 2020 at 05:02, Tony Ladd via users 
mailto:users@lists.open-mpi.org>> wrote:


Hi Jeff

I installed ucx as you suggested. But I can't get even the
simplest code
(ucp_client_server) to work across the network. I can compile openMPI
with UCX but it has the same problem - mpi codes will not execute and
there are no messages. Really, UCX is not helping. It is adding
another
(not so well documented) software layer, which does not offer better
diagnostics as far as I can see. Its also unclear to me how to
control
what drivers are being loaded - UCX wants to make that decision
for you.
With openMPI I can see that (for instance) the tcp module works both
locally and over the network - it must be using the Mellanox NIC
for the
bandwidth it is reporting on IMB-MPI1 even with tcp protocols. But
if I
try to use openib (or allow ucx or openmpi to choose the transport
layer) it just hangs. Annoyingly I have this server where everything
works just fine - I can run locally over openib and its fine. All the
other nodes cannot seem to load openib so even local jobs fail.

The only good (as best I can tell) diagnostic is from openMPI.
ibv_obj
(from v2.x) complains  that openib returns a NULL object, whereas
on my
server it returns logical_index=1. Can we not try to diagnose the
problem with openib not loading (see my original post for
details). I am
pretty sure if we can that would fix the problem.

Thanks

Tony

PS I tried configuring two nodes back to back to see if it was a
switch
issue, but the result was the same.


On 

Re: [OMPI users] Problem in starting openmpi job - no output just hangs

2020-08-23 Thread John Hearns via users
Tony, start at a low level. Is the Infiniband fabric healthy?
Run
ibstatus   on every node
sminfo on one node
ibdiagnet on one node

On Sun, 23 Aug 2020 at 05:02, Tony Ladd via users 
wrote:

> Hi Jeff
>
> I installed ucx as you suggested. But I can't get even the simplest code
> (ucp_client_server) to work across the network. I can compile openMPI
> with UCX but it has the same problem - mpi codes will not execute and
> there are no messages. Really, UCX is not helping. It is adding another
> (not so well documented) software layer, which does not offer better
> diagnostics as far as I can see. Its also unclear to me how to control
> what drivers are being loaded - UCX wants to make that decision for you.
> With openMPI I can see that (for instance) the tcp module works both
> locally and over the network - it must be using the Mellanox NIC for the
> bandwidth it is reporting on IMB-MPI1 even with tcp protocols. But if I
> try to use openib (or allow ucx or openmpi to choose the transport
> layer) it just hangs. Annoyingly I have this server where everything
> works just fine - I can run locally over openib and its fine. All the
> other nodes cannot seem to load openib so even local jobs fail.
>
> The only good (as best I can tell) diagnostic is from openMPI. ibv_obj
> (from v2.x) complains  that openib returns a NULL object, whereas on my
> server it returns logical_index=1. Can we not try to diagnose the
> problem with openib not loading (see my original post for details). I am
> pretty sure if we can that would fix the problem.
>
> Thanks
>
> Tony
>
> PS I tried configuring two nodes back to back to see if it was a switch
> issue, but the result was the same.
>
>
> On 8/19/20 1:27 PM, Jeff Squyres (jsquyres) wrote:
> > [External Email]
> >
> > Tony --
> >
> > Have you tried compiling Open MPI with UCX support?  This is Mellanox
> (NVIDIA's) preferred mechanism for InfiniBand support these days -- the
> openib BTL is legacy.
> >
> > You can run: mpirun --mca pml ucx ...
> >
> >
> >> On Aug 19, 2020, at 12:46 PM, Tony Ladd via users <
> users@lists.open-mpi.org> wrote:
> >>
> >> One other update. I compiled OpenMPI-4.0.4 The outcome was the same but
> there is no mention of ibv_obj this time.
> >>
> >> Tony
> >>
> >> --
> >>
> >> Tony Ladd
> >>
> >> Chemical Engineering Department
> >> University of Florida
> >> Gainesville, Florida 32611-6005
> >> USA
> >>
> >> Email: tladd-"(AT)"-che.ufl.edu
> >> Webhttp://ladd.che.ufl.edu
> >>
> >> Tel:   (352)-392-6509
> >> FAX:   (352)-392-9514
> >>
> >> 
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> >
> --
> Tony Ladd
>
> Chemical Engineering Department
> University of Florida
> Gainesville, Florida 32611-6005
> USA
>
> Email: tladd-"(AT)"-che.ufl.edu
> Webhttp://ladd.che.ufl.edu
>
> Tel:   (352)-392-6509
> FAX:   (352)-392-9514
>
>


Re: [OMPI users] Problem in starting openmpi job - no output just hangs

2020-08-22 Thread Tony Ladd via users

Hi Jeff

I installed ucx as you suggested. But I can't get even the simplest code 
(ucp_client_server) to work across the network. I can compile openMPI 
with UCX but it has the same problem - mpi codes will not execute and 
there are no messages. Really, UCX is not helping. It is adding another 
(not so well documented) software layer, which does not offer better 
diagnostics as far as I can see. Its also unclear to me how to control 
what drivers are being loaded - UCX wants to make that decision for you. 
With openMPI I can see that (for instance) the tcp module works both 
locally and over the network - it must be using the Mellanox NIC for the 
bandwidth it is reporting on IMB-MPI1 even with tcp protocols. But if I 
try to use openib (or allow ucx or openmpi to choose the transport 
layer) it just hangs. Annoyingly I have this server where everything 
works just fine - I can run locally over openib and its fine. All the 
other nodes cannot seem to load openib so even local jobs fail.


The only good (as best I can tell) diagnostic is from openMPI. ibv_obj 
(from v2.x) complains  that openib returns a NULL object, whereas on my 
server it returns logical_index=1. Can we not try to diagnose the 
problem with openib not loading (see my original post for details). I am 
pretty sure if we can that would fix the problem.


Thanks

Tony

PS I tried configuring two nodes back to back to see if it was a switch 
issue, but the result was the same.



On 8/19/20 1:27 PM, Jeff Squyres (jsquyres) wrote:

[External Email]

Tony --

Have you tried compiling Open MPI with UCX support?  This is Mellanox 
(NVIDIA's) preferred mechanism for InfiniBand support these days -- the openib 
BTL is legacy.

You can run: mpirun --mca pml ucx ...



On Aug 19, 2020, at 12:46 PM, Tony Ladd via users  
wrote:

One other update. I compiled OpenMPI-4.0.4 The outcome was the same but there 
is no mention of ibv_obj this time.

Tony

--

Tony Ladd

Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA

Email: tladd-"(AT)"-che.ufl.edu
Webhttp://ladd.che.ufl.edu

Tel:   (352)-392-6509
FAX:   (352)-392-9514




--
Jeff Squyres
jsquy...@cisco.com


--
Tony Ladd

Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA

Email: tladd-"(AT)"-che.ufl.edu
Webhttp://ladd.che.ufl.edu

Tel:   (352)-392-6509
FAX:   (352)-392-9514



Re: [OMPI users] Problem in starting openmpi job - no output just hangs

2020-08-19 Thread Jeff Squyres (jsquyres) via users
Tony --

Have you tried compiling Open MPI with UCX support?  This is Mellanox 
(NVIDIA's) preferred mechanism for InfiniBand support these days -- the openib 
BTL is legacy.

You can run: mpirun --mca pml ucx ...


> On Aug 19, 2020, at 12:46 PM, Tony Ladd via users  
> wrote:
> 
> One other update. I compiled OpenMPI-4.0.4 The outcome was the same but there 
> is no mention of ibv_obj this time.
> 
> Tony
> 
> -- 
> 
> Tony Ladd
> 
> Chemical Engineering Department
> University of Florida
> Gainesville, Florida 32611-6005
> USA
> 
> Email: tladd-"(AT)"-che.ufl.edu
> Webhttp://ladd.che.ufl.edu
> 
> Tel:   (352)-392-6509
> FAX:   (352)-392-9514
> 
> 


-- 
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI users] Problem in starting openmpi job - no output just hangs

2020-08-19 Thread Tony Ladd via users
One other update. I compiled OpenMPI-4.0.4 The outcome was the same but 
there is no mention of ibv_obj this time.


Tony

--

Tony Ladd

Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA

Email: tladd-"(AT)"-che.ufl.edu
Webhttp://ladd.che.ufl.edu

Tel:   (352)-392-6509
FAX:   (352)-392-9514

f34:tladd(~)> mpirun -d --report-bindings --mca btl_openib_allow_ib 1 --mca btl 
openib,self --mca btl_base_verbose 30 -np 2 
mpi-benchmarks-IMB-v2019.3/src_c/IMB-MPI1 SendRecv
[f34:24079] procdir: /tmp/ompi.f34.501/pid.24079/0/0
[f34:24079] jobdir: /tmp/ompi.f34.501/pid.24079/0
[f34:24079] top: /tmp/ompi.f34.501/pid.24079
[f34:24079] top: /tmp/ompi.f34.501
[f34:24079] tmp: /tmp
[f34:24079] sess_dir_cleanup: job session dir does not exist
[f34:24079] sess_dir_cleanup: top session dir not empty - leaving
[f34:24079] procdir: /tmp/ompi.f34.501/pid.24079/0/0
[f34:24079] jobdir: /tmp/ompi.f34.501/pid.24079/0
[f34:24079] top: /tmp/ompi.f34.501/pid.24079
[f34:24079] top: /tmp/ompi.f34.501
[f34:24079] tmp: /tmp
[f34:24079] [[62672,0],0] Releasing job data for [INVALID]
[f34:24079] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././.][./././.]
[f34:24079] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./.][./././.]
  MPIR_being_debugged = 0
  MPIR_debug_state = 1
  MPIR_partial_attach_ok = 1
  MPIR_i_am_starter = 0
  MPIR_forward_output = 0
  MPIR_proctable_size = 2
  MPIR_proctable:
(i, host, exe, pid) = (0, f34, 
/home/tladd/mpi-benchmarks-IMB-v2019.3/src_c/IMB-MPI1, 24083)
(i, host, exe, pid) = (1, f34, 
/home/tladd/mpi-benchmarks-IMB-v2019.3/src_c/IMB-MPI1, 24084)
MPIR_executable_path: NULL
MPIR_server_arguments: NULL
[f34:24084] procdir: /tmp/ompi.f34.501/pid.24079/1/1
[f34:24084] jobdir: /tmp/ompi.f34.501/pid.24079/1
[f34:24084] top: /tmp/ompi.f34.501/pid.24079
[f34:24084] top: /tmp/ompi.f34.501
[f34:24084] tmp: /tmp
[f34:24083] procdir: /tmp/ompi.f34.501/pid.24079/1/0
[f34:24083] jobdir: /tmp/ompi.f34.501/pid.24079/1
[f34:24083] top: /tmp/ompi.f34.501/pid.24079
[f34:24083] top: /tmp/ompi.f34.501
[f34:24083] tmp: /tmp
[f34:24084] mca: base: components_register: registering framework btl components
[f34:24084] mca: base: components_register: found loaded component self
[f34:24084] mca: base: components_register: component self register function 
successful
[f34:24084] mca: base: components_register: found loaded component openib
[f34:24084] mca: base: components_register: component openib register function 
successful
[f34:24084] mca: base: components_open: opening btl components
[f34:24084] mca: base: components_open: found loaded component self
[f34:24084] mca: base: components_open: component self open function successful
[f34:24084] mca: base: components_open: found loaded component openib
[f34:24084] mca: base: components_open: component openib open function 
successful
[f34:24084] select: initializing btl component self
[f34:24084] select: init of component self returned success
[f34:24084] select: initializing btl component openib
[f34:24083] mca: base: components_register: registering framework btl components
[f34:24083] mca: base: components_register: found loaded component self
[f34:24083] mca: base: components_register: component self register function 
successful
[f34:24083] mca: base: components_register: found loaded component openib
[f34:24083] mca: base: components_register: component openib register function 
successful
[f34:24083] mca: base: components_open: opening btl components
[f34:24083] mca: base: components_open: found loaded component self
[f34:24083] mca: base: components_open: component self open function successful
[f34:24083] mca: base: components_open: found loaded component openib
[f34:24083] mca: base: components_open: component openib open function 
successful
[f34:24083] select: initializing btl component self
[f34:24083] select: init of component self returned success
[f34:24083] select: initializing btl component openib
[f34:24084] Checking distance from this process to device=mlx4_0
[f34:24084] Process is not bound: distance to device is 0.00
[f34:24083] Checking distance from this process to device=mlx4_0
[f34:24083] Process is not bound: distance to device is 0.00
[f34:24083] [rank=0] openib: using port mlx4_0:1
[f34:24083] select: init of component openib returned success
[f34:24084] [rank=1] openib: using port mlx4_0:1
[f34:24084] select: init of component openib returned success
[f34:24083] mca: bml: Using self btl for send to [[62672,1],0] on node f34
[f34:24084] mca: bml: Using self btl for send to [[62672,1],1] on node f34
^C
[f34:24079] sess_dir_finalize: proc session dir does not exist
[f34:24079] sess_dir_finalize: job session dir does not exist
[f34:24079] sess_dir_finalize: jobfam session dir not empty - leaving
[f34:24079] sess_dir_finalize: jobfam session dir not empty - leaving
[f34:24079] sess_dir_finalize: top session dir not empty - leaving
[f34:24079] sess_dir_finalize: proc session dir does not 

Re: [OMPI users] Problem in starting openmpi job - no output just hangs

2020-08-17 Thread Tony Ladd via users
My apologies - I did not read the FAQ's carefully enough - with regard 
to 14:


1. openib

2. Ubuntu supplied drivers etc.

3. Ubuntu 18.04  4.15.0-112-generic

4. opensm-3.3.5_mlnx-0.1.g6b18e73

5. Attached

6. Attached

7. unlimited on foam and 16384 on f34

I changed the ulimit to unlimited on f34 but it did not help.

Tony

--

Tony Ladd

Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA

Email: tladd-"(AT)"-che.ufl.edu
Webhttp://ladd.che.ufl.edu

Tel:   (352)-392-6509
FAX:   (352)-392-9514

foam:root(ib)> ibv_devinfo
hca_id: mlx4_0
transport:  InfiniBand (0)
fw_ver: 2.9.1000
node_guid:  0002:c903:000f:666e
sys_image_guid: 0002:c903:000f:6671
vendor_id:  0x02c9
vendor_part_id: 26428
hw_ver: 0xB0
board_id:   MT_0D90110009
phys_port_cnt:  1
port:   1
state:  PORT_ACTIVE (4)
max_mtu:4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid:   26
port_lmc:   0x00
link_layer: InfiniBand

root@f34:/home/tladd# ibv_devinfo 
hca_id: mlx4_0
transport:  InfiniBand (0)
fw_ver: 2.9.1000
node_guid:  0002:c903:000a:af92
sys_image_guid: 0002:c903:000a:af95
vendor_id:  0x02c9
vendor_part_id: 26428
hw_ver: 0xB0
board_id:   MT_0D90110009
phys_port_cnt:  1
port:   1
state:  PORT_ACTIVE (4)
max_mtu:4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid:   32
port_lmc:   0x00
link_layer: InfiniBand

root@f34:/home/tladd# ifconfig
eno1: flags=4163  mtu 1500
inet 10.1.2.34  netmask 255.255.255.0  broadcast 10.1.2.255
inet6 fe80::862b:2bff:fe18:3729  prefixlen 64  scopeid 0x20
ether 84:2b:2b:18:37:29  txqueuelen 1000  (Ethernet)
RX packets 1015244  bytes 146716710 (146.7 MB)
RX errors 0  dropped 234903  overruns 0  frame 0
TX packets 176298  bytes 17106041 (17.1 MB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ib0: flags=4163  mtu 2044
inet 10.2.2.34  netmask 255.255.255.0  broadcast 10.2.2.255
inet6 fe80::202:c903:a:af93  prefixlen 64  scopeid 0x20
unspec 80-00-02-08-FE-80-00-00-00-00-00-00-00-00-00-00  txqueuelen 256  
(UNSPEC)
RX packets 289257  bytes 333876570 (333.8 MB)
RX errors 0  dropped 0  overruns 0  frame 0
TX packets 140385  bytes 324882131 (324.8 MB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73  mtu 65536
inet 127.0.0.1  netmask 255.0.0.0
inet6 ::1  prefixlen 128  scopeid 0x10
loop  txqueuelen 1000  (Local Loopback)
RX packets 317853  bytes 21490738 (21.4 MB)
RX errors 0  dropped 0  overruns 0  frame 0
TX packets 317853  bytes 21490738 (21.4 MB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

foam:root(ib)> ifconfig
enp4s0: flags=4163  mtu 1500
inet 10.1.2.251  netmask 255.255.255.0  broadcast 10.1.2.255
inet6 fe80::ae1f:6bff:feb1:7f02  prefixlen 64  scopeid 0x20
ether ac:1f:6b:b1:7f:02  txqueuelen 1000  (Ethernet)
RX packets 1092343  bytes 98282221 (98.2 MB)
RX errors 0  dropped 176607  overruns 0  frame 0
TX packets 248746  bytes 206951391 (206.9 MB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
device memory 0xf040-f047  

enp5s0: flags=4163  mtu 1500
inet 192.168.1.2  netmask 255.255.255.0  broadcast 192.168.1.255
inet6 fe80::ae1f:6bff:feb1:7f03  prefixlen 64  scopeid 0x20
ether ac:1f:6b:b1:7f:03  txqueuelen 1000  (Ethernet)
RX packets 1039387  bytes 87199457 (87.1 MB)
RX errors 0  dropped 187625  overruns 0  frame 0
TX packets 5884980  bytes 8649612519 (8.6 GB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
device memory 0xf030-f037  

enp6s0: flags=4163  mtu 1500
inet 10.227.121.95  netmask 255.255.255.0  broadcast 10.227.121.255
inet6 fe80::6a05:caff:febd:397c  prefixlen 64  scopeid 0x20
ether 68:05:ca:bd:39:7c  txqueuelen 1000