Re: [OMPI users] Many different errors with ompi version 2.1.1

2017-05-19 Thread Allan Overstreet
Below are the results from the ibnetdiscover command This command was 
run from node smd.


#
# Topology file: generated on Fri May 19 15:59:47 2017
#
# Initiated from node 0002c903000a0a32 port 0002c903000a0a34

vendid=0x8f1
devid=0x5a5a
sysimgguid=0x8f105001094d3
switchguid=0x8f105001094d2(8f105001094d2)
Switch36 "S-0008f105001094d2"# "Voltaire 4036 # SWITCH-IB-1" 
enhanced port 0 lid 1 lmc 0
[1]"H-0002c903000a09c2"[1](2c903000a09c3) # "dl580 mlx4_0" 
lid 2 4xQDR
[2]"H-0011757986e4"[1](11757986e4) # "sm4 qib0" lid 
6 4xQDR
[3]"H-0011757990f6"[1](11757990f6) # "sm3 qib0" lid 
5 4xQDR
[4]"H-001175797a12"[1](1175797a12) # "sm2 qib0" lid 
4 4xDDR
[5]"H-001175797a68"[1](1175797a68) # "sm1 qib0" lid 
3 4xDDR
[36]"H-0002c903000a0a32"[2](2c903000a0a34) # "MT25408 
ConnectX Mellanox Technologies" lid 7 4xQDR


vendid=0x1175
devid=0x7322
sysimgguid=0x1175797a68
caguid=0x1175797a68
Ca1 "H-001175797a68"# "sm1 qib0"
[1](1175797a68) "S-0008f105001094d2"[5]# lid 3 lmc 0 
"Voltaire 4036 # SWITCH-IB-1" lid 1 4xDDR


vendid=0x1175
devid=0x7322
sysimgguid=0x1175797a12
caguid=0x1175797a12
Ca1 "H-001175797a12"# "sm2 qib0"
[1](1175797a12) "S-0008f105001094d2"[4]# lid 4 lmc 0 
"Voltaire 4036 # SWITCH-IB-1" lid 1 4xDDR


vendid=0x1175
devid=0x7322
sysimgguid=0x11757990f6
caguid=0x11757990f6
Ca1 "H-0011757990f6"# "sm3 qib0"
[1](11757990f6) "S-0008f105001094d2"[3]# lid 5 lmc 0 
"Voltaire 4036 # SWITCH-IB-1" lid 1 4xQDR


vendid=0x1175
devid=0x7322
sysimgguid=0x11757986e4
caguid=0x11757986e4
Ca1 "H-0011757986e4"# "sm4 qib0"
[1](11757986e4) "S-0008f105001094d2"[2]# lid 6 lmc 0 
"Voltaire 4036 # SWITCH-IB-1" lid 1 4xQDR


vendid=0x2c9
devid=0x673c
sysimgguid=0x2c903000a09c5
caguid=0x2c903000a09c2
Ca2 "H-0002c903000a09c2"# "dl580 mlx4_0"
[1](2c903000a09c3) "S-0008f105001094d2"[1]# lid 2 lmc 0 
"Voltaire 4036 # SWITCH-IB-1" lid 1 4xQDR


vendid=0x2c9
devid=0x673c
sysimgguid=0x2c903000a0a35
caguid=0x2c903000a0a32
Ca2 "H-0002c903000a0a32"# "MT25408 ConnectX Mellanox 
Technologies"
[2](2c903000a0a34) "S-0008f105001094d2"[36]# lid 7 lmc 0 
"Voltaire 4036 # SWITCH-IB-1" lid 1 4xQDR



On 05/19/2017 03:26 AM, John Hearns via users wrote:

Allan,
remember that Infiniband is not Ethernet.  You dont NEED to set up 
IPOIB interfaces.


Two diagnostics please for you to run:

ibnetdiscover

ibdiagnet


Let us please have the reuslts ofibnetdiscover




On 19 May 2017 at 09:25, John Hearns > wrote:


Giles, Allan,

if the host 'smd' is acting as a cluster head node it is not a
must for it to have an Infiniband card.
So you should be able to run jobs across the other nodes, which
have Qlogic cards.
I may have something mixed up here, if so I am sorry.

If you want also to run jobs on the smd host, you should take note
of what Giles says.
You may be out of luck in that case.

On 19 May 2017 at 09:15, Gilles Gouaillardet mailto:gil...@rist.or.jp>> wrote:

Allan,


i just noted smd has a Mellanox card, while other nodes have
QLogic cards.

mtl/psm works best for QLogic while btl/openib (or mtl/mxm)
work best for Mellanox,

but these are not interoperable. also, i do not think
btl/openib can be used with QLogic cards

(please someone correct me if i am wrong)


from the logs, i can see that smd (Mellanox) is not even able
to use the infiniband port.

if you run with 2 MPI tasks, both run on smd and hence
btl/vader is used, that is why it works

if you run with more than 2 MPI tasks, then smd and other
nodes are used, and every MPI task fall back to btl/tcp

for inter node communication.


[smd][[41971,1],1][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.1.196 failed: No route to host (113)

this usually indicates a firewall, but since both ssh and
oob/tcp are fine, this puzzles me.


what if you

mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include
192.168.1.0/24  --mca
btl_tcp_if_include 192.168.1.0/24 
--mca pml ob1 --mca btl tcp,sm,vader,self  ring

that should work with no error messages, and then you can try
with 12 MPI tasks

(note internode MPI communications will use tcp only)


if you want optimal performance, i am afraid you cannot run
any MPI task on smd (so mtl/psm can be used )

(btw, make sure PSM support was built in Open MPI)

a suboptimal option is to force MPI communications on IPoIB with

/* make sure all no

Re: [OMPI users] Many different errors with ompi version 2.1.1

2017-05-19 Thread Elken, Tom
" i do not think btl/openib can be used with QLogic cards
(please someone correct me if i am wrong)"

You are wrong :) .  The openib BTL is the best one to use for interoperability 
between QLogic and Mellanox IB cards.
The Intel True Scale (the continuation of the QLogic IB product line)  Host SW 
User Guide 
http://www.intel.com/content/dam/support/us/en/documents/network-and-i-o/fabric-products/OFED_Host_Software_UserGuide_G91902_06.pdf
 
says, (I paraphrase):

To run over IB verbs ... for example:

$ mpirun -np 4 -hostfile mpihosts --mca btl self,sm,openib --mca mtl ^psm 
./mpi_app_name


But, as some have suggested, you may make your life simpler and get ~ the same 
or better performance (depending on the workload) if you use the Mlx node as a 
head node and run the job on the 5 QLogic HCA nodes using mtl psm.

-Tom

-Original Message-
From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Gilles 
Gouaillardet
Sent: Friday, May 19, 2017 12:16 AM
To: Open MPI Users 
Subject: Re: [OMPI users] Many different errors with ompi version 2.1.1

Allan,


i just noted smd has a Mellanox card, while other nodes have QLogic cards.

mtl/psm works best for QLogic while btl/openib (or mtl/mxm) work best for 
Mellanox,

but these are not interoperable. also, i do not think btl/openib can be used 
with QLogic cards

(please someone correct me if i am wrong)


from the logs, i can see that smd (Mellanox) is not even able to use the 
infiniband port.

if you run with 2 MPI tasks, both run on smd and hence btl/vader is used, that 
is why it works

if you run with more than 2 MPI tasks, then smd and other nodes are used, and 
every MPI task fall back to btl/tcp

for inter node communication.

[smd][[41971,1],1][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.1.196 failed: No route to host (113)

this usually indicates a firewall, but since both ssh and oob/tcp are fine, 
this puzzles me.


what if you

mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24 
--mca btl_tcp_if_include 192.168.1.0/24 --mca pml ob1 --mca btl 
tcp,sm,vader,self  ring

that should work with no error messages, and then you can try with 12 
MPI tasks

(note internode MPI communications will use tcp only)


if you want optimal performance, i am afraid you cannot run any MPI task 
on smd (so mtl/psm can be used )

(btw, make sure PSM support was built in Open MPI)

a suboptimal option is to force MPI communications on IPoIB with

/* make sure all nodes can ping each other via IPoIB first */

mpirun --mca oob_tcp_if_include 192.168.1.0/24 --mca btl_tcp_if_include 
10.1.0.0/24 --mca pml ob1 --mca btl tcp,sm,vader,self



Cheers,


Gilles


On 5/19/2017 3:50 PM, Allan Overstreet wrote:
> Gilles,
>
> On which node is mpirun invoked ?
>
> The mpirun command was involed on node smd.
>
> Are you running from a batch manager?
>
> No.
>
> Is there any firewall running on your nodes ?
>
> No CentOS minimal does not have a firewall installed and Ubuntu 
> Mate's firewall is disabled.
>
> All three of your commands have appeared to run successfully. The 
> outputs of the three commands are attached.
>
> mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24 
> --mca oob_base_verbose 100 true &> cmd1
>
> mpirun -np 12 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24 
> --mca oob_base_verbose 100 true &> cmd2
>
> mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24 
> --mca oob_base_verbose 100 ring &> cmd3
>
> If I increase the number of processors in the ring program, mpirun 
> will not succeed.
>
> mpirun -np 12 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24 
> --mca oob_base_verbose 100 ring &> cmd4
>
>
> On 05/19/2017 02:18 AM, Gilles Gouaillardet wrote:
>> Allan,
>>
>>
>> - on which node is mpirun invoked ?
>>
>> - are you running from a batch manager ?
>>
>> - is there any firewall running on your nodes ?
>>
>>
>> the error is likely occuring when wiring-up mpirun/orted
>>
>> what if you
>>
>> mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24 
>> --mca oob_base_verbose 100 true
>>
>> then (if the previous command worked)
>>
>> mpirun -np 12 --hostfile nodes --mca oob_tcp_if_include 
>> 192.168.1.0/24 --mca oob_base_verbose 100 true
>>
>> and finally (if both previous commands worked)
>>
>> mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24 
>> --mca oob_base_verbose 100 ring
>>
>>
>> C

Re: [OMPI users] Many different errors with ompi version 2.1.1

2017-05-19 Thread John Hearns via users
Allan,
remember that Infiniband is not Ethernet.  You dont NEED to set up IPOIB
interfaces.

Two diagnostics please for you to run:

ibnetdiscover

ibdiagnet


Let us please have the reuslts ofibnetdiscover




On 19 May 2017 at 09:25, John Hearns  wrote:

> Giles, Allan,
>
> if the host 'smd' is acting as a cluster head node it is not a must for it
> to have an Infiniband card.
> So you should be able to run jobs across the other nodes, which have
> Qlogic cards.
> I may have something mixed up here, if so I am sorry.
>
> If you want also to run jobs on the smd host, you should take note of what
> Giles says.
> You may be out of luck in that case.
>
> On 19 May 2017 at 09:15, Gilles Gouaillardet  wrote:
>
>> Allan,
>>
>>
>> i just noted smd has a Mellanox card, while other nodes have QLogic cards.
>>
>> mtl/psm works best for QLogic while btl/openib (or mtl/mxm) work best for
>> Mellanox,
>>
>> but these are not interoperable. also, i do not think btl/openib can be
>> used with QLogic cards
>>
>> (please someone correct me if i am wrong)
>>
>>
>> from the logs, i can see that smd (Mellanox) is not even able to use the
>> infiniband port.
>>
>> if you run with 2 MPI tasks, both run on smd and hence btl/vader is used,
>> that is why it works
>>
>> if you run with more than 2 MPI tasks, then smd and other nodes are used,
>> and every MPI task fall back to btl/tcp
>>
>> for inter node communication.
>>
>> [smd][[41971,1],1][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect]
>> connect() to 192.168.1.196 failed: No route to host (113)
>>
>> this usually indicates a firewall, but since both ssh and oob/tcp are
>> fine, this puzzles me.
>>
>>
>> what if you
>>
>> mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>> --mca btl_tcp_if_include 192.168.1.0/24 --mca pml ob1 --mca btl
>> tcp,sm,vader,self  ring
>>
>> that should work with no error messages, and then you can try with 12 MPI
>> tasks
>>
>> (note internode MPI communications will use tcp only)
>>
>>
>> if you want optimal performance, i am afraid you cannot run any MPI task
>> on smd (so mtl/psm can be used )
>>
>> (btw, make sure PSM support was built in Open MPI)
>>
>> a suboptimal option is to force MPI communications on IPoIB with
>>
>> /* make sure all nodes can ping each other via IPoIB first */
>>
>> mpirun --mca oob_tcp_if_include 192.168.1.0/24 --mca btl_tcp_if_include
>> 10.1.0.0/24 --mca pml ob1 --mca btl tcp,sm,vader,self
>>
>>
>>
>> Cheers,
>>
>>
>> Gilles
>>
>>
>> On 5/19/2017 3:50 PM, Allan Overstreet wrote:
>>
>>> Gilles,
>>>
>>> On which node is mpirun invoked ?
>>>
>>> The mpirun command was involed on node smd.
>>>
>>> Are you running from a batch manager?
>>>
>>> No.
>>>
>>> Is there any firewall running on your nodes ?
>>>
>>> No CentOS minimal does not have a firewall installed and Ubuntu
>>> Mate's firewall is disabled.
>>>
>>> All three of your commands have appeared to run successfully. The
>>> outputs of the three commands are attached.
>>>
>>> mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>>> --mca oob_base_verbose 100 true &> cmd1
>>>
>>> mpirun -np 12 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>>> --mca oob_base_verbose 100 true &> cmd2
>>>
>>> mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>>> --mca oob_base_verbose 100 ring &> cmd3
>>>
>>> If I increase the number of processors in the ring program, mpirun will
>>> not succeed.
>>>
>>> mpirun -np 12 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>>> --mca oob_base_verbose 100 ring &> cmd4
>>>
>>>
>>> On 05/19/2017 02:18 AM, Gilles Gouaillardet wrote:
>>>
 Allan,


 - on which node is mpirun invoked ?

 - are you running from a batch manager ?

 - is there any firewall running on your nodes ?


 the error is likely occuring when wiring-up mpirun/orted

 what if you

 mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
 --mca oob_base_verbose 100 true

 then (if the previous command worked)

 mpirun -np 12 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
 --mca oob_base_verbose 100 true

 and finally (if both previous commands worked)

 mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
 --mca oob_base_verbose 100 ring


 Cheers,

 Gilles

 On 5/19/2017 3:07 PM, Allan Overstreet wrote:

> I experiencing many different errors with openmpi version 2.1.1. I
> have had a suspicion that this might be related to the way the servers 
> were
> connected and configured. Regardless below is a diagram of how the server
> are configured.
>
> __  _
>[__]|=|
>/::/|_|
>HOST: smd
>Dua

Re: [OMPI users] Many different errors with ompi version 2.1.1

2017-05-19 Thread John Hearns via users
Giles, Allan,

if the host 'smd' is acting as a cluster head node it is not a must for it
to have an Infiniband card.
So you should be able to run jobs across the other nodes, which have Qlogic
cards.
I may have something mixed up here, if so I am sorry.

If you want also to run jobs on the smd host, you should take note of what
Giles says.
You may be out of luck in that case.

On 19 May 2017 at 09:15, Gilles Gouaillardet  wrote:

> Allan,
>
>
> i just noted smd has a Mellanox card, while other nodes have QLogic cards.
>
> mtl/psm works best for QLogic while btl/openib (or mtl/mxm) work best for
> Mellanox,
>
> but these are not interoperable. also, i do not think btl/openib can be
> used with QLogic cards
>
> (please someone correct me if i am wrong)
>
>
> from the logs, i can see that smd (Mellanox) is not even able to use the
> infiniband port.
>
> if you run with 2 MPI tasks, both run on smd and hence btl/vader is used,
> that is why it works
>
> if you run with more than 2 MPI tasks, then smd and other nodes are used,
> and every MPI task fall back to btl/tcp
>
> for inter node communication.
>
> [smd][[41971,1],1][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect]
> connect() to 192.168.1.196 failed: No route to host (113)
>
> this usually indicates a firewall, but since both ssh and oob/tcp are
> fine, this puzzles me.
>
>
> what if you
>
> mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
> --mca btl_tcp_if_include 192.168.1.0/24 --mca pml ob1 --mca btl
> tcp,sm,vader,self  ring
>
> that should work with no error messages, and then you can try with 12 MPI
> tasks
>
> (note internode MPI communications will use tcp only)
>
>
> if you want optimal performance, i am afraid you cannot run any MPI task
> on smd (so mtl/psm can be used )
>
> (btw, make sure PSM support was built in Open MPI)
>
> a suboptimal option is to force MPI communications on IPoIB with
>
> /* make sure all nodes can ping each other via IPoIB first */
>
> mpirun --mca oob_tcp_if_include 192.168.1.0/24 --mca btl_tcp_if_include
> 10.1.0.0/24 --mca pml ob1 --mca btl tcp,sm,vader,self
>
>
>
> Cheers,
>
>
> Gilles
>
>
> On 5/19/2017 3:50 PM, Allan Overstreet wrote:
>
>> Gilles,
>>
>> On which node is mpirun invoked ?
>>
>> The mpirun command was involed on node smd.
>>
>> Are you running from a batch manager?
>>
>> No.
>>
>> Is there any firewall running on your nodes ?
>>
>> No CentOS minimal does not have a firewall installed and Ubuntu
>> Mate's firewall is disabled.
>>
>> All three of your commands have appeared to run successfully. The outputs
>> of the three commands are attached.
>>
>> mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>> --mca oob_base_verbose 100 true &> cmd1
>>
>> mpirun -np 12 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>> --mca oob_base_verbose 100 true &> cmd2
>>
>> mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>> --mca oob_base_verbose 100 ring &> cmd3
>>
>> If I increase the number of processors in the ring program, mpirun will
>> not succeed.
>>
>> mpirun -np 12 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>> --mca oob_base_verbose 100 ring &> cmd4
>>
>>
>> On 05/19/2017 02:18 AM, Gilles Gouaillardet wrote:
>>
>>> Allan,
>>>
>>>
>>> - on which node is mpirun invoked ?
>>>
>>> - are you running from a batch manager ?
>>>
>>> - is there any firewall running on your nodes ?
>>>
>>>
>>> the error is likely occuring when wiring-up mpirun/orted
>>>
>>> what if you
>>>
>>> mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>>> --mca oob_base_verbose 100 true
>>>
>>> then (if the previous command worked)
>>>
>>> mpirun -np 12 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>>> --mca oob_base_verbose 100 true
>>>
>>> and finally (if both previous commands worked)
>>>
>>> mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24
>>> --mca oob_base_verbose 100 ring
>>>
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On 5/19/2017 3:07 PM, Allan Overstreet wrote:
>>>
 I experiencing many different errors with openmpi version 2.1.1. I have
 had a suspicion that this might be related to the way the servers were
 connected and configured. Regardless below is a diagram of how the server
 are configured.

 __  _
[__]|=|
/::/|_|
HOST: smd
Dual 1Gb Ethernet Bonded
.-> Bond0 IP: 192.168.1.200
|   Infiniband Card: MHQH29B-XTR <.
|   Ib0 IP: 10.1.0.1  |
|   OS: Ubuntu Mate   |
|   __ _ |
| [__]|=||
  

Re: [OMPI users] Many different errors with ompi version 2.1.1

2017-05-19 Thread Gilles Gouaillardet

Allan,


i just noted smd has a Mellanox card, while other nodes have QLogic cards.

mtl/psm works best for QLogic while btl/openib (or mtl/mxm) work best 
for Mellanox,


but these are not interoperable. also, i do not think btl/openib can be 
used with QLogic cards


(please someone correct me if i am wrong)


from the logs, i can see that smd (Mellanox) is not even able to use the 
infiniband port.


if you run with 2 MPI tasks, both run on smd and hence btl/vader is 
used, that is why it works


if you run with more than 2 MPI tasks, then smd and other nodes are 
used, and every MPI task fall back to btl/tcp


for inter node communication.

[smd][[41971,1],1][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect] 
connect() to 192.168.1.196 failed: No route to host (113)


this usually indicates a firewall, but since both ssh and oob/tcp are 
fine, this puzzles me.



what if you

mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24 
--mca btl_tcp_if_include 192.168.1.0/24 --mca pml ob1 --mca btl 
tcp,sm,vader,self  ring


that should work with no error messages, and then you can try with 12 
MPI tasks


(note internode MPI communications will use tcp only)


if you want optimal performance, i am afraid you cannot run any MPI task 
on smd (so mtl/psm can be used )


(btw, make sure PSM support was built in Open MPI)

a suboptimal option is to force MPI communications on IPoIB with

/* make sure all nodes can ping each other via IPoIB first */

mpirun --mca oob_tcp_if_include 192.168.1.0/24 --mca btl_tcp_if_include 
10.1.0.0/24 --mca pml ob1 --mca btl tcp,sm,vader,self




Cheers,


Gilles


On 5/19/2017 3:50 PM, Allan Overstreet wrote:

Gilles,

On which node is mpirun invoked ?

The mpirun command was involed on node smd.

Are you running from a batch manager?

No.

Is there any firewall running on your nodes ?

No CentOS minimal does not have a firewall installed and Ubuntu 
Mate's firewall is disabled.


All three of your commands have appeared to run successfully. The 
outputs of the three commands are attached.


mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24 
--mca oob_base_verbose 100 true &> cmd1


mpirun -np 12 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24 
--mca oob_base_verbose 100 true &> cmd2


mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24 
--mca oob_base_verbose 100 ring &> cmd3


If I increase the number of processors in the ring program, mpirun 
will not succeed.


mpirun -np 12 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24 
--mca oob_base_verbose 100 ring &> cmd4



On 05/19/2017 02:18 AM, Gilles Gouaillardet wrote:

Allan,


- on which node is mpirun invoked ?

- are you running from a batch manager ?

- is there any firewall running on your nodes ?


the error is likely occuring when wiring-up mpirun/orted

what if you

mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24 
--mca oob_base_verbose 100 true


then (if the previous command worked)

mpirun -np 12 --hostfile nodes --mca oob_tcp_if_include 
192.168.1.0/24 --mca oob_base_verbose 100 true


and finally (if both previous commands worked)

mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24 
--mca oob_base_verbose 100 ring



Cheers,

Gilles

On 5/19/2017 3:07 PM, Allan Overstreet wrote:
I experiencing many different errors with openmpi version 2.1.1. I 
have had a suspicion that this might be related to the way the 
servers were connected and configured. Regardless below is a diagram 
of how the server are configured.


__  _
   [__]|=|
   /::/|_|
   HOST: smd
   Dual 1Gb Ethernet Bonded
   .-> Bond0 IP: 192.168.1.200
   |   Infiniband Card: MHQH29B-XTR <.
   |   Ib0 IP: 10.1.0.1  |
   |   OS: Ubuntu Mate   |
   |   __ _ |
   | [__]|=||
   | /::/|_||
   |   HOST: sm1 |
   |   Dual 1Gb Ethernet Bonded  |
   |-> Bond0 IP: 192.168.1.196   |
   |   Infiniband Card: QLOGIC QLE7340 <-|
   |   Ib0 IP: 10.1.0.2  |
   |   OS: Centos 7 Minimal  |
   |   __ _ |
   | [__]|=||
   |-. /::/|_||
   | | HOST: sm2 |
   | | Dual 1Gb Ethernet Bonded  |

Re: [OMPI users] Many different errors with ompi version 2.1.1

2017-05-18 Thread Gilles Gouaillardet

Allan,


- on which node is mpirun invoked ?

- are you running from a batch manager ?

- is there any firewall running on your nodes ?

- how many interfaces are part of bond0 ?


the error is likely occuring when wiring-up mpirun/orted

what if you

mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24 
--mca oob_base_verbose 100 true


then (if the previous command worked)

mpirun -np 12 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24 
--mca oob_base_verbose 100 true


and finally (if both previous commands worked)

mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include 192.168.1.0/24 
--mca oob_base_verbose 100 ring



Cheers,

Gilles


On 5/19/2017 3:07 PM, Allan Overstreet wrote:
I experiencing many different errors with openmpi version 2.1.1. I 
have had a suspicion that this might be related to the way the servers 
were connected and configured. Regardless below is a diagram of how 
the server are configured.


__  _
   [__]|=|
   /::/|_|
   HOST: smd
   Dual 1Gb Ethernet Bonded
   .-> Bond0 IP: 192.168.1.200
   |   Infiniband Card: MHQH29B-XTR <.
   |   Ib0 IP: 10.1.0.1  |
   |   OS: Ubuntu Mate   |
   |   __ _ |
   | [__]|=||
   | /::/|_||
   |   HOST: sm1 |
   |   Dual 1Gb Ethernet Bonded  |
   |-> Bond0 IP: 192.168.1.196   |
   |   Infiniband Card: QLOGIC QLE7340 <-|
   |   Ib0 IP: 10.1.0.2  |
   |   OS: Centos 7 Minimal  |
   |   __ _ |
   | [__]|=||
   |-. /::/|_||
   | | HOST: sm2 |
   | | Dual 1Gb Ethernet Bonded  |
   | '---> Bond0 IP: 192.168.1.199   |
   __  Infiniband Card: QLOGIC QLE7340 __
  [_|||_°] Ib0 IP: 10.1.0.3 [_|||_°]
  [_|||_°] OS: Centos 7 Minimal [_|||_°]
  [_|||_°] __ _ [_|||_°]
   Gb Ethernet Switch [__]|=| Voltaire 4036 QDR Switch
   | /::/|_| |
   |   HOST: sm3  |
   |   Dual 1Gb Ethernet Bonded   |
   |-> Bond0 IP: 192.168.1.203|
   |   Infiniband Card: QLOGIC QLE7340 <--|
   |   Ib0 IP: 10.1.0.4   |
   |   OS: Centos 7 Minimal   |
   |  __ _   |
   | [__]|=|  |
   | /::/|_|  |
   |   HOST: sm4  |
   |   Dual 1Gb Ethernet Bonded   |
   |-> Bond0 IP: 192.168.1.204|
   |   Infiniband Card: QLOGIC QLE7340 <--|
   |   Ib0 IP: 10.1.0.5   |
   |   OS: Centos 7 Minimal   |
   | __ _|
   | [__]|=|   |
   | /::/|_|   |
   |   HOST: dl580|
   |   Dual 1Gb Ethernet Bonded   |
   '-> Bond0 IP: 192.168.1.201|
   Infiniband Card: QLOGIC QLE7340 <--'
   Ib0 IP: 10.1.0.6
   OS: Centos 7 Minimal

I have ensured that the Infiniband adapters can ping each other and 
every node can passwordless ssh into every other node. Every node has 
the same /etc/hosts file,


cat /etc/hosts

127.0.0.1localhost
192.168.1.200smd
192.168.1.196sm1
192.168.1.199sm2
192.168.1.203sm3
192.168.1.204sm4
192.168.1.201dl580

10.1.0.1smd-ib
10.1.0.2sm1-ib
10.1.0.3sm2-ib
10.1.0.4sm3-ib
10.1.0.5sm4-ib
10.1.0.6dl580-ib

I have been using a simple ring test program to test openmpi. The code 
for this program is attached.


The hostfile used in all the commands is,

cat ./nodes

smd slots=2
sm1 s

[OMPI users] Many different errors with ompi version 2.1.1

2017-05-18 Thread Allan Overstreet
I experiencing many different errors with openmpi version 2.1.1. I have 
had a suspicion that this might be related to the way the servers were 
connected and configured. Regardless below is a diagram of how the 
server are configured.


__  _
   [__]|=|
   /::/|_|
   HOST: smd
   Dual 1Gb Ethernet Bonded
   .-> Bond0 IP: 192.168.1.200
   |   Infiniband Card: MHQH29B-XTR <.
   |   Ib0 IP: 10.1.0.1  |
   |   OS: Ubuntu Mate   |
   |   __ _ |
   | [__]|=||
   | /::/|_||
   |   HOST: sm1 |
   |   Dual 1Gb Ethernet Bonded  |
   |-> Bond0 IP: 192.168.1.196   |
   |   Infiniband Card: QLOGIC QLE7340 <-|
   |   Ib0 IP: 10.1.0.2  |
   |   OS: Centos 7 Minimal  |
   |   __ _ |
   | [__]|=||
   |-. /::/|_||
   | | HOST: sm2 |
   | | Dual 1Gb Ethernet Bonded  |
   | '---> Bond0 IP: 192.168.1.199   |
   __  Infiniband Card: QLOGIC QLE7340  __
  [_|||_°] Ib0 IP: 10.1.0.3[_|||_°]
  [_|||_°] OS: Centos 7 Minimal[_|||_°]
  [_|||_°] __ _   [_|||_°]
   Gb Ethernet Switch [__]|=| Voltaire 4036 QDR Switch
   | /::/|_| |
   |   HOST: sm3  |
   |   Dual 1Gb Ethernet Bonded   |
   |-> Bond0 IP: 192.168.1.203|
   |   Infiniband Card: QLOGIC QLE7340 <--|
   |   Ib0 IP: 10.1.0.4   |
   |   OS: Centos 7 Minimal   |
   |  __ _   |
   | [__]|=|  |
   | /::/|_|  |
   |   HOST: sm4  |
   |   Dual 1Gb Ethernet Bonded   |
   |-> Bond0 IP: 192.168.1.204|
   |   Infiniband Card: QLOGIC QLE7340 <--|
   |   Ib0 IP: 10.1.0.5   |
   |   OS: Centos 7 Minimal   |
   | __ _|
   | [__]|=|   |
   | /::/|_|   |
   |   HOST: dl580|
   |   Dual 1Gb Ethernet Bonded   |
   '-> Bond0 IP: 192.168.1.201|
   Infiniband Card: QLOGIC QLE7340 <--'
   Ib0 IP: 10.1.0.6
   OS: Centos 7 Minimal

I have ensured that the Infiniband adapters can ping each other and 
every node can passwordless ssh into every other node. Every node has 
the same /etc/hosts file,


cat /etc/hosts

127.0.0.1localhost
192.168.1.200smd
192.168.1.196sm1
192.168.1.199sm2
192.168.1.203sm3
192.168.1.204sm4
192.168.1.201dl580

10.1.0.1smd-ib
10.1.0.2sm1-ib
10.1.0.3sm2-ib
10.1.0.4sm3-ib
10.1.0.5sm4-ib
10.1.0.6dl580-ib

I have been using a simple ring test program to test openmpi. The code 
for this program is attached.


The hostfile used in all the commands is,

cat ./nodes

smd slots=2
sm1 slots=2
sm2 slots=2
sm3 slots=2
sm4 slots=2
dl580 slots=2

When running the following command on smd,

mpirun -mca btl openib,self -np 2 --hostfile nodes ./ring

I obtain the following error,


A process or daemon was unable to complete a TCP connection
to another process:
  Local host:sm1
  Remote host:   192.168.1.200
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.

--
No Op