Brilliant! Thank you, Rolf. This works: all ranks have reported using the expected port number, and performance is twice of what I was observing before :)

I can certainly live with this workaround, but I will be happy to do some debugging to find the problem. If you tell me what is needed / where I can look, I could help to find the issue.

Thanks a lot!

Marcin


On 08/28/2015 05:28 PM, Rolf vandeVaart wrote:
I am not sure why the distances are being computed as you are seeing. I do not 
have a dual rail card system to reproduce with. However, short term, I think 
you could get what you want by running like the following.  The first argument 
tells the selection logic to ignore locality, so both cards will be available 
to all ranks.  Then, using the application specific notation you can pick the 
exact port for each rank.

Something like:
  mpirun -gmca btl_openib_ignore_locality -np 1 --mca btl_openib_if_include 
mlx4_0:1 a.out : -np 1 --mca btl_openib_if_include mlx4_0:2 a.out : -np 1 --mca 
btl_openib_if_include mlx4_1:1 a.out : --mca btl_openib_if_include mlx4_1:2 
a.out

Kind of messy, but that is the general idea.

Rolf
-----Original Message-----
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of
marcin.krotkiewski
Sent: Friday, August 28, 2015 10:49 AM
To: us...@open-mpi.org
Subject: [OMPI users] Wrong distance calculations in multi-rail setup?

I have a 4-socket machine with two dual-port Infiniband cards (devices
mlx4_0 and mlx4_1). The cards are conneted to PCI slots of different CPUs (I
hope..), both ports are active on both cards, everything connected to the
same physical network.

I use openmpi-1.10.0 and run the IBM-MPI1 benchmark with 4 MPI ranks
bound to the 4 sockets, hoping to use both IB cards (and both ports):

     mpirun --map-by socket --bind-to core -np 4 --mca btl openib,self --mca
btl_openib_if_include mlx4_0,mlx4_1 ./IMB-MPI1 SendRecv

but OpenMPI refuses to use the mlx4_1 device

     [node1.local:28265] [rank=0] openib: skipping device mlx4_1; it is too far
away
     [ the same for other ranks ]

This is confusing, since I have read that OpenMPI automatically uses a closer
HCA, so at least some (>=one) rank should choose mlx4_1. I use binding by
socket, here is the reported map:

     [node1.local:28263] MCW rank 2 bound to socket 2[core 24[hwt 0]]:
[./././././././././././.][./././././././././././.][B/././././././././././.][./././././././././.
/./.]
     [node1.local:28263] MCW rank 3 bound to socket 3[core 36[hwt 0]]:
[./././././././././././.][./././././././././././.][./././././././././././.][B/././././././././.
/./.]
     [node1.local:28263] MCW rank 0 bound to socket 0[core  0[hwt 0]]:
[B/././././././././././.][./././././././././././.][./././././././././././.][./././././././././.
/./.]
     [node1.local:28263] MCW rank 1 bound to socket 1[core 12[hwt 0]]:
[./././././././././././.][B/././././././././././.][./././././././././././.][./././././././././.
/./.]

To check what's going on I have modified btl_openib_component.c to print
the computed distances.

         opal_output_verbose(1, ompi_btl_base_framework.framework_output,
                             "[rank=%d] openib: device %d/%d distance %lf",
ORTE_PROC_MY_NAME->vpid,
                             (int)i, (int)num_devs, 
(double)dev_sorted[i].distance);

Here is what I get:

     [node1.local:28265] [rank=0] openib: device 0/2 distance 0.000000
     [node1.local:28266] [rank=1] openib: device 0/2 distance 0.000000
     [node1.local:28267] [rank=2] openib: device 0/2 distance 0.000000
     [node1.local:28268] [rank=3] openib: device 0/2 distance 0.000000
     [node1.local:28265] [rank=0] openib: device 1/2 distance 2.100000
     [node1.local:28266] [rank=1] openib: device 1/2 distance 1.000000
     [node1.local:28267] [rank=2] openib: device 1/2 distance 2.100000
     [node1.local:28268] [rank=3] openib: device 1/2 distance 2.100000

So the computed distance for mlx4_0 is 0 on all ranks. I believe this should not
be so. The distance should be smaller on 1 rank and larger for 3 others, as is
the case for mlx4_1. Looks like a bug?

Another question is, In my configuration two ranks will have a 'closer'
IB card, but two others will not. Since the correct distance to both devices 
will
likely be equal, which device will they choose, if they do that automatically? 
I'd
rather they didn't both choose mlx4_0.. I guess it would be nice if I could by
hand specify the device/port, which should be used by a given MPI rank. Is
this (going to be) possible with OpenMPI?

Thanks a lot,

Marcin

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-
mpi.org/community/lists/users/2015/08/27503.php
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/08/27504.php

Reply via email to