Re: [OMPI users] Wrong distance calculations in multi-rail setup?
Let me send you a patch off list that will print out some extra information to see if we can figure out where things are going wrong. We basically depend on the information reported by hwloc so the patch will print out some extra information to see if we are getting good data from hwloc. Thanks, Rolf >-Original Message- >From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Marcin >Krotkiewski >Sent: Friday, August 28, 2015 12:13 PM >To: Open MPI Users >Subject: Re: [OMPI users] Wrong distance calculations in multi-rail setup? > > >Brilliant! Thank you, Rolf. This works: all ranks have reported using the >expected port number, and performance is twice of what I was observing >before :) > >I can certainly live with this workaround, but I will be happy to do some >debugging to find the problem. If you tell me what is needed / where I can >look, I could help to find the issue. > >Thanks a lot! > >Marcin > > >On 08/28/2015 05:28 PM, Rolf vandeVaart wrote: >> I am not sure why the distances are being computed as you are seeing. I do >not have a dual rail card system to reproduce with. However, short term, I >think you could get what you want by running like the following. The first >argument tells the selection logic to ignore locality, so both cards will be >available to all ranks. Then, using the application specific notation you can >pick >the exact port for each rank. >> >> Something like: >> mpirun -gmca btl_openib_ignore_locality -np 1 --mca >> btl_openib_if_include mlx4_0:1 a.out : -np 1 --mca >> btl_openib_if_include mlx4_0:2 a.out : -np 1 --mca >> btl_openib_if_include mlx4_1:1 a.out : --mca btl_openib_if_include >> mlx4_1:2 a.out >> >> Kind of messy, but that is the general idea. >> >> Rolf >>> -Original Message- >>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of >>> marcin.krotkiewski >>> Sent: Friday, August 28, 2015 10:49 AM >>> To: us...@open-mpi.org >>> Subject: [OMPI users] Wrong distance calculations in multi-rail setup? >>> >>> I have a 4-socket machine with two dual-port Infiniband cards >>> (devices >>> mlx4_0 and mlx4_1). The cards are conneted to PCI slots of different >>> CPUs (I hope..), both ports are active on both cards, everything >>> connected to the same physical network. >>> >>> I use openmpi-1.10.0 and run the IBM-MPI1 benchmark with 4 MPI ranks >>> bound to the 4 sockets, hoping to use both IB cards (and both ports): >>> >>> mpirun --map-by socket --bind-to core -np 4 --mca btl >>> openib,self --mca btl_openib_if_include mlx4_0,mlx4_1 ./IMB-MPI1 >>> SendRecv >>> >>> but OpenMPI refuses to use the mlx4_1 device >>> >>> [node1.local:28265] [rank=0] openib: skipping device mlx4_1; it >>> is too far away >>> [ the same for other ranks ] >>> >>> This is confusing, since I have read that OpenMPI automatically uses >>> a closer HCA, so at least some (>=one) rank should choose mlx4_1. I >>> use binding by socket, here is the reported map: >>> >>> [node1.local:28263] MCW rank 2 bound to socket 2[core 24[hwt 0]]: >>> >[./././././././././././.][./././././././././././.][B/././././././././././.][./././././././././. >>> /./.] >>> [node1.local:28263] MCW rank 3 bound to socket 3[core 36[hwt 0]]: >>> >[./././././././././././.][./././././././././././.][./././././././././././.][B/././././././././. >>> /./.] >>> [node1.local:28263] MCW rank 0 bound to socket 0[core 0[hwt 0]]: >>> >[B/././././././././././.][./././././././././././.][./././././././././././.][./././././././././. >>> /./.] >>> [node1.local:28263] MCW rank 1 bound to socket 1[core 12[hwt 0]]: >>> >[./././././././././././.][B/././././././././././.][./././././././././././.][./././././././././. >>> /./.] >>> >>> To check what's going on I have modified btl_openib_component.c to >>> print the computed distances. >>> >>> opal_output_verbose(1, >ompi_btl_base_framework.framework_output, >>> "[rank=%d] openib: device %d/%d distance >>> %lf", ORTE_PROC_MY_NAME->vpid, >>> (int)i, (int)num_devs, >>> (double)dev_sorted[i].distance); >>> >>> Here is what I get: >>> >>> [node1.local:28265] [rank=0] openib: device 0/2 distance 0.00 >>> [node1.local:28266] [rank=1] openib: device 0/2 distance 0.00 >>> [node1.local:28267] [rank=2] openib: device 0/2 distance 0.00 >>> [node1.local:28268] [rank=3] openib: device 0/2 distance 0.00 >>> [node1.local:28265] [rank=0] openib: device 1/2 distance 2.10 >>> [node1.local:28266] [rank=1] openib: device 1/2 distance 1.00 >>> [node1.local:28267] [rank=2] openib: device 1/2 distance 2.10 >>> [node1.local:28268] [rank=3] openib: device 1/2 distance >>> 2.10 >>> >>> So the computed distance for mlx4_0 is 0 on all ranks. I believe this >>> should not be so. The distance should be smaller on 1 rank and larger >>> for 3 others, as is the case for mlx4_1. Looks like a bug? >>> >>> Another question is, In
Re: [OMPI users] MPI_LB in a recursive type
From: George BosilcaFirst and foremost the two datatype markers (MPI_LB and MPI_UB) have been deprecated from MPI 3.0 for exactly the reason you encountered. Once a datatype is annotated with these markers, they are propagated to all derived types, leading to an unnatural datatype definition. This behavior is enforced by the definition of the typemap specified by the equation on Section 4.1 page 105 line 18. Unfortunately, the only way to circumvent this issue, is to manually set the UB to all newly created datatypes. I see I should have directly checked the specification to see what the expected behavior was, instead of relying on (apparently over-) simplified summaries from web searches and books. Thanks for the pointer! I'd wondered if this was a fixable bug, but it looks like that equation dates back to at *least* 1994 and the MPI-1.0 spec; clearly the only thing to do was deprecate and replace the API rather than breaking old user code to enforce the "right" behavior instead. Thus, to fix your datatype composition you just have to add an explicit MPI_LB (set to 0) when calling the MPI_Type_struct on your second struct datatype. I'd managed to hit on this solution by guesswork, but it's quite a relief to know that its correctness is actually mandated by the MPI standand not just my dumb luck. Thanks again, --- Roy Stogner
Re: [OMPI users] Wrong distance calculations in multi-rail setup?
Brilliant! Thank you, Rolf. This works: all ranks have reported using the expected port number, and performance is twice of what I was observing before :) I can certainly live with this workaround, but I will be happy to do some debugging to find the problem. If you tell me what is needed / where I can look, I could help to find the issue. Thanks a lot! Marcin On 08/28/2015 05:28 PM, Rolf vandeVaart wrote: I am not sure why the distances are being computed as you are seeing. I do not have a dual rail card system to reproduce with. However, short term, I think you could get what you want by running like the following. The first argument tells the selection logic to ignore locality, so both cards will be available to all ranks. Then, using the application specific notation you can pick the exact port for each rank. Something like: mpirun -gmca btl_openib_ignore_locality -np 1 --mca btl_openib_if_include mlx4_0:1 a.out : -np 1 --mca btl_openib_if_include mlx4_0:2 a.out : -np 1 --mca btl_openib_if_include mlx4_1:1 a.out : --mca btl_openib_if_include mlx4_1:2 a.out Kind of messy, but that is the general idea. Rolf -Original Message- From: users [mailto:users-boun...@open-mpi.org] On Behalf Of marcin.krotkiewski Sent: Friday, August 28, 2015 10:49 AM To: us...@open-mpi.org Subject: [OMPI users] Wrong distance calculations in multi-rail setup? I have a 4-socket machine with two dual-port Infiniband cards (devices mlx4_0 and mlx4_1). The cards are conneted to PCI slots of different CPUs (I hope..), both ports are active on both cards, everything connected to the same physical network. I use openmpi-1.10.0 and run the IBM-MPI1 benchmark with 4 MPI ranks bound to the 4 sockets, hoping to use both IB cards (and both ports): mpirun --map-by socket --bind-to core -np 4 --mca btl openib,self --mca btl_openib_if_include mlx4_0,mlx4_1 ./IMB-MPI1 SendRecv but OpenMPI refuses to use the mlx4_1 device [node1.local:28265] [rank=0] openib: skipping device mlx4_1; it is too far away [ the same for other ranks ] This is confusing, since I have read that OpenMPI automatically uses a closer HCA, so at least some (>=one) rank should choose mlx4_1. I use binding by socket, here is the reported map: [node1.local:28263] MCW rank 2 bound to socket 2[core 24[hwt 0]]: [./././././././././././.][./././././././././././.][B/././././././././././.][./././././././././. /./.] [node1.local:28263] MCW rank 3 bound to socket 3[core 36[hwt 0]]: [./././././././././././.][./././././././././././.][./././././././././././.][B/././././././././. /./.] [node1.local:28263] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././.][./././././././././././.][./././././././././././.][./././././././././. /./.] [node1.local:28263] MCW rank 1 bound to socket 1[core 12[hwt 0]]: [./././././././././././.][B/././././././././././.][./././././././././././.][./././././././././. /./.] To check what's going on I have modified btl_openib_component.c to print the computed distances. opal_output_verbose(1, ompi_btl_base_framework.framework_output, "[rank=%d] openib: device %d/%d distance %lf", ORTE_PROC_MY_NAME->vpid, (int)i, (int)num_devs, (double)dev_sorted[i].distance); Here is what I get: [node1.local:28265] [rank=0] openib: device 0/2 distance 0.00 [node1.local:28266] [rank=1] openib: device 0/2 distance 0.00 [node1.local:28267] [rank=2] openib: device 0/2 distance 0.00 [node1.local:28268] [rank=3] openib: device 0/2 distance 0.00 [node1.local:28265] [rank=0] openib: device 1/2 distance 2.10 [node1.local:28266] [rank=1] openib: device 1/2 distance 1.00 [node1.local:28267] [rank=2] openib: device 1/2 distance 2.10 [node1.local:28268] [rank=3] openib: device 1/2 distance 2.10 So the computed distance for mlx4_0 is 0 on all ranks. I believe this should not be so. The distance should be smaller on 1 rank and larger for 3 others, as is the case for mlx4_1. Looks like a bug? Another question is, In my configuration two ranks will have a 'closer' IB card, but two others will not. Since the correct distance to both devices will likely be equal, which device will they choose, if they do that automatically? I'd rather they didn't both choose mlx4_0.. I guess it would be nice if I could by hand specify the device/port, which should be used by a given MPI rank. Is this (going to be) possible with OpenMPI? Thanks a lot, Marcin ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open- mpi.org/community/lists/users/2015/08/27503.php --- This email message is for the sole use of the intended recipient(s) and may contain confidential
Re: [OMPI users] Wrong distance calculations in multi-rail setup?
I am not sure why the distances are being computed as you are seeing. I do not have a dual rail card system to reproduce with. However, short term, I think you could get what you want by running like the following. The first argument tells the selection logic to ignore locality, so both cards will be available to all ranks. Then, using the application specific notation you can pick the exact port for each rank. Something like: mpirun -gmca btl_openib_ignore_locality -np 1 --mca btl_openib_if_include mlx4_0:1 a.out : -np 1 --mca btl_openib_if_include mlx4_0:2 a.out : -np 1 --mca btl_openib_if_include mlx4_1:1 a.out : --mca btl_openib_if_include mlx4_1:2 a.out Kind of messy, but that is the general idea. Rolf >-Original Message- >From: users [mailto:users-boun...@open-mpi.org] On Behalf Of >marcin.krotkiewski >Sent: Friday, August 28, 2015 10:49 AM >To: us...@open-mpi.org >Subject: [OMPI users] Wrong distance calculations in multi-rail setup? > >I have a 4-socket machine with two dual-port Infiniband cards (devices >mlx4_0 and mlx4_1). The cards are conneted to PCI slots of different CPUs (I >hope..), both ports are active on both cards, everything connected to the >same physical network. > >I use openmpi-1.10.0 and run the IBM-MPI1 benchmark with 4 MPI ranks >bound to the 4 sockets, hoping to use both IB cards (and both ports): > > mpirun --map-by socket --bind-to core -np 4 --mca btl openib,self --mca >btl_openib_if_include mlx4_0,mlx4_1 ./IMB-MPI1 SendRecv > >but OpenMPI refuses to use the mlx4_1 device > > [node1.local:28265] [rank=0] openib: skipping device mlx4_1; it is too far >away > [ the same for other ranks ] > >This is confusing, since I have read that OpenMPI automatically uses a closer >HCA, so at least some (>=one) rank should choose mlx4_1. I use binding by >socket, here is the reported map: > > [node1.local:28263] MCW rank 2 bound to socket 2[core 24[hwt 0]]: >[./././././././././././.][./././././././././././.][B/././././././././././.][./././././././././. >/./.] > [node1.local:28263] MCW rank 3 bound to socket 3[core 36[hwt 0]]: >[./././././././././././.][./././././././././././.][./././././././././././.][B/././././././././. >/./.] > [node1.local:28263] MCW rank 0 bound to socket 0[core 0[hwt 0]]: >[B/././././././././././.][./././././././././././.][./././././././././././.][./././././././././. >/./.] > [node1.local:28263] MCW rank 1 bound to socket 1[core 12[hwt 0]]: >[./././././././././././.][B/././././././././././.][./././././././././././.][./././././././././. >/./.] > >To check what's going on I have modified btl_openib_component.c to print >the computed distances. > > opal_output_verbose(1, ompi_btl_base_framework.framework_output, > "[rank=%d] openib: device %d/%d distance %lf", >ORTE_PROC_MY_NAME->vpid, > (int)i, (int)num_devs, > (double)dev_sorted[i].distance); > >Here is what I get: > > [node1.local:28265] [rank=0] openib: device 0/2 distance 0.00 > [node1.local:28266] [rank=1] openib: device 0/2 distance 0.00 > [node1.local:28267] [rank=2] openib: device 0/2 distance 0.00 > [node1.local:28268] [rank=3] openib: device 0/2 distance 0.00 > [node1.local:28265] [rank=0] openib: device 1/2 distance 2.10 > [node1.local:28266] [rank=1] openib: device 1/2 distance 1.00 > [node1.local:28267] [rank=2] openib: device 1/2 distance 2.10 > [node1.local:28268] [rank=3] openib: device 1/2 distance 2.10 > >So the computed distance for mlx4_0 is 0 on all ranks. I believe this should >not >be so. The distance should be smaller on 1 rank and larger for 3 others, as is >the case for mlx4_1. Looks like a bug? > >Another question is, In my configuration two ranks will have a 'closer' >IB card, but two others will not. Since the correct distance to both devices >will >likely be equal, which device will they choose, if they do that automatically? >I'd >rather they didn't both choose mlx4_0.. I guess it would be nice if I could by >hand specify the device/port, which should be used by a given MPI rank. Is >this (going to be) possible with OpenMPI? > >Thanks a lot, > >Marcin > >___ >users mailing list >us...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >Link to this post: http://www.open- >mpi.org/community/lists/users/2015/08/27503.php --- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. ---
[OMPI users] Wrong distance calculations in multi-rail setup?
I have a 4-socket machine with two dual-port Infiniband cards (devices mlx4_0 and mlx4_1). The cards are conneted to PCI slots of different CPUs (I hope..), both ports are active on both cards, everything connected to the same physical network. I use openmpi-1.10.0 and run the IBM-MPI1 benchmark with 4 MPI ranks bound to the 4 sockets, hoping to use both IB cards (and both ports): mpirun --map-by socket --bind-to core -np 4 --mca btl openib,self --mca btl_openib_if_include mlx4_0,mlx4_1 ./IMB-MPI1 SendRecv but OpenMPI refuses to use the mlx4_1 device [node1.local:28265] [rank=0] openib: skipping device mlx4_1; it is too far away [ the same for other ranks ] This is confusing, since I have read that OpenMPI automatically uses a closer HCA, so at least some (>=one) rank should choose mlx4_1. I use binding by socket, here is the reported map: [node1.local:28263] MCW rank 2 bound to socket 2[core 24[hwt 0]]: [./././././././././././.][./././././././././././.][B/././././././././././.][./././././././././././.] [node1.local:28263] MCW rank 3 bound to socket 3[core 36[hwt 0]]: [./././././././././././.][./././././././././././.][./././././././././././.][B/././././././././././.] [node1.local:28263] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././.][./././././././././././.][./././././././././././.][./././././././././././.] [node1.local:28263] MCW rank 1 bound to socket 1[core 12[hwt 0]]: [./././././././././././.][B/././././././././././.][./././././././././././.][./././././././././././.] To check what's going on I have modified btl_openib_component.c to print the computed distances. opal_output_verbose(1, ompi_btl_base_framework.framework_output, "[rank=%d] openib: device %d/%d distance %lf", ORTE_PROC_MY_NAME->vpid, (int)i, (int)num_devs, (double)dev_sorted[i].distance); Here is what I get: [node1.local:28265] [rank=0] openib: device 0/2 distance 0.00 [node1.local:28266] [rank=1] openib: device 0/2 distance 0.00 [node1.local:28267] [rank=2] openib: device 0/2 distance 0.00 [node1.local:28268] [rank=3] openib: device 0/2 distance 0.00 [node1.local:28265] [rank=0] openib: device 1/2 distance 2.10 [node1.local:28266] [rank=1] openib: device 1/2 distance 1.00 [node1.local:28267] [rank=2] openib: device 1/2 distance 2.10 [node1.local:28268] [rank=3] openib: device 1/2 distance 2.10 So the computed distance for mlx4_0 is 0 on all ranks. I believe this should not be so. The distance should be smaller on 1 rank and larger for 3 others, as is the case for mlx4_1. Looks like a bug? Another question is, In my configuration two ranks will have a 'closer' IB card, but two others will not. Since the correct distance to both devices will likely be equal, which device will they choose, if they do that automatically? I'd rather they didn't both choose mlx4_0.. I guess it would be nice if I could by hand specify the device/port, which should be used by a given MPI rank. Is this (going to be) possible with OpenMPI? Thanks a lot, Marcin