Hi Sébastien, If I understand you correctly, you are running your application on two different MPIs on two different clusters with two different IB vendors.
Could you make a comparison more "apples to apples"-ish? For instance: - run the same version of Open MPI on both clusters - run the same version of MVAPICH on both clusters -- YK On 18-Sep-11 1:59 AM, Sébastien Boisvert wrote: > Hello, > > Open-MPI 1.4.3 on Mellanox Infiniband hardware gives a latency of 250 > microseconds with 256 MPI ranks on super-computer A (name is colosse). > > The same software gives a latency of 10 microseconds with MVAPICH2 and QLogic > Infiniband hardware with 512 MPI ranks on super-computer B (name is > guillimin). > > > Here are the relevant information listed in > http://www.open-mpi.org/community/help/ > > > 1. Check the FAQ first. > > done ! > > > 2. The version of Open MPI that you're using. > > Open-MPI 1.4.3 > > > 3. The config.log file from the top-level Open MPI directory, if available > (please compress!). > > See below. > > Command file: http://pastebin.com/mW32ntSJ > > > 4. The output of the "ompi_info --all" command from the node where you're > invoking mpirun. > > ompi_info -a on colosse: http://pastebin.com/RPyY9s24 > > > 5. If running on more than one node -- especially if you're having problems > launching Open MPI processes -- also include the output of the "ompi_info -v > ompi full --parsable" command from each node on which you're trying to run. > > I am not having problems launching Open-MPI processes. > > > 6. A detailed description of what is failing. > > Open-MPI 1.4.3 on Mellanox Infiniband hardware give a latency of 250 > microseconds with 256 MPI ranks on super-computer A (name is colosse). > > The same software gives a latency of 10 microseconds with MVAPICH2 and QLogic > Infiniband hardware on 512 MPI ranks on super-computer B (name is guillimin). > > Details follow. > > > I am developing a distributed genome assembler that runs with the > message-passing interface (I am a PhD student). > It is called Ray. Link: http://github.com/sebhtml/ray > > I recently added the option -test-network-only so that Ray can be used to > test the latency. Each MPI rank has to send 100000 messages (4000 bytes > each), one by one. > The destination of any message is picked up at random. > > > On colosse, a super-computer located at Laval University, I get an average > latency of 250 microseconds with the test done in Ray. > > See http://pastebin.com/9nyjSy5z > > On colosse, the hardware is Mellanox Infiniband QDR ConnectX and the MPI > middleware is Open-MPI 1.4.3 compiled with gcc 4.4.2. > > colosse has 8 compute cores per node (Intel Nehalem). > > > Testing the latency with ibv_rc_pingpong on colosse gives 11 microseconds. > > local address: LID 0x048e, QPN 0x1c005c, PSN 0xf7c66b > remote address: LID 0x018c, QPN 0x2c005c, PSN 0x5428e6 > 8192000 bytes in 0.01 seconds = 5776.64 Mbit/sec > 1000 iters in 0.01 seconds = 11.35 usec/iter > > So I know that the Infiniband has a correct latency between two HCAs because > of the output of ibv_rc_pingpong. > > > > Adding the parameter --mca btl_openib_verbose 1 to mpirun shows that Open-MPI > detects the hardware correctly: > > [r107-n57][[59764,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query] > Querying INI files for vendor 0x02c9, part ID 26428 > [r107-n57][[59764,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found > corresponding INI values: Mellanox Hermon > > see http://pastebin.com/pz03f0B3 > > > So I don't think this is the problem described in the FAQ ( > http://www.open-mpi.org/faq/?category=openfabrics#mellanox-connectx-poor-latency > ) > and on the mailing list ( > http://www.open-mpi.org/community/lists/users/2007/10/4238.php ) because the > INI values are found. > > > > > Running the network test implemented in Ray on 32 MPI ranks, I get an average > latency of 65 microseconds. > > See http://pastebin.com/nWDmGhvM > > > Thus, with 256 MPI ranks I get an average latency of 250 microseconds and > with 32 MPI ranks I get 65 microseconds. > > > Running the network test on 32 MPI ranks again but only allowing the MPI rank > 0 to send messages gives a latency of 10 microseconds for this rank. > See http://pastebin.com/dWMXsHpa > > > > Because I get 10 microseconds in the network test in Ray when only the MPI > rank sends messages, I would say that there may be some I/O contention. > > To test this hypothesis, I re-ran the test, but allowed only 1 MPI rank per > node to send messages (there are 8 MPI ranks per node and a total of 32 MPI > ranks). > Ranks 0, 8, 16 and 24 all reported 13 microseconds. See > http://pastebin.com/h84Fif3g > > The next test was to allow 2 MPI ranks on each node to send messages. Ranks > 0, 1, 8, 9, 16, 17, 24, and 25 reported 15 microseconds. > See http://pastebin.com/REdhJXkS > > With 3 MPI ranks per node that can send messages, ranks 0, 1, 2, 8, 9, 10, > 16, 17, 18, 24, 25, 26 reported 20 microseconds. See > http://pastebin.com/TCd6xpuC > > Finally, with 4 MPI ranks per node that can send messages, I got 23 > microseconds. See http://pastebin.com/V8zjae7s > > > So the MPI ranks on a given node seem to fight for access to the HCA port. > > Each colosse node has 1 port (ibv_devinfo) and the max_mtu is 2048 bytes. See > http://pastebin.com/VXMAZdeZ > > > > > > > At this point, some may think that there may be a bug in the network test > itself. So I tested the same code on another super-computer. > > On guillimin, a super-computer located at McGill University, I get an average > latency (with Ray -test-network-only) of 10 microseconds when running Ray on > 512 MPI ranks. > > See http://pastebin.com/nCKF8Xg6 > > On guillimin, the hardware is Qlogic Infiniband QDR and the MPI middleware is > MVAPICH2 1.6. > > Thus, I know that the network test in Ray works as expected because results > on guillimin show a latency of 10 microseconds for 512 MPI ranks. > > guillimin also has 8 compute cores per node (Intel Nehalem). > > On guillimin, each node has one port (ibv_devinfo) and the max_mtu of HCAs is > 4096 bytes. See http://pastebin.com/35T8N5t8 > > > > > > > > > In Ray, only the following MPI functions are utilised: > > - MPI_Init > - MPI_Comm_rank > - MPI_Comm_size > - MPI_Finalize > > - MPI_Isend > > - MPI_Request_free > - MPI_Test > - MPI_Get_count > - MPI_Start > - MPI_Recv_init > - MPI_Cancel > > - MPI_Get_processor_name > > > > > 7. Please include information about your network: > http://www.open-mpi.org/faq/?category=openfabrics#ofa-troubleshoot > > Type: Infiniband > > 7.1. Which OpenFabrics version are you running? > > > ofed-scripts-1.4.2-0_sunhpc1 > > libibverbs-1.1.3-2.el5 > libibverbs-utils-1.1.3-2.el5 > libibverbs-devel-1.1.3-2.el5 > > > 7.2. What distro and version of Linux are you running? What is your kernel > version? > > > CentOS release 5.6 (Final) > > Linux colosse1 2.6.18-238.19.1.el5 #1 SMP Fri Jul 15 07:31:24 EDT 2011 x86_64 > x86_64 x86_64 GNU/Linux > > > 7.3. Which subnet manager are you running? (e.g., OpenSM, a > vendor-specific subnet manager, etc.) > > opensm-libs-3.3.3-1.el5_6.1 > > 7.4. What is the output of the ibv_devinfo command > > hca_id: mlx4_0 > fw_ver: 2.7.000 > node_guid: 5080:0200:008d:8f88 > sys_image_guid: 5080:0200:008d:8f8b > vendor_id: 0x02c9 > vendor_part_id: 26428 > hw_ver: 0xA0 > board_id: X6275_QDR_IB_2.5 > phys_port_cnt: 1 > port: 1 > state: active (4) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 1222 > port_lid: 659 > port_lmc: 0x00 > > > > 7.5. What is the output of the ifconfig command > > Not using IPoIB. > > 7.6. If running under Bourne shells, what is the output of the "ulimit -l" > command? > > [sboisver12@colosse1 ~]$ ulimit -l > 6000000 > > > > > > > > The two differences I see between guillimin and colosse are > > - Open-MPI 1.4.3 (colosse) v. MVAPICH2 1.6 (guillimin) > - Mellanox (colosse) v. QLogic (guillimin) > > > Does anyone experienced such a high latency with Open-MPI 1.4.3 on Mellanox > HCAs ? > > > > > > > Thank you for your time. > > > Sébastien Boisvert > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >