Interesting. Try with the native OFED benchmarks -- i.e., get MPI out of the way and see if the raw/native performance of the network between the devices reflects the same dichotomy.
(e.g., ibv_rc_pingpong) On Jul 15, 2011, at 7:58 PM, David Warren wrote: > All OFED 1.4 and 2.6.32 (that's what I can get to today) > qib to qib: > > # OSU MPI Latency Test v3.3 > # Size Latency (us) > 0 0.29 > 1 0.32 > 2 0.31 > 4 0.32 > 8 0.32 > 16 0.35 > 32 0.35 > 64 0.47 > 128 0.47 > 256 0.50 > 512 0.53 > 1024 0.66 > 2048 0.88 > 4096 1.24 > 8192 1.89 > 16384 3.94 > 32768 5.94 > 65536 9.79 > 131072 18.93 > 262144 37.36 > 524288 71.90 > 1048576 189.62 > 2097152 478.55 > 4194304 1148.80 > > # OSU MPI Bandwidth Test v3.3 > # Size Bandwidth (MB/s) > 1 2.48 > 2 5.00 > 4 10.04 > 8 20.02 > 16 33.22 > 32 67.32 > 64 134.65 > 128 260.30 > 256 486.44 > 512 860.77 > 1024 1385.54 > 2048 1940.68 > 4096 2231.20 > 8192 2343.30 > 16384 2944.99 > 32768 3213.77 > 65536 3174.85 > 131072 3220.07 > 262144 3259.48 > 524288 3277.05 > 1048576 3283.97 > 2097152 3288.91 > 4194304 3291.84 > > # OSU MPI Bi-Directional Bandwidth Test v3.3 > # Size Bi-Bandwidth (MB/s) > 1 3.10 > 2 6.21 > 4 13.08 > 8 26.91 > 16 41.00 > 32 78.17 > 64 161.13 > 128 312.08 > 256 588.18 > 512 968.32 > 1024 1683.42 > 2048 2513.86 > 4096 2948.11 > 8192 2918.39 > 16384 3370.28 > 32768 3543.99 > 65536 4159.99 > 131072 4709.73 > 262144 4733.31 > 524288 4795.44 > 1048576 4753.69 > 2097152 4786.11 > 4194304 4779.40 > > mlx4 to mlx4: > # OSU MPI Latency Test v3.3 > # Size Latency (us) > 0 1.62 > 1 1.66 > 2 1.67 > 4 1.66 > 8 1.70 > 16 1.71 > 32 1.75 > 64 1.91 > 128 3.11 > 256 3.32 > 512 3.66 > 1024 4.46 > 2048 5.57 > 4096 6.62 > 8192 8.95 > 16384 11.07 > 32768 15.94 > 65536 25.57 > 131072 44.93 > 262144 83.58 > 524288 160.85 > 1048576 315.47 > 2097152 624.68 > 4194304 1247.17 > > # OSU MPI Bandwidth Test v3.3 > # Size Bandwidth (MB/s) > 1 1.80 > 2 4.21 > 4 8.79 > 8 18.14 > 16 35.79 > 32 68.58 > 64 132.72 > 128 221.89 > 256 399.62 > 512 724.13 > 1024 1267.36 > 2048 1959.22 > 4096 2354.26 > 8192 2519.50 > 16384 3225.44 > 32768 3227.86 > 65536 3350.76 > 131072 3369.86 > 262144 3378.76 > 524288 3384.02 > 1048576 3386.60 > 2097152 3387.97 > 4194304 3388.66 > > # OSU MPI Bi-Directional Bandwidth Test v3.3 > # Size Bi-Bandwidth (MB/s) > 1 1.70 > 2 3.86 > 4 10.42 > 8 20.99 > 16 41.22 > 32 79.17 > 64 151.25 > 128 277.64 > 256 495.44 > 512 843.44 > 1024 162.53 > 2048 2427.23 > 4096 2989.63 > 8192 3587.58 > 16384 5391.08 > 32768 6051.56 > 65536 6314.33 > 131072 6439.04 > 262144 6506.51 > 524288 6539.51 > 1048576 6558.34 > 2097152 6567.24 > 4194304 6555.76 > > mixed: > # OSU MPI Latency Test v3.3 > # Size Latency (us) > 0 3.81 > 1 3.88 > 2 3.86 > 4 3.85 > 8 3.92 > 16 3.93 > 32 3.93 > 64 4.02 > 128 4.60 > 256 4.80 > 512 5.14 > 1024 5.94 > 2048 7.26 > 4096 8.50 > 8192 10.98 > 16384 19.92 > 32768 26.35 > 65536 39.93 > 131072 64.45 > 262144 106.93 > 524288 191.89 > 1048576 358.31 > 2097152 694.25 > 4194304 1429.56 > > # OSU MPI Bandwidth Test v3.3 > # Size Bandwidth (MB/s) > 1 0.64 > 2 1.39 > 4 2.76 > 8 5.58 > 16 11.03 > 32 22.17 > 64 43.70 > 128 100.49 > 256 179.83 > 512 305.87 > 1024 544.68 > 2048 838.22 > 4096 1187.74 > 8192 1542.07 > 16384 1260.93 > 32768 1708.54 > 65536 2180.45 > 131072 2482.28 > 262144 2624.89 > 524288 2680.55 > 1048576 2728.58 > never gets past here > > # OSU MPI Bi-Directional Bandwidth Test v3.3 > # Size Bi-Bandwidth (MB/s) > 1 0.41 > 2 0.83 > 4 1.68 > 8 3.37 > 16 6.71 > 32 13.37 > 64 26.64 > 128 63.47 > 256 113.23 > 512 202.92 > 1024 362.48 > 2048 578.53 > 4096 830.31 > 8192 1143.16 > 16384 1303.02 > 32768 1913.07 > 65536 2463.83 > 131072 2793.83 > 262144 2918.32 > 524288 2987.92 > 1048576 3033.31 > never gets past here > > > > On 07/15/11 09:03, Jeff Squyres wrote: >> I don't think too many people have done combined QLogic + Mellanox runs, so >> this probably isn't a well-explored space. >> >> Can you run some microbenchmarks to see what kind of latency / bandwidth >> you're getting between nodes of the same type and nodes of different types? >> >> On Jul 14, 2011, at 8:21 PM, David Warren wrote: >> >> >>> On my test runs (wrf run just long enough to go beyond the spinup influence) >>> On just 6 of the the old mlx4 machines I get about 00:05:30 runtime >>> On 3 mlx4 and 3 qib nodes I get avg of 00:06:20 >>> So the slow down is about 11+% >>> When this is a full run 11% becomes a evry long time. This has held for >>> some longer tests as well before I went to ofed 1.6. >>> >>> On 07/14/11 05:55, Jeff Squyres wrote: >>> >>>> On Jul 13, 2011, at 7:46 PM, David Warren wrote: >>>> >>>> >>>> >>>>> I finally got access to the systems again (the original ones are part of >>>>> our real time system). I thought I would try one other test I had set up >>>>> first. I went to OFED 1.6 and it started running with no errors. It must >>>>> have been an OFED bug. Now I just have the speed problem. Anyone have a >>>>> way to make the mixture of mlx4 and qlogic work together without slowing >>>>> down? >>>>> >>>>> >>>> What do you mean by "slowing down"? >>>> >>>> >>>> >>> <warren.vcf> >>> >> >> > <warren.vcf> -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/