Thanks Brian, Thanks Michael
I wanted to benchmark the communcation throughput and latency using multiple
using gigabit eithernet controller.
So here are the results which i want share with you all
I used .....
OpenMPI version 1.0.2a10r9275
Hpcbench
Two Dell Precision 650 workstation.
The Dell Precision 650 workstation has three separate PCI bus segments.
Segment 1 -> PCI Slot1,2 -> 32 bit, 33MHz, Shared with integrated 1394
Segment 2 -> PCI SLot3,4 -> 64 bit, 100MHz, Shared with the Gb Ethernet connection
Segment 3 -> PCI Slot 5 -> Shared with Integrated Ultra 320 controller
The workstation has Integrated PCI-X 64-bit Intel 10/100/1000 Gigabit Ethernet.
I added three D-Link DGE-530T 1000 Mbps Ethernet Card in Slot2, Slot4 and Slot5 respectively.
As i expected, the Card in slot5 performed better than the cards in other slots. Here
are the results.
(Using Slot2)
-------------
# MPI communication latency (roundtrip time) test -- Wed Mar 15 09:19:10 2006
# Hosts: DELL <----> DELL2
# Blocking Communication (MPI_Send/MPI_Recv)
# Message size (Bytes) : 40960
# Iteration: 7
# Test time (Seconds): 0.20
# RTT-time
# Microseconds
1 25953.565
2 25569.439
3 22392.000
4 20876.578
5 21327.121
6 19597.156
7 21264.008
8 24109.568
9 23877.859
10 24064.575
# MPI RTT min/avg/max = 19597.156/22903.187/25953.565 usec
----------------------------------------------------------
# MPI communication test -- Wed Mar 15 10:16:22 2006
# Test mode: Fixed-size stream (unidirectional) test
# Hosts: DELL <----> DELL2
# Blocking communication (MPI_Send/MPI_Recv)
# Total data size of each test (Bytes): 524288000
# Message size (Bytes): 104857600
# Iteration : 5
# Test time: 5.000000
# Test repetition: 10
#
# Overall Master-node M-process M-process Slave-node S-process S-process
# Throughput Elapsed-time User-mode Sys-mode Elapsed-time User-mode Sys-mode
# Mbps Seconds Seconds Seconds Seconds
Seconds Seconds
1 521.9423 8.04 1.42 6.62 8.04 0.93 7.10
2 551.5377 7.60 1.20 6.41 7.60 0.77 6.87
3 552.5600 7.59 1.27 6.32
7.59 0.82 6.81
4 552.6328 7.59 1.28 6.31 7.59 0.80 6.83
5 552.6334 7.59 1.24 6.35 7.59 0.86 6.77
6 552.7048 7.59
1.26 6.33 7.59 0.77 6.86
7 563.6736 7.44 1.22 6.22 7.44 0.78 6.70
8 552.2710 7.59 1.22 6.37 7.59 0.83 6.80
9
520.9938 8.05 1.37 6.68 8.05 0.93 7.16
10 535.0131 7.84 1.36 6.48 7.84 0.84 7.04
======================================================================================
(Using Slot3)
-------------
# MPI communication latency (roundtrip time) test -- Thu Mar 16 10:15:58 2006
# Hosts: DELL <----> DELL2
# Blocking Communication (MPI_Send/MPI_Recv)
# Message size (Bytes) : 40960
# Iteration: 10
# Test time (Seconds): 0.20
# RTT-time
# Microseconds
1 20094.204
2 14773.512
3 14846.015
4 17756.820
5 18419.290
6 23394.799
7 21840.596
8 17727.494
9 21822.095
10 17659.688
# MPI RTT min/avg/max = 14773.512/18833.451/23394.799 usec
----------------------------------------------------------
# MPI communication test -- Wed Mar 15 09:17:54 2006
# Test mode: Fixed-size stream (unidirectional) test
# Hosts: DELL <----> DELL2
# Blocking communication (MPI_Send/MPI_Recv)
# Total data size of each test (Bytes): 524288000
# Message size (Bytes): 104857600
# Iteration : 5
# Test time: 5.000000
# Test repetition: 10
#
# Overall Master-node M-process M-process Slave-node S-process S-process
# Throughput Elapsed-time User-mode Sys-mode Elapsed-time User-mode Sys-mode
# Mbps Seconds Seconds Seconds Seconds
Seconds Seconds
1 794.9650 5.28 1.04 4.24 5.28 0.47 4.81
2 838.1621 5.00 0.91 4.09 5.00 0.39 4.65
3 898.3811 4.67 0.84 3.82
4.67 0.34 4.37
4 798.9575 5.25 1.03 4.22 5.25 0.40 4.89
5 829.7181 5.06 0.94 4.11 5.05 0.40 4.69
6 881.5526 4.76
0.86 3.90 4.76 0.28 4.52
7 827.9215 5.07 0.96 4.11 5.07 0.41 4.70
8 845.6428 4.96 0.87 4.09 4.96 0.38 4.62
9
845.6903 4.96 0.90 4.06 4.96 0.37 4.63
10 827.9424 5.07 0.92 4.15 5.07 0.42 4.69
======================================================================================
(Using Slot5)
-------------
# MPI communication latency (roundtrip time) test -- Wed Mar 15 09:38:55 2006
# Hosts: DELL <----> DELL2
# Blocking Communication (MPI_Send/MPI_Recv)
# Message size (Bytes) : 40960
# Iteration: 5
# Test time (Seconds): 0.20
# RTT-time
# Microseconds
1 201938.009
2 176876.974
3 266473.198
4 277261.162
5 235448.408
6 386055.040
7 263659.239
8 191064.596
9 255028.391
10 342683.983
# MPI RTT min/avg/max = 176876.974/259648.900/386055.040 usec
-------------------------------------------------------------
# MPI communication test -- Thu Mar 16 09:40:46 2006
# Test mode: Fixed-size stream (unidirectional) test
# Hosts: DELL <----> DELL2
# Blocking communication (MPI_Send/MPI_Recv)
# Total data size of each test (Bytes): 5242880
# Message size (Bytes): 1048576
# Iteration : 5
# Test time: 0.001000
# Test repetition: 10
#
# Overall Master-node M-process M-process Slave-node S-process S-process
# Throughput Elapsed-time User-mode Sys-mode Elapsed-time User-mode Sys-mode
# Mbps Seconds Seconds Seconds Seconds
Seconds Seconds
1 955.8585 0.04 0.01 0.03 0.04 0.01 0.03
2 964.4314 0.04 0.01 0.03 0.04 0.03 0.06
3 963.8343 0.04 0.01 0.03
0.04 0.02 0.07
4 963.5862 0.04 0.02 0.06 0.04 0.01 0.07
5 965.3840 0.04 0.01 0.04 0.04 0.01 0.04
6 964.5371 0.04
0.01 0.04 0.04 0.01 0.04
7 963.1009 0.04 0.01 0.03 0.04 0.01 0.03
8 963.6126 0.04 0.01 0.04 0.04 0.01 0.03
9
963.8554 0.04 0.01 0.03 0.04 0.00 0.04
10 963.7445 0.04 0.01 0.03 0.04 0.01 0.04
======================================================================================
(Using Onboard)
---------------
# MPI communication latency (roundtrip time) test -- Wed Mar 15 09:38:25 2006
# Hosts: DELL <----> DELL2
# Blocking Communication (MPI_Send/MPI_Recv)
# Message size (Bytes) : 40960
# Iteration: 200
# Test time (Seconds): 0.20
# RTT-time
# Microseconds
1 999.186
2 1000.586
3 997.865
4 1000.780
5 1001.199
6 1004.665
7 1003.225
8
1004.366
9 1004.120
10 1003.854
# MPI RTT min/avg/max = 997.865/1001.985/1004.665 usec
------------------------------------------------------
# MPI communication test -- Wed Mar 15 09:11:18 2006
# Test mode: Fixed-size stream (unidirectional) test
# Hosts: DELL <----> DELL2
# Blocking communication (MPI_Send/MPI_Recv)
# Total data size of each test (Bytes): 524288000
# Message size (Bytes): 104857600
# Iteration : 5
# Test time: 5.000000
# Test repetition: 10
#
# Overall Master-node M-process M-process Slave-node S-process S-process
# Throughput Elapsed-time User-mode Sys-mode Elapsed-time User-mode Sys-mode
# Mbps Seconds Seconds Seconds Seconds
Seconds Seconds
1 941.0156 4.46 0.93 3.53 4.46 0.48 3.98
2 941.1148 4.46 0.99 3.47 4.46 0.46 4.03
3 941.1063 4.46 1.05 3.41
4.46 0.45 4.05
4 941.0544 4.46 1.00 3.45 4.46 0.50 4.00
5 941.1083 4.46 1.01 3.44 4.46 0.47 4.03
6 941.1070 4.46
0.93 3.52 4.46 0.45 4.05
7 941.1078 4.46 0.99 3.46 4.46 0.50 3.99
8 941.0721 4.46 0.98 3.48 4.46 0.43 4.06
9
941.1091 4.46 1.01 3.44 4.46 0.49 4.01
10 941.1093 4.46 0.97 3.49 4.46 0.45 4.04
======================================================================================
The Dlink cards were giving poor latency, so i downloaded updated driver version 8.31 (dated 18 Jan, 2006)
from www.skd.de and tuned some parameters as follows:
The sk98lin (chipset used in Dlink 530T) supports large frames (also called jumbo frames).Using jumbo frames can
improve throughput tremendously when transferring large amounts of data. To enable large frames, the MTU (maximum transfer
unit) size for an interface is to be set to a high value. The default MTU size is 1500 and can be changed up to 9000
(bytes). Setting the MTU size can be done when assigning the IP address to the interface or later by using the ifconfig(8)
command with the mtu parameter. If for instance eth0 needs an
IP address and a large frame MTU size, the following command might be used:
ifconfig eth0 mtu 9000
Added follwing lines into /etc/modprob.conf ...
options sk98lin LowLatency=On
With the above changes, i observed a huge change in the performance of Card in Slot5 (but not of Slot2 & 3, cant figure
out why???)
Here are the improved latency figures of SLot5
# MPI communication latency (roundtrip time) test -- Thu Mar 16 10:12:05 2006
# Hosts: DELL <----> DELL2
# Blocking Communication (MPI_Send/MPI_Recv)
# Message size (Bytes) : 40960
# Iteration: 227
# Test time (Seconds): 0.20
# RTT-time
# Microseconds
1 882.409
2 880.656
3 881.314
4 880.067
5 879.532
6 878.070
7 879.520
8
878.035
9 881.300
10 878.349
# MPI RTT min/avg/max = 878.035/879.925/882.409 usec
====================================================
I next tested OB1 for different configurations and i obtained bandwidth as height as 1.8 Gbps
Here are the figures
(Using Slot5 & Onboard)
-----------------------
# MPI communication test -- Thu Mar 16 06:14:27 2006
# Test mode: Fixed-size stream (unidirectional) test
# Hosts: DELL <----> DELL2
# Blocking communication (MPI_Send/MPI_Recv)
# Total data size of each test (Bytes): 5242880
# Message size (Bytes): 1048576
# Iteration : 5
# Test time: 0.000100
# Test repetition: 10
#
# Overall Master-node M-process M-process Slave-node S-process S-process
# Throughput Elapsed-time User-mode Sys-mode Elapsed-time User-mode Sys-mode
# Mbps Seconds Seconds
Seconds Seconds Seconds Seconds
1 1625.8958 0.03 0.00 0.02 0.03 0.00 0.02
2 1646.6386 0.03 0.01 0.02 0.03 0.00 0.02
3 1648.6447 0.03
0.00 0.02 0.03 0.00 0.02
4 1647.7336 0.03 0.00 0.02 0.03 0.00 0.02
5 1640.5118 0.03 0.01 0.02 0.03 0.00 0.03
6 1625.1298
0.03 0.00 0.02 0.03 0.00 0.02
7 1648.1195 0.03 0.01 0.02 0.03 0.00 0.02
8 1647.6102 0.03 0.00 0.02 0.03 0.00 0.02
9
1647.9960 0.03 0.00 0.02 0.03 0.00 0.02
10 1648.1813 0.03 0.01 0.02 0.03 0.00 0.02
# MPI communication test -- Thu Mar 16 09:45:13 2006
# Test mode: Exponential stream (unidirectional) test
# Hosts: DELL <----> DELL2
# Blocking communication (MPI_Send/MPI_Recv)
#
# Message Overall Master-node M-process M-process Slave-node S-process S-process
# Size Throughput Iteration Elapsed-time User-mode Sys-mode Elapsed-time User-mode Sys-mode
# Bytes Mbps Seconds Seconds Seconds Seconds
Seconds Seconds
1 0.0608 324 0.04 0.02 0.06 0.04 0.01 0.03
2 0.3078 5 0.00 0.00 0.00 0.00 0.00
0.00
4 1.0505 8 0.00 0.00 0.00 0.00 0.00 0.00
8 0.0244 15 0.04 0.01 0.03 0.04 0.01 0.03
16 3.6251 5 0.00 0.00 0.00 0.00 0.00 0.00
32 0.0831 13 0.04 0.01 0.03 0.04 0.01 0.03
64
14.1421 5 0.00 0.00 0.00 0.00 0.00 0.00
128 0.3313 13 0.04 0.01 0.03 0.04 0.01 0.03
256 57.8961
5 0.00 0.00 0.00 0.00 0.00 0.00
512 114.2923 14 0.00 0.00 0.00 0.00 0.00 0.00
1024 281.8572 13 0.00
0.00 0.00 0.00 0.00 0.00
2048 442.8727 17 0.00 0.00 0.00 0.00 0.00 0.00
4096 666.7065 13 0.00 0.00 0.00
0.00 0.00 0.00
8192 857.6743 10 0.00 0.00 0.00 0.00 0.00 0.00
16384 1050.1757 6 0.00 0.00 0.00 0.00 0.00 0.00
32768 1016.0091 5 0.00 0.00 0.00 0.00 0.00 0.00
65536 747.5140 5 0.00 0.00 0.00 0.00 0.00 0.00
131072 1131.1883 5
0.00 0.00 0.00 0.00 0.00 0.00
262144 50.1699 5 0.21 0.05 0.16 0.21 0.05 0.16
524288 1445.7282 5 0.01 0.00 0.01
0.01 0.00 0.01
1048576 1620.4892 5 0.03 0.00 0.02 0.03 0.00 0.02
2097152 1739.4759 5 0.05 0.01 0.04 0.05 0.00 0.05
4194304
1809.7050 5 0.09 0.01 0.08 0.09 0.00 0.09
8388608 1843.5496 5 0.18 0.02 0.16 0.18 0.01 0.17
16777216 1867.9856 5 0.36
0.04 0.32 0.36 0.01 0.35
33554432 1872.8597 5 0.72 0.10 0.65 0.72 0.03 0.72
I did same tests with TEG and found it to underperform a lot compared to OB1
here are the result of test using TEG
(Using Slot5 & Onboard)
-----------------------
# MPI communication test -- Thu Mar 16 10:30:54 2006
# Test mode: Fixed-size stream (unidirectional) test
# Hosts: DELL <----> DELL2
# Blocking communication (MPI_Send/MPI_Recv)
# Total data size of each test (Bytes): 5242880
# Message size (Bytes): 1048576
# Iteration : 5
# Test time: 0.001000
# Test repetition: 10
#
# Overall Master-node M-process M-process Slave-node S-process S-process
# Throughput Elapsed-time User-mode Sys-mode Elapsed-time User-mode Sys-mode
# Mbps Seconds Seconds
Seconds Seconds Seconds Seconds
1 641.7744 0.07 0.01 0.05 0.06 0.01 0.06
2 139.9301 0.30 0.07 0.23 0.30 0.07 0.23
3 701.6473 0.06
0.01 0.05 0.06 0.01 0.05
4 697.3198 0.06 0.01 0.05 0.06 0.01 0.05
5 703.8848 0.06 0.01 0.05 0.06 0.00 0.05
6
699.9834 0.06 0.02 0.04 0.06 0.00 0.06
7 1046.7493 0.04 0.00 0.04 0.04 0.00 0.04
8 699.8330 0.06 0.01 0.05 0.06
0.01 0.05
9 699.7746 0.06 0.01 0.05 0.06 0.01 0.05
10 678.8552 0.06 0.01 0.05 0.06 0.01 0.05
Thanks
Jayabrata