Thanks Brian, Thanks Michael

I wanted to benchmark the communcation throughput and latency using multiple
using gigabit eithernet controller.

So here are the results which i want share with you all

I used .....

OpenMPI version 1.0.2a10r9275

Hpcbench

Two Dell Precision 650 workstation.

The Dell Precision 650 workstation has three separate PCI bus segments.

Segment 1 -> PCI Slot1,2 -> 32 bit, 33MHz, Shared with integrated 1394             

Segment 2 -> PCI SLot3,4 -> 64 bit, 100MHz, Shared with the Gb Ethernet connection    

Segment 3 -> PCI Slot 5 -> Shared with Integrated Ultra 320 controller

The workstation has Integrated PCI-X 64-bit Intel 10/100/1000 Gigabit Ethernet.

I added three D-Link DGE-530T 1000 Mbps Ethernet Card in Slot2, Slot4 and Slot5 respectively.

As i expected, the Card in slot5 performed better than the cards in other slots. Here
are the results.

(Using Slot2)
-------------
# MPI communication latency (roundtrip time) test -- Wed Mar 15 09:19:10 2006
# Hosts: DELL <----> DELL2
# Blocking Communication (MPI_Send/MPI_Recv)
# Message size (Bytes) : 40960
# Iteration: 7
# Test time (Seconds): 0.20

#             RTT-time
#           Microseconds
1            25953.565
2            25569.439
3            22392.000
4            20876.578
5            21327.121
6            19597.156
7            21264.008
8            24109.568
9            23877.859
10           24064.575

# MPI RTT min/avg/max = 19597.156/22903.187/25953.565 usec

----------------------------------------------------------

# MPI communication test -- Wed Mar 15 10:16:22 2006
# Test mode: Fixed-size stream (unidirectional) test
# Hosts: DELL <----> DELL2
# Blocking communication (MPI_Send/MPI_Recv)
# Total data size of each test (Bytes): 524288000
# Message size (Bytes): 104857600
# Iteration : 5
# Test time: 5.000000
# Test repetition: 10
#
#      Overall    Master-node  M-process  M-process   Slave-node   S-process  S-process
#    Throughput  Elapsed-time  User-mode   Sys-mode  Elapsed-time  User-mode   Sys-mode
#        Mbps        Seconds     Seconds    Seconds     Seconds     Seconds     Seconds
1     521.9423         8.04        1.42       6.62        8.04        0.93        7.10
2     551.5377         7.60        1.20       6.41        7.60        0.77        6.87
3     552.5600         7.59        1.27       6.32        7.59        0.82        6.81
4     552.6328         7.59        1.28       6.31        7.59        0.80        6.83
5     552.6334         7.59        1.24       6.35        7.59        0.86        6.77
6     552.7048         7.59        1.26       6.33        7.59        0.77        6.86
7     563.6736         7.44        1.22       6.22        7.44        0.78        6.70
8     552.2710         7.59        1.22       6.37        7.59        0.83        6.80
9     520.9938         8.05        1.37       6.68        8.05        0.93        7.16
10    535.0131         7.84        1.36       6.48        7.84        0.84        7.04

======================================================================================

(Using Slot3)
-------------
# MPI communication latency (roundtrip time) test -- Thu Mar 16 10:15:58 2006
# Hosts: DELL <----> DELL2
# Blocking Communication (MPI_Send/MPI_Recv)
# Message size (Bytes) : 40960
# Iteration: 10
# Test time (Seconds): 0.20

#             RTT-time
#           Microseconds
1            20094.204
2            14773.512
3            14846.015
4            17756.820
5            18419.290
6            23394.799
7            21840.596
8            17727.494
9            21822.095
10           17659.688

# MPI RTT min/avg/max = 14773.512/18833.451/23394.799 usec

----------------------------------------------------------

# MPI communication test -- Wed Mar 15 09:17:54 2006
# Test mode: Fixed-size stream (unidirectional) test
# Hosts: DELL <----> DELL2
# Blocking communication (MPI_Send/MPI_Recv)
# Total data size of each test (Bytes): 524288000
# Message size (Bytes): 104857600
# Iteration : 5
# Test time: 5.000000
# Test repetition: 10
#
#      Overall    Master-node  M-process  M-process   Slave-node   S-process  S-process
#    Throughput  Elapsed-time  User-mode   Sys-mode  Elapsed-time  User-mode   Sys-mode
#        Mbps        Seconds     Seconds    Seconds     Seconds     Seconds     Seconds
1     794.9650         5.28        1.04       4.24        5.28        0.47        4.81
2     838.1621         5.00        0.91       4.09        5.00        0.39        4.65
3     898.3811         4.67        0.84       3.82        4.67        0.34        4.37
4     798.9575         5.25        1.03       4.22        5.25        0.40        4.89
5     829.7181         5.06        0.94       4.11        5.05        0.40        4.69
6     881.5526         4.76        0.86       3.90        4.76        0.28        4.52
7     827.9215         5.07        0.96       4.11        5.07        0.41        4.70
8     845.6428         4.96        0.87       4.09        4.96        0.38        4.62
9     845.6903         4.96        0.90       4.06        4.96        0.37        4.63
10    827.9424         5.07        0.92       4.15        5.07        0.42        4.69

======================================================================================

(Using Slot5)
-------------
# MPI communication latency (roundtrip time) test -- Wed Mar 15 09:38:55 2006
# Hosts: DELL <----> DELL2
# Blocking Communication (MPI_Send/MPI_Recv)
# Message size (Bytes) : 40960
# Iteration: 5
# Test time (Seconds): 0.20

#             RTT-time
#           Microseconds
1           201938.009
2           176876.974
3           266473.198
4           277261.162
5           235448.408
6           386055.040
7           263659.239
8           191064.596
9           255028.391
10          342683.983

# MPI RTT min/avg/max = 176876.974/259648.900/386055.040 usec

-------------------------------------------------------------

# MPI communication test -- Thu Mar 16 09:40:46 2006
# Test mode: Fixed-size stream (unidirectional) test
# Hosts: DELL <----> DELL2
# Blocking communication (MPI_Send/MPI_Recv)
# Total data size of each test (Bytes): 5242880
# Message size (Bytes): 1048576
# Iteration : 5
# Test time: 0.001000
# Test repetition: 10
#
#      Overall    Master-node  M-process  M-process   Slave-node   S-process  S-process
#    Throughput  Elapsed-time  User-mode   Sys-mode  Elapsed-time  User-mode   Sys-mode
#        Mbps        Seconds     Seconds    Seconds     Seconds     Seconds     Seconds
1     955.8585         0.04        0.01       0.03        0.04        0.01        0.03
2     964.4314         0.04        0.01       0.03        0.04        0.03        0.06
3     963.8343         0.04        0.01       0.03        0.04        0.02        0.07
4     963.5862         0.04        0.02       0.06        0.04        0.01        0.07
5     965.3840         0.04        0.01       0.04        0.04        0.01        0.04
6     964.5371         0.04        0.01       0.04        0.04        0.01        0.04
7     963.1009         0.04        0.01       0.03        0.04        0.01        0.03
8     963.6126         0.04        0.01       0.04        0.04        0.01        0.03
9     963.8554         0.04        0.01       0.03        0.04        0.00        0.04
10    963.7445         0.04        0.01       0.03        0.04        0.01        0.04

======================================================================================

(Using Onboard)
---------------
# MPI communication latency (roundtrip time) test -- Wed Mar 15 09:38:25 2006
# Hosts: DELL <----> DELL2
# Blocking Communication (MPI_Send/MPI_Recv)
# Message size (Bytes) : 40960
# Iteration: 200
# Test time (Seconds): 0.20

#             RTT-time
#           Microseconds
1              999.186
2             1000.586
3              997.865
4             1000.780
5             1001.199
6             1004.665
7             1003.225
8             1004.366
9             1004.120
10            1003.854

# MPI RTT min/avg/max = 997.865/1001.985/1004.665 usec

------------------------------------------------------

# MPI communication test -- Wed Mar 15 09:11:18 2006
# Test mode: Fixed-size stream (unidirectional) test
# Hosts: DELL <----> DELL2
# Blocking communication (MPI_Send/MPI_Recv)
# Total data size of each test (Bytes): 524288000
# Message size (Bytes): 104857600
# Iteration : 5
# Test time: 5.000000
# Test repetition: 10
#
#      Overall    Master-node  M-process  M-process   Slave-node   S-process  S-process
#    Throughput  Elapsed-time  User-mode   Sys-mode  Elapsed-time  User-mode   Sys-mode
#        Mbps        Seconds     Seconds    Seconds     Seconds     Seconds     Seconds
1     941.0156         4.46        0.93       3.53        4.46        0.48        3.98
2     941.1148         4.46        0.99       3.47        4.46        0.46        4.03
3     941.1063         4.46        1.05       3.41        4.46        0.45        4.05
4     941.0544         4.46        1.00       3.45        4.46        0.50        4.00
5     941.1083         4.46        1.01       3.44        4.46        0.47        4.03
6     941.1070         4.46        0.93       3.52        4.46        0.45        4.05
7     941.1078         4.46        0.99       3.46        4.46        0.50        3.99
8     941.0721         4.46        0.98       3.48        4.46        0.43        4.06
9     941.1091         4.46        1.01       3.44        4.46        0.49        4.01
10    941.1093         4.46        0.97       3.49        4.46        0.45        4.04

======================================================================================

The Dlink cards were giving poor latency, so i downloaded updated driver version 8.31 (dated 18 Jan, 2006)
from www.skd.de and tuned some parameters as follows:

The sk98lin (chipset used in Dlink 530T) supports large frames (also  called  jumbo  frames).Using  jumbo  frames  can 

improve throughput tremendously when transferring large amounts of data. To enable large frames, the MTU (maximum transfer

unit) size for an interface is to be set to a high value. The default MTU size is 1500  and  can be  changed  up  to  9000 

(bytes). Setting the MTU size can be done when assigning the IP address to the interface or later by using the ifconfig(8)

command with the mtu parameter. If for instance eth0 needs an
  IP address and a large frame MTU size, the following command might be used:
  ifconfig eth0 mtu 9000

Added follwing lines into /etc/modprob.conf ...
options sk98lin LowLatency=On

With the above changes, i observed a huge change in the performance of Card in Slot5 (but not of Slot2 & 3, cant figure
out why???)

Here are the improved latency figures of SLot5

# MPI communication latency (roundtrip time) test -- Thu Mar 16 10:12:05 2006
# Hosts: DELL <----> DELL2
# Blocking Communication (MPI_Send/MPI_Recv)
# Message size (Bytes) : 40960
# Iteration: 227
# Test time (Seconds): 0.20

#             RTT-time
#           Microseconds
1              882.409
2              880.656
3              881.314
4              880.067
5              879.532
6              878.070
7              879.520
8              878.035
9              881.300
10             878.349

# MPI RTT min/avg/max = 878.035/879.925/882.409 usec

====================================================

I next tested OB1 for different configurations and i obtained bandwidth as height as 1.8 Gbps

Here are the figures

(Using Slot5 & Onboard)
-----------------------
# MPI communication test -- Thu Mar 16 06:14:27 2006
# Test mode: Fixed-size stream (unidirectional) test
# Hosts: DELL <----> DELL2
# Blocking communication (MPI_Send/MPI_Recv)
# Total data size of each test (Bytes): 5242880
# Message size (Bytes): 1048576
# Iteration : 5
# Test time: 0.000100
# Test repetition: 10
#
#      Overall    Master-node  M-process  M-process   Slave-node   S-process  S-process
#    Throughput  Elapsed-time  User-mode   Sys-mode  Elapsed-time  User-mode   Sys-mode
#        Mbps        Seconds     Seconds    Seconds     Seconds     Seconds     Seconds
1    1625.8958         0.03        0.00       0.02        0.03        0.00        0.02
2    1646.6386         0.03        0.01       0.02        0.03        0.00        0.02
3    1648.6447         0.03        0.00       0.02        0.03        0.00        0.02
4    1647.7336         0.03        0.00       0.02        0.03        0.00        0.02
5    1640.5118         0.03        0.01       0.02        0.03        0.00        0.03
6    1625.1298         0.03        0.00       0.02        0.03        0.00        0.02
7    1648.1195         0.03        0.01       0.02        0.03        0.00        0.02
8    1647.6102         0.03        0.00       0.02        0.03        0.00        0.02
9    1647.9960         0.03        0.00       0.02        0.03        0.00        0.02
10   1648.1813         0.03        0.01       0.02        0.03        0.00        0.02

# MPI communication test -- Thu Mar 16 09:45:13 2006
# Test mode: Exponential stream (unidirectional) test
# Hosts: DELL <----> DELL2
# Blocking communication (MPI_Send/MPI_Recv)
#
#   Message    Overall             Master-node  M-process  M-process   Slave-node   S-process  S-process
#     Size   Throughput Iteration Elapsed-time  User-mode   Sys-mode  Elapsed-time  User-mode   Sys-mode
#    Bytes       Mbps                 Seconds     Seconds    Seconds     Seconds     Seconds     Seconds
         1      0.0608       324        0.04        0.02       0.06        0.04        0.01        0.03
         2      0.3078         5        0.00        0.00       0.00        0.00        0.00        0.00
         4      1.0505         8        0.00        0.00       0.00        0.00        0.00        0.00
         8      0.0244        15        0.04        0.01       0.03        0.04        0.01        0.03
        16      3.6251         5        0.00        0.00       0.00        0.00        0.00        0.00
        32      0.0831        13        0.04        0.01       0.03        0.04        0.01        0.03
        64     14.1421         5        0.00        0.00       0.00        0.00        0.00        0.00
       128      0.3313        13        0.04        0.01       0.03        0.04        0.01        0.03
       256     57.8961         5        0.00        0.00       0.00        0.00        0.00        0.00
       512    114.2923        14        0.00        0.00       0.00        0.00        0.00        0.00
      1024    281.8572        13        0.00        0.00       0.00        0.00        0.00        0.00
      2048    442.8727        17        0.00        0.00       0.00        0.00        0.00        0.00
      4096    666.7065        13        0.00        0.00       0.00        0.00        0.00        0.00
      8192    857.6743        10        0.00        0.00       0.00        0.00        0.00        0.00
     16384   1050.1757         6        0.00        0.00       0.00        0.00        0.00        0.00
     32768   1016.0091         5        0.00        0.00       0.00        0.00        0.00        0.00
     65536    747.5140         5        0.00        0.00       0.00        0.00        0.00        0.00
    131072   1131.1883         5        0.00        0.00       0.00        0.00        0.00        0.00
    262144     50.1699         5        0.21        0.05       0.16        0.21        0.05        0.16
    524288   1445.7282         5        0.01        0.00       0.01        0.01        0.00        0.01
   1048576   1620.4892         5        0.03        0.00       0.02        0.03        0.00        0.02
   2097152   1739.4759         5        0.05        0.01       0.04        0.05        0.00        0.05
   4194304   1809.7050         5        0.09        0.01       0.08        0.09        0.00        0.09
   8388608   1843.5496         5        0.18        0.02       0.16        0.18        0.01        0.17
  16777216   1867.9856         5        0.36        0.04       0.32        0.36        0.01        0.35
  33554432   1872.8597         5        0.72        0.10       0.65        0.72        0.03        0.72


I did same tests with TEG and found it to underperform a lot compared to OB1

here are the result of test using TEG

(Using Slot5 & Onboard)
-----------------------
# MPI communication test -- Thu Mar 16 10:30:54 2006
# Test mode: Fixed-size stream (unidirectional) test
# Hosts: DELL <----> DELL2
# Blocking communication (MPI_Send/MPI_Recv)
# Total data size of each test (Bytes): 5242880
# Message size (Bytes): 1048576
# Iteration : 5
# Test time: 0.001000
# Test repetition: 10
#
#      Overall    Master-node  M-process  M-process   Slave-node   S-process  S-process
#    Throughput  Elapsed-time  User-mode   Sys-mode  Elapsed-time  User-mode   Sys-mode
#        Mbps        Seconds     Seconds    Seconds     Seconds     Seconds     Seconds
1     641.7744         0.07        0.01       0.05        0.06        0.01        0.06
2     139.9301         0.30        0.07       0.23        0.30        0.07        0.23
3     701.6473         0.06        0.01       0.05        0.06        0.01        0.05
4     697.3198         0.06        0.01       0.05        0.06        0.01        0.05
5     703.8848         0.06        0.01       0.05        0.06        0.00        0.05
6     699.9834         0.06        0.02       0.04        0.06        0.00        0.06
7    1046.7493         0.04        0.00       0.04        0.04        0.00        0.04
8     699.8330         0.06        0.01       0.05        0.06        0.01        0.05
9     699.7746         0.06        0.01       0.05        0.06        0.01        0.05
10    678.8552         0.06        0.01       0.05        0.06        0.01        0.05


Thanks
Jayabrata

 

 

 

Reply via email to