[Lustre-discuss] Fwd: Lustre performance issue (obdfilter_survey

2011-07-06 Thread lior amar
Hi,

I am installing a Lustre system and I wanted to measure the OSS
performance.
I used the obdfilter_survey and got very low performance for low
thread numbers when using the case=network option

System Configuration:
* Lustre 1.8.6-wc (compiled from the whamcloud git)
* Centos 5.6
* Infiniband (mellanox cards) open ib from centos 5.6
* OSS - 2 quad core  E5620 CPUS
* OSS - memory 48GB
* LSI 2965 raid card with 18 disks in raid 6 (16 data + 2). Raw
performance are good both  when testing the block device or over a file
system with Bonnie++

* OSS uses ext4 and mkfs parameters were set to reflect the stripe
size .. -E stride =...

The performance test I did:

1) obdfilter_survey case=disk -
   OSS performance is ok (similar to raw disk performance) -
   In the case of 1  thread and one object getting 966MB/sec

2) obdfilter_survey case=network -
OSS performance is bad for low thread numbers and get better as
the  number of  threads increases.
For the 1 thread one object getting 88MB/sec

3) obdfilter_survey case=netdisk -- Same as network case

4) When running ost_survey I am getting also low performance:
   Read = 156 MB/sec Write = ~350MB/sec

5) Running the lnet_self test I get much higher numbers
 Numbers obtained with concurrency = 1

 [LNet Rates of servers]
 [R] Avg: 3556 RPC/s Min: 3556 RPC/s Max: 3556 RPC/s
 [W] Avg: 4742 RPC/s Min: 4742 RPC/s Max: 4742 RPC/s
 [LNet Bandwidth of servers]
 [R] Avg: 1185.72  MB/s  Min: 1185.72  MB/s  Max: 1185.72  MB/s
 [W] Avg: 1185.72  MB/s  Min: 1185.72  MB/s  Max: 1185.72  MB/s




Any Ideas why a single thread over network obtain 88MB/sec while the same
test conducted local obtained 966MB/sec??
What else should I test/read/try ??

10x

Below are the actual numbers:

= obdfilter_survey case = disk ==
Wed Jul  6 13:24:57 IDT 2011 Obdfilter-survey for case=disk from oss1
ost  1 sz 16777216K rsz 1024K obj1 thr1 write  966.90
[ 644.40,1030.02] rewrite 1286.23 [1300.78,1315.77] read
8474.33 SHORT
ost  1 sz 16777216K rsz 1024K obj1 thr2 write 1577.95
[1533.57,1681.43] rewrite 1548.29 [1244.83,1718.42] read
11003.26 SHORT
ost  1 sz 16777216K rsz 1024K obj1 thr4 write 1465.68
[1354.73,1600.50] rewrite 1484.98 [1271.54,1584.52] read
16464.13 SHORT
ost  1 sz 16777216K rsz 1024K obj1 thr8 write 1267.39
[ 797.25,1476.48] rewrite 1350.28 [1283.80,1387.70] read
15353.69 SHORT
ost  1 sz 16777216K rsz 1024K obj1 thr   16 write 1295.35
[1266.82,1408.70] rewrite 1332.59 [1315.61,1429.66] read
15001.67 SHORT
ost  1 sz 16777216K rsz 1024K obj2 thr2 write 1467.80
[1472.62,1691.42] rewrite 1218.88 [ 821.23,1338.74] read
13538.41 SHORT
ost  1 sz 16777216K rsz 1024K obj2 thr4 write 1561.09
[1521.57,1682.75] rewrite 1183.31 [ 959.10,1372.52] read
15955.31 SHORT
ost  1 sz 16777216K rsz 1024K obj2 thr8 write 1498.74
[1543.58,1704.41] rewrite 1116.19 [1001.06,1163.91] read
15523.22 SHORT
ost  1 sz 16777216K rsz 1024K obj2 thr   16 write 1462.54
[ 985.08,1615.48] rewrite 1244.29 [1100.97,1444.80] read
15174.56 SHORT
ost  1 sz 16777216K rsz 1024K obj4 thr4 write 1483.42
[1497.88,1648.45] rewrite 1042.92 [ 801.25,1192.69] read
15997.30 SHORT
ost  1 sz 16777216K rsz 1024K obj4 thr8 write 1494.63
[1458.85,1624.13] rewrite 1041.81 [ 806.25,1183.89] read
15450.18 SHORT
ost  1 sz 16777216K rsz 1024K obj4 thr   16 write 1469.96
[1450.65,1647.45] rewrite 1027.06 [ 645.50,1215.86] read
15543.46 SHORT
ost  1 sz 16777216K rsz 1024K obj8 thr8 write 1417.93
[1250.85,1520.58] rewrite 1007.45 [ 905.15,1130.82] read
15789.66 SHORT
ost  1 sz 16777216K rsz 1024K obj8 thr   16 write 1324.28
[ 951.87,1518.26] rewrite  986.48 [ 855.21,1079.99] read
15510.70 SHORT
ost  1 sz 16777216K rsz 1024K obj   16 thr   16 write 1237.22
[ 989.07,1345.17] rewrite  915.56 [ 749.08,1033.03] read
15415.75 SHORT

==

== obdfilter_survey case = network 
Wed Jul  6 16:29:38 IDT 2011 Obdfilter-survey for case=network from
oss6
ost  1 sz 16777216K rsz 1024K obj1 thr1 write   87.99
[  86.92,  88.92] rewrite   87.98 [  86.83,  88.92] read   88.09
[  86.92,  88.92]
ost  1 sz 16777216K rsz 1024K obj1 thr2 write  175.76
[ 173.84, 176.83] rewrite  175.75 [ 174.84, 176.83] read  172.76
[ 171.67, 174.84]
ost  1 sz 16777216K rsz 1024K obj1 thr4 write  343.13
[ 327.69, 347.67] rewrite  344.64 [ 342.34, 347.67] read  331.20
[ 327.69, 337.77]
ost  1 sz 16777216K rsz 1024K obj1 thr8 write  638.44
[ 638.10, 653.39] rewrite  639.07 [ 627.75, 654.74] read  605.36
[ 598.84, 626.71]
ost  1 sz 16777216K rsz 1024K obj1 thr   16 write 1257.67
[1216.88,1424.42] rewrite 1231.61 [1200.67,1316.77] read 1122.70
[1095.04,1187.64]
ost  1 sz 16777216K rsz 1024K 

Re: [Lustre-discuss] Fwd: Lustre performance issue (obdfilter_survey

2011-07-06 Thread Cliff White
The case=network part of obdfilter_survey has really been replaced by
lnet_selftest.
I don't think it's been maintained in awhile.

It would be best to repeat the network-only test with lnet_selftest, this is
likely an issue with
the script.
cliffw

On Wed, Jul 6, 2011 at 1:04 PM, lior amar lioror...@gmail.com wrote:

 Hi,

 I am installing a Lustre system and I wanted to measure the OSS
 performance.
 I used the obdfilter_survey and got very low performance for low
 thread numbers when using the case=network option


 System Configuration:
 * Lustre 1.8.6-wc (compiled from the whamcloud git)
 * Centos 5.6
 * Infiniband (mellanox cards) open ib from centos 5.6
 * OSS - 2 quad core  E5620 CPUS
 * OSS - memory 48GB
 * LSI 2965 raid card with 18 disks in raid 6 (16 data + 2). Raw
 performance are good both  when testing the block device or over a file
 system with Bonnie++

 * OSS uses ext4 and mkfs parameters were set to reflect the stripe
 size .. -E stride =...

 The performance test I did:


 1) obdfilter_survey case=disk -
OSS performance is ok (similar to raw disk performance) -
In the case of 1  thread and one object getting 966MB/sec

 2) obdfilter_survey case=network -
 OSS performance is bad for low thread numbers and get better as
 the  number of  threads increases.
 For the 1 thread one object getting 88MB/sec

 3) obdfilter_survey case=netdisk -- Same as network case

 4) When running ost_survey I am getting also low performance:
Read = 156 MB/sec Write = ~350MB/sec

 5) Running the lnet_self test I get much higher numbers
  Numbers obtained with concurrency = 1

  [LNet Rates of servers]
  [R] Avg: 3556 RPC/s Min: 3556 RPC/s Max: 3556 RPC/s
  [W] Avg: 4742 RPC/s Min: 4742 RPC/s Max: 4742 RPC/s
  [LNet Bandwidth of servers]
  [R] Avg: 1185.72  MB/s  Min: 1185.72  MB/s  Max: 1185.72  MB/s
  [W] Avg: 1185.72  MB/s  Min: 1185.72  MB/s  Max: 1185.72  MB/s




 Any Ideas why a single thread over network obtain 88MB/sec while the same
 test conducted local obtained 966MB/sec??

 What else should I test/read/try ??

 10x

 Below are the actual numbers:

 = obdfilter_survey case = disk ==
 Wed Jul  6 13:24:57 IDT 2011 Obdfilter-survey for case=disk from oss1
 ost  1 sz 16777216K rsz 1024K obj1 thr1 write  966.90
 [ 644.40,1030.02] rewrite 1286.23 [1300.78,1315.77] read
 8474.33 SHORT
 ost  1 sz 16777216K rsz 1024K obj1 thr2 write 1577.95
 [1533.57,1681.43] rewrite 1548.29 [1244.83,1718.42] read
 11003.26 SHORT
 ost  1 sz 16777216K rsz 1024K obj1 thr4 write 1465.68
 [1354.73,1600.50] rewrite 1484.98 [1271.54,1584.52] read
 16464.13 SHORT
 ost  1 sz 16777216K rsz 1024K obj1 thr8 write 1267.39
 [ 797.25,1476.48] rewrite 1350.28 [1283.80,1387.70] read
 15353.69 SHORT
 ost  1 sz 16777216K rsz 1024K obj1 thr   16 write 1295.35
 [1266.82,1408.70] rewrite 1332.59 [1315.61,1429.66] read
 15001.67 SHORT
 ost  1 sz 16777216K rsz 1024K obj2 thr2 write 1467.80
 [1472.62,1691.42] rewrite 1218.88 [ 821.23,1338.74] read
 13538.41 SHORT
 ost  1 sz 16777216K rsz 1024K obj2 thr4 write 1561.09
 [1521.57,1682.75] rewrite 1183.31 [ 959.10,1372.52] read
 15955.31 SHORT
 ost  1 sz 16777216K rsz 1024K obj2 thr8 write 1498.74
 [1543.58,1704.41] rewrite 1116.19 [1001.06,1163.91] read
 15523.22 SHORT
 ost  1 sz 16777216K rsz 1024K obj2 thr   16 write 1462.54
 [ 985.08,1615.48] rewrite 1244.29 [1100.97,1444.80] read
 15174.56 SHORT
 ost  1 sz 16777216K rsz 1024K obj4 thr4 write 1483.42
 [1497.88,1648.45] rewrite 1042.92 [ 801.25,1192.69] read
 15997.30 SHORT
 ost  1 sz 16777216K rsz 1024K obj4 thr8 write 1494.63
 [1458.85,1624.13] rewrite 1041.81 [ 806.25,1183.89] read
 15450.18 SHORT
 ost  1 sz 16777216K rsz 1024K obj4 thr   16 write 1469.96
 [1450.65,1647.45] rewrite 1027.06 [ 645.50,1215.86] read
 15543.46 SHORT
 ost  1 sz 16777216K rsz 1024K obj8 thr8 write 1417.93
 [1250.85,1520.58] rewrite 1007.45 [ 905.15,1130.82] read
 15789.66 SHORT
 ost  1 sz 16777216K rsz 1024K obj8 thr   16 write 1324.28
 [ 951.87,1518.26] rewrite  986.48 [ 855.21,1079.99] read
 15510.70 SHORT
 ost  1 sz 16777216K rsz 1024K obj   16 thr   16 write 1237.22
 [ 989.07,1345.17] rewrite  915.56 [ 749.08,1033.03] read
 15415.75 SHORT

 ==

 == obdfilter_survey case = network 
 Wed Jul  6 16:29:38 IDT 2011 Obdfilter-survey for case=network from
 oss6
 ost  1 sz 16777216K rsz 1024K obj1 thr1 write   87.99
 [  86.92,  88.92] rewrite   87.98 [  86.83,  88.92] read   88.09
 [  86.92,  88.92]
 ost  1 sz 16777216K rsz 1024K obj1 thr2 write  175.76
 [ 173.84, 176.83] rewrite  175.75 [ 174.84, 176.83] read  172.76
 [ 171.67, 174.84]
 ost  1 sz 16777216K rsz 1024K obj1 thr

Re: [Lustre-discuss] Fwd: Lustre performance issue (obdfilter_survey

2011-07-06 Thread Chris Horn
FYI, there is some work being done to clean up obdfilter-survey. See 
https://bugzilla.lustre.org/show_bug.cgi?id=24490
If there was a script issue you might try the patch from that bug to see if you 
can reproduce.
https://bugzilla.lustre.org/show_bug.cgi?id=24490
Chris Horn

On Jul 6, 2011, at 3:37 PM, Cliff White wrote:

The case=network part of obdfilter_survey has really been replaced by 
lnet_selftest.
I don't think it's been maintained in awhile.

It would be best to repeat the network-only test with lnet_selftest, this is 
likely an issue with
the script.
cliffw

On Wed, Jul 6, 2011 at 1:04 PM, lior amar 
lioror...@gmail.commailto:lioror...@gmail.com wrote:
Hi,

I am installing a Lustre system and I wanted to measure the OSS
performance.
I used the obdfilter_survey and got very low performance for low
thread numbers when using the case=network option


System Configuration:
* Lustre 1.8.6-wc (compiled from the whamcloud git)
* Centos 5.6
* Infiniband (mellanox cards) open ib from centos 5.6
* OSS - 2 quad core  E5620 CPUS
* OSS - memory 48GB
* LSI 2965 raid card with 18 disks in raid 6 (16 data + 2). Raw
performance are good both  when testing the block device or over a file system 
with Bonnie++

* OSS uses ext4 and mkfs parameters were set to reflect the stripe
size .. -E stride =...

The performance test I did:


1) obdfilter_survey case=disk -
   OSS performance is ok (similar to raw disk performance) -
   In the case of 1  thread and one object getting 966MB/sec

2) obdfilter_survey case=network -
OSS performance is bad for low thread numbers and get better as
the  number of  threads increases.
For the 1 thread one object getting 88MB/sec

3) obdfilter_survey case=netdisk -- Same as network case

4) When running ost_survey I am getting also low performance:
   Read = 156 MB/sec Write = ~350MB/sec

5) Running the lnet_self test I get much higher numbers
 Numbers obtained with concurrency = 1

 [LNet Rates of servers]
 [R] Avg: 3556 RPC/s Min: 3556 RPC/s Max: 3556 RPC/s
 [W] Avg: 4742 RPC/s Min: 4742 RPC/s Max: 4742 RPC/s
 [LNet Bandwidth of servers]
 [R] Avg: 1185.72  MB/s  Min: 1185.72  MB/s  Max: 1185.72  MB/s
 [W] Avg: 1185.72  MB/s  Min: 1185.72  MB/s  Max: 1185.72  MB/s




Any Ideas why a single thread over network obtain 88MB/sec while the same test 
conducted local obtained 966MB/sec??

What else should I test/read/try ??

10x

Below are the actual numbers:

= obdfilter_survey case = disk ==
Wed Jul  6 13:24:57 IDT 2011 Obdfilter-survey for case=disk from oss1
ost  1 sz 16777216K rsz 1024K obj1 thr1 write  966.90
[ 644.40,1030.02] rewrite 1286.23 [1300.78,1315.77] read
8474.33 SHORT
ost  1 sz 16777216K rsz 1024K obj1 thr2 write 1577.95
[1533.57,1681.43] rewrite 1548.29 [1244.83,1718.42] read
11003.26 SHORT
ost  1 sz 16777216K rsz 1024K obj1 thr4 write 1465.68
[1354.73,1600.50] rewrite 1484.98 [1271.54,1584.52] read
16464.13 SHORT
ost  1 sz 16777216K rsz 1024K obj1 thr8 write 1267.39
[ 797.25,1476.48] rewrite 1350.28 [1283.80,1387.70] read
15353.69 SHORT
ost  1 sz 16777216K rsz 1024K obj1 thr   16 write 1295.35
[1266.82,1408.70] rewrite 1332.59 [1315.61,1429.66] read
15001.67 SHORT
ost  1 sz 16777216K rsz 1024K obj2 thr2 write 1467.80
[1472.62,1691.42] rewrite 1218.88 [ 821.23,1338.74] read
13538.41 SHORT
ost  1 sz 16777216K rsz 1024K obj2 thr4 write 1561.09
[1521.57,1682.75] rewrite 1183.31 [ 959.10,1372.52] read
15955.31 SHORT
ost  1 sz 16777216K rsz 1024K obj2 thr8 write 1498.74
[1543.58,1704.41] rewrite 1116.19 [1001.06,1163.91] read
15523.22 SHORT
ost  1 sz 16777216K rsz 1024K obj2 thr   16 write 1462.54
[ 985.08,1615.48] rewrite 1244.29 [1100.97,1444.80] read
15174.56 SHORT
ost  1 sz 16777216K rsz 1024K obj4 thr4 write 1483.42
[1497.88,1648.45] rewrite 1042.92 [ 801.25,1192.69] read
15997.30 SHORT
ost  1 sz 16777216K rsz 1024K obj4 thr8 write 1494.63
[1458.85,1624.13] rewrite 1041.81 [ 806.25,1183.89] read
15450.18 SHORT
ost  1 sz 16777216K rsz 1024K obj4 thr   16 write 1469.96
[1450.65,1647.45] rewrite 1027.06 [ 645.50,1215.86] read
15543.46 SHORT
ost  1 sz 16777216K rsz 1024K obj8 thr8 write 1417.93
[1250.85,1520.58] rewrite 1007.45 [ 905.15,1130.82] read
15789.66 SHORT
ost  1 sz 16777216K rsz 1024K obj8 thr   16 write 1324.28
[ 951.87,1518.26] rewrite  986.48 [ 855.21,1079.99] read
15510.70 SHORT
ost  1 sz 16777216K rsz 1024K obj   16 thr   16 write 1237.22
[ 989.07,1345.17] rewrite  915.56 [ 749.08,1033.03] read
15415.75 SHORT

==

== obdfilter_survey case = network 
Wed Jul  6 16:29:38 IDT 2011 Obdfilter-survey for case=network from
oss6
ost  1 sz 16777216K rsz 1024K obj1 thr1 write   87.99
[  86.92,  

Re: [Lustre-discuss] Fwd: Lustre performance issue (obdfilter_survey

2011-07-06 Thread lior amar
Hi,

First, thanks for your quick replay.


On Wed, Jul 6, 2011 at 11:37 PM, Cliff White cli...@whamcloud.com wrote:

 The case=network part of obdfilter_survey has really been replaced by
 lnet_selftest.
 I don't think it's been maintained in awhile.

 It would be best to repeat the network-only test with lnet_selftest, this
 is likely an issue with
 the script.
 cliffw


I used the lnet self test and got  reasonable results
--
Numbers obtained with concurrency = 1

 [LNet Rates of servers]
 [R] Avg: 3556 RPC/s Min: 3556 RPC/s Max: 3556 RPC/s
 [W] Avg: 4742 RPC/s Min: 4742 RPC/s Max: 4742 RPC/s
 [LNet Bandwidth of servers]
 [R] Avg: 1185.72  MB/s  Min: 1185.72  MB/s  Max: 1185.72  MB/s
 [W] Avg: 1185.72  MB/s  Min: 1185.72  MB/s  Max: 1185.72  MB/s
---

The question is what is the meaning of the concurrency=1 flag. Does it mean
a single thread at the client
or a sigle thread per core??

My problem is with the case=netdisk that gives me low performance for single
thread (the dd case).
(as well as with the ost_survey case).

Is the case=netdisk a valid test ??

I am trying to isolate the problem and the case=netdisk alows me to avoid
accessing the mds (right?)

Any Idea??


--oo--o(:-:)o--oo
Lior Amar, Ph.D.
Cluster Logic Ltd -- The Art of HPC
www.clusterlogic.net
--
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss