Re: [Lustre-discuss] Lustre Issue

2011-01-27 Thread Nauman Yousuf
hey on lustre client i got this error .


LustreError: 2169:0:(client.c:576:ptlrpc_check_status()) Skipped 17 previous
similar messages
LustreError: 2208:0:(ldlm_request.c:746:ldlm_cli_cancel()) client/server
(nid 10.65.200.30@tcp) out of sync -- not fatal, flags 332c90
LustreError: 2208:0:(ldlm_request.c:746:ldlm_cli_cancel()) Skipped 1
previous similar message
LustreError: 2208:0:(file.c:754:ll_extent_lock_callback()) ldlm_cli_cancel
failed: 116
LustreError: 2169:0:(client.c:576:ptlrpc_check_status()) @@@ type ==
PTL_RPC_MSG_ERR, err == -2  req@c229bc00 x1219552/t0
o4-ost2_UUID@cyclops_UUID:28 lens 328/288 ref 2 fl Rpc:R/0/0 rc 0/-2
LustreError: 2169:0:(client.c:576:ptlrpc_check_status()) Skipped 40 previous
similar messages
LustreError: 2188:0:(ldlm_request.c:746:ldlm_cli_cancel()) client/server
(nid 10.65.200.30@tcp) out of sync -- not fatal, flags 332c90
LustreError: 2188:0:(file.c:754:ll_extent_lock_callback()) ldlm_cli_cancel
failed: 116
LustreError: 2169:0:(client.c:576:ptlrpc_check_status()) @@@ type ==
PTL_RPC_MSG_ERR, err == -2  req@c22a3a00 x1219666/t0
o4-ost2_UUID@cyclops_UUID:28 lens 328/288 ref 2 fl Rpc:R/0/0 rc 0/-2
LustreError: 2169:0:(client.c:576:ptlrpc_check_status()) Skipped 88 previous
similar messages
LustreError: 2231:0:(ldlm_request.c:746:ldlm_cli_cancel()) client/server
(nid 10.65.200.30@tcp) out of sync -- not fatal, flags 332c90
LustreError: 2231:0:(ldlm_request.c:746:ldlm_cli_cancel()) Skipped 2
previous similar messages
LustreError: 2231:0:(file.c:754:ll_extent_lock_callback()) ldlm_cli_cancel
failed: 116
LustreError: 2231:0:(file.c:754:ll_extent_lock_callback()) Skipped 2
previous similar messages


On Wed, Jan 26, 2011 at 11:53 PM, Brian J. Murrell br...@whamcloud.comwrote:

 On Wed, 2011-01-26 at 22:24 +0500, Nauman Yousuf wrote:
 

 Your logs don't have timestamps so it's difficult to correlate events
 but did you notice right before you started getting these messages:


  Lustre: 1588:0:(lustre_fsfilt.h:283:fsfilt_setattr()) mds01: slow setattr
 31s
  Lustre: 1595:0:(lustre_fsfilt.h:182:fsfilt_start_log()) mds01: slow
 journal start 33s
  Lustre: 1720:0:(lustre_fsfilt.h:182:fsfilt_start_log()) mds01: slow
 journal start 32s
  Lustre: 1602:0:(lustre_fsfilt.h:182:fsfilt_start_log()) mds01: slow
 journal start 38s

 You got this:

  drbd0: Resync started as SyncSource (need to sync 634747844 KB [158686961
 bits set]).
  drbd0: Resync done (total 97313 sec; paused 0 sec; 6520 K/sec)
  drbd0: drbd0_worker [1126]: cstate SyncSource -- Connected

 I'm no DRBD expert by a long shot but that looks to me like you had a
 disk in the MDS re-syncing to it's DRBD partner.  If that disk is the
 MDT, a resync, of course is going to slow down the MDT.

 The problem here is that you are probably tuned (i.e. the number of
 threads) to expect to full performance out of the hardware and when it's
 under a resync load, it won't deliver it.

 Unfortunately at this point Lustre will push it's thread count higher if
 can determine it can get more performance out of a target but it won't
 back off when things slow down (i.e. because the disk is being
 commandeered for housekeeping tasks such as resync or raid rebuild,
 etc.), so you need to maximize your thread count to what performs well
 while your disks are under a resync load.

 Please see the operations manual for details on tuning thread counts for
 performance.

 Cheers,
 b.


 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] MDT raid parameters, multiple MGSes

2011-01-27 Thread Andreas Dilger
On 2011-01-25, at 17:05, Jeremy Filizetti wrote:
 On Fri, Jan 21, 2011 at 1:02 PM, Andreas Dilger adil...@whamcloud.com wrote:
 While this runs, it is definitely not correct.  The problem is that the 
 client will only connect to a single MGS for configuration updates (in 
 particular, the MGS for the last filesystem that was mounted).  If there is 
 a configuration change (e.g. lctl conf_param, or adding a new OST) on one of 
 the other filesystems, then the client will not be notified of this change 
 because it is no longer connected to the MGS for that filesystem.
 
  
 We use Lustre in a WAN environment and each geographic location has their own 
 Lustre file system with it's own MGS.  While I don't add storage frequently 
 I've never seen an issue with this.
  
 Just to be sure I just mounted a test file system, follewed by another file 
 system and added an OST to the test file system and the client was notified 
 by the MGS.  Looking at lctl dl the client shows a device for MGC and I see 
 connections in the peers list.  I didn't test any conf_param, but at least 
 the connections look fine including the output from the lctl dk.
  
 Is there something I'm missing here?  I know each OSS shares a single MGC 
 between all the OBDs so that you can really only mount one file system at a 
 time in Lustre.  Is that what you are referring to?

Depending on how you ran the test, it is entirely possible that the client
hadn't been evicted from the first MGS yet, and it accepted the message from 
this MGS even though this was evicted.  However, if you check the connection 
state
on the client (e.g. lctl get_param mgc.*.import) it is only possible for
the client to have a single MGC today, and that MGC can only have a connection
to a single MGS at a time.

Granted, it is possible that someone fixed this when I wasn't paying attention.


Cheers, Andreas
--
Andreas Dilger 
Principal Engineer
Whamcloud, Inc.



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] llverfs outcome

2011-01-27 Thread Thomas Roth
Hi all,

I have run llverfs (lustre-utils 1.8.4) on an OST partition as llverfs 
-w -v /srv/OST0002.
That went smoothly until all 9759209724 kB were written, terminating with:

write File name: /srv/OST0002/dir00072/file022
write complete

llverfs: writing /srv/OST0002/llverfs.filecount failed :No space left on 
device

My question: What should be the result of llverfs? I haven't found any 
documentation on this tool, so I can just suspect that this was a 
successful run?

(llverdev terminates with 'write complete' also, no errors indicated - 
good?)

Regards,
Thomas

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre Issue

2011-01-27 Thread Nauman Yousuf
guys still issues, some how my client and OSS start getting CUP load when
this happens

oSS says

LustreError: 1538:0:(ldlm_lockd.c:1425:ldlm_cancel_handler()) operation 103
from 12345-10.65.200.37@tcp with bad export cookie 14320354116280279937
LustreError: 1560:0:(ldlm_lockd.c:1425:ldlm_cancel_handler()) operation 103
from 12345-10.65.200.37@tcp with bad export cookie 14320354116280279937
LustreError: 1714:0:(filter_io.c:532:filter_preprw_write()) ost2: trying to
BRW to non-existent file 28017031
LustreError: 1708:0:(filter_io.c:532:filter_preprw_write()) ost2: trying to
BRW to non-existent file 28017031
LustreError: 1717:0:(filter_io.c:532:filter_preprw_write()) ost2: trying to
BRW to non-existent file 28017040
LustreError: 1717:0:(filter_io.c:532:filter_preprw_write()) Skipped 10
previous similar messages
LustreError: 1700:0:(filter_io.c:532:filter_preprw_write()) ost2: trying to
BRW to non-existent file 28017174
LustreError: 1700:0:(filter_io.c:532:filter_preprw_write()) Skipped 5
previous similar messages
LustreError: 1688:0:(filter_io.c:532:filter_preprw_write()) ost2: trying to
BRW to non-existent file 28016970
LustreError: 1688:0:(filter_io.c:532:filter_preprw_write()) Skipped 12
previous similar messages
LustreError: 1697:0:(filter_io.c:532:filter_preprw_write()) ost2: trying to
BRW to non-existent file 28017244
LustreError: 1697:0:(filter_io.c:532:filter_preprw_write()) Skipped 17
previous similar messages
LustreError: 1709:0:(filter_io.c:532:filter_preprw_write()) ost2: trying to
BRW to non-existent file 28017244
LustreError: 1709:0:(filter_io.c:532:filter_preprw_write()) Skipped 48
previous similar messages
drbd1: [ll_ost_io_23/1690] sock_sendmsg time expired, ko = 4294967295
Lustre: 1689:0:(filter_io_26.c:714:filter_commitrw_write()) ost2: slow
direct_io 30s
Lustre: 1689:0:(filter_io_26.c:727:filter_commitrw_write()) ost2: slow
commitrw commit 30s

10.65.200.37 is my lustre client

LustreError: 2169:0:(client.c:576:ptlrpc_check_status()) Skipped 4 previous
similar messages
LustreError: 2199:0:(file.c:754:ll_extent_lock_callback()) ldlm_cli_cancel
failed: 116
LustreError: 2199:0:(file.c:754:ll_extent_lock_callback()) Skipped 3
previous similar messages
LustreError: 2169:0:(client.c:576:ptlrpc_check_status()) @@@ type ==
PTL_RPC_MSG_ERR, err == -2  req@c229d200 x1219484/t0
o4-ost2_UUID@cyclops_UUID:28 lens 328/288 ref 2 fl Rpc:R/0/0 rc 0/-2
LustreError: 2169:0:(client.c:576:ptlrpc_check_status()) Skipped 17 previous
similar messages
LustreError: 2208:0:(ldlm_request.c:746:ldlm_cli_cancel()) client/server
(nid 10.65.200.30@tcp) out of sync -- not fatal, flags 332c90
LustreError: 2208:0:(ldlm_request.c:746:ldlm_cli_cancel()) Skipped 1
previous similar message
LustreError: 2208:0:(file.c:754:ll_extent_lock_callback()) ldlm_cli_cancel
failed: 116
LustreError: 2169:0:(client.c:576:ptlrpc_check_status()) @@@ type ==
PTL_RPC_MSG_ERR, err == -2  req@c229bc00 x1219552/t0
o4-ost2_UUID@cyclops_UUID:28 lens 328/288 ref 2 fl Rpc:R/0/0 rc 0/-2
LustreError: 2169:0:(client.c:576:ptlrpc_check_status()) Skipped 40 previous
similar messages
LustreError: 2188:0:(ldlm_request.c:746:ldlm_cli_cancel()) client/server
(nid 10.65.200.30@tcp) out of sync -- not fatal, flags 332c90
LustreError: 2188:0:(file.c:754:ll_extent_lock_callback()) ldlm_cli_cancel
failed: 116
LustreError: 2169:0:(client.c:576:ptlrpc_check_status()) @@@ type ==
PTL_RPC_MSG_ERR, err == -2  req@c22a3a00 x1219666/t0
o4-ost2_UUID@cyclops_UUID:28 lens 328/288 ref 2 fl Rpc:R/0/0 rc 0/-2
LustreError: 2169:0:(client.c:576:ptlrpc_check_status()) Skipped 88 previous
similar messages
LustreError: 2231:0:(ldlm_request.c:746:ldlm_cli_cancel()) client/server
(nid 10.65.200.30@tcp) out of sync -- not fatal, flags 332c90
LustreError: 2231:0:(ldlm_request.c:746:ldlm_cli_cancel()) Skipped 2
previous similar messages
LustreError: 2231:0:(file.c:754:ll_extent_lock_callback()) ldlm_cli_cancel
failed: 116
LustreError: 2231:0:(file.c:754:ll_extent_lock_callback()) Skipped 2
previous similar messages

10.65.200.30 is my OSS both are generating load..




On Thu, Jan 27, 2011 at 3:17 PM, Nauman Yousuf nauman.you...@gmail.comwrote:

 hey on lustre client i got this error .


 LustreError: 2169:0:(client.c:576:ptlrpc_check_status()) Skipped 17
 previous similar messages
 LustreError: 2208:0:(ldlm_request.c:746:ldlm_cli_cancel()) client/server
 (nid 10.65.200.30@tcp) out of sync -- not fatal, flags 332c90
 LustreError: 2208:0:(ldlm_request.c:746:ldlm_cli_cancel()) Skipped 1
 previous similar message
 LustreError: 2208:0:(file.c:754:ll_extent_lock_callback()) ldlm_cli_cancel
 failed: 116
 LustreError: 2169:0:(client.c:576:ptlrpc_check_status()) @@@ type ==
 PTL_RPC_MSG_ERR, err == -2  req@c229bc00 x1219552/t0
 o4-ost2_UUID@cyclops_UUID:28 lens 328/288 ref 2 fl Rpc:R/0/0 rc 0/-2
 LustreError: 2169:0:(client.c:576:ptlrpc_check_status()) Skipped 40
 previous similar messages
 LustreError: 2188:0:(ldlm_request.c:746:ldlm_cli_cancel()) client/server
 (nid 10.65.200.30@tcp) out of 

[Lustre-discuss] 1GB throughput limit on OST (1.8.5)?

2011-01-27 Thread David Merhar
Our OSS's with 2x1GB NICs (bonded) appear limited to 1GB worth of  
write throughput each.

Our setup:
2 OSS serving 1 OST each
Lustre 1.8.5
RHEL 5.4
New Dell M610's blade servers with plenty of CPU and RAM
All SAN fibre connections are at least 4GB

Some notes:
- A direct write (dd) from a single OSS to the OST gets 4GB, the OSS's  
fibre wire speed.
- A single client will get 2GB of lustre write speed, the client's  
ethernet wire speed.
- We've tried bond mode 6 and 0 on all systems.  With mode 6 we will  
see both NICs on both OSSs receiving data.
- We've tried multiple OSTs per OSS.

But 2 clients writing a file will get 2GB of total bandwidth to the  
filesystems.  We have been unable to isolate any particular resource  
bottleneck.  None of the systems (MDS, OSS, or client) seem to be  
working very hard.

The 1GB per OSS threshold is so consistent, that it almost appears by  
design - and hopefully we're missing something obvious.

Any advice?

Thanks.

djm



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] 1GB throughput limit on OST (1.8.5)?

2011-01-27 Thread Balagopal Pillai
I guess you have two gigabit nics bonded in mode 6 and not two 1GB nics? 
(B-Bytes, b-bits) The max aggregate throughput could be about 200MBps 
out of the 2 bonded nics. I think the mode 0 bonding works only with 
cisco etherchannel or something similar on the switch side. Same with 
the FC connection, its 4Gbps (not 4GBps) or about 400-500 MBps max 
throughout. Maybe you could also see the max read and write capabilities 
of the raid controller other than just the network. When testing with 
dd, some of the data remains as dirty data till its flushed into the 
disk. I think the default background ratio is 10% for rhel5 which would 
be sizable if your oss have lots of ram. There is chance of lockup of 
the oss once it hits the dirty_ratio limit,which is 40% by default. So a 
bit more aggressive flush to disk by lowering the background_ratio and a 
bit more headroom before it hits the dirty_ratio is generally desirable 
if your raid controller could keep up with it. So with your current 
setup, i guess you could get a max of 400MBps out of both OSS's if they 
both have two 1Gb nics in them. Maybe if you have one of the switches 
from Dell that has 4 10Gb ports in them (their powerconnect 6248), 10Gb 
nics for your OSS's might be a cheaper way to increase the aggregate 
performance. I think over 1GBps from a client is possible in cases where 
you use infiniband and rdma to deliver data.


David Merhar wrote:
 Our OSS's with 2x1GB NICs (bonded) appear limited to 1GB worth of  
 write throughput each.
 
 Our setup:
 2 OSS serving 1 OST each
 Lustre 1.8.5
 RHEL 5.4
 New Dell M610's blade servers with plenty of CPU and RAM
 All SAN fibre connections are at least 4GB
 
 Some notes:
 - A direct write (dd) from a single OSS to the OST gets 4GB, the OSS's  
 fibre wire speed.
 - A single client will get 2GB of lustre write speed, the client's  
 ethernet wire speed.
 - We've tried bond mode 6 and 0 on all systems.  With mode 6 we will  
 see both NICs on both OSSs receiving data.
 - We've tried multiple OSTs per OSS.
 
 But 2 clients writing a file will get 2GB of total bandwidth to the  
 filesystems.  We have been unable to isolate any particular resource  
 bottleneck.  None of the systems (MDS, OSS, or client) seem to be  
 working very hard.
 
 The 1GB per OSS threshold is so consistent, that it almost appears by  
 design - and hopefully we're missing something obvious.
 
 Any advice?
 
 Thanks.
 
 djm
 
 
 
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] 1GB throughput limit on OST (1.8.5)?

2011-01-27 Thread David Merhar
Sorry - little b all the way around.

We're limited to 1Gb per OST.

djm



On Jan 27, 2011, at 7:48 AM, Balagopal Pillai wrote:

 I guess you have two gigabit nics bonded in mode 6 and not two 1GB  
 nics?
 (B-Bytes, b-bits) The max aggregate throughput could be about 200MBps
 out of the 2 bonded nics. I think the mode 0 bonding works only with
 cisco etherchannel or something similar on the switch side. Same with
 the FC connection, its 4Gbps (not 4GBps) or about 400-500 MBps max
 throughout. Maybe you could also see the max read and write  
 capabilities
 of the raid controller other than just the network. When testing with
 dd, some of the data remains as dirty data till its flushed into the
 disk. I think the default background ratio is 10% for rhel5 which  
 would
 be sizable if your oss have lots of ram. There is chance of lockup of
 the oss once it hits the dirty_ratio limit,which is 40% by default.  
 So a
 bit more aggressive flush to disk by lowering the background_ratio  
 and a
 bit more headroom before it hits the dirty_ratio is generally  
 desirable
 if your raid controller could keep up with it. So with your current
 setup, i guess you could get a max of 400MBps out of both OSS's if  
 they
 both have two 1Gb nics in them. Maybe if you have one of the switches
 from Dell that has 4 10Gb ports in them (their powerconnect 6248),  
 10Gb
 nics for your OSS's might be a cheaper way to increase the aggregate
 performance. I think over 1GBps from a client is possible in cases  
 where
 you use infiniband and rdma to deliver data.


 David Merhar wrote:
 Our OSS's with 2x1GB NICs (bonded) appear limited to 1GB worth of
 write throughput each.

 Our setup:
 2 OSS serving 1 OST each
 Lustre 1.8.5
 RHEL 5.4
 New Dell M610's blade servers with plenty of CPU and RAM
 All SAN fibre connections are at least 4GB

 Some notes:
 - A direct write (dd) from a single OSS to the OST gets 4GB, the  
 OSS's
 fibre wire speed.
 - A single client will get 2GB of lustre write speed, the client's
 ethernet wire speed.
 - We've tried bond mode 6 and 0 on all systems.  With mode 6 we will
 see both NICs on both OSSs receiving data.
 - We've tried multiple OSTs per OSS.

 But 2 clients writing a file will get 2GB of total bandwidth to the
 filesystems.  We have been unable to isolate any particular resource
 bottleneck.  None of the systems (MDS, OSS, or client) seem to be
 working very hard.

 The 1GB per OSS threshold is so consistent, that it almost appears by
 design - and hopefully we're missing something obvious.

 Any advice?

 Thanks.

 djm



 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] 1GB throughput limit on OST (1.8.5)?

2011-01-27 Thread Kevin Van Maren
Normally if you are having a problem with write BW, you need to futz 
with the switch.  If you were having
problems with read BW, you need to futz with the server's config (xmit 
hash policy is the usual culprit).

Are you testing multiple clients to the same server?

Are you using mode 6 because you don't have bonding support in your 
switch?  I normally use 802.3ad mode,
assuming your switch supports link aggregation.


I was bonding 2x1Gb links for Lustre back in 2004.  That was before 
BOND_XMIT_POLICY_LAYER34
was in the kernel, so I had to hack the bond xmit hash (with multiple 
NICs standard, layer2 hashing does not
produce a uniform distribution, and can't work if going through a router).

Any one connection (socket or node/node connection) will use only one 
gigabit link.  While it is possible
to use two links using round-robin, that normally only helps for client 
reads (server can't choose which link to
receive data, the switch picks that), and has the serious downside of 
out-of-order packets on the TCP stream.

[If you want clients to have better client bandwidth for a single file, 
change your default stripe count to 2, so it
will hit two different servers.]

Kevin


David Merhar wrote:
 Sorry - little b all the way around.

 We're limited to 1Gb per OST.

 djm



 On Jan 27, 2011, at 7:48 AM, Balagopal Pillai wrote:

   
 I guess you have two gigabit nics bonded in mode 6 and not two 1GB  
 nics?
 (B-Bytes, b-bits) The max aggregate throughput could be about 200MBps
 out of the 2 bonded nics. I think the mode 0 bonding works only with
 cisco etherchannel or something similar on the switch side. Same with
 the FC connection, its 4Gbps (not 4GBps) or about 400-500 MBps max
 throughout. Maybe you could also see the max read and write  
 capabilities
 of the raid controller other than just the network. When testing with
 dd, some of the data remains as dirty data till its flushed into the
 disk. I think the default background ratio is 10% for rhel5 which  
 would
 be sizable if your oss have lots of ram. There is chance of lockup of
 the oss once it hits the dirty_ratio limit,which is 40% by default.  
 So a
 bit more aggressive flush to disk by lowering the background_ratio  
 and a
 bit more headroom before it hits the dirty_ratio is generally  
 desirable
 if your raid controller could keep up with it. So with your current
 setup, i guess you could get a max of 400MBps out of both OSS's if  
 they
 both have two 1Gb nics in them. Maybe if you have one of the switches
 from Dell that has 4 10Gb ports in them (their powerconnect 6248),  
 10Gb
 nics for your OSS's might be a cheaper way to increase the aggregate
 performance. I think over 1GBps from a client is possible in cases  
 where
 you use infiniband and rdma to deliver data.


 David Merhar wrote:
 
 Our OSS's with 2x1GB NICs (bonded) appear limited to 1GB worth of
 write throughput each.

 Our setup:
 2 OSS serving 1 OST each
 Lustre 1.8.5
 RHEL 5.4
 New Dell M610's blade servers with plenty of CPU and RAM
 All SAN fibre connections are at least 4GB

 Some notes:
 - A direct write (dd) from a single OSS to the OST gets 4GB, the  
 OSS's
 fibre wire speed.
 - A single client will get 2GB of lustre write speed, the client's
 ethernet wire speed.
 - We've tried bond mode 6 and 0 on all systems.  With mode 6 we will
 see both NICs on both OSSs receiving data.
 - We've tried multiple OSTs per OSS.

 But 2 clients writing a file will get 2GB of total bandwidth to the
 filesystems.  We have been unable to isolate any particular resource
 bottleneck.  None of the systems (MDS, OSS, or client) seem to be
 working very hard.

 The 1GB per OSS threshold is so consistent, that it almost appears by
 design - and hopefully we're missing something obvious.

 Any advice?

 Thanks.

 djm



 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
 

 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] 1GB throughput limit on OST (1.8.5)?

2011-01-27 Thread David Merhar

Appreciate the input.

We've been using mode 6 as I expect it provides the fewest  
configuration pratfalls.  IF the single stream becomes our bottleneck  
we'll mess with aggregation.


What I can't find is the bottleneck in our current setup.  With 4  
servers - 2 clients, two OSSs - I'd expect 4Gb of aggregate throughput  
where each client has a single connection to each OST.   Instead we're  
limited to 2GB, where each OSS appears limited to 1Gb of I/O.   The  
strangeness is that iptraf on the OSSs shows traffic through the  
expected connections (2 X 2) but at only 35% - 65% of bandwidth.


And a third client writing to the filesystem will briefly increase  
aggregate throughput, but it quickly settles back to ~2Gb.


djm




On Jan 27, 2011, at 11:16 AM, Kevin Van Maren wrote:

Normally if you are having a problem with write BW, you need to futz  
with the switch.  If you were having
problems with read BW, you need to futz with the server's config  
(xmit hash policy is the usual culprit).


Are you testing multiple clients to the same server?

Are you using mode 6 because you don't have bonding support in your  
switch?  I normally use 802.3ad mode,

assuming your switch supports link aggregation.


I was bonding 2x1Gb links for Lustre back in 2004.  That was before  
BOND_XMIT_POLICY_LAYER34
was in the kernel, so I had to hack the bond xmit hash (with  
multiple NICs standard, layer2 hashing does not
produce a uniform distribution, and can't work if going through a  
router).


Any one connection (socket or node/node connection) will use only  
one gigabit link.  While it is possible
to use two links using round-robin, that normally only helps for  
client reads (server can't choose which link to
receive data, the switch picks that), and has the serious downside  
of out-of-order packets on the TCP stream.


[If you want clients to have better client bandwidth for a single  
file, change your default stripe count to 2, so it

will hit two different servers.]

Kevin


David Merhar wrote:

Sorry - little b all the way around.

We're limited to 1Gb per OST.

djm



On Jan 27, 2011, at 7:48 AM, Balagopal Pillai wrote:


I guess you have two gigabit nics bonded in mode 6 and not two  
1GB  nics?
(B-Bytes, b-bits) The max aggregate throughput could be about  
200MBps

out of the 2 bonded nics. I think the mode 0 bonding works only with
cisco etherchannel or something similar on the switch side. Same  
with

the FC connection, its 4Gbps (not 4GBps) or about 400-500 MBps max
throughout. Maybe you could also see the max read and write   
capabilities
of the raid controller other than just the network. When testing  
with

dd, some of the data remains as dirty data till its flushed into the
disk. I think the default background ratio is 10% for rhel5 which   
would
be sizable if your oss have lots of ram. There is chance of lockup  
of
the oss once it hits the dirty_ratio limit,which is 40% by  
default.  So a
bit more aggressive flush to disk by lowering the  
background_ratio  and a
bit more headroom before it hits the dirty_ratio is generally   
desirable

if your raid controller could keep up with it. So with your current
setup, i guess you could get a max of 400MBps out of both OSS's  
if  they
both have two 1Gb nics in them. Maybe if you have one of the  
switches
from Dell that has 4 10Gb ports in them (their powerconnect  
6248),  10Gb

nics for your OSS's might be a cheaper way to increase the aggregate
performance. I think over 1GBps from a client is possible in  
cases  where

you use infiniband and rdma to deliver data.


David Merhar wrote:


Our OSS's with 2x1GB NICs (bonded) appear limited to 1GB worth of
write throughput each.

Our setup:
2 OSS serving 1 OST each
Lustre 1.8.5
RHEL 5.4
New Dell M610's blade servers with plenty of CPU and RAM
All SAN fibre connections are at least 4GB

Some notes:
- A direct write (dd) from a single OSS to the OST gets 4GB, the   
OSS's

fibre wire speed.
- A single client will get 2GB of lustre write speed, the client's
ethernet wire speed.
- We've tried bond mode 6 and 0 on all systems.  With mode 6 we  
will

see both NICs on both OSSs receiving data.
- We've tried multiple OSTs per OSS.

But 2 clients writing a file will get 2GB of total bandwidth to the
filesystems.  We have been unable to isolate any particular  
resource

bottleneck.  None of the systems (MDS, OSS, or client) seem to be
working very hard.

The 1GB per OSS threshold is so consistent, that it almost  
appears by

design - and hopefully we're missing something obvious.

Any advice?

Thanks.

djm



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss



___
Lustre-discuss 

Re: [Lustre-discuss] MDT raid parameters, multiple MGSes

2011-01-27 Thread Andreas Dilger
On 2011-01-27, at 08:26, Jason Rappleye wrote:
 On Jan 27, 2011, at 3:15 AM, Andreas Dilger wrote:
 
 The problem is that the client will only connect to a single MGS for 
 configuration updates (in particular, the MGS for the last filesystem that 
 was mounted).  If there is a configuration change (e.g. lctl conf_param, or 
 adding a new OST) on one of the other filesystems, then the client will not 
 be notified of this change because it is no longer connected to the MGS for 
 that filesystem.
 
 Granted, it is possible that someone fixed this when I wasn't paying 
 attention.
 
 I thought this sounded familiar - have a look at bz 20299. Multiple MGCs on a 
 client are ok; multiple MGSes on a single server are not.

Sigh, it was even me who filed the bug...  Seems that bit of information was 
evicted from my memory.  Thanks for setting me straight.

Cheers, Andreas
--
Andreas Dilger 
Principal Engineer
Whamcloud, Inc.



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] llverfs outcome

2011-01-27 Thread Andreas Dilger
On 2011-01-27, at 04:56, Thomas Roth wrote:
 I have run llverfs (lustre-utils 1.8.4) on an OST partition as llverfs 
 -w -v /srv/OST0002.
 That went smoothly until all 9759209724 kB were written, terminating with:
 
 write File name: /srv/OST0002/dir00072/file022
 write complete
 
 llverfs: writing /srv/OST0002/llverfs.filecount failed :No space left on 
 device
 
 My question: What should be the result of llverfs? I haven't found any 
 documentation on this tool, so I can just suspect that this was a 
 successful run?

It shouldn't be terminating at this point, but I suspect a bug in llverfs and 
not in the filesystem.  I _thought_ there was an llverfs(8) man page, but it 
turns out there is only an old llverfs.txt file.

 (llverdev terminates with 'write complete' also, no errors indicated - 
 good?)

You can restart llverfs with the -r option so that it does the read tests to 
verify the data, and the -t option is needed to specify the timestamp used 
for the writes (so that it can distinguish stale data written from two 
different tests).  In hindsight, it probably makes sense from a usability POV 
to allow automatically detecting the timestamp value from the first file read, 
if unspecified, and then use that for the rest of the test.

Cheers, Andreas
--
Andreas Dilger 
Principal Engineer
Whamcloud, Inc.



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] MDT raid parameters, multiple MGSes

2011-01-27 Thread Jeremy Filizetti
Thanks Jason, I haven't had any luck in reproducing it, although I have been
trying.  Next time I'll have to check bugzilla for closed bugs too.

Jeremy

On Thu, Jan 27, 2011 at 2:10 PM, Andreas Dilger adil...@whamcloud.comwrote:

 On 2011-01-27, at 08:26, Jason Rappleye wrote:
  On Jan 27, 2011, at 3:15 AM, Andreas Dilger wrote:
 
  The problem is that the client will only connect to a single MGS for
 configuration updates (in particular, the MGS for the last filesystem that
 was mounted).  If there is a configuration change (e.g. lctl conf_param, or
 adding a new OST) on one of the other filesystems, then the client will not
 be notified of this change because it is no longer connected to the MGS for
 that filesystem.
 
  Granted, it is possible that someone fixed this when I wasn't paying
 attention.
 
  I thought this sounded familiar - have a look at bz 20299. Multiple MGCs
 on a client are ok; multiple MGSes on a single server are not.

 Sigh, it was even me who filed the bug...  Seems that bit of information
 was evicted from my memory.  Thanks for setting me straight.

 Cheers, Andreas
 --
 Andreas Dilger
 Principal Engineer
 Whamcloud, Inc.




___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] question about size on MDS (MDT) for lustre-1.8

2011-01-27 Thread Robin Humble
On Thu, Jan 13, 2011 at 05:28:23PM -0500, Kit Westneat wrote:
 It would probably be better to set:

 lctl conf_param fsname-OST00XX.ost.readcache_max_filesize=32M

 or similar, to limit the read cache to files 32MB in size or less (or 
 whatever you consider small files at your site.  That allows the read 
 cache for config files and such, while not thrashing the cache while 
 accessing large files.

 We should probably change this to be the default, but at the time the read 
 cache was introduced, we didn't know what should be considered a small vs. 
 large file, and the amount of RAM and number of OSTs on an OSS, and the uses 
 varies so much that it is difficult to pick a single correct value for this.

limiting the total amount of OSS cache used in order to leave room for
inodes/dentries might be more useful. the data cache will always fill
up and push out inodes otherwise.
Nathan's approach of turning off the caches entirely is extreme, but if
it gives us back some metadata performance then it might be worth it.

or is there a Lustre or VM setting to limit overall OSS cache size?

I presume that Lustre's OSS caches are subject to normal Linux VM
pagecache tweakables, but I don't think such a knob exists in Linux at
the moment...

I was looking through the Linux vm settings and saw vfs_cache_pressure - 
has anyone tested performance with this parameter? Do you know if this 
would this have any effect on file caching vs. ext4 metadata caching?

For us, Linux/Lustre would ideally push out data before the metadata, as 
the performance penalty for doing 4k reads on the s2a far outweighs any 
benefits of data caching.

good idea. if all inodes are always cached on OSS's then the fs should
be far more responsive to stat loads... 4k/inode shouldn't use up too
much of the OSS's ram (probably more like 1 or 2k/inode really).

anyway, following your idea, we tried vfs_cache_pressure=50 on our
OSS's a week or so ago, but hit this within a couple of hours
  https://bugzilla.lustre.org/show_bug.cgi?id=24401
could have been a coincidence I guess.

did anyone else give it a try?


BTW, we recently had the opposite problem on a client that scans the
filesystem - too many inodes were cached leading to low memory problems
on the client. we've had vfs_cache_pressure=150 set on that machine for
the last month or so and it seems to help. although a more effective
setting in this case was limiting ldlm locks. eg. from the Lustre manual
  lctl set_param ldlm.namespaces.*osc*.lru_size=1

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss