Re: [Lustre-discuss] Lustre Issue
hey on lustre client i got this error . LustreError: 2169:0:(client.c:576:ptlrpc_check_status()) Skipped 17 previous similar messages LustreError: 2208:0:(ldlm_request.c:746:ldlm_cli_cancel()) client/server (nid 10.65.200.30@tcp) out of sync -- not fatal, flags 332c90 LustreError: 2208:0:(ldlm_request.c:746:ldlm_cli_cancel()) Skipped 1 previous similar message LustreError: 2208:0:(file.c:754:ll_extent_lock_callback()) ldlm_cli_cancel failed: 116 LustreError: 2169:0:(client.c:576:ptlrpc_check_status()) @@@ type == PTL_RPC_MSG_ERR, err == -2 req@c229bc00 x1219552/t0 o4-ost2_UUID@cyclops_UUID:28 lens 328/288 ref 2 fl Rpc:R/0/0 rc 0/-2 LustreError: 2169:0:(client.c:576:ptlrpc_check_status()) Skipped 40 previous similar messages LustreError: 2188:0:(ldlm_request.c:746:ldlm_cli_cancel()) client/server (nid 10.65.200.30@tcp) out of sync -- not fatal, flags 332c90 LustreError: 2188:0:(file.c:754:ll_extent_lock_callback()) ldlm_cli_cancel failed: 116 LustreError: 2169:0:(client.c:576:ptlrpc_check_status()) @@@ type == PTL_RPC_MSG_ERR, err == -2 req@c22a3a00 x1219666/t0 o4-ost2_UUID@cyclops_UUID:28 lens 328/288 ref 2 fl Rpc:R/0/0 rc 0/-2 LustreError: 2169:0:(client.c:576:ptlrpc_check_status()) Skipped 88 previous similar messages LustreError: 2231:0:(ldlm_request.c:746:ldlm_cli_cancel()) client/server (nid 10.65.200.30@tcp) out of sync -- not fatal, flags 332c90 LustreError: 2231:0:(ldlm_request.c:746:ldlm_cli_cancel()) Skipped 2 previous similar messages LustreError: 2231:0:(file.c:754:ll_extent_lock_callback()) ldlm_cli_cancel failed: 116 LustreError: 2231:0:(file.c:754:ll_extent_lock_callback()) Skipped 2 previous similar messages On Wed, Jan 26, 2011 at 11:53 PM, Brian J. Murrell br...@whamcloud.comwrote: On Wed, 2011-01-26 at 22:24 +0500, Nauman Yousuf wrote: Your logs don't have timestamps so it's difficult to correlate events but did you notice right before you started getting these messages: Lustre: 1588:0:(lustre_fsfilt.h:283:fsfilt_setattr()) mds01: slow setattr 31s Lustre: 1595:0:(lustre_fsfilt.h:182:fsfilt_start_log()) mds01: slow journal start 33s Lustre: 1720:0:(lustre_fsfilt.h:182:fsfilt_start_log()) mds01: slow journal start 32s Lustre: 1602:0:(lustre_fsfilt.h:182:fsfilt_start_log()) mds01: slow journal start 38s You got this: drbd0: Resync started as SyncSource (need to sync 634747844 KB [158686961 bits set]). drbd0: Resync done (total 97313 sec; paused 0 sec; 6520 K/sec) drbd0: drbd0_worker [1126]: cstate SyncSource -- Connected I'm no DRBD expert by a long shot but that looks to me like you had a disk in the MDS re-syncing to it's DRBD partner. If that disk is the MDT, a resync, of course is going to slow down the MDT. The problem here is that you are probably tuned (i.e. the number of threads) to expect to full performance out of the hardware and when it's under a resync load, it won't deliver it. Unfortunately at this point Lustre will push it's thread count higher if can determine it can get more performance out of a target but it won't back off when things slow down (i.e. because the disk is being commandeered for housekeeping tasks such as resync or raid rebuild, etc.), so you need to maximize your thread count to what performs well while your disks are under a resync load. Please see the operations manual for details on tuning thread counts for performance. Cheers, b. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] MDT raid parameters, multiple MGSes
On 2011-01-25, at 17:05, Jeremy Filizetti wrote: On Fri, Jan 21, 2011 at 1:02 PM, Andreas Dilger adil...@whamcloud.com wrote: While this runs, it is definitely not correct. The problem is that the client will only connect to a single MGS for configuration updates (in particular, the MGS for the last filesystem that was mounted). If there is a configuration change (e.g. lctl conf_param, or adding a new OST) on one of the other filesystems, then the client will not be notified of this change because it is no longer connected to the MGS for that filesystem. We use Lustre in a WAN environment and each geographic location has their own Lustre file system with it's own MGS. While I don't add storage frequently I've never seen an issue with this. Just to be sure I just mounted a test file system, follewed by another file system and added an OST to the test file system and the client was notified by the MGS. Looking at lctl dl the client shows a device for MGC and I see connections in the peers list. I didn't test any conf_param, but at least the connections look fine including the output from the lctl dk. Is there something I'm missing here? I know each OSS shares a single MGC between all the OBDs so that you can really only mount one file system at a time in Lustre. Is that what you are referring to? Depending on how you ran the test, it is entirely possible that the client hadn't been evicted from the first MGS yet, and it accepted the message from this MGS even though this was evicted. However, if you check the connection state on the client (e.g. lctl get_param mgc.*.import) it is only possible for the client to have a single MGC today, and that MGC can only have a connection to a single MGS at a time. Granted, it is possible that someone fixed this when I wasn't paying attention. Cheers, Andreas -- Andreas Dilger Principal Engineer Whamcloud, Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] llverfs outcome
Hi all, I have run llverfs (lustre-utils 1.8.4) on an OST partition as llverfs -w -v /srv/OST0002. That went smoothly until all 9759209724 kB were written, terminating with: write File name: /srv/OST0002/dir00072/file022 write complete llverfs: writing /srv/OST0002/llverfs.filecount failed :No space left on device My question: What should be the result of llverfs? I haven't found any documentation on this tool, so I can just suspect that this was a successful run? (llverdev terminates with 'write complete' also, no errors indicated - good?) Regards, Thomas ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre Issue
guys still issues, some how my client and OSS start getting CUP load when this happens oSS says LustreError: 1538:0:(ldlm_lockd.c:1425:ldlm_cancel_handler()) operation 103 from 12345-10.65.200.37@tcp with bad export cookie 14320354116280279937 LustreError: 1560:0:(ldlm_lockd.c:1425:ldlm_cancel_handler()) operation 103 from 12345-10.65.200.37@tcp with bad export cookie 14320354116280279937 LustreError: 1714:0:(filter_io.c:532:filter_preprw_write()) ost2: trying to BRW to non-existent file 28017031 LustreError: 1708:0:(filter_io.c:532:filter_preprw_write()) ost2: trying to BRW to non-existent file 28017031 LustreError: 1717:0:(filter_io.c:532:filter_preprw_write()) ost2: trying to BRW to non-existent file 28017040 LustreError: 1717:0:(filter_io.c:532:filter_preprw_write()) Skipped 10 previous similar messages LustreError: 1700:0:(filter_io.c:532:filter_preprw_write()) ost2: trying to BRW to non-existent file 28017174 LustreError: 1700:0:(filter_io.c:532:filter_preprw_write()) Skipped 5 previous similar messages LustreError: 1688:0:(filter_io.c:532:filter_preprw_write()) ost2: trying to BRW to non-existent file 28016970 LustreError: 1688:0:(filter_io.c:532:filter_preprw_write()) Skipped 12 previous similar messages LustreError: 1697:0:(filter_io.c:532:filter_preprw_write()) ost2: trying to BRW to non-existent file 28017244 LustreError: 1697:0:(filter_io.c:532:filter_preprw_write()) Skipped 17 previous similar messages LustreError: 1709:0:(filter_io.c:532:filter_preprw_write()) ost2: trying to BRW to non-existent file 28017244 LustreError: 1709:0:(filter_io.c:532:filter_preprw_write()) Skipped 48 previous similar messages drbd1: [ll_ost_io_23/1690] sock_sendmsg time expired, ko = 4294967295 Lustre: 1689:0:(filter_io_26.c:714:filter_commitrw_write()) ost2: slow direct_io 30s Lustre: 1689:0:(filter_io_26.c:727:filter_commitrw_write()) ost2: slow commitrw commit 30s 10.65.200.37 is my lustre client LustreError: 2169:0:(client.c:576:ptlrpc_check_status()) Skipped 4 previous similar messages LustreError: 2199:0:(file.c:754:ll_extent_lock_callback()) ldlm_cli_cancel failed: 116 LustreError: 2199:0:(file.c:754:ll_extent_lock_callback()) Skipped 3 previous similar messages LustreError: 2169:0:(client.c:576:ptlrpc_check_status()) @@@ type == PTL_RPC_MSG_ERR, err == -2 req@c229d200 x1219484/t0 o4-ost2_UUID@cyclops_UUID:28 lens 328/288 ref 2 fl Rpc:R/0/0 rc 0/-2 LustreError: 2169:0:(client.c:576:ptlrpc_check_status()) Skipped 17 previous similar messages LustreError: 2208:0:(ldlm_request.c:746:ldlm_cli_cancel()) client/server (nid 10.65.200.30@tcp) out of sync -- not fatal, flags 332c90 LustreError: 2208:0:(ldlm_request.c:746:ldlm_cli_cancel()) Skipped 1 previous similar message LustreError: 2208:0:(file.c:754:ll_extent_lock_callback()) ldlm_cli_cancel failed: 116 LustreError: 2169:0:(client.c:576:ptlrpc_check_status()) @@@ type == PTL_RPC_MSG_ERR, err == -2 req@c229bc00 x1219552/t0 o4-ost2_UUID@cyclops_UUID:28 lens 328/288 ref 2 fl Rpc:R/0/0 rc 0/-2 LustreError: 2169:0:(client.c:576:ptlrpc_check_status()) Skipped 40 previous similar messages LustreError: 2188:0:(ldlm_request.c:746:ldlm_cli_cancel()) client/server (nid 10.65.200.30@tcp) out of sync -- not fatal, flags 332c90 LustreError: 2188:0:(file.c:754:ll_extent_lock_callback()) ldlm_cli_cancel failed: 116 LustreError: 2169:0:(client.c:576:ptlrpc_check_status()) @@@ type == PTL_RPC_MSG_ERR, err == -2 req@c22a3a00 x1219666/t0 o4-ost2_UUID@cyclops_UUID:28 lens 328/288 ref 2 fl Rpc:R/0/0 rc 0/-2 LustreError: 2169:0:(client.c:576:ptlrpc_check_status()) Skipped 88 previous similar messages LustreError: 2231:0:(ldlm_request.c:746:ldlm_cli_cancel()) client/server (nid 10.65.200.30@tcp) out of sync -- not fatal, flags 332c90 LustreError: 2231:0:(ldlm_request.c:746:ldlm_cli_cancel()) Skipped 2 previous similar messages LustreError: 2231:0:(file.c:754:ll_extent_lock_callback()) ldlm_cli_cancel failed: 116 LustreError: 2231:0:(file.c:754:ll_extent_lock_callback()) Skipped 2 previous similar messages 10.65.200.30 is my OSS both are generating load.. On Thu, Jan 27, 2011 at 3:17 PM, Nauman Yousuf nauman.you...@gmail.comwrote: hey on lustre client i got this error . LustreError: 2169:0:(client.c:576:ptlrpc_check_status()) Skipped 17 previous similar messages LustreError: 2208:0:(ldlm_request.c:746:ldlm_cli_cancel()) client/server (nid 10.65.200.30@tcp) out of sync -- not fatal, flags 332c90 LustreError: 2208:0:(ldlm_request.c:746:ldlm_cli_cancel()) Skipped 1 previous similar message LustreError: 2208:0:(file.c:754:ll_extent_lock_callback()) ldlm_cli_cancel failed: 116 LustreError: 2169:0:(client.c:576:ptlrpc_check_status()) @@@ type == PTL_RPC_MSG_ERR, err == -2 req@c229bc00 x1219552/t0 o4-ost2_UUID@cyclops_UUID:28 lens 328/288 ref 2 fl Rpc:R/0/0 rc 0/-2 LustreError: 2169:0:(client.c:576:ptlrpc_check_status()) Skipped 40 previous similar messages LustreError: 2188:0:(ldlm_request.c:746:ldlm_cli_cancel()) client/server (nid 10.65.200.30@tcp) out of
[Lustre-discuss] 1GB throughput limit on OST (1.8.5)?
Our OSS's with 2x1GB NICs (bonded) appear limited to 1GB worth of write throughput each. Our setup: 2 OSS serving 1 OST each Lustre 1.8.5 RHEL 5.4 New Dell M610's blade servers with plenty of CPU and RAM All SAN fibre connections are at least 4GB Some notes: - A direct write (dd) from a single OSS to the OST gets 4GB, the OSS's fibre wire speed. - A single client will get 2GB of lustre write speed, the client's ethernet wire speed. - We've tried bond mode 6 and 0 on all systems. With mode 6 we will see both NICs on both OSSs receiving data. - We've tried multiple OSTs per OSS. But 2 clients writing a file will get 2GB of total bandwidth to the filesystems. We have been unable to isolate any particular resource bottleneck. None of the systems (MDS, OSS, or client) seem to be working very hard. The 1GB per OSS threshold is so consistent, that it almost appears by design - and hopefully we're missing something obvious. Any advice? Thanks. djm ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] 1GB throughput limit on OST (1.8.5)?
I guess you have two gigabit nics bonded in mode 6 and not two 1GB nics? (B-Bytes, b-bits) The max aggregate throughput could be about 200MBps out of the 2 bonded nics. I think the mode 0 bonding works only with cisco etherchannel or something similar on the switch side. Same with the FC connection, its 4Gbps (not 4GBps) or about 400-500 MBps max throughout. Maybe you could also see the max read and write capabilities of the raid controller other than just the network. When testing with dd, some of the data remains as dirty data till its flushed into the disk. I think the default background ratio is 10% for rhel5 which would be sizable if your oss have lots of ram. There is chance of lockup of the oss once it hits the dirty_ratio limit,which is 40% by default. So a bit more aggressive flush to disk by lowering the background_ratio and a bit more headroom before it hits the dirty_ratio is generally desirable if your raid controller could keep up with it. So with your current setup, i guess you could get a max of 400MBps out of both OSS's if they both have two 1Gb nics in them. Maybe if you have one of the switches from Dell that has 4 10Gb ports in them (their powerconnect 6248), 10Gb nics for your OSS's might be a cheaper way to increase the aggregate performance. I think over 1GBps from a client is possible in cases where you use infiniband and rdma to deliver data. David Merhar wrote: Our OSS's with 2x1GB NICs (bonded) appear limited to 1GB worth of write throughput each. Our setup: 2 OSS serving 1 OST each Lustre 1.8.5 RHEL 5.4 New Dell M610's blade servers with plenty of CPU and RAM All SAN fibre connections are at least 4GB Some notes: - A direct write (dd) from a single OSS to the OST gets 4GB, the OSS's fibre wire speed. - A single client will get 2GB of lustre write speed, the client's ethernet wire speed. - We've tried bond mode 6 and 0 on all systems. With mode 6 we will see both NICs on both OSSs receiving data. - We've tried multiple OSTs per OSS. But 2 clients writing a file will get 2GB of total bandwidth to the filesystems. We have been unable to isolate any particular resource bottleneck. None of the systems (MDS, OSS, or client) seem to be working very hard. The 1GB per OSS threshold is so consistent, that it almost appears by design - and hopefully we're missing something obvious. Any advice? Thanks. djm ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] 1GB throughput limit on OST (1.8.5)?
Sorry - little b all the way around. We're limited to 1Gb per OST. djm On Jan 27, 2011, at 7:48 AM, Balagopal Pillai wrote: I guess you have two gigabit nics bonded in mode 6 and not two 1GB nics? (B-Bytes, b-bits) The max aggregate throughput could be about 200MBps out of the 2 bonded nics. I think the mode 0 bonding works only with cisco etherchannel or something similar on the switch side. Same with the FC connection, its 4Gbps (not 4GBps) or about 400-500 MBps max throughout. Maybe you could also see the max read and write capabilities of the raid controller other than just the network. When testing with dd, some of the data remains as dirty data till its flushed into the disk. I think the default background ratio is 10% for rhel5 which would be sizable if your oss have lots of ram. There is chance of lockup of the oss once it hits the dirty_ratio limit,which is 40% by default. So a bit more aggressive flush to disk by lowering the background_ratio and a bit more headroom before it hits the dirty_ratio is generally desirable if your raid controller could keep up with it. So with your current setup, i guess you could get a max of 400MBps out of both OSS's if they both have two 1Gb nics in them. Maybe if you have one of the switches from Dell that has 4 10Gb ports in them (their powerconnect 6248), 10Gb nics for your OSS's might be a cheaper way to increase the aggregate performance. I think over 1GBps from a client is possible in cases where you use infiniband and rdma to deliver data. David Merhar wrote: Our OSS's with 2x1GB NICs (bonded) appear limited to 1GB worth of write throughput each. Our setup: 2 OSS serving 1 OST each Lustre 1.8.5 RHEL 5.4 New Dell M610's blade servers with plenty of CPU and RAM All SAN fibre connections are at least 4GB Some notes: - A direct write (dd) from a single OSS to the OST gets 4GB, the OSS's fibre wire speed. - A single client will get 2GB of lustre write speed, the client's ethernet wire speed. - We've tried bond mode 6 and 0 on all systems. With mode 6 we will see both NICs on both OSSs receiving data. - We've tried multiple OSTs per OSS. But 2 clients writing a file will get 2GB of total bandwidth to the filesystems. We have been unable to isolate any particular resource bottleneck. None of the systems (MDS, OSS, or client) seem to be working very hard. The 1GB per OSS threshold is so consistent, that it almost appears by design - and hopefully we're missing something obvious. Any advice? Thanks. djm ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] 1GB throughput limit on OST (1.8.5)?
Normally if you are having a problem with write BW, you need to futz with the switch. If you were having problems with read BW, you need to futz with the server's config (xmit hash policy is the usual culprit). Are you testing multiple clients to the same server? Are you using mode 6 because you don't have bonding support in your switch? I normally use 802.3ad mode, assuming your switch supports link aggregation. I was bonding 2x1Gb links for Lustre back in 2004. That was before BOND_XMIT_POLICY_LAYER34 was in the kernel, so I had to hack the bond xmit hash (with multiple NICs standard, layer2 hashing does not produce a uniform distribution, and can't work if going through a router). Any one connection (socket or node/node connection) will use only one gigabit link. While it is possible to use two links using round-robin, that normally only helps for client reads (server can't choose which link to receive data, the switch picks that), and has the serious downside of out-of-order packets on the TCP stream. [If you want clients to have better client bandwidth for a single file, change your default stripe count to 2, so it will hit two different servers.] Kevin David Merhar wrote: Sorry - little b all the way around. We're limited to 1Gb per OST. djm On Jan 27, 2011, at 7:48 AM, Balagopal Pillai wrote: I guess you have two gigabit nics bonded in mode 6 and not two 1GB nics? (B-Bytes, b-bits) The max aggregate throughput could be about 200MBps out of the 2 bonded nics. I think the mode 0 bonding works only with cisco etherchannel or something similar on the switch side. Same with the FC connection, its 4Gbps (not 4GBps) or about 400-500 MBps max throughout. Maybe you could also see the max read and write capabilities of the raid controller other than just the network. When testing with dd, some of the data remains as dirty data till its flushed into the disk. I think the default background ratio is 10% for rhel5 which would be sizable if your oss have lots of ram. There is chance of lockup of the oss once it hits the dirty_ratio limit,which is 40% by default. So a bit more aggressive flush to disk by lowering the background_ratio and a bit more headroom before it hits the dirty_ratio is generally desirable if your raid controller could keep up with it. So with your current setup, i guess you could get a max of 400MBps out of both OSS's if they both have two 1Gb nics in them. Maybe if you have one of the switches from Dell that has 4 10Gb ports in them (their powerconnect 6248), 10Gb nics for your OSS's might be a cheaper way to increase the aggregate performance. I think over 1GBps from a client is possible in cases where you use infiniband and rdma to deliver data. David Merhar wrote: Our OSS's with 2x1GB NICs (bonded) appear limited to 1GB worth of write throughput each. Our setup: 2 OSS serving 1 OST each Lustre 1.8.5 RHEL 5.4 New Dell M610's blade servers with plenty of CPU and RAM All SAN fibre connections are at least 4GB Some notes: - A direct write (dd) from a single OSS to the OST gets 4GB, the OSS's fibre wire speed. - A single client will get 2GB of lustre write speed, the client's ethernet wire speed. - We've tried bond mode 6 and 0 on all systems. With mode 6 we will see both NICs on both OSSs receiving data. - We've tried multiple OSTs per OSS. But 2 clients writing a file will get 2GB of total bandwidth to the filesystems. We have been unable to isolate any particular resource bottleneck. None of the systems (MDS, OSS, or client) seem to be working very hard. The 1GB per OSS threshold is so consistent, that it almost appears by design - and hopefully we're missing something obvious. Any advice? Thanks. djm ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] 1GB throughput limit on OST (1.8.5)?
Appreciate the input. We've been using mode 6 as I expect it provides the fewest configuration pratfalls. IF the single stream becomes our bottleneck we'll mess with aggregation. What I can't find is the bottleneck in our current setup. With 4 servers - 2 clients, two OSSs - I'd expect 4Gb of aggregate throughput where each client has a single connection to each OST. Instead we're limited to 2GB, where each OSS appears limited to 1Gb of I/O. The strangeness is that iptraf on the OSSs shows traffic through the expected connections (2 X 2) but at only 35% - 65% of bandwidth. And a third client writing to the filesystem will briefly increase aggregate throughput, but it quickly settles back to ~2Gb. djm On Jan 27, 2011, at 11:16 AM, Kevin Van Maren wrote: Normally if you are having a problem with write BW, you need to futz with the switch. If you were having problems with read BW, you need to futz with the server's config (xmit hash policy is the usual culprit). Are you testing multiple clients to the same server? Are you using mode 6 because you don't have bonding support in your switch? I normally use 802.3ad mode, assuming your switch supports link aggregation. I was bonding 2x1Gb links for Lustre back in 2004. That was before BOND_XMIT_POLICY_LAYER34 was in the kernel, so I had to hack the bond xmit hash (with multiple NICs standard, layer2 hashing does not produce a uniform distribution, and can't work if going through a router). Any one connection (socket or node/node connection) will use only one gigabit link. While it is possible to use two links using round-robin, that normally only helps for client reads (server can't choose which link to receive data, the switch picks that), and has the serious downside of out-of-order packets on the TCP stream. [If you want clients to have better client bandwidth for a single file, change your default stripe count to 2, so it will hit two different servers.] Kevin David Merhar wrote: Sorry - little b all the way around. We're limited to 1Gb per OST. djm On Jan 27, 2011, at 7:48 AM, Balagopal Pillai wrote: I guess you have two gigabit nics bonded in mode 6 and not two 1GB nics? (B-Bytes, b-bits) The max aggregate throughput could be about 200MBps out of the 2 bonded nics. I think the mode 0 bonding works only with cisco etherchannel or something similar on the switch side. Same with the FC connection, its 4Gbps (not 4GBps) or about 400-500 MBps max throughout. Maybe you could also see the max read and write capabilities of the raid controller other than just the network. When testing with dd, some of the data remains as dirty data till its flushed into the disk. I think the default background ratio is 10% for rhel5 which would be sizable if your oss have lots of ram. There is chance of lockup of the oss once it hits the dirty_ratio limit,which is 40% by default. So a bit more aggressive flush to disk by lowering the background_ratio and a bit more headroom before it hits the dirty_ratio is generally desirable if your raid controller could keep up with it. So with your current setup, i guess you could get a max of 400MBps out of both OSS's if they both have two 1Gb nics in them. Maybe if you have one of the switches from Dell that has 4 10Gb ports in them (their powerconnect 6248), 10Gb nics for your OSS's might be a cheaper way to increase the aggregate performance. I think over 1GBps from a client is possible in cases where you use infiniband and rdma to deliver data. David Merhar wrote: Our OSS's with 2x1GB NICs (bonded) appear limited to 1GB worth of write throughput each. Our setup: 2 OSS serving 1 OST each Lustre 1.8.5 RHEL 5.4 New Dell M610's blade servers with plenty of CPU and RAM All SAN fibre connections are at least 4GB Some notes: - A direct write (dd) from a single OSS to the OST gets 4GB, the OSS's fibre wire speed. - A single client will get 2GB of lustre write speed, the client's ethernet wire speed. - We've tried bond mode 6 and 0 on all systems. With mode 6 we will see both NICs on both OSSs receiving data. - We've tried multiple OSTs per OSS. But 2 clients writing a file will get 2GB of total bandwidth to the filesystems. We have been unable to isolate any particular resource bottleneck. None of the systems (MDS, OSS, or client) seem to be working very hard. The 1GB per OSS threshold is so consistent, that it almost appears by design - and hopefully we're missing something obvious. Any advice? Thanks. djm ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss
Re: [Lustre-discuss] MDT raid parameters, multiple MGSes
On 2011-01-27, at 08:26, Jason Rappleye wrote: On Jan 27, 2011, at 3:15 AM, Andreas Dilger wrote: The problem is that the client will only connect to a single MGS for configuration updates (in particular, the MGS for the last filesystem that was mounted). If there is a configuration change (e.g. lctl conf_param, or adding a new OST) on one of the other filesystems, then the client will not be notified of this change because it is no longer connected to the MGS for that filesystem. Granted, it is possible that someone fixed this when I wasn't paying attention. I thought this sounded familiar - have a look at bz 20299. Multiple MGCs on a client are ok; multiple MGSes on a single server are not. Sigh, it was even me who filed the bug... Seems that bit of information was evicted from my memory. Thanks for setting me straight. Cheers, Andreas -- Andreas Dilger Principal Engineer Whamcloud, Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] llverfs outcome
On 2011-01-27, at 04:56, Thomas Roth wrote: I have run llverfs (lustre-utils 1.8.4) on an OST partition as llverfs -w -v /srv/OST0002. That went smoothly until all 9759209724 kB were written, terminating with: write File name: /srv/OST0002/dir00072/file022 write complete llverfs: writing /srv/OST0002/llverfs.filecount failed :No space left on device My question: What should be the result of llverfs? I haven't found any documentation on this tool, so I can just suspect that this was a successful run? It shouldn't be terminating at this point, but I suspect a bug in llverfs and not in the filesystem. I _thought_ there was an llverfs(8) man page, but it turns out there is only an old llverfs.txt file. (llverdev terminates with 'write complete' also, no errors indicated - good?) You can restart llverfs with the -r option so that it does the read tests to verify the data, and the -t option is needed to specify the timestamp used for the writes (so that it can distinguish stale data written from two different tests). In hindsight, it probably makes sense from a usability POV to allow automatically detecting the timestamp value from the first file read, if unspecified, and then use that for the rest of the test. Cheers, Andreas -- Andreas Dilger Principal Engineer Whamcloud, Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] MDT raid parameters, multiple MGSes
Thanks Jason, I haven't had any luck in reproducing it, although I have been trying. Next time I'll have to check bugzilla for closed bugs too. Jeremy On Thu, Jan 27, 2011 at 2:10 PM, Andreas Dilger adil...@whamcloud.comwrote: On 2011-01-27, at 08:26, Jason Rappleye wrote: On Jan 27, 2011, at 3:15 AM, Andreas Dilger wrote: The problem is that the client will only connect to a single MGS for configuration updates (in particular, the MGS for the last filesystem that was mounted). If there is a configuration change (e.g. lctl conf_param, or adding a new OST) on one of the other filesystems, then the client will not be notified of this change because it is no longer connected to the MGS for that filesystem. Granted, it is possible that someone fixed this when I wasn't paying attention. I thought this sounded familiar - have a look at bz 20299. Multiple MGCs on a client are ok; multiple MGSes on a single server are not. Sigh, it was even me who filed the bug... Seems that bit of information was evicted from my memory. Thanks for setting me straight. Cheers, Andreas -- Andreas Dilger Principal Engineer Whamcloud, Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] question about size on MDS (MDT) for lustre-1.8
On Thu, Jan 13, 2011 at 05:28:23PM -0500, Kit Westneat wrote: It would probably be better to set: lctl conf_param fsname-OST00XX.ost.readcache_max_filesize=32M or similar, to limit the read cache to files 32MB in size or less (or whatever you consider small files at your site. That allows the read cache for config files and such, while not thrashing the cache while accessing large files. We should probably change this to be the default, but at the time the read cache was introduced, we didn't know what should be considered a small vs. large file, and the amount of RAM and number of OSTs on an OSS, and the uses varies so much that it is difficult to pick a single correct value for this. limiting the total amount of OSS cache used in order to leave room for inodes/dentries might be more useful. the data cache will always fill up and push out inodes otherwise. Nathan's approach of turning off the caches entirely is extreme, but if it gives us back some metadata performance then it might be worth it. or is there a Lustre or VM setting to limit overall OSS cache size? I presume that Lustre's OSS caches are subject to normal Linux VM pagecache tweakables, but I don't think such a knob exists in Linux at the moment... I was looking through the Linux vm settings and saw vfs_cache_pressure - has anyone tested performance with this parameter? Do you know if this would this have any effect on file caching vs. ext4 metadata caching? For us, Linux/Lustre would ideally push out data before the metadata, as the performance penalty for doing 4k reads on the s2a far outweighs any benefits of data caching. good idea. if all inodes are always cached on OSS's then the fs should be far more responsive to stat loads... 4k/inode shouldn't use up too much of the OSS's ram (probably more like 1 or 2k/inode really). anyway, following your idea, we tried vfs_cache_pressure=50 on our OSS's a week or so ago, but hit this within a couple of hours https://bugzilla.lustre.org/show_bug.cgi?id=24401 could have been a coincidence I guess. did anyone else give it a try? BTW, we recently had the opposite problem on a client that scans the filesystem - too many inodes were cached leading to low memory problems on the client. we've had vfs_cache_pressure=150 set on that machine for the last month or so and it seems to help. although a more effective setting in this case was limiting ldlm locks. eg. from the Lustre manual lctl set_param ldlm.namespaces.*osc*.lru_size=1 cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss