Re: [Lustre-discuss] Not sure how we should configure our RAID arrays (HW limitation)
The 512K stripe size should be fine for Lustre, and 128KB per disk is enough to get good performance from the underlying hard drive. I don't know anything about the E18s beyond what you've posted, so I can't guess which configuration is more optimal, so I would suggest you create the RAID arrays, format the LUNs for Lustre, and run the Lustre iokit and see how the various configurations perform (3 * 4+2, 2 * 8+1, 2 * 7+2). Then please post results (with mkfs, etc command lines) here so others can benefit from your experiments and/or suggest additional tunings. Kevin On May 4, 2012, at 3:14 PM, Frank Riley wrote: How about doing 3 4+2 RAIDs? 12 usable disks, instead of 14 or 16, but still better than 8 with RAID1. Doing 4*128KB, resulting in 2 full-stripe writes for each 1MB IO is not that bad. Yes, of course. I had thought of this option earlier but forgot to include it. Thanks for reminding me. So using a stripe width of 512K will not harm performance that much? Note also that the E18s have two active/active controllers in them so that means one controller will be handling I/O requests for 2 arrays, which will reduce performance somewhat. Would this affect your decision between 3 4+2 (512K) or 2 7+2 (896K)? ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss Confidentiality Notice: This e-mail message, its contents and any attachments to it are confidential to the intended recipient, and may contain information that is privileged and/or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the original e-mail message and any attachments (and any copies that may have been made) from your system or otherwise. Any unauthorized use, copying, disclosure or distribution of this information is strictly prohibited. Email addresses that end with a ?-c? identify the sender as a Fusion-io contractor. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] recovery from multiple disks failure on the same md
On May 6, 2012, at 10:13 PM, Tae Young Hong wrote: Hi, I found the terrible situation on our lustre system. A OST (raid 6: 8+2, spare 1) had 2 disk failures almost at the same time. While recovering it, another disk failed. so recovering procedure seems to be halt, and the spare disk which were in resync fell into spare status again. (I guess that resync procedure almost finished more than 95%) Right now we have just 7 disks for this md. Is there any possibility to recover from this situation? It might be possible, but not something I've done. If the array has not been written to since a drive failed, you might be able to power-cycle the failed drives (to reset the firmware) and force re-add them (without a rebuild)? If the array _has_ been modified (most likely) you could write a sector of 0's to the bad sector, which will corrupt just that stripe, and force-re-add the last failed drive and attempt to rebuild again. Certainly if you have a support contract I'd recommend you get professional assistance. Unfortunately, the failure mode you encountered is all too common. Because the Linux SW RAID code does not read the parity blocks unless there is a problem, hard drive failures are NOT independent: drives appear to fail more often during a rebuild than at any other time. The only way to work around this problem is to periodically do a verify of the MD array. A verify allows the drive, which is failing in the 20% of the space that contains parity, to fail _before_ the data becomes unreadable, rather than fail _after_ the data becomes unreadable. Don't do it on a degraded array, but it is a good way to ensure healthy arrays are really healthy. echo check /sys/block/mdX/md/sync_action to force a verify. Parity mis-matches will be reported (not corrected), but drive failures can be dealt with sooner, rather than letting them stack up. Do man md and see the sync_action section. Also note that Lustre 1.8.7 has a fix to the SW RAID code (corruption when rebuilding under load). Oracle's release called the patch md-avoid-corrupted-ldiskfs-after-rebuild.patch, while Whamcloud called it raid5-rebuild-corrupt-bug.patch Kevin The following is detailed log. #1 the original configuration before any failure Number Major Minor RaidDevice State 0 8 1760 active sync /dev/sdl 1 8 1921 active sync /dev/sdm 2 8 2082 active sync /dev/sdn 3 8 2243 active sync /dev/sdo 4 8 2404 active sync /dev/sdp 5 6505 active sync /dev/sdq 6 65 166 active sync /dev/sdr 7 65 327 active sync /dev/sds 8 65 488 active sync /dev/sdt 9 65 969 active sync /dev/sdw 10 65 64- spare /dev/sdu #2 a disk(sdl) failed, and resync started after adding spare disk(sdu) May 7 04:53:33 oss07 kernel: sd 1:0:10:0: SCSI error: return code = 0x0802 May 7 04:53:33 oss07 kernel: sdl: Current: sense key: Medium Error May 7 04:53:33 oss07 kernel: Add. Sense: Unrecovered read error May 7 04:53:33 oss07 kernel: May 7 04:53:33 oss07 kernel: Info fld=0x74241ace May 7 04:53:33 oss07 kernel: end_request: I/O error, dev sdl, sector 1948523214 ... ... May 7 04:54:15 oss07 kernel: RAID5 conf printout: May 7 04:54:16 oss07 kernel: --- rd:10 wd:9 fd:1 May 7 04:54:16 oss07 kernel: disk 1, o:1, dev:sdm May 7 04:54:16 oss07 kernel: disk 2, o:1, dev:sdn May 7 04:54:16 oss07 kernel: disk 3, o:1, dev:sdo May 7 04:54:16 oss07 kernel: disk 4, o:1, dev:sdp May 7 04:54:16 oss07 kernel: disk 5, o:1, dev:sdq May 7 04:54:16 oss07 kernel: disk 6, o:1, dev:sdr May 7 04:54:16 oss07 kernel: disk 7, o:1, dev:sds May 7 04:54:16 oss07 kernel: disk 8, o:1, dev:sdt May 7 04:54:16 oss07 kernel: disk 9, o:1, dev:sdw May 7 04:54:16 oss07 kernel: RAID5 conf printout: May 7 04:54:16 oss07 kernel: --- rd:10 wd:9 fd:1 May 7 04:54:16 oss07 kernel: disk 0, o:1, dev:sdu May 7 04:54:16 oss07 kernel: disk 1, o:1, dev:sdm May 7 04:54:16 oss07 kernel: disk 2, o:1, dev:sdn May 7 04:54:16 oss07 kernel: disk 3, o:1, dev:sdo May 7 04:54:16 oss07 kernel: disk 4, o:1, dev:sdp May 7 04:54:16 oss07 kernel: disk 5, o:1, dev:sdq May 7 04:54:16 oss07 kernel: disk 6, o:1, dev:sdr May 7 04:54:16 oss07 kernel: disk 7, o:1, dev:sds May 7 04:54:16 oss07 kernel: disk 8, o:1, dev:sdt May 7 04:54:16 oss07 kernel: disk 9, o:1, dev:sdw May 7 04:54:16 oss07 kernel: md: syncing RAID array md12 #3 another disk(sdp) failed May 7 04:54:42 oss07 kernel: end_request: I/O error, dev sdp, sector 1949298688 May 7 04:54:42 oss07 kernel: mptbase: ioc1: LogInfo(0x3108): Originator={PL}, Code={SATA NCQ FaCommands After Error}, SubCode(0x) May 7 04:54:42 oss07
Re: [Lustre-discuss] Not sure how we should configure our RAID arrays (HW limitation)
On May 4, 2012, at 2:53 PM, Frank Riley wrote: Hello, We are using Nexsan E18s for our storage systems, and we are in the process of setting them up for Lustre. Each E18 has 18 disks total (max'ed out) in them. According to the Lustre docs, I want to have a stripe width of 1MB. Unfortunately, these E18s have a max stripe size of 128K. As I see it, for RAID6 this leaves us two options: 1) One array 16+2 with a stripe size of 64K for a stripe width of 1MB. I'm hesitant with this option because of the increased chance that we could have more than 2 disks fail. 2) Do two arrays 7+2 with a stripe size of 128K for a stripe width of 896K. I'd then modify the max_pages_per_rpc tunable to match the 896K. I'm not sure what to do with the flex_bg filesystem option since it has to be a power of 2. Note that you need to set the stripe size to match 896K, as otherwise you will send 896 and 128KB to each OST. Additional tuning of the mkfs options is also necessary so that the file system understands the layout (see -E in the Lustre manual), as otherwise all the block allocations will start mid-stripe. This is not ideal for applications that expect Po2 sizes to be optimal. What is the better option here? Or is there an option I'm missing? I've pretty much ruled out RAID5 arrays at 8+1 due to data loss risk, and RAID1+0 wastes too much disk for our use. 8+1 is the best option from a Lustre performance standpoint. You should get better performance from 2 7+2 arrays than with a 16+2 simply because you can have twice the number of independent IOs. How about doing 3 4+2 RAIDs? 12 usable disks, instead of 14 or 16, but still better than 8 with RAID1. Doing 4*128KB, resulting in 2 full-stripe writes for each 1MB IO is not that bad. Kevin This e-mail message, its contents and any attachments to it are confidential to the intended recipient, and may contain information that is privileged and/or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the original e-mail message and any attachments (and any copies that may have been made) from your system or otherwise. Any unauthorized use, copying, disclosure or distribution of this information is strictly prohibited. Email addresses that end with a ?-c? identify the sender as a Fusion-io contractor. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] erroneous ENOSPC -28
There is information in that bug; have you looked at the tot_granted and compared it to the sum on the cur_grant_bytes? The workaround, until you upgrade, was to restart the OSTs to reset the grant. Kevin On Apr 23, 2012, at 9:18 AM, Gretchen Zwart wrote: Hi, Debian 5.0 2.6.26-2-amd64 SMP (SLES 11) Lustre 1.8.1.1-1 Lustre clients are getting ENOSPC -28 error messages but 'lfs df' results indicate that OSTs are no more than 50% full. This looks like it could be related to Bug 22755. What is the best way to nail down if this is the cause? I'm in the upgrade process, but I'd also like to know what is the fastest/best method to restore lustre functionality should I encounter this again. Regards, -- Gretchen Zwart UMass Astronomy Dept. 619E Lederle 710 North Pleasant ST Amherst,MA 01003 (413) 577-2108 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss Confidentiality Notice: This e-mail message, its contents and any attachments to it are confidential to the intended recipient, and may contain information that is privileged and/or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the original e-mail message and any attachments (and any copies that may have been made) from your system or otherwise. Any unauthorized use, copying, disclosure or distribution of this information is strictly prohibited. Email addresses that end with a ?-c? identify the sender as a Fusion-io contractor. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre on Debian
Save yourself pain: use a supported RedHat kernel on the server. On Apr 1, 2012, at 8:08 AM, Mario Benitez wrote: Hi guys, I'm trying to set up Luster on Debian (server clients). Any hints out there? Thanx in avance. Marinho.- ATT1..txt Confidentiality Notice: This e-mail message, its contents and any attachments to it are confidential to the intended recipient, and may contain information that is privileged and/or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the original e-mail message and any attachments (and any copies that may have been made) from your system or otherwise. Any unauthorized use, copying, disclosure or distribution of this information is strictly prohibited. Email addresses that end with a ?-c? identify the sender as a Fusion-io contractor. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] OSS1 Node issue
This is not the correct list for help with SGE. That being said, the real issue (as has been mentioned by several people) is that an OST has gone read-only due to some issue. The file system will not function properly until this is resolved, irrespective of where you put SGE. You will need to check the logs on oss1 to find the initial issue, stop the bad ost, and take corrective action (the details of which depend on the issue), Kevin Sent from my iPhone On Feb 21, 2012, at 3:23 AM, VIJESH EK ekvij...@gmail.commailto:ekvij...@gmail.com wrote: - We are waiting for your feedback. Thanks Regards VIJESH E K On Tue, Feb 21, 2012 at 12:22 PM, VIJESH EK mailto:ekvij...@gmail.comekvij...@gmail.commailto:ekvij...@gmail.com wrote: Dear All, We have done the following changes in the exec Nodes , still now also we are getting the same errors in /var/log/messages. 1. We have changed the exec Nodes spool directory to local directory by editing the file /home/appl/sge-root/default/common/configuration and changes the parameter execd_spool_dir. After changing this also the same error, i.e below mentioned error is coming in OSS1 Node. This error is generating only in the OSS1 Node. Feb 6 18:32:10 oss1 kernel: LustreError: 9362:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Feb 6 18:32:05 oss1 kernel: LustreError: 9422:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Feb 6 18:32:06 oss1 kernel: LustreError: 9432:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Feb 6 18:32:07 oss1 kernel: LustreError: 9369:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Feb 6 18:32:10 oss1 kernel: LustreError: 9362:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Can u tell me how to change the Master spool directory ? Is it possible to change the directory in live mode ? Kindly explain briefly, so that we can proceed for the next step.. Thanks and Regards VIJESH On Fri, Feb 10, 2012 at 1:19 PM, Carlos Thomaz mailto:ctho...@ddn.comctho...@ddn.commailto:ctho...@ddn.com wrote: Hi vijesh. Are you running the SGE master spooling on lustre?!?! What about the exec nodes spooling?! I strongly recommend you to do not run the master spooling on lustre. And if possible use local spooling on local disk for the exec nodes. SGE (át. least until version 6.2u7) is known to get unstable when running the spooling on lustre. Carlos On Feb 10, 2012, at 1:18 AM, VIJESH EK mailto:ekvij...@gmail.comekvij...@gmail.commailto:ekvij...@gmail.com wrote: Dear All, Kindly get a solution for these below issue... Thanks Regards VIJESH E K On Thu, Feb 9, 2012 at 3:26 PM, VIJESH EK mailto:ekvij...@gmail.comekvij...@gmail.commailto:ekvij...@gmail.com wrote: Dear Sir, I am getting below mentioned error messages continuously in OSS1 Node,it causes that sge service is not running intermittently... Feb 5 04:03:37 oss1 kernel: LustreError: 9193:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Feb 5 04:03:47 oss1 kernel: LustreError: 9164:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Feb 5 04:03:47 oss1 kernel: LustreError: 28420:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Feb 5 04:03:48 oss1 kernel: LustreError: 9266:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Feb 5 04:03:50 oss1 kernel: LustreError: 9200:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Feb 5 04:03:53 oss1 kernel: LustreError: 9230:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Feb 5 04:03:57 oss1 kernel: LustreError: 9212:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Feb 5 04:04:03 oss1 kernel: LustreError: 9262:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Feb 5 04:04:08 oss1 kernel: LustreError: 9162:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Feb 5 04:04:15 oss1 kernel: LustreError: 9271:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Feb 5 04:04:23 oss1 kernel: LustreError: 9191:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Feb 5 04:04:32 oss1 kernel: LustreError: 9242:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 The detailed log information i have attached herewith.. The attached file containes the /var/log/messages continuous logs seperated by *. So kindly give me a solution for this issue... Thanks Regards VIJESH E K - ATT1.c - - Confidentiality Notice: This e-mail message, its contents and any attachments to it are
Re: [Lustre-discuss] OSS1 Node issue
The logs you attached start sometime after the issue: to tell what happened you need to find the error in the logs before you started getting these errors: Feb 5 04:03:13 oss1 kernel: LustreError: 9222:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 It looks like you rebooted the server, and OST0 and OST1 were mounted, and you are NOT getting those errors any more, but both OSTs reported errors on mount. So unmount the OSTs, and run: e2fsck /dev/dm-0 e2fsck /dev/dm-1 I don't know how mangled your OSTs are, so I don't know what e2fsck will report. See also http://wiki.lustre.org/index.php/Handling_File_System_Errors Kevin On Feb 21, 2012, at 10:43 PM, VIJESH EK wrote: Dear Kevin, Herewith i have attached the /var/log/messages , kindly go through the logs and give me a solution for this immly. Can u tell me How to run e2fsck for OST ? , Pl tell the exact command with switch how to run e2fsck without effecting the data. we are waiting for your reply. Thanks Regards VIJESH E K On Tue, Feb 21, 2012 at 8:38 PM, Kevin Van Maren kvanma...@fusionio.commailto:kvanma...@fusionio.com wrote: This is not the correct list for help with SGE. That being said, the real issue (as has been mentioned by several people) is that an OST has gone read-only due to some issue. The file system will not function properly until this is resolved, irrespective of where you put SGE. You will need to check the logs on oss1 to find the initial issue, stop the bad ost, and take corrective action (the details of which depend on the issue), Kevin Sent from my iPhone On Feb 21, 2012, at 3:23 AM, VIJESH EK ekvij...@gmail.commailto:ekvij...@gmail.com wrote: - We are waiting for your feedback. Thanks Regards VIJESH E K On Tue, Feb 21, 2012 at 12:22 PM, VIJESH EK mailto:ekvij...@gmail.comekvij...@gmail.commailto:ekvij...@gmail.com wrote: Dear All, We have done the following changes in the exec Nodes , still now also we are getting the same errors in /var/log/messages. 1. We have changed the exec Nodes spool directory to local directory by editing the file /home/appl/sge-root/default/common/configuration and changes the parameter execd_spool_dir. After changing this also the same error, i.e below mentioned error is coming in OSS1 Node. This error is generating only in the OSS1 Node. Feb 6 18:32:10 oss1 kernel: LustreError: 9362:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Feb 6 18:32:05 oss1 kernel: LustreError: 9422:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Feb 6 18:32:06 oss1 kernel: LustreError: 9432:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Feb 6 18:32:07 oss1 kernel: LustreError: 9369:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Feb 6 18:32:10 oss1 kernel: LustreError: 9362:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Can u tell me how to change the Master spool directory ? Is it possible to change the directory in live mode ? Kindly explain briefly, so that we can proceed for the next step.. Thanks and Regards VIJESH On Fri, Feb 10, 2012 at 1:19 PM, Carlos Thomaz mailto:ctho...@ddn.comctho...@ddn.commailto:ctho...@ddn.com wrote: Hi vijesh. Are you running the SGE master spooling on lustre?!?! What about the exec nodes spooling?! I strongly recommend you to do not run the master spooling on lustre. And if possible use local spooling on local disk for the exec nodes. SGE (át. least until version 6.2u7) is known to get unstable when running the spooling on lustre. Carlos On Feb 10, 2012, at 1:18 AM, VIJESH EK mailto:ekvij...@gmail.comekvij...@gmail.commailto:ekvij...@gmail.com wrote: Dear All, Kindly get a solution for these below issue... Thanks Regards VIJESH E K On Thu, Feb 9, 2012 at 3:26 PM, VIJESH EK mailto:ekvij...@gmail.comekvij...@gmail.commailto:ekvij...@gmail.com wrote: Dear Sir, I am getting below mentioned error messages continuously in OSS1 Node,it causes that sge service is not running intermittently... Feb 5 04:03:37 oss1 kernel: LustreError: 9193:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Feb 5 04:03:47 oss1 kernel: LustreError: 9164:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Feb 5 04:03:47 oss1 kernel: LustreError: 28420:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Feb 5 04:03:48 oss1 kernel: LustreError: 9266:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Feb 5 04:03:50 oss1 kernel: LustreError: 9200:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Feb 5 04:03:53 oss1 kernel: LustreError: 9230:0:(filter_io_26.c:693:filter_commitrw_write()) error
Re: [Lustre-discuss] LNET Performance Issue
Perhaps someone else here has a thought, but it does not make sense to me that loading SDP (which accelerates TCP traffic by by-passing the TCP stack) makes lnet faster if you are using ip@o2ib, and _not_ ip@tcp0 for your NIDs. Any chance you've configured both TCP and O2IB nids on the machine, and it is somehow picking the TCP nids to use? Can you confirm the lctl list_nids output, and your lustre/lnet sections of your modprobe.conf? Kevin On Feb 15, 2012, at 12:30 PM, Barberi, Carl E wrote: We are having issues with LNET performance over Infiniband. We have a configuration with a single MDT and six (6) OSTs. The Lustre client I am using to test is configured to use 6 stripes (lfs setstripe -c 6 /mnt/lustre). When I perform a test using the following command: dd if=/dev/zero of=/mnt/lustre/test.dat bs=1M count=2000 I typically get a write rate of about 815 MB/s, and we never exceed 848 MB/s. When I run obdfilter-survey, we easily get about 3-4GB/s write speed, but when I run a series of lnet-selftests, the read and write rates range from 850MB/s – 875MB/s max. I have performed the following optimizations to increase the data rate: On the Client: lctl set_param osc.*.checksums=0 lctl set_param osc.*.max_dirty_mb=256 On the OSTs lctl set_param obdfilter.*.writethrough_cache_enable=0 lctl set_param obdfilter.*.read_cache_enable=0 echo 4096 /sys/block/devices/queue/nr_requests I have also loaded the ib_sdp module, which also brought an increase in speed. However, we need to be able to record at no less than 1GB/s, which we cannot achieve right now. Any thoughts on how I can optimize LNET, which clearly seems to be the bottleneck? Thank you for any help you can provide, Carl Barberi ATT1..txt This e-mail message, its contents and any attachments to it are confidential to the intended recipient, and may contain information that is privileged and/or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the original e-mail message and any attachments (and any copies that may have been made) from your system or otherwise. Any unauthorized use, copying, disclosure or distribution of this information is strictly prohibited. Email addresses that end with a “-c” identify the sender as a Fusion-io contractor. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] obdidx ordering in lfs getstripe
On Feb 14, 2012, at 12:13 AM, Jack David wrote: On Thu, Feb 9, 2012 at 8:18 PM, Andreas Dilger adil...@whamcloud.com wrote: On 2012-02-09, at 6:20 AM, Jack David wrote: In the output of lsf getstripe filename | dirname, the obdidx denotes the OST index (I assume). Consider the following output: lmm_stripe_count: 2 lmm_stripe_size:1048576 lmm_stripe_offset: 1 obdidx objid objidgroup 1 20x20 0 30x30 where I have a setup consisting of two OSTs. If I have more than two OSTs, is it possible that I get the obdidx values out of order? Or the obdidx values will always be linear? For example, in above output, the values are linear (like 1, 0 - and this pattern will be repeated while storing the data I assume). If I have 4 OSTs, can the values be non-linear? Something like 2,0,1,3 or 2,1,3,0 (or any pattern for that matter)?? Typically the ordering will be linear, but this depends on a number of different factors: - what order the OSTs were created in: without --index=N the OST order depends on the order in which they were first mounted, so using --index is always recommended, and will be mandatory in the future - the distribution of OSTs among OSS nodes: the MDS object allocator will normally select one OST from each OSS before allocating another object from a different OST on the same OSS Thanks for this information. - the space available on each OST: when OST free space is imbalanced the OSTs will be selected in part based on how full they are I have a doubt here. Lets say I have 4 OSTs, but the lustre client is issuing the write request having which can be accommodated by any single OST (e.g. write request is of size 512bytes and stripe_size is 1MB). In this case, how will the data be stored? Will the MDS maintain the index of next OST which should serve the request? I think you are still confused about how it works. The OSTs are selected _when the file is created_. The striping is a static map of offset to OST. For example, if the stripe count = 2, and the stripe size = 1MB, then 0-1MB goes to the first OST, 1-2MB goes to the second, 2-3 goes to the first, etc. The free space impacts _which_ OSTs are selected when a file is created, it does NOT impact where data is written once a file a created. So if an OST fills up, every file that resides on that OST will be unable to grow if the growth is to an offset that maps to that OST. Kevin Confidentiality Notice: This e-mail message, its contents and any attachments to it are confidential to the intended recipient, and may contain information that is privileged and/or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the original e-mail message and any attachments (and any copies that may have been made) from your system or otherwise. Any unauthorized use, copying, disclosure or distribution of this information is strictly prohibited. Email addresses that end with a ?-c? identify the sender as a Fusion-io contractor. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] obdidx ordering in lfs getstripe
On Feb 14, 2012, at 6:51 AM, Jack David wrote: On Tue, Feb 14, 2012 at 6:57 PM, Kevin Van Maren kvanma...@fusionio.com wrote: On Feb 14, 2012, at 12:13 AM, Jack David wrote: On Thu, Feb 9, 2012 at 8:18 PM, Andreas Dilger adil...@whamcloud.com wrote: On 2012-02-09, at 6:20 AM, Jack David wrote: In the output of lsf getstripe filename | dirname, the obdidx denotes the OST index (I assume). Consider the following output: lmm_stripe_count: 2 lmm_stripe_size:1048576 lmm_stripe_offset: 1 obdidx objid objidgroup 1 20x20 0 30x30 where I have a setup consisting of two OSTs. If I have more than two OSTs, is it possible that I get the obdidx values out of order? Or the obdidx values will always be linear? For example, in above output, the values are linear (like 1, 0 - and this pattern will be repeated while storing the data I assume). If I have 4 OSTs, can the values be non-linear? Something like 2,0,1,3 or 2,1,3,0 (or any pattern for that matter)?? Typically the ordering will be linear, but this depends on a number of different factors: - what order the OSTs were created in: without --index=N the OST order depends on the order in which they were first mounted, so using --index is always recommended, and will be mandatory in the future - the distribution of OSTs among OSS nodes: the MDS object allocator will normally select one OST from each OSS before allocating another object from a different OST on the same OSS Thanks for this information. - the space available on each OST: when OST free space is imbalanced the OSTs will be selected in part based on how full they are I have a doubt here. Lets say I have 4 OSTs, but the lustre client is issuing the write request having which can be accommodated by any single OST (e.g. write request is of size 512bytes and stripe_size is 1MB). In this case, how will the data be stored? Will the MDS maintain the index of next OST which should serve the request? I think you are still confused about how it works. The OSTs are selected _when the file is created_. The striping is a static map of offset to OST. For example, if the stripe count = 2, and the stripe size = 1MB, then 0-1MB goes to the first OST, 1-2MB goes to the second, 2-3 goes to the first, etc. I understand that, but just got curious that does lustre client keeps track of which is the _next_ OST where the IO request should go to? I No, it does not track the next, as that depends on the file offset. For example, with the 2-OST stripe example in my previous email, if the client writes 0-1MB, 2-3MB, and 4-5MB, all the data will be written to a single OST. am unaware that who decides the stripe_size at the time of file creation (by default is 1MB - from lfs setstripe man page), so I assume client is not bothered about that. But if the client is generating the write request which is not in multiple of stripe_size, multiple write requests can be and stored into one OST (e.g. if stripe size is 1MB, then 20 req of 512bytes can be stored in OST1, next 20 reqs on OST2 and likewise). 1MB is the default default, but the actual default can vary system to system. The file stripe is determined when the file is created. lfs setstripe can be used to create a file with a specified striping. lfs setstripe can aso be used to change the striping for a directory, which is quite useful as that determines the default stripe for any files created in that directory (including directories!) When the client opens a file, the MDT returns the stripe information to the client so that the client knows how to map file offsets to OST objects (and the offset in that object). It is the client's job (inside Lustre so it is automatic) to figure out how to map a read/write to the server/ost/object/offset. Kevin Actually I am trying to understand how can I leverage the pNFS file layout semantics (which communicates to Data Servers directly once the layout is supplied by Meta Data Server) with Lustre Filesystem, and that is the source of such questions. The free space impacts _which_ OSTs are selected when a file is created, it does NOT impact where data is written once a file a created. So if an OST fills up, every file that resides on that OST will be unable to grow if the growth is to an offset that maps to that OST. Good to know that. Kevin Confidentiality Notice: This e-mail message, its contents and any attachments to it are confidential to the intended recipient, and may contain information that is privileged and/or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the original e-mail message and any attachments (and any copies that may have been made) from your system or otherwise
Re: [Lustre-discuss] Need cost for lustre
Lustre is free to use. Support is optional, and that cost will vary depending on where you get it. Kevin On Feb 14, 2012, at 4:51 AM, Anantharamanan R wrote: Hello, I need to know the licensing cost for Lustre, Please provide me the same Regards Ananth C-CAMP, NCBS INDIA ATT1..txt Confidentiality Notice: This e-mail message, its contents and any attachments to it are confidential to the intended recipient, and may contain information that is privileged and/or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the original e-mail message and any attachments (and any copies that may have been made) from your system or otherwise. Any unauthorized use, copying, disclosure or distribution of this information is strictly prohibited. Email addresses that end with a ?-c? identify the sender as a Fusion-io contractor. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] OSS1 Node issue
Errno 30 is EROFS, read-only file system. Perhaps there is some issue further up in the logs indicating the OST went read-only? Kevin On Feb 10, 2012, at 12:17 AM, VIJESH EK wrote: Dear All, Kindly get a solution for these below issue... Thanks Regards VIJESH E K On Thu, Feb 9, 2012 at 3:26 PM, VIJESH EK ekvij...@gmail.commailto:ekvij...@gmail.com wrote: Dear Sir, I am getting below mentioned error messages continuously in OSS1 Node,it causes that sge service is not running intermittently... Feb 5 04:03:37 oss1 kernel: LustreError: 9193:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Feb 5 04:03:47 oss1 kernel: LustreError: 9164:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Feb 5 04:03:47 oss1 kernel: LustreError: 28420:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Feb 5 04:03:48 oss1 kernel: LustreError: 9266:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Feb 5 04:03:50 oss1 kernel: LustreError: 9200:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Feb 5 04:03:53 oss1 kernel: LustreError: 9230:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Feb 5 04:03:57 oss1 kernel: LustreError: 9212:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Feb 5 04:04:03 oss1 kernel: LustreError: 9262:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Feb 5 04:04:08 oss1 kernel: LustreError: 9162:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Feb 5 04:04:15 oss1 kernel: LustreError: 9271:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Feb 5 04:04:23 oss1 kernel: LustreError: 9191:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 Feb 5 04:04:32 oss1 kernel: LustreError: 9242:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: rc = -30 The detailed log information i have attached herewith.. The attached file containes the /var/log/messages continuous logs seperated by *. So kindly give me a solution for this issue... Thanks Regards VIJESH E K - ATT1..txt Confidentiality Notice: This e-mail message, its contents and any attachments to it are confidential to the intended recipient, and may contain information that is privileged and/or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the original e-mail message and any attachments (and any copies that may have been made) from your system or otherwise. Any unauthorized use, copying, disclosure or distribution of this information is strictly prohibited. Email addresses that end with a ?-c? identify the sender as a Fusion-io contractor. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre 1.8.7 - Setup prototype in Research field - STUCK !
You can get the Oracle downloads from: http://downloads.lustre.org/public/lustre/v1.8/lustre_1.8.7/ Basically, build Lustre for your kernel on the clients, but use the Lustre server kernel on the servers. Kevin On Feb 7, 2012, at 9:09 AM, Charles Cummings ccummi...@harthosp.org wrote: Hello Everyone, being the local crafty busy admin for a neuroscience research branch, Lustre seems the only way to go however I'm a bit stuck and need some thoughtful guidance. My goal is to setup a virtual OS environment which is a replica of our Direct attached storage head node running SLES 11.0 x86 64 Kernel: 2.6.27.19-5 default #1 SMP and our (2) Dell blade clusters running CentOS 5.3 x86 64 Kernel: 2.6.18-128.el5 #1 SMP which I now have running as a) SLES 11 same kernel MDS b) SLES 11 same kernel OSS and c) CentOS 5.3 x86 65 same kernel and then get Lustre running across it. The trouble began when i was informed that the Lustre rpm kernel numbers MUST match the OS kernel number EXACTLY due to modprobe errors and mount errors on the client, and some known messages on the servers after the rpm installs. My only direct access to Oracle Lustre downloads is through another person with an Oracle ID who's not very willing to help - i.e. this route is painful So to explain why I'm stuck: a) access to oracle downloads is not easy b) there is so much risk with altering kernels, given all the applications and stability of the environment you could literally trash the server and spend days recovering - in addition to it being the main storage / resource for research c) I can't seem to find after looking Lustre RPMs that match my kernel environment specifically, i.e. the SLES 11 AND CENTOS 5.3 d) I've never created rpms to a specific kernel version and that would be a deep dive into new territory and frankly another gamble What's the least painful and least risky to get Lustre working in this prototype which will then lend to production (equally least painful) given these statements - Help ! Cliff, I could use some details on how specifically wamcloud can fit this scenero - and thanks for all the enlightenment. thanks for your help Charles ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss Confidentiality Notice: This e-mail message, its contents and any attachments to it are confidential to the intended recipient, and may contain information that is privileged and/or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the original e-mail message and any attachments (and any copies that may have been made) from your system or otherwise. Any unauthorized use, copying, disclosure or distribution of this information is strictly prohibited. Email addresses that end with a ?-c? identify the sender as a Fusion-io contractor. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] OSS Nodes Fencing issue in HPC
As I replied earlier, those slow messages are often a result of memory allocations taking a long time. Since zone_reclaim shows up in many of the stack traces, that still appears to be a good candidate. Did you check /proc/sys/vm/zone_reclaim_mode and was it 0? Did you change it to 0 and still have problems? The same situation that causes the Lustre threads to be slow can also stall the heartbeat processes. Did you increase the heartbeat deadtime timeout value? Kevin On Jan 27, 2012, at 1:42 AM, VIJESH EK wrote: Dear Sir, I have attached the /var/log/messages from the OSS node , Please go through the logs and kindly give me a solution for this issue Thanks Regards VIJESH E K HCL Infosystems Ltd. Chennai-6 Mob:+91 99400 96543 On Mon, Jan 23, 2012 at 12:03 PM, VIJESH EK ekvij...@gmail.commailto:ekvij...@gmail.com wrote: Hi, I hope all of them are in good spirit We have a four OSS servers, OSS1 to OSS4 are clustered each other The Nodes are clustered with OSS1 and OSS2 , OSS3 OSS4. It was configured six months back, from the beginning itself its creacting an issue that one of node is fencing the other node and its goes to the shutdown state. This problem may be happen from two to three weeks timing period. In the /var/log/messages showing some errors continuously that slow start_page_write 57s due to heavy IO load Can anybody can help me regarding this issue. Thanks Regards VIJESH E K messages.3messagesmessages.1messages.2ATT1..txt Confidentiality Notice: This e-mail message, its contents and any attachments to it are confidential to the intended recipient, and may contain information that is privileged and/or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the original e-mail message and any attachments (and any copies that may have been made) from your system or otherwise. Any unauthorized use, copying, disclosure or distribution of this information is strictly prohibited. Email addresses that end with a ?-c? identify the sender as a Fusion-io contractor. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] How to write more databytes on a file if ost is full
Yes, this is the expected behavior. Lustre is (still) unable to change the static stripe information after a file is created, so once a file is allocated on an OST, if that OST becomes full, Lustre will not be able to grow the file regardless of the space available on other OSTs. The workaround for this issue is to cp the file to a temporary name on the same file system, where it is likely to be allocated on the new OST with free space, and then rename the new file over the old one. Now repeat until you have achieved the desired balance of free space. lfs_migrate is a tool that automates this process somewhat. See http://wiki.lustre.org/manual/LustreManual20_HTML/UserUtilities_HTML.html#50438206_42260 Kevin On Jan 26, 2012, at 7:09 AM, Eudes wrote: Hello, I use lustre 1.8.5 on Debian. On lustre 1.8.5, if I have - One OST with 1 To, 100 Mo free I add a new OST with 1 To. On the first ost, I want to add new databytes on a file (fseek at the end), and I want to add 500 Mo, it's fail because lustre can't write 400 Mo on the new OST. So my questions are: - Is there a solution on Lustre (2.0?) - Others clusters have this problem? Thanks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss Confidentiality Notice: This e-mail message, its contents and any attachments to it are confidential to the intended recipient, and may contain information that is privileged and/or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the original e-mail message and any attachments (and any copies that may have been made) from your system or otherwise. Any unauthorized use, copying, disclosure or distribution of this information is strictly prohibited. Email addresses that end with a ?-c? identify the sender as a Fusion-io contractor. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] OSS Nodes Fencing issue in HPC
Well, it sounds like an issue with your HA package configuration. Likely one node is not being responsive enough to heartbeat/are-you-alive messages so the other node assumes it has died. This is likely fixed by increasing the deadtime parameter in your HA configuration (try 180 seconds if it is smaller than that). Hard to say, as you omitted any logs, and you didn't even say what HA package you are using. You also didn't indicate which Lustre version you are using. One of the likely candidates for those messages is the kernel having difficulty allocating memory. On many kernels, if /proc/sys/vm/zone_reclaim_mode is not 0, memory allocations can take a long time as it keeps looking for the best pages to free until pages in the local NUMA node are available. With the Lustre 1.8.x write cache, the memory pressure is substantial (in 1.6.x and earlier, the service threads had statically-allocated buffers, but starting with 1.8.x each incoming request allocates new pages and frees them back to the page cache). Kevin On Jan 22, 2012, at 11:33 PM, VIJESH EK wrote: Hi, I hope all of them are in good spirit We have a four OSS servers, OSS1 to OSS4 are clustered each other The Nodes are clustered with OSS1 and OSS2 , OSS3 OSS4. It was configured six months back, from the beginning itself its creacting an issue that one of node is fencing the other node and its goes to the shutdown state. This problem may be happen from two to three weeks timing period. In the /var/log/messages showing some errors continuously that slow start_page_write 57s due to heavy IO load Can anybody can help me regarding this issue. Thanks Regards VIJESH E K ATT1..txt Confidentiality Notice: This e-mail message, its contents and any attachments to it are confidential to the intended recipient, and may contain information that is privileged and/or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the original e-mail message and any attachments (and any copies that may have been made) from your system or otherwise. Any unauthorized use, copying, disclosure or distribution of this information is strictly prohibited. Email addresses that end with a ?-c? identify the sender as a Fusion-io contractor. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre 1.8.7 kernel patches for SLES11
I don't know why it would have been removed. I find the sd_iostats very useful. It provides stats for any sd disk. So if you are using SCSI or SAS, and SATA in scsi-emulation mode (ie: no if you get IDE's /dev/hd*, but yes if you get /dev/sd*) Kevin On Dec 21, 2011, at 9:46 AM, Charland, Denis wrote: Any good reason why sd_iostats-2.6.32-vanilla.patch has been removed from lustre/kernel_patches/series/2.6-sles11.series in Lustre 1.8.7? I found that it has been removed as part of “b=23988 Remove sd iostats patch from sles11 patch series”. I’m using this patch series to patch kernel 2.6.32.19-163 in Fedora12. Should I avoid applying this patch when building the patched kernel? Does this patch apply to SCSI disks only or does it apply to other type of disks (SAS/SATA) too? Denis Charland UNIX Systems Administrator National Research Council Canada ATT1..txt Confidentiality Notice: This e-mail message, its contents and any attachments to it are confidential to the intended recipient, and may contain information that is privileged and/or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the original e-mail message and any attachments (and any copies that may have been made) from your system or otherwise. Any unauthorized use, copying, disclosure or distribution of this information is strictly prohibited. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Are there recommended CPUs for Lustre servers?
Not sure how much this has improved with the cpu-scaling work in 2.x, but in general faster processors are much better than more processors. Peak performance has been in the range of 4-8 cores, with performance dropping after that due to lock contention. 12 cores/node should still be fine, but certainly a faster-per-core quad core is likely preferable to a hex-core CPU. OSS nodes need a good IO/memory subsystem most. Bull used some large NUMA machines, but there are additional complications using eg, multiple IB HCAs for performance, so generally the 2-socket range is optimal. Kevin On Dec 6, 2011, at 11:57 PM, Oleg Drokin wrote: Hello! On Dec 6, 2011, at 3:44 PM, Sebastian Gutierrez wrote: Is there any recommendations on whether or not to use 6core Intel CPUs for the Lustre OSS or MDS nodes? While on MDs you do want to have as powerful machine as you can since there is only one, I think cpu is not a bottleneck on OSSes. Of course you still can install faster CPUs on OSSes, but I think your money would be better spent on memory instead. Bye, Oleg -- Oleg Drokin Senior Software Engineer Whamcloud, Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss Confidentiality Notice: This e-mail message, its contents and any attachments to it are confidential to the intended recipient, and may contain information that is privileged and/or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the original e-mail message and any attachments (and any copies that may have been made) from your system or otherwise. Any unauthorized use, copying, disclosure or distribution of this information is strictly prohibited. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] SNS Status
non-existant On Dec 7, 2011, at 5:55 AM, Yuri wrote: Hi guys, Could someone please tell me what's the current status of SNS (in particular RAID-1)? Thanks in advance. ATT1..txt Confidentiality Notice: This e-mail message, its contents and any attachments to it are confidential to the intended recipient, and may contain information that is privileged and/or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the original e-mail message and any attachments (and any copies that may have been made) from your system or otherwise. Any unauthorized use, copying, disclosure or distribution of this information is strictly prohibited. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] OST size limitation
On Nov 3, 2011, at 12:40 PM, Andreas Dilger wrote: Not only is the seeking evil (talk to Kevin if you want to run 24TB OSTs on flash :-), but the 512-byte sector offset added by the partition table will cause all IO to be misaligned to the underlying device. It is possible to align partition boundaries, but it is not the default. Partitions (if used) should normally be aligned to a multiple of the RAID stripe size, although note that some RAID controllers internally compensate for the expected misalignment. See http://wikis.sun.com/display/Performance/Aligning+Flash+Modules+for+Optimal+Performance Even with flash storage it is much better to align the IO on power-of-two boundaries, since the erase blocks cause extra latency if there are read- modify-write operations. That also depends on the flash. The Fusion-io products have no alignment issues. Kevin Confidentiality Notice: This e-mail message, its contents and any attachments to it are confidential to the intended recipient, and may contain information that is privileged and/or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the original e-mail message and any attachments (and any copies that may have been made) from your system or otherwise. Any unauthorized use, copying, disclosure or distribution of this information is strictly prohibited. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] OST size limitation
On Nov 2, 2011, at 1:48 PM, Charland, Denis wrote: I read in the Lustre Operations Manual that there is an OST size limitation of 16 TB on RHEL and 8 TB on other distributions because of the ext3 file system limitation. I have a few questions about that. Why is the limitation 16 TB on RHEL? 16TB is the maximum size RedHat supports. See http://www.redhat.com/rhel/compare/ Larger than that requires bigger changes. Note that whamcloud's 1.8.6-wc1 claimed support for 24TB LUNs (but see http://jira.whamcloud.com/browse/LU-419 ). Whamcloud's Lustre 2.1 (not sure you'd want to use it) claims support for 128TB LUNs. I plan to use Lustre 1.8.5 on Fedora 12 for a new Lustre file system. What will be the OST size limitation? What is the OST size limitation when using ext4? 16TB with the Lustre-patched RHEL kernel. Is it preferable to use ext4 instead of ext3? If the block device has more than 8 TB or 16 TB, it must be partitioned. Is there a performance degradation when a device has multiple partitions compared to a single partition? In other words, is it better to have three 8 TB devices with one partition per device than to have one 24 TB device with three partitions? Better to have 3 separate 8TB LUNs. Different OSTs forcing the same drive heads to move to opposite parts of the disk does degrade performance (with a single OST moving the drive heads, the block allocator tries to minimize movement). Denis Charland UNIX Systems Administrator National Research Council Canada ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.orgmailto:Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss Confidentiality Notice: This e-mail message, its contents and any attachments to it are confidential to the intended recipient, and may contain information that is privileged and/or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the original e-mail message and any attachments (and any copies that may have been made) from your system or otherwise. Any unauthorized use, copying, disclosure or distribution of this information is strictly prohibited. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Anybody have a client running on a 2.6.37 or later kernel?
Why not use the RHEL6 kernel on RHEL5? That's probably much easier. Kevin On Oct 21, 2011, at 9:50 PM, Carlson, Timothy S timothy.carl...@pnnl.gov wrote: Folks, I've got a need to run a 2.6.37 or later kernel on client machines in order to properly support AMD Interlagos CPUs. My other option is to switch from RHEL 5.x to RHEL 6.x and use the whamcloud 1.8.6-wc1 patchless client (the latest RHEL 6 kernel also supports Interlagos). But I would first like to investigate using a 2.6.37 or later kernel on RHEL 5. I have a running kernel and started down the path of building Lustre against 2.6.37.6 and ran into the changes that have been made wrt to ioctl(), proc structures, etc. I am *not* a kernel programmer would rather not mess around too much in the source. So I am asking if anyone has successfully patched up Lustre to get a client working with 2.6.37.6 or later. Thanks! Tim ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss Confidentiality Notice: This e-mail message, its contents and any attachments to it are confidential to the intended recipient, and may contain information that is privileged and/or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the original e-mail message and any attachments (and any copies that may have been made) from your system or otherwise. Any unauthorized use, copying, disclosure or distribution of this information is strictly prohibited. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] MDS network traffic question
I would replace the 1GigE with the 10GigE: have all Ethernet traffic go over the 10GigE links, rather than add another tcp1 network for Lustre. This will keep your configuration much simpler, and make the migration as painless as possible (just move the IP address to the 10GigE port on the servers). The MDS traffic _volume_ is much lower than it is for the OSS nodes. The big win from 10GigE would be the lower latency: if you approach 100MB/s of MDS traffic, you have much bigger problems than a 10GigE NIC can solve. Kevin On Oct 11, 2011, at 3:54 PM, James Robnett wrote: We have a small lustre install consisting of an MDS and 5 OSS servers. Historically the MDS and OSS servers had both a 1Gbit ethernet interface (tcp0) to workstations and a QDR IB interface (ib0) to our cluster. We're planning on adding a MTU 9000 10Gbit ethernet (tcp1) interface to the MDS and OSS nodes and workstations for faster access. Our software has a pretty high IO to CPU component. I just discovered that our MDS can't in fact take another PCIe 8x card but it does have a spare GigE port. The 10gbit Ethernet switch can support 1gbit and 10gbit interfaces. We'd then have 3 networks tcp0 at 1gbit to slow clients tcp1 at 10gbit to faster clients ib0 to cluster My question is: Is there a risk of congestion or overrunning that 2nd GigE MDS interface if our workstations and OSS servers communicate over tcp1 at 10gbit but the MDS tcp1 is connected at 1Gbit. The bulk of our traffic will continue to be between the cluster and lustre over IB but the workstations can trivially over run ethernet hence the desire for 10gbit between them and the OSSes. My gut feeling is it should be fine, particularly with the larger MTU, there's not that much traffic to the MDS but I'd easily believe it if somebody said it's risky thing to do. The alternative is to buy a new MDS and swap disks into it. James Robnett National Radio Astronomy Observatory Array Operations Center Confidentiality Notice: This e-mail message, its contents and any attachments to it are confidential to the intended recipient, and may contain information that is privileged and/or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the original e-mail message and any attachments (and any copies that may have been made) from your system or otherwise. Any unauthorized use, copying, disclosure or distribution of this information is strictly prohibited. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Question about setting max service threads
Andreas answered the question asked, and did an excellent job. But to answer the unasked question, will reducing the thread count really fix the problem: This is often NOT caused by mere disk overload from too many service threads. For example, one recent issue was tracked down to free space allocation times being quite large, due to free space bitmaps needing to be read from disk. It has also been common for memory allocations to be the major time sink, as with Lustre 1.8 the service threads no longer reuse the buffer and have to allocate new memory on every request (numa zoned allocations were especially problematic; apparently the best pages to free have a tendency of being found on the wrong numa node, so it took a lot of time/work to free up space on the local numa node to allow the allocation to succeed). Bug 23826 had patches to track service times better, which will help you see how much of an issue this really is. See also Bug 22516, which strives to normalize server threads per OST, rather than per server. Big 22886 discusses issues with the elevator taking 1MB IOs and converting them into odd sizes, which depending on the array could also have an impact on IO. Bug 23805 has some additional rambling along this line as well. Kevin On Aug 15, 2011, at 6:36 PM, Andreas Dilger adil...@whamcloud.com wrote: On 2011-08-15, at 3:58 PM, Mike Hanby wrote: Our OSS servers are logging quite a few heavy IO load combined with system load (via 'uptime') being reported in the 100's to several 100's range. Lustre: lustre-OST0004: slow commitrw commit 191s due to heavy IO load Aug 15 13:00:38 lustre-oss-0-2 kernel: Lustre: Service thread pid 17651 completed after 236.04s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Lustre: Skipped 1 previous similar message Lustre: lustre-OST0004: slow commitrw commit 191s due to heavy IO load Lustre: Service thread pid 16436 completed after 210.17s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). I'd like to test setting the ost_io.threads_max to values lower than 512. Question 1: Will this command survive a reboot lctl set_param ost.OSS.ost_io.threads_max=256 This is only a temporary setting. or do I need to also run lctl conf_param ost.OSS.ost_io.threads_max=256? The conf_param syntax is (unfortunately) slightly different than the set_param syntax. You can also set this in /etc/modprobe.d/ lustre.conf: options ost oss_num_threads=256 options mds mds_num_threads=256 Question 2: Since Lustre does not reduce the number of service threads in use, is there any way I can force the extra running service threads to exit, or is a reboot of the OSS servers the only clean way? I had written a patch to do this, but it wasn't landed yet. Currently the only way to limit the thread count is to set this before the number of running threads has exceeded the maximum thread count. Cheers, Andreas -- Andreas Dilger Principal Engineer Whamcloud, Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] inconsistent client behavior when creating an empty directory
This appears to be the same issue as https://bugzilla.lustre.org/show_bug.cgi?id=23459 Kevin Andrej Filipcic wrote: Hi, the following code does not work as expected: - #include sys/stat.h #include errno.h #include stdio.h int main(int argc, char** argv) { int rc; rc=mkdir(argv[1],S_IRWXU); if(rc) perror(failed create dir); chown(argv[1],4103,4100); struct stat buf; /* stat(argv[1],buf); */ setresuid(0,4103,4100); rc=mkdir(argv[1],S_IRWXU); if(rc) perror(failed create dir as user); } - initial status: # ls -ld /lustre/test drwxr-xr-x 2 root root 4096 Aug 9 14:59 /lustre/test # ls -l /lustre/test total 0 1) running the test program: # /tmp/test /lustre/test/testdir failed create dir as user: Permission denied # ls -l /lustre/test total 4 drwx-- 2 griduser03 grid 4096 Aug 9 15:02 testdir griduser03, grid correspond to uid=4103,gid=4100 2) running the test program, but with uncommented stat call: # /tmp/test /lustre/test/testdir failed create dir as user: File exists # ls -l /lustre/test total 4 drwx-- 2 griduser03 grid 4096 Aug 9 15:04 testdir The code first makes the testdir as root and changes the ownership to uid 4103. Then it tries to (re)create the same dir with the user privileges. If stat is called, the code behaves as expected (case 2), but if not (case 1), the second mkdir should return EEXIST and not EACCES. Is this behavior expected or is it a client bug? The client runs lustre 1.8.6. The code just illustrates, what is actually used in a complex software. Andrej ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] [bug?] mdc_enter_request() problems
chas williams - CONTRACTOR wrote: On Mon, 08 Aug 2011 12:03:25 -0400 chas williams - CONTRACTOR c...@cmf.nrl.navy.mil wrote: later mdc_exit_request() finds this mcw by iterating the list. seeing as mcw was allocated on the stack, i dont think you can do this. mcw might have been reused by the time mdc_exit_request() gets around to removing it. nevermind. i see this has been fixed in later releases apparently (i was looking at 1.8.5). if l_wait_event() returns early (like from being interrupted) mdc_enter_request() does the cleanup itself now. That code is unchanged in 1.8.6. Kevin ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Moving storage from one OSS to another
Rafa Griman wrote: Hi all :) Got a customer (It is quite bad form to ask for help for a commercial deal without using your work email address. I assume you still work for Bull?) with: - 1 x S2A9900 (one couplet) - 500 x 2 TB drives - 4 OSS This customer wants to add: - 1 x S2A9900 (one couplet) - 300 x 2 TB drives - 4 OSS I know we could just add and restripe the existing files. But the customer wants to physically move 100 drives from the existing S2A9900 to the new one in order to have 400 drives on one S2A9900 and 400 drives on the other one. So my questions are: 1.- Can this be done (move drives between OSTs) without losing data? First, make sure DDN supports moving drives _data intact_. The drives in each tier will at least have to end up as an ordered tier for this to work, but I would guess it isn't quite so simple. 2.- Can an OST be moved from one OSS to another without losing data? Yes, IF it is done properly. Step 1, back up all essential data. Step 2, add the _new server_ as a failover node for the OSTs being moved. Note that this requires a tunefs.lustre --writeconf (with new parameters on the OSTs being moved) and re-mounting all the servers. Step 3, move the drives Step 4, update the server/failover NIDs, removing the old ones with another writeconf pass. I _strongly_ recommend that you verify their Lustre support contract is current, and that you do a dry-run in a test environment before doing it live (if nothing else, you do have 4 servers and a new couplet to play with). 3.- Has anyone done this? How? I imagine it can be done restriping files/migrating files within the Lustre filesystem, removing empty OSTs, ... Well, yes, there is that option as well; use lfs_migrate (at least 2 passes). TIA Rafa Kevin ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Random OST Numbers chosen in a stripe
Johann Lombardi wrote: On Fri, Jul 29, 2011 at 04:49:28PM -0400, Roger Spellman wrote: For a different file: obdidx objid objidgroup 136884 0x1ae40 286880 0x1ae00 446880 0x1ae00 276880 0x1ae00 Why is this? How can I control it to always be sequential? It depends on the OST usage imbalance and you can tune the stripe allocation policy with qos_threshold_rr. For more information, please refer to the lustre manual: http://wiki.lustre.org/manual/LustreManual20_HTML/LustreProc.html#50438271_pgfId-1296529 Cheers, Johann Also note that newer versions of Lustre sort the OST list even in RR mode, so that it will not allocate successive objects from the same OSS node. Kevin ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Failover / reliability using SAD direct-attached storage
Mark Hahn wrote: It seems an external fibre or SAS raid is needed, to be precise, a redundant-path SAN is needed. you could do it with commodity disks and Gb, or you can spend almost unlimited amounts on gold-plated disks, FC switches, etc. Many deployments are done without redundant paths, which offer additional insurance. the range of costs is really quite remarkable, I guess O(100x). compare this to cars where even VERY nice production cars are only a few times more expensive than the most cost-effective ones. You're comparing two mass-market cars: there is a nearly 1000x difference in price between a cheap dune buggy and a Bugatti, but both provide transportation for 1-2 people. as the idea of loosing the file system if one node goes down doesn't seem good, even if temporary. The clients should just hang on the file system until the server is again available. This is not so different from using NFS with hard mounts. Note that even with failover, the Lustre file system will be down for several minutes, as the HA package has to first detect a problem, and then safely startup Lustre on the backup server, and then Lustre recovery has to occur. how often do you expect nodes to fail, and why? regards, mark hahn. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Failover / reliability using SAD direct-attached storage
Tyler Hawes wrote: Apologies if this is a bit newbie, but I'm just getting started, really. I'm still in design / testing stage and looking to wrap my head around a few things. I'm most familiar with Fibre Channel storage. As I understand it, you configure a pair of OSS per OST, one actively serving it, the other passively waiting in case the primary OSS fails. Please correct me if I'm wrong... No, that's basically it. Lustre works well with FC storage, although a full SAN configuration (redundant switch fabrics) is not often used: with only 2 servers needing access to each LUN, and bandwidth to storage being key, servers are most often directly attached to the FC storage, with multiple paths to handle controller/path failure and improve BW. But to clarify one point, Lustre is not waiting passively on the backup server. Lustre can only be active on one server for a given OST at a time. Some high-availability package, external to Lustre, is responsible for ensuring Lustre is active on one server (the OST is mounted on one server). Heartbeat was quite popular, but more people have been moving to the more modern packages like Pacemaker. It is left to the HA package to perform failover as necessary, even though most HA packages do not perform failover by default if the network or back-end storage link goes down (which is where bonded networks and storage multipath could come in). With SAS/SATA direct-attached storage (DAS), though, it's a little less clear to me. With SATA, I imagine that if an OSS goes down, all it's OSTs go down with it (whether they be internal or external mounted drives), since there is no multipathing. Also, I suppose I'd want a hardware RAID controller PCIe card, which would also preclude failover since it's not going to have cache and configuration mirrored in another OSS's RAID card. Normally, yes. Sun shipped quite a bit of Lustre stoage with failover using SATA in external enclosures (J4400), but that was special in that there were (2) SAS expanders per enclosure, and each drive was connected to a SATA MUX to allow both servers access to the SATA drives. I am glad you understand the hazards of connecting two servers using internal raid controllers with external storage. Until a RAID card is developed specifically designed with that in mind (and strictly uses a write-though cache), it is a very bad idea. [For others, please consider what would happen to the file system if the raid card has a battery backed cache with a bunch of pending writes to get replayed at some point _after_ the other server completes recovery.] If you are using a SAS-attached external RAID enclosure, then it is not much different than using a FC-attached RAID. Ie, the direct-attached ST2530 (SAS) can be used in place of a direct-attached ST2540 (FC) with the only architecture change being the use of a SAS card/cables instead of an FC card/cables. The big difference between SAS and FC is that people are not (yet) building SAS-based SANs. Already many FC arrays have moved to SAS drives on the back end. http://www.oracle.com/us/products/servers-storage/storage/disk-storage/sun-storage-2500-m2-array-407918.html With SAS, there seems to be a new way of doing this that I'm just starting to learn about, but is a bit fuzzy still to me. I see that with things like Storage Bridge Bay storage servers from the likes of Supermicro, there is a method of putting two server motherboards in one enclosure, having an internal 10GigE link between them to keep cache coherency, some sort of software layer to manage that (?), and then you can use inexpensive SAS drives internally and through external JBOD chassis. Is anyone using something like this with Lustre? Some people have used (or at least toyed with using) DRDB and Lustre, but I would not say it is fast, recommended, or a mainstream Lustre configuration. But that is one way to replicate internal storage across servers, to allow Lustre failover. With SAS drives in an external enclosure, it is possible to configure shared storage for use with Lustre, although if you are using a JBOD rather than a raid controller, there are the normal issues (Linux SW raid/LVM layers are not clustered, so you have to ensure they are only active on one node at a time). Or perhaps I'm not seeing the forest through the trees and Lustre has software features built-in that negate the need for this (such as parity of objects at the server level, so you can loose N+1 OSS)? Bottom line, what I'm after is figuring out what architecture works with inexpensive internal and/or JBOD SAS storage that won't risk data loss with the failure of a single drive or server RAID array... Lustre does not support redundancy in the file system. All data availability is through RAID protection, combined with server failover. With internal storage, you lose the failover part. Sun also delivered quite a bit of
Re: [Lustre-discuss] multipathd or sun rdac driver?
David Noriega wrote: We already use multipathd in our install already, but this was something I wondered about. We use Sun disk arrays and they mention the use of their RDAC driver to multipathing on Linux. Since its from the vendor, one would think it be better. What does the collective think? Sun StorageTek RDAC Multipath Failover Driver for Linux http://download.oracle.com/docs/cd/E19373-01/820-4738-13/chapsing.html David I assume you are using the ST25xx or ST6xxx storage with Lustre? Exactly which arrays? I've been happy with RDAC, but I don't think Oracle has released RHEL6 support yet (but Oracle also does not support Lustre servers on RHEL6 yet). If your multupath config is working (ie, you've tested it by unplugging/replugging cables under load and were happy with the behavior), I'm not going to tell you to change. Kevin ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] multipathd or sun rdac driver?
Yes, the controllers are active/passive, so while both controllers export each LUN, only the LUN on the active controller can be used. In the event of a path or controller failure, RDAC will migrate the lun so that it is active on the working controller/path. Seeing those problems either indicates that your multipath driver doesn't properly support asynchronous multipath, or there is a configuration issue. I believe some firmware versions allow you to have automatic failover, so the LUN is migrated on access, which was meant to work around multipath drivers that didn't migrate the LUN, but will perform very poorly if more than one path is used. Note that it is also possible to have multiple paths to each controller, which can also be load balanced or zoned (more useful for eg the ST6780). [If you want to experience pain, access a LUN from two hosts at the same time, which each host connected to a different controller. It will also work, but be slow, kindof like reading two CDs at the same time in a CD changer.] Kevin David Noriega wrote: They are 2540 and I'm running EL5(centos). Well the thought came around since I had to rebuild a node after a hardware problem. So I went ahead and gave it a shot. I think I posted about this problem before somewhere in the mailing list about getting stray I/O errors which were for /dev/sdX devices that were the other path to the same device(Well thats the idea we came to). Well after installing the Sun RDAC module and disabling multipathd, I can happily say those messages are gone, so I suppose Sun's module is able to talk to the disk array in a better manner then multipathd. Though I haven't failed back the lustre ost's to this particular node just yet(will wait till the weekend). I'll post again if anything goes wrong, but I think going with this RDAC module might be better. ps: One thing that has nagged me since Lustre was installed and setup by a vendor, was the disk arrays were never setup with initiators or hosts in the configuration(Using CAM). We have another similar disk array(6140) we setup for another filesystem and I know initiators/hosts were setup on the array. I can't say that this has caused any problems, but its something in the back of my mind. Thanks, David On Wed, Jul 20, 2011 at 4:15 PM, Kevin Van Maren kevin.van.ma...@oracle.com wrote: David Noriega wrote: We already use multipathd in our install already, but this was something I wondered about. We use Sun disk arrays and they mention the use of their RDAC driver to multipathing on Linux. Since its from the vendor, one would think it be better. What does the collective think? Sun StorageTek RDAC Multipath Failover Driver for Linux http://download.oracle.com/docs/cd/E19373-01/820-4738-13/chapsing.html David I assume you are using the ST25xx or ST6xxx storage with Lustre? Exactly which arrays? I've been happy with RDAC, but I don't think Oracle has released RHEL6 support yet (but Oracle also does not support Lustre servers on RHEL6 yet). If your multupath config is working (ie, you've tested it by unplugging/replugging cables under load and were happy with the behavior), I'm not going to tell you to change. Kevin ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] how to baseline the performance of a Lustre cluster?
Tim Carlson wrote: On Fri, 15 Jul 2011, Theodore Omtzigt wrote: To me it looks very disappointing as we can get 3GB/s from the RAID controller aggregating a collection of raw SAS drives on the OSTs, and we should be able to get a peak of -5GB/s from QDR IB. First question: is this baseline reasonable? For starters, the theoretical peak of QDR IB is 4GB/s in terms of moving real data. 40Gb/s is the signaling rate and you need to factor in the PCI bus 8/10 encoding. So your 40Gb/s becomes 32Gb/s right off the bat. Yes, the (unidirectional) bandwidth of QDR 4x IB is 4GB/s, including headers, due to the InfiniBand 8b/10b encoding. This is the same (raw) data rate as PCIe gen2 x8 (which also uses 8b/10b encoding, to transmit 10bits for every 8-bit byte). Interestingly, the upcoming InfiniBand FDR moves to 64b/66b encoding, which eliminates most of the link overhead. [8b/10b encoding exists to ensure there are 1) an equal number of 10 bits, and 2) to set an upper bounds on the number of sequential 1 or 0 bits at a small number. With 64b/66b there can now be something like 65bits in a row with the same value, which makes it more susceptible to clock skew issues, although the claim is that in practice the number of bits is much smaller as a scrambler is used to randomize the actual bits, and the sequences that correspond to 64 1's or 64 0's will never be used. So the wrong data pattern could cause more problems.] To clarify, this 4GB/s is reduced to around 3.2GB/s of data primarily due to the smaller packet size of PCIe (256Bytes), where the headers consume quite a bit of the BW, or somewhat less when using 128byte PCIe packets. While MPI can achieve 3.2GB/s data rates, I have never seen o2ib lnet get that high. As I recall, something ~2.5 is more typical. Now try and move some data with something like mpi_send and you will see that the real amount of data you can send is really more like 24Gb/s or 3GB/s. The test size for ost_survey is pretty small. 30MB. You can increase that with the -s flag. Try at least 100MB. You should also turn of checksums to test raw performance. There is an lctl conf_param to do this, but the quick and dirty route on the client is the following bash: for OST in /proc/fs/lustre/osc/*/checksums do echo 0 $OST done For comparison sake, on my latest QDR connected Lustre file system with LSI 9285-8e controllers connected to JBODs of slowing disks in 11 disk RAID 6 stripes, I get around 500MB/s write and 350MB/s read using ost-survey with 100MB data chunks. Your numbers seem reasonable. Tim Theodore, You have jumped straight to testing Lustre over the network, without first providing performance numbers for the disks when locally attached. (You also didn't test the network, but in the absence of bad links GigE and IB are less variable and well understood.) As for the disk performance, were you able to measure 3GB/s from the raid controller, or what is that number based on? What was the performance of an individual lun (or whatever backs your OST)? Are all the OSTs on a single server, and you are testing them one at a time? You should be able to get 100+MB/s over GigE, although you may need 2 OSTs to do that, and larger IO sizes. Similarly, if you access multiple OSTs simultaneously, you should be 2GB/s over o2ib. At least I am assuming you are using o2ib and not just using tcp over InfiniBand, which would be slower. Kevin ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] how to add force_over_8tb to MDS
With one other note: you should have used --mkfsoptions='-t ext4' when doing mkfs.lustre, and NOT the force option. Given that it is already formatted and you don't want to use data, at least use the ext4 Lustre RPMs. Pretty sure you don't need a --writeconf -- you would either run as-is with ext4-based ldiskfs or reformat. The MDT device should be limited to 8TB; I don't think anyone has tested a larger MDT. Kevin Cliff White wrote: This error message you are seeing is what Andreas was talking about - you must use the ext4-based version, as you will not need any option with your size LUNS. The 'must use force_over_8tb' error is the key here, you most certainly want/need to *.ext4.rpm versions of stuff. cliffw On Thu, Jul 14, 2011 at 11:10 AM, Theodore Omtzigt t...@stillwater-sc.com mailto:t...@stillwater-sc.com wrote: Michael: The reason I had to do it on the OST's is because when issuing the mkfs.lustre command to build the OST it would error out with the message that I should use the force_over_8tb mount option. I was not able to create an OST on that device without the force_over_8tb option. Your insights on the writeconf are excellent: good to know that writeconf is solid. Thank you. Theo On 7/14/2011 1:29 PM, Michael Barnes wrote: On Jul 14, 2011, at 1:15 PM, Theodore Omtzigt wrote: Two part question: 1- do I need to set that parameter on the MGS/MDS server as well No, they are different filesystems. You shouldn't need to do this on the OSTs either. You must be using an older lustre release. 2- if yes, how do I properly add this parameter on this running Lustre file system (100TB on 9 storage servers) covered I can't resolve the ambiguity in the documentation as I can't find a good explanation of the configuration log mechanism that is being referenced in the man pages. The fact that the doc for --writeconf states This is very dangerous, I am hesitant to pull the trigger as there is 60TB of data on this file system that I rather not lose. I've had no issues with writeconf. Its nice because it shows you the old and new parameters. Make sure that the changes that you made were the what you want, and that the old parameters that you want to keep are still in tact. I don't remember the exact circumstances, but I've found settings were lost when doing a writeconf, and I had to explictly put these settings in tunefs.lustre command to preserve them. -mb -- +--- | Michael Barnes | | Thomas Jefferson National Accelerator Facility | Scientific Computing Group | 12000 Jefferson Ave. | Newport News, VA 23606 | (757) 269-7634 tel:%28757%29%20269-7634 +--- ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org mailto:Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- cliffw Support Guy WhamCloud, Inc. www.whamcloud.com http://www.whamcloud.com ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] inode tuning on shared mdt/mgs
Andreas Dilger wrote: On 2011-07-01, at 12:03 PM, Aaron Everett aever...@forteds.com mailto:aever...@forteds.com wrote: I'm trying to increase the number of inodes available on our shared mdt/mgs. I've tried reformatting using the following: mkfs.lustre --fsname fdfs --mdt --mgs --mkfsoptions=-i 2048 --reformat /dev/sdb The number of inodes actually decreased when I specified -i 2048 vs. leaving the number at default. This os a bit of an anomaly in how 1.8 reports the inode count. You actually do have more inodes on the MDS, but because the MDS might need to use an external block to store the striping layout, it limits the returned inode count to the worst case usage. As the filesystem fills and these external blck [trying to complete his sentence:] are not used, the free inode count keeps reporting the same number of free inodes, as the number of used inodes goes up. It is pretty weird, but it was doing the same thing in v1.6 We have a large number of smaller files, and we're nearing our inode limit on our mdt/mgs. I'm trying to find a solution before simply expanding the RAID on the server. Since there is plenty of disk space, changing the bytes per inode seemed like a simple solution. From the docs: Alternately, if you are specifying an absolute number of inodes, use the-N number of inodes option. You should not specify the -i option with an inode ratio below one inode per 1024 bytes in order to avoid unintentional mistakes. Instead, use the -N option. What is the format of the -N flag, and how should I calculate the number to use? Thanks for your help! Aaron ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] HW RAID - fragmented I/O
It's possible there is another issue, but are you sure you (or RedHat) are not setting CONFIG_SCSI_MPT2SAS_MAX_SGE in your .config, which is preventing it from being set to 256? I don't have a machine using this driver. You could put #warning in the code to see if you hit the non-256 code path when building, or printk the max_sgl_entries in _base_allocate_memory_pools. Kevin Wojciech Turek wrote: Hi Kevin, Thanks for very helpful answer. I tried your suggestion and recompiled the mpt2sas driver with the following changes: --- mpt2sas_base.h 2010-01-16 20:57:30.0 + +++ new_mpt2sas_base.h 2011-06-10 12:53:35.0 +0100 @@ -83,13 +83,13 @@ #ifdef CONFIG_SCSI_MPT2SAS_MAX_SGE #if CONFIG_SCSI_MPT2SAS_MAX_SGE 16 #define MPT2SAS_SG_DEPTH 16 -#elif CONFIG_SCSI_MPT2SAS_MAX_SGE 128 -#define MPT2SAS_SG_DEPTH 128 +#elif CONFIG_SCSI_MPT2SAS_MAX_SGE 256 +#define MPT2SAS_SG_DEPTH 256 #else #define MPT2SAS_SG_DEPTH CONFIG_SCSI_MPT2SAS_MAX_SGE #endif #else -#define MPT2SAS_SG_DEPTH 128 /* MAX_HW_SEGMENTS */ +#define MPT2SAS_SG_DEPTH 256 /* MAX_HW_SEGMENTS */ #endif #if defined(TARGET_MODE) However I can still that almost 50% of writes and slightly over 50% of reads falls under 512K I/Os I am using device-mapper-multipath to manage active/passive paths do you think that could have something to do with the I/O fragmentation? Best regards, Wojciech On 8 June 2011 17:30, Kevin Van Maren kevin.van.ma...@oracle.com mailto:kevin.van.ma...@oracle.com wrote: Yep, with 1.8.5 the problem is most likely in the (mpt2sas) driver, not in the rest of the kernel. Driver limits are not normally noticed by (non-Lustre) people, because the default kernel limits IO to 512KB. May want to see Bug 22850 for the changes required eg, for the Emulex/lpfc driver. Glancing at the stock RHEL5 kernel, it looks like the issue is MPT2SAS_SG_DEPTH, which is limited to 128. This appears to be set to match the default kernel limit, but it is possible there is also a driver/HW limit. You should be able to increase that to 256 and see if it works... Also note that the size buckets are power-of-2, so a 1MB entry is any IO 512KB and = 1MB. If you can't get the driver to reliably do full 1MB IOs, change to a 64KB chunk and set max_sectors_kb to 512. This will help ensure you get aligned, full-stripe writes. Kevin Wojciech Turek wrote: I am setting up a new lustre filesystem using LSI engenio based disk enclosures with integrated dual RAID controllers. I configured disks into 8+2 RAID6 groups using 128kb segment size (chunk size). This hardware uses mpt2sas kernel module on the Linux host side. I use the whole block device for an OST (to avoid any alignment issues). When running sgpdd-survey I can see high through numbers (~3GB/s write, 5GB/s read), Also controllers stats show that number of IOPS = number of MB/s. However as soon as I put ldiskfs on the OSTs, obdfilter shows slower results (~2GB/s write , ~2GB/s read ) and controller stats show more then double IOPS than MB/s. Looking at output from iostat -m -x 1 and brw_stats I can see that a large number of I/O operations are smaller than 1MB, mostly 512kb. I know that there was some work done on optimising the kernel block device layer to process 1MB I/O requests and that those changes were committed to Lustre 1.8.5. Thus I guess this I/O chopping happens below the Lustre stack, maybe in the mpt2sas driver? I am hoping that someone in Lustre community can shed some light on to my problem. In my setup I use: Lustre 1.8.5 CentOS-5.5 Some parameters I tuned from defaults in CentOS: deadline I/O scheduler max_hw_sectors_kb=4096 max_sectors_kb=1024 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] lustre ofed compatibility
to recompile linux kernel with increased stack size, because lustre and ofed may use up stack (both are stack greedy) and thus lead to system hang issue. YiLei On Thu, Jun 2, 2011 at 1:36 AM, Kevin Van Maren kevin.van.ma...@oracle.com wrote: OFED 1.5.1 should work fine with Lustre 1.8.4, although I believe more people are using the in-kernel OFED now: Lustre (finally) defaulted to the in-kernel OFED for RedHat, so it is no longer _necessary_ to build either OFED or Lustre. Kevin Edward Walter wrote: Hi List, We're getting ready to upgrade the OS/software stack on one of our clusters and I'm looking at which Lustre and OFED versions will work best. It looks like the changelog for 1.8.4 and the compatibility matrix have conflicting information. The Lustre compatibility matrix indicates that on Lustre 1.8.4; the highest OFED revision with o2iblnd support is 1.4.2: http://wiki.lustre.org/index.php/Lustre_Release_Information The changelog for 1.8.4 indicates that o2iblnd is supported with OFED 1.5.1: http://wiki.lustre.org/index.php/Change_Log_1.8#Changes_from_v1.8.3_to_v1.8.4 Can someone clarify whether 1.8.4 supports o2iblnd with OFED 1.5.1? Are there any pitfalls to this configuration? Has anyone found any instabilities with this configuration? Thanks much. -Ed Walter Carnegie Mellon University ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Pardon my stupidity: IOH?
The I/O Hub, which provides the PCI Express lanes to the processor. See: http://en.wikipedia.org/wiki/Intel_X58 Ms. Megan Larko wrote: Greetings, Please pardon my ignorance, what is this IOH to which the recent thread OSSes on dual IOH motherboards has been referring? Thanks, megan ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] OSSes on dual IOH motherboards
Mark, In addition to thread pinning, see also Bug 22078, which allows a different network interface to be used for different OSTs on the same server: a single IB interface is not enough to saturate one IOH, let alone multiple. Normally all the threads are in a shared pool, where any thread can service any incoming request for any OST. The most common server configuration is probably still dual-socket single IOH. Kevin Andreas Dilger wrote: Look for the Bull NUMIOA presentation from the recent LUG. The short story is that OST thread pinning is critical to getting good performance. The numbers are something like 3.6GB/s without, and 6.0 GB/s with thread affinity. Cheers, Andreas On 2011-06-02, at 7:23 PM, Mark Nelson m...@msi.umn.edu wrote: Hi List, I was wondering if anyone here has looked at the performance characteristics of lustre OSSes on dual tylersburg motherboards with raid controllers split up onto separate IO hubs. I imagine that without proper pinning of service threads to the right CPUs/IOH and memory pools this could cause some nasty QPI contention. Is this actually a problem in practice? Is it possible to pin service threads in a reasonable way based on which OST is involved? Anyone doing this on purpose to try and gain more overall PCIE bandwidth? I imagine that in general it's probably best to stick with a single socket single IOH OSS. No pinning to worry about, very direct QPI setup, consistent performance characteristics, etc. Thanks, Mark ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] lustre ofed compatibility
OFED 1.5.1 should work fine with Lustre 1.8.4, although I believe more people are using the in-kernel OFED now: Lustre (finally) defaulted to the in-kernel OFED for RedHat, so it is no longer _necessary_ to build either OFED or Lustre. Kevin Edward Walter wrote: Hi List, We're getting ready to upgrade the OS/software stack on one of our clusters and I'm looking at which Lustre and OFED versions will work best. It looks like the changelog for 1.8.4 and the compatibility matrix have conflicting information. The Lustre compatibility matrix indicates that on Lustre 1.8.4; the highest OFED revision with o2iblnd support is 1.4.2: http://wiki.lustre.org/index.php/Lustre_Release_Information The changelog for 1.8.4 indicates that o2iblnd is supported with OFED 1.5.1: http://wiki.lustre.org/index.php/Change_Log_1.8#Changes_from_v1.8.3_to_v1.8.4 Can someone clarify whether 1.8.4 supports o2iblnd with OFED 1.5.1? Are there any pitfalls to this configuration? Has anyone found any instabilities with this configuration? Thanks much. -Ed Walter Carnegie Mellon University ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] mv_sata module for rhel5 and write through patch
Brock Palen wrote: We are (finally) updating our x4500's to rhel5 and luster 1.8.5 from rhel4 and 1.6.7 On rhel4 we had used the patch from: https://bugzilla.lustre.org/show_bug.cgi?id=14040 for the mv_sata module. Is this still recommended on rhel5? To use the mv_sata module over the stock redhat sata_mv as well as applying this patch? That patch is quite old is there a newer one? I don't know: the last I heard was that the upcoming rhel 5.3 was to have an in-tree Marvell driver that worked. If your system is still under support, I'd contact Oracle support for information about running RHEL5 on the x4500. You do want to ensure the write-back cache is disabled on the drive, but you may be able to do that with udev scripts. See Bug 17462 for an example for the J4400. What are other x4500/thumper users running? Also I will do some digging on the list but why is lustre 2.0 not the 'production' version? We are planning on 1.8.x for now but if 2.0 is stable we would install that one. Lustre 2.0 is not being widely used, and would not be covered by an Oracle support contract. It is strongly recommended to run production systems on 1.8.x rather than 2.0. If you really want to try Lustre 2.x, you will want to use something newer than 2.0: maybe check with lustre...@googlegroups.com for the current status of the whamcloud git repository? Can we upgrade directly from 1.6 to 2.0 if we did this? Brock Palen www.umich.edu/~brockp Center for Advanced Computing bro...@umich.edu (734)936-1985 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Checksums of files on disk
Christopher J.Walker wrote: The application I use, StoRM[1] can store checksums on disk in an extended user attribute - and use that to ensure the integrity of files on disk. The algorithm currently used is adler32. The intention is to perform end to end checksumming from file creation through storage, transfer over the WAN and storage at a site. Looking at http://wiki.lustre.org/manual/LustreManual20_HTML/ManagingFileSystemIO.html#50438211_pgfId-1291975 I see that Lustre has some checksum support (though not for checksumming the file on the OST - so we'd still need to use the user attribute for that). http://wiki.lustre.org/manual/LustreManual18_HTML/LustreTuning.html#50651264_pgfId-1291287 Is the value of the checksum user accessible? Or to be more specific, I'd potentially get a big speedup if I were able to ask the diskserver to tell me the checksum of a file without actually transferring it over the network. Is it easy to do this? No, the checksum is not currently available, and is not being stored on disk. That being said, feel free to send patches! There were some plans to merge the client-side checksum with the ZFS checksum when the backing store is ZFS, but I have not been following the ZFS status closely enough to know the status of that enhancement. Do note that the Lustre checksums only cover the RPC, so at best each 1MB file chunk would have a separate checksum, generated on the client before doing the RPC (so not quite as end-to-end as an application checksum). Also note that the checksums are not used when using mmap(). See Bug 11742 for the details (it is sent, but failures are ignored). Kevin Chris [1] http://storm.forge.cnaf.infn.it/home This is an SRM implementation we use to give an grid authentication to our storage (we store data for the LHC). ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] [Lustre-community] Poor multithreaded I/O performance
[Moved to Lustre-discuss] However, if I spawn 8 threads such that all of them write to the same file (non-overlapping locations), without explicitly synchronizing the writes (i.e. I dont lock the file handle) How exactly does your multi-threaded application write the data? Are you using pwrite to ensure non-overlapping regions or are they all just doing unlocked write() operations on the same fd to each write (each just transferring size/8)? If it divides the file into N pieces, and each thread does pwrite on its piece, then what each OST sees are multiple streams at wide offsets to the same object, which could impact performance. If on the other hand the file is written sequentially, where each thread grabs the next piece to be written (locking normally used for the current_offset value, so you know where each chunk is actually going), then you get a more sequential pattern at the OST. If the number of threads maps to the number of OSTs (or some modulo, like in your case 6 OSTs per thread), and each thread owns the piece of the file that belongs to an OST (ie: for (offset = thread_num * 6MB; offset size; offset += 48MB) pwrite(fd, buf, 6MB, offset); ), then you've eliminated the need for application locks (assuming the use of pwrite) and ensured each OST object is being written sequentially. It's quite possible there is some bottleneck on the shared fd. So perhaps the question is not why you aren't scaling with more threads, but why the single file is not able to saturate the client, or why the file BW is not scaling with more OSTs. It is somewhat common for multiple processes (on different nodes) to write non-overlapping regions of the same file; does performance improve if each thread opens its own file descriptor? Kevin Wojciech Turek wrote: Ok so it looks like you have in total 64 OSTs and your output file is striped across 48 of them. May I suggest that you limit number of stripes, lets say a good number to start with would be 8 stripes and also for best results use OST pools feature to arrange that each stripe goes to OST owned by different OSS. regards, Wojciech On 23 May 2011 23:09, kme...@cs.uh.edu mailto:kme...@cs.uh.edu wrote: Actually, 'lfs check servers' returns 64 entries as well, so I presume the system documentation is out of date. Again, I am sorry the basic information had been incorrect. - Kshitij Run lfs getstripe your_output_file and paste the output of that command to the mailing list. Stripe count of 48 is not possible if you have max 11 OSTs (the max stripe count will be 11) If your striping is correct, the bottleneck can be your client network. regards, Wojciech On 23 May 2011 22:35, kme...@cs.uh.edu mailto:kme...@cs.uh.edu wrote: The stripe count is 48. Just fyi, this is what my application does: A simple I/O test where threads continually write blocks of size 64Kbytes or 1Mbyte (decided at compile time) till a large file of say, 16Gbytes is created. Thanks, Kshitij What is your stripe count on the file, if your default is 1, you are only writing to one of the OST's. you can check with the lfs getstripe command, you can set the stripe bigger, and hopefully your wide-stripped file with threaded writes will be faster. Evan -Original Message- From: lustre-community-boun...@lists.lustre.org mailto:lustre-community-boun...@lists.lustre.org [mailto:lustre-community-boun...@lists.lustre.org mailto:lustre-community-boun...@lists.lustre.org] On Behalf Of kme...@cs.uh.edu mailto:kme...@cs.uh.edu Sent: Monday, May 23, 2011 2:28 PM To: lustre-commun...@lists.lustre.org mailto:lustre-commun...@lists.lustre.org Subject: [Lustre-community] Poor multithreaded I/O performance Hello, I am running a multithreaded application that writes to a common shared file on lustre fs, and this is what I see: If I have a single thread in my application, I get a bandwidth of approx. 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I spawn 8 threads such that all of them write to the same file (non-overlapping locations), without explicitly synchronizing the writes (i.e. I dont lock the file handle), I still get the same bandwidth. Now, instead of writing to a shared file, if these threads write to separate files, the bandwidth obtained is approx. 700 Mbytes/sec. I would ideally like my multithreaded application to see similar scaling. Any ideas why the performance is limited and any workarounds? Thank you, Kshitij
Re: [Lustre-discuss] Two questions about the tuning of Lustre file system.
What exactly were you testing? I have no idea how to interpret your numbers. A single client reading from a single file? One file per OST, or file striped across all OSTs? Is the Lustre file system idle except for your test? In general, start with the pieces: 1) make sure the network is sane. Try measuring BW to/from each node (client and server) to ensure all the cables are good. For your configuration, you should be able to measure ~3.2GB/s (unidirectional) using large MPI messages. While I prefer to use MPI, some people use the lnet_selftest. 2) make sure each OST is sane. For each OST, create a file that is only striped on that OST. Make sure a client can read/write each of these files as expected. Be sure you transfer much more data than the client+server RAM sizes. Many issues are sorted out just getting both 1 2 in good shape. Kevin Tanin wrote: Dear all, I have two question regarding the performance of Lustre System. Currently, we have 5 OSS nodes, and each OSS carries 8 OST's. All the nodes (including the MDT/MGS node and client node) are connected to a Mellanox MTS 3600 InfiniBand switch using RDMA for data transfer. The bandwidth of the network is 40Gbps. The kernel version is 'Linux 2.6.18-164.11.1.el5_lustre.1.8.3 #1 SMP Fri Apr 9 18:00:39 MDT 2010 x86_64 x86_64 x86_64 GNU/Linux'. OS is RHEL 5.5. Lustre version is 1.8.3. OFED Version is 1.5.2. IB HCA is Mellanox Technologies MT26428 ConnectX VPI PCIe IB QDR. And I did a simple test on the client side to see the peak data reading performance. Here is the data: #time Data transferred Bandwidth 2 sec 2.18 GBytes 8.71 Gbits/sec 2 sec 2.06 GBytes 8.24 Gbits/sec 2 sec 2.10 GBytes 8.40 Gbits/sec 2 sec 1.93 GBytes 7.73 Gbits/sec 2 sec 1.50 GBytes 6.02 Gbits/sec 2 sec 420.00 MBytes 1.64 Gbits/sec 2 sec 2.19 GBytes 8.75 Gbits/sec 2 sec 2.08 GBytes 8.32 Gbits/sec 2 sec 2.08 GBytes 8.32 Gbits/sec 2 sec 1.99 GBytes 7.97 Gbits/sec 2 sec 1.80 GBytes 7.19 Gbits/sec *2 sec 160.00 MBytes 640.00 Mbits/sec* 2 sec 2.15 GBytes 8.59 Gbits/sec 2 sec 2.13 GBytes 8.52 Gbits/sec 2 sec 2.15 GBytes 8.59 Gbits/sec 2 sec 2.09 GBytes 8.36 Gbits/sec 2 sec 2.09 GBytes 8.36 Gbits/sec 2 sec 2.07 GBytes 8.28 Gbits/sec 2 sec 2.15 GBytes 8.59 Gbits/sec 2 sec 2.11 GBytes 8.44 Gbits/sec 2 sec 2.05 GBytes 8.20 Gbits/sec *2 sec 0.00 Bytes 0.00 bits/sec* *2 sec 0.00 Bytes 0.00 bits/sec* 2 sec 1.95 GBytes 7.81 Gbits/sec 2 sec 2.14 GBytes 8.55 Gbits/sec 2 sec 1.99 GBytes 7.97 Gbits/sec 2 sec 2.00 GBytes 8.01 Gbits/sec 2 sec 370.00 MBytes 1.45 Gbits/sec 2 sec 1.96 GBytes 7.85 Gbits/sec 2 sec 2.03 GBytes 8.12 Gbits/sec 2 sec 1.89 GBytes 7.58 Gbits/sec 2 sec 1.94 GBytes 7.77 Gbits/sec 2 sec 640.00 MBytes 2.50 Gbits/sec 2 sec 1.47 GBytes 5.90 Gbits/sec 2 sec 1.94 GBytes 7.77 Gbits/sec 2 sec 1.90 GBytes 7.62 Gbits/sec 2 sec 1.94 GBytes 7.77 Gbits/sec 2 sec 1.18 GBytes 4.73 Gbits/sec 2 sec 940.00 MBytes 3.67 Gbits/sec 2 sec 1.97 GBytes 7.89 Gbits/sec 2 sec 1.93 GBytes 7.73 Gbits/sec 2 sec 1.87 GBytes 7.46 Gbits/sec 2 sec 1.77 GBytes 7.07 Gbits/sec 2 sec 320.00 MBytes 1.25 Gbits/sec 2 sec 1.97 GBytes 7.89 Gbits/sec 2 sec 2.00 GBytes 8.01 Gbits/sec 2 sec 1.89 GBytes 7.58 Gbits/sec 2 sec 1.93 GBytes 7.73 Gbits/sec 2 sec 350.00 MBytes 1.37 Gbits/sec 2 sec 1.77 GBytes 7.07 Gbits/sec 2 sec 1.92 GBytes 7.70 Gbits/sec 2 sec 2.05 GBytes 8.20 Gbits/sec 2 sec 2.01 GBytes 8.05 Gbits/sec 2 sec 710.00 MBytes 2.77 Gbits/sec 2 sec 1.59 GBytes 6.37 Gbits/sec 2 sec 2.00 GBytes 8.01 Gbits/sec 2 sec 710.00 MBytes 2.77 Gbits/sec 2 sec 1.59 GBytes 6.37 Gbits/sec 2 sec 2.00 GBytes 8.01 Gbits/sec 2 sec 1.88 GBytes 7.54 Gbits/sec 2 sec 1.62 GBytes 6.48 Gbits/sec As you can see, although the peak bandwidth can reach 8.71Gbps, the performance is quite unstable(sometimes the bandwidth just gets chocked). All the OSS node seems to stop reading data simultaneously. I tried to group up different OSTs and turn on/off the checksum, this still happens. Does anybody get a hint of the reason? 2. As we know, when reading data from lustre client, the data is moved from
Re: [Lustre-discuss] Anybody actually using Flash (Fusion IO specifically) for meta data?
Dardo D Kleiner - CONTRACTOR wrote: Short answer: of course it works - they're just block devices after all - but you'll find that you won't realize the performance gains you might expect (at least not for an MDT). Yes. See the email thread improving metadata performance and Robin Humble's talk at LUG. The MDT disk is rarely the bottleneck (although that could change with full size-on-mds support), which others had discovered using a ram-based (tmpfs) MDT. As for putting the entire filesystem on flash, sure that would be pretty nifty, but expensive. Not being able to do failover, with storage on internal PCIe cards, is a downside. Aside from simply being fast OSTs, there are several areas that would allow Lustre to take advantage of these kinds of devices: 1) SMP scaling for the MDS - the problem right now is that the low latency of these devices really shines best when you have many threads scattering small I/O. The current (1.8.x) Lustre MDS doesn't do this. SMP scaling is a big issue. In Lustre 1.8.x the maximum performance is not more than 8 CPUs (maybe fewer) for the MDT -- additional cpu cores results in _lower_ performance. There are patches for Lustre 2.x to improve SMP scaling, but I haven't tested a workload. 2) Flashcache/bcache over traditional disk storage (OST or MDT) - this can be done today, of course. There's some interop issues in my testing, but when it works it does what it says it does. It still won't really help an MDT though. 3) Targeted device mapping of the metadata portions of an OST on traditional disk (e.g. extent lists) onto flash. #1 is substantial work (ongoing I believe). #2 is pretty nifty, basically grow your local page cache beyond RAM - helps when hot working set is large. #3 is trickier and though I haven't tried it I understand there's real effort ongoing in this regard. flex_bg is in ext4, which allows the inodes to be packed together. Filesystem size in this discussion is mostly irrelevant for an MDT, its just whether or not the device is big enough for the number of objects (a few million is *not* many). A huge number of clients thrashing about creating/modifying/deleting is where these things have the most potential. - Dardo On 5/16/11 2:58 PM, Carlson, Timothy S wrote: Folks, I know that flash based technology gets talked about from time to time on the list, but I was wondering if anybody has actually implemented FusionIO devices for metadata. The last thread I can find on the mailing list that relates to this topic dates from 3 years ago. The software driving the Fusion cards has come quite a ways since then and I've got good experience using the device as a raw disk. I'm just fishing around to see if anybody has implemented one of these devices in a reasonably sized Lustre config where reasonably is left open to interpretation. I'm thinking500T and a few million files. Thanks! Tim ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] software RAID1 in RHEL5
Adesanya, Adeyemi wrote: I'm discussing the proposed architecture for two new Lustre 1.8.x filesystems. We plan to use a failover pair of MDS nodes (active-active), with each MDS serving an MDT. The MDTs will be housed in external storage but we would like to implement redundancy across more than one storage array by using software RAID1. The Lustre documentation mentions using linux md to set up software RAID1 or RAID10 for MDTs. Does the RAID1 implementation in the Lustre 1.8.x RHEL5 kernel do an adequate job of ensuring consistency across mirrored devices (compared to a hardware RAID1 implementation)? Adequate, probably. As correct as hardware raid, doubtful. Without special hardware, or doing things that kill performance, there will always remain some corner cases. The issue is what happens for writes that are in process when you have a crash/reboot/power loss: it is possible for them to make it to one disk, but not the other. So it is possible to believe they are on disk, and proceed accordingly, when they are only on one copy, and are lost if that disk fails. Even worse, Linux alternates reads, so in theory it could be there one time and gone the next. The good news is that writes should(!) not be marked as on disk until both disks have said it is written. So you could do an md check, and if needed do a repair before eg, replaying the journal (mounting the file system doing fsck, etc). Even if the MD resync takes the older copy and undoes a write, it should not have been a write that was expected to have made it to stable storage, so the normal Lustre recovery mechanisms should be able to replay it. Assuming, that is, that this is done _before_ you mount the device. Kevin ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre client question
See bug 24264 -- certainly possible that the raid controller corrupted your filesystem. If you remove the new drive and reboot, does the file system look cleaner? Kevin On May 13, 2011, at 11:39 AM, Zachary Beebleson zbee...@math.uchicago.edu wrote: We recently had two raid rebuilds on a couple storage targets that did not go according to plan. The cards reported a successful rebuild in each case, but ldiskfs errors started showing up on the associated OSSs and the effected OSTs were remounted read-only. We are planning to migrate off the data, but we've noticed that some clients are getting i/o errors, while others are not. As an example, a file that has a stripe on at least one affected OST could not be read on one client, i.e. I received a read-error trying to access it, while it was perfectly readable and apparently uncorrupted on another (I am able to migrate the file to healthy OSTs by copying to a new file name). The clients with the i/o problem see inactive devices corresponding to the read- only OSTs when I issue a 'lfs df', while the others without the i/o problems report the targets as normal. Is it just that many clients are not aware of an OST problem yet? I need clients with minimal I/O disruptions in order to migrate as much data off as possible. A client reboot appears to awaken them to the fact that there are problems with the OSTs. However, I need them to be able to read the data in order to migrate it off. Is there a way to reconnect the clients to the problematic OSTs? We have dd-ed copies of the OSTs to try e2fsck against them, but the results were not promising. The check aborted with: -- Resize inode (re)creation failed: A block group is missing an inode table.Continue? yes ext2fs_read_inode: A block group is missing an inode table while reading inode 7 in recreate inode e2fsck: aborted -- Any advice would be greatly appreciated. Zach ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] two iSCSI lun for OST conf in RAID 1
You could use software RAID below Lustre to present an md device. See the mdadm command. Kevin Roberto Scudeller wrote: Hi all, I need help. Is possible config 2 lun (of the 2 different storages) for OST in RAID1? I need the same data replicated in 2 storages for data recovery (security and et.). Cheers, -- Roberto Scudeller ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre client question
It sounds like it is working better. Did the clients recover? I would have re-run fsck before mounting it again, and moving the data off may still be the best plan. Since dropping the rebuilt drive reduced the corruption, certainly contact your raid vendor over this issue. Kevin Zachary Beebleson wrote: Kevin, I just failed the drive and remounted. A basic 'df' hangs when it gets to the mount point, but /proc/fs/lustre/health_check reports everything is healthy. 'lfs df' on a client reports the OST is active, where it was inactive before. However, now I'm working with a degraded volume, but it is raid 6. Should I try another rebuild or just proceed with the mirgration off of this OST asap? Thanks, Zach PS. Sorry for the repeat message On Fri, 13 May 2011, Kevin Van Maren wrote: See bug 24264 -- certainly possible that the raid controller corrupted your filesystem. If you remove the new drive and reboot, does the file system look cleaner? Kevin On May 13, 2011, at 11:39 AM, Zachary Beebleson zbee...@math.uchicago.edu wrote: We recently had two raid rebuilds on a couple storage targets that did not go according to plan. The cards reported a successful rebuild in each case, but ldiskfs errors started showing up on the associated OSSs and the effected OSTs were remounted read-only. We are planning to migrate off the data, but we've noticed that some clients are getting i/o errors, while others are not. As an example, a file that has a stripe on at least one affected OST could not be read on one client, i.e. I received a read-error trying to access it, while it was perfectly readable and apparently uncorrupted on another (I am able to migrate the file to healthy OSTs by copying to a new file name). The clients with the i/o problem see inactive devices corresponding to the read-only OSTs when I issue a 'lfs df', while the others without the i/o problems report the targets as normal. Is it just that many clients are not aware of an OST problem yet? I need clients with minimal I/O disruptions in order to migrate as much data off as possible. A client reboot appears to awaken them to the fact that there are problems with the OSTs. However, I need them to be able to read the data in order to migrate it off. Is there a way to reconnect the clients to the problematic OSTs? We have dd-ed copies of the OSTs to try e2fsck against them, but the results were not promising. The check aborted with: -- Resize inode (re)creation failed: A block group is missing an inode table.Continue? yes ext2fs_read_inode: A block group is missing an inode table while reading inode 7 in recreate inode e2fsck: aborted -- Any advice would be greatly appreciated. Zach ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Fragmented I/O
Kevin Hildebrand wrote: The PERC 6 and H800 use megaraid_sas, I'm currently running 00.00.04.17-RH1. The max_sectors numbers (320) are what is being set by default- I am able to set it to something smaller than 320, but not larger. Right. You can not set max_sectors_kb larger than max_hw_sectors_kb (Linux normally defaults most drivers to 512, but Lustre sets them to be the same): you may want to instrument your HBA driver to see what is going on (ie, why the max_hw_sectors_kb is 1024). I don't know if it is due to a driver limitation or a true hardware limit. Most drivers have a limit of 512KB by default; see Bug 22850 for the patches that fixed the QLogic and Emulex fibre channel drivers. Kevin Kevin On Wed, 11 May 2011, Kevin Van Maren wrote: You didn't say, but I think they are LSI-based: are you using the mptsas driver with the PERC cards? Which driver version? First, max_sectors_kb should normally be set to a power of 2 number, like 256, over an odd size like 320. This number should also match the native raid size of the device, to avoid read-modify-write cycles. (See Bug 22886 on why not to make it 1024 in general). See Bug 17086 for patches to increase the max_sectors_kb limitation for the mptsas driver to 1MB, or the true hardware maximum, rather than a driver limit; however, the hardware may still be limited to sizes 1MB. Also, to clarify the sizes: the smallest bucket = transfer_size is the one incremented, so a 320KB IO increments the 512KB bucket. Since your HW says it can only do a 320KB IO, there will never be a 1MB IO. You may want to instrument your HBA driver to see what is going on (ie, why the max_hw_sectors_kb is 1024). Kevin Kevin Hildebrand wrote: Hi, I'm having some performance issues on my Lustre filesystem and it looks to me like it's related to I/Os getting fragmented before being written to disk, but I can't figure out why. This system is RHEL5, running Lustre 1.8.4. All of my OSTs look pretty much the same- read | write pages per bulk r/w rpcs % cum % | rpcs % cum % 1: 88811 38 38 | 46375 17 17 2:1497 0 38 | 7733 2 20 4:1161 0 39 | 1840 0 21 8:1168 0 39 | 7148 2 24 16:922 0 40 | 3297 1 25 32:979 0 40 | 7602 2 28 64: 1576 0 41 | 9046 3 31 128: 7063 3 44 | 16284 6 37 256:129282 55 100 | 162090 62 100 read | write disk fragmented I/Os ios % cum % | ios % cum % 0: 51181 22 22 |0 0 0 1: 45280 19 42 | 82206 31 31 2: 16615 7 49 | 29108 11 42 3:3425 1 50 | 17392 6 49 4: 110445 48 98 | 129481 49 98 5:1661 0 99 | 2702 1 99 read | write disk I/O size ios % cum % | ios % cum % 4K: 45889 8 8 | 56240 7 7 8K: 3658 0 8 | 6416 0 8 16K: 7956 1 10 | 4703 0 9 32K: 4527 0 11 | 11951 1 10 64K:114369 20 31 | 134128 18 29 128K: 5095 0 32 | 17229 2 31 256K: 7164 1 33 | 30826 4 35 512K: 369512 66 100 | 465719 64 100 Oddly, there's no 1024K row in the I/O size table... ...and these seem small to me as well, but I can't seem to change them. Writing new values to either doesn't change anything. # cat /sys/block/sdb/queue/max_hw_sectors_kb 320 # cat /sys/block/sdb/queue/max_sectors_kb 320 Hardware in question is DELL PERC 6/E and DELL PERC H800 RAID controllers, with MD1000 and MD1200 arrays, respectively. Any clues on where I should look next? Thanks, Kevin Kevin Hildebrand University of Maryland, College Park Office of Information Technology ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre 1.8.4 - Local mount of ost for backup purposes, fs type ldiskfs or ext4?
Well, that's the opposite problem of Bug 24398. Are you sure you are using the ext4-based ldiskfs? Kevin On May 11, 2011, at 4:23 PM, Jeff Johnson jeff.john...@aeoncomputing.com wrote: Greetings, I am doing a local mount of a 8TB ost device in a Lustre 1.8.4 installation. The ost was built with a backfstype of ldiskfs. When attempting the local mount: mount -t ldiskfs /dev/sdc /mnt/save/ost I get: mount: wrong fs type, bad option, bad superblock on /dev/sdt, missing codepage or other error I am able to mount the same block device as ext4, just not as ldiskfs. I need to be able to mount as ldiskfs to get access to the extended attributes and back them up. Is this still the case with the ext4 extensions for Lustre 1.8.4? I am able to mount read-only as ext4 but any attempt at reading the extended attributes with getfattr fails. Thanks, --Jeff -- -- Jeff Johnson Manager Aeon Computing jeff.john...@aeoncomputing.com www.aeoncomputing.com t: 858-412-3810 x101 f: 858-412-3845 m: 619-204-9061 4905 Morena Boulevard, Suite 1313 - San Diego, CA 92117 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Fragmented I/O
You didn't say, but I think they are LSI-based: are you using the mptsas driver with the PERC cards? Which driver version? First, max_sectors_kb should normally be set to a power of 2 number, like 256, over an odd size like 320. This number should also match the native raid size of the device, to avoid read-modify-write cycles. (See Bug 22886 on why not to make it 1024 in general). See Bug 17086 for patches to increase the max_sectors_kb limitation for the mptsas driver to 1MB, or the true hardware maximum, rather than a driver limit; however, the hardware may still be limited to sizes 1MB. Also, to clarify the sizes: the smallest bucket = transfer_size is the one incremented, so a 320KB IO increments the 512KB bucket. Since your HW says it can only do a 320KB IO, there will never be a 1MB IO. You may want to instrument your HBA driver to see what is going on (ie, why the max_hw_sectors_kb is 1024). Kevin Kevin Hildebrand wrote: Hi, I'm having some performance issues on my Lustre filesystem and it looks to me like it's related to I/Os getting fragmented before being written to disk, but I can't figure out why. This system is RHEL5, running Lustre 1.8.4. All of my OSTs look pretty much the same- read | write pages per bulk r/w rpcs % cum % | rpcs % cum % 1: 88811 38 38 | 46375 17 17 2:1497 0 38 | 7733 2 20 4:1161 0 39 | 1840 0 21 8:1168 0 39 | 7148 2 24 16:922 0 40 | 3297 1 25 32:979 0 40 | 7602 2 28 64: 1576 0 41 | 9046 3 31 128: 7063 3 44 | 16284 6 37 256:129282 55 100 | 162090 62 100 read | write disk fragmented I/Os ios % cum % | ios % cum % 0: 51181 22 22 |0 0 0 1: 45280 19 42 | 82206 31 31 2: 16615 7 49 | 29108 11 42 3:3425 1 50 | 17392 6 49 4: 110445 48 98 | 129481 49 98 5:1661 0 99 | 2702 1 99 read | write disk I/O size ios % cum % | ios % cum % 4K: 45889 8 8 | 56240 7 7 8K: 3658 0 8 | 6416 0 8 16K: 7956 1 10 | 4703 0 9 32K: 4527 0 11 | 11951 1 10 64K:114369 20 31 | 134128 18 29 128K: 5095 0 32 | 17229 2 31 256K: 7164 1 33 | 30826 4 35 512K: 369512 66 100 | 465719 64 100 Oddly, there's no 1024K row in the I/O size table... ...and these seem small to me as well, but I can't seem to change them. Writing new values to either doesn't change anything. # cat /sys/block/sdb/queue/max_hw_sectors_kb 320 # cat /sys/block/sdb/queue/max_sectors_kb 320 Hardware in question is DELL PERC 6/E and DELL PERC H800 RAID controllers, with MD1000 and MD1200 arrays, respectively. Any clues on where I should look next? Thanks, Kevin Kevin Hildebrand University of Maryland, College Park Office of Information Technology ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre filesystem hangs when reading large files
Chris Exton wrote: Hello, We are currently using lustre 1.8.1.1 and using kernel version 2.6.18_128.7.1.el5_lustre. We are experiencing problems when performing reads of large files from my lustre filesystem, small reads are not affected. The read process hangs and the following message is reported in /var/log/messages: Feb 22 15:59:38 leopard kernel: LustreError: 11-0: an error occurred while communicating with 192.168.13.200@o2ib. The obd_ping operation failed with -107 Feb 22 15:59:38 leopard kernel: Lustre: lustre-OST-osc-81067e0eac00: Connection to service lustre-OST via nid 192.168.13.200@o2ib was lost; in progress operations using this service will wait for recovery to complete. Feb 22 15:59:38 leopard kernel: LustreError: 6811:0:(import.c:939:ptlrpc_connect_interpret()) lustre-OST_UUID went back in time (transno 476754140074 was previously committed, server now claims 0)! See https://bugzilla.lustre.org/show_bug.cgi?id=9646 Feb 22 15:59:38 leopard kernel: LustreError: 167-0: This client was evicted by lustre-OST; in progress operations using this service will fail. Feb 22 15:59:38 leopard kernel: Lustre: lustre-OST-osc-81067e0eac00: Connection restored to service lustre-OST using nid 192.168.13.200@o2ib. Feb 22 15:59:38 leopard kernel: LustreError: 17592:0:(lov_request.c:196:lov_update_enqueue_set()) enqueue objid 0x18f87222 subobj 0x4d0c9f on OST idx 0: rc -5 I have checked the bugzilla report but we have not had a disk crash and the system was not restarted. Could this be an underlying hardware problem that’s not getting logged? Could be a hardware issue with your network, but not your disk: it looks like a network failure resulted in client eviction (server unable to contact client, so it was evicted), which resulted in the back in time message when it reconnected (and could not complete outstanding IOs -- pending writes, ie from client cache, get dropped on the floor when evicted). See https://bugzilla.lustre.org/show_bug.cgi?id=21681 Any additional help on this matter would be much appreciated. Kind Regards Chris ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] poor ost write performance.
First guess is the increased memory pressure caused by the Lustre 1.8 read cache. Many times slow messages are caused by memory allocatons taking a long time. You could try disabling the read cache and see if that clears up the slow messages. Kevin On Apr 20, 2011, at 4:29 AM, James Rose james.r...@framestore.com wrote: Hi We have been experiencing degraded performance for a few days on a fresh install of lustre 1.8.5 (on RHEL5 using sun ext4 rpms). The initial bulk load of the data will be fine but once in use for a while writes become very slow to individual ost. This will block io for a few minutes and then carry on as normal. The slow writes will then move to another ost. This can be seen in iostat and many slow IO messages will be seen in the logs (example included) The osts are between 87 90 % full. Not ideal but has not caused any issues running 1.6.7.2 on the same hardware. The osts are RAID6 on external raid chassis (Infortrend). Each ost is 5.4T (small). The server is Dual AMD (4 cores). 16G Ram. Qlogic FC HBA. I mounted the osts as ldiskfs and tried a few write tests. These also show the same behaviour. While the write operation is blocked there will be hundreds of read tps and a very small kb/s read from the raid but now writes. As soon as this completes writes will go through at a more expected speed. Any idea what is going on? Many thanks James. Example error messages: Apr 20 04:53:04 oss5r-mgmt kernel: LustreError: dumping log to /tmp/ lustre-log.1303271584.3935 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow quota init 286s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal start 39s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 39 previous similar messages Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow brw_start 39s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 38 previous similar messages Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal start 133s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 44 previous similar messages Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow brw_start 133s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 44 previous similar messages Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal start 236s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow i_mutex 40s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 2 previous similar messages Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 6 previous similar messages Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow i_mutex 277s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow direct_io 286s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 3 previous similar messages Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal start 285s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 1 previous similar message Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow commitrw commit 285s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 1 previous similar message Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow parent lock 236s due to heavy IO load ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] poor ost write performance.
Yes, difficulty finding free disk space can also be a problem, but I could not recall big changes in how that worked since 1.6, other than memory pressure from the read cache pushing out the bitmaps. See http://jira.whamcloud.com/browse/LU-15 Kevin James Rose wrote: Hi Kevin, Thanks for the suggestion. I will try this out. For the moment it seems that it may be disk space related. I have removed some data from the file system. Performance returned to where I would expect it to be as space freed up (currently at 83% full). Since free space I have seen two messages on an OSS where the number of threads is tuned to the amount of RAM in the host and six on an OSS that has the number of threads set higher than it should. This is a much better situation than the steady stream I was experiencing last night. Maybe disabling the read cache will remove the last few. I am still very curious what the rapid small reads seen when writing are as this showed up while mounted ldiskfs so not doing regular lustre operations at all. Thanks again for your help, James. On Wed, 2011-04-20 at 08:48 -0300, Kevin Van Maren wrote: First guess is the increased memory pressure caused by the Lustre 1.8 read cache. Many times slow messages are caused by memory allocatons taking a long time. You could try disabling the read cache and see if that clears up the slow messages. Kevin On Apr 20, 2011, at 4:29 AM, James Rose james.r...@framestore.com wrote: Hi We have been experiencing degraded performance for a few days on a fresh install of lustre 1.8.5 (on RHEL5 using sun ext4 rpms). The initial bulk load of the data will be fine but once in use for a while writes become very slow to individual ost. This will block io for a few minutes and then carry on as normal. The slow writes will then move to another ost. This can be seen in iostat and many slow IO messages will be seen in the logs (example included) The osts are between 87 90 % full. Not ideal but has not caused any issues running 1.6.7.2 on the same hardware. The osts are RAID6 on external raid chassis (Infortrend). Each ost is 5.4T (small). The server is Dual AMD (4 cores). 16G Ram. Qlogic FC HBA. I mounted the osts as ldiskfs and tried a few write tests. These also show the same behaviour. While the write operation is blocked there will be hundreds of read tps and a very small kb/s read from the raid but now writes. As soon as this completes writes will go through at a more expected speed. Any idea what is going on? Many thanks James. Example error messages: Apr 20 04:53:04 oss5r-mgmt kernel: LustreError: dumping log to /tmp/ lustre-log.1303271584.3935 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow quota init 286s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal start 39s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 39 previous similar messages Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow brw_start 39s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 38 previous similar messages Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal start 133s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 44 previous similar messages Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow brw_start 133s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 44 previous similar messages Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal start 236s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow i_mutex 40s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 2 previous similar messages Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 6 previous similar messages Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow i_mutex 277s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow direct_io 286s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 3 previous similar messages Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal start 285s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 1 previous similar message Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow commitrw commit 285s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 1 previous similar message Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow parent lock 236s due to heavy IO load ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list
Re: [Lustre-discuss] poor ost write performance.
Yes, difficulty finding free disk space can also be a problem, but I could not recall big changes in how that worked since 1.6, other than memory pressure from the read cache pushing out the bitmaps. See http://jira.whamcloud.com/browse/LU-15 Kevin James Rose wrote: Hi Kevin, Thanks for the suggestion. I will try this out. For the moment it seems that it may be disk space related. I have removed some data from the file system. Performance returned to where I would expect it to be as space freed up (currently at 83% full). Since free space I have seen two messages on an OSS where the number of threads is tuned to the amount of RAM in the host and six on an OSS that has the number of threads set higher than it should. This is a much better situation than the steady stream I was experiencing last night. Maybe disabling the read cache will remove the last few. I am still very curious what the rapid small reads seen when writing are as this showed up while mounted ldiskfs so not doing regular lustre operations at all. Thanks again for your help, James. On Wed, 2011-04-20 at 08:48 -0300, Kevin Van Maren wrote: First guess is the increased memory pressure caused by the Lustre 1.8 read cache. Many times slow messages are caused by memory allocatons taking a long time. You could try disabling the read cache and see if that clears up the slow messages. Kevin On Apr 20, 2011, at 4:29 AM, James Rose james.r...@framestore.com wrote: Hi We have been experiencing degraded performance for a few days on a fresh install of lustre 1.8.5 (on RHEL5 using sun ext4 rpms). The initial bulk load of the data will be fine but once in use for a while writes become very slow to individual ost. This will block io for a few minutes and then carry on as normal. The slow writes will then move to another ost. This can be seen in iostat and many slow IO messages will be seen in the logs (example included) The osts are between 87 90 % full. Not ideal but has not caused any issues running 1.6.7.2 on the same hardware. The osts are RAID6 on external raid chassis (Infortrend). Each ost is 5.4T (small). The server is Dual AMD (4 cores). 16G Ram. Qlogic FC HBA. I mounted the osts as ldiskfs and tried a few write tests. These also show the same behaviour. While the write operation is blocked there will be hundreds of read tps and a very small kb/s read from the raid but now writes. As soon as this completes writes will go through at a more expected speed. Any idea what is going on? Many thanks James. Example error messages: Apr 20 04:53:04 oss5r-mgmt kernel: LustreError: dumping log to /tmp/ lustre-log.1303271584.3935 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow quota init 286s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal start 39s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 39 previous similar messages Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow brw_start 39s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 38 previous similar messages Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal start 133s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 44 previous similar messages Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow brw_start 133s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 44 previous similar messages Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal start 236s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow i_mutex 40s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 2 previous similar messages Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 6 previous similar messages Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow i_mutex 277s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow direct_io 286s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 3 previous similar messages Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal start 285s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 1 previous similar message Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow commitrw commit 285s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 1 previous similar message Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow parent lock 236s due to heavy IO load ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list
Re: [Lustre-discuss] Optimal stratgy for OST distribution
It used to be that multi-stripe files were created with sequential OST indexes. It also used to be that OST indexes were sequentially assigned to newly-created files. As Lustre now adds greater randomization, the strategy for assigning OSTs to OSS nodes (and storage hardware, which often limits the aggregate performance of multiple OSTs) is less important. While I have normally gone with a, b can make it easier to remember where OSTs are located, and also keep a uniform convention if the storage system is later grown. Kevin Heckes, Frank wrote: Hi all, sorry if this question has been answered before. What is the optimal 'strategy' assigning OSTs to OSS nodes: -a- Assign OST via round-robin to the OSS -b- Assign in consecutive order (as long as the backend storage provides enought capacity for iops and bandwidth) -c- Something 'in-between' the 'extremes' of -a- and -b- E.g.: -a- OSS_1 OSS_2 OST_3 |_ |_ |_ OST_1 OST_2 OST_3 OST_4 OST_5 OST_6 OST_7 OST_8 OST_9 -b- OSS_1 OSS_2 OST_3 |_ |_ |_ OST_1 OST_4 OST_7 OST_2 OST_5 OST_8 OST_3 OST_6 OST_9 I thought -a- would be best for task-local (each task write to own file) and single file (all task write to single file) I/O since its like a raid-0 approach used disk I/O (and SUN create our first FS this way). Does someone made any systematic investigations which approach is best or have some educated opinion? Many thanks in advance. BR -Frank Heckes Forschungszentrum Juelich GmbH 52425 Juelich Sitz der Gesellschaft: Juelich Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender), Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt, Prof. Dr. Sebastian M. Schmidt Besuchen Sie uns auf unserem neuen Webauftritt unter www.fz-juelich.de ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Problem with lustre 2.0.0.1, ext3/4 and big OSTs (8Tb)
Joan J. Piles wrote: Hi, We are trying to set up a lustre 2.0.0.1 (the most recent one downladable from the offiecial site) installation. We plan to have some big OSTs (~ 12Tb), using ScientificLinux 5.5 (which should be a RHEL clone for all purposes). However, when we try to format the OSTs, we get the following error: [root@oss01 ~]# mkfs.lustre --ost --fsname=extra --mgsnode=172.16.4.4@tcp0 --mkfsoptions '-i 262144 -E stride=32,stripe_width=192 ' /dev/sde Permanent disk data: Target: extra-OST Index: unassigned Lustre FS: extra Mount type: ldiskfs Flags: 0x72 (OST needs_index first_time update ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=172.16.4.4@tcp checking for existing Lustre data: not found device size = 11427830MB formatting backing filesystem ldiskfs on /dev/sde target name extra-OST 4k blocks 2925524480 options -i 262144 -E stride=32,stripe_width=192 -J size=400 -I 256 -q -O dir_index,extents,uninit_bg -F mkfs_cmd = mke2fs -j -b 4096 -L extra-OST -i 262144 -E stride=32,stripe_width=192 -J size=400 -I 256 -q -O dir_index,extents,uninit_bg -F /dev/sde 2925524480 mkfs.lustre: Unable to mount /dev/sde: Invalid argument mkfs.lustre FATAL: failed to write local files mkfs.lustre: exiting with 22 (Invalid argument) In the dmesg log, we find the following line: LDISKFS-fs does not support filesystems greater than 8TB and can cause data corruption.Use force_over_8tb mount option to override. After some investigation, we find it is related to the use of ext3 instead of ext4, Correct. even though we should be using ext4, proven by the fact that the file systems created are actually ext4: [root@oss01 ~]# file -s /dev/sde /dev/sde: Linux rev 1.0 ext4 filesystem data (extents) (large files) No, these are ldiskfs filesystems. ext3+ldiskfs looks a bit like ext4 (ext4 is largely based on the enhancements done for Lustre's ldiskfs), but is not the same as ext4+ldiskfs. In particular, file system size is limited to 8TB, not 16TB. Further, we made a test with an ext3 filesystem in the same machine, and the difference is found: [root@oss01 ~]# file -s /dev/sda1 /dev/sda1: Linux rev 1.0 ext3 filesystem data (large files) Everything we found in the net about this problem seems to refer to lustre 1.8.5. However, we would not expect such a regression in lustre 2. Is this actually a problem with lustre 2? Has ext4 to be enabled either at compile time or with a parameter somewhere (we found no documentation about it)? Lustre 2.0 did not enable ext4 by default, due to known issues. You can rebuild the Lustre server, with --enable-ext4 on the configure line, to enable it. But if you are going to use 12TB LUNs, you should either sick with v1.8.5 (stable), or pull a newer version from git (experimental). Kevin ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] need help
Ashok nulguda wrote: Dear All, How to forcefully shutdown the luster service from client and OST and MDS server when IO are opening. For the servers, you can just umount them. There will not be any file system corruption, but files will not have the latest data -- the cache on the clients will not be written to disk (unless recovery happens -- restart the servers without having rebooted the clients). In an emergency, this is normally all you have time to do before shutting down the system. To unmount clients, not only can there not be any IO, you also need to first kill every process that has an open file on Lustre. lsof can be useful here if you don't want to do a full shutdown, but in many environments killing non-system processes is enough. Normally you'd want to shutdown all the clients, and then the servers. Kevin ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] clients gets EINTR from time to time
No, in case of an eviction or IO errors, EIO is returned to the application, not EINTR. Kevin DEGREMONT Aurelien wrote: Hello From my understanding, Lustre can return EINTR for some I/O error cases. I think that when a client gets evicted in the middle of one of its RPC, it can returns EINTR to the caller. Is this can explain your issue? Can your verify your clients where not evicted at the same time? Aurélien Francois Chassaing a écrit : OK, thanks it makes it more clear. I indeed messed up my mind (and words) between signals and error return codes. I did understood that the write()/pwrite() system was returning the EINTR error code because it received a signal, but I supposed that the signal was sent because of an error condition somewhere in the FS. This is where I now think I'm wrong. As for your questions : - I have to mention that I always had had this issue, and this is why I've upgraded from 1.8.4 to 1.8.5, hoping this would solve it. - I will try to have that SA_RESTART flag set in the app... if I can find where the signal handler is set. - How can I see that lustre is returning EINTR for any other reason ? As I said no logs shows nothing neither on MDS or OSSs, but I didn't go through examining lctl debug_kernel yet... which I'm going to do right away... my last question is : how can I tell which signal I am receiving ? because my app doesn't say, it just dumps outs the write/pwrite error code. And if there is no signal handler, then it should follow the standard actions (as of man 7 signal). On the other hand, my app does not stop or dump core, and is not ignored, so it has to be handled in the code. Correct me if I'm wrong... At that point, you realize that I didn't write the app, nor am I a good Linux guru ;-) Tnaks a lot. weborama lineFrançois Chassaing Directeur Technique - CTO - Mail Original - De: Ken Hornstein k...@cmf.nrl.navy.mil À: Francois Chassaing f...@weborama.com Cc: lustre-discuss@lists.lustre.org Envoyé: Jeudi 24 Février 2011 15h54:24 GMT +01:00 Amsterdam / Berlin / Berne / Rome / Stockholm / Vienne Objet: Re: [Lustre-discuss] clients gets EINTR from time to time OK, the app is used to deal with standard disks, that is why it is not handling the EINTR signal propoerly. I think you're misunderstanding what a signal is in the Unix sense. EINTR isn't a signal; it's a return code from the write() system call that says, Hey, you got a signal in the middle of this write() call and it didn't complete. It doesn't mean that there was an error writing the file; if that was happening, you'd get a (presumably different) error code. Signals can be sent by the operating system, but those signals are things like SIGSEGV, which basically means, you're program screwed up. Programs can also send signals to each other, with kill(2) and the like. Now, NORMALLY systems calls like write() are interrupted by signals when you're writing to slow devices, like network sockets. According to the signal(7) man page, disks are not normally considered slow devices, so I can understand the application not being used to handling this. And you know, now that I think about it I'm not even sure that network filesystems SHOULD allow I/O system calls to be interrupted by signals ... I'd have to think more about it. I suspect what happened is that something changed between 1.8.5 and the previous version of Lustre that you were using that allowed some operations to be interruptable by signals. Some things to try: - Check to see if you are, in fact, receiving a signal in your application and Lustre isn't returning EINTR for some other reason. - If you are receiving a signal, when you set the signal handler for it you could use the SA_RESTART flag to restart the interrupted I/O; I think that would make everything work like it did before. --Ken ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] OST threads
However, I don't think you can decrease the number of running threads. See https://bugzilla.lustre.org/show_bug.cgi?id=22417 (and also https://bugzilla.lustre.org/show_bug.cgi?id=22516 ) Kevin Mervini, Joseph A wrote: Cool! Thank you Johann. Joe Mervini Sandia National Laboratories High Performance Computing 505.844.6770 jame...@sandia.gov On Feb 24, 2011, at 11:05 AM, Johann Lombardi wrote: On Thu, Feb 24, 2011 at 10:48:32AM -0700, Mervini, Joseph A wrote: Quick question: Has runtime modification of the number of OST threads been implemented in Lustre-1.8.3? Yes, see bugzilla ticket 18688. It was landed in 1.8.1. Cheers, Johann -- Johann Lombardi Whamcloud, Inc. www.whamcloud.com ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Compiling Lustre 2 on SLES10
Yes, that is what Oracle had announced in the roadmap. SLES servers are still supported on Lustre 1.8.x, but Oracle announced plans to not support them with Lustre 2.x. Given the similarities between the RHEL6 and SLES11 kernels, I am sure someone could bring SLES support back when RHEL6 is supported, if enough people were willing to pay for it. Kevin Alvaro Aguilera wrote: Does it mean that Lustre is completely dropping server support for SLES? On Mon, Feb 21, 2011 at 4:58 PM, Johann Lombardi joh...@whamcloud.com mailto:joh...@whamcloud.com wrote: On Mon, Feb 21, 2011 at 04:42:45PM +0100, Alvaro Aguilera wrote: inside that directory there are only files for RedHat5 and SLES11. Is SLES10 still supported? Yes, but only on the client side: http://wiki.lustre.org/index.php/Lustre_2.0#Lustre_2.0_Matrix Cheers, Johann -- Johann Lombardi Whamcloud, Inc. www.whamcloud.com http://www.whamcloud.com ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Kernel Panic error after lustre 2.0 installation
Yep. All you have to do is rebuild the driver for the Lustre kernel. First, bring the system back up with the non-Lustre kernel. See the bottom of the readme: # cd /usr/src/linux/drivers/scsi/arcmsr (suppose /usr/src/linux is the soft-link for /usr/src/kernel/2.6.23.1-42.fc8-i386) # make -C /lib/modules/`uname -r`/build CONFIG_SCSI_ARCMSR=m SUBDIRS=$PWD modules # insmod arcmsr.ko Except instead of uname -r substitute the lustre kernel's 'uname -r', as you want to build for the Lustre kernel. Be sure you have the Lustre kernel-devel RPM installed. Note that the insmod will not work (you already have it for the running kernel, and the one you built for the Lustre kernel will not work). You will need to rebuild the initrd for the Lustre kernel (see the other instructions in the readme, using the Lustre kernel). Kevin Arya Mazaheri wrote: The driver name is arcmsr.ko and I extracted it from driver.img included in RAID controller's CD. The following text file may clarify better: ftp://areca.starline.de/RaidCards/AP_Drivers/Linux/DRIVER/RedHat/FedoraCore/Redhat-Fedora-core8/1.20.0X.15/Intel/readme.txt Please tell me, if you need more information about this issue... On Thu, Feb 17, 2011 at 11:33 PM, Brian J. Murrell br...@whamcloud.com mailto:br...@whamcloud.com wrote: On Thu, 2011-02-17 at 23:26 +0330, Arya Mazaheri wrote: Hi there, Hi, Unable to access resume device (LABEL=SWAP-sda3) mount: could not find filesystem 'dev/root' setuproot: moving /dev failed: No such file or directory setuproot: error mounting /proc: No such file or directory setuproot: error mounting /sys: No such file or directory swirchroot: mount failed: No such file or directory Kernel Panic - not syncing: Attempted to kill init! I have no problem with the original kernel installed by centos. I guessed this may be related to RAID controller card driver which may not loaded by the patched lustre kernel. That seems like a reasonable conclusion given the information available. so I have added the driver into the initrd.img file. Where did you get the driver from? What is the name of the driver? But it didn't solve the problem. Depending on where it came from, yes, it might not. Should I install the lustre by building the source? That may be required, but not necessarily required. We need more information. b. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org mailto:Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre client error
To figure out which OST is which, use e2label /dev/sdX (or e2label /dev/mapper/mpath7) which will print the OST index in hex. If clients run out of space, but there is space left, see Bug 22755 (mostly fixed in Lustre 1.8.4). Lustre assigns the OST index at file creation time. Lustre will avoid full OSTs, but once a file is created any growth must be accommodated by the initial OST assignment(s). Deactivating the OST on the MDS will prevent new allocations, but they shouldn't be happening anyway. You can copy/rename some large files to put them on another OST which will free up space on the full OST (move will not allocate new space, just change the directory name). Kevin Jagga Soorma wrote: This OST is 100% now with only 12GB remaining and something is actively writing to this volume. What would be the appropriate thing to do in this scenario? If I set this to read only on the mds then some of my clients start hanging up. Should I be running lfs find -O OST_UID /lustre and then move the files out of this filesystem and re-add them back? But then there is no gurantee that they will not be written to this specific OST. Any help would be greately appreciated. Thanks, -J On Tue, Feb 15, 2011 at 3:05 PM, Jagga Soorma jagg...@gmail.com mailto:jagg...@gmail.com wrote: I might be looking at the wrong OST. What is the best way to map the actual /dev/mapper/mpath[X] to what OST ID is used for that volume? Thanks, -J On Tue, Feb 15, 2011 at 3:01 PM, Jagga Soorma jagg...@gmail.com mailto:jagg...@gmail.com wrote: Also, it looks like the client is reporting a different %used compared to the oss server itself: client: reshpc101:~ # lfs df -h | grep -i 0007 reshpcfs-OST0007_UUID 2.0T 1.7T202.7G 84% /reshpcfs[OST:7] oss: /dev/mapper/mpath72.0T 1.9T 40G 98% /gnet/lustre/oss02/mpath7 Here is how the data seems to be distributed on one of the OSS's: -- /dev/mapper/mpath52.0T 1.2T 688G 65% /gnet/lustre/oss02/mpath5 /dev/mapper/mpath62.0T 1.7T 224G 89% /gnet/lustre/oss02/mpath6 /dev/mapper/mpath72.0T 1.9T 41G 98% /gnet/lustre/oss02/mpath7 /dev/mapper/mpath82.0T 1.3T 671G 65% /gnet/lustre/oss02/mpath8 /dev/mapper/mpath92.0T 1.3T 634G 67% /gnet/lustre/oss02/mpath9 -- -J On Tue, Feb 15, 2011 at 2:37 PM, Jagga Soorma jagg...@gmail.com mailto:jagg...@gmail.com wrote: I did deactivate this OST on the MDS server. So how would I deal with a OST filling up? The OST's don't seem to be filling up evenly either. How does lustre handle a OST that is at 100%? Would it not use this specific OST for writes if there are other OST available with capacity? Thanks, -J On Tue, Feb 15, 2011 at 11:45 AM, Andreas Dilger adil...@whamcloud.com mailto:adil...@whamcloud.com wrote: On 2011-02-15, at 12:20, Cliff White wrote: Client situation depends on where you deactivated the OST - if you deactivate on the MDS only, clients should be able to read. What is best to do when an OST fills up really depends on what else you are doing at the time, and how much control you have over what the clients are doing and other things. If you can solve the space issue with a quick rm -rf, best to leave it online, likewise if all your clients are trying to bang on it and failing, best to turn things off. YMMV In theory, with 1.8 the full OST should be skipped for new object allocations, but this is not robust in the face of e.g. a single very large file being written to the OST that takes it from average usage to being full. On Tue, Feb 15, 2011 at 10:57 AM, Jagga Soorma jagg...@gmail.com mailto:jagg...@gmail.com wrote: Hi Guys, One of my clients got a hung lustre mount this morning and I saw the following errors in my logs: -- ..snip.. Feb 15 09:38:07 reshpc116 kernel: LustreError: 11-0: an error occurred while communicating with 10.0.250.47@o2ib3. The ost_write operation failed with -28 Feb 15 09:38:07 reshpc116 kernel: LustreError: Skipped 4755836 previous similar messages Feb 15 09:48:07 reshpc116 kernel: LustreError: 11-0: an
Re: [Lustre-discuss] 1GB throughput limit on OST (1.8.5)?
Normally if you are having a problem with write BW, you need to futz with the switch. If you were having problems with read BW, you need to futz with the server's config (xmit hash policy is the usual culprit). Are you testing multiple clients to the same server? Are you using mode 6 because you don't have bonding support in your switch? I normally use 802.3ad mode, assuming your switch supports link aggregation. I was bonding 2x1Gb links for Lustre back in 2004. That was before BOND_XMIT_POLICY_LAYER34 was in the kernel, so I had to hack the bond xmit hash (with multiple NICs standard, layer2 hashing does not produce a uniform distribution, and can't work if going through a router). Any one connection (socket or node/node connection) will use only one gigabit link. While it is possible to use two links using round-robin, that normally only helps for client reads (server can't choose which link to receive data, the switch picks that), and has the serious downside of out-of-order packets on the TCP stream. [If you want clients to have better client bandwidth for a single file, change your default stripe count to 2, so it will hit two different servers.] Kevin David Merhar wrote: Sorry - little b all the way around. We're limited to 1Gb per OST. djm On Jan 27, 2011, at 7:48 AM, Balagopal Pillai wrote: I guess you have two gigabit nics bonded in mode 6 and not two 1GB nics? (B-Bytes, b-bits) The max aggregate throughput could be about 200MBps out of the 2 bonded nics. I think the mode 0 bonding works only with cisco etherchannel or something similar on the switch side. Same with the FC connection, its 4Gbps (not 4GBps) or about 400-500 MBps max throughout. Maybe you could also see the max read and write capabilities of the raid controller other than just the network. When testing with dd, some of the data remains as dirty data till its flushed into the disk. I think the default background ratio is 10% for rhel5 which would be sizable if your oss have lots of ram. There is chance of lockup of the oss once it hits the dirty_ratio limit,which is 40% by default. So a bit more aggressive flush to disk by lowering the background_ratio and a bit more headroom before it hits the dirty_ratio is generally desirable if your raid controller could keep up with it. So with your current setup, i guess you could get a max of 400MBps out of both OSS's if they both have two 1Gb nics in them. Maybe if you have one of the switches from Dell that has 4 10Gb ports in them (their powerconnect 6248), 10Gb nics for your OSS's might be a cheaper way to increase the aggregate performance. I think over 1GBps from a client is possible in cases where you use infiniband and rdma to deliver data. David Merhar wrote: Our OSS's with 2x1GB NICs (bonded) appear limited to 1GB worth of write throughput each. Our setup: 2 OSS serving 1 OST each Lustre 1.8.5 RHEL 5.4 New Dell M610's blade servers with plenty of CPU and RAM All SAN fibre connections are at least 4GB Some notes: - A direct write (dd) from a single OSS to the OST gets 4GB, the OSS's fibre wire speed. - A single client will get 2GB of lustre write speed, the client's ethernet wire speed. - We've tried bond mode 6 and 0 on all systems. With mode 6 we will see both NICs on both OSSs receiving data. - We've tried multiple OSTs per OSS. But 2 clients writing a file will get 2GB of total bandwidth to the filesystems. We have been unable to isolate any particular resource bottleneck. None of the systems (MDS, OSS, or client) seem to be working very hard. The 1GB per OSS threshold is so consistent, that it almost appears by design - and hopefully we're missing something obvious. Any advice? Thanks. djm ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Support for 2.6.32 kernel.org Kernel in Lustre 1.8.5
Client support through 2.6.32 (vanilla) is in v1.8.5. Looks like one page missed getting updated. http://wiki.lustre.org/index.php/Lustre_Release_Information#Lustre_Support_Matrix Kevin Nirmal Seenu wrote: I have a quick question whether patchless clients for kernel.org kernel 2.6.32 is officially supported under Lustre 1.8.5 or if need to include any patches. In the lustre source tree lustre/ChangeLog say that 2.6.32 is supported while the wiki page(http://wiki.lustre.org/index.php/Change_Log_1.8) says only kernels up to 2.6.30 are officially supported for patchless clients. Note: I am able to build the patchless clients cleanly with 2.6.32.20 kernel and OFED 1.5.2. Thanks Nirmal ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] manual OST failover for maintenance work?
Cliff White wrote: On 12/06/2010 09:57 AM, Adeyemi Adesanya wrote: Hi. We have pairs of OSS nodes hooked up to shared storage arrays containing OSTs but we have not enabled any failover settings yet. Now we need to perform maintenance work on an OSS and we would like to minimize Lustre downtime. Can I use tunefs.lustre to specify the OSS failover NID for an existing OST? I assume i'll have to take the OST offline to make this change. Will clients that have Lustre mounted pick up this change or will all clients have to remount? I should mention that we are running Lustre 1.8.2. Yes, see the Lustre Manual for details. cliffw Should be something like this for an OST: # tunefs.lustre --writeconf --erase-params --mgsnode=10.0@o2ib --mgsnode=10.0@o2ib --param=failover.node=10.0@o2ib /dev/ost0 Do MGS first (if not already done and it will have failover). Dedicated mgs should not have to specify mgs, just the failover. For MDT, would probably have to also have --param=mdt.group_upcall=/usr/sbin/l_getgroups Note that you must add the failover NID (ie, do the tunefs and the first mount) on the _primary_ (non-failover) node. Lustre machines get the NID information for MDT/OST devices from the MGS at mount time. There is no callback mechanism to notify of changes to the NIDs, so yes, clients would have to re-mount the file system to be able to use the failover NIDs. Kevin ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Delete ost
Wang Yibin wrote: Hello, 在 2010-11-19,上午3:21, Thomas Johansson 写道: I am not sure I understand - Do you have multiple filesystems sharing the same MGS? Yes 5 filesystems on 4 OSS:s and 2 MDS in active/passive failover. Some 100 TB of space in total. Probably you misunderstood me. You seems to be using 1 filesystem(MDSx2/OSSx4) with 5 clients. Making 5 lustre filesystems out of 4OSS/2MDS is a mission-impossible. No, while it is not often done, there is nothing to prevent 5 Lustre file systems from running on 4 OSS nodes and 2 MDS nodes. In addition to the MGS, each file system needs one MDT and 1 or more OSTs. An OSS can serve up OSTs for multiple file systems, and an MDS node can serve up MDTs for multiple file systems (and a node could even be both an MDS and OSS at the same time). Now, if there were a separate MGS for each file system, then it would be a different story... each node can really only serve up OSTs or MDTs for a single MGS. Kevin ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Cannot mount MDS: Lustre: Denying initial registration attempt from nid 10.201.62...@o2ib, specified as failover
Adrian Ulrich wrote: Hi Kevin, But you specified that as a failover node: # tunefs.lustre --erase-params --param=failover.node=10.201.62...@o2ib,10.201.30...@tcp failover.node=10.201.62...@o2ib,10.201.30...@tcp mdt.group_upcall=/usr/sbin/l_getgroups /dev/md10 Well: First i was just running # tunefs.lustre --param mdt.quota_type=ug /dev/md10 and this alone was enough to break it. Not sure. did you specify both sets on your mkfs command line? The initial installation was done / dictated by the swiss branch of an (no longer existing) three-letter company. This command was used to create the filesystem on the MDS # FS_NAME=lustre1 # MGS_1=10.201.62...@o2ib0,10.201.30...@tcp0 # MGS_2=10.201.62...@o2ib0,10.201.30...@tcp0 # mkfs.lustre --reformat --fsname ${FS_NAME} --mdt --mgs --failnode=${MGS_1} --failnode=${MGS_2} /dev/md10 I haven't done combined mdt/mgs for a while, so I can't recall if you have to specify the mgs NIDs for the MDT when it is colocated with the MGS, but I think the command should have been more like: # mkfs.lustre --fsname ${FS_NAME} --mdt --mgs --failnode=${MGS_2} --mgsnode=${MGS_1} --mgsnode=${MGS_2} /dev/md10 with the mkfs/first mount on MGS_1. As I mentioned, you would not normally specify the mkfs/first-mount NIDs as failover parameters, as they are added automatically by Lustre. Kevin ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] LBUG on lustre 1.8.0
Sure, but I think for engineering to make progress on this bug, they are going to want a crash dump. If you can enable crash dumps and panic on lbug (and if HA, increase dead timeout so it can complete the dump before being shot in the head) it would provide more info for the bug report. That being said, there are quite a few other bugs that have been fixed since 1.8.0, so you really should upgrade ASAP to 1.8.4. Kevin On Nov 21, 2010, at 6:59 PM, Larry tsr...@gmail.com wrote: We had a LBUG several days ago on our lustre 1.8.0. One OSS reported kernel: LustreError: 24669:0:(service.c:1311:ptlrpc_server_handle_request()) ASSERTION(atomic_read((export)-exp_refcount) 0x5a5a5a) failed kernel: LustreError: 24669:0:(service.c:1311:ptlrpc_server_handle_request()) LBUG kernel: Lustre: 24669:0:(linux-debug.c:222:libcfs_debug_dumpstack()) showing stack for process 24669 .. I google for this, and find little information about it. It seems to be a race condition on OSS, right? Should I open a bugzilla for this LBUG? Thanks. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] LBUG on lustre 1.8.0
Larry wrote: We add the options libcfs libcfs_panic_on_lbug=1 in modprobe.conf to make the server kernel panic ASAP the LBUG happened. Is there some way to make the server dead a few seconds after the LBUG? We are also puzzled with the message lost during the LBUG happened. The messages should have gone to the console just fine (hopefully you are logging a serial console). If you are talking about /var/log/messages, then yes, it will be missing the final output as the messages don't have time to get written to disk on a kernel panic. Kevin On Mon, Nov 22, 2010 at 10:42 AM, Kevin Van Maren kevin.van.ma...@oracle.com wrote: Sure, but I think for engineering to make progress on this bug, they are going to want a crash dump. If you can enable crash dumps and panic on lbug (and if HA, increase dead timeout so it can complete the dump before being shot in the head) it would provide more info for the bug report. That being said, there are quite a few other bugs that have been fixed since 1.8.0, so you really should upgrade ASAP to 1.8.4. Kevin On Nov 21, 2010, at 6:59 PM, Larry tsr...@gmail.com wrote: We had a LBUG several days ago on our lustre 1.8.0. One OSS reported kernel: LustreError: 24669:0:(service.c:1311:ptlrpc_server_handle_request()) ASSERTION(atomic_read((export)-exp_refcount) 0x5a5a5a) failed kernel: LustreError: 24669:0:(service.c:1311:ptlrpc_server_handle_request()) LBUG kernel: Lustre: 24669:0:(linux-debug.c:222:libcfs_debug_dumpstack()) showing stack for process 24669 .. I google for this, and find little information about it. It seems to be a race condition on OSS, right? Should I open a bugzilla for this LBUG? Thanks. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] [Fwd: Re: Broken client]
Not sure. Could be some clients had data in their cache, and others hit the error when they tried to get it from the OST. Sorry I misunderstood you -- I thought you had already run fsck on the OSTs. Kevin On Nov 19, 2010, at 9:41 AM, Herbert Fruchtl herbert.fruc...@st-andrews.ac.uk wrote: Thanks guys, Looks like unmounting the unhealthy OST filesystem and running an fsck on it (which found several errors) solved the problem! I still don't understand why it looked different from different clients... Cheers, Herbert Oleg Drokin wrote: Hello! So are there any other compplaints on the OSS node when you mount that OST? Did you try to run e2fsck on the ost disk itself (while unmounted)? I assume one of the possible problems is just on0disk fs corruptions (and it might show unhealthy due to that right after mount too). Bye, Oleg On Nov 18, 2010, at 1:47 PM, Herbert Fruchtl wrote: Sorry, I had meant to cc this to the list. Herbert From: Herbert Fruchtl herbert.fruc...@st-andrews.ac.uk Date: November 18, 2010 12:56:53 PM EST To: Kevin Van Maren kevin.van.ma...@oracle.com Subject: Re: [Lustre-discuss] Broken client Hi Kevin, That didn't change anything. Umounting the of the OSTs hung (yes, with an LBUG), and I did a hard reboot. It came up again, and the status is as before: on the MDT server, I can see all files (well, I assume it's all); on the client in question some files appear broken. The OST is still not healthy. I am running another lfsck, without much hope. Here's the LBUG: Nov 18 17:05:16 oss1-fs kernel: LustreError: 8125:0: (lprocfs_status.c:865:lprocfs_free_client_stats()) LBU Herbert Kevin Van Maren wrote: Reboot the server with the unhealthy OST. If you look at the logs, there is likely an LBUG that is causing the problems. Kevin On Nov 18, 2010, at 9:51 AM, Herbert Fruchtl herbert.fruc...@st-andrews.ac.uk wrote: It looks like you may have corruption on the mdt or an ost, where the objects on an OST can't be found for the directory entry. Have you had a crash recently or run Lustre fsck? You might need to do fsck and delete (unlink) the broken files. The files do exist (I can see them on the mdt server) and I don't want to delete them. There was a crash lately, and I have run an lfsck afterwards (repeatedly, actually. I suppose it's also possible you're seeing fallout from an earlier LBUG or something. Try 'cat /proc/fs/lustre/health_check' on all the servers. There seems to be a problem: [r...@master ~]# cat /proc/fs/lustre/health_check healthy [r...@master ~]# ssh oss1 'cat /proc/fs/lustre/health_check' device home-OST0005 reported unhealthy NOT HEALTHY [r...@master ~]# ssh oss2 'cat /proc/fs/lustre/health_check' healthy [r...@master ~]# ssh oss3 'cat /proc/fs/lustre/health_check' healthy What do I do about the unhealthy OST? Herbert -- Herbert Fruchtl Senior Scientific Computing Officer School of Chemistry, School of Mathematics and Statistics University of St Andrews -- The University of St Andrews is a charity registered in Scotland: No SC013532 -- Herbert Fruchtl Senior Scientific Computing Officer School of Chemistry, School of Mathematics and Statistics University of St Andrews -- The University of St Andrews is a charity registered in Scotland: No SC013532 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Herbert Fruchtl Senior Scientific Computing Officer School of Chemistry, School of Mathematics and Statistics University of St Andrews -- The University of St Andrews is a charity registered in Scotland: No SC013532 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Broken client
Wang Yibin wrote: Hello, 在 2010-11-18,下午10:03, Herbert Fruchtl 写道: I was wrong about only one client having problems. It seems to be all of them, except the mds server (see below), so it is a problem of the filesystem (not the client) after all. It looks like you may have corruption on the mdt or an ost, where the objects on an OST can't be found for the directory entry. Have you had a crash recently or run Lustre fsck? You might need to do fsck and delete (unlink) the broken files. I suppose it's also possible you're seeing fallout from an earlier LBUG or something. Try 'cat /proc/fs/lustre/health_check' on all the servers. Kevin ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Network problems
Arne Brutschy wrote: Hi all, we're using Lustre 1.8.3 on a gigabit network. We have 4 osts on 2 oss and a single mgs. We are serving the users' homedirs (mostly small files) for 64 clients on this network. It now happened for the third time that the cluster went down: either the oss or the mgs block, and nobody can access the lustre share anymore. Looking at the logs, I see lots of connectivity errors: LustreError: 17792:0:(mgs_handler.c:641:mgs_handle()) MGS handle cmd=250 rc=-16 LustreError: 17792:0:(mgs_handler.c:641:mgs_handle()) Skipped 3 previous similar messages LustreError: 17792:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@ processing error (-16) r...@f5ae642c x1344336331331741/t0 o250-6e1c6cb5-564f-49b0-a01e-e7e460542...@net_0x20ac1_uuid:0/0 lens 368/264 e 0 to 0 dl 1288258643 ref 1 fl Interpret:/0/0 rc -16/0 LustreError: 17792:0:(ldlm_lib.c:1892:target_send_reply_msg()) Skipped 3 previous similar messages Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1348047450260821 sent from lustre-MDT-mdc-f3aa2e00 to NID 0...@lo 30s ago has timed out (30s prior to deadline). r...@e66d1e00 x1348047450260821/t0 o38-lustre-mdt_u...@10.1.1.1@tcp:12/10 lens 368/584 e 0 to 1 dl 1288258632 ref 1 fl Rpc:N/0/0 rc 0/0 Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 19 previous similar messages Lustre: MGS: haven't heard from client 03b9cdae-66f1-552b-8c7c-94a9499c8dcf (at 10.255.255@tcp) in 228 seconds. I think it's dead, and I am evicting it. LustreError: 2893:0:(acceptor.c:455:lnet_acceptor()) Error -11 reading connection request from 10.255.255.199 Lustre: 2936:0:(ldlm_lib.c:575:target_handle_reconnect()) MGS: fa602b20-b24c-bbcd-7003-b3b9bf702db4 reconnecting Lustre: 2936:0:(ldlm_lib.c:575:target_handle_reconnect()) Skipped 49 previous similar messages LustreError: 2888:0:(socklnd_cb.c:1707:ksocknal_recv_hello()) Error -104 reading HELLO from 10.255.255.199 LustreError: 2888:0:(socklnd_cb.c:1707:ksocknal_recv_hello()) Skipped 2 previous similar messages LustreError: 11b-b: Connection to 10.255.255@tcp at host 10.255.255.199 on port 988 was reset: is it running a compatible version of Lustre and is 10.255.255@tcp one of its NIDs? Lustre: 20268:0:(ldlm_lib.c:804:target_handle_connect()) MGS: exp eb3ff200 already connecting Lustre: 17792:0:(ldlm_lib.c:875:target_handle_connect()) MGS: refuse reconnection from fa602b20-b24c-bbcd-7003-b3b9bf702...@10.255.255.199@tcp to 0xeb3ff200; still busy with 2 active RPCs It looks like the server threads are spending a long time processing the request. If you look at the client logs for 10.255.255.199 you will likely see that it thinks the server died and tried to failover. The server, when it finally got around to processing the request, noticed that the client was no longer there, as it had given up on the server. The server, for sanity, won't allow the client to reconnect until the outstanding transactions have completed (so the question is why are they taking so long). Are you seeing any slow messages on the servers? There are lots of reasons server threads could be slow. If /proc/sys/vm/zone_reclaim_mode is 1, try setting it to 0. You might want to try the patch in Bug 23826 which I found useful in tracking how long the server thread was processing the request, rather than just the IO phase. Lustre: 17792:0:(ldlm_lib.c:875:target_handle_connect()) Skipped 1 previous similar message Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1348047450261066 sent from lustre-MDT-mdc-f3aa2e00 to NID 0...@lo 30s ago has timed out (30s prior to deadline). r...@ce186400 x1348047450261066/t0 o38-lustre-mdt_u...@10.1.1.1@tcp:12/10 lens 368/584 e 0 to 1 dl 1288259252 ref 1 fl Rpc:N/0/0 rc 0/0 Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 19 previous similar messages Lustre: There was an unexpected network error while writing to 10.255.255.221: -110. Lustre: 20268:0:(ldlm_lib.c:804:target_handle_connect()) MGS: exp d8637600 already connecting Lustre: 19656:0:(ldlm_lib.c:875:target_handle_connect()) MGS: refuse reconnection from 7ee2fe58-3fab-c39a-8adb-c356d1bdc...@10.255.255.209@tcp to 0xd8637600; still busy with 1 active RPCs LustreError: 19656:0:(mgs_handler.c:641:mgs_handle()) MGS handle cmd=250 rc=-16 LustreError: 19656:0:(mgs_handler.c:641:mgs_handle()) Skipped 3 previous similar messages LustreError: 19656:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@ processing error (-16) r...@f5ae6c2c x1344329600678410/t0 o250-7ee2fe58-3fab-c39a-8adb-c356d1bdc...@net_0x20ad1_uuid:0/0 lens 368/264 e 0 to 0 dl 1288259427
Re: [Lustre-discuss] 1.8 quotas
David Dillow wrote: On Fri, 2010-10-22 at 22:56 +0800, Fan Yong wrote: On 10/22/10 9:37 PM, Jason Hill wrote: Folks, Not having to deal with quotas on our scratch filesystems in the past, I'm puzzled on why we're seeing messages like the following: Oct 22 09:29:00 widow-oss3c2 kernel: kernel: Lustre: widow3-OST00b1: slow quota init 35s due to heavy IO load We're (I think) not doing quotas. [ ... ] So, the question is - if we see messages like slow quota init, are quotas being calculated in the background? And as a followup - how do we turn them off? No. I think you are misguided by the message slow quota init 35s due to heavy IO load, which does not mean recalculating (initial calculating) quota in the background. In fact, such message is printed out before obdfilter write, at such point, the OST tries to acquire enough quota for this write operation. It will check locally whether the remaining quota related with the uid/gid (for this OST object) is enough or not, if not, the quota slave on this OST will acquire more quota from quota master on MDS. This process maybe take some long time on high load system, especially when the remaining quota on quota master (MDS) is also very limit. The message you saw just shows that. There is no good way to disable these message so long as setting quota on this uid/gid. This is the heart of Jason's question -- he has done nothing to his knowledge to enable quotas at all, so why is he getting a message about quotas? Are they actually enabled on the FS, and how would he be able to verify that? Or does it always process quotas, even if they are not enabled? That message, from lustre/obdfilter/filter_io_26.c, is the result of the thread taking 35 second from when it entered filter_commitrw_write() until after it called lquota_chkquota() to check the quota. However, it is certainly plausible that the thread was delayed because of something other than quotas, such as an allocation (eg, it could have been stuck in filter_iobuf_get). Kevin ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] bad csum errors
https://bugzilla.lustre.org/show_bug.cgi?id=11742 Kevin John White wrote: Hello Folks, Recently we've had a fair number of messages akin to the following coming from out OSS syslog: n0004: LustreError: 168-f: lrc-OST0002: BAD WRITE CHECKSUM: changed in transit before arrival at OST from 12345-10.4.8@o2ib inum 1409775/2324736913 object 1771080/0 extent [401408-2809855] n0004: LustreError: Skipped 13 previous similar messages n0004: LustreError: 10839:0:(ost_handler.c:1169:ost_brw_write()) client csum ae09a542, original server csum cfb6ab4b, server csum now cfb6ab4b There appear to be no specific clients, OSSs or OSTs in common. We'll commonly get a block of messages concerning one OST w/ different clients involved and then move on to another OST. As such, I'm doubting this is a memory issue. Previous mails on this list mention MMAP, but there doesn't seem to be any mention in these messages. Ideas? John White High Performance Computing Services (HPCS) (510) 486-7307 One Cyclotron Rd, MS: 50B-3209C Lawrence Berkeley National Lab Berkeley, CA 94720 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Profiling data
Maybe something like RobinHood? On Sep 28, 2010, at 2:41 PM, David Noriega tsk...@my.utsa.edu wrote: This question isn't really about Lustre, but file system administration. I was wondering what tools exist, particularly anything free/open source, that can scan for old files and either report to the admin or user that said files are say 1yr old, please archive them or delete them. Also any tools that can profile file types, such as to check if someone is keeping their mp3 library on our server. Thanks David -- Personally, I liked the university. They gave us money and facilities, we didn't have to produce anything! You've never been out of college! You don't know what it's like out there! I've worked in the private sector. They expect results. -Ray Ghostbusters ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre 1.8.4 with new kernel 2.6.18-194.11.4
https://bugzilla.lustre.org/show_bug.cgi?id=22514 Have you tried the 1.8.4 client on the stock kernel? Kevin Mike Hanby wrote: Are there any plans to build new Lustre 1.8.4 patched kernel packages for EL5 kernel 2.6.18-194.11.4 This kernel has the patch that prevents the much talked about privilege escalation CVE-2010-3081: https://rhn.redhat.com/errata/RHSA-2010-0704.html Regards, Mike ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Multi-Role/Tasking MDS/OSS Hosts
Bernd Schubert wrote: Hello Cory, On 09/17/2010 11:31 PM, Cory Spitz wrote: Hi, Bernd. On 09/17/2010 02:48 PM, Bernd Schubert wrote: On Friday, September 17, 2010, Andreas Dilger wrote: On 2010-09-17, at 12:42, Jonathan B. Horen wrote: We're trying to architect a Lustre setup for our group, and want to leverage our available resources. In doing so, we've come to consider multi-purposing several hosts, so that they'll function simultaneously as MDS OSS. You can't do this and expect recovery to work in a robust manner. The reason is that the MDS is a client of the OSS, and if they are both on the same node that crashes, the OSS will wait for the MDS client to reconnect and will time out recovery of the real clients. Well, that is some kind of design problem. Even on separate nodes it can easily happen, that both MDS and OSS fail, for example power outage of the storage rack. In my experience situations like that happen frequently... I think that just argues that the MDS should be on a separate UPS. Or dual-redundant UPS devices driving all critical infrastructure. Redundant power supplies are the norm for server-class hardware, and they should be cabled to different circuits (which each need to be sized to sustain the maximum power). well, there is not only a single reason. Next hardware issue is that maybe an IB switch fails. Sure, but that's also easy to address (in theory): put OSS nodes on different leaf switches than MDS nodes, and put the failover pairs on different switches as well. In practice, IB switches probably do not fail often enough to worry about recovery glitches, especially if they have redundant power, but I certainly recommend failover partners are on different switch chips so that in case of a failure it is still possible to get the system up. I would also recommend using bonded network interfaces to avoid cable-failure issues (ie, connect both OSS nodes to both of the leaf switches, rather than one to each), but there are some outstanding issues with Lustre on IB bonding (patches in bugzilla), and of course multipath to disk (loss of connectivity to disk was mentioned at LUG as one of the biggest causes of Lustre issues). In general it is easier to have redundant cables than to ensure your HA package properly monitors cable status and does a failover when required. And then have also seen cascading Lustre failures. It starts with an LBUG on the OSS, which triggers another problem on the MDS... Yes, that's why bugs are fixed. panic_on_lbug may help stop the problem before it spreads, depending on the issue. Also, for us this actually will become a real problem, which cannot be easily solved. So this issue will become a DDN priority. Cheers, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Question about adaptive timeouts, not sending early reply
I believe this message says that the request timeout is on this transaction is 42s, but when Lustre went to go ask for more time based on the current AT service estimate, it came up with 30s. Since 30s is 42s, it could not ask for more time. Kevin Thomas Roth wrote: Hi all, I'm trying to understand MDT logs and adaptive timeouts. After upgrade to 1.8.4 and while users believed Lustre to be still in maintenance (= no activity), the MDT log just shows Lustre: 19823:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (42/30), not sending early reply Now, for historical reasons of running on a very shaky network, we load the lustre module with options ptlrpc at_max=6000 options ptlrpc at_history=6000 options ptlrpc at_early_margin=50 Right now however, the MDT reports: lxmds:~# lctl get_param -n mdt.MDS.mds.timeouts service : cur 30 worst 76 (at 1284734311, 0d19h33m39s ago) 30 30 30 30 Reading the manual on adaptive timeouts again, I conclude that if the current estimate for timeout is 30 sec, the MDT is indeed hard pressed to send an early reply 50 sec before that timeout occurs. The log messages states something of the like, (42/30). So, is my assessment correct? Are these log messages just due to the stupid at_early_margin setting? Regards, Thomas ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] 1.8.4 and write-through cache
Stu Midgley wrote: Afternoon I upgraded our oss's from 1.8.3 to 1.8.4 on Saturday (due to https://bugzilla.lustre.org/show_bug.cgi?id=22755) and suffered a great deal of pain. We have 30 oss's of multiple vintages. The basic difference between them is * md on first 20 nodes * 3ware 9650SE ML12 on last 10 nodes After the upgrade to 1.8.4 we were seeing terrible throughput on the nodes with 3ware cards (and only the nodes with 3ware cards). This was typified by see the block device being 100% utilised (iostat), doing about 100r/s and 400kb/s and all the ost_io threads in D state (no writes). They would be in this state for 10mins and then suddenly awake and start pushing data again. 1-2 mins later, they would lock up again. The oss's were dumping stacks all over the place, crawling along and generally making our lustrefs unuseable. Would you post a few of the stack traces? Presumably these were driven by watchdog timeouts, but it would help to know where they were getting stuck. After trying different kernels, raid card drivers, changing write back policy on the raid cards etc. the solution was to lctl set_param obdfilter.*.writethrough_cache_enable=0 lctl set_param obdfilter.*.read_cache_enable=0 on all the nodes with the 3ware cards. Has anyone else seen this? I am completely baffled as to why it only affects our nodes with 3ware cards. These nodes were working very well under 1.8.3... ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Oss Error and 0 byte files
I believe grant leak is still possible with 1.8.4, but many of the holes are plugged. Kevin Gabriele Paciucci wrote: the bug 22755 is fixed in 1.8.4 http://wiki.lustre.org/index.php/Use:Change_Log_1.8 On 09/09/2010 11:55 AM, Gianluca Tresoldi wrote: Yes, client gets ENOSP,I see now. Anyway: ThankYou Very Much for your reply ;) On 09/08/10 17:29, Kevin Van Maren wrote: It might be related to bug 22755, but there the client gets ENOSPC On Sep 8, 2010, at 8:02 AM, Gianluca Tresoldi gianluca.treso...@tuttogratis.com mailto:gianluca.treso...@tuttogratis.com wrote: Hello everyone I've an installation with Lustre 1.8.2, Centos 5, x86_64 and I encountered this problem: After several months of smooth operation, client begin to write empty files without log error,from their point of view writing was successful. OSS wrote, in their log, several lines like: Sep 8 12:40:31 tgoss-0200 kernel: LustreError: 5816:0:(filter_io.c:183:filter_grant_space_left()) lfs01-OST: cli 20d94382-3300-f12e-65d1-c0f1743e1e20/8106a4e30a00 grant 39956230144 available 39956226048 and pending 0 I checked the availability of space and inodes, but this is not the problem. the problem goes away by rebooting ost. This is the second time I have, first at july 2010, second september 2010. Any ideas?It's a bug? Thanks -- Gianluca Tresoldi ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Virtual machines
I seem to recall Mellanox presenting a paper on IB support virtual machines at SC two years ago. I think it was just a proof of concept, and I'm unaware of the current status. Kevin On Sep 8, 2010, at 6:09 AM, Brian J. Murrell brian.murr...@oracle.com wrote: On Wed, 2010-09-08 at 05:50 -0500, Brian O'Connor wrote: Does lustre work in a VM? Yes, of course, given that a VM provides an entire virtual computer. what about in a VM over Infiniband? I don't know of any VMs which expose the hosts Infiniband hardware for the VM to use directly. Xen might. libvirt/kvm might. But those are just WAGs. b. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Oss Error and 0 byte files
It might be related to bug 22755, but there the client gets ENOSPC On Sep 8, 2010, at 8:02 AM, Gianluca Tresoldi gianluca.treso...@tuttogratis.com wrote: Hello everyone I've an installation with Lustre 1.8.2, Centos 5, x86_64 and I encountered this problem: After several months of smooth operation, client begin to write empty files without log error,from their point of view writing was successful. OSS wrote, in their log, several lines like: Sep 8 12:40:31 tgoss-0200 kernel: LustreError: 5816:0:(filter_io.c: 183:filter_grant_space_left()) lfs01-OST: cli 20d94382-3300- f12e-65d1-c0f1743e1e20/8106a4e30a00 grant 39956230144 available 39956226048 and pending 0 I checked the availability of space and inodes, but this is not the problem. the problem goes away by rebooting ost. This is the second time I have, first at july 2010, second september 2010. Any ideas?It's a bug? Thanks -- Gianluca Tresoldi ***SysAdmin*** ***Demon's Trainer*** Tuttogratis Italia Spa E-mail: gianluca.treso...@tuttogratis.com http://www.tuttogratis.it Tel Centralino 02-57313101 Tel Diretto 02-57313136 linux40.jpgBe open... *** Confidentiality Notice Disclaimer * This message, together with any attachments, is for the confidential and exclusive use of the addressee(s). If you receive it in error, please delete the message and its attachments from your system immediately and notify us by return e-mail. Do not disclose, copy, circulate or use any information contained in this e-mail. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre requirements and tuning tricks
On Sep 8, 2010, at 8:25 AM, Joe Landman land...@scalableinformatics.com wrote: Joan J. Piles wrote: And then 2 MDS like these: - 2 x Intel 5520 (quad core) processor (or equivalent). - 36Gb RAM. - 2 x 64Gb SSD disks. - 2 x10Gb Ethernet ports. Hmmm In general there is not much gain from using SSD for MDT, and depending on the SSD, it could do much _worse_ than spinning rust. Many ssd controllers degrade horribly under the small random write workload. (SSD are best for sequential write, random read). Journals may receive some benefit, as the sequential write pattern works much better for SSDs, although SSDs are not normally needed there. After having read the documentation, it seems to be a sensible configuration, specially regarding the OSS. However we are not so sure about the MDS. We have seen recommendations to reserve 5% of the total file system space in the MDS. Is this true and then we should go for 2x2Tb SAS disks for the MDS? Is SSD really worth there? There is a nice formula for approximating your MDS needs on the wiki. Basically it is something to the effect of Number-of-inodes-planned * 1kB = storage space required So, for 10 million inodes, you need ~10 GB of space. I am not sure if this helps, but you might be able to estimate your likely usage scenario. Updating MDSes isn't easy (e.g. you have to pre-plan) It is 4KB/inode on the MDT. (It can be set to 2KB if you need 4 billion files on an 8TB MDT). My sizing rule of thumb has been ~ one MDT drive in RAID10 for each OST, to ensure you scale IOPS. And we have also read about having a separate storage for the OSTs' journals. Is it really useful to get a pair of extra small (16Gb) SSD disks for each OST to keep the journals and bitmaps? It doesn't have to be SSD, and bitmaps are only applicable for software RAID. But unless you use asynchronous journals, there is normally a big win from external journals -- even with HW RAID having non-volatile storage. The bug win is putting journals on raid 1, rather than raid5/6. Finally, we have also read that it's important to have different OSTs in different physical drives to avoid bottlenecks. Is thas so if we make a big RAID volume and then several logical volumes (done with the hardware raid card, the operating system would just see different block devices)? Yes, though this will be suboptimal in performance. You want traffic to different LUNs not sharing the same physical disks. Build smaller RAID containers, and single LUNs atop those. You get best performane with one HW RAID per OST. And that RAID should be optimized for 1MB IO (ie, not. 6+p) for best performance without having to muck with a bunch of parameters. If the OSTs are on the same drives, then there will be excessive head contention as different OST filesystems seek the same disks, greatly reducing throughput. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: land...@scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Announce: Lustre 2.0.0 is available!
Yes On Aug 27, 2010, at 7:04 AM, Mike Hanby mha...@uab.edu wrote: Are the release notes accurate in that OFED 1.5.1 is not supported in Lustre 2.0.0 but it is supported in 1.8.4? -Original Message- From: lustre-discuss-boun...@lists.lustre.org [mailto:lustre-discuss- boun...@lists.lustre.org] On Behalf Of Terry Rutledge Sent: Thursday, August 26, 2010 12:55 PM Subject: [Lustre-discuss] Announce: Lustre 2.0.0 is available! Hi all, The entire Lustre team is pleased to announce the GA Release of Lustre 2.0.0. This represents the first release of the main Lustre trunk in a number of years. The team has spent extraordinary efforts over the last year preparing this release for GA. This release has had the most extensive pre-release testing of any previous Lustre release. We are excited for the community to try this release and offer feedback. Our next 2.x release is planned for later this year and details will follow at a later date. Quick Reference: Lustre 2.0.0 is available on the Oracle Download Center Site. http://www.oracle.com/technetwork/indexes/downloads/sun-az-index-095901.html#L The Lustre 2.0 Operations Manual: http://dlc.sun.com/pdf/821-2076-10/821-2076-10.pdf The Release Notes: http://dlc.sun.com/pdf/821-2077-10/821-2077-10.pdf The change log: http://wiki.lustre.org/index.php/Change_Log_2.0 As always, you can report issues via Bugzilla: https://bugzilla.lustre.org/ To access earlier releases of Lustre, please check the box See previous products(P), then click L or scroll down to Lustre, the current and all previous releases (1.8.0 - 1.8.4) will be displayed. Happy downloading! -- The Lustre Team -- ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Enabling async journals while the filesystem is active
Yes, but depending on the Lustre version there are several bugs in the async journal code. Kevin Erik Froese wrote: Is it safe to enable async journals on the OSS's while the filesystem is active? I'd like to see how it works for us. Thanks Erik ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] More detail regarding soft lockup error
Andreas _always_ recommends a backup first. Kevin Brian J. Murrell wrote: On Thu, 2010-08-19 at 10:09 -0600, Andreas Dilger wrote: If you increase the size of the MDT (via resize2fs) it will increase the number of inodes as well. Andreas: what is [y]our confidence level with resize2fs and our MDT? Given that I don't think we regularly (if at all) test this in our QA cycles (although I wish we would) I personally would be a lot more comfortable with a backup first. What are your thoughts? Unnecessary? b. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Splitting lustre space
David Noriega wrote: OK hooray! Lustre setup with failover of all nodes, but now we have this huge lustre mount point. How can I say create /lustre/home and /lustre/groups and mount on the client? David Two choices: 1) create two Lustre file systems (separate MDT and OSTs for each) 2) use mount --bind on the client to make one filesystem's directories show up in different places ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Splitting lustre space
David Noriega wrote: Ok, so I could do mount --bind /lustre/home /home mount --bind /lustre/groups /groups Is this a generally accepted practice with Lustre? This just seems so much like a nifty trick, but if its what the community uses, then ok. It is a pretty nifty trick. Same file system, so the same quotas (if any) would apply to both directories. But ultimately if I wanted two separate filesystems, I would need more hardware? An OST can't be put into a general 'pool' for use between the two? You probably don't need more hardware, but you would have to decide which file system each OST would serve -- it can only provide space to one file system. So some of your OSTs would be for home and some for groups. You would need to have 2 MDTs (if necessary, you could split/partition the MDT you have). Kevin David On Wed, Aug 18, 2010 at 12:33 PM, Kevin Van Maren kevin.van.ma...@oracle.com wrote: David Noriega wrote: OK hooray! Lustre setup with failover of all nodes, but now we have this huge lustre mount point. How can I say create /lustre/home and /lustre/groups and mount on the client? David Two choices: 1) create two Lustre file systems (separate MDT and OSTs for each) 2) use mount --bind on the client to make one filesystem's directories show up in different places ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Question on setting up fail-over
David Noriega wrote: Ok I've gotten heartbeat setup with the two OSSs, but I do have a question that isn't stated in the documentation. Shouldn't the lustre mounts be removed from fstab once they are given to heartbeat since when it comes online, it will mount the resources, correct? David Yes: on the servers, they must be not there or noauto. Once you start running heartbeat, you have given control of the resource away, and must not mount/umount it yourself (unless you stop heartbeat on both nodes in the HA pair to get control back). Kevin ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] ost pools
That's more likely if you file a bug report at bugzilla.lustre.org. Even better if you modify check_and_complete_ostname in lctl to handle your OST names and submit a patch with the bug. Kevin Stu Midgley wrote: Right, so I assume this means it will be fixed in some future version of lustre and until then I can't have those nodes in the pool until then? On Tue, Aug 10, 2010 at 3:41 PM, Andreas Dilger andreas.dil...@oracle.com wrote: On 2010-08-10, at 01:20, Stu Midgley wrote: # lctl pool_add l1.default l1-OST[10] OST l1-OST0010_UUID is not part of the 'l1' fs. pool_add: No such file or directory All the nodes that have the new-style names went into the pool just fine. all the nodes with old-style names will not go into the pool. eg. ost_011_UUID I had a quick look at lctl::jt_pool_cmd(), and it looks like this checking is done in userspace in check_and_complete_ostname(), to avoid bad interactions with invalid OST names, and to allow short forms of the OST to be used (e.g. OST0001 instead of l1-OST0001_UUID). That said, it should also be possible to have lctl scan the existing OST UUID array via setup_obd_indexes(param-obd_uuid = ost_name) to see if the OST name is actually valid before adding it to the pool. That will iterate over the list of OSTs, and use llapi_uuid_match() to see if the OST name is valid. We have a lustre file system which started life at V1.4 and is now at V1.8. I'm keen to use ost pools, but I can't actually add nodes to the pool. The node names are not in a format that lctl pool_add likes ost_011_UUID3.3T3.0T 331.5G 90% /l1[OST:10] lctl pool_add l1.default OST[10] OST l1-OST0010_UUID is not part of the 'l1' fs. pool_add: No such file or directory How do I get nodes with these names added to a pool? Thanks. -- Dr Stuart Midgley sdm...@gmail.com ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Question on setting up fail-over
On Aug 9, 2010, at 11:45 AM, David Noriega tsk...@my.utsa.edu wrote: My understanding of setting up fail-over is you need some control over the power so with a script it can turn off a machine by cutting its power? Is this correct? It is the recommended configuration because it is simple to understand and implement. But the only _hard_ requirement is that both nodes can access the storage. Is there a way to do fail-over without having access to the pdu(power strips)? If you have IPMI support, that can be used for power control, instead of a switched PDU. Depending on the storage, you may be able to do resource fencing of the disks instead of STONITH. Or you can run fast- and-loose, without any way to ensure the dead node is really dead and not accessing storage (at your risk). While Lustre has MMP, it is really more to protect against a mount typo than to guarantee resource fencing. Thanks David -- Personally, I liked the university. They gave us money and facilities, we didn't have to produce anything! You've never been out of college! You don't know what it's like out there! I've worked in the private sector. They expect results. -Ray Ghostbusters ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Multiple FS in one MDS
If the question is whether you can have multiple file systems on the same servers, the answer is yes. 1) Need a new LUN/partition for the MDT 2) Need new LUN/partitions for the OST to provide space for that file system, as an OST belongs to exactly one Lustre file system It looks like you added a second MDT, but did not add any OSTs for the NGS file system? If the question is whether you can take one file system and mount parts of it at different places on the client, then that answer is also yes: look at mount --bind which can make a file system (or subdir) appear at a different location. Kevin Fabio Cassarotti Parronchi Navarro wrote: Hi, We recently started using Lustre in production environment for a small storage and we are currently testing speed and reliability. So far, everyone is exited with it. But here comes the problem. Is it possible to create another mount point on the same server that is already running a MDS ? For example: [mdsho...@tcp:/Projects ( current ) [mdsho...@tcp:/NGS ( new ) Actually, I've been able to create another partition on the MDS using ( --fsname=NGS ) and mount it, the OSSs seems to be running nicely too ( at least no problems are reported on the log files ). But when I try to mount NGS file system on the clients, the mount command freezes with no output on the logs mount -t lustre [mdsho...@tcp:/NGS /home/NGS/ The MDS logs: Aug 4 08:32:02 pnq1 kernel: LustreError: 19304:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error (-11) r...@810112097800 x1338817395530034/t0 o38-?@?:0/0 lens 368/0 e 0 to 0 dl 1280921622 ref 1 fl Interpret:/0/0 rc -11/0 Aug 4 08:32:24 pnq1 kernel: LustreError: 19304:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error (-11) r...@810226542000 x1338817395530041/t0 o38-?@?:0/0 lens 368/0 e 0 to 0 dl 1280921644 ref 1 fl Interpret:/0/0 rc -11/0 Aug 4 08:32:24 pnq1 kernel: LustreError: 19304:0:(ldlm_lib.c:1848:target_send_reply_msg()) Skipped 1 previous similar message Aug 4 08:33:01 pnq1 kernel: LustreError: 19297:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error (-11) r...@8101fc298c00 x1336109251604116/t0 o38-?@?:0/0 lens 368/0 e 0 to 0 dl 1280921681 ref 1 fl Interpret:/0/0 rc -11/0 Aug 4 08:33:01 pnq1 kernel: LustreError: 19297:0:(ldlm_lib.c:1848:target_send_reply_msg()) Skipped 4 previous similar messages Aug 4 08:33:07 pnq1 kernel: Lustre: NGS-MDT: temporarily refusing client connection from 192.168.10...@tcp Do I have to change any config on the MDS to fix this issue? Or this architecture is not supported by Lustre ? Thanks in advice, Fábio Navarro -- Ludwig Insitute for Cancer Research LTDA Laboratory of Computational Biology 245 João Julião St - 1th floor CEP 01323-903 - Sao Paulo - Brazil Phone: 55 11 33883232 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Client directory entry caching
Since Bug 22492 hit a lot of people, it sounds like opencache isn't generally useful unless enabled on every node. Is there an easy way to force files out of the cache (ie, echo 3 /proc/sys/vm/drop_caches)? Kevin On Aug 3, 2010, at 11:50 AM, Oleg Drokin oleg.dro...@oracle.com wrote: Hello! On Aug 3, 2010, at 12:49 PM, Daire Byrne wrote: So even with the metadata going over NFS the opencache in the client seems to make quite a difference (I'm not sure how much the NFS client caches though). As expected I see no mdt activity for the NFS export once cached. I think it would be really nice to be able to enable the opencache on any lustre client. A couple of potential workloads that I A simple workaround for you to enable opencache on a specific client would be to add cr_flags |= MDS_OPEN_LOCK; in mdc/ mdc_lib.c:mds_pack_open_flags() Yea that works - cheers. FYI some comparisons with a simple find on a remote client (~33,000 files): find /mnt/lustre (not cached) = 41 secs find /mnt/lustre (cached) = 19 secs find /mnt/lustre (opencache) = 3 secs Hm, initially I was going to say that find is not open-intensive so it should not benefit from opencache at all. But then I realized if you have a lot of dirs, then indeed there would be a positive impact on subsequent reruns. I assume that the opencache result is a second run and first run produces same 41 seconds? BTW, another unintended side-effect you might experience if you have mixed opencache enabled/disabled network is if you run something (or open for write) on an opencache-enabled client, you might have problems writing (or executing) that file from non-opencache enabled nodes as long as the file handle would remain cached on the client. This is because if open lock was not requested, we don't try to invalidate current ones (expensive) and MDS would think the file is genuinely open for write/execution and disallow conflicting accesses with EBUSY. performance when compared to something simpler like NFS. Slightly off topic (and I've kinda asked this before) but is there a good reason why link() speeds in Lustre are so slow compare to something like NFS? A quick comparison of doing a cp -al from a remote Lustre client and an NFS client (to a fast NFS server): cp -fa /mnt/lustre/blah /mnt/lustre/blah2 = ~362 files/sec cp -fa /mnt/nfs/blah /mnt/nfs/blah2 = ~1863 files/sec Is it just the extra depth of the lustre stack/code path? Is there anything we could do to speed this up if we know that no other client will touch these dirs while we hardlink them? Hm, this is a first complaint about this that I hear. I just looked into strace of cp -fal (which I guess you mant instead of just -fa that would just copy everything). so we traverse the tree down creating a dir structure in parallel first (or just doing it in readdir order) open(/mnt/lustre/a/b/c/d/e/f, O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 3 +1 RPC fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 +1 RPC (if no opencache) fcntl(3, F_SETFD, FD_CLOEXEC) = 0 getdents(3, /* 4 entries */, 4096) = 96 getdents(3, /* 0 entries */, 4096) = 0 +1 RPC close(3)= 0 +1 RPC (if no opencache) lstat(/mnt/lustre/a/b/c/d/e/f/g, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 (should be cached, so no RPC) mkdir(/mnt/lustre/blah2/b/c/d/e/f/g, 040755) = 0 +1 RPC lstat(/mnt/lustre/blah2/b/c/d/e/f/g, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 +1 RPC stat(/mnt/lustre/blah2/b/c/d/e/f/g, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 (should be cached, so no RPC) Then we get to files: link(/mnt/lustre/a/b/c/d/e/f/g/k/8, /mnt/lustre/blah2/b/c/d/e/f/g/ k/8) = 0 +1 RPC futimesat(AT_FDCWD, /mnt/lustre/blah2/b/c/d/e/f/g/k, {{1280856246, 0}, {128085 6291, 0}}) = 0 +1 RPC then we start traversing the just created tree up and chowning it: chown(/mnt/lustre/blah2/b/c/d/e/f/g/k, 0, 0) = 0 +1 RPC getxattr(/mnt/lustre/a/b/c/d/e/f/g/k, system.posix_acl_access, 0x7fff519f0950, 132) = -1 ENODATA (No data available) +1 RPC stat(/mnt/lustre/a/b/c/d/e/f/g/k, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 (not sure why another stat here, we already did it on the way up. Should be cached) setxattr(/mnt/lustre/blah2/b/c/d/e/f/g/k, system.posix_acl_access, \x02\x00 \x00\x00\x01\x00\x07\x00\xff\xff\xff\xff\x04\x00\x05\x00\xff\xff\xff \xff \x00\x0 5\x00\xff\xff\xff\xff, 28, 0) = 0 +1 RPC getxattr(/mnt/lustre/a/b/c/d/e/f/g/k, system.posix_acl_default, 0x7fff519f09 50, 132) = -1 ENODATA (No data available) +1 RPC stat(/mnt/lustre/a/b/c/d/e/f/g/k, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 Hm, stat again? did not we do it a few syscalls back? stat(/mnt/lustre/blah2/b/c/d/e/f/g/k, {st_mode=S_IFDIR|0755, st_size=4096, ... }) = 0 stat of the target. +1 RPC (the cache got invalidated by link above). setxattr(/mnt/lustre/blah2/b/c/d/e/f/g/k, system.posix_acl_default, \x02\x0 0\x00\x00, 4,
Re: [Lustre-discuss] Per directory quota
On Jul 16, 2010, at 7:17 AM, Christopher J.Walker c.j.wal...@qmul.ac.uk wrote: I know Lustre can do quotas per user, but can Lustre do quotas on a per directory basis? No, Lustre does not support directory (fileset) based quotas. I can't work out how to do this from the manual. To be more specific, the software we use[1] is written by people using GPFS and we'd like an equivalent to the GPFS command: mmlsquota -j which AIUI finds out how much space is used under a directory. We could use du --summarize mydirectory but for a directory containing a large number of files, this takes a long time - and is presumably not very efficient. If there were an lfs du it would presumably be more efficient, but even so, probably still resource intensive. Without size-on-mds, either way would have to query both the mds and each OST to get the size info. Not being that familiar with size-on- mds, it does seem likely that du would still have to query the OST for size info, even when ls -l does not. Am I missing something? [1] http://storm.forge.cnaf.infn.it/home Chris ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] NFS Export Issues
Without more information about the server error messages and exact nfs configuration, not sure anyone can help more than this. A common problem with Lustre NFS exports, one that isn't due to normal NFS/configuration issues, is getting error -43 when the mds did not have the client's IDs in its /etc/passwd and /etc/group files. Dumb question, but have you checked the permissions on the NFS server's Lustre mount point (before/after Lustre is mounted), and exported a non-Lustre directory successfully? Kevin Andreas Dilger wrote: My only other suggestion is to dump the Lustre kernel debug log on the NFS server after a mount failure to see where/why it is getting the permission error. # lctl clear # (mount NFS client) # lctl dk /tmp/debug Then search through the logs for -2 errors (-EPERM). Cheers, Andreas On 2010-07-16, at 10:06, William Olson lustre_ad...@reachone.com wrote: On 7/15/2010 5:48 PM, Andreas Dilger wrote: On 2010-07-15, at 08:33, William Olson wrote: Somebody, anybody? I'm sure it's something fairly simple, but it escapes me, assistance would be greatly appreciated! ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Luster 1.8.3 Qlogic OFED 1.4.2
If you replace OFED, you do not need to rebuild the kernel (unless you want to patch/change it), so you can install the binary Lustre kernel. You do need to rebuild Lustre (or at least the kernel modules) (step #3), as o2ib must be built against the OFED you are running. Kevin Marco Aurelio L Gomes wrote: Hi, I saw the post above on the lustre-discuss list and would like to know if in case I install OFED 1.5 on Lustre 1.8.3 I'll need build the lustre kernel, or only install the available kernel from lustre download page. When I see that the build in (1) is optional I thought that is possible to use that available kernel. Thanks in advance. Best regards, Marco Gomes Systems/HPC-Cluster Numerical Offshore Tank Naval and Ocean Engineering Department's Laboratory Escola Politécnica University of São Paulo +55 11 3777 4142 ext. 250 On Wed, 2010-06-09 at 12:48 -0700, Kevin Van Maren wrote: Looks like a mis-match on the OFED modules Lustre is expecting. If not using the included OFED, you need these steps: 1) Build (optional) and install the Lustre kernel 2) Build OFED against the Lustre kernel and install it 3) Build Lustre against the kernel and the OFED you are using Lustre defaults to using the in-kernel OFED unless you point configure at a different set of headers. Kevin - Original Message - From: srirangam.addepa...@gmail.com To: lustre-discuss@lists.lustre.org Sent: Wednesday, June 9, 2010 1:33:04 PM GMT -07:00 US/Canada Mountain Subject: [Lustre-discuss] Luster 1.8.3 Qlogic OFED 1.4.2 Hello All, I am trying to use luster with qlogic ofed 1.4.2 . After building and installing the kernel when i try modprobe lustre i get the following ! errors # modprobe lustre WARNING: Error inserting osc (/lib/modules/2.6.18-164.11.1.el5_lustre.1.8.3/kernel/fs/lustre/osc.ko): Input/output error WARNING: Error inserting mdc (/lib/modules/2.6.18-164.11.1.el5_lustre.1.8.3/kernel/fs/lustre/mdc.ko): Input/output error WARNING: Error inserting lov (/lib/modules/2.6.18-164.11.1.el5_lustre.1.8.3/kernel/fs/lustre/lov.ko): Input/output error FATAL: Error inserting lustre (/lib/modules/2.6.18-164.11.1.el5_lustre.1.8.3/kernel/fs/lustre/lustre.ko): Input/output error dmesg shows errors of the type Lustre: OBD class driver, http://www.lustre.org/ Lustre: Lustre Version: 1.8.3 Lustre: Build Version: 1.8.3-20100409182943-PRISTINE-2.6.18-164.11.1.el5_lustre.1.8.3 ! ko2iblnd: disagrees about version of symbol ib_fmr_pool_unmap ko2iblnd: Unknown symbol ib_fmr_pool_unmap nbsp! ; ko2iblnd: disagrees about version of symbol ib_create_cq ko2iblnd: Unknown symbol ib_create_cq nb! sp; ko2iblnd: disagrees about version of symbol rdma_resolve_addr ko2iblnd: Unknown symbol rdma_resolve_addr Following are the rpm's installed # rpm -qa | grep kernel kernel-doc-2.6.18-164.6.1.el5 kernel-2.6.18-164.6.1.el5 kernel-headers-2.6.18-164.6.1.el5 kernel-ib-devel-1.4.2.1-2.6.18_164.11.1.el5_lustre.1.8.3 kernel-2.6.18-164.11.1.el5_lustre.1.8.3 kernel-ib-1.4.2.1-2.6.18_164.11.1.el5_lustre.1.8.3 kernel-devel-2.6.18-164.6.1.el5 kernel-devel-2.6.18-164.11.1.el5_lustre.1.8.3 What am i missing. Ady ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] NFS Export Issues
Looks like a problem with your mount point. What are the permissions on the client directory? On Jul 16, 2010, at 6:23 PM, William Olson lustre_ad...@reachone.com wrote: On 7/16/2010 5:12 PM, Andreas Dilger wrote: Well that improved the debug level, but didn't reveal any -2 errors.. In fact I can't seem to find a line with an error in it... Is there a specific verbiage used on error lines that I can grep for? 90% is Process entered or Process leaving... You could try strace -f on the mount process, to see which syscall is failing. It may be failing with something before it gets to Lustre. Results of strace below: [r...@lustreclient mnt]# strace -f -p 15964 Process 15964 attached - interrupt to quit lstat(/mnt, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 lstat(/mnt/lustre_mail_fs, 0x7fff4bd4b2b0) = -1 EACCES (Permission denied) stat(/etc/localtime, {st_mode=S_IFREG|0644, st_size=2875, ...}) = 0 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] NFS Export Issues
But the client is doing a lstat on /mnt/lustre_mail_fs, not /mnt/ lustre -- what is the mount command again? On Jul 16, 2010, at 6:50 PM, William Olson lustre_ad...@reachone.com wrote: On 7/16/2010 5:41 PM, Kevin Van Maren wrote: Looks like a problem with your mount point. What are the permissions on the client directory? NFSServer/Lustre Client Lustre mounted: drwxrwxrwx 29 root root 4.0K Jul 12 17:03 lustre_mail_fs Lustre not mounted: drwxrwxrwx 2 root root 4.0K Jun 10 13:26 lustre_mail_fs NFSClient mount dir: drwxrwxrwx 2 root root 4.0K Jul 12 15:09 lustre On Jul 16, 2010, at 6:23 PM, William Olson lustre_ad...@reachone.com wrote: On 7/16/2010 5:12 PM, Andreas Dilger wrote: Well that improved the debug level, but didn't reveal any -2 errors.. In fact I can't seem to find a line with an error in it... Is there a specific verbiage used on error lines that I can grep for? 90% is Process entered or Process leaving... You could try strace -f on the mount process, to see which syscall is failing. It may be failing with something before it gets to Lustre. Results of strace below: [r...@lustreclient mnt]# strace -f -p 15964 Process 15964 attached - interrupt to quit lstat(/mnt, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 lstat(/mnt/lustre_mail_fs, 0x7fff4bd4b2b0) = -1 EACCES (Permission denied) stat(/etc/localtime, {st_mode=S_IFREG|0644, st_size=2875, ...}) = 0 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss