Re: [Lustre-discuss] raid5 patches for rhel5
On Fri, Aug 01, 2008 at 01:51:36PM -0600, Andreas Dilger wrote: On Aug 01, 2008 09:38 -0400, Robin Humble wrote: done, and yes, performance is largely the same as RHEL4. cool! Version 1.03 --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size:chnk K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP rhel4 oss 16G:256k 84624 99 842138 92 310044 91 77675 99 491239 96 285.8 10 rhel5 oss 16G:256k 86085 99 827731 95 327007 97 79639 100 495487 98 456.2 18 streaming writes are down marginally on rhel5, but seeks/s are up 50%. Good to know, thanks. BTW - the above is with 1.6.4.3 clients. Is this with 1.6.5 servers or 1.6.4.3 servers? that's with 1.6.5.1 RHEL5 servers. 1.6.5.1 client still perform badly for us. eg. Have you tried disabling the checksums? lctl set_param osc.*.checksums=0 yes, checksums were disabled. Note that 1.6.5 clients - 1.6.5 servers with checksums enabled will perform better than mixed client/server because 1.6.5 has a more efficient checksum algorithm. Version 1.03 --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size:chnk K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP 16G:256k 77216 99 462659 100 296050 96 68100 81 648350 93 422.2 13 which shows better streaming writes, but ~1/2 the streaming read speed :-( You are getting that backward... 55% of the previous write speed, 90% of the previous overwrite speed, and 130% of the previous read speed. doh! yes, backwards... that was patchless 2.6.23 clients BTW. Note that there are also similar performance improvements for RAID-6. I can't see the RAID6 patches in the tree for RHEL5... am I missing something? Sigh, RAID6 patches were ported to RHEL4, but not RHEL5... I've filed bug 16587 about that, but have no idea when it will be completed. cool - thanks! cheers, robin ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] slow recovery when MDS failed over
On Thu, 2008-08-07 at 12:06 -0400, Brock Palen wrote: In doing some testing with our new hardware I did the following: I rebooted the active MDS server, it failed over to the second one as expected. While this was happening a client was reset. When the MDS came up on the new server by heartbeat it went into recovery as expected. The MDS now has been in recovery for 1.5 hours. I don't think this is normal. What would cause this? I know by having a client go down (the reset above) while the MDS is down but before recovery will cause recovery to time out but 1.5 hours is unacceptable time to wait for the file system to come back. This is a stock 1.6.5.1 install. Hrm. Can you provide the syslog from the backup MDS from the time it was mounted until present? b. signature.asc Description: This is a digitally signed message part ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre-discuss] Lustre 1.6.5.1 client with kernel, 2.6.22.14
On Thu, 2008-08-07 at 00:11 +0300, Dr. David L.H. wrote: So now we had diskless client, kernel 2.6.22.14 , OFED-1.3 and luster1.6.5.1. We mount via IB the lustre and it's look O.K. Great. I'm glad you found your answer. b. signature.asc Description: This is a digitally signed message part ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] MDS
Cliff White wrote: Mag Gam wrote: Also, what is the best way to test the backup? Other than really remove my MGS and restore it. Is there a better way to test this? If you really care about the backups, you need to be brave. If you can't remove the MDS and restore it, then something is wrong with your backup process. Many people seem to focus on the backup part and ignore the 'restore' bit, so I definately reccomend a live test. That said, if you can bring up your backup MDT image on a separate node, you could configure that node as a failover MDS - this would require you to tunefs.lustre all the servers, and remount all the clients. Then you can test restore using a 'manual failover' - and once you made the mount changes, you could repeat this test at will, without even halting the filesystem. Also, you would not have to 'remove' your primary MDS, just stop that node. If your MDS _does_ die, the failover config will cause a slightly longer timeout (everybody will retry the alternate) but otherwise won't impact you. Just to be clear, there is a potential data loss issue due to the time delta between the backup and the live system. Any transactions in play that miss the snapshot could result in lost data, as the MDS will replay transaction logs and delete orphans on startup. So testing on your live system definately is for the brave. cliffw cliffw TIA On Tue, Aug 5, 2008 at 6:37 PM, Mag Gam [EMAIL PROTECTED] wrote: Brian: Thanks for the response. I actually seen this response before and was wondering if my technique would simply work. I guess not. I guess another question will be, if I take a snapshot every 10 mins and back it up. If I have a failure at 15th minute. Can I just simply restore my MDS to the previous snapshot and be with it? Ofcourse I will lose my 5 minutes of data, correct? TIA On Tue, Aug 5, 2008 at 12:08 PM, Brian J. Murrell [EMAIL PROTECTED] wrote: On Tue, 2008-08-05 at 01:12 -0400, Mag Gam wrote: What is a good MGS/MDT backup strategy if there is one? I was thinking of mounting the MGS/MDT partition on the MDS as ext3 and rsync it every 10 mins to another server. Would this work? What would happen in the 9th minute I lose my MDS, would I still be able to have a good copy? Any thoughts or ideas? Peter Braam answered a similar question and of course, the answer is in the archives. It was the second google hit on a search for lustre mds backup. The answer is at: http://lists.lustre.org/pipermail/lustre-discuss/2006-June/001655.html Backup of the MDT is also covered in the manual in section 15 at http://manual.lustre.org/manual/LustreManual16_HTML/BackupAndRestore.html#50544703_pgfId-5529 Now, as for mounting the MDT as ext3 (you should actually use ldiskfs, not ext3) every 10 minutes, that means you are going to make your filesystem unavailable every 10 minutes as you CANNOT mount the MDT partition on more than one machine and we have not tested multiple mounting on a single machine with any degree of confidence. Of course Peter's LVM snapshotting technique will allow you to mount snapshots which you can backup as you describe. But if you are going to have a whole separate machine with enough storage to mirror your MDT why not use something more active like DRBD and have a fully functional active/passive MDT failover strategy? While nobody in the Lustre Group has done any extensive testing of Lustre on DRBD, there have been a number of reports of success with it here on this list. b. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] simulations
Mag Gam wrote: We do a lot of fluid simulations at my university, but on a similar note I would like to know what the Lustre experts will do in particular simulated scenarios... The environment is this: 30 Servers (All Linux) 1000+ Clients (All Linux) 30 Servers 1 MDS 30 OSTs each with 2TB of storage No fail over capabilities. Scenario 1: Your client is trying to mount lustre filesystem using lustre module, and it hung. Do what? Answer 0 to all questions: Read the Lustre Manual. File doc bugs in Lustre Bugzilla if there's a part you don't understand, or a part missing Answer 1 for all your questions. Check syslogs/consoles on the impacted clients. Check syslogs/consoles on _all lustre servers. Pay careful attention to timestamps. Work backwards to the first error. Is the problem restricted to one client or seen by multiple clients? If multiple clients, start with the network, use lctl ping to check lustre connectivity. If a single client, it's generally a client config/network config issue. Scenario 2: Your MDS won't mount up. Its saying, The server is already running. You try to mount it up couple of times and still its not Be certain the server is not already running. Be certain no hung mount processes exist. Unload all lustre modules (lustre_rmmod script will do this) Retry and - answer 1 Scenario 3: OST/OSS reboots due to a power outage. Some files are striped on this, and some aren't What happens? What to do for minimal outage? - Clients can be mounted with a dead OST using the exclude options to the mount command. lfs getstripe can be run from clients to find files on the bad OST. See answer 0 for detailed process. Scenario 4: lctl dl shows some devices in ST state. What does that mean, and how do I clear it? ST = stopped. Clear this by cleaning up all devices (answer 0) or restarting the stopped devices. Usually indicates an error/issue with the stopped device, so see answer 1. I know some of these scenarios may be ambiguous, but please let me know which so I can further elaborate. I am eventually planning to wiki this for future reference and other lustre newbies. Please contribute to wiki.lustre.org - there is considerable information there already, and a decent existing structure. If anyone else has any other scenarios, please don't be shy and ask away. We can create a good trouble shooting doc similar to the operations manual. Again, please file doc bugs at bugzilla.lustre.org and contribute to wiki.lustre.org, hope this helps! cliffw TIA ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] MDS
On Thu, 2008-08-07 at 10:51 -0700, Cliff White wrote: Just to be clear, there is a potential data loss issue due to the time delta between the backup and the live system. Any transactions in play that miss the snapshot could result in lost data, as the MDS will replay transaction logs and delete orphans on startup. So testing on your live system definately is for the brave. Indeed. There are a couple of alternatives to consider. I know your production MO will be to take an LVM snapshot of the running MDT and back that up, but if the MDT (i.e. filesystem) were shut down prior to the backup, what you restore should be an identical MDT which you could then start the filesystem against without the risks of in-play transactions and orphan deletion. But indeed it is not a 100% reproduction of what would happen restoring from an in-production backup. Alternatively, rather than trying to start the OSTs against the restored MDT you could simply do a filesystem level (i.e. ldiskfs) comparison of the restored MDT against the production MDT. Indeed, there are other variations that you could use to satisfy yourself that the restore worked. I would highly suggest you do any of this testing either on a testbed (which you could build with a VirtualBox virtual cluster) or on your production system before you put production data on it. It is good system deployment policy to have fully tested backup and restore policies before going live anyway. b. signature.asc Description: This is a digitally signed message part ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] stata_mv mv_stata which is better?
Thanks, I might look into it. Right now the performance of the stock driver that comes with the kernel is more than the 4 1gig connections we will be using. I am having other issues now with the new filesystem that I did not have with our old one, that iwll be a new question though. Brock Palen www.umich.edu/~brockp Center for Advanced Computing [EMAIL PROTECTED] (734)936-1985 On Aug 7, 2008, at 2:11 PM, Mike Berg wrote: Brock, It is recommended that mv_sata is used on the x4500. It has been a while since I have built this up myself and a few Lustre releases back but I do understand the pain. I hope that with Lustre 1.6.5.1 on RHEL 4.5 you can just build mv_sata against the provided Lustre kernel and alias it accordingly in modprobe.conf and create a new initrd, then update grub. I don't have gear handy to give it try unfortunately. Please let me know your experiences with this if you pursue it. Enclosed is a somewhat dated document on what we have found to be the best configuration of the x4500 for use with Lustre. Ignore the N1SM parts. We optimized for performance and RAS with some sacrifices on capacity. Hopefully this is a useful reference. Regards, Mike Berg Sr. Lustre Solutions Engineer Sun Microsystems, Inc. Office/Fax: (303) 547-3491 E-mail: [EMAIL PROTECTED] X4500-preparation.pdf On Aug 6, 2008, at 1:48 PM, Brock Palen wrote: Is it still worth the effort to try and build mv_stata? when working with an x4500? stata_mv from RHEL4 does not appear to show some of the stability problems discussed online before. I am curious because the build system sun provides with the driver does not play nicely with the lustre kernel source packaging. If it is worth all the pain, if others have already figured it out. Any help would be grateful. Brock Palen www.umich.edu/~brockp Center for Advanced Computing [EMAIL PROTECTED] (734)936-1985 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] stata_mv mv_stata which is better?
Yes Brock -- as Mike has mentioned we also took this doc and provided this for our TACC customer: http://www.tacc.utexas.edu/resources/hpcsystems/ where we put 72 x4500s in place with this configuration with them. In addition, Sun's recent Linux HPC Software http://www.sun.com/software/products/hpcsoftware/index.xml has the mv_sata driver and the SW configurations needed to put the x4500 together as an OSS one can build up upon and further configure with the SW RAID patches also included. HTH Mike Berg wrote On 08/07/08 11:11,: Brock, It is recommended that mv_sata is used on the x4500. It has been a while since I have built this up myself and a few Lustre releases back but I do understand the pain. I hope that with Lustre 1.6.5.1 on RHEL 4.5 you can just build mv_sata against the provided Lustre kernel and alias it accordingly in modprobe.conf and create a new initrd, then update grub. I don't have gear handy to give it try unfortunately. Please let me know your experiences with this if you pursue it. Enclosed is a somewhat dated document on what we have found to be the best configuration of the x4500 for use with Lustre. Ignore the N1SM parts. We optimized for performance and RAS with some sacrifices on capacity. Hopefully this is a useful reference. Regards, Mike Berg Sr. Lustre Solutions Engineer Sun Microsystems, Inc. Office/Fax: (303) 547-3491 E-mail: [EMAIL PROTECTED] On Aug 6, 2008, at 1:48 PM, Brock Palen wrote: Is it still worth the effort to try and build mv_stata? when working with an x4500? stata_mv from RHEL4 does not appear to show some of the stability problems discussed online before. I am curious because the build system sun provides with the driver does not play nicely with the lustre kernel source packaging. If it is worth all the pain, if others have already figured it out. Any help would be grateful. Brock Palen www.umich.edu/~brockp Center for Advanced Computing [EMAIL PROTECTED] (734)936-1985 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] operation 400 on unconnected MGS
The problem I was refering to: With the new filesystem we just created I am getting the following problem, clients loose connection to the MGS and the MGS says it evicted them, machines are on the same network and there is no errors on the interfaces. The MGS says: Lustre: MGS: haven't heard from client e8eb1779-5cea-9cc7- b5ae-4c5ccf54f5ca (at [EMAIL PROTECTED]) in 240 seconds. I think it's dead, and I am evicting it. LustreError: 9103:0:(mgs_handler.c:538:mgs_handle()) lustre_mgs: operation 400 on unconnected MGS LustreError: 9103:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error (-107) [EMAIL PROTECTED] x24929/t0 o400-?@?: 0/0 lens 128/0 e 0 to 0 dl 1218142953 ref 1 fl Interpret:/0/0 rc -107/0 The operation 400 on unconnected MGS is the only new message I am not familiar with. Once the client losses connection with the MGS I will see the OST's start booting the client also. Servers are 1.6.5.1 clients are patch-less 1.6.4.1 on RHEL4. Any insight would be great. Brock Palen www.umich.edu/~brockp Center for Advanced Computing [EMAIL PROTECTED] (734)936-1985 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] operation 400 on unconnected MGS
On Thu, 2008-08-07 at 17:10 -0400, Brock Palen wrote: The problem I was refering to: With the new filesystem we just created I am getting the following problem, clients loose connection to the MGS and the MGS says it evicted them, machines are on the same network and there is no errors on the interfaces. The MGS says: Lustre: MGS: haven't heard from client e8eb1779-5cea-9cc7- b5ae-4c5ccf54f5ca (at [EMAIL PROTECTED]) in 240 seconds. I think it's dead, and I am evicting it. LustreError: 9103:0:(mgs_handler.c:538:mgs_handle()) lustre_mgs: operation 400 on unconnected MGS LustreError: 9103:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error (-107) [EMAIL PROTECTED] x24929/t0 o400-?@?: 0/0 lens 128/0 e 0 to 0 dl 1218142953 ref 1 fl Interpret:/0/0 rc -107/0 Do you have any messages on the client that correlate? Please use timestamps in syslogs from machines that are timesync'd to show the correlating MGS eviction and client messages around the same timeframe. b. signature.asc Description: This is a digitally signed message part ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] MDS
Very nice insight. Thanks Brian, Cliff, and Phil! On Thu, Aug 7, 2008 at 11:14 AM, Brian J. Murrell [EMAIL PROTECTED] wrote: On Thu, 2008-08-07 at 10:51 -0700, Cliff White wrote: Just to be clear, there is a potential data loss issue due to the time delta between the backup and the live system. Any transactions in play that miss the snapshot could result in lost data, as the MDS will replay transaction logs and delete orphans on startup. So testing on your live system definately is for the brave. Indeed. There are a couple of alternatives to consider. I know your production MO will be to take an LVM snapshot of the running MDT and back that up, but if the MDT (i.e. filesystem) were shut down prior to the backup, what you restore should be an identical MDT which you could then start the filesystem against without the risks of in-play transactions and orphan deletion. But indeed it is not a 100% reproduction of what would happen restoring from an in-production backup. Alternatively, rather than trying to start the OSTs against the restored MDT you could simply do a filesystem level (i.e. ldiskfs) comparison of the restored MDT against the production MDT. Indeed, there are other variations that you could use to satisfy yourself that the restore worked. I would highly suggest you do any of this testing either on a testbed (which you could build with a VirtualBox virtual cluster) or on your production system before you put production data on it. It is good system deployment policy to have fully tested backup and restore policies before going live anyway. b. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss