Re: [Lustre-discuss] raid5 patches for rhel5

2008-08-07 Thread Robin Humble
On Fri, Aug 01, 2008 at 01:51:36PM -0600, Andreas Dilger wrote:
On Aug 01, 2008  09:38 -0400, Robin Humble wrote:
 done, and yes, performance is largely the same as RHEL4. cool!
 
 Version  1.03   --Sequential Output-- --Sequential Input- 
 --Random-
 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- 
 --Seeks--
 Machine   Size:chnk K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec 
 %CP
 rhel4 oss  16G:256k 84624  99 842138 92 310044 91 77675  99 491239 96 285.8  
 10
 rhel5 oss  16G:256k 86085  99 827731 95 327007 97 79639 100 495487 98 456.2  
 18
 
 streaming writes are down marginally on rhel5, but seeks/s are up 50%.
Good to know, thanks.

 BTW - the above is with 1.6.4.3 clients.
Is this with 1.6.5 servers or 1.6.4.3 servers?

that's with 1.6.5.1 RHEL5 servers.

 1.6.5.1 client still perform badly for us. eg.
Have you tried disabling the checksums?
   lctl set_param osc.*.checksums=0

yes, checksums were disabled.

Note that 1.6.5 clients - 1.6.5 servers with checksums enabled will perform
better than mixed client/server because 1.6.5 has a more efficient checksum
algorithm.

 Version  1.03   --Sequential Output-- --Sequential Input- 
 --Random-
 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- 
 --Seeks--
 Machine   Size:chnk K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec 
 %CP
16G:256k 77216  99 462659 100 296050  96 68100  81 648350  93 
 422.2  13
 
 which shows better streaming writes, but ~1/2 the streaming read speed :-(
You are getting that backward... 55% of the previous write speed,
90% of the previous overwrite speed, and 130% of the previous read speed.

doh! yes, backwards...
that was patchless 2.6.23 clients BTW.

  Note that there are also similar
 performance improvements for RAID-6.
 I can't see the RAID6 patches in the tree for RHEL5... am I missing
 something?
Sigh, RAID6 patches were ported to RHEL4, but not RHEL5...  I've filed
bug 16587 about that, but have no idea when it will be completed.

cool - thanks!

cheers,
robin
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] slow recovery when MDS failed over

2008-08-07 Thread Brian J. Murrell
On Thu, 2008-08-07 at 12:06 -0400, Brock Palen wrote:
 In doing some testing with our new hardware I did the following:
 
 I rebooted the active MDS server, it failed over to the second one as  
 expected.  While this was happening a client was reset.
 
 When the MDS came up on the new server by heartbeat it went into  
 recovery as expected.  The MDS now has been in recovery for 1.5  
 hours.  I don't think this is normal.
 
 What would cause this?  I know by having a client go down (the reset  
 above) while the MDS is down but before recovery will cause recovery  
 to time out but 1.5 hours is unacceptable time to wait for the file  
 system to come back.
 
 This is a stock 1.6.5.1 install.

Hrm.  Can you provide the syslog from the backup MDS from the time it
was mounted until present?

b.



signature.asc
Description: This is a digitally signed message part
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre-discuss] Lustre 1.6.5.1 client with kernel, 2.6.22.14

2008-08-07 Thread Brian J. Murrell
On Thu, 2008-08-07 at 00:11 +0300, Dr. David L.H. wrote:

 So now we had diskless client, kernel 2.6.22.14 , OFED-1.3 and luster1.6.5.1. 
 We mount via IB the lustre and it's look O.K.

Great.  I'm glad you found your answer.

b.



signature.asc
Description: This is a digitally signed message part
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] MDS

2008-08-07 Thread Cliff White
Cliff White wrote:
 Mag Gam wrote:
 Also, what is the best way to test the backup? Other than really
 remove my MGS and restore it. Is there a better way to test this?
 
 If you really care about the backups, you need to be brave. If you can't
 remove the MDS and restore it, then something is wrong with your backup
 process. Many people seem to focus on the backup part and ignore the 
 'restore' bit, so I definately reccomend a live test.
 
 That said, if you can bring up your backup MDT image on a separate node, 
 you could configure that node as a failover MDS - this would require you 
 to tunefs.lustre all the servers, and remount all the clients. Then you 
 can test restore using a 'manual failover' - and once you made the mount 
 changes, you could repeat this test at will, without even halting the 
 filesystem. Also, you would not have to 'remove' your primary MDS, just 
 stop that node.
 
 If your MDS _does_ die, the failover config will cause a slightly longer 
 timeout (everybody will retry the alternate) but otherwise won't impact 
 you.

Just to be clear, there is a potential data loss issue due to the time 
delta between the backup and the live system. Any transactions in play
that miss the snapshot could result in lost data, as the MDS will replay 
transaction logs and delete orphans on startup. So testing on your live 
system definately is for the brave.
cliffw

 
 cliffw

 TIA


 On Tue, Aug 5, 2008 at 6:37 PM, Mag Gam [EMAIL PROTECTED] wrote:
 Brian:

 Thanks for the response. I actually seen this response before and was
 wondering if my technique would simply work. I guess not.

 I guess another question will be, if I take a snapshot every 10 mins
 and back it up. If I have a failure at 15th minute. Can I just simply
 restore my MDS to the previous snapshot and be with it? Ofcourse I
 will lose my 5 minutes of data, correct?

 TIA


 On Tue, Aug 5, 2008 at 12:08 PM, Brian J. Murrell 
 [EMAIL PROTECTED] wrote:
 On Tue, 2008-08-05 at 01:12 -0400, Mag Gam wrote:
 What is a good MGS/MDT backup strategy if there is one?

 I was thinking of  mounting the MGS/MDT partition on the MDS as ext3
 and rsync it every 10 mins to another server. Would this work? What
 would happen in the 9th minute I lose my MDS, would I still be able to
 have a good copy? Any thoughts or ideas?
 Peter Braam answered a similar question and of course, the answer is in
 the archives.  It was the second google hit on a search for lustre mds
 backup.  The answer is at:

 http://lists.lustre.org/pipermail/lustre-discuss/2006-June/001655.html

 Backup of the MDT is also covered in the manual in section 15 at

 http://manual.lustre.org/manual/LustreManual16_HTML/BackupAndRestore.html#50544703_pgfId-5529
  


 Now, as for mounting the MDT as ext3 (you should actually use ldiskfs,
 not ext3) every 10 minutes, that means you are going to make your
 filesystem unavailable every 10 minutes as you CANNOT mount the MDT
 partition on more than one machine and we have not tested multiple
 mounting on a single machine with any degree of confidence.

 Of course Peter's LVM snapshotting technique will allow you to mount
 snapshots which you can backup as you describe.

 But if you are going to have a whole separate machine with enough
 storage to mirror your MDT why not use something more active like DRBD
 and have a fully functional active/passive MDT failover strategy?  
 While
 nobody in the Lustre Group has done any extensive testing of Lustre on
 DRBD, there have been a number of reports of success with it here on
 this list.

 b.


 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss


 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
 
 

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] simulations

2008-08-07 Thread Cliff White
Mag Gam wrote:
 We do a lot of fluid simulations at my university, but on a similar
 note I would like to know what the Lustre experts will do in
 particular simulated scenarios...
 
 The environment is this:
 30 Servers (All Linux)
 1000+ Clients (All Linux)
 
 30 Servers
 1 MDS
 30 OSTs each with 2TB of storage
 
 No fail over capabilities.
 
 
 Scenario 1:
 Your client is trying to mount lustre filesystem using lustre module,
 and it hung. Do what?
Answer 0 to all questions:
Read the Lustre Manual. File doc bugs in Lustre Bugzilla if there's a 
part you don't understand, or a part missing

Answer 1 for all your questions.
Check syslogs/consoles on the impacted clients.
Check syslogs/consoles on _all lustre servers.
Pay careful attention to timestamps.
Work backwards to the first error.

Is the problem restricted to one client or seen by multiple clients?
If multiple clients, start with the network, use lctl ping to check 
lustre connectivity.
If a single client, it's generally a client config/network config issue.
 
 Scenario 2:
 Your MDS won't mount up. Its saying, The server is already running.
 You try to mount it up couple of times and still its not

Be certain the server is not already running.
Be certain no hung mount processes exist.
Unload all lustre modules (lustre_rmmod script will do this)
Retry and - answer 1

 
 Scenario 3:
 OST/OSS reboots due to a power outage. Some files are striped on this,
 and some aren't What happens? What to do for minimal outage?

- Clients can be mounted with a dead OST using the exclude options to 
the mount command. lfs getstripe can be run from clients to find files
on the bad OST. See answer 0 for detailed process.
 
 Scenario 4:
 lctl dl shows some devices in ST state. What does that mean, and how
 do I clear it?

ST = stopped.
Clear this by cleaning up all devices (answer 0)
or restarting the stopped devices.
Usually indicates an error/issue with the stopped device, so see
answer 1.
 
 
 I know some of these scenarios may be ambiguous, but please let me
 know which so I can further elaborate. I am eventually planning to
 wiki this for future reference and other lustre newbies.

Please contribute to wiki.lustre.org - there is considerable information 
there already, and a decent existing structure.
 
 If anyone else has any other scenarios, please don't be shy and ask
 away. We can create a good trouble shooting doc similar to the
 operations manual.

Again, please file doc bugs at bugzilla.lustre.org and contribute to 
wiki.lustre.org, hope this helps!
cliffw

 
 
 TIA
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] MDS

2008-08-07 Thread Brian J. Murrell
On Thu, 2008-08-07 at 10:51 -0700, Cliff White wrote:
 
 Just to be clear, there is a potential data loss issue due to the time 
 delta between the backup and the live system. Any transactions in play
 that miss the snapshot could result in lost data, as the MDS will replay 
 transaction logs and delete orphans on startup. So testing on your live 
 system definately is for the brave.

Indeed.  There are a couple of alternatives to consider.  I know your
production MO will be to take an LVM snapshot of the running MDT and
back that up, but if the MDT (i.e. filesystem) were shut down prior to
the backup, what you restore should be an identical MDT which you could
then start the filesystem against without the risks of in-play
transactions and orphan deletion.  But indeed it is not a 100%
reproduction of what would happen restoring from an in-production
backup.

Alternatively, rather than trying to start the OSTs against the restored
MDT you could simply do a filesystem level (i.e. ldiskfs) comparison of
the restored MDT against the production MDT.

Indeed, there are other variations that you could use to satisfy
yourself that the restore worked.

I would highly suggest you do any of this testing either on a testbed
(which you could build with a VirtualBox virtual cluster) or on your
production system before you put production data on it.  It is good
system deployment policy to have fully tested backup and restore
policies before going live anyway.

b.



signature.asc
Description: This is a digitally signed message part
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] stata_mv mv_stata which is better?

2008-08-07 Thread Brock Palen
Thanks, I might look into it.  Right now the performance of the stock  
driver that comes with the kernel is more than the 4 1gig connections  
we will be using.

I am having other issues now with the new filesystem that I did not  
have with our old one, that iwll be a new question though.

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
[EMAIL PROTECTED]
(734)936-1985



On Aug 7, 2008, at 2:11 PM, Mike Berg wrote:
 Brock,

 It is recommended that mv_sata is used on the x4500.

 It has been a while since I have built this up myself and a few  
 Lustre releases back but I do understand the pain. I hope that with  
 Lustre 1.6.5.1 on RHEL 4.5 you can just build mv_sata against the  
 provided Lustre kernel and alias it accordingly in modprobe.conf  
 and create a new initrd, then update grub. I don't have gear handy  
 to give it try unfortunately. Please let me know your experiences  
 with this if you pursue it.

 Enclosed is a somewhat dated document on what we have found to be  
 the best configuration of the x4500 for use with Lustre. Ignore the  
 N1SM parts. We optimized for performance and RAS with some  
 sacrifices on capacity. Hopefully this is a useful reference.


 Regards,
 Mike Berg
 Sr. Lustre Solutions Engineer
 Sun Microsystems, Inc.
 Office/Fax: (303) 547-3491
 E-mail:  [EMAIL PROTECTED]


 X4500-preparation.pdf

 On Aug 6, 2008, at 1:48 PM, Brock Palen wrote:

 Is it still worth the effort to try and build mv_stata?  when working
 with an x4500?
 stata_mv from RHEL4 does not appear to show some of the stability
 problems discussed online before.

 I am curious because the build system sun provides with the driver
 does not play nicely with the lustre kernel source packaging.

 If it is worth all the pain, if others have already figured it out.
 Any help would be grateful.


 Brock Palen
 www.umich.edu/~brockp
 Center for Advanced Computing
 [EMAIL PROTECTED]
 (734)936-1985



 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] stata_mv mv_stata which is better?

2008-08-07 Thread Larry McIntosh

Yes Brock -- as Mike has mentioned we also took this doc and provided
this for our TACC customer:

http://www.tacc.utexas.edu/resources/hpcsystems/

where we put 72 x4500s in place with this configuration with them.

In addition, Sun's recent Linux HPC Software

http://www.sun.com/software/products/hpcsoftware/index.xml

has the mv_sata driver and the SW configurations needed to put the
x4500 together as an OSS one can build up upon and further configure
with the SW RAID patches also included.

HTH

Mike Berg wrote On 08/07/08 11:11,:

Brock,

It is recommended that mv_sata is used on the x4500.

It has been a while since I have built this up myself and a few Lustre  
releases back but I do understand the pain. I hope that with Lustre  
1.6.5.1 on RHEL 4.5 you can just build mv_sata against the provided  
Lustre kernel and alias it accordingly in modprobe.conf and create a  
new initrd, then update grub. I don't have gear handy to give it try  
unfortunately. Please let me know your experiences with this if you  
pursue it.

Enclosed is a somewhat dated document on what we have found to be the  
best configuration of the x4500 for use with Lustre. Ignore the N1SM  
parts. We optimized for performance and RAS with some sacrifices on  
capacity. Hopefully this is a useful reference.


Regards,
Mike Berg
Sr. Lustre Solutions Engineer
Sun Microsystems, Inc.
Office/Fax: (303) 547-3491
E-mail:  [EMAIL PROTECTED]

  





On Aug 6, 2008, at 1:48 PM, Brock Palen wrote:

  

Is it still worth the effort to try and build mv_stata?  when working
with an x4500?
stata_mv from RHEL4 does not appear to show some of the stability
problems discussed online before.

I am curious because the build system sun provides with the driver
does not play nicely with the lustre kernel source packaging.

If it is worth all the pain, if others have already figured it out.
Any help would be grateful.


Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
[EMAIL PROTECTED]
(734)936-1985



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss




___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] operation 400 on unconnected MGS

2008-08-07 Thread Brock Palen
The problem I was refering to:

With the new filesystem we just created I am getting the following  
problem,

clients loose connection to the MGS and the MGS says it evicted  
them,  machines are on the same network and there is no errors on the  
interfaces.  The MGS  says:

Lustre: MGS: haven't heard from client e8eb1779-5cea-9cc7- 
b5ae-4c5ccf54f5ca (at [EMAIL PROTECTED]) in 240 seconds. I think it's  
dead, and I am evicting it.
LustreError: 9103:0:(mgs_handler.c:538:mgs_handle()) lustre_mgs:  
operation 400 on unconnected MGS
LustreError: 9103:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@  
processing error (-107)  [EMAIL PROTECTED] x24929/t0 o400-?@?: 
0/0 lens 128/0 e 0 to 0 dl 1218142953 ref 1 fl Interpret:/0/0 rc -107/0


The operation 400 on unconnected MGS  is the only new message I am  
not familiar with.  Once the client losses connection with the MGS I  
will see the OST's start booting the client also.


Servers are 1.6.5.1  clients are patch-less 1.6.4.1  on RHEL4.

Any insight would be great.

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
[EMAIL PROTECTED]
(734)936-1985



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] operation 400 on unconnected MGS

2008-08-07 Thread Brian J. Murrell
On Thu, 2008-08-07 at 17:10 -0400, Brock Palen wrote:
 The problem I was refering to:
 
 With the new filesystem we just created I am getting the following  
 problem,
 
 clients loose connection to the MGS and the MGS says it evicted  
 them,  machines are on the same network and there is no errors on the  
 interfaces.  The MGS  says:
 
 Lustre: MGS: haven't heard from client e8eb1779-5cea-9cc7- 
 b5ae-4c5ccf54f5ca (at [EMAIL PROTECTED]) in 240 seconds. I think it's  
 dead, and I am evicting it.
 LustreError: 9103:0:(mgs_handler.c:538:mgs_handle()) lustre_mgs:  
 operation 400 on unconnected MGS
 LustreError: 9103:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@  
 processing error (-107)  [EMAIL PROTECTED] x24929/t0 o400-?@?: 
 0/0 lens 128/0 e 0 to 0 dl 1218142953 ref 1 fl Interpret:/0/0 rc -107/0

Do you have any messages on the client that correlate?  Please use
timestamps in syslogs from machines that are timesync'd to show the
correlating MGS eviction and client messages around the same timeframe.

b.



signature.asc
Description: This is a digitally signed message part
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] MDS

2008-08-07 Thread Mag Gam
Very nice insight.

Thanks Brian, Cliff, and Phil!



On Thu, Aug 7, 2008 at 11:14 AM, Brian J. Murrell [EMAIL PROTECTED] wrote:
 On Thu, 2008-08-07 at 10:51 -0700, Cliff White wrote:

 Just to be clear, there is a potential data loss issue due to the time
 delta between the backup and the live system. Any transactions in play
 that miss the snapshot could result in lost data, as the MDS will replay
 transaction logs and delete orphans on startup. So testing on your live
 system definately is for the brave.

 Indeed.  There are a couple of alternatives to consider.  I know your
 production MO will be to take an LVM snapshot of the running MDT and
 back that up, but if the MDT (i.e. filesystem) were shut down prior to
 the backup, what you restore should be an identical MDT which you could
 then start the filesystem against without the risks of in-play
 transactions and orphan deletion.  But indeed it is not a 100%
 reproduction of what would happen restoring from an in-production
 backup.

 Alternatively, rather than trying to start the OSTs against the restored
 MDT you could simply do a filesystem level (i.e. ldiskfs) comparison of
 the restored MDT against the production MDT.

 Indeed, there are other variations that you could use to satisfy
 yourself that the restore worked.

 I would highly suggest you do any of this testing either on a testbed
 (which you could build with a VirtualBox virtual cluster) or on your
 production system before you put production data on it.  It is good
 system deployment policy to have fully tested backup and restore
 policies before going live anyway.

 b.


 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss