Re: [Lustre-discuss] Lustre 1.6.5.1 on X4200 and STK 6140 Issues

2008-10-13 Thread Brock Palen
I know you say the only addition was the RDAC for the MDS's I assume  
(we use it also just fine).

When I ran faultmond from suns dcmu rpm (RHEL 4 here)  the x4500's  
would crash like clock work ~48 hours.  For a very simple bit of code  
I was surpised that once when I forgot to turn it on when working on  
the load this would happen.  Just FYI it was unrelated to lustre  
(using provided rpm's no kernel build)  this solved my problem on the  
x4500

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
[EMAIL PROTECTED]
(734)936-1985



On Oct 13, 2008, at 4:41 AM, Malcolm Cowe wrote:

 The X4200m2 MDS systems and the X4500 OSS were rebuilt using the  
 stock Lustre packages (Kernel + modules + userspace). With the  
 exception of the RDAC kernel module, no additional software was  
 applied to the systems. We recreated our volumes and ran the  
 servers over the weekend. However, the OSS crashed about 8 hours  
 in. The syslog output is attached to this message.

 Looks like it could be similar to bug #16404, which means patching  
 and rebuilding the kernel. Given my lack of success at trying to  
 build from source, I am again asking for some guidance on how to do  
 this. I sent out the steps I used to try and build from source on  
 the 7th because I was encountering problems and was unable to get a  
 working set of packages. Included in that messages was output from  
 quilt that implies that the kernel patching process was not working  
 properly.


 Regards,

 Malcolm.

 -- 
 6g_top.gif
 Malcolm Cowe
 Solutions Integration Engineer

 Sun Microsystems, Inc.
 Blackness Road
 Linlithgow, West Lothian EH49 7LR UK
 Phone: x73602 / +44 1506 673 602
 Email: [EMAIL PROTECTED]
 6g_top.gif
 Oct 10 06:49:39 oss-1 kernel: LDISKFS FS on md15, internal journal
 Oct 10 06:49:39 oss-1 kernel: LDISKFS-fs: mounted filesystem with  
 ordered data mode.
 Oct 10 06:53:42 oss-1 kernel: kjournald starting.  Commit interval  
 5 seconds
 Oct 10 06:53:42 oss-1 kernel: LDISKFS FS on md16, internal journal
 Oct 10 06:53:42 oss-1 kernel: LDISKFS-fs: mounted filesystem with  
 ordered data mode.
 Oct 10 06:57:49 oss-1 kernel: kjournald starting.  Commit interval  
 5 seconds
 Oct 10 06:57:49 oss-1 kernel: LDISKFS FS on md17, internal journal
 Oct 10 06:57:49 oss-1 kernel: LDISKFS-fs: mounted filesystem with  
 ordered data mode.
 Oct 10 07:44:55 oss-1 faultmond: 16:Polling all 48 slots for drive  
 fault
 Oct 10 07:45:00 oss-1 faultmond: Polling cycle 16 is complete
 Oct 10 07:56:23 oss-1 kernel: Lustre: OBD class driver,  
 [EMAIL PROTECTED]
 Oct 10 07:56:23 oss-LDISKFS-fs: file extents enabled1 kernel:
   Lustre VersionLDISKFS-fs: mballoc enabled
 : 1.6.5.1
 Oct 10 07:56:23 oss-1 kernel: Build Version:  
 1.6.5.1-1969123119-PRISTINE-.cache.OLDRPMS.20080618230526.linux- 
 smp-2.6.9-67.0.7.EL_lustre.1.6.5.1.x86_64-2.6.9-67.0.7.EL_lustre. 
 1.6.5.1smp
 Oct 10 07:56:24 oss-1 kernel: Lustre: Added LNI [EMAIL PROTECTED]  
 [8/64]
 Oct 10 07:56:24 oss-1 kernel: Lustre: Lustre Client File System;  
 [EMAIL PROTECTED]
 Oct 10 07:56:24 oss-1 kernel: kjournald starting.  Commit interval  
 5 seconds
 Oct 10 07:56:24 oss-1 kernel: LDISKFS FS on md11, external journal  
 on md21
 Oct 10 07:56:24 oss-1 kernel: LDISKFS-fs: mounted filesystem with  
 journal data mode.
 Oct 10 07:56:24 oss-1 kernel: kjournald starting.  Commit interval  
 5 seconds
 Oct 10 07:56:24 oss-1 kernel: LDISKFS FS on md11, external journal  
 on md21
 Oct 10 07:56:24 oss-1 kernel: LDISKFS-fs: mounted filesystem with  
 journal data mode.
 Oct 10 07:56:24 oss-1 kernel: LDISKFS-fs: file extents enabled
 Oct 10 07:56:24 oss-1 kernel: LDISKFS-fs: mballoc enabled
 Lustre: Request x1 sent from [EMAIL PROTECTED] to NID  
 [EMAIL PROTECTED] 5s ago has timed out (limit 5s).
 Oct 10 07:56:30 oss-1 kernel: Lustre: Request x1 sent from  
 [EMAIL PROTECTED] to NID [EMAIL PROTECTED] 5s ago has timed  
 out (limit 5s).
 LustreError: 4685:0:(events.c:55:request_out_callback()) @@@ type  
 4, status -113  [EMAIL PROTECTED] x3/t0 o250- 
 [EMAIL PROTECTED]@o2ib_1:26/25 lens 240/400 e 0 to 5 dl  
 1223621815 ref 2 fl Rpc:/0/0 rc 0/0
 Lustre: Request x3 sent from [EMAIL PROTECTED] to NID  
 [EMAIL PROTECTED] 0s ago has timed out (limit 5s).
 LustreError: 18125:0:(obd_mount.c:1062:server_start_targets())  
 Required registration failed for lfs01-OST: -5
 LustreError: 15f-b: Communication error with the MGS.  Is the MGS  
 running?
 LustreError: 18125:0:(obd_mount.c:1597:server_fill_super()) Unable  
 to start targets: -5
 LustreError: 18125:0:(obd_mount.c:1382:server_put_super()) no obd  
 lfs01-OST
 LustreError: 18125:0:(obd_mount.c:119:server_deregister_mount())  
 lfs01-OST not registered
 LDISKFS-fs: mballoc: 0 blocks 0 reqs (0 success)
 LDISKFS-fs: mballoc: 0 extents scanned, 0 goal hits, 0 2^N hits, 0  
 breaks, 0 lost
 LDISKFS-fs: mballoc: 0 generated and it took 0
 LDISKFS-fs: mballoc: 0 preallocated, 0 discarded
 Oct 10 07:56:50 oss-1 

Re: [Lustre-discuss] Lustre 1.6.5.1 on X4200 and STK 6140 Issues

2008-10-13 Thread Brock Palen
I never uninstalled it (i still use some of the tools in it)   
Faultmond is a service,  just chkconfig it off.

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
[EMAIL PROTECTED]
(734)936-1985



On Oct 13, 2008, at 11:03 AM, Malcolm Cowe wrote:

 Brock Palen wrote:

 I know you say the only addition was the RDAC for the MDS's I  
 assume (we use it also just fine).
 Yes, the MDS's share a STK 6140.
 When I ran faultmond from suns dcmu rpm (RHEL 4 here) the x4500's  
 would crash like clock work ~48 hours. For a very simple bit of  
 code I was surpised that once when I forgot to turn it on when  
 working on the load this would happen. Just FYI it was unrelated  
 to lustre (using provided rpm's no kernel build) this solved my  
 problem on the x4500
 The DCMU RPM is installed. I didn't explicitly install this, so it  
 must have been bundled in with the SIA CD... I'll try removing the  
 rpm to see what happens. Thanks for the heads up.

 Regards,

 Malcolm.

 Brock Palen www.umich.edu/~brockp Center for Advanced Computing  
 [EMAIL PROTECTED] (734)936-1985 On Oct 13, 2008, at 4:41 AM,  
 Malcolm Cowe wrote:

 The X4200m2 MDS systems and the X4500 OSS were rebuilt using the  
 stock Lustre packages (Kernel + modules + userspace). With the  
 exception of the RDAC kernel module, no additional software was  
 applied to the systems. We recreated our volumes and ran the  
 servers over the weekend. However, the OSS crashed about 8 hours  
 in. The syslog output is attached to this message. Looks like it  
 could be similar to bug #16404, which means patching and  
 rebuilding the kernel. Given my lack of success at trying to  
 build from source, I am again asking for some guidance on how to  
 do this. I sent out the steps I used to try and build from source  
 on the 7th because I was encountering problems and was unable to  
 get a working set of packages. Included in that messages was  
 output from quilt that implies that the kernel patching process  
 was not working properly. Regards, Malcolm. -- 6g_top.gif  
 Malcolm Cowe Solutions Integration Engineer Sun Microsystems,  
 Inc. Blackness Road Linlithgow, West Lothian EH49 7LR UK Phone:  
 x73602 / +44 1506 673 602 Email: [EMAIL PROTECTED]  
 6g_top.gif Oct 10 06:49:39 oss-1 kernel: LDISKFS FS on md15,  
 internal journal Oct 10 06:49:39 oss-1 kernel: LDISKFS-fs:  
 mounted filesystem with ordered data mode. Oct 10 06:53:42 oss-1  
 kernel: kjournald starting. Commit interval 5 seconds Oct 10  
 06:53:42 oss-1 kernel: LDISKFS FS on md16, internal journal Oct  
 10 06:53:42 oss-1 kernel: LDISKFS-fs: mounted filesystem with  
 ordered data mode. Oct 10 06:57:49 oss-1 kernel: kjournald  
 starting. Commit interval 5 seconds Oct 10 06:57:49 oss-1 kernel:  
 LDISKFS FS on md17, internal journal Oct 10 06:57:49 oss-1  
 kernel: LDISKFS-fs: mounted filesystem with ordered data mode.  
 Oct 10 07:44:55 oss-1 faultmond: 16:Polling all 48 slots for  
 drive fault Oct 10 07:45:00 oss-1 faultmond: Polling cycle 16 is  
 complete Oct 10 07:56:23 oss-1 kernel: Lustre: OBD class driver,  
 [EMAIL PROTECTED] Oct 10 07:56:23 oss-LDISKFS-fs: file extents  
 enabled1 kernel: Lustre VersionLDISKFS-fs: mballoc enabled :  
 1.6.5.1 Oct 10 07:56:23 oss-1 kernel: Build Version:  
 1.6.5.1-1969123119-PRISTINE-.cache.OLDRPMS. 
 20080618230526.linux- smp-2.6.9-67.0.7.EL_lustre. 
 1.6.5.1.x86_64-2.6.9-67.0.7.EL_lustre. 1.6.5.1smp Oct 10 07:56:24  
 oss-1 kernel: Lustre: Added LNI [EMAIL PROTECTED] [8/64] Oct 10  
 07:56:24 oss-1 kernel: Lustre: Lustre Client File System;  
 [EMAIL PROTECTED] Oct 10 07:56:24 oss-1 kernel: kjournald  
 starting. Commit interval 5 seconds Oct 10 07:56:24 oss-1 kernel:  
 LDISKFS FS on md11, external journal on md21 Oct 10 07:56:24  
 oss-1 kernel: LDISKFS-fs: mounted filesystem with journal data  
 mode. Oct 10 07:56:24 oss-1 kernel: kjournald starting. Commit  
 interval 5 seconds Oct 10 07:56:24 oss-1 kernel: LDISKFS FS on  
 md11, external journal on md21 Oct 10 07:56:24 oss-1 kernel:  
 LDISKFS-fs: mounted filesystem with   journal data mode. Oct 10  
 07:56:24 oss-1 kernel: LDISKFS-fs: file extents enabled Oct 10  
 07:56:24 oss-1 kernel: LDISKFS-fs: mballoc enabled Lustre:  
 Request x1 sent from [EMAIL PROTECTED] to NID  
 [EMAIL PROTECTED] 5s ago has timed out (limit 5s). Oct 10  
 07:56:30 oss-1 kernel: Lustre: Request x1 sent from  
 [EMAIL PROTECTED] to NID [EMAIL PROTECTED] 5s ago has  
 timed out (limit 5s). LustreError: 4685:0:(events.c: 
 55:request_out_callback()) @@@ type 4, status -113  
 [EMAIL PROTECTED] x3/t0 o250-

 [EMAIL PROTECTED]@o2ib_1:26/25 lens 240/400 e 0 to 5 dl
 1223621815 ref 2 fl Rpc:/0/0 rc 0/0 Lustre: Request x3 sent from  
 [EMAIL PROTECTED] to NID [EMAIL PROTECTED] 0s ago has  
 timed out (limit 5s). LustreError: 18125:0:(obd_mount.c: 
 1062:server_start_targets()) Required registration failed for  
 lfs01-OST: -5 LustreError: 15f-b: Communication error with  
 the MGS. Is the MGS 

Re: [Lustre-discuss] Lustre 1.6.5.1 on X4200 and STK 6140 Issues

2008-10-07 Thread Andreas Dilger
On Oct 06, 2008  10:59 -0400, Brian J. Murrell wrote:
 On Mon, 2008-10-06 at 15:47 +0100, Malcolm Cowe wrote:
  With respect to the OFED stack used, we are using the latest official
  software stack supplied by Voltaire. The reason for this is that there
  is more to OFED than just the kernel modules, including many libraries
  and tools,
 
 None of these should be necessary for Lustre to use I/B.

Also very important to note is that if you are changing the IB stack,
then Lustre also needs to be recompiled to work with the new IB stack.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre 1.6.5.1 on X4200 and STK 6140 Issues

2008-10-07 Thread Malcolm Cowe





Andreas Dilger wrote:

  On Oct 06, 2008  10:59 -0400, Brian J. Murrell wrote:
  
  
On Mon, 2008-10-06 at 15:47 +0100, Malcolm Cowe wrote:


  With respect to the OFED stack used, we are using the latest official
software stack supplied by Voltaire. The reason for this is that there
is more to OFED than just the kernel modules, including many libraries
and tools,
  

None of these should be necessary for Lustre to use I/B.

  
  
Also very important to note is that if you are changing the IB stack,
then Lustre also needs to be recompiled to work with the new IB stack.

  

Yes. As a matter of fact, you have anticipated a question I have: how
does one re-build Lustre in a safe and consistent manner? I'm working
through the docs, but I have come across a problem when I try to run
"make rpms" in the Lustre source:

make[4]: *** No rule to make target
`/usr/src/redhat/BUILD/lustre-1.6.5.1/ldiskfs/Module.symvers', needed
by `Module.symvers'. Stop.

How do I ensure that the build environment that Lustre requires is
properly prepared? I could just hoick a soft link to the Module.symvers
file in the kernel tree, but that's a little messy.

I've attached a draft copy of the build process to this message. Again,
just looking to sanity check the method, since I'm obviously missing
something.

I'm going to rebuild the servers today so that I can provide the debug
information that Brian requested.

  Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
  

Regards,

Malcolm.

-- 

  

  
  
  Malcolm Cowe
  Solutions Integration Engineer
  
  Sun Microsystems, Inc.
Blackness Road
Linlithgow, West Lothian EH49 7LR UK
Phone: x73602 / +44 1506 673 602
Email: [EMAIL PROTECTED]
  

  




___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre 1.6.5.1 on X4200 and STK 6140 Issues

2008-10-07 Thread Andreas Dilger
On Oct 06, 2008  10:24 -0400, Ms. Megan Larko wrote:
 The order I used which generated no unknown symbol errors for
 installation of lustre 1.6.5.1 was this:
 1)   kernel-lustre-smp-2.6.18-53.1.14.el5_lustre.1.6.5.1.x86_64.rpm
 
 If using infiniband (IB) this is next:
 2)   kernel-ib-1.3-2.6.18_53.1.14.el5_lustre.1.6.5.1smp.x86_64.rpm
 3)   lustre-ldiskfs-3.0.4-2.6.18_53.1.14.el5_lustre.1.6.5.1smp.x86_64.rpm
 4)   lustre-modules-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp.x86_64.rpm
 5)   lustre-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp.x86_64.rpm

That is good to know for the documentation.  However, I suspect if all
of these packages are installed at the same time there would also not
be any symbol warnings.

 If a module installation does have many unknown symbol references,
 then find the rpm which will satisfy those references and install that
 module.   To actually have them satisfied one must return to the
 package that had complained about the unknown symbol and having
 already installed the package to satisfy those symbols then rpm
 --force -ivh  to force a retry of the package with the issues.

That isn't quite correct.  The missing module symbols are the output
of depmod -ae that is run in the RPM post-install after kernel modules
are installed.  Even if there are such warnings, if the modules are
later installed and depmod -ae is run again it should report no
warnings, regardless of what order the modules were installed.  That
means - no need to reinstall the RPMs or to install them in a particular
order, though of course avoid the warnings is always nicer.

You can always run depmod -ae by hand to re-verify the modules of
the currently installed kernels.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre 1.6.5.1 on X4200 and STK 6140 Issues

2008-10-07 Thread Malcolm Cowe




Now with attachment. Sorry.

Malcolm Cowe wrote:

  
  
  
Andreas Dilger wrote:
  
On Oct 06, 2008  10:59 -0400, Brian J. Murrell wrote:
  

  On Mon, 2008-10-06 at 15:47 +0100, Malcolm Cowe wrote:

  
With respect to the OFED stack used, we are using the latest official
software stack supplied by Voltaire. The reason for this is that there
is more to OFED than just the kernel modules, including many libraries
and tools,
  
  
  None of these should be necessary for Lustre to use I/B.



Also very important to note is that if you are changing the IB stack,
then Lustre also needs to be recompiled to work with the new IB stack.

  
  
Yes. As a matter of fact, you have anticipated a question I have: how
does one re-build Lustre in a safe and consistent manner? I'm working
through the docs, but I have come across a problem when I try to run
"make rpms" in the Lustre source:
  
make[4]: *** No rule to make target
`/usr/src/redhat/BUILD/lustre-1.6.5.1/ldiskfs/Module.symvers', needed
by `Module.symvers'. Stop.
  
How do I ensure that the build environment that Lustre requires is
properly prepared? I could just hoick a soft link to the Module.symvers
file in the kernel tree, but that's a little messy.
  
I've attached a draft copy of the build process to this message. Again,
just looking to sanity check the method, since I'm obviously missing
something.
  
I'm going to rebuild the servers today so that I can provide the debug
information that Brian requested.
  
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
  
  
Regards,
  
Malcolm.
  
  -- 
  

  


Malcolm Cowe
Solutions Integration Engineer

Sun Microsystems, Inc.
Blackness Road
Linlithgow, West Lothian EH49 7LR UK
Phone: x73602 / +44 1506 673 602
Email: [EMAIL PROTECTED]

  

  
  


-- 

  

  
  
  Malcolm Cowe
  Solutions Integration Engineer
  
  Sun Microsystems, Inc.
Blackness Road
Linlithgow, West Lothian EH49 7LR UK
Phone: x73602 / +44 1506 673 602
Email: [EMAIL PROTECTED]
  

  




Creating Lustre Packages From Source for RHEL 4.5 AS, 64-bit



Pre-requisites
--


RHEL 4.5 AS Full Installation

SUN-supplied Linux RDAC kernel modules installed.

System running on Stock RHEL 4.5 AS kernel with RDAC support.

Quilt Source
Lustre Source
RHEL 4.5 AS Kernel SRPM


1. Download and install Quilt:
http://download.savannah.gnu.org/releases/quilt/

2. Install the RHEL 4.5 AS Kernel SRPM, found on RHEL 4.5 source CD 4.

3. Change to the Red Hat RPM specs directory and extract the full kernel
source tree. This will also apply Red Hat's patches to the source:

cd /usr/src/redhat/SPECS
rpmbuild -bp kernel-2.6.spec 

4. Install the Lustre sources.

rpm -ivh 
lustre-source-1.6.5.1-2.6.9_67.0.7.EL_lustre.1.6.5.1smp.x86_64.rpm

5. Prepare the kernel source for the Lustre patches:

cd /usr/src/redhat/BUILD/kernel-2.6.9/linux-2.6.9
rm -f patches series 
ln -s 
/usr/src/lustre-1.6.5.1/lustre/kernel_patches/series/2.6-rhel4.series series
ln -s /usr/src/lustre-1.6.5.1/lustre/kernel_patches/patches .

6. Apply the Lustre patches to the kernel sources:

cd /usr/src/redhat/BUILD/kernel-2.6.9/linux-2.6.9
quilt push -av


Points to note in output:

Hunk #1 FAILED at 770.
1 out of 1 hunk FAILED -- rejects in file fs/nfs/nfs4proc.c

Patch patches/vfs_intent-2.6-rhel4.patch does not apply (enforce with 
-f)

Full output:

[EMAIL PROTECTED] linux-2.6.9]# quilt push -av
Applying patch patches/vfs_intent-2.6-rhel4.patch
patching file fs/cifs/dir.c
patching file fs/exec.c
patching file fs/inode.c
patching file fs/namei.c
patching file fs/namespace.c
patching file fs/nfs/dir.c
patching file fs/nfs/nfs4proc.c
Hunk #1 FAILED at 770.
1 out of 1 hunk FAILED -- rejects in file fs/nfs/nfs4proc.c
patching file fs/open.c
patching file fs/stat.c
patching file include/linux/dcache.h
patching file include/linux/fs.h
patching file include/linux/mount.h
patching file include/linux/namei.h
Restoring include/linux/dcache.h
Restoring include/linux/namei.h
Restoring include/linux/mount.h
Restoring include/linux/fs.h
Restoring fs/stat.c
Restoring fs/open.c
Restoring fs/namespace.c
Restoring fs/exec.c
Restoring fs/nfs/nfs4proc.c
Restoring fs/nfs/dir.c
Restoring fs/namei.c
Restoring fs/cifs/dir.c
Restoring fs/inode.c
Patch patches/vfs_intent-2.6-rhel4.patch does not apply (enforce with -f)
Restoring include/linux/dcache.h
Restoring include/linux/namei.h
Restoring include/linux/mount.h
Restoring include/linux/fs.h
Restoring 

[Lustre-discuss] Lustre 1.6.5.1 on X4200 and STK 6140 Issues

2008-10-06 Thread Malcolm Cowe




Hi Folks,

We are trying to create a small lustre environment on behalf of a
customer. There are 2 X4200m2 MDS servers, both dual-attached to an STK
6140 array over FC. This is an active-passive arrangement with a single
shared volume. Heartbeat is used to co-ordinate file system failover.
There is a single X4500 OSS server, the storage for which is split into
6 OSTs. Finally, we have 2 X4600m2 clients, just for kicks.

All systems are connected together over ethernet and infiniband, with
the IB network being used for Lustre and every system is running RHEL
4.5 AS. The X4500 OST volumes are created using software RAID, while
the X4200m2 MDT is accessed using DM Multipath. We downloaded the
Lustre binary packages from SUN's web site and installed them onto each
of the servers.

Unfortunately, the resulting system is very unstable and is prone to
lock-ups on the servers (uptimes are measured in hours). These lock-ups
happen without warning, and with very little, if any, debug information
in the system logs. We have also observed the servers locking up on
shutdown (kernel panics). Based on the documentation in the Lustre
operations manual, we installed the RPMs as follows:

rpm -Uvh --force e2fsprogs-1.40.7.sun3-0redhat.x86_64.rpm
rpm -ivh kernel-lustre-smp-2.6.9-67.0.7.EL_lustre.1.6.5.1.x86_64.rpm
rpm -ivh kernel-lustre-source-2.6.9-67.0.7.EL_lustre.1.6.5.1.x86_64.rpm
rpm -ivh
lustre-modules-1.6.5.1-2.6.9_67.0.7.EL_lustre.1.6.5.1smp.x86_64.rpm #
(many "unknown symbol" warnings)
rpm -ivh lustre-1.6.5.1-2.6.9_67.0.7.EL_lustre.1.6.5.1smp.x86_64.rpm
rpm -ivh
lustre-source-1.6.5.1-2.6.9_67.0.7.EL_lustre.1.6.5.1smp.x86_64.rpm
rpm -ivh
lustre-ldiskfs-3.0.4-2.6.9_67.0.7.EL_lustre.1.6.5.1smp.x86_64.rpm #
(many
"unknown symbol" warnings)
mv /etc/init.d/openibd /etc/init.d/openibd.rhel4default
rpm -ivh --force
kernel-ib-1.3-2.6.9_67.0.7.EL_lustre.1.6.5.1smp.x86_64.rpm 
cp /etc/init.d/openibd /etc/init.d/openibd.lustre.1.6.5.1

We then reboot the system and load RHEL using the Lustre kernel. Now we
install the Voltaire OFED software:

  Copy the kernel config used to build the Lustre patched kernel
into the Lustre kernel source tree:

cp /boot/config-2.6.9-67.0.7.EL_lustre.1.6.5.1smp \
/usr/src/linux-2.6.9-67.0.7.EL_lustre.1.6.5.1/.config


  Change into the Lustre kernel source and edit the Makefile.
Change
"custom" suffix to "smp" in the variable "EXTRAVERSION".
  Change into the lustre kernel source and run these setup commands:

make oldconfig || make menuconfig
make include/asm
make include/linux/version.h
make SUBDIRS=scripts

  
  Change into the "-obj" directory and run these setup
commands:

cd /usr/src/linux-2.6.9-67.0.7.EL_lustre.1.6.5.1-obj/x86_64/smp
ln -s /usr/src/linux-2.6.9-67.0.7.EL_lustre.1.6.5.1/include .


  Unpack the Voltaire OFED tar-ball:

tar zxf VoltaireOFED-5.1.3.1_5.tgz

  
  Change to the unpacked software directory and run the
installation script. To build the OFED packages with the Voltaire
certified configuration, run the following commands:

cd VoltaireOFED-5.1.3.1_5
./install.pl -c ofed.conf.Volt

  
  Once complete, reboot.
  Configure any IPoIB interfaces as required.
  Add the following into /etc/modprobe.conf:

options lnet networks="o2ib0(ib0)"

  
   Load the Lustre LNET kernel module.

modprobe lnet

  
  Start the Lustre core networking service.

lctl network up

  
  Check the system log (/var/log/messages) for
confirmation.


Create the MGS/MDT Lustre Volume:

  Format the MGS/MDT device.

mkfs.lustre [ --reformat ] --fsname lfs01 --mdt --mgs
[EMAIL PROTECTED] /dev/dm-0

  
  Create the MGS/MDT file system mount point.

mkdir -p /lustre/mdt/lfs01

  
  Mount the file system. This will initiate MGS and MDT services
for Lustre.

mount -t lustre /dev/dm-0 /lustre/mdt/lfs01
  

With the exception of the OST volume creation, we use an equivalent
process to bring the OSS online.

The cabling has been checked and verified. So we re-built the system
from scratch and applied only SUN's RDAC modules and Voltaire OFED to
the stock RHEL 4.5 kernel (2.6.9-55.ELsmp). We removed the second MDS
from the h/w configuration and did not install Heartbeat. The shared
storage was re-formatted as a regular EXT3 file system using the DM
multipathing device, /dev/dm-0, and mounted onto the host. Running I/O
tests onto the mounted file system over an extended period did not
elicit a single error or warning message in the log related to the
multipathing or the SCSI device.

Once we were confident that the system was running in a consistent and
stable manner, we re-installed the Lustre packages, omitting the
kernel-ib packages. We had to re-build and re-install the RDAC support
as well. This means that the system has support for the Lustre file
system but no infiniband support at all. /etc/modprobe.conf
is updated such that the lnet networks option is
set to "tcp". 

[Lustre-discuss] Lustre 1.6.5.1 on X4200 and STK 6140 Issues

2008-10-06 Thread Ms. Megan Larko
Hello,

Reading through the message from Malcolm Cowe about a new lustre
environment, he mentioned that there were unknown symbol warnings
during his installation procedure.   Well, I also saw warnings when I
was doing a lustre 1.6.5.1 install.   I know from general linux that
the rpm is not properly installed while there are so many warning and
of the type (some were ldiskfs issues for example) that I was seeing.
  What I discovered is that the order in which the lustre 1.6.5.1 rpms
are installed does matter and that it is not the same order as
indicated in the Lustre Manual version 1.12 for 1.6.4.

The order I used which generated no unknown symbol errors for
installation of lustre 1.6.5.1 was this:
1)   kernel-lustre-smp-2.6.18-53.1.14.el5_lustre.1.6.5.1.x86_64.rpm

If using infiniband (IB) this is next:
2)   kernel-ib-1.3-2.6.18_53.1.14.el5_lustre.1.6.5.1smp.x86_64.rpm

3)   lustre-ldiskfs-3.0.4-2.6.18_53.1.14.el5_lustre.1.6.5.1smp.x86_64.rpm

4)   lustre-modules-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp.x86_64.rpm

5)   lustre-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp.x86_64.rpm

The above were done using rpm -i (install) which works well for
kernels so that you have multiple versions (hopefully including a good
one to which one may return if necessary).

The last cannot be done using -i, but -U:
rpm -Uvh e2fsprogs-1.40.7.sun3-0redhat.x86_64.rpm

This has not been a problem for us as the newer version seems to get
along fine with a 1.6.4.3 version of Lustre (I appreciate backwards
compatibility).

If a module installation does have many unknown symbol references,
then find the rpm which will satisfy those references and install that
module.   To actually have them satisfied one must return to the
package that had complained about the unknown symbol and having
already installed the package to satisfy those symbols then rpm
--force -ivh  to force a retry of the package with the issues.
This procedure can be iterative as sometimes more than one package may
be needed to satisfy all the references of the package desired.  I do
know from personal experience that if a package has unknown symbols
especially if those symbols are used/accessed, then it can panic the
box.

My experience with this in on  64-bit hw using Cent OS 5 as the base
Operating System.

Best of luck.

megan
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre 1.6.5.1 on X4200 and STK 6140 Issues

2008-10-06 Thread Malcolm Cowe




Hey Brian,

I'll have to re-install the system from scratch in order to be able to
answer some of your questions, which I'll get started on this evening.
What I was hoping for in the first instance was a sanity check of our
installation methods. With respect to the OFED stack used, we are using
the latest official software stack supplied by Voltaire. The reason for
this is that there is more to OFED than just the kernel modules,
including many libraries and tools, plus the latest firmware for the
cards. It's what the customer has asked for, and it is what the card
vendor expects us to do.

We may be able to get away with OFED 1.3, but I would still like some
guidance on how to install the rest of the OFED stack -- do we use the
OFED source to rebuild everything, or can we pick the Lustre supplied
kernel modules and just layer on the other stuff separately? Like I
said, sanity-checking the install procedure is important.

Finally, when I said that one file system fails versus another passes,
I mean that the server locks solid, crashes, usually with no debug to
speak of (nothing in the system logs). Even while the system is up and
running the lustre kernel, if we attempt a clean shutdown, the kernel
panics.

Since I need to rebuild the systems anyway, I will also try to install
the packages in the order mentioned by Megan Larko, to see how that
affects the installation. We have been following the instructions in
the Lustre Operations Manual (v. 1.14).

Regards,

Malcolm.


Brian J. Murrell wrote:

  On Mon, 2008-10-06 at 10:58 +0100, Malcolm Cowe wrote:
  
  
rpm -Uvh --force e2fsprogs-1.40.7.sun3-0redhat.x86_64.rpm

  
  
You should not (have to) use --force.  If you do, there is either an
operational error or a bug in our packages.  In the latter case, please
file a bug in our bugzilla.

  
  
rpm -ivh
lustre-modules-1.6.5.1-2.6.9_67.0.7.EL_lustre.1.6.5.1smp.x86_64.rpm #
(many "unknown symbol" warnings)

  
  
Can you paste them here?

  
  
rpm -ivh
lustre-ldiskfs-3.0.4-2.6.9_67.0.7.EL_lustre.1.6.5.1smp.x86_64.rpm #
(many "unknown symbol" warnings)

  
  
Ditto.

  
  
rpm -ivh --force
kernel-ib-1.3-2.6.9_67.0.7.EL_lustre.1.6.5.1smp.x86_64.rpm 

  
  
Again, you should not need to use --force.

  
  
We then reboot the system and load RHEL using the Lustre kernel. Now
we install the Voltaire OFED software:

  
  
Why?  The kernel-ib package you installed above should provide a working
OFED stack.

  
  
 1. Unpack the Voltaire OFED tar-ball:

tar zxf VoltaireOFED-5.1.3.1_5.tgz

  
  
Do you really need 1.3.1?  If so, then you should not install the 1.3
kernel-ib package we provide above.  I really wonder why you need 1.3.1
though.

  
  
  * Lustre supplied kernel, Lustre software. No IB. MDS/MGS file
system. FAILED.

  
  
Failed in what way?

  
  
  * Lustre supplied kernel, Lustre software, RDAC. No IB. MDS/MGS
file system (Full Lustre FS over Ethernet). FAILED.

  
  
Again, in what way?

  
  
  * Lustre supplied kernel, Lustre software, RDAC, Voltaire OFED.
EXT-3 file system. FAILED.

  
  
Ditto.

  
  
  * Lustre supplied kernel, Lustre software. RDAC, Voltaire OFED.
MDS/MGS file system (Full Lustre FS over IB). FAILED.

  
  
And Ditto again.

You have to provide more details than just "FAILED" if we are to try to
help diagnose a problem.

  
  
Our findings indicate that there is a problem within the binary
distribution of Lustre.

  
  
I think that many of our users use it as is, so it cannot be all that
bad.

  
  
This may be due to the fact that we are applying the 2.6.9-67 RHEL
kernel to a platform based upon 2.6.9.-55,

  
  
That shouldn't be a problem in and of itself.

b.

  
  

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
  


-- 

  

  
  
  Malcolm Cowe
  Solutions Integration Engineer
  
  Sun Microsystems, Inc.
Blackness Road
Linlithgow, West Lothian EH49 7LR UK
Phone: x73602 / +44 1506 673 602
Email: [EMAIL PROTECTED]
  

  




___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre 1.6.5.1 on X4200 and STK 6140 Issues

2008-10-06 Thread Brian J. Murrell
On Mon, 2008-10-06 at 15:47 +0100, Malcolm Cowe wrote:
 Hey Brian,

Hey Malcolm,

 I'll have to re-install the system from scratch in order to be able to
 answer some of your questions, which I'll get started on this evening.

OK.

 What I was hoping for in the first instance was a sanity check of our
 installation methods.

I think I commented on those.  If you are going to build your OFED stack
you don't need to install the one we provide.

 With respect to the OFED stack used, we are using the latest official
 software stack supplied by Voltaire. The reason for this is that there
 is more to OFED than just the kernel modules, including many libraries
 and tools,

None of these should be necessary for Lustre to use I/B.

 plus the latest firmware for the cards.

Hrm.  Can you not upgrade firmware independent of upgrading the whole
OFED stack?  That seems very limiting.

 It's what the customer has asked for, and it is what the card vendor
 expects us to do.

Fair enough.  I was just pointing out that you don't need our OFED stack
if you are going to install your own.

 We may be able to get away with OFED 1.3, but I would still like some
 guidance on how to install the rest of the OFED stack

We don't supply the userspace tools because they are not really
necessary for Lustre.

 do we use the OFED source to rebuild everything, or can we pick the
 Lustre supplied kernel modules and just layer on the other stuff
 separately?

Yes, you should be able to do that.  I say that quite generally as I'm
not entirely clear on your operating environment.

 Finally, when I said that one file system fails versus another passes,
 I mean that the server locks solid, crashes, usually with no debug to
 speak of (nothing in the system logs).

Nothing on the console either?

 Even while the system is up and running the lustre kernel, if we
 attempt a clean shutdown, the kernel panics.

Hrm.  A panic is quite different than locking solid with no messages at
all.  A solid lock with no messages is indicative of hardware problems.

 Since I need to rebuild the systems anyway, I will also try to install
 the packages in the order mentioned by Megan Larko, to see how that
 affects the installation.

I'm not entirely convinced of her process.  You should not need to use
--force and reinstall packages already installed.  I'd be more
interested in knowing exactly your installation steps and the errors you
get from it.  Please try to avoid the use of --force so we can see why
it's necessary.  You will have to use rpm -U with e2fsprogs though as
she mentions.  Do all of your work with the script(1) tool so you can
easily log it.

b.



signature.asc
Description: This is a digitally signed message part
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss