Re: [Lustre-discuss] More newbie issues
On Jul 28, 2008 08:28 -0400, Robert Healey wrote: On further investigation, the Ethernet interface was accidentally removed while I was swapping drives. Time for more testing, since the cluster is subject to frequent unexpected power cuts. How does Lustre compare to ext3 in terms of having clients unexpectedly power cycled? Lustre uses ext3 back-end storage, so it behaves the same. On the OSTs the data is actually written synchronously so there is no real distinction between the ext3 data={ordered,writeback} modes. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Operation not supported when mounted Lustre
Hi, If someone could help me out on this: Why can't I mount lustre ? [EMAIL PROTECTED] ~]# mkfs.lustre --reformat --device-size=30 --mdt --mgs --param lov.stripcount=3 --param lov.stripesize=4M /dev/sdb Permanent disk data: Target: lustre-MDT Index: unassigned Lustre FS: lustre Mount type: ldiskfs Flags: 0x75 (MDT MGS needs_index first_time update ) Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr Parameters: lov.stripcount=3 lov.stripesize=4M mdt.group_upcall=/usr/sbin/l_getgroups 2 6 18 formatting backing filesystem ldiskfs on /dev/sdb target name lustre-MDT 4k blocks 75000 options-i 4096 -I 512 -q -O dir_index,uninit_groups -F mkfs_cmd = mkfs.ext2 -j -b 4096 -L lustre-MDT -i 4096 -I 512 -q -O dir_index,uninit_groups -F /dev/sdb 75000 Writing CONFIGS/mountdata [EMAIL PROTECTED] ~]# mount -t lustre /dev/sdb /mnt/mdt [EMAIL PROTECTED] ~]# mkdir /mnt/mdt [EMAIL PROTECTED] ~]# mount -t lustre /dev/sdb /mnt/mdt mount.lustre: mount /dev/sdb at /mnt/mdt failed: *Operation not supported * /var/log/messages shows: Jul 29 11:14:47 lustre_server kernel: Lustre: Skipped 1 previous similar message Jul 29 11:14:47 lustre_server kernel: Lustre: *** setting obd lustre-MDT device 'unknown-block(8,16)' read-only *** Jul 29 11:14:47 lustre_server kernel: Turning device sdb (0x800010) read-only Jul 29 11:14:47 lustre_server kernel: Lustre: lustre-MDT: shutting down for failover; client state will be preserved. Jul 29 11:14:47 lustre_server kernel: Lustre: MDT lustre-MDT has stopped. Jul 29 11:14:47 lustre_server kernel: LustreError: 14992:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway Jul 29 11:14:47 lustre_server kernel: LustreError: 14992:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 Jul 29 11:14:47 lustre_server kernel: Lustre: MGS has stopped. Jul 29 11:14:47 lustre_server kernel: Removing read-only on sdb (0x800010) Jul 29 11:14:47 lustre_server kernel: Lustre: server umount lustre-MDT complete TIA PS: I have no problem mounting lustre on RedHat 5.2 with Lustre 1.6.5.1 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Mirroring lustre distfiles
Hi, On Fri, Jul 25, 2008 at 08:59:17AM -0400, Brian J. Murrell wrote: On Fri, 2008-07-25 at 02:07 -0400, Andreas Dilger wrote: On Jul 23, 2008 10:28 +, Piotr Jaroszy??ski wrote: is it allowed to mirror lustre distfiles? I'm preparing Gentoo ebuilds and am not sure whether I should force users to download them manually. Also, is the license just GPL-2 or did I miss something? Lustre is GPL v2, you can do whatever the license allows you to do, which includes redistribution. There is already a Debian packaging of Lustre. I may have been interpreting him incorrectly but I thought his question was can he redistribute the packages we produce? i.e. download from SDLC and then host and redistribute those packages. I'm not sure that that changes Andreas' answer or not, but just wanted to clarify. I don't know if resulting RPMs themselves can have a licence any different than that which they package. An interesting thing to ponder. I'd also like to know the legal status of redistributing the binary packages/rpms etc... I can see our site using the rpm's in our local rpm repos for automatically updating machines and installing new machines. since we also roll our own scientificlinux site for redistribution for our users, it would be nice if we could throw lustre in as well to give people a choice. Thanks, Jimmy -- Jimmy Tang Trinity Centre for High Performance Computing, Lloyd Building, Trinity College Dublin, Dublin 2, Ireland. http://www.tchpc.tcd.ie/ | http://www.tchpc.tcd.ie/~jtang ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] MGS failover
Hi, take a look at section 32.4.3: # mount -t lustre [EMAIL PROTECTED]:[EMAIL PROTECTED]:/mds-p/client /mnt/lustre means if [EMAIL PROTECTED] is down use [EMAIL PROTECTED] for MGS. Regards, Stephane Thiell CEA Brock Palen a écrit : The manual does not make much sense when it comes to MGS failover. Manual: Note – The MGS does not use the --failnode option. You need to set the command on all other nodes of the filesystem (servers and clients), about the failover options for the MGS. Use the --mgsnode parameter on servers and mount address for clients. The servers need to contact the MGS for configuration information; they cannot query the MGS about the failover partner. This does not make any sense at all, other than you can't use -- failnode and that clients can't check with two different hosts for MGS data. Our MGS will be on its own LUN setup with heartbeat between two nodes that are also working as an MDS pair. While Heartbeat takes care of mounting the MGS file system, how can we tell clients if mds1 is down use mds2 for MGS data Thanks. I hope that makes sense. Brock Palen www.umich.edu/~brockp Center for Advanced Computing [EMAIL PROTECTED] (734)936-1985 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Lustre 1.6.5.1 + kernel-ib doesn't work
Hi, installed all the new Lustre 1.6.5.1 packages on a CentOS5.1 system and if I start OpenIB the server crashes. It also can't be rebooted anymore until the kernel-ib RPM is deinstalled. The list of installed packages: lustre-modules-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp lustre-source-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp lustre-ldiskfs-3.0.4-2.6.18_53.1.14.el5_lustre.1.6.5.1smp kernel-lustre-source-2.6.18-53.1.14.el5_lustre.1.6.5.1 lustre-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp kernel-lustre-smp-2.6.18-53.1.14.el5_lustre.1.6.5.1 kernel-ib-1.3-2.6.18_53.1.14.el5_lustre.1.6.5.1smp The previous OFED installion is completely removed. What's wrong? During boot the server hangs at udev startup. Then it stops booting. I also tried to BUILD OFED 1.3.1 and 1.3, but it fails due to missing modules. As also mentioned here: http://lists.lustre.org/pipermail/lustre-discuss/2008-June/007767.html Did anybody get it running? best regards, Danny -- Danny Sternkopf http://www.nec.de/hpc [EMAIL PROTECTED] HPCE Division Germany phone: +49-711-68770-35 fax: +49-711-6877145 ~~~ NEC Deutschland GmbH, Hansaallee 101, 40549 Düsseldorf Geschäftsführer Yuya Momose Handelsregister Düsseldorf HRB 57941; VAT ID DE129424743 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Operation not supported when mounted Lustre
On Thu, 2008-07-31 at 05:36 -0700, SoVo Le wrote: Hi, If someone could help me out on this: Why can't I mount lustre ? ... mount.lustre: mount /dev/sdb at /mnt/mdt failed: Operation not supported Do you have selinux or some other security framework enabled? If so, disable it and try again. b. signature.asc Description: This is a digitally signed message part ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Luster recovery when clients go away
One of our OSS's died with a panic last night. Between when it was found (no failover) and restarted two clients had died also. (nodes crashed by user OOM). Because of this the OST's now are looking for 626 clients to recover when only 624 are up. So the 624 recover in about 15 minutes, but the OST's on that OSS hang waiting for the last two that are dead and not coming back. Note the MDS reports only 624 clients. Is there a a way to tell the OST's to go ahead and evict those two clients and finish recovering? Also time remaining has been 0 sense it was booted. How long will the OST's wait before it lets operations continue? Is there any rule to speeding up recovery? The OSS that crashed sees very little cpus/disk/network traffic when recovery is going on so any way to speed it up even if it results in a higher load would be great to know. status: RECOVERING recovery_start: 1217509142 time remaining: 0 connected_clients: 624/626 completed_clients: 624/626 replayed_requests: 0/?? queued_requests: 0 next_transno: 175342162 status: RECOVERING recovery_start: 1217509144 time remaining: 0 connected_clients: 624/626 completed_clients: 624/626 replayed_requests: 0/?? queued_requests: 0 next_transno: 193097794 Brock Palen www.umich.edu/~brockp Center for Advanced Computing [EMAIL PROTECTED] (734)936-1985 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre 1.6.5.1 + kernel-ib doesn't work
On Thu, 2008-07-31 at 16:08 +0200, Danny Sternkopf wrote: Hi, installed all the new Lustre 1.6.5.1 packages on a CentOS5.1 system and if I start OpenIB the server crashes. It also can't be rebooted anymore until the kernel-ib RPM is deinstalled. That sounds very suspect. Did anybody get it running? Most certainly our QA department had it all running before we released it. I suspect that you have some other problem masquerading itself as a problem with the OFED stack. I'm afraid there is not much we can do to help you without seeing some logs or error messages or the like. You might have to instrument your boot with some debugging to see where it's really getting stuck. b. signature.asc Description: This is a digitally signed message part ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] lustre client /proc cached reads
Simply, I would like to be able to access lustre client data statistics for each filesystem that excludes statistics for cached reads. It is my understanding that for the lustre client (at least on version 1.6.4.2) the /proc lustre stats for the client for each fs (/proc/fs/lustre/llite/FSNAME/stats) report the total number of bytes read (reported in the line read_bytes), and that this number also includes the total number of bytes read using client-cached data. First of all, is my understanding correct for this version of lustre? And does this apply also for newer versions? It has been suggested to subtract the lustre IO from the network IO to get this data, but this is only applicable if the network is dedicated to Lustre IO, which is not the case. For the moment it seems only the number of cached reads are being reported in /proc, and not the actual sizes, so this -seems- difficult or impossible. Is there another way (perhaps even in a newer version of lustre) to find the true read rate for the lustre client that excludes cached reads? -- John Parhizgari ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Luster recovery when clients go away
On Jul 31, 2008 10:30 -0400, Brock Palen wrote: One of our OSS's died with a panic last night. Between when it was found (no failover) and restarted two clients had died also. (nodes crashed by user OOM). Because of this the OST's now are looking for 626 clients to recover when only 624 are up. So the 624 recover in about 15 minutes, but the OST's on that OSS hang waiting for the last two that are dead and not coming back. Note the MDS reports only 624 clients. Is there a a way to tell the OST's to go ahead and evict those two clients and finish recovering? Also time remaining has been 0 sense it was booted. How long will the OST's wait before it lets operations continue? Is there any rule to speeding up recovery? The OSS that crashed sees very little cpus/disk/network traffic when recovery is going on so any way to speed it up even if it results in a higher load would be great to know. status: RECOVERING recovery_start: 1217509142 time remaining: 0 connected_clients: 624/626 completed_clients: 624/626 replayed_requests: 0/?? queued_requests: 0 next_transno: 175342162 status: RECOVERING recovery_start: 1217509144 time remaining: 0 connected_clients: 624/626 completed_clients: 624/626 replayed_requests: 0/?? queued_requests: 0 next_transno: 193097794 The recovery should time out after about 5 minutes (with default 100s timeouts). The recovery goes as fast as clients connect and submit RPCs for replay. In the case where all clients connect then recovery is finished as soon as all clients report completion. Are you saying the system is still stuck in recovery after more than 5 or 10 minutes? Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] More: OSS crashes
Hi all, I'm still successful in bringing my OSSs to a standstill if not crashing them. Having reduced the number of stress jobs writing to Lustre (stress -d 2 --hdd-noclean --hdd-bytes 5M) to four, and having reduced the number of OSS threads (options ost oss_num_threads=256 in /etc/modprobe.d/lustre), the OSS do not freeze entirely any more. Instead after ~ 15 hours, - all stress jobs have terminated with Input/output error - the MDT has marked the affected OSTs as Inactive - the already open connections to the OSS remain active - interactive collectl, watch df, top sessions are still working - the number of ll_ost threads is 256 ( number of ll_ost_io is 257 ?) - log file writing has obviously stopped after only 10 hours - already open shells allow commands like ps, I can kill some processes - new ssh login doesn't work - access to disk, as in ls, brings the system to total freeze The process table shows six ll_ost_io - threads, all using 38.9% cpu, all running for 419:21m. All the rest are sleeping. The cause can't be system overloading or simple faulty hardware. To give an impression of what is going on, I'm quoting the last collectl record: ## ### RECORD 139 (1217475195.342) (Thu Jul 31 05:33:15 2008) ### # CPU SUMMARY (INTR, CTXSW PROC /sec) # USER NICE SYS WAIT IRQ SOFT STEAL IDLE INTR CTXSW PROC RUNQ RUN AVG1 AVG5 AVG15 0 014 200 5 0 58 425553K 1 736 622.06 31.28 31.13 # DISK SUMMARY (/sec) #KBRead RMerged Reads SizeKB KBWrite WMerged Writes SizeKB 00 0 0 83740 314 861 97 # LUSTRE FILESYSTEM SINGLE OST STATISTICS #Ost KBRead ReadsKBWrite Writes OST00040 0 40674 63 OST00050 0 40858 66 ## That's not too much for the machine, I'd reckon. And as mentioned in an earlier post, I have run the very same 'stress' test, also with CPU load or I/O load only, locally on machines that had crashed earlier. The test runs that wrote to disk finished only when the disks where 100% full (then formatted plain ext3), the tests with I/O load = 500 and CPU load = 1k are running for three days now. Of course I don't know how reliable these test are. Looks to me as if a few Lustre threads for some reason can't process their I/O any more, kind of building up pressure and finally blocking all (disk) I/O. Knowing this reason and how to avoid it would not only relieve these servers of some pressure... ;-) Hm, hardware: the cluster is running Debian Etch, Kernel 2.6.22, Lustre 1.6.5. The OSS are Supermicro X7DB8 fileservers, Xeon E5320, 8GB RAM, with 16 internal disks on two 3ware 9650 RAID controllers, forming two OSTs each. Many thanks for any further hints, Thomas ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Another download other than Sun?
I'm having difficulties downloading 1.6.5.1 through Sun. Every time I get a General Protection error. I really need to get this version so I can go home at a decent time tonight. Can somebody point me to an alternative location to download 1.6.5.1 for RHEL4? Thanks! -- Jeremy Mann [EMAIL PROTECTED] University of Texas Health Science Center Bioinformatics Core Facility http://www.bioinformatics.uthscsa.edu Phone: (210) 567-2672 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Another download other than Sun?
Hi Jeremy, On Thursday 31 July 2008 01:04:24 pm Jeremy Mann wrote: I'm having difficulties downloading 1.6.5.1 through Sun. Every time I get a General Protection error. I really need to get this version so I can go home at a decent time tonight. Can somebody point me to an alternative location to download 1.6.5.1 for RHEL4? I get the same error with Konqueror. However, the download page works from Firefox, so you may want to try that. Although I agree that the plain Apache DirectoryIndex version from pre-Sun times was much easier and convenient to use (wget love). But well... :) Yes, I miss the 'normal' download page ;) I used Safari yesterday to get the version for RHEL5, which worked fine. I'll try with Firefox on my Mac. -- Jeremy Mann [EMAIL PROTECTED] University of Texas Health Science Center Bioinformatics Core Facility (210) 567-2672 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] lustre 1.6.5.1 panic on failover
I have two machines I am setting up as my first mds failover pair. The two sun x4100's are connected to a FC disk array. I have set up heartbeat with IPMI for STONITH. Problem is when I run a test on the host that currently has the mds/ mgs mounted 'killall -9 heartbeat' I see the IPMI shutdown and when the second 4100 tries to mount the filesystem it does a kernel panic. Has anyone else seen this behavior? Is there something I am running into? If I do a 'hb_takelover' or shutdown heartbeat cleanly all is well. Only if I simulate heartbeat failing does this happen. Note I have not tired yanking power yet, but I want to simulate a MDS in a semi dead state and ran into this. Brock Palen www.umich.edu/~brockp Center for Advanced Computing [EMAIL PROTECTED] (734)936-1985 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] lustre 1.6.5.1 panic on failover
On Thu, 2008-07-31 at 16:57 -0400, Brock Palen wrote: Problem is when I run a test on the host that currently has the mds/ mgs mounted 'killall -9 heartbeat' I see the IPMI shutdown and when the second 4100 tries to mount the filesystem it does a kernel panic. We'd need to see the *full* panic info to do any amount of diagnostics. b. signature.asc Description: This is a digitally signed message part ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] lustre 1.6.5.1 panic on failover
Hi Brock, I've been using Sun X2200s with Lustre in a similar configuration (IPMI, STONITH, Linux-HA, FC storage) and haven't had any issues like this (although I would typically panic the primary node during testing using Sysrq) ... is the behaviour consistent? Klaus On 7/31/08 1:57 PM, Brock Palen [EMAIL PROTECTED]did etch on stone tablets: I have two machines I am setting up as my first mds failover pair. The two sun x4100's are connected to a FC disk array. I have set up heartbeat with IPMI for STONITH. Problem is when I run a test on the host that currently has the mds/ mgs mounted 'killall -9 heartbeat' I see the IPMI shutdown and when the second 4100 tries to mount the filesystem it does a kernel panic. Has anyone else seen this behavior? Is there something I am running into? If I do a 'hb_takelover' or shutdown heartbeat cleanly all is well. Only if I simulate heartbeat failing does this happen. Note I have not tired yanking power yet, but I want to simulate a MDS in a semi dead state and ran into this. Brock Palen www.umich.edu/~brockp Center for Advanced Computing [EMAIL PROTECTED] (734)936-1985 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] More: OSS crashes
On Jul 31, 2008 20:45 +0200, Thomas Roth wrote: I'm still successful in bringing my OSSs to a standstill if not crashing them. Having reduced the number of stress jobs writing to Lustre (stress -d 2 --hdd-noclean --hdd-bytes 5M) to four, and having reduced the number of OSS threads (options ost oss_num_threads=256 in /etc/modprobe.d/lustre), the OSS do not freeze entirely any more. Instead after ~ 15 hours, - all stress jobs have terminated with Input/output error - the MDT has marked the affected OSTs as Inactive - the already open connections to the OSS remain active - interactive collectl, watch df, top sessions are still working - the number of ll_ost threads is 256 ( number of ll_ost_io is 257 ?) - log file writing has obviously stopped after only 10 hours - already open shells allow commands like ps, I can kill some processes - new ssh login doesn't work - access to disk, as in ls, brings the system to total freeze The process table shows six ll_ost_io - threads, all using 38.9% cpu, all running for 419:21m. All the rest are sleeping. The cause can't be system overloading or simple faulty hardware. You need to look at the process table (sysrq-t) and get the stacks of the running and blocked lustre processes. Also useful would be the memory information (sysrq-m) to see if the node is out of free memory, and if so where it is gone. If you can still run some commands, then cat /proc/slabinfo may also be useful. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] lustre 1.6.5.1 panic on failover
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Whats a good tool to grab this? Its more than one page long, and the machine does not have serial ports. Links are ok. Brock Palen www.umich.edu/~brockp Center for Advanced Computing [EMAIL PROTECTED] (734)936-1985 On Jul 31, 2008, at 5:14 PM, Brian J. Murrell wrote: On Thu, 2008-07-31 at 16:57 -0400, Brock Palen wrote: Problem is when I run a test on the host that currently has the mds/ mgs mounted 'killall -9 heartbeat' I see the IPMI shutdown and when the second 4100 tries to mount the filesystem it does a kernel panic. We'd need to see the *full* panic info to do any amount of diagnostics. b. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (Darwin) iD8DBQFIkldGMFCQB4Bvz5QRAjEqAJ99IN1m0/JJcqyh/Dm7WF0w5nd2eQCfT9IT w39dxPiWCdXKzpLEo4WxBSU= =Gnsm -END PGP SIGNATURE- ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Remove me from the list
Dear Lustre Community: I have subscript as a member for more than four years. Thanks for the information provided and during the period, I have sell 4 SFS (Lustre in HP version) to customer and solve their heavily file bandwidth requirement for a cluster with more than hundred nodes. The architecture of Lustre is quite an elegant design for large bandwidth in a cluster environment. I am now become more and more target on HPC system architecture design,,, Actually, it become not feasible to trace the technical message mail regarding to the Lustre related issues. Thanks for long time help... I will continuous to promote SFS as usual., and would like to ask your help to remove me from the list. Thanks again! Best Rgrds, Robert Sheen 沈 仲 杰 (M)+886-955-766-078 (O)+886-2-8722-9576 HP TSG Pre-Sales Solution Manager www.hp.com/go/hptchttp://www.hp.com/go/hptc, www.hp.com/go/linuxhttp://www.hp.com/go/linux BEGIN:VCARD VERSION:2.1 N:Sheen;Robert FN:Sheen, Robert ORG:HP Taiwan Ltd;TSG APJ Sales ESS Pre-sales - TEL;WORK;VOICE:+886 2 87229576 TEL;CELL;VOICE:+886 955 766078 ADR;WORK;ENCODING=QUOTED-PRINTABLE:;TPI02:;9F, No. 106, Sec. 5=0D=0AHsin-Yi Rd., Hsin-Yi Dist.,=0D=0AHsinyi Tra= ding Center;Taipei;;110;Taiwan LABEL;WORK;ENCODING=QUOTED-PRINTABLE:TPI02:=0D=0A9F, No. 106, Sec. 5=0D=0AHsin-Yi Rd., Hsin-Yi Dist.,=0D=0AHsinyi= Trading Center=0D=0ATaipei 110=0D=0ATaiwan EMAIL;PREF;INTERNET:[EMAIL PROTECTED] REV:20080122T061712Z END:VCARD ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Recall: Remove me from the list
Sheen, Robert would like to recall the message, Remove me from the list. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] lustre 1.6.5.1 panic on failover
netdump is indeed good for this, but you may have to take two or three cracks at it ... it doesn't always dump the complete core image, and you can't really do a whole lot with the incomplete version. Klaus On 7/31/08 5:50 PM, Kilian CAVALOTTI [EMAIL PROTECTED]did etch on stone tablets: On Thursday 31 July 2008 17:22:28 Brock Palen wrote: Whats a good tool to grab this? Its more than one page long, and the machine does not have serial ports. If your servers do IPMI, you probably can configure Serial-over-LAN to get a console and capture the logs. But a way more convenient solution is netdump. As long as the network connection is working on the panicking machine, you should be able to transmit the kernel panic info, as well as a stack trace, to a netump-server, which will store it in a file. See http://www.redhat.com/support/wpapers/redhat/netdump/ Cheers, ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] __d_rehash and __d_move in ldiskfs in 1.6.5.1 w/ RHEL4 latest security-patch kernel rev 2.6.9-67.0.22
The 1.6.5.1 rhel5 patches patched w/o reject, but, when the ldiskfs module loads, there are two undefined externls: __d_rehash, and __d_move. These are in fs/dcache.c in the kernel. For __d_rehash their look to be two easy fixes: 1) expose the static __d_rehash function, or 2) have ldiskfs call d_cond_refresh instead (ldiskfs always calls __d_rehash with 0 as the second arg. Which is correct? The function definitions are: static void __d_rehash(struct dentry * entry, struct hlist_head *list) { entry-d_flags = ~DCACHE_UNHASHED; entry-d_bucket = list; hlist_add_head_rcu(entry-d_hash, list); } ...and ... void d_rehash_cond(struct dentry * entry, int lock) { struct hlist_head *list = d_hash(entry-d_parent, entry-d_name.hash); if (lock) spin_lock(dcache_lock); spin_lock(entry-d_lock); entry-d_flags = ~DCACHE_UNHASHED; spin_unlock(entry-d_lock); entry-d_bucket = list; hlist_add_head_rcu(entry-d_hash, list); if (lock) spin_unlock(dcache_lock); } The second issue is __d_move. The likely fix here is to have ldiskfs call the already exposed d_move instead: void d_move(struct dentry * dentry, struct dentry * target) { spin_lock(dcache_lock); d_move_locked(dentry, target); spin_unlock(dcache_lock); } Would that be correct? Thanks, Chris ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss