[Lustre-discuss] Lustre crashes periodically
Hi everyone, I have a problem lately with our Lustre 1.8 deployment. It crashes periodically in a way that the nodes can mount the storage and I can't access the Lustre server machine neither. So I have to manually restart the machine every time to make everything normal again. I tried to see the logs, memory usage and locks count to see whether these issues may have the cause of the problem. But, I don't think they account for this issue. An interesting symptom I see every time this problem happens is the Infiniband switch network usage lights which blink very fast. I think a huge traffic on the Infiniband network to the lustre server may cause the server crash. Does this relevance seems logical? Anyway, I hope some of you may have experience this problem before and could help me understand what is happening and how to avoid crashing the server again! Thanks,___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre crashes periodically
Sorry, I have to correct this: the nodes CANNOT mount the storage and I can't access the Lustre server machine neither. On Wednesday ۱۷ July ۱۳۹۲ at ۱۱:۲۱, Arya Mazaheri wrote: Hi everyone, I have a problem lately with our Lustre 1.8 deployment. It crashes periodically in a way that the nodes can mount the storage and I can't access the Lustre server machine neither. So I have to manually restart the machine every time to make everything normal again. I tried to see the logs, memory usage and locks count to see whether these issues may have the cause of the problem. But, I don't think they account for this issue. An interesting symptom I see every time this problem happens is the Infiniband switch network usage lights which blink very fast. I think a huge traffic on the Infiniband network to the lustre server may cause the server crash. Does this relevance seems logical? Anyway, I hope some of you may have experience this problem before and could help me understand what is happening and how to avoid crashing the server again! Thanks, ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] ldiskfs for MDT and zfs for OSTs?
Hello Anjana, I can confirm that this setup works (ZFS-MGS/MDT or LDFISKFS-MGS/MDT and ZFS-OSS/OST) I used a Cent OS 6.4 build: 2.4.0-RC2-gd3f91c4-PRISTINE-2.6.32-358.6.2.el6_lustre.g230b174.x86_64 and the Lustre Packages from http://downloads.whamcloud.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64/ ZFS is downloaded from ZOL and compiled/installed. SPL: Loaded module v0.6.2-1 SPL: using hostid 0x ZFS: Loaded module v0.6.2-1, ZFS pool version 5000, ZFS filesystem version 5 I first run in the same problem: mkfs.lustre --fsname=lustrefs --reformat --ost --backfstype=zfs . mkfs.lustre FATAL: unable to prepare backend (22) mkfs.lustre: exiting with 22 (Invalid argument) and saw that ZFS libraries in /usr/local/lib where not known to Cent OS 6.4. A quick: echo /usr/local/lib /etc/ld.so.conf.d/zfs.conf echo /usr/local/lib64 /etc/ld.so.conf.d/zfs.conf ldconfig solved the problem. (LDISKFS) mkfs.lustre --reformat --mgs /dev/sda16 mkfs.lustre --reformat --fsname=zlust --mgsnode=10.16.0.104@o2ib0 --mdt --index=0 /dev/sda5 (ZFS) mkfs.lustre --reformat --mgs --backfstype=zfs mgs/mgs /dev/sda16 mkfs.lustre --reformat --fsname=zlust --mgsnode=10.16.0.104@o2ib0 --mdt --index=0 --backfstype=zfs mdt0/mdt0 /dev/sda5 is working fine. The OSS/OST is a debian wheezy box with 70 disks JBOD and kernel 3.6.11-lustre-tstibor-build with patch series 3.x-fc18.series and SPL/ZFS v0.6.2-1 Best, Thomas On 10/08/2013 05:40 PM, Anjana Kar wrote: The git checkout was on Sep. 20. Was the patch before or after? The zpool create command successfully creates a raidz2 pool, and mkfs.lustre does not complain, but [root@cajal kar]# zpool list NAME SIZE ALLOC FREECAP DEDUP HEALTH ALTROOT lustre-ost0 36.2T 2.24M 36.2T 0% 1.00x ONLINE - [root@cajal kar]# /usr/sbin/mkfs.lustre --fsname=cajalfs --ost --backfstype=zfs --index=0 --mgsnode=10.10.101.171@o2ib lustre-ost0 [root@cajal kar]# /sbin/service lustre start lustre-ost0 lustre-ost0 is not a valid lustre label on this node I think we'll be splitting up the MDS and OSTs on 2 nodes as some of you said there could be other issues down the road, but thanks for all the good suggestions. -Anjana On 10/07/2013 07:24 PM, Ned Bass wrote: I'm guessing your git checkout doesn't include this commit: * 010a78e Revert LU-3682 tunefs: prevent tunefs running on a mounted device It looks like the LU-3682 patch introduced a bug that could cause your issue, so its reverted in the latest master. Ned On Mon, Oct 07, 2013 at 04:54:13PM -0400, Anjana Kar wrote: On 10/07/2013 04:27 PM, Ned Bass wrote: On Mon, Oct 07, 2013 at 02:23:32PM -0400, Anjana Kar wrote: Here is the exact command used to create a raidz2 pool with 8+2 drives, followed by the error messages: mkfs.lustre --fsname=cajalfs --reformat --ost --backfstype=zfs --index=0 --mgsnode=10.10.101.171@o2ib lustre-ost0/ost0 raidz2 /dev/sda /dev/sdc /dev/sde /dev/sdg /dev/sdi /dev/sdk /dev/sdm /dev/sdo /dev/sdq /dev/sds mkfs.lustre FATAL: Invalid filesystem name /dev/sds It seems that either the version of mkfs.lustre you are using has a parsing bug, or there was some sort of syntax error in the actual command entered. If you are certain your command line is free from errors, please post the version of lustre you are using, or report the bug in the Lustre issue tracker. Thanks, Ned For building this server, I followed steps from the walk-thru-build* for Centos 6.4, and added --with-spl and --with-zfs when configuring lustre.. *https://wiki.hpdd.intel.com/pages/viewpage.action?pageId=8126821 spl and zfs modules were installed from source for the lustre 2.4 kernel 2.6.32.358.18.1.el6_lustre2.4 Device sds appears to be valid, but I will try issuing the command using by-path names.. -Anjana ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss smime.p7s Description: S/MIME Cryptographic Signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Technical Working Group Meeting Oct 10th
A quick note to remind interest parties that the TWG will meet tomorrow, and to provide links to the slides for the presentations: http://www.eofs.eu/fileadmin/lad2013/slides/09_Walgenbach_UID_LAD_2013.pdf http://www.eofs.eu/fileadmin/lad2013/slides/09_Andrew_Korty_UID_LAD.pdf On Mon, 2013-10-07 at 13:17 -0400, David Dillow wrote: The OpenSFS Technical Working Group would like to invite you to our next meeting to discuss current developments and our future course. When: Thursday, October 10th, 2013 12:30pm ET, 9:30am PT Dial in: US Toll-free: 866-692-4538 Local/Other US: 517-466-2084 Access code:1973781 Other numbers may be available for those outside the US; please drop me a line and I'll see what I can do. Agenda: * Plans for TWG going forward * Status of the Intel contract (DNE/LFSCK) * Status of the new development contract * Deep dive into the IU uid-mapping and shared key work * QA If anyone has additional items for the agenda, please let me know. The TWG is looking for presenters! If you are working in Lustre and would like to talk about your efforts, please speak up and let the community know about your project! -- Dave Dillow National Center for Computational Science Oak Ridge National Laboratory (865) 241-6602 office ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre crashes periodically
Did you run lfsck against it? No kernel crash dumps? Maybe it’s not Lustre related problem? If you have no Active/Passive MDS setup, Lustre file system will be unusable if the MDS server crashes for whatever reason. Abraham Alawi Linux/UNIX Systems and Storage Specialist | STACC Project | Information Management Technology (IMT) | CSIRO From: lustre-discuss-boun...@lists.lustre.org [mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf Of Arya Mazaheri Sent: Wednesday, 9 October 2013 6:52 PM To: lustre-discuss@lists.lustre.org Subject: [Lustre-discuss] Lustre crashes periodically Hi everyone, I have a problem lately with our Lustre 1.8 deployment. It crashes periodically in a way that the nodes can mount the storage and I can't access the Lustre server machine neither. So I have to manually restart the machine every time to make everything normal again. I tried to see the logs, memory usage and locks count to see whether these issues may have the cause of the problem. But, I don't think they account for this issue. An interesting symptom I see every time this problem happens is the Infiniband switch network usage lights which blink very fast. I think a huge traffic on the Infiniband network to the lustre server may cause the server crash. Does this relevance seems logical? Anyway, I hope some of you may have experience this problem before and could help me understand what is happening and how to avoid crashing the server again! Thanks, ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss