[Lustre-discuss] Lustre crashes periodically

2013-10-09 Thread Arya Mazaheri
Hi everyone, 
I have a problem lately with our Lustre 1.8 deployment. It crashes periodically 
in a way that the nodes can mount the storage and I can't access the Lustre 
server machine neither. So I have to manually restart the machine every time to 
make everything normal again. I tried to see the logs, memory usage and locks 
count to see whether these issues may have the cause of the problem. But, I 
don't think they account for this issue.
An interesting symptom I see every time this problem happens is the Infiniband 
switch network usage lights which blink very fast. I think a huge traffic on 
the Infiniband network to the lustre server may cause the server crash. Does 
this relevance seems logical?

Anyway, I hope some of you may have experience this problem before and could 
help me understand what is happening and how to avoid crashing the server again!

Thanks,___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre crashes periodically

2013-10-09 Thread Arya Mazaheri
Sorry, I have to correct this:  the nodes CANNOT mount the storage and I can't 
access the Lustre server machine neither.


On Wednesday ۱۷ July ۱۳۹۲ at ۱۱:۲۱, Arya Mazaheri wrote:

 Hi everyone,  
 I have a problem lately with our Lustre 1.8 deployment. It crashes 
 periodically in a way that the nodes can mount the storage and I can't access 
 the Lustre server machine neither. So I have to manually restart the machine 
 every time to make everything normal again. I tried to see the logs, memory 
 usage and locks count to see whether these issues may have the cause of the 
 problem. But, I don't think they account for this issue.
 An interesting symptom I see every time this problem happens is the 
 Infiniband switch network usage lights which blink very fast. I think a huge 
 traffic on the Infiniband network to the lustre server may cause the server 
 crash. Does this relevance seems logical?
  
 Anyway, I hope some of you may have experience this problem before and could 
 help me understand what is happening and how to avoid crashing the server 
 again!
  
 Thanks,  

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] ldiskfs for MDT and zfs for OSTs?

2013-10-09 Thread Thomas Stibor
Hello Anjana,

I can confirm that this setup works (ZFS-MGS/MDT or LDFISKFS-MGS/MDT and
ZFS-OSS/OST)

I used a Cent OS 6.4
build: 
2.4.0-RC2-gd3f91c4-PRISTINE-2.6.32-358.6.2.el6_lustre.g230b174.x86_64
and the Lustre Packages from
http://downloads.whamcloud.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64/

ZFS is downloaded from ZOL and compiled/installed.

SPL: Loaded module v0.6.2-1
SPL: using hostid 0x
ZFS: Loaded module v0.6.2-1, ZFS pool version 5000, ZFS filesystem version 5

I first run in the same problem:

mkfs.lustre --fsname=lustrefs --reformat --ost --backfstype=zfs .
mkfs.lustre FATAL: unable to prepare backend (22)
mkfs.lustre: exiting with 22 (Invalid argument)

and saw that ZFS libraries in /usr/local/lib where not known to Cent OS 6.4.

A quick:

echo /usr/local/lib  /etc/ld.so.conf.d/zfs.conf
echo /usr/local/lib64  /etc/ld.so.conf.d/zfs.conf
ldconfig

solved the problem.

(LDISKFS)
mkfs.lustre --reformat --mgs /dev/sda16
mkfs.lustre --reformat --fsname=zlust --mgsnode=10.16.0.104@o2ib0 --mdt
--index=0 /dev/sda5

(ZFS)
mkfs.lustre --reformat --mgs --backfstype=zfs mgs/mgs /dev/sda16
mkfs.lustre --reformat --fsname=zlust --mgsnode=10.16.0.104@o2ib0 --mdt
--index=0 --backfstype=zfs mdt0/mdt0 /dev/sda5

is working fine.
The OSS/OST is a debian wheezy box with 70 disks JBOD and kernel
3.6.11-lustre-tstibor-build with patch series 3.x-fc18.series
and SPL/ZFS v0.6.2-1

Best,
 Thomas

On 10/08/2013 05:40 PM, Anjana Kar wrote:
 The git checkout was on Sep. 20. Was the patch before or after?

 The zpool create command successfully creates a raidz2 pool, and mkfs.lustre
 does not complain, but

 [root@cajal kar]# zpool list
 NAME  SIZE  ALLOC   FREECAP  DEDUP  HEALTH  ALTROOT
 lustre-ost0  36.2T  2.24M  36.2T 0%  1.00x  ONLINE  -

 [root@cajal kar]# /usr/sbin/mkfs.lustre --fsname=cajalfs --ost 
 --backfstype=zfs --index=0 --mgsnode=10.10.101.171@o2ib lustre-ost0

 [root@cajal kar]# /sbin/service lustre start lustre-ost0
 lustre-ost0 is not a valid lustre label on this node

 I think we'll be splitting up the MDS and OSTs on 2 nodes as some of you 
 said
 there could be other issues down the road, but thanks for all the good 
 suggestions.

 -Anjana

 On 10/07/2013 07:24 PM, Ned Bass wrote:
 I'm guessing your git checkout doesn't include this commit:

 * 010a78e Revert LU-3682 tunefs: prevent tunefs running on a mounted device

 It looks like the LU-3682 patch introduced a bug that could cause your issue,
 so its reverted in the latest master.

 Ned

 On Mon, Oct 07, 2013 at 04:54:13PM -0400, Anjana Kar wrote:
 On 10/07/2013 04:27 PM, Ned Bass wrote:
 On Mon, Oct 07, 2013 at 02:23:32PM -0400, Anjana Kar wrote:
 Here is the exact command used to create a raidz2 pool with 8+2 drives,
 followed by the error messages:

 mkfs.lustre --fsname=cajalfs --reformat --ost --backfstype=zfs
 --index=0 --mgsnode=10.10.101.171@o2ib lustre-ost0/ost0 raidz2
 /dev/sda /dev/sdc /dev/sde /dev/sdg /dev/sdi /dev/sdk /dev/sdm
 /dev/sdo /dev/sdq /dev/sds

 mkfs.lustre FATAL: Invalid filesystem name /dev/sds
 It seems that either the version of mkfs.lustre you are using has a
 parsing bug, or there was some sort of syntax error in the actual
 command entered.  If you are certain your command line is free from
 errors, please post the version of lustre you are using, or report the
 bug in the Lustre issue tracker.

 Thanks,
 Ned
 For building this server, I followed steps from the walk-thru-build*
 for Centos 6.4,
 and added --with-spl and --with-zfs when configuring lustre..
 *https://wiki.hpdd.intel.com/pages/viewpage.action?pageId=8126821

 spl and zfs modules were installed from source for the lustre 2.4 kernel
 2.6.32.358.18.1.el6_lustre2.4

 Device sds appears to be valid, but I will try issuing the command
 using by-path
 names..

 -Anjana
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss




smime.p7s
Description: S/MIME Cryptographic Signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Technical Working Group Meeting Oct 10th

2013-10-09 Thread David Dillow
A quick note to remind interest parties that the TWG will meet tomorrow,
and to provide links to the slides for the presentations:
http://www.eofs.eu/fileadmin/lad2013/slides/09_Walgenbach_UID_LAD_2013.pdf
http://www.eofs.eu/fileadmin/lad2013/slides/09_Andrew_Korty_UID_LAD.pdf


On Mon, 2013-10-07 at 13:17 -0400, David Dillow wrote:
 The OpenSFS Technical Working Group would like to invite you to our next
 meeting to discuss current developments and our future course.
 
 When:
   Thursday, October 10th, 2013
   12:30pm ET, 9:30am PT
 
 Dial in:
   US Toll-free:   866-692-4538
   Local/Other US: 517-466-2084
   Access code:1973781
 
   Other numbers may be available for those outside the US;
   please drop me a line and I'll see what I can do.
 
 Agenda:
   * Plans for TWG going forward
   * Status of the Intel contract (DNE/LFSCK)
   * Status of the new development contract
   * Deep dive into the IU uid-mapping and shared key work
   * QA
 
 If anyone has additional items for the agenda, please let me know.
 
 The TWG is looking for presenters! If you are working in Lustre and
 would like to talk about your efforts, please speak up and let the
 community know about your project!
 

-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre crashes periodically

2013-10-09 Thread Abraham.Alawi
Did you run lfsck against it?
No kernel crash dumps?

Maybe it’s not Lustre related problem? If you have no Active/Passive MDS setup, 
Lustre file system will be unusable if the MDS server crashes for whatever 
reason.

Abraham Alawi
Linux/UNIX Systems and Storage Specialist | STACC Project | Information 
Management  Technology (IMT) | CSIRO

From: lustre-discuss-boun...@lists.lustre.org 
[mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf Of Arya Mazaheri
Sent: Wednesday, 9 October 2013 6:52 PM
To: lustre-discuss@lists.lustre.org
Subject: [Lustre-discuss] Lustre crashes periodically

Hi everyone,
I have a problem lately with our Lustre 1.8 deployment. It crashes periodically 
in a way that the nodes can mount the storage and I can't access the Lustre 
server machine neither. So I have to manually restart the machine every time to 
make everything normal again. I tried to see the logs, memory usage and locks 
count to see whether these issues may have the cause of the problem. But, I 
don't think they account for this issue.
An interesting symptom I see every time this problem happens is the Infiniband 
switch network usage lights which blink very fast. I think a huge traffic on 
the Infiniband network to the lustre server may cause the server crash. Does 
this relevance seems logical?

Anyway, I hope some of you may have experience this problem before and could 
help me understand what is happening and how to avoid crashing the server again!

Thanks,
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss