[Lustre-discuss] Speeding up configuration log regeneration?

2013-10-17 Thread Olli Lounela
Hi,

We run four-node Lustre 2.3, and I needed to both change hardware  
under MGS/MDS and reassign an OSS ip. Just the same, I added a brand  
new 10GE network to the system, which was the reason for MDS hardware  
change.

I ran tunefs.lustre --writeconf as per chapter 14.4 in Lustre Manual,  
and everything mounts fine. Log regeneration apparently works, since  
it seems to do something, but exceedingly slowly. Disks show all but  
no activity, CPU utilization is zero across the board, and memory  
should be no issue. I believe it works, but currently it seems the  
1,5*10^9 files (some 55 TiB of data) won't be indexed in a week. My  
boss isn't happy when I can't even predict how long this will take, or  
even say for sure that it really works.

Two questions: is there a way to know how fast it is progressing  
and/or where it is at, or even that it really works, and is there a  
way to speed up whatever is slowing it down? Seems all diagnostic  
/proc entries have been removed from 2.3.  I have tried mounting the  
Lustre partitions with -o nobarrier (yes, I know it's dangerous, but  
I'd really need to speed things up) but I don't know if that does  
anything at all.

We run Centos 6.x in Lustre servers, where Lustre has been installed  
from rpm's from Whamcloud/Intel build bot, and Ubuntu 10.04 in clients  
with hand compiled kernel and Lustre. One MGC/MGS with twelve 15k-RPM  
SAS disks in RAID-10 as MDT that is all but empty, and six variously  
build RAID-6's in SAS-attached shelves in three OSS's.

ATdhvaannkcse for any help,

-- 
 Olli Lounela
 IT specialist and administrator
 DNA sequencing and genomics
 Institute of Biotechnology
 University of Helsinki

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Speeding up configuration log regeneration?

2013-10-17 Thread Dilger, Andreas
On 2013/10/17 5:34 AM, Olli Lounela olli.loun...@helsinki.fi wrote:

Hi,

We run four-node Lustre 2.3, and I needed to both change hardware
under MGS/MDS and reassign an OSS ip. Just the same, I added a brand
new 10GE network to the system, which was the reason for MDS hardware
change.

Note that in Lustre 2.4 there is a lctl replace_nids command that
allows you to change the NIDs without running --writeconf.  That doesn't
help you now, but possibly in the future.

I ran tunefs.lustre --writeconf as per chapter 14.4 in Lustre Manual,
and everything mounts fine. Log regeneration apparently works, since
it seems to do something, but exceedingly slowly. Disks show all but
no activity, CPU utilization is zero across the board, and memory
should be no issue. I believe it works, but currently it seems the
1,5*10^9 files (some 55 TiB of data) won't be indexed in a week. My
boss isn't happy when I can't even predict how long this will take, or
even say for sure that it really works.

The --writeconf information is at most a few kB and should only take
seconds to complete.  What reindexing operation are you referencing?
It should be possible to mount the filesystem immediately (MGS first,
then MDS and OSSes) after running --writeconf.

You didn't really explain what is preventing you from using the filesystem,
since you said it mounted properly?

Two questions: is there a way to know how fast it is progressing
and/or where it is at, or even that it really works, and is there a
way to speed up whatever is slowing it down? Seems all diagnostic
/proc entries have been removed from 2.3.  I have tried mounting the
Lustre partitions with -o nobarrier (yes, I know it's dangerous, but
I'd really need to speed things up) but I don't know if that does
anything at all.

I doubt that the -o nobarrier is helping you much.

We run Centos 6.x in Lustre servers, where Lustre has been installed
from rpm's from Whamcloud/Intel build bot, and Ubuntu 10.04 in clients
with hand compiled kernel and Lustre. One MGC/MGS with twelve 15k-RPM
SAS disks in RAID-10 as MDT that is all but empty, and six variously
build RAID-6's in SAS-attached shelves in three OSS's.

ATdhvaannkcse for any help,

-- 
 Olli Lounela
 IT specialist and administrator
 DNA sequencing and genomics
 Institute of Biotechnology
 University of Helsinki

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss



Cheers, Andreas
-- 
Andreas Dilger

Lustre Software Architect
Intel High Performance Data Division


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre

2013-10-17 Thread Weilin Chang
I got the following message for lctl list_nids:
Opening /dev/lnet failed: No such device
Hint: the kerenel modules may not be loaded
IOC_LIBCFS_GET_NI error 19: No such device


It looks like the lunstre packet has been installed, but it does not hook with 
kernel correctly. Which part went wrong?
-Weilin



From: Chan Ching Yu Patrick [mailto:cyc...@clustertech.com]
Sent: Wednesday, October 16, 2013 7:49 PM
To: Weilin Chang
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre

What's the output?

lctl list_nids



On 10/17/2013 08:36 AM, Weilin Chang wrote:
Do I need to upgrade e2fsprogs on the client system?

-Weilin

From: Weilin Chang
Sent: Wednesday, October 16, 2013 3:45 PM
To: lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org
Cc: Weilin Chang
Subject: lustre 1.8.5 client failed to mount lustre

HI,

I am using luster 1.8.5. Server s are up and mounted without any problem. But 
client failed to mount the luster file system. I also did not see luster in 
/proc/filesystems.  Is there other rpm I needed to install on the client system?

The below is log from the client system:

[root@localhost client]# rpm -ivh 
lustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5.i686.rpm
Preparing...### [100%]
   1:lustre-client-modules  ### [100%]
Congratulations on finishing your Lustre installation!  To register
your copy of Lustre and find out more about Lustre Support, Service,
and Training offerings please visit

http://www.sun.com/software/products/lustre/lustre_reg.jsp
[root@localhost client]#  rpm -qa | grep 
lustrelustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5
[root@localhost client]# rpm -ivh 
lustre-client-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5.i686.rpm
Preparing...### [100%]
   1:lustre-client  ### [100%]
[root@localhost client]# mount -t lustre 10.193.35.54@tcp0:/lustrefs /lustrefs
mount.lustre: mount 10.193.35.54@tcp0:/lustrefs at /lustrefs failed: No such 
device
Are the lustre modules loaded?
Check /etc/modprobe.conf and /proc/filesystems
Note 'alias lustre llite' should be removed from modprobe.conf


[root@localhost client]# cat /etc/modprobe.conf
alias eth0 e1000e
alias eth1 e1000e
alias scsi_hostadapter ahci
options lnet networks=tcp


[root@localhost client]# cat /proc/filesystems | grep -i lustre

[root@localhost client]# rpm -qa | grep -i lustre
lustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5
lustre-client-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5






___

Lustre-discuss mailing list

Lustre-discuss@lists.lustre.orgmailto:Lustre-discuss@lists.lustre.org

http://lists.lustre.org/mailman/listinfo/lustre-discuss

--
Chan Ching Yu, Patrick
Senior System Engineer
Cluster Technology Limited

Modernize Your Business with Advanced Computing Technologies
Cloud Computing | Cluster | Financial Engineering | Business Intelligence
Email: cychan @clustertech.commailto:cyc...@clustertech.com
Direct Line: +852 2655 6113 Tel: +852 2655 6100 Fax: +852 2994 2101
Website: www.clustertech.comhttp://www.clustertech.com
Address: Units 211 - 213, Lakeside 1, No. 8 Science Park West Avenue, Hong Kong 
Science Park, N.T. Hong Kong
Hong Kong Beijing Shanghai Guangzhou Shenzhen Wuhan 
Sydney

**
The information contained in this e-mail and its attachments is confidential 
and intended solely for the specified addressees. If you have received this 
email in error, please do not read, copy, distribute, disclose or use any 
information of this email in any way and please immediately notify the sender 
and delete this email. Thank you for your cooperation.
**
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre

2013-10-17 Thread Weilin Chang
[root@localhost client]# uname -a
Linux localhost.localdomain 2.6.18-194.el5PAE #1 SMP Fri Apr 2 15:37:44 EDT 
2010 i686 i686 i386 GNU/Linux

Is this version OK for luster 1.8.5?

-Weilin

-Original Message-
From: Diep, Minh [mailto:minh.d...@intel.com] 
Sent: Thursday, October 17, 2013 10:59 AM
To: Weilin Chang; Chan Ching Yu Patrick
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre

Please  check if your client's kernel version matches the luster's. What is the 
output of uname -a?

Thanks
-Minh

From: Weilin Chang weilin.ch...@huawei.commailto:weilin.ch...@huawei.com
Date: Thursday, October 17, 2013 10:25 AM
To: Chan Ching Yu Patrick 
cyc...@clustertech.commailto:cyc...@clustertech.com
Cc: Weilin Chang weilin.ch...@huawei.commailto:weilin.ch...@huawei.com, 
lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org 
lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org
Subject: Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre

I got the following message for lctl list_nids:
Opening /dev/lnet failed: No such device
Hint: the kerenel modules may not be loaded
IOC_LIBCFS_GET_NI error 19: No such device


It looks like the lunstre packet has been installed, but it does not hook with 
kernel correctly. Which part went wrong?
-Weilin



From: Chan Ching Yu Patrick [mailto:cyc...@clustertech.com]
Sent: Wednesday, October 16, 2013 7:49 PM
To: Weilin Chang
Cc: lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org
Subject: Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre

What's the output?

lctl list_nids



On 10/17/2013 08:36 AM, Weilin Chang wrote:
Do I need to upgrade e2fsprogs on the client system?

-Weilin

From: Weilin Chang
Sent: Wednesday, October 16, 2013 3:45 PM
To: lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org
Cc: Weilin Chang
Subject: lustre 1.8.5 client failed to mount lustre

HI,

I am using luster 1.8.5. Server s are up and mounted without any problem. But 
client failed to mount the luster file system. I also did not see luster in 
/proc/filesystems.  Is there other rpm I needed to install on the client system?

The below is log from the client system:

[root@localhost client]# rpm -ivh 
lustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5.i686.rpm
Preparing...### [100%]
   1:lustre-client-modules  ### [100%]
Congratulations on finishing your Lustre installation!  To register
your copy of Lustre and find out more about Lustre Support, Service,
and Training offerings please visit

http://www.sun.com/software/products/lustre/lustre_reg.jsp
[root@localhost client]#  rpm -qa | grep 
lustrelustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5
[root@localhost client]# rpm -ivh 
lustre-client-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5.i686.rpm
Preparing...### [100%]
   1:lustre-client  ### [100%]
[root@localhost client]# mount -t lustre 10.193.35.54@tcp0:/lustrefs /lustrefs
mount.lustre: mount 10.193.35.54@tcp0:/lustrefs at /lustrefs failed: No such 
device
Are the lustre modules loaded?
Check /etc/modprobe.conf and /proc/filesystems
Note 'alias lustre llite' should be removed from modprobe.conf


[root@localhost client]# cat /etc/modprobe.conf
alias eth0 e1000e
alias eth1 e1000e
alias scsi_hostadapter ahci
options lnet networks=tcp


[root@localhost client]# cat /proc/filesystems | grep -i lustre

[root@localhost client]# rpm -qa | grep -i lustre
lustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5
lustre-client-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5






___

Lustre-discuss mailing list

Lustre-discuss@lists.lustre.orgmailto:Lustre-discuss@lists.lustre.org

http://lists.lustre.org/mailman/listinfo/lustre-discuss

--
Chan Ching Yu, Patrick
Senior System Engineer
Cluster Technology Limited

Modernize Your Business with Advanced Computing Technologies
Cloud Computing | Cluster | Financial Engineering | Business Intelligence
Email:cychan @clustertech.commailto:cyc...@clustertech.com
Direct Line: +852 2655 6113 Tel: +852 2655 6100 Fax: +852 2994 2101
Website:www.clustertech.comhttp://www.clustertech.com
Address: Units 211 - 213, Lakeside 1, No. 8 Science Park West Avenue, Hong Kong 
Science Park, N.T. Hong Kong
Hong Kong Beijing Shanghai Guangzhou Shenzhen Wuhan 
Sydney

**
The information contained in this e-mail and its attachments is confidential 
and intended solely for the specified addressees. If you have received this 
email in error, please do not read, copy, distribute, disclose or use any 
information of this email in any way and please immediately notify the sender 
and 

Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre

2013-10-17 Thread Dilger, Andreas
On 2013/10/16 4:45 PM, Weilin Chang weilin.ch...@huawei.com wrote:

HI,
 
I am using luster 1.8.5. Server s are up and mounted without any problem.
But client failed to mount the luster file system. I also did not see
luster in /proc/filesystems.  Is there other rpm I needed to install on
the client system?

The most important question is why are you starting with this very old
1.8.5 release from 3 years ago instead of a newer release like 2.1.6 or
2.4.1?
You will definitely find better support for newer kernels with newer
releases,
as well as a better chance that someone will be able to fix your problem.

My first guess is that your client network is not configured correctly,
since
it has a hostname of localhost.  Second guess is that the Lustre modules
are not properly matched with your kernel and they are not loading at all.

Cheers, Andreas

 
The below is log from the client system:
 
[root@localhost client]# rpm -ivh
lustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5.i686.rpm
Preparing...###
[100%]
   1:lustre-client-modules  ###
[100%]
Congratulations on finishing your Lustre installation!  To register
your copy of Lustre and find out more about Lustre Support, Service,
and Training offerings please visit
 
http://www.sun.com/software/products/lustre/lustre_reg.jsp
[root@localhost client]#  rpm -qa | grep
lustrelustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5
[root@localhost client]# rpm -ivh
lustre-client-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5.i686.rpm
Preparing...###
[100%]
   1:lustre-client  ###
[100%]
[root@localhost client]# mount -t lustre 10.193.35.54@tcp0:/lustrefs
/lustrefs
mount.lustre: mount 10.193.35.54@tcp0:/lustrefs at /lustrefs failed: No
such device
Are the lustre modules loaded?
Check /etc/modprobe.conf and /proc/filesystems
Note 'alias lustre llite' should be removed from modprobe.conf
 
 
[root@localhost client]# cat /etc/modprobe.conf
alias eth0 e1000e
alias eth1 e1000e
alias scsi_hostadapter ahci
options lnet networks=tcp
 
 
[root@localhost client]# cat /proc/filesystems | grep -i lustre
 
[root@localhost client]# rpm -qa | grep -i lustre
lustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5
lustre-client-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5
 
 




Cheers, Andreas
-- 
Andreas Dilger

Lustre Software Architect
Intel High Performance Data Division


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre

2013-10-17 Thread Weilin Chang
Lustre_rmmod failed with the following message:
Open /proc/sys/lnet/dump_kernel failed: No such file or directory
Open(dump_kernel) failed: No such file or directory

The package does not install correct for the directory /proc/syus/lnet does not 
exist. 

-Weilin

-Original Message-
From: White, Cliff [mailto:cliff.wh...@intel.com] 
Sent: Thursday, October 17, 2013 10:59 AM
To: Weilin Chang; Chan Ching Yu Patrick
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre

From: Weilin Chang weilin.ch...@huawei.commailto:weilin.ch...@huawei.com
Date: Thursday, October 17, 2013 10:25 AM
To: Chan Ching Yu Patrick 
cyc...@clustertech.commailto:cyc...@clustertech.com
Cc: Weilin Chang weilin.ch...@huawei.commailto:weilin.ch...@huawei.com, 
lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org 
lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org
Subject: Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre

I got the following message for lctl list_nids:
Opening /dev/lnet failed: No such device
Hint: the kerenel modules may not be loaded
IOC_LIBCFS_GET_NI error 19: No such device


It looks like the lunstre packet has been installed, but it does not hook with 
kernel correctly. Which part went wrong?
-Weilin


You might try 'lustre_rmmod' following by 'modprobe -v lustre' - usually you 
will have useful error messages.
The 'networks' option In /etc/modprobe.conf Is unnecessary - lnet will default 
to TCP.  I would delete that line, as the syntax does not look correct.
Cliffw



From: Chan Ching Yu Patrick [mailto:cyc...@clustertech.com]
Sent: Wednesday, October 16, 2013 7:49 PM
To: Weilin Chang
Cc: lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org
Subject: Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre

What's the output?

lctl list_nids



On 10/17/2013 08:36 AM, Weilin Chang wrote:
Do I need to upgrade e2fsprogs on the client system?

-Weilin

From: Weilin Chang
Sent: Wednesday, October 16, 2013 3:45 PM
To: lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org
Cc: Weilin Chang
Subject: lustre 1.8.5 client failed to mount lustre

HI,

I am using luster 1.8.5. Server s are up and mounted without any problem. But 
client failed to mount the luster file system. I also did not see luster in 
/proc/filesystems.  Is there other rpm I needed to install on the client system?

The below is log from the client system:

[root@localhost client]# rpm -ivh 
lustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5.i686.rpm
Preparing...### [100%]
   1:lustre-client-modules  ### [100%]
Congratulations on finishing your Lustre installation!  To register
your copy of Lustre and find out more about Lustre Support, Service,
and Training offerings please visit

http://www.sun.com/software/products/lustre/lustre_reg.jsp
[root@localhost client]#  rpm -qa | grep 
lustrelustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5
[root@localhost client]# rpm -ivh 
lustre-client-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5.i686.rpm
Preparing...### [100%]
   1:lustre-client  ### [100%]
[root@localhost client]# mount -t lustre 10.193.35.54@tcp0:/lustrefs /lustrefs
mount.lustre: mount 10.193.35.54@tcp0:/lustrefs at /lustrefs failed: No such 
device
Are the lustre modules loaded?
Check /etc/modprobe.conf and /proc/filesystems
Note 'alias lustre llite' should be removed from modprobe.conf


[root@localhost client]# cat /etc/modprobe.conf
alias eth0 e1000e
alias eth1 e1000e
alias scsi_hostadapter ahci
options lnet networks=tcp


[root@localhost client]# cat /proc/filesystems | grep -i lustre

[root@localhost client]# rpm -qa | grep -i lustre
lustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5
lustre-client-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5






___

Lustre-discuss mailing list

Lustre-discuss@lists.lustre.orgmailto:Lustre-discuss@lists.lustre.org

http://lists.lustre.org/mailman/listinfo/lustre-discuss

--
Chan Ching Yu, Patrick
Senior System Engineer
Cluster Technology Limited

Modernize Your Business with Advanced Computing Technologies
Cloud Computing | Cluster | Financial Engineering | Business Intelligence
Email:cychan @clustertech.commailto:cyc...@clustertech.com
Direct Line: +852 2655 6113 Tel: +852 2655 6100 Fax: +852 2994 2101
Website:www.clustertech.comhttp://www.clustertech.com
Address: Units 211 - 213, Lakeside 1, No. 8 Science Park West Avenue, Hong Kong 
Science Park, N.T. Hong Kong
Hong Kong Beijing Shanghai Guangzhou Shenzhen Wuhan 
Sydney

**
The information contained in this e-mail and its attachments is confidential 

Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre

2013-10-17 Thread Weilin Chang
I agree with your observation. My purpose is to run Lustre on a 32 bit Linux 
system. The latest Lustre release does not support kernel patch for 32 bit 
Linux system. I don't know how well to generate a 32 bit Linux kernel from a 64 
bit Linux system without having other problems. I think the older Lustre 
release which had been tested and should be ok to use it.

Do I need to patch my Linux from version 2.6.18-194.e15 to 2.6.18-194.17.1.e15 
in order to work with Lustre 1.8.5?

-Weilin

-Original Message-
From: Dilger, Andreas [mailto:andreas.dil...@intel.com] 
Sent: Thursday, October 17, 2013 11:06 AM
To: Weilin Chang
Cc: lustre-discuss@lists.lustre.org; hpdd-disc...@lists.01.org
Subject: Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre

On 2013/10/16 4:45 PM, Weilin Chang weilin.ch...@huawei.com wrote:

HI,
 
I am using luster 1.8.5. Server s are up and mounted without any problem.
But client failed to mount the luster file system. I also did not see
luster in /proc/filesystems.  Is there other rpm I needed to install on
the client system?

The most important question is why are you starting with this very old
1.8.5 release from 3 years ago instead of a newer release like 2.1.6 or
2.4.1?
You will definitely find better support for newer kernels with newer
releases,
as well as a better chance that someone will be able to fix your problem.

My first guess is that your client network is not configured correctly,
since
it has a hostname of localhost.  Second guess is that the Lustre modules
are not properly matched with your kernel and they are not loading at all.

Cheers, Andreas

 
The below is log from the client system:
 
[root@localhost client]# rpm -ivh
lustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5.i686.rpm
Preparing...###
[100%]
   1:lustre-client-modules  ###
[100%]
Congratulations on finishing your Lustre installation!  To register
your copy of Lustre and find out more about Lustre Support, Service,
and Training offerings please visit
 
http://www.sun.com/software/products/lustre/lustre_reg.jsp
[root@localhost client]#  rpm -qa | grep
lustrelustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5
[root@localhost client]# rpm -ivh
lustre-client-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5.i686.rpm
Preparing...###
[100%]
   1:lustre-client  ###
[100%]
[root@localhost client]# mount -t lustre 10.193.35.54@tcp0:/lustrefs
/lustrefs
mount.lustre: mount 10.193.35.54@tcp0:/lustrefs at /lustrefs failed: No
such device
Are the lustre modules loaded?
Check /etc/modprobe.conf and /proc/filesystems
Note 'alias lustre llite' should be removed from modprobe.conf
 
 
[root@localhost client]# cat /etc/modprobe.conf
alias eth0 e1000e
alias eth1 e1000e
alias scsi_hostadapter ahci
options lnet networks=tcp
 
 
[root@localhost client]# cat /proc/filesystems | grep -i lustre
 
[root@localhost client]# rpm -qa | grep -i lustre
lustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5
lustre-client-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5
 
 




Cheers, Andreas
-- 
Andreas Dilger

Lustre Software Architect
Intel High Performance Data Division


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre

2013-10-17 Thread Jeff Johnson
Weilin,

Earlier in the week you were discussing trying to get Lustre running on 
32-bit ARM. Are the systems you have installed 1.8.5 upon x86 systems or 
are you doing this on ARM processor based platforms?

--Jeff


On 10/17/13 11:10 AM, Weilin Chang wrote:
 Lustre_rmmod failed with the following message:
 Open /proc/sys/lnet/dump_kernel failed: No such file or directory
 Open(dump_kernel) failed: No such file or directory

 The package does not install correct for the directory /proc/syus/lnet does 
 not exist.

 -Weilin

 -Original Message-
 From: White, Cliff [mailto:cliff.wh...@intel.com]
 Sent: Thursday, October 17, 2013 10:59 AM
 To: Weilin Chang; Chan Ching Yu Patrick
 Cc: lustre-discuss@lists.lustre.org
 Subject: Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre

 From: Weilin Chang weilin.ch...@huawei.commailto:weilin.ch...@huawei.com
 Date: Thursday, October 17, 2013 10:25 AM
 To: Chan Ching Yu Patrick 
 cyc...@clustertech.commailto:cyc...@clustertech.com
 Cc: Weilin Chang weilin.ch...@huawei.commailto:weilin.ch...@huawei.com, 
 lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org 
 lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org
 Subject: Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre

 I got the following message for lctl list_nids:
 Opening /dev/lnet failed: No such device
 Hint: the kerenel modules may not be loaded
 IOC_LIBCFS_GET_NI error 19: No such device


 It looks like the lunstre packet has been installed, but it does not hook 
 with kernel correctly. Which part went wrong?
 -Weilin


 You might try 'lustre_rmmod' following by 'modprobe -v lustre' - usually you 
 will have useful error messages.
 The 'networks' option In /etc/modprobe.conf Is unnecessary - lnet will 
 default to TCP.  I would delete that line, as the syntax does not look 
 correct.
 Cliffw



 From: Chan Ching Yu Patrick [mailto:cyc...@clustertech.com]
 Sent: Wednesday, October 16, 2013 7:49 PM
 To: Weilin Chang
 Cc: lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org
 Subject: Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre

 What's the output?

 lctl list_nids



 On 10/17/2013 08:36 AM, Weilin Chang wrote:
 Do I need to upgrade e2fsprogs on the client system?

 -Weilin

 From: Weilin Chang
 Sent: Wednesday, October 16, 2013 3:45 PM
 To: lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org
 Cc: Weilin Chang
 Subject: lustre 1.8.5 client failed to mount lustre

 HI,

 I am using luster 1.8.5. Server s are up and mounted without any problem. But 
 client failed to mount the luster file system. I also did not see luster in 
 /proc/filesystems.  Is there other rpm I needed to install on the client 
 system?

 The below is log from the client system:

 [root@localhost client]# rpm -ivh 
 lustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5.i686.rpm
 Preparing...### [100%]
 1:lustre-client-modules  ### 
 [100%]
 Congratulations on finishing your Lustre installation!  To register
 your copy of Lustre and find out more about Lustre Support, Service,
 and Training offerings please visit

 http://www.sun.com/software/products/lustre/lustre_reg.jsp
 [root@localhost client]#  rpm -qa | grep 
 lustrelustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5
 [root@localhost client]# rpm -ivh 
 lustre-client-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5.i686.rpm
 Preparing...### [100%]
 1:lustre-client  ### 
 [100%]
 [root@localhost client]# mount -t lustre 10.193.35.54@tcp0:/lustrefs /lustrefs
 mount.lustre: mount 10.193.35.54@tcp0:/lustrefs at /lustrefs failed: No such 
 device
 Are the lustre modules loaded?
 Check /etc/modprobe.conf and /proc/filesystems
 Note 'alias lustre llite' should be removed from modprobe.conf


 [root@localhost client]# cat /etc/modprobe.conf
 alias eth0 e1000e
 alias eth1 e1000e
 alias scsi_hostadapter ahci
 options lnet networks=tcp


 [root@localhost client]# cat /proc/filesystems | grep -i lustre

 [root@localhost client]# rpm -qa | grep -i lustre
 lustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5
 lustre-client-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5






 ___

 Lustre-discuss mailing list

 Lustre-discuss@lists.lustre.orgmailto:Lustre-discuss@lists.lustre.org

 http://lists.lustre.org/mailman/listinfo/lustre-discuss

 --
 Chan Ching Yu, Patrick
 Senior System Engineer
 Cluster Technology Limited

 Modernize Your Business with Advanced Computing Technologies
 Cloud Computing | Cluster | Financial Engineering | Business Intelligence
 Email:cychan @clustertech.commailto:cyc...@clustertech.com
 Direct Line: +852 2655 6113 Tel: +852 2655 6100 Fax: +852 2994 2101
 

Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre

2013-10-17 Thread White, Cliff
On 10/17/13 11:10 AM, Weilin Chang weilin.ch...@huawei.com wrote:

Lustre_rmmod failed with the following message:
Open /proc/sys/lnet/dump_kernel failed: No such file or directory
Open(dump_kernel) failed: No such file or directory

The package does not install correct for the directory /proc/syus/lnet
does not exist.

That's not the problem. What error messages does 'modprobe -v lustre' give?
 

-Weilin

-Original Message-
From: White, Cliff [mailto:cliff.wh...@intel.com]
Sent: Thursday, October 17, 2013 10:59 AM
To: Weilin Chang; Chan Ching Yu Patrick
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre

From: Weilin Chang
weilin.ch...@huawei.commailto:weilin.ch...@huawei.com
Date: Thursday, October 17, 2013 10:25 AM
To: Chan Ching Yu Patrick
cyc...@clustertech.commailto:cyc...@clustertech.com
Cc: Weilin Chang 
weilin.ch...@huawei.commailto:weilin.ch...@huawei.com,
lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org
lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org
Subject: Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre

I got the following message for lctl list_nids:
Opening /dev/lnet failed: No such device
Hint: the kerenel modules may not be loaded
IOC_LIBCFS_GET_NI error 19: No such device


It looks like the lunstre packet has been installed, but it does not hook
with kernel correctly. Which part went wrong?
-Weilin


You might try 'lustre_rmmod' following by 'modprobe -v lustre' - usually
you will have useful error messages.
The 'networks' option In /etc/modprobe.conf Is unnecessary - lnet will
default to TCP.  I would delete that line, as the syntax does not look
correct.
Cliffw



From: Chan Ching Yu Patrick [mailto:cyc...@clustertech.com]
Sent: Wednesday, October 16, 2013 7:49 PM
To: Weilin Chang
Cc: 
lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org
Subject: Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre

What's the output?

lctl list_nids



On 10/17/2013 08:36 AM, Weilin Chang wrote:
Do I need to upgrade e2fsprogs on the client system?

-Weilin

From: Weilin Chang
Sent: Wednesday, October 16, 2013 3:45 PM
To: 
lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org
Cc: Weilin Chang
Subject: lustre 1.8.5 client failed to mount lustre

HI,

I am using luster 1.8.5. Server s are up and mounted without any problem.
But client failed to mount the luster file system. I also did not see
luster in /proc/filesystems.  Is there other rpm I needed to install on
the client system?

The below is log from the client system:

[root@localhost client]# rpm -ivh
lustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5.i686.rpm
Preparing...###
[100%]
   1:lustre-client-modules  ###
[100%]
Congratulations on finishing your Lustre installation!  To register
your copy of Lustre and find out more about Lustre Support, Service,
and Training offerings please visit

http://www.sun.com/software/products/lustre/lustre_reg.jsp
[root@localhost client]#  rpm -qa | grep
lustrelustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5
[root@localhost client]# rpm -ivh
lustre-client-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5.i686.rpm
Preparing...###
[100%]
   1:lustre-client  ###
[100%]
[root@localhost client]# mount -t lustre 10.193.35.54@tcp0:/lustrefs
/lustrefs
mount.lustre: mount 10.193.35.54@tcp0:/lustrefs at /lustrefs failed: No
such device
Are the lustre modules loaded?
Check /etc/modprobe.conf and /proc/filesystems
Note 'alias lustre llite' should be removed from modprobe.conf


[root@localhost client]# cat /etc/modprobe.conf
alias eth0 e1000e
alias eth1 e1000e
alias scsi_hostadapter ahci
options lnet networks=tcp


[root@localhost client]# cat /proc/filesystems | grep -i lustre

[root@localhost client]# rpm -qa | grep -i lustre
lustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5
lustre-client-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5






___

Lustre-discuss mailing list

Lustre-discuss@lists.lustre.orgmailto:Lustre-discuss@lists.lustre.org

http://lists.lustre.org/mailman/listinfo/lustre-discuss

--
Chan Ching Yu, Patrick
Senior System Engineer
Cluster Technology Limited

Modernize Your Business with Advanced Computing Technologies
Cloud Computing | Cluster | Financial Engineering | Business Intelligence
Email:cychan @clustertech.commailto:cyc...@clustertech.com
Direct Line: +852 2655 6113 Tel: +852 2655 6100 Fax: +852 2994
2101
Website:www.clustertech.comhttp://www.clustertech.com
Address: Units 211 - 213, Lakeside 1, No. 8 Science Park West Avenue,
Hong Kong Science Park, N.T. Hong Kong
Hong Kong Beijing Shanghai Guangzhou Shenzhen Wuhan
  Sydney


[Lustre-discuss] Broken communication between OSS and Client on Lustre 2.4

2013-10-17 Thread Eduardo Murrieta
Hello,

this is my first post on this list, I hope someone can give me some advise
on how to resolve the following issue.

I'm using the lustre release 2.4.0 RC2 compiled from whamcloud sources,
this is an upgrade from lustre 2.2.22 from same sources.

The situation is:

There are several clients reading files that belongs mostly to the same
OST, afther a period of time the clients starts loosing contact with this
OST and processes stops due to this fault, here is the state for such OST
on one client:

client# lfs check servers
...
...
lustre-OST000a-osc-8801bc548000: check error: Resource temporarily
unavailable
...
...

checking dmesg on client and OSS server we have:

client# dmesg
LustreError: 11-0: lustre-OST000a-osc-8801bc548000: Communicating with
10.2.2.3@o2ib, operation ost_connect failed with -16.
LustreError: Skipped 24 previous similar messages

OSS-server# dmesg

Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at
10.2.64.4@o2ib) reconnecting
Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at
10.2.64.4@o2ib) refused reconnection, still busy with 9 active RPCs


At this moment I can ping from client to server and vice versa, but some
time this call also hangs on server and client.

client# # lctl ping OSS-server@o2ib
12345-0@lo
12345-OSS-server@o2ib

OSS-server# lctl ping 10.2.64.4@o2ib
12345-0@lo
12345-10.2.64.4@o2ib

This situation happens very frequently and specially with jobs that process
a lot of files in an average size of 100MB.

The only solution that  I find to reestablish the communication between the
server and the client is restarting both machines.

I hope some have an idea what is the reason for the problem and how can I
reset the communication with the clients without restarting the machines.

thank you,

Eduardo
UNAM@Mexico

-- 
Eduardo Murrieta
Unidad de Cómputo
Instituto de Ciencias Nucleares, UNAM
Ph. +52-55-5622-4739 ext. 5103
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Broken communication between OSS and Client on Lustre 2.4

2013-10-17 Thread Jeff Johnson
Hola Eduardo,

How are the OSTs connected to the OSS (SAS, FC, Infiniband SRP)?
Are there any non-Lustre errors in the dmesg output of the OSS?
Block devices error on the OSS (/dev/sd?)?

If you are losing [scsi,sas,fc,srp] connectivity you may see this sort 
of thing. If the OSTs are connected to the OSS node via IB SRP and your 
IB fabric gets busy or you have subnet manager issues you might see a 
condition like this.

Is this the AliceFS at DGTIC?

--Jeff



On 10/17/13 3:52 PM, Eduardo Murrieta wrote:
 Hello,

 this is my first post on this list, I hope someone can give me some 
 advise on how to resolve the following issue.

 I'm using the lustre release 2.4.0 RC2 compiled from whamcloud 
 sources, this is an upgrade from lustre 2.2.22 from same sources.

 The situation is:

 There are several clients reading files that belongs mostly to the 
 same OST, afther a period of time the clients starts loosing contact 
 with this OST and processes stops due to this fault, here is the state 
 for such OST on one client:

 client# lfs check servers
 ...
 ...
 lustre-OST000a-osc-8801bc548000: check error: Resource temporarily 
 unavailable
 ...
 ...

 checking dmesg on client and OSS server we have:

 client# dmesg
 LustreError: 11-0: lustre-OST000a-osc-8801bc548000: Communicating 
 with 10.2.2.3@o2ib, operation ost_connect failed with -16.
 LustreError: Skipped 24 previous similar messages

 OSS-server# dmesg
 
 Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 
 (at 10.2.64.4@o2ib) reconnecting
 Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 
 (at 10.2.64.4@o2ib) refused reconnection, still busy with 9 active RPCs
 

 At this moment I can ping from client to server and vice versa, but 
 some time this call also hangs on server and client.

 client# # lctl ping OSS-server@o2ib
 12345-0@lo
 12345-OSS-server@o2ib

 OSS-server# lctl ping 10.2.64.4@o2ib
 12345-0@lo
 12345-10.2.64.4@o2ib

 This situation happens very frequently and specially with jobs that 
 process a lot of files in an average size of 100MB.

 The only solution that  I find to reestablish the communication 
 between the server and the client is restarting both machines.

 I hope some have an idea what is the reason for the problem and how 
 can I reset the communication with the clients without restarting the 
 machines.

 thank you,

 Eduardo
 UNAM@Mexico

 -- 
 Eduardo Murrieta
 Unidad de Cómputo
 Instituto de Ciencias Nucleares, UNAM
 Ph. +52-55-5622-4739 ext. 5103



 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss


-- 
--
Jeff Johnson
Co-Founder
Aeon Computing

jeff.john...@aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x1001   f: 858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite D - San Diego, CA 92117

High-performance Computing / Lustre Filesystems / Scale-out Storage

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Broken communication between OSS and Client on Lustre 2.4

2013-10-17 Thread Eduardo Murrieta
Hello Jeff,

Non, this is a lustre filesystem for Instituto de Ciencias Nucleares at
UNAM, we are working on the installation for Alice at DGTIC too, but this
problem is with our local filesystem.

The OST is connected using a LSI-SAS controller, we have 8 OSTs on the same
server, there are nodes that loose connection with all the OSTs that belong
to this server but the problem is not related with the OST-OSS
communication, since I can access this  OST and read files stored there
from other lustre clients.

The problem is a deadlock condition in which the OSS and some clients
refuse connections from each other as I can see from dmesg:

in the client
LustreError: 11-0: lustre-OST000a-osc-8801bc548000: Communicating with
10.2.2.3@o2ib, operation ost_connect failed with -16.

in the server
Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at
10.2.64.4@o2ib) reconnecting
Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at
10.2.64.4@o2ib) refused reconnection, still busy with 9 active RPCs

this only happen with clients that are reading a lot of small files (~100MB
each) in the same OST.

thank you,

Eduardo



2013/10/17 Jeff Johnson jeff.john...@aeoncomputing.com

 Hola Eduardo,

 How are the OSTs connected to the OSS (SAS, FC, Infiniband SRP)?
 Are there any non-Lustre errors in the dmesg output of the OSS?
 Block devices error on the OSS (/dev/sd?)?

 If you are losing [scsi,sas,fc,srp] connectivity you may see this sort
 of thing. If the OSTs are connected to the OSS node via IB SRP and your
 IB fabric gets busy or you have subnet manager issues you might see a
 condition like this.

 Is this the AliceFS at DGTIC?

 --Jeff



 On 10/17/13 3:52 PM, Eduardo Murrieta wrote:
  Hello,
 
  this is my first post on this list, I hope someone can give me some
  advise on how to resolve the following issue.
 
  I'm using the lustre release 2.4.0 RC2 compiled from whamcloud
  sources, this is an upgrade from lustre 2.2.22 from same sources.
 
  The situation is:
 
  There are several clients reading files that belongs mostly to the
  same OST, afther a period of time the clients starts loosing contact
  with this OST and processes stops due to this fault, here is the state
  for such OST on one client:
 
  client# lfs check servers
  ...
  ...
  lustre-OST000a-osc-8801bc548000: check error: Resource temporarily
  unavailable
  ...
  ...
 
  checking dmesg on client and OSS server we have:
 
  client# dmesg
  LustreError: 11-0: lustre-OST000a-osc-8801bc548000: Communicating
  with 10.2.2.3@o2ib, operation ost_connect failed with -16.
  LustreError: Skipped 24 previous similar messages
 
  OSS-server# dmesg
  
  Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9
  (at 10.2.64.4@o2ib) reconnecting
  Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9
  (at 10.2.64.4@o2ib) refused reconnection, still busy with 9 active RPCs
  
 
  At this moment I can ping from client to server and vice versa, but
  some time this call also hangs on server and client.
 
  client# # lctl ping OSS-server@o2ib
  12345-0@lo
  12345-OSS-server@o2ib
 
  OSS-server# lctl ping 10.2.64.4@o2ib
  12345-0@lo
  12345-10.2.64.4@o2ib
 
  This situation happens very frequently and specially with jobs that
  process a lot of files in an average size of 100MB.
 
  The only solution that  I find to reestablish the communication
  between the server and the client is restarting both machines.
 
  I hope some have an idea what is the reason for the problem and how
  can I reset the communication with the clients without restarting the
  machines.
 
  thank you,
 
  Eduardo
  UNAM@Mexico
 
  --
  Eduardo Murrieta
  Unidad de Cómputo
  Instituto de Ciencias Nucleares, UNAM
  Ph. +52-55-5622-4739 ext. 5103
 
 
 
  ___
  Lustre-discuss mailing list
  Lustre-discuss@lists.lustre.org
  http://lists.lustre.org/mailman/listinfo/lustre-discuss


 --
 --
 Jeff Johnson
 Co-Founder
 Aeon Computing

 jeff.john...@aeoncomputing.com
 www.aeoncomputing.com
 t: 858-412-3810 x1001   f: 858-412-3845
 m: 619-204-9061

 4170 Morena Boulevard, Suite D - San Diego, CA 92117

 High-performance Computing / Lustre Filesystems / Scale-out Storage

 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss




-- 
Eduardo Murrieta
Unidad de Cómputo
Instituto de Ciencias Nucleares, UNAM
Ph. +52-55-5622-4739 ext. 5103
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Broken communication between OSS and Client on Lustre 2.4

2013-10-17 Thread Joseph Landman
Are there device or Filesystem level error messages on the server?  This
almost looks like a corrupted file system.

Please pardon brevity and typos ... Sent from my iPhone

On Oct 17, 2013, at 6:11 PM, Eduardo Murrieta emurri...@nucleares.unam.mx
wrote:

Hello Jeff,

Non, this is a lustre filesystem for Instituto de Ciencias Nucleares at
UNAM, we are working on the installation for Alice at DGTIC too, but this
problem is with our local filesystem.

The OST is connected using a LSI-SAS controller, we have 8 OSTs on the same
server, there are nodes that loose connection with all the OSTs that belong
to this server but the problem is not related with the OST-OSS
communication, since I can access this  OST and read files stored there
from other lustre clients.

The problem is a deadlock condition in which the OSS and some clients
refuse connections from each other as I can see from dmesg:

in the client
LustreError: 11-0: lustre-OST000a-osc-8801bc548000: Communicating with
10.2.2.3@o2ib, operation ost_connect failed with -16.

in the server
Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at
10.2.64.4@o2ib) reconnecting
Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at
10.2.64.4@o2ib) refused reconnection, still busy with 9 active RPCs

this only happen with clients that are reading a lot of small files (~100MB
each) in the same OST.

thank you,

Eduardo



2013/10/17 Jeff Johnson jeff.john...@aeoncomputing.com

 Hola Eduardo,

 How are the OSTs connected to the OSS (SAS, FC, Infiniband SRP)?
 Are there any non-Lustre errors in the dmesg output of the OSS?
 Block devices error on the OSS (/dev/sd?)?

 If you are losing [scsi,sas,fc,srp] connectivity you may see this sort
 of thing. If the OSTs are connected to the OSS node via IB SRP and your
 IB fabric gets busy or you have subnet manager issues you might see a
 condition like this.

 Is this the AliceFS at DGTIC?

 --Jeff



 On 10/17/13 3:52 PM, Eduardo Murrieta wrote:
  Hello,
 
  this is my first post on this list, I hope someone can give me some
  advise on how to resolve the following issue.
 
  I'm using the lustre release 2.4.0 RC2 compiled from whamcloud
  sources, this is an upgrade from lustre 2.2.22 from same sources.
 
  The situation is:
 
  There are several clients reading files that belongs mostly to the
  same OST, afther a period of time the clients starts loosing contact
  with this OST and processes stops due to this fault, here is the state
  for such OST on one client:
 
  client# lfs check servers
  ...
  ...
  lustre-OST000a-osc-8801bc548000: check error: Resource temporarily
  unavailable
  ...
  ...
 
  checking dmesg on client and OSS server we have:
 
  client# dmesg
  LustreError: 11-0: lustre-OST000a-osc-8801bc548000: Communicating
  with 10.2.2.3@o2ib, operation ost_connect failed with -16.
  LustreError: Skipped 24 previous similar messages
 
  OSS-server# dmesg
  
  Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9
  (at 10.2.64.4@o2ib) reconnecting
  Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9
  (at 10.2.64.4@o2ib) refused reconnection, still busy with 9 active RPCs
  
 
  At this moment I can ping from client to server and vice versa, but
  some time this call also hangs on server and client.
 
  client# # lctl ping OSS-server@o2ib
  12345-0@lo
  12345-OSS-server@o2ib
 
  OSS-server# lctl ping 10.2.64.4@o2ib
  12345-0@lo
  12345-10.2.64.4@o2ib
 
  This situation happens very frequently and specially with jobs that
  process a lot of files in an average size of 100MB.
 
  The only solution that  I find to reestablish the communication
  between the server and the client is restarting both machines.
 
  I hope some have an idea what is the reason for the problem and how
  can I reset the communication with the clients without restarting the
  machines.
 
  thank you,
 
  Eduardo
  UNAM@Mexico
 
  --
  Eduardo Murrieta
  Unidad de Cómputo
  Instituto de Ciencias Nucleares, UNAM
  Ph. +52-55-5622-4739 ext. 5103
 
 
 
  ___
  Lustre-discuss mailing list
  Lustre-discuss@lists.lustre.org
  http://lists.lustre.org/mailman/listinfo/lustre-discuss


 --
 --
 Jeff Johnson
 Co-Founder
 Aeon Computing

 jeff.john...@aeoncomputing.com
 www.aeoncomputing.com
 t: 858-412-3810 x1001   f: 858-412-3845
 m: 619-204-9061

 4170 Morena Boulevard, Suite D - San Diego, CA 92117

 High-performance Computing / Lustre Filesystems / Scale-out Storage

 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss




-- 
Eduardo Murrieta
Unidad de Cómputo
Instituto de Ciencias Nucleares, UNAM
Ph. +52-55-5622-4739 ext. 5103

 ___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org

Re: [Lustre-discuss] Broken communication between OSS and Client on Lustre 2.4

2013-10-17 Thread Jeff Johnson
Ah, I understand. I performed the onsite Lustre installation of Alice and
worked with JLG and his staff. Nice group of people!

This seems like a backend issue. Ldiskfs or the LSI RAID devices. Do you
see any read/write failures reported on the OSS of the sd block devices
where the OSTs reside? Something is timing out; disk I/O or the OSS is
running too high of an iowait under load.

How many OSS nodes in the filesystem? Are these operations striped across
all OSTs? Across multiple OSSs?

I still have an account on DGTIC's gateway, I could login and look. :-)

--Jeff

On Thursday, October 17, 2013, Eduardo Murrieta wrote:

 Hello Jeff,

 Non, this is a lustre filesystem for Instituto de Ciencias Nucleares at
 UNAM, we are working on the installation for Alice at DGTIC too, but this
 problem is with our local filesystem.

 The OST is connected using a LSI-SAS controller, we have 8 OSTs on the
 same server, there are nodes that loose connection with all the OSTs that
 belong to this server but the problem is not related with the OST-OSS
 communication, since I can access this  OST and read files stored there
 from other lustre clients.

 The problem is a deadlock condition in which the OSS and some clients
 refuse connections from each other as I can see from dmesg:

 in the client
 LustreError: 11-0: lustre-OST000a-osc-8801bc548000: Communicating with
 10.2.2.3@o2ib, operation ost_connect failed with -16.

 in the server
 Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at
 10.2.64.4@o2ib) reconnecting
 Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at
 10.2.64.4@o2ib) refused reconnection, still busy with 9 active RPCs

 this only happen with clients that are reading a lot of small files
 (~100MB each) in the same OST.

 thank you,

 Eduardo



 2013/10/17 Jeff Johnson jeff.john...@aeoncomputing.com

 Hola Eduardo,

 How are the OSTs connected to the OSS (SAS, FC, Infiniband SRP)?
 Are there any non-Lustre errors in the dmesg output of the OSS?
 Block devices error on the OSS (/dev/sd?)?

 If you are losing [scsi,sas,fc,srp] connectivity you may see this sort
 of thing. If the OSTs are connected to the OSS node via IB SRP and your
 IB fabric gets busy or you have subnet manager issues you might see a
 condition like this.

 Is this the AliceFS at DGTIC?

 --Jeff



 On 10/17/13 3:52 PM, Eduardo Murrieta wrote:
  Hello,
 
  this is my first post on this list, I hope someone can give me some
  advise on how to resolve the following issue.
 
  I'm using the lustre release 2.4.0 RC2 compiled from whamcloud
  sources, this is an upgrade from lustre 2.2.22 from same sources.
 
  The situation is:
 
  There are several clients reading files that belongs mostly to the
  same OST, afther a period of time the clients starts loosing contact
  with this OST and processes stops due to this fault, here is the state
  for such OST on one client:
 
  client# lfs check servers
  ...
  ...
  lustre-OST000a-osc-8801bc548000: check error: Resource temporarily
  unavailable
  ...
  ...
 
  checking dmesg on client and OSS server we have:
 
  client# dmesg
  LustreError: 11-0: lustre-OST000a-osc-8801bc548000: Communicating
  with 10.2.2.3@o2ib, operation ost_connect failed with -16.
  LustreError: Skipped 24 previous similar messages
 
  OSS-server# dmesg
  
  Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9
  (at 10.2.64.4@o2ib) reconnecting
  Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9
  (at 10.2.64.4@o2ib) refused reconnection, still busy with 9 active RPCs
  
 
  At this moment I can ping from client to server and vice versa, but
  some time this call also hangs on server and client.
 
  client# # lctl ping OSS-server@o2ib
  12345-0@lo
  12345-OSS-server@o2ib
 
  OSS-server# lctl ping 10.2.64.4@o2ib
  12345-0@lo
  12345-10.2.64.4@o2ib
 
  This situation happens very frequently and specially with jobs that
  process a lot of files in an average size of 100MB.
 
  The only solution that  I find to reestablish the communication
  between the server and the client is restarting both machines.
 
  I hope some have an idea what is the reason for the problem and how
  can I reset the communication with the clients without restarting the
  machines.
 
  thank you,
 
  Eduardo
  UNAM@Mexico
 
  --
  Eduardo Murrieta
  Unidad de Cómputo
  Instituto de Ciencias Nucleares, UNAM
  Ph. +52-55-5622-4739 ext. 5103
 
 
 
  ___
  Lustre-discuss mailing list
  Lustre-discuss@lists.lustre.org
  http://lists.lustre.org/mailman/listinfo/lustre-discuss


 --
 --
 Jeff Johnson
 Co-Founder
 Aeon Computing

 jeff.john...@aeoncomputing.com
 www.aeoncomputing.com
 t: 858-412-3810 x1001   f: 858-412-3845
 m: 619-204-9061

 4170 Morena Boulevard, Suite D - San Diego, CA 92117

 High-performance Computing / Lustre Filesystems / Scale-out Storage

 

Re: [Lustre-discuss] Broken communication between OSS and Client on Lustre 2.4

2013-10-17 Thread Eduardo Murrieta
I have this on the debug_file from my OSS:

0010:02000400:0.0:1382055634.785734:0:3099:0:(ost_handler.c:940:ost_brw_read())
lustre-OST: Bulk IO read error with 0afb2e4c-d
870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4@o2ib), client will retry: rc -107

0400:02000400:0.0:1382055634.786061:0:3099:0:(watchdog.c:411:lcw_update_time())
Service thread pid 3099 completed after 227.00s. This indicates the system
was overloaded (too many service threads, or there were not enough hardware
resources).

But I can read without problems files stored on this ODT from other
clients. For example:

$ lfs find --obd lustre-OST .
./src/BLAS/srot.f
...

$ more ./src/BLAS/srot.f
  SUBROUTINE SROT(N,SX,INCX,SY,INCY,C,S)
* .. Scalar Arguments ..
  REAL C,S
  INTEGER INCX,INCY,N
* ..
* .. Array Arguments ..
  REAL SX(*),SY(*)
...
...

This OSS have 8 ODTs of 14 TB each, with 12 GB/RAM and Xeon Quad Core
E5506. Tomorrow I'll increase the memory, if this is the missing resource.









2013/10/17 Joseph Landman land...@scalableinformatics.com

 Are there device or Filesystem level error messages on the server?  This
 almost looks like a corrupted file system.

 Please pardon brevity and typos ... Sent from my iPhone

 On Oct 17, 2013, at 6:11 PM, Eduardo Murrieta emurri...@nucleares.unam.mx
 wrote:

 Hello Jeff,

 Non, this is a lustre filesystem for Instituto de Ciencias Nucleares at
 UNAM, we are working on the installation for Alice at DGTIC too, but this
 problem is with our local filesystem.

 The OST is connected using a LSI-SAS controller, we have 8 OSTs on the
 same server, there are nodes that loose connection with all the OSTs that
 belong to this server but the problem is not related with the OST-OSS
 communication, since I can access this  OST and read files stored there
 from other lustre clients.

 The problem is a deadlock condition in which the OSS and some clients
 refuse connections from each other as I can see from dmesg:

 in the client
 LustreError: 11-0: lustre-OST000a-osc-8801bc548000: Communicating with
 10.2.2.3@o2ib, operation ost_connect failed with -16.

 in the server
 Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at
 10.2.64.4@o2ib) reconnecting
 Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at
 10.2.64.4@o2ib) refused reconnection, still busy with 9 active RPCs

 this only happen with clients that are reading a lot of small files
 (~100MB each) in the same OST.

 thank you,

 Eduardo



 2013/10/17 Jeff Johnson jeff.john...@aeoncomputing.com

 Hola Eduardo,

 How are the OSTs connected to the OSS (SAS, FC, Infiniband SRP)?
 Are there any non-Lustre errors in the dmesg output of the OSS?
 Block devices error on the OSS (/dev/sd?)?

 If you are losing [scsi,sas,fc,srp] connectivity you may see this sort
 of thing. If the OSTs are connected to the OSS node via IB SRP and your
 IB fabric gets busy or you have subnet manager issues you might see a
 condition like this.

 Is this the AliceFS at DGTIC?

 --Jeff



 On 10/17/13 3:52 PM, Eduardo Murrieta wrote:
  Hello,
 
  this is my first post on this list, I hope someone can give me some
  advise on how to resolve the following issue.
 
  I'm using the lustre release 2.4.0 RC2 compiled from whamcloud
  sources, this is an upgrade from lustre 2.2.22 from same sources.
 
  The situation is:
 
  There are several clients reading files that belongs mostly to the
  same OST, afther a period of time the clients starts loosing contact
  with this OST and processes stops due to this fault, here is the state
  for such OST on one client:
 
  client# lfs check servers
  ...
  ...
  lustre-OST000a-osc-8801bc548000: check error: Resource temporarily
  unavailable
  ...
  ...
 
  checking dmesg on client and OSS server we have:
 
  client# dmesg
  LustreError: 11-0: lustre-OST000a-osc-8801bc548000: Communicating
  with 10.2.2.3@o2ib, operation ost_connect failed with -16.
  LustreError: Skipped 24 previous similar messages
 
  OSS-server# dmesg
  
  Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9
  (at 10.2.64.4@o2ib) reconnecting
  Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9
  (at 10.2.64.4@o2ib) refused reconnection, still busy with 9 active RPCs
  
 
  At this moment I can ping from client to server and vice versa, but
  some time this call also hangs on server and client.
 
  client# # lctl ping OSS-server@o2ib
  12345-0@lo
  12345-OSS-server@o2ib
 
  OSS-server# lctl ping 10.2.64.4@o2ib
  12345-0@lo
  12345-10.2.64.4@o2ib
 
  This situation happens very frequently and specially with jobs that
  process a lot of files in an average size of 100MB.
 
  The only solution that  I find to reestablish the communication
  between the server and the client is restarting both machines.
 
  I hope some have an idea what is the reason for the problem and how
  can I reset the communication with the clients without restarting the
 

Re: [Lustre-discuss] Broken communication between OSS and Client on Lustre 2.4

2013-10-17 Thread Jeff Johnson
Eduardo,

One or two E5506 CPUs in the OSS? What is the specific LSI controller and
how many of them in the OSS?

I think the OSS is under provisioned for 8 OSTs. I'm betting you run a high
iowait on those sd devices during your problematic run. The iowait probably
grows until deadlock. Can you run the job while running a shell with top on
the OSS. You're likely hitting 99% iowait.

--Jeff

On Thursday, October 17, 2013, Eduardo Murrieta wrote:

 I have this on the debug_file from my OSS:

 0010:02000400:0.0:1382055634.785734:0:3099:0:(ost_handler.c:940:ost_brw_read())
 lustre-OST: Bulk IO read error with 0afb2e4c-d
 870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4@o2ib), client will retry: rc -107

 0400:02000400:0.0:1382055634.786061:0:3099:0:(watchdog.c:411:lcw_update_time())
 Service thread pid 3099 completed after 227.00s. This indicates the system
 was overloaded (too many service threads, or there were not enough hardware
 resources).

 But I can read without problems files stored on this ODT from other
 clients. For example:

 $ lfs find --obd lustre-OST .
 ./src/BLAS/srot.f
 ...

 $ more ./src/BLAS/srot.f
   SUBROUTINE SROT(N,SX,INCX,SY,INCY,C,S)
 * .. Scalar Arguments ..
   REAL C,S
   INTEGER INCX,INCY,N
 * ..
 * .. Array Arguments ..
   REAL SX(*),SY(*)
 ...
 ...

 This OSS have 8 ODTs of 14 TB each, with 12 GB/RAM and Xeon Quad Core
 E5506. Tomorrow I'll increase the memory, if this is the missing resource.









 2013/10/17 Joseph Landman land...@scalableinformatics.com

 Are there device or Filesystem level error messages on the server?  This
 almost looks like a corrupted file system.

 Please pardon brevity and typos ... Sent from my iPhone

 On Oct 17, 2013, at 6:11 PM, Eduardo Murrieta emurri...@nucleares.unam.mx
 wrote:

 Hello Jeff,

 Non, this is a lustre filesystem for Instituto de Ciencias Nucleares at
 UNAM, we are working on the installation for Alice at DGTIC too, but this
 problem is with our local filesystem.

 The OST is connected using a LSI-SAS controller, we have 8 OSTs on the
 same server, there are nodes that loose connection with all the OSTs that
 belong to this server but the problem is not related with the OST-OSS
 communication, since I can access this  OST and read files stored there
 from other lustre clients.

 The problem is a deadlock condition in which the OSS and some clients
 refuse connections from each other as I can see from dmesg:

 in the client
 LustreError: 11-0: lustre-OST000a-osc-8801bc548000: Communicating with
 10.2.2.3@o2ib, operation ost_connect failed with -16.

 in the server
 Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at
 10.2.64.4@o2ib) reconnecting
 Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at
 10.2.64.4@o2ib) refused reconnection, still busy with 9 active RPCs

 this only happen with clients that are reading a lot of small files
 (~100MB each) in the same OST.

 thank you,

 Eduardo



 2013/10/17 Jeff Johnson jeff.john...@aeoncomputing.com

 Hola Eduardo,

 How are the OSTs connected to the OSS (SAS, FC, Infiniband SRP)?
 Are there any non-Lustre errors in the dmesg output of the OSS?
 Block devices error on the OSS (/dev/sd?)?

 If you are losing [scsi,sas,fc,srp] connectivity you may see this sort
 of thing. If the OSTs are connected to the OSS node via IB SRP and your
 IB fabric gets busy or you have subnet manager issues you might see a
 condition like this.

 Is this the AliceFS at DGTIC?

 --Jeff



 On 10/17/13 3:52 PM, Eduardo Murrieta wrote:
  Hello,
 
  this is my first post on this list, I hope someone can give me some
  advise on how to resolve the following issue.
 
  I'm using the lustre release 2.4.0 RC2 compiled from whamcloud
  sources, this is an upgrade from lustre 2.2.22 from same sources.
 
  The situation is:
 
  There are several clients reading files that belongs mostly to the
  same OST, afther a period of time the clients starts loosing contact
  with this OST and processes stops due to this fault, here is the state
  for such OST on one client:
 
  client# lfs check servers
  ...
  ...
  lustre-OST000a-osc-8801bc548000: check error: Resource temporarily
  unavailable
  ...
  ...
 
  checking dmesg on client and OSS server we have:
 
  client# dmesg
  LustreError: 11-0: lustre-OST000a-osc-8801bc548000: Communicating
  with 10.2.2.3@o2ib, operation ost_connect failed with -16.
  LustreError: Skipped 24 previous similar messages
 
  OSS-server# dmesg
  
  Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9
  (at 10.2.64.4@o2ib) reconnecting
  Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9
  (at 10.2.64.4@o2ib) refused reconnection, still busy with 9 active RPCs
  
 
  At this moment I can ping from client to server and vice versa, but
  some time this call also hangs on server and client.
 
  client# # lctl ping OSS-server@o2ib
  12345-0@lo
  12345-OSS-server@o2ib