[Lustre-discuss] Speeding up configuration log regeneration?
Hi, We run four-node Lustre 2.3, and I needed to both change hardware under MGS/MDS and reassign an OSS ip. Just the same, I added a brand new 10GE network to the system, which was the reason for MDS hardware change. I ran tunefs.lustre --writeconf as per chapter 14.4 in Lustre Manual, and everything mounts fine. Log regeneration apparently works, since it seems to do something, but exceedingly slowly. Disks show all but no activity, CPU utilization is zero across the board, and memory should be no issue. I believe it works, but currently it seems the 1,5*10^9 files (some 55 TiB of data) won't be indexed in a week. My boss isn't happy when I can't even predict how long this will take, or even say for sure that it really works. Two questions: is there a way to know how fast it is progressing and/or where it is at, or even that it really works, and is there a way to speed up whatever is slowing it down? Seems all diagnostic /proc entries have been removed from 2.3. I have tried mounting the Lustre partitions with -o nobarrier (yes, I know it's dangerous, but I'd really need to speed things up) but I don't know if that does anything at all. We run Centos 6.x in Lustre servers, where Lustre has been installed from rpm's from Whamcloud/Intel build bot, and Ubuntu 10.04 in clients with hand compiled kernel and Lustre. One MGC/MGS with twelve 15k-RPM SAS disks in RAID-10 as MDT that is all but empty, and six variously build RAID-6's in SAS-attached shelves in three OSS's. ATdhvaannkcse for any help, -- Olli Lounela IT specialist and administrator DNA sequencing and genomics Institute of Biotechnology University of Helsinki ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Speeding up configuration log regeneration?
On 2013/10/17 5:34 AM, Olli Lounela olli.loun...@helsinki.fi wrote: Hi, We run four-node Lustre 2.3, and I needed to both change hardware under MGS/MDS and reassign an OSS ip. Just the same, I added a brand new 10GE network to the system, which was the reason for MDS hardware change. Note that in Lustre 2.4 there is a lctl replace_nids command that allows you to change the NIDs without running --writeconf. That doesn't help you now, but possibly in the future. I ran tunefs.lustre --writeconf as per chapter 14.4 in Lustre Manual, and everything mounts fine. Log regeneration apparently works, since it seems to do something, but exceedingly slowly. Disks show all but no activity, CPU utilization is zero across the board, and memory should be no issue. I believe it works, but currently it seems the 1,5*10^9 files (some 55 TiB of data) won't be indexed in a week. My boss isn't happy when I can't even predict how long this will take, or even say for sure that it really works. The --writeconf information is at most a few kB and should only take seconds to complete. What reindexing operation are you referencing? It should be possible to mount the filesystem immediately (MGS first, then MDS and OSSes) after running --writeconf. You didn't really explain what is preventing you from using the filesystem, since you said it mounted properly? Two questions: is there a way to know how fast it is progressing and/or where it is at, or even that it really works, and is there a way to speed up whatever is slowing it down? Seems all diagnostic /proc entries have been removed from 2.3. I have tried mounting the Lustre partitions with -o nobarrier (yes, I know it's dangerous, but I'd really need to speed things up) but I don't know if that does anything at all. I doubt that the -o nobarrier is helping you much. We run Centos 6.x in Lustre servers, where Lustre has been installed from rpm's from Whamcloud/Intel build bot, and Ubuntu 10.04 in clients with hand compiled kernel and Lustre. One MGC/MGS with twelve 15k-RPM SAS disks in RAID-10 as MDT that is all but empty, and six variously build RAID-6's in SAS-attached shelves in three OSS's. ATdhvaannkcse for any help, -- Olli Lounela IT specialist and administrator DNA sequencing and genomics Institute of Biotechnology University of Helsinki ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss Cheers, Andreas -- Andreas Dilger Lustre Software Architect Intel High Performance Data Division ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre
I got the following message for lctl list_nids: Opening /dev/lnet failed: No such device Hint: the kerenel modules may not be loaded IOC_LIBCFS_GET_NI error 19: No such device It looks like the lunstre packet has been installed, but it does not hook with kernel correctly. Which part went wrong? -Weilin From: Chan Ching Yu Patrick [mailto:cyc...@clustertech.com] Sent: Wednesday, October 16, 2013 7:49 PM To: Weilin Chang Cc: lustre-discuss@lists.lustre.org Subject: Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre What's the output? lctl list_nids On 10/17/2013 08:36 AM, Weilin Chang wrote: Do I need to upgrade e2fsprogs on the client system? -Weilin From: Weilin Chang Sent: Wednesday, October 16, 2013 3:45 PM To: lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org Cc: Weilin Chang Subject: lustre 1.8.5 client failed to mount lustre HI, I am using luster 1.8.5. Server s are up and mounted without any problem. But client failed to mount the luster file system. I also did not see luster in /proc/filesystems. Is there other rpm I needed to install on the client system? The below is log from the client system: [root@localhost client]# rpm -ivh lustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5.i686.rpm Preparing...### [100%] 1:lustre-client-modules ### [100%] Congratulations on finishing your Lustre installation! To register your copy of Lustre and find out more about Lustre Support, Service, and Training offerings please visit http://www.sun.com/software/products/lustre/lustre_reg.jsp [root@localhost client]# rpm -qa | grep lustrelustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5 [root@localhost client]# rpm -ivh lustre-client-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5.i686.rpm Preparing...### [100%] 1:lustre-client ### [100%] [root@localhost client]# mount -t lustre 10.193.35.54@tcp0:/lustrefs /lustrefs mount.lustre: mount 10.193.35.54@tcp0:/lustrefs at /lustrefs failed: No such device Are the lustre modules loaded? Check /etc/modprobe.conf and /proc/filesystems Note 'alias lustre llite' should be removed from modprobe.conf [root@localhost client]# cat /etc/modprobe.conf alias eth0 e1000e alias eth1 e1000e alias scsi_hostadapter ahci options lnet networks=tcp [root@localhost client]# cat /proc/filesystems | grep -i lustre [root@localhost client]# rpm -qa | grep -i lustre lustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5 lustre-client-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.orgmailto:Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Chan Ching Yu, Patrick Senior System Engineer Cluster Technology Limited Modernize Your Business with Advanced Computing Technologies Cloud Computing | Cluster | Financial Engineering | Business Intelligence Email: cychan @clustertech.commailto:cyc...@clustertech.com Direct Line: +852 2655 6113 Tel: +852 2655 6100 Fax: +852 2994 2101 Website: www.clustertech.comhttp://www.clustertech.com Address: Units 211 - 213, Lakeside 1, No. 8 Science Park West Avenue, Hong Kong Science Park, N.T. Hong Kong Hong Kong Beijing Shanghai Guangzhou Shenzhen Wuhan Sydney ** The information contained in this e-mail and its attachments is confidential and intended solely for the specified addressees. If you have received this email in error, please do not read, copy, distribute, disclose or use any information of this email in any way and please immediately notify the sender and delete this email. Thank you for your cooperation. ** ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre
[root@localhost client]# uname -a Linux localhost.localdomain 2.6.18-194.el5PAE #1 SMP Fri Apr 2 15:37:44 EDT 2010 i686 i686 i386 GNU/Linux Is this version OK for luster 1.8.5? -Weilin -Original Message- From: Diep, Minh [mailto:minh.d...@intel.com] Sent: Thursday, October 17, 2013 10:59 AM To: Weilin Chang; Chan Ching Yu Patrick Cc: lustre-discuss@lists.lustre.org Subject: Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre Please check if your client's kernel version matches the luster's. What is the output of uname -a? Thanks -Minh From: Weilin Chang weilin.ch...@huawei.commailto:weilin.ch...@huawei.com Date: Thursday, October 17, 2013 10:25 AM To: Chan Ching Yu Patrick cyc...@clustertech.commailto:cyc...@clustertech.com Cc: Weilin Chang weilin.ch...@huawei.commailto:weilin.ch...@huawei.com, lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org Subject: Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre I got the following message for lctl list_nids: Opening /dev/lnet failed: No such device Hint: the kerenel modules may not be loaded IOC_LIBCFS_GET_NI error 19: No such device It looks like the lunstre packet has been installed, but it does not hook with kernel correctly. Which part went wrong? -Weilin From: Chan Ching Yu Patrick [mailto:cyc...@clustertech.com] Sent: Wednesday, October 16, 2013 7:49 PM To: Weilin Chang Cc: lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org Subject: Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre What's the output? lctl list_nids On 10/17/2013 08:36 AM, Weilin Chang wrote: Do I need to upgrade e2fsprogs on the client system? -Weilin From: Weilin Chang Sent: Wednesday, October 16, 2013 3:45 PM To: lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org Cc: Weilin Chang Subject: lustre 1.8.5 client failed to mount lustre HI, I am using luster 1.8.5. Server s are up and mounted without any problem. But client failed to mount the luster file system. I also did not see luster in /proc/filesystems. Is there other rpm I needed to install on the client system? The below is log from the client system: [root@localhost client]# rpm -ivh lustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5.i686.rpm Preparing...### [100%] 1:lustre-client-modules ### [100%] Congratulations on finishing your Lustre installation! To register your copy of Lustre and find out more about Lustre Support, Service, and Training offerings please visit http://www.sun.com/software/products/lustre/lustre_reg.jsp [root@localhost client]# rpm -qa | grep lustrelustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5 [root@localhost client]# rpm -ivh lustre-client-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5.i686.rpm Preparing...### [100%] 1:lustre-client ### [100%] [root@localhost client]# mount -t lustre 10.193.35.54@tcp0:/lustrefs /lustrefs mount.lustre: mount 10.193.35.54@tcp0:/lustrefs at /lustrefs failed: No such device Are the lustre modules loaded? Check /etc/modprobe.conf and /proc/filesystems Note 'alias lustre llite' should be removed from modprobe.conf [root@localhost client]# cat /etc/modprobe.conf alias eth0 e1000e alias eth1 e1000e alias scsi_hostadapter ahci options lnet networks=tcp [root@localhost client]# cat /proc/filesystems | grep -i lustre [root@localhost client]# rpm -qa | grep -i lustre lustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5 lustre-client-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.orgmailto:Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Chan Ching Yu, Patrick Senior System Engineer Cluster Technology Limited Modernize Your Business with Advanced Computing Technologies Cloud Computing | Cluster | Financial Engineering | Business Intelligence Email:cychan @clustertech.commailto:cyc...@clustertech.com Direct Line: +852 2655 6113 Tel: +852 2655 6100 Fax: +852 2994 2101 Website:www.clustertech.comhttp://www.clustertech.com Address: Units 211 - 213, Lakeside 1, No. 8 Science Park West Avenue, Hong Kong Science Park, N.T. Hong Kong Hong Kong Beijing Shanghai Guangzhou Shenzhen Wuhan Sydney ** The information contained in this e-mail and its attachments is confidential and intended solely for the specified addressees. If you have received this email in error, please do not read, copy, distribute, disclose or use any information of this email in any way and please immediately notify the sender and
Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre
On 2013/10/16 4:45 PM, Weilin Chang weilin.ch...@huawei.com wrote: HI, I am using luster 1.8.5. Server s are up and mounted without any problem. But client failed to mount the luster file system. I also did not see luster in /proc/filesystems. Is there other rpm I needed to install on the client system? The most important question is why are you starting with this very old 1.8.5 release from 3 years ago instead of a newer release like 2.1.6 or 2.4.1? You will definitely find better support for newer kernels with newer releases, as well as a better chance that someone will be able to fix your problem. My first guess is that your client network is not configured correctly, since it has a hostname of localhost. Second guess is that the Lustre modules are not properly matched with your kernel and they are not loading at all. Cheers, Andreas The below is log from the client system: [root@localhost client]# rpm -ivh lustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5.i686.rpm Preparing...### [100%] 1:lustre-client-modules ### [100%] Congratulations on finishing your Lustre installation! To register your copy of Lustre and find out more about Lustre Support, Service, and Training offerings please visit http://www.sun.com/software/products/lustre/lustre_reg.jsp [root@localhost client]# rpm -qa | grep lustrelustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5 [root@localhost client]# rpm -ivh lustre-client-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5.i686.rpm Preparing...### [100%] 1:lustre-client ### [100%] [root@localhost client]# mount -t lustre 10.193.35.54@tcp0:/lustrefs /lustrefs mount.lustre: mount 10.193.35.54@tcp0:/lustrefs at /lustrefs failed: No such device Are the lustre modules loaded? Check /etc/modprobe.conf and /proc/filesystems Note 'alias lustre llite' should be removed from modprobe.conf [root@localhost client]# cat /etc/modprobe.conf alias eth0 e1000e alias eth1 e1000e alias scsi_hostadapter ahci options lnet networks=tcp [root@localhost client]# cat /proc/filesystems | grep -i lustre [root@localhost client]# rpm -qa | grep -i lustre lustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5 lustre-client-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5 Cheers, Andreas -- Andreas Dilger Lustre Software Architect Intel High Performance Data Division ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre
Lustre_rmmod failed with the following message: Open /proc/sys/lnet/dump_kernel failed: No such file or directory Open(dump_kernel) failed: No such file or directory The package does not install correct for the directory /proc/syus/lnet does not exist. -Weilin -Original Message- From: White, Cliff [mailto:cliff.wh...@intel.com] Sent: Thursday, October 17, 2013 10:59 AM To: Weilin Chang; Chan Ching Yu Patrick Cc: lustre-discuss@lists.lustre.org Subject: Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre From: Weilin Chang weilin.ch...@huawei.commailto:weilin.ch...@huawei.com Date: Thursday, October 17, 2013 10:25 AM To: Chan Ching Yu Patrick cyc...@clustertech.commailto:cyc...@clustertech.com Cc: Weilin Chang weilin.ch...@huawei.commailto:weilin.ch...@huawei.com, lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org Subject: Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre I got the following message for lctl list_nids: Opening /dev/lnet failed: No such device Hint: the kerenel modules may not be loaded IOC_LIBCFS_GET_NI error 19: No such device It looks like the lunstre packet has been installed, but it does not hook with kernel correctly. Which part went wrong? -Weilin You might try 'lustre_rmmod' following by 'modprobe -v lustre' - usually you will have useful error messages. The 'networks' option In /etc/modprobe.conf Is unnecessary - lnet will default to TCP. I would delete that line, as the syntax does not look correct. Cliffw From: Chan Ching Yu Patrick [mailto:cyc...@clustertech.com] Sent: Wednesday, October 16, 2013 7:49 PM To: Weilin Chang Cc: lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org Subject: Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre What's the output? lctl list_nids On 10/17/2013 08:36 AM, Weilin Chang wrote: Do I need to upgrade e2fsprogs on the client system? -Weilin From: Weilin Chang Sent: Wednesday, October 16, 2013 3:45 PM To: lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org Cc: Weilin Chang Subject: lustre 1.8.5 client failed to mount lustre HI, I am using luster 1.8.5. Server s are up and mounted without any problem. But client failed to mount the luster file system. I also did not see luster in /proc/filesystems. Is there other rpm I needed to install on the client system? The below is log from the client system: [root@localhost client]# rpm -ivh lustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5.i686.rpm Preparing...### [100%] 1:lustre-client-modules ### [100%] Congratulations on finishing your Lustre installation! To register your copy of Lustre and find out more about Lustre Support, Service, and Training offerings please visit http://www.sun.com/software/products/lustre/lustre_reg.jsp [root@localhost client]# rpm -qa | grep lustrelustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5 [root@localhost client]# rpm -ivh lustre-client-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5.i686.rpm Preparing...### [100%] 1:lustre-client ### [100%] [root@localhost client]# mount -t lustre 10.193.35.54@tcp0:/lustrefs /lustrefs mount.lustre: mount 10.193.35.54@tcp0:/lustrefs at /lustrefs failed: No such device Are the lustre modules loaded? Check /etc/modprobe.conf and /proc/filesystems Note 'alias lustre llite' should be removed from modprobe.conf [root@localhost client]# cat /etc/modprobe.conf alias eth0 e1000e alias eth1 e1000e alias scsi_hostadapter ahci options lnet networks=tcp [root@localhost client]# cat /proc/filesystems | grep -i lustre [root@localhost client]# rpm -qa | grep -i lustre lustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5 lustre-client-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.orgmailto:Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Chan Ching Yu, Patrick Senior System Engineer Cluster Technology Limited Modernize Your Business with Advanced Computing Technologies Cloud Computing | Cluster | Financial Engineering | Business Intelligence Email:cychan @clustertech.commailto:cyc...@clustertech.com Direct Line: +852 2655 6113 Tel: +852 2655 6100 Fax: +852 2994 2101 Website:www.clustertech.comhttp://www.clustertech.com Address: Units 211 - 213, Lakeside 1, No. 8 Science Park West Avenue, Hong Kong Science Park, N.T. Hong Kong Hong Kong Beijing Shanghai Guangzhou Shenzhen Wuhan Sydney ** The information contained in this e-mail and its attachments is confidential
Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre
I agree with your observation. My purpose is to run Lustre on a 32 bit Linux system. The latest Lustre release does not support kernel patch for 32 bit Linux system. I don't know how well to generate a 32 bit Linux kernel from a 64 bit Linux system without having other problems. I think the older Lustre release which had been tested and should be ok to use it. Do I need to patch my Linux from version 2.6.18-194.e15 to 2.6.18-194.17.1.e15 in order to work with Lustre 1.8.5? -Weilin -Original Message- From: Dilger, Andreas [mailto:andreas.dil...@intel.com] Sent: Thursday, October 17, 2013 11:06 AM To: Weilin Chang Cc: lustre-discuss@lists.lustre.org; hpdd-disc...@lists.01.org Subject: Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre On 2013/10/16 4:45 PM, Weilin Chang weilin.ch...@huawei.com wrote: HI, I am using luster 1.8.5. Server s are up and mounted without any problem. But client failed to mount the luster file system. I also did not see luster in /proc/filesystems. Is there other rpm I needed to install on the client system? The most important question is why are you starting with this very old 1.8.5 release from 3 years ago instead of a newer release like 2.1.6 or 2.4.1? You will definitely find better support for newer kernels with newer releases, as well as a better chance that someone will be able to fix your problem. My first guess is that your client network is not configured correctly, since it has a hostname of localhost. Second guess is that the Lustre modules are not properly matched with your kernel and they are not loading at all. Cheers, Andreas The below is log from the client system: [root@localhost client]# rpm -ivh lustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5.i686.rpm Preparing...### [100%] 1:lustre-client-modules ### [100%] Congratulations on finishing your Lustre installation! To register your copy of Lustre and find out more about Lustre Support, Service, and Training offerings please visit http://www.sun.com/software/products/lustre/lustre_reg.jsp [root@localhost client]# rpm -qa | grep lustrelustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5 [root@localhost client]# rpm -ivh lustre-client-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5.i686.rpm Preparing...### [100%] 1:lustre-client ### [100%] [root@localhost client]# mount -t lustre 10.193.35.54@tcp0:/lustrefs /lustrefs mount.lustre: mount 10.193.35.54@tcp0:/lustrefs at /lustrefs failed: No such device Are the lustre modules loaded? Check /etc/modprobe.conf and /proc/filesystems Note 'alias lustre llite' should be removed from modprobe.conf [root@localhost client]# cat /etc/modprobe.conf alias eth0 e1000e alias eth1 e1000e alias scsi_hostadapter ahci options lnet networks=tcp [root@localhost client]# cat /proc/filesystems | grep -i lustre [root@localhost client]# rpm -qa | grep -i lustre lustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5 lustre-client-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5 Cheers, Andreas -- Andreas Dilger Lustre Software Architect Intel High Performance Data Division ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre
Weilin, Earlier in the week you were discussing trying to get Lustre running on 32-bit ARM. Are the systems you have installed 1.8.5 upon x86 systems or are you doing this on ARM processor based platforms? --Jeff On 10/17/13 11:10 AM, Weilin Chang wrote: Lustre_rmmod failed with the following message: Open /proc/sys/lnet/dump_kernel failed: No such file or directory Open(dump_kernel) failed: No such file or directory The package does not install correct for the directory /proc/syus/lnet does not exist. -Weilin -Original Message- From: White, Cliff [mailto:cliff.wh...@intel.com] Sent: Thursday, October 17, 2013 10:59 AM To: Weilin Chang; Chan Ching Yu Patrick Cc: lustre-discuss@lists.lustre.org Subject: Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre From: Weilin Chang weilin.ch...@huawei.commailto:weilin.ch...@huawei.com Date: Thursday, October 17, 2013 10:25 AM To: Chan Ching Yu Patrick cyc...@clustertech.commailto:cyc...@clustertech.com Cc: Weilin Chang weilin.ch...@huawei.commailto:weilin.ch...@huawei.com, lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org Subject: Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre I got the following message for lctl list_nids: Opening /dev/lnet failed: No such device Hint: the kerenel modules may not be loaded IOC_LIBCFS_GET_NI error 19: No such device It looks like the lunstre packet has been installed, but it does not hook with kernel correctly. Which part went wrong? -Weilin You might try 'lustre_rmmod' following by 'modprobe -v lustre' - usually you will have useful error messages. The 'networks' option In /etc/modprobe.conf Is unnecessary - lnet will default to TCP. I would delete that line, as the syntax does not look correct. Cliffw From: Chan Ching Yu Patrick [mailto:cyc...@clustertech.com] Sent: Wednesday, October 16, 2013 7:49 PM To: Weilin Chang Cc: lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org Subject: Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre What's the output? lctl list_nids On 10/17/2013 08:36 AM, Weilin Chang wrote: Do I need to upgrade e2fsprogs on the client system? -Weilin From: Weilin Chang Sent: Wednesday, October 16, 2013 3:45 PM To: lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org Cc: Weilin Chang Subject: lustre 1.8.5 client failed to mount lustre HI, I am using luster 1.8.5. Server s are up and mounted without any problem. But client failed to mount the luster file system. I also did not see luster in /proc/filesystems. Is there other rpm I needed to install on the client system? The below is log from the client system: [root@localhost client]# rpm -ivh lustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5.i686.rpm Preparing...### [100%] 1:lustre-client-modules ### [100%] Congratulations on finishing your Lustre installation! To register your copy of Lustre and find out more about Lustre Support, Service, and Training offerings please visit http://www.sun.com/software/products/lustre/lustre_reg.jsp [root@localhost client]# rpm -qa | grep lustrelustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5 [root@localhost client]# rpm -ivh lustre-client-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5.i686.rpm Preparing...### [100%] 1:lustre-client ### [100%] [root@localhost client]# mount -t lustre 10.193.35.54@tcp0:/lustrefs /lustrefs mount.lustre: mount 10.193.35.54@tcp0:/lustrefs at /lustrefs failed: No such device Are the lustre modules loaded? Check /etc/modprobe.conf and /proc/filesystems Note 'alias lustre llite' should be removed from modprobe.conf [root@localhost client]# cat /etc/modprobe.conf alias eth0 e1000e alias eth1 e1000e alias scsi_hostadapter ahci options lnet networks=tcp [root@localhost client]# cat /proc/filesystems | grep -i lustre [root@localhost client]# rpm -qa | grep -i lustre lustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5 lustre-client-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.orgmailto:Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Chan Ching Yu, Patrick Senior System Engineer Cluster Technology Limited Modernize Your Business with Advanced Computing Technologies Cloud Computing | Cluster | Financial Engineering | Business Intelligence Email:cychan @clustertech.commailto:cyc...@clustertech.com Direct Line: +852 2655 6113 Tel: +852 2655 6100 Fax: +852 2994 2101
Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre
On 10/17/13 11:10 AM, Weilin Chang weilin.ch...@huawei.com wrote: Lustre_rmmod failed with the following message: Open /proc/sys/lnet/dump_kernel failed: No such file or directory Open(dump_kernel) failed: No such file or directory The package does not install correct for the directory /proc/syus/lnet does not exist. That's not the problem. What error messages does 'modprobe -v lustre' give? -Weilin -Original Message- From: White, Cliff [mailto:cliff.wh...@intel.com] Sent: Thursday, October 17, 2013 10:59 AM To: Weilin Chang; Chan Ching Yu Patrick Cc: lustre-discuss@lists.lustre.org Subject: Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre From: Weilin Chang weilin.ch...@huawei.commailto:weilin.ch...@huawei.com Date: Thursday, October 17, 2013 10:25 AM To: Chan Ching Yu Patrick cyc...@clustertech.commailto:cyc...@clustertech.com Cc: Weilin Chang weilin.ch...@huawei.commailto:weilin.ch...@huawei.com, lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org Subject: Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre I got the following message for lctl list_nids: Opening /dev/lnet failed: No such device Hint: the kerenel modules may not be loaded IOC_LIBCFS_GET_NI error 19: No such device It looks like the lunstre packet has been installed, but it does not hook with kernel correctly. Which part went wrong? -Weilin You might try 'lustre_rmmod' following by 'modprobe -v lustre' - usually you will have useful error messages. The 'networks' option In /etc/modprobe.conf Is unnecessary - lnet will default to TCP. I would delete that line, as the syntax does not look correct. Cliffw From: Chan Ching Yu Patrick [mailto:cyc...@clustertech.com] Sent: Wednesday, October 16, 2013 7:49 PM To: Weilin Chang Cc: lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org Subject: Re: [Lustre-discuss] lustre 1.8.5 client failed to mount lustre What's the output? lctl list_nids On 10/17/2013 08:36 AM, Weilin Chang wrote: Do I need to upgrade e2fsprogs on the client system? -Weilin From: Weilin Chang Sent: Wednesday, October 16, 2013 3:45 PM To: lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org Cc: Weilin Chang Subject: lustre 1.8.5 client failed to mount lustre HI, I am using luster 1.8.5. Server s are up and mounted without any problem. But client failed to mount the luster file system. I also did not see luster in /proc/filesystems. Is there other rpm I needed to install on the client system? The below is log from the client system: [root@localhost client]# rpm -ivh lustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5.i686.rpm Preparing...### [100%] 1:lustre-client-modules ### [100%] Congratulations on finishing your Lustre installation! To register your copy of Lustre and find out more about Lustre Support, Service, and Training offerings please visit http://www.sun.com/software/products/lustre/lustre_reg.jsp [root@localhost client]# rpm -qa | grep lustrelustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5 [root@localhost client]# rpm -ivh lustre-client-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5.i686.rpm Preparing...### [100%] 1:lustre-client ### [100%] [root@localhost client]# mount -t lustre 10.193.35.54@tcp0:/lustrefs /lustrefs mount.lustre: mount 10.193.35.54@tcp0:/lustrefs at /lustrefs failed: No such device Are the lustre modules loaded? Check /etc/modprobe.conf and /proc/filesystems Note 'alias lustre llite' should be removed from modprobe.conf [root@localhost client]# cat /etc/modprobe.conf alias eth0 e1000e alias eth1 e1000e alias scsi_hostadapter ahci options lnet networks=tcp [root@localhost client]# cat /proc/filesystems | grep -i lustre [root@localhost client]# rpm -qa | grep -i lustre lustre-client-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5 lustre-client-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.orgmailto:Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Chan Ching Yu, Patrick Senior System Engineer Cluster Technology Limited Modernize Your Business with Advanced Computing Technologies Cloud Computing | Cluster | Financial Engineering | Business Intelligence Email:cychan @clustertech.commailto:cyc...@clustertech.com Direct Line: +852 2655 6113 Tel: +852 2655 6100 Fax: +852 2994 2101 Website:www.clustertech.comhttp://www.clustertech.com Address: Units 211 - 213, Lakeside 1, No. 8 Science Park West Avenue, Hong Kong Science Park, N.T. Hong Kong Hong Kong Beijing Shanghai Guangzhou Shenzhen Wuhan Sydney
[Lustre-discuss] Broken communication between OSS and Client on Lustre 2.4
Hello, this is my first post on this list, I hope someone can give me some advise on how to resolve the following issue. I'm using the lustre release 2.4.0 RC2 compiled from whamcloud sources, this is an upgrade from lustre 2.2.22 from same sources. The situation is: There are several clients reading files that belongs mostly to the same OST, afther a period of time the clients starts loosing contact with this OST and processes stops due to this fault, here is the state for such OST on one client: client# lfs check servers ... ... lustre-OST000a-osc-8801bc548000: check error: Resource temporarily unavailable ... ... checking dmesg on client and OSS server we have: client# dmesg LustreError: 11-0: lustre-OST000a-osc-8801bc548000: Communicating with 10.2.2.3@o2ib, operation ost_connect failed with -16. LustreError: Skipped 24 previous similar messages OSS-server# dmesg Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4@o2ib) reconnecting Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4@o2ib) refused reconnection, still busy with 9 active RPCs At this moment I can ping from client to server and vice versa, but some time this call also hangs on server and client. client# # lctl ping OSS-server@o2ib 12345-0@lo 12345-OSS-server@o2ib OSS-server# lctl ping 10.2.64.4@o2ib 12345-0@lo 12345-10.2.64.4@o2ib This situation happens very frequently and specially with jobs that process a lot of files in an average size of 100MB. The only solution that I find to reestablish the communication between the server and the client is restarting both machines. I hope some have an idea what is the reason for the problem and how can I reset the communication with the clients without restarting the machines. thank you, Eduardo UNAM@Mexico -- Eduardo Murrieta Unidad de Cómputo Instituto de Ciencias Nucleares, UNAM Ph. +52-55-5622-4739 ext. 5103 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Broken communication between OSS and Client on Lustre 2.4
Hola Eduardo, How are the OSTs connected to the OSS (SAS, FC, Infiniband SRP)? Are there any non-Lustre errors in the dmesg output of the OSS? Block devices error on the OSS (/dev/sd?)? If you are losing [scsi,sas,fc,srp] connectivity you may see this sort of thing. If the OSTs are connected to the OSS node via IB SRP and your IB fabric gets busy or you have subnet manager issues you might see a condition like this. Is this the AliceFS at DGTIC? --Jeff On 10/17/13 3:52 PM, Eduardo Murrieta wrote: Hello, this is my first post on this list, I hope someone can give me some advise on how to resolve the following issue. I'm using the lustre release 2.4.0 RC2 compiled from whamcloud sources, this is an upgrade from lustre 2.2.22 from same sources. The situation is: There are several clients reading files that belongs mostly to the same OST, afther a period of time the clients starts loosing contact with this OST and processes stops due to this fault, here is the state for such OST on one client: client# lfs check servers ... ... lustre-OST000a-osc-8801bc548000: check error: Resource temporarily unavailable ... ... checking dmesg on client and OSS server we have: client# dmesg LustreError: 11-0: lustre-OST000a-osc-8801bc548000: Communicating with 10.2.2.3@o2ib, operation ost_connect failed with -16. LustreError: Skipped 24 previous similar messages OSS-server# dmesg Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4@o2ib) reconnecting Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4@o2ib) refused reconnection, still busy with 9 active RPCs At this moment I can ping from client to server and vice versa, but some time this call also hangs on server and client. client# # lctl ping OSS-server@o2ib 12345-0@lo 12345-OSS-server@o2ib OSS-server# lctl ping 10.2.64.4@o2ib 12345-0@lo 12345-10.2.64.4@o2ib This situation happens very frequently and specially with jobs that process a lot of files in an average size of 100MB. The only solution that I find to reestablish the communication between the server and the client is restarting both machines. I hope some have an idea what is the reason for the problem and how can I reset the communication with the clients without restarting the machines. thank you, Eduardo UNAM@Mexico -- Eduardo Murrieta Unidad de Cómputo Instituto de Ciencias Nucleares, UNAM Ph. +52-55-5622-4739 ext. 5103 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- -- Jeff Johnson Co-Founder Aeon Computing jeff.john...@aeoncomputing.com www.aeoncomputing.com t: 858-412-3810 x1001 f: 858-412-3845 m: 619-204-9061 4170 Morena Boulevard, Suite D - San Diego, CA 92117 High-performance Computing / Lustre Filesystems / Scale-out Storage ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Broken communication between OSS and Client on Lustre 2.4
Hello Jeff, Non, this is a lustre filesystem for Instituto de Ciencias Nucleares at UNAM, we are working on the installation for Alice at DGTIC too, but this problem is with our local filesystem. The OST is connected using a LSI-SAS controller, we have 8 OSTs on the same server, there are nodes that loose connection with all the OSTs that belong to this server but the problem is not related with the OST-OSS communication, since I can access this OST and read files stored there from other lustre clients. The problem is a deadlock condition in which the OSS and some clients refuse connections from each other as I can see from dmesg: in the client LustreError: 11-0: lustre-OST000a-osc-8801bc548000: Communicating with 10.2.2.3@o2ib, operation ost_connect failed with -16. in the server Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4@o2ib) reconnecting Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4@o2ib) refused reconnection, still busy with 9 active RPCs this only happen with clients that are reading a lot of small files (~100MB each) in the same OST. thank you, Eduardo 2013/10/17 Jeff Johnson jeff.john...@aeoncomputing.com Hola Eduardo, How are the OSTs connected to the OSS (SAS, FC, Infiniband SRP)? Are there any non-Lustre errors in the dmesg output of the OSS? Block devices error on the OSS (/dev/sd?)? If you are losing [scsi,sas,fc,srp] connectivity you may see this sort of thing. If the OSTs are connected to the OSS node via IB SRP and your IB fabric gets busy or you have subnet manager issues you might see a condition like this. Is this the AliceFS at DGTIC? --Jeff On 10/17/13 3:52 PM, Eduardo Murrieta wrote: Hello, this is my first post on this list, I hope someone can give me some advise on how to resolve the following issue. I'm using the lustre release 2.4.0 RC2 compiled from whamcloud sources, this is an upgrade from lustre 2.2.22 from same sources. The situation is: There are several clients reading files that belongs mostly to the same OST, afther a period of time the clients starts loosing contact with this OST and processes stops due to this fault, here is the state for such OST on one client: client# lfs check servers ... ... lustre-OST000a-osc-8801bc548000: check error: Resource temporarily unavailable ... ... checking dmesg on client and OSS server we have: client# dmesg LustreError: 11-0: lustre-OST000a-osc-8801bc548000: Communicating with 10.2.2.3@o2ib, operation ost_connect failed with -16. LustreError: Skipped 24 previous similar messages OSS-server# dmesg Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4@o2ib) reconnecting Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4@o2ib) refused reconnection, still busy with 9 active RPCs At this moment I can ping from client to server and vice versa, but some time this call also hangs on server and client. client# # lctl ping OSS-server@o2ib 12345-0@lo 12345-OSS-server@o2ib OSS-server# lctl ping 10.2.64.4@o2ib 12345-0@lo 12345-10.2.64.4@o2ib This situation happens very frequently and specially with jobs that process a lot of files in an average size of 100MB. The only solution that I find to reestablish the communication between the server and the client is restarting both machines. I hope some have an idea what is the reason for the problem and how can I reset the communication with the clients without restarting the machines. thank you, Eduardo UNAM@Mexico -- Eduardo Murrieta Unidad de Cómputo Instituto de Ciencias Nucleares, UNAM Ph. +52-55-5622-4739 ext. 5103 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- -- Jeff Johnson Co-Founder Aeon Computing jeff.john...@aeoncomputing.com www.aeoncomputing.com t: 858-412-3810 x1001 f: 858-412-3845 m: 619-204-9061 4170 Morena Boulevard, Suite D - San Diego, CA 92117 High-performance Computing / Lustre Filesystems / Scale-out Storage ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Eduardo Murrieta Unidad de Cómputo Instituto de Ciencias Nucleares, UNAM Ph. +52-55-5622-4739 ext. 5103 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Broken communication between OSS and Client on Lustre 2.4
Are there device or Filesystem level error messages on the server? This almost looks like a corrupted file system. Please pardon brevity and typos ... Sent from my iPhone On Oct 17, 2013, at 6:11 PM, Eduardo Murrieta emurri...@nucleares.unam.mx wrote: Hello Jeff, Non, this is a lustre filesystem for Instituto de Ciencias Nucleares at UNAM, we are working on the installation for Alice at DGTIC too, but this problem is with our local filesystem. The OST is connected using a LSI-SAS controller, we have 8 OSTs on the same server, there are nodes that loose connection with all the OSTs that belong to this server but the problem is not related with the OST-OSS communication, since I can access this OST and read files stored there from other lustre clients. The problem is a deadlock condition in which the OSS and some clients refuse connections from each other as I can see from dmesg: in the client LustreError: 11-0: lustre-OST000a-osc-8801bc548000: Communicating with 10.2.2.3@o2ib, operation ost_connect failed with -16. in the server Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4@o2ib) reconnecting Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4@o2ib) refused reconnection, still busy with 9 active RPCs this only happen with clients that are reading a lot of small files (~100MB each) in the same OST. thank you, Eduardo 2013/10/17 Jeff Johnson jeff.john...@aeoncomputing.com Hola Eduardo, How are the OSTs connected to the OSS (SAS, FC, Infiniband SRP)? Are there any non-Lustre errors in the dmesg output of the OSS? Block devices error on the OSS (/dev/sd?)? If you are losing [scsi,sas,fc,srp] connectivity you may see this sort of thing. If the OSTs are connected to the OSS node via IB SRP and your IB fabric gets busy or you have subnet manager issues you might see a condition like this. Is this the AliceFS at DGTIC? --Jeff On 10/17/13 3:52 PM, Eduardo Murrieta wrote: Hello, this is my first post on this list, I hope someone can give me some advise on how to resolve the following issue. I'm using the lustre release 2.4.0 RC2 compiled from whamcloud sources, this is an upgrade from lustre 2.2.22 from same sources. The situation is: There are several clients reading files that belongs mostly to the same OST, afther a period of time the clients starts loosing contact with this OST and processes stops due to this fault, here is the state for such OST on one client: client# lfs check servers ... ... lustre-OST000a-osc-8801bc548000: check error: Resource temporarily unavailable ... ... checking dmesg on client and OSS server we have: client# dmesg LustreError: 11-0: lustre-OST000a-osc-8801bc548000: Communicating with 10.2.2.3@o2ib, operation ost_connect failed with -16. LustreError: Skipped 24 previous similar messages OSS-server# dmesg Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4@o2ib) reconnecting Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4@o2ib) refused reconnection, still busy with 9 active RPCs At this moment I can ping from client to server and vice versa, but some time this call also hangs on server and client. client# # lctl ping OSS-server@o2ib 12345-0@lo 12345-OSS-server@o2ib OSS-server# lctl ping 10.2.64.4@o2ib 12345-0@lo 12345-10.2.64.4@o2ib This situation happens very frequently and specially with jobs that process a lot of files in an average size of 100MB. The only solution that I find to reestablish the communication between the server and the client is restarting both machines. I hope some have an idea what is the reason for the problem and how can I reset the communication with the clients without restarting the machines. thank you, Eduardo UNAM@Mexico -- Eduardo Murrieta Unidad de Cómputo Instituto de Ciencias Nucleares, UNAM Ph. +52-55-5622-4739 ext. 5103 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- -- Jeff Johnson Co-Founder Aeon Computing jeff.john...@aeoncomputing.com www.aeoncomputing.com t: 858-412-3810 x1001 f: 858-412-3845 m: 619-204-9061 4170 Morena Boulevard, Suite D - San Diego, CA 92117 High-performance Computing / Lustre Filesystems / Scale-out Storage ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Eduardo Murrieta Unidad de Cómputo Instituto de Ciencias Nucleares, UNAM Ph. +52-55-5622-4739 ext. 5103 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org
Re: [Lustre-discuss] Broken communication between OSS and Client on Lustre 2.4
Ah, I understand. I performed the onsite Lustre installation of Alice and worked with JLG and his staff. Nice group of people! This seems like a backend issue. Ldiskfs or the LSI RAID devices. Do you see any read/write failures reported on the OSS of the sd block devices where the OSTs reside? Something is timing out; disk I/O or the OSS is running too high of an iowait under load. How many OSS nodes in the filesystem? Are these operations striped across all OSTs? Across multiple OSSs? I still have an account on DGTIC's gateway, I could login and look. :-) --Jeff On Thursday, October 17, 2013, Eduardo Murrieta wrote: Hello Jeff, Non, this is a lustre filesystem for Instituto de Ciencias Nucleares at UNAM, we are working on the installation for Alice at DGTIC too, but this problem is with our local filesystem. The OST is connected using a LSI-SAS controller, we have 8 OSTs on the same server, there are nodes that loose connection with all the OSTs that belong to this server but the problem is not related with the OST-OSS communication, since I can access this OST and read files stored there from other lustre clients. The problem is a deadlock condition in which the OSS and some clients refuse connections from each other as I can see from dmesg: in the client LustreError: 11-0: lustre-OST000a-osc-8801bc548000: Communicating with 10.2.2.3@o2ib, operation ost_connect failed with -16. in the server Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4@o2ib) reconnecting Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4@o2ib) refused reconnection, still busy with 9 active RPCs this only happen with clients that are reading a lot of small files (~100MB each) in the same OST. thank you, Eduardo 2013/10/17 Jeff Johnson jeff.john...@aeoncomputing.com Hola Eduardo, How are the OSTs connected to the OSS (SAS, FC, Infiniband SRP)? Are there any non-Lustre errors in the dmesg output of the OSS? Block devices error on the OSS (/dev/sd?)? If you are losing [scsi,sas,fc,srp] connectivity you may see this sort of thing. If the OSTs are connected to the OSS node via IB SRP and your IB fabric gets busy or you have subnet manager issues you might see a condition like this. Is this the AliceFS at DGTIC? --Jeff On 10/17/13 3:52 PM, Eduardo Murrieta wrote: Hello, this is my first post on this list, I hope someone can give me some advise on how to resolve the following issue. I'm using the lustre release 2.4.0 RC2 compiled from whamcloud sources, this is an upgrade from lustre 2.2.22 from same sources. The situation is: There are several clients reading files that belongs mostly to the same OST, afther a period of time the clients starts loosing contact with this OST and processes stops due to this fault, here is the state for such OST on one client: client# lfs check servers ... ... lustre-OST000a-osc-8801bc548000: check error: Resource temporarily unavailable ... ... checking dmesg on client and OSS server we have: client# dmesg LustreError: 11-0: lustre-OST000a-osc-8801bc548000: Communicating with 10.2.2.3@o2ib, operation ost_connect failed with -16. LustreError: Skipped 24 previous similar messages OSS-server# dmesg Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4@o2ib) reconnecting Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4@o2ib) refused reconnection, still busy with 9 active RPCs At this moment I can ping from client to server and vice versa, but some time this call also hangs on server and client. client# # lctl ping OSS-server@o2ib 12345-0@lo 12345-OSS-server@o2ib OSS-server# lctl ping 10.2.64.4@o2ib 12345-0@lo 12345-10.2.64.4@o2ib This situation happens very frequently and specially with jobs that process a lot of files in an average size of 100MB. The only solution that I find to reestablish the communication between the server and the client is restarting both machines. I hope some have an idea what is the reason for the problem and how can I reset the communication with the clients without restarting the machines. thank you, Eduardo UNAM@Mexico -- Eduardo Murrieta Unidad de Cómputo Instituto de Ciencias Nucleares, UNAM Ph. +52-55-5622-4739 ext. 5103 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- -- Jeff Johnson Co-Founder Aeon Computing jeff.john...@aeoncomputing.com www.aeoncomputing.com t: 858-412-3810 x1001 f: 858-412-3845 m: 619-204-9061 4170 Morena Boulevard, Suite D - San Diego, CA 92117 High-performance Computing / Lustre Filesystems / Scale-out Storage
Re: [Lustre-discuss] Broken communication between OSS and Client on Lustre 2.4
I have this on the debug_file from my OSS: 0010:02000400:0.0:1382055634.785734:0:3099:0:(ost_handler.c:940:ost_brw_read()) lustre-OST: Bulk IO read error with 0afb2e4c-d 870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4@o2ib), client will retry: rc -107 0400:02000400:0.0:1382055634.786061:0:3099:0:(watchdog.c:411:lcw_update_time()) Service thread pid 3099 completed after 227.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). But I can read without problems files stored on this ODT from other clients. For example: $ lfs find --obd lustre-OST . ./src/BLAS/srot.f ... $ more ./src/BLAS/srot.f SUBROUTINE SROT(N,SX,INCX,SY,INCY,C,S) * .. Scalar Arguments .. REAL C,S INTEGER INCX,INCY,N * .. * .. Array Arguments .. REAL SX(*),SY(*) ... ... This OSS have 8 ODTs of 14 TB each, with 12 GB/RAM and Xeon Quad Core E5506. Tomorrow I'll increase the memory, if this is the missing resource. 2013/10/17 Joseph Landman land...@scalableinformatics.com Are there device or Filesystem level error messages on the server? This almost looks like a corrupted file system. Please pardon brevity and typos ... Sent from my iPhone On Oct 17, 2013, at 6:11 PM, Eduardo Murrieta emurri...@nucleares.unam.mx wrote: Hello Jeff, Non, this is a lustre filesystem for Instituto de Ciencias Nucleares at UNAM, we are working on the installation for Alice at DGTIC too, but this problem is with our local filesystem. The OST is connected using a LSI-SAS controller, we have 8 OSTs on the same server, there are nodes that loose connection with all the OSTs that belong to this server but the problem is not related with the OST-OSS communication, since I can access this OST and read files stored there from other lustre clients. The problem is a deadlock condition in which the OSS and some clients refuse connections from each other as I can see from dmesg: in the client LustreError: 11-0: lustre-OST000a-osc-8801bc548000: Communicating with 10.2.2.3@o2ib, operation ost_connect failed with -16. in the server Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4@o2ib) reconnecting Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4@o2ib) refused reconnection, still busy with 9 active RPCs this only happen with clients that are reading a lot of small files (~100MB each) in the same OST. thank you, Eduardo 2013/10/17 Jeff Johnson jeff.john...@aeoncomputing.com Hola Eduardo, How are the OSTs connected to the OSS (SAS, FC, Infiniband SRP)? Are there any non-Lustre errors in the dmesg output of the OSS? Block devices error on the OSS (/dev/sd?)? If you are losing [scsi,sas,fc,srp] connectivity you may see this sort of thing. If the OSTs are connected to the OSS node via IB SRP and your IB fabric gets busy or you have subnet manager issues you might see a condition like this. Is this the AliceFS at DGTIC? --Jeff On 10/17/13 3:52 PM, Eduardo Murrieta wrote: Hello, this is my first post on this list, I hope someone can give me some advise on how to resolve the following issue. I'm using the lustre release 2.4.0 RC2 compiled from whamcloud sources, this is an upgrade from lustre 2.2.22 from same sources. The situation is: There are several clients reading files that belongs mostly to the same OST, afther a period of time the clients starts loosing contact with this OST and processes stops due to this fault, here is the state for such OST on one client: client# lfs check servers ... ... lustre-OST000a-osc-8801bc548000: check error: Resource temporarily unavailable ... ... checking dmesg on client and OSS server we have: client# dmesg LustreError: 11-0: lustre-OST000a-osc-8801bc548000: Communicating with 10.2.2.3@o2ib, operation ost_connect failed with -16. LustreError: Skipped 24 previous similar messages OSS-server# dmesg Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4@o2ib) reconnecting Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4@o2ib) refused reconnection, still busy with 9 active RPCs At this moment I can ping from client to server and vice versa, but some time this call also hangs on server and client. client# # lctl ping OSS-server@o2ib 12345-0@lo 12345-OSS-server@o2ib OSS-server# lctl ping 10.2.64.4@o2ib 12345-0@lo 12345-10.2.64.4@o2ib This situation happens very frequently and specially with jobs that process a lot of files in an average size of 100MB. The only solution that I find to reestablish the communication between the server and the client is restarting both machines. I hope some have an idea what is the reason for the problem and how can I reset the communication with the clients without restarting the
Re: [Lustre-discuss] Broken communication between OSS and Client on Lustre 2.4
Eduardo, One or two E5506 CPUs in the OSS? What is the specific LSI controller and how many of them in the OSS? I think the OSS is under provisioned for 8 OSTs. I'm betting you run a high iowait on those sd devices during your problematic run. The iowait probably grows until deadlock. Can you run the job while running a shell with top on the OSS. You're likely hitting 99% iowait. --Jeff On Thursday, October 17, 2013, Eduardo Murrieta wrote: I have this on the debug_file from my OSS: 0010:02000400:0.0:1382055634.785734:0:3099:0:(ost_handler.c:940:ost_brw_read()) lustre-OST: Bulk IO read error with 0afb2e4c-d 870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4@o2ib), client will retry: rc -107 0400:02000400:0.0:1382055634.786061:0:3099:0:(watchdog.c:411:lcw_update_time()) Service thread pid 3099 completed after 227.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). But I can read without problems files stored on this ODT from other clients. For example: $ lfs find --obd lustre-OST . ./src/BLAS/srot.f ... $ more ./src/BLAS/srot.f SUBROUTINE SROT(N,SX,INCX,SY,INCY,C,S) * .. Scalar Arguments .. REAL C,S INTEGER INCX,INCY,N * .. * .. Array Arguments .. REAL SX(*),SY(*) ... ... This OSS have 8 ODTs of 14 TB each, with 12 GB/RAM and Xeon Quad Core E5506. Tomorrow I'll increase the memory, if this is the missing resource. 2013/10/17 Joseph Landman land...@scalableinformatics.com Are there device or Filesystem level error messages on the server? This almost looks like a corrupted file system. Please pardon brevity and typos ... Sent from my iPhone On Oct 17, 2013, at 6:11 PM, Eduardo Murrieta emurri...@nucleares.unam.mx wrote: Hello Jeff, Non, this is a lustre filesystem for Instituto de Ciencias Nucleares at UNAM, we are working on the installation for Alice at DGTIC too, but this problem is with our local filesystem. The OST is connected using a LSI-SAS controller, we have 8 OSTs on the same server, there are nodes that loose connection with all the OSTs that belong to this server but the problem is not related with the OST-OSS communication, since I can access this OST and read files stored there from other lustre clients. The problem is a deadlock condition in which the OSS and some clients refuse connections from each other as I can see from dmesg: in the client LustreError: 11-0: lustre-OST000a-osc-8801bc548000: Communicating with 10.2.2.3@o2ib, operation ost_connect failed with -16. in the server Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4@o2ib) reconnecting Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4@o2ib) refused reconnection, still busy with 9 active RPCs this only happen with clients that are reading a lot of small files (~100MB each) in the same OST. thank you, Eduardo 2013/10/17 Jeff Johnson jeff.john...@aeoncomputing.com Hola Eduardo, How are the OSTs connected to the OSS (SAS, FC, Infiniband SRP)? Are there any non-Lustre errors in the dmesg output of the OSS? Block devices error on the OSS (/dev/sd?)? If you are losing [scsi,sas,fc,srp] connectivity you may see this sort of thing. If the OSTs are connected to the OSS node via IB SRP and your IB fabric gets busy or you have subnet manager issues you might see a condition like this. Is this the AliceFS at DGTIC? --Jeff On 10/17/13 3:52 PM, Eduardo Murrieta wrote: Hello, this is my first post on this list, I hope someone can give me some advise on how to resolve the following issue. I'm using the lustre release 2.4.0 RC2 compiled from whamcloud sources, this is an upgrade from lustre 2.2.22 from same sources. The situation is: There are several clients reading files that belongs mostly to the same OST, afther a period of time the clients starts loosing contact with this OST and processes stops due to this fault, here is the state for such OST on one client: client# lfs check servers ... ... lustre-OST000a-osc-8801bc548000: check error: Resource temporarily unavailable ... ... checking dmesg on client and OSS server we have: client# dmesg LustreError: 11-0: lustre-OST000a-osc-8801bc548000: Communicating with 10.2.2.3@o2ib, operation ost_connect failed with -16. LustreError: Skipped 24 previous similar messages OSS-server# dmesg Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4@o2ib) reconnecting Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4@o2ib) refused reconnection, still busy with 9 active RPCs At this moment I can ping from client to server and vice versa, but some time this call also hangs on server and client. client# # lctl ping OSS-server@o2ib 12345-0@lo 12345-OSS-server@o2ib