[Lustre-discuss] Broken communication between OSS and Client on Lustre 2.4

2013-10-17 Thread Eduardo Murrieta
Hello, this is my first post on this list, I hope someone can give me some advise on how to resolve the following issue. I'm using the lustre release 2.4.0 RC2 compiled from whamcloud sources, this is an upgrade from lustre 2.2.22 from same sources. The situation is: There are several clients

Re: [Lustre-discuss] Broken communication between OSS and Client on Lustre 2.4

2013-10-17 Thread Jeff Johnson
Hola Eduardo, How are the OSTs connected to the OSS (SAS, FC, Infiniband SRP)? Are there any non-Lustre errors in the dmesg output of the OSS? Block devices error on the OSS (/dev/sd?)? If you are losing [scsi,sas,fc,srp] connectivity you may see this sort of thing. If the OSTs are connected to

Re: [Lustre-discuss] Broken communication between OSS and Client on Lustre 2.4

2013-10-17 Thread Eduardo Murrieta
Hello Jeff, Non, this is a lustre filesystem for Instituto de Ciencias Nucleares at UNAM, we are working on the installation for Alice at DGTIC too, but this problem is with our local filesystem. The OST is connected using a LSI-SAS controller, we have 8 OSTs on the same server, there are nodes

Re: [Lustre-discuss] Broken communication between OSS and Client on Lustre 2.4

2013-10-17 Thread Joseph Landman
Are there device or Filesystem level error messages on the server? This almost looks like a corrupted file system. Please pardon brevity and typos ... Sent from my iPhone On Oct 17, 2013, at 6:11 PM, Eduardo Murrieta emurri...@nucleares.unam.mx wrote: Hello Jeff, Non, this is a lustre

Re: [Lustre-discuss] Broken communication between OSS and Client on Lustre 2.4

2013-10-17 Thread Jeff Johnson
Ah, I understand. I performed the onsite Lustre installation of Alice and worked with JLG and his staff. Nice group of people! This seems like a backend issue. Ldiskfs or the LSI RAID devices. Do you see any read/write failures reported on the OSS of the sd block devices where the OSTs reside?

Re: [Lustre-discuss] Broken communication between OSS and Client on Lustre 2.4

2013-10-17 Thread Eduardo Murrieta
I have this on the debug_file from my OSS: 0010:02000400:0.0:1382055634.785734:0:3099:0:(ost_handler.c:940:ost_brw_read()) lustre-OST: Bulk IO read error with 0afb2e4c-d 870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4@o2ib), client will retry: rc -107

Re: [Lustre-discuss] Broken communication between OSS and Client on Lustre 2.4

2013-10-17 Thread Jeff Johnson
Eduardo, One or two E5506 CPUs in the OSS? What is the specific LSI controller and how many of them in the OSS? I think the OSS is under provisioned for 8 OSTs. I'm betting you run a high iowait on those sd devices during your problematic run. The iowait probably grows until deadlock. Can you