Hello,
this is my first post on this list, I hope someone can give me some advise
on how to resolve the following issue.
I'm using the lustre release 2.4.0 RC2 compiled from whamcloud sources,
this is an upgrade from lustre 2.2.22 from same sources.
The situation is:
There are several clients
Hola Eduardo,
How are the OSTs connected to the OSS (SAS, FC, Infiniband SRP)?
Are there any non-Lustre errors in the dmesg output of the OSS?
Block devices error on the OSS (/dev/sd?)?
If you are losing [scsi,sas,fc,srp] connectivity you may see this sort
of thing. If the OSTs are connected to
Hello Jeff,
Non, this is a lustre filesystem for Instituto de Ciencias Nucleares at
UNAM, we are working on the installation for Alice at DGTIC too, but this
problem is with our local filesystem.
The OST is connected using a LSI-SAS controller, we have 8 OSTs on the same
server, there are nodes
Are there device or Filesystem level error messages on the server? This
almost looks like a corrupted file system.
Please pardon brevity and typos ... Sent from my iPhone
On Oct 17, 2013, at 6:11 PM, Eduardo Murrieta emurri...@nucleares.unam.mx
wrote:
Hello Jeff,
Non, this is a lustre
Ah, I understand. I performed the onsite Lustre installation of Alice and
worked with JLG and his staff. Nice group of people!
This seems like a backend issue. Ldiskfs or the LSI RAID devices. Do you
see any read/write failures reported on the OSS of the sd block devices
where the OSTs reside?
I have this on the debug_file from my OSS:
0010:02000400:0.0:1382055634.785734:0:3099:0:(ost_handler.c:940:ost_brw_read())
lustre-OST: Bulk IO read error with 0afb2e4c-d
870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4@o2ib), client will retry: rc -107
Eduardo,
One or two E5506 CPUs in the OSS? What is the specific LSI controller and
how many of them in the OSS?
I think the OSS is under provisioned for 8 OSTs. I'm betting you run a high
iowait on those sd devices during your problematic run. The iowait probably
grows until deadlock. Can you