[Lustre-discuss] OSS load in the roof

2008-06-27 Thread Brock Palen
our OSS went crazy today.  It is attached to two OST's.

The load normally around 2-4.  Right now it is 123.

I noticed this to be the cause:

root  6748  0.0  0.0 00 ?DMay27   8:57  
[ll_ost_io_123]

All of them are stuck in un-interruptible sleep.
Has anyone seen this happen before?  Is this caused by a pending disk  
failure?

I ask the disk system failure because I also see this message:

mptscsi: ioc1: attempting task abort! (sc=010038904c40)
scsi1 : destination target 0, lun 0
 command = Read (10) 00 75 94 40 00 00 10 00 00
mptscsi: ioc1: task abort: SUCCESS (sc=010038904c40)

and:

Lustre: 6698:0:(lustre_fsfilt.h:306:fsfilt_setattr()) nobackup- 
OST0001: slow setattr 100s
Lustre: 6698:0:(watchdog.c:312:lcw_update_time()) Expired watchdog  
for pid 6698 disabled after 103.1261s

Thanks

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
[EMAIL PROTECTED]
(734)936-1985



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] OSS load in the roof

2008-06-27 Thread Bernd Schubert
On Fri, Jun 27, 2008 at 01:07:32PM -0400, Brian J. Murrell wrote:
 On Fri, 2008-06-27 at 12:44 -0400, Brock Palen wrote:
  
  All of them are stuck in un-interruptible sleep.
  Has anyone seen this happen before?  Is this caused by a pending disk  
  failure?
 
 Well, they are certainly stuck because of some blocking I/O.  That could
 be disk failure, indeed.
 
  mptscsi: ioc1: attempting task abort! (sc=010038904c40)
  scsi1 : destination target 0, lun 0
   command = Read (10) 00 75 94 40 00 00 10 00 00
  mptscsi: ioc1: task abort: SUCCESS (sc=010038904c40)
 
 That does not look like a picture of happiness, indeed, no.  You have
 SCSI commands aborting.
 

Well, these messages are not nice of course, since the mpt error handler
got activated, but in principle a scsi device can recover then.
Unfortunately, the verbosity level of scsi makes it impossbible to
figure out what was actually the problem. Since we suffered from severe
scsi problems, I wrote quite a number of patches to improve the situation.
We now at least can understand where the problem came from and also have
a slightly improved error handling. These are presently for 2.6.22 only, 
but my plan is to sent these upstream for 2.6.28.


  Lustre: 6698:0:(lustre_fsfilt.h:306:fsfilt_setattr()) nobackup- 
  OST0001: slow setattr 100s
  Lustre: 6698:0:(watchdog.c:312:lcw_update_time()) Expired watchdog  
  for pid 6698 disabled after 103.1261s
 
 Those are just fallout from the above disk situation.

Probably the device was offlined and actually this also should have been
printed in the logs. Brock, can you check the device status 
(cat /sys/block/sdX/device/state).

Cheers,
Bernd
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] OSS load in the roof

2008-06-27 Thread Brock Palen
On Jun 27, 2008, at 1:39 PM, Bernd Schubert wrote:
 On Fri, Jun 27, 2008 at 01:07:32PM -0400, Brian J. Murrell wrote:
 On Fri, 2008-06-27 at 12:44 -0400, Brock Palen wrote:

 All of them are stuck in un-interruptible sleep.
 Has anyone seen this happen before?  Is this caused by a pending  
 disk
 failure?

 Well, they are certainly stuck because of some blocking I/O.  That  
 could
 be disk failure, indeed.

 mptscsi: ioc1: attempting task abort! (sc=010038904c40)
 scsi1 : destination target 0, lun 0
  command = Read (10) 00 75 94 40 00 00 10 00 00
 mptscsi: ioc1: task abort: SUCCESS (sc=010038904c40)

 That does not look like a picture of happiness, indeed, no.  You have
 SCSI commands aborting.


 Well, these messages are not nice of course, since the mpt error  
 handler
 got activated, but in principle a scsi device can recover then.
 Unfortunately, the verbosity level of scsi makes it impossbible to
 figure out what was actually the problem. Since we suffered from  
 severe
 scsi problems, I wrote quite a number of patches to improve the  
 situation.
 We now at least can understand where the problem came from and also  
 have
 a slightly improved error handling. These are presently for 2.6.22  
 only,
 but my plan is to sent these upstream for 2.6.28.


 Lustre: 6698:0:(lustre_fsfilt.h:306:fsfilt_setattr()) nobackup-
 OST0001: slow setattr 100s
 Lustre: 6698:0:(watchdog.c:312:lcw_update_time()) Expired watchdog
 for pid 6698 disabled after 103.1261s

 Those are just fallout from the above disk situation.

 Probably the device was offlined and actually this also should have  
 been
 printed in the logs. Brock, can you check the device status
 (cat /sys/block/sdX/device/state).

IO Is still flowing from both OST's on that OSS,

[EMAIL PROTECTED] ~]# cat /sys/block/sd*/device/state
running
running

Sigh, it only needs to live till August when we install our x4500's.
I think its safe to send a notice to users they may want to copy  
their data.


 Cheers,
 Bernd
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] OSS load in the roof

2008-06-27 Thread Bernd Schubert
On Fri, Jun 27, 2008 at 01:44:13PM -0400, Brock Palen wrote:
 On Jun 27, 2008, at 1:39 PM, Bernd Schubert wrote:
 On Fri, Jun 27, 2008 at 01:07:32PM -0400, Brian J. Murrell wrote:
 On Fri, 2008-06-27 at 12:44 -0400, Brock Palen wrote:

 All of them are stuck in un-interruptible sleep.
 Has anyone seen this happen before?  Is this caused by a pending  
 disk
 failure?

 Well, they are certainly stuck because of some blocking I/O.  That  
 could
 be disk failure, indeed.

 mptscsi: ioc1: attempting task abort! (sc=010038904c40)
 scsi1 : destination target 0, lun 0
  command = Read (10) 00 75 94 40 00 00 10 00 00
 mptscsi: ioc1: task abort: SUCCESS (sc=010038904c40)

 That does not look like a picture of happiness, indeed, no.  You have
 SCSI commands aborting.


 Well, these messages are not nice of course, since the mpt error  
 handler
 got activated, but in principle a scsi device can recover then.
 Unfortunately, the verbosity level of scsi makes it impossbible to
 figure out what was actually the problem. Since we suffered from  
 severe
 scsi problems, I wrote quite a number of patches to improve the  
 situation.
 We now at least can understand where the problem came from and also  
 have
 a slightly improved error handling. These are presently for 2.6.22  
 only,
 but my plan is to sent these upstream for 2.6.28.


 Lustre: 6698:0:(lustre_fsfilt.h:306:fsfilt_setattr()) nobackup-
 OST0001: slow setattr 100s
 Lustre: 6698:0:(watchdog.c:312:lcw_update_time()) Expired watchdog
 for pid 6698 disabled after 103.1261s

 Those are just fallout from the above disk situation.

 Probably the device was offlined and actually this also should have  
 been
 printed in the logs. Brock, can you check the device status
 (cat /sys/block/sdX/device/state).

 IO Is still flowing from both OST's on that OSS,

 [EMAIL PROTECTED] ~]# cat /sys/block/sd*/device/state
 running
 running

So the device recovered. Is the parallel-scsi? If so it now might run at 
a lower scsi speed level, but you should have got domain validation messages
about this (unless you are using a customized driver, which has DV disabled).


Cheers,
Bernd
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] OSS load in the roof

2008-06-27 Thread Brock Palen
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Jun 27, 2008, at 1:07 PM, Brian J. Murrell wrote:
 On Fri, 2008-06-27 at 12:44 -0400, Brock Palen wrote:

 All of them are stuck in un-interruptible sleep.
 Has anyone seen this happen before?  Is this caused by a pending disk
 failure?

 Well, they are certainly stuck because of some blocking I/O.  That  
 could
 be disk failure, indeed.

 mptscsi: ioc1: attempting task abort! (sc=010038904c40)
 scsi1 : destination target 0, lun 0
  command = Read (10) 00 75 94 40 00 00 10 00 00
 mptscsi: ioc1: task abort: SUCCESS (sc=010038904c40)

 That does not look like a picture of happiness, indeed, no.  You have
 SCSI commands aborting.

While the array was reporting no problems one of the disk was really  
lagging the others. We have swapped it out.  Thanks for the feedback  
everyone.


 Lustre: 6698:0:(lustre_fsfilt.h:306:fsfilt_setattr()) nobackup-
 OST0001: slow setattr 100s
 Lustre: 6698:0:(watchdog.c:312:lcw_update_time()) Expired watchdog
 for pid 6698 disabled after 103.1261s

 Those are just fallout from the above disk situation.

 b.

 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (Darwin)

iD8DBQFIZUq/MFCQB4Bvz5QRAvacAJ9jkhi+2KgfbJ7bUI/KfHJ0Hnq1wQCeNgHO
d6+tzscwCqwYtuHXmzT2kFI=
=5p1N
-END PGP SIGNATURE-
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss