[Lustre-discuss] OSS load in the roof
our OSS went crazy today. It is attached to two OST's. The load normally around 2-4. Right now it is 123. I noticed this to be the cause: root 6748 0.0 0.0 00 ?DMay27 8:57 [ll_ost_io_123] All of them are stuck in un-interruptible sleep. Has anyone seen this happen before? Is this caused by a pending disk failure? I ask the disk system failure because I also see this message: mptscsi: ioc1: attempting task abort! (sc=010038904c40) scsi1 : destination target 0, lun 0 command = Read (10) 00 75 94 40 00 00 10 00 00 mptscsi: ioc1: task abort: SUCCESS (sc=010038904c40) and: Lustre: 6698:0:(lustre_fsfilt.h:306:fsfilt_setattr()) nobackup- OST0001: slow setattr 100s Lustre: 6698:0:(watchdog.c:312:lcw_update_time()) Expired watchdog for pid 6698 disabled after 103.1261s Thanks Brock Palen www.umich.edu/~brockp Center for Advanced Computing [EMAIL PROTECTED] (734)936-1985 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] OSS load in the roof
On Fri, Jun 27, 2008 at 01:07:32PM -0400, Brian J. Murrell wrote: On Fri, 2008-06-27 at 12:44 -0400, Brock Palen wrote: All of them are stuck in un-interruptible sleep. Has anyone seen this happen before? Is this caused by a pending disk failure? Well, they are certainly stuck because of some blocking I/O. That could be disk failure, indeed. mptscsi: ioc1: attempting task abort! (sc=010038904c40) scsi1 : destination target 0, lun 0 command = Read (10) 00 75 94 40 00 00 10 00 00 mptscsi: ioc1: task abort: SUCCESS (sc=010038904c40) That does not look like a picture of happiness, indeed, no. You have SCSI commands aborting. Well, these messages are not nice of course, since the mpt error handler got activated, but in principle a scsi device can recover then. Unfortunately, the verbosity level of scsi makes it impossbible to figure out what was actually the problem. Since we suffered from severe scsi problems, I wrote quite a number of patches to improve the situation. We now at least can understand where the problem came from and also have a slightly improved error handling. These are presently for 2.6.22 only, but my plan is to sent these upstream for 2.6.28. Lustre: 6698:0:(lustre_fsfilt.h:306:fsfilt_setattr()) nobackup- OST0001: slow setattr 100s Lustre: 6698:0:(watchdog.c:312:lcw_update_time()) Expired watchdog for pid 6698 disabled after 103.1261s Those are just fallout from the above disk situation. Probably the device was offlined and actually this also should have been printed in the logs. Brock, can you check the device status (cat /sys/block/sdX/device/state). Cheers, Bernd ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] OSS load in the roof
On Jun 27, 2008, at 1:39 PM, Bernd Schubert wrote: On Fri, Jun 27, 2008 at 01:07:32PM -0400, Brian J. Murrell wrote: On Fri, 2008-06-27 at 12:44 -0400, Brock Palen wrote: All of them are stuck in un-interruptible sleep. Has anyone seen this happen before? Is this caused by a pending disk failure? Well, they are certainly stuck because of some blocking I/O. That could be disk failure, indeed. mptscsi: ioc1: attempting task abort! (sc=010038904c40) scsi1 : destination target 0, lun 0 command = Read (10) 00 75 94 40 00 00 10 00 00 mptscsi: ioc1: task abort: SUCCESS (sc=010038904c40) That does not look like a picture of happiness, indeed, no. You have SCSI commands aborting. Well, these messages are not nice of course, since the mpt error handler got activated, but in principle a scsi device can recover then. Unfortunately, the verbosity level of scsi makes it impossbible to figure out what was actually the problem. Since we suffered from severe scsi problems, I wrote quite a number of patches to improve the situation. We now at least can understand where the problem came from and also have a slightly improved error handling. These are presently for 2.6.22 only, but my plan is to sent these upstream for 2.6.28. Lustre: 6698:0:(lustre_fsfilt.h:306:fsfilt_setattr()) nobackup- OST0001: slow setattr 100s Lustre: 6698:0:(watchdog.c:312:lcw_update_time()) Expired watchdog for pid 6698 disabled after 103.1261s Those are just fallout from the above disk situation. Probably the device was offlined and actually this also should have been printed in the logs. Brock, can you check the device status (cat /sys/block/sdX/device/state). IO Is still flowing from both OST's on that OSS, [EMAIL PROTECTED] ~]# cat /sys/block/sd*/device/state running running Sigh, it only needs to live till August when we install our x4500's. I think its safe to send a notice to users they may want to copy their data. Cheers, Bernd ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] OSS load in the roof
On Fri, Jun 27, 2008 at 01:44:13PM -0400, Brock Palen wrote: On Jun 27, 2008, at 1:39 PM, Bernd Schubert wrote: On Fri, Jun 27, 2008 at 01:07:32PM -0400, Brian J. Murrell wrote: On Fri, 2008-06-27 at 12:44 -0400, Brock Palen wrote: All of them are stuck in un-interruptible sleep. Has anyone seen this happen before? Is this caused by a pending disk failure? Well, they are certainly stuck because of some blocking I/O. That could be disk failure, indeed. mptscsi: ioc1: attempting task abort! (sc=010038904c40) scsi1 : destination target 0, lun 0 command = Read (10) 00 75 94 40 00 00 10 00 00 mptscsi: ioc1: task abort: SUCCESS (sc=010038904c40) That does not look like a picture of happiness, indeed, no. You have SCSI commands aborting. Well, these messages are not nice of course, since the mpt error handler got activated, but in principle a scsi device can recover then. Unfortunately, the verbosity level of scsi makes it impossbible to figure out what was actually the problem. Since we suffered from severe scsi problems, I wrote quite a number of patches to improve the situation. We now at least can understand where the problem came from and also have a slightly improved error handling. These are presently for 2.6.22 only, but my plan is to sent these upstream for 2.6.28. Lustre: 6698:0:(lustre_fsfilt.h:306:fsfilt_setattr()) nobackup- OST0001: slow setattr 100s Lustre: 6698:0:(watchdog.c:312:lcw_update_time()) Expired watchdog for pid 6698 disabled after 103.1261s Those are just fallout from the above disk situation. Probably the device was offlined and actually this also should have been printed in the logs. Brock, can you check the device status (cat /sys/block/sdX/device/state). IO Is still flowing from both OST's on that OSS, [EMAIL PROTECTED] ~]# cat /sys/block/sd*/device/state running running So the device recovered. Is the parallel-scsi? If so it now might run at a lower scsi speed level, but you should have got domain validation messages about this (unless you are using a customized driver, which has DV disabled). Cheers, Bernd ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] OSS load in the roof
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Jun 27, 2008, at 1:07 PM, Brian J. Murrell wrote: On Fri, 2008-06-27 at 12:44 -0400, Brock Palen wrote: All of them are stuck in un-interruptible sleep. Has anyone seen this happen before? Is this caused by a pending disk failure? Well, they are certainly stuck because of some blocking I/O. That could be disk failure, indeed. mptscsi: ioc1: attempting task abort! (sc=010038904c40) scsi1 : destination target 0, lun 0 command = Read (10) 00 75 94 40 00 00 10 00 00 mptscsi: ioc1: task abort: SUCCESS (sc=010038904c40) That does not look like a picture of happiness, indeed, no. You have SCSI commands aborting. While the array was reporting no problems one of the disk was really lagging the others. We have swapped it out. Thanks for the feedback everyone. Lustre: 6698:0:(lustre_fsfilt.h:306:fsfilt_setattr()) nobackup- OST0001: slow setattr 100s Lustre: 6698:0:(watchdog.c:312:lcw_update_time()) Expired watchdog for pid 6698 disabled after 103.1261s Those are just fallout from the above disk situation. b. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (Darwin) iD8DBQFIZUq/MFCQB4Bvz5QRAvacAJ9jkhi+2KgfbJ7bUI/KfHJ0Hnq1wQCeNgHO d6+tzscwCqwYtuHXmzT2kFI= =5p1N -END PGP SIGNATURE- ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss