Re: dealing with a failing drive
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/25/07 9:08 PM, Ted Mittelstaedt wrote: > There are two physical disks in the server. bus 1 target 0 and > bus 1 target 1. Those ARE the physical disks. If one of them > has failed instead of: > > Sync, Ultra2, Wide - Configured in a logical volume. > > you will see something like: > > Sync, Ultra2, Wide - Unconfigured > > or nothing at all. Cool, thanks. Your output and mine are virtually identical. Now I get what you mean by running idacontrol periodically and grokking the output to verify both disks are still in the array. > > It is normal for idacontrol to generate soft write errors. The > developer knows about this. There's really no easy way to make > it not happen. It doesen't hurt anything, however. OK, good to know. thanks much! dn -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.3 (Darwin) iD8DBQFHSlXUyPxGVjntI4IRAlbxAJ0aZDSOeyrTIoEVtKOZd5UMbDMx9QCdHP8I TAh9zWa+2cUlE5Qh2qfks2Y= =iEK3 -END PGP SIGNATURE- ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
RE: dealing with a failing drive
Are we looking at the same output? Here's the output of idacontrol show off one of my DL360 servers: mail# idacontrol show cmd_show_all() [Compaq Integrated Array controller] Controller uptime: 301 hours 54 minutes 22 seconds Firmware Version: 1.50 (running) 1.50 (ROM) Revision - Hardware: 2 Marketing: A SCSI bus count: 2 Max drives per bus: 16 Maximum request: 65535 blocks Logical drive 0: 17359MB (35553120 sectors), blocksize=512 Status: Logical drive ok Mode: Mirroring (RAID1) Drive ID: Drive Label: bus 1 target 0 lun 0: enclosure 0, bay 0, connector 2J direct-access 17361MB (35556888 512 byte sectors, 1088 reserved) Sync, Ultra2, Wide - Configured in a logical volume. bus 1 target 1 lun 0: enclosure 0, bay 1, connector 2J direct-access 17361MB (35556888 512 byte sectors, 1088 reserved) Sync, Ultra2, Wide - Configured in a logical volume. bus 1 target 7 lun 0: enclosure 0, bay 7, connector 2J non-disk Async mail# There are two physical disks in the server. bus 1 target 0 and bus 1 target 1. Those ARE the physical disks. If one of them has failed instead of: Sync, Ultra2, Wide - Configured in a logical volume. you will see something like: Sync, Ultra2, Wide - Unconfigured or nothing at all. It is normal for idacontrol to generate soft write errors. The developer knows about this. There's really no easy way to make it not happen. It doesen't hurt anything, however. If the RAID card itself is flakey you can't really tell it from software. Even the Windows RAID utilities that HP/Compaq supplies won't tell you this. The "by the book" way of troubleshooting these servers is if you get a disk failure, you immediately swap the disk. Then if the failure happens again and your pretty sure it's not the disk, you down the server, and boot it into Compaq Diagnostics and let it run for a day or so. It is not uncommon to end up with several additional hard drives that you don't need in the process of identifying a bad RAID card in a server. We have all done it, it is part of the territory. If you cannot afford it, stay away from these servers. Remember these servers are designed for a medium to large corporation that has a lot of resources. To give you a typical scenario, a couple weeks ago one of our mailservers running on a Proliant 1600R started freezing up. I had the admin pull the entire disk array and put the disks into our backup server, that went online in place of the original server, and the original server was pulled and put on a test bench. About a week later the admin finally discovered the processor board had worked it's way almost out of the socket, after much hair-pulling, running of diagnostics, and so on. Ted > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] Behalf Of David Newman > Sent: Sunday, November 25, 2007 2:58 PM > To: Ted Mittelstaedt > Cc: freebsd-questions@freebsd.org > Subject: Re: dealing with a failing drive > > > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > On 11/24/07 12:39 PM, Ted Mittelstaedt wrote: > > The output of idacontrol show will show if one of the > > hard disks in the SmartArray has failed. Your choice with > > a hardware array is to either run it with redundancy or not. > > (ie: raid5 or mirroring or striping) You have to choose > > which is more important for you. > > > > IMHO it is very foolish to stripe an array that you have > > critical data on and assume that you can predict a failure > > of a disk using smart or other monitoring, and replace it > > in advance of a failure. If your concern is redundancy, then > > add more disks to the array and create a raid 5 or a mirror. > > Then ignore all the predictive junk and let the array card > > concern itself with detecting if a drive has failed. Run > > idacontrol periodically out of a script that checks for a > > failure of a disk and e-mails you if there is one. > > Thanks, this is good advice, but it doesn't answer the specific > questions I had: > > 1. How to diagnose the health of a *physical* disk that's part of a RAID > array (RAID1, in this case) in an old Compaq Proliant server? > > 2. Is it normal for idacontrol to generate soft write errors? > > Backstory here is that Proliant server #1 generated beaucoup hard and > soft read and write errors and eventually locked up. I thought it was > one of the disks but replacing one at a time didn't help. So I took both > disks and put them in identical Proliant server #2. Ergo, I would > conclude server #1's RAID controller flaked
Re: dealing with a failing drive
On Sun, 25 Nov 2007 08:45:46 + Matthew Seaman <[EMAIL PROTECTED]> wrote: > sysutils/aaccli aaccli-1.0 Adaptec SCSI RAID administration > As I said in my previous post, this is EXACTLY what was wanted. Installation of aaccli was a snap. My only problem was the total lack of documentation; no man page, no info file Capturing the "help" screens within the CLI was useful, but pretty incomplete. I found an Adaptec doc, describing their cli-sata-scsi-iug program; http://download.adaptec.com/pdfs/installation_guides/cli-sata-scsi-iug.pdf This seems to be exactly what aaccli is. Since I usually do this sort of work outside of X, at the console, I converted the adaptec pdf file into a text file using pdftotext. The ridiculous copyright restrictions on this file prevents me from producing a man page, or an info file for redistribution as part of the port! So; If anyone wants either the pdf file, or the converted text file, I would be glad to email same. Just send an email to [EMAIL PROTECTED] and ask for either my /usr/local/share/cli/cli-sata-scsi-iug.pdf or for my /usr/local/share/cli/cli-sata-scsi-iug.txt. Bob -- _ /o\ // \\ The ASCII \\ // Ribbon Campaign \V/ Against HTML /A\ eMail! // \\ ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: dealing with a failing drive
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/24/07 12:39 PM, Ted Mittelstaedt wrote: > The output of idacontrol show will show if one of the > hard disks in the SmartArray has failed. Your choice with > a hardware array is to either run it with redundancy or not. > (ie: raid5 or mirroring or striping) You have to choose > which is more important for you. > > IMHO it is very foolish to stripe an array that you have > critical data on and assume that you can predict a failure > of a disk using smart or other monitoring, and replace it > in advance of a failure. If your concern is redundancy, then > add more disks to the array and create a raid 5 or a mirror. > Then ignore all the predictive junk and let the array card > concern itself with detecting if a drive has failed. Run > idacontrol periodically out of a script that checks for a > failure of a disk and e-mails you if there is one. Thanks, this is good advice, but it doesn't answer the specific questions I had: 1. How to diagnose the health of a *physical* disk that's part of a RAID array (RAID1, in this case) in an old Compaq Proliant server? 2. Is it normal for idacontrol to generate soft write errors? Backstory here is that Proliant server #1 generated beaucoup hard and soft read and write errors and eventually locked up. I thought it was one of the disks but replacing one at a time didn't help. So I took both disks and put them in identical Proliant server #2. Ergo, I would conclude server #1's RAID controller flaked out. idacontrol is useful for telling the health of the logical disk. What it doesn't tell me (or maybe I just don't see it) is whether the physical disks are ok, and those "soft write errors" concern me. I had a failure situation, and need to figure out whether just the controller was bad or whether I need to replace at least one disk too. Thanks again! dn -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.3 (Darwin) iD8DBQFHSf39yPxGVjntI4IRAp1yAJ4vMV9FkeaBsHRr/Z5WpCL27wJ3tACfS+pT 3UVlscnQUZhe8ulHksKDWsY= =Om7/ -END PGP SIGNATURE- ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: dealing with a failing drive
On Sun, 25 Nov 2007 08:45:46 + Matthew Seaman <[EMAIL PROTECTED]> wrote: >... it's a rebadged Adaptec RAID controller using > the aac Wonderful; I can now look into and play with the RAID system without taking the OS off-line and going to the bios. Thanks! Bob -- _ /o\ // \\ The ASCII \\ // Ribbon Campaign \V/ Against HTML /A\ eMail! // \\ ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: dealing with a failing drive
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Bob Richards wrote: > I have a similar issue, only it is with a Dell server which has 6 SCSI > drives in a hardware raid array. The controller is a Dell PERC 2/Si. > > Is there an equivalent monitor utility for this as well? I am currently > running: FreeBSD 6.1-RELEASE-p20 #2. If that's a rebadged LSI MegaRAID card and uses the amr driver under FreeBSD, then there are two packages that may be of interest: sysutils/amrstat amrstat-20070216Utility for LSI Logic's MegaRAID RAID controllers sysutils/megarc megarc-1.51 LSI Logic's MegaRAID controlling software On the other hand, if it's a rebadged Adaptec RAID controller using the aac driver under FreeBSD then you want: sysutils/aaccli aaccli-1.0 Adaptec SCSI RAID administration tool Cheers, Matthew - -- Dr Matthew J Seaman MA, D.Phil. 7 Priory Courtyard Flat 3 PGP: http://www.infracaninophile.co.uk/pgpkey Ramsgate Kent, CT11 9PW -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.4 (FreeBSD) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHSTY68Mjk52CukIwRCPEqAJ9Pc4YyFagh7y9jmA2SPOUv7+2bJgCfd21K IGMSIdhSznOl9WTms5Oc0NI= =JgO2 -END PGP SIGNATURE- ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: dealing with a failing drive
> Compaq uses several RAID cards most are under the so-called > "SmartArray" using the ida driver. If this is yours, you can > use a utility called "idacontrol" that can monitor the array, Interesting discussion! I have a similar issue, only it is with a Dell server which has 6 SCSI drives in a hardware raid array. The controller is a Dell PERC 2/Si. Is there an equivalent monitor utility for this as well? I am currently running: FreeBSD 6.1-RELEASE-p20 #2. TIA Bob signature.asc Description: PGP signature
RE: dealing with a failing drive
The output of idacontrol show will show if one of the hard disks in the SmartArray has failed. Your choice with a hardware array is to either run it with redundancy or not. (ie: raid5 or mirroring or striping) You have to choose which is more important for you. IMHO it is very foolish to stripe an array that you have critical data on and assume that you can predict a failure of a disk using smart or other monitoring, and replace it in advance of a failure. If your concern is redundancy, then add more disks to the array and create a raid 5 or a mirror. Then ignore all the predictive junk and let the array card concern itself with detecting if a drive has failed. Run idacontrol periodically out of a script that checks for a failure of a disk and e-mails you if there is one. Ted > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] Behalf Of David Newman > Sent: Monday, November 19, 2007 8:44 AM > To: freebsd-questions@freebsd.org > Subject: Re: dealing with a failing drive > > > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > On 11/18/07 11:30 PM, Ted Mittelstaedt wrote: > > Hi David, apologies to Jerry for jumping in. > > > > Compaq uses several RAID cards most are under the so-called > > "SmartArray" using the ida driver. If this is yours, you can > > use a utility called "idacontrol" that can monitor the array, > > Hi Ted, > > Thanks much for this info. I'm pleased to report that idacontrol thinks > the logical array is in good shape. (This is on an identical server; I > moved both disks from a RAID1 array there after the first server started > reporting write and read errors.) > > > > NOTE: > > > > The smart utility only works on SATA or ATA/IDE drives, not SCSI. > > Yes. I've heard it said that "SMART isn't." > > This Proliant DL320 server uses a SmartArray controller and SCSI disks. > SMART or not, is there a way of monitoring the health of the physical > disks from within FreeBSD? > > thanks again! > > dn > > > > -BEGIN PGP SIGNATURE- > Version: GnuPG v1.4.3 (Darwin) > > iD8DBQFHQb1WyPxGVjntI4IRAhZwAKCzS4yKRyeJZDXm2pq+aIL8VMBKQQCfUpq3 > +eThP189Kav2DSRVAgDdbDI= > =coqi > -END PGP SIGNATURE- > ___ > freebsd-questions@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-questions > To unsubscribe, send any mail to > "[EMAIL PROTECTED]" > ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: dealing with a failing drive
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/18/07 11:30 PM, Ted Mittelstaedt wrote: > idacontrol show | grep "Status" > > IF status is fully up it will say: > > Status: Logical drive ok And that's what it does say. So far so good... ...but then each time I run idacontrol I get this in /var/log/messages: Nov 21 17:01:30 mail kernel: ida0: soft error Nov 21 17:01:36 mail last message repeated 59 times Does this mean the controller is OK and the disks are dying? Or is it expected behavior with idacontrol? Or something else? thanks dn -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.3 (Darwin) iD8DBQFHRNeLyPxGVjntI4IRAkigAJ41KeUVpDfNab6f/F/eHcSCrJLMrwCdHLos eYOqGGn8K3RV1l/okGwuYp4= =U4Tx -END PGP SIGNATURE- ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Failing Drive
On Nov 16, 2007 5:05 PM, Douglas Rodriguez <[EMAIL PROTECTED]> wrote: > I've been getting the following message repeating continuously: > > ad1:FAILURE - READ_DMA status=51 > error=1 LBA=216026367 > g_vfs_done():ad1s1[READ(offset = 110605467648, length = 16384)]error=5 > ad1:FAILURE - READ_DMA status=51 > error=40 LBA=216026367 > g_vfs_done():ad1s1[READ(offset = 110605467648, length = 16384)]error=5 > ad1:FAILURE - READ_DMA status=51 > error=1 LBA=216026367 > g_vfs_done():ad1s1[READ(offset = 110605467648, length = 16384)]error=5 > > > The same thing repeats every so often. What does this mean? I've read > other threads (Drives Dieing) about possibly shutting down dma or > reinstalling the system, but is that the best solution to this kind of > problem? > > Thanks. > > ~Doug > > ___ > freebsd-questions@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-questions > To unsubscribe, send any mail to "[EMAIL PROTECTED]" > One of the first things you can do is install sysutils/smartmontools. This package gives you the ability to access the S.M.A.R.T. functionality of your drives. Of course, your drives need to include S.M.A.R.T. capability and be enabled. After installing you can check to see if your drives support it by using the smartctl command. This is also the command that will use to run tests and check the results. Check out their homepage for more info: http://smartmontools.sourceforge.net/ Regards -- Chad M. Gross ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: dealing with a failing drive
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/18/07 11:30 PM, Ted Mittelstaedt wrote: > Hi David, apologies to Jerry for jumping in. > > Compaq uses several RAID cards most are under the so-called > "SmartArray" using the ida driver. If this is yours, you can > use a utility called "idacontrol" that can monitor the array, Hi Ted, Thanks much for this info. I'm pleased to report that idacontrol thinks the logical array is in good shape. (This is on an identical server; I moved both disks from a RAID1 array there after the first server started reporting write and read errors.) > NOTE: > > The smart utility only works on SATA or ATA/IDE drives, not SCSI. Yes. I've heard it said that "SMART isn't." This Proliant DL320 server uses a SmartArray controller and SCSI disks. SMART or not, is there a way of monitoring the health of the physical disks from within FreeBSD? thanks again! dn -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.3 (Darwin) iD8DBQFHQb1WyPxGVjntI4IRAhZwAKCzS4yKRyeJZDXm2pq+aIL8VMBKQQCfUpq3 +eThP189Kav2DSRVAgDdbDI= =coqi -END PGP SIGNATURE- ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
RE: dealing with a failing drive
> -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] Behalf Of Jerry > McAllister > Sent: Monday, November 12, 2007 8:04 AM > To: David Newman > Cc: freebsd-questions@freebsd.org > Subject: Re: dealing with a failing drive > > > On Sat, Nov 10, 2007 at 05:22:06PM -0800, David Newman wrote: > > I vaguely remember trying about a year ago to load a SMART utility from > > the ports collection but it wouldn't work on drives in a RAID array. > > > > Is there some other way to: > > > > a) diagnose/fix the errant disk here? > > b) monitor the health of disks on a Compaq controller so it doesn't get > > to this point to begin with? > > Hi David, apologies to Jerry for jumping in. Compaq uses several RAID cards most are under the so-called "SmartArray" using the ida driver. If this is yours, you can use a utility called "idacontrol" that can monitor the array, here's the instructions for using it. You will need usrsbin sources installed: ) Install idacontrol cd /usr/ports mkdir distfiles cd /usr/ports/distfiles mkdir manual-build cd manual-build fetch ftp://ftp.jurai.net/users/winter/idacontrol.tar cd /usr/src tar xf /usr/ports/distfiles/manual-build/idacontrol.tar cd /usr/src/usr.sbin/idacontrol vi makefile change variable NOMAN to NO_MAN make obj && make depend && make && make install cd idacontrol show | grep "Status" IF status is fully up it will say: Status: Logical drive ok IF status is degraded it will say 1 of several other error messages. More on PR i386/70482 and on thread: http://lists.freebsd.org/pipermail/freebsd-scsi/2005-September/002009.html NOTE: The smart utility only works on SATA or ATA/IDE drives, not SCSI. Ted ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Failing Drive
Douglas Rodriguez wrote: I've been getting the following message repeating continuously: ad1:FAILURE - READ_DMA status=51 error=1 LBA=216026367 g_vfs_done():ad1s1[READ(offset = 110605467648, length = 16384)]error=5 ad1:FAILURE - READ_DMA status=51 error=40 LBA=216026367 g_vfs_done():ad1s1[READ(offset = 110605467648, length = 16384)]error=5 ad1:FAILURE - READ_DMA status=51 error=1 LBA=216026367 g_vfs_done():ad1s1[READ(offset = 110605467648, length = 16384)]error=5 The same thing repeats every so often. What does this mean? I've read other threads (Drives Dieing) about possibly shutting down dma or reinstalling the system, but is that the best solution to this kind of problem? Backup, backup, backup ;-) You'll need a Real Expert(tm) to help on the ILLEGAL_LENGTH error, but I've seen UNCORRECTABLE plenty. Keep in mind that it may cost some time and energy to find out; apart from a bad disk, could be a bad disk *controller*. I bought two new HDD's recently because of similar problems, but all of them are now working fine on a new motherboard :-/ Sorry no help here :-/ Kevin Kinsey -- Recursion: n. See Recursion. -- Random Shack Data Processing Dictionary ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Failing Drive
I've been getting the following message repeating continuously: ad1:FAILURE - READ_DMA status=51 error=1 LBA=216026367 g_vfs_done():ad1s1[READ(offset = 110605467648, length = 16384)]error=5 ad1:FAILURE - READ_DMA status=51 error=40 LBA=216026367 g_vfs_done():ad1s1[READ(offset = 110605467648, length = 16384)]error=5 ad1:FAILURE - READ_DMA status=51 error=1 LBA=216026367 g_vfs_done():ad1s1[READ(offset = 110605467648, length = 16384)]error=5 The same thing repeats every so often. What does this mean? I've read other threads (Drives Dieing) about possibly shutting down dma or reinstalling the system, but is that the best solution to this kind of problem? Thanks. ~Doug ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: dealing with a failing drive
From: "Jerry McAllister" <[EMAIL PROTECTED]> Sent: Monday, November 12, 2007 12:53 On Mon, Nov 12, 2007 at 09:26:38AM -0800, David Newman wrote: On 11/12/07 8:14 AM, Jerry McAllister wrote: > An update: After doing what you suggest (leaving in the "good" disk, > adding a new disk, RAID rebuilding) I still got soft write errors -- > with *either one* of the disks I tried. > > Then I tried putting both disks in an identical server and they came up > fine, no read or write errors. > > Ergo, the bad RAID controller is bad and the disks may be OK. > >> Probably not. >> Generally, if the RAID controller is bad, you will see errors >> all over and not it just one place, tho I suppose it is possible. >> Check and see what it reports as error locations and see if they >> move around any. Jerry, thanks for your response. After 36 hours of running the same disks in a different, identical machine there hasn't been a single read or write error. I'm hardly a storage expert but from the evidence I have I'm inclined to believe the root cause was a bad RAID controller and not failed disks. That is not much proof. The different machine would probably be accessing the disks in a different way, either slightly different positioning or using different space. Also, 36 hours is not really much time. Dn, I have had a Promise controller that was bad. I kept getting errors at one specific location on two disks out of three on a RAID 5. The system continued to operate. When I finally spent the time to nail it down to the controller I found the Promise people more than anxious to get the beast for a postmortem. It had been bad for me from day one. It would take about a week to a month for the problem to appear. After the 6th disk showing the problem at the same block number the coin dropped in my sometimes overly slow mind. {^_-}Joanne ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: dealing with a failing drive
From: "David Newman" <[EMAIL PROTECTED]> -BEGIN PGP SIGNED MESSAGE- On 11/12/07 8:14 AM, Jerry McAllister wrote: An update: After doing what you suggest (leaving in the "good" disk, adding a new disk, RAID rebuilding) I still got soft write errors -- with *either one* of the disks I tried. Then I tried putting both disks in an identical server and they came up fine, no read or write errors. Ergo, the bad RAID controller is bad and the disks may be OK. Probably not. Generally, if the RAID controller is bad, you will see errors all over and not it just one place, tho I suppose it is possible. Check and see what it reports as error locations and see if they move around any. Jerry, thanks for your response. After 36 hours of running the same disks in a different, identical machine there hasn't been a single read or write error. I'm hardly a storage expert but from the evidence I have I'm inclined to believe the root cause was a bad RAID controller and not failed disks. I'm aware of CLI tools to monitor 3Ware SATA RAID controllers. Anyone know if there are similar tools for HP/Compaq SCSI RAID controllers? Bad cable? Iffy power supply? Examine each step the data and power take for possible hitches. You might even have an overheated and weakened power connector on a drive. If it's not making solid contact it can give you headaches. {^_^} ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: dealing with a failing drive
On Mon, Nov 12, 2007 at 09:26:38AM -0800, David Newman wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > On 11/12/07 8:14 AM, Jerry McAllister wrote: > > > An update: After doing what you suggest (leaving in the "good" disk, > > adding a new disk, RAID rebuilding) I still got soft write errors -- > > with *either one* of the disks I tried. > > > > Then I tried putting both disks in an identical server and they came up > > fine, no read or write errors. > > > > Ergo, the bad RAID controller is bad and the disks may be OK. > > > >> Probably not. > >> Generally, if the RAID controller is bad, you will see errors > >> all over and not it just one place, tho I suppose it is possible. > >> Check and see what it reports as error locations and see if they > >> move around any. > > Jerry, thanks for your response. > > After 36 hours of running the same disks in a different, identical > machine there hasn't been a single read or write error. I'm hardly a > storage expert but from the evidence I have I'm inclined to believe the > root cause was a bad RAID controller and not failed disks. That is not much proof. The different machine would probably be accessing the disks in a different way, either slightly different positioning or using different space. Also, 36 hours is not really much time. It could be you are right, but disks have a way of starting small in errors and then avalanching on you with accelerating volume of errors just when you begin to feel safe. You could be right, but is the price of a disk worth it - the price of a new RAID controller, for that matter? Replace them both. jerry > > I'm aware of CLI tools to monitor 3Ware SATA RAID controllers. Anyone > know if there are similar tools for HP/Compaq SCSI RAID controllers? > > thanks > > dn > -BEGIN PGP SIGNATURE- > Version: GnuPG v1.4.3 (Darwin) > > iD8DBQFHOIzOyPxGVjntI4IRAmMWAJ4grMR6mcL/j9qbcGY/fJfDEqv3KgCg8BVW > wcHVDkZPykFcQzVYnp8mx+g= > =8rws > -END PGP SIGNATURE- > ___ > freebsd-questions@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-questions > To unsubscribe, send any mail to "[EMAIL PROTECTED]" ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: dealing with a failing drive
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/12/07 8:14 AM, Jerry McAllister wrote: > An update: After doing what you suggest (leaving in the "good" disk, > adding a new disk, RAID rebuilding) I still got soft write errors -- > with *either one* of the disks I tried. > > Then I tried putting both disks in an identical server and they came up > fine, no read or write errors. > > Ergo, the bad RAID controller is bad and the disks may be OK. > >> Probably not. >> Generally, if the RAID controller is bad, you will see errors >> all over and not it just one place, tho I suppose it is possible. >> Check and see what it reports as error locations and see if they >> move around any. Jerry, thanks for your response. After 36 hours of running the same disks in a different, identical machine there hasn't been a single read or write error. I'm hardly a storage expert but from the evidence I have I'm inclined to believe the root cause was a bad RAID controller and not failed disks. I'm aware of CLI tools to monitor 3Ware SATA RAID controllers. Anyone know if there are similar tools for HP/Compaq SCSI RAID controllers? thanks dn -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.3 (Darwin) iD8DBQFHOIzOyPxGVjntI4IRAmMWAJ4grMR6mcL/j9qbcGY/fJfDEqv3KgCg8BVW wcHVDkZPykFcQzVYnp8mx+g= =8rws -END PGP SIGNATURE- ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: dealing with a failing drive
On Sun, Nov 11, 2007 at 07:56:52AM -0800, David Newman wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > On 11/10/07 9:09 PM, Modulok wrote: > >>> I'd welcome suggestions on how (or whether) to try to revive a SCSI > > drive that's failing. > > > > It depends on how valuable the data on the array is, and more > > importantly, how much funding you have at your disposal to fix the > > problem. If it were me, I would set aside the bad disk, connect a new > > disk to the card and re-synchronize the array. (Assuming one of the > > members still retains a good copy of the data.) Afterwards I would > > destroy, or toss the existing disk in the trash can (depending on the > > sensitivity of the data stored on it.) > > Thanks for your reply. > > An update: After doing what you suggest (leaving in the "good" disk, > adding a new disk, RAID rebuilding) I still got soft write errors -- > with *either one* of the disks I tried. > > Then I tried putting both disks in an identical server and they came up > fine, no read or write errors. > > Ergo, the bad RAID controller is bad and the disks may be OK. Probably not. Generally, if the RAID controller is bad, you will see errors all over and not it just one place, tho I suppose it is possible. Check and see what it reports as error locations and see if they move around any. A soft error is usually one that can be corrected within the limits of rereads and any error correction that the system is using. It may be that the error was introduced when the problems with the old disk was occuring so that there was an error written on to the other supposedly good disk and then mirrored to the new disk - errors can be preserved by mirroring too. Having said that, I don't know where this error is from. Try reading up and rewriting the data that is in the spot getting the error and then reading it from the new location. It is pretty hard to figure out and specifically rewrite one certain block on modern systems because the physical locations are virtual. Although you would expect the same sector number to be in the same place from one write to the next, if it happens that that sector gets remapped due to an error, then it will actually be a different physical location the next time and you don't really prove anything. But, it is worth experimenting with if you want. You can dd from and to any sector on the partition by carefully using skip counts and block counts. But, you have to figure out the location (sector number) first. Good luck, jerry > > >>> Is there some other way to: > >>> b)monitor the health of disks on a Compaq controller so it doesn't > > get to this point to begin with? > > > > There are various tools out there that attempt to 'monitor' the > > condition of disk drives to try and predict when failure is eminent. > > For valuable data, it is safer to setup a mirror and simply toss out > > bad disks as they fail. For extremely valuable data use a 3 disk > > array. With a 3 disk setup you will still be covered in the event that > > an additional disk craps out during the re-sync. > > > > To quote google's article on disk failure, regarding SMART: > > Right, I've heard it said that "SMART isn't." > > Nonetheless, I'd appreciate any suggestions to monitor the health of > disks -- and RAID controllers too -- on HP Proliant servers running FreeBSD. > > thanks again. > > dn > > > -BEGIN PGP SIGNATURE- > Version: GnuPG v1.4.1 (Darwin) > > iD8DBQFHNyZDyPxGVjntI4IRAqk1AKCUwByNOAJZwvtD9V21TZfyaMWaxgCdFSCZ > dZjf3ynK+4OffBzsDOawF9A= > =DUqc > -END PGP SIGNATURE- > ___ > freebsd-questions@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-questions > To unsubscribe, send any mail to "[EMAIL PROTECTED]" ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: dealing with a failing drive
On Sat, Nov 10, 2007 at 05:22:06PM -0800, David Newman wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > I'd welcome suggestions on how (or whether) to try to revive a SCSI > drive that's failing. to answer 'whether': don't. Get your stuff off from it as soon as possible and nuke it if it has anything sensitive at all. If it is a mirror or raid5 then you should be able to just replace it, but otherwise, back it up immediately and quit using it. Generally, if you start seeing a regular hard error, the drive is on its last legs. The errors only increase.You may be able to do things to get past this one error, but more will be coming. So, is answer to 'how': also don't. jerry > > This is on FreeBSD 6.2-RELENG on a Compaq Proliant DL320, onboard RAID > and two SCSI drives in a RAID1 array. > > Today this system rebooted and hung on Compaq's "what do you want the > RAID controller to do?" message. I told it to fix any errors. > > When I brought the system back up (after running fsck in single-user > mode), the log had lots of errors like this: > > Nov 10 09:00:40 mail kernel: ida0: hard write error > Nov 10 09:00:40 mail kernel: ida0: invalid request > Nov 10 09:01:48 mail last message repeated 35 times > Nov 10 09:03:49 mail last message repeated 571 times > Nov 10 09:12:27 mail last message repeated 796 times > > I vaguely remember trying about a year ago to load a SMART utility from > the ports collection but it wouldn't work on drives in a RAID array. > > Is there some other way to: > > a) diagnose/fix the errant disk here? > b) monitor the health of disks on a Compaq controller so it doesn't get > to this point to begin with? > > thanks in advance > > dn > > > > > > -BEGIN PGP SIGNATURE- > Version: GnuPG v1.4.1 (Darwin) > > iD8DBQFHNlk+yPxGVjntI4IRAntlAJ9FWA2ez+BdnViq7mrIpkLBTLm/CgCfRyEA > czDvMn6+8KjlI3V0iBG4U3I= > =36+k > -END PGP SIGNATURE- > ___ > freebsd-questions@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-questions > To unsubscribe, send any mail to "[EMAIL PROTECTED]" ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: dealing with a failing drive
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/10/07 9:09 PM, Modulok wrote: >>> I'd welcome suggestions on how (or whether) to try to revive a SCSI > drive that's failing. > > It depends on how valuable the data on the array is, and more > importantly, how much funding you have at your disposal to fix the > problem. If it were me, I would set aside the bad disk, connect a new > disk to the card and re-synchronize the array. (Assuming one of the > members still retains a good copy of the data.) Afterwards I would > destroy, or toss the existing disk in the trash can (depending on the > sensitivity of the data stored on it.) Thanks for your reply. An update: After doing what you suggest (leaving in the "good" disk, adding a new disk, RAID rebuilding) I still got soft write errors -- with *either one* of the disks I tried. Then I tried putting both disks in an identical server and they came up fine, no read or write errors. Ergo, the bad RAID controller is bad and the disks may be OK. >>> Is there some other way to: >>> b)monitor the health of disks on a Compaq controller so it doesn't > get to this point to begin with? > > There are various tools out there that attempt to 'monitor' the > condition of disk drives to try and predict when failure is eminent. > For valuable data, it is safer to setup a mirror and simply toss out > bad disks as they fail. For extremely valuable data use a 3 disk > array. With a 3 disk setup you will still be covered in the event that > an additional disk craps out during the re-sync. > > To quote google's article on disk failure, regarding SMART: Right, I've heard it said that "SMART isn't." Nonetheless, I'd appreciate any suggestions to monitor the health of disks -- and RAID controllers too -- on HP Proliant servers running FreeBSD. thanks again. dn -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.1 (Darwin) iD8DBQFHNyZDyPxGVjntI4IRAqk1AKCUwByNOAJZwvtD9V21TZfyaMWaxgCdFSCZ dZjf3ynK+4OffBzsDOawF9A= =DUqc -END PGP SIGNATURE- ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: dealing with a failing drive
>> I'd welcome suggestions on how (or whether) to try to revive a SCSI drive that's failing. It depends on how valuable the data on the array is, and more importantly, how much funding you have at your disposal to fix the problem. If it were me, I would set aside the bad disk, connect a new disk to the card and re-synchronize the array. (Assuming one of the members still retains a good copy of the data.) Afterwards I would destroy, or toss the existing disk in the trash can (depending on the sensitivity of the data stored on it.) >> Is there some other way to: >> b)monitor the health of disks on a Compaq controller so it doesn't get to this point to begin with? There are various tools out there that attempt to 'monitor' the condition of disk drives to try and predict when failure is eminent. For valuable data, it is safer to setup a mirror and simply toss out bad disks as they fail. For extremely valuable data use a 3 disk array. With a 3 disk setup you will still be covered in the event that an additional disk craps out during the re-sync. To quote google's article on disk failure, regarding SMART: "...we find that failure prediction models based on SMART parameters alone are likely to be severely limited in the prediction accuracy, given that a large fraction of our failed drives have shown on SMART error signals whatsoever. This result suggests that SMART models are more useful in predicting trends for large aggregate populations that for individual components." http://labs.google.com/papers/disk_failures.pdf My 2 cents. -Modulok- On 11/10/07, David Newman <[EMAIL PROTECTED]> wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > I'd welcome suggestions on how (or whether) to try to revive a SCSI > drive that's failing. > > This is on FreeBSD 6.2-RELENG on a Compaq Proliant DL320, onboard RAID > and two SCSI drives in a RAID1 array. > > Today this system rebooted and hung on Compaq's "what do you want the > RAID controller to do?" message. I told it to fix any errors. > > When I brought the system back up (after running fsck in single-user > mode), the log had lots of errors like this: > > Nov 10 09:00:40 mail kernel: ida0: hard write error > Nov 10 09:00:40 mail kernel: ida0: invalid request > Nov 10 09:01:48 mail last message repeated 35 times > Nov 10 09:03:49 mail last message repeated 571 times > Nov 10 09:12:27 mail last message repeated 796 times > > I vaguely remember trying about a year ago to load a SMART utility from > the ports collection but it wouldn't work on drives in a RAID array. > > Is there some other way to: > > a) diagnose/fix the errant disk here? > b) monitor the health of disks on a Compaq controller so it doesn't get > to this point to begin with? > > thanks in advance > > dn > > > > > > -BEGIN PGP SIGNATURE- > Version: GnuPG v1.4.1 (Darwin) > > iD8DBQFHNlk+yPxGVjntI4IRAntlAJ9FWA2ez+BdnViq7mrIpkLBTLm/CgCfRyEA > czDvMn6+8KjlI3V0iBG4U3I= > =36+k > -END PGP SIGNATURE- > ___ > freebsd-questions@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-questions > To unsubscribe, send any mail to "[EMAIL PROTECTED]" > ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
dealing with a failing drive
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I'd welcome suggestions on how (or whether) to try to revive a SCSI drive that's failing. This is on FreeBSD 6.2-RELENG on a Compaq Proliant DL320, onboard RAID and two SCSI drives in a RAID1 array. Today this system rebooted and hung on Compaq's "what do you want the RAID controller to do?" message. I told it to fix any errors. When I brought the system back up (after running fsck in single-user mode), the log had lots of errors like this: Nov 10 09:00:40 mail kernel: ida0: hard write error Nov 10 09:00:40 mail kernel: ida0: invalid request Nov 10 09:01:48 mail last message repeated 35 times Nov 10 09:03:49 mail last message repeated 571 times Nov 10 09:12:27 mail last message repeated 796 times I vaguely remember trying about a year ago to load a SMART utility from the ports collection but it wouldn't work on drives in a RAID array. Is there some other way to: a) diagnose/fix the errant disk here? b) monitor the health of disks on a Compaq controller so it doesn't get to this point to begin with? thanks in advance dn -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.1 (Darwin) iD8DBQFHNlk+yPxGVjntI4IRAntlAJ9FWA2ez+BdnViq7mrIpkLBTLm/CgCfRyEA czDvMn6+8KjlI3V0iBG4U3I= =36+k -END PGP SIGNATURE- ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
RE: replacing failing drive
Hi Dave, You could prepare the replacement drive offline and test it first, provided you have a generic kernel you can do this on any piece of hardware you have lying around. By the way there is no need to install anything, check out a previous answer I wrote, it's for changing RAID levels but the concept is pretty much the same : http://lists.freebsd.org/pipermail/freebsd-questions/2005-July/092529.html Good luck, Ruben -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Dave Sent: April 11, 2007 7:02 PM To: freebsd-questions@freebsd.org Subject: replacing failing drive Hello, I've got a drive that i'm uncertain if it's failing. It is making an occational clicking noise, which is getting more frequent. I installed smartmontools and tried to start them, output below: #smartctl -a /dev/ad0 smartctl version 5.37 [i386-portbld-freebsd6.1] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ Smartctl: Device Read Identity Failed (not an ATA/ATAPI device) A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options. #/usr/local/etc/rc.d/smartd start Starting smartd. (pass0:vpo0:0:5:0): INQUIRY. CDB: 12 0 0 0 24 0 (pass0:vpo0:0:5:0): CAM Status: Command timeout (pass0:vpo0:0:5:0): INQUIRY. CDB: 12 0 0 0 40 0 (pass0:vpo0:0:5:0): CAM Status: Command timeout (pass0:vpo0:0:5:0): Vendor Specific Command. CDB: 85 8 e 0 0 0 1 0 0 0 0 0 0 0 ec 0 (pass0:vpo0:0:5:0): CAM Status: Command timeout (pass0:vpo0:0:5:0): Vendor Specific Command. CDB: 85 8 e 0 0 0 1 0 0 0 0 0 0 0 a1 0 (pass0:vpo0:0:5:0): CAM Status: Command timeout Does this mean this drive is failing? I've got another identical drive that i run smartd on and it doesn't have any issues picking up it's smart id or in running tests on it. This is on a 6.2 box. If this drive is failing i'd like to drop in another one with minimum downtime. Could someone check my procedure: 1. Install new drive as slave 2. Use sysinstall to partition the new drive (i only use a single partition) 3. Use sysinstall to create bsd labels and give them the same values as the master drive 4. Use sysinstall to install the boot manager on slave drive 5. Use dump/restore to copy all data on to the slave drive. 6. Power down the box, remove old master drive, set new drive to master, and reboot Thanks. Dave. ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]" -- No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.5.446 / Virus Database: 269.2.0/756 - Release Date: 4/10/2007 10:44 PM -- No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.5.446 / Virus Database: 269.2.0/756 - Release Date: 04/10/2007 10:44 PM ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
replacing failing drive
Hello, I've got a drive that i'm uncertain if it's failing. It is making an occational clicking noise, which is getting more frequent. I installed smartmontools and tried to start them, output below: #smartctl -a /dev/ad0 smartctl version 5.37 [i386-portbld-freebsd6.1] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ Smartctl: Device Read Identity Failed (not an ATA/ATAPI device) A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options. #/usr/local/etc/rc.d/smartd start Starting smartd. (pass0:vpo0:0:5:0): INQUIRY. CDB: 12 0 0 0 24 0 (pass0:vpo0:0:5:0): CAM Status: Command timeout (pass0:vpo0:0:5:0): INQUIRY. CDB: 12 0 0 0 40 0 (pass0:vpo0:0:5:0): CAM Status: Command timeout (pass0:vpo0:0:5:0): Vendor Specific Command. CDB: 85 8 e 0 0 0 1 0 0 0 0 0 0 0 ec 0 (pass0:vpo0:0:5:0): CAM Status: Command timeout (pass0:vpo0:0:5:0): Vendor Specific Command. CDB: 85 8 e 0 0 0 1 0 0 0 0 0 0 0 a1 0 (pass0:vpo0:0:5:0): CAM Status: Command timeout Does this mean this drive is failing? I've got another identical drive that i run smartd on and it doesn't have any issues picking up it's smart id or in running tests on it. This is on a 6.2 box. If this drive is failing i'd like to drop in another one with minimum downtime. Could someone check my procedure: 1. Install new drive as slave 2. Use sysinstall to partition the new drive (i only use a single partition) 3. Use sysinstall to create bsd labels and give them the same values as the master drive 4. Use sysinstall to install the boot manager on slave drive 5. Use dump/restore to copy all data on to the slave drive. 6. Power down the box, remove old master drive, set new drive to master, and reboot Thanks. Dave. ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: retrieving data from a failing drive
> Is there any way I can force the FS to be marked clean, or to mount a > dirty filesystem (possibly in read-only mode)? Read the mount(8) man page, specifically `-f' and `-r' options. I hope you have backups! - Mike To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-questions" in the body of the message
retrieving data from a failing drive
Hi, I have a 20G /usr partition (IDE drive) that is reporting hard errors at a certain sector. I've run fsck -y many times, and each time it hits the bad sector, it falls back from DMA to PIO mode, and finally exits, saying "The filesystem is still marked dirty, please run fsck again." I'm looking around for a new drive that I can restore to, but I don't know how to read data off a dirty filesystem. Is there any way I can force the FS to be marked clean, or to mount a dirty filesystem (possibly in read-only mode)? OTOH, would something like netbsd's g4u (ghost for unix) help me out here? TIA Mark Miller To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-questions" in the body of the message