RAID 1 / disk error / Offline uncorrectable sectors
Hello, I'd like to ask your advice. We have RAID 1 / SATA turned on in BIOS. A couple of days ago smartd let me know about a disk problem. Jun 14 01:13:38 relay kernel: ad12: FAILURE - READ_DMA48 status=51READY,DSC,ERROR error=40UNCORRECTABLE LBA=374468863 Jun 14 01:13:38 relay kernel: ar0: WARNING - mirror protection lost. RAID1 array in DEGRADED mode Jun 14 01:14:19 relay kernel: ad12: WARNING - WRITE_DMA taskqueue timeout - completing request directly Jun 14 01:14:19 relay kernel: ad12: WARNING - WRITE_DMA48 freeing taskqueue zombie request Jun 14 01:37:38 relay smartd[683]: Device: /dev/ad12, 1 Currently unreadable (pending) sectors Jun 14 01:37:38 relay smartd[683]: Device: /dev/ad12, 1 Offline uncorrectable sectors If I do smarctl -a /dev/ad12 I get 197 Current_Pending_Sector 0x0012 100 100 000Old_age Always - 1 198 Offline_Uncorrectable 0x0010 100 100 000Old_age Offline - 1 My understanding is that RAID 1 no longer works because of this error. There is a bad sector on HD (Offline uncorrectable sectors) and the best we can do is replace the drive? Does it make sense to try to turn RAID 1 on ignoring this error (however, this is done in BIOS so the machine would have to be taken down in order to do that)? It seems serious enough for me not to ignore it but then I know close to nothing about HDs. Many thanks for your suggestions! Zbigniew Szalbot ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: RAID 1 / disk error / Offline uncorrectable sectors
In response to Zbigniew Szalbot [EMAIL PROTECTED]: A couple of days ago smartd let me know about a disk problem. Jun 14 01:13:38 relay kernel: ad12: FAILURE - READ_DMA48 status=51READY,DSC,ERROR error=40UNCORRECTABLE LBA=374468863 Jun 14 01:13:38 relay kernel: ar0: WARNING - mirror protection lost. RAID1 array in DEGRADED mode Jun 14 01:14:19 relay kernel: ad12: WARNING - WRITE_DMA taskqueue timeout - completing request directly Jun 14 01:14:19 relay kernel: ad12: WARNING - WRITE_DMA48 freeing taskqueue zombie request Jun 14 01:37:38 relay smartd[683]: Device: /dev/ad12, 1 Currently unreadable (pending) sectors Jun 14 01:37:38 relay smartd[683]: Device: /dev/ad12, 1 Offline uncorrectable sectors If I do smarctl -a /dev/ad12 I get 197 Current_Pending_Sector 0x0012 100 100 000Old_age Always - 1 198 Offline_Uncorrectable 0x0010 100 100 000Old_age Offline - 1 My understanding is that RAID 1 no longer works because of this error. There is a bad sector on HD (Offline uncorrectable sectors) and the best we can do is replace the drive? Does it make sense to try to turn RAID 1 on ignoring this error (however, this is done in BIOS so the machine would have to be taken down in order to do that)? It seems serious enough for me not to ignore it but then I know close to nothing about HDs. Replace the hard drive. Every modern hard drive keeps extra space available to remap bad sectors. This happens magically behind the scenes without you ever knowing about it. Once you've hit uncorrectable errors, it means your re-mappable sectors are used up, and that means the drive is on its last legs. -- Bill Moran http://www.potentialtech.com ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: RAID 1 / disk error / Offline uncorrectable sectors
Dear all, Bill Moran: My understanding is that RAID 1 no longer works because of this error. There is a bad sector on HD (Offline uncorrectable sectors) and the best we can do is replace the drive? Does it make sense to try to turn RAID 1 on ignoring this error (however, this is done in BIOS so the machine would have to be taken down in order to do that)? It seems serious enough for me not to ignore it but then I know close to nothing about HDs. Replace the hard drive. Every modern hard drive keeps extra space available to remap bad sectors. This happens magically behind the scenes without you ever knowing about it. Once you've hit uncorrectable errors, it means your re-mappable sectors are used up, and that means the drive is on its last legs. Thank you Bill. One last question. RAID 1 is off now (degraded) and the hosting company is asking if I can try to bring it up (to check if it will work). They have given me this link http://www.freebsd.org/doc/en/books/handbook/raid.html. The problem is that as far as I understand we are not using gmirror but RAID 1 turned on in BIOS (although it is also software-based). Thank you very much in advance! Zbigniew Szalbot www.lc-words.com ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: RAID 1 / disk error / Offline uncorrectable sectors
Zbigniew Szalbot wrote: Dear all, Bill Moran: My understanding is that RAID 1 no longer works because of this error. There is a bad sector on HD (Offline uncorrectable sectors) and the best we can do is replace the drive? Does it make sense to try to turn RAID 1 on ignoring this error (however, this is done in BIOS so the machine would have to be taken down in order to do that)? It seems serious enough for me not to ignore it but then I know close to nothing about HDs. Replace the hard drive. Every modern hard drive keeps extra space available to remap bad sectors. This happens magically behind the scenes without you ever knowing about it. Once you've hit uncorrectable errors, it means your re-mappable sectors are used up, and that means the drive is on its last legs. Thank you Bill. One last question. RAID 1 is off now (degraded) and the hosting company is asking if I can try to bring it up (to check if it will work). They have given me this link http://www.freebsd.org/doc/en/books/handbook/raid.html. The problem is that as far as I understand we are not using gmirror but RAID 1 turned on in BIOS (although it is also software-based). Thank you very much in advance! Zbigniew Szalbot www.lc-words.com Hey Zbigniew ;) I understand you are using the ataraid (ar) driver. I always use gmirror, but it seems they pointed you to the right place in the handbook. Look at section 18.4.3 - you would probably need to do something like: # atacontrol list From the list, get the ATA channel for /dev/ad12 which is the faulty one, e.g. ata2 Detach and re-attach (maybe this will reset the state of the drive) atacontrol detach ata2 atacontrol attach ata2 atacontrol addspare ar0 ad12 atacontrol rebuild ar0 I've done more or less the same with gmirror when I had similar messages a few months back. It may work for a few hours/days but it will fail again. Have it replaced ASAP. Manolis ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: RAID 1 / disk error / Offline uncorrectable sectors
Replace the hard drive. Every modern hard drive keeps extra space available to remap bad sectors. This happens magically behind the scenes without you ever knowing about it. Once you've hit uncorrectable errors, it means no. usually it means that there was an error when writing that sector, and later there is an error on read. madia may be good (quite often is). if you would be right i wouldn't have my disk running one year after having whole block of uncorrectable errors i just rewrote that blocks and they are readable. drive HAS TO know about bad media to remap, and no HDDs today perform verification ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: RAID 1 / disk error / Offline uncorrectable sectors
Hello Manolis, I understand you are using the ataraid (ar) driver. I always use gmirror, but it seems they pointed you to the right place in the handbook. Look at section 18.4.3 - you would probably need to do something like: # atacontrol list ATA channel 6: Master: ad12 ST3250310NS/SN04 Serial ATA v1.0 Slave: no device present ATA channel 0: Master: no device present Slave: no device present ATA channel 1: Master: no device present Slave: no device present ATA channel 2: Master: no device present Slave: no device present ATA channel 3: Master: no device present Slave: no device present ATA channel 4: Master: no device present Slave: no device present ATA channel 5: Master: ad10 ST3250310NS/SN04 Serial ATA v1.0 Slave: no device present ATA channel 6: Master: ad12 ST3250310NS/SN04 Serial ATA v1.0 Slave: no device present ATA channel 7: Master: no device present Slave: no device present ATA channel 8: Master: no device present Slave: no device present ATA channel 9: Master: no device present Slave: no device present ATA channel 10: Master: no device present Slave: no device present So in this case it would be ata6? Sorry for asking confirmation for every step but it is just so new to me! And thanks for the list of steps to perform! Zbigniew Szalbot ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: RAID 1 / disk error / Offline uncorrectable sectors
On Mon, Jun 16, 2008 at 04:41:15PM +0200, Wojciech Puchar wrote: Replace the hard drive. Every modern hard drive keeps extra space available to remap bad sectors. This happens magically behind the scenes without you ever knowing about it. Once you've hit uncorrectable errors, it means no. usually it means that there was an error when writing that sector, and later there is an error on read. madia may be good (quite often is). if you would be right i wouldn't have my disk running one year after having whole block of uncorrectable errors i just rewrote that blocks and they are readable. drive HAS TO know about bad media to remap, and no HDDs today perform verification Also, remapping can only happen if the error is encountered on a write operation. If there is an error on read the drive cannot remap, since it does not know what data should be there. (A good RAID implementation could however handle a read error by reading the corresponding sector from the other disks(s) in the array and write it back to the failing disk, probably causing it to remap the block.) (Write errors is however usually a strong indication that the drive should be replaced ASAP.) -- Insert your favourite quote here. Erik Trulsson [EMAIL PROTECTED] ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: RAID 1 / disk error / Offline uncorrectable sectors
Zbigniew Szalbot wrote: Hello Manolis, I understand you are using the ataraid (ar) driver. I always use gmirror, but it seems they pointed you to the right place in the handbook. Look at section 18.4.3 - you would probably need to do something like: # atacontrol list ATA channel 6: Master: ad12 ST3250310NS/SN04 Serial ATA v1.0 Slave: no device present ATA channel 0: Master: no device present Slave: no device present ATA channel 1: Master: no device present Slave: no device present ATA channel 2: Master: no device present Slave: no device present ATA channel 3: Master: no device present Slave: no device present ATA channel 4: Master: no device present Slave: no device present ATA channel 5: Master: ad10 ST3250310NS/SN04 Serial ATA v1.0 Slave: no device present ATA channel 6: Master: ad12 ST3250310NS/SN04 Serial ATA v1.0 Slave: no device present ATA channel 7: Master: no device present Slave: no device present ATA channel 8: Master: no device present Slave: no device present ATA channel 9: Master: no device present Slave: no device present ATA channel 10: Master: no device present Slave: no device present So in this case it would be ata6? Sorry for asking confirmation for every step but it is just so new to me! And thanks for the list of steps to perform! Zbigniew Szalbot Yes, it is ata6 Give it a try, if the problem is serious enough, it will probably not even finish rebuild :( ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: RAID 1 / disk error / Offline uncorrectable sectors
(Write errors is however usually a strong indication that the drive should be replaced ASAP.) he got read error... but your sentence alone is true of course. ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: RAID 1 / disk error / Offline uncorrectable sectors
Hi Manolis, Yes, it is ata6 Give it a try, if the problem is serious enough, it will probably not even finish rebuild :( Detaching and ataching went well but when I issued atacontrol addspare ar0 ad12 it said atacontrol: ioctl(IOCATARAIDADDSPARE): Device busy I am not sure if that means I should wait or rather that it is mission impossible? Thanks! Zbigniew Szalbot smime.p7s Description: S/MIME Cryptographic Signature
Re: RAID 1 / disk error / Offline uncorrectable sectors
Zbigniew Szalbot wrote: Hi Manolis, Yes, it is ata6 Give it a try, if the problem is serious enough, it will probably not even finish rebuild :( Detaching and ataching went well but when I issued atacontrol addspare ar0 ad12 it said atacontrol: ioctl(IOCATARAIDADDSPARE): Device busy I am not sure if that means I should wait or rather that it is mission impossible? Thanks! Zbigniew Szalbot Try atacontrol status ar0 Since you haven't actually removed/replaced ad12 you may simply have to continue with: atacontrol rebuild ar0 but see what status says first. ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: RAID 1 / disk error / Offline uncorrectable sectors
Hello, Manolis Kiagias: Try atacontrol status ar0 ar0: ATA RAID1 status: DEGRADED subdisks: 0 ad10 ONLINE 1 MISSING Since you haven't actually removed/replaced ad12 you may simply have to continue with: atacontrol rebuild ar0 I'll try it now. Thanks! Zbigniew Szalbot smime.p7s Description: S/MIME Cryptographic Signature
Re: RAID 1 / disk error / Offline uncorrectable sectors
Hello, Manolis Kiagias: Try atacontrol status ar0 Since you haven't actually removed/replaced ad12 you may simply have to continue with: atacontrol rebuild ar0 atacontrol rebuild ar0 atacontrol: ioctl(IOCATARAIDREBUILD): Input/output error So it looks like it cannot be done? Zbigniew Szalbot smime.p7s Description: S/MIME Cryptographic Signature
Re: RAID 1 / disk error / Offline uncorrectable sectors
Zbigniew Szalbot wrote: Hello, Manolis Kiagias: Try atacontrol status ar0 ar0: ATA RAID1 status: DEGRADED subdisks: 0 ad10 ONLINE 1 MISSING Since you haven't actually removed/replaced ad12 you may simply have to continue with: atacontrol rebuild ar0 I'll try it now. Thanks! Zbigniew Szalbot Ok, ad12 is missing, so it seems it was detached but not reattached. try again: atacontrol attach ata6 If this succeeds, atacontrol addspare ar0 ad12 atacontrol rebuild ar0 If attach fails, then someone at the remote site may have to physically detach / reattach the disk in question. ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: RAID 1 / disk error / Offline uncorrectable sectors
Hello one last time, Manolis Kiagias: Ok, ad12 is missing, so it seems it was detached but not reattached. try again: atacontrol attach ata6 $ sudo atacontrol attach ata6 atacontrol: ioctl(IOCATAATTACH): File exists Thank you all for a lot of suggestions! Zbigniew Szalbot smime.p7s Description: S/MIME Cryptographic Signature
Re: RAID 1 / disk error / Offline uncorrectable sectors
Zbigniew Szalbot wrote: Hello one last time, Manolis Kiagias: Ok, ad12 is missing, so it seems it was detached but not reattached. try again: atacontrol attach ata6 $ sudo atacontrol attach ata6 atacontrol: ioctl(IOCATAATTACH): File exists Thank you all for a lot of suggestions! Zbigniew Szalbot As a last resort, you could also try: atacontrol reinit ata6 and try reattaching again ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: RAID 1 / disk error / Offline uncorrectable sectors
Hello, As a last resort, you could also try: atacontrol reinit ata6 and try reattaching again Thank you Manolis - you have been more than patient with me! Unfortunately, the result is still the same. OK. I am going to ask our hosting company to replace the drive. Again, many thanks for your help! Zbigniew Szalbot smime.p7s Description: S/MIME Cryptographic Signature
Re: RAID 1 / disk error / Offline uncorrectable sectors
Bill Moran wrote: Zbigniew Szalbot wrote: [...] Jun 14 01:13:38 relay kernel: ad12: FAILURE - READ_DMA48 status=51READY,DSC,ERROR error=40UNCORRECTABLE LBA=374468863 [...] Replace the hard drive. Every modern hard drive keeps extra space available to remap bad sectors. This happens magically behind the scenes without you ever knowing about it. Once you've hit uncorrectable errors, it means your re-mappable sectors are used up, and that means the drive is on its last legs. That's not completely true. When a disk drive encounters a bad sector during a read operation, it will remember the bad sector address, but it is unable to transparently remap the sector because it doesn't know that correct contents of the sector. So it has to report the unrecoverable error to the OS, even if there's still plenty of space for remapping sectors. Upon the next write operation to a sector marked as bad, the drive will finally remap it and write the data to a spare location. Therefore, getting uncorrectable errors does *not* mean that the drive has used up its spare sectors. You only need to overwrite the bad sectors (e.g. with dd(1))so the drive gets a chance to remap them. Of course, it might still be a good idea to replace the drive anyway. It depends on the cause of the bad sectors (mechanical or electrical). If you had a head crash (caused by mechanical impact or a media manufacturing error or whatever), it is possible that it caused debris within the drive which will cause further bad blocks. This can lead to a snowball effect that can really exhaust all spare sectors quickly. On the other hand, if the bad sectors where caused by a voltage spike, a power failure or similar, chances are that the drive is fine and you can continue to use it after making sure that the bad sectors are remapped (by overwriting them, see above). Finally, there is also the possibility that the problem is caused by a bug in the drive's firmware. If that's the case, I would be inclined to replace the drive with a different brand. However, I guess all drives have bugs ... the question is whether they affect you. Another question is whether it's possible at all to find out what caused the problem in the first place. Best regards Oliver -- Oliver Fromme, secnetix GmbH Co. KG, Marktplatz 29, 85567 Grafing b. M. Handelsregister: Registergericht Muenchen, HRA 74606, Geschäftsfuehrung: secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün- chen, HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart FreeBSD-Dienstleistungen, -Produkte und mehr: http://www.secnetix.de/bsd What is this talk of 'release'? We do not make software 'releases'. Our software 'escapes', leaving a bloody trail of designers and quality assurance people in its wake. ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]