RAID 1 / disk error / Offline uncorrectable sectors

2008-06-16 Thread Zbigniew Szalbot

Hello,

I'd like to ask your advice. We have RAID 1 / SATA turned on in BIOS.

A couple of days ago smartd let me know about a disk problem.

Jun 14 01:13:38 relay kernel: ad12: FAILURE - READ_DMA48 
status=51READY,DSC,ERROR error=40UNCORRECTABLE LBA=374468863
Jun 14 01:13:38 relay kernel: ar0: WARNING - mirror protection lost. 
RAID1 array in DEGRADED mode
Jun 14 01:14:19 relay kernel: ad12: WARNING - WRITE_DMA taskqueue 
timeout - completing request directly
Jun 14 01:14:19 relay kernel: ad12: WARNING - WRITE_DMA48 freeing 
taskqueue zombie request
Jun 14 01:37:38 relay smartd[683]: Device: /dev/ad12, 1 Currently 
unreadable (pending) sectors
Jun 14 01:37:38 relay smartd[683]: Device: /dev/ad12, 1 Offline 
uncorrectable sectors


If I do smarctl -a /dev/ad12 I get

197 Current_Pending_Sector  0x0012   100   100   000Old_age   Always 
  -   1
198 Offline_Uncorrectable   0x0010   100   100   000Old_age 
Offline  -   1


My understanding is that RAID 1 no longer works because of this error. 
There is a bad sector on HD (Offline uncorrectable sectors) and the best 
we can do is replace the drive? Does it make sense to try to turn RAID 1 
on ignoring this error (however, this is done in BIOS so the machine 
would have to be taken down in order to do that)? It seems serious 
enough for me not to ignore it but then I know close to nothing about HDs.


Many thanks for your suggestions!


Zbigniew Szalbot
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: RAID 1 / disk error / Offline uncorrectable sectors

2008-06-16 Thread Bill Moran
In response to Zbigniew Szalbot [EMAIL PROTECTED]:
 
 A couple of days ago smartd let me know about a disk problem.
 
 Jun 14 01:13:38 relay kernel: ad12: FAILURE - READ_DMA48 
 status=51READY,DSC,ERROR error=40UNCORRECTABLE LBA=374468863
 Jun 14 01:13:38 relay kernel: ar0: WARNING - mirror protection lost. 
 RAID1 array in DEGRADED mode
 Jun 14 01:14:19 relay kernel: ad12: WARNING - WRITE_DMA taskqueue 
 timeout - completing request directly
 Jun 14 01:14:19 relay kernel: ad12: WARNING - WRITE_DMA48 freeing 
 taskqueue zombie request
 Jun 14 01:37:38 relay smartd[683]: Device: /dev/ad12, 1 Currently 
 unreadable (pending) sectors
 Jun 14 01:37:38 relay smartd[683]: Device: /dev/ad12, 1 Offline 
 uncorrectable sectors
 
 If I do smarctl -a /dev/ad12 I get
 
 197 Current_Pending_Sector  0x0012   100   100   000Old_age   Always 
-   1
 198 Offline_Uncorrectable   0x0010   100   100   000Old_age 
 Offline  -   1
 
 My understanding is that RAID 1 no longer works because of this error. 
 There is a bad sector on HD (Offline uncorrectable sectors) and the best 
 we can do is replace the drive? Does it make sense to try to turn RAID 1 
 on ignoring this error (however, this is done in BIOS so the machine 
 would have to be taken down in order to do that)? It seems serious 
 enough for me not to ignore it but then I know close to nothing about HDs.

Replace the hard drive.  Every modern hard drive keeps extra space available
to remap bad sectors.  This happens magically behind the scenes without
you ever knowing about it.  Once you've hit uncorrectable errors, it means
your re-mappable sectors are used up, and that means the drive is on its
last legs.

-- 
Bill Moran
http://www.potentialtech.com
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: RAID 1 / disk error / Offline uncorrectable sectors

2008-06-16 Thread Zbigniew Szalbot

Dear all,

Bill Moran:


My understanding is that RAID 1 no longer works because of this
error. There is a bad sector on HD (Offline uncorrectable sectors)
and the best we can do is replace the drive? Does it make sense to
try to turn RAID 1 on ignoring this error (however, this is done in
BIOS so the machine would have to be taken down in order to do
that)? It seems serious enough for me not to ignore it but then I
know close to nothing about HDs.


Replace the hard drive.  Every modern hard drive keeps extra space
available to remap bad sectors.  This happens magically behind the
scenes without you ever knowing about it.  Once you've hit
uncorrectable errors, it means your re-mappable sectors are used
up, and that means the drive is on its last legs.



Thank you Bill. One last question. RAID 1 is off now (degraded) and the 
hosting company is asking if I can try to bring it up (to check if it 
will work). They have given me this link 
http://www.freebsd.org/doc/en/books/handbook/raid.html. The problem is 
that as far as I understand we are not using gmirror but RAID 1 turned 
on in BIOS (although it is also software-based).


Thank you very much in advance!

Zbigniew Szalbot
www.lc-words.com

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: RAID 1 / disk error / Offline uncorrectable sectors

2008-06-16 Thread Manolis Kiagias

Zbigniew Szalbot wrote:

Dear all,

Bill Moran:


My understanding is that RAID 1 no longer works because of this
error. There is a bad sector on HD (Offline uncorrectable sectors)
and the best we can do is replace the drive? Does it make sense to
try to turn RAID 1 on ignoring this error (however, this is done in
BIOS so the machine would have to be taken down in order to do
that)? It seems serious enough for me not to ignore it but then I
know close to nothing about HDs.


Replace the hard drive.  Every modern hard drive keeps extra space
available to remap bad sectors.  This happens magically behind the
scenes without you ever knowing about it.  Once you've hit
uncorrectable errors, it means your re-mappable sectors are used
up, and that means the drive is on its last legs.



Thank you Bill. One last question. RAID 1 is off now (degraded) and 
the hosting company is asking if I can try to bring it up (to check if 
it will work). They have given me this link 
http://www.freebsd.org/doc/en/books/handbook/raid.html. The problem is 
that as far as I understand we are not using gmirror but RAID 1 turned 
on in BIOS (although it is also software-based).


Thank you very much in advance!

Zbigniew Szalbot
www.lc-words.com



Hey Zbigniew ;)

I understand you are using the ataraid (ar) driver. I always use 
gmirror, but it seems they pointed you to the right place in the handbook.

Look at section 18.4.3 - you would probably need to do something like:

# atacontrol list

From the list, get the ATA channel for /dev/ad12 which is the faulty 
one, e.g. ata2


Detach and re-attach (maybe this will reset the state of the drive)

atacontrol detach ata2
atacontrol attach ata2

atacontrol addspare ar0 ad12
atacontrol rebuild ar0

I've done more or less the same with gmirror when I had similar messages 
a few months back. It may work for a few hours/days but it will fail 
again. Have it replaced ASAP.


Manolis

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: RAID 1 / disk error / Offline uncorrectable sectors

2008-06-16 Thread Wojciech Puchar


Replace the hard drive.  Every modern hard drive keeps extra space available
to remap bad sectors.  This happens magically behind the scenes without
you ever knowing about it.  Once you've hit uncorrectable errors, it means


no. usually it means that there was an error when writing that sector, and 
later there is an error on read. madia may be good (quite often is).


if you would be right i wouldn't have my disk running one year after 
having whole block of uncorrectable errors


i just rewrote that blocks and they are readable.

drive HAS TO know about bad media to remap, and no HDDs today perform 
verification

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: RAID 1 / disk error / Offline uncorrectable sectors

2008-06-16 Thread Zbigniew Szalbot

Hello Manolis,

I understand you are using the ataraid (ar) driver. I always use 
gmirror, but it seems they pointed you to the right place in the handbook.

Look at section 18.4.3 - you would probably need to do something like:

# atacontrol list


ATA channel 6:
Master: ad12 ST3250310NS/SN04 Serial ATA v1.0
Slave:   no device present

ATA channel 0:
Master:  no device present
Slave:   no device present
ATA channel 1:
Master:  no device present
Slave:   no device present
ATA channel 2:
Master:  no device present
Slave:   no device present
ATA channel 3:
Master:  no device present
Slave:   no device present
ATA channel 4:
Master:  no device present
Slave:   no device present
ATA channel 5:
Master: ad10 ST3250310NS/SN04 Serial ATA v1.0
Slave:   no device present
ATA channel 6:
Master: ad12 ST3250310NS/SN04 Serial ATA v1.0
Slave:   no device present
ATA channel 7:
Master:  no device present
Slave:   no device present
ATA channel 8:
Master:  no device present
Slave:   no device present
ATA channel 9:
Master:  no device present
Slave:   no device present
ATA channel 10:
Master:  no device present
Slave:   no device present

So in this case it would be ata6? Sorry for asking confirmation for 
every step but it is just so new to me!


And thanks for the list of steps to perform!

Zbigniew Szalbot
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: RAID 1 / disk error / Offline uncorrectable sectors

2008-06-16 Thread Erik Trulsson
On Mon, Jun 16, 2008 at 04:41:15PM +0200, Wojciech Puchar wrote:
 
  Replace the hard drive.  Every modern hard drive keeps extra space available
  to remap bad sectors.  This happens magically behind the scenes without
  you ever knowing about it.  Once you've hit uncorrectable errors, it means
 
 no. usually it means that there was an error when writing that sector, and 
 later there is an error on read. madia may be good (quite often is).
 
 if you would be right i wouldn't have my disk running one year after 
 having whole block of uncorrectable errors
 
 i just rewrote that blocks and they are readable.
 
 drive HAS TO know about bad media to remap, and no HDDs today perform 
 verification


Also, remapping can only happen if the error is encountered on a write
operation.  If there is an error on read the drive cannot remap, since
it does not know what data should be there.
(A good RAID implementation could however handle a read error by reading
the corresponding sector from the other disks(s) in the array and write it
back to the failing disk, probably causing it to remap the block.)

(Write errors is however usually a strong indication that the drive should
be replaced ASAP.)



-- 
Insert your favourite quote here.
Erik Trulsson
[EMAIL PROTECTED]
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: RAID 1 / disk error / Offline uncorrectable sectors

2008-06-16 Thread Manolis Kiagias

Zbigniew Szalbot wrote:

Hello Manolis,

I understand you are using the ataraid (ar) driver. I always use 
gmirror, but it seems they pointed you to the right place in the 
handbook.

Look at section 18.4.3 - you would probably need to do something like:

# atacontrol list


ATA channel 6:
Master: ad12 ST3250310NS/SN04 Serial ATA v1.0
Slave:   no device present

ATA channel 0:
Master:  no device present
Slave:   no device present
ATA channel 1:
Master:  no device present
Slave:   no device present
ATA channel 2:
Master:  no device present
Slave:   no device present
ATA channel 3:
Master:  no device present
Slave:   no device present
ATA channel 4:
Master:  no device present
Slave:   no device present
ATA channel 5:
Master: ad10 ST3250310NS/SN04 Serial ATA v1.0
Slave:   no device present
ATA channel 6:
Master: ad12 ST3250310NS/SN04 Serial ATA v1.0
Slave:   no device present
ATA channel 7:
Master:  no device present
Slave:   no device present
ATA channel 8:
Master:  no device present
Slave:   no device present
ATA channel 9:
Master:  no device present
Slave:   no device present
ATA channel 10:
Master:  no device present
Slave:   no device present

So in this case it would be ata6? Sorry for asking confirmation for 
every step but it is just so new to me!


And thanks for the list of steps to perform!

Zbigniew Szalbot



Yes, it is ata6
Give it a try, if the problem is serious enough, it will probably not 
even finish rebuild :(

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: RAID 1 / disk error / Offline uncorrectable sectors

2008-06-16 Thread Wojciech Puchar


(Write errors is however usually a strong indication that the drive should
be replaced ASAP.)


he got read error... but your sentence alone is true of course.
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: RAID 1 / disk error / Offline uncorrectable sectors

2008-06-16 Thread Zbigniew Szalbot

Hi Manolis,



Yes, it is ata6
Give it a try, if the problem is serious enough, it will probably not 
even finish rebuild :(


Detaching and ataching went well but when I issued
atacontrol addspare ar0 ad12
it said
atacontrol: ioctl(IOCATARAIDADDSPARE): Device busy

I am not sure if that means I should wait or rather that it is mission 
impossible?


Thanks!

Zbigniew Szalbot


smime.p7s
Description: S/MIME Cryptographic Signature


Re: RAID 1 / disk error / Offline uncorrectable sectors

2008-06-16 Thread Manolis Kiagias

Zbigniew Szalbot wrote:

Hi Manolis,



Yes, it is ata6
Give it a try, if the problem is serious enough, it will probably not 
even finish rebuild :(


Detaching and ataching went well but when I issued
atacontrol addspare ar0 ad12
it said
atacontrol: ioctl(IOCATARAIDADDSPARE): Device busy

I am not sure if that means I should wait or rather that it is mission 
impossible?


Thanks!

Zbigniew Szalbot


Try

atacontrol status ar0

Since you haven't actually removed/replaced ad12 you may simply have to 
continue with:


atacontrol rebuild ar0

but see what status says first.
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: RAID 1 / disk error / Offline uncorrectable sectors

2008-06-16 Thread Zbigniew Szalbot

Hello,

Manolis Kiagias:


Try

atacontrol status ar0


ar0: ATA RAID1 status: DEGRADED
 subdisks:
   0 ad10 ONLINE
   1  MISSING

Since you haven't actually removed/replaced ad12 you may simply have to 
continue with:


atacontrol rebuild ar0


I'll try it now. Thanks!

Zbigniew Szalbot


smime.p7s
Description: S/MIME Cryptographic Signature


Re: RAID 1 / disk error / Offline uncorrectable sectors

2008-06-16 Thread Zbigniew Szalbot

Hello,

Manolis Kiagias:


Try

atacontrol status ar0

Since you haven't actually removed/replaced ad12 you may simply have to 
continue with:


atacontrol rebuild ar0


atacontrol rebuild ar0
atacontrol: ioctl(IOCATARAIDREBUILD): Input/output error

So it looks like it cannot be done?

Zbigniew Szalbot


smime.p7s
Description: S/MIME Cryptographic Signature


Re: RAID 1 / disk error / Offline uncorrectable sectors

2008-06-16 Thread Manolis Kiagias

Zbigniew Szalbot wrote:

Hello,

Manolis Kiagias:


Try

atacontrol status ar0


ar0: ATA RAID1 status: DEGRADED
 subdisks:
   0 ad10 ONLINE
   1  MISSING

Since you haven't actually removed/replaced ad12 you may simply have 
to continue with:


atacontrol rebuild ar0


I'll try it now. Thanks!

Zbigniew Szalbot


Ok, ad12 is missing, so it seems it was detached but not reattached.

try again:

atacontrol attach ata6

If this succeeds,

atacontrol addspare ar0 ad12
atacontrol rebuild ar0

If attach fails, then someone at the remote site may have to  physically 
detach / reattach the disk in question.

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: RAID 1 / disk error / Offline uncorrectable sectors

2008-06-16 Thread Zbigniew Szalbot

Hello one last time,

Manolis Kiagias:


Ok, ad12 is missing, so it seems it was detached but not reattached.

try again:

atacontrol attach ata6


$ sudo atacontrol attach ata6
atacontrol: ioctl(IOCATAATTACH): File exists

Thank you all for a lot of suggestions!


Zbigniew Szalbot


smime.p7s
Description: S/MIME Cryptographic Signature


Re: RAID 1 / disk error / Offline uncorrectable sectors

2008-06-16 Thread Manolis Kiagias

Zbigniew Szalbot wrote:

Hello one last time,

Manolis Kiagias:


Ok, ad12 is missing, so it seems it was detached but not reattached.

try again:

atacontrol attach ata6


$ sudo atacontrol attach ata6
atacontrol: ioctl(IOCATAATTACH): File exists

Thank you all for a lot of suggestions!


Zbigniew Szalbot

As a last resort, you could also try:

atacontrol reinit ata6

and try reattaching again
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: RAID 1 / disk error / Offline uncorrectable sectors

2008-06-16 Thread Zbigniew Szalbot

Hello,



As a last resort, you could also try:

atacontrol reinit ata6

and try reattaching again


Thank you Manolis - you have been more than patient with me! 
Unfortunately, the result is still the same. OK. I am going to ask our 
hosting company to replace the drive. Again, many thanks for your help!


Zbigniew Szalbot


smime.p7s
Description: S/MIME Cryptographic Signature


Re: RAID 1 / disk error / Offline uncorrectable sectors

2008-06-16 Thread Oliver Fromme
Bill Moran wrote:
  Zbigniew Szalbot wrote:
   [...]
   Jun 14 01:13:38 relay kernel: ad12: FAILURE - READ_DMA48 
   status=51READY,DSC,ERROR error=40UNCORRECTABLE LBA=374468863
  [...]
  
  Replace the hard drive.  Every modern hard drive keeps extra space available
  to remap bad sectors.  This happens magically behind the scenes without
  you ever knowing about it.  Once you've hit uncorrectable errors, it means
  your re-mappable sectors are used up, and that means the drive is on its
  last legs.

That's not completely true.

When a disk drive encounters a bad sector during a read
operation, it will remember the bad sector address, but
it is unable to transparently remap the sector because it
doesn't know that correct contents of the sector.  So it
has to report the unrecoverable error to the OS, even if
there's still plenty of space for remapping sectors.

Upon the next write operation to a sector marked as bad,
the drive will finally remap it and write the data to a
spare location.

Therefore, getting uncorrectable errors does *not* mean
that the drive has used up its spare sectors.  You only
need to overwrite the bad sectors (e.g. with dd(1))so the
drive gets a chance to remap them.

Of course, it might still be a good idea to replace the
drive anyway.  It depends on the cause of the bad sectors
(mechanical or electrical).

If you had a head crash (caused by mechanical impact or
a media manufacturing error or whatever), it is possible
that it caused debris within the drive which will cause
further bad blocks.  This can lead to a snowball effect
that can really exhaust all spare sectors quickly.

On the other hand, if the bad sectors where caused by
a voltage spike, a power failure or similar, chances are
that the drive is fine and you can continue to use it
after making sure that the bad sectors are remapped
(by overwriting them, see above).

Finally, there is also the possibility that the problem
is caused by a bug in the drive's firmware.  If that's
the case, I would be inclined to replace the drive with
a different brand.  However, I guess all drives have
bugs ...  the question is whether they affect you.
Another question is whether it's possible at all to
find out what caused the problem in the first place.

Best regards
   Oliver

-- 
Oliver Fromme, secnetix GmbH  Co. KG, Marktplatz 29, 85567 Grafing b. M.
Handelsregister: Registergericht Muenchen, HRA 74606,  Geschäftsfuehrung:
secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün-
chen, HRB 125758,  Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart

FreeBSD-Dienstleistungen, -Produkte und mehr:  http://www.secnetix.de/bsd

What is this talk of 'release'?  We do not make software 'releases'.
Our software 'escapes', leaving a bloody trail of designers and quality
assurance people in its wake.
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]