Re: SMART: disk problems on RAIDZ1 pool: (ada6:ahcich6:0:0:0): CAM status: ATA Status Error

2017-12-13 Thread Rodney W. Grimes
> Am Tue, 12 Dec 2017 14:55:49 -0800 (PST)
> "Rodney W. Grimes"  schrieb:
> > > Am Tue, 12 Dec 2017 10:52:27 -0800 (PST)
> > > "Rodney W. Grimes"  schrieb:
> > > 
> > > Thank you for answering that fast!

Not so fast this time, had to sleep :)

> > > > > Hello,
> > > > > 
> > > > > running CURRENT (recent r326769), I realised that smartmond sends out 
> > > > > some console
> > > > > messages when booting the box:
> > > > > 
> > > > > [...]
> > > > > Dec 12 14:14:33 <3.2> box1 smartd[68426]: Device: /dev/ada6, 1 
> > > > > Currently
> > > > > unreadable (pending) sectors Dec 12 14:14:33 <3.2> box1 smartd[68426]:
> > > > > Device: /dev/ada6, 1 Offline uncorrectable sectors
> > > > > [...]
> > > > > 
> > > > > Checking the drive's SMART log with smartctl (it is one of four 3TB 
> > > > > disk drives),
> > > > > I gather these informations:
> > > > > 
> > > > > [... smartctl -x /dev/ada6 ...]
> > > > > Error 42 [17] occurred at disk power-on lifetime: 25335 hours (1055 
> > > > > days + 15
> > > > > hours) When the command that caused the error occurred, the device 
> > > > > was active or
> > > > > idle.
> > > > > 
> > > > >   After command completion occurred, registers were:
> > > > >   ER -- ST COUNT  LBA_48  LH LM LL DV DC
> > > > >   -- -- -- == -- == == == -- -- -- -- --
> > > > >   40 -- 51 00 00 00 00 c2 7a 72 98 40 00  Error: UNC at LBA = 
> > > > > 0xc27a7298 =
> > > > > 3262804632
> > > > > 
> > > > >   Commands leading to the command that caused the error were:
> > > > >   CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  
> > > > > Command/Feature_Name
> > > > >   -- == -- == -- == == == -- -- -- -- --  ---  
> > > > > 
> > > > >   60 00 b0 00 88 00 00 c2 7a 73 20 40 08 23:38:12.195  READ FPDMA 
> > > > > QUEUED
> > > > >   60 00 b0 00 80 00 00 c2 7a 72 70 40 08 23:38:12.195  READ FPDMA 
> > > > > QUEUED
> > > > >   2f 00 00 00 01 00 00 00 00 00 10 40 08 23:38:12.195  READ LOG 
> > > > > EXT
> > > > >   60 00 b0 00 70 00 00 c2 7a 73 20 40 08 23:38:09.343  READ FPDMA 
> > > > > QUEUED
> > > > >   60 00 b0 00 68 00 00 c2 7a 72 70 40 08 23:38:09.343  READ FPDMA 
> > > > > QUEUED
> > > > > [...]
> > > > > 
> > > > > and
> > > > > 
> > > > > [...]
> > > > > SMART Attributes Data Structure revision number: 16
> > > > > Vendor Specific SMART Attributes with Thresholds:
> > > > > ID# ATTRIBUTE_NAME  FLAGSVALUE WORST THRESH FAIL RAW_VALUE
> > > > >   1 Raw_Read_Error_Rate POSR-K   200   200   051-64
> > > > >   3 Spin_Up_TimePOS--K   178   170   021-6075
> > > > >   4 Start_Stop_Count-O--CK   098   098   000-2406
> > > > >   5 Reallocated_Sector_Ct   PO--CK   200   200   140-0
> > > > >   7 Seek_Error_Rate -OSR-K   200   200   000-0
> > > > >   9 Power_On_Hours  -O--CK   066   066   000-25339
> > > > >  10 Spin_Retry_Count-O--CK   100   100   000-0
> > > > >  11 Calibration_Retry_Count -O--CK   100   100   000-0
> > > > >  12 Power_Cycle_Count   -O--CK   098   098   000-2404
> > > > > 192 Power-Off_Retract_Count -O--CK   200   200   000-154
> > > > > 193 Load_Cycle_Count-O--CK   001   001   000-2055746
> > > > > 194 Temperature_Celsius -O---K   122   109   000-28
> > > > > 196 Reallocated_Event_Count -O--CK   200   200   000-0
> > > > > 197 Current_Pending_Sector  -O--CK   200   200   000-1
> > > > > 198 Offline_Uncorrectable   CK   200   200   000-1

Note here, we have a pending and we have an offline uncorrectable,
an offline uncorrectable needs to end up in the remap, that should
never end up cleared and back in the good blocks iirc, but then
again firmware gets changed so maybe it is possible to return
this to a good sector, either way it looks as if at this point
in time we infact may have 2 seperate blocks that are bad.

I have some long use heavily worn drives that have 10's of remapped
sectors and they are still running fine.  I would not use them for
mission critical or in a high heavy use situation, but they are good
for cold storage and other non critical use.  A total of 2 reallocates
I would not worry much about.  Unless I am seeing a growth rate.
Note that when these drives are shipped brand now for the first N
Power On Hours they are in a special mode that is very quick to simply
remap a "weak" sector.  Ie, any sector that gets requires some threshold
of M bits of error, the ECC already corrected the data but they vendor
has decided that these are weak sectors and it should just remap them.
Some firmware does not even call them Reallocated sectors, and adds
them to the manaufactures P list.

> > > > > 199 UDMA_CRC_Error_Count-O--CK   200   200   000-0
> > > > > 200 Multi_Zone_Error_Rate   ---R--   200   200   000-5
> > > > > ||_ K auto-keep
> > > > > |_

Re: SMART: disk problems on RAIDZ1 pool: (ada6:ahcich6:0:0:0): CAM status: ATA Status Error

2017-12-13 Thread O. Hartmann
Am Tue, 12 Dec 2017 14:55:49 -0800 (PST)
"Rodney W. Grimes"  schrieb:

> > Am Tue, 12 Dec 2017 10:52:27 -0800 (PST)
> > "Rodney W. Grimes"  schrieb:
> > 
> > 
> > Thank you for answering that fast!
> >   
> > > > Hello,
> > > > 
> > > > running CURRENT (recent r326769), I realised that smartmond sends out 
> > > > some console
> > > > messages when booting the box:
> > > > 
> > > > [...]
> > > > Dec 12 14:14:33 <3.2> box1 smartd[68426]: Device: /dev/ada6, 1 Currently
> > > > unreadable (pending) sectors Dec 12 14:14:33 <3.2> box1 smartd[68426]:
> > > > Device: /dev/ada6, 1 Offline uncorrectable sectors
> > > > [...]
> > > > 
> > > > Checking the drive's SMART log with smartctl (it is one of four 3TB 
> > > > disk drives),
> > > > I gather these informations:
> > > > 
> > > > [... smartctl -x /dev/ada6 ...]
> > > > Error 42 [17] occurred at disk power-on lifetime: 25335 hours (1055 
> > > > days + 15
> > > > hours) When the command that caused the error occurred, the device was 
> > > > active or
> > > > idle.
> > > > 
> > > >   After command completion occurred, registers were:
> > > >   ER -- ST COUNT  LBA_48  LH LM LL DV DC
> > > >   -- -- -- == -- == == == -- -- -- -- --
> > > >   40 -- 51 00 00 00 00 c2 7a 72 98 40 00  Error: UNC at LBA = 
> > > > 0xc27a7298 =
> > > > 3262804632
> > > > 
> > > >   Commands leading to the command that caused the error were:
> > > >   CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  
> > > > Command/Feature_Name
> > > >   -- == -- == -- == == == -- -- -- -- --  ---  
> > > > 
> > > >   60 00 b0 00 88 00 00 c2 7a 73 20 40 08 23:38:12.195  READ FPDMA 
> > > > QUEUED
> > > >   60 00 b0 00 80 00 00 c2 7a 72 70 40 08 23:38:12.195  READ FPDMA 
> > > > QUEUED
> > > >   2f 00 00 00 01 00 00 00 00 00 10 40 08 23:38:12.195  READ LOG EXT
> > > >   60 00 b0 00 70 00 00 c2 7a 73 20 40 08 23:38:09.343  READ FPDMA 
> > > > QUEUED
> > > >   60 00 b0 00 68 00 00 c2 7a 72 70 40 08 23:38:09.343  READ FPDMA 
> > > > QUEUED
> > > > [...]
> > > > 
> > > > and
> > > > 
> > > > [...]
> > > > SMART Attributes Data Structure revision number: 16
> > > > Vendor Specific SMART Attributes with Thresholds:
> > > > ID# ATTRIBUTE_NAME  FLAGSVALUE WORST THRESH FAIL RAW_VALUE
> > > >   1 Raw_Read_Error_Rate POSR-K   200   200   051-64
> > > >   3 Spin_Up_TimePOS--K   178   170   021-6075
> > > >   4 Start_Stop_Count-O--CK   098   098   000-2406
> > > >   5 Reallocated_Sector_Ct   PO--CK   200   200   140-0
> > > >   7 Seek_Error_Rate -OSR-K   200   200   000-0
> > > >   9 Power_On_Hours  -O--CK   066   066   000-25339
> > > >  10 Spin_Retry_Count-O--CK   100   100   000-0
> > > >  11 Calibration_Retry_Count -O--CK   100   100   000-0
> > > >  12 Power_Cycle_Count   -O--CK   098   098   000-2404
> > > > 192 Power-Off_Retract_Count -O--CK   200   200   000-154
> > > > 193 Load_Cycle_Count-O--CK   001   001   000-2055746
> > > > 194 Temperature_Celsius -O---K   122   109   000-28
> > > > 196 Reallocated_Event_Count -O--CK   200   200   000-0
> > > > 197 Current_Pending_Sector  -O--CK   200   200   000-1
> > > > 198 Offline_Uncorrectable   CK   200   200   000-1
> > > > 199 UDMA_CRC_Error_Count-O--CK   200   200   000-0
> > > > 200 Multi_Zone_Error_Rate   ---R--   200   200   000-5
> > > > ||_ K auto-keep
> > > > |__ C event count
> > > > ___ R error rate
> > > > ||| S speed/performance
> > > > ||_ O updated online
> > > > |__ P prefailure warning
> > > > 
> > > > [...]
> > > 
> > > The data up to this point informs us that you have 1 bad sector
> > > on a 3TB drive, that is actually an expected event given the data
> > > error rate on this stuff is such that your gona have these now
> > > and again.
> > > 
> > > Given you have 1 single event I would not suspect that this drive
> > > is dying, but it would be prudent to prepare for that possibility.  
> > 
> > Hello.
> > 
> > Well, I copied simply "one single event" that has been logged so far.
> > 
> > As you (and I) can see, it is error #42. After I posted here, a reboot has 
> > taken place
> > because the "repair" process on the Pool suddenly increased time and now 
> > I'm with
> > error #47, but interestingly, it is a new block that is damaged, but the 
> > SMART
> > attribute fields show this for now:  
> 
> Can you send the complete output of smartctl -a /dev/foo, I somehow missed
> that 40+ other errors had occured.


Yes, here it is, but please do not beat me due to its size ;-). It is "smartctl 
-x", that
shows me the errors. See file attached named "smart_ada.txt". It is everythin

Re: SMART: disk problems on RAIDZ1 pool: (ada6:ahcich6:0:0:0): CAM status: ATA Status Error

2017-12-13 Thread Daniel Kalchev


> On 13 Dec 2017, at 1:26, Freddie Cash  wrote:
> 
> On Tue, Dec 12, 2017 at 2:55 PM, Rodney W. Grimes <
> freebsd-...@pdx.rh.cn85.dnsmgr.net> wrote:
> 
>> Hum, just noticed this.  25k hours power on, 2M load cycles, this is
>> very hard on a hard drive.  Your drive is going into power save mode
>> and unloading the heads.  Infact at a rate of 81 times per hour?
>> Oh, I can not believe that.  Either way we need to get this stopped,
>> it shall wear your drives out.
>> 
> 
> ​Believe it.  :)  The WD Green drives have a head parking timeout of 15
> seconds, and no way to disable that anymore.  You used to be able to boot
> into DOS and run the tler.exe program from WD to disable the auto-parking
> feature, but they removed that ability fairly quickly.
> 
> The Green drives are meant to be used in systems that spend most of their
> time idle.  Trying to use them in an always-on RAID array is just asking
> for trouble.  They are only warrantied for a couple hundred thousand head
> parkings or something ridiculous like that.  2 million puts it way out of
> the warranty coverage.  :(
> 
> We had 24 of them in a ZFS pool back when they were first released as they
> were very inexpensive.  They lead to more downtime and replacement costs
> than any other drive we've used since (or even before).  Just don't use
> them in any kind of RAID array or always-on system.
> 

In order to handle drives like this and in general to get rid of load cycles, I 
use smartd on  all my ZFS pools with this piece of config:

DEVICESCAN -a -o off -e apm,off 

Might not be the best solution, but as it is activated during boot, S.M.A.R.T. 
attribute 193 Load_Cycle_Count does not increase anymore. Not fan of WD drives, 
but have few tens of them… all of them “behave” in some way or another.

For the original question, if I do not have spare disk to replace, on a 
raidz1/raidz2 pool I would typically do:

zpool offline poolname disk
dd if=/dev/zero of=/dev/disk bs=1m
zpool replace poolname disk

This effectively fills the disk with zeros, forcing any suspected unreadable 
blocks to be replaced. After this operation, no more pending blocks etc. But, 
on large drives/pools requires few days to complete (the last part). Over the 
years, I have used this procedure on many drives, sometimes more than once on 
the same drive and that posponed having to replace the drive and the annoying 
S.M.A.R.T. message: which by itself might not be major problem, but better not 
have the logs filled with warnings all the time.

I feel more confident doing this on raidz2 vdevs anyway..

If I had spare disk and spare port, just

zpool replace poolname disk

Daniel
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: SMART: disk problems on RAIDZ1 pool: (ada6:ahcich6:0:0:0): CAM status: ATA Status Error

2017-12-12 Thread Freddie Cash
On Tue, Dec 12, 2017 at 2:55 PM, Rodney W. Grimes <
freebsd-...@pdx.rh.cn85.dnsmgr.net> wrote:

> Hum, just noticed this.  25k hours power on, 2M load cycles, this is
> very hard on a hard drive.  Your drive is going into power save mode
> and unloading the heads.  Infact at a rate of 81 times per hour?
> Oh, I can not believe that.  Either way we need to get this stopped,
> it shall wear your drives out.
>

​Believe it.  :)  The WD Green drives have a head parking timeout of 15
seconds, and no way to disable that anymore.  You used to be able to boot
into DOS and run the tler.exe program from WD to disable the auto-parking
feature, but they removed that ability fairly quickly.

The Green drives are meant to be used in systems that spend most of their
time idle.  Trying to use them in an always-on RAID array is just asking
for trouble.  They are only warrantied for a couple hundred thousand head
parkings or something ridiculous like that.  2 million puts it way out of
the warranty coverage.  :(

We had 24 of them in a ZFS pool back when they were first released as they
were very inexpensive.  They lead to more downtime and replacement costs
than any other drive we've used since (or even before).  Just don't use
them in any kind of RAID array or always-on system.

-- 
Freddie Cash
fjwc...@gmail.com
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: SMART: disk problems on RAIDZ1 pool: (ada6:ahcich6:0:0:0): CAM status: ATA Status Error

2017-12-12 Thread Rodney W. Grimes
> Am Tue, 12 Dec 2017 10:52:27 -0800 (PST)
> "Rodney W. Grimes"  schrieb:
> 
> 
> Thank you for answering that fast!
> 
> > > Hello,
> > > 
> > > running CURRENT (recent r326769), I realised that smartmond sends out 
> > > some console
> > > messages when booting the box:
> > > 
> > > [...]
> > > Dec 12 14:14:33 <3.2> box1 smartd[68426]: Device: /dev/ada6, 1 Currently 
> > > unreadable
> > > (pending) sectors Dec 12 14:14:33 <3.2> box1 smartd[68426]: Device: 
> > > /dev/ada6, 1
> > > Offline uncorrectable sectors
> > > [...]
> > > 
> > > Checking the drive's SMART log with smartctl (it is one of four 3TB disk 
> > > drives), I
> > > gather these informations:
> > > 
> > > [... smartctl -x /dev/ada6 ...]
> > > Error 42 [17] occurred at disk power-on lifetime: 25335 hours (1055 days 
> > > + 15 hours)
> > >   When the command that caused the error occurred, the device was active 
> > > or idle.
> > > 
> > >   After command completion occurred, registers were:
> > >   ER -- ST COUNT  LBA_48  LH LM LL DV DC
> > >   -- -- -- == -- == == == -- -- -- -- --
> > >   40 -- 51 00 00 00 00 c2 7a 72 98 40 00  Error: UNC at LBA = 0xc27a7298 
> > > = 3262804632
> > > 
> > >   Commands leading to the command that caused the error were:
> > >   CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  
> > > Command/Feature_Name
> > >   -- == -- == -- == == == -- -- -- -- --  ---  
> > > 
> > >   60 00 b0 00 88 00 00 c2 7a 73 20 40 08 23:38:12.195  READ FPDMA 
> > > QUEUED
> > >   60 00 b0 00 80 00 00 c2 7a 72 70 40 08 23:38:12.195  READ FPDMA 
> > > QUEUED
> > >   2f 00 00 00 01 00 00 00 00 00 10 40 08 23:38:12.195  READ LOG EXT
> > >   60 00 b0 00 70 00 00 c2 7a 73 20 40 08 23:38:09.343  READ FPDMA 
> > > QUEUED
> > >   60 00 b0 00 68 00 00 c2 7a 72 70 40 08 23:38:09.343  READ FPDMA 
> > > QUEUED
> > > [...]
> > > 
> > > and
> > > 
> > > [...]
> > > SMART Attributes Data Structure revision number: 16
> > > Vendor Specific SMART Attributes with Thresholds:
> > > ID# ATTRIBUTE_NAME  FLAGSVALUE WORST THRESH FAIL RAW_VALUE
> > >   1 Raw_Read_Error_Rate POSR-K   200   200   051-64
> > >   3 Spin_Up_TimePOS--K   178   170   021-6075
> > >   4 Start_Stop_Count-O--CK   098   098   000-2406
> > >   5 Reallocated_Sector_Ct   PO--CK   200   200   140-0
> > >   7 Seek_Error_Rate -OSR-K   200   200   000-0
> > >   9 Power_On_Hours  -O--CK   066   066   000-25339
> > >  10 Spin_Retry_Count-O--CK   100   100   000-0
> > >  11 Calibration_Retry_Count -O--CK   100   100   000-0
> > >  12 Power_Cycle_Count   -O--CK   098   098   000-2404
> > > 192 Power-Off_Retract_Count -O--CK   200   200   000-154
> > > 193 Load_Cycle_Count-O--CK   001   001   000-2055746
> > > 194 Temperature_Celsius -O---K   122   109   000-28
> > > 196 Reallocated_Event_Count -O--CK   200   200   000-0
> > > 197 Current_Pending_Sector  -O--CK   200   200   000-1
> > > 198 Offline_Uncorrectable   CK   200   200   000-1
> > > 199 UDMA_CRC_Error_Count-O--CK   200   200   000-0
> > > 200 Multi_Zone_Error_Rate   ---R--   200   200   000-5
> > > ||_ K auto-keep
> > > |__ C event count
> > > ___ R error rate
> > > ||| S speed/performance
> > > ||_ O updated online
> > > |__ P prefailure warning
> > > 
> > > [...]  
> > 
> > The data up to this point informs us that you have 1 bad sector
> > on a 3TB drive, that is actually an expected event given the data
> > error rate on this stuff is such that your gona have these now
> > and again.
> > 
> > Given you have 1 single event I would not suspect that this drive
> > is dying, but it would be prudent to prepare for that possibility.
> 
> Hello.
> 
> Well, I copied simply "one single event" that has been logged so far.
> 
> As you (and I) can see, it is error #42. After I posted here, a reboot has 
> taken place
> because the "repair" process on the Pool suddenly increased time and now I'm 
> with error
> #47, but interestingly, it is a new block that is damaged, but the SMART 
> attribute fields
> show this for now:

Can you send the complete output of smartctl -a /dev/foo, I somehow missed
that 40+ other errors had occured.

> [...]
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME  FLAGSVALUE WORST THRESH FAIL RAW_VALUE
>   1 Raw_Read_Error_Rate POSR-K   200   200   051-69
>   3 Spin_Up_TimePOS--K   178   170   021-6075
>   4 Start_Stop_Count-O--CK   098   098   000-2406
>   5 Reallocated_Sector_Ct   PO--CK   200   200   140-0

I

Re: SMART: disk problems on RAIDZ1 pool: (ada6:ahcich6:0:0:0): CAM status: ATA Status Error

2017-12-12 Thread Rodney W. Grimes
> Hello,
> 
> running CURRENT (recent r326769), I realised that smartmond sends out some 
> console
> messages when booting the box:
> 
> [...]
> Dec 12 14:14:33 <3.2> box1 smartd[68426]: Device: /dev/ada6, 1 Currently 
> unreadable
> (pending) sectors Dec 12 14:14:33 <3.2> box1 smartd[68426]: Device: 
> /dev/ada6, 1
> Offline uncorrectable sectors
> [...]
> 
> Checking the drive's SMART log with smartctl (it is one of four 3TB disk 
> drives), I
> gather these informations:
> 
> [... smartctl -x /dev/ada6 ...]
> Error 42 [17] occurred at disk power-on lifetime: 25335 hours (1055 days + 15 
> hours)
>   When the command that caused the error occurred, the device was active or 
> idle.
> 
>   After command completion occurred, registers were:
>   ER -- ST COUNT  LBA_48  LH LM LL DV DC
>   -- -- -- == -- == == == -- -- -- -- --
>   40 -- 51 00 00 00 00 c2 7a 72 98 40 00  Error: UNC at LBA = 0xc27a7298 = 
> 3262804632
> 
>   Commands leading to the command that caused the error were:
>   CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  
> Command/Feature_Name
>   -- == -- == -- == == == -- -- -- -- --  ---  
> 
>   60 00 b0 00 88 00 00 c2 7a 73 20 40 08 23:38:12.195  READ FPDMA QUEUED
>   60 00 b0 00 80 00 00 c2 7a 72 70 40 08 23:38:12.195  READ FPDMA QUEUED
>   2f 00 00 00 01 00 00 00 00 00 10 40 08 23:38:12.195  READ LOG EXT
>   60 00 b0 00 70 00 00 c2 7a 73 20 40 08 23:38:09.343  READ FPDMA QUEUED
>   60 00 b0 00 68 00 00 c2 7a 72 70 40 08 23:38:09.343  READ FPDMA QUEUED
> [...]
> 
> and
> 
> [...]
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME  FLAGSVALUE WORST THRESH FAIL RAW_VALUE
>   1 Raw_Read_Error_Rate POSR-K   200   200   051-64
>   3 Spin_Up_TimePOS--K   178   170   021-6075
>   4 Start_Stop_Count-O--CK   098   098   000-2406
>   5 Reallocated_Sector_Ct   PO--CK   200   200   140-0
>   7 Seek_Error_Rate -OSR-K   200   200   000-0
>   9 Power_On_Hours  -O--CK   066   066   000-25339
>  10 Spin_Retry_Count-O--CK   100   100   000-0
>  11 Calibration_Retry_Count -O--CK   100   100   000-0
>  12 Power_Cycle_Count   -O--CK   098   098   000-2404
> 192 Power-Off_Retract_Count -O--CK   200   200   000-154
> 193 Load_Cycle_Count-O--CK   001   001   000-2055746
> 194 Temperature_Celsius -O---K   122   109   000-28
> 196 Reallocated_Event_Count -O--CK   200   200   000-0
> 197 Current_Pending_Sector  -O--CK   200   200   000-1
> 198 Offline_Uncorrectable   CK   200   200   000-1
> 199 UDMA_CRC_Error_Count-O--CK   200   200   000-0
> 200 Multi_Zone_Error_Rate   ---R--   200   200   000-5
> ||_ K auto-keep
> |__ C event count
> ___ R error rate
> ||| S speed/performance
> ||_ O updated online
> |__ P prefailure warning
> 
> [...]

The data up to this point informs us that you have 1 bad sector
on a 3TB drive, that is actually an expected event given the data
error rate on this stuff is such that your gona have these now
and again.

Given you have 1 single event I would not suspect that this drive
is dying, but it would be prudent to prepare for that possibility.


> 
> The ZFS pool is RAIDZ1, comprised of 3 WD Green 3TB HDD and one WD RED 3 TB 
> HDD. The
> failure occured is on one of the WD Green 3 TB HDD.
Ok, so the data is redundantly protected.  This helps a lot.

> The pool is marked as "resilvered" - I do scrubbing on a regular basis and the
> "resilvering" message has now aapeared the second time in row. Searching the 
> net
> recommend on SMART attribute 197 errors, in my case it is one, and in 
> combination with
> the problems occured that I should replace the disk.

It is probably putting the RAIDZ in that state as the scrub is finding a block
it can not read.

> 
> Well, here comes the problem. The box is comprised from "electronical waste" 
> made by
> ASRock - it is a Socket 1150/IvyBridge board, which has its last 
> Firmware/BIOS update got
> in 2013 and since then UEFI booting FreeBSD from a HDD isn't possible (just 
> to indicate
> that I'm aware of having issues with crap, but that is some other issue right 
> now). The
> board's SATA connectors are all populated.
> 
> So: Due to the lack of adequate backup space I can only selectively backup 
> portions, most
> of the space is occupied by scientific modelling data, which I had worked on. 
> So backup
> exists! In one way or the other. My concern is how to replace the faulty HDD! 
> Most
> HowTo's indicate a replacement disk being prepared and then "replaced" via 
> ZFS's replace
> command. This is

Re: SMART: disk problems on RAIDZ1 pool: (ada6:ahcich6:0:0:0): CAM status: ATA Status Error

2017-12-12 Thread O. Hartmann
Am Tue, 12 Dec 2017 10:52:27 -0800 (PST)
"Rodney W. Grimes"  schrieb:


Thank you for answering that fast!

> > Hello,
> > 
> > running CURRENT (recent r326769), I realised that smartmond sends out some 
> > console
> > messages when booting the box:
> > 
> > [...]
> > Dec 12 14:14:33 <3.2> box1 smartd[68426]: Device: /dev/ada6, 1 Currently 
> > unreadable
> > (pending) sectors Dec 12 14:14:33 <3.2> box1 smartd[68426]: Device: 
> > /dev/ada6, 1
> > Offline uncorrectable sectors
> > [...]
> > 
> > Checking the drive's SMART log with smartctl (it is one of four 3TB disk 
> > drives), I
> > gather these informations:
> > 
> > [... smartctl -x /dev/ada6 ...]
> > Error 42 [17] occurred at disk power-on lifetime: 25335 hours (1055 days + 
> > 15 hours)
> >   When the command that caused the error occurred, the device was active or 
> > idle.
> > 
> >   After command completion occurred, registers were:
> >   ER -- ST COUNT  LBA_48  LH LM LL DV DC
> >   -- -- -- == -- == == == -- -- -- -- --
> >   40 -- 51 00 00 00 00 c2 7a 72 98 40 00  Error: UNC at LBA = 0xc27a7298 = 
> > 3262804632
> > 
> >   Commands leading to the command that caused the error were:
> >   CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  
> > Command/Feature_Name
> >   -- == -- == -- == == == -- -- -- -- --  ---  
> > 
> >   60 00 b0 00 88 00 00 c2 7a 73 20 40 08 23:38:12.195  READ FPDMA QUEUED
> >   60 00 b0 00 80 00 00 c2 7a 72 70 40 08 23:38:12.195  READ FPDMA QUEUED
> >   2f 00 00 00 01 00 00 00 00 00 10 40 08 23:38:12.195  READ LOG EXT
> >   60 00 b0 00 70 00 00 c2 7a 73 20 40 08 23:38:09.343  READ FPDMA QUEUED
> >   60 00 b0 00 68 00 00 c2 7a 72 70 40 08 23:38:09.343  READ FPDMA QUEUED
> > [...]
> > 
> > and
> > 
> > [...]
> > SMART Attributes Data Structure revision number: 16
> > Vendor Specific SMART Attributes with Thresholds:
> > ID# ATTRIBUTE_NAME  FLAGSVALUE WORST THRESH FAIL RAW_VALUE
> >   1 Raw_Read_Error_Rate POSR-K   200   200   051-64
> >   3 Spin_Up_TimePOS--K   178   170   021-6075
> >   4 Start_Stop_Count-O--CK   098   098   000-2406
> >   5 Reallocated_Sector_Ct   PO--CK   200   200   140-0
> >   7 Seek_Error_Rate -OSR-K   200   200   000-0
> >   9 Power_On_Hours  -O--CK   066   066   000-25339
> >  10 Spin_Retry_Count-O--CK   100   100   000-0
> >  11 Calibration_Retry_Count -O--CK   100   100   000-0
> >  12 Power_Cycle_Count   -O--CK   098   098   000-2404
> > 192 Power-Off_Retract_Count -O--CK   200   200   000-154
> > 193 Load_Cycle_Count-O--CK   001   001   000-2055746
> > 194 Temperature_Celsius -O---K   122   109   000-28
> > 196 Reallocated_Event_Count -O--CK   200   200   000-0
> > 197 Current_Pending_Sector  -O--CK   200   200   000-1
> > 198 Offline_Uncorrectable   CK   200   200   000-1
> > 199 UDMA_CRC_Error_Count-O--CK   200   200   000-0
> > 200 Multi_Zone_Error_Rate   ---R--   200   200   000-5
> > ||_ K auto-keep
> > |__ C event count
> > ___ R error rate
> > ||| S speed/performance
> > ||_ O updated online
> > |__ P prefailure warning
> > 
> > [...]  
> 
> The data up to this point informs us that you have 1 bad sector
> on a 3TB drive, that is actually an expected event given the data
> error rate on this stuff is such that your gona have these now
> and again.
> 
> Given you have 1 single event I would not suspect that this drive
> is dying, but it would be prudent to prepare for that possibility.

Hello.

Well, I copied simply "one single event" that has been logged so far.

As you (and I) can see, it is error #42. After I posted here, a reboot has 
taken place
because the "repair" process on the Pool suddenly increased time and now I'm 
with error
#47, but interestingly, it is a new block that is damaged, but the SMART 
attribute fields
show this for now:

[...]
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAGSVALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate POSR-K   200   200   051-69
  3 Spin_Up_TimePOS--K   178   170   021-6075
  4 Start_Stop_Count-O--CK   098   098   000-2406
  5 Reallocated_Sector_Ct   PO--CK   200   200   140-0
  7 Seek_Error_Rate -OSR-K   200   200   000-0
  9 Power_On_Hours  -O--CK   066   066   000-25343
 10 Spin_Retry_Count-O--CK   100   100   000-0
 11 Calibration_Retry_Count -O--CK   100   100   000-0
 12 Power_Cycle_Count   -O--CK   098   098   000-2404
192 Power-Off_Retract_Count -O--C

Re: SMART: disk problems on RAIDZ1 pool: (ada6:ahcich6:0:0:0): CAM status: ATA Status Error

2017-12-12 Thread Freddie Cash
On Tue, Dec 12, 2017 at 10:21 AM, O. Hartmann 
wrote:

>
> Question: is it possible to simply pull the faulty disk (implies I know
> exactly which one
> to pull!) and then prepare and add the replacement HDD and let the system
> do its job
> resilvering the pool?
>

​zpool offline  

Do that first.  That will mark the drive as offline, put the pool into a
degraded mode, and generally be less harmful to the system.

Then figure out which disk to pull and remove it (doing it from a powered
off state if needed).

Install the new drive, configure it however it's needed, then use:

zpool replace   
​

> Next question is: I'm about to replace the 3 TB HDD with a more recent and
> modern 4 TB
> HDD (WD RED 4TB). I'm aware of the fact that I can only use 3 TB as the
> other disks are 3
> TB, but I'd like to know whether FreeBSD's ZFS is capable of handling it?
>

​Yes, it can handle it just fine.  And it will keep the extra space as
"usable in the future", so if you replace all the drives with 4 TB ones,
the extra space will be added to the pool.

-- 
Freddie Cash
fjwc...@gmail.com
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: SMART: disk problems on RAIDZ1 pool: (ada6:ahcich6:0:0:0): CAM status: ATA Status Error

2017-12-12 Thread Alan Somers
On Tue, Dec 12, 2017 at 11:21 AM, O. Hartmann 
wrote:

> Hello,
>
> running CURRENT (recent r326769), I realised that smartmond sends out some
> console
> messages when booting the box:
>
> [...]
> Dec 12 14:14:33 <3.2> box1 smartd[68426]: Device: /dev/ada6, 1 Currently
> unreadable
> (pending) sectors Dec 12 14:14:33 <3.2> box1 smartd[68426]: Device:
> /dev/ada6, 1
> Offline uncorrectable sectors
> [...]
>
> Checking the drive's SMART log with smartctl (it is one of four 3TB disk
> drives), I
> gather these informations:
>
> [... smartctl -x /dev/ada6 ...]
> Error 42 [17] occurred at disk power-on lifetime: 25335 hours (1055 days +
> 15 hours)
>   When the command that caused the error occurred, the device was active
> or idle.
>
>   After command completion occurred, registers were:
>   ER -- ST COUNT  LBA_48  LH LM LL DV DC
>   -- -- -- == -- == == == -- -- -- -- --
>   40 -- 51 00 00 00 00 c2 7a 72 98 40 00  Error: UNC at LBA = 0xc27a7298 =
> 3262804632
>
>   Commands leading to the command that caused the error were:
>   CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time
> Command/Feature_Name
>   -- == -- == -- == == == -- -- -- -- --  ---
> 
>   60 00 b0 00 88 00 00 c2 7a 73 20 40 08 23:38:12.195  READ FPDMA
> QUEUED
>   60 00 b0 00 80 00 00 c2 7a 72 70 40 08 23:38:12.195  READ FPDMA
> QUEUED
>   2f 00 00 00 01 00 00 00 00 00 10 40 08 23:38:12.195  READ LOG EXT
>   60 00 b0 00 70 00 00 c2 7a 73 20 40 08 23:38:09.343  READ FPDMA
> QUEUED
>   60 00 b0 00 68 00 00 c2 7a 72 70 40 08 23:38:09.343  READ FPDMA
> QUEUED
> [...]
>
> and
>
> [...]
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME  FLAGSVALUE WORST THRESH FAIL RAW_VALUE
>   1 Raw_Read_Error_Rate POSR-K   200   200   051-64
>   3 Spin_Up_TimePOS--K   178   170   021-6075
>   4 Start_Stop_Count-O--CK   098   098   000-2406
>   5 Reallocated_Sector_Ct   PO--CK   200   200   140-0
>   7 Seek_Error_Rate -OSR-K   200   200   000-0
>   9 Power_On_Hours  -O--CK   066   066   000-25339
>  10 Spin_Retry_Count-O--CK   100   100   000-0
>  11 Calibration_Retry_Count -O--CK   100   100   000-0
>  12 Power_Cycle_Count   -O--CK   098   098   000-2404
> 192 Power-Off_Retract_Count -O--CK   200   200   000-154
> 193 Load_Cycle_Count-O--CK   001   001   000-2055746
> 194 Temperature_Celsius -O---K   122   109   000-28
> 196 Reallocated_Event_Count -O--CK   200   200   000-0
> 197 Current_Pending_Sector  -O--CK   200   200   000-1
> 198 Offline_Uncorrectable   CK   200   200   000-1
> 199 UDMA_CRC_Error_Count-O--CK   200   200   000-0
> 200 Multi_Zone_Error_Rate   ---R--   200   200   000-5
> ||_ K auto-keep
> |__ C event count
> ___ R error rate
> ||| S speed/performance
> ||_ O updated online
> |__ P prefailure warning
>
> [...]
>
> The ZFS pool is RAIDZ1, comprised of 3 WD Green 3TB HDD and one WD RED 3
> TB HDD. The
> failure occured is on one of the WD Green 3 TB HDD.
>
> The pool is marked as "resilvered" - I do scrubbing on a regular basis and
> the
> "resilvering" message has now aapeared the second time in row. Searching
> the net
> recommend on SMART attribute 197 errors, in my case it is one, and in
> combination with
> the problems occured that I should replace the disk.
>
> Well, here comes the problem. The box is comprised from "electronical
> waste" made by
> ASRock - it is a Socket 1150/IvyBridge board, which has its last
> Firmware/BIOS update got
> in 2013 and since then UEFI booting FreeBSD from a HDD isn't possible
> (just to indicate
> that I'm aware of having issues with crap, but that is some other issue
> right now). The
> board's SATA connectors are all populated.
>
> So: Due to the lack of adequate backup space I can only selectively backup
> portions, most
> of the space is occupied by scientific modelling data, which I had worked
> on. So backup
> exists! In one way or the other. My concern is how to replace the faulty
> HDD! Most
> HowTo's indicate a replacement disk being prepared and then "replaced" via
> ZFS's replace
> command. This isn't applicable here.
>
> Question: is it possible to simply pull the faulty disk (implies I know
> exactly which one
> to pull!) and then prepare and add the replacement HDD and let the system
> do its job
> resilvering the pool?
>

Absolutely.  If you don't know which disk to pull, then it's better to
power down and check serial numbers.  After you power back on, you can
replace the disk with a command like this:
zpool replace   
The missing disk guid can be obta