Re[2]: Re[2]: AHCI timeout when using ZFS + AIO + NCQ

2013-01-30 Thread Vladislav Prodan



 I once ran into a very severe AHCI timeout problem. After months of trying to
 figure it out and insane Hardware_ECC_Recovered error values, I found that
 the error was with the power connector plug / sata HDD interface. All errors
 disappeared after replacing that cable. Since you have error on more than 1
 HDD, I suggest:
 1. Check smartctl output for each AND all HDD
 2. Check whether your power supply unit is still healthy or if it is
 supplying inconsistent power.
 3. Check the main power supply line and whether it shows any voltage
 fluctuations or if there is a new heavy consumer of amps on the same power
 line as the server is plugged to.
 
 

I've deliberately chose a different server that has a different chipset, and 
that there were no problems with the HDD.

Added kernel support:
device ahci # AHCI-compatible SATA controllers

And now, after 2.5 days fell off one HDD.

[3:14]beastie:root-/root# zpool status
  pool: tank
 state: DEGRADED
status: One or more devices has been removed by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
  scan: none requested
config:

NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
  mirror-0   ONLINE   0 0 0
gpt/disk0ONLINE   0 0 0
gpt/disk2ONLINE   0 0 0
  mirror-1   DEGRADED 0 0 0
gpt/disk1ONLINE   0 0 0
4931885954389536913  REMOVED  0 0 0  was /dev/gpt/disk3

errors: No known data errors


Jan 30 09:49:28 beastie kernel: ahcich3: Timeout on slot 29 port 0
Jan 30 09:49:28 beastie kernel: ahcich3: is  cs 2000 ss  rs 
2000 tfd c0 serr  cmd 0004dd17
Jan 30 09:49:28 beastie kernel: (ada3:ahcich3:0:0:0): FLUSHCACHE48. ACB: ea 00 
00 00 00 40 00 00 00 00 00 00
Jan 30 09:49:28 beastie kernel: (ada3:ahcich3:0:0:0): CAM status: Command 
timeout
Jan 30 09:49:28 beastie kernel: (ada3:ahcich3:0:0:0): Retrying command
Jan 30 09:51:31 beastie kernel: ahcich3: AHCI reset: device not ready after 
31000ms (tfd = 0080)
Jan 30 09:51:31 beastie kernel: ahcich3: Timeout on slot 29 port 0
Jan 30 09:51:31 beastie kernel: ahcich3: is  cs 2000 ss  rs 
2000 tfd 80 serr  cmd 0004dd17
Jan 30 09:51:31 beastie kernel: (aprobe0:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 
00 00 00 00 40 00 00 00 00 00 00
Jan 30 09:51:31 beastie kernel: (aprobe0:ahcich3:0:0:0): CAM status: Command 
timeout
Jan 30 09:51:31 beastie kernel: (aprobe0:ahcich3:0:0:0): Error 5, Retry was 
blocked
Jan 30 09:51:31 beastie kernel: ahcich3: AHCI reset: device not ready after 
31000ms (tfd = 0080)
Jan 30 09:51:31 beastie kernel: ahcich3: Timeout on slot 29 port 0
Jan 30 09:51:31 beastie kernel: ahcich3: is  cs  ss  rs 
2000 tfd 58 serr  cmd 0004dd17
Jan 30 09:51:31 beastie kernel: (aprobe0:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 
00 00 00 00 40 00 00 00 00 00 00
Jan 30 09:51:31 beastie kernel: (aprobe0:ahcich3:0:0:0): CAM status: Command 
timeout
Jan 30 09:51:31 beastie kernel: (aprobe0:ahcich3:0:0:0): Error 5, Retry was 
blocked
Jan 30 09:51:31 beastie kernel: (ada3:ahcich3:0:0:0): lost device
Jan 30 09:51:31 beastie kernel: (pass3:ahcich3:0:0:0): passdevgonecb: devfs 
entry is gone


-- 
Vladislav V. Prodan
System  Network Administrator 
http://support.od.ua   
+380 67 4584408, +380 99 4060508
VVP88-RIPE

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re[2]: Re[2]: AHCI timeout when using ZFS + AIO + NCQ

2013-01-27 Thread Vladislav Prodan


 - Original Message - 
 From: Vladislav Prodan univers...@ukr.net
 
  Is it always the same disk, of so replace it SMART helps identify issues
  but doesn't tell you 100% there's no problem.
  
  
  Now it has fallen off a different HDD - ada0.
  I'm 99% sure that MHDD will not find problems in HDD - ada0 and ada2.
  I still have three servers with similar chipsets that have similar problems
  with blade ahci times out.
 
 I notice your disks are connecting at SATA 3.x, which rings bells. We had
 a very similar issue on a new Supermicro machine here and after much
 testing we proved to our satisfaction that the problem was the HW.


I have a motherboard ASUS M5A97 PRO
http://www.asus.com/Motherboard/M5A97_PRO/#specifications
Has replacement SATA data cables.
Putting hard RAID controller does not guarantee data recovery at his death.
 
 Essentially the combination of SATA 3 speeds the midplane / backplane
 degraded the connection between the MB and HDD enough to cause
 the disks to randomly drop when under load.
 
 If we connected the disks directly to the MB with SATA cables the
 problem went away. In the end we had midplanes changed from an
 AHCI pass-through to active LSI controller.
 
 So if you have any sort of midplane / backplane connecting your disks
 try connecting them direct to the MB / controller via known SATA 3.x
 compliant cables and see if that stops the drops.
 
 Another test you can do is to force the disks to connect at SATA 2.x
 this also fixed it in our case, but wasn't something we wanted to
 put into production hence the controller swap.
 
 To force SATA 2 speeds you can use the following in /boot/loader.conf
 where 'X' is disk identifier e.g. for ada0 X = 0:-
 hint.ahcich.X.sata_rev=2
 
 Hope this helps.
 
 Regards
 Steve
 

-- 
Vladislav V. Prodan
System  Network Administrator 
http://support.od.ua   
+380 67 4584408, +380 99 4060508
VVP88-RIPE

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org