My configuration is as follows:

FreeBSD 8.2-RELEASE
Supermicro X8DTi-LN4F (Intel Tylersburg 5520 chipset) motherboard
24 GB system memory
32 x Hitachi Deskstar 5K3000 disks connected to 4 x Intel SASUC8I (LSI 3081E-R) 
in IT mode
2 x Crucial M4 64 Gb SATA SSD for FreeBSD OS (zroot)
2 x Intel 320 MLC 80 Gb SATA SSD for L2ARC and swap
SSD are connected to on-board SATA port on motherboard

This system was commissioned in February of 2012 and ran without issue as a ZFS 
backup system on our network until about 3 weeks ago.

At that time I started getting kernel panics due to timeouts to the on-board 
SATA devices. The only change to the system since it was built was to add an 
SSD for swap (32 Gb swap device) and this issue did not happen until several 
months after this was added.

My initial thought was that I might have a bad SSD drive so I swapped out one 
of the Crucial SSD drives and the problem happened again a few days later.

I then moved to systematically replacing items such as SATA cables, memory, 
motherboard, etc and the problem continued. For example, I swapped out the 4 
SATA cables with brand new SATA cables and waited to see if the problem 
happened again. Once it did I moved on to replacing the motherboard with an 
identical motherboard, waited, etc.

I could not find an obvious hardware related explanation for this behavior so 
about a week and a half ago I did a fresh install of FreeBSD 9.0-RELEASE to 
move from the ATA driver to the AHCI driver as I found some evidence that this 
was helpful.

The problem continued with something like this:

ahcich0: Timeout on slot 29 port 0
ahcich0: is 000000000 cs 00000000 ss e0000000 rs e0000000 tfd 40 serr 00000000 
cmd 0004df17

ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich0: Timeout on slot 31 port 0
ahcich0: is 00000000 cs 80000000 ss 00000000 rs 80000000 tfd 80 serr 00000000 
cmd 0004df17
(ada0:ahcich0:0:0:0): lost device

ahcich0: AHCI reset: device not ready after 3100ms (tfd = 00000080)
ahcich0: Timeout on slot 31 port 0
ahcich0: is 00000000 cs 80000003 ss 800000003 rs 80000003 tfd 80 serr 0000000 
cmd 0004df17
(ada0:ahcich0:0:0:0): removing device entry

ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080)
ahcich0: Poll timeout on slot 1 port 0
ahcich0: is 00000000 cs 00000002 ss 000000000 rs 0000002 tfd 80 serr 00000000 
cmd 004c117

When this happens the only way to recover the system is to hard boot via IPMI 
(yanking the power vs hitting reset). I cannot say that every time this happens 
a hard reset is necessary but more often than not a hard reset is necessary as 
the on-board AHCI portion of the BIOS does not always see the disks after the 
event without a hard system power reset.

I have done a bunch of Google work on this and have seen the issue appear in 
FreeNAS and FreeBSD but no clear cut resolution in terms of how to address it 
or what causes it. Some people had a bad SSD, others had to disable NCQ or 
power management on their SSD, particular brands of SSD (Samsung), etc.

Nothing conclusive so far.

At the present time the issue happens every 1-2 hours unless I have the 
following in my /boot/loader.conf after the ahci_load statement:

ahci_load="YES"

# See ahci(4)
hint.ahcich.0.sata_rev=1
hint.ahcich.1.sata_rev=1
hint.ahcich.2.sata_rev=1
hint.ahcich.3.sata_rev=1

hint.ahcich.0.pm_level=1
hint.ahcich.1.pm_level=1
hint.ahcich.2.pm_level=1
hint.ahcich.3.pm_level=1

I have a script in /usr/local/etc/rc.d which disables NCQ on these drives:

#!/bin/sh

CAMCONTROL=/sbin/camcontrol

$CAMCONTROL tags ada0 -N 1 > /dev/null
$CAMCONTROL tags ada1 -N 1 > /dev/null
$CAMCONTROL tags ada2 -N 1 > /dev/null
$CAMCONTROL tags ada3 -N 1 > /dev/null

exit 0

I went ahead and pulled the Intel SSDs as they were showing ASR and hardware 
resets which incremented. Removing both of these disks from the system did not 
change the situation.

The combination of /boot/loader.conf and this script gets me 6 days or so of 
operation before the issue pops up again. If I remove these two items I get 
maybe 2 hours before the issue happens again.

Right now I'm down to one OS disk and one swap disk and that is it for SSD 
disks on the system.

At the last reboot (yesterday) I disabled APM on the disks (ada0 and ada1 at 
this point) to see if that makes a difference as I found a reference to this 
being a potential problem.

I'm looking for insight/help on this as I'm about out of options. If there is a 
way to gather more information when this happens, post up information, etc I'm 
open to trying it.

What is driving me crazy is that I can't seem to come up with a concrete 
explanation as to why now and not back when the system was built. The issue 
only seems to happen when the system is idle and the SSD drives do not see much 
action other than to host OS, scripts, etc while the Intel/LSI based drives is 
where the actual I/O is at.

The system logs do not show anything prior to event happening and the OS will 
respond to ping requests after the issue and if you have an active SSH session 
you will remain connected to the system until you attempt to do something like 
'ls', 'ps', etc.

New SSH requests to the system get 'connection refused'.

As far as I can see I have three real options left:

* Hope that someone here knows something I don't
* Ditch SSD for straight SATA disks (plan on doing this next week before next 
likely happening sometime Wed am) as perhaps there is some odd SATA/SSD 
interaction with FreeBSD or with controller I'm not aware of (haven't had this 
happen with plain SATA and FreeBSD before)
* Ditch FreeBSD for Solaris so I can keep ZFS lovin for the intended purpose of 
this system

I'm open to suggestions, direction, etc to see if I can nail down what is going 
on and put this issue to bed for not only myself but for anyone else who might 
run into it before I lose what little hair and sanity I have left...heh

- Nate
_______________________________________________
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"

Reply via email to