Dear Mr. Long,
during the last week or so, I've been trying to get more information
about the problem in 4.8-RELEASE that this thread has been about.
Yesterday, just when I thought that maybe I had some interesting data
worth sending to you, I noticed that 4.9-RC1 was out. So I tested for the
symptoms in that and I have to admit that
IT WORKS IN 4.9-RC1 ! Wonderful!
I.e., the machine does boot from a rebuilding array and array degradation
at runtime doesn't make the machine hang because of storage failure.
All of that with SMP, APIC_IO and HT enabled.
No need to include the irrelevant ISP driver (see the attachments for
an explanation of this comment).
In the newsgroups, I've noticed other people complaing about various
aac-based hardware under FreeBSD.
I am aware that there have been changes to dev/aac/* between 4.8 and 4.9.
I'm not sure whether you have managed to squash the bug, or if the remedy
was incidental, and whether or not the bug was in the aac drivers or
elsewhere in the system (APIC handling? DMA mapping?).
Therefore, just in case you were interested, the information I gathered in
4.8 is attached to this message.
Hmm. Now that this is solved, I'd like to focus on the defunct aaccli.
I guess I'd better start another thread related to that.
Thanks for the great job that you're doing in the FreeBSD team.
And, thanks for your patience with me.
Frank Rysanek
The following problem description applies exclusively to
4.8-RELEASE.
I have managed to carry out further research related to the topic of this
e-mail thread, in the original 4.8-RELEASE. I have found some more
deterministic symptoms of the problem, one other dependency apart from
SMP+APIC_IO, but I'm stuck again.
A detailed explanation follows.
PROBLEM SUMMARY
---
With the GENERIC kernel, the ASR2120 (driver AAC) works fine under all
circumstances.
With SMP+APIC_IO enabled or with device isp disabled, the controller
and the driver work fine as long as the array volumes are fine.
When an array becomes degraded at runtime, or when booting off a degraded
(especially rebuilding) array, the system crashes miserably.
The bottom line is, that my ASR2120 is not fault tolerant in SMP.
DISCOVERED CFG DEPENDENCIES
---
The ASR2120 controller and the aac driver WORK FINE under these
conditions:
1) SMP+APIC_IO are disabled (a UP-only kernel)
2) 'device isp' is _enabled_ (though its PCI probe doesn't find anything)
And no, I don't have any Qlogic chips in my machine.
SYMPTOMS
1) the zero-padded FIB AKA unknown command from controller upon
runtime disk fault or when booting from a rebuilding array.
This is the original symptom.
2) During the kernel boot sequence, still with interrupts disabled, when
aac_startup() probes for containers, it finds none - as a result of
the controller being stuck as per symptom 3).
3) let's focus on booting: during aac_init(), when the controller
is notified of a ready mailbox for the first time, the controller
pukes - the drive LED's start flashing red and symptom 2) follows.
Once interrupts are enabled, symptom 3) follows.
I have discovered the precise moment when the controller pukes
by inserting debug messages and DELAY(10 s) statements at various
points in the code of aac_attach(), aac_init(), aac_sync_command()
etc...
Obviously the red flashing doesn't occur in the workable setup
- the controller keeps rebuilding the array(s) merrily throughout
FreeBSD boot.
4) with the working setup (see the CFG DEPENDENCIES section above),
the MMIO assumes a different config than with any defunct setup.
The physical address of the FIBs (or whatever it is) is different,
I don't know why. See the two log snippets below. See also the
attached tarball for more complete boot logs.
The following are some variable dumps, logged by instrumentation that
I have inserted into aac.c.
This is the workable setup (UP 'device isp' enabled):
FRR: generic attach - aac_attach() called
FRR: Disabling interrupts.
FRR:aac_init(): initing controller
FRR: -- Init structure contents: --
FRR:aac_common is at e198
FRR:ac_init is ate1981000
FRR:InitStructRevision = 3
FRR:MiniPortRevision = 1
FRR:AdapterFibsPhysicalAddress = 1c000
FRR:AdapterFibsVirtualAddress = e198
FRR:AdapterFibsSize = 1000
FRR:AdapterFibAlign = 200
FRR:PrintfBufferAddress = 1f184
FRR:PrintfBufferSize = 100
FRR:HostPhysMemPages = 3fccf
FRR:HostElapsedSeconds = 0
FRR: setting the outbound doorbell register to all one's.
FRR:aac_common_busaddr = 1c000
FRR:ac_init offset = 1000
FRR: aac_sync_command() called.
FRR: - populating the mailbox...
FRR:aac_rx_set_mailbox() called.
FRR:btag: 1
FRR:handle: dd96e000
FRR:command: 5
FRR:arg0: 1d000
FRR:arg1: 0
FRR:arg2: 0
FRR:arg3: 0
FRR: - clearing the