Re: FreeBSD 4.8, ASR2120, SMP, degraded RAID1/mirror = storage failure

2003-10-02 Thread rysanek
Dear Mr. Long,

during the last week or so, I've been trying to get more information
about the problem in 4.8-RELEASE that this thread has been about.

Yesterday, just when I thought that maybe I had some interesting data
worth sending to you, I noticed that 4.9-RC1 was out. So I tested for the
symptoms in that and I have to admit that

   IT WORKS IN 4.9-RC1 !   Wonderful!

I.e., the machine does boot from a rebuilding array and array degradation
at runtime doesn't make the machine hang because of storage failure.
All of that with SMP, APIC_IO and HT enabled.
No need to include the irrelevant ISP driver (see the attachments for
an explanation of this comment).


In the newsgroups, I've noticed other people complaing about various
aac-based hardware under FreeBSD.
I am aware that there have been changes to dev/aac/* between 4.8 and 4.9.
I'm not sure whether you have managed to squash the bug, or if the remedy
was incidental, and whether or not the bug was in the aac drivers or
elsewhere in the system (APIC handling? DMA mapping?).
Therefore, just in case you were interested, the information I gathered in
4.8 is attached to this message.

Hmm. Now that this is solved, I'd like to focus on the defunct aaccli.
I guess I'd better start another thread related to that.

Thanks for the great job that you're doing in the FreeBSD team.
And, thanks for your patience with me.

Frank Rysanek
The following problem description applies exclusively to
4.8-RELEASE.

I have managed to carry out further research related to the topic of this 
e-mail thread, in the original 4.8-RELEASE. I have found some more 
deterministic symptoms of the problem, one other dependency apart from 
SMP+APIC_IO, but I'm stuck again.
A detailed explanation follows.


PROBLEM SUMMARY
---
With the GENERIC kernel, the ASR2120 (driver AAC) works fine under all 
circumstances.
With SMP+APIC_IO enabled or with device isp disabled, the controller
and the driver work fine as long as the array volumes are fine.
When an array becomes degraded at runtime, or when booting off a degraded 
(especially rebuilding) array, the system crashes miserably.
The bottom line is, that my ASR2120 is not fault tolerant in SMP.


DISCOVERED CFG DEPENDENCIES
---
The ASR2120 controller and the aac driver WORK FINE under these 
conditions:
1) SMP+APIC_IO are disabled (a UP-only kernel)
2) 'device isp' is _enabled_ (though its PCI probe doesn't find anything)

And no, I don't have any Qlogic chips in my machine.


SYMPTOMS

1) the zero-padded FIB AKA unknown command from controller upon 
   runtime disk fault or when booting from a rebuilding array.
   This is the original symptom.

2) During the kernel boot sequence, still with interrupts disabled, when 
   aac_startup() probes for containers, it finds none - as a result of
   the controller being stuck as per symptom 3).

3) let's focus on booting: during aac_init(), when the controller
   is notified of a ready mailbox for the first time, the controller
   pukes - the drive LED's start flashing red and symptom 2) follows.
   Once interrupts are enabled, symptom 3) follows.
   I have discovered the precise moment when the controller pukes
   by inserting debug messages and DELAY(10 s) statements at various
   points in the code of aac_attach(), aac_init(), aac_sync_command()
   etc...
   Obviously the red flashing doesn't occur in the workable setup
   - the controller keeps rebuilding the array(s) merrily throughout
   FreeBSD boot.

4) with the working setup (see the CFG DEPENDENCIES section above),
   the MMIO assumes a different config than with any defunct setup.
   The physical address of the FIBs (or whatever it is) is different,
   I don't know why. See the two log snippets below. See also the
   attached tarball for more complete boot logs.


The following are some variable dumps, logged by instrumentation that
I have inserted into aac.c.


This is the workable setup (UP  'device isp' enabled):

FRR:  generic attach - aac_attach() called
FRR:   Disabling interrupts.
FRR:aac_init(): initing controller
FRR:  -- Init structure contents: --
FRR:aac_common is at e198
FRR:ac_init is ate1981000
FRR:InitStructRevision = 3
FRR:MiniPortRevision = 1
FRR:AdapterFibsPhysicalAddress = 1c000
FRR:AdapterFibsVirtualAddress = e198
FRR:AdapterFibsSize = 1000
FRR:AdapterFibAlign = 200
FRR:PrintfBufferAddress = 1f184
FRR:PrintfBufferSize = 100
FRR:HostPhysMemPages = 3fccf
FRR:HostElapsedSeconds = 0
FRR:   setting the outbound doorbell register to all one's.
FRR:aac_common_busaddr = 1c000
FRR:ac_init offset = 1000
FRR:   aac_sync_command() called.
FRR:   - populating the mailbox...
FRR:aac_rx_set_mailbox() called.
FRR:btag: 1
FRR:handle:   dd96e000
FRR:command:  5
FRR:arg0: 1d000
FRR:arg1: 0
FRR:arg2: 0
FRR:arg3: 0
FRR:   - clearing the 

Re: FreeBSD 4.8, ASR2120, SMP, degraded RAID1/mirror = storage failure

2003-09-10 Thread rysanek
Dear Mr. Long,

thanks a lot for taking the time to respond
- especially given that you're on vaccations and
that it's almost 2 a.m. your time.

I apologize for having used vague formulations in my
past mail. Also, perhaps I have made up wrong meanings
for some vocabulary occuring in the driver's code.
Specifically:

 2. What is a zero-padded FIB?  I concede that the AIF handling in the driver
 is sub-par and needs to be revisited, so I'd like to know what you are seeing.

I was referring to this:

aac_dequeue_fib: called
aac0: aac_host_command: FIB @ 0xe1984000
aac0:   XferState 0
aac0:   Command   0
aac0:   StructType0
aac0:   Flags 0x0
aac0:   Size  0
aac0:   SenderSize0
aac0:   SenderAddress 0x0
aac0:   RcvrAddress   0x0
aac0:   SenderData0x0
aac0:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
aac0:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
aac0: unknown command from controller

The size is a zero, the data dump at the end contains all
zeroes. That's why I called it a zero-padded FIB.

This only occurs when an unhandled array failure arrives
- when the machine is about to hang upon runtime array degradation or
at boot from a degraded array. Which normally only happens
with SMP+APIC_IO enabled. Not with a UP kernel.

All the other FIB listings that I've seen contain some
non-zero data and claim non-zero length...

In my last message, I have attached a tarball with some logs.
To see what I'm talking about, please take a look at this:

- runtime array degradation - compare the two logs:
 - SMP, unrecoverable failure:
logs/DEBUG_CAM_AAC_L2/SMP-2_disk_failed (line 23)
 - UP, system keeps going just fine:
logs/DEBUG_CAM_AAC_L2/NOSMP-2_disk_failed (line 76)

- boot from a degraded array - compare the two logs:
 - SMP, unrecoverable failure
logs/DEBUG_AAC_L4/SMP-3_boot_with_degraded_array_failed (line 296)
 - UP, system boots just fine:
logs/DEBUG_AAC_L4/NOSMP-3_boot_with_degraded_array_OK (line 273)

 3. The split and corrupted messages on the console were likely due to
 kernel printfs happening from different contexts at the same time.  The
 printf facility has no serializing ability, unfortunately.

OK.

 4. I'm unclear on what you mean by there being a problem in the
 asynchronous handling of device printfs and host command fibs.  I'd be
 very interested in more information on this.

I didn't mean to say that there was the cause of my problem in that area.

I meant to say that I have a problem understanding what's going on.
I'm not a skilled coder, I have a hard time understanding how the
driver's code works. I am able to see where a function is called
with some arguments and returns with a result. However, when a SCSI
command is issued to the controller, at the driver level the
request/response doesn't happen within a single function.
One function queues the command to the controller via the MMIO (?)
region and the response from the controller eventually comes
back within an interrupt, invoking an interrupt handler.
The response may be a valid SCSI response to the SCSI command,
or a something went wrong **Monitor** event.
I am vaguely aware that the SCSI controller can reorder commands in the
queue or process them out of order.
Combine this with the unserialized logging and I'm lost :)
Sorry.

If there's something specific I should check for, please let me know.
Thanks for being patient with my hasty descriptions :)

Frank Rysanek

___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]