Re: disabling sata_nv ADMA for 2.6.24

2008-01-09 Thread Robert Hancock

Tejun Heo wrote:

How about putting a bunch of printks inside the interrupt handler? That
would tell us if it's even reaching the interrupt handler..


If you give me a patch, I'll apply it and cause lock up and report the
result.  Just shoot the patches my way.  But maybe reproducing the lock
up on your machine would be the better solution.  It isn't difficult at
all.  Plug in, fire up IO, disconnect, wait.  Connect different drive.
Rinse and repeat.  It will lock up pretty soon.


Unfortunately my nForce4 machine is my main box with 2 drives, neither 
of which exactly have expendable contents, so random hotplug/unplug 
tests with IO in progress seem a bit risky..


However, how about putting in a printk in nv_adma_interrupt handler here:

/* freeze if hotplugged or controller error */
if (unlikely(status  (NV_ADMA_STAT_HOTPLUG |
   NV_ADMA_STAT_HOTUNPLUG |
   NV_ADMA_STAT_TIMEOUT |
   NV_ADMA_STAT_SERROR))) {
struct ata_eh_info *ehi = ap-link.eh_info;
ata_ehi_clear_desc(ehi);
--- ata_port_printk(ADMA status 0x%08x: , status);
__ata_ehi_push_desc(ehi, ADMA status 0x%08x: , status);


That should tell us if it reaches the point of the hotplug/unplug 
interrupt but failed before or during the error handling.


If that doesn't give anything useful, you can try and move that printk 
before the if, but that will likely flood you with a lot of output from 
every interrupt that fires..

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: disabling sata_nv ADMA for 2.6.24

2008-01-09 Thread Tejun Heo
Robert Hancock wrote:
 However, how about putting in a printk in nv_adma_interrupt handler here:
 
 /* freeze if hotplugged or controller error */
 if (unlikely(status  (NV_ADMA_STAT_HOTPLUG |
NV_ADMA_STAT_HOTUNPLUG |
NV_ADMA_STAT_TIMEOUT |
NV_ADMA_STAT_SERROR))) {
 struct ata_eh_info *ehi = ap-link.eh_info;
 ata_ehi_clear_desc(ehi);
 ---ata_port_printk(ADMA status 0x%08x: , status);
 __ata_ehi_push_desc(ehi, ADMA status 0x%08x: , status);

Alright, will do when I get some time.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: disabling sata_nv ADMA for 2.6.24

2008-01-08 Thread Tejun Heo
Tejun Heo wrote:
 [   34.466899] testing NMI watchdog ... 4WARNING: CPU#0: NMI appears
 to be stuck (0-0)!
 [   34.555056] WARNING: CPU#1: NMI appears to be stuck (0-0)!
 
 Oops, missed that.  I'll see whether there's IRQ storm going on.

I made the nv irq handler to print message every 100th time and it says
nothing after lock up and no response to keyboard, sysrq or serial.  It
seems like a solid lock up to me.  Anything else you want me to try out?

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: disabling sata_nv ADMA for 2.6.24

2008-01-08 Thread Robert Hancock

Tejun Heo wrote:

Tejun Heo wrote:

[   34.466899] testing NMI watchdog ... 4WARNING: CPU#0: NMI appears
to be stuck (0-0)!
[   34.555056] WARNING: CPU#1: NMI appears to be stuck (0-0)!

Oops, missed that.  I'll see whether there's IRQ storm going on.


I made the nv irq handler to print message every 100th time and it says
nothing after lock up and no response to keyboard, sysrq or serial.  It
seems like a solid lock up to me.  Anything else you want me to try out?


I assume that replugging or unplugging cables after that point doesn't 
bring it back to life?

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: disabling sata_nv ADMA for 2.6.24

2008-01-08 Thread Tejun Heo
Robert Hancock wrote:
 Tejun Heo wrote:
 Tejun Heo wrote:
 [   34.466899] testing NMI watchdog ... 4WARNING: CPU#0: NMI appears
 to be stuck (0-0)!
 [   34.555056] WARNING: CPU#1: NMI appears to be stuck (0-0)!
 Oops, missed that.  I'll see whether there's IRQ storm going on.

 I made the nv irq handler to print message every 100th time and it says
 nothing after lock up and no response to keyboard, sysrq or serial.  It
 seems like a solid lock up to me.  Anything else you want me to try out?
 
 I assume that replugging or unplugging cables after that point doesn't
 bring it back to life?

Nope.  The machine is a solid brick.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: disabling sata_nv ADMA for 2.6.24

2008-01-08 Thread Tejun Heo
Tejun Heo wrote:
 Robert Hancock wrote:
 Tejun Heo wrote:
 Tejun Heo wrote:
 [   34.466899] testing NMI watchdog ... 4WARNING: CPU#0: NMI appears
 to be stuck (0-0)!
 [   34.555056] WARNING: CPU#1: NMI appears to be stuck (0-0)!
 Oops, missed that.  I'll see whether there's IRQ storm going on.
 I made the nv irq handler to print message every 100th time and it says
 nothing after lock up and no response to keyboard, sysrq or serial.  It
 seems like a solid lock up to me.  Anything else you want me to try out?
 I assume that replugging or unplugging cables after that point doesn't
 bring it back to life?
 
 Nope.  The machine is a solid brick.
 

If you want, I can ship the board + processor + cooler to you for
debugging.  I guess it will be more useful in your hands than mine.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: disabling sata_nv ADMA for 2.6.24

2008-01-08 Thread Robert Hancock

Tejun Heo wrote:

Tejun Heo wrote:

Robert Hancock wrote:

Tejun Heo wrote:

Tejun Heo wrote:

[   34.466899] testing NMI watchdog ... 4WARNING: CPU#0: NMI appears
to be stuck (0-0)!
[   34.555056] WARNING: CPU#1: NMI appears to be stuck (0-0)!

Oops, missed that.  I'll see whether there's IRQ storm going on.

I made the nv irq handler to print message every 100th time and it says
nothing after lock up and no response to keyboard, sysrq or serial.  It
seems like a solid lock up to me.  Anything else you want me to try out?

I assume that replugging or unplugging cables after that point doesn't
bring it back to life?

Nope.  The machine is a solid brick.



If you want, I can ship the board + processor + cooler to you for
debugging.  I guess it will be more useful in your hands than mine.


If it's an A8N-E I'd be surprised if it behaved differently from my 
A8N-SLI Deluxe, though maybe it's a different chipset revision or 
something.. The last time I tested hotplug on here it seemed to work 
fine, though I haven't done any rapid disconnect/reconnect tests.


How about putting a bunch of printks inside the interrupt handler? That 
would tell us if it's even reaching the interrupt handler..

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: disabling sata_nv ADMA for 2.6.24

2008-01-08 Thread Tejun Heo
Robert Hancock wrote:
 Tejun Heo wrote:
 Tejun Heo wrote:
 Robert Hancock wrote:
 Tejun Heo wrote:
 Tejun Heo wrote:
 [   34.466899] testing NMI watchdog ... 4WARNING: CPU#0: NMI
 appears
 to be stuck (0-0)!
 [   34.555056] WARNING: CPU#1: NMI appears to be stuck (0-0)!
 Oops, missed that.  I'll see whether there's IRQ storm going on.
 I made the nv irq handler to print message every 100th time and it
 says
 nothing after lock up and no response to keyboard, sysrq or
 serial.  It
 seems like a solid lock up to me.  Anything else you want me to try
 out?
 I assume that replugging or unplugging cables after that point doesn't
 bring it back to life?
 Nope.  The machine is a solid brick.


 If you want, I can ship the board + processor + cooler to you for
 debugging.  I guess it will be more useful in your hands than mine.
 
 If it's an A8N-E I'd be surprised if it behaved differently from my
 A8N-SLI Deluxe, though maybe it's a different chipset revision or
 something.. The last time I tested hotplug on here it seemed to work
 fine, though I haven't done any rapid disconnect/reconnect tests.
 
 How about putting a bunch of printks inside the interrupt handler? That
 would tell us if it's even reaching the interrupt handler..

If you give me a patch, I'll apply it and cause lock up and report the
result.  Just shoot the patches my way.  But maybe reproducing the lock
up on your machine would be the better solution.  It isn't difficult at
all.  Plug in, fire up IO, disconnect, wait.  Connect different drive.
Rinse and repeat.  It will lock up pretty soon.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: disabling sata_nv ADMA for 2.6.24

2008-01-07 Thread Mark Lord

Tejun Heo wrote:

Hello, guys.

We still have three problems with ADMA.

* hard lockup during resume
* occasional hard lockup after hotplug or other erros (probably related
to the above?)
* occasional timeout of FLUSH after NCQ writes

I think we should disable ADMA for 2.6.24 and -stable for now.  What do
you guys think?

..

Heck, given the active vendor neglect here,
I'm surprised we even bother with it at all!

Cheers
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: disabling sata_nv ADMA for 2.6.24

2008-01-07 Thread Robert Hancock

Mark Lord wrote:

Tejun Heo wrote:

Hello, guys.

We still have three problems with ADMA.

* hard lockup during resume
* occasional hard lockup after hotplug or other erros (probably related
to the above?)


This has only been reported on one person's MSI board. Apparently 
another revision of the same board is reported to work, and I can't 
duplicate the problem on my Asus board, so it could just be some 
hardware problem on that motherboard.



* occasional timeout of FLUSH after NCQ writes

I think we should disable ADMA for 2.6.24 and -stable for now.  What do
you guys think?


I still can't say I'm really in favor of it.. In particular to do so for 
2.6.24 right now seems excessive, as none of these problems are 
regressions from 2.6.23, and these controllers haven't been tested in 
non-ADMA mode very much since it was made the default, so that change 
might actually cause regressions.




Heck, given the active vendor neglect here,
I'm surprised we even bother with it at all!

Cheers


-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: disabling sata_nv ADMA for 2.6.24

2008-01-07 Thread Tejun Heo
Robert Hancock wrote:
 Mark Lord wrote:
 Tejun Heo wrote:
 Hello, guys.
 
 We still have three problems with ADMA.
 
 * hard lockup during resume * occasional hard lockup after
 hotplug or other erros (probably related to the above?)
 
 This has only been reported on one person's MSI board. Apparently 
 another revision of the same board is reported to work, and I can't 
 duplicate the problem on my Asus board, so it could just be some 
 hardware problem on that motherboard.

IIRC, I have two from suse bug reports and both resolved with adma=0.
I'm not too sure whether post 2.6.23-rcX changes would have fixed those
problems tho.  FWIW, I've disabled ADMA mode on all suse products.

 I still can't say I'm really in favor of it.. In particular to do so
 for 2.6.24 right now seems excessive, as none of these problems are
 regressions from 2.6.23, and these controllers haven't been tested in
 non-ADMA mode very much since it was made the default, so that change
 might actually cause regressions.

Technically, they're regressions from pre-ADMA days - pretty grave ones
considering some of the failure modes include hard lock up.  Also, they
don't seem resolvable in foreseeable future at this point.  If this
isn't gonna improve, I think we should just drop ADMA support altogether
and concentrate on stabilizing non-ADMA operation.  Stability is far
more important than small performance improvements or feature supports.

But, yeah, you're right in that the change might cause more problems.
What's your estimation of such possibility?  I generally feel good about
non-ADMA mode operation as they seem to solve most reported sata_nv bugs
but I haven't really followed sata_nv code changes recently.

Maybe this can be resolved by going through one more -rc cycle after the
change if that's possible.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: disabling sata_nv ADMA for 2.6.24

2008-01-07 Thread Robert Hancock

Tejun Heo wrote:

Robert Hancock wrote:

Mark Lord wrote:

Tejun Heo wrote:

Hello, guys.

We still have three problems with ADMA.

* hard lockup during resume * occasional hard lockup after
hotplug or other erros (probably related to the above?)
This has only been reported on one person's MSI board. Apparently 
another revision of the same board is reported to work, and I can't 
duplicate the problem on my Asus board, so it could just be some 
hardware problem on that motherboard.


IIRC, I have two from suse bug reports and both resolved with adma=0.
I'm not too sure whether post 2.6.23-rcX changes would have fixed those
problems tho.  FWIW, I've disabled ADMA mode on all suse products.


A hotplug-related problem? Have a link to the reports?




I still can't say I'm really in favor of it.. In particular to do so
for 2.6.24 right now seems excessive, as none of these problems are
regressions from 2.6.23, and these controllers haven't been tested in
non-ADMA mode very much since it was made the default, so that change
might actually cause regressions.


Technically, they're regressions from pre-ADMA days - pretty grave ones
considering some of the failure modes include hard lock up.  Also, they
don't seem resolvable in foreseeable future at this point.  If this
isn't gonna improve, I think we should just drop ADMA support altogether
and concentrate on stabilizing non-ADMA operation.  Stability is far
more important than small performance improvements or feature supports.


The suspend/resume problem should be resolvable. It worked before and 
should be able to work again. Hopefully debug output with console 
enabled during resume may provide some hints..


The cache flush timeout problem is a bit onerous, but hopefully we can 
figure something out there with some more debugging by the reporter.




But, yeah, you're right in that the change might cause more problems.
What's your estimation of such possibility?  I generally feel good about
non-ADMA mode operation as they seem to solve most reported sata_nv bugs
but I haven't really followed sata_nv code changes recently.


It's hard to say what may come up if we do this. I seem to recall that 
there were some reports of wierd hotplug issues and high latencies on 
register access that went away with ADMA mode.


I do think it's likely too late in the -rc series to make such a change 
though. Hopefully by 2.6.25 we'll either have the issues fixed or have 
more of an idea whether they can be.




Maybe this can be resolved by going through one more -rc cycle after the
change if that's possible.

Thanks.


-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: disabling sata_nv ADMA for 2.6.24

2008-01-07 Thread Tejun Heo
Robert Hancock wrote:
 This has only been reported on one person's MSI board. Apparently
 another revision of the same board is reported to work, and I can't
 duplicate the problem on my Asus board, so it could just be some
 hardware problem on that motherboard.

 IIRC, I have two from suse bug reports and both resolved with adma=0.
 I'm not too sure whether post 2.6.23-rcX changes would have fixed those
 problems tho.  FWIW, I've disabled ADMA mode on all suse products.
 
 A hotplug-related problem? Have a link to the reports?

Hmmm... I mis-remembered.  The reporter said it was okay in SL102
(2.6.18, no ADMA) but SL103 (2.6.22, ADMA is on) fell apart.  I asked
for retest w/ adma=0 but no response yet.

  https://bugzilla.novell.com/show_bug.cgi?id=347184

I tried to reproduce the problem on my a8n-e but couldn't.

 Technically, they're regressions from pre-ADMA days - pretty grave ones
 considering some of the failure modes include hard lock up.  Also, they
 don't seem resolvable in foreseeable future at this point.  If this
 isn't gonna improve, I think we should just drop ADMA support altogether
 and concentrate on stabilizing non-ADMA operation.  Stability is far
 more important than small performance improvements or feature supports.
 
 The suspend/resume problem should be resolvable. It worked before and
 should be able to work again. Hopefully debug output with console
 enabled during resume may provide some hints..

Okay.

 The cache flush timeout problem is a bit onerous, but hopefully we can
 figure something out there with some more debugging by the reporter.

:-(

 But, yeah, you're right in that the change might cause more problems.
 What's your estimation of such possibility?  I generally feel good about
 non-ADMA mode operation as they seem to solve most reported sata_nv bugs
 but I haven't really followed sata_nv code changes recently.
 
 It's hard to say what may come up if we do this. I seem to recall that
 there were some reports of wierd hotplug issues and high latencies on
 register access that went away with ADMA mode.
 
 I do think it's likely too late in the -rc series to make such a change
 though. Hopefully by 2.6.25 we'll either have the issues fixed or have
 more of an idea whether they can be.

I feel pretty uncomfortable with the current situation.  Two mostly
working operation modes w/o any doc and known unresolved issues on both.
 Eeeek.  :-(

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: disabling sata_nv ADMA for 2.6.24

2008-01-07 Thread Tejun Heo
Tejun Heo wrote:
 Robert Hancock wrote:
 This has only been reported on one person's MSI board. Apparently
 another revision of the same board is reported to work, and I can't
 duplicate the problem on my Asus board, so it could just be some
 hardware problem on that motherboard.
 IIRC, I have two from suse bug reports and both resolved with adma=0.
 I'm not too sure whether post 2.6.23-rcX changes would have fixed those
 problems tho.  FWIW, I've disabled ADMA mode on all suse products.
 A hotplug-related problem? Have a link to the reports?
 
 Hmmm... I mis-remembered.  The reporter said it was okay in SL102
 (2.6.18, no ADMA) but SL103 (2.6.22, ADMA is on) fell apart.  I asked
 for retest w/ adma=0 but no response yet.
 
   https://bugzilla.novell.com/show_bug.cgi?id=347184
 
 I tried to reproduce the problem on my a8n-e but couldn't.

Okay, just succeeded on the current #upstream-fixes, attaching the log.
 The machine is a brick after the crash.

Thanks.

-- 
tejun
[0.00] Linux version 2.6.24-rc5-work ([EMAIL PROTECTED]) (gcc version 4.2.1 (SUSE Linux)) #15 SMP PREEMPT Tue Jan 8 00:52:24 KST 2008
[0.00] Command line: BOOT_IMAGE=vmlinuz-ck804 root=/dev/hde1 nmi_watchdog=1 printk.printk_time=1 console=ttyS0,115200 console=tty0 sysrq_always_enabled [EMAIL PROTECTED]/eth0,[EMAIL PROTECTED]/00:18:f3:ab:44:ab
[0.00] BIOS-provided physical RAM map:
[0.00]  BIOS-e820:  - 0009e800 (usable)
[0.00]  BIOS-e820: 0009e800 - 000a (reserved)
[0.00]  BIOS-e820: 000f - 0010 (reserved)
[0.00]  BIOS-e820: 0010 - 1fff (usable)
[0.00]  BIOS-e820: 1fff - 1fff3000 (ACPI NVS)
[0.00]  BIOS-e820: 1fff3000 - 2000 (ACPI data)
[0.00]  BIOS-e820: e000 - f000 (reserved)
[0.00]  BIOS-e820: fec0 - 0001 (reserved)
[0.00] end_pfn_map = 1048576
[0.00] DMI 2.3 present.
[0.00] ACPI: RSDP 000F7560, 0014 (r0 Nvidia)
[0.00] ACPI: RSDT 1FFF3040, 0030 (r1 Nvidia AWRDACPI 42302E31 AWRD0)
[0.00] ACPI: FACP 1FFF30C0, 0074 (r1 Nvidia AWRDACPI 42302E31 AWRD0)
[0.00] ACPI: DSDT 1FFF3180, 65F2 (r1 NVIDIA AWRDACPI 1000 MSFT  10E)
[0.00] ACPI: FACS 1FFF, 0040
[0.00] ACPI: MCFG 1FFF9880, 003C (r1 Nvidia AWRDACPI 42302E31 AWRD0)
[0.00] ACPI: APIC 1FFF97C0, 007C (r1 Nvidia AWRDACPI 42302E31 AWRD0)
[0.00] Zone PFN ranges:
[0.00]   DMA 0 - 4096
[0.00]   DMA324096 -  1048576
[0.00]   Normal1048576 -  1048576
[0.00] Movable zone start PFN for each node
[0.00] early_node_map[2] active PFN ranges
[0.00] 0:0 -  158
[0.00] 0:  256 -   131056
[0.00] Nvidia board detected. Ignoring ACPI timer override.
[0.00] If you got timer trouble try acpi_use_timer_override
[0.00] ACPI: PM-Timer IO Port: 0x4008
[0.00] ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
[0.00] Processor #0 (Bootup-CPU)
[0.00] ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
[0.00] Processor #1
[0.00] ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
[0.00] ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
[0.00] ACPI: IOAPIC (id[0x02] address[0xfec0] gsi_base[0])
[0.00] IOAPIC[0]: apic_id 2, address 0xfec0, GSI 0-23
[0.00] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[0.00] ACPI: BIOS IRQ0 pin2 override ignored.
[0.00] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
[0.00] ACPI: INT_SRC_OVR (bus 0 bus_irq 14 global_irq 14 high edge)
[0.00] ACPI: INT_SRC_OVR (bus 0 bus_irq 15 global_irq 15 high edge)
[0.00] Setting APIC routing to flat
[0.00] Using ACPI (MADT) for SMP configuration information
[0.00] swsusp: Registered nosave memory region: 0009e000 - 0009f000
[0.00] swsusp: Registered nosave memory region: 0009f000 - 000a
[0.00] swsusp: Registered nosave memory region: 000a - 000f
[0.00] swsusp: Registered nosave memory region: 000f - 0010
[0.00] Allocating PCI resources starting at 3000 (gap: 2000:c000)
[0.00] SMP: Allowing 2 CPUs, 0 hotplug CPUs
[0.00] PERCPU: Allocating 30608 bytes of per cpu data
[0.00] Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 124913
[0.00] Kernel command line: BOOT_IMAGE=vmlinuz-ck804 root=/dev/hde1 nmi_watchdog=1 printk.printk_time=1 console=ttyS0,115200 console=tty0 sysrq_always_enabled [EMAIL PROTECTED]/eth0,[EMAIL PROTECTED]/00:18:f3:ab:44:ab
[0.00] Unknown boot option `printk.printk_time=1': ignoring
[ 

Re: disabling sata_nv ADMA for 2.6.24

2008-01-07 Thread Robert Hancock

Tejun Heo wrote:

Tejun Heo wrote:

Robert Hancock wrote:

This has only been reported on one person's MSI board. Apparently
another revision of the same board is reported to work, and I can't
duplicate the problem on my Asus board, so it could just be some
hardware problem on that motherboard.

IIRC, I have two from suse bug reports and both resolved with adma=0.
I'm not too sure whether post 2.6.23-rcX changes would have fixed those
problems tho.  FWIW, I've disabled ADMA mode on all suse products.

A hotplug-related problem? Have a link to the reports?

Hmmm... I mis-remembered.  The reporter said it was okay in SL102
(2.6.18, no ADMA) but SL103 (2.6.22, ADMA is on) fell apart.  I asked
for retest w/ adma=0 but no response yet.

  https://bugzilla.novell.com/show_bug.cgi?id=347184

I tried to reproduce the problem on my a8n-e but couldn't.


Okay, just succeeded on the current #upstream-fixes, attaching the log.
 The machine is a brick after the crash.


I assume the cable got reconnected at 325 seconds? It looks like that 
was during error handling for the previous unplug?


[  314.987885] ata3: timeout waiting for ADMA IDLE, stat=0x400
[  314.993556] ata3: timeout waiting for ADMA LEGACY, stat=0x400
[  315.009915] ata3.00: exception Emask 0x10 SAct 0x1 SErr 0x191 
action 0xa frozen

[  315.017708] ata3.00: ADMA status 0x0402: , hot unplug
[  315.017714] ata3: SError: { PHYRdyChg Dispar LinkSeq TrStaTrns }
[  315.029239] ata3.00: cmd 60/01:00:92:d7:12/00:00:05:00:00/40 tag 0 
ncq 512 in
[  315.029240]  res 40/00:04:92:d7:12/00:04:92:d7:12/40 Emask 
0x10 (ATA bus error)

[  315.029243] ata3.00: status: { DRDY }
[  315.048236] ata3: hard resetting link
[  315.774982] ata3: SATA link down (SStatus 0 SControl 300)
[  315.780498] ata3: failed to recover some devices, retrying in 5 secs
[  320.788427] ata3: hard resetting link
[  325.242220] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)

Not sure if the port would be frozen at this point or not?

It would be useful to add some printks to narrow down at what point the 
lockup happens. If it's a loop, interrupt storm or something then we can 
likely fix it, but if the controller's just locking up then we may be 
out of luck..

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: disabling sata_nv ADMA for 2.6.24

2008-01-07 Thread Tejun Heo
Robert Hancock wrote:
 Okay, just succeeded on the current #upstream-fixes, attaching the log.
  The machine is a brick after the crash.
 
 I assume the cable got reconnected at 325 seconds? It looks like that
 was during error handling for the previous unplug?

I don't remember too well (the console was more than two meters away and
I was just keeping disconnecting and reconnecting.  I noticed the
machine was frozen after I came back to console, so...

 [  314.987885] ata3: timeout waiting for ADMA IDLE, stat=0x400
 [  314.993556] ata3: timeout waiting for ADMA LEGACY, stat=0x400
 [  315.009915] ata3.00: exception Emask 0x10 SAct 0x1 SErr 0x191
 action 0xa frozen
 [  315.017708] ata3.00: ADMA status 0x0402: , hot unplug
 [  315.017714] ata3: SError: { PHYRdyChg Dispar LinkSeq TrStaTrns }
 [  315.029239] ata3.00: cmd 60/01:00:92:d7:12/00:00:05:00:00/40 tag 0
 ncq 512 in
 [  315.029240]  res 40/00:04:92:d7:12/00:04:92:d7:12/40 Emask
 0x10 (ATA bus error)
 [  315.029243] ata3.00: status: { DRDY }
 [  315.048236] ata3: hard resetting link
 [  315.774982] ata3: SATA link down (SStatus 0 SControl 300)
 [  315.780498] ata3: failed to recover some devices, retrying in 5 secs
 [  320.788427] ata3: hard resetting link
 [  325.242220] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
 
 Not sure if the port would be frozen at this point or not?
 
 It would be useful to add some printks to narrow down at what point the
 lockup happens. If it's a loop, interrupt storm or something then we can
 likely fix it, but if the controller's just locking up then we may be
 out of luck..

I think it's machine hard lock up.  NMI watchdog doesn't get triggered.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: disabling sata_nv ADMA for 2.6.24

2008-01-07 Thread Tejun Heo
Tejun Heo wrote:
 Robert Hancock wrote:
 Okay, just succeeded on the current #upstream-fixes, attaching the log.
  The machine is a brick after the crash.
 I assume the cable got reconnected at 325 seconds? It looks like that
 was during error handling for the previous unplug?
 
 I don't remember too well (the console was more than two meters away and
 I was just keeping disconnecting and reconnecting.  I noticed the
 machine was frozen after I came back to console, so...
 
 [  314.987885] ata3: timeout waiting for ADMA IDLE, stat=0x400
 [  314.993556] ata3: timeout waiting for ADMA LEGACY, stat=0x400
 [  315.009915] ata3.00: exception Emask 0x10 SAct 0x1 SErr 0x191
 action 0xa frozen
 [  315.017708] ata3.00: ADMA status 0x0402: , hot unplug
 [  315.017714] ata3: SError: { PHYRdyChg Dispar LinkSeq TrStaTrns }
 [  315.029239] ata3.00: cmd 60/01:00:92:d7:12/00:00:05:00:00/40 tag 0
 ncq 512 in
 [  315.029240]  res 40/00:04:92:d7:12/00:04:92:d7:12/40 Emask
 0x10 (ATA bus error)
 [  315.029243] ata3.00: status: { DRDY }
 [  315.048236] ata3: hard resetting link
 [  315.774982] ata3: SATA link down (SStatus 0 SControl 300)
 [  315.780498] ata3: failed to recover some devices, retrying in 5 secs
 [  320.788427] ata3: hard resetting link
 [  325.242220] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)

 Not sure if the port would be frozen at this point or not?

 It would be useful to add some printks to narrow down at what point the
 lockup happens. If it's a loop, interrupt storm or something then we can
 likely fix it, but if the controller's just locking up then we may be
 out of luck..
 
 I think it's machine hard lock up.  NMI watchdog doesn't get triggered.
 

Ah.. another thing.  Sometimes when I swap two drives, sata_nv fails to
detect the new drive.  If I pull out the plug and replug it, it then
recognizes the new drive.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: disabling sata_nv ADMA for 2.6.24

2008-01-07 Thread Robert Hancock

Tejun Heo wrote:

Tejun Heo wrote:

Robert Hancock wrote:

Okay, just succeeded on the current #upstream-fixes, attaching the log.
 The machine is a brick after the crash.

I assume the cable got reconnected at 325 seconds? It looks like that
was during error handling for the previous unplug?

I don't remember too well (the console was more than two meters away and
I was just keeping disconnecting and reconnecting.  I noticed the
machine was frozen after I came back to console, so...


[  314.987885] ata3: timeout waiting for ADMA IDLE, stat=0x400
[  314.993556] ata3: timeout waiting for ADMA LEGACY, stat=0x400
[  315.009915] ata3.00: exception Emask 0x10 SAct 0x1 SErr 0x191
action 0xa frozen
[  315.017708] ata3.00: ADMA status 0x0402: , hot unplug
[  315.017714] ata3: SError: { PHYRdyChg Dispar LinkSeq TrStaTrns }
[  315.029239] ata3.00: cmd 60/01:00:92:d7:12/00:00:05:00:00/40 tag 0
ncq 512 in
[  315.029240]  res 40/00:04:92:d7:12/00:04:92:d7:12/40 Emask
0x10 (ATA bus error)
[  315.029243] ata3.00: status: { DRDY }
[  315.048236] ata3: hard resetting link
[  315.774982] ata3: SATA link down (SStatus 0 SControl 300)
[  315.780498] ata3: failed to recover some devices, retrying in 5 secs
[  320.788427] ata3: hard resetting link
[  325.242220] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)

Not sure if the port would be frozen at this point or not?

It would be useful to add some printks to narrow down at what point the
lockup happens. If it's a loop, interrupt storm or something then we can
likely fix it, but if the controller's just locking up then we may be
out of luck..

I think it's machine hard lock up.  NMI watchdog doesn't get triggered.


Is NMI watchdog actually working on this machine?

[   34.466899] testing NMI watchdog ... 4WARNING: CPU#0: NMI appears 
to be stuck (0-0)!

[   34.555056] WARNING: CPU#1: NMI appears to be stuck (0-0)!





Ah.. another thing.  Sometimes when I swap two drives, sata_nv fails to
detect the new drive.  If I pull out the plug and replug it, it then
recognizes the new drive.


No output in that case, I assume?
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: disabling sata_nv ADMA for 2.6.24

2008-01-07 Thread Tejun Heo
Robert Hancock wrote:
 Tejun Heo wrote:
 Tejun Heo wrote:
 Robert Hancock wrote:
 Okay, just succeeded on the current #upstream-fixes, attaching the
 log.
  The machine is a brick after the crash.
 I assume the cable got reconnected at 325 seconds? It looks like that
 was during error handling for the previous unplug?
 I don't remember too well (the console was more than two meters away and
 I was just keeping disconnecting and reconnecting.  I noticed the
 machine was frozen after I came back to console, so...

 [  314.987885] ata3: timeout waiting for ADMA IDLE, stat=0x400
 [  314.993556] ata3: timeout waiting for ADMA LEGACY, stat=0x400
 [  315.009915] ata3.00: exception Emask 0x10 SAct 0x1 SErr 0x191
 action 0xa frozen
 [  315.017708] ata3.00: ADMA status 0x0402: , hot unplug
 [  315.017714] ata3: SError: { PHYRdyChg Dispar LinkSeq TrStaTrns }
 [  315.029239] ata3.00: cmd 60/01:00:92:d7:12/00:00:05:00:00/40 tag 0
 ncq 512 in
 [  315.029240]  res 40/00:04:92:d7:12/00:04:92:d7:12/40 Emask
 0x10 (ATA bus error)
 [  315.029243] ata3.00: status: { DRDY }
 [  315.048236] ata3: hard resetting link
 [  315.774982] ata3: SATA link down (SStatus 0 SControl 300)
 [  315.780498] ata3: failed to recover some devices, retrying in 5 secs
 [  320.788427] ata3: hard resetting link
 [  325.242220] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)

 Not sure if the port would be frozen at this point or not?

 It would be useful to add some printks to narrow down at what point the
 lockup happens. If it's a loop, interrupt storm or something then we
 can
 likely fix it, but if the controller's just locking up then we may be
 out of luck..
 I think it's machine hard lock up.  NMI watchdog doesn't get triggered.
 
 Is NMI watchdog actually working on this machine?
 
 [   34.466899] testing NMI watchdog ... 4WARNING: CPU#0: NMI appears
 to be stuck (0-0)!
 [   34.555056] WARNING: CPU#1: NMI appears to be stuck (0-0)!

Oops, missed that.  I'll see whether there's IRQ storm going on.

 Ah.. another thing.  Sometimes when I swap two drives, sata_nv fails to
 detect the new drive.  If I pull out the plug and replug it, it then
 recognizes the new drive.
 
 No output in that case, I assume?

It seems what happens is sata_nv EH loses hotplug events during
hardreset is going on.  This is a bit tricky.  I'm not sure whether it's
sata_nv's fault or other drivers are working out of dumb luck.  I'll
reproduce the problem and post the log when I get some time.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html