Re: Strange freezes (seems like SATA related)

2007-11-06 Thread Max Krasnyansky
Robert Hancock wrote:
> Can you post the full dmesg output? What kind of drive is this?

Sorry for the delay. I'm on vacation and have sporadic email access.
Full dmesg is pretty long. Here SATA related section.

sata_nv :00:07.0: version 3.4
ACPI: PCI Interrupt Link [LSA0] enabled at IRQ 23
ACPI: PCI Interrupt :00:07.0[A] -> Link [LSA0] -> GSI 23 (level, high) -> 
IRQ 23
sata_nv :00:07.0: Using ADMA mode
PCI: Setting latency timer of device :00:07.0 to 64
scsi0 : sata_nv
scsi1 : sata_nv
ata1: SATA max UDMA/133 cmd 0xc2a16480 ctl 0xc2a164a0 bmdma 
0x000158b0 irq 23
ata2: SATA max UDMA/133 cmd 0xc2a16580 ctl 0xc2a165a0 bmdma 
0x000158b8 irq 23
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: ATA-7: SAMSUNG HD080HJ, WT100-33, max UDMA/100
ata1.00: 156301488 sectors, multi 16: LBA48 
ata1.00: configured for UDMA/100
ata2: SATA link down (SStatus 0 SControl 300)
scsi 0:0:0:0: Direct-Access ATA  SAMSUNG HD080HJ  WT10 PQ: 0 ANSI: 5
ata1: bounce limit 0x, segment boundary 0x, hw segs 61
sd 0:0:0:0: [sda] 156301488 512-byte hardware sectors (80026 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support 
DPO or FUA
sd 0:0:0:0: [sda] 156301488 512-byte hardware sectors (80026 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support 
DPO or FUA
 sda: sda1 sda2 sda3
sd 0:0:0:0: [sda] Attached SCSI disk
ACPI: PCI Interrupt Link [LSA1] enabled at IRQ 22
ACPI: PCI Interrupt :00:08.0[A] -> Link [LSA1] -> GSI 22 (level, high) -> 
IRQ 22
sata_nv :00:08.0: Using ADMA mode

Max



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Strange freezes (seems like SATA related)

2007-11-06 Thread Max Krasnyansky


Andrew Morton wrote:
> On Mon, 29 Oct 2007 09:54:27 -0700
> Max Krasnyansky <[EMAIL PROTECTED]> wrote:
> 
>> A couple of HP xw9300 machines (dual Opterons) started freezing up.
>> We're running on 2.6.22.1 on them. Freezes a somewhere weird. VGA console is 
>> alive
>> (I can switch vts, etc) but everything else is dead (network, etc).
>> Unfortunately SYSRQ was not enabled and I could not get backtraces and stuff.
>>
>> Hooked up serial console and the only error that shows up is this.
>>
>> ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 
>> status 0x1540 next cpb count 0x0 next cpb idx 0x0
>> ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1
>> ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
>> ata1.00: cmd ca/00:08:57:00:80/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 out
>>  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>> Descriptor sense data with sense descriptors (in hex):
>> end_request: I/O error, dev sda, sector 8388695
>> Buffer I/O error on device sda1, logical block 1048579
>> lost page write due to I/O error on sda1
>> sd 0:0:0:0: [sda] Write Protect is off
>>
>> I see a bunch of those and then the box just sits there spewing this 
>> periodically
>>
>> ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 
>> status 0x1540 next cpb count 0x0 next cpb idx 0x0
>> ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1
>> ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
>> ata1.00: cmd ca/00:08:4f:00:f8/00:00:00:00:00/e1 tag 0 cdb 0x0 data 4096 out
>>  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>>
>> SMART selftest on the drive passed without errors.
>>
>> Here is how this machine looks like
>>
>> ...
> 
> So this happens on more than one machine?
Yep.

> The kernel shouldn't freeze, so even if both machines have magically
> identical hardware faults, there's a kernel bug there somewhere.
> 
> I guess it would be useful to test a 2.6.23 kernel if poss.  We've seen a
> very large number of reports like this one in recent months (many of which
> have not been responded to, btw) and perhaps someone has done something
> about them.
I may not be able to run identical workload on 2.6.23. Will try to give it a 
shot
sometime next week. Also I've upgraded to 2.6.22.10 last week. There are a few 
fixes 
in there that may potentially affect those boxes.

Max



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Strange freezes (seems like SATA related)

2007-11-06 Thread Max Krasnyansky


Andrew Morton wrote:
 On Mon, 29 Oct 2007 09:54:27 -0700
 Max Krasnyansky [EMAIL PROTECTED] wrote:
 
 A couple of HP xw9300 machines (dual Opterons) started freezing up.
 We're running on 2.6.22.1 on them. Freezes a somewhere weird. VGA console is 
 alive
 (I can switch vts, etc) but everything else is dead (network, etc).
 Unfortunately SYSRQ was not enabled and I could not get backtraces and stuff.

 Hooked up serial console and the only error that shows up is this.

 ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 
 status 0x1540 next cpb count 0x0 next cpb idx 0x0
 ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1
 ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
 ata1.00: cmd ca/00:08:57:00:80/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 out
  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
 Descriptor sense data with sense descriptors (in hex):
 end_request: I/O error, dev sda, sector 8388695
 Buffer I/O error on device sda1, logical block 1048579
 lost page write due to I/O error on sda1
 sd 0:0:0:0: [sda] Write Protect is off

 I see a bunch of those and then the box just sits there spewing this 
 periodically

 ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 
 status 0x1540 next cpb count 0x0 next cpb idx 0x0
 ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1
 ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
 ata1.00: cmd ca/00:08:4f:00:f8/00:00:00:00:00/e1 tag 0 cdb 0x0 data 4096 out
  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

 SMART selftest on the drive passed without errors.

 Here is how this machine looks like

 ...
 
 So this happens on more than one machine?
Yep.

 The kernel shouldn't freeze, so even if both machines have magically
 identical hardware faults, there's a kernel bug there somewhere.
 
 I guess it would be useful to test a 2.6.23 kernel if poss.  We've seen a
 very large number of reports like this one in recent months (many of which
 have not been responded to, btw) and perhaps someone has done something
 about them.
I may not be able to run identical workload on 2.6.23. Will try to give it a 
shot
sometime next week. Also I've upgraded to 2.6.22.10 last week. There are a few 
fixes 
in there that may potentially affect those boxes.

Max



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Strange freezes (seems like SATA related)

2007-11-06 Thread Max Krasnyansky
Robert Hancock wrote:
 Can you post the full dmesg output? What kind of drive is this?

Sorry for the delay. I'm on vacation and have sporadic email access.
Full dmesg is pretty long. Here SATA related section.

sata_nv :00:07.0: version 3.4
ACPI: PCI Interrupt Link [LSA0] enabled at IRQ 23
ACPI: PCI Interrupt :00:07.0[A] - Link [LSA0] - GSI 23 (level, high) - 
IRQ 23
sata_nv :00:07.0: Using ADMA mode
PCI: Setting latency timer of device :00:07.0 to 64
scsi0 : sata_nv
scsi1 : sata_nv
ata1: SATA max UDMA/133 cmd 0xc2a16480 ctl 0xc2a164a0 bmdma 
0x000158b0 irq 23
ata2: SATA max UDMA/133 cmd 0xc2a16580 ctl 0xc2a165a0 bmdma 
0x000158b8 irq 23
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: ATA-7: SAMSUNG HD080HJ, WT100-33, max UDMA/100
ata1.00: 156301488 sectors, multi 16: LBA48 
ata1.00: configured for UDMA/100
ata2: SATA link down (SStatus 0 SControl 300)
scsi 0:0:0:0: Direct-Access ATA  SAMSUNG HD080HJ  WT10 PQ: 0 ANSI: 5
ata1: bounce limit 0x, segment boundary 0x, hw segs 61
sd 0:0:0:0: [sda] 156301488 512-byte hardware sectors (80026 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support 
DPO or FUA
sd 0:0:0:0: [sda] 156301488 512-byte hardware sectors (80026 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support 
DPO or FUA
 sda: sda1 sda2 sda3
sd 0:0:0:0: [sda] Attached SCSI disk
ACPI: PCI Interrupt Link [LSA1] enabled at IRQ 22
ACPI: PCI Interrupt :00:08.0[A] - Link [LSA1] - GSI 22 (level, high) - 
IRQ 22
sata_nv :00:08.0: Using ADMA mode

Max



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Strange freezes (seems like SATA related)

2007-11-01 Thread Jeff Garzik

Heikki Orsila wrote:

On Mon, Oct 29, 2007 at 09:54:27AM -0700, Max Krasnyansky wrote:

A couple of HP xw9300 machines (dual Opterons) started freezing up.
We're running on 2.6.22.1 on them. Freezes a somewhere weird. 
VGA console is alive

(I can switch vts, etc) but everything else is dead (network, etc).


I'm thinking this is not a coincidence. I was running 2.6.22.5, and 
looking at your problems, I just had a similar experience on tuesday.. 
The network was still fine after kernel errors so that I was able to 
login with SSH. See:


http://lkml.org/lkml/2007/10/30/193


ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 status 
0x1540 next cpb count 0x0 next cpb idx 0x0
ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd ca/00:08:57:00:80/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Descriptor sense data with sense descriptors (in hex):
end_request: I/O error, dev sda, sector 8388695
Buffer I/O error on device sda1, logical block 1048579
lost page write due to I/O error on sda1
sd 0:0:0:0: [sda] Write Protect is off


With ata_piix Intel SATA I got these errors:

ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd ca/00:68:6f:3a:00/00:00:00:00:00/e0 tag 0 cdb 0x0 data 53248 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1: port is slow to respond, please be patient (Status 0xd0)
ata1: device not ready (errno=-16), forcing hardreset
ata1: soft resetting port
ata1.00: revalidation failed (errno=-2)
ata1: failed to recover some devices, retrying in 5 secs
ata1: soft resetting port
ata1.00: configured for UDMA/133
ata1: EH complete
sd 0:0:0:0: [sda] 488397168 512-byte hardware sectors (250059 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support 
DPO or FUA


These are two 100% different issues  The only thing they have in 
common is that they spit out an error.


Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Strange freezes (seems like SATA related)

2007-11-01 Thread Heikki Orsila
On Mon, Oct 29, 2007 at 09:54:27AM -0700, Max Krasnyansky wrote:
> A couple of HP xw9300 machines (dual Opterons) started freezing up.
> We're running on 2.6.22.1 on them. Freezes a somewhere weird. 
> VGA console is alive
> (I can switch vts, etc) but everything else is dead (network, etc).

I'm thinking this is not a coincidence. I was running 2.6.22.5, and 
looking at your problems, I just had a similar experience on tuesday.. 
The network was still fine after kernel errors so that I was able to 
login with SSH. See:

http://lkml.org/lkml/2007/10/30/193

> ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 
> status 0x1540 next cpb count 0x0 next cpb idx 0x0
> ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1
> ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
> ata1.00: cmd ca/00:08:57:00:80/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 out
>  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> Descriptor sense data with sense descriptors (in hex):
> end_request: I/O error, dev sda, sector 8388695
> Buffer I/O error on device sda1, logical block 1048579
> lost page write due to I/O error on sda1
> sd 0:0:0:0: [sda] Write Protect is off

With ata_piix Intel SATA I got these errors:

ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd ca/00:68:6f:3a:00/00:00:00:00:00/e0 tag 0 cdb 0x0 data 53248 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1: port is slow to respond, please be patient (Status 0xd0)
ata1: device not ready (errno=-16), forcing hardreset
ata1: soft resetting port
ata1.00: revalidation failed (errno=-2)
ata1: failed to recover some devices, retrying in 5 secs
ata1: soft resetting port
ata1.00: configured for UDMA/133
ata1: EH complete
sd 0:0:0:0: [sda] 488397168 512-byte hardware sectors (250059 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support 
DPO or FUA

> Here is how this machine looks like
> 
> 00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
> 00:01.0 ISA bridge: nVidia Corporation CK804 ISA Bridge (rev a3)
> 00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2)
> 00:02.0 USB Controller: nVidia Corporation CK804 USB Controller (rev a2)
> 00:02.1 USB Controller: nVidia Corporation CK804 USB Controller (rev a3)
> 00:04.0 Multimedia audio controller: nVidia Corporation CK804 AC'97 Audio 
> Controller (rev a2)
> 00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev f2)
> 00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
> 00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
> 00:09.0 PCI bridge: nVidia Corporation CK804 PCI Bridge (rev a2)
> 00:0a.0 Bridge: nVidia Corporation CK804 Ethernet Controller (rev a3)
> 00:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
> 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
> HyperTransport Technology Configuration
> 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
> Address Map
> 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM 
> Controller
> 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
> Miscellaneous Control
> 00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
> HyperTransport Technology Configuration
> 00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
> Address Map
> 00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM 
> Controller
> 00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
> Miscellaneous Control
> 05:04.0 VGA compatible controller: ATI Technologies Inc Radeon RV100 QY 
> [Radeon 7000/VE]
> 05:05.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 
> Controller (PHY/Link)
> 0a:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet 
> Controller (Copper) (rev 06)
> 40:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 
> 12)
> 40:01.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
> 40:02.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 
> 12)
> 40:02.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
> 61:04.0 PCI bridge: Intel Corporation Unknown device 537c (rev 07)
> 61:06.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X 
> Fusion-MPT Dual Ultra320 SCSI (rev 07)
> 61:06.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X 
> Fusion-MPT Dual Ultra320 SCSI (rev 07)
> 61:09.0 PCI bridge: Intel Corporation Unknown device 537c (rev 07)
> 62:09.0 Multimedia controller: BittWare, Inc. Unknown device 0035 (rev 01)
> 63:09.0 Multimedia controller: BittWare, Inc. Unknown device 0035 (rev 01)
> 80:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
> 80:01.0 Memory 

Re: Strange freezes (seems like SATA related)

2007-11-01 Thread Andrew Morton
On Mon, 29 Oct 2007 09:54:27 -0700
Max Krasnyansky <[EMAIL PROTECTED]> wrote:

> A couple of HP xw9300 machines (dual Opterons) started freezing up.
> We're running on 2.6.22.1 on them. Freezes a somewhere weird. VGA console is 
> alive
> (I can switch vts, etc) but everything else is dead (network, etc).
> Unfortunately SYSRQ was not enabled and I could not get backtraces and stuff.
> 
> Hooked up serial console and the only error that shows up is this.
> 
> ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 
> status 0x1540 next cpb count 0x0 next cpb idx 0x0
> ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1
> ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
> ata1.00: cmd ca/00:08:57:00:80/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 out
>  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> Descriptor sense data with sense descriptors (in hex):
> end_request: I/O error, dev sda, sector 8388695
> Buffer I/O error on device sda1, logical block 1048579
> lost page write due to I/O error on sda1
> sd 0:0:0:0: [sda] Write Protect is off
> 
> I see a bunch of those and then the box just sits there spewing this 
> periodically
> 
> ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 
> status 0x1540 next cpb count 0x0 next cpb idx 0x0
> ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1
> ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
> ata1.00: cmd ca/00:08:4f:00:f8/00:00:00:00:00/e1 tag 0 cdb 0x0 data 4096 out
>  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> 
> SMART selftest on the drive passed without errors.
> 
> Here is how this machine looks like
> 
> ...

So this happens on more than one machine?

The kernel shouldn't freeze, so even if both machines have magically
identical hardware faults, there's a kernel bug there somewhere.

I guess it would be useful to test a 2.6.23 kernel if poss.  We've seen a
very large number of reports like this one in recent months (many of which
have not been responded to, btw) and perhaps someone has done something
about them.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Strange freezes (seems like SATA related)

2007-11-01 Thread Heikki Orsila
On Mon, Oct 29, 2007 at 09:54:27AM -0700, Max Krasnyansky wrote:
 A couple of HP xw9300 machines (dual Opterons) started freezing up.
 We're running on 2.6.22.1 on them. Freezes a somewhere weird. 
 VGA console is alive
 (I can switch vts, etc) but everything else is dead (network, etc).

I'm thinking this is not a coincidence. I was running 2.6.22.5, and 
looking at your problems, I just had a similar experience on tuesday.. 
The network was still fine after kernel errors so that I was able to 
login with SSH. See:

http://lkml.org/lkml/2007/10/30/193

 ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 
 status 0x1540 next cpb count 0x0 next cpb idx 0x0
 ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1
 ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
 ata1.00: cmd ca/00:08:57:00:80/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 out
  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
 Descriptor sense data with sense descriptors (in hex):
 end_request: I/O error, dev sda, sector 8388695
 Buffer I/O error on device sda1, logical block 1048579
 lost page write due to I/O error on sda1
 sd 0:0:0:0: [sda] Write Protect is off

With ata_piix Intel SATA I got these errors:

ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd ca/00:68:6f:3a:00/00:00:00:00:00/e0 tag 0 cdb 0x0 data 53248 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1: port is slow to respond, please be patient (Status 0xd0)
ata1: device not ready (errno=-16), forcing hardreset
ata1: soft resetting port
ata1.00: revalidation failed (errno=-2)
ata1: failed to recover some devices, retrying in 5 secs
ata1: soft resetting port
ata1.00: configured for UDMA/133
ata1: EH complete
sd 0:0:0:0: [sda] 488397168 512-byte hardware sectors (250059 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support 
DPO or FUA

 Here is how this machine looks like
 
 00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
 00:01.0 ISA bridge: nVidia Corporation CK804 ISA Bridge (rev a3)
 00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2)
 00:02.0 USB Controller: nVidia Corporation CK804 USB Controller (rev a2)
 00:02.1 USB Controller: nVidia Corporation CK804 USB Controller (rev a3)
 00:04.0 Multimedia audio controller: nVidia Corporation CK804 AC'97 Audio 
 Controller (rev a2)
 00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev f2)
 00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
 00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
 00:09.0 PCI bridge: nVidia Corporation CK804 PCI Bridge (rev a2)
 00:0a.0 Bridge: nVidia Corporation CK804 Ethernet Controller (rev a3)
 00:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
 HyperTransport Technology Configuration
 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
 Address Map
 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM 
 Controller
 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
 Miscellaneous Control
 00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
 HyperTransport Technology Configuration
 00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
 Address Map
 00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM 
 Controller
 00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
 Miscellaneous Control
 05:04.0 VGA compatible controller: ATI Technologies Inc Radeon RV100 QY 
 [Radeon 7000/VE]
 05:05.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 
 Controller (PHY/Link)
 0a:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet 
 Controller (Copper) (rev 06)
 40:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 
 12)
 40:01.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
 40:02.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 
 12)
 40:02.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
 61:04.0 PCI bridge: Intel Corporation Unknown device 537c (rev 07)
 61:06.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X 
 Fusion-MPT Dual Ultra320 SCSI (rev 07)
 61:06.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X 
 Fusion-MPT Dual Ultra320 SCSI (rev 07)
 61:09.0 PCI bridge: Intel Corporation Unknown device 537c (rev 07)
 62:09.0 Multimedia controller: BittWare, Inc. Unknown device 0035 (rev 01)
 63:09.0 Multimedia controller: BittWare, Inc. Unknown device 0035 (rev 01)
 80:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
 80:01.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
 80:0e.0 

Re: Strange freezes (seems like SATA related)

2007-11-01 Thread Andrew Morton
On Mon, 29 Oct 2007 09:54:27 -0700
Max Krasnyansky [EMAIL PROTECTED] wrote:

 A couple of HP xw9300 machines (dual Opterons) started freezing up.
 We're running on 2.6.22.1 on them. Freezes a somewhere weird. VGA console is 
 alive
 (I can switch vts, etc) but everything else is dead (network, etc).
 Unfortunately SYSRQ was not enabled and I could not get backtraces and stuff.
 
 Hooked up serial console and the only error that shows up is this.
 
 ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 
 status 0x1540 next cpb count 0x0 next cpb idx 0x0
 ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1
 ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
 ata1.00: cmd ca/00:08:57:00:80/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 out
  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
 Descriptor sense data with sense descriptors (in hex):
 end_request: I/O error, dev sda, sector 8388695
 Buffer I/O error on device sda1, logical block 1048579
 lost page write due to I/O error on sda1
 sd 0:0:0:0: [sda] Write Protect is off
 
 I see a bunch of those and then the box just sits there spewing this 
 periodically
 
 ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 
 status 0x1540 next cpb count 0x0 next cpb idx 0x0
 ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1
 ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
 ata1.00: cmd ca/00:08:4f:00:f8/00:00:00:00:00/e1 tag 0 cdb 0x0 data 4096 out
  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
 
 SMART selftest on the drive passed without errors.
 
 Here is how this machine looks like
 
 ...

So this happens on more than one machine?

The kernel shouldn't freeze, so even if both machines have magically
identical hardware faults, there's a kernel bug there somewhere.

I guess it would be useful to test a 2.6.23 kernel if poss.  We've seen a
very large number of reports like this one in recent months (many of which
have not been responded to, btw) and perhaps someone has done something
about them.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Strange freezes (seems like SATA related)

2007-11-01 Thread Jeff Garzik

Heikki Orsila wrote:

On Mon, Oct 29, 2007 at 09:54:27AM -0700, Max Krasnyansky wrote:

A couple of HP xw9300 machines (dual Opterons) started freezing up.
We're running on 2.6.22.1 on them. Freezes a somewhere weird. 
VGA console is alive

(I can switch vts, etc) but everything else is dead (network, etc).


I'm thinking this is not a coincidence. I was running 2.6.22.5, and 
looking at your problems, I just had a similar experience on tuesday.. 
The network was still fine after kernel errors so that I was able to 
login with SSH. See:


http://lkml.org/lkml/2007/10/30/193


ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 status 
0x1540 next cpb count 0x0 next cpb idx 0x0
ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd ca/00:08:57:00:80/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Descriptor sense data with sense descriptors (in hex):
end_request: I/O error, dev sda, sector 8388695
Buffer I/O error on device sda1, logical block 1048579
lost page write due to I/O error on sda1
sd 0:0:0:0: [sda] Write Protect is off


With ata_piix Intel SATA I got these errors:

ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd ca/00:68:6f:3a:00/00:00:00:00:00/e0 tag 0 cdb 0x0 data 53248 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1: port is slow to respond, please be patient (Status 0xd0)
ata1: device not ready (errno=-16), forcing hardreset
ata1: soft resetting port
ata1.00: revalidation failed (errno=-2)
ata1: failed to recover some devices, retrying in 5 secs
ata1: soft resetting port
ata1.00: configured for UDMA/133
ata1: EH complete
sd 0:0:0:0: [sda] 488397168 512-byte hardware sectors (250059 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support 
DPO or FUA


These are two 100% different issues  The only thing they have in 
common is that they spit out an error.


Jeff



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Strange freezes (seems like SATA related)

2007-10-29 Thread Robert Hancock

Max Krasnyansky wrote:

A couple of HP xw9300 machines (dual Opterons) started freezing up.
We're running on 2.6.22.1 on them. Freezes a somewhere weird. VGA console is 
alive
(I can switch vts, etc) but everything else is dead (network, etc).
Unfortunately SYSRQ was not enabled and I could not get backtraces and stuff.

Hooked up serial console and the only error that shows up is this.

ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 status 
0x1540 next cpb count 0x0 next cpb idx 0x0
ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd ca/00:08:57:00:80/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Descriptor sense data with sense descriptors (in hex):
end_request: I/O error, dev sda, sector 8388695
Buffer I/O error on device sda1, logical block 1048579
lost page write due to I/O error on sda1
sd 0:0:0:0: [sda] Write Protect is off

I see a bunch of those and then the box just sits there spewing this 
periodically

ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 status 
0x1540 next cpb count 0x0 next cpb idx 0x0
ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd ca/00:08:4f:00:f8/00:00:00:00:00/e1 tag 0 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

SMART selftest on the drive passed without errors.

Here is how this machine looks like

00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
00:01.0 ISA bridge: nVidia Corporation CK804 ISA Bridge (rev a3)
00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2)
00:02.0 USB Controller: nVidia Corporation CK804 USB Controller (rev a2)
00:02.1 USB Controller: nVidia Corporation CK804 USB Controller (rev a3)
00:04.0 Multimedia audio controller: nVidia Corporation CK804 AC'97 Audio 
Controller (rev a2)
00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev f2)
00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
00:09.0 PCI bridge: nVidia Corporation CK804 PCI Bridge (rev a2)
00:0a.0 Bridge: nVidia Corporation CK804 Ethernet Controller (rev a3)
00:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address 
Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM 
Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
Miscellaneous Control
00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
HyperTransport Technology Configuration
00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address 
Map
00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM 
Controller
00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
Miscellaneous Control
05:04.0 VGA compatible controller: ATI Technologies Inc Radeon RV100 QY [Radeon 
7000/VE]
05:05.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 
Controller (PHY/Link)
0a:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet 
Controller (Copper) (rev 06)
40:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
40:01.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
40:02.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
40:02.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
61:04.0 PCI bridge: Intel Corporation Unknown device 537c (rev 07)
61:06.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X 
Fusion-MPT Dual Ultra320 SCSI (rev 07)
61:06.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X 
Fusion-MPT Dual Ultra320 SCSI (rev 07)
61:09.0 PCI bridge: Intel Corporation Unknown device 537c (rev 07)
62:09.0 Multimedia controller: BittWare, Inc. Unknown device 0035 (rev 01)
63:09.0 Multimedia controller: BittWare, Inc. Unknown device 0035 (rev 01)
80:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
80:01.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
80:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
81:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet 
Controller (Copper) (rev 06)

As I mentioned dual Opteron, NUMA. Nothing fancy in the kernel config. 


Any ideas what might the problem be ?


Can you post the full dmesg output? What kind of drive is this?

--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe 

Strange freezes (seems like SATA related)

2007-10-29 Thread Max Krasnyansky
A couple of HP xw9300 machines (dual Opterons) started freezing up.
We're running on 2.6.22.1 on them. Freezes a somewhere weird. VGA console is 
alive
(I can switch vts, etc) but everything else is dead (network, etc).
Unfortunately SYSRQ was not enabled and I could not get backtraces and stuff.

Hooked up serial console and the only error that shows up is this.

ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 status 
0x1540 next cpb count 0x0 next cpb idx 0x0
ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd ca/00:08:57:00:80/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Descriptor sense data with sense descriptors (in hex):
end_request: I/O error, dev sda, sector 8388695
Buffer I/O error on device sda1, logical block 1048579
lost page write due to I/O error on sda1
sd 0:0:0:0: [sda] Write Protect is off

I see a bunch of those and then the box just sits there spewing this 
periodically

ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 status 
0x1540 next cpb count 0x0 next cpb idx 0x0
ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd ca/00:08:4f:00:f8/00:00:00:00:00/e1 tag 0 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

SMART selftest on the drive passed without errors.

Here is how this machine looks like

00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
00:01.0 ISA bridge: nVidia Corporation CK804 ISA Bridge (rev a3)
00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2)
00:02.0 USB Controller: nVidia Corporation CK804 USB Controller (rev a2)
00:02.1 USB Controller: nVidia Corporation CK804 USB Controller (rev a3)
00:04.0 Multimedia audio controller: nVidia Corporation CK804 AC'97 Audio 
Controller (rev a2)
00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev f2)
00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
00:09.0 PCI bridge: nVidia Corporation CK804 PCI Bridge (rev a2)
00:0a.0 Bridge: nVidia Corporation CK804 Ethernet Controller (rev a3)
00:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address 
Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM 
Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
Miscellaneous Control
00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
HyperTransport Technology Configuration
00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address 
Map
00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM 
Controller
00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
Miscellaneous Control
05:04.0 VGA compatible controller: ATI Technologies Inc Radeon RV100 QY [Radeon 
7000/VE]
05:05.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 
Controller (PHY/Link)
0a:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet 
Controller (Copper) (rev 06)
40:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
40:01.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
40:02.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
40:02.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
61:04.0 PCI bridge: Intel Corporation Unknown device 537c (rev 07)
61:06.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X 
Fusion-MPT Dual Ultra320 SCSI (rev 07)
61:06.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X 
Fusion-MPT Dual Ultra320 SCSI (rev 07)
61:09.0 PCI bridge: Intel Corporation Unknown device 537c (rev 07)
62:09.0 Multimedia controller: BittWare, Inc. Unknown device 0035 (rev 01)
63:09.0 Multimedia controller: BittWare, Inc. Unknown device 0035 (rev 01)
80:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
80:01.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
80:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
81:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet 
Controller (Copper) (rev 06)

As I mentioned dual Opteron, NUMA. Nothing fancy in the kernel config. 

Any ideas what might the problem be ?

Max

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Strange freezes (seems like SATA related)

2007-10-29 Thread Max Krasnyansky
A couple of HP xw9300 machines (dual Opterons) started freezing up.
We're running on 2.6.22.1 on them. Freezes a somewhere weird. VGA console is 
alive
(I can switch vts, etc) but everything else is dead (network, etc).
Unfortunately SYSRQ was not enabled and I could not get backtraces and stuff.

Hooked up serial console and the only error that shows up is this.

ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 status 
0x1540 next cpb count 0x0 next cpb idx 0x0
ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd ca/00:08:57:00:80/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Descriptor sense data with sense descriptors (in hex):
end_request: I/O error, dev sda, sector 8388695
Buffer I/O error on device sda1, logical block 1048579
lost page write due to I/O error on sda1
sd 0:0:0:0: [sda] Write Protect is off

I see a bunch of those and then the box just sits there spewing this 
periodically

ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 status 
0x1540 next cpb count 0x0 next cpb idx 0x0
ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd ca/00:08:4f:00:f8/00:00:00:00:00/e1 tag 0 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

SMART selftest on the drive passed without errors.

Here is how this machine looks like

00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
00:01.0 ISA bridge: nVidia Corporation CK804 ISA Bridge (rev a3)
00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2)
00:02.0 USB Controller: nVidia Corporation CK804 USB Controller (rev a2)
00:02.1 USB Controller: nVidia Corporation CK804 USB Controller (rev a3)
00:04.0 Multimedia audio controller: nVidia Corporation CK804 AC'97 Audio 
Controller (rev a2)
00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev f2)
00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
00:09.0 PCI bridge: nVidia Corporation CK804 PCI Bridge (rev a2)
00:0a.0 Bridge: nVidia Corporation CK804 Ethernet Controller (rev a3)
00:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address 
Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM 
Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
Miscellaneous Control
00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
HyperTransport Technology Configuration
00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address 
Map
00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM 
Controller
00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
Miscellaneous Control
05:04.0 VGA compatible controller: ATI Technologies Inc Radeon RV100 QY [Radeon 
7000/VE]
05:05.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 
Controller (PHY/Link)
0a:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet 
Controller (Copper) (rev 06)
40:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
40:01.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
40:02.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
40:02.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
61:04.0 PCI bridge: Intel Corporation Unknown device 537c (rev 07)
61:06.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X 
Fusion-MPT Dual Ultra320 SCSI (rev 07)
61:06.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X 
Fusion-MPT Dual Ultra320 SCSI (rev 07)
61:09.0 PCI bridge: Intel Corporation Unknown device 537c (rev 07)
62:09.0 Multimedia controller: BittWare, Inc. Unknown device 0035 (rev 01)
63:09.0 Multimedia controller: BittWare, Inc. Unknown device 0035 (rev 01)
80:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
80:01.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
80:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
81:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet 
Controller (Copper) (rev 06)

As I mentioned dual Opteron, NUMA. Nothing fancy in the kernel config. 

Any ideas what might the problem be ?

Max

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Strange freezes (seems like SATA related)

2007-10-29 Thread Robert Hancock

Max Krasnyansky wrote:

A couple of HP xw9300 machines (dual Opterons) started freezing up.
We're running on 2.6.22.1 on them. Freezes a somewhere weird. VGA console is 
alive
(I can switch vts, etc) but everything else is dead (network, etc).
Unfortunately SYSRQ was not enabled and I could not get backtraces and stuff.

Hooked up serial console and the only error that shows up is this.

ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 status 
0x1540 next cpb count 0x0 next cpb idx 0x0
ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd ca/00:08:57:00:80/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Descriptor sense data with sense descriptors (in hex):
end_request: I/O error, dev sda, sector 8388695
Buffer I/O error on device sda1, logical block 1048579
lost page write due to I/O error on sda1
sd 0:0:0:0: [sda] Write Protect is off

I see a bunch of those and then the box just sits there spewing this 
periodically

ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 status 
0x1540 next cpb count 0x0 next cpb idx 0x0
ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd ca/00:08:4f:00:f8/00:00:00:00:00/e1 tag 0 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

SMART selftest on the drive passed without errors.

Here is how this machine looks like

00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
00:01.0 ISA bridge: nVidia Corporation CK804 ISA Bridge (rev a3)
00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2)
00:02.0 USB Controller: nVidia Corporation CK804 USB Controller (rev a2)
00:02.1 USB Controller: nVidia Corporation CK804 USB Controller (rev a3)
00:04.0 Multimedia audio controller: nVidia Corporation CK804 AC'97 Audio 
Controller (rev a2)
00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev f2)
00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
00:09.0 PCI bridge: nVidia Corporation CK804 PCI Bridge (rev a2)
00:0a.0 Bridge: nVidia Corporation CK804 Ethernet Controller (rev a3)
00:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address 
Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM 
Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
Miscellaneous Control
00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
HyperTransport Technology Configuration
00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address 
Map
00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM 
Controller
00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
Miscellaneous Control
05:04.0 VGA compatible controller: ATI Technologies Inc Radeon RV100 QY [Radeon 
7000/VE]
05:05.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 
Controller (PHY/Link)
0a:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet 
Controller (Copper) (rev 06)
40:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
40:01.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
40:02.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
40:02.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
61:04.0 PCI bridge: Intel Corporation Unknown device 537c (rev 07)
61:06.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X 
Fusion-MPT Dual Ultra320 SCSI (rev 07)
61:06.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X 
Fusion-MPT Dual Ultra320 SCSI (rev 07)
61:09.0 PCI bridge: Intel Corporation Unknown device 537c (rev 07)
62:09.0 Multimedia controller: BittWare, Inc. Unknown device 0035 (rev 01)
63:09.0 Multimedia controller: BittWare, Inc. Unknown device 0035 (rev 01)
80:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
80:01.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
80:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
81:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet 
Controller (Copper) (rev 06)

As I mentioned dual Opteron, NUMA. Nothing fancy in the kernel config. 


Any ideas what might the problem be ?


Can you post the full dmesg output? What kind of drive is this?

--
Robert Hancock  Saskatoon, SK, Canada
To email, remove nospam from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line unsubscribe