Re: PATA Sil680 Command Timeout on ARM XScale

2007-03-14 Thread Fajun Chen

Since primary channel and secondary channel share the same IRQ,  the
ISR could be called to service one or both channels. So I would think
it's normal to see irq trap traces when both channels are in IO
operation, correct?

I have another question in regard to ata_host_intr() function in
libata-core.c. For PIO read/write, the status of interrupt pin was not
checked before moving the host state machine.  Sil680 spec. recommend
checking IDE channel interupt (bit 11 in the IDEx Task File Timing and
Config + Status register) though.  Could someone explain why interrupt
status does not need to be checked for PIO?

I'm still troubleshooting the command timeout issue on my test
hardware, I will repeat the same test on i386 hardware. In the mean
time, I would appreciate any suggestions or known information to
isolate the issue.

Thanks,
Fajun

On 3/13/07, Fajun Chen [EMAIL PROTECTED] wrote:

On 3/13/07, Alan Cox [EMAIL PROTECTED] wrote:
  above 1.5.   Two timers are used to track command timeout in our test
  software.  The one in the user space is set to 6 seconds using alarm()
  call while the one in the kernel (scsi timer) is set to 5 seconds.
  These timeout values are probably too low to be realistic,  but the
  issue here is not about the timeout itself but to understand why it is

 A lot of drive commands seem to be set up on a seven second worst case

  always user space timer expired before kernel timer.   Since kernel
  timer uses jiffies to track time, does this imply a kernel bug where
  the time interrupts were lost or delay somehow?  Do you know any know
  problems related to command timeout in PATA Sil680?

 Alarm() is also handled by the same jiffies logic, so I suspect a bug in
 your test environment ?


I enabled ata_irq_trap and did the same test again. The kernel timer
caught the timeout (10 seconds) this time along with the irq trap
traces below.  What's the cause of these idle irqs?

[42949560.15] SCSI device sdb: drive cache: write back
[   85.57] ata1: irq trap
[   85.82] ata2: irq trap
[   92.12] abnormal status 0xD0
[   92.12] ata1: irq trap
[   92.92] ata2: irq trap
[   98.75] ata1: irq trap
[  100.26] abnormal status 0xD0
[  100.26] ata2: irq trap
[  105.54] ata1: irq trap
[  108.05] ata1: irq trap
[  110.62] ata1: irq trap
[  113.13] ata1: irq trap
[  115.53] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
[  115.53] ata1.00: (BMDMA stat 0x0)
[  115.53] ata1.00: tag 0 cmd 0xc8 Emask 0x4 stat 0x40 err 0x0 (timeout)
[  115.53] ata1: soft resetting port
[  115.57] ata1.00: configured for UDMA/100
[  115.57] sg_cmd_done: sg0, pack_id=2706, res=0x802, dur=10040 ms
[  115.57] ata1: EH complete
[  115.58] SCSI device sda: 156301488 512-byte hdwr sectors (80026 MB)
[  115.58] sda: Write Protect is off
[  115.58] sda: Mode Sense: 00 3a 00 00
[  115.58] SCSI device sda: drive cache: write back
...

Thanks,
Fajun


-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATA Sil680 Command Timeout on ARM XScale

2007-03-14 Thread Jeff Garzik

Fajun Chen wrote:

Since primary channel and secondary channel share the same IRQ,  the
ISR could be called to service one or both channels. So I would think
it's normal to see irq trap traces when both channels are in IO
operation, correct?


The irq trap code only occurs after a certain number of unhandled 
interrupts.




I have another question in regard to ata_host_intr() function in
libata-core.c. For PIO read/write, the status of interrupt pin was not
checked before moving the host state machine.  Sil680 spec. recommend
checking IDE channel interupt (bit 11 in the IDEx Task File Timing and
Config + Status register) though.  Could someone explain why interrupt
status does not need to be checked for PIO?


Reading the Status register (as opposed to AltStatus) should clear the 
interrupt condition, on standard hardware.


Jeff


-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


PATA Sil680 Command Timeout on ARM XScale

2007-03-13 Thread Fajun Chen

Hi Folks,

We have a command timeout with Sil680 controller on ARM XScale. The
kernel is 2.6.18-rc2 and libata 2.00 with preemptive enabled. Similar
problem observed as well with kernel preemptive disabled.  ATA pass
through and sg are used.  Heavy IO test was ran on both channels of
Sil680 and the system was pretty loaded where the load average was
above 1.5.   Two timers are used to track command timeout in our test
software.  The one in the user space is set to 6 seconds using alarm()
call while the one in the kernel (scsi timer) is set to 5 seconds.
These timeout values are probably too low to be realistic,  but the
issue here is not about the timeout itself but to understand why it is
always user space timer expired before kernel timer.   Since kernel
timer uses jiffies to track time, does this imply a kernel bug where
the time interrupts were lost or delay somehow?  Do you know any know
problems related to command timeout in PATA Sil680?

Thanks,
Fajun

User space trace:
Cmd 4276 timed out after 7.260137 secs: start time 1173775439.409099
secs, timed out at 1173775446.669236 secs
[Tue Mar 13 08:44:06 2007]:
Test:   Random Write Sectors Extended
LBA Low:0
LBA High:   10
...
Num Cmds:   4277
Num_Failed_Cmds:1
...
Status: Fail [Error 401: Command timeout]

Dmesg log

~ $ dmesg
.77] Calling initcall 0xc001ebb4: inet_diag_init+0x0/0x80()
[42949375.77] Calling initcall 0xc001ec34: tcp_diag_init+0x0/0x1c()
[42949375.77] Calling initcall 0xc001ec50: bictcp_register+0x0/0x1c()
[42949375.77] TCP bic registered
[42949375.77] Calling initcall 0xc001ee2c: af_unix_init+0x0/0x80()
[42949375.77] NET: Registered protocol family 1
[42949375.77] Calling initcall 0xc001eeac: packet_init+0x0/0x70()
[42949375.77] NET: Registered protocol family 17
[42949375.77] Calling initcall 0xc0012a88:
clocksource_done_booting+0x0/0x24()
[42949375.77] Calling initcall 0xc0019ed4: seqgen_init+0x0/0x1c()
[42949375.77] Calling initcall 0xc001ba44:
early_uart_console_switch+0x0/0x90()
[42949375.77] Calling initcall 0xc013a150: net_random_reseed+0x0/0x38()
[42949375.77] RAMDISK: Compressed image found at block 0
[42949378.95] VFS: Mounted root (ext2 filesystem).
[42949378.96] Freeing init memory: 104K
[42949549.17] ata1: soft resetting port
[42949549.25] ata1.00: ATA-6, max UDMA/100, 78140160 sectors: LBA48
[42949549.25] ata1.00: configured for UDMA/100
[42949549.25] ata1: EH complete
[42949549.25]   Vendor: ATA   Model: ST940813AMRev: 3.02
[42949549.25]   Type:   Direct-Access  ANSI
SCSI revision: 05
[42949549.26] SCSI device sda: 78140160 512-byte hdwr sectors (40008 MB)
[42949549.26] sda: Write Protect is off
[42949549.26] sda: Mode Sense: 00 3a 00 00
[42949549.26] SCSI device sda: drive cache: write back
[42949549.27] SCSI device sda: 78140160 512-byte hdwr sectors (40008 MB)
[42949549.27] sda: Write Protect is off
[42949549.27] sda: Mode Sense: 00 3a 00 00
[42949549.27] SCSI device sda: drive cache: write back
[42949549.27]  sda: unknown partition table
[42949549.29] sd 0:0:0:0: Attached scsi disk sda
[42949549.29] sd 0:0:0:0: Attached scsi generic sg0 type 0
[42949549.32] ata1: soft resetting port
[42949549.38] ata1.00: configured for UDMA/100
[42949549.38] ata1: EH complete
[42949549.38] SCSI device sda: 78140160 512-byte hdwr sectors (40008 MB)
[42949549.38] sda: Write Protect is off
[42949549.38] sda: Mode Sense: 00 3a 00 00
[42949549.39] SCSI device sda: drive cache: write back
[42949559.28] ata2: soft resetting port
[42949559.42] ata2.00: ATA-6, max UDMA/100, 78140160 sectors: LBA48
[42949559.42] ata2.00: configured for UDMA/100
[42949559.42] ata2: EH complete
[42949559.42]   Vendor: ATA   Model: ST94811A  Rev: 3.07
[42949559.42]   Type:   Direct-Access  ANSI
SCSI revision: 05
[42949559.43] SCSI device sdb: 78140160 512-byte hdwr sectors (40008 MB)
[42949559.43] sdb: Write Protect is off
[42949559.43] sdb: Mode Sense: 00 3a 00 00
[42949559.43] SCSI device sdb: drive cache: write back
[42949559.43] SCSI device sdb: 78140160 512-byte hdwr sectors (40008 MB)
[42949559.44] sdb: Write Protect is off
[42949559.44] sdb: Mode Sense: 00 3a 00 00
[42949559.44] SCSI device sdb: drive cache: write back
[42949559.44]  sdb: unknown partition table
[42949559.46] sd 1:0:0:0: Attached scsi disk sdb
[42949559.46] sd 1:0:0:0: Attached scsi generic sg1 type 0
[  643.23] NWFPE: ntpd[38] takes exception 0001 at c002d514
from 0001d308
[  711.23] NWFPE: ntpd[38] takes exception 0001 at c002d514
from 0001d308
[  777.22] NWFPE: ntpd[38] takes exception 0001 at c002d514
from 0001d308
[  841.30] NWFPE: ntpd[38] takes exception 0001 at c002d514
from 

Re: PATA Sil680 Command Timeout on ARM XScale

2007-03-13 Thread Alan Cox
 above 1.5.   Two timers are used to track command timeout in our test
 software.  The one in the user space is set to 6 seconds using alarm()
 call while the one in the kernel (scsi timer) is set to 5 seconds.
 These timeout values are probably too low to be realistic,  but the
 issue here is not about the timeout itself but to understand why it is

A lot of drive commands seem to be set up on a seven second worst case

 always user space timer expired before kernel timer.   Since kernel
 timer uses jiffies to track time, does this imply a kernel bug where
 the time interrupts were lost or delay somehow?  Do you know any know
 problems related to command timeout in PATA Sil680?

Alarm() is also handled by the same jiffies logic, so I suspect a bug in
your test environment ?

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATA Sil680 Command Timeout on ARM XScale

2007-03-13 Thread Fajun Chen

On 3/13/07, Alan Cox [EMAIL PROTECTED] wrote:

 above 1.5.   Two timers are used to track command timeout in our test
 software.  The one in the user space is set to 6 seconds using alarm()
 call while the one in the kernel (scsi timer) is set to 5 seconds.
 These timeout values are probably too low to be realistic,  but the
 issue here is not about the timeout itself but to understand why it is

A lot of drive commands seem to be set up on a seven second worst case

 always user space timer expired before kernel timer.   Since kernel
 timer uses jiffies to track time, does this imply a kernel bug where
 the time interrupts were lost or delay somehow?  Do you know any know
 problems related to command timeout in PATA Sil680?

Alarm() is also handled by the same jiffies logic, so I suspect a bug in
your test environment ?



I enabled ata_irq_trap and did the same test again. The kernel timer
caught the timeout (10 seconds) this time along with the irq trap
traces below.  What's the cause of these idle irqs?

[42949560.15] SCSI device sdb: drive cache: write back
[   85.57] ata1: irq trap
[   85.82] ata2: irq trap
[   92.12] abnormal status 0xD0
[   92.12] ata1: irq trap
[   92.92] ata2: irq trap
[   98.75] ata1: irq trap
[  100.26] abnormal status 0xD0
[  100.26] ata2: irq trap
[  105.54] ata1: irq trap
[  108.05] ata1: irq trap
[  110.62] ata1: irq trap
[  113.13] ata1: irq trap
[  115.53] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
[  115.53] ata1.00: (BMDMA stat 0x0)
[  115.53] ata1.00: tag 0 cmd 0xc8 Emask 0x4 stat 0x40 err 0x0 (timeout)
[  115.53] ata1: soft resetting port
[  115.57] ata1.00: configured for UDMA/100
[  115.57] sg_cmd_done: sg0, pack_id=2706, res=0x802, dur=10040 ms
[  115.57] ata1: EH complete
[  115.58] SCSI device sda: 156301488 512-byte hdwr sectors (80026 MB)
[  115.58] sda: Write Protect is off
[  115.58] sda: Mode Sense: 00 3a 00 00
[  115.58] SCSI device sda: drive cache: write back
...

Thanks,
Fajun
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html