Re: ata command timeout

2007-02-20 Thread Tejun Heo
Jeff Garzik wrote:
> Mark Lord wrote:
>> I don't believe that.  Command timeouts never happen on healthy systems,
>> unless we have a driver bug.  Okay, so I can imagine a pathological case
>> of a full queue (NCQ) with all 32 commands taking longer than usual due
>> to ECC retries in the firmware..
> 
> It's not quite so black and white.  There have definitely been interrupt
> delivery problems that cause command timeouts.  Also, Intel PIIX BMDMA
> (all standard PCI IDE, I think?) is defined to /not/ send an interrupt,
> when a DMA error occurs.  The driver is instructed to time out the
> transaction, and start recovery by deducing the state of things from the
> DMA status bits.
> 
> Nonetheless, I mostly agree with your statement.  The two most common
> causes of timeouts that I see are interrupt delivery problems, and
> driver bugs.

Oh.. well.  My experience is that it's much more common on SATA compared
to PATA.  SATA link seems to be one of the most vulnerable parts to
interference.  When PSU has the slightest of problem, SATA drives
timeout or give transmission problems.  System often survives brief
fluctuation in power input (e.g. when the compressor starts up) but SATA
link sometimes reports error after such event.

Or just buy a static generator and apply it to your computer case.
Generally system is perfectly okay with that but the SATA devices tend
to complain or timeout.

Those condition might not be considered too healthy in any server
environment but they do occur on cheap desktop environment.  I mean, a
lot of people are putting 10USD PSU into their desktop machines.

So, yeah, it might be a driver or other problem but if problem is very
intermittent, I tend to lean toward transient hardware problem and
that's primarily why I wanna make EH kick in and recover faster.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ata command timeout

2007-02-20 Thread Jeff Garzik

Mark Lord wrote:

I don't believe that.  Command timeouts never happen on healthy systems,
unless we have a driver bug.  Okay, so I can imagine a pathological case
of a full queue (NCQ) with all 32 commands taking longer than usual due
to ECC retries in the firmware..


It's not quite so black and white.  There have definitely been interrupt 
delivery problems that cause command timeouts.  Also, Intel PIIX BMDMA 
(all standard PCI IDE, I think?) is defined to /not/ send an interrupt, 
when a DMA error occurs.  The driver is instructed to time out the 
transaction, and start recovery by deducing the state of things from the 
DMA status bits.


Nonetheless, I mostly agree with your statement.  The two most common 
causes of timeouts that I see are interrupt delivery problems, and 
driver bugs.


Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ata command timeout

2007-02-20 Thread Mark Lord

Tejun Heo wrote:

[EMAIL PROTECTED] wrote:

..

   ata1: command timeout
Feb 19 20:39:31 linux kernel: ata1: no sense translation for status: 0x40
Feb 19 20:39:31 linux kernel: ata1: translated ATA stat/err 0x40/00 to SCSI 
SK/ASC/ASCQ 0xb/00/00

Feb 19 20:39:31 linux kernel: ata1: status=0x40 { DriveReady }
Feb 19 20:39:31 linux kernel: sd 0:0:0:0: SCSI error: return code = 0x0802
Feb 19 20:39:31 linux kernel: sda: Current: sense key: Aborted Command
Feb 19 20:39:31 linux kernel: Additional sense: No additional sense 
information

Feb 19 20:39:31 linux kernel: end_request: I/O error, dev sda, sector 89553479

without any other ill-effects that I know of(I did smart tests on the drive; 
all passed successfully).
I have read that hddtemp may be the cause of this (I am running version 0.3) 
so is there any reason

to worry and prepare for a HDD replacement?


Not really.  If the problem occurs very infrequently, you don't need to
worry about it too much.  Command timeouts do occur on otherwise healthy
systems from time to time.


I don't believe that.  Command timeouts never happen on healthy systems,
unless we have a driver bug.  Okay, so I can imagine a pathological case
of a full queue (NCQ) with all 32 commands taking longer than usual due
to ECC retries in the firmware..

But in real life, on a desktop, timeouts never happen as a normal event.

I wonder what's *really* wrong here?

Cheers

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ata command timeout

2007-02-20 Thread Mark Lord

Tejun Heo wrote:

[EMAIL PROTECTED] wrote:

..

   ata1: command timeout
Feb 19 20:39:31 linux kernel: ata1: no sense translation for status: 0x40
Feb 19 20:39:31 linux kernel: ata1: translated ATA stat/err 0x40/00 to SCSI 
SK/ASC/ASCQ 0xb/00/00

Feb 19 20:39:31 linux kernel: ata1: status=0x40 { DriveReady }
Feb 19 20:39:31 linux kernel: sd 0:0:0:0: SCSI error: return code = 0x0802
Feb 19 20:39:31 linux kernel: sda: Current: sense key: Aborted Command
Feb 19 20:39:31 linux kernel: Additional sense: No additional sense 
information

Feb 19 20:39:31 linux kernel: end_request: I/O error, dev sda, sector 89553479

without any other ill-effects that I know of(I did smart tests on the drive; 
all passed successfully).
I have read that hddtemp may be the cause of this (I am running version 0.3) 
so is there any reason

to worry and prepare for a HDD replacement?


Not really.  If the problem occurs very infrequently, you don't need to
worry about it too much.  Command timeouts do occur on otherwise healthy
systems from time to time.


I don't believe that.  Command timeouts never happen on healthy systems,
unless we have a driver bug.  Okay, so I can imagine a pathological case
of a full queue (NCQ) with all 32 commands taking longer than usual due
to ECC retries in the firmware..

But in real life, on a desktop, timeouts never happen as a normal event.

I wonder what's *really* wrong here?

Cheers

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ata command timeout

2007-02-20 Thread Jeff Garzik

Mark Lord wrote:

I don't believe that.  Command timeouts never happen on healthy systems,
unless we have a driver bug.  Okay, so I can imagine a pathological case
of a full queue (NCQ) with all 32 commands taking longer than usual due
to ECC retries in the firmware..


It's not quite so black and white.  There have definitely been interrupt 
delivery problems that cause command timeouts.  Also, Intel PIIX BMDMA 
(all standard PCI IDE, I think?) is defined to /not/ send an interrupt, 
when a DMA error occurs.  The driver is instructed to time out the 
transaction, and start recovery by deducing the state of things from the 
DMA status bits.


Nonetheless, I mostly agree with your statement.  The two most common 
causes of timeouts that I see are interrupt delivery problems, and 
driver bugs.


Jeff


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ata command timeout

2007-02-20 Thread Tejun Heo
Jeff Garzik wrote:
 Mark Lord wrote:
 I don't believe that.  Command timeouts never happen on healthy systems,
 unless we have a driver bug.  Okay, so I can imagine a pathological case
 of a full queue (NCQ) with all 32 commands taking longer than usual due
 to ECC retries in the firmware..
 
 It's not quite so black and white.  There have definitely been interrupt
 delivery problems that cause command timeouts.  Also, Intel PIIX BMDMA
 (all standard PCI IDE, I think?) is defined to /not/ send an interrupt,
 when a DMA error occurs.  The driver is instructed to time out the
 transaction, and start recovery by deducing the state of things from the
 DMA status bits.
 
 Nonetheless, I mostly agree with your statement.  The two most common
 causes of timeouts that I see are interrupt delivery problems, and
 driver bugs.

Oh.. well.  My experience is that it's much more common on SATA compared
to PATA.  SATA link seems to be one of the most vulnerable parts to
interference.  When PSU has the slightest of problem, SATA drives
timeout or give transmission problems.  System often survives brief
fluctuation in power input (e.g. when the compressor starts up) but SATA
link sometimes reports error after such event.

Or just buy a static generator and apply it to your computer case.
Generally system is perfectly okay with that but the SATA devices tend
to complain or timeout.

Those condition might not be considered too healthy in any server
environment but they do occur on cheap desktop environment.  I mean, a
lot of people are putting 10USD PSU into their desktop machines.

So, yeah, it might be a driver or other problem but if problem is very
intermittent, I tend to lean toward transient hardware problem and
that's primarily why I wanna make EH kick in and recover faster.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ata command timeout

2007-02-19 Thread Marc Marais
On Tue, 20 Feb 2007 13:07:50 +0900, Tejun Heo wrote
> [EMAIL PROTECTED] wrote:
> > Hello,
> > 
> > I have been running 2.6.18 for two months and the last couple of days these 
> > error messages have appeared in my logs 
> > (sata_promise kernel module, sda:SATA sdb:PATA disks):
> > 
> >ata1: command timeout
> > Feb 17 22:23:14 linux kernel: ata1: no sense translation for status: 0x40
> > Feb 17 22:23:14 linux kernel: ata1: translated ATA stat/err 0x40/00 to SCSI 
> > SK/ASC/ASCQ 0xb/00/00
> > Feb 17 22:23:14 linux kernel: ata1: status=0x40 { DriveReady }
> > Feb 17 22:23:14 linux kernel: sd 0:0:0:0: SCSI error: return code = 
> > 0x0802
> > Feb 17 22:23:14 linux kernel: sda: Current: sense key: Aborted Command
> > Feb 17 22:23:14 linux kernel: Additional sense: No additional sense 
> > information
> > Feb 17 22:23:14 linux kernel: end_request: I/O error, dev sda, sector 
> > 145179585
> > Feb 17 22:23:14 linux kernel: Buffer I/O error on device sda2, logical 
> > block 
> > 2787300
> > Feb 17 22:23:14 linux kernel: lost page write due to I/O error on sda2
> > 
> > and 
> > 
> >ata1: command timeout
> > Feb 19 20:39:31 linux kernel: ata1: no sense translation for status: 0x40
> > Feb 19 20:39:31 linux kernel: ata1: translated ATA stat/err 0x40/00 to SCSI 
> > SK/ASC/ASCQ 0xb/00/00
> > Feb 19 20:39:31 linux kernel: ata1: status=0x40 { DriveReady }
> > Feb 19 20:39:31 linux kernel: sd 0:0:0:0: SCSI error: return code = 
> > 0x0802
> > Feb 19 20:39:31 linux kernel: sda: Current: sense key: Aborted Command
> > Feb 19 20:39:31 linux kernel: Additional sense: No additional sense 
> > information
> > Feb 19 20:39:31 linux kernel: end_request: I/O error, dev sda, sector 
> > 89553479
> > 
> > without any other ill-effects that I know of(I did smart tests on the 
> > drive; 
> > all passed successfully).
> > I have read that hddtemp may be the cause of this (I am running version 
> > 0.3) 
> > so is there any reason
> > to worry and prepare for a HDD replacement?
> 
> Not really.  If the problem occurs very infrequently, you don't need 
> to worry about it too much.  Command timeouts do occur on otherwise healthy
> systems from time to time.
> 
> -- 
> tejun
> -

I'm experiencing the exact same problem with my setup also with sata_promise.
I have posted to the linux-ide list but it wasn't really acknowledged as a
problem in the driver. Are these command timeouts? The log entry doesn't seem
to say that - just an error with 'DriveReady' and command aborted. I would
think some kind of retry should be performed (and if it is then logged too).

The errors may be benign but the problem is when using software raid (md
driver) that this error may cause a degraded array and worse a damaged array
should a read error like this occur when an array is already degraded.

The question is what happens after the error is reported, is the operation
retried? In my situation the md layer receives the error and recovers by
taking the data from another drive in the array. 

The fact that you are also experiencing this means it might be an issue that
needs further investigation in my opinion.



Regards,
Marc

--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ata command timeout

2007-02-19 Thread Tejun Heo
[EMAIL PROTECTED] wrote:
> Hello,
> 
> I have been running 2.6.18 for two months and the last couple of days these 
> error messages have appeared in my logs 
> (sata_promise kernel module, sda:SATA sdb:PATA disks):
> 
>ata1: command timeout
> Feb 17 22:23:14 linux kernel: ata1: no sense translation for status: 0x40
> Feb 17 22:23:14 linux kernel: ata1: translated ATA stat/err 0x40/00 to SCSI 
> SK/ASC/ASCQ 0xb/00/00
> Feb 17 22:23:14 linux kernel: ata1: status=0x40 { DriveReady }
> Feb 17 22:23:14 linux kernel: sd 0:0:0:0: SCSI error: return code = 0x0802
> Feb 17 22:23:14 linux kernel: sda: Current: sense key: Aborted Command
> Feb 17 22:23:14 linux kernel: Additional sense: No additional sense 
> information
> Feb 17 22:23:14 linux kernel: end_request: I/O error, dev sda, sector 
> 145179585
> Feb 17 22:23:14 linux kernel: Buffer I/O error on device sda2, logical block 
> 2787300
> Feb 17 22:23:14 linux kernel: lost page write due to I/O error on sda2
> 
> and 
> 
>ata1: command timeout
> Feb 19 20:39:31 linux kernel: ata1: no sense translation for status: 0x40
> Feb 19 20:39:31 linux kernel: ata1: translated ATA stat/err 0x40/00 to SCSI 
> SK/ASC/ASCQ 0xb/00/00
> Feb 19 20:39:31 linux kernel: ata1: status=0x40 { DriveReady }
> Feb 19 20:39:31 linux kernel: sd 0:0:0:0: SCSI error: return code = 0x0802
> Feb 19 20:39:31 linux kernel: sda: Current: sense key: Aborted Command
> Feb 19 20:39:31 linux kernel: Additional sense: No additional sense 
> information
> Feb 19 20:39:31 linux kernel: end_request: I/O error, dev sda, sector 89553479
> 
> without any other ill-effects that I know of(I did smart tests on the drive; 
> all passed successfully).
> I have read that hddtemp may be the cause of this (I am running version 0.3) 
> so is there any reason
> to worry and prepare for a HDD replacement?

Not really.  If the problem occurs very infrequently, you don't need to
worry about it too much.  Command timeouts do occur on otherwise healthy
systems from time to time.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ata command timeout

2007-02-19 Thread Tejun Heo
[EMAIL PROTECTED] wrote:
 Hello,
 
 I have been running 2.6.18 for two months and the last couple of days these 
 error messages have appeared in my logs 
 (sata_promise kernel module, sda:SATA sdb:PATA disks):
 
ata1: command timeout
 Feb 17 22:23:14 linux kernel: ata1: no sense translation for status: 0x40
 Feb 17 22:23:14 linux kernel: ata1: translated ATA stat/err 0x40/00 to SCSI 
 SK/ASC/ASCQ 0xb/00/00
 Feb 17 22:23:14 linux kernel: ata1: status=0x40 { DriveReady }
 Feb 17 22:23:14 linux kernel: sd 0:0:0:0: SCSI error: return code = 0x0802
 Feb 17 22:23:14 linux kernel: sda: Current: sense key: Aborted Command
 Feb 17 22:23:14 linux kernel: Additional sense: No additional sense 
 information
 Feb 17 22:23:14 linux kernel: end_request: I/O error, dev sda, sector 
 145179585
 Feb 17 22:23:14 linux kernel: Buffer I/O error on device sda2, logical block 
 2787300
 Feb 17 22:23:14 linux kernel: lost page write due to I/O error on sda2
 
 and 
 
ata1: command timeout
 Feb 19 20:39:31 linux kernel: ata1: no sense translation for status: 0x40
 Feb 19 20:39:31 linux kernel: ata1: translated ATA stat/err 0x40/00 to SCSI 
 SK/ASC/ASCQ 0xb/00/00
 Feb 19 20:39:31 linux kernel: ata1: status=0x40 { DriveReady }
 Feb 19 20:39:31 linux kernel: sd 0:0:0:0: SCSI error: return code = 0x0802
 Feb 19 20:39:31 linux kernel: sda: Current: sense key: Aborted Command
 Feb 19 20:39:31 linux kernel: Additional sense: No additional sense 
 information
 Feb 19 20:39:31 linux kernel: end_request: I/O error, dev sda, sector 89553479
 
 without any other ill-effects that I know of(I did smart tests on the drive; 
 all passed successfully).
 I have read that hddtemp may be the cause of this (I am running version 0.3) 
 so is there any reason
 to worry and prepare for a HDD replacement?

Not really.  If the problem occurs very infrequently, you don't need to
worry about it too much.  Command timeouts do occur on otherwise healthy
systems from time to time.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ata command timeout

2007-02-19 Thread Marc Marais
On Tue, 20 Feb 2007 13:07:50 +0900, Tejun Heo wrote
 [EMAIL PROTECTED] wrote:
  Hello,
  
  I have been running 2.6.18 for two months and the last couple of days these 
  error messages have appeared in my logs 
  (sata_promise kernel module, sda:SATA sdb:PATA disks):
  
 ata1: command timeout
  Feb 17 22:23:14 linux kernel: ata1: no sense translation for status: 0x40
  Feb 17 22:23:14 linux kernel: ata1: translated ATA stat/err 0x40/00 to SCSI 
  SK/ASC/ASCQ 0xb/00/00
  Feb 17 22:23:14 linux kernel: ata1: status=0x40 { DriveReady }
  Feb 17 22:23:14 linux kernel: sd 0:0:0:0: SCSI error: return code = 
  0x0802
  Feb 17 22:23:14 linux kernel: sda: Current: sense key: Aborted Command
  Feb 17 22:23:14 linux kernel: Additional sense: No additional sense 
  information
  Feb 17 22:23:14 linux kernel: end_request: I/O error, dev sda, sector 
  145179585
  Feb 17 22:23:14 linux kernel: Buffer I/O error on device sda2, logical 
  block 
  2787300
  Feb 17 22:23:14 linux kernel: lost page write due to I/O error on sda2
  
  and 
  
 ata1: command timeout
  Feb 19 20:39:31 linux kernel: ata1: no sense translation for status: 0x40
  Feb 19 20:39:31 linux kernel: ata1: translated ATA stat/err 0x40/00 to SCSI 
  SK/ASC/ASCQ 0xb/00/00
  Feb 19 20:39:31 linux kernel: ata1: status=0x40 { DriveReady }
  Feb 19 20:39:31 linux kernel: sd 0:0:0:0: SCSI error: return code = 
  0x0802
  Feb 19 20:39:31 linux kernel: sda: Current: sense key: Aborted Command
  Feb 19 20:39:31 linux kernel: Additional sense: No additional sense 
  information
  Feb 19 20:39:31 linux kernel: end_request: I/O error, dev sda, sector 
  89553479
  
  without any other ill-effects that I know of(I did smart tests on the 
  drive; 
  all passed successfully).
  I have read that hddtemp may be the cause of this (I am running version 
  0.3) 
  so is there any reason
  to worry and prepare for a HDD replacement?
 
 Not really.  If the problem occurs very infrequently, you don't need 
 to worry about it too much.  Command timeouts do occur on otherwise healthy
 systems from time to time.
 
 -- 
 tejun
 -

I'm experiencing the exact same problem with my setup also with sata_promise.
I have posted to the linux-ide list but it wasn't really acknowledged as a
problem in the driver. Are these command timeouts? The log entry doesn't seem
to say that - just an error with 'DriveReady' and command aborted. I would
think some kind of retry should be performed (and if it is then logged too).

The errors may be benign but the problem is when using software raid (md
driver) that this error may cause a degraded array and worse a damaged array
should a read error like this occur when an array is already degraded.

The question is what happens after the error is reported, is the operation
retried? In my situation the md layer receives the error and recovers by
taking the data from another drive in the array. 

The fact that you are also experiencing this means it might be an issue that
needs further investigation in my opinion.



Regards,
Marc

--
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/