Re: SATA exceptions with 2.6.20-rc5

2007-02-09 Thread Björn Steinbrink
On 2007.02.04 02:13:51 +0100, Björn Steinbrink wrote:
> On 2007.02.02 23:48:14 -0600, Robert Hancock wrote:
> > There's a patch in -mm (sata_nv-use-adma-for-nodata-commands.patch) 
> > which should hopefully avoid this problem for the cache flush commands, 
> > at least - can you try that one out? You'll have to apply the other 
> > sata_nv patches in -mm first, i.e. this order:
> > 
> > http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2.patch
> > http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2-cleanup.patch
> > http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-use-adma-for-nodata-commands.patch
> 
> Got 2.6.20-rc7 with them applied now (the rejects seemed trivial enough
> for me to fix them). Let's see how that works out...

After about 1.5 days of uptime, an involuntary reboot and another 3
days of uptime, no sign of an exception. No stress testing was done,
but a few disk intensive actions did happen, at least more than with
that -rc6 that did throw an exception at me.

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-02-03 Thread Björn Steinbrink
On 2007.02.02 23:48:14 -0600, Robert Hancock wrote:
> Björn Steinbrink wrote:
> >On 2007.01.24 01:39:23 +0100, Björn Steinbrink wrote:
> >>On 2007.01.23 17:18:43 -0600, Robert Hancock wrote:
> >>>Larry Walton wrote:
> The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
> seems to have fix the problem.  Much appreciated, 
> thank you. I'd consider it a must have in 2.6.20.
> >>>Can any of the rest of you that have been seeing this problem also 
> >>>confirm that this fixes it?
> >>Seems to work for me, uptime is about an hour now and no exception yet.
> >>Had the stress test running for only about 10 minutes, but I usually got
> >>an exception within an hour even during plain irssi usage, so I'm quite
> >>confident that the patch fixes it.
> >
> >Or maybe not :( Just got an exception on 2.6.20-rc6. Took 4 days of
> >uptime to trigger, so it's just a lot harder to trigger now.
> 
> Same exception details as before?

Yes, exactly the same.

> There's a patch in -mm (sata_nv-use-adma-for-nodata-commands.patch) 
> which should hopefully avoid this problem for the cache flush commands, 
> at least - can you try that one out? You'll have to apply the other 
> sata_nv patches in -mm first, i.e. this order:
> 
> http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2.patch
> http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2-cleanup.patch
> http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-use-adma-for-nodata-commands.patch

Got 2.6.20-rc7 with them applied now (the rejects seemed trivial enough
for me to fix them). Let's see how that works out...

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-02-02 Thread Robert Hancock

Björn Steinbrink wrote:

On 2007.01.24 01:39:23 +0100, Björn Steinbrink wrote:

On 2007.01.23 17:18:43 -0600, Robert Hancock wrote:

Larry Walton wrote:
The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
seems to have fix the problem.  Much appreciated, 
thank you. I'd consider it a must have in 2.6.20.
Can any of the rest of you that have been seeing this problem also 
confirm that this fixes it?

Seems to work for me, uptime is about an hour now and no exception yet.
Had the stress test running for only about 10 minutes, but I usually got
an exception within an hour even during plain irssi usage, so I'm quite
confident that the patch fixes it.


Or maybe not :( Just got an exception on 2.6.20-rc6. Took 4 days of
uptime to trigger, so it's just a lot harder to trigger now.


Same exception details as before?

There's a patch in -mm (sata_nv-use-adma-for-nodata-commands.patch) 
which should hopefully avoid this problem for the cache flush commands, 
at least - can you try that one out? You'll have to apply the other 
sata_nv patches in -mm first, i.e. this order:


http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2.patch
http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2-cleanup.patch
http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-use-adma-for-nodata-commands.patch

--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-02-02 Thread Björn Steinbrink
On 2007.01.24 01:39:23 +0100, Björn Steinbrink wrote:
> On 2007.01.23 17:18:43 -0600, Robert Hancock wrote:
> > Larry Walton wrote:
> > >The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
> > >seems to have fix the problem.  Much appreciated, 
> > >thank you. I'd consider it a must have in 2.6.20.
> > 
> > Can any of the rest of you that have been seeing this problem also 
> > confirm that this fixes it?
> 
> Seems to work for me, uptime is about an hour now and no exception yet.
> Had the stress test running for only about 10 minutes, but I usually got
> an exception within an hour even during plain irssi usage, so I'm quite
> confident that the patch fixes it.

Or maybe not :( Just got an exception on 2.6.20-rc6. Took 4 days of
uptime to trigger, so it's just a lot harder to trigger now.

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-24 Thread Björn Steinbrink
On 2007.01.24 09:24:00 +0100, Ian Kumlien wrote:
> On tis, 2007-01-23 at 17:18 -0600, Robert Hancock wrote:
> > Larry Walton wrote:
> > > The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
> > > seems to have fix the problem.  Much appreciated, 
> > > thank you. I'd consider it a must have in 2.6.20.
> > 
> > Can any of the rest of you that have been seeing this problem also 
> > confirm that this fixes it?
> 
> I applied it yesterday and today my dmesg contains three:
> BUG: at mm/truncate.c:60 cancel_dirty_page()

David Chinner sent two patches regarding that bug yesterday.
http://lkml.org/lkml/2007/1/23/190
http://lkml.org/lkml/2007/1/23/192

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-24 Thread Ian Kumlien
On tis, 2007-01-23 at 17:18 -0600, Robert Hancock wrote:
> Larry Walton wrote:
> > The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
> > seems to have fix the problem.  Much appreciated, 
> > thank you. I'd consider it a must have in 2.6.20.
> 
> Can any of the rest of you that have been seeing this problem also 
> confirm that this fixes it?

I applied it yesterday and today my dmesg contains three:
BUG: at mm/truncate.c:60 cancel_dirty_page()

Call Trace:
 [] cancel_dirty_page+0x43/0x71
 [] reiserfs_cut_from_item+0x5f8/0x61d
 [] find_get_page+0x21/0x47
 [] reiserfs_do_truncate+0x34d/0x495
 [] reiserfs_truncate_file+0x199/0x2aa
 [] reiserfs_file_release+0x261/0x281
 [] __fput+0xb1/0x17d
 [] filp_close+0x5d/0x65
 [] sys_close+0x8c/0xcf
 [] system_call+0x7e/0x83

Which never happened before... I dunno if they are related though, but
they weren't there before...

(It does fix the timeout problem)

-- 
Ian Kumlien  -- http://pomac.netswarm.net


signature.asc
Description: This is a digitally signed message part


Re: SATA exceptions with 2.6.20-rc5

2007-01-23 Thread Björn Steinbrink
On 2007.01.23 17:18:43 -0600, Robert Hancock wrote:
> Larry Walton wrote:
> >The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
> >seems to have fix the problem.  Much appreciated, 
> >thank you. I'd consider it a must have in 2.6.20.
> 
> Can any of the rest of you that have been seeing this problem also 
> confirm that this fixes it?

Seems to work for me, uptime is about an hour now and no exception yet.
Had the stress test running for only about 10 minutes, but I usually got
an exception within an hour even during plain irssi usage, so I'm quite
confident that the patch fixes it.

Thanks,
Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-23 Thread Robert Hancock

Larry Walton wrote:
The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
seems to have fix the problem.  Much appreciated, 
thank you. I'd consider it a must have in 2.6.20.


Can any of the rest of you that have been seeing this problem also 
confirm that this fixes it?


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-23 Thread Larry Walton
The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
seems to have fix the problem.  Much appreciated, 
thank you. I'd consider it a must have in 2.6.20.


-- 
*--* Mail: [EMAIL PROTECTED]
*--* Voice: 206.892.6269
*--* Cell: 206.225.0154
*--* HTTP://real.com
--
- - - - - - - R e a l - - - - - - - -



signature.asc
Description: Digital signature


Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Robert Hancock

Björn Steinbrink wrote:

Hm, I don't think it is unhappy about looking at NV_INT_STATUS_CK804.
I'm running 2.6.20-rc5 with the INT_DEV check removed for 8 hours now
without a single problem and that should still look at
NV_INT_STATUS_CK804, right?
I just noticed that my last email might not have been clear enough. The
exceptions happened when I re-enabled the return statement in addition
to the debug message. Without the INT_DEV check, it is completely fine
AFAICT.


Indeed, it seems to be just the NV_INT_DEV check that is problematic. 
Here's a patch that's likely better to test, it forces the NV_INT_DEV 
flag on when a command is active, and also fixes that questionable code 
in nv_host_intr that I mentioned.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

--- linux-2.6.20-rc5/drivers/ata/sata_nv.c  2007-01-19 19:18:53.0 
-0600
+++ linux-2.6.20-rc5debug/drivers/ata/sata_nv.c 2007-01-22 22:33:43.0 
-0600
@@ -700,7 +700,6 @@ static void nv_adma_check_cpb(struct ata
 static int nv_host_intr(struct ata_port *ap, u8 irq_stat)
 {
struct ata_queued_cmd *qc = ata_qc_from_tag(ap, ap->active_tag);
-   int handled;
 
/* freeze if hotplugged */
if (unlikely(irq_stat & (NV_INT_ADDED | NV_INT_REMOVED))) {
@@ -719,13 +718,7 @@ static int nv_host_intr(struct ata_port 
}
 
/* handle interrupt */
-   handled = ata_host_intr(ap, qc);
-   if (unlikely(!handled)) {
-   /* spurious, clear it */
-   ata_check_status(ap);
-   }
-
-   return 1;
+   return ata_host_intr(ap, qc);
 }
 
 static irqreturn_t nv_adma_interrupt(int irq, void *dev_instance)
@@ -752,6 +745,11 @@ static irqreturn_t nv_adma_interrupt(int
if (pp->flags & NV_ADMA_PORT_REGISTER_MODE) {
u8 irq_stat = readb(host->mmio_base + 
NV_INT_STATUS_CK804)
>> (NV_INT_PORT_SHIFT * i);
+   if(ata_tag_valid(ap->active_tag))
+   /** NV_INT_DEV indication seems 
unreliable at times
+   at least in ADMA mode. Force it on 
always when a
+   command is active, to prevent 
losing interrupts. */
+   irq_stat |= NV_INT_DEV;
handled += nv_host_intr(ap, irq_stat);
continue;
}


Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Björn Steinbrink
On 2007.01.22 19:24:22 -0600, Robert Hancock wrote:
> Björn Steinbrink wrote:
> >>>Running a kernel with the return statement replace by a line that prints
> >>>the irq_stat instead.
> >>>
> >>>Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2.
> >>40 minutes stress test now and no exception yet. What's interesting is
> >>that ata1 saw exactly one interrupt with irq_stat 0x0, all others that
> >>might have get dropped are as above.
> >>I'll keep it running for some time and will then re-enable the return
> >>statement to see if there's a relation between the irq_stat 0x0 and the
> >>exception.
> >
> >No, doesn't seem to be related, did get 2 exceptions, but no irq_stat
> >0x0 for ata1. Syslog/dmesg has nothing new either, still the same
> >pattern of dismissed irq_stats.
> 
> I've finally managed to reproduce this problem on my box, by doing:
> 
> watch --interval=0.1 /sbin/hdparm -I /dev/sda
> 
> on one drive and then running bonnie++ on /dev/sdb connected to the 
> other port on the same controller device. Usually within a few minutes 
> one of the IDENTIFY commands would time out in the same way you guys 
> have been seeing.
> 
> Through some various trials and tribulations, the only conclusion I can 
> come to is that this controller really doesn't like that 
> NV_INT_STATUS_CK804 register being looked at in ADMA mode. I tried 
> adding some debug code to the qc_issue function that would check to see 
> if the BUSY flag in altstatus went high or that register showed an 
> interrupt within a certain time afterwards, however that really seemed 
> to hose things, the system wouldn't even boot.

Hm, I don't think it is unhappy about looking at NV_INT_STATUS_CK804.
I'm running 2.6.20-rc5 with the INT_DEV check removed for 8 hours now
without a single problem and that should still look at
NV_INT_STATUS_CK804, right?
I just noticed that my last email might not have been clear enough. The
exceptions happened when I re-enabled the return statement in addition
to the debug message. Without the INT_DEV check, it is completely fine
AFAICT.

> Try out this patch, it just calls the ata_host_intr function where 
> appropriate without using nv_host_intr which looks at the 
> NV_INT_STATUS_CK804 register. This is what the original ADMA patch from 
> Mr. Mysterious NVIDIA Person did, I'm guessing there may be a reason for 
> that. With this patch I can get through a whole bonnie++ run with the 
> repeated IDENTIFY requests running without seeing the error.

I'll see if I can schedule a test run for tomorrow, I currently need
this box.

Thanks,
Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Robert Hancock

Alistair John Strachan wrote:

On Tuesday 23 January 2007 01:24, Robert Hancock wrote:

As a final aside, this is another case where the hardware docs for this
controller would really be useful, in order to know whether we are
actually supposed to be reading that register in ADMA mode or not. I
sent a query to Allen Martin at NVIDIA asking if there's a way I could
get access to the documents, but I haven't heard anything yet.


Obviously, NVIDIA's response is disappointing, but thank you for putting the 
time in to debug this problem. Definitely sounds like a hardware defect, I'm 
just glad there's a workaround.


Will we see this fix in 2.6.20?


Hopefully, assuming it actually does fix the problem for those that have 
been seeing it..


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Alistair John Strachan
On Tuesday 23 January 2007 01:24, Robert Hancock wrote:
> As a final aside, this is another case where the hardware docs for this
> controller would really be useful, in order to know whether we are
> actually supposed to be reading that register in ADMA mode or not. I
> sent a query to Allen Martin at NVIDIA asking if there's a way I could
> get access to the documents, but I haven't heard anything yet.

Obviously, NVIDIA's response is disappointing, but thank you for putting the 
time in to debug this problem. Definitely sounds like a hardware defect, I'm 
just glad there's a workaround.

Will we see this fix in 2.6.20?

-- 
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Robert Hancock

Björn Steinbrink wrote:

Running a kernel with the return statement replace by a line that prints
the irq_stat instead.

Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2.

40 minutes stress test now and no exception yet. What's interesting is
that ata1 saw exactly one interrupt with irq_stat 0x0, all others that
might have get dropped are as above.
I'll keep it running for some time and will then re-enable the return
statement to see if there's a relation between the irq_stat 0x0 and the
exception.


No, doesn't seem to be related, did get 2 exceptions, but no irq_stat
0x0 for ata1. Syslog/dmesg has nothing new either, still the same
pattern of dismissed irq_stats.


I've finally managed to reproduce this problem on my box, by doing:

watch --interval=0.1 /sbin/hdparm -I /dev/sda

on one drive and then running bonnie++ on /dev/sdb connected to the 
other port on the same controller device. Usually within a few minutes 
one of the IDENTIFY commands would time out in the same way you guys 
have been seeing.


Through some various trials and tribulations, the only conclusion I can 
come to is that this controller really doesn't like that 
NV_INT_STATUS_CK804 register being looked at in ADMA mode. I tried 
adding some debug code to the qc_issue function that would check to see 
if the BUSY flag in altstatus went high or that register showed an 
interrupt within a certain time afterwards, however that really seemed 
to hose things, the system wouldn't even boot.


Try out this patch, it just calls the ata_host_intr function where 
appropriate without using nv_host_intr which looks at the 
NV_INT_STATUS_CK804 register. This is what the original ADMA patch from 
Mr. Mysterious NVIDIA Person did, I'm guessing there may be a reason for 
that. With this patch I can get through a whole bonnie++ run with the 
repeated IDENTIFY requests running without seeing the error.


As an aside, there seems to be some dubious code in nv_host_intr, if 
ata_host_intr returns 0 for handled when a command is outstanding, it 
goes and calls ata_check_status anyway. This is rather dangerous since 
if an interrupt showed up right after ata_host_intr but before 
ata_check_status, the ata_check_status would clear it and we would 
forget about it. I tried fixing just that issue and still had this 
problem however. I suspect that code is truly broken and needs further 
thought, but this patch avoids calling it in the ADMA case, at any rate.


As a final aside, this is another case where the hardware docs for this 
controller would really be useful, in order to know whether we are 
actually supposed to be reading that register in ADMA mode or not. I 
sent a query to Allen Martin at NVIDIA asking if there's a way I could 
get access to the documents, but I haven't heard anything yet.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

--- linux-2.6.20-rc5/drivers/ata/sata_nv.c  2007-01-19 19:18:53.0 
-0600
+++ linux-2.6.20-rc5debug/drivers/ata/sata_nv.c 2007-01-22 18:35:09.0 
-0600
@@ -750,9 +750,9 @@ static irqreturn_t nv_adma_interrupt(int
 
/* if in ATA register mode, use standard ata interrupt 
handler */
if (pp->flags & NV_ADMA_PORT_REGISTER_MODE) {
-   u8 irq_stat = readb(host->mmio_base + 
NV_INT_STATUS_CK804)
-   >> (NV_INT_PORT_SHIFT * i);
-   handled += nv_host_intr(ap, irq_stat);
+   struct ata_queued_cmd *qc = ata_qc_from_tag(ap, 
ap->active_tag);
+   if(qc && !(qc->tf.flags & ATA_TFLAG_POLLING))
+   handled += ata_host_intr(ap, qc);
continue;
}
 


Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Eric D. Mudama

On 1/15/07, Jeff Garzik <[EMAIL PROTECTED]> wrote:

Jens Axboe wrote:
> On Mon, Jan 15 2007, Jeff Garzik wrote:
>> Jens Axboe wrote:
>>> I'd be surprised if the device would not obey the 7 second timeout rule
>>> that seems to be set in stone and not allow more dirty in-drive cache
>>> than it could flush out in approximately that time.
>> AFAIK Windows flush-cache timeout is 30 seconds, not 7 as with other
>> commands...
>
> Ok, 7 seconds for FLUSH_CACHE would have been nice for us too though, as
> it would pretty much guarentee lower latencies for random writes and
> write back caching. The concern is the barrier code, of course. I guess
> I should do some timings on potential worst case patterns some day. Alan
> may have done that sometime in the past, iirc.

FWIW:  According to the drive guys (Eric M, among others), FLUSH CACHE
will "probably" be under 30 seconds, but pathological cases might even
extend beyond that.

Definitely more than 7 seconds in less-than-pathological cases,
unfortunately...


The mentioned Maxtor model (6Yxxx) isn't susceptible to the
large-buffer long completion times, due to architectural differences
and availability of only small buffers.  Any "real" long-completion
flush on this device would, I believe, involve damage to the disk that
hinders the ability to seek, settle, or write.  (e.g. 30-second
flushes are easy to hit if you mount the disk on a shaker-table with
sufficient amplitude)

Later in the thread I think people have pretty much isolated it as not
the disk's problem, but just wanted to point this out.

I assume that large enough customers can buy enterprise-type command
completion ("all commands within X seconds") from most any disk
vendor.  However, these firmwares require much smarter or more active
drivers or block layers, to handle the higher error rate when the data
on the device is valid, but it will take longer than allowed by the
arbitrary enterprise rules.  Most customers who are buying this many
devices have software engineers customizing the drivers or disk
management applications to handle this differing behavior.

--eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Björn Steinbrink
On 2007.01.22 17:57:08 +0100, Björn Steinbrink wrote:
> On 2007.01.22 17:12:40 +0100, Björn Steinbrink wrote:
> > On 2007.01.21 18:17:01 -0600, Robert Hancock wrote:
> > > Hmm, another miss, apparently.. Has anyone tried removing these lines
> > > >from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does?
> > > 
> > > /* bail out if not our interrupt */
> > > if (!(irq_stat & NV_INT_DEV))
> > > return 0;
> > 
> > Running a kernel with the return statement replace by a line that prints
> > the irq_stat instead.
> > 
> > Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2.
> 
> 40 minutes stress test now and no exception yet. What's interesting is
> that ata1 saw exactly one interrupt with irq_stat 0x0, all others that
> might have get dropped are as above.
> I'll keep it running for some time and will then re-enable the return
> statement to see if there's a relation between the irq_stat 0x0 and the
> exception.

No, doesn't seem to be related, did get 2 exceptions, but no irq_stat
0x0 for ata1. Syslog/dmesg has nothing new either, still the same
pattern of dismissed irq_stats.

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Björn Steinbrink
On 2007.01.22 17:12:40 +0100, Björn Steinbrink wrote:
> On 2007.01.21 18:17:01 -0600, Robert Hancock wrote:
> > Björn Steinbrink wrote:
> > >On 2007.01.21 13:58:01 -0600, Robert Hancock wrote:
> > >>Björn Steinbrink wrote:
> > >>>All kernels were bad using that approach. So back to square 1. :/
> > >>>
> > >>>Björn
> > >>>
> > >>OK guys, here's a new patch to try against 2.6.20-rc5:
> > >>
> > >>Right now when switching between ADMA mode and legacy mode (i.e. when 
> > >>going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
> > >>set the ADMA GO register bit appropriately and continue with no delay. 
> > >>It looks like in some cases the controller doesn't respond to this 
> > >>immediately, it takes some nanoseconds for the controller's status 
> > >>registers to reflect the change that was made. It's possible that if we 
> > >>were trying to issue commands during this time, the controller might not 
> > >>react properly. This patch adds some code to wait for the status 
> > >>register to change to the state we asked for before continuing.
> > >
> > >Just got two exceptions with your patch, none of the debug messages were
> > >issued.
> > >
> > >Björn
> > 
> > Hmm, another miss, apparently.. Has anyone tried removing these lines
> > >from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does?
> > 
> > /* bail out if not our interrupt */
> > if (!(irq_stat & NV_INT_DEV))
> > return 0;
> 
> Running a kernel with the return statement replace by a line that prints
> the irq_stat instead.
> 
> Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2.

40 minutes stress test now and no exception yet. What's interesting is
that ata1 saw exactly one interrupt with irq_stat 0x0, all others that
might have get dropped are as above.
I'll keep it running for some time and will then re-enable the return
statement to see if there's a relation between the irq_stat 0x0 and the
exception.

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Björn Steinbrink
On 2007.01.21 18:17:01 -0600, Robert Hancock wrote:
> Björn Steinbrink wrote:
> >On 2007.01.21 13:58:01 -0600, Robert Hancock wrote:
> >>Björn Steinbrink wrote:
> >>>All kernels were bad using that approach. So back to square 1. :/
> >>>
> >>>Björn
> >>>
> >>OK guys, here's a new patch to try against 2.6.20-rc5:
> >>
> >>Right now when switching between ADMA mode and legacy mode (i.e. when 
> >>going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
> >>set the ADMA GO register bit appropriately and continue with no delay. 
> >>It looks like in some cases the controller doesn't respond to this 
> >>immediately, it takes some nanoseconds for the controller's status 
> >>registers to reflect the change that was made. It's possible that if we 
> >>were trying to issue commands during this time, the controller might not 
> >>react properly. This patch adds some code to wait for the status 
> >>register to change to the state we asked for before continuing.
> >
> >Just got two exceptions with your patch, none of the debug messages were
> >issued.
> >
> >Björn
> 
> Hmm, another miss, apparently.. Has anyone tried removing these lines
> >from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does?
> 
> /* bail out if not our interrupt */
> if (!(irq_stat & NV_INT_DEV))
> return 0;

Running a kernel with the return statement replace by a line that prints
the irq_stat instead.

Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2.

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Chr
On Monday, 22. January 2007 03:39, Tejun Heo wrote:
> Hello,
> 
> Chr wrote:
> > Ok, you won't believe this... I opened my case and rewired my drives... 
> > And guess what, my second (aka the "good") HDD is now failing! 
> > I guess, my mainboard has a (but maybe two, or three :( ) "bad" 
> > sata-port(s)!  
> 
> Or, you have power related problem.  Try to rewire the power lines or 
> connect harddrives to a separate powersupply.  It's often useful to 
> change one component at a time and watch which change the problem 
> follows.  Anyways, you seem to be suffering transmission failures, not a 
> driver problem.
> 
> Thanks.
> 

Yes and no, it's probably not a power problem, I've tried another
PSU with the same result :( . Futhermore, the RAID0 setup makes
it impossible to try only one drive alone :(. 

Anyway,the WD2500KS is known to have some strange bugs in the FW.
e.g.: It reports 255°C right after a cold start. 
( http://www.bugtrack.almico.com/view.php?id=468 ).

Thanks,
Chr.
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Tejun Heo

Hello,

Chr wrote:
Ok, you won't believe this... I opened my case and rewired my drives... 
And guess what, my second (aka the "good") HDD is now failing! 
I guess, my mainboard has a (but maybe two, or three :( ) "bad" sata-port(s)!  


Or, you have power related problem.  Try to rewire the power lines or 
connect harddrives to a separate powersupply.  It's often useful to 
change one component at a time and watch which change the problem 
follows.  Anyways, you seem to be suffering transmission failures, not a 
driver problem.


Thanks.

--
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Robert Hancock

Björn Steinbrink wrote:

On 2007.01.21 13:58:01 -0600, Robert Hancock wrote:

Björn Steinbrink wrote:

All kernels were bad using that approach. So back to square 1. :/

Björn


OK guys, here's a new patch to try against 2.6.20-rc5:

Right now when switching between ADMA mode and legacy mode (i.e. when 
going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
set the ADMA GO register bit appropriately and continue with no delay. 
It looks like in some cases the controller doesn't respond to this 
immediately, it takes some nanoseconds for the controller's status 
registers to reflect the change that was made. It's possible that if we 
were trying to issue commands during this time, the controller might not 
react properly. This patch adds some code to wait for the status 
register to change to the state we asked for before continuing.


Just got two exceptions with your patch, none of the debug messages were
issued.

Björn


Hmm, another miss, apparently.. Has anyone tried removing these lines
from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does?

/* bail out if not our interrupt */
if (!(irq_stat & NV_INT_DEV))
return 0;

--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Robert Hancock

Björn Steinbrink wrote:

On 2007.01.21 23:08:11 +0100, Björn Steinbrink wrote:

On 2007.01.21 13:58:01 -0600, Robert Hancock wrote:

Björn Steinbrink wrote:

All kernels were bad using that approach. So back to square 1. :/

Björn


OK guys, here's a new patch to try against 2.6.20-rc5:

Right now when switching between ADMA mode and legacy mode (i.e. when 
going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
set the ADMA GO register bit appropriately and continue with no delay. 
It looks like in some cases the controller doesn't respond to this 
immediately, it takes some nanoseconds for the controller's status 
registers to reflect the change that was made. It's possible that if we 
were trying to issue commands during this time, the controller might not 
react properly. This patch adds some code to wait for the status 
register to change to the state we asked for before continuing.

I went for the "I feel lucky" route and did just add mmio reads after the
mmio writes, posting them. Rationale being that if it is a write posting
issue, the debug patch would/could actually hide it AFAICT.
It's the "I feel lucky" route, because my whole "knowledge" about mmio
and write posting originates from the few things I read up on when you
discovered the comment about write posting in the generic ata code.


Uhm, yeah, exception occured about the time that I hit "send".

Björn


Yeah, I don't think just adding reads to flush posted writes is enough 
here - it seems to need more delay than that, and it also wasn't always 
in the idle state even before we would write the register..


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Björn Steinbrink
On 2007.01.21 13:58:01 -0600, Robert Hancock wrote:
> Björn Steinbrink wrote:
> >All kernels were bad using that approach. So back to square 1. :/
> >
> >Björn
> >
> 
> OK guys, here's a new patch to try against 2.6.20-rc5:
> 
> Right now when switching between ADMA mode and legacy mode (i.e. when 
> going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
> set the ADMA GO register bit appropriately and continue with no delay. 
> It looks like in some cases the controller doesn't respond to this 
> immediately, it takes some nanoseconds for the controller's status 
> registers to reflect the change that was made. It's possible that if we 
> were trying to issue commands during this time, the controller might not 
> react properly. This patch adds some code to wait for the status 
> register to change to the state we asked for before continuing.

Just got two exceptions with your patch, none of the debug messages were
issued.

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Björn Steinbrink
On 2007.01.21 23:08:11 +0100, Björn Steinbrink wrote:
> On 2007.01.21 13:58:01 -0600, Robert Hancock wrote:
> > Björn Steinbrink wrote:
> > >All kernels were bad using that approach. So back to square 1. :/
> > >
> > >Björn
> > >
> > 
> > OK guys, here's a new patch to try against 2.6.20-rc5:
> > 
> > Right now when switching between ADMA mode and legacy mode (i.e. when 
> > going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
> > set the ADMA GO register bit appropriately and continue with no delay. 
> > It looks like in some cases the controller doesn't respond to this 
> > immediately, it takes some nanoseconds for the controller's status 
> > registers to reflect the change that was made. It's possible that if we 
> > were trying to issue commands during this time, the controller might not 
> > react properly. This patch adds some code to wait for the status 
> > register to change to the state we asked for before continuing.
> 
> I went for the "I feel lucky" route and did just add mmio reads after the
> mmio writes, posting them. Rationale being that if it is a write posting
> issue, the debug patch would/could actually hide it AFAICT.
> It's the "I feel lucky" route, because my whole "knowledge" about mmio
> and write posting originates from the few things I read up on when you
> discovered the comment about write posting in the generic ata code.

Uhm, yeah, exception occured about the time that I hit "send".

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Björn Steinbrink
On 2007.01.21 13:58:01 -0600, Robert Hancock wrote:
> Björn Steinbrink wrote:
> >All kernels were bad using that approach. So back to square 1. :/
> >
> >Björn
> >
> 
> OK guys, here's a new patch to try against 2.6.20-rc5:
> 
> Right now when switching between ADMA mode and legacy mode (i.e. when 
> going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
> set the ADMA GO register bit appropriately and continue with no delay. 
> It looks like in some cases the controller doesn't respond to this 
> immediately, it takes some nanoseconds for the controller's status 
> registers to reflect the change that was made. It's possible that if we 
> were trying to issue commands during this time, the controller might not 
> react properly. This patch adds some code to wait for the status 
> register to change to the state we asked for before continuing.

I went for the "I feel lucky" route and did just add mmio reads after the
mmio writes, posting them. Rationale being that if it is a write posting
issue, the debug patch would/could actually hide it AFAICT.
It's the "I feel lucky" route, because my whole "knowledge" about mmio
and write posting originates from the few things I read up on when you
discovered the comment about write posting in the generic ata code.

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Chr
On Sunday, 21. January 2007 19:01, Björn Steinbrink wrote:
> On 2007.01.21 18:34:40 +0100, Chr wrote:
>
> I run those two in parallel:
> while /bin/true; do ls -lR / > /dev/null 2>&1; done
> while /bin/true; do echo 255 > /proc/sys/vm/drop_caches; sleep 1; done
>
> Not sure if running them in parallel is necessary, but I don't want to
> change the test setup ;) Takes between 1 and 40 minutes to trigger it.
> Most of the time it's around 15 minutes now, doing more random stuff in
> addition to that seems to trigger it even easier (like reading mail,
> rebuilding the kernel etc.).
>
> I'm down to 2 commits after 2.6.19 now, only bad kernels, so I tend to
> say that 2.6.19 with 2.6.20-rc5's sata_nv.c will also fail for me, but I
> thought I might finish bisection just to be sure.
>
> > But, this time it looks slightly different:
> > ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
> > ata3.00: tag 0 cmd 0xec Emask 0x4 stat 0x40 err 0x0 (timeout)
> >
> > [Rest of the error message + SMART error snipped]
>
> I get the same exception every time, doesn't change for me. And neither
> do I get any SMART errors or something.
>
> Thanks,
> Björn

Ok, you won't believe this... I opened my case and rewired my drives... 
And guess what, my second (aka the "good") HDD is now failing! 
I guess, my mainboard has a (but maybe two, or three :( ) "bad" sata-port(s)!  

But, one small question remains: when I opened my case, I saw that my drivers
are pluged in SATA jack 1 and 2... The BIOS also says they're on 1 and 2.
Now, Linux says they're on port 3 & 4! 



it's always ata3.00!
"ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata3.00: tag 0 cmd 0xea Emask 0x4 stat 0x40 err 0x0 (timeout)
ata3: soft resetting port
ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata3.00: configured for UDMA/133
ata3: EH complete
SCSI device sda: 490234752 512-byte hdwr sectors (251000 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back"


Thanks,
Chr.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Robert Hancock

Björn Steinbrink wrote:

All kernels were bad using that approach. So back to square 1. :/

Björn



OK guys, here's a new patch to try against 2.6.20-rc5:

Right now when switching between ADMA mode and legacy mode (i.e. when 
going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
set the ADMA GO register bit appropriately and continue with no delay. 
It looks like in some cases the controller doesn't respond to this 
immediately, it takes some nanoseconds for the controller's status 
registers to reflect the change that was made. It's possible that if we 
were trying to issue commands during this time, the controller might not 
react properly. This patch adds some code to wait for the status 
register to change to the state we asked for before continuing.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

--- linux-2.6.20-rc5/drivers/ata/sata_nv.c  2007-01-19 19:18:53.0 
-0600
+++ linux-2.6.20-rc5debug/drivers/ata/sata_nv.c 2007-01-21 13:35:17.0 
-0600
@@ -509,14 +509,38 @@ static void nv_adma_register_mode(struct
 {
void __iomem *mmio = nv_adma_ctl_block(ap);
struct nv_adma_port_priv *pp = ap->private_data;
-   u16 tmp;
+   u16 tmp, status;
+   int count = 0;
 
if (pp->flags & NV_ADMA_PORT_REGISTER_MODE)
return;
 
+   status = readw(mmio + NV_ADMA_STAT);
+   while(!(status & NV_ADMA_STAT_IDLE) && count < 20) {
+   ndelay(50);
+   status = readw(mmio + NV_ADMA_STAT);
+   count++;
+   }
+   if(count == 20)
+   ata_port_printk(ap, KERN_WARNING,
+   "timeout waiting for ADMA IDLE, stat=0x%hx\n",
+   status);
+
tmp = readw(mmio + NV_ADMA_CTL);
writew(tmp & ~NV_ADMA_CTL_GO, mmio + NV_ADMA_CTL);
 
+   count = 0;
+   status = readw(mmio + NV_ADMA_STAT);
+   while(!(status & NV_ADMA_STAT_LEGACY) && count < 20) {
+   ndelay(50);
+   status = readw(mmio + NV_ADMA_STAT);
+   count++;
+   }
+   if(count == 20)
+   ata_port_printk(ap, KERN_WARNING,
+"timeout waiting for ADMA LEGACY, stat=0x%hx\n",
+status);
+
pp->flags |= NV_ADMA_PORT_REGISTER_MODE;
 }
 
@@ -524,7 +548,8 @@ static void nv_adma_mode(struct ata_port
 {
void __iomem *mmio = nv_adma_ctl_block(ap);
struct nv_adma_port_priv *pp = ap->private_data;
-   u16 tmp;
+   u16 tmp, status;
+   int count = 0;
 
if (!(pp->flags & NV_ADMA_PORT_REGISTER_MODE))
return;
@@ -534,6 +559,18 @@ static void nv_adma_mode(struct ata_port
tmp = readw(mmio + NV_ADMA_CTL);
writew(tmp | NV_ADMA_CTL_GO, mmio + NV_ADMA_CTL);
 
+   status = readw(mmio + NV_ADMA_STAT);
+   while(((status & NV_ADMA_STAT_LEGACY) ||
+ !(status & NV_ADMA_STAT_IDLE)) && count < 20) {
+   ndelay(50);
+   status = readw(mmio + NV_ADMA_STAT);
+   count++;
+   }
+   if(count == 20)
+   ata_port_printk(ap, KERN_WARNING,
+   "timeout waiting for ADMA LEGACY clear and IDLE, 
stat=0x%hx\n",
+   status);
+
pp->flags &= ~NV_ADMA_PORT_REGISTER_MODE;
 }
 


Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Björn Steinbrink
On 2007.01.21 09:36:18 +0100, Björn Steinbrink wrote:
> On 2007.01.21 00:39:20 -0600, Robert Hancock wrote:
> > Björn Steinbrink wrote:
> > >On 2007.01.20 22:34:27 -0500, Jeff Garzik wrote:
> > >>Robert Hancock wrote:
> > >>>change in 2.6.20-rc is either causing or triggering this problem. It 
> > >>>would be useful if you could try git bisect between 2.6.19 and 
> > >>>2.6.20-rc5, keeping the latest sata_nv.c each time, and see if that 
> > >>
> > >>Yes, 'git bisect' would be the next step in figuring out this puzzle.
> > >>
> > >>Anybody up for it?
> > >
> > >I'll go for it, but could I get an explanation how that could lead to a
> > >different result than my last bisection? I see the difference of keeping
> > >sata_nv.c but my brain can't wrap around it right now (woke up in the
> > >middle of the night and still not up to speed...).
> > 
> > Whatever the problem is, only seems to show up when ADMA is enabled, and 
> > so the patch that added ADMA support shows up as the culprit from your 
> > git bisect. However, from what Chr is reporting, 2.6.19 with the ADMA 
> > support added in doesn't seem to have the problem, so presumably 
> > something else that changed in the 2.6.20-rc series is triggering it. 
> > Doing a bisect while keeping the driver code itself the same will 
> > hopefully identify what that change is..
> 
> Ah, right... sata_nv.c of course interacts with the outside world, d'oh!
> 
> Up to now, I only got bad kernels, latest tested being:
> 94fcda1f8ab5e0cacc381c5ca1cc9aa6ad523576
> 
> Which, unless I missed a commit in the diff, only USB changes,
> continuing anyway.

All kernels were bad using that approach. So back to square 1. :/

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Björn Steinbrink
On 2007.01.21 18:34:40 +0100, Chr wrote:
> On Sunday, 21. January 2007 09:36, Björn Steinbrink wrote:
> > On 2007.01.21 00:39:20 -0600, Robert Hancock wrote:
> >
> > Ah, right... sata_nv.c of course interacts with the outside world, d'oh!
> >
> > Up to now, I only got bad kernels, latest tested being:
> > 94fcda1f8ab5e0cacc381c5ca1cc9aa6ad523576
> >
> > Which, unless I missed a commit in the diff, only USB changes,
> > continuing anyway.
> >
> > Just to make sure, here's my little helper for this bisect run, I hope
> > it does what you expected:
> >
> > #!/bin/bash
> > cp ../sata_nv.c.orig drivers/ata/sata_nv.c
> > git bisect good
> > cp drivers/ata/sata_nv.c ../sata_nv.c.orig
> > cp ../sata_nv.c drivers/ata/
> > make oldconfig
> > make -j4
> >
> > Where "../sata_nv.c" is the version from 2.6.20-rc5. The copying is done
> > to avoid conflicts and keep git happy. Of course there's also a version
> > for bad kernels ;) No idea, why I didn't make that an argument to the
> > script...
> >
> > Thanks,
> > Björn
> 
> Ar, 2.6.19 (with 2.6.20-rc5 adma stuff) is affected too (BTW, what do you 
> do to trigger the exceptions? Because, it takes hours to "reproduces" this
> silly *).

I run those two in parallel:
while /bin/true; do ls -lR / > /dev/null 2>&1; done
while /bin/true; do echo 255 > /proc/sys/vm/drop_caches; sleep 1; done

Not sure if running them in parallel is necessary, but I don't want to
change the test setup ;) Takes between 1 and 40 minutes to trigger it.
Most of the time it's around 15 minutes now, doing more random stuff in
addition to that seems to trigger it even easier (like reading mail,
rebuilding the kernel etc.).

I'm down to 2 commits after 2.6.19 now, only bad kernels, so I tend to
say that 2.6.19 with 2.6.20-rc5's sata_nv.c will also fail for me, but I
thought I might finish bisection just to be sure.

> But, this time it looks slightly different:
> ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
> ata3.00: tag 0 cmd 0xec Emask 0x4 stat 0x40 err 0x0 (timeout)

> [Rest of the error message + SMART error snipped]

I get the same exception every time, doesn't change for me. And neither
do I get any SMART errors or something.

Thanks,
Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Chr
On Sunday, 21. January 2007 09:36, Björn Steinbrink wrote:
> On 2007.01.21 00:39:20 -0600, Robert Hancock wrote:
>
> Ah, right... sata_nv.c of course interacts with the outside world, d'oh!
>
> Up to now, I only got bad kernels, latest tested being:
> 94fcda1f8ab5e0cacc381c5ca1cc9aa6ad523576
>
> Which, unless I missed a commit in the diff, only USB changes,
> continuing anyway.
>
> Just to make sure, here's my little helper for this bisect run, I hope
> it does what you expected:
>
> #!/bin/bash
> cp ../sata_nv.c.orig drivers/ata/sata_nv.c
> git bisect good
> cp drivers/ata/sata_nv.c ../sata_nv.c.orig
> cp ../sata_nv.c drivers/ata/
> make oldconfig
> make -j4
>
> Where "../sata_nv.c" is the version from 2.6.20-rc5. The copying is done
> to avoid conflicts and keep git happy. Of course there's also a version
> for bad kernels ;) No idea, why I didn't make that an argument to the
> script...
>
> Thanks,
> Björn

Ar, 2.6.19 (with 2.6.20-rc5 adma stuff) is affected too (BTW, what do you 
do to trigger the exceptions? Because, it takes hours to "reproduces" this
silly *).

But, this time it looks slightly different:
ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata3.00: tag 0 cmd 0xec Emask 0x4 stat 0x40 err 0x0 (timeout)
ata3: soft resetting port
ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
!!!
ata3.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x1)
ata3.00: revalidation failed (errno=-5)
ata3: failed to recover some devices, retrying in 5 secs
!!!
ata3: hard resetting port
ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata3.00: configured for UDMA/133
ata3: EH complete
SCSI device sda: 488395055 512-byte hdwr sectors (250058 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back


Oh, and I got this nice SMART Error: 

ID# ATTRIBUTE_NAME  FLAGRAW VALUE
199 UDMA_CRC_Error_Count0x003e   ...  -   12

SMART Error Log Version: 1
ATA Error Count: 1
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 5603 hours (233 days + 11 hours)
  When the command that caused the error occurred, the device was in an 
unknown state.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 51 3f 00 00 00 af

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --    
  91 00 3f 00 00 00 0f 00  05:30:59.655  INITIALIZE DEVICE PARAMETERS 
[OBS-6]
  ec 00 01 01 00 00 00 00  05:30:59.654  IDENTIFY DEVICE
  ec 00 00 00 00 00 00 00  05:30:56.191  IDENTIFY DEVICE
  ca 00 28 02 ee 9a 0c 00  05:30:56.190  WRITE DMA
  ca 00 10 e8 4c 10 0a 00  05:30:56.190  WRITE DMA


Maybe, it's really the HDD!

OT: "http://www.nvidia.com/object/680i_hotfix.html";  


Chr.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Björn Steinbrink
On 2007.01.21 00:39:20 -0600, Robert Hancock wrote:
> Björn Steinbrink wrote:
> >On 2007.01.20 22:34:27 -0500, Jeff Garzik wrote:
> >>Robert Hancock wrote:
> >>>change in 2.6.20-rc is either causing or triggering this problem. It 
> >>>would be useful if you could try git bisect between 2.6.19 and 
> >>>2.6.20-rc5, keeping the latest sata_nv.c each time, and see if that 
> >>
> >>Yes, 'git bisect' would be the next step in figuring out this puzzle.
> >>
> >>Anybody up for it?
> >
> >I'll go for it, but could I get an explanation how that could lead to a
> >different result than my last bisection? I see the difference of keeping
> >sata_nv.c but my brain can't wrap around it right now (woke up in the
> >middle of the night and still not up to speed...).
> 
> Whatever the problem is, only seems to show up when ADMA is enabled, and 
> so the patch that added ADMA support shows up as the culprit from your 
> git bisect. However, from what Chr is reporting, 2.6.19 with the ADMA 
> support added in doesn't seem to have the problem, so presumably 
> something else that changed in the 2.6.20-rc series is triggering it. 
> Doing a bisect while keeping the driver code itself the same will 
> hopefully identify what that change is..

Ah, right... sata_nv.c of course interacts with the outside world, d'oh!

Up to now, I only got bad kernels, latest tested being:
94fcda1f8ab5e0cacc381c5ca1cc9aa6ad523576

Which, unless I missed a commit in the diff, only USB changes,
continuing anyway.

Just to make sure, here's my little helper for this bisect run, I hope
it does what you expected:

#!/bin/bash
cp ../sata_nv.c.orig drivers/ata/sata_nv.c
git bisect good
cp drivers/ata/sata_nv.c ../sata_nv.c.orig
cp ../sata_nv.c drivers/ata/
make oldconfig
make -j4

Where "../sata_nv.c" is the version from 2.6.20-rc5. The copying is done
to avoid conflicts and keep git happy. Of course there's also a version
for bad kernels ;) No idea, why I didn't make that an argument to the
script...

Thanks,
Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-20 Thread Robert Hancock

Björn Steinbrink wrote:

On 2007.01.20 22:34:27 -0500, Jeff Garzik wrote:

Robert Hancock wrote:
change in 2.6.20-rc is either causing or triggering this problem. It 
would be useful if you could try git bisect between 2.6.19 and 
2.6.20-rc5, keeping the latest sata_nv.c each time, and see if that 


Yes, 'git bisect' would be the next step in figuring out this puzzle.

Anybody up for it?


I'll go for it, but could I get an explanation how that could lead to a
different result than my last bisection? I see the difference of keeping
sata_nv.c but my brain can't wrap around it right now (woke up in the
middle of the night and still not up to speed...).


Whatever the problem is, only seems to show up when ADMA is enabled, and 
so the patch that added ADMA support shows up as the culprit from your 
git bisect. However, from what Chr is reporting, 2.6.19 with the ADMA 
support added in doesn't seem to have the problem, so presumably 
something else that changed in the 2.6.20-rc series is triggering it. 
Doing a bisect while keeping the driver code itself the same will 
hopefully identify what that change is..

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-20 Thread Björn Steinbrink
On 2007.01.20 22:34:27 -0500, Jeff Garzik wrote:
> Robert Hancock wrote:
> >change in 2.6.20-rc is either causing or triggering this problem. It 
> >would be useful if you could try git bisect between 2.6.19 and 
> >2.6.20-rc5, keeping the latest sata_nv.c each time, and see if that 
> 
> 
> Yes, 'git bisect' would be the next step in figuring out this puzzle.
> 
> Anybody up for it?

I'll go for it, but could I get an explanation how that could lead to a
different result than my last bisection? I see the difference of keeping
sata_nv.c but my brain can't wrap around it right now (woke up in the
middle of the night and still not up to speed...).

Thanks,
Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-20 Thread Jeff Garzik

Robert Hancock wrote:
change in 2.6.20-rc is either causing or triggering this problem. It 
would be useful if you could try git bisect between 2.6.19 and 
2.6.20-rc5, keeping the latest sata_nv.c each time, and see if that 



Yes, 'git bisect' would be the next step in figuring out this puzzle.

Anybody up for it?

Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-20 Thread Robert Hancock

Chr wrote:

Could you (or anyone else) test what happens if you take the 2.6.20-rc5
version of sata_nv.c and try it on 2.6.19? That would tell us whether
it's this change or whether it's something else (i.e. in libata core).


Ok, did that! (got a fresh 2.6.19 tar ball, and used 2.6.20-rc5' sata_nv.c
with the oneliner in libata_sff.c)

And surprise after one hour uptime, there is not even one 
sata exceptions in dmesg! (I'll report back tomorrow...)


That is interesting, indeed.. If that holds up then I assume some other 
change in 2.6.20-rc is either causing or triggering this problem. It 
would be useful if you could try git bisect between 2.6.19 and 
2.6.20-rc5, keeping the latest sata_nv.c each time, and see if that 
gives any indication. If not, just trying some of the different 
2.6.20-rcX versions may be useful.


Before that, though, can you try making this change I suggested below in 
2.6.20-rc5 and see if the problem still shows up?





Assuming that still doesn't work, can you then try removing these lines
from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does?

/* bail out if not our interrupt */
if (!(irq_stat & NV_INT_DEV))
return 0;

as that's the difference I'm most suspicious of causing the problem.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-20 Thread Chr
On Saturday, 20. January 2007 20:59, you wrote:
> Ian Kumlien wrote:
> > Hi,
> >
> > I went from 2.6.19+sata_nv-adma-ncq-v7.patch, with no problems and adama
> > enabled, to 2.6.20-rc5, which gave me problems almost instantly.
> >
> > I just thought that it might be interesting to know that it DID work
> > nicely.
> >
> > CC since i'm not on the ml
>
> (I'm ccing more of the people who reported this)
>
> Well that's interesting.. The only significant change that went into
> 2.6.20-rc5 in that driver that wasn't in that version you mentioned was
> this one:
>
> http://www2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=com
>mit;h=2dec7555e6bf2772749113ea0ad454fcdb8cf861
>
> Could you (or anyone else) test what happens if you take the 2.6.20-rc5
> version of sata_nv.c and try it on 2.6.19? That would tell us whether
> it's this change or whether it's something else (i.e. in libata core).

Ok, did that! (got a fresh 2.6.19 tar ball, and used 2.6.20-rc5' sata_nv.c
with the oneliner in libata_sff.c)

And surprise after one hour uptime, there is not even one 
sata exceptions in dmesg! (I'll report back tomorrow...)

>
> Assuming that still doesn't work, can you then try removing these lines
> from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does?
>
>   /* bail out if not our interrupt */
>   if (!(irq_stat & NV_INT_DEV))
>   return 0;
>
> as that's the difference I'm most suspicious of causing the problem.


Linux version 2.6.19test ([EMAIL PROTECTED]) (gcc version 4.1.2 20061115 
(prerelease) (Debian 4.1.1-21)) #2 SMP PREEMPT Sat Jan 20 22:19:20 CET 2007
Command line: root=/dev/md1 ro
BIOS-provided physical RAM map:
 BIOS-e820:  - 0009f800 (usable)
 BIOS-e820: 0009f800 - 000a (reserved)
 BIOS-e820: 000f - 0010 (reserved)
 BIOS-e820: 0010 - 7fff (usable)
 BIOS-e820: 7fff - 7fff3000 (ACPI NVS)
 BIOS-e820: 7fff3000 - 8000 (ACPI data)
 BIOS-e820: e000 - f000 (reserved)
 BIOS-e820: fec0 - 0001 (reserved)
Entering add_active_range(0, 0, 159) 0 entries of 256 used
Entering add_active_range(0, 256, 524272) 1 entries of 256 used
end_pfn_map = 1048576
DMI 2.3 present.
ACPI: RSDP (v000 Nvidia) @ 0x000f7d30
ACPI: RSDT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 
0x7fff3040
ACPI: FADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 
0x7fff30c0
ACPI: SSDT (v001 PTLTD  POWERNOW 0x0001  LTP 0x0001) @ 
0x7fff9900
ACPI: SRAT (v001 AMDHAMMER   0x0001 AMD  0x0001) @ 
0x7fff9b40
ACPI: MCFG (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 
0x7fff9c40
ACPI: MADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 
0x7fff9840
ACPI: DSDT (v001 NVIDIA AWRDACPI 0x1000 MSFT 0x010e) @ 
0x
Entering add_active_range(0, 0, 159) 0 entries of 256 used
Entering add_active_range(0, 256, 524272) 1 entries of 256 used
Zone PFN ranges:
  DMA 0 -> 4096
  DMA324096 ->  1048576
  Normal1048576 ->  1048576
early_node_map[2] active PFN ranges
0:0 ->  159
0:  256 ->   524272
On node 0 totalpages: 524175
  DMA zone: 56 pages used for memmap
  DMA zone: 10 pages reserved
  DMA zone: 3933 pages, LIFO batch:0
  DMA32 zone: 7111 pages used for memmap
  DMA32 zone: 513065 pages, LIFO batch:31
  Normal zone: 0 pages used for memmap
Nvidia board detected. Ignoring ACPI timer override.
If you got timer trouble try acpi_use_timer_override
ACPI: PM-Timer IO Port: 0x4008
ACPI: Local APIC address 0xfee0
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 (Bootup-CPU)
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
Processor #1
ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
ACPI: IOAPIC (id[0x02] address[0xfec0] gsi_base[0])
IOAPIC[0]: apic_id 2, address 0xfec0, GSI 0-23
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: BIOS IRQ0 pin2 override ignored.
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
ACPI: INT_SRC_OVR (bus 0 bus_irq 14 global_irq 14 high edge)
ACPI: INT_SRC_OVR (bus 0 bus_irq 15 global_irq 15 high edge)
ACPI: IRQ9 used by override.
ACPI: IRQ14 used by override.
ACPI: IRQ15 used by override.
Setting APIC routing to physical flat
Using ACPI (MADT) for SMP configuration information
Nosave address range: 0009f000 - 000a
Nosave address range: 000a - 000f
Nosave address range: 000f - 0010
Allocating PCI resources starting at 8800 (gap: 8000:6000)
SMP: Allowing 2 CPUs, 0 hotplug CPUs
PERCPU: Allocating 32320 bytes of per cpu data
Built 1 zonelists.  Total pages: 516998
Kernel command line: root=/dev/md1 ro
Initializing CPU#0
PID hash table 

Re: SATA exceptions with 2.6.20-rc5

2007-01-20 Thread Ian Kumlien
On lör, 2007-01-20 at 21:43 +, Alistair John Strachan wrote:
> On Saturday 20 January 2007 19:59, Robert Hancock wrote:
> > Ian Kumlien wrote:
> > > Hi,
> > >
> > > I went from 2.6.19+sata_nv-adma-ncq-v7.patch, with no problems and adama
> > > enabled, to 2.6.20-rc5, which gave me problems almost instantly.
> > >
> > > I just thought that it might be interesting to know that it DID work
> > > nicely.
> > >
> > > CC since i'm not on the ml
> >
> > (I'm ccing more of the people who reported this)
> >
> > Well that's interesting.. The only significant change that went into
> > 2.6.20-rc5 in that driver that wasn't in that version you mentioned was
> > this one:
> >
> > http://www2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=com
> >mit;h=2dec7555e6bf2772749113ea0ad454fcdb8cf861
> >
> > Could you (or anyone else) test what happens if you take the 2.6.20-rc5
> > version of sata_nv.c and try it on 2.6.19? That would tell us whether
> > it's this change or whether it's something else (i.e. in libata core).
> 
> I'm still running an -rc5 kernel with ADMA switched off entirely and I can't 
> reproduce the problem. How is everybody else reproducing this?
> 
> I've been successful installing bonnie++, then going to a large XFS partition 
> and running "bonnie++ -u 1000:1000" and letting it run through, all defaults.
> 
> It doesn't cause the problem I was seeing in -rc5 with ADMA on, when I switch 
> ADMA off, so I think this is sufficient to fix it.

Eh? The whole point with that patch was to ADD ADMA support to sata_nv,
imho that is something we want to have and i have been running with ADMA
on on two computers since sata_nv-adma-ncq-v4 or 5 or so without
problems.

So, something has been introduced or been broken to cause this error,
wouldn't it be better to find the error introduced than to just totally
negate the patch in the first place?

I haven't had the energy to go trough the patch that was found as
causing the problem yet... I don't know if i even have all the info
needed to make any form of educated guess but i'll give it a try when i
have the energy.

I really home someone finds it before then =)

-- 
Ian Kumlien  -- http://pomac.netswarm.net


signature.asc
Description: This is a digitally signed message part


Re: SATA exceptions with 2.6.20-rc5

2007-01-20 Thread Alistair John Strachan
On Saturday 20 January 2007 19:59, Robert Hancock wrote:
> Ian Kumlien wrote:
> > Hi,
> >
> > I went from 2.6.19+sata_nv-adma-ncq-v7.patch, with no problems and adama
> > enabled, to 2.6.20-rc5, which gave me problems almost instantly.
> >
> > I just thought that it might be interesting to know that it DID work
> > nicely.
> >
> > CC since i'm not on the ml
>
> (I'm ccing more of the people who reported this)
>
> Well that's interesting.. The only significant change that went into
> 2.6.20-rc5 in that driver that wasn't in that version you mentioned was
> this one:
>
> http://www2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=com
>mit;h=2dec7555e6bf2772749113ea0ad454fcdb8cf861
>
> Could you (or anyone else) test what happens if you take the 2.6.20-rc5
> version of sata_nv.c and try it on 2.6.19? That would tell us whether
> it's this change or whether it's something else (i.e. in libata core).

I'm still running an -rc5 kernel with ADMA switched off entirely and I can't 
reproduce the problem. How is everybody else reproducing this?

I've been successful installing bonnie++, then going to a large XFS partition 
and running "bonnie++ -u 1000:1000" and letting it run through, all defaults.

It doesn't cause the problem I was seeing in -rc5 with ADMA on, when I switch 
ADMA off, so I think this is sufficient to fix it.

Others have reported differently. Did you guys do:

[EMAIL PROTECTED]:~$ cat /proc/cmdline
root=/dev/sda1 ro sata_nv.adma=0

Or something similar? This is how Jeff suggested disabling ADMA and indeed the 
messages about its use disappear from dmesg.

-- 
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-20 Thread Robert Hancock

Ian Kumlien wrote:

Hi,

I went from 2.6.19+sata_nv-adma-ncq-v7.patch, with no problems and adama
enabled, to 2.6.20-rc5, which gave me problems almost instantly.

I just thought that it might be interesting to know that it DID work
nicely.

CC since i'm not on the ml



(I'm ccing more of the people who reported this)

Well that's interesting.. The only significant change that went into 
2.6.20-rc5 in that driver that wasn't in that version you mentioned was 
this one:


http://www2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=2dec7555e6bf2772749113ea0ad454fcdb8cf861

Could you (or anyone else) test what happens if you take the 2.6.20-rc5 
version of sata_nv.c and try it on 2.6.19? That would tell us whether 
it's this change or whether it's something else (i.e. in libata core).


Assuming that still doesn't work, can you then try removing these lines 
from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does?


/* bail out if not our interrupt */
if (!(irq_stat & NV_INT_DEV))
return 0;

as that's the difference I'm most suspicious of causing the problem.

--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-20 Thread Chr
On Saturday, 20. January 2007 03:41, Robert Hancock wrote:
> Alistair John Strachan wrote:
> > On Tuesday 16 January 2007 01:53, Jeff Garzik wrote:
> >> Robert Hancock wrote:
> >>> I'll try your stress test when I get a chance, but I doubt I'll run
> >>> into the same problem and I haven't seen any similar reports. Perhaps
> >>> it's some kind of wierd timing issue or incompatibility between the
> >>> controller and that drive when running in ADMA mode? I seem to remember
> >>> various reports of issues with certain Maxtor drives and some nForce
> >>> SATA controllers under Windows at least..
> >>
> >> Just to eliminate things, has disabling ADMA been attempted?
> >>
> >> It can be disabled using the sata_nv.adma module parameter.
> >
> > Setting this option fixes the problem for me. I suggest that ADMA
> > defaults off in 2.6.20, if there's still time to do that.
>
> Can you guys that are having this problem try the attached debug patch?
> It's possible it will fix the problem, as I'm trying a private
> exec_command implementation that flushes the write by reading a
> controller register instead of reading altstatus from the drive like the
> libata core code does.
>
> If the problem still happens, I also added some more debugging in to
> help figure out what is going on, so please post full dmesg.
>
> By the way, I assume that you guys are using reiserfs or xfs, as it
> appears no other file systems issue flush commands automatically. I had
> to test this by "echo 1 > delete" on the SCSI disk in sysfs, as I am
> using ext3.
Yes, I've some reiserfs partitions, but I don't think it's reiserfs fault ;). 
Here is the log. (I cut out some parts, because it was too big.) 


BTW: please CC, I'm not on the list!
18:17:29 sys kernel: Linux version 2.6.20-rc5 ([EMAIL PROTECTED]) (gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)) #2 SMP PREEMPT Sat 18:07:36 CET 2007
18:17:29 sys kernel: Command line: root=/dev/md1 ro 
18:17:29 sys kernel: BIOS-provided physical RAM map:
18:17:29 sys kernel: BIOS-e820:  - 0009f800 (usable)
18:17:29 sys kernel: BIOS-e820: 0009f800 - 000a (reserved)
18:17:29 sys kernel: BIOS-e820: 000f - 0010 (reserved)
18:17:29 sys kernel: BIOS-e820: 0010 - 7fff (usable)
18:17:29 sys kernel: BIOS-e820: 7fff - 7fff3000 (ACPI NVS)
18:17:29 sys kernel: BIOS-e820: 7fff3000 - 8000 (ACPI data)
18:17:29 sys kernel: BIOS-e820: e000 - f000 (reserved)
18:17:29 sys kernel: BIOS-e820: fec0 - 0001 (reserved)
18:17:29 sys kernel: Entering add_active_range(0, 0, 159) 0 entries of 256 used
18:17:29 sys kernel: Entering add_active_range(0, 256, 524272) 1 entries of 256 used
18:17:29 sys kernel: end_pfn_map = 1048576
18:17:29 sys kernel: DMI 2.3 present.
18:17:29 sys kernel: ACPI: RSDP (v000 Nvidia) @ 0x000f7d30
18:17:29 sys kernel: ACPI: RSDT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 0x7fff3040
18:17:29 sys kernel: ACPI: FADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 0x7fff30c0
18:17:29 sys kernel: ACPI: SSDT (v001 PTLTD  POWERNOW 0x0001  LTP 0x0001) @ 0x7fff9900
18:17:29 sys kernel: ACPI: SRAT (v001 AMDHAMMER   0x0001 AMD  0x0001) @ 0x7fff9b40
18:17:29 sys kernel: ACPI: MCFG (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 0x7fff9c40
18:17:29 sys kernel: ACPI: MADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 0x7fff9840
18:17:29 sys kernel: ACPI: DSDT (v001 NVIDIA AWRDACPI 0x1000 MSFT 0x010e) @ 0x
18:17:29 sys kernel: Entering add_active_range(0, 0, 159) 0 entries of 256 used
18:17:29 sys kernel: Entering add_active_range(0, 256, 524272) 1 entries of 256 used
18:17:29 sys kernel: Zone PFN ranges:
18:17:29 sys kernel: DMA 0 -> 4096
18:17:29 sys kernel: DMA324096 ->  1048576
18:17:29 sys kernel: Normal1048576 ->  1048576
18:17:29 sys kernel: early_node_map[2] active PFN ranges
18:17:29 sys kernel: 0:0 ->  159
18:17:29 sys kernel: 0:  256 ->   524272
18:17:29 sys kernel: On node 0 totalpages: 524175
18:17:29 sys kernel: DMA zone: 56 pages used for memmap
18:17:29 sys kernel: DMA zone: 10 pages reserved
18:17:29 sys kernel: DMA zone: 3933 pages, LIFO batch:0
18:17:29 sys kernel: DMA32 zone: 7111 pages used for memmap
18:17:29 sys kernel: DMA32 zone: 513065 pages, LIFO batch:31
18:17:29 sys kernel: Normal zone: 0 pages used for memmap
18:17:29 sys kernel: Nvidia board detected. Ignoring ACPI timer override.
18:17:29 sys kernel: If you got timer trouble try acpi_use_timer_override
18:17:29 sys kernel: ACPI: PM-Timer IO Port: 0x4008
18:17:29 sys kernel: ACPI: Local APIC address 0xfee0
18:17:29 sys kernel: ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
18:17:29 sys kernel: Processor #0 (Bootup-CPU)
18:17:29 sys kernel: ACPI: LAPI

Re: SATA exceptions with 2.6.20-rc5

2007-01-20 Thread Ian Kumlien
Hi,

I went from 2.6.19+sata_nv-adma-ncq-v7.patch, with no problems and adama
enabled, to 2.6.20-rc5, which gave me problems almost instantly.

I just thought that it might be interesting to know that it DID work
nicely.

CC since i'm not on the ml

-- 
Ian Kumlien  -- http://pomac.netswarm.net


signature.asc
Description: This is a digitally signed message part


Re: SATA exceptions with 2.6.20-rc5

2007-01-19 Thread Björn Steinbrink
On 2007.01.19 20:41:36 -0600, Robert Hancock wrote:
> Alistair John Strachan wrote:
> >On Tuesday 16 January 2007 01:53, Jeff Garzik wrote:
> >>Robert Hancock wrote:
> >>>I'll try your stress test when I get a chance, but I doubt I'll run into
> >>>the same problem and I haven't seen any similar reports. Perhaps it's
> >>>some kind of wierd timing issue or incompatibility between the
> >>>controller and that drive when running in ADMA mode? I seem to remember
> >>>various reports of issues with certain Maxtor drives and some nForce
> >>>SATA controllers under Windows at least..
> >>Just to eliminate things, has disabling ADMA been attempted?
> >>
> >>It can be disabled using the sata_nv.adma module parameter.
> >
> >Setting this option fixes the problem for me. I suggest that ADMA defaults 
> >off in 2.6.20, if there's still time to do that.
> >
> 
> Can you guys that are having this problem try the attached debug patch? 
> It's possible it will fix the problem, as I'm trying a private 
> exec_command implementation that flushes the write by reading a 
> controller register instead of reading altstatus from the drive like the 
> libata core code does.

Will give it a spin in about an hour.

> If the problem still happens, I also added some more debugging in to 
> help figure out what is going on, so please post full dmesg.
> 
> By the way, I assume that you guys are using reiserfs or xfs, as it 
> appears no other file systems issue flush commands automatically. I had 
> to test this by "echo 1 > delete" on the SCSI disk in sysfs, as I am 
> using ext3.

No, ext3 here, on top of md RAID1 and LVM. Oh, and one ext2, I wonder
where that comes from...

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-19 Thread Alistair John Strachan
On Saturday 20 January 2007 02:41, Robert Hancock wrote:
> By the way, I assume that you guys are using reiserfs or xfs, as it
> appears no other file systems issue flush commands automatically. I had
> to test this by "echo 1 > delete" on the SCSI disk in sysfs, as I am
> using ext3.

I'll give it a spin now, and yes I'm using several large XFS partitions on 
this machine, layered on top of md RAID5. That's why this particular defect 
is so catastrophic (literally _everything_ is stalled).

-- 
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-19 Thread Robert Hancock

Alistair John Strachan wrote:

On Tuesday 16 January 2007 01:53, Jeff Garzik wrote:

Robert Hancock wrote:

I'll try your stress test when I get a chance, but I doubt I'll run into
the same problem and I haven't seen any similar reports. Perhaps it's
some kind of wierd timing issue or incompatibility between the
controller and that drive when running in ADMA mode? I seem to remember
various reports of issues with certain Maxtor drives and some nForce
SATA controllers under Windows at least..

Just to eliminate things, has disabling ADMA been attempted?

It can be disabled using the sata_nv.adma module parameter.


Setting this option fixes the problem for me. I suggest that ADMA defaults off 
in 2.6.20, if there's still time to do that.




Can you guys that are having this problem try the attached debug patch? 
It's possible it will fix the problem, as I'm trying a private 
exec_command implementation that flushes the write by reading a 
controller register instead of reading altstatus from the drive like the 
libata core code does.


If the problem still happens, I also added some more debugging in to 
help figure out what is going on, so please post full dmesg.


By the way, I assume that you guys are using reiserfs or xfs, as it 
appears no other file systems issue flush commands automatically. I had 
to test this by "echo 1 > delete" on the SCSI disk in sysfs, as I am 
using ext3.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

--- linux-2.6.20-rc5/drivers/ata/sata_nv.c  2007-01-19 19:18:53.0 
-0600
+++ linux-2.6.20-rc5debug/drivers/ata/sata_nv.c 2007-01-19 20:25:31.0 
-0600
@@ -245,6 +245,7 @@ static void nv_adma_bmdma_setup(struct a
 static void nv_adma_bmdma_start(struct ata_queued_cmd *qc);
 static void nv_adma_bmdma_stop(struct ata_queued_cmd *qc);
 static u8 nv_adma_bmdma_status(struct ata_port *ap);
+static void nv_adma_exec_command(struct ata_port *ap, const struct 
ata_taskfile *tf);
 
 enum nv_host_type
 {
@@ -409,7 +410,7 @@ static const struct ata_port_operations 
.tf_load= ata_tf_load,
.tf_read= ata_tf_read,
.check_atapi_dma= nv_adma_check_atapi_dma,
-   .exec_command   = ata_exec_command,
+   .exec_command   = nv_adma_exec_command,
.check_status   = ata_check_status,
.dev_select = ata_std_dev_select,
.bmdma_setup= nv_adma_bmdma_setup,
@@ -617,6 +618,14 @@ static int nv_adma_check_atapi_dma(struc
return !(pp->flags & NV_ADMA_ATAPI_SETUP_COMPLETE);
 }
 
+static void nv_adma_exec_command(struct ata_port *ap, const struct 
ata_taskfile *tf)
+{
+   void __iomem* mmio = nv_adma_ctl_block(ap);
+   writeb(tf->command, (void __iomem *) ap->ioaddr.command_addr);
+   readw(mmio + NV_ADMA_CTL); /* flush */
+   ndelay(400);
+}
+
 static unsigned int nv_adma_tf_to_cpb(struct ata_taskfile *tf, __le16 *cpb)
 {
unsigned int idx = 0;
@@ -701,6 +710,9 @@ static int nv_host_intr(struct ata_port 
 {
struct ata_queued_cmd *qc = ata_qc_from_tag(ap, ap->active_tag);
int handled;
+   u8 cmd = 0;
+   if(qc)
+   cmd = qc->tf.command;
 
/* freeze if hotplugged */
if (unlikely(irq_stat & (NV_INT_ADDED | NV_INT_REMOVED))) {
@@ -709,8 +721,11 @@ static int nv_host_intr(struct ata_port 
}
 
/* bail out if not our interrupt */
-   if (!(irq_stat & NV_INT_DEV))
+   if (!(irq_stat & NV_INT_DEV)) {
+   if( cmd == ATA_CMD_FLUSH || cmd == ATA_CMD_FLUSH_EXT )
+   ata_port_printk(ap, KERN_NOTICE, "cmd 0x%x active but 
stat 0x%x\n", cmd, irq_stat);
return 0;
+   }
 
/* DEV interrupt w/ no active qc? */
if (unlikely(!qc || (qc->tf.flags & ATA_TFLAG_POLLING))) {
@@ -720,6 +735,8 @@ static int nv_host_intr(struct ata_port 
 
/* handle interrupt */
handled = ata_host_intr(ap, qc);
+   if( cmd == ATA_CMD_FLUSH || cmd == ATA_CMD_FLUSH_EXT )
+   ata_port_printk(ap, KERN_NOTICE, "cmd 0x%x active, stat = 0x%x, 
handled = 0x%x\n", cmd, irq_stat, handled);
if (unlikely(!handled)) {
/* spurious, clear it */
ata_check_status(ap);
@@ -870,7 +887,7 @@ static void nv_adma_bmdma_setup(struct a
outb(dmactl, ap->ioaddr.bmdma_addr + ATA_DMA_CMD);
 
/* issue r/w command */
-   ata_exec_command(ap, &qc->tf);
+   nv_adma_exec_command(ap, &qc->tf);
 }
 
 static void nv_adma_bmdma_start(struct ata_queued_cmd *qc)
@@ -1161,6 +1178,9 @@ static unsigned int nv_adma_qc_issue(str
/* use ATA register mode */
VPRINTK("no dmamap or ATAPI, using ATA register mode: 0x%lx\n", 
qc->flags);
nv_adma_register_mode(qc->ap);
+   if(qc->tf.command == ATA_CMD_FLUSH ||
+  qc->tf.c

Re: SATA exceptions with 2.6.20-rc5

2007-01-19 Thread chunkeey
On Friday, 19. January 2007 16:05, Alistair John Strachan wrote:
> On Tuesday 16 January 2007 01:53, Jeff Garzik wrote:
> > Robert Hancock wrote:
> > > I'll try your stress test when I get a chance, but I doubt I'll run
> > > into the same problem and I haven't seen any similar reports. Perhaps
> > > it's some kind of wierd timing issue or incompatibility between the
> > > controller and that drive when running in ADMA mode? I seem to remember
> > > various reports of issues with certain Maxtor drives and some nForce
> > > SATA controllers under Windows at least..
> >
> > Just to eliminate things, has disabling ADMA been attempted?
> >
> > It can be disabled using the sata_nv.adma module parameter.
>
> Setting this option fixes the problem for me. I suggest that ADMA defaults
> off in 2.6.20, if there's still time to do that.

Not for me.
I'm still have the same trouble, but less (maybe about every hour, instead of 
every 5 minutes). futhermore, I found a patch
cocktail-2.6.20-rc3.patch: http://tinyurl.com/2gza8q, which improves the 
situation too! 

Now, the funny thing is that I've two SATA HDDs, but only 1 causes all the
headaches.

The affected drive is a:
sda - @ata3.0 - WDC WD2500KS-00M 02.0
ATA-7, max UDMA/133, 488395055 sectors: LBA48

"ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 out
 res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata3: soft resetting port
ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata3.00: configured for UDMA/133:PIO0
ata3: EH complete
SCSI device sda: 488395055 512-byte hdwr sectors (250058 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00"

the "good" HDD is a:
sdb - @ata4.0 - WDC WD2500YD-01N 10.0
ATA-7, max UDMA/133, 490234752 sectors: LBA48 NCQ (depth 0/1)

System:
AMD64 4200+ 
nForce 4 SLI
2 GB
SMP PREEMPT kernel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-19 Thread Alistair John Strachan
On Tuesday 16 January 2007 01:53, Jeff Garzik wrote:
> Robert Hancock wrote:
> > I'll try your stress test when I get a chance, but I doubt I'll run into
> > the same problem and I haven't seen any similar reports. Perhaps it's
> > some kind of wierd timing issue or incompatibility between the
> > controller and that drive when running in ADMA mode? I seem to remember
> > various reports of issues with certain Maxtor drives and some nForce
> > SATA controllers under Windows at least..
>
> Just to eliminate things, has disabling ADMA been attempted?
>
> It can be disabled using the sata_nv.adma module parameter.

Setting this option fixes the problem for me. I suggest that ADMA defaults off 
in 2.6.20, if there's still time to do that.

-- 
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-19 Thread Alistair John Strachan
On Tuesday 16 January 2007 00:34, Robert Hancock wrote:
> I'll try your stress test when I get a chance, but I doubt I'll run into
> the same problem and I haven't seen any similar reports. Perhaps it's
> some kind of wierd timing issue or incompatibility between the
> controller and that drive when running in ADMA mode? I seem to remember
> various reports of issues with certain Maxtor drives and some nForce
> SATA controllers under Windows at least..

I have exactly the same problem on -rc5 and it causes all I/O to stall 
periodically if I do _anything_ I/O intensive.

On my box, I have 4 sata_nv handled SATA ports, with two pairs of different 
drives (two Maxtor, two WD) and it happens randomly on both. So it's 
absolutely nothing to do with the drive make/model.

I'll try Jeff's suggestion of disabling ADMA now, but I think something more 
radical than this workaround should make it into 2.6.20 final, otherwise a 
lot of people are going to have broken boxes.

-- 
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-18 Thread Björn Steinbrink
On 2007.01.18 18:09:50 -0600, Robert Hancock wrote:
> I heard from Larry Walton who was apparently seeing this problem as 
> well. He tried my recent "sata_nv: cleanup ADMA error handling v2" patch 
> and originally thought it fixed the problem, but it turned out to only 
> make it happen less often.
> 
> I wouldn't expect that patch to have an effect on this problem. If it 
> seems to reduce the frequency that would tend to be further evidence of 
>  some kind of timing-related issue where the code change just happens 
> to make a difference.
> 
> I'll see if I can come up with a debug patch for people having this 
> problem to try, which prints out when a flush command is issued and what 
> interrupts happen when a flush is pending.
> 
> There is one important difference between ADMA and non-ADMA mode for 
> non-DMA commands like flushes, which didn't come to mind before: ADMA 
> mode uses MMIO registers on the controller whereas non-ADMA mode uses 
> legacy IO registers. Posted write flushing is a concern with MMIO 
> registers but not with PIO, the libata core is supposed to handle this 
> but maybe it doesn't in some case(s). In fact, just looking at 
> libata-sff.c there's this comment on the ata_exec_command_mmio function:
> 
>  *  FIXME: missing write posting for 400nS delay enforcement
> 
> That seems a bit suspicious..

That would imply that disabling adma via a module parameter should make
the issue go away, right? I'll try to have a test run with adma disabled
over night then.

Thanks,
Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-18 Thread Robert Hancock
I heard from Larry Walton who was apparently seeing this problem as 
well. He tried my recent "sata_nv: cleanup ADMA error handling v2" patch 
and originally thought it fixed the problem, but it turned out to only 
make it happen less often.


I wouldn't expect that patch to have an effect on this problem. If it 
seems to reduce the frequency that would tend to be further evidence of 
 some kind of timing-related issue where the code change just happens 
to make a difference.


I'll see if I can come up with a debug patch for people having this 
problem to try, which prints out when a flush command is issued and what 
interrupts happen when a flush is pending.


There is one important difference between ADMA and non-ADMA mode for 
non-DMA commands like flushes, which didn't come to mind before: ADMA 
mode uses MMIO registers on the controller whereas non-ADMA mode uses 
legacy IO registers. Posted write flushing is a concern with MMIO 
registers but not with PIO, the libata core is supposed to handle this 
but maybe it doesn't in some case(s). In fact, just looking at 
libata-sff.c there's this comment on the ata_exec_command_mmio function:


 *  FIXME: missing write posting for 400nS delay enforcement

That seems a bit suspicious..

--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-15 Thread Robert Hancock

Björn Steinbrink wrote:
It should be correct the way it is - that check is trying to prevent 
ATAPI commands from using DMA until the slave_config function has been 
called to set up the DMA parameters properly. When the 
NV_ADMA_ATAPI_SETUP_COMPLETE flag is not set, this returns 1 which 
disallows DMA transfers. Unless you were using an ATAPI (i.e. CD/DVD) 
device on the channel this wouldn't affect you anyway.


I wondered about it, because the flag is cleared when adma_enabled is 1,
which seems to be consistent with everything but nv_adma_check_atapi_dma.


When ADMA is enabled we can't use ATAPI at all (or so says NVidia 
anyway), so it has to be disabled when an ATAPI device is detected in 
slave_config. Since doing that implies using the legacy BMDMA engine 
with its greater restrictions, this is why we need to prevent DMA 
transfers from being attempted until those restrictions have been set 
properly. (Otherwise, the libata core will try to use PACKET commands on 
an ATAPI device with DMA enabled before slave_config is even called.)



Thus I thought that nv_adma_check_atapi_dma might be wrong, but maybe
setting/clearing the flag is wrong instead? *feels lost*


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-15 Thread Jeff Garzik

Robert Hancock wrote:
I'll try your stress test when I get a chance, but I doubt I'll run into 
the same problem and I haven't seen any similar reports. Perhaps it's 
some kind of wierd timing issue or incompatibility between the 
controller and that drive when running in ADMA mode? I seem to remember 
various reports of issues with certain Maxtor drives and some nForce 
SATA controllers under Windows at least..



Just to eliminate things, has disabling ADMA been attempted?

It can be disabled using the sata_nv.adma module parameter.

Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-15 Thread Jeff Garzik

Robert Hancock wrote:
Note that the ATA-7 spec for FLUSH CACHE says that "This command may 
take longer than 30 s to complete."


Yep...

Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-15 Thread Jeff Garzik

Jens Axboe wrote:

On Mon, Jan 15 2007, Jeff Garzik wrote:

Jens Axboe wrote:

I'd be surprised if the device would not obey the 7 second timeout rule
that seems to be set in stone and not allow more dirty in-drive cache
than it could flush out in approximately that time.
AFAIK Windows flush-cache timeout is 30 seconds, not 7 as with other 
commands...


Ok, 7 seconds for FLUSH_CACHE would have been nice for us too though, as
it would pretty much guarentee lower latencies for random writes and
write back caching. The concern is the barrier code, of course. I guess
I should do some timings on potential worst case patterns some day. Alan
may have done that sometime in the past, iirc.


FWIW:  According to the drive guys (Eric M, among others), FLUSH CACHE 
will "probably" be under 30 seconds, but pathological cases might even 
extend beyond that.


Definitely more than 7 seconds in less-than-pathological cases, 
unfortunately...


The SCSI layer /should/ already take this (30 second timeout) into 
account, for SYNCHRONIZE CACHE (and thus FLUSH CACHE for libata) but I'm 
too slack to check at the moment.


Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-15 Thread Björn Steinbrink
On 2007.01.15 18:34:43 -0600, Robert Hancock wrote:
> Björn Steinbrink wrote:
> >>My latest bisection attempt actually led to your sata_nv ADMA commit. [1]
> >>I've now backed out that patch from 2.6.20-rc5 and have my stress test
> >>running for 20 minutes now ("record" for a bad kernel surviving that
> >>test is about 40 minutes IIRC). I'll keep it running for at least 2 more
> >>hours.
> >
> >Yep, that one seems to be guilty. 2.6.20-rc5 with that commit backed out
> >survived about 3 hours of testing, while the average was around 5
> >minutes for a failure, sometimes even before I could log in.
> >I took a look at the patch, but I can't really tell anything.
> >nv_adma_check_atapi_dma somehow looks like it should not negate its
> >return value, so that it returns 0 (atapi dma available) when
> >adma_enable was 1. But I'm not exactly confident about that either ;)
> >Will it hurt if I try to remove the negation?
> 
> It should be correct the way it is - that check is trying to prevent 
> ATAPI commands from using DMA until the slave_config function has been 
> called to set up the DMA parameters properly. When the 
> NV_ADMA_ATAPI_SETUP_COMPLETE flag is not set, this returns 1 which 
> disallows DMA transfers. Unless you were using an ATAPI (i.e. CD/DVD) 
> device on the channel this wouldn't affect you anyway.

I wondered about it, because the flag is cleared when adma_enabled is 1,
which seems to be consistent with everything but nv_adma_check_atapi_dma.
Thus I thought that nv_adma_check_atapi_dma might be wrong, but maybe
setting/clearing the flag is wrong instead? *feels lost*

> I'll try your stress test when I get a chance, but I doubt I'll run into 
> the same problem and I haven't seen any similar reports. Perhaps it's 
> some kind of wierd timing issue or incompatibility between the 
> controller and that drive when running in ADMA mode? I seem to remember 
> various reports of issues with certain Maxtor drives and some nForce 
> SATA controllers under Windows at least..

I just checked Maxtor's knowledge base, that incompatibility does not
affect my drive.

Thanks,
Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-15 Thread Robert Hancock

Jens Axboe wrote:

On Mon, Jan 15 2007, Jeff Garzik wrote:

Jens Axboe wrote:

I'd be surprised if the device would not obey the 7 second timeout rule
that seems to be set in stone and not allow more dirty in-drive cache
than it could flush out in approximately that time.
AFAIK Windows flush-cache timeout is 30 seconds, not 7 as with other 
commands...


Ok, 7 seconds for FLUSH_CACHE would have been nice for us too though, as
it would pretty much guarentee lower latencies for random writes and
write back caching. The concern is the barrier code, of course. I guess
I should do some timings on potential worst case patterns some day. Alan
may have done that sometime in the past, iirc.



Note that the ATA-7 spec for FLUSH CACHE says that "This command may 
take longer than 30 s to complete."


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-15 Thread Robert Hancock

Björn Steinbrink wrote:

My latest bisection attempt actually led to your sata_nv ADMA commit. [1]
I've now backed out that patch from 2.6.20-rc5 and have my stress test
running for 20 minutes now ("record" for a bad kernel surviving that
test is about 40 minutes IIRC). I'll keep it running for at least 2 more
hours.


Yep, that one seems to be guilty. 2.6.20-rc5 with that commit backed out
survived about 3 hours of testing, while the average was around 5
minutes for a failure, sometimes even before I could log in.
I took a look at the patch, but I can't really tell anything.
nv_adma_check_atapi_dma somehow looks like it should not negate its
return value, so that it returns 0 (atapi dma available) when
adma_enable was 1. But I'm not exactly confident about that either ;)
Will it hurt if I try to remove the negation?


It should be correct the way it is - that check is trying to prevent 
ATAPI commands from using DMA until the slave_config function has been 
called to set up the DMA parameters properly. When the 
NV_ADMA_ATAPI_SETUP_COMPLETE flag is not set, this returns 1 which 
disallows DMA transfers. Unless you were using an ATAPI (i.e. CD/DVD) 
device on the channel this wouldn't affect you anyway.


I'll try your stress test when I get a chance, but I doubt I'll run into 
the same problem and I haven't seen any similar reports. Perhaps it's 
some kind of wierd timing issue or incompatibility between the 
controller and that drive when running in ADMA mode? I seem to remember 
various reports of issues with certain Maxtor drives and some nForce 
SATA controllers under Windows at least..


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-15 Thread Jens Axboe
On Mon, Jan 15 2007, Jeff Garzik wrote:
> Jens Axboe wrote:
> >I'd be surprised if the device would not obey the 7 second timeout rule
> >that seems to be set in stone and not allow more dirty in-drive cache
> >than it could flush out in approximately that time.
> 
> AFAIK Windows flush-cache timeout is 30 seconds, not 7 as with other 
> commands...

Ok, 7 seconds for FLUSH_CACHE would have been nice for us too though, as
it would pretty much guarentee lower latencies for random writes and
write back caching. The concern is the barrier code, of course. I guess
I should do some timings on potential worst case patterns some day. Alan
may have done that sometime in the past, iirc.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-15 Thread Björn Steinbrink
On 2007.01.15 22:17:24 +0100, Björn Steinbrink wrote:
> On 2007.01.14 17:43:53 -0600, Robert Hancock wrote:
> > Björn Steinbrink wrote:
> > >Hi,
> > >
> > >with 2.6.20-rc{2,4,5} (no other tested yet) I see SATA exceptions quite
> > >often, with 2.6.19 there are no such exceptions. dmesg and lspci -v
> > >output follows. In the meantime, I'll start bisecting.
> > 
> > ...
> > 
> > >ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
> > >ata1.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 in
> > > res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> > >ata1: soft resetting port
> > >ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> > >ata1.00: configured for UDMA/133
> > >ata1: EH complete
> > >SCSI device sda: 160086528 512-byte hdwr sectors (81964 MB)
> > >sda: Write Protect is off
> > >sda: Mode Sense: 00 3a 00 00
> > >SCSI device sda: write cache: enabled, read cache: enabled, doesn't 
> > >support DPO or FUA
> > 
> > Looks like all of these errors are from a FLUSH CACHE command and the 
> > drive is indicating that it is no longer busy, so presumably done. 
> > That's not a DMA-mapped command, so it wouldn't go through the ADMA 
> > machinery and I wouldn't have expected this to be handled any 
> > differently from before. Curious..
> 
> My latest bisection attempt actually led to your sata_nv ADMA commit. [1]
> I've now backed out that patch from 2.6.20-rc5 and have my stress test
> running for 20 minutes now ("record" for a bad kernel surviving that
> test is about 40 minutes IIRC). I'll keep it running for at least 2 more
> hours.

Yep, that one seems to be guilty. 2.6.20-rc5 with that commit backed out
survived about 3 hours of testing, while the average was around 5
minutes for a failure, sometimes even before I could log in.
I took a look at the patch, but I can't really tell anything.
nv_adma_check_atapi_dma somehow looks like it should not negate its
return value, so that it returns 0 (atapi dma available) when
adma_enable was 1. But I'm not exactly confident about that either ;)
Will it hurt if I try to remove the negation?

Thanks,
Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-15 Thread Björn Steinbrink
On 2007.01.14 17:43:53 -0600, Robert Hancock wrote:
> Björn Steinbrink wrote:
> >Hi,
> >
> >with 2.6.20-rc{2,4,5} (no other tested yet) I see SATA exceptions quite
> >often, with 2.6.19 there are no such exceptions. dmesg and lspci -v
> >output follows. In the meantime, I'll start bisecting.
> 
> ...
> 
> >ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
> >ata1.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 in
> > res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> >ata1: soft resetting port
> >ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> >ata1.00: configured for UDMA/133
> >ata1: EH complete
> >SCSI device sda: 160086528 512-byte hdwr sectors (81964 MB)
> >sda: Write Protect is off
> >sda: Mode Sense: 00 3a 00 00
> >SCSI device sda: write cache: enabled, read cache: enabled, doesn't 
> >support DPO or FUA
> 
> Looks like all of these errors are from a FLUSH CACHE command and the 
> drive is indicating that it is no longer busy, so presumably done. 
> That's not a DMA-mapped command, so it wouldn't go through the ADMA 
> machinery and I wouldn't have expected this to be handled any 
> differently from before. Curious..

My latest bisection attempt actually led to your sata_nv ADMA commit. [1]
I've now backed out that patch from 2.6.20-rc5 and have my stress test
running for 20 minutes now ("record" for a bad kernel surviving that
test is about 40 minutes IIRC). I'll keep it running for at least 2 more
hours.

The test is pretty simple:
while /bin/true; do ls -lR > /dev/null; done
while /bin/true; do echo 255 > /proc/sys/vm/drop_caches; sleep 1; done

running in parallel.

Björn

[1] 2dec7555e6bf2772749113ea0ad454fcdb8cf861
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-15 Thread Björn Steinbrink
On 2007.01.15 07:48:23 +0100, Mikael Pettersson wrote:
> Notice how the problems started exactly at the point the
> "NVRM" NVIDIA module (whatever it is) was loaded ...

That's not the reason. Yeah, I should not have sent a log of a run with
the nvidia module loaded, but the same thing happens without it. For the
bisection kernels I did not even build the nvidia module and did the
testing at the console.

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-15 Thread Jeff Garzik

Jens Axboe wrote:

I'd be surprised if the device would not obey the 7 second timeout rule
that seems to be set in stone and not allow more dirty in-drive cache
than it could flush out in approximately that time.


AFAIK Windows flush-cache timeout is 30 seconds, not 7 as with other 
commands...




And BUSY should also be set for that case, as Robert indicates.


Agreed.

Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-15 Thread Jeff Garzik

Mikael Pettersson wrote:

Notice how the problems started exactly at the point the
"NVRM" NVIDIA module (whatever it is) was loaded ...



Yes, that's a bit suspicious...

Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-14 Thread Mikael Pettersson
Björn Steinbrink writes:
 > Hi,
 > 
 > with 2.6.20-rc{2,4,5} (no other tested yet) I see SATA exceptions quite
 > often, with 2.6.19 there are no such exceptions. dmesg and lspci -v
 > output follows. In the meantime, I'll start bisecting.
 > 
 > Thanks
 > Björn
 > 
 > 
 > Linux version 2.6.20-rc2 ([EMAIL PROTECTED]) (gcc version 4.1.2 20061115 
 > (prerelease) (Debian 4.1.1-21)) #4 SMP Sun Dec 31 12:54:22 CET 2006

[uneventful kernel log omitted]

 > sata_nv :00:07.0: Using ADMA mode
 > PCI: Setting latency timer of device :00:07.0 to 64
 > ata1: SATA max UDMA/133 cmd 0xC2004480 ctl 0xC20044A0 bmdma 
 > 0xD400 irq 23
 > ata2: SATA max UDMA/133 cmd 0xC2004580 ctl 0xC20045A0 bmdma 
 > 0xD408 irq 23
 > scsi0 : sata_nv
 > ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
 > ata1.00: ATA-7, max UDMA/133, 160086528 sectors: LBA 
 > ata1.00: ata1: dev 0 multi count 16
 > ata1.00: configured for UDMA/133
 > scsi1 : sata_nv
 > ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
 > ata2.00: ATA-7, max UDMA/133, 160086528 sectors: LBA 
 > ata2.00: ata2: dev 0 multi count 16
 > ata2.00: configured for UDMA/133
 > scsi 0:0:0:0: Direct-Access ATA  Maxtor 6Y080M0   YAR5 PQ: 0 ANSI: 5
 > ata1: bounce limit 0x, segment boundary 0x, hw segs 
 > 61
 > SCSI device sda: 160086528 512-byte hdwr sectors (81964 MB)
 > sda: Write Protect is off
 > sda: Mode Sense: 00 3a 00 00
 > SCSI device sda: write cache: enabled, read cache: enabled, doesn't support 
 > DPO or FUA
 > SCSI device sda: 160086528 512-byte hdwr sectors (81964 MB)
 > sda: Write Protect is off
 > sda: Mode Sense: 00 3a 00 00
 > SCSI device sda: write cache: enabled, read cache: enabled, doesn't support 
 > DPO or FUA
 >  sda: sda1 sda2 sda3
 > sd 0:0:0:0: Attached scsi disk sda
 > scsi 1:0:0:0: Direct-Access ATA  Maxtor 6Y080M0   YAR5 PQ: 0 ANSI: 5
 > ata2: bounce limit 0x, segment boundary 0x, hw segs 
 > 61
 > SCSI device sdb: 160086528 512-byte hdwr sectors (81964 MB)
 > sdb: Write Protect is off
 > sdb: Mode Sense: 00 3a 00 00
 > SCSI device sdb: write cache: enabled, read cache: enabled, doesn't support 
 > DPO or FUA
 > SCSI device sdb: 160086528 512-byte hdwr sectors (81964 MB)
 > sdb: Write Protect is off
 > sdb: Mode Sense: 00 3a 00 00
 > SCSI device sdb: write cache: enabled, read cache: enabled, doesn't support 
 > DPO or FUA
 >  sdb: sdb1 sdb2 sdb3
 > sd 1:0:0:0: Attached scsi disk sdb

Things are fine so far.

[more uneventful kernel log omitted]

 > NVRM: loading NVIDIA Linux x86_64 Kernel Module  1.0-9631  Thu Nov  9 
 > 17:35:27 PST 2006
 > ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
 > ata1.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 in
 >  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
 > ata1: soft resetting port
 > ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
 > ata1.00: configured for UDMA/133
 > ata1: EH complete
 > SCSI device sda: 160086528 512-byte hdwr sectors (81964 MB)
 > sda: Write Protect is off
 > sda: Mode Sense: 00 3a 00 00
 > SCSI device sda: write cache: enabled, read cache: enabled, doesn't support 
 > DPO or FUA
 > ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
 > ata1.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 out
 >  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

and then things start to break.

Notice how the problems started exactly at the point the
"NVRM" NVIDIA module (whatever it is) was loaded ...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-14 Thread Jens Axboe
On Sun, Jan 14 2007, Robert Hancock wrote:
> Jeff Garzik wrote:
> >>Looks like all of these errors are from a FLUSH CACHE command and the 
> >>drive is indicating that it is no longer busy, so presumably done. 
> >>That's not a DMA-mapped command, so it wouldn't go through the ADMA 
> >>machinery and I wouldn't have expected this to be handled any 
> >>differently from before. Curious..
> >
> >It's possible the flush-cache command takes longer than 30 seconds, if 
> >the cache is large, contents are discontiguous, etc.  It's a 
> >pathological case, but possible.
> >
> >Or maybe flush-cache doesn't get a 30 second timeout, and it should...? 
> > (thinking out loud)
> >
> >Jeff
> 
> If the flush was still in progress I would expect Busy to still be set, 
> however..

I'd be surprised if the device would not obey the 7 second timeout rule
that seems to be set in stone and not allow more dirty in-drive cache
than it could flush out in approximately that time.

And BUSY should also be set for that case, as Robert indicates.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-14 Thread Björn Steinbrink
On 2007.01.15 01:34:48 +0100, Björn Steinbrink wrote:
> On 2007.01.14 19:22:51 -0500, Jeff Garzik wrote:
> > Robert Hancock wrote:
> > >Björn Steinbrink wrote:
> > >>Hi,
> > >>
> > >>with 2.6.20-rc{2,4,5} (no other tested yet) I see SATA exceptions quite
> > >>often, with 2.6.19 there are no such exceptions. dmesg and lspci -v
> > >>output follows. In the meantime, I'll start bisecting.
> > >
> > >...
> > >
> > >>ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
> > >>ata1.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 in
> > >> res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> > >>ata1: soft resetting port
> > >>ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> > >>ata1.00: configured for UDMA/133
> > >>ata1: EH complete
> > >>SCSI device sda: 160086528 512-byte hdwr sectors (81964 MB)
> > >>sda: Write Protect is off
> > >>sda: Mode Sense: 00 3a 00 00
> > >>SCSI device sda: write cache: enabled, read cache: enabled, doesn't 
> > >>support DPO or FUA
> > >
> > >Looks like all of these errors are from a FLUSH CACHE command and the 
> > >drive is indicating that it is no longer busy, so presumably done. 
> > >That's not a DMA-mapped command, so it wouldn't go through the ADMA 
> > >machinery and I wouldn't have expected this to be handled any 
> > >differently from before. Curious..
> > 
> > It's possible the flush-cache command takes longer than 30 seconds, if 
> > the cache is large, contents are discontiguous, etc.  It's a 
> > pathological case, but possible.
> > 
> > Or maybe flush-cache doesn't get a 30 second timeout, and it should...? 
> >  (thinking out loud)
> 
> Bi-section led to commit 249e83fe839 which makes absolutely no sense to
> me, just in case that anyone sees any problem with that commit.
> I'll go and re-check a few of those commits that I marked as good.

Next round of bisecting led to another useless result, a) it was an
unrelated driver, b) the kernel I just marked as good after 20 minutes
of testing decided to fail when I hit reply... Guess that it was pure
luck that the kernel I marked as bad failed within 1-2 minutes.
I send the git bisect log with this mail, maybe at least the early good
kernel are really good and someone can make some sense out of it. At
least I ended up somewhere in a series of libata changes.

Thanks,
Björn

git-bisect start
# bad: [8a2d17a56a71c5c796b0a5378ee76a105f21fdd9] Linux 2.6.20-rc2
git-bisect bad 8a2d17a56a71c5c796b0a5378ee76a105f21fdd9
# good: [c3fe6924620fd733ffe8bc8a9da1e9cde08402b3] Linux 2.6.19
git-bisect good c3fe6924620fd733ffe8bc8a9da1e9cde08402b3
# bad: [2685b267bce34c9b66626cb11664509c32a761a5] Merge 
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
git-bisect bad 2685b267bce34c9b66626cb11664509c32a761a5
# good: [a985239bdf017e00e985c3a31149d6ae128fdc5f] [POWERPC] cell: spu 
management xmon routines
git-bisect good a985239bdf017e00e985c3a31149d6ae128fdc5f
# bad: [33f2ef89f8e181486b63fdbdc97c6afa6ca9f34b] mm: make compound page 
destructor handling explicit
git-bisect bad 33f2ef89f8e181486b63fdbdc97c6afa6ca9f34b
# bad: [651857a1ecaf97a8ad9d324dd2a61675c53e541e] Merge branch 'upstream-linus' 
of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2
git-bisect bad 651857a1ecaf97a8ad9d324dd2a61675c53e541e
# bad: [ff51a98799931256b555446b2f5675db08de6229] Merge branch 'upstream-linus' 
of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/libata-dev
git-bisect bad ff51a98799931256b555446b2f5675db08de6229
# bad: [3ac551a6a63dcbc707348772a27bd7090b081524] [libata] pata_cs5535: fix 
build
git-bisect bad 3ac551a6a63dcbc707348772a27bd7090b081524
# good: [750426aa1ad1ddd1fa8bb4ed531a7956f3b9a27c] libata: cosmetic changes to 
sense generation functions
git-bisect good 750426aa1ad1ddd1fa8bb4ed531a7956f3b9a27c
# good: [62d64ae0ec76360736c9dc4ca2067ae8de0ba9f2] pata : more drivers that 
need only standard suspend and resume
git-bisect good 62d64ae0ec76360736c9dc4ca2067ae8de0ba9f2
# good: [6a36261e63770ab61422550b774fe949ccca5fa9] libata: fix READ CAPACITY 
simulation
git-bisect good 6a36261e63770ab61422550b774fe949ccca5fa9
# good: [2432697ba0ce312d60be5009ffe1fa054a761bb9] libata: implement 
ata_exec_internal_sg()
git-bisect good 2432697ba0ce312d60be5009ffe1fa054a761bb9
# good: [70e6ad0c6d1e6cb9ee3c036a85ca2561eb1fd766] libata: prepare 
ata_sg_clean() for invocation from EH
git-bisect good 70e6ad0c6d1e6cb9ee3c036a85ca2561eb1fd766
# good: [8e16f941226f15622fbbc416a1f3d8705001a191] ahci: do not powerdown 
during initialization
git-bisect good 8e16f941226f15622fbbc416a1f3d8705001a191
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-14 Thread Robert Hancock

Jeff Garzik wrote:
Looks like all of these errors are from a FLUSH CACHE command and the 
drive is indicating that it is no longer busy, so presumably done. 
That's not a DMA-mapped command, so it wouldn't go through the ADMA 
machinery and I wouldn't have expected this to be handled any 
differently from before. Curious..


It's possible the flush-cache command takes longer than 30 seconds, if 
the cache is large, contents are discontiguous, etc.  It's a 
pathological case, but possible.


Or maybe flush-cache doesn't get a 30 second timeout, and it should...? 
 (thinking out loud)


Jeff


If the flush was still in progress I would expect Busy to still be set, 
however..


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-14 Thread Björn Steinbrink
On 2007.01.14 19:22:51 -0500, Jeff Garzik wrote:
> Robert Hancock wrote:
> >Björn Steinbrink wrote:
> >>Hi,
> >>
> >>with 2.6.20-rc{2,4,5} (no other tested yet) I see SATA exceptions quite
> >>often, with 2.6.19 there are no such exceptions. dmesg and lspci -v
> >>output follows. In the meantime, I'll start bisecting.
> >
> >...
> >
> >>ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
> >>ata1.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 in
> >> res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> >>ata1: soft resetting port
> >>ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> >>ata1.00: configured for UDMA/133
> >>ata1: EH complete
> >>SCSI device sda: 160086528 512-byte hdwr sectors (81964 MB)
> >>sda: Write Protect is off
> >>sda: Mode Sense: 00 3a 00 00
> >>SCSI device sda: write cache: enabled, read cache: enabled, doesn't 
> >>support DPO or FUA
> >
> >Looks like all of these errors are from a FLUSH CACHE command and the 
> >drive is indicating that it is no longer busy, so presumably done. 
> >That's not a DMA-mapped command, so it wouldn't go through the ADMA 
> >machinery and I wouldn't have expected this to be handled any 
> >differently from before. Curious..
> 
> It's possible the flush-cache command takes longer than 30 seconds, if 
> the cache is large, contents are discontiguous, etc.  It's a 
> pathological case, but possible.
> 
> Or maybe flush-cache doesn't get a 30 second timeout, and it should...? 
>  (thinking out loud)

Bi-section led to commit 249e83fe839 which makes absolutely no sense to
me, just in case that anyone sees any problem with that commit.
I'll go and re-check a few of those commits that I marked as good.

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-14 Thread Jeff Garzik

Robert Hancock wrote:

Björn Steinbrink wrote:

Hi,

with 2.6.20-rc{2,4,5} (no other tested yet) I see SATA exceptions quite
often, with 2.6.19 there are no such exceptions. dmesg and lspci -v
output follows. In the meantime, I'll start bisecting.


...


ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 in
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1: soft resetting port
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata1.00: configured for UDMA/133
ata1: EH complete
SCSI device sda: 160086528 512-byte hdwr sectors (81964 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: write cache: enabled, read cache: enabled, doesn't 
support DPO or FUA


Looks like all of these errors are from a FLUSH CACHE command and the 
drive is indicating that it is no longer busy, so presumably done. 
That's not a DMA-mapped command, so it wouldn't go through the ADMA 
machinery and I wouldn't have expected this to be handled any 
differently from before. Curious..


It's possible the flush-cache command takes longer than 30 seconds, if 
the cache is large, contents are discontiguous, etc.  It's a 
pathological case, but possible.


Or maybe flush-cache doesn't get a 30 second timeout, and it should...? 
 (thinking out loud)


Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-14 Thread Robert Hancock

Björn Steinbrink wrote:

Hi,

with 2.6.20-rc{2,4,5} (no other tested yet) I see SATA exceptions quite
often, with 2.6.19 there are no such exceptions. dmesg and lspci -v
output follows. In the meantime, I'll start bisecting.


...


ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 in
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1: soft resetting port
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata1.00: configured for UDMA/133
ata1: EH complete
SCSI device sda: 160086528 512-byte hdwr sectors (81964 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO 
or FUA


Looks like all of these errors are from a FLUSH CACHE command and the 
drive is indicating that it is no longer busy, so presumably done. 
That's not a DMA-mapped command, so it wouldn't go through the ADMA 
machinery and I wouldn't have expected this to be handled any 
differently from before. Curious..


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


SATA exceptions with 2.6.20-rc5

2007-01-14 Thread Björn Steinbrink
Hi,

with 2.6.20-rc{2,4,5} (no other tested yet) I see SATA exceptions quite
often, with 2.6.19 there are no such exceptions. dmesg and lspci -v
output follows. In the meantime, I'll start bisecting.

Thanks
Björn


Linux version 2.6.20-rc2 ([EMAIL PROTECTED]) (gcc version 4.1.2 20061115 
(prerelease) (Debian 4.1.1-21)) #4 SMP Sun Dec 31 12:54:22 CET 2006
Command line: root=/dev/md0 ro quiet 
BIOS-provided physical RAM map:
 BIOS-e820:  - 0009f000 (usable)
 BIOS-e820: 0009f000 - 000a (reserved)
 BIOS-e820: 000f - 0010 (reserved)
 BIOS-e820: 0010 - 7fee (usable)
 BIOS-e820: 7fee - 7fee3000 (ACPI NVS)
 BIOS-e820: 7fee3000 - 7fef (ACPI data)
 BIOS-e820: 7fef - 7ff0 (reserved)
 BIOS-e820: e000 - f000 (reserved)
 BIOS-e820: fec0 - 0001 (reserved)
Entering add_active_range(0, 0, 159) 0 entries of 256 used
Entering add_active_range(0, 256, 524000) 1 entries of 256 used
end_pfn_map = 1048576
DMI 2.2 present.
ACPI: RSDP (v000 Nvidia) @ 0x000f7a70
ACPI: RSDT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 
0x7fee3040
ACPI: FADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 
0x7fee30c0
ACPI: SSDT (v001 PTLTD  POWERNOW 0x0001  LTP 0x0001) @ 
0x7fee9540
ACPI: SRAT (v001 AMDHAMMER   0x0001 AMD  0x0001) @ 
0x7fee9780
ACPI: MADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 
0x7fee9480
ACPI: DSDT (v001 NVIDIA AWRDACPI 0x1000 MSFT 0x010e) @ 
0x
Entering add_active_range(0, 0, 159) 0 entries of 256 used
Entering add_active_range(0, 256, 524000) 1 entries of 256 used
Zone PFN ranges:
  DMA 0 -> 4096
  DMA324096 ->  1048576
  Normal1048576 ->  1048576
early_node_map[2] active PFN ranges
0:0 ->  159
0:  256 ->   524000
On node 0 totalpages: 523903
  DMA zone: 56 pages used for memmap
  DMA zone: 1122 pages reserved
  DMA zone: 2821 pages, LIFO batch:0
  DMA32 zone: 7108 pages used for memmap
  DMA32 zone: 512796 pages, LIFO batch:31
  Normal zone: 0 pages used for memmap
Nvidia board detected. Ignoring ACPI timer override.
If you got timer trouble try acpi_use_timer_override
ACPI: PM-Timer IO Port: 0x1008
ACPI: Local APIC address 0xfee0
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 (Bootup-CPU)
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
Processor #1
ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
ACPI: IOAPIC (id[0x02] address[0xfec0] gsi_base[0])
IOAPIC[0]: apic_id 2, address 0xfec0, GSI 0-23
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
ACPI: INT_SRC_OVR (bus 0 bus_irq 14 global_irq 14 high edge)
ACPI: INT_SRC_OVR (bus 0 bus_irq 15 global_irq 15 high edge)
ACPI: IRQ9 used by override.
ACPI: IRQ14 used by override.
ACPI: IRQ15 used by override.
Setting APIC routing to flat
Using ACPI (MADT) for SMP configuration information
Nosave address range: 0009f000 - 000a
Nosave address range: 000a - 000f
Nosave address range: 000f - 0010
Allocating PCI resources starting at 8000 (gap: 7ff0:6010)
PERCPU: Allocating 32000 bytes of per cpu data
Built 1 zonelists.  Total pages: 515617
Kernel command line: root=/dev/md0 ro quiet 
Initializing CPU#0
PID hash table entries: 4096 (order: 12, 32768 bytes)
Console: colour VGA+ 80x25
Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes)
Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes)
Checking aperture...
CPU 0: aperture @ 5a7400 size 32 MB
Aperture too small (32 MB)
No AGP bridge found
Memory: 2059000k/2096000k available (2795k kernel code, 36392k reserved, 1089k 
data, 224k init)
Calibrating delay using timer specific routine.. 4422.42 BogoMIPS (lpj=22112114)
Mount-cache hash table entries: 256
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 1024K (64 bytes/line)
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 0
Freeing SMP alternatives: 28k freed
ACPI: Core revision 20060707
Using local APIC timer interrupts.
result 12558084
Detected 12.558 MHz APIC timer.
Booting processor 1/2 APIC 0x1
Initializing CPU#1
Calibrating delay using timer specific routine.. 4420.44 BogoMIPS (lpj=22102213)
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 1024K (64 bytes/line)
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 1
AMD Athlon(tm) 64 X2 Dual Core Processor 4400+ stepping 02
CPU 1: Syncing TSC to CPU 0.
CPU 1: synchronized TSC with CPU 0 (last diff 89 cycles, maxerr 393 cycles)
Brought up 2 CPUs
testing NMI watchdog ... OK.
Disabling vsyscall due to use of PM timer
time.c: Using 3.579545 MHz WALL PM GTOD PM timer.
time.c: Detec