Re: SATA exceptions with 2.6.20-rc5
On 2007.02.04 02:13:51 +0100, Björn Steinbrink wrote: > On 2007.02.02 23:48:14 -0600, Robert Hancock wrote: > > There's a patch in -mm (sata_nv-use-adma-for-nodata-commands.patch) > > which should hopefully avoid this problem for the cache flush commands, > > at least - can you try that one out? You'll have to apply the other > > sata_nv patches in -mm first, i.e. this order: > > > > http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2.patch > > http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2-cleanup.patch > > http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-use-adma-for-nodata-commands.patch > > Got 2.6.20-rc7 with them applied now (the rejects seemed trivial enough > for me to fix them). Let's see how that works out... After about 1.5 days of uptime, an involuntary reboot and another 3 days of uptime, no sign of an exception. No stress testing was done, but a few disk intensive actions did happen, at least more than with that -rc6 that did throw an exception at me. Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.02.02 23:48:14 -0600, Robert Hancock wrote: > Björn Steinbrink wrote: > >On 2007.01.24 01:39:23 +0100, Björn Steinbrink wrote: > >>On 2007.01.23 17:18:43 -0600, Robert Hancock wrote: > >>>Larry Walton wrote: > The last patch (sata_nv-force-int-dev-in-interrupt.patch) > seems to have fix the problem. Much appreciated, > thank you. I'd consider it a must have in 2.6.20. > >>>Can any of the rest of you that have been seeing this problem also > >>>confirm that this fixes it? > >>Seems to work for me, uptime is about an hour now and no exception yet. > >>Had the stress test running for only about 10 minutes, but I usually got > >>an exception within an hour even during plain irssi usage, so I'm quite > >>confident that the patch fixes it. > > > >Or maybe not :( Just got an exception on 2.6.20-rc6. Took 4 days of > >uptime to trigger, so it's just a lot harder to trigger now. > > Same exception details as before? Yes, exactly the same. > There's a patch in -mm (sata_nv-use-adma-for-nodata-commands.patch) > which should hopefully avoid this problem for the cache flush commands, > at least - can you try that one out? You'll have to apply the other > sata_nv patches in -mm first, i.e. this order: > > http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2.patch > http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2-cleanup.patch > http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-use-adma-for-nodata-commands.patch Got 2.6.20-rc7 with them applied now (the rejects seemed trivial enough for me to fix them). Let's see how that works out... Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Björn Steinbrink wrote: On 2007.01.24 01:39:23 +0100, Björn Steinbrink wrote: On 2007.01.23 17:18:43 -0600, Robert Hancock wrote: Larry Walton wrote: The last patch (sata_nv-force-int-dev-in-interrupt.patch) seems to have fix the problem. Much appreciated, thank you. I'd consider it a must have in 2.6.20. Can any of the rest of you that have been seeing this problem also confirm that this fixes it? Seems to work for me, uptime is about an hour now and no exception yet. Had the stress test running for only about 10 minutes, but I usually got an exception within an hour even during plain irssi usage, so I'm quite confident that the patch fixes it. Or maybe not :( Just got an exception on 2.6.20-rc6. Took 4 days of uptime to trigger, so it's just a lot harder to trigger now. Same exception details as before? There's a patch in -mm (sata_nv-use-adma-for-nodata-commands.patch) which should hopefully avoid this problem for the cache flush commands, at least - can you try that one out? You'll have to apply the other sata_nv patches in -mm first, i.e. this order: http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2.patch http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2-cleanup.patch http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-use-adma-for-nodata-commands.patch -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.24 01:39:23 +0100, Björn Steinbrink wrote: > On 2007.01.23 17:18:43 -0600, Robert Hancock wrote: > > Larry Walton wrote: > > >The last patch (sata_nv-force-int-dev-in-interrupt.patch) > > >seems to have fix the problem. Much appreciated, > > >thank you. I'd consider it a must have in 2.6.20. > > > > Can any of the rest of you that have been seeing this problem also > > confirm that this fixes it? > > Seems to work for me, uptime is about an hour now and no exception yet. > Had the stress test running for only about 10 minutes, but I usually got > an exception within an hour even during plain irssi usage, so I'm quite > confident that the patch fixes it. Or maybe not :( Just got an exception on 2.6.20-rc6. Took 4 days of uptime to trigger, so it's just a lot harder to trigger now. Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.24 09:24:00 +0100, Ian Kumlien wrote: > On tis, 2007-01-23 at 17:18 -0600, Robert Hancock wrote: > > Larry Walton wrote: > > > The last patch (sata_nv-force-int-dev-in-interrupt.patch) > > > seems to have fix the problem. Much appreciated, > > > thank you. I'd consider it a must have in 2.6.20. > > > > Can any of the rest of you that have been seeing this problem also > > confirm that this fixes it? > > I applied it yesterday and today my dmesg contains three: > BUG: at mm/truncate.c:60 cancel_dirty_page() David Chinner sent two patches regarding that bug yesterday. http://lkml.org/lkml/2007/1/23/190 http://lkml.org/lkml/2007/1/23/192 Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On tis, 2007-01-23 at 17:18 -0600, Robert Hancock wrote: > Larry Walton wrote: > > The last patch (sata_nv-force-int-dev-in-interrupt.patch) > > seems to have fix the problem. Much appreciated, > > thank you. I'd consider it a must have in 2.6.20. > > Can any of the rest of you that have been seeing this problem also > confirm that this fixes it? I applied it yesterday and today my dmesg contains three: BUG: at mm/truncate.c:60 cancel_dirty_page() Call Trace: [] cancel_dirty_page+0x43/0x71 [] reiserfs_cut_from_item+0x5f8/0x61d [] find_get_page+0x21/0x47 [] reiserfs_do_truncate+0x34d/0x495 [] reiserfs_truncate_file+0x199/0x2aa [] reiserfs_file_release+0x261/0x281 [] __fput+0xb1/0x17d [] filp_close+0x5d/0x65 [] sys_close+0x8c/0xcf [] system_call+0x7e/0x83 Which never happened before... I dunno if they are related though, but they weren't there before... (It does fix the timeout problem) -- Ian Kumlien -- http://pomac.netswarm.net signature.asc Description: This is a digitally signed message part
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.23 17:18:43 -0600, Robert Hancock wrote: > Larry Walton wrote: > >The last patch (sata_nv-force-int-dev-in-interrupt.patch) > >seems to have fix the problem. Much appreciated, > >thank you. I'd consider it a must have in 2.6.20. > > Can any of the rest of you that have been seeing this problem also > confirm that this fixes it? Seems to work for me, uptime is about an hour now and no exception yet. Had the stress test running for only about 10 minutes, but I usually got an exception within an hour even during plain irssi usage, so I'm quite confident that the patch fixes it. Thanks, Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Larry Walton wrote: The last patch (sata_nv-force-int-dev-in-interrupt.patch) seems to have fix the problem. Much appreciated, thank you. I'd consider it a must have in 2.6.20. Can any of the rest of you that have been seeing this problem also confirm that this fixes it? -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
The last patch (sata_nv-force-int-dev-in-interrupt.patch) seems to have fix the problem. Much appreciated, thank you. I'd consider it a must have in 2.6.20. -- *--* Mail: [EMAIL PROTECTED] *--* Voice: 206.892.6269 *--* Cell: 206.225.0154 *--* HTTP://real.com -- - - - - - - - R e a l - - - - - - - - signature.asc Description: Digital signature
Re: SATA exceptions with 2.6.20-rc5
Björn Steinbrink wrote: Hm, I don't think it is unhappy about looking at NV_INT_STATUS_CK804. I'm running 2.6.20-rc5 with the INT_DEV check removed for 8 hours now without a single problem and that should still look at NV_INT_STATUS_CK804, right? I just noticed that my last email might not have been clear enough. The exceptions happened when I re-enabled the return statement in addition to the debug message. Without the INT_DEV check, it is completely fine AFAICT. Indeed, it seems to be just the NV_INT_DEV check that is problematic. Here's a patch that's likely better to test, it forces the NV_INT_DEV flag on when a command is active, and also fixes that questionable code in nv_host_intr that I mentioned. -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ --- linux-2.6.20-rc5/drivers/ata/sata_nv.c 2007-01-19 19:18:53.0 -0600 +++ linux-2.6.20-rc5debug/drivers/ata/sata_nv.c 2007-01-22 22:33:43.0 -0600 @@ -700,7 +700,6 @@ static void nv_adma_check_cpb(struct ata static int nv_host_intr(struct ata_port *ap, u8 irq_stat) { struct ata_queued_cmd *qc = ata_qc_from_tag(ap, ap->active_tag); - int handled; /* freeze if hotplugged */ if (unlikely(irq_stat & (NV_INT_ADDED | NV_INT_REMOVED))) { @@ -719,13 +718,7 @@ static int nv_host_intr(struct ata_port } /* handle interrupt */ - handled = ata_host_intr(ap, qc); - if (unlikely(!handled)) { - /* spurious, clear it */ - ata_check_status(ap); - } - - return 1; + return ata_host_intr(ap, qc); } static irqreturn_t nv_adma_interrupt(int irq, void *dev_instance) @@ -752,6 +745,11 @@ static irqreturn_t nv_adma_interrupt(int if (pp->flags & NV_ADMA_PORT_REGISTER_MODE) { u8 irq_stat = readb(host->mmio_base + NV_INT_STATUS_CK804) >> (NV_INT_PORT_SHIFT * i); + if(ata_tag_valid(ap->active_tag)) + /** NV_INT_DEV indication seems unreliable at times + at least in ADMA mode. Force it on always when a + command is active, to prevent losing interrupts. */ + irq_stat |= NV_INT_DEV; handled += nv_host_intr(ap, irq_stat); continue; }
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.22 19:24:22 -0600, Robert Hancock wrote: > Björn Steinbrink wrote: > >>>Running a kernel with the return statement replace by a line that prints > >>>the irq_stat instead. > >>> > >>>Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2. > >>40 minutes stress test now and no exception yet. What's interesting is > >>that ata1 saw exactly one interrupt with irq_stat 0x0, all others that > >>might have get dropped are as above. > >>I'll keep it running for some time and will then re-enable the return > >>statement to see if there's a relation between the irq_stat 0x0 and the > >>exception. > > > >No, doesn't seem to be related, did get 2 exceptions, but no irq_stat > >0x0 for ata1. Syslog/dmesg has nothing new either, still the same > >pattern of dismissed irq_stats. > > I've finally managed to reproduce this problem on my box, by doing: > > watch --interval=0.1 /sbin/hdparm -I /dev/sda > > on one drive and then running bonnie++ on /dev/sdb connected to the > other port on the same controller device. Usually within a few minutes > one of the IDENTIFY commands would time out in the same way you guys > have been seeing. > > Through some various trials and tribulations, the only conclusion I can > come to is that this controller really doesn't like that > NV_INT_STATUS_CK804 register being looked at in ADMA mode. I tried > adding some debug code to the qc_issue function that would check to see > if the BUSY flag in altstatus went high or that register showed an > interrupt within a certain time afterwards, however that really seemed > to hose things, the system wouldn't even boot. Hm, I don't think it is unhappy about looking at NV_INT_STATUS_CK804. I'm running 2.6.20-rc5 with the INT_DEV check removed for 8 hours now without a single problem and that should still look at NV_INT_STATUS_CK804, right? I just noticed that my last email might not have been clear enough. The exceptions happened when I re-enabled the return statement in addition to the debug message. Without the INT_DEV check, it is completely fine AFAICT. > Try out this patch, it just calls the ata_host_intr function where > appropriate without using nv_host_intr which looks at the > NV_INT_STATUS_CK804 register. This is what the original ADMA patch from > Mr. Mysterious NVIDIA Person did, I'm guessing there may be a reason for > that. With this patch I can get through a whole bonnie++ run with the > repeated IDENTIFY requests running without seeing the error. I'll see if I can schedule a test run for tomorrow, I currently need this box. Thanks, Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Alistair John Strachan wrote: On Tuesday 23 January 2007 01:24, Robert Hancock wrote: As a final aside, this is another case where the hardware docs for this controller would really be useful, in order to know whether we are actually supposed to be reading that register in ADMA mode or not. I sent a query to Allen Martin at NVIDIA asking if there's a way I could get access to the documents, but I haven't heard anything yet. Obviously, NVIDIA's response is disappointing, but thank you for putting the time in to debug this problem. Definitely sounds like a hardware defect, I'm just glad there's a workaround. Will we see this fix in 2.6.20? Hopefully, assuming it actually does fix the problem for those that have been seeing it.. -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On Tuesday 23 January 2007 01:24, Robert Hancock wrote: > As a final aside, this is another case where the hardware docs for this > controller would really be useful, in order to know whether we are > actually supposed to be reading that register in ADMA mode or not. I > sent a query to Allen Martin at NVIDIA asking if there's a way I could > get access to the documents, but I haven't heard anything yet. Obviously, NVIDIA's response is disappointing, but thank you for putting the time in to debug this problem. Definitely sounds like a hardware defect, I'm just glad there's a workaround. Will we see this fix in 2.6.20? -- Cheers, Alistair. Final year Computer Science undergraduate. 1F2 55 South Clerk Street, Edinburgh, UK. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Björn Steinbrink wrote: Running a kernel with the return statement replace by a line that prints the irq_stat instead. Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2. 40 minutes stress test now and no exception yet. What's interesting is that ata1 saw exactly one interrupt with irq_stat 0x0, all others that might have get dropped are as above. I'll keep it running for some time and will then re-enable the return statement to see if there's a relation between the irq_stat 0x0 and the exception. No, doesn't seem to be related, did get 2 exceptions, but no irq_stat 0x0 for ata1. Syslog/dmesg has nothing new either, still the same pattern of dismissed irq_stats. I've finally managed to reproduce this problem on my box, by doing: watch --interval=0.1 /sbin/hdparm -I /dev/sda on one drive and then running bonnie++ on /dev/sdb connected to the other port on the same controller device. Usually within a few minutes one of the IDENTIFY commands would time out in the same way you guys have been seeing. Through some various trials and tribulations, the only conclusion I can come to is that this controller really doesn't like that NV_INT_STATUS_CK804 register being looked at in ADMA mode. I tried adding some debug code to the qc_issue function that would check to see if the BUSY flag in altstatus went high or that register showed an interrupt within a certain time afterwards, however that really seemed to hose things, the system wouldn't even boot. Try out this patch, it just calls the ata_host_intr function where appropriate without using nv_host_intr which looks at the NV_INT_STATUS_CK804 register. This is what the original ADMA patch from Mr. Mysterious NVIDIA Person did, I'm guessing there may be a reason for that. With this patch I can get through a whole bonnie++ run with the repeated IDENTIFY requests running without seeing the error. As an aside, there seems to be some dubious code in nv_host_intr, if ata_host_intr returns 0 for handled when a command is outstanding, it goes and calls ata_check_status anyway. This is rather dangerous since if an interrupt showed up right after ata_host_intr but before ata_check_status, the ata_check_status would clear it and we would forget about it. I tried fixing just that issue and still had this problem however. I suspect that code is truly broken and needs further thought, but this patch avoids calling it in the ADMA case, at any rate. As a final aside, this is another case where the hardware docs for this controller would really be useful, in order to know whether we are actually supposed to be reading that register in ADMA mode or not. I sent a query to Allen Martin at NVIDIA asking if there's a way I could get access to the documents, but I haven't heard anything yet. -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ --- linux-2.6.20-rc5/drivers/ata/sata_nv.c 2007-01-19 19:18:53.0 -0600 +++ linux-2.6.20-rc5debug/drivers/ata/sata_nv.c 2007-01-22 18:35:09.0 -0600 @@ -750,9 +750,9 @@ static irqreturn_t nv_adma_interrupt(int /* if in ATA register mode, use standard ata interrupt handler */ if (pp->flags & NV_ADMA_PORT_REGISTER_MODE) { - u8 irq_stat = readb(host->mmio_base + NV_INT_STATUS_CK804) - >> (NV_INT_PORT_SHIFT * i); - handled += nv_host_intr(ap, irq_stat); + struct ata_queued_cmd *qc = ata_qc_from_tag(ap, ap->active_tag); + if(qc && !(qc->tf.flags & ATA_TFLAG_POLLING)) + handled += ata_host_intr(ap, qc); continue; }
Re: SATA exceptions with 2.6.20-rc5
On 1/15/07, Jeff Garzik <[EMAIL PROTECTED]> wrote: Jens Axboe wrote: > On Mon, Jan 15 2007, Jeff Garzik wrote: >> Jens Axboe wrote: >>> I'd be surprised if the device would not obey the 7 second timeout rule >>> that seems to be set in stone and not allow more dirty in-drive cache >>> than it could flush out in approximately that time. >> AFAIK Windows flush-cache timeout is 30 seconds, not 7 as with other >> commands... > > Ok, 7 seconds for FLUSH_CACHE would have been nice for us too though, as > it would pretty much guarentee lower latencies for random writes and > write back caching. The concern is the barrier code, of course. I guess > I should do some timings on potential worst case patterns some day. Alan > may have done that sometime in the past, iirc. FWIW: According to the drive guys (Eric M, among others), FLUSH CACHE will "probably" be under 30 seconds, but pathological cases might even extend beyond that. Definitely more than 7 seconds in less-than-pathological cases, unfortunately... The mentioned Maxtor model (6Yxxx) isn't susceptible to the large-buffer long completion times, due to architectural differences and availability of only small buffers. Any "real" long-completion flush on this device would, I believe, involve damage to the disk that hinders the ability to seek, settle, or write. (e.g. 30-second flushes are easy to hit if you mount the disk on a shaker-table with sufficient amplitude) Later in the thread I think people have pretty much isolated it as not the disk's problem, but just wanted to point this out. I assume that large enough customers can buy enterprise-type command completion ("all commands within X seconds") from most any disk vendor. However, these firmwares require much smarter or more active drivers or block layers, to handle the higher error rate when the data on the device is valid, but it will take longer than allowed by the arbitrary enterprise rules. Most customers who are buying this many devices have software engineers customizing the drivers or disk management applications to handle this differing behavior. --eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.22 17:57:08 +0100, Björn Steinbrink wrote: > On 2007.01.22 17:12:40 +0100, Björn Steinbrink wrote: > > On 2007.01.21 18:17:01 -0600, Robert Hancock wrote: > > > Hmm, another miss, apparently.. Has anyone tried removing these lines > > > >from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does? > > > > > > /* bail out if not our interrupt */ > > > if (!(irq_stat & NV_INT_DEV)) > > > return 0; > > > > Running a kernel with the return statement replace by a line that prints > > the irq_stat instead. > > > > Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2. > > 40 minutes stress test now and no exception yet. What's interesting is > that ata1 saw exactly one interrupt with irq_stat 0x0, all others that > might have get dropped are as above. > I'll keep it running for some time and will then re-enable the return > statement to see if there's a relation between the irq_stat 0x0 and the > exception. No, doesn't seem to be related, did get 2 exceptions, but no irq_stat 0x0 for ata1. Syslog/dmesg has nothing new either, still the same pattern of dismissed irq_stats. Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.22 17:12:40 +0100, Björn Steinbrink wrote: > On 2007.01.21 18:17:01 -0600, Robert Hancock wrote: > > Björn Steinbrink wrote: > > >On 2007.01.21 13:58:01 -0600, Robert Hancock wrote: > > >>Björn Steinbrink wrote: > > >>>All kernels were bad using that approach. So back to square 1. :/ > > >>> > > >>>Björn > > >>> > > >>OK guys, here's a new patch to try against 2.6.20-rc5: > > >> > > >>Right now when switching between ADMA mode and legacy mode (i.e. when > > >>going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just > > >>set the ADMA GO register bit appropriately and continue with no delay. > > >>It looks like in some cases the controller doesn't respond to this > > >>immediately, it takes some nanoseconds for the controller's status > > >>registers to reflect the change that was made. It's possible that if we > > >>were trying to issue commands during this time, the controller might not > > >>react properly. This patch adds some code to wait for the status > > >>register to change to the state we asked for before continuing. > > > > > >Just got two exceptions with your patch, none of the debug messages were > > >issued. > > > > > >Björn > > > > Hmm, another miss, apparently.. Has anyone tried removing these lines > > >from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does? > > > > /* bail out if not our interrupt */ > > if (!(irq_stat & NV_INT_DEV)) > > return 0; > > Running a kernel with the return statement replace by a line that prints > the irq_stat instead. > > Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2. 40 minutes stress test now and no exception yet. What's interesting is that ata1 saw exactly one interrupt with irq_stat 0x0, all others that might have get dropped are as above. I'll keep it running for some time and will then re-enable the return statement to see if there's a relation between the irq_stat 0x0 and the exception. Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.21 18:17:01 -0600, Robert Hancock wrote: > Björn Steinbrink wrote: > >On 2007.01.21 13:58:01 -0600, Robert Hancock wrote: > >>Björn Steinbrink wrote: > >>>All kernels were bad using that approach. So back to square 1. :/ > >>> > >>>Björn > >>> > >>OK guys, here's a new patch to try against 2.6.20-rc5: > >> > >>Right now when switching between ADMA mode and legacy mode (i.e. when > >>going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just > >>set the ADMA GO register bit appropriately and continue with no delay. > >>It looks like in some cases the controller doesn't respond to this > >>immediately, it takes some nanoseconds for the controller's status > >>registers to reflect the change that was made. It's possible that if we > >>were trying to issue commands during this time, the controller might not > >>react properly. This patch adds some code to wait for the status > >>register to change to the state we asked for before continuing. > > > >Just got two exceptions with your patch, none of the debug messages were > >issued. > > > >Björn > > Hmm, another miss, apparently.. Has anyone tried removing these lines > >from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does? > > /* bail out if not our interrupt */ > if (!(irq_stat & NV_INT_DEV)) > return 0; Running a kernel with the return statement replace by a line that prints the irq_stat instead. Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2. Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On Monday, 22. January 2007 03:39, Tejun Heo wrote: > Hello, > > Chr wrote: > > Ok, you won't believe this... I opened my case and rewired my drives... > > And guess what, my second (aka the "good") HDD is now failing! > > I guess, my mainboard has a (but maybe two, or three :( ) "bad" > > sata-port(s)! > > Or, you have power related problem. Try to rewire the power lines or > connect harddrives to a separate powersupply. It's often useful to > change one component at a time and watch which change the problem > follows. Anyways, you seem to be suffering transmission failures, not a > driver problem. > > Thanks. > Yes and no, it's probably not a power problem, I've tried another PSU with the same result :( . Futhermore, the RAID0 setup makes it impossible to try only one drive alone :(. Anyway,the WD2500KS is known to have some strange bugs in the FW. e.g.: It reports 255°C right after a cold start. ( http://www.bugtrack.almico.com/view.php?id=468 ). Thanks, Chr. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Hello, Chr wrote: Ok, you won't believe this... I opened my case and rewired my drives... And guess what, my second (aka the "good") HDD is now failing! I guess, my mainboard has a (but maybe two, or three :( ) "bad" sata-port(s)! Or, you have power related problem. Try to rewire the power lines or connect harddrives to a separate powersupply. It's often useful to change one component at a time and watch which change the problem follows. Anyways, you seem to be suffering transmission failures, not a driver problem. Thanks. -- tejun - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Björn Steinbrink wrote: On 2007.01.21 13:58:01 -0600, Robert Hancock wrote: Björn Steinbrink wrote: All kernels were bad using that approach. So back to square 1. :/ Björn OK guys, here's a new patch to try against 2.6.20-rc5: Right now when switching between ADMA mode and legacy mode (i.e. when going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just set the ADMA GO register bit appropriately and continue with no delay. It looks like in some cases the controller doesn't respond to this immediately, it takes some nanoseconds for the controller's status registers to reflect the change that was made. It's possible that if we were trying to issue commands during this time, the controller might not react properly. This patch adds some code to wait for the status register to change to the state we asked for before continuing. Just got two exceptions with your patch, none of the debug messages were issued. Björn Hmm, another miss, apparently.. Has anyone tried removing these lines from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does? /* bail out if not our interrupt */ if (!(irq_stat & NV_INT_DEV)) return 0; -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Björn Steinbrink wrote: On 2007.01.21 23:08:11 +0100, Björn Steinbrink wrote: On 2007.01.21 13:58:01 -0600, Robert Hancock wrote: Björn Steinbrink wrote: All kernels were bad using that approach. So back to square 1. :/ Björn OK guys, here's a new patch to try against 2.6.20-rc5: Right now when switching between ADMA mode and legacy mode (i.e. when going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just set the ADMA GO register bit appropriately and continue with no delay. It looks like in some cases the controller doesn't respond to this immediately, it takes some nanoseconds for the controller's status registers to reflect the change that was made. It's possible that if we were trying to issue commands during this time, the controller might not react properly. This patch adds some code to wait for the status register to change to the state we asked for before continuing. I went for the "I feel lucky" route and did just add mmio reads after the mmio writes, posting them. Rationale being that if it is a write posting issue, the debug patch would/could actually hide it AFAICT. It's the "I feel lucky" route, because my whole "knowledge" about mmio and write posting originates from the few things I read up on when you discovered the comment about write posting in the generic ata code. Uhm, yeah, exception occured about the time that I hit "send". Björn Yeah, I don't think just adding reads to flush posted writes is enough here - it seems to need more delay than that, and it also wasn't always in the idle state even before we would write the register.. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.21 13:58:01 -0600, Robert Hancock wrote: > Björn Steinbrink wrote: > >All kernels were bad using that approach. So back to square 1. :/ > > > >Björn > > > > OK guys, here's a new patch to try against 2.6.20-rc5: > > Right now when switching between ADMA mode and legacy mode (i.e. when > going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just > set the ADMA GO register bit appropriately and continue with no delay. > It looks like in some cases the controller doesn't respond to this > immediately, it takes some nanoseconds for the controller's status > registers to reflect the change that was made. It's possible that if we > were trying to issue commands during this time, the controller might not > react properly. This patch adds some code to wait for the status > register to change to the state we asked for before continuing. Just got two exceptions with your patch, none of the debug messages were issued. Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.21 23:08:11 +0100, Björn Steinbrink wrote: > On 2007.01.21 13:58:01 -0600, Robert Hancock wrote: > > Björn Steinbrink wrote: > > >All kernels were bad using that approach. So back to square 1. :/ > > > > > >Björn > > > > > > > OK guys, here's a new patch to try against 2.6.20-rc5: > > > > Right now when switching between ADMA mode and legacy mode (i.e. when > > going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just > > set the ADMA GO register bit appropriately and continue with no delay. > > It looks like in some cases the controller doesn't respond to this > > immediately, it takes some nanoseconds for the controller's status > > registers to reflect the change that was made. It's possible that if we > > were trying to issue commands during this time, the controller might not > > react properly. This patch adds some code to wait for the status > > register to change to the state we asked for before continuing. > > I went for the "I feel lucky" route and did just add mmio reads after the > mmio writes, posting them. Rationale being that if it is a write posting > issue, the debug patch would/could actually hide it AFAICT. > It's the "I feel lucky" route, because my whole "knowledge" about mmio > and write posting originates from the few things I read up on when you > discovered the comment about write posting in the generic ata code. Uhm, yeah, exception occured about the time that I hit "send". Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.21 13:58:01 -0600, Robert Hancock wrote: > Björn Steinbrink wrote: > >All kernels were bad using that approach. So back to square 1. :/ > > > >Björn > > > > OK guys, here's a new patch to try against 2.6.20-rc5: > > Right now when switching between ADMA mode and legacy mode (i.e. when > going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just > set the ADMA GO register bit appropriately and continue with no delay. > It looks like in some cases the controller doesn't respond to this > immediately, it takes some nanoseconds for the controller's status > registers to reflect the change that was made. It's possible that if we > were trying to issue commands during this time, the controller might not > react properly. This patch adds some code to wait for the status > register to change to the state we asked for before continuing. I went for the "I feel lucky" route and did just add mmio reads after the mmio writes, posting them. Rationale being that if it is a write posting issue, the debug patch would/could actually hide it AFAICT. It's the "I feel lucky" route, because my whole "knowledge" about mmio and write posting originates from the few things I read up on when you discovered the comment about write posting in the generic ata code. Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On Sunday, 21. January 2007 19:01, Björn Steinbrink wrote: > On 2007.01.21 18:34:40 +0100, Chr wrote: > > I run those two in parallel: > while /bin/true; do ls -lR / > /dev/null 2>&1; done > while /bin/true; do echo 255 > /proc/sys/vm/drop_caches; sleep 1; done > > Not sure if running them in parallel is necessary, but I don't want to > change the test setup ;) Takes between 1 and 40 minutes to trigger it. > Most of the time it's around 15 minutes now, doing more random stuff in > addition to that seems to trigger it even easier (like reading mail, > rebuilding the kernel etc.). > > I'm down to 2 commits after 2.6.19 now, only bad kernels, so I tend to > say that 2.6.19 with 2.6.20-rc5's sata_nv.c will also fail for me, but I > thought I might finish bisection just to be sure. > > > But, this time it looks slightly different: > > ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen > > ata3.00: tag 0 cmd 0xec Emask 0x4 stat 0x40 err 0x0 (timeout) > > > > [Rest of the error message + SMART error snipped] > > I get the same exception every time, doesn't change for me. And neither > do I get any SMART errors or something. > > Thanks, > Björn Ok, you won't believe this... I opened my case and rewired my drives... And guess what, my second (aka the "good") HDD is now failing! I guess, my mainboard has a (but maybe two, or three :( ) "bad" sata-port(s)! But, one small question remains: when I opened my case, I saw that my drivers are pluged in SATA jack 1 and 2... The BIOS also says they're on 1 and 2. Now, Linux says they're on port 3 & 4! it's always ata3.00! "ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata3.00: tag 0 cmd 0xea Emask 0x4 stat 0x40 err 0x0 (timeout) ata3: soft resetting port ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata3.00: configured for UDMA/133 ata3: EH complete SCSI device sda: 490234752 512-byte hdwr sectors (251000 MB) sda: Write Protect is off sda: Mode Sense: 00 3a 00 00 SCSI device sda: drive cache: write back" Thanks, Chr. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Björn Steinbrink wrote: All kernels were bad using that approach. So back to square 1. :/ Björn OK guys, here's a new patch to try against 2.6.20-rc5: Right now when switching between ADMA mode and legacy mode (i.e. when going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just set the ADMA GO register bit appropriately and continue with no delay. It looks like in some cases the controller doesn't respond to this immediately, it takes some nanoseconds for the controller's status registers to reflect the change that was made. It's possible that if we were trying to issue commands during this time, the controller might not react properly. This patch adds some code to wait for the status register to change to the state we asked for before continuing. -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ --- linux-2.6.20-rc5/drivers/ata/sata_nv.c 2007-01-19 19:18:53.0 -0600 +++ linux-2.6.20-rc5debug/drivers/ata/sata_nv.c 2007-01-21 13:35:17.0 -0600 @@ -509,14 +509,38 @@ static void nv_adma_register_mode(struct { void __iomem *mmio = nv_adma_ctl_block(ap); struct nv_adma_port_priv *pp = ap->private_data; - u16 tmp; + u16 tmp, status; + int count = 0; if (pp->flags & NV_ADMA_PORT_REGISTER_MODE) return; + status = readw(mmio + NV_ADMA_STAT); + while(!(status & NV_ADMA_STAT_IDLE) && count < 20) { + ndelay(50); + status = readw(mmio + NV_ADMA_STAT); + count++; + } + if(count == 20) + ata_port_printk(ap, KERN_WARNING, + "timeout waiting for ADMA IDLE, stat=0x%hx\n", + status); + tmp = readw(mmio + NV_ADMA_CTL); writew(tmp & ~NV_ADMA_CTL_GO, mmio + NV_ADMA_CTL); + count = 0; + status = readw(mmio + NV_ADMA_STAT); + while(!(status & NV_ADMA_STAT_LEGACY) && count < 20) { + ndelay(50); + status = readw(mmio + NV_ADMA_STAT); + count++; + } + if(count == 20) + ata_port_printk(ap, KERN_WARNING, +"timeout waiting for ADMA LEGACY, stat=0x%hx\n", +status); + pp->flags |= NV_ADMA_PORT_REGISTER_MODE; } @@ -524,7 +548,8 @@ static void nv_adma_mode(struct ata_port { void __iomem *mmio = nv_adma_ctl_block(ap); struct nv_adma_port_priv *pp = ap->private_data; - u16 tmp; + u16 tmp, status; + int count = 0; if (!(pp->flags & NV_ADMA_PORT_REGISTER_MODE)) return; @@ -534,6 +559,18 @@ static void nv_adma_mode(struct ata_port tmp = readw(mmio + NV_ADMA_CTL); writew(tmp | NV_ADMA_CTL_GO, mmio + NV_ADMA_CTL); + status = readw(mmio + NV_ADMA_STAT); + while(((status & NV_ADMA_STAT_LEGACY) || + !(status & NV_ADMA_STAT_IDLE)) && count < 20) { + ndelay(50); + status = readw(mmio + NV_ADMA_STAT); + count++; + } + if(count == 20) + ata_port_printk(ap, KERN_WARNING, + "timeout waiting for ADMA LEGACY clear and IDLE, stat=0x%hx\n", + status); + pp->flags &= ~NV_ADMA_PORT_REGISTER_MODE; }
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.21 09:36:18 +0100, Björn Steinbrink wrote: > On 2007.01.21 00:39:20 -0600, Robert Hancock wrote: > > Björn Steinbrink wrote: > > >On 2007.01.20 22:34:27 -0500, Jeff Garzik wrote: > > >>Robert Hancock wrote: > > >>>change in 2.6.20-rc is either causing or triggering this problem. It > > >>>would be useful if you could try git bisect between 2.6.19 and > > >>>2.6.20-rc5, keeping the latest sata_nv.c each time, and see if that > > >> > > >>Yes, 'git bisect' would be the next step in figuring out this puzzle. > > >> > > >>Anybody up for it? > > > > > >I'll go for it, but could I get an explanation how that could lead to a > > >different result than my last bisection? I see the difference of keeping > > >sata_nv.c but my brain can't wrap around it right now (woke up in the > > >middle of the night and still not up to speed...). > > > > Whatever the problem is, only seems to show up when ADMA is enabled, and > > so the patch that added ADMA support shows up as the culprit from your > > git bisect. However, from what Chr is reporting, 2.6.19 with the ADMA > > support added in doesn't seem to have the problem, so presumably > > something else that changed in the 2.6.20-rc series is triggering it. > > Doing a bisect while keeping the driver code itself the same will > > hopefully identify what that change is.. > > Ah, right... sata_nv.c of course interacts with the outside world, d'oh! > > Up to now, I only got bad kernels, latest tested being: > 94fcda1f8ab5e0cacc381c5ca1cc9aa6ad523576 > > Which, unless I missed a commit in the diff, only USB changes, > continuing anyway. All kernels were bad using that approach. So back to square 1. :/ Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.21 18:34:40 +0100, Chr wrote: > On Sunday, 21. January 2007 09:36, Björn Steinbrink wrote: > > On 2007.01.21 00:39:20 -0600, Robert Hancock wrote: > > > > Ah, right... sata_nv.c of course interacts with the outside world, d'oh! > > > > Up to now, I only got bad kernels, latest tested being: > > 94fcda1f8ab5e0cacc381c5ca1cc9aa6ad523576 > > > > Which, unless I missed a commit in the diff, only USB changes, > > continuing anyway. > > > > Just to make sure, here's my little helper for this bisect run, I hope > > it does what you expected: > > > > #!/bin/bash > > cp ../sata_nv.c.orig drivers/ata/sata_nv.c > > git bisect good > > cp drivers/ata/sata_nv.c ../sata_nv.c.orig > > cp ../sata_nv.c drivers/ata/ > > make oldconfig > > make -j4 > > > > Where "../sata_nv.c" is the version from 2.6.20-rc5. The copying is done > > to avoid conflicts and keep git happy. Of course there's also a version > > for bad kernels ;) No idea, why I didn't make that an argument to the > > script... > > > > Thanks, > > Björn > > Ar, 2.6.19 (with 2.6.20-rc5 adma stuff) is affected too (BTW, what do you > do to trigger the exceptions? Because, it takes hours to "reproduces" this > silly *). I run those two in parallel: while /bin/true; do ls -lR / > /dev/null 2>&1; done while /bin/true; do echo 255 > /proc/sys/vm/drop_caches; sleep 1; done Not sure if running them in parallel is necessary, but I don't want to change the test setup ;) Takes between 1 and 40 minutes to trigger it. Most of the time it's around 15 minutes now, doing more random stuff in addition to that seems to trigger it even easier (like reading mail, rebuilding the kernel etc.). I'm down to 2 commits after 2.6.19 now, only bad kernels, so I tend to say that 2.6.19 with 2.6.20-rc5's sata_nv.c will also fail for me, but I thought I might finish bisection just to be sure. > But, this time it looks slightly different: > ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen > ata3.00: tag 0 cmd 0xec Emask 0x4 stat 0x40 err 0x0 (timeout) > [Rest of the error message + SMART error snipped] I get the same exception every time, doesn't change for me. And neither do I get any SMART errors or something. Thanks, Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On Sunday, 21. January 2007 09:36, Björn Steinbrink wrote: > On 2007.01.21 00:39:20 -0600, Robert Hancock wrote: > > Ah, right... sata_nv.c of course interacts with the outside world, d'oh! > > Up to now, I only got bad kernels, latest tested being: > 94fcda1f8ab5e0cacc381c5ca1cc9aa6ad523576 > > Which, unless I missed a commit in the diff, only USB changes, > continuing anyway. > > Just to make sure, here's my little helper for this bisect run, I hope > it does what you expected: > > #!/bin/bash > cp ../sata_nv.c.orig drivers/ata/sata_nv.c > git bisect good > cp drivers/ata/sata_nv.c ../sata_nv.c.orig > cp ../sata_nv.c drivers/ata/ > make oldconfig > make -j4 > > Where "../sata_nv.c" is the version from 2.6.20-rc5. The copying is done > to avoid conflicts and keep git happy. Of course there's also a version > for bad kernels ;) No idea, why I didn't make that an argument to the > script... > > Thanks, > Björn Ar, 2.6.19 (with 2.6.20-rc5 adma stuff) is affected too (BTW, what do you do to trigger the exceptions? Because, it takes hours to "reproduces" this silly *). But, this time it looks slightly different: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata3.00: tag 0 cmd 0xec Emask 0x4 stat 0x40 err 0x0 (timeout) ata3: soft resetting port ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300) !!! ata3.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x1) ata3.00: revalidation failed (errno=-5) ata3: failed to recover some devices, retrying in 5 secs !!! ata3: hard resetting port ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata3.00: configured for UDMA/133 ata3: EH complete SCSI device sda: 488395055 512-byte hdwr sectors (250058 MB) sda: Write Protect is off sda: Mode Sense: 00 3a 00 00 SCSI device sda: drive cache: write back Oh, and I got this nice SMART Error: ID# ATTRIBUTE_NAME FLAGRAW VALUE 199 UDMA_CRC_Error_Count0x003e ... - 12 SMART Error Log Version: 1 ATA Error Count: 1 CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 1 occurred at disk power-on lifetime: 5603 hours (233 days + 11 hours) When the command that caused the error occurred, the device was in an unknown state. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 04 51 3f 00 00 00 af Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- 91 00 3f 00 00 00 0f 00 05:30:59.655 INITIALIZE DEVICE PARAMETERS [OBS-6] ec 00 01 01 00 00 00 00 05:30:59.654 IDENTIFY DEVICE ec 00 00 00 00 00 00 00 05:30:56.191 IDENTIFY DEVICE ca 00 28 02 ee 9a 0c 00 05:30:56.190 WRITE DMA ca 00 10 e8 4c 10 0a 00 05:30:56.190 WRITE DMA Maybe, it's really the HDD! OT: "http://www.nvidia.com/object/680i_hotfix.html"; Chr. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.21 00:39:20 -0600, Robert Hancock wrote: > Björn Steinbrink wrote: > >On 2007.01.20 22:34:27 -0500, Jeff Garzik wrote: > >>Robert Hancock wrote: > >>>change in 2.6.20-rc is either causing or triggering this problem. It > >>>would be useful if you could try git bisect between 2.6.19 and > >>>2.6.20-rc5, keeping the latest sata_nv.c each time, and see if that > >> > >>Yes, 'git bisect' would be the next step in figuring out this puzzle. > >> > >>Anybody up for it? > > > >I'll go for it, but could I get an explanation how that could lead to a > >different result than my last bisection? I see the difference of keeping > >sata_nv.c but my brain can't wrap around it right now (woke up in the > >middle of the night and still not up to speed...). > > Whatever the problem is, only seems to show up when ADMA is enabled, and > so the patch that added ADMA support shows up as the culprit from your > git bisect. However, from what Chr is reporting, 2.6.19 with the ADMA > support added in doesn't seem to have the problem, so presumably > something else that changed in the 2.6.20-rc series is triggering it. > Doing a bisect while keeping the driver code itself the same will > hopefully identify what that change is.. Ah, right... sata_nv.c of course interacts with the outside world, d'oh! Up to now, I only got bad kernels, latest tested being: 94fcda1f8ab5e0cacc381c5ca1cc9aa6ad523576 Which, unless I missed a commit in the diff, only USB changes, continuing anyway. Just to make sure, here's my little helper for this bisect run, I hope it does what you expected: #!/bin/bash cp ../sata_nv.c.orig drivers/ata/sata_nv.c git bisect good cp drivers/ata/sata_nv.c ../sata_nv.c.orig cp ../sata_nv.c drivers/ata/ make oldconfig make -j4 Where "../sata_nv.c" is the version from 2.6.20-rc5. The copying is done to avoid conflicts and keep git happy. Of course there's also a version for bad kernels ;) No idea, why I didn't make that an argument to the script... Thanks, Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Björn Steinbrink wrote: On 2007.01.20 22:34:27 -0500, Jeff Garzik wrote: Robert Hancock wrote: change in 2.6.20-rc is either causing or triggering this problem. It would be useful if you could try git bisect between 2.6.19 and 2.6.20-rc5, keeping the latest sata_nv.c each time, and see if that Yes, 'git bisect' would be the next step in figuring out this puzzle. Anybody up for it? I'll go for it, but could I get an explanation how that could lead to a different result than my last bisection? I see the difference of keeping sata_nv.c but my brain can't wrap around it right now (woke up in the middle of the night and still not up to speed...). Whatever the problem is, only seems to show up when ADMA is enabled, and so the patch that added ADMA support shows up as the culprit from your git bisect. However, from what Chr is reporting, 2.6.19 with the ADMA support added in doesn't seem to have the problem, so presumably something else that changed in the 2.6.20-rc series is triggering it. Doing a bisect while keeping the driver code itself the same will hopefully identify what that change is.. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.20 22:34:27 -0500, Jeff Garzik wrote: > Robert Hancock wrote: > >change in 2.6.20-rc is either causing or triggering this problem. It > >would be useful if you could try git bisect between 2.6.19 and > >2.6.20-rc5, keeping the latest sata_nv.c each time, and see if that > > > Yes, 'git bisect' would be the next step in figuring out this puzzle. > > Anybody up for it? I'll go for it, but could I get an explanation how that could lead to a different result than my last bisection? I see the difference of keeping sata_nv.c but my brain can't wrap around it right now (woke up in the middle of the night and still not up to speed...). Thanks, Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Robert Hancock wrote: change in 2.6.20-rc is either causing or triggering this problem. It would be useful if you could try git bisect between 2.6.19 and 2.6.20-rc5, keeping the latest sata_nv.c each time, and see if that Yes, 'git bisect' would be the next step in figuring out this puzzle. Anybody up for it? Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Chr wrote: Could you (or anyone else) test what happens if you take the 2.6.20-rc5 version of sata_nv.c and try it on 2.6.19? That would tell us whether it's this change or whether it's something else (i.e. in libata core). Ok, did that! (got a fresh 2.6.19 tar ball, and used 2.6.20-rc5' sata_nv.c with the oneliner in libata_sff.c) And surprise after one hour uptime, there is not even one sata exceptions in dmesg! (I'll report back tomorrow...) That is interesting, indeed.. If that holds up then I assume some other change in 2.6.20-rc is either causing or triggering this problem. It would be useful if you could try git bisect between 2.6.19 and 2.6.20-rc5, keeping the latest sata_nv.c each time, and see if that gives any indication. If not, just trying some of the different 2.6.20-rcX versions may be useful. Before that, though, can you try making this change I suggested below in 2.6.20-rc5 and see if the problem still shows up? Assuming that still doesn't work, can you then try removing these lines from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does? /* bail out if not our interrupt */ if (!(irq_stat & NV_INT_DEV)) return 0; as that's the difference I'm most suspicious of causing the problem. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On Saturday, 20. January 2007 20:59, you wrote: > Ian Kumlien wrote: > > Hi, > > > > I went from 2.6.19+sata_nv-adma-ncq-v7.patch, with no problems and adama > > enabled, to 2.6.20-rc5, which gave me problems almost instantly. > > > > I just thought that it might be interesting to know that it DID work > > nicely. > > > > CC since i'm not on the ml > > (I'm ccing more of the people who reported this) > > Well that's interesting.. The only significant change that went into > 2.6.20-rc5 in that driver that wasn't in that version you mentioned was > this one: > > http://www2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=com >mit;h=2dec7555e6bf2772749113ea0ad454fcdb8cf861 > > Could you (or anyone else) test what happens if you take the 2.6.20-rc5 > version of sata_nv.c and try it on 2.6.19? That would tell us whether > it's this change or whether it's something else (i.e. in libata core). Ok, did that! (got a fresh 2.6.19 tar ball, and used 2.6.20-rc5' sata_nv.c with the oneliner in libata_sff.c) And surprise after one hour uptime, there is not even one sata exceptions in dmesg! (I'll report back tomorrow...) > > Assuming that still doesn't work, can you then try removing these lines > from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does? > > /* bail out if not our interrupt */ > if (!(irq_stat & NV_INT_DEV)) > return 0; > > as that's the difference I'm most suspicious of causing the problem. Linux version 2.6.19test ([EMAIL PROTECTED]) (gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)) #2 SMP PREEMPT Sat Jan 20 22:19:20 CET 2007 Command line: root=/dev/md1 ro BIOS-provided physical RAM map: BIOS-e820: - 0009f800 (usable) BIOS-e820: 0009f800 - 000a (reserved) BIOS-e820: 000f - 0010 (reserved) BIOS-e820: 0010 - 7fff (usable) BIOS-e820: 7fff - 7fff3000 (ACPI NVS) BIOS-e820: 7fff3000 - 8000 (ACPI data) BIOS-e820: e000 - f000 (reserved) BIOS-e820: fec0 - 0001 (reserved) Entering add_active_range(0, 0, 159) 0 entries of 256 used Entering add_active_range(0, 256, 524272) 1 entries of 256 used end_pfn_map = 1048576 DMI 2.3 present. ACPI: RSDP (v000 Nvidia) @ 0x000f7d30 ACPI: RSDT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 0x7fff3040 ACPI: FADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 0x7fff30c0 ACPI: SSDT (v001 PTLTD POWERNOW 0x0001 LTP 0x0001) @ 0x7fff9900 ACPI: SRAT (v001 AMDHAMMER 0x0001 AMD 0x0001) @ 0x7fff9b40 ACPI: MCFG (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 0x7fff9c40 ACPI: MADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 0x7fff9840 ACPI: DSDT (v001 NVIDIA AWRDACPI 0x1000 MSFT 0x010e) @ 0x Entering add_active_range(0, 0, 159) 0 entries of 256 used Entering add_active_range(0, 256, 524272) 1 entries of 256 used Zone PFN ranges: DMA 0 -> 4096 DMA324096 -> 1048576 Normal1048576 -> 1048576 early_node_map[2] active PFN ranges 0:0 -> 159 0: 256 -> 524272 On node 0 totalpages: 524175 DMA zone: 56 pages used for memmap DMA zone: 10 pages reserved DMA zone: 3933 pages, LIFO batch:0 DMA32 zone: 7111 pages used for memmap DMA32 zone: 513065 pages, LIFO batch:31 Normal zone: 0 pages used for memmap Nvidia board detected. Ignoring ACPI timer override. If you got timer trouble try acpi_use_timer_override ACPI: PM-Timer IO Port: 0x4008 ACPI: Local APIC address 0xfee0 ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled) Processor #0 (Bootup-CPU) ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled) Processor #1 ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1]) ACPI: IOAPIC (id[0x02] address[0xfec0] gsi_base[0]) IOAPIC[0]: apic_id 2, address 0xfec0, GSI 0-23 ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) ACPI: BIOS IRQ0 pin2 override ignored. ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level) ACPI: INT_SRC_OVR (bus 0 bus_irq 14 global_irq 14 high edge) ACPI: INT_SRC_OVR (bus 0 bus_irq 15 global_irq 15 high edge) ACPI: IRQ9 used by override. ACPI: IRQ14 used by override. ACPI: IRQ15 used by override. Setting APIC routing to physical flat Using ACPI (MADT) for SMP configuration information Nosave address range: 0009f000 - 000a Nosave address range: 000a - 000f Nosave address range: 000f - 0010 Allocating PCI resources starting at 8800 (gap: 8000:6000) SMP: Allowing 2 CPUs, 0 hotplug CPUs PERCPU: Allocating 32320 bytes of per cpu data Built 1 zonelists. Total pages: 516998 Kernel command line: root=/dev/md1 ro Initializing CPU#0 PID hash table
Re: SATA exceptions with 2.6.20-rc5
On lör, 2007-01-20 at 21:43 +, Alistair John Strachan wrote: > On Saturday 20 January 2007 19:59, Robert Hancock wrote: > > Ian Kumlien wrote: > > > Hi, > > > > > > I went from 2.6.19+sata_nv-adma-ncq-v7.patch, with no problems and adama > > > enabled, to 2.6.20-rc5, which gave me problems almost instantly. > > > > > > I just thought that it might be interesting to know that it DID work > > > nicely. > > > > > > CC since i'm not on the ml > > > > (I'm ccing more of the people who reported this) > > > > Well that's interesting.. The only significant change that went into > > 2.6.20-rc5 in that driver that wasn't in that version you mentioned was > > this one: > > > > http://www2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=com > >mit;h=2dec7555e6bf2772749113ea0ad454fcdb8cf861 > > > > Could you (or anyone else) test what happens if you take the 2.6.20-rc5 > > version of sata_nv.c and try it on 2.6.19? That would tell us whether > > it's this change or whether it's something else (i.e. in libata core). > > I'm still running an -rc5 kernel with ADMA switched off entirely and I can't > reproduce the problem. How is everybody else reproducing this? > > I've been successful installing bonnie++, then going to a large XFS partition > and running "bonnie++ -u 1000:1000" and letting it run through, all defaults. > > It doesn't cause the problem I was seeing in -rc5 with ADMA on, when I switch > ADMA off, so I think this is sufficient to fix it. Eh? The whole point with that patch was to ADD ADMA support to sata_nv, imho that is something we want to have and i have been running with ADMA on on two computers since sata_nv-adma-ncq-v4 or 5 or so without problems. So, something has been introduced or been broken to cause this error, wouldn't it be better to find the error introduced than to just totally negate the patch in the first place? I haven't had the energy to go trough the patch that was found as causing the problem yet... I don't know if i even have all the info needed to make any form of educated guess but i'll give it a try when i have the energy. I really home someone finds it before then =) -- Ian Kumlien -- http://pomac.netswarm.net signature.asc Description: This is a digitally signed message part
Re: SATA exceptions with 2.6.20-rc5
On Saturday 20 January 2007 19:59, Robert Hancock wrote: > Ian Kumlien wrote: > > Hi, > > > > I went from 2.6.19+sata_nv-adma-ncq-v7.patch, with no problems and adama > > enabled, to 2.6.20-rc5, which gave me problems almost instantly. > > > > I just thought that it might be interesting to know that it DID work > > nicely. > > > > CC since i'm not on the ml > > (I'm ccing more of the people who reported this) > > Well that's interesting.. The only significant change that went into > 2.6.20-rc5 in that driver that wasn't in that version you mentioned was > this one: > > http://www2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=com >mit;h=2dec7555e6bf2772749113ea0ad454fcdb8cf861 > > Could you (or anyone else) test what happens if you take the 2.6.20-rc5 > version of sata_nv.c and try it on 2.6.19? That would tell us whether > it's this change or whether it's something else (i.e. in libata core). I'm still running an -rc5 kernel with ADMA switched off entirely and I can't reproduce the problem. How is everybody else reproducing this? I've been successful installing bonnie++, then going to a large XFS partition and running "bonnie++ -u 1000:1000" and letting it run through, all defaults. It doesn't cause the problem I was seeing in -rc5 with ADMA on, when I switch ADMA off, so I think this is sufficient to fix it. Others have reported differently. Did you guys do: [EMAIL PROTECTED]:~$ cat /proc/cmdline root=/dev/sda1 ro sata_nv.adma=0 Or something similar? This is how Jeff suggested disabling ADMA and indeed the messages about its use disappear from dmesg. -- Cheers, Alistair. Final year Computer Science undergraduate. 1F2 55 South Clerk Street, Edinburgh, UK. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Ian Kumlien wrote: Hi, I went from 2.6.19+sata_nv-adma-ncq-v7.patch, with no problems and adama enabled, to 2.6.20-rc5, which gave me problems almost instantly. I just thought that it might be interesting to know that it DID work nicely. CC since i'm not on the ml (I'm ccing more of the people who reported this) Well that's interesting.. The only significant change that went into 2.6.20-rc5 in that driver that wasn't in that version you mentioned was this one: http://www2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=2dec7555e6bf2772749113ea0ad454fcdb8cf861 Could you (or anyone else) test what happens if you take the 2.6.20-rc5 version of sata_nv.c and try it on 2.6.19? That would tell us whether it's this change or whether it's something else (i.e. in libata core). Assuming that still doesn't work, can you then try removing these lines from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does? /* bail out if not our interrupt */ if (!(irq_stat & NV_INT_DEV)) return 0; as that's the difference I'm most suspicious of causing the problem. -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On Saturday, 20. January 2007 03:41, Robert Hancock wrote: > Alistair John Strachan wrote: > > On Tuesday 16 January 2007 01:53, Jeff Garzik wrote: > >> Robert Hancock wrote: > >>> I'll try your stress test when I get a chance, but I doubt I'll run > >>> into the same problem and I haven't seen any similar reports. Perhaps > >>> it's some kind of wierd timing issue or incompatibility between the > >>> controller and that drive when running in ADMA mode? I seem to remember > >>> various reports of issues with certain Maxtor drives and some nForce > >>> SATA controllers under Windows at least.. > >> > >> Just to eliminate things, has disabling ADMA been attempted? > >> > >> It can be disabled using the sata_nv.adma module parameter. > > > > Setting this option fixes the problem for me. I suggest that ADMA > > defaults off in 2.6.20, if there's still time to do that. > > Can you guys that are having this problem try the attached debug patch? > It's possible it will fix the problem, as I'm trying a private > exec_command implementation that flushes the write by reading a > controller register instead of reading altstatus from the drive like the > libata core code does. > > If the problem still happens, I also added some more debugging in to > help figure out what is going on, so please post full dmesg. > > By the way, I assume that you guys are using reiserfs or xfs, as it > appears no other file systems issue flush commands automatically. I had > to test this by "echo 1 > delete" on the SCSI disk in sysfs, as I am > using ext3. Yes, I've some reiserfs partitions, but I don't think it's reiserfs fault ;). Here is the log. (I cut out some parts, because it was too big.) BTW: please CC, I'm not on the list! 18:17:29 sys kernel: Linux version 2.6.20-rc5 ([EMAIL PROTECTED]) (gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)) #2 SMP PREEMPT Sat 18:07:36 CET 2007 18:17:29 sys kernel: Command line: root=/dev/md1 ro 18:17:29 sys kernel: BIOS-provided physical RAM map: 18:17:29 sys kernel: BIOS-e820: - 0009f800 (usable) 18:17:29 sys kernel: BIOS-e820: 0009f800 - 000a (reserved) 18:17:29 sys kernel: BIOS-e820: 000f - 0010 (reserved) 18:17:29 sys kernel: BIOS-e820: 0010 - 7fff (usable) 18:17:29 sys kernel: BIOS-e820: 7fff - 7fff3000 (ACPI NVS) 18:17:29 sys kernel: BIOS-e820: 7fff3000 - 8000 (ACPI data) 18:17:29 sys kernel: BIOS-e820: e000 - f000 (reserved) 18:17:29 sys kernel: BIOS-e820: fec0 - 0001 (reserved) 18:17:29 sys kernel: Entering add_active_range(0, 0, 159) 0 entries of 256 used 18:17:29 sys kernel: Entering add_active_range(0, 256, 524272) 1 entries of 256 used 18:17:29 sys kernel: end_pfn_map = 1048576 18:17:29 sys kernel: DMI 2.3 present. 18:17:29 sys kernel: ACPI: RSDP (v000 Nvidia) @ 0x000f7d30 18:17:29 sys kernel: ACPI: RSDT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 0x7fff3040 18:17:29 sys kernel: ACPI: FADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 0x7fff30c0 18:17:29 sys kernel: ACPI: SSDT (v001 PTLTD POWERNOW 0x0001 LTP 0x0001) @ 0x7fff9900 18:17:29 sys kernel: ACPI: SRAT (v001 AMDHAMMER 0x0001 AMD 0x0001) @ 0x7fff9b40 18:17:29 sys kernel: ACPI: MCFG (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 0x7fff9c40 18:17:29 sys kernel: ACPI: MADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 0x7fff9840 18:17:29 sys kernel: ACPI: DSDT (v001 NVIDIA AWRDACPI 0x1000 MSFT 0x010e) @ 0x 18:17:29 sys kernel: Entering add_active_range(0, 0, 159) 0 entries of 256 used 18:17:29 sys kernel: Entering add_active_range(0, 256, 524272) 1 entries of 256 used 18:17:29 sys kernel: Zone PFN ranges: 18:17:29 sys kernel: DMA 0 -> 4096 18:17:29 sys kernel: DMA324096 -> 1048576 18:17:29 sys kernel: Normal1048576 -> 1048576 18:17:29 sys kernel: early_node_map[2] active PFN ranges 18:17:29 sys kernel: 0:0 -> 159 18:17:29 sys kernel: 0: 256 -> 524272 18:17:29 sys kernel: On node 0 totalpages: 524175 18:17:29 sys kernel: DMA zone: 56 pages used for memmap 18:17:29 sys kernel: DMA zone: 10 pages reserved 18:17:29 sys kernel: DMA zone: 3933 pages, LIFO batch:0 18:17:29 sys kernel: DMA32 zone: 7111 pages used for memmap 18:17:29 sys kernel: DMA32 zone: 513065 pages, LIFO batch:31 18:17:29 sys kernel: Normal zone: 0 pages used for memmap 18:17:29 sys kernel: Nvidia board detected. Ignoring ACPI timer override. 18:17:29 sys kernel: If you got timer trouble try acpi_use_timer_override 18:17:29 sys kernel: ACPI: PM-Timer IO Port: 0x4008 18:17:29 sys kernel: ACPI: Local APIC address 0xfee0 18:17:29 sys kernel: ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled) 18:17:29 sys kernel: Processor #0 (Bootup-CPU) 18:17:29 sys kernel: ACPI: LAPI
Re: SATA exceptions with 2.6.20-rc5
Hi, I went from 2.6.19+sata_nv-adma-ncq-v7.patch, with no problems and adama enabled, to 2.6.20-rc5, which gave me problems almost instantly. I just thought that it might be interesting to know that it DID work nicely. CC since i'm not on the ml -- Ian Kumlien -- http://pomac.netswarm.net signature.asc Description: This is a digitally signed message part
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.19 20:41:36 -0600, Robert Hancock wrote: > Alistair John Strachan wrote: > >On Tuesday 16 January 2007 01:53, Jeff Garzik wrote: > >>Robert Hancock wrote: > >>>I'll try your stress test when I get a chance, but I doubt I'll run into > >>>the same problem and I haven't seen any similar reports. Perhaps it's > >>>some kind of wierd timing issue or incompatibility between the > >>>controller and that drive when running in ADMA mode? I seem to remember > >>>various reports of issues with certain Maxtor drives and some nForce > >>>SATA controllers under Windows at least.. > >>Just to eliminate things, has disabling ADMA been attempted? > >> > >>It can be disabled using the sata_nv.adma module parameter. > > > >Setting this option fixes the problem for me. I suggest that ADMA defaults > >off in 2.6.20, if there's still time to do that. > > > > Can you guys that are having this problem try the attached debug patch? > It's possible it will fix the problem, as I'm trying a private > exec_command implementation that flushes the write by reading a > controller register instead of reading altstatus from the drive like the > libata core code does. Will give it a spin in about an hour. > If the problem still happens, I also added some more debugging in to > help figure out what is going on, so please post full dmesg. > > By the way, I assume that you guys are using reiserfs or xfs, as it > appears no other file systems issue flush commands automatically. I had > to test this by "echo 1 > delete" on the SCSI disk in sysfs, as I am > using ext3. No, ext3 here, on top of md RAID1 and LVM. Oh, and one ext2, I wonder where that comes from... Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On Saturday 20 January 2007 02:41, Robert Hancock wrote: > By the way, I assume that you guys are using reiserfs or xfs, as it > appears no other file systems issue flush commands automatically. I had > to test this by "echo 1 > delete" on the SCSI disk in sysfs, as I am > using ext3. I'll give it a spin now, and yes I'm using several large XFS partitions on this machine, layered on top of md RAID5. That's why this particular defect is so catastrophic (literally _everything_ is stalled). -- Cheers, Alistair. Final year Computer Science undergraduate. 1F2 55 South Clerk Street, Edinburgh, UK. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Alistair John Strachan wrote: On Tuesday 16 January 2007 01:53, Jeff Garzik wrote: Robert Hancock wrote: I'll try your stress test when I get a chance, but I doubt I'll run into the same problem and I haven't seen any similar reports. Perhaps it's some kind of wierd timing issue or incompatibility between the controller and that drive when running in ADMA mode? I seem to remember various reports of issues with certain Maxtor drives and some nForce SATA controllers under Windows at least.. Just to eliminate things, has disabling ADMA been attempted? It can be disabled using the sata_nv.adma module parameter. Setting this option fixes the problem for me. I suggest that ADMA defaults off in 2.6.20, if there's still time to do that. Can you guys that are having this problem try the attached debug patch? It's possible it will fix the problem, as I'm trying a private exec_command implementation that flushes the write by reading a controller register instead of reading altstatus from the drive like the libata core code does. If the problem still happens, I also added some more debugging in to help figure out what is going on, so please post full dmesg. By the way, I assume that you guys are using reiserfs or xfs, as it appears no other file systems issue flush commands automatically. I had to test this by "echo 1 > delete" on the SCSI disk in sysfs, as I am using ext3. -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ --- linux-2.6.20-rc5/drivers/ata/sata_nv.c 2007-01-19 19:18:53.0 -0600 +++ linux-2.6.20-rc5debug/drivers/ata/sata_nv.c 2007-01-19 20:25:31.0 -0600 @@ -245,6 +245,7 @@ static void nv_adma_bmdma_setup(struct a static void nv_adma_bmdma_start(struct ata_queued_cmd *qc); static void nv_adma_bmdma_stop(struct ata_queued_cmd *qc); static u8 nv_adma_bmdma_status(struct ata_port *ap); +static void nv_adma_exec_command(struct ata_port *ap, const struct ata_taskfile *tf); enum nv_host_type { @@ -409,7 +410,7 @@ static const struct ata_port_operations .tf_load= ata_tf_load, .tf_read= ata_tf_read, .check_atapi_dma= nv_adma_check_atapi_dma, - .exec_command = ata_exec_command, + .exec_command = nv_adma_exec_command, .check_status = ata_check_status, .dev_select = ata_std_dev_select, .bmdma_setup= nv_adma_bmdma_setup, @@ -617,6 +618,14 @@ static int nv_adma_check_atapi_dma(struc return !(pp->flags & NV_ADMA_ATAPI_SETUP_COMPLETE); } +static void nv_adma_exec_command(struct ata_port *ap, const struct ata_taskfile *tf) +{ + void __iomem* mmio = nv_adma_ctl_block(ap); + writeb(tf->command, (void __iomem *) ap->ioaddr.command_addr); + readw(mmio + NV_ADMA_CTL); /* flush */ + ndelay(400); +} + static unsigned int nv_adma_tf_to_cpb(struct ata_taskfile *tf, __le16 *cpb) { unsigned int idx = 0; @@ -701,6 +710,9 @@ static int nv_host_intr(struct ata_port { struct ata_queued_cmd *qc = ata_qc_from_tag(ap, ap->active_tag); int handled; + u8 cmd = 0; + if(qc) + cmd = qc->tf.command; /* freeze if hotplugged */ if (unlikely(irq_stat & (NV_INT_ADDED | NV_INT_REMOVED))) { @@ -709,8 +721,11 @@ static int nv_host_intr(struct ata_port } /* bail out if not our interrupt */ - if (!(irq_stat & NV_INT_DEV)) + if (!(irq_stat & NV_INT_DEV)) { + if( cmd == ATA_CMD_FLUSH || cmd == ATA_CMD_FLUSH_EXT ) + ata_port_printk(ap, KERN_NOTICE, "cmd 0x%x active but stat 0x%x\n", cmd, irq_stat); return 0; + } /* DEV interrupt w/ no active qc? */ if (unlikely(!qc || (qc->tf.flags & ATA_TFLAG_POLLING))) { @@ -720,6 +735,8 @@ static int nv_host_intr(struct ata_port /* handle interrupt */ handled = ata_host_intr(ap, qc); + if( cmd == ATA_CMD_FLUSH || cmd == ATA_CMD_FLUSH_EXT ) + ata_port_printk(ap, KERN_NOTICE, "cmd 0x%x active, stat = 0x%x, handled = 0x%x\n", cmd, irq_stat, handled); if (unlikely(!handled)) { /* spurious, clear it */ ata_check_status(ap); @@ -870,7 +887,7 @@ static void nv_adma_bmdma_setup(struct a outb(dmactl, ap->ioaddr.bmdma_addr + ATA_DMA_CMD); /* issue r/w command */ - ata_exec_command(ap, &qc->tf); + nv_adma_exec_command(ap, &qc->tf); } static void nv_adma_bmdma_start(struct ata_queued_cmd *qc) @@ -1161,6 +1178,9 @@ static unsigned int nv_adma_qc_issue(str /* use ATA register mode */ VPRINTK("no dmamap or ATAPI, using ATA register mode: 0x%lx\n", qc->flags); nv_adma_register_mode(qc->ap); + if(qc->tf.command == ATA_CMD_FLUSH || + qc->tf.c
Re: SATA exceptions with 2.6.20-rc5
On Friday, 19. January 2007 16:05, Alistair John Strachan wrote: > On Tuesday 16 January 2007 01:53, Jeff Garzik wrote: > > Robert Hancock wrote: > > > I'll try your stress test when I get a chance, but I doubt I'll run > > > into the same problem and I haven't seen any similar reports. Perhaps > > > it's some kind of wierd timing issue or incompatibility between the > > > controller and that drive when running in ADMA mode? I seem to remember > > > various reports of issues with certain Maxtor drives and some nForce > > > SATA controllers under Windows at least.. > > > > Just to eliminate things, has disabling ADMA been attempted? > > > > It can be disabled using the sata_nv.adma module parameter. > > Setting this option fixes the problem for me. I suggest that ADMA defaults > off in 2.6.20, if there's still time to do that. Not for me. I'm still have the same trouble, but less (maybe about every hour, instead of every 5 minutes). futhermore, I found a patch cocktail-2.6.20-rc3.patch: http://tinyurl.com/2gza8q, which improves the situation too! Now, the funny thing is that I've two SATA HDDs, but only 1 causes all the headaches. The affected drive is a: sda - @ata3.0 - WDC WD2500KS-00M 02.0 ATA-7, max UDMA/133, 488395055 sectors: LBA48 "ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 out res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata3: soft resetting port ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata3.00: configured for UDMA/133:PIO0 ata3: EH complete SCSI device sda: 488395055 512-byte hdwr sectors (250058 MB) sda: Write Protect is off sda: Mode Sense: 00 3a 00 00" the "good" HDD is a: sdb - @ata4.0 - WDC WD2500YD-01N 10.0 ATA-7, max UDMA/133, 490234752 sectors: LBA48 NCQ (depth 0/1) System: AMD64 4200+ nForce 4 SLI 2 GB SMP PREEMPT kernel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On Tuesday 16 January 2007 01:53, Jeff Garzik wrote: > Robert Hancock wrote: > > I'll try your stress test when I get a chance, but I doubt I'll run into > > the same problem and I haven't seen any similar reports. Perhaps it's > > some kind of wierd timing issue or incompatibility between the > > controller and that drive when running in ADMA mode? I seem to remember > > various reports of issues with certain Maxtor drives and some nForce > > SATA controllers under Windows at least.. > > Just to eliminate things, has disabling ADMA been attempted? > > It can be disabled using the sata_nv.adma module parameter. Setting this option fixes the problem for me. I suggest that ADMA defaults off in 2.6.20, if there's still time to do that. -- Cheers, Alistair. Final year Computer Science undergraduate. 1F2 55 South Clerk Street, Edinburgh, UK. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On Tuesday 16 January 2007 00:34, Robert Hancock wrote: > I'll try your stress test when I get a chance, but I doubt I'll run into > the same problem and I haven't seen any similar reports. Perhaps it's > some kind of wierd timing issue or incompatibility between the > controller and that drive when running in ADMA mode? I seem to remember > various reports of issues with certain Maxtor drives and some nForce > SATA controllers under Windows at least.. I have exactly the same problem on -rc5 and it causes all I/O to stall periodically if I do _anything_ I/O intensive. On my box, I have 4 sata_nv handled SATA ports, with two pairs of different drives (two Maxtor, two WD) and it happens randomly on both. So it's absolutely nothing to do with the drive make/model. I'll try Jeff's suggestion of disabling ADMA now, but I think something more radical than this workaround should make it into 2.6.20 final, otherwise a lot of people are going to have broken boxes. -- Cheers, Alistair. Final year Computer Science undergraduate. 1F2 55 South Clerk Street, Edinburgh, UK. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.18 18:09:50 -0600, Robert Hancock wrote: > I heard from Larry Walton who was apparently seeing this problem as > well. He tried my recent "sata_nv: cleanup ADMA error handling v2" patch > and originally thought it fixed the problem, but it turned out to only > make it happen less often. > > I wouldn't expect that patch to have an effect on this problem. If it > seems to reduce the frequency that would tend to be further evidence of > some kind of timing-related issue where the code change just happens > to make a difference. > > I'll see if I can come up with a debug patch for people having this > problem to try, which prints out when a flush command is issued and what > interrupts happen when a flush is pending. > > There is one important difference between ADMA and non-ADMA mode for > non-DMA commands like flushes, which didn't come to mind before: ADMA > mode uses MMIO registers on the controller whereas non-ADMA mode uses > legacy IO registers. Posted write flushing is a concern with MMIO > registers but not with PIO, the libata core is supposed to handle this > but maybe it doesn't in some case(s). In fact, just looking at > libata-sff.c there's this comment on the ata_exec_command_mmio function: > > * FIXME: missing write posting for 400nS delay enforcement > > That seems a bit suspicious.. That would imply that disabling adma via a module parameter should make the issue go away, right? I'll try to have a test run with adma disabled over night then. Thanks, Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
I heard from Larry Walton who was apparently seeing this problem as well. He tried my recent "sata_nv: cleanup ADMA error handling v2" patch and originally thought it fixed the problem, but it turned out to only make it happen less often. I wouldn't expect that patch to have an effect on this problem. If it seems to reduce the frequency that would tend to be further evidence of some kind of timing-related issue where the code change just happens to make a difference. I'll see if I can come up with a debug patch for people having this problem to try, which prints out when a flush command is issued and what interrupts happen when a flush is pending. There is one important difference between ADMA and non-ADMA mode for non-DMA commands like flushes, which didn't come to mind before: ADMA mode uses MMIO registers on the controller whereas non-ADMA mode uses legacy IO registers. Posted write flushing is a concern with MMIO registers but not with PIO, the libata core is supposed to handle this but maybe it doesn't in some case(s). In fact, just looking at libata-sff.c there's this comment on the ata_exec_command_mmio function: * FIXME: missing write posting for 400nS delay enforcement That seems a bit suspicious.. -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Björn Steinbrink wrote: It should be correct the way it is - that check is trying to prevent ATAPI commands from using DMA until the slave_config function has been called to set up the DMA parameters properly. When the NV_ADMA_ATAPI_SETUP_COMPLETE flag is not set, this returns 1 which disallows DMA transfers. Unless you were using an ATAPI (i.e. CD/DVD) device on the channel this wouldn't affect you anyway. I wondered about it, because the flag is cleared when adma_enabled is 1, which seems to be consistent with everything but nv_adma_check_atapi_dma. When ADMA is enabled we can't use ATAPI at all (or so says NVidia anyway), so it has to be disabled when an ATAPI device is detected in slave_config. Since doing that implies using the legacy BMDMA engine with its greater restrictions, this is why we need to prevent DMA transfers from being attempted until those restrictions have been set properly. (Otherwise, the libata core will try to use PACKET commands on an ATAPI device with DMA enabled before slave_config is even called.) Thus I thought that nv_adma_check_atapi_dma might be wrong, but maybe setting/clearing the flag is wrong instead? *feels lost* -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Robert Hancock wrote: I'll try your stress test when I get a chance, but I doubt I'll run into the same problem and I haven't seen any similar reports. Perhaps it's some kind of wierd timing issue or incompatibility between the controller and that drive when running in ADMA mode? I seem to remember various reports of issues with certain Maxtor drives and some nForce SATA controllers under Windows at least.. Just to eliminate things, has disabling ADMA been attempted? It can be disabled using the sata_nv.adma module parameter. Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Robert Hancock wrote: Note that the ATA-7 spec for FLUSH CACHE says that "This command may take longer than 30 s to complete." Yep... Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Jens Axboe wrote: On Mon, Jan 15 2007, Jeff Garzik wrote: Jens Axboe wrote: I'd be surprised if the device would not obey the 7 second timeout rule that seems to be set in stone and not allow more dirty in-drive cache than it could flush out in approximately that time. AFAIK Windows flush-cache timeout is 30 seconds, not 7 as with other commands... Ok, 7 seconds for FLUSH_CACHE would have been nice for us too though, as it would pretty much guarentee lower latencies for random writes and write back caching. The concern is the barrier code, of course. I guess I should do some timings on potential worst case patterns some day. Alan may have done that sometime in the past, iirc. FWIW: According to the drive guys (Eric M, among others), FLUSH CACHE will "probably" be under 30 seconds, but pathological cases might even extend beyond that. Definitely more than 7 seconds in less-than-pathological cases, unfortunately... The SCSI layer /should/ already take this (30 second timeout) into account, for SYNCHRONIZE CACHE (and thus FLUSH CACHE for libata) but I'm too slack to check at the moment. Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.15 18:34:43 -0600, Robert Hancock wrote: > Björn Steinbrink wrote: > >>My latest bisection attempt actually led to your sata_nv ADMA commit. [1] > >>I've now backed out that patch from 2.6.20-rc5 and have my stress test > >>running for 20 minutes now ("record" for a bad kernel surviving that > >>test is about 40 minutes IIRC). I'll keep it running for at least 2 more > >>hours. > > > >Yep, that one seems to be guilty. 2.6.20-rc5 with that commit backed out > >survived about 3 hours of testing, while the average was around 5 > >minutes for a failure, sometimes even before I could log in. > >I took a look at the patch, but I can't really tell anything. > >nv_adma_check_atapi_dma somehow looks like it should not negate its > >return value, so that it returns 0 (atapi dma available) when > >adma_enable was 1. But I'm not exactly confident about that either ;) > >Will it hurt if I try to remove the negation? > > It should be correct the way it is - that check is trying to prevent > ATAPI commands from using DMA until the slave_config function has been > called to set up the DMA parameters properly. When the > NV_ADMA_ATAPI_SETUP_COMPLETE flag is not set, this returns 1 which > disallows DMA transfers. Unless you were using an ATAPI (i.e. CD/DVD) > device on the channel this wouldn't affect you anyway. I wondered about it, because the flag is cleared when adma_enabled is 1, which seems to be consistent with everything but nv_adma_check_atapi_dma. Thus I thought that nv_adma_check_atapi_dma might be wrong, but maybe setting/clearing the flag is wrong instead? *feels lost* > I'll try your stress test when I get a chance, but I doubt I'll run into > the same problem and I haven't seen any similar reports. Perhaps it's > some kind of wierd timing issue or incompatibility between the > controller and that drive when running in ADMA mode? I seem to remember > various reports of issues with certain Maxtor drives and some nForce > SATA controllers under Windows at least.. I just checked Maxtor's knowledge base, that incompatibility does not affect my drive. Thanks, Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Jens Axboe wrote: On Mon, Jan 15 2007, Jeff Garzik wrote: Jens Axboe wrote: I'd be surprised if the device would not obey the 7 second timeout rule that seems to be set in stone and not allow more dirty in-drive cache than it could flush out in approximately that time. AFAIK Windows flush-cache timeout is 30 seconds, not 7 as with other commands... Ok, 7 seconds for FLUSH_CACHE would have been nice for us too though, as it would pretty much guarentee lower latencies for random writes and write back caching. The concern is the barrier code, of course. I guess I should do some timings on potential worst case patterns some day. Alan may have done that sometime in the past, iirc. Note that the ATA-7 spec for FLUSH CACHE says that "This command may take longer than 30 s to complete." -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Björn Steinbrink wrote: My latest bisection attempt actually led to your sata_nv ADMA commit. [1] I've now backed out that patch from 2.6.20-rc5 and have my stress test running for 20 minutes now ("record" for a bad kernel surviving that test is about 40 minutes IIRC). I'll keep it running for at least 2 more hours. Yep, that one seems to be guilty. 2.6.20-rc5 with that commit backed out survived about 3 hours of testing, while the average was around 5 minutes for a failure, sometimes even before I could log in. I took a look at the patch, but I can't really tell anything. nv_adma_check_atapi_dma somehow looks like it should not negate its return value, so that it returns 0 (atapi dma available) when adma_enable was 1. But I'm not exactly confident about that either ;) Will it hurt if I try to remove the negation? It should be correct the way it is - that check is trying to prevent ATAPI commands from using DMA until the slave_config function has been called to set up the DMA parameters properly. When the NV_ADMA_ATAPI_SETUP_COMPLETE flag is not set, this returns 1 which disallows DMA transfers. Unless you were using an ATAPI (i.e. CD/DVD) device on the channel this wouldn't affect you anyway. I'll try your stress test when I get a chance, but I doubt I'll run into the same problem and I haven't seen any similar reports. Perhaps it's some kind of wierd timing issue or incompatibility between the controller and that drive when running in ADMA mode? I seem to remember various reports of issues with certain Maxtor drives and some nForce SATA controllers under Windows at least.. -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On Mon, Jan 15 2007, Jeff Garzik wrote: > Jens Axboe wrote: > >I'd be surprised if the device would not obey the 7 second timeout rule > >that seems to be set in stone and not allow more dirty in-drive cache > >than it could flush out in approximately that time. > > AFAIK Windows flush-cache timeout is 30 seconds, not 7 as with other > commands... Ok, 7 seconds for FLUSH_CACHE would have been nice for us too though, as it would pretty much guarentee lower latencies for random writes and write back caching. The concern is the barrier code, of course. I guess I should do some timings on potential worst case patterns some day. Alan may have done that sometime in the past, iirc. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.15 22:17:24 +0100, Björn Steinbrink wrote: > On 2007.01.14 17:43:53 -0600, Robert Hancock wrote: > > Björn Steinbrink wrote: > > >Hi, > > > > > >with 2.6.20-rc{2,4,5} (no other tested yet) I see SATA exceptions quite > > >often, with 2.6.19 there are no such exceptions. dmesg and lspci -v > > >output follows. In the meantime, I'll start bisecting. > > > > ... > > > > >ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen > > >ata1.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 in > > > res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) > > >ata1: soft resetting port > > >ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) > > >ata1.00: configured for UDMA/133 > > >ata1: EH complete > > >SCSI device sda: 160086528 512-byte hdwr sectors (81964 MB) > > >sda: Write Protect is off > > >sda: Mode Sense: 00 3a 00 00 > > >SCSI device sda: write cache: enabled, read cache: enabled, doesn't > > >support DPO or FUA > > > > Looks like all of these errors are from a FLUSH CACHE command and the > > drive is indicating that it is no longer busy, so presumably done. > > That's not a DMA-mapped command, so it wouldn't go through the ADMA > > machinery and I wouldn't have expected this to be handled any > > differently from before. Curious.. > > My latest bisection attempt actually led to your sata_nv ADMA commit. [1] > I've now backed out that patch from 2.6.20-rc5 and have my stress test > running for 20 minutes now ("record" for a bad kernel surviving that > test is about 40 minutes IIRC). I'll keep it running for at least 2 more > hours. Yep, that one seems to be guilty. 2.6.20-rc5 with that commit backed out survived about 3 hours of testing, while the average was around 5 minutes for a failure, sometimes even before I could log in. I took a look at the patch, but I can't really tell anything. nv_adma_check_atapi_dma somehow looks like it should not negate its return value, so that it returns 0 (atapi dma available) when adma_enable was 1. But I'm not exactly confident about that either ;) Will it hurt if I try to remove the negation? Thanks, Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.14 17:43:53 -0600, Robert Hancock wrote: > Björn Steinbrink wrote: > >Hi, > > > >with 2.6.20-rc{2,4,5} (no other tested yet) I see SATA exceptions quite > >often, with 2.6.19 there are no such exceptions. dmesg and lspci -v > >output follows. In the meantime, I'll start bisecting. > > ... > > >ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen > >ata1.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 in > > res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) > >ata1: soft resetting port > >ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) > >ata1.00: configured for UDMA/133 > >ata1: EH complete > >SCSI device sda: 160086528 512-byte hdwr sectors (81964 MB) > >sda: Write Protect is off > >sda: Mode Sense: 00 3a 00 00 > >SCSI device sda: write cache: enabled, read cache: enabled, doesn't > >support DPO or FUA > > Looks like all of these errors are from a FLUSH CACHE command and the > drive is indicating that it is no longer busy, so presumably done. > That's not a DMA-mapped command, so it wouldn't go through the ADMA > machinery and I wouldn't have expected this to be handled any > differently from before. Curious.. My latest bisection attempt actually led to your sata_nv ADMA commit. [1] I've now backed out that patch from 2.6.20-rc5 and have my stress test running for 20 minutes now ("record" for a bad kernel surviving that test is about 40 minutes IIRC). I'll keep it running for at least 2 more hours. The test is pretty simple: while /bin/true; do ls -lR > /dev/null; done while /bin/true; do echo 255 > /proc/sys/vm/drop_caches; sleep 1; done running in parallel. Björn [1] 2dec7555e6bf2772749113ea0ad454fcdb8cf861 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.15 07:48:23 +0100, Mikael Pettersson wrote: > Notice how the problems started exactly at the point the > "NVRM" NVIDIA module (whatever it is) was loaded ... That's not the reason. Yeah, I should not have sent a log of a run with the nvidia module loaded, but the same thing happens without it. For the bisection kernels I did not even build the nvidia module and did the testing at the console. Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Jens Axboe wrote: I'd be surprised if the device would not obey the 7 second timeout rule that seems to be set in stone and not allow more dirty in-drive cache than it could flush out in approximately that time. AFAIK Windows flush-cache timeout is 30 seconds, not 7 as with other commands... And BUSY should also be set for that case, as Robert indicates. Agreed. Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Mikael Pettersson wrote: Notice how the problems started exactly at the point the "NVRM" NVIDIA module (whatever it is) was loaded ... Yes, that's a bit suspicious... Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Björn Steinbrink writes: > Hi, > > with 2.6.20-rc{2,4,5} (no other tested yet) I see SATA exceptions quite > often, with 2.6.19 there are no such exceptions. dmesg and lspci -v > output follows. In the meantime, I'll start bisecting. > > Thanks > Björn > > > Linux version 2.6.20-rc2 ([EMAIL PROTECTED]) (gcc version 4.1.2 20061115 > (prerelease) (Debian 4.1.1-21)) #4 SMP Sun Dec 31 12:54:22 CET 2006 [uneventful kernel log omitted] > sata_nv :00:07.0: Using ADMA mode > PCI: Setting latency timer of device :00:07.0 to 64 > ata1: SATA max UDMA/133 cmd 0xC2004480 ctl 0xC20044A0 bmdma > 0xD400 irq 23 > ata2: SATA max UDMA/133 cmd 0xC2004580 ctl 0xC20045A0 bmdma > 0xD408 irq 23 > scsi0 : sata_nv > ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) > ata1.00: ATA-7, max UDMA/133, 160086528 sectors: LBA > ata1.00: ata1: dev 0 multi count 16 > ata1.00: configured for UDMA/133 > scsi1 : sata_nv > ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300) > ata2.00: ATA-7, max UDMA/133, 160086528 sectors: LBA > ata2.00: ata2: dev 0 multi count 16 > ata2.00: configured for UDMA/133 > scsi 0:0:0:0: Direct-Access ATA Maxtor 6Y080M0 YAR5 PQ: 0 ANSI: 5 > ata1: bounce limit 0x, segment boundary 0x, hw segs > 61 > SCSI device sda: 160086528 512-byte hdwr sectors (81964 MB) > sda: Write Protect is off > sda: Mode Sense: 00 3a 00 00 > SCSI device sda: write cache: enabled, read cache: enabled, doesn't support > DPO or FUA > SCSI device sda: 160086528 512-byte hdwr sectors (81964 MB) > sda: Write Protect is off > sda: Mode Sense: 00 3a 00 00 > SCSI device sda: write cache: enabled, read cache: enabled, doesn't support > DPO or FUA > sda: sda1 sda2 sda3 > sd 0:0:0:0: Attached scsi disk sda > scsi 1:0:0:0: Direct-Access ATA Maxtor 6Y080M0 YAR5 PQ: 0 ANSI: 5 > ata2: bounce limit 0x, segment boundary 0x, hw segs > 61 > SCSI device sdb: 160086528 512-byte hdwr sectors (81964 MB) > sdb: Write Protect is off > sdb: Mode Sense: 00 3a 00 00 > SCSI device sdb: write cache: enabled, read cache: enabled, doesn't support > DPO or FUA > SCSI device sdb: 160086528 512-byte hdwr sectors (81964 MB) > sdb: Write Protect is off > sdb: Mode Sense: 00 3a 00 00 > SCSI device sdb: write cache: enabled, read cache: enabled, doesn't support > DPO or FUA > sdb: sdb1 sdb2 sdb3 > sd 1:0:0:0: Attached scsi disk sdb Things are fine so far. [more uneventful kernel log omitted] > NVRM: loading NVIDIA Linux x86_64 Kernel Module 1.0-9631 Thu Nov 9 > 17:35:27 PST 2006 > ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen > ata1.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 in > res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) > ata1: soft resetting port > ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) > ata1.00: configured for UDMA/133 > ata1: EH complete > SCSI device sda: 160086528 512-byte hdwr sectors (81964 MB) > sda: Write Protect is off > sda: Mode Sense: 00 3a 00 00 > SCSI device sda: write cache: enabled, read cache: enabled, doesn't support > DPO or FUA > ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen > ata1.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 out > res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) and then things start to break. Notice how the problems started exactly at the point the "NVRM" NVIDIA module (whatever it is) was loaded ... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On Sun, Jan 14 2007, Robert Hancock wrote: > Jeff Garzik wrote: > >>Looks like all of these errors are from a FLUSH CACHE command and the > >>drive is indicating that it is no longer busy, so presumably done. > >>That's not a DMA-mapped command, so it wouldn't go through the ADMA > >>machinery and I wouldn't have expected this to be handled any > >>differently from before. Curious.. > > > >It's possible the flush-cache command takes longer than 30 seconds, if > >the cache is large, contents are discontiguous, etc. It's a > >pathological case, but possible. > > > >Or maybe flush-cache doesn't get a 30 second timeout, and it should...? > > (thinking out loud) > > > >Jeff > > If the flush was still in progress I would expect Busy to still be set, > however.. I'd be surprised if the device would not obey the 7 second timeout rule that seems to be set in stone and not allow more dirty in-drive cache than it could flush out in approximately that time. And BUSY should also be set for that case, as Robert indicates. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.15 01:34:48 +0100, Björn Steinbrink wrote: > On 2007.01.14 19:22:51 -0500, Jeff Garzik wrote: > > Robert Hancock wrote: > > >Björn Steinbrink wrote: > > >>Hi, > > >> > > >>with 2.6.20-rc{2,4,5} (no other tested yet) I see SATA exceptions quite > > >>often, with 2.6.19 there are no such exceptions. dmesg and lspci -v > > >>output follows. In the meantime, I'll start bisecting. > > > > > >... > > > > > >>ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen > > >>ata1.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 in > > >> res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) > > >>ata1: soft resetting port > > >>ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) > > >>ata1.00: configured for UDMA/133 > > >>ata1: EH complete > > >>SCSI device sda: 160086528 512-byte hdwr sectors (81964 MB) > > >>sda: Write Protect is off > > >>sda: Mode Sense: 00 3a 00 00 > > >>SCSI device sda: write cache: enabled, read cache: enabled, doesn't > > >>support DPO or FUA > > > > > >Looks like all of these errors are from a FLUSH CACHE command and the > > >drive is indicating that it is no longer busy, so presumably done. > > >That's not a DMA-mapped command, so it wouldn't go through the ADMA > > >machinery and I wouldn't have expected this to be handled any > > >differently from before. Curious.. > > > > It's possible the flush-cache command takes longer than 30 seconds, if > > the cache is large, contents are discontiguous, etc. It's a > > pathological case, but possible. > > > > Or maybe flush-cache doesn't get a 30 second timeout, and it should...? > > (thinking out loud) > > Bi-section led to commit 249e83fe839 which makes absolutely no sense to > me, just in case that anyone sees any problem with that commit. > I'll go and re-check a few of those commits that I marked as good. Next round of bisecting led to another useless result, a) it was an unrelated driver, b) the kernel I just marked as good after 20 minutes of testing decided to fail when I hit reply... Guess that it was pure luck that the kernel I marked as bad failed within 1-2 minutes. I send the git bisect log with this mail, maybe at least the early good kernel are really good and someone can make some sense out of it. At least I ended up somewhere in a series of libata changes. Thanks, Björn git-bisect start # bad: [8a2d17a56a71c5c796b0a5378ee76a105f21fdd9] Linux 2.6.20-rc2 git-bisect bad 8a2d17a56a71c5c796b0a5378ee76a105f21fdd9 # good: [c3fe6924620fd733ffe8bc8a9da1e9cde08402b3] Linux 2.6.19 git-bisect good c3fe6924620fd733ffe8bc8a9da1e9cde08402b3 # bad: [2685b267bce34c9b66626cb11664509c32a761a5] Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 git-bisect bad 2685b267bce34c9b66626cb11664509c32a761a5 # good: [a985239bdf017e00e985c3a31149d6ae128fdc5f] [POWERPC] cell: spu management xmon routines git-bisect good a985239bdf017e00e985c3a31149d6ae128fdc5f # bad: [33f2ef89f8e181486b63fdbdc97c6afa6ca9f34b] mm: make compound page destructor handling explicit git-bisect bad 33f2ef89f8e181486b63fdbdc97c6afa6ca9f34b # bad: [651857a1ecaf97a8ad9d324dd2a61675c53e541e] Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2 git-bisect bad 651857a1ecaf97a8ad9d324dd2a61675c53e541e # bad: [ff51a98799931256b555446b2f5675db08de6229] Merge branch 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/libata-dev git-bisect bad ff51a98799931256b555446b2f5675db08de6229 # bad: [3ac551a6a63dcbc707348772a27bd7090b081524] [libata] pata_cs5535: fix build git-bisect bad 3ac551a6a63dcbc707348772a27bd7090b081524 # good: [750426aa1ad1ddd1fa8bb4ed531a7956f3b9a27c] libata: cosmetic changes to sense generation functions git-bisect good 750426aa1ad1ddd1fa8bb4ed531a7956f3b9a27c # good: [62d64ae0ec76360736c9dc4ca2067ae8de0ba9f2] pata : more drivers that need only standard suspend and resume git-bisect good 62d64ae0ec76360736c9dc4ca2067ae8de0ba9f2 # good: [6a36261e63770ab61422550b774fe949ccca5fa9] libata: fix READ CAPACITY simulation git-bisect good 6a36261e63770ab61422550b774fe949ccca5fa9 # good: [2432697ba0ce312d60be5009ffe1fa054a761bb9] libata: implement ata_exec_internal_sg() git-bisect good 2432697ba0ce312d60be5009ffe1fa054a761bb9 # good: [70e6ad0c6d1e6cb9ee3c036a85ca2561eb1fd766] libata: prepare ata_sg_clean() for invocation from EH git-bisect good 70e6ad0c6d1e6cb9ee3c036a85ca2561eb1fd766 # good: [8e16f941226f15622fbbc416a1f3d8705001a191] ahci: do not powerdown during initialization git-bisect good 8e16f941226f15622fbbc416a1f3d8705001a191 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Jeff Garzik wrote: Looks like all of these errors are from a FLUSH CACHE command and the drive is indicating that it is no longer busy, so presumably done. That's not a DMA-mapped command, so it wouldn't go through the ADMA machinery and I wouldn't have expected this to be handled any differently from before. Curious.. It's possible the flush-cache command takes longer than 30 seconds, if the cache is large, contents are discontiguous, etc. It's a pathological case, but possible. Or maybe flush-cache doesn't get a 30 second timeout, and it should...? (thinking out loud) Jeff If the flush was still in progress I would expect Busy to still be set, however.. -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.14 19:22:51 -0500, Jeff Garzik wrote: > Robert Hancock wrote: > >Björn Steinbrink wrote: > >>Hi, > >> > >>with 2.6.20-rc{2,4,5} (no other tested yet) I see SATA exceptions quite > >>often, with 2.6.19 there are no such exceptions. dmesg and lspci -v > >>output follows. In the meantime, I'll start bisecting. > > > >... > > > >>ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen > >>ata1.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 in > >> res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) > >>ata1: soft resetting port > >>ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) > >>ata1.00: configured for UDMA/133 > >>ata1: EH complete > >>SCSI device sda: 160086528 512-byte hdwr sectors (81964 MB) > >>sda: Write Protect is off > >>sda: Mode Sense: 00 3a 00 00 > >>SCSI device sda: write cache: enabled, read cache: enabled, doesn't > >>support DPO or FUA > > > >Looks like all of these errors are from a FLUSH CACHE command and the > >drive is indicating that it is no longer busy, so presumably done. > >That's not a DMA-mapped command, so it wouldn't go through the ADMA > >machinery and I wouldn't have expected this to be handled any > >differently from before. Curious.. > > It's possible the flush-cache command takes longer than 30 seconds, if > the cache is large, contents are discontiguous, etc. It's a > pathological case, but possible. > > Or maybe flush-cache doesn't get a 30 second timeout, and it should...? > (thinking out loud) Bi-section led to commit 249e83fe839 which makes absolutely no sense to me, just in case that anyone sees any problem with that commit. I'll go and re-check a few of those commits that I marked as good. Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Robert Hancock wrote: Björn Steinbrink wrote: Hi, with 2.6.20-rc{2,4,5} (no other tested yet) I see SATA exceptions quite often, with 2.6.19 there are no such exceptions. dmesg and lspci -v output follows. In the meantime, I'll start bisecting. ... ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata1.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata1: soft resetting port ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata1.00: configured for UDMA/133 ata1: EH complete SCSI device sda: 160086528 512-byte hdwr sectors (81964 MB) sda: Write Protect is off sda: Mode Sense: 00 3a 00 00 SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA Looks like all of these errors are from a FLUSH CACHE command and the drive is indicating that it is no longer busy, so presumably done. That's not a DMA-mapped command, so it wouldn't go through the ADMA machinery and I wouldn't have expected this to be handled any differently from before. Curious.. It's possible the flush-cache command takes longer than 30 seconds, if the cache is large, contents are discontiguous, etc. It's a pathological case, but possible. Or maybe flush-cache doesn't get a 30 second timeout, and it should...? (thinking out loud) Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Björn Steinbrink wrote: Hi, with 2.6.20-rc{2,4,5} (no other tested yet) I see SATA exceptions quite often, with 2.6.19 there are no such exceptions. dmesg and lspci -v output follows. In the meantime, I'll start bisecting. ... ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata1.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata1: soft resetting port ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata1.00: configured for UDMA/133 ata1: EH complete SCSI device sda: 160086528 512-byte hdwr sectors (81964 MB) sda: Write Protect is off sda: Mode Sense: 00 3a 00 00 SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA Looks like all of these errors are from a FLUSH CACHE command and the drive is indicating that it is no longer busy, so presumably done. That's not a DMA-mapped command, so it wouldn't go through the ADMA machinery and I wouldn't have expected this to be handled any differently from before. Curious.. -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
SATA exceptions with 2.6.20-rc5
Hi, with 2.6.20-rc{2,4,5} (no other tested yet) I see SATA exceptions quite often, with 2.6.19 there are no such exceptions. dmesg and lspci -v output follows. In the meantime, I'll start bisecting. Thanks Björn Linux version 2.6.20-rc2 ([EMAIL PROTECTED]) (gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)) #4 SMP Sun Dec 31 12:54:22 CET 2006 Command line: root=/dev/md0 ro quiet BIOS-provided physical RAM map: BIOS-e820: - 0009f000 (usable) BIOS-e820: 0009f000 - 000a (reserved) BIOS-e820: 000f - 0010 (reserved) BIOS-e820: 0010 - 7fee (usable) BIOS-e820: 7fee - 7fee3000 (ACPI NVS) BIOS-e820: 7fee3000 - 7fef (ACPI data) BIOS-e820: 7fef - 7ff0 (reserved) BIOS-e820: e000 - f000 (reserved) BIOS-e820: fec0 - 0001 (reserved) Entering add_active_range(0, 0, 159) 0 entries of 256 used Entering add_active_range(0, 256, 524000) 1 entries of 256 used end_pfn_map = 1048576 DMI 2.2 present. ACPI: RSDP (v000 Nvidia) @ 0x000f7a70 ACPI: RSDT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 0x7fee3040 ACPI: FADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 0x7fee30c0 ACPI: SSDT (v001 PTLTD POWERNOW 0x0001 LTP 0x0001) @ 0x7fee9540 ACPI: SRAT (v001 AMDHAMMER 0x0001 AMD 0x0001) @ 0x7fee9780 ACPI: MADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 0x7fee9480 ACPI: DSDT (v001 NVIDIA AWRDACPI 0x1000 MSFT 0x010e) @ 0x Entering add_active_range(0, 0, 159) 0 entries of 256 used Entering add_active_range(0, 256, 524000) 1 entries of 256 used Zone PFN ranges: DMA 0 -> 4096 DMA324096 -> 1048576 Normal1048576 -> 1048576 early_node_map[2] active PFN ranges 0:0 -> 159 0: 256 -> 524000 On node 0 totalpages: 523903 DMA zone: 56 pages used for memmap DMA zone: 1122 pages reserved DMA zone: 2821 pages, LIFO batch:0 DMA32 zone: 7108 pages used for memmap DMA32 zone: 512796 pages, LIFO batch:31 Normal zone: 0 pages used for memmap Nvidia board detected. Ignoring ACPI timer override. If you got timer trouble try acpi_use_timer_override ACPI: PM-Timer IO Port: 0x1008 ACPI: Local APIC address 0xfee0 ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled) Processor #0 (Bootup-CPU) ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled) Processor #1 ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1]) ACPI: IOAPIC (id[0x02] address[0xfec0] gsi_base[0]) IOAPIC[0]: apic_id 2, address 0xfec0, GSI 0-23 ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level) ACPI: INT_SRC_OVR (bus 0 bus_irq 14 global_irq 14 high edge) ACPI: INT_SRC_OVR (bus 0 bus_irq 15 global_irq 15 high edge) ACPI: IRQ9 used by override. ACPI: IRQ14 used by override. ACPI: IRQ15 used by override. Setting APIC routing to flat Using ACPI (MADT) for SMP configuration information Nosave address range: 0009f000 - 000a Nosave address range: 000a - 000f Nosave address range: 000f - 0010 Allocating PCI resources starting at 8000 (gap: 7ff0:6010) PERCPU: Allocating 32000 bytes of per cpu data Built 1 zonelists. Total pages: 515617 Kernel command line: root=/dev/md0 ro quiet Initializing CPU#0 PID hash table entries: 4096 (order: 12, 32768 bytes) Console: colour VGA+ 80x25 Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes) Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes) Checking aperture... CPU 0: aperture @ 5a7400 size 32 MB Aperture too small (32 MB) No AGP bridge found Memory: 2059000k/2096000k available (2795k kernel code, 36392k reserved, 1089k data, 224k init) Calibrating delay using timer specific routine.. 4422.42 BogoMIPS (lpj=22112114) Mount-cache hash table entries: 256 CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 1024K (64 bytes/line) CPU: Physical Processor ID: 0 CPU: Processor Core ID: 0 Freeing SMP alternatives: 28k freed ACPI: Core revision 20060707 Using local APIC timer interrupts. result 12558084 Detected 12.558 MHz APIC timer. Booting processor 1/2 APIC 0x1 Initializing CPU#1 Calibrating delay using timer specific routine.. 4420.44 BogoMIPS (lpj=22102213) CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 1024K (64 bytes/line) CPU: Physical Processor ID: 0 CPU: Processor Core ID: 1 AMD Athlon(tm) 64 X2 Dual Core Processor 4400+ stepping 02 CPU 1: Syncing TSC to CPU 0. CPU 1: synchronized TSC with CPU 0 (last diff 89 cycles, maxerr 393 cycles) Brought up 2 CPUs testing NMI watchdog ... OK. Disabling vsyscall due to use of PM timer time.c: Using 3.579545 MHz WALL PM GTOD PM timer. time.c: Detec