Re: libata error handling
On 1/7/07, Kasper Sandberg <[EMAIL PROTECTED]> wrote: On Sat, 2007-01-06 at 20:28 +0100, Bartlomiej Zolnierkiewicz wrote: > On 1/6/07, Kasper Sandberg <[EMAIL PROTECTED]> wrote: > > On Sat, 2007-01-06 at 13:01 -0600, Robert Hancock wrote: > > > Kasper Sandberg wrote: > > > > On Sat, 2007-01-06 at 12:21 -0600, Robert Hancock wrote: > > > >> Kasper Sandberg wrote: > > > >>> i have heard that libata has much better error handling (this is what > > > >>> made me try it), and from initial observations, that appears to be very > > > >>> true, however, im wondering, is there something i can do to get > > > >>> extremely verbose information from libata? for example if it corrects > > > >>> errors? cause i'd really like to know if it still happens, and if i > > > >>> perhaps get corruption as before, even though not severe. > > > >> Any errors, timeouts or retries would be showing up in dmesg.. > > > > how sure can i be of this? is it 100% sure that i have not encountered > > > > this error then? > > > > > > Pretty sure, I'm quite certain libata never does any silent error recovery.. > > AFAIR this is true > (at least it was last time that I've looked at libata eh code) > > > okay, i suppose i face two possibilities then: > > 1: libata drivers are simply better, and the error does not occur > > because of driver bugs in the old ide drivers > > very likely however pdc202xx_new bugs should be fixed in 2.6.20-rc3 > (as it contains a lot of bugfixes for this driver from Sergei Shtylyov) these fixes are also in the libata driver? some were backported directly from libata driver and few were pdc202xx_new specific so probably pata_pdc2027x is also fine > > 2: it hasnt happened to me on libata yet (though this is also abit > > weird, as it has now ran far longer than were previously required to hit > > the errors) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
On Sat, 2007-01-06 at 20:28 +0100, Bartlomiej Zolnierkiewicz wrote: > On 1/6/07, Kasper Sandberg <[EMAIL PROTECTED]> wrote: > > On Sat, 2007-01-06 at 13:01 -0600, Robert Hancock wrote: > > > Kasper Sandberg wrote: > > > > On Sat, 2007-01-06 at 12:21 -0600, Robert Hancock wrote: > > > >> Kasper Sandberg wrote: > > > >>> i have heard that libata has much better error handling (this is what > > > >>> made me try it), and from initial observations, that appears to be > > > >>> very > > > >>> true, however, im wondering, is there something i can do to get > > > >>> extremely verbose information from libata? for example if it corrects > > > >>> errors? cause i'd really like to know if it still happens, and if i > > > >>> perhaps get corruption as before, even though not severe. > > > >> Any errors, timeouts or retries would be showing up in dmesg.. > > > > how sure can i be of this? is it 100% sure that i have not encountered > > > > this error then? > > > > > > Pretty sure, I'm quite certain libata never does any silent error > > > recovery.. > > AFAIR this is true > (at least it was last time that I've looked at libata eh code) > > > okay, i suppose i face two possibilities then: > > 1: libata drivers are simply better, and the error does not occur > > because of driver bugs in the old ide drivers > > very likely however pdc202xx_new bugs should be fixed in 2.6.20-rc3 > (as it contains a lot of bugfixes for this driver from Sergei Shtylyov) these fixes are also in the libata driver? > > > 2: it hasnt happened to me on libata yet (though this is also abit > > weird, as it has now ran far longer than were previously required to hit > > the errors) > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
On Sat, 2007-01-06 at 20:28 +0100, Bartlomiej Zolnierkiewicz wrote: On 1/6/07, Kasper Sandberg [EMAIL PROTECTED] wrote: On Sat, 2007-01-06 at 13:01 -0600, Robert Hancock wrote: Kasper Sandberg wrote: On Sat, 2007-01-06 at 12:21 -0600, Robert Hancock wrote: Kasper Sandberg wrote: i have heard that libata has much better error handling (this is what made me try it), and from initial observations, that appears to be very true, however, im wondering, is there something i can do to get extremely verbose information from libata? for example if it corrects errors? cause i'd really like to know if it still happens, and if i perhaps get corruption as before, even though not severe. Any errors, timeouts or retries would be showing up in dmesg.. how sure can i be of this? is it 100% sure that i have not encountered this error then? Pretty sure, I'm quite certain libata never does any silent error recovery.. AFAIR this is true (at least it was last time that I've looked at libata eh code) okay, i suppose i face two possibilities then: 1: libata drivers are simply better, and the error does not occur because of driver bugs in the old ide drivers very likely however pdc202xx_new bugs should be fixed in 2.6.20-rc3 (as it contains a lot of bugfixes for this driver from Sergei Shtylyov) these fixes are also in the libata driver? 2: it hasnt happened to me on libata yet (though this is also abit weird, as it has now ran far longer than were previously required to hit the errors) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
On 1/7/07, Kasper Sandberg [EMAIL PROTECTED] wrote: On Sat, 2007-01-06 at 20:28 +0100, Bartlomiej Zolnierkiewicz wrote: On 1/6/07, Kasper Sandberg [EMAIL PROTECTED] wrote: On Sat, 2007-01-06 at 13:01 -0600, Robert Hancock wrote: Kasper Sandberg wrote: On Sat, 2007-01-06 at 12:21 -0600, Robert Hancock wrote: Kasper Sandberg wrote: i have heard that libata has much better error handling (this is what made me try it), and from initial observations, that appears to be very true, however, im wondering, is there something i can do to get extremely verbose information from libata? for example if it corrects errors? cause i'd really like to know if it still happens, and if i perhaps get corruption as before, even though not severe. Any errors, timeouts or retries would be showing up in dmesg.. how sure can i be of this? is it 100% sure that i have not encountered this error then? Pretty sure, I'm quite certain libata never does any silent error recovery.. AFAIR this is true (at least it was last time that I've looked at libata eh code) okay, i suppose i face two possibilities then: 1: libata drivers are simply better, and the error does not occur because of driver bugs in the old ide drivers very likely however pdc202xx_new bugs should be fixed in 2.6.20-rc3 (as it contains a lot of bugfixes for this driver from Sergei Shtylyov) these fixes are also in the libata driver? some were backported directly from libata driver and few were pdc202xx_new specific so probably pata_pdc2027x is also fine 2: it hasnt happened to me on libata yet (though this is also abit weird, as it has now ran far longer than were previously required to hit the errors) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
On 1/6/07, Kasper Sandberg <[EMAIL PROTECTED]> wrote: On Sat, 2007-01-06 at 13:01 -0600, Robert Hancock wrote: > Kasper Sandberg wrote: > > On Sat, 2007-01-06 at 12:21 -0600, Robert Hancock wrote: > >> Kasper Sandberg wrote: > >>> i have heard that libata has much better error handling (this is what > >>> made me try it), and from initial observations, that appears to be very > >>> true, however, im wondering, is there something i can do to get > >>> extremely verbose information from libata? for example if it corrects > >>> errors? cause i'd really like to know if it still happens, and if i > >>> perhaps get corruption as before, even though not severe. > >> Any errors, timeouts or retries would be showing up in dmesg.. > > how sure can i be of this? is it 100% sure that i have not encountered > > this error then? > > Pretty sure, I'm quite certain libata never does any silent error recovery.. AFAIR this is true (at least it was last time that I've looked at libata eh code) okay, i suppose i face two possibilities then: 1: libata drivers are simply better, and the error does not occur because of driver bugs in the old ide drivers very likely however pdc202xx_new bugs should be fixed in 2.6.20-rc3 (as it contains a lot of bugfixes for this driver from Sergei Shtylyov) 2: it hasnt happened to me on libata yet (though this is also abit weird, as it has now ran far longer than were previously required to hit the errors) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
On Sat, 2007-01-06 at 13:01 -0600, Robert Hancock wrote: > Kasper Sandberg wrote: > > On Sat, 2007-01-06 at 12:21 -0600, Robert Hancock wrote: > >> Kasper Sandberg wrote: > >>> i have heard that libata has much better error handling (this is what > >>> made me try it), and from initial observations, that appears to be very > >>> true, however, im wondering, is there something i can do to get > >>> extremely verbose information from libata? for example if it corrects > >>> errors? cause i'd really like to know if it still happens, and if i > >>> perhaps get corruption as before, even though not severe. > >> Any errors, timeouts or retries would be showing up in dmesg.. > > how sure can i be of this? is it 100% sure that i have not encountered > > this error then? > > Pretty sure, I'm quite certain libata never does any silent error recovery.. okay, i suppose i face two possibilities then: 1: libata drivers are simply better, and the error does not occur because of driver bugs in the old ide drivers 2: it hasnt happened to me on libata yet (though this is also abit weird, as it has now ran far longer than were previously required to hit the errors) > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
Kasper Sandberg wrote: On Sat, 2007-01-06 at 12:21 -0600, Robert Hancock wrote: Kasper Sandberg wrote: i have heard that libata has much better error handling (this is what made me try it), and from initial observations, that appears to be very true, however, im wondering, is there something i can do to get extremely verbose information from libata? for example if it corrects errors? cause i'd really like to know if it still happens, and if i perhaps get corruption as before, even though not severe. Any errors, timeouts or retries would be showing up in dmesg.. how sure can i be of this? is it 100% sure that i have not encountered this error then? Pretty sure, I'm quite certain libata never does any silent error recovery.. -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
On Sat, 2007-01-06 at 12:21 -0600, Robert Hancock wrote: > Kasper Sandberg wrote: > > i have heard that libata has much better error handling (this is what > > made me try it), and from initial observations, that appears to be very > > true, however, im wondering, is there something i can do to get > > extremely verbose information from libata? for example if it corrects > > errors? cause i'd really like to know if it still happens, and if i > > perhaps get corruption as before, even though not severe. > > Any errors, timeouts or retries would be showing up in dmesg.. how sure can i be of this? is it 100% sure that i have not encountered this error then? > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
Kasper Sandberg wrote: i have heard that libata has much better error handling (this is what made me try it), and from initial observations, that appears to be very true, however, im wondering, is there something i can do to get extremely verbose information from libata? for example if it corrects errors? cause i'd really like to know if it still happens, and if i perhaps get corruption as before, even though not severe. Any errors, timeouts or retries would be showing up in dmesg.. -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
libata error handling
Hello. i have a question in regards to libata's error handling, specifically with pata drivers. ill start by explaining something that happens to me using the normal ide drivers (via ide and pdc202 new) this is what i get when it has been used for a while: hde: dma_intr: bad DMA status (dma_stat=75) hde: dma_intr: status=0x50 { DriveReady SeekComplete } ide: failed opcode was: unknown hde: dma_timer_expiry: dma status == 0x60 hde: DMA timeout retry PDC202XX: Primary channel reset. hde: timeout waiting for DMA its ALWAYS hde, and its on the promise controller, i attempted to replace the promise controller by other controllers, but i got the same error. i have tried replacing cables too, and swapping around harddrives, its ALWAYS the last harddrive that gets me this. after this, my raid (6x300gb drives in raid5) would go nuts, as if the data was there, but skewed, so i got it all from an offset. this has been going on since always on this box, from .15 to .17, but now i updated to .20-rc3-git4, and went over to the pata-on-libata drivers, where i think this has stopped, or atleast, its not causing WEIRD errors anymore, i have observed some stalls, but im not sure this is due to it doing this, or simply syncing. i get no messages like this from the kernel anymore. i have heard that libata has much better error handling (this is what made me try it), and from initial observations, that appears to be very true, however, im wondering, is there something i can do to get extremely verbose information from libata? for example if it corrects errors? cause i'd really like to know if it still happens, and if i perhaps get corruption as before, even though not severe. Regards, Kasper Sandberg - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
libata error handling
Hello. i have a question in regards to libata's error handling, specifically with pata drivers. ill start by explaining something that happens to me using the normal ide drivers (via ide and pdc202 new) this is what i get when it has been used for a while: hde: dma_intr: bad DMA status (dma_stat=75) hde: dma_intr: status=0x50 { DriveReady SeekComplete } ide: failed opcode was: unknown hde: dma_timer_expiry: dma status == 0x60 hde: DMA timeout retry PDC202XX: Primary channel reset. hde: timeout waiting for DMA its ALWAYS hde, and its on the promise controller, i attempted to replace the promise controller by other controllers, but i got the same error. i have tried replacing cables too, and swapping around harddrives, its ALWAYS the last harddrive that gets me this. after this, my raid (6x300gb drives in raid5) would go nuts, as if the data was there, but skewed, so i got it all from an offset. this has been going on since always on this box, from .15 to .17, but now i updated to .20-rc3-git4, and went over to the pata-on-libata drivers, where i think this has stopped, or atleast, its not causing WEIRD errors anymore, i have observed some stalls, but im not sure this is due to it doing this, or simply syncing. i get no messages like this from the kernel anymore. i have heard that libata has much better error handling (this is what made me try it), and from initial observations, that appears to be very true, however, im wondering, is there something i can do to get extremely verbose information from libata? for example if it corrects errors? cause i'd really like to know if it still happens, and if i perhaps get corruption as before, even though not severe. Regards, Kasper Sandberg - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
Kasper Sandberg wrote: i have heard that libata has much better error handling (this is what made me try it), and from initial observations, that appears to be very true, however, im wondering, is there something i can do to get extremely verbose information from libata? for example if it corrects errors? cause i'd really like to know if it still happens, and if i perhaps get corruption as before, even though not severe. Any errors, timeouts or retries would be showing up in dmesg.. -- Robert Hancock Saskatoon, SK, Canada To email, remove nospam from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
On Sat, 2007-01-06 at 12:21 -0600, Robert Hancock wrote: Kasper Sandberg wrote: i have heard that libata has much better error handling (this is what made me try it), and from initial observations, that appears to be very true, however, im wondering, is there something i can do to get extremely verbose information from libata? for example if it corrects errors? cause i'd really like to know if it still happens, and if i perhaps get corruption as before, even though not severe. Any errors, timeouts or retries would be showing up in dmesg.. how sure can i be of this? is it 100% sure that i have not encountered this error then? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
Kasper Sandberg wrote: On Sat, 2007-01-06 at 12:21 -0600, Robert Hancock wrote: Kasper Sandberg wrote: i have heard that libata has much better error handling (this is what made me try it), and from initial observations, that appears to be very true, however, im wondering, is there something i can do to get extremely verbose information from libata? for example if it corrects errors? cause i'd really like to know if it still happens, and if i perhaps get corruption as before, even though not severe. Any errors, timeouts or retries would be showing up in dmesg.. how sure can i be of this? is it 100% sure that i have not encountered this error then? Pretty sure, I'm quite certain libata never does any silent error recovery.. -- Robert Hancock Saskatoon, SK, Canada To email, remove nospam from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
On Sat, 2007-01-06 at 13:01 -0600, Robert Hancock wrote: Kasper Sandberg wrote: On Sat, 2007-01-06 at 12:21 -0600, Robert Hancock wrote: Kasper Sandberg wrote: i have heard that libata has much better error handling (this is what made me try it), and from initial observations, that appears to be very true, however, im wondering, is there something i can do to get extremely verbose information from libata? for example if it corrects errors? cause i'd really like to know if it still happens, and if i perhaps get corruption as before, even though not severe. Any errors, timeouts or retries would be showing up in dmesg.. how sure can i be of this? is it 100% sure that i have not encountered this error then? Pretty sure, I'm quite certain libata never does any silent error recovery.. okay, i suppose i face two possibilities then: 1: libata drivers are simply better, and the error does not occur because of driver bugs in the old ide drivers 2: it hasnt happened to me on libata yet (though this is also abit weird, as it has now ran far longer than were previously required to hit the errors) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
On 1/6/07, Kasper Sandberg [EMAIL PROTECTED] wrote: On Sat, 2007-01-06 at 13:01 -0600, Robert Hancock wrote: Kasper Sandberg wrote: On Sat, 2007-01-06 at 12:21 -0600, Robert Hancock wrote: Kasper Sandberg wrote: i have heard that libata has much better error handling (this is what made me try it), and from initial observations, that appears to be very true, however, im wondering, is there something i can do to get extremely verbose information from libata? for example if it corrects errors? cause i'd really like to know if it still happens, and if i perhaps get corruption as before, even though not severe. Any errors, timeouts or retries would be showing up in dmesg.. how sure can i be of this? is it 100% sure that i have not encountered this error then? Pretty sure, I'm quite certain libata never does any silent error recovery.. AFAIR this is true (at least it was last time that I've looked at libata eh code) okay, i suppose i face two possibilities then: 1: libata drivers are simply better, and the error does not occur because of driver bugs in the old ide drivers very likely however pdc202xx_new bugs should be fixed in 2.6.20-rc3 (as it contains a lot of bugfixes for this driver from Sergei Shtylyov) 2: it hasnt happened to me on libata yet (though this is also abit weird, as it has now ran far longer than were previously required to hit the errors) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
On 08/19/05 17:10, Patrick Mansfield wrote: > Luben - > > On Fri, Aug 19, 2005 at 04:43:41PM -0400, Luben Tuikov wrote: > >>On 08/19/05 16:11, Patrick Mansfield wrote: > > >>>I was changing it to wakeup the eh even while other IO is outstanding, so >>>the eh can wakeup and cancel individual commands while other IO is still >>>using the HBA. >> >>Hmm, if you want to do this, then SCSI Core needs to know about: >> - Domain, >> - Domain device and >> - LU. > > > Not really, scsi core is just asking the LLDD to cancel or release the scmd. Hi Patrick, If you want to call any kind of eh, *without* stalling IO to the host, you have to know the context: LU or Domain Device or Domain, in order to stall IO to the context object only. > That is really all we do in the eh today, and then if the LLDD can't > cancel the scmd, we take other sometimes less than useful steps. But remember, eh_timed_out and eh_strategy_handler point to functions in the driver, so they can do more or less, depending. BTW, you may be able to implement an _instantaneous_ canceling of a command in the LLDD from any context, but this would bloat the LLDD so much, that is is not a viable solution, plus it _cannot_ always be done (physically). This is why eh_timed_out + eh_strategy_handler is a best current bet for transport LLDD. (Given the current SCSI Core infrastructure.) As far as the upper layers are concerned: - the command is either finished with status, - or the LU or the Device or the Domain had problems. The above list is exhaustive, because if the task was never able to be sent via the service delivery subsystem you'd get a response, but the task will not "sit" in the LLDD. Most importantly to remember is that a timed out command _cannot_ be returned instantaneously to SCSI Core, unless it is "complete" in the SAM sense of the meaning. If the task is out on the domain, the respective eh needs to be invoked. > The LLDD could start any error handling scheme it wants, independent of > scsi core action. True, but this will bloat LLDD very much, and all you want of a transport LLDD is access to the delivery subsystem as described in SAM. > We don't initiate error handling in scsi core for other error cases, why > should a timeout be any different? Think of eh_timed_out as a replacement of the transport timeout plus some delta. Even if you had implemented it internally, which you don't need to, you'd still get a timer to fire off and you'd be in interrupt context. As I mentioned earlier, if you've properly set the timeout value at slave_configure, you shouldn't be getting SCSI Core calling your eh_timed_out _unless_ the transport time out has kicked in, which *you* set at slave configure time. If, OTOH, SCSI Core _did_ call your eh_timed_out routine, this means that the timeout which *you* set has kicked in and action needs to be taken. At that time, as per protocol, your task is either complete or you need to go and find it and take the appropriate action. >>The reason, is that you do not know why a task timed out. >>Is it the LU, is it the device, is it the domain? > > Right, so in scsi core allow a simple method that can cancel commands > while the HBA is still in use. Yes, you can do this. But in order to do this you'll need to know _why_ the command failed. Is it the LU or is it the Device or is it the Domain. The only one who can tell you this is the LLDD who talks to the transport. *Unless*, the LLDD exports SAM TMFs and SCSI Core knows what to call and how, which is currently not the case. >>>So, for EH_NOT_HANDLED, do you add the scmd to a LLDD list in your >>>eh_timed_out, then wait for the eh to run? >> >>No, no Patrick, I don't. The SCSI Core does this for me, and then >>calls my eh_strategy routine and all the commands are on the list. > > > Oh right ... I was not thinking straight. > > But I don't see how that gains much, if you sometimes still wait for scsi > core to quiesce IO and wakeup the eh. Yes, you're completely right, the SCSI Core eh infrastructure is incomplete. (I never mentioned I was gaining anything, I was merely working around the current infrastructure. ;-) ) With the current infrastructure you cannot fine grain stalling IO to the different storage objects, simply because SCSI Core has _no representation for them_. Luben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
Luben - On Fri, Aug 19, 2005 at 04:43:41PM -0400, Luben Tuikov wrote: > On 08/19/05 16:11, Patrick Mansfield wrote: > > I was changing it to wakeup the eh even while other IO is outstanding, so > > the eh can wakeup and cancel individual commands while other IO is still > > using the HBA. > > Hmm, if you want to do this, then SCSI Core needs to know about: > - Domain, > - Domain device and > - LU. Not really, scsi core is just asking the LLDD to cancel or release the scmd. That is really all we do in the eh today, and then if the LLDD can't cancel the scmd, we take other sometimes less than useful steps. The LLDD could start any error handling scheme it wants, independent of scsi core action. We don't initiate error handling in scsi core for other error cases, why should a timeout be any different? > The reason, is that you do not know why a task timed out. > Is it the LU, is it the device, is it the domain? Right, so in scsi core allow a simple method that can cancel commands while the HBA is still in use. > (Those are concepts talked about in SAM.) > > Since currently, SCSI Core has no clue about those concepts, > the current infrastructure, stalling IO to the host on eh, > satisfies. > > > So, for EH_NOT_HANDLED, do you add the scmd to a LLDD list in your > > eh_timed_out, then wait for the eh to run? > > No, no Patrick, I don't. The SCSI Core does this for me, and then > calls my eh_strategy routine and all the commands are on the list. Oh right ... I was not thinking straight. But I don't see how that gains much, if you sometimes still wait for scsi core to quiesce IO and wakeup the eh. -- Patrick Mansfield - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
On 08/19/05 16:29, Mike Anderson wrote: > Luben Tuikov <[EMAIL PROTECTED]> wrote: >>Consider this: When SCSI Core told you that the command timed out, >> A) it has already finished, >> B) it hasn't already finished. >> >>In case A, you can return EH_HANDLED. In case B, you return >>EH_NOT_HANDLED, and deal with it in the eh_strategy_handler. >>(Hint: you can still "finish" it from there.) >> > > > But dealing with it in the eh_strategy_handler means that you may be > stopping all IO on the host instance as the first lun returns > EH_NOT_HANDLED for LUN based canceling. Hi Mike, how are you? Yes, this is true. See my email to Patrick. > I still think we can do better here for an LLDD that cannot execute a > cancel in interrupt context. This is the key! Think about this: You do not need to cancel a command to cancel a command. ;-) > Having a error handler that works is a plus, I would hope that > some factoring would happen over time from the eh_strategy_handler to > some transport (or other factor point) error handler. I would think from a > testing, support, and block level multipath predictability sharing code > would be a good goal. Yes, definitely. Hopefully I'll be posting code soon. Luben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
On 08/19/05 16:11, Patrick Mansfield wrote: > On Fri, Aug 19, 2005 at 04:03:15PM -0400, Luben Tuikov wrote: >>The eh_timed_out + eh_strategy_handler is actually pretty perfect, >>and _complete_, for any application and purpose in recovering a > > > One other point: Another problems is that we quiesce all shost IO before > waking up the eh. Yes, this is true. > I was changing it to wakeup the eh even while other IO is outstanding, so > the eh can wakeup and cancel individual commands while other IO is still > using the HBA. Hmm, if you want to do this, then SCSI Core needs to know about: - Domain, - Domain device and - LU. The reason, is that you do not know why a task timed out. Is it the LU, is it the device, is it the domain? (Those are concepts talked about in SAM.) Since currently, SCSI Core has no clue about those concepts, the current infrastructure, stalling IO to the host on eh, satisfies. > So, for EH_NOT_HANDLED, do you add the scmd to a LLDD list in your > eh_timed_out, then wait for the eh to run? No, no Patrick, I don't. The SCSI Core does this for me, and then calls my eh_strategy routine and all the commands are on the list. > Or maybe your host can_queue is 1 :) No, it is actually pretty huge for a controller, and have to more than halve it and give that to SCSI Core. > I don't see it ... hence my question above. Hmm, let me know if I'm missing something out. Luben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
Luben Tuikov <[EMAIL PROTECTED]> wrote: > On 08/19/05 15:38, Patrick Mansfield wrote: > The eh_timed_out + eh_strategy_handler is actually pretty perfect, > and _complete_, for any application and purpose in recovering a > LU/device/host (in that order ;-) ). > > > The two problems I see with the hook are: > > > > It calls the driver in interrupt context, so the called function can't > > sleep. > > Consider this: When SCSI Core told you that the command timed out, > A) it has already finished, > B) it hasn't already finished. > > In case A, you can return EH_HANDLED. In case B, you return > EH_NOT_HANDLED, and deal with it in the eh_strategy_handler. > (Hint: you can still "finish" it from there.) > But dealing with it in the eh_strategy_handler means that you may be stopping all IO on the host instance as the first lun returns EH_NOT_HANDLED for LUN based canceling. I still think we can do better here for an LLDD that cannot execute a cancel in interrupt context. Having a error handler that works is a plus, I would hope that some factoring would happen over time from the eh_strategy_handler to some transport (or other factor point) error handler. I would think from a testing, support, and block level multipath predictability sharing code would be a good goal. -andmike -- Michael Anderson [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
On Fri, Aug 19, 2005 at 04:03:15PM -0400, Luben Tuikov wrote: > On 08/19/05 15:38, Patrick Mansfield wrote: > > On Fri, Aug 19, 2005 at 02:46:35PM -0400, Luben Tuikov wrote: > > > > > >>Using the command time out hook and the strategy routine, gives _complete_ > >>control over host recovery, and I really do mean _complete_. > >> > > > > > > I assume you mean hostt->eh_timed_out. > > Hi Patrick, how are you? Good thanks :) > > I was looking at using it in an LLDD, but hit two problems, and have > > started to work on an alternate approach of cancelling (aborting or wtf you > > want to call it) a list of commands in the eh thread. > > The eh_timed_out + eh_strategy_handler is actually pretty perfect, > and _complete_, for any application and purpose in recovering a One other point: Another problems is that we quiesce all shost IO before waking up the eh. I was changing it to wakeup the eh even while other IO is outstanding, so the eh can wakeup and cancel individual commands while other IO is still using the HBA. > > The two problems I see with the hook are: > > > > It calls the driver in interrupt context, so the called function can't > > sleep. > > Consider this: When SCSI Core told you that the command timed out, > A) it has already finished, > B) it hasn't already finished. > > In case A, you can return EH_HANDLED. In case B, you return > EH_NOT_HANDLED, and deal with it in the eh_strategy_handler. So, for EH_NOT_HANDLED, do you add the scmd to a LLDD list in your eh_timed_out, then wait for the eh to run? Or maybe your host can_queue is 1 :) > (Hint: you can still "finish" it from there.) > > EH_RESET_TIMER is not really needed provided that > - your interface infrastructure is in place, > - you set the timeout value properly in slave_configure. > > > There is no queueing or list mechanism, so LLDD's that can only cancel one > > command at a time will have problem. > > See above. I don't see it ... hence my question above. -- Patrick Mansfield - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
On 08/19/05 15:38, Patrick Mansfield wrote: > On Fri, Aug 19, 2005 at 02:46:35PM -0400, Luben Tuikov wrote: > > >>Using the command time out hook and the strategy routine, gives _complete_ >>control over host recovery, and I really do mean _complete_. >> > > > I assume you mean hostt->eh_timed_out. Hi Patrick, how are you? Yes, this is what I meant, sorry for not being clear. > Is anyone implmenting (or has implemented) a ->eh_timed_out function? I see > none in mainline kernel. Yes, I have. > I was looking at using it in an LLDD, but hit two problems, and have > started to work on an alternate approach of cancelling (aborting or wtf you > want to call it) a list of commands in the eh thread. The eh_timed_out + eh_strategy_handler is actually pretty perfect, and _complete_, for any application and purpose in recovering a LU/device/host (in that order ;-) ). > The two problems I see with the hook are: > > It calls the driver in interrupt context, so the called function can't > sleep. Consider this: When SCSI Core told you that the command timed out, A) it has already finished, B) it hasn't already finished. In case A, you can return EH_HANDLED. In case B, you return EH_NOT_HANDLED, and deal with it in the eh_strategy_handler. (Hint: you can still "finish" it from there.) EH_RESET_TIMER is not really needed provided that - your interface infrastructure is in place, - you set the timeout value properly in slave_configure. > There is no queueing or list mechanism, so LLDD's that can only cancel one > command at a time will have problem. See above. Luben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
On Fri, Aug 19, 2005 at 02:46:35PM -0400, Luben Tuikov wrote: > > Using the command time out hook and the strategy routine, gives _complete_ > control over host recovery, and I really do mean _complete_. > I assume you mean hostt->eh_timed_out. Is anyone implmenting (or has implemented) a ->eh_timed_out function? I see none in mainline kernel. I was looking at using it in an LLDD, but hit two problems, and have started to work on an alternate approach of cancelling (aborting or wtf you want to call it) a list of commands in the eh thread. The two problems I see with the hook are: It calls the driver in interrupt context, so the called function can't sleep. There is no queueing or list mechanism, so LLDD's that can only cancel one command at a time will have problem. -- Patrick Mansfield - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
On 08/19/05 01:40, Tejun Heo wrote: > I genearally agree that the events are somewhat standard for block > devices but IMHO SCSI EH also has fair amount SCSI-specific assumptions > and ATA is a bit too different from SCSI to fit cleanly into it. For > example, when handling NCQ errors, the whole task set is aborted and the > status is retrieved with read log page. This can be worked around in > one of the hooks and emulate SCSI behavior, but it just doesn't really > fit well. And I think that recovering via translation layer is a bit > too much translation. > > So, my thought is that SCSI EH assumptions are a bit too specific to > be used as standard for block devices. Ok, so everyone seems to agree on this. > It's true that we must do SCSI specific tasks inside libata if we use > eh_strategy_handler but I don't think switching to fine-grained EH will > reduce the amount of SCSI-specific things inside libata. I think as > long as we can insulate LLDD's from SCSI layer, either way should be > okay later. True, this is the goal. Separation between device management and how that device got to you, is the future and should be the a goal. > I agree that being the only user does incur difficulties, but my very > subjective feeling is that the original libata EH implementation was > just a bit too fragile to start with. eg. not grabbing host lock on EH > entrance causing command completion vs. EH handling race and handling > errors in several different ways. > > Heh... Maybe I'm just reluctant to let go of my patches. Anyways, > I'll now stand down and see how things go and try to help. Please don't do that. One thing everyone in the Linux community knows is that Linux-SCSI needs fresh minds and fresh ideas. Especially from knowlegable folks in the storage protocols and standards. Luben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
On 08/18/05 23:49, Jeff Garzik wrote: > 1) The fine-grained hooks of the SCSI layer are somewhat standard for > block devices. The events they signify -- timeout, abort cmd, dev > reset, bus reset, and host reset -- map precisely to the events that we > must deal with at the ATA level. "dev reset, bus reset" -- non existant, as I'm sure you're aware of, depending on what _transport_ you use. ;-) > 2) When libata SAT translation layer becomes optional, and libata drives > a "true" block device, Yes, this will be very cool! (when (S)ATA(PI) devices become true block devices. > use of ->eh_strategy_handler() will actually be > an obstacle due to false sharing of code paths. ->eh_strategy_handler() I fully agree. > is indeed a single "do it all" EH entrypoint, but within that entrypoint > you must perform several SCSI-specific tasks. > > 3) ->eh_strategy_handler() has continually proven to be a method of > error handling poorly supported by the SCSI layer. There are many > assumption coded into the SCSI layer that this is -not- the path taken > by LLD EH code, and libata must constantly work around these assumptions. I agree. > > 4) libata is the -only- user of ->eh_strategy_handler(), and oddballs Not any more ;-) Using the command time out hook and the strategy routine, gives _complete_ control over host recovery, and I really do mean _complete_. Luben > must be stomped out. It creates a maintenance burden on the SCSI layer > that should be eliminated. > > > - > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
On 08/18/05 23:49, Jeff Garzik wrote: 1) The fine-grained hooks of the SCSI layer are somewhat standard for block devices. The events they signify -- timeout, abort cmd, dev reset, bus reset, and host reset -- map precisely to the events that we must deal with at the ATA level. dev reset, bus reset -- non existant, as I'm sure you're aware of, depending on what _transport_ you use. ;-) 2) When libata SAT translation layer becomes optional, and libata drives a true block device, Yes, this will be very cool! (when (S)ATA(PI) devices become true block devices. use of -eh_strategy_handler() will actually be an obstacle due to false sharing of code paths. -eh_strategy_handler() I fully agree. is indeed a single do it all EH entrypoint, but within that entrypoint you must perform several SCSI-specific tasks. 3) -eh_strategy_handler() has continually proven to be a method of error handling poorly supported by the SCSI layer. There are many assumption coded into the SCSI layer that this is -not- the path taken by LLD EH code, and libata must constantly work around these assumptions. I agree. 4) libata is the -only- user of -eh_strategy_handler(), and oddballs Not any more ;-) Using the command time out hook and the strategy routine, gives _complete_ control over host recovery, and I really do mean _complete_. Luben must be stomped out. It creates a maintenance burden on the SCSI layer that should be eliminated. - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
On 08/19/05 01:40, Tejun Heo wrote: I genearally agree that the events are somewhat standard for block devices but IMHO SCSI EH also has fair amount SCSI-specific assumptions and ATA is a bit too different from SCSI to fit cleanly into it. For example, when handling NCQ errors, the whole task set is aborted and the status is retrieved with read log page. This can be worked around in one of the hooks and emulate SCSI behavior, but it just doesn't really fit well. And I think that recovering via translation layer is a bit too much translation. So, my thought is that SCSI EH assumptions are a bit too specific to be used as standard for block devices. Ok, so everyone seems to agree on this. It's true that we must do SCSI specific tasks inside libata if we use eh_strategy_handler but I don't think switching to fine-grained EH will reduce the amount of SCSI-specific things inside libata. I think as long as we can insulate LLDD's from SCSI layer, either way should be okay later. True, this is the goal. Separation between device management and how that device got to you, is the future and should be the a goal. I agree that being the only user does incur difficulties, but my very subjective feeling is that the original libata EH implementation was just a bit too fragile to start with. eg. not grabbing host lock on EH entrance causing command completion vs. EH handling race and handling errors in several different ways. Heh... Maybe I'm just reluctant to let go of my patches. Anyways, I'll now stand down and see how things go and try to help. Please don't do that. One thing everyone in the Linux community knows is that Linux-SCSI needs fresh minds and fresh ideas. Especially from knowlegable folks in the storage protocols and standards. Luben - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
On Fri, Aug 19, 2005 at 02:46:35PM -0400, Luben Tuikov wrote: Using the command time out hook and the strategy routine, gives _complete_ control over host recovery, and I really do mean _complete_. I assume you mean hostt-eh_timed_out. Is anyone implmenting (or has implemented) a -eh_timed_out function? I see none in mainline kernel. I was looking at using it in an LLDD, but hit two problems, and have started to work on an alternate approach of cancelling (aborting or wtf you want to call it) a list of commands in the eh thread. The two problems I see with the hook are: It calls the driver in interrupt context, so the called function can't sleep. There is no queueing or list mechanism, so LLDD's that can only cancel one command at a time will have problem. -- Patrick Mansfield - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
On 08/19/05 15:38, Patrick Mansfield wrote: On Fri, Aug 19, 2005 at 02:46:35PM -0400, Luben Tuikov wrote: Using the command time out hook and the strategy routine, gives _complete_ control over host recovery, and I really do mean _complete_. I assume you mean hostt-eh_timed_out. Hi Patrick, how are you? Yes, this is what I meant, sorry for not being clear. Is anyone implmenting (or has implemented) a -eh_timed_out function? I see none in mainline kernel. Yes, I have. I was looking at using it in an LLDD, but hit two problems, and have started to work on an alternate approach of cancelling (aborting or wtf you want to call it) a list of commands in the eh thread. The eh_timed_out + eh_strategy_handler is actually pretty perfect, and _complete_, for any application and purpose in recovering a LU/device/host (in that order ;-) ). The two problems I see with the hook are: It calls the driver in interrupt context, so the called function can't sleep. Consider this: When SCSI Core told you that the command timed out, A) it has already finished, B) it hasn't already finished. In case A, you can return EH_HANDLED. In case B, you return EH_NOT_HANDLED, and deal with it in the eh_strategy_handler. (Hint: you can still finish it from there.) EH_RESET_TIMER is not really needed provided that - your interface infrastructure is in place, - you set the timeout value properly in slave_configure. There is no queueing or list mechanism, so LLDD's that can only cancel one command at a time will have problem. See above. Luben - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
On Fri, Aug 19, 2005 at 04:03:15PM -0400, Luben Tuikov wrote: On 08/19/05 15:38, Patrick Mansfield wrote: On Fri, Aug 19, 2005 at 02:46:35PM -0400, Luben Tuikov wrote: Using the command time out hook and the strategy routine, gives _complete_ control over host recovery, and I really do mean _complete_. I assume you mean hostt-eh_timed_out. Hi Patrick, how are you? Good thanks :) I was looking at using it in an LLDD, but hit two problems, and have started to work on an alternate approach of cancelling (aborting or wtf you want to call it) a list of commands in the eh thread. The eh_timed_out + eh_strategy_handler is actually pretty perfect, and _complete_, for any application and purpose in recovering a One other point: Another problems is that we quiesce all shost IO before waking up the eh. I was changing it to wakeup the eh even while other IO is outstanding, so the eh can wakeup and cancel individual commands while other IO is still using the HBA. The two problems I see with the hook are: It calls the driver in interrupt context, so the called function can't sleep. Consider this: When SCSI Core told you that the command timed out, A) it has already finished, B) it hasn't already finished. In case A, you can return EH_HANDLED. In case B, you return EH_NOT_HANDLED, and deal with it in the eh_strategy_handler. So, for EH_NOT_HANDLED, do you add the scmd to a LLDD list in your eh_timed_out, then wait for the eh to run? Or maybe your host can_queue is 1 :) (Hint: you can still finish it from there.) EH_RESET_TIMER is not really needed provided that - your interface infrastructure is in place, - you set the timeout value properly in slave_configure. There is no queueing or list mechanism, so LLDD's that can only cancel one command at a time will have problem. See above. I don't see it ... hence my question above. -- Patrick Mansfield - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
Luben Tuikov [EMAIL PROTECTED] wrote: On 08/19/05 15:38, Patrick Mansfield wrote: The eh_timed_out + eh_strategy_handler is actually pretty perfect, and _complete_, for any application and purpose in recovering a LU/device/host (in that order ;-) ). The two problems I see with the hook are: It calls the driver in interrupt context, so the called function can't sleep. Consider this: When SCSI Core told you that the command timed out, A) it has already finished, B) it hasn't already finished. In case A, you can return EH_HANDLED. In case B, you return EH_NOT_HANDLED, and deal with it in the eh_strategy_handler. (Hint: you can still finish it from there.) But dealing with it in the eh_strategy_handler means that you may be stopping all IO on the host instance as the first lun returns EH_NOT_HANDLED for LUN based canceling. I still think we can do better here for an LLDD that cannot execute a cancel in interrupt context. Having a error handler that works is a plus, I would hope that some factoring would happen over time from the eh_strategy_handler to some transport (or other factor point) error handler. I would think from a testing, support, and block level multipath predictability sharing code would be a good goal. -andmike -- Michael Anderson [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
On 08/19/05 16:11, Patrick Mansfield wrote: On Fri, Aug 19, 2005 at 04:03:15PM -0400, Luben Tuikov wrote: The eh_timed_out + eh_strategy_handler is actually pretty perfect, and _complete_, for any application and purpose in recovering a One other point: Another problems is that we quiesce all shost IO before waking up the eh. Yes, this is true. I was changing it to wakeup the eh even while other IO is outstanding, so the eh can wakeup and cancel individual commands while other IO is still using the HBA. Hmm, if you want to do this, then SCSI Core needs to know about: - Domain, - Domain device and - LU. The reason, is that you do not know why a task timed out. Is it the LU, is it the device, is it the domain? (Those are concepts talked about in SAM.) Since currently, SCSI Core has no clue about those concepts, the current infrastructure, stalling IO to the host on eh, satisfies. So, for EH_NOT_HANDLED, do you add the scmd to a LLDD list in your eh_timed_out, then wait for the eh to run? No, no Patrick, I don't. The SCSI Core does this for me, and then calls my eh_strategy routine and all the commands are on the list. Or maybe your host can_queue is 1 :) No, it is actually pretty huge for a controller, and have to more than halve it and give that to SCSI Core. I don't see it ... hence my question above. Hmm, let me know if I'm missing something out. Luben - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
On 08/19/05 16:29, Mike Anderson wrote: Luben Tuikov [EMAIL PROTECTED] wrote: Consider this: When SCSI Core told you that the command timed out, A) it has already finished, B) it hasn't already finished. In case A, you can return EH_HANDLED. In case B, you return EH_NOT_HANDLED, and deal with it in the eh_strategy_handler. (Hint: you can still finish it from there.) But dealing with it in the eh_strategy_handler means that you may be stopping all IO on the host instance as the first lun returns EH_NOT_HANDLED for LUN based canceling. Hi Mike, how are you? Yes, this is true. See my email to Patrick. I still think we can do better here for an LLDD that cannot execute a cancel in interrupt context. This is the key! Think about this: You do not need to cancel a command to cancel a command. ;-) Having a error handler that works is a plus, I would hope that some factoring would happen over time from the eh_strategy_handler to some transport (or other factor point) error handler. I would think from a testing, support, and block level multipath predictability sharing code would be a good goal. Yes, definitely. Hopefully I'll be posting code soon. Luben - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
Luben - On Fri, Aug 19, 2005 at 04:43:41PM -0400, Luben Tuikov wrote: On 08/19/05 16:11, Patrick Mansfield wrote: I was changing it to wakeup the eh even while other IO is outstanding, so the eh can wakeup and cancel individual commands while other IO is still using the HBA. Hmm, if you want to do this, then SCSI Core needs to know about: - Domain, - Domain device and - LU. Not really, scsi core is just asking the LLDD to cancel or release the scmd. That is really all we do in the eh today, and then if the LLDD can't cancel the scmd, we take other sometimes less than useful steps. The LLDD could start any error handling scheme it wants, independent of scsi core action. We don't initiate error handling in scsi core for other error cases, why should a timeout be any different? The reason, is that you do not know why a task timed out. Is it the LU, is it the device, is it the domain? Right, so in scsi core allow a simple method that can cancel commands while the HBA is still in use. (Those are concepts talked about in SAM.) Since currently, SCSI Core has no clue about those concepts, the current infrastructure, stalling IO to the host on eh, satisfies. So, for EH_NOT_HANDLED, do you add the scmd to a LLDD list in your eh_timed_out, then wait for the eh to run? No, no Patrick, I don't. The SCSI Core does this for me, and then calls my eh_strategy routine and all the commands are on the list. Oh right ... I was not thinking straight. But I don't see how that gains much, if you sometimes still wait for scsi core to quiesce IO and wakeup the eh. -- Patrick Mansfield - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
On 08/19/05 17:10, Patrick Mansfield wrote: Luben - On Fri, Aug 19, 2005 at 04:43:41PM -0400, Luben Tuikov wrote: On 08/19/05 16:11, Patrick Mansfield wrote: I was changing it to wakeup the eh even while other IO is outstanding, so the eh can wakeup and cancel individual commands while other IO is still using the HBA. Hmm, if you want to do this, then SCSI Core needs to know about: - Domain, - Domain device and - LU. Not really, scsi core is just asking the LLDD to cancel or release the scmd. Hi Patrick, If you want to call any kind of eh, *without* stalling IO to the host, you have to know the context: LU or Domain Device or Domain, in order to stall IO to the context object only. That is really all we do in the eh today, and then if the LLDD can't cancel the scmd, we take other sometimes less than useful steps. But remember, eh_timed_out and eh_strategy_handler point to functions in the driver, so they can do more or less, depending. BTW, you may be able to implement an _instantaneous_ canceling of a command in the LLDD from any context, but this would bloat the LLDD so much, that is is not a viable solution, plus it _cannot_ always be done (physically). This is why eh_timed_out + eh_strategy_handler is a best current bet for transport LLDD. (Given the current SCSI Core infrastructure.) As far as the upper layers are concerned: - the command is either finished with status, - or the LU or the Device or the Domain had problems. The above list is exhaustive, because if the task was never able to be sent via the service delivery subsystem you'd get a response, but the task will not sit in the LLDD. Most importantly to remember is that a timed out command _cannot_ be returned instantaneously to SCSI Core, unless it is complete in the SAM sense of the meaning. If the task is out on the domain, the respective eh needs to be invoked. The LLDD could start any error handling scheme it wants, independent of scsi core action. True, but this will bloat LLDD very much, and all you want of a transport LLDD is access to the delivery subsystem as described in SAM. We don't initiate error handling in scsi core for other error cases, why should a timeout be any different? Think of eh_timed_out as a replacement of the transport timeout plus some delta. Even if you had implemented it internally, which you don't need to, you'd still get a timer to fire off and you'd be in interrupt context. As I mentioned earlier, if you've properly set the timeout value at slave_configure, you shouldn't be getting SCSI Core calling your eh_timed_out _unless_ the transport time out has kicked in, which *you* set at slave configure time. If, OTOH, SCSI Core _did_ call your eh_timed_out routine, this means that the timeout which *you* set has kicked in and action needs to be taken. At that time, as per protocol, your task is either complete or you need to go and find it and take the appropriate action. The reason, is that you do not know why a task timed out. Is it the LU, is it the device, is it the domain? Right, so in scsi core allow a simple method that can cancel commands while the HBA is still in use. Yes, you can do this. But in order to do this you'll need to know _why_ the command failed. Is it the LU or is it the Device or is it the Domain. The only one who can tell you this is the LLDD who talks to the transport. *Unless*, the LLDD exports SAM TMFs and SCSI Core knows what to call and how, which is currently not the case. So, for EH_NOT_HANDLED, do you add the scmd to a LLDD list in your eh_timed_out, then wait for the eh to run? No, no Patrick, I don't. The SCSI Core does this for me, and then calls my eh_strategy routine and all the commands are on the list. Oh right ... I was not thinking straight. But I don't see how that gains much, if you sometimes still wait for scsi core to quiesce IO and wakeup the eh. Yes, you're completely right, the SCSI Core eh infrastructure is incomplete. (I never mentioned I was gaining anything, I was merely working around the current infrastructure. ;-) ) With the current infrastructure you cannot fine grain stalling IO to the different storage objects, simply because SCSI Core has _no representation for them_. Luben - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
Tejun Heo wrote: Heh... Maybe I'm just reluctant to let go of my patches. Anyways, I'll now stand down and see how things go and try to help. Note that my email simply describes a long term target. For the short term, and perhaps medium term, libata will continue to use ->eh_strategy_handler(). Given Mark's messages, my own knowledge, and other reports, there continues to be room for improvement in the current EH code. In general, we need to distinguish between PCI bus errors, SATA bus errors, and ATA device errors, and handle each error class appropriately. In the SCSI layer, ->eh_strategy_handler() or no, this will likely consist of taking the SCSI device offline and dealing with the error(s). Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
Hi, Jeff. Jeff Garzik wrote: Tejun, In an email I cannot find anymore, you asked why I was interested in converting libata to use the fine-grained EH hooks in the SCSI layer, rather than continued with the current ->eh_strategy_handler() method. Several reasons: 1) The fine-grained hooks of the SCSI layer are somewhat standard for block devices. The events they signify -- timeout, abort cmd, dev reset, bus reset, and host reset -- map precisely to the events that we must deal with at the ATA level. I genearally agree that the events are somewhat standard for block devices but IMHO SCSI EH also has fair amount SCSI-specific assumptions and ATA is a bit too different from SCSI to fit cleanly into it. For example, when handling NCQ errors, the whole task set is aborted and the status is retrieved with read log page. This can be worked around in one of the hooks and emulate SCSI behavior, but it just doesn't really fit well. And I think that recovering via translation layer is a bit too much translation. So, my thought is that SCSI EH assumptions are a bit too specific to be used as standard for block devices. But be warned of false sharing, as I talk about in #2... 2) When libata SAT translation layer becomes optional, and libata drives a "true" block device, use of ->eh_strategy_handler() will actually be an obstacle due to false sharing of code paths. ->eh_strategy_handler() is indeed a single "do it all" EH entrypoint, but within that entrypoint you must perform several SCSI-specific tasks. It's true that we must do SCSI specific tasks inside libata if we use eh_strategy_handler but I don't think switching to fine-grained EH will reduce the amount of SCSI-specific things inside libata. I think as long as we can insulate LLDD's from SCSI layer, either way should be okay later. 3) ->eh_strategy_handler() has continually proven to be a method of error handling poorly supported by the SCSI layer. There are many assumption coded into the SCSI layer that this is -not- the path taken by LLD EH code, and libata must constantly work around these assumptions. 4) libata is the -only- user of ->eh_strategy_handler(), and oddballs must be stomped out. It creates a maintenance burden on the SCSI layer that should be eliminated. I agree that being the only user does incur difficulties, but my very subjective feeling is that the original libata EH implementation was just a bit too fragile to start with. eg. not grabbing host lock on EH entrance causing command completion vs. EH handling race and handling errors in several different ways. Heh... Maybe I'm just reluctant to let go of my patches. Anyways, I'll now stand down and see how things go and try to help. Thanks, always. -- tejun - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
libata error handling
Tejun, In an email I cannot find anymore, you asked why I was interested in converting libata to use the fine-grained EH hooks in the SCSI layer, rather than continued with the current ->eh_strategy_handler() method. Several reasons: 1) The fine-grained hooks of the SCSI layer are somewhat standard for block devices. The events they signify -- timeout, abort cmd, dev reset, bus reset, and host reset -- map precisely to the events that we must deal with at the ATA level. But be warned of false sharing, as I talk about in #2... 2) When libata SAT translation layer becomes optional, and libata drives a "true" block device, use of ->eh_strategy_handler() will actually be an obstacle due to false sharing of code paths. ->eh_strategy_handler() is indeed a single "do it all" EH entrypoint, but within that entrypoint you must perform several SCSI-specific tasks. 3) ->eh_strategy_handler() has continually proven to be a method of error handling poorly supported by the SCSI layer. There are many assumption coded into the SCSI layer that this is -not- the path taken by LLD EH code, and libata must constantly work around these assumptions. 4) libata is the -only- user of ->eh_strategy_handler(), and oddballs must be stomped out. It creates a maintenance burden on the SCSI layer that should be eliminated. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
libata error handling
Tejun, In an email I cannot find anymore, you asked why I was interested in converting libata to use the fine-grained EH hooks in the SCSI layer, rather than continued with the current -eh_strategy_handler() method. Several reasons: 1) The fine-grained hooks of the SCSI layer are somewhat standard for block devices. The events they signify -- timeout, abort cmd, dev reset, bus reset, and host reset -- map precisely to the events that we must deal with at the ATA level. But be warned of false sharing, as I talk about in #2... 2) When libata SAT translation layer becomes optional, and libata drives a true block device, use of -eh_strategy_handler() will actually be an obstacle due to false sharing of code paths. -eh_strategy_handler() is indeed a single do it all EH entrypoint, but within that entrypoint you must perform several SCSI-specific tasks. 3) -eh_strategy_handler() has continually proven to be a method of error handling poorly supported by the SCSI layer. There are many assumption coded into the SCSI layer that this is -not- the path taken by LLD EH code, and libata must constantly work around these assumptions. 4) libata is the -only- user of -eh_strategy_handler(), and oddballs must be stomped out. It creates a maintenance burden on the SCSI layer that should be eliminated. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
Hi, Jeff. Jeff Garzik wrote: Tejun, In an email I cannot find anymore, you asked why I was interested in converting libata to use the fine-grained EH hooks in the SCSI layer, rather than continued with the current -eh_strategy_handler() method. Several reasons: 1) The fine-grained hooks of the SCSI layer are somewhat standard for block devices. The events they signify -- timeout, abort cmd, dev reset, bus reset, and host reset -- map precisely to the events that we must deal with at the ATA level. I genearally agree that the events are somewhat standard for block devices but IMHO SCSI EH also has fair amount SCSI-specific assumptions and ATA is a bit too different from SCSI to fit cleanly into it. For example, when handling NCQ errors, the whole task set is aborted and the status is retrieved with read log page. This can be worked around in one of the hooks and emulate SCSI behavior, but it just doesn't really fit well. And I think that recovering via translation layer is a bit too much translation. So, my thought is that SCSI EH assumptions are a bit too specific to be used as standard for block devices. But be warned of false sharing, as I talk about in #2... 2) When libata SAT translation layer becomes optional, and libata drives a true block device, use of -eh_strategy_handler() will actually be an obstacle due to false sharing of code paths. -eh_strategy_handler() is indeed a single do it all EH entrypoint, but within that entrypoint you must perform several SCSI-specific tasks. It's true that we must do SCSI specific tasks inside libata if we use eh_strategy_handler but I don't think switching to fine-grained EH will reduce the amount of SCSI-specific things inside libata. I think as long as we can insulate LLDD's from SCSI layer, either way should be okay later. 3) -eh_strategy_handler() has continually proven to be a method of error handling poorly supported by the SCSI layer. There are many assumption coded into the SCSI layer that this is -not- the path taken by LLD EH code, and libata must constantly work around these assumptions. 4) libata is the -only- user of -eh_strategy_handler(), and oddballs must be stomped out. It creates a maintenance burden on the SCSI layer that should be eliminated. I agree that being the only user does incur difficulties, but my very subjective feeling is that the original libata EH implementation was just a bit too fragile to start with. eg. not grabbing host lock on EH entrance causing command completion vs. EH handling race and handling errors in several different ways. Heh... Maybe I'm just reluctant to let go of my patches. Anyways, I'll now stand down and see how things go and try to help. Thanks, always. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata error handling
Tejun Heo wrote: Heh... Maybe I'm just reluctant to let go of my patches. Anyways, I'll now stand down and see how things go and try to help. Note that my email simply describes a long term target. For the short term, and perhaps medium term, libata will continue to use -eh_strategy_handler(). Given Mark's messages, my own knowledge, and other reports, there continues to be room for improvement in the current EH code. In general, we need to distinguish between PCI bus errors, SATA bus errors, and ATA device errors, and handle each error class appropriately. In the SCSI layer, -eh_strategy_handler() or no, this will likely consist of taking the SCSI device offline and dealing with the error(s). Jeff - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/