RE: [PATCH] s2io: add PCI error recovery support
> > Ideally s2io_reset should > > return a failure in this case (return is void now) > > Would you care to provide a patch that did this? I could > experiment a bit, and try to do this myself; but I really > don't know this hardware, or this driver, that well. [Ram] Yes, I can submit a patch to address this. Ram > > --linas > - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] s2io: add PCI error recovery support
On Mon, Mar 05, 2007 at 05:33:39PM -0500, Ramkrishna Vepa wrote: > Comments on this patch - > > 1. device_close_flag is unused and is not required. I'll submit a patch to strip this out sometime next week. > 2. s2io_reset can fail to reset the device. I thought I'd seen this occasionally, and its on my to-do list to look into this further. > Ideally s2io_reset should > return a failure in this case (return is void now) Would you care to provide a patch that did this? I could experiment a bit, and try to do this myself; but I really don't know this hardware, or this driver, that well. --linas - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] s2io: add PCI error recovery support
Jeff, Please apply and forward this patch upstream. Ram > -Original Message- > From: Ramkrishna Vepa > Sent: Monday, March 05, 2007 2:34 PM > To: 'Linas Vepstas' > Cc: Wen Xiong; linux-kernel@vger.kernel.org; linux- > [EMAIL PROTECTED]; netdev@vger.kernel.org; Jeff Garzik; Andrew > Morton > Subject: RE: [PATCH] s2io: add PCI error recovery support > > Comments on this patch - > > 1. device_close_flag is unused and is not required. > > +static pci_ers_result_t s2io_io_error_detected(struct pci_dev *pdev, > > + pci_channel_state_t > state) > > +{ > ... > > + do_s2io_card_down(sp, 0); > > + sp->device_close_flag = TRUE; /* Device is shut down. */ > > 2. s2io_reset can fail to reset the device. Ideally s2io_reset should > return a failure in this case (return is void now) and in this case could > s2io_io_slot_reset() be called again, maybe try thrice, in total, before > failing to reset the slot? > > Ram > > -Original Message- > > From: Linas Vepstas [mailto:[EMAIL PROTECTED] > > Sent: Thursday, February 15, 2007 3:09 PM > > To: Ramkrishna Vepa; Raghavendra Koushik; Ananda Raju > > Cc: Wen Xiong; linux-kernel@vger.kernel.org; linux- > > [EMAIL PROTECTED]; netdev@vger.kernel.org; Jeff Garzik; > Andrew > > Morton > > Subject: [PATCH] s2io: add PCI error recovery support > > > > > > Koushik, Raju, > > > > Please review, comment, and if you find this acceptable, > > please forward upstream. This patch incorporates all of > > fixes resulting from the last set of discussions, circa > > November 2006. > > > > --linas > > > > This patch adds PCI error recovery support to the > > s2io 10-Gigabit ethernet device driver. Fourth revision, > > blocks interrupts and the watchdog. Adds a flag to > > s2io_down(), to avoid doing I/O when PCI bus is offline. > > > > Tested, seems to work well. > > > > Signed-off-by: Linas Vepstas <[EMAIL PROTECTED]> > > Acked-by: Ramkrishna Vepa <[EMAIL PROTECTED]> > > Cc: Raghavendra Koushik <[EMAIL PROTECTED]> > > Cc: Ananda Raju <[EMAIL PROTECTED]> > > Cc: Wen Xiong <[EMAIL PROTECTED]> > > > > > > drivers/net/s2io.c | 116 > > ++--- > > drivers/net/s2io.h |5 ++ > > 2 files changed, 116 insertions(+), 5 deletions(-) > > > > Index: linux-2.6.20-git4/drivers/net/s2io.c > > === > > --- linux-2.6.20-git4.orig/drivers/net/s2io.c 2007-02-15 > > 15:39:35.0 -0600 > > +++ linux-2.6.20-git4/drivers/net/s2io.c2007-02-15 > 16:15:10.0 - > > 0600 > > @@ -435,11 +435,18 @@ static struct pci_device_id s2io_tbl[] _ > > > > MODULE_DEVICE_TABLE(pci, s2io_tbl); > > > > +static struct pci_error_handlers s2io_err_handler = { > > + .error_detected = s2io_io_error_detected, > > + .slot_reset = s2io_io_slot_reset, > > + .resume = s2io_io_resume, > > +}; > > + > > static struct pci_driver s2io_driver = { > >.name = "S2IO", > >.id_table = s2io_tbl, > >.probe = s2io_init_nic, > >.remove = __devexit_p(s2io_rem_nic), > > + .err_handler = &s2io_err_handler, > > }; > > > > /* A simplifier macro used both by init and free shared_mem Fns(). */ > > @@ -2577,6 +2584,9 @@ static void s2io_netpoll(struct net_devi > > u64 val64 = 0xULL; > > int i; > > > > + if (pci_channel_offline(nic->pdev)) > > + return; > > + > > disable_irq(dev->irq); > > > > atomic_inc(&nic->isr_cnt); > > @@ -3079,6 +3089,8 @@ static void alarm_intr_handler(struct s2 > > int i; > > if (atomic_read(&nic->card_state) == CARD_DOWN) > > return; > > + if (pci_channel_offline(nic->pdev)) > > + return; > > nic->mac_control.stats_info->sw_stat.ring_full_cnt = 0; > > /* Handling the XPAK counters update */ > > if(nic->mac_control.stats_info->xpak_stat.xpak_timer_count < 72000) > > { > > @@ -4117,6 +4129,10 @@ static irqreturn_t s2io_isr(int irq, voi > > struct mac_info *mac_control; > > struct config_param *config; > > > > + /* Pretend we handled any irq's from a disconnected card */ > > + if (pci_channel_offline(sp->pdev)) > > + ret
RE: [PATCH] s2io: add PCI error recovery support
Comments on this patch - 1. device_close_flag is unused and is not required. > +static pci_ers_result_t s2io_io_error_detected(struct pci_dev *pdev, > + pci_channel_state_t state) > +{ ... > + do_s2io_card_down(sp, 0); > + sp->device_close_flag = TRUE; /* Device is shut down. */ 2. s2io_reset can fail to reset the device. Ideally s2io_reset should return a failure in this case (return is void now) and in this case could s2io_io_slot_reset() be called again, maybe try thrice, in total, before failing to reset the slot? Ram > -Original Message- > From: Linas Vepstas [mailto:[EMAIL PROTECTED] > Sent: Thursday, February 15, 2007 3:09 PM > To: Ramkrishna Vepa; Raghavendra Koushik; Ananda Raju > Cc: Wen Xiong; linux-kernel@vger.kernel.org; linux- > [EMAIL PROTECTED]; netdev@vger.kernel.org; Jeff Garzik; Andrew > Morton > Subject: [PATCH] s2io: add PCI error recovery support > > > Koushik, Raju, > > Please review, comment, and if you find this acceptable, > please forward upstream. This patch incorporates all of > fixes resulting from the last set of discussions, circa > November 2006. > > --linas > > This patch adds PCI error recovery support to the > s2io 10-Gigabit ethernet device driver. Fourth revision, > blocks interrupts and the watchdog. Adds a flag to > s2io_down(), to avoid doing I/O when PCI bus is offline. > > Tested, seems to work well. > > Signed-off-by: Linas Vepstas <[EMAIL PROTECTED]> > Acked-by: Ramkrishna Vepa <[EMAIL PROTECTED]> > Cc: Raghavendra Koushik <[EMAIL PROTECTED]> > Cc: Ananda Raju <[EMAIL PROTECTED]> > Cc: Wen Xiong <[EMAIL PROTECTED]> > > > drivers/net/s2io.c | 116 > ++--- > drivers/net/s2io.h |5 ++ > 2 files changed, 116 insertions(+), 5 deletions(-) > > Index: linux-2.6.20-git4/drivers/net/s2io.c > === > --- linux-2.6.20-git4.orig/drivers/net/s2io.c 2007-02-15 > 15:39:35.0 -0600 > +++ linux-2.6.20-git4/drivers/net/s2io.c 2007-02-15 16:15:10.0 - > 0600 > @@ -435,11 +435,18 @@ static struct pci_device_id s2io_tbl[] _ > > MODULE_DEVICE_TABLE(pci, s2io_tbl); > > +static struct pci_error_handlers s2io_err_handler = { > + .error_detected = s2io_io_error_detected, > + .slot_reset = s2io_io_slot_reset, > + .resume = s2io_io_resume, > +}; > + > static struct pci_driver s2io_driver = { >.name = "S2IO", >.id_table = s2io_tbl, >.probe = s2io_init_nic, >.remove = __devexit_p(s2io_rem_nic), > + .err_handler = &s2io_err_handler, > }; > > /* A simplifier macro used both by init and free shared_mem Fns(). */ > @@ -2577,6 +2584,9 @@ static void s2io_netpoll(struct net_devi > u64 val64 = 0xULL; > int i; > > + if (pci_channel_offline(nic->pdev)) > + return; > + > disable_irq(dev->irq); > > atomic_inc(&nic->isr_cnt); > @@ -3079,6 +3089,8 @@ static void alarm_intr_handler(struct s2 > int i; > if (atomic_read(&nic->card_state) == CARD_DOWN) > return; > + if (pci_channel_offline(nic->pdev)) > + return; > nic->mac_control.stats_info->sw_stat.ring_full_cnt = 0; > /* Handling the XPAK counters update */ > if(nic->mac_control.stats_info->xpak_stat.xpak_timer_count < 72000) > { > @@ -4117,6 +4129,10 @@ static irqreturn_t s2io_isr(int irq, voi > struct mac_info *mac_control; > struct config_param *config; > > + /* Pretend we handled any irq's from a disconnected card */ > + if (pci_channel_offline(sp->pdev)) > + return IRQ_NONE; > + > atomic_inc(&sp->isr_cnt); > mac_control = &sp->mac_control; > config = &sp->config; > @@ -6188,7 +6204,7 @@ static void s2io_rem_isr(struct s2io_nic > } while(cnt < 5); > } > > -static void s2io_card_down(struct s2io_nic * sp) > +static void do_s2io_card_down(struct s2io_nic * sp, int do_io) > { > int cnt = 0; > struct XENA_dev_config __iomem *bar0 = sp->bar0; > @@ -6203,7 +6219,8 @@ static void s2io_card_down(struct s2io_n > atomic_set(&sp->card_state, CARD_DOWN); > > /* disable Tx and Rx traffic on the NIC */ > - stop_nic(sp); > + if (do_io) > + stop_nic(sp); > > s2io_rem_isr(sp); > > @@ -6211,7 +6228,7 @@ static void s2io_card_down(struct s2io_n >
[PATCH] s2io: add PCI error recovery support
Koushik, Raju, Please review, comment, and if you find this acceptable, please forward upstream. This patch incorporates all of fixes resulting from the last set of discussions, circa November 2006. --linas This patch adds PCI error recovery support to the s2io 10-Gigabit ethernet device driver. Fourth revision, blocks interrupts and the watchdog. Adds a flag to s2io_down(), to avoid doing I/O when PCI bus is offline. Tested, seems to work well. Signed-off-by: Linas Vepstas <[EMAIL PROTECTED]> Acked-by: Ramkrishna Vepa <[EMAIL PROTECTED]> Cc: Raghavendra Koushik <[EMAIL PROTECTED]> Cc: Ananda Raju <[EMAIL PROTECTED]> Cc: Wen Xiong <[EMAIL PROTECTED]> drivers/net/s2io.c | 116 ++--- drivers/net/s2io.h |5 ++ 2 files changed, 116 insertions(+), 5 deletions(-) Index: linux-2.6.20-git4/drivers/net/s2io.c === --- linux-2.6.20-git4.orig/drivers/net/s2io.c 2007-02-15 15:39:35.0 -0600 +++ linux-2.6.20-git4/drivers/net/s2io.c2007-02-15 16:15:10.0 -0600 @@ -435,11 +435,18 @@ static struct pci_device_id s2io_tbl[] _ MODULE_DEVICE_TABLE(pci, s2io_tbl); +static struct pci_error_handlers s2io_err_handler = { + .error_detected = s2io_io_error_detected, + .slot_reset = s2io_io_slot_reset, + .resume = s2io_io_resume, +}; + static struct pci_driver s2io_driver = { .name = "S2IO", .id_table = s2io_tbl, .probe = s2io_init_nic, .remove = __devexit_p(s2io_rem_nic), + .err_handler = &s2io_err_handler, }; /* A simplifier macro used both by init and free shared_mem Fns(). */ @@ -2577,6 +2584,9 @@ static void s2io_netpoll(struct net_devi u64 val64 = 0xULL; int i; + if (pci_channel_offline(nic->pdev)) + return; + disable_irq(dev->irq); atomic_inc(&nic->isr_cnt); @@ -3079,6 +3089,8 @@ static void alarm_intr_handler(struct s2 int i; if (atomic_read(&nic->card_state) == CARD_DOWN) return; + if (pci_channel_offline(nic->pdev)) + return; nic->mac_control.stats_info->sw_stat.ring_full_cnt = 0; /* Handling the XPAK counters update */ if(nic->mac_control.stats_info->xpak_stat.xpak_timer_count < 72000) { @@ -4117,6 +4129,10 @@ static irqreturn_t s2io_isr(int irq, voi struct mac_info *mac_control; struct config_param *config; + /* Pretend we handled any irq's from a disconnected card */ + if (pci_channel_offline(sp->pdev)) + return IRQ_NONE; + atomic_inc(&sp->isr_cnt); mac_control = &sp->mac_control; config = &sp->config; @@ -6188,7 +6204,7 @@ static void s2io_rem_isr(struct s2io_nic } while(cnt < 5); } -static void s2io_card_down(struct s2io_nic * sp) +static void do_s2io_card_down(struct s2io_nic * sp, int do_io) { int cnt = 0; struct XENA_dev_config __iomem *bar0 = sp->bar0; @@ -6203,7 +6219,8 @@ static void s2io_card_down(struct s2io_n atomic_set(&sp->card_state, CARD_DOWN); /* disable Tx and Rx traffic on the NIC */ - stop_nic(sp); + if (do_io) + stop_nic(sp); s2io_rem_isr(sp); @@ -6211,7 +6228,7 @@ static void s2io_card_down(struct s2io_n tasklet_kill(&sp->task); /* Check if the device is Quiescent and then Reset the NIC */ - do { + while(do_io) { /* As per the HW requirement we need to replenish the * receive buffer to avoid the ring bump. Since there is * no intention of processing the Rx frame at this pointwe are @@ -6236,8 +6253,9 @@ static void s2io_card_down(struct s2io_n (unsigned long long) val64); break; } - } while (1); - s2io_reset(sp); + } + if (do_io) + s2io_reset(sp); spin_lock_irqsave(&sp->tx_lock, flags); /* Free all Tx buffers */ @@ -6252,6 +6270,11 @@ static void s2io_card_down(struct s2io_n clear_bit(0, &(sp->link_state)); } +static void s2io_card_down(struct s2io_nic * sp) +{ + do_s2io_card_down(sp, 1); +} + static int s2io_card_up(struct s2io_nic * sp) { int i, ret = 0; @@ -7536,3 +7559,86 @@ static void lro_append_pkt(struct s2io_n sp->mac_control.stats_info->sw_stat.clubbed_frms_cnt++; return; } + +/** + * s2io_io_error_detected - called when PCI error is detected + * @pdev: Pointer to PCI device + * @state: The current pci conneection state + * + * This function is called after a PCI bus error affecting + * this device has been detected. + */ +static pci_ers_result_t s2io_io_error_detected(struct pci_dev *pdev, + pci_channel_state_t state) +{ + struct net_device *netdev = pci_get_drvdata(pdev); +
Re: [PATCH] s2io: add PCI error recovery support
On Fri, Oct 27, 2006 at 07:35:18AM -0400, Ananda Raju wrote: > Looking at all scenarios I feel the first patch is OK. Can you add the > watchdog timer fix to first initial patch and resubmit. Appended below. > So -- just for grins, I thought to myself, "Maybe I can make > s2io be the first adapter ever to fully recover without > a hard reset of the card." ... I couldn't quite make this work. Since the patch below already works, I didn't see much point exterting myself further. --linas This patch adds PCI error recovery support to the s2io 10-Gigabit ethernet device driver. Third revision, blocks interrupts and the watchdog. Tested, seems to work well. Signed-off-by: Linas Vepstas <[EMAIL PROTECTED]> Cc: Raghavendra Koushik <[EMAIL PROTECTED]> Cc: Ananda Raju <[EMAIL PROTECTED]> Cc: Wen Xiong <[EMAIL PROTECTED]> drivers/net/s2io.c | 121 + drivers/net/s2io.h |5 ++ 2 files changed, 126 insertions(+) Index: linux-2.6.19-rc1-git11/drivers/net/s2io.c === --- linux-2.6.19-rc1-git11.orig/drivers/net/s2io.c 2006-10-27 10:49:07.0 -0500 +++ linux-2.6.19-rc1-git11/drivers/net/s2io.c 2006-10-27 13:55:01.0 -0500 @@ -434,11 +434,18 @@ static struct pci_device_id s2io_tbl[] _ MODULE_DEVICE_TABLE(pci, s2io_tbl); +static struct pci_error_handlers s2io_err_handler = { + .error_detected = s2io_io_error_detected, + .slot_reset = s2io_io_slot_reset, + .resume = s2io_io_resume, +}; + static struct pci_driver s2io_driver = { .name = "S2IO", .id_table = s2io_tbl, .probe = s2io_init_nic, .remove = __devexit_p(s2io_rem_nic), + .err_handler = &s2io_err_handler, }; /* A simplifier macro used both by init and free shared_mem Fns(). */ @@ -3159,6 +3166,11 @@ static void alarm_intr_handler(struct s2 register u64 val64 = 0, err_reg = 0; u64 cnt; int i; + + if ((nic->pdev->error_state != pci_channel_io_normal) && +(nic->pdev->error_state != 0)) + return; + nic->mac_control.stats_info->sw_stat.ring_full_cnt = 0; /* Handling the XPAK counters update */ if(nic->mac_control.stats_info->xpak_stat.xpak_timer_count < 72000) { @@ -4171,6 +4183,11 @@ static irqreturn_t s2io_isr(int irq, voi mac_info_t *mac_control; struct config_param *config; + /* Pretend we handled any irq's from a disconnected card */ + if ((sp->pdev->error_state != pci_channel_io_normal) && +(sp->pdev->error_state != 0)) + return IRQ_HANDLED; + atomic_inc(&sp->isr_cnt); mac_control = &sp->mac_control; config = &sp->config; @@ -7564,3 +7581,107 @@ static void lro_append_pkt(nic_t *sp, lr sp->mac_control.stats_info->sw_stat.clubbed_frms_cnt++; return; } + +/** + * s2io_io_error_detected - called when PCI error is detected + * @pdev: Pointer to PCI device + * @state: The current pci conneection state + * + * This function is called after a PCI bus error affecting + * this device has been detected. + */ +static pci_ers_result_t s2io_io_error_detected(struct pci_dev *pdev, + pci_channel_state_t state) +{ + struct net_device *netdev = pci_get_drvdata(pdev); + nic_t *sp = netdev->priv; + + netif_device_detach(netdev); + + if (netif_running(netdev)) { + unsigned long flags; + + /* The folowing is an abreviated subset of the +* steps taken by s2io_card_down(), avoiding +* steps that touch the card itself. +*/ + del_timer_sync(&sp->alarm_timer); + atomic_set(&sp->card_state, CARD_DOWN); + + /* Kill tasklet. */ + tasklet_kill(&sp->task); + + /* Free all Tx buffers */ + spin_lock_irqsave(&sp->tx_lock, flags); + free_tx_buffers(sp); + spin_unlock_irqrestore(&sp->tx_lock, flags); + + /* Free all Rx buffers */ + spin_lock_irqsave(&sp->rx_lock, flags); + free_rx_buffers(sp); + spin_unlock_irqrestore(&sp->rx_lock, flags); + + clear_bit(0, &(sp->link_state)); + sp->device_close_flag = TRUE; /* Device is shut down. */ + } + pci_disable_device(pdev); + + return PCI_ERS_RESULT_NEED_RESET; +} + +/** + * s2io_io_slot_reset - called after the pci bus has been reset. + * @pdev: Pointer to PCI device + * + * Restart the card from scratch, as if from a cold-boot. + * At this point, the card has exprienced a hard reset, + * followed by fixups by BIOS, and has its config space + * set up identically to what it was at cold boot. + */ +static pci_ers_result_t s2io_io_slot_reset(struct pci_dev *pdev) +{ + struct net_device *netdev = pci_get_dr
RE: [PATCH] s2io: add PCI error recovery support
Looking at all scenarios I feel the first patch is OK. Can you add the watchdog timer fix to first initial patch and resubmit. -Original Message- From: Linas Vepstas [mailto:[EMAIL PROTECTED] Sent: Thursday, October 26, 2006 3:52 PM To: Ananda Raju Cc: Wen Xiong; linux-kernel@vger.kernel.org; [EMAIL PROTECTED]; netdev@vger.kernel.org; Jeff Garzik; Andrew Morton Subject: Re: [PATCH] s2io: add PCI error recovery support Hi. On Thu, Oct 26, 2006 at 05:56:34AM -0400, Ananda Raju wrote: > Hi, > Can you try attached patch. The attached patch is simple. We set card > state as down in error_detecct() so that all entry points return error > and don't proceed further. > > In slot_reset() we do s2io_card_down() will reset adapter. > In io_resume() we bringup the driver. Simplicity is always better. However, some questions/comments: > @@ -4175,6 +4186,10 @@ static irqreturn_t s2io_isr(int irq, voi > mac_info_t *mac_control; > struct config_param *config; > > + if (atomic_read(&sp->card_state) == CARD_DOWN) { > + return IRQ_NONE; > + } I used if ((sp->pdev->error_state != pci_channel_io_normal) here for a reason: the pdev->error_state is set even in an interrupt context, that is, it gets set even if interrups are disabled, and so it represents the actual state immediately. By contrast, the error callbacks do not get called until possibly much later, and so sp->card_state = CARD_DOWN might not get set for a while. If, for any reason, e.g. some obscure corner case, the s2io generates zillions of interupts, this could result in a soft-lockup. I actually saw this in the symbios device driver, which will regenerate an interrupt until its acknowledged -- and so it sat there, spinning. :-( I was returning IRQ_HANDLED instead of IRQ_NONE, so as to avoid falling into handle_bad_irq() or report_bad_irq(). I haven't seen this happen on s2io, but thought it would still be wise. If this can't happen, then there's no problem here. > +/** > + * s2io_io_slot_reset - called after the pci bus has been reset. > + * @pdev: Pointer to PCI device > + * > + * Restart the card from scratch, as if from a cold-boot. > + */ > +static pci_ers_result_t s2io_io_slot_reset(struct pci_dev *pdev) > +{ At this point, the card has just experienced a hardware reset, (the #RST wire was held low for 250 millisecs, followed by a settle time of 2 seconds, followed by whatever BIOS thinks it needed to do, followed by a restore of the pci config space to what it was after a cold boot. So the card is in a "fresh" state; in theory its identitcal to a cold boot. So ... are you sure you want to "down" at this point? > + s2io_card_down(sp); > + sp->device_close_flag = TRUE; /* Device is shut down. */ One problem I'm having is that the watchdog timer sometimes pops and tries to reset the card before s2io_card_down() has a chance to run. I fixed this ... == So -- just for grins, I thought to myself, "Maybe I can make s2io be the first adapter ever to fully recover without a hard reset of the card." The idea is simple: 1) enable MMIO, 2) call s2io_card_down() 3) enable DMA 4) cal s2io_card_up() I have a patch that does this, but then hit a few more snags. I haven't yet nailed down all the trouble spots, maybe tommorrow. --linas - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] s2io: add PCI error recovery support
Hi. On Thu, Oct 26, 2006 at 05:56:34AM -0400, Ananda Raju wrote: > Hi, > Can you try attached patch. The attached patch is simple. We set card > state as down in error_detecct() so that all entry points return error > and don't proceed further. > > In slot_reset() we do s2io_card_down() will reset adapter. > In io_resume() we bringup the driver. Simplicity is always better. However, some questions/comments: > @@ -4175,6 +4186,10 @@ static irqreturn_t s2io_isr(int irq, voi > mac_info_t *mac_control; > struct config_param *config; > > + if (atomic_read(&sp->card_state) == CARD_DOWN) { > + return IRQ_NONE; > + } I used if ((sp->pdev->error_state != pci_channel_io_normal) here for a reason: the pdev->error_state is set even in an interrupt context, that is, it gets set even if interrups are disabled, and so it represents the actual state immediately. By contrast, the error callbacks do not get called until possibly much later, and so sp->card_state = CARD_DOWN might not get set for a while. If, for any reason, e.g. some obscure corner case, the s2io generates zillions of interupts, this could result in a soft-lockup. I actually saw this in the symbios device driver, which will regenerate an interrupt until its acknowledged -- and so it sat there, spinning. :-( I was returning IRQ_HANDLED instead of IRQ_NONE, so as to avoid falling into handle_bad_irq() or report_bad_irq(). I haven't seen this happen on s2io, but thought it would still be wise. If this can't happen, then there's no problem here. > +/** > + * s2io_io_slot_reset - called after the pci bus has been reset. > + * @pdev: Pointer to PCI device > + * > + * Restart the card from scratch, as if from a cold-boot. > + */ > +static pci_ers_result_t s2io_io_slot_reset(struct pci_dev *pdev) > +{ At this point, the card has just experienced a hardware reset, (the #RST wire was held low for 250 millisecs, followed by a settle time of 2 seconds, followed by whatever BIOS thinks it needed to do, followed by a restore of the pci config space to what it was after a cold boot. So the card is in a "fresh" state; in theory its identitcal to a cold boot. So ... are you sure you want to "down" at this point? > + s2io_card_down(sp); > + sp->device_close_flag = TRUE; /* Device is shut down. */ One problem I'm having is that the watchdog timer sometimes pops and tries to reset the card before s2io_card_down() has a chance to run. I fixed this ... == So -- just for grins, I thought to myself, "Maybe I can make s2io be the first adapter ever to fully recover without a hard reset of the card." The idea is simple: 1) enable MMIO, 2) call s2io_card_down() 3) enable DMA 4) cal s2io_card_up() I have a patch that does this, but then hit a few more snags. I haven't yet nailed down all the trouble spots, maybe tommorrow. --linas - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] s2io: add PCI error recovery support
Hi, Can you try attached patch. The attached patch is simple. We set card state as down in error_detecct() so that all entry points return error and don't proceed further. In slot_reset() we do s2io_card_down() will reset adapter. In io_resume() we bringup the driver. Ananda -Original Message- From: Linas Vepstas [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 25, 2006 1:55 PM To: Ananda Raju Cc: Wen Xiong; linux-kernel@vger.kernel.org; [EMAIL PROTECTED]; netdev@vger.kernel.org; Jeff Garzik; Andrew Morton Subject: Re: [PATCH] s2io: add PCI error recovery support On Wed, Oct 25, 2006 at 10:11:24AM -0500, Linas Vepstas wrote: > > > Also we have to add following if statement in beginning of s2io_isr(). Done, below, > > If it is ok to do BAR0 read/write in error_detected() then patch is OK. I re-wrote that section to avoid doing I/O. It seems to work well, and generates a few less messages in the process. New, improved patch below, please ack and send upstream if you like it. --linas This patch adds PCI error recovery support to the s2io 10-Gigabit ethernet device driver. Tested, seems to work well. Signed-off-by: Linas Vepstas <[EMAIL PROTECTED]> Cc: Raghavendra Koushik <[EMAIL PROTECTED]> Cc: Ananda Raju <[EMAIL PROTECTED]> Cc: Wen Xiong <[EMAIL PROTECTED]> drivers/net/s2io.c | 103 + drivers/net/s2io.h |5 ++ 2 files changed, 108 insertions(+) Index: linux-2.6.19-rc1-git11/drivers/net/s2io.c === --- linux-2.6.19-rc1-git11.orig/drivers/net/s2io.c 2006-10-25 14:09:47.0 -0500 +++ linux-2.6.19-rc1-git11/drivers/net/s2io.c 2006-10-25 15:18:25.0 -0500 @@ -434,11 +434,18 @@ static struct pci_device_id s2io_tbl[] _ MODULE_DEVICE_TABLE(pci, s2io_tbl); +static struct pci_error_handlers s2io_err_handler = { + .error_detected = s2io_io_error_detected, + .slot_reset = s2io_io_slot_reset, + .resume = s2io_io_resume, +}; + static struct pci_driver s2io_driver = { .name = "S2IO", .id_table = s2io_tbl, .probe = s2io_init_nic, .remove = __devexit_p(s2io_rem_nic), + .err_handler = &s2io_err_handler, }; /* A simplifier macro used both by init and free shared_mem Fns(). */ @@ -4171,6 +4178,11 @@ static irqreturn_t s2io_isr(int irq, voi mac_info_t *mac_control; struct config_param *config; + if ((sp->pdev->error_state != pci_channel_io_normal) && +(sp->pdev->error_state != 0)) { + return IRQ_HANDLED; + } + atomic_inc(&sp->isr_cnt); mac_control = &sp->mac_control; config = &sp->config; @@ -7564,3 +7576,94 @@ static void lro_append_pkt(nic_t *sp, lr sp->mac_control.stats_info->sw_stat.clubbed_frms_cnt++; return; } + +/** + * s2io_io_error_detected - called when PCI error is detected + * @pdev: Pointer to PCI device + * @state: The current pci conneection state + * + * This function is called after a PCI bus error affecting + * this device has been detected. + */ +static pci_ers_result_t s2io_io_error_detected(struct pci_dev *pdev, pci_channel_state_t state) +{ + struct net_device *netdev = pci_get_drvdata(pdev); + nic_t *sp = netdev->priv; + + netif_device_detach(netdev); + + if (netif_running(netdev)) { + unsigned long flags; + + /* The folowing is an abreviated subset of the +* steps taken by s2io_card_down(), avoiding +* steps that touch the card itself. +*/ + del_timer_sync(&sp->alarm_timer); + atomic_set(&sp->card_state, CARD_DOWN); + + /* Kill tasklet. */ + tasklet_kill(&sp->task); + + /* Free all Tx buffers */ + spin_lock_irqsave(&sp->tx_lock, flags); + free_tx_buffers(sp); + spin_unlock_irqrestore(&sp->tx_lock, flags); + + /* Free all Rx buffers */ + spin_lock_irqsave(&sp->rx_lock, flags); + free_rx_buffers(sp); + spin_unlock_irqrestore(&sp->rx_lock, flags); + + clear_bit(0, &(sp->link_state)); + sp->device_close_flag = TRUE; /* Device is shut down. */ + } + pci_disable_device(pdev); + + return PCI_ERS_RESULT_NEED_RESET; +} + +/** + * s2io_io_slot_reset - called after the pci bus has been reset. + * @pdev: Pointer to PCI device + * + * Restart the card from scratch, as if from a cold-boot. + */ +static pci_ers_result_t s2io_io_slot_reset(struct pci_dev *pdev) +{ + struct net_device *netdev = pci_get_drvdata(pdev); + nic_t *sp = netdev->priv; + + if (pci_enable_device(pdev)) { +
Re: [PATCH] s2io: add PCI error recovery support
On Wed, Oct 25, 2006 at 10:11:24AM -0500, Linas Vepstas wrote: > > > Also we have to add following if statement in beginning of s2io_isr(). Done, below, > > If it is ok to do BAR0 read/write in error_detected() then patch is OK. I re-wrote that section to avoid doing I/O. It seems to work well, and generates a few less messages in the process. New, improved patch below, please ack and send upstream if you like it. --linas This patch adds PCI error recovery support to the s2io 10-Gigabit ethernet device driver. Tested, seems to work well. Signed-off-by: Linas Vepstas <[EMAIL PROTECTED]> Cc: Raghavendra Koushik <[EMAIL PROTECTED]> Cc: Ananda Raju <[EMAIL PROTECTED]> Cc: Wen Xiong <[EMAIL PROTECTED]> drivers/net/s2io.c | 103 + drivers/net/s2io.h |5 ++ 2 files changed, 108 insertions(+) Index: linux-2.6.19-rc1-git11/drivers/net/s2io.c === --- linux-2.6.19-rc1-git11.orig/drivers/net/s2io.c 2006-10-25 14:09:47.0 -0500 +++ linux-2.6.19-rc1-git11/drivers/net/s2io.c 2006-10-25 15:18:25.0 -0500 @@ -434,11 +434,18 @@ static struct pci_device_id s2io_tbl[] _ MODULE_DEVICE_TABLE(pci, s2io_tbl); +static struct pci_error_handlers s2io_err_handler = { + .error_detected = s2io_io_error_detected, + .slot_reset = s2io_io_slot_reset, + .resume = s2io_io_resume, +}; + static struct pci_driver s2io_driver = { .name = "S2IO", .id_table = s2io_tbl, .probe = s2io_init_nic, .remove = __devexit_p(s2io_rem_nic), + .err_handler = &s2io_err_handler, }; /* A simplifier macro used both by init and free shared_mem Fns(). */ @@ -4171,6 +4178,11 @@ static irqreturn_t s2io_isr(int irq, voi mac_info_t *mac_control; struct config_param *config; + if ((sp->pdev->error_state != pci_channel_io_normal) && +(sp->pdev->error_state != 0)) { + return IRQ_HANDLED; + } + atomic_inc(&sp->isr_cnt); mac_control = &sp->mac_control; config = &sp->config; @@ -7564,3 +7576,94 @@ static void lro_append_pkt(nic_t *sp, lr sp->mac_control.stats_info->sw_stat.clubbed_frms_cnt++; return; } + +/** + * s2io_io_error_detected - called when PCI error is detected + * @pdev: Pointer to PCI device + * @state: The current pci conneection state + * + * This function is called after a PCI bus error affecting + * this device has been detected. + */ +static pci_ers_result_t s2io_io_error_detected(struct pci_dev *pdev, pci_channel_state_t state) +{ + struct net_device *netdev = pci_get_drvdata(pdev); + nic_t *sp = netdev->priv; + + netif_device_detach(netdev); + + if (netif_running(netdev)) { + unsigned long flags; + + /* The folowing is an abreviated subset of the +* steps taken by s2io_card_down(), avoiding +* steps that touch the card itself. +*/ + del_timer_sync(&sp->alarm_timer); + atomic_set(&sp->card_state, CARD_DOWN); + + /* Kill tasklet. */ + tasklet_kill(&sp->task); + + /* Free all Tx buffers */ + spin_lock_irqsave(&sp->tx_lock, flags); + free_tx_buffers(sp); + spin_unlock_irqrestore(&sp->tx_lock, flags); + + /* Free all Rx buffers */ + spin_lock_irqsave(&sp->rx_lock, flags); + free_rx_buffers(sp); + spin_unlock_irqrestore(&sp->rx_lock, flags); + + clear_bit(0, &(sp->link_state)); + sp->device_close_flag = TRUE; /* Device is shut down. */ + } + pci_disable_device(pdev); + + return PCI_ERS_RESULT_NEED_RESET; +} + +/** + * s2io_io_slot_reset - called after the pci bus has been reset. + * @pdev: Pointer to PCI device + * + * Restart the card from scratch, as if from a cold-boot. + */ +static pci_ers_result_t s2io_io_slot_reset(struct pci_dev *pdev) +{ + struct net_device *netdev = pci_get_drvdata(pdev); + nic_t *sp = netdev->priv; + + if (pci_enable_device(pdev)) { + printk(KERN_ERR "s2io: Cannot re-enable PCI device after reset.\n"); + return PCI_ERS_RESULT_DISCONNECT; + } + + pci_set_master(pdev); + s2io_reset(sp); + + return PCI_ERS_RESULT_RECOVERED; +} + +/** + * s2io_io_resume - called when traffic can start flowing again. + * @pdev: Pointer to PCI device + * + * This callback is called when the error recovery driver tells us that + * its OK to resume normal operation. + */ +static void s2io_io_resume(struct pci_dev *pdev) +{ + struct net_device *netdev = pci_get_drvdata(pdev); + nic_t *sp = netdev->priv; + + if (netif_running(netdev)) { + if (s2io_card_up(sp)) { + printk(KERN_ERR "s2io: can't bring devi
Re: [PATCH] s2io: add PCI error recovery support
On Wed, Oct 25, 2006 at 02:29:33AM -0400, Ananda Raju wrote: > Hi, > > s2io_card_down() will do few BAR0 read/write. As per > pci-error-recovery.txt Documentation we are not supposed to do any new > IO in error_detected(). Hmm, actually, its harmless to do further i/o. The s2io driver barks (as it should) because the result of reads is always 0x. > Can you try using > > atomic_set(&sp->card_state, CARD_DOWN); > > instead of s2io_card_down(). I will try that. I was mostly concerned that s2io_card_down() also may do some other "important" things with regards to the driver state, things which might be needed to keep the down-up sequence symmetrical. I wasn't sure, so I took the conservative route. > Also we have to add following if statement in beginning of s2io_isr(). > > if (atomic_read(&nic->card_state) == CARD_DOWN) > return IRQ_NOTHANDLED. Right! Will revise this patch shortly. > If it is ok to do BAR0 read/write in error_detected() then patch is OK. Its OK on pSeries, and I beleive it will be OK on PCIE, but I do not quite know enough about actual PCIE chipsets. --linas - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] s2io: add PCI error recovery support
Hi, s2io_card_down() will do few BAR0 read/write. As per pci-error-recovery.txt Documentation we are not supposed to do any new IO in error_detected(). Can you try using atomic_set(&sp->card_state, CARD_DOWN); instead of s2io_card_down(). Also we have to add following if statement in beginning of s2io_isr(). if (atomic_read(&nic->card_state) == CARD_DOWN) return IRQ_NOTHANDLED. If it is ok to do BAR0 read/write in error_detected() then patch is OK. Ananda -Original Message- From: Linas Vepstas [mailto:[EMAIL PROTECTED] Sent: Tuesday, October 24, 2006 2:55 PM To: Raghavendra Koushik; Ananda Raju; Wen Xiong Cc: linux-kernel@vger.kernel.org; [EMAIL PROTECTED]; netdev@vger.kernel.org; Jeff Garzik; Andrew Morton Subject: [PATCH] s2io: add PCI error recovery support Koushik, Raju, Please review, comment, and if you find this acceptable, please forward upstream. --linas This patch adds PCI error recovery support to the s2io 10-Gigabit ethernet device driver. Tested, seems to work well. Signed-off-by: Linas Vepstas <[EMAIL PROTECTED]> Cc: Raghavendra Koushik <[EMAIL PROTECTED]> Cc: Ananda Raju <[EMAIL PROTECTED]> Cc: Wen Xiong <[EMAIL PROTECTED]> drivers/net/s2io.c | 77 + drivers/net/s2io.h |5 +++ 2 files changed, 82 insertions(+) Index: linux-2.6.19-rc1-git11/drivers/net/s2io.c === --- linux-2.6.19-rc1-git11.orig/drivers/net/s2io.c 2006-10-20 12:24:17.0 -0500 +++ linux-2.6.19-rc1-git11/drivers/net/s2io.c 2006-10-24 16:19:49.0 -0500 @@ -434,11 +434,18 @@ static struct pci_device_id s2io_tbl[] _ MODULE_DEVICE_TABLE(pci, s2io_tbl); +static struct pci_error_handlers s2io_err_handler = { + .error_detected = s2io_io_error_detected, + .slot_reset = s2io_io_slot_reset, + .resume = s2io_io_resume, +}; + static struct pci_driver s2io_driver = { .name = "S2IO", .id_table = s2io_tbl, .probe = s2io_init_nic, .remove = __devexit_p(s2io_rem_nic), + .err_handler = &s2io_err_handler, }; /* A simplifier macro used both by init and free shared_mem Fns(). */ @@ -7564,3 +7571,73 @@ static void lro_append_pkt(nic_t *sp, lr sp->mac_control.stats_info->sw_stat.clubbed_frms_cnt++; return; } + +/** + * s2io_io_error_detected - called when PCI error is detected + * @pdev: Pointer to PCI device + * @state: The current pci conneection state + * + * This function is called after a PCI bus error affecting + * this device has been detected. + */ +static pci_ers_result_t s2io_io_error_detected(struct pci_dev *pdev, pci_channel_state_t state) +{ + struct net_device *netdev = pci_get_drvdata(pdev); + nic_t *sp = netdev->priv; + + netif_device_detach(netdev); + + if (netif_running(netdev)) { + /* Reset card */ + s2io_card_down(sp); + sp->device_close_flag = TRUE; /* Device is shut down. */ + } + pci_disable_device(pdev); + + return PCI_ERS_RESULT_NEED_RESET; +} + +/** + * s2io_io_slot_reset - called after the pci bus has been reset. + * @pdev: Pointer to PCI device + * + * Restart the card from scratch, as if from a cold-boot. + */ +static pci_ers_result_t s2io_io_slot_reset(struct pci_dev *pdev) +{ + struct net_device *netdev = pci_get_drvdata(pdev); + nic_t *sp = netdev->priv; + + if (pci_enable_device(pdev)) { + printk(KERN_ERR "s2io: Cannot re-enable PCI device after reset.\n"); + return PCI_ERS_RESULT_DISCONNECT; + } + + pci_set_master(pdev); + s2io_reset(sp); + + return PCI_ERS_RESULT_RECOVERED; +} + +/** + * s2io_io_resume - called when traffic can start flowing again. + * @pdev: Pointer to PCI device + * + * This callback is called when the error recovery driver tells us that + * its OK to resume normal operation. + */ +static void s2io_io_resume(struct pci_dev *pdev) +{ + struct net_device *netdev = pci_get_drvdata(pdev); + nic_t *sp = netdev->priv; + + if (netif_running(netdev)) { + if (s2io_card_up(sp)) { + printk(KERN_ERR "s2io: can't bring device back up after reset\n"); + return; + } + } + + netif_device_attach(netdev); + netif_wake_queue(netdev); +} Index: linux-2.6.19-rc1-git11/drivers/net/s2io.h === --- linux-2.6.19-rc1-git11.orig/drivers/net/s2io.h 2006-10-20 12:24:17.0 -0500 +++ linux-2.6.19-rc1-git11/drivers/net/s2io.h 2006-10-20 12:41:39.0 -0500 @@ -1013,6 +1013,11 @@ static void queue_rx_frame(struct sk_buf static void update_L3L4_header(nic_t *sp, lro_t *lro); static void lro_append_pkt(nic_t *sp, lro_t *lro, stru
[PATCH] s2io: add PCI error recovery support
Koushik, Raju, Please review, comment, and if you find this acceptable, please forward upstream. --linas This patch adds PCI error recovery support to the s2io 10-Gigabit ethernet device driver. Tested, seems to work well. Signed-off-by: Linas Vepstas <[EMAIL PROTECTED]> Cc: Raghavendra Koushik <[EMAIL PROTECTED]> Cc: Ananda Raju <[EMAIL PROTECTED]> Cc: Wen Xiong <[EMAIL PROTECTED]> drivers/net/s2io.c | 77 + drivers/net/s2io.h |5 +++ 2 files changed, 82 insertions(+) Index: linux-2.6.19-rc1-git11/drivers/net/s2io.c === --- linux-2.6.19-rc1-git11.orig/drivers/net/s2io.c 2006-10-20 12:24:17.0 -0500 +++ linux-2.6.19-rc1-git11/drivers/net/s2io.c 2006-10-24 16:19:49.0 -0500 @@ -434,11 +434,18 @@ static struct pci_device_id s2io_tbl[] _ MODULE_DEVICE_TABLE(pci, s2io_tbl); +static struct pci_error_handlers s2io_err_handler = { + .error_detected = s2io_io_error_detected, + .slot_reset = s2io_io_slot_reset, + .resume = s2io_io_resume, +}; + static struct pci_driver s2io_driver = { .name = "S2IO", .id_table = s2io_tbl, .probe = s2io_init_nic, .remove = __devexit_p(s2io_rem_nic), + .err_handler = &s2io_err_handler, }; /* A simplifier macro used both by init and free shared_mem Fns(). */ @@ -7564,3 +7571,73 @@ static void lro_append_pkt(nic_t *sp, lr sp->mac_control.stats_info->sw_stat.clubbed_frms_cnt++; return; } + +/** + * s2io_io_error_detected - called when PCI error is detected + * @pdev: Pointer to PCI device + * @state: The current pci conneection state + * + * This function is called after a PCI bus error affecting + * this device has been detected. + */ +static pci_ers_result_t s2io_io_error_detected(struct pci_dev *pdev, pci_channel_state_t state) +{ + struct net_device *netdev = pci_get_drvdata(pdev); + nic_t *sp = netdev->priv; + + netif_device_detach(netdev); + + if (netif_running(netdev)) { + /* Reset card */ + s2io_card_down(sp); + sp->device_close_flag = TRUE; /* Device is shut down. */ + } + pci_disable_device(pdev); + + return PCI_ERS_RESULT_NEED_RESET; +} + +/** + * s2io_io_slot_reset - called after the pci bus has been reset. + * @pdev: Pointer to PCI device + * + * Restart the card from scratch, as if from a cold-boot. + */ +static pci_ers_result_t s2io_io_slot_reset(struct pci_dev *pdev) +{ + struct net_device *netdev = pci_get_drvdata(pdev); + nic_t *sp = netdev->priv; + + if (pci_enable_device(pdev)) { + printk(KERN_ERR "s2io: Cannot re-enable PCI device after reset.\n"); + return PCI_ERS_RESULT_DISCONNECT; + } + + pci_set_master(pdev); + s2io_reset(sp); + + return PCI_ERS_RESULT_RECOVERED; +} + +/** + * s2io_io_resume - called when traffic can start flowing again. + * @pdev: Pointer to PCI device + * + * This callback is called when the error recovery driver tells us that + * its OK to resume normal operation. + */ +static void s2io_io_resume(struct pci_dev *pdev) +{ + struct net_device *netdev = pci_get_drvdata(pdev); + nic_t *sp = netdev->priv; + + if (netif_running(netdev)) { + if (s2io_card_up(sp)) { + printk(KERN_ERR "s2io: can't bring device back up after reset\n"); + return; + } + } + + netif_device_attach(netdev); + netif_wake_queue(netdev); +} Index: linux-2.6.19-rc1-git11/drivers/net/s2io.h === --- linux-2.6.19-rc1-git11.orig/drivers/net/s2io.h 2006-10-20 12:24:17.0 -0500 +++ linux-2.6.19-rc1-git11/drivers/net/s2io.h 2006-10-20 12:41:39.0 -0500 @@ -1013,6 +1013,11 @@ static void queue_rx_frame(struct sk_buf static void update_L3L4_header(nic_t *sp, lro_t *lro); static void lro_append_pkt(nic_t *sp, lro_t *lro, struct sk_buff *skb, u32 tcp_len); +static pci_ers_result_t s2io_io_error_detected(struct pci_dev *pdev, + pci_channel_state_t state); +static pci_ers_result_t s2io_io_slot_reset(struct pci_dev *pdev); +static void s2io_io_resume(struct pci_dev *pdev); + #define s2io_tcp_mss(skb) skb_shinfo(skb)->gso_size #define s2io_udp_mss(skb) skb_shinfo(skb)->gso_size #define s2io_offload_type(skb) skb_shinfo(skb)->gso_type - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html