Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")
On Tue, 2017-09-19 at 15:45 +0300, Ville Syrjälä wrote: > On Mon, Sep 18, 2017 at 12:52:15PM -0700, Eric Dumazet wrote: > > On Mon, Sep 18, 2017 at 12:46 PM, Ville Syrjälä > >wrote: > > > > > And five months later I'm still waiting for this patch to land... > > > > Sure, while I was on vacation, I received thousands of emails that I > > simply have not read. > > Ah that explains where my previous ping went. Pardon the snarky comment, > and thanks for taking care of this. > No problem, this was my fault not following this bug. Thanks a lot !
Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")
On Mon, Sep 18, 2017 at 12:52:15PM -0700, Eric Dumazet wrote: > On Mon, Sep 18, 2017 at 12:46 PM, Ville Syrjälä >wrote: > > > And five months later I'm still waiting for this patch to land... > > Sure, while I was on vacation, I received thousands of emails that I > simply have not read. Ah that explains where my previous ping went. Pardon the snarky comment, and thanks for taking care of this. -- Ville Syrjälä Intel OTC
Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")
On Mon, Sep 18, 2017 at 12:46 PM, Ville Syrjäläwrote: > And five months later I'm still waiting for this patch to land... Sure, while I was on vacation, I received thousands of emails that I simply have not read.
Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")
On Mon, Apr 10, 2017 at 03:11:02PM +0300, Ville Syrjälä wrote: > On Fri, Apr 07, 2017 at 11:38:49AM -0700, Eric Dumazet wrote: > > On Fri, 2017-04-07 at 21:17 +0300, Ville Syrjälä wrote: > > > Hi, > > > > > > My old P3 laptop started to die on me in the middle of larger compile > > > jobs (using distcc) after v4.11-rc. I bisected the problem > > > to 617f01211baf ("8139too: use napi_complete_done()"). > > > > > > Unfortunately I wasn't able to capture a full oops as the machine doesn't > > > have serial and ramoops failed me. I did get one partial oops on vgacon > > > which showed rtl8139_poll() being involved (EIP was around > > > _raw_spin_unlock_irqrestore() supposedly), so seems to agree with my > > > bisect result. > > > > > > So maybe some kind of nasty thing going between the hard irq and > > > softirq? Perhaps UP related? I tried to stare at the locking around > > > rtl8139_poll() for a while but it looked mostly sane to me. > > > > > > > Thanks a lot for the detective work, I am so sorry for this ! > > > > Could you try the following patch ? > > > > I do not really see what could be wrong, the code should run just fine > > on UP. > > > > Thanks. > > > > diff --git a/drivers/net/ethernet/realtek/8139too.c > > b/drivers/net/ethernet/realtek/8139too.c > > index > > 89631753e79962d91456d93b71929af768917da1..cd2dbec331dd796f5296cd378561b3443f231673 > > 100644 > > --- a/drivers/net/ethernet/realtek/8139too.c > > +++ b/drivers/net/ethernet/realtek/8139too.c > > @@ -2135,11 +2135,12 @@ static int rtl8139_poll(struct napi_struct *napi, > > int budget) > > if (likely(RTL_R16(IntrStatus) & RxAckBits)) > > work_done += rtl8139_rx(dev, tp, budget); > > > > - if (work_done < budget && napi_complete_done(napi, work_done)) { > > + if (work_done < budget) { > > unsigned long flags; > > > > spin_lock_irqsave(>lock, flags); > > - RTL_W16_F(IntrMask, rtl8139_intr_mask); > > + if (napi_complete_done(napi, work_done)) > > + RTL_W16_F(IntrMask, rtl8139_intr_mask); > > spin_unlock_irqrestore(>lock, flags); > > } > > spin_unlock(>rx_lock); > > > > > > Yep, that patch does appear to make it stable again. > > Tested-by: Ville SyrjäläAnd five months later I'm still waiting for this patch to land... -- Ville Syrjälä Intel OTC
Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")
On Mon, Jun 19, 2017 at 04:44:45PM +0200, Michal Kubecek wrote: > On Fri, Apr 07, 2017 at 11:38:49AM -0700, Eric Dumazet wrote: > > On Fri, 2017-04-07 at 21:17 +0300, Ville Syrjälä wrote: > > > Hi, > > > > > > My old P3 laptop started to die on me in the middle of larger compile > > > jobs (using distcc) after v4.11-rc. I bisected the problem > > > to 617f01211baf ("8139too: use napi_complete_done()"). > > > > > > Unfortunately I wasn't able to capture a full oops as the machine doesn't > > > have serial and ramoops failed me. I did get one partial oops on vgacon > > > which showed rtl8139_poll() being involved (EIP was around > > > _raw_spin_unlock_irqrestore() supposedly), so seems to agree with my > > > bisect result. > > > > > > So maybe some kind of nasty thing going between the hard irq and > > > softirq? Perhaps UP related? I tried to stare at the locking around > > > rtl8139_poll() for a while but it looked mostly sane to me. > > > > > > > Thanks a lot for the detective work, I am so sorry for this ! > > > > Could you try the following patch ? > > > > I do not really see what could be wrong, the code should run just fine > > on UP. > > > > Thanks. > > > > diff --git a/drivers/net/ethernet/realtek/8139too.c > > b/drivers/net/ethernet/realtek/8139too.c > > index > > 89631753e79962d91456d93b71929af768917da1..cd2dbec331dd796f5296cd378561b3443f231673 > > 100644 > > --- a/drivers/net/ethernet/realtek/8139too.c > > +++ b/drivers/net/ethernet/realtek/8139too.c > > @@ -2135,11 +2135,12 @@ static int rtl8139_poll(struct napi_struct *napi, > > int budget) > > if (likely(RTL_R16(IntrStatus) & RxAckBits)) > > work_done += rtl8139_rx(dev, tp, budget); > > > > - if (work_done < budget && napi_complete_done(napi, work_done)) { > > + if (work_done < budget) { > > unsigned long flags; > > > > spin_lock_irqsave(>lock, flags); > > - RTL_W16_F(IntrMask, rtl8139_intr_mask); > > + if (napi_complete_done(napi, work_done)) > > + RTL_W16_F(IntrMask, rtl8139_intr_mask); > > spin_unlock_irqrestore(>lock, flags); > > } > > spin_unlock(>rx_lock); > > Eric, > > we have a bugreport of what seems to be the same problem: > > https://bugzilla.suse.com/show_bug.cgi?id=1042208 > > Do you plan to submit the patch above or is the conclusion that this is > rather a hardware problem? Could someone please push this patch forward? My machine just died again with a fresh kernel because I forgot that I still need this patch. -- Ville Syrjälä Intel OTC
Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")
On Fri, Apr 07, 2017 at 11:38:49AM -0700, Eric Dumazet wrote: > On Fri, 2017-04-07 at 21:17 +0300, Ville Syrjälä wrote: > > Hi, > > > > My old P3 laptop started to die on me in the middle of larger compile > > jobs (using distcc) after v4.11-rc. I bisected the problem > > to 617f01211baf ("8139too: use napi_complete_done()"). > > > > Unfortunately I wasn't able to capture a full oops as the machine doesn't > > have serial and ramoops failed me. I did get one partial oops on vgacon > > which showed rtl8139_poll() being involved (EIP was around > > _raw_spin_unlock_irqrestore() supposedly), so seems to agree with my > > bisect result. > > > > So maybe some kind of nasty thing going between the hard irq and > > softirq? Perhaps UP related? I tried to stare at the locking around > > rtl8139_poll() for a while but it looked mostly sane to me. > > > > Thanks a lot for the detective work, I am so sorry for this ! > > Could you try the following patch ? > > I do not really see what could be wrong, the code should run just fine > on UP. > > Thanks. > > diff --git a/drivers/net/ethernet/realtek/8139too.c > b/drivers/net/ethernet/realtek/8139too.c > index > 89631753e79962d91456d93b71929af768917da1..cd2dbec331dd796f5296cd378561b3443f231673 > 100644 > --- a/drivers/net/ethernet/realtek/8139too.c > +++ b/drivers/net/ethernet/realtek/8139too.c > @@ -2135,11 +2135,12 @@ static int rtl8139_poll(struct napi_struct *napi, int > budget) > if (likely(RTL_R16(IntrStatus) & RxAckBits)) > work_done += rtl8139_rx(dev, tp, budget); > > - if (work_done < budget && napi_complete_done(napi, work_done)) { > + if (work_done < budget) { > unsigned long flags; > > spin_lock_irqsave(>lock, flags); > - RTL_W16_F(IntrMask, rtl8139_intr_mask); > + if (napi_complete_done(napi, work_done)) > + RTL_W16_F(IntrMask, rtl8139_intr_mask); > spin_unlock_irqrestore(>lock, flags); > } > spin_unlock(>rx_lock); Eric, we have a bugreport of what seems to be the same problem: https://bugzilla.suse.com/show_bug.cgi?id=1042208 Do you plan to submit the patch above or is the conclusion that this is rather a hardware problem? Michal Kubecek
Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")
On Fri, 2017-04-21 at 06:29 -0700, Eric Dumazet wrote: > Thanks for this report. > > Interesting to see how many drivers got the netpoll stuff wrong :/ > > Can you try : > > diff --git a/drivers/net/ethernet/realtek/r8169.c > b/drivers/net/ethernet/realtek/r8169.c > index > 81f18a8335276495a59fa93219c4607c2b8a47aa..74e4c72c331d5a6cc5b653970ef4133c8ddf > 100644 > --- a/drivers/net/ethernet/realtek/r8169.c > +++ b/drivers/net/ethernet/realtek/r8169.c > @@ -7668,7 +7668,7 @@ static void rtl8169_netpoll(struct net_device *dev) > { > struct rtl8169_private *tp = netdev_priv(dev); > > - rtl8169_interrupt(tp->pci_dev->irq, dev); > + napi_schedule(>napi); The problem is more likely that netconsole handling can call rtl_tx() from hard irq context, while standard NAPI poll calls it from BH Meaning that the following sequence triggers a lockdep warning. u64_stats_update_begin(>tx_stats.syncp); tp->tx_stats.packets++; tp->tx_stats.bytes += tx_skb->skb->len; u64_stats_update_end(>tx_stats.syncp); Lockdep does not know that poll_napi() ( called from netpoll_poll_dev()) uses an cmpxchg() to make sure that there is no race. I am not sure how we can teach lockdep to not splat in this case.
Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")
On Fri, 2017-04-21 at 08:09 -0700, Eric Dumazet wrote: > On Fri, 2017-04-21 at 06:29 -0700, Eric Dumazet wrote: > > > Thanks for this report. > > > > Interesting to see how many drivers got the netpoll stuff wrong :/ > > > > Can you try : > > > > diff --git a/drivers/net/ethernet/realtek/r8169.c > > b/drivers/net/ethernet/realtek/r8169.c > > index > > 81f18a8335276495a59fa93219c4607c2b8a47aa..74e4c72c331d5a6cc5b653970ef4133c8ddf > > 100644 > > --- a/drivers/net/ethernet/realtek/r8169.c > > +++ b/drivers/net/ethernet/realtek/r8169.c > > @@ -7668,7 +7668,7 @@ static void rtl8169_netpoll(struct net_device *dev) > > { > > struct rtl8169_private *tp = netdev_priv(dev); > > > > - rtl8169_interrupt(tp->pci_dev->irq, dev); > > + napi_schedule(>napi); > > The problem is more likely that netconsole handling can call rtl_tx() > from hard irq context, while standard NAPI poll calls it from BH > > Meaning that the following sequence triggers a lockdep warning. > > u64_stats_update_begin(>tx_stats.syncp); > tp->tx_stats.packets++; > tp->tx_stats.bytes += tx_skb->skb->len; > u64_stats_update_end(>tx_stats.syncp); > > Lockdep does not know that poll_napi() ( called from netpoll_poll_dev()) > uses an cmpxchg() to make sure that there is no race. > > I am not sure how we can teach lockdep to not splat in this case. Well, could you try the following patch ? Thanks ! diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index 0a8f2817ea60f2172eb28177473a4879f85bd18a..f64f812b86029b772bb245e51cdc2263adc4e6ea 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -7313,10 +7313,10 @@ static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp) rtl8169_unmap_tx_skb(>pci_dev->dev, tx_skb, tp->TxDescArray + entry); if (status & LastFrag) { - u64_stats_update_begin(>tx_stats.syncp); + u64_stats_update_begin_raw(>tx_stats.syncp); tp->tx_stats.packets++; tp->tx_stats.bytes += tx_skb->skb->len; - u64_stats_update_end(>tx_stats.syncp); + u64_stats_update_end_raw(>tx_stats.syncp); dev_kfree_skb_any(tx_skb->skb); tx_skb->skb = NULL; }
Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")
On Fri, 2017-04-21 at 14:40 +0300, Ville Syrjälä wrote: > BTW I've also been getting some lockdep grief from r8169 netpoll stuff > recently. Not sure if it might be related to these changes or not, but > I don't remember seeing this sort of stuff until quite recently. > > [ 251.911044] == > [ 251.911044] [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ] > [ 251.911045] 4.11.0-rc6-elk+ #101 Not tainted > [ 251.911045] -- > [ 251.911046] kms_plane_blink/2132 [HC0[0]:SC0[0]:HE0:SE1] is trying to > acquire: > [ 251.911046] (>seq#2){+.-...}, at: [] > netpoll_poll_dev+0xb3/0x2a0 > [ 251.911048] > [ 251.911048] and this task is already holding: > [ 251.911049] (target_list_lock){-.}, at: [] > write_msg+0x3f/0xd0 [netconsole] > [ 251.911050] which would create a new lock dependency: > [ 251.911050] (target_list_lock){-.} -> (>seq#2){+.-...} > [ 251.911053] > [ 251.911053] but this new dependency connects a HARDIRQ-irq-safe lock: > [ 251.911054] (target_list_lock){-.} > [ 251.911055] ... which became HARDIRQ-irq-safe at: > [ 251.911055] __lock_acquire+0x6ec/0x1440 > [ 251.911056] lock_acquire+0x7e/0x1a0 > [ 251.911056] _raw_spin_lock_irqsave+0x2b/0x40 > [ 251.911056] write_msg+0x3f/0xd0 [netconsole] > [ 251.911057] console_unlock+0x41c/0x570 > [ 251.911057] vprintk_emit+0x26d/0x350 > [ 251.911058] vprintk_default+0x12/0x20 > [ 251.911058] vprintk_func+0x1c/0x50 > [ 251.911058] printk+0xe/0x10 > [ 251.911059] drm_printk+0x5a/0x60 [drm] > [ 251.911059] intel_get_hpd_pins+0x87/0xa0 [i915] > [ 251.911060] i9xx_hpd_irq_handler+0x96/0xe0 [i915] > [ 251.911060] i965_irq_handler+0x2c8/0x2f0 [i915] > [ 251.911060] __handle_irq_event_percpu+0x38/0x360 > [ 251.911061] handle_irq_event_percpu+0x19/0x50 > [ 251.911061] handle_irq_event+0x29/0x50 > [ 251.911062] handle_edge_irq+0x65/0x110 > [ 251.911062] handle_irq+0x9f/0xc0 > [ 251.911062] do_IRQ+0x4f/0x110 > [ 251.911063] common_interrupt+0x36/0x3c > [ 251.911063] cpuidle_enter_state+0xcc/0x390 > [ 251.911064] cpuidle_enter+0xf/0x20 > [ 251.911064] call_cpuidle+0x1c/0x40 > [ 251.911064] do_idle+0x164/0x1d0 > [ 251.911065] cpu_startup_entry+0x1d/0x20 > [ 251.911065] start_secondary+0x104/0x170 > [ 251.911066] startup_32_smp+0x16b/0x16d > [ 251.911066] > [ 251.911066] to a HARDIRQ-irq-unsafe lock: > [ 251.911067] (>seq#2){+.-...} > [ 251.911068] ... which became HARDIRQ-irq-unsafe at: > [ 251.911068] ... __lock_acquire+0x5d7/0x1440 > [ 251.911069] lock_acquire+0x7e/0x1a0 > [ 251.911069] rtl8169_poll+0x474/0x620 [r8169] > [ 251.911070] net_rx_action+0x1d0/0x3f0 > [ 251.911070] __do_softirq+0x17d/0x460 > [ 251.911070] do_softirq_own_stack+0x1d/0x30 > [ 251.911071] irq_exit+0xa5/0xb0 > [ 251.911071] do_IRQ+0x58/0x110 > [ 251.911072] common_interrupt+0x36/0x3c > [ 251.911072] cpuidle_enter_state+0xcc/0x390 > [ 251.911072] cpuidle_enter+0xf/0x20 > [ 251.911073] call_cpuidle+0x1c/0x40 > [ 251.911073] do_idle+0x164/0x1d0 > [ 251.911073] cpu_startup_entry+0x1d/0x20 > [ 251.911074] rest_init+0x10c/0x120 > [ 251.911074] start_kernel+0x30d/0x312 > [ 251.911075] i386_start_kernel+0x85/0x89 > [ 251.911075] startup_32_smp+0x16b/0x16d > [ 251.911075] > [ 251.911076] other info that might help us debug this: > [ 251.911076] > [ 251.911077] Possible interrupt unsafe locking scenario: > [ 251.911077] > [ 251.911077]CPU0CPU1 > [ 251.911078] > [ 251.911078] lock(>seq#2); > [ 251.911079]local_irq_disable(); > [ 251.911080]lock(target_list_lock); > [ 251.911081]lock(>seq#2); > [ 251.911082] > [ 251.911082] lock(target_list_lock); > [ 251.911083] > [ 251.911084] *** DEADLOCK *** > [ 251.911084] > [ 251.911084] 2 locks held by kms_plane_blink/2132: > [ 251.911085] #0: (console_lock){+.+.+.}, at: [] > vprintk_emit+0x264/0x350 > [ 251.911086] #1: (target_list_lock){-.}, at: [] > write_msg+0x3f/0xd0 [netconsole] > [ 251.911088] > [ 251.911088] the dependencies between HARDIRQ-irq-safe lock and the holding > lock: > [ 251.911089] -> (target_list_lock){-.} ops: 482 { > [ 251.911090]IN-HARDIRQ-W at: > [ 251.911091]__lock_acquire+0x6ec/0x1440 > [ 251.911092]lock_acquire+0x7e/0x1a0 > [ 251.911092]_raw_spin_lock_irqsave+0x2b/0x40 > [ 251.911092]write_msg+0x3f/0xd0 [netconsole] > [ 251.911093]console_unlock+0x41c/0x570 > [ 251.911093]vprintk_emit+0x26d/0x350 > [ 251.911094]vprintk_default+0x12/0x20 > [ 251.911094]
Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")
BTW I've also been getting some lockdep grief from r8169 netpoll stuff recently. Not sure if it might be related to these changes or not, but I don't remember seeing this sort of stuff until quite recently. [ 251.911044] == [ 251.911044] [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ] [ 251.911045] 4.11.0-rc6-elk+ #101 Not tainted [ 251.911045] -- [ 251.911046] kms_plane_blink/2132 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: [ 251.911046] (>seq#2){+.-...}, at: [] netpoll_poll_dev+0xb3/0x2a0 [ 251.911048] [ 251.911048] and this task is already holding: [ 251.911049] (target_list_lock){-.}, at: [] write_msg+0x3f/0xd0 [netconsole] [ 251.911050] which would create a new lock dependency: [ 251.911050] (target_list_lock){-.} -> (>seq#2){+.-...} [ 251.911053] [ 251.911053] but this new dependency connects a HARDIRQ-irq-safe lock: [ 251.911054] (target_list_lock){-.} [ 251.911055] ... which became HARDIRQ-irq-safe at: [ 251.911055] __lock_acquire+0x6ec/0x1440 [ 251.911056] lock_acquire+0x7e/0x1a0 [ 251.911056] _raw_spin_lock_irqsave+0x2b/0x40 [ 251.911056] write_msg+0x3f/0xd0 [netconsole] [ 251.911057] console_unlock+0x41c/0x570 [ 251.911057] vprintk_emit+0x26d/0x350 [ 251.911058] vprintk_default+0x12/0x20 [ 251.911058] vprintk_func+0x1c/0x50 [ 251.911058] printk+0xe/0x10 [ 251.911059] drm_printk+0x5a/0x60 [drm] [ 251.911059] intel_get_hpd_pins+0x87/0xa0 [i915] [ 251.911060] i9xx_hpd_irq_handler+0x96/0xe0 [i915] [ 251.911060] i965_irq_handler+0x2c8/0x2f0 [i915] [ 251.911060] __handle_irq_event_percpu+0x38/0x360 [ 251.911061] handle_irq_event_percpu+0x19/0x50 [ 251.911061] handle_irq_event+0x29/0x50 [ 251.911062] handle_edge_irq+0x65/0x110 [ 251.911062] handle_irq+0x9f/0xc0 [ 251.911062] do_IRQ+0x4f/0x110 [ 251.911063] common_interrupt+0x36/0x3c [ 251.911063] cpuidle_enter_state+0xcc/0x390 [ 251.911064] cpuidle_enter+0xf/0x20 [ 251.911064] call_cpuidle+0x1c/0x40 [ 251.911064] do_idle+0x164/0x1d0 [ 251.911065] cpu_startup_entry+0x1d/0x20 [ 251.911065] start_secondary+0x104/0x170 [ 251.911066] startup_32_smp+0x16b/0x16d [ 251.911066] [ 251.911066] to a HARDIRQ-irq-unsafe lock: [ 251.911067] (>seq#2){+.-...} [ 251.911068] ... which became HARDIRQ-irq-unsafe at: [ 251.911068] ... __lock_acquire+0x5d7/0x1440 [ 251.911069] lock_acquire+0x7e/0x1a0 [ 251.911069] rtl8169_poll+0x474/0x620 [r8169] [ 251.911070] net_rx_action+0x1d0/0x3f0 [ 251.911070] __do_softirq+0x17d/0x460 [ 251.911070] do_softirq_own_stack+0x1d/0x30 [ 251.911071] irq_exit+0xa5/0xb0 [ 251.911071] do_IRQ+0x58/0x110 [ 251.911072] common_interrupt+0x36/0x3c [ 251.911072] cpuidle_enter_state+0xcc/0x390 [ 251.911072] cpuidle_enter+0xf/0x20 [ 251.911073] call_cpuidle+0x1c/0x40 [ 251.911073] do_idle+0x164/0x1d0 [ 251.911073] cpu_startup_entry+0x1d/0x20 [ 251.911074] rest_init+0x10c/0x120 [ 251.911074] start_kernel+0x30d/0x312 [ 251.911075] i386_start_kernel+0x85/0x89 [ 251.911075] startup_32_smp+0x16b/0x16d [ 251.911075] [ 251.911076] other info that might help us debug this: [ 251.911076] [ 251.911077] Possible interrupt unsafe locking scenario: [ 251.911077] [ 251.911077]CPU0CPU1 [ 251.911078] [ 251.911078] lock(>seq#2); [ 251.911079]local_irq_disable(); [ 251.911080]lock(target_list_lock); [ 251.911081]lock(>seq#2); [ 251.911082] [ 251.911082] lock(target_list_lock); [ 251.911083] [ 251.911084] *** DEADLOCK *** [ 251.911084] [ 251.911084] 2 locks held by kms_plane_blink/2132: [ 251.911085] #0: (console_lock){+.+.+.}, at: [] vprintk_emit+0x264/0x350 [ 251.911086] #1: (target_list_lock){-.}, at: [] write_msg+0x3f/0xd0 [netconsole] [ 251.911088] [ 251.911088] the dependencies between HARDIRQ-irq-safe lock and the holding lock: [ 251.911089] -> (target_list_lock){-.} ops: 482 { [ 251.911090]IN-HARDIRQ-W at: [ 251.911091]__lock_acquire+0x6ec/0x1440 [ 251.911092]lock_acquire+0x7e/0x1a0 [ 251.911092]_raw_spin_lock_irqsave+0x2b/0x40 [ 251.911092]write_msg+0x3f/0xd0 [netconsole] [ 251.911093]console_unlock+0x41c/0x570 [ 251.911093]vprintk_emit+0x26d/0x350 [ 251.911094]vprintk_default+0x12/0x20 [ 251.911094]vprintk_func+0x1c/0x50 [ 251.911095]printk+0xe/0x10 [ 251.911095]drm_printk+0x5a/0x60 [drm] [ 251.911096]intel_get_hpd_pins+0x87/0xa0 [i915] [ 251.911096]
Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")
On Mon, Apr 10, 2017 at 7:22 AM, tedheadsterwrote: > On Sat, Apr 8, 2017 at 6:23 AM, Francois Romieu wrote: >> David Miller : >> [...] >>> One theory is that the interrupt masking isn't working properly >>> and interrupts are still arriving and hitting the NAPI state even >>> when we are actively polling NAPI. >>> >>> And this problem was masked by the locking done here. >> >> Yes. >> >> Ville, can you rule out irq sharing between the 8139 and some other >> device ? It's a candidate for unexpected interrupt handler invocation >> with older pc, even with properly working hardware. >> > > Eric, > If napi_complete_done() calls could affect drivers on older > hardware, I can test the following: > > drivers/net/ethernet/3com/typhoon.c > drivers/net/ethernet/amd/pcnet32.c > drivers/net/ethernet/broadcom/tg3.c > drivers/net/ethernet/dec/tulip/interrupt.c > drivers/net/ethernet/intel/e100.c > drivers/net/ethernet/intel/e1000/e1000_main.c > drivers/net/ethernet/smsc/epic100.c > drivers/net/ethernet/via/via-rhine.c That would be great, thanks Matthew.
Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")
On Sat, Apr 8, 2017 at 6:23 AM, Francois Romieuwrote: > David Miller : > [...] >> One theory is that the interrupt masking isn't working properly >> and interrupts are still arriving and hitting the NAPI state even >> when we are actively polling NAPI. >> >> And this problem was masked by the locking done here. > > Yes. > > Ville, can you rule out irq sharing between the 8139 and some other > device ? It's a candidate for unexpected interrupt handler invocation > with older pc, even with properly working hardware. > Eric, If napi_complete_done() calls could affect drivers on older hardware, I can test the following: drivers/net/ethernet/3com/typhoon.c drivers/net/ethernet/amd/pcnet32.c drivers/net/ethernet/broadcom/tg3.c drivers/net/ethernet/dec/tulip/interrupt.c drivers/net/ethernet/intel/e100.c drivers/net/ethernet/intel/e1000/e1000_main.c drivers/net/ethernet/smsc/epic100.c drivers/net/ethernet/via/via-rhine.c - Matthew
Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")
On Fri, Apr 07, 2017 at 11:38:49AM -0700, Eric Dumazet wrote: > On Fri, 2017-04-07 at 21:17 +0300, Ville Syrjälä wrote: > > Hi, > > > > My old P3 laptop started to die on me in the middle of larger compile > > jobs (using distcc) after v4.11-rc. I bisected the problem > > to 617f01211baf ("8139too: use napi_complete_done()"). > > > > Unfortunately I wasn't able to capture a full oops as the machine doesn't > > have serial and ramoops failed me. I did get one partial oops on vgacon > > which showed rtl8139_poll() being involved (EIP was around > > _raw_spin_unlock_irqrestore() supposedly), so seems to agree with my > > bisect result. > > > > So maybe some kind of nasty thing going between the hard irq and > > softirq? Perhaps UP related? I tried to stare at the locking around > > rtl8139_poll() for a while but it looked mostly sane to me. > > > > Thanks a lot for the detective work, I am so sorry for this ! > > Could you try the following patch ? > > I do not really see what could be wrong, the code should run just fine > on UP. > > Thanks. > > diff --git a/drivers/net/ethernet/realtek/8139too.c > b/drivers/net/ethernet/realtek/8139too.c > index > 89631753e79962d91456d93b71929af768917da1..cd2dbec331dd796f5296cd378561b3443f231673 > 100644 > --- a/drivers/net/ethernet/realtek/8139too.c > +++ b/drivers/net/ethernet/realtek/8139too.c > @@ -2135,11 +2135,12 @@ static int rtl8139_poll(struct napi_struct *napi, int > budget) > if (likely(RTL_R16(IntrStatus) & RxAckBits)) > work_done += rtl8139_rx(dev, tp, budget); > > - if (work_done < budget && napi_complete_done(napi, work_done)) { > + if (work_done < budget) { > unsigned long flags; > > spin_lock_irqsave(>lock, flags); > - RTL_W16_F(IntrMask, rtl8139_intr_mask); > + if (napi_complete_done(napi, work_done)) > + RTL_W16_F(IntrMask, rtl8139_intr_mask); > spin_unlock_irqrestore(>lock, flags); > } > spin_unlock(>rx_lock); > > Yep, that patch does appear to make it stable again. Tested-by: Ville Syrjälä-- Ville Syrjälä Intel OTC
Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")
David Miller: [...] > One theory is that the interrupt masking isn't working properly > and interrupts are still arriving and hitting the NAPI state even > when we are actively polling NAPI. > > And this problem was masked by the locking done here. Yes. Ville, can you rule out irq sharing between the 8139 and some other device ? It's a candidate for unexpected interrupt handler invocation with older pc, even with properly working hardware. -- Ueimor
Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")
From: Eric DumazetDate: Fri, 07 Apr 2017 11:38:49 -0700 > I do not really see what could be wrong, the code should run just fine > on UP. One theory is that the interrupt masking isn't working properly and interrupts are still arriving and hitting the NAPI state even when we are actively polling NAPI. And this problem was masked by the locking done here.
Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")
On Fri, 2017-04-07 at 21:17 +0300, Ville Syrjälä wrote: > Hi, > > My old P3 laptop started to die on me in the middle of larger compile > jobs (using distcc) after v4.11-rc. I bisected the problem > to 617f01211baf ("8139too: use napi_complete_done()"). > > Unfortunately I wasn't able to capture a full oops as the machine doesn't > have serial and ramoops failed me. I did get one partial oops on vgacon > which showed rtl8139_poll() being involved (EIP was around > _raw_spin_unlock_irqrestore() supposedly), so seems to agree with my > bisect result. > > So maybe some kind of nasty thing going between the hard irq and > softirq? Perhaps UP related? I tried to stare at the locking around > rtl8139_poll() for a while but it looked mostly sane to me. > Thanks a lot for the detective work, I am so sorry for this ! Could you try the following patch ? I do not really see what could be wrong, the code should run just fine on UP. Thanks. diff --git a/drivers/net/ethernet/realtek/8139too.c b/drivers/net/ethernet/realtek/8139too.c index 89631753e79962d91456d93b71929af768917da1..cd2dbec331dd796f5296cd378561b3443f231673 100644 --- a/drivers/net/ethernet/realtek/8139too.c +++ b/drivers/net/ethernet/realtek/8139too.c @@ -2135,11 +2135,12 @@ static int rtl8139_poll(struct napi_struct *napi, int budget) if (likely(RTL_R16(IntrStatus) & RxAckBits)) work_done += rtl8139_rx(dev, tp, budget); - if (work_done < budget && napi_complete_done(napi, work_done)) { + if (work_done < budget) { unsigned long flags; spin_lock_irqsave(>lock, flags); - RTL_W16_F(IntrMask, rtl8139_intr_mask); + if (napi_complete_done(napi, work_done)) + RTL_W16_F(IntrMask, rtl8139_intr_mask); spin_unlock_irqrestore(>lock, flags); } spin_unlock(>rx_lock);
[regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")
Hi, My old P3 laptop started to die on me in the middle of larger compile jobs (using distcc) after v4.11-rc. I bisected the problem to 617f01211baf ("8139too: use napi_complete_done()"). Unfortunately I wasn't able to capture a full oops as the machine doesn't have serial and ramoops failed me. I did get one partial oops on vgacon which showed rtl8139_poll() being involved (EIP was around _raw_spin_unlock_irqrestore() supposedly), so seems to agree with my bisect result. So maybe some kind of nasty thing going between the hard irq and softirq? Perhaps UP related? I tried to stare at the locking around rtl8139_poll() for a while but it looked mostly sane to me. -- Ville Syrjälä Intel OTC