Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")

2017-09-19 Thread Eric Dumazet
On Tue, 2017-09-19 at 15:45 +0300, Ville Syrjälä wrote:
> On Mon, Sep 18, 2017 at 12:52:15PM -0700, Eric Dumazet wrote:
> > On Mon, Sep 18, 2017 at 12:46 PM, Ville Syrjälä
> >  wrote:
> > 
> > > And five months later I'm still waiting for this patch to land...
> > 
> > Sure, while I was on vacation, I received thousands of emails that I
> > simply have not read.
> 
> Ah that explains where my previous ping went. Pardon the snarky comment,
> and thanks for taking care of this.
> 

No problem, this was my fault not following this bug.

Thanks a lot !




Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")

2017-09-19 Thread Ville Syrjälä
On Mon, Sep 18, 2017 at 12:52:15PM -0700, Eric Dumazet wrote:
> On Mon, Sep 18, 2017 at 12:46 PM, Ville Syrjälä
>  wrote:
> 
> > And five months later I'm still waiting for this patch to land...
> 
> Sure, while I was on vacation, I received thousands of emails that I
> simply have not read.

Ah that explains where my previous ping went. Pardon the snarky comment,
and thanks for taking care of this.

-- 
Ville Syrjälä
Intel OTC


Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")

2017-09-18 Thread Eric Dumazet
On Mon, Sep 18, 2017 at 12:46 PM, Ville Syrjälä
 wrote:

> And five months later I'm still waiting for this patch to land...

Sure, while I was on vacation, I received thousands of emails that I
simply have not read.


Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")

2017-09-18 Thread Ville Syrjälä
On Mon, Apr 10, 2017 at 03:11:02PM +0300, Ville Syrjälä wrote:
> On Fri, Apr 07, 2017 at 11:38:49AM -0700, Eric Dumazet wrote:
> > On Fri, 2017-04-07 at 21:17 +0300, Ville Syrjälä wrote:
> > > Hi,
> > > 
> > > My old P3 laptop started to die on me in the middle of larger compile
> > > jobs (using distcc) after v4.11-rc. I bisected the problem
> > > to 617f01211baf ("8139too: use napi_complete_done()").
> > > 
> > > Unfortunately I wasn't able to capture a full oops as the machine doesn't
> > > have serial and ramoops failed me. I did get one partial oops on vgacon
> > > which showed rtl8139_poll() being involved (EIP was around
> > > _raw_spin_unlock_irqrestore() supposedly), so seems to agree with my
> > > bisect result.
> > > 
> > > So maybe some kind of nasty thing going between the hard irq and
> > > softirq? Perhaps UP related? I tried to stare at the locking around
> > > rtl8139_poll() for a while but it looked mostly sane to me.
> > > 
> > 
> > Thanks a lot for the detective work, I am so sorry for this !
> > 
> > Could you try the following patch ?
> > 
> > I do not really see what could be wrong, the code should run just fine
> > on UP.
> > 
> > Thanks.
> > 
> > diff --git a/drivers/net/ethernet/realtek/8139too.c 
> > b/drivers/net/ethernet/realtek/8139too.c
> > index 
> > 89631753e79962d91456d93b71929af768917da1..cd2dbec331dd796f5296cd378561b3443f231673
> >  100644
> > --- a/drivers/net/ethernet/realtek/8139too.c
> > +++ b/drivers/net/ethernet/realtek/8139too.c
> > @@ -2135,11 +2135,12 @@ static int rtl8139_poll(struct napi_struct *napi, 
> > int budget)
> > if (likely(RTL_R16(IntrStatus) & RxAckBits))
> > work_done += rtl8139_rx(dev, tp, budget);
> >  
> > -   if (work_done < budget && napi_complete_done(napi, work_done)) {
> > +   if (work_done < budget) {
> > unsigned long flags;
> >  
> > spin_lock_irqsave(>lock, flags);
> > -   RTL_W16_F(IntrMask, rtl8139_intr_mask);
> > +   if (napi_complete_done(napi, work_done))
> > +   RTL_W16_F(IntrMask, rtl8139_intr_mask);
> > spin_unlock_irqrestore(>lock, flags);
> > }
> > spin_unlock(>rx_lock);
> > 
> > 
> 
> Yep, that patch does appear to make it stable again.
> 
> Tested-by: Ville Syrjälä 

And five months later I'm still waiting for this patch to land...

-- 
Ville Syrjälä
Intel OTC


Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")

2017-07-05 Thread Ville Syrjälä
On Mon, Jun 19, 2017 at 04:44:45PM +0200, Michal Kubecek wrote:
> On Fri, Apr 07, 2017 at 11:38:49AM -0700, Eric Dumazet wrote:
> > On Fri, 2017-04-07 at 21:17 +0300, Ville Syrjälä wrote:
> > > Hi,
> > > 
> > > My old P3 laptop started to die on me in the middle of larger compile
> > > jobs (using distcc) after v4.11-rc. I bisected the problem
> > > to 617f01211baf ("8139too: use napi_complete_done()").
> > > 
> > > Unfortunately I wasn't able to capture a full oops as the machine doesn't
> > > have serial and ramoops failed me. I did get one partial oops on vgacon
> > > which showed rtl8139_poll() being involved (EIP was around
> > > _raw_spin_unlock_irqrestore() supposedly), so seems to agree with my
> > > bisect result.
> > > 
> > > So maybe some kind of nasty thing going between the hard irq and
> > > softirq? Perhaps UP related? I tried to stare at the locking around
> > > rtl8139_poll() for a while but it looked mostly sane to me.
> > > 
> > 
> > Thanks a lot for the detective work, I am so sorry for this !
> > 
> > Could you try the following patch ?
> > 
> > I do not really see what could be wrong, the code should run just fine
> > on UP.
> > 
> > Thanks.
> > 
> > diff --git a/drivers/net/ethernet/realtek/8139too.c 
> > b/drivers/net/ethernet/realtek/8139too.c
> > index 
> > 89631753e79962d91456d93b71929af768917da1..cd2dbec331dd796f5296cd378561b3443f231673
> >  100644
> > --- a/drivers/net/ethernet/realtek/8139too.c
> > +++ b/drivers/net/ethernet/realtek/8139too.c
> > @@ -2135,11 +2135,12 @@ static int rtl8139_poll(struct napi_struct *napi, 
> > int budget)
> > if (likely(RTL_R16(IntrStatus) & RxAckBits))
> > work_done += rtl8139_rx(dev, tp, budget);
> >  
> > -   if (work_done < budget && napi_complete_done(napi, work_done)) {
> > +   if (work_done < budget) {
> > unsigned long flags;
> >  
> > spin_lock_irqsave(>lock, flags);
> > -   RTL_W16_F(IntrMask, rtl8139_intr_mask);
> > +   if (napi_complete_done(napi, work_done))
> > +   RTL_W16_F(IntrMask, rtl8139_intr_mask);
> > spin_unlock_irqrestore(>lock, flags);
> > }
> > spin_unlock(>rx_lock);
> 
> Eric,
> 
> we have a bugreport of what seems to be the same problem:
> 
>   https://bugzilla.suse.com/show_bug.cgi?id=1042208
> 
> Do you plan to submit the patch above or is the conclusion that this is
> rather a hardware problem?

Could someone please push this patch forward? My machine just died again
with a fresh kernel because I forgot that I still need this patch.

-- 
Ville Syrjälä
Intel OTC


Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")

2017-06-19 Thread Michal Kubecek
On Fri, Apr 07, 2017 at 11:38:49AM -0700, Eric Dumazet wrote:
> On Fri, 2017-04-07 at 21:17 +0300, Ville Syrjälä wrote:
> > Hi,
> > 
> > My old P3 laptop started to die on me in the middle of larger compile
> > jobs (using distcc) after v4.11-rc. I bisected the problem
> > to 617f01211baf ("8139too: use napi_complete_done()").
> > 
> > Unfortunately I wasn't able to capture a full oops as the machine doesn't
> > have serial and ramoops failed me. I did get one partial oops on vgacon
> > which showed rtl8139_poll() being involved (EIP was around
> > _raw_spin_unlock_irqrestore() supposedly), so seems to agree with my
> > bisect result.
> > 
> > So maybe some kind of nasty thing going between the hard irq and
> > softirq? Perhaps UP related? I tried to stare at the locking around
> > rtl8139_poll() for a while but it looked mostly sane to me.
> > 
> 
> Thanks a lot for the detective work, I am so sorry for this !
> 
> Could you try the following patch ?
> 
> I do not really see what could be wrong, the code should run just fine
> on UP.
> 
> Thanks.
> 
> diff --git a/drivers/net/ethernet/realtek/8139too.c 
> b/drivers/net/ethernet/realtek/8139too.c
> index 
> 89631753e79962d91456d93b71929af768917da1..cd2dbec331dd796f5296cd378561b3443f231673
>  100644
> --- a/drivers/net/ethernet/realtek/8139too.c
> +++ b/drivers/net/ethernet/realtek/8139too.c
> @@ -2135,11 +2135,12 @@ static int rtl8139_poll(struct napi_struct *napi, int 
> budget)
>   if (likely(RTL_R16(IntrStatus) & RxAckBits))
>   work_done += rtl8139_rx(dev, tp, budget);
>  
> - if (work_done < budget && napi_complete_done(napi, work_done)) {
> + if (work_done < budget) {
>   unsigned long flags;
>  
>   spin_lock_irqsave(>lock, flags);
> - RTL_W16_F(IntrMask, rtl8139_intr_mask);
> + if (napi_complete_done(napi, work_done))
> + RTL_W16_F(IntrMask, rtl8139_intr_mask);
>   spin_unlock_irqrestore(>lock, flags);
>   }
>   spin_unlock(>rx_lock);

Eric,

we have a bugreport of what seems to be the same problem:

  https://bugzilla.suse.com/show_bug.cgi?id=1042208

Do you plan to submit the patch above or is the conclusion that this is
rather a hardware problem?

Michal Kubecek



Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")

2017-04-21 Thread Eric Dumazet
On Fri, 2017-04-21 at 06:29 -0700, Eric Dumazet wrote:

> Thanks for this report.
> 
> Interesting to see how many drivers got the netpoll stuff wrong :/
> 
> Can you try :
> 
> diff --git a/drivers/net/ethernet/realtek/r8169.c 
> b/drivers/net/ethernet/realtek/r8169.c
> index 
> 81f18a8335276495a59fa93219c4607c2b8a47aa..74e4c72c331d5a6cc5b653970ef4133c8ddf
>  100644
> --- a/drivers/net/ethernet/realtek/r8169.c
> +++ b/drivers/net/ethernet/realtek/r8169.c
> @@ -7668,7 +7668,7 @@ static void rtl8169_netpoll(struct net_device *dev)
>  {
>   struct rtl8169_private *tp = netdev_priv(dev);
>  
> - rtl8169_interrupt(tp->pci_dev->irq, dev);
> + napi_schedule(>napi);

The problem is more likely that netconsole handling can call rtl_tx()
from hard irq context, while standard NAPI poll calls it from BH

Meaning that the following sequence triggers a lockdep warning.

u64_stats_update_begin(>tx_stats.syncp);
tp->tx_stats.packets++;
tp->tx_stats.bytes += tx_skb->skb->len;
u64_stats_update_end(>tx_stats.syncp);

Lockdep does not know that poll_napi() ( called from netpoll_poll_dev())
uses an cmpxchg() to make sure that there is no race.

I am not sure how we can teach lockdep to not splat in this case.






Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")

2017-04-21 Thread Eric Dumazet
On Fri, 2017-04-21 at 08:09 -0700, Eric Dumazet wrote:
> On Fri, 2017-04-21 at 06:29 -0700, Eric Dumazet wrote:
> 
> > Thanks for this report.
> > 
> > Interesting to see how many drivers got the netpoll stuff wrong :/
> > 
> > Can you try :
> > 
> > diff --git a/drivers/net/ethernet/realtek/r8169.c 
> > b/drivers/net/ethernet/realtek/r8169.c
> > index 
> > 81f18a8335276495a59fa93219c4607c2b8a47aa..74e4c72c331d5a6cc5b653970ef4133c8ddf
> >  100644
> > --- a/drivers/net/ethernet/realtek/r8169.c
> > +++ b/drivers/net/ethernet/realtek/r8169.c
> > @@ -7668,7 +7668,7 @@ static void rtl8169_netpoll(struct net_device *dev)
> >  {
> > struct rtl8169_private *tp = netdev_priv(dev);
> >  
> > -   rtl8169_interrupt(tp->pci_dev->irq, dev);
> > +   napi_schedule(>napi);
> 
> The problem is more likely that netconsole handling can call rtl_tx()
> from hard irq context, while standard NAPI poll calls it from BH
> 
> Meaning that the following sequence triggers a lockdep warning.
> 
>   u64_stats_update_begin(>tx_stats.syncp);
>   tp->tx_stats.packets++;
>   tp->tx_stats.bytes += tx_skb->skb->len;
>   u64_stats_update_end(>tx_stats.syncp);
> 
> Lockdep does not know that poll_napi() ( called from netpoll_poll_dev())
> uses an cmpxchg() to make sure that there is no race.
> 
> I am not sure how we can teach lockdep to not splat in this case.

Well, could you try the following patch ?
Thanks !

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index 
0a8f2817ea60f2172eb28177473a4879f85bd18a..f64f812b86029b772bb245e51cdc2263adc4e6ea
 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -7313,10 +7313,10 @@ static void rtl_tx(struct net_device *dev, struct 
rtl8169_private *tp)
rtl8169_unmap_tx_skb(>pci_dev->dev, tx_skb,
 tp->TxDescArray + entry);
if (status & LastFrag) {
-   u64_stats_update_begin(>tx_stats.syncp);
+   u64_stats_update_begin_raw(>tx_stats.syncp);
tp->tx_stats.packets++;
tp->tx_stats.bytes += tx_skb->skb->len;
-   u64_stats_update_end(>tx_stats.syncp);
+   u64_stats_update_end_raw(>tx_stats.syncp);
dev_kfree_skb_any(tx_skb->skb);
tx_skb->skb = NULL;
}





Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")

2017-04-21 Thread Eric Dumazet
On Fri, 2017-04-21 at 14:40 +0300, Ville Syrjälä wrote:
> BTW I've also been getting some lockdep grief from r8169 netpoll stuff
> recently. Not sure if it might be related to these changes or not, but
> I don't remember seeing this sort of stuff until quite recently.
> 
> [  251.911044] ==
> [  251.911044] [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
> [  251.911045] 4.11.0-rc6-elk+ #101 Not tainted
> [  251.911045] --
> [  251.911046] kms_plane_blink/2132 [HC0[0]:SC0[0]:HE0:SE1] is trying to 
> acquire:
> [  251.911046]  (>seq#2){+.-...}, at: [] 
> netpoll_poll_dev+0xb3/0x2a0
> [  251.911048] 
> [  251.911048] and this task is already holding:
> [  251.911049]  (target_list_lock){-.}, at: [] 
> write_msg+0x3f/0xd0 [netconsole]
> [  251.911050] which would create a new lock dependency:
> [  251.911050]  (target_list_lock){-.} -> (>seq#2){+.-...}
> [  251.911053] 
> [  251.911053] but this new dependency connects a HARDIRQ-irq-safe lock:
> [  251.911054]  (target_list_lock){-.}
> [  251.911055] ... which became HARDIRQ-irq-safe at:
> [  251.911055]   __lock_acquire+0x6ec/0x1440
> [  251.911056]   lock_acquire+0x7e/0x1a0
> [  251.911056]   _raw_spin_lock_irqsave+0x2b/0x40
> [  251.911056]   write_msg+0x3f/0xd0 [netconsole]
> [  251.911057]   console_unlock+0x41c/0x570
> [  251.911057]   vprintk_emit+0x26d/0x350
> [  251.911058]   vprintk_default+0x12/0x20
> [  251.911058]   vprintk_func+0x1c/0x50
> [  251.911058]   printk+0xe/0x10
> [  251.911059]   drm_printk+0x5a/0x60 [drm]
> [  251.911059]   intel_get_hpd_pins+0x87/0xa0 [i915]
> [  251.911060]   i9xx_hpd_irq_handler+0x96/0xe0 [i915]
> [  251.911060]   i965_irq_handler+0x2c8/0x2f0 [i915]
> [  251.911060]   __handle_irq_event_percpu+0x38/0x360
> [  251.911061]   handle_irq_event_percpu+0x19/0x50
> [  251.911061]   handle_irq_event+0x29/0x50
> [  251.911062]   handle_edge_irq+0x65/0x110
> [  251.911062]   handle_irq+0x9f/0xc0
> [  251.911062]   do_IRQ+0x4f/0x110
> [  251.911063]   common_interrupt+0x36/0x3c
> [  251.911063]   cpuidle_enter_state+0xcc/0x390
> [  251.911064]   cpuidle_enter+0xf/0x20
> [  251.911064]   call_cpuidle+0x1c/0x40
> [  251.911064]   do_idle+0x164/0x1d0
> [  251.911065]   cpu_startup_entry+0x1d/0x20
> [  251.911065]   start_secondary+0x104/0x170
> [  251.911066]   startup_32_smp+0x16b/0x16d
> [  251.911066] 
> [  251.911066] to a HARDIRQ-irq-unsafe lock:
> [  251.911067]  (>seq#2){+.-...}
> [  251.911068] ... which became HARDIRQ-irq-unsafe at:
> [  251.911068] ...  __lock_acquire+0x5d7/0x1440
> [  251.911069]   lock_acquire+0x7e/0x1a0
> [  251.911069]   rtl8169_poll+0x474/0x620 [r8169]
> [  251.911070]   net_rx_action+0x1d0/0x3f0
> [  251.911070]   __do_softirq+0x17d/0x460
> [  251.911070]   do_softirq_own_stack+0x1d/0x30
> [  251.911071]   irq_exit+0xa5/0xb0
> [  251.911071]   do_IRQ+0x58/0x110
> [  251.911072]   common_interrupt+0x36/0x3c
> [  251.911072]   cpuidle_enter_state+0xcc/0x390
> [  251.911072]   cpuidle_enter+0xf/0x20
> [  251.911073]   call_cpuidle+0x1c/0x40
> [  251.911073]   do_idle+0x164/0x1d0
> [  251.911073]   cpu_startup_entry+0x1d/0x20
> [  251.911074]   rest_init+0x10c/0x120
> [  251.911074]   start_kernel+0x30d/0x312
> [  251.911075]   i386_start_kernel+0x85/0x89
> [  251.911075]   startup_32_smp+0x16b/0x16d
> [  251.911075] 
> [  251.911076] other info that might help us debug this:
> [  251.911076] 
> [  251.911077]  Possible interrupt unsafe locking scenario:
> [  251.911077] 
> [  251.911077]CPU0CPU1
> [  251.911078]
> [  251.911078]   lock(>seq#2);
> [  251.911079]local_irq_disable();
> [  251.911080]lock(target_list_lock);
> [  251.911081]lock(>seq#2);
> [  251.911082]   
> [  251.911082] lock(target_list_lock);
> [  251.911083] 
> [  251.911084]  *** DEADLOCK ***
> [  251.911084] 
> [  251.911084] 2 locks held by kms_plane_blink/2132:
> [  251.911085]  #0:  (console_lock){+.+.+.}, at: [] 
> vprintk_emit+0x264/0x350
> [  251.911086]  #1:  (target_list_lock){-.}, at: [] 
> write_msg+0x3f/0xd0 [netconsole]
> [  251.911088] 
> [  251.911088] the dependencies between HARDIRQ-irq-safe lock and the holding 
> lock:
> [  251.911089] -> (target_list_lock){-.} ops: 482 {
> [  251.911090]IN-HARDIRQ-W at:
> [  251.911091]__lock_acquire+0x6ec/0x1440
> [  251.911092]lock_acquire+0x7e/0x1a0
> [  251.911092]_raw_spin_lock_irqsave+0x2b/0x40
> [  251.911092]write_msg+0x3f/0xd0 [netconsole]
> [  251.911093]console_unlock+0x41c/0x570
> [  251.911093]vprintk_emit+0x26d/0x350
> [  251.911094]vprintk_default+0x12/0x20
> [  251.911094]

Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")

2017-04-21 Thread Ville Syrjälä
BTW I've also been getting some lockdep grief from r8169 netpoll stuff
recently. Not sure if it might be related to these changes or not, but
I don't remember seeing this sort of stuff until quite recently.

[  251.911044] ==
[  251.911044] [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
[  251.911045] 4.11.0-rc6-elk+ #101 Not tainted
[  251.911045] --
[  251.911046] kms_plane_blink/2132 [HC0[0]:SC0[0]:HE0:SE1] is trying to 
acquire:
[  251.911046]  (>seq#2){+.-...}, at: [] 
netpoll_poll_dev+0xb3/0x2a0
[  251.911048] 
[  251.911048] and this task is already holding:
[  251.911049]  (target_list_lock){-.}, at: [] 
write_msg+0x3f/0xd0 [netconsole]
[  251.911050] which would create a new lock dependency:
[  251.911050]  (target_list_lock){-.} -> (>seq#2){+.-...}
[  251.911053] 
[  251.911053] but this new dependency connects a HARDIRQ-irq-safe lock:
[  251.911054]  (target_list_lock){-.}
[  251.911055] ... which became HARDIRQ-irq-safe at:
[  251.911055]   __lock_acquire+0x6ec/0x1440
[  251.911056]   lock_acquire+0x7e/0x1a0
[  251.911056]   _raw_spin_lock_irqsave+0x2b/0x40
[  251.911056]   write_msg+0x3f/0xd0 [netconsole]
[  251.911057]   console_unlock+0x41c/0x570
[  251.911057]   vprintk_emit+0x26d/0x350
[  251.911058]   vprintk_default+0x12/0x20
[  251.911058]   vprintk_func+0x1c/0x50
[  251.911058]   printk+0xe/0x10
[  251.911059]   drm_printk+0x5a/0x60 [drm]
[  251.911059]   intel_get_hpd_pins+0x87/0xa0 [i915]
[  251.911060]   i9xx_hpd_irq_handler+0x96/0xe0 [i915]
[  251.911060]   i965_irq_handler+0x2c8/0x2f0 [i915]
[  251.911060]   __handle_irq_event_percpu+0x38/0x360
[  251.911061]   handle_irq_event_percpu+0x19/0x50
[  251.911061]   handle_irq_event+0x29/0x50
[  251.911062]   handle_edge_irq+0x65/0x110
[  251.911062]   handle_irq+0x9f/0xc0
[  251.911062]   do_IRQ+0x4f/0x110
[  251.911063]   common_interrupt+0x36/0x3c
[  251.911063]   cpuidle_enter_state+0xcc/0x390
[  251.911064]   cpuidle_enter+0xf/0x20
[  251.911064]   call_cpuidle+0x1c/0x40
[  251.911064]   do_idle+0x164/0x1d0
[  251.911065]   cpu_startup_entry+0x1d/0x20
[  251.911065]   start_secondary+0x104/0x170
[  251.911066]   startup_32_smp+0x16b/0x16d
[  251.911066] 
[  251.911066] to a HARDIRQ-irq-unsafe lock:
[  251.911067]  (>seq#2){+.-...}
[  251.911068] ... which became HARDIRQ-irq-unsafe at:
[  251.911068] ...  __lock_acquire+0x5d7/0x1440
[  251.911069]   lock_acquire+0x7e/0x1a0
[  251.911069]   rtl8169_poll+0x474/0x620 [r8169]
[  251.911070]   net_rx_action+0x1d0/0x3f0
[  251.911070]   __do_softirq+0x17d/0x460
[  251.911070]   do_softirq_own_stack+0x1d/0x30
[  251.911071]   irq_exit+0xa5/0xb0
[  251.911071]   do_IRQ+0x58/0x110
[  251.911072]   common_interrupt+0x36/0x3c
[  251.911072]   cpuidle_enter_state+0xcc/0x390
[  251.911072]   cpuidle_enter+0xf/0x20
[  251.911073]   call_cpuidle+0x1c/0x40
[  251.911073]   do_idle+0x164/0x1d0
[  251.911073]   cpu_startup_entry+0x1d/0x20
[  251.911074]   rest_init+0x10c/0x120
[  251.911074]   start_kernel+0x30d/0x312
[  251.911075]   i386_start_kernel+0x85/0x89
[  251.911075]   startup_32_smp+0x16b/0x16d
[  251.911075] 
[  251.911076] other info that might help us debug this:
[  251.911076] 
[  251.911077]  Possible interrupt unsafe locking scenario:
[  251.911077] 
[  251.911077]CPU0CPU1
[  251.911078]
[  251.911078]   lock(>seq#2);
[  251.911079]local_irq_disable();
[  251.911080]lock(target_list_lock);
[  251.911081]lock(>seq#2);
[  251.911082]   
[  251.911082] lock(target_list_lock);
[  251.911083] 
[  251.911084]  *** DEADLOCK ***
[  251.911084] 
[  251.911084] 2 locks held by kms_plane_blink/2132:
[  251.911085]  #0:  (console_lock){+.+.+.}, at: [] 
vprintk_emit+0x264/0x350
[  251.911086]  #1:  (target_list_lock){-.}, at: [] 
write_msg+0x3f/0xd0 [netconsole]
[  251.911088] 
[  251.911088] the dependencies between HARDIRQ-irq-safe lock and the holding 
lock:
[  251.911089] -> (target_list_lock){-.} ops: 482 {
[  251.911090]IN-HARDIRQ-W at:
[  251.911091]__lock_acquire+0x6ec/0x1440
[  251.911092]lock_acquire+0x7e/0x1a0
[  251.911092]_raw_spin_lock_irqsave+0x2b/0x40
[  251.911092]write_msg+0x3f/0xd0 [netconsole]
[  251.911093]console_unlock+0x41c/0x570
[  251.911093]vprintk_emit+0x26d/0x350
[  251.911094]vprintk_default+0x12/0x20
[  251.911094]vprintk_func+0x1c/0x50
[  251.911095]printk+0xe/0x10
[  251.911095]drm_printk+0x5a/0x60 [drm]
[  251.911096]intel_get_hpd_pins+0x87/0xa0 [i915]
[  251.911096]

Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")

2017-04-10 Thread Eric Dumazet
On Mon, Apr 10, 2017 at 7:22 AM, tedheadster  wrote:
> On Sat, Apr 8, 2017 at 6:23 AM, Francois Romieu  wrote:
>> David Miller  :
>> [...]
>>> One theory is that the interrupt masking isn't working properly
>>> and interrupts are still arriving and hitting the NAPI state even
>>> when we are actively polling NAPI.
>>>
>>> And this problem was masked by the locking done here.
>>
>> Yes.
>>
>> Ville, can you rule out irq sharing between the 8139 and some other
>> device ? It's a candidate for unexpected interrupt handler invocation
>> with older pc, even with properly working hardware.
>>
>
> Eric,
>   If napi_complete_done() calls could affect drivers on older
> hardware, I can test the following:
>
> drivers/net/ethernet/3com/typhoon.c
> drivers/net/ethernet/amd/pcnet32.c
> drivers/net/ethernet/broadcom/tg3.c
> drivers/net/ethernet/dec/tulip/interrupt.c
> drivers/net/ethernet/intel/e100.c
> drivers/net/ethernet/intel/e1000/e1000_main.c
> drivers/net/ethernet/smsc/epic100.c
> drivers/net/ethernet/via/via-rhine.c

That would be great, thanks Matthew.


Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")

2017-04-10 Thread tedheadster
On Sat, Apr 8, 2017 at 6:23 AM, Francois Romieu  wrote:
> David Miller  :
> [...]
>> One theory is that the interrupt masking isn't working properly
>> and interrupts are still arriving and hitting the NAPI state even
>> when we are actively polling NAPI.
>>
>> And this problem was masked by the locking done here.
>
> Yes.
>
> Ville, can you rule out irq sharing between the 8139 and some other
> device ? It's a candidate for unexpected interrupt handler invocation
> with older pc, even with properly working hardware.
>

Eric,
  If napi_complete_done() calls could affect drivers on older
hardware, I can test the following:

drivers/net/ethernet/3com/typhoon.c
drivers/net/ethernet/amd/pcnet32.c
drivers/net/ethernet/broadcom/tg3.c
drivers/net/ethernet/dec/tulip/interrupt.c
drivers/net/ethernet/intel/e100.c
drivers/net/ethernet/intel/e1000/e1000_main.c
drivers/net/ethernet/smsc/epic100.c
drivers/net/ethernet/via/via-rhine.c

- Matthew


Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")

2017-04-10 Thread Ville Syrjälä
On Fri, Apr 07, 2017 at 11:38:49AM -0700, Eric Dumazet wrote:
> On Fri, 2017-04-07 at 21:17 +0300, Ville Syrjälä wrote:
> > Hi,
> > 
> > My old P3 laptop started to die on me in the middle of larger compile
> > jobs (using distcc) after v4.11-rc. I bisected the problem
> > to 617f01211baf ("8139too: use napi_complete_done()").
> > 
> > Unfortunately I wasn't able to capture a full oops as the machine doesn't
> > have serial and ramoops failed me. I did get one partial oops on vgacon
> > which showed rtl8139_poll() being involved (EIP was around
> > _raw_spin_unlock_irqrestore() supposedly), so seems to agree with my
> > bisect result.
> > 
> > So maybe some kind of nasty thing going between the hard irq and
> > softirq? Perhaps UP related? I tried to stare at the locking around
> > rtl8139_poll() for a while but it looked mostly sane to me.
> > 
> 
> Thanks a lot for the detective work, I am so sorry for this !
> 
> Could you try the following patch ?
> 
> I do not really see what could be wrong, the code should run just fine
> on UP.
> 
> Thanks.
> 
> diff --git a/drivers/net/ethernet/realtek/8139too.c 
> b/drivers/net/ethernet/realtek/8139too.c
> index 
> 89631753e79962d91456d93b71929af768917da1..cd2dbec331dd796f5296cd378561b3443f231673
>  100644
> --- a/drivers/net/ethernet/realtek/8139too.c
> +++ b/drivers/net/ethernet/realtek/8139too.c
> @@ -2135,11 +2135,12 @@ static int rtl8139_poll(struct napi_struct *napi, int 
> budget)
>   if (likely(RTL_R16(IntrStatus) & RxAckBits))
>   work_done += rtl8139_rx(dev, tp, budget);
>  
> - if (work_done < budget && napi_complete_done(napi, work_done)) {
> + if (work_done < budget) {
>   unsigned long flags;
>  
>   spin_lock_irqsave(>lock, flags);
> - RTL_W16_F(IntrMask, rtl8139_intr_mask);
> + if (napi_complete_done(napi, work_done))
> + RTL_W16_F(IntrMask, rtl8139_intr_mask);
>   spin_unlock_irqrestore(>lock, flags);
>   }
>   spin_unlock(>rx_lock);
> 
> 

Yep, that patch does appear to make it stable again.

Tested-by: Ville Syrjälä 

-- 
Ville Syrjälä
Intel OTC


Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")

2017-04-08 Thread Francois Romieu
David Miller  :
[...]
> One theory is that the interrupt masking isn't working properly
> and interrupts are still arriving and hitting the NAPI state even
> when we are actively polling NAPI.
> 
> And this problem was masked by the locking done here.

Yes.

Ville, can you rule out irq sharing between the 8139 and some other
device ? It's a candidate for unexpected interrupt handler invocation
with older pc, even with properly working hardware.

-- 
Ueimor


Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")

2017-04-07 Thread David Miller
From: Eric Dumazet 
Date: Fri, 07 Apr 2017 11:38:49 -0700

> I do not really see what could be wrong, the code should run just fine
> on UP.

One theory is that the interrupt masking isn't working properly
and interrupts are still arriving and hitting the NAPI state even
when we are actively polling NAPI.

And this problem was masked by the locking done here.


Re: [regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")

2017-04-07 Thread Eric Dumazet
On Fri, 2017-04-07 at 21:17 +0300, Ville Syrjälä wrote:
> Hi,
> 
> My old P3 laptop started to die on me in the middle of larger compile
> jobs (using distcc) after v4.11-rc. I bisected the problem
> to 617f01211baf ("8139too: use napi_complete_done()").
> 
> Unfortunately I wasn't able to capture a full oops as the machine doesn't
> have serial and ramoops failed me. I did get one partial oops on vgacon
> which showed rtl8139_poll() being involved (EIP was around
> _raw_spin_unlock_irqrestore() supposedly), so seems to agree with my
> bisect result.
> 
> So maybe some kind of nasty thing going between the hard irq and
> softirq? Perhaps UP related? I tried to stare at the locking around
> rtl8139_poll() for a while but it looked mostly sane to me.
> 

Thanks a lot for the detective work, I am so sorry for this !

Could you try the following patch ?

I do not really see what could be wrong, the code should run just fine
on UP.

Thanks.

diff --git a/drivers/net/ethernet/realtek/8139too.c 
b/drivers/net/ethernet/realtek/8139too.c
index 
89631753e79962d91456d93b71929af768917da1..cd2dbec331dd796f5296cd378561b3443f231673
 100644
--- a/drivers/net/ethernet/realtek/8139too.c
+++ b/drivers/net/ethernet/realtek/8139too.c
@@ -2135,11 +2135,12 @@ static int rtl8139_poll(struct napi_struct *napi, int 
budget)
if (likely(RTL_R16(IntrStatus) & RxAckBits))
work_done += rtl8139_rx(dev, tp, budget);
 
-   if (work_done < budget && napi_complete_done(napi, work_done)) {
+   if (work_done < budget) {
unsigned long flags;
 
spin_lock_irqsave(>lock, flags);
-   RTL_W16_F(IntrMask, rtl8139_intr_mask);
+   if (napi_complete_done(napi, work_done))
+   RTL_W16_F(IntrMask, rtl8139_intr_mask);
spin_unlock_irqrestore(>lock, flags);
}
spin_unlock(>rx_lock);





[regression v4.11] 617f01211baf ("8139too: use napi_complete_done()")

2017-04-07 Thread Ville Syrjälä
Hi,

My old P3 laptop started to die on me in the middle of larger compile
jobs (using distcc) after v4.11-rc. I bisected the problem
to 617f01211baf ("8139too: use napi_complete_done()").

Unfortunately I wasn't able to capture a full oops as the machine doesn't
have serial and ramoops failed me. I did get one partial oops on vgacon
which showed rtl8139_poll() being involved (EIP was around
_raw_spin_unlock_irqrestore() supposedly), so seems to agree with my
bisect result.

So maybe some kind of nasty thing going between the hard irq and
softirq? Perhaps UP related? I tried to stare at the locking around
rtl8139_poll() for a while but it looked mostly sane to me.

-- 
Ville Syrjälä
Intel OTC