Re: [Intel-wired-lan] [PATCH] igb: use igb_adapter->io_addr instead of e1000_hw->hw_addr

2016-11-10 Thread Hisashi T Fujinaka

On Thu, 10 Nov 2016, Corinna Vinschen wrote:


On Nov  8 11:33, Alexander Duyck wrote:

...

The question I would have is what is reading the device when it is in
this state.  The watchdog and any other functions that would read the
device should be disabled.

One possibility could be a race between a call to igb_close and the
igb_suspend function.  We have seen some of those pop up recently on
ixgbe and it looks like igb has the same bug.  We should probably be
using the rtnl_lock to guarantee that netif_device_detach and the call
to __igb_close are completed before igb_close could possibly be called
by the network stack.


Do you have a pointer to the related ixgbe patch, by any chance?

...

The thing is that a suspended device should not be accessed at all.
If we are accessing it while it is suspended then that is a bug.  If
you could throw a WARN_ON call in igb_rd32 to capture where this is
being triggered that might be useful.


- Otherwise assume it's actually a surprise removal.  In theory that
  should somehow trigger a device removal sequence, kind of like
  calling igb_remove, no?


Well a read of the MMIO region while suspended is more of a surprise
read since there shouldn't be anything going on.  We need to isolate
where that read is coming from and fix it.


That would be ideal, but the problem couldn't be reproduced yet apart
from at a customer's customer site.  It's not clear yet if we can access
the machine for further testing.


Here's the initial patch for igb I have, but it's on hold awaiting more
changes in ixgbe regarding AER.

--
Hisashi T Fujinaka - ht...@twofifty.com
BSEE + BSChem + BAEnglish + MSCS + $2.50 = coffeeFrom todd.fujin...@intel.com Thu Nov 10 05:44:02 2016
Date: Thu, 10 Nov 2016 05:36:13 -0800
From: Todd Fujinaka <todd.fujin...@intel.com>
To: ht...@twofifty.com
Subject: [PATCH] igb: handle close/suspend race netif_device_detach

Similar to ixgbe, when an interface is part of a namespace it is
possible that igb_close() may be called while __igb_shutdown() is
running ending up in a double free WARN and/or a BUG in
free_msi_irqs().

Extend the rtnl_lock() to protect the call to netif_device_detach() and
igb_clear_interrupt_scheme() in __igb_shutdown() and check for
netif_device_present() to avoid clearing the interrupts second time in
igb_close().

Signed-off-by: Todd Fujinaka <todd.fujin...@intel.com>
---
 drivers/net/ethernet/intel/igb/igb_main.c | 24 +++-
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 4feca69..ce5add6 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -3275,7 +3275,10 @@ static int __igb_close(struct net_device *netdev, bool 
suspending)
 
 int igb_close(struct net_device *netdev)
 {
-   return __igb_close(netdev, false);
+   if (netif_device_present(netdev))
+   return __igb_close(netdev, false);
+
+   return 0;
 }
 
 /**
@@ -7537,6 +7540,7 @@ static int __igb_shutdown(struct pci_dev *pdev, bool 
*enable_wake,
int retval = 0;
 #endif
 
+   rtnl_lock();
netif_device_detach(netdev);
 
if (netif_running(netdev))
@@ -7545,6 +7549,7 @@ static int __igb_shutdown(struct pci_dev *pdev, bool 
*enable_wake,
igb_ptp_suspend(adapter);
 
igb_clear_interrupt_scheme(adapter);
+   rtnl_unlock();
 
 #ifdef CONFIG_PM
retval = pci_save_state(pdev);
@@ -7663,16 +7668,17 @@ static int igb_resume(struct device *dev)
 
wr32(E1000_WUS, ~0);
 
-   if (netdev->flags & IFF_UP) {
-   rtnl_lock();
+   rtnl_lock();
+
+   if (!err && netif_running(netdev))
err = __igb_open(netdev, true);
-   rtnl_unlock();
-   if (err)
-   return err;
-   }
 
-   netif_device_attach(netdev);
-   return 0;
+   if (!err)
+   netif_device_attach(netdev);
+
+   rtnl_unlock();
+
+   return err;
 }
 
 static int igb_runtime_idle(struct device *dev)
-- 
2.7.4


Re: [Intel-wired-lan] [PATCH] igb: use igb_adapter->io_addr instead of e1000_hw->hw_addr

2016-11-08 Thread Hisashi T Fujinaka

On Tue, 8 Nov 2016, Hisashi T Fujinaka wrote:


Incidentally we're just looking for a solution to that problem too.
Do three patches to fix the same problem at rougly the same time already
qualify as freak accident?

FTR, I attached my current patch, which I was planning to submit after
some external testing.

However, all three patches have one thing in common:  They workaround
a somewhat dubious resetting of the hardware address to NULL in case
reading from a register failed.

That makes me wonder if setting the hardware address to NULL in
rd32/igb_rd32 is really such a good idea.  It's performed in a function
which return value is *never* tested for validity in the calling
functions and leads to subsequent crashes since no tests for hw_addr ==
NULL are performed.

Maybe commit 22a8b2915 should be reconsidered?  Isn't there some more
graceful way to handle the "surprise removal"?


Answering this from my home account because, well, work is Outlook.

"Reconsidering" would be great. In fact, revert if if you'd like. I'm
uncertain that the surprise removal code actually works the way I
thought previously and I think I took a lot of it out of my local code.

Unfortuantely I don't have any equipment that I can use to reproduce
surprise removal any longer so that means I wouldn't be able to test
anything. I have to defer to you or Cao Jin.


Whoops. Never mind. I was just told that I had a bug that Alex Duyck and
Cao Jin just fixed. I'd stick to listening to Alex.

--
Hisashi T Fujinaka - ht...@twofifty.com


Re: [Intel-wired-lan] [PATCH] igb: use igb_adapter->io_addr instead of e1000_hw->hw_addr

2016-11-08 Thread Hisashi T Fujinaka

On Tue, 8 Nov 2016, Corinna Vinschen wrote:


On Nov  8 15:06, Cao jin wrote:

When running as guest, under certain condition, it will oops as following.
writel() in igb_configure_tx_ring() results in oops, because hw->hw_addr
is NULL. While other register access won't oops kernel because they use
wr32/rd32 which have a defense against NULL pointer.

[  141.225449] pcieport :00:1c.0: AER: Multiple Uncorrected (Fatal)
error received: id=0101
[  141.225523] igb :01:00.1: PCIe Bus Error:
severity=Uncorrected (Fatal), type=Unaccessible,
id=0101(Unregistered Agent ID)
[  141.299442] igb :01:00.1: broadcast error_detected message
[  141.300539] igb :01:00.0 enp1s0f0: PCIe link lost, device now
detached
[  141.351019] igb :01:00.1 enp1s0f1: PCIe link lost, device now
detached
[  143.465904] pcieport :00:1c.0: Root Port link has been reset
[  143.465994] igb :01:00.1: broadcast slot_reset message
[  143.466039] igb :01:00.0: enabling device ( -> 0002)
[  144.389078] igb :01:00.1: enabling device ( -> 0002)
[  145.312078] igb :01:00.1: broadcast resume message
[  145.322211] BUG: unable to handle kernel paging request at
3818
[  145.361275] IP: []
igb_configure_tx_ring+0x14d/0x280 [igb]
[  145.400048] PGD 0
[  145.438007] Oops: 0002 [#1] SMP

A similiar issue & solution could be found at:
http://patchwork.ozlabs.org/patch/689592/

Signed-off-by: Cao jin <caoj.f...@cn.fujitsu.com>
---
 drivers/net/ethernet/intel/igb/igb_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index edc9a6a..3f240ac 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -3390,7 +3390,7 @@ void igb_configure_tx_ring(struct igb_adapter *adapter,
 tdba & 0xULL);
wr32(E1000_TDBAH(reg_idx), tdba >> 32);

-   ring->tail = hw->hw_addr + E1000_TDT(reg_idx);
+   ring->tail = adapter->io_addr + E1000_TDT(reg_idx);
wr32(E1000_TDH(reg_idx), 0);
writel(0, ring->tail);

@@ -3729,7 +3729,7 @@ void igb_configure_rx_ring(struct igb_adapter *adapter,
 ring->count * sizeof(union e1000_adv_rx_desc));

/* initialize head and tail */
-   ring->tail = hw->hw_addr + E1000_RDT(reg_idx);
+   ring->tail = adapter->io_addr + E1000_RDT(reg_idx);
wr32(E1000_RDH(reg_idx), 0);
writel(0, ring->tail);

--
2.1.0


Incidentally we're just looking for a solution to that problem too.
Do three patches to fix the same problem at rougly the same time already
qualify as freak accident?

FTR, I attached my current patch, which I was planning to submit after
some external testing.

However, all three patches have one thing in common:  They workaround
a somewhat dubious resetting of the hardware address to NULL in case
reading from a register failed.

That makes me wonder if setting the hardware address to NULL in
rd32/igb_rd32 is really such a good idea.  It's performed in a function
which return value is *never* tested for validity in the calling
functions and leads to subsequent crashes since no tests for hw_addr ==
NULL are performed.

Maybe commit 22a8b2915 should be reconsidered?  Isn't there some more
graceful way to handle the "surprise removal"?


Answering this from my home account because, well, work is Outlook.

"Reconsidering" would be great. In fact, revert if if you'd like. I'm
uncertain that the surprise removal code actually works the way I
thought previously and I think I took a lot of it out of my local code.

Unfortuantely I don't have any equipment that I can use to reproduce
surprise removal any longer so that means I wouldn't be able to test
anything. I have to defer to you or Cao Jin.

--
Hisashi T Fujinaka - ht...@twofifty.com (todd.fujin...@intel.com)


Re: [Intel-wired-lan] [PATCH] ixgbe: Limit lowest interrupt rate for adaptive interrupt moderation to 12K

2015-09-01 Thread Hisashi T Fujinaka

On Tue, 1 Sep 2015, Alexander Duyck wrote:


On 07/30/2015 03:19 PM, Alexander Duyck wrote:

This patch updates the lowest limit for adaptive interrupt interrupt
moderation to roughly 12K interrupts per second.

...
Has there been any update on this patch?  I submitted it just over a month 
ago now and it hasn't received any feedback.  I was hoping this could be 
submitted before the merge window closes for net-next.


Thanks.


I'm nobody, but it makes sense to me.

--
Hisashi T Fujinaka - ht...@twofifty.com (also todd.fujin...@intel.com)
BSEE + BSChem + BAEnglish + MSCS + $2.50 = coffee
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html