from:"Benjamin Poirier"

Re: Re: [Bug] net/ipv6: skb_over_panic in mld_newpack

2018-12-05 Thread Benjamin Poirier

On 2018/12/05 16:57, Nicolas Belouin wrote:
[...]
> 
> Thanks for your help, using your debug patch I got the value of 
> needed_headroom:
> USHRT_MAX - 64
> And tracked it down to a legacy out of tree patch of ours I then fixed.
> The patch was increasing/decreasing the needed_headroom without checking
> for bounds...
> 
> Sorry for the noise then.

Thanks for reporting back. Let's consider it fixed.

> 
> Regards,
> 
> Nicolas
>

Re: Re: [Bug] net/ipv6: skb_over_panic in mld_newpack

2018-12-04 Thread Benjamin Poirier

On 2018/12/04 11:52, Nicolas Belouin wrote:
> On 03/12 07:59, Eric Dumazet wrote:
> > 
> > 
> > On 12/03/2018 07:20 AM, Nicolas Belouin wrote:
> > > Hi,
> > > I ran into a panic while adding an interface to a bridge with a vxlan
> > > interface already attached to it, as it seems related mtu=9000.
> > > 
> > > I get the following panic info :
> > > 
> > > [ 2482.419893] br100: port 2(vif1.1) entered blocking state
> > > [ 2482.425427] br100: port 2(vif1.1) entered forwarding state
> > > [ 2482.431797] skbuff: skb_over_panic: text:816e4f78 len:40 
> > > put:40 head:880146449000 data:880146458fd0 tail:0xfff8 end:0xec0 
> > > dev:vif1.1
> > > [ 2482.442891] [ cut here ]
> > > [ 2482.448254] kernel BUG at 
> > > /srv/jenkins/workspace/workspace/hosting-xen-dom0-kernel/build/src/linux-4.9/net/core/skbuff.c:105!
> > > [ 2482.459009] invalid opcode:  [#1] PREEMPT SMP
> > > [ 2482.464371] Modules linked in:
> > > [ 2482.469682] CPU: 19 PID: 1317 Comm: kworker/19:1 Not tainted 
> > > 4.9.135-dom0-e9d15b2-x86_64-iaas #2
> > > [ 2482.480362] Hardware name: Dell Inc. PowerEdge C8220/09N44V, BIOS 
> > > 2.7.1 03/04/2015
> > > [ 2482.491008] Workqueue: ipv6_addrconf addrconf_dad_work
> > > [ 2482.496380] task: 88017eef1a00 task.stack: c90001fcc000
> > > [ 2482.501785] RIP: e030:[]  [] 
> > > skb_panic+0x5f/0x70
> > > [ 2482.512450] RSP: e02b:c90001fcfba8  EFLAGS: 00010296
> > > [ 2482.517817] RAX: 0088 RBX: 880117fb0800 RCX: 
> > > 
> > > [ 2482.528447] RDX: 0088 RSI: 880184cd03c8 RDI: 
> > > 880184cd03c8
> > > [ 2482.539085] RBP: c90001fcfc00 R08: 06a8 R09: 
> > > 81ea7359
> > > [ 2482.549717] R10: 880180406f80 R11: 06a8 R12: 
> > > 880147258cc0
> > > [ 2482.560350] R13: c90001fcfc20 R14: 81d13440 R15: 
> > > 
> > > [ 2482.570993] FS:  () GS:880184cc() 
> > > knlGS:
> > > [ 2482.581646] CS:  e033 DS:  ES:  CR0: 80050033
> > > [ 2482.587039] CR2: 7f5b17f032b0 CR3: 01c08000 CR4: 
> > > 00042660
> > > [ 2482.597675] Stack:
> > > [ 2482.602958]  880146458fd0 fff8 0ec0 
> > > 88017f3f
> > > [ 2482.613619]  815efa62 816e4f78 880117fb0800 
> > > c90001fcfc20
> > > [ 2482.624288]  880147258cc0 88017f3f 880146502000 
> > > c90001fcfc68
> > > [ 2482.634955] Call Trace:
> > > [ 2482.640254]  [] ? skb_put+0x42/0x50
> > > [ 2482.645633]  [] ? ip6_mc_hdr.constprop.36+0x58/0xd0
> > > [ 2482.651045]  [] ? mld_newpack+0x12a/0x1e0
> > > [ 2482.656421]  [] ? add_grhead.isra.28+0x87/0xa0
> > > [ 2482.661825]  [] ? add_grec+0x446/0x4c0
> > > [ 2482.667198]  [] ? __local_bh_enable_ip+0x1b/0xb0
> > > [ 2482.672609]  [] ? 
> > > mld_send_initial_cr.part.29+0x58/0xa0
> > > [ 2482.678022]  [] ? ipv6_mc_dad_complete+0x26/0x60
> > > [ 2482.683441]  [] ? addrconf_dad_completed+0x29f/0x2c0
> > > [ 2482.688850]  [] ? ipv6_dev_mc_inc+0x194/0x2c0
> > > [ 2482.694249]  [] ? addrconf_dad_work+0xfe/0x3d0
> > > [ 2482.699650]  [] ? _raw_spin_unlock_irq+0xd/0x20
> > > [ 2482.705052]  [] ? process_one_work+0x142/0x3e0
> > > [ 2482.710453]  [] ? worker_thread+0x62/0x480
> > > [ 2482.715848]  [] ? process_one_work+0x3e0/0x3e0
> > > [ 2482.721256]  [] ? kthread+0xe2/0x100
> > > [ 2482.726621]  [] ? __switch_to+0x261/0x6b0
> > > [ 2482.732006]  [] ? kthread_park+0x60/0x60
> > > [ 2482.737379]  [] ? ret_from_fork+0x57/0x70
> > > [ 2482.742761] Code: 00 00 48 89 44 24 10 8b 87 b0 00 00 00 48 89 44 24 
> > > 08 48 8b 87 c0 00 00 00 48 c7 c7 50 8e a2 81 48 89 04 24 31 c0 e8 b5 07 
> > > b6 ff <0f> 0b 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 
> > > [ 2482.759199] RIP  [] skb_panic+0x5f/0x70
> > > [ 2482.764672]  RSP 
> > > [ 2482.771186] ---[ end trace 6d0fe52ed049d841 ]---
> > > [ 2482.776641] Kernel panic - not syncing: Fatal exception in interrupt
> > > [ 2482.861621] Kernel Offset: disabled
> > > 
> > > I circumvented the bug by applying this patch:
> > > diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c
> > > index 21f6deb2aec9..2762c3dcc883 100644
> > > --- a/net/ipv6/mcast.c
> > > +++ b/net/ipv6/mcast.c
> > > @@ -1605,8 +1605,6 @@ static struct sk_buff *mld_newpack(struct inet6_dev 
> > > *idev, unsigned int mtu)
> > >  IPV6_TLV_PADN, 0 };
> > > 
> > > /* we assume size > sizeof(ra) here */
> > > -   /* limit our allocations to order-0 page */
> > > -   size = min_t(int, size, SKB_MAX_ORDER(0, 0));
> > > skb = sock_alloc_send_skb(sk, size, 1, );
> > > 
> > > if (!skb)
> > > 
> > > The lines are introduced by commit 
> > > 72e09ad107e78d69ff4d3b97a69f0aad2b77280f
> > > stating that "order-2 GRP_ATOMIC allocations are very unreliable"
> > > I then wonder if this statement is still relevant, or if such a patch
> > > would be acceptable to you.
> > 
> >

[PATCH] xfrm: Fix bucket count reported to userspace

2018-11-05 Thread Benjamin Poirier

sadhcnt is reported by `ip -s xfrm state count` as "buckets count", not the
hash mask.

Fixes: 28d8909bc790 ("[XFRM]: Export SAD info.")
Signed-off-by: Benjamin Poirier 
---
 net/xfrm/xfrm_state.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index b669262682c9..12cdb350c456 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -788,7 +788,7 @@ void xfrm_sad_getinfo(struct net *net, struct xfrmk_sadinfo 
*si)
 {
spin_lock_bh(>xfrm.xfrm_state_lock);
si->sadcnt = net->xfrm.state_num;
-   si->sadhcnt = net->xfrm.state_hmask;
+   si->sadhcnt = net->xfrm.state_hmask + 1;
si->sadhmcnt = xfrm_state_hashmax;
spin_unlock_bh(>xfrm.xfrm_state_lock);
 }
-- 
2.19.0

Re: [RFC PATCH net-next v1 00/14] rename and shrink i40evf

2018-09-13 Thread Benjamin Poirier

On 2018/09/13 15:31, Jesse Brandeburg wrote:
[...]
> 
> ---
> v1: initial RFC
> 
> Jesse Brandeburg (14):
>   intel-ethernet: rename i40evf to iavf

Seems like patch 1 didn't make it to netdev
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20180910/014025.html

>   iavf: diet and reformat
>   iavf: rename functions and structs to new name
>   iavf: rename i40e_status to iavf_status
>   iavf: move i40evf files to new name
>   iavf: remove references to old names
>   iavf: rename device ID defines
>   iavf: rename I40E_ADMINQ_DESC
>   iavf: rename i40e_hw to iavf_hw
>   iavf: replace i40e_debug with iavf version
>   iavf: tracing infrastructure rename
>   iavf: rename most of i40e strings
>   iavf: finish renaming files to iavf
>   intel-ethernet: use correct module license

[PATCH] e1000e: Ignore TSYNCRXCTL when getting I219 clock attributes

2018-05-10 Thread Benjamin Poirier

There have been multiple reports of crashes that look like
kernel: RIP: 0010:[] timecounter_read+0xf/0x50
[...]
kernel: Call Trace:
kernel:  [] e1000e_phc_gettime+0x2f/0x60 [e1000e]
kernel:  [] e1000e_systim_overflow_work+0x1d/0x80 [e1000e]
kernel:  [] process_one_work+0x155/0x440
kernel:  [] worker_thread+0x116/0x4b0
kernel:  [] kthread+0xd2/0xf0
kernel:  [] ret_from_fork+0x3f/0x70

These can be traced back to the fact that e1000e_systim_reset() skips the
timecounter_init() call if e1000e_get_base_timinca() returns -EINVAL, which
leads to a null deref in timecounter_read().

Commit 83129b37ef35 ("e1000e: fix systim issues", v4.2-rc1) reworked
e1000e_get_base_timinca() in such a way that it can return -EINVAL for
e1000_pch_spt if the SYSCFI bit is not set in TSYNCRXCTL.

Some experimentation has shown that on I219 (e1000_pch_spt, "MAC: 12")
adapters, the E1000_TSYNCRXCTL_SYSCFI flag is unstable; TSYNCRXCTL reads
sometimes don't have the SYSCFI bit set. Retrying the read shortly after
finds the bit to be set. This was observed at boot (probe) but also link up
and link down.

Moreover, the phc (PTP Hardware Clock) seems to operate normally even after
reads where SYSCFI=0. Therefore, remove this register read and
unconditionally set the clock parameters.

Reported-by: Achim Mildenberger <ad...@fph.physik.uni-karlsruhe.de>
Message-Id: <20180425065243.g5mqewg5irkwgwgv@f2>
Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1075876
Fixes: 83129b37ef35 ("e1000e: fix systim issues")
Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 15 ++-
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index ec4a9759a6f2..3afb1f3b6f91 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -3546,15 +3546,12 @@ s32 e1000e_get_base_timinca(struct e1000_adapter 
*adapter, u32 *timinca)
}
break;
case e1000_pch_spt:
-   if (er32(TSYNCRXCTL) & E1000_TSYNCRXCTL_SYSCFI) {
-   /* Stable 24MHz frequency */
-   incperiod = INCPERIOD_24MHZ;
-   incvalue = INCVALUE_24MHZ;
-   shift = INCVALUE_SHIFT_24MHZ;
-   adapter->cc.shift = shift;
-   break;
-   }
-   return -EINVAL;
+   /* Stable 24MHz frequency */
+   incperiod = INCPERIOD_24MHZ;
+   incvalue = INCVALUE_24MHZ;
+   shift = INCVALUE_SHIFT_24MHZ;
+   adapter->cc.shift = shift;
+   break;
case e1000_pch_cnp:
if (er32(TSYNCRXCTL) & E1000_TSYNCRXCTL_SYSCFI) {
/* Stable 24MHz frequency */
-- 
2.16.3

e1000e I219 timestamping oops related to TSYNCRXCTL read

2018-04-25 Thread Benjamin Poirier

In the following openSUSE bug report
https://bugzilla.suse.com/show_bug.cgi?id=1075876
Achim reported an oops related to e1000e timestamping:
kernel: RIP: 0010:[] timecounter_read+0xf/0x50
[...]
kernel: Call Trace:
kernel:  [] e1000e_phc_gettime+0x2f/0x60 [e1000e]
kernel:  [] e1000e_systim_overflow_work+0x1d/0x80 [e1000e]
kernel:  [] process_one_work+0x155/0x440
kernel:  [] worker_thread+0x116/0x4b0
kernel:  [] kthread+0xd2/0xf0
kernel:  [] ret_from_fork+0x3f/0x70

It always occurs 4 hours after boot but not on every boot. It is most
likely the same problem reported here:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1668356
http://lkml.iu.edu/hypermail/linux/kernel/1506.2/index.html#02530
https://bugzilla.redhat.com/show_bug.cgi?id=1463882
https://bugzilla.redhat.com/show_bug.cgi?id=1431863

This occurs with MAC: 12, e1000_pch_spt/I219. The reporter has
reproduced it on a v4.16 derivative.

We've traced it to the fact that e1000e_systim_reset() skips the
timecounter_init() call if e1000e_get_base_timinca() returns -EINVAL,
which leads to a null deref in timecounter_read() (see comment 8 of the
suse bugzilla for more details.)

In commit 83129b37ef35 ("e1000e: fix systim issues", v4.2-rc1) Yanir
reworked e1000e_get_base_timinca() in such a way that it can return
-EINVAL for e1000_pch_spt if the SYSCFI bit is not set in TSYNCRXCTL.
This is also the commit that was identified by bisection in the second
link above (lkml).

What we've observed (in comment 14) is that TSYNCRXCTL reads sometimes
don't have the SYSCFI bit set. Retrying the read shortly after finds the
bit to be set. This was observed at boot (probe) but also link up and
link down.

I have a few questions:

What's the purpose of the SYSCFI bit in TSYNCRXCTL ("Reserved" in the
datasheet)?

Why does it look like subsequent reads of TSYNCRXCTL sometimes have the
SYSCFI bit set/not set on I219?

Is it right to check the SYSCFI bit in e1000e_get_base_timinca() for
_spt and return -EINVAL if it's not set? Could we just remove that
check?

The patch in comment 13 of the suse bugzilla works around the problem by
retrying TSYNCRXCTL reads, maybe we could instead remove that read
altogether or move the timecounter_init() call to at least avoid the
oops. The best approach to take seems to depend on the behavior expected
of TSYNCRXCTL reads on I219 so I'm hoping that you could provide more
info on that.

Thanks,
-Benjamin

[PATCH 2/2] e1000e: Fix link check race condition

2018-03-05 Thread Benjamin Poirier

Alex reported the following race condition:

/* link goes up... interrupt... schedule watchdog */
\ e1000_watchdog_task
\ e1000e_has_link
\ hw->mac.ops.check_for_link() === e1000e_check_for_copper_link
\ e1000e_phy_has_link_generic(..., )
link = true

 /* link goes down... interrupt */
 \ e1000_msix_other
 hw->mac.get_link_status = true

/* link is up */
mac->get_link_status = false

link_active = true
/* link_active is true, wrongly, and stays so because
 * get_link_status is false */

Avoid this problem by making sure that we don't set get_link_status = false
after having checked the link.

It seems this problem has been present since the introduction of e1000e.

Link: https://lkml.org/lkml/2018/1/29/338
Reported-by: Alexander Duyck <alexander.du...@gmail.com>
Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 drivers/net/ethernet/intel/e1000e/ich8lan.c | 31 -
 drivers/net/ethernet/intel/e1000e/mac.c | 14 ++---
 2 files changed, 24 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/ich8lan.c 
b/drivers/net/ethernet/intel/e1000e/ich8lan.c
index d6d4ed7acf03..1dddfb7b2de6 100644
--- a/drivers/net/ethernet/intel/e1000e/ich8lan.c
+++ b/drivers/net/ethernet/intel/e1000e/ich8lan.c
@@ -1383,6 +1383,7 @@ static s32 e1000_check_for_copper_link_ich8lan(struct 
e1000_hw *hw)
 */
if (!mac->get_link_status)
return 0;
+   mac->get_link_status = false;
 
/* First we want to see if the MII Status Register reports
 * link.  If so, then we want to get the current speed/duplex
@@ -1390,12 +1391,12 @@ static s32 e1000_check_for_copper_link_ich8lan(struct 
e1000_hw *hw)
 */
ret_val = e1000e_phy_has_link_generic(hw, 1, 0, );
if (ret_val)
-   return ret_val;
+   goto out;
 
if (hw->mac.type == e1000_pchlan) {
ret_val = e1000_k1_gig_workaround_hv(hw, link);
if (ret_val)
-   return ret_val;
+   goto out;
}
 
/* When connected at 10Mbps half-duplex, some parts are excessively
@@ -1428,7 +1429,7 @@ static s32 e1000_check_for_copper_link_ich8lan(struct 
e1000_hw *hw)
 
ret_val = hw->phy.ops.acquire(hw);
if (ret_val)
-   return ret_val;
+   goto out;
 
if (hw->mac.type == e1000_pch2lan)
emi_addr = I82579_RX_CONFIG;
@@ -1450,7 +1451,7 @@ static s32 e1000_check_for_copper_link_ich8lan(struct 
e1000_hw *hw)
hw->phy.ops.release(hw);
 
if (ret_val)
-   return ret_val;
+   goto out;
 
if (hw->mac.type >= e1000_pch_spt) {
u16 data;
@@ -1459,14 +1460,14 @@ static s32 e1000_check_for_copper_link_ich8lan(struct 
e1000_hw *hw)
if (speed == SPEED_1000) {
ret_val = hw->phy.ops.acquire(hw);
if (ret_val)
-   return ret_val;
+   goto out;
 
ret_val = e1e_rphy_locked(hw,
  PHY_REG(776, 20),
  );
if (ret_val) {
hw->phy.ops.release(hw);
-   return ret_val;
+   goto out;
}
 
ptr_gap = (data & (0x3FF << 2)) >> 2;
@@ -1480,18 +1481,18 @@ static s32 e1000_check_for_copper_link_ich8lan(struct 
e1000_hw *hw)
}
hw->phy.ops.release(hw);
if (ret_val)
-   return ret_val;
+   goto out;
} else {
ret_val = hw->phy.ops.acquire(hw);
if (ret_val)
-   return ret_val;
+   goto out;
 
ret_val = e1e_wphy_locked(hw,
  PHY_REG(776, 20),
  0xC023);
hw->phy.ops.release(hw);

[PATCH 1/2] Revert "e1000e: Separate signaling for link check/link up"

2018-03-05 Thread Benjamin Poirier

This reverts commit 19110cfbb34d4af0cdfe14cd243f3b09dc95b013.
This reverts commit 4110e02eb45ea447ec6f5459c9934de0a273fb91.
This reverts commit d3604515c9eda464a92e8e67aae82dfe07fe3c98.

Commit 19110cfbb34d ("e1000e: Separate signaling for link check/link up")
changed what happens to the link status when there is an error which
happens after "get_link_status = false" in the copper check_for_link
callbacks. Previously, such an error would be ignored and the link
considered up. After that commit, any error implies that the link is down.

Revert commit 19110cfbb34d ("e1000e: Separate signaling for link check/link
up") and its followups. After reverting, the race condition described in
the log of commit 19110cfbb34d is reintroduced. It may still be triggered
by LSC events but this should keep the link down in case the link is
electrically unstable, as discussed. The race may no longer be
triggered by RXO events because commit 4aea7a5c5e94 ("e1000e: Avoid
receiver overrun interrupt bursts") restored reading icr in the Other
handler.

Link: https://lkml.org/lkml/2018/3/1/789
Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 drivers/net/ethernet/intel/e1000e/ich8lan.c | 13 -
 drivers/net/ethernet/intel/e1000e/mac.c | 13 -
 drivers/net/ethernet/intel/e1000e/netdev.c  |  2 +-
 3 files changed, 9 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/ich8lan.c 
b/drivers/net/ethernet/intel/e1000e/ich8lan.c
index ff308b05d68c..d6d4ed7acf03 100644
--- a/drivers/net/ethernet/intel/e1000e/ich8lan.c
+++ b/drivers/net/ethernet/intel/e1000e/ich8lan.c
@@ -1367,9 +1367,6 @@ static s32 e1000_disable_ulp_lpt_lp(struct e1000_hw *hw, 
bool force)
  *  Checks to see of the link status of the hardware has changed.  If a
  *  change in link status has been detected, then we read the PHY registers
  *  to get the current speed/duplex if link exists.
- *
- *  Returns a negative error code (-E1000_ERR_*) or 0 (link down) or 1 (link
- *  up).
  **/
 static s32 e1000_check_for_copper_link_ich8lan(struct e1000_hw *hw)
 {
@@ -1385,7 +1382,7 @@ static s32 e1000_check_for_copper_link_ich8lan(struct 
e1000_hw *hw)
 * Change or Rx Sequence Error interrupt.
 */
if (!mac->get_link_status)
-   return 1;
+   return 0;
 
/* First we want to see if the MII Status Register reports
 * link.  If so, then we want to get the current speed/duplex
@@ -1602,7 +1599,7 @@ static s32 e1000_check_for_copper_link_ich8lan(struct 
e1000_hw *hw)
 * we have already determined whether we have link or not.
 */
if (!mac->autoneg)
-   return 1;
+   return -E1000_ERR_CONFIG;
 
/* Auto-Neg is enabled.  Auto Speed Detection takes care
 * of MAC speed/duplex configuration.  So we only need to
@@ -1616,12 +1613,10 @@ static s32 e1000_check_for_copper_link_ich8lan(struct 
e1000_hw *hw)
 * different link partner.
 */
ret_val = e1000e_config_fc_after_link_up(hw);
-   if (ret_val) {
+   if (ret_val)
e_dbg("Error configuring flow control\n");
-   return ret_val;
-   }
 
-   return 1;
+   return ret_val;
 }
 
 static s32 e1000_get_variants_ich8lan(struct e1000_adapter *adapter)
diff --git a/drivers/net/ethernet/intel/e1000e/mac.c 
b/drivers/net/ethernet/intel/e1000e/mac.c
index db735644b312..b322011ec282 100644
--- a/drivers/net/ethernet/intel/e1000e/mac.c
+++ b/drivers/net/ethernet/intel/e1000e/mac.c
@@ -410,9 +410,6 @@ void e1000e_clear_hw_cntrs_base(struct e1000_hw *hw)
  *  Checks to see of the link status of the hardware has changed.  If a
  *  change in link status has been detected, then we read the PHY registers
  *  to get the current speed/duplex if link exists.
- *
- *  Returns a negative error code (-E1000_ERR_*) or 0 (link down) or 1 (link
- *  up).
  **/
 s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
 {
@@ -426,7 +423,7 @@ s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
 * Change or Rx Sequence Error interrupt.
 */
if (!mac->get_link_status)
-   return 1;
+   return 0;
 
/* First we want to see if the MII Status Register reports
 * link.  If so, then we want to get the current speed/duplex
@@ -450,7 +447,7 @@ s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
 * we have already determined whether we have link or not.
 */
if (!mac->autoneg)
-   return 1;
+   return -E1000_ERR_CONFIG;
 
/* Auto-Neg is enabled.  Auto Speed Detection takes care
 * of MAC speed/duplex configuration.  So we only need to
@@ -464,12 +461,10 @@ s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
 * different link partner.
 */
ret_val = e1000e_config_fc_after_link_up(hw);
-   if (ret_val) {
+

Re: [PATCH] e1000e: Fix link status in case of error.

2018-02-28 Thread Benjamin Poirier

On 2018/02/28 08:48, Alexander Duyck wrote:
> On Tue, Feb 27, 2018 at 9:20 PM, Benjamin Poirier <bpoir...@suse.com> wrote:
> > Before commit 19110cfbb34d ("e1000e: Separate signaling for link check/link
> > up"), errors which happen after "get_link_status = false" in the copper
> > check_for_link callbacks would be ignored and the link considered up. After
> > that commit, any error implies that the link is down. Since all
> > combinations of link up/down and error/no error are possible, do the same
> > thing as e1000e_phy_has_link_generic() and return the link status in a
> > separate variable.
> >
> > Fixes: 19110cfbb34d ("e1000e: Separate signaling for link check/link up")
> > Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
> 
> This seems more like a refactor than a fix. There are valid cases
> where errors can be ignored after we set the link to up. For example
> if we cannot negotiate flow control we may not care as long as the
> link is established. In such a case we may see errors establishing
> flow control and they should be ignored.

Indeed, before commit 19110cfbb34d ("e1000e: Separate signaling for link
check/link up") a failure of e1000e_config_fc_after_link_up() in
e1000e_check_for_copper_link() would be ignored. After that commit,
there is an extra 2s delay before link up, like what was just fixed for
autoneg in commit d8ca384786c2 ("e1000e: Fix check_for_link return value
with autoneg off"). That was an unintended change in behavior. This is
what this patch purports to fix. The same is true for other failure
paths in e1000_check_for_copper_link_ich8lan() that happen after
get_link_status = false.

> 
> If there are cases where we are setting the link as up to early we
> should address that instead of changing all the functions to make them
> look like other ones.
> 
> > ---
[...]

[PATCH] e1000e: Fix link status in case of error.

2018-02-27 Thread Benjamin Poirier

Before commit 19110cfbb34d ("e1000e: Separate signaling for link check/link
up"), errors which happen after "get_link_status = false" in the copper
check_for_link callbacks would be ignored and the link considered up. After
that commit, any error implies that the link is down. Since all
combinations of link up/down and error/no error are possible, do the same
thing as e1000e_phy_has_link_generic() and return the link status in a
separate variable.

Fixes: 19110cfbb34d ("e1000e: Separate signaling for link check/link up")
Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 drivers/net/ethernet/intel/e1000e/82571.c   |  6 --
 drivers/net/ethernet/intel/e1000e/ethtool.c |  5 +++--
 drivers/net/ethernet/intel/e1000e/hw.h  |  2 +-
 drivers/net/ethernet/intel/e1000e/ich8lan.c | 19 +++
 drivers/net/ethernet/intel/e1000e/mac.c | 26 +++---
 drivers/net/ethernet/intel/e1000e/mac.h |  6 +++---
 drivers/net/ethernet/intel/e1000e/netdev.c  | 11 +++
 7 files changed, 40 insertions(+), 35 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/82571.c 
b/drivers/net/ethernet/intel/e1000e/82571.c
index 6b03c8553e59..980ed89e61ea 100644
--- a/drivers/net/ethernet/intel/e1000e/82571.c
+++ b/drivers/net/ethernet/intel/e1000e/82571.c
@@ -40,7 +40,8 @@
 static s32 e1000_get_phy_id_82571(struct e1000_hw *hw);
 static s32 e1000_setup_copper_link_82571(struct e1000_hw *hw);
 static s32 e1000_setup_fiber_serdes_link_82571(struct e1000_hw *hw);
-static s32 e1000_check_for_serdes_link_82571(struct e1000_hw *hw);
+static s32 e1000_check_for_serdes_link_82571(struct e1000_hw *hw,
+bool *unused);
 static s32 e1000_write_nvm_eewr_82571(struct e1000_hw *hw, u16 offset,
  u16 words, u16 *data);
 static s32 e1000_fix_nvm_checksum_82571(struct e1000_hw *hw);
@@ -1493,6 +1494,7 @@ static s32 e1000_setup_fiber_serdes_link_82571(struct 
e1000_hw *hw)
 /**
  *  e1000_check_for_serdes_link_82571 - Check for link (Serdes)
  *  @hw: pointer to the HW structure
+ *  @unused: unused for serdes links
  *
  *  Reports the link state as up or down.
  *
@@ -1509,7 +1511,7 @@ static s32 e1000_setup_fiber_serdes_link_82571(struct 
e1000_hw *hw)
  *  4) forced_up (the link has been forced up, it did not autonegotiate)
  *
  **/
-static s32 e1000_check_for_serdes_link_82571(struct e1000_hw *hw)
+static s32 e1000_check_for_serdes_link_82571(struct e1000_hw *hw, bool *unused)
 {
struct e1000_mac_info *mac = >mac;
u32 rxcw;
diff --git a/drivers/net/ethernet/intel/e1000e/ethtool.c 
b/drivers/net/ethernet/intel/e1000e/ethtool.c
index 003cbd605799..1946ddae06c0 100644
--- a/drivers/net/ethernet/intel/e1000e/ethtool.c
+++ b/drivers/net/ethernet/intel/e1000e/ethtool.c
@@ -1753,6 +1753,7 @@ static int e1000_loopback_test(struct e1000_adapter 
*adapter, u64 *data)
 static int e1000_link_test(struct e1000_adapter *adapter, u64 *data)
 {
struct e1000_hw *hw = >hw;
+   bool link_status;
 
*data = 0;
if (hw->phy.media_type == e1000_media_type_internal_serdes) {
@@ -1764,7 +1765,7 @@ static int e1000_link_test(struct e1000_adapter *adapter, 
u64 *data)
 * could take as long as 2-3 minutes
 */
do {
-   hw->mac.ops.check_for_link(hw);
+   hw->mac.ops.check_for_link(hw, NULL);
if (hw->mac.serdes_has_link)
return *data;
msleep(20);
@@ -1772,7 +1773,7 @@ static int e1000_link_test(struct e1000_adapter *adapter, 
u64 *data)
 
*data = 1;
} else {
-   hw->mac.ops.check_for_link(hw);
+   hw->mac.ops.check_for_link(hw, _status);
if (hw->mac.autoneg)
/* On some Phy/switch combinations, link establishment
 * can take a few seconds more than expected.
diff --git a/drivers/net/ethernet/intel/e1000e/hw.h 
b/drivers/net/ethernet/intel/e1000e/hw.h
index d803b1a12349..4dff6df469bb 100644
--- a/drivers/net/ethernet/intel/e1000e/hw.h
+++ b/drivers/net/ethernet/intel/e1000e/hw.h
@@ -472,7 +472,7 @@ struct e1000_mac_operations {
s32  (*id_led_init)(struct e1000_hw *);
s32  (*blink_led)(struct e1000_hw *);
bool (*check_mng_mode)(struct e1000_hw *);
-   s32  (*check_for_link)(struct e1000_hw *);
+   s32  (*check_for_link)(struct e1000_hw *, bool *);
s32  (*cleanup_led)(struct e1000_hw *);
void (*clear_hw_cntrs)(struct e1000_hw *);
void (*clear_vfta)(struct e1000_hw *);
diff --git a/drivers/net/ethernet/intel/e1000e/ich8lan.c 
b/drivers/net/ethernet/intel/e1000e/ich8lan.c
index ff308b05d68c..3d25255289ff 100644
--- a/drivers/net/ethernet/intel/e1000e/ich8lan.c
+++ b/drivers/net/ethernet/intel/e1000e/ich8lan.c
@@

Re: [RFC PATCH] e1000e: Fix link check race condition.

2018-02-27 Thread Benjamin Poirier

On 2018/02/26 08:14, Alexander Duyck wrote:
[...]
> 
> >
> > switch (hw->mac.type) {
> > case e1000_pch2lan:
> > ret_val = e1000_k1_workaround_lv(hw);
> > if (ret_val)
> > -   return ret_val;
> > +   goto out;
> > /* fall-thru */
> > case e1000_pchlan:
> > if (hw->phy.type == e1000_phy_82578) {
> > ret_val = e1000_link_stall_workaround_hv(hw);
> > if (ret_val)
> > -   return ret_val;
> > +   goto out;
> > }
> >
> > /* Workaround for PCHx parts in half-duplex:
> > @@ -1595,7 +1596,7 @@ static s32 e1000_check_for_copper_link_ich8lan(struct 
> > e1000_hw *hw)
> > if (hw->phy.type > e1000_phy_82579) {
> > ret_val = e1000_set_eee_pchlan(hw);
> > if (ret_val)
> > -   return ret_val;
> > +   goto out;
> > }
> >
> > /* If we are forcing speed/duplex, then we simply return since
> > @@ -1618,10 +1619,14 @@ static s32 
> > e1000_check_for_copper_link_ich8lan(struct e1000_hw *hw)
> > ret_val = e1000e_config_fc_after_link_up(hw);
> > if (ret_val) {
> > e_dbg("Error configuring flow control\n");
> > -   return ret_val;
> > +   goto out;
> > }
> 
> Technically these changes would be a change in behavior. For these we
> may just want to leave them as-is since I am not certain they would
> have any actual impact on the link state other than delaying the
> link-up. For example do we really care if we fail to negotiate flow
> control? We may not so we might report link up and just a debug
> message indicating we didn't negotiate that part of the link.

You're right and actually that raises yet another problem with commit
19110cfbb34d ("e1000e: Separate signaling for link check/link up").

Previously these errors which come after the "get_link_status = false"
would be ignored and the link considered up. After that commit, any
error implies that the link is down:

-   link_active = !hw->mac.get_link_status;
+   link_active = ret_val > 0;

I'll send a patch to fix that problem first and then get back to this
race.

[RFC PATCH] e1000e: Fix link check race condition.

2018-02-25 Thread Benjamin Poirier

Alex reported the following race condition:

/* link goes up... interrupt... schedule watchdog */
\ e1000_watchdog_task
\ e1000e_has_link
\ hw->mac.ops.check_for_link() === e1000e_check_for_copper_link
\ e1000e_phy_has_link_generic(..., )
link = true

 /* link goes down... interrupt */
 \ e1000_msix_other
 hw->mac.get_link_status = true

/* link is up */
mac->get_link_status = false

link_active = true
/* link_active is true, wrongly, and stays so because
 * get_link_status is false */

Avoid this problem by making sure that we don't set get_link_status = false
after having checked the link.

It seems this problem has been present since the introduction of e1000e.

Link: https://lkml.org/lkml/2018/1/29/338
Reported-by: Alexander Duyck <alexander.du...@gmail.com>
Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 drivers/net/ethernet/intel/e1000e/ich8lan.c | 41 -
 drivers/net/ethernet/intel/e1000e/mac.c | 14 +++---
 2 files changed, 33 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/ich8lan.c 
b/drivers/net/ethernet/intel/e1000e/ich8lan.c
index ff308b05d68c..3c2c4f87e075 100644
--- a/drivers/net/ethernet/intel/e1000e/ich8lan.c
+++ b/drivers/net/ethernet/intel/e1000e/ich8lan.c
@@ -1386,6 +1386,7 @@ static s32 e1000_check_for_copper_link_ich8lan(struct 
e1000_hw *hw)
 */
if (!mac->get_link_status)
return 1;
+   mac->get_link_status = false;
 
/* First we want to see if the MII Status Register reports
 * link.  If so, then we want to get the current speed/duplex
@@ -1393,12 +1394,12 @@ static s32 e1000_check_for_copper_link_ich8lan(struct 
e1000_hw *hw)
 */
ret_val = e1000e_phy_has_link_generic(hw, 1, 0, );
if (ret_val)
-   return ret_val;
+   goto out;
 
if (hw->mac.type == e1000_pchlan) {
ret_val = e1000_k1_gig_workaround_hv(hw, link);
if (ret_val)
-   return ret_val;
+   goto out;
}
 
/* When connected at 10Mbps half-duplex, some parts are excessively
@@ -1431,7 +1432,7 @@ static s32 e1000_check_for_copper_link_ich8lan(struct 
e1000_hw *hw)
 
ret_val = hw->phy.ops.acquire(hw);
if (ret_val)
-   return ret_val;
+   goto out;
 
if (hw->mac.type == e1000_pch2lan)
emi_addr = I82579_RX_CONFIG;
@@ -1453,7 +1454,7 @@ static s32 e1000_check_for_copper_link_ich8lan(struct 
e1000_hw *hw)
hw->phy.ops.release(hw);
 
if (ret_val)
-   return ret_val;
+   goto out;
 
if (hw->mac.type >= e1000_pch_spt) {
u16 data;
@@ -1462,14 +1463,14 @@ static s32 e1000_check_for_copper_link_ich8lan(struct 
e1000_hw *hw)
if (speed == SPEED_1000) {
ret_val = hw->phy.ops.acquire(hw);
if (ret_val)
-   return ret_val;
+   goto out;
 
ret_val = e1e_rphy_locked(hw,
  PHY_REG(776, 20),
  );
if (ret_val) {
hw->phy.ops.release(hw);
-   return ret_val;
+   goto out;
}
 
ptr_gap = (data & (0x3FF << 2)) >> 2;
@@ -1483,18 +1484,18 @@ static s32 e1000_check_for_copper_link_ich8lan(struct 
e1000_hw *hw)
}
hw->phy.ops.release(hw);
if (ret_val)
-   return ret_val;
+   goto out;
} else {
ret_val = hw->phy.ops.acquire(hw);
if (ret_val)
-   return ret_val;
+   goto out;
 
ret_val = e1e_wphy_locked(hw,
  PHY_REG(776, 20),
  0xC023);
hw->phy.ops.release(hw);

[PATCH net-queue] e1000e: Fix check_for_link return value with autoneg off.

2018-02-19 Thread Benjamin Poirier

When autoneg is off, the .check_for_link callback functions clear the
get_link_status flag and systematically return a "pseudo-error". This means
that the link is not detected as up until the next execution of the
e1000_watchdog_task() 2 seconds later.

Fixes: 19110cfbb34d ("e1000e: Separate signaling for link check/link up")
Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 drivers/net/ethernet/intel/e1000e/ich8lan.c | 2 +-
 drivers/net/ethernet/intel/e1000e/mac.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/ich8lan.c 
b/drivers/net/ethernet/intel/e1000e/ich8lan.c
index 31277d3bb7dc..ff308b05d68c 100644
--- a/drivers/net/ethernet/intel/e1000e/ich8lan.c
+++ b/drivers/net/ethernet/intel/e1000e/ich8lan.c
@@ -1602,7 +1602,7 @@ static s32 e1000_check_for_copper_link_ich8lan(struct 
e1000_hw *hw)
 * we have already determined whether we have link or not.
 */
if (!mac->autoneg)
-   return -E1000_ERR_CONFIG;
+   return 1;
 
/* Auto-Neg is enabled.  Auto Speed Detection takes care
 * of MAC speed/duplex configuration.  So we only need to
diff --git a/drivers/net/ethernet/intel/e1000e/mac.c 
b/drivers/net/ethernet/intel/e1000e/mac.c
index f457c5703d0c..db735644b312 100644
--- a/drivers/net/ethernet/intel/e1000e/mac.c
+++ b/drivers/net/ethernet/intel/e1000e/mac.c
@@ -450,7 +450,7 @@ s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
 * we have already determined whether we have link or not.
 */
if (!mac->autoneg)
-   return -E1000_ERR_CONFIG;
+   return 1;
 
/* Auto-Neg is enabled.  Auto Speed Detection takes care
 * of MAC speed/duplex configuration.  So we only need to
-- 
2.16.1

Re: [PATCH net-queue 1/3] Partial revert "e1000e: Avoid receiver overrun interrupt bursts"

2018-02-08 Thread Benjamin Poirier

I forgot to mark it as such but this is v2 of the series originally
submitted in this thread:
https://lkml.org/lkml/2018/1/26/93

Changes since v1:
* series rebased to apply over "e1000e: Remove Other from EIAC."
  http://patchwork.ozlabs.org/patch/867833/
  This essentially removes patch 3/3 from the original series.
* dropped [PATCH 2/3] Revert "e1000e: Separate signaling for link
  check/link up". After Alex's feedback, I think that problem needs to
  be addressed differently and I will submit a separate patch for it.
* patch 1 was split into three parts. Instead of manually clearing OTHER
  via a write to icr as in v1, in v2 we make sure that INT_ASSERTED is
  always set via bits for all events related to the Other interrupt in
  IMS.

Benjamin Poirier (3):
  Partial revert "e1000e: Avoid receiver overrun interrupt bursts"
  e1000e: Fix queue interrupt re-raising in Other interrupt.
  e1000e: Avoid missed interrupts following ICR read.

 drivers/net/ethernet/intel/e1000e/defines.h | 21 ++-
 drivers/net/ethernet/intel/e1000e/netdev.c  | 32 +
 2 files changed, 30 insertions(+), 23 deletions(-)

On 2018/02/08 15:47, Benjamin Poirier wrote:
> This partially reverts commit 4aea7a5c5e940c1723add439f4088844cd26196d.
> 
> We keep the fix for the first part of the problem (1) described in the log
> of that commit, that is to read ICR in the other interrupt handler. We
> remove the fix for the second part of the problem (2), Other interrupt
> throttling.
> 
> Bursts of "Other" interrupts may once again occur during rxo (receive
> overflow) traffic conditions. This is deemed acceptable in the interest of
> avoiding unforeseen fallout from changes that are not strictly necessary.
> As discussed, the e1000e driver should be in "maintenance mode".
> 
> Link: https://www.spinics.net/lists/netdev/msg480675.html
> Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
> ---
>  drivers/net/ethernet/intel/e1000e/netdev.c | 16 ++--
>  1 file changed, 2 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
> b/drivers/net/ethernet/intel/e1000e/netdev.c
> index 153ad406c65e..3b36efa6228d 100644
> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> @@ -1915,21 +1915,10 @@ static irqreturn_t e1000_msix_other(int 
> __always_unused irq, void *data)
>   struct e1000_adapter *adapter = netdev_priv(netdev);
>   struct e1000_hw *hw = >hw;
>   u32 icr;
> - bool enable = true;
>  
>   icr = er32(ICR);
>   ew32(ICR, E1000_ICR_OTHER);
>  
> - if (icr & E1000_ICR_RXO) {
> - ew32(ICR, E1000_ICR_RXO);
> - enable = false;
> - /* napi poll will re-enable Other, make sure it runs */
> - if (napi_schedule_prep(>napi)) {
> - adapter->total_rx_bytes = 0;
> - adapter->total_rx_packets = 0;
> - __napi_schedule(>napi);
> - }
> - }
>   if (icr & E1000_ICR_LSC) {
>   ew32(ICR, E1000_ICR_LSC);
>   hw->mac.get_link_status = true;
> @@ -1938,7 +1927,7 @@ static irqreturn_t e1000_msix_other(int __always_unused 
> irq, void *data)
>   mod_timer(>watchdog_timer, jiffies + 1);
>   }
>  
> - if (enable && !test_bit(__E1000_DOWN, >state))
> + if (!test_bit(__E1000_DOWN, >state))
>   ew32(IMS, E1000_IMS_OTHER);
>  
>   return IRQ_HANDLED;
> @@ -2708,8 +2697,7 @@ static int e1000e_poll(struct napi_struct *napi, int 
> weight)
>   napi_complete_done(napi, work_done);
>   if (!test_bit(__E1000_DOWN, >state)) {
>   if (adapter->msix_entries)
> - ew32(IMS, adapter->rx_ring->ims_val |
> -  E1000_IMS_OTHER);
> + ew32(IMS, adapter->rx_ring->ims_val);
>   else
>   e1000_irq_enable(adapter);
>   }
> -- 
> 2.16.1
>

[PATCH net-queue 1/3] Partial revert "e1000e: Avoid receiver overrun interrupt bursts"

2018-02-07 Thread Benjamin Poirier

This partially reverts commit 4aea7a5c5e940c1723add439f4088844cd26196d.

We keep the fix for the first part of the problem (1) described in the log
of that commit, that is to read ICR in the other interrupt handler. We
remove the fix for the second part of the problem (2), Other interrupt
throttling.

Bursts of "Other" interrupts may once again occur during rxo (receive
overflow) traffic conditions. This is deemed acceptable in the interest of
avoiding unforeseen fallout from changes that are not strictly necessary.
As discussed, the e1000e driver should be in "maintenance mode".

Link: https://www.spinics.net/lists/netdev/msg480675.html
Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 16 ++--
 1 file changed, 2 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 153ad406c65e..3b36efa6228d 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1915,21 +1915,10 @@ static irqreturn_t e1000_msix_other(int __always_unused 
irq, void *data)
struct e1000_adapter *adapter = netdev_priv(netdev);
struct e1000_hw *hw = >hw;
u32 icr;
-   bool enable = true;
 
icr = er32(ICR);
ew32(ICR, E1000_ICR_OTHER);
 
-   if (icr & E1000_ICR_RXO) {
-   ew32(ICR, E1000_ICR_RXO);
-   enable = false;
-   /* napi poll will re-enable Other, make sure it runs */
-   if (napi_schedule_prep(>napi)) {
-   adapter->total_rx_bytes = 0;
-   adapter->total_rx_packets = 0;
-   __napi_schedule(>napi);
-   }
-   }
if (icr & E1000_ICR_LSC) {
ew32(ICR, E1000_ICR_LSC);
hw->mac.get_link_status = true;
@@ -1938,7 +1927,7 @@ static irqreturn_t e1000_msix_other(int __always_unused 
irq, void *data)
mod_timer(>watchdog_timer, jiffies + 1);
}
 
-   if (enable && !test_bit(__E1000_DOWN, >state))
+   if (!test_bit(__E1000_DOWN, >state))
ew32(IMS, E1000_IMS_OTHER);
 
return IRQ_HANDLED;
@@ -2708,8 +2697,7 @@ static int e1000e_poll(struct napi_struct *napi, int 
weight)
napi_complete_done(napi, work_done);
if (!test_bit(__E1000_DOWN, >state)) {
if (adapter->msix_entries)
-   ew32(IMS, adapter->rx_ring->ims_val |
-E1000_IMS_OTHER);
+   ew32(IMS, adapter->rx_ring->ims_val);
else
e1000_irq_enable(adapter);
}
-- 
2.16.1

[PATCH net-queue 2/3] e1000e: Fix queue interrupt re-raising in Other interrupt.

2018-02-07 Thread Benjamin Poirier

restores the ICS write for rx/tx queue interrupts which was present before
commit 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt",
v4.5-rc1) but was not restored in commit 4aea7a5c5e94 ("e1000e: Avoid
receiver overrun interrupt bursts", v4.15-rc1).

This re-raises the queue interrupts in case the txq or rxq bits were set in
ICR and the Other interrupt handler read and cleared ICR before the queue
interrupt was raised.

Fixes: 4aea7a5c5e94 ("e1000e: Avoid receiver overrun interrupt bursts")
Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 3b36efa6228d..2c9609bee2ae 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1919,6 +1919,9 @@ static irqreturn_t e1000_msix_other(int __always_unused 
irq, void *data)
icr = er32(ICR);
ew32(ICR, E1000_ICR_OTHER);
 
+   if (icr & adapter->eiac_mask)
+   ew32(ICS, (icr & adapter->eiac_mask));
+
if (icr & E1000_ICR_LSC) {
ew32(ICR, E1000_ICR_LSC);
hw->mac.get_link_status = true;
-- 
2.16.1

[PATCH net-queue 3/3] e1000e: Avoid missed interrupts following ICR read.

2018-02-07 Thread Benjamin Poirier

The 82574 specification update errata 12 states that interrupts may be
missed if ICR is read while INT_ASSERTED is not set. Avoid that problem by
setting all bits related to events that can trigger the Other interrupt in
IMS.

The Other interrupt is raised for such events regardless of whether or not
they are set in IMS. However, only when they are set is the INT_ASSERTED
bit also set in ICR.

By doing this, we ensure that INT_ASSERTED is always set when we read ICR
in e1000_msix_other() and steer clear of the errata. This also ensures that
ICR will automatically be cleared on read, therefore we no longer need to
clear bits explicitly.

Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 drivers/net/ethernet/intel/e1000e/defines.h | 21 -
 drivers/net/ethernet/intel/e1000e/netdev.c  | 11 ---
 2 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/defines.h 
b/drivers/net/ethernet/intel/e1000e/defines.h
index afb7ebe20b24..824fd44e25f0 100644
--- a/drivers/net/ethernet/intel/e1000e/defines.h
+++ b/drivers/net/ethernet/intel/e1000e/defines.h
@@ -400,6 +400,10 @@
 #define E1000_ICR_RXDMT00x0010 /* Rx desc min. threshold (0) */
 #define E1000_ICR_RXO   0x0040 /* Receiver Overrun */
 #define E1000_ICR_RXT0  0x0080 /* Rx timer intr (ring 0) */
+#define E1000_ICR_MDAC  0x0200 /* MDIO Access Complete */
+#define E1000_ICR_SRPD  0x0001 /* Small Receive Packet Detected */
+#define E1000_ICR_ACK   0x0002 /* Receive ACK Frame Detected */
+#define E1000_ICR_MNG   0x0004 /* Manageability Event Detected */
 #define E1000_ICR_ECCER 0x0040 /* Uncorrectable ECC Error */
 /* If this bit asserted, the driver should claim the interrupt */
 #define E1000_ICR_INT_ASSERTED 0x8000
@@ -407,7 +411,7 @@
 #define E1000_ICR_RXQ1  0x0020 /* Rx Queue 1 Interrupt */
 #define E1000_ICR_TXQ0  0x0040 /* Tx Queue 0 Interrupt */
 #define E1000_ICR_TXQ1  0x0080 /* Tx Queue 1 Interrupt */
-#define E1000_ICR_OTHER 0x0100 /* Other Interrupts */
+#define E1000_ICR_OTHER 0x0100 /* Other Interrupt */
 
 /* PBA ECC Register */
 #define E1000_PBA_ECC_COUNTER_MASK  0xFFF0 /* ECC counter mask */
@@ -431,12 +435,27 @@
E1000_IMS_RXSEQ  |\
E1000_IMS_LSC)
 
+/* These are all of the events related to the OTHER interrupt.
+ */
+#define IMS_OTHER_MASK ( \
+   E1000_IMS_LSC  | \
+   E1000_IMS_RXO  | \
+   E1000_IMS_MDAC | \
+   E1000_IMS_SRPD | \
+   E1000_IMS_ACK  | \
+   E1000_IMS_MNG)
+
 /* Interrupt Mask Set */
 #define E1000_IMS_TXDW  E1000_ICR_TXDW  /* Transmit desc written back 
*/
 #define E1000_IMS_LSC   E1000_ICR_LSC   /* Link Status Change */
 #define E1000_IMS_RXSEQ E1000_ICR_RXSEQ /* Rx sequence error */
 #define E1000_IMS_RXDMT0E1000_ICR_RXDMT0/* Rx desc min. threshold */
+#define E1000_IMS_RXO   E1000_ICR_RXO   /* Receiver Overrun */
 #define E1000_IMS_RXT0  E1000_ICR_RXT0  /* Rx timer intr */
+#define E1000_IMS_MDAC  E1000_ICR_MDAC  /* MDIO Access Complete */
+#define E1000_IMS_SRPD  E1000_ICR_SRPD  /* Small Receive Packet */
+#define E1000_IMS_ACK   E1000_ICR_ACK   /* Receive ACK Frame Detected 
*/
+#define E1000_IMS_MNG   E1000_ICR_MNG   /* Manageability Event */
 #define E1000_IMS_ECCER E1000_ICR_ECCER /* Uncorrectable ECC Error */
 #define E1000_IMS_RXQ0  E1000_ICR_RXQ0  /* Rx Queue 0 Interrupt */
 #define E1000_IMS_RXQ1  E1000_ICR_RXQ1  /* Rx Queue 1 Interrupt */
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 2c9609bee2ae..9fd4050a91ca 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1914,16 +1914,12 @@ static irqreturn_t e1000_msix_other(int __always_unused 
irq, void *data)
struct net_device *netdev = data;
struct e1000_adapter *adapter = netdev_priv(netdev);
struct e1000_hw *hw = >hw;
-   u32 icr;
-
-   icr = er32(ICR);
-   ew32(ICR, E1000_ICR_OTHER);
+   u32 icr = er32(ICR);
 
if (icr & adapter->eiac_mask)
ew32(ICS, (icr & adapter->eiac_mask));
 
if (icr & E1000_ICR_LSC) {
-   ew32(ICR, E1000_ICR_LSC);
hw->mac.get_link_status = true;
/* guard against interrupt when we're going down */
if (!test_bit(__E1000_DOWN, >state))
@@ -1931,7 +1927,7 @@ static irqreturn_t e1000_msix_other(int __always_unused 
irq, void *data)
}
 
if (!test_bit(__E1000_DOWN, >state))
-   ew32(IMS, E1000_IMS_OTHER);
+   ew32(IMS, E1000_IMS_OTHER | IMS_OTHER_MASK);
 
return IRQ_HANDLED;
 }
@@ -2258,7 +2254,8 @@ static void e1000_irq_enable(struct e1000_adap

Re: [PATCH 3/3] Revert "e1000e: Do not read ICR in Other interrupt"

2018-02-07 Thread Benjamin Poirier

On 2018/01/29 09:22, Alexander Duyck wrote:
[...]
> 
> > Consequently, we must clear OTHER manually from ICR, otherwise the
> > interrupt is immediately re-raised after exiting the handler.
> >
> > These observations are the same whether the interrupt is triggered via a
> > write to ICS or in hardware.
> >
> > Furthermore, I tested that this behavior is the same for other Other
> > events (MDAC, SRPD, ACK, MNG). Those were tested via a write to ICS
> > only, not in hardware.
> >
> > This is a version of the test patch that I used to trigger lsc and rxo in
> > software and hardware. It applies over this patch series.
> 
> I plan to look into this some more over the next few days. Ideally if
> we could mask these "OTHER" interrupts besides the LSC we could comply
> with all the needed bits for MSI-X. My concern is that we are still
> stuck reading the ICR at this point because of this and it is going to
> make dealing with MSI-X challenging on 82574 since it seems like the
> intention was that you weren't supposed to be reading the ICR when
> MSI-X is enabled based on the list of current issues and HW errata.

I totally agree with you that it looks like the msi-x interface was
designed so you don't need to read icr. That's also why I was happy to
go that direction with the (now infamous) commit 16ecba59bc33 ("e1000e:
Do not read ICR in Other interrupt", v4.5-rc1).

However, we looked at it before and there seems to be no way to mask
individual Other interrupt causes (masking rxo but getting lsc). Because
of that, I think we have to keep reading icr in the Other interrupt
handler.

> 
> At this point it seems like the interrupts is firing and the
> INT_ASSERTED is all we really need to be checking for if I understand
> this all correctly. Basically if LSC is set it will trigger OTHER and
> INT_ASSERTED, if any of the other causes are set they are only setting
> OTHER.

I think that's right and it's related to the fact that currently LSC is
set in IMS but not the other causes. Since we have to read icr (as I
wrote above) but we want to avoid reading it without INT_ASSERTED set
(as per errata 12) the solution will be to set all of the causes related
to Other in IMS. Patches incoming...

Re: [Intel-wired-lan] [RFC PATCH] e1000e: Remove Other from EIAC.

2018-01-30 Thread Benjamin Poirier

On 2018/01/30 11:46, Alexander Duyck wrote:
> On Wed, Jan 17, 2018 at 10:50 PM, Benjamin Poirier <bpoir...@suse.com> wrote:
> > It was reported that emulated e1000e devices in vmware esxi 6.5 Build
> > 7526125 do not link up after commit 4aea7a5c5e94 ("e1000e: Avoid receiver
> > overrun interrupt bursts", v4.15-rc1). Some tracing shows that after
> > e1000e_trigger_lsc() is called, ICR reads out as 0x0 in e1000_msix_other()
> > on emulated e1000e devices. In comparison, on real e1000e 82574 hardware,
> > icr=0x8004 (_INT_ASSERTED | _OTHER) in the same situation.
> >
> > Some experimentation showed that this flaw in vmware e1000e emulation can
> > be worked around by not setting Other in EIAC. This is how it was before
> > 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt", v4.5-rc1).
> >
> > Fixes: 4aea7a5c5e94 ("e1000e: Avoid receiver overrun interrupt bursts")
> > Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
> > ---
> 
> Hi Benjamin,
> 
> How would you feel about resubmitting this patch for net?
> 
> We have some issues that have come up and it would be useful to have
> this fixed in the kernel sooner rather than later. I would be okay
> with us applying it for now while we work on coming up with a more
> complete solution.

Ok, I've resent it in its original form. Once it's in mainline I'll
rebase the cleanups.

[PATCH net] e1000e: Remove Other from EIAC.

2018-01-30 Thread Benjamin Poirier

It was reported that emulated e1000e devices in vmware esxi 6.5 Build
7526125 do not link up after commit 4aea7a5c5e94 ("e1000e: Avoid receiver
overrun interrupt bursts", v4.15-rc1). Some tracing shows that after
e1000e_trigger_lsc() is called, ICR reads out as 0x0 in e1000_msix_other()
on emulated e1000e devices. In comparison, on real e1000e 82574 hardware,
icr=0x8004 (_INT_ASSERTED | _LSC) in the same situation.

Some experimentation showed that this flaw in vmware e1000e emulation can
be worked around by not setting Other in EIAC. This is how it was before
16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt", v4.5-rc1).

Fixes: 4aea7a5c5e94 ("e1000e: Avoid receiver overrun interrupt bursts")
Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 9f18d39bdc8f..625a4c9a86a4 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1918,6 +1918,8 @@ static irqreturn_t e1000_msix_other(int __always_unused 
irq, void *data)
bool enable = true;
 
icr = er32(ICR);
+   ew32(ICR, E1000_ICR_OTHER);
+
if (icr & E1000_ICR_RXO) {
ew32(ICR, E1000_ICR_RXO);
enable = false;
@@ -2040,7 +2042,6 @@ static void e1000_configure_msix(struct e1000_adapter 
*adapter)
   hw->hw_addr + E1000_EITR_82574(vector));
else
writel(1, hw->hw_addr + E1000_EITR_82574(vector));
-   adapter->eiac_mask |= E1000_IMS_OTHER;
 
/* Cause Tx interrupts on every write back */
ivar |= BIT(31);
@@ -2265,7 +2266,7 @@ static void e1000_irq_enable(struct e1000_adapter 
*adapter)
 
if (adapter->msix_entries) {
ew32(EIAC_82574, adapter->eiac_mask & E1000_EIAC_MASK_82574);
-   ew32(IMS, adapter->eiac_mask | E1000_IMS_LSC);
+   ew32(IMS, adapter->eiac_mask | E1000_IMS_OTHER | E1000_IMS_LSC);
} else if (hw->mac.type >= e1000_pch_lpt) {
ew32(IMS, IMS_ENABLE_MASK | E1000_IMS_ECCER);
} else {
-- 
2.15.1

Re: [PATCH 2/3] Revert "e1000e: Separate signaling for link check/link up"

2018-01-29 Thread Benjamin Poirier

On 2018/01/26 09:03, Alexander Duyck wrote:
> On Fri, Jan 26, 2018 at 1:12 AM, Benjamin Poirier <bpoir...@suse.com> wrote:
> > This reverts commit 19110cfbb34d4af0cdfe14cd243f3b09dc95b013.
> > This reverts commit 4110e02eb45ea447ec6f5459c9934de0a273fb91.
> >
> > ... because they cause an extra 2s delay for the link to come up when
> > autoneg is off.
> >
> > After reverting, the race condition described in the log of commit
> > 19110cfbb34d ("e1000e: Separate signaling for link check/link up") is
> > reintroduced. It may still be triggered by LSC events but this should not
> > result in link flap. It may no longer be triggered by RXO events because
> > commit 4aea7a5c5e94 ("e1000e: Avoid receiver overrun interrupt bursts")
> > restored reading icr in the Other handler.
> 
> With the RXO events removed the only cause for us to transition the
> bit should be LSC. I'm not sure if the race condition in that state is
> a valid concern or not as the LSC should only get triggered if the
> link state toggled, even briefly.
> 
> The bigger concern I would have would be the opposite of the original
> race that was pointed out:
> \ e1000_watchdog_task
> \ e1000e_has_link
> \ hw->mac.ops.check_for_link() === e1000e_check_for_copper_link
> /* link is up */
> mac->get_link_status = false;
> 
> /* interrupt */
> \ e1000_msix_other
> hw->mac.get_link_status = true;
> 
> link_active = !hw->mac.get_link_status
> /* link_active is false, wrongly */
> 
> So the question I would have is what if we see the LSC for a link down
> just after the check_for_copper_link call completes? It may not be

Can you write out exactly what that race would be, in a format similar to the
above?

> anything seen in the real world since I don't know if we have any link
> flapping issues on e1000e or not without this patch. It is something
> to keep in mind for the future though.
> 
> 
> > As discussed, the driver should be in "maintenance mode". In the interest
> > of stability, revert to the original code as much as possible instead of a
> > half-baked solution.
> 
> If nothing else we may want to do a follow-up on this patch as we
> probably shouldn't be returning the error values to trigger link up.
> There are definitely issues to be found here. If nothing else we may
> want to explore just returning 1 if auto-neg is disabled instead of
> returning an error code.
> 
> > Link: https://www.spinics.net/lists/netdev/msg479923.html
> > Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
[...]

Re: [PATCH 3/3] Revert "e1000e: Do not read ICR in Other interrupt"

2018-01-28 Thread Benjamin Poirier

On 2018/01/26 13:01, Alexander Duyck wrote:
> On Fri, Jan 26, 2018 at 1:12 AM, Benjamin Poirier <bpoir...@suse.com> wrote:
> > This reverts commit 16ecba59bc333d6282ee057fb02339f77a880beb.
> >
> > It was reported that emulated e1000e devices in vmware esxi 6.5 Build
> > 7526125 do not link up after commit 4aea7a5c5e94 ("e1000e: Avoid receiver
> > overrun interrupt bursts"). Some tracing shows that after
> > e1000e_trigger_lsc() is called, ICR reads out as 0x0 in e1000_msix_other()
> > on emulated e1000e devices. In comparison, on real e1000e 82574 hardware,
> > icr=0x8004 (_INT_ASSERTED | _LSC) in the same situation.
> >
> > Some experimentation showed that this flaw in vmware e1000e emulation can
> > be worked around by not setting Other in EIAC. This is how it was before
> > commit 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt").
> >
> > Since the ICR read in the Other interrupt handler has already been
> > restored, this patch effectively reverts the remainder of commit
> > 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt").
> >
> > Fixes: 4aea7a5c5e94 ("e1000e: Avoid receiver overrun interrupt bursts")
> > Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
> > ---
> >  drivers/net/ethernet/intel/e1000e/netdev.c | 10 --
> >  1 file changed, 8 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
> > b/drivers/net/ethernet/intel/e1000e/netdev.c
> > index ed103b9a8d3a..fffc1f0e3895 100644
> > --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> > +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> > @@ -1916,6 +1916,13 @@ static irqreturn_t e1000_msix_other(int 
> > __always_unused irq, void *data)
> > struct e1000_hw *hw = >hw;
> > u32 icr = er32(ICR);
> >
> > +   /* Certain events (such as RXO) which trigger Other do not set
> > +* INT_ASSERTED. In that case, read to clear of icr does not take
> > +* place.
> > +*/
> > +   if (!(icr & E1000_ICR_INT_ASSERTED))
> > +   ew32(ICR, E1000_ICR_OTHER);
> > +
> 
> This piece doesn't make sense to me. Why are we clearing OTHER if
> ICR_INT_ASSERTED is not set?

Datasheet §10.2.4.1 ("Interrupt Cause Read Register") says that ICR read
to clear only occurs if INT_ASSERTED is set. This corresponds to what I
observed.

However, while working on these issues, I noticed that when there is an rxo
event, INT_ASSERTED is not always set even though the interrupt is raised. I
think this is a hardware flaw.

For example, if doing
ew32(ICS, E1000_ICS_LSC | E1000_ICS_OTHER);
we enter e1000_msix_other() and two consecutive reads of ICR result in
0x8104
0x

If doing
ew32(ICS, E1000_ICS_RXO | E1000_ICS_OTHER);
we enter e1000_msix_other() and two consecutive reads of ICR result in
0x0141
0x0141

Consequently, we must clear OTHER manually from ICR, otherwise the
interrupt is immediately re-raised after exiting the handler.

These observations are the same whether the interrupt is triggered via a
write to ICS or in hardware.

Furthermore, I tested that this behavior is the same for other Other
events (MDAC, SRPD, ACK, MNG). Those were tested via a write to ICS
only, not in hardware.

This is a version of the test patch that I used to trigger lsc and rxo in
software and hardware. It applies over this patch series.

diff --git a/drivers/net/ethernet/intel/e1000e/defines.h 
b/drivers/net/ethernet/intel/e1000e/defines.h
index 0641c0098738..f54e7ac9c934 100644
--- a/drivers/net/ethernet/intel/e1000e/defines.h
+++ b/drivers/net/ethernet/intel/e1000e/defines.h
@@ -398,6 +398,7 @@
 #define E1000_ICR_LSC   0x0004 /* Link Status Change */
 #define E1000_ICR_RXSEQ 0x0008 /* Rx sequence error */
 #define E1000_ICR_RXDMT00x0010 /* Rx desc min. threshold (0) */
+#define E1000_ICR_RXO   0x0040 /* rx overrun */
 #define E1000_ICR_RXT0  0x0080 /* Rx timer intr (ring 0) */
 #define E1000_ICR_ECCER 0x0040 /* Uncorrectable ECC Error */
 /* If this bit asserted, the driver should claim the interrupt */
diff --git a/drivers/net/ethernet/intel/e1000e/ethtool.c 
b/drivers/net/ethernet/intel/e1000e/ethtool.c
index 003cbd605799..4933c1beac74 100644
--- a/drivers/net/ethernet/intel/e1000e/ethtool.c
+++ b/drivers/net/ethernet/intel/e1000e/ethtool.c
@@ -1802,98 +1802,20 @@ static void e1000_diag_test(struct net_device *netdev,
struct ethtool_test *eth_test, u64 *data)
 {
struct e1000_adapter *adapter = netdev_priv(netdev);
-   u16 autoneg_advertised;
-   u8 forced_speed_duplex;
-   u8 a

Re: [PATCH 1/3] Partial revert "e1000e: Avoid receiver overrun interrupt bursts"

2018-01-28 Thread Benjamin Poirier

On 2018/01/26 08:50, Alexander Duyck wrote:
> On Fri, Jan 26, 2018 at 1:12 AM, Benjamin Poirier <bpoir...@suse.com> wrote:
> > This reverts commit 4aea7a5c5e940c1723add439f4088844cd26196d.
> >
> > We keep the fix for the first part of the problem (1) described in the log
> > of that commit however we use the implementation of e1000_msix_other() from
> > before commit 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt").
> > We remove the fix for the second part of the problem (2).
> >
> > Bursts of "Other" interrupts may once again occur during rxo (receive
> > overflow) traffic conditions. This is deemed acceptable in the interest of
> > reverting driver code back to its previous state. The e1000e driver should
> > be in "maintenance" mode and we want to avoid unforeseen fallout from
> > changes that are not strictly necessary.
> >
> > Link: https://www.spinics.net/lists/netdev/msg480675.html
> > Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
> 
> Thanks for doing this.
> 
> Only a few minor tweaks I would recommend. Otherwise it looks good.
> 
> > ---
> >  drivers/net/ethernet/intel/e1000e/defines.h |  1 -
> >  drivers/net/ethernet/intel/e1000e/netdev.c  | 32 
> > +++--
> >  2 files changed, 12 insertions(+), 21 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/intel/e1000e/defines.h 
> > b/drivers/net/ethernet/intel/e1000e/defines.h
> > index afb7ebe20b24..0641c0098738 100644
> > --- a/drivers/net/ethernet/intel/e1000e/defines.h
> > +++ b/drivers/net/ethernet/intel/e1000e/defines.h
> > @@ -398,7 +398,6 @@
> >  #define E1000_ICR_LSC   0x0004 /* Link Status Change */
> >  #define E1000_ICR_RXSEQ 0x0008 /* Rx sequence error */
> >  #define E1000_ICR_RXDMT00x0010 /* Rx desc min. threshold (0) */
> > -#define E1000_ICR_RXO   0x0040 /* Receiver Overrun */
> >  #define E1000_ICR_RXT0  0x0080 /* Rx timer intr (ring 0) */
> >  #define E1000_ICR_ECCER 0x0040 /* Uncorrectable ECC Error */
> >  /* If this bit asserted, the driver should claim the interrupt */
> > diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
> > b/drivers/net/ethernet/intel/e1000e/netdev.c
> > index 9f18d39bdc8f..398e940436f8 100644
> > --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> > +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> > @@ -1914,30 +1914,23 @@ static irqreturn_t e1000_msix_other(int 
> > __always_unused irq, void *data)
> > struct net_device *netdev = data;
> > struct e1000_adapter *adapter = netdev_priv(netdev);
> > struct e1000_hw *hw = >hw;
> > -   u32 icr;
> > -   bool enable = true;
> > -
> > -   icr = er32(ICR);
> > -   if (icr & E1000_ICR_RXO) {
> > -   ew32(ICR, E1000_ICR_RXO);
> > -   enable = false;
> > -   /* napi poll will re-enable Other, make sure it runs */
> > -   if (napi_schedule_prep(>napi)) {
> > -   adapter->total_rx_bytes = 0;
> > -   adapter->total_rx_packets = 0;
> > -   __napi_schedule(>napi);
> > -   }
> > -   }
> > -   if (icr & E1000_ICR_LSC) {
> > -   ew32(ICR, E1000_ICR_LSC);
> > +   u32 icr = er32(ICR);
> > +
> > +   if (icr & adapter->eiac_mask)
> > +   ew32(ICS, (icr & adapter->eiac_mask));
> 
> I'm not sure about this bit as it includes queue interrupts if I am
> not mistaken. What we should be focusing on are the bits that are not
> auto-cleared so this should probably be icr & ~adapter->eiac_mask.
> Actually what you might do is something like:
> icr &= ~adapter->eiac_mask;
> if (icr)
> ew32(ICS, icr);
> 
> Also a comment explaining that we have to clear the bits that are not
> automatically cleared by other interrupt causes might be useful.

I've re-read your comment many times and I think you misunderstood what
that code is doing.

This:
> > +   if (icr & adapter->eiac_mask)
> > +   ew32(ICS, (icr & adapter->eiac_mask));

re-raises the queue interrupts in case the txq or rxq bits were set in
icr and the Other interrupt handler read and cleared icr before the
queue interrupt was raised. It's not about "clear the bits that are not
automatically cleared". It's a write to ICS, not ICR.

> 
> > +   if (icr & E1000_ICR_OTHER) {
> > +   if (!(icr & E100

[PATCH 0/3] e1000e: Revert interrupt handling changes

2018-01-26 Thread Benjamin Poirier

As discussed in the thread "[RFC PATCH] e1000e: Remove Other from EIAC.",
https://www.spinics.net/lists/netdev/msg479311.html

The following list of commits was considered:
4d432f67ff00 e1000e: Remove unreachable code (v4.5-rc1)
16ecba59bc33 e1000e: Do not read ICR in Other interrupt (v4.5-rc1)
a61cfe4ffad7 e1000e: Do not write lsc to ics in msi-x mode (v4.5-rc1)
0a8047ac68e5 e1000e: Fix msi-x interrupt automask (v4.5-rc1)
19110cfbb34d e1000e: Separate signaling for link check/link up (v4.15-rc1)
4aea7a5c5e94 e1000e: Avoid receiver overrun interrupt bursts (v4.15-rc1)
4110e02eb45e e1000e: Fix e1000_check_for_copper_link_ich8lan return value. 
(v4.15-rc8)

There have a been a slew of regressions due to unforeseen consequences
(receive overflow triggers Other, vmware's emulated e1000e) and programming
mistakes (4110e02eb45e). Since the e1000e driver is supposed to be in
maintenance mode, this patch series revisits the above changes to prune
them down.

After this series, the remaining differences related to how interrupts were
handled at commit 4d432f67ff00 ("e1000e: Remove unreachable code",
v4.5-rc1) are:
* the changes in commit 0a8047ac68e5 ("e1000e: Fix msi-x interrupt
  automask", v4.5-rc1) are preserved.
* we manually clear Other from icr in e1000_msix_other().

We try to go back to a long lost time when things were simple and drivers
ran smoothly.

--------
Benjamin Poirier (3):
  Partial revert "e1000e: Avoid receiver overrun interrupt bursts"
  Revert "e1000e: Separate signaling for link check/link up"
  Revert "e1000e: Do not read ICR in Other interrupt"

 drivers/net/ethernet/intel/e1000e/defines.h |  1 -
 drivers/net/ethernet/intel/e1000e/ich8lan.c | 11 ++--
 drivers/net/ethernet/intel/e1000e/mac.c | 11 ++--
 drivers/net/ethernet/intel/e1000e/netdev.c  | 44 ++---
 4 files changed, 27 insertions(+), 40 deletions(-)

-- 
2.15.1

[PATCH 2/3] Revert "e1000e: Separate signaling for link check/link up"

2018-01-26 Thread Benjamin Poirier

This reverts commit 19110cfbb34d4af0cdfe14cd243f3b09dc95b013.
This reverts commit 4110e02eb45ea447ec6f5459c9934de0a273fb91.

... because they cause an extra 2s delay for the link to come up when
autoneg is off.

After reverting, the race condition described in the log of commit
19110cfbb34d ("e1000e: Separate signaling for link check/link up") is
reintroduced. It may still be triggered by LSC events but this should not
result in link flap. It may no longer be triggered by RXO events because
commit 4aea7a5c5e94 ("e1000e: Avoid receiver overrun interrupt bursts")
restored reading icr in the Other handler.

As discussed, the driver should be in "maintenance mode". In the interest
of stability, revert to the original code as much as possible instead of a
half-baked solution.

Link: https://www.spinics.net/lists/netdev/msg479923.html
Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 drivers/net/ethernet/intel/e1000e/ich8lan.c | 11 +++
 drivers/net/ethernet/intel/e1000e/mac.c | 11 +++
 drivers/net/ethernet/intel/e1000e/netdev.c  |  2 +-
 3 files changed, 7 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/ich8lan.c 
b/drivers/net/ethernet/intel/e1000e/ich8lan.c
index 31277d3bb7dc..d6d4ed7acf03 100644
--- a/drivers/net/ethernet/intel/e1000e/ich8lan.c
+++ b/drivers/net/ethernet/intel/e1000e/ich8lan.c
@@ -1367,9 +1367,6 @@ static s32 e1000_disable_ulp_lpt_lp(struct e1000_hw *hw, 
bool force)
  *  Checks to see of the link status of the hardware has changed.  If a
  *  change in link status has been detected, then we read the PHY registers
  *  to get the current speed/duplex if link exists.
- *
- *  Returns a negative error code (-E1000_ERR_*) or 0 (link down) or 1 (link
- *  up).
  **/
 static s32 e1000_check_for_copper_link_ich8lan(struct e1000_hw *hw)
 {
@@ -1385,7 +1382,7 @@ static s32 e1000_check_for_copper_link_ich8lan(struct 
e1000_hw *hw)
 * Change or Rx Sequence Error interrupt.
 */
if (!mac->get_link_status)
-   return 1;
+   return 0;
 
/* First we want to see if the MII Status Register reports
 * link.  If so, then we want to get the current speed/duplex
@@ -1616,12 +1613,10 @@ static s32 e1000_check_for_copper_link_ich8lan(struct 
e1000_hw *hw)
 * different link partner.
 */
ret_val = e1000e_config_fc_after_link_up(hw);
-   if (ret_val) {
+   if (ret_val)
e_dbg("Error configuring flow control\n");
-   return ret_val;
-   }
 
-   return 1;
+   return ret_val;
 }
 
 static s32 e1000_get_variants_ich8lan(struct e1000_adapter *adapter)
diff --git a/drivers/net/ethernet/intel/e1000e/mac.c 
b/drivers/net/ethernet/intel/e1000e/mac.c
index f457c5703d0c..b322011ec282 100644
--- a/drivers/net/ethernet/intel/e1000e/mac.c
+++ b/drivers/net/ethernet/intel/e1000e/mac.c
@@ -410,9 +410,6 @@ void e1000e_clear_hw_cntrs_base(struct e1000_hw *hw)
  *  Checks to see of the link status of the hardware has changed.  If a
  *  change in link status has been detected, then we read the PHY registers
  *  to get the current speed/duplex if link exists.
- *
- *  Returns a negative error code (-E1000_ERR_*) or 0 (link down) or 1 (link
- *  up).
  **/
 s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
 {
@@ -426,7 +423,7 @@ s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
 * Change or Rx Sequence Error interrupt.
 */
if (!mac->get_link_status)
-   return 1;
+   return 0;
 
/* First we want to see if the MII Status Register reports
 * link.  If so, then we want to get the current speed/duplex
@@ -464,12 +461,10 @@ s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
 * different link partner.
 */
ret_val = e1000e_config_fc_after_link_up(hw);
-   if (ret_val) {
+   if (ret_val)
e_dbg("Error configuring flow control\n");
-   return ret_val;
-   }
 
-   return 1;
+   return ret_val;
 }
 
 /**
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 398e940436f8..ed103b9a8d3a 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -5091,7 +5091,7 @@ static bool e1000e_has_link(struct e1000_adapter *adapter)
case e1000_media_type_copper:
if (hw->mac.get_link_status) {
ret_val = hw->mac.ops.check_for_link(hw);
-   link_active = ret_val > 0;
+   link_active = !hw->mac.get_link_status;
} else {
link_active = true;
}
-- 
2.15.1

[PATCH 1/3] Partial revert "e1000e: Avoid receiver overrun interrupt bursts"

2018-01-26 Thread Benjamin Poirier

This reverts commit 4aea7a5c5e940c1723add439f4088844cd26196d.

We keep the fix for the first part of the problem (1) described in the log
of that commit however we use the implementation of e1000_msix_other() from
before commit 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt").
We remove the fix for the second part of the problem (2).

Bursts of "Other" interrupts may once again occur during rxo (receive
overflow) traffic conditions. This is deemed acceptable in the interest of
reverting driver code back to its previous state. The e1000e driver should
be in "maintenance" mode and we want to avoid unforeseen fallout from
changes that are not strictly necessary.

Link: https://www.spinics.net/lists/netdev/msg480675.html
Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 drivers/net/ethernet/intel/e1000e/defines.h |  1 -
 drivers/net/ethernet/intel/e1000e/netdev.c  | 32 +++--
 2 files changed, 12 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/defines.h 
b/drivers/net/ethernet/intel/e1000e/defines.h
index afb7ebe20b24..0641c0098738 100644
--- a/drivers/net/ethernet/intel/e1000e/defines.h
+++ b/drivers/net/ethernet/intel/e1000e/defines.h
@@ -398,7 +398,6 @@
 #define E1000_ICR_LSC   0x0004 /* Link Status Change */
 #define E1000_ICR_RXSEQ 0x0008 /* Rx sequence error */
 #define E1000_ICR_RXDMT00x0010 /* Rx desc min. threshold (0) */
-#define E1000_ICR_RXO   0x0040 /* Receiver Overrun */
 #define E1000_ICR_RXT0  0x0080 /* Rx timer intr (ring 0) */
 #define E1000_ICR_ECCER 0x0040 /* Uncorrectable ECC Error */
 /* If this bit asserted, the driver should claim the interrupt */
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 9f18d39bdc8f..398e940436f8 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1914,30 +1914,23 @@ static irqreturn_t e1000_msix_other(int __always_unused 
irq, void *data)
struct net_device *netdev = data;
struct e1000_adapter *adapter = netdev_priv(netdev);
struct e1000_hw *hw = >hw;
-   u32 icr;
-   bool enable = true;
-
-   icr = er32(ICR);
-   if (icr & E1000_ICR_RXO) {
-   ew32(ICR, E1000_ICR_RXO);
-   enable = false;
-   /* napi poll will re-enable Other, make sure it runs */
-   if (napi_schedule_prep(>napi)) {
-   adapter->total_rx_bytes = 0;
-   adapter->total_rx_packets = 0;
-   __napi_schedule(>napi);
-   }
-   }
-   if (icr & E1000_ICR_LSC) {
-   ew32(ICR, E1000_ICR_LSC);
+   u32 icr = er32(ICR);
+
+   if (icr & adapter->eiac_mask)
+   ew32(ICS, (icr & adapter->eiac_mask));
+
+   if (icr & E1000_ICR_OTHER) {
+   if (!(icr & E1000_ICR_LSC))
+   goto no_link_interrupt;
hw->mac.get_link_status = true;
/* guard against interrupt when we're going down */
if (!test_bit(__E1000_DOWN, >state))
mod_timer(>watchdog_timer, jiffies + 1);
}
 
-   if (enable && !test_bit(__E1000_DOWN, >state))
-   ew32(IMS, E1000_IMS_OTHER);
+no_link_interrupt:
+   if (!test_bit(__E1000_DOWN, >state))
+   ew32(IMS, E1000_IMS_LSC | E1000_IMS_OTHER);
 
return IRQ_HANDLED;
 }
@@ -2707,8 +2700,7 @@ static int e1000e_poll(struct napi_struct *napi, int 
weight)
napi_complete_done(napi, work_done);
if (!test_bit(__E1000_DOWN, >state)) {
if (adapter->msix_entries)
-   ew32(IMS, adapter->rx_ring->ims_val |
-E1000_IMS_OTHER);
+   ew32(IMS, adapter->rx_ring->ims_val);
else
e1000_irq_enable(adapter);
}
-- 
2.15.1

[PATCH 3/3] Revert "e1000e: Do not read ICR in Other interrupt"

2018-01-26 Thread Benjamin Poirier

This reverts commit 16ecba59bc333d6282ee057fb02339f77a880beb.

It was reported that emulated e1000e devices in vmware esxi 6.5 Build
7526125 do not link up after commit 4aea7a5c5e94 ("e1000e: Avoid receiver
overrun interrupt bursts"). Some tracing shows that after
e1000e_trigger_lsc() is called, ICR reads out as 0x0 in e1000_msix_other()
on emulated e1000e devices. In comparison, on real e1000e 82574 hardware,
icr=0x8004 (_INT_ASSERTED | _LSC) in the same situation.

Some experimentation showed that this flaw in vmware e1000e emulation can
be worked around by not setting Other in EIAC. This is how it was before
commit 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt").

Since the ICR read in the Other interrupt handler has already been
restored, this patch effectively reverts the remainder of commit
16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt").

Fixes: 4aea7a5c5e94 ("e1000e: Avoid receiver overrun interrupt bursts")
Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index ed103b9a8d3a..fffc1f0e3895 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1916,6 +1916,13 @@ static irqreturn_t e1000_msix_other(int __always_unused 
irq, void *data)
struct e1000_hw *hw = >hw;
u32 icr = er32(ICR);
 
+   /* Certain events (such as RXO) which trigger Other do not set
+* INT_ASSERTED. In that case, read to clear of icr does not take
+* place.
+*/
+   if (!(icr & E1000_ICR_INT_ASSERTED))
+   ew32(ICR, E1000_ICR_OTHER);
+
if (icr & adapter->eiac_mask)
ew32(ICS, (icr & adapter->eiac_mask));
 
@@ -2033,7 +2040,6 @@ static void e1000_configure_msix(struct e1000_adapter 
*adapter)
   hw->hw_addr + E1000_EITR_82574(vector));
else
writel(1, hw->hw_addr + E1000_EITR_82574(vector));
-   adapter->eiac_mask |= E1000_IMS_OTHER;
 
/* Cause Tx interrupts on every write back */
ivar |= BIT(31);
@@ -2258,7 +2264,7 @@ static void e1000_irq_enable(struct e1000_adapter 
*adapter)
 
if (adapter->msix_entries) {
ew32(EIAC_82574, adapter->eiac_mask & E1000_EIAC_MASK_82574);
-   ew32(IMS, adapter->eiac_mask | E1000_IMS_LSC);
+   ew32(IMS, adapter->eiac_mask | E1000_IMS_OTHER | E1000_IMS_LSC);
} else if (hw->mac.type >= e1000_pch_lpt) {
ew32(IMS, IMS_ENABLE_MASK | E1000_IMS_ECCER);
} else {
-- 
2.15.1

Re: [Intel-wired-lan] [RFC PATCH] e1000e: Remove Other from EIAC.

2018-01-24 Thread Benjamin Poirier

On 2018/01/22 10:01, Alexander Duyck wrote:
[...]
> >
> > If the patch that I submitted for the current vmware issue is merged,
> > the significant commits that are left are:
> >
> > 0a8047ac68e5 e1000e: Fix msi-x interrupt automask (v4.5-rc1)
> > Fixes a problem in the irq disabling of the napi implementation.
> 
> This one I fully agree is still needed.
> 
> > 19110cfbb34d e1000e: Separate signaling for link check/link up
> >  (v4.15-rc1)
> > Fixes link flapping caused by a race condition in link
> > detection. It was found because the Other interrupt was being
> > triggered sort of spuriously by rxo.
> 
> This one is somewhat iffy. I am not sure if the patch description
> really matches what it is doing. It doesn't appear to do what it says
> it is trying to do since clearing get_link_status will still trigger a
> link up, it just takes an extra 2 seconds. I think there may be issues
> if you aren't using autoneg, as I don't see how you are getting the
> link to report up other than the fact that mac->get_link_status has
> been cleared but we are reporting a pseduo-error. In addition it is
> only really needed after the RXO problem was introduced which really
> didn't exist until after we stopped checking for LSC. One interesting
> test we may want to look at is to see if there is an additional delay
> in a link coming up for a non-autoneg setup. If we find an additional
> 2 second delay then I would be even more confident that this patch has
> a bug.

It seems like you're right but I didn't look into this part of the problem in
detail yet. I'll get back to it.

> 
> > 4aea7a5c5e94 e1000e: Avoid receiver overrun interrupt bursts (v4.15-rc1)
> > Fixes Other interrupt bursts during sustained rxo conditions.
> 
> So the RXO problem probably didn't exist until we stopped checking for
> the OTHER and LSC bits in the "other" interrupt handler. Yes there
> would be more "other" cause interrupts, but they shouldn't have been
> causing much in the way of issues since the get_link_status value
> never changed. Personally I would lean more toward the option of

I agree. I tested rxo behavior on commit 4d432f67ff00 ("e1000e: Remove
unreachable code", v4.5-rc1) which is before any significant change in that
area. (I force rxo by adding mdelay(10) to e1000_clean_rx_irq and sending a
netperf UDP_STREAM from another host). In case of sustained rxo condition, we
get repeated Other interrupts. Handling these irqs is useless work that could
be avoided when the system is already overloaded but it doesn't lead to
misbehavior like the race condition described in the log of commit
19110cfbb34d ("e1000e: Separate signaling for link check/link up", v4.15-rc1).

However, I noticed something unexpected. It seems like reading ICR doesn't
clear every bit that's set in IAM, most notably not rxo. In a different test,
I was doing a single write of RXO | OTHER to ICS, then two subsequent reads of
icr gave 0x0141. OTOH, writing a bit to ICS reliably clears it. So if you
want to remove RXO interrupt mitigation, you should at least add a write of
RXO to ICR, to clear it. On my system it reduced Other interrupts from
~17000/s to ~1700/s when using the mdelay testing approach.

> reverting this patch and instead just focus on testing OTHER and LSC
> as we originally were so that we don't risk messing up NAPI by messing
> with ring state from a non-ring interrupt.
> 
> I will try to get to these later this week if you would like.
> Unfortunately I don't have any of these devices in any of my
> development systems so I have to go chase one down. Otherwise you are
> free to take these on and tell me if I have made another flawed
> assumption somewhere, but I am thinking the RXO issue goes away if we
> get the original "other" interrupt routine back to where it was.
> 
> So the last bit in all this ends up being that because of 0a8047ac68e5
> e1000e: Fix msi-x interrupt automask (v4.5-rc1) we don't seem to
> auto-clear interrupt causes anymore on ICR read. I am not certain what
> the impact of this is. I would be interested in finding out if a cause
> left set will trigger an interrupt storm or if it just goes quiet when
> we just leave the value high. If it goes quiet then that in itself
> might solve the RXO interrupt burst problem if we don't clear it.
> Otherwise we need to make certain to clear all of the causes that can
> trigger the "other" interrupt to fire regardless of if we service the
> events or not.

In MSI-X mode, as long as Other is not set in ICR, nothing will happen even if
the bits related to Other (LSC, RXO, MDAC, SRPD, ACK, MNG) are set. However,
that doesn't solve the rxo interrupt burst because an rxo condition in
hardware sets both RXO and Other, so it triggers an interrupt.

Re: [Intel-wired-lan] [RFC PATCH] e1000e: Remove Other from EIAC.

2018-01-21 Thread Benjamin Poirier

On 2018/01/20 09:21, Alexander Duyck wrote:
> On Fri, Jan 19, 2018 at 2:55 PM, Benjamin Poirier
> <benjamin.poir...@gmail.com> wrote:
> > On 2018/01/20 07:45, Benjamin Poirier wrote:
> > [...]
> >> >
> >> > I'm of the mind that we need to cut down on the code thrash.  This
> >> > driver is supposed to have been in a "maintenance" mode for the last
> >> > year or so as there aren't being any new parts added is my
> >> > understanding. As-is I don't see any reason why 16ecba59bc33 ("e1000e:
> >> > Do not read ICR in Other interrupt", v4.5-rc1) was submitted or
> >> > accepted in the first place. I don't see any notes about it fixing any
> >> > bug or addressing any issue and it seems like that is the start of all
> >> > the issues we have been having recently with RXO triggering more
> >> > interrupts, various link issues, and this most recent VMware issue.
> >>
> >> I'm sorry to say but you're the one who suggested that change:
> >>
> >> http://lkml.iu.edu/hypermail/linux/kernel/1510.3/03528.html
> >>
> >> On 2015/10/28 23:08, Alexander Duyck wrote:
> >> > On 10/22/2015 05:32 PM, Benjamin Poirier wrote:
> >> [...]
> >> >
> >> > I would argue your first patch probably didn't go far enough to remove 
> >> > dead
> >> > code.  Specifically you should only ever get into this function if LSC is
> >> > set.  There are no other causes that should trigger this.  As such you 
> >> > could
> >> > probably remove the ICR read, and instead replace it with an ICR write of
> >> > the LSC bit since OTHER is already cleared via EIAC.
> >> >
> >
> > ... The assumption that "There are no other causes that should trigger
> > this." turned out to be wrong and that caused the RXO problems. Clearing
> > OTHER via EIAC is causing the problems with vmware now. I don't think
> > you foresaw those problems back in 2015 and neither did I.
> 
> Well that explains why I felt like I was explaining things to a
> younger version of myself. I was a bit more relaxed in terms of being
> willing to make arbitrary changes a few years ago. I tend to be a bit
> more conservative now, at least as far as having justifications as to
> why I want to do things. With any change you end up taking on risk,
> and so usually a patch has a justification as to why you are making
> the change.
> 
> Unfortunately at the time I didn't have all the information and based
> my suggestion on a bad assumption. I would guess at the time I was
> thinking of doing general code cleanup. Other drivers such as igb work
> this way, but it led us down the path we are on now where we are
> having to make one fix after another. It is leading in the opposite
> direction of maintainability as this is all becoming more complex.
> Suggesting this was a bad decision on my part at the time. I'm only
> human, I make mistakes.. :-)

Thanks for the introspection.

> 
> With further review of the code I am seeing various other issues that
> could still pop up as I am not certain we should even have the "other"
> interrupt messing with the NAPI polling or packet accounting logic at
> all. The question I would have at this point is if we revert
> 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt", v4.5-rc1)
> and all the related fixes for it, what do we end up with?

The patch I submitted for the current vmware issue actually finishes
reverting commit 16ecba59bc33.

I believe the relevant commits to consider are:

16ecba59bc33 e1000e: Do not read ICR in Other interrupt (v4.5-rc1)
a61cfe4ffad7 e1000e: Do not write lsc to ics in msi-x mode (v4.5-rc1)
0a8047ac68e5 e1000e: Fix msi-x interrupt automask (v4.5-rc1)

19110cfbb34d e1000e: Separate signaling for link check/link up
 (v4.15-rc1)
4aea7a5c5e94 e1000e: Avoid receiver overrun interrupt bursts (v4.15-rc1)

4110e02eb45e e1000e: Fix e1000_check_for_copper_link_ich8lan return
 value. (v4.15-rc8)

(submitted)  e1000e: Remove Other from EIAC.

commit 4aea7a5c5e94 essentially reverts a61cfe4ffad7 and part of
16ecba59bc33 (ICR read). The submitted patch reverts the rest of
16ecba59bc33 (EIAC clearing of Other).

> It seems
> like the code is slowly heading back in the direction of where it was
> originally anyway since there have been a number of partial reverts.
> I'm wondering what would happen if we were to just short-cut that and
> audit the patches involved to see what we really need and don't.
> 
> Your patch as proposed is essentially another step in that direction.
> I'm thinking we may want to drop my curre

Re: [Intel-wired-lan] [RFC PATCH] e1000e: Remove Other from EIAC.

2018-01-19 Thread Benjamin Poirier

On 2018/01/20 07:45, Benjamin Poirier wrote:
[...]
> > 
> > I'm of the mind that we need to cut down on the code thrash.  This
> > driver is supposed to have been in a "maintenance" mode for the last
> > year or so as there aren't being any new parts added is my
> > understanding. As-is I don't see any reason why 16ecba59bc33 ("e1000e:
> > Do not read ICR in Other interrupt", v4.5-rc1) was submitted or
> > accepted in the first place. I don't see any notes about it fixing any
> > bug or addressing any issue and it seems like that is the start of all
> > the issues we have been having recently with RXO triggering more
> > interrupts, various link issues, and this most recent VMware issue.
> 
> I'm sorry to say but you're the one who suggested that change:
> 
> http://lkml.iu.edu/hypermail/linux/kernel/1510.3/03528.html
> 
> On 2015/10/28 23:08, Alexander Duyck wrote:
> > On 10/22/2015 05:32 PM, Benjamin Poirier wrote:
> [...]
> > 
> > I would argue your first patch probably didn't go far enough to remove dead
> > code.  Specifically you should only ever get into this function if LSC is
> > set.  There are no other causes that should trigger this.  As such you could
> > probably remove the ICR read, and instead replace it with an ICR write of
> > the LSC bit since OTHER is already cleared via EIAC.
> > 

... The assumption that "There are no other causes that should trigger
this." turned out to be wrong and that caused the RXO problems. Clearing
OTHER via EIAC is causing the problems with vmware now. I don't think
you foresaw those problems back in 2015 and neither did I.

Re: [Intel-wired-lan] [RFC PATCH] e1000e: Remove Other from EIAC.

2018-01-19 Thread Benjamin Poirier

On 2018/01/19 08:22, Alexander Duyck wrote:
> On Fri, Jan 19, 2018 at 5:36 AM, Benjamin Poirier
> <benjamin.poir...@gmail.com> wrote:
> > On 2018/01/19 17:59, Benjamin Poirier wrote:
> >> On 2018/01/18 07:51, Alexander Duyck wrote:
> >> > On Wed, Jan 17, 2018 at 10:50 PM, Benjamin Poirier <bpoir...@suse.com> 
> >> > wrote:
> >> > > It was reported that emulated e1000e devices in vmware esxi 6.5 Build
> >> > > 7526125 do not link up after commit 4aea7a5c5e94 ("e1000e: Avoid 
> >> > > receiver
> >> > > overrun interrupt bursts", v4.15-rc1). Some tracing shows that after
> >> > > e1000e_trigger_lsc() is called, ICR reads out as 0x0 in 
> >> > > e1000_msix_other()
> >> > > on emulated e1000e devices. In comparison, on real e1000e 82574 
> >> > > hardware,
> >> > > icr=0x8004 (_INT_ASSERTED | _OTHER) in the same situation.
> >> >
> >> > Isn't 0x8004 (_INT_ASSERTED | _LSC)? The assumption I based my
> >>
> >> Yes. The numeric value is correct. I made a mistake when writing down
> >> the flag names.
> >>
> >> > patch on was that the VMware code was sending _OTHER instead of _LSC
> >> > to trigger LSC events. As such in my version of the workaround I just
> >>
> >> It's not so deterministic, sadly. In my tests, upon entry into
> >> e1000_msix_other() after e1000e_trigger_lsc(), with no workaround patch
> >> I've seen:
> >> icr=0x0
> >> icr=0x3d
> >>   Reserved RXDMT0 Reserved LSC TXDW
> >> icr=0x46
> >>   RXO LSC TXQE
> >> and someone else reported:
> >> icr=0x3c
> >>   Reserved RXDMT0 Reserved LSC
> >>
> >> > went through and did the testing if the _RXO bit was set, otherwise I
> >> > assumed that whatever event was received must have been meant to
> >> > trigger an _LSC type event since that worked in the past.
> >^...
> >
> > Re-reading that part, my thoughts are that it worked in the past, before
> > 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt", v4.5-rc1),
> > because (presumably) Other was not set in EIAC. It worked after
> > 16ecba59bc33 but before 4aea7a5c5e94 ("e1000e: Avoid receiver overrun
> > interrupt bursts", v4.15-rc1) because e1000_msix_other() didn't read the
> > value of icr. If it had, it would've found a bogus value, which is
> > what's happening after 4aea7a5c5e94. I wonder if we're not just getting
> > some uninitialized value from the emulation code... which makes me kind
> > of uneasy about your approach of trying to make sense of the value.
> > Maybe Shrikrishna can clarify where the icr value is coming from when
> > Other is set in EIAC.
> 
> For now I still say we let my current patch go as-is in order to
> address the immediate issue. We can follow-up and do a more refined
> version of things once we get the final word from VMware on how this
> actually works. If nothing else the current patch appears to resolve
> the currently reported issue and is already submitted.
> 
> I'm of the mind that we need to cut down on the code thrash.  This
> driver is supposed to have been in a "maintenance" mode for the last
> year or so as there aren't being any new parts added is my
> understanding. As-is I don't see any reason why 16ecba59bc33 ("e1000e:
> Do not read ICR in Other interrupt", v4.5-rc1) was submitted or
> accepted in the first place. I don't see any notes about it fixing any
> bug or addressing any issue and it seems like that is the start of all
> the issues we have been having recently with RXO triggering more
> interrupts, various link issues, and this most recent VMware issue.

I'm sorry to say but you're the one who suggested that change:

http://lkml.iu.edu/hypermail/linux/kernel/1510.3/03528.html

On 2015/10/28 23:08, Alexander Duyck wrote:
> On 10/22/2015 05:32 PM, Benjamin Poirier wrote:
[...]
> 
> I would argue your first patch probably didn't go far enough to remove dead
> code.  Specifically you should only ever get into this function if LSC is
> set.  There are no other causes that should trigger this.  As such you could
> probably remove the ICR read, and instead replace it with an ICR write of
> the LSC bit since OTHER is already cleared via EIAC.
>

Re: [Intel-wired-lan] [RFC PATCH] e1000e: Remove Other from EIAC.

2018-01-19 Thread Benjamin Poirier

On 2018/01/19 17:59, Benjamin Poirier wrote:
> On 2018/01/18 07:51, Alexander Duyck wrote:
> > On Wed, Jan 17, 2018 at 10:50 PM, Benjamin Poirier <bpoir...@suse.com> 
> > wrote:
> > > It was reported that emulated e1000e devices in vmware esxi 6.5 Build
> > > 7526125 do not link up after commit 4aea7a5c5e94 ("e1000e: Avoid receiver
> > > overrun interrupt bursts", v4.15-rc1). Some tracing shows that after
> > > e1000e_trigger_lsc() is called, ICR reads out as 0x0 in e1000_msix_other()
> > > on emulated e1000e devices. In comparison, on real e1000e 82574 hardware,
> > > icr=0x8004 (_INT_ASSERTED | _OTHER) in the same situation.
> > 
> > Isn't 0x8004 (_INT_ASSERTED | _LSC)? The assumption I based my
> 
> Yes. The numeric value is correct. I made a mistake when writing down
> the flag names.
> 
> > patch on was that the VMware code was sending _OTHER instead of _LSC
> > to trigger LSC events. As such in my version of the workaround I just
> 
> It's not so deterministic, sadly. In my tests, upon entry into
> e1000_msix_other() after e1000e_trigger_lsc(), with no workaround patch
> I've seen:
> icr=0x0
> icr=0x3d
>   Reserved RXDMT0 Reserved LSC TXDW
> icr=0x46
>   RXO LSC TXQE
> and someone else reported:
> icr=0x3c
>   Reserved RXDMT0 Reserved LSC
> 
> > went through and did the testing if the _RXO bit was set, otherwise I
> > assumed that whatever event was received must have been meant to
> > trigger an _LSC type event since that worked in the past.
   ^...

Re-reading that part, my thoughts are that it worked in the past, before
16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt", v4.5-rc1),
because (presumably) Other was not set in EIAC. It worked after
16ecba59bc33 but before 4aea7a5c5e94 ("e1000e: Avoid receiver overrun
interrupt bursts", v4.15-rc1) because e1000_msix_other() didn't read the
value of icr. If it had, it would've found a bogus value, which is
what's happening after 4aea7a5c5e94. I wonder if we're not just getting
some uninitialized value from the emulation code... which makes me kind
of uneasy about your approach of trying to make sense of the value.
Maybe Shrikrishna can clarify where the icr value is coming from when
Other is set in EIAC.

> > 
> > > Some experimentation showed that this flaw in vmware e1000e emulation can
> > > be worked around by not setting Other in EIAC. This is how it was before
> > > 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt", v4.5-rc1).
> > 
> > Did this actually set the _LSC bit or was it just giving you the
> > _OTHER bit? The ICR for that combination would come out 0x8100.
> 
> With my patch, after e1000e_trigger_lsc(), it results in icr=0x8104
> on real and emulated hardware.
> 
> IMO, the resulting icr read is cleaner than with your patch but it
> depends on an undocumented quirk of the emulated vmware e1000e, so I
> don't know which of the two workarounds is more desirable.
> 
> If you'd like to stick with your patch though, I think that you should
> definitely rewrite it as:
> 
> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
> b/drivers/net/ethernet/intel/e1000e/netdev.c
> index 9f18d39bdc8f..68c0bcb8287f 100644
> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> @@ -1928,7 +1928,12 @@ static irqreturn_t e1000_msix_other(int 
> __always_unused irq, void *data)
>   __napi_schedule(>napi);
>   }
>   }
> - if (icr & E1000_ICR_LSC) {
> + if (icr & E1000_ICR_LSC || !(icr & E1000_ICR_RXO)) {
> + /* We assume if the RXO bit is not set that this is a
> +  * link status change event. This is needed due to emulated
> +  * versions of the device that may not correctly populate
> +  * the LSC bit.
> +  */
>   ew32(ICR, E1000_ICR_LSC);
>   hw->mac.get_link_status = true;
>   /* guard against interrupt when we're going down */
> 
> > 
> > > Fixes: 4aea7a5c5e94 ("e1000e: Avoid receiver overrun interrupt bursts")
> > > Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
> > > ---
> > >
> > > Jeff, I'm sending as RFC since it looks like a problem that should be 
> > > fixed
> > > in vmware. If you'd like to have the workaround in e1000e, I'll submit.
> > 
> > I would appreciate it if you could review/test the patch I submitted
> > for the same issue. Specifically I would want to make certain that it
> > still addresses the original RXO interrupt burst issue your reported.
> 
> I've tested both my patch and yours; they both allow link up on vmware;
> link up on real 82574 and rxo mitigation on real 82574. I couldn't
> conveniently test rxo on vmware.
> 
> > 
> > Thanks.
> > 
> > - Alex
> >

Re: [Intel-wired-lan] [RFC PATCH] e1000e: Remove Other from EIAC.

2018-01-19 Thread Benjamin Poirier

On 2018/01/18 07:51, Alexander Duyck wrote:
> On Wed, Jan 17, 2018 at 10:50 PM, Benjamin Poirier <bpoir...@suse.com> wrote:
> > It was reported that emulated e1000e devices in vmware esxi 6.5 Build
> > 7526125 do not link up after commit 4aea7a5c5e94 ("e1000e: Avoid receiver
> > overrun interrupt bursts", v4.15-rc1). Some tracing shows that after
> > e1000e_trigger_lsc() is called, ICR reads out as 0x0 in e1000_msix_other()
> > on emulated e1000e devices. In comparison, on real e1000e 82574 hardware,
> > icr=0x8004 (_INT_ASSERTED | _OTHER) in the same situation.
> 
> Isn't 0x8004 (_INT_ASSERTED | _LSC)? The assumption I based my

Yes. The numeric value is correct. I made a mistake when writing down
the flag names.

> patch on was that the VMware code was sending _OTHER instead of _LSC
> to trigger LSC events. As such in my version of the workaround I just

It's not so deterministic, sadly. In my tests, upon entry into
e1000_msix_other() after e1000e_trigger_lsc(), with no workaround patch
I've seen:
icr=0x0
icr=0x3d
Reserved RXDMT0 Reserved LSC TXDW
icr=0x46
RXO LSC TXQE
and someone else reported:
icr=0x3c
Reserved RXDMT0 Reserved LSC

> went through and did the testing if the _RXO bit was set, otherwise I
> assumed that whatever event was received must have been meant to
> trigger an _LSC type event since that worked in the past.
> 
> > Some experimentation showed that this flaw in vmware e1000e emulation can
> > be worked around by not setting Other in EIAC. This is how it was before
> > 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt", v4.5-rc1).
> 
> Did this actually set the _LSC bit or was it just giving you the
> _OTHER bit? The ICR for that combination would come out 0x8100.

With my patch, after e1000e_trigger_lsc(), it results in icr=0x8104
on real and emulated hardware.

IMO, the resulting icr read is cleaner than with your patch but it
depends on an undocumented quirk of the emulated vmware e1000e, so I
don't know which of the two workarounds is more desirable.

If you'd like to stick with your patch though, I think that you should
definitely rewrite it as:

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 9f18d39bdc8f..68c0bcb8287f 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1928,7 +1928,12 @@ static irqreturn_t e1000_msix_other(int __always_unused 
irq, void *data)
__napi_schedule(>napi);
}
}
-   if (icr & E1000_ICR_LSC) {
+   if (icr & E1000_ICR_LSC || !(icr & E1000_ICR_RXO)) {
+   /* We assume if the RXO bit is not set that this is a
+* link status change event. This is needed due to emulated
+* versions of the device that may not correctly populate
+* the LSC bit.
+*/
ew32(ICR, E1000_ICR_LSC);
hw->mac.get_link_status = true;
/* guard against interrupt when we're going down */

> 
> > Fixes: 4aea7a5c5e94 ("e1000e: Avoid receiver overrun interrupt bursts")
> > Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
> > ---
> >
> > Jeff, I'm sending as RFC since it looks like a problem that should be fixed
> > in vmware. If you'd like to have the workaround in e1000e, I'll submit.
> 
> I would appreciate it if you could review/test the patch I submitted
> for the same issue. Specifically I would want to make certain that it
> still addresses the original RXO interrupt burst issue your reported.

I've tested both my patch and yours; they both allow link up on vmware;
link up on real 82574 and rxo mitigation on real 82574. I couldn't
conveniently test rxo on vmware.

> 
> Thanks.
> 
> - Alex
>

Re: [RFC PATCH] e1000e: Remove Other from EIAC.

2018-01-18 Thread Benjamin Poirier

On 2018/01/18 18:42, Shrikrishna Khare wrote:
> 
> 
> On Thu, 18 Jan 2018, Benjamin Poirier wrote:
> 
> > On 2018/01/18 15:50, Benjamin Poirier wrote:
> > > It was reported that emulated e1000e devices in vmware esxi 6.5 Build
> > > 7526125 do not link up after commit 4aea7a5c5e94 ("e1000e: Avoid receiver
> > > overrun interrupt bursts", v4.15-rc1). Some tracing shows that after
> > > e1000e_trigger_lsc() is called, ICR reads out as 0x0 in e1000_msix_other()
> > > on emulated e1000e devices. In comparison, on real e1000e 82574 hardware,
> > > icr=0x8004 (_INT_ASSERTED | _OTHER) in the same situation.
> > > 
> > > Some experimentation showed that this flaw in vmware e1000e emulation can
> > > be worked around by not setting Other in EIAC. This is how it was before
> > > 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt", v4.5-rc1).
> > 
> > vmware folks, please comment.
> 
> Thank you for bringing this to our attention.
> 
> Using the reported build (ESX 6.5, 7526125) and 4.15.0-rc8+ kernel (which 
> has the said patch), I could bring up e1000e interface (version: 3.2.6-k),
> get dhcp address and even do large file downloads without difficulty.
> 
> Could you give us more pointers on how we may be able to reproduce this 
> locally? Was there anything different with the configuration when the 
> issue was observed? Is the issue consistently reproducible?

It's consistently reproducible, however I noticed that once in a while
there is a genuine "Other" interrupt that comes in and triggers the link
status change. The problem is with interrupts that are triggered via a
write to ICS (such as in e1000e_trigger_lsc()). Can you reproduce a
problem if you do:
ip link set ethX down
ip link set ethX up

If you're building your own kernel, you can add the following patch and
cat /sys/kernel/debug/tracing/trace_pipe

For me it shows on v4.15-rc8:
   <...>-2578  [000]  83527.938321: e1000e_trigger_lsc: trigger_lsc
   <...>-2578  [000] d.h. 83527.938398: e1000_msix_other: icr 0x0

With the patch that I submitted, it shows:
 wickedd-1329  [002] .N..20.123545: e1000e_trigger_lsc: trigger_lsc
  -0 [000] d.h.20.123630: e1000_msix_other: icr 0x8104
  -0 [000] d.h.20.123654: e1000_msix_other: lsc
  -0 [000] d.h.20.123676: e1000_msix_other: mod_timer


diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 9f18d39bdc8f..16620ce840fc 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1918,22 +1918,29 @@ static irqreturn_t e1000_msix_other(int __always_unused 
irq, void *data)
bool enable = true;
 
icr = er32(ICR);
+   trace_printk("icr 0x%x\n", icr);
+
if (icr & E1000_ICR_RXO) {
+   trace_printk("rxo\n");
ew32(ICR, E1000_ICR_RXO);
enable = false;
/* napi poll will re-enable Other, make sure it runs */
if (napi_schedule_prep(>napi)) {
+   trace_printk("napi schedule\n");
adapter->total_rx_bytes = 0;
adapter->total_rx_packets = 0;
__napi_schedule(>napi);
}
}
if (icr & E1000_ICR_LSC) {
+   trace_printk("lsc\n");
ew32(ICR, E1000_ICR_LSC);
hw->mac.get_link_status = true;
/* guard against interrupt when we're going down */
-   if (!test_bit(__E1000_DOWN, >state))
+   if (!test_bit(__E1000_DOWN, >state)) {
+   trace_printk("mod_timer\n");
mod_timer(>watchdog_timer, jiffies + 1);
+   }
}
 
if (enable && !test_bit(__E1000_DOWN, >state))
@@ -4221,6 +4228,8 @@ static void e1000e_trigger_lsc(struct e1000_adapter 
*adapter)
 {
struct e1000_hw *hw = >hw;
 
+   trace_printk("trigger_lsc\n");
+
if (adapter->msix_entries)
ew32(ICS, E1000_ICS_LSC | E1000_ICS_OTHER);
else

Re: [RFC PATCH] e1000e: Remove Other from EIAC.

2018-01-17 Thread Benjamin Poirier

On 2018/01/18 15:50, Benjamin Poirier wrote:
> It was reported that emulated e1000e devices in vmware esxi 6.5 Build
> 7526125 do not link up after commit 4aea7a5c5e94 ("e1000e: Avoid receiver
> overrun interrupt bursts", v4.15-rc1). Some tracing shows that after
> e1000e_trigger_lsc() is called, ICR reads out as 0x0 in e1000_msix_other()
> on emulated e1000e devices. In comparison, on real e1000e 82574 hardware,
> icr=0x8004 (_INT_ASSERTED | _OTHER) in the same situation.
 (_INT_ASSERTED | _LSC)

> 
> Some experimentation showed that this flaw in vmware e1000e emulation can
> be worked around by not setting Other in EIAC. This is how it was before
> 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt", v4.5-rc1).
> 
> Fixes: 4aea7a5c5e94 ("e1000e: Avoid receiver overrun interrupt bursts")
> Signed-off-by: Benjamin Poirier <bpoir...@suse.com>

Re: [RFC PATCH] e1000e: Remove Other from EIAC.

2018-01-17 Thread Benjamin Poirier

On 2018/01/18 15:50, Benjamin Poirier wrote:
> It was reported that emulated e1000e devices in vmware esxi 6.5 Build
> 7526125 do not link up after commit 4aea7a5c5e94 ("e1000e: Avoid receiver
> overrun interrupt bursts", v4.15-rc1). Some tracing shows that after
> e1000e_trigger_lsc() is called, ICR reads out as 0x0 in e1000_msix_other()
> on emulated e1000e devices. In comparison, on real e1000e 82574 hardware,
> icr=0x8004 (_INT_ASSERTED | _OTHER) in the same situation.
> 
> Some experimentation showed that this flaw in vmware e1000e emulation can
> be worked around by not setting Other in EIAC. This is how it was before
> 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt", v4.5-rc1).

vmware folks, please comment.

[RFC PATCH] e1000e: Remove Other from EIAC.

2018-01-17 Thread Benjamin Poirier

It was reported that emulated e1000e devices in vmware esxi 6.5 Build
7526125 do not link up after commit 4aea7a5c5e94 ("e1000e: Avoid receiver
overrun interrupt bursts", v4.15-rc1). Some tracing shows that after
e1000e_trigger_lsc() is called, ICR reads out as 0x0 in e1000_msix_other()
on emulated e1000e devices. In comparison, on real e1000e 82574 hardware,
icr=0x8004 (_INT_ASSERTED | _OTHER) in the same situation.

Some experimentation showed that this flaw in vmware e1000e emulation can
be worked around by not setting Other in EIAC. This is how it was before
16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt", v4.5-rc1).

Fixes: 4aea7a5c5e94 ("e1000e: Avoid receiver overrun interrupt bursts")
Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---

Jeff, I'm sending as RFC since it looks like a problem that should be fixed
in vmware. If you'd like to have the workaround in e1000e, I'll submit.

---
 drivers/net/ethernet/intel/e1000e/netdev.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 9f18d39bdc8f..625a4c9a86a4 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1918,6 +1918,8 @@ static irqreturn_t e1000_msix_other(int __always_unused 
irq, void *data)
bool enable = true;
 
icr = er32(ICR);
+   ew32(ICR, E1000_ICR_OTHER);
+
if (icr & E1000_ICR_RXO) {
ew32(ICR, E1000_ICR_RXO);
enable = false;
@@ -2040,7 +2042,6 @@ static void e1000_configure_msix(struct e1000_adapter 
*adapter)
   hw->hw_addr + E1000_EITR_82574(vector));
else
writel(1, hw->hw_addr + E1000_EITR_82574(vector));
-   adapter->eiac_mask |= E1000_IMS_OTHER;
 
/* Cause Tx interrupts on every write back */
ivar |= BIT(31);
@@ -2265,7 +2266,7 @@ static void e1000_irq_enable(struct e1000_adapter 
*adapter)
 
if (adapter->msix_entries) {
ew32(EIAC_82574, adapter->eiac_mask & E1000_EIAC_MASK_82574);
-   ew32(IMS, adapter->eiac_mask | E1000_IMS_LSC);
+   ew32(IMS, adapter->eiac_mask | E1000_IMS_OTHER | E1000_IMS_LSC);
} else if (hw->mac.type >= e1000_pch_lpt) {
ew32(IMS, IMS_ENABLE_MASK | E1000_IMS_ECCER);
} else {
-- 
2.15.1

Re: [PATCH] e1000e: Fix e1000_check_for_copper_link_ich8lan return value.

2018-01-09 Thread Benjamin Poirier

On 2018/01/10 01:29, rwar...@gmx.de wrote:
> hallo
> 
> any chance to get this patch into stable and 4.15 ?
> 
> https://marc.info/?l=linux-kernel=151297726823919=2
> 

It was part of the last network pull request and should be included in
the next mainline release as
4110e02eb45e e1000e: Fix e1000_check_for_copper_link_ich8lan return value.

It's needed in stable branches that include commit 19110cfbb34d
("e1000e: Separate signaling for link check/link up"):
linux-4.14.y
linux-4.9.y
linux-4.4.y
linux-4.1.y
linux-3.18.y

Re: [PATCH for-4.14.y 0/5] e1000e: Upstream fixes for linux-4.14.y

2017-12-31 Thread Benjamin Poirier

On 2017/12/19 15:27, Jeff Kirsher wrote:
> On Tue, 2017-12-19 at 10:10 +0900, Benjamin Poirier wrote:
> > On 2017/12/18 12:59, Greg KH wrote:
> > > On Mon, Dec 11, 2017 at 10:58:10PM +0100, Christian Hesse wrote:
> > > > Greg KH <gre...@linuxfoundation.org> on Tue, 2017/12/05 08:35:
> > > > > On Tue, Dec 05, 2017 at 08:23:27AM +0100, Christian Hesse wrote:
> > > > > > Greg KH <gre...@linuxfoundation.org> on Mon, 2017/12/04 19:37:  
> > > > > > > On Mon, Dec 04, 2017 at 04:47:00PM +0100, Christian Hesse
> > > > > > > wrote:  
> > > > > > > > Amit Pundir <amit.pun...@linaro.org> on Mon, 2017/11/27
> > > > > > > > 18:23:
> > > > > > > > > Hi Greg,
> > > > > > > > > 
> > > > > > > > > Found few e100e upstream fixes from Benjamin Poirier in
> > > > > > > > > lede
> > > > > > > > > source tree, https://git.lede-project.org/?p=source.git,
> > > > > > > > > and
> > > > > > > > > these fixes seem reasonable enough for 4.14.y too.
> > > > > > > > > 
> > > > > > > > > Also submitting an e1000e buffer overrun fix by Sasha
> > > > > > > > > Neftin.
> > > > > > > > > 
> > > > > > > > > Cherry-picked and build tested for linux v4.14.2 for
> > > > > > > > > ARCH=arm/arm64.
> > > > > > > > > 
> > > > > > > > > Regards,
> > > > > > > > > Amit Pundir
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Benjamin Poirier (4):
> > > > > > > > >   e1000e: Fix error path in link detection
> > > > > > > > >   e1000e: Fix return value test
> > > > > > > > >   e1000e: Separate signaling for link check/link up
> > > > > > > > >   e1000e: Avoid receiver overrun interrupt bursts
> > > > > > > > > 
> > > > > > > > > Sasha Neftin (1):
> > > > > > > > >   e1000e: fix buffer overrun while the I219 is processing
> > > > > > > > > DMA
> > > > > > > > > transactions
> > > > > > > > 
> > > > > > > > Hello everybody,
> > > > > > > > 
> > > > > > > > looks like one of these breaks connectivity on my Thinkpad
> > > > > > > > X250.
> > > > > > > > Just downgraded to linux 4.14.2 to verify.
> > > > > > > 
> > > > > > > Can you try the -rc release I just did?  It has a fix for this
> > > > > > > series in
> > > > > > > it.  
> > > > > > 
> > > > > > It connects with the notebook's built in ethernet port (did not
> > > > > > check with
> > > > > > 4.14.3) but still fails to see a link when placed in docking
> > > > > > station.  
> > > > > 
> > > > > Do you have the same issues with 4.15-rc2?
> > > > 
> > > > Just a short heads-up and final result for this thread...
> > > > The issue is fixed with Benjamin's patch:
> > > > 
> > > > https://patchwork.kernel.org/patch/10104349/
> > > 
> > > Any word on getting this patch into Linus's tree anytime soon?
> > 
> > Apparently the intel maintainer was unavailable recently:
> > https://www.mail-archive.com/netdev@vger.kernel.org/msg206192.html
> > 
> > The patch is now in there:
> > https://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.git/l
> > og/?h=dev-queue
> > but that only makes it slated for 4.16 AFAIK.
> > 
> > Jeff, given that this issue breaks networking for many people, wouldn't
> > it be more appropriate to submit the patch for 4.15?
> 
> Yeah, I will send to Dave's net tree (4.15) for inclusion.

Jeff, I see that the patch is still in your "next-queue". Meanwhile
another 4.15 rc has been released and I've gotten more user reports that
e1000e is broken.


signature.asc
Description: Digital signature

[PATCH] e1000e: Fix e1000_check_for_copper_link_ich8lan return value.

2017-12-10 Thread Benjamin Poirier

e1000e_check_for_copper_link() and e1000_check_for_copper_link_ich8lan()
are the two functions that may be assigned to mac.ops.check_for_link when
phy.media_type == e1000_media_type_copper. Commit 19110cfbb34d ("e1000e:
Separate signaling for link check/link up") changed the meaning of the
return value of check_for_link for copper media but only adjusted the first
function. This patch adjusts the second function likewise.

Reported-by: Christian Hesse <l...@eworm.de>
Reported-by: Gabriel C <nix.or@gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=198047
Fixes: 19110cfbb34d ("e1000e: Separate signaling for link check/link up")
Tested-by: Christian Hesse <l...@eworm.de>
Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 drivers/net/ethernet/intel/e1000e/ich8lan.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/ich8lan.c 
b/drivers/net/ethernet/intel/e1000e/ich8lan.c
index d6d4ed7acf03..31277d3bb7dc 100644
--- a/drivers/net/ethernet/intel/e1000e/ich8lan.c
+++ b/drivers/net/ethernet/intel/e1000e/ich8lan.c
@@ -1367,6 +1367,9 @@ static s32 e1000_disable_ulp_lpt_lp(struct e1000_hw *hw, 
bool force)
  *  Checks to see of the link status of the hardware has changed.  If a
  *  change in link status has been detected, then we read the PHY registers
  *  to get the current speed/duplex if link exists.
+ *
+ *  Returns a negative error code (-E1000_ERR_*) or 0 (link down) or 1 (link
+ *  up).
  **/
 static s32 e1000_check_for_copper_link_ich8lan(struct e1000_hw *hw)
 {
@@ -1382,7 +1385,7 @@ static s32 e1000_check_for_copper_link_ich8lan(struct 
e1000_hw *hw)
 * Change or Rx Sequence Error interrupt.
 */
if (!mac->get_link_status)
-   return 0;
+   return 1;
 
/* First we want to see if the MII Status Register reports
 * link.  If so, then we want to get the current speed/duplex
@@ -1613,10 +1616,12 @@ static s32 e1000_check_for_copper_link_ich8lan(struct 
e1000_hw *hw)
 * different link partner.
 */
ret_val = e1000e_config_fc_after_link_up(hw);
-   if (ret_val)
+   if (ret_val) {
e_dbg("Error configuring flow control\n");
+   return ret_val;
+   }
 
-   return ret_val;
+   return 1;
 }
 
 static s32 e1000_get_variants_ich8lan(struct e1000_adapter *adapter)
-- 
2.15.1

Re: [5/5] e1000e: Avoid receiver overrun interrupt bursts

2017-09-19 Thread Benjamin Poirier

On 2017/09/19 12:38, Philip Prindeville wrote:
> Hi.
> 
> We’ve been running this patchset (all 5) for about as long as they’ve been 
> under review… about 2 months.  And in a burn-in lab with heavy traffic.
> 
> We’ve not seen a single link-flap in hundreds of ours of saturated traffic.
> 
> Would love to see some resolution soon on this as we don’t want to ship a 
> release with unsanctioned patches.
> 
> Is there an estimate on when that might be?

The patches have been added to Jeff Kirsher's next-queue tree. I guess
they will be submitted for v4.15 which might be released in early
2018...
http://phb-crystal-ball.org/

[PATCH] packet: Don't write vnet header beyond end of buffer

2017-08-28 Thread Benjamin Poirier

... which may happen with certain values of tp_reserve and maclen.

Fixes: 58d19b19cd99 ("packet: vnet_hdr support for tpacket_rcv")
Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
Cc: Willem de Bruijn <will...@google.com>
---
 net/packet/af_packet.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 008a45ca3112..1c61af9af67d 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2191,6 +2191,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
net_device *dev,
struct timespec ts;
__u32 ts_status;
bool is_drop_n_account = false;
+   bool do_vnet = false;
 
/* struct tpacket{2,3}_hdr is aligned to a multiple of 
TPACKET_ALIGNMENT.
 * We may add members to them until current aligned size without forcing
@@ -2241,8 +2242,10 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
net_device *dev,
netoff = TPACKET_ALIGN(po->tp_hdrlen +
   (maclen < 16 ? 16 : maclen)) +
   po->tp_reserve;
-   if (po->has_vnet_hdr)
+   if (po->has_vnet_hdr) {
netoff += sizeof(struct virtio_net_hdr);
+   do_vnet = true;
+   }
macoff = netoff - maclen;
}
if (po->tp_version <= TPACKET_V2) {
@@ -2259,8 +2262,10 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
net_device *dev,
skb_set_owner_r(copy_skb, sk);
}
snaplen = po->rx_ring.frame_size - macoff;
-   if ((int)snaplen < 0)
+   if ((int)snaplen < 0) {
snaplen = 0;
+   do_vnet = false;
+   }
}
} else if (unlikely(macoff + snaplen >
GET_PBDQC_FROM_RB(>rx_ring)->max_frame_len)) {
@@ -2273,6 +2278,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
net_device *dev,
if (unlikely((int)snaplen < 0)) {
snaplen = 0;
macoff = GET_PBDQC_FROM_RB(>rx_ring)->max_frame_len;
+   do_vnet = false;
}
}
spin_lock(>sk_receive_queue.lock);
@@ -2298,7 +2304,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
net_device *dev,
}
spin_unlock(>sk_receive_queue.lock);
 
-   if (po->has_vnet_hdr) {
+   if (do_vnet) {
if (virtio_net_hdr_from_skb(skb, h.raw + macoff -
sizeof(struct virtio_net_hdr),
vio_le(), true)) {
-- 
2.14.1

Re: [PATCH 5/5] e1000e: Avoid receiver overrun interrupt bursts

2017-08-21 Thread Benjamin Poirier

On 2017/07/21 11:36, Benjamin Poirier wrote:
> When e1000e_poll() is not fast enough to keep up with incoming traffic, the
> adapter (when operating in msix mode) raises the Other interrupt to signal
> Receiver Overrun.
> 
> This is a double problem because 1) at the moment e1000_msix_other()
> assumes that it is only called in case of Link Status Change and 2) if the
> condition persists, the interrupt is repeatedly raised again in quick
> succession.
> 
> Ideally we would configure the Other interrupt to not be raised in case of
> receiver overrun but this doesn't seem possible on this adapter. Instead,
> we handle the first part of the problem by reverting to the practice of
> reading ICR in the other interrupt handler, like before commit 16ecba59bc33
> ("e1000e: Do not read ICR in Other interrupt"). Thanks to commit
> 0a8047ac68e5 ("e1000e: Fix msi-x interrupt automask") which cleared IAME
> from CTRL_EXT, reading ICR doesn't interfere with RxQ0, TxQ0 interrupts
> anymore. We handle the second part of the problem by not re-enabling the
> Other interrupt right away when there is overrun. Instead, we wait until
> traffic subsides, napi polling mode is exited and interrupts are
> re-enabled.
> 
> Reported-by: Lennart Sorensen <lsore...@csclub.uwaterloo.ca>
> Fixes: 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt")
> Signed-off-by: Benjamin Poirier <bpoir...@suse.com>

What's the status on these patches please? One month later they still
show up as "new" in patchwork.

Re: [Intel-wired-lan] [PATCH 4/5] e1000e: Separate signaling for link check/link up

2017-08-02 Thread Benjamin Poirier

On 2017/08/02 10:34, Lennart Sorensen wrote:
> On Wed, Aug 02, 2017 at 02:28:07PM +0300, Neftin, Sasha wrote:
> > On 7/21/2017 21:36, Benjamin Poirier wrote:
> > > Lennart reported the following race condition:
> > > 
> > > \ e1000_watchdog_task
> > >  \ e1000e_has_link
> > >  \ hw->mac.ops.check_for_link() === e1000e_check_for_copper_link
> > >  /* link is up */
> > >  mac->get_link_status = false;
> > > 
> > >  /* interrupt */
> > >  \ e1000_msix_other
> > >  hw->mac.get_link_status = true;
> > > 
> > >  link_active = !hw->mac.get_link_status
> > >  /* link_active is false, wrongly */
> > > 
> > > This problem arises because the single flag get_link_status is used to
> > > signal two different states: link status needs checking and link status is
> > > down.
> > > 
> > > Avoid the problem by using the return value of .check_for_link to signal
> > > the link status to e1000e_has_link().
> > > 
> > > Reported-by: Lennart Sorensen <lsore...@csclub.uwaterloo.ca>
> > > Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
> > > ---
> > >   drivers/net/ethernet/intel/e1000e/mac.c| 11 ---
> > >   drivers/net/ethernet/intel/e1000e/netdev.c |  2 +-
> > >   2 files changed, 9 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/drivers/net/ethernet/intel/e1000e/mac.c 
> > > b/drivers/net/ethernet/intel/e1000e/mac.c
> > > index b322011ec282..f457c5703d0c 100644
> > > --- a/drivers/net/ethernet/intel/e1000e/mac.c
> > > +++ b/drivers/net/ethernet/intel/e1000e/mac.c
> > > @@ -410,6 +410,9 @@ void e1000e_clear_hw_cntrs_base(struct e1000_hw *hw)
> > >*  Checks to see of the link status of the hardware has changed.  If a
> > >*  change in link status has been detected, then we read the PHY 
> > > registers
> > >*  to get the current speed/duplex if link exists.
> > > + *
> > > + *  Returns a negative error code (-E1000_ERR_*) or 0 (link down) or 1 
> > > (link
> > > + *  up).
> > >**/
> > >   s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
> > >   {
> > > @@ -423,7 +426,7 @@ s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
> > >* Change or Rx Sequence Error interrupt.
> > >*/
> > >   if (!mac->get_link_status)
> > > - return 0;
> > > + return 1;
> > >   /* First we want to see if the MII Status Register reports
> > >* link.  If so, then we want to get the current speed/duplex
> > > @@ -461,10 +464,12 @@ s32 e1000e_check_for_copper_link(struct e1000_hw 
> > > *hw)
> > >* different link partner.
> > >*/
> > >   ret_val = e1000e_config_fc_after_link_up(hw);
> > > - if (ret_val)
> > > + if (ret_val) {
> > >   e_dbg("Error configuring flow control\n");
> > > + return ret_val;
> > > + }
> > > - return ret_val;
> > > + return 1;
> > >   }
> > >   /**
> > > diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
> > > b/drivers/net/ethernet/intel/e1000e/netdev.c
> > > index fc6a1db2..5a8ab1136566 100644
> > > --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> > > +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> > > @@ -5081,7 +5081,7 @@ static bool e1000e_has_link(struct e1000_adapter 
> > > *adapter)
> > >   case e1000_media_type_copper:
> > >   if (hw->mac.get_link_status) {
> > >   ret_val = hw->mac.ops.check_for_link(hw);
> > > - link_active = !hw->mac.get_link_status;
> > > + link_active = ret_val > 0;
> > >   } else {
> > >   link_active = true;
> > >   }
> > 
> > Hello Benjamin,
> > 
> > Will this patch fix any serious problem with link indication? Is it
> > necessary? Can we consider your patch series without 4/5 part?
> 
> Without this patch, you have the race condition that can make the
> watchdog_task mistakenly think the link is down when it isn't, and then
> it resets the adapter, which does make the link go down.
> 
> So it is rather catastrophic for the interface.
> 
> The other patch to the interrupt handling should make it never get hit,
> but the issue does still exist if not fixed and I wouldn't rule out that
> it could possibly still happen even with the other fix in place.

Exactly. I wouldn't have explained it better, thanks.

[PATCH 1/5] e1000e: Fix error path in link detection

2017-07-21 Thread Benjamin Poirier

In case of error from e1e_rphy(), the loop will exit early and "success"
will be set to true erroneously.

Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 drivers/net/ethernet/intel/e1000e/phy.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/phy.c 
b/drivers/net/ethernet/intel/e1000e/phy.c
index d78d47b41a71..86ff0969efb6 100644
--- a/drivers/net/ethernet/intel/e1000e/phy.c
+++ b/drivers/net/ethernet/intel/e1000e/phy.c
@@ -1744,6 +1744,7 @@ s32 e1000e_phy_has_link_generic(struct e1000_hw *hw, u32 
iterations,
s32 ret_val = 0;
u16 i, phy_status;
 
+   *success = false;
for (i = 0; i < iterations; i++) {
/* Some PHYs require the MII_BMSR register to be read
 * twice due to the link bit being sticky.  No harm doing
@@ -1763,16 +1764,16 @@ s32 e1000e_phy_has_link_generic(struct e1000_hw *hw, 
u32 iterations,
ret_val = e1e_rphy(hw, MII_BMSR, _status);
if (ret_val)
break;
-   if (phy_status & BMSR_LSTATUS)
+   if (phy_status & BMSR_LSTATUS) {
+   *success = true;
break;
+   }
if (usec_interval >= 1000)
msleep(usec_interval / 1000);
else
udelay(usec_interval);
}
 
-   *success = (i < iterations);
-
return ret_val;
 }
 
-- 
2.13.2

[PATCH 3/5] e1000e: Fix return value test

2017-07-21 Thread Benjamin Poirier

All the helpers return -E1000_ERR_PHY.

Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 58a87134d2e5..fc6a1db2 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -5099,7 +5099,7 @@ static bool e1000e_has_link(struct e1000_adapter *adapter)
break;
}
 
-   if ((ret_val == E1000_ERR_PHY) && (hw->phy.type == e1000_phy_igp_3) &&
+   if ((ret_val == -E1000_ERR_PHY) && (hw->phy.type == e1000_phy_igp_3) &&
(er32(CTRL) & E1000_PHY_CTRL_GBE_DISABLE)) {
/* See e1000_kmrn_lock_loss_workaround_ich8lan() */
e_info("Gigabit has been disabled, downgrading speed\n");
-- 
2.13.2

[PATCH 5/5] e1000e: Avoid receiver overrun interrupt bursts

2017-07-21 Thread Benjamin Poirier

When e1000e_poll() is not fast enough to keep up with incoming traffic, the
adapter (when operating in msix mode) raises the Other interrupt to signal
Receiver Overrun.

This is a double problem because 1) at the moment e1000_msix_other()
assumes that it is only called in case of Link Status Change and 2) if the
condition persists, the interrupt is repeatedly raised again in quick
succession.

Ideally we would configure the Other interrupt to not be raised in case of
receiver overrun but this doesn't seem possible on this adapter. Instead,
we handle the first part of the problem by reverting to the practice of
reading ICR in the other interrupt handler, like before commit 16ecba59bc33
("e1000e: Do not read ICR in Other interrupt"). Thanks to commit
0a8047ac68e5 ("e1000e: Fix msi-x interrupt automask") which cleared IAME
from CTRL_EXT, reading ICR doesn't interfere with RxQ0, TxQ0 interrupts
anymore. We handle the second part of the problem by not re-enabling the
Other interrupt right away when there is overrun. Instead, we wait until
traffic subsides, napi polling mode is exited and interrupts are
re-enabled.

Reported-by: Lennart Sorensen <lsore...@csclub.uwaterloo.ca>
Fixes: 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt")
Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 drivers/net/ethernet/intel/e1000e/defines.h |  1 +
 drivers/net/ethernet/intel/e1000e/netdev.c  | 33 +++--
 2 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/defines.h 
b/drivers/net/ethernet/intel/e1000e/defines.h
index 0641c0098738..afb7ebe20b24 100644
--- a/drivers/net/ethernet/intel/e1000e/defines.h
+++ b/drivers/net/ethernet/intel/e1000e/defines.h
@@ -398,6 +398,7 @@
 #define E1000_ICR_LSC   0x0004 /* Link Status Change */
 #define E1000_ICR_RXSEQ 0x0008 /* Rx sequence error */
 #define E1000_ICR_RXDMT00x0010 /* Rx desc min. threshold (0) */
+#define E1000_ICR_RXO   0x0040 /* Receiver Overrun */
 #define E1000_ICR_RXT0  0x0080 /* Rx timer intr (ring 0) */
 #define E1000_ICR_ECCER 0x0040 /* Uncorrectable ECC Error */
 /* If this bit asserted, the driver should claim the interrupt */
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 5a8ab1136566..803edd1a6401 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1910,12 +1910,30 @@ static irqreturn_t e1000_msix_other(int __always_unused 
irq, void *data)
struct net_device *netdev = data;
struct e1000_adapter *adapter = netdev_priv(netdev);
struct e1000_hw *hw = >hw;
+   u32 icr;
+   bool enable = true;
+
+   icr = er32(ICR);
+   if (icr & E1000_ICR_RXO) {
+   ew32(ICR, E1000_ICR_RXO);
+   enable = false;
+   /* napi poll will re-enable Other, make sure it runs */
+   if (napi_schedule_prep(>napi)) {
+   adapter->total_rx_bytes = 0;
+   adapter->total_rx_packets = 0;
+   __napi_schedule(>napi);
+   }
+   }
+   if (icr & E1000_ICR_LSC) {
+   ew32(ICR, E1000_ICR_LSC);
+   hw->mac.get_link_status = true;
+   /* guard against interrupt when we're going down */
+   if (!test_bit(__E1000_DOWN, >state)) {
+   mod_timer(>watchdog_timer, jiffies + 1);
+   }
+   }
 
-   hw->mac.get_link_status = true;
-
-   /* guard against interrupt when we're going down */
-   if (!test_bit(__E1000_DOWN, >state)) {
-   mod_timer(>watchdog_timer, jiffies + 1);
+   if (enable && !test_bit(__E1000_DOWN, >state)) {
ew32(IMS, E1000_IMS_OTHER);
}
 
@@ -2687,7 +2705,8 @@ static int e1000e_poll(struct napi_struct *napi, int 
weight)
napi_complete_done(napi, work_done);
if (!test_bit(__E1000_DOWN, >state)) {
if (adapter->msix_entries)
-   ew32(IMS, adapter->rx_ring->ims_val);
+   ew32(IMS, adapter->rx_ring->ims_val |
+E1000_IMS_OTHER);
else
e1000_irq_enable(adapter);
}
@@ -4204,7 +4223,7 @@ static void e1000e_trigger_lsc(struct e1000_adapter 
*adapter)
struct e1000_hw *hw = >hw;
 
if (adapter->msix_entries)
-   ew32(ICS, E1000_ICS_OTHER);
+   ew32(ICS, E1000_ICS_LSC | E1000_ICS_OTHER);
else
ew32(ICS, E1000_ICS_LSC);
 }
-- 
2.13.2

[PATCH 2/5] e1000e: Fix wrong comment related to link detection

2017-07-21 Thread Benjamin Poirier

Reading e1000e_check_for_copper_link() shows that get_link_status is set to
false after link has been detected. Therefore, it stays TRUE until then.

Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 2dcb5463d9b8..58a87134d2e5 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -5074,7 +5074,7 @@ static bool e1000e_has_link(struct e1000_adapter *adapter)
 
/* get_link_status is set on LSC (link status) interrupt or
 * Rx sequence error interrupt.  get_link_status will stay
-* false until the check_for_link establishes link
+* true until the check_for_link establishes link
 * for copper adapters ONLY
 */
switch (hw->phy.media_type) {
@@ -5092,7 +5092,7 @@ static bool e1000e_has_link(struct e1000_adapter *adapter)
break;
case e1000_media_type_internal_serdes:
ret_val = hw->mac.ops.check_for_link(hw);
-   link_active = adapter->hw.mac.serdes_has_link;
+   link_active = hw->mac.serdes_has_link;
break;
default:
case e1000_media_type_unknown:
-- 
2.13.2

[PATCH 4/5] e1000e: Separate signaling for link check/link up

2017-07-21 Thread Benjamin Poirier

Lennart reported the following race condition:

\ e1000_watchdog_task
\ e1000e_has_link
\ hw->mac.ops.check_for_link() === e1000e_check_for_copper_link
/* link is up */
mac->get_link_status = false;

/* interrupt */
\ e1000_msix_other
hw->mac.get_link_status = true;

link_active = !hw->mac.get_link_status
/* link_active is false, wrongly */

This problem arises because the single flag get_link_status is used to
signal two different states: link status needs checking and link status is
down.

Avoid the problem by using the return value of .check_for_link to signal
the link status to e1000e_has_link().

Reported-by: Lennart Sorensen <lsore...@csclub.uwaterloo.ca>
Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 drivers/net/ethernet/intel/e1000e/mac.c| 11 ---
 drivers/net/ethernet/intel/e1000e/netdev.c |  2 +-
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/mac.c 
b/drivers/net/ethernet/intel/e1000e/mac.c
index b322011ec282..f457c5703d0c 100644
--- a/drivers/net/ethernet/intel/e1000e/mac.c
+++ b/drivers/net/ethernet/intel/e1000e/mac.c
@@ -410,6 +410,9 @@ void e1000e_clear_hw_cntrs_base(struct e1000_hw *hw)
  *  Checks to see of the link status of the hardware has changed.  If a
  *  change in link status has been detected, then we read the PHY registers
  *  to get the current speed/duplex if link exists.
+ *
+ *  Returns a negative error code (-E1000_ERR_*) or 0 (link down) or 1 (link
+ *  up).
  **/
 s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
 {
@@ -423,7 +426,7 @@ s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
 * Change or Rx Sequence Error interrupt.
 */
if (!mac->get_link_status)
-   return 0;
+   return 1;
 
/* First we want to see if the MII Status Register reports
 * link.  If so, then we want to get the current speed/duplex
@@ -461,10 +464,12 @@ s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
 * different link partner.
 */
ret_val = e1000e_config_fc_after_link_up(hw);
-   if (ret_val)
+   if (ret_val) {
e_dbg("Error configuring flow control\n");
+   return ret_val;
+   }
 
-   return ret_val;
+   return 1;
 }
 
 /**
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index fc6a1db2..5a8ab1136566 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -5081,7 +5081,7 @@ static bool e1000e_has_link(struct e1000_adapter *adapter)
case e1000_media_type_copper:
if (hw->mac.get_link_status) {
ret_val = hw->mac.ops.check_for_link(hw);
-   link_active = !hw->mac.get_link_status;
+   link_active = ret_val > 0;
} else {
link_active = true;
}
-- 
2.13.2

Re: commit 16ecba59 breaks 82574L under heavy load.

2017-07-20 Thread Benjamin Poirier

On 2017/07/20 10:00, Lennart Sorensen wrote:
> On Wed, Jul 19, 2017 at 05:07:47PM -0700, Benjamin Poirier wrote:
> > Are you sure about this? In my testing, while triggering the overrun
> > with the msleep, I read ICR when entering e1000_msix_other() and RXO is
> > consistently set.
> 
> I had thousands of calls to e1000_msix_other where the only bit set
> was OTHER.
> 
> I don't know if the cause is overruns, it just seems plausible.
> 
> > I'm working on a patch that uses that fact to handle the situation and
> > limit the interrupt.
> 
> Excellent.
> 

Could you please test the following patch and let me know if it:
1) reduces the interrupt rate of the Other msi-x vector
2) avoids the link flaps
or
3) logs some dmesg warnings of the form "Other interrupt with unhandled [...]"
In this case, please paste icr values printed.

Thanks

diff --git a/drivers/net/ethernet/intel/e1000e/defines.h 
b/drivers/net/ethernet/intel/e1000e/defines.h
index 0641c0098738..afb7ebe20b24 100644
--- a/drivers/net/ethernet/intel/e1000e/defines.h
+++ b/drivers/net/ethernet/intel/e1000e/defines.h
@@ -398,6 +398,7 @@
 #define E1000_ICR_LSC   0x0004 /* Link Status Change */
 #define E1000_ICR_RXSEQ 0x0008 /* Rx sequence error */
 #define E1000_ICR_RXDMT00x0010 /* Rx desc min. threshold (0) */
+#define E1000_ICR_RXO   0x0040 /* Receiver Overrun */
 #define E1000_ICR_RXT0  0x0080 /* Rx timer intr (ring 0) */
 #define E1000_ICR_ECCER 0x0040 /* Uncorrectable ECC Error */
 /* If this bit asserted, the driver should claim the interrupt */
diff --git a/drivers/net/ethernet/intel/e1000e/e1000.h 
b/drivers/net/ethernet/intel/e1000e/e1000.h
index c7c994eb410e..f7b46eba3efb 100644
--- a/drivers/net/ethernet/intel/e1000e/e1000.h
+++ b/drivers/net/ethernet/intel/e1000e/e1000.h
@@ -351,6 +351,10 @@ struct e1000_adapter {
s32 ptp_delta;
 
u16 eee_advert;
+
+   unsigned int uh_count;
+   u32 uh_values[16];
+   unsigned int uh_values_nb;
 };
 
 struct e1000_info {
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index b3679728caac..46697338c0e1 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -46,6 +46,8 @@
 
 #include "e1000.h"
 
+DEFINE_RATELIMIT_STATE(other_uh_ratelimit_state, HZ, 1);
+
 #define DRV_EXTRAVERSION "-k"
 
 #define DRV_VERSION "3.2.6" DRV_EXTRAVERSION
@@ -1904,12 +1906,60 @@ static irqreturn_t e1000_msix_other(int __always_unused 
irq, void *data)
struct net_device *netdev = data;
struct e1000_adapter *adapter = netdev_priv(netdev);
struct e1000_hw *hw = >hw;
+   u32 icr;
+   bool enable = true;
+   bool handled = false;
+   unsigned int i;
 
-   hw->mac.get_link_status = true;
+   icr = er32(ICR);
+   if (icr & E1000_ICR_RXO) {
+   ew32(ICR, E1000_ICR_RXO);
+   enable = false;
+   /* napi poll will re-enable Other, make sure it runs */
+   if (napi_schedule_prep(>napi)) {
+   adapter->total_rx_bytes = 0;
+   adapter->total_rx_packets = 0;
+   __napi_schedule(>napi);
+   }
+   handled = true;
+   }
+   if (icr & E1000_ICR_LSC) {
+   ew32(ICR, E1000_ICR_LSC);
+   hw->mac.get_link_status = true;
+   /* guard against interrupt when we're going down */
+   if (!test_bit(__E1000_DOWN, >state)) {
+   mod_timer(>watchdog_timer, jiffies + 1);
+   }
+   handled = true;
+   }
 
-   /* guard against interrupt when we're going down */
-   if (!test_bit(__E1000_DOWN, >state)) {
-   mod_timer(>watchdog_timer, jiffies + 1);
+   if (!handled) {
+   adapter->uh_count++;
+   /* only print unseen icr values */
+   if (adapter->uh_values_nb < ARRAY_SIZE(adapter->uh_values)) {
+   for (i = 0; i < ARRAY_SIZE(adapter->uh_values); i++) {
+   if (adapter->uh_values[i] == icr) {
+   break;
+   }
+   }
+   if (i == ARRAY_SIZE(adapter->uh_values)) {
+   adapter->uh_values[adapter->uh_values_nb] =
+   icr;
+   adapter->uh_values_nb++;
+   netdev_warn(netdev,
+   "Other interrupt with unhandled icr 
0x%08x\n",
+   icr);
+   }
+   }
+

Re: commit 16ecba59 breaks 82574L under heavy load.

2017-07-19 Thread Benjamin Poirier

On 2017/07/19 10:19, Lennart Sorensen wrote:
> On Tue, Jul 18, 2017 at 04:14:35PM -0700, Benjamin Poirier wrote:
> > Thanks for the detailed analysis.
> > 
> > Refering to the original discussion around this patch series, it seemed like
> > the IMS bit for a condition had to be set for the Other interrupt to be 
> > raised
> > for that condition.
> > 
> > https://lkml.org/lkml/2015/11/4/683
> > 
> > In this case however, E1000_ICR_RXT0 is not set in IMS so Other shouldn't be
> > raised for Receiver Overrun. Apparently something is going on...
> > 
> > I can reproduce the spurious Other interrupts with a simple mdelay()
> > With the debugging patch at the end of the mail I see stuff like this
> > while blasting with udp frames:
> >   -0 [086] d.h1 15338.742675: e1000_msix_other: got Other 
> > interrupt, count 15127
> ><...>-54504 [086] d.h. 15338.742724: e1000_msix_other: got Other 
> > interrupt, count 1
> ><...>-54504 [086] d.h. 15338.742774: e1000_msix_other: got Other 
> > interrupt, count 1
> ><...>-54504 [086] d.h. 15338.742824: e1000_msix_other: got Other 
> > interrupt, count 1
> >   -0 [086] d.h1 15340.745123: e1000_msix_other: got Other 
> > interrupt, count 27584
> ><...>-54504 [086] d.h. 15340.745172: e1000_msix_other: got Other 
> > interrupt, count 1
> ><...>-54504 [086] d.h. 15340.745222: e1000_msix_other: got Other 
> > interrupt, count 1
> ><...>-54504 [086] d.h. 15340.745272: e1000_msix_other: got Other 
> > interrupt, count 1
> > 
> > > hence sets the flag that (unfortunately) means both link is down and link
> > > state should be checked.  Since this now happens 3000 times per second,
> > > the chances of it happening while the watchdog_task is checking the link
> > > state becomes pretty high, and it if does happen to coincice, then the
> > > watchdog_task will reset the adapter, which causes a real loss of link.
> > 
> > Through which path does watchdog_task reset the adapter? I didn't
> > reproduce that.
> 
> The other interrupt happens and sets get_link_status to true.  At some
> point the watchdog_task runs on some core and calls e1000e_has_link,
> which then calls check_for_link to find out the current link status.
> While e1000e_check_for_copper_link is checking the link state and
> after updating get_link_status to false to indicate link is up, another
> interrupt occurs and another core handles it and changes get_link_status
> to true again.  So by the time e1000e_has_link goes to determine the
> return value, get_link_state has changed back again so now it returns
> link down, and as a result the watchdog_task calls reset, because we
> have packets in the transmit queue (we were busy forwarding over 10
> packets per second when it happened).

Ah I see. Thanks again.

In your previous mail,
On 2017/07/18 10:21, Lennart Sorensen wrote:
[...]
> I tried checking what the bits in the ICR actually were under these
> conditions, and it would appear that the only bit set is 24 (the Other
> Causes interrupt bit).  So I don't know what the real cause is although

Are you sure about this? In my testing, while triggering the overrun
with the msleep, I read ICR when entering e1000_msix_other() and RXO is
consistently set.

I'm working on a patch that uses that fact to handle the situation and
limit the interrupt.

Re: commit 16ecba59 breaks 82574L under heavy load.

2017-07-18 Thread Benjamin Poirier

On 2017/07/18 10:21, Lennart Sorensen wrote:
> Commit 16ecba59bc333d6282ee057fb02339f77a880beb has apparently broken
> at least the 82574L under heavy load (as in load heavy enough to cause
> packet drops).  In this case, when running in MSI-X mode, the Other
> Causes interrupt fires about 3000 times per second, but not due to link
> state changes.  Unfortunately this commit changed the driver to assume
> that the Other Causes interrupt can only mean link state change and

Thanks for the detailed analysis.

Refering to the original discussion around this patch series, it seemed like
the IMS bit for a condition had to be set for the Other interrupt to be raised
for that condition.

https://lkml.org/lkml/2015/11/4/683

In this case however, E1000_ICR_RXT0 is not set in IMS so Other shouldn't be
raised for Receiver Overrun. Apparently something is going on...

I can reproduce the spurious Other interrupts with a simple mdelay()
With the debugging patch at the end of the mail I see stuff like this
while blasting with udp frames:
  -0 [086] d.h1 15338.742675: e1000_msix_other: got Other 
interrupt, count 15127
   <...>-54504 [086] d.h. 15338.742724: e1000_msix_other: got Other 
interrupt, count 1
   <...>-54504 [086] d.h. 15338.742774: e1000_msix_other: got Other 
interrupt, count 1
   <...>-54504 [086] d.h. 15338.742824: e1000_msix_other: got Other 
interrupt, count 1
  -0 [086] d.h1 15340.745123: e1000_msix_other: got Other 
interrupt, count 27584
   <...>-54504 [086] d.h. 15340.745172: e1000_msix_other: got Other 
interrupt, count 1
   <...>-54504 [086] d.h. 15340.745222: e1000_msix_other: got Other 
interrupt, count 1
   <...>-54504 [086] d.h. 15340.745272: e1000_msix_other: got Other 
interrupt, count 1

> hence sets the flag that (unfortunately) means both link is down and link
> state should be checked.  Since this now happens 3000 times per second,
> the chances of it happening while the watchdog_task is checking the link
> state becomes pretty high, and it if does happen to coincice, then the
> watchdog_task will reset the adapter, which causes a real loss of link.

Through which path does watchdog_task reset the adapter? I didn't
reproduce that.

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index b3679728caac..689ad76d0d12 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -46,6 +46,8 @@
 
 #include "e1000.h"
 
+DEFINE_RATELIMIT_STATE(e1000e_ratelimit_state, 2 * HZ, 4);
+
 #define DRV_EXTRAVERSION "-k"
 
 #define DRV_VERSION "3.2.6" DRV_EXTRAVERSION
@@ -937,6 +939,8 @@ static bool e1000_clean_rx_irq(struct e1000_ring *rx_ring, 
int *work_done,
bool cleaned = false;
unsigned int total_rx_bytes = 0, total_rx_packets = 0;
 
+   mdelay(10);
+
i = rx_ring->next_to_clean;
rx_desc = E1000_RX_DESC_EXT(*rx_ring, i);
staterr = le32_to_cpu(rx_desc->wb.upper.status_error);
@@ -1067,6 +1071,13 @@ static bool e1000_clean_rx_irq(struct e1000_ring 
*rx_ring, int *work_done,
 
adapter->total_rx_bytes += total_rx_bytes;
adapter->total_rx_packets += total_rx_packets;
+
+   if (__ratelimit(_ratelimit_state)) {
+   static unsigned int max;
+   max = max(max, total_rx_packets);
+   trace_printk("received %u max %u\n", total_rx_packets, max);
+   }
+
return cleaned;
 }
 
@@ -1904,9 +1915,16 @@ static irqreturn_t e1000_msix_other(int __always_unused 
irq, void *data)
struct net_device *netdev = data;
struct e1000_adapter *adapter = netdev_priv(netdev);
struct e1000_hw *hw = >hw;
+   static unsigned int count;
 
hw->mac.get_link_status = true;
 
+   count++;
+   if (__ratelimit(_ratelimit_state)) {
+   trace_printk("got Other interrupt, count %u\n", count);
+   count = 0;
+   }
+
/* guard against interrupt when we're going down */
if (!test_bit(__E1000_DOWN, >state)) {
mod_timer(>watchdog_timer, jiffies + 1);
@@ -7121,7 +7139,7 @@ static int e1000_probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
netdev->netdev_ops = _netdev_ops;
e1000e_set_ethtool_ops(netdev);
netdev->watchdog_timeo = 5 * HZ;
-   netif_napi_add(netdev, >napi, e1000e_poll, 64);
+   netif_napi_add(netdev, >napi, e1000e_poll, 500);
strlcpy(netdev->name, pci_name(pdev), sizeof(netdev->name));
 
netdev->mem_start = mmio_start;
@@ -7327,6 +7345,8 @@ static int e1000_probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
if (err)
goto err_register;
 
+   ratelimit_set_flags(_ratelimit_state, RATELIMIT_MSG_ON_RELEASE);
+
/* carrier off reporting is important to ethtool even BEFORE open */
netif_carrier_off(netdev);

[PATCH v2] e1000e: Don't return uninitialized stats

2017-05-17 Thread Benjamin Poirier

Some statistics passed to ethtool are garbage because e1000e_get_stats64()
doesn't write them, for example: tx_heartbeat_errors. This leaks kernel
memory to userspace and confuses users.

Do like ixgbe and use dev_get_stats() which first zeroes out
rtnl_link_stats64.

Fixes: 5944701df90d ("net: remove useless memset's in drivers get_stats64")
Reported-by: Stefan Priebe <s.pri...@profihost.ag>
Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---

Notes:
Changes v1->v2:
* add the Fixes tag

 drivers/net/ethernet/intel/e1000e/ethtool.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/e1000e/ethtool.c 
b/drivers/net/ethernet/intel/e1000e/ethtool.c
index e23dbd9190d6..5b4d97570896 100644
--- a/drivers/net/ethernet/intel/e1000e/ethtool.c
+++ b/drivers/net/ethernet/intel/e1000e/ethtool.c
@@ -2072,7 +2072,7 @@ static void e1000_get_ethtool_stats(struct net_device 
*netdev,
 
pm_runtime_get_sync(netdev->dev.parent);
 
-   e1000e_get_stats64(netdev, _stats);
+   dev_get_stats(netdev, _stats);
 
pm_runtime_put_sync(netdev->dev.parent);
 
-- 
2.12.2

Re: SSE instructions for fast packet copy?

2017-05-08 Thread Benjamin Poirier

On 2017/05/04 22:50, Tom Herbert wrote:
> Hi,
> 
> I am thinking about the possibility of using SSE in kernel for
> speeding up the kernel memcpy particularly for copy to userspace
> emeory, and maybe even using the string instructions (like if we
> supported regex in something like eBPF). AFAIK we don't use SSE in
> kernel because of xmm register state needing to be saved across
> context switch. However, if we start busy-polling a CPU in kernel on
> network queues then there might not be any context switches to worry
> about. In this model we'd want to enable SSE per CPU.
> 
> Has this ever been tried before? Is this at all feasible? :-) Is it
> possible to enable SSE for kernel for just one CPU? (I found CPUID
> will return SSE supported, but don't see how to enable other than
> -msse for compiling).

This reminds me of what you tried in
c6e1a0d12ca7 net: Allow no-cache copy from user on transmit
(v3.0-rc1)
and that I reverted in
cdb3f4a31b64 net: Do not enable tx-nocache-copy by default
(v3.14-rc1)

Sure, it's not exactly the same thing...

Re: [Intel-wired-lan] [PATCH 1/2] e1000e: Don't return uninitialized stats

2017-04-25 Thread Benjamin Poirier

On 2017/04/25 10:54, Stephen Hemminger wrote:
[...]
> > > The call to memset was removed from the upstream kernel with:
> > > ---
> > > -
> > > commit 5944701df90d9577658e2354cc27c4ceaeca30fe
> > > Author: stephen hemminger 
> > > Date:   Fri Jan 6 19:12:53 2017 -0800
> > > 
> > >     net: remove useless memset's in drivers get_stats64
> > > 
> > >     In dev_get_stats() the statistic structure storage has already
> > > been
> > >     zeroed. Therefore network drivers do not need to call memset()
> > > again.
> > > ...
> > > < changes to other drivers snipped out >
> > > ...
> > > diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c
> > > b/drivers/net/ethernet/int
> > > index 723025b..79651eb 100644
> > > --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> > > +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> > > @@ -5925,7 +5925,6 @@ void e1000e_get_stats64(struct net_device
> > > *netdev,
> > >  {
> > >     struct e1000_adapter *adapter = netdev_priv(netdev);
> > > 
> > > -   memset(stats, 0, sizeof(struct rtnl_link_stats64));
> > >     spin_lock(>stats64_lock);
> > >     e1000e_update_stats(adapter);
> > >     /* Fill out the OS statistics structure */
> > > ---
> > > -
> > > 
> > > This also is where the bad counters start to show up for e1000e for
> > > my test systems.  From this driver on I see (very) large values for
> > > tx_dropped, rx_over_errors and tx_fifo_errors on driver load (even
> > > before bringing the interface up.  It seems the memset is not so
> > > useless for this driver after all.  Would simply reverting the e1000e
> > > portion of this patch resolve the issue?  
> > 
> > Looks like Aaron beat me to the punch on pointing out that we had this
> > very code in there before.  It appears that Stephen's
> > assertion/assumption was incorrect about the stats structure being
> > zero'd out, which is why we are seeing the issue.
> > 
> > I have no issue reverting Stephen's earlier patch, or do we want to
> > pursue why the stats structure is not zero'd out and resolve that
> > instead.  Either way, just want to make sure we are all on the same
> > page as to the right solution so that we do not end up repeating this
> > in the future.
> 
> Lets's fix this in the base code.
> 
> From: Stephen Hemminger 
> Date: Tue, 25 Apr 2017 10:50:19 -0700
> Subject: [PATCH net] net: always zero statistics
> 
> Drivers with 32 bit statistics API also should get zeroed statistics.
> 
> Fixes: 5944701df90d ("net: remove useless memset's in drivers get_stats64")

This is probably a good change to do but it doesn't fix anything in
5944701df90d, especially not the problem with e1000e.


signature.asc
Description: Digital signature

Re: [Intel-wired-lan] [PATCH 1/2] e1000e: Don't return uninitialized stats

2017-04-25 Thread Benjamin Poirier

On 2017/04/25 02:07, Jeff Kirsher wrote:
[...]
> > > 
> > > I don't see any memset in e1000e_get_stats64(). What kernel version
> > > are
> > > you looking at?
> > 
> > The call to memset was removed from the upstream kernel with:
> > ---
> > -
> > commit 5944701df90d9577658e2354cc27c4ceaeca30fe
> > Author: stephen hemminger 
> > Date:   Fri Jan 6 19:12:53 2017 -0800
> > 
> >     net: remove useless memset's in drivers get_stats64
> > 
> >     In dev_get_stats() the statistic structure storage has already
> > been
> >     zeroed. Therefore network drivers do not need to call memset()
> > again.
> > ...
> > < changes to other drivers snipped out >
> > ...
> > diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c
> > b/drivers/net/ethernet/int
> > index 723025b..79651eb 100644
> > --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> > +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> > @@ -5925,7 +5925,6 @@ void e1000e_get_stats64(struct net_device
> > *netdev,
> >  {
> >     struct e1000_adapter *adapter = netdev_priv(netdev);
> > 
> > -   memset(stats, 0, sizeof(struct rtnl_link_stats64));
> >     spin_lock(>stats64_lock);
> >     e1000e_update_stats(adapter);
> >     /* Fill out the OS statistics structure */
> > ---
> > -
> > 
> > This also is where the bad counters start to show up for e1000e for
> > my test systems.  From this driver on I see (very) large values for
> > tx_dropped, rx_over_errors and tx_fifo_errors on driver load (even
> > before bringing the interface up.  It seems the memset is not so
> > useless for this driver after all.  Would simply reverting the e1000e
> > portion of this patch resolve the issue?
> 
> Looks like Aaron beat me to the punch on pointing out that we had this
> very code in there before.  It appears that Stephen's
> assertion/assumption was incorrect about the stats structure being
> zero'd out, which is why we are seeing the issue.
> 
> I have no issue reverting Stephen's earlier patch, or do we want to
> pursue why the stats structure is not zero'd out and resolve that
> instead.  Either way, just want to make sure we are all on the same
> page as to the right solution so that we do not end up repeating this
> in the future.

If you revert the e1000e part of 5944701df90d ("net: remove useless
memset's in drivers get_stats64", v4.11-rc1) it will fix the issue with
ethtool but memset will be done twice for code paths that call
dev_get_stats() (sysfs, rtnl, ...). Not a big deal but this is not a
problem in the approach I initially suggested. Alternatively, we could
put a memset in e1000_get_ethtool_stats().


signature.asc
Description: Digital signature

Re: [Intel-wired-lan] [PATCH 1/2] e1000e: Don't return uninitialized stats

2017-04-24 Thread Benjamin Poirier

Sasha, please use reply-all to keep everyone in cc (including me...).

On 2017/04/24 11:17, Neftin, Sasha wrote:
> On 4/23/2017 15:53, Neftin, Sasha wrote:
> > -Original Message-
> > From: Intel-wired-lan [mailto:intel-wired-lan-boun...@lists.osuosl.org] On 
> > Behalf Of Benjamin Poirier
> > Sent: Saturday, April 22, 2017 00:20
> > To: Kirsher, Jeffrey T <jeffrey.t.kirs...@intel.com>
> > Cc: netdev@vger.kernel.org; intel-wired-...@lists.osuosl.org; Stefan Priebe 
> > <s.pri...@profihost.ag>
> > Subject: [Intel-wired-lan] [PATCH 1/2] e1000e: Don't return uninitialized 
> > stats
> > 
> > Some statistics passed to ethtool are garbage because e1000e_get_stats64() 
> > doesn't write them, for example: tx_heartbeat_errors. This leaks kernel 
> > memory to userspace and confuses users.
> > 
> > Do like ixgbe and use dev_get_stats() which first zeroes out 
> > rtnl_link_stats64.
> > 
> > Reported-by: Stefan Priebe <s.pri...@profihost.ag>
> > Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
> > ---
> >   drivers/net/ethernet/intel/e1000e/ethtool.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/net/ethernet/intel/e1000e/ethtool.c 
> > b/drivers/net/ethernet/intel/e1000e/ethtool.c
> > index 7aff68a4a4df..f117b90cdc2f 100644
> > --- a/drivers/net/ethernet/intel/e1000e/ethtool.c
> > +++ b/drivers/net/ethernet/intel/e1000e/ethtool.c
> > @@ -2063,7 +2063,7 @@ static void e1000_get_ethtool_stats(struct net_device 
> > *netdev,
> > pm_runtime_get_sync(netdev->dev.parent);
> > -   e1000e_get_stats64(netdev, _stats);
> > +   dev_get_stats(netdev, _stats);
> > pm_runtime_put_sync(netdev->dev.parent);
> > --
> > 2.12.2
> > 
> > ___
> > Intel-wired-lan mailing list
> > intel-wired-...@lists.osuosl.org
> > http://lists.osuosl.org/mailman/listinfo/intel-wired-lan
> 
> Hello,
> 
> We would like to not accept this patch. Suggested generic method
> '*dev_get_stats' (net/core/dev.c) calls 'ops->ndo_get_stats64' method which
> eventually calls e1000e_get_stats64 (netdev.c) - so there is same
> functionality. Also, see that 'e1000e_get_stats64' method in netdev.c (line

No, it's not the same functionality because dev_get_stats() does a
memset on the rtnl_link_stats64 struct.

> 5928) calls 'memset' with 0's before update statistics.  Local sanity check

I don't see any memset in e1000e_get_stats64(). What kernel version are
you looking at?

> in our lab shows 'tx_heartbeat_errors' counter reported as 0.
> 

Please see the mail I just sent to Paul Menzel <pmen...@molgen.mpg.de>
for more information about the issue and how to reproduce it.

Re: [Intel-wired-lan] [PATCH 1/2] e1000e: Don't return uninitialized stats

2017-04-24 Thread Benjamin Poirier

On 2017/04/24 10:23, Paul Menzel wrote:
> Dear Benjamin,
> 
> 
> Thank you for your fix.
> 
> On 04/21/17 23:20, Benjamin Poirier wrote:
> > Some statistics passed to ethtool are garbage because e1000e_get_stats64()
> > doesn't write them, for example: tx_heartbeat_errors. This leaks kernel
> > memory to userspace and confuses users.
> 
> Could you please give specific examples to reproduce the issue? That way
> your fix can also be tested.
> 

Some fields in e1000_get_ethtool_stats()'s net_stats are not initialized
by e1000e_get_stats64(). The structure is allocated on the stack,
therefore, the value of those fields depends on previous stack content;
that in turns depends on kernel version, compiler and previous execution
path. I've tried on 8 machines with different kernel versions and it
reproduced on 3.

root@linux-zxe0:/usr/local/src/linux# git log -n1 --oneline
fc1f8f4f310a net: ipv6: send unsolicited NA if enabled for all interfaces
root@linux-zxe0:/usr/local/src/linux# ethtool -i eth0
driver: e1000e
[...]
root@linux-zxe0:/usr/local/src/linux# ethtool -S eth0
NIC statistics:
 rx_packets: 217
 tx_packets: 153
 rx_bytes: 23091
 tx_bytes: 20533
 rx_broadcast: 0
 tx_broadcast: 6
 rx_multicast: 0
 tx_multicast: 10
 rx_errors: 0
 tx_errors: 0
 tx_dropped: 18446683600612146192
 multicast: 0
 collisions: 0
 rx_length_errors: 0
 rx_over_errors: 70364470214850
 rx_crc_errors: 0
 rx_frame_errors: 0
 rx_no_buffer_count: 0
 rx_missed_errors: 0
 tx_aborted_errors: 0
 tx_carrier_errors: 0
 tx_fifo_errors: 18446744072101618112
 tx_heartbeat_errors: 18446612150964469760
[...]

(gdb) p /x 18446683600612146192
$1 = 0xc9000282bc10
(gdb) p /x 18446744072101618112
$2 = 0xa028e1c0
(gdb) p /x 18446612150964469760
$3 = 0x880457a44000
... a bunch of kernel addresses

Inserting a dummy memset is a reliable way to show the issue:

--- a/drivers/net/ethernet/intel/e1000e/ethtool.c
+++ b/drivers/net/ethernet/intel/e1000e/ethtool.c
@@ -2061,6 +2061,8 @@ static void e1000_get_ethtool_stats(struct net_device 
*netdev,
int i;
char *p = NULL;

+   memset(_stats, 0xff, sizeof(net_stats));
+
pm_runtime_get_sync(netdev->dev.parent);

e1000e_get_stats64(netdev, _stats);

root@linux-zxe0:/usr/local/src/linux# ethtool -S eth0
NIC statistics:
 rx_packets: 30
 tx_packets: 29
 rx_bytes: 2924
 tx_bytes: 3012
 rx_broadcast: 0
 tx_broadcast: 6
 rx_multicast: 0
 tx_multicast: 7
 rx_errors: 0
 tx_errors: 0
 tx_dropped: 18446744073709551615
 multicast: 0
 collisions: 0
 rx_length_errors: 0
 rx_over_errors: 18446744073709551615
 rx_crc_errors: 0
 rx_frame_errors: 0
 rx_no_buffer_count: 0
 rx_missed_errors: 0
 tx_aborted_errors: 0
 tx_carrier_errors: 0
 tx_fifo_errors: 18446744073709551615
 tx_heartbeat_errors: 18446744073709551615
[...]

(gdb) p /x 18446744073709551615
$1 = 0x

[PATCH 2/2] igb: Remove useless argument

2017-04-21 Thread Benjamin Poirier

Given that all callers of igb_update_stats() pass the same two arguments:
(adapter, >stats64), the second argument can be removed.

Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 drivers/net/ethernet/intel/igb/igb.h |  2 +-
 drivers/net/ethernet/intel/igb/igb_ethtool.c |  2 +-
 drivers/net/ethernet/intel/igb/igb_main.c| 10 +-
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb.h 
b/drivers/net/ethernet/intel/igb/igb.h
index acbc3abe2ddd..3f0c06847fc2 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -593,7 +593,7 @@ void igb_setup_rctl(struct igb_adapter *);
 netdev_tx_t igb_xmit_frame_ring(struct sk_buff *, struct igb_ring *);
 void igb_unmap_and_free_tx_resource(struct igb_ring *, struct igb_tx_buffer *);
 void igb_alloc_rx_buffers(struct igb_ring *, u16);
-void igb_update_stats(struct igb_adapter *, struct rtnl_link_stats64 *);
+void igb_update_stats(struct igb_adapter *);
 bool igb_has_link(struct igb_adapter *adapter);
 void igb_set_ethtool_ops(struct net_device *);
 void igb_power_up_link(struct igb_adapter *);
diff --git a/drivers/net/ethernet/intel/igb/igb_ethtool.c 
b/drivers/net/ethernet/intel/igb/igb_ethtool.c
index 737b664d004c..8c913958c2eb 100644
--- a/drivers/net/ethernet/intel/igb/igb_ethtool.c
+++ b/drivers/net/ethernet/intel/igb/igb_ethtool.c
@@ -2287,7 +2287,7 @@ static void igb_get_ethtool_stats(struct net_device 
*netdev,
char *p;
 
spin_lock(>stats64_lock);
-   igb_update_stats(adapter, net_stats);
+   igb_update_stats(adapter);
 
for (i = 0; i < IGB_GLOBAL_STATS_LEN; i++) {
p = (char *)adapter + igb_gstrings_stats[i].stat_offset;
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index be456bae8169..20da5e9d9d3c 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -1815,7 +1815,7 @@ void igb_down(struct igb_adapter *adapter)
 
/* record the stats before reset*/
spin_lock(>stats64_lock);
-   igb_update_stats(adapter, >stats64);
+   igb_update_stats(adapter);
spin_unlock(>stats64_lock);
 
adapter->link_speed = 0;
@@ -4628,7 +4628,7 @@ static void igb_watchdog_task(struct work_struct *work)
}
 
spin_lock(>stats64_lock);
-   igb_update_stats(adapter, >stats64);
+   igb_update_stats(adapter);
spin_unlock(>stats64_lock);
 
for (i = 0; i < adapter->num_tx_queues; i++) {
@@ -5410,7 +5410,7 @@ static void igb_get_stats64(struct net_device *netdev,
struct igb_adapter *adapter = netdev_priv(netdev);
 
spin_lock(>stats64_lock);
-   igb_update_stats(adapter, >stats64);
+   igb_update_stats(adapter);
memcpy(stats, >stats64, sizeof(*stats));
spin_unlock(>stats64_lock);
 }
@@ -5459,9 +5459,9 @@ static int igb_change_mtu(struct net_device *netdev, int 
new_mtu)
  *  igb_update_stats - Update the board statistics counters
  *  @adapter: board private structure
  **/
-void igb_update_stats(struct igb_adapter *adapter,
- struct rtnl_link_stats64 *net_stats)
+void igb_update_stats(struct igb_adapter *adapter)
 {
+   struct rtnl_link_stats64 *net_stats = >stats64;
struct e1000_hw *hw = >hw;
struct pci_dev *pdev = adapter->pdev;
u32 reg, mpc;
-- 
2.12.2

[PATCH 1/2] e1000e: Don't return uninitialized stats

2017-04-21 Thread Benjamin Poirier

Some statistics passed to ethtool are garbage because e1000e_get_stats64()
doesn't write them, for example: tx_heartbeat_errors. This leaks kernel
memory to userspace and confuses users.

Do like ixgbe and use dev_get_stats() which first zeroes out
rtnl_link_stats64.

Reported-by: Stefan Priebe <s.pri...@profihost.ag>
Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 drivers/net/ethernet/intel/e1000e/ethtool.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/e1000e/ethtool.c 
b/drivers/net/ethernet/intel/e1000e/ethtool.c
index 7aff68a4a4df..f117b90cdc2f 100644
--- a/drivers/net/ethernet/intel/e1000e/ethtool.c
+++ b/drivers/net/ethernet/intel/e1000e/ethtool.c
@@ -2063,7 +2063,7 @@ static void e1000_get_ethtool_stats(struct net_device 
*netdev,
 
pm_runtime_get_sync(netdev->dev.parent);
 
-   e1000e_get_stats64(netdev, _stats);
+   dev_get_stats(netdev, _stats);
 
pm_runtime_put_sync(netdev->dev.parent);
 
-- 
2.12.2

[PATCH net v2] mlx4: Invoke softirqs after napi_reschedule

2017-02-06 Thread Benjamin Poirier

mlx4 may schedule napi from a workqueue. Afterwards, softirqs are not run
in a deterministic time frame and the following message may be logged:
NOHZ: local_softirq_pending 08

The problem is the same as what was described in commit ec13ee80145c
("virtio_net: invoke softirqs after __napi_schedule") and this patch
applies the same fix to mlx4.

Fixes: 07841f9d94c1 ("net/mlx4_en: Schedule napi when RX buffers allocation 
fails")
Cc: Eric Dumazet <eric.duma...@gmail.com>
Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index eac527e25ec9..cc003fdf0ed9 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -514,8 +514,11 @@ void mlx4_en_recover_from_oom(struct mlx4_en_priv *priv)
return;
 
for (ring = 0; ring < priv->rx_ring_num; ring++) {
-   if (mlx4_en_is_ring_empty(priv->rx_ring[ring]))
+   if (mlx4_en_is_ring_empty(priv->rx_ring[ring])) {
+   local_bh_disable();
napi_reschedule(>rx_cq[ring]->napi);
+   local_bh_enable();
+   }
}
 }
 
-- 
2.11.0

[PATCH net] mlx4: Invoke softirqs after napi_reschedule

2017-02-06 Thread Benjamin Poirier

mlx4 may schedule napi from a workqueue. Afterwards, softirqs are not run
in a deterministic time frame and the following message may be logged:
NOHZ: local_softirq_pending 08

The problem is the same as what was described in commit ec13ee80145c
("virtio_net: invoke softirqs after __napi_schedule") and this patch
applies the same fix to mlx4.

Cc: Eric Dumazet <eric.duma...@gmail.com>
Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index eac527e25ec9..14ce1549b638 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -513,10 +513,12 @@ void mlx4_en_recover_from_oom(struct mlx4_en_priv *priv)
if (!priv->port_up)
return;
 
+   local_bh_disable();
for (ring = 0; ring < priv->rx_ring_num; ring++) {
if (mlx4_en_is_ring_empty(priv->rx_ring[ring]))
napi_reschedule(>rx_cq[ring]->napi);
}
+   local_bh_enable();
 }
 
 /* When the rx ring is running in page-per-packet mode, a released frame can go
-- 
2.11.0

Re: [PATCH] i40e: Invoke softirqs after napi_reschedule

2017-01-12 Thread Benjamin Poirier

On 2017/01/12 17:15, Eric Dumazet wrote:
> On Thu, 2017-01-12 at 17:04 -0800, Benjamin Poirier wrote:
> > The following message is logged from time to time when using i40e:
> > NOHZ: local_softirq_pending 08
> > 
> > i40e may schedule napi from a workqueue. Afterwards, softirqs are not run
> > in a deterministic time frame. The problem is the same as what was
> > described in commit ec13ee80145c ("virtio_net: invoke softirqs after
> > __napi_schedule") and this patch applies the same fix to i40e.
> 
> Yes, I believe mlx4 has a similar problem in mlx4_en_recover_from_oom()

Indeed, I was going to send a patch for mlx4 after this one is accepted.

[PATCH] i40e: Invoke softirqs after napi_reschedule

2017-01-12 Thread Benjamin Poirier

The following message is logged from time to time when using i40e:
NOHZ: local_softirq_pending 08

i40e may schedule napi from a workqueue. Afterwards, softirqs are not run
in a deterministic time frame. The problem is the same as what was
described in commit ec13ee80145c ("virtio_net: invoke softirqs after
__napi_schedule") and this patch applies the same fix to i40e.

Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index ad4cf63..d65488c 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -4621,8 +4621,10 @@ static void i40e_detect_recover_hung_queue(int q_idx, 
struct i40e_vsi *vsi)
 */
if ((!tx_pending_hw) && i40e_get_tx_pending(tx_ring, true) &&
(!(val & I40E_PFINT_DYN_CTLN_INTENA_MASK))) {
+   local_bh_disable();
if (napi_reschedule(_ring->q_vector->napi))
tx_ring->tx_stats.tx_lost_interrupt++;
+   local_bh_enable();
}
 }
 
-- 
2.10.2

[PATCH v2] scsi: bfa: Increase requested firmware version to 3.2.5.1

2016-12-23 Thread Benjamin Poirier

bna & bfa firmware version 3.2.5.1 was submitted to linux-firmware on
Feb 17 19:10:20 2015 -0500 in 0ab54ff1dc ("linux-firmware: Add QLogic BR
Series Adapter Firmware").

bna was updated to use the newer firmware on Feb 19 16:02:32 2015 -0500 in
3f307c3d70 ("bna: Update the Driver and Firmware Version")

bfa was not updated. I presume this was an oversight but it broke support
for bfa+bna cards such as the following
04:00.0 Fibre Channel [0c04]: Brocade Communications Systems, Inc.
1010/1020/1007/1741 10Gbps CNA [1657:0014] (rev 01)
04:00.1 Fibre Channel [0c04]: Brocade Communications Systems, Inc.
1010/1020/1007/1741 10Gbps CNA [1657:0014] (rev 01)
04:00.2 Ethernet controller [0200]: Brocade Communications Systems,
Inc. 1010/1020/1007/1741 10Gbps CNA [1657:0014] (rev 01)
04:00.3 Ethernet controller [0200]: Brocade Communications Systems,
Inc. 1010/1020/1007/1741 10Gbps CNA [1657:0014] (rev 01)

Currently, if the bfa module is loaded first, bna fails to probe the
respective devices with
[  215.026787] bna: QLogic BR-series 10G Ethernet driver - version: 3.2.25.1
[  215.043707] bna :04:00.2: bar0 mapped to c90001fc, len 262144
[  215.060656] bna :04:00.2: initialization failed err=1
[  215.073893] bna :04:00.3: bar0 mapped to c9000204, len 262144
[  215.090644] bna :04:00.3: initialization failed err=1

Whereas if bna is loaded first, bfa fails with
[  249.592109] QLogic BR-series BFA FC/FCOE SCSI driver - version: 3.2.25.0
[  249.610738] bfa :04:00.0: Running firmware version is incompatible with 
the driver version
[  249.833513] bfa :04:00.0: bfa init failed
[  249.833919] scsi host6: QLogic BR-series FC/FCOE Adapter, hwpath: 
:04:00.0 driver: 3.2.25.0
[  249.841446] bfa :04:00.1: Running firmware version is incompatible with 
the driver version
[  250.045449] bfa :04:00.1: bfa init failed
[  250.045962] scsi host7: QLogic BR-series FC/FCOE Adapter, hwpath: 
:04:00.1 driver: 3.2.25.0

Increase bfa's requested firmware version. Also increase the driver
version.
I only tested that all of the devices probe without error.

Reported-by: Tim Ehlers <tehl...@gwdg.de>
Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
Acked-by: Rasesh Mody <rasesh.m...@cavium.com>
---
 drivers/scsi/bfa/bfad.c | 6 +++---
 drivers/scsi/bfa/bfad_drv.h | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

Changes v1-v2:
Also increase the driver version

diff --git a/drivers/scsi/bfa/bfad.c b/drivers/scsi/bfa/bfad.c
index 9d253cb..e70410b 100644
--- a/drivers/scsi/bfa/bfad.c
+++ b/drivers/scsi/bfa/bfad.c
@@ -64,9 +64,9 @@ int   max_rport_logins = BFA_FCS_MAX_RPORT_LOGINS;
 u32bfi_image_cb_size, bfi_image_ct_size, bfi_image_ct2_size;
 u32*bfi_image_cb, *bfi_image_ct, *bfi_image_ct2;
 
-#define BFAD_FW_FILE_CB"cbfw-3.2.3.0.bin"
-#define BFAD_FW_FILE_CT"ctfw-3.2.3.0.bin"
-#define BFAD_FW_FILE_CT2   "ct2fw-3.2.3.0.bin"
+#define BFAD_FW_FILE_CB"cbfw-3.2.5.1.bin"
+#define BFAD_FW_FILE_CT"ctfw-3.2.5.1.bin"
+#define BFAD_FW_FILE_CT2   "ct2fw-3.2.5.1.bin"
 
 static u32 *bfad_load_fwimg(struct pci_dev *pdev);
 static void bfad_free_fwimg(void);
diff --git a/drivers/scsi/bfa/bfad_drv.h b/drivers/scsi/bfa/bfad_drv.h
index f9e8620..cfcfff4 100644
--- a/drivers/scsi/bfa/bfad_drv.h
+++ b/drivers/scsi/bfa/bfad_drv.h
@@ -58,7 +58,7 @@
 #ifdef BFA_DRIVER_VERSION
 #define BFAD_DRIVER_VERSIONBFA_DRIVER_VERSION
 #else
-#define BFAD_DRIVER_VERSION"3.2.25.0"
+#define BFAD_DRIVER_VERSION"3.2.25.1"
 #endif
 
 #define BFAD_PROTO_NAME FCPI_NAME
-- 
2.10.2

[PATCH] scsi: bfa: Increase requested firmware version to 3.2.5.1

2016-12-23 Thread Benjamin Poirier

bna & bfa firmware version 3.2.5.1 was submitted to linux-firmware on Tue
Feb 17 19:10:20 2015 -0500 in 0ab54ff1dc ("linux-firmware: Add QLogic BR
Series Adapter Firmware").

bna was updated to use the newer firmware on Thu, 19 Feb 2015 16:02:32
-0500 in 3f307c3d70 ("bna: Update the Driver and Firmware Version")

bfa was not updated. I presume this was an oversight but it broke support
for bfa+bna cards such as the following
04:00.0 Fibre Channel [0c04]: Brocade Communications Systems, Inc.
1010/1020/1007/1741 10Gbps CNA [1657:0014] (rev 01)
04:00.1 Fibre Channel [0c04]: Brocade Communications Systems, Inc.
1010/1020/1007/1741 10Gbps CNA [1657:0014] (rev 01)
04:00.2 Ethernet controller [0200]: Brocade Communications Systems,
Inc. 1010/1020/1007/1741 10Gbps CNA [1657:0014] (rev 01)
04:00.3 Ethernet controller [0200]: Brocade Communications Systems,
Inc. 1010/1020/1007/1741 10Gbps CNA [1657:0014] (rev 01)

Currently, if the bfa module is loaded first, bna fails to probe the
respective devices with
[  215.026787] bna: QLogic BR-series 10G Ethernet driver - version: 3.2.25.1
[  215.043707] bna :04:00.2: bar0 mapped to c90001fc, len 262144
[  215.060656] bna :04:00.2: initialization failed err=1
[  215.073893] bna :04:00.3: bar0 mapped to c9000204, len 262144
[  215.090644] bna :04:00.3: initialization failed err=1

Whereas if bna is loaded first, bfa fails with
[  249.592109] QLogic BR-series BFA FC/FCOE SCSI driver - version: 3.2.25.0
[  249.610738] bfa :04:00.0: Running firmware version is incompatible with 
the driver version
[  249.833513] bfa :04:00.0: bfa init failed
[  249.833919] scsi host6: QLogic BR-series FC/FCOE Adapter, hwpath: 
:04:00.0 driver: 3.2.25.0
[  249.841446] bfa :04:00.1: Running firmware version is incompatible with 
the driver version
[  250.045449] bfa :04:00.1: bfa init failed
[  250.045962] scsi host7: QLogic BR-series FC/FCOE Adapter, hwpath: 
:04:00.1 driver: 3.2.25.0

Increase bfa's requested firmware version.
I only tested that all of the devices probe without error.

Cc: Rasesh Mody <rasesh.m...@qlogic.com>
Reported-by: Tim Ehlers <tehl...@gwdg.de>
Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 drivers/scsi/bfa/bfad.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/bfa/bfad.c b/drivers/scsi/bfa/bfad.c
index 9d253cb..e70410b 100644
--- a/drivers/scsi/bfa/bfad.c
+++ b/drivers/scsi/bfa/bfad.c
@@ -64,9 +64,9 @@ int   max_rport_logins = BFA_FCS_MAX_RPORT_LOGINS;
 u32bfi_image_cb_size, bfi_image_ct_size, bfi_image_ct2_size;
 u32*bfi_image_cb, *bfi_image_ct, *bfi_image_ct2;
 
-#define BFAD_FW_FILE_CB"cbfw-3.2.3.0.bin"
-#define BFAD_FW_FILE_CT"ctfw-3.2.3.0.bin"
-#define BFAD_FW_FILE_CT2   "ct2fw-3.2.3.0.bin"
+#define BFAD_FW_FILE_CB"cbfw-3.2.5.1.bin"
+#define BFAD_FW_FILE_CT"ctfw-3.2.5.1.bin"
+#define BFAD_FW_FILE_CT2   "ct2fw-3.2.5.1.bin"
 
 static u32 *bfad_load_fwimg(struct pci_dev *pdev);
 static void bfad_free_fwimg(void);
-- 
2.10.2

[PATCH] bna: Add synchronization for tx ring.

2016-11-07 Thread Benjamin Poirier

We received two reports of BUG_ON in bnad_txcmpl_process() where
hw_consumer_index appeared to be ahead of producer_index. Out of order
write/read of these variables could explain these reports.

bnad_start_xmit(), as a producer of tx descriptors, has a few memory
barriers sprinkled around writes to producer_index and the device's
doorbell but they're not paired with anything in bnad_txcmpl_process(), a
consumer.

Since we are synchronizing with a device, we must use mandatory barriers,
not smp_*. Also, I didn't see the purpose of the last smp_mb() in
bnad_start_xmit().

Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 drivers/net/ethernet/brocade/bna/bnad.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/brocade/bna/bnad.c 
b/drivers/net/ethernet/brocade/bna/bnad.c
index f9df4b5a..f42f672 100644
--- a/drivers/net/ethernet/brocade/bna/bnad.c
+++ b/drivers/net/ethernet/brocade/bna/bnad.c
@@ -177,6 +177,7 @@ bnad_txcmpl_process(struct bnad *bnad, struct bna_tcb *tcb)
return 0;
 
hw_cons = *(tcb->hw_consumer_index);
+   rmb();
cons = tcb->consumer_index;
q_depth = tcb->q_depth;
 
@@ -3094,7 +3095,7 @@ bnad_start_xmit(struct sk_buff *skb, struct net_device 
*netdev)
BNA_QE_INDX_INC(prod, q_depth);
tcb->producer_index = prod;
 
-   smp_mb();
+   wmb();
 
if (unlikely(!test_bit(BNAD_TXQ_TX_STARTED, >flags)))
return NETDEV_TX_OK;
@@ -3102,7 +3103,6 @@ bnad_start_xmit(struct sk_buff *skb, struct net_device 
*netdev)
skb_tx_timestamp(skb);
 
bna_txq_prod_indx_doorbell(tcb);
-   smp_mb();
 
return NETDEV_TX_OK;
 }
-- 
2.9.3

[PATCH] vmxnet3: Wake queue from reset work

2016-10-02 Thread Benjamin Poirier

vmxnet3_reset_work() expects tx queues to be stopped (via
vmxnet3_quiesce_dev -> netif_tx_disable). However, this races with the
netif_wake_queue() call in netif_tx_timeout() such that the driver's
start_xmit routine may be called unexpectedly, triggering one of the BUG_ON
in vmxnet3_map_pkt with a stack trace like this:

RIP: 0010:[] vmxnet3_map_pkt+0x3ac/0x4c0 [vmxnet3]
 [] vmxnet3_tq_xmit+0x210/0x4e0 [vmxnet3]
 [] dev_hard_start_xmit+0x2e4/0x4c0
 [] sch_direct_xmit+0x17e/0x1e0
 [] __qdisc_run+0xd7/0x130
 [] net_tx_action+0x10a/0x200
 [] __do_softirq+0x11f/0x260
 [] call_softirq+0x1c/0x30
 [] do_softirq+0x65/0xa0
 [] local_bh_enable_ip+0x99/0xa0
 [] destroy_conntrack+0x96/0x110 [nf_conntrack]
 [] nf_conntrack_destroy+0x12/0x20
 [] skb_release_head_state+0xb5/0xf0
 [] skb_release_all+0x9/0x20
 [] __kfree_skb+0x9/0x90
 [] vmxnet3_quiesce_dev+0x209/0x340 [vmxnet3]
 [] vmxnet3_reset_work+0x6a/0xa0 [vmxnet3]
 [] process_one_work+0x16c/0x350
 [] worker_thread+0x17a/0x410
 [] kthread+0x96/0xa0
 [] kernel_thread_helper+0x4/0x10

Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 drivers/net/vmxnet3/vmxnet3_drv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/vmxnet3/vmxnet3_drv.c 
b/drivers/net/vmxnet3/vmxnet3_drv.c
index 4244b9d..515f7aa 100644
--- a/drivers/net/vmxnet3/vmxnet3_drv.c
+++ b/drivers/net/vmxnet3/vmxnet3_drv.c
@@ -3186,7 +3186,6 @@ vmxnet3_tx_timeout(struct net_device *netdev)
 
netdev_err(adapter->netdev, "tx hang\n");
schedule_work(>work);
-   netif_wake_queue(adapter->netdev);
 }
 
 
@@ -3213,6 +3212,7 @@ vmxnet3_reset_work(struct work_struct *data)
}
rtnl_unlock();
 
+   netif_wake_queue(adapter->netdev);
clear_bit(VMXNET3_STATE_BIT_RESETTING, >state);
 }
 
-- 
2.9.3

Re: [PATCH RESEND] net: can: Introduce MEN 16Z192-00 CAN controller driver

2016-08-08 Thread Benjamin Poirier

On 2016/08/08 09:26, Andreas Werner wrote:
[...]
> > > +
> > > + if (cf->can_dlc > 0)
> > > + data[0] = be32_to_cpup((__be32 *)(cf->data));
> > > + if (cf->can_dlc > 3)
> > > + data[1] = be32_to_cpup((__be32 *)(cf->data + 4));
> > > +
> > > + writel(id, _buf->can_id);
> > > + writel(cf->can_dlc, _buf->length);
> > > +
> > > + if (!(cf->can_id & CAN_RTR_FLAG)) {
> > > + writel(data[0], _buf->data[0]);
> > > + writel(data[1], _buf->data[1]);
> > > +
> > > + stats->tx_bytes += cf->can_dlc;
> > > + }
> > > +
> > > + /* be sure everything is written to the
> > > +  * device before acknowledge the data.
> > > +  */
> > > + mmiowb();
> > > +
> > > + /* trigger the transmission */
> > > + men_z192_ack_tx_pkg(priv, 1);
> > > +
> > > + stats->tx_packets++;
> > > +
> > > + kfree_skb(skb);
> > 
> > What prevents the skb data to be freed/reused before the device has
> > accessed it?

I'm sorry, I hadn't realized that all of the data (all 8 bytes of it!)
is written directly to the device. I was thinking about ethernet devices
that dma packet data.

Re: [PATCH RESEND] net: can: Introduce MEN 16Z192-00 CAN controller driver

2016-08-07 Thread Benjamin Poirier

On 2016/07/26 11:16, Andreas Werner wrote:
[...]
> +
> + /* Lock for CTL_BTR register access.
> +  * This register combines bittiming bits
> +  * and the operation mode bits.
> +  * It is also used for bit r/m/w access
> +  * to all registers.
> +  */
> + spinlock_t lock;

Why not use 80 cols for comments?

[...]
> +
> +static int men_z192_xmit(struct sk_buff *skb, struct net_device *ndev)
> +{
> + struct can_frame *cf = (struct can_frame *)skb->data;
> + struct men_z192 *priv = netdev_priv(ndev);
> + struct men_z192_regs __iomem *regs = priv->regs;
> + struct net_device_stats *stats = >stats;
> + struct men_z192_cf_buf __iomem *cf_buf;
> + u32 data[2] = {0, 0};
> + int status;
> + u32 id;
> +
> + if (can_dropped_invalid_skb(ndev, skb))
> + return NETDEV_TX_OK;
> +
> + status = readl(>rx_tx_sts);
> +
> + if (MEN_Z192_TX_BUF_CNT(status) >= 255) {
> + netif_stop_queue(ndev);
> + netdev_err(ndev, "not enough space in TX buffer\n");
> +
> + return NETDEV_TX_BUSY;
> + }
> +
> + cf_buf = priv->dev_base + MEN_Z192_TX_BUF_START;
> +
> + if (cf->can_id & CAN_EFF_FLAG) {
> + /* Extended frame */
> + id = ((cf->can_id & CAN_EFF_MASK) <<
> + MEN_Z192_CFBUF_ID2_SHIFT) & MEN_Z192_CFBUF_ID2;
> +
> + id |= (((cf->can_id & CAN_EFF_MASK) >>
> + (CAN_EFF_ID_BITS - CAN_SFF_ID_BITS)) <<
> +  MEN_Z192_CFBUF_ID1_SHIFT) & MEN_Z192_CFBUF_ID1;
> +
> + id |= MEN_Z192_CFBUF_IDE;
> + id |= MEN_Z192_CFBUF_SRR;
> +
> + if (cf->can_id & CAN_RTR_FLAG)
> + id |= MEN_Z192_CFBUF_E_RTR;
> + } else {
> + /* Standard frame */
> + id = ((cf->can_id & CAN_SFF_MASK) <<
> +MEN_Z192_CFBUF_ID1_SHIFT) & MEN_Z192_CFBUF_ID1;
> +
> + if (cf->can_id & CAN_RTR_FLAG)
> + id |= MEN_Z192_CFBUF_S_RTR;
> + }
> +
> + if (cf->can_dlc > 0)
> + data[0] = be32_to_cpup((__be32 *)(cf->data));
> + if (cf->can_dlc > 3)
> + data[1] = be32_to_cpup((__be32 *)(cf->data + 4));
> +
> + writel(id, _buf->can_id);
> + writel(cf->can_dlc, _buf->length);
> +
> + if (!(cf->can_id & CAN_RTR_FLAG)) {
> + writel(data[0], _buf->data[0]);
> + writel(data[1], _buf->data[1]);
> +
> + stats->tx_bytes += cf->can_dlc;
> + }
> +
> + /* be sure everything is written to the
> +  * device before acknowledge the data.
> +  */
> + mmiowb();
> +
> + /* trigger the transmission */
> + men_z192_ack_tx_pkg(priv, 1);
> +
> + stats->tx_packets++;
> +
> + kfree_skb(skb);

What prevents the skb data to be freed/reused before the device has
accessed it?

[...]
> +
> +static int men_z192_probe(struct mcb_device *mdev,
> +   const struct mcb_device_id *id)
> +{
> + struct device *dev = >dev;
> + struct men_z192 *priv;
> + struct net_device *ndev;
> + void __iomem *dev_base;
> + struct resource *mem;
> + u32 timebase;
> + int ret = 0;
> + int irq;
> +
> + mem = mcb_request_mem(mdev, dev_name(dev));
> + if (IS_ERR(mem)) {
> + dev_err(dev, "failed to request device memory");
> + return PTR_ERR(mem);
> + }
> +
> + dev_base = ioremap(mem->start, resource_size(mem));
> + if (!dev_base) {
> + dev_err(dev, "failed to ioremap device memory");
> + ret = -ENXIO;
> + goto out_release;
> + }
> +
> + irq = mcb_get_irq(mdev);
> + if (irq <= 0) {
> + ret = -ENODEV;
> + goto out_unmap;
> + }
> +
> + ndev = alloc_candev(sizeof(struct men_z192), 1);
> + if (!ndev) {
> + dev_err(dev, "failed to allocate the can device");
> + ret = -ENOMEM;
> + goto out_unmap;
> + }
> +
> + ndev->netdev_ops = _z192_netdev_ops;
> + ndev->irq = irq;
> +
> + priv = netdev_priv(ndev);
> + priv->ndev = ndev;
> + priv->dev = dev;
> +
> + priv->mem = mem;
> + priv->dev_base = dev_base;
> + priv->regs = priv->dev_base + MEN_Z192_REGS_OFFS;
> +
> + timebase = readl(>regs->timebase);
> + if (!timebase) {
> + dev_err(dev, "invalid timebase configured (timebase=%d)\n",
> + timebase);
> + ret = -EINVAL;
> + goto out_unmap;

free_candev is missing in this error path

> + }
> +
> + priv->can.clock.freq = timebase;
> + priv->can.bittiming_const = _z192_bittiming_const;
> + priv->can.do_set_mode = men_z192_set_mode;
> + priv->can.do_get_berr_counter = men_z192_get_berr_counter;
> + priv->can.ctrlmode_supported = CAN_CTRLMODE_LISTENONLY |
> +CAN_CTRLMODE_3_SAMPLES |
> +

Re: [PATCH net-next V3] net: ena: Add a driver for Amazon Elastic Network Adapters (ENA)

2016-07-15 Thread Benjamin Poirier

On 2016/07/15 08:00, Leon Romanovsky wrote:
> On Thu, Jul 14, 2016 at 09:46:14AM +0300, Netanel Belgazal wrote:
> > This is a driver for the ENA family of networking devices.
> > 
> > Signed-off-by: Netanel Belgazal 
> > ---
> > 
> > Notes:
> 
> ...
> 
> > - Increase driver version to 1.0.2
> 
> ...
> 
> > +static void ena_get_drvinfo(struct net_device *dev,
> > +   struct ethtool_drvinfo *info)
> > +{
> > +   struct ena_adapter *adapter = netdev_priv(dev);
> > +
> > +   strlcpy(info->driver, DRV_MODULE_NAME, sizeof(info->driver));
> > +   strlcpy(info->version, DRV_MODULE_VERSION, sizeof(info->version));
> 
> Does module version give anything valuable in real life usage?
> Do you plan to bump version after every patch?
> 
> Hint, NO.
> 
[...]
> > +
> > +#define DRV_MODULE_VER_MAJOR   1
> > +#define DRV_MODULE_VER_MINOR   0
> > +#define DRV_MODULE_VER_SUBMINOR 1
> > +
> > +#define DRV_MODULE_NAME"ena"
> > +#ifndef DRV_MODULE_VERSION
> > +#define DRV_MODULE_VERSION \
> > +   __stringify(DRV_MODULE_VER_MAJOR) "."   \
> > +   __stringify(DRV_MODULE_VER_MINOR) "."   \
> > +   __stringify(DRV_MODULE_VER_SUBMINOR)
> > +#endif
> > +#define DRV_MODULE_RELDATE  "22-JUNE-2016"
> 
> Please remove it, driver version is useless in real life kernel usage.
> 

The release date might be a bit overkill but the driver version is
useful in the context of distribution kernels where users sometimes mix
and match newer drivers (ex: the intel sf.net drivers) with older
kernels. When a bug is reported, a quick look at the module version can
help indicate the provenance of the driver.


signature.asc
Description: Digital signature

Re: [PATCH net-next V3] net: ena: Add a driver for Amazon Elastic Network Adapters (ENA)

2016-07-14 Thread Benjamin Poirier

On 2016/07/14 08:22, Matt Wilson wrote:
[...]
> 
> Dave and Benjamin,
> 
> Do you want to see the interrupt moderation extensions to ethtool and
> the sysfs nodes removed before this lands in net-next? Or should
> Netanel remove the sysfs bits until we can extend the ethtool
> interfaces to cover the parameters that ena uses?

I couldn't say what's acceptable or not. A few other drivers (qlcnic,
sfc, ...) already have sysfs tunables. Maybe John, as the new ethtool
maintainer, can weight in too about the changes required to ethtool.

Re: Fwd: Fwd: Re: [PATCH net-next 00/15] net/smc: Shared Memory Communications - RDMA

2016-07-12 Thread Benjamin Poirier

On 2016/07/12 11:08, Benjamin Poirier wrote:
> On 2016/07/06 17:29, Ursula Braun wrote:
> > Dave,
> > 
> > we still like to see SMC-R included into a future Linux-kernel. After
> > answering your first 2 questions, there is no longer a response. What should
> > we do next?
> > - Still wait for an answer from you?
> > - Resend the same whole SMC-R patch series, this time with the cover letter
> > adapted to your requested changes?
> 
> ^^^ I would suggest to send v2 of the patch series with the changes
> that were requested.
> 
> > - Put the SMC-R development on hold, and concentrate on another
> > s390-specific SMC-solution first (not RDMA-based), that makes use of the
> > SMC-socket family as well.
> > - Anything else?
> > 

... and I would suggest to trim my emails next time. Sorry.

Re: Fwd: Fwd: Re: [PATCH net-next 00/15] net/smc: Shared Memory Communications - RDMA

2016-07-12 Thread Benjamin Poirier

On 2016/07/06 17:29, Ursula Braun wrote:
> Dave,
> 
> we still like to see SMC-R included into a future Linux-kernel. After
> answering your first 2 questions, there is no longer a response. What should
> we do next?
> - Still wait for an answer from you?
> - Resend the same whole SMC-R patch series, this time with the cover letter
> adapted to your requested changes?

^^^ I would suggest to send v2 of the patch series with the changes
that were requested.

> - Put the SMC-R development on hold, and concentrate on another
> s390-specific SMC-solution first (not RDMA-based), that makes use of the
> SMC-socket family as well.
> - Anything else?
> 
> Kind regards, Ursula
> 
>  Forwarded Message 
> Subject: Fwd: Re: [PATCH net-next 00/15] net/smc: Shared Memory
> Communications - RDMA
> Date: Tue, 21 Jun 2016 16:02:59 +0200
> From: Ursula Braun 
> To: da...@davemloft.net
> CC: netdev@vger.kernel.org, linux-s...@vger.kernel.org,
> schwidef...@de.ibm.com, heiko.carst...@de.ibm.com, utz.bac...@de.ibm.com
> 
> Dave,
> 
> the SMC-R patches submitted 2016-06-03 show up in state "Changes
> Requested" on patchwork:
> https://patchwork.ozlabs.org/project/netdev/list/?submitter=2266=*=1
> 
> You had requested a change of the SMC-R description in the cover letter.
> We came up with the response below. Do you need anything else from us?
> 
> Kind regards,
> Ursula Braun
> 
>  Forwarded Message 
> Subject: Re: [PATCH net-next 00/15] net/smc: Shared Memory
> Communications - RDMA
> Date: Thu,  9 Jun 2016 17:36:28 +0200
> From: Ursula Braun 
> To: da...@davemloft.net
> CC: netdev@vger.kernel.org, linux-s...@vger.kernel.org,
> schwidef...@de.ibm.com, heiko.carst...@de.ibm.com
> 
> On Tue, 2016-06-07 at 15:07 -0700, David Miller wrote:
> > In case my previous reply wasn't clear enough, I require that you provide
> > a more accurate description of what the implications of this feature are.
> > 
> > Namely, that important _CORE_ networking features are completely bypassed
> > and unusable when SMC applies to a connection.
> > 
> > Specifically, all packet shaping, filtering, traffic inspection, and
> > flow management facilitites in the kernel will not be able to see nor
> > act upon the data flow of these TCP connections once established.
> > 
> > It is always important, and in my opinion required, to list the
> > negative aspects of your change and not just the "wow, amazing"
> > positive aspects.
> > 
> > Thanks.
> > 
> > 
> Correct, the SMC-R data stream bypasses TCP and thus cannot enjoy its
> features. This is the price for leveraging the TCP application ecosystem
> and reducing CPU load.
> 
> When a load balancer allows the TCP handshake to take place between a
> worker node and the TCP client, RDMA will be used between these two
> nodes. So anything based on TCP connection establishment (including a
> firewall) can apply to SMC-R, too. To be clear -- yes, the data flow
> later on is not subject to these features anymore.  At least VLAN
> isolation from the TCP part can be leveraged for RDMA traffic. From our
> experience, discussions, etc., that tradeoff seems acceptable in a
> classical data center environment.
> 
> Improving our cover letter would result in the following new
> introductory motivation part at the beginning and a slightly modified
> list of
> planned enhancements at the end:
> 
> On Fri, 2016-06-03 at 17:26 +0200, Ursula Braun wrote:
> 
> > These patches are the initial part of the implementation of the
> > "Shared Memory Communications-RDMA" (SMC-R) protocol. The protocol is
> > defined in RFC7609 [1]. It allows transformation of TCP connections
> > using the "Remote Direct Memory Access over Converged Ethernet" (RoCE)
> > feature of specific communication hardware for data center environments.
> > 
> > SMC-R inherits TCP qualities such as reliable connections, host-based
> > firewall packet filtering (on connection establishment) and unmodified
> > application of communication encryption such as TLS (transport layer
> > security) or SSL (secure sockets layer). It is transparent to most existing
> > TCP connection load balancers that are commonly used in the enterprise data
> > center environment for multi-tier application workloads.
> > 
> > Being designed for the data center network switched fabric environment, it
> > does not need congestion control and thus reaches line speed right away
> > without having to go through slow start as with TCP. This can be beneficial
> > for short living flows including request response patterns requiring
> > reliability. A full SMC-R implementation also provides seamless high
> > availability and load-balancing demanded by enterprise installations.
> > 
> > SMC-R does not require an RDMA communication manager (RDMA CM). Its use of
> > RDMA provides CPU savings transparently for unmodified applications.
> > For instance, when running 10 parallel connections with uperf, we measured
>

Re: [PATCH net-next] net: ena: Add a driver for Amazon Elastic Network Adapters (ENA)

2016-06-16 Thread Benjamin Poirier

On 2016/06/13 11:46, Netanel Belgazal wrote:
[...]
> +
> +static int ena_set_coalesce(struct net_device *net_dev,
> + struct ethtool_coalesce *coalesce)
> +{
> + struct ena_adapter *adapter = netdev_priv(net_dev);
> + struct ena_com_dev *ena_dev = adapter->ena_dev;
> + int rc;
> +
> + if (!ena_com_interrupt_moderation_supported(ena_dev)) {
> + /* the devie doesn't support interrupt moderation */
> + return -EOPNOTSUPP;
> + }
> +
> + /* Note, adaptive coalescing settings are updated through sysfs */

I believe the usual approach is to use ethtool for these kinds of
settings, extending the interface if necessary.

> + if (coalesce->rx_coalesce_usecs_irq ||
> + coalesce->rx_max_coalesced_frames ||
> + coalesce->rx_max_coalesced_frames_irq ||
> + coalesce->tx_coalesce_usecs_irq ||
> + coalesce->tx_max_coalesced_frames ||
> + coalesce->tx_max_coalesced_frames_irq ||
> + coalesce->stats_block_coalesce_usecs ||
> + coalesce->use_adaptive_tx_coalesce ||
> + coalesce->pkt_rate_low ||
> + coalesce->rx_coalesce_usecs_low ||
> + coalesce->rx_max_coalesced_frames_low ||
> + coalesce->tx_coalesce_usecs_low ||
> + coalesce->tx_max_coalesced_frames_low ||
> + coalesce->pkt_rate_high ||
> + coalesce->rx_coalesce_usecs_high ||
> + coalesce->rx_max_coalesced_frames_high ||
> + coalesce->tx_coalesce_usecs_high ||
> + coalesce->tx_max_coalesced_frames_high ||
> + coalesce->rate_sample_interval)
> + return -EINVAL;
> +

[...]

> +
> +static ssize_t ena_store_small_copy_len(struct device *dev,
> + struct device_attribute *attr,
> + const char *buf, size_t len)
> +{
> + struct ena_adapter *adapter = dev_get_drvdata(dev);
> + unsigned long small_copy_len;
> + struct ena_ring *rx_ring;
> + int err, i;
> +
> + err = kstrtoul(buf, 10, _copy_len);
> + if (err < 0)
> + return err;
> +
> + err = ena_validate_small_copy_len(adapter, small_copy_len);
> + if (err)
> + return err;
> +
> + rtnl_lock();
> + adapter->small_copy_len = small_copy_len;
> +
> + for (i = 0; i < adapter->num_queues; i++) {
> + rx_ring = >rx_ring[i];
> + rx_ring->rx_small_copy_len = small_copy_len;
> + }
> + rtnl_unlock();
> +
> + return len;
> +}
> +
> +static ssize_t ena_show_small_copy_len(struct device *dev,
> +struct device_attribute *attr, char *buf)
> +{
> + struct ena_adapter *adapter = dev_get_drvdata(dev);
> +
> + return sprintf(buf, "%d\n", adapter->small_copy_len);
> +}
> +
> +static DEVICE_ATTR(small_copy_len, S_IRUGO | S_IWUSR, 
> ena_show_small_copy_len,
> +ena_store_small_copy_len);

This is what many other drivers call (rx_)copybreak. Perhaps it's time
to add it to ethtool as well?

[PATCH] net: Add missing kernel-doc for netdev ptype lists

2016-03-21 Thread Benjamin Poirier

.//include/linux/netdevice.h:1826: warning: No description found for parameter 
'ptype_all'
.//include/linux/netdevice.h:1826: warning: No description found for parameter 
'ptype_specific'

Introduced by commit 7866a621043f ("dev: add per net_device packet type
chains")

Cc: Salam Noureddine <nouredd...@arista.com>
Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 include/linux/netdevice.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 9a3d55c..1937bdd 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1418,6 +1418,8 @@ enum netdev_priv_flags {
  * @unreg_list:List entry, that is used, when we are unregistering the
  * device, see the function unregister_netdev
  * @close_list:List entry, that is used, when we are closing the device
+ * @ptype_all: Device-specific packet handlers for all protocols
+ * @ptype_specific: Device-specific, protocol-specific packet handlers
  *
  * @adj_list:  Directly linked devices, like slaves for bonding
  * @all_adj_list:  All linked devices, *including* neighbours
-- 
2.7.2

Re: [PATCH] net: add missing descriptions in net_device_priv_flags

2016-03-21 Thread Benjamin Poirier

On 2016/03/21 20:20, Luis de Bethencourt wrote:
> The flags IFF_XMIT_DST_RELEASE_PERM, IFF_IPVLAN_MASTER and
> IFF_IPVLAN_SLAVE are missing descriptions for the Documentation. Adding
> them.
> 
> Signed-off-by: Luis de Bethencourt 
> ---
> Hi,
> 
> I also noticed this issue when running make htmldocs. Having better
> documentation is great :)
> 
> Thanks,
> Luis
> 
>  include/linux/netdevice.h | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index be693b3..db9ffe4 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1320,6 +1320,9 @@ struct net_device_ops {
>   * @IFF_LIVE_ADDR_CHANGE: device supports hardware address
>   *   change when it's running
>   * @IFF_MACVLAN: Macvlan device
> + * @IFF_XMIT_DST_RELEASE_PERM: permanent IFF_XMIT_DST_RELEASE

I also noticed these kernel-doc warnings and had some patches queued for
them. May I suggest the following in this case:

 * @IFF_XMIT_DST_RELEASE_PERM: IFF_XMIT_DST_RELEASE not taking into account
 *  underlying stacked devices

I presume that a patch for netdev ptype lists is forthcoming. Here's
what I had:

 *  @ptype_all: Device-specific packet handlers for all protocols
 *  @ptype_specific: Device-specific, protocol-specific packet handlers

> + * @IFF_IPVLAN_MASTER: IPvlan master device
> + * @IFF_IPVLAN_SLAVE: IPvlan slave device
>   * @IFF_L3MDEV_MASTER: device is an L3 master device
>   * @IFF_NO_QUEUE: device can run without qdisc attached
>   * @IFF_OPENVSWITCH: device is a Open vSwitch master
> -- 
> 2.5.1
>

[PATCH] ip.7: Fix incorrect sockopt name

2016-03-21 Thread Benjamin Poirier

"IP_LEAVE_GROUP" does not exist. It was perhaps a confusion with
MCAST_LEAVE_GROUP. Change the text to IP_DROP_MEMBERSHIP which has the same
function as MCAST_LEAVE_GROUP and is documented in the ip.7 man page.

Reference:
Linux kernel net/ipv4/ip_sockglue.c do_ip_setsockopt()

Cc: Radek Pazdera <rpazd...@redhat.com>
Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 man7/ip.7 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/man7/ip.7 b/man7/ip.7
index 3905573..37e2c86 100644
--- a/man7/ip.7
+++ b/man7/ip.7
@@ -376,7 +376,7 @@ a given multicast group that come from a given source.
 If the application has subscribed to multiple sources within
 the same group, data from the remaining sources will still be delivered.
 To stop receiving data from all sources at once, use
-.BR IP_LEAVE_GROUP .
+.BR IP_DROP_MEMBERSHIP .
 .IP
 Argument is an
 .I ip_mreq_source
-- 
2.7.2

[PATCH 2/2] igmp: Document sysctl_igmp_max_msf

2016-03-21 Thread Benjamin Poirier

Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 Documentation/networking/ip-sysctl.txt | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index d3768e8..b183e2b 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -946,10 +946,15 @@ igmp_max_memberships - INTEGER
The value 5459 assumes no IP header options, so in practice
this number may be lower.
 
+igmp_max_msf - INTEGER
+   Maximum number of addresses allowed in the source filter list for a
+   multicast group.
+   Default: 10
+
 igmp_qrv - INTEGER
-Controls the IGMP query robustness variable (see RFC2236 8.1).
-Default: 2 (as specified by RFC2236 8.1)
-Minimum: 1 (as specified by RFC6636 4.5)
+   Controls the IGMP query robustness variable (see RFC2236 8.1).
+   Default: 2 (as specified by RFC2236 8.1)
+   Minimum: 1 (as specified by RFC6636 4.5)
 
 conf/interface/*  changes special settings per interface (where
 "interface" is the name of your network interface)
-- 
2.7.2

[PATCH 1/2] net: Fix indentation of the conf/ documentation block

2016-03-21 Thread Benjamin Poirier

Commit d67ef35fff67 ("clarify documentation for
net.ipv4.igmp_max_memberships") mistakenly indented a block of
documentation such that it now looks like it belongs to a specific sysctl.
Restore that block's original position.

Cc: Jeremy Eder <je...@redhat.com>
Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 Documentation/networking/ip-sysctl.txt | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index d5df40c..d3768e8 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -946,16 +946,16 @@ igmp_max_memberships - INTEGER
The value 5459 assumes no IP header options, so in practice
this number may be lower.
 
-   conf/interface/*  changes special settings per interface (where
-   "interface" is the name of your network interface)
-
-   conf/all/*is special, changes the settings for all interfaces
-
 igmp_qrv - INTEGER
 Controls the IGMP query robustness variable (see RFC2236 8.1).
 Default: 2 (as specified by RFC2236 8.1)
 Minimum: 1 (as specified by RFC6636 4.5)
 
+conf/interface/*  changes special settings per interface (where
+"interface" is the name of your network interface)
+
+conf/all/*   is special, changes the settings for all interfaces
+
 log_martians - BOOLEAN
Log packets with impossible addresses to kernel log.
log_martians for the interface will be enabled if at least one of
-- 
2.7.2

[PATCH net v2] mld, igmp: Fix reserved tailroom calculation

2016-02-29 Thread Benjamin Poirier

The current reserved_tailroom calculation fails to take hlen and tlen into
account.

skb:
[__hlen__|__data|__tlen___|__extra__]
^   ^
headskb_end_offset

In this representation, hlen + data + tlen is the size passed to alloc_skb.
"extra" is the extra space made available in __alloc_skb because of
rounding up by kmalloc. We can reorder the representation like so:

[__hlen__|__data|__extra__|__tlen___]
^   ^
headskb_end_offset

The maximum space available for ip headers and payload without
fragmentation is min(mtu, data + extra). Therefore,
reserved_tailroom
= data + extra + tlen - min(mtu, data + extra)
= skb_end_offset - hlen - min(mtu, skb_end_offset - hlen - tlen)
= skb_tailroom - min(mtu, skb_tailroom - tlen) ; after skb_reserve(hlen)

Compare the second line to the current expression:
reserved_tailroom = skb_end_offset - min(mtu, skb_end_offset)
and we can see that hlen and tlen are not taken into account.

The min() in the third line can be expanded into:
if mtu < skb_tailroom - tlen:
reserved_tailroom = skb_tailroom - mtu
else:
reserved_tailroom = tlen

Depending on hlen, tlen, mtu and the number of multicast address records,
the current code may output skbs that have less tailroom than
dev->needed_tailroom or it may output more skbs than needed because not all
space available is used.

Fixes: 4c672e4b ("ipv6: mld: fix add_grhead skb_over_panic for devs with large 
MTUs")
Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---

Notes:
Changes v1->v2
As suggested by Hannes, move the code to an inline helper and express it
using "if" rather than "min".

 include/linux/skbuff.h | 24 
 net/ipv4/igmp.c|  3 +--
 net/ipv6/mcast.c   |  3 +--
 3 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 4ce9ff7..d3fcd45 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1985,6 +1985,30 @@ static inline void skb_reserve(struct sk_buff *skb, int 
len)
skb->tail += len;
 }
 
+/**
+ * skb_tailroom_reserve - adjust reserved_tailroom
+ * @skb: buffer to alter
+ * @mtu: maximum amount of headlen permitted
+ * @needed_tailroom: minimum amount of reserved_tailroom
+ *
+ * Set reserved_tailroom so that headlen can be as large as possible but
+ * not larger than mtu and tailroom cannot be smaller than
+ * needed_tailroom.
+ * The required headroom should already have been reserved before using
+ * this function.
+ */
+static inline void skb_tailroom_reserve(struct sk_buff *skb, unsigned int mtu,
+   unsigned int needed_tailroom)
+{
+   SKB_LINEAR_ASSERT(skb);
+   if (mtu < skb_tailroom(skb) - needed_tailroom)
+   /* use at most mtu */
+   skb->reserved_tailroom = skb_tailroom(skb) - mtu;
+   else
+   /* use up to all available space */
+   skb->reserved_tailroom = needed_tailroom;
+}
+
 #define ENCAP_TYPE_ETHER   0
 #define ENCAP_TYPE_IPPROTO 1
 
diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index 05e4cba..b3086cf 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -356,9 +356,8 @@ static struct sk_buff *igmpv3_newpack(struct net_device 
*dev, unsigned int mtu)
skb_dst_set(skb, >dst);
skb->dev = dev;
 
-   skb->reserved_tailroom = skb_end_offset(skb) -
-min(mtu, skb_end_offset(skb));
skb_reserve(skb, hlen);
+   skb_tailroom_reserve(skb, mtu, tlen);
 
skb_reset_network_header(skb);
pip = ip_hdr(skb);
diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c
index 5ee56d0..d64ee7e 100644
--- a/net/ipv6/mcast.c
+++ b/net/ipv6/mcast.c
@@ -1574,9 +1574,8 @@ static struct sk_buff *mld_newpack(struct inet6_dev 
*idev, unsigned int mtu)
return NULL;
 
skb->priority = TC_PRIO_CONTROL;
-   skb->reserved_tailroom = skb_end_offset(skb) -
-min(mtu, skb_end_offset(skb));
skb_reserve(skb, hlen);
+   skb_tailroom_reserve(skb, mtu, tlen);
 
if (__ipv6_get_lladdr(idev, _buf, IFA_F_TENTATIVE)) {
/* :
-- 
2.7.0

Re: [PATCH] mld, igmp: Fix reserved tailroom calculation

2016-02-29 Thread Benjamin Poirier

On 2016/02/29 16:43, Hannes Frederic Sowa wrote:
> On 29.02.2016 16:19, Benjamin Poirier wrote:
> >On 2016/02/29 15:57, Daniel Borkmann wrote:
> >[...]
> >>
> >>[ cutting the IPv4 part off as diff is the same ]
> >>
> >>>diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c
> >>>index 5ee56d0..c157edc 100644
> >>>--- a/net/ipv6/mcast.c
> >>>+++ b/net/ipv6/mcast.c
> >>>@@ -1574,9 +1574,9 @@ static struct sk_buff *mld_newpack(struct inet6_dev 
> >>>*idev, unsigned int mtu)
> >>>   return NULL;
> >>>
> >>>   skb->priority = TC_PRIO_CONTROL;
> >>>-  skb->reserved_tailroom = skb_end_offset(skb) -
> >>>-   min(mtu, skb_end_offset(skb));
> >>>   skb_reserve(skb, hlen);
> >>>+  skb->reserved_tailroom = skb_tailroom(skb) -
> >>>+  min_t(int, mtu, skb_tailroom(skb) - tlen);
> >>
> >>Are you sure this is correct? Wouldn't that mean (assuming we allocated
> >>enough space), that I could now fill a larger than MTU frame?
> >
> >Quoting back a part of the log:
> >
> >>>The maximum space available for ip headers and payload without
> >>>fragmentation is min(mtu, data + extra). Therefore,
> >>>reserved_tailroom
> >>>= data + extra + tlen - min(mtu, data + extra)
> >>>= skb_end_offset - hlen - min(mtu, skb_end_offset - hlen - tlen)
> >>>= skb_tailroom - min(mtu, skb_tailroom - tlen) ; after skb_reserve(hlen)
> >
> >The min() takes care of the situation you describe, ie. if the allocated
> >space is large, reserved_tailroom will be large enough that we do not
> >use more space than the mtu.
> >
> >I tested the mld and igmp code with different driver parameters, mtu
> >values, number of multicast address records and even allocation
> >failures. If you think the formula is wrong, please provide a
> >counter-example with hlen, tlen, mtu and size values.
> 
> I think the code is fine albeit I think we should remove the min macro and
> just do something:
> 
> if (skb_tailroom(skb) > mtu)
>   skb->reserved_tailroom = skb_tailroom(skb) - mtu;
> 
> Does that make sense? I think it is much more readable.

That is not equivalent. It fails to take tlen into account.

For igmp, consider this case:
with hlen = 16, mtu = 9000, tlen = 8,
additionally, suppose that the first iteration of the allocation loop
(alloc_skb(9000 + 16 + 8, ...) which requires 4 pages) fails and the
second iteration (alloc_skb((9000 >> 1) + 16 + 8, ...) which requires 2
pages) succeeds:
size = (9000 >> 1) + 16 + 8 = 4524
skb_end_offset = 8192 - 320 = 7872
tailroom = 7872 - 16 = 7856

data = 9000 >> 1 = 4500
extra = 7872 - 4524 = 3348

reserved tailroom (patch version)
= 4500 + 3348 + 8 - min(9000, 4500 + 3348)
= 8
reserved tailroom (your version)
= 0

Headers are ipv4 + igmpv3 = 24 + 8 = 32, records are 8 bytes
With 978 igmpv3 records, with your version, we would output an
skb that has less tailroom (0) than dev->needed_tailroom (8).

For mld, consider this case:
with hlen = 16, mtu = 9000, tlen = 8:
size = 3776 (SKB_MAX_ORDER case)
skb_end_offset = 3776
tailroom = 3776 - 16 = 3760

data = 3776 - 16 - 8 = 3752
extra = 0

reserved tailroom (patch version)
= 3752 + 0 + 8 - min(9000, 3752 + 0)
= 8
reserved tailroom (your version)
= 0

Headers are ipv6 + icmpv6 = 48 + 8 = 56, records are 20 bytes
With 185 mld records, with your formula, we would output an skb that
has less tailroom (4) than dev->needed_tailroom (8).

If you think we should write the expression with "if" instead of "min",
instead of the current

+   skb->reserved_tailroom = skb_tailroom(skb) -
+   min_t(int, mtu, skb_tailroom(skb) - tlen);

it should be:

+   if (mtu < skb_tailroom(skb) - tlen)
+   skb->reserved_tailroom = skb_tailroom(skb) - mtu;
+   else
+   skb->reserved_tailroom = tlen;

The second alternative does not look more readable to me but I have been
looking at that expression for a while. If you think that it is more
readable, I will resend the patch expressed that way. Please let me
know.

Re: [PATCH] mld, igmp: Fix reserved tailroom calculation

2016-02-29 Thread Benjamin Poirier

On 2016/02/29 15:57, Daniel Borkmann wrote:
[...]
> 
> [ cutting the IPv4 part off as diff is the same ]
> 
> >diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c
> >index 5ee56d0..c157edc 100644
> >--- a/net/ipv6/mcast.c
> >+++ b/net/ipv6/mcast.c
> >@@ -1574,9 +1574,9 @@ static struct sk_buff *mld_newpack(struct inet6_dev 
> >*idev, unsigned int mtu)
> > return NULL;
> >
> > skb->priority = TC_PRIO_CONTROL;
> >-skb->reserved_tailroom = skb_end_offset(skb) -
> >- min(mtu, skb_end_offset(skb));
> > skb_reserve(skb, hlen);
> >+skb->reserved_tailroom = skb_tailroom(skb) -
> >+min_t(int, mtu, skb_tailroom(skb) - tlen);
> 
> Are you sure this is correct? Wouldn't that mean (assuming we allocated
> enough space), that I could now fill a larger than MTU frame?

Quoting back a part of the log:

> >The maximum space available for ip headers and payload without
> >fragmentation is min(mtu, data + extra). Therefore,
> >reserved_tailroom
> >= data + extra + tlen - min(mtu, data + extra)
> >= skb_end_offset - hlen - min(mtu, skb_end_offset - hlen - tlen)
> >= skb_tailroom - min(mtu, skb_tailroom - tlen) ; after skb_reserve(hlen)

The min() takes care of the situation you describe, ie. if the allocated
space is large, reserved_tailroom will be large enough that we do not
use more space than the mtu.

I tested the mld and igmp code with different driver parameters, mtu
values, number of multicast address records and even allocation
failures. If you think the formula is wrong, please provide a
counter-example with hlen, tlen, mtu and size values.

Regards,
-Benjamin

[PATCH] mld, igmp: Fix reserved tailroom calculation

2016-02-27 Thread Benjamin Poirier

The current reserved_tailroom calculation fails to take hlen and tlen into
account.

skb:
[__hlen__|__data|__tlen___|__extra__]
^   ^
headskb_end_offset

In this representation, hlen + data + tlen is the size passed to alloc_skb.
"extra" is the extra space made available in __alloc_skb because of
rounding up by kmalloc. We can reorder the representation like so:

[__hlen__|__data|__extra__|__tlen___]
^   ^
headskb_end_offset

The maximum space available for ip headers and payload without
fragmentation is min(mtu, data + extra). Therefore,
reserved_tailroom
= data + extra + tlen - min(mtu, data + extra)
= skb_end_offset - hlen - min(mtu, skb_end_offset - hlen - tlen)
= skb_tailroom - min(mtu, skb_tailroom - tlen) ; after skb_reserve(hlen)

Compare the second line to the current expression:
reserved_tailroom = skb_end_offset - min(mtu, skb_end_offset)
and we can see that hlen and tlen are not taken into account.

Depending on hlen, tlen, mtu and the number of multicast address records,
the current code may output skbs that have less tailroom than
dev->needed_tailroom or it may output more skbs than needed because not all
space available is used.

Fixes: 4c672e4b ("ipv6: mld: fix add_grhead skb_over_panic for devs with large 
MTUs")
Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 net/ipv4/igmp.c  | 4 ++--
 net/ipv6/mcast.c | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index 05e4cba..b5d28a4 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -356,9 +356,9 @@ static struct sk_buff *igmpv3_newpack(struct net_device 
*dev, unsigned int mtu)
skb_dst_set(skb, >dst);
skb->dev = dev;
 
-   skb->reserved_tailroom = skb_end_offset(skb) -
-min(mtu, skb_end_offset(skb));
skb_reserve(skb, hlen);
+   skb->reserved_tailroom = skb_tailroom(skb) -
+   min_t(int, mtu, skb_tailroom(skb) - tlen);
 
skb_reset_network_header(skb);
pip = ip_hdr(skb);
diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c
index 5ee56d0..c157edc 100644
--- a/net/ipv6/mcast.c
+++ b/net/ipv6/mcast.c
@@ -1574,9 +1574,9 @@ static struct sk_buff *mld_newpack(struct inet6_dev 
*idev, unsigned int mtu)
return NULL;
 
skb->priority = TC_PRIO_CONTROL;
-   skb->reserved_tailroom = skb_end_offset(skb) -
-min(mtu, skb_end_offset(skb));
skb_reserve(skb, hlen);
+   skb->reserved_tailroom = skb_tailroom(skb) -
+   min_t(int, mtu, skb_tailroom(skb) - tlen);
 
if (__ipv6_get_lladdr(idev, _buf, IFA_F_TENTATIVE)) {
/* :
-- 
2.7.0

[PATCH] ipv6: Annotate change of locking mechanism for np->opt

2016-02-17 Thread Benjamin Poirier

follows up commit 45f6fad84cc3 ("ipv6: add complete rcu protection around
np->opt") which added mixed rcu/refcount protection to np->opt.

Given the current implementation of rcu_pointer_handoff(), this has no
effect at runtime.

Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 include/net/ipv6.h | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 6570f37..f3c9857 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -259,8 +259,12 @@ static inline struct ipv6_txoptions *txopt_get(const 
struct ipv6_pinfo *np)
 
rcu_read_lock();
opt = rcu_dereference(np->opt);
-   if (opt && !atomic_inc_not_zero(>refcnt))
-   opt = NULL;
+   if (opt) {
+   if (!atomic_inc_not_zero(>refcnt))
+   opt = NULL;
+   else
+   opt = rcu_pointer_handoff(opt);
+   }
rcu_read_unlock();
return opt;
 }
-- 
2.7.0

[PATCH v3 1/4] e1000e: Remove unreachable code

2015-11-09 Thread Benjamin Poirier

msi-x interrupts are not shared so there's no need to check if the
interrupt was really from this adapter.

Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 0a854a4..a228167 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1907,12 +1907,6 @@ static irqreturn_t e1000_msix_other(int __always_unused 
irq, void *data)
struct e1000_hw *hw = >hw;
u32 icr = er32(ICR);
 
-   if (!(icr & E1000_ICR_INT_ASSERTED)) {
-   if (!test_bit(__E1000_DOWN, >state))
-   ew32(IMS, E1000_IMS_OTHER);
-   return IRQ_NONE;
-   }
-
if (icr & adapter->eiac_mask)
ew32(ICS, (icr & adapter->eiac_mask));
 
-- 
2.6.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 0/4] e1000e msi-x fixes

2015-11-09 Thread Benjamin Poirier

Hi,

For this series:


Benjamin Poirier (4):
  e1000e: Remove unreachable code
  e1000e: Do not read icr in Other interrupt
  e1000e: Do not write lsc to ics in msi-x mode
  e1000e: Fix msi-x interrupt automask

 drivers/net/ethernet/intel/e1000e/defines.h |  3 +-
 drivers/net/ethernet/intel/e1000e/netdev.c  | 66 -
 2 files changed, 30 insertions(+), 39 deletions(-)

Changes in v3:
Preserve LSC in IMS, LSC events are not delivered otherwise.
Disable CTRL_EXT.IAME to prevent IMC write on ICR read from external
program.

Changes in v2:
Address review comments from Alexander Duyck: extend cleanup of Other
interrupt handler and use tx_ring->ims_val.


The first three patches cleanup handling of Other interrupts and the
last patch fixes tx and rx interrupts. Please consider reading the
description for that patch before proceeding. I believe that the
following simple tracing statements are helpful in detecting the problem
fixed by the last patch.

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index a09d1e4..29b8c6e 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1942,6 +1942,9 @@ static irqreturn_t e1000_intr_msix_rx(int __always_unused 
irq, void *data)
struct net_device *netdev = data;
struct e1000_adapter *adapter = netdev_priv(netdev);
struct e1000_ring *rx_ring = adapter->rx_ring;
+   struct e1000_hw *hw = >hw;
+
+   trace_printk("%s: rxq0 irq ims 0x%08x\n", netdev->name, er32(IMS));
 
/* Write the ITR value calculated at the end of the
 * previous interrupt.
@@ -1956,6 +1959,7 @@ static irqreturn_t e1000_intr_msix_rx(int __always_unused 
irq, void *data)
adapter->total_rx_bytes = 0;
adapter->total_rx_packets = 0;
__napi_schedule(>napi);
+   trace_printk("%s: scheduling napi\n", netdev->name);
}
return IRQ_HANDLED;
 }
@@ -2663,6 +2667,8 @@ static int e1000e_poll(struct napi_struct *napi, int 
weight)
struct net_device *poll_dev = adapter->netdev;
int tx_cleaned = 1, work_done = 0;
 
+   trace_printk("%s: poll starting ims 0x%08x\n", poll_dev->name,
+er32(IMS));
adapter = netdev_priv(poll_dev);
 
if (!adapter->msix_entries ||
@@ -2680,6 +2686,8 @@ static int e1000e_poll(struct napi_struct *napi, int 
weight)
e1000_set_itr(adapter);
napi_complete_done(napi, work_done);
if (!test_bit(__E1000_DOWN, >state)) {
+   trace_printk("%s: will enable rxq0 irq\n",
+poll_dev->name);
if (adapter->msix_entries)
ew32(IMS, adapter->rx_ring->ims_val);
else

 8< 

With that patch but without the patches in this series we can see that rx irqs
occur at unexpected times:

  -0 [000] .Ns.  1986.887517: e1000e_poll: eth1: will enable 
rxq0 irq
  -0 [000] d.h.  1986.896654: e1000_intr_msix_rx: eth1: rxq0 
irq ims 0x0154
  -0 [000] d.h.  1986.896657: e1000_intr_msix_rx: eth1: 
scheduling napi
  -0 [000] d.H.  1986.896662: e1000_intr_msix_rx: eth1: rxq0 
irq ims 0x0154
  -0 [000] ..s.  1986.896667: e1000e_poll: eth1: poll 
starting ims 0x0154
Warning: many interrupts (2) before napi
  -0 [000] ..s.  1986.896685: e1000e_poll: eth1: will enable 
rxq0 irq

  -0 [000] d.h.  1990.688870: e1000_intr_msix_rx: eth1: 
scheduling napi
  -0 [000] ..s.  1990.688875: e1000e_poll: eth1: poll 
starting ims 0x0154
  -0 [000] dNH.  1990.688913: e1000_intr_msix_rx: eth1: rxq0 
irq ims 0x0154
Warning: interrupt inside napi
  -0 [000] .Ns.  1990.688916: e1000e_poll: eth1: will enable 
rxq0 irq
  -0 [000] d.h.  1990.729688: e1000_intr_msix_rx: eth1: rxq0 
irq ims 0x0154

Here's a typical sequence after applying the patches in this series. Notice
that ims is changed. Another printk at the end of e1000e_poll would show it to
be 0x0154.

  -0 [000] d.h. 23547.977917: e1000_intr_msix_rx: eth1: rxq0 
irq ims 0x0144
  -0 [000] d.h. 23547.977922: e1000_intr_msix_rx: eth1: 
scheduling napi
  -0 [000] ..s. 23547.977928: e1000e_poll: eth1: poll 
starting ims 0x0144
  -0 [000] ..s. 23547.977961: e1000e_poll: eth1: will enable 
rxq0 irq

Finally, here's the script I used to generate the warnings above:

#!/usr/bin/python3

import sys
import re
import pprint


class NaE(Exception):
"Not an Event"
pass

class Event:
def __init__(self, line):
# sample events:
#  -0 [000] d.h.  2025.256536: e1000_intr_

[PATCH v3 2/4] e1000e: Do not read icr in Other interrupt

2015-11-09 Thread Benjamin Poirier

removes the icr read in the other interrupt handler, uses eiac to
autoclear the Other bit from icr and ims. This allows us to avoid
interference with rx and tx interrupts in the Other interrupt handler.

The information read from icr is not needed. IMS is configured such that
the only interrupt cause that can trigger the Other interrupt is Link
Status Change.

Signed-off-by: Benjamin Poirier <bpoir...@suse.com>

---
I noticed a 8-16% improvement in netperf rr tests after applying this
patch. This is a little surprising since this patch touches the handling
of Other interrupts, which do not occur during such a test. Some
profiling was not very insightful but the improvement seems related to
writing Other to EIAC.
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 22 +++---
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index a228167..a73e323 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1905,24 +1905,15 @@ static irqreturn_t e1000_msix_other(int __always_unused 
irq, void *data)
struct net_device *netdev = data;
struct e1000_adapter *adapter = netdev_priv(netdev);
struct e1000_hw *hw = >hw;
-   u32 icr = er32(ICR);
 
-   if (icr & adapter->eiac_mask)
-   ew32(ICS, (icr & adapter->eiac_mask));
+   hw->mac.get_link_status = true;
 
-   if (icr & E1000_ICR_OTHER) {
-   if (!(icr & E1000_ICR_LSC))
-   goto no_link_interrupt;
-   hw->mac.get_link_status = true;
-   /* guard against interrupt when we're going down */
-   if (!test_bit(__E1000_DOWN, >state))
-   mod_timer(>watchdog_timer, jiffies + 1);
+   /* guard against interrupt when we're going down */
+   if (!test_bit(__E1000_DOWN, >state)) {
+   mod_timer(>watchdog_timer, jiffies + 1);
+   ew32(IMS, E1000_IMS_OTHER);
}
 
-no_link_interrupt:
-   if (!test_bit(__E1000_DOWN, >state))
-   ew32(IMS, E1000_IMS_LSC | E1000_IMS_OTHER);
-
return IRQ_HANDLED;
 }
 
@@ -2019,6 +2010,7 @@ static void e1000_configure_msix(struct e1000_adapter 
*adapter)
   hw->hw_addr + E1000_EITR_82574(vector));
else
writel(1, hw->hw_addr + E1000_EITR_82574(vector));
+   adapter->eiac_mask |= E1000_IMS_OTHER;
 
/* Cause Tx interrupts on every write back */
ivar |= (1 << 31);
@@ -2247,7 +2239,7 @@ static void e1000_irq_enable(struct e1000_adapter 
*adapter)
 
if (adapter->msix_entries) {
ew32(EIAC_82574, adapter->eiac_mask & E1000_EIAC_MASK_82574);
-   ew32(IMS, adapter->eiac_mask | E1000_IMS_OTHER | E1000_IMS_LSC);
+   ew32(IMS, adapter->eiac_mask | E1000_IMS_LSC);
} else if ((hw->mac.type == e1000_pch_lpt) ||
   (hw->mac.type == e1000_pch_spt)) {
ew32(IMS, IMS_ENABLE_MASK | E1000_IMS_ECCER);
-- 
2.6.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 4/4] e1000e: Fix msi-x interrupt automask

2015-11-09 Thread Benjamin Poirier

Since the introduction of 82574 support in e1000e, the driver has worked
on the assumption that msi-x interrupt generation is automatically
disabled after each irq. As it turns out, this is not the case.
Currently, rx interrupts can fire multiple times before and during napi
processing. This can be a problem for users because frames that arrive
in a certain window (after adapter->clean_rx() but before
napi_complete_done() has cleared NAPI_STATE_SCHED) generate an interrupt
which does not lead to napi_schedule(). These frames sit in the rx queue
until another frame arrives (a tcp retransmit for example).

While the EIAC and CTRL_EXT registers are properly configured for irq
automask, the modification of IAM in e1000_configure_msix() is what
prevents automask from working as intended.

This patch removes that erroneous write and fixes interrupt rearming for
tx interrupts. It also clears IAME from CTRL_EXT. This is not strictly
necessary for operation of the driver but it is to avoid disruption from
potential programs that access the registers directly, like `ethregs -c`.

Reported-by: Frank Steiner <steiner-...@bio.ifi.lmu.de>
Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index ed7cc8e..2a22ed7 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1931,6 +1931,9 @@ static irqreturn_t e1000_intr_msix_tx(int __always_unused 
irq, void *data)
/* Ring was not completely cleaned, so fire another interrupt */
ew32(ICS, tx_ring->ims_val);
 
+   if (!test_bit(__E1000_DOWN, >state))
+   ew32(IMS, adapter->tx_ring->ims_val);
+
return IRQ_HANDLED;
 }
 
@@ -2018,12 +2021,8 @@ static void e1000_configure_msix(struct e1000_adapter 
*adapter)
ew32(IVAR, ivar);
 
/* enable MSI-X PBA support */
-   ctrl_ext = er32(CTRL_EXT);
-   ctrl_ext |= E1000_CTRL_EXT_PBA_CLR;
-
-   /* Auto-Mask Other interrupts upon ICR read */
-   ew32(IAM, ~E1000_EIAC_MASK_82574 | E1000_IMS_OTHER);
-   ctrl_ext |= E1000_CTRL_EXT_EIAME;
+   ctrl_ext = er32(CTRL_EXT) & ~E1000_CTRL_EXT_IAME;
+   ctrl_ext |= E1000_CTRL_EXT_PBA_CLR | E1000_CTRL_EXT_EIAME;
ew32(CTRL_EXT, ctrl_ext);
e1e_flush();
 }
-- 
2.6.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 3/4] e1000e: Do not write lsc to ics in msi-x mode

2015-11-09 Thread Benjamin Poirier

In msi-x mode, there is no handler for the lsc interrupt so there is no
point in writing that to ics now that we always assume Other interrupts
are caused by lsc.

Reviewed-by: Jasna Hodzic <jhod...@ucdavis.edu>
Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 drivers/net/ethernet/intel/e1000e/defines.h |  3 ++-
 drivers/net/ethernet/intel/e1000e/netdev.c  | 27 ---
 2 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/defines.h 
b/drivers/net/ethernet/intel/e1000e/defines.h
index 133d407..f7c7804 100644
--- a/drivers/net/ethernet/intel/e1000e/defines.h
+++ b/drivers/net/ethernet/intel/e1000e/defines.h
@@ -441,12 +441,13 @@
 #define E1000_IMS_RXQ1  E1000_ICR_RXQ1  /* Rx Queue 1 Interrupt */
 #define E1000_IMS_TXQ0  E1000_ICR_TXQ0  /* Tx Queue 0 Interrupt */
 #define E1000_IMS_TXQ1  E1000_ICR_TXQ1  /* Tx Queue 1 Interrupt */
-#define E1000_IMS_OTHER E1000_ICR_OTHER /* Other Interrupts */
+#define E1000_IMS_OTHER E1000_ICR_OTHER /* Other Interrupt */
 
 /* Interrupt Cause Set */
 #define E1000_ICS_LSC   E1000_ICR_LSC   /* Link Status Change */
 #define E1000_ICS_RXSEQ E1000_ICR_RXSEQ /* Rx sequence error */
 #define E1000_ICS_RXDMT0E1000_ICR_RXDMT0/* Rx desc min. threshold */
+#define E1000_ICS_OTHER E1000_ICR_OTHER /* Other Interrupt */
 
 /* Transmit Descriptor Control */
 #define E1000_TXDCTL_PTHRESH 0x003F /* TXDCTL Prefetch Threshold */
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index a73e323..ed7cc8e 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -4130,10 +4130,23 @@ void e1000e_reset(struct e1000_adapter *adapter)
 
 }
 
-int e1000e_up(struct e1000_adapter *adapter)
+/**
+ * e1000e_trigger_lsc - trigger an lsc interrupt
+ *
+ * Fire a link status change interrupt to start the watchdog.
+ **/
+static void e1000e_trigger_lsc(struct e1000_adapter *adapter)
 {
struct e1000_hw *hw = >hw;
 
+   if (adapter->msix_entries)
+   ew32(ICS, E1000_ICS_OTHER);
+   else
+   ew32(ICS, E1000_ICS_LSC);
+}
+
+int e1000e_up(struct e1000_adapter *adapter)
+{
/* hardware has been reset, we need to reload some things */
e1000_configure(adapter);
 
@@ -4145,11 +4158,7 @@ int e1000e_up(struct e1000_adapter *adapter)
 
netif_start_queue(adapter->netdev);
 
-   /* fire a link change interrupt to start the watchdog */
-   if (adapter->msix_entries)
-   ew32(ICS, E1000_ICS_LSC | E1000_ICR_OTHER);
-   else
-   ew32(ICS, E1000_ICS_LSC);
+   e1000e_trigger_lsc(adapter);
 
return 0;
 }
@@ -4576,11 +4585,7 @@ static int e1000_open(struct net_device *netdev)
hw->mac.get_link_status = true;
pm_runtime_put(>dev);
 
-   /* fire a link status change interrupt to start the watchdog */
-   if (adapter->msix_entries)
-   ew32(ICS, E1000_ICS_LSC | E1000_ICR_OTHER);
-   else
-   ew32(ICS, E1000_ICS_LSC);
+   e1000e_trigger_lsc(adapter);
 
return 0;
 
-- 
2.6.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 2/4] e1000e: Do not read icr in Other interrupt

2015-11-04 Thread Benjamin Poirier

On 2015/10/30 12:19, Alexander Duyck wrote:
> On 10/30/2015 10:31 AM, Benjamin Poirier wrote:
> >Using eiac instead of reading icr allows us to avoid interference with
> >rx and tx interrupts in the Other interrupt handler.
> >
> >According to the 82574 datasheet section 10.2.4.1, interrupt causes that
> >trigger the Other interrupt are
> >1) Link Status Change.
> >2) Receiver Overrun.
> >3) MDIO Access Complete.
> >4) Small Receive Packet Detected.
> >5) Receive ACK Frame Detected.
> >6) Manageability Event Detected.
> >
> >Causes 3, 4, 5 are related to features which are not enabled by the
> >driver. Always assume that cause 1 is what triggered the Other interrupt
> >and set get_link_status. Cause 2 and 6 should be rare enough that the
> >extra cost of needlessly re-reading the link status is negligible.
> >
> >Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
> 
> You might want to instead use a write of LSC to the ICR instead of just
> using auto-clear and not enabling LSC.  My concern is that you might no
> longer be getting link status change events at all.  An easy test is to just
> unplug/plug the cable a few times, or run "ethtool -r" on the link partner
> if connected back to back.  You should see messages appear in the dmesg log
> indicating that the link state changed.
> 
> In addition you should probably clear the IAME bit in the CTRL_EXT register
> so that you don't risk masking the interrupts on the ICR read or write.

Thanks, your concern about not getting LSC events was right. After more
experimentation I noticed that in order for the Other interrupt to be
raised for each of these six conditions, the IMS bit for that condition
must also be set. I've restored setting LSC in IMS. OTOH, I don't see a
need to clear LSC from ICR. Even without an ICR read or write-to-clear
to clear the LSC bit, Other interrupts are raised to signal LSC events.

I'll wait for net-next to reopen and send v3.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 4/4] e1000e: Fix msi-x interrupt automask

2015-10-30 Thread Benjamin Poirier

Since the introduction of 82574 support in e1000e, the driver has worked
on the assumption that msi-x interrupt generation is automatically
disabled after each irq. As it turns out, this is not the case.
Currently, rx interrupts can fire multiple times before and during napi
processing. This can be a problem for users because frames that arrive
in a certain window (after adapter->clean_rx() but before
napi_complete_done() has cleared NAPI_STATE_SCHED) generate an interrupt
which does not lead to napi_schedule(). These frames sit in the rx queue
until another frame arrives (a tcp retransmit for example).

While the EIAC and CTRL_EXT registers are properly configured for irq
automask, the modification of IAM in e1000_configure_msix() is what
prevents automask from working as intended.

This patch removes that erroneous write and fixes interrupt rearming for
tx interrupts.

Reported-by: Frank Steiner <steiner-...@bio.ifi.lmu.de>
Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 639fbe8..b5549d1 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1932,6 +1932,9 @@ static irqreturn_t e1000_intr_msix_tx(int __always_unused 
irq, void *data)
/* Ring was not completely cleaned, so fire another interrupt */
ew32(ICS, tx_ring->ims_val);
 
+   if (!test_bit(__E1000_DOWN, >state))
+   ew32(IMS, adapter->tx_ring->ims_val);
+
return IRQ_HANDLED;
 }
 
@@ -2020,11 +2023,7 @@ static void e1000_configure_msix(struct e1000_adapter 
*adapter)
 
/* enable MSI-X PBA support */
ctrl_ext = er32(CTRL_EXT);
-   ctrl_ext |= E1000_CTRL_EXT_PBA_CLR;
-
-   /* Auto-Mask Other interrupts upon ICR read */
-   ew32(IAM, ~E1000_EIAC_MASK_82574 | E1000_IMS_OTHER);
-   ctrl_ext |= E1000_CTRL_EXT_EIAME;
+   ctrl_ext |= E1000_CTRL_EXT_PBA_CLR | E1000_CTRL_EXT_EIAME;
ew32(CTRL_EXT, ctrl_ext);
e1e_flush();
 }
-- 
2.6.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 3/4] e1000e: Do not write lsc to ics in msi-x mode

2015-10-30 Thread Benjamin Poirier

In msi-x mode, there is no handler for the lsc interrupt so there is no
point in writing that to ics now that we always assume Other interrupts
are caused by lsc.

Reviewed-by: Jasna Hodzic <jhod...@ucdavis.edu>
Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 drivers/net/ethernet/intel/e1000e/defines.h |  3 ++-
 drivers/net/ethernet/intel/e1000e/netdev.c  | 27 ---
 2 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/defines.h 
b/drivers/net/ethernet/intel/e1000e/defines.h
index 133d407..f7c7804 100644
--- a/drivers/net/ethernet/intel/e1000e/defines.h
+++ b/drivers/net/ethernet/intel/e1000e/defines.h
@@ -441,12 +441,13 @@
 #define E1000_IMS_RXQ1  E1000_ICR_RXQ1  /* Rx Queue 1 Interrupt */
 #define E1000_IMS_TXQ0  E1000_ICR_TXQ0  /* Tx Queue 0 Interrupt */
 #define E1000_IMS_TXQ1  E1000_ICR_TXQ1  /* Tx Queue 1 Interrupt */
-#define E1000_IMS_OTHER E1000_ICR_OTHER /* Other Interrupts */
+#define E1000_IMS_OTHER E1000_ICR_OTHER /* Other Interrupt */
 
 /* Interrupt Cause Set */
 #define E1000_ICS_LSC   E1000_ICR_LSC   /* Link Status Change */
 #define E1000_ICS_RXSEQ E1000_ICR_RXSEQ /* Rx sequence error */
 #define E1000_ICS_RXDMT0E1000_ICR_RXDMT0/* Rx desc min. threshold */
+#define E1000_ICS_OTHER E1000_ICR_OTHER /* Other Interrupt */
 
 /* Transmit Descriptor Control */
 #define E1000_TXDCTL_PTHRESH 0x003F /* TXDCTL Prefetch Threshold */
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 602fcc9..639fbe8 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -4131,10 +4131,23 @@ void e1000e_reset(struct e1000_adapter *adapter)
 
 }
 
-int e1000e_up(struct e1000_adapter *adapter)
+/**
+ * e1000e_trigger_lsc - trigger a lsc interrupt
+ *
+ * Fire a link status change interrupt to start the watchdog.
+ **/
+static void e1000e_trigger_lsc(struct e1000_adapter *adapter)
 {
struct e1000_hw *hw = >hw;
 
+   if (adapter->msix_entries)
+   ew32(ICS, E1000_ICS_OTHER);
+   else
+   ew32(ICS, E1000_ICS_LSC);
+}
+
+int e1000e_up(struct e1000_adapter *adapter)
+{
/* hardware has been reset, we need to reload some things */
e1000_configure(adapter);
 
@@ -4146,11 +4159,7 @@ int e1000e_up(struct e1000_adapter *adapter)
 
netif_start_queue(adapter->netdev);
 
-   /* fire a link change interrupt to start the watchdog */
-   if (adapter->msix_entries)
-   ew32(ICS, E1000_ICS_LSC | E1000_ICR_OTHER);
-   else
-   ew32(ICS, E1000_ICS_LSC);
+   e1000e_trigger_lsc(adapter);
 
return 0;
 }
@@ -4577,11 +4586,7 @@ static int e1000_open(struct net_device *netdev)
hw->mac.get_link_status = true;
pm_runtime_put(>dev);
 
-   /* fire a link status change interrupt to start the watchdog */
-   if (adapter->msix_entries)
-   ew32(ICS, E1000_ICS_LSC | E1000_ICR_OTHER);
-   else
-   ew32(ICS, E1000_ICS_LSC);
+   e1000e_trigger_lsc(adapter);
 
return 0;
 
-- 
2.6.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 1/4] e1000e: Remove unreachable code

2015-10-30 Thread Benjamin Poirier

msi-x interrupts are not shared so there's no need to check if the
interrupt was really from this adapter.

Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 0a854a4..a228167 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1907,12 +1907,6 @@ static irqreturn_t e1000_msix_other(int __always_unused 
irq, void *data)
struct e1000_hw *hw = >hw;
u32 icr = er32(ICR);
 
-   if (!(icr & E1000_ICR_INT_ASSERTED)) {
-   if (!test_bit(__E1000_DOWN, >state))
-   ew32(IMS, E1000_IMS_OTHER);
-   return IRQ_NONE;
-   }
-
if (icr & adapter->eiac_mask)
ew32(ICS, (icr & adapter->eiac_mask));
 
-- 
2.6.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 0/4] e1000e msi-x fixes

2015-10-30 Thread Benjamin Poirier

Hi,

For this series:


Benjamin Poirier (4):
  e1000e: Remove unreachable code
  e1000e: Do not read icr in Other interrupt
  e1000e: Do not write lsc to ics in msi-x mode
  e1000e: Fix msi-x interrupt automask

 drivers/net/ethernet/intel/e1000e/defines.h |  3 +-
 drivers/net/ethernet/intel/e1000e/netdev.c  | 65 +
 2 files changed, 30 insertions(+), 38 deletions(-)

Changes in v2:
Address review comments from Alexander Duyck: extend cleanup of Other
interrupt handler and use tx_ring->ims_val.


The first three patches cleanup handling of Other interrupts and the
last patch fixes tx and rx interrupts. Please consider reading the
description for that patch before proceeding. I believe that the
following simple tracing statements are helpful in detecting the problem
fixed by the last patch.

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 8881256..707a525 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1952,6 +1952,9 @@ static irqreturn_t e1000_intr_msix_rx(int __always_unused 
irq, void *data)
struct net_device *netdev = data;
struct e1000_adapter *adapter = netdev_priv(netdev);
struct e1000_ring *rx_ring = adapter->rx_ring;
+   struct e1000_hw *hw = >hw;
+
+   trace_printk("%s: rxq0 irq ims 0x%08x\n", netdev->name, er32(IMS));
 
/* Write the ITR value calculated at the end of the
 * previous interrupt.
@@ -1966,6 +1969,7 @@ static irqreturn_t e1000_intr_msix_rx(int __always_unused 
irq, void *data)
adapter->total_rx_bytes = 0;
adapter->total_rx_packets = 0;
__napi_schedule(>napi);
+   trace_printk("%s: scheduling napi\n", netdev->name);
}
return IRQ_HANDLED;
 }
@@ -2672,6 +2676,8 @@ static int e1000e_poll(struct napi_struct *napi, int 
weight)
struct net_device *poll_dev = adapter->netdev;
int tx_cleaned = 1, work_done = 0;
 
+   trace_printk("%s: poll starting ims 0x%08x\n", poll_dev->name,
+er32(IMS));
adapter = netdev_priv(poll_dev);
 
if (!adapter->msix_entries ||
@@ -2689,6 +2695,8 @@ static int e1000e_poll(struct napi_struct *napi, int 
weight)
e1000_set_itr(adapter);
napi_complete_done(napi, work_done);
if (!test_bit(__E1000_DOWN, >state)) {
+   trace_printk("%s: will enable rxq0 irq\n",
+poll_dev->name);
if (adapter->msix_entries)
ew32(IMS, adapter->rx_ring->ims_val);
else

 8< 

With that patch but without the patches in this series we can see that rx irqs
occur at unexpected times:

  -0 [000] .Ns.  1986.887517: e1000e_poll: eth1: will enable 
rxq0 irq
  -0 [000] d.h.  1986.896654: e1000_intr_msix_rx: eth1: rxq0 
irq ims 0x0154
  -0 [000] d.h.  1986.896657: e1000_intr_msix_rx: eth1: 
scheduling napi
  -0 [000] d.H.  1986.896662: e1000_intr_msix_rx: eth1: rxq0 
irq ims 0x0154
  -0 [000] ..s.  1986.896667: e1000e_poll: eth1: poll 
starting ims 0x0154
Warning: many interrupts (2) before napi
  -0 [000] ..s.  1986.896685: e1000e_poll: eth1: will enable 
rxq0 irq

  -0 [000] d.h.  1990.688870: e1000_intr_msix_rx: eth1: 
scheduling napi
  -0 [000] ..s.  1990.688875: e1000e_poll: eth1: poll 
starting ims 0x0154
  -0 [000] dNH.  1990.688913: e1000_intr_msix_rx: eth1: rxq0 
irq ims 0x0154
Warning: interrupt inside napi
  -0 [000] .Ns.  1990.688916: e1000e_poll: eth1: will enable 
rxq0 irq
  -0 [000] d.h.  1990.729688: e1000_intr_msix_rx: eth1: rxq0 
irq ims 0x0154

Here's a typical sequence after applying the patches in this series. Notice
that ims is changed. Another printk at the end of e1000e_poll would show it to
be 0x0150.

  -0 [000] d.h. 672874.016104: e1000_intr_msix_rx: eth1: rxq0 
irq ims 0x0140
  -0 [000] d.h. 672874.016107: e1000_intr_msix_rx: eth1: 
scheduling napi
  -0 [000] ..s. 672874.016112: e1000e_poll: eth1: poll 
starting ims 0x0140
  -0 [000] ..s. 672874.016126: e1000e_poll: eth1: will enable 
rxq0 irq

Finally, here's the script I used to generate the warnings above:

#!/usr/bin/python3

import sys
import re
import pprint


class NaE(Exception):
"Not an Event"
pass

class Event:
def __init__(self, line):
# sample events:
#  -0 [000] d.h.  2025.256536: e1000_intr_msix_rx: eth1: rxq0 
irq ims 0x0154
#  -0 [000] d.h.  2025.256539: e1000_intr_msix_rx: eth1: 
scheduling napi
#  -0

[PATCH v2 2/4] e1000e: Do not read icr in Other interrupt

2015-10-30 Thread Benjamin Poirier

Using eiac instead of reading icr allows us to avoid interference with
rx and tx interrupts in the Other interrupt handler.

According to the 82574 datasheet section 10.2.4.1, interrupt causes that
trigger the Other interrupt are
1) Link Status Change.
2) Receiver Overrun.
3) MDIO Access Complete.
4) Small Receive Packet Detected.
5) Receive ACK Frame Detected.
6) Manageability Event Detected.

Causes 3, 4, 5 are related to features which are not enabled by the
driver. Always assume that cause 1 is what triggered the Other interrupt
and set get_link_status. Cause 2 and 6 should be rare enough that the
extra cost of needlessly re-reading the link status is negligible.

Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 23 ---
 1 file changed, 8 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index a228167..602fcc9 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1905,24 +1905,16 @@ static irqreturn_t e1000_msix_other(int __always_unused 
irq, void *data)
struct net_device *netdev = data;
struct e1000_adapter *adapter = netdev_priv(netdev);
struct e1000_hw *hw = >hw;
-   u32 icr = er32(ICR);
 
-   if (icr & adapter->eiac_mask)
-   ew32(ICS, (icr & adapter->eiac_mask));
+   /* Assume that the Other interrupt was triggered by LSC */
+   hw->mac.get_link_status = true;
 
-   if (icr & E1000_ICR_OTHER) {
-   if (!(icr & E1000_ICR_LSC))
-   goto no_link_interrupt;
-   hw->mac.get_link_status = true;
-   /* guard against interrupt when we're going down */
-   if (!test_bit(__E1000_DOWN, >state))
-   mod_timer(>watchdog_timer, jiffies + 1);
+   /* guard against interrupt when we're going down */
+   if (!test_bit(__E1000_DOWN, >state)) {
+   mod_timer(>watchdog_timer, jiffies + 1);
+   ew32(IMS, E1000_IMS_OTHER);
}
 
-no_link_interrupt:
-   if (!test_bit(__E1000_DOWN, >state))
-   ew32(IMS, E1000_IMS_LSC | E1000_IMS_OTHER);
-
return IRQ_HANDLED;
 }
 
@@ -2019,6 +2011,7 @@ static void e1000_configure_msix(struct e1000_adapter 
*adapter)
   hw->hw_addr + E1000_EITR_82574(vector));
else
writel(1, hw->hw_addr + E1000_EITR_82574(vector));
+   adapter->eiac_mask |= E1000_IMS_OTHER;
 
/* Cause Tx interrupts on every write back */
ivar |= (1 << 31);
@@ -2247,7 +2240,7 @@ static void e1000_irq_enable(struct e1000_adapter 
*adapter)
 
if (adapter->msix_entries) {
ew32(EIAC_82574, adapter->eiac_mask & E1000_EIAC_MASK_82574);
-   ew32(IMS, adapter->eiac_mask | E1000_IMS_OTHER | E1000_IMS_LSC);
+   ew32(IMS, adapter->eiac_mask);
} else if ((hw->mac.type == e1000_pch_lpt) ||
   (hw->mac.type == e1000_pch_spt)) {
ew32(IMS, IMS_ENABLE_MASK | E1000_IMS_ECCER);
-- 
2.6.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] e1000e: Fix msi-x interrupt automask

2015-10-22 Thread Benjamin Poirier

Since the introduction of 82574 support in e1000e, the driver has worked on
the assumption that msi-x interrupt generation is automatically disabled
after each irq. As it turns out, this is not the case. Currently, rx
interrupts can fire multiple times before and during napi processing. This
can be a problem for users because frames that arrive in a certain window
(after adapter->clean_rx() but before napi_complete_done() has cleared
NAPI_STATE_SCHED) generate an interrupt which does not lead to
napi_schedule(). These frames sit in the rx queue until another frame
arrives (a tcp retransmit for example).

While the EIAC and CTRL_EXT registers are properly configured for irq
automask, the modification of IAM in e1000_configure_msix() is what
prevents automask from working as intended.

This patch removes that erroneous write and fixes interrupt rearming for tx
and "other" interrupts. Since e1000_msix_other() reads ICR, all interrupts
must be rearmed in that function.

Reported-by: Frank Steiner <steiner-...@bio.ifi.lmu.de>
Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index a228167..8881256 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1921,7 +1921,8 @@ static irqreturn_t e1000_msix_other(int __always_unused 
irq, void *data)
 
 no_link_interrupt:
if (!test_bit(__E1000_DOWN, >state))
-   ew32(IMS, E1000_IMS_LSC | E1000_IMS_OTHER);
+   ew32(IMS, adapter->eiac_mask | E1000_IMS_OTHER |
+E1000_IMS_LSC);
 
return IRQ_HANDLED;
 }
@@ -1940,6 +1941,9 @@ static irqreturn_t e1000_intr_msix_tx(int __always_unused 
irq, void *data)
/* Ring was not completely cleaned, so fire another interrupt */
ew32(ICS, tx_ring->ims_val);
 
+   if (!test_bit(__E1000_DOWN, >state))
+   ew32(IMS, E1000_IMS_TXQ0);
+
return IRQ_HANDLED;
 }
 
@@ -2027,11 +2031,7 @@ static void e1000_configure_msix(struct e1000_adapter 
*adapter)
 
/* enable MSI-X PBA support */
ctrl_ext = er32(CTRL_EXT);
-   ctrl_ext |= E1000_CTRL_EXT_PBA_CLR;
-
-   /* Auto-Mask Other interrupts upon ICR read */
-   ew32(IAM, ~E1000_EIAC_MASK_82574 | E1000_IMS_OTHER);
-   ctrl_ext |= E1000_CTRL_EXT_EIAME;
+   ctrl_ext |= E1000_CTRL_EXT_PBA_CLR | E1000_CTRL_EXT_EIAME;
ew32(CTRL_EXT, ctrl_ext);
e1e_flush();
 }
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] e1000e: remove unreachable code

2015-10-22 Thread Benjamin Poirier

msi-x interrupts are not shared so there's no need to check if the
interrupt was really from this adapter.

Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 0a854a4..a228167 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1907,12 +1907,6 @@ static irqreturn_t e1000_msix_other(int __always_unused 
irq, void *data)
struct e1000_hw *hw = >hw;
u32 icr = er32(ICR);
 
-   if (!(icr & E1000_ICR_INT_ASSERTED)) {
-   if (!test_bit(__E1000_DOWN, >state))
-   ew32(IMS, E1000_IMS_OTHER);
-   return IRQ_NONE;
-   }
-
if (icr & adapter->eiac_mask)
ew32(ICS, (icr & adapter->eiac_mask));
 
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/2] e1000e msi-x fixes

2015-10-22 Thread Benjamin Poirier

Hi,

For this series:


Benjamin Poirier (2):
  e1000e: remove unreachable code
  e1000e: Fix msi-x interrupt automask

 drivers/net/ethernet/intel/e1000e/netdev.c | 18 ++
 1 file changed, 6 insertions(+), 12 deletions(-)


The first patch is a cleanup but the second one is the real deal. Please
consider reading the description for that patch before proceeding. I
believe that the following simple tracing statements are helpful in
detecting the problem fixed by the second patch.

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 8881256..707a525 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1952,6 +1952,9 @@ static irqreturn_t e1000_intr_msix_rx(int __always_unused 
irq, void *data)
struct net_device *netdev = data;
struct e1000_adapter *adapter = netdev_priv(netdev);
struct e1000_ring *rx_ring = adapter->rx_ring;
+   struct e1000_hw *hw = >hw;
+
+   trace_printk("%s: rxq0 irq ims 0x%08x\n", netdev->name, er32(IMS));
 
/* Write the ITR value calculated at the end of the
 * previous interrupt.
@@ -1966,6 +1969,7 @@ static irqreturn_t e1000_intr_msix_rx(int __always_unused 
irq, void *data)
adapter->total_rx_bytes = 0;
adapter->total_rx_packets = 0;
__napi_schedule(>napi);
+   trace_printk("%s: scheduling napi\n", netdev->name);
}
return IRQ_HANDLED;
 }
@@ -2672,6 +2676,8 @@ static int e1000e_poll(struct napi_struct *napi, int 
weight)
struct net_device *poll_dev = adapter->netdev;
int tx_cleaned = 1, work_done = 0;
 
+   trace_printk("%s: poll starting ims 0x%08x\n", poll_dev->name,
+er32(IMS));
adapter = netdev_priv(poll_dev);
 
if (!adapter->msix_entries ||
@@ -2689,6 +2695,8 @@ static int e1000e_poll(struct napi_struct *napi, int 
weight)
e1000_set_itr(adapter);
napi_complete_done(napi, work_done);
if (!test_bit(__E1000_DOWN, >state)) {
+   trace_printk("%s: will enable rxq0 irq\n",
+poll_dev->name);
if (adapter->msix_entries)
ew32(IMS, adapter->rx_ring->ims_val);
else

 8< 

With that patch but without the patches in this series we can see that rx irqs
occur at unexpected times:

  -0 [000] .Ns.  1986.887517: e1000e_poll: eth1: will enable 
rxq0 irq
  -0 [000] d.h.  1986.896654: e1000_intr_msix_rx: eth1: rxq0 
irq ims 0x0154
  -0 [000] d.h.  1986.896657: e1000_intr_msix_rx: eth1: 
scheduling napi
  -0 [000] d.H.  1986.896662: e1000_intr_msix_rx: eth1: rxq0 
irq ims 0x0154
  -0 [000] ..s.  1986.896667: e1000e_poll: eth1: poll 
starting ims 0x0154
Warning: many interrupts (2) before napi
  -0 [000] ..s.  1986.896685: e1000e_poll: eth1: will enable 
rxq0 irq

  -0 [000] d.h.  1990.688870: e1000_intr_msix_rx: eth1: 
scheduling napi
  -0 [000] ..s.  1990.688875: e1000e_poll: eth1: poll 
starting ims 0x0154
  -0 [000] dNH.  1990.688913: e1000_intr_msix_rx: eth1: rxq0 
irq ims 0x0154
Warning: interrupt inside napi
  -0 [000] .Ns.  1990.688916: e1000e_poll: eth1: will enable 
rxq0 irq
  -0 [000] d.h.  1990.729688: e1000_intr_msix_rx: eth1: rxq0 
irq ims 0x0154

Here's a typical sequence after applying the patches in this series. Notice
that ims is changed. Another printk at the end of e1000e_poll would show it to
be 0x0154.

  -0 [000] d.h.  3896.134376: e1000_intr_msix_rx: eth1: rxq0 
irq ims 0x0144
  -0 [000] d.h.  3896.134379: e1000_intr_msix_rx: eth1: 
scheduling napi
  -0 [000] ..s.  3896.134384: e1000e_poll: eth1: poll 
starting ims 0x0144
  -0 [000] ..s.  3896.134398: e1000e_poll: eth1: will enable 
rxq0 irq

Finally, here's the script I used to generate the warnings above:

#!/usr/bin/python3

import sys
import re
import pprint


class NaE(Exception):
"Not an Event"
pass

class Event:
def __init__(self, line):
# sample events:
#  -0 [000] d.h.  2025.256536: e1000_intr_msix_rx: eth1: rxq0 
irq ims 0x0154
#  -0 [000] d.h.  2025.256539: e1000_intr_msix_rx: eth1: 
scheduling napi
#  -0 [000] ..s.  2025.256544: e1000e_poll: eth1: poll 
starting ims 0x0154
#  -0 [000] ..s.  2025.256558: e1000e_poll: eth1: will enable 
rxq0 irq
retval = re.match(" +.*)>?-(?P[0-9]+) +\[(?P.*)\] 
(?P[^ ]+) +(?P[0-9.]+): (?P[^:]+): (?P[^:]+): 
(?P.*)", line)
if retval:
self.

[PATCH] net-timestamp: Update skb_complete_tx_timestamp comment

2015-08-07 Thread Benjamin Poirier

After 62bccb8 net-timestamp: Make the clone operation stand-alone from phy
timestamping the hwtstamps parameter of skb_complete_tx_timestamp() may no
longer be NULL.

Signed-off-by: Benjamin Poirier bpoir...@suse.com
Cc: Alexander Duyck alexander.h.du...@redhat.com
---
 include/linux/skbuff.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index d6cdd6e..22b6d9c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2884,11 +2884,11 @@ static inline bool skb_defer_rx_timestamp(struct 
sk_buff *skb)
  *
  * PHY drivers may accept clones of transmitted packets for
  * timestamping via their phy_driver.txtstamp method. These drivers
- * must call this function to return the skb back to the stack, with
- * or without a timestamp.
+ * must call this function to return the skb back to the stack with a
+ * timestamp.
  *
  * @skb: clone of the the original outgoing packet
- * @hwtstamps: hardware time stamps, may be NULL if not available
+ * @hwtstamps: hardware time stamps
  *
  */
 void skb_complete_tx_timestamp(struct sk_buff *skb,
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 103 matches

Mail list logo