date:20170809

Re: [PATCH 0/4] net-next: dsa: fix flow dissection

2017-08-09 Thread David Miller

From: John Crispin 
Date: Wed,  9 Aug 2017 14:41:15 +0200

> RPS and probably other kernel features are currently broken on some if not
> all DSA devices. The root cause of this is that skb_hash will call the
> flow_dissector. At this point the skb still contains the magic switch
> header and the skb->protocol field is not set up to the correct 802.3
> value yet. By the time the tag specific code is called, removing the header
> and properly setting the protocol an invalid hash is already set. In the
> case of the mt7530 this will result in all flows always having the same
> hash.
> 
> Changes since RFC:
> * use a callback instead of static values
> * add cover letter

Series applied, thanks.

Re: [PATCH net-next] skbuff: Add BUG_ON in skb_init.

2017-08-09 Thread David Miller

From: Tonghao Zhang 
Date: Wed,  9 Aug 2017 05:04:38 -0700

> When initializing the skbuff SLAB cache, we should make
> sure it is successful. Adding BUG_ON to check it and
> init_inodecache() is in the same case.
> 
> Signed-off-by: Tonghao Zhang 
> ---
>  net/core/skbuff.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 42b62c716a33..9513de519870 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3904,6 +3904,8 @@ void __init skb_init(void)
>   0,
>   SLAB_HWCACHE_ALIGN|SLAB_PANIC,
>   NULL);
> + BUG_ON(skbuff_head_cache == NULL);
> + BUG_ON(skbuff_fclone_cache == NULL);
>  }

I know you guys want every allocation to be explicitly checked so that
everything is consistent for static code analysis checkers.

But this is just wasted code.

The first allocation will take a NULL dereference and the backtrace
will make it completely clear which SLAB cache was NULL and couldn't
be allocated.

So there is no real value to adding these checks.

So I'm not applying this, sorry.

The same logic goes for your other patch of this nature.

Re: [PATCHv2] igmp: Fix regression caused by igmp sysctl namespace code.

2017-08-09 Thread David Miller

From: Nikolay Borisov 
Date: Wed,  9 Aug 2017 14:38:04 +0300

> Commit dcd87999d415 ("igmp: net: Move igmp namespace init to correct file")
> moved the igmp sysctls initialization from tcp_sk_init to igmp_net_init. This
> function is only called as part of per-namespace initialization, only if
> CONFIG_IP_MULTICAST is defined, otherwise igmp_mc_init() call in ip_init is
> compiled out, casuing the igmp pernet ops to not be registerd and those sysctl
> being left initialized with 0. However, there are certain functions, such as
> ip_mc_join_group which are always compiled and make use of some of those
> sysctls. Let's do a partial revert of the aforementioned commit and move the
> sysctl initialization into inet_init_net, that way they will always have
> sane values.
> 
> Fixes: dcd87999d415 ("igmp: net: Move igmp namespace init to correct file")
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=196595
> Reported-by: Gerardo Exequiel Pozzi 
> Cc:  # 4.6
> Signed-off-by: Nikolay Borisov 

Please do not use "CC: stable.." for networking patches, I take care of
-stable submissions by hand for those.

Applied and queued up for -stable, thanks.

Re: [PATCH 0/2] net-next: mediatek: bring up QDMA RX ring 0

2017-08-09 Thread David Miller

From: John Crispin 
Date: Wed,  9 Aug 2017 12:09:30 +0200

> The MT7623 has several DMA rings. Inside the SW path, the core will use
> the PDMA when receiving traffic. While bringing up the HW path we noticed
> that the PPE requires the QDMA RX to also be brought up as it uses this
> ring internally for its flow scheduling.

Series applied.

Re: [PATCHv2 net] net: sched: set xt_tgchk_param par.nft_compat as 0 in ipt_init_target

2017-08-09 Thread David Miller

From: Xin Long 
Date: Wed,  9 Aug 2017 18:15:19 +0800

> Commit 55917a21d0cc ("netfilter: x_tables: add context to know if
> extension runs from nft_compat") introduced a member nft_compat to
> xt_tgchk_param structure.
> 
> But it didn't set it's value for ipt_init_target. With unexpected
> value in par.nft_compat, it may return unexpected result in some
> target's checkentry.
> 
> This patch is to set all it's fields as 0 and only initialize the
> non-zero fields in ipt_init_target.
> 
> v1->v2:
>   As Wang Cong's suggestion, fix it by setting all it's fields as
>   0 and only initializing the non-zero fields.
> 
> Fixes: 55917a21d0cc ("netfilter: x_tables: add context to know if extension 
> runs from nft_compat")
> Suggested-by: Cong Wang 
> Signed-off-by: Xin Long 

Applied and queued up for -stable.

Re: [PATCH] net: atm: make atmdev_ops const

2017-08-09 Thread David Miller

From: Bhumika Goyal 
Date: Wed,  9 Aug 2017 15:02:08 +0530

> Make these const as they are only stored in the ops field of a atm_dev
> structure, which is const.
> Done using Coccinelle.
> 
> Signed-off-by: Bhumika Goyal 

Applied.

Re: [PATCH] atm: make atmdev_ops const

2017-08-09 Thread David Miller

From: Bhumika Goyal 
Date: Wed,  9 Aug 2017 14:49:15 +0530

> Make these structures const as they are either passed to the function
> atm_dev_register having the corresponding argument as const or stored in
> the ops field of a atm_dev structure, which is also const.
> Done using Coccinelle.
> 
> Signed-off-by: Bhumika Goyal 

Applied.

Re: [PATCH] net: dsa: make dsa_switch_ops const

2017-08-09 Thread David Miller

From: Bhumika Goyal 
Date: Wed,  9 Aug 2017 10:34:15 +0530

> Make these structures const as they are only stored in the ops field of
> a dsa_switch structure, which is const.
> Done using Coccinelle.
> 
> Signed-off-by: Bhumika Goyal 

Applied, thank you.

Re: [PATCH net-next] liquidio: napi cleanup

2017-08-09 Thread David Miller

From: Felix Manlunas 
Date: Tue, 8 Aug 2017 19:34:28 -0700

> From: Intiyaz Basha 
> 
> Disable napi when interface is going down.
> Delete napi when destroying the interface.
> 
> Signed-off-by: Intiyaz Basha 
> Signed-off-by: Felix Manlunas 

Applied, thanks.

Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2017-08-09 Thread Wei Wang

Hi John,

Is it possible to try the attached patch?
I am not sure if it actually fixes the issue. But I think it is worth a try.
Also, could you get me all the ipv6 routes when you plug in the usb
using "ip -6 route show"? (If you have multiple routing tables
configured, could you dump them all?)

Thanks a lot.
Wei

On Wed, Aug 9, 2017 at 6:36 PM, Wei Wang  wrote:
> On Wed, Aug 9, 2017 at 6:26 PM, John Stultz  wrote:
>> On Wed, Aug 9, 2017 at 5:36 PM, Wei Wang  wrote:
>>> On Wed, Aug 9, 2017 at 4:44 PM, John Stultz  wrote:
 On Wed, Aug 9, 2017 at 4:34 PM, Cong Wang  wrote:
> (Cc'ing Wei whose commit was blamed)
>
> On Mon, Aug 7, 2017 at 2:15 PM, John Stultz  
> wrote:
>> On Mon, Aug 7, 2017 at 2:05 PM, John Stultz  
>> wrote:
>>> So, with recent testing with my HiKey board, I've been noticing some
>>> quirky behavior with my USB eth adapter.
>>>
>>> Basically, pluging the usb eth adapter in and then removing it, when
>>> plugging it back in I often find that its not detected, and the system
>>> slowly spits out the following message over and over:
>>>   unregister_netdevice: waiting for eth0 to become free. Usage count = 1
>>
>> The other bit is that after this starts printing, the board will no
>> longer reboot (it hangs continuing to occasionally print the above
>> message), and I have to manually reset the device.
>>
>
> So this warning is not temporarily shown but lasts until a reboot,
> right? If so it is a dst refcnt leak.

 Correct, once I get into the state it lasts until a reboot.

> How reproducible is it for you? From my reading, it seems always
> reproduced when you unplug and plug your usb eth interface?
> Is there anything else involved? For example, network namespace.

 So with 4.13-rc3/4 I seem to trigger it easily, often with the first
 unplug of the USB eth adapter.

 But as I get back closer to 4.12, it seemingly becomes harder to
 trigger, but sometimes still happens.

 So far, I've not been able to trigger it with 4.12.

 I don't think network namespaces are involved?  Though its out of my
 area, so AOSP may be using them these days.  Is there a simple way to
 check?

 I'll also do another bisection to see if the bad point moves back any 
 further.
>>
>> So I went through another bisection around and got  9514528d92d4 ipv6:
>> call dst_dev_put() properly as the first bad commit again.
>>
>>> If you see the problem starts to happen on commit
>>> 9514528d92d4cbe086499322370155ed69f5d06c, could you try reverting all
>>> the following commits:
>>> (from new to old)
>>> 1eb04e7c9e63 net: reorder all the dst flags
>>> a4c2fd7f7891 net: remove DST_NOCACHE flag
>>> b2a9c0ed75a3 net: remove DST_NOGC flag
>>> 5b7c9a8ff828 net: remove dst gc related code
>>> db916649b5dd ipv6: get rid of icmp6 dst garbage collector
>>> 587fea741134 ipv6: mark DST_NOGC and remove the operation of dst_free()
>>> ad65a2f05695 ipv6: call dst_hold_safe() properly
>>> 9514528d92d4 ipv6: call dst_dev_put() properly
>>
>>
>> And reverting this set off of 4.13-rc4 seems to make the issue go away.
>>
>> Is there anything I can test to help narrow down the specific problem
>> with that patchset?
>>
>
> Thanks John for confirming.
> Let me spend some time on the commits and I will let you know if I
> have some debug image for you to try.
>
> Wei
>
>
>> thanks
>> -john
From 93f2836679c81915b110ff56617f9f5dae2e6927 Mon Sep 17 00:00:00 2001
From: Wei Wang 
Date: Wed, 9 Aug 2017 22:27:36 -0700
Subject: [PATCH] ipv6: unregister netdev bug fix

Change-Id: I30fa739989ac50fbc7f4cbc6a04130005589cc25
---
 include/net/ip6_route.h |  1 +
 net/ipv6/addrconf.c | 10 +++---
 net/ipv6/anycast.c  |  3 ++-
 net/ipv6/route.c|  2 +-
 4 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 907d39a42f6b..dec1424ce619 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -94,6 +94,7 @@ int ipv6_route_ioctl(struct net *net, unsigned int cmd, void __user *arg);
 int ip6_route_add(struct fib6_config *cfg, struct netlink_ext_ack *extack);
 int ip6_ins_rt(struct rt6_info *);
 int ip6_del_rt(struct rt6_info *);
+void rt6_uncached_list_add(struct rt6_info *rt);
 
 static inline int ip6_route_get_saddr(struct net *net, struct rt6_info *rt,
   const struct in6_addr *daddr,
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 3c46e9513a31..06a27addb93c 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -3079,7 +3079,8 @@ static void init_loopback(struct net_device *dev)
 			/* Failure cases are ignored */
 			if (!IS_ERR(sp_rt)) {
 sp_ifa->rt = sp_rt;
-ip6_ins_rt(sp_rt);
+if

Re: [PATCH net] geneve: maximum value of VNI cannot be used

2017-08-09 Thread David Miller

From: Girish Moodalbail 
Date: Tue,  8 Aug 2017 17:26:24 -0700

> Geneve's Virtual Network Identifier (VNI) is 24 bit long, so the range
> of values for it would be from 0 to 16777215 (2^24 -1).  However, one
> cannot create a geneve device with VNI set to 16777215. This patch fixes
> this issue.
> 
> Signed-off-by: Girish Moodalbail 

I always worry that someone, somewhere, might be using this in some
way and this will break things.

But I'll apply this for now.

Re: [PATCH net-next] net: ipv6: lower ndisc notifier priority below addrconf

2017-08-09 Thread David Miller

From: David Ahern 
Date: Tue,  8 Aug 2017 15:51:02 -0600

> ndisc_notify is used to send unsolicited neighbor advertisements
> (e.g., on a link up). Currently, the ndisc notifier is run before the
> addrconf notifer which means NA's are not sent for link-local addresses
> which are added by the addrconf notifier.
> 
> Fix by lowering the priority of the ndisc notifier. Setting the priority
> to ADDRCONF_NOTIFY_PRIORITY - 5 means it runs after addrconf and before
> the route notifier which is ADDRCONF_NOTIFY_PRIORITY - 10.
> 
> Signed-off-by: David Ahern 

Applied, thanks David.

Re: [PATCH net v2] net: systemport: Fix software statistics for SYSTEMPORT Lite

2017-08-09 Thread David Miller

From: Florian Fainelli 
Date: Tue,  8 Aug 2017 14:45:09 -0700

> With SYSTEMPORT Lite we have holes in our statistics layout that make us
> skip over the hardware MIB counters, bcm_sysport_get_stats() was not
> taking that into account resulting in reporting 0 for all SW-maintained
> statistics, fix this by skipping accordingly.
> 
> Fixes: 44a4524c54af ("net: systemport: Add support for SYSTEMPORT Lite")
> Signed-off-by: Florian Fainelli 

Applied, thanks.

Regression: Bug 196547 - Since 4.12 - bonding module not working with wireless drivers

2017-08-09 Thread Kalle Valo

Hi Mahesh and Andy,

James Feeney reported that there's a serious regression in bonding
module since v4.12, it doesn't work with wireless drivers anymore as
wireless drivers don't report the link speed via ethtool:

https://bugzilla.kernel.org/show_bug.cgi?id=196547

In the bug report it's said that this commit is the culprit:

3f3c278c94dd bonding: fix active-backup transition

Is there a fix for this or should that commit be reverted? This seems to
be a serious regression as there are multiple reports already and we
should get it fixed for v4.13, and the fix backported to v4.12 stable
release.

-- 
Kalle Valo

Re: [net 1/1] tipc: remove premature ESTABLISH FSM event at link synchronization

2017-08-09 Thread David Miller

From: Jon Maloy 
Date: Tue, 8 Aug 2017 22:23:56 +0200

> When a link between two nodes come up, both endpoints will initially
> send out a STATE message to the peer, to increase the probability that
> the peer endpoint also is up when the first traffic message arrives.
> Thereafter, if the establishing link is the second link between two
> nodes, this first "traffic" message is a TUNNEL_PROTOCOL/SYNCH message,
> helping the peer to perform initial synchronization between the two
> links.
> 
> However, the initial STATE message may be lost, in which case the SYNCH
> message will be the first one arriving at the peer. This should also
> work, as the SYNCH message itself will be used to take up the link
> endpoint before  initializing synchronization.
> 
> Unfortunately the code for this case is broken. Currently, the link is
> brought up through a tipc_link_fsm_evt(ESTABLISHED) when a SYNCH
> arrives, whereupon __tipc_node_link_up() is called to distribute the
> link slots and take the link into traffic. But, __tipc_node_link_up() is
> itself starting with a test for whether the link is up, and if true,
> returns without action. Clearly, the tipc_link_fsm_evt(ESTABLISHED) call
> is unnecessary, since tipc_node_link_up() is itself issuing such an
> event, but also harmful, since it inhibits tipc_node_link_up() to
> perform the test of its tasks, and the link endpoint in question hence
> is never taken into traffic.
> 
> This problem has been exposed when we set up dual links between pre-
> and post-4.4 kernels, because the former ones don't send out the
> initial STATE message described above.
> 
> We fix this by removing the unnecessary event call.
> 
> Signed-off-by: Jon Maloy 

Applied.

Re: [PATCH net-next] ibmvnic: Clean up resources on probe failure

2017-08-09 Thread David Miller

From: Nathan Fontenot 
Date: Tue, 08 Aug 2017 14:28:45 -0500

> Ensure that any resources allocated during probe are released if the
> probe of the driver fails.
> 
> Signed-off-by: Nathan Fontenot 

Applied, but:

> +
> +ibmvnic_register_fail:
> + device_remove_file(>dev, _attr_failover);
> + 
   ^^

That last line has trailing whitespace, which I had to fix up.

Please take care of this next time.

Re: [PATCH net-next] ibmvnic: Correct 'unused variable' warning in build.

2017-08-09 Thread David Miller

From: Nathan Fontenot 
Date: Tue, 08 Aug 2017 15:26:18 -0500

> Commit a248878d7a1d ("ibmvnic: Check for transport event on driver resume")
> removed the loop to kick irqs on driver resume but didn't remove the now
> unused loop variable 'i'.
> 
> Signed-off-by: Nathan Fontenot 

Applied.

Re: [PATCH v1 net] TCP_USER_TIMEOUT and tcp_keepalive should conform to RFC5482

2017-08-09 Thread Jerry Chu

On Wed, Aug 9, 2017 at 8:32 PM, Jerry Chu  wrote:
> On Wed, Aug 9, 2017 at 5:47 PM, Rao Shoaib  wrote:
>>
>>
>> On 08/09/2017 05:30 PM, David Miller wrote:
>>>
>>> From: Joe Smith 
>>> Date: Wed, 9 Aug 2017 17:20:32 -0700
>>>
 Making Linux conform to standards and behavior that is logical seems
 like a good enough reason.
>>>
>>> That's an awesome attitude to have when we're implementing something
>>> new and don't have the facility already.
>>>
>>> But when we have something already the only important consideration is
>>> not breaking existing apps which rely on that behavior.
>>>
>>> That is much, much, more important than standards compliance.
>>>
>>> If users are confused, just fix the documentation.
>>
>> David,
>>
>> If it was just confusion than sure fixing the documentation is fine. What if
>> the logic is incorrect, does not conform to the standard that is says it is
>
> Not sure what part of logic is "incorrect" when it was a homegrown Linux API
> with no need to conform with any "standard"? Note that the new API was 
> invented
> 7 years ago not out of need for RFC5482. In fact I initially call the option
> TCP_FAILFAST and did not even know the existence of RFC5482 until someone
> around the same time proposed a UTO option specifically for RFC5482 and I
> thought the two can be combined. (This is roughly the memory I can
> recollect so far.)
>
> So you see my focus back then was to devise a "failfast" option whereas 
> RFC5482
> was meant for a "failslow" case. I think that explains why I let the
> option override
> keepalive so a TCP connection can "fail fast" while RFC5482 4.2 tries to 
> prevent
> keepalive failure ahead of UTO, favoring "fail slow".
>
> If we start from a clean slate then perhaps one can argue the semantic
> either way
> but we do not have a clean slate. For that I still slightly favor not
> changing the code
> because the risk of breakage is definitely non-zero and the issue you're 
> having
> seem to be only related to documentation.

One more thing - the proposed patch compares TCP_KEEPIDLE against
TCP_USER_TIMEOUT. But I don't think TCP_KEEPIDLE is what the
"keep-alive
timer" referred to in RFC5482. Linux keepalive implementation seems to use #
of retries (TCP_KEEPCNT) rather than time duration (keep-alive time) to
determine when to quit. If that is the case then your proposed change is not
fully "compliant" either and the best is probably just don't change.

>
> Jerry
>
>> implementing and easy to fix with little or no risk of breakage.
>>
>> The proposed patch changes a feature that no one uses. It also imposes the
>> relation ship between keepalive and timeout values that is required by the
>> RFC and make sense.
>>
>> You are the final authority, if you say we should just fix the documentation
>> than that is fine.
>>
>> Shoaib
>>

[PATCH] bonding: require speed/duplex only for 802.3ad, alb and tlb

2017-08-09 Thread Andreas Born

The patch c4adfc822bf5 ("bonding: make speed, duplex setting consistent
with link state") puts the link state to down if
bond_update_speed_duplex() cannot retrieve speed and duplex settings.
Assumably the patch was written with 802.3ad mode in mind which relies
on link speed/duplex settings. For other modes like active-backup these
settings are not required. Thus, only for these other modes, this patch
reintroduces support for slaves that do not support reporting speed or
duplex such as wireless devices. This fixes the regression reported in
bug 196547 (https://bugzilla.kernel.org/show_bug.cgi?id=196547).

Fixes: c4adfc822bf5 ("bonding: make speed, duplex setting consistent
with link state")
Signed-off-by: Andreas Born 
---
 drivers/net/bonding/bond_main.c | 6 --
 include/net/bonding.h   | 5 +
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 9bee6c1c70cc..85bb272d2a34 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1569,7 +1569,8 @@ int bond_enslave(struct net_device *bond_dev, struct 
net_device *slave_dev)
new_slave->delay = 0;
new_slave->link_failure_count = 0;
 
-   if (bond_update_speed_duplex(new_slave))
+   if (bond_update_speed_duplex(new_slave) &&
+   bond_needs_speed_duplex(bond))
new_slave->link = BOND_LINK_DOWN;
 
new_slave->last_rx = jiffies -
@@ -2140,7 +2141,8 @@ static void bond_miimon_commit(struct bonding *bond)
continue;
 
case BOND_LINK_UP:
-   if (bond_update_speed_duplex(slave)) {
+   if (bond_update_speed_duplex(slave) &&
+   bond_needs_speed_duplex(bond)) {
slave->link = BOND_LINK_DOWN;
netdev_warn(bond->dev,
"failed to get link speed/duplex 
for %s\n",
diff --git a/include/net/bonding.h b/include/net/bonding.h
index b00508d22e0a..b2e68657a216 100644
--- a/include/net/bonding.h
+++ b/include/net/bonding.h
@@ -277,6 +277,11 @@ static inline bool bond_is_lb(const struct bonding *bond)
   BOND_MODE(bond) == BOND_MODE_ALB;
 }
 
+static inline bool bond_needs_speed_duplex(const struct bonding *bond)
+{
+   return BOND_MODE(bond) == BOND_MODE_8023AD || bond_is_lb(bond);
+}
+
 static inline bool bond_is_nondyn_tlb(const struct bonding *bond)
 {
return (BOND_MODE(bond) == BOND_MODE_TLB)  &&
-- 
2.14.0

Re: [PATCH v4 05/12] Documentation: net: phy: Add phy-is-internal binding

2017-08-09 Thread Chen-Yu Tsai

On Thu, Aug 10, 2017 at 8:20 AM, Andrew Lunn  wrote:
> On Wed, Aug 09, 2017 at 03:47:34PM -0700, Florian Fainelli wrote:
>> On August 9, 2017 5:10:30 AM PDT, David Wu  wrote:
>> >Add the documentation for internal phy. A boolean property
>> >indicates that a internal phy will be used.
>> >
>> >Signed-off-by: David Wu 
>> >---
>> > Documentation/devicetree/bindings/net/phy.txt | 3 +++
>> > 1 file changed, 3 insertions(+)
>> >
>> >diff --git a/Documentation/devicetree/bindings/net/phy.txt
>> >b/Documentation/devicetree/bindings/net/phy.txt
>> >index b558576..942c892 100644
>> >--- a/Documentation/devicetree/bindings/net/phy.txt
>> >+++ b/Documentation/devicetree/bindings/net/phy.txt
>> >@@ -52,6 +52,9 @@ Optional Properties:
>> >   Mark the corresponding energy efficient ethernet mode as broken and
>> >   request the ethernet to stop advertising it.
>> >
>> >+- phy-is-internal: If set, indicates that phy will connect to the MAC
>> >as a
>> >+  internal phy.
>>
>> Something along the lines of:
>>
>> If set, indicates that the PHY is integrated into the same physical package 
>> as the Ethernet MAC.
>
> Hi Florian, David.
>
> I'm happy with the property name. But i think the text needs more
> description. We deal with Ethernet switches with integrated PHYs. Yet
> for us, this property is unneeded.
>
> Seeing this property means some bit of software needs to ensure the
> internal PHY should be used, when given the choice between an internal
> and external PHY. So i would say something like:
>
> If set, indicates that the PHY is integrated into the same
> physical package as the Ethernet MAC. If needed, muxers should be
> configured to ensure the internal PHY is used. The absence of this
> property indicates the muxers should be configured so that the
> external PHY is used.
>
> This last part is important. If the bootloader has set the internal
> PHY to be used, you need to reset it. Otherwise we are going to get
> into a mess sometime later and need to add a phy-is-external property.

Ack.

One other thing. We need to fix our (sunxi) binding which is already
in 4.13-rc1. We'd like to see this new property in netdev, i.e. merged
for 4.13, so we can use it.

Thanks
ChenYu

Re:Re: [PATCH net-next 1/1] driver: pptp: Remove unnecessary statements in pptp_sock_destruct

2017-08-09 Thread Gao Feng

At 2017-08-10 02:08:30, "Cong Wang"  wrote:
>On Wed, Aug 9, 2017 at 8:57 AM,   wrote:
>> From: Gao Feng 
>>
>> In the commit ddab82821fa6 ("ppp: Fix a scheduling-while-atomic bug in
>> del_chan"), I moved the synchronize_rcu() from del_chan() to pptp_release
>> after del_chan() to avoid one scheduling-while-atomic bug.
>>
>> Actually the del_chan() and pppox_unbind_sock are unneccessary in the
>> pptp_sock_destruct. Because the pptp sock refcnt wouldn't reach zero until
>> sk_state is set as PPPOX_DEAD in pptp_release. By that time, the del_chan()
>> and pppox_unbind_sock() have been invoked already and the condition check
>> "!(sk->sk_state & PPPOX_DEAD)" of this sock must be false in 
>> pptp_sock_destruct.
>
>I am not sure. The check for sock->sk in the beginning of pptp_release()
>indicates there could be a case we could skip del_chan() in pptp_release(),
>although I can't figure out how.
>
>Also there is a suspicious sock_put() in pptp_release().

Hi Cong, 

Thank you.  I also failed to find the case which causes the sock->sk is null in 
release().

There is a suspicious case in __sock_create following.
err = pf->create(net, sock, protocol, kern);
if (err < 0)
goto out_module_put;
...
out_module_put:
sock->ops = NULL;
module_put(pf->owner);
out_sock_release:
sock_release(sock);
return err;
In the beginning, I thought when create is failed and the sock->sk is null, 
then the sock_release is invoked.
It could cause the sk is null in the release().
But I find  it has already reset the sock->ops as NULL before sock_release 
later, so the release() wouldn't be invoked actually.

Best Regards
Feng

Re: [PATCH v1 net] TCP_USER_TIMEOUT and tcp_keepalive should conform to RFC5482

2017-08-09 Thread Jerry Chu

On Wed, Aug 9, 2017 at 5:47 PM, Rao Shoaib  wrote:
>
>
> On 08/09/2017 05:30 PM, David Miller wrote:
>>
>> From: Joe Smith 
>> Date: Wed, 9 Aug 2017 17:20:32 -0700
>>
>>> Making Linux conform to standards and behavior that is logical seems
>>> like a good enough reason.
>>
>> That's an awesome attitude to have when we're implementing something
>> new and don't have the facility already.
>>
>> But when we have something already the only important consideration is
>> not breaking existing apps which rely on that behavior.
>>
>> That is much, much, more important than standards compliance.
>>
>> If users are confused, just fix the documentation.
>
> David,
>
> If it was just confusion than sure fixing the documentation is fine. What if
> the logic is incorrect, does not conform to the standard that is says it is

Not sure what part of logic is "incorrect" when it was a homegrown Linux API
with no need to conform with any "standard"? Note that the new API was invented
7 years ago not out of need for RFC5482. In fact I initially call the option
TCP_FAILFAST and did not even know the existence of RFC5482 until someone
around the same time proposed a UTO option specifically for RFC5482 and I
thought the two can be combined. (This is roughly the memory I can
recollect so far.)

So you see my focus back then was to devise a "failfast" option whereas RFC5482
was meant for a "failslow" case. I think that explains why I let the
option override
keepalive so a TCP connection can "fail fast" while RFC5482 4.2 tries to prevent
keepalive failure ahead of UTO, favoring "fail slow".

If we start from a clean slate then perhaps one can argue the semantic
either way
but we do not have a clean slate. For that I still slightly favor not
changing the code
because the risk of breakage is definitely non-zero and the issue you're having
seem to be only related to documentation.

Jerry

> implementing and easy to fix with little or no risk of breakage.
>
> The proposed patch changes a feature that no one uses. It also imposes the
> relation ship between keepalive and timeout values that is required by the
> RFC and make sense.
>
> You are the final authority, if you say we should just fix the documentation
> than that is fine.
>
> Shoaib
>

Re:Re: Re:Re: Re:Re: Re: [PATCH net] ppp: Fix a scheduling-while-atomic bug in del_chan

2017-08-09 Thread Gao Feng


At 2017-08-10 05:00:19, "Cong Wang"  wrote:
>On Wed, Aug 9, 2017 at 12:17 AM, Gao Feng  wrote:
>> Hi Cong,
>>
>> Actually I have one question about the SOCK_RCU_FREE.
>> I don't think it could resolve the issue you raised even though it exists 
>> really.
>>
>> I checked the SOCK_RCU_FREE, it just defer the __sk_destruct after one rcu 
>> period.
>> But when it performs, someone still could find this sock by callid during 
>> the del_chan period and it may still deference the sock which may freed soon.
>>
>> The right flow should be following.
>> del_chan()
>> wait a rcu period
>> sock_put()  It is safe that someone gets the sock because it 
>> already hold sock refcnt.
>>
>> When using SOCK_RCU_FREE, its flow would be following.
>> wait a rcu period
>> del_chan()
>> free the sock directly  no sock refcnt check again.
>> Because the del_chan happens after rcu wait, not before, so it isn't helpful 
>> with SOCK_RCU_FREE.
>
>
>Yes, good point! With SOCK_RCU_FREE the sock_hold() should
>not be needed. For RCU, unpublish should indeed happen before
>grace period.

Sorry, I couldn't understand why sock_hold() isn't necessary with SOCK_RCU_FREE.
When lookup_chan finds the sock, it would return and reference it later.
If no refcnt, how to protect the sock ?

Best Regards
Feng

>
>
>>
>> I don't know if I misunderstand the SOCK_RCU_FREE usage.
>>
>> But it is a good news that the del_chan is only invoked in pptp_release 
>> actually and it would wait a rcu period before sock_put.
>>
>
>Looking at the code again, the reader lookup_chan() is actually
>invoked in BH context, but neither add_chan() nor del_chan()
>actually disables BH...

Re: [PATCH v3 05/11] net: stmmac: dwmac-rk: Add internal phy support

2017-08-09 Thread Chen-Yu Tsai

Hi David,

On Wed, Aug 9, 2017 at 5:38 PM, David.Wu  wrote:
> Hello Corentin, Chen-Yu
>
>
> 在 2017/8/9 16:45, Corentin Labbe 写道:
>>
>> On Thu, Aug 03, 2017 at 07:06:33PM +0800, Chen-Yu Tsai wrote:
>>>
>>> On Thu, Aug 3, 2017 at 1:38 AM, Florian Fainelli 
>>> wrote:

 On 08/01/2017 11:21 PM, David Wu wrote:
>
> To make internal phy work, need to configure the phy_clock,
> phy cru_reset and related registers.
>
> Signed-off-by: David Wu 
> ---
>   .../devicetree/bindings/net/rockchip-dwmac.txt |  6 +-
>   drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c | 81
> ++
>   2 files changed, 86 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/devicetree/bindings/net/rockchip-dwmac.txt
> b/Documentation/devicetree/bindings/net/rockchip-dwmac.txt
> index 8f42755..ec39b31 100644
> --- a/Documentation/devicetree/bindings/net/rockchip-dwmac.txt
> +++ b/Documentation/devicetree/bindings/net/rockchip-dwmac.txt
> @@ -25,7 +25,8 @@ Required properties:
>- clock-names: One name for each entry in the clocks property.
>- phy-mode: See ethernet.txt file in the same directory.
>- pinctrl-names: Names corresponding to the numbered pinctrl states.
> - - pinctrl-0: pin-control mode. can be <_pins> or <_pins>.
> + - pinctrl-0: pin-control mode. can be <_pins>, <_pins> or
> led pins
> +   for internal phy mode.
>- clock_in_out: For RGMII, it must be "input", means main
> clock(125MHz)
>  is not sourced from SoC's PLL, but input from PHY; For RMII,
> "input" means
>  PHY provides the reference clock(50MHz), "output" means GMAC
> provides the
> @@ -40,6 +41,9 @@ Optional properties:
>- tx_delay: Delay value for TXD timing. Range value is 0~0x7F, 0x30
> as default.
>- rx_delay: Delay value for RXD timing. Range value is 0~0x7F, 0x10
> as default.
>- phy-supply: phandle to a regulator if the PHY needs one
> + - clocks: < MAC_PHY>: Clock selector for internal macphy
> + - phy-is-internal: A boolean property allows us to know that MAC will
> connect to
> +   internal phy.


 This is incorrect in two ways:

 - this is a property of the PHY, so it should be documented as such in
 Documentation/devicetree/bindings/net/phy.txt so other bindings can
 re-use it

 - if it was specific to your MAC you would expect a vendor prefix to
 this property, which is not there

 An alternative way to implement the external/internal logic selection
 would be mandate that your Ethernet PHY node have a compatible string
 like this:

 phy@0 {
  compatible = "ethernet-phy-id1234.d400",
 "ethernet-phy-802.3-c22";
 };

 Then you don't need this phy-is-internal property, because you can
 locate the PHY node by the phy-handle (see more about that in a reply to
 patch 10) and you can determine ahead of time whether this PHY is
 internal or not based on its compatible string.
>>>
>>>
>>> This doesn't really work for us (sunxi). The "internal" PHY of the H3
>>> is also available in the X-Powers AC200 combo chip, which would be
>>> external. Same ID. So if someone were to be stupid and put the two
>>> together, you wouldn't know which one it actually is. Generically
>>> it doesn't make sense to match against the ID either. The PHY is
>>> only integrated or inlined into the SoC, meaning it could be moved
>>> elsewhere or even be a discreet part.
>>>
>>> The way I see it is more like a reversed pinmux. The system should
>>> select either the internal set or external set of xMII pins to use.
>>>
>>> A phy-is-internal property in the PHY node would work for us. We
>>> already need a PHY node to describe its clocks and resets.
>>>
>>> A side note to this solution is that, since the internal PHY is defined
>>> at the .dtsi level, any external PHYs at the same address would need to
>>> make sure to delete the property, which is kind of counterintuitive, but
>>> it is how device tree works. One would want to put the internal PHY's
>>> address, assuming it is configurable, on something that is rarely used.
>>>
>>
>> Hello David, Florian, Andrew
>>
>> Could someone ack this ? or nack if you think that the chance for having
>> two PHY id both internal and external is too low.
>> Anyway, we need a choice.
>>
>
> I think we should be specific to the situation, for us we have the
> possibility that the Mac only picks up internal PHY, so this can be fixed at
> the. DTSi level, also possible INTERNL PHY's Mac can also be used to connect
> external PHY, while cutting off the connection with the internal PHY, so we
> should define the internal PHY at the DTS level, so I think Florian's
> approach is acceptable.

So it looks like you have the clock for the internal/integrated PHY

Re: [PATCH net-next 0/2] zerocopy fixes

2017-08-09 Thread David Ahern

On 8/9/17 5:09 PM, Willem de Bruijn wrote:
> From: Willem de Bruijn 
> 
> Fix two issues introduced in the msg_zerocopy patchset.
> 
> Willem de Bruijn (2):
>   sock: fix zerocopy panic in mem accounting
>   sock: fix zerocopy_success regression with msg_zerocopy
> 
>  include/linux/skbuff.h | 9 +++--
>  net/core/skbuff.c  | 4 ++--
>  2 files changed, 9 insertions(+), 4 deletions(-)
> 

Fixes the problem I reported. Thanks for the quick turn around.

Re: [PATCH 0/3] Fix y2038 issues for security/keys subsystem

2017-08-09 Thread Baolin Wang

Hi Arnd,

On 9 August 2017 at 16:44, Arnd Bergmann  wrote:
> On Wed, Aug 9, 2017 at 4:51 AM, Baolin Wang  wrote:
>> Since 'time_t', 'timeval' and 'timespec' types are not year 2038 safe on
>> 32 bits system, this patchset tries to fix this issues for security/keys
>> subsystem and net/rxrpc subsystem which is connected with security/keys
>> subsystem.
>>
>> Baolin Wang (3):
>>   security: keys: Replace time_t/timespec with time64_t
>>   security: keys: Replace time_t with time64_t for struct
>> key_preparsed_payload
>>   net: rxrpc: Replace time_t type with time64_t type
>
> Hi David,
>
> I did a private review before Baolin posted these patches, this version look
> correct to me, though I would like to see some clarification from you for the
> rxrpc portion, I'll reply there separately.
>
> All three patches
>
> Reviewed-by: Arnd Bergmann 

Thanks for your reviewed tag.

-- 
Baolin.wang
Best Regards

Re: [PATCH 0/3] Fix y2038 issues for security/keys subsystem

2017-08-09 Thread Baolin Wang

On 9 August 2017 at 16:28, David Howells  wrote:
> The rxrpc patch isn't part of the security/keys subsystem.  I'll push it
> to the network tree.  The other two I'll push to James.

Thanks David.

-- 
Baolin.wang
Best Regards

Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2017-08-09 Thread Wei Wang

On Wed, Aug 9, 2017 at 6:26 PM, John Stultz  wrote:
> On Wed, Aug 9, 2017 at 5:36 PM, Wei Wang  wrote:
>> On Wed, Aug 9, 2017 at 4:44 PM, John Stultz  wrote:
>>> On Wed, Aug 9, 2017 at 4:34 PM, Cong Wang  wrote:
 (Cc'ing Wei whose commit was blamed)

 On Mon, Aug 7, 2017 at 2:15 PM, John Stultz  wrote:
> On Mon, Aug 7, 2017 at 2:05 PM, John Stultz  
> wrote:
>> So, with recent testing with my HiKey board, I've been noticing some
>> quirky behavior with my USB eth adapter.
>>
>> Basically, pluging the usb eth adapter in and then removing it, when
>> plugging it back in I often find that its not detected, and the system
>> slowly spits out the following message over and over:
>>   unregister_netdevice: waiting for eth0 to become free. Usage count = 1
>
> The other bit is that after this starts printing, the board will no
> longer reboot (it hangs continuing to occasionally print the above
> message), and I have to manually reset the device.
>

 So this warning is not temporarily shown but lasts until a reboot,
 right? If so it is a dst refcnt leak.
>>>
>>> Correct, once I get into the state it lasts until a reboot.
>>>
 How reproducible is it for you? From my reading, it seems always
 reproduced when you unplug and plug your usb eth interface?
 Is there anything else involved? For example, network namespace.
>>>
>>> So with 4.13-rc3/4 I seem to trigger it easily, often with the first
>>> unplug of the USB eth adapter.
>>>
>>> But as I get back closer to 4.12, it seemingly becomes harder to
>>> trigger, but sometimes still happens.
>>>
>>> So far, I've not been able to trigger it with 4.12.
>>>
>>> I don't think network namespaces are involved?  Though its out of my
>>> area, so AOSP may be using them these days.  Is there a simple way to
>>> check?
>>>
>>> I'll also do another bisection to see if the bad point moves back any 
>>> further.
>
> So I went through another bisection around and got  9514528d92d4 ipv6:
> call dst_dev_put() properly as the first bad commit again.
>
>> If you see the problem starts to happen on commit
>> 9514528d92d4cbe086499322370155ed69f5d06c, could you try reverting all
>> the following commits:
>> (from new to old)
>> 1eb04e7c9e63 net: reorder all the dst flags
>> a4c2fd7f7891 net: remove DST_NOCACHE flag
>> b2a9c0ed75a3 net: remove DST_NOGC flag
>> 5b7c9a8ff828 net: remove dst gc related code
>> db916649b5dd ipv6: get rid of icmp6 dst garbage collector
>> 587fea741134 ipv6: mark DST_NOGC and remove the operation of dst_free()
>> ad65a2f05695 ipv6: call dst_hold_safe() properly
>> 9514528d92d4 ipv6: call dst_dev_put() properly
>
>
> And reverting this set off of 4.13-rc4 seems to make the issue go away.
>
> Is there anything I can test to help narrow down the specific problem
> with that patchset?
>

Thanks John for confirming.
Let me spend some time on the commits and I will let you know if I
have some debug image for you to try.

Wei


> thanks
> -john

Re: [PATCH v2 net-next 0/7] rtnetlink: allow selected handlers to run without rtnl

2017-08-09 Thread David Ahern

On 8/9/17 6:21 PM, David Miller wrote:
> 
> Ok series applied, let's see where this goes :-)
> 

1 hour in, 1 problem reported

Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2017-08-09 Thread John Stultz

On Wed, Aug 9, 2017 at 5:36 PM, Wei Wang  wrote:
> On Wed, Aug 9, 2017 at 4:44 PM, John Stultz  wrote:
>> On Wed, Aug 9, 2017 at 4:34 PM, Cong Wang  wrote:
>>> (Cc'ing Wei whose commit was blamed)
>>>
>>> On Mon, Aug 7, 2017 at 2:15 PM, John Stultz  wrote:
 On Mon, Aug 7, 2017 at 2:05 PM, John Stultz  wrote:
> So, with recent testing with my HiKey board, I've been noticing some
> quirky behavior with my USB eth adapter.
>
> Basically, pluging the usb eth adapter in and then removing it, when
> plugging it back in I often find that its not detected, and the system
> slowly spits out the following message over and over:
>   unregister_netdevice: waiting for eth0 to become free. Usage count = 1

 The other bit is that after this starts printing, the board will no
 longer reboot (it hangs continuing to occasionally print the above
 message), and I have to manually reset the device.

>>>
>>> So this warning is not temporarily shown but lasts until a reboot,
>>> right? If so it is a dst refcnt leak.
>>
>> Correct, once I get into the state it lasts until a reboot.
>>
>>> How reproducible is it for you? From my reading, it seems always
>>> reproduced when you unplug and plug your usb eth interface?
>>> Is there anything else involved? For example, network namespace.
>>
>> So with 4.13-rc3/4 I seem to trigger it easily, often with the first
>> unplug of the USB eth adapter.
>>
>> But as I get back closer to 4.12, it seemingly becomes harder to
>> trigger, but sometimes still happens.
>>
>> So far, I've not been able to trigger it with 4.12.
>>
>> I don't think network namespaces are involved?  Though its out of my
>> area, so AOSP may be using them these days.  Is there a simple way to
>> check?
>>
>> I'll also do another bisection to see if the bad point moves back any 
>> further.

So I went through another bisection around and got  9514528d92d4 ipv6:
call dst_dev_put() properly as the first bad commit again.

> If you see the problem starts to happen on commit
> 9514528d92d4cbe086499322370155ed69f5d06c, could you try reverting all
> the following commits:
> (from new to old)
> 1eb04e7c9e63 net: reorder all the dst flags
> a4c2fd7f7891 net: remove DST_NOCACHE flag
> b2a9c0ed75a3 net: remove DST_NOGC flag
> 5b7c9a8ff828 net: remove dst gc related code
> db916649b5dd ipv6: get rid of icmp6 dst garbage collector
> 587fea741134 ipv6: mark DST_NOGC and remove the operation of dst_free()
> ad65a2f05695 ipv6: call dst_hold_safe() properly
> 9514528d92d4 ipv6: call dst_dev_put() properly


And reverting this set off of 4.13-rc4 seems to make the issue go away.

Is there anything I can test to help narrow down the specific problem
with that patchset?

thanks
-john

Re:Re: Re: Re:Re: Re: [PATCH net] ppp: Fix a scheduling-while-atomic bug in del_chan

2017-08-09 Thread Gao Feng

At 2017-08-10 02:18:44, "Cong Wang"  wrote:
>On Tue, Aug 8, 2017 at 10:13 PM, Gao Feng  wrote:
>> Maybe I didn't show my explanation clearly.
>> I think it won't happen as I mentioned in the last email.
>> Because the pptp_release invokes the synchronize_rcu to make sure it, and 
>> actually there is no one which would invoke del_chan except pptp_release.
>> It is guaranteed by that the pptp_release doesn't put the sock refcnt until 
>> complete all cleanup include marking sk_state as PPPOX_DEAD.
>>
>> In other words, even though the pptp_release is not the last user of this 
>> sock, the other one wouldn't invoke del_chan in pptp_sock_destruct.
>> Because the condition "!(sk->sk_state & PPPOX_DEAD)" must be false.
>
>Only if sock->sk is always non-NULL for pptp_release(), which
>is what I am not sure. If you look at other ->release(), similar checks
>are there too, so not just for pptp.

Yes. It seems only if the release() is invoked twice, the sock->sk would be 
NULL.
But I don't find there is any case which could cause it.

>
>>
>> As summary, the del_chan and pppox_unbind_sock in pptp_sock_destruct are 
>> unnecessary.
>> And it even brings confusing.
>
>Sorry, I can't draw any conclusion for this.

Thank you all the same, and I have learn a lot from you :)
Wish someone which is familiar with these codes could give more details and 
explanations.

Best Regards
Feng

RTNL: assertion failed at ...

2017-08-09 Thread David Ahern

Florian:

Booting top of tree on my host with a VRF configured is spewing traces:

[   24.779911] RTNL: assertion failed at
/home/dsa/kernel.git/net/core/dev.c (5717)
[   24.779984] CPU: 3 PID: 989 Comm: ip Not tainted
4.13.0-rc4-01020-gcd9cb3890b20 #8
[   24.779986] Hardware name: Supermicro X9DAi/X9DAi, BIOS 1.0b 10/17/2012
[   24.779988] Call Trace:
[   24.780002]  dump_stack+0x85/0xc7
[   24.780009]  netdev_master_upper_dev_get+0x5f/0x70
[   24.780017]  if_nlmsg_size+0x158/0x240
[   24.780021]  rtnl_calcit.isra.26+0xa3/0xf0
[   24.780029]  ? dump_rules+0xd0/0xd0
[   24.780033]  rtnetlink_rcv_msg+0x1f0/0x230
[   24.780037]  ? rtnl_calcit.isra.26+0xf0/0xf0
[   24.780043]  netlink_rcv_skb+0xec/0x120
[   24.780047]  rtnetlink_rcv+0x15/0x20
[   24.780050]  netlink_unicast+0x16a/0x210
[   24.780054]  netlink_sendmsg+0x2b8/0x3a0
[   24.780060]  sock_sendmsg+0x38/0x50
[   24.780063]  SYSC_sendto+0x102/0x190
[   24.780069]  ? handle_mm_fault+0xe8/0x210
[   24.780075]  ? __do_page_fault+0x1f6/0x4a0
[   24.780080]  ? trace_hardirqs_on_thunk+0x1a/0x1c
[   24.780083]  SyS_sendto+0xe/0x10
[   24.780089]  entry_SYSCALL_64_fastpath+0x23/0xbd
[   24.780117] RIP: 0033:0x7fc3972d95ed
[   24.780119] RSP: 002b:7fff9fbd4528 EFLAGS: 0246 ORIG_RAX:
002c
[   24.780121] RAX: ffda RBX: 0044bd00 RCX:
7fc3972d95ed
[   24.780122] RDX: 0028 RSI: 7fff9fbd4530 RDI:
0003
[   24.780123] RBP: 7fff9fbd5f77 R08:  R09:

[   24.780124] R10:  R11: 0246 R12:
0002
[   24.780125] R13: 7fff9fbd4750 R14:  R15:



I did not do a bisect but seems pretty obvious that it has to be your
patch set.

David

Re: [PATCH v2 3/4] can: m_can: Update documentation to mention new fixed transceiver binding

2017-08-09 Thread Franklin S Cooper Jr


Hi Rob,
On 08/03/2017 12:07 PM, Rob Herring wrote:
> On Mon, Jul 24, 2017 at 06:05:20PM -0500, Franklin S Cooper Jr wrote:
>> Add information regarding fixed transceiver binding. This is especially
>> important for MCAN since the IP allows CAN FD mode to run significantly
>> faster than what most transceivers are capable of.
>>
>> Signed-off-by: Franklin S Cooper Jr 
>> ---
>> Version 2 changes:
>> Drop unit address
>>
>>  Documentation/devicetree/bindings/net/can/m_can.txt | 10 ++
>>  1 file changed, 10 insertions(+)
>>
>> diff --git a/Documentation/devicetree/bindings/net/can/m_can.txt 
>> b/Documentation/devicetree/bindings/net/can/m_can.txt
>> index 9e33177..e4abd2c 100644
>> --- a/Documentation/devicetree/bindings/net/can/m_can.txt
>> +++ b/Documentation/devicetree/bindings/net/can/m_can.txt
>> @@ -43,6 +43,11 @@ Required properties:
>>Please refer to 2.4.1 Message RAM Configuration in
>>Bosch M_CAN user manual for details.
>>  
>> +Optional properties:
>> +- fixed-transceiver : Fixed-transceiver subnode describing maximum speed
> 
> This is a node, not a property. Sub nodes should have their own section.

Fixed in my v4 that I just sent.
> 
>> +  that can be used for CAN and/or CAN-FD modes.  See
>> +  
>> Documentation/devicetree/bindings/net/can/fixed-transceiver.txt
>> +  for details.
>>  Example:
>>  SoC dtsi:
>>  m_can1: can@020e8000 {
>> @@ -64,4 +69,9 @@ Board dts:
>>  pinctrl-names = "default";
>>  pinctrl-0 = <_m_can1>;
>>  status = "enabled";
>> +
>> +fixed-transceiver {
>> +max-arbitration-speed = <100>;
>> +max-data-speed = <500>;
>> +};
>>  };
>> -- 
>> 2.10.0
>>

Re: [PATCH net-next] net: skb_needs_check() removes CHECKSUM_NONE check for tx.

2017-08-09 Thread Tonghao Zhang

Thanks for your work.

On Thu, Aug 10, 2017 at 2:30 AM, Willem de Bruijn
 wrote:
> On Wed, Aug 9, 2017 at 5:04 AM, Tonghao Zhang  
> wrote:
>> This patch reverts the commit 6e7bc478c9a0
>> ("net: skb_needs_check() accepts CHECKSUM_NONE for tx"),
>> because we removed the UFO support.
>>
>> Cc: Eric Dumazet 
>> Cc: Willem de Bruijn 
>> Signed-off-by: Tonghao Zhang 
>
>
> I would wait until net is merged into net-next. This will cause a conflict.
>
> Also, while logically equivalent, it is not a real revert (as in `git
> revert $SHA1`) of that patch.
>
> Aside from those concerns, I agree that the original patch is no
> longer needed now that UFO is reverted.

[PATCH v4 4/4] can: m_can: Add call to of_can_transceiver

2017-08-09 Thread Franklin S Cooper Jr

Add call to new generic functions that provides support via a binding
to limit the arbitration rate and/or data rate imposed by the physical
transceiver connected to the MCAN peripheral.

Signed-off-by: Franklin S Cooper Jr 
---
 drivers/net/can/m_can/m_can.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/can/m_can/m_can.c b/drivers/net/can/m_can/m_can.c
index f4947a7..f72116e 100644
--- a/drivers/net/can/m_can/m_can.c
+++ b/drivers/net/can/m_can/m_can.c
@@ -1649,6 +1649,8 @@ static int m_can_plat_probe(struct platform_device *pdev)
 
devm_can_led_init(dev);
 
+   of_can_transceiver(dev);
+
dev_info(>dev, "%s device registered (irq=%d, version=%d)\n",
 KBUILD_MODNAME, dev->irq, priv->version);
 
-- 
2.9.4.dirty

Re: [PATCH v3 2/4] dt-bindings: can: fixed-transceiver: Add new CAN fixed transceiver bindings

2017-08-09 Thread Franklin S Cooper Jr

Hi Sergei,

On 08/03/2017 10:38 AM, Franklin S Cooper Jr wrote:
> 
> 
> On 08/03/2017 07:22 AM, Sergei Shtylyov wrote:
>> On 08/03/2017 12:48 PM, Franklin S Cooper Jr wrote:
>>
> Add documentation to describe usage of the new fixed transceiver
> binding.
> This new binding is applicable for any CAN device therefore it
> exists as
> its own document.
>
> Signed-off-by: Franklin S Cooper Jr 
> ---
>.../bindings/net/can/fixed-transceiver.txt | 24
> ++
>1 file changed, 24 insertions(+)
>create mode 100644
> Documentation/devicetree/bindings/net/can/fixed-transceiver.txt
>
> diff --git
> a/Documentation/devicetree/bindings/net/can/fixed-transceiver.txt
> b/Documentation/devicetree/bindings/net/can/fixed-transceiver.txt
> new file mode 100644
> index 000..2f58838b
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/net/can/fixed-transceiver.txt
> @@ -0,0 +1,24 @@
> +Fixed transceiver Device Tree binding
> +--
> +
> +CAN transceiver typically limits the max speed in standard CAN and
> CAN FD
> +modes. Typically these limitations are static and the transceivers
> themselves
> +provide no way to detect this limitation at runtime. For this
> situation,
> +the "fixed-transceiver" node can be used.
> +
> +Required Properties:
> + max-bitrate:a positive non 0 value that determines the max
> +speed that CAN/CAN-FD can run. Any other value
> +will be ignored.
> +
> +Examples:
> +
> +Based on Texas Instrument's TCAN1042HGV CAN Transceiver
> +
> +m_can0 {
> +
> +fixed-transceiver@0 {

 The  (after @) must only be specified if there's "reg"
>>>
>>> Sorry. Fixed this in my v2 and some how it came back. Will fix.
>>>
 prop in the device node. Also, please name the node "can-transceiver@"
 to be more in line with the DT spec. which requires generic node names.
>>>
>>> Its possible for future can transceivers drivers to be created. So I
>>
>>So what? Ah, you are using the node name to match in the CAN drivers...
>>
>>> thought including fixed was important to indicate that this is a "dumb"
>>> transceiver similar to "fixed-link".
>>
>>I'm not sure the "fixed-link" MAC subnode assumed any transceiver at
>> all...
> 
> Your right. I wasn't trying to imply that it does. What I meant was that
> having a node named "can-transceiver" may be a bit confusing in the
> future if can transceiver drivers are created. Prefix of "fixed" atleast
> to me makes it clear that this is something unique or a generic
> transceiver with limitations. Similar to "fixed-link" which is for MACs
> not connected to MDIO managed phy. Calling this subnode
> "can-transceiver" to me would be like renaming "fixed-link" to "phy".
> 
>>
>>> So would "fixed-can-transceiver" be
>>> ok or do you want to go with can-transceiver?
>>
>>I'm somewhat perplexed at this point...
> 
> If my reasoning still didn't change your views then I'll make the switch.

I went ahead and made your suggested switch in my v4. Thanks for taking
the time to review this series.
>>
>> MBR, Sergei

[PATCH v4 2/4] dt-bindings: can: can-transceiver: Document new binding

2017-08-09 Thread Franklin S Cooper Jr

Add documentation to describe usage of the new can-transceiver binding.
This new binding is applicable for any CAN device therefore it exists as
its own document.

Signed-off-by: Franklin S Cooper Jr 
---
Version 4 changes:
Drop unit address.
Switch from using fixed-transceiver to can-transceiver

 .../bindings/net/can/can-transceiver.txt   | 24 ++
 1 file changed, 24 insertions(+)
 create mode 100644 
Documentation/devicetree/bindings/net/can/can-transceiver.txt

diff --git a/Documentation/devicetree/bindings/net/can/can-transceiver.txt 
b/Documentation/devicetree/bindings/net/can/can-transceiver.txt
new file mode 100644
index 000..2c31dc0
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/can/can-transceiver.txt
@@ -0,0 +1,24 @@
+Generic CAN transceiver Device Tree binding
+--
+
+CAN transceiver typically limits the max speed in standard CAN and CAN FD
+modes. Typically these limitations are static and the transceivers themselves
+provide no way to detect this limitation at runtime. For this situation,
+the "can-transceiver" node can be used.
+
+Required Properties:
+ max-bitrate:  a positive non 0 value that determines the max
+   speed that CAN/CAN-FD can run. Any other value
+   will be ignored.
+
+Examples:
+
+Based on Texas Instrument's TCAN1042HGV CAN Transceiver
+
+m_can0 {
+   
+   can-transceiver@ {
+   max-bitrate = <500>;
+   };
+   ...
+};
-- 
2.9.4.dirty

[PATCH v4 0/4] can: Support transceiver based bit rate limit

2017-08-09 Thread Franklin S Cooper Jr

Add a new generic binding that CAN drivers can be used to specify the max
bit rate supported by a transceiver. This is useful since in some instances
since the maximum speeds may be limited by the transceiver used. However,
transceivers may not provide a means to determine this limitation at
runtime. Therefore, create a new binding that mimics "fixed-link" that
allows a user to hardcode the max speeds that can be used.

Also add support for this new binding in the MCAN driver.

Note this is an optional subnode so even if a driver adds support for
parsing can-transceiver the user does not have to define it in their
device tree.

Version 4 changes:
Switch from fixed-transceiver to can-transceiver
Drop unit address that snuck back in again.
Indicate that can-transceiver is a subnode and not a property in
documentation

Version 3 changes:
Switch from having two "max bitrates" to one universal bitrate.

Version 2 changes:
Rename function
Define proper variable default
Drop unit address
Move check to changelink function
Reword commit message
Reword documentation

Franklin S Cooper Jr (4):
  can: dev: Add support for limiting configured bitrate
  dt-bindings: can: can-transceiver: Document new binding
  dt-bindings: can: m_can: Document new can transceiver binding
  can: m_can: Add call to of_can_transceiver

 .../bindings/net/can/can-transceiver.txt   | 24 +++
 .../devicetree/bindings/net/can/m_can.txt  |  9 
 drivers/net/can/dev.c  | 50 ++
 drivers/net/can/m_can/m_can.c  |  2 +
 include/linux/can/dev.h|  5 +++
 5 files changed, 90 insertions(+)
 create mode 100644 
Documentation/devicetree/bindings/net/can/can-transceiver.txt

-- 
2.9.4.dirty

[PATCH v4 1/4] can: dev: Add support for limiting configured bitrate

2017-08-09 Thread Franklin S Cooper Jr

Various CAN or CAN-FD IP may be able to run at a faster rate than
what the transceiver the CAN node is connected to. This can lead to
unexpected errors. However, CAN transceivers typically have fixed
limitations and provide no means to discover these limitations at
runtime. Therefore, add support for a can-transceiver node that
can be reused by other CAN peripheral drivers to determine for both
CAN and CAN-FD what the max bitrate that can be used. If the user
tries to configure CAN to pass these maximum bitrates it will throw
an error.

Signed-off-by: Franklin S Cooper Jr 
---
Version 4 changes:
Used can-transceiver instead of fixed-transceiver.

 drivers/net/can/dev.c   | 50 +
 include/linux/can/dev.h |  5 +
 2 files changed, 55 insertions(+)

diff --git a/drivers/net/can/dev.c b/drivers/net/can/dev.c
index 365a8cc..372108f 100644
--- a/drivers/net/can/dev.c
+++ b/drivers/net/can/dev.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #define MOD_DESC "CAN device driver interface"
@@ -814,6 +815,39 @@ int open_candev(struct net_device *dev)
 }
 EXPORT_SYMBOL_GPL(open_candev);
 
+#ifdef CONFIG_OF
+/*
+ * Common function that can be used to understand the limitation of
+ * a transceiver when it provides no means to determine these limitations
+ * at runtime.
+ */
+void of_can_transceiver(struct net_device *dev)
+{
+   struct device_node *dn;
+   struct can_priv *priv = netdev_priv(dev);
+   struct device_node *np;
+   unsigned int max_bitrate;
+   int ret;
+
+   np = dev->dev.parent->of_node;
+
+   dn = of_get_child_by_name(np, "can-transceiver");
+   if (!dn)
+   return;
+
+   max_bitrate = 0;
+   ret = of_property_read_u32(dn, "max-bitrate", _bitrate);
+
+   if (max_bitrate > 0) {
+   priv->max_bitrate = max_bitrate;
+   priv->is_bitrate_limited = true;
+   } else if (ret != -EINVAL) {
+   netdev_warn(dev, "Invalid value for transceiver max bitrate\n");
+   }
+}
+EXPORT_SYMBOL(of_can_transceiver);
+#endif
+
 /*
  * Common close function for cleanup before the device gets closed.
  *
@@ -913,6 +947,14 @@ static int can_changelink(struct net_device *dev, struct 
nlattr *tb[],
priv->bitrate_const_cnt);
if (err)
return err;
+
+   if (priv->is_bitrate_limited &&
+   bt.bitrate > priv->max_bitrate) {
+   netdev_err(dev, "arbitration bitrate surpasses 
transceiver capabilities of %d bps\n",
+  priv->max_bitrate);
+   return -EINVAL;
+   }
+
memcpy(>bittiming, , sizeof(bt));
 
if (priv->do_set_bittiming) {
@@ -997,6 +1039,14 @@ static int can_changelink(struct net_device *dev, struct 
nlattr *tb[],
priv->data_bitrate_const_cnt);
if (err)
return err;
+
+   if (priv->is_bitrate_limited &&
+   dbt.bitrate > priv->max_bitrate) {
+   netdev_err(dev, "canfd data bitrate surpasses 
transceiver capabilities of %d bps\n",
+  priv->max_bitrate);
+   return -EINVAL;
+   }
+
memcpy(>data_bittiming, , sizeof(dbt));
 
if (priv->do_set_data_bittiming) {
diff --git a/include/linux/can/dev.h b/include/linux/can/dev.h
index 141b05a..5519f59 100644
--- a/include/linux/can/dev.h
+++ b/include/linux/can/dev.h
@@ -47,6 +47,9 @@ struct can_priv {
unsigned int data_bitrate_const_cnt;
struct can_clock clock;
 
+   unsigned int max_bitrate;
+   bool is_bitrate_limited;
+
enum can_state state;
 
/* CAN controller features - see include/uapi/linux/can/netlink.h */
@@ -165,6 +168,8 @@ void can_put_echo_skb(struct sk_buff *skb, struct 
net_device *dev,
 unsigned int can_get_echo_skb(struct net_device *dev, unsigned int idx);
 void can_free_echo_skb(struct net_device *dev, unsigned int idx);
 
+void of_can_transceiver(struct net_device *dev);
+
 struct sk_buff *alloc_can_skb(struct net_device *dev, struct can_frame **cf);
 struct sk_buff *alloc_canfd_skb(struct net_device *dev,
struct canfd_frame **cfd);
-- 
2.9.4.dirty

[PATCH v4 3/4] dt-bindings: can: m_can: Document new can transceiver binding

2017-08-09 Thread Franklin S Cooper Jr

Add information regarding can-transceiver binding. This is especially
important for MCAN since the IP allows CAN FD mode to run significantly
faster than what most transceivers are capable of.

Signed-off-by: Franklin S Cooper Jr 
---
Drop unit address.
Switch from using fixed-transceiver to can-transceiver
Indicate that can-transceiver is an optional subnode not a property.

 Documentation/devicetree/bindings/net/can/m_can.txt | 9 +
 1 file changed, 9 insertions(+)

diff --git a/Documentation/devicetree/bindings/net/can/m_can.txt 
b/Documentation/devicetree/bindings/net/can/m_can.txt
index 9e33177..ee90aac 100644
--- a/Documentation/devicetree/bindings/net/can/m_can.txt
+++ b/Documentation/devicetree/bindings/net/can/m_can.txt
@@ -43,6 +43,11 @@ Required properties:
  Please refer to 2.4.1 Message RAM Configuration in
  Bosch M_CAN user manual for details.
 
+Optional Subnode:
+- can-transceiver  : Can-transceiver subnode describing maximum speed
+ that can be used for CAN/CAN-FD modes. See
+ 
Documentation/devicetree/bindings/net/can/can-transceiver.txt
+ for details.
 Example:
 SoC dtsi:
 m_can1: can@020e8000 {
@@ -64,4 +69,8 @@ Board dts:
pinctrl-names = "default";
pinctrl-0 = <_m_can1>;
status = "enabled";
+
+   can-transceiver@ {
+   max-bitrate = <500>;
+   };
 };
-- 
2.9.4.dirty

Re: [PATCH v1 net] TCP_USER_TIMEOUT and tcp_keepalive should conform to RFC5482

2017-08-09 Thread David Miller

From: Rao Shoaib 
Date: Wed, 9 Aug 2017 17:47:57 -0700

> 
> 
> On 08/09/2017 05:30 PM, David Miller wrote:
>> From: Joe Smith 
>> Date: Wed, 9 Aug 2017 17:20:32 -0700
>>
>>> Making Linux conform to standards and behavior that is logical seems
>>> like a good enough reason.
>> That's an awesome attitude to have when we're implementing something
>> new and don't have the facility already.
>>
>> But when we have something already the only important consideration is
>> not breaking existing apps which rely on that behavior.
>>
>> That is much, much, more important than standards compliance.
>>
>> If users are confused, just fix the documentation.
> David,
> 
> If it was just confusion than sure fixing the documentation is
> fine. What if the logic is incorrect, does not conform to the standard
> that is says it is implementing and easy to fix with little or no risk
> of breakage.
> 
> The proposed patch changes a feature that no one uses. It also imposes
> the relation ship between keepalive and timeout values that is
> required by the RFC and make sense.
> 
> You are the final authority, if you say we should just fix the
> documentation than that is fine.

I want to hear more about what hkchu and ycheng have to say about
this.

Re: [PATCH v1 net] TCP_USER_TIMEOUT and tcp_keepalive should conform to RFC5482

2017-08-09 Thread Rao Shoaib

On 08/09/2017 05:30 PM, David Miller wrote:

From: Joe Smith 
Date: Wed, 9 Aug 2017 17:20:32 -0700

Making Linux conform to standards and behavior that is logical seems
like a good enough reason.

That's an awesome attitude to have when we're implementing something
new and don't have the facility already.

But when we have something already the only important consideration is
not breaking existing apps which rely on that behavior.

That is much, much, more important than standards compliance.

If users are confused, just fix the documentation.

David,

If it was just confusion than sure fixing the documentation is fine. 
What if the logic is incorrect, does not conform to the standard that is 
says it is implementing and easy to fix with little or no risk of breakage.

The proposed patch changes a feature that no one uses. It also imposes 
the relation ship between keepalive and timeout values that is required 
by the RFC and make sense.

You are the final authority, if you say we should just fix the 
documentation than that is fine.

Shoaib

[PATCH net-next 04/10] netvsc: check error return when restoring channels and mtu

2017-08-09 Thread Stephen Hemminger

If setting new values fails, and the attempt to restore original
settings fails. Then log an error and leave device down.
This should never happen, but if it does don't go down in flames.

Signed-off-by: Stephen Hemminger 
---
 drivers/net/hyperv/netvsc_drv.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 07015b1c42c6..c7391889938b 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -845,7 +845,13 @@ static int netvsc_set_channels(struct net_device *net,
} else {
ret = PTR_ERR(nvdev);
device_info.num_chn = orig;
-   rndis_filter_device_add(dev, _info);
+   nvdev = rndis_filter_device_add(dev, _info);
+
+   if (IS_ERR(nvdev)) {
+   netdev_err(net, "restoring channel setting failed: 
%ld\n",
+  PTR_ERR(nvdev));
+   return ret;
+   }
}
 
if (was_opened)
@@ -953,10 +959,16 @@ static int netvsc_change_mtu(struct net_device *ndev, int 
mtu)
 
/* Attempt rollback to original MTU */
ndev->mtu = orig_mtu;
-   rndis_filter_device_add(hdev, _info);
+   nvdev = rndis_filter_device_add(hdev, _info);
 
if (vf_netdev)
dev_set_mtu(vf_netdev, orig_mtu);
+
+   if (IS_ERR(nvdev)) {
+   netdev_err(ndev, "restoring mtu failed: %ld\n",
+  PTR_ERR(nvdev));
+   return ret;
+   }
}
 
if (was_opened)
-- 
2.11.0

[PATCH net-next 08/10] netvsc: remove unnecessary check for NULL hdr

2017-08-09 Thread Stephen Hemminger

The function init_page_array is always called with a valid pointer
to RNDIS header. No check for NULL is needed.

Signed-off-by: Stephen Hemminger 
---
 drivers/net/hyperv/netvsc_drv.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 9f89de17b5fa..7b465e40869b 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -333,10 +333,9 @@ static u32 init_page_array(void *hdr, u32 len, struct 
sk_buff *skb,
 * 2. skb linear data
 * 3. skb fragment data
 */
-   if (hdr != NULL)
-   slots_used += fill_pg_buf(virt_to_page(hdr),
-   offset_in_page(hdr),
-   len, [slots_used]);
+   slots_used += fill_pg_buf(virt_to_page(hdr),
+ offset_in_page(hdr),
+ len, [slots_used]);
 
packet->rmsg_size = len;
packet->rmsg_pgcnt = slots_used;
-- 
2.11.0

[PATCH net-next 09/10] netvsc: allow controlling send/recv buffer size

2017-08-09 Thread Stephen Hemminger

Control the size of the buffer areas via ethtool ring settings.
They aren't really traditional hardware rings, but host API breaks
receive and send buffer into chunks. The final size of the chunks are
controlled by the host.

The default value of send and receive buffer area for host DMA
is much larger than it needs to be. Experimentation shows that
4M receive and 1M send is sufficient.

Signed-off-by: Stephen Hemminger 
---
 drivers/net/hyperv/hyperv_net.h |   9 ++--
 drivers/net/hyperv/netvsc.c |  70 +---
 drivers/net/hyperv/netvsc_drv.c | 117 ++--
 3 files changed, 157 insertions(+), 39 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index f620c90307ed..4ccfc335a356 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -148,6 +148,8 @@ struct netvsc_device_info {
unsigned char mac_adr[ETH_ALEN];
int  ring_size;
u32  num_chn;
+   u32  send_sections;
+   u32  recv_sections;
 };
 
 enum rndis_device_state {
@@ -634,12 +636,12 @@ struct nvsp_message {
 #define NETVSC_SEND_BUFFER_SIZE(1024 * 1024 * 15)   /* 
15MB */
 #define NETVSC_INVALID_INDEX   -1
 
+#define NETVSC_SEND_SECTION_SIZE   6144
+#define NETVSC_RECV_SECTION_SIZE   1728
 
 #define NETVSC_RECEIVE_BUFFER_ID   0xcafe
 #define NETVSC_SEND_BUFFER_ID  0
 
-#define NETVSC_PACKET_SIZE  4096
-
 #define VRSS_SEND_TAB_SIZE 16  /* must be power of 2 */
 #define VRSS_CHANNEL_MAX 64
 #define VRSS_CHANNEL_DEFAULT 8
@@ -754,14 +756,13 @@ struct netvsc_device {
 
/* Receive buffer allocated by us but manages by NetVSP */
void *recv_buf;
-   u32 recv_buf_size;
u32 recv_buf_gpadl_handle;
u32 recv_section_cnt;
+   u32 recv_section_size;
u32 recv_completion_cnt;
 
/* Send buffer allocated by us */
void *send_buf;
-   u32 send_buf_size;
u32 send_buf_gpadl_handle;
u32 send_section_cnt;
u32 send_section_size;
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 91b0674cfdb0..5b3a2b8fa77e 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -75,6 +75,10 @@ static struct netvsc_device *alloc_net_device(void)
atomic_set(_device->open_cnt, 0);
net_device->max_pkt = RNDIS_MAX_PKT_DEFAULT;
net_device->pkt_align = RNDIS_PKT_ALIGN_DEFAULT;
+
+   net_device->recv_section_size = NETVSC_RECV_SECTION_SIZE;
+   net_device->send_section_size = NETVSC_SEND_SECTION_SIZE;
+
init_completion(_device->channel_init_wait);
 
return net_device;
@@ -142,6 +146,7 @@ static void netvsc_destroy_buf(struct hv_device *device)
"revoke receive buffer to netvsp\n");
return;
}
+   net_device->recv_section_cnt = 0;
}
 
/* Teardown the gpadl on the vsp end */
@@ -172,7 +177,7 @@ static void netvsc_destroy_buf(struct hv_device *device)
 * NVSP_MSG1_TYPE_SEND_SEND_BUF msg) therefore, we need
 * to send a revoke msg here
 */
-   if (net_device->send_section_size) {
+   if (net_device->send_section_cnt) {
/* Send the revoke receive buffer */
revoke_packet = _device->revoke_packet;
memset(revoke_packet, 0, sizeof(struct nvsp_message));
@@ -204,6 +209,7 @@ static void netvsc_destroy_buf(struct hv_device *device)
   "revoke send buffer to netvsp\n");
return;
}
+   net_device->send_section_cnt = 0;
}
/* Teardown the gpadl on the vsp end */
if (net_device->send_buf_gpadl_handle) {
@@ -243,18 +249,25 @@ int netvsc_alloc_recv_comp_ring(struct netvsc_device 
*net_device, u32 q_idx)
 }
 
 static int netvsc_init_buf(struct hv_device *device,
-  struct netvsc_device *net_device)
+  struct netvsc_device *net_device,
+  const struct netvsc_device_info *device_info)
 {
struct nvsp_1_message_send_receive_buffer_complete *resp;
struct net_device *ndev = hv_get_drvdata(device);
struct nvsp_message *init_packet;
+   unsigned int buf_size;
size_t map_words;
int ret = 0;
 
-   net_device->recv_buf = vzalloc(net_device->recv_buf_size);
+   /* Get receive buffer area. */
+   buf_size = device_info->recv_sections * net_device->recv_section_size;
+   buf_size = roundup(buf_size, PAGE_SIZE);
+
+   net_device->recv_buf = vzalloc(buf_size);
if (!net_device->recv_buf) {
-   netdev_err(ndev, "unable to allocate receive "
-   "buffer of size %d\n", net_device->recv_buf_size);
+

[PATCH net-next 07/10] netvsc: remove unnecessary cast of void pointer

2017-08-09 Thread Stephen Hemminger

Assignment to a typed pointer is sufficient in C.
No cast is needed.

Signed-off-by: Stephen Hemminger 
---
 drivers/net/hyperv/netvsc_drv.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 3219d2e8918f..9f89de17b5fa 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -523,9 +523,9 @@ static int netvsc_start_xmit(struct sk_buff *skb, struct 
net_device *net)
 
rndis_msg_size += NDIS_VLAN_PPI_SIZE;
ppi = init_ppi_data(rndis_msg, NDIS_VLAN_PPI_SIZE,
-   IEEE_8021Q_INFO);
-   vlan = (struct ndis_pkt_8021q_info *)((void *)ppi +
-   ppi->ppi_offset);
+   IEEE_8021Q_INFO);
+
+   vlan = (void *)ppi + ppi->ppi_offset;
vlan->vlanid = skb->vlan_tci & VLAN_VID_MASK;
vlan->pri = (skb->vlan_tci & VLAN_PRIO_MASK) >>
VLAN_PRIO_SHIFT;
@@ -538,8 +538,7 @@ static int netvsc_start_xmit(struct sk_buff *skb, struct 
net_device *net)
ppi = init_ppi_data(rndis_msg, NDIS_LSO_PPI_SIZE,
TCP_LARGESEND_PKTINFO);
 
-   lso_info = (struct ndis_tcp_lso_info *)((void *)ppi +
-   ppi->ppi_offset);
+   lso_info = (void *)ppi + ppi->ppi_offset;
 
lso_info->lso_v2_transmit.type = 
NDIS_TCP_LARGE_SEND_OFFLOAD_V2_TYPE;
if (skb->protocol == htons(ETH_P_IP)) {
-- 
2.11.0

[PATCH net-next 03/10] netvsc: propagate MAC address change to VF slave

2017-08-09 Thread Stephen Hemminger

If VF is slaved to synthetic device, then any change to netvsc
MAC address should be propagated to the slave device.

If slave device doesn't support MAC address change then it
should also be an error to attempt to change synthetic NIC MAC
address.

It also fixes the error unwind in the original code.
If give a bad address, the old code would change the device
MAC address anyway.

Reviewed-by: Haiyang Zhang 
Signed-off-by: Stephen Hemminger 
---
 drivers/net/hyperv/netvsc_drv.c | 26 +++---
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index e059375a6d8c..07015b1c42c6 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -1053,27 +1053,31 @@ static void netvsc_get_stats64(struct net_device *net,
 static int netvsc_set_mac_addr(struct net_device *ndev, void *p)
 {
struct net_device_context *ndc = netdev_priv(ndev);
+   struct net_device *vf_netdev = rtnl_dereference(ndc->vf_netdev);
struct netvsc_device *nvdev = rtnl_dereference(ndc->nvdev);
struct sockaddr *addr = p;
-   char save_adr[ETH_ALEN];
-   unsigned char save_aatype;
int err;
 
-   memcpy(save_adr, ndev->dev_addr, ETH_ALEN);
-   save_aatype = ndev->addr_assign_type;
-
-   err = eth_mac_addr(ndev, p);
-   if (err != 0)
+   err = eth_prepare_mac_addr_change(ndev, p);
+   if (err)
return err;
 
if (!nvdev)
return -ENODEV;
 
+   if (vf_netdev) {
+   err = dev_set_mac_address(vf_netdev, addr);
+   if (err)
+   return err;
+   }
+
err = rndis_filter_set_device_mac(nvdev, addr->sa_data);
-   if (err != 0) {
-   /* roll back to saved MAC */
-   memcpy(ndev->dev_addr, save_adr, ETH_ALEN);
-   ndev->addr_assign_type = save_aatype;
+   if (!err) {
+   eth_commit_mac_addr_change(ndev, p);
+   } else if (vf_netdev) {
+   /* rollback change on VF */
+   memcpy(addr->sa_data, ndev->dev_addr, ETH_ALEN);
+   dev_set_mac_address(vf_netdev, addr);
}
 
return err;
-- 
2.11.0

[PATCH net-next 05/10] netvsc: no need to allocate send/receive on numa node

2017-08-09 Thread Stephen Hemminger

The send and receive buffers are both per-device (not per-channel).
The associated NUMA node is a property of the CPU which is per-channel
therefore it makes no sense to force the receive/send buffer to be
allocated on a particular node (since it is a shared resource).

Signed-off-by: Stephen Hemminger 
---
 drivers/net/hyperv/netvsc.c | 19 +--
 1 file changed, 5 insertions(+), 14 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index c842265d1df9..91b0674cfdb0 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -245,20 +245,13 @@ int netvsc_alloc_recv_comp_ring(struct netvsc_device 
*net_device, u32 q_idx)
 static int netvsc_init_buf(struct hv_device *device,
   struct netvsc_device *net_device)
 {
-   int ret = 0;
-   struct nvsp_message *init_packet;
struct nvsp_1_message_send_receive_buffer_complete *resp;
-   struct net_device *ndev;
+   struct net_device *ndev = hv_get_drvdata(device);
+   struct nvsp_message *init_packet;
size_t map_words;
-   int node;
-
-   ndev = hv_get_drvdata(device);
-
-   node = cpu_to_node(device->channel->target_cpu);
-   net_device->recv_buf = vzalloc_node(net_device->recv_buf_size, node);
-   if (!net_device->recv_buf)
-   net_device->recv_buf = vzalloc(net_device->recv_buf_size);
+   int ret = 0;
 
+   net_device->recv_buf = vzalloc(net_device->recv_buf_size);
if (!net_device->recv_buf) {
netdev_err(ndev, "unable to allocate receive "
"buffer of size %d\n", net_device->recv_buf_size);
@@ -339,9 +332,7 @@ static int netvsc_init_buf(struct hv_device *device,
goto cleanup;
 
/* Now setup the send buffer. */
-   net_device->send_buf = vzalloc_node(net_device->send_buf_size, node);
-   if (!net_device->send_buf)
-   net_device->send_buf = vzalloc(net_device->send_buf_size);
+   net_device->send_buf = vzalloc(net_device->send_buf_size);
if (!net_device->send_buf) {
netdev_err(ndev, "unable to allocate send "
   "buffer of size %d\n", net_device->send_buf_size);
-- 
2.11.0

[PATCH net-next 01/10] netvsc: delay setup of VF device

2017-08-09 Thread Stephen Hemminger

When VF device is discovered, delay bring it automatically up in
order to allow userspace to some simple changes (like renaming).

Reported-by: Vitaly Kuznetsov 
Signed-off-by: Stephen Hemminger 
---
 drivers/net/hyperv/hyperv_net.h |  2 +-
 drivers/net/hyperv/netvsc_drv.c | 15 ---
 2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index d1ea99a12cf2..f620c90307ed 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -723,7 +723,7 @@ struct net_device_context {
/* State to manage the associated VF interface. */
struct net_device __rcu *vf_netdev;
struct netvsc_vf_pcpu_stats __percpu *vf_stats;
-   struct work_struct vf_takeover;
+   struct delayed_work vf_takeover;
 
/* 1: allocated, serial number is valid. 0: not allocated */
u32 vf_alloc;
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index eb0023f55fe1..e059375a6d8c 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -47,6 +47,7 @@
 
 #define RING_SIZE_MIN 64
 #define LINKCHANGE_INT (2 * HZ)
+#define VF_TAKEOVER_INT (HZ / 10)
 
 static int ring_size = 128;
 module_param(ring_size, int, S_IRUGO);
@@ -1559,7 +1560,9 @@ static int netvsc_vf_join(struct net_device *vf_netdev,
/* set slave flag before open to prevent IPv6 addrconf */
vf_netdev->flags |= IFF_SLAVE;
 
-   schedule_work(_ctx->vf_takeover);
+   schedule_delayed_work(_ctx->vf_takeover, VF_TAKEOVER_INT);
+
+   call_netdevice_notifiers(NETDEV_JOIN, vf_netdev);
 
netdev_info(vf_netdev, "joined to %s\n", ndev->name);
return 0;
@@ -1575,8 +1578,6 @@ static void __netvsc_vf_setup(struct net_device *ndev,
 {
int ret;
 
-   call_netdevice_notifiers(NETDEV_JOIN, vf_netdev);
-
/* Align MTU of VF with master */
ret = dev_set_mtu(vf_netdev, ndev->mtu);
if (ret)
@@ -1597,12 +1598,12 @@ static void __netvsc_vf_setup(struct net_device *ndev,
 static void netvsc_vf_setup(struct work_struct *w)
 {
struct net_device_context *ndev_ctx
-   = container_of(w, struct net_device_context, vf_takeover);
+   = container_of(w, struct net_device_context, vf_takeover.work);
struct net_device *ndev = hv_get_drvdata(ndev_ctx->device_ctx);
struct net_device *vf_netdev;
 
if (!rtnl_trylock()) {
-   schedule_work(w);
+   schedule_delayed_work(_ctx->vf_takeover, 0);
return;
}
 
@@ -1706,7 +1707,7 @@ static int netvsc_unregister_vf(struct net_device 
*vf_netdev)
return NOTIFY_DONE;
 
net_device_ctx = netdev_priv(ndev);
-   cancel_work_sync(_device_ctx->vf_takeover);
+   cancel_delayed_work_sync(_device_ctx->vf_takeover);
 
netdev_info(ndev, "VF unregistering: %s\n", vf_netdev->name);
 
@@ -1748,7 +1749,7 @@ static int netvsc_probe(struct hv_device *dev,
 
spin_lock_init(_device_ctx->lock);
INIT_LIST_HEAD(_device_ctx->reconfig_events);
-   INIT_WORK(_device_ctx->vf_takeover, netvsc_vf_setup);
+   INIT_DELAYED_WORK(_device_ctx->vf_takeover, netvsc_vf_setup);
 
net_device_ctx->vf_stats
= netdev_alloc_pcpu_stats(struct netvsc_vf_pcpu_stats);
-- 
2.11.0

[PATCH net-next 02/10] netvsc: don't signal host twice if empty

2017-08-09 Thread Stephen Hemminger

When hv_pkt_iter_next() returns NULL, it has already called
hv_pkt_iter_close(). Calling it twice can lead to extra host signal.

Signed-off-by: Stephen Hemminger 
---
 drivers/net/hyperv/netvsc.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 208f03aa83de..c842265d1df9 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -1190,10 +1190,6 @@ int netvsc_poll(struct napi_struct *napi, int budget)
nvchan->desc = hv_pkt_iter_next(channel, nvchan->desc);
}
 
-   /* if ring is empty, signal host */
-   if (!nvchan->desc)
-   hv_pkt_iter_close(channel);
-
/* If send of pending receive completions suceeded
 *   and did not exhaust NAPI budget this time
 *   and not doing busy poll
-- 
2.11.0

[PATCH net-next 06/10] netvsc: whitespace cleanup

2017-08-09 Thread Stephen Hemminger

Fix some minor indentation issues.

Signed-off-by: Stephen Hemminger 
---
 drivers/net/hyperv/netvsc_drv.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index c7391889938b..3219d2e8918f 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -165,7 +165,7 @@ static int netvsc_close(struct net_device *net)
 }
 
 static void *init_ppi_data(struct rndis_message *msg, u32 ppi_size,
-   int pkt_type)
+  int pkt_type)
 {
struct rndis_packet *rndis_pkt;
struct rndis_per_packet_info *ppi;
@@ -286,7 +286,7 @@ static u16 netvsc_select_queue(struct net_device *ndev, 
struct sk_buff *skb,
 }
 
 static u32 fill_pg_buf(struct page *page, u32 offset, u32 len,
-   struct hv_page_buffer *pb)
+  struct hv_page_buffer *pb)
 {
int j = 0;
 
@@ -626,6 +626,7 @@ static int netvsc_start_xmit(struct sk_buff *skb, struct 
net_device *net)
++net_device_ctx->eth_stats.tx_no_memory;
goto drop;
 }
+
 /*
  * netvsc_linkstatus_callback - Link up/down notification
  */
@@ -649,8 +650,8 @@ void netvsc_linkstatus_callback(struct hv_device 
*device_obj,
if (indicate->status == RNDIS_STATUS_LINK_SPEED_CHANGE) {
u32 speed;
 
-   speed = *(u32 *)((void *)indicate + indicate->
-status_buf_offset) / 1;
+   speed = *(u32 *)((void *)indicate
++ indicate->status_buf_offset) / 1;
ndev_ctx->speed = speed;
return;
}
@@ -1018,7 +1019,7 @@ static void netvsc_get_stats64(struct net_device *net,
struct net_device_context *ndev_ctx = netdev_priv(net);
struct netvsc_device *nvdev = rcu_dereference_rtnl(ndev_ctx->nvdev);
struct netvsc_vf_pcpu_stats vf_tot;
-   int i;
+   int i;
 
if (!nvdev)
return;
-- 
2.11.0

[PATCH net-next 10/10] netvsc: keep track of some non-fatal overload conditions

2017-08-09 Thread Stephen Hemminger

Add ethtool statistics for case where send chimmeny buffer is
exhausted and driver has to fall back to doing scatter/gather
send. Also, add statistic for case where ring buffer is full and
receive completions are delayed.

Signed-off-by: Stephen Hemminger 
---
 drivers/net/hyperv/hyperv_net.h |  2 ++
 drivers/net/hyperv/netvsc.c | 19 +--
 drivers/net/hyperv/netvsc_drv.c |  2 ++
 3 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 4ccfc335a356..345786321879 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -680,6 +680,8 @@ struct netvsc_ethtool_stats {
unsigned long tx_no_space;
unsigned long tx_too_big;
unsigned long tx_busy;
+   unsigned long tx_send_full;
+   unsigned long rx_comp_busy;
 };
 
 struct netvsc_vf_pcpu_stats {
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 5b3a2b8fa77e..938c9f3d2ea6 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -882,7 +882,9 @@ int netvsc_send(struct net_device_context *ndev_ctx,
} else if (pktlen + net_device->pkt_align <
   net_device->send_section_size) {
section_index = netvsc_get_next_send_section(net_device);
-   if (section_index != NETVSC_INVALID_INDEX) {
+   if (unlikely(section_index == NETVSC_INVALID_INDEX)) {
+   ++ndev_ctx->eth_stats.tx_send_full;
+   } else {
move_pkt_msd(_send, _skb, msdp);
msd_len = 0;
}
@@ -948,9 +950,10 @@ int netvsc_send(struct net_device_context *ndev_ctx,
 }
 
 /* Send pending recv completions */
-static int send_recv_completions(struct netvsc_channel *nvchan)
+static int send_recv_completions(struct net_device *ndev,
+struct netvsc_device *nvdev,
+struct netvsc_channel *nvchan)
 {
-   struct netvsc_device *nvdev = nvchan->net_device;
struct multi_recv_comp *mrc = >mrc;
struct recv_comp_msg {
struct nvsp_message_header hdr;
@@ -968,8 +971,12 @@ static int send_recv_completions(struct netvsc_channel 
*nvchan)
msg.status = rcd->status;
ret = vmbus_sendpacket(nvchan->channel, , sizeof(msg),
   rcd->tid, VM_PKT_COMP, 0);
-   if (unlikely(ret))
+   if (unlikely(ret)) {
+   struct net_device_context *ndev_ctx = netdev_priv(ndev);
+
+   ++ndev_ctx->eth_stats.rx_comp_busy;
return ret;
+   }
 
if (++mrc->first == nvdev->recv_completion_cnt)
mrc->first = 0;
@@ -1010,7 +1017,7 @@ static void enq_receive_complete(struct net_device *ndev,
recv_comp_slot_avail(nvdev, mrc, , );
 
if (unlikely(filled > NAPI_POLL_WEIGHT)) {
-   send_recv_completions(nvchan);
+   send_recv_completions(ndev, nvdev, nvchan);
recv_comp_slot_avail(nvdev, mrc, , );
}
 
@@ -1193,7 +1200,7 @@ int netvsc_poll(struct napi_struct *napi, int budget)
 * then re-enable host interrupts
 * and reschedule if ring is not empty.
 */
-   if (send_recv_completions(nvchan) == 0 &&
+   if (send_recv_completions(ndev, net_device, nvchan) == 0 &&
work_done < budget &&
napi_complete_done(napi, work_done) &&
hv_end_read(>inbound)) {
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 873c83a66cc2..b33f0507c373 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -1112,6 +1112,8 @@ static const struct {
{ "tx_no_space",  offsetof(struct netvsc_ethtool_stats, tx_no_space) },
{ "tx_too_big",   offsetof(struct netvsc_ethtool_stats, tx_too_big) },
{ "tx_busy",  offsetof(struct netvsc_ethtool_stats, tx_busy) },
+   { "tx_send_full", offsetof(struct netvsc_ethtool_stats, tx_send_full) },
+   { "rx_comp_busy", offsetof(struct netvsc_ethtool_stats, rx_comp_busy) },
 }, vf_stats[] = {
{ "vf_rx_packets", offsetof(struct netvsc_vf_pcpu_stats, rx_packets) },
{ "vf_rx_bytes",   offsetof(struct netvsc_vf_pcpu_stats, rx_bytes) },
-- 
2.11.0

[PATCH net-next 00/10] netvsc: minor fixes and improvements

2017-08-09 Thread Stephen Hemminger

These are non-critical bug fixes, related to functionality now in net-next.
 1. delaying the automatic bring up of VF device to allow udev to change name.
 2. performance improvement
 3. handle MAC address change with VF; mostly propogate the error that VF gives.
 4. minor cleanups
 5. allow setting send/receive buffer size with ethtool.

Stephen Hemminger (10):
  netvsc: delay setup of VF device
  netvsc: don't signal host twice if empty
  netvsc: propagate MAC address change to VF slave
  netvsc: check error return when restoring channels and mtu
  netvsc: no need to allocate send/receive on numa node
  netvsc: whitespace cleanup
  netvsc: remove unnecessary cast of void pointer
  netvsc: remove unnecessary check for NULL hdr
  netvsc: allow controlling send/recv buffer size
  netvsc: keep track of some non-fatal overload conditions

 drivers/net/hyperv/hyperv_net.h |  13 ++-
 drivers/net/hyperv/netvsc.c | 106 ++---
 drivers/net/hyperv/netvsc_drv.c | 203 
 3 files changed, 227 insertions(+), 95 deletions(-)

-- 
2.11.0

Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2017-08-09 Thread John Stultz

On Wed, Aug 9, 2017 at 5:36 PM, Wei Wang  wrote:
>
> Does your USB adapter get an IPv6 address?

Yes, it does.

> If you see the problem starts to happen on commit
> 9514528d92d4cbe086499322370155ed69f5d06c, could you try reverting all
> the following commits:
> (from new to old)
> 1eb04e7c9e63 net: reorder all the dst flags
> a4c2fd7f7891 net: remove DST_NOCACHE flag
> b2a9c0ed75a3 net: remove DST_NOGC flag
> 5b7c9a8ff828 net: remove dst gc related code
> db916649b5dd ipv6: get rid of icmp6 dst garbage collector
> 587fea741134 ipv6: mark DST_NOGC and remove the operation of dst_free()
> ad65a2f05695 ipv6: call dst_hold_safe() properly
> 9514528d92d4 ipv6: call dst_dev_put() properly
>
> and try if it starts to work?

I'll give that a shot!

Thanks so much for the help! I really appreciate it!
-john

Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2017-08-09 Thread Wei Wang

On Wed, Aug 9, 2017 at 4:44 PM, John Stultz  wrote:
> On Wed, Aug 9, 2017 at 4:34 PM, Cong Wang  wrote:
>> (Cc'ing Wei whose commit was blamed)
>>
>> On Mon, Aug 7, 2017 at 2:15 PM, John Stultz  wrote:
>>> On Mon, Aug 7, 2017 at 2:05 PM, John Stultz  wrote:
 So, with recent testing with my HiKey board, I've been noticing some
 quirky behavior with my USB eth adapter.

 Basically, pluging the usb eth adapter in and then removing it, when
 plugging it back in I often find that its not detected, and the system
 slowly spits out the following message over and over:
   unregister_netdevice: waiting for eth0 to become free. Usage count = 1
>>>
>>> The other bit is that after this starts printing, the board will no
>>> longer reboot (it hangs continuing to occasionally print the above
>>> message), and I have to manually reset the device.
>>>
>>
>> So this warning is not temporarily shown but lasts until a reboot,
>> right? If so it is a dst refcnt leak.
>
> Correct, once I get into the state it lasts until a reboot.
>
>> How reproducible is it for you? From my reading, it seems always
>> reproduced when you unplug and plug your usb eth interface?
>> Is there anything else involved? For example, network namespace.
>
> So with 4.13-rc3/4 I seem to trigger it easily, often with the first
> unplug of the USB eth adapter.
>
> But as I get back closer to 4.12, it seemingly becomes harder to
> trigger, but sometimes still happens.
>
> So far, I've not been able to trigger it with 4.12.
>
> I don't think network namespaces are involved?  Though its out of my
> area, so AOSP may be using them these days.  Is there a simple way to
> check?
>
> I'll also do another bisection to see if the bad point moves back any further.
>
> thanks
> -john

Hi John,

Does your USB adapter get an IPv6 address?

If you see the problem starts to happen on commit
9514528d92d4cbe086499322370155ed69f5d06c, could you try reverting all
the following commits:
(from new to old)
1eb04e7c9e63 net: reorder all the dst flags
a4c2fd7f7891 net: remove DST_NOCACHE flag
b2a9c0ed75a3 net: remove DST_NOGC flag
5b7c9a8ff828 net: remove dst gc related code
db916649b5dd ipv6: get rid of icmp6 dst garbage collector
587fea741134 ipv6: mark DST_NOGC and remove the operation of dst_free()
ad65a2f05695 ipv6: call dst_hold_safe() properly
9514528d92d4 ipv6: call dst_dev_put() properly

and try if it starts to work?

By only reverting 9514528d92d4 definitely won't work as all the later
commits depend on this one.

Thanks a lot.
Wei

Re: [PATCH iproute2 master] bpf: unbreak libelf linkage for bpf obj loader

2017-08-09 Thread Stephen Hemminger

On Thu, 10 Aug 2017 00:15:41 +0200
Daniel Borkmann  wrote:

> Commit 69fed534a533 ("change how Config is used in Makefile's") moved
> HAVE_MNL specific CFLAGS/LDLIBS for building with libmnl out of the
> top level Makefile into sub-Makefiles. However, it also removed the
> HAVE_ELF specific CFLAGS/LDLIBS entirely, which breaks the BPF object
> loader for tc and ip with "No ELF library support compiled in." despite
> having libelf detected in configure script. Fix it similarly as in
> 69fed534a533 for HAVE_ELF.
> 
> Fixes: 69fed534a533 ("change how Config is used in Makefile's")
> Reported-by: Jeffrey Panneman 
> Signed-off-by: Daniel Borkmann 

Thanks, but I am thinking maybe a better solution long term would be to
move all the package specific stuff into the generated Config file.

That way only the generation shell script would have to change.

Also, all the flags should probably be using pkg-config to get the values.

Re: [PATCH v1 net] TCP_USER_TIMEOUT and tcp_keepalive should conform to RFC5482

2017-08-09 Thread Rao Shoaib




On 08/09/2017 05:20 PM, Joe Smith wrote:

On Wed, Aug 9, 2017 at 4:52 PM, Jerry Chu  wrote:

[try to recover from long lost memory]

On Tue, Aug 8, 2017 at 10:25 AM, Yuchung Cheng  wrote:

On Mon, Aug 7, 2017 at 11:16 AM, Rao Shoaib  wrote:

Change from version 0: Rationale behind the change:

The man page for tcp(7) states

when used with the TCP keepalive (SO_KEEPALIVE) option, TCP_USER_TIMEOUT will
override keepalive to  determine  when to close a connection due to keepalive
failure.

This is ambigious at best. user expectation is most likely that the connection
will be reset after TCP_USER_TIMEOUT milliseconds of inactivity.

ccing the original author Jerry Chu who can tell more.


There was a reason for the above otherwise I wouldn't have explicitly
spelled it out in
my commit msg. But unfortunately it was seven years ago and I can't
remember why.
It could range from micro-optimization (saving a syscall() because
this facility was
used by servers handling millions of Android clients) to something more critical
but I can't remember.

The issue is that the man page is ambiguous and does not conform to
any standard.
Whether  RFC 5482 is in little use or not that was cited as the basis
of this change and
I want to change the behavior to conform to it as users are confused.

I doubt that saving a syscall is of any benefit when the connection
has been idle for 2hrs. If anything the user expects the keep alive
probes to start after TCP_USER_TIMEOUT of inactivity. In which case
keep alive should be adjusted.


The code however waits for the keepalive to kick-in (default 2hrs) and than
after one failure resets the conenction.

What is the rationale for that ? The same effect can be obtained by simply
changing the value of tcp_keep_alive_probes.

Since the TCP_USER_TIMEOUT option was added based on RFC 5482 we need to follow
the RFC. Which states

Well the patch has little to do with RFC5482 other than borrowing the name, and
also conveniently providing a mechanism for RFC5482 apps to program the local
timeout value. As far as I knew back when I worked on the patch, RFC5482 was
under little use (told directly by Lars).

Your proposed change may not be unreasonable but my fear is it may
cause breakage
on apps that depend on "TCP_USER_TIMEOUT will overtake keepalive to determine
when to close a connection due to keepalive failure". What is your
case for "RFC5482
compliance" after all? I know the TCP_USER_TIMEOUT option has been very popular
among apps since its inception.

The only use of TCP_USER_TIMEOUT has been for flushing unacknowledged
data (evident from all the fixes). That behavior is not being touched.

Making Linux conform to standards and behavior that is logical seems
like a good enough reason. Mixing keep alive and TCP_USER_TIMEOUT does
not make any sense. I doubt very much if this change will break
anything but if it does than we need to see why that is needed and
implement a proper fix and document it.


The behavior for the main use case has been previously changed as part 
of bug fixes. This is a very low risk change and makes the code logical 
and clean.


Shoaib

Re: [PATCH v1 net] TCP_USER_TIMEOUT and tcp_keepalive should conform to RFC5482

2017-08-09 Thread David Miller

From: Joe Smith 
Date: Wed, 9 Aug 2017 17:20:32 -0700

> Making Linux conform to standards and behavior that is logical seems
> like a good enough reason.

That's an awesome attitude to have when we're implementing something
new and don't have the facility already.

But when we have something already the only important consideration is
not breaking existing apps which rely on that behavior.

That is much, much, more important than standards compliance.

If users are confused, just fix the documentation.

Re: [PATCH v4 05/12] Documentation: net: phy: Add phy-is-internal binding

2017-08-09 Thread Andrew Lunn

On Wed, Aug 09, 2017 at 03:47:34PM -0700, Florian Fainelli wrote:
> On August 9, 2017 5:10:30 AM PDT, David Wu  wrote:
> >Add the documentation for internal phy. A boolean property
> >indicates that a internal phy will be used.
> >
> >Signed-off-by: David Wu 
> >---
> > Documentation/devicetree/bindings/net/phy.txt | 3 +++
> > 1 file changed, 3 insertions(+)
> >
> >diff --git a/Documentation/devicetree/bindings/net/phy.txt
> >b/Documentation/devicetree/bindings/net/phy.txt
> >index b558576..942c892 100644
> >--- a/Documentation/devicetree/bindings/net/phy.txt
> >+++ b/Documentation/devicetree/bindings/net/phy.txt
> >@@ -52,6 +52,9 @@ Optional Properties:
> >   Mark the corresponding energy efficient ethernet mode as broken and
> >   request the ethernet to stop advertising it.
> > 
> >+- phy-is-internal: If set, indicates that phy will connect to the MAC
> >as a
> >+  internal phy.
> 
> Something along the lines of:
> 
> If set, indicates that the PHY is integrated into the same physical package 
> as the Ethernet MAC.

Hi Florian, David.

I'm happy with the property name. But i think the text needs more
description. We deal with Ethernet switches with integrated PHYs. Yet
for us, this property is unneeded.

Seeing this property means some bit of software needs to ensure the
internal PHY should be used, when given the choice between an internal
and external PHY. So i would say something like:

If set, indicates that the PHY is integrated into the same
physical package as the Ethernet MAC. If needed, muxers should be
configured to ensure the internal PHY is used. The absence of this
property indicates the muxers should be configured so that the
external PHY is used.

This last part is important. If the bootloader has set the internal
PHY to be used, you need to reset it. Otherwise we are going to get
into a mess sometime later and need to add a phy-is-external property.

 Andrew

Re: [PATCH v2 net-next 0/7] rtnetlink: allow selected handlers to run without rtnl

2017-08-09 Thread David Miller

From: Florian Westphal 
Date: Wed,  9 Aug 2017 20:41:46 +0200

> Changes since v1:
>  In patch 6, don't make ipv6 route handlers lockless, they all have
>  assumptions on rtnl being held.  Other patches are unchanged.
> 
> The RTNL mutex is used to serialize both rtnetlink calls and
> dump requests.
> Its also used to protect other things such as the list of current
> net namespaces.
> 
> Unfortunately RTNL mutex is a performance issue, e.g. a cpu adding an
> ip address prevents other cpus from seemingly unrelated tasks such as
> dumping tc classifiers or doing rtnetlink route lookups.
> 
> This patch set adds basic infrastructure to start pushing the rtnl lock
> down to those places that need it, or even elide it entirely in some cases.
> 
> Subsystems can now indicate that their doit() callback can run without
> RTNL mutex, such callbacks can then run in parallel.
> 
> This will obviously need a lot of followup work; all current
> users need to be audited/changed to benefit from this.
> Initial no-rtnl spot is netns new/getid.
> 
> We have various 'get' handlers that are also a tempting target,
> however, several of these depend on rtnl mutex to prevent information
> from changing while objects are being read by rtnl handlers; however,
> it doesn't appear impossible to change this.
> 
> Dumps are another problem entirely, see
> commit 2907c35ff64708065 ("net: hold rtnl again in dump callbacks"),
> this patchset doesn't touch dump requests.

Ok series applied, let's see where this goes :-)

Re: [PATCH v1 net] TCP_USER_TIMEOUT and tcp_keepalive should conform to RFC5482

2017-08-09 Thread Joe Smith

On Wed, Aug 9, 2017 at 4:52 PM, Jerry Chu  wrote:
> [try to recover from long lost memory]
>
> On Tue, Aug 8, 2017 at 10:25 AM, Yuchung Cheng  wrote:
>> On Mon, Aug 7, 2017 at 11:16 AM, Rao Shoaib  wrote:
>>> Change from version 0: Rationale behind the change:
>>>
>>> The man page for tcp(7) states
>>>
>>> when used with the TCP keepalive (SO_KEEPALIVE) option, TCP_USER_TIMEOUT 
>>> will
>>> override keepalive to  determine  when to close a connection due to 
>>> keepalive
>>> failure.
>>>
>>> This is ambigious at best. user expectation is most likely that the 
>>> connection
>>> will be reset after TCP_USER_TIMEOUT milliseconds of inactivity.
>> ccing the original author Jerry Chu who can tell more.
>>
>
> There was a reason for the above otherwise I wouldn't have explicitly
> spelled it out in
> my commit msg. But unfortunately it was seven years ago and I can't
> remember why.
> It could range from micro-optimization (saving a syscall() because
> this facility was
> used by servers handling millions of Android clients) to something more 
> critical
> but I can't remember.

The issue is that the man page is ambiguous and does not conform to
any standard.
Whether  RFC 5482 is in little use or not that was cited as the basis
of this change and
I want to change the behavior to conform to it as users are confused.

I doubt that saving a syscall is of any benefit when the connection
has been idle for 2hrs. If anything the user expects the keep alive
probes to start after TCP_USER_TIMEOUT of inactivity. In which case
keep alive should be adjusted.

>
>>>
>>> The code however waits for the keepalive to kick-in (default 2hrs) and than
>>> after one failure resets the conenction.
>>>
>>> What is the rationale for that ? The same effect can be obtained by simply
>>> changing the value of tcp_keep_alive_probes.
>>>
>>> Since the TCP_USER_TIMEOUT option was added based on RFC 5482 we need to 
>>> follow
>>> the RFC. Which states
>
> Well the patch has little to do with RFC5482 other than borrowing the name, 
> and
> also conveniently providing a mechanism for RFC5482 apps to program the local
> timeout value. As far as I knew back when I worked on the patch, RFC5482 was
> under little use (told directly by Lars).
>
> Your proposed change may not be unreasonable but my fear is it may
> cause breakage
> on apps that depend on "TCP_USER_TIMEOUT will overtake keepalive to determine
> when to close a connection due to keepalive failure". What is your
> case for "RFC5482
> compliance" after all? I know the TCP_USER_TIMEOUT option has been very 
> popular
> among apps since its inception.

The only use of TCP_USER_TIMEOUT has been for flushing unacknowledged
data (evident from all the fixes). That behavior is not being touched.

Making Linux conform to standards and behavior that is logical seems
like a good enough reason. Mixing keep alive and TCP_USER_TIMEOUT does
not make any sense. I doubt very much if this change will break
anything but if it does than we need to see why that is needed and
implement a proper fix and document it.

Shoaib

>
>>>
>>> 4.2 TCP keep-Alives:
>>>Some TCP implementations, such as those in BSD systems, use a
>>>different abort policy for TCP keep-alives than for user data.  Thus,
>>>the TCP keep-alive mechanism might abort a connection that would
>>>otherwise have survived the transient period without connectivity.
>>>Therefore, if a connection that enables keep-alives is also using the
>>>TCP User Timeout Option, then the keep-alive timer MUST be set to a
>>>value larger than that of the adopted USER TIMEOUT.
>>>
>>> This patch enforces the MUST and also dis-associates user timeout from keep
>>> alive.  A man page patch will be submitted separately.
>>>
>>> Signed-off-by: Rao Shoaib 
>>> ---
>>>  net/ipv4/tcp.c   | 10 --
>>>  net/ipv4/tcp_timer.c |  9 +
>>>  2 files changed, 9 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>>> index 71ce33d..f2af44d 100644
>>> --- a/net/ipv4/tcp.c
>>> +++ b/net/ipv4/tcp.c
>>> @@ -2628,7 +2628,9 @@ static int do_tcp_setsockopt(struct sock *sk, int 
>>> level,
>>> break;
>>>
>>> case TCP_KEEPIDLE:
>>> -   if (val < 1 || val > MAX_TCP_KEEPIDLE)
>>> +   /* Per RFC5482 keepalive_time must be > user_timeout */
>>> +   if (val < 1 || val > MAX_TCP_KEEPIDLE ||
>>> +   ((val * HZ) <= icsk->icsk_user_timeout))
>>> err = -EINVAL;
>>> else {
>>> tp->keepalive_time = val * HZ;
>>> @@ -2724,8 +2726,12 @@ static int do_tcp_setsockopt(struct sock *sk, int 
>>> level,
>>> case TCP_USER_TIMEOUT:
>>> /* Cap the max time in ms TCP will retry or probe the window
>>>  * before giving up and aborting (ETIMEDOUT) a connection.
>>> +

Re: [PATCH net-next 0/2] zerocopy fixes

2017-08-09 Thread David Miller

From: Willem de Bruijn 
Date: Wed,  9 Aug 2017 19:09:42 -0400

> Fix two issues introduced in the msg_zerocopy patchset.

Series applied, thanks for following up on this so quickly.

Re: [PATCH net-next v2 0/9] Add BPF_J{LT,LE,SLT,SLE} instructions

2017-08-09 Thread David Miller

From: Daniel Borkmann 
Date: Thu, 10 Aug 2017 01:39:54 +0200

> This set adds BPF_J{LT,LE,SLT,SLE} instructions to the BPF
> insn set, interpreter, JIT hardening code and all JITs are
> also updated to support the new instructions. Basic idea is
> to reduce register pressure by avoiding BPF_J{GT,GE,SGT,SGE}
> rewrites. Removing the workaround for the rewrites in LLVM,
> this can result in shorter BPF programs, less stack usage
> and less verification complexity. First patch provides some
> more details on rationale and integration.
> 
> Thanks a lot!
> 
> v1 -> v2:
>   - Reworded commit msg in patch 1

Ok this looks a lot better.

Thanks for taking care of all of the JITs as well.

Series applied.

Re: [net-next 00/12][pull request] 1GbE Intel Wired LAN Driver Updates 2017-08-08

2017-08-09 Thread David Miller

From: Jeff Kirsher 
Date: Wed,  9 Aug 2017 14:47:34 -0700

> This series contains updates to e1000e and igb/igbvf.
 ...
> The following are changes since commit 
> 53b948356554376ec6f89016376825d48bf396c3:
>   net: vrf: Add extack messages for newlink failures
> and are available in the git repository at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 1GbE

Pulled, thanks Jeff.

Re: [PATCH v1 net] TCP_USER_TIMEOUT and tcp_keepalive should conform to RFC5482

2017-08-09 Thread Jerry Chu

[try to recover from long lost memory]

On Tue, Aug 8, 2017 at 10:25 AM, Yuchung Cheng  wrote:
> On Mon, Aug 7, 2017 at 11:16 AM, Rao Shoaib  wrote:
>> Change from version 0: Rationale behind the change:
>>
>> The man page for tcp(7) states
>>
>> when used with the TCP keepalive (SO_KEEPALIVE) option, TCP_USER_TIMEOUT will
>> override keepalive to  determine  when to close a connection due to keepalive
>> failure.
>>
>> This is ambigious at best. user expectation is most likely that the 
>> connection
>> will be reset after TCP_USER_TIMEOUT milliseconds of inactivity.
> ccing the original author Jerry Chu who can tell more.
>

There was a reason for the above otherwise I wouldn't have explicitly
spelled it out in
my commit msg. But unfortunately it was seven years ago and I can't
remember why.
It could range from micro-optimization (saving a syscall() because
this facility was
used by servers handling millions of Android clients) to something more critical
but I can't remember.

>>
>> The code however waits for the keepalive to kick-in (default 2hrs) and than
>> after one failure resets the conenction.
>>
>> What is the rationale for that ? The same effect can be obtained by simply
>> changing the value of tcp_keep_alive_probes.
>>
>> Since the TCP_USER_TIMEOUT option was added based on RFC 5482 we need to 
>> follow
>> the RFC. Which states

Well the patch has little to do with RFC5482 other than borrowing the name, and
also conveniently providing a mechanism for RFC5482 apps to program the local
timeout value. As far as I knew back when I worked on the patch, RFC5482 was
under little use (told directly by Lars).

Your proposed change may not be unreasonable but my fear is it may
cause breakage
on apps that depend on "TCP_USER_TIMEOUT will overtake keepalive to determine
when to close a connection due to keepalive failure". What is your
case for "RFC5482
compliance" after all? I know the TCP_USER_TIMEOUT option has been very popular
among apps since its inception.

>>
>> 4.2 TCP keep-Alives:
>>Some TCP implementations, such as those in BSD systems, use a
>>different abort policy for TCP keep-alives than for user data.  Thus,
>>the TCP keep-alive mechanism might abort a connection that would
>>otherwise have survived the transient period without connectivity.
>>Therefore, if a connection that enables keep-alives is also using the
>>TCP User Timeout Option, then the keep-alive timer MUST be set to a
>>value larger than that of the adopted USER TIMEOUT.
>>
>> This patch enforces the MUST and also dis-associates user timeout from keep
>> alive.  A man page patch will be submitted separately.
>>
>> Signed-off-by: Rao Shoaib 
>> ---
>>  net/ipv4/tcp.c   | 10 --
>>  net/ipv4/tcp_timer.c |  9 +
>>  2 files changed, 9 insertions(+), 10 deletions(-)
>>
>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>> index 71ce33d..f2af44d 100644
>> --- a/net/ipv4/tcp.c
>> +++ b/net/ipv4/tcp.c
>> @@ -2628,7 +2628,9 @@ static int do_tcp_setsockopt(struct sock *sk, int 
>> level,
>> break;
>>
>> case TCP_KEEPIDLE:
>> -   if (val < 1 || val > MAX_TCP_KEEPIDLE)
>> +   /* Per RFC5482 keepalive_time must be > user_timeout */
>> +   if (val < 1 || val > MAX_TCP_KEEPIDLE ||
>> +   ((val * HZ) <= icsk->icsk_user_timeout))
>> err = -EINVAL;
>> else {
>> tp->keepalive_time = val * HZ;
>> @@ -2724,8 +2726,12 @@ static int do_tcp_setsockopt(struct sock *sk, int 
>> level,
>> case TCP_USER_TIMEOUT:
>> /* Cap the max time in ms TCP will retry or probe the window
>>  * before giving up and aborting (ETIMEDOUT) a connection.
>> +* Per RFC5482 TCP user timeout must be < keepalive_time.
>> +* If the default value changes later -- all bets are off.
>>  */
>> -   if (val < 0)
>> +   if (val < 0 || (tp->keepalive_time &&
>> +   tp->keepalive_time <= msecs_to_jiffies(val)) 
>> ||
>> +  net->ipv4.sysctl_tcp_keepalive_time <= 
>> msecs_to_jiffies(val))
>> err = -EINVAL;
>> else
>> icsk->icsk_user_timeout = msecs_to_jiffies(val);
>> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
>> index c0f..d39fe60 100644
>> --- a/net/ipv4/tcp_timer.c
>> +++ b/net/ipv4/tcp_timer.c
>> @@ -664,14 +664,7 @@ static void tcp_keepalive_timer (unsigned long data)
>> elapsed = keepalive_time_elapsed(tp);
>>
>> if (elapsed >= keepalive_time_when(tp)) {
>> -   /* If the TCP_USER_TIMEOUT option is enabled, use that
>> -* to determine when to timeout instead.
>> -*/
>> -   if ((icsk->icsk_user_timeout != 0 &&
>> -

Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2017-08-09 Thread John Stultz

On Wed, Aug 9, 2017 at 4:34 PM, Cong Wang  wrote:
> (Cc'ing Wei whose commit was blamed)
>
> On Mon, Aug 7, 2017 at 2:15 PM, John Stultz  wrote:
>> On Mon, Aug 7, 2017 at 2:05 PM, John Stultz  wrote:
>>> So, with recent testing with my HiKey board, I've been noticing some
>>> quirky behavior with my USB eth adapter.
>>>
>>> Basically, pluging the usb eth adapter in and then removing it, when
>>> plugging it back in I often find that its not detected, and the system
>>> slowly spits out the following message over and over:
>>>   unregister_netdevice: waiting for eth0 to become free. Usage count = 1
>>
>> The other bit is that after this starts printing, the board will no
>> longer reboot (it hangs continuing to occasionally print the above
>> message), and I have to manually reset the device.
>>
>
> So this warning is not temporarily shown but lasts until a reboot,
> right? If so it is a dst refcnt leak.

Correct, once I get into the state it lasts until a reboot.

> How reproducible is it for you? From my reading, it seems always
> reproduced when you unplug and plug your usb eth interface?
> Is there anything else involved? For example, network namespace.

So with 4.13-rc3/4 I seem to trigger it easily, often with the first
unplug of the USB eth adapter.

But as I get back closer to 4.12, it seemingly becomes harder to
trigger, but sometimes still happens.

So far, I've not been able to trigger it with 4.12.

I don't think network namespaces are involved?  Though its out of my
area, so AOSP may be using them these days.  Is there a simple way to
check?

I'll also do another bisection to see if the bad point moves back any further.

thanks
-john

[PATCH net-next v2 4/9] bpf, sparc64: implement jiting of BPF_J{LT, LE, SLT, SLE}

2017-08-09 Thread Daniel Borkmann

This work implements jiting of BPF_J{LT,LE,SLT,SLE} instructions
with BPF_X/BPF_K variants for the sparc64 eBPF JIT.

Signed-off-by: Daniel Borkmann 
---
 arch/sparc/net/bpf_jit_comp_64.c | 34 ++
 1 file changed, 34 insertions(+)

diff --git a/arch/sparc/net/bpf_jit_comp_64.c b/arch/sparc/net/bpf_jit_comp_64.c
index 8799ae9..c340af7 100644
--- a/arch/sparc/net/bpf_jit_comp_64.c
+++ b/arch/sparc/net/bpf_jit_comp_64.c
@@ -128,6 +128,8 @@ static u32 WDISP10(u32 off)
 
 #define BA (BRANCH | CONDA)
 #define BG (BRANCH | CONDG)
+#define BL (BRANCH | CONDL)
+#define BLE(BRANCH | CONDLE)
 #define BGU(BRANCH | CONDGU)
 #define BLEU   (BRANCH | CONDLEU)
 #define BGE(BRANCH | CONDGE)
@@ -715,9 +717,15 @@ static int emit_compare_and_branch(const u8 code, const u8 
dst, u8 src,
case BPF_JGT:
br_opcode = BGU;
break;
+   case BPF_JLT:
+   br_opcode = BLU;
+   break;
case BPF_JGE:
br_opcode = BGEU;
break;
+   case BPF_JLE:
+   br_opcode = BLEU;
+   break;
case BPF_JSET:
case BPF_JNE:
br_opcode = BNE;
@@ -725,9 +733,15 @@ static int emit_compare_and_branch(const u8 code, const u8 
dst, u8 src,
case BPF_JSGT:
br_opcode = BG;
break;
+   case BPF_JSLT:
+   br_opcode = BL;
+   break;
case BPF_JSGE:
br_opcode = BGE;
break;
+   case BPF_JSLE:
+   br_opcode = BLE;
+   break;
default:
/* Make sure we dont leak kernel information to the
 * user.
@@ -746,18 +760,30 @@ static int emit_compare_and_branch(const u8 code, const 
u8 dst, u8 src,
case BPF_JGT:
cbcond_opcode = CBCONDGU;
break;
+   case BPF_JLT:
+   cbcond_opcode = CBCONDLU;
+   break;
case BPF_JGE:
cbcond_opcode = CBCONDGEU;
break;
+   case BPF_JLE:
+   cbcond_opcode = CBCONDLEU;
+   break;
case BPF_JNE:
cbcond_opcode = CBCONDNE;
break;
case BPF_JSGT:
cbcond_opcode = CBCONDG;
break;
+   case BPF_JSLT:
+   cbcond_opcode = CBCONDL;
+   break;
case BPF_JSGE:
cbcond_opcode = CBCONDGE;
break;
+   case BPF_JSLE:
+   cbcond_opcode = CBCONDLE;
+   break;
default:
/* Make sure we dont leak kernel information to the
 * user.
@@ -1176,10 +1202,14 @@ static int build_insn(const struct bpf_insn *insn, 
struct jit_ctx *ctx)
/* IF (dst COND src) JUMP off */
case BPF_JMP | BPF_JEQ | BPF_X:
case BPF_JMP | BPF_JGT | BPF_X:
+   case BPF_JMP | BPF_JLT | BPF_X:
case BPF_JMP | BPF_JGE | BPF_X:
+   case BPF_JMP | BPF_JLE | BPF_X:
case BPF_JMP | BPF_JNE | BPF_X:
case BPF_JMP | BPF_JSGT | BPF_X:
+   case BPF_JMP | BPF_JSLT | BPF_X:
case BPF_JMP | BPF_JSGE | BPF_X:
+   case BPF_JMP | BPF_JSLE | BPF_X:
case BPF_JMP | BPF_JSET | BPF_X: {
int err;
 
@@ -1191,10 +1221,14 @@ static int build_insn(const struct bpf_insn *insn, 
struct jit_ctx *ctx)
/* IF (dst COND imm) JUMP off */
case BPF_JMP | BPF_JEQ | BPF_K:
case BPF_JMP | BPF_JGT | BPF_K:
+   case BPF_JMP | BPF_JLT | BPF_K:
case BPF_JMP | BPF_JGE | BPF_K:
+   case BPF_JMP | BPF_JLE | BPF_K:
case BPF_JMP | BPF_JNE | BPF_K:
case BPF_JMP | BPF_JSGT | BPF_K:
+   case BPF_JMP | BPF_JSLT | BPF_K:
case BPF_JMP | BPF_JSGE | BPF_K:
+   case BPF_JMP | BPF_JSLE | BPF_K:
case BPF_JMP | BPF_JSET | BPF_K: {
int err;
 
-- 
1.9.3

[PATCH net-next v2 0/9] Add BPF_J{LT,LE,SLT,SLE} instructions

2017-08-09 Thread Daniel Borkmann

This set adds BPF_J{LT,LE,SLT,SLE} instructions to the BPF
insn set, interpreter, JIT hardening code and all JITs are
also updated to support the new instructions. Basic idea is
to reduce register pressure by avoiding BPF_J{GT,GE,SGT,SGE}
rewrites. Removing the workaround for the rewrites in LLVM,
this can result in shorter BPF programs, less stack usage
and less verification complexity. First patch provides some
more details on rationale and integration.

Thanks a lot!

v1 -> v2:
  - Reworded commit msg in patch 1

Daniel Borkmann (9):
  bpf: add BPF_J{LT,LE,SLT,SLE} instructions
  bpf, x86: implement jiting of BPF_J{LT,LE,SLT,SLE}
  bpf, arm64: implement jiting of BPF_J{LT,LE,SLT,SLE}
  bpf, sparc64: implement jiting of BPF_J{LT, LE, SLT, SLE}
  bpf, s390x: implement jiting of BPF_J{LT,LE,SLT,SLE}
  bpf, ppc64: implement jiting of BPF_J{LT,LE,SLT,SLE}
  bpf, nfp: implement jiting of BPF_J{LT,LE}
  bpf: enable BPF_J{LT,LE,SLT,SLE} opcodes in verifier
  bpf: add test cases for new BPF_J{LT, LE, SLT, SLE} instructions

 Documentation/networking/filter.txt  |   4 +
 arch/arm64/net/bpf_jit.h |   4 +
 arch/arm64/net/bpf_jit_comp.c|  20 ++
 arch/powerpc/net/bpf_jit.h   |   1 +
 arch/powerpc/net/bpf_jit_comp64.c|  20 ++
 arch/s390/net/bpf_jit_comp.c |  24 ++
 arch/sparc/net/bpf_jit_comp_64.c |  34 +++
 arch/x86/net/bpf_jit_comp.c  |  26 ++
 drivers/net/ethernet/netronome/nfp/bpf/jit.c |  24 ++
 include/uapi/linux/bpf.h |   5 +
 kernel/bpf/core.c|  60 +
 kernel/bpf/verifier.c|  62 -
 lib/test_bpf.c   | 364 +++
 net/core/filter.c|  21 +-
 tools/include/uapi/linux/bpf.h   |   5 +
 tools/testing/selftests/bpf/test_verifier.c  | 313 +++
 16 files changed, 979 insertions(+), 8 deletions(-)

-- 
1.9.3

[PATCH net-next v2 5/9] bpf, s390x: implement jiting of BPF_J{LT,LE,SLT,SLE}

2017-08-09 Thread Daniel Borkmann

This work implements jiting of BPF_J{LT,LE,SLT,SLE} instructions
with BPF_X/BPF_K variants for the s390x eBPF JIT.

Signed-off-by: Daniel Borkmann 
Acked-by: Michael Holzheu 
---
 arch/s390/net/bpf_jit_comp.c | 24 
 1 file changed, 24 insertions(+)

diff --git a/arch/s390/net/bpf_jit_comp.c b/arch/s390/net/bpf_jit_comp.c
index 01c6fbc..b5228ca 100644
--- a/arch/s390/net/bpf_jit_comp.c
+++ b/arch/s390/net/bpf_jit_comp.c
@@ -1093,15 +1093,27 @@ static noinline int bpf_jit_insn(struct bpf_jit *jit, 
struct bpf_prog *fp, int i
case BPF_JMP | BPF_JSGT | BPF_K: /* ((s64) dst > (s64) imm) */
mask = 0x2000; /* jh */
goto branch_ks;
+   case BPF_JMP | BPF_JSLT | BPF_K: /* ((s64) dst < (s64) imm) */
+   mask = 0x4000; /* jl */
+   goto branch_ks;
case BPF_JMP | BPF_JSGE | BPF_K: /* ((s64) dst >= (s64) imm) */
mask = 0xa000; /* jhe */
goto branch_ks;
+   case BPF_JMP | BPF_JSLE | BPF_K: /* ((s64) dst <= (s64) imm) */
+   mask = 0xc000; /* jle */
+   goto branch_ks;
case BPF_JMP | BPF_JGT | BPF_K: /* (dst_reg > imm) */
mask = 0x2000; /* jh */
goto branch_ku;
+   case BPF_JMP | BPF_JLT | BPF_K: /* (dst_reg < imm) */
+   mask = 0x4000; /* jl */
+   goto branch_ku;
case BPF_JMP | BPF_JGE | BPF_K: /* (dst_reg >= imm) */
mask = 0xa000; /* jhe */
goto branch_ku;
+   case BPF_JMP | BPF_JLE | BPF_K: /* (dst_reg <= imm) */
+   mask = 0xc000; /* jle */
+   goto branch_ku;
case BPF_JMP | BPF_JNE | BPF_K: /* (dst_reg != imm) */
mask = 0x7000; /* jne */
goto branch_ku;
@@ -1119,15 +1131,27 @@ static noinline int bpf_jit_insn(struct bpf_jit *jit, 
struct bpf_prog *fp, int i
case BPF_JMP | BPF_JSGT | BPF_X: /* ((s64) dst > (s64) src) */
mask = 0x2000; /* jh */
goto branch_xs;
+   case BPF_JMP | BPF_JSLT | BPF_X: /* ((s64) dst < (s64) src) */
+   mask = 0x4000; /* jl */
+   goto branch_xs;
case BPF_JMP | BPF_JSGE | BPF_X: /* ((s64) dst >= (s64) src) */
mask = 0xa000; /* jhe */
goto branch_xs;
+   case BPF_JMP | BPF_JSLE | BPF_X: /* ((s64) dst <= (s64) src) */
+   mask = 0xc000; /* jle */
+   goto branch_xs;
case BPF_JMP | BPF_JGT | BPF_X: /* (dst > src) */
mask = 0x2000; /* jh */
goto branch_xu;
+   case BPF_JMP | BPF_JLT | BPF_X: /* (dst < src) */
+   mask = 0x4000; /* jl */
+   goto branch_xu;
case BPF_JMP | BPF_JGE | BPF_X: /* (dst >= src) */
mask = 0xa000; /* jhe */
goto branch_xu;
+   case BPF_JMP | BPF_JLE | BPF_X: /* (dst <= src) */
+   mask = 0xc000; /* jle */
+   goto branch_xu;
case BPF_JMP | BPF_JNE | BPF_X: /* (dst != src) */
mask = 0x7000; /* jne */
goto branch_xu;
-- 
1.9.3

[PATCH net-next v2 1/9] bpf: add BPF_J{LT,LE,SLT,SLE} instructions

2017-08-09 Thread Daniel Borkmann

Currently, eBPF only understands BPF_JGT (>), BPF_JGE (>=),
BPF_JSGT (s>), BPF_JSGE (s>=) instructions, this means that
particularly *JLT/*JLE counterparts involving immediates need
to be rewritten from e.g. X < [IMM] by swapping arguments into
[IMM] > X, meaning the immediate first is required to be loaded
into a register Y := [IMM], such that then we can compare with
Y > X. Note that the destination operand is always required to
be a register.

This has the downside of having unnecessarily increased register
pressure, meaning complex program would need to spill other
registers temporarily to stack in order to obtain an unused
register for the [IMM]. Loading to registers will thus also
affect state pruning since we need to account for that register
use and potentially those registers that had to be spilled/filled
again. As a consequence slightly more stack space might have
been used due to spilling, and BPF programs are a bit longer
due to extra code involving the register load and potentially
required spill/fills.

Thus, add BPF_JLT (<), BPF_JLE (<=), BPF_JSLT (s<), BPF_JSLE (s<=)
counterparts to the eBPF instruction set. Modifying LLVM to
remove the NegateCC() workaround in a PoC patch at [1] and
allowing it to also emit the new instructions resulted in
cilium's BPF programs that are injected into the fast-path to
have a reduced program length in the range of 2-3% (e.g.
accumulated main and tail call sections from one of the object
file reduced from 4864 to 4729 insns), reduced complexity in
the range of 10-30% (e.g. accumulated sections reduced in one
of the cases from 116432 to 88428 insns), and reduced stack
usage in the range of 1-5% (e.g. accumulated sections from one
of the object files reduced from 824 to 784b).

The modification for LLVM will be incorporated in a backwards
compatible way. Plan is for LLVM to have i) a target specific
option to offer a possibility to explicitly enable the extension
by the user (as we have with -m target specific extensions today
for various CPU insns), and ii) have the kernel checked for
presence of the extensions and enable them transparently when
the user is selecting more aggressive options such as -march=native
in a bpf target context. (Other frontends generating BPF byte
code, e.g. ply can probe the kernel directly for its code
generation.)

  [1] https://github.com/borkmann/llvm/tree/bpf-insns

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 Documentation/networking/filter.txt |   4 +
 include/uapi/linux/bpf.h|   5 +
 kernel/bpf/core.c   |  60 ++
 lib/test_bpf.c  | 364 
 net/core/filter.c   |  21 ++-
 tools/include/uapi/linux/bpf.h  |   5 +
 6 files changed, 455 insertions(+), 4 deletions(-)

diff --git a/Documentation/networking/filter.txt 
b/Documentation/networking/filter.txt
index d0fdba7..6a0df8d 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -906,6 +906,10 @@ If BPF_CLASS(code) == BPF_JMP, BPF_OP(code) is one of:
   BPF_JSGE  0x70  /* eBPF only: signed '>=' */
   BPF_CALL  0x80  /* eBPF only: function call */
   BPF_EXIT  0x90  /* eBPF only: function return */
+  BPF_JLT   0xa0  /* eBPF only: unsigned '<' */
+  BPF_JLE   0xb0  /* eBPF only: unsigned '<=' */
+  BPF_JSLT  0xc0  /* eBPF only: signed '<' */
+  BPF_JSLE  0xd0  /* eBPF only: signed '<=' */
 
 So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF
 and eBPF. There are only two registers in classic BPF, so it means A += X.
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1d06be1..91da837 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -30,9 +30,14 @@
 #define BPF_FROM_LEBPF_TO_LE
 #define BPF_FROM_BEBPF_TO_BE
 
+/* jmp encodings */
 #define BPF_JNE0x50/* jump != */
+#define BPF_JLT0xa0/* LT is unsigned, '<' */
+#define BPF_JLE0xb0/* LE is unsigned, '<=' */
 #define BPF_JSGT   0x60/* SGT is signed '>', GT in x86 */
 #define BPF_JSGE   0x70/* SGE is signed '>=', GE in x86 */
+#define BPF_JSLT   0xc0/* SLT is signed, '<' */
+#define BPF_JSLE   0xd0/* SLE is signed, '<=' */
 #define BPF_CALL   0x80/* function call */
 #define BPF_EXIT   0x90/* function return */
 
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index ad5f559..c69e7f5 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -595,9 +595,13 @@ static int bpf_jit_blind_insn(const struct bpf_insn *from,
case BPF_JMP | BPF_JEQ  | BPF_K:
case BPF_JMP | BPF_JNE  | BPF_K:
case BPF_JMP | BPF_JGT  | BPF_K:
+   case BPF_JMP | BPF_JLT  | BPF_K:
case BPF_JMP | BPF_JGE  | BPF_K:
+   case BPF_JMP | BPF_JLE  | BPF_K:
case BPF_JMP | BPF_JSGT | BPF_K:
+   case BPF_JMP | BPF_JSLT | BPF_K:
case BPF_JMP

[PATCH net-next v2 3/9] bpf, arm64: implement jiting of BPF_J{LT,LE,SLT,SLE}

2017-08-09 Thread Daniel Borkmann

This work implements jiting of BPF_J{LT,LE,SLT,SLE} instructions
with BPF_X/BPF_K variants for the arm64 eBPF JIT.

Signed-off-by: Daniel Borkmann 
---
 arch/arm64/net/bpf_jit.h  |  4 
 arch/arm64/net/bpf_jit_comp.c | 20 
 2 files changed, 24 insertions(+)

diff --git a/arch/arm64/net/bpf_jit.h b/arch/arm64/net/bpf_jit.h
index b02a926..783de51 100644
--- a/arch/arm64/net/bpf_jit.h
+++ b/arch/arm64/net/bpf_jit.h
@@ -44,8 +44,12 @@
 #define A64_COND_NEAARCH64_INSN_COND_NE /* != */
 #define A64_COND_CSAARCH64_INSN_COND_CS /* unsigned >= */
 #define A64_COND_HIAARCH64_INSN_COND_HI /* unsigned > */
+#define A64_COND_LSAARCH64_INSN_COND_LS /* unsigned <= */
+#define A64_COND_CCAARCH64_INSN_COND_CC /* unsigned < */
 #define A64_COND_GEAARCH64_INSN_COND_GE /* signed >= */
 #define A64_COND_GTAARCH64_INSN_COND_GT /* signed > */
+#define A64_COND_LEAARCH64_INSN_COND_LE /* signed <= */
+#define A64_COND_LTAARCH64_INSN_COND_LT /* signed < */
 #define A64_B_(cond, imm19) A64_COND_BRANCH(cond, (imm19) << 2)
 
 /* Unconditional branch (immediate) */
diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index f32144b..ba38d40 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -527,10 +527,14 @@ static int build_insn(const struct bpf_insn *insn, struct 
jit_ctx *ctx)
/* IF (dst COND src) JUMP off */
case BPF_JMP | BPF_JEQ | BPF_X:
case BPF_JMP | BPF_JGT | BPF_X:
+   case BPF_JMP | BPF_JLT | BPF_X:
case BPF_JMP | BPF_JGE | BPF_X:
+   case BPF_JMP | BPF_JLE | BPF_X:
case BPF_JMP | BPF_JNE | BPF_X:
case BPF_JMP | BPF_JSGT | BPF_X:
+   case BPF_JMP | BPF_JSLT | BPF_X:
case BPF_JMP | BPF_JSGE | BPF_X:
+   case BPF_JMP | BPF_JSLE | BPF_X:
emit(A64_CMP(1, dst, src), ctx);
 emit_cond_jmp:
jmp_offset = bpf2a64_offset(i + off, i, ctx);
@@ -542,9 +546,15 @@ static int build_insn(const struct bpf_insn *insn, struct 
jit_ctx *ctx)
case BPF_JGT:
jmp_cond = A64_COND_HI;
break;
+   case BPF_JLT:
+   jmp_cond = A64_COND_CC;
+   break;
case BPF_JGE:
jmp_cond = A64_COND_CS;
break;
+   case BPF_JLE:
+   jmp_cond = A64_COND_LS;
+   break;
case BPF_JSET:
case BPF_JNE:
jmp_cond = A64_COND_NE;
@@ -552,9 +562,15 @@ static int build_insn(const struct bpf_insn *insn, struct 
jit_ctx *ctx)
case BPF_JSGT:
jmp_cond = A64_COND_GT;
break;
+   case BPF_JSLT:
+   jmp_cond = A64_COND_LT;
+   break;
case BPF_JSGE:
jmp_cond = A64_COND_GE;
break;
+   case BPF_JSLE:
+   jmp_cond = A64_COND_LE;
+   break;
default:
return -EFAULT;
}
@@ -566,10 +582,14 @@ static int build_insn(const struct bpf_insn *insn, struct 
jit_ctx *ctx)
/* IF (dst COND imm) JUMP off */
case BPF_JMP | BPF_JEQ | BPF_K:
case BPF_JMP | BPF_JGT | BPF_K:
+   case BPF_JMP | BPF_JLT | BPF_K:
case BPF_JMP | BPF_JGE | BPF_K:
+   case BPF_JMP | BPF_JLE | BPF_K:
case BPF_JMP | BPF_JNE | BPF_K:
case BPF_JMP | BPF_JSGT | BPF_K:
+   case BPF_JMP | BPF_JSLT | BPF_K:
case BPF_JMP | BPF_JSGE | BPF_K:
+   case BPF_JMP | BPF_JSLE | BPF_K:
emit_a64_mov_i(1, tmp, imm, ctx);
emit(A64_CMP(1, dst, tmp), ctx);
goto emit_cond_jmp;
-- 
1.9.3

[PATCH net-next v2 6/9] bpf, ppc64: implement jiting of BPF_J{LT,LE,SLT,SLE}

2017-08-09 Thread Daniel Borkmann

This work implements jiting of BPF_J{LT,LE,SLT,SLE} instructions
with BPF_X/BPF_K variants for the ppc64 eBPF JIT.

Signed-off-by: Daniel Borkmann 
Acked-by: Naveen N. Rao 
Tested-by: Naveen N. Rao 
---
 arch/powerpc/net/bpf_jit.h|  1 +
 arch/powerpc/net/bpf_jit_comp64.c | 20 
 2 files changed, 21 insertions(+)

diff --git a/arch/powerpc/net/bpf_jit.h b/arch/powerpc/net/bpf_jit.h
index 30cf03f..47fc666 100644
--- a/arch/powerpc/net/bpf_jit.h
+++ b/arch/powerpc/net/bpf_jit.h
@@ -263,6 +263,7 @@ static inline bool is_nearbranch(int offset)
 #define COND_EQ(CR0_EQ | COND_CMP_TRUE)
 #define COND_NE(CR0_EQ | COND_CMP_FALSE)
 #define COND_LT(CR0_LT | COND_CMP_TRUE)
+#define COND_LE(CR0_GT | COND_CMP_FALSE)
 
 #endif
 
diff --git a/arch/powerpc/net/bpf_jit_comp64.c 
b/arch/powerpc/net/bpf_jit_comp64.c
index 861c5af..faf2016 100644
--- a/arch/powerpc/net/bpf_jit_comp64.c
+++ b/arch/powerpc/net/bpf_jit_comp64.c
@@ -795,12 +795,24 @@ static int bpf_jit_build_body(struct bpf_prog *fp, u32 
*image,
case BPF_JMP | BPF_JSGT | BPF_X:
true_cond = COND_GT;
goto cond_branch;
+   case BPF_JMP | BPF_JLT | BPF_K:
+   case BPF_JMP | BPF_JLT | BPF_X:
+   case BPF_JMP | BPF_JSLT | BPF_K:
+   case BPF_JMP | BPF_JSLT | BPF_X:
+   true_cond = COND_LT;
+   goto cond_branch;
case BPF_JMP | BPF_JGE | BPF_K:
case BPF_JMP | BPF_JGE | BPF_X:
case BPF_JMP | BPF_JSGE | BPF_K:
case BPF_JMP | BPF_JSGE | BPF_X:
true_cond = COND_GE;
goto cond_branch;
+   case BPF_JMP | BPF_JLE | BPF_K:
+   case BPF_JMP | BPF_JLE | BPF_X:
+   case BPF_JMP | BPF_JSLE | BPF_K:
+   case BPF_JMP | BPF_JSLE | BPF_X:
+   true_cond = COND_LE;
+   goto cond_branch;
case BPF_JMP | BPF_JEQ | BPF_K:
case BPF_JMP | BPF_JEQ | BPF_X:
true_cond = COND_EQ;
@@ -817,14 +829,18 @@ static int bpf_jit_build_body(struct bpf_prog *fp, u32 
*image,
 cond_branch:
switch (code) {
case BPF_JMP | BPF_JGT | BPF_X:
+   case BPF_JMP | BPF_JLT | BPF_X:
case BPF_JMP | BPF_JGE | BPF_X:
+   case BPF_JMP | BPF_JLE | BPF_X:
case BPF_JMP | BPF_JEQ | BPF_X:
case BPF_JMP | BPF_JNE | BPF_X:
/* unsigned comparison */
PPC_CMPLD(dst_reg, src_reg);
break;
case BPF_JMP | BPF_JSGT | BPF_X:
+   case BPF_JMP | BPF_JSLT | BPF_X:
case BPF_JMP | BPF_JSGE | BPF_X:
+   case BPF_JMP | BPF_JSLE | BPF_X:
/* signed comparison */
PPC_CMPD(dst_reg, src_reg);
break;
@@ -834,7 +850,9 @@ static int bpf_jit_build_body(struct bpf_prog *fp, u32 
*image,
case BPF_JMP | BPF_JNE | BPF_K:
case BPF_JMP | BPF_JEQ | BPF_K:
case BPF_JMP | BPF_JGT | BPF_K:
+   case BPF_JMP | BPF_JLT | BPF_K:
case BPF_JMP | BPF_JGE | BPF_K:
+   case BPF_JMP | BPF_JLE | BPF_K:
/*
 * Need sign-extended load, so only positive
 * values can be used as imm in cmpldi
@@ -849,7 +867,9 @@ static int bpf_jit_build_body(struct bpf_prog *fp, u32 
*image,
}
break;
case BPF_JMP | BPF_JSGT | BPF_K:
+   case BPF_JMP | BPF_JSLT | BPF_K:
case BPF_JMP | BPF_JSGE | BPF_K:
+   case BPF_JMP | BPF_JSLE | BPF_K:
/*
 * signed comparison, so any 16-bit value
 * can be used in cmpdi
-- 
1.9.3

[PATCH net-next v2 7/9] bpf, nfp: implement jiting of BPF_J{LT,LE}

2017-08-09 Thread Daniel Borkmann

This work implements jiting of BPF_J{LT,LE} instructions with
BPF_X/BPF_K variants for the nfp eBPF JIT. The two BPF_J{SLT,SLE}
instructions have not been added yet given BPF_J{SGT,SGE} are
not supported yet either.

Signed-off-by: Daniel Borkmann 
Acked-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/bpf/jit.c | 24 
 1 file changed, 24 insertions(+)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c 
b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
index 8e57fda..239dfbe 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
@@ -1238,6 +1238,16 @@ static int jge_imm(struct nfp_prog *nfp_prog, struct 
nfp_insn_meta *meta)
return wrp_cmp_imm(nfp_prog, meta, BR_BHS, true);
 }
 
+static int jlt_imm(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
+{
+   return wrp_cmp_imm(nfp_prog, meta, BR_BHS, false);
+}
+
+static int jle_imm(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
+{
+   return wrp_cmp_imm(nfp_prog, meta, BR_BLO, true);
+}
+
 static int jset_imm(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
 {
const struct bpf_insn *insn = >insn;
@@ -1325,6 +1335,16 @@ static int jge_reg(struct nfp_prog *nfp_prog, struct 
nfp_insn_meta *meta)
return wrp_cmp_reg(nfp_prog, meta, BR_BHS, true);
 }
 
+static int jlt_reg(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
+{
+   return wrp_cmp_reg(nfp_prog, meta, BR_BHS, false);
+}
+
+static int jle_reg(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
+{
+   return wrp_cmp_reg(nfp_prog, meta, BR_BLO, true);
+}
+
 static int jset_reg(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
 {
return wrp_test_reg(nfp_prog, meta, ALU_OP_AND, BR_BNE);
@@ -1383,11 +1403,15 @@ static int goto_out(struct nfp_prog *nfp_prog, struct 
nfp_insn_meta *meta)
[BPF_JMP | BPF_JEQ | BPF_K] =   jeq_imm,
[BPF_JMP | BPF_JGT | BPF_K] =   jgt_imm,
[BPF_JMP | BPF_JGE | BPF_K] =   jge_imm,
+   [BPF_JMP | BPF_JLT | BPF_K] =   jlt_imm,
+   [BPF_JMP | BPF_JLE | BPF_K] =   jle_imm,
[BPF_JMP | BPF_JSET | BPF_K] =  jset_imm,
[BPF_JMP | BPF_JNE | BPF_K] =   jne_imm,
[BPF_JMP | BPF_JEQ | BPF_X] =   jeq_reg,
[BPF_JMP | BPF_JGT | BPF_X] =   jgt_reg,
[BPF_JMP | BPF_JGE | BPF_X] =   jge_reg,
+   [BPF_JMP | BPF_JLT | BPF_X] =   jlt_reg,
+   [BPF_JMP | BPF_JLE | BPF_X] =   jle_reg,
[BPF_JMP | BPF_JSET | BPF_X] =  jset_reg,
[BPF_JMP | BPF_JNE | BPF_X] =   jne_reg,
[BPF_JMP | BPF_EXIT] =  goto_out,
-- 
1.9.3

[PATCH net-next v2 8/9] bpf: enable BPF_J{LT,LE,SLT,SLE} opcodes in verifier

2017-08-09 Thread Daniel Borkmann

Enable the newly added jump opcodes, main parts are in two
different areas, namely direct packet access and dynamic map
value access. For the direct packet access, we now allow for
the following two new patterns to match in order to trigger
markings with find_good_pkt_pointers():

Variant 1 (access ok when taking the branch):

  0: (61) r2 = *(u32 *)(r1 +76)
  1: (61) r3 = *(u32 *)(r1 +80)
  2: (bf) r0 = r2
  3: (07) r0 += 8
  4: (ad) if r0 < r3 goto pc+2
  R0=pkt(id=0,off=8,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
  R3=pkt_end R10=fp
  5: (b7) r0 = 0
  6: (95) exit

  from 4 to 7: R0=pkt(id=0,off=8,r=8) R1=ctx
   R2=pkt(id=0,off=0,r=8) R3=pkt_end R10=fp
  7: (71) r0 = *(u8 *)(r2 +0)
  8: (05) goto pc-4
  5: (b7) r0 = 0
  6: (95) exit
  processed 11 insns, stack depth 0

Variant 2 (access ok on fall-through):

  0: (61) r2 = *(u32 *)(r1 +76)
  1: (61) r3 = *(u32 *)(r1 +80)
  2: (bf) r0 = r2
  3: (07) r0 += 8
  4: (bd) if r3 <= r0 goto pc+1
  R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8)
  R3=pkt_end R10=fp
  5: (71) r0 = *(u8 *)(r2 +0)
  6: (b7) r0 = 1
  7: (95) exit

  from 4 to 6: R0=pkt(id=0,off=8,r=0) R1=ctx
   R2=pkt(id=0,off=0,r=0) R3=pkt_end R10=fp
  6: (b7) r0 = 1
  7: (95) exit
  processed 10 insns, stack depth 0

The above two basically just swap the branches where we need
to handle an exception and allow packet access compared to the
two already existing variants for find_good_pkt_pointers().

For the dynamic map value access, we add the new instructions
to reg_set_min_max() and reg_set_min_max_inv() in order to
learn bounds. Verifier test cases for both are added in a
follow-up patch.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
Acked-by: John Fastabend 
---
 kernel/bpf/verifier.c | 62 +++
 1 file changed, 58 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 8160a81..ecc590e 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -312,11 +312,15 @@ static void print_verifier_state(struct 
bpf_verifier_state *state)
[BPF_JA >> 4]   = "jmp",
[BPF_JEQ >> 4]  = "==",
[BPF_JGT >> 4]  = ">",
+   [BPF_JLT >> 4]  = "<",
[BPF_JGE >> 4]  = ">=",
+   [BPF_JLE >> 4]  = "<=",
[BPF_JSET >> 4] = "&",
[BPF_JNE >> 4]  = "!=",
[BPF_JSGT >> 4] = "s>",
+   [BPF_JSLT >> 4] = "s<",
[BPF_JSGE >> 4] = "s>=",
+   [BPF_JSLE >> 4] = "s<=",
[BPF_CALL >> 4] = "call",
[BPF_EXIT >> 4] = "exit",
 };
@@ -2383,27 +2387,37 @@ static void find_good_pkt_pointers(struct 
bpf_verifier_state *state,
 */
return;
 
-   /* LLVM can generate two kind of checks:
+   /* LLVM can generate four kind of checks:
 *
-* Type 1:
+* Type 1/2:
 *
 *   r2 = r3;
 *   r2 += 8;
 *   if (r2 > pkt_end) goto 
 *   
 *
+*   r2 = r3;
+*   r2 += 8;
+*   if (r2 < pkt_end) goto 
+*   
+*
 *   Where:
 * r2 == dst_reg, pkt_end == src_reg
 * r2=pkt(id=n,off=8,r=0)
 * r3=pkt(id=n,off=0,r=0)
 *
-* Type 2:
+* Type 3/4:
 *
 *   r2 = r3;
 *   r2 += 8;
 *   if (pkt_end >= r2) goto 
 *   
 *
+*   r2 = r3;
+*   r2 += 8;
+*   if (pkt_end <= r2) goto 
+*   
+*
 *   Where:
 * pkt_end == dst_reg, r2 == src_reg
 * r2=pkt(id=n,off=8,r=0)
@@ -2471,6 +2485,14 @@ static void reg_set_min_max(struct bpf_reg_state 
*true_reg,
false_reg->smax_value = min_t(s64, false_reg->smax_value, val);
true_reg->smin_value = max_t(s64, true_reg->smin_value, val + 
1);
break;
+   case BPF_JLT:
+   false_reg->umin_value = max(false_reg->umin_value, val);
+   true_reg->umax_value = min(true_reg->umax_value, val - 1);
+   break;
+   case BPF_JSLT:
+   false_reg->smin_value = max_t(s64, false_reg->smin_value, val);
+   true_reg->smax_value = min_t(s64, true_reg->smax_value, val - 
1);
+   break;
case BPF_JGE:
false_reg->umax_value = min(false_reg->umax_value, val - 1);
true_reg->umin_value = max(true_reg->umin_value, val);
@@ -2479,6 +2501,14 @@ static void reg_set_min_max(struct bpf_reg_state 
*true_reg,
false_reg->smax_value = min_t(s64, false_reg->smax_value, val - 
1);
true_reg->smin_value = max_t(s64, true_reg->smin_value, val);
break;
+   case BPF_JLE:
+   false_reg->umin_value = max(false_reg->umin_value, val + 1);
+   true_reg->umax_value = min(true_reg->umax_value, val);
+   break;
+   case BPF_JSLE:

[PATCH net-next v2 2/9] bpf, x86: implement jiting of BPF_J{LT,LE,SLT,SLE}

2017-08-09 Thread Daniel Borkmann

This work implements jiting of BPF_J{LT,LE,SLT,SLE} instructions
with BPF_X/BPF_K variants for the x86_64 eBPF JIT.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 arch/x86/net/bpf_jit_comp.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index e1324f2..8194696 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -94,7 +94,9 @@ static int bpf_size_to_x86_bytes(int bpf_size)
 #define X86_JNE 0x75
 #define X86_JBE 0x76
 #define X86_JA  0x77
+#define X86_JL  0x7C
 #define X86_JGE 0x7D
+#define X86_JLE 0x7E
 #define X86_JG  0x7F
 
 static void bpf_flush_icache(void *start, void *end)
@@ -888,9 +890,13 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, 
u8 *image,
case BPF_JMP | BPF_JEQ | BPF_X:
case BPF_JMP | BPF_JNE | BPF_X:
case BPF_JMP | BPF_JGT | BPF_X:
+   case BPF_JMP | BPF_JLT | BPF_X:
case BPF_JMP | BPF_JGE | BPF_X:
+   case BPF_JMP | BPF_JLE | BPF_X:
case BPF_JMP | BPF_JSGT | BPF_X:
+   case BPF_JMP | BPF_JSLT | BPF_X:
case BPF_JMP | BPF_JSGE | BPF_X:
+   case BPF_JMP | BPF_JSLE | BPF_X:
/* cmp dst_reg, src_reg */
EMIT3(add_2mod(0x48, dst_reg, src_reg), 0x39,
  add_2reg(0xC0, dst_reg, src_reg));
@@ -911,9 +917,13 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, 
u8 *image,
case BPF_JMP | BPF_JEQ | BPF_K:
case BPF_JMP | BPF_JNE | BPF_K:
case BPF_JMP | BPF_JGT | BPF_K:
+   case BPF_JMP | BPF_JLT | BPF_K:
case BPF_JMP | BPF_JGE | BPF_K:
+   case BPF_JMP | BPF_JLE | BPF_K:
case BPF_JMP | BPF_JSGT | BPF_K:
+   case BPF_JMP | BPF_JSLT | BPF_K:
case BPF_JMP | BPF_JSGE | BPF_K:
+   case BPF_JMP | BPF_JSLE | BPF_K:
/* cmp dst_reg, imm8/32 */
EMIT1(add_1mod(0x48, dst_reg));
 
@@ -935,18 +945,34 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, 
u8 *image,
/* GT is unsigned '>', JA in x86 */
jmp_cond = X86_JA;
break;
+   case BPF_JLT:
+   /* LT is unsigned '<', JB in x86 */
+   jmp_cond = X86_JB;
+   break;
case BPF_JGE:
/* GE is unsigned '>=', JAE in x86 */
jmp_cond = X86_JAE;
break;
+   case BPF_JLE:
+   /* LE is unsigned '<=', JBE in x86 */
+   jmp_cond = X86_JBE;
+   break;
case BPF_JSGT:
/* signed '>', GT in x86 */
jmp_cond = X86_JG;
break;
+   case BPF_JSLT:
+   /* signed '<', LT in x86 */
+   jmp_cond = X86_JL;
+   break;
case BPF_JSGE:
/* signed '>=', GE in x86 */
jmp_cond = X86_JGE;
break;
+   case BPF_JSLE:
+   /* signed '<=', LE in x86 */
+   jmp_cond = X86_JLE;
+   break;
default: /* to silence gcc warning */
return -EFAULT;
}
-- 
1.9.3

[PATCH net-next v2 9/9] bpf: add test cases for new BPF_J{LT, LE, SLT, SLE} instructions

2017-08-09 Thread Daniel Borkmann

Add test cases to the verifier selftest suite in order to verify that
i) direct packet access, and ii) dynamic map value access is working
with the changes related to the new instructions.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 tools/testing/selftests/bpf/test_verifier.c | 313 
 1 file changed, 313 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 65aa562..70973c1 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -2830,6 +2830,79 @@ struct test_val {
.result = ACCEPT,
},
{
+   "direct packet access: test25 (marking on <, good access)",
+   .insns = {
+   BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1,
+   offsetof(struct __sk_buff, data)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_1,
+   offsetof(struct __sk_buff, data_end)),
+   BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 8),
+   BPF_JMP_REG(BPF_JLT, BPF_REG_0, BPF_REG_3, 2),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_2, 0),
+   BPF_JMP_IMM(BPF_JA, 0, 0, -4),
+   },
+   .result = ACCEPT,
+   .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   },
+   {
+   "direct packet access: test26 (marking on <, bad access)",
+   .insns = {
+   BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1,
+   offsetof(struct __sk_buff, data)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_1,
+   offsetof(struct __sk_buff, data_end)),
+   BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 8),
+   BPF_JMP_REG(BPF_JLT, BPF_REG_0, BPF_REG_3, 3),
+   BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_2, 0),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   BPF_JMP_IMM(BPF_JA, 0, 0, -3),
+   },
+   .result = REJECT,
+   .errstr = "invalid access to packet",
+   .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   },
+   {
+   "direct packet access: test27 (marking on <=, good access)",
+   .insns = {
+   BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1,
+   offsetof(struct __sk_buff, data)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_1,
+   offsetof(struct __sk_buff, data_end)),
+   BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 8),
+   BPF_JMP_REG(BPF_JLE, BPF_REG_3, BPF_REG_0, 1),
+   BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_2, 0),
+   BPF_MOV64_IMM(BPF_REG_0, 1),
+   BPF_EXIT_INSN(),
+   },
+   .result = ACCEPT,
+   .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   },
+   {
+   "direct packet access: test28 (marking on <=, bad access)",
+   .insns = {
+   BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1,
+   offsetof(struct __sk_buff, data)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_1,
+   offsetof(struct __sk_buff, data_end)),
+   BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 8),
+   BPF_JMP_REG(BPF_JLE, BPF_REG_3, BPF_REG_0, 2),
+   BPF_MOV64_IMM(BPF_REG_0, 1),
+   BPF_EXIT_INSN(),
+   BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_2, 0),
+   BPF_JMP_IMM(BPF_JA, 0, 0, -4),
+   },
+   .result = REJECT,
+   .errstr = "invalid access to packet",
+   .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   },
+   {
"helper access to packet: test1, valid packet_ptr range",
.insns = {
BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1,
@@ -4488,6 +4561,246 @@ struct test_val {
.prog_type = BPF_PROG_TYPE_TRACEPOINT,
},
{
+   "helper access to map: bounds check using <, good access",
+   .insns = {
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+

Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2017-08-09 Thread Cong Wang

(Cc'ing Wei whose commit was blamed)

On Mon, Aug 7, 2017 at 2:15 PM, John Stultz  wrote:
> On Mon, Aug 7, 2017 at 2:05 PM, John Stultz  wrote:
>> So, with recent testing with my HiKey board, I've been noticing some
>> quirky behavior with my USB eth adapter.
>>
>> Basically, pluging the usb eth adapter in and then removing it, when
>> plugging it back in I often find that its not detected, and the system
>> slowly spits out the following message over and over:
>>   unregister_netdevice: waiting for eth0 to become free. Usage count = 1
>
> The other bit is that after this starts printing, the board will no
> longer reboot (it hangs continuing to occasionally print the above
> message), and I have to manually reset the device.
>

So this warning is not temporarily shown but lasts until a reboot,
right? If so it is a dst refcnt leak.

How reproducible is it for you? From my reading, it seems always
reproduced when you unplug and plug your usb eth interface?
Is there anything else involved? For example, network namespace.

Thanks.

Re: skb allocation from interrupt handler?

2017-08-09 Thread Stephen Hemminger

On Thu, 10 Aug 2017 00:29:19 +0200
Francois Romieu  wrote:

> Murali Karicheri  :
> [...]
> > The internal memory or FIFO can store only up to 3 MTU sized packets. So 
> > that has to
> > be processed before PRU gets another packets to send to CPU. So per above, 
> > it is not ideal to run NAPI for this scenario, right? Also for NetCP we use
> > about 128 descriptors with MTU size buffers to handle 1Gbps Ethernet link.
> > Based on that roughly we would need at least 10-12 buffers in the FIFO.
> > 
> > Currently we have a NAPI implementation in use that gives throughput of 
> > 95Mbps for
> > MTU sized packets, but our UDP iperf tests shows less than 1% packet loss 
> > for an
> > offered traffic of 95Mbps with MTU sized packets.  This is not good for 
> > industrial
> > network using HSR/PRP protocol for network redundancy. We need to have zero 
> > packet
> > loss for MTU sized packets at 95Mbps throughput. That is the problem 
> > description.  
> 
> Imvho you should instrument the kernel to figure where the excess latency that
> prevents NAPI processing to take place within 125 us of physical packet 
> reception
> comes from.
> 
> > As an experiment, I have moved the packet processing to irq handler to see 
> > if we 
> > can take advantage of CPU cycle to processing the packet instead of NAPI
> > and to check if the firmware encounters buffer overflow. The result is 
> > positive 
> > with no buffer overflow seen at the firmware and no packet loss in the 
> > iperf test.
> > But we want to do more testing as an experiment and ran into a uart console 
> > locks
> > up after running traffic for about 2 minutes. So I tried enabling the DEBUG 
> > HACK 
> > options to get some clue on what is happening and ran into the trace I 
> > shared 
> > earlier. So what function can I use to allocate SKB from interrupt handler 
> > ?  
> 
> Is your design also so tight on memory that you can't even refill your own
> software skb pool from some non-irq context then only swap buffers in the
> irq handler ?
> 

The current best practice in network drivers is to receive into
an allocated page, then create skb meta data with build_skb() in the NAPI poll
routine.

Re: Some traffic stress testing with 4.13.0-rc4-next-20170808 and BUG: [61183.212237] BUG: Bad page state in process ksoftirqd/52 pfn:855f2f

2017-08-09 Thread Cong Wang

On Wed, Aug 9, 2017 at 8:50 AM, Paweł Staszewski  wrote:
>
>
>
> [60961.112120] BUG: Bad page state in process ksoftirqd/44 pfn:855f2f
> [60961.112123] page:ea002157cbc0 count:51910 mapcount:0 mapping:
> (null) index:0x0
> [60961.112259] flags: 0x600()
> [60961.112375] raw: 0600  
> cac6
> [60961.112376] raw: dead0100 dead0200 
> 
> [60961.112377] page dumped because: nonzero _count
> [60961.112377] Modules linked in: sch_fq x86_pkg_temp_thermal ipmi_si
> [60961.112381] CPU: 44 PID: 232 Comm: ksoftirqd/44 Tainted: G B   W
> 4.13.0-rc4-next-20170808 #1
> [60961.112382] Call Trace:
> [60961.112390]  dump_stack+0x4d/0x63
> [60961.112394]  bad_page+0xf3/0x10f
> [60961.112396]  check_new_page_bad+0x73/0x75
> [60961.112397]  get_page_from_freelist+0x2a9/0x63f
> [60961.112398]  __alloc_pages_nodemask+0xf2/0x187
> [60961.112403]  alloc_pages_current+0x85/0x8c
> [60961.112405]  new_slab+0xc2/0x2f7
> [60961.112407]  ___slab_alloc+0x350/0x3cd
> [60961.112409]  ? fq_enqueue+0x24e/0x3e3 [sch_fq]
> [60961.112411]  __slab_alloc+0x12/0x17
> [60961.112412]  ? __slab_alloc+0x12/0x17
> [60961.112413]  kmem_cache_alloc+0x50/0x112
> [60961.112414]  fq_enqueue+0x24e/0x3e3 [sch_fq]
> [60961.112419]  ? ixgbe_select_queue+0x15/0x17
> [60961.112422]  __dev_queue_xmit+0x2a5/0x4b9
> [60961.112423]  dev_queue_xmit+0xb/0xd


Is this the first stack trace in your dmesg?

It doesn't look like a bug in sch_fq, but like we have
a bad mm page somewhere else, perhaps ixgbe driver.



> [60961.112423]  ? dev_queue_xmit+0xb/0xd
> [60961.112425]  vlan_dev_hard_start_xmit+0x81/0xaa
> [60961.112426]  dev_hard_start_xmit+0xc3/0x197
> [60961.112427]  __dev_queue_xmit+0x38d/0x4b9
> [60961.112429]  ? eth_header+0x27/0xab
> [60961.112429]  dev_queue_xmit+0xb/0xd
> [60961.112430]  ? dev_queue_xmit+0xb/0xd
> [60961.112432]  neigh_connected_output+0x9b/0xb2
> [60961.112436]  ip_finish_output2+0x24e/0x292
> [60961.112437]  ip_finish_output+0x11f/0x12b
> [60961.112438]  ip_output+0x56/0xa7
> [60961.112441]  ? ip_route_input_rcu+0x489/0x7d3
> [60961.112442]  ip_forward_finish+0x53/0x58
> [60961.112443]  ip_forward+0x2ff/0x350
> [60961.112443]  ? ip_frag_mem+0x1e/0x1e
> [60961.112444]  ip_rcv_finish+0x27f/0x28a
> [60961.112445]  ip_rcv+0x2c0/0x30d
> [60961.112448]  __netif_receive_skb_core+0x325/0x4c0
> [60961.112449]  __netif_receive_skb+0x18/0x5a
> [60961.112450]  ? dev_gro_receive+0x2a9/0x3b4
> [60961.112451]  ? __netif_receive_skb+0x18/0x5a
> [60961.112452]  netif_receive_skb_internal+0x4b/0x96
> [60961.112453]  napi_gro_receive+0x75/0xcc
> [60961.112455]  ixgbe_poll+0xed8/0xeef
> [60961.112459]  ? __printk_ratelimit+0x8/0x15
> [60961.112460]  net_rx_action+0xd3/0x22d
> [60961.112462]  __do_softirq+0xe4/0x23a
> [60961.112465]  ? sort_range+0x1d/0x1d
> [60961.112466]  run_ksoftirqd+0x15/0x2a
> [60961.112467]  smpboot_thread_fn+0x128/0x13f
> [60961.112470]  kthread+0xf7/0xfc
> [60961.112471]  ? init_completion+0x24/0x24
> [60961.112475]  ret_from_fork+0x22/0x30
> [60983.541418] capability: warning: `turbostat' uses 32-bit capabilities
> (legacy support in use)
> [61179.159683] ixgbe :d8:00.1 enp216s0f1: detected SFP+: 4
> [61179.503705] ixgbe :d8:00.1 enp216s0f1: NIC Link is Up 10 Gbps, Flow
> Control: None
> [61183.212237] BUG: Bad page state in process ksoftirqd/52 pfn:855f2f
> [61183.212240] page:ea002157cbc0 count:-1 mapcount:0 mapping:
> (null) index:0x0
> [61183.212375] flags: 0x600()
> [61183.212492] raw: 0600  
> 
> [61183.212493] raw: dead0100 dead0200 
> 
> [61183.212493] page dumped because: nonzero _count
> [61183.212494] Modules linked in: sch_fq x86_pkg_temp_thermal ipmi_si
> [61183.212498] CPU: 52 PID: 272 Comm: ksoftirqd/52 Tainted: G B   W
> 4.13.0-rc4-next-20170808 #1
> [61183.212500] Call Trace:
> [61183.212507]  dump_stack+0x4d/0x63
> [61183.212512]  bad_page+0xf3/0x10f
> [61183.212514]  check_new_page_bad+0x73/0x75
> [61183.212515]  get_page_from_freelist+0x2a9/0x63f
> [61183.212517]  __alloc_pages_nodemask+0xf2/0x187
> [61183.212523]  ixgbe_alloc_rx_buffers+0x77/0x1bb
> [61183.212524]  ixgbe_poll+0x421/0xeef
> [61183.212529]  ? cfs_rq_util_change+0x1e/0x20
> [61183.212532]  net_rx_action+0xd3/0x22d
> [61183.212534]  __do_softirq+0xe4/0x23a
> [61183.212536]  ? sort_range+0x1d/0x1d
> [61183.212537]  run_ksoftirqd+0x15/0x2a
> [61183.212538]  smpboot_thread_fn+0x128/0x13f
> [61183.212540]  kthread+0xf7/0xfc
> [61183.212541]  ? init_completion+0x24/0x24
> [61183.212546]  ret_from_fork+0x22/0x30
> [61183.283732] ixgbe :d8:00.0 enp216s0f0: detected SFP+: 3
> [61183.527747] ixgbe :d8:00.0 enp216s0f0: NIC Link is Up 10 Gbps, Flow
> Control: None
> [61223.651684] ixgbe :d8:00.1 enp216s0f1: detected SFP+: 4
> [61223.891712] ixgbe :d8:00.1 enp216s0f1: NIC Link

[PATCH net-next 0/2] zerocopy fixes

2017-08-09 Thread Willem de Bruijn

From: Willem de Bruijn 

Fix two issues introduced in the msg_zerocopy patchset.

Willem de Bruijn (2):
  sock: fix zerocopy panic in mem accounting
  sock: fix zerocopy_success regression with msg_zerocopy

 include/linux/skbuff.h | 9 +++--
 net/core/skbuff.c  | 4 ++--
 2 files changed, 9 insertions(+), 4 deletions(-)

-- 
2.14.0.434.g98096fd7a8-goog

[PATCH net-next 1/2] sock: fix zerocopy panic in mem accounting

2017-08-09 Thread Willem de Bruijn

From: Willem de Bruijn 

Only call mm_unaccount_pinned_pages when releasing a struct ubuf_info
that has initialized its field uarg->mmp.

Before this patch, a vhost-net with experimental_zcopytx can crash in

  mm_unaccount_pinned_pages
  sock_zerocopy_put
  skb_zcopy_clear
  skb_release_data

Only sock_zerocopy_alloc initializes this field. Move the unaccount
call from generic sock_zerocopy_put to its specific callback
sock_zerocopy_callback.

Fixes: a91dbff551a6 ("sock: ulimit on MSG_ZEROCOPY pages")
Reported-by: David Ahern 
Signed-off-by: Willem de Bruijn 
---
 net/core/skbuff.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 42b62c716a33..cb123590c674 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1044,6 +1044,8 @@ void sock_zerocopy_callback(struct ubuf_info *uarg, bool 
success)
u32 lo, hi;
u16 len;
 
+   mm_unaccount_pinned_pages(>mmp);
+
/* if !len, there was only 1 call, and it was aborted
 * so do not queue a completion notification
 */
@@ -1084,8 +1086,6 @@ EXPORT_SYMBOL_GPL(sock_zerocopy_callback);
 void sock_zerocopy_put(struct ubuf_info *uarg)
 {
if (uarg && atomic_dec_and_test(>refcnt)) {
-   mm_unaccount_pinned_pages(>mmp);
-
if (uarg->callback)
uarg->callback(uarg, uarg->zerocopy);
else
-- 
2.14.0.434.g98096fd7a8-goog

[PATCH net-next 2/2] sock: fix zerocopy_success regression with msg_zerocopy

2017-08-09 Thread Willem de Bruijn

From: Willem de Bruijn 

Do not use uarg->zerocopy outside msg_zerocopy. In other paths the
field is not explicitly initialized and aliases another field.

Those paths have only one reference so do not need this intermediate
variable. Call uarg->callback directly.

Fixes: 1f8b977ab32d ("sock: enable MSG_ZEROCOPY")
Signed-off-by: Willem de Bruijn 
---
 include/linux/skbuff.h | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 8c0708d2e5e6..7594e19bce62 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1273,8 +1273,13 @@ static inline void skb_zcopy_clear(struct sk_buff *skb, 
bool zerocopy)
struct ubuf_info *uarg = skb_zcopy(skb);
 
if (uarg) {
-   uarg->zerocopy = uarg->zerocopy && zerocopy;
-   sock_zerocopy_put(uarg);
+   if (uarg->callback == sock_zerocopy_callback) {
+   uarg->zerocopy = uarg->zerocopy && zerocopy;
+   sock_zerocopy_put(uarg);
+   } else {
+   uarg->callback(uarg, zerocopy);
+   }
+
skb_shinfo(skb)->tx_flags &= ~SKBTX_ZEROCOPY_FRAG;
}
 }
-- 
2.14.0.434.g98096fd7a8-goog

RE: [ovs-dev] [PATCH net-next] openvswitch: add NSH support

2017-08-09 Thread Yang, Yi Y

struct ovs_action_encap_nsh is the only one way we transfer all the data for 
encap_nsh, netlink allows variable attribute, so I don't think we break netlink 
convention or abuse this variable feature.

Even if we bring nested attributes to handle this, OVS_ACTION_ATTR_ENCAP_NSH is 
still length-variable, OVS_NSH_ATTR_MD2 is also length-variable (it can be from 
0 to 248), so I don't think such way can avoid the issue you're addressing.

The result will be worse, it will make many difficulties when we transfer all 
the data for encap_nsh between OVS' components.

-Original Message-
From: Ben Pfaff [mailto:b...@ovn.org] 
Sent: Thursday, August 10, 2017 4:54 AM
To: Yang, Yi Y 
Cc: Jan Scheurich ; d...@openvswitch.org; 
netdev@vger.kernel.org; Jiri Benc ; da...@davemloft.net; 
Zoltán Balogh 
Subject: Re: [ovs-dev] [PATCH net-next] openvswitch: add NSH support

On Wed, Aug 09, 2017 at 08:12:36PM +, Yang, Yi Y wrote:
> Ben, do you mean we bring two new attributes (OVS_NSH_ATTR_MD1 and
> OVS_NSH_ATTR_MD2) and embed one of them in OVS_ACTION_ATTR_ENCAP_NSH?

Yes.

> Anyway we need to use a struct or something else to transfer those 
> metadata between functions, how do you think we can handle this 
> without metadata in struct ovs_action_encap_nsh? I mean how we handle 
> the arguments for function encap_nsh.

I guess that a pointer to the embedded nlattr with type OVS_NSH_ATTR_MD1 or 
OVS_NSH_ATTR2 should work OK.

Keep in mind that I'm not a kernel-side maintainer of any kind.  I am only 
passing along what I've perceived to be common Netlink protocol design patterns.

> -Original Message-
> From: netdev-ow...@vger.kernel.org 
> [mailto:netdev-ow...@vger.kernel.org] On Behalf Of Ben Pfaff
> Sent: Thursday, August 10, 2017 2:09 AM
> To: Yang, Yi Y 
> Cc: Jan Scheurich ; d...@openvswitch.org; 
> netdev@vger.kernel.org; Jiri Benc ; 
> da...@davemloft.net; Zoltán Balogh 
> Subject: Re: [ovs-dev] [PATCH net-next] openvswitch: add NSH support
> 
> On Wed, Aug 09, 2017 at 09:41:51AM +, Yang, Yi Y wrote:
> > Hi,  Jan
> > 
> > I have worked out a patch, will send it quickly for Ben. In 
> > addition, I also will send out a patch to change encap_nsh 
> > _nsh to push_nsh and pop_nsh. Per comments from all the 
> > people, we all agreed to do so :-)
> > 
> > diff --git a/datapath/linux/compat/include/linux/openvswitch.h
> > b/datapath/linux/compat/include/linux/openvswitch.h
> > index bc6c94b..4936c12 100644
> > --- a/datapath/linux/compat/include/linux/openvswitch.h
> > +++ b/datapath/linux/compat/include/linux/openvswitch.h
> > @@ -793,7 +793,7 @@ struct ovs_action_push_eth {
> > struct ovs_key_ethernet addresses;  };
> > 
> > -#define OVS_ENCAP_NSH_MAX_MD_LEN 16
> > +#define OVS_ENCAP_NSH_MAX_MD_LEN 248
> >  /*
> >   * struct ovs_action_encap_nsh - %OVS_ACTION_ATTR_ENCAP_NSH
> >   * @flags: NSH header flags.
> > @@ -809,7 +809,7 @@ struct ovs_action_encap_nsh {
> >  uint8_t mdlen;
> >  uint8_t np;
> >  __be32 path_hdr;
> > -uint8_t metadata[OVS_ENCAP_NSH_MAX_MD_LEN];
> > +uint8_t metadata[];
> >  };
> 
> This brings the overall format of ovs_action_encap_nsh to:
> 
> struct ovs_action_encap_nsh {
> uint8_t flags;
> uint8_t mdtype;
> uint8_t mdlen;
> uint8_t np;
> __be32 path_hdr;
> uint8_t metadata[];
> };
> 
> This is an unusual format for a Netlink attribute.  More commonly, one would 
> put variable-length data into an attribute of its own, which allows that data 
> to be handled using the regular Netlink means.  Then the mdlen and metadata 
> members could be removed, since they would be part of the additional 
> attribute, and one might expect the mdtype member to be removed as well since 
> each type of metadata would be in a different attribute type.
> 
> So, a format closer to what I expect to see in Netlink is something 
> like
> this:
> 
> /**
>  * enum ovs_nsh_attr - Metadata attributes for %OVS_ACTION_ENCAP_NSH action.
>  *
>  * @OVS_NSH_ATTR_MD1: Contains 16-byte NSH type-1 metadata.
>  * @OVS_NSH_ATTR_MD2: Contains 0- to 255-byte variable-length NSH 
> type-2
>  * metadata. */
> enum ovs_nsh_attr {
> OVS_NSH_ATTR_MD1,
> OVS_NSH_ATTR_MD2
> };
> 
> /*
>  * struct ovs_action_encap_nsh - %OVS_ACTION_ATTR_ENCAP_NSH
>  *
>  * @path_hdr: NSH service path id and service index.
>  * @flags: NSH header flags.
>  * @np: NSH next_protocol: Inner packet type.
>  *
>  * Followed by either %OVS_NSH_ATTR_MD1 or %OVS_NSH_ATTR_MD2 attribute.
>  */
> struct ovs_action_encap_nsh {
> __be32 path_hdr;
> uint8_t flags;
> uint8_t np;
> };

Re: [PATCH v4 10/12] ARM: dts: rk3228-evb: Enable the internal phy for gmac

2017-08-09 Thread Florian Fainelli

On August 9, 2017 5:13:19 AM PDT, David Wu  wrote:
>This patch enables the internal phy for rk3228 evb board
>by default.
>To use the external 1000M phy on evb board, need to make
>some switch of evb board to be on.
>
>Signed-off-by: David Wu 


LGTM

Reviewed-by: Florian Fainelli 

-- 
Florian

Re: [PATCH v4 05/12] Documentation: net: phy: Add phy-is-internal binding

2017-08-09 Thread Florian Fainelli

On August 9, 2017 5:10:30 AM PDT, David Wu  wrote:
>Add the documentation for internal phy. A boolean property
>indicates that a internal phy will be used.
>
>Signed-off-by: David Wu 
>---
> Documentation/devicetree/bindings/net/phy.txt | 3 +++
> 1 file changed, 3 insertions(+)
>
>diff --git a/Documentation/devicetree/bindings/net/phy.txt
>b/Documentation/devicetree/bindings/net/phy.txt
>index b558576..942c892 100644
>--- a/Documentation/devicetree/bindings/net/phy.txt
>+++ b/Documentation/devicetree/bindings/net/phy.txt
>@@ -52,6 +52,9 @@ Optional Properties:
>   Mark the corresponding energy efficient ethernet mode as broken and
>   request the ethernet to stop advertising it.
> 
>+- phy-is-internal: If set, indicates that phy will connect to the MAC
>as a
>+  internal phy.

Something along the lines of:

If set, indicates that the PHY is integrated into the same physical package as 
the Ethernet MAC.

Please always capitalize PHY.


-- 
Florian

Re: [PATCH v3 05/11] net: stmmac: dwmac-rk: Add internal phy support

2017-08-09 Thread Florian Fainelli

On August 9, 2017 1:45:41 AM PDT, Corentin Labbe  
wrote:
>On Thu, Aug 03, 2017 at 07:06:33PM +0800, Chen-Yu Tsai wrote:
>> On Thu, Aug 3, 2017 at 1:38 AM, Florian Fainelli
> wrote:
>> > On 08/01/2017 11:21 PM, David Wu wrote:
>> >> To make internal phy work, need to configure the phy_clock,
>> >> phy cru_reset and related registers.
>> >>
>> >> Signed-off-by: David Wu 
>> >> ---
>> >>  .../devicetree/bindings/net/rockchip-dwmac.txt |  6 +-
>> >>  drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c | 81
>++
>> >>  2 files changed, 86 insertions(+), 1 deletion(-)
>> >>
>> >> diff --git
>a/Documentation/devicetree/bindings/net/rockchip-dwmac.txt
>b/Documentation/devicetree/bindings/net/rockchip-dwmac.txt
>> >> index 8f42755..ec39b31 100644
>> >> --- a/Documentation/devicetree/bindings/net/rockchip-dwmac.txt
>> >> +++ b/Documentation/devicetree/bindings/net/rockchip-dwmac.txt
>> >> @@ -25,7 +25,8 @@ Required properties:
>> >>   - clock-names: One name for each entry in the clocks property.
>> >>   - phy-mode: See ethernet.txt file in the same directory.
>> >>   - pinctrl-names: Names corresponding to the numbered pinctrl
>states.
>> >> - - pinctrl-0: pin-control mode. can be <_pins> or
><_pins>.
>> >> + - pinctrl-0: pin-control mode. can be <_pins>,
><_pins> or led pins
>> >> +   for internal phy mode.
>> >>   - clock_in_out: For RGMII, it must be "input", means main
>clock(125MHz)
>> >> is not sourced from SoC's PLL, but input from PHY; For RMII,
>"input" means
>> >> PHY provides the reference clock(50MHz), "output" means GMAC
>provides the
>> >> @@ -40,6 +41,9 @@ Optional properties:
>> >>   - tx_delay: Delay value for TXD timing. Range value is 0~0x7F,
>0x30 as default.
>> >>   - rx_delay: Delay value for RXD timing. Range value is 0~0x7F,
>0x10 as default.
>> >>   - phy-supply: phandle to a regulator if the PHY needs one
>> >> + - clocks: < MAC_PHY>: Clock selector for internal macphy
>> >> + - phy-is-internal: A boolean property allows us to know that MAC
>will connect to
>> >> +   internal phy.
>> >
>> > This is incorrect in two ways:
>> >
>> > - this is a property of the PHY, so it should be documented as such
>in
>> > Documentation/devicetree/bindings/net/phy.txt so other bindings can
>> > re-use it
>> >
>> > - if it was specific to your MAC you would expect a vendor prefix
>to
>> > this property, which is not there
>> >
>> > An alternative way to implement the external/internal logic
>selection
>> > would be mandate that your Ethernet PHY node have a compatible
>string
>> > like this:
>> >
>> > phy@0 {
>> > compatible = "ethernet-phy-id1234.d400",
>"ethernet-phy-802.3-c22";
>> > };
>> >
>> > Then you don't need this phy-is-internal property, because you can
>> > locate the PHY node by the phy-handle (see more about that in a
>reply to
>> > patch 10) and you can determine ahead of time whether this PHY is
>> > internal or not based on its compatible string.
>> 
>> This doesn't really work for us (sunxi). The "internal" PHY of the H3
>> is also available in the X-Powers AC200 combo chip, which would be
>> external. Same ID. So if someone were to be stupid and put the two
>> together, you wouldn't know which one it actually is. Generically
>> it doesn't make sense to match against the ID either. The PHY is
>> only integrated or inlined into the SoC, meaning it could be moved
>> elsewhere or even be a discreet part.

It actually makes a lot of sense to differentiate on the PHY ID because you are 
supposed to allocate an unique ID based on how the integration of the PHY is 
done. Not doing that is making a sloppy job at integrating HW blocks, but such 
is life and there is no shortage of creativity amongst HW engineers when they 
are not given feedback from SW people.

>> 
>> The way I see it is more like a reversed pinmux. The system should
>> select either the internal set or external set of xMII pins to use.
>> 
>> A phy-is-internal property in the PHY node would work for us. We
>> already need a PHY node to describe its clocks and resets.
>> 
>> A side note to this solution is that, since the internal PHY is
>defined
>> at the .dtsi level, any external PHYs at the same address would need
>to
>> make sure to delete the property, which is kind of counterintuitive,
>but
>> it is how device tree works. One would want to put the internal PHY's
>> address, assuming it is configurable, on something that is rarely
>used.
>> 
>
>Hello David, Florian, Andrew
>
>Could someone ack this ? or nack if you think that the chance for
>having two PHY id both internal and external is too low.
>Anyway, we need a choice.

Let's move forward with the 'phy-is-integrated' Boolean property 
('phy-is-internal' is too close from the proprietary "internal" phy-mode IMHO). 
Andrew, does that also work for you?

-- 
Florian

Help with debugging CPU stall

2017-08-09 Thread Joe Smith

Hi,

I am debugging a system where rcu_sched detects cpu stall. The system
is running a test which runs fine if IPSec is not used. With IPSec it
stalls. I have tried reducing the value of netdev_budget to 100 but
that seems to have no impact. I have included two stack traces below.


How do I figure out where the system is looping/stuck.


Thanks for the help.

Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410509]
[] ? do_IRQ+0x69/0xe0
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410513]
[] ? common_interrupt+0x13/0x13
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410520]
[] ? sched_slice+0x4c/0x90
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410524]
[] print_cpu_stall+0x42/0xa0
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410528]
[] check_cpu_stall+0xca/0xe0
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410533]
[] __rcu_pending+0x30/0x140
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410537]
[] rcu_pending+0x37/0x90
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410541]
[] rcu_check_callbacks+0x85/0xa0
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410546]
[] update_process_times+0x46/0x90
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410553]
[] tick_sched_timer+0x66/0xd0
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410557]
[] ? tick_clock_notify+0x60/0x60
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410563]
[] __run_hrtimer+0x83/0x1e0
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410566]
[] hrtimer_interrupt+0xe6/0x240
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410573]
[] local_apic_timer_interrupt+0x3b/0x70
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410578]
[] smp_apic_timer_interrupt+0x45/0x5a
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410582]
[] apic_timer_interrupt+0x13/0x20
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410586]
[] shash_finup_unaligned+0x30/0x40
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410592]
[] ? dec128+0x742/0x818 [aes_x86_64]
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410597]
[] ? aes_decrypt+0x12/0x30 [aes_x86_64]
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410602]
[] ? crypto_cbc_decrypt_inplace+0xc2/0x120 [cbc]
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410607]
[] ? dec128+0x818/0x818 [aes_x86_64]
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410612]
[] ? crypto_cbc_decrypt+0x7d/0x90 [cbc]
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410617]
[] ? async_decrypt+0x3d/0x40
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410621]
[] ? crypto_authenc_decrypt+0xa8/0xb0 [authenc]
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410626]
[] ? esp_input+0x1e8/0x350 [esp4]
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410630]
[] ? xfrm_state_lookup+0x69/0x90
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410634]
[] ? xfrm_input+0x691/0x6e0
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410639]
[] ? xfrm4_rcv_encap+0x20/0x30
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410643]
[] ? xfrm4_rcv+0x24/0x30
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410647]
[] ? ip_local_deliver_finish+0x129/0x280
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410650]
[] ? ip_local_deliver+0x40/0xa0
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410654]
[] ? ip_rcv_finish+0x179/0x380
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410658]
[] ? ip_rcv+0x222/0x320
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410662]
[] ? __netif_receive_skb+0x25b/0x500
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410667]
[] ? netif_receive_skb+0x7d/0x90
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410671]
[] ? swiotlb_sync_single+0x2d/0x70
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410675]
[] ? napi_skb_finish+0x48/0x60
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410679]
[] ? napi_gro_receive+0xfb/0x130
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410690]
[] ? ixgbe_rx_skb+0x41/0xd0 [ixgbe]
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410701]
[] ? ixgbe_clean_rx_irq+0x10c/0x1d0 [ixgbe]
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410712]
[] ? ixgbe_poll+0xa5/0x130 [ixgbe]
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410716]
[] ? net_rx_action+0x14a/0x230
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410721]
[] ? rcu_check_quiescent_state+0x21/0x60
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410725]
[] ? __do_softirq+0xb9/0x1d0
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410729]
[] ? rcu_irq_exit+0xe/0x10
Aug  8 15:37:38 scad01adm01 kernel: [ 1884.410733]
[] ? call_softirq+0x1c/0x30

Another Trace:

Aug  9 14:12:05 scad01adm01 kernel: [82985.293794] Call Trace:
Aug  9 14:12:05 scad01adm01 kernel: [82985.293795]  
Aug  9 14:12:05 scad01adm01 kernel: [82985.293800]
[] delay_tsc+0x4d/0x80
Aug  9 14:12:05 scad01adm01 kernel: [82985.293804]
[] __delay+0xf/0x20
Aug  9 14:12:05 scad01adm01 kernel: [82985.293807]
[] __const_udelay+0x2c/0x30
Aug  9 14:12:05 scad01adm01 kernel: [82985.293814]
[] wait_for_xmitr+0x30/0xa0
Aug  9 14:12:05 scad01adm01 kernel: [82985.293818]
[] ? wait_for_xmitr+0xa0/0xa0
Aug  9 14:12:05 scad01adm01 kernel: [82985.293822]
[] serial8250_console_putchar+0x26/0x40
Aug  9 14:12:05 scad01adm01 kernel: [82985.293826]
[] uart_console_write+0x3d/0x70
Aug  9 14:12:05 scad01adm01 kernel: [82985.293831]
[]

Re: skb allocation from interrupt handler?

2017-08-09 Thread Francois Romieu

Murali Karicheri  :
[...]
> The internal memory or FIFO can store only up to 3 MTU sized packets. So that 
> has to
> be processed before PRU gets another packets to send to CPU. So per above, 
> it is not ideal to run NAPI for this scenario, right? Also for NetCP we use
> about 128 descriptors with MTU size buffers to handle 1Gbps Ethernet link.
> Based on that roughly we would need at least 10-12 buffers in the FIFO.
> 
> Currently we have a NAPI implementation in use that gives throughput of 
> 95Mbps for
> MTU sized packets, but our UDP iperf tests shows less than 1% packet loss for 
> an
> offered traffic of 95Mbps with MTU sized packets.  This is not good for 
> industrial
> network using HSR/PRP protocol for network redundancy. We need to have zero 
> packet
> loss for MTU sized packets at 95Mbps throughput. That is the problem 
> description.

Imvho you should instrument the kernel to figure where the excess latency that
prevents NAPI processing to take place within 125 us of physical packet 
reception
comes from.

> As an experiment, I have moved the packet processing to irq handler to see if 
> we 
> can take advantage of CPU cycle to processing the packet instead of NAPI
> and to check if the firmware encounters buffer overflow. The result is 
> positive 
> with no buffer overflow seen at the firmware and no packet loss in the iperf 
> test.
> But we want to do more testing as an experiment and ran into a uart console 
> locks
> up after running traffic for about 2 minutes. So I tried enabling the DEBUG 
> HACK 
> options to get some clue on what is happening and ran into the trace I shared 
> earlier. So what function can I use to allocate SKB from interrupt handler ?

Is your design also so tight on memory that you can't even refill your own
software skb pool from some non-irq context then only swap buffers in the
irq handler ?

-- 
Ueimor

[PATCH iproute2 master] bpf: unbreak libelf linkage for bpf obj loader

2017-08-09 Thread Daniel Borkmann

Commit 69fed534a533 ("change how Config is used in Makefile's") moved
HAVE_MNL specific CFLAGS/LDLIBS for building with libmnl out of the
top level Makefile into sub-Makefiles. However, it also removed the
HAVE_ELF specific CFLAGS/LDLIBS entirely, which breaks the BPF object
loader for tc and ip with "No ELF library support compiled in." despite
having libelf detected in configure script. Fix it similarly as in
69fed534a533 for HAVE_ELF.

Fixes: 69fed534a533 ("change how Config is used in Makefile's")
Reported-by: Jeffrey Panneman 
Signed-off-by: Daniel Borkmann 
---
 ip/Makefile  | 4 
 lib/Makefile | 4 
 tc/Makefile  | 4 
 3 files changed, 12 insertions(+)

diff --git a/ip/Makefile b/ip/Makefile
index 572604d..a754c04 100644
--- a/ip/Makefile
+++ b/ip/Makefile
@@ -19,6 +19,10 @@ ifeq ($(IP_CONFIG_SETNS),y)
CFLAGS += -DHAVE_SETNS
 endif
 
+ifeq ($(HAVE_ELF),y)
+   CFLAGS += -DHAVE_ELF
+   LDLIBS += -lelf
+endif
 ifeq ($(HAVE_MNL),y)
CFLAGS += -DHAVE_LIBMNL $(shell $(PKG_CONFIG) libmnl --cflags)
LDLIBS += $(shell $(PKG_CONFIG) libmnl --libs)
diff --git a/lib/Makefile b/lib/Makefile
index 637fe48..b7b1d56 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -4,6 +4,10 @@ ifeq ($(IP_CONFIG_SETNS),y)
CFLAGS += -DHAVE_SETNS
 endif
 
+ifeq ($(HAVE_ELF),y)
+   CFLAGS += -DHAVE_ELF
+endif
+
 ifeq ($(HAVE_MNL),y)
CFLAGS += -DHAVE_LIBMNL $(shell $(PKG_CONFIG) libmnl --cflags)
 endif
diff --git a/tc/Makefile b/tc/Makefile
index c364a05..a9b4b8e 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -102,6 +102,10 @@ endif
 TCOBJ += $(TCMODULES)
 LDLIBS += -L. -lm
 
+ifeq ($(HAVE_ELF),y)
+   CFLAGS += -DHAVE_ELF
+   LDLIBS += -lelf
+endif
 ifeq ($(HAVE_MNL),y)
CFLAGS += -DHAVE_LIBMNL $(shell $(PKG_CONFIG) libmnl --cflags)
LDLIBS += $(shell $(PKG_CONFIG) libmnl --libs)
-- 
1.9.3

[net-next 03/12] e1000e: add check on e1e_wphy() return value

2017-08-09 Thread Jeff Kirsher

From: Gustavo A R Silva 

Check return value from call to e1e_wphy(). This value is being
checked during previous calls to function e1e_wphy() and it seems
a check was missing here.

Addresses-Coverity-ID: 1226905
Signed-off-by: Gustavo A R Silva 
Reviewed-by: Ethan Zhao 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/e1000e/ich8lan.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/intel/e1000e/ich8lan.c 
b/drivers/net/ethernet/intel/e1000e/ich8lan.c
index 68ea8b4555ab..d6d4ed7acf03 100644
--- a/drivers/net/ethernet/intel/e1000e/ich8lan.c
+++ b/drivers/net/ethernet/intel/e1000e/ich8lan.c
@@ -2437,6 +2437,8 @@ static s32 e1000_hv_phy_workarounds_ich8lan(struct 
e1000_hw *hw)
if (hw->phy.revision < 2) {
e1000e_phy_sw_reset(hw);
ret_val = e1e_wphy(hw, MII_BMCR, 0x3140);
+   if (ret_val)
+   return ret_val;
}
}
 
-- 
2.13.3

[net-next 02/12] igb: protect TX timestamping from API misuse

2017-08-09 Thread Jeff Kirsher

From: Cliff Spradlin 

HW timestamping can only be requested for a packet if the NIC is first
setup via ioctl(SIOCSHWTSTAMP). If this step was skipped, then the igb
driver still allowed TX packets to request HW timestamping. In this
situation, the _IGB_PTP_TX_IN_PROGRESS flag was set and would never
clear. This prevented any future HW timestamping requests to succeed.

Fix this by checking that the NIC is configured for HW TX timestamping
before accepting a HW TX timestamping request.

Signed-off-by: Cliff Spradlin 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/igb/igb_main.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 6a63ea564a57..5d0a75c1ba0c 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -5380,7 +5380,8 @@ netdev_tx_t igb_xmit_frame_ring(struct sk_buff *skb,
if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP)) {
struct igb_adapter *adapter = netdev_priv(tx_ring->netdev);
 
-   if (!test_and_set_bit_lock(__IGB_PTP_TX_IN_PROGRESS,
+   if (adapter->tstamp_config.tx_type & HWTSTAMP_TX_ON &&
+   !test_and_set_bit_lock(__IGB_PTP_TX_IN_PROGRESS,
   >state)) {
skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
tx_flags |= IGB_TX_FLAGS_TSTAMP;
-- 
2.13.3

[net-next 01/12] igb: Fix error of RX network flow classification

2017-08-09 Thread Jeff Kirsher

From: Gangfeng Huang 

After add an ethertype filter, if user change the adapter speed several
times, the error "ethtool -N: etype filters are all used" is reported by
igb driver.

In older patch, function igb_nfc_filter_exit() and igb_nfc_filter_restore()
is not paried. igb_nfc_filter_restore() exist in igb_up(), but function
igb_nfc_filter_exit() is exist in __igb_close(). In the process of speed
changing, only igb_nfc_filter_restore() is called, it will take a position
of ethertype bitmap.

Reproduce steps:
Step 1: Add a etype filter by ethtool
$ethtool -N eth0 flow-type ether proto 0x88F8 action 1
Step 2: Change the adapter speed to 100M/full duplex
$ethtool -s eth0 speed 100 duplex full
Step 3: Change the adapter speed to 1000M/full duplex
ethtool -s eth0 speed 1000 duplex full
Repeat step2 and step3, then dmesg the system log, you can find the error
message, add new ethtype filter is also failed.

This fixing is move igb_nfc_filter_exit() from __igb_close() to igb_down()
to make igb_nfc_filter_restore()/igb_nfc_filter_exit() is paired.

Signed-off-by: Gangfeng Huang 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/igb/igb_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index ec62410b035a..6a63ea564a57 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -1791,6 +1791,8 @@ void igb_down(struct igb_adapter *adapter)
wr32(E1000_RCTL, rctl & ~E1000_RCTL_EN);
/* flush and sleep below */
 
+   igb_nfc_filter_exit(adapter);
+
netif_carrier_off(netdev);
netif_tx_stop_all_queues(netdev);
 
@@ -3317,8 +3319,6 @@ static int __igb_close(struct net_device *netdev, bool 
suspending)
igb_down(adapter);
igb_free_irq(adapter);
 
-   igb_nfc_filter_exit(adapter);
-
igb_free_all_tx_resources(adapter);
igb_free_all_rx_resources(adapter);
 
-- 
2.13.3

[net-next 04/12] igb: Remove incorrect "unexpected SYS WRAP" log message

2017-08-09 Thread Jeff Kirsher

From: Corinna Vinschen 

TSAUXC.DisableSystime is never set, so SYSTIM runs into a SYS WRAP
every 1100 secs on 80580/i350/i354 (40 bit SYSTIM) and every 35000
secs on 80576 (45 bit SYSTIM).

This wrap event sets the TSICR.SysWrap bit unconditionally.

However, checking TSIM at interrupt time shows that this event does not
actually cause the interrupt.  Rather, it's just bycatch while the
actual interrupt is caused by, for instance, TSICR.TXTS.

The conclusion is that the SYS WRAP is actually expected, so the
"unexpected SYS WRAP" message is entirely bogus and just helps to
confuse users.  Drop it.

Signed-off-by: Corinna Vinschen 
Acked-by: Jacob Keller 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/igb/igb_main.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 5d0a75c1ba0c..1a99164d5d11 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -5746,8 +5746,6 @@ static void igb_tsync_interrupt(struct igb_adapter 
*adapter)
event.type = PTP_CLOCK_PPS;
if (adapter->ptp_caps.pps)
ptp_clock_event(adapter->ptp_clock, );
-   else
-   dev_err(>pdev->dev, "unexpected SYS WRAP");
ack |= TSINTR_SYS_WRAP;
}
 
-- 
2.13.3

[net-next 06/12] igb: expose mailbox unlock method

2017-08-09 Thread Jeff Kirsher

From: Greg Edwards 

Add a mailbox unlock method to e1000_mbx_operations, which will be used
to unlock the PF/VF mailbox by the PF.

Signed-off-by: Greg Edwards 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/igb/e1000_hw.h  |  1 +
 drivers/net/ethernet/intel/igb/e1000_mbx.c | 39 ++
 drivers/net/ethernet/intel/igb/e1000_mbx.h |  1 +
 3 files changed, 41 insertions(+)

diff --git a/drivers/net/ethernet/intel/igb/e1000_hw.h 
b/drivers/net/ethernet/intel/igb/e1000_hw.h
index fd7865a8d2e3..6076f258a0a5 100644
--- a/drivers/net/ethernet/intel/igb/e1000_hw.h
+++ b/drivers/net/ethernet/intel/igb/e1000_hw.h
@@ -499,6 +499,7 @@ struct e1000_mbx_operations {
s32 (*check_for_msg)(struct e1000_hw *hw, u16 mbx_id);
s32 (*check_for_ack)(struct e1000_hw *hw, u16 mbx_id);
s32 (*check_for_rst)(struct e1000_hw *hw, u16 mbx_id);
+   s32 (*unlock)(struct e1000_hw *hw, u16 mbx_id);
 };
 
 struct e1000_mbx_stats {
diff --git a/drivers/net/ethernet/intel/igb/e1000_mbx.c 
b/drivers/net/ethernet/intel/igb/e1000_mbx.c
index 00e263f0c030..6aa44723507b 100644
--- a/drivers/net/ethernet/intel/igb/e1000_mbx.c
+++ b/drivers/net/ethernet/intel/igb/e1000_mbx.c
@@ -125,6 +125,24 @@ s32 igb_check_for_rst(struct e1000_hw *hw, u16 mbx_id)
 }
 
 /**
+ *  igb_unlock_mbx - unlock the mailbox
+ *  @hw: pointer to the HW structure
+ *  @mbx_id: id of mailbox to check
+ *
+ *  returns SUCCESS if the mailbox was unlocked or else ERR_MBX
+ **/
+s32 igb_unlock_mbx(struct e1000_hw *hw, u16 mbx_id)
+{
+   struct e1000_mbx_info *mbx = >mbx;
+   s32 ret_val = -E1000_ERR_MBX;
+
+   if (mbx->ops.unlock)
+   ret_val = mbx->ops.unlock(hw, mbx_id);
+
+   return ret_val;
+}
+
+/**
  *  igb_poll_for_msg - Wait for message notification
  *  @hw: pointer to the HW structure
  *  @mbx_id: id of mailbox to write
@@ -341,6 +359,26 @@ static s32 igb_obtain_mbx_lock_pf(struct e1000_hw *hw, u16 
vf_number)
 }
 
 /**
+ *  igb_release_mbx_lock_pf - release mailbox lock
+ *  @hw: pointer to the HW structure
+ *  @vf_number: the VF index
+ *
+ *  return SUCCESS if we released the mailbox lock
+ **/
+static s32 igb_release_mbx_lock_pf(struct e1000_hw *hw, u16 vf_number)
+{
+   u32 p2v_mailbox;
+
+   /* drop PF lock of mailbox, if set */
+   p2v_mailbox = rd32(E1000_P2VMAILBOX(vf_number));
+   if (p2v_mailbox & E1000_P2VMAILBOX_PFU)
+   wr32(E1000_P2VMAILBOX(vf_number),
+p2v_mailbox & ~E1000_P2VMAILBOX_PFU);
+
+   return 0;
+}
+
+/**
  *  igb_write_mbx_pf - Places a message in the mailbox
  *  @hw: pointer to the HW structure
  *  @msg: The message buffer
@@ -437,6 +475,7 @@ s32 igb_init_mbx_params_pf(struct e1000_hw *hw)
mbx->ops.check_for_msg = igb_check_for_msg_pf;
mbx->ops.check_for_ack = igb_check_for_ack_pf;
mbx->ops.check_for_rst = igb_check_for_rst_pf;
+   mbx->ops.unlock = igb_release_mbx_lock_pf;
 
mbx->stats.msgs_tx = 0;
mbx->stats.msgs_rx = 0;
diff --git a/drivers/net/ethernet/intel/igb/e1000_mbx.h 
b/drivers/net/ethernet/intel/igb/e1000_mbx.h
index 73d90aeb48b2..a98c5dc60afd 100644
--- a/drivers/net/ethernet/intel/igb/e1000_mbx.h
+++ b/drivers/net/ethernet/intel/igb/e1000_mbx.h
@@ -72,6 +72,7 @@ s32 igb_write_mbx(struct e1000_hw *hw, u32 *msg, u16 size, 
u16 mbx_id);
 s32 igb_check_for_msg(struct e1000_hw *hw, u16 mbx_id);
 s32 igb_check_for_ack(struct e1000_hw *hw, u16 mbx_id);
 s32 igb_check_for_rst(struct e1000_hw *hw, u16 mbx_id);
+s32 igb_unlock_mbx(struct e1000_hw *hw, u16 mbx_id);
 s32 igb_init_mbx_params_pf(struct e1000_hw *hw);
 
 #endif /* _E1000_MBX_H_ */
-- 
2.13.3

[net-next 05/12] igb: add argument names to mailbox op function declarations

2017-08-09 Thread Jeff Kirsher

From: Greg Edwards 

Signed-off-by: Greg Edwards 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/igb/e1000_hw.h  | 15 ---
 drivers/net/ethernet/intel/igb/e1000_mbx.h | 12 ++--
 2 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/e1000_hw.h 
b/drivers/net/ethernet/intel/igb/e1000_hw.h
index 2fb2213cd562..fd7865a8d2e3 100644
--- a/drivers/net/ethernet/intel/igb/e1000_hw.h
+++ b/drivers/net/ethernet/intel/igb/e1000_hw.h
@@ -491,13 +491,14 @@ struct e1000_fc_info {
 
 struct e1000_mbx_operations {
s32 (*init_params)(struct e1000_hw *hw);
-   s32 (*read)(struct e1000_hw *, u32 *, u16,  u16);
-   s32 (*write)(struct e1000_hw *, u32 *, u16, u16);
-   s32 (*read_posted)(struct e1000_hw *, u32 *, u16,  u16);
-   s32 (*write_posted)(struct e1000_hw *, u32 *, u16, u16);
-   s32 (*check_for_msg)(struct e1000_hw *, u16);
-   s32 (*check_for_ack)(struct e1000_hw *, u16);
-   s32 (*check_for_rst)(struct e1000_hw *, u16);
+   s32 (*read)(struct e1000_hw *hw, u32 *msg, u16 size, u16 mbx_id);
+   s32 (*write)(struct e1000_hw *hw, u32 *msg, u16 size, u16 mbx_id);
+   s32 (*read_posted)(struct e1000_hw *hw, u32 *msg, u16 size, u16 mbx_id);
+   s32 (*write_posted)(struct e1000_hw *hw, u32 *msg, u16 size,
+   u16 mbx_id);
+   s32 (*check_for_msg)(struct e1000_hw *hw, u16 mbx_id);
+   s32 (*check_for_ack)(struct e1000_hw *hw, u16 mbx_id);
+   s32 (*check_for_rst)(struct e1000_hw *hw, u16 mbx_id);
 };
 
 struct e1000_mbx_stats {
diff --git a/drivers/net/ethernet/intel/igb/e1000_mbx.h 
b/drivers/net/ethernet/intel/igb/e1000_mbx.h
index 3e7fed73df15..73d90aeb48b2 100644
--- a/drivers/net/ethernet/intel/igb/e1000_mbx.h
+++ b/drivers/net/ethernet/intel/igb/e1000_mbx.h
@@ -67,11 +67,11 @@
 
 #define E1000_PF_CONTROL_MSG   0x0100 /* PF control message */
 
-s32 igb_read_mbx(struct e1000_hw *, u32 *, u16, u16);
-s32 igb_write_mbx(struct e1000_hw *, u32 *, u16, u16);
-s32 igb_check_for_msg(struct e1000_hw *, u16);
-s32 igb_check_for_ack(struct e1000_hw *, u16);
-s32 igb_check_for_rst(struct e1000_hw *, u16);
-s32 igb_init_mbx_params_pf(struct e1000_hw *);
+s32 igb_read_mbx(struct e1000_hw *hw, u32 *msg, u16 size, u16 mbx_id);
+s32 igb_write_mbx(struct e1000_hw *hw, u32 *msg, u16 size, u16 mbx_id);
+s32 igb_check_for_msg(struct e1000_hw *hw, u16 mbx_id);
+s32 igb_check_for_ack(struct e1000_hw *hw, u16 mbx_id);
+s32 igb_check_for_rst(struct e1000_hw *hw, u16 mbx_id);
+s32 igb_init_mbx_params_pf(struct e1000_hw *hw);
 
 #endif /* _E1000_MBX_H_ */
-- 
2.13.3

[net-next 07/12] igb: do not drop PF mailbox lock after read of VF message

2017-08-09 Thread Jeff Kirsher

From: Greg Edwards 

When the PF receives a mailbox message from the VF, it grabs the mailbox
lock, reads the VF message from the mailbox, ACKs the message and drops
the lock.

While the PF is performing the action for the VF message, nothing
prevents another VF message from being posted to the mailbox.  The
current code handles this condition by just dropping any new VF messages
without processing them.  This results in a mailbox timeout in the VM
for posted messages waiting for an ACK, and the VF is reset by the
igbvf_watchdog_task in the VM.

Given the right sequence of VF messages and mailbox timeouts, this
condition can go on ad infinitum.

Modify the PF mailbox read method to take an 'unlock' argument that
optionally leaves the mailbox locked by the PF after reading the VF
message.  This ensures another VF message is not posted to the mailbox
until after the PF has completed processing the VF message and written
its reply.

Signed-off-by: Greg Edwards 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/igb/e1000_hw.h  |  3 ++-
 drivers/net/ethernet/intel/igb/e1000_mbx.c | 18 --
 drivers/net/ethernet/intel/igb/e1000_mbx.h |  3 ++-
 drivers/net/ethernet/intel/igb/igb_main.c  | 14 ++
 4 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/e1000_hw.h 
b/drivers/net/ethernet/intel/igb/e1000_hw.h
index 6076f258a0a5..6ea9f702ba0f 100644
--- a/drivers/net/ethernet/intel/igb/e1000_hw.h
+++ b/drivers/net/ethernet/intel/igb/e1000_hw.h
@@ -491,7 +491,8 @@ struct e1000_fc_info {
 
 struct e1000_mbx_operations {
s32 (*init_params)(struct e1000_hw *hw);
-   s32 (*read)(struct e1000_hw *hw, u32 *msg, u16 size, u16 mbx_id);
+   s32 (*read)(struct e1000_hw *hw, u32 *msg, u16 size, u16 mbx_id,
+   bool unlock);
s32 (*write)(struct e1000_hw *hw, u32 *msg, u16 size, u16 mbx_id);
s32 (*read_posted)(struct e1000_hw *hw, u32 *msg, u16 size, u16 mbx_id);
s32 (*write_posted)(struct e1000_hw *hw, u32 *msg, u16 size,
diff --git a/drivers/net/ethernet/intel/igb/e1000_mbx.c 
b/drivers/net/ethernet/intel/igb/e1000_mbx.c
index 6aa44723507b..bffd58f7b2a1 100644
--- a/drivers/net/ethernet/intel/igb/e1000_mbx.c
+++ b/drivers/net/ethernet/intel/igb/e1000_mbx.c
@@ -32,7 +32,8 @@
  *
  *  returns SUCCESS if it successfully read message from buffer
  **/
-s32 igb_read_mbx(struct e1000_hw *hw, u32 *msg, u16 size, u16 mbx_id)
+s32 igb_read_mbx(struct e1000_hw *hw, u32 *msg, u16 size, u16 mbx_id,
+bool unlock)
 {
struct e1000_mbx_info *mbx = >mbx;
s32 ret_val = -E1000_ERR_MBX;
@@ -42,7 +43,7 @@ s32 igb_read_mbx(struct e1000_hw *hw, u32 *msg, u16 size, u16 
mbx_id)
size = mbx->size;
 
if (mbx->ops.read)
-   ret_val = mbx->ops.read(hw, msg, size, mbx_id);
+   ret_val = mbx->ops.read(hw, msg, size, mbx_id, unlock);
 
return ret_val;
 }
@@ -222,7 +223,7 @@ static s32 igb_read_posted_mbx(struct e1000_hw *hw, u32 
*msg, u16 size,
ret_val = igb_poll_for_msg(hw, mbx_id);
 
if (!ret_val)
-   ret_val = mbx->ops.read(hw, msg, size, mbx_id);
+   ret_val = mbx->ops.read(hw, msg, size, mbx_id, true);
 out:
return ret_val;
 }
@@ -423,13 +424,14 @@ static s32 igb_write_mbx_pf(struct e1000_hw *hw, u32 
*msg, u16 size,
  *  @msg: The message buffer
  *  @size: Length of buffer
  *  @vf_number: the VF index
+ *  @unlock: unlock the mailbox when done?
  *
  *  This function copies a message from the mailbox buffer to the caller's
  *  memory buffer.  The presumption is that the caller knows that there was
  *  a message due to a VF request so no polling for message is needed.
  **/
 static s32 igb_read_mbx_pf(struct e1000_hw *hw, u32 *msg, u16 size,
-  u16 vf_number)
+  u16 vf_number, bool unlock)
 {
s32 ret_val;
u16 i;
@@ -443,8 +445,12 @@ static s32 igb_read_mbx_pf(struct e1000_hw *hw, u32 *msg, 
u16 size,
for (i = 0; i < size; i++)
msg[i] = array_rd32(E1000_VMBMEM(vf_number), i);
 
-   /* Acknowledge the message and release buffer */
-   wr32(E1000_P2VMAILBOX(vf_number), E1000_P2VMAILBOX_ACK);
+   /* Acknowledge the message and release mailbox lock (or not) */
+   if (unlock)
+   wr32(E1000_P2VMAILBOX(vf_number), E1000_P2VMAILBOX_ACK);
+   else
+   wr32(E1000_P2VMAILBOX(vf_number),
+E1000_P2VMAILBOX_ACK | E1000_P2VMAILBOX_PFU);
 
/* update stats */
hw->mbx.stats.msgs_rx++;
diff --git a/drivers/net/ethernet/intel/igb/e1000_mbx.h 
b/drivers/net/ethernet/intel/igb/e1000_mbx.h
index a98c5dc60afd..a62b08e1572e 100644
--- a/drivers/net/ethernet/intel/igb/e1000_mbx.h
+++ b/drivers/net/ethernet/intel/igb/e1000_mbx.h
@@

[net-next 08/12] e1000e: Initial Support for IceLake

2017-08-09 Thread Jeff Kirsher

From: Sasha Neftin 

i219 (8) and i219 (9) are the next LOM generations that will be available
on the next Intel Client platform (IceLake).
This patch provides the initial support for these devices

Signed-off-by: Sasha Neftin 
Reviewed-by: Raanan Avargil 
Reviewed-by: Dima Ruinskiy 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/e1000e/hw.h | 4 
 drivers/net/ethernet/intel/e1000e/netdev.c | 4 
 2 files changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/intel/e1000e/hw.h 
b/drivers/net/ethernet/intel/e1000e/hw.h
index 66bd5060a65b..d803b1a12349 100644
--- a/drivers/net/ethernet/intel/e1000e/hw.h
+++ b/drivers/net/ethernet/intel/e1000e/hw.h
@@ -100,6 +100,10 @@ struct e1000_hw;
 #define E1000_DEV_ID_PCH_CNP_I219_V6   0x15BE
 #define E1000_DEV_ID_PCH_CNP_I219_LM7  0x15BB
 #define E1000_DEV_ID_PCH_CNP_I219_V7   0x15BC
+#define E1000_DEV_ID_PCH_ICP_I219_LM8  0x15DF
+#define E1000_DEV_ID_PCH_ICP_I219_V8   0x15E0
+#define E1000_DEV_ID_PCH_ICP_I219_LM9  0x15E1
+#define E1000_DEV_ID_PCH_ICP_I219_V9   0x15E2
 
 #define E1000_REVISION_4   4
 
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 2dcb5463d9b8..327dfe5bedc0 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -7544,6 +7544,10 @@ static const struct pci_device_id e1000_pci_tbl[] = {
{ PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_CNP_I219_V6), board_pch_cnp },
{ PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_CNP_I219_LM7), board_pch_cnp },
{ PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_CNP_I219_V7), board_pch_cnp },
+   { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_ICP_I219_LM8), board_pch_cnp },
+   { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_ICP_I219_V8), board_pch_cnp },
+   { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_ICP_I219_LM9), board_pch_cnp },
+   { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_ICP_I219_V9), board_pch_cnp },
 
{ 0, 0, 0, 0, 0, 0, 0 } /* terminate list */
 };
-- 
2.13.3

[net-next 00/12][pull request] 1GbE Intel Wired LAN Driver Updates 2017-08-08

2017-08-09 Thread Jeff Kirsher

This series contains updates to e1000e and igb/igbvf.

Gangfeng Huang fixes an issue with receive network flow classification,
where igb_nfc_filter_exit() was not being called in igb_down() which
would cause the filter tables to "fill up" if a user where to change
the adapter settings (such as speed) which requires a reset of the
adapter.

Cliff Spradlin fixes a timestamping issue, where igb was allowing requests
for hardware timestamping even if it was not configured for hardware
transmit timestamping.

Corinna Vinschen removes the error message that there was an "unexpected
SYS WRAP", when it is actually expected.  So remove the message to not
confuse users.

Greg Edwards provides several patches for the mailbox interface between
the PF and VF drivers.  Added a mailbox unlock method to be used to unlock
the PF/VF mailbox by the PF.  Added a lock around the VF mailbox ops to
prevent the VF from sending another message while the PF is still
processing the previous message.  Fixed a "scheduling while atomic" issue
by changing msleep() to mdelay().

Sasha adds support for the next LOM generations i219 (v8 & v9) which
will be available in the next Intel client platform IceLake.

John Linville adds support for a Broadcom PHY to the igb driver, since
there are designs out in the world which use the igb MAC and a third
party PHY.  This allows the driver to load and function as expected on
these designs.

The following are changes since commit 53b948356554376ec6f89016376825d48bf396c3:
  net: vrf: Add extack messages for newlink failures
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 1GbE

Cliff Spradlin (1):
  igb: protect TX timestamping from API misuse

Corinna Vinschen (1):
  igb: Remove incorrect "unexpected SYS WRAP" log message

Gangfeng Huang (1):
  igb: Fix error of RX network flow classification

Greg Edwards (6):
  igb: add argument names to mailbox op function declarations
  igb: expose mailbox unlock method
  igb: do not drop PF mailbox lock after read of VF message
  igbvf: add lock around mailbox ops
  igbvf: after mailbox write, wait for reply
  igbvf: convert msleep to mdelay in atomic context

Gustavo A R Silva (1):
  e1000e: add check on e1e_wphy() return value

John W Linville (1):
  igb: support BCM54616 PHY

Sasha Neftin (1):
  e1000e: Initial Support for IceLake

 drivers/net/ethernet/intel/e1000e/hw.h |  4 ++
 drivers/net/ethernet/intel/e1000e/ich8lan.c|  2 +
 drivers/net/ethernet/intel/e1000e/netdev.c |  4 ++
 drivers/net/ethernet/intel/igb/e1000_82575.c   |  6 +++
 drivers/net/ethernet/intel/igb/e1000_defines.h |  1 +
 drivers/net/ethernet/intel/igb/e1000_hw.h  | 18 
 drivers/net/ethernet/intel/igb/e1000_mbx.c | 57 +++---
 drivers/net/ethernet/intel/igb/e1000_mbx.h | 14 ---
 drivers/net/ethernet/intel/igb/igb_main.c  | 23 +++
 drivers/net/ethernet/intel/igbvf/ethtool.c |  4 ++
 drivers/net/ethernet/intel/igbvf/mbx.c |  4 ++
 drivers/net/ethernet/intel/igbvf/netdev.c  | 47 +
 drivers/net/ethernet/intel/igbvf/vf.c  | 12 --
 drivers/net/ethernet/intel/igbvf/vf.h  |  1 +
 14 files changed, 166 insertions(+), 31 deletions(-)

-- 
2.13.3

[net-next 12/12] igb: support BCM54616 PHY

2017-08-09 Thread Jeff Kirsher

From: John W Linville 

The management port on an Edgecore AS7712-32 switch uses an igb MAC, but
it uses a BCM54616 PHY. Without a patch like this, loading the igb
module produces dmesg output like this:

[3.439125] igb: Copyright (c) 2007-2014 Intel Corporation.
[3.439866] igb: probe of :00:14.0 failed with error -2

Signed-off-by: John W Linville 
Cc: Jeff Kirsher 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/igb/e1000_82575.c   | 6 ++
 drivers/net/ethernet/intel/igb/e1000_defines.h | 1 +
 drivers/net/ethernet/intel/igb/e1000_hw.h  | 1 +
 3 files changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/intel/igb/e1000_82575.c 
b/drivers/net/ethernet/intel/igb/e1000_82575.c
index 4a50870e0fa7..c37cc8bccf47 100644
--- a/drivers/net/ethernet/intel/igb/e1000_82575.c
+++ b/drivers/net/ethernet/intel/igb/e1000_82575.c
@@ -340,6 +340,9 @@ static s32 igb_init_phy_params_82575(struct e1000_hw *hw)
phy->ops.set_d3_lplu_state = igb_set_d3_lplu_state_82580;
phy->ops.force_speed_duplex = igb_phy_force_speed_duplex_m88;
break;
+   case BCM54616_E_PHY_ID:
+   phy->type = e1000_phy_bcm54616;
+   break;
default:
ret_val = -E1000_ERR_PHY;
goto out;
@@ -1659,6 +1662,9 @@ static s32 igb_setup_copper_link_82575(struct e1000_hw 
*hw)
case e1000_phy_82580:
ret_val = igb_copper_link_setup_82580(hw);
break;
+   case e1000_phy_bcm54616:
+   ret_val = 0;
+   break;
default:
ret_val = -E1000_ERR_PHY;
break;
diff --git a/drivers/net/ethernet/intel/igb/e1000_defines.h 
b/drivers/net/ethernet/intel/igb/e1000_defines.h
index d8517779439b..1de82f247312 100644
--- a/drivers/net/ethernet/intel/igb/e1000_defines.h
+++ b/drivers/net/ethernet/intel/igb/e1000_defines.h
@@ -889,6 +889,7 @@
 #define I210_I_PHY_ID0x01410C00
 #define M88E1543_E_PHY_ID0x01410EA0
 #define M88E1512_E_PHY_ID0x01410DD0
+#define BCM54616_E_PHY_ID0x03625D10
 
 /* M88E1000 Specific Registers */
 #define M88E1000_PHY_SPEC_CTRL 0x10  /* PHY Specific Control Register */
diff --git a/drivers/net/ethernet/intel/igb/e1000_hw.h 
b/drivers/net/ethernet/intel/igb/e1000_hw.h
index 6ea9f702ba0f..6c9485ab4b57 100644
--- a/drivers/net/ethernet/intel/igb/e1000_hw.h
+++ b/drivers/net/ethernet/intel/igb/e1000_hw.h
@@ -128,6 +128,7 @@ enum e1000_phy_type {
e1000_phy_ife,
e1000_phy_82580,
e1000_phy_i210,
+   e1000_phy_bcm54616,
 };
 
 enum e1000_bus_type {
-- 
2.13.3

[net-next 11/12] igbvf: convert msleep to mdelay in atomic context

2017-08-09 Thread Jeff Kirsher

From: Greg Edwards 

This fixes a "scheduling while atomic" splat seen with
CONFIG_DEBUG_ATOMIC_SLEEP enabled.

Signed-off-by: Greg Edwards 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/igbvf/vf.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/igbvf/vf.c 
b/drivers/net/ethernet/intel/igbvf/vf.c
index 1d3aa9adcaa8..9577ccf4b26a 100644
--- a/drivers/net/ethernet/intel/igbvf/vf.c
+++ b/drivers/net/ethernet/intel/igbvf/vf.c
@@ -149,7 +149,7 @@ static s32 e1000_reset_hw_vf(struct e1000_hw *hw)
msgbuf[0] = E1000_VF_RESET;
mbx->ops.write_posted(hw, msgbuf, 1);
 
-   msleep(10);
+   mdelay(10);
 
/* set our "perm_addr" based on info provided by PF */
ret_val = mbx->ops.read_posted(hw, msgbuf, 3);
-- 
2.13.3

1 2 3 >

1 - 100 of 298 matches

Mail list logo