date:20180829

Re: [PATCH 0/3] IB/ipoib: Use dev_port to disambiguate port numbers

2018-08-29 Thread Leon Romanovsky

On Wed, Aug 29, 2018 at 12:01:14AM +0300, Arseny Maslennikov wrote:
> Pre-3.15 userspace had trouble distinguishing different ports
> of a NIC on a single PCI bus/device/function. To solve this,
> a sysfs field `dev_port' was introduced quite a while ago
> (commit v3.14-rc3-739-g3f85944fe207), and some relevant device
> drivers were fixed to use it, but not in case of IPoIB.
>
> The convention for some reason never got documented in the kernel, but
> was immediately adopted by userspace (notably udev[1][2], biosdevname[3])
>
> 3/3 documents the sysfs field — that's why I'm CC-ing netdev.
>
> This series was tested on and applies to 4.19-rc1.
>
> [1] https://lists.freedesktop.org/archives/systemd-devel/2014-June/020788.html
> [2] https://lists.freedesktop.org/archives/systemd-devel/2014-July/020804.html
> [3] 
> https://github.com/CloudAutomationNTools/biosdevname/blob/c795d51dd93a5309652f0d635f12a3ecfabfaa72/src/eths.c#L38
>
> Arseny Maslennikov (3):
>   IB/ipoib: Use dev_port to expose network interface port numbers
>   IB/ipoib: Stop using dev_id to expose port numbers

I completely agree with previous Yuval's comment, it makes no sense to
start separate commits for every line.

Please decide what is best and right behavior and do it, instead of
pushing it up to be the maintainer's problem.

Thanks

>   Documentation/ABI: document /sys/class/net/*/dev_port
>
>  Documentation/ABI/testing/sysfs-class-net | 10 ++
>  drivers/infiniband/ulp/ipoib/ipoib_main.c |  2 +-
>  2 files changed, 11 insertions(+), 1 deletion(-)
>
> --
> 2.18.0
>


signature.asc
Description: PGP signature

Re: [PATCH] rtnetlink: expose value from SET_NETDEV_DEVTYPE via IFLA_DEVTYPE attribute

2018-08-29 Thread Jiri Pirko

Wed, Aug 29, 2018 at 05:24:28PM CEST, step...@networkplumber.org wrote:
>On Wed, 29 Aug 2018 09:18:55 +0200
>Jiri Pirko  wrote:
>
>> Tue, Aug 28, 2018 at 10:58:11PM CEST, mar...@holtmann.org wrote:
>> >The name value from SET_NETDEV_DEVTYPE only ended up in the uevent sysfs
>> >file as DEVTYPE= information. To avoid any kind of race conditions
>> >between netlink messages and reading from sysfs, it is useful to add the
>> >same string as new IFLA_DEVTYPE attribute included in the RTM_NEWLINK
>> >messages.
>> >
>> >For network managing daemons that have to classify ARPHRD_ETHER network
>> >devices into different types (like Wireless LAN, Bluetooth etc.), this
>> >avoids the extra round trip to sysfs and parsing of the uevent file.
>> >
>> >Signed-off-by: Marcel Holtmann 
>> >---
>> > include/uapi/linux/if_link.h |  2 ++
>> > net/core/rtnetlink.c | 12 
>> > 2 files changed, 14 insertions(+)
>> >
>> >diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
>> >index 43391e2d1153..781294972bb4 100644
>> >--- a/include/uapi/linux/if_link.h
>> >+++ b/include/uapi/linux/if_link.h
>> >@@ -166,6 +166,8 @@ enum {
>> >IFLA_NEW_IFINDEX,
>> >IFLA_MIN_MTU,
>> >IFLA_MAX_MTU,
>> >+   IFLA_DEVTYPE,   /* Name value from SET_NETDEV_DEVTYPE */  
>> 
>> This is not something netdev-related. dev->dev.type is struct device_type.
>> This is a generic "device" thing. Incorrect to expose over
>> netdev-specific API. Please use "device" API for this.
>
>There is no device API in netlink. The whole point of this patch is to
>do it in one message. It might be a performance optimization, but I can't
>see how it could be a race condition. Devices set type before registering.

I understand. My point is to avoid exposing generic "struct device"
values over rtnetlink. It mixes apples and oranges.

Re: [PATCH] rtnetlink: expose value from SET_NETDEV_DEVTYPE via IFLA_DEVTYPE attribute

2018-08-29 Thread Jiri Pirko

Wed, Aug 29, 2018 at 05:23:58PM CEST, mar...@holtmann.org wrote:
>Hi Jiri,
>
>>> The name value from SET_NETDEV_DEVTYPE only ended up in the uevent sysfs
>>> file as DEVTYPE= information. To avoid any kind of race conditions
>>> between netlink messages and reading from sysfs, it is useful to add the
>>> same string as new IFLA_DEVTYPE attribute included in the RTM_NEWLINK
>>> messages.
>>> 
>>> For network managing daemons that have to classify ARPHRD_ETHER network
>>> devices into different types (like Wireless LAN, Bluetooth etc.), this
>>> avoids the extra round trip to sysfs and parsing of the uevent file.
>>> 
>>> Signed-off-by: Marcel Holtmann 
>>> ---
>>> include/uapi/linux/if_link.h |  2 ++
>>> net/core/rtnetlink.c | 12 
>>> 2 files changed, 14 insertions(+)
>>> 
>>> diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
>>> index 43391e2d1153..781294972bb4 100644
>>> --- a/include/uapi/linux/if_link.h
>>> +++ b/include/uapi/linux/if_link.h
>>> @@ -166,6 +166,8 @@ enum {
>>> IFLA_NEW_IFINDEX,
>>> IFLA_MIN_MTU,
>>> IFLA_MAX_MTU,
>>> +   IFLA_DEVTYPE,   /* Name value from SET_NETDEV_DEVTYPE */
>> 
>> This is not something netdev-related. dev->dev.type is struct device_type.
>> This is a generic "device" thing. Incorrect to expose over
>> netdev-specific API. Please use "device" API for this.
>
>it is not just "device" related since this is a sub-classification of 
>ARPHRD_ETHER type as a wrote above. Don't get hang up that this information is 
>part of struct device. The dev->dev.type contains strings like "wlan", 
>"bluetooth", "wimax", "gadget" etc. so that you can tell what kind of Ethernet 
>card you have.

I understand. But using dev->dev.type for that purpose seems like an
abuse. If this is subtype of netdev, it should not be stored in parent
device. I believe that you need to re-introduce this as a part of struct
net_device - preferably an enum - and that you can expose over rtnl (and
net-sysfs).

>
>We can revive the patches for adding this information as IFLA_INFO_KIND, but 
>last time they where reverted since NetworkManager did all the wrong decisions 
>based on that. That part is fixed now and we can put that back and declare the 
>sub-type classification of Ethernet device in two places. Or we just take the 
>information that we added a long time ago for exactly this sub-classification 
>and provide them to userspace via RTNL.

IFLA_INFO_KIND serves for different purpose. It is for drivers that
register rtnl_link_ops. I don't think it would be wise to mix it with
this.

Btw, could you wrap the sentences at 72 cols? Would be easier to read
that way.


>
>Regards
>
>Marcel
>

Re: [PATCH net v2 0/2] net_sched: reject unknown tcfa_action values

2018-08-29 Thread David Miller

From: Paolo Abeni 
Date: Wed, 29 Aug 2018 10:22:32 +0200

> As agreed some time ago, this changeset reject unknown tcfa_action values,
> instead of changing such values under the hood.
> 
> A tdc test is included to verify the new behavior.
> 
> v1 -> v2:
>  - helper is now static and renamed according to act_* convention
>  - updated extack message, according to the new behavior

Series applied, thank you.

Re: [PATCH v2] net: mvpp2: initialize port of_node pointer

2018-08-29 Thread David Miller

From: Baruch Siach 
Date: Wed, 29 Aug 2018 09:44:39 +0300

> Without a valid of_node in struct device we can't find the mvpp2 port
> device by its DT node. Specifically, this breaks
> of_find_net_device_by_node().
> 
> For example, the Armada 8040 based Clearfog GT-8K uses Marvell 88E6141
> switch connected to the _eth2 port:
> 
> _mdio {
>   ...
> 
>   switch0: switch0@4 {
>   compatible = "marvell,mv88e6085";
>   ...
> 
>   ports {
>   ...
> 
>   port@5 {
>   reg = <5>;
>   label = "cpu";
>   ethernet = <_eth2>;
>   };
>   };
>   };
> };
> 
> Without this patch, dsa_register_switch() returns -EPROBE_DEFER because
> of_find_net_device_by_node() can't find the device_node of the _eth2
> device.
> 
> Reviewed-by: Andrew Lunn 
> Signed-off-by: Baruch Siach 

Applied.

Re: [PATCH][net-next] vxlan: reduce dirty cache line in vxlan_find_mac

2018-08-29 Thread David Miller

From: Li RongQing 
Date: Wed, 29 Aug 2018 11:52:10 +0800

> vxlan_find_mac() unconditionally set f->used for every packet,
> this causes a cache miss for every packet, since remote, hlist
> and used of vxlan_fdb share the same cache line, which are
> accessed when send every packets.
> 
> so f->used is set only if not equal to jiffies, to reduce dirty
> cache line times, this gives 3% speed-up with small packets.
> 
> Signed-off-by: Zhang Yu 
> Signed-off-by: Li RongQing 

Applied.

Re: [PATCH 3/4] clk: x86: Stop marking clocks as CLK_IS_CRITICAL

2018-08-29 Thread Stephen Boyd

Quoting Hans de Goede (2018-08-27 07:31:59)
> Commit d31fd43c0f9a ("clk: x86: Do not gate clocks enabled by the
> firmware"), which added the code to mark clocks as CLK_IS_CRITICAL, causes
> all unclaimed PMC clocks on Cherry Trail devices to be on all the time,
> resulting on the device not being able to reach S0i3 when suspended.
> 
> The reason for this commit is that on some Bay Trail / Cherry Trail devices
> the r8169 ethernet controller uses pmc_plt_clk_4. Now that the clk-pmc-atom
> driver exports an "ether_clk" alias for pmc_plt_clk_4 and the r8169 driver
> has been modified to get and enable this clock (if present) the marking of
> the clocks as CLK_IS_CRITICAL is no longer necessary.
> 
> This commit removes the CLK_IS_CRITICAL marking, fixing Cherry Trail
> devices not being able to reach S0i3 greatly decreasing their battery
> drain when suspended.
> 
> Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=193891#c102
> Cc: Johannes Stezenbach 
> Cc: Carlo Caione 
> Signed-off-by: Hans de Goede 
> ---

Acked-by: Stephen Boyd

Re: [PATCH net-next 0/4] liquidio: improve soft command/response handling

2018-08-29 Thread David Miller

From: Felix Manlunas 
Date: Tue, 28 Aug 2018 18:50:58 -0700

> From: Weilin Chang 
> 
> Change soft command handling to fix the possible race condition when the
> process handles a response of a soft command that was already freed by an
> application which got timeout for this request.

Series applied, thank you.

Re: [net-next 02/15] i40e: move ethtool stats boiler plate code to i40e_ethtool_stats.h

2018-08-29 Thread David Miller

From: Jeff Kirsher 
Date: Wed, 29 Aug 2018 15:48:21 -0700

> diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool_stats.h 
> b/drivers/net/ethernet/intel/i40e/i40e_ethtool_stats.h
> new file mode 100644
> index ..0290ade7494b
> --- /dev/null
> +++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool_stats.h
> @@ -0,0 +1,221 @@
...
> +/**
> + * __i40e_add_stat_strings - copy stat strings into ethtool buffer
> + * @p: ethtool supplied buffer
> + * @stats: stat definitions array
> + * @size: size of the stats array
> + *
> + * Format and copy the strings described by stats into the buffer pointed at
> + * by p.
> + **/
> +static void __i40e_add_stat_strings(u8 **p, const struct i40e_stats stats[],
> + const unsigned int size, ...)

Need to be marked inline.

Re: [PATCH net-next,v5] net/tls: Calculate nsg for zerocopy path without skb_cow_data.

2018-08-29 Thread David Miller

From: Doron Roberts-Kedes 
Date: Tue, 28 Aug 2018 16:33:57 -0700

> decrypt_skb fails if the number of sg elements required to map it
> is greater than MAX_SKB_FRAGS. nsg must always be calculated, but
> skb_cow_data adds unnecessary memcpy's for the zerocopy case.
> 
> The new function skb_nsg calculates the number of scatterlist elements
> required to map the skb without the extra overhead of skb_cow_data.
> This patch reduces memcpy by 50% on my encrypted NBD benchmarks.
> 
> Reported-by: Vakul Garg 
> Reviewed-by: Vakul Garg 
> Tested-by: Vakul Garg 
> Signed-off-by: Doron Roberts-Kedes 

Applied, thank you.

Re: [PATCH net-next 1/2] ip: fail fast on IP defrag errors

2018-08-29 Thread David Miller

From: Peter Oskolkov 
Date: Tue, 28 Aug 2018 11:36:19 -0700

> The current behavior of IP defragmentation is inconsistent:
> - some overlapping/wrong length fragments are dropped without
>   affecting the queue;
> - most overlapping fragments cause the whole frag queue to be dropped.
> 
> This patch brings consistency: if a bad fragment is detected,
> the whole frag queue is dropped. Two major benefits:
> - fail fast: corrupted frag queues are cleared immediately, instead of
>   by timeout;
> - testing of overlapping fragments is now much easier: any kind of
>   random fragment length mutation now leads to the frag queue being
>   discarded (IP packet dropped); before this patch, some overlaps were
>   "corrected", with tests not seeing expected packet drops.
> 
> Note that in one case (see "if (end&7)" conditional) the current
> behavior is preserved as there are concerns that this could be
> legitimate padding.
> 
> Signed-off-by: Peter Oskolkov 
> Reviewed-by: Eric Dumazet 
> Reviewed-by: Willem de Bruijn 

Applied.

Re: [PATCH net-next 2/2] selftests/net: add ip_defrag selftest

2018-08-29 Thread David Miller

From: Peter Oskolkov 
Date: Tue, 28 Aug 2018 11:36:20 -0700

> This test creates a raw IPv4 socket, fragments a largish UDP
> datagram and sends the fragments out of order.
> 
> Then repeats in a loop with different message and fragment lengths.
> 
> Then does the same with overlapping fragments (with overlapping
> fragments the expectation is that the recv times out).
> 
> Tested:
> 
> root@# time ./ip_defrag.sh
> ipv4 defrag
> PASS
> ipv4 defrag with overlaps
> PASS
> 
> real1m7.679s
> user0m0.628s
> sys 0m2.242s
> 
> A similar test for IPv6 is to follow.
> 
> Signed-off-by: Peter Oskolkov 
> Reviewed-by: Willem de Bruijn 

Applied.

Re: [PATCH net-next] liquidio: fix race condition in instruction completion processing

2018-08-29 Thread David Miller

From: Felix Manlunas 
Date: Tue, 28 Aug 2018 11:32:55 -0700

> From: Rick Farrington 
> 
> In lio_enable_irq, the pkt_in_done count register was being cleared to
> zero.  However, there could be some completed instructions which were not
> yet processed due to budget and limit constraints.
> So, only write this register with the number of actual completions
> that were processed.
> 
> Signed-off-by: Rick Farrington 
> Signed-off-by: Felix Manlunas 

Applied.

Re: [PATCH net-next] liquidio: remove unnecessary delay when processing IQ responses

2018-08-29 Thread David Miller

From: Felix Manlunas 
Date: Tue, 28 Aug 2018 11:19:54 -0700

> From: Rick Farrington 
> 
> Signed-off-by: Rick Farrington 
> Signed-off-by: Felix Manlunas 

Applied.

Re: [PATCH net-next] net: thunderbolt: Convert to use SPDX identifier

2018-08-29 Thread David Miller

From: Mika Westerberg 
Date: Tue, 28 Aug 2018 19:58:43 +0300

> This gets rid of the licence boilerblate in favor of SPDX identifier
> which only takes a single line comment.
> 
> Signed-off-by: Mika Westerberg 

Applied.

Re: [PATCH net 0/3] ipv6: fix error path of inet6_init()

2018-08-29 Thread David Miller

From: Sabrina Dubroca 
Date: Tue, 28 Aug 2018 13:40:50 +0200

> The error path of inet6_init() can trigger multiple kernel panics,
> mostly due to wrong ordering of cleanups. This series fixes those
> issues.

Series applied, thank you.

Re: [RFC PATCH v2 bpf-next 0/2] verifier liveness simplification

2018-08-29 Thread Alexei Starovoitov

On Wed, Aug 22, 2018 at 08:00:46PM +0100, Edward Cree wrote:
> The first patch is a simplification of register liveness tracking by using
>  a separate parentage chain for each register and stack slot, thus avoiding
>  the need for logic to handle callee-saved registers when applying read
>  marks.  In the future this idea may be extended to form use-def chains.
> The second patch adds information about misc/zero data on the stack to the
>  state dumps emitted to the log at various points; this information was
>  found essential in debugging the first patch, and may be useful elsewhere.

I think this set is a great improvement in liveness tracking,
so depsite seeing the issues I applied it to bpf-next.

I think it's a better base to continue debugging.
In particular:
1. we have instability issue in the verifier.
 from time to time the verifier goes to process extra 7 instructions on one
 of the cilium tests. This was happening before and after this set.
2. there is a nice improvement in number of processed insns with this set,
 but the difference I cannot explain, hence it has to debugged.
 In theory the liveness rewrite shouldn't cause the difference in processed 
insns.

If not for the issue 1 I would argue that the issue 2 means that the set has to
be debugged before going in, but since the verifier is unstable it's better
to debug from this base with this patch set applied (because it greatly
simplifies liveness and adds additional debug in patch2)
and once we figure out the issue 1, I hope, the issue 2 will be resolved 
automatically.

The numbers on cilium bpf programs:
  before1before2   after1  after2
bpf_lb-DLB_L3.o 2003  2003  20032003
bpf_lb-DLB_L4.o 3173  3173  31733173
bpf_lb-DUNKNOWN.o   1080  1080  10801080
bpf_lxc-DDROP_ALL.o 29587 29587 29587   29587
bpf_lxc-DUNKNOWN.o  37204 37211 36926   36933
bpf_netdev.o11283 11283 11188   11188
bpf_overlay.o   6679  6679  66796679
bpf_lcx_jit.o   39657 39657 39561   39561

notice how bpf_lxc-DUNKNOWN.o fluctuates with +7 before and after. That is 
issue 1.
bpf_lxc-DUNKNOWN.o, bpf_netdev.o, and bpf_lcx_jit.o have small improvements.
That is issue 2.

To reproduce above numbers clone this repo: 
https://github.com/4ast/bpf_cilium_test
and run .sh. The .o files in there are pretty old cilium bpf programs.
I kept them frozen and didn't recompile for more than a year to keep stable
base line and track the progress of the verifier in 'processed insns'.

Thanks

[PATCH net-next] net/ncsi: remove duplicated include from ncsi-netlink.c

2018-08-29 Thread YueHaibing

Remove duplicated include.

Signed-off-by: YueHaibing 
---
 net/ncsi/ncsi-netlink.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/net/ncsi/ncsi-netlink.c b/net/ncsi/ncsi-netlink.c
index 45f33d6..32cb775 100644
--- a/net/ncsi/ncsi-netlink.c
+++ b/net/ncsi/ncsi-netlink.c
@@ -12,7 +12,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include

Re: [PATCH 1/2] dt-bindings: net: cpsw: Document cpsw-phy-sel usage but prefer phandle

2018-08-29 Thread Andrew Lunn

> In the long run cpsw should be really treated as an
> interconnect instance with it's control module providing
> standard Linux framework services such as clock /
> regulator / phy / pinctrl / iio whatever for the other
> modules.

Some of us have been applying pressure for a new driver. This sounds
like another argument for such a re-write.

   Andrew

Re: [PATCH net] net/sched: act_pedit: fix dump of extended layered op

2018-08-29 Thread David Miller

From: Davide Caratti 
Date: Mon, 27 Aug 2018 22:56:22 +0200

> in the (rare) case of failure in nla_nest_start(), missing NULL checks in
> tcf_pedit_key_ex_dump() can make the following command
> 
>  # tc action add action pedit ex munge ip ttl set 64
> 
> dereference a NULL pointer:
 ...
> Like it's done for other TC actions, give up dumping pedit rules and return
> an error if nla_nest_start() returns NULL.
> 
> Fixes: 71d0ed7079df ("net/act_pedit: Support using offset relative to the 
> conventional network headers")
> Signed-off-by: Davide Caratti 

Applied and queued up for -stable, thanks.

Re: [PATCH] r8169: set RxConfig after tx/rx is enabled for RTL8169sb/8110sb devices

2018-08-29 Thread David Miller

From: Azat Khuzhin 
Date: Sun, 26 Aug 2018 17:03:09 +0300

> I have two Ethernet adapters:
>   r8169 :03:01.0 eth0: RTL8169sb/8110sb, 00:14:d1:14:2d:49, XID 1000, 
> IRQ 18
>   r8169 :01:00.0 eth0: RTL8168e/8111e, 64:66:b3:11:14:5d, XID 2c20, 
> IRQ 30
> And after upgrading from linux 4.15 [1] to linux 4.18+ [2] RTL8169sb failed to
> receive any packets. tcpdump shows a lot of checksum mismatch.
> 
>   [1]: a0f79386a4968b4925da6db2d1daffd0605a4402
>   [2]: 0519359784328bfa92bf0931bf0cff3b58c16932 (4.19 merge window opened)
> 
> I started bisecting and the found that [3] breaks it. According to [4]:
>   "For 8110S, 8110SB, and 8110SC series, the initial value of RxConfig
>   needs to be set after the tx/rx is enabled."
> So I moved rtl_init_rxcfg() after enabling tx/rs and now my adapter works
> (RTL8168e works too).
> 
>   [3]: 3559d81e76bfe3803e89f2e04cf6ef7ab4f3aace
>   [4]: e542a2269f232d61270ceddd42b73a4348dee2bb ("r8169: adjust the RxConfig
> settings.")
> 
> Also drop "rx" from rtl_set_rx_tx_config_registers(), since it does nothing
> with it already.
> 
> Fixes: 3559d81e76bfe3803e89f2e04cf6ef7ab4f3aace ("r8169: simplify
> rtl_hw_start_8169")
> 
> Cc: Heiner Kallweit 
> Cc: David S. Miller 
> Cc: netdev@vger.kernel.org
> Cc: Realtek linux nic maintainers 
> Signed-off-by: Azat Khuzhin 

Applied and queued up for -stable.

Re: [Patch net] tipc: fix a missing rhashtable_walk_exit()

2018-08-29 Thread David Miller

From: Cong Wang 
Date: Thu, 23 Aug 2018 16:19:44 -0700

> rhashtable_walk_exit() must be paired with rhashtable_walk_enter().
> 
> Fixes: 40f9f4397060 ("tipc: Fix tipc_sk_reinit race conditions")
> Cc: Herbert Xu 
> Cc: Ying Xue 
> Signed-off-by: Cong Wang 

Applied and queued up for -stable, thanks Cong.

RE: [PATCH] hv_netvsc: Fix a deadlock by getting rtnl_lock earlier in netvsc_probe()

2018-08-29 Thread Dexuan Cui

> From: David Miller 
> Sent: Wednesday, August 29, 2018 17:49
> 
> From: Dexuan Cui 
> Date: Wed, 22 Aug 2018 21:20:03 +
> 
> > ---
> >  drivers/net/hyperv/netvsc_drv.c | 11 ++-
> >  1 file changed, 10 insertions(+), 1 deletion(-)
> >
> >
> > FYI: these are the related 3 paths which show the deadlock:
> 
> This incredibly useful information belongs in the commit log
> message, and therefore before the --- and signoffs.

Hi David,
I was afraid the call-traces are too detailed. :-)

Can you please move the info to before the --- line?

Or, should I resend the patch with the commit log updated?

Thanks,
-- Dexuan

Re: [PATCH net] vti6: remove !skb->ignore_df check from vti6_xmit()

2018-08-29 Thread David Miller

From: Alexey Kodanev 
Date: Thu, 23 Aug 2018 19:49:54 +0300

> Before the commit d6990976af7c ("vti6: fix PMTU caching and reporting
> on xmit") '!skb->ignore_df' check was always true because the function
> skb_scrub_packet() was called before it, resetting ignore_df to zero.
> 
> In the commit, skb_scrub_packet() was moved below, and now this check
> can be false for the packet, e.g. when sending it in the two fragments,
> this prevents successful PMTU updates in such case. The next attempts
> to send the packet lead to the same tx error. Moreover, vti6 initial
> MTU value relies on PMTU adjustments.
> 
> This issue can be reproduced with the following LTP test script:
> udp_ipsec_vti.sh -6 -p ah -m tunnel -s 2000
> 
> Fixes: ccd740cbc6e0 ("vti6: Add pmtu handling to vti6_xmit.")
> Signed-off-by: Alexey Kodanev 

Applied and queued up for -stable, thank you.

Re: [PATCH 1/2] dt-bindings: net: cpsw: Document cpsw-phy-sel usage but prefer phandle

2018-08-29 Thread Tony Lindgren

* Grygorii Strashko  [180830 00:12]:
> Hi Tony,
> 
> On 08/29/2018 10:00 AM, Tony Lindgren wrote:
> > The current cpsw usage for cpsw-phy-sel is undocumented but is used for
> > all the boards using cpsw. And cpsw-phy-sel is not really a child of
> > the cpsw device, it lives in the system control module instead.
> > 
> > Let's document the existing usage, and improve it a bit where we prefer
> > to use a phandle instead of a child device for it. That way we can
> > properly describe the hardware in dts files for things like genpd.
> 
> I'm ok with this series, but I really don't like cpsw-phy-sel in general.

Yeah this binding predates any standards. This series
only fixes the nasty issue of cpsw claiming a module as a
child that's outside it's IO range.

> It was introduced long time back and now I'm thinking about possibility to 
> replace it with
> one of current generic interfaces - for example mux-controller. 
> Each port will control up to 3 muxes (port mode, idmode and rmii_ext_clk) and 
>  
> transform phy-mode => mux states.
> What do you think?

Sure a mux-controller here makes sense.

> Another option is to use phy, but it'd be complicated.

For the port muxes, how about a phy driver just using
a pinctrl driver?

In general, it seems cpsw is just an interconnect instance
(L4_FAST) with a control module (CPSW_WR) and a pile of
independent other modules. That's described nicely in
am437x TRM chapter "2.1.4 L4 Fast Peripheral Memory Map".
So from that point of view the binding reg entries right
now are all wrong :)

In the long run cpsw should be really treated as an
interconnect instance with it's control module providing
standard Linux framework services such as clock /
regulator / phy / pinctrl / iio whatever for the other
modules.

Just my 2c based on looking at the interconnect, I'm
not too familiar with cpsw otherwise.

Regards,

Tony

Re: [PATCH net] ebpf: fix bpf_msg_pull_data

2018-08-29 Thread Tushar Dave





On 08/29/2018 05:07 PM, Tushar Dave wrote:

While doing some preliminary testing it is found that bpf helper
bpf_msg_pull_data does not calculate the data and data_end offset
correctly. Fix it!

Fixes: 015632bb30da ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
Signed-off-by: Tushar Dave 
Acked-by: Sowmini Varadhan 
---
  net/core/filter.c | 38 +-
  1 file changed, 25 insertions(+), 13 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index c25eb36..3eeb3d6 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2285,7 +2285,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
  BPF_CALL_4(bpf_msg_pull_data,
   struct sk_msg_buff *, msg, u32, start, u32, end, u64, flags)
  {
-   unsigned int len = 0, offset = 0, copy = 0;
+   unsigned int len = 0, offset = 0, copy = 0, off = 0;
struct scatterlist *sg = msg->sg_data;
int first_sg, last_sg, i, shift;
unsigned char *p, *to, *from;
@@ -2299,22 +2299,30 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff 
*msg)
i = msg->sg_start;
do {
len = sg[i].length;
-   offset += len;
if (start < offset + len)
break;
+   offset += len;
i++;
if (i == MAX_SKB_FRAGS)
i = 0;
-   } while (i != msg->sg_end);
+   } while (i <= msg->sg_end);
  
+	/* return error if start is out of range */

if (unlikely(start >= offset + len))
return -EINVAL;
  
-	if (!msg->sg_copy[i] && bytes <= len)

-   goto out;
+   /* return error if i is last entry in sglist and end is out of range */
+   if (msg->sg_copy[i] && end > offset + len)
+   return -EINVAL;
  
  	first_sg = i;
  
+	/* if i is not last entry in sg list and end (i.e start + bytes) is

+* within this sg[i] then goto out and calculate data and data_end
+*/
+   if (!msg->sg_copy[i] && end <= offset + len)
+   goto out;
+
/* At this point we need to linearize multiple scatterlist
 * elements or a single shared page. Either way we need to
 * copy into a linear buffer exclusively owned by BPF. Then
@@ -2330,9 +2338,14 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
i++;
if (i == MAX_SKB_FRAGS)
i = 0;
-   if (bytes < copy)
+   if (end < copy)
break;
-   } while (i != msg->sg_end);
+   } while (i <= msg->sg_end);
+
+   /* return error if i is last entry in sglist and end is out of range */
+   if (i > msg->sg_end && end > offset + copy)
+   return -EINVAL;
+
last_sg = i;
  
  	if (unlikely(copy < end - start))

@@ -2342,23 +2355,22 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff 
*msg)
if (unlikely(!page))
return -ENOMEM;
p = page_address(page);
-   offset = 0;
  
  	i = first_sg;

do {
from = sg_virt([i]);
len = sg[i].length;
-   to = p + offset;
+   to = p + off;
  
  		memcpy(to, from, len);

-   offset += len;
+   off += len;
sg[i].length = 0;
put_page(sg_page([i]));
  
  		i++;

if (i == MAX_SKB_FRAGS)
i = 0;
-   } while (i != last_sg);
+   } while (i < last_sg);
  
  	sg[first_sg].length = copy;

sg_set_page([first_sg], page, copy, 0);
@@ -2380,7 +2392,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
else
move_from = i + shift;
  
-		if (move_from == msg->sg_end)

+   if (move_from > msg->sg_end)
break;
  
  		sg[i] = sg[move_from];

@@ -2396,7 +2408,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
if (msg->sg_end < 0)
msg->sg_end += MAX_SKB_FRAGS;
  out:
-   msg->data = sg_virt([i]) + start - offset;
+   msg->data = sg_virt([first_sg]) + start - offset;
msg->data_end = msg->data + bytes;
  
  	return 0;




Please discard this patch. I just noticed that Daniel Borkmann sent 
some similar fixes for bpf_msg_pull_data.



-Tushar

[PATCH net-next] net: dsa: mv88e6xxx: Share main switch IRQ

2018-08-29 Thread Marek Behún

On some boards the interrupt can be shared between multiple devices.
For example on Turris Mox the interrupt is shared between all switches.

Signed-off-by: Marek Behun 
---
 drivers/net/dsa/mv88e6xxx/chip.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 8da3d39e3218..b57f5403982a 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -434,7 +434,7 @@ static int mv88e6xxx_g1_irq_setup(struct mv88e6xxx_chip 
*chip)
 
err = request_threaded_irq(chip->irq, NULL,
   mv88e6xxx_g1_irq_thread_fn,
-  IRQF_ONESHOT,
+  IRQF_ONESHOT | IRQF_SHARED,
   dev_name(chip->dev), chip);
if (err)
mv88e6xxx_g1_irq_free_common(chip);
-- 
2.16.4

Re: [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e

2018-08-29 Thread William Tu

> Thanks for working on this, LGTM! Are you also planning to get ixgbe
> out after that?
>

I currently don't have i40e nic to test, so
I'm also looking forward to the ixgbe patch!

Thank you
William

Re: [PATCH 1/2] dt-bindings: net: cpsw: Document cpsw-phy-sel usage but prefer phandle

2018-08-29 Thread Grygorii Strashko

Hi Tony,

On 08/29/2018 10:00 AM, Tony Lindgren wrote:
> The current cpsw usage for cpsw-phy-sel is undocumented but is used for
> all the boards using cpsw. And cpsw-phy-sel is not really a child of
> the cpsw device, it lives in the system control module instead.
> 
> Let's document the existing usage, and improve it a bit where we prefer
> to use a phandle instead of a child device for it. That way we can
> properly describe the hardware in dts files for things like genpd.

I'm ok with this series, but I really don't like cpsw-phy-sel in general.

It was introduced long time back and now I'm thinking about possibility to 
replace it with
one of current generic interfaces - for example mux-controller. 
Each port will control up to 3 muxes (port mode, idmode and rmii_ext_clk) and  
transform phy-mode => mux states.
What do you think?

Another option is to use phy, but it'd be complicated.

> 
> Cc: devicet...@vger.kernel.org
> Cc: Andrew Lunn 
> Cc: Grygorii Strashko 
> Cc: Ivan Khoronzhuk 
> Cc: Mark Rutland 
> Cc: Murali Karicheri 
> Cc: Rob Herring 
> Signed-off-by: Tony Lindgren 
> ---
>   Documentation/devicetree/bindings/net/cpsw.txt | 6 ++
>   1 file changed, 6 insertions(+)
> 
> diff --git a/Documentation/devicetree/bindings/net/cpsw.txt 
> b/Documentation/devicetree/bindings/net/cpsw.txt
> --- a/Documentation/devicetree/bindings/net/cpsw.txt
> +++ b/Documentation/devicetree/bindings/net/cpsw.txt
> @@ -19,6 +19,10 @@ Required properties:
>   - slaves: Specifies number for slaves
>   - active_slave  : Specifies the slave to use for time stamping,
> ethtool and SIOCGMIIPHY
> +- cpsw-phy-sel   : Specifies the phandle to the CPSW phy mode 
> selection
> +   device. See also cpsw-phy-sel.txt for it's binding.
> +   Note that in legacy cases cpsw-phy-sel may be
> +   a child device instead of a phandle.
>   
>   Optional properties:
>   - ti,hwmods : Must be "cpgmac0"
> @@ -75,6 +79,7 @@ Examples:
>   cpts_clock_mult = <0x8000>;
>   cpts_clock_shift = <29>;
>   syscon = <>;
> + cpsw-phy-sel = <_sel>;
>   cpsw_emac0: slave@0 {
>   phy_id = <_mdio>, <0>;
>   phy-mode = "rgmii-txid";
> @@ -103,6 +108,7 @@ Examples:
>   cpts_clock_mult = <0x8000>;
>   cpts_clock_shift = <29>;
>   syscon = <>;
> + cpsw-phy-sel = <_sel>;
>   cpsw_emac0: slave@0 {
>   phy_id = <_mdio>, <0>;
>   phy-mode = "rgmii-txid";
> 

-- 
regards,
-grygorii

[PATCH net] ebpf: fix bpf_msg_pull_data

2018-08-29 Thread Tushar Dave

While doing some preliminary testing it is found that bpf helper
bpf_msg_pull_data does not calculate the data and data_end offset
correctly. Fix it!

Fixes: 015632bb30da ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
Signed-off-by: Tushar Dave 
Acked-by: Sowmini Varadhan 
---
 net/core/filter.c | 38 +-
 1 file changed, 25 insertions(+), 13 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index c25eb36..3eeb3d6 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2285,7 +2285,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
 BPF_CALL_4(bpf_msg_pull_data,
   struct sk_msg_buff *, msg, u32, start, u32, end, u64, flags)
 {
-   unsigned int len = 0, offset = 0, copy = 0;
+   unsigned int len = 0, offset = 0, copy = 0, off = 0;
struct scatterlist *sg = msg->sg_data;
int first_sg, last_sg, i, shift;
unsigned char *p, *to, *from;
@@ -2299,22 +2299,30 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff 
*msg)
i = msg->sg_start;
do {
len = sg[i].length;
-   offset += len;
if (start < offset + len)
break;
+   offset += len;
i++;
if (i == MAX_SKB_FRAGS)
i = 0;
-   } while (i != msg->sg_end);
+   } while (i <= msg->sg_end);
 
+   /* return error if start is out of range */
if (unlikely(start >= offset + len))
return -EINVAL;
 
-   if (!msg->sg_copy[i] && bytes <= len)
-   goto out;
+   /* return error if i is last entry in sglist and end is out of range */
+   if (msg->sg_copy[i] && end > offset + len)
+   return -EINVAL;
 
first_sg = i;
 
+   /* if i is not last entry in sg list and end (i.e start + bytes) is
+* within this sg[i] then goto out and calculate data and data_end
+*/
+   if (!msg->sg_copy[i] && end <= offset + len)
+   goto out;
+
/* At this point we need to linearize multiple scatterlist
 * elements or a single shared page. Either way we need to
 * copy into a linear buffer exclusively owned by BPF. Then
@@ -2330,9 +2338,14 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
i++;
if (i == MAX_SKB_FRAGS)
i = 0;
-   if (bytes < copy)
+   if (end < copy)
break;
-   } while (i != msg->sg_end);
+   } while (i <= msg->sg_end);
+
+   /* return error if i is last entry in sglist and end is out of range */
+   if (i > msg->sg_end && end > offset + copy)
+   return -EINVAL;
+
last_sg = i;
 
if (unlikely(copy < end - start))
@@ -2342,23 +2355,22 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff 
*msg)
if (unlikely(!page))
return -ENOMEM;
p = page_address(page);
-   offset = 0;
 
i = first_sg;
do {
from = sg_virt([i]);
len = sg[i].length;
-   to = p + offset;
+   to = p + off;
 
memcpy(to, from, len);
-   offset += len;
+   off += len;
sg[i].length = 0;
put_page(sg_page([i]));
 
i++;
if (i == MAX_SKB_FRAGS)
i = 0;
-   } while (i != last_sg);
+   } while (i < last_sg);
 
sg[first_sg].length = copy;
sg_set_page([first_sg], page, copy, 0);
@@ -2380,7 +2392,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
else
move_from = i + shift;
 
-   if (move_from == msg->sg_end)
+   if (move_from > msg->sg_end)
break;
 
sg[i] = sg[move_from];
@@ -2396,7 +2408,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
if (msg->sg_end < 0)
msg->sg_end += MAX_SKB_FRAGS;
 out:
-   msg->data = sg_virt([i]) + start - offset;
+   msg->data = sg_virt([first_sg]) + start - offset;
msg->data_end = msg->data + bytes;
 
return 0;
-- 
1.8.3.1

[PATCH net-next] net/ipv6: Do not reset nl_net in ip6_route_info_create

2018-08-29 Thread dsahern

From: David Ahern 

nl_net is set on entry to ip6_route_info_create. Only devices
within that namespace are considered so no need to reset it
before returning.

Signed-off-by: David Ahern 
---
 net/ipv6/route.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index c4ea13e8360b..b5f385d2b0e9 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -3138,8 +3138,6 @@ static struct fib6_info *ip6_route_info_create(struct 
fib6_config *cfg,
rt->fib6_nh.nh_dev = dev;
rt->fib6_table = table;
 
-   cfg->fc_nlinfo.nl_net = dev_net(dev);
-
if (idev)
in6_dev_put(idev);
 
-- 
2.11.0

[PATCH net-next] net/ipv4: Add extack message that dev is required for ONLINK

2018-08-29 Thread dsahern

From: David Ahern 

Make IPv4 consistent with IPv6 and return an extack message that the
ONLINK flag requires a nexthop device.

Signed-off-by: David Ahern 
---
 net/ipv4/fib_semantics.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index f3c89ccf14c5..bee8db979195 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -797,8 +797,10 @@ static int fib_check_nh(struct fib_config *cfg, struct 
fib_nh *nh,
return -EINVAL;
}
dev = __dev_get_by_index(net, nh->nh_oif);
-   if (!dev)
+   if (!dev) {
+   NL_SET_ERR_MSG(extack, "Nexthop device required 
for onlink");
return -ENODEV;
+   }
if (!(dev->flags & IFF_UP)) {
NL_SET_ERR_MSG(extack,
   "Nexthop device is not up");
-- 
2.11.0

Re: pull-request: bpf 2018-08-29

2018-08-29 Thread David Miller

From: Daniel Borkmann 
Date: Wed, 29 Aug 2018 21:07:24 +0200

> The following pull-request contains BPF updates for your *net* tree.
> 
> The main changes are:
> 
> 1) Fix a build error in sk_reuseport_convert_ctx_access() when
>compiling with clang which cannot resolve hweight_long() at
>build time inside the BUILD_BUG_ON() assertion, from Stefan.
> 
> 2) Several fixes for BPF sockmap, four of them in getting the
>bpf_msg_pull_data() helper to work, one use after free case
>in bpf_tcp_close() and one refcount leak in bpf_tcp_recvmsg(),
>from Daniel.
> 
> 3) Another fix for BPF sockmap where we misaccount sk_mem_uncharge()
>in the socket redirect error case from unwinding scatterlist
>twice, from John.
> 
> Please consider pulling these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Pulled, thanks Daniel.

[net-next 15/15] i40e: Prevent deleting MAC address from VF when set by PF

2018-08-29 Thread Jeff Kirsher

From: Patryk Małek 

To prevent VF from deleting MAC address that was assigned by the
PF we need to check for that scenario when we try to delete a MAC
address from a VF.

Signed-off-by: Patryk Małek 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c 
b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
index f56c645588f3..3e707c7e6782 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
@@ -2569,6 +2569,16 @@ static int i40e_vc_del_mac_addr_msg(struct i40e_vf *vf, 
u8 *msg, u16 msglen)
ret = I40E_ERR_INVALID_MAC_ADDR;
goto error_param;
}
+
+   if (vf->pf_set_mac &&
+   ether_addr_equal(al->list[i].addr,
+vf->default_lan_addr.addr)) {
+   dev_err(>pdev->dev,
+   "MAC addr %pM has been set by PF, cannot delete 
it for VF %d, reset VF to change MAC addr\n",
+   vf->default_lan_addr.addr, vf->vf_id);
+   ret = I40E_ERR_PARAM;
+   goto error_param;
+   }
}
vsi = pf->vsi[vf->lan_vsi_idx];
 
-- 
2.17.1

[net-next 11/15] i40e: Check and correct speed values for link on open

2018-08-29 Thread Jeff Kirsher

From: Jan Sokolowski 

If our card has been put in an unstable state due to
other drivers interacting with it, speed settings
might be incorrect. If incorrect, forcefully reset them
on open to known default values.

Signed-off-by: Jan Sokolowski 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 27 ++---
 1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 6cb4076e8fba..c30feb27e1c0 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -6570,6 +6570,24 @@ static i40e_status i40e_force_link_state(struct i40e_pf 
*pf, bool is_up)
struct i40e_hw *hw = >hw;
i40e_status err;
u64 mask;
+   u8 speed;
+
+   /* Card might've been put in an unstable state by other drivers
+* and applications, which causes incorrect speed values being
+* set on startup. In order to clear speed registers, we call
+* get_phy_capabilities twice, once to get initial state of
+* available speeds, and once to get current PHY config.
+*/
+   err = i40e_aq_get_phy_capabilities(hw, false, true, ,
+  NULL);
+   if (err) {
+   dev_err(>pdev->dev,
+   "failed to get phy cap., ret =  %s last_status =  %s\n",
+   i40e_stat_str(hw, err),
+   i40e_aq_str(hw, hw->aq.asq_last_status));
+   return err;
+   }
+   speed = abilities.link_speed;
 
/* Get the current phy config */
err = i40e_aq_get_phy_capabilities(hw, false, false, ,
@@ -6583,9 +6601,9 @@ static i40e_status i40e_force_link_state(struct i40e_pf 
*pf, bool is_up)
}
 
/* If link needs to go up, but was not forced to go down,
-* no need for a flap
+* and its speed values are OK, no need for a flap
 */
-   if (is_up && abilities.phy_type != 0)
+   if (is_up && abilities.phy_type != 0 && abilities.link_speed != 0)
return I40E_SUCCESS;
 
/* To force link we need to set bits for all supported PHY types,
@@ -6597,7 +6615,10 @@ static i40e_status i40e_force_link_state(struct i40e_pf 
*pf, bool is_up)
config.phy_type_ext = is_up ? (u8)((mask >> 32) & 0xff) : 0;
/* Copy the old settings, except of phy_type */
config.abilities = abilities.abilities;
-   config.link_speed = abilities.link_speed;
+   if (abilities.link_speed != 0)
+   config.link_speed = abilities.link_speed;
+   else
+   config.link_speed = speed;
config.eee_capability = abilities.eee_capability;
config.eeer = abilities.eeer_val;
config.low_power_ctrl = abilities.d3_lpan;
-- 
2.17.1

[net-next 08/15] virtchnl: use u8 type for a field in the virtchnl_filter struct

2018-08-29 Thread Jeff Kirsher

From: Harshitha Ramamurthy 

The virtchnl_filter struct has a field called field_flags. A previous
commit mistakenly had the type to be a __u8. What we want is for the
field to be an unsigned 8 bit value, so let's just use the existing
kernel type u8 for that.

Signed-off-by: Harshitha Ramamurthy 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 include/linux/avf/virtchnl.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/avf/virtchnl.h b/include/linux/avf/virtchnl.h
index 212b3822d180..b41f7bc958ef 100644
--- a/include/linux/avf/virtchnl.h
+++ b/include/linux/avf/virtchnl.h
@@ -573,7 +573,7 @@ struct virtchnl_filter {
enumvirtchnl_flow_type flow_type;
enumvirtchnl_action action;
u32 action_meta;
-   __u8field_flags;
+   u8  field_flags;
 };
 
 VIRTCHNL_CHECK_STRUCT_LEN(272, virtchnl_filter);
-- 
2.17.1

[net-next 07/15] i40evf: set IFF_UNICAST_FLT flag for the VF

2018-08-29 Thread Jeff Kirsher

From: Lihong Yang 

Set IFF_UNICAST_FLT flag for the VF to prevent it from entering
promiscuous mode when macvlan is added to the VF.

Signed-off-by: Lihong Yang 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40evf/i40evf_main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c 
b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
index 5906c1c1d19d..07f187b5bcac 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
@@ -3358,6 +3358,8 @@ int i40evf_process_config(struct i40evf_adapter *adapter)
if (vfres->vf_cap_flags & VIRTCHNL_VF_OFFLOAD_VLAN)
netdev->features |= NETIF_F_HW_VLAN_CTAG_FILTER;
 
+   netdev->priv_flags |= IFF_UNICAST_FLT;
+
/* Do not turn on offloads when they are requested to be turned off.
 * TSO needs minimum 576 bytes to work correctly.
 */
-- 
2.17.1

[net-next 01/15] i40e: convert queue stats to i40e_stats array

2018-08-29 Thread Jeff Kirsher

From: Jacob Keller 

Use an i40e_stats array to handle the queue stats, instead of coding
similar functionality separately. Because of how the queue stats are
accessed on some kernels, we can't easily use i40e_add_ethtool_stats.

Instead, implement a separate helper, i40e_add_queue_stats, which we'll
use instead. This helper will correctly implement the
u64_stats_fetch_begin_irq logic and allow retries until successful. We
share the most complex code by re-using i40e_add_one_ethtool_stat.

This logic additionally easily supports skipping disabled rings by using
a ternary operator before calling the u64_stats_fetch_begin_irq()
function, so that we correctly zero-out the stats values without having
to perform two separate sections of code.

This significantly reduces the boiler plate code in
i40e_get_ethtool_stats, and helps keep the complex logic contained to as
few functions as possible.

With this patch, we've finally converted all the statistics to use the
helpers and the i40e_stats function.

Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 .../net/ethernet/intel/i40e/i40e_ethtool.c| 148 +++---
 1 file changed, 89 insertions(+), 59 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index 5ff6caa83948..235012b3bd42 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -33,6 +33,8 @@ struct i40e_stats {
I40E_STAT(struct i40e_veb, _name, _stat)
 #define I40E_PFC_STAT(_name, _stat) \
I40E_STAT(struct i40e_pfc_stats, _name, _stat)
+#define I40E_QUEUE_STAT(_name, _stat) \
+   I40E_STAT(struct i40e_ring, _name, _stat)
 
 static const struct i40e_stats i40e_gstrings_net_stats[] = {
I40E_NETDEV_STAT(rx_packets),
@@ -171,20 +173,16 @@ static const struct i40e_stats i40e_gstrings_pfc_stats[] 
= {
I40E_PFC_STAT("port.rx_priority_%u_xon_2_xoff", priority_xon_2_xoff),
 };
 
-/* We use num_tx_queues here as a proxy for the maximum number of queues
- * available because we always allocate queues symmetrically.
- */
-#define I40E_MAX_NUM_QUEUES(n) ((n)->num_tx_queues)
-#define I40E_QUEUE_STATS_LEN(n)  \
-  (I40E_MAX_NUM_QUEUES(n)   \
-   * 2 /* Tx and Rx together */ \
-   * (sizeof(struct i40e_queue_stats) / sizeof(u64)))
-#define I40E_GLOBAL_STATS_LEN  ARRAY_SIZE(i40e_gstrings_stats)
+static const struct i40e_stats i40e_gstrings_queue_stats[] = {
+   I40E_QUEUE_STAT("%s-%u.packets", stats.packets),
+   I40E_QUEUE_STAT("%s-%u.bytes", stats.bytes),
+};
+
 #define I40E_NETDEV_STATS_LEN  ARRAY_SIZE(i40e_gstrings_net_stats)
+
 #define I40E_MISC_STATS_LENARRAY_SIZE(i40e_gstrings_misc_stats)
-#define I40E_VSI_STATS_LEN(n)  (I40E_NETDEV_STATS_LEN + \
-I40E_MISC_STATS_LEN + \
-I40E_QUEUE_STATS_LEN((n)))
+
+#define I40E_VSI_STATS_LEN (I40E_NETDEV_STATS_LEN + I40E_MISC_STATS_LEN)
 
 #define I40E_PFC_STATS_LEN (ARRAY_SIZE(i40e_gstrings_pfc_stats) * \
 I40E_MAX_USER_PRIORITY)
@@ -193,10 +191,15 @@ static const struct i40e_stats i40e_gstrings_pfc_stats[] 
= {
 (ARRAY_SIZE(i40e_gstrings_veb_tc_stats) * \
  I40E_MAX_TRAFFIC_CLASS))
 
-#define I40E_PF_STATS_LEN(n)   (I40E_GLOBAL_STATS_LEN + \
+#define I40E_GLOBAL_STATS_LEN  ARRAY_SIZE(i40e_gstrings_stats)
+
+#define I40E_PF_STATS_LEN  (I40E_GLOBAL_STATS_LEN + \
 I40E_PFC_STATS_LEN + \
 I40E_VEB_STATS_LEN + \
-I40E_VSI_STATS_LEN((n)))
+I40E_VSI_STATS_LEN)
+
+/* Length of stats for a single queue */
+#define I40E_QUEUE_STATS_LEN   ARRAY_SIZE(i40e_gstrings_queue_stats)
 
 enum i40e_ethtool_test_id {
I40E_ETH_TEST_REG = 0,
@@ -1701,11 +1704,30 @@ static int i40e_get_stats_count(struct net_device 
*netdev)
struct i40e_netdev_priv *np = netdev_priv(netdev);
struct i40e_vsi *vsi = np->vsi;
struct i40e_pf *pf = vsi->back;
+   int stats_len;
 
if (vsi == pf->vsi[pf->lan_vsi] && pf->hw.partition_id == 1)
-   return I40E_PF_STATS_LEN(netdev);
+   stats_len = I40E_PF_STATS_LEN;
else
-   return I40E_VSI_STATS_LEN(netdev);
+   stats_len = I40E_VSI_STATS_LEN;
+
+   /* The number of stats reported for a given net_device must remain
+* constant throughout the life of that device.
+*
+* This is because the API for obtaining the size, strings, and stats
+* is spread out over three separate ethtool ioctls. There is no safe
+* way to lock the number of stats across these calls, so we must
+

[net-next 05/15] i40evf: Validate the number of queues a PF sends

2018-08-29 Thread Jeff Kirsher

From: Paul M Stillwell Jr 

A PF can send any number of queues to the VF and the VF may not
be able to support that many. Check to see that the number of
queues is less than or equal to the max number of queues the
VF can have.

Signed-off-by: Paul M Stillwell Jr 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 .../ethernet/intel/i40evf/i40evf_virtchnl.c   | 32 +++
 1 file changed, 32 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c 
b/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c
index 79b7be83b5c3..6579dabab78c 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c
@@ -153,6 +153,32 @@ int i40evf_send_vf_config_msg(struct i40evf_adapter 
*adapter)
  NULL, 0);
 }
 
+/**
+ * i40evf_validate_num_queues
+ * @adapter: adapter structure
+ *
+ * Validate that the number of queues the PF has sent in
+ * VIRTCHNL_OP_GET_VF_RESOURCES is not larger than the VF can handle.
+ **/
+static void i40evf_validate_num_queues(struct i40evf_adapter *adapter)
+{
+   if (adapter->vf_res->num_queue_pairs > I40EVF_MAX_REQ_QUEUES) {
+   struct virtchnl_vsi_resource *vsi_res;
+   int i;
+
+   dev_info(>pdev->dev, "Received %d queues, but can only 
have a max of %d\n",
+adapter->vf_res->num_queue_pairs,
+I40EVF_MAX_REQ_QUEUES);
+   dev_info(>pdev->dev, "Fixing by reducing queues to 
%d\n",
+I40EVF_MAX_REQ_QUEUES);
+   adapter->vf_res->num_queue_pairs = I40EVF_MAX_REQ_QUEUES;
+   for (i = 0; i < adapter->vf_res->num_vsis; i++) {
+   vsi_res = >vf_res->vsi_res[i];
+   vsi_res->num_queue_pairs = I40EVF_MAX_REQ_QUEUES;
+   }
+   }
+}
+
 /**
  * i40evf_get_vf_config
  * @adapter: private adapter structure
@@ -195,6 +221,11 @@ int i40evf_get_vf_config(struct i40evf_adapter *adapter)
err = (i40e_status)le32_to_cpu(event.desc.cookie_low);
memcpy(adapter->vf_res, event.msg_buf, min(event.msg_len, len));
 
+   /* some PFs send more queues than we should have so validate that
+* we aren't getting too many queues
+*/
+   if (!err)
+   i40evf_validate_num_queues(adapter);
i40e_vf_parse_hw_config(hw, adapter->vf_res);
 out_alloc:
kfree(event.msg_buf);
@@ -1329,6 +1360,7 @@ void i40evf_virtchnl_completion(struct i40evf_adapter 
*adapter,
  I40E_MAX_VF_VSI *
  sizeof(struct virtchnl_vsi_resource);
memcpy(adapter->vf_res, msg, min(msglen, len));
+   i40evf_validate_num_queues(adapter);
i40e_vf_parse_hw_config(>hw, adapter->vf_res);
if (is_zero_ether_addr(adapter->hw.mac.addr)) {
/* restore current mac address */
-- 
2.17.1

[net-next 12/15] i40evf: Don't enable vlan stripping when rx offload is turned on

2018-08-29 Thread Jeff Kirsher

From: Patryk Małek 

With current implementation of i40evf_set_features when user sets
any offload via ethtool we set I40EVF_FLAG_AQ_ENABLE_VLAN_STRIPPING
as a required aq which triggers driver to call
i40evf_enable_vlan_stripping. This shouldn't take place.
This patches fixes it by setting the flag only when VLAN offload
is turned on.

Signed-off-by: Patryk Małek 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40evf/i40evf_main.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c 
b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
index 07f187b5bcac..c7048cf484fb 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
@@ -3120,18 +3120,19 @@ static int i40evf_set_features(struct net_device 
*netdev,
 {
struct i40evf_adapter *adapter = netdev_priv(netdev);
 
-   /* Don't allow changing VLAN_RX flag when VLAN is set for VF
-* and return an error in this case
+   /* Don't allow changing VLAN_RX flag when adapter is not capable
+* of VLAN offload
 */
-   if (VLAN_ALLOWED(adapter)) {
+   if (!VLAN_ALLOWED(adapter)) {
+   if ((netdev->features ^ features) & NETIF_F_HW_VLAN_CTAG_RX)
+   return -EINVAL;
+   } else if ((netdev->features ^ features) & NETIF_F_HW_VLAN_CTAG_RX) {
if (features & NETIF_F_HW_VLAN_CTAG_RX)
adapter->aq_required |=
I40EVF_FLAG_AQ_ENABLE_VLAN_STRIPPING;
else
adapter->aq_required |=
I40EVF_FLAG_AQ_DISABLE_VLAN_STRIPPING;
-   } else if ((netdev->features ^ features) & NETIF_F_HW_VLAN_CTAG_RX) {
-   return -EINVAL;
}
 
return 0;
-- 
2.17.1

[net-next 02/15] i40e: move ethtool stats boiler plate code to i40e_ethtool_stats.h

2018-08-29 Thread Jeff Kirsher

From: Jacob Keller 

Move the boiler plate structures and helper functions we recently
added into their own header file, so that the complete collection is
located together.

Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 .../net/ethernet/intel/i40e/i40e_ethtool.c| 182 +--
 .../ethernet/intel/i40e/i40e_ethtool_stats.h  | 221 ++
 2 files changed, 222 insertions(+), 181 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_ethtool_stats.h

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index 235012b3bd42..d7d3974beca2 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -6,25 +6,8 @@
 #include "i40e.h"
 #include "i40e_diag.h"
 
-struct i40e_stats {
-   /* The stat_string is expected to be a format string formatted using
-* vsnprintf by i40e_add_stat_strings. Every member of a stats array
-* should use the same format specifiers as they will be formatted
-* using the same variadic arguments.
-*/
-   char stat_string[ETH_GSTRING_LEN];
-   int sizeof_stat;
-   int stat_offset;
-};
-
-#define I40E_STAT(_type, _name, _stat) { \
-   .stat_string = _name, \
-   .sizeof_stat = FIELD_SIZEOF(_type, _stat), \
-   .stat_offset = offsetof(_type, _stat) \
-}
+#include "i40e_ethtool_stats.h"
 
-#define I40E_NETDEV_STAT(_net_stat) \
-   I40E_STAT(struct rtnl_link_stats64, #_net_stat, _net_stat)
 #define I40E_PF_STAT(_name, _stat) \
I40E_STAT(struct i40e_pf, _name, _stat)
 #define I40E_VSI_STAT(_name, _stat) \
@@ -173,11 +156,6 @@ static const struct i40e_stats i40e_gstrings_pfc_stats[] = 
{
I40E_PFC_STAT("port.rx_priority_%u_xon_2_xoff", priority_xon_2_xoff),
 };
 
-static const struct i40e_stats i40e_gstrings_queue_stats[] = {
-   I40E_QUEUE_STAT("%s-%u.packets", stats.packets),
-   I40E_QUEUE_STAT("%s-%u.bytes", stats.bytes),
-};
-
 #define I40E_NETDEV_STATS_LEN  ARRAY_SIZE(i40e_gstrings_net_stats)
 
 #define I40E_MISC_STATS_LENARRAY_SIZE(i40e_gstrings_misc_stats)
@@ -1749,128 +1727,6 @@ static int i40e_get_sset_count(struct net_device 
*netdev, int sset)
}
 }
 
-/**
- * i40e_add_one_ethtool_stat - copy the stat into the supplied buffer
- * @data: location to store the stat value
- * @pointer: basis for where to copy from
- * @stat: the stat definition
- *
- * Copies the stat data defined by the pointer and stat structure pair into
- * the memory supplied as data. Used to implement i40e_add_ethtool_stats and
- * i40e_add_queue_stats. If the pointer is null, data will be zero'd.
- */
-static inline void
-i40e_add_one_ethtool_stat(u64 *data, void *pointer,
- const struct i40e_stats *stat)
-{
-   char *p;
-
-   if (!pointer) {
-   /* ensure that the ethtool data buffer is zero'd for any stats
-* which don't have a valid pointer.
-*/
-   *data = 0;
-   return;
-   }
-
-   p = (char *)pointer + stat->stat_offset;
-   switch (stat->sizeof_stat) {
-   case sizeof(u64):
-   *data = *((u64 *)p);
-   break;
-   case sizeof(u32):
-   *data = *((u32 *)p);
-   break;
-   case sizeof(u16):
-   *data = *((u16 *)p);
-   break;
-   case sizeof(u8):
-   *data = *((u8 *)p);
-   break;
-   default:
-   WARN_ONCE(1, "unexpected stat size for %s",
- stat->stat_string);
-   *data = 0;
-   }
-}
-
-/**
- * __i40e_add_ethtool_stats - copy stats into the ethtool supplied buffer
- * @data: ethtool stats buffer
- * @pointer: location to copy stats from
- * @stats: array of stats to copy
- * @size: the size of the stats definition
- *
- * Copy the stats defined by the stats array using the pointer as a base into
- * the data buffer supplied by ethtool. Updates the data pointer to point to
- * the next empty location for successive calls to __i40e_add_ethtool_stats.
- * If pointer is null, set the data values to zero and update the pointer to
- * skip these stats.
- **/
-static inline void
-__i40e_add_ethtool_stats(u64 **data, void *pointer,
-const struct i40e_stats stats[],
-const unsigned int size)
-{
-   unsigned int i;
-
-   for (i = 0; i < size; i++)
-   i40e_add_one_ethtool_stat((*data)++, pointer, [i]);
-}
-
-/**
- * i40e_add_ethtool_stats - copy stats into ethtool supplied buffer
- * @data: ethtool stats buffer
- * @pointer: location where stats are stored
- * @stats: static const array of stat definitions
- *
- * Macro to ease the use of __i40e_add_ethtool_stats by taking a static
- * constant stats array and passing the ARRAY_SIZE(). This avoids typos by
- * ensuring that we pass the size

[net-next 06/15] i40e: use correct length for strncpy

2018-08-29 Thread Jeff Kirsher

From: Mitch Williams 

Caught by GCC 8. When we provide a length for strncpy, we should not
include the terminating null. So we must tell it one less than the size
of the destination buffer.

Signed-off-by: Mitch Williams 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_ptp.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ptp.c 
b/drivers/net/ethernet/intel/i40e/i40e_ptp.c
index 35f2866b38c6..1199f0502d6d 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ptp.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ptp.c
@@ -694,7 +694,8 @@ static long i40e_ptp_create_clock(struct i40e_pf *pf)
if (!IS_ERR_OR_NULL(pf->ptp_clock))
return 0;
 
-   strncpy(pf->ptp_caps.name, i40e_driver_name, sizeof(pf->ptp_caps.name));
+   strncpy(pf->ptp_caps.name, i40e_driver_name,
+   sizeof(pf->ptp_caps.name) - 1);
pf->ptp_caps.owner = THIS_MODULE;
pf->ptp_caps.max_adj = 9;
pf->ptp_caps.n_ext_ts = 0;
-- 
2.17.1

[net-next 13/15] i40e: hold the rtnl lock on clearing interrupt scheme

2018-08-29 Thread Jeff Kirsher

From: Patryk Małek 

Hold the rtnl lock when we're clearing interrupt scheme
in i40e_shutdown and in i40e_remove.

Signed-off-by: Patryk Małek 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index c30feb27e1c0..112245f32d7d 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -14182,6 +14182,7 @@ static void i40e_remove(struct pci_dev *pdev)
mutex_destroy(>aq.asq_mutex);
 
/* Clear all dynamic memory lists of rings, q_vectors, and VSIs */
+   rtnl_lock();
i40e_clear_interrupt_scheme(pf);
for (i = 0; i < pf->num_alloc_vsi; i++) {
if (pf->vsi[i]) {
@@ -14190,6 +14191,7 @@ static void i40e_remove(struct pci_dev *pdev)
pf->vsi[i] = NULL;
}
}
+   rtnl_unlock();
 
for (i = 0; i < I40E_MAX_VEB; i++) {
kfree(pf->veb[i]);
@@ -14401,7 +14403,13 @@ static void i40e_shutdown(struct pci_dev *pdev)
wr32(hw, I40E_PFPM_WUFC,
 (pf->wol_en ? I40E_PFPM_WUFC_MAG_MASK : 0));
 
+   /* Since we're going to destroy queues during the
+* i40e_clear_interrupt_scheme() we should hold the RTNL lock for this
+* whole section
+*/
+   rtnl_lock();
i40e_clear_interrupt_scheme(pf);
+   rtnl_unlock();
 
if (system_state == SYSTEM_POWER_OFF) {
pci_wake_from_d3(pdev, pf->wol_en);
-- 
2.17.1

[net-next 09/15] i40e: static analysis report from community

2018-08-29 Thread Jeff Kirsher

From: Martyna Szapar 

Static analysis tools report a problem from original driver submission.
Removing unnecessary check in condition.

Signed-off-by: Martyna Szapar 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index ac685ad4d877..62f2b5bce6bb 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -13033,7 +13033,7 @@ struct i40e_veb *i40e_veb_setup(struct i40e_pf *pf, u16 
flags,
for (vsi_idx = 0; vsi_idx < pf->num_alloc_vsi; vsi_idx++)
if (pf->vsi[vsi_idx] && pf->vsi[vsi_idx]->seid == vsi_seid)
break;
-   if (vsi_idx >= pf->num_alloc_vsi && vsi_seid != 0) {
+   if (vsi_idx == pf->num_alloc_vsi && vsi_seid != 0) {
dev_info(>pdev->dev, "vsi seid %d not found\n",
 vsi_seid);
return NULL;
-- 
2.17.1

[net-next 10/15] i40e: report correct statistics when XDP is enabled

2018-08-29 Thread Jeff Kirsher

From: Björn Töpel 

When XDP is enabled, the driver will report incorrect
statistics. Received frames will reported as transmitted frames.

This commits fixes the i40e implementation of ndo_get_stats64 (struct
net_device_ops), so that iproute2 will report correct statistics
(e.g. when running "ip -stats link show dev eth0") even when XDP is
enabled.

Reported-by: Jesper Dangaard Brouer 
Fixes: 74608d17fe29 ("i40e: add support for XDP_TX action")
Signed-off-by: Björn Töpel 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 24 +++--
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 62f2b5bce6bb..6cb4076e8fba 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -420,9 +420,9 @@ static void i40e_get_netdev_stats_struct(struct net_device 
*netdev,
  struct rtnl_link_stats64 *stats)
 {
struct i40e_netdev_priv *np = netdev_priv(netdev);
-   struct i40e_ring *tx_ring, *rx_ring;
struct i40e_vsi *vsi = np->vsi;
struct rtnl_link_stats64 *vsi_stats = i40e_get_vsi_stats_struct(vsi);
+   struct i40e_ring *ring;
int i;
 
if (test_bit(__I40E_VSI_DOWN, vsi->state))
@@ -436,24 +436,26 @@ static void i40e_get_netdev_stats_struct(struct 
net_device *netdev,
u64 bytes, packets;
unsigned int start;
 
-   tx_ring = READ_ONCE(vsi->tx_rings[i]);
-   if (!tx_ring)
+   ring = READ_ONCE(vsi->tx_rings[i]);
+   if (!ring)
continue;
-   i40e_get_netdev_stats_struct_tx(tx_ring, stats);
+   i40e_get_netdev_stats_struct_tx(ring, stats);
 
-   rx_ring = _ring[1];
+   if (i40e_enabled_xdp_vsi(vsi)) {
+   ring++;
+   i40e_get_netdev_stats_struct_tx(ring, stats);
+   }
 
+   ring++;
do {
-   start = u64_stats_fetch_begin_irq(_ring->syncp);
-   packets = rx_ring->stats.packets;
-   bytes   = rx_ring->stats.bytes;
-   } while (u64_stats_fetch_retry_irq(_ring->syncp, start));
+   start   = u64_stats_fetch_begin_irq(>syncp);
+   packets = ring->stats.packets;
+   bytes   = ring->stats.bytes;
+   } while (u64_stats_fetch_retry_irq(>syncp, start));
 
stats->rx_packets += packets;
stats->rx_bytes   += bytes;
 
-   if (i40e_enabled_xdp_vsi(vsi))
-   i40e_get_netdev_stats_struct_tx(_ring[1], stats);
}
rcu_read_unlock();
 
-- 
2.17.1

[net-next 14/15] i40evf: cancel workqueue sync for adminq when a VF is removed

2018-08-29 Thread Jeff Kirsher

From: Lihong Yang 

If a VF is being removed, there is no need to continue with the
workqueue sync for the adminq task, thus cancel it. Without this call,
when VFs are created and removed right away, there might be a chance for
the driver to crash with events stuck in the adminq.

Signed-off-by: Lihong Yang 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40evf/i40evf_main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c 
b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
index c7048cf484fb..174d1da2857b 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
@@ -3910,6 +3910,8 @@ static void i40evf_remove(struct pci_dev *pdev)
if (adapter->watchdog_timer.function)
del_timer_sync(>watchdog_timer);
 
+   cancel_work_sync(>adminq_task);
+
i40evf_free_rss(adapter);
 
if (hw->aq.asq.count)
-- 
2.17.1

[net-next 03/15] i40evf: update ethtool stats code and use helper functions

2018-08-29 Thread Jeff Kirsher

From: Jacob Keller 

Fix a bug in the way we handled VF queues, by always showing stats for
the maximum number of queues, even if they aren't allocated. It is not
safe to change the number of strings reported to ethtool, as grabbing
statistics occurs over multiple ethtool ops for which the rtnl_lock()
cannot be held the entire time.

Avoid this by always reporting queue stats for the maximum number of
queues in the netdevice. Share some of the helper functionality for
adding stats with the PF code in i40e_ethtool_stats.h

This should reduce the chance of potential future bugs, and make adding
new statistics easier.

Note for the queue stats, unlike the PF driver we do not keep an array
of queue pointers, but an array of queues, so care must be taken to
avoid accessing queue memory that hasn't yet been allocated.

Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 .../intel/i40evf/i40e_ethtool_stats.h | 221 ++
 .../ethernet/intel/i40evf/i40evf_ethtool.c| 137 ++-
 2 files changed, 298 insertions(+), 60 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/i40evf/i40e_ethtool_stats.h

diff --git a/drivers/net/ethernet/intel/i40evf/i40e_ethtool_stats.h 
b/drivers/net/ethernet/intel/i40evf/i40e_ethtool_stats.h
new file mode 100644
index ..60b595dd8c39
--- /dev/null
+++ b/drivers/net/ethernet/intel/i40evf/i40e_ethtool_stats.h
@@ -0,0 +1,221 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2013 - 2018 Intel Corporation. */
+
+/* ethtool statistics helpers */
+
+/**
+ * struct i40e_stats - definition for an ethtool statistic
+ * @stat_string: statistic name to display in ethtool -S output
+ * @sizeof_stat: the sizeof() the stat, must be no greater than sizeof(u64)
+ * @stat_offset: offsetof() the stat from a base pointer
+ *
+ * This structure defines a statistic to be added to the ethtool stats buffer.
+ * It defines a statistic as offset from a common base pointer. Stats should
+ * be defined in constant arrays using the I40E_STAT macro, with every element
+ * of the array using the same _type for calculating the sizeof_stat and
+ * stat_offset.
+ *
+ * The @sizeof_stat is expected to be sizeof(u8), sizeof(u16), sizeof(u32) or
+ * sizeof(u64). Other sizes are not expected and will produce a WARN_ONCE from
+ * the i40e_add_ethtool_stat() helper function.
+ *
+ * The @stat_string is interpreted as a format string, allowing formatted
+ * values to be inserted while looping over multiple structures for a given
+ * statistics array. Thus, every statistic string in an array should have the
+ * same type and number of format specifiers, to be formatted by variadic
+ * arguments to the i40e_add_stat_string() helper function.
+ **/
+struct i40e_stats {
+   char stat_string[ETH_GSTRING_LEN];
+   int sizeof_stat;
+   int stat_offset;
+};
+
+/* Helper macro to define an i40e_stat structure with proper size and type.
+ * Use this when defining constant statistics arrays. Note that @_type expects
+ * only a type name and is used multiple times.
+ */
+#define I40E_STAT(_type, _name, _stat) { \
+   .stat_string = _name, \
+   .sizeof_stat = FIELD_SIZEOF(_type, _stat), \
+   .stat_offset = offsetof(_type, _stat) \
+}
+
+/* Helper macro for defining some statistics directly copied from the netdev
+ * stats structure.
+ */
+#define I40E_NETDEV_STAT(_net_stat) \
+   I40E_STAT(struct rtnl_link_stats64, #_net_stat, _net_stat)
+
+/* Helper macro for defining some statistics related to queues */
+#define I40E_QUEUE_STAT(_name, _stat) \
+   I40E_STAT(struct i40e_ring, _name, _stat)
+
+/* Stats associated with a Tx or Rx ring */
+static const struct i40e_stats i40e_gstrings_queue_stats[] = {
+   I40E_QUEUE_STAT("%s-%u.packets", stats.packets),
+   I40E_QUEUE_STAT("%s-%u.bytes", stats.bytes),
+};
+
+/**
+ * i40evf_add_one_ethtool_stat - copy the stat into the supplied buffer
+ * @data: location to store the stat value
+ * @pointer: basis for where to copy from
+ * @stat: the stat definition
+ *
+ * Copies the stat data defined by the pointer and stat structure pair into
+ * the memory supplied as data. Used to implement i40e_add_ethtool_stats and
+ * i40evf_add_queue_stats. If the pointer is null, data will be zero'd.
+ */
+static inline void
+i40evf_add_one_ethtool_stat(u64 *data, void *pointer,
+   const struct i40e_stats *stat)
+{
+   char *p;
+
+   if (!pointer) {
+   /* ensure that the ethtool data buffer is zero'd for any stats
+* which don't have a valid pointer.
+*/
+   *data = 0;
+   return;
+   }
+
+   p = (char *)pointer + stat->stat_offset;
+   switch (stat->sizeof_stat) {
+   case sizeof(u64):
+   *data = *((u64 *)p);
+   break;
+   case sizeof(u32):
+   *data = *((u32 *)p);
+   break;
+   case sizeof(u16):
+

[net-next 04/15] i40evf: Change a VF mac without reloading the VF driver

2018-08-29 Thread Jeff Kirsher

From: Paweł Jabłoński 

Add possibility to change a VF mac address from host side
without reloading the VF driver on the guest side. Without
this patch it is not possible to change the VF mac because
executing i40evf_virtchnl_completion function with
VIRTCHNL_OP_GET_VF_RESOURCES opcode resets the VF mac
address to previous value.

Signed-off-by: Paweł Jabłoński 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c  |  8 +---
 drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c | 11 +--
 2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c 
b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
index c6d24eaede18..f56c645588f3 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
@@ -2458,7 +2458,7 @@ static inline int i40e_check_vf_permission(struct i40e_vf 
*vf,
!is_multicast_ether_addr(addr) && vf->pf_set_mac &&
!ether_addr_equal(addr, vf->default_lan_addr.addr)) {
dev_err(>pdev->dev,
-   "VF attempting to override administratively set 
MAC address, reload the VF driver to resume normal operation\n");
+   "VF attempting to override administratively set 
MAC address, bring down and up the VF interface to resume normal operation\n");
return -EPERM;
}
}
@@ -3873,9 +3873,11 @@ int i40e_ndo_set_vf_mac(struct net_device *netdev, int 
vf_id, u8 *mac)
 mac, vf_id);
}
 
-   /* Force the VF driver stop so it has to reload with new MAC address */
+   /* Force the VF interface down so it has to bring up with new MAC
+* address
+*/
i40e_vc_disable_vf(vf);
-   dev_info(>pdev->dev, "Reload the VF driver to make this change 
effective.\n");
+   dev_info(>pdev->dev, "Bring down and up the VF interface to make 
this change effective.\n");
 
 error_param:
return ret;
diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c 
b/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c
index 565677de5ba3..79b7be83b5c3 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c
@@ -1330,8 +1330,15 @@ void i40evf_virtchnl_completion(struct i40evf_adapter 
*adapter,
  sizeof(struct virtchnl_vsi_resource);
memcpy(adapter->vf_res, msg, min(msglen, len));
i40e_vf_parse_hw_config(>hw, adapter->vf_res);
-   /* restore current mac address */
-   ether_addr_copy(adapter->hw.mac.addr, netdev->dev_addr);
+   if (is_zero_ether_addr(adapter->hw.mac.addr)) {
+   /* restore current mac address */
+   ether_addr_copy(adapter->hw.mac.addr, netdev->dev_addr);
+   } else {
+   /* refresh current mac address if changed */
+   ether_addr_copy(netdev->dev_addr, adapter->hw.mac.addr);
+   ether_addr_copy(netdev->perm_addr,
+   adapter->hw.mac.addr);
+   }
i40evf_process_config(adapter);
}
break;
-- 
2.17.1

[net-next 00/15][pull request] 40GbE Intel Wired LAN Driver Updates 2018-08-29

2018-08-29 Thread Jeff Kirsher

This series contains updates to i40e, i40evf and virtchnl.

Jake implements helper functions to use an array to handle the queue
stats which reduces the boiler plate code as well as keep the complexity
localized to a few functions.

Paweł adds the ability to change a VF's MAC address from the host side
without having to reload the VF driver on the guest side.

Paul adds a check to ensure that the number of queues that the PF sends
to the VF is equal to or less than the maximum number of queues the VF
can support.

Mitch fixes an issue caught by GCC 8, where we need to not include the
terminating null in the length of the string for strncpy().

Lihong fixes a VF issue to ensure that it does not enter into
promiscuous mode when macvlan is added to the VF.  Fixed a potential
crash after a VF is removed, since the workqueue sync for the adminq
task was not being cancelled.

Harshitha fixes the type for field_flags in the virtchnl_filter struct.

Martyna removes an unnecessary check in a conditional if statement.

Björn fixes an issue reported by Jesper Dangaard Brouer, where the
driver was reporting incorrect statistics when XDP was enabled.

Jan fixes the potential reporting of incorrect speed settings.

Patryk fixed an issue where the flag
I40EVF_FLAG_AQ_ENABLE_VLAN_STRIPPING was getting set when any offload is
set via ethtool.  Resolved by only setting this flag when VLAN offload
is enabled.  Also ensure we hold the rtnl lock when we are clearing the
interrupt scheme.  Added a check when deleting the MAC address from the
VF to ensure that the MAC address was not set by the PF and if it was,
do not delete it. 

The following are changes since commit 817e60a7a2bb1f22052f18562990d675cb3a3762:
  Merge branch 'nfp-add-NFP5000-support'
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 40GbE

Björn Töpel (1):
  i40e: report correct statistics when XDP is enabled

Harshitha Ramamurthy (1):
  virtchnl: use u8 type for a field in the virtchnl_filter struct

Jacob Keller (3):
  i40e: convert queue stats to i40e_stats array
  i40e: move ethtool stats boiler plate code to i40e_ethtool_stats.h
  i40evf: update ethtool stats code and use helper functions

Jan Sokolowski (1):
  i40e: Check and correct speed values for link on open

Lihong Yang (2):
  i40evf: set IFF_UNICAST_FLT flag for the VF
  i40evf: cancel workqueue sync for adminq when a VF is removed

Martyna Szapar (1):
  i40e: static analysis report from community

Mitch Williams (1):
  i40e: use correct length for strncpy

Patryk Małek (3):
  i40evf: Don't enable vlan stripping when rx offload is turned on
  i40e: hold the rtnl lock on clearing interrupt scheme
  i40e: Prevent deleting MAC address from VF when set by PF

Paul M Stillwell Jr (1):
  i40evf: Validate the number of queues a PF sends

Paweł Jabłoński (1):
  i40evf: Change a VF mac without reloading the VF driver

 .../net/ethernet/intel/i40e/i40e_ethtool.c| 238 --
 .../ethernet/intel/i40e/i40e_ethtool_stats.h  | 221 
 drivers/net/ethernet/intel/i40e/i40e_main.c   |  61 +++--
 drivers/net/ethernet/intel/i40e/i40e_ptp.c|   3 +-
 .../ethernet/intel/i40e/i40e_virtchnl_pf.c|  18 +-
 .../intel/i40evf/i40e_ethtool_stats.h | 221 
 .../ethernet/intel/i40evf/i40evf_ethtool.c| 137 +-
 .../net/ethernet/intel/i40evf/i40evf_main.c   |  15 +-
 .../ethernet/intel/i40evf/i40evf_virtchnl.c   |  43 +++-
 include/linux/avf/virtchnl.h  |   2 +-
 10 files changed, 678 insertions(+), 281 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_ethtool_stats.h
 create mode 100644 drivers/net/ethernet/intel/i40evf/i40e_ethtool_stats.h

-- 
2.17.1

[PATCH net-next] tcp: change IPv6 flow-label upon receiving spurious retransmission

2018-08-29 Thread Yuchung Cheng

Currently a Linux IPv6 TCP sender will change the flow label upon
timeouts to potentially steer away from a data path that has gone
bad. However this does not help if the problem is on the ACK path
and the data path is healthy. In this case the receiver is likely
to receive repeated spurious retransmission because the sender
couldn't get the ACKs in time and has recurring timeouts.

This patch adds another feature to mitigate this problem. It
leverages the DSACK states in the receiver to change the flow
label of the ACKs to speculatively re-route the ACK packets.
In order to allow triggering on the second consecutive spurious
RTO, the receiver changes the flow label upon sending a second
consecutive DSACK for a sequence number below RCV.NXT.

Signed-off-by: Yuchung Cheng 
Signed-off-by: Neal Cardwell 
Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp.c   |  2 ++
 net/ipv4/tcp_input.c | 13 +
 2 files changed, 15 insertions(+)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index b8af2fec5ad5..8c4235c098fd 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2595,6 +2595,8 @@ int tcp_disconnect(struct sock *sk, int flags)
tp->compressed_ack = 0;
tp->bytes_sent = 0;
tp->bytes_retrans = 0;
+   tp->duplicate_sack[0].start_seq = 0;
+   tp->duplicate_sack[0].end_seq = 0;
tp->dsack_dups = 0;
tp->reord_seen = 0;
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 4c2dd9f863f7..62508a2f9b21 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4199,6 +4199,17 @@ static void tcp_dsack_extend(struct sock *sk, u32 seq, 
u32 end_seq)
tcp_sack_extend(tp->duplicate_sack, seq, end_seq);
 }
 
+static void tcp_rcv_spurious_retrans(struct sock *sk, const struct sk_buff 
*skb)
+{
+   /* When the ACK path fails or drops most ACKs, the sender would
+* timeout and spuriously retransmit the same segment repeatedly.
+* The receiver remembers and reflects via DSACKs. Leverage the
+* DSACK state and change the txhash to re-route speculatively.
+*/
+   if (TCP_SKB_CB(skb)->seq == tcp_sk(sk)->duplicate_sack[0].start_seq)
+   sk_rethink_txhash(sk);
+}
+
 static void tcp_send_dupack(struct sock *sk, const struct sk_buff *skb)
 {
struct tcp_sock *tp = tcp_sk(sk);
@@ -4211,6 +4222,7 @@ static void tcp_send_dupack(struct sock *sk, const struct 
sk_buff *skb)
if (tcp_is_sack(tp) && sock_net(sk)->ipv4.sysctl_tcp_dsack) {
u32 end_seq = TCP_SKB_CB(skb)->end_seq;
 
+   tcp_rcv_spurious_retrans(sk, skb);
if (after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt))
end_seq = tp->rcv_nxt;
tcp_dsack_set(sk, TCP_SKB_CB(skb)->seq, end_seq);
@@ -4755,6 +4767,7 @@ static void tcp_data_queue(struct sock *sk, struct 
sk_buff *skb)
}
 
if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt)) {
+   tcp_rcv_spurious_retrans(sk, skb);
/* A retransmit, 2nd most common case.  Force an immediate ack. 
*/
NET_INC_STATS(sock_net(sk), LINUX_MIB_DELAYEDACKLOST);
tcp_dsack_set(sk, TCP_SKB_CB(skb)->seq, 
TCP_SKB_CB(skb)->end_seq);
-- 
2.19.0.rc0.228.g281dcd1b4d0-goog

[PATCH bpf-next 0/3] bpf: implement percpu map pretty print for bpffs and bpftool

2018-08-29 Thread Yonghong Song

Commit a26ca7c982cb ("bpf: btf: Add pretty print support to the
basic arraymap") and Commit 699c86d6ec21 ("bpf: btf: add pretty print
for hash/lru_hash maps") added bpffs pretty print for array, hash and
lru hash maps. The pretty print gives users a structurally formatted
dump for keys/values which much easy to understand than raw bytes.

This patch set implemented bpffs pretty print support for
percpu arraymap, percpu hashmap and percpu lru hashmap.
For complex key/value types, the pretty print here is even more useful
due to
  . large volumne of data making it even harder to correlate bytes
to a particular field in a particular cpu.
  . kernel rounds the value size for each cpu to multiple of 8.
User has to be aware of this otherwise wrong value may be
derived from cpu 1/2/...

For example, we may have a bpffs pretty print like below:
   43602: {
cpu0: 
{43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
cpu1: 
{43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
cpu2: 
{43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
cpu3: 
{43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
   }
for a percpu map.

This patch also added percpu formatted print on bpftool. For example,
bpftool may print like below:
{
"key": 0,
"values": [{
"cpu": 0,
"value": {
"ui32": 0,
"ui16": 0,
}
},{
"cpu": 1,
"value": {
"ui32": 1,
"ui16": 0,
}
},{
"cpu": 2,
"value": {
"ui32": 2,
"ui16": 0,
}
},{
"cpu": 3,
"value": {
"ui32": 3,
"ui16": 0,
}
}
]
}

Patch #1 implemented bpffs pretty print for percpu arraymap/hash/lru_hash
in kernel. Patch #2 added the test case in tools bpf selftest test_btf.
Patch #3 added percpu map btf based dump.

Yonghong Song (3):
  bpf: add bpffs pretty print for percpu arraymap/hash/lru_hash
  tools/bpf: add bpffs percpu map pretty print tests in test_btf
  tools/bpf: bpftool: add btf percpu map formated dump

 kernel/bpf/arraymap.c  |  24 +
 kernel/bpf/hashtab.c   |  31 ++
 tools/bpf/bpftool/map.c|  33 +-
 tools/testing/selftests/bpf/test_btf.c | 179 ++---
 4 files changed, 230 insertions(+), 37 deletions(-)

-- 
2.14.3

[PATCH bpf-next 1/3] bpf: add bpffs pretty print for percpu arraymap/hash/lru_hash

2018-08-29 Thread Yonghong Song

Added bpffs pretty print for percpu arraymap, percpu hashmap
and percpu lru hashmap.

For each map  pair, the format is:
   : {
cpu0: 
cpu1: 
...
cpun: 
   }

For example, on my VM, there are 4 cpus, and
for test_btf test in the next patch:
   cat /sys/fs/bpf/pprint_test_percpu_hash

You may get:
   ...
   43602: {
cpu0: 
{43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
cpu1: 
{43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
cpu2: 
{43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
cpu3: 
{43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
   }
   72847: {
cpu0: 
{72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
cpu1: 
{72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
cpu2: 
{72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
cpu3: 
{72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
   }
   ...

Signed-off-by: Yonghong Song 
---
 kernel/bpf/arraymap.c | 24 
 kernel/bpf/hashtab.c  | 31 +++
 2 files changed, 55 insertions(+)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 0c17aab3ce5f..f9d24121be99 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -358,6 +358,29 @@ static void array_map_seq_show_elem(struct bpf_map *map, 
void *key,
rcu_read_unlock();
 }
 
+static void percpu_array_map_seq_show_elem(struct bpf_map *map, void *key,
+  struct seq_file *m)
+{
+   struct bpf_array *array = container_of(map, struct bpf_array, map);
+   u32 index = *(u32 *)key;
+   void __percpu *pptr;
+   int cpu;
+
+   rcu_read_lock();
+
+   seq_printf(m, "%u: {\n", *(u32 *)key);
+   pptr = array->pptrs[index & array->index_mask];
+   for_each_possible_cpu(cpu) {
+   seq_printf(m, "\tcpu%d: ", cpu);
+   btf_type_seq_show(map->btf, map->btf_value_type_id,
+ per_cpu_ptr(pptr, cpu), m);
+   seq_puts(m, "\n");
+   }
+   seq_puts(m, "}\n");
+
+   rcu_read_unlock();
+}
+
 static int array_map_check_btf(const struct bpf_map *map,
   const struct btf_type *key_type,
   const struct btf_type *value_type)
@@ -398,6 +421,7 @@ const struct bpf_map_ops percpu_array_map_ops = {
.map_lookup_elem = percpu_array_map_lookup_elem,
.map_update_elem = array_map_update_elem,
.map_delete_elem = array_map_delete_elem,
+   .map_seq_show_elem = percpu_array_map_seq_show_elem,
.map_check_btf = array_map_check_btf,
 };
 
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 03cc59ee9c95..2c1790288138 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -1285,6 +1285,35 @@ int bpf_percpu_hash_update(struct bpf_map *map, void 
*key, void *value,
return ret;
 }
 
+static void htab_percpu_map_seq_show_elem(struct bpf_map *map, void *key,
+ struct seq_file *m)
+{
+   struct htab_elem *l;
+   void __percpu *pptr;
+   int cpu;
+
+   rcu_read_lock();
+
+   l = __htab_map_lookup_elem(map, key);
+   if (!l) {
+   rcu_read_unlock();
+   return;
+   }
+
+   btf_type_seq_show(map->btf, map->btf_key_type_id, key, m);
+   seq_puts(m, ": {\n");
+   pptr = htab_elem_get_ptr(l, map->key_size);
+   for_each_possible_cpu(cpu) {
+   seq_printf(m, "\tcpu%d: ", cpu);
+   btf_type_seq_show(map->btf, map->btf_value_type_id,
+ per_cpu_ptr(pptr, cpu), m);
+   seq_puts(m, "\n");
+   }
+   seq_puts(m, "}\n");
+
+   rcu_read_unlock();
+}
+
 const struct bpf_map_ops htab_percpu_map_ops = {
.map_alloc_check = htab_map_alloc_check,
.map_alloc = htab_map_alloc,
@@ -1293,6 +1322,7 @@ const struct bpf_map_ops htab_percpu_map_ops = {
.map_lookup_elem = htab_percpu_map_lookup_elem,
.map_update_elem = htab_percpu_map_update_elem,
.map_delete_elem = htab_map_delete_elem,
+   .map_seq_show_elem = htab_percpu_map_seq_show_elem,
 };
 
 const struct bpf_map_ops htab_lru_percpu_map_ops = {
@@ -1303,6 +1333,7 @@ const struct bpf_map_ops htab_lru_percpu_map_ops = {
.map_lookup_elem = htab_lru_percpu_map_lookup_elem,
.map_update_elem = htab_lru_percpu_map_update_elem,
.map_delete_elem = htab_lru_map_delete_elem,
+   .map_seq_show_elem = htab_percpu_map_seq_show_elem,
 };
 
 static int fd_htab_map_alloc_check(union bpf_attr *attr)
-- 
2.14.3

[PATCH bpf-next 2/3] tools/bpf: add bpffs percpu map pretty print tests in test_btf

2018-08-29 Thread Yonghong Song

The bpf selftest test_btf is extended to test bpffs
percpu map pretty print for percpu array, percpu hash and
percpu lru hash.

Signed-off-by: Yonghong Song 
---
 tools/testing/selftests/bpf/test_btf.c | 179 ++---
 1 file changed, 144 insertions(+), 35 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_btf.c 
b/tools/testing/selftests/bpf/test_btf.c
index 6b5cfeb7a9cc..f42b3396d622 100644
--- a/tools/testing/selftests/bpf/test_btf.c
+++ b/tools/testing/selftests/bpf/test_btf.c
@@ -4,6 +4,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -45,7 +46,6 @@ static int count_result(int err)
return err;
 }
 
-#define min(a, b) ((a) < (b) ? (a) : (b))
 #define __printf(a, b) __attribute__((format(printf, a, b)))
 
 __printf(1, 2)
@@ -130,6 +130,7 @@ struct btf_raw_test {
bool map_create_err;
bool ordered_map;
bool lossless_map;
+   bool percpu_map;
int hdr_len_delta;
int type_off_delta;
int str_off_delta;
@@ -2157,6 +2158,7 @@ static struct btf_pprint_test_meta {
const char *map_name;
bool ordered_map;
bool lossless_map;
+   bool percpu_map;
 } pprint_tests_meta[] = {
 {
.descr = "BTF pretty print array",
@@ -2164,6 +2166,7 @@ static struct btf_pprint_test_meta {
.map_name = "pprint_test_array",
.ordered_map = true,
.lossless_map = true,
+   .percpu_map = false,
 },
 
 {
@@ -2172,6 +2175,7 @@ static struct btf_pprint_test_meta {
.map_name = "pprint_test_hash",
.ordered_map = false,
.lossless_map = true,
+   .percpu_map = false,
 },
 
 {
@@ -2180,30 +2184,83 @@ static struct btf_pprint_test_meta {
.map_name = "pprint_test_lru_hash",
.ordered_map = false,
.lossless_map = false,
+   .percpu_map = false,
+},
+
+{
+   .descr = "BTF pretty print percpu array",
+   .map_type = BPF_MAP_TYPE_PERCPU_ARRAY,
+   .map_name = "pprint_test_percpu_array",
+   .ordered_map = true,
+   .lossless_map = true,
+   .percpu_map = true,
+},
+
+{
+   .descr = "BTF pretty print percpu hash",
+   .map_type = BPF_MAP_TYPE_PERCPU_HASH,
+   .map_name = "pprint_test_percpu_hash",
+   .ordered_map = false,
+   .lossless_map = true,
+   .percpu_map = true,
+},
+
+{
+   .descr = "BTF pretty print lru percpu hash",
+   .map_type = BPF_MAP_TYPE_LRU_PERCPU_HASH,
+   .map_name = "pprint_test_lru_percpu_hash",
+   .ordered_map = false,
+   .lossless_map = false,
+   .percpu_map = true,
 },
 
 };
 
 
-static void set_pprint_mapv(struct pprint_mapv *v, uint32_t i)
+static void set_pprint_mapv(struct pprint_mapv *v, uint32_t i,
+   int num_cpus, int rounded_value_size)
 {
-   v->ui32 = i;
-   v->si32 = -i;
-   v->unused_bits2a = 3;
-   v->bits28 = i;
-   v->unused_bits2b = 3;
-   v->ui64 = i;
-   v->aenum = i & 0x03;
+   int cpu;
+
+   for (cpu = 0; cpu < num_cpus; cpu++) {
+   v->ui32 = i + cpu;
+   v->si32 = -i;
+   v->unused_bits2a = 3;
+   v->bits28 = i;
+   v->unused_bits2b = 3;
+   v->ui64 = i;
+   v->aenum = i & 0x03;
+   v = (void *)v + rounded_value_size;
+   }
 }
 
+static int check_line(const char *expected_line, int nexpected_line,
+ int expected_line_len, const char *line)
+{
+   if (CHECK(nexpected_line == expected_line_len,
+ "expected_line is too long"))
+   return -1;
+
+   if (strcmp(expected_line, line)) {
+   fprintf(stderr, "unexpected pprint output\n");
+   fprintf(stderr, "expected: %s", expected_line);
+   fprintf(stderr, "read: %s", line);
+   return -1;
+   }
+
+   return 0;
+}
+
+
 static int do_test_pprint(void)
 {
const struct btf_raw_test *test = _test_template;
struct bpf_create_map_attr create_attr = {};
+   bool ordered_map, lossless_map, percpu_map;
+   int err, ret, num_cpus, rounded_value_size;
+   struct pprint_mapv *mapv = NULL;
unsigned int key, nr_read_elems;
-   bool ordered_map, lossless_map;
int map_fd = -1, btf_fd = -1;
-   struct pprint_mapv mapv = {};
unsigned int raw_btf_size;
char expected_line[255];
FILE *pin_file = NULL;
@@ -2212,7 +2269,6 @@ static int do_test_pprint(void)
char *line = NULL;
uint8_t *raw_btf;
ssize_t nread;
-   int err, ret;
 
fprintf(stderr, "%s..", test->descr);
raw_btf = btf_raw_create(_tmpl, test->raw_types,
@@ -2261,9 +2317,18 @@ static int do_test_pprint(void)
if (CHECK(err, "bpf_obj_pin(%s): errno:%d.", pin_path, errno))
goto done;
 
+   percpu_map = test->percpu_map;
+   num_cpus = percpu_map ? bpf_num_possible_cpus() : 1;
+

[PATCH bpf-next 3/3] tools/bpf: bpftool: add btf percpu map formated dump

2018-08-29 Thread Yonghong Song

The btf pretty print is added to percpu arraymap,
percpu hashmap and percpu lru hashmap.
For each  pair, the following will be
added to plain/json output:

   {
   "key": ,
   "values": [{
 "cpu": 0,
 "value": 
  },{
 "cpu": 1,
 "value": 
  },{
  
  },{
 "cpu": n,
 "value": 
  }
   ]
   }

For example, the following could be part of plain or json formatted
output:
{
"key": 0,
"values": [{
"cpu": 0,
"value": {
"ui32": 0,
"ui16": 0,
}
},{
"cpu": 1,
"value": {
"ui32": 1,
"ui16": 0,
}
},{
"cpu": 2,
"value": {
"ui32": 2,
"ui16": 0,
}
},{
"cpu": 3,
"value": {
"ui32": 3,
"ui16": 0,
}
}
]
}

Signed-off-by: Yonghong Song 
---
 tools/bpf/bpftool/map.c | 33 +++--
 1 file changed, 31 insertions(+), 2 deletions(-)

diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index b2ec20e562bd..df175bc33c5d 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -169,9 +169,28 @@ static int do_dump_btf(const struct btf_dumper *d,
if (ret)
goto err_end_obj;
 
-   jsonw_name(d->jw, "value");
+   if (!map_is_per_cpu(map_info->type)) {
+   jsonw_name(d->jw, "value");
+   ret = btf_dumper_type(d, map_info->btf_value_type_id, value);
+   } else {
+   unsigned int i, n, step;
 
-   ret = btf_dumper_type(d, map_info->btf_value_type_id, value);
+   jsonw_name(d->jw, "values");
+   jsonw_start_array(d->jw);
+   n = get_possible_cpus();
+   step = round_up(map_info->value_size, 8);
+   for (i = 0; i < n; i++) {
+   jsonw_start_object(d->jw);
+   jsonw_int_field(d->jw, "cpu", i);
+   jsonw_name(d->jw, "value");
+   ret = btf_dumper_type(d, map_info->btf_value_type_id,
+ value + i * step);
+   jsonw_end_object(d->jw);
+   if (ret)
+   break;
+   }
+   jsonw_end_array(d->jw);
+   }
 
 err_end_obj:
/* end of key-value pair */
@@ -298,6 +317,16 @@ static void print_entry_json(struct bpf_map_info *info, 
unsigned char *key,
jsonw_end_object(json_wtr);
}
jsonw_end_array(json_wtr);
+   if (btf) {
+   struct btf_dumper d = {
+   .btf = btf,
+   .jw = json_wtr,
+   .is_plain_text = false,
+   };
+
+   jsonw_name(json_wtr, "formatted");
+   do_dump_btf(, info, key, value);
+   }
}
 
jsonw_end_object(json_wtr);
-- 
2.14.3

Motorcycle Owners List

2018-08-29 Thread Marlene Royle




Hi,

Greeting of the day!

Would you are interested in acquiring an email list of "Motorcycle Owners"? 
from USA.

We also having data of Harley Davidson Owners, Car Owners List, BMW Owners 
List, Luxury Car Owners List, RV Owners, Pick Up Truck Owners, Boat Owners, RV 
Owners List, HNI,Travelers List and many more...

Each record we will provide you with: Contact (First and Last name), Mailing 
Address and Emails Address.

All the contacts are opt-in verified, complete permission based and can be used 
for unlimited multi-channel marketing.

Let me know if you'd be interested in hearing more about it.

Waiting for your valuable and sincere reply.


Best Regards,
Marlene Royle
Research Analyst

Re: GPL compliance issue with liquidio/lio_23xx_vsw.bin firmware

2018-08-29 Thread Ben Hutchings

On Mon, 2018-08-27 at 17:04 -0700, Felix Manlunas wrote:
> On Mon, Aug 27, 2018 at 05:01:10PM +0200, Florian Weimer wrote:
> > liquidio/lio_23xx_vsw.bin contains a compiled MIPS Linux kernel:
> > 
> > $ tail --bytes=+1313 liquidio/lio_23xx_vsw.bin > elf
> > $ readelf -aW elf
> > […]
> >   [ 6] __ksymtab PROGBITS80e495f8 64a5f8 00d130
> > 00   A  0   0  8
> >   [ 7] __ksymtab_gpl PROGBITS80e56728 657728 008400
> > 00   A  0   0  8
> >   [ 8] __ksymtab_strings PROGBITS80e5eb28 65fb28 018868
> > 00   A  0   0  1
> > […]
> > Symbol table '.symtab' contains 1349 entries:
> >Num:Value  Size TypeBind   Vis  Ndx Name
> >  0:  0 NOTYPE  LOCAL  DEFAULT  UND
> >  1:  0 FILELOCAL  DEFAULT  ABS
> > arch/mips/kernel/head.o
> >  2:  0 FILELOCAL  DEFAULT  ABS init/main.c
> >  3:  0 FILELOCAL  DEFAULT  ABS
> > include/linux/types.h
> > […]
> > 
> > Yet there is no corresponding source provided, and LICENCE.cavium lacks
> > the required notices.
> > 
> > Thanks,
> > Florian
> 
> Cavium apologizes for the oversight.  Cavium has been advertising the
> appropriate license terms including the existence of GPL in the firmware
> in our outbox releases. We will update the license terms in LICENCE.cavium
> in our upstream contribution in collaboration with our legal team.

Everything added to linux-firmware.git needs to be safe for Linux
distributions to redistribute.  (There are some ancient firmware images
with unclear licensing, but we don't want to add any more.)

My understanding is that GPL v2 requires that commercial distributors
either provide the complete and corresponding source alongside the
binaries, or include a written offer to provide it later.  They
*cannot* simply point to your offer to do this (only non-commercial
distributors can do that).  So the source needs to be published too.

Adding the complete Linux kernel source code to linux-firmware.git
doesn't seem like a sensible step, so maybe this particular firmware
needs to live elsewhere.

Ben.

-- 
Ben Hutchings
For every complex problem
there is a solution that is simple, neat, and wrong.

signature.asc
Description: This is a digitally signed message part

Wine Enthusiasts List

2018-08-29 Thread Julie Wider




Hi,

Greeting of the day!

Would you be interested in reaching out to "Wine Enthusiasts list" from USA?

Our Databases:- 1.Beer Enthusiasts List2.Alcohol Enthusiasts List
3.Beverage Consumers   4.Liquor Enthusiasts List
5.Chocolate Enthusiasts List   6.Students List
7.Luxury Brand Buyers List 8.Food Lovers List
9.Spa and Resort Visitors  10.Sports Enthusiasts List 
and many more.

Each record in the list contains Contact Name (First and Last Name), Mailing 
Address, List type and Opt-in email address.

All the contacts are opt-in verified, complete permission based and can be used 
for unlimited multi-channel marketing.

Let me know if you'd be interested in hearing more about it.

Waiting for your valuable and sincere reply.

Best Regards,
Julie Wider
Research Analyst

Re: Fw: [Bug 200943] New: Repeating tcp_mark_head_lost in dmesg

2018-08-29 Thread Yuchung Cheng

On Wed, Aug 29, 2018 at 8:02 AM, Stephen Hemminger
 wrote:
>
>
>
> Begin forwarded message:
>
> Date: Sun, 26 Aug 2018 22:24:12 +
> From: bugzilla-dae...@bugzilla.kernel.org
> To: step...@networkplumber.org
> Subject: [Bug 200943] New: Repeating tcp_mark_head_lost in dmesg
>
>
> https://bugzilla.kernel.org/show_bug.cgi?id=200943
>
> Bug ID: 200943
>Summary: Repeating tcp_mark_head_lost in dmesg
>Product: Networking
>Version: 2.5
> Kernel Version: 4.14.66
>   Hardware: All
> OS: Linux
>   Tree: Mainline
> Status: NEW
>   Severity: normal
>   Priority: P1
>  Component: IPV4
>   Assignee: step...@networkplumber.org
>   Reporter: rm+...@romanrm.net
> Regression: No
>
> Getting a bunch of these now every hour during continuous ~100 Mbit of network
> traffic.
> What's up with that? Seems harmless, as in the kernel doesn't crash and the
> network connection is not interrupted. (Maybe the particular TCP session is?)
> If there are no ill-effects from this condition, is such spammy WARN_ON really
> necessary?
This warning is likely triggered by buggy remote SACK behaviors, and
is pretty harmless - in my opinion the warning tcp_verify_left_out()
is still worthy to detect other inflight states inconsistencies.

The good news the particular loss recovery code path is disabled by
default on 4.18+ kernels by this patch

commit b38a51fec1c1f693f03b1aa19d0622123634d4b7
Author: Yuchung Cheng 
Date:   Wed May 16 16:40:11 2018 -0700

tcp: disable RFC6675 loss detection


>
> [Mon Aug 27 02:16:11 2018] [ cut here ]
> [Mon Aug 27 02:16:11 2018] WARNING: CPU: 5 PID: 0 at net/ipv4/tcp_input.c:2263
> tcp_mark_head_lost+0x247/0x260
> [Mon Aug 27 02:16:11 2018] Modules linked in: dm_snapshot loop vhost_net vhost
> tap tun ip6t_MASQUERADE nf_nat_masquerade_ipv6 ipt_MASQUERADE
> nf_nat_masquerade_ipv4 xt_DSCP xt_mark ip6t_REJECT nf_reject_ipv6 ipt_REJECT
> nf_reject_ipv4 xt_owner xt_tcpudp xt_set ip_set_hash_net ip_set nfnetlink
> xt_limit xt_length xt_multiport xt_conntrack ip6t_rpfilter ipt_rpfilter
> ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_raw
> wireguard ip6_udp_tunnel udp_tunnel ip6table_mangle iptable_nat
> nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_raw
> iptable_mangle ip6table_filter ip6_tables matroxfb_base matroxfb_g450
> matroxfb_Ti3026 matroxfb_accel matroxfb_DAC1064 g450_pll matroxfb_misc
> iptable_filter ip_tables x_tables cpufreq_powersave cpufreq_userspace
> cpufreq_conservative 8021q garp mrp
> [Mon Aug 27 02:16:11 2018]  bridge stp llc bonding tcp_bbr sch_fq tcp_illinois
> fuse radeon ttm drm_kms_helper drm i2c_algo_bit it87 hwmon_vid eeepc_wmi
> asus_wmi sparse_keymap rfkill video wmi_bmof mxm_wmi edac_mce_amd kvm_amd kvm
> snd_pcm snd_timer snd soundcore joydev evdev pcspkr k10temp fam15h_power
> sp5100_tco sg shpchp wmi pcc_cpufreq acpi_cpufreq button ext4 crc16 mbcache
> jbd2 fscrypto btrfs zstd_decompress zstd_compress xxhash algif_skcipher af_alg
> dm_crypt dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio dm_mod
> hid_generic usbhid hid raid10 raid456 async_raid6_recov async_memcpy async_pq
> async_xor async_tx xor sd_mod raid6_pq libcrc32c crc32c_generic raid1 raid0
> multipath linear md_mod vfio_pci irqbypass vfio_virqfd vfio_iommu_type1 vfio
> ohci_pci crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel
> [Mon Aug 27 02:16:11 2018]  pcbc aesni_intel aes_x86_64 crypto_simd 
> glue_helper
> cryptd r8169 ahci xhci_pci libahci ohci_hcd ehci_pci mii xhci_hcd ehci_hcd
> i2c_piix4 libata usbcore scsi_mod bnx2
> [Mon Aug 27 02:16:11 2018] CPU: 5 PID: 0 Comm: swapper/5 Tainted: GW
>4.14.66-rm1+ #132
> [Mon Aug 27 02:16:11 2018] Hardware name: To be filled by O.E.M. To be filled
> by O.E.M./SABERTOOTH 990FX R2.0, BIOS 2901 05/04/2016
> [Mon Aug 27 02:16:11 2018] task: 8ba79c679dc0 task.stack: b4d741928000
> [Mon Aug 27 02:16:11 2018] RIP: 0010:tcp_mark_head_lost+0x247/0x260
> [Mon Aug 27 02:16:11 2018] RSP: 0018:8ba7aed437d8 EFLAGS: 00010202
> [Mon Aug 27 02:16:11 2018] RAX: 0018 RBX: 8ba3901a0800 RCX:
> 
> [Mon Aug 27 02:16:11 2018] RDX: 0017 RSI: 0001 RDI:
> 8ba4d47e9000
> [Mon Aug 27 02:16:11 2018] RBP: 8ba4d47e9000 R08: 000d R09:
> 
> [Mon Aug 27 02:16:11 2018] R10: 100c R11:  R12:
> 0001
> [Mon Aug 27 02:16:11 2018] R13: 8ba4d47e9158 R14: 0001 R15:
> 9d0b6708
> [Mon Aug 27 02:16:11 2018] FS:  ()
> GS:8ba7aed4() knlGS:
> [Mon Aug 27 02:16:11 2018] CS:  0010 DS:  ES:  CR0: 80050033
> [Mon Aug 27 02:16:11 2018] CR2: 01fbdff0 CR3: 00040c8d6000 CR4:
> 000406e0
> [Mon Aug 27 02:16:11 2018] Call Trace:
> [Mon

[PATCH net] nfp: wait for posted reconfigs when disabling the device

2018-08-29 Thread Jakub Kicinski

To avoid leaking a running timer we need to wait for the
posted reconfigs after netdev is unregistered.  In common
case the process of deinitializing the device will perform
synchronous reconfigs which wait for posted requests, but
especially with VXLAN ports being actively added and removed
there can be a race condition leaving a timer running after
adapter structure is freed leading to a crash.

Add an explicit flush after deregistering and for a good
measure a warning to check if timer is running just before
structures are freed.

Fixes: 3d780b926a12 ("nfp: add async reconfiguration mechanism")
Signed-off-by: Jakub Kicinski 
Reviewed-by: Dirk van der Merwe 
---
 .../ethernet/netronome/nfp/nfp_net_common.c   | 48 +--
 1 file changed, 33 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index a8b9fbab5f73..253bdaef1505 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -229,29 +229,16 @@ static void nfp_net_reconfig_post(struct nfp_net *nn, u32 
update)
spin_unlock_bh(>reconfig_lock);
 }
 
-/**
- * nfp_net_reconfig() - Reconfigure the firmware
- * @nn:  NFP Net device to reconfigure
- * @update:  The value for the update field in the BAR config
- *
- * Write the update word to the BAR and ping the reconfig queue.  The
- * poll until the firmware has acknowledged the update by zeroing the
- * update word.
- *
- * Return: Negative errno on error, 0 on success
- */
-int nfp_net_reconfig(struct nfp_net *nn, u32 update)
+static void nfp_net_reconfig_sync_enter(struct nfp_net *nn)
 {
bool cancelled_timer = false;
u32 pre_posted_requests;
-   int ret;
 
spin_lock_bh(>reconfig_lock);
 
nn->reconfig_sync_present = true;
 
if (nn->reconfig_timer_active) {
-   del_timer(>reconfig_timer);
nn->reconfig_timer_active = false;
cancelled_timer = true;
}
@@ -260,14 +247,43 @@ int nfp_net_reconfig(struct nfp_net *nn, u32 update)
 
spin_unlock_bh(>reconfig_lock);
 
-   if (cancelled_timer)
+   if (cancelled_timer) {
+   del_timer_sync(>reconfig_timer);
nfp_net_reconfig_wait(nn, nn->reconfig_timer.expires);
+   }
 
/* Run the posted reconfigs which were issued before we started */
if (pre_posted_requests) {
nfp_net_reconfig_start(nn, pre_posted_requests);
nfp_net_reconfig_wait(nn, jiffies + HZ * NFP_NET_POLL_TIMEOUT);
}
+}
+
+static void nfp_net_reconfig_wait_posted(struct nfp_net *nn)
+{
+   nfp_net_reconfig_sync_enter(nn);
+
+   spin_lock_bh(>reconfig_lock);
+   nn->reconfig_sync_present = false;
+   spin_unlock_bh(>reconfig_lock);
+}
+
+/**
+ * nfp_net_reconfig() - Reconfigure the firmware
+ * @nn:  NFP Net device to reconfigure
+ * @update:  The value for the update field in the BAR config
+ *
+ * Write the update word to the BAR and ping the reconfig queue.  The
+ * poll until the firmware has acknowledged the update by zeroing the
+ * update word.
+ *
+ * Return: Negative errno on error, 0 on success
+ */
+int nfp_net_reconfig(struct nfp_net *nn, u32 update)
+{
+   int ret;
+
+   nfp_net_reconfig_sync_enter(nn);
 
nfp_net_reconfig_start(nn, update);
ret = nfp_net_reconfig_wait(nn, jiffies + HZ * NFP_NET_POLL_TIMEOUT);
@@ -3633,6 +3649,7 @@ struct nfp_net *nfp_net_alloc(struct pci_dev *pdev, bool 
needs_netdev,
  */
 void nfp_net_free(struct nfp_net *nn)
 {
+   WARN_ON(timer_pending(>reconfig_timer) || nn->reconfig_posted);
if (nn->dp.netdev)
free_netdev(nn->dp.netdev);
else
@@ -3920,4 +3937,5 @@ void nfp_net_clean(struct nfp_net *nn)
return;
 
unregister_netdev(nn->dp.netdev);
+   nfp_net_reconfig_wait_posted(nn);
 }
-- 
2.17.1

Re: [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e

2018-08-29 Thread Alexei Starovoitov

On Tue, Aug 28, 2018 at 02:44:24PM +0200, Björn Töpel wrote:
> From: Björn Töpel 
> 
> This patch set introduces zero-copy AF_XDP support for Intel's i40e
> driver. In the first preparatory patch we also add support for
> XDP_REDIRECT for zero-copy allocated frames so that XDP programs can
> redirect them. This was a ToDo from the first AF_XDP zero-copy patch
> set from early June. Special thanks to Alex Duyck and Jesper Dangaard
> Brouer for reviewing earlier versions of this patch set.
> 
> The i40e zero-copy code is located in its own file i40e_xsk.[ch]. Note
> that in the interest of time, to get an AF_XDP zero-copy implementation
> out there for people to try, some code paths have been copied from the
> XDP path to the zero-copy path. It is out goal to merge the two paths
> in later patch sets.
> 
> In contrast to the implementation from beginning of June, this patch
> set does not require any extra HW queues for AF_XDP zero-copy
> TX. Instead, the XDP TX HW queue is used for both XDP_REDIRECT and
> AF_XDP zero-copy TX.
> 
> Jeff, given that most of changes are in i40e, it is up to you how you
> would like to route these patches. The set is tagged bpf-next, but
> if taking it via the Intel driver tree is easier, let us know.
> 
> We have run some benchmarks on a dual socket system with two Broadwell
> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> cores which gives a total of 28, but only two cores are used in these
> experiments. One for TR/RX and one for the user space application. The
> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The
> NIC is Intel I40E 40Gbit/s using the i40e driver.
> 
> Below are the results in Mpps of the I40E NIC benchmark runs for 64
> and 1500 byte packets, generated by a commercial packet generator HW
> outputing packets at full 40 Gbit/s line rate. The results are with
> retpoline and all other spectre and meltdown fixes, so these results
> are not comparable to the ones from the zero-copy patch set in June.
> 
> AF_XDP performance 64 byte packets.
> Benchmark   XDP_SKBXDP_DRVXDP_DRV with zerocopy
> rxdrop   2.68.2 15.0
> txpush   2.2-   21.9
> l2fwd1.72.3 11.3
> 
> AF_XDP performance 1500 byte packets:
> Benchmark   XDP_SKB   XDP_DRV XDP_DRV with zerocopy
> rxdrop   2.03.3 3.3
> l2fwd1.31.7 3.1
> 
> XDP performance on our system as a base line:
> 
> 64 byte packets:
> XDP stats   CPU pps issue-pps
> XDP-RX CPU  16  18.4M  0
> 
> 1500 byte packets:
> XDP stats   CPU pps issue-pps
> XDP-RX CPU  16  3.3M0
> 
> The structure of the patch set is as follows:
> 
> Patch 1: Add support for XDP_REDIRECT of zero-copy allocated frames
> Patches 2-4: Preparatory patches to common xsk and net code
> Patches 5-7: Preparatory patches to i40e driver code for RX
> Patch 8: i40e zero-copy support for RX
> Patch 9: Preparatory patch to i40e driver code for TX
> Patch 10: i40e zero-copy support for TX
> Patch 11: Add flags to sample application to force zero-copy/copy mode
> 
> We based this patch set on bpf-next commit 050cdc6c9501 ("Merge
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
> 
> 
> Magnus & Björn
> 
> Björn Töpel (8):
>   xdp: implement convert_to_xdp_frame for MEM_TYPE_ZERO_COPY
>   xdp: export xdp_rxq_info_unreg_mem_model
>   xsk: expose xdp_umem_get_{data,dma} to drivers
>   i40e: added queue pair disable/enable functions
>   i40e: refactor Rx path for re-use
>   i40e: move common Rx functions to i40e_txrx_common.h
>   i40e: add AF_XDP zero-copy Rx support
>   samples/bpf: add -c/--copy -z/--zero-copy flags to xdpsock
> 
> Magnus Karlsson (3):
>   net: add napi_if_scheduled_mark_missed
>   i40e: move common Tx functions to i40e_txrx_common.h
>   i40e: add AF_XDP zero-copy Tx support

Applied to bpf-next.
Thanks a lot to the authors, code reviewers and testers.

I hope all the issues brought up during code review can be quickly
addressed in the follow up patches and zerocopy feature in i40e
and other drivers will be mainline ready for the next merge window.

Re: Kernel warnings from tcp.c

2018-08-29 Thread Stephen Hemminger

On Wed, 29 Aug 2018 14:38:38 -0400
Adam Mitchell  wrote:

> Anyone with experience in tcp.c have an idea what's causing this on
> our busy database server?  Comes from this macro at line 2278:
>  WARN_ON(sock_owned_by_user(sk));
> 
> 
> [726780.788201] WARNING: CPU: 15 PID: 52245 at net/ipv4/tcp.c:2278
> tcp_close+0x40f/0x430
> 
> [726780.794947] Modules linked in: binfmt_misc nf_conntrack_netlink
> nfnetlink_queue tcp_diag inet_diag isofs ip6table_mangle ip6table_raw
> ip6table_nat nf_nat_ipv6 iptable_security xt_CT iptable_raw
> iptable_nat nf_nat_ipv4 nf_nat iptable_mangle xt_pkttype xt_NFLOG
> nfnetlink_log xt_u32 xt_multiport xt_set xt_conntrack
> ip_set_hash_netport ip_set_hash_ipport ip_set_hash_net ip_set_hash_ip
> ip_set nfnetlink nf_conntrack_proto_gre nf_conntrack_ipv6
> nf_defrag_ipv6 ip6table_filter ip6_tables xt_LOG nf_conntrack_tftp
> nf_conntrack_ftp nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack
> iptable_filter zfs(PO) zunicode(PO) zavl(PO) icp(PO) sb_edac
> intel_powerclamp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
> pcbc aesni_intel crypto_simd glue_helper cryptd cirrus snd_seq
> zcommon(PO) ttm intel_rapl_perf snd_seq_device
> [726780.848696]  drm_kms_helper znvpair(PO) snd_pcm snd_timer spl(O)
> drm snd soundcore syscopyarea sysfillrect pcspkr sysimgblt input_leds
> fb_sys_fops i2c_piix4 ip_tables xfs libcrc32c ata_generic pata_acpi
> ata_piix xen_blkfront crc32c_intel libata ena(O) serio_raw floppy
> sunrpc
> [726780.863916] CPU: 15 PID: 52245 Comm: mysqld Tainted: PW  O
> 4.16.13-1.el7.elrepo.x86_64 #1
>

Sorry, you have a tainted kernel with modules that are not upstream
therefore don't expect any support from kernel community

Re: [PATCH bpf-next 08/11] i40e: add AF_XDP zero-copy Rx support

2018-08-29 Thread Alexei Starovoitov

On Tue, Aug 28, 2018 at 02:44:32PM +0200, Björn Töpel wrote:
> From: Björn Töpel 
> 
> This patch adds zero-copy Rx support for AF_XDP sockets. Instead of
> allocating buffers of type MEM_TYPE_PAGE_SHARED, the Rx frames are
> allocated as MEM_TYPE_ZERO_COPY when AF_XDP is enabled for a certain
> queue.
> 
> All AF_XDP specific functions are added to a new file, i40e_xsk.c.
> 
> Note that when AF_XDP zero-copy is enabled, the XDP action XDP_PASS
> will allocate a new buffer and copy the zero-copy frame prior passing
> it to the kernel stack.
> 
> Signed-off-by: Björn Töpel 
...
> +/**
> + * i40e_construct_skb_zc - Create skbufff from zero-copy Rx buffer

typo sk_buff

overall the set looks great to me. The driver side changes look
particularly clean and independent from skb path of the driver.
I think the issues brought up by other reviews are not blockers
for the set, but certainly should be addressed in the follow up patches.

[RFC] net: xsk: add a simple buffer reuse queue

2018-08-29 Thread Jakub Kicinski

XSK UMEM is strongly single producer single consumer so reuse of
frames is challenging.  Add a simple "stash" of FILL packets to
reuse for drivers to optionally make use of.  This is useful
when driver has to free (ndo_stop) or resize a ring with an active
AF_XDP ZC socket.

Signed-off-by: Jakub Kicinski 
---
 include/net/xdp_sock.h | 44 +
 net/xdp/xdp_umem.c |  2 ++
 net/xdp/xsk_queue.c| 56 ++
 net/xdp/xsk_queue.h|  3 +++
 4 files changed, 105 insertions(+)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 6871e4755975..108c1c100de4 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -14,6 +14,7 @@
 #include 
 
 struct net_device;
+struct xdp_umem_fq_reuse;
 struct xsk_queue;
 
 struct xdp_umem_props {
@@ -41,6 +42,7 @@ struct xdp_umem {
struct page **pgs;
u32 npgs;
struct net_device *dev;
+   struct xdp_umem_fq_reuse *fq_reuse;
u16 queue_id;
bool zc;
spinlock_t xsk_list_lock;
@@ -110,4 +112,46 @@ static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem 
*umem, u64 addr)
return umem->pages[addr >> PAGE_SHIFT].dma + (addr & (PAGE_SIZE - 1));
 }
 
+struct xdp_umem_fq_reuse {
+   u32 nentries;
+   u32 length;
+   u64 handles[];
+};
+
+/* Following functions are not thread-safe in any way */
+struct xdp_umem_fq_reuse *xsk_reuseq_prepare(u32 nentries);
+struct xdp_umem_fq_reuse *xsk_reuseq_swap(struct xdp_umem *umem,
+ struct xdp_umem_fq_reuse *newq);
+void xsk_reuseq_free(struct xdp_umem_fq_reuse *rq);
+
+/* Reuse-queue aware version of FILL queue helpers */
+static inline u64 *xsk_umem_peek_addr_rq(struct xdp_umem *umem, u64 *addr)
+{
+   struct xdp_umem_fq_reuse *rq = umem->fq_reuse;
+
+   if (!rq->length) {
+   return xsk_umem_peek_addr(umem, addr);
+   } else {
+   *addr = rq->handles[rq->length - 1];
+   return addr;
+   }
+}
+
+static inline void xsk_umem_discard_addr_rq(struct xdp_umem *umem)
+{
+   struct xdp_umem_fq_reuse *rq = umem->fq_reuse;
+
+   if (!rq->length)
+   xsk_umem_discard_addr(umem);
+   else
+   rq->length--;
+}
+
+static inline void xsk_umem_fq_reuse(struct xdp_umem *umem, u64 addr)
+{
+   struct xdp_umem_fq_reuse *rq = umem->fq_reuse;
+
+   rq->handles[rq->length++] = addr;
+}
+
 #endif /* _LINUX_XDP_SOCK_H */
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index e762310c9bee..40303e24c954 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -170,6 +170,8 @@ static void xdp_umem_release(struct xdp_umem *umem)
umem->cq = NULL;
}
 
+   xsk_reuseq_destroy(umem);
+
xdp_umem_unpin_pages(umem);
 
task = get_pid_task(umem->pid, PIDTYPE_PID);
diff --git a/net/xdp/xsk_queue.c b/net/xdp/xsk_queue.c
index 6c32e92e98fc..f9ee40a13a9a 100644
--- a/net/xdp/xsk_queue.c
+++ b/net/xdp/xsk_queue.c
@@ -3,7 +3,9 @@
  * Copyright(c) 2018 Intel Corporation.
  */
 
+#include 
 #include 
+#include 
 
 #include "xsk_queue.h"
 
@@ -61,3 +63,57 @@ void xskq_destroy(struct xsk_queue *q)
page_frag_free(q->ring);
kfree(q);
 }
+
+struct xdp_umem_fq_reuse *xsk_reuseq_prepare(u32 nentries)
+{
+   struct xdp_umem_fq_reuse *newq;
+
+   /* Check for overflow */
+   if (nentries > (u32)roundup_pow_of_two(nentries))
+   return NULL;
+   nentries = roundup_pow_of_two(nentries);
+
+   newq = kvmalloc(struct_size(newq, handles, nentries), GFP_KERNEL);
+   if (!newq)
+   return NULL;
+   memset(newq, 0, offsetof(typeof(*newq), handles));
+
+   newq->nentries = nentries;
+   return newq;
+}
+EXPORT_SYMBOL_GPL(xsk_reuseq_prepare);
+
+struct xdp_umem_fq_reuse *xsk_reuseq_swap(struct xdp_umem *umem,
+ struct xdp_umem_fq_reuse *newq)
+{
+   struct xdp_umem_fq_reuse *oldq = umem->fq_reuse;
+
+   if (!oldq) {
+   umem->fq_reuse = newq;
+   return NULL;
+   }
+
+   if (newq->nentries < oldq->length)
+   return newq;
+
+
+   memcpy(newq->handles, oldq->handles,
+  array_size(oldq->length, sizeof(u64)));
+   newq->length = oldq->length;
+
+   umem->fq_reuse = newq;
+   return oldq;
+}
+EXPORT_SYMBOL_GPL(xsk_reuseq_swap);
+
+void xsk_reuseq_free(struct xdp_umem_fq_reuse *rq)
+{
+   kvfree(rq);
+}
+EXPORT_SYMBOL_GPL(xsk_reuseq_free);
+
+void xsk_reuseq_destroy(struct xdp_umem *umem)
+{
+   xsk_reuseq_free(umem->fq_reuse);
+   umem->fq_reuse = NULL;
+}
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 8a64b150be54..7a480e3eb35d 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -257,4 +257,7 @@ void xskq_set_umem(struct xsk_queue *q, struct 
xdp_umem_props *umem_props);
 struct xsk_queue *xskq_create(u32 nentries, bool umem_queue);

Re: [PATCH bpf-next 08/11] i40e: add AF_XDP zero-copy Rx support

2018-08-29 Thread Jakub Kicinski

On Tue, 28 Aug 2018 14:44:32 +0200, Björn Töpel wrote:
> From: Björn Töpel 
> 
> This patch adds zero-copy Rx support for AF_XDP sockets. Instead of
> allocating buffers of type MEM_TYPE_PAGE_SHARED, the Rx frames are
> allocated as MEM_TYPE_ZERO_COPY when AF_XDP is enabled for a certain
> queue.
> 
> All AF_XDP specific functions are added to a new file, i40e_xsk.c.
> 
> Note that when AF_XDP zero-copy is enabled, the XDP action XDP_PASS
> will allocate a new buffer and copy the zero-copy frame prior passing
> it to the kernel stack.
> 
> Signed-off-by: Björn Töpel 

Mm.. I'm surprised you don't run into buffer reuse issues that I had
when playing with AF_XDP.  What happens in i40e if someone downs the
interface?  Will UMEMs get destroyed?  Will the RX buffers get freed?

I'll shortly send an RFC with my quick and dirty RX buffer reuse queue,
FWIW.

pull-request: bpf 2018-08-29

2018-08-29 Thread Daniel Borkmann

Hi David,

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix a build error in sk_reuseport_convert_ctx_access() when
   compiling with clang which cannot resolve hweight_long() at
   build time inside the BUILD_BUG_ON() assertion, from Stefan.

2) Several fixes for BPF sockmap, four of them in getting the
   bpf_msg_pull_data() helper to work, one use after free case
   in bpf_tcp_close() and one refcount leak in bpf_tcp_recvmsg(),
   from Daniel.

3) Another fix for BPF sockmap where we misaccount sk_mem_uncharge()
   in the socket redirect error case from unwinding scatterlist
   twice, from John.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Thanks a lot!



The following changes since commit 53ae914d898e5dd5984d352d5fa0b23410f966a0:

  net/rds: Use rdma_read_gids to get connection SGID/DGID in IPv6 (2018-08-27 
15:26:01 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git 

for you to fetch changes up to d65e6c80c6bb72ced46ce90dea4016d913a8ddd4:

  Merge branch 'bpf_msg_pull_data-fixes' (2018-08-29 10:47:18 -0700)


Alexei Starovoitov (1):
  Merge branch 'bpf_msg_pull_data-fixes'

Daniel Borkmann (6):
  bpf, sockmap: fix potential use after free in bpf_tcp_close
  bpf, sockmap: fix psock refcount leak in bpf_tcp_recvmsg
  bpf: fix several offset tests in bpf_msg_pull_data
  bpf: fix msg->data/data_end after sg shift repair in bpf_msg_pull_data
  bpf: fix shift upon scatterlist ring wrap-around in bpf_msg_pull_data
  bpf: fix sg shift repair start offset in bpf_msg_pull_data

John Fastabend (1):
  bpf: sockmap, decrement copied count correctly in redirect error case

Stefan Agner (1):
  bpf: fix build error with clang

 kernel/bpf/sockmap.c | 52 +---
 net/core/filter.c| 52 +---
 2 files changed, 54 insertions(+), 50 deletions(-)

Re: bpfilter causes a leftover kernel process

2018-08-29 Thread Olivier Brunel

On Tue, 28 Aug 2018 22:35:38 -0700
Alexei Starovoitov  wrote:

> > Yeah, I have a similar thing happening on shutdown, except that
> > we're talking about a kernel thread here, so that process is
> > ignored by the mentionned killing spree as a result, thus leaving
> > that process running.  
> 
> it's not a kernel thread and sounds like there is a bug in your pid 1
> that is worth fixing.

Well, it sure does look like one. By which I mean that looking at
its /proc/$PID entry, one can see that it has an empty command line and
kthreadd as parent (ppid 2), which usually are only true for kernel
threads (unless I'm mistaken).

The tool I'm using to send the signals on shutdown does indeed not use
kill(-1,sig) but instead scans /proc and determines whether or not to
signal a process (thus allowing to ignore a few specific processes to
be killed later on).

Kernel threads are obviously to be skipped, and to identify them it
uses the classic method of checking whether or not it has an empty
command line. As I said, this bpfilter helper does and that's why it is
considered a kernel thread & thus not killed.

This is also what other tools do, like ps or top, which is indeed why
they also show it as a kernel thread ("[none]").

So is this way of doing this wrong? And if so, what would be a better
way to identify kernel threads?

Out of curiosity I also had a look at how systemd does things, and it
looks to me like it does the exact same thing[1], also identifying
kernel threads by their empty command line. So I would think that a
similar issue could be observed under systemd as well.

Thanks.

[1]
https://github.com/systemd/systemd/blob/master/src/core/killall.c#L44

[PATCH net] Revert "packet: switch kvzalloc to allocate memory"

2018-08-29 Thread Eric Dumazet

This reverts commit 71e41286203c017d24f041a7cd71abea7ca7b1e0.

mmap()/munmap() can not be backed by kmalloced pages :

We fault in :

VM_BUG_ON_PAGE(PageSlab(page), page);

unmap_single_vma+0x8a/0x110
unmap_vmas+0x4b/0x90
unmap_region+0xc9/0x140
do_munmap+0x274/0x360
vm_munmap+0x81/0xc0
SyS_munmap+0x2b/0x40
do_syscall_64+0x13e/0x1c0
entry_SYSCALL_64_after_hwframe+0x42/0xb7

Fixes: 71e41286203c ("packet: switch kvzalloc to allocate memory")
Signed-off-by: Eric Dumazet 
Reported-by: John Sperbeck 
Bisected-by: John Sperbeck 
Cc: Zhang Yu 
Cc: Li RongQing 
---
 net/packet/af_packet.c | 44 +-
 net/packet/internal.h  |  1 +
 2 files changed, 32 insertions(+), 13 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 
5610061e7f2e00b935ce44dd9cf82d10eb77a7bf..75c92a87e7b2481141161c8945f5e7eef8e0abf8
 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -4137,36 +4137,52 @@ static const struct vm_operations_struct 
packet_mmap_ops = {
.close  =   packet_mm_close,
 };
 
-static void free_pg_vec(struct pgv *pg_vec, unsigned int len)
+static void free_pg_vec(struct pgv *pg_vec, unsigned int order,
+   unsigned int len)
 {
int i;
 
for (i = 0; i < len; i++) {
if (likely(pg_vec[i].buffer)) {
-   kvfree(pg_vec[i].buffer);
+   if (is_vmalloc_addr(pg_vec[i].buffer))
+   vfree(pg_vec[i].buffer);
+   else
+   free_pages((unsigned long)pg_vec[i].buffer,
+  order);
pg_vec[i].buffer = NULL;
}
}
kfree(pg_vec);
 }
 
-static char *alloc_one_pg_vec_page(unsigned long size)
+static char *alloc_one_pg_vec_page(unsigned long order)
 {
char *buffer;
+   gfp_t gfp_flags = GFP_KERNEL | __GFP_COMP |
+ __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY;
 
-   buffer = kvzalloc(size, GFP_KERNEL);
+   buffer = (char *) __get_free_pages(gfp_flags, order);
if (buffer)
return buffer;
 
-   buffer = kvzalloc(size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+   /* __get_free_pages failed, fall back to vmalloc */
+   buffer = vzalloc(array_size((1 << order), PAGE_SIZE));
+   if (buffer)
+   return buffer;
+
+   /* vmalloc failed, lets dig into swap here */
+   gfp_flags &= ~__GFP_NORETRY;
+   buffer = (char *) __get_free_pages(gfp_flags, order);
+   if (buffer)
+   return buffer;
 
-   return buffer;
+   /* complete and utter failure */
+   return NULL;
 }
 
-static struct pgv *alloc_pg_vec(struct tpacket_req *req)
+static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order)
 {
unsigned int block_nr = req->tp_block_nr;
-   unsigned long size = req->tp_block_size;
struct pgv *pg_vec;
int i;
 
@@ -4175,7 +4191,7 @@ static struct pgv *alloc_pg_vec(struct tpacket_req *req)
goto out;
 
for (i = 0; i < block_nr; i++) {
-   pg_vec[i].buffer = alloc_one_pg_vec_page(size);
+   pg_vec[i].buffer = alloc_one_pg_vec_page(order);
if (unlikely(!pg_vec[i].buffer))
goto out_free_pgvec;
}
@@ -4184,7 +4200,7 @@ static struct pgv *alloc_pg_vec(struct tpacket_req *req)
return pg_vec;
 
 out_free_pgvec:
-   free_pg_vec(pg_vec, block_nr);
+   free_pg_vec(pg_vec, order, block_nr);
pg_vec = NULL;
goto out;
 }
@@ -4194,9 +4210,9 @@ static int packet_set_ring(struct sock *sk, union 
tpacket_req_u *req_u,
 {
struct pgv *pg_vec = NULL;
struct packet_sock *po = pkt_sk(sk);
+   int was_running, order = 0;
struct packet_ring_buffer *rb;
struct sk_buff_head *rb_queue;
-   int was_running;
__be16 num;
int err = -EINVAL;
/* Added to avoid minimal code churn */
@@ -4258,7 +4274,8 @@ static int packet_set_ring(struct sock *sk, union 
tpacket_req_u *req_u,
goto out;
 
err = -ENOMEM;
-   pg_vec = alloc_pg_vec(req);
+   order = get_order(req->tp_block_size);
+   pg_vec = alloc_pg_vec(req, order);
if (unlikely(!pg_vec))
goto out;
switch (po->tp_version) {
@@ -4312,6 +4329,7 @@ static int packet_set_ring(struct sock *sk, union 
tpacket_req_u *req_u,
rb->frame_size = req->tp_frame_size;
spin_unlock_bh(_queue->lock);
 
+   swap(rb->pg_vec_order, order);
swap(rb->pg_vec_len, req->tp_block_nr);
 
rb->pg_vec_pages = req->tp_block_size/PAGE_SIZE;
@@ -4337,7 +4355,7 @@ static int packet_set_ring(struct sock *sk, union 
tpacket_req_u *req_u,
}
 
if

Kernel warnings from tcp.c

2018-08-29 Thread Adam Mitchell

Anyone with experience in tcp.c have an idea what's causing this on
our busy database server?  Comes from this macro at line 2278:
 WARN_ON(sock_owned_by_user(sk));


[726780.788201] WARNING: CPU: 15 PID: 52245 at net/ipv4/tcp.c:2278
tcp_close+0x40f/0x430

[726780.794947] Modules linked in: binfmt_misc nf_conntrack_netlink
nfnetlink_queue tcp_diag inet_diag isofs ip6table_mangle ip6table_raw
ip6table_nat nf_nat_ipv6 iptable_security xt_CT iptable_raw
iptable_nat nf_nat_ipv4 nf_nat iptable_mangle xt_pkttype xt_NFLOG
nfnetlink_log xt_u32 xt_multiport xt_set xt_conntrack
ip_set_hash_netport ip_set_hash_ipport ip_set_hash_net ip_set_hash_ip
ip_set nfnetlink nf_conntrack_proto_gre nf_conntrack_ipv6
nf_defrag_ipv6 ip6table_filter ip6_tables xt_LOG nf_conntrack_tftp
nf_conntrack_ftp nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack
iptable_filter zfs(PO) zunicode(PO) zavl(PO) icp(PO) sb_edac
intel_powerclamp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
pcbc aesni_intel crypto_simd glue_helper cryptd cirrus snd_seq
zcommon(PO) ttm intel_rapl_perf snd_seq_device
[726780.848696]  drm_kms_helper znvpair(PO) snd_pcm snd_timer spl(O)
drm snd soundcore syscopyarea sysfillrect pcspkr sysimgblt input_leds
fb_sys_fops i2c_piix4 ip_tables xfs libcrc32c ata_generic pata_acpi
ata_piix xen_blkfront crc32c_intel libata ena(O) serio_raw floppy
sunrpc
[726780.863916] CPU: 15 PID: 52245 Comm: mysqld Tainted: PW  O
4.16.13-1.el7.elrepo.x86_64 #1
[726780.869486] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
[726780.873756] RIP: 0010:tcp_close+0x40f/0x430
[726780.877049] RSP: 0018:c90063a87dd0 EFLAGS: 00010202
[726780.880913] RAX: 0001 RBX: 8839cbb4b300 RCX:
0001
[726780.885708] RDX: 0041 RSI: 00023540 RDI:
002b
[726780.890412] RBP: c90063a87df0 R08:  R09:
0101
[726780.895192] R10: 03ff R11: 2057 R12:
8839cbb4b388
[726780.900029] R13: 0009 R14: 8839cbb4b3c8 R15:
883c5ed14900
[726780.904777] FS:  7f6acd467700() GS:883c8b1c()
knlGS:
[726780.909964] CS:  0010 DS:  ES:  CR0: 80050033
[726780.914078] CR2: 7f6cd5d430c8 CR3: 003c7e6b6005 CR4:
001606e0
[726780.918837] DR0:  DR1:  DR2:

[726780.923433] DR3:  DR6: fffe0ff0 DR7:
0400
[726780.927882] Call Trace:
[726780.930410]  inet_release+0x42/0x70
[726780.933423]  inet6_release+0x30/0x40
[726780.936439]  sock_release+0x25/0x80
[726780.939421]  sock_close+0x12/0x20
[726780.942342]  __fput+0xea/0x220
[726780.945170]  fput+0xe/0x10
[726780.947963]  task_work_run+0x8c/0xb0
[726780.951052]  exit_to_usermode_loop+0x6b/0x95
[726780.954480]  do_syscall_64+0x182/0x1b0
[726780.957586]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[726780.961288] RIP: 0033:0x7fd82b27085d
[726780.964349] RSP: 002b:7f6acd466b50 EFLAGS: 0293 ORIG_RAX:
0003
[726780.969280] RAX:  RBX: 7f65b2418220 RCX:
7fd82b27085d
[726780.973987] RDX: 0003 RSI: 7f6d5e9990c0 RDI:
02d8
[726780.978689] RBP: 7f6acd466c00 R08: 01622848 R09:
01f8
[726780.983373] R10:  R11: 0293 R12:

[726780.988049] R13: 7f6d5e9990c0 R14: 01d87cc0 R15:
02d8
[726780.992700] Code: ff 48 8b 43 28 31 f6 48 89 df 48 8b 40 10 e8 49
2d 4e 00 48 8b 43 30 48 8b 80 98 01 00 00 65 48 ff 80 90 01 00 00 e9
49 ff ff ff <0f> 0b e9 ab fc ff ff 48 8b 43 28 31 f6 48 89 df 48 8b 40
10 e8
[726781.004156] ---[ end trace 8525f27644ac4631 ]---

Re: [PATCH 0/2] net/sched: Add hardware specific counters to TC actions

2018-08-29 Thread Jakub Kicinski

On Wed, 29 Aug 2018 11:43:47 +0200, Eelco Chaudron wrote:
> On 23 Aug 2018, at 20:14, Jakub Kicinski wrote:
> 
> > On Mon, 20 Aug 2018 16:03:40 +0200, Eelco Chaudron wrote:  
> >> On 17 Aug 2018, at 13:27, Jakub Kicinski wrote:  
> >>> On Thu, 16 Aug 2018 14:02:44 +0200, Eelco Chaudron wrote:  
>  On 11 Aug 2018, at 21:06, David Miller wrote:
>   
> > From: Jakub Kicinski 
> > Date: Thu, 9 Aug 2018 20:26:08 -0700
> >  
> >> It is not immediately clear why this is needed.  The memory and
> >> updating two sets of counters won't come for free, so perhaps a
> >> stronger justification than troubleshooting is due? :S
> >>
> >> Netdev has counters for fallback vs forwarded traffic, so you'd
> >> know
> >> that traffic hits the SW datapath, plus the rules which are in_hw
> >> will
> >> most likely not match as of today for flower (assuming
> >> correctness).  
> 
>  I strongly believe that these counters are a requirement for a 
>  mixed
>  software/hardware (flow) based forwarding environment. The global
>  counters will not help much here as you might have chosen to have
>  certain traffic forwarded by software.
> 
>  These counters are probably the only option you have to figure out
>  why
>  forwarding is not as fast as expected, and you want to blame the TC
>  offload NIC.  
> >>>
> >>> The suggested debugging flow would be:
> >>>  (1) check the global counter for fallback are incrementing;
> >>>  (2) find a flow with high stats but no in_hw flag set.
> >>>
> >>> The in_hw indication should be sufficient in most cases (unless 
> >>> there
> >>> are shared blocks between netdevs of different ASICs...).  
> >>
> >> I guess the aim is to find miss behaving hardware, i.e. having the 
> >> in_hw
> >> flag set, but flows still coming to the kernel.  
> >
> > For misbehaving hardware in_hw will not work indeed.  Whether we need
> > these extra always-on stats for such use case could be debated :)
> >  
> >> I'm slightly concerned about potential performance impact, would
> >> you
> >> be able to share some numbers for non-trivial number of flows 
> >> (100k
> >> active?)?  
> >
> > Agreed, features used for diagnostics cannot have a harmful 
> > penalty
> > for fast path performance.  
> 
>  Fast path performance is not affected as these counters are not
>  incremented there. They are only incremented by the nic driver when
>  they
>  gather their statistics from hardware.  
> >>>
> >>> Not by much, you are adding state to performance-critical 
> >>> structures,
> >>> though, for what is effectively debugging purposes.
> >>>
> >>> I was mostly talking about the HW offload stat updates (sorry for 
> >>> not
> >>> being clear).
> >>>
> >>> We can have some hundreds of thousands active offloaded flows, each 
> >>> of
> >>> them can have multiple actions, and stats have to be updated 
> >>> multiple
> >>> times per second and dumped probably around once a second, too.  On 
> >>> a
> >>> busy system the stats will get evicted from cache between each 
> >>> round.
> >>>
> >>> But I'm speculating let's see if I can get some numbers on it (if 
> >>> you
> >>> could get some too, that would be great!).  
> >>
> >> I’ll try to measure some of this later this week/early next week.  
> >
> > I asked Louis to run some tests while I'm travelling, and he reports
> > that my worry about reporting the extra stats was unfounded.  Update
> > function does not show up in traces at all.  It seems under stress
> > (generated with stress-ng) the thread dumping the stats in userspace
> > (in OvS it would be the revalidator) actually consumes less CPU in
> > __gnet_stats_copy_basic (0.4% less for ~2.0% total).
> >
> > Would this match with your results?  I'm not sure why dumping would be
> > faster with your change..  
> 
> Tested with OVS and https://github.com/chaudron/ovs_perf using 300K TC 
> rules installed in HW.
> 
> For __gnet_stats_copy_basic() being faster I have (had) a theory. Now 
> this function is called twice, and I assumed the first call would cache 
> memory and the second call would be faster.
> 
> Sampling a lot of perf data, I get an average of 1115ns with the base 
> kernel and 954ns with the fix applied, so about ~14%.
> 
> Thought I would perf tcf_action_copy_stats() as it is the place updating 
> the additional counter. But even in this case, I see a better 
> performance with the patch applied.
> 
> In average 13581ns with the fix, vs base kernel at 1391ns, so about 
> 2.3%.
> 
> I guess the changes to the tc_action structure got better cache 
> alignment.

Interesting you could reproduce the speed up too!  +1 for the guess.
Seems like my caution about slowing down SW paths to support HW offload
landed on a very unfortunate patch :)

Re: [bpf-next, 01/11] xdp: implement convert_to_xdp_frame for MEM_TYPE_ZERO_COPY

2018-08-29 Thread Maciek Fijalkowski

From: Maciej Fijalkowski 

> From: Björn Töpel 
> 
> This commit adds proper MEM_TYPE_ZERO_COPY support for
> convert_to_xdp_frame. Converting a MEM_TYPE_ZERO_COPY xdp_buff to an
> xdp_frame is done by transforming the MEM_TYPE_ZERO_COPY buffer into a
> MEM_TYPE_PAGE_ORDER0 frame. This is costly, and in the future it might
> make sense to implement a more sophisticated thread-safe alloc/free
> scheme for MEM_TYPE_ZERO_COPY, so that no allocation and copy is
> required in the fast-path.
> 
> Signed-off-by: Björn Töpel 
> ---
>  include/net/xdp.h |  5 +++--
>  net/core/xdp.c| 39 +++
>  2 files changed, 42 insertions(+), 2 deletions(-)
> 
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 76b95256c266..0d5c6fb4b2e2 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -91,6 +91,8 @@ static inline void xdp_scrub_frame(struct xdp_frame *frame)
>   frame->dev_rx = NULL;
>  }
>  
> +struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp);
> +
>  /* Convert xdp_buff to xdp_frame */
>  static inline
>  struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
> @@ -99,9 +101,8 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff 
> *xdp)
>   int metasize;
>   int headroom;
>  
> - /* TODO: implement clone, copy, use "native" MEM_TYPE */
>   if (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY)
> - return NULL;
> + return xdp_convert_zc_to_xdp_frame(xdp);
>  
>   /* Assure headroom is available for storing info */
>   headroom = xdp->data - xdp->data_hard_start;
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index 89b6785cef2a..be6cb2f0e722 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -398,3 +398,42 @@ void xdp_attachment_setup(struct xdp_attachment_info 
> *info,
>   info->flags = bpf->flags;
>  }
>  EXPORT_SYMBOL_GPL(xdp_attachment_setup);
> +
> +struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp)
> +{
> + unsigned int metasize, headroom, totsize;
> + void *addr, *data_to_copy;
> + struct xdp_frame *xdpf;
> + struct page *page;
> +
> + /* Clone into a MEM_TYPE_PAGE_ORDER0 xdp_frame. */
> + metasize = xdp_data_meta_unsupported(xdp) ? 0 :
> +xdp->data - xdp->data_meta;
> + headroom = xdp->data - xdp->data_hard_start;

You actually don't use the headroom calculated here;
xdpf->headroom is assigned to 0 - was this your intention?
If so, the local variable can be removed.

> + totsize = xdp->data_end - xdp->data + metasize;
> +
> + if (sizeof(*xdpf) + totsize > PAGE_SIZE)
> + return NULL;
> +
> + page = dev_alloc_page();
> + if (!page)
> + return NULL;
> +
> + addr = page_to_virt(page);
> + xdpf = addr;
> + memset(xdpf, 0, sizeof(*xdpf));
> +
> + addr += sizeof(*xdpf);
> + data_to_copy = metasize ? xdp->data_meta : xdp->data;
> + memcpy(addr, data_to_copy, totsize);
> +
> + xdpf->data = addr + metasize;
> + xdpf->len = totsize - metasize;
> + xdpf->headroom = 0;
> + xdpf->metasize = metasize;
> + xdpf->mem.type = MEM_TYPE_PAGE_ORDER0;
> +
> + xdp_return_buff(xdp);
> + return xdpf;
> +}
> +EXPORT_SYMBOL_GPL(xdp_convert_zc_to_xdp_frame);

Re: [PATCH 0/2] net/sched: Add hardware specific counters to TC actions

2018-08-29 Thread Jakub Kicinski

On Wed, 29 Aug 2018 12:23:15 +0200, Paolo Abeni wrote:
> On Thu, 2018-08-23 at 20:14 +0200, Jakub Kicinski wrote:
> > I asked Louis to run some tests while I'm travelling, and he reports
> > that my worry about reporting the extra stats was unfounded.  Update
> > function does not show up in traces at all.  It seems under stress
> > (generated with stress-ng) the thread dumping the stats in userspace
> > (in OvS it would be the revalidator) actually consumes less CPU in
> > __gnet_stats_copy_basic (0.4% less for ~2.0% total).
> > 
> > Would this match with your results?  I'm not sure why dumping would be
> > faster with your change..  
> 
> Wild guess on my side: the relevant patch changes a bit the binary
> layout of the 'tc_action' struct, possibly (I still need to check with
> pahole) moving the tcf_lock and the stats field on different
> cachelines, reducing false sharing that could affect badly such test.

I think in our tests we tried with and without pinning relevant
processing to one core, and both results shown improvement.  I don't
have the actual samples any more, just perf script dump without CPU
IDs to confirm things were pinned correctly.. :(

Re: [PATCH bpf 0/3] Three fixes for bpf_msg_pull_data

2018-08-29 Thread Alexei Starovoitov

On Wed, Aug 29, 2018 at 04:50:33PM +0200, Daniel Borkmann wrote:
> This set contains three more fixes for the bpf_msg_pull_data()
> mainly for correcting scatterlist ring wrap-arounds as well as
> fixing up data pointers. For details please see individual patches.
> Thanks!

Applied to bpf tree, Thanks

[Patch net-nnext] Revert "net: sched: act: add extack for lookup callback"

2018-08-29 Thread Cong Wang

This reverts commit 331a9295de23 ("net: sched: act: add extack for lookup 
callback").

This extack is never used after 6 months... In fact, it can be just
set in the caller, right after ->lookup().

Cc: Alexander Aring 
Signed-off-by: Cong Wang 
---
 include/net/act_api.h  | 3 +--
 net/sched/act_api.c| 6 --
 net/sched/act_bpf.c| 3 +--
 net/sched/act_connmark.c   | 3 +--
 net/sched/act_csum.c   | 3 +--
 net/sched/act_gact.c   | 3 +--
 net/sched/act_ife.c| 3 +--
 net/sched/act_ipt.c| 6 ++
 net/sched/act_mirred.c | 3 +--
 net/sched/act_nat.c| 3 +--
 net/sched/act_pedit.c  | 3 +--
 net/sched/act_police.c | 3 +--
 net/sched/act_sample.c | 3 +--
 net/sched/act_simple.c | 3 +--
 net/sched/act_skbedit.c| 3 +--
 net/sched/act_skbmod.c | 3 +--
 net/sched/act_tunnel_key.c | 3 +--
 net/sched/act_vlan.c   | 3 +--
 18 files changed, 22 insertions(+), 38 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 970303448c90..c6f195b3c706 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -85,8 +85,7 @@ struct tc_action_ops {
   struct tcf_result *); /* called under RCU BH lock*/
int (*dump)(struct sk_buff *, struct tc_action *, int, int);
void(*cleanup)(struct tc_action *);
-   int (*lookup)(struct net *net, struct tc_action **a, u32 index,
- struct netlink_ext_ack *extack);
+   int (*lookup)(struct net *net, struct tc_action **a, u32 index);
int (*init)(struct net *net, struct nlattr *nla,
struct nlattr *est, struct tc_action **act, int ovr,
int bind, bool rtnl_held,
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index db83dac1e7f4..398c752ff529 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -1067,12 +1067,14 @@ static struct tc_action *tcf_action_get_1(struct net 
*net, struct nlattr *nla,
err = -EINVAL;
ops = tc_lookup_action(tb[TCA_ACT_KIND]);
if (!ops) { /* could happen in batch of actions */
-   NL_SET_ERR_MSG(extack, "Specified TC action not found");
+   NL_SET_ERR_MSG(extack, "Specified TC action kind not found");
goto err_out;
}
err = -ENOENT;
-   if (ops->lookup(net, , index, extack) == 0)
+   if (ops->lookup(net, , index) == 0) {
+   NL_SET_ERR_MSG(extack, "TC action with specified index not 
found");
goto err_mod;
+   }
 
module_put(ops->owner);
return a;
diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
index 0c68bc9cf0b4..c7633843e223 100644
--- a/net/sched/act_bpf.c
+++ b/net/sched/act_bpf.c
@@ -387,8 +387,7 @@ static int tcf_bpf_walker(struct net *net, struct sk_buff 
*skb,
return tcf_generic_walker(tn, skb, cb, type, ops, extack);
 }
 
-static int tcf_bpf_search(struct net *net, struct tc_action **a, u32 index,
- struct netlink_ext_ack *extack)
+static int tcf_bpf_search(struct net *net, struct tc_action **a, u32 index)
 {
struct tc_action_net *tn = net_generic(net, bpf_net_id);
 
diff --git a/net/sched/act_connmark.c b/net/sched/act_connmark.c
index 6f0f273f1139..e869c0ee63c8 100644
--- a/net/sched/act_connmark.c
+++ b/net/sched/act_connmark.c
@@ -190,8 +190,7 @@ static int tcf_connmark_walker(struct net *net, struct 
sk_buff *skb,
return tcf_generic_walker(tn, skb, cb, type, ops, extack);
 }
 
-static int tcf_connmark_search(struct net *net, struct tc_action **a, u32 
index,
-  struct netlink_ext_ack *extack)
+static int tcf_connmark_search(struct net *net, struct tc_action **a, u32 
index)
 {
struct tc_action_net *tn = net_generic(net, connmark_net_id);
 
diff --git a/net/sched/act_csum.c b/net/sched/act_csum.c
index b8a67ae3105a..3dc25b7806d7 100644
--- a/net/sched/act_csum.c
+++ b/net/sched/act_csum.c
@@ -646,8 +646,7 @@ static int tcf_csum_walker(struct net *net, struct sk_buff 
*skb,
return tcf_generic_walker(tn, skb, cb, type, ops, extack);
 }
 
-static int tcf_csum_search(struct net *net, struct tc_action **a, u32 index,
-  struct netlink_ext_ack *extack)
+static int tcf_csum_search(struct net *net, struct tc_action **a, u32 index)
 {
struct tc_action_net *tn = net_generic(net, csum_net_id);
 
diff --git a/net/sched/act_gact.c b/net/sched/act_gact.c
index cd1d9bd32ef9..aa44d14b43c7 100644
--- a/net/sched/act_gact.c
+++ b/net/sched/act_gact.c
@@ -222,8 +222,7 @@ static int tcf_gact_walker(struct net *net, struct sk_buff 
*skb,
return tcf_generic_walker(tn, skb, cb, type, ops, extack);
 }
 
-static int tcf_gact_search(struct net *net, struct tc_action **a, u32 index,
-  struct netlink_ext_ack *extack)
+static int tcf_gact_search(struct net *net, struct tc_action **a, u32 index)
 {
struct tc_action_net *tn =

[Patch net-nnext] net_sched: add missing tcf_lock for act_connmark

2018-08-29 Thread Cong Wang

According to the new locking rule, we have to take tcf_lock
for both ->init() and ->dump(), as RTNL will be removed.
However, it is missing for act_connmark.

Cc: Vlad Buslov 
Signed-off-by: Cong Wang 
---
 net/sched/act_connmark.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/net/sched/act_connmark.c b/net/sched/act_connmark.c
index e869c0ee63c8..8475913f2070 100644
--- a/net/sched/act_connmark.c
+++ b/net/sched/act_connmark.c
@@ -143,8 +143,10 @@ static int tcf_connmark_init(struct net *net, struct 
nlattr *nla,
return -EEXIST;
}
/* replacing action and zone */
+   spin_lock_bh(>tcf_lock);
ci->tcf_action = parm->action;
ci->zone = parm->zone;
+   spin_unlock_bh(>tcf_lock);
ret = 0;
}
 
@@ -156,16 +158,16 @@ static inline int tcf_connmark_dump(struct sk_buff *skb, 
struct tc_action *a,
 {
unsigned char *b = skb_tail_pointer(skb);
struct tcf_connmark_info *ci = to_connmark(a);
-
struct tc_connmark opt = {
.index   = ci->tcf_index,
.refcnt  = refcount_read(>tcf_refcnt) - ref,
.bindcnt = atomic_read(>tcf_bindcnt) - bind,
-   .action  = ci->tcf_action,
-   .zone   = ci->zone,
};
struct tcf_t t;
 
+   spin_lock_bh(>tcf_lock);
+   opt.action = ci->tcf_action;
+   opt.zone = ci->zone;
if (nla_put(skb, TCA_CONNMARK_PARMS, sizeof(opt), ))
goto nla_put_failure;
 
@@ -173,9 +175,12 @@ static inline int tcf_connmark_dump(struct sk_buff *skb, 
struct tc_action *a,
if (nla_put_64bit(skb, TCA_CONNMARK_TM, sizeof(t), ,
  TCA_CONNMARK_PAD))
goto nla_put_failure;
+   spin_unlock_bh(>tcf_lock);
 
return skb->len;
+
 nla_put_failure:
+   spin_unlock_bh(>tcf_lock);
nlmsg_trim(skb, b);
return -1;
 }
-- 
2.14.4

Re: [PATCH 2/4] r8169: Get and enable optional ether_clk clock

2018-08-29 Thread Hans de Goede


Hi,

On 27-08-18 21:14, Stephen Boyd wrote:

Quoting Hans de Goede (2018-08-27 11:53:19)

On 27-08-18 20:47, Stephen Boyd wrote:

How would you know that a clk device driver hasn't probed yet and isn't
the driver that's actually providing the clk to this device on x86
systems? With DT systems we can figure that out by looking at the DT and
seeing if the device driver requesting the clk has the clocks property.
On x86 systems it's all clkdev which doesn't really lend itself to
solving this problem.


Right on x86 the assumption is that the clk driver will be builtin and
will probe before the consumer. In this case that is true as the
pmc-atom-clk driver can only be builtin and its platform device is
instantiated from the acpi_lpss code and acpi init happens before
the PCI bus is scanned.


If we can go with this assumption then we can make the optional clk API
work even on clkdev based systems. Maybe if x86 had some way of
indicating that all builtin clks are registered?


Unfortunately there is no such thing I'm afraid.


That might work but
it's not very clean. Or if we could check to see if we're running on an
ACPI based system in clkdev we could use that to assume that clk_get()
will only be called after all providers have registered their lookups.


Yes some check for x86 + ACPI (ARM also uses ACPI, but there we
should no do this AFAICT) is probably best. That or not use the
new optional clk API on x86, but that means that any cross platform
driver cannot use it, which would be a pain.

BTW does your Acked-by indicate you are ok with merging this series
through the netdev tree as I suggested in the cover-letter? If so
can I also add your Acked-by to the 3th patch ?

Regards,

Hans

[Patch iproute2 v2] ss: add UNIX_DIAG_VFS and UNIX_DIAG_ICONS for unix sockets

2018-08-29 Thread Cong Wang

UNIX_DIAG_VFS and UNIX_DIAG_ICONS are never used by ss,
make them available in ss -e output.

Cc: Stephen Hemminger 
Signed-off-by: Cong Wang 
---
 misc/ss.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/misc/ss.c b/misc/ss.c
index 41e7762b..b2c634c8 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -3604,6 +3605,21 @@ static int unix_show_sock(const struct sockaddr_nl 
*addr, struct nlmsghdr *nlh,
out(" %c-%c",
mask & 1 ? '-' : '<', mask & 2 ? '-' : '>');
}
+   if (tb[UNIX_DIAG_VFS]) {
+   struct unix_diag_vfs *uv = RTA_DATA(tb[UNIX_DIAG_VFS]);
+
+   out(" ino:%u dev:%u/%u", uv->udiag_vfs_ino, 
major(uv->udiag_vfs_dev),
+minor(uv->udiag_vfs_dev));
+   }
+   if (tb[UNIX_DIAG_ICONS]) {
+   int len = RTA_PAYLOAD(tb[UNIX_DIAG_ICONS]);
+   __u32 *peers = RTA_DATA(tb[UNIX_DIAG_ICONS]);
+   int i;
+
+   out(" peers:");
+   for (i = 0; i < len / sizeof(__u32); i++)
+   out(" %u", peers[i]);
+   }
}
 
return 0;
@@ -3641,6 +3657,8 @@ static int unix_show_netlink(struct filter *f)
req.r.udiag_show = UDIAG_SHOW_NAME | UDIAG_SHOW_PEER | UDIAG_SHOW_RQLEN;
if (show_mem)
req.r.udiag_show |= UDIAG_SHOW_MEMINFO;
+   if (show_details)
+   req.r.udiag_show |= UDIAG_SHOW_VFS | UDIAG_SHOW_ICONS;
 
return handle_netlink_request(f, , sizeof(req), unix_show_sock);
 }
-- 
2.14.4

Re: [PATCH 0/4] clk-pmc-atom + r8169: Add ether_clk handling to fix suspend issues

2018-08-29 Thread Hans de Goede


Hi,

On 29-08-18 18:31, Andy Shevchenko wrote:

On Mon, Aug 27, 2018 at 04:31:56PM +0200, Hans de Goede wrote:

Hi All,

This series has as goal to revert commit d31fd43c0f9a ("clk: x86: Do not gate
clocks enabled by the firmware"), because that commit causes almost all
Cherry Trail devices to not use the S0i3 powerstate when suspending, leading
to them quickly draining their battery when suspended.

This commit was added to fix issues with the r8169 NIC on some Bay Trail
devices, where the pmc_plt_clk_4 was used as clk for the r8169 chip.

This series fixes this properly, so that we can undo the trouble some
commit.

The first patch adds an "ether_clk" alias to the clk-pmc-atom driver so
that the r8169 can pass "ether_clk" as generic id to clk_get instead of
having to add x86 specific code to the r8169 driver.

The second patch makes the r8169 driver do a clk_get for "ether_clk"
(ignoring -ENOENT errors so this is optional) and if that succeeds then
it enables the clock.

The third patch effectively revert the troublesome commit.

This series has been tested on a HP Stream x360 - 11-p000nd, which uses
a Bay Trail SoC with a r8169 ethernet chip and I can confirm that the
pmc_plt_clk_4 gets properly enabled, so this series should not cause any
regressions wrt devices needing pmc_plt_clk_4 enabled (which is not the
case on the Stream x360).

To avoid regressions we need to have patches 1 and 2 merged before 3,
so it is probably best if this entire series gets merged through a single
tree. Given that clk-pmc-atom.c has not seen any changes for over a
year I suggest that all 3 patches are merged through the netdev tree,
with an ack from the clk maintainers. Assuming that is ok with the clk
maintainers of course.

Note there is a fourth patch in this series, this patch is necessary to
reach S0i3 state on Bay Trail devices wich have a r8169 ethernet chip,
since I do not have access to hardware where the r8169 actually needs
the pmc_plt_clk_4 I have not been able to test this, hence it is marked
as RFC for now.



What a nice stuff, thanks!

Reviewed-by: Andy Shevchenko 


Thank you (and also thank you for the other reviews)


Btw, you probably better to refer to
https://bugzilla.kernel.org/show_bug.cgi?id=196861 which is dedicated for S0ix
issue.


Ok, I've added a link to that. I've also added:

Reported-by: Johannes Stezenbach 

To honor Johannes for figuring out that leaving the clocks enabled
was a problem in the first place.

This will all be included in v2 of the series when I send it out.


Another thing, clk_prepare_enable() can fail, I dunno what you can do at
->resume() stage with it failed.


Right, I know that it can fail, but in practice it never fails unless
things are seriously foo-barred already and there is not much we can
do when it fails, so I decided to just ignore the return code.

Regards,

Hans

Re: [pull request][net-next 00/10] Mellanox, mlx5 and devlink updates 2018-07-31

2018-08-29 Thread Alexander Duyck

On Wed, Aug 29, 2018 at 8:43 AM Alex Vesker  wrote:
>
>
> > On Wed, Aug 1, 2018 at 4:13 PM, Saeed Mahameed
> >  wrote:
> >> On Wed, Aug 1, 2018 at 3:34 PM, Alexander Duyck
> >>  wrote:
> >>> On Wed, Aug 1, 2018 at 2:52 PM, Saeed Mahameed 
> >>> wrote:
>  Hi Dave,
> 
>  This series provides devlink parameters updates to both devlink API
>  and
>  mlx5 driver, it is a 2nd iteration of the dropped patches sent in a
>  previous
>  mlx5 submission "net/mlx5: Support PCIe buffer congestion handling via
>  Devlink" to address review comments [1].
> 
>  Changes from the original series:
>  - According to the discussion outcome, we are keeping the
>  congestion control
>    setting as mlx5 device specific for the current HW generation.
>  - Changed the congestion_mode and congestion action param type to
>  string
>  - Added patches to fix devlink handling of param type string
>  - Added a patch which adds extack messages support for param set.
>  - At the end of this series, I've added yet another mlx5 devlink
>  related
>   feature, firmware snapshot support.
> 
>  For more information please see tag log below.
> 
>  Please pull and let me know if there's any problem.
> 
>  [1] https://patchwork.ozlabs.org/patch/945996/
> 
>  Thanks,
>  Saeed.
> 
>  ---
> 
>  The following changes since commit
>  e6476c21447c4b17c47e476aade6facf050f31e8:
> 
>    net: remove bogus RCU annotations on socket.wq (2018-07-31
>  12:40:22 -0700)
> 
>  are available in the Git repository at:
> 
>  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git
>  tags/mlx5-updates-2018-08-01
> 
>  for you to fetch changes up to
>  2ac6108c65ffcb1e5eab1fba1fd59272604d1c32:
> 
>    net/mlx5: Use devlink region_snapshot parameter (2018-08-01
>  14:49:09 -0700)
> 
>  
>  mlx5-updates-2018-08-01
> 
>  This series provides devlink parameters updates to both devlink API
>  and
>  mlx5 driver,
> 
>  1) Devlink changes: (Moshe Shemesh)
>  The first two patches fix devlink param infrastructure for string type
>  params.
>  The third patch adds a devlink helper function to safely copy
>  string from
>  driver to devlink.
>  The forth patch adds extack support for param set.
> 
>  2) mlx5 specific congestion parameters: (Eran Ben Elisha)
>  Next three patches add new devlink driver specific params for
>  controlling
>  congestion action and mode, using string type params and extack
>  messages support.
> 
>  This congestion mode enables hw workaround in specific devices
>  which is
>  controlled by devlink driver-specific params. The workaround is device
>  specific for this NIC generation, the next NIC will not need it.
> 
>  Congestion parameters:
>   - Congestion action
>  HW W/A mechanism in the PCIe buffer which monitors the
>  amount of
>  consumed PCIe buffer per host.  This mechanism supports
>  the
>  following actions in case of threshold overflow:
>  - Disabled - NOP (Default)
>  - Drop
>  - Mark - Mark CE bit in the CQE of received packet
>  - Congestion mode
>  - Aggressive - Aggressive static trigger threshold
>  (Default)
>  - Dynamic - Dynamically change the trigger threshold
> 
>  3) mlx5 firmware snapshot support via devlink: (Alex Vesker)
>  Last three patches, add the support for capturing region snapshot
>  of the
>  firmware crspace during critical errors, using devlink region_snapshot
>  parameter.
> 
>  -Saeed.
> 
>  
>  Alex Vesker (3):
>    net/mlx5: Add Vendor Specific Capability access gateway
>    net/mlx5: Add Crdump FW snapshot support
>    net/mlx5: Use devlink region_snapshot parameter
> 
>  Eran Ben Elisha (3):
>    net/mlx5: Move all devlink related functions calls to devlink.c
>    net/mlx5: Add MPEGC register configuration functionality
>    net/mlx5: Enable PCIe buffer congestion handling workaround
>  via devlink
> 
>  Moshe Shemesh (4):
>    devlink: Fix param set handling for string type
>    devlink: Fix param cmode driverinit for string type
>    devlink: Add helper function for safely copy string param
>    devlink: Add extack messages support to param set
> 
>   drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.c  |   3 +-
>   drivers/net/ethernet/mellanox/mlx4/main.c  |   6 +-
>   drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   3 +-
>

Re: [PATCH 0/4] clk-pmc-atom + r8169: Add ether_clk handling to fix suspend issues

2018-08-29 Thread Andy Shevchenko

On Mon, Aug 27, 2018 at 04:31:56PM +0200, Hans de Goede wrote:
> Hi All,
> 
> This series has as goal to revert commit d31fd43c0f9a ("clk: x86: Do not gate
> clocks enabled by the firmware"), because that commit causes almost all
> Cherry Trail devices to not use the S0i3 powerstate when suspending, leading
> to them quickly draining their battery when suspended.
> 
> This commit was added to fix issues with the r8169 NIC on some Bay Trail
> devices, where the pmc_plt_clk_4 was used as clk for the r8169 chip.
> 
> This series fixes this properly, so that we can undo the trouble some
> commit.
> 
> The first patch adds an "ether_clk" alias to the clk-pmc-atom driver so
> that the r8169 can pass "ether_clk" as generic id to clk_get instead of
> having to add x86 specific code to the r8169 driver.
> 
> The second patch makes the r8169 driver do a clk_get for "ether_clk"
> (ignoring -ENOENT errors so this is optional) and if that succeeds then
> it enables the clock.
> 
> The third patch effectively revert the troublesome commit.
> 
> This series has been tested on a HP Stream x360 - 11-p000nd, which uses
> a Bay Trail SoC with a r8169 ethernet chip and I can confirm that the
> pmc_plt_clk_4 gets properly enabled, so this series should not cause any
> regressions wrt devices needing pmc_plt_clk_4 enabled (which is not the
> case on the Stream x360).
> 
> To avoid regressions we need to have patches 1 and 2 merged before 3,
> so it is probably best if this entire series gets merged through a single
> tree. Given that clk-pmc-atom.c has not seen any changes for over a
> year I suggest that all 3 patches are merged through the netdev tree,
> with an ack from the clk maintainers. Assuming that is ok with the clk
> maintainers of course.
> 
> Note there is a fourth patch in this series, this patch is necessary to
> reach S0i3 state on Bay Trail devices wich have a r8169 ethernet chip,
> since I do not have access to hardware where the r8169 actually needs
> the pmc_plt_clk_4 I have not been able to test this, hence it is marked
> as RFC for now.
> 

What a nice stuff, thanks!

Reviewed-by: Andy Shevchenko 

Btw, you probably better to refer to
https://bugzilla.kernel.org/show_bug.cgi?id=196861 which is dedicated for S0ix
issue.

Another thing, clk_prepare_enable() can fail, I dunno what you can do at
->resume() stage with it failed.

> Carlos can you test the 4th patch (when you have time) and let us know if
> it works?
> 
> Regards,
> 
> Hans
> 

-- 
With Best Regards,
Andy Shevchenko

[PATCH net-next 2/2] hv_netvsc: pair VF based on serial number

2018-08-29 Thread Stephen Hemminger

Matching network device based on MAC address is problematic
since a non-VF network device can be created with a duplicate MAC
address causing confusion and problems.  The VMBus API provides
a serial number that is a better matching method.

Signed-off-by: Stephen Hemminger 
---
 drivers/net/hyperv/netvsc.c |  3 ++
 drivers/net/hyperv/netvsc_drv.c | 58 +++--
 2 files changed, 36 insertions(+), 25 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 31c3d77b4733..fe01e141c8f8 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -1203,6 +1203,9 @@ static void netvsc_send_vf(struct net_device *ndev,
 
net_device_ctx->vf_alloc = nvmsg->msg.v4_msg.vf_assoc.allocated;
net_device_ctx->vf_serial = nvmsg->msg.v4_msg.vf_assoc.serial;
+   netdev_info(ndev, "VF slot %u %s\n",
+   net_device_ctx->vf_serial,
+   net_device_ctx->vf_alloc ? "added" : "removed");
 }
 
 static  void netvsc_receive_inband(struct net_device *ndev,
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 1121a1ec407c..9dedc1463e88 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -1894,20 +1894,6 @@ static void netvsc_link_change(struct work_struct *w)
rtnl_unlock();
 }
 
-static struct net_device *get_netvsc_bymac(const u8 *mac)
-{
-   struct net_device_context *ndev_ctx;
-
-   list_for_each_entry(ndev_ctx, _dev_list, list) {
-   struct net_device *dev = hv_get_drvdata(ndev_ctx->device_ctx);
-
-   if (ether_addr_equal(mac, dev->perm_addr))
-   return dev;
-   }
-
-   return NULL;
-}
-
 static struct net_device *get_netvsc_byref(struct net_device *vf_netdev)
 {
struct net_device_context *net_device_ctx;
@@ -2036,26 +2022,48 @@ static void netvsc_vf_setup(struct work_struct *w)
rtnl_unlock();
 }
 
+/* Find netvsc by VMBus serial number.
+ * The PCI hyperv controller records the serial number as the slot.
+ */
+static struct net_device *get_netvsc_byslot(const struct net_device *vf_netdev)
+{
+   struct device *parent = vf_netdev->dev.parent;
+   struct net_device_context *ndev_ctx;
+   struct pci_dev *pdev;
+
+   if (!parent || !dev_is_pci(parent))
+   return NULL; /* not a PCI device */
+
+   pdev = to_pci_dev(parent);
+   if (!pdev->slot) {
+   netdev_notice(vf_netdev, "no PCI slot information\n");
+   return NULL;
+   }
+
+   list_for_each_entry(ndev_ctx, _dev_list, list) {
+   if (!ndev_ctx->vf_alloc)
+   continue;
+
+   if (ndev_ctx->vf_serial == pdev->slot->number)
+   return hv_get_drvdata(ndev_ctx->device_ctx);
+   }
+
+   netdev_notice(vf_netdev,
+ "no netdev found for slot %u\n", pdev->slot->number);
+   return NULL;
+}
+
 static int netvsc_register_vf(struct net_device *vf_netdev)
 {
-   struct net_device *ndev;
struct net_device_context *net_device_ctx;
-   struct device *pdev = vf_netdev->dev.parent;
struct netvsc_device *netvsc_dev;
+   struct net_device *ndev;
int ret;
 
if (vf_netdev->addr_len != ETH_ALEN)
return NOTIFY_DONE;
 
-   if (!pdev || !dev_is_pci(pdev) || dev_is_pf(pdev))
-   return NOTIFY_DONE;
-
-   /*
-* We will use the MAC address to locate the synthetic interface to
-* associate with the VF interface. If we don't find a matching
-* synthetic interface, move on.
-*/
-   ndev = get_netvsc_bymac(vf_netdev->perm_addr);
+   ndev = get_netvsc_byslot(vf_netdev);
if (!ndev)
return NOTIFY_DONE;
 
-- 
2.18.0

[PATCH net-next 0/2] hv_netvsc: associate VF and PV device by serial number

2018-08-29 Thread Stephen Hemminger

The Hyper-V implementation of PCI controller has concept of 32 bit serial number
(not to be confused with PCI-E serial number).  This value is sent in the 
protocol
from the host to indicate SR-IOV VF device is attached to a synthetic NIC.

Using the serial number (instead of MAC address) to associate the two devices
avoids lots of potential problems when there are duplicate MAC addresses from
tunnels or layered devices.

The patch set is broken into two parts, one is for the PCI controller
and the other is for the netvsc device. Normally, these go through different
trees but sending them together here for better review. The PCI changes
were submitted previously, but the main review comment was "why do you
need this?". This is why.

Stephen Hemminger (2):
  PCI: hv: support reporting serial number as slot information
  hv_netvsc: pair VF based on serial number

 drivers/net/hyperv/netvsc.c |  3 ++
 drivers/net/hyperv/netvsc_drv.c | 58 -
 drivers/pci/controller/pci-hyperv.c | 30 +++
 3 files changed, 66 insertions(+), 25 deletions(-)

-- 
2.18.0

[PATCH net-next 1/2] PCI: hv: support reporting serial number as slot information

2018-08-29 Thread Stephen Hemminger

The Hyper-V host API for PCI provides a unique "serial number" which
can be used as basis for sysfs PCI slot table. This can be useful
for cases where userspace wants to find the PCI device based on
serial number.

When an SR-IOV NIC is added, the host sends an attach message
with serial number. The kernel doesn't use the serial number, but
it is useful when doing the same thing in a userspace driver such
as the DPDK. By having /sys/bus/pci/slots/N it provides a direct
way to find the matching PCI device.

There may be some cases where serial number is not unique such
as when using GPU's. But the PCI slot infrastructure will handle
that by adding suffix "2-1" etc.

This has a side effect which may also be useful. The common udev
network device naming policy uses the slot information (rather
than PCI address). This causes udev to give shorter network device
names for VF devices.  It does not break applications or startup
because the VF device must never be configured directly.

Signed-off-by: Stephen Hemminger 
---
 drivers/pci/controller/pci-hyperv.c | 30 +
 1 file changed, 30 insertions(+)

diff --git a/drivers/pci/controller/pci-hyperv.c 
b/drivers/pci/controller/pci-hyperv.c
index c00f82cc54aa..e6a6c1146a41 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -89,6 +89,8 @@ static enum pci_protocol_version_t pci_protocol_version;
 
 #define STATUS_REVISION_MISMATCH 0xC059
 
+#define SLOT_NAME_SIZE 21
+
 /*
  * Message Types
  */
@@ -494,6 +496,7 @@ struct hv_pci_dev {
struct list_head list_entry;
refcount_t refs;
enum hv_pcichild_state state;
+   struct pci_slot *pci_slot;
struct pci_function_description desc;
bool reported_missing;
struct hv_pcibus_device *hbus;
@@ -1457,6 +1460,28 @@ static void prepopulate_bars(struct hv_pcibus_device 
*hbus)
spin_unlock_irqrestore(>device_list_lock, flags);
 }
 
+static void hv_pci_assign_slots(struct hv_pcibus_device *hbus)
+{
+   struct hv_pci_dev *hpdev;
+   char name[SLOT_NAME_SIZE];
+   unsigned long flags;
+   int slot_nr;
+
+   spin_lock_irqsave(>device_list_lock, flags);
+   list_for_each_entry(hpdev, >children, list_entry) {
+   if (hpdev->pci_slot)
+   continue;
+
+   slot_nr = PCI_SLOT(wslot_to_devfn(hpdev->desc.win_slot.slot));
+   snprintf(name, SLOT_NAME_SIZE, "%u", hpdev->desc.ser);
+   hpdev->pci_slot = pci_create_slot(hbus->pci_bus, slot_nr,
+ name, NULL);
+   if (!hpdev->pci_slot)
+   pr_warn("pci_create slot %s failed\n", name);
+   }
+   spin_unlock_irqrestore(>device_list_lock, flags);
+}
+
 /**
  * create_root_hv_pci_bus() - Expose a new root PCI bus
  * @hbus:  Root PCI bus, as understood by this driver
@@ -1480,6 +1505,7 @@ static int create_root_hv_pci_bus(struct hv_pcibus_device 
*hbus)
pci_lock_rescan_remove();
pci_scan_child_bus(hbus->pci_bus);
pci_bus_assign_resources(hbus->pci_bus);
+   hv_pci_assign_slots(hbus);
pci_bus_add_devices(hbus->pci_bus);
pci_unlock_rescan_remove();
hbus->state = hv_pcibus_installed;
@@ -1742,6 +1768,7 @@ static void pci_devices_present_work(struct work_struct 
*work)
 */
pci_lock_rescan_remove();
pci_scan_child_bus(hbus->pci_bus);
+   hv_pci_assign_slots(hbus);
pci_unlock_rescan_remove();
break;
 
@@ -1858,6 +1885,9 @@ static void hv_eject_device_work(struct work_struct *work)
list_del(>list_entry);
spin_unlock_irqrestore(>hbus->device_list_lock, flags);
 
+   if (hpdev->pci_slot)
+   pci_destroy_slot(hpdev->pci_slot);
+
memset(, 0, sizeof(ctxt));
ejct_pkt = (struct pci_eject_response *)
ejct_pkt->message_type.type = PCI_EJECTION_COMPLETE;
-- 
2.18.0

Re: [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e

2018-08-29 Thread Daniel Borkmann

On 08/28/2018 02:44 PM, Björn Töpel wrote:
> From: Björn Töpel 
> 
> This patch set introduces zero-copy AF_XDP support for Intel's i40e
> driver. In the first preparatory patch we also add support for
> XDP_REDIRECT for zero-copy allocated frames so that XDP programs can
> redirect them. This was a ToDo from the first AF_XDP zero-copy patch
> set from early June. Special thanks to Alex Duyck and Jesper Dangaard
> Brouer for reviewing earlier versions of this patch set.
> 
> The i40e zero-copy code is located in its own file i40e_xsk.[ch]. Note
> that in the interest of time, to get an AF_XDP zero-copy implementation
> out there for people to try, some code paths have been copied from the
> XDP path to the zero-copy path. It is out goal to merge the two paths
> in later patch sets.
> 
> In contrast to the implementation from beginning of June, this patch
> set does not require any extra HW queues for AF_XDP zero-copy
> TX. Instead, the XDP TX HW queue is used for both XDP_REDIRECT and
> AF_XDP zero-copy TX.
> 
> Jeff, given that most of changes are in i40e, it is up to you how you
> would like to route these patches. The set is tagged bpf-next, but
> if taking it via the Intel driver tree is easier, let us know.
> 
> We have run some benchmarks on a dual socket system with two Broadwell
> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> cores which gives a total of 28, but only two cores are used in these
> experiments. One for TR/RX and one for the user space application. The
> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The
> NIC is Intel I40E 40Gbit/s using the i40e driver.
> 
> Below are the results in Mpps of the I40E NIC benchmark runs for 64
> and 1500 byte packets, generated by a commercial packet generator HW
> outputing packets at full 40 Gbit/s line rate. The results are with
> retpoline and all other spectre and meltdown fixes, so these results
> are not comparable to the ones from the zero-copy patch set in June.
> 
> AF_XDP performance 64 byte packets.
> Benchmark   XDP_SKBXDP_DRVXDP_DRV with zerocopy
> rxdrop   2.68.2 15.0
> txpush   2.2-   21.9
> l2fwd1.72.3 11.3
> 
> AF_XDP performance 1500 byte packets:
> Benchmark   XDP_SKB   XDP_DRV XDP_DRV with zerocopy
> rxdrop   2.03.3 3.3
> l2fwd1.31.7 3.1
> 
> XDP performance on our system as a base line:
> 
> 64 byte packets:
> XDP stats   CPU pps issue-pps
> XDP-RX CPU  16  18.4M  0
> 
> 1500 byte packets:
> XDP stats   CPU pps issue-pps
> XDP-RX CPU  16  3.3M0
> 
> The structure of the patch set is as follows:
> 
> Patch 1: Add support for XDP_REDIRECT of zero-copy allocated frames
> Patches 2-4: Preparatory patches to common xsk and net code
> Patches 5-7: Preparatory patches to i40e driver code for RX
> Patch 8: i40e zero-copy support for RX
> Patch 9: Preparatory patch to i40e driver code for TX
> Patch 10: i40e zero-copy support for TX
> Patch 11: Add flags to sample application to force zero-copy/copy mode
> 
> We based this patch set on bpf-next commit 050cdc6c9501 ("Merge
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
> 
> 
> Magnus & Björn
> 
> Björn Töpel (8):
>   xdp: implement convert_to_xdp_frame for MEM_TYPE_ZERO_COPY
>   xdp: export xdp_rxq_info_unreg_mem_model
>   xsk: expose xdp_umem_get_{data,dma} to drivers
>   i40e: added queue pair disable/enable functions
>   i40e: refactor Rx path for re-use
>   i40e: move common Rx functions to i40e_txrx_common.h
>   i40e: add AF_XDP zero-copy Rx support
>   samples/bpf: add -c/--copy -z/--zero-copy flags to xdpsock
> 
> Magnus Karlsson (3):
>   net: add napi_if_scheduled_mark_missed
>   i40e: move common Tx functions to i40e_txrx_common.h
>   i40e: add AF_XDP zero-copy Tx support

Thanks for working on this, LGTM! Are you also planning to get ixgbe
out after that?

For the series:

Acked-by: Daniel Borkmann 

Thanks,
Daniel

Re: [pull request][net-next 00/10] Mellanox, mlx5 and devlink updates 2018-07-31

2018-08-29 Thread Alex Vesker





On Wed, Aug 1, 2018 at 4:13 PM, Saeed Mahameed
 wrote:

On Wed, Aug 1, 2018 at 3:34 PM, Alexander Duyck
 wrote:
On Wed, Aug 1, 2018 at 2:52 PM, Saeed Mahameed  
wrote:

Hi Dave,

This series provides devlink parameters updates to both devlink API 
and
mlx5 driver, it is a 2nd iteration of the dropped patches sent in a 
previous

mlx5 submission "net/mlx5: Support PCIe buffer congestion handling via
Devlink" to address review comments [1].

Changes from the original series:
- According to the discussion outcome, we are keeping the 
congestion control

  setting as mlx5 device specific for the current HW generation.
- Changed the congestion_mode and congestion action param type to 
string

- Added patches to fix devlink handling of param type string
- Added a patch which adds extack messages support for param set.
- At the end of this series, I've added yet another mlx5 devlink 
related

 feature, firmware snapshot support.

For more information please see tag log below.

Please pull and let me know if there's any problem.

[1] https://patchwork.ozlabs.org/patch/945996/

Thanks,
Saeed.

---

The following changes since commit 
e6476c21447c4b17c47e476aade6facf050f31e8:


  net: remove bogus RCU annotations on socket.wq (2018-07-31 
12:40:22 -0700)


are available in the Git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git 
tags/mlx5-updates-2018-08-01


for you to fetch changes up to 
2ac6108c65ffcb1e5eab1fba1fd59272604d1c32:


  net/mlx5: Use devlink region_snapshot parameter (2018-08-01 
14:49:09 -0700)



mlx5-updates-2018-08-01

This series provides devlink parameters updates to both devlink API 
and

mlx5 driver,

1) Devlink changes: (Moshe Shemesh)
The first two patches fix devlink param infrastructure for string type
params.
The third patch adds a devlink helper function to safely copy 
string from

driver to devlink.
The forth patch adds extack support for param set.

2) mlx5 specific congestion parameters: (Eran Ben Elisha)
Next three patches add new devlink driver specific params for 
controlling
congestion action and mode, using string type params and extack 
messages support.


This congestion mode enables hw workaround in specific devices 
which is

controlled by devlink driver-specific params. The workaround is device
specific for this NIC generation, the next NIC will not need it.

Congestion parameters:
 - Congestion action
    HW W/A mechanism in the PCIe buffer which monitors the 
amount of
    consumed PCIe buffer per host.  This mechanism supports 
the

    following actions in case of threshold overflow:
    - Disabled - NOP (Default)
    - Drop
    - Mark - Mark CE bit in the CQE of received packet
    - Congestion mode
    - Aggressive - Aggressive static trigger threshold 
(Default)

    - Dynamic - Dynamically change the trigger threshold

3) mlx5 firmware snapshot support via devlink: (Alex Vesker)
Last three patches, add the support for capturing region snapshot 
of the

firmware crspace during critical errors, using devlink region_snapshot
parameter.

-Saeed.


Alex Vesker (3):
  net/mlx5: Add Vendor Specific Capability access gateway
  net/mlx5: Add Crdump FW snapshot support
  net/mlx5: Use devlink region_snapshot parameter

Eran Ben Elisha (3):
  net/mlx5: Move all devlink related functions calls to devlink.c
  net/mlx5: Add MPEGC register configuration functionality
  net/mlx5: Enable PCIe buffer congestion handling workaround 
via devlink


Moshe Shemesh (4):
  devlink: Fix param set handling for string type
  devlink: Fix param cmode driverinit for string type
  devlink: Add helper function for safely copy string param
  devlink: Add extack messages support to param set

 drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.c  |   3 +-
 drivers/net/ethernet/mellanox/mlx4/main.c  |   6 +-
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   3 +-
 drivers/net/ethernet/mellanox/mlx5/core/devlink.c  | 388 
+

 drivers/net/ethernet/mellanox/mlx5/core/devlink.h  |  13 +
 .../net/ethernet/mellanox/mlx5/core/diag/crdump.c  | 223 
 drivers/net/ethernet/mellanox/mlx5/core/health.c   |   3 +
 drivers/net/ethernet/mellanox/mlx5/core/lib/mlx5.h |   4 +
 .../net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c  | 320 
+

 .../net/ethernet/mellanox/mlx5/core/lib/pci_vsc.h  |  56 +++
 drivers/net/ethernet/mellanox/mlx5/core/main.c |  10 +-
 include/linux/mlx5/driver.h    |   5 +
 include/net/devlink.h  |  15 +-
 net/core/devlink.c |  44 ++-
 14 files changed, 1076 insertions(+), 17 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/devlink.c
 create mode 100644

Re: [Patch iproute2] ss: add UNIX_DIAG_VFS and UNIX_DIAG_ICONS for unix sockets

2018-08-29 Thread Stephen Hemminger

On Tue, 28 Aug 2018 16:16:49 -0700
Cong Wang  wrote:

> On Mon, Aug 27, 2018 at 3:27 PM Stephen Hemminger
>  wrote:
> >
> > On Mon, 27 Aug 2018 14:46:52 -0700
> > Cong Wang  wrote:
> >  
> > > UNIX_DIAG_VFS and UNIX_DIAG_ICONS are never used by ss,
> > > make them available in ss -e output.
> > >
> > > Cc: Stephen Hemminger 
> > > Signed-off-by: Cong Wang 
> > > ---
> > >  misc/ss.c | 25 +
> > >  1 file changed, 25 insertions(+)
> > >
> > > diff --git a/misc/ss.c b/misc/ss.c
> > > index 41e7762b..d28bc1ec 100644
> > > --- a/misc/ss.c
> > > +++ b/misc/ss.c
> > > @@ -16,6 +16,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include   
> >
> > Why is this included, it isn't on my system.  
> 
> It is for major() and minor().

Ok on Debian, these are in the architecture include, so this will work fine.

Re: [bpf-next PATCH 0/2] bpf: test_sockmap updates

2018-08-29 Thread Daniel Borkmann

On 08/28/2018 06:10 PM, John Fastabend wrote:
> Two small test sockmap updates for bpf-next. These help me run some
> additional tests with test_sockmap.

Applied to bpf-next, thanks!

Re: [PATCH bpf-next] bpf: remove duplicated include from syscall.c

2018-08-29 Thread Daniel Borkmann

On 08/28/2018 09:42 AM, YueHaibing wrote:
> Remove duplicated include.
> 
> Signed-off-by: YueHaibing 

Applied to bpf-next, thanks!

Re: [PATCH] rtnetlink: expose value from SET_NETDEV_DEVTYPE via IFLA_DEVTYPE attribute

2018-08-29 Thread Marcel Holtmann

Hi Stephen,

>>> The name value from SET_NETDEV_DEVTYPE only ended up in the uevent sysfs
>>> file as DEVTYPE= information. To avoid any kind of race conditions
>>> between netlink messages and reading from sysfs, it is useful to add the
>>> same string as new IFLA_DEVTYPE attribute included in the RTM_NEWLINK
>>> messages.
>>> 
>>> For network managing daemons that have to classify ARPHRD_ETHER network
>>> devices into different types (like Wireless LAN, Bluetooth etc.), this
>>> avoids the extra round trip to sysfs and parsing of the uevent file.
>>> 
>>> Signed-off-by: Marcel Holtmann 
>>> ---
>>> include/uapi/linux/if_link.h |  2 ++
>>> net/core/rtnetlink.c | 12 
>>> 2 files changed, 14 insertions(+)
>>> 
>>> diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
>>> index 43391e2d1153..781294972bb4 100644
>>> --- a/include/uapi/linux/if_link.h
>>> +++ b/include/uapi/linux/if_link.h
>>> @@ -166,6 +166,8 @@ enum {
>>> IFLA_NEW_IFINDEX,
>>> IFLA_MIN_MTU,
>>> IFLA_MAX_MTU,
>>> +   IFLA_DEVTYPE,   /* Name value from SET_NETDEV_DEVTYPE */  
>> 
>> This is not something netdev-related. dev->dev.type is struct device_type.
>> This is a generic "device" thing. Incorrect to expose over
>> netdev-specific API. Please use "device" API for this.
> 
> There is no device API in netlink. The whole point of this patch is to
> do it in one message. It might be a performance optimization, but I can't
> see how it could be a race condition. Devices set type before registering.

the only way right now to pick up the DEVTYPE= value is from the 
/sys/class/net/*/uevent file. That is based on the ifname and not the index. 
When udev + systemd start renaming things behind your back, your daemon does 
not have a clean one-shot way of getting that information. As stated, the 
information in DEVTYPE= are a sub-classification of ARPHRD_ETHER and allows to 
differentiate a wired Ethernet card from a WiFi interface, from a Bluetooth 
interface, from WiMAX and so on. They just happen to be in dev.dev_type data 
structure at the moment and I didn't want to duplicate that information.

I am actually fine doing this via IFLA_INFO_KIND since NetworkManager seems to 
be fixed now or do this via a different method or maybe just a different 
attribute name. I really just want to get the sub-classification of 
ARPHRD_ETHER that we need in userspace networking daemons from the kernel 
without having to go poke left and right over sysfs or interact with udev or 
systemd.

Regards

Marcel

Re: Kernel Panic on high bandwidth transfer over wifi

2018-08-29 Thread Eric Dumazet




On 08/29/2018 04:42 AM, Nathaniel Munk wrote:
> Hi all,
> I'm running Arch Linux on kernel 4.18.5 (same issue on both arch-provided 
> kernel and mainline built-from-source). There is an issue whereby the kernel 
> crashes when transferring at high bandwidths (approx 6mB/s) over a specific 
> wifi connection. I can only reproduce the issue when using the Personal 
> Hotspot on my iPhone 6S+, but can reproduce it very consistently on that 
> connection.
> 
> More often than not, any download reaching this speed will cause a panic, but 
> if the download is immediately terminated at the first error the system can 
> recover (and doing this I have obtained the attached logs). Unfortunately, I 
> have not had access to a second machine to obtain the netconsole printout of 
> the panic.
> 
> As above, high-bandwidth transfers on other wifi networks do not cause the 
> issue (nor on ethernet connections).
> 
> As you can see from the attached log, the issue appears at tcp_recvmsg+0x579 
> and net_tx_action+0x1fe. At both these positions (net/ipv4/tcp.c:2000 and 
> net/core/dev.c:4279 in mainline 4.18.5), a member of the skb struct is called.
> 
> Thank you for your time (and I apologize if this is spurious or badly worded, 
> this is my first bug report), and please don't hesitate to let me know if 
> there's anything else I can do to help work this out.
> 
> Regards,
> ---
> Nathaniel Munk
> nathan...@munk.com.au
> 

Unfortunately there is no attached log ;)

Re: [PATCH] rtnetlink: expose value from SET_NETDEV_DEVTYPE via IFLA_DEVTYPE attribute

2018-08-29 Thread Stephen Hemminger

On Wed, 29 Aug 2018 09:18:55 +0200
Jiri Pirko  wrote:

> Tue, Aug 28, 2018 at 10:58:11PM CEST, mar...@holtmann.org wrote:
> >The name value from SET_NETDEV_DEVTYPE only ended up in the uevent sysfs
> >file as DEVTYPE= information. To avoid any kind of race conditions
> >between netlink messages and reading from sysfs, it is useful to add the
> >same string as new IFLA_DEVTYPE attribute included in the RTM_NEWLINK
> >messages.
> >
> >For network managing daemons that have to classify ARPHRD_ETHER network
> >devices into different types (like Wireless LAN, Bluetooth etc.), this
> >avoids the extra round trip to sysfs and parsing of the uevent file.
> >
> >Signed-off-by: Marcel Holtmann 
> >---
> > include/uapi/linux/if_link.h |  2 ++
> > net/core/rtnetlink.c | 12 
> > 2 files changed, 14 insertions(+)
> >
> >diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
> >index 43391e2d1153..781294972bb4 100644
> >--- a/include/uapi/linux/if_link.h
> >+++ b/include/uapi/linux/if_link.h
> >@@ -166,6 +166,8 @@ enum {
> > IFLA_NEW_IFINDEX,
> > IFLA_MIN_MTU,
> > IFLA_MAX_MTU,
> >+IFLA_DEVTYPE,   /* Name value from SET_NETDEV_DEVTYPE */  
> 
> This is not something netdev-related. dev->dev.type is struct device_type.
> This is a generic "device" thing. Incorrect to expose over
> netdev-specific API. Please use "device" API for this.

There is no device API in netlink. The whole point of this patch is to
do it in one message. It might be a performance optimization, but I can't
see how it could be a race condition. Devices set type before registering.

Re: [PATCH] rtnetlink: expose value from SET_NETDEV_DEVTYPE via IFLA_DEVTYPE attribute

2018-08-29 Thread Marcel Holtmann

Hi Jiri,

>> The name value from SET_NETDEV_DEVTYPE only ended up in the uevent sysfs
>> file as DEVTYPE= information. To avoid any kind of race conditions
>> between netlink messages and reading from sysfs, it is useful to add the
>> same string as new IFLA_DEVTYPE attribute included in the RTM_NEWLINK
>> messages.
>> 
>> For network managing daemons that have to classify ARPHRD_ETHER network
>> devices into different types (like Wireless LAN, Bluetooth etc.), this
>> avoids the extra round trip to sysfs and parsing of the uevent file.
>> 
>> Signed-off-by: Marcel Holtmann 
>> ---
>> include/uapi/linux/if_link.h |  2 ++
>> net/core/rtnetlink.c | 12 
>> 2 files changed, 14 insertions(+)
>> 
>> diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
>> index 43391e2d1153..781294972bb4 100644
>> --- a/include/uapi/linux/if_link.h
>> +++ b/include/uapi/linux/if_link.h
>> @@ -166,6 +166,8 @@ enum {
>>  IFLA_NEW_IFINDEX,
>>  IFLA_MIN_MTU,
>>  IFLA_MAX_MTU,
>> +IFLA_DEVTYPE,   /* Name value from SET_NETDEV_DEVTYPE */
> 
> This is not something netdev-related. dev->dev.type is struct device_type.
> This is a generic "device" thing. Incorrect to expose over
> netdev-specific API. Please use "device" API for this.

it is not just "device" related since this is a sub-classification of 
ARPHRD_ETHER type as a wrote above. Don't get hang up that this information is 
part of struct device. The dev->dev.type contains strings like "wlan", 
"bluetooth", "wimax", "gadget" etc. so that you can tell what kind of Ethernet 
card you have.

We can revive the patches for adding this information as IFLA_INFO_KIND, but 
last time they where reverted since NetworkManager did all the wrong decisions 
based on that. That part is fixed now and we can put that back and declare the 
sub-type classification of Ethernet device in two places. Or we just take the 
information that we added a long time ago for exactly this sub-classification 
and provide them to userspace via RTNL.

Regards

Marcel

mlx5 driver loading failing on v4.19 / net-next / bpf-next

2018-08-29 Thread Jesper Dangaard Brouer

Hi Saeed,

I'm having issues loading mlx5 driver on v4.19 kernels (tested both
net-next and bpf-next), while kernel v4.18 seems to work.  It happens
with a Mellanox ConnectX-5 NIC (and also a CX4-Lx but I removed that
from the system now).

One pain point is very long boot-time, caused by some timeout code in
the driver. The kernel console log (dmesg) says:

[5.763330] mlx5_core :03:00.0: firmware version: 16.22.1002
[5.769367] mlx5_core :03:00.0: 126.016 Gb/s available PCIe bandwidth, 
limited by 8 GT/s x16 link at :00:02.0 (capable of 252.048 Gb/s with 16 
GT/s x16 link)

(...) other drivers loading

[   66.816635] mlx5_core :03:00.0: wait_func:964:(pid 112): 
ENABLE_HCA(0x104) timeout. Will cause a leak of a command resource
[   66.828123] mlx5_core :03:00.0: enable hca failed
[   66.845516] mlx5_core :03:00.0: mlx5_load_one failed with error code -110
[   66.852802] mlx5_core: probe of :03:00.0 failed with error -110

[   66.859347] mlx5_core :03:00.1: firmware version: 16.22.1002
[   66.865388] mlx5_core :03:00.1: 126.016 Gb/s available PCIe bandwidth, 
limited by 8 GT/s x16 link at :00:02.0 (capable of 252.048 Gb/s with 16 
GT/s x16 link)

[  125.787395] XFS (sda3): Mounting V5 Filesystem
[  125.848509] XFS (sda3): Ending clean mount
[  127.984784] mlx5_core :03:00.1: wait_func:964:(pid 5): ENABLE_HCA(0x104) 
timeout. Will cause a leak of a command resource
[  127.996090] mlx5_core :03:00.1: enable hca failed
[  128.013819] mlx5_core :03:00.1: mlx5_load_one failed with error code -110
[  128.021076] mlx5_core: probe of :03:00.1 failed with error -110


Do you have any idea what could be causing this?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH v2 iproute2-next 2/5] bridge: colorize output and use JSON print library

2018-08-29 Thread Stephen Hemminger

On Sat, 14 Jul 2018 18:41:03 -0700
Roopa Prabhu  wrote:

> On Tue, Feb 20, 2018 at 11:24 AM, Stephen Hemminger
>  wrote:
> > From: Stephen Hemminger 
> >
> > Use new functions from json_print to simplify code.
> > Provide standard flag for colorizing output.
> >
> > The shortened -c flag is ambiguous it could mean color or
> > compressvlan; it is now changed to mean color for consistency
> > with other iproute2 commands.
> >
> > Signed-off-by: Stephen Hemminger 
> > ---
> >  bridge/br_common.h |   2 +-
> >  bridge/bridge.c|  10 +-
> >  bridge/fdb.c   | 281 +++--
> >  bridge/mdb.c   | 362 
> > ++---
> >  bridge/vlan.c  | 276 +++-
> >  5 files changed, 363 insertions(+), 568 deletions(-)
> >
> > diff --git a/bridge/br_common.h b/bridge/br_common.h
> > index b25f61e50e05..2f1cb8fd9f3d 100644
> > --- a/bridge/br_common.h
> > +++ b/bridge/br_common.h
> > @@ -6,7 +6,7 @@
> >  #define MDB_RTR_RTA(r) \
> > ((struct rtattr *)(((char *)(r)) + 
> > RTA_ALIGN(sizeof(__u32
> >
> > -extern void print_vlan_info(FILE *fp, struct rtattr *tb, int ifindex);
> > +extern void print_vlan_info(FILE *fp, struct rtattr *tb);
> >  extern int print_linkinfo(const struct sockaddr_nl *who,
> >   struct nlmsghdr *n,
> >   void *arg);
> > diff --git a/bridge/bridge.c b/bridge/bridge.c
> > index 4b112e3b8da9..e5b4c3c2198f 100644
> > --- a/bridge/bridge.c
> > +++ b/bridge/bridge.c
> > @@ -16,12 +16,15 @@
> >  #include "utils.h"
> >  #include "br_common.h"
> >  #include "namespace.h"
> > +#include "color.h"
> >
> >  struct rtnl_handle rth = { .fd = -1 };
> >  int preferred_family = AF_UNSPEC;
> >  int oneline;
> >  int show_stats;
> >  int show_details;
> > +int show_pretty;
> > +int color;
> >  int compress_vlans;
> >  int json;
> >  int timestamp;
> > @@ -39,7 +42,7 @@ static void usage(void)
> >  "where OBJECT := { link | fdb | mdb | vlan | monitor }\n"
> >  "  OPTIONS := { -V[ersion] | -s[tatistics] | -d[etails] |\n"
> >  "   -o[neline] | -t[imestamp] | -n[etns] name |\n"
> > -"   -c[ompressvlans] -p[retty] -j{son} }\n");
> > +"   -c[ompressvlans] -color -p[retty] -j{son} }\n");
> > exit(-1);
> >  }
> >
> > @@ -170,6 +173,8 @@ main(int argc, char **argv)
> > NEXT_ARG();
> > if (netns_switch(argv[1]))
> > exit(-1);
> > +   } else if (matches(opt, "-color") == 0) {
> > +   enable_color();
> > } else if (matches(opt, "-compressvlans") == 0) {
> > ++compress_vlans;
> > } else if (matches(opt, "-force") == 0) {
> > @@ -195,6 +200,9 @@ main(int argc, char **argv)
> >
> > _SL_ = oneline ? "\\" : "\n";
> >
> > +   if (json)
> > +   check_if_color_enabled();
> > +
> > if (batch_file)
> > return batch(batch_file);
> >
> > diff --git a/bridge/fdb.c b/bridge/fdb.c
> > index 93b5b2e694e3..b4f6e8b3a01b 100644
> > --- a/bridge/fdb.c
> > +++ b/bridge/fdb.c
> > @@ -22,9 +22,9 @@
> >  #include 
> >  #include 
> >  #include 
> > -#include 
> >  #include 
> >
> > +#include "json_print.h"
> >  #include "libnetlink.h"
> >  #include "br_common.h"
> >  #include "rt_names.h"
> > @@ -32,8 +32,6 @@
> >
> >  static unsigned int filter_index, filter_vlan, filter_state;
> >
> > -json_writer_t *jw_global;
> > -
> >  static void usage(void)
> >  {
> > fprintf(stderr,
> > @@ -83,13 +81,46 @@ static int state_a2n(unsigned int *s, const char *arg)
> > return 0;
> >  }
> >
> > -static void start_json_fdb_flags_array(bool *fdb_flags)
> > +static void fdb_print_flags(FILE *fp, unsigned int flags)
> > +{
> > +   open_json_array(PRINT_JSON,
> > +   is_json_context() ?  "flags" : "");
> > +
> > +   if (flags & NTF_SELF)
> > +   print_string(PRINT_ANY, NULL, "%s ", "self");
> > +
> > +   if (flags & NTF_ROUTER)
> > +   print_string(PRINT_ANY, NULL, "%s ", "router");
> > +
> > +   if (flags & NTF_EXT_LEARNED)
> > +   print_string(PRINT_ANY, NULL, "%s ", "extern_learn");
> > +
> > +   if (flags & NTF_OFFLOADED)
> > +   print_string(PRINT_ANY, NULL, "%s ", "offload");
> > +
> > +   if (flags & NTF_MASTER)
> > +   print_string(PRINT_ANY, NULL, "%s ", "master");
> > +
> > +   close_json_array(PRINT_JSON, NULL);
> > +}
> > +
> > +static void fdb_print_stats(FILE *fp, const struct nda_cacheinfo *ci)
> >  {
> > -   if (*fdb_flags)
> > -   return;
> > -   jsonw_name(jw_global, "flags");
> > -   jsonw_start_array(jw_global);
> > -   *fdb_flags = true;
> > +   static int hz;
> > +
> > +   if (!hz)
> > +   hz = get_user_hz();
> > +
> > +   if

Fw: [Bug 200943] New: Repeating tcp_mark_head_lost in dmesg

2018-08-29 Thread Stephen Hemminger

Begin forwarded message:

Date: Sun, 26 Aug 2018 22:24:12 +
From: bugzilla-dae...@bugzilla.kernel.org
To: step...@networkplumber.org
Subject: [Bug 200943] New: Repeating tcp_mark_head_lost in dmesg

https://bugzilla.kernel.org/show_bug.cgi?id=200943

Bug ID: 200943
   Summary: Repeating tcp_mark_head_lost in dmesg
   Product: Networking
   Version: 2.5
Kernel Version: 4.14.66
  Hardware: All
OS: Linux
  Tree: Mainline
Status: NEW
  Severity: normal
  Priority: P1
 Component: IPV4
  Assignee: step...@networkplumber.org
  Reporter: rm+...@romanrm.net
Regression: No

Getting a bunch of these now every hour during continuous ~100 Mbit of network
traffic.
What's up with that? Seems harmless, as in the kernel doesn't crash and the
network connection is not interrupted. (Maybe the particular TCP session is?)
If there are no ill-effects from this condition, is such spammy WARN_ON really
necessary?

[Mon Aug 27 02:16:11 2018] [ cut here ]
[Mon Aug 27 02:16:11 2018] WARNING: CPU: 5 PID: 0 at net/ipv4/tcp_input.c:2263
tcp_mark_head_lost+0x247/0x260
[Mon Aug 27 02:16:11 2018] Modules linked in: dm_snapshot loop vhost_net vhost
tap tun ip6t_MASQUERADE nf_nat_masquerade_ipv6 ipt_MASQUERADE
nf_nat_masquerade_ipv4 xt_DSCP xt_mark ip6t_REJECT nf_reject_ipv6 ipt_REJECT
nf_reject_ipv4 xt_owner xt_tcpudp xt_set ip_set_hash_net ip_set nfnetlink
xt_limit xt_length xt_multiport xt_conntrack ip6t_rpfilter ipt_rpfilter
ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_raw
wireguard ip6_udp_tunnel udp_tunnel ip6table_mangle iptable_nat
nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_raw
iptable_mangle ip6table_filter ip6_tables matroxfb_base matroxfb_g450
matroxfb_Ti3026 matroxfb_accel matroxfb_DAC1064 g450_pll matroxfb_misc
iptable_filter ip_tables x_tables cpufreq_powersave cpufreq_userspace
cpufreq_conservative 8021q garp mrp
[Mon Aug 27 02:16:11 2018]  bridge stp llc bonding tcp_bbr sch_fq tcp_illinois
fuse radeon ttm drm_kms_helper drm i2c_algo_bit it87 hwmon_vid eeepc_wmi
asus_wmi sparse_keymap rfkill video wmi_bmof mxm_wmi edac_mce_amd kvm_amd kvm
snd_pcm snd_timer snd soundcore joydev evdev pcspkr k10temp fam15h_power
sp5100_tco sg shpchp wmi pcc_cpufreq acpi_cpufreq button ext4 crc16 mbcache
jbd2 fscrypto btrfs zstd_decompress zstd_compress xxhash algif_skcipher af_alg
dm_crypt dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio dm_mod
hid_generic usbhid hid raid10 raid456 async_raid6_recov async_memcpy async_pq
async_xor async_tx xor sd_mod raid6_pq libcrc32c crc32c_generic raid1 raid0
multipath linear md_mod vfio_pci irqbypass vfio_virqfd vfio_iommu_type1 vfio
ohci_pci crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel
[Mon Aug 27 02:16:11 2018]  pcbc aesni_intel aes_x86_64 crypto_simd glue_helper
cryptd r8169 ahci xhci_pci libahci ohci_hcd ehci_pci mii xhci_hcd ehci_hcd
i2c_piix4 libata usbcore scsi_mod bnx2
[Mon Aug 27 02:16:11 2018] CPU: 5 PID: 0 Comm: swapper/5 Tainted: GW   
   4.14.66-rm1+ #132
[Mon Aug 27 02:16:11 2018] Hardware name: To be filled by O.E.M. To be filled
by O.E.M./SABERTOOTH 990FX R2.0, BIOS 2901 05/04/2016
[Mon Aug 27 02:16:11 2018] task: 8ba79c679dc0 task.stack: b4d741928000
[Mon Aug 27 02:16:11 2018] RIP: 0010:tcp_mark_head_lost+0x247/0x260
[Mon Aug 27 02:16:11 2018] RSP: 0018:8ba7aed437d8 EFLAGS: 00010202
[Mon Aug 27 02:16:11 2018] RAX: 0018 RBX: 8ba3901a0800 RCX:

[Mon Aug 27 02:16:11 2018] RDX: 0017 RSI: 0001 RDI:
8ba4d47e9000
[Mon Aug 27 02:16:11 2018] RBP: 8ba4d47e9000 R08: 000d R09:

[Mon Aug 27 02:16:11 2018] R10: 100c R11:  R12:
0001
[Mon Aug 27 02:16:11 2018] R13: 8ba4d47e9158 R14: 0001 R15:
9d0b6708
[Mon Aug 27 02:16:11 2018] FS:  ()
GS:8ba7aed4() knlGS:
[Mon Aug 27 02:16:11 2018] CS:  0010 DS:  ES:  CR0: 80050033
[Mon Aug 27 02:16:11 2018] CR2: 01fbdff0 CR3: 00040c8d6000 CR4:
000406e0
[Mon Aug 27 02:16:11 2018] Call Trace:
[Mon Aug 27 02:16:11 2018]  
[Mon Aug 27 02:16:11 2018]  tcp_fastretrans_alert+0x5c3/0xa20
[Mon Aug 27 02:16:11 2018]  tcp_ack+0x95a/0x1170
[Mon Aug 27 02:16:11 2018]  ? __slab_free.isra.70+0x79/0x200
[Mon Aug 27 02:16:11 2018]  tcp_rcv_established+0x16a/0x5a0
[Mon Aug 27 02:16:11 2018]  ? tcp_v4_inbound_md5_hash+0x76/0x1e0
[Mon Aug 27 02:16:11 2018]  tcp_v4_do_rcv+0x130/0x1f0
[Mon Aug 27 02:16:11 2018]  tcp_v4_rcv+0x9ac/0xaa0
[Mon Aug 27 02:16:11 2018]  ip_local_deliver_finish+0x9a/0x1c0
[Mon Aug 27 02:16:11 2018]  ip_local_deliver+0x6b/0xe0
[Mon Aug 27 02:16:11 2018]  ? ip_rcv_finish+0x440/0x440
[Mon Aug 27 02:16:11 2018]  ip_rcv+0x2b0/0x3c0
[Mon Aug 27 02:16:11 2018]  ?

Fw: [Bug 200967] New: No network with U.S. Robotics USR997902

2018-08-29 Thread Stephen Hemminger

Begin forwarded message:

Date: Wed, 29 Aug 2018 04:36:20 +
From: bugzilla-dae...@bugzilla.kernel.org
To: step...@networkplumber.org
Subject: [Bug 200967] New: No network with U.S. Robotics USR997902

https://bugzilla.kernel.org/show_bug.cgi?id=200967

Bug ID: 200967
   Summary: No network with U.S. Robotics USR997902
   Product: Networking
   Version: 2.5
Kernel Version: 4.18.5
  Hardware: x86-64
OS: Linux
  Tree: Mainline
Status: NEW
  Severity: normal
  Priority: P1
 Component: IPV4
  Assignee: step...@networkplumber.org
  Reporter: kyri...@alumni.princeton.edu
Regression: No

Created attachment 278193
  --> https://bugzilla.kernel.org/attachment.cgi?id=278193=edit  
dmesg and hwinfo output

(I am reporting this upstream as instructed by OpenSUSE, where I had originally
reported this problem.)

After upgrading to kernel 4.18 (originally noticed the problem with kernel
4.18.0, the problem persists with kernel 4.18.5), I could no longer connect to
the network.

A similar machine, which I upgraded at the same time, had no problem, so I
thought that the problem might be due to my using a network card instead of the
motherboard's built-in network controller. Sure enough, configuring the
built-in controller and connecting that to the network worked fine.

According to lspci, the card that doesn't work with the new kernel is:
U.S. Robotics USR997902 10/100/1000 Mbps PCI Network Card (rev 10)

Ifconfig shows that the network controller has been recognized as such, but
that it has not obtained an IP address.

The problem did not occur with kernel 4.17.14.

I have attached the output of hwinfo and dmesg for kernels 4.17.14 and 4.18.5,
with the network cable connected to the U.S. Robotics controller (enp5s0).

-- 
You are receiving this mail because:
You are the assignee for the bug.

[PATCH 2/2] net: ethernet: cpsw-phy-sel: prefer phandle for phy sel

2018-08-29 Thread Tony Lindgren

The cpsw-phy-sel device is not a child of the cpsw interconnect target
module. It lives in the system control module.

Let's fix this issue by trying to use cpsw-phy-sel phandle first if it
exists and if not fall back to current usage of trying to find the
cpsw-phy-sel child. That way the phy sel driver can be a child of the
system control module where it belongs in the device tree.

Without this fix, we cannot have a proper interconnect target module
hierarchy in device tree for things like genpd.

Note that deferred probe is mostly not supported by cpsw and this patch
does not attempt to fix that. In case deferred probe support is needed,
this could be added to cpsw_slave_open() and phy_connect() so they start
handling and returning errors.

For documenting it, looks like the cpsw-phy-sel is used for all cpsw device
tree nodes. It's missing the related binding documentation, so let's also
update the binding documentation accordingly.

Cc: devicet...@vger.kernel.org
Cc: Andrew Lunn 
Cc: Grygorii Strashko 
Cc: Ivan Khoronzhuk 
Cc: Mark Rutland 
Cc: Murali Karicheri 
Cc: Rob Herring 
Signed-off-by: Tony Lindgren 
---
 drivers/net/ethernet/ti/cpsw-phy-sel.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpsw-phy-sel.c 
b/drivers/net/ethernet/ti/cpsw-phy-sel.c
--- a/drivers/net/ethernet/ti/cpsw-phy-sel.c
+++ b/drivers/net/ethernet/ti/cpsw-phy-sel.c
@@ -170,10 +170,13 @@ void cpsw_phy_sel(struct device *dev, phy_interface_t 
phy_mode, int slave)
struct device_node *node;
struct cpsw_phy_sel_priv *priv;
 
-   node = of_get_child_by_name(dev->of_node, "cpsw-phy-sel");
+   node = of_parse_phandle(dev->of_node, "cpsw-phy-sel", 0);
if (!node) {
-   dev_err(dev, "Phy mode driver DT not found\n");
-   return;
+   node = of_get_child_by_name(dev->of_node, "cpsw-phy-sel");
+   if (!node) {
+   dev_err(dev, "Phy mode driver DT not found\n");
+   return;
+   }
}
 
dev = bus_find_device(_bus_type, NULL, node, match);
-- 
2.18.0

[PATCH 1/2] dt-bindings: net: cpsw: Document cpsw-phy-sel usage but prefer phandle

2018-08-29 Thread Tony Lindgren

The current cpsw usage for cpsw-phy-sel is undocumented but is used for
all the boards using cpsw. And cpsw-phy-sel is not really a child of
the cpsw device, it lives in the system control module instead.

Let's document the existing usage, and improve it a bit where we prefer
to use a phandle instead of a child device for it. That way we can
properly describe the hardware in dts files for things like genpd.

Cc: devicet...@vger.kernel.org
Cc: Andrew Lunn 
Cc: Grygorii Strashko 
Cc: Ivan Khoronzhuk 
Cc: Mark Rutland 
Cc: Murali Karicheri 
Cc: Rob Herring 
Signed-off-by: Tony Lindgren 
---
 Documentation/devicetree/bindings/net/cpsw.txt | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/Documentation/devicetree/bindings/net/cpsw.txt 
b/Documentation/devicetree/bindings/net/cpsw.txt
--- a/Documentation/devicetree/bindings/net/cpsw.txt
+++ b/Documentation/devicetree/bindings/net/cpsw.txt
@@ -19,6 +19,10 @@ Required properties:
 - slaves   : Specifies number for slaves
 - active_slave : Specifies the slave to use for time stamping,
  ethtool and SIOCGMIIPHY
+- cpsw-phy-sel : Specifies the phandle to the CPSW phy mode selection
+ device. See also cpsw-phy-sel.txt for it's binding.
+ Note that in legacy cases cpsw-phy-sel may be
+ a child device instead of a phandle.
 
 Optional properties:
 - ti,hwmods: Must be "cpgmac0"
@@ -75,6 +79,7 @@ Examples:
cpts_clock_mult = <0x8000>;
cpts_clock_shift = <29>;
syscon = <>;
+   cpsw-phy-sel = <_sel>;
cpsw_emac0: slave@0 {
phy_id = <_mdio>, <0>;
phy-mode = "rgmii-txid";
@@ -103,6 +108,7 @@ Examples:
cpts_clock_mult = <0x8000>;
cpts_clock_shift = <29>;
syscon = <>;
+   cpsw-phy-sel = <_sel>;
cpsw_emac0: slave@0 {
phy_id = <_mdio>, <0>;
phy-mode = "rgmii-txid";
-- 
2.18.0

[PATCH bpf 3/3] bpf: fix sg shift repair start offset in bpf_msg_pull_data

2018-08-29 Thread Daniel Borkmann

When we perform the sg shift repair for the scatterlist ring, we
currently start out at i = first_sg + 1. However, this is not
correct since the first_sg could point to the sge sitting at slot
MAX_SKB_FRAGS - 1, and a subsequent i = MAX_SKB_FRAGS will access
the scatterlist ring (sg) out of bounds. Add the sk_msg_iter_var()
helper for iterating through the ring, and apply the same rule
for advancing to the next ring element as we do elsewhere. Later
work will use this helper also in other places.

Fixes: 015632bb30da ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
Signed-off-by: Daniel Borkmann 
Acked-by: John Fastabend 
---
 net/core/filter.c | 26 +-
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 43ba5f8..2c7801f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2282,6 +2282,13 @@ static const struct bpf_func_proto 
bpf_msg_cork_bytes_proto = {
.arg2_type  = ARG_ANYTHING,
 };
 
+#define sk_msg_iter_var(var)   \
+   do {\
+   var++;  \
+   if (var == MAX_SKB_FRAGS)   \
+   var = 0;\
+   } while (0)
+
 BPF_CALL_4(bpf_msg_pull_data,
   struct sk_msg_buff *, msg, u32, start, u32, end, u64, flags)
 {
@@ -2302,9 +2309,7 @@ BPF_CALL_4(bpf_msg_pull_data,
if (start < offset + len)
break;
offset += len;
-   i++;
-   if (i == MAX_SKB_FRAGS)
-   i = 0;
+   sk_msg_iter_var(i);
} while (i != msg->sg_end);
 
if (unlikely(start >= offset + len))
@@ -2330,9 +2335,7 @@ BPF_CALL_4(bpf_msg_pull_data,
 */
do {
copy += sg[i].length;
-   i++;
-   if (i == MAX_SKB_FRAGS)
-   i = 0;
+   sk_msg_iter_var(i);
if (bytes_sg_total <= copy)
break;
} while (i != msg->sg_end);
@@ -2358,9 +2361,7 @@ BPF_CALL_4(bpf_msg_pull_data,
sg[i].length = 0;
put_page(sg_page([i]));
 
-   i++;
-   if (i == MAX_SKB_FRAGS)
-   i = 0;
+   sk_msg_iter_var(i);
} while (i != last_sg);
 
sg[first_sg].length = copy;
@@ -2377,7 +2378,8 @@ BPF_CALL_4(bpf_msg_pull_data,
if (!shift)
goto out;
 
-   i = first_sg + 1;
+   i = first_sg;
+   sk_msg_iter_var(i);
do {
int move_from;
 
@@ -2394,9 +2396,7 @@ BPF_CALL_4(bpf_msg_pull_data,
sg[move_from].page_link = 0;
sg[move_from].offset = 0;
 
-   i++;
-   if (i == MAX_SKB_FRAGS)
-   i = 0;
+   sk_msg_iter_var(i);
} while (1);
msg->sg_end -= shift;
if (msg->sg_end < 0)
-- 
2.9.5

[PATCH bpf 0/3] Three fixes for bpf_msg_pull_data

2018-08-29 Thread Daniel Borkmann

This set contains three more fixes for the bpf_msg_pull_data()
mainly for correcting scatterlist ring wrap-arounds as well as
fixing up data pointers. For details please see individual patches.
Thanks!

Daniel Borkmann (3):
  bpf: fix msg->data/data_end after sg shift repair in bpf_msg_pull_data
  bpf: fix shift upon scatterlist ring wrap-around in bpf_msg_pull_data
  bpf: fix sg shift repair start offset in bpf_msg_pull_data

 net/core/filter.c | 36 +++-
 1 file changed, 19 insertions(+), 17 deletions(-)

-- 
2.9.5

[PATCH bpf 2/3] bpf: fix shift upon scatterlist ring wrap-around in bpf_msg_pull_data

2018-08-29 Thread Daniel Borkmann

If first_sg and last_sg wraps around in the scatterlist ring, then we
need to account for that in the shift as well. E.g. crafting such msgs
where this is the case leads to a hang as shift becomes negative. E.g.
consider the following scenario:

  first_sg := 14 |=>shift := -12 msg->sg_start := 10
  last_sg  :=  3 |   msg->sg_end   :=  5

round  1:  i := 15, move_from :=   3, sg[15] := sg[  3]
round  2:  i :=  0, move_from := -12, sg[ 0] := sg[-12]
round  3:  i :=  1, move_from := -11, sg[ 1] := sg[-11]
round  4:  i :=  2, move_from := -10, sg[ 2] := sg[-10]
[...]
round 13:  i := 11, move_from :=  -1, sg[ 2] := sg[ -1]
round 14:  i := 12, move_from :=   0, sg[ 2] := sg[  0]
round 15:  i := 13, move_from :=   1, sg[ 2] := sg[  1]
round 16:  i := 14, move_from :=   2, sg[ 2] := sg[  2]
round 17:  i := 15, move_from :=   3, sg[ 2] := sg[  3]
[...]

This means we will loop forever and never hit the msg->sg_end condition
to break out of the loop. When we see that the ring wraps around, then
the shift should be MAX_SKB_FRAGS - first_sg + last_sg - 1. Meaning,
the remainder slots from the tail of the ring and the head until last_sg
combined.

Fixes: 015632bb30da ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
Signed-off-by: Daniel Borkmann 
Acked-by: John Fastabend 
---
 net/core/filter.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index b9225c5..43ba5f8 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2370,7 +2370,10 @@ BPF_CALL_4(bpf_msg_pull_data,
 * had a single entry though we can just replace it and
 * be done. Otherwise walk the ring and shift the entries.
 */
-   shift = last_sg - first_sg - 1;
+   WARN_ON_ONCE(last_sg == first_sg);
+   shift = last_sg > first_sg ?
+   last_sg - first_sg - 1 :
+   MAX_SKB_FRAGS - first_sg + last_sg - 1;
if (!shift)
goto out;
 
-- 
2.9.5

1 2 >

1 - 100 of 128 matches

Mail list logo