from:"david"

Re: [PATCH net-next] ip: silence udp zerocopy smatch false positive

2018-12-08 Thread David Miller

From: Willem de Bruijn 
Date: Sat,  8 Dec 2018 06:22:46 -0500

> From: Willem de Bruijn 
> 
> extra_uref is used in __ip(6)_append_data only if uarg is set.
> 
> Smatch sees that the variable is passed to sock_zerocopy_put_abort.
> This function accesses it only when uarg is set, but smatch cannot
> infer this.
> 
> Make this dependency explicit.
> 
> Fixes: 52900d22288e ("udp: elide zerocopy operation in hot path")
> Signed-off-by: Willem de Bruijn 

I looked and can't figure out a better way to fix this :)

Applied, thanks Willem.

Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API

2018-12-08 Thread David Miller

From: Ilias Apalodimas 
Date: Sat, 8 Dec 2018 16:57:28 +0200

> The patchset speeds up the mvneta driver on the default network
> stack. The only change that was needed was to adapt the driver to
> using the page_pool API. The speed improvements we are seeing on
> specific workloads (i.e 256b < packet < 400b) are almost 3x.
> 
> Lots of high speed drivers are doing similar recycling tricks themselves (and
> there's no common code, everyone is doing something similar though). All we 
> are
> trying to do is provide a unified API to make that easier for the rest. 
> Another
> advantage is that if the some drivers switch to the API, adding XDP
> functionality on them is pretty trivial.

Yeah this is a very important point moving forward.

Jesse Brandeberg brought the following up to me at LPC and I'd like to
develop it further.

Right now we tell driver authors to write a new driver as SKB based,
and once they've done all of that work we tell them to basically
shoe-horn XDP support into that somewhat different framework.

Instead, the model should be the other way around, because with a raw
meta-data free set of data buffers we can always construct an SKB or
pass it to XDP.

So drivers should be targetting some raw data buffer kind of interface
which takes care of all of this stuff.  If the buffers get wrapped
into an SKB and get pushed into the traditional networking stack, the
driver shouldn't know or care.  Likewise if it ends up being processed
with XDP, it should not need to know or care.

All of those details should be behind a common layer.  Then we can
control:

1) Buffer handling, recycling, "fast paths"

2) Statistics

3) XDP feature sets

We can consolidate behavior and semantics across all of the drivers
if we do this.  No more talk about "supporting all XDP features",
and the inconsistencies we have because of that.

The whole common statistics discussion could be resolved with this
common layer as well.

We'd be able to control and properly optimize everything.

Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API

2018-12-08 Thread David Miller

From: Jesper Dangaard Brouer 
Date: Sat, 8 Dec 2018 12:36:10 +0100

> The annoying part is actually that depending on the kernel config
> options CONFIG_XFRM, CONFIG_NF_CONNTRACK and CONFIG_BRIDGE_NETFILTER,
> whether there is a cache-line split, where mem_info gets moved into the
> next cacheline.

Note that Florian Westphal's work (trying to help MP-TCP) would
eliminate this variability.

Re: [net-next PATCH RFC 4/8] net: core: add recycle capabilities on skbs via page_pool API

2018-12-07 Thread David Miller

From: Jesper Dangaard Brouer 
Date: Fri, 07 Dec 2018 00:25:47 +0100

> @@ -744,6 +745,10 @@ struct sk_buff {
>   head_frag:1,
>   xmit_more:1,
>   pfmemalloc:1;
> + /* TODO: Future idea, extend mem_info with __u8 flags, and
> +  * move bits head_frag and pfmemalloc there.
> +  */
> + struct xdp_mem_info mem_info;

This is 4 bytes right?

I guess I can live with this.

Please do some microbenchmarks to make sure this doesn't show any
obvious regressions.

Thanks.

Re: [net-next PATCH RFC 1/8] page_pool: add helper functions for DMA

2018-12-07 Thread David Miller

From: Jesper Dangaard Brouer 
Date: Fri, 07 Dec 2018 00:25:32 +0100

> From: Ilias Apalodimas 
> 
> Add helper functions for retreiving dma_addr_t stored in page_private and
> unmapping dma addresses, mapped via the page_pool API.
> 
> Signed-off-by: Ilias Apalodimas 
> Signed-off-by: Jesper Dangaard Brouer 

This isn't going to work on 32-bit platforms where dma_addr_t is a u64,
because the page private is unsigned long.

Grep for PHY_ADDR_T_64BIT under arch/ to see the vast majority of the
cases where this happens, then ARCH_DMA_ADDR_T_64BIT.

Re: [PATCH] Revert "net/ibm/emac: wrong bit is used for STA control"

2018-12-07 Thread David Miller

From: Benjamin Herrenschmidt 
Date: Fri, 07 Dec 2018 15:05:04 +1100

> This reverts commit 624ca9c33c8a853a4a589836e310d776620f4ab9.
> 
> This commit is completely bogus. The STACR register has two formats, old
> and new, depending on the version of the IP block used. There's a pair of
> device-tree properties that can be used to specify the format used:
> 
>   has-inverted-stacr-oc
>   has-new-stacr-staopc
> 
> What this commit did was to change the bit definition used with the old
> parts to match the new parts. This of course breaks the driver on all
> the old ones.
> 
> Instead, the author should have set the appropriate properties in the
> device-tree for the variant used on his board.
> 
> Signed-off-by: Benjamin Herrenschmidt 
> ---
> 
> Found while setting up some old ppc440 boxes for test/CI

Applied, thanks.

Re: [PATCH] net-udp: deprioritize cpu match for udp socket lookup

2018-12-07 Thread David Miller

From: Maciej Żenczykowski 
Date: Fri, 7 Dec 2018 16:46:36 -0800

>> This doesn't apply to the current net tree.
>>
>> Also "net-udp: " is a weird subsystem prefix, just use "udp: ".
>>
>> Thank you.
> 
> Interesting... this patch was on top of net-next/master, and it still
> rebases cleanly on current net-next/master.
> 
> Would you like it on net/master instead?  It indeed doesn't apply
> cleanly there...

Well, it is a bug fix isn't it?  Or is this more like a behavioral feature?

Re: [PATCH net-next 0/4] tc-testing: implement command timeouts and better results tracking

2018-12-07 Thread David Miller

From: Lucas Bates 
Date: Thu,  6 Dec 2018 17:42:23 -0500

> Patch 1 adds a timeout feature for any command tdc launches in a subshell.
> This prevents tdc from hanging indefinitely.
> 
> Patches 2-4 introduce a new method for tracking and generating test case
> results, and implements it across the core script and all applicable
> plugins.

Series applied.

Re: [PATCH net v2 0/2] Fix slab out-of-bounds on insufficient headroom for IPv6 packets

2018-12-07 Thread David Miller

From: Stefano Brivio 
Date: Thu,  6 Dec 2018 19:30:35 +0100

> Patch 1/2 fixes a slab out-of-bounds occurring with short SCTP packets over
> IPv4 over L2TP over IPv6 on a configuration with relatively low HEADER_MAX.
> 
> Patch 2/2 makes sure we avoid writing before the allocated buffer in
> neigh_hh_output() in case the headroom is enough for the unaligned hardware
> header size, but not enough for the aligned one, and that we warn if we hit
> this condition.

Series applied and queued up for -stable, thanks.

Re: [PATCH net] tcp: lack of available data can also cause TSO defer

2018-12-07 Thread David Miller

From: Eric Dumazet 
Date: Thu,  6 Dec 2018 09:58:24 -0800

> tcp_tso_should_defer() can return true in three different cases :
> 
>  1) We are cwnd-limited
>  2) We are rwnd-limited
>  3) We are application limited.
> 
> Neal pointed out that my recent fix went too far, since
> it assumed that if we were not in 1) case, we must be rwnd-limited
> 
> Fix this by properly populating the is_cwnd_limited and
> is_rwnd_limited booleans.
> 
> After this change, we can finally move the silly check for FIN
> flag only for the application-limited case.
> 
> The same move for EOR bit will be handled in net-next,
> since commit 1c09f7d073b1 ("tcp: do not try to defer skbs
> with eor mark (MSG_EOR)") is scheduled for linux-4.21
> 
> Tested by running 200 concurrent netperf -t TCP_RR -- -r 6,100
> and checking none of them was rwnd_limited in the chrono_stat
> output from "ss -ti" command.
> 
> Fixes: 41727549de3e ("tcp: Do not underestimate rwnd_limited")
> Signed-off-by: Eric Dumazet 
> Suggested-by: Neal Cardwell 
> Reviewed-by: Neal Cardwell 
> Acked-by: Soheil Hassas Yeganeh 
> Reviewed-by: Yuchung Cheng 

Applied.

Re: [PATCH] net-udp: deprioritize cpu match for udp socket lookup

2018-12-07 Thread David Miller

From: Maciej Żenczykowski 
Date: Wed,  5 Dec 2018 12:59:17 -0800

> From: Maciej Żenczykowski 
> 
> During udp socket lookup cpu match should be lowest priority,
> hence it should increase score by only 1.
> 
> The next priority is delivering v4 to v4 sockets, and v6 to v6 sockets.
> The v6 code path doesn't have to deal with this so it always gets
> a score of '4'.  The v4 code path uses '4' or '2' depending on
> whether we're delivering to a v4 socket or a dualstack v6 socket.
> 
> This is more important than cpu match, so has to be greater than
> the '1' bump in score from cpu match.
> 
> All other matches (src/dst ip, src port) are even *more* important,
> so need to bump score by 4 for ipv4.
> 
> For ipv6 we could simply bump by 2, but let's keep the two code
> paths as similar as possible.
> 
> (also, while at it, remove two unnecessary unconditional score bumps)
> 
> Signed-off-by: Maciej Żenczykowski 

This doesn't apply to the current net tree.

Also "net-udp: " is a weird subsystem prefix, just use "udp: ".

Thank you.

Re: [Patch v2 net-next] call sk_dst_reset when set SO_DONTROUTE

2018-12-07 Thread David Miller

From: yupeng 
Date: Wed,  5 Dec 2018 18:56:28 -0800

> after set SO_DONTROUTE to 1, the IP layer should not route packets if
> the dest IP address is not in link scope. But if the socket has cached
> the dst_entry, such packets would be routed until the sk_dst_cache
> expires. So we should clean the sk_dst_cache when a user set
> SO_DONTROUTE option. Below are server/client python scripts which
> could reprodue this issue:
 ...
> Signed-off-by: yupeng 

Applied.

Re: [PATCH v2 net-next] neighbor: Improve garbage collection

2018-12-07 Thread David Miller

From: David Ahern 
Date: Fri,  7 Dec 2018 12:24:57 -0800

> From: David Ahern 
> 
> The existing garbage collection algorithm has a number of problems:
 ...
> This patch addresses these problems as follows:
> 
> 1. Use of a separate list_head to track entries that can be garbage
>collected along with a separate counter. PERMANENT entries are not
>added to this list.
> 
>The gc_thresh parameters are only compared to the new counter, not the
>total entries in the table. The forced_gc function is updated to only
>walk this new gc_list looking for entries to evict.
> 
> 2. Entries are added to the list head at the tail and removed from the
>front.
> 
> 3. Entries are only evicted if they were last updated more than 5 seconds
>ago, adhering to the original intent of gc_thresh2.
> 
> 4. Forced gc is stopped once the number of gc_entries drops below
>gc_thresh2.
> 
> 5. Since gc checks do not apply to PERMANENT entries, gc levels are skipped
>when allocating a new neighbor for a PERMANENT entry. By extension this
>means there are no explicit limits on the number of PERMANENT entries
>that can be created, but this is no different than FIB entries or FDB
>entries.
> 
> Signed-off-by: David Ahern 
> ---
> v2
> - remove on_gc_list boolean in favor of !list_empty
> - fix neigh_alloc to add new entry to tail of list_head

Again, looks great, applied.

Re: [PATCH V2] net: dsa: ksz: Add reset GPIO handling

2018-12-07 Thread David Miller

From: Marek Vasut 
Date: Fri, 7 Dec 2018 23:59:58 +0100

> On 12/07/2018 11:24 PM, Andrew Lunn wrote:
>> On Fri, Dec 07, 2018 at 10:51:36PM +0100, Marek Vasut wrote:
>>> Add code to handle optional reset GPIO in the KSZ switch driver. The switch
>>> has a reset GPIO line which can be controlled by the CPU, so make sure it is
>>> configured correctly in such setups.
>> 
>> Hi Marek
> 
> Hi Andrew,
> 
>> Please make this a patch series, not two individual patches.
> 
> This actually is an individual patch, it doesn't depend on anything.
> Or do you mean a series with the DT documentation change ?

Yes, but all of this stuff is building up for one single purpose,
and that is to support a new mode of operation with DSA or whatever.

So please group them together in a series with an appropriate
header posting.

Re: [PATCH net-next] neighbor: Add protocol attribute

2018-12-07 Thread David Miller

From: Eric Dumazet 
Date: Fri, 7 Dec 2018 15:03:04 -0800

> On 12/07/2018 02:24 PM, David Ahern wrote:
>> On 12/7/18 3:20 PM, Eric Dumazet wrote:
>> 
>> /* --- cacheline 3 boundary (192 bytes) --- */
>> struct hh_cachehh;   /*   19248 */
>> 
>> ...
>> 
>> but does not change the actual allocation size which is rounded to 512.
>> 
> 
> I have not talked about the allocation size, but alignment of ->ha field,
> which is kind of assuming long alignment, in a strange way.

Right, neigh->ha[] should probably be kept 8-byte aligned.

Re: [PATCH net-next] neighbor: Add protocol attribute

2018-12-07 Thread David Ahern

On 12/7/18 3:20 PM, Eric Dumazet wrote:
> 
> 
> On 12/07/2018 01:49 PM, David Ahern wrote:
>> From: David Ahern 
>>
>> Similar to routes and rules, add protocol attribute to neighbor entries
>> for easier tracking of how each was created.
>>
>> Signed-off-by: David Ahern 
>> ---
>>  include/net/neighbour.h|  2 ++
>>  include/uapi/linux/neighbour.h |  1 +
>>  net/core/neighbour.c   | 24 +++-
>>  3 files changed, 26 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/net/neighbour.h b/include/net/neighbour.h
>> index 6c13072910ab..e93c59df9501 100644
>> --- a/include/net/neighbour.h
>> +++ b/include/net/neighbour.h
>> @@ -149,6 +149,7 @@ struct neighbour {
>>  __u8nud_state;
>>  __u8type;
>>  __u8dead;
>> +u8  protocol;
>>  seqlock_t   ha_lock;
>>  unsigned char   ha[ALIGN(MAX_ADDR_LEN, sizeof(unsigned long))];
> 
> This looks like ha[] alignment would change, I am not sure how critical it is.

Just adds 4 bytes to neighbour:

...
/* --- cacheline 2 boundary (128 bytes) --- */
long unsigned int  used; /*   128 8 */
atomic_t   probes;   /*   136 4 */
__u8   flags;/*   140 1 */
__u8   nud_state;/*   141 1 */
__u8   type; /*   142 1 */
__u8   dead; /*   143 1 */
u8 protocol; /*   144 1 */

/* XXX 3 bytes hole, try to pack */
seqlock_t  ha_lock;  /*   148 8 */
unsigned char  ha[32];   /*   15632 */
/* XXX 4 bytes hole, try to pack */

/* --- cacheline 3 boundary (192 bytes) --- */
struct hh_cachehh;   /*   19248 */

...

but does not change the actual allocation size which is rounded to 512.

[PATCH net-next] neighbor: Add protocol attribute

2018-12-07 Thread David Ahern

From: David Ahern 

Similar to routes and rules, add protocol attribute to neighbor entries
for easier tracking of how each was created.

Signed-off-by: David Ahern 
---
 include/net/neighbour.h|  2 ++
 include/uapi/linux/neighbour.h |  1 +
 net/core/neighbour.c   | 24 +++-
 3 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index 6c13072910ab..e93c59df9501 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -149,6 +149,7 @@ struct neighbour {
__u8nud_state;
__u8type;
__u8dead;
+   u8  protocol;
seqlock_t   ha_lock;
unsigned char   ha[ALIGN(MAX_ADDR_LEN, sizeof(unsigned long))];
struct hh_cache hh;
@@ -173,6 +174,7 @@ struct pneigh_entry {
possible_net_t  net;
struct net_device   *dev;
u8  flags;
+   u8  protocol;
u8  key[0];
 };
 
diff --git a/include/uapi/linux/neighbour.h b/include/uapi/linux/neighbour.h
index 998155444e0d..cd144e3099a3 100644
--- a/include/uapi/linux/neighbour.h
+++ b/include/uapi/linux/neighbour.h
@@ -28,6 +28,7 @@ enum {
NDA_MASTER,
NDA_LINK_NETNSID,
NDA_SRC_VNI,
+   NDA_PROTOCOL,  /* Originator of entry */
__NDA_MAX
 };
 
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index c3b58712e98b..56984695585d 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1799,6 +1799,7 @@ static int neigh_add(struct sk_buff *skb, struct nlmsghdr 
*nlh,
struct net_device *dev = NULL;
struct neighbour *neigh;
void *dst, *lladdr;
+   u8 protocol = 0;
int err;
 
ASSERT_RTNL();
@@ -1838,6 +1839,14 @@ static int neigh_add(struct sk_buff *skb, struct 
nlmsghdr *nlh,
dst = nla_data(tb[NDA_DST]);
lladdr = tb[NDA_LLADDR] ? nla_data(tb[NDA_LLADDR]) : NULL;
 
+   if (tb[NDA_PROTOCOL]) {
+   if (nla_len(tb[NDA_PROTOCOL]) != sizeof(u8)) {
+   NL_SET_ERR_MSG(extack, "Invalid protocol attribute");
+   goto out;
+   }
+   protocol = nla_get_u8(tb[NDA_PROTOCOL]);
+   }
+
if (ndm->ndm_flags & NTF_PROXY) {
struct pneigh_entry *pn;
 
@@ -1845,6 +1854,8 @@ static int neigh_add(struct sk_buff *skb, struct nlmsghdr 
*nlh,
pn = pneigh_lookup(tbl, net, dst, dev, 1);
if (pn) {
pn->flags = ndm->ndm_flags;
+   if (protocol)
+   pn->protocol = protocol;
err = 0;
}
goto out;
@@ -1893,6 +1904,10 @@ static int neigh_add(struct sk_buff *skb, struct 
nlmsghdr *nlh,
} else
err = __neigh_update(neigh, lladdr, ndm->ndm_state, flags,
 NETLINK_CB(skb).portid, extack);
+
+   if (protocol)
+   neigh->protocol = protocol;
+
neigh_release(neigh);
 
 out:
@@ -2386,6 +2401,9 @@ static int neigh_fill_info(struct sk_buff *skb, struct 
neighbour *neigh,
nla_put(skb, NDA_CACHEINFO, sizeof(ci), ))
goto nla_put_failure;
 
+   if (neigh->protocol && nla_put_u8(skb, NDA_PROTOCOL, neigh->protocol))
+   goto nla_put_failure;
+
nlmsg_end(skb, nlh);
return 0;
 
@@ -2417,6 +2435,9 @@ static int pneigh_fill_info(struct sk_buff *skb, struct 
pneigh_entry *pn,
if (nla_put(skb, NDA_DST, tbl->key_len, pn->key))
goto nla_put_failure;
 
+   if (pn->protocol && nla_put_u8(skb, NDA_PROTOCOL, pn->protocol))
+   goto nla_put_failure;
+
nlmsg_end(skb, nlh);
return 0;
 
@@ -3072,7 +3093,8 @@ static inline size_t neigh_nlmsg_size(void)
   + nla_total_size(MAX_ADDR_LEN) /* NDA_DST */
   + nla_total_size(MAX_ADDR_LEN) /* NDA_LLADDR */
   + nla_total_size(sizeof(struct nda_cacheinfo))
-  + nla_total_size(4); /* NDA_PROBES */
+  + nla_total_size(4)  /* NDA_PROBES */
+  + nla_total_size(1); /* NDA_PROTOCOL */
 }
 
 static void __neigh_notify(struct neighbour *n, int type, int flags,
-- 
2.11.0

Re: [PATCH iproute2-next 0/2] devlink: Add support for 'fw_load_policy' generic parameter

2018-12-07 Thread David Ahern

On 12/4/18 3:14 AM, Shalom Toledo wrote:
> Patch #1 add string to uint conversion support for generic parameters.
> Patch #2 add string to uint support for 'fw_load_policy' generic parameter
> 
> Shalom Toledo (2):
>   devlink: Add string to uint{8,16,32} conversion for generic parameters
>   devlink: Add support for 'fw_load_policy' generic parameter
> 
>  devlink/devlink.c| 156 ---
>  include/uapi/linux/devlink.h |   5 ++
>  2 files changed, 151 insertions(+), 10 deletions(-)
> 

applied to iproute2-next. Thanks

Re: [PATCH 1/5] net: dsa: ksz: Add MIB counter reading support

2018-12-07 Thread David Miller



Every patch series should have a header posting with Subject of
the form "[PATCH 0/N] ..." explaining what the series does at
a high level, how it does it, and why it does it that way.

Re: [PATCH v2 net-next 0/4] net: aquantia: add RSS configuration

2018-12-07 Thread David Miller

From: Igor Russkikh 
Date: Fri, 7 Dec 2018 14:00:09 +

> In this patchset few bugs related to RSS are fixed and RSS table and
> hash key configuration is added.
> 
> We also do increase max number of HW rings upto 8.
> 
> v2: removed extra arg check

Series applied.

[PATCH v2 net-next] neighbor: Improve garbage collection

2018-12-07 Thread David Ahern

From: David Ahern 

The existing garbage collection algorithm has a number of problems:

1. The gc algorithm will not evict PERMANENT entries as those entries
   are managed by userspace, yet the existing algorithm walks the entire
   hash table which means it always considers PERMANENT entries when
   looking for entries to evict. In some use cases (e.g., EVPN) there
   can be tens of thousands of PERMANENT entries leading to wasted
   CPU cycles when gc kicks in. As an example, with 32k permanent
   entries, neigh_alloc has been observed taking more than 4 msec per
   invocation.

2. Currently, when the number of neighbor entries hits gc_thresh2 and
   the last flush for the table was more than 5 seconds ago gc kicks in
   walks the entire hash table evicting *all* entries not in PERMANENT
   or REACHABLE state and not marked as externally learned. There is no
   discriminator on when the neigh entry was created or if it just moved
   from REACHABLE to another NUD_VALID state (e.g., NUD_STALE).

   It is possible for entries to be created or for established neighbor
   entries to be moved to STALE (e.g., an external node sends an ARP
   request) right before the 5 second window lapses:

-|-x|--|-
t-5 t t+5

   If that happens those entries are evicted during gc causing unnecessary
   thrashing on neighbor entries and userspace caches trying to track them.

   Further, this contradicts the description of gc_thresh2 which says
   "Entries older than 5 seconds will be cleared".

   One workaround is to make gc_thresh2 == gc_thresh3 but that negates the
   whole point of having separate thresholds.

3. Clearing *all* neigh non-PERMANENT/REACHABLE/externally learned entries
   when gc_thresh2 is exceeded is over kill and contributes to trashing
   especially during startup.

This patch addresses these problems as follows:

1. Use of a separate list_head to track entries that can be garbage
   collected along with a separate counter. PERMANENT entries are not
   added to this list.

   The gc_thresh parameters are only compared to the new counter, not the
   total entries in the table. The forced_gc function is updated to only
   walk this new gc_list looking for entries to evict.

2. Entries are added to the list head at the tail and removed from the
   front.

3. Entries are only evicted if they were last updated more than 5 seconds
   ago, adhering to the original intent of gc_thresh2.

4. Forced gc is stopped once the number of gc_entries drops below
   gc_thresh2.

5. Since gc checks do not apply to PERMANENT entries, gc levels are skipped
   when allocating a new neighbor for a PERMANENT entry. By extension this
   means there are no explicit limits on the number of PERMANENT entries
   that can be created, but this is no different than FIB entries or FDB
   entries.

Signed-off-by: David Ahern 
---
v2
- remove on_gc_list boolean in favor of !list_empty
- fix neigh_alloc to add new entry to tail of list_head

 Documentation/networking/ip-sysctl.txt |   4 +-
 include/net/neighbour.h|   3 +
 net/core/neighbour.c   | 119 +++--
 3 files changed, 90 insertions(+), 36 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index af2a69439b93..acdfb5d2bcaa 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -108,8 +108,8 @@ neigh/default/gc_thresh2 - INTEGER
Default: 512
 
 neigh/default/gc_thresh3 - INTEGER
-   Maximum number of neighbor entries allowed.  Increase this
-   when using large numbers of interfaces and when communicating
+   Maximum number of non-PERMANENT neighbor entries allowed.  Increase
+   this when using large numbers of interfaces and when communicating
with large numbers of directly-connected peers.
Default: 1024
 
diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index f58b384aa6c9..6c13072910ab 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -154,6 +154,7 @@ struct neighbour {
struct hh_cache hh;
int (*output)(struct neighbour *, struct sk_buff *);
const struct neigh_ops  *ops;
+   struct list_headgc_list;
struct rcu_head rcu;
struct net_device   *dev;
u8  primary_key[0];
@@ -214,6 +215,8 @@ struct neigh_table {
struct timer_list   proxy_timer;
struct sk_buff_head proxy_queue;
atomic_tentries;
+   atomic_tgc_entries;
+   struct list_headgc_list;
rwlock_tlock;
unsigned long   last_rand;
struct neigh_statistics __percpu *stats;
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 6d479b5562be..c3b58712e98b 100644
--- a/net/core/neighbo

Re: [PATCH net] ipv6: sr: properly initialize flowi6 prior passing to ip6_route_output

2018-12-07 Thread David Miller

From: Shmulik Ladkani 
Date: Fri,  7 Dec 2018 09:50:17 +0200

> In 'seg6_output', stack variable 'struct flowi6 fl6' was missing
> initialization.
> 
> Fixes: 6c8702c60b88 ("ipv6: sr: add support for SRH encapsulation and 
> injection with lwtunnels")
> Signed-off-by: Shmulik Ladkani 

Applied and queued up for -stable, thanks.

Re: [PATCH net-next] neighbour: Improve garbage collection

2018-12-07 Thread David Ahern

On 12/6/18 8:59 PM, David Miller wrote:
> But why do you need the on_gc_list boolean state? f

mental blockage.

v2 coming up.

I wait to hear from you.

2018-12-07 Thread Mr. David Abraham

My Greeting, How are you today?Did you receive the letter i sent to
you. Please answer me.
Best Regard,
Mr.David Abraham.

Re: [PATCH] Revert "net/ibm/emac: wrong bit is used for STA control"

2018-12-06 Thread David Miller



Looks like your posting was empty?

Re: [PATCH net-next] neighbour: Improve garbage collection

2018-12-06 Thread David Miller

From: David Ahern 
Date: Thu,  6 Dec 2018 14:38:44 -0800

> The existing garbage collection algorithm has a number of problems:

Thanks for working on this!

I totally agree with what you are doing, especially the separate
gc_list.

But why do you need the on_gc_list boolean state?  That's equivalent
to "!list_empty(>gc_list)" and seems redundant.

[PATCH net-next] neighbour: Improve garbage collection

2018-12-06 Thread David Ahern

From: David Ahern 

The existing garbage collection algorithm has a number of problems:

1. The gc algorithm will not evict PERMANENT entries as those entries
   are managed by userspace, yet the existing algorithm walks the entire
   hash table which means it always considers PERMANENT entries when
   looking for entries to evict. In some use cases (e.g., EVPN) there
   can be tens of thousands of PERMANENT entries leading to wasted
   CPU cycles when gc kicks in. As an example, with 32k permanent
   entries, neigh_alloc has been observed taking more than 4 msec per
   invocation.

2. Currently, when the number of neighbor entries hits gc_thresh2 and
   the last flush for the table was more than 5 seconds ago gc kicks in
   walks the entire hash table evicting *all* entries not in PERMANENT
   or REACHABLE state and not marked as externally learned. There is no
   discriminator on when the neigh entry was created or if it just moved
   from REACHABLE to another NUD_VALID state (e.g., NUD_STALE).

   It is possible for entries to be created or for established neighbor
   entries to be moved to STALE (e.g., an external node sends an ARP
   request) right before the 5 second window lapses:

-|-x|--|-
t-5 t t+5

   If that happens those entries are evicted during gc causing unnecessary
   thrashing on neighbor entries and userspace caches trying to track them.

   Further, this contradicts the description of gc_thresh2 which says
   "Entries older than 5 seconds will be cleared".

   One workaround is to make gc_thresh2 == gc_thresh3 but that negates the
   whole point of having separate thresholds.

3. Clearing *all* neigh non-PERMANENT/REACHABLE/externally learned entries
   when gc_thresh2 is exceeded is over kill and contributes to trashing
   especially during startup.

This patch addresses these problems as follows:
1. use of a separate list_head to track entries that can be garbage
   collected along with a separate counter. PERMANENT entries are not
   added to this list.

   The gc_thresh parameters are only compared to the new counter, not the
   total entries in the table. The forced_gc function is updated to only
   walk this new gc_list looking for entries to evict.

2. Entries are added to the list head at the tail and removed from the
   front.

3. Entries are only evicted if they were last updated more than 5 seconds
   ago, adhering to the original intent of gc_thresh2.

4. Forced gc is stopped once the number of gc_entries drops below
   gc_thresh2.

5. Since gc checks do not apply to PERMANENT entries, gc levels are skipped
   when allocating a new neighbor for a PERMANENT entry. By extension this
   means there are no explicit limits on the number of PERMANENT entries
   that can be created, but this is no different than FIB entries or FDB
   entries.

Signed-off-by: David Ahern 
---
 Documentation/networking/ip-sysctl.txt |   4 +-
 include/net/neighbour.h|   4 ++
 net/core/neighbour.c   | 122 +++--
 3 files changed, 93 insertions(+), 37 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index af2a69439b93..acdfb5d2bcaa 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -108,8 +108,8 @@ neigh/default/gc_thresh2 - INTEGER
Default: 512
 
 neigh/default/gc_thresh3 - INTEGER
-   Maximum number of neighbor entries allowed.  Increase this
-   when using large numbers of interfaces and when communicating
+   Maximum number of non-PERMANENT neighbor entries allowed.  Increase
+   this when using large numbers of interfaces and when communicating
with large numbers of directly-connected peers.
Default: 1024
 
diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index f58b384aa6c9..846ad8da91eb 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -154,6 +154,8 @@ struct neighbour {
struct hh_cache hh;
int (*output)(struct neighbour *, struct sk_buff *);
const struct neigh_ops  *ops;
+   struct list_headgc_list;
+   boolon_gc_list;
struct rcu_head rcu;
struct net_device   *dev;
u8  primary_key[0];
@@ -214,6 +216,8 @@ struct neigh_table {
struct timer_list   proxy_timer;
struct sk_buff_head proxy_queue;
atomic_tentries;
+   atomic_tgc_entries;
+   struct list_headgc_list;
rwlock_tlock;
unsigned long   last_rand;
struct neigh_statistics __percpu *stats;
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 6d479b5562be..ab11e94ec44d 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -118,6 +118,36 @@ uns

Re: [PATCH net-next 2/2] net: dsa: Set the master device's MTU to account for DSA overheads

2018-12-06 Thread David Miller

From: Andrew Lunn 
Date: Thu, 6 Dec 2018 21:48:46 +0100

> David has already accepted the patchset, so i will add a followup
> patch.

Yeah sorry for jumping the gun, the changes looked pretty
straightforward to me. :-/

Re: [PATCH net-next v2 0/8] Pass extack to NETDEV_PRE_UP

2018-12-06 Thread David Miller

From: Petr Machata 
Date: Thu, 6 Dec 2018 17:05:35 +

> Drivers may need to validate configuration of a device that's about to
> be upped. An example is mlxsw, which needs to check the configuration of
> a VXLAN device attached to an offloaded bridge. Should the validation
> fail, there's currently no way to communicate details of the failure to
> the user, beyond an error number.
> 
> Therefore this patch set extends the NETDEV_PRE_UP event to include
> extack, if available.
 ...

Series applied, thank you.

Re: [PATCH net 0/4] mlxsw: Various fixes

2018-12-06 Thread David Miller

From: Ido Schimmel 
Date: Thu, 6 Dec 2018 17:44:48 +

> Patches #1 and #2 fix two VxLAN related issues. The first patch removes
> warnings that can currently be triggered from user space. Second patch
> avoids leaking a FID in an error path.
> 
> Patch #3 fixes a too strict check that causes certain host routes not to
> be promoted to perform GRE decapsulation in hardware.
> 
> Last patch avoids a use-after-free when deleting a VLAN device via an
> ioctl when it is enslaved to a bridge. I have a patchset for net-next
> that reworks this code and makes the driver more robust.

Series applied.

Re: mv88e6060: Turn e6060 driver into e6065 driver

2018-12-06 Thread David Miller

From: Pavel Machek 
Date: Thu, 6 Dec 2018 14:03:45 +0100

> @@ -79,7 +82,7 @@ static enum dsa_tag_protocol 
> mv88e6060_get_tag_protocol(struct dsa_switch *ds,
>  {
>//return DSA_TAG_PROTO_QCA;
>//return DSA_TAG_PROTO_TRAILER;

These C++ style comments are not in any of my tree(s).

Your patch submission really needs to shape up if you want your patches
to be considered seriously.

Thank you.

Re: [PATCH] mv88e6060: Warn about errors

2018-12-06 Thread David Miller



Plain "printk" are never appropriate.

Please explicitly use pr_warn() or similar.  If there is a device context
available, either a generic device or a netdev, use one of the dev_*()
or netdev_*() variants.

Re: [PATCH] tcp: fix code style in tcp_recvmsg()

2018-12-06 Thread David Miller

From: Pedro Tammela 
Date: Thu,  6 Dec 2018 10:45:28 -0200

> 2 goto labels are indented with a tab. remove the tabs and
> keep the code style consistent.
> 
> Signed-off-by: Pedro Tammela 

Applied to net-next.

Re: [PATCH net-next 0/2] Adjust MTU of DSA master interface

2018-12-06 Thread David Miller

From: Andrew Lunn 
Date: Thu,  6 Dec 2018 11:36:03 +0100

> DSA makes use of additional headers to direct a frame in/out of a
> specific port of the switch. When the slave interfaces uses an MTU of
> 1500, the master interface can be asked to handle frames with an MTU
> of 1504, or 1508 bytes. Some Ethernet interfaces won't
> transmit/receive frames which are bigger than their MTU.
> 
> Automate the increasing of the MTU on the master interface, by adding
> to each tagging driver how much overhead they need, and then calling
> dev_set_mtu() of the master interface to increase its MTU as needed.

Series applied, thanks Andrew.

Re: [PATCH][net-next] tun: align write-heavy flow entry members to a cache line

2018-12-06 Thread David Miller

From: Li RongQing 
Date: Thu,  6 Dec 2018 16:08:17 +0800

> tun flow entry 'updated' fields are written when receive
> every packet. Thus if a flow is receiving packets from a
> particular flow entry, it'll cause false-sharing with
> all the other who has looked it up, so move it in its own
> cache line
> 
> and update 'queue_index' and 'update' field only when
> they are changed to reduce the cache false-sharing.
> 
> Signed-off-by: Zhang Yu 
> Signed-off-by: Wang Li 
> Signed-off-by: Li RongQing 

Applied.

Re: [PATCH][net-next] tun: remove unnecessary check in tun_flow_update

2018-12-06 Thread David Miller

From: Li RongQing 
Date: Thu,  6 Dec 2018 16:28:11 +0800

> caller has guaranted that rxhash is not zero
> 
> Signed-off-by: Li RongQing 

Applied.

Re: [PATCH 1/2] net: linkwatch: send change uevent on link changes

2018-12-06 Thread David Miller

From: Jouke Witteveen 
Date: Thu, 6 Dec 2018 09:59:20 +0100

> On Thu, Dec 6, 2018 at 1:34 AM David Miller  wrote:
>>
>> From: Jouke Witteveen 
>> Date: Wed, 5 Dec 2018 23:38:17 +0100
>>
>> > Can you elaborate a bit? I may not be aware of the policy you have in
>> > mind.
>>
>> When we have a user facing interface to do something, we don't create
>> another one unless it is absolutely, positively, unavoidable.
> 
> Obviously, if I would have known this I would not have gone through
> the trouble of investigating and proposing this patch. It was an
> honest attempt at making the kernel better.
> Where could I have found this policy? I have looked on kernel.org/doc,
> but couldn't find it.

It is not formally documented but it is a concern we raise every time
a duplicate piece of user facing functionality is proposed.

Re: [PATCH net] sctp: fix pr_warn max_data argument type mismatch

2018-12-06 Thread David Miller

From: Jakub Audykowicz 
Date: Thu,  6 Dec 2018 08:58:37 +0100

> My previous patch introduced a compilation warning regarding a type
> mismatch (int vs size_t). This is a one-letter fix for good housekeeping.
> 
> Signed-off-by: Jakub Audykowicz 

Still wrong and I fixed it when I applied your patch.

You need to use the 'Z' prefix for size_t, so %Zu in this case.

Re: [PATCH net-next] neighbor: Add extack messages for add and delete commands

2018-12-05 Thread David Miller

From: David Ahern 
Date: Wed,  5 Dec 2018 20:02:29 -0800

> From: David Ahern 
> 
> Add extack messages for failures in neigh_add and neigh_delete.
> 
> Signed-off-by: David Ahern 

Looks good, applied, thanks David.

Re: [PATCH net] ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes

2018-12-05 Thread David Miller

From: Jiri Wiesner 
Date: Wed, 5 Dec 2018 16:55:29 +0100

> The *_frag_reasm() functions are susceptible to miscalculating the byte
> count of packet fragments in case the truesize of a head buffer changes.
> The truesize member may be changed by the call to skb_unclone(), leaving
> the fragment memory limit counter unbalanced even if all fragments are
> processed. This miscalculation goes unnoticed as long as the network
> namespace which holds the counter is not destroyed.
> 
> Should an attempt be made to destroy a network namespace that holds an
> unbalanced fragment memory limit counter the cleanup of the namespace
> never finishes. The thread handling the cleanup gets stuck in
> inet_frags_exit_net() waiting for the percpu counter to reach zero. The
> thread is usually in running state with a stacktrace similar to:
> 
>  PID: 1073   TASK: 880626711440  CPU: 1   COMMAND: "kworker/u48:4"
>   #5 [880621563d48] _raw_spin_lock at 815f5480
>   #6 [880621563d48] inet_evict_bucket at 8158020b
>   #7 [880621563d80] inet_frags_exit_net at 8158051c
>   #8 [880621563db0] ops_exit_list at 814f5856
>   #9 [880621563dd8] cleanup_net at 814f67c0
>  #10 [880621563e38] process_one_work at 81096f14
> 
> It is not possible to create new network namespaces, and processes
> that call unshare() end up being stuck in uninterruptible sleep state
> waiting to acquire the net_mutex.
> 
> The bug was observed in the IPv6 netfilter code by Per Sundstrom.
> I thank him for his analysis of the problem. The parts of this patch
> that apply to IPv4 and IPv6 fragment reassembly are preemptive measures.
> 
> Signed-off-by: Jiri Wiesner 
> Reported-by: Per Sundstrom 

Nice catch.

Applied and queued up for -stable, thanks!

Re: [PATCH net 1/3] flex_array: make FLEX_ARRAY_BASE_SIZE the same value of FLEX_ARRAY_PART_SIZE

2018-12-05 Thread David Miller

From: Xin Long 
Date: Wed,  5 Dec 2018 14:49:40 +0800

> This patch is to separate the base data memory from struct flex_array and
> save it into a page. With this change, total_nr_elements of a flex_array
> can grow or shrink without having the old element's memory changed when
> the new size of the flex_arry crosses FLEX_ARRAY_BASE_SIZE, which will
> be added in the next patch.
> 
> Suggested-by: Neil Horman 
> Signed-off-by: Xin Long 

This needs to be reviewed by the flex array hackers and lkml.

It can't just get reviewed on netdev alone.

Re: [PATCH v2 net-next 1/1] net: netem: use a list in addition to rbtree

2018-12-05 Thread David Miller

From: Peter Oskolkov 
Date: Tue,  4 Dec 2018 11:55:56 -0800

> When testing high-bandwidth TCP streams with large windows,
> high latency, and low jitter, netem consumes a lot of CPU cycles
> doing rbtree rebalancing.
> 
> This patch uses a linear list/queue in addition to the rbtree:
> if an incoming packet is past the tail of the linear queue, it is
> added there, otherwise it is inserted into the rbtree.
> 
> Without this patch, perf shows netem_enqueue, netem_dequeue,
> and rb_* functions among the top offenders. With this patch,
> only netem_enqueue is noticeable if jitter is low/absent.
> 
> Suggested-by: Eric Dumazet 
> Signed-off-by: Peter Oskolkov 

Applied, thanks.

Re: [PATCH net] sctp: frag_point sanity check

2018-12-05 Thread David Miller

From: Jakub Audykowicz 
Date: Tue,  4 Dec 2018 20:27:41 +0100

> If for some reason an association's fragmentation point is zero,
> sctp_datamsg_from_user will try to endlessly try to divide a message
> into zero-sized chunks. This eventually causes kernel panic due to
> running out of memory.
> 
> Although this situation is quite unlikely, it has occurred before as
> reported. I propose to add this simple last-ditch sanity check due to
> the severity of the potential consequences.
> 
> Signed-off-by: Jakub Audykowicz 

Applied.

[PATCH net-next] neighbor: Add extack messages for add and delete commands

2018-12-05 Thread David Ahern

From: David Ahern 

Add extack messages for failures in neigh_add and neigh_delete.

Signed-off-by: David Ahern 
---
 net/core/neighbour.c | 55 +---
 1 file changed, 39 insertions(+), 16 deletions(-)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 41954e42a2de..6d479b5562be 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1137,8 +1137,9 @@ static void neigh_update_hhs(struct neighbour *neigh)
Caller MUST hold reference count on the entry.
  */
 
-int neigh_update(struct neighbour *neigh, const u8 *lladdr, u8 new,
-u32 flags, u32 nlmsg_pid)
+static int __neigh_update(struct neighbour *neigh, const u8 *lladdr,
+ u8 new, u32 flags, u32 nlmsg_pid,
+ struct netlink_ext_ack *extack)
 {
u8 old;
int err;
@@ -1155,8 +1156,10 @@ int neigh_update(struct neighbour *neigh, const u8 
*lladdr, u8 new,
if (!(flags & NEIGH_UPDATE_F_ADMIN) &&
(old & (NUD_NOARP | NUD_PERMANENT)))
goto out;
-   if (neigh->dead)
+   if (neigh->dead) {
+   NL_SET_ERR_MSG(extack, "Neighbor entry is now dead");
goto out;
+   }
 
neigh_update_ext_learned(neigh, flags, );
 
@@ -1193,8 +1196,10 @@ int neigh_update(struct neighbour *neigh, const u8 
*lladdr, u8 new,
   use it, otherwise discard the request.
 */
err = -EINVAL;
-   if (!(old & NUD_VALID))
+   if (!(old & NUD_VALID)) {
+   NL_SET_ERR_MSG(extack, "No link layer address given");
goto out;
+   }
lladdr = neigh->ha;
}
 
@@ -1307,6 +1312,12 @@ int neigh_update(struct neighbour *neigh, const u8 
*lladdr, u8 new,
 
return err;
 }
+
+int neigh_update(struct neighbour *neigh, const u8 *lladdr, u8 new,
+u32 flags, u32 nlmsg_pid)
+{
+   return __neigh_update(neigh, lladdr, new, flags, nlmsg_pid, NULL);
+}
 EXPORT_SYMBOL(neigh_update);
 
 /* Update the neigh to listen temporarily for probe responses, even if it is
@@ -1678,8 +1689,10 @@ static int neigh_delete(struct sk_buff *skb, struct 
nlmsghdr *nlh,
goto out;
 
dst_attr = nlmsg_find_attr(nlh, sizeof(*ndm), NDA_DST);
-   if (dst_attr == NULL)
+   if (!dst_attr) {
+   NL_SET_ERR_MSG(extack, "Network address not specified");
goto out;
+   }
 
ndm = nlmsg_data(nlh);
if (ndm->ndm_ifindex) {
@@ -1694,8 +1707,10 @@ static int neigh_delete(struct sk_buff *skb, struct 
nlmsghdr *nlh,
if (tbl == NULL)
return -EAFNOSUPPORT;
 
-   if (nla_len(dst_attr) < (int)tbl->key_len)
+   if (nla_len(dst_attr) < (int)tbl->key_len) {
+   NL_SET_ERR_MSG(extack, "Invalid network address");
goto out;
+   }
 
if (ndm->ndm_flags & NTF_PROXY) {
err = pneigh_delete(tbl, net, nla_data(dst_attr), dev);
@@ -1711,10 +1726,9 @@ static int neigh_delete(struct sk_buff *skb, struct 
nlmsghdr *nlh,
goto out;
}
 
-   err = neigh_update(neigh, NULL, NUD_FAILED,
-  NEIGH_UPDATE_F_OVERRIDE |
-  NEIGH_UPDATE_F_ADMIN,
-  NETLINK_CB(skb).portid);
+   err = __neigh_update(neigh, NULL, NUD_FAILED,
+NEIGH_UPDATE_F_OVERRIDE | NEIGH_UPDATE_F_ADMIN,
+NETLINK_CB(skb).portid, extack);
write_lock_bh(>lock);
neigh_release(neigh);
neigh_remove_one(neigh, tbl);
@@ -1744,8 +1758,10 @@ static int neigh_add(struct sk_buff *skb, struct 
nlmsghdr *nlh,
goto out;
 
err = -EINVAL;
-   if (tb[NDA_DST] == NULL)
+   if (!tb[NDA_DST]) {
+   NL_SET_ERR_MSG(extack, "Network address not specified");
goto out;
+   }
 
ndm = nlmsg_data(nlh);
if (ndm->ndm_ifindex) {
@@ -1755,16 +1771,21 @@ static int neigh_add(struct sk_buff *skb, struct 
nlmsghdr *nlh,
goto out;
}
 
-   if (tb[NDA_LLADDR] && nla_len(tb[NDA_LLADDR]) < dev->addr_len)
+   if (tb[NDA_LLADDR] && nla_len(tb[NDA_LLADDR]) < dev->addr_len) {
+   NL_SET_ERR_MSG(extack, "Invalid link address");
goto out;
+   }
}
 
tbl = neigh_find_table(ndm->ndm_family);
if (tbl == NULL)
return -EAFNOSUPPORT;
 
-   if (nla_len(tb[NDA_DST]) < (int)tbl->key_len)
+   if (nla_len(tb[NDA_DST]) < (int)tbl->key_len) {
+   NL_SET_ERR_MSG(extack, "Invalid network address");
goto

Re: [PATCH net-next 2/7] neighbor: Fold ___neigh_lookup_noref into __neigh_lookup_noref

2018-12-05 Thread David Miller

From: David Ahern 
Date: Wed, 5 Dec 2018 17:46:37 -0700

> ok. patches 5-7 are not dependent on 1-4. Should I re-send outside of
> this set?

Yes, please respin.

Thanks David.

Re: [pull request][net-next V2 0/7] Mellanox, mlx5e updates 2018-12-04

2018-12-05 Thread David Miller

From: Saeed Mahameed 
Date: Wed,  5 Dec 2018 16:12:58 -0800

> The following series is for mlx5e netdevice driver, it adds ethtool
> support for RX hash fields configuration and some misc updates, please
> see tag log below.
> 
> Please pull and let me know if there's any problem.
> 
> v1->v2:
>  - Move static const array to c file.
>  - Remove unnecessary blank line
>  - Add #include 
>  - Print priv flag name rather than its hex value

Pulled, thanks Saeed.

Re: [PATCH net-next 2/7] neighbor: Fold ___neigh_lookup_noref into __neigh_lookup_noref

2018-12-05 Thread David Ahern

On 12/5/18 5:46 PM, David Ahern wrote:
> ok. patches 5-7 are not dependent on 1-4. Should I re-send outside of
> this set?

bleh. 5 is. I'll re-send.

Re: [PATCH net-next 2/7] neighbor: Fold ___neigh_lookup_noref into __neigh_lookup_noref

2018-12-05 Thread David Ahern

On 12/5/18 5:44 PM, David Miller wrote:
> From: David Ahern 
> Date: Wed,  5 Dec 2018 15:34:09 -0800
> 
>> @@ -270,37 +270,25 @@ static inline bool neigh_key_eq128(const struct 
>> neighbour *n, const void *pkey)
>>  (n32[2] ^ p32[2]) | (n32[3] ^ p32[3])) == 0;
>>  }
>>  
>> -static inline struct neighbour *___neigh_lookup_noref(
>> -struct neigh_table *tbl,
>> -bool (*key_eq)(const struct neighbour *n, const void *pkey),
>> -__u32 (*hash)(const void *pkey,
>> -  const struct net_device *dev,
>> -  __u32 *hash_rnd),
>> -const void *pkey,
>> -struct net_device *dev)
>> +static inline struct neighbour *__neigh_lookup_noref(struct neigh_table 
>> *tbl,
>> + const void *pkey,
>> + struct net_device *dev)
>>  {
> 
> Sorry, we can't do this.
> 
> The whole point of how this is laid out is so that the entire hash traversal,
> including the hash function, is expanded inline.
> 
> This demux is extremely critical on the output side, it must be the
> smallest number of cycles possible.  It was the only way I could justify
> not caching neigh entries in the routes any more when I wrote this code.
> 
> Even before retpoline, putting an indirect call here is painful.  With
> retpoline it is deadly.
> 
> Please avoid removing the full inline expansion of the neigh lookup in the 
> ipv6
> and ipv4 data paths.
> 

ok. patches 5-7 are not dependent on 1-4. Should I re-send outside of
this set?

Re: [PATCH net-next 2/7] neighbor: Fold ___neigh_lookup_noref into __neigh_lookup_noref

2018-12-05 Thread David Miller

From: David Ahern 
Date: Wed,  5 Dec 2018 15:34:09 -0800

> @@ -270,37 +270,25 @@ static inline bool neigh_key_eq128(const struct 
> neighbour *n, const void *pkey)
>   (n32[2] ^ p32[2]) | (n32[3] ^ p32[3])) == 0;
>  }
>  
> -static inline struct neighbour *___neigh_lookup_noref(
> - struct neigh_table *tbl,
> - bool (*key_eq)(const struct neighbour *n, const void *pkey),
> - __u32 (*hash)(const void *pkey,
> -   const struct net_device *dev,
> -   __u32 *hash_rnd),
> - const void *pkey,
> - struct net_device *dev)
> +static inline struct neighbour *__neigh_lookup_noref(struct neigh_table *tbl,
> +  const void *pkey,
> +  struct net_device *dev)
>  {

Sorry, we can't do this.

The whole point of how this is laid out is so that the entire hash traversal,
including the hash function, is expanded inline.

This demux is extremely critical on the output side, it must be the
smallest number of cycles possible.  It was the only way I could justify
not caching neigh entries in the routes any more when I wrote this code.

Even before retpoline, putting an indirect call here is painful.  With
retpoline it is deadly.

Please avoid removing the full inline expansion of the neigh lookup in the ipv6
and ipv4 data paths.

Thank you.

Re: [PATCH net] tcp: fix NULL ref in tail loss probe

2018-12-05 Thread David Miller

From: Yuchung Cheng 
Date: Wed,  5 Dec 2018 14:38:38 -0800

> TCP loss probe timer may fire when the retranmission queue is empty but
> has a non-zero tp->packets_out counter. tcp_send_loss_probe will call
> tcp_rearm_rto which triggers NULL pointer reference by fetching the
> retranmission queue head in its sub-routines.
> 
> Add a more detailed warning to help catch the root cause of the inflight
> accounting inconsistency.
> 
> Reported-by: Rafael Tinoco 
> Signed-off-by: Yuchung Cheng 
> Signed-off-by: Eric Dumazet 
> Signed-off-by: Neal Cardwell 

Applied, thanks for working to diagnose this so quickly.

Re: [PATCH 1/2] net: linkwatch: send change uevent on link changes

2018-12-05 Thread David Miller

From: Jouke Witteveen 
Date: Wed, 5 Dec 2018 23:38:17 +0100

> Can you elaborate a bit? I may not be aware of the policy you have in
> mind.

When we have a user facing interface to do something, we don't create
another one unless it is absolutely, positively, unavoidable.

Re: [PATCH net] tcp: Do not underestimate rwnd_limited

2018-12-05 Thread David Miller

From: Eric Dumazet 
Date: Wed,  5 Dec 2018 14:24:31 -0800

> If available rwnd is too small, tcp_tso_should_defer()
> can decide it is worth waiting before splitting a TSO packet.
> 
> This really means we are rwnd limited.
> 
> Fixes: 5615f88614a4 ("tcp: instrument how long TCP is limited by receive 
> window")
> Signed-off-by: Eric Dumazet 

Applied and queued up for -stable, thanks Eric.

Re: pull-request: bpf 2018-12-05

2018-12-05 Thread David Miller

From: Alexei Starovoitov 
Date: Wed, 5 Dec 2018 13:23:22 -0800

> The following pull-request contains BPF updates for your *net* tree.
> 
> The main changes are:
> 
> 1) fix bpf uapi pointers for 32-bit architectures, from Daniel.
> 
> 2) improve verifer ability to handle progs with a lot of branches, from 
> Alexei.
> 
> 3) strict btf checks, from Yonghong.
> 
> 4) bpf_sk_lookup api cleanup, from Joe.
> 
> 5) other misc fixes
> 
> Please consider pulling these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Pulled, thank you.

Re: [PATCH net-next 0/6] u32 to linkmode fixes

2018-12-05 Thread David Miller

From: Andrew Lunn 
Date: Wed,  5 Dec 2018 21:49:39 +0100

> This patchset fixes issues found in the last patchset which converted
> the phydev advertise etc, from a u32 to a linux bitmap. Most of the
> issues are the result of clearing bits which should not of been
> cleared. To make the API clearer, the idea from Heiner Kallweit was
> used, with _mod_ to indicate the function modifies just the bits it
> needs to, or _to_ to clear all bits and just set bit that need to be
> set.

Series applied, thanks Andrew.

Please always list the Fixes tag first in the future.  I fixed if up
for you this time.

Thanks again.

Re: [PATCH net] net: use skb_list_del_init() to remove from RX sublists

2018-12-05 Thread David Miller

From: Edward Cree 
Date: Tue, 4 Dec 2018 17:37:57 +

> list_del() leaves the skb->next pointer poisoned, which can then lead to
>  a crash in e.g. OVS forwarding.  For example, setting up an OVS VXLAN
>  forwarding bridge on sfc as per:
 ...
> So, in all listified-receive handling, instead pull skbs off the lists with
>  skb_list_del_init().
> 
> Fixes: 9af86f933894 ("net: core: fix use-after-free in 
> __netif_receive_skb_list_core")
> Fixes: 7da517a3bc52 ("net: core: Another step of skb receive list processing")
> Fixes: a4ca8b7df73c ("net: ipv4: fix drop handling in ip_list_rcv() and 
> ip_list_rcv_finish()")
> Fixes: d8269e2cbf90 ("net: ipv6: listify ipv6_rcv() and ip6_rcv_finish()")
> Signed-off-by: Edward Cree 

Applied and queued up for -stable

> I'm not sure if these are the right Fixes tags, or if I should instead be
>  fingering some commit that made dev_hard_start_xmit() more sensitive to
>  skb->next.
> Also, I only saw a crash from the list_del() in 
> __netif_receive_skb_list_core()
>  but I converted all of them in the listified RX path, in case any others
>  have similar ways to escape into paths that care about skb->next.

I think we should use skb_list_del_init() on in all cases skb->list except
where we immediately queue it onto another list in a trivially auditable
way.

Therefore I think what you did is the way to go.

Thanks.

[PATCH net-next 2/7] neighbor: Fold ___neigh_lookup_noref into __neigh_lookup_noref

2018-12-05 Thread David Ahern

From: David Ahern 

There are no more direct callers of ___neigh_lookup_noref so no need
for it to be a standalone helper.

Signed-off-by: David Ahern 
---
 include/net/neighbour.h | 22 +-
 1 file changed, 5 insertions(+), 17 deletions(-)

diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index f58b384aa6c9..aac87bc2d96b 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -270,37 +270,25 @@ static inline bool neigh_key_eq128(const struct neighbour 
*n, const void *pkey)
(n32[2] ^ p32[2]) | (n32[3] ^ p32[3])) == 0;
 }
 
-static inline struct neighbour *___neigh_lookup_noref(
-   struct neigh_table *tbl,
-   bool (*key_eq)(const struct neighbour *n, const void *pkey),
-   __u32 (*hash)(const void *pkey,
- const struct net_device *dev,
- __u32 *hash_rnd),
-   const void *pkey,
-   struct net_device *dev)
+static inline struct neighbour *__neigh_lookup_noref(struct neigh_table *tbl,
+const void *pkey,
+struct net_device *dev)
 {
struct neigh_hash_table *nht = rcu_dereference_bh(tbl->nht);
struct neighbour *n;
u32 hash_val;
 
-   hash_val = hash(pkey, dev, nht->hash_rnd) >> (32 - nht->hash_shift);
+   hash_val = tbl->hash(pkey, dev, nht->hash_rnd) >> (32 - 
nht->hash_shift);
for (n = rcu_dereference_bh(nht->hash_buckets[hash_val]);
 n != NULL;
 n = rcu_dereference_bh(n->next)) {
-   if (n->dev == dev && key_eq(n, pkey))
+   if (n->dev == dev && tbl->key_eq(n, pkey))
return n;
}
 
return NULL;
 }
 
-static inline struct neighbour *__neigh_lookup_noref(struct neigh_table *tbl,
-const void *pkey,
-struct net_device *dev)
-{
-   return ___neigh_lookup_noref(tbl, tbl->key_eq, tbl->hash, pkey, dev);
-}
-
 void neigh_table_init(int index, struct neigh_table *tbl);
 int neigh_table_clear(int index, struct neigh_table *tbl);
 struct neighbour *neigh_lookup(struct neigh_table *tbl, const void *pkey,
-- 
2.11.0

[PATCH net-next 5/7] neighbor: Create a neigh_hash helper

2018-12-05 Thread David Ahern

From: David Ahern 

Consolidate calculations of the neighbor hash into a single helper.

Signed-off-by: David Ahern 
---
 include/net/neighbour.h | 10 +-
 net/core/neighbour.c| 15 +--
 2 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index aac87bc2d96b..092493a8c91b 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -270,6 +270,14 @@ static inline bool neigh_key_eq128(const struct neighbour 
*n, const void *pkey)
(n32[2] ^ p32[2]) | (n32[3] ^ p32[3])) == 0;
 }
 
+static inline u32 neigh_hash(struct neigh_table *tbl,
+struct neigh_hash_table *nht,
+const void *pkey,
+struct net_device *dev)
+{
+   return tbl->hash(pkey, dev, nht->hash_rnd) >> (32 - nht->hash_shift);
+}
+
 static inline struct neighbour *__neigh_lookup_noref(struct neigh_table *tbl,
 const void *pkey,
 struct net_device *dev)
@@ -278,7 +286,7 @@ static inline struct neighbour *__neigh_lookup_noref(struct 
neigh_table *tbl,
struct neighbour *n;
u32 hash_val;
 
-   hash_val = tbl->hash(pkey, dev, nht->hash_rnd) >> (32 - 
nht->hash_shift);
+   hash_val = neigh_hash(tbl, nht, pkey, dev);
for (n = rcu_dereference_bh(nht->hash_buckets[hash_val]);
 n != NULL;
 n = rcu_dereference_bh(n->next)) {
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 41954e42a2de..53e30c15882d 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -151,9 +151,8 @@ bool neigh_remove_one(struct neighbour *ndel, struct 
neigh_table *tbl)
 
nht = rcu_dereference_protected(tbl->nht,
lockdep_is_held(>lock));
-   hash_val = tbl->hash(pkey, ndel->dev, nht->hash_rnd);
-   hash_val = hash_val >> (32 - nht->hash_shift);
 
+   hash_val = neigh_hash(tbl, nht, pkey, ndel->dev);
np = >hash_buckets[hash_val];
while ((n = rcu_dereference_protected(*np,
  lockdep_is_held(>lock {
@@ -434,10 +433,7 @@ static struct neigh_hash_table *neigh_hash_grow(struct 
neigh_table *tbl,
   lockdep_is_held(>lock));
 n != NULL;
 n = next) {
-   hash = tbl->hash(n->primary_key, n->dev,
-new_nht->hash_rnd);
-
-   hash >>= (32 - new_nht->hash_shift);
+   hash = neigh_hash(tbl, new_nht, n->primary_key, n->dev);
next = rcu_dereference_protected(n->next,
lockdep_is_held(>lock));
 
@@ -485,9 +481,9 @@ struct neighbour *neigh_lookup_nodev(struct neigh_table 
*tbl, struct net *net,
NEIGH_CACHE_STAT_INC(tbl, lookups);
 
rcu_read_lock_bh();
-   nht = rcu_dereference_bh(tbl->nht);
-   hash_val = tbl->hash(pkey, NULL, nht->hash_rnd) >> (32 - 
nht->hash_shift);
 
+   nht = rcu_dereference_bh(tbl->nht);
+   hash_val = neigh_hash(tbl, nht, pkey, NULL);
for (n = rcu_dereference_bh(nht->hash_buckets[hash_val]);
 n != NULL;
 n = rcu_dereference_bh(n->next)) {
@@ -553,13 +549,12 @@ struct neighbour *__neigh_create(struct neigh_table *tbl, 
const void *pkey,
if (atomic_read(>entries) > (1 << nht->hash_shift))
nht = neigh_hash_grow(tbl, nht->hash_shift + 1);
 
-   hash_val = tbl->hash(n->primary_key, dev, nht->hash_rnd) >> (32 - 
nht->hash_shift);
-
if (n->parms->dead) {
rc = ERR_PTR(-EINVAL);
goto out_tbl_unlock;
}
 
+   hash_val = neigh_hash(tbl, nht, n->primary_key, dev);
for (n1 = rcu_dereference_protected(nht->hash_buckets[hash_val],
lockdep_is_held(>lock));
 n1 != NULL;
-- 
2.11.0

[PATCH net-next 6/7] neighbor: Skip the duplicate lookup in neigh_add

2018-12-05 Thread David Ahern

From: David Ahern 

When adding a new neighbor via rtnetlink, neigh_add does a lookup
and if the result is NULL calls __neigh_lookup_errno to create a
new entry if the NLM_F_CREATE flag is set. But, __neigh_lookup_errno
calls neigh_lookup again before neigh_create; the neigh_lookup is
redundant.

Replace the call to __neigh_lookup_errno with a call to __neigh_create
to more efficiently achieve the same result and prepare for the next
patch.

Signed-off-by: David Ahern 
---
 net/core/neighbour.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 53e30c15882d..e324467e9a71 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1785,7 +1785,7 @@ static int neigh_add(struct sk_buff *skb, struct nlmsghdr 
*nlh,
goto out;
}
 
-   neigh = __neigh_lookup_errno(tbl, dst, dev);
+   neigh = __neigh_create(tbl, dst, dev, true);
if (IS_ERR(neigh)) {
err = PTR_ERR(neigh);
goto out;
-- 
2.11.0

[PATCH net-next 0/7] neighbor: cleanups plus extack for add and delete

2018-12-05 Thread David Ahern

From: David Ahern 

cleanups:
- remove open coding of key and hash functions for ipv4 and ipv6
  and then collapse hash functions
- collapse now unnecessary ___neigh_lookup_noref helper
- create helper for neigh hash computation
- remove duplicate lookup in neigh_add

After that add extack messages for neighbor add and delete.

David Ahern (7):
  neighbor: Remove open coding of key and hash functions
  neighbor: Fold ___neigh_lookup_noref into __neigh_lookup_noref
  net/ipv4: Move arp_hashfn into arp_hash
  net/ipv6: Move ndisc_hashfn to ndisc_hash
  neighbor: Create a neigh_hash helper
  neighbor: Skip the duplicate lookup in neigh_add
  neighbor: Add extack messages for add and delete commands

 include/net/arp.h   | 10 +--
 include/net/ndisc.h | 12 +
 include/net/neighbour.h | 30 +
 net/core/filter.c   |  3 +--
 net/core/neighbour.c| 72 ++---
 net/ipv4/arp.c  |  5 +++-
 net/ipv6/ndisc.c|  7 -
 7 files changed, 71 insertions(+), 68 deletions(-)

-- 
2.11.0

[PATCH net-next 7/7] neighbor: Add extack messages for add and delete commands

2018-12-05 Thread David Ahern

From: David Ahern 

Add extack messages for failures in neigh_add and neigh_delete.

Also, require NDA_DST length to be exactly the key length for the
table otherwise it is an unexpected address and can lead to unexpected
entries. e.g., IPv4 table sent and IPv6 address (using a modified ip):
$ ip neigh add 2001:db8:1::1 dev foo
$ ip neigh ls dev foo
32.1.13.184 dev foo lladdr 72:ed:f1:d9:20:9a PERMANENT

Signed-off-by: David Ahern 
---
 net/core/neighbour.c | 55 +---
 1 file changed, 39 insertions(+), 16 deletions(-)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index e324467e9a71..916a99fbb306 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1132,8 +1132,9 @@ static void neigh_update_hhs(struct neighbour *neigh)
Caller MUST hold reference count on the entry.
  */
 
-int neigh_update(struct neighbour *neigh, const u8 *lladdr, u8 new,
-u32 flags, u32 nlmsg_pid)
+static int __neigh_update(struct neighbour *neigh, const u8 *lladdr,
+ u8 new, u32 flags, u32 nlmsg_pid,
+ struct netlink_ext_ack *extack)
 {
u8 old;
int err;
@@ -1150,8 +1151,10 @@ int neigh_update(struct neighbour *neigh, const u8 
*lladdr, u8 new,
if (!(flags & NEIGH_UPDATE_F_ADMIN) &&
(old & (NUD_NOARP | NUD_PERMANENT)))
goto out;
-   if (neigh->dead)
+   if (neigh->dead) {
+   NL_SET_ERR_MSG(extack, "Neighbor entry is now dead");
goto out;
+   }
 
neigh_update_ext_learned(neigh, flags, );
 
@@ -1188,8 +1191,10 @@ int neigh_update(struct neighbour *neigh, const u8 
*lladdr, u8 new,
   use it, otherwise discard the request.
 */
err = -EINVAL;
-   if (!(old & NUD_VALID))
+   if (!(old & NUD_VALID)) {
+   NL_SET_ERR_MSG(extack, "No link layer address given");
goto out;
+   }
lladdr = neigh->ha;
}
 
@@ -1302,6 +1307,12 @@ int neigh_update(struct neighbour *neigh, const u8 
*lladdr, u8 new,
 
return err;
 }
+
+int neigh_update(struct neighbour *neigh, const u8 *lladdr, u8 new,
+u32 flags, u32 nlmsg_pid)
+{
+   return __neigh_update(neigh, lladdr, new, flags, nlmsg_pid, NULL);
+}
 EXPORT_SYMBOL(neigh_update);
 
 /* Update the neigh to listen temporarily for probe responses, even if it is
@@ -1673,8 +1684,10 @@ static int neigh_delete(struct sk_buff *skb, struct 
nlmsghdr *nlh,
goto out;
 
dst_attr = nlmsg_find_attr(nlh, sizeof(*ndm), NDA_DST);
-   if (dst_attr == NULL)
+   if (!dst_attr) {
+   NL_SET_ERR_MSG(extack, "Network address not specified");
goto out;
+   }
 
ndm = nlmsg_data(nlh);
if (ndm->ndm_ifindex) {
@@ -1689,8 +1702,10 @@ static int neigh_delete(struct sk_buff *skb, struct 
nlmsghdr *nlh,
if (tbl == NULL)
return -EAFNOSUPPORT;
 
-   if (nla_len(dst_attr) < (int)tbl->key_len)
+   if (nla_len(dst_attr) < (int)tbl->key_len) {
+   NL_SET_ERR_MSG(extack, "Invalid network address");
goto out;
+   }
 
if (ndm->ndm_flags & NTF_PROXY) {
err = pneigh_delete(tbl, net, nla_data(dst_attr), dev);
@@ -1706,10 +1721,9 @@ static int neigh_delete(struct sk_buff *skb, struct 
nlmsghdr *nlh,
goto out;
}
 
-   err = neigh_update(neigh, NULL, NUD_FAILED,
-  NEIGH_UPDATE_F_OVERRIDE |
-  NEIGH_UPDATE_F_ADMIN,
-  NETLINK_CB(skb).portid);
+   err = __neigh_update(neigh, NULL, NUD_FAILED,
+NEIGH_UPDATE_F_OVERRIDE | NEIGH_UPDATE_F_ADMIN,
+NETLINK_CB(skb).portid, extack);
write_lock_bh(>lock);
neigh_release(neigh);
neigh_remove_one(neigh, tbl);
@@ -1739,8 +1753,10 @@ static int neigh_add(struct sk_buff *skb, struct 
nlmsghdr *nlh,
goto out;
 
err = -EINVAL;
-   if (tb[NDA_DST] == NULL)
+   if (!tb[NDA_DST]) {
+   NL_SET_ERR_MSG(extack, "Network address not specified");
goto out;
+   }
 
ndm = nlmsg_data(nlh);
if (ndm->ndm_ifindex) {
@@ -1750,16 +1766,21 @@ static int neigh_add(struct sk_buff *skb, struct 
nlmsghdr *nlh,
goto out;
}
 
-   if (tb[NDA_LLADDR] && nla_len(tb[NDA_LLADDR]) < dev->addr_len)
+   if (tb[NDA_LLADDR] && nla_len(tb[NDA_LLADDR]) < dev->addr_len) {
+   NL_SET_ERR_MSG(extack, "Invalid link address");
g

[PATCH net-next 3/7] net/ipv4: Move arp_hashfn into arp_hash

2018-12-05 Thread David Ahern

From: David Ahern 

There are no more direct references to arp_hashfn so fold it into
arp_hash, the hash callback for arp.

Signed-off-by: David Ahern 
---
 include/net/arp.h | 8 
 net/ipv4/arp.c| 5 -
 2 files changed, 4 insertions(+), 9 deletions(-)

diff --git a/include/net/arp.h b/include/net/arp.h
index a5091f13cd3e..9f433c077b67 100644
--- a/include/net/arp.h
+++ b/include/net/arp.h
@@ -10,14 +10,6 @@
 
 extern struct neigh_table arp_tbl;
 
-static inline u32 arp_hashfn(const void *pkey, const struct net_device *dev, 
u32 *hash_rnd)
-{
-   u32 key = *(const u32 *)pkey;
-   u32 val = key ^ hash32_ptr(dev);
-
-   return val * hash_rnd[0];
-}
-
 static inline struct neighbour *__ipv4_neigh_lookup_noref(struct net_device 
*dev, u32 key)
 {
if (dev->flags & (IFF_LOOPBACK | IFF_POINTOPOINT))
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index 850a6f13a082..6b88211287ae 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -213,7 +213,10 @@ static u32 arp_hash(const void *pkey,
const struct net_device *dev,
__u32 *hash_rnd)
 {
-   return arp_hashfn(pkey, dev, hash_rnd);
+   u32 key = *(const u32 *)pkey;
+   u32 val = key ^ hash32_ptr(dev);
+
+   return val * hash_rnd[0];
 }
 
 static bool arp_key_eq(const struct neighbour *neigh, const void *pkey)
-- 
2.11.0

[PATCH net-next 4/7] net/ipv6: Move ndisc_hashfn to ndisc_hash

2018-12-05 Thread David Ahern

From: David Ahern 

There are no more direct references to ndisc_hashfn so fold it into
ndisc_hash, the hash callback for ndisc.

Signed-off-by: David Ahern 
---
 include/net/ndisc.h | 10 --
 net/ipv6/ndisc.c|  7 ++-
 2 files changed, 6 insertions(+), 11 deletions(-)

diff --git a/include/net/ndisc.h b/include/net/ndisc.h
index c354345c679b..83a84f68901b 100644
--- a/include/net/ndisc.h
+++ b/include/net/ndisc.h
@@ -364,16 +364,6 @@ static inline u8 *ndisc_opt_addr_data(struct nd_opt_hdr *p,
 ndisc_addr_option_pad(dev->type));
 }
 
-static inline u32 ndisc_hashfn(const void *pkey, const struct net_device *dev, 
__u32 *hash_rnd)
-{
-   const u32 *p32 = pkey;
-
-   return (((p32[0] ^ hash32_ptr(dev)) * hash_rnd[0]) +
-   (p32[1] * hash_rnd[1]) +
-   (p32[2] * hash_rnd[2]) +
-   (p32[3] * hash_rnd[3]));
-}
-
 static inline struct neighbour *__ipv6_neigh_lookup_noref(struct net_device 
*dev, const void *pkey)
 {
return __neigh_lookup_noref(_tbl, pkey, dev);
diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index 659ecf4e4b3c..304a32b3c3f5 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -311,7 +311,12 @@ static u32 ndisc_hash(const void *pkey,
  const struct net_device *dev,
  __u32 *hash_rnd)
 {
-   return ndisc_hashfn(pkey, dev, hash_rnd);
+   const u32 *p32 = pkey;
+
+   return (((p32[0] ^ hash32_ptr(dev)) * hash_rnd[0]) +
+(p32[1] * hash_rnd[1]) +
+(p32[2] * hash_rnd[2]) +
+(p32[3] * hash_rnd[3]));
 }
 
 static bool ndisc_key_eq(const struct neighbour *n, const void *pkey)
-- 
2.11.0

[PATCH net-next 1/7] neighbor: Remove open coding of key and hash functions

2018-12-05 Thread David Ahern

From: David Ahern 

___neigh_lookup_noref takes the key and hash functions as inputs, yet
those are part of the operations listed in the neigh_table which is
also passed as an arugment. Remove the open coding of these internal
implementations by converting uses of ___neigh_lookup_noref to
__neigh_lookup_noref.

For IPv4, arp_key_eq is essentially a call to neigh_key_eq32 and
arp_hash is a call to arp_hashfn. Similarly for IPv6, ndisc_key_eq
calls neigh_key_eq128 and ndisc_hash calls ndisc_hashfn. So the change
in helpers is a no-op.

Signed-off-by: David Ahern 
---
 include/net/arp.h   | 2 +-
 include/net/ndisc.h | 2 +-
 net/core/filter.c   | 3 +--
 3 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/include/net/arp.h b/include/net/arp.h
index 977aabfcdc03..a5091f13cd3e 100644
--- a/include/net/arp.h
+++ b/include/net/arp.h
@@ -23,7 +23,7 @@ static inline struct neighbour 
*__ipv4_neigh_lookup_noref(struct net_device *dev
if (dev->flags & (IFF_LOOPBACK | IFF_POINTOPOINT))
key = INADDR_ANY;
 
-   return ___neigh_lookup_noref(_tbl, neigh_key_eq32, arp_hashfn, 
, dev);
+   return __neigh_lookup_noref(_tbl, , dev);
 }
 
 static inline struct neighbour *__ipv4_neigh_lookup(struct net_device *dev, 
u32 key)
diff --git a/include/net/ndisc.h b/include/net/ndisc.h
index ddfbb591e2c5..c354345c679b 100644
--- a/include/net/ndisc.h
+++ b/include/net/ndisc.h
@@ -376,7 +376,7 @@ static inline u32 ndisc_hashfn(const void *pkey, const 
struct net_device *dev, _
 
 static inline struct neighbour *__ipv6_neigh_lookup_noref(struct net_device 
*dev, const void *pkey)
 {
-   return ___neigh_lookup_noref(_tbl, neigh_key_eq128, ndisc_hashfn, 
pkey, dev);
+   return __neigh_lookup_noref(_tbl, pkey, dev);
 }
 
 static inline struct neighbour *__ipv6_neigh_lookup(struct net_device *dev, 
const void *pkey)
diff --git a/net/core/filter.c b/net/core/filter.c
index bd0df75dc7b6..f10cc675783c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4668,8 +4668,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct 
bpf_fib_lookup *params,
 * not needed here. Can not use __ipv6_neigh_lookup_noref here
 * because we need to get nd_tbl via the stub
 */
-   neigh = ___neigh_lookup_noref(ipv6_stub->nd_tbl, neigh_key_eq128,
- ndisc_hashfn, dst, dev);
+   neigh = __neigh_lookup_noref(ipv6_stub->nd_tbl, dst, dev);
if (!neigh)
return BPF_FIB_LKUP_RET_NO_NEIGH;
 
-- 
2.11.0

Re: [PATCH net] macvlan: remove duplicate check

2018-12-05 Thread David Miller

From: Matteo Croce 
Date: Tue,  4 Dec 2018 18:05:42 +0100

> Following commit 59f997b088d2 ("macvlan: return correct error value"),
> there is a duplicate check for mac addresses both in macvlan_sync_address()
> and macvlan_set_mac_address().
> As the former calls the latter, remove the one in macvlan_set_mac_address()
> and move the one in macvlan_sync_address() before any other check.
> 
> Signed-off-by: Matteo Croce 

Hmmm, doesn't this change behavior?

For the handling of the NETDEV_CHANGEADDR event in macvlan_device_event()
we would make it to macvlan_sync_address(), and if IFF_UP is false,
we would elide the macvlan_addr_busy() check and just copy the MAC addres
over and return.

Now, we would always perform the macvlan_addr_busy() check.

Please, if this is OK, explain and document this behavioral chance in
the commit message.

Thank you.

Re: [PATCH 0/3] net: macb: DMA race condition fixes

2018-12-05 Thread David Miller

From: Anssi Hannula 
Date: Fri, 30 Nov 2018 20:21:34 +0200

> Here are a couple of race condition fixes for the macb driver. The first
> two are issues observed on real HW.

It looks like there is still an active discussion about the memory
barriers in patch #3 being excessive.

Once that is sorted out to everyone's satisfaction, would you
please repost this series with appropriate ACKs, reviewed-by's,
tested-by's, etc. added?

Thank you.

Re: [PATCH 1/3] net: macb: fix random memory corruption on RX with 64-bit DMA

2018-12-05 Thread David Miller

From: Anssi Hannula 
Date: Fri, 30 Nov 2018 20:21:35 +0200

> @@ -682,6 +682,11 @@ static void macb_set_addr(struct macb *bp, struct 
> macb_dma_desc *desc, dma_addr_
>   if (bp->hw_dma_cap & HW_DMA_CAP_64B) {
>   desc_64 = macb_64b_desc(bp, desc);
>   desc_64->addrh = upper_32_bits(addr);
> + /* The low bits of RX address contain the RX_USED bit, clearing
> +  * of which allows packet RX. Make sure the high bits are also
> +  * visible to HW at that point.
> +  */
> + dma_wmb();
>   }

I agree with that dma_wmb() is what should be used here.

We are ordering CPU stores with DMA visibility, which is exactly what
the dma_*() are for.

If it doesn't work properly on some architecture's implementation of dma_*(),
those should be fixed rather than papering over it in the drivers.

Re: [PATCH 1/2] net: linkwatch: send change uevent on link changes

2018-12-05 Thread David Miller

From: Jouke Witteveen 
Date: Wed, 5 Dec 2018 14:50:31 +0100

> For example, I maintain a network manager that delegates the actual
> networking work to specialized programs.

Basically "I've implemented things using separate programs"

> Basically, it is an implementation of network manager logic in shell
> script. For such a shell script, it is easy to respond to uevents
> (via udev, or alternatives), but responding to rtnetlink messages
> would require a separate program.

And "In order to use rtnetlink I'll need a separate program!"

(╯°□°）╯︵ ┻━┻

So it's ok to use the separate program paradigm for dividing up
the tasks, but not for processing events?

I'm not convinced.

Either use the facility we have or extend it to fill a valid missing
need.

I'm not applying these patches, your logic doesn't add up and it's
inconsistent with our clear goals of not duplicating functionality.

Re: [PATCH bpf-next 2/7] ppc: bpf: implement jitting of BPF_ALU | BPF_ARSH | BPF_*

2018-12-05 Thread David Miller

From: Jiong Wang 
Date: Wed, 05 Dec 2018 11:28:32 +

> Indeed. Doubled checked the ISA doc,"Bit 32 of RS is replicated to fill
> RA0:31.".
> 
> Will fix both places in v2.

See, sparc64 isn't so weird :-)

Re: [PATCH net-next] tcp: reduce POLLOUT events caused by TCP_NOTSENT_LOWAT

2018-12-04 Thread David Miller

From: Eric Dumazet 
Date: Tue,  4 Dec 2018 07:58:17 -0800

> TCP_NOTSENT_LOWAT socket option or sysctl was added in linux-3.12
> as a step to enable bigger tcp sndbuf limits.
> 
> It works reasonably well, but the following happens :
> 
> Once the limit is reached, TCP stack generates
> an [E]POLLOUT event for every incoming ACK packet.
> 
> This causes a high number of context switches.
> 
> This patch implements the strategy David Miller added
> in sock_def_write_space() :
> 
>  - If TCP socket has a notsent_lowat constraint of X bytes,
>allow sendmsg() to fill up to X bytes, but send [E]POLLOUT
>only if number of notsent bytes is below X/2
> 
> This considerably reduces TCP_NOTSENT_LOWAT overhead,
> while allowing to keep the pipe full.
 ...
> Signed-off-by: Eric Dumazet 
> Acked-by: Soheil Hassas Yeganeh 

Applied, thanks Eric.

Re: [PATCH v2 2/2] net: mvpp2: fix phylink handling of invalid PHY modes

2018-12-04 Thread David Miller

From: Baruch Siach 
Date: Tue,  4 Dec 2018 16:03:53 +0200

> The .validate phylink callback should empty the supported bitmap when
> the interface mode is invalid.
> 
> Cc: Maxime Chevallier 
> Cc: Antoine Tenart 
> Reported-by: Russell King 
> Signed-off-by: Baruch Siach 

Applied.

Re: [PATCH v2 1/2] net: mvpp2: fix detection of 10G SFP modules

2018-12-04 Thread David Miller

From: Baruch Siach 
Date: Tue,  4 Dec 2018 16:03:52 +0200

> The mvpp2_phylink_validate() relies on the interface field of
> phylink_link_state to determine valid link modes. However, when called
> from phylink_sfp_module_insert() this field in not initialized. The
> default switch case then excludes 10G link modes. This allows 10G SFP
> modules that are detected correctly to be configured at max rate of
> 2.5G.
> 
> Catch the uninitialized PHY mode case, and allow 10G rates.
> 
> Fixes: d97c9f4ab000b ("net: mvpp2: 1000baseX support")
> Cc: Maxime Chevallier 
> Cc: Antoine Tenart 
> Acked-by: Russell King 
> Signed-off-by: Baruch Siach 

Applied.

Re: [PATCH v2 net-next] ip6_tunnel: Adding support of mapping rules for MAP-E tunnel

2018-12-04 Thread David Miller

From: Felix Jia 
Date: Mon,  3 Dec 2018 16:39:31 +1300

> +int
> +ip6_get_addrport(struct iphdr *iph, __be32 *saddr4, __be32 *daddr4,
> +  __be16 *sport4, __be16 *dport4, __u8 *proto, int *icmperr)
> +{

This looks like something the flow dissector can do alreayd, please look into
utilizing that common piece of infrastructure instead of reimplementing it.

> + u8 *ptr;
> + struct iphdr *icmpiph = NULL;
> + struct tcphdr *tcph, *icmptcph;
> + struct udphdr *udph, *icmpudph;
> + struct icmphdr *icmph, *icmpicmph;

Please always order local variables from longest to shortest line.

Please audit your entire submission for this problem.

> +static struct ip6_tnl_rule *ip6_tnl_rule_find(struct net_device *dev,
> +   __be32 _dst)
> +{
> + u32 dst = ntohl(_dst);
> + struct ip6_rule_list *pos = NULL;
> + struct ip6_tnl *t = netdev_priv(dev);
> +
> + list_for_each_entry(pos, >rules.list, list) {
> + int mask =
> + 0x ^ ((1 << (32 - pos->data.ipv4_prefixlen)) - 1);
> + if ((dst & mask) == ntohl(pos->data.ipv4_subnet.s_addr))
> + return >data;
> + }
> + return NULL;
> +}

How will this scale with large numbers of rules?

This rule facility seems to be designed in a way that sophisticated
(at least as fast as "O(log N)") lookup schemes aren't even possible,
and that even worse the ordering matters.

Re: [PATCH net-next V2 0/2] net/sched: act_tunnel_key: support key-less tunnels

2018-12-04 Thread David Miller

From: Or Gerlitz 
Date: Sun,  2 Dec 2018 14:55:19 +0200

> This short series from Adi Nissim allows to support key-less tunnels
> by the tc tunnel key actions, which is needed for some GRE use-cases.
> 
> changes from V0:
>  - addresses build warning spotted by kbuild, make sure to always init
>to zero the tunnel key

Series applied to net-next, thank you.

Re: [PATCH 1/2] net: linkwatch: send change uevent on link changes

2018-12-04 Thread David Miller

From: Jouke Witteveen 
Date: Sat, 1 Dec 2018 17:00:21 +0100

> Make it easy for userspace to respond to acquisition/loss of carrier.
> The uevent is picked up by udev and, on systems with systemd, the
> device unit of the interface announces a configuration reload.
> 
> Signed-off-by: Jouke Witteveen 
> ---
> I did not want to change the commit message into a systemd-howto, but
> subscribing to udev events can be done through a line like
> ReloadPropagatedFrom=sys-subsystem-net-devices-%i.device
> in a systemd unit file.

I want to hear more about "why".

If we have the rtnetlink message that can be listened for, userspace
ought to use that.  That's what it is there for.

Re: [PATCH net] rtnetlink: ndo_dflt_fdb_dump() only work for ARPHRD_ETHER devices

2018-12-04 Thread David Miller

From: Eric Dumazet 
Date: Tue,  4 Dec 2018 09:40:35 -0800

> kmsan was able to trigger a kernel-infoleak using a gre device [1]
> 
> nlmsg_populate_fdb_fill() has a hard coded assumption
> that dev->addr_len is ETH_ALEN, as normally guaranteed
> for ARPHRD_ETHER devices.
> 
> A similar issue was fixed recently in commit da71577545a5
> ("rtnetlink: Disallow FDB configuration for non-Ethernet device")
 ...
> Fixes: d83b06036048 ("net: add fdb generic dump routine")
> Signed-off-by: Eric Dumazet 

Applied and queued up for -stable, thanks Eric.

Re: [PATCH net-next v2 0/3] net: bridge: convert multicast to generic rhashtable

2018-12-04 Thread David Miller

From: Nikolay Aleksandrov 
Date: Wed, 5 Dec 2018 01:45:16 +0200

> On a related note I saw Paul's call_rcu patches hit, so I'll wait for those
> to go in and will rebase on top of them before sending the v3 as the bridge
> change will have a conflict with this set.

They aren't going in via my tree, so I wouldn't wait for that before
you respin.

Re: [RFC bpf-next 1/7] bpf: interpreter support BPF_ALU | BPF_ARSH

2018-12-04 Thread David Miller

From: Alexei Starovoitov 
Date: Tue, 4 Dec 2018 12:16:04 -0800

> You already did :)

Amazing, I'll take the rest of the day off, thanks! :)

Re: [RFC bpf-next 1/7] bpf: interpreter support BPF_ALU | BPF_ARSH

2018-12-04 Thread David Miller

From: Jiong Wang 
Date: Tue, 4 Dec 2018 20:14:11 +

> On 04/12/2018 20:10, David Miller wrote:
>> From: Alexei Starovoitov 
>> Date: Tue, 4 Dec 2018 11:29:55 -0800
>>
>>> I guess sparc doesn't really have 32 subregisters.  All registers
>>> are considered 64-bit. It has 32-bit alu ops on 64-bit registers
>>> instead.
>> Right.
>>
>> Anyways, sparc will require two instructions because of this, the
>> 'sra' then a 'srl' by zero bits to clear the top 32-bits.
>>
>> I'll code up the sparc JIT part when this goes in.
> 
> Hmm, I had been going through all JIT backends, and saw there is
> do_alu32_trunc after jitting sra for BPF_ALU. That's what needed?

Yes, it clears the top 32-bits of a register after a 32-bit
ALU op beccause BPF's semantics require it.

In fact, we call it too much, we even call it for 32-bit
shift right instructions which automatically clear those
top bits.  I've been meaning to optimize that.

Meanwhile, again the answer to your question is yes.

Re: [RFC bpf-next 1/7] bpf: interpreter support BPF_ALU | BPF_ARSH

2018-12-04 Thread David Miller

From: Alexei Starovoitov 
Date: Tue, 4 Dec 2018 11:29:55 -0800

> I guess sparc doesn't really have 32 subregisters.  All registers
> are considered 64-bit. It has 32-bit alu ops on 64-bit registers
> instead.

Right.

Anyways, sparc will require two instructions because of this, the
'sra' then a 'srl' by zero bits to clear the top 32-bits.

I'll code up the sparc JIT part when this goes in.

Re: [PATCH net-next] net: netem: use a list in addition to rbtree

2018-12-04 Thread David Miller

From: Peter Oskolkov 
Date: Tue, 4 Dec 2018 11:10:55 -0800

> Thanks, Stephen!
> 
> I don't care much about braces either. David, do you want me to send a
> new patch with braces moved around?

Single statement basic blocks definitely must not have curly braces,
please remove them and repost.

Thank you.

RE: [PATCH net 1/2] net/mlx4_en: Change min MTU size to ETH_MIN_MTU

2018-12-04 Thread David Laight

From: Eric Dumazet
> Sent: 04 December 2018 17:04
> On 12/04/2018 08:59 AM, David Laight wrote:
> > From: Tariq Toukan
> >> Sent: 02 December 2018 12:35
> >> From: Eran Ben Elisha 
> >>
> >> NIC driver minimal MTU size shall be set to ETH_MIN_MTU, as defined in
> >> the RFC791 and in the network stack. Remove old mlx4_en only define for
> >> it, which was set to wrong value.
> > ...
> >>
> >> -  /* MTU range: 46 - hw-specific max */
> >> -  dev->min_mtu = MLX4_EN_MIN_MTU;
> >> +  /* MTU range: 68 - hw-specific max */
> >> +  dev->min_mtu = ETH_MIN_MTU;
> >>dev->max_mtu = priv->max_mtu;
> >
> > Where does 68 come from?
> 
> Min IPv4 MTU per RFC791

Maybe I'm just confused and these are the ranges that the 'maximum mtu'
can be set to.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

RE: [PATCH net 1/2] net/mlx4_en: Change min MTU size to ETH_MIN_MTU

2018-12-04 Thread David Laight

From: Eric Dumazet
> Sent: 04 December 2018 17:04
> 
> On 12/04/2018 08:59 AM, David Laight wrote:
> > From: Tariq Toukan
> >> Sent: 02 December 2018 12:35
> >> From: Eran Ben Elisha 
> >>
> >> NIC driver minimal MTU size shall be set to ETH_MIN_MTU, as defined in
> >> the RFC791 and in the network stack. Remove old mlx4_en only define for
> >> it, which was set to wrong value.
> > ...
> >>
> >> -  /* MTU range: 46 - hw-specific max */
> >> -  dev->min_mtu = MLX4_EN_MIN_MTU;
> >> +  /* MTU range: 68 - hw-specific max */
> >> +  dev->min_mtu = ETH_MIN_MTU;
> >>dev->max_mtu = priv->max_mtu;
> >
> > Where does 68 come from?
> 
> Min IPv4 MTU per RFC791

Which has nothing to do with an ethernet driver.
Indeed, IIRC, it is the smallest maximum frame size that IPv4
can work over.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

RE: [PATCH net 1/2] net/mlx4_en: Change min MTU size to ETH_MIN_MTU

2018-12-04 Thread David Laight

From: Tariq Toukan
> Sent: 02 December 2018 12:35
> From: Eran Ben Elisha 
> 
> NIC driver minimal MTU size shall be set to ETH_MIN_MTU, as defined in
> the RFC791 and in the network stack. Remove old mlx4_en only define for
> it, which was set to wrong value.
...
> 
> - /* MTU range: 46 - hw-specific max */
> - dev->min_mtu = MLX4_EN_MIN_MTU;
> + /* MTU range: 68 - hw-specific max */
> + dev->min_mtu = ETH_MIN_MTU;
>   dev->max_mtu = priv->max_mtu;

Where does 68 come from?
The minimum size of an ethernet packet including the mac addresses
and CRC is 64 bytes - but that would never be an 'mtu'.

Since 64 - 46 = 18, the 46 probably excludes both MAC addresses,
the ethertype/length and the CRC.
This is 'sort of' the minimum mtu for an ethernet frame.

I'm not sure which values are supposed to be in dev->min/max_mtu.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

Re: [PATCH net-next 1/4] indirect call wrappers: helpers to speed-up indirect calls of builtin

2018-12-04 Thread David Miller

From: Paolo Abeni 
Date: Tue, 04 Dec 2018 12:27:51 +0100

> On Mon, 2018-12-03 at 10:04 -0800, Eric Dumazet wrote:
>> On 12/03/2018 03:40 AM, Paolo Abeni wrote:
>> > This header define a bunch of helpers that allow avoiding the
>> > retpoline overhead when calling builtin functions via function pointers.
>> > It boils down to explicitly comparing the function pointers to
>> > known builtin functions and eventually invoke directly the latter.
>> > 
>> > The macros defined here implement the boilerplate for the above schema
>> > and will be used by the next patches.
>> > 
>> > rfc -> v1:
>> >  - use branch prediction hint, as suggested by Eric
>> > 
>> > Suggested-by: Eric Dumazet 
>> > Signed-off-by: Paolo Abeni 
>> > ---
>> >  include/linux/indirect_call_wrapper.h | 77 +++
>> >  1 file changed, 77 insertions(+)
>> >  create mode 100644 include/linux/indirect_call_wrapper.h
>> 
>> This needs to be discussed more broadly, please include lkml 
> 
> Agreed. @David: please let me know if you prefer a repost or a v2 with
> the expanded recipients list.

v2 probably works better and will help me better keep track of things.

Thanks for asking.

Re: [RFC bpf-next 1/7] bpf: interpreter support BPF_ALU | BPF_ARSH

2018-12-04 Thread David Miller

From: Jiong Wang 
Date: Tue,  4 Dec 2018 04:56:29 -0500

> This patch implements interpreting BPF_ALU | BPF_ARSH. Do arithmetic right
> shift on low 32-bit sub-register, and zero the high 32 bits.
> 
> Reviewed-by: Jakub Kicinski 
> Signed-off-by: Jiong Wang 

I just want to say that this behavior is interesting because on most
cpus that have a 32-bit and 64-bit variant, the 32-bit arithmetic
right shift typically sign extends to 64-bit rather than zero extends
which is what is being defined here.

Well, definitely, sparc64 behaves this way.

Re: [PATCH net-next 0/4] mlxsw: Add one-armed router support

2018-12-04 Thread David Miller

From: Ido Schimmel 
Date: Tue, 4 Dec 2018 08:15:09 +

> Up until now, when a packet was routed by the ASIC through the same
> router interface (RIF) from which it ingressed from, the ASIC passed the
> sole copy of the packet to the kernel. This allowed the kernel to route
> the packet and also potentially generate an ICMP redirect.
> 
> There are scenarios (e.g., "one-armed router") where packets are
> intentionally routed this way and are therefore not deemed as
> exceptions. In such scenarios the current method of trapping packets to
> the CPU is problematic, as it results in major packet loss.
> 
> This patchset solves the problem by having the ASIC forward the packet,
> but also send a copy to the CPU, which gives the kernel the opportunity
> to generate required exceptions.
> 
> To prevent the kernel from forwarding such packets again, the driver
> marks them with 'offload_l3_fwd_mark', which causes the kernel to
> consume them in ip{,6}_forward_finish().
> 
> Patch #1 renames 'offload_mr_fwd_mark' to 'offload_l3_fwd_mark'. When
> set, the field indicates that a packet was already forwarded in L3
> (unicast / multicast) by a capable device.
> 
> Patch #2 teaches the kernel to consume unicast packets that have
> 'offload_l3_fwd_mark' set.
> 
> Patch #3 changes mlxsw to mirror loopbacked (iRIF == eRIF) packets,
> instead of trapping them.
> 
> Patch #4 adds a test case for above mentioned scenario.

Series applied, thank you.

Re: consistency for statistics with XDP mode

2018-12-03 Thread David Miller

From: David Ahern 
Date: Mon, 3 Dec 2018 17:15:03 -0700

> So, instead of a program tag which the program writer controls, how
> about some config knob that an admin controls that says at attach time
> use standard stats?

How about, instead of replacing it is in addition to, and admin can
override?

I'm all for choice so how can I object? :)

Re: [PATCH net-next 0/2] mlx4_core cleanups

2018-12-03 Thread David Miller

From: Tariq Toukan 
Date: Sun,  2 Dec 2018 17:40:24 +0200

> This patchset by Erez contains cleanups to the mlx4_core driver.
> 
> Patch 1 replaces -EINVAL with -EOPNOTSUPP for unsupported operations.
> Patch 2 fixes some coding style issues.
> 
> Series generated against net-next commit:
> 97e6c858a26e net: usb: aqc111: Initialize wol_cfg with memset in 
> aqc111_suspend

Series applied, thanks.

Re: [PATCH net-next v2] net: phy: Also request modules for C45 IDs

2018-12-03 Thread David Miller

From: Jose Abreu 
Date: Sun,  2 Dec 2018 16:33:14 +0100

> Logic of phy_device_create() requests PHY modules according to PHY ID
> but for C45 PHYs we use different field for the IDs.
> 
> Let's also request the modules for these IDs.
> 
> Changes from v1:
> - Only request C22 modules if C45 are not present (Andrew)
> 
> Signed-off-by: Jose Abreu 

Applied, thanks Jose.

Florian, just for the record, I actually like the changelogs to be in
the commit messages.  It can help people understand that something
was deliberately implemented a certain way and alternative approaches
were considered.

Re: [PATCH net-next v2 00/14] octeontx2-af: NIX and NPC enhancements

2018-12-03 Thread David Miller

From: Jerin Jacob 
Date: Sun,  2 Dec 2018 18:17:35 +0530

> This patchset is a continuation to earlier submitted four patch
> series to add a new driver for Marvell's OcteonTX2 SOC's
> Resource virtualization unit (RVU) admin function driver.
> 
> 1. octeontx2-af: Add RVU Admin Function driver
>https://www.spinics.net/lists/netdev/msg528272.html
> 2. octeontx2-af: NPA and NIX blocks initialization
>https://www.spinics.net/lists/netdev/msg529163.html
> 3. octeontx2-af: NPC parser and NIX blocks initialization
>https://www.spinics.net/lists/netdev/msg530252.html
> 4. octeontx2-af: NPC MCAM support and FLR handling
>https://www.spinics.net/lists/netdev/msg534392.html
> 
> This patch series adds support for below
> 
> NPC block:
> - Add NPC(mkex) profile support for various Key extraction configurations
> 
> NIX block:
> - Enable dynamic RSS flow key algorithm configuration
> - Enhancements on Rx checksum and error checks
> - Add support for Tx packet marking support
> - TL1 schedule queue allocation enhancements
> - Add LSO format configuration mbox
> - VLAN TPID configuration
> - Skip multicast entry init for broadcast tables
 ...

Series applied, thanks.

Re: [PATCH net 0/2] mlx4 fixes for 4.20-rc

2018-12-03 Thread David Miller

From: Tariq Toukan 
Date: Sun,  2 Dec 2018 14:34:35 +0200

> This patchset includes small fixes for the mlx4_en driver.
> 
> First patch by Eran fixes the value used to init the netdevice's
> min_mtu field.
> Please queue it to -stable >= v4.10.
> 
> Second patch by Saeed adds missing Kconfig build dependencies.
> 
> Series generated against net commit:
> 35b827b6d061 tun: forbid iface creation with rtnl ops

Series applied and patch #1 queued up for -stable, thanks.

Re: consistency for statistics with XDP mode

2018-12-03 Thread David Ahern

On 12/3/18 5:00 PM, David Miller wrote:
> From: Toke Høiland-Jørgensen 
> Date: Mon, 03 Dec 2018 22:00:32 +0200
> 
>> I wonder if it would be possible to support both the "give me user
>> normal stats" case and the "let me do whatever I want" case by a
>> combination of userspace tooling and maybe a helper or two?
>>
>> I.e., create a "do_stats()" helper (please pick a better name), which
>> will either just increment the desired counters, or set a flag so the
>> driver can do it at napi poll exit. With this, the userspace tooling
>> could have a "--give-me-normal-stats" switch (or some other interface),
>> which would inject a call instruction to that helper at the start of the
>> program.
>>
>> This would enable the normal counters in a relatively painless way,
>> while still letting people opt out if they don't want to pay the cost in
>> terms of overhead. And having the userspace tooling inject the helper
>> call helps support the case where the admin didn't write the XDP
>> programs being loaded.
>>
>> Any reason why that wouldn't work?
> 
> I think this is a good idea, or even an attribute tag that gets added
> to the XDP program that controls stats handling.
> 

My argument is that the ebpf program writer should *not* get that
choice; the admin of the box should. Program writers make mistakes. Box
admins / customer support are the ones that have to deal with those
mistakes. Program writers - especially for xdp - are going to be focused
on benchmarks; admins are focused on the big picture and should be given
the option of trading a small amount of performance for simpler management.

So, instead of a program tag which the program writer controls, how
about some config knob that an admin controls that says at attach time
use standard stats?

Re: [iproute2-next PATCH v6] tc: flower: Classify packets based port ranges

2018-12-03 Thread David Ahern

On 12/3/18 4:58 PM, Nambiar, Amritha wrote:
> A previous version v3 of this patch was already applied to iproute2-next.
> https://patchwork.ozlabs.org/patch/998644/
> 
> I think that needs to be reverted for this v6 to apply clean.

ugh. That's embarrassing. Looks like I inadvertently pushed the older
one. Reverted and applied. Thanks,

Re: [PATCH net] macvlan: return correct error value

2018-12-03 Thread David Miller

From: Matteo Croce 
Date: Sat,  1 Dec 2018 00:26:27 +0100

> A MAC address must be unique among all the macvlan devices with the same
> lower device. The only exception is the passthru [sic] mode,
> which shares the lower device address.
> 
> When duplicate addresses are detected, EBUSY is returned when bringing
> the interface up:
> 
> # ip link add macvlan0 link eth0 type macvlan
> # read addr  # ip link set macvlan0 address $addr
> # ip link set macvlan0 up
> RTNETLINK answers: Device or resource busy
> 
> Use correct error code which is EADDRINUSE, and do the check also
> earlier, on address change:
> 
> # ip link set macvlan0 address $addr
> RTNETLINK answers: Address already in use
> 
> Signed-off-by: Matteo Croce 

Applied, thanks Matteo.

Re: consistency for statistics with XDP mode

2018-12-03 Thread David Miller

From: Toke Høiland-Jørgensen 
Date: Mon, 03 Dec 2018 22:00:32 +0200

> I wonder if it would be possible to support both the "give me user
> normal stats" case and the "let me do whatever I want" case by a
> combination of userspace tooling and maybe a helper or two?
> 
> I.e., create a "do_stats()" helper (please pick a better name), which
> will either just increment the desired counters, or set a flag so the
> driver can do it at napi poll exit. With this, the userspace tooling
> could have a "--give-me-normal-stats" switch (or some other interface),
> which would inject a call instruction to that helper at the start of the
> program.
> 
> This would enable the normal counters in a relatively painless way,
> while still letting people opt out if they don't want to pay the cost in
> terms of overhead. And having the userspace tooling inject the helper
> call helps support the case where the admin didn't write the XDP
> programs being loaded.
> 
> Any reason why that wouldn't work?

I think this is a good idea, or even an attribute tag that gets added
to the XDP program that controls stats handling.

Re: [PATCH net-next v4 0/3] udp msg_zerocopy

2018-12-03 Thread David Miller

From: Willem de Bruijn 
Date: Fri, 30 Nov 2018 15:32:38 -0500

> Enable MSG_ZEROCOPY for udp sockets

Series applied, thanks for keeping up with this.

Re: [PATCH 0/3] net: macb: DMA race condition fixes

2018-12-03 Thread David Miller

From: 
Date: Mon, 3 Dec 2018 08:26:52 +

> Can you please delay a bit the acceptance of this series, I would like 
> that we assess these findings with tests on our hardware before applying 
> them.

Sure.

Re: [PATCH net] sctp: kfree_rcu asoc

2018-12-03 Thread David Miller

From: Xin Long 
Date: Sat,  1 Dec 2018 01:36:59 +0800

> In sctp_hash_transport/sctp_epaddr_lookup_transport, it dereferences
> a transport's asoc under rcu_read_lock while asoc is freed not after
> a grace period, which leads to a use-after-free panic.
> 
> This patch fixes it by calling kfree_rcu to make asoc be freed after
> a grace period.
> 
> Note that only the asoc's memory is delayed to free in the patch, it
> won't cause sk to linger longer.
> 
> Thanks Neil and Marcelo to make this clear.
> 
> Fixes: 7fda702f9315 ("sctp: use new rhlist interface on sctp transport 
> rhashtable")
> Fixes: cd2b70875058 ("sctp: check duplicate node before inserting a new 
> transport")
> Reported-by: syzbot+0b05d8aa7cb185107...@syzkaller.appspotmail.com
> Reported-by: syzbot+aad231d51b1923158...@syzkaller.appspotmail.com
> Suggested-by: Neil Horman 
> Signed-off-by: Xin Long 

Applied and queued up for -stable, thanks.

Re: [PATCH net] net/ibmvnic: Fix RTNL deadlock during device reset

2018-12-03 Thread David Miller

From: Thomas Falcon 
Date: Fri, 30 Nov 2018 10:59:08 -0600

> Commit a5681e20b541 ("net/ibmnvic: Fix deadlock problem
> in reset") made the change to hold the RTNL lock during
> driver reset but still calls netdev_notify_peers, which
> results in a deadlock. Instead, use call_netdevice_notifiers,
> which is functionally the same except that it does not
> take the RTNL lock again.
> 
> Fixes: a5681e20b541 ("net/ibmnvic: Fix deadlock problem in reset")
> 
> Signed-off-by: Thomas Falcon 

Applied.

Re: [PATCH net] rtnetlink: Refine sanity checks in rtnl_fdb_{add|del}

2018-12-03 Thread David Miller

From: Ido Schimmel 
Date: Fri, 30 Nov 2018 19:00:24 +0200

> Yes, agree. Patch is good. I'll tag your v2.

This means, I assume, that a new version of this fix is coming.

Eric, is this correct?

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 33546 matches

Mail list logo