[PATCH] tipc: fix a potential missing-check bug

2018-04-30 Thread Wenwen Wang
In tipc_link_xmit(), the member field "len" of l->backlog[imp] must
be less than the member field "limit" of l->backlog[imp] when imp is
equal to TIPC_SYSTEM_IMPORTANCE. Otherwise, an error code, i.e., -ENOBUFS,
is returned. This is enforced by the security check. However, at the end
of tipc_link_xmit(), the length of "list" is added to l->backlog[imp].len
without any further check. This can potentially cause unexpected values for
l->backlog[imp].len. If imp is equal to TIPC_SYSTEM_IMPORTANCE and the
original value of l->backlog[imp].len is less than l->backlog[imp].limit,
after this addition, l->backlog[imp] could be larger than
l->backlog[imp].limit. That means the security check can potentially be
bypassed, especially when an adversary can control the length of "list".

This patch performs such a check after the modification to
l->backlog[imp].len (if imp is TIPC_SYSTEM_IMPORTANCE) to avoid such
security issues. An error code will be returned if an unexpected value of
l->backlog[imp].len is generated.

Signed-off-by: Wenwen Wang 
---
 net/tipc/link.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/net/tipc/link.c b/net/tipc/link.c
index 695acb7..62972fa 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -948,6 +948,11 @@ int tipc_link_xmit(struct tipc_link *l, struct 
sk_buff_head *list,
continue;
}
l->backlog[imp].len += skb_queue_len(list);
+   if (imp == TIPC_SYSTEM_IMPORTANCE &&
+   l->backlog[imp].len >= l->backlog[imp].limit) {
+   pr_warn("%s<%s>, link overflow", link_rst_msg, l->name);
+   return -ENOBUFS;
+   }
skb_queue_splice_tail_init(list, backlogq);
}
l->snd_nxt = seqno;
-- 
2.7.4



Re: [PATCH] ipv6: Allow non-gateway ECMP for IPv6

2018-04-30 Thread David Ahern
On 4/30/18 3:15 PM, Thomas Winter wrote:
> It is valid to have static routes where the nexthop
> is an interface not an address such as tunnels.
> For IPv4 it was possible to use ECMP on these routes
> but not for IPv6.
> 
> Signed-off-by: Thomas Winter 
> Cc: David Ahern 
> Cc: "David S. Miller" 
> Cc: Alexey Kuznetsov 
> Cc: Hideaki YOSHIFUJI 
> ---
>  include/net/ip6_route.h | 3 +--
>  net/ipv6/ip6_fib.c  | 3 ---
>  2 files changed, 1 insertion(+), 5 deletions(-)
> 

Interesting. Existing code inserts the dev nexthop as a separate route.

Change looks good to me.

Acked-by: David Ahern 


Re: [PATCH net-next 0/4] net/smc: fixes 2018/04/30

2018-04-30 Thread David Miller
From: Ursula Braun 
Date: Mon, 30 Apr 2018 16:51:15 +0200

> From: Ursula Braun 
> 
> Dave,
> 
> here are 4 smc patches for net-next covering different areas:
>* link health check
>* diagnostics for IPv6 smc sockets
>* ioctl
>* improvement for vlan determination

You say "fixes" in your Subject line but adding ipv6 smc
socket diag support is a feature not a fix.

Actually, generally speaking your patch submissions are
confusing.

You do submit really pure bug fixes, but for some odd
reason you target the net-next tree instead of net.

Maybe you have a good reason for doing this and you can
explain it to me?


Re: [PATCH RFC 6/9] veth: Add ndo_xdp_xmit

2018-04-30 Thread Toshiaki Makita
On 2018/05/01 2:27, Jesper Dangaard Brouer wrote:
> On Thu, 26 Apr 2018 19:52:40 +0900
> Toshiaki Makita  wrote:
> 
>> On 2018/04/26 5:24, Jesper Dangaard Brouer wrote:
>>> On Tue, 24 Apr 2018 23:39:20 +0900
>>> Toshiaki Makita  wrote:
>>>   
 +static int veth_xdp_xmit(struct net_device *dev, struct xdp_frame *frame)
 +{
 +  struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
 +  int headroom = frame->data - (void *)frame;
 +  struct net_device *rcv;
 +  int err = 0;
 +
 +  rcv = rcu_dereference(priv->peer);
 +  if (unlikely(!rcv))
 +  return -ENXIO;
 +
 +  rcv_priv = netdev_priv(rcv);
 +  /* xdp_ring is initialized on receive side? */
 +  if (rcu_access_pointer(rcv_priv->xdp_prog)) {
 +  err = xdp_ok_fwd_dev(rcv, frame->len);
 +  if (unlikely(err))
 +  return err;
 +
 +  err = veth_xdp_enqueue(rcv_priv, veth_xdp_to_ptr(frame));
 +  } else {
 +  struct sk_buff *skb;
 +
 +  skb = veth_build_skb(frame, headroom, frame->len, 0);
 +  if (unlikely(!skb))
 +  return -ENOMEM;
 +
 +  /* Get page ref in case skb is dropped in netif_rx.
 +   * The caller is responsible for freeing the page on error.
 +   */
 +  get_page(virt_to_page(frame->data));  
>>>
>>> I'm not sure you can make this assumption, that xdp_frames coming from
>>> another device driver uses a refcnt based memory model. But maybe I'm
>>> confused, as this looks like an SKB receive path, but in the
>>> ndo_xdp_xmit().  
>>
>> I find this path similar to cpumap, which creates skb from redirected
>> xdp frame. Once it is converted to skb, skb head is freed by
>> page_frag_free, so anyway I needed to get the refcount here regardless
>> of memory model.
> 
> Yes I know, I wrote cpumap ;-)
> 
> First of all, I don't want to see such xdp_frame to SKB conversion code
> in every driver.  Because that increase the chances of errors.  And
> when looking at the details, then it seems that you have made the
> mistake of making it possible to leak xdp_frame info to the SKB (which
> cpumap takes into account).

Do you mean leaving xdp_frame in skb->head is leaking something? how?

> 
> Second, I think the refcnt scheme here is wrong. The xdp_frame should
> be "owned" by XDP and have the proper refcnt to deliver it directly to
> the network stack.
> 
> Third, if we choose that we want a fallback, in-case XDP is not enabled
> on egress dev (but it have an ndo_xdp_xmit), then it should be placed
> in the generic/core code.  E.g. __bpf_tx_xdp_map() could look at the
> return code from dev->netdev_ops->ndo_xdp() and create an SKB.  (Hint,
> this would make it easy to implement TX bulking towards the dev).

Right, this is a much cleaner way.
Although I feel like we should add this fallback for veth because it
requires something which is different from other drivers (enabling XDP
on the peer device of the egress device), I'll drop the part for now. It
should not be resolved in the driver code.

-- 
Toshiaki Makita



[PATCH net-next] udp: Complement partial checksum for GSO packet

2018-04-30 Thread Sean Tranchetti
Using the udp_v4_check() function to calculate the pseudo header
for the newly segmented UDP packets results in assigning the complement
of the value to the UDP header checksum field.

Always undo the complement the partial checksum value in order to
match the case where GSO is not used on the UDP transmit path.

Fixes: ee80d1ebe5ba ("udp: add udp gso")
Signed-off-by: Sean Tranchetti 
Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 net/ipv4/udp_offload.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index f78fb36..0062570 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -223,6 +223,7 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
csum_replace2(>check, htons(mss),
  htons(seg->len - hdrlen - sizeof(*uh)));
 
+   uh->check = ~uh->check;
seg->destructor = sock_wfree;
seg->sk = sk;
sum_truesize += seg->truesize;
-- 
1.9.1



Re: [RFC net-next 0/5] Support for PHY test modes

2018-04-30 Thread Andrew Lunn
> Turning these tests on will typically result in the link partner
> dropping the link with us, and the interface will be non-functional as
> far as the data path is concerned (similar to an isolation mode). This
> might warrant properly reporting that to user-space through e.g: a
> private IFF_* value maybe?

Hi Florian

I think a IFF_* value would be a good idea. We want to give the user
some indicate why they don't have working networking. ip link show
showing PHY-TEST-MODE would help.

Andrew


Re: [RFC net-next 4/5] net: phy: Add support for IEEE standard test modes

2018-04-30 Thread Andrew Lunn
> +/* genphy_set_test - Make a PHY enter one of the standard IEEE defined
> + * test modes
> + * @phydev: the PHY device instance
> + * @test: the desired test mode
> + * @data: test specific data (none)
> + *
> + * This function makes the designated @phydev enter the desired standard
> + * 100BaseT2 or 1000BaseT test mode as defined in IEEE 802.3-2012 section TWO
> + * and THREE under 32.6.1.2.1 and 40.6.1.1.2 respectively
> + */
> +int genphy_set_test(struct phy_device *phydev,
> + struct ethtool_phy_test *test, const u8 *data)
> +{
> + u16 shift, base, bmcr = 0;
> + int ret;
> +
> + /* Exit test mode */
> + if (test->mode == PHY_STD_TEST_MODE_NORMAL) {
> + ret = phy_read(phydev, MII_CTRL1000);
> + if (ret < 0)
> + return ret;
> +
> + ret &= ~GENMASK(15, 13);
> +
> + return phy_write(phydev, MII_CTRL1000, ret);
> + }

Hi Florain

I looked at the Marvell SDK for PHYs. It performs a soft reset after
swapping back to normal mode. I assume the broadcom PHY does not need
this? But maybe we can add it anyway?

> +
> + switch (test->mode) {
> + case PHY_STD_TEST_MODE_100BASET2_1:
> + case PHY_STD_TEST_MODE_100BASET2_2:
> + case PHY_STD_TEST_MODE_100BASET2_3:
> + if (!(phydev->supported & PHY_100BT_FEATURES))
> + return -EOPNOTSUPP;
> +
> + shift = 14;
> + base = test->mode - PHY_STD_TEST_MODE_NORMAL;
> + bmcr = BMCR_SPEED100;
> + break;
> +
> + case PHY_STD_TEST_MODE_1000BASET_1:
> + case PHY_STD_TEST_MODE_1000BASET_2:
> + case PHY_STD_TEST_MODE_1000BASET_3:
> + case PHY_STD_TEST_MODE_1000BASET_4:
> + if (!(phydev->supported & PHY_1000BT_FEATURES))
> + return -EOPNOTSUPP;
> +
> + shift = 13;
> + base = test->mode - PHY_STD_TEST_MODE_100BASET2_MAX;
> + bmcr = BMCR_SPEED1000;
> + break;
> +
> + default:
> + /* Let an upper driver deal with additional modes it may
> +  * support
> +  */
> + return -EOPNOTSUPP;
> + }
> +
> + /* Force speed and duplex */
> + ret = phy_write(phydev, MII_BMCR, bmcr | BMCR_FULLDPLX);
> + if (ret < 0)
> + return ret;

Should there be something to undo this when returning to normal mode?

   Andrew


Re: [PATCH net-next v6] Add Common Applications Kept Enhanced (cake) qdisc

2018-04-30 Thread Cong Wang
On Mon, Apr 30, 2018 at 2:27 PM, Dave Taht  wrote:
> On Mon, Apr 30, 2018 at 2:21 PM, Cong Wang  wrote:
>> On Sun, Apr 29, 2018 at 2:34 PM, Toke Høiland-Jørgensen  wrote:
>>> sch_cake targets the home router use case and is intended to squeeze the
>>> most bandwidth and latency out of even the slowest ISP links and routers,
>>> while presenting an API simple enough that even an ISP can configure it.
>>>
>>> Example of use on a cable ISP uplink:
>>>
>>> tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter
>>>
>>> To shape a cable download link (ifb and tc-mirred setup elided)
>>>
>>> tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash
>>>
>>> CAKE is filled with:
>>>
>>> * A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel
>>>   derived Flow Queuing system, which autoconfigures based on the bandwidth.
>>> * A novel "triple-isolate" mode (the default) which balances per-host
>>>   and per-flow FQ even through NAT.
>>> * An deficit based shaper, that can also be used in an unlimited mode.
>>> * 8 way set associative hashing to reduce flow collisions to a minimum.
>>> * A reasonable interpretation of various diffserv latency/loss tradeoffs.
>>> * Support for zeroing diffserv markings for entering and exiting traffic.
>>> * Support for interacting well with Docsis 3.0 shaper framing.
>>> * Extensive support for DSL framing types.
>>> * Support for ack filtering.
>>
>> Why this TCP ACK filtering has to be built into CAKE qdisc rather than
>> an independent TC filter? Why other qdisc's can't use it?
>
> I actually have a tc - bpf based ack filter, during the development of
> cake's ack-thinner, that I should submit one of these days. It
> proved to be of limited use.

Yeah.

>
> Probably the biggest mistake we made is by calling this cake feature a
> filter. It isn't.


It inspects the payload of each packet and drops packets, therefore
it is a filter by definition, no matter how you name it.

>
> Maybe we should have called it a "thinner" or something like that? In
> order to properly "thin" or "reduce" an ack stream
> you have to have a queue to look at and some related state. TC filters
> do not operate on queues, qdiscs do. Thus the "ack-filter" here is
> deeply embedded into cake's flow isolation and queue structures.


TC filters are installed on qdiscs and in the beginning qdiscs were
queues,for example, pfifo. We already have flow-based filters too
(cls_flower),so we can make them work together, although probably
it is not straight forward.


Re: [PATCH net-next 2/4] net/smc: ipv6 support for smc_diag.c

2018-04-30 Thread kbuild test robot
Hi Karsten,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Ursula-Braun/net-smc-periodic-testlink-support/20180501-045940
config: x86_64-randconfig-x016-201817 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All error/warnings (new ones prefixed by >>):

   In file included from include/linux/sock_diag.h:8:0,
from net/smc/smc_diag.c:15:
   net/smc/smc_diag.c: In function 'smc_diag_msg_common_fill':
>> include/net/sock.h:350:37: error: 'struct sock_common' has no member named 
>> 'skc_v6_rcv_saddr'; did you mean 'skc_rcv_saddr'?
#define sk_v6_rcv_saddr __sk_common.skc_v6_rcv_saddr
^
>> net/smc/smc_diag.c:49:47: note: in expansion of macro 'sk_v6_rcv_saddr'
  memcpy(>id.idiag_src, >clcsock->sk->sk_v6_rcv_saddr,
  ^~~
>> include/net/sock.h:350:37: error: 'struct sock_common' has no member named 
>> 'skc_v6_rcv_saddr'; did you mean 'skc_rcv_saddr'?
#define sk_v6_rcv_saddr __sk_common.skc_v6_rcv_saddr
^
   net/smc/smc_diag.c:50:35: note: in expansion of macro 'sk_v6_rcv_saddr'
 sizeof(smc->clcsock->sk->sk_v6_rcv_saddr));
  ^~~
>> include/net/sock.h:349:34: error: 'struct sock_common' has no member named 
>> 'skc_v6_daddr'; did you mean 'skc_daddr'?
#define sk_v6_daddr  __sk_common.skc_v6_daddr
 ^
>> net/smc/smc_diag.c:51:47: note: in expansion of macro 'sk_v6_daddr'
  memcpy(>id.idiag_dst, >clcsock->sk->sk_v6_daddr,
  ^~~
>> include/net/sock.h:349:34: error: 'struct sock_common' has no member named 
>> 'skc_v6_daddr'; did you mean 'skc_daddr'?
#define sk_v6_daddr  __sk_common.skc_v6_daddr
 ^
   net/smc/smc_diag.c:52:35: note: in expansion of macro 'sk_v6_daddr'
 sizeof(smc->clcsock->sk->sk_v6_daddr));
  ^~~
--
   In file included from include/linux/sock_diag.h:8:0,
from net//smc/smc_diag.c:15:
   net//smc/smc_diag.c: In function 'smc_diag_msg_common_fill':
>> include/net/sock.h:350:37: error: 'struct sock_common' has no member named 
>> 'skc_v6_rcv_saddr'; did you mean 'skc_rcv_saddr'?
#define sk_v6_rcv_saddr __sk_common.skc_v6_rcv_saddr
^
   net//smc/smc_diag.c:49:47: note: in expansion of macro 'sk_v6_rcv_saddr'
  memcpy(>id.idiag_src, >clcsock->sk->sk_v6_rcv_saddr,
  ^~~
>> include/net/sock.h:350:37: error: 'struct sock_common' has no member named 
>> 'skc_v6_rcv_saddr'; did you mean 'skc_rcv_saddr'?
#define sk_v6_rcv_saddr __sk_common.skc_v6_rcv_saddr
^
   net//smc/smc_diag.c:50:35: note: in expansion of macro 'sk_v6_rcv_saddr'
 sizeof(smc->clcsock->sk->sk_v6_rcv_saddr));
  ^~~
>> include/net/sock.h:349:34: error: 'struct sock_common' has no member named 
>> 'skc_v6_daddr'; did you mean 'skc_daddr'?
#define sk_v6_daddr  __sk_common.skc_v6_daddr
 ^
   net//smc/smc_diag.c:51:47: note: in expansion of macro 'sk_v6_daddr'
  memcpy(>id.idiag_dst, >clcsock->sk->sk_v6_daddr,
  ^~~
>> include/net/sock.h:349:34: error: 'struct sock_common' has no member named 
>> 'skc_v6_daddr'; did you mean 'skc_daddr'?
#define sk_v6_daddr  __sk_common.skc_v6_daddr
 ^
   net//smc/smc_diag.c:52:35: note: in expansion of macro 'sk_v6_daddr'
 sizeof(smc->clcsock->sk->sk_v6_daddr));
  ^~~

vim +/sk_v6_rcv_saddr +49 net/smc/smc_diag.c

  > 15  #include 
16  #include 
17  #include 
18  #include 
19  #include 
20  
21  #include "smc.h"
22  #include "smc_core.h"
23  
24  static void smc_gid_be16_convert(__u8 *buf, u8 *gid_raw)
25  {
26  sprintf(buf, "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x",
27  be16_to_cpu(((__be16 *)gid_raw)[0]),
28  be16_to_cpu(((__be16 *)gid_raw)[1]),
29  be16_to_cpu(((__be16 *)gid_raw)[2]),
30  be16_to_cpu(((__be16 *)gid_raw)[3]),
31  be16_to_cpu(((__be16 *)gid_raw)[4]),
32  be16_to_cpu(((__be16 *)gid_raw)[5]),
33  be16_to_cpu(((__be16 *)gid_raw)[6]),
34  be16_to_cpu(((__be16 *)gid_raw)[7]));
35  }
36  
37  static void smc_diag_msg_common_fill(struct 

[PATCH bpf-next 2/3] bpf: centre subprog information fields

2018-04-30 Thread Jiong Wang
It is better to centre all subprog information fields into one structure.
This structure could later serve as function node in call graph.

Signed-off-by: Jiong Wang 
---
 include/linux/bpf_verifier.h |  9 ---
 kernel/bpf/verifier.c| 62 +++-
 2 files changed, 38 insertions(+), 33 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index f655b92..8f70dc1 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -173,6 +173,11 @@ static inline bool bpf_verifier_log_needed(const struct 
bpf_verifier_log *log)
 
 #define BPF_MAX_SUBPROGS 256
 
+struct bpf_subprog_info {
+   u32 start; /* insn idx of function entry point */
+   u16 stack_depth; /* max. stack depth used by this function */
+};
+
 /* single container for all structs
  * one verifier_env per bpf_check() call
  */
@@ -191,9 +196,7 @@ struct bpf_verifier_env {
bool seen_direct_write;
struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */
struct bpf_verifier_log log;
-   u32 subprog_starts[BPF_MAX_SUBPROGS + 1];
-   /* computes the stack depth of each bpf function */
-   u16 subprog_stack_depth[BPF_MAX_SUBPROGS + 1];
+   struct bpf_subprog_info subprog_info[BPF_MAX_SUBPROGS + 1];
u32 subprog_cnt;
 };
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 16ec977..9764b9b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -738,18 +738,19 @@ enum reg_arg_type {
 
 static int cmp_subprogs(const void *a, const void *b)
 {
-   return *(int *)a - *(int *)b;
+   return ((struct bpf_subprog_info *)a)->start -
+  ((struct bpf_subprog_info *)b)->start;
 }
 
 static int find_subprog(struct bpf_verifier_env *env, int off)
 {
-   u32 *p;
+   struct bpf_subprog_info *p;
 
-   p = bsearch(, env->subprog_starts, env->subprog_cnt,
-   sizeof(env->subprog_starts[0]), cmp_subprogs);
+   p = bsearch(, env->subprog_info, env->subprog_cnt,
+   sizeof(env->subprog_info[0]), cmp_subprogs);
if (!p)
return -ENOENT;
-   return p - env->subprog_starts;
+   return p - env->subprog_info;
 
 }
 
@@ -769,15 +770,16 @@ static int add_subprog(struct bpf_verifier_env *env, int 
off)
verbose(env, "too many subprograms\n");
return -E2BIG;
}
-   env->subprog_starts[env->subprog_cnt++] = off;
-   sort(env->subprog_starts, env->subprog_cnt,
-sizeof(env->subprog_starts[0]), cmp_subprogs, NULL);
+   env->subprog_info[env->subprog_cnt++].start = off;
+   sort(env->subprog_info, env->subprog_cnt,
+sizeof(env->subprog_info[0]), cmp_subprogs, NULL);
return 0;
 }
 
 static int check_subprogs(struct bpf_verifier_env *env)
 {
int i, ret, subprog_start, subprog_end, off, cur_subprog = 0;
+   struct bpf_subprog_info *subprog = env->subprog_info;
struct bpf_insn *insn = env->prog->insnsi;
int insn_cnt = env->prog->len;
 
@@ -807,14 +809,14 @@ static int check_subprogs(struct bpf_verifier_env *env)
 
if (env->log.level > 1)
for (i = 0; i < env->subprog_cnt; i++)
-   verbose(env, "func#%d @%d\n", i, 
env->subprog_starts[i]);
+   verbose(env, "func#%d @%d\n", i, subprog[i].start);
 
/* now check that all jumps are within the same subprog */
subprog_start = 0;
if (env->subprog_cnt == cur_subprog + 1)
subprog_end = insn_cnt;
else
-   subprog_end = env->subprog_starts[cur_subprog + 1];
+   subprog_end = subprog[cur_subprog + 1].start;
for (i = 0; i < insn_cnt; i++) {
u8 code = insn[i].code;
 
@@ -843,8 +845,7 @@ static int check_subprogs(struct bpf_verifier_env *env)
if (env->subprog_cnt == cur_subprog + 1)
subprog_end = insn_cnt;
else
-   subprog_end =
-   env->subprog_starts[cur_subprog + 1];
+   subprog_end = subprog[cur_subprog + 1].start;
}
}
return 0;
@@ -1477,13 +1478,13 @@ static int update_stack_depth(struct bpf_verifier_env 
*env,
  const struct bpf_func_state *func,
  int off)
 {
-   u16 stack = env->subprog_stack_depth[func->subprogno];
+   u16 stack = env->subprog_info[func->subprogno].stack_depth;
 
if (stack >= -off)
return 0;
 
/* update known max for given subprogram */
-   env->subprog_stack_depth[func->subprogno] = -off;
+   env->subprog_info[func->subprogno].stack_depth = -off;
return 0;
 }
 
@@ -1495,7 +1496,8 @@ static int update_stack_depth(struct bpf_verifier_env 
*env,
  */

[PATCH bpf-next 0/3] bpf: cleanups on managing subprog information

2018-04-30 Thread Jiong Wang
This patch set clean up some code logic related with managing subprog
information.

Part of the set are inspried by Edwin's code in his RFC:

  "bpf/verifier: subprog/func_call simplifications"

but with clearer separation so it could be easier to review.

  - Path 1 unifies main prog and subprogs. All of them are registered in
env->subprog_starts.

  - After patch 1, it is clear that subprog_starts and subprog_stack_depth
could be merged as both of them now have main and subprog unified.
Patch 2 therefore does the merge, all subprog information are centred
at bpf_subprog_info.

  - Patch 3 goes further to introduce a new fake "exit" subprog which
serves as an ending marker to the subprog list. We could then turn the
following code snippets across verifier:

   if (env->subprog_cnt == cur_subprog + 1)
   subprog_end = insn_cnt;
   else
   subprog_end = env->subprog_info[cur_subprog + 1].start;

into:
   subprog_end = env->subprog_info[cur_subprog + 1].start;

There is no functional change by this patch set.
No bpf selftest regression found after this patch set.

Jiong Wang (3):
  bpf: unify main prog and subprog
  bpf: centre subprog information fields
  bpf: add faked "ending" subprog

 include/linux/bpf_verifier.h |   9 ++--
 kernel/bpf/verifier.c| 118 +--
 2 files changed, 65 insertions(+), 62 deletions(-)

-- 
2.7.4



[PATCH bpf-next 3/3] bpf: add faked "ending" subprog

2018-04-30 Thread Jiong Wang
There are quite a few code snippet like the following in verifier:

   subprog_start = 0;
   if (env->subprog_cnt == cur_subprog + 1)
   subprog_end = insn_cnt;
   else
   subprog_end = env->subprog_info[cur_subprog + 1].start;

The reason is there is no marker in subprog_info array to tell the end of
it.

We could resolve this issue by introducing a faked "ending" subprog.
The special "ending" subprog is with "insn_cnt" as start offset, so it is
serving as the end mark whenever we iterate over all subprogs.

Signed-off-by: Jiong Wang 
---
 kernel/bpf/verifier.c | 31 ---
 1 file changed, 12 insertions(+), 19 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 9764b9b..4a081e0 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -766,7 +766,7 @@ static int add_subprog(struct bpf_verifier_env *env, int 
off)
ret = find_subprog(env, off);
if (ret >= 0)
return 0;
-   if (env->subprog_cnt > BPF_MAX_SUBPROGS) {
+   if (env->subprog_cnt >= BPF_MAX_SUBPROGS) {
verbose(env, "too many subprograms\n");
return -E2BIG;
}
@@ -807,16 +807,18 @@ static int check_subprogs(struct bpf_verifier_env *env)
return ret;
}
 
+   /* Add a fake 'exit' subprog which could simplify subprog iteration
+* logic. 'subprog_cnt' should not be increased.
+*/
+   subprog[env->subprog_cnt].start = insn_cnt;
+
if (env->log.level > 1)
for (i = 0; i < env->subprog_cnt; i++)
verbose(env, "func#%d @%d\n", i, subprog[i].start);
 
/* now check that all jumps are within the same subprog */
-   subprog_start = 0;
-   if (env->subprog_cnt == cur_subprog + 1)
-   subprog_end = insn_cnt;
-   else
-   subprog_end = subprog[cur_subprog + 1].start;
+   subprog_start = subprog[cur_subprog].start;
+   subprog_end = subprog[cur_subprog + 1].start;
for (i = 0; i < insn_cnt; i++) {
u8 code = insn[i].code;
 
@@ -840,11 +842,9 @@ static int check_subprogs(struct bpf_verifier_env *env)
verbose(env, "last insn is not an exit or 
jmp\n");
return -EINVAL;
}
-   cur_subprog++;
subprog_start = subprog_end;
-   if (env->subprog_cnt == cur_subprog + 1)
-   subprog_end = insn_cnt;
-   else
+   cur_subprog++;
+   if (cur_subprog < env->subprog_cnt)
subprog_end = subprog[cur_subprog + 1].start;
}
}
@@ -1499,7 +1499,6 @@ static int check_max_stack_depth(struct bpf_verifier_env 
*env)
int depth = 0, frame = 0, idx = 0, i = 0, subprog_end;
struct bpf_subprog_info *subprog = env->subprog_info;
struct bpf_insn *insn = env->prog->insnsi;
-   int insn_cnt = env->prog->len;
int ret_insn[MAX_CALL_FRAMES];
int ret_prog[MAX_CALL_FRAMES];
 
@@ -1514,10 +1513,7 @@ static int check_max_stack_depth(struct bpf_verifier_env 
*env)
return -EACCES;
}
 continue_func:
-   if (env->subprog_cnt == idx + 1)
-   subprog_end = insn_cnt;
-   else
-   subprog_end = subprog[idx + 1].start;
+   subprog_end = subprog[idx + 1].start;
for (; i < subprog_end; i++) {
if (insn[i].code != (BPF_JMP | BPF_CALL))
continue;
@@ -5268,10 +5264,7 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 
for (i = 0; i < env->subprog_cnt; i++) {
subprog_start = subprog_end;
-   if (env->subprog_cnt == i + 1)
-   subprog_end = prog->len;
-   else
-   subprog_end = env->subprog_info[i + 1].start;
+   subprog_end = env->subprog_info[i + 1].start;
 
len = subprog_end - subprog_start;
func[i] = bpf_prog_alloc(bpf_prog_size(len), GFP_USER);
-- 
2.7.4



[PATCH bpf-next 1/3] bpf: unify main prog and subprog

2018-04-30 Thread Jiong Wang
Currently, verifier treat main prog and subprog differently. All subprogs
detected are kept in env->subprog_starts while main prog is not kept there.
Instead, main prog is implicitly defined as the prog start at 0.

There is actually no difference between main prog and subprog, it is better
to unify them, and register all progs detected into env->subprog_starts.

This could also help simplifying some code logic.

Signed-off-by: Jiong Wang 
---
 include/linux/bpf_verifier.h |  2 +-
 kernel/bpf/verifier.c| 57 
 2 files changed, 32 insertions(+), 27 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 7e61c39..f655b92 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -191,7 +191,7 @@ struct bpf_verifier_env {
bool seen_direct_write;
struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */
struct bpf_verifier_log log;
-   u32 subprog_starts[BPF_MAX_SUBPROGS];
+   u32 subprog_starts[BPF_MAX_SUBPROGS + 1];
/* computes the stack depth of each bpf function */
u16 subprog_stack_depth[BPF_MAX_SUBPROGS + 1];
u32 subprog_cnt;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index eb1a596..16ec977 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -765,7 +765,7 @@ static int add_subprog(struct bpf_verifier_env *env, int 
off)
ret = find_subprog(env, off);
if (ret >= 0)
return 0;
-   if (env->subprog_cnt >= BPF_MAX_SUBPROGS) {
+   if (env->subprog_cnt > BPF_MAX_SUBPROGS) {
verbose(env, "too many subprograms\n");
return -E2BIG;
}
@@ -781,6 +781,11 @@ static int check_subprogs(struct bpf_verifier_env *env)
struct bpf_insn *insn = env->prog->insnsi;
int insn_cnt = env->prog->len;
 
+   /* Add entry function. */
+   ret = add_subprog(env, 0);
+   if (ret < 0)
+   return ret;
+
/* determine subprog starts. The end is one before the next starts */
for (i = 0; i < insn_cnt; i++) {
if (insn[i].code != (BPF_JMP | BPF_CALL))
@@ -806,10 +811,10 @@ static int check_subprogs(struct bpf_verifier_env *env)
 
/* now check that all jumps are within the same subprog */
subprog_start = 0;
-   if (env->subprog_cnt == cur_subprog)
+   if (env->subprog_cnt == cur_subprog + 1)
subprog_end = insn_cnt;
else
-   subprog_end = env->subprog_starts[cur_subprog++];
+   subprog_end = env->subprog_starts[cur_subprog + 1];
for (i = 0; i < insn_cnt; i++) {
u8 code = insn[i].code;
 
@@ -833,11 +838,13 @@ static int check_subprogs(struct bpf_verifier_env *env)
verbose(env, "last insn is not an exit or 
jmp\n");
return -EINVAL;
}
+   cur_subprog++;
subprog_start = subprog_end;
-   if (env->subprog_cnt == cur_subprog)
+   if (env->subprog_cnt == cur_subprog + 1)
subprog_end = insn_cnt;
else
-   subprog_end = 
env->subprog_starts[cur_subprog++];
+   subprog_end =
+   env->subprog_starts[cur_subprog + 1];
}
}
return 0;
@@ -1505,10 +1512,10 @@ static int check_max_stack_depth(struct 
bpf_verifier_env *env)
return -EACCES;
}
 continue_func:
-   if (env->subprog_cnt == subprog)
+   if (env->subprog_cnt == subprog + 1)
subprog_end = insn_cnt;
else
-   subprog_end = env->subprog_starts[subprog];
+   subprog_end = env->subprog_starts[subprog + 1];
for (; i < subprog_end; i++) {
if (insn[i].code != (BPF_JMP | BPF_CALL))
continue;
@@ -1526,7 +1533,6 @@ static int check_max_stack_depth(struct bpf_verifier_env 
*env)
  i);
return -EFAULT;
}
-   subprog++;
frame++;
if (frame >= MAX_CALL_FRAMES) {
WARN_ONCE(1, "verifier bug. Call stack is too deep\n");
@@ -1558,7 +1564,6 @@ static int get_callee_stack_depth(struct bpf_verifier_env 
*env,
  start);
return -EFAULT;
}
-   subprog++;
return env->subprog_stack_depth[subprog];
 }
 #endif
@@ -2087,7 +2092,7 @@ static int check_map_func_compatibility(struct 
bpf_verifier_env *env,
case BPF_FUNC_tail_call:
if (map->map_type != BPF_MAP_TYPE_PROG_ARRAY)
goto error;
-   if (env->subprog_cnt) {
+   if 

Re: Request for stable 4.14.x inclusion: net: don't call update_pmtu unconditionally

2018-04-30 Thread Thomas Deutschmann
Hi,

On 2018-04-30 20:22, Greg KH wrote:
> The geneve hunk doesn't apply at all to the 4.14.y tree, so I think
> someone has a messed up tree somewhere...
> 
> I'll go look into this now.

Mh?

> $ git clone 
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
> $ cd linux-stable
> $ git checkout v4.14.38
> $ git cherry-pick 52a589d51f1008f62569bf89e95b26221ee76690

Works for me... then I cherry-pick
f15ca723c1ebe6c1a06bc95fda6b62cd87b44559 on top, adjust
"net/ipv6/ip6_tunnel.c" like shown in my previous mail and everything is
fine for me...


-- 
Regards,
Thomas Deutschmann / Gentoo Linux Developer
C4DD 695F A713 8F24 2AA1 5638 5849 7EE5 1D5D 74A5


[PATCH RESEND] connector: add parent pid and tgid to coredump and exit events

2018-04-30 Thread Stefan Strogin
The intention is to get notified of process failures as soon
as possible, before a possible core dumping (which could be very long)
(e.g. in some process-manager). Coredump and exit process events
are perfect for such use cases (see 2b5faa4c553f "connector: Added
coredumping event to the process connector").

The problem is that for now the process-manager cannot know the parent
of a dying process using connectors. This could be useful if the
process-manager should monitor for failures only children of certain
parents, so we could filter the coredump and exit events by parent
process and/or thread ID.

Add parent pid and tgid to coredump and exit process connectors event
data.

Signed-off-by: Stefan Strogin 
Acked-by: Evgeniy Polyakov 
---
 drivers/connector/cn_proc.c  | 4 
 include/uapi/linux/cn_proc.h | 4 
 2 files changed, 8 insertions(+)

diff --git a/drivers/connector/cn_proc.c b/drivers/connector/cn_proc.c
index a782ce87715c..ed5e42461094 100644
--- a/drivers/connector/cn_proc.c
+++ b/drivers/connector/cn_proc.c
@@ -262,6 +262,8 @@ void proc_coredump_connector(struct task_struct *task)
ev->what = PROC_EVENT_COREDUMP;
ev->event_data.coredump.process_pid = task->pid;
ev->event_data.coredump.process_tgid = task->tgid;
+   ev->event_data.coredump.parent_pid = task->real_parent->pid;
+   ev->event_data.coredump.parent_tgid = task->real_parent->tgid;
 
memcpy(>id, _proc_event_id, sizeof(msg->id));
msg->ack = 0; /* not used */
@@ -288,6 +290,8 @@ void proc_exit_connector(struct task_struct *task)
ev->event_data.exit.process_tgid = task->tgid;
ev->event_data.exit.exit_code = task->exit_code;
ev->event_data.exit.exit_signal = task->exit_signal;
+   ev->event_data.exit.parent_pid = task->real_parent->pid;
+   ev->event_data.exit.parent_tgid = task->real_parent->tgid;
 
memcpy(>id, _proc_event_id, sizeof(msg->id));
msg->ack = 0; /* not used */
diff --git a/include/uapi/linux/cn_proc.h b/include/uapi/linux/cn_proc.h
index 68ff25414700..db210625cee8 100644
--- a/include/uapi/linux/cn_proc.h
+++ b/include/uapi/linux/cn_proc.h
@@ -116,12 +116,16 @@ struct proc_event {
struct coredump_proc_event {
__kernel_pid_t process_pid;
__kernel_pid_t process_tgid;
+   __kernel_pid_t parent_pid;
+   __kernel_pid_t parent_tgid;
} coredump;
 
struct exit_proc_event {
__kernel_pid_t process_pid;
__kernel_pid_t process_tgid;
__u32 exit_code, exit_signal;
+   __kernel_pid_t parent_pid;
+   __kernel_pid_t parent_tgid;
} exit;
 
} event_data;
-- 
2.16.1



[PATCH net-next v3 2/2] openvswitch: Support conntrack zone limit

2018-04-30 Thread Yi-Hung Wei
Currently, nf_conntrack_max is used to limit the maximum number of
conntrack entries in the conntrack table for every network namespace.
For the VMs and containers that reside in the same namespace,
they share the same conntrack table, and the total # of conntrack entries
for all the VMs and containers are limited by nf_conntrack_max.  In this
case, if one of the VM/container abuses the usage the conntrack entries,
it blocks the others from committing valid conntrack entries into the
conntrack table.  Even if we can possibly put the VM in different network
namespace, the current nf_conntrack_max configuration is kind of rigid
that we cannot limit different VM/container to have different # conntrack
entries.

To address the aforementioned issue, this patch proposes to have a
fine-grained mechanism that could further limit the # of conntrack entries
per-zone.  For example, we can designate different zone to different VM,
and set conntrack limit to each zone.  By providing this isolation, a
mis-behaved VM only consumes the conntrack entries in its own zone, and
it will not influence other well-behaved VMs.  Moreover, the users can
set various conntrack limit to different zone based on their preference.

The proposed implementation utilizes Netfilter's nf_conncount backend
to count the number of connections in a particular zone.  If the number of
connection is above a configured limitation, ovs will return ENOMEM to the
userspace.  If userspace does not configure the zone limit, the limit
defaults to zero that is no limitation, which is backward compatible to
the behavior without this patch.

The following high leve APIs are provided to the userspace:
  - OVS_CT_LIMIT_CMD_SET:
* set default connection limit for all zones
* set the connection limit for a particular zone
  - OVS_CT_LIMIT_CMD_DEL:
* remove the connection limit for a particular zone
  - OVS_CT_LIMIT_CMD_GET:
* get the default connection limit for all zones
* get the connection limit for a particular zone

Signed-off-by: Yi-Hung Wei 
---
 net/openvswitch/Kconfig |   3 +-
 net/openvswitch/conntrack.c | 508 +++-
 net/openvswitch/conntrack.h |   9 +-
 net/openvswitch/datapath.c  |   7 +-
 net/openvswitch/datapath.h  |   1 +
 5 files changed, 522 insertions(+), 6 deletions(-)

diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig
index 2650205cdaf9..89da9512ec1e 100644
--- a/net/openvswitch/Kconfig
+++ b/net/openvswitch/Kconfig
@@ -9,7 +9,8 @@ config OPENVSWITCH
   (NF_CONNTRACK && ((!NF_DEFRAG_IPV6 || NF_DEFRAG_IPV6) && \
 (!NF_NAT || NF_NAT) && \
 (!NF_NAT_IPV4 || NF_NAT_IPV4) && \
-(!NF_NAT_IPV6 || NF_NAT_IPV6)))
+(!NF_NAT_IPV6 || NF_NAT_IPV6) && \
+(!NETFILTER_CONNCOUNT || 
NETFILTER_CONNCOUNT)))
select LIBCRC32C
select MPLS
select NET_MPLS_GSO
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index c5904f629091..8234964889d9 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -16,8 +16,11 @@
 #include 
 #include 
 #include 
+#include 
 #include 
+#include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -76,6 +79,39 @@ struct ovs_conntrack_info {
 #endif
 };
 
+#ifIS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+#define OVS_CT_LIMIT_UNLIMITED 0
+#define OVS_CT_LIMIT_DEFAULT OVS_CT_LIMIT_UNLIMITED
+#define CT_LIMIT_HASH_BUCKETS 512
+DEFINE_STATIC_KEY_FALSE(ovs_ct_limit_enabled);
+
+struct ovs_ct_limit {
+   /* Elements in ovs_ct_limit_info->limits hash table */
+   struct hlist_node hlist_node;
+   struct rcu_head rcu;
+   u16 zone;
+   u32 limit;
+};
+
+struct ovs_ct_limit_info {
+   u32 default_limit;
+   struct hlist_head *limits;
+   struct nf_conncount_data *data __aligned(8);
+};
+
+static const struct nla_policy ct_limit_policy[OVS_CT_LIMIT_ATTR_MAX + 1] = {
+   [OVS_CT_LIMIT_ATTR_OPTION] = { .type = NLA_NESTED, },
+};
+
+static const struct nla_policy
+   ct_zone_limit_policy[OVS_CT_ZONE_LIMIT_ATTR_MAX + 1] = {
+   [OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT] = { .type = NLA_U32, },
+   [OVS_CT_ZONE_LIMIT_ATTR_ZONE] = { .type = NLA_U16, },
+   [OVS_CT_ZONE_LIMIT_ATTR_LIMIT] = { .type = NLA_U32, },
+   [OVS_CT_ZONE_LIMIT_ATTR_COUNT] = { .type = NLA_U32, },
+};
+#endif
+
 static bool labels_nonzero(const struct ovs_key_ct_labels *labels);
 
 static void __ovs_ct_free_action(struct ovs_conntrack_info *ct_info);
@@ -1036,6 +1072,94 @@ static bool labels_nonzero(const struct 
ovs_key_ct_labels *labels)
return false;
 }
 
+#ifIS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+static struct hlist_head *ct_limit_hash_bucket(
+   const struct ovs_ct_limit_info *info, u16 zone)
+{
+   return 

[PATCH net-next v3 0/2] openvswitch: Support conntrack zone limit

2018-04-30 Thread Yi-Hung Wei
Currently, nf_conntrack_max is used to limit the maximum number of
conntrack entries in the conntrack table for every network namespace.
For the VMs and containers that reside in the same namespace,
they share the same conntrack table, and the total # of conntrack entries
for all the VMs and containers are limited by nf_conntrack_max.  In this
case, if one of the VM/container abuses the usage the conntrack entries,
it blocks the others from committing valid conntrack entries into the
conntrack table.  Even if we can possibly put the VM in different network
namespace, the current nf_conntrack_max configuration is kind of rigid
that we cannot limit different VM/container to have different # conntrack
entries.

To address the aforementioned issue, this patch proposes to have a
fine-grained mechanism that could further limit the # of conntrack entries
per-zone.  For example, we can designate different zone to different VM,
and set conntrack limit to each zone.  By providing this isolation, a
mis-behaved VM only consumes the conntrack entries in its own zone, and
it will not influence other well-behaved VMs.  Moreover, the users can
set various conntrack limit to different zone based on their preference.

The proposed implementation utilizes Netfilter's nf_conncount backend
to count the number of connections in a particular zone.  If the number of
connection is above a configured limitation, OVS will return ENOMEM to the
userspace.  If userspace does not configure the zone limit, the limit
defaults to zero that is no limitation, which is backward compatible to
the behavior without this patch.

The first patch defines the conntrack limit netlink definition, and the
second patch provides the implementation.

v2->v3:
  - Addresses comments from Parvin that include using static keys to check
if ovs_ct_limit features is used, only check ct_limit when a ct entry
is unconfirmed, and reports rate limited warning messages when the ct
limit is reached.
  - Rebases to master.

v1->v2:
  - Fixes commit log typos suggested by Greg.
  - Fixes memory free issue that Julia found.


Yi-Hung Wei (2):
  openvswitch: Add conntrack limit netlink definition
  openvswitch: Support conntrack zone limit

 include/uapi/linux/openvswitch.h |  62 +
 net/openvswitch/Kconfig  |   3 +-
 net/openvswitch/conntrack.c  | 508 ++-
 net/openvswitch/conntrack.h  |   9 +-
 net/openvswitch/datapath.c   |   7 +-
 net/openvswitch/datapath.h   |   1 +
 6 files changed, 584 insertions(+), 6 deletions(-)

-- 
2.7.4



[PATCH net-next v3 1/2] openvswitch: Add conntrack limit netlink definition

2018-04-30 Thread Yi-Hung Wei
Define netlink messages and attributes to support user kernel
communication that uses the conntrack limit feature.

Signed-off-by: Yi-Hung Wei 
---
 include/uapi/linux/openvswitch.h | 62 
 1 file changed, 62 insertions(+)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 713e56ce681f..ca63c16375ce 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -937,4 +937,66 @@ enum ovs_meter_band_type {
 
 #define OVS_METER_BAND_TYPE_MAX (__OVS_METER_BAND_TYPE_MAX - 1)
 
+/* Conntrack limit */
+#define OVS_CT_LIMIT_FAMILY  "ovs_ct_limit"
+#define OVS_CT_LIMIT_MCGROUP "ovs_ct_limit"
+#define OVS_CT_LIMIT_VERSION 0x1
+
+enum ovs_ct_limit_cmd {
+   OVS_CT_LIMIT_CMD_UNSPEC,
+   OVS_CT_LIMIT_CMD_SET,   /* Add or modify ct limit. */
+   OVS_CT_LIMIT_CMD_DEL,   /* Delete ct limit. */
+   OVS_CT_LIMIT_CMD_GET/* Get ct limit. */
+};
+
+enum ovs_ct_limit_attr {
+   OVS_CT_LIMIT_ATTR_UNSPEC,
+   OVS_CT_LIMIT_ATTR_OPTION,   /* Nested OVS_CT_LIMIT_ATTR_* */
+   __OVS_CT_LIMIT_ATTR_MAX
+};
+
+#define OVS_CT_LIMIT_ATTR_MAX (__OVS_CT_LIMIT_ATTR_MAX - 1)
+
+/**
+ * @OVS_CT_ZONE_LIMIT_ATTR_SET_REQ: Contains either
+ * OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT or a pair of
+ * OVS_CT_ZONE_LIMIT_ATTR_ZONE and OVS_CT_ZONE_LIMIT_ATTR_LIMIT.
+ * @OVS_CT_ZONE_LIMIT_ATTR_DEL_REQ: Contains OVS_CT_ZONE_LIMIT_ATTR_ZONE.
+ * @OVS_CT_ZONE_LIMIT_ATTR_GET_REQ: Contains OVS_CT_ZONE_LIMIT_ATTR_ZONE.
+ * @OVS_CT_ZONE_LIMIT_ATTR_GET_RLY: Contains either
+ * OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT or a triple of
+ * OVS_CT_ZONE_LIMIT_ATTR_ZONE, OVS_CT_ZONE_LIMIT_ATTR_LIMIT and
+ * OVS_CT_ZONE_LIMIT_ATTR_COUNT.
+ */
+enum ovs_ct_limit_option_attr {
+   OVS_CT_LIMIT_OPTION_ATTR_UNSPEC,
+   OVS_CT_ZONE_LIMIT_ATTR_SET_REQ, /* Nested OVS_CT_ZONE_LIMIT_ATTR_*
+* attributes. */
+   OVS_CT_ZONE_LIMIT_ATTR_DEL_REQ, /* Nested OVS_CT_ZONE_LIMIT_ATTR_*
+* attributes. */
+   OVS_CT_ZONE_LIMIT_ATTR_GET_REQ, /* Nested OVS_CT_ZONE_LIMIT_ATTR_*
+* attributes. */
+   OVS_CT_ZONE_LIMIT_ATTR_GET_RLY, /* Nested OVS_CT_ZONE_LIMIT_ATTR_*
+* attributes. */
+   __OVS_CT_LIMIT_OPTION_ATTR_MAX
+};
+
+#define OVS_CT_LIMIT_OPTION_ATTR_MAX (__OVS_CT_LIMIT_OPTION_ATTR_MAX - 1)
+
+enum ovs_ct_zone_limit_attr {
+   OVS_CT_ZONE_LIMIT_ATTR_UNSPEC,
+   OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT,   /* u32 default conntrack limit
+* for all zones. */
+   OVS_CT_ZONE_LIMIT_ATTR_ZONE,/* u16 conntrack zone id. */
+   OVS_CT_ZONE_LIMIT_ATTR_LIMIT,   /* u32 max number of conntrack
+* entries allowed in the
+* corresponding zone. */
+   OVS_CT_ZONE_LIMIT_ATTR_COUNT,   /* u32 number of conntrack
+* entries in the corresponding
+* zone. */
+   __OVS_CT_ZONE_LIMIT_ATTR_MAX
+};
+
+#define OVS_CT_ZONE_LIMIT_ATTR_MAX (__OVS_CT_ZONE_LIMIT_ATTR_MAX - 1)
+
 #endif /* _LINUX_OPENVSWITCH_H */
-- 
2.7.4



Re: [PATCH net-next v6] Add Common Applications Kept Enhanced (cake) qdisc

2018-04-30 Thread Dave Taht
On Mon, Apr 30, 2018 at 2:21 PM, Cong Wang  wrote:
> On Sun, Apr 29, 2018 at 2:34 PM, Toke Høiland-Jørgensen  wrote:
>> sch_cake targets the home router use case and is intended to squeeze the
>> most bandwidth and latency out of even the slowest ISP links and routers,
>> while presenting an API simple enough that even an ISP can configure it.
>>
>> Example of use on a cable ISP uplink:
>>
>> tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter
>>
>> To shape a cable download link (ifb and tc-mirred setup elided)
>>
>> tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash
>>
>> CAKE is filled with:
>>
>> * A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel
>>   derived Flow Queuing system, which autoconfigures based on the bandwidth.
>> * A novel "triple-isolate" mode (the default) which balances per-host
>>   and per-flow FQ even through NAT.
>> * An deficit based shaper, that can also be used in an unlimited mode.
>> * 8 way set associative hashing to reduce flow collisions to a minimum.
>> * A reasonable interpretation of various diffserv latency/loss tradeoffs.
>> * Support for zeroing diffserv markings for entering and exiting traffic.
>> * Support for interacting well with Docsis 3.0 shaper framing.
>> * Extensive support for DSL framing types.
>> * Support for ack filtering.
>
> Why this TCP ACK filtering has to be built into CAKE qdisc rather than
> an independent TC filter? Why other qdisc's can't use it?

I actually have a tc - bpf based ack filter, during the development of
cake's ack-thinner, that I should submit one of these days. It
proved to be of limited use.

Probably the biggest mistake we made is by calling this cake feature a
filter. It isn't.

Maybe we should have called it a "thinner" or something like that? In
order to properly "thin" or "reduce" an ack stream
you have to have a queue to look at and some related state. TC filters
do not operate on queues, qdiscs do. Thus the "ack-filter" here is
deeply embedded into cake's flow isolation and queue structures.

>
>
>> * Extensive statistics for measuring, loss, ecn markings, latency
>>   variation.
>>
>> A paper describing the design of CAKE is available at
>> https://arxiv.org/abs/1804.07617
>>
>
> Thanks.



-- 

Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619


Re: [PATCH net-next v6] Add Common Applications Kept Enhanced (cake) qdisc

2018-04-30 Thread Cong Wang
On Sun, Apr 29, 2018 at 2:34 PM, Toke Høiland-Jørgensen  wrote:
> sch_cake targets the home router use case and is intended to squeeze the
> most bandwidth and latency out of even the slowest ISP links and routers,
> while presenting an API simple enough that even an ISP can configure it.
>
> Example of use on a cable ISP uplink:
>
> tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter
>
> To shape a cable download link (ifb and tc-mirred setup elided)
>
> tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash
>
> CAKE is filled with:
>
> * A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel
>   derived Flow Queuing system, which autoconfigures based on the bandwidth.
> * A novel "triple-isolate" mode (the default) which balances per-host
>   and per-flow FQ even through NAT.
> * An deficit based shaper, that can also be used in an unlimited mode.
> * 8 way set associative hashing to reduce flow collisions to a minimum.
> * A reasonable interpretation of various diffserv latency/loss tradeoffs.
> * Support for zeroing diffserv markings for entering and exiting traffic.
> * Support for interacting well with Docsis 3.0 shaper framing.
> * Extensive support for DSL framing types.
> * Support for ack filtering.

Why this TCP ACK filtering has to be built into CAKE qdisc rather than
an independent TC filter? Why other qdisc's can't use it?


> * Extensive statistics for measuring, loss, ecn markings, latency
>   variation.
>
> A paper describing the design of CAKE is available at
> https://arxiv.org/abs/1804.07617
>

Thanks.


[PATCH net-next] net: core: Inline netdev_features_size_check()

2018-04-30 Thread Florian Fainelli
We do not require this inline function to be used in multiple different
locations, just inline it where it gets used in register_netdevice().

Suggested-by: David Miller 
Suggested-by: Stephen Hemminger 
Signed-off-by: Florian Fainelli 
---
 include/linux/netdevice.h | 6 --
 net/core/dev.c| 3 ++-
 2 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 9e09dd897b74..82f5a9aba578 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -4108,12 +4108,6 @@ const char *netdev_drivername(const struct net_device 
*dev);
 
 void linkwatch_run_queue(void);
 
-static inline void netdev_features_size_check(void)
-{
-   BUILD_BUG_ON(sizeof(netdev_features_t) * BITS_PER_BYTE <
-NETDEV_FEATURE_COUNT);
-}
-
 static inline netdev_features_t netdev_intersect_features(netdev_features_t f1,
  netdev_features_t f2)
 {
diff --git a/net/core/dev.c b/net/core/dev.c
index e01c21a88cae..3263c14c607f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7879,7 +7879,8 @@ int register_netdevice(struct net_device *dev)
int ret;
struct net *net = dev_net(dev);
 
-   netdev_features_size_check();
+   BUILD_BUG_ON(sizeof(netdev_features_t) * BITS_PER_BYTE <
+NETDEV_FEATURE_COUNT);
BUG_ON(dev_boot_phase);
ASSERT_RTNL();
 
-- 
2.14.1



[PATCH] ipv6: Allow non-gateway ECMP for IPv6

2018-04-30 Thread Thomas Winter
It is valid to have static routes where the nexthop
is an interface not an address such as tunnels.
For IPv4 it was possible to use ECMP on these routes
but not for IPv6.

Signed-off-by: Thomas Winter 
Cc: David Ahern 
Cc: "David S. Miller" 
Cc: Alexey Kuznetsov 
Cc: Hideaki YOSHIFUJI 
---
 include/net/ip6_route.h | 3 +--
 net/ipv6/ip6_fib.c  | 3 ---
 2 files changed, 1 insertion(+), 5 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 08b132381984..abceb5864d99 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -68,8 +68,7 @@ static inline bool rt6_need_strict(const struct in6_addr 
*daddr)
 
 static inline bool rt6_qualify_for_ecmp(const struct rt6_info *rt)
 {
-   return (rt->rt6i_flags & (RTF_GATEWAY|RTF_ADDRCONF|RTF_DYNAMIC)) ==
-  RTF_GATEWAY;
+   return (rt->rt6i_flags & (RTF_ADDRCONF | RTF_DYNAMIC)) == 0;
 }
 
 void ip6_route_input(struct sk_buff *skb);
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index deab2db6692e..3c97c29d4401 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -934,9 +934,6 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct 
rt6_info *rt,
 * list.
 * Only static routes (which don't have flag
 * RTF_EXPIRES) are used for ECMPv6.
-*
-* To avoid long list, we only had siblings if the
-* route have a gateway.
 */
if (rt_can_ecmp &&
rt6_qualify_for_ecmp(iter))
-- 
2.17.0



Proposal

2018-04-30 Thread Miss Zeliha Omer Faruk



Hello

   Greetings to you today i asked before but i did't get a response please
i know this might come to you as a surprise because you do not know me
personally i have a business proposal for our mutual benefit please let
me know if you are interested.



Best Regards,

Esentepe Mahallesi Büyükdere
Caddesi Kristal Kule Binasi
No:215
 Sisli - Istanbul, Turkey





Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options

2018-04-30 Thread Mikulas Patocka


On Mon, 30 Apr 2018, John Stoffel wrote:

> > "Mikulas" == Mikulas Patocka  writes:
> 
> Mikulas> On Thu, 26 Apr 2018, John Stoffel wrote:
> 
> Mikulas> I see your point - and I think the misunderstanding is this.
> 
> Thanks.
> 
> Mikulas> This patch is not really helping people to debug existing crashes. 
> It is 
> Mikulas> not like "you get a crash" - "you google for some keywords" - "you 
> get a 
> Mikulas> page that suggests to turn this option on" - "you turn it on and 
> solve the 
> Mikulas> crash".
> 
> Mikulas> What this patch really does is that - it makes the kernel 
> deliberately 
> Mikulas> crash in a situation when the code violates the specification, but 
> it 
> Mikulas> would not crash otherwise or it would crash very rarely. It helps to 
> Mikulas> detect specification violations.
> 
> Mikulas> If the kernel developer (or tester) doesn't use this option, his 
> buggy 
> Mikulas> code won't crash - and if it won't crash, he won't fix the bug or 
> report 
> Mikulas> it. How is the user or developer supposed to learn about this 
> option, if 
> Mikulas> he gets no crash at all?
> 
> So why do we make this a KConfig option at all?

Because other people see the KConfig option (so, they may enable it) and 
they don't see the kernel parameter (so, they won't enable it).

Close your eyes and say how many kernel parameters do you remember :-)

> Just turn it on and let it rip.

I can't test if all the networking drivers use kvmalloc properly, because 
I don't have the hardware. You can't test it neither. No one has all the 
hardware that is supported by Linux.

Driver issues can only be tested by a mass of users. And if the users 
don't know about the debugging option, they won't enable it.

> >> I agree with James here.  Looking at the SLAB vs SLUB Kconfig entries
> >> tells me *nothing* about why I should pick one or the other, as an
> >> example.

BTW. You can enable slub debugging either with CONFIG_SLUB_DEBUG_ON or 
with the kernel parameter "slub_debug" - and most users who compile their 
own kernel use CONFIG_SLUB_DEBUG_ON - just because it is visible.

> Now I also think that Linus has the right idea to not just sprinkle 
> BUG_ONs into the code, just dump and oops and keep going if you can.  
> If it's a filesystem or a device, turn it read only so that people 
> notice right away.

This vmalloc fallback is similar to CONFIG_DEBUG_KOBJECT_RELEASE. 
CONFIG_DEBUG_KOBJECT_RELEASE changes the behavior of kobject_put in order 
to cause deliberate crashes (that wouldn't happen otherwise) in drivers 
that misuse kobject_put. In the same sense, we want to cause deliberate 
crashes (that wouldn't happen otherwise) in drivers that misuse kvmalloc.

The crashes will only happen in debugging kernels, not in production 
kernels.

Mikulas


[PATCH 0/6] Fix XSA-155-like bugs in frontend drivers

2018-04-30 Thread Marek Marczykowski-Górecki
Patches in original Xen Security Advisory 155 cared only about backend drivers
while leaving frontend patches to be "developed and released (publicly) after
the embargo date". This is said series.

Marek Marczykowski-Górecki (6):
  xen: Add RING_COPY_RESPONSE()
  xen-netfront: copy response out of shared buffer before accessing it
  xen-netfront: do not use data already exposed to backend
  xen-netfront: add range check for Tx response id
  xen-blkfront: make local copy of response before using it
  xen-blkfront: prepare request locally, only then put it on the shared ring

 drivers/block/xen-blkfront.c| 110 ++---
 drivers/net/xen-netfront.c  |  61 +-
 include/xen/interface/io/ring.h |  14 -
 3 files changed, 106 insertions(+), 79 deletions(-)

base-commit: 6d08b06e67cd117f6992c46611dfb4ce267cd71e
-- 
git-series 0.9.1


[PATCH 2/6] xen-netfront: copy response out of shared buffer before accessing it

2018-04-30 Thread Marek Marczykowski-Górecki
Make local copy of the response, otherwise backend might modify it while
frontend is already processing it - leading to time of check / time of
use issue.

This is complementary to XSA155.

Cc: sta...@vger.kernel.org
Signed-off-by: Marek Marczykowski-Górecki 
---
 drivers/net/xen-netfront.c | 51 +++
 1 file changed, 25 insertions(+), 26 deletions(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index 4dd0668..dc99763 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -387,13 +387,13 @@ static void xennet_tx_buf_gc(struct netfront_queue *queue)
rmb(); /* Ensure we see responses up to 'rp'. */
 
for (cons = queue->tx.rsp_cons; cons != prod; cons++) {
-   struct xen_netif_tx_response *txrsp;
+   struct xen_netif_tx_response txrsp;
 
-   txrsp = RING_GET_RESPONSE(>tx, cons);
-   if (txrsp->status == XEN_NETIF_RSP_NULL)
+   RING_COPY_RESPONSE(>tx, cons, );
+   if (txrsp.status == XEN_NETIF_RSP_NULL)
continue;
 
-   id  = txrsp->id;
+   id  = txrsp.id;
skb = queue->tx_skbs[id].skb;
if (unlikely(gnttab_query_foreign_access(
queue->grant_tx_ref[id]) != 0)) {
@@ -741,7 +741,7 @@ static int xennet_get_extras(struct netfront_queue *queue,
 RING_IDX rp)
 
 {
-   struct xen_netif_extra_info *extra;
+   struct xen_netif_extra_info extra;
struct device *dev = >info->netdev->dev;
RING_IDX cons = queue->rx.rsp_cons;
int err = 0;
@@ -757,24 +757,23 @@ static int xennet_get_extras(struct netfront_queue *queue,
break;
}
 
-   extra = (struct xen_netif_extra_info *)
-   RING_GET_RESPONSE(>rx, ++cons);
+   RING_COPY_RESPONSE(>rx, ++cons, );
 
-   if (unlikely(!extra->type ||
-extra->type >= XEN_NETIF_EXTRA_TYPE_MAX)) {
+   if (unlikely(!extra.type ||
+extra.type >= XEN_NETIF_EXTRA_TYPE_MAX)) {
if (net_ratelimit())
dev_warn(dev, "Invalid extra type: %d\n",
-   extra->type);
+   extra.type);
err = -EINVAL;
} else {
-   memcpy([extra->type - 1], extra,
-  sizeof(*extra));
+   memcpy([extra.type - 1], ,
+  sizeof(extra));
}
 
skb = xennet_get_rx_skb(queue, cons);
ref = xennet_get_rx_ref(queue, cons);
xennet_move_rx_slot(queue, skb, ref);
-   } while (extra->flags & XEN_NETIF_EXTRA_FLAG_MORE);
+   } while (extra.flags & XEN_NETIF_EXTRA_FLAG_MORE);
 
queue->rx.rsp_cons = cons;
return err;
@@ -784,28 +783,28 @@ static int xennet_get_responses(struct netfront_queue 
*queue,
struct netfront_rx_info *rinfo, RING_IDX rp,
struct sk_buff_head *list)
 {
-   struct xen_netif_rx_response *rx = >rx;
+   struct xen_netif_rx_response rx = rinfo->rx;
struct xen_netif_extra_info *extras = rinfo->extras;
struct device *dev = >info->netdev->dev;
RING_IDX cons = queue->rx.rsp_cons;
struct sk_buff *skb = xennet_get_rx_skb(queue, cons);
grant_ref_t ref = xennet_get_rx_ref(queue, cons);
-   int max = MAX_SKB_FRAGS + (rx->status <= RX_COPY_THRESHOLD);
+   int max = MAX_SKB_FRAGS + (rx.status <= RX_COPY_THRESHOLD);
int slots = 1;
int err = 0;
unsigned long ret;
 
-   if (rx->flags & XEN_NETRXF_extra_info) {
+   if (rx.flags & XEN_NETRXF_extra_info) {
err = xennet_get_extras(queue, extras, rp);
cons = queue->rx.rsp_cons;
}
 
for (;;) {
-   if (unlikely(rx->status < 0 ||
-rx->offset + rx->status > XEN_PAGE_SIZE)) {
+   if (unlikely(rx.status < 0 ||
+rx.offset + rx.status > XEN_PAGE_SIZE)) {
if (net_ratelimit())
dev_warn(dev, "rx->offset: %u, size: %d\n",
-rx->offset, rx->status);
+rx.offset, rx.status);
xennet_move_rx_slot(queue, skb, ref);
err = -EINVAL;
goto next;
@@ -819,7 +818,7 @@ static int xennet_get_responses(struct netfront_queue 
*queue,
if (ref == GRANT_INVALID_REF) {
  

[PATCH 3/6] xen-netfront: do not use data already exposed to backend

2018-04-30 Thread Marek Marczykowski-Górecki
Backend may freely modify anything on shared page, so use data which was
supposed to be written there, instead of reading it back from the shared
page.

This is complementary to XSA155.

CC: sta...@vger.kernel.org
Signed-off-by: Marek Marczykowski-Górecki 
---
 drivers/net/xen-netfront.c |  9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index dc99763..934b8a4 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -458,7 +458,7 @@ static void xennet_tx_setup_grant(unsigned long gfn, 
unsigned int offset,
tx->flags = 0;
 
info->tx = tx;
-   info->size += tx->size;
+   info->size += len;
 }
 
 static struct xen_netif_tx_request *xennet_make_first_txreq(
@@ -574,7 +574,7 @@ static int xennet_start_xmit(struct sk_buff *skb, struct 
net_device *dev)
int slots;
struct page *page;
unsigned int offset;
-   unsigned int len;
+   unsigned int len, this_len;
unsigned long flags;
struct netfront_queue *queue = NULL;
unsigned int num_queues = dev->real_num_tx_queues;
@@ -634,14 +634,15 @@ static int xennet_start_xmit(struct sk_buff *skb, struct 
net_device *dev)
}
 
/* First request for the linear area. */
+   this_len = min_t(unsigned int, XEN_PAGE_SIZE - offset, len);
first_tx = tx = xennet_make_first_txreq(queue, skb,
page, offset, len);
-   offset += tx->size;
+   offset += this_len;
if (offset == PAGE_SIZE) {
page++;
offset = 0;
}
-   len -= tx->size;
+   len -= this_len;
 
if (skb->ip_summed == CHECKSUM_PARTIAL)
/* local packet? */
-- 
git-series 0.9.1


[PATCH 4/6] xen-netfront: add range check for Tx response id

2018-04-30 Thread Marek Marczykowski-Górecki
Tx response ID is fetched from shared page, so make sure it is sane
before using it as an array index.

CC: sta...@vger.kernel.org
Signed-off-by: Marek Marczykowski-Górecki 
---
 drivers/net/xen-netfront.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index 934b8a4..55c9b25 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -394,6 +394,7 @@ static void xennet_tx_buf_gc(struct netfront_queue *queue)
continue;
 
id  = txrsp.id;
+   BUG_ON(id >= NET_TX_RING_SIZE);
skb = queue->tx_skbs[id].skb;
if (unlikely(gnttab_query_foreign_access(
queue->grant_tx_ref[id]) != 0)) {
-- 
git-series 0.9.1


Good News

2018-04-30 Thread Mrs Julie Leach
You are a recipient to Mrs Julie Leach Donation of $2 million USD. Contact 
(julieleach...@gmail.com) for claims.


Re: [PATCH net-next v9 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-04-30 Thread Samudrala, Sridhar

On 4/30/2018 12:12 AM, Jiri Pirko wrote:

Mon, Apr 30, 2018 at 05:00:33AM CEST, sridhar.samudr...@intel.com wrote:

On 4/28/2018 1:24 AM, Jiri Pirko wrote:

Fri, Apr 27, 2018 at 07:06:59PM CEST, sridhar.samudr...@intel.com wrote:

This patch enables virtio_net to switch over to a VF datapath when a VF
netdev is present with the same MAC address. It allows live migration
of a VM with a direct attached VF without the need to setup a bond/team
between a VF and virtio net device in the guest.

The hypervisor needs to enable only one datapath at any time so that
packets don't get looped back to the VM over the other datapath. When a VF

Why? Both datapaths could be enabled at a time. Why the loop on
hypervisor side would be a problem. This in not an issue for
bonding/team as well.

Somehow the hypervisor needs to make sure that the broadcasts/multicasts from 
the VM
sent over the VF datapath don't get looped back to the VM via the virtio-net 
datapth.

Why? Please see below.



This can happen if both datapaths are enabled at the same time.

I would think this is an issue even with bonding/team as well when virtio-net 
and
VF are backed by the same PF.



I believe that the scenario is the same as on an ordinary nic/swich
network:

...

   host
  
bond0

   / \
 eth0   eth1
  |   |
...
  |   |
  p1  p2

   switch

...

It is perfectly valid to p1 and p2 be up and "bridged" together. Bond
has to cope with loop-backed frames. "Failover driver" should too,
it's the same scenario.


OK. So looks like we should be able to handle this by returning RX_HANDLER_EXACT
for frames received on standby device when primary is present.



Re: [PATCH net-next v6 0/3] kernel: add support to collect hardware logs in crash recovery kernel

2018-04-30 Thread Eric W. Biederman
Rahul Lakkireddy  writes:

> v6:
> - Reworked device dump elf note name to contain vendor identifier.
> - Added vmcoredd_header that precedes actual dump in the Elf Note.
> - Device dump's name is moved inside vmcoredd_header.
> - Added "CHELSIO" string as vendor identifier in the Elf Note name
>   for cxgb4 device dumps.

Yep you did and that is not correct.  My apologies if I was unclear.

An elf note looks like this:

/* Note header in a PT_NOTE section */
typedef struct elf32_note {
  Elf32_Wordn_namesz;   /* Name size */
  Elf32_Wordn_descsz;   /* Content size */
  Elf32_Wordn_type; /* Content type */
} Elf32_Nhdr;

n_descsz is is the length of the body.

n_namesz is the length of the ``vendor'' but not the vendor of the your
driver.  It is the ``vendor'' that defines n_type.  So "LINUX" in our
case.

The pair "LINUX", NT_VMCOREDD go together.  That pair is what a
subsequent program must look at to decide how to understand the note.

Please don't use CRASH_CORE_NOTE_HEAD BYTES that obscures things
unnecessarily, and really is not applicable to this use case.
ALIGN(sizeof(struct elf_note), 4) is almost as short and it makes
it clear what your are talking about.


Also please don't look too closely as the other note stuff for Elf core
dumps.  That stuff was not well done from an Elf standards and ABI
perspective.  Unfortunately by the time I had noticed it was years later
so not something that is worth the breakage in tools to change now.

Looking at struct vmcoredd_header.

That seems a reasonable structure.  However it is a uapi structure so
we need to carefully definite it that way.

Perhaps:

#define VMCOREDD_MAX_NAME_BYTES  32

struct vmcoredd_header {
__u32 header_len;   /* Length of this header */
__u32 reserved;
__u64 data_len; /* Length of this device dump */
__u8 dump_name[VMCOREDD_MAX_NAME_BYTES];
__u8 reserved2[16];
};


Looking at that I see another siginficant issue.  We can't let the
device dump be more than 32bits long.  The length of an elf
note is defined by an elf word which is universally a __u32.

So perhaps vmcoredd_header should look like:

#define VMCOREDD_MAX_NAME_BYTES  44

struct vmcoredd_header {
__u32   n_namesz;   /* Name size */
__u32   n_descsz;   /* Content size */
__u32   n_type; /* NT_VMDOREDD */
__u8name[8];/* LINUX\0\0\0 */
__u8dump_name[VMCOREDD_MAX_NAME_BYTES];
};

The total length of the data dump would be descsz - sizeof(struct
vmcoredd_header).  The header winds up being a constant 64 bytes this
way, and includes the elf note header.

Eric












Re: [PATCH net-next v2 4/5] ipv6: sr: Add seg6local action End.BPF

2018-04-30 Thread Mathieu Xhonneux
2018-04-28 2:01 GMT+02:00 Alexei Starovoitov :
>
> On Fri, Apr 27, 2018 at 10:59:19AM -0400, David Miller wrote:
> > From: Mathieu Xhonneux 
> > Date: Tue, 24 Apr 2018 18:44:15 +0100
> >
> > > This patch adds the End.BPF action to the LWT seg6local infrastructure.
> > > This action works like any other seg6local End action, meaning that an 
> > > IPv6
> > > header with SRH is needed, whose DA has to be equal to the SID of the
> > > action. It will also advance the SRH to the next segment, the BPF program
> > > does not have to take care of this.
> >
> > I'd like to see some BPF developers review this change.
> >
> > But on my side I wonder if, instead of validating the whole thing 
> > afterwards,
> > we should make the helpers accessible by the eBPF program validate the 
> > changes
> > as they are made.
>
> Looking at the code I don't think it's possible to keep it valid all the time
> while building, so seg6_validate_srh() after the program run seems necessary.


Indeed, e.g. to add a TLV in the SRH one needs to call
bpf_lwt_seg6_adjust_srh (to add some room for the TLV), then
bpf_lwt_seg6_store_bytes() (to fill the space with the TLV). Between
those two calls, the SRH is in an invalid state.

>
>
> I think the whole set should be targeting bpf-next tree.
> Please fix kbuild errors, rebase and document new helper in man-page style.
> Things like:
> +   test_btf_haskv.o test_btf_nokv.o test_lwt_seg6local.o
> +>>> selftests/bpf: test for seg6local End.BPF action
> should be fixed properly.


Oops, I didn't catch this one, thanks. I'll send a v3 towards bpf-next.


[PATCH net-next] udp: disable gso with no_check_tx

2018-04-30 Thread Willem de Bruijn
From: Willem de Bruijn 

Syzbot managed to send a udp gso packet without checksum offload into
the gso stack by disabling tx checksum (UDP_NO_CHECK6_TX). This
triggered the skb_warn_bad_offload.

  RIP: 0010:skb_warn_bad_offload+0x2bc/0x600 net/core/dev.c:2658
   skb_gso_segment include/linux/netdevice.h:4038 [inline]
   validate_xmit_skb+0x54d/0xd90 net/core/dev.c:3120
   __dev_queue_xmit+0xbf8/0x34c0 net/core/dev.c:3577
   dev_queue_xmit+0x17/0x20 net/core/dev.c:3618

UDP_NO_CHECK6_TX sets skb->ip_summed to CHECKSUM_NONE just after the
udp gso integrity checks in udp_(v6_)send_skb. Extend those checks to
catch and fail in this case.

After the integrity checks jump directly to the CHECKSUM_PARTIAL case
to avoid reading the no_check_tx flags again (a TOCTTOU race).

Fixes: bec1f6f69736 ("udp: generate gso with UDP_SEGMENT")
Signed-off-by: Willem de Bruijn 
---
 net/ipv4/udp.c | 4 
 net/ipv6/udp.c | 4 
 2 files changed, 8 insertions(+)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 794aeafeb782..dd3102a37ef9 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -786,11 +786,14 @@ static int udp_send_skb(struct sk_buff *skb, struct 
flowi4 *fl4,
return -EINVAL;
if (skb->len > cork->gso_size * UDP_MAX_SEGMENTS)
return -EINVAL;
+   if (sk->sk_no_check_tx)
+   return -EINVAL;
if (skb->ip_summed != CHECKSUM_PARTIAL || is_udplite)
return -EIO;
 
skb_shinfo(skb)->gso_size = cork->gso_size;
skb_shinfo(skb)->gso_type = SKB_GSO_UDP_L4;
+   goto csum_partial;
}
 
if (is_udplite)  /* UDP-Lite  */
@@ -802,6 +805,7 @@ static int udp_send_skb(struct sk_buff *skb, struct flowi4 
*fl4,
goto send;
 
} else if (skb->ip_summed == CHECKSUM_PARTIAL) { /* UDP hardware csum */
+csum_partial:
 
udp4_hwcsum(skb, fl4->saddr, fl4->daddr);
goto send;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 6acfdd3e442b..a34e28ac03a7 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1051,11 +1051,14 @@ static int udp_v6_send_skb(struct sk_buff *skb, struct 
flowi6 *fl6,
return -EINVAL;
if (skb->len > cork->gso_size * UDP_MAX_SEGMENTS)
return -EINVAL;
+   if (udp_sk(sk)->no_check6_tx)
+   return -EINVAL;
if (skb->ip_summed != CHECKSUM_PARTIAL || is_udplite)
return -EIO;
 
skb_shinfo(skb)->gso_size = cork->gso_size;
skb_shinfo(skb)->gso_type = SKB_GSO_UDP_L4;
+   goto csum_partial;
}
 
if (is_udplite)
@@ -1064,6 +1067,7 @@ static int udp_v6_send_skb(struct sk_buff *skb, struct 
flowi6 *fl6,
skb->ip_summed = CHECKSUM_NONE;
goto send;
} else if (skb->ip_summed == CHECKSUM_PARTIAL) { /* UDP hardware csum */
+csum_partial:
udp6_hwcsum_outgoing(sk, skb, >saddr, >daddr, len);
goto send;
} else
-- 
2.17.0.441.gb46fe60e1d-goog



Re: Representing cpu-port of a switch

2018-04-30 Thread Florian Fainelli
On 04/30/2018 06:06 AM, sk.syed2 wrote:
>> Again, pretty standard. What you probably meant is that for host destined or 
>> initiated traffic you must use that internal port to egress/ingress frames 
>> towards the other external ports.
>>
> Yes.
> 
>>
>>> Without tagging, we cant really use DSA, and hide the cpu/dsa port. So
>>> if we expose this cpu port as a interface with fixed-phy
>>> infrastructure does it create any problems?
>>
>> Well the fact that you don't have a tagging protocol does not really mean 
>> you cannot use DSA. You could create your own tagging format which in your 
>> case could be as simple as keeping the prepended DMA packet descriptor and 
>> have the parsing of that descriptor be done in a DSA tagger driver.
> This would need HW change wont it?
> 
>  Not that I would necessarily recommend that though, see below.
>>
> 
>>  DSA documentation says one
>>> cannot open a socket on cpu/dsa port and send/receive traffic. Is it
>>> fairly common to use internal/cpu port as a network interface- i.e,
>>> creating a socket and send/receive traffic?
>>
>> In the context of DSA, all external ports are represented by a network 
>> device. This means that the CPU/management port is only used to 
>> ingress/egress frames that include the tag which either the switch hardware 
>> inserts on its way to the host or conversely that the host must insert to 
>> have the switch do the appropriate switching operation. If you do not use 
>> tags and you still have a way to target specific external ports the same 
>> representation should happen and you do not want to expose the internal port 
>> because it will only be used to send/receive traffic from the external ports 
>> and it will not be used to send or receive traffic to itself (so to speak).
> Just like with DSA you should have the ability to create network
> devices that are per-port and you can use the HW provided information
> to deliver packets to the appropriate destination port, conversely
> send from the appropriate source port.
>>
>>
> The switch HW doesn't have any provision to target specific external
> port. I think I didnt make this clear. We would like to integrate a
> switch along with an endpoint into a single HW. Meaning we would like
> to use internal cpu port as any normal network port to send and
> receive traffic. The external network ports will not be used to
> send/receive normal traffic(except for control/mgmt frames). So, the
> two external front panel ports tied to phys are lets say eth1, eth2.

The only problem with that approach is that eth1 and eth2 are only
control interfaces, they cannot transfer any data, eth0 does that, see
below.

> The cpu port tied to DMAs is represented using fixed-link/always_on
> interface say ep0(endpoint). Now applications can open a socket and
> send/receive traffic from ep0. The switch does forwarding based on
> programmed CAM. Do you see any problem with this approach?

No, this is entirely similar to what we do in DSA with DSA_PROTO_TAG_NONE.

In premise, if you created a VLAN identifier for each of your front
panel port, you could emulate in SW what you do not have in HW which is
to target specific front panel ports.

> 
> 
> 
>>> One problem is how to report back when network errors(like if both
>>> front panel ports are disconnected, the expectation is to bring this
>>> cpu port down?).
>>
>> The CPU port should be considered always UP and the external ports must have 
>> proper link notifications in place through PHYLIB/PHYLINK preferably. With 
>> link management in place the carrier state is properly managed and the 
>> network stack won't send traffic from a port that is not UP.
>>
>>> We also need to offload all the switch configuration to switch-dev. So
>>> the question is using switch-dev without DSA and representing a cpu
>>> port as a normal network interface would be ok?
>>
>> Using switchdev without DSA is absolutely okay, see rocker and mlxsw, but 
>> neither of those do represent their CPU/management port for the reasons 
>> outlined above that it is only used to convey traffic to/from other ports 
>> that have a proper network device representation.
>>
> May be this is something elementary, but what do you mean by proper
> network device representation that cpu port lacks?

CPU ports in these cases are not created as a separate net_device
instance, they exist in the HW and at the SW level because we need to
use them to direct traffic to/from specific front-panel port using a
specific DMA capability (e.g: a special set of bits in a DMA
descriptor). In your case, you don't have that ability in your HW, so
you must actually create a CPU net_device for applications to be able to
send traffic.
-- 
Florian


Re: [RFC net-next 0/5] Support for PHY test modes

2018-04-30 Thread Florian Fainelli
On 04/30/2018 09:40 AM, Andrew Lunn wrote:
>> Turning these tests on will typically result in the link partner
>> dropping the link with us, and the interface will be non-functional as
>> far as the data path is concerned (similar to an isolation mode). This
>> might warrant properly reporting that to user-space through e.g: a
>> private IFF_* value maybe?
> 
> Hi Florian
> 
> I've not looked at the code yet
> 
> Is it also necessary to kick off auto-neg again after the test has
> finished, in order to reestablish the link?

It would yes. Right now there is a test mode exposed named "normal"
which really, means: bring back to the PHY to an operation state. This
state likely does not belong here and we would have to introduce flags
instead such as start and stop the test instead.

Please review the patches when we get a chance, because I suspect this
is not the only issue they have ;)
-- 
Florian


Re: [PATCH net-next v9 1/4] virtio_net: Introduce VIRTIO_NET_F_STANDBY feature bit

2018-04-30 Thread Samudrala, Sridhar


On 4/30/2018 12:03 AM, Jiri Pirko wrote:

Mon, Apr 30, 2018 at 04:47:03AM CEST, sridhar.samudr...@intel.com wrote:

On 4/28/2018 12:50 AM, Jiri Pirko wrote:

Fri, Apr 27, 2018 at 07:06:57PM CEST,sridhar.samudr...@intel.com  wrote:

This feature bit can be used by hypervisor to indicate virtio_net device to
act as a standby for another device with the same MAC address.

VIRTIO_NET_F_STANDBY is defined as bit 62 as it is a device feature bit.

Signed-off-by: Sridhar Samudrala
---
drivers/net/virtio_net.c| 2 +-
include/uapi/linux/virtio_net.h | 3 +++
2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 3b5991734118..51a085b1a242 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2999,7 +2999,7 @@ static struct virtio_device_id id_table[] = {
VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
VIRTIO_NET_F_CTRL_MAC_ADDR, \
VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
-   VIRTIO_NET_F_SPEED_DUPLEX
+   VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_STANDBY

This is not part of current qemu master (head 
6f0c4706b35dead265509115ddbd2a8d1af516c1)
Were I can find the qemu code?

Also, I think it makes sense to push HW (qemu HW in this case) first
and only then the driver.

I had sent qemu patch with a couple of earlier versions of this patchset.
Will include it when i send out v10.

The point was, don't you want to push it to qemu first? Did you at least
send RFC to qemu?


Yes. Here is the link to the RFC patch.
https://patchwork.ozlabs.org/patch/859521/





Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options

2018-04-30 Thread John Stoffel
> "Mikulas" == Mikulas Patocka  writes:

Mikulas> On Thu, 26 Apr 2018, John Stoffel wrote:

>> > "James" == James Bottomley  
>> > writes:
>> 
James> I may be an atypical developer but I'd rather have a root canal
James> than browse through menuconfig options.  The way to get people
James> to learn about new debugging options is to blog about it (or
James> write an lwn.net article) which google will find the next time
James> I ask it how I debug XXX.  Google (probably as a service to
James> humanity) rarely turns up Kconfig options in response to a
James> query.
>> 
>> I agree with James here.  Looking at the SLAB vs SLUB Kconfig entries
>> tells me *nothing* about why I should pick one or the other, as an
>> example.
>> 
>> John

Mikulas> I see your point - and I think the misunderstanding is this.

Thanks.

Mikulas> This patch is not really helping people to debug existing crashes. It 
is 
Mikulas> not like "you get a crash" - "you google for some keywords" - "you get 
a 
Mikulas> page that suggests to turn this option on" - "you turn it on and solve 
the 
Mikulas> crash".

Mikulas> What this patch really does is that - it makes the kernel deliberately 
Mikulas> crash in a situation when the code violates the specification, but it 
Mikulas> would not crash otherwise or it would crash very rarely. It helps to 
Mikulas> detect specification violations.

Mikulas> If the kernel developer (or tester) doesn't use this option, his buggy 
Mikulas> code won't crash - and if it won't crash, he won't fix the bug or 
report 
Mikulas> it. How is the user or developer supposed to learn about this option, 
if 
Mikulas> he gets no crash at all?

So why do we make this a KConfig option at all?  Just turn it on and
let it rip.  Now I also think that Linus has the right idea to not
just sprinkle BUG_ONs into the code, just dump and oops and keep going
if you can.  If it's a filesystem or a device, turn it read only so
that people notice right away.



Re: Request for stable 4.14.x inclusion: net: don't call update_pmtu unconditionally

2018-04-30 Thread Greg KH
On Fri, Apr 27, 2018 at 07:43:52PM +0100, Eddie Chapman wrote:
> On 27/04/18 19:07, Thomas Deutschmann wrote:
> > Hi Greg,
> > 
> > first, we need to cherry-pick another patch first:
> > >  From 52a589d51f1008f62569bf89e95b26221ee76690 Mon Sep 17 00:00:00 2001
> > > From: Xin Long 
> > > Date: Mon, 25 Dec 2017 14:43:58 +0800
> > > Subject: [PATCH] geneve: update skb dst pmtu on tx path
> > > 
> > > Commit a93bf0ff4490 ("vxlan: update skb dst pmtu on tx path") has fixed
> > > a performance issue caused by the change of lower dev's mtu for vxlan.
> > > 
> > > The same thing needs to be done for geneve as well.
> > > 
> > > Note that geneve cannot adjust it's mtu according to lower dev's mtu
> > > when creating it. The performance is very low later when netperfing
> > > over it without fixing the mtu manually. This patch could also avoid
> > > this issue.
> > > 
> > > Signed-off-by: Xin Long 
> > > Signed-off-by: David S. Miller 
> 
> Oops, I completely missed that the coreos patch doesn't have the geneve hunk
> that is in the original 4.15 patch. I don't load the geneve module on my box
> hence why no problems surfaced on my machine.

The geneve hunk doesn't apply at all to the 4.14.y tree, so I think
someone has a messed up tree somewhere...

I'll go look into this now.

greg k-h


INFO: rcu detected stall in kfree_skbmem

2018-04-30 Thread syzbot

Hello,

syzbot found the following crash on:

HEAD commit:5d1365940a68 Merge  
git://git.kernel.org/pub/scm/linux/kerne...

git tree:   net-next
console output: https://syzkaller.appspot.com/x/log.txt?id=5667997129637888
kernel config:   
https://syzkaller.appspot.com/x/.config?id=-5947642240294114534

dashboard link: https://syzkaller.appspot.com/bug?extid=fc78715ba3b3257caf6a
compiler:   gcc (GCC) 8.0.1 20180413 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+fc78715ba3b3257ca...@syzkaller.appspotmail.com

INFO: rcu_sched self-detected stall on CPU
	1-...!: (1 GPs behind) idle=a3e/1/4611686018427387908 softirq=71980/71983  
fqs=33

 (t=125000 jiffies g=39438 c=39437 q=958)
rcu_sched kthread starved for 124829 jiffies! g39438 c39437 f0x0  
RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=0

RCU grace-period kthread stack dump:
rcu_sched   R  running task23768 9  2 0x8000
Call Trace:
 context_switch kernel/sched/core.c:2848 [inline]
 __schedule+0x801/0x1e30 kernel/sched/core.c:3490
 schedule+0xef/0x430 kernel/sched/core.c:3549
 schedule_timeout+0x138/0x240 kernel/time/timer.c:1801
 rcu_gp_kthread+0x6b5/0x1940 kernel/rcu/tree.c:2231
 kthread+0x345/0x410 kernel/kthread.c:238
 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:411
NMI backtrace for cpu 1
CPU: 1 PID: 20560 Comm: syz-executor4 Not tainted 4.16.0+ #1
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011

Call Trace:
 
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1b9/0x294 lib/dump_stack.c:113
 nmi_cpu_backtrace.cold.4+0x19/0xce lib/nmi_backtrace.c:103
 nmi_trigger_cpumask_backtrace+0x151/0x192 lib/nmi_backtrace.c:62
 arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
 trigger_single_cpu_backtrace include/linux/nmi.h:156 [inline]
 rcu_dump_cpu_stacks+0x175/0x1c2 kernel/rcu/tree.c:1376
 print_cpu_stall kernel/rcu/tree.c:1525 [inline]
 check_cpu_stall.isra.61.cold.80+0x36c/0x59a kernel/rcu/tree.c:1593
 __rcu_pending kernel/rcu/tree.c:3356 [inline]
 rcu_pending kernel/rcu/tree.c:3401 [inline]
 rcu_check_callbacks+0x21b/0xad0 kernel/rcu/tree.c:2763
 update_process_times+0x2d/0x70 kernel/time/timer.c:1636
 tick_sched_handle+0x9f/0x180 kernel/time/tick-sched.c:173
 tick_sched_timer+0x45/0x130 kernel/time/tick-sched.c:1283
 __run_hrtimer kernel/time/hrtimer.c:1386 [inline]
 __hrtimer_run_queues+0x3e3/0x10a0 kernel/time/hrtimer.c:1448
 hrtimer_interrupt+0x286/0x650 kernel/time/hrtimer.c:1506
 local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1025 [inline]
 smp_apic_timer_interrupt+0x15d/0x710 arch/x86/kernel/apic/apic.c:1050
 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:862
RIP: 0010:arch_local_irq_restore arch/x86/include/asm/paravirt.h:783  
[inline]

RIP: 0010:kmem_cache_free+0xb3/0x2d0 mm/slab.c:3757
RSP: 0018:8801db105228 EFLAGS: 0282 ORIG_RAX: ff13
RAX: 0007 RBX: 8800b055c940 RCX: 11003b2345a5
RDX:  RSI: 8801d91a2d80 RDI: 0282
RBP: 8801db105248 R08: 8801d91a2cb8 R09: 0002
R10: 8801d91a2480 R11:  R12: 8801d9848e40
R13: 0282 R14: 85b7f27c R15: 
 kfree_skbmem+0x13c/0x210 net/core/skbuff.c:582
 __kfree_skb net/core/skbuff.c:642 [inline]
 kfree_skb+0x19d/0x560 net/core/skbuff.c:659
 enqueue_to_backlog+0x2fc/0xc90 net/core/dev.c:3968
 netif_rx_internal+0x14d/0xae0 net/core/dev.c:4181
 netif_rx+0xba/0x400 net/core/dev.c:4206
 loopback_xmit+0x283/0x741 drivers/net/loopback.c:91
 __netdev_start_xmit include/linux/netdevice.h:4087 [inline]
 netdev_start_xmit include/linux/netdevice.h:4096 [inline]
 xmit_one net/core/dev.c:3053 [inline]
 dev_hard_start_xmit+0x264/0xc10 net/core/dev.c:3069
 __dev_queue_xmit+0x2724/0x34c0 net/core/dev.c:3584
 dev_queue_xmit+0x17/0x20 net/core/dev.c:3617
 neigh_hh_output include/net/neighbour.h:472 [inline]
 neigh_output include/net/neighbour.h:480 [inline]
 ip6_finish_output2+0x134e/0x2810 net/ipv6/ip6_output.c:120
 ip6_finish_output+0x5fe/0xbc0 net/ipv6/ip6_output.c:154
 NF_HOOK_COND include/linux/netfilter.h:277 [inline]
 ip6_output+0x227/0x9b0 net/ipv6/ip6_output.c:171
 dst_output include/net/dst.h:444 [inline]
 NF_HOOK include/linux/netfilter.h:288 [inline]
 ip6_xmit+0xf51/0x23f0 net/ipv6/ip6_output.c:277
 sctp_v6_xmit+0x4a5/0x6b0 net/sctp/ipv6.c:225
 sctp_packet_transmit+0x26f6/0x3ba0 net/sctp/output.c:650
 sctp_outq_flush+0x1373/0x4370 net/sctp/outqueue.c:1197
 sctp_outq_uncork+0x6a/0x80 net/sctp/outqueue.c:776
 sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1820 [inline]
 sctp_side_effects net/sctp/sm_sideeffect.c:1220 [inline]
 sctp_do_sm+0x596/0x7160 net/sctp/sm_sideeffect.c:1191
 sctp_generate_heartbeat_event+0x218/0x450 net/sctp/sm_sideeffect.c:406
 call_timer_fn+0x230/0x940 kernel/time/timer.c:1326
 expire_timers 

INFO: rcu detected stall in kmem_cache_alloc_node_trace

2018-04-30 Thread syzbot

Hello,

syzbot found the following crash on:

HEAD commit:17dec0a94915 Merge branch 'userns-linus' of  
git://git.kerne...

git tree:   net-next
console output: https://syzkaller.appspot.com/x/log.txt?id=6093051722203136
kernel config:   
https://syzkaller.appspot.com/x/.config?id=-2735707888269579554

dashboard link: https://syzkaller.appspot.com/bug?extid=deec965c578bb9b81613
compiler:   gcc (GCC) 8.0.1 20180301 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+deec965c578bb9b81...@syzkaller.appspotmail.com

sctp: [Deprecated]: syz-executor3 (pid 10218) Use of int in max_burst  
socket option.

Use struct sctp_assoc_value instead
sctp: [Deprecated]: syz-executor3 (pid 10218) Use of int in max_burst  
socket option.

Use struct sctp_assoc_value instead
random: crng init done
INFO: rcu_sched self-detected stall on CPU
	0-: (120712 ticks this GP) idle=ac6/1/4611686018427387908  
softirq=31693/31693 fqs=31173

 (t=125001 jiffies g=17039 c=17038 q=303419)
NMI backtrace for cpu 0
CPU: 0 PID: 10218 Comm: syz-executor3 Not tainted 4.16.0+ #1
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011

Call Trace:
 
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x1b9/0x29f lib/dump_stack.c:53
 nmi_cpu_backtrace.cold.4+0x19/0xce lib/nmi_backtrace.c:103
 nmi_trigger_cpumask_backtrace+0x151/0x192 lib/nmi_backtrace.c:62
 arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
 trigger_single_cpu_backtrace include/linux/nmi.h:156 [inline]
 rcu_dump_cpu_stacks+0x175/0x1c2 kernel/rcu/tree.c:1376
 print_cpu_stall kernel/rcu/tree.c:1525 [inline]
 check_cpu_stall.isra.61.cold.80+0x36c/0x59a kernel/rcu/tree.c:1593
 __rcu_pending kernel/rcu/tree.c:3356 [inline]
 rcu_pending kernel/rcu/tree.c:3401 [inline]
 rcu_check_callbacks+0x21b/0xad0 kernel/rcu/tree.c:2763
 update_process_times+0x2d/0x70 kernel/time/timer.c:1636
 tick_sched_handle+0xa0/0x180 kernel/time/tick-sched.c:162
 tick_sched_timer+0x42/0x130 kernel/time/tick-sched.c:1170
 __run_hrtimer kernel/time/hrtimer.c:1349 [inline]
 __hrtimer_run_queues+0x3e3/0x10a0 kernel/time/hrtimer.c:1411
 hrtimer_interrupt+0x2f3/0x750 kernel/time/hrtimer.c:1469
 local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1025 [inline]
 smp_apic_timer_interrupt+0x15d/0x710 arch/x86/kernel/apic/apic.c:1050
 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:862
RIP: 0010:arch_local_irq_restore arch/x86/include/asm/paravirt.h:783  
[inline]

RIP: 0010:lock_is_held_type+0x18b/0x210 kernel/locking/lockdep.c:3960
RSP: 0018:8801db006400 EFLAGS: 0282 ORIG_RAX: ff12
RAX: dc00 RBX: 0282 RCX: 
RDX: 11162e55 RSI: 88b90c60 RDI: 0282
RBP: 8801db006420 R08: ed003b6046c3 R09: ed003b6046c2
R10: ed003b6046c2 R11: 8801db023613 R12: 8801b2f623c0
R13:  R14: 88009932bb00 R15: 
 lock_is_held include/linux/lockdep.h:344 [inline]
 rcu_read_lock_sched_held+0x108/0x120 kernel/rcu/update.c:117
 trace_kmalloc_node include/trace/events/kmem.h:100 [inline]
 kmem_cache_alloc_node_trace+0x34e/0x770 mm/slab.c:3652
 __do_kmalloc_node mm/slab.c:3669 [inline]
 __kmalloc_node_track_caller+0x33/0x70 mm/slab.c:3684
 __kmalloc_reserve.isra.38+0x3a/0xe0 net/core/skbuff.c:137
 __alloc_skb+0x14d/0x780 net/core/skbuff.c:205
 alloc_skb include/linux/skbuff.h:987 [inline]
 sctp_packet_transmit+0x45e/0x3ba0 net/sctp/output.c:585
 sctp_outq_flush+0x1373/0x4370 net/sctp/outqueue.c:1197
 sctp_outq_uncork+0x6a/0x80 net/sctp/outqueue.c:776
 sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1820 [inline]
 sctp_side_effects net/sctp/sm_sideeffect.c:1220 [inline]
 sctp_do_sm+0x596/0x7160 net/sctp/sm_sideeffect.c:1191
 sctp_generate_heartbeat_event+0x218/0x450 net/sctp/sm_sideeffect.c:406
 call_timer_fn+0x230/0x940 kernel/time/timer.c:1326
 expire_timers kernel/time/timer.c:1363 [inline]
 __run_timers+0x79e/0xc50 kernel/time/timer.c:1666
 run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692
 __do_softirq+0x2e0/0xaf5 kernel/softirq.c:285
 invoke_softirq kernel/softirq.c:365 [inline]
 irq_exit+0x1d1/0x200 kernel/softirq.c:405
 exiting_irq arch/x86/include/asm/apic.h:525 [inline]
 smp_apic_timer_interrupt+0x17e/0x710 arch/x86/kernel/apic/apic.c:1052
 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:862
 
RIP: 0010:arch_local_irq_restore arch/x86/include/asm/paravirt.h:783  
[inline]

RIP: 0010:console_unlock+0xcdf/0x1100 kernel/printk/printk.c:2403
RSP: 0018:8801946eec00 EFLAGS: 0212 ORIG_RAX: ff12
RAX: 0004 RBX: 0200 RCX: c90002ee8000
RDX: 4461 RSI: 815f3446 RDI: 0212
RBP: 8801946eed68 R08: 8801b2f62c38 R09: 0006
R10: 8801b2f623c0 R11:  R12: 
R13: 

Re: Performance regressions in TCP_STREAM tests in Linux 4.15 (and later)

2018-04-30 Thread Eric Dumazet


On 04/30/2018 09:36 AM, Eric Dumazet wrote:
> 
> 
> On 04/30/2018 09:14 AM, Ben Greear wrote:
>> On 04/27/2018 08:11 PM, Steven Rostedt wrote:
>>>
>>> We'd like this email archived in netdev list, but since netdev is
>>> notorious for blocking outlook email as spam, it didn't go through. So
>>> I'm replying here to help get it into the archives.
>>>
>>> Thanks!
>>>
>>> -- Steve
>>>
>>>
>>> On Fri, 27 Apr 2018 23:05:46 +
>>> Michael Wenig  wrote:
>>>
 As part of VMware's performance testing with the Linux 4.15 kernel,
 we identified CPU cost and throughput regressions when comparing to
 the Linux 4.14 kernel. The impacted test cases are mostly TCP_STREAM
 send tests when using small message sizes. The regressions are
 significant (up 3x) and were tracked down to be a side effect of Eric
 Dumazat's RB tree changes that went into the Linux 4.15 kernel.
 Further investigation showed our use of the TCP_NODELAY flag in
 conjunction with Eric's change caused the regressions to show and
 simply disabling TCP_NODELAY brought performance back to normal.
 Eric's change also resulted into significant improvements in our
 TCP_RR test cases.



 Based on these results, our theory is that Eric's change made the
 system overall faster (reduced latency) but as a side effect less
 aggregation is happening (with TCP_NODELAY) and that results in lower
 throughput. Previously even though TCP_NODELAY was set, system was
 slower and we still got some benefit of aggregation. Aggregation
 helps in better efficiency and higher throughput although it can
 increase the latency. If you are seeing a regression in your
 application throughput after this change, using TCP_NODELAY might
 help bring performance back however that might increase latency.
>>
>> I guess you mean _disabling_ TCP_NODELAY instead of _using_ TCP_NODELAY?
>>
> 
> Yeah, I guess auto-corking does not work as intended.

I would try the following patch :

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 
44be7f43455e4aefde8db61e2d941a69abcc642a..c9d00ef54deca15d5760bcbe154001a96fa1e2a7
 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -697,7 +697,7 @@ static bool tcp_should_autocork(struct sock *sk, struct 
sk_buff *skb,
 {
return skb->len < size_goal &&
   sock_net(sk)->ipv4.sysctl_tcp_autocorking &&
-  skb != tcp_write_queue_head(sk) &&
+  !tcp_rtx_queue_empty(sk) &&
   refcount_read(>sk_wmem_alloc) > skb->truesize;
 }
 


Re: [PATCH bpf-next] bpf: relax constraints on formatting for eBPF helper documentation

2018-04-30 Thread Quentin Monnet
2018-04-30 17:33 UTC+0100 ~ Edward Cree 
> On 30/04/18 16:59, Quentin Monnet wrote:
>> The Python script used to parse and extract eBPF helpers documentation
>> from include/uapi/linux/bpf.h expects a very specific formatting for the
>> descriptions (single dots represent a space, '>' stands for a tab):
>>
>> /*
>>  ...
>>  *.int bpf_helper(list of arguments)
>>  *.>Description
>>  *.>>   Start of description
>>  *.>>   Another line of description
>>  *.>>   And yet another line of description
>>  *.>Return
>>  *.>>   0 on success, or a negative error in case of failure
>>  ...
>>  */
>>
>> This is too strict, and painful for developers who wants to add
>> documentation for new helpers. Worse, it is extremelly difficult to
>> check that the formatting is correct during reviews. Change the
>> format expected by the script and make it more flexible. The script now
>> works whether or not the initial space (right after the star) is
>> present, and accepts both tabs and white spaces (or a combination of
>> both) for indenting description sections and contents.
>>
>> Concretely, something like the following would now be supported:
>>
>> /*
>>  ...
>>  *int bpf_helper(list of arguments)
>>  *..Description
>>  *.>>   Start of description...
>>  *> >   Another line of description
>>  *..And yet another line of description
>>  *> Return
>>  *.>0 on success, or a negative error in case of failure
>>  ...
>>  */
>>
>> Signed-off-by: Quentin Monnet 
>> ---
>>  scripts/bpf_helpers_doc.py | 10 +-
>>  1 file changed, 5 insertions(+), 5 deletions(-)
>>
>> diff --git a/scripts/bpf_helpers_doc.py b/scripts/bpf_helpers_doc.py
>> index 30ba0fee36e4..717547e6f0a6 100755
>> --- a/scripts/bpf_helpers_doc.py
>> +++ b/scripts/bpf_helpers_doc.py
>> @@ -87,7 +87,7 @@ class HeaderParser(object):
>>  #   - Same as above, with "const" and/or "struct" in front of type
>>  #   - "..." (undefined number of arguments, for bpf_trace_printk())
>>  # There is at least one term ("void"), and at most five arguments.
>> -p = re.compile('^ \* ((.+) \**\w+\const )?(struct 
>> )?(\w+|\.\.\.)( \**\w+)?)(, )?){1,5}\))$')
>> +p = re.compile('^ \* ?((.+) \**\w+\const )?(struct 
>> )?(\w+|\.\.\.)( \**\w+)?)(, )?){1,5}\))$')
> The proper coding style for such things is to go straight to tabs after
>  the star and not have the space.  So if we're going to make the script
>  flexible here (and leave coding style enforcement to other tools such
>  as checkpatch), maybe the regexen should just begin '^ \*\s+' and avoid
>  relying on counting indentation to delimit sections (e.g. scan for the
>  section headers like '^ \*\s+Description$' instead).

Thanks Edward! I agree it would be cleaner. However, with the current
format of the doc, I see two shortcomings.

- First we need a way to detect the end of a section. There is no
"Return" section for helper returning void, so we cannot rely on it to
end the "Description" section. And there is no delimiter to indicate the
end of the description of a given helper. We cannot assume that a string
matching a function definition, alone on its line, indicate the start of
a new helper (this is not the case). So as I see it, this would at least
require some delimiter between the descriptions of different functions
in bpf.h. I could add them if you think this is better.

- Also, we loose the possibility to further indent the text from the
description. Think about code snippets in descriptions: were we to
extract the lines with a regex such as / *\s+(.*)/, I see no way to get
the additional indent that should appear in the man page, if we do not
know what indent level was used for the helper description. I do not see
any simple workaround.

This being said, I am ready to bring whatever changes are needed to make
writing new helper doc easier, so I am open to suggestions if you have
workarounds for these or if the consensus is that the formatting should
be completely revised.

> Btw, leading '^' is unnecessary as re.match() is already implicitly
>  anchored at start-of-string.  (The trailing '$' are still needed.)

Oh, thanks! I'll fix that.

Quentin


Re: [PATCH net-next 1/1] inet_diag: fetch cong algo info when socket is destroyed

2018-04-30 Thread Jamal Hadi Salim

On 29/04/18 08:31 PM, David Miller wrote:


Well, two things:

1) The congestion control info is opt-in, meaning that the user gets
it in the dump if they ask for it.

This information is opt-in, because otherwise the dumps get really
large.

Therefore, emitting this stuff by default on destroys in a
non-starter.



There are two options that I investigated:
Add a setsockopt() for a new group that indicate "give me the congestion
info in addition" or add a similar knob at bind() time. Either of those
approaches would require bigger surgeries. If you think either of those
is reasonable i will work in that direction.

Note: Vegas adds 4 32-bit words; BBR 5 32-bit words; the congestion
name another 16B worst case.
In the larger scope of things that is very small extra data and saves
all the complexity of the other approaches.


2) The TCP_TIME_WAIT test is not there for looks.  You need to add it
also to the destroy case, and guess what?  All the sockets you will
see will not pass that test.



The TCP_TIME_WAIT test makes sense for a live socket. This sock is
past that stage.


I'm not applying this, sorry.  I really think things are go as-is, and
if you really truly want the congestion control information you can
ask for it while the socket is still alive, and is in the proper state
to sample the congestion control state before you kill it off.


I am avoiding the polling for scaling reasons. It worked fine for
small number of sockets.

cheers,
jamal


[PATCH v2] ethtool: fix a potential missing-check bug

2018-04-30 Thread Wenwen Wang
In ethtool_get_rxnfc(), the object "info" is firstly copied from
user-space. If the FLOW_RSS flag is set in the member field flow_type of
"info" (and cmd is ETHTOOL_GRXFH), info needs to be copied again from
user-space because FLOW_RSS is newer and has new definition, as mentioned
in the comment. However, given that the user data resides in user-space, a
malicious user can race to change the data after the first copy. By doing
so, the user can inject inconsistent data. For example, in the second
copy, the FLOW_RSS flag could be cleared in the field flow_type of "info".
In the following execution, "info" will be used in the function
ops->get_rxnfc(). Such inconsistent data can potentially lead to unexpected
information leakage since ops->get_rxnfc() will prepare various types of
data according to flow_type, and the prepared data will be eventually
copied to user-space. This inconsistent data may also cause undefined
behaviors based on how ops->get_rxnfc() is implemented.

This patch simply re-verifies the flow_type field of "info" after the
second copy. If the value is not as expected, an error code will be
returned.

Signed-off-by: Wenwen Wang 
---
 net/core/ethtool.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 03416e6..ba02f0d 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -1032,6 +1032,11 @@ static noinline_for_stack int ethtool_get_rxnfc(struct 
net_device *dev,
info_size = sizeof(info);
if (copy_from_user(, useraddr, info_size))
return -EFAULT;
+   /* Since malicious users may modify the original data,
+* we need to check whether FLOW_RSS is still requested.
+*/
+   if (!(info.flow_type & FLOW_RSS))
+   return -EINVAL;
}
 
if (info.cmd == ETHTOOL_GRXCLSRLALL) {
-- 
2.7.4



Re: [PATCH RFC 6/9] veth: Add ndo_xdp_xmit

2018-04-30 Thread Jesper Dangaard Brouer
On Thu, 26 Apr 2018 19:52:40 +0900
Toshiaki Makita  wrote:

> On 2018/04/26 5:24, Jesper Dangaard Brouer wrote:
> > On Tue, 24 Apr 2018 23:39:20 +0900
> > Toshiaki Makita  wrote:
> >   
> >> +static int veth_xdp_xmit(struct net_device *dev, struct xdp_frame *frame)
> >> +{
> >> +  struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
> >> +  int headroom = frame->data - (void *)frame;
> >> +  struct net_device *rcv;
> >> +  int err = 0;
> >> +
> >> +  rcv = rcu_dereference(priv->peer);
> >> +  if (unlikely(!rcv))
> >> +  return -ENXIO;
> >> +
> >> +  rcv_priv = netdev_priv(rcv);
> >> +  /* xdp_ring is initialized on receive side? */
> >> +  if (rcu_access_pointer(rcv_priv->xdp_prog)) {
> >> +  err = xdp_ok_fwd_dev(rcv, frame->len);
> >> +  if (unlikely(err))
> >> +  return err;
> >> +
> >> +  err = veth_xdp_enqueue(rcv_priv, veth_xdp_to_ptr(frame));
> >> +  } else {
> >> +  struct sk_buff *skb;
> >> +
> >> +  skb = veth_build_skb(frame, headroom, frame->len, 0);
> >> +  if (unlikely(!skb))
> >> +  return -ENOMEM;
> >> +
> >> +  /* Get page ref in case skb is dropped in netif_rx.
> >> +   * The caller is responsible for freeing the page on error.
> >> +   */
> >> +  get_page(virt_to_page(frame->data));  
> > 
> > I'm not sure you can make this assumption, that xdp_frames coming from
> > another device driver uses a refcnt based memory model. But maybe I'm
> > confused, as this looks like an SKB receive path, but in the
> > ndo_xdp_xmit().  
> 
> I find this path similar to cpumap, which creates skb from redirected
> xdp frame. Once it is converted to skb, skb head is freed by
> page_frag_free, so anyway I needed to get the refcount here regardless
> of memory model.

Yes I know, I wrote cpumap ;-)

First of all, I don't want to see such xdp_frame to SKB conversion code
in every driver.  Because that increase the chances of errors.  And
when looking at the details, then it seems that you have made the
mistake of making it possible to leak xdp_frame info to the SKB (which
cpumap takes into account).

Second, I think the refcnt scheme here is wrong. The xdp_frame should
be "owned" by XDP and have the proper refcnt to deliver it directly to
the network stack.

Third, if we choose that we want a fallback, in-case XDP is not enabled
on egress dev (but it have an ndo_xdp_xmit), then it should be placed
in the generic/core code.  E.g. __bpf_tx_xdp_map() could look at the
return code from dev->netdev_ops->ndo_xdp() and create an SKB.  (Hint,
this would make it easy to implement TX bulking towards the dev).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


+static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
+ int buflen)
+{
+   struct sk_buff *skb;
+
+   if (!buflen) {
+   buflen = SKB_DATA_ALIGN(headroom + len) +
+SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+   }
+   skb = build_skb(head, buflen);
+   if (!skb)
+   return NULL;
+
+   skb_reserve(skb, headroom);
+   skb_put(skb, len);
+
+   return skb;
+}



[PATCH bpf-next] bpf/verifier: enable ctx + const + 0.

2018-04-30 Thread William Tu
Existing verifier does not allow 'ctx + const + const'.  However, due to
compiler optimization, there is a case where BPF compilerit generates
'ctx + const + 0', as shown below:

  599: (1d) if r2 == r4 goto pc+2
   R0=inv(id=0) R1=ctx(id=0,off=40,imm=0)
   R2=inv(id=0,umax_value=4294967295,var_off=(0x0; 0x))
   R3=inv(id=0,umax_value=65535,var_off=(0x0; 0x)) R4=inv0
   R6=ctx(id=0,off=0,imm=0) R7=inv2
  600: (bf) r1 = r6 // r1 is ctx
  601: (07) r1 += 36// r1 has offset 36
  602: (61) r4 = *(u32 *)(r1 +0)// r1 + 0
  dereference of modified ctx ptr R1 off=36+0, ctx+const is allowed,
  ctx+const+const is not

The reason for BPF backend generating this code is due optimization
likes this, explained from Yonghong:
if (...)
*(ctx + 60)
else
*(ctx + 56)

The compiler translates it to
if (...)
   ptr = ctx + 60
else
   ptr = ctx + 56
*(ptr + 0)

So load ptr memory become an example of 'ctx + const + 0'.  This patch
enables support for this case.

Fixes: f8ddadc4db6c7 ("Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
Cc: Yonghong Song 
Signed-off-by: Yifeng Sun 
Signed-off-by: William Tu 
---
 kernel/bpf/verifier.c   |  2 +-
 tools/testing/selftests/bpf/test_verifier.c | 13 +
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 712d8655e916..c9a791b9cf2a 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1638,7 +1638,7 @@ static int check_mem_access(struct bpf_verifier_env *env, 
int insn_idx, u32 regn
/* ctx accesses must be at a fixed offset, so that we can
 * determine what type of data were returned.
 */
-   if (reg->off) {
+   if (reg->off && off != reg->off) {
verbose(env,
"dereference of modified ctx ptr R%d off=%d+%d, 
ctx+const is allowed, ctx+const+const is not\n",
regno, reg->off, off - reg->off);
diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 1acafe26498b..95ad5d5723ae 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -8452,6 +8452,19 @@ static struct bpf_test tests[] = {
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
},
{
+   "arithmetic ops make PTR_TO_CTX + const + 0 valid",
+   .insns = {
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_1,
+ offsetof(struct __sk_buff, data) -
+ offsetof(struct __sk_buff, mark)),
+   BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, 0),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   },
+   .result = ACCEPT,
+   .prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   },
+   {
"pkt_end - pkt_start is allowed",
.insns = {
BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
-- 
2.7.4



[net-next 8/9] i40e/i40evf: take into account queue map from vf when handling queues

2018-04-30 Thread Jeff Kirsher
From: Harshitha Ramamurthy 

The expectation of the ops VIRTCHNL_OP_ENABLE_QUEUES and
VIRTCHNL_OP_DISABLE_QUEUES is that the queue map sent by
the VF is taken into account when enabling/disabling
queues in the VF VSI. This patch makes sure that happens.

By breaking out the individual queue set up functions so
that they can be called directly from the i40e_virtchnl_pf.c
file, only the queues as specified by the queue bit map that
accompanies the enable/disable queues ops will be handled.

Signed-off-by: Harshitha Ramamurthy 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h |  3 +
 drivers/net/ethernet/intel/i40e/i40e_main.c| 40 +
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 69 +-
 3 files changed, 99 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index 1d59ab6ca90f..7a80652e2500 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -987,6 +987,9 @@ void i40e_service_event_schedule(struct i40e_pf *pf);
 void i40e_notify_client_of_vf_msg(struct i40e_vsi *vsi, u32 vf_id,
  u8 *msg, u16 len);
 
+int i40e_control_wait_tx_q(int seid, struct i40e_pf *pf, int pf_q, bool is_xdp,
+  bool enable);
+int i40e_control_wait_rx_q(struct i40e_pf *pf, int pf_q, bool enable);
 int i40e_vsi_start_rings(struct i40e_vsi *vsi);
 void i40e_vsi_stop_rings(struct i40e_vsi *vsi);
 void i40e_vsi_stop_rings_no_wait(struct  i40e_vsi *vsi);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 0babde10fa15..b500bbf6c43f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -4235,8 +4235,8 @@ static void i40e_control_tx_q(struct i40e_pf *pf, int 
pf_q, bool enable)
  * @is_xdp: true if the queue is used for XDP
  * @enable: start or stop the queue
  **/
-static int i40e_control_wait_tx_q(int seid, struct i40e_pf *pf, int pf_q,
- bool is_xdp, bool enable)
+int i40e_control_wait_tx_q(int seid, struct i40e_pf *pf, int pf_q,
+  bool is_xdp, bool enable)
 {
int ret;
 
@@ -4281,7 +4281,6 @@ static int i40e_vsi_control_tx(struct i40e_vsi *vsi, bool 
enable)
if (ret)
break;
}
-
return ret;
 }
 
@@ -4320,9 +4319,9 @@ static int i40e_pf_rxq_wait(struct i40e_pf *pf, int pf_q, 
bool enable)
  * @pf_q: the PF queue to configure
  * @enable: start or stop the queue
  *
- * This function enables or disables a single queue. Note that any delay
- * required after the operation is expected to be handled by the caller of
- * this function.
+ * This function enables or disables a single queue. Note that
+ * any delay required after the operation is expected to be
+ * handled by the caller of this function.
  **/
 static void i40e_control_rx_q(struct i40e_pf *pf, int pf_q, bool enable)
 {
@@ -4351,6 +4350,30 @@ static void i40e_control_rx_q(struct i40e_pf *pf, int 
pf_q, bool enable)
wr32(hw, I40E_QRX_ENA(pf_q), rx_reg);
 }
 
+/**
+ * i40e_control_wait_rx_q
+ * @pf: the PF structure
+ * @pf_q: queue being configured
+ * @enable: start or stop the rings
+ *
+ * This function enables or disables a single queue along with waiting
+ * for the change to finish. The caller of this function should handle
+ * the delays needed in the case of disabling queues.
+ **/
+int i40e_control_wait_rx_q(struct i40e_pf *pf, int pf_q, bool enable)
+{
+   int ret = 0;
+
+   i40e_control_rx_q(pf, pf_q, enable);
+
+   /* wait for the change to finish */
+   ret = i40e_pf_rxq_wait(pf, pf_q, enable);
+   if (ret)
+   return ret;
+
+   return ret;
+}
+
 /**
  * i40e_vsi_control_rx - Start or stop a VSI's rings
  * @vsi: the VSI being configured
@@ -4363,10 +4386,7 @@ static int i40e_vsi_control_rx(struct i40e_vsi *vsi, 
bool enable)
 
pf_q = vsi->base_queue;
for (i = 0; i < vsi->num_queue_pairs; i++, pf_q++) {
-   i40e_control_rx_q(pf, pf_q, enable);
-
-   /* wait for the change to finish */
-   ret = i40e_pf_rxq_wait(pf, pf_q, enable);
+   ret = i40e_control_wait_rx_q(pf, pf_q, enable);
if (ret) {
dev_info(>pdev->dev,
 "VSI seid %d Rx ring %d %sable timeout\n",
diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c 
b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
index 2ceea63cc6cf..c6d24eaede18 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
@@ -2153,6 +2153,51 @@ static int 

[net-next 9/9] i40e: use %pI4b instead of byte swapping before dev_err

2018-04-30 Thread Jeff Kirsher
From: Jacob Keller 

Fix warnings regarding restricted __be32 type usage by strictly
specifying the type of the ipv4 address being printed in the dev_err
statement.

Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index b500bbf6c43f..c8659fbd7111 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -7213,8 +7213,7 @@ static int i40e_parse_cls_flower(struct i40e_vsi *vsi,
if (mask->dst == cpu_to_be32(0x)) {
field_flags |= I40E_CLOUD_FIELD_IIP;
} else {
-   mask->dst = be32_to_cpu(mask->dst);
-   dev_err(>pdev->dev, "Bad ip dst mask 
%pI4\n",
+   dev_err(>pdev->dev, "Bad ip dst mask 
%pI4b\n",
>dst);
return I40E_ERR_CONFIG;
}
@@ -7224,8 +7223,7 @@ static int i40e_parse_cls_flower(struct i40e_vsi *vsi,
if (mask->src == cpu_to_be32(0x)) {
field_flags |= I40E_CLOUD_FIELD_IIP;
} else {
-   mask->src = be32_to_cpu(mask->src);
-   dev_err(>pdev->dev, "Bad ip src mask 
%pI4\n",
+   dev_err(>pdev->dev, "Bad ip src mask 
%pI4b\n",
>src);
return I40E_ERR_CONFIG;
}
-- 
2.14.3



[net-next 7/9] i40e: avoid overflow in i40e_ptp_adjfreq()

2018-04-30 Thread Jeff Kirsher
From: Jacob Keller 

When operating at 1GbE, the base incval for the PTP clock is so large
that multiplying it by numbers close to the max_adj can overflow the
u64.

Rather than attempting to limit the max_adj to a value small enough to
avoid overflow, instead calculate the incvalue adjustment based on the
40GbE incvalue, and then multiply that by the scaling factor for the
link speed.

This sacrifices a small amount of precision in the adjustment but we
avoid erratic behavior of the clock due to the overflow caused if ppb is
very near the maximum adjustment.

Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h |  2 +-
 drivers/net/ethernet/intel/i40e/i40e_ptp.c | 41 --
 2 files changed, 28 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index 70d369e9139c..1d59ab6ca90f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -586,7 +586,7 @@ struct i40e_pf {
unsigned long ptp_tx_start;
struct hwtstamp_config tstamp_config;
struct mutex tmreg_lock; /* Used to protect the SYSTIME registers. */
-   u64 ptp_base_adj;
+   u32 ptp_adj_mult;
u32 tx_hwtstamp_timeouts;
u32 tx_hwtstamp_skipped;
u32 rx_hwtstamp_cleared;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_ptp.c 
b/drivers/net/ethernet/intel/i40e/i40e_ptp.c
index 43d7c44d6d9f..aa3daec2049d 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ptp.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ptp.c
@@ -16,9 +16,9 @@
  * At 1Gb link, the period is multiplied by 20. (32ns)
  * 1588 functionality is not supported at 100Mbps.
  */
-#define I40E_PTP_40GB_INCVAL 0x01ULL
-#define I40E_PTP_10GB_INCVAL 0x03ULL
-#define I40E_PTP_1GB_INCVAL  0x20ULL
+#define I40E_PTP_40GB_INCVAL   0x01ULL
+#define I40E_PTP_10GB_INCVAL_MULT  2
+#define I40E_PTP_1GB_INCVAL_MULT   20
 
 #define I40E_PRTTSYN_CTL1_TSYNTYPE_V1  BIT(I40E_PRTTSYN_CTL1_TSYNTYPE_SHIFT)
 #define I40E_PRTTSYN_CTL1_TSYNTYPE_V2  (2 << \
@@ -106,17 +106,24 @@ static int i40e_ptp_adjfreq(struct ptp_clock_info *ptp, 
s32 ppb)
ppb = -ppb;
}
 
-   smp_mb(); /* Force any pending update before accessing. */
-   adj = READ_ONCE(pf->ptp_base_adj);
-
-   freq = adj;
+   freq = I40E_PTP_40GB_INCVAL;
freq *= ppb;
diff = div_u64(freq, 10ULL);
 
if (neg_adj)
-   adj -= diff;
+   adj = I40E_PTP_40GB_INCVAL - diff;
else
-   adj += diff;
+   adj = I40E_PTP_40GB_INCVAL + diff;
+
+   /* At some link speeds, the base incval is so large that directly
+* multiplying by ppb would result in arithmetic overflow even when
+* using a u64. Avoid this by instead calculating the new incval
+* always in terms of the 40GbE clock rate and then multiplying by the
+* link speed factor afterwards. This does result in slightly lower
+* precision at lower link speeds, but it is fairly minor.
+*/
+   smp_mb(); /* Force any pending update before accessing. */
+   adj *= READ_ONCE(pf->ptp_adj_mult);
 
wr32(hw, I40E_PRTTSYN_INC_L, adj & 0x);
wr32(hw, I40E_PRTTSYN_INC_H, adj >> 32);
@@ -438,6 +445,7 @@ void i40e_ptp_set_increment(struct i40e_pf *pf)
struct i40e_link_status *hw_link_info;
struct i40e_hw *hw = >hw;
u64 incval;
+   u32 mult;
 
hw_link_info = >phy.link_info;
 
@@ -445,10 +453,10 @@ void i40e_ptp_set_increment(struct i40e_pf *pf)
 
switch (hw_link_info->link_speed) {
case I40E_LINK_SPEED_10GB:
-   incval = I40E_PTP_10GB_INCVAL;
+   mult = I40E_PTP_10GB_INCVAL_MULT;
break;
case I40E_LINK_SPEED_1GB:
-   incval = I40E_PTP_1GB_INCVAL;
+   mult = I40E_PTP_1GB_INCVAL_MULT;
break;
case I40E_LINK_SPEED_100MB:
{
@@ -459,15 +467,20 @@ void i40e_ptp_set_increment(struct i40e_pf *pf)
 "1588 functionality is not supported at 100 
Mbps. Stopping the PHC.\n");
warn_once++;
}
-   incval = 0;
+   mult = 0;
break;
}
case I40E_LINK_SPEED_40GB:
default:
-   incval = I40E_PTP_40GB_INCVAL;
+   mult = 1;
break;
}
 
+   /* The increment value is calculated by taking the base 40GbE incvalue
+* and multiplying it by a factor based on the link speed.
+*/
+   incval = I40E_PTP_40GB_INCVAL * mult;
+
/* Write the new increment value into the increment register. The
   

Re: [PATCH V2 net-next 1/2] tcp: send in-queue bytes in cmsg upon read

2018-04-30 Thread Soheil Hassas Yeganeh
On Mon, Apr 30, 2018 at 12:10 PM, David Miller  wrote:
> From: Eric Dumazet 
> Date: Mon, 30 Apr 2018 09:01:47 -0700
>
>> TCP sockets are read by a single thread really (or synchronized
>> threads), or garbage is ensured, regardless of how the kernel
>> ensures locking while reporting "queue length"
>
> Whatever applications "typically do", we should never return
> garbage, and that is what this code allowing to happen.
>
> Everything else in recvmsg() operates on state under the proper socket
> lock, to ensure consistency.
>
> The only reason we are releasing the socket lock first it to make sure
> the backlog is processed and we have the most update information
> available.
>
> It seems like one is striving for correctness and better accuracy, no?
> :-)
>
> Look, this can be fixed really simply.  And if you are worried about
> unbounded loops if two apps maliciously do recvmsg() in parallel,
> then don't even loop, just fallback to full socket locking and make
> the "non-typical" application pay the price:
>
> tmp1 = A;
> tmp2 = B;
> barrier();
> tmp3 = A;
> if (unlikely(tmp1 != tmp3)) {
> lock_sock(sk);
> tmp1 = A;
> tmp2 = B;
> release_sock(sk);
> }
>
> I'm seriously not applying the patch as-is, sorry.  This issue
> must be addressed somehow.

Thank you David for the suggestion. Sure, I'll send a V3 with what you
suggested above.

Thanks,
Soheil


[net-next 0/9][pull request] 40GbE Intel Wired LAN Driver Updates 2018-04-30

2018-04-30 Thread Jeff Kirsher
This series contains updates to i40e and i40evf only.

Jia-Ju Bai replaces an instance of GFP_ATOMIC to GFP_KERNEL, since
i40evf is not in atomic context when i40evf_add_vlan() is called.

Jake cleans up function header comments to ensure that the function
parameter comments actually match the function parameters.  Fixed a
possible overflow error in the PTP clock code.  Fixed warnings regarding
restricted __be32 type usage.

Mariusz fixes the reading of the LLDP configuration, which moves from
using relative values to calculating the absolute address.

Jakub adds a check for 10G LR mode for i40e.

Paweł fixes an issue, where changing the MTU would turn on TSO, GSO and
GRO.

Alex fixes a couple of issues with the UDP tunnel filter configuration.
First being that the tunnels did not have mutual exclusion in place to
prevent a race condition between a user request to add/remove a port and
an update.  The second issue was we were deleting filters that were not
associated with the actual filter we wanted to delete.

Harshitha ensures that the queue map sent by the VF is taken into
account when enabling/disabling queues in the VF VSI.

The following are changes since commit 76c2a96d42ca3bdac12c463ff27fec3bb2982e3f:
  liquidio: fix spelling mistake: "mac_tx_multi_collison" -> 
"mac_tx_multi_collision"
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 40GbE

Alexander Duyck (1):
  i40e: Fix multiple issues with UDP tunnel offload filter configuration

Harshitha Ramamurthy (1):
  i40e/i40evf: take into account queue map from vf when handling queues

Jacob Keller (3):
  i40e/i40evf: cleanup incorrect function doxygen comments
  i40e: avoid overflow in i40e_ptp_adjfreq()
  i40e: use %pI4b instead of byte swapping before dev_err

Jakub Pawlak (1):
  i40e: Add advertising 10G LR mode

Jia-Ju Bai (1):
  i40evf: Replace GFP_ATOMIC with GFP_KERNEL in i40evf_add_vlan

Mariusz Stachura (1):
  i40e: fix reading LLDP configuration

Paweł Jabłoński (1):
  i40evf: Fix turning TSO, GSO and GRO on after

 drivers/net/ethernet/intel/i40e/i40e.h |   7 +-
 drivers/net/ethernet/intel/i40e/i40e_client.c  |   6 +-
 drivers/net/ethernet/intel/i40e/i40e_common.c  |  37 +++---
 drivers/net/ethernet/intel/i40e/i40e_dcb.c |  91 --
 drivers/net/ethernet/intel/i40e/i40e_dcb_nl.c  |  11 +-
 drivers/net/ethernet/intel/i40e/i40e_debugfs.c |   8 +-
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c |  28 +++--
 drivers/net/ethernet/intel/i40e/i40e_hmc.c |   1 -
 drivers/net/ethernet/intel/i40e/i40e_main.c| 134 -
 drivers/net/ethernet/intel/i40e/i40e_nvm.c |   1 +
 drivers/net/ethernet/intel/i40e/i40e_ptp.c |  45 ---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c|   6 +-
 drivers/net/ethernet/intel/i40e/i40e_type.h|   8 +-
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c |  85 +++--
 drivers/net/ethernet/intel/i40evf/i40e_common.c|   1 +
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c  |   4 +-
 drivers/net/ethernet/intel/i40evf/i40e_type.h  |  10 +-
 drivers/net/ethernet/intel/i40evf/i40evf_client.c  |   4 +-
 drivers/net/ethernet/intel/i40evf/i40evf_ethtool.c |   7 +-
 drivers/net/ethernet/intel/i40evf/i40evf_main.c|  25 +++-
 .../net/ethernet/intel/i40evf/i40evf_virtchnl.c|  11 +-
 21 files changed, 401 insertions(+), 129 deletions(-)

-- 
2.14.3



[net-next 4/9] i40e: Add advertising 10G LR mode

2018-04-30 Thread Jeff Kirsher
From: Jakub Pawlak 

The advertising 10G LR mode should be possible to set
but in the function i40e_set_link_ksettings() check for this
is missed. This patch adds check for 1baseLR_Full
flag for 10G modes.

Signed-off-by: Jakub Pawlak 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index c1bbfb913e49..fc6a5eef141c 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -953,7 +953,9 @@ static int i40e_set_link_ksettings(struct net_device 
*netdev,
ethtool_link_ksettings_test_link_mode(ks, advertising,
  1baseCR_Full) ||
ethtool_link_ksettings_test_link_mode(ks, advertising,
- 1baseSR_Full))
+ 1baseSR_Full) ||
+   ethtool_link_ksettings_test_link_mode(ks, advertising,
+ 1baseLR_Full))
config.link_speed |= I40E_LINK_SPEED_10GB;
if (ethtool_link_ksettings_test_link_mode(ks, advertising,
  2baseKR2_Full))
-- 
2.14.3



[net-next 1/9] i40evf: Replace GFP_ATOMIC with GFP_KERNEL in i40evf_add_vlan

2018-04-30 Thread Jeff Kirsher
From: Jia-Ju Bai 

i40evf_add_vlan() is never called in atomic context.

i40evf_add_vlan() is only called by i40evf_vlan_rx_add_vid(),
which is only set as ".ndo_vlan_rx_add_vid" in struct net_device_ops.
".ndo_vlan_rx_add_vid" is not called in atomic context.

Despite never getting called from atomic context,
i40evf_add_vlan() calls kzalloc() with GFP_ATOMIC,
which does not sleep for allocation.
GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL,
which can sleep and improve the possibility of sucessful allocation.

This is found by a static analysis tool named DCNS written by myself.
And I also manually check it.

Signed-off-by: Jia-Ju Bai 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40evf/i40evf_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c 
b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
index 97cda4a8f8e0..8f775a82e2fa 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
@@ -681,7 +681,7 @@ i40evf_vlan_filter *i40evf_add_vlan(struct i40evf_adapter 
*adapter, u16 vlan)
 
f = i40evf_find_vlan(adapter, vlan);
if (!f) {
-   f = kzalloc(sizeof(*f), GFP_ATOMIC);
+   f = kzalloc(sizeof(*f), GFP_KERNEL);
if (!f)
goto clearout;
 
-- 
2.14.3



[net-next 2/9] i40e/i40evf: cleanup incorrect function doxygen comments

2018-04-30 Thread Jeff Kirsher
From: Jacob Keller 

Recent versions of the Linux kernel now warn about incorrect parameter
definitions for function comments. Fix up several function comments to
correctly reflect the current function arguments. This cleans up the
warnings and helps ensure our documentation is accurate.

Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_client.c  |  6 ++--
 drivers/net/ethernet/intel/i40e/i40e_common.c  | 37 +-
 drivers/net/ethernet/intel/i40e/i40e_dcb_nl.c  | 11 ---
 drivers/net/ethernet/intel/i40e/i40e_debugfs.c |  8 ++---
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 24 --
 drivers/net/ethernet/intel/i40e/i40e_hmc.c |  1 -
 drivers/net/ethernet/intel/i40e/i40e_main.c| 22 +
 drivers/net/ethernet/intel/i40e/i40e_nvm.c |  1 +
 drivers/net/ethernet/intel/i40e/i40e_ptp.c |  4 +--
 drivers/net/ethernet/intel/i40e/i40e_txrx.c|  6 ++--
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 16 +-
 drivers/net/ethernet/intel/i40evf/i40e_common.c|  1 +
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c  |  4 ++-
 drivers/net/ethernet/intel/i40evf/i40evf_client.c  |  4 +--
 drivers/net/ethernet/intel/i40evf/i40evf_ethtool.c |  7 ++--
 drivers/net/ethernet/intel/i40evf/i40evf_main.c|  5 ++-
 .../net/ethernet/intel/i40evf/i40evf_virtchnl.c| 11 +--
 17 files changed, 95 insertions(+), 73 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_client.c 
b/drivers/net/ethernet/intel/i40e/i40e_client.c
index 2041757f948c..5f3b8b9ff511 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_client.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_client.c
@@ -40,7 +40,7 @@ static struct i40e_ops i40e_lan_ops = {
 /**
  * i40e_client_get_params - Get the params that can change at runtime
  * @vsi: the VSI with the message
- * @param: clinet param struct
+ * @params: client param struct
  *
  **/
 static
@@ -566,7 +566,7 @@ static int i40e_client_virtchnl_send(struct i40e_info *ldev,
  * i40e_client_setup_qvlist
  * @ldev: pointer to L2 context.
  * @client: Client pointer.
- * @qv_info: queue and vector list
+ * @qvlist_info: queue and vector list
  *
  * Return 0 on success or < 0 on error
  **/
@@ -641,7 +641,7 @@ static int i40e_client_setup_qvlist(struct i40e_info *ldev,
  * i40e_client_request_reset
  * @ldev: pointer to L2 context.
  * @client: Client pointer.
- * @level: reset level
+ * @reset_level: reset level
  **/
 static void i40e_client_request_reset(struct i40e_info *ldev,
  struct i40e_client *client,
diff --git a/drivers/net/ethernet/intel/i40e/i40e_common.c 
b/drivers/net/ethernet/intel/i40e/i40e_common.c
index 6f8fd70d606a..eb2d1530d331 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_common.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_common.c
@@ -1671,6 +1671,8 @@ enum i40e_status_code i40e_aq_set_phy_config(struct 
i40e_hw *hw,
 /**
  * i40e_set_fc
  * @hw: pointer to the hw struct
+ * @aq_failures: buffer to return AdminQ failure information
+ * @atomic_restart: whether to enable atomic link restart
  *
  * Set the requested flow control mode using set_phy_config.
  **/
@@ -2807,8 +2809,8 @@ i40e_status i40e_aq_remove_macvlan(struct i40e_hw *hw, 
u16 seid,
  * @mr_list: list of mirrored VSI SEIDs or VLAN IDs
  * @cmd_details: pointer to command details structure or NULL
  * @rule_id: Rule ID returned from FW
- * @rule_used: Number of rules used in internal switch
- * @rule_free: Number of rules free in internal switch
+ * @rules_used: Number of rules used in internal switch
+ * @rules_free: Number of rules free in internal switch
  *
  * Add/Delete a mirror rule to a specific switch. Mirror rules are supported 
for
  * VEBs/VEPA elements only
@@ -2868,8 +2870,8 @@ static i40e_status i40e_mirrorrule_op(struct i40e_hw *hw,
  * @mr_list: list of mirrored VSI SEIDs or VLAN IDs
  * @cmd_details: pointer to command details structure or NULL
  * @rule_id: Rule ID returned from FW
- * @rule_used: Number of rules used in internal switch
- * @rule_free: Number of rules free in internal switch
+ * @rules_used: Number of rules used in internal switch
+ * @rules_free: Number of rules free in internal switch
  *
  * Add mirror rule. Mirror rules are supported for VEBs or VEPA elements only
  **/
@@ -2899,8 +2901,8 @@ i40e_status i40e_aq_add_mirrorrule(struct i40e_hw *hw, 
u16 sw_seid,
  * add_mirrorrule.
  * @mr_list: list of mirrored VLAN IDs to be removed
  * @cmd_details: pointer to command details structure or NULL
- * @rule_used: Number of rules used in internal switch
- * @rule_free: Number of rules free in internal switch
+ * @rules_used: Number of rules used in internal switch
+ * @rules_free: Number of rules free in internal switch
  *
  * Delete 

[net-next 6/9] i40e: Fix multiple issues with UDP tunnel offload filter configuration

2018-04-30 Thread Jeff Kirsher
From: Alexander Duyck 

This fixes at least 2 issues I have found with the UDP tunnel filter
configuration.

The first issue is the fact that the tunnels didn't have any sort of mutual
exclusion in place to prevent an update from racing with a user request to
add/remove a port. As such you could request to add and remove a port
before the port update code had a chance to respond which would result in a
very confusing result. To address it I have added 2 changes. First I added
the RTNL mutex wrapper around our updating of the pending, port, and
filter_index bits. Second I added logic so that we cannot use a port that
has a pending deletion since we need to free the space in hardware before
we can allow software to reuse it.

The second issue addressed is the fact that we were not recording the
actual filter index provided to us by the admin queue. As a result we were
deleting filters that were not associated with the actual filter we wanted
to delete. To fix that I added a filter_index member to the UDP port
tracking structure.

Signed-off-by: Alexander Duyck 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h  |  2 +
 drivers/net/ethernet/intel/i40e/i40e_main.c | 66 +++--
 2 files changed, 56 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index f573108faec3..70d369e9139c 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -310,10 +310,12 @@ struct i40e_tc_configuration {
struct i40e_tc_info tc_info[I40E_MAX_TRAFFIC_CLASS];
 };
 
+#define I40E_UDP_PORT_INDEX_UNUSED 255
 struct i40e_udp_port_config {
/* AdminQ command interface expects port number in Host byte order */
u16 port;
u8 type;
+   u8 filter_index;
 };
 
 /* macros related to FLX_PIT */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index ad01bfc5ec80..0babde10fa15 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -9672,9 +9672,9 @@ static void i40e_handle_mdd_event(struct i40e_pf *pf)
i40e_flush(hw);
 }
 
-static const char *i40e_tunnel_name(struct i40e_udp_port_config *port)
+static const char *i40e_tunnel_name(u8 type)
 {
-   switch (port->type) {
+   switch (type) {
case UDP_TUNNEL_TYPE_VXLAN:
return "vxlan";
case UDP_TUNNEL_TYPE_GENEVE:
@@ -9708,37 +9708,68 @@ static void i40e_sync_udp_filters(struct i40e_pf *pf)
 static void i40e_sync_udp_filters_subtask(struct i40e_pf *pf)
 {
struct i40e_hw *hw = >hw;
-   i40e_status ret;
+   u8 filter_index, type;
u16 port;
int i;
 
if (!test_and_clear_bit(__I40E_UDP_FILTER_SYNC_PENDING, pf->state))
return;
 
+   /* acquire RTNL to maintain state of flags and port requests */
+   rtnl_lock();
+
for (i = 0; i < I40E_MAX_PF_UDP_OFFLOAD_PORTS; i++) {
if (pf->pending_udp_bitmap & BIT_ULL(i)) {
+   struct i40e_udp_port_config *udp_port;
+   i40e_status ret = 0;
+
+   udp_port = >udp_ports[i];
pf->pending_udp_bitmap &= ~BIT_ULL(i);
-   port = pf->udp_ports[i].port;
+
+   port = READ_ONCE(udp_port->port);
+   type = READ_ONCE(udp_port->type);
+   filter_index = READ_ONCE(udp_port->filter_index);
+
+   /* release RTNL while we wait on AQ command */
+   rtnl_unlock();
+
if (port)
ret = i40e_aq_add_udp_tunnel(hw, port,
-   pf->udp_ports[i].type,
-   NULL, NULL);
-   else
-   ret = i40e_aq_del_udp_tunnel(hw, i, NULL);
+type,
+_index,
+NULL);
+   else if (filter_index != I40E_UDP_PORT_INDEX_UNUSED)
+   ret = i40e_aq_del_udp_tunnel(hw, filter_index,
+NULL);
+
+   /* reacquire RTNL so we can update filter_index */
+   rtnl_lock();
 
if (ret) {
dev_info(>pdev->dev,
 "%s %s port %d, index %d failed, err 
%s aq_err %s\n",
-i40e_tunnel_name(>udp_ports[i]),
+   

[net-next 3/9] i40e: fix reading LLDP configuration

2018-04-30 Thread Jeff Kirsher
From: Mariusz Stachura 

Previous method for reading LLDP config was based on hard-coded
offsets. It happened to work, because of structured architecture of
the NVM memory. In the new approach, known as FLAT, we need to
calculate the absolute address, instead of using relative values.
Needed defines for memory location were added.

Signed-off-by: Mariusz Stachura 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_dcb.c| 91 ---
 drivers/net/ethernet/intel/i40e/i40e_type.h   |  8 ++-
 drivers/net/ethernet/intel/i40evf/i40e_type.h | 10 ++-
 3 files changed, 99 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_dcb.c 
b/drivers/net/ethernet/intel/i40e/i40e_dcb.c
index 69e7d4967b1c..56bff8faf371 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_dcb.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_dcb.c
@@ -920,6 +920,70 @@ i40e_status i40e_init_dcb(struct i40e_hw *hw)
return ret;
 }
 
+/**
+ * _i40e_read_lldp_cfg - generic read of LLDP Configuration data from NVM
+ * @hw: pointer to the HW structure
+ * @lldp_cfg: pointer to hold lldp configuration variables
+ * @module: address of the module pointer
+ * @word_offset: offset of LLDP configuration
+ *
+ * Reads the LLDP configuration data from NVM using passed addresses
+ **/
+static i40e_status _i40e_read_lldp_cfg(struct i40e_hw *hw,
+  struct i40e_lldp_variables *lldp_cfg,
+  u8 module, u32 word_offset)
+{
+   u32 address, offset = (2 * word_offset);
+   i40e_status ret;
+   __le16 raw_mem;
+   u16 mem;
+
+   ret = i40e_acquire_nvm(hw, I40E_RESOURCE_READ);
+   if (ret)
+   return ret;
+
+   ret = i40e_aq_read_nvm(hw, 0x0, module * 2, sizeof(raw_mem), _mem,
+  true, NULL);
+   i40e_release_nvm(hw);
+   if (ret)
+   return ret;
+
+   mem = le16_to_cpu(raw_mem);
+   /* Check if this pointer needs to be read in word size or 4K sector
+* units.
+*/
+   if (mem & I40E_PTR_TYPE)
+   address = (0x7FFF & mem) * 4096;
+   else
+   address = (0x7FFF & mem) * 2;
+
+   ret = i40e_acquire_nvm(hw, I40E_RESOURCE_READ);
+   if (ret)
+   goto err_lldp_cfg;
+
+   ret = i40e_aq_read_nvm(hw, module, offset, sizeof(raw_mem), _mem,
+  true, NULL);
+   i40e_release_nvm(hw);
+   if (ret)
+   return ret;
+
+   mem = le16_to_cpu(raw_mem);
+   offset = mem + word_offset;
+   offset *= 2;
+
+   ret = i40e_acquire_nvm(hw, I40E_RESOURCE_READ);
+   if (ret)
+   goto err_lldp_cfg;
+
+   ret = i40e_aq_read_nvm(hw, 0, address + offset,
+  sizeof(struct i40e_lldp_variables), lldp_cfg,
+  true, NULL);
+   i40e_release_nvm(hw);
+
+err_lldp_cfg:
+   return ret;
+}
+
 /**
  * i40e_read_lldp_cfg - read LLDP Configuration data from NVM
  * @hw: pointer to the HW structure
@@ -931,21 +995,34 @@ i40e_status i40e_read_lldp_cfg(struct i40e_hw *hw,
   struct i40e_lldp_variables *lldp_cfg)
 {
i40e_status ret = 0;
-   u32 offset = (2 * I40E_NVM_LLDP_CFG_PTR);
+   u32 mem;
 
if (!lldp_cfg)
return I40E_ERR_PARAM;
 
ret = i40e_acquire_nvm(hw, I40E_RESOURCE_READ);
if (ret)
-   goto err_lldp_cfg;
+   return ret;
 
-   ret = i40e_aq_read_nvm(hw, I40E_SR_EMP_MODULE_PTR, offset,
-  sizeof(struct i40e_lldp_variables),
-  (u8 *)lldp_cfg,
-  true, NULL);
+   ret = i40e_aq_read_nvm(hw, I40E_SR_NVM_CONTROL_WORD, 0, sizeof(mem),
+  , true, NULL);
i40e_release_nvm(hw);
+   if (ret)
+   return ret;
+
+   /* Read a bit that holds information whether we are running flat or
+* structured NVM image. Flat image has LLDP configuration in shadow
+* ram, so there is a need to pass different addresses for both cases.
+*/
+   if (mem & I40E_SR_NVM_MAP_STRUCTURE_TYPE) {
+   /* Flat NVM case */
+   ret = _i40e_read_lldp_cfg(hw, lldp_cfg, I40E_SR_EMP_MODULE_PTR,
+ I40E_SR_LLDP_CFG_PTR);
+   } else {
+   /* Good old structured NVM image */
+   ret = _i40e_read_lldp_cfg(hw, lldp_cfg, I40E_EMP_MODULE_PTR,
+ I40E_NVM_LLDP_CFG_PTR);
+   }
 
-err_lldp_cfg:
return ret;
 }
diff --git a/drivers/net/ethernet/intel/i40e/i40e_type.h 
b/drivers/net/ethernet/intel/i40e/i40e_type.h
index 40968a4216a7..7df969c59855 100644
--- 

[net-next 5/9] i40evf: Fix turning TSO, GSO and GRO on after

2018-04-30 Thread Jeff Kirsher
From: Paweł Jabłoński 

This patch fixes the problem where each MTU change turns TSO,
GSO and GRO on from off state.

Now when TSO, GSO or GRO is turned off, MTU change does not
turn them on.

Signed-off-by: Paweł Jabłoński 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40evf/i40evf_main.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c 
b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
index 28a8cc4a14cb..3f04a182903d 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
@@ -3357,6 +3357,24 @@ int i40evf_process_config(struct i40evf_adapter *adapter)
if (vfres->vf_cap_flags & VIRTCHNL_VF_OFFLOAD_VLAN)
netdev->features |= NETIF_F_HW_VLAN_CTAG_FILTER;
 
+   /* Do not turn on offloads when they are requested to be turned off.
+* TSO needs minimum 576 bytes to work correctly.
+*/
+   if (netdev->wanted_features) {
+   if (!(netdev->wanted_features & NETIF_F_TSO) ||
+   netdev->mtu < 576)
+   netdev->features &= ~NETIF_F_TSO;
+   if (!(netdev->wanted_features & NETIF_F_TSO6) ||
+   netdev->mtu < 576)
+   netdev->features &= ~NETIF_F_TSO6;
+   if (!(netdev->wanted_features & NETIF_F_TSO_ECN))
+   netdev->features &= ~NETIF_F_TSO_ECN;
+   if (!(netdev->wanted_features & NETIF_F_GRO))
+   netdev->features &= ~NETIF_F_GRO;
+   if (!(netdev->wanted_features & NETIF_F_GSO))
+   netdev->features &= ~NETIF_F_GSO;
+   }
+
adapter->vsi.id = adapter->vsi_res->vsi_id;
 
adapter->vsi.back = adapter;
-- 
2.14.3



Re: [PATCH] ethtool: fix a potential missing-check bug

2018-04-30 Thread Shannon Nelson

On 4/29/2018 6:31 PM, Wenwen Wang wrote:

In ethtool_get_rxnfc(), the object "info" is firstly copied from
user-space. If the FLOW_RSS flag is set in the member field flow_type of
"info" (and cmd is ETHTOOL_GRXFH), info needs to be copied again from
user-space because FLOW_RSS is newer and has new definition, as mentioned
in the comment. However, given that the user data resides in user-space, a
malicious user can race to change the data after the first copy. By doing
so, the user can inject inconsistent data. For example, in the second
copy, the FLOW_RSS flag could be cleared in the field flow_type of "info".
In the following execution, "info" will be used in the function
ops->get_rxnfc(). Such inconsistent data can potentially lead to unexpected
information leakage since ops->get_rxnfc() will prepare various types of
data according to flow_type, and the prepared data will be eventually
copied to user-space. This inconsistent data may also cause undefined
behaviors based on how ops->get_rxnfc() is implemented.

This patch re-verifies the flow_type field of "info" after the second copy.
If the value is not as expected, an error code will be returned.

Signed-off-by: Wenwen Wang 
---
  net/core/ethtool.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 03416e6..a121034 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -1032,6 +1032,8 @@ static noinline_for_stack int ethtool_get_rxnfc(struct 
net_device *dev,
info_size = sizeof(info);
if (copy_from_user(, useraddr, info_size))
return -EFAULT;


You might add a comment here to explain why the second check; otherwise 
someone might come along later and remove this check as redundant code.


sln


+   if (!(info.flow_type & FLOW_RSS))
+   return -EINVAL;
}
  
  	if (info.cmd == ETHTOOL_GRXCLSRLALL) {




Re: [PATCH net-next v3 0/6] mlxsw: SPAN: Support routes pointing at bridges

2018-04-30 Thread David Miller
From: Ido Schimmel 
Date: Sun, 29 Apr 2018 10:56:07 +0300

> Petr says:
> 
> When mirroring to a gretap or ip6gretap netdevice, the route that
> directs the encapsulated packets can reference a bridge. In that case,
> in the software model, the packet is switched.
> 
> Thus when offloading mirroring like that, take into consideration FDB,
> STP, PVID configured at the bridge, and whether that VLAN ID should be
> tagged on egress.
> 
> Patch #1 introduces functions to get bridge PVID, VLAN flags and to look
> up an FDB entry.
> 
> Patches #2 and #3 refactor some existing code and introduce a new
> accessor function.
> 
> With patches #4 and #5 mlxsw calls mlxsw_sp_span_respin() on switchdev
> events as well. There is no impact yet, because bridge as an underlay
> device is still not allowed.
> 
> That is implemented in patch #6, which uses the new interfaces to figure
> out on which one port the mirroring should be configured, and whether
> the mirrored packets should be VLAN-tagged and how.
> 
> Changes from v2 to v3:
> 
> - Rename the suite of bridge accessor function to br_vlan_get_pvid(),
>   br_vlan_get_info() and br_fdb_find_port(). The _get bit is to avoid
>   clashing with an existing static function.
> 
> Changes from v1 to v2:
> 
> - Change the suite of bridge accessor functions to br_vlan_pvid_rtnl(),
>   br_vlan_info_rtnl(), br_fdb_find_port_rtnl().

Series applied, thank you.


Re: [PATCH net-next] net: core: Assert the size of netdev_featres_t

2018-04-30 Thread Stephen Hemminger
On Fri, 27 Apr 2018 13:11:14 -0700
Florian Fainelli  wrote:

> We have about 53 netdev_features_t bits defined and counting, add a
> build time check to catch when an u64 type will not be enough and we
> will have to convert that to a bitmap. This is done in
> register_netdevice() for convenience.
> 
> Signed-off-by: Florian Fainelli 
> ---
>  include/linux/netdevice.h | 6 ++
>  net/core/dev.c| 1 +
>  2 files changed, 7 insertions(+)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 366c32891158..4326bc6b27d1 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -4121,6 +4121,12 @@ const char *netdev_drivername(const struct net_device 
> *dev);
>  
>  void linkwatch_run_queue(void);
>  
> +static inline void netdev_features_size_check(void)
> +{
> + BUILD_BUG_ON(sizeof(netdev_features_t) * BITS_PER_BYTE <
> +  NETDEV_FEATURE_COUNT);
> +}
> +
>  static inline netdev_features_t netdev_intersect_features(netdev_features_t 
> f1,
> netdev_features_t f2)
>  {
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 0a2d46424069..23e6c1aa78c6 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -7881,6 +7881,7 @@ int register_netdevice(struct net_device *dev)
>   int ret;
>   struct net *net = dev_net(dev);
>  
> + netdev_features_size_check();
>   BUG_ON(dev_boot_phase);
>   ASSERT_RTNL();
>  

You don't have do this kind of inline function stuff to get the check.
Why not just put BUILD_BUG_ON directly in net/core/dev.c  Could be anywhere.
Rather than adding inline in the header file.



Re: [RFC net-next 0/5] Support for PHY test modes

2018-04-30 Thread Andrew Lunn
> Turning these tests on will typically result in the link partner
> dropping the link with us, and the interface will be non-functional as
> far as the data path is concerned (similar to an isolation mode). This
> might warrant properly reporting that to user-space through e.g: a
> private IFF_* value maybe?

Hi Florian

I've not looked at the code yet

Is it also necessary to kick off auto-neg again after the test has
finished, in order to reestablish the link?

  Andrew


Re: Performance regressions in TCP_STREAM tests in Linux 4.15 (and later)

2018-04-30 Thread Eric Dumazet


On 04/30/2018 09:14 AM, Ben Greear wrote:
> On 04/27/2018 08:11 PM, Steven Rostedt wrote:
>>
>> We'd like this email archived in netdev list, but since netdev is
>> notorious for blocking outlook email as spam, it didn't go through. So
>> I'm replying here to help get it into the archives.
>>
>> Thanks!
>>
>> -- Steve
>>
>>
>> On Fri, 27 Apr 2018 23:05:46 +
>> Michael Wenig  wrote:
>>
>>> As part of VMware's performance testing with the Linux 4.15 kernel,
>>> we identified CPU cost and throughput regressions when comparing to
>>> the Linux 4.14 kernel. The impacted test cases are mostly TCP_STREAM
>>> send tests when using small message sizes. The regressions are
>>> significant (up 3x) and were tracked down to be a side effect of Eric
>>> Dumazat's RB tree changes that went into the Linux 4.15 kernel.
>>> Further investigation showed our use of the TCP_NODELAY flag in
>>> conjunction with Eric's change caused the regressions to show and
>>> simply disabling TCP_NODELAY brought performance back to normal.
>>> Eric's change also resulted into significant improvements in our
>>> TCP_RR test cases.
>>>
>>>
>>>
>>> Based on these results, our theory is that Eric's change made the
>>> system overall faster (reduced latency) but as a side effect less
>>> aggregation is happening (with TCP_NODELAY) and that results in lower
>>> throughput. Previously even though TCP_NODELAY was set, system was
>>> slower and we still got some benefit of aggregation. Aggregation
>>> helps in better efficiency and higher throughput although it can
>>> increase the latency. If you are seeing a regression in your
>>> application throughput after this change, using TCP_NODELAY might
>>> help bring performance back however that might increase latency.
> 
> I guess you mean _disabling_ TCP_NODELAY instead of _using_ TCP_NODELAY?
>

Yeah, I guess auto-corking does not work as intended.




Re: [PATCH bpf-next] bpf: relax constraints on formatting for eBPF helper documentation

2018-04-30 Thread Edward Cree
On 30/04/18 16:59, Quentin Monnet wrote:
> The Python script used to parse and extract eBPF helpers documentation
> from include/uapi/linux/bpf.h expects a very specific formatting for the
> descriptions (single dots represent a space, '>' stands for a tab):
>
> /*
>  ...
>  *.int bpf_helper(list of arguments)
>  *.>Description
>  *.>>   Start of description
>  *.>>   Another line of description
>  *.>>   And yet another line of description
>  *.>Return
>  *.>>   0 on success, or a negative error in case of failure
>  ...
>  */
>
> This is too strict, and painful for developers who wants to add
> documentation for new helpers. Worse, it is extremelly difficult to
> check that the formatting is correct during reviews. Change the
> format expected by the script and make it more flexible. The script now
> works whether or not the initial space (right after the star) is
> present, and accepts both tabs and white spaces (or a combination of
> both) for indenting description sections and contents.
>
> Concretely, something like the following would now be supported:
>
> /*
>  ...
>  *int bpf_helper(list of arguments)
>  *..Description
>  *.>>   Start of description...
>  *> >   Another line of description
>  *..And yet another line of description
>  *> Return
>  *.>0 on success, or a negative error in case of failure
>  ...
>  */
>
> Signed-off-by: Quentin Monnet 
> ---
>  scripts/bpf_helpers_doc.py | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/scripts/bpf_helpers_doc.py b/scripts/bpf_helpers_doc.py
> index 30ba0fee36e4..717547e6f0a6 100755
> --- a/scripts/bpf_helpers_doc.py
> +++ b/scripts/bpf_helpers_doc.py
> @@ -87,7 +87,7 @@ class HeaderParser(object):
>  #   - Same as above, with "const" and/or "struct" in front of type
>  #   - "..." (undefined number of arguments, for bpf_trace_printk())
>  # There is at least one term ("void"), and at most five arguments.
> -p = re.compile('^ \* ((.+) \**\w+\const )?(struct 
> )?(\w+|\.\.\.)( \**\w+)?)(, )?){1,5}\))$')
> +p = re.compile('^ \* ?((.+) \**\w+\const )?(struct 
> )?(\w+|\.\.\.)( \**\w+)?)(, )?){1,5}\))$')
The proper coding style for such things is to go straight to tabs after
 the star and not have the space.  So if we're going to make the script
 flexible here (and leave coding style enforcement to other tools such
 as checkpatch), maybe the regexen should just begin '^ \*\s+' and avoid
 relying on counting indentation to delimit sections (e.g. scan for the
 section headers like '^ \*\s+Description$' instead).
Btw, leading '^' is unnecessary as re.match() is already implicitly
 anchored at start-of-string.  (The trailing '$' are still needed.)

-Ed


Re: Performance regressions in TCP_STREAM tests in Linux 4.15 (and later)

2018-04-30 Thread Steven Rostedt
On Mon, 30 Apr 2018 09:14:04 -0700
Ben Greear  wrote:

> >> As part of VMware's performance testing with the Linux 4.15 kernel,
> >> we identified CPU cost and throughput regressions when comparing to
> >> the Linux 4.14 kernel. The impacted test cases are mostly TCP_STREAM
> >> send tests when using small message sizes. The regressions are
> >> significant (up 3x) and were tracked down to be a side effect of Eric
> >> Dumazat's RB tree changes that went into the Linux 4.15 kernel.
> >> Further investigation showed our use of the TCP_NODELAY flag in
> >> conjunction with Eric's change caused the regressions to show and
> >> simply disabling TCP_NODELAY brought performance back to normal.
> >> Eric's change also resulted into significant improvements in our
> >> TCP_RR test cases.
> >>
> >>
> >>
> >> Based on these results, our theory is that Eric's change made the
> >> system overall faster (reduced latency) but as a side effect less
> >> aggregation is happening (with TCP_NODELAY) and that results in lower
> >> throughput. Previously even though TCP_NODELAY was set, system was
> >> slower and we still got some benefit of aggregation. Aggregation
> >> helps in better efficiency and higher throughput although it can
> >> increase the latency. If you are seeing a regression in your
> >> application throughput after this change, using TCP_NODELAY might
> >> help bring performance back however that might increase latency.  
> 
> I guess you mean _disabling_ TCP_NODELAY instead of _using_ TCP_NODELAY?

Yes, thank you for catching that.

-- Steve



Re: [RFC net-next 0/5] Support for PHY test modes

2018-04-30 Thread Florian Fainelli
On 04/29/2018 07:55 PM, David Miller wrote:
> From: Florian Fainelli 
> Date: Fri, 27 Apr 2018 17:32:30 -0700
> 
>> This patch series adds support for specifying PHY test modes through
>> ethtool and paves the ground for adding support for more complex
>> test modes that might require data to be exchanged between user and
>> kernel space.
>>
>> As an example, patches are included to add support for the IEEE
>> electrical test modes for 100BaseT2 and 1000BaseT. Those do not
>> require data to be passed back and forth.
>>
>> I believe the infrastructure to be usable enough to add support for
>> other things like:
>>
>> - cable diagnostics
>> - pattern generator/waveform generator with specific pattern being
>>   indicated for instance
>>
>> Questions for Andrew, and others:
>>
>> - there could be room for adding additional ETH_TEST_FL_* values in order to
>>   help determine how the test should be running
>> - some of these tests can be disruptive to connectivity, the minimum we could
>>   do is stop the PHY state machine and restart it when "normal" is used to 
>> exit
>>   those test modes
>>
>> Comments welcome!
> 
> Generally, no objection to providing this in the general manner you
> have implemented it via ethtool.

Thanks for taking a look!

> 
> I think in order to answer the disruptive question, you need to give
> some information about what kind of context this stuff would be
> used in, and if in those contexts what the user expectations are
> or might be.
> 
> Are these test modes something that usually would be initiated with
> the interface down?

We expect that these commands/tests would likely be issued when the
interface is up (not necessarily with a carrier state ON though) because
we know for sure that drivers will have successfully connected to their
PHY and there is no power management (or there is, like runtime PM)
which will not prevent accesses to the MDIO interface from working.

Turning these tests on will typically result in the link partner
dropping the link with us, and the interface will be non-functional as
far as the data path is concerned (similar to an isolation mode). This
might warrant properly reporting that to user-space through e.g: a
private IFF_* value maybe?
-- 
Florian


[PATCH] net/mlx4: fix spelling mistake: "failedi" -> "failed"

2018-04-30 Thread Colin King
From: Colin Ian King 

trivial fix to spelling mistake in mlx4_warn message.

Signed-off-by: Colin Ian King 
---
 drivers/net/ethernet/mellanox/mlx4/main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c 
b/drivers/net/ethernet/mellanox/mlx4/main.c
index bfef69235d71..211578ffc70d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -1317,7 +1317,7 @@ static int mlx4_mf_unbond(struct mlx4_dev *dev)
 
ret = mlx4_unbond_fs_rules(dev);
if (ret)
-   mlx4_warn(dev, "multifunction unbond for flow rules failedi 
(%d)\n", ret);
+   mlx4_warn(dev, "multifunction unbond for flow rules failed 
(%d)\n", ret);
ret1 = mlx4_unbond_mac_table(dev);
if (ret1) {
mlx4_warn(dev, "multifunction unbond for MAC table failed 
(%d)\n", ret1);
-- 
2.17.0



Re: [net-next v2] ipv6: sr: extract the right key values for "seg6_make_flowlabel"

2018-04-30 Thread David Miller
From: Ahmed Abdelsalam 
Date: Sat, 28 Apr 2018 12:18:35 +0200

> The seg6_make_flowlabel() is used by seg6_do_srh_encap() to compute the
> flowlabel from a given skb. It relies on skb_get_hash() which eventually
> calls __skb_flow_dissect() to extract the flow_keys struct values from
> the skb.
> 
> In case of IPv4 traffic, calling seg6_make_flowlabel() after skb_push(),
> skb_reset_network_header(), and skb_mac_header_rebuild() will results in
> flow_keys struct of all key values set to zero.
> 
> This patch calls seg6_make_flowlabel() before resetting the headers of skb
> to get the right key values.
> 
> Extracted Key values are based on the type inner packet as follows:
> 1) IPv6 traffic: src_IP, dst_IP, L4 proto, and flowlabel of inner packet.
> 2) IPv4 traffic: src_IP, dst_IP, L4 proto, src_port, and dst_port
> 3) L2 traffic: depends on what kind of traffic carried into the L2
> frame. IPv6 and IPv4 traffic works as discussed 1) and 2)
> 
> Here a hex_dump of struct flow_keys for IPv4 and IPv6 traffic
> 10.100.1.100: 47302 > 30.0.0.2: 5001
> : 14 00 02 00 00 00 00 00 08 00 11 00 00 00 00 00
> 0010: 00 00 00 00 00 00 00 00 13 89 b8 c6 1e 00 00 02
> 0020: 0a 64 01 64
> 
> fc00:a1:a > b2::2
> : 28 00 03 00 00 00 00 00 86 dd 11 00 99 f9 02 00
> 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 b2 00 00
> 0020: 00 00 00 00 00 00 00 00 00 00 00 02 fc 00 00 a1
> 0030: 00 00 00 00 00 00 00 00 00 00 00 0a
> 
> Signed-off-by: Ahmed Abdelsalam 

Looks good, applied, thank you.


Re: Performance regressions in TCP_STREAM tests in Linux 4.15 (and later)

2018-04-30 Thread Ben Greear

On 04/27/2018 08:11 PM, Steven Rostedt wrote:


We'd like this email archived in netdev list, but since netdev is
notorious for blocking outlook email as spam, it didn't go through. So
I'm replying here to help get it into the archives.

Thanks!

-- Steve


On Fri, 27 Apr 2018 23:05:46 +
Michael Wenig  wrote:


As part of VMware's performance testing with the Linux 4.15 kernel,
we identified CPU cost and throughput regressions when comparing to
the Linux 4.14 kernel. The impacted test cases are mostly TCP_STREAM
send tests when using small message sizes. The regressions are
significant (up 3x) and were tracked down to be a side effect of Eric
Dumazat's RB tree changes that went into the Linux 4.15 kernel.
Further investigation showed our use of the TCP_NODELAY flag in
conjunction with Eric's change caused the regressions to show and
simply disabling TCP_NODELAY brought performance back to normal.
Eric's change also resulted into significant improvements in our
TCP_RR test cases.



Based on these results, our theory is that Eric's change made the
system overall faster (reduced latency) but as a side effect less
aggregation is happening (with TCP_NODELAY) and that results in lower
throughput. Previously even though TCP_NODELAY was set, system was
slower and we still got some benefit of aggregation. Aggregation
helps in better efficiency and higher throughput although it can
increase the latency. If you are seeing a regression in your
application throughput after this change, using TCP_NODELAY might
help bring performance back however that might increase latency.


I guess you mean _disabling_ TCP_NODELAY instead of _using_ TCP_NODELAY?

Thanks,
Ben


--
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com



Re: [PATCH V2 net-next 1/2] tcp: send in-queue bytes in cmsg upon read

2018-04-30 Thread David Miller
From: Eric Dumazet 
Date: Mon, 30 Apr 2018 09:01:47 -0700

> TCP sockets are read by a single thread really (or synchronized
> threads), or garbage is ensured, regardless of how the kernel
> ensures locking while reporting "queue length"

Whatever applications "typically do", we should never return
garbage, and that is what this code allowing to happen.

Everything else in recvmsg() operates on state under the proper socket
lock, to ensure consistency.

The only reason we are releasing the socket lock first it to make sure
the backlog is processed and we have the most update information
available.

It seems like one is striving for correctness and better accuracy, no?
:-)

Look, this can be fixed really simply.  And if you are worried about
unbounded loops if two apps maliciously do recvmsg() in parallel,
then don't even loop, just fallback to full socket locking and make
the "non-typical" application pay the price:

tmp1 = A;
tmp2 = B;
barrier();
tmp3 = A;
if (unlikely(tmp1 != tmp3)) {
lock_sock(sk);
tmp1 = A;
tmp2 = B;
release_sock(sk);
}

I'm seriously not applying the patch as-is, sorry.  This issue
must be addressed somehow.

Thank you.


[PATCH bpf-next] bpf: relax constraints on formatting for eBPF helper documentation

2018-04-30 Thread Quentin Monnet
The Python script used to parse and extract eBPF helpers documentation
from include/uapi/linux/bpf.h expects a very specific formatting for the
descriptions (single dots represent a space, '>' stands for a tab):

/*
 ...
 *.int bpf_helper(list of arguments)
 *.>Description
 *.>>   Start of description
 *.>>   Another line of description
 *.>>   And yet another line of description
 *.>Return
 *.>>   0 on success, or a negative error in case of failure
 ...
 */

This is too strict, and painful for developers who wants to add
documentation for new helpers. Worse, it is extremelly difficult to
check that the formatting is correct during reviews. Change the
format expected by the script and make it more flexible. The script now
works whether or not the initial space (right after the star) is
present, and accepts both tabs and white spaces (or a combination of
both) for indenting description sections and contents.

Concretely, something like the following would now be supported:

/*
 ...
 *int bpf_helper(list of arguments)
 *..Description
 *.>>   Start of description...
 *> >   Another line of description
 *..And yet another line of description
 *> Return
 *.>0 on success, or a negative error in case of failure
 ...
 */

Signed-off-by: Quentin Monnet 
---
 scripts/bpf_helpers_doc.py | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/scripts/bpf_helpers_doc.py b/scripts/bpf_helpers_doc.py
index 30ba0fee36e4..717547e6f0a6 100755
--- a/scripts/bpf_helpers_doc.py
+++ b/scripts/bpf_helpers_doc.py
@@ -87,7 +87,7 @@ class HeaderParser(object):
 #   - Same as above, with "const" and/or "struct" in front of type
 #   - "..." (undefined number of arguments, for bpf_trace_printk())
 # There is at least one term ("void"), and at most five arguments.
-p = re.compile('^ \* ((.+) \**\w+\const )?(struct )?(\w+|\.\.\.)( 
\**\w+)?)(, )?){1,5}\))$')
+p = re.compile('^ \* ?((.+) \**\w+\const )?(struct )?(\w+|\.\.\.)( 
\**\w+)?)(, )?){1,5}\))$')
 capture = p.match(self.line)
 if not capture:
 raise NoHelperFound
@@ -95,7 +95,7 @@ class HeaderParser(object):
 return capture.group(1)
 
 def parse_desc(self):
-p = re.compile('^ \* \tDescription$')
+p = re.compile('^ \* ?(?:\t| {6,8})Description$')
 capture = p.match(self.line)
 if not capture:
 # Helper can have empty description and we might be parsing another
@@ -109,7 +109,7 @@ class HeaderParser(object):
 if self.line == ' *\n':
 desc += '\n'
 else:
-p = re.compile('^ \* \t\t(.*)')
+p = re.compile('^ \* ?(?:\t| {6,8})(?:\t| {8})(.*)')
 capture = p.match(self.line)
 if capture:
 desc += capture.group(1) + '\n'
@@ -118,7 +118,7 @@ class HeaderParser(object):
 return desc
 
 def parse_ret(self):
-p = re.compile('^ \* \tReturn$')
+p = re.compile('^ \* ?(?:\t| {6,8})Return$')
 capture = p.match(self.line)
 if not capture:
 # Helper can have empty retval and we might be parsing another
@@ -132,7 +132,7 @@ class HeaderParser(object):
 if self.line == ' *\n':
 ret += '\n'
 else:
-p = re.compile('^ \* \t\t(.*)')
+p = re.compile('^ \* ?(?:\t| {6,8})(?:\t| {8})(.*)')
 capture = p.match(self.line)
 if capture:
 ret += capture.group(1) + '\n'
-- 
2.14.1



Re: [PATCH bpf-next] tools include uapi: Grab a copy of linux/erspan.h

2018-04-30 Thread Daniel Borkmann
On 04/30/2018 05:45 PM, Y Song wrote:
> On Mon, Apr 30, 2018 at 7:33 AM, Daniel Borkmann  wrote:
>> On 04/30/2018 04:26 PM, William Tu wrote:
>>> Bring the erspan uapi header file so BPF tunnel helpers can use it.
>>>
>>> Fixes: 933a741e3b82 ("selftests/bpf: bpf tunnel test.")
>>> Reported-by: Yonghong Song 
>>> Signed-off-by: William Tu 
>>
>> Thanks for the patch, William! I also Cc'ed Yonghong here, so he has a
>> chance to try it out.
> 
> Just tried it out. It works. Thanks for fixing!
> Acked-by: Yonghong Song 

Applied to bpf-next, thanks everyone!


Re: [PATCH V2 net-next 1/2] tcp: send in-queue bytes in cmsg upon read

2018-04-30 Thread Eric Dumazet


On 04/30/2018 08:56 AM, David Miller wrote:
> From: Eric Dumazet 
> Date: Mon, 30 Apr 2018 08:43:50 -0700
> 
>> I say sort of, because by the time we have any number, TCP might
>> have received more packets anyway.
> 
> That's fine.
> 
> However, the number reported should have been true at least at some
> finite point in time.
> 
> If you allow overlapping changes to either of the two variables during
> the sampling, then you are reporting a number which was never true at
> any point in time.
> 
> It is essentially garbage.


Correct.

TCP sockets are read by a single thread really (or synchronized threads),
or garbage is ensured, regardless of how the kernel ensures locking while 
reporting "queue length" 




Re: [PATCH V2 net-next 1/2] tcp: send in-queue bytes in cmsg upon read

2018-04-30 Thread Soheil Hassas Yeganeh
On Mon, Apr 30, 2018 at 11:43 AM, Eric Dumazet  wrote:
> On 04/30/2018 08:38 AM, David Miller wrote:
>> From: Soheil Hassas Yeganeh 
>> Date: Fri, 27 Apr 2018 14:57:32 -0400
>>
>>> Since the socket lock is not held when calculating the size of
>>> receive queue, TCP_INQ is a hint.  For example, it can overestimate
>>> the queue size by one byte, if FIN is received.
>>
>> I think it is even worse than that.
>>
>> If another application comes in and does a recvmsg() in parallel with
>> these calculations, you could even report a negative value.

Thanks you David. In addition to Eric's point, for TCP specifically,
it is quite uncommon to have multiple threads calling recvmsg() for
the same socket in parallel, because the application is interested in
the streamed, in-sequence bytes. Except when the application just
wants to discard the incoming stream or has a predefined frame sizes,
this wouldn't be an issue. For such cases, the proposed INQ hint is
not going to be useful.

Could you please let me know whether you have any other example in mind?

Thanks!
Soheil


Re: [PATCH] change the comment of vti6_ioctl

2018-04-30 Thread David Miller
From: Sun Lianwen 
Date: Sun, 29 Apr 2018 15:05:52 +0800

> The comment of vti6_ioctl() is wrong. which use vti6_tnl_ioctl
> instead of vti6_ioctl.
> 
> Signed-off-by: Sun Lianwen 

Please CC: the IPSEC maintainers on future patch submissions to IPSEC
files, as per the top level MAINTAINERS file.

Steffen, please queue this up, thank you.

> ---
>  net/ipv6/ip6_vti.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
> index c214ffec02f0..deadc4c3703b 100644
> --- a/net/ipv6/ip6_vti.c
> +++ b/net/ipv6/ip6_vti.c
> @@ -743,7 +743,7 @@ vti6_parm_to_user(struct ip6_tnl_parm2 *u, const struct 
> __ip6_tnl_parm *p)
>  }
>  
>  /**
> - * vti6_tnl_ioctl - configure vti6 tunnels from userspace
> + * vti6_ioctl - configure vti6 tunnels from userspace
>   *   @dev: virtual device associated with tunnel
>   *   @ifr: parameters passed from userspace
>   *   @cmd: command to be performed
> -- 
> 2.17.0
> 
> 
> 


Re: [PATCH net-next 0/2 v5] netns: uevent filtering

2018-04-30 Thread Eric W. Biederman
Christian Brauner  writes:

> Hey everyone,
>
> This is the new approach to uevent filtering as discussed (see the
> threads in [1], [2], and [3]). It only contains *non-functional
> changes*.
>
> This series deals with with fixing up uevent filtering logic:
> - uevent filtering logic is simplified
> - locking time on uevent_sock_list is minimized
> - tagged and untagged kobjects are handled in separate codepaths
> - permissions for userspace are fixed for network device uevents in
>   network namespaces owned by non-initial user namespaces
>   Udev is now able to see those events correctly which it wasn't before.
>   For example, moving a physical device into a network namespace not
>   owned by the initial user namespaces before gave:
>
>   root@xen1:~# udevadm --debug monitor -k
>   calling: monitor
>   monitor will print the received events for:
>   KERNEL - the kernel uevent
>
>   sender uid=65534, message ignored
>   sender uid=65534, message ignored
>   sender uid=65534, message ignored
>   sender uid=65534, message ignored
>   sender uid=65534, message ignored
>
>   and now after the discussion and solution in [3] correctly gives:
>
>   root@xen1:~# udevadm --debug monitor -k
>   calling: monitor
>   monitor will print the received events for:
>   KERNEL - the kernel uevent
>
>   KERNEL[625.301042] add  
> /devices/pci:00/:00:02.0/:01:00.1/net/enp1s0f1 (net)
>   KERNEL[625.301109] move 
> /devices/pci:00/:00:02.0/:01:00.1/net/enp1s0f1 (net)
>   KERNEL[625.301138] move 
> /devices/pci:00/:00:02.0/:01:00.1/net/eth1 (net)
>   KERNEL[655.333272] remove 
> /devices/pci:00/:00:02.0/:01:00.1/net/eth1 (net)
>
> Thanks!
> Christian
>
> [1]: https://lkml.org/lkml/2018/4/4/739
> [2]: https://lkml.org/lkml/2018/4/26/767
> [3]: https://lkml.org/lkml/2018/4/26/738

Acked-by: "Eric W. Biederman" 

>
> Christian Brauner (2):
>   uevent: add alloc_uevent_skb() helper
>   netns: restrict uevents
>
>  lib/kobject_uevent.c | 178 ++-
>  1 file changed, 126 insertions(+), 52 deletions(-)

Eric


Re: [PATCH V2 net-next 1/2] tcp: send in-queue bytes in cmsg upon read

2018-04-30 Thread David Miller
From: Eric Dumazet 
Date: Mon, 30 Apr 2018 08:43:50 -0700

> I say sort of, because by the time we have any number, TCP might
> have received more packets anyway.

That's fine.

However, the number reported should have been true at least at some
finite point in time.

If you allow overlapping changes to either of the two variables during
the sampling, then you are reporting a number which was never true at
any point in time.

It is essentially garbage.


Re: [PATCH net-next] libcxgb,cxgb4: use __skb_put_zero to simplfy code

2018-04-30 Thread David Miller
From: YueHaibing 
Date: Sat, 28 Apr 2018 12:35:22 +0800

> use helper __skb_put_zero to replace the pattern of __skb_put() && memset()
> 
> Signed-off-by: YueHaibing 

Applied, thank you.


Re: [PATCH bpf-next] tools include uapi: Grab a copy of linux/erspan.h

2018-04-30 Thread Y Song
On Mon, Apr 30, 2018 at 7:33 AM, Daniel Borkmann  wrote:
> On 04/30/2018 04:26 PM, William Tu wrote:
>> Bring the erspan uapi header file so BPF tunnel helpers can use it.
>>
>> Fixes: 933a741e3b82 ("selftests/bpf: bpf tunnel test.")
>> Reported-by: Yonghong Song 
>> Signed-off-by: William Tu 
>
> Thanks for the patch, William! I also Cc'ed Yonghong here, so he has a
> chance to try it out.

Just tried it out. It works. Thanks for fixing!
Acked-by: Yonghong Song 


RE: smsc95xx: aligment issues

2018-04-30 Thread Woojung.Huh
Hi Stefan,

Thanks for report. We will try to repro issue and contact you if need more 
details.

Regards,
Woojung

> -Original Message-
> From: Stefan Wahren [mailto:stefan.wah...@i2se.com]
> Sent: Saturday, April 28, 2018 3:59 AM
> To: Nisar Sayed - I17970 ; Woojung Huh - C21699
> 
> Cc: David S. Miller ; linux-usb 
> ; netdev
> ; popcorn mix ; James Hughes
> 
> Subject: net: smsc95xx: aligment issues
> 
> Hi,
> after connecting a Raspberry Pi 1 B to my local network i'm seeing aligment 
> issues under
> /proc/cpu/alignment:
> 
> User: 0
> System:   142 (_decode_session4+0x12c/0x3c8)
> Skipped:  0
> Half: 0
> Word: 0
> DWord:127
> Multi:15
> User faults:  2 (fixup)
> 
> I've also seen outputs with _csum_ipv6_magic.
> 
> Kernel config: bcm2835_defconfig
> Reproducible kernel trees: current linux-next, 4.17-rc2 and 4.14.37 (i didn't 
> test older versions)
> 
> Please tell if you need more information to narrow down this issue.
> 
> Best regards
> Stefan


Re: [PATCH net-next] erspan: auto detect truncated packets.

2018-04-30 Thread David Miller
From: William Tu 
Date: Fri, 27 Apr 2018 14:16:32 -0700

> Currently the truncated bit is set only when the mirrored packet
> is larger than mtu.  For certain cases, the packet might already
> been truncated before sending to the erspan tunnel.  In this case,
> the patch detect whether the IP header's total length is larger
> than the actual skb->len.  If true, this indicated that the
> mirrored packet is truncated and set the erspan truncate bit.
> 
> I tested the patch using bpf_skb_change_tail helper function to
> shrink the packet size and send to erspan tunnel.
> 
> Reported-by: Xiaoyan Jin 
> Signed-off-by: William Tu 

Applied, thanks.


Re: [PATCH V2 net-next 1/2] tcp: send in-queue bytes in cmsg upon read

2018-04-30 Thread Eric Dumazet


On 04/30/2018 08:38 AM, David Miller wrote:
> From: Soheil Hassas Yeganeh 
> Date: Fri, 27 Apr 2018 14:57:32 -0400
> 
>> Since the socket lock is not held when calculating the size of
>> receive queue, TCP_INQ is a hint.  For example, it can overestimate
>> the queue size by one byte, if FIN is received.
> 
> I think it is even worse than that.
> 
> If another application comes in and does a recvmsg() in parallel with
> these calculations, you could even report a negative value.
> 
> These READ_ONCE() make it look like some of these issues are being
> addressed but they are not.
> 
> You could freeze the values just by taking sk->sk_lock.slock, but I
> don't know if that cost is considered acceptable or not.
> 
> Another idea is to sample both values in a loop, similar to a sequence
> lock sequence:
> 
> again:
>   tmp1 = A;
>   tmp2 = B;
>   barrier();
>   tmp3 = A;
>   if (tmp1 != tmp3)
>   goto again;
> 
> But the current state of affairs is not going to work well.
> 

We want a hint, and max_t(int, 0, )  does not return a negative value ?

If the hint is wrong in 0.1 % of the cases, we really do not care, it is not 
meant
to replace the existing precise ( well, sort of ) mechanism.

I say sort of, because by the time we have any number, TCP might have received 
more packets anyway.



Re: KASAN: use-after-free Read in perf_trace_rpc_stats_latency

2018-04-30 Thread Chuck Lever


> On Apr 30, 2018, at 9:34 AM, syzbot 
>  wrote:
> 
> Hello,
> 
> syzbot hit the following crash on bpf-next commit
> f60ad0a0c441530280a4918eca781a6a94dffa50 (Sun Apr 29 15:45:55 2018 +)
> Merge branch 'bpf_get_stack'
> syzbot dashboard link: 
> https://syzkaller.appspot.com/bug?extid=27db1f90e2b972a5f2d3
> 
> Unfortunately, I don't have any reproducer for this crash yet.
> Raw console output: 
> https://syzkaller.appspot.com/x/log.txt?id=6741221342969856
> Kernel config: https://syzkaller.appspot.com/x/.config?id=4410550353033654931
> compiler: gcc (GCC) 8.0.1 20180413 (experimental)
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+27db1f90e2b972a5f...@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for details.
> If you forward the report, please keep this part and the footer.
> 
> rpcbind: RPC call returned error 22
> rpcbind: RPC call returned error 22
> rpcbind: RPC call returned error 22
> rpcbind: RPC call returned error 22
> ==
> BUG: KASAN: use-after-free in strlen+0x83/0xa0 lib/string.c:482
> Read of size 1 at addr 8801d6f0a1c0 by task syz-executor7/5079
> 
> CPU: 1 PID: 5079 Comm: syz-executor7 Not tainted 4.17.0-rc2+ #16
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:
> __dump_stack lib/dump_stack.c:77 [inline]
> dump_stack+0x1b9/0x294 lib/dump_stack.c:113
> print_address_description+0x6c/0x20b mm/kasan/report.c:256
> kasan_report_error mm/kasan/report.c:354 [inline]
> kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
> __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430
> strlen+0x83/0xa0 lib/string.c:482
> trace_event_get_offsets_rpc_stats_latency include/trace/events/sunrpc.h:215 
> [inline]
> perf_trace_rpc_stats_latency+0x318/0x10d0 include/trace/events/sunrpc.h:215
> trace_rpc_stats_latency include/trace/events/sunrpc.h:215 [inline]
> rpc_count_iostats_metrics+0x594/0x8a0 net/sunrpc/stats.c:182
> rpc_count_iostats+0x76/0x90 net/sunrpc/stats.c:195
> xprt_release+0xa3b/0x1110 net/sunrpc/xprt.c:1351
> rpc_release_resources_task+0x20/0xa0 net/sunrpc/sched.c:1024
> rpc_release_task net/sunrpc/sched.c:1068 [inline]
> __rpc_execute+0x5e9/0xf50 net/sunrpc/sched.c:833
> rpc_execute+0x37f/0x480 net/sunrpc/sched.c:852
> rpc_run_task+0x615/0x8c0 net/sunrpc/clnt.c:1053
> rpc_call_sync+0x196/0x290 net/sunrpc/clnt.c:1082
> rpc_ping+0x155/0x1f0 net/sunrpc/clnt.c:2513
> rpc_create_xprt+0x282/0x3f0 net/sunrpc/clnt.c:479
> rpc_create+0x52e/0x900 net/sunrpc/clnt.c:587
> nfs_create_rpc_client+0x63e/0x850 fs/nfs/client.c:523
> nfs_init_client+0x74/0x100 fs/nfs/client.c:634
> nfs_get_client+0x1065/0x1500 fs/nfs/client.c:425
> nfs_init_server+0x364/0xfb0 fs/nfs/client.c:670
> nfs_create_server+0x86/0x5f0 fs/nfs/client.c:953
> nfs_try_mount+0x177/0xab0 fs/nfs/super.c:1884
> nfs_fs_mount+0x17de/0x2efd fs/nfs/super.c:2695
> mount_fs+0xae/0x328 fs/super.c:1267
> vfs_kern_mount.part.34+0xd4/0x4d0 fs/namespace.c:1037
> vfs_kern_mount fs/namespace.c:1027 [inline]
> do_new_mount fs/namespace.c:2518 [inline]
> do_mount+0x564/0x3070 fs/namespace.c:2848
> ksys_mount+0x12d/0x140 fs/namespace.c:3064
> __do_sys_mount fs/namespace.c:3078 [inline]
> __se_sys_mount fs/namespace.c:3075 [inline]
> __x64_sys_mount+0xbe/0x150 fs/namespace.c:3075
> do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
> entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x455979
> RSP: 002b:7f1e2785bc68 EFLAGS: 0246 ORIG_RAX: 00a5
> RAX: ffda RBX: 7f1e2785c6d4 RCX: 00455979
> RDX: 20fb5ffc RSI: 20343ff8 RDI: 2091dff8
> RBP: 0072bf50 R08: 2000a000 R09: 
> R10:  R11: 0246 R12: 
> R13: 0440 R14: 006fa6a0 R15: 0001
> 
> Allocated by task 5079:
> save_stack+0x43/0xd0 mm/kasan/kasan.c:448
> set_track mm/kasan/kasan.c:460 [inline]
> kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553
> __do_kmalloc mm/slab.c:3718 [inline]
> __kmalloc_track_caller+0x14a/0x760 mm/slab.c:3733
> kstrdup+0x39/0x70 mm/util.c:56
> xs_format_common_peer_ports+0x130/0x370 net/sunrpc/xprtsock.c:290
> xs_format_peer_addresses net/sunrpc/xprtsock.c:303 [inline]
> xs_setup_udp+0x5ea/0x880 net/sunrpc/xprtsock.c:3037
> xprt_create_transport+0x1d7/0x596 net/sunrpc/xprt.c:1433
> rpc_create+0x489/0x900 net/sunrpc/clnt.c:573
> nfs_create_rpc_client+0x63e/0x850 fs/nfs/client.c:523
> nfs_init_client+0x74/0x100 fs/nfs/client.c:634
> nfs_get_client+0x1065/0x1500 fs/nfs/client.c:425
> nfs_init_server+0x364/0xfb0 fs/nfs/client.c:670
> nfs_create_server+0x86/0x5f0 fs/nfs/client.c:953
> nfs_try_mount+0x177/0xab0 fs/nfs/super.c:1884
> nfs_fs_mount+0x17de/0x2efd fs/nfs/super.c:2695
> mount_fs+0xae/0x328 fs/super.c:1267
> 

Re: [PATCH V2 net-next 1/2] tcp: send in-queue bytes in cmsg upon read

2018-04-30 Thread David Miller
From: Soheil Hassas Yeganeh 
Date: Fri, 27 Apr 2018 14:57:32 -0400

> Since the socket lock is not held when calculating the size of
> receive queue, TCP_INQ is a hint.  For example, it can overestimate
> the queue size by one byte, if FIN is received.

I think it is even worse than that.

If another application comes in and does a recvmsg() in parallel with
these calculations, you could even report a negative value.

These READ_ONCE() make it look like some of these issues are being
addressed but they are not.

You could freeze the values just by taking sk->sk_lock.slock, but I
don't know if that cost is considered acceptable or not.

Another idea is to sample both values in a loop, similar to a sequence
lock sequence:

again:
tmp1 = A;
tmp2 = B;
barrier();
tmp3 = A;
if (tmp1 != tmp3)
goto again;

But the current state of affairs is not going to work well.


Re: simplify procfs code for seq_file instances V2

2018-04-30 Thread David Howells
Note that your kernel hits the:

inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
swapper/0/0 [HC1[1]:SC0[0]:HE0:SE1] takes:
(ptrval) (fs_reclaim){?.+.}, at: fs_reclaim_acquire+0x12/0x35
{HARDIRQ-ON-W} state was registered at:
  fs_reclaim_acquire+0x32/0x35
  kmem_cache_alloc_node_trace+0x49/0x2cf
  alloc_worker+0x1d/0x49
  init_rescuer.part.7+0x19/0x8f
  workqueue_init+0xc0/0x1fe
  kernel_init_freeable+0xdc/0x433
  kernel_init+0xa/0xf5
  ret_from_fork+0x24/0x30

bug, as described here:


https://groups.google.com/forum/#!msg/syzkaller-bugs/sJC3Y3hOM08/aO3z9JXoAgAJ

David


Re: [PATCH RFC iproute2-next 2/2] rdma: print provider resource attributes

2018-04-30 Thread Stephen Hemminger
On Mon, 30 Apr 2018 07:36:18 -0700
Steve Wise  wrote:

> +#define nla_type(attr) ((attr)->nla_type & NLA_TYPE_MASK)
> +
> +void newline(struct rd *rd)
> +{
> + if (rd->json_output)
> + jsonw_end_array(rd->jw);
> + else
> + pr_out("\n");
> +}
> +
> +void newline_indent(struct rd *rd)
> +{
> + newline(rd);
> + if (!rd->json_output)
> + pr_out("");
> +}
> +
> +static int print_provider_string(struct rd *rd, const char *key_str,
> +  const char *val_str)
> +{
> + if (rd->json_output) {
> + jsonw_string_field(rd->jw, key_str, val_str);
> + return 0;
> + } else {
> + return pr_out("%s %s ", key_str, val_str);
> + }
> +}
> +
> +static int print_provider_s32(struct rd *rd, const char *key_str, int32_t 
> val,
> +   enum rdma_nldev_print_type print_type)
> +{
> + if (rd->json_output) {
> + jsonw_int_field(rd->jw, key_str, val);
> + return 0;
> + }
> + switch (print_type) {
> + case RDMA_NLDEV_PRINT_TYPE_UNSPEC:
> + return pr_out("%s %d ", key_str, val);
> + case RDMA_NLDEV_PRINT_TYPE_HEX:
> + return pr_out("%s 0x%x ", key_str, val);
> + default:
> + return -EINVAL;
> + }
> +}
> +

This code should get converted to json_print library that handles the
different output modes; rather than rolling it's own equivalent functionality.


Re: [PATCH] connector: add parent pid and tgid to coredump and exit events

2018-04-30 Thread David Miller
From: Evgeniy Polyakov 
Date: Mon, 30 Apr 2018 18:01:30 +0300

> Stefan, hi
> 
> Sorry for delay.
> 
> 26.04.2018, 15:04, "Stefan Strogin" :
>> Hi David, Evgeniy,
>>
>> Sorry to bother you, but could you please comment about the UAPI change and 
>> the patch?
> 
> With 4-bytes pid_t everything looks fine, and I do not know arch where pid is 
> larger currently, so it looks safe.
> 
> David, please pull it into your tree, or should it go via different path?
> 
> Acked-by: Evgeniy Polyakov 

After this much time it needs to be resubmitted.


Re: [PATCH bpf-next 2/3] bpf: fix formatting for bpf_get_stack() helper doc

2018-04-30 Thread Quentin Monnet
2018-04-30 09:12 UTC-0600 ~ David Ahern 
> On 4/30/18 9:08 AM, Alexei Starovoitov wrote:
>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>> index 530ff6588d8f..8daef7326bb7 100644
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
>>> @@ -1770,33 +1770,33 @@ union bpf_attr {
>>>   *
>>>   * int bpf_get_stack(struct pt_regs *regs, void *buf, u32 size, u64 flags)
>>>   * Description
>>> - * Return a user or a kernel stack in bpf program provided buffer.
>>> - * To achieve this, the helper needs *ctx*, which is a pointer
>>> + * Return a user or a kernel stack in bpf program provided 
>>> buffer.
>>> + * To achieve this, the helper needs *ctx*, which is a 
>>> pointer
>> I still don't quite get the difference.
>> It's replacing 2 tabs in above with 1 space + 2 tabs ?

Yes, exactly (Plus in this case, the "::" a few line below has a missing
tab).

>> Can you please teach the python script to accept both?
>> I bet that will be recurring mistake and it's impossible to spot in code 
>> review.
> And checkpatch throws an error on the 1 space + 2 tabs so it gets
> confusing on which format should be used.

Sorry about that :/. I will send a patch to make the script more flexible.

Quentin



Re: [PATCH bpf-next 2/3] bpf: fix formatting for bpf_get_stack() helper doc

2018-04-30 Thread David Ahern
On 4/30/18 9:08 AM, Alexei Starovoitov wrote:
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 530ff6588d8f..8daef7326bb7 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -1770,33 +1770,33 @@ union bpf_attr {
>>   *
>>   * int bpf_get_stack(struct pt_regs *regs, void *buf, u32 size, u64 flags)
>>   *  Description
>> - *  Return a user or a kernel stack in bpf program provided buffer.
>> - *  To achieve this, the helper needs *ctx*, which is a pointer
>> + *  Return a user or a kernel stack in bpf program provided buffer.
>> + *  To achieve this, the helper needs *ctx*, which is a pointer
> 
> I still don't quite get the difference.
> It's replacing 2 tabs in above with 1 space + 2 tabs ?
> Can you please teach the python script to accept both?
> I bet that will be recurring mistake and it's impossible to spot in code 
> review.
> 

And checkpatch throws an error on the 1 space + 2 tabs so it gets
confusing on which format should be used.


[PATCH RFC iproute2-next 2/2] rdma: print provider resource attributes

2018-04-30 Thread Steve Wise
This enhancement allows printing rdma device-specific state, if provided
by the kernel.  This is done in a generic manner, so rdma tool doesn't
need to know about the details of every type of rdma device.

Provider attributes for a rdma resource are in the form of  tuples, where the key is a string and the value can
be any supported provider attribute.  The print_type attribute, if present,
provides a print format to use vs the standard print format for the type.
For example, the default print type for a PROVIDER_S32 value is "%d ",
but "0x%x " if the print_type of PRINT_TYPE_HEX is included inthe tuple.

Provider resources are only printed when the -dd flag is present.
If -p is present, then the output is formatted to not exceed 80 columns,
otherwise it is printed as a single row to be grep/awk friendly.

Example output:

# rdma resource show qp lqpn 1028 -dd -p
link cxgb4_0/- lqpn 1028 rqpn 0 type RC state RTS rq-psn 0 sq-psn 0 
path-mig-state MIGRATED pid 0 comm [nvme_rdma]
sqid 1028 flushed 0 memsize 123968 cidx 85 pidx 85 wq_pidx 106 flush_cidx 
85 in_use 0
size 386 flags 0x0 rqid 1029 memsize 16768 cidx 43 pidx 41 wq_pidx 171 msn 
44 rqt_hwaddr 0x2a8a5d00
rqt_size 256 in_use 128 size 130 idx 43 wr_id 0x881057c03408 idx 40 
wr_id 0x881057c033f0

Signed-off-by: Steve Wise 
---
 rdma/rdma.c  |   7 ++-
 rdma/rdma.h  |  11 
 rdma/res.c   |  30 +++--
 rdma/utils.c | 194 +++
 4 files changed, 221 insertions(+), 21 deletions(-)

diff --git a/rdma/rdma.c b/rdma/rdma.c
index b43e538..c7c8b83 100644
--- a/rdma/rdma.c
+++ b/rdma/rdma.c
@@ -132,6 +132,7 @@ int main(int argc, char **argv)
const char *batch_file = NULL;
bool pretty_output = false;
bool show_details = false;
+   bool show_provider_details = false;
bool json_output = false;
bool force = false;
char *filename;
@@ -152,7 +153,10 @@ int main(int argc, char **argv)
pretty_output = true;
break;
case 'd':
-   show_details = true;
+   if (show_details)
+   show_provider_details = true;
+   else
+   show_details = true;
break;
case 'j':
json_output = true;
@@ -180,6 +184,7 @@ int main(int argc, char **argv)
argv += optind;
 
rd.show_details = show_details;
+   rd.show_provider_details = show_provider_details;
rd.json_output = json_output;
rd.pretty_output = pretty_output;
 
diff --git a/rdma/rdma.h b/rdma/rdma.h
index 1908fc4..e9581fe 100644
--- a/rdma/rdma.h
+++ b/rdma/rdma.h
@@ -55,6 +55,7 @@ struct rd {
char **argv;
char *filename;
bool show_details;
+   bool show_provider_details;
struct list_head dev_map_list;
uint32_t dev_idx;
uint32_t port_idx;
@@ -115,4 +116,14 @@ int rd_recv_msg(struct rd *rd, mnl_cb_t callback, void 
*data, uint32_t seq);
 void rd_prepare_msg(struct rd *rd, uint32_t cmd, uint32_t *seq, uint16_t 
flags);
 int rd_dev_init_cb(const struct nlmsghdr *nlh, void *data);
 int rd_attr_cb(const struct nlattr *attr, void *data);
+int rd_attr_check(const struct nlattr *attr, int *typep);
+
+/*
+ * Print helpers
+ */
+void print_provider_table(struct rd *rd, struct nlattr *tb);
+void newline(struct rd *rd);
+void newline_indent(struct rd *rd);
+#define MAX_LINE_LENGTH 80
+
 #endif /* _RDMA_TOOL_H_ */
diff --git a/rdma/res.c b/rdma/res.c
index 1a0aab6..bc0aef5 100644
--- a/rdma/res.c
+++ b/rdma/res.c
@@ -439,10 +439,8 @@ static int res_qp_parse_cb(const struct nlmsghdr *nlh, 
void *data)
if (nla_line[RDMA_NLDEV_ATTR_RES_PID])
free(comm);
 
-   if (rd->json_output)
-   jsonw_end_array(rd->jw);
-   else
-   pr_out("\n");
+   print_provider_table(rd, nla_line[RDMA_NLDEV_ATTR_PROVIDER]);
+   newline(rd);
}
return MNL_CB_OK;
 }
@@ -678,10 +676,8 @@ static int res_cm_id_parse_cb(const struct nlmsghdr *nlh, 
void *data)
if (nla_line[RDMA_NLDEV_ATTR_RES_PID])
free(comm);
 
-   if (rd->json_output)
-   jsonw_end_array(rd->jw);
-   else
-   pr_out("\n");
+   print_provider_table(rd, nla_line[RDMA_NLDEV_ATTR_PROVIDER]);
+   newline(rd);
}
return MNL_CB_OK;
 }
@@ -804,10 +800,8 @@ static int res_cq_parse_cb(const struct nlmsghdr *nlh, 
void *data)
if (nla_line[RDMA_NLDEV_ATTR_RES_PID])
free(comm);
 
-   if (rd->json_output)
-   jsonw_end_array(rd->jw);
-   else
- 

[PATCH RFC iproute2-next 1/2] rdma: update rdma_netlink.h to get provider attrs

2018-04-30 Thread Steve Wise
Signed-off-by: Steve Wise 
---
 rdma/include/uapi/rdma/rdma_netlink.h | 37 ++-
 1 file changed, 36 insertions(+), 1 deletion(-)

diff --git a/rdma/include/uapi/rdma/rdma_netlink.h 
b/rdma/include/uapi/rdma/rdma_netlink.h
index 45474f1..faea9d5 100644
--- a/rdma/include/uapi/rdma/rdma_netlink.h
+++ b/rdma/include/uapi/rdma/rdma_netlink.h
@@ -249,10 +249,22 @@ enum rdma_nldev_command {
RDMA_NLDEV_NUM_OPS
 };
 
+enum {
+   RDMA_NLDEV_ATTR_ENTRY_STRLEN = 16,
+};
+
+enum rdma_nldev_print_type {
+   RDMA_NLDEV_PRINT_TYPE_UNSPEC,
+   RDMA_NLDEV_PRINT_TYPE_HEX,
+};
+
 enum rdma_nldev_attr {
/* don't change the order or add anything between, this is ABI! */
RDMA_NLDEV_ATTR_UNSPEC,
 
+   /* Pad attribute for 64b alignment */
+   RDMA_NLDEV_ATTR_PAD = RDMA_NLDEV_ATTR_UNSPEC,
+
/* Identifier for ib_device */
RDMA_NLDEV_ATTR_DEV_INDEX,  /* u32 */
 
@@ -387,8 +399,31 @@ enum rdma_nldev_attr {
RDMA_NLDEV_ATTR_RES_PD_ENTRY,   /* nested table */
RDMA_NLDEV_ATTR_RES_LOCAL_DMA_LKEY, /* u32 */
RDMA_NLDEV_ATTR_RES_UNSAFE_GLOBAL_RKEY, /* u32 */
+   /*
+* provider-specific attributes.
+*/
+   RDMA_NLDEV_ATTR_PROVIDER,   /* nested table */
+   RDMA_NLDEV_ATTR_PROVIDER_ENTRY, /* nested table */
+   RDMA_NLDEV_ATTR_PROVIDER_STRING,/* string */
+   /*
+* u8 values from enum rdma_nldev_print_type
+*/
+   RDMA_NLDEV_ATTR_PROVIDER_PRINT_TYPE,/* u8 */
+   RDMA_NLDEV_ATTR_PROVIDER_S32,   /* s32 */
+   RDMA_NLDEV_ATTR_PROVIDER_U32,   /* u32 */
+   RDMA_NLDEV_ATTR_PROVIDER_S64,   /* s64 */
+   RDMA_NLDEV_ATTR_PROVIDER_U64,   /* u64 */
 
-   /* Netdev information for relevant protocols, like RoCE and iWARP */
+   /*
+* Provides logical name and index of netdevice which is
+* connected to physical port. This information is relevant
+* for RoCE and iWARP.
+*
+* The netdevices which are associated with containers are
+* supposed to be exported together with GID table once it
+* will be exposed through the netlink. Because the
+* associated netdevices are properties of GIDs.
+*/
RDMA_NLDEV_ATTR_NDEV_INDEX, /* u32 */
RDMA_NLDEV_ATTR_NDEV_NAME,  /* string */
 
-- 
1.8.3.1



[PATCH RFC iproute2-next 0/2] RDMA tool provider resource tracking

2018-04-30 Thread Steve Wise
Hello,

This series enhances the iproute2 rdma tool to include displaying
provider-specific resource attributes.  It is the user-space part of
the kernel provider resource tracking series currently under
review [1].

This is an RFC and should not be merged yet.  Once [1] is in the
linux-rdma for-next branch (and all reviewing is complete), I'll post
a final version and request merging.

Thanks,

Steve.

[1] https://www.spinics.net/lists/linux-rdma/msg64013.html

Steve Wise (2):
  rdma: update rdma_netlink.h to get provider attrs
  rdma: print provider resource attributes

 rdma/include/uapi/rdma/rdma_netlink.h |  37 ++-
 rdma/rdma.c   |   7 +-
 rdma/rdma.h   |  11 ++
 rdma/res.c|  30 ++
 rdma/utils.c  | 194 ++
 5 files changed, 257 insertions(+), 22 deletions(-)

-- 
1.8.3.1



Re: [PATCH bpf-next 2/3] bpf: fix formatting for bpf_get_stack() helper doc

2018-04-30 Thread Alexei Starovoitov
On Mon, Apr 30, 2018 at 11:39:04AM +0100, Quentin Monnet wrote:
> Fix formatting (indent) for bpf_get_stack() helper documentation, so
> that the doc is rendered correctly with the Python script.
> 
> Fixes: c195651e565a ("bpf: add bpf_get_stack helper")
> Cc: Yonghong Song 
> Signed-off-by: Quentin Monnet 
> ---
> 
> Note: The error was a missing space between the '*' marking the
> comments, and the tabs. This expected mixed indent comes from the fact I
> started to write the doc as a RST, then copied my contents (tabs
> included) in the header file and added a " * " (with a space) prefix
> everywhere.
> 
> On a second thought, using such indent style was maybe... not my best idea
> ever. Anyway, if indent for documenting eBPF helpers really gets to painful, 
> we
> could relax parsing rules in the Python script to make things easier.
> ---
>  include/uapi/linux/bpf.h | 54 
> 
>  1 file changed, 27 insertions(+), 27 deletions(-)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 530ff6588d8f..8daef7326bb7 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1770,33 +1770,33 @@ union bpf_attr {
>   *
>   * int bpf_get_stack(struct pt_regs *regs, void *buf, u32 size, u64 flags)
>   *   Description
> - *   Return a user or a kernel stack in bpf program provided buffer.
> - *   To achieve this, the helper needs *ctx*, which is a pointer
> + *   Return a user or a kernel stack in bpf program provided buffer.
> + *   To achieve this, the helper needs *ctx*, which is a pointer

I still don't quite get the difference.
It's replacing 2 tabs in above with 1 space + 2 tabs ?
Can you please teach the python script to accept both?
I bet that will be recurring mistake and it's impossible to spot in code review.



Re: [PATCH] connector: add parent pid and tgid to coredump and exit events

2018-04-30 Thread Evgeniy Polyakov
Stefan, hi

Sorry for delay.

26.04.2018, 15:04, "Stefan Strogin" :
> Hi David, Evgeniy,
>
> Sorry to bother you, but could you please comment about the UAPI change and 
> the patch?

With 4-bytes pid_t everything looks fine, and I do not know arch where pid is 
larger currently, so it looks safe.

David, please pull it into your tree, or should it go via different path?

Acked-by: Evgeniy Polyakov 


>>  I don't see how it breaks UAPI. The point is that structures
>>  coredump_proc_event and exit_proc_event are members of *union*
>>  event_data, thus position of the existing data in the structure is
>>  unchanged. Furthermore, this change won't increase size of struct
>>  proc_event, because comm_proc_event (also a member of event_data) is
>>  of bigger size than the changed structures.
>>
>>  If I'm wrong, could you please explain what exactly will the change
>>  break in UAPI?
>>
>>  On 30/03/18 19:59, David Miller wrote:
>>>  From: Stefan Strogin 
>>>  Date: Thu, 29 Mar 2018 17:12:47 +0300
>>>
  diff --git a/include/uapi/linux/cn_proc.h b/include/uapi/linux/cn_proc.h
  index 68ff25414700..db210625cee8 100644
  --- a/include/uapi/linux/cn_proc.h
  +++ b/include/uapi/linux/cn_proc.h
  @@ -116,12 +116,16 @@ struct proc_event {
   struct coredump_proc_event {
   __kernel_pid_t process_pid;
   __kernel_pid_t process_tgid;
  + __kernel_pid_t parent_pid;
  + __kernel_pid_t parent_tgid;
   } coredump;

   struct exit_proc_event {
   __kernel_pid_t process_pid;
   __kernel_pid_t process_tgid;
   __u32 exit_code, exit_signal;
  + __kernel_pid_t parent_pid;
  + __kernel_pid_t parent_tgid;
   } exit;

   } event_data;
>>>
>>>  I don't think you can add these members without breaking UAPI.



[PATCH net-next 3/4] net/smc: handle ioctls SIOCINQ, SIOCOUTQ, and SIOCOUTQNSD

2018-04-30 Thread Ursula Braun
SIOCINQ returns the amount of unread data in the RMB.
SIOCOUTQ returns the amount of unsent or unacked sent data in the send
buffer.
SIOCOUTQNSD returns the amount of data prepared for sending, but
not yet sent.

Signed-off-by: Ursula Braun 
---
 net/smc/af_smc.c | 33 ++---
 1 file changed, 30 insertions(+), 3 deletions(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 961b8eff9553..823ea3371575 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "smc.h"
 #include "smc_clc.h"
@@ -1389,12 +1390,38 @@ static int smc_ioctl(struct socket *sock, unsigned int 
cmd,
 unsigned long arg)
 {
struct smc_sock *smc;
+   int answ;
 
smc = smc_sk(sock->sk);
-   if (smc->use_fallback)
+   if (smc->use_fallback) {
+   if (!smc->clcsock)
+   return -EBADF;
return smc->clcsock->ops->ioctl(smc->clcsock, cmd, arg);
-   else
-   return sock_no_ioctl(sock, cmd, arg);
+   }
+   switch (cmd) {
+   case SIOCINQ: /* same as FIONREAD */
+   if (smc->sk.sk_state == SMC_LISTEN)
+   return -EINVAL;
+   answ = atomic_read(>conn.bytes_to_rcv);
+   break;
+   case SIOCOUTQ:
+   /* output queue size (not send + not acked) */
+   if (smc->sk.sk_state == SMC_LISTEN)
+   return -EINVAL;
+   answ = smc->conn.sndbuf_size -
+   atomic_read(>conn.sndbuf_space);
+   break;
+   case SIOCOUTQNSD:
+   /* output queue size (not send only) */
+   if (smc->sk.sk_state == SMC_LISTEN)
+   return -EINVAL;
+   answ = smc_tx_prepared_sends(>conn);
+   break;
+   default:
+   return -ENOIOCTLCMD;
+   }
+
+   return put_user(answ, (int __user *)arg);
 }
 
 static ssize_t smc_sendpage(struct socket *sock, struct page *page,
-- 
2.13.5



[PATCH net-next 4/4] net/smc: determine vlan_id of stacked net_device

2018-04-30 Thread Ursula Braun
An SMC link group is bound to a specific vlan_id. Its link uses
the RoCE-GIDs established for the specific vlan_id. This patch makes
sure the appropriate vlan_id is determined for stacked scenarios like
for instance a master bonding device with vlan devices enslaved.

Signed-off-by: Ursula Braun 
---
 net/smc/smc_core.c | 26 +++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index d9247765aff3..1f3ea62fac5c 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -360,7 +360,8 @@ void smc_lgr_terminate(struct smc_link_group *lgr)
 static int smc_vlan_by_tcpsk(struct socket *clcsock, unsigned short *vlan_id)
 {
struct dst_entry *dst = sk_dst_get(clcsock->sk);
-   int rc = 0;
+   struct net_device *ndev;
+   int i, nest_lvl, rc = 0;
 
*vlan_id = 0;
if (!dst) {
@@ -372,8 +373,27 @@ static int smc_vlan_by_tcpsk(struct socket *clcsock, 
unsigned short *vlan_id)
goto out_rel;
}
 
-   if (is_vlan_dev(dst->dev))
-   *vlan_id = vlan_dev_vlan_id(dst->dev);
+   ndev = dst->dev;
+   if (is_vlan_dev(ndev)) {
+   *vlan_id = vlan_dev_vlan_id(ndev);
+   goto out_rel;
+   }
+
+   rtnl_lock();
+   nest_lvl = dev_get_nest_level(ndev);
+   for (i = 0; i < nest_lvl; i++) {
+   struct list_head *lower = >adj_list.lower;
+
+   if (list_empty(lower))
+   break;
+   lower = lower->next;
+   ndev = (struct net_device *)netdev_lower_get_next(ndev, );
+   if (is_vlan_dev(ndev)) {
+   *vlan_id = vlan_dev_vlan_id(ndev);
+   break;
+   }
+   }
+   rtnl_unlock();
 
 out_rel:
dst_release(dst);
-- 
2.13.5



[PATCH net-next 1/4] net/smc: periodic testlink support

2018-04-30 Thread Ursula Braun
From: Karsten Graul 

Add periodic LLC testlink support to ensure the link is still active.
The interval time is initialized using the value of
sysctl_tcp_keepalive_time.

Signed-off-by: Karsten Graul 
Signed-off-by: Ursula Braun 
---
 net/smc/af_smc.c   |  6 --
 net/smc/smc_core.c |  2 ++
 net/smc/smc_core.h |  4 
 net/smc/smc_llc.c  | 62 +-
 net/smc/smc_llc.h  |  3 +++
 net/smc/smc_wr.c   |  1 +
 6 files changed, 75 insertions(+), 3 deletions(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 20aa4175b9f8..961b8eff9553 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -294,6 +294,7 @@ static void smc_copy_sock_settings_to_smc(struct smc_sock 
*smc)
 
 static int smc_clnt_conf_first_link(struct smc_sock *smc)
 {
+   struct net *net = sock_net(smc->clcsock->sk);
struct smc_link_group *lgr = smc->conn.lgr;
struct smc_link *link;
int rest;
@@ -353,7 +354,7 @@ static int smc_clnt_conf_first_link(struct smc_sock *smc)
if (rc < 0)
return SMC_CLC_DECL_TCL;
 
-   link->state = SMC_LNK_ACTIVE;
+   smc_llc_link_active(link, net->ipv4.sysctl_tcp_keepalive_time);
 
return 0;
 }
@@ -715,6 +716,7 @@ void smc_close_non_accepted(struct sock *sk)
 
 static int smc_serv_conf_first_link(struct smc_sock *smc)
 {
+   struct net *net = sock_net(smc->clcsock->sk);
struct smc_link_group *lgr = smc->conn.lgr;
struct smc_link *link;
int rest;
@@ -769,7 +771,7 @@ static int smc_serv_conf_first_link(struct smc_sock *smc)
return rc;
}
 
-   link->state = SMC_LNK_ACTIVE;
+   smc_llc_link_active(link, net->ipv4.sysctl_tcp_keepalive_time);
 
return 0;
 }
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index f44f6803f7ff..d9247765aff3 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -310,6 +310,7 @@ static void smc_lgr_free_bufs(struct smc_link_group *lgr)
 /* remove a link group */
 void smc_lgr_free(struct smc_link_group *lgr)
 {
+   smc_llc_link_flush(>lnk[SMC_SINGLE_LINK]);
smc_lgr_free_bufs(lgr);
smc_link_clear(>lnk[SMC_SINGLE_LINK]);
kfree(lgr);
@@ -332,6 +333,7 @@ void smc_lgr_terminate(struct smc_link_group *lgr)
struct rb_node *node;
 
smc_lgr_forget(lgr);
+   smc_llc_link_inactive(>lnk[SMC_SINGLE_LINK]);
 
write_lock_bh(>conns_lock);
node = rb_first(>conns_all);
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index 07e2a393e6d9..97339f03ba79 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -79,6 +79,7 @@ struct smc_link {
dma_addr_t  wr_rx_dma_addr; /* DMA address of wr_rx_bufs */
u64 wr_rx_id;   /* seq # of last recv WR */
u32 wr_rx_cnt;  /* number of WR recv buffers */
+   unsigned long   wr_rx_tstamp;   /* jiffies when last buf rx */
 
struct ib_reg_wrwr_reg; /* WR register memory region */
wait_queue_head_t   wr_reg_wait;/* wait for wr_reg result */
@@ -101,6 +102,9 @@ struct smc_link {
int llc_confirm_resp_rc; /* rc from conf_resp msg */
struct completion   llc_add;/* wait for rx of add link */
struct completion   llc_add_resp;   /* wait for rx of add link rsp*/
+   struct delayed_work llc_testlink_wrk; /* testlink worker */
+   struct completion   llc_testlink_resp; /* wait for rx of testlink */
+   int llc_testlink_time; /* testlink interval */
 };
 
 /* For now we just allow one parallel link per link group. The SMC protocol
diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
index ea4b21981b4b..33b4d856f4c6 100644
--- a/net/smc/smc_llc.c
+++ b/net/smc/smc_llc.c
@@ -397,7 +397,8 @@ static void smc_llc_rx_test_link(struct smc_link *link,
 struct smc_llc_msg_test_link *llc)
 {
if (llc->hd.flags & SMC_LLC_FLAG_RESP) {
-   /* unused as long as we don't send this type of msg */
+   if (link->state == SMC_LNK_ACTIVE)
+   complete(>llc_testlink_resp);
} else {
smc_llc_send_test_link(link, llc->user_data, SMC_LLC_RESP);
}
@@ -502,6 +503,65 @@ static void smc_llc_rx_handler(struct ib_wc *wc, void *buf)
}
 }
 
+/* worker /
+
+static void smc_llc_testlink_work(struct work_struct *work)
+{
+   struct smc_link *link = container_of(to_delayed_work(work),
+struct smc_link, llc_testlink_wrk);
+   unsigned long next_interval;
+   struct smc_link_group *lgr;
+   unsigned long expire_time;
+   u8 user_data[16] = { 0 };
+   int rc;
+
+   lgr = container_of(link, struct smc_link_group, 

[PATCH net-next 2/4] net/smc: ipv6 support for smc_diag.c

2018-04-30 Thread Ursula Braun
From: Karsten Graul 

Update smc_diag.c to support ipv6 addresses on the diagnosis interface.

Signed-off-by: Karsten Graul 
Signed-off-by: Ursula Braun 
---
 net/smc/smc_diag.c | 37 -
 1 file changed, 28 insertions(+), 9 deletions(-)

diff --git a/net/smc/smc_diag.c b/net/smc/smc_diag.c
index 427b91c1c964..9a8d0db7bb88 100644
--- a/net/smc/smc_diag.c
+++ b/net/smc/smc_diag.c
@@ -38,17 +38,25 @@ static void smc_diag_msg_common_fill(struct smc_diag_msg 
*r, struct sock *sk)
 {
struct smc_sock *smc = smc_sk(sk);
 
-   r->diag_family = sk->sk_family;
if (!smc->clcsock)
return;
r->id.idiag_sport = htons(smc->clcsock->sk->sk_num);
r->id.idiag_dport = smc->clcsock->sk->sk_dport;
r->id.idiag_if = smc->clcsock->sk->sk_bound_dev_if;
sock_diag_save_cookie(sk, r->id.idiag_cookie);
-   memset(>id.idiag_src, 0, sizeof(r->id.idiag_src));
-   memset(>id.idiag_dst, 0, sizeof(r->id.idiag_dst));
-   r->id.idiag_src[0] = smc->clcsock->sk->sk_rcv_saddr;
-   r->id.idiag_dst[0] = smc->clcsock->sk->sk_daddr;
+   if (sk->sk_protocol == SMCPROTO_SMC6) {
+   r->diag_family = PF_INET6;
+   memcpy(>id.idiag_src, >clcsock->sk->sk_v6_rcv_saddr,
+  sizeof(smc->clcsock->sk->sk_v6_rcv_saddr));
+   memcpy(>id.idiag_dst, >clcsock->sk->sk_v6_daddr,
+  sizeof(smc->clcsock->sk->sk_v6_daddr));
+   } else {
+   r->diag_family = PF_INET;
+   memset(>id.idiag_src, 0, sizeof(r->id.idiag_src));
+   memset(>id.idiag_dst, 0, sizeof(r->id.idiag_dst));
+   r->id.idiag_src[0] = smc->clcsock->sk->sk_rcv_saddr;
+   r->id.idiag_dst[0] = smc->clcsock->sk->sk_daddr;
+   }
 }
 
 static int smc_diag_msg_attrs_fill(struct sock *sk, struct sk_buff *skb,
@@ -153,7 +161,8 @@ static int __smc_diag_dump(struct sock *sk, struct sk_buff 
*skb,
return -EMSGSIZE;
 }
 
-static int smc_diag_dump(struct sk_buff *skb, struct netlink_callback *cb)
+static int smc_diag_dump_proto(struct proto *prot, struct sk_buff *skb,
+  struct netlink_callback *cb)
 {
struct net *net = sock_net(skb->sk);
struct nlattr *bc = NULL;
@@ -161,8 +170,8 @@ static int smc_diag_dump(struct sk_buff *skb, struct 
netlink_callback *cb)
struct sock *sk;
int rc = 0;
 
-   read_lock(_proto.h.smc_hash->lock);
-   head = _proto.h.smc_hash->ht;
+   read_lock(>h.smc_hash->lock);
+   head = >h.smc_hash->ht;
if (hlist_empty(head))
goto out;
 
@@ -175,7 +184,17 @@ static int smc_diag_dump(struct sk_buff *skb, struct 
netlink_callback *cb)
}
 
 out:
-   read_unlock(_proto.h.smc_hash->lock);
+   read_unlock(>h.smc_hash->lock);
+   return rc;
+}
+
+static int smc_diag_dump(struct sk_buff *skb, struct netlink_callback *cb)
+{
+   int rc = 0;
+
+   rc = smc_diag_dump_proto(_proto, skb, cb);
+   if (!rc)
+   rc = smc_diag_dump_proto(_proto6, skb, cb);
return rc;
 }
 
-- 
2.13.5



[PATCH net-next 0/4] net/smc: fixes 2018/04/30

2018-04-30 Thread Ursula Braun
From: Ursula Braun 

Dave,

here are 4 smc patches for net-next covering different areas:
   * link health check
   * diagnostics for IPv6 smc sockets
   * ioctl
   * improvement for vlan determination

Thanks, Ursula

Karsten Graul (2):
  net/smc: periodic testlink support
  net/smc: ipv6 support for smc_diag.c

Ursula Braun (2):
  net/smc: handle ioctls SIOCINQ, SIOCOUTQ, and SIOCOUTQNSD
  net/smc: determine vlan_id of stacked net_device

 net/smc/af_smc.c   | 39 +-
 net/smc/smc_core.c | 28 +---
 net/smc/smc_core.h |  4 
 net/smc/smc_diag.c | 37 
 net/smc/smc_llc.c  | 62 +-
 net/smc/smc_llc.h  |  3 +++
 net/smc/smc_wr.c   |  1 +
 7 files changed, 156 insertions(+), 18 deletions(-)

-- 
2.13.5



Re: [PATCH 20/39] afs: simplify procfs code

2018-04-30 Thread David Howells
Christoph Hellwig  wrote:

> I don't think you should need any of these.  seq_file_net or
> seq_file_single_net will return you the net_ns based on a struct
> seq_file.  And even from your write routines you can reach the
> seq_file in file->private pretty easily.

You've taken away things like single_open/release_net() which means I can't
supply my own fops and use the proc_net stuff.  I wonder if I should add a
write op to struct proc_dir_entry.

David


Re: [PATCH bpf-next] tools include uapi: Grab a copy of linux/erspan.h

2018-04-30 Thread Daniel Borkmann
On 04/30/2018 04:26 PM, William Tu wrote:
> Bring the erspan uapi header file so BPF tunnel helpers can use it.
> 
> Fixes: 933a741e3b82 ("selftests/bpf: bpf tunnel test.")
> Reported-by: Yonghong Song 
> Signed-off-by: William Tu 

Thanks for the patch, William! I also Cc'ed Yonghong here, so he has a
chance to try it out.


  1   2   >