date:20180307

,Your urgent confirmation

2018-03-07 Thread James Williams

Attn: Beneficiary,

We have contacted the Federal Ministry of Finance on your Behalf and
they have brought a solution to your problem by coordinating your
payment in total (10,000,000.00) Ten Million Dollars in an atm card
which you can use to withdraw money from any ATM MACHINE CENTER
anywhere in the world with a maximum of 1 Dollars daily. You now
have the lawful right to claim your fund in an atm card. Since the
Federal Bureau of Investigation is involved in this transaction, you
have to be rest assured for this is 100% risk free it is our duty to
protect the American Citizens, European Citizens, Asian Citizen. All I
want you to do is to contact the atm card CENTER Via email or call the
office telephone number one of the Consultant will assist you for
their requirements to proceed and procure your Approval Slip on your
behalf.

CONTACT INFORMATION
NAME: James  Williams
EMAIL: paymasterofficed...@gmail.com


Do contact us with your details:

Full name//
Address// Age//
 Telephone Numbers//
Occupation//
 Your Country//
Bank Details//

So your files would be updated after which the Delivery of your
atm card will be affected to your designated home Address without any
further delay and the bank will transfer your funds in total
(10,000,000.00) Ten Million Dollars to your Bank account. We
will reply you with the secret code (1600 atm card). We advice you get
back to the payment office after you have contacted the ATM SWIFT CARD
CENTER and we do await your response so we can move on with our
Investigation and make sure your ATM SWIFT CARD gets to you.


Best Regards
James Williams
Paymaster General
Federal Republic Of Nigeri

Re: [PATCH] net: xfrm: use preempt-safe this_cpu_read() in ipcomp_alloc_tfms()

2018-03-07 Thread Herbert Xu

On Wed, Mar 07, 2018 at 11:24:16AM -0800, Greg Hackmann wrote:
> f7c83bcbfaf5 ("net: xfrm: use __this_cpu_read per-cpu helper") added a
> __this_cpu_read() call inside ipcomp_alloc_tfms().  Since this call was
> introduced, the rules around per-cpu accessors have been tightened and
> __this_cpu_read() cannot be used in a preemptible context.
> 
> syzkaller reported this leading to the following kernel BUG while
> fuzzing sendmsg:

How about reverting f7c83bcbfaf5 instead?

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [bpf-next V1 PATCH 09/15] mlx5: register a memory model when XDP is enabled

2018-03-07 Thread Jesper Dangaard Brouer

On Wed, 7 Mar 2018 13:50:19 +0200
Tariq Toukan  wrote:

> On 06/03/2018 11:48 PM, Jesper Dangaard Brouer wrote:
> > Now all the users of ndo_xdp_xmit have been converted to use 
> > xdp_return_frame.
> > This enable a different memory model, thus activating another code path
> > in the xdp_return_frame API.
> > 
> > Signed-off-by: Jesper Dangaard Brouer 
> > ---
> >   drivers/net/ethernet/mellanox/mlx5/core/en_main.c |7 +++
> >   1 file changed, 7 insertions(+)
> > 
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
> > b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > index da94c8cba5ee..51482943c583 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > @@ -506,6 +506,13 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
> > rq->mkey_be = c->mkey_be;
> > }
> >   
> > +   /* This must only be activate for order-0 pages */
> > +   if (rq->xdp_prog)
> > +   err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq,
> > +MEM_TYPE_PAGE_ORDER0, NULL);
> > +   if (err < 0)
> > +   goto err_rq_wq_destroy;
> > +  
> 
> Use "if (err)" here, instead of changing this in next patch.
> Also, get it into the "if (rq->xdp_prog)" block.

Good point, fixed!

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[PATCH iproute2-next v4 3/4] iplink: Follow documented behaviour when "index" is given

2018-03-07 Thread Serhey Popovych

Both ip-link(8) and error message when "index" parameter is given for
set/delete case says that index can only be given during network
device creation.

Follow this documented behaviour and get rid of ambiguous behaviour in
case of both "dev" and "index" specified for ip link delete scenario
(actually "index" being ignored in favor to "dev").

Prohibit "index" when configuring/deleting group of network devices.

Signed-off-by: Serhey Popovych 
---
 ip/iplink.c |   11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/ip/iplink.c b/ip/iplink.c
index b4307ab..6d3ebde 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -964,6 +964,12 @@ static int iplink_modify(int cmd, unsigned int flags, int 
argc, char **argv)
argc -= ret;
argv += ret;
 
+   if (!(flags & NLM_F_CREATE) && index) {
+   fprintf(stderr,
+   "index can be used only when creating devices.\n");
+   exit(-1);
+   }
+
if (group != -1) {
if (dev)
addattr_l(&req.n, sizeof(req), IFLA_GROUP,
@@ -994,11 +1000,6 @@ static int iplink_modify(int cmd, unsigned int flags, int 
argc, char **argv)
"Not enough information: \"dev\" argument is 
required.\n");
exit(-1);
}
-   if (cmd == RTM_NEWLINK && index) {
-   fprintf(stderr,
-   "index can be used only when creating 
devices.\n");
-   exit(-1);
-   }
 
req.i.ifi_index = ll_name_to_index(dev);
if (!req.i.ifi_index)
-- 
1.7.10.4

[PATCH iproute2-next v4 4/4] iplink: Perform most of request buffer setups and checks in iplink_parse()

2018-03-07 Thread Serhey Popovych

To benefit other users (e.g. link_veth.c) of iplink_parse() from
additional attribute checks and setups made in iplink_modify(). This
catches most of weired cobination of parameters to peer device
configuration.

Drop @name, @dev, @link, @group and @index from iplink_parse() parameters
list: they are not needed outside.

While there change return -1 to exit(-1) for group parsing errors: we
want to stop further command processing unless -force option is given
to get error line easily.

Signed-off-by: Serhey Popovych 
---
 ip/ip_common.h|4 +-
 ip/iplink.c   |  143 ++---
 ip/iplink_vxcan.c |   23 +++--
 ip/link_veth.c|   23 +++--
 4 files changed, 82 insertions(+), 111 deletions(-)

diff --git a/ip/ip_common.h b/ip/ip_common.h
index e4e628b..1b89795 100644
--- a/ip/ip_common.h
+++ b/ip/ip_common.h
@@ -132,9 +132,7 @@ struct link_util {
 
 struct link_util *get_link_kind(const char *kind);
 
-int iplink_parse(int argc, char **argv, struct iplink_req *req,
-char **name, char **type, char **link, char **dev,
-int *group, int *index);
+int iplink_parse(int argc, char **argv, struct iplink_req *req, char **type);
 
 /* iplink_bridge.c */
 void br_dump_bridge_id(const struct ifla_bridge_id *id, char *buf, size_t len);
diff --git a/ip/iplink.c b/ip/iplink.c
index 6d3ebde..02042ed 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -569,10 +569,11 @@ static int iplink_parse_vf(int vf, int *argcp, char 
***argvp,
return 0;
 }
 
-int iplink_parse(int argc, char **argv, struct iplink_req *req,
-char **name, char **type, char **link, char **dev,
-int *group, int *index)
+int iplink_parse(int argc, char **argv, struct iplink_req *req, char **type)
 {
+   char *name = NULL;
+   char *dev = NULL;
+   char *link = NULL;
int ret, len;
char abuf[32];
int qlen = -1;
@@ -583,9 +584,10 @@ int iplink_parse(int argc, char **argv, struct iplink_req 
*req,
int numrxqueues = -1;
int dev_index = 0;
int link_netnsid = -1;
+   int index = 0;
+   int group = -1;
int addr_len = 0;
 
-   *group = -1;
ret = argc;
 
while (argc > 0) {
@@ -597,25 +599,25 @@ int iplink_parse(int argc, char **argv, struct iplink_req 
*req,
req->i.ifi_flags &= ~IFF_UP;
} else if (strcmp(*argv, "name") == 0) {
NEXT_ARG();
-   if (*name)
+   if (name)
duparg("name", *argv);
if (check_ifname(*argv))
invarg("\"name\" not a valid ifname", *argv);
-   *name = *argv;
-   if (!*dev) {
-   *dev = *name;
-   dev_index = ll_name_to_index(*dev);
+   name = *argv;
+   if (!dev) {
+   dev = name;
+   dev_index = ll_name_to_index(dev);
}
} else if (strcmp(*argv, "index") == 0) {
NEXT_ARG();
-   if (*index)
+   if (index)
duparg("index", *argv);
-   *index = atoi(*argv);
-   if (*index <= 0)
+   index = atoi(*argv);
+   if (index <= 0)
invarg("Invalid \"index\" value", *argv);
} else if (matches(*argv, "link") == 0) {
NEXT_ARG();
-   *link = *argv;
+   link = *argv;
} else if (matches(*argv, "address") == 0) {
NEXT_ARG();
addr_len = ll_addr_a2n(abuf, sizeof(abuf), *argv);
@@ -661,8 +663,8 @@ int iplink_parse(int argc, char **argv, struct iplink_req 
*req,
  generic, drv, offload))
exit(-1);
 
-   if (offload && *name == *dev)
-   *dev = NULL;
+   if (offload && name == dev)
+   dev = NULL;
} else if (strcmp(*argv, "netns") == 0) {
NEXT_ARG();
if (netns != -1)
@@ -755,8 +757,8 @@ int iplink_parse(int argc, char **argv, struct iplink_req 
*req,
return -1;
addattr_nest_end(&req->n, vflist);
 
-   if (*name == *dev)
-   *dev = NULL;
+   if (name == dev)
+   dev = NULL;
} else if (matches(*argv, "master") == 0) {
int ifindex;
 
@@ -806,10 +808,11 @@ int iplink_parse(int argc, char

[PATCH iproute2-next v4 2/4] iplink: Use "dev" and "name" parameters interchangeable when possible

2018-03-07 Thread Serhey Popovych

Both of them accept network device name as argument, but have different
meaning:

  dev  - is a device by it's name,
  name - name for specific device.

The only case where they treated separately is network device rename
case where need to specify both ifindex and new name. In rest of the
cases we can assume that dev == name.

With this change we do following:

  1) Kill ambiguity with both "dev" and "name" parameters given the same
 name:

   ip link {add|set} dev veth100a name veth100a ...

  2) Make sure we do not accept "name" more than once.

  3) For VF and XDP treat "name" as "dev". Fail in case of "dev" is
 given after VF and/or XDP parsing.

  4) Make veth and vxcan to accept both "name" and "dev" as their peer
 parameters, effectively following general ip-link(8) utility
 behaviour on link create:

   ip link add {name|dev} veth1a type veth peer {name|dev} veth1b

Signed-off-by: Serhey Popovych 
---
 ip/iplink.c |   34 ++
 1 file changed, 30 insertions(+), 4 deletions(-)

diff --git a/ip/iplink.c b/ip/iplink.c
index 5471626..b4307ab 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -597,9 +597,15 @@ int iplink_parse(int argc, char **argv, struct iplink_req 
*req,
req->i.ifi_flags &= ~IFF_UP;
} else if (strcmp(*argv, "name") == 0) {
NEXT_ARG();
+   if (*name)
+   duparg("name", *argv);
if (check_ifname(*argv))
invarg("\"name\" not a valid ifname", *argv);
*name = *argv;
+   if (!*dev) {
+   *dev = *name;
+   dev_index = ll_name_to_index(*dev);
+   }
} else if (strcmp(*argv, "index") == 0) {
NEXT_ARG();
if (*index)
@@ -654,6 +660,9 @@ int iplink_parse(int argc, char **argv, struct iplink_req 
*req,
if (xdp_parse(&argc, &argv, req, dev_index,
  generic, drv, offload))
exit(-1);
+
+   if (offload && *name == *dev)
+   *dev = NULL;
} else if (strcmp(*argv, "netns") == 0) {
NEXT_ARG();
if (netns != -1)
@@ -745,6 +754,9 @@ int iplink_parse(int argc, char **argv, struct iplink_req 
*req,
if (len < 0)
return -1;
addattr_nest_end(&req->n, vflist);
+
+   if (*name == *dev)
+   *dev = NULL;
} else if (matches(*argv, "master") == 0) {
int ifindex;
 
@@ -895,7 +907,7 @@ int iplink_parse(int argc, char **argv, struct iplink_req 
*req,
 
if (strcmp(*argv, "dev") == 0)
NEXT_ARG();
-   if (*dev)
+   if (*dev != *name)
duparg2("dev", *argv);
if (check_ifname(*argv))
invarg("\"dev\" not a valid ifname", *argv);
@@ -905,6 +917,14 @@ int iplink_parse(int argc, char **argv, struct iplink_req 
*req,
argc--; argv++;
}
 
+   /* Allow "ip link add dev" and "ip link add name" */
+   if (!*name)
+   *name = *dev;
+   else if (!*dev)
+   *dev = *name;
+   else if (!strcmp(*name, *dev))
+   *name = *dev;
+
if (dev_index && addr_len) {
int halen = nl_get_ll_addr_len(dev_index);
 
@@ -983,10 +1003,16 @@ static int iplink_modify(int cmd, unsigned int flags, 
int argc, char **argv)
req.i.ifi_index = ll_name_to_index(dev);
if (!req.i.ifi_index)
return nodev(dev);
+
+   /* Not renaming to the same name */
+   if (name == dev)
+   name = NULL;
} else {
-   /* Allow "ip link add dev" and "ip link add name" */
-   if (!name)
-   name = dev;
+   if (name != dev) {
+   fprintf(stderr,
+   "both \"name\" and \"dev\" cannot be used when 
creating devices.\n");
+   exit(-1);
+   }
 
if (link) {
int ifindex;
-- 
1.7.10.4

[PATCH iproute2-next v4 0/4] iplink: Improve iplink_parse()

2018-03-07 Thread Serhey Popovych

This is main routine to parse ip-link(8) configuration parameters.

Move all code related to command line parsing and validation to it from
iptables_modify(). As benefit we reduce number of arguments as well as
checking for most of weired cases in single place to give benefit to
iptables_parse() users.

See individual patch description message for more information.

v4
  Drop patches intended to reduce number of arguments to
  iptables_parse(): postpone to the series with real use cases.

  Save only ifi_index in iplink_vxcan.c and link_veth.c: no need
  to save whole ifinfomsg data structure.

  Note that there is no sense to introduce custom version of
  iplink_parse() to use in iplink_vxcan.c and link_veth.c because
  there is too much parameters we need to support (except VF and
  few others) making huge code duplication.

v3
  Move vxlan/veth ifinfomsg save/restore to separate patch to
  make clear change that perform most of request buffer setups
  and checks in iplink_parse().

  Update commit message descriptions and extra new line from
  "utils: Introduce and use nodev() helper routine" patch.

v2
  Terminate via exit() when failing to parse command line arguments
  to help identify failing line in batch mode.

Thanks,
Serhii

Serhey Popovych (4):
  utils: Introduce and use nodev() helper routine
  iplink: Use "dev" and "name" parameters interchangeable when possible
  iplink: Follow documented behaviour when "index" is given
  iplink: Perform most of request buffer setups and checks in
iplink_parse()

 bridge/fdb.c  |   17 ++
 bridge/link.c |8 +--
 bridge/mdb.c  |   19 ++
 bridge/vlan.c |7 +--
 include/utils.h   |1 +
 ip/ip6tunnel.c|6 +-
 ip/ip_common.h|4 +-
 ip/ipaddress.c|7 +--
 ip/iplink.c   |  163 +++--
 ip/iplink_bond.c  |4 +-
 ip/iplink_bridge.c|7 +--
 ip/iplink_vxcan.c |   23 ++-
 ip/iplink_vxlan.c |7 +--
 ip/ipmroute.c |7 +--
 ip/ipneigh.c  |   14 ++---
 ip/ipntable.c |6 +-
 ip/iproute.c  |   36 ---
 ip/iproute_lwtunnel.c |4 +-
 ip/iptunnel.c |6 +-
 ip/link_gre.c |7 +--
 ip/link_gre6.c|7 +--
 ip/link_ip6tnl.c  |4 +-
 ip/link_iptnl.c   |4 +-
 ip/link_veth.c|   23 ++-
 ip/link_vti.c |7 +--
 ip/link_vti6.c|7 +--
 lib/utils.c   |6 ++
 tc/m_mirred.c |6 +-
 tc/tc_class.c |   14 ++---
 tc/tc_filter.c|   18 ++
 tc/tc_qdisc.c |   12 ++--
 31 files changed, 196 insertions(+), 265 deletions(-)

-- 
1.7.10.4

[PATCH iproute2-next v4 1/4] utils: Introduce and use nodev() helper routine

2018-03-07 Thread Serhey Popovych

There is a couple of places where we report error in case of no network
device is found. In all of them we output message in the same format to
stderr and either return -1 or 1 to the caller or exit with -1.

Introduce new helper function nodev() that takes name of the network
device caused error and returns -1 to it's caller. Either call exit()
or return to the caller to preserve behaviour before change.

Use -nodev() in traffic control (tc) code to return 1.

Simplify expression for checking for argument being 0/NULL in @if
statement.

Signed-off-by: Serhey Popovych 
---
 bridge/fdb.c  |   17 ++---
 bridge/link.c |8 +++-
 bridge/mdb.c  |   19 ++-
 bridge/vlan.c |7 ++-
 include/utils.h   |1 +
 ip/ip6tunnel.c|6 ++
 ip/ipaddress.c|7 +++
 ip/iplink.c   |   13 -
 ip/iplink_bond.c  |4 ++--
 ip/iplink_bridge.c|7 ++-
 ip/iplink_vxlan.c |7 ++-
 ip/ipmroute.c |7 +++
 ip/ipneigh.c  |   14 +++---
 ip/ipntable.c |6 ++
 ip/iproute.c  |   36 
 ip/iproute_lwtunnel.c |4 ++--
 ip/iptunnel.c |6 ++
 ip/link_gre.c |7 ++-
 ip/link_gre6.c|7 ++-
 ip/link_ip6tnl.c  |4 ++--
 ip/link_iptnl.c   |4 ++--
 ip/link_vti.c |7 ++-
 ip/link_vti6.c|7 ++-
 lib/utils.c   |6 ++
 tc/m_mirred.c |6 ++
 tc/tc_class.c |   14 ++
 tc/tc_filter.c|   18 ++
 tc/tc_qdisc.c |   12 
 28 files changed, 97 insertions(+), 164 deletions(-)

diff --git a/bridge/fdb.c b/bridge/fdb.c
index b4f6e8b..205b4fa 100644
--- a/bridge/fdb.c
+++ b/bridge/fdb.c
@@ -311,11 +311,8 @@ static int fdb_show(int argc, char **argv)
/*we'll keep around filter_dev for older kernels */
if (filter_dev) {
filter_index = ll_name_to_index(filter_dev);
-   if (filter_index == 0) {
-   fprintf(stderr, "Cannot find device \"%s\"\n",
-   filter_dev);
-   return -1;
-   }
+   if (!filter_index)
+   return nodev(filter_dev);
req.ifm.ifi_index = filter_index;
}
 
@@ -391,8 +388,8 @@ static int fdb_modify(int cmd, int flags, int argc, char 
**argv)
} else if (strcmp(*argv, "via") == 0) {
NEXT_ARG();
via = ll_name_to_index(*argv);
-   if (via == 0)
-   invarg("invalid device\n", *argv);
+   if (!via)
+   exit(nodev(*argv));
} else if (strcmp(*argv, "self") == 0) {
req.ndm.ndm_flags |= NTF_SELF;
} else if (matches(*argv, "master") == 0) {
@@ -467,10 +464,8 @@ static int fdb_modify(int cmd, int flags, int argc, char 
**argv)
addattr32(&req.n, sizeof(req), NDA_IFINDEX, via);
 
req.ndm.ndm_ifindex = ll_name_to_index(d);
-   if (req.ndm.ndm_ifindex == 0) {
-   fprintf(stderr, "Cannot find device \"%s\"\n", d);
-   return -1;
-   }
+   if (!req.ndm.ndm_ifindex)
+   return nodev(d);
 
if (rtnl_talk(&rth, &req.n, NULL) < 0)
return -1;
diff --git a/bridge/link.c b/bridge/link.c
index 69c08ec..579d57e 100644
--- a/bridge/link.c
+++ b/bridge/link.c
@@ -485,11 +485,9 @@ static int brlink_show(int argc, char **argv)
}
 
if (filter_dev) {
-   if ((filter_index = ll_name_to_index(filter_dev)) == 0) {
-   fprintf(stderr, "Cannot find device \"%s\"\n",
-   filter_dev);
-   return -1;
-   }
+   filter_index = ll_name_to_index(filter_dev);
+   if (!filter_index)
+   return nodev(filter_dev);
}
 
if (show_details) {
diff --git a/bridge/mdb.c b/bridge/mdb.c
index 8c08baf..f38dc67 100644
--- a/bridge/mdb.c
+++ b/bridge/mdb.c
@@ -287,11 +287,8 @@ static int mdb_show(int argc, char **argv)
 
if (filter_dev) {
filter_index = ll_name_to_index(filter_dev);
-   if (filter_index == 0) {
-   fprintf(stderr, "Cannot find device \"%s\"\n",
-   filter_dev);
-   return -1;
-   }
+   if (!filter_index)
+   return nodev(filter_dev);
}
 
new_json_obj(json);
@@ -360,16 +357,12 @@ static int mdb_modify(int cmd, int flags, int argc, char 
**argv)
}
 
req.bpm.ifindex = ll_name_to_index(d);
-   if (req.bpm.ifindex == 0) {
-   fprintf(stderr, "Cannot fi

Re: [PATCH iproute2-next 2/3] ipmroute: don't complain about unicast routes

2018-03-07 Thread Sergei Shtylyov


Hello!

On 3/7/2018 4:03 AM, Stephen Hemminger wrote:


From: Stephen Hemminger 

Every non-multicast route prints an error message.
Kernel doesn't filter out unicast routes, it is up to filter function
to do this.

Signed-off-by: Stephen Hemminger 
---
  ip/ipmroute.c | 7 +++
  1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/ip/ipmroute.c b/ip/ipmroute.c
index aa5029b44f41..03ca0575e571 100644
--- a/ip/ipmroute.c
+++ b/ip/ipmroute.c
@@ -75,15 +75,14 @@ int print_mroute(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
fprintf(stderr, "BUG: wrong nlmsg len %d\n", len);
return -1;
}
-   if (r->rtm_type != RTN_MULTICAST) {
-   fprintf(stderr, "Not a multicast route (type: %s)\n",
-   rtnl_rtntype_n2a(r->rtm_type, b1, sizeof(b1)));
+
+   if (r->rtm_type != RTN_MULTICAST)
return 0;
-   }
  
  	parse_rtattr(tb, RTA_MAX, RTM_RTA(r), len);

table = rtm_get_table(r, tb);
  
+


   Why?


if (filter.tb > 0 && filter.tb != table)
return 0;
  


MBR, Sergei

Re: [PATCH net-next] net/mlx4_en: try to use high order pages for RX rings

2018-03-07 Thread Tariq Toukan



On 06/03/2018 9:12 PM, Eric Dumazet wrote:

From: Eric Dumazet 

RX rings can fit most of the time in a contiguous piece of memory,
so lets use kvzalloc_node/kvfree instead of vzalloc_node/vfree

Note that kvzalloc_node() automatically falls back to another node,
there is no need to do the fallback ourselves.

Signed-off-by: Eric Dumazet 
---
  drivers/net/ethernet/mellanox/mlx4/en_rx.c |   13 +
  1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 
c2c6bd7578fd42ded9b10c3993742a244b8f..05787efef492b1c0c6ce540ef73647fad91ce282
 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -291,13 +291,10 @@ int mlx4_en_create_rx_ring(struct mlx4_en_priv *priv,
  
  	tmp = size * roundup_pow_of_two(MLX4_EN_MAX_RX_FRAGS *

sizeof(struct mlx4_en_rx_alloc));
-   ring->rx_info = vzalloc_node(tmp, node);
+   ring->rx_info = kvzalloc_node(tmp, GFP_KERNEL, node);
if (!ring->rx_info) {
-   ring->rx_info = vzalloc(tmp);
-   if (!ring->rx_info) {
-   err = -ENOMEM;
-   goto err_xdp_info;
-   }
+   err = -ENOMEM;
+   goto err_xdp_info;
}
  
  	en_dbg(DRV, priv, "Allocated rx_info ring at addr:%p size:%d\n",

@@ -318,7 +315,7 @@ int mlx4_en_create_rx_ring(struct mlx4_en_priv *priv,
return 0;
  
  err_info:

-   vfree(ring->rx_info);
+   kvfree(ring->rx_info);
ring->rx_info = NULL;
  err_xdp_info:
xdp_rxq_info_unreg(&ring->xdp_rxq);
@@ -447,7 +444,7 @@ void mlx4_en_destroy_rx_ring(struct mlx4_en_priv *priv,
bpf_prog_put(old_prog);
xdp_rxq_info_unreg(&ring->xdp_rxq);
mlx4_free_hwq_res(mdev->dev, &ring->wqres, size * stride + TXBB_SIZE);
-   vfree(ring->rx_info);
+   kvfree(ring->rx_info);
ring->rx_info = NULL;
kfree(ring);
*pring = NULL;



Reviewed-by: Tariq Toukan 

Thanks Eric.

[PATCH iproute2-next] rdma: Update device capabilities flags

2018-03-07 Thread Leon Romanovsky

From: Leon Romanovsky 

In kernel commit e1d2e8873369 ("IB/core: Add PCI write
end padding flags for WQ and QP"), we introduced new
device capability to advertise PCI write end padding.

PCI write end padding is the device's ability to pad the ending of
incoming packets (scatter) to full cache line such that the last
upstream write generated by an incoming packet will be a full cache
line.

This commit updates RDMAtool to present this field.

Signed-off-by: Leon Romanovsky 
---
 rdma/dev.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/rdma/dev.c b/rdma/dev.c
index 03ab8683..e2eafe47 100644
--- a/rdma/dev.c
+++ b/rdma/dev.c
@@ -53,7 +53,8 @@ static const char *dev_caps_to_str(uint32_t idx)
x(SG_GAPS_REG, 32) \
x(VIRTUAL_FUNCTION, 33) \
x(RAW_SCATTER_FCS, 34) \
-   x(RDMA_NETDEV_OPA_VNIC, 35)
+   x(RDMA_NETDEV_OPA_VNIC, 35) \
+   x(PCI_WRITE_END_PADDING, 36)
 
enum { RDMA_DEV_FLAGS(RDMA_BITMAP_ENUM) };
 
-- 
2.16.2

Re: [PATCH RESEND net-next] net: Do synchronize_rcu() in ip6mr_sk_done() only if this is needed

2018-03-07 Thread Kirill Tkhai

On 06.03.2018 19:50, Eric Dumazet wrote:
> On Tue, 2018-03-06 at 19:24 +0300, Kirill Tkhai wrote:
>> After unshare test kworker hangs for ages:
>>
>> $ while :; do unshare -n true; done &
>>
>> $ perf report 
>> -   88,82% 0,00%  kworker/u16:0  [kernel.vmlinux]  [k]
>> cleanup_net
>>  cleanup_net
>>    - ops_exit_list.isra.9
>>   - 85,68% igmp6_net_exit
>>  - 53,31% sock_release
>> - inet_release
>>    - 25,33% rawv6_close
>>   - ip6mr_sk_done
>>  + 23,38% synchronize_rcu
>>
>> Keep in mind, this perf report shows the time a function was
>> executing, and
>> it does not show the time, it was sleeping. But it's easy to imagine,
>> how
>> much it is sleeping, if synchronize_rcu() execution takes the most
>> time.
>> Top shows the kworker R time is less then 1%.
>>
>> This happen, because of in ip6mr_sk_done() we do too many
>> synchronize_rcu(),
>> even for the sockets, that are not referenced in mr_table, and which
>> are not
>> need it. So, the progress of kworker becomes very slow.
>>
>> The patch introduces apparent solution, and it makes ip6mr_sk_done()
>> to skip
>> synchronize_rcu() for sockets, that are not need that. After the
>> patch,
>> kworker becomes able to warm the cpu up as expected.
>>
>> Signed-off-by: Kirill Tkhai 
>> ---
>>  net/ipv6/ip6mr.c |4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
>> index 2a38f9de45d3..290a8d0d5eac 100644
>> --- a/net/ipv6/ip6mr.c
>> +++ b/net/ipv6/ip6mr.c
>> @@ -1485,7 +1485,9 @@ int ip6mr_sk_done(struct sock *sk)
>>  }
>>  }
>>  rtnl_unlock();
>> -synchronize_rcu();
>> +
>> +if (!err)
>> +synchronize_rcu();
>>  
> 
> 
> But... what is this synchronize_rcu() doing exactly ?
> 
> This was added in 8571ab479a6e1ef46ead5ebee567e128a422767c
> 
> ("ip6mr: Make mroute_sk rcu-based")
> 
> Typically on a delete, the synchronize_rcu() would be needed before
> freeing the deleted object.
> 
> But nowadays we have better way : SOCK_RCU_FREE

Hm. I'm agree with you. This is hot path, and mroute sockets created from 
userspace
will delay userspace tasks close() and exit(). Since there may be many such 
sockets,
we may get a zombie task, which can't be reaped for ages. This slows down the 
system
directly.

Fix for pernet_operations works, but we need generic solution instead.

The commit "8571ab479a6e1ef46ead5ebee567e128a422767c" says:

ip6mr: Make mroute_sk rcu-based

In ipmr the mr_table socket is handled under RCU. Introduce the same
for ip6mr.

There is no pointing to improvements it invents, or to the problem it solves. 
The description
looks like a cleanup. It's too expensive cleanup, if it worsens the performance 
a hundred
times.

Can't we simply revert it?!

Yuval, do you have ideas to fix that (maybe, via SOCK_RCU_FREE suggested by 
Eric)?

We actually use rcu_dereference() in ip6mr_cache_report() only. The only user 
of dereference
is sock_queue_rcv_skb(). Superficial analysis shows we purge the queue in 
inet_sock_destruct().

Thanks,
Kirill

Re: [PATCH RESEND net-next] net: Do synchronize_rcu() in ip6mr_sk_done() only if this is needed

2018-03-07 Thread Kirill Tkhai

On 07.03.2018 12:22, Kirill Tkhai wrote:
> On 06.03.2018 19:50, Eric Dumazet wrote:
>> On Tue, 2018-03-06 at 19:24 +0300, Kirill Tkhai wrote:
>>> After unshare test kworker hangs for ages:
>>>
>>> $ while :; do unshare -n true; done &
>>>
>>> $ perf report 
>>> -   88,82% 0,00%  kworker/u16:0  [kernel.vmlinux]  [k]
>>> cleanup_net
>>>  cleanup_net
>>>    - ops_exit_list.isra.9
>>>   - 85,68% igmp6_net_exit
>>>  - 53,31% sock_release
>>> - inet_release
>>>    - 25,33% rawv6_close
>>>   - ip6mr_sk_done
>>>  + 23,38% synchronize_rcu
>>>
>>> Keep in mind, this perf report shows the time a function was
>>> executing, and
>>> it does not show the time, it was sleeping. But it's easy to imagine,
>>> how
>>> much it is sleeping, if synchronize_rcu() execution takes the most
>>> time.
>>> Top shows the kworker R time is less then 1%.
>>>
>>> This happen, because of in ip6mr_sk_done() we do too many
>>> synchronize_rcu(),
>>> even for the sockets, that are not referenced in mr_table, and which
>>> are not
>>> need it. So, the progress of kworker becomes very slow.
>>>
>>> The patch introduces apparent solution, and it makes ip6mr_sk_done()
>>> to skip
>>> synchronize_rcu() for sockets, that are not need that. After the
>>> patch,
>>> kworker becomes able to warm the cpu up as expected.
>>>
>>> Signed-off-by: Kirill Tkhai 
>>> ---
>>>  net/ipv6/ip6mr.c |4 +++-
>>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
>>> index 2a38f9de45d3..290a8d0d5eac 100644
>>> --- a/net/ipv6/ip6mr.c
>>> +++ b/net/ipv6/ip6mr.c
>>> @@ -1485,7 +1485,9 @@ int ip6mr_sk_done(struct sock *sk)
>>>     }
>>>     }
>>>     rtnl_unlock();
>>> -   synchronize_rcu();
>>> +
>>> +   if (!err)
>>> +   synchronize_rcu();
>>>  
>>
>>
>> But... what is this synchronize_rcu() doing exactly ?
>>
>> This was added in 8571ab479a6e1ef46ead5ebee567e128a422767c
>>
>> ("ip6mr: Make mroute_sk rcu-based")
>>
>> Typically on a delete, the synchronize_rcu() would be needed before
>> freeing the deleted object.
>>
>> But nowadays we have better way : SOCK_RCU_FREE
> 
> Hm. I'm agree with you. This is hot path, and mroute sockets created from 
> userspace
> will delay userspace tasks close() and exit(). Since there may be many such 
> sockets,
> we may get a zombie task, which can't be reaped for ages. This slows down the 
> system
> directly.
> 
> Fix for pernet_operations works, but we need generic solution instead.
> 
> The commit "8571ab479a6e1ef46ead5ebee567e128a422767c" says:
> 
> ip6mr: Make mroute_sk rcu-based
> 
> In ipmr the mr_table socket is handled under RCU. Introduce the same
> for ip6mr.
> 
> There is no pointing to improvements it invents, or to the problem it solves. 
> The description
> looks like a cleanup. It's too expensive cleanup, if it worsens the 
> performance a hundred
> times.
> 
> Can't we simply revert it?!
> 
> Yuval, do you have ideas to fix that (maybe, via SOCK_RCU_FREE suggested by 
> Eric)?
> 
> We actually use rcu_dereference() in ip6mr_cache_report() only. The only user 
> of dereference
> is sock_queue_rcv_skb(). Superficial analysis shows we purge the queue in 
> inet_sock_destruct().

+ So this change should be safe.

> Thanks,
> Kirill
>

[PATCH net-next 00/16] Converting pernet_operations (part #5)

2018-03-07 Thread Kirill Tkhai

Hi,

this series continues to review and to convert pernet_operations
to make them possible to be executed in parallel for several
net namespaces in the same time. There are mostly netfilter
operations (and they should be the last netfilter's), also
there are two patches touching pktgen and xfrm.

Thanks,
Kirill
---

Kirill Tkhai (16):
  net: Convert ip6 tables pernet_operations
  net: Convert xfrm_user_net_ops
  net: Convert nf_tables_net_ops
  net: Convert nfnetlink_net_ops
  net: Convert nfnl_acct_ops
  net: Convert cttimeout_ops
  net: Convert nfnl_log_net_ops
  net: Convert nfnl_queue_net_ops
  net: Convert pg_net_ops
  net: Convert arptable_filter_net_ops
  net: Convert iptable_mangle_net_ops
  net: Convert iptable_nat_net_ops
  net: Convert iptable_raw_net_ops
  net: Convert iptable_security_net_ops
  net: Convert ipv4_net_ops
  net: Convet ipv6_net_ops


 net/core/pktgen.c  |1 +
 net/ipv4/netfilter/arptable_filter.c   |1 +
 net/ipv4/netfilter/iptable_mangle.c|1 +
 net/ipv4/netfilter/iptable_nat.c   |1 +
 net/ipv4/netfilter/iptable_raw.c   |1 +
 net/ipv4/netfilter/iptable_security.c  |1 +
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c |1 +
 net/ipv6/netfilter/ip6table_filter.c   |1 +
 net/ipv6/netfilter/ip6table_mangle.c   |1 +
 net/ipv6/netfilter/ip6table_nat.c  |1 +
 net/ipv6/netfilter/ip6table_raw.c  |1 +
 net/ipv6/netfilter/ip6table_security.c |1 +
 net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c |1 +
 net/netfilter/nf_tables_api.c  |1 +
 net/netfilter/nfnetlink.c  |1 +
 net/netfilter/nfnetlink_acct.c |1 +
 net/netfilter/nfnetlink_cttimeout.c|1 +
 net/netfilter/nfnetlink_log.c  |1 +
 net/netfilter/nfnetlink_queue.c|1 +
 net/xfrm/xfrm_user.c   |1 +
 20 files changed, 20 insertions(+)

--
Signed-off-by: Kirill Tkhai

[PATCH net-next 01/16] net: Convert ip6 tables pernet_operations

2018-03-07 Thread Kirill Tkhai

The pernet_operations:

ip6table_filter_net_ops
ip6table_mangle_net_ops
ip6table_nat_net_ops
ip6table_raw_net_ops
ip6table_security_net_ops

have exit methods, which call ip6t_unregister_table().
ip6table_filter_net_ops has init method registering
filter table.

Since there must not be in-flight ipv6 packets at the time
of pernet_operations execution and since pernet_operations
don't send ipv6 packets each other, these pernet_operations
are safe to be async.

Signed-off-by: Kirill Tkhai 
---
 net/ipv6/netfilter/ip6table_filter.c   |1 +
 net/ipv6/netfilter/ip6table_mangle.c   |1 +
 net/ipv6/netfilter/ip6table_nat.c  |1 +
 net/ipv6/netfilter/ip6table_raw.c  |1 +
 net/ipv6/netfilter/ip6table_security.c |1 +
 5 files changed, 5 insertions(+)

diff --git a/net/ipv6/netfilter/ip6table_filter.c 
b/net/ipv6/netfilter/ip6table_filter.c
index 1343077dde93..06561c84c0bc 100644
--- a/net/ipv6/netfilter/ip6table_filter.c
+++ b/net/ipv6/netfilter/ip6table_filter.c
@@ -87,6 +87,7 @@ static void __net_exit ip6table_filter_net_exit(struct net 
*net)
 static struct pernet_operations ip6table_filter_net_ops = {
.init = ip6table_filter_net_init,
.exit = ip6table_filter_net_exit,
+   .async = true,
 };
 
 static int __init ip6table_filter_init(void)
diff --git a/net/ipv6/netfilter/ip6table_mangle.c 
b/net/ipv6/netfilter/ip6table_mangle.c
index b0524b18c4fb..a11e25936b45 100644
--- a/net/ipv6/netfilter/ip6table_mangle.c
+++ b/net/ipv6/netfilter/ip6table_mangle.c
@@ -107,6 +107,7 @@ static void __net_exit ip6table_mangle_net_exit(struct net 
*net)
 
 static struct pernet_operations ip6table_mangle_net_ops = {
.exit = ip6table_mangle_net_exit,
+   .async = true,
 };
 
 static int __init ip6table_mangle_init(void)
diff --git a/net/ipv6/netfilter/ip6table_nat.c 
b/net/ipv6/netfilter/ip6table_nat.c
index 47306e45a80a..4475fd300bb6 100644
--- a/net/ipv6/netfilter/ip6table_nat.c
+++ b/net/ipv6/netfilter/ip6table_nat.c
@@ -131,6 +131,7 @@ static void __net_exit ip6table_nat_net_exit(struct net 
*net)
 
 static struct pernet_operations ip6table_nat_net_ops = {
.exit   = ip6table_nat_net_exit,
+   .async  = true,
 };
 
 static int __init ip6table_nat_init(void)
diff --git a/net/ipv6/netfilter/ip6table_raw.c 
b/net/ipv6/netfilter/ip6table_raw.c
index 710fa0806c37..a88f3b1995b1 100644
--- a/net/ipv6/netfilter/ip6table_raw.c
+++ b/net/ipv6/netfilter/ip6table_raw.c
@@ -75,6 +75,7 @@ static void __net_exit ip6table_raw_net_exit(struct net *net)
 
 static struct pernet_operations ip6table_raw_net_ops = {
.exit = ip6table_raw_net_exit,
+   .async = true,
 };
 
 static int __init ip6table_raw_init(void)
diff --git a/net/ipv6/netfilter/ip6table_security.c 
b/net/ipv6/netfilter/ip6table_security.c
index cf26ccb04056..320048c008dc 100644
--- a/net/ipv6/netfilter/ip6table_security.c
+++ b/net/ipv6/netfilter/ip6table_security.c
@@ -74,6 +74,7 @@ static void __net_exit ip6table_security_net_exit(struct net 
*net)
 
 static struct pernet_operations ip6table_security_net_ops = {
.exit = ip6table_security_net_exit,
+   .async = true,
 };
 
 static int __init ip6table_security_init(void)

[PATCH net-next 03/16] net: Convert nf_tables_net_ops

2018-03-07 Thread Kirill Tkhai

These pernet_operations looks nicely separated per-net.
Exit method unregisters net's nf tables objects.
We allow them be executed in parallel.

Signed-off-by: Kirill Tkhai 
---
 net/netfilter/nf_tables_api.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 558593e6a0a3..8e19c86d1aa6 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -6596,6 +6596,7 @@ static void __net_exit nf_tables_exit_net(struct net *net)
 static struct pernet_operations nf_tables_net_ops = {
.init   = nf_tables_init_net,
.exit   = nf_tables_exit_net,
+   .async  = true,
 };
 
 static int __init nf_tables_module_init(void)

[PATCH net-next 05/16] net: Convert nfnl_acct_ops

2018-03-07 Thread Kirill Tkhai

These pernet_operations look closed in themself,
and there are no other users of net::nfnl_acct_list
outside. They are safe to be executed for several
net namespaces in parallel.

Signed-off-by: Kirill Tkhai 
---
 net/netfilter/nfnetlink_acct.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/netfilter/nfnetlink_acct.c b/net/netfilter/nfnetlink_acct.c
index 88d427f9f9e6..8d9f18bb8840 100644
--- a/net/netfilter/nfnetlink_acct.c
+++ b/net/netfilter/nfnetlink_acct.c
@@ -515,6 +515,7 @@ static void __net_exit nfnl_acct_net_exit(struct net *net)
 static struct pernet_operations nfnl_acct_ops = {
 .init   = nfnl_acct_net_init,
 .exit   = nfnl_acct_net_exit,
+   .async  = true,
 };
 
 static int __init nfnl_acct_init(void)

[PATCH net-next 04/16] net: Convert nfnetlink_net_ops

2018-03-07 Thread Kirill Tkhai

These pernet_operations create and destroy net::nfnl
socket of NETLINK_NETFILTER code. There are no other
places, where such type the socket is created, except
these pernet_operations. It seem other pernet_operations
depending on CONFIG_NETFILTER_NETLINK send messages
to this socket. So, we mark it async.

Signed-off-by: Kirill Tkhai 
---
 net/netfilter/nfnetlink.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c
index 03ead8a9e90c..84fc4954862d 100644
--- a/net/netfilter/nfnetlink.c
+++ b/net/netfilter/nfnetlink.c
@@ -566,6 +566,7 @@ static void __net_exit nfnetlink_net_exit_batch(struct 
list_head *net_exit_list)
 static struct pernet_operations nfnetlink_net_ops = {
.init   = nfnetlink_net_init,
.exit_batch = nfnetlink_net_exit_batch,
+   .async  = true,
 };
 
 static int __init nfnetlink_init(void)

[PATCH net-next 02/16] net: Convert xfrm_user_net_ops

2018-03-07 Thread Kirill Tkhai

These pernet_operations create and destroy net::xfrm::nlsk
socket of NETLINK_XFRM. There is only entry point, where
it's dereferenced, it's xfrm_user_rcv_msg(). There is no
in-kernel senders to this socket.

Signed-off-by: Kirill Tkhai 
---
 net/xfrm/xfrm_user.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 7f52b8eb177d..aff2e84ec761 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -3258,6 +3258,7 @@ static void __net_exit xfrm_user_net_exit(struct 
list_head *net_exit_list)
 static struct pernet_operations xfrm_user_net_ops = {
.init   = xfrm_user_net_init,
.exit_batch = xfrm_user_net_exit,
+   .async  = true,
 };
 
 static int __init xfrm_user_init(void)

[PATCH net-next 07/16] net: Convert nfnl_log_net_ops

2018-03-07 Thread Kirill Tkhai

These pernet_operations create and destroy /proc entries.
Also, exit method unsets nfulnl_logger. The logger is not
set by default, and it becomes bound via userspace request.
So, they look safe to be made async.

Signed-off-by: Kirill Tkhai 
---
 net/netfilter/nfnetlink_log.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/netfilter/nfnetlink_log.c b/net/netfilter/nfnetlink_log.c
index 7b46aa4c478d..b21ef79849a1 100644
--- a/net/netfilter/nfnetlink_log.c
+++ b/net/netfilter/nfnetlink_log.c
@@ -1108,6 +1108,7 @@ static struct pernet_operations nfnl_log_net_ops = {
.exit   = nfnl_log_net_exit,
.id = &nfnl_log_net_id,
.size   = sizeof(struct nfnl_log_net),
+   .async  = true,
 };
 
 static int __init nfnetlink_log_init(void)

[PATCH net-next 06/16] net: Convert cttimeout_ops

2018-03-07 Thread Kirill Tkhai

These pernet_operations also look closed in themself.
Exit method touch only per-net structures, so it's
safe to execute them for several net namespaces in parallel.

Signed-off-by: Kirill Tkhai 
---
 net/netfilter/nfnetlink_cttimeout.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/netfilter/nfnetlink_cttimeout.c 
b/net/netfilter/nfnetlink_cttimeout.c
index 95b04702a655..6819300f7fb7 100644
--- a/net/netfilter/nfnetlink_cttimeout.c
+++ b/net/netfilter/nfnetlink_cttimeout.c
@@ -586,6 +586,7 @@ static void __net_exit cttimeout_net_exit(struct net *net)
 static struct pernet_operations cttimeout_ops = {
.init   = cttimeout_net_init,
.exit   = cttimeout_net_exit,
+   .async  = true,
 };
 
 static int __init cttimeout_init(void)

[PATCH net-next 08/16] net: Convert nfnl_queue_net_ops

2018-03-07 Thread Kirill Tkhai

These pernet_operations register and unregister net::nf::queue_handler
and /proc entry. The handler is accessed only under RCU, so this looks
safe to convert them.

Signed-off-by: Kirill Tkhai 
---
 net/netfilter/nfnetlink_queue.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
index 8bba23160a68..59e2833c17f1 100644
--- a/net/netfilter/nfnetlink_queue.c
+++ b/net/netfilter/nfnetlink_queue.c
@@ -1528,6 +1528,7 @@ static struct pernet_operations nfnl_queue_net_ops = {
.exit_batch = nfnl_queue_net_exit_batch,
.id = &nfnl_queue_net_id,
.size   = sizeof(struct nfnl_queue_net),
+   .async  = true,
 };
 
 static int __init nfnetlink_queue_init(void)

[PATCH net-next 10/16] net: Convert arptable_filter_net_ops

2018-03-07 Thread Kirill Tkhai

These pernet_operations unregister net::ipv4::arptable_filter.
Another net/pernet_operations do not send arp packets to foreign
net namespaces. So, we mark them async.

Signed-off-by: Kirill Tkhai 
---
 net/ipv4/netfilter/arptable_filter.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/netfilter/arptable_filter.c 
b/net/ipv4/netfilter/arptable_filter.c
index 8f8713b4388f..49c2490193ae 100644
--- a/net/ipv4/netfilter/arptable_filter.c
+++ b/net/ipv4/netfilter/arptable_filter.c
@@ -65,6 +65,7 @@ static void __net_exit arptable_filter_net_exit(struct net 
*net)
 
 static struct pernet_operations arptable_filter_net_ops = {
.exit = arptable_filter_net_exit,
+   .async = true,
 };
 
 static int __init arptable_filter_init(void)

[PATCH net-next 09/16] net: Convert pg_net_ops

2018-03-07 Thread Kirill Tkhai

These pernet_operations create per-net pktgen threads
and /proc entries. These pernet subsys looks closed
in itself, and there are no pernet_operations outside
this file, which are interested in the threads.
Init and/or exit methods look safe to be executed
in parallel.

Signed-off-by: Kirill Tkhai 
---
 net/core/pktgen.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index b8ab5c829511..d81bddd1bb80 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -3851,6 +3851,7 @@ static struct pernet_operations pg_net_ops = {
.exit = pg_net_exit,
.id   = &pg_net_id,
.size = sizeof(struct pktgen_net),
+   .async = true,
 };
 
 static int __init pg_init(void)

[PATCH net-next 11/16] net: Convert iptable_mangle_net_ops

2018-03-07 Thread Kirill Tkhai

These pernet_operations unregister net::ipv4::iptable_mangle table.
Another net/pernet_operations do not send ipv4 packets to foreign
net namespaces. So, we mark them async.

Signed-off-by: Kirill Tkhai 
---
 net/ipv4/netfilter/iptable_mangle.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/netfilter/iptable_mangle.c 
b/net/ipv4/netfilter/iptable_mangle.c
index dea138ca8925..f6074059531a 100644
--- a/net/ipv4/netfilter/iptable_mangle.c
+++ b/net/ipv4/netfilter/iptable_mangle.c
@@ -113,6 +113,7 @@ static void __net_exit iptable_mangle_net_exit(struct net 
*net)
 
 static struct pernet_operations iptable_mangle_net_ops = {
.exit = iptable_mangle_net_exit,
+   .async = true,
 };
 
 static int __init iptable_mangle_init(void)

[PATCH net-next 12/16] net: Convert iptable_nat_net_ops

2018-03-07 Thread Kirill Tkhai

These pernet_operations unregister net::ipv4::nat_table table.
Another net/pernet_operations do not send ipv4 packets to foreign
net namespaces. So, we mark them async.

Signed-off-by: Kirill Tkhai 
---
 net/ipv4/netfilter/iptable_nat.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/netfilter/iptable_nat.c b/net/ipv4/netfilter/iptable_nat.c
index 0f7255cc65ee..b771af74be79 100644
--- a/net/ipv4/netfilter/iptable_nat.c
+++ b/net/ipv4/netfilter/iptable_nat.c
@@ -129,6 +129,7 @@ static void __net_exit iptable_nat_net_exit(struct net *net)
 
 static struct pernet_operations iptable_nat_net_ops = {
.exit   = iptable_nat_net_exit,
+   .async  = true,
 };
 
 static int __init iptable_nat_init(void)

[PATCH net-next 13/16] net: Convert iptable_raw_net_ops

2018-03-07 Thread Kirill Tkhai

These pernet_operations unregister net::ipv4::iptable_raw table.
Another net/pernet_operations do not send ipv4 packets to foreign
net namespaces. So, we mark them async.

Signed-off-by: Kirill Tkhai 
---
 net/ipv4/netfilter/iptable_raw.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/netfilter/iptable_raw.c b/net/ipv4/netfilter/iptable_raw.c
index 960625aabf04..963753e50842 100644
--- a/net/ipv4/netfilter/iptable_raw.c
+++ b/net/ipv4/netfilter/iptable_raw.c
@@ -76,6 +76,7 @@ static void __net_exit iptable_raw_net_exit(struct net *net)
 
 static struct pernet_operations iptable_raw_net_ops = {
.exit = iptable_raw_net_exit,
+   .async = true,
 };
 
 static int __init iptable_raw_init(void)

[PATCH net-next 15/16] net: Convert ipv4_net_ops

2018-03-07 Thread Kirill Tkhai

These pernet_operations register and unregister bunch
of nf_conntrack_l4proto. Exit method unregisters related
sysctl, init method calls init_net and get_net_proto.
The whole builtin_l4proto4 array has pretty simple
init_net and get_net_proto methods. The first one register
sysctl table, the second one is just RO memory dereference.
So, these pernet_operations are safe to be marked as async.

Signed-off-by: Kirill Tkhai 
---
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c 
b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
index b50721d9d30e..6531f69db010 100644
--- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
+++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
@@ -399,6 +399,7 @@ static struct pernet_operations ipv4_net_ops = {
.exit = ipv4_net_exit,
.id = &conntrack4_net_id,
.size = sizeof(struct conntrack4_net),
+   .async = true,
 };
 
 static int __init nf_conntrack_l3proto_ipv4_init(void)

[PATCH net-next 16/16] net: Convet ipv6_net_ops

2018-03-07 Thread Kirill Tkhai

These pernet_operations are similar to ipv4_net_ops.
They are safe to be async.

Signed-off-by: Kirill Tkhai 
---
 net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c 
b/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
index 663827ee3cf8..ba54bb3bd1e4 100644
--- a/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
+++ b/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
@@ -401,6 +401,7 @@ static struct pernet_operations ipv6_net_ops = {
.exit = ipv6_net_exit,
.id = &conntrack6_net_id,
.size = sizeof(struct conntrack6_net),
+   .async = true,
 };
 
 static int __init nf_conntrack_l3proto_ipv6_init(void)

[PATCH net-next 14/16] net: Convert iptable_security_net_ops

2018-03-07 Thread Kirill Tkhai

These pernet_operations unregister net::ipv4::iptable_security table.
Another net/pernet_operations do not send ipv4 packets to foreign
net namespaces. So, we mark them async.

Signed-off-by: Kirill Tkhai 
---
 net/ipv4/netfilter/iptable_security.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/netfilter/iptable_security.c 
b/net/ipv4/netfilter/iptable_security.c
index e5379fe57b64..c40d6b3d8b6a 100644
--- a/net/ipv4/netfilter/iptable_security.c
+++ b/net/ipv4/netfilter/iptable_security.c
@@ -76,6 +76,7 @@ static void __net_exit iptable_security_net_exit(struct net 
*net)
 
 static struct pernet_operations iptable_security_net_ops = {
.exit = iptable_security_net_exit,
+   .async = true,
 };
 
 static int __init iptable_security_init(void)

[PATCH net-next 2/2] net: cdc_eem: clean up bind error path

2018-03-07 Thread Johan Hovold

Drop bogus call to usb_driver_release_interface() from an error path in
the usbnet bind() callback, which is called during interface probe. At
this point the interface is not bound and usb_driver_release_interface()
returns early.

Also remove the bogus call to clear the interface data, which is owned
by the usbnet driver and would not even have been set by the time bind()
is called.

Signed-off-by: Johan Hovold 
---
 drivers/net/usb/cdc_eem.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/net/usb/cdc_eem.c b/drivers/net/usb/cdc_eem.c
index f7180f8db39e..61ea4eaace5d 100644
--- a/drivers/net/usb/cdc_eem.c
+++ b/drivers/net/usb/cdc_eem.c
@@ -83,11 +83,8 @@ static int eem_bind(struct usbnet *dev, struct usb_interface 
*intf)
int status = 0;
 
status = usbnet_get_endpoints(dev, intf);
-   if (status < 0) {
-   usb_set_intfdata(intf, NULL);
-   usb_driver_release_interface(driver_of(intf), intf);
+   if (status < 0)
return status;
-   }
 
/* no jumbogram (16K) support for now */
 
-- 
2.16.2

[PATCH net-next 1/2] net: kalmia: clean up bind error path

2018-03-07 Thread Johan Hovold

Drop bogus call to usb_driver_release_interface() from an error path in
the usbnet bind() callback, which is called during interface probe. At
this point the interface is not bound and usb_driver_release_interface()
returns early.

Also remove the bogus call to clear the interface data, which is owned
by the usbnet driver and would not even have been set by the time bind()
is called.

Signed-off-by: Johan Hovold 
---
 drivers/net/usb/kalmia.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/net/usb/kalmia.c b/drivers/net/usb/kalmia.c
index 1ec523b0e932..bd2ba3659028 100644
--- a/drivers/net/usb/kalmia.c
+++ b/drivers/net/usb/kalmia.c
@@ -150,12 +150,8 @@ kalmia_bind(struct usbnet *dev, struct usb_interface *intf)
dev->rx_urb_size = dev->hard_mtu * 10; // Found as optimal after testing
 
status = kalmia_init_and_get_ethernet_addr(dev, ethernet_addr);
-
-   if (status) {
-   usb_set_intfdata(intf, NULL);
-   usb_driver_release_interface(driver_of(intf), intf);
+   if (status)
return status;
-   }
 
memcpy(dev->net->dev_addr, ethernet_addr, ETH_ALEN);
 
-- 
2.16.2

Re: [PATCH v9 crypto 01/12] tls: tls_device struct to register TLS drivers

2018-03-07 Thread Sabrina Dubroca

Hello Atul,

One quick note before you start replying: please fix your email
client, as you've been told before. Quoting has to add a quoting
marker (the '>' character) at the beginning of the line, otherwise
it's impossible to separate your reply from the email you're quoting.

2018-03-06, 21:06:20 +0530, Atul Gupta wrote:
> tls_device structure to register Inline TLS drivers
> with net/tls
> 
> Signed-off-by: Atul Gupta 
> ---
>  include/net/tls.h | 26 ++
>  1 file changed, 26 insertions(+)
> 
> diff --git a/include/net/tls.h b/include/net/tls.h
> index 4913430..9bfb91f 100644
> --- a/include/net/tls.h
> +++ b/include/net/tls.h
> @@ -55,6 +55,28 @@
>  #define TLS_RECORD_TYPE_DATA 0x17
>  
>  #define TLS_AAD_SPACE_SIZE   13
> +#define TLS_DEVICE_NAME_MAX  32

Why 32 characters?

> +enum {
> + TLS_BASE_TX,
> + TLS_SW_TX,
> + TLS_FULL_HW, /* TLS record processed Inline */
> + TLS_NUM_CONFIG,
> +};
> +
> +extern struct proto tls_prots[TLS_NUM_CONFIG];

Don't break bisection. The code has to compile after every
patch. These are already defined in net/tls/tls_main.c.

> +struct tls_device {
> + char name[TLS_DEVICE_NAME_MAX];

I could find only one use of it, in chtls_register_dev. Is this
actually needed?

> + struct list_head dev_list;
> +
> + /* netdev present in registered inline tls driver */
> + int (*netdev)(struct tls_device *device,
> +   struct net_device *netdev);

I was trying to figure out what this did, because the name is really
not explicit, and the comment doesn't make sense, but noticed it's
never actually called.

> + int (*feature)(struct tls_device *device);
> + int (*hash)(struct tls_device *device, struct sock *sk);
> + void (*unhash)(struct tls_device *device, struct sock *sk);

I think you should add a kerneldoc comment, like all the ndo_*
methods have.

> +};
>  
>  struct tls_sw_context {
>   struct crypto_aead *aead_send;
> @@ -115,6 +137,8 @@ struct tls_context {
>   int  (*getsockopt)(struct sock *sk, int level,
>  int optname, char __user *optval,
>  int __user *optlen);
> + int (*hash)(struct sock *sk);
> + void (*unhash)(struct sock *sk);
>  };
>  
>  int wait_on_pending_writer(struct sock *sk, long *timeo);
> @@ -256,5 +280,7 @@ static inline struct tls_offload_context *tls_offload_ctx(
>  
>  int tls_proccess_cmsg(struct sock *sk, struct msghdr *msg,
> unsigned char *record_type);
> +void tls_register_device(struct tls_device *device);
> +void tls_unregister_device(struct tls_device *device);

Prototype without implementation, please don't do that.  This happens
because you split your patchset so that each patch has all the changes
for exactly one file.

-- 
Sabrina

Re: [PATCH v9 crypto 00/12] Chelsio Inline TLS

2018-03-07 Thread Sabrina Dubroca

2018-03-06, 21:05:23 +0530, Atul Gupta wrote:
> Series for Chelsio Inline TLS driver (chtls)
> 
> Use tls ULP infrastructure to register chtls as Inline TLS driver.
> Chtls use TCP Sockets to transmit and receive TLS record. TCP proto_ops is 
> extended to offload TLS record.
> 
> T6 adapter provides the following features:
> -TLS record offload, TLS header, encrypt, digest and transmit
> -TLS record receive and decrypt
> -TLS keys store
> -TCP/IP engine
> -TLS engine
> -GCM crypto engine [support CBC also]
> 
> TLS provides security at the transport layer. It uses TCP to provide reliable 
> end-to-end transport of application data. It relies on TCP for any 
> retransmission. TLS session comprises of three parts:

Please wrap long lines.

[snip]

> v9: corrected __u8 and similar usage

That's not the only changes since v8, actually. There's also some new
code in patch 3.

-- 
Sabrina

Re: [for-next V2 03/13] net/mlx5e: Wait for FPGA command responses with a timeout

2018-03-07 Thread Kirill Tkhai

On 07.03.2018 09:35, Saeed Mahameed wrote:
> From: Aviad Yehezkel 
> 
> Generally, FPGA IPSec commands must always complete.
> We want to wait for one minute for them to complete gracefully also
> when killing a process.
> 
> Signed-off-by: Aviad Yehezkel 
> Signed-off-by: Saeed Mahameed 
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/fpga/ipsec.c | 9 ++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fpga/ipsec.c 
> b/drivers/net/ethernet/mellanox/mlx5/core/fpga/ipsec.c
> index 35d0e33381ca..95f9c5a8619b 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/fpga/ipsec.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/fpga/ipsec.c
> @@ -39,6 +39,7 @@
>  #include "fpga/core.h"
>  
>  #define SBU_QP_QUEUE_SIZE 8
> +#define MLX5_FPGA_IPSEC_CMD_TIMEOUT_MSEC (60 * 1000)
>  
>  enum mlx5_ipsec_response_syndrome {
>   MLX5_IPSEC_RESPONSE_SUCCESS = 0,
> @@ -217,12 +218,14 @@ void *mlx5_fpga_ipsec_sa_cmd_exec(struct mlx5_core_dev 
> *mdev,
>  int mlx5_fpga_ipsec_sa_cmd_wait(void *ctx)
>  {
>   struct mlx5_ipsec_command_context *context = ctx;
> + unsigned long timeout =
> + msecs_to_jiffies(MLX5_FPGA_IPSEC_CMD_TIMEOUT_MSEC);
>   int res;
>  
> - res = wait_for_completion_killable(&context->complete);
> - if (res) {
> + res = wait_for_completion_timeout(&context->complete, timeout);

You had a possibility to kill the waiting task before the patch, but it gets 
lost after the patch.
Why not wait_for_completion_killable_timeout()?

> + if (!res) {
>   mlx5_fpga_warn(context->dev, "Failure waiting for IPSec command 
> response\n");
> - return -EINTR;
> + return -ETIMEDOUT;
>   }
>  
>   if (context->status == MLX5_FPGA_IPSEC_SACMD_COMPLETE)

Thanks,
Kirill

Re: [PATCH net-next 2/2] net: cdc_eem: clean up bind error path

2018-03-07 Thread Oliver Neukum

Am Mittwoch, den 07.03.2018, 10:46 +0100 schrieb Johan Hovold:
> Drop bogus call to usb_driver_release_interface() from an error path in
> the usbnet bind() callback, which is called during interface probe. At
> this point the interface is not bound and usb_driver_release_interface()
> returns early.
> 
> Also remove the bogus call to clear the interface data, which is owned
> by the usbnet driver and would not even have been set by the time bind()
> is called.
> 
> Signed-off-by: Johan Hovold 
Acked-by: Oliver Neukum 
> ---
>  drivers/net/usb/cdc_eem.c | 5 +
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/drivers/net/usb/cdc_eem.c b/drivers/net/usb/cdc_eem.c
> index f7180f8db39e..61ea4eaace5d 100644
> --- a/drivers/net/usb/cdc_eem.c
> +++ b/drivers/net/usb/cdc_eem.c
> @@ -83,11 +83,8 @@ static int eem_bind(struct usbnet *dev, struct 
> usb_interface *intf)
>   int status = 0;
>  
>   status = usbnet_get_endpoints(dev, intf);
> - if (status < 0) {
> - usb_set_intfdata(intf, NULL);
> - usb_driver_release_interface(driver_of(intf), intf);
> + if (status < 0)
>   return status;
> - }
>  
>   /* no jumbogram (16K) support for now */
>

Re: [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond

2018-03-07 Thread Jiri Pirko

Mon, Mar 05, 2018 at 02:28:30PM CET, john.hur...@netronome.com wrote:
>Allow drivers to register netdev callbacks for tc offload in linux bonds.
>If a netdev has registered and is a slave of a given bond, then any tc
>rules offloaded to the bond will be relayed to it if both the bond and the
>slave permit hw offload.
>
>Because the bond itself is not offloaded, just the rules, we don't care
>about whether the bond ports are on the same device or whether some of
>slaves are representor ports and some are not.
>
>Signed-off-by: John Hurley 
>---
> drivers/net/bonding/bond_main.c | 195 +++-
> include/net/bonding.h   |   7 ++
> 2 files changed, 201 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index e6415f6..d9e41cf 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c

[...]

>+EXPORT_SYMBOL_GPL(tc_setup_cb_bond_register);

Please, no "bond" specific calls from drivers. That would be wrong.
The idea behing block callbacks was that anyone who is interested could
register to receive those. In this case, slave device is interested.
So it should register to receive block callbacks in the same way as if
the block was directly on top of the slave device. The only thing you
need to handle is to propagate block bind/unbind from master down to the
slaves.

Re: [PATCH v9 crypto 01/12] tls: tls_device struct to register TLS drivers

2018-03-07 Thread Atul Gupta



On 3/7/2018 3:46 PM, Sabrina Dubroca wrote:
> Hello Atul,
>
> One quick note before you start replying: please fix your email
> client, as you've been told before. Quoting has to add a quoting
> marker (the '>' character) at the beginning of the line, otherwise
> it's impossible to separate your reply from the email you're quoting.
>
> 2018-03-06, 21:06:20 +0530, Atul Gupta wrote:
>> tls_device structure to register Inline TLS drivers
>> with net/tls
>>
>> Signed-off-by: Atul Gupta 
>> ---
>>  include/net/tls.h | 26 ++
>>  1 file changed, 26 insertions(+)
>>
>> diff --git a/include/net/tls.h b/include/net/tls.h
>> index 4913430..9bfb91f 100644
>> --- a/include/net/tls.h
>> +++ b/include/net/tls.h
>> @@ -55,6 +55,28 @@
>>  #define TLS_RECORD_TYPE_DATA0x17
>>  
>>  #define TLS_AAD_SPACE_SIZE  13
>> +#define TLS_DEVICE_NAME_MAX 32
> Why 32 characters?
This is considered to accommodate registering device name, should it be bigger?
>
>
>> +enum {
>> +TLS_BASE_TX,
>> +TLS_SW_TX,
>> +TLS_FULL_HW, /* TLS record processed Inline */
>> +TLS_NUM_CONFIG,
>> +};
>> +
>> +extern struct proto tls_prots[TLS_NUM_CONFIG];
> Don't break bisection. The code has to compile after every
> patch. These are already defined in net/tls/tls_main.c.
Will take care
>
>> +struct tls_device {
>> +char name[TLS_DEVICE_NAME_MAX];
> I could find only one use of it, in chtls_register_dev. Is this
> actually needed?
help to identify Inline TLS drivers registered with net/tls
>
>> +struct list_head dev_list;
>> +
>> +/* netdev present in registered inline tls driver */
>> +int (*netdev)(struct tls_device *device,
>> +  struct net_device *netdev);
> I was trying to figure out what this did, because the name is really
> not explicit, and the comment doesn't make sense, but noticed it's
> never actually called.
Was used Initially, removed in last few cleanup [thanks for pointing]
>
>> +int (*feature)(struct tls_device *device);
>> +int (*hash)(struct tls_device *device, struct sock *sk);
>> +void (*unhash)(struct tls_device *device, struct sock *sk);
> I think you should add a kerneldoc comment, like all the ndo_*
> methods have.
Will take care
>
>> +};
>>  
>>  struct tls_sw_context {
>>  struct crypto_aead *aead_send;
>> @@ -115,6 +137,8 @@ struct tls_context {
>>  int  (*getsockopt)(struct sock *sk, int level,
>> int optname, char __user *optval,
>> int __user *optlen);
>> +int (*hash)(struct sock *sk);
>> +void (*unhash)(struct sock *sk);
>>  };
>>  
>>  int wait_on_pending_writer(struct sock *sk, long *timeo);
>> @@ -256,5 +280,7 @@ static inline struct tls_offload_context 
>> *tls_offload_ctx(
>>  
>>  int tls_proccess_cmsg(struct sock *sk, struct msghdr *msg,
>>unsigned char *record_type);
>> +void tls_register_device(struct tls_device *device);
>> +void tls_unregister_device(struct tls_device *device);
> Prototype without implementation, please don't do that.  This happens
> because you split your patchset so that each patch has all the changes
> for exactly one file.
will have declaration and definition together
>

Re: [PATCH v9 crypto 00/12] Chelsio Inline TLS

2018-03-07 Thread Atul Gupta



On 3/7/2018 3:53 PM, Sabrina Dubroca wrote:
> 2018-03-06, 21:05:23 +0530, Atul Gupta wrote:
>> Series for Chelsio Inline TLS driver (chtls)
>>
>> Use tls ULP infrastructure to register chtls as Inline TLS driver.
>> Chtls use TCP Sockets to transmit and receive TLS record. TCP proto_ops is 
>> extended to offload TLS record.
>>
>> T6 adapter provides the following features:
>> -TLS record offload, TLS header, encrypt, digest and transmit
>> -TLS record receive and decrypt
>> -TLS keys store
>> -TCP/IP engine
>> -TLS engine
>> -GCM crypto engine [support CBC also]
>>
>> TLS provides security at the transport layer. It uses TCP to provide 
>> reliable end-to-end transport of application data. It relies on TCP for any 
>> retransmission. TLS session comprises of three parts:
> Please wrap long lines.
will edit in next ver
>
> [snip]
>
>> v9: corrected __u8 and similar usage
> That's not the only changes since v8, actually. There's also some new
> code in patch 3.
 tls_hw_prot is done before sk_state != TCP_ESTABLISHED,  this check was 
introduced in net-next
and pulled in crypto tree later.
>

Re: [bpf-next V1 PATCH 09/15] mlx5: register a memory model when XDP is enabled

2018-03-07 Thread Tariq Toukan




On 06/03/2018 11:48 PM, Jesper Dangaard Brouer wrote:

Now all the users of ndo_xdp_xmit have been converted to use xdp_return_frame.
This enable a different memory model, thus activating another code path
in the xdp_return_frame API.

Signed-off-by: Jesper Dangaard Brouer 
---
  drivers/net/ethernet/mellanox/mlx5/core/en_main.c |7 +++
  1 file changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index da94c8cba5ee..51482943c583 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -506,6 +506,13 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
rq->mkey_be = c->mkey_be;
}
  
+	/* This must only be activate for order-0 pages */

+   if (rq->xdp_prog)
+   err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq,
+MEM_TYPE_PAGE_ORDER0, NULL);
+   if (err < 0)
+   goto err_rq_wq_destroy;
+


Use "if (err)" here, instead of changing this in next patch.
Also, get it into the "if (rq->xdp_prog)" block.



for (i = 0; i < wq_sz; i++) {
struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(&rq->wq, i);

Re: [PATCH v3 net-next 2/5] net/ipv6: Address checks need to consider the L3 domain

2018-03-07 Thread Kirill Tkhai

On 07.03.2018 06:58, David Ahern wrote:
> ipv6_chk_addr_and_flags determines if an address is a local address. It
> is called by ip6_route_info_create to validate a gateway address is not a
> local address. It currently does not consider L3 domains and as a result
> does not allow a route to be added in one VRF if the nexthop points to
> an address in a second VRF. e.g.,
> 
> $ ip route add 2001:db8:1::/64 vrf r2 via 2001:db8:102::23
> Error: Invalid gateway address.
> 
> where 2001:db8:102::23 is an address on an interface in vrf r1.
> 
> Resolve by comparing the l3mdev for the passed in device and requiring an
> l3mdev match with the device containing an address. The intent of checking
> for an address on the specified device versus any device in the domain is
> mantained by a new argument to skip the check between the passed in device
> and the device with the address.
> 
> Update the handful of users of ipv6_chk_addr with a NULL dev argument:
> - anycast to call ipv6_chk_addr_and_flags. If the device is given by the
>   user, look for the given address across the L3 domain. If the index is
>   not given, the default table is presumed so only addresses on devices
>   not enslaved are considered.
> 
> - ip6_tnl_rcv_ctl - local address must exist on device, remote address
>   can not exist in L3 domain; only remote check needs to be updated but
>   do both for consistency.
> 
> ip6_validate_gw needs to handle 2 cases - one where the device is given
> as part of the nexthop spec and the other where the device is resolved.
> There is at least 1 VRF case where deferring the check to only after
> the route lookup has resolved the device fails with an unintuitive error
> "RTNETLINK answers: No route to host" as opposed to the preferred
> "Error: Gateway can not be a local address." The 'no route to host'
> error is because of the fallback to a full lookup.
> 
> Signed-off-by: David Ahern 
> ---
>  include/net/addrconf.h |  4 ++--
>  net/ipv6/addrconf.c| 26 ++
>  net/ipv6/anycast.c |  9 ++---
>  net/ipv6/datagram.c|  5 +++--
>  net/ipv6/ip6_tunnel.c  | 12 
>  net/ipv6/ndisc.c   |  2 +-
>  net/ipv6/route.c   | 37 -
>  7 files changed, 70 insertions(+), 25 deletions(-)
> 
> diff --git a/include/net/addrconf.h b/include/net/addrconf.h
> index c4185a7b0e90..132e5b95167a 100644
> --- a/include/net/addrconf.h
> +++ b/include/net/addrconf.h
> @@ -69,8 +69,8 @@ int addrconf_set_dstaddr(struct net *net, void __user *arg);
>  int ipv6_chk_addr(struct net *net, const struct in6_addr *addr,
> const struct net_device *dev, int strict);
>  int ipv6_chk_addr_and_flags(struct net *net, const struct in6_addr *addr,
> - const struct net_device *dev, int strict,
> - u32 banned_flags);
> + const struct net_device *dev, bool skip_dev_check,
> + int strict, u32 banned_flags);

This function already has 5 arguments, while this patch adds one more.
Can't we use new flags argument for both of them?

Also, the name of function and input parameters are already so big, that they
don't fit a single line already, while your patch adds more parameters.
Can't we make it more slim? Something like ipv6_chk_addr_fl() instead of current
name.

>  #if defined(CONFIG_IPV6_MIP6) || defined(CONFIG_IPV6_MIP6_MODULE)
>  int ipv6_chk_home_addr(struct net *net, const struct in6_addr *addr);
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index b5fd116c046a..17d5d3f42d21 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -1851,22 +1851,40 @@ static int ipv6_count_addresses(const struct 
> inet6_dev *idev)
>  int ipv6_chk_addr(struct net *net, const struct in6_addr *addr,
> const struct net_device *dev, int strict)
>  {
> - return ipv6_chk_addr_and_flags(net, addr, dev, strict, IFA_F_TENTATIVE);
> + return ipv6_chk_addr_and_flags(net, addr, dev, !dev,
> +strict, IFA_F_TENTATIVE);
>  }
>  EXPORT_SYMBOL(ipv6_chk_addr);

This function was not introduced by this commit, but since the commit modifies 
it,
and the function is pretty simple, we could declare it as static inline in 
header
in separate patch.

>  
> +/* device argument is used to find the L3 domain of interest. If
> + * skip_dev_check is set, then the ifp device is not checked against
> + * the passed in dev argument. So the 2 cases for addresses checks are:
> + *   1. does the address exist in the L3 domain that dev is part of
> + *  (skip_dev_check = true), or
> + *
> + *   2. does the address exist on the specific device
> + *  (skip_dev_check = false)
> + */
>  int ipv6_chk_addr_and_flags(struct net *net, const struct in6_addr *addr,
> - const struct net_device *dev, int strict,
> - u32 banned_flags)
> + const struct

Re: [PATCH v3 net-next 2/5] net/ipv6: Address checks need to consider the L3 domain

2018-03-07 Thread Kirill Tkhai

On 07.03.2018 14:53, Kirill Tkhai wrote:
> On 07.03.2018 06:58, David Ahern wrote:
>> ipv6_chk_addr_and_flags determines if an address is a local address. It
>> is called by ip6_route_info_create to validate a gateway address is not a
>> local address. It currently does not consider L3 domains and as a result
>> does not allow a route to be added in one VRF if the nexthop points to
>> an address in a second VRF. e.g.,
>>
>> $ ip route add 2001:db8:1::/64 vrf r2 via 2001:db8:102::23
>> Error: Invalid gateway address.
>>
>> where 2001:db8:102::23 is an address on an interface in vrf r1.
>>
>> Resolve by comparing the l3mdev for the passed in device and requiring an
>> l3mdev match with the device containing an address. The intent of checking
>> for an address on the specified device versus any device in the domain is
>> mantained by a new argument to skip the check between the passed in device
>> and the device with the address.
>>
>> Update the handful of users of ipv6_chk_addr with a NULL dev argument:
>> - anycast to call ipv6_chk_addr_and_flags. If the device is given by the
>>   user, look for the given address across the L3 domain. If the index is
>>   not given, the default table is presumed so only addresses on devices
>>   not enslaved are considered.
>>
>> - ip6_tnl_rcv_ctl - local address must exist on device, remote address
>>   can not exist in L3 domain; only remote check needs to be updated but
>>   do both for consistency.
>>
>> ip6_validate_gw needs to handle 2 cases - one where the device is given
>> as part of the nexthop spec and the other where the device is resolved.
>> There is at least 1 VRF case where deferring the check to only after
>> the route lookup has resolved the device fails with an unintuitive error
>> "RTNETLINK answers: No route to host" as opposed to the preferred
>> "Error: Gateway can not be a local address." The 'no route to host'
>> error is because of the fallback to a full lookup.
>>
>> Signed-off-by: David Ahern 
>> ---
>>  include/net/addrconf.h |  4 ++--
>>  net/ipv6/addrconf.c| 26 ++
>>  net/ipv6/anycast.c |  9 ++---
>>  net/ipv6/datagram.c|  5 +++--
>>  net/ipv6/ip6_tunnel.c  | 12 
>>  net/ipv6/ndisc.c   |  2 +-
>>  net/ipv6/route.c   | 37 -
>>  7 files changed, 70 insertions(+), 25 deletions(-)
>>
>> diff --git a/include/net/addrconf.h b/include/net/addrconf.h
>> index c4185a7b0e90..132e5b95167a 100644
>> --- a/include/net/addrconf.h
>> +++ b/include/net/addrconf.h
>> @@ -69,8 +69,8 @@ int addrconf_set_dstaddr(struct net *net, void __user 
>> *arg);
>>  int ipv6_chk_addr(struct net *net, const struct in6_addr *addr,
>>const struct net_device *dev, int strict);
>>  int ipv6_chk_addr_and_flags(struct net *net, const struct in6_addr *addr,
>> -const struct net_device *dev, int strict,
>> -u32 banned_flags);
>> +const struct net_device *dev, bool skip_dev_check,
>> +int strict, u32 banned_flags);
> 
> This function already has 5 arguments, while this patch adds one more.
> Can't we use new flags argument for both of them?

Them are skip_dev_check and strict.
 
> Also, the name of function and input parameters are already so big, that they
> don't fit a single line already, while your patch adds more parameters.
> Can't we make it more slim? Something like ipv6_chk_addr_fl() instead of 
> current
> name.
> 
>>  #if defined(CONFIG_IPV6_MIP6) || defined(CONFIG_IPV6_MIP6_MODULE)
>>  int ipv6_chk_home_addr(struct net *net, const struct in6_addr *addr);
>> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
>> index b5fd116c046a..17d5d3f42d21 100644
>> --- a/net/ipv6/addrconf.c
>> +++ b/net/ipv6/addrconf.c
>> @@ -1851,22 +1851,40 @@ static int ipv6_count_addresses(const struct 
>> inet6_dev *idev)
>>  int ipv6_chk_addr(struct net *net, const struct in6_addr *addr,
>>const struct net_device *dev, int strict)
>>  {
>> -return ipv6_chk_addr_and_flags(net, addr, dev, strict, IFA_F_TENTATIVE);
>> +return ipv6_chk_addr_and_flags(net, addr, dev, !dev,
>> +   strict, IFA_F_TENTATIVE);
>>  }
>>  EXPORT_SYMBOL(ipv6_chk_addr);
> 
> This function was not introduced by this commit, but since the commit 
> modifies it,
> and the function is pretty simple, we could declare it as static inline in 
> header
> in separate patch.
> 
>>  
>> +/* device argument is used to find the L3 domain of interest. If
>> + * skip_dev_check is set, then the ifp device is not checked against
>> + * the passed in dev argument. So the 2 cases for addresses checks are:
>> + *   1. does the address exist in the L3 domain that dev is part of
>> + *  (skip_dev_check = true), or
>> + *
>> + *   2. does the address exist on the specific device
>> + *  (skip_dev_check = false)
>> + */
>>  int ipv6_chk_addr_and_flags(struct net *net,

Re: [bpf-next V1 PATCH 13/15] mlx5: use page_pool for xdp_return_frame call

2018-03-07 Thread Tariq Toukan




On 06/03/2018 11:48 PM, Jesper Dangaard Brouer wrote:

This patch shows how it is possible to have both the driver local page
cache, which uses elevated refcnt for "catching"/avoiding SKB
put_page.  And at the same time, have pages getting returned to the
page_pool from ndp_xdp_xmit DMA completion.

Performance is surprisingly good. Tested DMA-TX completion on ixgbe,
that calls "xdp_return_frame", which call page_pool_put_page().
Stats show DMA-TX-completion runs on CPU#9 and mlx5 RX runs on CPU#5.
(Internally page_pool uses ptr_ring, which is what gives the good
cross CPU performance).

Show adapter(s) (ixgbe2 mlx5p2) statistics (ONLY that changed!)
Ethtool(ixgbe2  ) stat:732863573 (732,863,573) <= tx_bytes /sec
Ethtool(ixgbe2  ) stat:781724427 (781,724,427) <= tx_bytes_nic /sec
Ethtool(ixgbe2  ) stat: 12214393 ( 12,214,393) <= tx_packets /sec
Ethtool(ixgbe2  ) stat: 12214435 ( 12,214,435) <= tx_pkts_nic /sec
Ethtool(mlx5p2  ) stat: 12211786 ( 12,211,786) <= rx3_cache_empty /sec
Ethtool(mlx5p2  ) stat: 36506736 ( 36,506,736) <= rx_64_bytes_phy /sec
Ethtool(mlx5p2  ) stat:   2336430575 (  2,336,430,575) <= rx_bytes_phy /sec
Ethtool(mlx5p2  ) stat: 12211786 ( 12,211,786) <= rx_cache_empty /sec
Ethtool(mlx5p2  ) stat: 22823073 ( 22,823,073) <= rx_discards_phy /sec
Ethtool(mlx5p2  ) stat:  1471860 (  1,471,860) <= rx_out_of_buffer /sec
Ethtool(mlx5p2  ) stat: 36506715 ( 36,506,715) <= rx_packets_phy /sec
Ethtool(mlx5p2  ) stat:   2336542282 (  2,336,542,282) <= rx_prio0_bytes /sec
Ethtool(mlx5p2  ) stat: 13683921 ( 13,683,921) <= rx_prio0_packets /sec
Ethtool(mlx5p2  ) stat:821015537 (821,015,537) <= 
rx_vport_unicast_bytes /sec
Ethtool(mlx5p2  ) stat: 13683608 ( 13,683,608) <= 
rx_vport_unicast_packets /sec

Before this patch: single flow performance was 6Mpps, and if I started
two flows the collective performance drop to 4Mpps, because we hit the
page allocator lock (further negative scaling occurs).

Signed-off-by: Jesper Dangaard Brouer 
---
  drivers/net/ethernet/mellanox/mlx5/core/en.h  |3 ++
  drivers/net/ethernet/mellanox/mlx5/core/en_main.c |   39 ++---
  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   |   10 -
  3 files changed, 45 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 28cc26debeda..ab91166f7c5a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -53,6 +53,8 @@
  #include "mlx5_core.h"
  #include "en_stats.h"
  
+struct page_pool;

+


you can have the include here instead:
#include 

and remove it from the .c files.


  #define MLX5_SET_CFG(p, f, v) MLX5_SET(create_flow_group_in, p, f, v)
  
  #define MLX5E_ETH_HARD_MTU (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN)

@@ -535,6 +537,7 @@ struct mlx5e_rq {
/* XDP */
struct bpf_prog   *xdp_prog;
struct mlx5e_xdpsq xdpsq;
+   struct page_pool  *page_pool;
  
  	/* control */

struct mlx5_wq_ctrlwq_ctrl;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 49732c8c27c1..fbe27110ff02 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -44,6 +44,8 @@
  #include "accel/ipsec.h"
  #include "vxlan.h"
  
+#include 

+
  struct mlx5e_rq_param {
u32 rqc[MLX5_ST_SZ_DW(rqc)];
struct mlx5_wq_paramwq;
@@ -396,6 +398,8 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
int err;
int i;
  
+	struct page_pool_params pp_params = { 0 };

+
rqp->wq.db_numa_node = cpu_to_node(c->cpu);
  
  	err = mlx5_wq_ll_create(mdev, &rqp->wq, rqc_wq, &rq->wq,

@@ -506,12 +510,33 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
rq->mkey_be = c->mkey_be;
}
  
-	/* This must only be activate for order-0 pages */

-   if (rq->xdp_prog)
-   err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq,
-MEM_TYPE_PAGE_ORDER0, NULL);
-   if (err)
+   /* Create a page_pool and register it with rxq */
+   pp_params.size  = PAGE_POOL_PARAMS_SIZE;
+   pp_params.order = rq->buff.page_order;
+   pp_params.dev   = c->pdev;
+   pp_params.nid   = cpu_to_node(c->cpu);
+   pp_params.dma_dir   = rq->buff.map_dir;
+   pp_params.pool_size = 1 << params->log_rq_size;


if rq->wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ, need to multiply 
by MLX5_MPWRQ_PAGES_PER_WQE.



+   pp_params.flags = 0; /* No-internal DMA mapping in page_pool */
+
+   /* page_pool can be used even when there is no rq->xdp_prog,
+* given page_pool does not handle DMA mapping there is no
+* required state to clear. And page_pool gracefully ha

Re: [PATCH v9 crypto 02/12] ethtool: enable Inline TLS in HW

2018-03-07 Thread Sabrina Dubroca

Since you're saying the driver supports offloading TLS records to the
HW, why not call the feature "record offloading"?  With, for example,
NETIF_F_HW_TLS_RECORD as the feature, and maybe "tls-hw-record" for
the ethtool string.  This "Inline TLS" name sounds rather like
marketing to me.

2018-03-06, 21:07:22 +0530, Atul Gupta wrote:
> Signed-off-by: Atul Gupta 

There should be a description here of what the feature covers, so that
other drivers can decide if it matches what their HW supports.

-- 
Sabrina

Re: [PATCH v9 crypto 02/12] ethtool: enable Inline TLS in HW

2018-03-07 Thread Atul Gupta



On 3/7/2018 6:05 PM, Sabrina Dubroca wrote:
> Since you're saying the driver supports offloading TLS records to the
> HW, why not call the feature "record offloading"?  With, for example,
> NETIF_F_HW_TLS_RECORD as the feature, and maybe "tls-hw-record" for
> the ethtool string.  This "Inline TLS" name sounds rather like
> marketing to me.
Sure, will use record.
>
> 2018-03-06, 21:07:22 +0530, Atul Gupta wrote:
>> Signed-off-by: Atul Gupta 
> There should be a description here of what the feature covers, so that
> other drivers can decide if it matches what their HW supports.
Will add

Thanks
Atul
>

Re: [PATCH net] xfrm: Verify MAC header exists before overwriting eth_hdr(skb)->h_proto

2018-03-07 Thread Steffen Klassert

On Sun, Mar 04, 2018 at 09:03:52PM +0200, yoss...@mellanox.com wrote:
> From: Yossi Kuperman 
> 
> Artem Savkov reported that commit 5efec5c655dd leads to a packet loss under
> IPSec configuration. It appears that his setup consists of a TUN device,
> which does not have a MAC header.
> 
> Make sure MAC header exists.
> 
> Note: TUN device sets a MAC header pointer, although it does not have one.
> 
> Fixes: 5efec5c655dd ("xfrm: Fix eth_hdr(skb)->h_proto to reflect inner IP 
> version")
> Reported-by: Artem Savkov 
> Tested-by: Artem Savkov 
> Signed-off-by: Yossi Kuperman 

Applied, thanks Yossi!

[patch net-next 2/2] selftests: forwarding: fix flags passed to first drop rule in gact_drop_and_ok_test

2018-03-07 Thread Jiri Pirko

From: Jiri Pirko 

Fix copy&paste error and pass proper flags.

Signed-off-by: Jiri Pirko 
Reviewed-by: Ido Schimmel 
---
 tools/testing/selftests/net/forwarding/tc_actions.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/forwarding/tc_actions.sh 
b/tools/testing/selftests/net/forwarding/tc_actions.sh
index ac6b6a1057d8..3a6385ebd5d0 100755
--- a/tools/testing/selftests/net/forwarding/tc_actions.sh
+++ b/tools/testing/selftests/net/forwarding/tc_actions.sh
@@ -81,7 +81,7 @@ gact_drop_and_ok_test()
RET=0
 
tc filter add dev $swp1 ingress protocol ip pref 2 handle 102 flower \
-   skip_hw dst_ip 192.0.2.2 action drop
+   $tcflags dst_ip 192.0.2.2 action drop
 
$MZ $h1 -c 1 -p 64 -a $h1mac -b $h2mac -A 192.0.2.1 -B 192.0.2.2 \
-t ip -q
-- 
2.14.3

[patch net-next 1/2] selftests: forwarding: fix "ok" action test

2018-03-07 Thread Jiri Pirko

From: Jiri Pirko 

Fix the "ok" action test so it checks that packet that is okayed does not
continue to be processed by other rules. Fix error message as well.

Signed-off-by: Jiri Pirko 
Reviewed-by: Ido Schimmel 
---
 tools/testing/selftests/net/forwarding/tc_actions.sh | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/forwarding/tc_actions.sh 
b/tools/testing/selftests/net/forwarding/tc_actions.sh
index 6b18ba2d3982..ac6b6a1057d8 100755
--- a/tools/testing/selftests/net/forwarding/tc_actions.sh
+++ b/tools/testing/selftests/net/forwarding/tc_actions.sh
@@ -96,7 +96,10 @@ gact_drop_and_ok_test()
-t ip -q
 
tc_check_packets "dev $swp1 ingress" 101 1
-   check_err $? "Did not see trapped packet"
+   check_err $? "Did not see passed packet"
+
+   tc_check_packets "dev $swp1 ingress" 102 2
+   check_fail $? "Packet was dropped and it should not reach here"
 
tc filter del dev $swp1 ingress protocol ip pref 2 handle 102 flower
tc filter del dev $swp1 ingress protocol ip pref 1 handle 101 flower
-- 
2.14.3

[patch net-next 0/2] selftests: forwarding: couple of fixes

2018-03-07 Thread Jiri Pirko

From: Jiri Pirko 

Jiri Pirko (2):
  selftests: forwarding: fix "ok" action test
  selftests: forwarding: fix flags passed to first drop rule in
gact_drop_and_ok_test

 tools/testing/selftests/net/forwarding/tc_actions.sh | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

-- 
2.14.3

Re: [PATCH] xfrm_policy: use true and false for boolean values

2018-03-07 Thread Steffen Klassert

On Mon, Mar 05, 2018 at 03:49:59PM -0600, Gustavo A. R. Silva wrote:
> Assign true or false to boolean variables instead of an integer value.
> 
> This issue was detected with the help of Coccinelle.
> 
> Signed-off-by: Gustavo A. R. Silva 

Patch applied to ipsec-next, thanks!

Re: [PATCH v9 crypto 06/12] cxgb4: LLD driver changes to enable TLS

2018-03-07 Thread Sabrina Dubroca

2018-03-06, 21:09:25 +0530, Atul Gupta wrote:
> Read FW capability. Read key area size. Dump the TLS record count.

That's not a really helpful commit message. Have a look at other
commit messages and try to be more descriptive.
It's also not clear if those changes belong together in one patch, but
I can't judge because the commit message is really too terse.

> Signed-off-by: Atul Gupta 
> ---
>  drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 32 +---
>  drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h  |  7 ++
>  drivers/net/ethernet/chelsio/cxgb4/sge.c| 98 
> -
>  3 files changed, 126 insertions(+), 11 deletions(-)

[snip]
> +int cxgb4_immdata_send(struct net_device *dev, unsigned int idx,
> +const void *src, unsigned int len)
> +{
> + struct adapter *adap = netdev2adap(dev);
> + struct sge_uld_txq_info *txq_info;
> + struct sge_uld_txq *txq;
> + int ret;
> +
> + local_bh_disable();
> + txq_info = adap->sge.uld_txq_info[CXGB4_TX_OFLD];
> + if (unlikely(!txq_info)) {
> + WARN_ON(true);
> + return NET_XMIT_DROP;

Don't you need to local_bh_enable() before returning? If not, this
needs a comment to explain why.

> + }
> + txq = &txq_info->uldtxq[idx];
> +
> + ret = ofld_xmit_direct(txq, src, len);
> + local_bh_enable();
> + return net_xmit_eval(ret);
> +}
> +EXPORT_SYMBOL(cxgb4_immdata_send);

-- 
Sabrina

[PATCH net 0/1] net/smc: fix 2018-03-07

2018-03-07 Thread Ursula Braun

Hi Dave,

please apply the following smc bug fix for net.

Thanks, Ursula

Ursula Braun (1):
  net/smc: wakeup closing listen socket

 net/smc/af_smc.c | 16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

-- 
2.13.5

[PATCH net 1/1] net/smc: wakeup closing listen socket

2018-03-07 Thread Ursula Braun

Closing of a listen socket wakes up kernel_accept() of
smc_tcp_listen_worker(), and then has to wait till smc_tcp_listen_worker()
gives up the internal clcsock. This wait is non-interruptible, and does
not fit to the interruptible wakeup in smc_tcp_listen_worker().
Thus wakeup fails, and waiting for the closed internal clcsock never ends
before the wait timeout is reached.
This patch introduces the matching non-interruptible wakeup.

Fixes: 127f49705823 ("net/smc: release clcsock from tcp_listen_worker")
Reported-by: Hans Wippel 
Signed-off-by: Ursula Braun 
---
 net/smc/af_smc.c | 16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 8cc97834d4f6..ad4c43eb7e7f 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -946,6 +946,20 @@ static void smc_listen_work(struct work_struct *work)
goto enqueue; /* queue new sock with sk_err set */
 }
 
+/* a non-interruptible variant of sock_def_wakeup() for wakeup of
+ * smc_close_wait_listen_clcsock()
+ */
+static void smc_sock_wakeup(struct sock *sk)
+{
+   struct socket_wq *wq;
+
+   rcu_read_lock();
+   wq = rcu_dereference(sk->sk_wq);
+   if (skwq_has_sleeper(wq))
+   wake_up_all(&wq->wait);
+   rcu_read_unlock();
+}
+
 static void smc_tcp_listen_work(struct work_struct *work)
 {
struct smc_sock *lsmc = container_of(work, struct smc_sock,
@@ -981,7 +995,7 @@ static void smc_tcp_listen_work(struct work_struct *work)
/* no more listening, wake up smc_close_wait_listen_clcsock and
 * accept
 */
-   lsk->sk_state_change(lsk);
+   smc_sock_wakeup(lsk);
sock_put(&lsmc->sk); /* sock_hold in smc_listen */
 }
 
-- 
2.13.5

Re: [bpf-next PATCH 05/16] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-07 Thread Daniel Borkmann

On 03/07/2018 04:25 AM, John Fastabend wrote:
[...]
> Thought about this a bit more and chatted with Daniel a bit. I think
> a better solution is to set data_start = data_end = 0 by default in the
> sendpage case. This will disallow any read/writes into the sendpage
> data. Then if the user needs to read/write data we can use a helper
> bpf_sk_msg_pull_data(start_byte, end_byte) which can pull the data into a
> linear buffer as needed. This will ensure any user writes will not
> change data after the BPF verdict (your concern). Also it will minimize
> the amount of data that needs to be copied (my concern). In some of my
> use cases where no data is needed we can simple not use the helper. Then
> on the sendmsg side we can continue to set the (data_start, data_end)
> pointers to the first scatterlist element. But, also use this helper to
> set the data pointers past the first scatterlist element if needed. So
> if someone wants to read past the first 4k bytes on a large send for
> example this can be done with the above helper. BPF programs just
> need to check (start,end) data pointers and can be oblivious to
> if the program is being invoked by a call from sendpage or sendmsg.
> 
> I think this is a fairly elegant solution. Finally we can further
> optimize later with a flag if needed to cover the case where we
> want to read lots of bytes but _not_ do the copy. We can debate
> the usefulness of this later with actual perf data.
> 
> All this has the added bonus that all I need is another patch on
> top to add the helper. Pseudo code might look like this,
> 
> my_bpf_prog(struct sk_msg_md *msg) {
>   void *data_end = msg->data_end;
>   void *data_start = msg->data_start;
> 
>   need = PARSE_BYTES;
> 
>   // ensure user actually sent full header
>   if (msg->size < PARSE_BYTES) {
>   bpf_msg_cork(PARSE_BYTES);
>   return SK_DROP;
>   }
> 
>   /* ensure we can read full header, if this is a
>* sendmsg system call AND PARSE_BYTES are all in
>* the first scatterlist elem this is a no-op.
>* If this is a sendpage call will put PARSE_BYTES
>* in a psock buffer to avoid user modifications.
>*/
>   if (data_end - data_start < PARSE_BYTES) {

I think it might need to look like 'data_start + PARSE_BYTES > data_end'
for verifier to recognize (unless LLVM generates code that way).

>   err = bpf_sk_msg_pull_data(0, PARSE_BYTES, flags);
>   if (err)
>   return SK_DROP;

Above should be:

if (unlikely(err || data_start + PARSE_BYTES > data_end))
return SK_DROP;

Here for the successful case, you need to recheck since data pointers
were invalidated due to the helper call. bpf_sk_msg_pull_data() would
for the very first case potentially be called unconditionally at prog
start though when you start out with 0 len anyway, basically right after
msg->size test.

>   }
> 
>   // we have the full header parse it now
>   verdict = my_bpf_header_parser(msg);
>   return verdict;
> }
> 
> Future optimization can work with prologue to pull in bytes
> more efficiently. And for what its worth I found a couple bugs
> in the error path of the sendpage hook I can fix in the v2 as well.
> 
> What do you think? 
> 
> @Daniel, sound more or less like what you were thinking?

Yes, absolutely what I was thinking.

We have exactly the same logic in tc/BPF today for the case when the direct
packet access test fails and we want to pull in skb data from non-linear
area, so we can in such case just call bpf_skb_pull_data(skb, len) and redo
the test to access it privately after that.

Thanks,
Daniel

Re: [PATCH] net: don't unnecessarily load kernel modules in dev_ioctl()

2018-03-07 Thread Paul Moore

On Tue, Mar 6, 2018 at 6:59 PM, Stephen Hemminger
 wrote:
> On Tue, 06 Mar 2018 17:27:44 -0500
> Paul Moore  wrote:
>> From: Paul Moore 
>>
>> Starting with v4.16-rc1 we've been seeing a higher than usual number
>> of requests for the kernel to load networking modules, even on events
>> which shouldn't trigger a module load (e.g. ioctl(TCGETS)).  Stephen
>> Smalley suggested the problem may lie in commit 44c02a2c3dc5
>> ("dev_ioctl(): move copyin/copyout to callers") which moves changes
>> the network dev_ioctl() function to always call dev_load(),
>> regardless of the requested ioctl.
>>
>> This patch moves the dev_load() calls back into the individual ioctls
>> while preserving the rest of the original patch.
>>
>> Reported-by: Dominick Grift 
>> Suggested-by: Stephen Smalley 
>> Signed-off-by: Paul Moore 
>> ---
>>  net/core/dev_ioctl.c |7 +--
>>  1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/core/dev_ioctl.c b/net/core/dev_ioctl.c
>> index 0ab1af04296c..a04e1e88bf3a 100644
>> --- a/net/core/dev_ioctl.c
>> +++ b/net/core/dev_ioctl.c
>> @@ -402,8 +402,6 @@ int dev_ioctl(struct net *net, unsigned int cmd, struct 
>> ifreq *ifr, bool *need_c
>>   if (colon)
>>   *colon = 0;
>>
>> - dev_load(net, ifr->ifr_name);
>
> Actually dev_load by ethernet name is really a legacy thing that should just 
> die,
>
> It was kept around so that some very tunnel configuration using special names.
>
> # ifconfig sit0
>
> which probably several web pages still tell users to do...
> We have much better control now with ip commands so that this is just
> baggage.

In an effort to get this regression fixed quickly, and not get tangled
up in a user education issue, can we at least restore the old ioctl()
behavior and worry about removing dev_load() later?

-- 
paul moore
www.paul-moore.com

RE: [PATCH RESEND net-next] net: Do synchronize_rcu() in ip6mr_sk_done() only if this is needed

2018-03-07 Thread Yuval Mintz

> >>> After unshare test kworker hangs for ages:
> >>>
> >>> $ while :; do unshare -n true; done &
> >>>
> >>> $ perf report 
> >>> -   88,82% 0,00%  kworker/u16:0  [kernel.vmlinux]  [k]
> >>> cleanup_net
> >>>  cleanup_net
> >>>    - ops_exit_list.isra.9
> >>>   - 85,68% igmp6_net_exit
> >>>  - 53,31% sock_release
> >>> - inet_release
> >>>    - 25,33% rawv6_close
> >>>   - ip6mr_sk_done
> >>>  + 23,38% synchronize_rcu
> >>>
> >>> Keep in mind, this perf report shows the time a function was
> >>> executing, and
> >>> it does not show the time, it was sleeping. But it's easy to imagine,
> >>> how
> >>> much it is sleeping, if synchronize_rcu() execution takes the most
> >>> time.
> >>> Top shows the kworker R time is less then 1%.
> >>>
> >>> This happen, because of in ip6mr_sk_done() we do too many
> >>> synchronize_rcu(),
> >>> even for the sockets, that are not referenced in mr_table, and which
> >>> are not
> >>> need it. So, the progress of kworker becomes very slow.
> >>>
> >>> The patch introduces apparent solution, and it makes ip6mr_sk_done()
> >>> to skip
> >>> synchronize_rcu() for sockets, that are not need that. After the
> >>> patch,
> >>> kworker becomes able to warm the cpu up as expected.
> >>>
> >>> Signed-off-by: Kirill Tkhai 
> >>> ---
> >>>  net/ipv6/ip6mr.c |4 +++-
> >>>  1 file changed, 3 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
> >>> index 2a38f9de45d3..290a8d0d5eac 100644
> >>> --- a/net/ipv6/ip6mr.c
> >>> +++ b/net/ipv6/ip6mr.c
> >>> @@ -1485,7 +1485,9 @@ int ip6mr_sk_done(struct sock *sk)
> >>>   }
> >>>   }
> >>>   rtnl_unlock();
> >>> - synchronize_rcu();
> >>> +
> >>> + if (!err)
> >>> + synchronize_rcu();
> >>>
> >>
> >>
> >> But... what is this synchronize_rcu() doing exactly ?
> >>
> >> This was added in 8571ab479a6e1ef46ead5ebee567e128a422767c
> >>
> >> ("ip6mr: Make mroute_sk rcu-based")
> >>
> >> Typically on a delete, the synchronize_rcu() would be needed before
> >> freeing the deleted object.
> >>
> >> But nowadays we have better way : SOCK_RCU_FREE
> >
> > Hm. I'm agree with you. This is hot path, and mroute sockets created from
> userspace
> > will delay userspace tasks close() and exit(). Since there may be many such
> sockets,
> > we may get a zombie task, which can't be reaped for ages. This slows down
> the system
> > directly.
> >
> > Fix for pernet_operations works, but we need generic solution instead.
> >
> > The commit "8571ab479a6e1ef46ead5ebee567e128a422767c" says:
> >
> > ip6mr: Make mroute_sk rcu-based
> >
> > In ipmr the mr_table socket is handled under RCU. Introduce the same
> > for ip6mr.
> >
> > There is no pointing to improvements it invents, or to the problem it 
> > solves.
> The description
> > looks like a cleanup. It's too expensive cleanup, if it worsens the
> performance a hundred
> > times.
> >
> > Can't we simply revert it?!
> >
> > Yuval, do you have ideas to fix that (maybe, via SOCK_RCU_FREE suggested
> by Eric)?

Sorry, failed to notice ip6mr_sk_done() is called unconditionally from
rawv6_close(). But even with your suggested fix it should be ~resolved
[as only sockets used for ip6mr would reach the sync].
Or are you claiming there's still some performance hit even with your
suggested change?

> >
> > We actually use rcu_dereference() in ip6mr_cache_report() only. The only
> user of dereference
> > is sock_queue_rcv_skb(). Superficial analysis shows we purge the queue in
> inet_sock_destruct().
> 
> + So this change should be safe.

I might have misunderstood the comment from
commit 4c9687098f24 ("ipmr: RCU conversion of mroute_sk") when writing
this; Thought comment regarding ip_ra_destroy() meant that for the IPv6 case
we DO have to make sure there's a grace-period before destroying the socket.

> 
> > Thanks,
> > Kirill
> >

[PATCH net v4 0/2] rhashtable: Fix rhltable duplicates insertion

2018-03-07 Thread Paul Blakey

On our mlx5 driver fs_core.c, we use the rhltable interface to store
flow groups. We noticed that sometimes we get a warning that flow group isn't
found at removal. This rare case was caused when a specific scenario happened,
insertion of a flow group with a similar match criteria (a duplicate),
but only where the flow group rhash_head was second (or not first)
on the relevant rhashtable bucket list.

The first patch fixes it, and the second one adds a test that show
it is now working.

Paul.

v4 --> v3 changes:
* Added Herbert Xu's ack (thanks)
* Removed extra commit tags
v3 --> v2 changes:
* Added missing fix in rhashtable_lookup_one code path as well.

v2 --> v1 changes:
* Changed commit messages to better reflect the change

Paul Blakey (2):
  rhashtable: Fix rhlist duplicates insertion
  test_rhashtable: add test case for rhltable with duplicate objects

 include/linux/rhashtable.h |   4 +-
 lib/rhashtable.c   |   4 +-
 lib/test_rhashtable.c  | 134 +
 3 files changed, 140 insertions(+), 2 deletions(-)

-- 
1.8.4.3

[PATCH net v4 2/2] test_rhashtable: add test case for rhltable with duplicate objects

2018-03-07 Thread Paul Blakey

Tries to insert duplicates in the middle of bucket's chain:
bucket 1:  [[val 21 (tid=1)]] -> [[ val 1 (tid=2),  val 1 (tid=0) ]]

Reuses tid to distinguish the elements insertion order.

Signed-off-by: Paul Blakey 
---
 lib/test_rhashtable.c | 134 ++
 1 file changed, 134 insertions(+)

diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index 76d3667..f4000c1 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -79,6 +79,21 @@ struct thread_data {
struct test_obj *objs;
 };
 
+static u32 my_hashfn(const void *data, u32 len, u32 seed)
+{
+   const struct test_obj_rhl *obj = data;
+
+   return (obj->value.id % 10) << RHT_HASH_RESERVED_SPACE;
+}
+
+static int my_cmpfn(struct rhashtable_compare_arg *arg, const void *obj)
+{
+   const struct test_obj_rhl *test_obj = obj;
+   const struct test_obj_val *val = arg->key;
+
+   return test_obj->value.id - val->id;
+}
+
 static struct rhashtable_params test_rht_params = {
.head_offset = offsetof(struct test_obj, node),
.key_offset = offsetof(struct test_obj, value),
@@ -87,6 +102,17 @@ struct thread_data {
.nulls_base = (3U << RHT_BASE_SHIFT),
 };
 
+static struct rhashtable_params test_rht_params_dup = {
+   .head_offset = offsetof(struct test_obj_rhl, list_node),
+   .key_offset = offsetof(struct test_obj_rhl, value),
+   .key_len = sizeof(struct test_obj_val),
+   .hashfn = jhash,
+   .obj_hashfn = my_hashfn,
+   .obj_cmpfn = my_cmpfn,
+   .nelem_hint = 128,
+   .automatic_shrinking = false,
+};
+
 static struct semaphore prestart_sem;
 static struct semaphore startup_sem = __SEMAPHORE_INITIALIZER(startup_sem, 0);
 
@@ -465,6 +491,112 @@ static int __init test_rhashtable_max(struct test_obj 
*array,
return err;
 }
 
+static unsigned int __init print_ht(struct rhltable *rhlt)
+{
+   struct rhashtable *ht;
+   const struct bucket_table *tbl;
+   char buff[512] = "";
+   unsigned int i, cnt = 0;
+
+   ht = &rhlt->ht;
+   tbl = rht_dereference(ht->tbl, ht);
+   for (i = 0; i < tbl->size; i++) {
+   struct rhash_head *pos, *next;
+   struct test_obj_rhl *p;
+
+   pos = rht_dereference(tbl->buckets[i], ht);
+   next = !rht_is_a_nulls(pos) ? rht_dereference(pos->next, ht) : 
NULL;
+
+   if (!rht_is_a_nulls(pos)) {
+   sprintf(buff, "%s\nbucket[%d] -> ", buff, i);
+   }
+
+   while (!rht_is_a_nulls(pos)) {
+   struct rhlist_head *list = container_of(pos, struct 
rhlist_head, rhead);
+   sprintf(buff, "%s[[", buff);
+   do {
+   pos = &list->rhead;
+   list = rht_dereference(list->next, ht);
+   p = rht_obj(ht, pos);
+
+   sprintf(buff, "%s val %d (tid=%d)%s", buff, 
p->value.id, p->value.tid,
+   list? ", " : " ");
+   cnt++;
+   } while (list);
+
+   pos = next,
+   next = !rht_is_a_nulls(pos) ?
+   rht_dereference(pos->next, ht) : NULL;
+
+   sprintf(buff, "%s]]%s", buff, !rht_is_a_nulls(pos) ? " 
-> " : "");
+   }
+   }
+   printk(KERN_ERR "\n ht: %s\n-\n", buff);
+
+   return cnt;
+}
+
+static int __init test_insert_dup(struct test_obj_rhl *rhl_test_objects,
+ int cnt, bool slow)
+{
+   struct rhltable rhlt;
+   unsigned int i, ret;
+   const char *key;
+   int err = 0;
+
+   err = rhltable_init(&rhlt, &test_rht_params_dup);
+   if (WARN_ON(err))
+   return err;
+
+   for (i = 0; i < cnt; i++) {
+   rhl_test_objects[i].value.tid = i;
+   key = rht_obj(&rhlt.ht, &rhl_test_objects[i].list_node.rhead);
+   key += test_rht_params_dup.key_offset;
+
+   if (slow) {
+   err = PTR_ERR(rhashtable_insert_slow(&rhlt.ht, key,
+
&rhl_test_objects[i].list_node.rhead));
+   if (err == -EAGAIN)
+   err = 0;
+   } else
+   err = rhltable_insert(&rhlt,
+ &rhl_test_objects[i].list_node,
+ test_rht_params_dup);
+   if (WARN(err, "error %d on element %d/%d (%s)\n", err, i, cnt, 
slow? "slow" : "fast"))
+   goto skip_print;
+   }
+
+   ret = print_ht(&rhlt);
+   WARN(ret != cnt, "missing rhltable elements (%d != %d, %s)\n", ret, 
cnt, slow? "slow" : "fast");
+
+skip_print:
+   rhltable_destroy(&rhlt);
+
+   ret

[PATCH net v4 1/2] rhashtable: Fix rhlist duplicates insertion

2018-03-07 Thread Paul Blakey

When inserting duplicate objects (those with the same key),
current rhlist implementation messes up the chain pointers by
updating the bucket pointer instead of prev next pointer to the
newly inserted node. This causes missing elements on removal and
travesal.

Fix that by properly updating pprev pointer to point to
the correct rhash_head next pointer.

Fixes: ca26893f05e8 ('rhashtable: Add rhlist interface')
Signed-off-by: Paul Blakey 
---
 include/linux/rhashtable.h | 4 +++-
 lib/rhashtable.c   | 4 +++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index c9df252..668a21f 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -766,8 +766,10 @@ static inline void *__rhashtable_insert_fast(
if (!key ||
(params.obj_cmpfn ?
 params.obj_cmpfn(&arg, rht_obj(ht, head)) :
-rhashtable_compare(&arg, rht_obj(ht, head
+rhashtable_compare(&arg, rht_obj(ht, head {
+   pprev = &head->next;
continue;
+   }
 
data = rht_obj(ht, head);
 
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 3825c30..47de025 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -506,8 +506,10 @@ static void *rhashtable_lookup_one(struct rhashtable *ht,
if (!key ||
(ht->p.obj_cmpfn ?
 ht->p.obj_cmpfn(&arg, rht_obj(ht, head)) :
-rhashtable_compare(&arg, rht_obj(ht, head
+rhashtable_compare(&arg, rht_obj(ht, head {
+   pprev = &head->next;
continue;
+   }
 
if (!ht->rhlist)
return rht_obj(ht, head);
-- 
1.8.4.3

[PATCH net 0/5] tcp: fixes to non-SACK TCP

2018-03-07 Thread Ilpo Järvinen

Here is a series of fixes to issues that occur when SACK is not
enabled for a TCP connection:

tcp: feed correct number of pkts acked to cc modules
tcp: prevent bogus FRTO undos with non-SACK flows
tcp: move false FR condition into
tcp: prevent bogus undos when SACK is not enabled
tcp: send real dupACKs by locking advertized window

[PATCH net 5/5] tcp: send real dupACKs by locking advertized window for non-SACK flows

2018-03-07 Thread Ilpo Järvinen

Currently, the TCP code is overly eager to update window on
every ACK. It makes some of the ACKs that the receiver should
sent as dupACKs look like they update window update that are
not considered real dupACKs by the non-SACK sender-side code.

Make sure that when an ofo segment is received, no change to
window is applied if we are going to send a dupACK. It's ok
to change the window for non-dupACKs (such as the first ACK
after ofo arrivals start if that ACK was using delayed ACKs).

Signed-off-by: Ilpo Järvinen 
---
 include/linux/tcp.h   |  3 ++-
 net/ipv4/tcp_input.c  |  5 -
 net/ipv4/tcp_output.c | 43 +--
 3 files changed, 31 insertions(+), 20 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 8f4c549..e239662 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -225,7 +225,8 @@ struct tcp_sock {
fastopen_connect:1, /* FASTOPEN_CONNECT sockopt */
fastopen_no_cookie:1, /* Allow send/recv SYN+data without a 
cookie */
is_sack_reneg:1,/* in recovery from loss with SACK reneg? */
-   unused:2;
+   dupack_wnd_lock :1, /* Non-SACK constant rwnd dupacks needed? */
+   unused:1;
u8  nonagle : 4,/* Disable Nagle algorithm? */
thin_lto: 1,/* Use linear timeouts for thin streams */
unused1 : 1,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index b689915..77a289f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4633,6 +4633,7 @@ int tcp_send_rcvq(struct sock *sk, struct msghdr *msg, 
size_t size)
 static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 {
struct tcp_sock *tp = tcp_sk(sk);
+   struct inet_connection_sock *icsk = inet_csk(sk);
bool fragstolen;
int eaten;
 
@@ -4676,7 +4677,7 @@ static void tcp_data_queue(struct sock *sk, struct 
sk_buff *skb)
 * gap in queue is filled.
 */
if (RB_EMPTY_ROOT(&tp->out_of_order_queue))
-   inet_csk(sk)->icsk_ack.pingpong = 0;
+   icsk->icsk_ack.pingpong = 0;
}
 
if (tp->rx_opt.num_sacks)
@@ -4726,6 +4727,8 @@ static void tcp_data_queue(struct sock *sk, struct 
sk_buff *skb)
goto queue_and_out;
}
 
+   if (tcp_is_reno(tp) && !(icsk->icsk_ack.pending & ICSK_ACK_TIMER))
+   tp->dupack_wnd_lock = 1;
tcp_data_queue_ofo(sk, skb);
 }
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 6818042..45fbe92 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -249,25 +249,32 @@ static u16 tcp_select_window(struct sock *sk)
 {
struct tcp_sock *tp = tcp_sk(sk);
u32 old_win = tp->rcv_wnd;
-   u32 cur_win = tcp_receive_window(tp);
-   u32 new_win = __tcp_select_window(sk);
-
-   /* Never shrink the offered window */
-   if (new_win < cur_win) {
-   /* Danger Will Robinson!
-* Don't update rcv_wup/rcv_wnd here or else
-* we will not be able to advertise a zero
-* window in time.  --DaveM
-*
-* Relax Will Robinson.
-*/
-   if (new_win == 0)
-   NET_INC_STATS(sock_net(sk),
- LINUX_MIB_TCPWANTZEROWINDOWADV);
-   new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
+   u32 cur_win;
+   u32 new_win;
+
+   if (tp->dupack_wnd_lock) {
+   new_win = old_win;
+   tp->dupack_wnd_lock = 0;
+   } else {
+   cur_win = tcp_receive_window(tp);
+   new_win = __tcp_select_window(sk);
+   /* Never shrink the offered window */
+   if (new_win < cur_win) {
+   /* Danger Will Robinson!
+* Don't update rcv_wup/rcv_wnd here or else
+* we will not be able to advertise a zero
+* window in time.  --DaveM
+*
+* Relax Will Robinson.
+*/
+   if (new_win == 0)
+   NET_INC_STATS(sock_net(sk),
+ LINUX_MIB_TCPWANTZEROWINDOWADV);
+   new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
+   }
+   tp->rcv_wnd = new_win;
+   tp->rcv_wup = tp->rcv_nxt;
}
-   tp->rcv_wnd = new_win;
-   tp->rcv_wup = tp->rcv_nxt;
 
/* Make sure we do not exceed the maximum possible
 * scaled window.
-- 
2.7.4

[PATCH net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows

2018-03-07 Thread Ilpo Järvinen

In a non-SACK case, any non-retransmitted segment acknowledged will
set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is
no indication that it would have been delivered for real (the
scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK
case). This causes bogus undos in ordinary RTO recoveries where
segments are lost here and there, with a few delivered segments in
between losses. A cumulative ACKs will cover retransmitted ones at
the bottom and the non-retransmitted ones following that causing
FLAG_ORIG_SACK_ACKED to be set in tcp_clean_rtx_queue and results
in a spurious FRTO undo.

We need to make the check more strict for non-SACK case and check
that none of the cumulatively ACKed segments were retransmitted,
which would be the case for the last step of FRTO algorithm as we
sent out only new segments previously. Only then, allow FRTO undo
to proceed in non-SACK case.

Signed-off-by: Ilpo Järvinen 
---
 net/ipv4/tcp_input.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 0305f6d..1a33752 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2629,8 +2629,13 @@ static void tcp_process_loss(struct sock *sk, int flag, 
bool is_dupack,
if (tp->frto) { /* F-RTO RFC5682 sec 3.1 (sack enhanced version). */
/* Step 3.b. A timeout is spurious if not all data are
 * lost, i.e., never-retransmitted data are (s)acked.
+*
+* As the non-SACK case does not keep track of 
TCPCB_SACKED_ACKED,
+* we need to ensure that none of the cumulative ACKed segments
+* was retransmitted to confirm the validity of 
FLAG_ORIG_SACK_ACKED.
 */
if ((flag & FLAG_ORIG_SACK_ACKED) &&
+   (tcp_is_sack(tp) || !(flag & FLAG_RETRANS_DATA_ACKED)) &&
tcp_try_undo_loss(sk, true))
return;
 
-- 
2.7.4

[PATCH net 4/5] tcp: prevent bogus undos when SACK is not enabled

2018-03-07 Thread Ilpo Järvinen

A bogus undo may/will trigger when the loss recovery state is
kept until snd_una is above high_seq. If tcp_any_retrans_done
is zero, retrans_stamp is cleared in this transient state. On
the next ACK, tcp_try_undo_recovery again executes and
tcp_may_undo will always return true because tcp_packet_delayed
has this condition:
return !tp->retrans_stamp || ...

Check for the false fast retransmit transient condition in
tcp_packet_delayed to avoid bogus undos. Since snd_una may have
advanced on this ACK but CA state still remains unchanged,
prior_snd_una needs to be passed instead of tp->snd_una.

Signed-off-by: Ilpo Järvinen 
---
 net/ipv4/tcp_input.c | 50 ++
 1 file changed, 34 insertions(+), 16 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e20f9ad..b689915 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -98,6 +98,7 @@ int sysctl_tcp_max_orphans __read_mostly = NR_FILE;
 #define FLAG_UPDATE_TS_RECENT  0x4000 /* tcp_replace_ts_recent() */
 #define FLAG_NO_CHALLENGE_ACK  0x8000 /* do not call tcp_send_challenge_ack()  
*/
 #define FLAG_ACK_MAYBE_DELAYED 0x1 /* Likely a delayed ACK */
+#define FLAG_PACKET_DELAYED0x2 /* 0 rexmits or tstamps reveal delayed 
pkt */
 
 #define FLAG_ACKED (FLAG_DATA_ACKED|FLAG_SYN_ACKED)
 #define FLAG_NOT_DUP   (FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED)
@@ -2243,10 +2244,19 @@ static bool tcp_skb_spurious_retrans(const struct 
tcp_sock *tp,
 /* Nothing was retransmitted or returned timestamp is less
  * than timestamp of the first retransmission.
  */
-static inline bool tcp_packet_delayed(const struct tcp_sock *tp)
+static inline bool tcp_packet_delayed(const struct sock *sk,
+ const u32 prior_snd_una)
 {
-   return !tp->retrans_stamp ||
-  tcp_tsopt_ecr_before(tp, tp->retrans_stamp);
+   const struct tcp_sock *tp = tcp_sk(sk);
+
+   if (!tp->retrans_stamp) {
+   /* Sender will be in a transient state with cleared
+* retrans_stamp during false fast retransmit prevention
+* mechanism
+*/
+   return !tcp_false_fast_retrans_possible(sk, prior_snd_una);
+   }
+   return tcp_tsopt_ecr_before(tp, tp->retrans_stamp);
 }
 
 /* Undo procedures. */
@@ -2336,17 +2346,20 @@ static void tcp_undo_cwnd_reduction(struct sock *sk, 
bool unmark_loss)
tp->rack.advanced = 1; /* Force RACK to re-exam losses */
 }
 
-static inline bool tcp_may_undo(const struct tcp_sock *tp)
+static inline bool tcp_may_undo(const struct sock *sk, const int flag)
 {
-   return tp->undo_marker && (!tp->undo_retrans || tcp_packet_delayed(tp));
+   const struct tcp_sock *tp = tcp_sk(sk);
+
+   return tp->undo_marker &&
+  (!tp->undo_retrans || (flag & FLAG_PACKET_DELAYED));
 }
 
 /* People celebrate: "We love our President!" */
-static bool tcp_try_undo_recovery(struct sock *sk)
+static bool tcp_try_undo_recovery(struct sock *sk, const int flag)
 {
struct tcp_sock *tp = tcp_sk(sk);
 
-   if (tcp_may_undo(tp)) {
+   if (tcp_may_undo(sk, flag)) {
int mib_idx;
 
/* Happy end! We did not retransmit anything
@@ -2393,11 +2406,11 @@ static bool tcp_try_undo_dsack(struct sock *sk)
 }
 
 /* Undo during loss recovery after partial ACK or using F-RTO. */
-static bool tcp_try_undo_loss(struct sock *sk, bool frto_undo)
+static bool tcp_try_undo_loss(struct sock *sk, bool frto_undo, const int flag)
 {
struct tcp_sock *tp = tcp_sk(sk);
 
-   if (frto_undo || tcp_may_undo(tp)) {
+   if (frto_undo || tcp_may_undo(sk, flag)) {
tcp_undo_cwnd_reduction(sk, true);
 
DBGUNDO(sk, "partial loss");
@@ -2636,7 +2649,7 @@ static void tcp_process_loss(struct sock *sk, int flag, 
bool is_dupack,
bool recovered = !before(tp->snd_una, tp->high_seq);
 
if ((flag & FLAG_SND_UNA_ADVANCED) &&
-   tcp_try_undo_loss(sk, false))
+   tcp_try_undo_loss(sk, false, flag))
return;
 
if (tp->frto) { /* F-RTO RFC5682 sec 3.1 (sack enhanced version). */
@@ -2649,7 +2662,7 @@ static void tcp_process_loss(struct sock *sk, int flag, 
bool is_dupack,
 */
if ((flag & FLAG_ORIG_SACK_ACKED) &&
(tcp_is_sack(tp) || !(flag & FLAG_RETRANS_DATA_ACKED)) &&
-   tcp_try_undo_loss(sk, true))
+   tcp_try_undo_loss(sk, true, flag))
return;
 
if (after(tp->snd_nxt, tp->high_seq)) {
@@ -2672,7 +2685,7 @@ static void tcp_process_loss(struct sock *sk, int flag, 
bool is_dupack,
 
if (recovered) {
/* F-RTO RFC5682 sec 3.1 step 2.a and 1st part of step 3.a */
-   tcp_try_undo_recovery(sk);
+   tcp_try_undo_recovery(sk, flag);
return;
}
if (tcp_is_reno

[PATCH net 3/5] tcp: move false FR condition into tcp_false_fast_retrans_possible()

2018-03-07 Thread Ilpo Järvinen

No functional changes.

Signed-off-by: Ilpo Järvinen 
---
 net/ipv4/tcp_input.c | 21 +
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 1a33752..e20f9ad 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2211,6 +2211,19 @@ static void tcp_update_scoreboard(struct sock *sk, int 
fast_rexmit)
}
 }
 
+/* False fast retransmits may occur when SACK is not in use under certain
+ * conditions (RFC6582). The sender MUST hold old state until something
+ * *above* high_seq is ACKed to prevent triggering such false fast
+ * retransmits. SACK TCP is safe.
+ */
+static bool tcp_false_fast_retrans_possible(const struct sock *sk,
+   const u32 snd_una)
+{
+   const struct tcp_sock *tp = tcp_sk(sk);
+
+   return ((snd_una == tp->high_seq) && tcp_is_reno(tp));
+}
+
 static bool tcp_tsopt_ecr_before(const struct tcp_sock *tp, u32 when)
 {
return tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr &&
@@ -2350,10 +2363,10 @@ static bool tcp_try_undo_recovery(struct sock *sk)
} else if (tp->rack.reo_wnd_persist) {
tp->rack.reo_wnd_persist--;
}
-   if (tp->snd_una == tp->high_seq && tcp_is_reno(tp)) {
-   /* Hold old state until something *above* high_seq
-* is ACKed. For Reno it is MUST to prevent false
-* fast retransmits (RFC2582). SACK TCP is safe. */
+   if (tcp_false_fast_retrans_possible(sk, tp->snd_una)) {
+   /* Hold old state until something *above* high_seq is ACKed
+* if false fast retransmit is possible.
+*/
if (!tcp_any_retrans_done(sk))
tp->retrans_stamp = 0;
return true;
-- 
2.7.4

[PATCH net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery

2018-03-07 Thread Ilpo Järvinen

The pkts_acked is clearly not the correct value for slow start
after RTO as it may include segments that were not lost and
therefore did not need retransmissions in the slow start following
the RTO. Then tcp_slow_start will add the excess into cwnd bloating
it and triggering a burst.

Instead, we want to pass only the number of retransmitted segments
that were covered by the cumulative ACK (and potentially newly sent
data segments too if the cumulative ACK covers that far).

Signed-off-by: Ilpo Järvinen 
---
 net/ipv4/tcp_input.c | 19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9a1b3c1..0305f6d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3027,6 +3027,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
prior_fack,
long seq_rtt_us = -1L;
long ca_rtt_us = -1L;
u32 pkts_acked = 0;
+   u32 rexmit_acked = 0;
+   u32 newdata_acked = 0;
u32 last_in_flight = 0;
bool rtt_update;
int flag = 0;
@@ -3056,8 +3058,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
prior_fack,
}
 
if (unlikely(sacked & TCPCB_RETRANS)) {
-   if (sacked & TCPCB_SACKED_RETRANS)
+   if (sacked & TCPCB_SACKED_RETRANS) {
tp->retrans_out -= acked_pcount;
+   rexmit_acked += acked_pcount;
+   }
flag |= FLAG_RETRANS_DATA_ACKED;
} else if (!(sacked & TCPCB_SACKED_ACKED)) {
last_ackt = skb->skb_mstamp;
@@ -3068,8 +3072,11 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
prior_fack,
last_in_flight = TCP_SKB_CB(skb)->tx.in_flight;
if (before(start_seq, reord))
reord = start_seq;
-   if (!after(scb->end_seq, tp->high_seq))
+   if (!after(scb->end_seq, tp->high_seq)) {
flag |= FLAG_ORIG_SACK_ACKED;
+   } else {
+   newdata_acked += acked_pcount;
+   }
}
 
if (sacked & TCPCB_SACKED_ACKED) {
@@ -3151,6 +3158,14 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
prior_fack,
}
 
if (tcp_is_reno(tp)) {
+   /* Due to discontinuity on RTO in the artificial
+* sacked_out calculations, TCP must restrict
+* pkts_acked without SACK to rexmits and new data
+* segments
+*/
+   if (icsk->icsk_ca_state == TCP_CA_Loss)
+   pkts_acked = rexmit_acked + newdata_acked;
+
tcp_remove_reno_sacks(sk, pkts_acked);
} else {
int delta;
-- 
2.7.4

[PATCH net-next 2/2] net: mvpp2: Add support for unicast filtering

2018-03-07 Thread Maxime Chevallier

Marvell PPv2 controller can be used to implement packet filtering based
on the destination MAC address. This is already used to implement
multicast filtering. This patch adds support for Unicast filtering.

Filtering is based on so-called "TCAM entries" to implement filtering.
Due to their limited number and the fact that these are also used for
other purposes, we reserve 80 entries for both unicast and multicast
filters. On top of the broadcast address, and each interface's own MAC
address, we reserve 25 entries per port, 4 for unicast filters, 21 for
multicast.

Whenever unicast or multicast range for one port is full, the filtering
is disabled and port goes into promiscuous mode for the given type of
addresses.

Signed-off-by: Maxime Chevallier 
---
 drivers/net/ethernet/marvell/mvpp2.c | 296 +++
 1 file changed, 161 insertions(+), 135 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvpp2.c 
b/drivers/net/ethernet/marvell/mvpp2.c
index 7820906eca8a..84709b0a9903 100644
--- a/drivers/net/ethernet/marvell/mvpp2.c
+++ b/drivers/net/ethernet/marvell/mvpp2.c
@@ -601,6 +601,9 @@ enum mvpp2_tag_type {
 #define MVPP2_PRS_TCAM_PROTO_MASK  0xff
 #define MVPP2_PRS_TCAM_PROTO_MASK_L0x3f
 #define MVPP2_PRS_DBL_VLANS_MAX100
+#define MVPP2_PRS_CAST_MASKBIT(0)
+#define MVPP2_PRS_MCAST_VALBIT(0)
+#define MVPP2_PRS_UCAST_VAL0x0
 
 /* Tcam structure:
  * - lookup ID - 4 bits
@@ -624,6 +627,19 @@ enum mvpp2_tag_type {
 
 #define MVPP2_PRS_VID_TCAM_BYTE 2
 
+/* TCAM range for unicast and multicast filtering. We have 25 entries per port,
+ * with 4 dedicated to UC filtering and the rest to multicast filtering.
+ * Additionnally we reserve one entry for the broadcast address, and one for
+ * each port's own address.
+ */
+#define MVPP2_PRS_MAC_UC_MC_FILT_MAX   25
+#define MVPP2_PRS_MAC_RANGE_SIZE   80
+
+/* Number of entries per port dedicated to UC and MC filtering */
+#define MVPP2_PRS_MAC_UC_FILT_MAX  4
+#define MVPP2_PRS_MAC_MC_FILT_MAX  (MVPP2_PRS_MAC_UC_MC_FILT_MAX - \
+MVPP2_PRS_MAC_UC_FILT_MAX)
+
 /* There is a TCAM range reserved for VLAN filtering entries, range size is 33
  * 10 VLAN ID filter entries per port
  * 1 default VLAN filter entry per port
@@ -639,36 +655,40 @@ enum mvpp2_tag_type {
 #define MVPP2_PE_DROP_ALL  0
 #define MVPP2_PE_FIRST_FREE_TID1
 
+/* MAC filtering range */
+#define MVPP2_PE_MAC_RANGE_END (MVPP2_PE_VID_FILT_RANGE_START - 1)
+#define MVPP2_PE_MAC_RANGE_START   (MVPP2_PE_MAC_RANGE_END - \
+   MVPP2_PRS_MAC_RANGE_SIZE + 1)
 /* VLAN filtering range */
 #define MVPP2_PE_VID_FILT_RANGE_END (MVPP2_PRS_TCAM_SRAM_SIZE - 31)
 #define MVPP2_PE_VID_FILT_RANGE_START   (MVPP2_PE_VID_FILT_RANGE_END - \
 MVPP2_PRS_VLAN_FILT_RANGE_SIZE + 1)
 #define MVPP2_PE_LAST_FREE_TID  (MVPP2_PE_VID_FILT_RANGE_START - 1)
 #define MVPP2_PE_IP6_EXT_PROTO_UN  (MVPP2_PRS_TCAM_SRAM_SIZE - 30)
-#define MVPP2_PE_MAC_MC_IP6(MVPP2_PRS_TCAM_SRAM_SIZE - 29)
-#define MVPP2_PE_IP6_ADDR_UN   (MVPP2_PRS_TCAM_SRAM_SIZE - 28)
-#define MVPP2_PE_IP4_ADDR_UN   (MVPP2_PRS_TCAM_SRAM_SIZE - 27)
-#define MVPP2_PE_LAST_DEFAULT_FLOW (MVPP2_PRS_TCAM_SRAM_SIZE - 26)
-#define MVPP2_PE_FIRST_DEFAULT_FLOW(MVPP2_PRS_TCAM_SRAM_SIZE - 21)
-#define MVPP2_PE_EDSA_TAGGED   (MVPP2_PRS_TCAM_SRAM_SIZE - 20)
-#define MVPP2_PE_EDSA_UNTAGGED (MVPP2_PRS_TCAM_SRAM_SIZE - 19)
-#define MVPP2_PE_DSA_TAGGED(MVPP2_PRS_TCAM_SRAM_SIZE - 18)
-#define MVPP2_PE_DSA_UNTAGGED  (MVPP2_PRS_TCAM_SRAM_SIZE - 17)
-#define MVPP2_PE_ETYPE_EDSA_TAGGED (MVPP2_PRS_TCAM_SRAM_SIZE - 16)
-#define MVPP2_PE_ETYPE_EDSA_UNTAGGED   (MVPP2_PRS_TCAM_SRAM_SIZE - 15)
-#define MVPP2_PE_ETYPE_DSA_TAGGED  (MVPP2_PRS_TCAM_SRAM_SIZE - 14)
-#define MVPP2_PE_ETYPE_DSA_UNTAGGED(MVPP2_PRS_TCAM_SRAM_SIZE - 13)
-#define MVPP2_PE_MH_DEFAULT(MVPP2_PRS_TCAM_SRAM_SIZE - 12)
-#define MVPP2_PE_DSA_DEFAULT   (MVPP2_PRS_TCAM_SRAM_SIZE - 11)
-#define MVPP2_PE_IP6_PROTO_UN  (MVPP2_PRS_TCAM_SRAM_SIZE - 10)
-#define MVPP2_PE_IP4_PROTO_UN  (MVPP2_PRS_TCAM_SRAM_SIZE - 9)
-#define MVPP2_PE_ETH_TYPE_UN   (MVPP2_PRS_TCAM_SRAM_SIZE - 8)
-#define MVPP2_PE_VID_FLTR_DEFAULT  (MVPP2_PRS_TCAM_SRAM_SIZE - 7)
-#define MVPP2_PE_VID_EDSA_FLTR_DEFAULT (MVPP2_PRS_TCAM_SRAM_SIZE - 6)
-#define MVPP2_PE_VLAN_DBL  (MVPP2_PRS_TCAM_SRAM_SIZE - 5)
-#define MVPP2_PE_VLAN_NONE (MVPP2_PRS_TCAM_SRAM_SIZE - 4)
-#define MVPP2_PE_MAC_MC_ALL(MVPP2_PRS_TCAM_SRAM_SIZE - 3)
-#define MVPP2_PE_MAC_PROMISCUOUS   (MVPP2_PRS_TCAM_SRAM_SIZE - 2)
+#define MVPP2_PE_IP6_ADDR_UN   (MVPP2_PRS_TCAM_SRAM_SIZE - 29)
+#define MVPP2_PE_IP4_ADDR_UN   (MVPP2_PRS_TCAM_SRAM_SIZE - 28)
+#defin

[PATCH net-next 1/2] net: mvpp2: Simplify MAC filtering function parameters

2018-03-07 Thread Maxime Chevallier

The mvpp2_prs_mac_da_accept function takes into parameter both the
struct representing the controller and the port id. This is meaningful
when we want to create TCAM entries for non-initialized ports, but in
this case we expect the port to be initialized before starting adding or
removing MAC addresses to the per-port filter.

This commit changes the function so that it takes struct mvpp2_port as
a parameter instead.

Signed-off-by: Maxime Chevallier 
---
 drivers/net/ethernet/marvell/mvpp2.c | 30 +++---
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvpp2.c 
b/drivers/net/ethernet/marvell/mvpp2.c
index ac0a0dc8f157..7820906eca8a 100644
--- a/drivers/net/ethernet/marvell/mvpp2.c
+++ b/drivers/net/ethernet/marvell/mvpp2.c
@@ -3817,16 +3817,17 @@ mvpp2_prs_mac_da_range_find(struct mvpp2 *priv, int 
pmap, const u8 *da,
 }
 
 /* Update parser's mac da entry */
-static int mvpp2_prs_mac_da_accept(struct mvpp2 *priv, int port,
-  const u8 *da, bool add)
+static int mvpp2_prs_mac_da_accept(struct mvpp2_port *port, const u8 *da,
+  bool add)
 {
-   struct mvpp2_prs_entry *pe;
-   unsigned int pmap, len, ri;
unsigned char mask[ETH_ALEN] = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff };
+   struct mvpp2 *priv = port->priv;
+   unsigned int pmap, len, ri;
+   struct mvpp2_prs_entry *pe;
int tid;
 
/* Scan TCAM and see if entry with this  already exist */
-   pe = mvpp2_prs_mac_da_range_find(priv, (1 << port), da, mask,
+   pe = mvpp2_prs_mac_da_range_find(priv, BIT(port->id), da, mask,
 MVPP2_PRS_UDF_MAC_DEF);
 
/* No such entry */
@@ -3861,7 +3862,7 @@ static int mvpp2_prs_mac_da_accept(struct mvpp2 *priv, 
int port,
}
 
/* Update port mask */
-   mvpp2_prs_tcam_port_set(pe, port, add);
+   mvpp2_prs_tcam_port_set(pe, port->id, add);
 
/* Invalidate the entry if no ports are left enabled */
pmap = mvpp2_prs_tcam_port_map_get(pe);
@@ -3917,13 +3918,12 @@ static int mvpp2_prs_update_mac_da(struct net_device 
*dev, const u8 *da)
int err;
 
/* Remove old parser entry */
-   err = mvpp2_prs_mac_da_accept(port->priv, port->id, dev->dev_addr,
- false);
+   err = mvpp2_prs_mac_da_accept(port, dev->dev_addr, false);
if (err)
return err;
 
/* Add new parser entry */
-   err = mvpp2_prs_mac_da_accept(port->priv, port->id, da, true);
+   err = mvpp2_prs_mac_da_accept(port, da, true);
if (err)
return err;
 
@@ -3959,7 +3959,8 @@ static void mvpp2_prs_mcast_del_all(struct mvpp2 *priv, 
int port)
 
if (is_multicast_ether_addr(da) && !is_broadcast_ether_addr(da))
/* Delete this entry */
-   mvpp2_prs_mac_da_accept(priv, port, da, false);
+   mvpp2_prs_mac_da_accept(priv->port_list[port], da,
+   false);
}
 }
 
@@ -7380,15 +7381,14 @@ static int mvpp2_open(struct net_device *dev)
0xff, 0xff, 0xff, 0xff, 0xff, 0xff };
int err;
 
-   err = mvpp2_prs_mac_da_accept(port->priv, port->id, mac_bcast, true);
+   err = mvpp2_prs_mac_da_accept(port, mac_bcast, true);
if (err) {
netdev_err(dev, "mvpp2_prs_mac_da_accept BC failed\n");
return err;
}
-   err = mvpp2_prs_mac_da_accept(port->priv, port->id,
- dev->dev_addr, true);
+   err = mvpp2_prs_mac_da_accept(port, dev->dev_addr, true);
if (err) {
-   netdev_err(dev, "mvpp2_prs_mac_da_accept MC failed\n");
+   netdev_err(dev, "mvpp2_prs_mac_da_accept own addr failed\n");
return err;
}
err = mvpp2_prs_tag_mode_set(port->priv, port->id, MVPP2_TAG_TYPE_MH);
@@ -7520,7 +7520,7 @@ static void mvpp2_set_rx_mode(struct net_device *dev)
 
if (!allmulti) {
netdev_for_each_mc_addr(ha, dev) {
-   if (mvpp2_prs_mac_da_accept(priv, id, ha->addr, true)) {
+   if (mvpp2_prs_mac_da_accept(port, ha->addr, true)) {
allmulti = true;
goto retry;
}
-- 
2.11.0

[PATCH net-next 0/2] net: mvpp2: Add Unicast filtering capabilities

2018-03-07 Thread Maxime Chevallier

Hi all,

This series adds unicast filtering support to the Marvell PPv2 controller.

This is implemented using the header parser cababilities of the PPv2,
which allows for generic packet filtering based on matching patterns in
the packet headers.

PPv2 controller only has 256 of these entries, and we need to share them
with other features, such as VLAN filtering.

For each interface, we have 5 entries dedicated to unicast filtering (the
controller's own address, and 4 other), and 21 to multicast filtering.

When this number is reached, the controller switches to unicast or
multicast promiscuous mode.

The first patch reworks the function that adds and removes addresses to the
filter. This is preparatory work to ease UC filter implementation.

The second patch adds the UC filtering feature.

Maxime Chevallier (2):
  net: mvpp2: Simplify MAC filtering function parameters
  net: mvpp2: Add support for unicast filtering

 drivers/net/ethernet/marvell/mvpp2.c | 320 +++
 1 file changed, 173 insertions(+), 147 deletions(-)

-- 
2.11.0

Re: [PATCH RESEND net-next] net: Do synchronize_rcu() in ip6mr_sk_done() only if this is needed

2018-03-07 Thread Eric Dumazet

On Wed, 2018-03-07 at 13:51 +, Yuval Mintz wrote:
> > > > > After unshare test kworker hangs for ages:
> > > > > 
> > > > > $ while :; do unshare -n true; done &
> > > > > 
> > > > > $ perf report 
> > > > > -   88,82% 0,00%  kworker/u16:0  [kernel.vmlinux]  [k
> > > > > ]
> > > > > cleanup_net
> > > > >  cleanup_net
> > > > >    - ops_exit_list.isra.9
> > > > >   - 85,68% igmp6_net_exit
> > > > >  - 53,31% sock_release
> > > > > - inet_release
> > > > >    - 25,33% rawv6_close
> > > > >   - ip6mr_sk_done
> > > > >  + 23,38% synchronize_rcu
> > > > > 
> > > > > Keep in mind, this perf report shows the time a function was
> > > > > executing, and
> > > > > it does not show the time, it was sleeping. But it's easy to
> > > > > imagine,
> > > > > how
> > > > > much it is sleeping, if synchronize_rcu() execution takes the
> > > > > most
> > > > > time.
> > > > > Top shows the kworker R time is less then 1%.
> > > > > 
> > > > > This happen, because of in ip6mr_sk_done() we do too many
> > > > > synchronize_rcu(),
> > > > > even for the sockets, that are not referenced in mr_table,
> > > > > and which
> > > > > are not
> > > > > need it. So, the progress of kworker becomes very slow.
> > > > > 
> > > > > The patch introduces apparent solution, and it makes
> > > > > ip6mr_sk_done()
> > > > > to skip
> > > > > synchronize_rcu() for sockets, that are not need that. After
> > > > > the
> > > > > patch,
> > > > > kworker becomes able to warm the cpu up as expected.
> > > > > 
> > > > > Signed-off-by: Kirill Tkhai 
> > > > > ---
> > > > >  net/ipv6/ip6mr.c |4 +++-
> > > > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
> > > > > index 2a38f9de45d3..290a8d0d5eac 100644
> > > > > --- a/net/ipv6/ip6mr.c
> > > > > +++ b/net/ipv6/ip6mr.c
> > > > > @@ -1485,7 +1485,9 @@ int ip6mr_sk_done(struct sock *sk)
> > > > >   }
> > > > >   }
> > > > >   rtnl_unlock();
> > > > > - synchronize_rcu();
> > > > > +
> > > > > + if (!err)
> > > > > + synchronize_rcu();
> > > > > 
> > > > 
> > > > 
> > > > But... what is this synchronize_rcu() doing exactly ?
> > > > 
> > > > This was added in 8571ab479a6e1ef46ead5ebee567e128a422767c
> > > > 
> > > > ("ip6mr: Make mroute_sk rcu-based")
> > > > 
> > > > Typically on a delete, the synchronize_rcu() would be needed
> > > > before
> > > > freeing the deleted object.
> > > > 
> > > > But nowadays we have better way : SOCK_RCU_FREE
> > > 
> > > Hm. I'm agree with you. This is hot path, and mroute sockets
> > > created from
> > 
> > userspace
> > > will delay userspace tasks close() and exit(). Since there may be
> > > many such
> > 
> > sockets,
> > > we may get a zombie task, which can't be reaped for ages. This
> > > slows down
> > 
> > the system
> > > directly.
> > > 
> > > Fix for pernet_operations works, but we need generic solution
> > > instead.
> > > 
> > > The commit "8571ab479a6e1ef46ead5ebee567e128a422767c" says:
> > > 
> > > ip6mr: Make mroute_sk rcu-based
> > > 
> > > In ipmr the mr_table socket is handled under RCU. Introduce
> > > the same
> > > for ip6mr.
> > > 
> > > There is no pointing to improvements it invents, or to the
> > > problem it solves.
> > 
> > The description
> > > looks like a cleanup. It's too expensive cleanup, if it worsens
> > > the
> > 
> > performance a hundred
> > > times.
> > > 
> > > Can't we simply revert it?!
> > > 
> > > Yuval, do you have ideas to fix that (maybe, via SOCK_RCU_FREE
> > > suggested
> > 
> > by Eric)?
> 
> Sorry, failed to notice ip6mr_sk_done() is called unconditionally
> from
> rawv6_close(). But even with your suggested fix it should be
> ~resolved
> [as only sockets used for ip6mr would reach the sync]

Yes but some user setups could still hit this synchronize_rcu() quite
hard.

New synchronize_rcu() should not be added in possibly critical paths
unless absolutely needed (if really there is no other ways)

We have to be very careful here, since this kind of mistake can have a
100x impact.

> .
> Or are you claiming there's still some performance hit even with your
> suggested change?
> 
> > > 
> > > We actually use rcu_dereference() in ip6mr_cache_report() only.
> > > The only
> > 
> > user of dereference
> > > is sock_queue_rcv_skb(). Superficial analysis shows we purge the
> > > queue in
> > 
> > inet_sock_destruct().
> > 
> > + So this change should be safe.
> 
> I might have misunderstood the comment from
> commit 4c9687098f24 ("ipmr: RCU conversion of mroute_sk") when
> writing
> this; Thought comment regarding ip_ra_destroy() meant that for the
> IPv6 case
> we DO have to make sure there's a grace-period before destroying the
> socket.

Yeah but since 2010 things have changed a lot in the kernel ;)

And we now have SOCK_RCU_FREE for the few sockets that need RCU grace
perio

[PATCH net-next 1/1] tools: tc-testing: Can pause just before post-suite

2018-03-07 Thread Brenda J. Butler

At debug level 5 or above, the test script will pause just before
the post_suite functions are called.  This allows the tester to
inspect the system before it is torn down.

Signed-off-by: Brenda J. Butler 
---
 tools/testing/selftests/tc-testing/tdc.py | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/tools/testing/selftests/tc-testing/tdc.py 
b/tools/testing/selftests/tc-testing/tdc.py
index 241eea37e4a4..3f5b0a696df2 100755
--- a/tools/testing/selftests/tc-testing/tdc.py
+++ b/tools/testing/selftests/tc-testing/tdc.py
@@ -308,6 +308,7 @@ def test_runner(pm, args, filtered_tests):
 # if we failed in setup or teardown,
 # fill in the remaining tests with ok-skipped
 count = index
+
 if not args.notap:
 tap += 'about to flush the tap output if tests need to be skipped\n'
 if tcount + 1 != index:
@@ -318,6 +319,12 @@ def test_runner(pm, args, filtered_tests):
 count += 1
 
 tap += 'done flushing skipped test tap output\n'
+
+if args.verbose > 4:
+print('Want to pause\nPress enter to continue ...')
+if input(sys.stdin):
+print('got something on stdin')
+
 pm.call_post_suite(index)
 
 return tap
-- 
2.15.1

[PATCH net-next 1/1] tools: tc-testing: Can refer to $TESTID in test spec

2018-03-07 Thread Brenda J. Butler

At debug level 5 or above, the test script will pause just before
the post_suite functions are called.  This allows the tester to
inspect the system before it is torn down.

When processing the commands in the test cases, substitute
the test id for $TESTID.  This helps to make more flexible
tests.  For example, the testid can be given as a command
line argument.

As an example, if we wish to save the test output to a file
named for the test case, we can write in the test case:

"cmdUnderTest": "some test command | tee -a $TESTID.out"

Signed-off-by: Brenda J. Butler 
---
 tools/testing/selftests/tc-testing/tdc.py | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/tools/testing/selftests/tc-testing/tdc.py 
b/tools/testing/selftests/tc-testing/tdc.py
index 241eea37e4a4..c05b9f0f3db2 100755
--- a/tools/testing/selftests/tc-testing/tdc.py
+++ b/tools/testing/selftests/tc-testing/tdc.py
@@ -177,6 +177,7 @@ def prepare_env(args, pm, stage, prefix, cmdlist, output = 
None):
 '"{}" did not complete successfully'.format(prefix))
 
 def run_one_test(pm, args, index, tidx):
+global NAMES
 result = True
 tresult = ""
 tap = ""
@@ -184,6 +185,9 @@ def run_one_test(pm, args, index, tidx):
 print("\t\n=> ", end="")
 print("Test " + tidx["id"] + ": " + tidx["name"])
 
+# populate NAMES with TESTID for this test
+NAMES['TESTID'] = tidx['id']
+
 pm.call_pre_case(index, tidx['id'])
 prepare_env(args, pm, 'setup', "-> prepare stage", tidx["setup"])
 
@@ -227,6 +231,8 @@ def run_one_test(pm, args, index, tidx):
 
 index += 1
 
+# remove TESTID from NAMES
+del(NAMES['TESTID'])
 return tap
 
 def test_runner(pm, args, filtered_tests):
-- 
2.15.1

[PATCH iproute2 1/1] tc: updated tc-bpf man page

2018-03-07 Thread Roman Mashak

Added description of direct-action parameter.

Signed-off-by: Roman Mashak 
---
 man/man8/tc-bpf.8 | 9 +
 1 file changed, 9 insertions(+)

diff --git a/man/man8/tc-bpf.8 b/man/man8/tc-bpf.8
index 2e9812ede028..d311f295615b 100644
--- a/man/man8/tc-bpf.8
+++ b/man/man8/tc-bpf.8
@@ -14,6 +14,10 @@ CLS_NAME ] [
 UDS_FILE ] [
 .B verbose
 ] [
+.B direct-action
+|
+.B da
+] [
 .B skip_hw
 |
 .B skip_sw
@@ -141,6 +145,11 @@ if set, it will dump the eBPF verifier output, even if 
loading the eBPF
 program was successful. By default, only on error, the verifier log is
 being emitted to the user.
 
+.SS direct-action | da
+instructs eBPF classifier to not invoke external TC actions, instead use the
+TC actions return codes (\fBTC_ACT_OK\fR, \fBTC_ACT_SHOT\fR etc.) for
+classifiers.
+
 .SS skip_hw | skip_sw
 hardware offload control flags. By default TC will try to offload
 filters to hardware if possible.
-- 
2.7.4

Re: [PATCH iproute2 1/1] tc: updated tc-bpf man page

2018-03-07 Thread Daniel Borkmann

On 03/07/2018 03:35 PM, Roman Mashak wrote:
> Added description of direct-action parameter.
> 
> Signed-off-by: Roman Mashak 

Yeah, man page is pretty outdated by now. Thanks!

Acked-by: Daniel Borkmann

Re: [PATCH net-next 1/2] net: kalmia: clean up bind error path

2018-03-07 Thread Greg KH

On Wed, Mar 07, 2018 at 10:46:57AM +0100, Johan Hovold wrote:
> Drop bogus call to usb_driver_release_interface() from an error path in
> the usbnet bind() callback, which is called during interface probe. At
> this point the interface is not bound and usb_driver_release_interface()
> returns early.
> 
> Also remove the bogus call to clear the interface data, which is owned
> by the usbnet driver and would not even have been set by the time bind()
> is called.
> 
> Signed-off-by: Johan Hovold 

Reviewed-by: Greg Kroah-Hartman

Re: [PATCH net-next 2/2] net: cdc_eem: clean up bind error path

2018-03-07 Thread Greg KH

On Wed, Mar 07, 2018 at 10:46:58AM +0100, Johan Hovold wrote:
> Drop bogus call to usb_driver_release_interface() from an error path in
> the usbnet bind() callback, which is called during interface probe. At
> this point the interface is not bound and usb_driver_release_interface()
> returns early.
> 
> Also remove the bogus call to clear the interface data, which is owned
> by the usbnet driver and would not even have been set by the time bind()
> is called.
> 
> Signed-off-by: Johan Hovold 

Reviewed-by: Greg Kroah-Hartman

Re: [PATCH v9 crypto 08/12] chtls: Key program

2018-03-07 Thread Sabrina Dubroca

2018-03-06, 21:09:27 +0530, Atul Gupta wrote:
[snip]
> +static int chtls_set_tcb_field(struct sock *sk, u16 word, u64 mask, u64 val)
> +{
> + struct chtls_sock *csk = rcu_dereference_sk_user_data(sk);
> + struct sk_buff *skb;
> + struct cpl_set_tcb_field *req;
> + struct ulptx_idata *sc;
> + unsigned int wrlen = roundup(sizeof(*req) + sizeof(*sc), 16);
> + unsigned int credits_needed = DIV_ROUND_UP(wrlen, 16);
> +
> + skb = alloc_skb(wrlen, GFP_ATOMIC);

I haven't followed the whole call chain, but does this really need to
be an atomic allocation?
Should this skb be charged to the socket's memory accounting?

> + if (!skb)
> + return -ENOMEM;
> +
> + __set_tcb_field(sk, skb, word, mask, val, 0, 1);
> + set_queue(skb, (csk->txq_idx << 1) | CPL_PRIORITY_DATA, sk);
> + csk->wr_credits -= credits_needed;
> + csk->wr_unacked += credits_needed;
> + enqueue_wr(csk, skb);
> + cxgb4_ofld_send(csk->egress_dev, skb);

Should the return value be checked?

> + return 0;
> +}


[snip]
> +static void *chtls_alloc_mem(unsigned long size)
> +{
> + void *p = kmalloc(size, GFP_KERNEL);
> +
> + if (!p)
> + p = vmalloc(size);
> + if (p)
> + memset(p, 0, size);
> + return p;
> +}

I think you replace this with kvzalloc().

> +static void chtls_free_mem(void *addr)
> +{
> + unsigned long p = (unsigned long)addr;
> +
> + if (p >= VMALLOC_START && p < VMALLOC_END)
> + vfree(addr);
> + else
> + kfree(addr);
> +}

Use kvfree().

> +/* TLS Key bitmap processing */
> +int chtls_init_kmap(struct chtls_dev *cdev, struct cxgb4_lld_info *lldi)
> +{
> + unsigned int num_key_ctx, bsize;
> +
> + num_key_ctx = (lldi->vr->key.size / TLS_KEY_CONTEXT_SZ);
> + bsize = BITS_TO_LONGS(num_key_ctx);
> +
> + cdev->kmap.size = num_key_ctx;
> + cdev->kmap.available = bsize;
> + cdev->kmap.addr = chtls_alloc_mem(sizeof(*cdev->kmap.addr) *
> +   bsize);
> + if (!cdev->kmap.addr)
> + return -1;

The return value of this function is never checked.

> +
> + cdev->kmap.start = lldi->vr->key.start;
> + spin_lock_init(&cdev->kmap.lock);
> + return 0;
> +}
> +
> +void chtls_free_kmap(struct chtls_dev *cdev)
> +{
> + if (cdev->kmap.addr)
> + chtls_free_mem(cdev->kmap.addr);
> +}

I think this wrapper function is not necessary.

> +static int get_new_keyid(struct chtls_sock *csk, u32 optname)
> +{
> + struct chtls_dev *cdev = csk->cdev;
> + struct chtls_hws *hws = &csk->tlshws;
> + struct net_device *dev = csk->egress_dev;
> + struct adapter *adap = netdev2adap(dev);
> + int keyid;
> +
> + spin_lock_bh(&cdev->kmap.lock);
> + keyid = find_first_zero_bit(cdev->kmap.addr, cdev->kmap.size);
> + if (keyid < cdev->kmap.size) {
> + __set_bit(keyid, cdev->kmap.addr);
> + if (optname == TLS_RX)

TLS_RX only gets defined in patch 11, so the only reason you're not
breaking the build is because all these new files aren't getting
compiled until patch 12.

> + hws->rxkey = keyid;
> + else
> + hws->txkey = keyid;
> + atomic_inc(&adap->chcr_stats.tls_key);
> + } else {
> + keyid = -1;
> + }
> + spin_unlock_bh(&cdev->kmap.lock);
> + return keyid;
> +}
> +

[snip]
> +int chtls_setkey(struct chtls_sock *csk, u32 keylen, u32 optname)
> +{
...
>+  skb = alloc_skb(len, GFP_KERNEL);
>+  if (!skb)
>+  return -ENOMEM;

This return code is also ignored by its caller.  Please review the
whole patchset for that problem.

Same question as before, does this need to be accounted?


> + /* ulptx command */
> + kwr->req.cmd = cpu_to_be32(ULPTX_CMD_V(ULP_TX_MEM_WRITE) |
> + T5_ULP_MEMIO_ORDER_V(1) |
> + T5_ULP_MEMIO_IMM_V(1));
> + kwr->req.len16 = cpu_to_be32((csk->tid << 8) |
> +   DIV_ROUND_UP(len - sizeof(kwr->wr), 16));
> + kwr->req.dlen = cpu_to_be32(ULP_MEMIO_DATA_LEN_V(klen >> 5));
> + kwr->req.lock_addr = cpu_to_be32(ULP_MEMIO_ADDR_V(keyid_to_addr
> + (cdev->kmap.start, keyid)));

Breaking this line in that way makes it really hard to read for
humans.

-- 
Sabrina

Re: Use of Indirect function calls

2018-03-07 Thread David Miller

From: Rao Shoaib 
Date: Tue, 6 Mar 2018 21:53:19 -0800

> Also while I have your attention can I ask your opinion about
> breaking up some TCP functions, mostly control functions into
> smaller units so that if a little different behavior is desired it
> can be achieved and common code could still be shared. Of course you
> can not say much without looking at the code but will you even
> entertain such a change ?

You answered your own question, I'd have to see what the change
looks like to make a determination.

Re: [PATCH net-next] net: Make account struct net to memcg

2018-03-07 Thread David Miller

From: Kirill Tkhai 
Date: Thu, 01 Mar 2018 15:23:28 +0300

> The patch adds SLAB_ACCOUNT to flags of net_cachep cache,
> which enables accounting of struct net memory to memcg kmem.
> Since number of net_namespaces may be significant, user
> want to know, how much there were consumed, and control.
> 
> Note, that we do not account net_generic to the same memcg,
> where net was accounted, moreover, we don't do this at all (*).
> We do not want the situation, when single memcg memory deficit
> prevents us to register new pernet_operations.
> 
> (*)Even despite there is !current process accounting already
> available in linux-next. See kmalloc_memcg() there for the details.
> 
> Signed-off-by: Kirill Tkhai 

Applied, thank you.

Re: [PATCH v2] dt-bindings: net: renesas-ravb: Make stream buffer optional

2018-03-07 Thread David Miller

From: Geert Uytterhoeven 
Date: Fri,  2 Mar 2018 16:01:48 +0100

> The Stream Buffer for EtherAVB-IF (STBE) is an optional component, and
> is not present on all SoCs.
> 
> Document this in the DT bindings, including a list of SoCs that do have
> it.
> 
> Fixes: 785ec87483d1e24a ("ravb: document R8A77970 bindings")
> Fixes: f231c4178a655b09 ("dt-bindings: net: renesas-ravb: Add support for 
> R8A77995 RAVB")
> Signed-off-by: Geert Uytterhoeven 
> Reviewed-by: Simon Horman 
> Acked-by: Sergei Shtylyov 
> Reviewed-by: Rob Herring 
> ---
> v2:
>   - Add Reviewed-by, Acked-by,
>   - Add R-Car M3-N.

Applied, thanks Geert.

Re: [PATCH RESEND net-next 0/2] ntuple filters with RSS

2018-03-07 Thread David Miller

From: Edward Cree 
Date: Fri, 2 Mar 2018 16:01:47 +

> On 01/03/18 18:36, David Miller wrote:
>> We really should have the ethtool interfaces under deep freeze until we
>> convert it to netlink or similar.
>> Second, this is a real hackish way to extend ethtool with new
>> semantics.  A structure changes layout based upon a flag bit setting
>> in an earlier member?  Yikes...
> Yeah, while I'm reasonably confident it's ABI-compatible (presence of that
>  flag in the past should always have led to drivers complaining they didn't
>  recognise it), and it is somewhat similar to the existing FLOW_EXT flag,
>  it is indeed rather ugly.  This is the only way I could see to do it
>  without adding a whole new command number, which I felt might also be
>  contentious (see: deep freeze) but is probably a better approach.
> 
>> Lastly, there has been feedback asking how practical and useful this
>> facility actually is, and you must address that.
> According to our marketing folks, there is end-user demand for this feature
>  or something like it.  I didn't see any arguments why this isn't useful,
>  just that other things might be useful too.  (Also, sorry it took me so
>  long to address their feedback, but I had to do a bit of background
>  reading before I could understand what Jakub was suggesting.)

Ok.

Since nobody is really working on the ethtool --> devlink/netlink conversion,
it really isn't reasonable for me to block useful changes like your's.

So please resubmit this series and I will apply it.

Thanks.

Re: [PATCH net-next 2/4] net: stmmac: use correct barrier between coherent memory and MMIO

2018-03-07 Thread David Miller

From: Niklas Cassel 
Date: Sat, 3 Mar 2018 00:28:53 +0100

> However, the last write we do is "DMA start transmission",
> this is a register in the IP, i.e. it is a write to the cache
> incoherent MMIO region (rather than a write to cache coherent memory).
> To ensure that all writes to cache coherent memory have
> completed before we start the DMA, we have to use the barrier
> wmb() (which performs a more extensive flush compared to
> dma_wmb()).

The is an implicit memory barrier between physical memory writes
and those to MMIO register space.

So as long as you place the dma_wmb() to ensure the correct
ordering within the descriptor words, you need nothing else
after the last descriptor word write.

Re: [PATCH v2] net: ethernet: Drop unnecessary continue

2018-03-07 Thread David Miller

From: Arushi Singhal 
Date: Sat, 3 Mar 2018 21:44:56 +0530

> Continue at the bottom of a loop are removed.
> Issue found using drop_continue.cocci Coccinelle script.
> 
> Signed-off-by: Arushi Singhal 
> ---
> Changes in v2:
> - Braces is dropped from if with single statement.

Sorry there is no way I am applying this.

Just blindly removing continue statements that some static checker
or compiler warning mentions is completely unacceptable.

Actually _LOOK_ at this code:

>   if(cards[i].vendor_id) {
>   for(j=0;j<3;j++)
> - if(inb(ioaddr+cards[i].addr_offset+j) != 
> cards[i].vendor_id[j]) {
> + if(inb(ioaddr+cards[i].addr_offset+j) != 
> cards[i].vendor_id[j])
>   release_region(ioaddr, 
> cards[i].total_size);
> - continue;
> -   }
>   }

The extraneous continue statement is the _least_ of the problems here.

This code, if the if() statement triggers, will release the ioaddr
region and then _CONTINUE_ the for(j) loop, and try to access the
registers using the same 'ioaddr' again.

That's actually a _REAL_ bug, and you're making it seem like
the behavior is intentional by editing it like this.

And the bug probably exists because this entire sequence has
misaligned closing curly braces.  It makes it look like
the continue is for the outer loop, which would be fine.

But it's not.  It's for the inner loop, and this causes the "use
ioaddr after releasing it" bug.

Please do not submit patches like this one, it makes for a lot of
wasted auditing and review for people.  The onus is on you to read
and understand the code you are editing before submitting your
changes.

Thank you.

Re: [PATCH RESEND net-next 0/2] ntuple filters with RSS

2018-03-07 Thread Edward Cree

On 07/03/18 15:24, David Miller wrote:
> Ok.
>
> Since nobody is really working on the ethtool --> devlink/netlink conversion,
> it really isn't reasonable for me to block useful changes like your's.
>
> So please resubmit this series and I will apply it.
>
> Thanks.
Ok, thanks.  Should I stick with the hackish union-and-flag-bit, or define a
 new ethtool command number for the extended command?

Re: [PATCH] ravb: remove erroneous comment

2018-03-07 Thread David Miller

From: Niklas Söderlund 
Date: Sat,  3 Mar 2018 23:39:54 +0100

> When addressing a review comment in a early version of the offending
> patch a comment where left in which should have been removed. Remove the
> comment to keep it consistent with the code.
> 
> Fixes: 75efa06f457bbed3 ("ravb: add support for changing MTU")
> Reported-by: Sergei Shtylyov 
> Signed-off-by: Niklas Söderlund 

Applied to net-next, thanks.

Re: [PATCH net v3 0/2] rhashtable: Fix rhltable duplicates insertion

2018-03-07 Thread David Miller

From: Paul Blakey 
Date: Sun,  4 Mar 2018 17:29:47 +0200

> On our mlx5 driver fs_core.c, we use the rhltable interface to store
> flow groups. We noticed that sometimes we get a warning that flow group isn't
> found at removal. This rare case was caused when a specific scenario happened,
> insertion of a flow group with a similar match criteria (a duplicate),
> but only where the flow group rhash_head was second (or not first)
> on the relevant rhashtable bucket list.
> 
> The first patch fixes it, and the second one adds a test that show
> it is now working.

Series applied, thank you.

Re: [PATCH] fsl/fman: avoid sleeping in atomic context while adding an address

2018-03-07 Thread David Miller

From: Denis Kirjanov 
Date: Sun,  4 Mar 2018 21:48:17 +0300

> __dev_mc_add grabs an adress spinlock so use
> atomic context in kmalloc.
 ...
> Signed-off-by: Denis Kirjanov 

Applied, thank you.

Re: [PATCH iproute2 net-next v3] iprule: support for ip_proto, sport and dport match options

2018-03-07 Thread Roopa Prabhu

On Tue, Mar 6, 2018 at 6:44 PM, Stephen Hemminger
 wrote:
> On Tue,  6 Mar 2018 18:07:59 -0800
> Roopa Prabhu  wrote:
>
>> + if (tb[FRA_IP_PROTO]) {
>> + SPRINT_BUF(pbuf);
>> + fprintf(fp, "ip_proto %s ",
>> + inet_proto_n2a(rta_getattr_u8(tb[FRA_IP_PROTO]), pbuf,
>> +sizeof(pbuf)));
>> + }
>> +
>> + if (tb[FRA_SPORT_RANGE]) {
>> + struct fib_rule_port_range *r = RTA_DATA(tb[FRA_SPORT_RANGE]);
>> +
>> + if (r->start == r->end)
>> + fprintf(fp, "sport %hu ", r->start);
>> + else
>> + fprintf(fp, "sport %hu-%hu ", r->start, r->end);
>> + }
>> +
>> + if (tb[FRA_DPORT_RANGE]) {
>> + struct fib_rule_port_range *r = RTA_DATA(tb[FRA_DPORT_RANGE]);
>> +
>> + if (r->start == r->end)
>> + fprintf(fp, "dport %hu ", r->start);
>> + else
>> + fprintf(fp, "dport %hu-%hu ", r->start, r->end);
>> + }
>> +
>
> in net-next this is all JSON now.

ok i will rebase, thanks stephen

Re: pull request: bluetooth 2018-03-05

2018-03-07 Thread David Miller

From: Johan Hedberg 
Date: Mon, 5 Mar 2018 14:29:12 +0200

> Here are a few more Bluetooth fixes for the 4.16 kernel:
> 
>  - btusb: reset/resume fixes for Yoga 920 and Dell OptiPlex 3060
>  - Fix for missing encryption refresh with the Security Manager protocol
> 
> Please let me know if there are any issues pulling. Thanks.

Pulled, thanks.

Re: [PATCH v2] KEYS: DNS: limit the length of option strings

2018-03-07 Thread David Howells

Eric Biggers  wrote:

> Fix it by limiting option strings (combined name + value) to a much more
> reasonable 128 bytes.  The exact limit is arbitrary, but currently the
> only recognized option is formatted as "dnserror=%lu" which fits well
> within this limit.

There will be more options coming ("ipv4", "ipv6") but they shouldn't overrun
this limit and we can always extend the limit if need be.

David

Re: [PATCH iproute2-next 2/3] ipmroute: don't complain about unicast routes

2018-03-07 Thread Stephen Hemminger

On Wed, 7 Mar 2018 11:43:33 +0300
Sergei Shtylyov  wrote:

> Hello!
> 
> On 3/7/2018 4:03 AM, Stephen Hemminger wrote:
> 
> > From: Stephen Hemminger 
> > 
> > Every non-multicast route prints an error message.
> > Kernel doesn't filter out unicast routes, it is up to filter function
> > to do this.
> > 
> > Signed-off-by: Stephen Hemminger 
> > ---
> >   ip/ipmroute.c | 7 +++
> >   1 file changed, 3 insertions(+), 4 deletions(-)
> > 
> > diff --git a/ip/ipmroute.c b/ip/ipmroute.c
> > index aa5029b44f41..03ca0575e571 100644
> > --- a/ip/ipmroute.c
> > +++ b/ip/ipmroute.c
> > @@ -75,15 +75,14 @@ int print_mroute(const struct sockaddr_nl *who, struct 
> > nlmsghdr *n, void *arg)
> > fprintf(stderr, "BUG: wrong nlmsg len %d\n", len);
> > return -1;
> > }
> > -   if (r->rtm_type != RTN_MULTICAST) {
> > -   fprintf(stderr, "Not a multicast route (type: %s)\n",
> > -   rtnl_rtntype_n2a(r->rtm_type, b1, sizeof(b1)));
> > +
> > +   if (r->rtm_type != RTN_MULTICAST)
> > return 0;
> > -   }
> >   
> > parse_rtattr(tb, RTA_MAX, RTM_RTA(r), len);
> > table = rtm_get_table(r, tb);
> >   
> > +  
> 
> Why?
> 
> > if (filter.tb > 0 && filter.tb != table)
> > return 0;
> > 
> 
> MBR, Sergei

The kernel dumps all routes in response to the RTM_GETROUTE and therefore all 
routes show
up in print_mroute.  On my system (which has no mcast), I get this with 
standard 4.14
kernel and ip commands.

$ ip mroute
Not a multicast route (type: unicast)
Not a multicast route (type: unicast)
Not a multicast route (type: unicast)
Not a multicast route (type: unicast)
Not a multicast route (type: unicast)
Not a multicast route (type: broadcast)
Not a multicast route (type: local)
Not a multicast route (type: local)
Not a multicast route (type: broadcast)
Not a multicast route (type: broadcast)
Not a multicast route (type: local)

Re: [PATCH net 3/5] tcp: move false FR condition into tcp_false_fast_retrans_possible()

2018-03-07 Thread Eric Dumazet

On Wed, 2018-03-07 at 14:59 +0200, Ilpo Järvinen wrote:
> No functional changes.
> 
> Signed-off-by: Ilpo Järvinen 
> ---
>  net/ipv4/tcp_input.c | 21 +
>  1 file changed, 17 insertions(+), 4 deletions(-)
> 
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 1a33752..e20f9ad 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -2211,6 +2211,19 @@ static void tcp_update_scoreboard(struct sock
> *sk, int fast_rexmit)
>   }
>  }
>  
> +/* False fast retransmits may occur when SACK is not in use under
> certain
> + * conditions (RFC6582). The sender MUST hold old state until
> something
> + * *above* high_seq is ACKed to prevent triggering such false fast
> + * retransmits. SACK TCP is safe.
> + */
> +static bool tcp_false_fast_retrans_possible(const struct sock *sk,
> + const u32 snd_una)
> +{
> + const struct tcp_sock *tp = tcp_sk(sk);
> +
> + return ((snd_una == tp->high_seq) && tcp_is_reno(tp));

return (EXPR);

should use instead :

return EXPR;

> +}
> +
>  static bool tcp_tsopt_ecr_before(const struct tcp_sock *tp, u32
> when)
>  {
>   return tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr &&
> @@ -2350,10 +2363,10 @@ static bool tcp_try_undo_recovery(struct sock
> *sk)
>   } else if (tp->rack.reo_wnd_persist) {
>   tp->rack.reo_wnd_persist--;
>   }
> - if (tp->snd_una == tp->high_seq && tcp_is_reno(tp)) {
> - /* Hold old state until something *above* high_seq
> -  * is ACKed. For Reno it is MUST to prevent false
> -  * fast retransmits (RFC2582). SACK TCP is safe. */
> + if (tcp_false_fast_retrans_possible(sk, tp->snd_una)) {
> + /* Hold old state until something *above* high_seq
> is ACKed
> +  * if false fast retransmit is possible.
> +  */
>   if (!tcp_any_retrans_done(sk))
>   tp->retrans_stamp = 0;
>   return true;

Re: [PATCH iproute2-next 2/3] ipmroute: don't complain about unicast routes

2018-03-07 Thread Sergei Shtylyov

On 03/07/2018 06:54 PM, Stephen Hemminger wrote:

>> Hello!
>>
>> On 3/7/2018 4:03 AM, Stephen Hemminger wrote:
>>
>>> From: Stephen Hemminger 
>>>
>>> Every non-multicast route prints an error message.
>>> Kernel doesn't filter out unicast routes, it is up to filter function
>>> to do this.
>>>
>>> Signed-off-by: Stephen Hemminger 
>>> ---
>>>   ip/ipmroute.c | 7 +++
>>>   1 file changed, 3 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/ip/ipmroute.c b/ip/ipmroute.c
>>> index aa5029b44f41..03ca0575e571 100644
>>> --- a/ip/ipmroute.c
>>> +++ b/ip/ipmroute.c
>>> @@ -75,15 +75,14 @@ int print_mroute(const struct sockaddr_nl *who, struct 
>>> nlmsghdr *n, void *arg)
>>> fprintf(stderr, "BUG: wrong nlmsg len %d\n", len);
>>> return -1;
>>> }
>>> -   if (r->rtm_type != RTN_MULTICAST) {
>>> -   fprintf(stderr, "Not a multicast route (type: %s)\n",
>>> -   rtnl_rtntype_n2a(r->rtm_type, b1, sizeof(b1)));
>>> +
>>> +   if (r->rtm_type != RTN_MULTICAST)
>>> return 0;
>>> -   }
>>>   
>>> parse_rtattr(tb, RTA_MAX, RTM_RTA(r), len);
>>> table = rtm_get_table(r, tb);
>>>   
>>> +  
>>
>> Why?
>>
>>> if (filter.tb > 0 && filter.tb != table)
>>> return 0;
>>> 
>>
>> MBR, Sergei
> 
> The kernel dumps all routes in response to the RTM_GETROUTE and therefore all 
> routes show
> up in print_mroute.  On my system (which has no mcast), I get this with 
> standard 4.14
> kernel and ip commands.

   I was merely asking about a double new line... :-)

MBR, Sergei

Re: [PATCH net-next 0/3] sctp: add support for some msg_control options from RFC6458

2018-03-07 Thread David Miller

From: Xin Long 
Date: Mon,  5 Mar 2018 20:44:17 +0800

> This patchset is to add support for 3 msg_control options described
> in RFC6458:
> 
> 5.3.7.  SCTP PR-SCTP Information Structure (SCTP_PRINFO)
> 5.3.9.  SCTP Destination IPv4 Address Structure (SCTP_DSTADDRV4)
> 5.3.10. SCTP Destination IPv6 Address Structure (SCTP_DSTADDRV6)
> 
> one send flag described in RFC6458:
> 
> SCTP_SENDALL:  This flag, if set, will cause a one-to-many
> style socket to send the message to all associations that
> are currently established on this socket.  For the one-to-
> one style socket, this flag has no effect.
> 
> Note there is another msg_control option:
> 
> 5.3.8.  SCTP AUTH Information Structure (SCTP_AUTHINFO)
> 
> It's a little complicated, I will post it in another patchset after
> this.

Series applied, thanks Xin.

Re: [PATCH net 5/5] tcp: send real dupACKs by locking advertized window for non-SACK flows

2018-03-07 Thread Eric Dumazet

On Wed, 2018-03-07 at 14:59 +0200, Ilpo Järvinen wrote:
> Currently, the TCP code is overly eager to update window on
> every ACK. It makes some of the ACKs that the receiver should
> sent as dupACKs look like they update window update that are
> not considered real dupACKs by the non-SACK sender-side code.
> 
> Make sure that when an ofo segment is received, no change to
> window is applied if we are going to send a dupACK. It's ok
> to change the window for non-dupACKs (such as the first ACK
> after ofo arrivals start if that ACK was using delayed ACKs).

This looks dangerous to me.

We certainly want to lower the window at some point, or risk expensive
pruning and/or drops.

It is not clear by reading your changelog/patch, but it looks like some
malicious peers could hurt us.

By current standards, non TCP sack flows are not worth any potential
issues.

Re: [PATCH iproute2-next 2/3] ipmroute: don't complain about unicast routes

2018-03-07 Thread Stephen Hemminger

On Tue,  6 Mar 2018 17:03:54 -0800
Stephen Hemminger  wrote:

> From: Stephen Hemminger 
> 
> Every non-multicast route prints an error message.
> Kernel doesn't filter out unicast routes, it is up to filter function
> to do this.
> 
> Signed-off-by: Stephen Hemminger 
> ---
>  ip/ipmroute.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/ip/ipmroute.c b/ip/ipmroute.c
> index aa5029b44f41..03ca0575e571 100644
> --- a/ip/ipmroute.c
> +++ b/ip/ipmroute.c
> @@ -75,15 +75,14 @@ int print_mroute(const struct sockaddr_nl *who, struct 
> nlmsghdr *n, void *arg)
>   fprintf(stderr, "BUG: wrong nlmsg len %d\n", len);
>   return -1;
>   }
> - if (r->rtm_type != RTN_MULTICAST) {
> - fprintf(stderr, "Not a multicast route (type: %s)\n",
> - rtnl_rtntype_n2a(r->rtm_type, b1, sizeof(b1)));
> +
> + if (r->rtm_type != RTN_MULTICAST)
>   return 0;
> - }
>  
>   parse_rtattr(tb, RTA_MAX, RTM_RTA(r), len);
>   table = rtm_get_table(r, tb);
>  
> +
>   if (filter.tb > 0 && filter.tb != table)
>   return 0;
>  

Actually, this is a recent kernel regression. Should be fixed there

Re: [PATCH net v3] sch_netem: fix skb leak in netem_enqueue()

2018-03-07 Thread David Miller

From: Alexey Kodanev 
Date: Mon,  5 Mar 2018 20:52:54 +0300

> When we exceed current packets limit and we have more than one
> segment in the list returned by skb_gso_segment(), netem drops
> only the first one, skipping the rest, hence kmemleak reports:
 ...
> Fix it by adding the rest of the segments, if any, to skb 'to_free'
> list. Add new __qdisc_drop_all() and qdisc_drop_all() functions
> because they can be useful in the future if we need to drop segmented
> GSO packets in other places.
> 
> Fixes: 6071bd1aa13e ("netem: Segment GSO packets on enqueue")
> Signed-off-by: Alexey Kodanev 
> ---
> 
> v3: use skb->prev to find the tail of the list. skb->prev can be NULL
> for not segmented skbs, so check it too.
> 
> v2: add new __qdisc_drop_all() and qdisc_drop_all(), and use
> qdisc_drop_all() in sch_netem.

Applied and queued up for -stable, thanks!

Re: [PATCH net v4 0/2] rhashtable: Fix rhltable duplicates insertion

2018-03-07 Thread David Miller

From: Paul Blakey 
Date: Wed,  7 Mar 2018 16:00:11 +0200

> On our mlx5 driver fs_core.c, we use the rhltable interface to store
> flow groups. We noticed that sometimes we get a warning that flow group isn't
> found at removal. This rare case was caused when a specific scenario happened,
> insertion of a flow group with a similar match criteria (a duplicate),
> but only where the flow group rhash_head was second (or not first)
> on the relevant rhashtable bucket list.
> 
> The first patch fixes it, and the second one adds a test that show
> it is now working.
> 
> Paul.
> 
> v4 --> v3 changes:
> * Added Herbert Xu's ack (thanks)
> * Removed extra commit tags

I already applied v3, sorry.  But I did add Herbert's ACKs.
:-)

Re: linux-next: manual merge of the selinux tree with the net-next tree

2018-03-07 Thread Paul Moore

On Mon, Mar 5, 2018 at 2:03 AM, Xin Long  wrote:
> On Mon, Mar 5, 2018 at 9:40 AM, Stephen Rothwell  
> wrote:
>> Hi Paul,
>>
>> Today's linux-next merge of the selinux tree got a conflict in:
>>
>>   net/sctp/socket.c
>>
>> between several refactoring commits from the net-next tree and commit:
>>
>>   2277c7cd75e3 ("sctp: Add LSM hooks")
>>
>> from the selinux tree.
>>
>> I fixed it up (I think - see below) and can carry the fix as
> The fixup is great!  the same as I mentioned in:
> https://patchwork.ozlabs.org/patch/879898/
> for net-next.git
>
>> necessary. This is now fixed as far as linux-next is concerned, but any
>> non trivial conflicts should be mentioned to your upstream maintainer
>> when your tree is submitted for merging.  You may also want to consider
>> cooperating with the maintainer of the conflicting tree to minimise any
>> particularly complex conflicts.
>
> [net-next,0/9] sctp: clean up sctp_sendmsg, this patchset was just applied
> in net-next. So I just guess it might not yet be there when selinux tree was
> being submitted.

The selinux/next branch is based on v4.16-rc1 and doesn't feed into
the netdev tree, it goes straight to Linus during the merge window so
unfortunately I think we may need to carry this for some time and
relay this fix-up patch up to Linus during the merge window.

-- 
paul moore
www.paul-moore.com

Re: linux-next: manual merge of the selinux tree with the net-next tree

2018-03-07 Thread David Miller

From: Paul Moore 
Date: Wed, 7 Mar 2018 11:34:31 -0500

> On Mon, Mar 5, 2018 at 2:03 AM, Xin Long  wrote:
>> On Mon, Mar 5, 2018 at 9:40 AM, Stephen Rothwell  
>> wrote:
>>> Hi Paul,
>>>
>>> Today's linux-next merge of the selinux tree got a conflict in:
>>>
>>>   net/sctp/socket.c
>>>
>>> between several refactoring commits from the net-next tree and commit:
>>>
>>>   2277c7cd75e3 ("sctp: Add LSM hooks")
>>>
>>> from the selinux tree.
>>>
>>> I fixed it up (I think - see below) and can carry the fix as
>> The fixup is great!  the same as I mentioned in:
>> https://patchwork.ozlabs.org/patch/879898/
>> for net-next.git
>>
>>> necessary. This is now fixed as far as linux-next is concerned, but any
>>> non trivial conflicts should be mentioned to your upstream maintainer
>>> when your tree is submitted for merging.  You may also want to consider
>>> cooperating with the maintainer of the conflicting tree to minimise any
>>> particularly complex conflicts.
>>
>> [net-next,0/9] sctp: clean up sctp_sendmsg, this patchset was just applied
>> in net-next. So I just guess it might not yet be there when selinux tree was
>> being submitted.
> 
> The selinux/next branch is based on v4.16-rc1 and doesn't feed into
> the netdev tree, it goes straight to Linus during the merge window so
> unfortunately I think we may need to carry this for some time and
> relay this fix-up patch up to Linus during the merge window.

What a mess.

The SCTP option changes should have gone through my tree in retrospect.

[PATCH net-next] ip6mr: remove synchronize_rcu() in favor of SOCK_RCU_FREE

2018-03-07 Thread Eric Dumazet

From: Eric Dumazet 

Kirill found that recently added synchronize_rcu() call in
ip6mr_sk_done()
was slowing down netns dismantle and posted a patch to use it only if
the socket
was found.

I instead suggested to get rid of this call, and use instead
SOCK_RCU_FREE

We might later change IPv4 side to use the same technique and unify
both stacks. IPv4 does not use synchronize_rcu() but has a call_rcu()
that could be replaced by SOCK_RCU_FREE.

Tested:
 time for i in {1..1000}; do unshare -n /bin/false;done

 Before : real 7m18.911s
 After : real 10.187s

Fixes: 8571ab479a6e ("ip6mr: Make mroute_sk rcu-based")
Signed-off-by: Eric Dumazet 
Reported-by: Kirill Tkhai 
Cc: Yuval Mintz 
---
 net/ipv6/ip6mr.c |6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index 
2a38f9de45d399fafc9f4bcc662b44be17279e51..7345bd6c4b7dda39c0d73d542e9ca9a5366542ff
 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -1443,6 +1443,7 @@ static int ip6mr_sk_init(struct mr_table *mrt, struct 
sock *sk)
err = -EADDRINUSE;
} else {
rcu_assign_pointer(mrt->mroute_sk, sk);
+   sock_set_flag(sk, SOCK_RCU_FREE);
net->ipv6.devconf_all->mc_forwarding++;
}
write_unlock_bh(&mrt_lock);
@@ -1472,6 +1473,10 @@ int ip6mr_sk_done(struct sock *sk)
if (sk == rtnl_dereference(mrt->mroute_sk)) {
write_lock_bh(&mrt_lock);
RCU_INIT_POINTER(mrt->mroute_sk, NULL);
+   /* Note that mroute_sk had SOCK_RCU_FREE set,
+* so the RCU grace period before sk freeing
+* is guaranteed by sk_destruct()
+*/
net->ipv6.devconf_all->mc_forwarding--;
write_unlock_bh(&mrt_lock);
inet6_netconf_notify_devconf(net, RTM_NEWNETCONF,
@@ -1485,7 +1490,6 @@ int ip6mr_sk_done(struct sock *sk)
}
}
rtnl_unlock();
-   synchronize_rcu();
 
return err;
 }

1 2 3 4 >

1 - 100 of 309 matches

Mail list logo