[PATCH net] ipv6: Do not consider linkdown nexthops during multipath

2017-11-20 Thread Ido Schimmel
When the 'ignore_routes_with_linkdown' sysctl is set, we should not
consider linkdown nexthops during route lookup.

While the code correctly verifies that the initially selected route
('match') has a carrier, it does not perform the same check in the
subsequent multipath selection, resulting in a potential packet loss.

In case the chosen route does not have a carrier and the sysctl is set,
choose the initially selected route.

Fixes: 35103d11173b ("net: ipv6 sysctl option to ignore routes when nexthop 
link is down")
Signed-off-by: Ido Schimmel 
---
 net/ipv6/route.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 05eb7bc36156..0363db914c7a 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -472,6 +472,11 @@ static struct rt6_info *rt6_multipath_select(struct 
rt6_info *match,
>rt6i_siblings, rt6i_siblings) {
route_choosen--;
if (route_choosen == 0) {
+   struct inet6_dev *idev = sibling->rt6i_idev;
+
+   if (!netif_carrier_ok(sibling->dst.dev) &&
+   idev->cnf.ignore_routes_with_linkdown)
+   break;
if (rt6_score_route(sibling, oif, strict) < 0)
break;
match = sibling;
-- 
2.14.3



Re: [PATCH v2 1/2] r8169: fix RTL8111EVL EEE and green settings

2017-11-20 Thread Heiner Kallweit
Am 21.11.2017 um 02:34 schrieb Andrew Lunn:
> Hi Heiner
Hi Andrew,

> 
> Do you have access to the data sheet?
> 
Not to more recent ones. I only have two older data sheets for early
members of the rtl8169 family.

> I had a quick look through the driver. It would be nice to refactor it
> to follow the usual Linux conventions:
> 
> Turn the MDIO read/write functions into an MDIO bus driver.
> 
I thought the same when looking at this driver.
It's a nightmare to maintain a driver with almost 9.000 lines of code
and numerous "switch mac_id" clauses.
Not sure whether still members of this chip family are being
developed that we may have to add in future.

My first thought was to factor out support for the original 8169 family
(mac id <= 6) into a separate driver as first step as it differs
significantly from later members of the chip family (e.g. TBI support
in addition to MII).

> Move the PHY code into drivers/net/phy/realtek.c, and in the process,
> replace all the magic numbers with #defines.
> 
Yes, this definitely would be desirable. However I found that the
available datasheets for the external PHY's usually only document
the registers on page 0. All the magic settings on other pages often
are just copied from vendor drivers.
Or are there other, more comprehensive versions of the datasheets
being available under NDA?

> Do you have any interest in doing this?
> 
Would be a nice challenge. I'm willing to look into this if I can
get hold of the official datasheets.

>Andrew
> 
Heiner


Re: Linux ECN Handling

2017-11-20 Thread Steve Ibanez
Hi Neal,

I tried your suggestion to disable tcp_tso_should_defer() and it does
indeed look like it is preventing the host from entering timeouts.
I'll have to do a bit more digging to try and find where the packets
are being dropped. I've verified that the bottleneck link queue is
capacity is at about the configured marking threshold when the timeout
occurs, so the drops may be happening at the NIC interfaces or perhaps
somewhere unexpected in the switch.

I wonder if you can explain why the TLP doesn't fire when in the CWR
state? It seems like that might be worth having for cases like this.

Btw, thank you very much for all the help! It is greatly appreciated :)

Best,
-Steve


On Mon, Nov 20, 2017 at 7:01 AM, Neal Cardwell  wrote:
> Going back to one of your Oct 19 trace snapshots (attached), AFAICT at the
> time of the timeout there is actually almost 64KBytes  (352553398 + 1448 -
> 352489686 = 65160) of unacknowledged data. So there really does seem to be a
> significant chunk of packets that were in-flight that were then declared
> lost.
>
> So here is a possibility: perhaps the combination of CWR+PRR plus
> tcp_tso_should_defer() means that PRR can make cwnd so gentle that
> tcp_tso_should_defer() thinks we should wait for another ACK to send, and
> that ACK doesn't come. Breaking it, down, the potential sequence would be:
>
> (1) tcp_write_xmit() does not send, because the CWR behavior, using PRR,
> does not leave enough cwnd for tcp_tso_should_defer() to think we should
> send (PRR was originally designed for recovery, which did not have TSO
> deferral)
>
> (2) TLP does not fire, because we are in state CWR, not Open
>
> (3) The only remaining option is an RTO, which fires.
>
> In other words, the possibility is that, at the time of the stall, the cwnd
> is reasonably high, but tcp_packets_in_flight() is also quite high, so
> either there is (a) literally no unused cwnd left ( tcp_packets_in_flight()
> == cwnd), or (b) some mechanism like tcp_tso_should_defer() is deciding that
> there is not enough available cwnd for it to make sense to chop off a
> fraction of a TSO skb to send now.
>
> One way to test that conjecture would be to disable tcp_tso_should_defer()
> by adding a:
>
>goto send_now;
>
> at the top of tcp_tso_should_defer().
>
> If that doesn't prevent the freezes then I would recommend adding printks or
> other instrumentation to  tcp_write_xmit() to log:
>
> - time
> - ca_state
> - cwnd
> - ssthresh
> - tcp_packets_in_flight()
> - the reason for breaking out of the tcp_write_xmit() loop (tso deferral, no
> packets left, tcp_snd_wnd_test, tcp_nagle_test, etc)
>
> cheers,
> neal
>
>
>
> On Mon, Nov 20, 2017 at 2:31 AM, Steve Ibanez  wrote:
>>
>> Hi Folks,
>>
>> I wanted to check back in on this for another update and to solicit
>> some more suggestions. I did a bit more digging to try an isolate the
>> problem.
>>
>> As I explained earlier, the log generated by tcp_probe indicates that
>> the snd_cwnd is set to 1 just before the end host receives an ECN
>> marked ACK and unexpectedly enters a timeout (
>> https://drive.google.com/open?id=1iyt8PvBxQga2jpRpBJ8KdQw3Q_mPTzZF ).
>> I was trying to track down where this is happening, but the only place
>> I could find that might be setting the snd_cwnd to 1 is in the
>> tcp_enter_loss() function. I inserted a printk() call in this function
>> to see when it is being invoked and it looks like it is only called by
>> the tcp_retransmit_timer() function after the timer expires.
>>
>> I decided to try recording the snd_cwnd, ss-thresh, and icsk_ca_state
>> inside the tcp_fastretrans_alert() function whenever it processes an
>> ECN marked ACK (
>> https://drive.google.com/open?id=17GD77lb9lkCSu0_s9p40GZ5r4EU8B4VB )
>> This plot also shows when the tcp_retransmit_timer() and
>> tcp_enter_loss() functions are invoked (red and purple dots
>> respectively). And I see that the ACK state machine is always either
>> in the TCP_CA_Open or TCP_CA_CWR state whenever the
>> tcp_fastretrans_alert() function processes ECN marked ACKs (
>> https://drive.google.com/open?id=1xwuPxjgwriT9DSblFx2uILfQ95Fy-Eqq ).
>> So I'm not sure where the snd_cwnd is being set to 1 (or possibly 0 as
>> Neal suggested) just before entering a timeout. Any suggestions here?
>>
>> In order to do a bit of profiling of the tcp_dctcp code I added
>> support into tcp_probe for recording the dctcp alpha parameter. I see
>> that alpha oscillates around about 0.1 when the flow rates have
>> converged, it goes to zero when the other host enters a timeout, and I
>> don't see any unexpected behavior just before the timeout (
>> https://drive.google.com/open?id=1zPdyS57TrUYZIekbid9p1UNyraLYrdw7 ).
>>
>> So I haven't had much luck yet trying to track down where the problem
>> is. If you have any suggestions that would help me to focus my search
>> efforts, I would appreciate the comments.
>>
>> Thanks!
>> -Steve
>>
>>
>> On Mon, Nov 6, 2017 at 

Re: [PATCH net-next] netlink: optimize err assignment

2017-11-20 Thread kbuild test robot
Hi yuan,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:
https://github.com/0day-ci/linux/commits/yuan-linyu/netlink-optimize-err-assignment/20171121-100409
config: i386-randconfig-x001-201747 (attached as .config)
compiler: gcc-6 (Debian 6.4.0-9) 6.4.0 20171026
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

Note: it may well be a FALSE warning. FWIW you are at least aware of it now.
http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings

All warnings (new ones prefixed by >>):

   net/netlink/af_netlink.c: In function 'netlink_setsockopt':
>> net/netlink/af_netlink.c:1614:19: warning: 'val' may be used uninitialized 
>> in this function [-Wmaybe-uninitialized]
  if (!val || val - 1 >= nlk->ngroups)
  ^~~

vim +/val +1614 net/netlink/af_netlink.c

84659eb52 Johannes Berg  2007-07-18  1584  
9a4595bc7 Patrick McHardy2005-08-15  1585  static int 
netlink_setsockopt(struct socket *sock, int level, int optname,
b7058842c David S. Miller2009-09-30  1586 char 
__user *optval, unsigned int optlen)
9a4595bc7 Patrick McHardy2005-08-15  1587  {
9a4595bc7 Patrick McHardy2005-08-15  1588   struct sock *sk = sock->sk;
9a4595bc7 Patrick McHardy2005-08-15  1589   struct netlink_sock *nlk = 
nlk_sk(sk);
f3b56fc83 yuan linyu 2017-11-19  1590   unsigned int val;
f3b56fc83 yuan linyu 2017-11-19  1591   int err = 0;
9a4595bc7 Patrick McHardy2005-08-15  1592  
9a4595bc7 Patrick McHardy2005-08-15  1593   if (level != SOL_NETLINK)
9a4595bc7 Patrick McHardy2005-08-15  1594   return -ENOPROTOOPT;
9a4595bc7 Patrick McHardy2005-08-15  1595  
d1b4c689d Florian Westphal   2016-02-18  1596   if (optlen >= sizeof(int) &&
eb4965344 Johannes Berg  2007-07-18  1597   get_user(val, (unsigned int 
__user *)optval))
9a4595bc7 Patrick McHardy2005-08-15  1598   return -EFAULT;
9a4595bc7 Patrick McHardy2005-08-15  1599  
9a4595bc7 Patrick McHardy2005-08-15  1600   switch (optname) {
9a4595bc7 Patrick McHardy2005-08-15  1601   case NETLINK_PKTINFO:
9a4595bc7 Patrick McHardy2005-08-15  1602   if (val)
cc3a572fe Nicolas Dichtel2015-05-07  1603   nlk->flags |= 
NETLINK_F_RECV_PKTINFO;
9a4595bc7 Patrick McHardy2005-08-15  1604   else
cc3a572fe Nicolas Dichtel2015-05-07  1605   nlk->flags &= 
~NETLINK_F_RECV_PKTINFO;
9a4595bc7 Patrick McHardy2005-08-15  1606   break;
9a4595bc7 Patrick McHardy2005-08-15  1607   case NETLINK_ADD_MEMBERSHIP:
9a4595bc7 Patrick McHardy2005-08-15  1608   case NETLINK_DROP_MEMBERSHIP: {
5187cd055 Eric W. Biederman  2014-04-23  1609   if 
(!netlink_allowed(sock, NL_CFG_F_NONROOT_RECV))
9a4595bc7 Patrick McHardy2005-08-15  1610   return -EPERM;
b4ff4f041 Johannes Berg  2007-07-18  1611   err = 
netlink_realloc_groups(sk);
513c25000 Patrick McHardy2005-09-06  1612   if (err)
513c25000 Patrick McHardy2005-09-06  1613   return err;
9a4595bc7 Patrick McHardy2005-08-15 @1614   if (!val || val - 1 >= 
nlk->ngroups)
9a4595bc7 Patrick McHardy2005-08-15  1615   return -EINVAL;
7774d5e03 Richard Guy Briggs 2014-04-22  1616   if (optname == 
NETLINK_ADD_MEMBERSHIP && nlk->netlink_bind) {
023e2cfa3 Johannes Berg  2014-12-23  1617   err = 
nlk->netlink_bind(sock_net(sk), val);
4f5209005 Richard Guy Briggs 2014-04-22  1618   if (err)
4f5209005 Richard Guy Briggs 2014-04-22  1619   return 
err;
4f5209005 Richard Guy Briggs 2014-04-22  1620   }
9a4595bc7 Patrick McHardy2005-08-15  1621   netlink_table_grab();
84659eb52 Johannes Berg  2007-07-18  1622   
netlink_update_socket_mc(nlk, val,
84659eb52 Johannes Berg  2007-07-18  1623   
 optname == NETLINK_ADD_MEMBERSHIP);
9a4595bc7 Patrick McHardy2005-08-15  1624   netlink_table_ungrab();
7774d5e03 Richard Guy Briggs 2014-04-22  1625   if (optname == 
NETLINK_DROP_MEMBERSHIP && nlk->netlink_unbind)
023e2cfa3 Johannes Berg  2014-12-23  1626   
nlk->netlink_unbind(sock_net(sk), val);
9a4595bc7 Patrick McHardy2005-08-15  1627   break;
9a4595bc7 Patrick McHardy2005-08-15  1628   }
be0c22a46 Pablo Neira Ayuso  2009-02-18  1629   case NETLINK_BROADCAST_ERROR:
be0c22a46 Pablo Neira Ayuso  2009-02-18  1630   if (val)
cc3a572fe Nicolas Dichtel2015-05-07  1631   nlk->flags |= 
NETLINK_F_BROADCAST_SEND_ERROR;
be0c22a46 Pablo Neira Ayuso  2009-02-18  1632   else
cc3a572fe Nicolas Dichtel2015-05-07  1633   nlk->flags &= 
~NETLINK_F_BROADCAST_SEND_ERROR;
be0c22a46 Pablo Neira Ayuso  2009-02-18  1634   break;
38938bfe3 Pablo 

[PATCH iproute2/net-next v4]tc: B.W limits can now be specified in %.

2017-11-20 Thread Nishanth Devarajan
This patch adapts the tc command line interface to allow bandwidth limits
to be specified as a percentage of the interface's capacity.

Adding this functionality requires passing the specified device string to
each class/qdisc which changes the prototype for a couple of functions: the
.parse_qopt and .parse_copt interfaces. The device string is a required
parameter for tc-qdisc and tc-class, and when not specified, the kernel
returns ENODEV. In this patch, if the user tries to specify a bandwidth
percentage without naming the device, we return an error from userspace.

v2:
* Modified and moved int read_prop() from ip/iptuntap.c to lib/utils.c,
to make it accessible to tc.

v3:
* Modified and moved int parse_percent() from tc/q_netem.c to ib/util.c for
use in tc.
* Changed couple variable names in int parse_percent_rate().
* Handled showing error message when device speed is unknown.
* Updated man page to warn users that when specifying rates in %, tc only
uses the current device speed and does not recalculate if it changes after.

During cases when properties (like device speed) are unknown, read_prop()
assumes that if the property file can be opened but not read, it means
that the property is unknown.

v4:
* int read_prop() in lib/utils.c was using strtoul() API, this was changed
to strtol()
* 'const' quantifier was added to device string arguments in .parse_qopt
and .parse_copt interface headers

Signed-off by: Nishanth Devarajan
---
 include/utils.h |  2 ++
 ip/iptuntap.c   | 32 ---
 lib/utils.c | 68 +
 man/man8/tc.8   |  5 -
 tc/q_atm.c  |  4 ++--
 tc/q_cbq.c  | 25 -
 tc/q_cbs.c  |  2 +-
 tc/q_choke.c|  9 ++--
 tc/q_clsact.c   |  2 +-
 tc/q_codel.c|  2 +-
 tc/q_drr.c  |  4 ++--
 tc/q_dsmark.c   |  4 ++--
 tc/q_fifo.c |  2 +-
 tc/q_fq.c   | 16 +++---
 tc/q_fq_codel.c |  2 +-
 tc/q_gred.c |  9 ++--
 tc/q_hfsc.c | 45 +-
 tc/q_hhf.c  |  2 +-
 tc/q_htb.c  | 18 +++
 tc/q_ingress.c  |  2 +-
 tc/q_mqprio.c   |  2 +-
 tc/q_multiq.c   |  2 +-
 tc/q_netem.c| 23 ++-
 tc/q_pie.c  |  2 +-
 tc/q_prio.c |  2 +-
 tc/q_qfq.c  |  4 ++--
 tc/q_red.c  |  9 ++--
 tc/q_rr.c   |  2 +-
 tc/q_sfb.c  |  2 +-
 tc/q_sfq.c  |  2 +-
 tc/q_tbf.c  | 16 +++---
 tc/tc.c |  2 +-
 tc/tc_class.c   |  2 +-
 tc/tc_qdisc.c   |  2 +-
 tc/tc_util.c| 63 
 tc/tc_util.h|  7 --
 36 files changed, 285 insertions(+), 112 deletions(-)

diff --git a/include/utils.h b/include/utils.h
index 10749fb..9c37c61 100644
--- a/include/utils.h
+++ b/include/utils.h
@@ -88,6 +88,8 @@ int get_prefix(inet_prefix *dst, char *arg, int family);
 int mask2bits(__u32 netmask);
 int get_addr_ila(__u64 *val, const char *arg);
 
+int read_prop(const char *dev, char *prop, long *value);
+int parse_percent(double *val, const char *str);
 int get_hex(char c);
 int get_integer(int *val, const char *arg, int base);
 int get_unsigned(unsigned *val, const char *arg, int base);
diff --git a/ip/iptuntap.c b/ip/iptuntap.c
index b46e452..09f2be2 100644
--- a/ip/iptuntap.c
+++ b/ip/iptuntap.c
@@ -223,38 +223,6 @@ static int do_del(int argc, char **argv)
return tap_del_ioctl();
 }
 
-static int read_prop(char *dev, char *prop, long *value)
-{
-   char fname[IFNAMSIZ+25], buf[80], *endp;
-   ssize_t len;
-   int fd;
-   long result;
-
-   sprintf(fname, "/sys/class/net/%s/%s", dev, prop);
-   fd = open(fname, O_RDONLY);
-   if (fd < 0) {
-   if (strcmp(prop, "tun_flags"))
-   fprintf(stderr, "open %s: %s\n", fname,
-   strerror(errno));
-   return -1;
-   }
-   len = read(fd, buf, sizeof(buf)-1);
-   close(fd);
-   if (len < 0) {
-   fprintf(stderr, "read %s: %s", fname, strerror(errno));
-   return -1;
-   }
-
-   buf[len] = 0;
-   result = strtol(buf, , 0);
-   if (*endp != '\n') {
-   fprintf(stderr, "Failed to parse %s\n", fname);
-   return -1;
-   }
-   *value = result;
-   return 0;
-}
-
 static void print_flags(long flags)
 {
if (flags & IFF_TUN)
diff --git a/lib/utils.c b/lib/utils.c
index 48cead1..7ced8c0 100644
--- a/lib/utils.c
+++ b/lib/utils.c
@@ -38,6 +38,74 @@
 int resolve_hosts;
 int timestamp_short;
 
+int read_prop(const char *dev, char *prop, long *value)
+{
+   char fname[128], buf[80], *endp, *nl;
+   FILE *fp;
+   long result;
+   int ret;
+
+   ret = snprintf(fname, sizeof(fname), "/sys/class/net/%s/%s",
+   dev, prop);
+
+   if (ret <= 0 || ret >= sizeof(fname)) {
+   fprintf(stderr, "could not build pathname for property\n");
+  

Re: [PATCH v2 1/2] r8169: fix RTL8111EVL EEE and green settings

2017-11-20 Thread Andrew Lunn
Hi Heiner

Do you have access to the data sheet?

I had a quick look through the driver. It would be nice to refactor it
to follow the usual Linux conventions:

Turn the MDIO read/write functions into an MDIO bus driver.

Move the PHY code into drivers/net/phy/realtek.c, and in the process,
replace all the magic numbers with #defines.

Do you have any interest in doing this?

   Andrew


[PATCH net] nfp: flower: add missing kdoc

2017-11-20 Thread Jakub Kicinski
Commit 0115552eac14 ("nfp: remove false positive offloads
in flower vxlan") missed adding kdoc for a new parameter
of nfp_flower_add_offload().

Signed-off-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
---
 drivers/net/ethernet/netronome/nfp/flower/offload.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c 
b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index f5d73b83dcc2..553f94f55dce 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -315,6 +315,7 @@ nfp_flower_allocate_new(struct nfp_fl_key_ls *key_layer)
  * @app:   Pointer to the APP handle
  * @netdev:netdev structure.
  * @flow:  TC flower classifier offload structure.
+ * @egress:NFP netdev is the egress.
  *
  * Adds a new flow to the repeated hash structure and action payload.
  *
-- 
2.14.1



[PATCH 2/3] net: use dev_alloc_name_ns instead of dev_get_valid_name

2017-11-20 Thread Rasmus Villemoes
The latter is simply a wrapper for the former; no need to keep both, so
call dev_alloc_name_ns directly.

Signed-off-by: Rasmus Villemoes 
---
 drivers/net/tun.c | 2 +-
 net/core/dev.c| 6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 6bb1e604aadd..849a95505f80 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2253,7 +2253,7 @@ static int tun_set_iff(struct net *net, struct file 
*file, struct ifreq *ifr)
 
if (!dev)
return -ENOMEM;
-   err = dev_get_valid_name(net, dev, name);
+   err = dev_alloc_name_ns(net, dev, name);
if (err < 0)
goto err_free_dev;
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 9575e7329823..0de42f39d280 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1183,7 +1183,7 @@ int dev_change_name(struct net_device *dev, const char 
*newname)
 
memcpy(oldname, dev->name, IFNAMSIZ);
 
-   err = dev_get_valid_name(net, dev, newname);
+   err = dev_alloc_name_ns(net, dev, newname);
if (err < 0) {
write_seqcount_end(_rename_seq);
return err;
@@ -7615,7 +7615,7 @@ int register_netdevice(struct net_device *dev)
spin_lock_init(>addr_list_lock);
netdev_set_addr_lockdep_class(dev);
 
-   ret = dev_get_valid_name(net, dev, dev->name);
+   ret = dev_alloc_name_ns(net, dev, dev->name);
if (ret < 0)
goto out;
 
@@ -8354,7 +8354,7 @@ int dev_change_net_namespace(struct net_device *dev, 
struct net *net, const char
/* We get here if we can't use the current device name */
if (!pat)
goto out;
-   if (dev_get_valid_name(net, dev, pat) < 0)
+   if (dev_alloc_name_ns(net, dev, pat) < 0)
goto out;
}
 
-- 
2.11.0



[PATCH 3/3] net: core: remove dev_get_valid_name

2017-11-20 Thread Rasmus Villemoes
No users left.

Signed-off-by: Rasmus Villemoes 
---
 include/linux/netdevice.h | 2 --
 net/core/dev.c| 7 ---
 2 files changed, 9 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e249d3d0ff85..7b057ef42906 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3732,8 +3732,6 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, 
const char *name,
unsigned char name_assign_type,
void (*setup)(struct net_device *),
unsigned int txqs, unsigned int rxqs);
-int dev_get_valid_name(struct net *net, struct net_device *dev,
-  const char *name);
 
 #define alloc_netdev(sizeof_priv, name, name_assign_type, setup) \
alloc_netdev_mqs(sizeof_priv, name, name_assign_type, setup, 1, 1)
diff --git a/net/core/dev.c b/net/core/dev.c
index 0de42f39d280..405598c27195 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1144,13 +1144,6 @@ int dev_alloc_name(struct net_device *dev, const char 
*name)
 }
 EXPORT_SYMBOL(dev_alloc_name);
 
-int dev_get_valid_name(struct net *net, struct net_device *dev,
-  const char *name)
-{
-   return dev_alloc_name_ns(net, dev, name);
-}
-EXPORT_SYMBOL(dev_get_valid_name);
-
 /**
  * dev_change_name - change name of a device
  * @dev: device
-- 
2.11.0



[PATCH 1/3] net: core: export dev_alloc_name_ns

2017-11-20 Thread Rasmus Villemoes
dev_alloc_name_ns and dev_get_valid_name now do exactly the same
thing. Let's expose this functionality as dev_alloc_name_ns
(obviously, a core function like this won't return an invalid
name...).

Signed-off-by: Rasmus Villemoes 
---
 include/linux/netdevice.h | 1 +
 net/core/dev.c| 7 ---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 6b274bfe489f..e249d3d0ff85 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2447,6 +2447,7 @@ struct net_device *dev_get_by_name(struct net *net, const 
char *name);
 struct net_device *dev_get_by_name_rcu(struct net *net, const char *name);
 struct net_device *__dev_get_by_name(struct net *net, const char *name);
 int dev_alloc_name(struct net_device *dev, const char *name);
+int dev_alloc_name_ns(struct net *net, struct net_device *dev, const char 
*name);
 int dev_open(struct net_device *dev);
 void dev_close(struct net_device *dev);
 void dev_close_many(struct list_head *head, bool unlink);
diff --git a/net/core/dev.c b/net/core/dev.c
index 8ee29f4f5fa9..9575e7329823 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1109,9 +1109,9 @@ static int __dev_alloc_name(struct net *net, const char 
*name, char *buf)
return p ? -ENFILE : -EEXIST;
 }
 
-static int dev_alloc_name_ns(struct net *net,
-struct net_device *dev,
-const char *name)
+int dev_alloc_name_ns(struct net *net,
+ struct net_device *dev,
+ const char *name)
 {
char buf[IFNAMSIZ];
int ret;
@@ -1122,6 +1122,7 @@ static int dev_alloc_name_ns(struct net *net,
strlcpy(dev->name, buf, IFNAMSIZ);
return ret;
 }
+EXPORT_SYMBOL(dev_alloc_name_ns);
 
 /**
  * dev_alloc_name - allocate a name for a device
-- 
2.11.0



Re: [PATCH net v2 00/10] bpf: offload: check netdev pointer in the drivers and namespace trouble

2017-11-20 Thread Daniel Borkmann
On 11/21/2017 12:21 AM, Jakub Kicinski wrote:
> Hi!
> 
> This series addresses some late comments and moves checking if program
> has been loaded for the correct device to the drivers.  There are also
> some problems with net namespaces which I didn't take into consideration.
> On the kernel side we will now simply ignore namespace moves.  Since the
> user space API is not reporting any namespace identification we have to
> remove the ifindex until a correct way of reporting is agreed upon.
> 
> v2:
>  - fix ext ack reporting for XDP (David A);
>  - add Jiri's Ack.

Series applied to bpf tree, thanks Jakub!


Re: [E1000-devel] Questions about crashes and GRO

2017-11-20 Thread Alexander Duyck
On Mon, Nov 20, 2017 at 3:35 PM, Sarah Newman  wrote:
> On 11/20/2017 02:56 PM, Alexander Duyck wrote:
>> On Mon, Nov 20, 2017 at 2:38 PM, Sarah Newman  
>> wrote:
>>> On 11/20/2017 08:36 AM, Alexander Duyck wrote:
 Hi Sarah,

 I am adding the netdev mailing list as I am not certain this is an
 i350 specific issue. The traces themselves aren't anything I recognize
 as an existing issue. From what I can tell it looks like you are
 running Xen, so would I be correct in assuming you are bridging
 between VMs? If so are you using any sort of tunnels on your network,
 if so what type? This information would be useful as we may be looking
 at a bug in a tunnel offload for GRO.
>>>
>>> Yes, there's bridging. The traffic on the physical device is tagged with 
>>> vlans and the bridges use untagged traffic. There are no tunnels. I do not
>>> own the VMs traffic.
>>>
>>> Because I have only seen this on a single server with unique hardware, I 
>>> think it's most likely related to the hardware or to a particular VM on that
>>> server.
>>
>> So I would suspect traffic coming from the VM if anything. The i350 is
>> a pretty common device. If we were seeing issues specific to it > would 
>> expect we would have more reports than just the one so far.
>
> My confusion was primarily related to the release notes for an older version 
> of a different intel driver.
>
> But regarding traffic coming from a VM, the backtraces both include igb_poll. 
> Doesn't that mean the problem is related to inbound traffic on the igb
> device and not traffic direct from a local VM?
>
> --Sarah

All the igb driver is doing is taking the data off of the network,
populating sk_buff structures, and then handing them off to the stack.
The format of the sk_buff's has been pretty consistent for the last
several years so I am not really suspecting a driver issue.

The issue with network traffic is that it is usually symmetric meaning
if the VM sends something it will get some sort of reply.  The actual
traffic itself and how the kernel handles it has changed quite a bit
over the years, and a VM could be setting up a tunnel, or stack of
VLANs, or some other type of traffic that the kernel might have
recognized and tried to do GRO for but didn't fully support. If
turning off GRO solves the problem then the issue is likely in the GRO
code, not in the igb driver.

- Alex


[PATCH RFC net-next 4/4] tcp: Enable 2nd listener hashtable in TCP

2017-11-20 Thread Martin KaFai Lau
Enable the second listener hashtable in TCP.
The scale is the same as UDP which is one slot per 2MB.

Signed-off-by: Martin KaFai Lau 
---
 net/ipv4/tcp.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index bf97317e6c97..180311636023 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3577,6 +3577,9 @@ void __init tcp_init(void)
percpu_counter_init(_sockets_allocated, 0, GFP_KERNEL);
percpu_counter_init(_orphan_count, 0, GFP_KERNEL);
inet_hashinfo_init(_hashinfo);
+   inet_hashinfo2_init(_hashinfo, "tcp_listen_portaddr_hash",
+   thash_entries, 21,  /* one slot per 2 MB*/
+   0, 64 * 1024);
tcp_hashinfo.bind_bucket_cachep =
kmem_cache_create("tcp_bind_bucket",
  sizeof(struct inet_bind_bucket), 0,
-- 
2.9.5



[PATCH RFC net-next 1/4] inet: Add a count to struct inet_listen_hashbucket

2017-11-20 Thread Martin KaFai Lau
This patch adds a count to the 'struct inet_listen_hashbucket'.
It counts how many sk is hashed to a bucket.  It will be
used to decide if the (to-be-added) portaddr listener's hashtable
should be used during inet[6]_lookup_listener().

Signed-off-by: Martin KaFai Lau 
---
 include/net/inet_hashtables.h |  1 +
 net/ipv4/inet_hashtables.c| 11 +--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 2dbbbff5e1e3..4cce516c41ac 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -111,6 +111,7 @@ struct inet_bind_hashbucket {
  */
 struct inet_listen_hashbucket {
spinlock_t  lock;
+   unsigned intcount;
struct hlist_head   head;
 };
 
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index e7d15fb0d94d..bf16aa852aa3 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -483,6 +483,7 @@ int __inet_hash(struct sock *sk, struct sock *osk)
hlist_add_tail_rcu(>sk_node, >head);
else
hlist_add_head_rcu(>sk_node, >head);
+   ilb->count++;
sock_set_flag(sk, SOCK_RCU_FREE);
sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
 unlock:
@@ -509,6 +510,7 @@ EXPORT_SYMBOL_GPL(inet_hash);
 void inet_unhash(struct sock *sk)
 {
struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo;
+   struct inet_listen_hashbucket *ilb;
spinlock_t *lock;
bool listener = false;
int done;
@@ -517,7 +519,8 @@ void inet_unhash(struct sock *sk)
return;
 
if (sk->sk_state == TCP_LISTEN) {
-   lock = 
>listening_hash[inet_sk_listen_hashfn(sk)].lock;
+   ilb = >listening_hash[inet_sk_listen_hashfn(sk)];
+   lock = >lock;
listener = true;
} else {
lock = inet_ehash_lockp(hashinfo, sk->sk_hash);
@@ -529,8 +532,11 @@ void inet_unhash(struct sock *sk)
done = __sk_del_node_init(sk);
else
done = __sk_nulls_del_node_init_rcu(sk);
-   if (done)
+   if (done) {
+   if (listener)
+   ilb->count--;
sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
+   }
spin_unlock_bh(lock);
 }
 EXPORT_SYMBOL_GPL(inet_unhash);
@@ -665,6 +671,7 @@ void inet_hashinfo_init(struct inet_hashinfo *h)
for (i = 0; i < INET_LHTABLE_SIZE; i++) {
spin_lock_init(>listening_hash[i].lock);
INIT_HLIST_HEAD(>listening_hash[i].head);
+   h->listening_hash[i].count = 0;
}
 }
 EXPORT_SYMBOL_GPL(inet_hashinfo_init);
-- 
2.9.5



[PATCH RFC net-next 0/4] tcp: Add a 2nd listener hashtable (port+addr)

2017-11-20 Thread Martin KaFai Lau
This patch set adds a 2nd listener hashtable.  It is to resolve
the performance issue when a process is listening at many IP
addresses with the same port (e.g. [IP1]:443, [IP2]:443... [IPN]:443)

Martin KaFai Lau (4):
  inet: Add a count to struct inet_listen_hashbucket
  udp: Move udp[46]_portaddr_hash() to net/ip[v6].h
  inet: Add a 2nd listener hashtable (port+addr)
  tcp: Enable 2nd listener hashtable in TCP

 include/net/inet_connection_sock.h |   1 +
 include/net/inet_hashtables.h  |  16 
 include/net/ip.h   |   9 ++
 include/net/ipv6.h |  17 
 net/ipv4/inet_hashtables.c | 180 +++--
 net/ipv4/tcp.c |   3 +
 net/ipv4/udp.c |  22 ++---
 net/ipv6/inet6_hashtables.c|  73 +++
 net/ipv6/udp.c |  32 ++-
 9 files changed, 307 insertions(+), 46 deletions(-)

-- 
2.9.5



[PATCH RFC net-next 2/4] udp: Move udp[46]_portaddr_hash() to net/ip[v6].h

2017-11-20 Thread Martin KaFai Lau
This patch moves the udp[46]_portaddr_hash()
to net/ip[v6].h.  The function name is renamed to
ipv[46]_portaddr_hash().

It will be used by a later patch which adds a second listener
hashtable hashed by the address and port.

Signed-off-by: Martin KaFai Lau 
---
 include/net/ip.h   |  9 +
 include/net/ipv6.h | 17 +
 net/ipv4/udp.c | 22 --
 net/ipv6/udp.c | 32 
 4 files changed, 42 insertions(+), 38 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index 9896f46cbbf1..fc9bf1b1fe2c 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -26,12 +26,14 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 
 #define IPV4_MAX_PMTU  65535U  /* RFC 2675, Section 5.1 */
 
@@ -521,6 +523,13 @@ static inline unsigned int ipv4_addr_hash(__be32 ip)
return (__force unsigned int) ip;
 }
 
+static inline u32 ipv4_portaddr_hash(const struct net *net,
+__be32 saddr,
+unsigned int port)
+{
+   return jhash_1word((__force u32)saddr, net_hash_mix(net)) ^ port;
+}
+
 bool ip_call_ra_chain(struct sk_buff *skb);
 
 /*
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index ec14f0d5a3a1..e337ecb3284f 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define SIN6_LEN_RFC2133   24
 
@@ -673,6 +674,22 @@ static inline bool ipv6_addr_v4mapped(const struct 
in6_addr *a)
cpu_to_be32(0x))) == 0UL;
 }
 
+static inline u32 ipv6_portaddr_hash(const struct net *net,
+const struct in6_addr *addr6,
+unsigned int port)
+{
+   unsigned int hash, mix = net_hash_mix(net);
+
+   if (ipv6_addr_any(addr6))
+   hash = jhash_1word(0, mix);
+   else if (ipv6_addr_v4mapped(addr6))
+   hash = jhash_1word((__force u32)addr6->s6_addr32[3], mix);
+   else
+   hash = jhash2((__force u32 *)addr6->s6_addr32, 4, mix);
+
+   return hash ^ port;
+}
+
 /*
  * Check for a RFC 4843 ORCHID address
  * (Overlay Routable Cryptographic Hash Identifiers)
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index e4ff25c947c5..057392a47c92 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -357,18 +357,12 @@ int udp_lib_get_port(struct sock *sk, unsigned short snum,
 }
 EXPORT_SYMBOL(udp_lib_get_port);
 
-static u32 udp4_portaddr_hash(const struct net *net, __be32 saddr,
- unsigned int port)
-{
-   return jhash_1word((__force u32)saddr, net_hash_mix(net)) ^ port;
-}
-
 int udp_v4_get_port(struct sock *sk, unsigned short snum)
 {
unsigned int hash2_nulladdr =
-   udp4_portaddr_hash(sock_net(sk), htonl(INADDR_ANY), snum);
+   ipv4_portaddr_hash(sock_net(sk), htonl(INADDR_ANY), snum);
unsigned int hash2_partial =
-   udp4_portaddr_hash(sock_net(sk), inet_sk(sk)->inet_rcv_saddr, 
0);
+   ipv4_portaddr_hash(sock_net(sk), inet_sk(sk)->inet_rcv_saddr, 
0);
 
/* precompute partial secondary hash */
udp_sk(sk)->udp_portaddr_hash = hash2_partial;
@@ -492,7 +486,7 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 
saddr,
u32 hash = 0;
 
if (hslot->count > 10) {
-   hash2 = udp4_portaddr_hash(net, daddr, hnum);
+   hash2 = ipv4_portaddr_hash(net, daddr, hnum);
slot2 = hash2 & udptable->mask;
hslot2 = >hash2[slot2];
if (hslot->count < hslot2->count)
@@ -503,7 +497,7 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 
saddr,
  exact_dif, hslot2, skb);
if (!result) {
unsigned int old_slot2 = slot2;
-   hash2 = udp4_portaddr_hash(net, htonl(INADDR_ANY), 
hnum);
+   hash2 = ipv4_portaddr_hash(net, htonl(INADDR_ANY), 
hnum);
slot2 = hash2 & udptable->mask;
/* avoid searching the same slot again. */
if (unlikely(slot2 == old_slot2))
@@ -1775,7 +1769,7 @@ EXPORT_SYMBOL(udp_lib_rehash);
 
 static void udp_v4_rehash(struct sock *sk)
 {
-   u16 new_hash = udp4_portaddr_hash(sock_net(sk),
+   u16 new_hash = ipv4_portaddr_hash(sock_net(sk),
  inet_sk(sk)->inet_rcv_saddr,
  inet_sk(sk)->inet_num);
udp_lib_rehash(sk, new_hash);
@@ -1966,9 +1960,9 @@ static int __udp4_lib_mcast_deliver(struct net *net, 
struct sk_buff *skb,
struct sk_buff *nskb;
 
if (use_hash2) {
-   hash2_any = udp4_portaddr_hash(net, htonl(INADDR_ANY), hnum) &
+   hash2_any = 

[PATCH RFC net-next 3/4] inet: Add a 2nd listener hashtable (port+addr)

2017-11-20 Thread Martin KaFai Lau
The current listener hashtable is hashed by port only.
When a process is listening at many IP addresses with the same port (e.g.
[IP1]:443, [IP2]:443... [IPN]:443), the inet[6]_lookup_listener()
performance is degraded to a link list.  It is prone to syn attack.

UDP had a similar issue and a second hashtable was added to resolve it.

This patch adds a second hashtable for the listener's sockets.
The second hashtable is hashed by port and address.

It cannot reuse the existing skc_portaddr_node which is shared
with skc_bind_node.  TCP listener needs to use skc_bind_node.
Instead, this patch adds a hlist_node 'icsk_listen_portaddr_node' to
the inet_connection_sock which the listener (like TCP) also belongs to.

The new portaddr hashtable may need two lookup (First by IP:PORT.
Second by INADDR_ANY:PORT if the IP:PORT is a not found).   Hence,
it implements a similar cut off as UDP such that it will only consult the
new portaddr hashtable if the current port-only hashtable has >10
sk in the link-list.

Signed-off-by: Martin KaFai Lau 
---
 include/net/inet_connection_sock.h |   1 +
 include/net/inet_hashtables.h  |  15 
 net/ipv4/inet_hashtables.c | 175 +++--
 net/ipv6/inet6_hashtables.c|  73 
 4 files changed, 255 insertions(+), 9 deletions(-)

diff --git a/include/net/inet_connection_sock.h 
b/include/net/inet_connection_sock.h
index 0358745ea059..f1b6919d18b2 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -101,6 +101,7 @@ struct inet_connection_sock {
const struct inet_connection_sock_af_ops *icsk_af_ops;
const struct tcp_ulp_ops  *icsk_ulp_ops;
void  *icsk_ulp_data;
+   struct hlist_node icsk_listen_portaddr_node;
unsigned int  (*icsk_sync_mss)(struct sock *sk, u32 pmtu);
__u8  icsk_ca_state:6,
  icsk_ca_setsockopt:1,
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 4cce516c41ac..ebce55d694e7 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -152,8 +152,19 @@ struct inet_hashinfo {
 */
struct inet_listen_hashbucket   listening_hash[INET_LHTABLE_SIZE]
cacheline_aligned_in_smp;
+   struct inet_listen_hashbucket   *lhash2;
+   unsigned intlhash2_mask;
 };
 
+#define inet_lhash2_for_each_icsk_rcu(__icsk, list) \
+   hlist_for_each_entry_rcu(__icsk, list, icsk_listen_portaddr_node)
+
+static inline struct inet_listen_hashbucket *
+inet_lhash2_bucket(struct inet_hashinfo *h, u32 hash)
+{
+   return >lhash2[hash & h->lhash2_mask];
+}
+
 static inline struct inet_ehash_bucket *inet_ehash_bucket(
struct inet_hashinfo *hashinfo,
unsigned int hash)
@@ -209,6 +220,10 @@ int __inet_inherit_port(const struct sock *sk, struct sock 
*child);
 void inet_put_port(struct sock *sk);
 
 void inet_hashinfo_init(struct inet_hashinfo *h);
+void inet_hashinfo2_init(struct inet_hashinfo *h, const char *name,
+unsigned long numentries, int scale,
+unsigned long low_limit,
+unsigned long high_limit);
 
 bool inet_ehash_insert(struct sock *sk, struct sock *osk);
 bool inet_ehash_nolisten(struct sock *sk, struct sock *osk);
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index bf16aa852aa3..899a21c0f68d 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -168,6 +169,60 @@ int __inet_inherit_port(const struct sock *sk, struct sock 
*child)
 }
 EXPORT_SYMBOL_GPL(__inet_inherit_port);
 
+static struct inet_listen_hashbucket *
+inet_lhash2_bucket_sk(struct inet_hashinfo *h, struct sock *sk)
+{
+   u32 hash;
+
+#if IS_ENABLED(CONFIG_IPV6)
+   if (sk->sk_family == AF_INET6)
+   hash = ipv6_portaddr_hash(sock_net(sk),
+ >sk_v6_rcv_saddr,
+ inet_sk(sk)->inet_num);
+   else
+#endif
+   hash = ipv4_portaddr_hash(sock_net(sk),
+ inet_sk(sk)->inet_rcv_saddr,
+ inet_sk(sk)->inet_num);
+   return inet_lhash2_bucket(h, hash);
+}
+
+static void inet_hash2(struct inet_hashinfo *h, struct sock *sk)
+{
+   struct inet_listen_hashbucket *ilb2;
+
+   if (!h->lhash2)
+   return;
+
+   ilb2 = inet_lhash2_bucket_sk(h, sk);
+
+   spin_lock(>lock);
+   if (sk->sk_reuseport && sk->sk_family == AF_INET6)
+   hlist_add_tail_rcu(_csk(sk)->icsk_listen_portaddr_node,
+  >head);
+   else
+   hlist_add_head_rcu(_csk(sk)->icsk_listen_portaddr_node,

Re: [E1000-devel] Questions about crashes and GRO

2017-11-20 Thread Sarah Newman
On 11/20/2017 02:56 PM, Alexander Duyck wrote:
> On Mon, Nov 20, 2017 at 2:38 PM, Sarah Newman  
> wrote:
>> On 11/20/2017 08:36 AM, Alexander Duyck wrote:
>>> Hi Sarah,
>>>
>>> I am adding the netdev mailing list as I am not certain this is an
>>> i350 specific issue. The traces themselves aren't anything I recognize
>>> as an existing issue. From what I can tell it looks like you are
>>> running Xen, so would I be correct in assuming you are bridging
>>> between VMs? If so are you using any sort of tunnels on your network,
>>> if so what type? This information would be useful as we may be looking
>>> at a bug in a tunnel offload for GRO.
>>
>> Yes, there's bridging. The traffic on the physical device is tagged with 
>> vlans and the bridges use untagged traffic. There are no tunnels. I do not
>> own the VMs traffic.
>>
>> Because I have only seen this on a single server with unique hardware, I 
>> think it's most likely related to the hardware or to a particular VM on that
>> server.
> 
> So I would suspect traffic coming from the VM if anything. The i350 is
> a pretty common device. If we were seeing issues specific to it > would 
> expect we would have more reports than just the one so far.

My confusion was primarily related to the release notes for an older version of 
a different intel driver.

But regarding traffic coming from a VM, the backtraces both include igb_poll. 
Doesn't that mean the problem is related to inbound traffic on the igb
device and not traffic direct from a local VM?

--Sarah


[PATCH net v2 01/10] bpf: offload: add comment warning developers about double destroy

2017-11-20 Thread Jakub Kicinski
Offload state may get destroyed either because the device for which
it was constructed is going away, or because the refcount of bpf
program itself has reached 0.  In both of those cases we will call
__bpf_prog_offload_destroy() to unlink the offload from the device.
We may in fact call it twice, which works just fine, but we should
make clear this is intended and caution others trying to extend the
function.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
Acked-by: Alexei Starovoitov 
Acked-by: Daniel Borkmann 
---
 kernel/bpf/offload.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index 2816feb38be1..fd696d3dd429 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -85,6 +85,10 @@ static void __bpf_prog_offload_destroy(struct bpf_prog *prog)
struct bpf_dev_offload *offload = prog->aux->offload;
struct netdev_bpf data = {};
 
+   /* Caution - if netdev is destroyed before the program, this function
+* will be called twice.
+*/
+
data.offload.prog = prog;
 
if (offload->verifier_running)
-- 
2.14.1



[PATCH net v2 06/10] bpf: turn bpf_prog_get_type() into a wrapper

2017-11-20 Thread Jakub Kicinski
bpf_prog_get_type() is identical to bpf_prog_get_type_dev(),
with false passed as attach_drv.  Instead of keeping it as
an exported symbol turn it into static inline wrapper.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
Acked-by: Alexei Starovoitov 
Acked-by: Daniel Borkmann 
---
 include/linux/bpf.h  | 13 ++---
 kernel/bpf/syscall.c | 10 --
 2 files changed, 6 insertions(+), 17 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f82be640731e..37bbab8c0f56 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -334,7 +334,6 @@ extern const struct bpf_verifier_ops 
tc_cls_act_analyzer_ops;
 extern const struct bpf_verifier_ops xdp_analyzer_ops;
 
 struct bpf_prog *bpf_prog_get(u32 ufd);
-struct bpf_prog *bpf_prog_get_type(u32 ufd, enum bpf_prog_type type);
 struct bpf_prog *bpf_prog_get_type_dev(u32 ufd, enum bpf_prog_type type,
   bool attach_drv);
 struct bpf_prog * __must_check bpf_prog_add(struct bpf_prog *prog, int i);
@@ -425,12 +424,6 @@ static inline struct bpf_prog *bpf_prog_get(u32 ufd)
return ERR_PTR(-EOPNOTSUPP);
 }
 
-static inline struct bpf_prog *bpf_prog_get_type(u32 ufd,
-enum bpf_prog_type type)
-{
-   return ERR_PTR(-EOPNOTSUPP);
-}
-
 static inline struct bpf_prog *bpf_prog_get_type_dev(u32 ufd,
 enum bpf_prog_type type,
 bool attach_drv)
@@ -514,6 +507,12 @@ static inline int cpu_map_enqueue(struct bpf_cpu_map_entry 
*rcpu,
 }
 #endif /* CONFIG_BPF_SYSCALL */
 
+static inline struct bpf_prog *bpf_prog_get_type(u32 ufd,
+enum bpf_prog_type type)
+{
+   return bpf_prog_get_type_dev(ufd, type, false);
+}
+
 int bpf_prog_offload_compile(struct bpf_prog *prog);
 void bpf_prog_offload_destroy(struct bpf_prog *prog);
 u32 bpf_prog_offload_ifindex(struct bpf_prog *prog);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 38da55905ab0..41509cf825d8 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1097,16 +1097,6 @@ struct bpf_prog *bpf_prog_get(u32 ufd)
return __bpf_prog_get(ufd, NULL, false);
 }
 
-struct bpf_prog *bpf_prog_get_type(u32 ufd, enum bpf_prog_type type)
-{
-   struct bpf_prog *prog = __bpf_prog_get(ufd, , false);
-
-   if (!IS_ERR(prog))
-   trace_bpf_prog_get_type(prog);
-   return prog;
-}
-EXPORT_SYMBOL_GPL(bpf_prog_get_type);
-
 struct bpf_prog *bpf_prog_get_type_dev(u32 ufd, enum bpf_prog_type type,
   bool attach_drv)
 {
-- 
2.14.1



[PATCH net v2 03/10] bpf: offload: rename the ifindex field

2017-11-20 Thread Jakub Kicinski
bpf_target_prog seems long and clunky, rename it to prog_ifindex.
We don't want to call this field just ifindex, because maps
may need a similar field in the future and bpf_attr members for
programs and maps are unnamed.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
Acked-by: Alexei Starovoitov 
Acked-by: Daniel Borkmann 
---
 include/uapi/linux/bpf.h   | 2 +-
 kernel/bpf/offload.c   | 2 +-
 kernel/bpf/syscall.c   | 4 ++--
 tools/include/uapi/linux/bpf.h | 2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index e880ae6434ee..3f626df42516 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -262,7 +262,7 @@ union bpf_attr {
__u32   kern_version;   /* checked when 
prog_type=kprobe */
__u32   prog_flags;
charprog_name[BPF_OBJ_NAME_LEN];
-   __u32   prog_target_ifindex;/* ifindex of netdev to 
prep for */
+   __u32   prog_ifindex;   /* ifindex of netdev to prep 
for */
};
 
struct { /* anonymous struct used by BPF_OBJ_* commands */
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index ac187f9ee182..a778e5df7e26 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -29,7 +29,7 @@ int bpf_prog_offload_init(struct bpf_prog *prog, union 
bpf_attr *attr)
init_waitqueue_head(>verifier_done);
 
rtnl_lock();
-   offload->netdev = __dev_get_by_index(net, attr->prog_target_ifindex);
+   offload->netdev = __dev_get_by_index(net, attr->prog_ifindex);
if (!offload->netdev) {
rtnl_unlock();
kfree(offload);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 09badc37e864..8e9d065bb7cd 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1118,7 +1118,7 @@ struct bpf_prog *bpf_prog_get_type_dev(u32 ufd, enum 
bpf_prog_type type,
 EXPORT_SYMBOL_GPL(bpf_prog_get_type_dev);
 
 /* last field in 'union bpf_attr' used by this command */
-#defineBPF_PROG_LOAD_LAST_FIELD prog_target_ifindex
+#defineBPF_PROG_LOAD_LAST_FIELD prog_ifindex
 
 static int bpf_prog_load(union bpf_attr *attr)
 {
@@ -1181,7 +1181,7 @@ static int bpf_prog_load(union bpf_attr *attr)
atomic_set(>aux->refcnt, 1);
prog->gpl_compatible = is_gpl ? 1 : 0;
 
-   if (attr->prog_target_ifindex) {
+   if (attr->prog_ifindex) {
err = bpf_prog_offload_init(prog, attr);
if (err)
goto free_prog;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index e880ae6434ee..3f626df42516 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -262,7 +262,7 @@ union bpf_attr {
__u32   kern_version;   /* checked when 
prog_type=kprobe */
__u32   prog_flags;
charprog_name[BPF_OBJ_NAME_LEN];
-   __u32   prog_target_ifindex;/* ifindex of netdev to 
prep for */
+   __u32   prog_ifindex;   /* ifindex of netdev to prep 
for */
};
 
struct { /* anonymous struct used by BPF_OBJ_* commands */
-- 
2.14.1



[PATCH net v2 00/10] bpf: offload: check netdev pointer in the drivers and namespace trouble

2017-11-20 Thread Jakub Kicinski
Hi!

This series addresses some late comments and moves checking if program
has been loaded for the correct device to the drivers.  There are also
some problems with net namespaces which I didn't take into consideration.
On the kernel side we will now simply ignore namespace moves.  Since the
user space API is not reporting any namespace identification we have to
remove the ifindex until a correct way of reporting is agreed upon.

v2:
 - fix ext ack reporting for XDP (David A);
 - add Jiri's Ack.

Jakub Kicinski (10):
  bpf: offload: add comment warning developers about double destroy
  bpf: offload: limit offload to cls_bpf and xdp programs only
  bpf: offload: rename the ifindex field
  bpf: offload: move offload device validation out to the drivers
  net: xdp: don't allow device-bound programs in driver mode
  bpf: turn bpf_prog_get_type() into a wrapper
  bpf: offload: ignore namespace moves
  bpftool: revert printing program device bound info
  bpf: revert report offload info to user space
  bpf: make bpf_prog_offload_verifier_prep() static inline

 drivers/net/ethernet/netronome/nfp/bpf/offload.c | 10 --
 include/linux/bpf.h  | 18 +--
 include/linux/bpf_verifier.h |  2 +-
 include/uapi/linux/bpf.h |  8 +
 kernel/bpf/offload.c | 27 +++-
 kernel/bpf/syscall.c | 40 
 net/core/dev.c   | 14 ++---
 net/sched/cls_bpf.c  |  8 ++---
 tools/bpf/bpftool/prog.c | 31 --
 tools/include/uapi/linux/bpf.h   |  8 +
 10 files changed, 56 insertions(+), 110 deletions(-)

-- 
2.14.1



[PATCH net v2 07/10] bpf: offload: ignore namespace moves

2017-11-20 Thread Jakub Kicinski
We are currently destroying the device offload state when device
moves to another net namespace.  This doesn't break with current
NFP code, because offload state is not used on program removal,
but it's not correct behaviour.

Ignore the device unregister notifications on namespace move.

Signed-off-by: Jakub Kicinski 
---
 kernel/bpf/offload.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index a778e5df7e26..d4267c674fec 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -174,6 +174,10 @@ static int bpf_offload_notification(struct notifier_block 
*notifier,
 
switch (event) {
case NETDEV_UNREGISTER:
+   /* ignore namespace changes */
+   if (netdev->reg_state != NETREG_UNREGISTERING)
+   break;
+
list_for_each_entry_safe(offload, tmp, _prog_offload_devs,
 offloads) {
if (offload->netdev == netdev)
-- 
2.14.1



[PATCH net v2 10/10] bpf: make bpf_prog_offload_verifier_prep() static inline

2017-11-20 Thread Jakub Kicinski
Header implementation of bpf_prog_offload_verifier_prep() which
is used if CONFIG_NET=n should be a static inline.

Signed-off-by: Jakub Kicinski 
---
 include/linux/bpf_verifier.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 07b96aaca256..b61482d354a2 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -171,7 +171,7 @@ static inline struct bpf_reg_state *cur_regs(struct 
bpf_verifier_env *env)
 #if defined(CONFIG_NET) && defined(CONFIG_BPF_SYSCALL)
 int bpf_prog_offload_verifier_prep(struct bpf_verifier_env *env);
 #else
-int bpf_prog_offload_verifier_prep(struct bpf_verifier_env *env)
+static inline int bpf_prog_offload_verifier_prep(struct bpf_verifier_env *env)
 {
return -EOPNOTSUPP;
 }
-- 
2.14.1



[PATCH net v2 05/10] net: xdp: don't allow device-bound programs in driver mode

2017-11-20 Thread Jakub Kicinski
Currently device-bound programs are not able to run on the host
to save resources (host JIT is not invoked).  Don't allow XDP
programs to be attached without the HW_MODE flag.  In theory
if program is already translated for device offload the driver
should choose to offload it instead of loading it in the driver.
However, offloading translated program may still fail resulting
in device-bound program being run on the host.

Prevent this by refusing to attach device bound programs if
XDP_FLAGS_HW_MODE is not set.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
Acked-by: Alexei Starovoitov 
Acked-by: Daniel Borkmann 
---
 net/core/dev.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 09525a27319c..5e2ba133fba7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7143,6 +7143,13 @@ int dev_change_xdp_fd(struct net_device *dev, struct 
netlink_ext_ack *extack,
 bpf_op == ops->ndo_bpf);
if (IS_ERR(prog))
return PTR_ERR(prog);
+
+   if (!(flags & XDP_FLAGS_HW_MODE) &&
+   bpf_prog_is_dev_bound(prog->aux)) {
+   NL_SET_ERR_MSG(extack, "using device-bound program 
without HW_MODE flag is not supported");
+   bpf_prog_put(prog);
+   return -EINVAL;
+   }
}
 
err = dev_xdp_install(dev, bpf_op, extack, flags, prog);
-- 
2.14.1



[PATCH net v2 04/10] bpf: offload: move offload device validation out to the drivers

2017-11-20 Thread Jakub Kicinski
With TC shared block changes we can't depend on correct netdev
pointer being available in cls_bpf.  Move the device validation
to the driver.  Core will only make sure that offloaded programs
are always attached in the driver (or in HW by the driver).  We
trust that drivers which implement offload callbacks will perform
necessary checks.

Moving the checks to the driver is generally a useful thing,
in practice the check should be against a switchdev instance,
not a netdev, given that most ASICs will probably allow using
the same program on many ports.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
Acked-by: Alexei Starovoitov 
Acked-by: Daniel Borkmann 
Acked-by: Jiri Pirko 
---
 drivers/net/ethernet/netronome/nfp/bpf/offload.c | 10 --
 include/linux/bpf.h  |  4 ++--
 kernel/bpf/syscall.c | 23 ---
 net/core/dev.c   |  7 ++-
 net/sched/cls_bpf.c  |  8 +++-
 5 files changed, 27 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/offload.c 
b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
index b6cee71f49d3..bc879aeb62d4 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
@@ -214,8 +214,14 @@ int nfp_net_bpf_offload(struct nfp_net *nn, struct 
bpf_prog *prog,
 {
int err;
 
-   if (prog && !prog->aux->offload)
-   return -EINVAL;
+   if (prog) {
+   struct bpf_dev_offload *offload = prog->aux->offload;
+
+   if (!offload)
+   return -EINVAL;
+   if (offload->netdev != nn->dp.netdev)
+   return -EINVAL;
+   }
 
if (prog && old_prog) {
u8 cap;
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index c397934f91dd..f82be640731e 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -336,7 +336,7 @@ extern const struct bpf_verifier_ops xdp_analyzer_ops;
 struct bpf_prog *bpf_prog_get(u32 ufd);
 struct bpf_prog *bpf_prog_get_type(u32 ufd, enum bpf_prog_type type);
 struct bpf_prog *bpf_prog_get_type_dev(u32 ufd, enum bpf_prog_type type,
-  struct net_device *netdev);
+  bool attach_drv);
 struct bpf_prog * __must_check bpf_prog_add(struct bpf_prog *prog, int i);
 void bpf_prog_sub(struct bpf_prog *prog, int i);
 struct bpf_prog * __must_check bpf_prog_inc(struct bpf_prog *prog);
@@ -433,7 +433,7 @@ static inline struct bpf_prog *bpf_prog_get_type(u32 ufd,
 
 static inline struct bpf_prog *bpf_prog_get_type_dev(u32 ufd,
 enum bpf_prog_type type,
-struct net_device *netdev)
+bool attach_drv)
 {
return ERR_PTR(-EOPNOTSUPP);
 }
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8e9d065bb7cd..38da55905ab0 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1057,22 +1057,23 @@ struct bpf_prog *bpf_prog_inc_not_zero(struct bpf_prog 
*prog)
 }
 EXPORT_SYMBOL_GPL(bpf_prog_inc_not_zero);
 
-static bool bpf_prog_can_attach(struct bpf_prog *prog,
-   enum bpf_prog_type *attach_type,
-   struct net_device *netdev)
+static bool bpf_prog_get_ok(struct bpf_prog *prog,
+   enum bpf_prog_type *attach_type, bool attach_drv)
 {
-   struct bpf_dev_offload *offload = prog->aux->offload;
+   /* not an attachment, just a refcount inc, always allow */
+   if (!attach_type)
+   return true;
 
if (prog->type != *attach_type)
return false;
-   if (offload && offload->netdev != netdev)
+   if (bpf_prog_is_dev_bound(prog->aux) && !attach_drv)
return false;
 
return true;
 }
 
 static struct bpf_prog *__bpf_prog_get(u32 ufd, enum bpf_prog_type 
*attach_type,
-  struct net_device *netdev)
+  bool attach_drv)
 {
struct fd f = fdget(ufd);
struct bpf_prog *prog;
@@ -1080,7 +1081,7 @@ static struct bpf_prog *__bpf_prog_get(u32 ufd, enum 
bpf_prog_type *attach_type,
prog = bpf_prog_get(f);
if (IS_ERR(prog))
return prog;
-   if (attach_type && !bpf_prog_can_attach(prog, attach_type, netdev)) {
+   if (!bpf_prog_get_ok(prog, attach_type, attach_drv)) {
prog = ERR_PTR(-EINVAL);
goto out;
}
@@ -1093,12 +1094,12 @@ static struct bpf_prog *__bpf_prog_get(u32 ufd, enum 
bpf_prog_type *attach_type,
 
 struct bpf_prog *bpf_prog_get(u32 ufd)
 {
-   return 

[PATCH net v2 08/10] bpftool: revert printing program device bound info

2017-11-20 Thread Jakub Kicinski
This reverts commit 928631e05495 ("bpftool: print program device bound
info").  We will remove this API and redo it right in -next.

Signed-off-by: Jakub Kicinski 
---
 tools/bpf/bpftool/prog.c   | 31 ---
 tools/include/uapi/linux/bpf.h |  6 --
 2 files changed, 37 deletions(-)

diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index f45c44ef9bec..ad619b96c276 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -41,7 +41,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 
@@ -230,21 +229,6 @@ static void print_prog_json(struct bpf_prog_info *info, 
int fd)
 info->tag[0], info->tag[1], info->tag[2], info->tag[3],
 info->tag[4], info->tag[5], info->tag[6], info->tag[7]);
 
-   if (info->status & BPF_PROG_STATUS_DEV_BOUND) {
-   jsonw_name(json_wtr, "dev");
-   if (info->ifindex) {
-   char name[IF_NAMESIZE];
-
-   if (!if_indextoname(info->ifindex, name))
-   jsonw_printf(json_wtr, "\"ifindex:%d\"",
-info->ifindex);
-   else
-   jsonw_printf(json_wtr, "\"%s\"", name);
-   } else {
-   jsonw_printf(json_wtr, "\"unknown\"");
-   }
-   }
-
if (info->load_time) {
char buf[32];
 
@@ -302,21 +286,6 @@ static void print_prog_plain(struct bpf_prog_info *info, 
int fd)
 
printf("tag ");
fprint_hex(stdout, info->tag, BPF_TAG_SIZE, "");
-   printf(" ");
-
-   if (info->status & BPF_PROG_STATUS_DEV_BOUND) {
-   printf("dev ");
-   if (info->ifindex) {
-   char name[IF_NAMESIZE];
-
-   if (!if_indextoname(info->ifindex, name))
-   printf("ifindex:%d ", info->ifindex);
-   else
-   printf("%s ", name);
-   } else {
-   printf("unknown ");
-   }
-   }
printf("\n");
 
if (info->load_time) {
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 3f626df42516..4c223ab30293 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -897,10 +897,6 @@ enum sk_action {
 
 #define BPF_TAG_SIZE   8
 
-enum bpf_prog_status {
-   BPF_PROG_STATUS_DEV_BOUND   = (1 << 0),
-};
-
 struct bpf_prog_info {
__u32 type;
__u32 id;
@@ -914,8 +910,6 @@ struct bpf_prog_info {
__u32 nr_map_ids;
__aligned_u64 map_ids;
char name[BPF_OBJ_NAME_LEN];
-   __u32 ifindex;
-   __u32 status;
 } __attribute__((aligned(8)));
 
 struct bpf_map_info {
-- 
2.14.1



[PATCH net v2 09/10] bpf: revert report offload info to user space

2017-11-20 Thread Jakub Kicinski
This reverts commit bd601b6ada11 ("bpf: report offload info to user
space").  The ifindex by itself is not sufficient, we should provide
information on which network namespace this ifindex belongs to.
After considering some options we concluded that it's best to just
remove this API for now, and rework it in -next.

Signed-off-by: Jakub Kicinski 
---
 include/linux/bpf.h  |  1 -
 include/uapi/linux/bpf.h |  6 --
 kernel/bpf/offload.c | 12 
 kernel/bpf/syscall.c |  5 -
 4 files changed, 24 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 37bbab8c0f56..76c577281d78 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -515,7 +515,6 @@ static inline struct bpf_prog *bpf_prog_get_type(u32 ufd,
 
 int bpf_prog_offload_compile(struct bpf_prog *prog);
 void bpf_prog_offload_destroy(struct bpf_prog *prog);
-u32 bpf_prog_offload_ifindex(struct bpf_prog *prog);
 
 #if defined(CONFIG_NET) && defined(CONFIG_BPF_SYSCALL)
 int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 3f626df42516..4c223ab30293 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -897,10 +897,6 @@ enum sk_action {
 
 #define BPF_TAG_SIZE   8
 
-enum bpf_prog_status {
-   BPF_PROG_STATUS_DEV_BOUND   = (1 << 0),
-};
-
 struct bpf_prog_info {
__u32 type;
__u32 id;
@@ -914,8 +910,6 @@ struct bpf_prog_info {
__u32 nr_map_ids;
__aligned_u64 map_ids;
char name[BPF_OBJ_NAME_LEN];
-   __u32 ifindex;
-   __u32 status;
 } __attribute__((aligned(8)));
 
 struct bpf_map_info {
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index d4267c674fec..68ec884440b7 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -149,18 +149,6 @@ int bpf_prog_offload_compile(struct bpf_prog *prog)
return bpf_prog_offload_translate(prog);
 }
 
-u32 bpf_prog_offload_ifindex(struct bpf_prog *prog)
-{
-   struct bpf_dev_offload *offload = prog->aux->offload;
-   u32 ifindex;
-
-   rtnl_lock();
-   ifindex = offload->netdev ? offload->netdev->ifindex : 0;
-   rtnl_unlock();
-
-   return ifindex;
-}
-
 const struct bpf_prog_ops bpf_offload_prog_ops = {
 };
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 41509cf825d8..2c4cfeaa8d5e 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1616,11 +1616,6 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
return -EFAULT;
}
 
-   if (bpf_prog_is_dev_bound(prog->aux)) {
-   info.status |= BPF_PROG_STATUS_DEV_BOUND;
-   info.ifindex = bpf_prog_offload_ifindex(prog);
-   }
-
 done:
if (copy_to_user(uinfo, , info_len) ||
put_user(info_len, >info.info_len))
-- 
2.14.1



[PATCH net v2 02/10] bpf: offload: limit offload to cls_bpf and xdp programs only

2017-11-20 Thread Jakub Kicinski
We are currently only allowing attachment of device-bound
cls_bpf and XDP programs.  Make this restriction explicit in
the BPF offload code.  This way we can potentially reuse the
ifindex field in the future.

Since XDP and cls_bpf programs can only be loaded by admin,
we can drop the explicit capability check from offload code.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
Acked-by: Alexei Starovoitov 
Acked-by: Daniel Borkmann 
---
 kernel/bpf/offload.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index fd696d3dd429..ac187f9ee182 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -14,8 +14,9 @@ int bpf_prog_offload_init(struct bpf_prog *prog, union 
bpf_attr *attr)
struct net *net = current->nsproxy->net_ns;
struct bpf_dev_offload *offload;
 
-   if (!capable(CAP_SYS_ADMIN))
-   return -EPERM;
+   if (attr->prog_type != BPF_PROG_TYPE_SCHED_CLS &&
+   attr->prog_type != BPF_PROG_TYPE_XDP)
+   return -EINVAL;
 
if (attr->prog_flags)
return -EINVAL;
-- 
2.14.1



Re: [E1000-devel] Questions about crashes and GRO

2017-11-20 Thread Alexander Duyck
On Mon, Nov 20, 2017 at 2:38 PM, Sarah Newman  wrote:
> On 11/20/2017 08:36 AM, Alexander Duyck wrote:
>> Hi Sarah,
>>
>> I am adding the netdev mailing list as I am not certain this is an
>> i350 specific issue. The traces themselves aren't anything I recognize
>> as an existing issue. From what I can tell it looks like you are
>> running Xen, so would I be correct in assuming you are bridging
>> between VMs? If so are you using any sort of tunnels on your network,
>> if so what type? This information would be useful as we may be looking
>> at a bug in a tunnel offload for GRO.
>
> Yes, there's bridging. The traffic on the physical device is tagged with 
> vlans and the bridges use untagged traffic. There are no tunnels. I do not
> own the VMs traffic.
>
> Because I have only seen this on a single server with unique hardware, I 
> think it's most likely related to the hardware or to a particular VM on that
> server.

So I would suspect traffic coming from the VM if anything. The i350 is
a pretty common device. If we were seeing issues specific to it I
would expect we would have more reports than just the one so far.

>>
>> On Fri, Nov 17, 2017 at 3:28 PM, Sarah Newman  
>> wrote:
>>> Hi,
>>>
>>> I have an X10 supermicro with two I350's that has crashed twice now under 
>>> v4.9.39 within the last 3 weeks, with no crashes before v4.9.39:
>>
>> What was the last kernel you tested before v4.9.39? Just wondering as
>> it will help to rule out certain patches as possibly being the issue.
>
> 4.9.31.
>
> If the problem is related to a particular VM, then I don't think the last 
> known good kernel is necessarily pertinent, as the problematic traffic could
> have started at any time.
>
>>> I see in the release notes 
>>> https://downloadmirror.intel.com/22919/eng/README.txt " Do Not Use LRO When 
>>> Routing Packets."
>>>
>>> We are bridging traffic, not routing, and the crashes are in the GRO code.
>>>
>>> Is it possible there are problems with GRO for bridging in the igb driver 
>>> now? If I disable GRO can I have some confidence it will fix the issue?
>>
>> As far as LRO not being used when routing, just so you know LRO and
>> GRO are two very different things. One of the issues with LRO is that
>> it wasn't reversible in some cases and so could lead to the packet
>> being changed if they were rerouted. With GRO that shouldn't be the
>> case as we should be able to get back out the original packets that
>> were put into a frame. So there shouldn't be any issues using GRO with
>> bridging or routing.
>
> In some very old release notes for the ixgbe 
> https://downloadmirror.intel.com/22919/eng/README.txt it said to disable GRO 
> for bridging/routing, and it
> wasn't clear it was not specific to the driver. I didn't originally notice 
> how old the release notes were and that the notice was removed in newer
> versions, I apologize.
>
>>> First crash:
>>>
>>> [4083386.299221] [ cut here ]
>>> [4083386.299358] WARNING: CPU: 0 PID: 0 at net/ipv4/af_inet.c:1473 
>>> inet_gro_complete+0xbb/0xd0
>>> [4083386.299520] Modules linked in: sb_edac edac_core 8021q mrp garp 
>>> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_physdev 
>>> ip6table_filter
>>> ip6_tables xen_pciback blktap xen_netback xen_gntdev xen_gnt
>>> alloc xenfs xen_privcmd xen_evtchn xen_blkback tun sch_htb fuse ext2 
>>> ebt_mark ebt_ip ebt_arp ebtable_filter ebtables drbd lru_cache cls_fw
>>> br_netfilter bridge stp llc iTCO_wdt iTCO_vendor_support pcspkr raid456 
>>> async_raid6_recov async_pq
>>>  async_xor xor async_memcpy async_tx raid10 raid6_pq libcrc32c joydev 
>>> shpchp i2c_i801 i2c_smbus mei_me mei lpc_ich fjes ipmi_si ipmi_msghandler
>>> acpi_power_meter ioatdma igb dca raid1 mlx4_en mlx4_ib ib_core ptp pps_core 
>>> mlx4_core mpt3sas
>>>  scsi_transport_sas raid_class wmi ast ttm
>>> [4083386.300888] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.39 #1
>>> [4083386.301002] Hardware name: Supermicro Super Server/X10DRi-LN4+, BIOS 
>>> 2.0a 09/16/2016
>>> [4083386.301109]  880306603d90 813f5935  
>>> 
>>> [4083386.301221]  880306603dd0 810a7e01 05c18174578a 
>>> 8802f94a9a00
>>> [4083386.301333]  8802f0824450  0040 
>>> 0040
>>> [4083386.301445] Call Trace:
>>> [4083386.301483]   [4083386.301519]   dump_stack+0x63/0x8e
>>> [4083386.301596]   __warn+0xd1/0xf0
>>> [4083386.301665]   warn_slowpath_null+0x1d/0x20
>>> [4083386.301747]   inet_gro_complete+0xbb/0xd0
>>> [4083386.301830]   napi_gro_complete+0x73/0xa0
>>> [4083386.301911]   napi_gro_flush+0x5f/0x80
>>> [4083386.301988]   napi_complete_done+0x6a/0xb0
>>> [4083386.302075]   igb_poll+0x38d/0x720 [igb]
>>> [4083386.302156]   ? igb_msix_ring+0x2e/0x40 [igb]
>>> [4083386.302255]   ? __handle_irq_event_percpu+0x4b/0x1a0
>>> [4083386.302349]   net_rx_action+0x158/0x360
>>> [4083386.302430]   

Re: [E1000-devel] Questions about crashes and GRO

2017-11-20 Thread Sarah Newman
On 11/20/2017 08:36 AM, Alexander Duyck wrote:
> Hi Sarah,
> 
> I am adding the netdev mailing list as I am not certain this is an
> i350 specific issue. The traces themselves aren't anything I recognize
> as an existing issue. From what I can tell it looks like you are
> running Xen, so would I be correct in assuming you are bridging
> between VMs? If so are you using any sort of tunnels on your network,
> if so what type? This information would be useful as we may be looking
> at a bug in a tunnel offload for GRO.

Yes, there's bridging. The traffic on the physical device is tagged with vlans 
and the bridges use untagged traffic. There are no tunnels. I do not
own the VMs traffic.

Because I have only seen this on a single server with unique hardware, I think 
it's most likely related to the hardware or to a particular VM on that
server.

> 
> On Fri, Nov 17, 2017 at 3:28 PM, Sarah Newman  
> wrote:
>> Hi,
>>
>> I have an X10 supermicro with two I350's that has crashed twice now under 
>> v4.9.39 within the last 3 weeks, with no crashes before v4.9.39:
> 
> What was the last kernel you tested before v4.9.39? Just wondering as
> it will help to rule out certain patches as possibly being the issue.

4.9.31.

If the problem is related to a particular VM, then I don't think the last known 
good kernel is necessarily pertinent, as the problematic traffic could
have started at any time.

>> I see in the release notes 
>> https://downloadmirror.intel.com/22919/eng/README.txt " Do Not Use LRO When 
>> Routing Packets."
>>
>> We are bridging traffic, not routing, and the crashes are in the GRO code.
>>
>> Is it possible there are problems with GRO for bridging in the igb driver 
>> now? If I disable GRO can I have some confidence it will fix the issue?
> 
> As far as LRO not being used when routing, just so you know LRO and
> GRO are two very different things. One of the issues with LRO is that
> it wasn't reversible in some cases and so could lead to the packet
> being changed if they were rerouted. With GRO that shouldn't be the
> case as we should be able to get back out the original packets that
> were put into a frame. So there shouldn't be any issues using GRO with
> bridging or routing.

In some very old release notes for the ixgbe 
https://downloadmirror.intel.com/22919/eng/README.txt it said to disable GRO 
for bridging/routing, and it
wasn't clear it was not specific to the driver. I didn't originally notice how 
old the release notes were and that the notice was removed in newer
versions, I apologize.

>> First crash:
>>
>> [4083386.299221] [ cut here ]
>> [4083386.299358] WARNING: CPU: 0 PID: 0 at net/ipv4/af_inet.c:1473 
>> inet_gro_complete+0xbb/0xd0
>> [4083386.299520] Modules linked in: sb_edac edac_core 8021q mrp garp 
>> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_physdev 
>> ip6table_filter
>> ip6_tables xen_pciback blktap xen_netback xen_gntdev xen_gnt
>> alloc xenfs xen_privcmd xen_evtchn xen_blkback tun sch_htb fuse ext2 
>> ebt_mark ebt_ip ebt_arp ebtable_filter ebtables drbd lru_cache cls_fw
>> br_netfilter bridge stp llc iTCO_wdt iTCO_vendor_support pcspkr raid456 
>> async_raid6_recov async_pq
>>  async_xor xor async_memcpy async_tx raid10 raid6_pq libcrc32c joydev shpchp 
>> i2c_i801 i2c_smbus mei_me mei lpc_ich fjes ipmi_si ipmi_msghandler
>> acpi_power_meter ioatdma igb dca raid1 mlx4_en mlx4_ib ib_core ptp pps_core 
>> mlx4_core mpt3sas
>>  scsi_transport_sas raid_class wmi ast ttm
>> [4083386.300888] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.39 #1
>> [4083386.301002] Hardware name: Supermicro Super Server/X10DRi-LN4+, BIOS 
>> 2.0a 09/16/2016
>> [4083386.301109]  880306603d90 813f5935  
>> 
>> [4083386.301221]  880306603dd0 810a7e01 05c18174578a 
>> 8802f94a9a00
>> [4083386.301333]  8802f0824450  0040 
>> 0040
>> [4083386.301445] Call Trace:
>> [4083386.301483]   [4083386.301519]   dump_stack+0x63/0x8e
>> [4083386.301596]   __warn+0xd1/0xf0
>> [4083386.301665]   warn_slowpath_null+0x1d/0x20
>> [4083386.301747]   inet_gro_complete+0xbb/0xd0
>> [4083386.301830]   napi_gro_complete+0x73/0xa0
>> [4083386.301911]   napi_gro_flush+0x5f/0x80
>> [4083386.301988]   napi_complete_done+0x6a/0xb0
>> [4083386.302075]   igb_poll+0x38d/0x720 [igb]
>> [4083386.302156]   ? igb_msix_ring+0x2e/0x40 [igb]
>> [4083386.302255]   ? __handle_irq_event_percpu+0x4b/0x1a0
>> [4083386.302349]   net_rx_action+0x158/0x360
>> [4083386.302430]   __do_softirq+0xd1/0x283
>> [4083386.302507]   irq_exit+0xe9/0x100
>> [4083386.302580]   xen_evtchn_do_upcall+0x35/0x50
>> [4083386.302665]   xen_do_hypervisor_callback+0x1e/0x40
>> [4083386.302754]   [4083386.302787]   ? xen_hypercall_sched_op+0xa/0x20
>> [4083386.302876]   ? xen_hypercall_sched_op+0xa/0x20
>> [4083386.302965]   ? xen_safe_halt+0x10/0x20
>> [4083386.303043]   ? 

[PATCH ipsec] xfrm: add documentation for xfrm device offload api

2017-11-20 Thread Shannon Nelson
Add a writeup on how to use the XFRM device offload API, and
mention this new file in the index.

Signed-off-by: Shannon Nelson 
---
 Documentation/networking/00-INDEX|   2 +
 Documentation/networking/xfrm_device.txt | 132 +++
 2 files changed, 134 insertions(+)
 create mode 100644 Documentation/networking/xfrm_device.txt

diff --git a/Documentation/networking/00-INDEX 
b/Documentation/networking/00-INDEX
index 7a79b35..f5d642c 100644
--- a/Documentation/networking/00-INDEX
+++ b/Documentation/networking/00-INDEX
@@ -228,6 +228,8 @@ x25.txt
- general info on X.25 development.
 x25-iface.txt
- description of the X.25 Packet Layer to LAPB device interface.
+xfrm_device.txt
+   - description of XFRM offload API
 xfrm_proc.txt
- description of the statistics package for XFRM.
 xfrm_sync.txt
diff --git a/Documentation/networking/xfrm_device.txt 
b/Documentation/networking/xfrm_device.txt
new file mode 100644
index 000..2d9d588c
--- /dev/null
+++ b/Documentation/networking/xfrm_device.txt
@@ -0,0 +1,132 @@
+
+===
+XFRM device - offloading the IPsec computations
+===
+Shannon Nelson 
+
+
+Overview
+
+
+IPsec is a useful feature for securing network traffic, but the
+computational cost is high: a 10Gbps link can easily be brought down
+to under 1Gbps, depending on the traffic and link configuration.
+Luckily, there are NICs that offer a hardware based IPsec offload which
+can radically increase throughput and decrease CPU utilization.  The XFRM
+Device interface allows NIC drivers to offer to the stack access to the
+hardware offload.
+
+Userland access to the offload is typically through a system such as
+libreswan or KAME/raccoon, but the iproute2 'ip xfrm' command set can
+be handy when experimenting.  An example command might look something
+like this:
+
+  ip x s add proto esp dst 14.0.0.70 src 14.0.0.52 spi 0x07 mode transport \
+ reqid 0x07 replay-window 32 \
+ aead 'rfc4106(gcm(aes))' 0x44434241343332312423222114131211f4f3f2f1 128 \
+ sel src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp \
+ offload dev eth4 dir in
+
+Yes, that's ugly, but that's what shell scripts and/or libreswan are for.
+
+
+
+Callbacks to implement
+==
+
+/* from include/linux/netdevice.h */
+struct xfrmdev_ops {
+   int (*xdo_dev_state_add) (struct xfrm_state *x);
+   void(*xdo_dev_state_delete) (struct xfrm_state *x);
+   void(*xdo_dev_state_free) (struct xfrm_state *x);
+   bool(*xdo_dev_offload_ok) (struct sk_buff *skb,
+  struct xfrm_state *x);
+};
+
+The NIC driver offering ipsec offload will need to implement these
+callbacks to make the offload available to the network stack's
+XFRM subsytem.  Additionally, the feature bits NETIF_F_HW_ESP and
+NETIF_F_HW_ESP_TX_CSUM will signal the availability of the offload.
+
+
+
+Flow
+
+
+At probe time and before the call to register_netdev(), the driver should
+set up local data structures and XFRM callbacks, and set the feature bits.
+The XFRM code's listener will finish the setup on NETDEV_REGISTER.
+
+   adapter->netdev->xfrmdev_ops = _xfrmdev_ops;
+   adapter->netdev->features |= NETIF_F_HW_ESP;
+   adapter->netdev->hw_enc_features |= NETIF_F_HW_ESP;
+
+When new SAs are set up with a request for "offload" feature, the
+driver's xdo_dev_state_add() will be given the new SA to be offloaded
+and an indication of whether it is for Rx or Tx.  The driver should
+   - verify the algorithm is supported for offloads
+   - store the SA information (key, salt, target-ip, protocol, etc)
+   - enable the HW offload of the SA
+
+The driver can also set an offload_handle in the SA, an opaque void pointer
+that can be used to convey context into the fast-path offload requests.
+
+   xs->xso.offload_handle = context;
+
+
+When the network stack is preparing an IPsec packet for an SA that has
+been setup for offload, it first calls into xdo_dev_offload_ok() with
+the skb and the intended offload state to ask the driver if the offload
+will serviceable.  This can check the packet information to be sure the
+offload can be supported (e.g. IPv4 or IPv6, no IPv4 options, etc) and
+return true of false to signify its support.
+
+When ready to send, the driver needs to inspect the Tx packet for the
+offload information, including the opaque context, and set up the packet
+send accordingly.
+
+   xs = xfrm_input_state(skb);
+   context = xs->xso.offload_handle;
+   set up HW for send
+
+The stack has already inserted the appropriate IPsec headers in the
+packet data, the offload just needs to do the encryption and fix up the
+header values.
+
+
+When a packet is received and the HW has indicated 

Re: [PATCH net 05/10] net: xdp: don't allow device-bound programs in driver mode

2017-11-20 Thread Daniel Borkmann
Hi Jakub,

On 11/20/2017 11:02 PM, Jakub Kicinski wrote:
> On Mon, 20 Nov 2017 07:36:39 -0700, David Ahern wrote:
>> On 11/19/17 9:55 PM, Jakub Kicinski wrote:
>>> diff --git a/net/core/dev.c b/net/core/dev.c
>>> index 09525a27319c..21de2d37a0ba 100644
>>> --- a/net/core/dev.c
>>> +++ b/net/core/dev.c
>>> @@ -7143,6 +7143,13 @@ int dev_change_xdp_fd(struct net_device *dev, struct 
>>> netlink_ext_ack *extack,
>>>  bpf_op == ops->ndo_bpf);
>>> if (IS_ERR(prog))
>>> return PTR_ERR(prog);
>>> +
>>> +   if (!(flags & XDP_FLAGS_HW_MODE) &&
>>> +   bpf_prog_is_dev_bound(prog->aux)) {
>>> +   NL_SET_ERR_MSG_MOD(extack, "using device-bound program 
>>> without HW_MODE flag not supported");  
>>
>> I don't see dev_change_xdp_fd called by device drivers, so that should
>> just be NL_SET_ERR_MSG. Also, "is not supported" sounds better to me
>> than just "not supported".
> 
> Thanks, I'll give others a couple more hours and respin!

The rest of the series looks good to me, please respin with the minor
changes requested.

Thanks,
Daniel


Re: [PATCH net 05/10] net: xdp: don't allow device-bound programs in driver mode

2017-11-20 Thread Jakub Kicinski
On Mon, 20 Nov 2017 07:36:39 -0700, David Ahern wrote:
> On 11/19/17 9:55 PM, Jakub Kicinski wrote:
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 09525a27319c..21de2d37a0ba 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -7143,6 +7143,13 @@ int dev_change_xdp_fd(struct net_device *dev, struct 
> > netlink_ext_ack *extack,
> >  bpf_op == ops->ndo_bpf);
> > if (IS_ERR(prog))
> > return PTR_ERR(prog);
> > +
> > +   if (!(flags & XDP_FLAGS_HW_MODE) &&
> > +   bpf_prog_is_dev_bound(prog->aux)) {
> > +   NL_SET_ERR_MSG_MOD(extack, "using device-bound program 
> > without HW_MODE flag not supported");  
> 
> I don't see dev_change_xdp_fd called by device drivers, so that should
> just be NL_SET_ERR_MSG. Also, "is not supported" sounds better to me
> than just "not supported".

Thanks, I'll give others a couple more hours and respin!


Re: [RFC PATCH 5/5] selinux: Add SCTP support

2017-11-20 Thread Paul Moore
On Tue, Nov 14, 2017 at 4:52 PM, Richard Haines
 wrote:
> On Mon, 2017-11-13 at 17:40 -0500, Paul Moore wrote:
>> On Mon, Nov 13, 2017 at 5:05 PM, Richard Haines
>>  wrote:
>> > On Mon, 2017-11-06 at 19:09 -0500, Paul Moore wrote:
>> > > On Tue, Oct 17, 2017 at 9:59 AM, Richard Haines
>> > >  wrote:
>> > > > The SELinux SCTP implementation is explained in:
>> > > > Documentation/security/SELinux-sctp.txt
>> > > >
>> > > > Signed-off-by: Richard Haines 
>> > > > ---
>> > > >  Documentation/security/SELinux-sctp.txt | 108 +
>> > > >  security/selinux/hooks.c| 268
>> > > > ++--
>> > > >  security/selinux/include/classmap.h |   3 +-
>> > > >  security/selinux/include/netlabel.h |   9 +-
>> > > >  security/selinux/include/objsec.h   |   5 +
>> > > >  security/selinux/netlabel.c |  52 ++-
>> > > >  6 files changed, 427 insertions(+), 18 deletions(-)
>> > > >  create mode 100644 Documentation/security/SELinux-sctp.txt
>>
>> ...
>>
>> > > > +Policy Statements
>> > > > +==
>> > > > +The following class and permissions to support SCTP are
>> > > > available
>> > > > within the
>> > > > +kernel:
>> > > > +class sctp_socket inherits socket { node_bind }
>> > > > +
>> > > > +whenever the following policy capability is enabled:
>> > > > +policycap extended_socket_class;
>> > > > +
>> > > > +The SELinux SCTP support adds the additional permissions that
>> > > > are
>> > > > explained
>> > > > +in the sections below:
>> > > > +association bindx connectx
>> > >
>> > > Is the distinction between bind and bindx significant?  The same
>> > > question applies to connect/connectx.  I think we can probably
>> > > just
>> > > reuse bind and connect in these cases.
>> >
>> > This has been discussed before with Marcelo and keeping
>> > bindx/connectx
>> > is a useful distinction.
>>
>> My apologies, I must have forgotten/missed that discussion.  Do you
>> have an archive pointer?
>
> No this was off list, however I've copied the relevant bits:
>
>> SCTP Socket Option Permissions
>> ===
>> Permissions that are validated on setsockopt(2) calls (note that the
>> sctp_socket SETOPT permission must be allowed):
>>
>> This option requires the BINDX_ADDR permission:
>> SCTP_SOCKOPT_BINDX_REM - Remove additional bind address.
>
> Can't see an usage for this one.
>
>>
>> These options require the SET_PARAMS permission:
>> SCTP_PEER_ADDR_PARAMS  - Set heartbeats and address max
>> retransmissions.
>> SCTP_PEER_ADDR_THLDS  - Set thresholds.
>> SCTP_ASSOCINFO- Set association / endpoint parameters.
>
> Also for these, considering we are not willing to go as deep as to only
> allow these if within a given threshold. But still even then, sounds
> like too much.
>
>>
>>
>> SCTP Bind, Connect and ASCONF Chunk Parameter Permission Checks
>> ==
>> The hook security_sctp_addr_list() is called by SCTP when processing
>> various options (@optname) to check permissions required for the list
>> of ipv4/ipv6 addresses (@address) as follows:
>> 
>> |sctp_socket BIND type permission checks  |
>> |(The socket must also have the BIND permission)  |
>> |  @optname| Permission  |  @address  |
>> |--|-|-|
>> |SCTP_SOCKOPT_BINDX_ADD|BINDX_ADDRS  |One or more ipv4/ipv6 adr|
>
> This one can be useful, for that privilege-dropping case.
>
> Paul note: I later changed BINDX_ADDRS to just BINDX
>
>> |SCTP_PRIMARY_ADDR|SET_PRI_ADDR |Single ipv4 or ipv6 adr  |
>> |SCTP_SET_PEER_PRIMARY_ADDR|SET_PEER_ADDR|Single ipv4 or ipv6 adr  |
>
> But these, can't use an use-case.
>
>> 
>> 
>> |sctp_socket CONNECT type permission checks|
>> |(The socket must also have the CONNECT permission)|
>> |  @optname| Permission  |  @address  |
>> |--|-|-|
>> |SCTP_SOCKOPT_CONNECTX|CONNECTX|One or more ipv4/ipv6 adr|
>> |SCTP_PARAM_ADD_IP|BINDX_ADDRS  |One or more ipv4/ipv6 adr|
>
> The 2 above, can be useful.
>
>> |SCTP_PARAM_DEL_IP|BINDX_ADDRS  |One or more ipv4/ipv6 adr|
>> |SCTP_PARAM_SET_PRIMARY|SET_PRI_ADDR |Single ipv4 or ipv6 adr  |
>
> But not these two..
>
>> 
>>
>> SCTP_SOCKOPT_BINDX_ADD - Allows additional bind addresses to be
>> associated after (optionally) calling
>> bind(3).
>> 

Re: [PATCH] net: sched: crash on blocks with goto chain action

2017-11-20 Thread Roman Kapl

On 11/20/2017 06:54 PM, Cong Wang wrote:

On Sun, Nov 19, 2017 at 8:17 AM, Roman Kapl  wrote:

tcf_block_put_ext has assumed that all filters (and thus their goto
actions) are destroyed in RCU callback and thus can not race with our
list iteration. However, that is not true during netns cleanup (see
tcf_exts_get_net comment).

Prevent the user after free by holding the current list element we are
iterating over (foreach_safe is not enough).

Hmm...

Looks like we need to restore the trick we used previously, that is
holding refcnt for all list entries before this list iteration.


Was there a reason to hold all list entries in that trick? I thought 
that holding just the current element will be enough, but maybe not.




[PATCH] net: sched: crash on blocks with goto chain action

2017-11-20 Thread Roman Kapl
tcf_block_put_ext has assumed that all filters (and thus their goto
actions) are destroyed in RCU callback and thus can not race with our
list iteration. However, that is not true during netns cleanup (see
tcf_exts_get_net comment).

Prevent the user after free by holding the current list element we are
iterating over (foreach_safe is not enough).

To reproduce, run the following in a netns and then delete the ns:
ip link add dtest type dummy
tc qdisc add dev dtest ingress
tc filter add dev dtest chain 1 parent : handle 1 prio 1 flower action 
goto chain 2

Signed-off-by: Roman Kapl 
---
The mail was original rejected by vger, this is a re-send to netdev@vger only
(with the same message ID). Sorry for any confusion.
---
 net/sched/cls_api.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 7d97f612c9b9..58fed2ea3379 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -344,16 +344,26 @@ static void tcf_block_put_final(struct work_struct *work)
 }
 
 /* XXX: Standalone actions are not allowed to jump to any chain, and bound
- * actions should be all removed after flushing. However, filters are now
- * destroyed in tc filter workqueue with RTNL lock, they can not race here.
+ * actions should be all removed after flushing.
  */
 void tcf_block_put_ext(struct tcf_block *block, struct Qdisc *q,
   struct tcf_block_ext_info *ei)
 {
-   struct tcf_chain *chain, *tmp;
+   struct tcf_chain *chain, *last = NULL;
 
-   list_for_each_entry_safe(chain, tmp, >chain_list, list)
+   list_for_each_entry(chain, >chain_list, list) {
+   if (last)
+   tcf_chain_put(last);
+   /* Flushing a chain may release any other chain in this block,
+* so we have to hold on to the chain across flush to known
+* which one comes next.
+*/
+   tcf_chain_hold(chain);
tcf_chain_flush(chain);
+   last = chain;
+   }
+   if (last)
+   tcf_chain_put(last);
 
tcf_block_offload_unbind(block, q, ei);
 
-- 
2.15.0



Re: [PATCH] qed: fix unnecessary call to memset cocci warnings

2017-11-20 Thread Vasyl Gomonovych
It  doesn't apply becouse of identical one qed: use kzalloc instead of kmalloc 
and memset.


[PATCH v2] net: sched: fix crash when deleting secondary chains

2017-11-20 Thread Roman Kapl
If you flush (delete) a filter chain other than chain 0 (such as when
deleting the device), the kernel may run into a use-after-free. The
chain refcount must not be decremented unless we are sure we are done
with the chain.

To reproduce the bug, run:
ip link add dtest type dummy
tc qdisc add dev dtest ingress
tc filter add dev dtest chain 1  parent : flower
ip link del dtest

Introduced in: commit f93e1cdcf42c ("net/sched: fix filter flushing"),
but unless you have KAsan or luck, you won't notice it until
commit 0dadc117ac8b ("cls_flower: use tcf_exts_get_net() before call_rcu()")

Fixes: f93e1cdcf42c ("net/sched: fix filter flushing")
Acked-by: Jiri Pirko 
Signed-off-by: Roman Kapl 
---
v1 -> v2: Added Fixes and Acked-by tags

The mail was original rejected by vger, this is a re-send to netdev@vger only
(with the same message ID). Sorry for any confusion.
---

 net/sched/cls_api.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index ab255b421781..7d97f612c9b9 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -205,13 +205,14 @@ static void tcf_chain_head_change(struct tcf_chain *chain,
 
 static void tcf_chain_flush(struct tcf_chain *chain)
 {
-   struct tcf_proto *tp;
+   struct tcf_proto *tp = rtnl_dereference(chain->filter_chain);
 
tcf_chain_head_change(chain, NULL);
-   while ((tp = rtnl_dereference(chain->filter_chain)) != NULL) {
+   while (tp) {
RCU_INIT_POINTER(chain->filter_chain, tp->next);
-   tcf_chain_put(chain);
tcf_proto_destroy(tp);
+   tp = rtnl_dereference(chain->filter_chain);
+   tcf_chain_put(chain);
}
 }
 
-- 
2.15.0



[PATCH] net: phy: cortina: add missing MODULE_DESCRIPTION/AUTHOR/LICENSE

2017-11-20 Thread Jesse Chan
This change resolves a new compile-time warning
when built as a loadable module:

WARNING: modpost: missing MODULE_LICENSE() in drivers/net/phy/cortina.o
see include/linux/module.h for more information

This adds the license as "GPL", which matches the header of the file.

MODULE_DESCRIPTION and MODULE_AUTHOR are also added.

Signed-off-by: Jesse Chan 
---
 drivers/net/phy/cortina.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/phy/cortina.c b/drivers/net/phy/cortina.c
index 72f4228a63bb..9442db221834 100644
--- a/drivers/net/phy/cortina.c
+++ b/drivers/net/phy/cortina.c
@@ -116,3 +116,7 @@ static struct mdio_device_id __maybe_unused cortina_tbl[] = 
{
 };
 
 MODULE_DEVICE_TABLE(mdio, cortina_tbl);
+
+MODULE_DESCRIPTION("Cortina EDC CDR 10G Ethernet PHY driver");
+MODULE_AUTHOR("NXP");
+MODULE_LICENSE("GPL");
-- 
2.14.1



[net 1/1] tipc: fix access of released memory

2017-11-20 Thread Jon Maloy
When the function tipc_group_filter_msg() finds that a member event
indicates that the member is leaving the group, it first deletes the
member instance, and then purges the message queue being handled
by the call. But the message queue is an aggregated field in the
just deleted item, leading the purge call to access freed memory.

We fix this by swapping the order of the two actions.

Signed-off-by: Jon Maloy 
---
 net/tipc/group.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/tipc/group.c b/net/tipc/group.c
index 7821085..12777ca 100644
--- a/net/tipc/group.c
+++ b/net/tipc/group.c
@@ -539,8 +539,8 @@ void tipc_group_filter_msg(struct tipc_group *grp, struct 
sk_buff_head *inputq,
tipc_group_proto_xmit(grp, m, GRP_ACK_MSG, xmitq);
 
if (leave) {
-   tipc_group_delete_member(grp, m);
__skb_queue_purge(defq);
+   tipc_group_delete_member(grp, m);
break;
}
if (!update)
-- 
2.1.4



Re: [PATCH v15 5/5] PCI: Remove PCI pool macro functions

2017-11-20 Thread Bjorn Helgaas
On Mon, Nov 20, 2017 at 08:32:47PM +0100, Romain Perier wrote:
> From: Romain Perier 
> 
> Now that all the drivers use dma pool API, we can remove the macro
> functions for PCI pool.
> 
> Signed-off-by: Romain Perier 
> Reviewed-by: Peter Senna Tschudin 

I already acked this once on Oct 24.  Please keep that ack and include
it in any future postings so I don't have to deal with this again.

Acked-by: Bjorn Helgaas 

> ---
>  include/linux/pci.h | 9 -
>  1 file changed, 9 deletions(-)
> 
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 96c94980d1ff..d03b4a20033d 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1324,15 +1324,6 @@ int pci_set_vga_state(struct pci_dev *pdev, bool 
> decode,
>  #include 
>  #include 
>  
> -#define  pci_pool dma_pool
> -#define pci_pool_create(name, pdev, size, align, allocation) \
> - dma_pool_create(name, >dev, size, align, allocation)
> -#define  pci_pool_destroy(pool) dma_pool_destroy(pool)
> -#define  pci_pool_alloc(pool, flags, handle) dma_pool_alloc(pool, flags, 
> handle)
> -#define  pci_pool_zalloc(pool, flags, handle) \
> - dma_pool_zalloc(pool, flags, handle)
> -#define  pci_pool_free(pool, vaddr, addr) dma_pool_free(pool, vaddr, 
> addr)
> -
>  struct msix_entry {
>   u32 vector; /* kernel uses to write allocated vector */
>   u16 entry;  /* driver uses to specify entry, OS writes */
> -- 
> 2.14.1
> 


[PATCH next] tools/hv: Fix IP reporting by KVP daemon with SRIOV

2017-11-20 Thread Haiyang Zhang
From: Haiyang Zhang 

On Hyper-V the VF NIC has the same MAC as the related synthetic NIC.
VF NIC can work under the synthetic NIC transparently, without its
own IP address. The existing KVP daemon only gets IP from the first
NIC matching a MAC address, and may not be able to find the IP in
this case.

This patch fixes the problem by searching the NIC matching the MAC,
and having an IP address. So, the IP address will be found and
reported to the host successfully.

Signed-off-by: Haiyang Zhang 
---
 tools/hv/hv_kvp_daemon.c | 138 ++-
 1 file changed, 65 insertions(+), 73 deletions(-)

diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c
index eaa3bec273c8..16964a7811a9 100644
--- a/tools/hv/hv_kvp_daemon.c
+++ b/tools/hv/hv_kvp_daemon.c
@@ -676,64 +676,6 @@ static char *kvp_if_name_to_mac(char *if_name)
return mac_addr;
 }
 
-
-/*
- * Retrieve the interface name given tha MAC address.
- */
-
-static char *kvp_mac_to_if_name(char *mac)
-{
-   DIR *dir;
-   struct dirent *entry;
-   FILE*file;
-   char*p, *x;
-   char*if_name = NULL;
-   charbuf[256];
-   char dev_id[PATH_MAX];
-   unsigned int i;
-
-   dir = opendir(KVP_NET_DIR);
-   if (dir == NULL)
-   return NULL;
-
-   while ((entry = readdir(dir)) != NULL) {
-   /*
-* Set the state for the next pass.
-*/
-   snprintf(dev_id, sizeof(dev_id), "%s%s/address", KVP_NET_DIR,
-entry->d_name);
-
-   file = fopen(dev_id, "r");
-   if (file == NULL)
-   continue;
-
-   p = fgets(buf, sizeof(buf), file);
-   if (p) {
-   x = strchr(p, '\n');
-   if (x)
-   *x = '\0';
-
-   for (i = 0; i < strlen(p); i++)
-   p[i] = toupper(p[i]);
-
-   if (!strcmp(p, mac)) {
-   /*
-* Found the MAC match; return the interface
-* name. The caller will free the memory.
-*/
-   if_name = strdup(entry->d_name);
-   fclose(file);
-   break;
-   }
-   }
-   fclose(file);
-   }
-
-   closedir(dir);
-   return if_name;
-}
-
-
 static void kvp_process_ipconfig_file(char *cmd,
char *config_buf, unsigned int len,
int element_size, int offset)
@@ -1039,6 +981,70 @@ kvp_get_ip_info(int family, char *if_name, int op,
return error;
 }
 
+/*
+ * Retrieve the IP given the MAC address.
+ */
+static int kvp_mac_to_ip(struct hv_kvp_ipaddr_value *kvp_ip_val)
+{
+   char *mac = (char *)kvp_ip_val->adapter_id;
+   DIR *dir;
+   struct dirent *entry;
+   FILE*file;
+   char*p, *x;
+   char*if_name = NULL;
+   charbuf[256];
+   char dev_id[PATH_MAX];
+   unsigned int i;
+   int error = HV_E_FAIL;
+
+   dir = opendir(KVP_NET_DIR);
+   if (dir == NULL)
+   return HV_E_FAIL;
+
+   while ((entry = readdir(dir)) != NULL) {
+   /*
+* Set the state for the next pass.
+*/
+   snprintf(dev_id, sizeof(dev_id), "%s%s/address", KVP_NET_DIR,
+entry->d_name);
+
+   file = fopen(dev_id, "r");
+   if (file == NULL)
+   continue;
+
+   p = fgets(buf, sizeof(buf), file);
+   fclose(file);
+   if (!p)
+   continue;
+
+   x = strchr(p, '\n');
+   if (x)
+   *x = '\0';
+
+   for (i = 0; i < strlen(p); i++)
+   p[i] = toupper(p[i]);
+
+   if (strcmp(p, mac))
+   continue;
+
+   /*
+* Found the MAC match.
+* A NIC (e.g. VF) matching the MAC, but without IP, is skipped.
+*/
+   if_name = entry->d_name;
+   if (!if_name)
+   continue;
+
+   error = kvp_get_ip_info(0, if_name, KVP_OP_GET_IP_INFO,
+   kvp_ip_val, MAX_IP_ADDR_SIZE * 2);
+
+   if (!error && strlen((char *)kvp_ip_val->ip_addr))
+   break;
+   }
+
+   closedir(dir);
+   return error;
+}
 
 static int expand_ipv6(char *addr, int type)
 {
@@ -1514,26 +1520,12 @@ int main(int argc, char *argv[])
switch (op) {
case KVP_OP_GET_IP_INFO:
kvp_ip_val = 

[PATCH v15 2/5] net: e100: Replace PCI pool old API

2017-11-20 Thread Romain Perier
From: Romain Perier 

The PCI pool API is deprecated. This commit replaces the PCI pool old
API by the appropriate function with the DMA pool API.

Signed-off-by: Romain Perier 
Acked-by: Peter Senna Tschudin 
Acked-by: Jeff Kirsher 
Tested-by: Peter Senna Tschudin 
---
 drivers/net/ethernet/intel/e100.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/e100.c 
b/drivers/net/ethernet/intel/e100.c
index 44b3937f7e81..29486478836e 100644
--- a/drivers/net/ethernet/intel/e100.c
+++ b/drivers/net/ethernet/intel/e100.c
@@ -607,7 +607,7 @@ struct nic {
struct mem *mem;
dma_addr_t dma_addr;
 
-   struct pci_pool *cbs_pool;
+   struct dma_pool *cbs_pool;
dma_addr_t cbs_dma_addr;
u8 adaptive_ifs;
u8 tx_threshold;
@@ -1892,7 +1892,7 @@ static void e100_clean_cbs(struct nic *nic)
nic->cb_to_clean = nic->cb_to_clean->next;
nic->cbs_avail++;
}
-   pci_pool_free(nic->cbs_pool, nic->cbs, nic->cbs_dma_addr);
+   dma_pool_free(nic->cbs_pool, nic->cbs, nic->cbs_dma_addr);
nic->cbs = NULL;
nic->cbs_avail = 0;
}
@@ -1910,7 +1910,7 @@ static int e100_alloc_cbs(struct nic *nic)
nic->cb_to_use = nic->cb_to_send = nic->cb_to_clean = NULL;
nic->cbs_avail = 0;
 
-   nic->cbs = pci_pool_zalloc(nic->cbs_pool, GFP_KERNEL,
+   nic->cbs = dma_pool_zalloc(nic->cbs_pool, GFP_KERNEL,
   >cbs_dma_addr);
if (!nic->cbs)
return -ENOMEM;
@@ -2960,8 +2960,8 @@ static int e100_probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
netif_err(nic, probe, nic->netdev, "Cannot register net device, 
aborting\n");
goto err_out_free;
}
-   nic->cbs_pool = pci_pool_create(netdev->name,
-  nic->pdev,
+   nic->cbs_pool = dma_pool_create(netdev->name,
+  >pdev->dev,
   nic->params.cbs.max * sizeof(struct cb),
   sizeof(u32),
   0);
@@ -3001,7 +3001,7 @@ static void e100_remove(struct pci_dev *pdev)
unregister_netdev(netdev);
e100_free(nic);
pci_iounmap(pdev, nic->csr);
-   pci_pool_destroy(nic->cbs_pool);
+   dma_pool_destroy(nic->cbs_pool);
free_netdev(netdev);
pci_release_regions(pdev);
pci_disable_device(pdev);
-- 
2.14.1



[PATCH v15 1/5] block: DAC960: Replace PCI pool old API

2017-11-20 Thread Romain Perier
From: Romain Perier 

The PCI pool API is deprecated. This commit replaces the PCI pool old
API by the appropriate function with the DMA pool API.

Signed-off-by: Romain Perier 
Acked-by: Peter Senna Tschudin 
Tested-by: Peter Senna Tschudin 
---
 drivers/block/DAC960.c | 38 ++
 drivers/block/DAC960.h |  4 ++--
 2 files changed, 20 insertions(+), 22 deletions(-)

diff --git a/drivers/block/DAC960.c b/drivers/block/DAC960.c
index 255591ab3716..2a8950ee382c 100644
--- a/drivers/block/DAC960.c
+++ b/drivers/block/DAC960.c
@@ -268,17 +268,17 @@ static bool 
DAC960_CreateAuxiliaryStructures(DAC960_Controller_T *Controller)
   void *AllocationPointer = NULL;
   void *ScatterGatherCPU = NULL;
   dma_addr_t ScatterGatherDMA;
-  struct pci_pool *ScatterGatherPool;
+  struct dma_pool *ScatterGatherPool;
   void *RequestSenseCPU = NULL;
   dma_addr_t RequestSenseDMA;
-  struct pci_pool *RequestSensePool = NULL;
+  struct dma_pool *RequestSensePool = NULL;
 
   if (Controller->FirmwareType == DAC960_V1_Controller)
 {
   CommandAllocationLength = offsetof(DAC960_Command_T, V1.EndMarker);
   CommandAllocationGroupSize = DAC960_V1_CommandAllocationGroupSize;
-  ScatterGatherPool = pci_pool_create("DAC960_V1_ScatterGather",
-   Controller->PCIDevice,
+  ScatterGatherPool = dma_pool_create("DAC960_V1_ScatterGather",
+   >PCIDevice->dev,
DAC960_V1_ScatterGatherLimit * sizeof(DAC960_V1_ScatterGatherSegment_T),
sizeof(DAC960_V1_ScatterGatherSegment_T), 0);
   if (ScatterGatherPool == NULL)
@@ -290,18 +290,18 @@ static bool 
DAC960_CreateAuxiliaryStructures(DAC960_Controller_T *Controller)
 {
   CommandAllocationLength = offsetof(DAC960_Command_T, V2.EndMarker);
   CommandAllocationGroupSize = DAC960_V2_CommandAllocationGroupSize;
-  ScatterGatherPool = pci_pool_create("DAC960_V2_ScatterGather",
-   Controller->PCIDevice,
+  ScatterGatherPool = dma_pool_create("DAC960_V2_ScatterGather",
+   >PCIDevice->dev,
DAC960_V2_ScatterGatherLimit * sizeof(DAC960_V2_ScatterGatherSegment_T),
sizeof(DAC960_V2_ScatterGatherSegment_T), 0);
   if (ScatterGatherPool == NULL)
return DAC960_Failure(Controller,
"AUXILIARY STRUCTURE CREATION (SG)");
-  RequestSensePool = pci_pool_create("DAC960_V2_RequestSense",
-   Controller->PCIDevice, sizeof(DAC960_SCSI_RequestSense_T),
+  RequestSensePool = dma_pool_create("DAC960_V2_RequestSense",
+   >PCIDevice->dev, sizeof(DAC960_SCSI_RequestSense_T),
sizeof(int), 0);
   if (RequestSensePool == NULL) {
-   pci_pool_destroy(ScatterGatherPool);
+   dma_pool_destroy(ScatterGatherPool);
return DAC960_Failure(Controller,
"AUXILIARY STRUCTURE CREATION (SG)");
   }
@@ -335,16 +335,16 @@ static bool 
DAC960_CreateAuxiliaryStructures(DAC960_Controller_T *Controller)
   Command->Next = Controller->FreeCommands;
   Controller->FreeCommands = Command;
   Controller->Commands[CommandIdentifier-1] = Command;
-  ScatterGatherCPU = pci_pool_alloc(ScatterGatherPool, GFP_ATOMIC,
+  ScatterGatherCPU = dma_pool_alloc(ScatterGatherPool, GFP_ATOMIC,
);
   if (ScatterGatherCPU == NULL)
  return DAC960_Failure(Controller, "AUXILIARY STRUCTURE CREATION");
 
   if (RequestSensePool != NULL) {
- RequestSenseCPU = pci_pool_alloc(RequestSensePool, GFP_ATOMIC,
+ RequestSenseCPU = dma_pool_alloc(RequestSensePool, GFP_ATOMIC,
);
  if (RequestSenseCPU == NULL) {
-pci_pool_free(ScatterGatherPool, ScatterGatherCPU,
+dma_pool_free(ScatterGatherPool, ScatterGatherCPU,
 ScatterGatherDMA);
return DAC960_Failure(Controller,
"AUXILIARY STRUCTURE CREATION");
@@ -379,8 +379,8 @@ static bool 
DAC960_CreateAuxiliaryStructures(DAC960_Controller_T *Controller)
 static void DAC960_DestroyAuxiliaryStructures(DAC960_Controller_T *Controller)
 {
   int i;
-  struct pci_pool *ScatterGatherPool = Controller->ScatterGatherPool;
-  struct pci_pool *RequestSensePool = NULL;
+  struct dma_pool *ScatterGatherPool = Controller->ScatterGatherPool;
+  struct dma_pool *RequestSensePool = NULL;
   void *ScatterGatherCPU;
   dma_addr_t ScatterGatherDMA;
   void *RequestSenseCPU;
@@ -411,9 +411,9 @@ static void 
DAC960_DestroyAuxiliaryStructures(DAC960_Controller_T *Controller)
  RequestSenseDMA = Command->V2.RequestSenseDMA;
   }
   if (ScatterGatherCPU != NULL)
-  pci_pool_free(ScatterGatherPool, ScatterGatherCPU, ScatterGatherDMA);
+  

[PATCH v15 4/5] scsi: mpt3sas: Replace PCI pool old API

2017-11-20 Thread Romain Perier
The PCI pool API is deprecated. This commit replaces the PCI pool old
API by the appropriate function with the DMA pool API.

Signed-off-by: Romain Perier 
---
 drivers/scsi/mpt3sas/mpt3sas_base.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.c 
b/drivers/scsi/mpt3sas/mpt3sas_base.c
index 8027de465d47..08237b8659ae 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_base.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_base.c
@@ -3790,12 +3790,12 @@ _base_release_memory_pools(struct MPT3SAS_ADAPTER *ioc)
if (ioc->pcie_sgl_dma_pool) {
for (i = 0; i < ioc->scsiio_depth; i++) {
if (ioc->scsi_lookup[i].pcie_sg_list.pcie_sgl)
-   pci_pool_free(ioc->pcie_sgl_dma_pool,
+   dma_pool_free(ioc->pcie_sgl_dma_pool,
ioc->scsi_lookup[i].pcie_sg_list.pcie_sgl,
ioc->scsi_lookup[i].pcie_sg_list.pcie_sgl_dma);
}
if (ioc->pcie_sgl_dma_pool)
-   pci_pool_destroy(ioc->pcie_sgl_dma_pool);
+   dma_pool_destroy(ioc->pcie_sgl_dma_pool);
}
 
if (ioc->config_page) {
@@ -4204,21 +4204,21 @@ _base_allocate_memory_pools(struct MPT3SAS_ADAPTER *ioc)
 
sz = nvme_blocks_needed * ioc->page_size;
ioc->pcie_sgl_dma_pool =
-   pci_pool_create("PCIe SGL pool", ioc->pdev, sz, 16, 0);
+   dma_pool_create("PCIe SGL pool", >pdev->dev, sz, 
16, 0);
if (!ioc->pcie_sgl_dma_pool) {
pr_info(MPT3SAS_FMT
-   "PCIe SGL pool: pci_pool_create failed\n",
+   "PCIe SGL pool: dma_pool_create failed\n",
ioc->name);
goto out;
}
for (i = 0; i < ioc->scsiio_depth; i++) {
ioc->scsi_lookup[i].pcie_sg_list.pcie_sgl =
-   pci_pool_alloc(ioc->pcie_sgl_dma_pool,
+   dma_pool_alloc(ioc->pcie_sgl_dma_pool,
GFP_KERNEL,
>scsi_lookup[i].pcie_sg_list.pcie_sgl_dma);
if (!ioc->scsi_lookup[i].pcie_sg_list.pcie_sgl) {
pr_info(MPT3SAS_FMT
-   "PCIe SGL pool: pci_pool_alloc failed\n",
+   "PCIe SGL pool: dma_pool_alloc failed\n",
ioc->name);
goto out;
}
-- 
2.14.1



[PATCH v15 0/5] Replace PCI pool by DMA pool API

2017-11-20 Thread Romain Perier
The current PCI pool API are simple macro functions direct expanded to
the appropriate dma pool functions. The prototypes are almost the same
and semantically, they are very similar. I propose to use the DMA pool
API directly and get rid of the old API.

This set of patches, replaces the old API by the dma pool API
and remove the defines.

Changes in v15:
- Rebased series onto next-20171120
- Added patch 04/05 for mpt3sas scsi driver

Changes in v14:
- Rebased series onto next-20171018
- Rebased patch 03/05 on latest driver

Changes in v13:
- Rebased series onto next-20170906
- Added a new commit for the hinic ethernet driver
- Remove previously merged patches

Changes in v12:
- Rebased series onto next-20170822

Changes in v11:
- Rebased series onto next-20170809
- Removed patches 08-14, these have been merged.

Changes in v10:
- Rebased series onto next-20170706
- I have fixed and improved patch "scsi: megaraid: Replace PCI pool old API"

Changes in v9:
- Rebased series onto next-20170522
- I have fixed and improved the patch for lpfc driver

Changes in v8:
- Rebased series onto next-20170428

Changes in v7:
- Rebased series onto next-20170416
- Added Acked-by, Tested-by and Reviwed-by tags

Changes in v6:
- Fixed an issue reported by kbuild test robot about changes in DAC960
- Removed patches 15/19,16/19,17/19,18/19. They have been merged by Greg
- Added Acked-by Tags

Changes in v5:
- Re-worded the cover letter (remove sentence about checkpatch.pl)
- Rebased series onto next-20170308
- Fix typos in commit message
- Added Acked-by Tags

Changes in v4:
- Rebased series onto next-20170301
- Removed patch 20/20: checks done by checkpath.pl, no longer required.
  Thanks to Peter and Joe for their feedbacks.
- Added Reviewed-by tags

Changes in v3:
- Rebased series onto next-20170224
- Fix checkpath.pl reports for patch 11/20 and patch 12/20
- Remove prefix RFC
Changes in v2:
- Introduced patch 18/20
- Fixed cosmetic changes: spaces before brace, live over 80 characters
- Removed some of the check for NULL pointers before calling dma_pool_destroy
- Improved the regexp in checkpatch for pci_pool, thanks to Joe Perches
- Added Tested-by and Acked-by tags

Romain Perier (5):
  block: DAC960: Replace PCI pool old API
  net: e100: Replace PCI pool old API
  hinic: Replace PCI pool old API
  scsi: mpt3sas: Replace PCI pool old API
  PCI: Remove PCI pool macro functions

 drivers/block/DAC960.c| 38 +++
 drivers/block/DAC960.h|  4 +--
 drivers/net/ethernet/huawei/hinic/hinic_hw_cmdq.c | 10 +++---
 drivers/net/ethernet/huawei/hinic/hinic_hw_cmdq.h |  2 +-
 drivers/net/ethernet/intel/e100.c | 12 +++
 drivers/scsi/mpt3sas/mpt3sas_base.c   | 12 +++
 include/linux/pci.h   |  9 --
 7 files changed, 38 insertions(+), 49 deletions(-)

-- 
2.14.1



[PATCH v15 3/5] hinic: Replace PCI pool old API

2017-11-20 Thread Romain Perier
From: Romain Perier 

The PCI pool API is deprecated. This commit replaces the PCI pool old
API by the appropriate function with the DMA pool API.

Signed-off-by: Romain Perier 
---
 drivers/net/ethernet/huawei/hinic/hinic_hw_cmdq.c | 10 +-
 drivers/net/ethernet/huawei/hinic/hinic_hw_cmdq.h |  2 +-
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/huawei/hinic/hinic_hw_cmdq.c 
b/drivers/net/ethernet/huawei/hinic/hinic_hw_cmdq.c
index 7d95f0866fb0..28a81ac97af5 100644
--- a/drivers/net/ethernet/huawei/hinic/hinic_hw_cmdq.c
+++ b/drivers/net/ethernet/huawei/hinic/hinic_hw_cmdq.c
@@ -143,7 +143,7 @@ int hinic_alloc_cmdq_buf(struct hinic_cmdqs *cmdqs,
struct hinic_hwif *hwif = cmdqs->hwif;
struct pci_dev *pdev = hwif->pdev;
 
-   cmdq_buf->buf = pci_pool_alloc(cmdqs->cmdq_buf_pool, GFP_KERNEL,
+   cmdq_buf->buf = dma_pool_alloc(cmdqs->cmdq_buf_pool, GFP_KERNEL,
   _buf->dma_addr);
if (!cmdq_buf->buf) {
dev_err(>dev, "Failed to allocate cmd from the pool\n");
@@ -161,7 +161,7 @@ int hinic_alloc_cmdq_buf(struct hinic_cmdqs *cmdqs,
 void hinic_free_cmdq_buf(struct hinic_cmdqs *cmdqs,
 struct hinic_cmdq_buf *cmdq_buf)
 {
-   pci_pool_free(cmdqs->cmdq_buf_pool, cmdq_buf->buf, cmdq_buf->dma_addr);
+   dma_pool_free(cmdqs->cmdq_buf_pool, cmdq_buf->buf, cmdq_buf->dma_addr);
 }
 
 static unsigned int cmdq_wqe_size_from_bdlen(enum bufdesc_len len)
@@ -875,7 +875,7 @@ int hinic_init_cmdqs(struct hinic_cmdqs *cmdqs, struct 
hinic_hwif *hwif,
int err;
 
cmdqs->hwif = hwif;
-   cmdqs->cmdq_buf_pool = pci_pool_create("hinic_cmdq", pdev,
+   cmdqs->cmdq_buf_pool = dma_pool_create("hinic_cmdq", >dev,
   HINIC_CMDQ_BUF_SIZE,
   HINIC_CMDQ_BUF_SIZE, 0);
if (!cmdqs->cmdq_buf_pool)
@@ -916,7 +916,7 @@ int hinic_init_cmdqs(struct hinic_cmdqs *cmdqs, struct 
hinic_hwif *hwif,
devm_kfree(>dev, cmdqs->saved_wqs);
 
 err_saved_wqs:
-   pci_pool_destroy(cmdqs->cmdq_buf_pool);
+   dma_pool_destroy(cmdqs->cmdq_buf_pool);
return err;
 }
 
@@ -942,5 +942,5 @@ void hinic_free_cmdqs(struct hinic_cmdqs *cmdqs)
 
devm_kfree(>dev, cmdqs->saved_wqs);
 
-   pci_pool_destroy(cmdqs->cmdq_buf_pool);
+   dma_pool_destroy(cmdqs->cmdq_buf_pool);
 }
diff --git a/drivers/net/ethernet/huawei/hinic/hinic_hw_cmdq.h 
b/drivers/net/ethernet/huawei/hinic/hinic_hw_cmdq.h
index b35583400cb6..23f8d39eab68 100644
--- a/drivers/net/ethernet/huawei/hinic/hinic_hw_cmdq.h
+++ b/drivers/net/ethernet/huawei/hinic/hinic_hw_cmdq.h
@@ -157,7 +157,7 @@ struct hinic_cmdq {
 struct hinic_cmdqs {
struct hinic_hwif   *hwif;
 
-   struct pci_pool *cmdq_buf_pool;
+   struct dma_pool *cmdq_buf_pool;
 
struct hinic_wq *saved_wqs;
 
-- 
2.14.1



[PATCH v15 5/5] PCI: Remove PCI pool macro functions

2017-11-20 Thread Romain Perier
From: Romain Perier 

Now that all the drivers use dma pool API, we can remove the macro
functions for PCI pool.

Signed-off-by: Romain Perier 
Reviewed-by: Peter Senna Tschudin 
---
 include/linux/pci.h | 9 -
 1 file changed, 9 deletions(-)

diff --git a/include/linux/pci.h b/include/linux/pci.h
index 96c94980d1ff..d03b4a20033d 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1324,15 +1324,6 @@ int pci_set_vga_state(struct pci_dev *pdev, bool decode,
 #include 
 #include 
 
-#definepci_pool dma_pool
-#define pci_pool_create(name, pdev, size, align, allocation) \
-   dma_pool_create(name, >dev, size, align, allocation)
-#definepci_pool_destroy(pool) dma_pool_destroy(pool)
-#definepci_pool_alloc(pool, flags, handle) dma_pool_alloc(pool, flags, 
handle)
-#definepci_pool_zalloc(pool, flags, handle) \
-   dma_pool_zalloc(pool, flags, handle)
-#definepci_pool_free(pool, vaddr, addr) dma_pool_free(pool, vaddr, 
addr)
-
 struct msix_entry {
u32 vector; /* kernel uses to write allocated vector */
u16 entry;  /* driver uses to specify entry, OS writes */
-- 
2.14.1



Re: Regression in throughput between kvm guests over virtual bridge

2017-11-20 Thread Matthew Rosato
On 11/14/2017 03:11 PM, Matthew Rosato wrote:
> On 11/12/2017 01:34 PM, Wei Xu wrote:
>> On Sat, Nov 11, 2017 at 03:59:54PM -0500, Matthew Rosato wrote:
> This case should be quite similar with pkgten, if you got improvement with
> pktgen, usually it was also the same for UDP, could you please try to 
> disable
> tso, gso, gro, ufo on all host tap devices and guest virtio-net devices? 
> Currently
> the most significant tests would be like this AFAICT:
>
> Host->VM 4.124.13
>  TCP:
>  UDP:
> pktgen:

So, I automated these scenarios for extended overnight runs and started
experiencing OOM conditions overnight on a 40G system.  I did a bisect
and it also points to c67df11f.  I can see a leak in at least all of the
Host->VM testcases (TCP, UDP, pktgen), but the pktgen scenario shows the
fastest leak.

I enabled slub_debug on base 4.13 and ran my pktgen scenario in short
intervals until a large% of host memory was consumed.  Numbers below
after the last pktgen run completed. The summary is that a very large #
of active skbuff_head_cache entries can be seen - The sum of alloc/free
calls match up, but the # of active skbuff_head_cache entries keeps
growing each time the workload is run and never goes back down in
between runs.

free -h:
 totalusedfree  shared  buff/cache   available
Mem:   39G 31G6.6G472K1.4G6.8G

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME

1001952 1000610  99%0.75K  23856   42763392K skbuff_head_cache
126192 126153  99%0.36K   2868   44 45888K ksm_rmap_item
100485 100435  99%0.41K   1305   77 41760K kernfs_node_cache
 63294  39598  62%0.48K959   66 30688K dentry
 31968  31719  99%0.88K888   36 28416K inode_cache

/sys/kernel/slab/skbuff_head_cache/alloc_calls :
259 __alloc_skb+0x68/0x188 age=1/135076/135741 pid=0-11776 cpus=0,2,4,18
1000351 __build_skb+0x42/0xb0 age=8114/63172/117830 pid=0-11863 cpus=0,10

/sys/kernel/slab/skbuff_head_cache/free_calls:
  13492  age=4295073614 pid=0 cpus=0
 978298 tun_do_read.part.10+0x18c/0x6a0 age=8532/63624/110571 pid=11733
cpus=1-19
  6 skb_free_datagram+0x32/0x78 age=11648/73253/110173 pid=11325
cpus=4,8,10,12,14
  3 __dev_kfree_skb_any+0x5e/0x70 age=108957/115043/118269
pid=0-11605 cpus=5,7,12
  1 netlink_broadcast_filtered+0x172/0x470 age=136165 pid=1 cpus=4
  2 netlink_dump+0x268/0x2a8 age=73236/86857/100479 pid=11325 cpus=4,12
  1 netlink_unicast+0x1ae/0x220 age=12991 pid=9922 cpus=12
  1 tcp_recvmsg+0x2e2/0xa60 age=0 pid=11776 cpus=6
  3 unix_stream_read_generic+0x810/0x908 age=15443/50904/118273
pid=9915-11581 cpus=8,16,18
  2 tap_do_read+0x16a/0x488 [tap] age=42338/74246/106155
pid=11605-11699 cpus=2,9
  1 macvlan_process_broadcast+0x17e/0x1e0 [macvlan] age=18835
pid=331 cpus=11
   8800 pktgen_thread_worker+0x80a/0x16d8 [pktgen] age=8545/62184/110571
pid=11863 cpus=0


By comparison, when running 4.13 with c67df11f reverted, here's the same
output after the exact same test:

free -h:
   totalusedfree  shared  buff/cache   available
Mem: 39G783M 37G472K637M 37G

slabtop:
  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
   714256  35%0.75K 17   42   544K skbuff_head_cache

/sys/kernel/slab/skbuff_head_cache/alloc_calls:
257 __alloc_skb+0x68/0x188 age=0/65252/65507 pid=1-11768 cpus=10,15
/sys/kernel/slab/skbuff_head_cache/free_calls:
255  age=4295003081 pid=0 cpus=0
  1 netlink_broadcast_filtered+0x2e8/0x4e0 age=65601 pid=1 cpus=15
  1 tcp_recvmsg+0x2e2/0xa60 age=0 pid=11768 cpus=16



Payment #451712 issue

2017-11-20 Thread Anita Cochran
Greetings ,


Here is your invoice.


Please ask, if you have any questions.

Payment #451712 issue:
http://www.nwbiomed.com/Invoice-number-5441676/
 



Respectfully Yours,
Anita Cochran


Re: [PATCH v2] net: sched: fix crash when deleting secondary chains

2017-11-20 Thread Cong Wang
On Sun, Nov 19, 2017 at 9:36 AM, Roman Kapl  wrote:
> If you flush (delete) a filter chain other than chain 0 (such as when
> deleting the device), the kernel may run into a use-after-free. The
> chain refcount must not be decremented unless we are sure we are done
> with the chain.
>
> To reproduce the bug, run:
> ip link add dtest type dummy
> tc qdisc add dev dtest ingress
> tc filter add dev dtest chain 1  parent : flower
> ip link del dtest
>
> Introduced in: commit f93e1cdcf42c ("net/sched: fix filter flushing"),
> but unless you have KAsan or luck, you won't notice it until
> commit 0dadc117ac8b ("cls_flower: use tcf_exts_get_net() before call_rcu()")
>
> Fixes: f93e1cdcf42c ("net/sched: fix filter flushing")
> Acked-by: Jiri Pirko 
> Signed-off-by: Roman Kapl 

Acked-by: Cong Wang 


[PATCH v2 00/31] Replacing net_mutex with rw_semaphore

2017-11-20 Thread Kirill Tkhai
Hi,

there is the second version of patchset introducing net_sem
instead of net_mutex. The patchset adds net_sem in addition
to net_mutex and allows pernet_operations to be async. This
flag means, the pernet_operations methods are safe to be
executed with any othor pernet_operations (un)initializing
another net.

If there are only async pernet_operations in the system,
net_mutex is not used either for setup_net() or for cleanup_net().

The flag is little easier, then (un)register_pernet_sys(),
as it changes one line only. Also, it requires less changes
in code. In future, when all pernet_operations are async,
we'll just remove this struct field.

The pernet_operations converted in this patchset allow
to create minimal .config to have network working, and
the changes improve the performance like you may see
below:

%for i in {1..1}; do unshare -n bash -c exit; done

*before*
real 1m40,377s
user 0m9,672s
sys 0m19,928s

*after*
real 0m17,007s
user 0m5,311s
sys 0m11,779

(5.8 times faster)
---

Kirill Tkhai (31):
  net: Assign net to net_namespace_list in setup_net()
  net: Cleanup copy_net_ns()
  net: Introduce net_sem for protection of pernet_list
  net: Move mutex_unlock() in cleanup_net() up
  net: Allow pernet_operations to be executed in parallel
  net: Convert proc_net_ns_ops
  net: Convert net_ns_ops methods
  net: Convert sysctl_pernet_ops
  net: Convert netfilter_net_ops
  net: Convert nf_log_net_ops
  net: Convert net_inuse_ops
  net: Convert net_defaults_ops
  net: Convert netlink_net_ops
  net: Convert rtnetlink_net_ops
  net: Convert audit_net_ops
  net: Convert uevent_net_ops
  net: Convert proto_net_ops
  net: Convert pernet_subsys ops, registered via net_dev_init()
  net: Convert fib_* pernet_operations, registered via subsys_initcall
  net: Convert subsys_initcall() registered pernet_operations from net/sched
  net: Convert genl_pernet_ops
  net: Convert wext_pernet_ops
  net: Convert sysctl_core_ops
  net: Convert pernet_subsys, registered from inet_init()
  net: Convert unix_net_ops
  net: Convert packet_net_ops
  net: Convert ipv4_sysctl_ops
  net: Convert addrconf_ops
  net: Convert loopback_net_ops
  net: Convert default_device_ops
  net: Convert diag_net_ops


 drivers/net/loopback.c  |1 
 fs/proc/proc_net.c  |1 
 include/linux/rtnetlink.h   |1 
 include/net/net_namespace.h |6 +++
 kernel/audit.c  |1 
 lib/kobject_uevent.c|1 
 net/core/dev.c  |2 +
 net/core/fib_notifier.c |1 
 net/core/fib_rules.c|1 
 net/core/net-procfs.c   |2 +
 net/core/net_namespace.c|   94 +--
 net/core/rtnetlink.c|5 +-
 net/core/sock.c |2 +
 net/core/sock_diag.c|1 
 net/core/sysctl_net_core.c  |1 
 net/ipv4/af_inet.c  |2 +
 net/ipv4/arp.c  |1 
 net/ipv4/devinet.c  |1 
 net/ipv4/fib_frontend.c |1 
 net/ipv4/icmp.c |1 
 net/ipv4/igmp.c |1 
 net/ipv4/ip_fragment.c  |1 
 net/ipv4/ipmr.c |1 
 net/ipv4/ping.c |1 
 net/ipv4/proc.c |1 
 net/ipv4/raw.c  |1 
 net/ipv4/route.c|4 ++
 net/ipv4/sysctl_net_ipv4.c  |1 
 net/ipv4/tcp_ipv4.c |2 +
 net/ipv4/tcp_metrics.c  |1 
 net/ipv4/udp.c  |1 
 net/ipv4/udplite.c  |1 
 net/ipv4/xfrm4_policy.c |1 
 net/ipv6/addrconf.c |1 
 net/netfilter/core.c|1 
 net/netfilter/nf_log.c  |1 
 net/netlink/af_netlink.c|1 
 net/netlink/genetlink.c |1 
 net/packet/af_packet.c  |1 
 net/sched/act_api.c |1 
 net/sched/sch_api.c |1 
 net/sysctl_net.c|1 
 net/unix/af_unix.c  |1 
 net/wireless/wext-core.c|1 
 net/xfrm/xfrm_policy.c  |1 
 45 files changed, 114 insertions(+), 41 deletions(-)

--
Signed-off-by: Kirill Tkhai 


[PATCH v2 02/31] net: Cleanup copy_net_ns()

2017-11-20 Thread Kirill Tkhai
Line up destructors actions in the revers order
to constructors. Next patches will add more actions,
and this will be comfortable, if there is the such
order.

Signed-off-by: Kirill Tkhai 
---
 net/core/net_namespace.c |   20 +---
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 7ecf71050ffa..2e512965bf42 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -404,27 +404,25 @@ struct net *copy_net_ns(unsigned long flags,
 
net = net_alloc();
if (!net) {
-   dec_net_namespaces(ucounts);
-   return ERR_PTR(-ENOMEM);
+   rv = -ENOMEM;
+   goto dec_ucounts;
}
-
+   refcount_set(>passive, 1);
+   net->ucounts = ucounts;
get_user_ns(user_ns);
 
rv = mutex_lock_killable(_mutex);
-   if (rv < 0) {
-   net_free(net);
-   dec_net_namespaces(ucounts);
-   put_user_ns(user_ns);
-   return ERR_PTR(rv);
-   }
+   if (rv < 0)
+   goto put_userns;
 
-   net->ucounts = ucounts;
rv = setup_net(net, user_ns);
mutex_unlock(_mutex);
if (rv < 0) {
-   dec_net_namespaces(ucounts);
+put_userns:
put_user_ns(user_ns);
net_drop_ns(net);
+dec_ucounts:
+   dec_net_namespaces(ucounts);
return ERR_PTR(rv);
}
return net;



Re: [PATCH] man: document ip route get mark

2017-11-20 Thread Stephen Hemminger
On Sat, 18 Nov 2017 22:56:49 +0100
Simon Ruderich  wrote:

> Signed-off-by: Simon Ruderich 
> ---
> Hello,
> 
> Just found this in an stackoverflow article from 2015 and it
> really helped. So here as patch.
> 
> Regards
> Simon


Applied man page patches,  thanks


[PATCH v2 07/31] net: Convert net_ns_ops methods

2017-11-20 Thread Kirill Tkhai
This patch starts to convert pernet_subsys, registered
from pure initcalls.

net_ns_ops::net_ns_net_init/net_ns_net_init, methods use only
ida_simple_* functions, which are not need a synchronization.

So, net_ns_ops methods are able to be executed
in parallel with methods of other pernet operations.

Signed-off-by: Kirill Tkhai 
---
 net/core/net_namespace.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 550c766f73aa..757765d62daf 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -615,6 +615,7 @@ static __net_exit void net_ns_net_exit(struct net *net)
 static struct pernet_operations __net_initdata net_ns_ops = {
.init = net_ns_net_init,
.exit = net_ns_net_exit,
+   .async = true,
 };
 
 static const struct nla_policy rtnl_net_policy[NETNSA_MAX + 1] = {



[PATCH v2 06/31] net: Convert proc_net_ns_ops

2017-11-20 Thread Kirill Tkhai
This patch starts to convert pernet_subsys, registered
from before initcalls.

proc_net_ns_ops::proc_net_ns_init()/proc_net_ns_exit()
register pernet net->proc_net and ->proc_net_stat.

Constructors and destructors of another pernet_operations
are not interested in foreign net's proc_net and proc_net_stat.
Proc filesystem privitives are synchronized on proc_subdir_lock.

So, proc_net_ns_ops methods are able to be executed
in parallel with methods of other pernet operations.

Signed-off-by: Kirill Tkhai 
---
 fs/proc/proc_net.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
index a2bf369c923d..2bf6170204b1 100644
--- a/fs/proc/proc_net.c
+++ b/fs/proc/proc_net.c
@@ -237,6 +237,7 @@ static __net_exit void proc_net_ns_exit(struct net *net)
 static struct pernet_operations __net_initdata proc_net_ns_ops = {
.init = proc_net_ns_init,
.exit = proc_net_ns_exit,
+   .async = true,
 };
 
 int __init proc_net_init(void)



[PATCH v2 04/31] net: Move mutex_unlock() in cleanup_net() up

2017-11-20 Thread Kirill Tkhai
net_sem protects from pernet_list changing, while
ops_free_list() makes simple kfree(), and it can't
race with other pernet_operations callbacks.

So we may release net_mutex earlier then it was.

Signed-off-by: Kirill Tkhai 
---
 net/core/net_namespace.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 859dce31e37e..c4f7452906bb 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -489,11 +489,12 @@ static void cleanup_net(struct work_struct *work)
list_for_each_entry_reverse(ops, _list, list)
ops_exit_list(ops, _exit_list);
 
+   mutex_unlock(_mutex);
+
/* Free the net generic variables */
list_for_each_entry_reverse(ops, _list, list)
ops_free_list(ops, _exit_list);
 
-   mutex_unlock(_mutex);
up_read(_sem);
 
/* Ensure there are no outstanding rcu callbacks using this



[PATCH v2 03/31] net: Introduce net_sem for protection of pernet_list

2017-11-20 Thread Kirill Tkhai
Curently mutex is used to protect pernet operations list. It makes
cleanup_net() to execute ->exit methods of the same operations set,
which was used on the time of ->init, even after net namespace is
unlinked from net_namespace_list.

But the problem is it's need to synchronize_rcu() after net is removed
from net_namespace_list():

Destroy net_ns:
cleanup_net()
  mutex_lock(_mutex)
  list_del_rcu(>list)
  synchronize_rcu()  <--- Sleep there for ages
  list_for_each_entry_reverse(ops, _list, list)
ops_exit_list(ops, _exit_list)
  list_for_each_entry_reverse(ops, _list, list)
ops_free_list(ops, _exit_list)
  mutex_unlock(_mutex)

This primitive is not fast, especially on the systems with many processors
and/or when preemptible RCU is enabled in config. So, all the time, while
cleanup_net() is waiting for RCU grace period, creation of new net namespaces
is not possible, the tasks, who makes it, are sleeping on the same mutex:

Create net_ns:
copy_net_ns()
  mutex_lock_killable(_mutex)<--- Sleep there for ages

I observed 20-30 seconds hangs of "unshare -n" on ordinary 8-cpu laptop
with preemptible RCU enabled.

The solution is to convert net_mutex to the rw_semaphore and add small locks
to really small number of pernet_operations, what really need them. Then,
pernet_operations::init/::exit methods, modifying the net-related data,
will require down_read() locking only, while down_write() will be used
for changing pernet_list.

This gives signify performance increase, after all patch set is applied,
like you may see here:

%for i in {1..1}; do unshare -n bash -c exit; done

*before*
real 1m40,377s
user 0m9,672s
sys 0m19,928s

*after*
real 0m17,007s
user 0m5,311s
sys 0m11,779

(5.8 times faster)

This patch starts replacing net_mutex to net_sem. It adds rw_semaphore,
describes the variables it protects, and makes to use where appropriate.
net_mutex is still present, and next patches will kick it out step-by-step.

Signed-off-by: Kirill Tkhai 
---
 include/linux/rtnetlink.h |1 +
 net/core/net_namespace.c  |   39 ++-
 net/core/rtnetlink.c  |4 ++--
 3 files changed, 29 insertions(+), 15 deletions(-)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 2032ce2eb20b..f640fc87fe1d 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -35,6 +35,7 @@ extern int rtnl_is_locked(void);
 
 extern wait_queue_head_t netdev_unregistering_wq;
 extern struct mutex net_mutex;
+extern struct rw_semaphore net_sem;
 
 #ifdef CONFIG_PROVE_LOCKING
 extern bool lockdep_rtnl_is_held(void);
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 2e512965bf42..859dce31e37e 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -41,6 +41,11 @@ struct net init_net = {
 EXPORT_SYMBOL(init_net);
 
 static bool init_net_initialized;
+/*
+ * net_sem: protects: pernet_list, net_generic_ids,
+ * init_net_initialized and first_device pointer.
+ */
+DECLARE_RWSEM(net_sem);
 
 #define MIN_PERNET_OPS_ID  \
((sizeof(struct net_generic) + sizeof(void *) - 1) / sizeof(void *))
@@ -279,7 +284,7 @@ struct net *get_net_ns_by_id(struct net *net, int id)
  */
 static __net_init int setup_net(struct net *net, struct user_namespace 
*user_ns)
 {
-   /* Must be called with net_mutex held */
+   /* Must be called with net_sem held */
const struct pernet_operations *ops, *saved_ops;
int error = 0;
LIST_HEAD(net_exit_list);
@@ -411,12 +416,16 @@ struct net *copy_net_ns(unsigned long flags,
net->ucounts = ucounts;
get_user_ns(user_ns);
 
-   rv = mutex_lock_killable(_mutex);
+   rv = down_read_killable(_sem);
if (rv < 0)
goto put_userns;
-
+   rv = mutex_lock_killable(_mutex);
+   if (rv < 0)
+   goto up_read;
rv = setup_net(net, user_ns);
mutex_unlock(_mutex);
+up_read:
+   up_read(_sem);
if (rv < 0) {
 put_userns:
put_user_ns(user_ns);
@@ -443,6 +452,7 @@ static void cleanup_net(struct work_struct *work)
list_replace_init(_list, _kill_list);
spin_unlock_irq(_list_lock);
 
+   down_read(_sem);
mutex_lock(_mutex);
 
/* Don't let anyone else find us. */
@@ -484,6 +494,7 @@ static void cleanup_net(struct work_struct *work)
ops_free_list(ops, _exit_list);
 
mutex_unlock(_mutex);
+   up_read(_sem);
 
/* Ensure there are no outstanding rcu callbacks using this
 * network namespace.
@@ -510,8 +521,10 @@ static void cleanup_net(struct work_struct *work)
  */
 void net_ns_barrier(void)
 {
+   down_write(_sem);
mutex_lock(_mutex);
mutex_unlock(_mutex);
+   up_write(_sem);
 }
 EXPORT_SYMBOL(net_ns_barrier);
 
@@ -838,12 +851,12 @@ static int __init net_ns_init(void)
 
rcu_assign_pointer(init_net.gen, ng);
 

[PATCH v2 05/31] net: Allow pernet_operations to be executed in parallel

2017-11-20 Thread Kirill Tkhai
This adds new pernet_operations::async flag to indicate operations,
which ->init(), ->exit() and ->exit_batch() methods are allowed
to be executed in parallel with the methods of any other pernet_operations.

When there are only asynchronous pernet_operations in the system,
net_mutex won't be taken for a net construction and destruction.

Also, remove BUG_ON(mutex_is_locked()) from net_assign_generic()
without replacing with the equivalent net_sem check, as there is
one more lockdep assert below.

Suggested-by: Eric W. Biederman 
Signed-off-by: Kirill Tkhai 
---
 include/net/net_namespace.h |6 ++
 net/core/net_namespace.c|   29 +++--
 2 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 10f99dafd5ac..db978c4755f7 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -303,6 +303,12 @@ struct pernet_operations {
void (*exit_batch)(struct list_head *net_exit_list);
unsigned int *id;
size_t size;
+   /*
+* Indicates above methods are allowe to be executed in parallel
+* with methods of any other pernet_operations, i.e. they are not
+* need synchronization via net_mutex.
+*/
+   bool async;
 };
 
 /*
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index c4f7452906bb..550c766f73aa 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -41,8 +41,9 @@ struct net init_net = {
 EXPORT_SYMBOL(init_net);
 
 static bool init_net_initialized;
+static unsigned nr_sync_pernet_ops;
 /*
- * net_sem: protects: pernet_list, net_generic_ids,
+ * net_sem: protects: pernet_list, net_generic_ids, nr_sync_pernet_ops,
  * init_net_initialized and first_device pointer.
  */
 DECLARE_RWSEM(net_sem);
@@ -70,11 +71,10 @@ static int net_assign_generic(struct net *net, unsigned int 
id, void *data)
 {
struct net_generic *ng, *old_ng;
 
-   BUG_ON(!mutex_is_locked(_mutex));
BUG_ON(id < MIN_PERNET_OPS_ID);
 
old_ng = rcu_dereference_protected(net->gen,
-  lockdep_is_held(_mutex));
+  lockdep_is_held(_sem));
if (old_ng->s.len > id) {
old_ng->ptr[id] = data;
return 0;
@@ -419,11 +419,14 @@ struct net *copy_net_ns(unsigned long flags,
rv = down_read_killable(_sem);
if (rv < 0)
goto put_userns;
-   rv = mutex_lock_killable(_mutex);
-   if (rv < 0)
-   goto up_read;
+   if (nr_sync_pernet_ops) {
+   rv = mutex_lock_killable(_mutex);
+   if (rv < 0)
+   goto up_read;
+   }
rv = setup_net(net, user_ns);
-   mutex_unlock(_mutex);
+   if (nr_sync_pernet_ops)
+   mutex_unlock(_mutex);
 up_read:
up_read(_sem);
if (rv < 0) {
@@ -453,7 +456,8 @@ static void cleanup_net(struct work_struct *work)
spin_unlock_irq(_list_lock);
 
down_read(_sem);
-   mutex_lock(_mutex);
+   if (nr_sync_pernet_ops)
+   mutex_lock(_mutex);
 
/* Don't let anyone else find us. */
rtnl_lock();
@@ -489,7 +493,8 @@ static void cleanup_net(struct work_struct *work)
list_for_each_entry_reverse(ops, _list, list)
ops_exit_list(ops, _exit_list);
 
-   mutex_unlock(_mutex);
+   if (nr_sync_pernet_ops)
+   mutex_unlock(_mutex);
 
/* Free the net generic variables */
list_for_each_entry_reverse(ops, _list, list)
@@ -961,6 +966,9 @@ static int register_pernet_operations(struct list_head 
*list,
rcu_barrier();
if (ops->id)
ida_remove(_generic_ids, *ops->id);
+   } else if (!ops->async) {
+   pr_info_once("Pernet operations %ps are sync.\n", ops);
+   nr_sync_pernet_ops++;
}
 
return error;
@@ -968,7 +976,8 @@ static int register_pernet_operations(struct list_head 
*list,
 
 static void unregister_pernet_operations(struct pernet_operations *ops)
 {
-   
+   if (!ops->async)
+   BUG_ON(nr_sync_pernet_ops-- == 0);
__unregister_pernet_operations(ops);
rcu_barrier();
if (ops->id)



[PATCH v2 08/31] net: Convert sysctl_pernet_ops

2017-11-20 Thread Kirill Tkhai
This patch starts to convert pernet_subsys, registered
from core initcalls.

Methods sysctl_net_init() and sysctl_net_exit() initialize
net::sysctls table of a namespace.

pernet_operations::init()/exit() methods from the rest
of the list do not touch net::sysctls of strangers,
so it's safe to execute sysctl_pernet_ops's methods
in parallel with any other pernet_operations.

Signed-off-by: Kirill Tkhai 
---
 net/sysctl_net.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/sysctl_net.c b/net/sysctl_net.c
index 9aed6fe1bf1a..f424539829b7 100644
--- a/net/sysctl_net.c
+++ b/net/sysctl_net.c
@@ -89,6 +89,7 @@ static void __net_exit sysctl_net_exit(struct net *net)
 static struct pernet_operations sysctl_pernet_ops = {
.init = sysctl_net_init,
.exit = sysctl_net_exit,
+   .async = true,
 };
 
 static struct ctl_table_header *net_header;



[PATCH v2 10/31] net: Convert nf_log_net_ops

2017-11-20 Thread Kirill Tkhai
The pernet_operations would have had a problem in parallel
execution with others, if init_net had been able to released.
But it's not, and the rest is safe for that.
There is memory allocation, which nobody else interested in,
and sysctl registration. So, we make it async.

Signed-off-by: Kirill Tkhai 
---
 net/netfilter/nf_log.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/netfilter/nf_log.c b/net/netfilter/nf_log.c
index 8bb152a7cca4..6137fb1bce66 100644
--- a/net/netfilter/nf_log.c
+++ b/net/netfilter/nf_log.c
@@ -578,6 +578,7 @@ static void __net_exit nf_log_net_exit(struct net *net)
 static struct pernet_operations nf_log_net_ops = {
.init = nf_log_net_init,
.exit = nf_log_net_exit,
+   .async = true,
 };
 
 int __init netfilter_log_init(void)



[PATCH v2 09/31] net: Convert netfilter_net_ops

2017-11-20 Thread Kirill Tkhai
Methods netfilter_net_init() and netfilter_net_exit()
initialize net::nf::hooks and change net-related proc
directory of net. Another pernet_operations are not
interested in forein net::nf::hooks or proc entries,
so it's safe to be execute them in parallel with
methods of other pernet operations.

Signed-off-by: Kirill Tkhai 
---
 net/netfilter/core.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index 52cd2901a097..bfe2e44244ee 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -600,6 +600,7 @@ static void __net_exit netfilter_net_exit(struct net *net)
 static struct pernet_operations netfilter_net_ops = {
.init = netfilter_net_init,
.exit = netfilter_net_exit,
+   .async = true,
 };
 
 int __init netfilter_init(void)



[PATCH v2 11/31] net: Convert net_inuse_ops

2017-11-20 Thread Kirill Tkhai
net_inuse_ops methods expose statistics in /proc.
No one from the rest of pernet_subsys or pernet_device
lists does not touch net::core::inuse.

So, it's safe to make net_inuse_ops async.

Signed-off-by: Kirill Tkhai 
---
 net/core/sock.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/sock.c b/net/core/sock.c
index c0b5b2f17412..f04f5ec87d04 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3075,6 +3075,7 @@ static void __net_exit sock_inuse_exit_net(struct net 
*net)
 static struct pernet_operations net_inuse_ops = {
.init = sock_inuse_init_net,
.exit = sock_inuse_exit_net,
+   .async = true,
 };
 
 static __init int net_inuse_init(void)



[PATCH v2 12/31] net: Convert net_defaults_ops

2017-11-20 Thread Kirill Tkhai
net_defaults_ops introduces only net_defaults_init_net method,
and it acts on net::core::sysctl_somaxconn, which
is not interesting for the rest of pernet_subsys and
pernet_device lists. Then, make it async.

Signed-off-by: Kirill Tkhai 
---
 net/core/net_namespace.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 757765d62daf..c91b10731498 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -332,6 +332,7 @@ static int __net_init net_defaults_init_net(struct net *net)
 
 static struct pernet_operations net_defaults_ops = {
.init = net_defaults_init_net,
+   .async = true,
 };
 
 static __init int net_defaults_init(void)



[PATCH v2 13/31] net: Convert netlink_net_ops

2017-11-20 Thread Kirill Tkhai
The methods of netlink_net_ops create and destroy "netlink"
file, which are not interesting for foreigh pernet_operations.
So, netlink_net_ops may safely be made async.

Signed-off-by: Kirill Tkhai 
---
 net/netlink/af_netlink.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index b9e0ee4e22f5..1bb967bce57c 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2687,6 +2687,7 @@ static void __init netlink_add_usersock_entry(void)
 static struct pernet_operations __net_initdata netlink_net_ops = {
.init = netlink_net_init,
.exit = netlink_net_exit,
+   .async = true,
 };
 
 static inline u32 netlink_hash(const void *data, u32 len, u32 seed)



[PATCH v2 14/31] net: Convert rtnetlink_net_ops

2017-11-20 Thread Kirill Tkhai
rtnetlink_net_init() and rtnetlink_net_exit()
create and destroy netlink socket. It looks like,
another pernet_operations are not interested in
foreiner net::rtnl, so rtnetlink_net_ops may be
safely made async.

Signed-off-by: Kirill Tkhai 
---
 net/core/rtnetlink.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index cb06d43c4230..fb3f58cf9351 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -4494,6 +4494,7 @@ static void __net_exit rtnetlink_net_exit(struct net *net)
 static struct pernet_operations rtnetlink_net_ops = {
.init = rtnetlink_net_init,
.exit = rtnetlink_net_exit,
+   .async = true,
 };
 
 void __init rtnetlink_init(void)



Re: [PATCH iproute2] iproute2: fixes to compile on some systems.

2017-11-20 Thread Stephen Hemminger
On Mon, 20 Nov 2017 12:57:07 +0900
Lorenzo Colitti  wrote:

> 1. Put the declarations of strlcpy and strlcat inside
>an #ifdef NEED_STRLCPY. Their declarations were already in a
>similar #ifdef.
> 2. In bpf_scm.h, include sys/un.h for struct sockaddr_un.
> 3. In utils.h, include time.h for struct timeval.
> 
> Tested: builds on ubuntu 14.04 with "make clean distclean; ./configure && 
> make -j64"
> Tested: 4.14.1 builds on Android with Android-specific #ifndefs for missing 
> library code
> Signed-off-by: Lorenzo Colitti 

Applied. Thanks for not forking


[PATCH v2 15/31] net: Convert audit_net_ops

2017-11-20 Thread Kirill Tkhai
This patch starts to convert pernet_subsys, registered
from postcore initcalls.

audit_net_init() creates netlink socket, while audit_net_exit()
destroys it. The rest of the pernet_list are not interested
in the socket, so we make audit_net_ops async.

Signed-off-by: Kirill Tkhai 
---
 kernel/audit.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/audit.c b/kernel/audit.c
index 227db99b0f19..5e49b614d0e6 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -1526,6 +1526,7 @@ static struct pernet_operations audit_net_ops 
__net_initdata = {
.exit = audit_net_exit,
.id = _net_id,
.size = sizeof(struct audit_net),
+   .async = true,
 };
 
 /* Initialize audit support at boot time. */



[PATCH v2 16/31] net: Convert uevent_net_ops

2017-11-20 Thread Kirill Tkhai
uevent_net_init() and uevent_net_exit() create and
destroy netlink socket, and these actions serialized
in netlink code.

Parallel execution with other pernet_operations
makes the socket disappear earlier from uevent_sock_list
on ->exit. As userspace can't be interested in broadcast
messages of dying net, and, as I see, no one in kernel
listen them, we may safely make uevent_net_ops async.

Signed-off-by: Kirill Tkhai 
---
 lib/kobject_uevent.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index c3e84edc47c9..4a2c39ae1e65 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -643,6 +643,7 @@ static void uevent_net_exit(struct net *net)
 static struct pernet_operations uevent_net_ops = {
.init   = uevent_net_init,
.exit   = uevent_net_exit,
+   .async  = true,
 };
 
 static int __init kobject_uevent_init(void)



[PATCH v2 17/31] net: Convert proto_net_ops

2017-11-20 Thread Kirill Tkhai
This patch starts to convert pernet_subsys, registered
from subsys initcalls.

It seems safe to be executed in parallel with others,
as it's only creates/destoyes proc entry,
which nobody else is not interested in.

Signed-off-by: Kirill Tkhai 
---
 net/core/sock.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/sock.c b/net/core/sock.c
index f04f5ec87d04..d9c3de4239e6 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3344,6 +3344,7 @@ static __net_exit void proto_exit_net(struct net *net)
 static __net_initdata struct pernet_operations proto_net_ops = {
.init = proto_init_net,
.exit = proto_exit_net,
+   .async = true,
 };
 
 static int __init proto_init(void)



[PATCH v2 18/31] net: Convert pernet_subsys ops, registered via net_dev_init()

2017-11-20 Thread Kirill Tkhai
There are:
1)dev_proc_ops and dev_mc_net_ops, which create and destroy
pernet proc file and not interested to another net namespaces;
2)netdev_net_ops, which creates pernet hash, which is not
touched by another pernet_operations.

So, make them async.

Signed-off-by: Kirill Tkhai 
---
 net/core/dev.c|1 +
 net/core/net-procfs.c |2 ++
 2 files changed, 3 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 8ee29f4f5fa9..41a576a17430 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -8656,6 +8656,7 @@ static void __net_exit netdev_exit(struct net *net)
 static struct pernet_operations __net_initdata netdev_net_ops = {
.init = netdev_init,
.exit = netdev_exit,
+   .async = true,
 };
 
 static void __net_exit default_device_exit(struct net *net)
diff --git a/net/core/net-procfs.c b/net/core/net-procfs.c
index 615ccab55f38..16b250dd50ed 100644
--- a/net/core/net-procfs.c
+++ b/net/core/net-procfs.c
@@ -352,6 +352,7 @@ static void __net_exit dev_proc_net_exit(struct net *net)
 static struct pernet_operations __net_initdata dev_proc_ops = {
.init = dev_proc_net_init,
.exit = dev_proc_net_exit,
+   .async = true,
 };
 
 static int dev_mc_seq_show(struct seq_file *seq, void *v)
@@ -409,6 +410,7 @@ static void __net_exit dev_mc_net_exit(struct net *net)
 static struct pernet_operations __net_initdata dev_mc_net_ops = {
.init = dev_mc_net_init,
.exit = dev_mc_net_exit,
+   .async = true,
 };
 
 int __init dev_proc_init(void)



[PATCH v2 19/31] net: Convert fib_* pernet_operations, registered via subsys_initcall

2017-11-20 Thread Kirill Tkhai
Both of them create and initialize lists, which are not touched
by another foreing pernet_operations.

Signed-off-by: Kirill Tkhai 
---
 net/core/fib_notifier.c |1 +
 net/core/fib_rules.c|1 +
 2 files changed, 2 insertions(+)

diff --git a/net/core/fib_notifier.c b/net/core/fib_notifier.c
index 0c048bdeb016..5ace0705a3f9 100644
--- a/net/core/fib_notifier.c
+++ b/net/core/fib_notifier.c
@@ -171,6 +171,7 @@ static void __net_exit fib_notifier_net_exit(struct net 
*net)
 static struct pernet_operations fib_notifier_net_ops = {
.init = fib_notifier_net_init,
.exit = fib_notifier_net_exit,
+   .async = true,
 };
 
 static int __init fib_notifier_init(void)
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 98e1066c3d55..cb071b8e8d17 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -1030,6 +1030,7 @@ static void __net_exit fib_rules_net_exit(struct net *net)
 static struct pernet_operations fib_rules_net_ops = {
.init = fib_rules_net_init,
.exit = fib_rules_net_exit,
+   .async = true,
 };
 
 static int __init fib_rules_init(void)



[PATCH v2 20/31] net: Convert subsys_initcall() registered pernet_operations from net/sched

2017-11-20 Thread Kirill Tkhai
psched_net_ops only creates and destroyes /proc entry,
and safe to be executed in parallel with any foreigh
pernet_operations.

tcf_action_net_ops initializes and destructs tcf_action_net::egdev_ht,
which is not touched by foreign pernet_operations.

So, make them async.

Signed-off-by: Kirill Tkhai 
---
 net/sched/act_api.c |1 +
 net/sched/sch_api.c |1 +
 2 files changed, 2 insertions(+)

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 4d33a50a8a6d..41a26f551dbb 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -1464,6 +1464,7 @@ static struct pernet_operations tcf_action_net_ops = {
.exit = tcf_action_net_exit,
.id = _action_net_id,
.size = sizeof(struct tcf_action_net),
+   .async = true,
 };
 
 static int __init tc_action_init(void)
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index b6c4f536876b..09d63c83542a 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -2002,6 +2002,7 @@ static void __net_exit psched_net_exit(struct net *net)
 static struct pernet_operations psched_net_ops = {
.init = psched_net_init,
.exit = psched_net_exit,
+   .async = true,
 };
 
 static int __init pktsched_init(void)



[PATCH v2 22/31] net: Convert wext_pernet_ops

2017-11-20 Thread Kirill Tkhai
These pernet_operations initialize and purge net::wext_nlevents
queue, and are not touched by foreign pernet_operations.

Mark them async.

Signed-off-by: Kirill Tkhai 
---
 net/wireless/wext-core.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/wireless/wext-core.c b/net/wireless/wext-core.c
index 6cdb054484d6..32c9f1c303f9 100644
--- a/net/wireless/wext-core.c
+++ b/net/wireless/wext-core.c
@@ -390,6 +390,7 @@ static void __net_exit wext_pernet_exit(struct net *net)
 static struct pernet_operations wext_pernet_ops = {
.init = wext_pernet_init,
.exit = wext_pernet_exit,
+   .async = true,
 };
 
 static int __init wireless_nlevent_init(void)



[PATCH v2 24/31] net: Convert pernet_subsys, registered from inet_init()

2017-11-20 Thread Kirill Tkhai
arp_net_ops just addr/removes /proc entry.

devinet_ops allocates and frees duplicate of init_net tables
and (un)registers sysctl entries.

fib_net_ops allocates and frees pernet tables, creates/destroys
netlink socket and (un)initializes /proc entries. Foreign
pernet_operations do not touch them.

ip_rt_proc_ops only modifies pernet /proc entries.

xfrm_net_ops creates/destroys /proc entries, allocates/frees
pernet statistics, hashes and tables, and (un)initializes
sysctl files. These are not touched by foreigh pernet_operations

xfrm4_net_ops allocates/frees private pernet memory, and
configures sysctls.

sysctl_route_ops creates/destroys sysctls.

rt_genid_ops only initializes fields of just allocated net.

ipv4_inetpeer_ops allocated/frees net private memory.

igmp_net_ops just creates/destroys /proc files and socket,
noone else interested in.

tcp_sk_ops seems to be safe, because tcp_sk_init() does not
depend on any other pernet_operations modifications. Iteration
over hash table in inet_twsk_purge() is made under RCU lock,
and it's safe to iterate the table this way. Removing from
the table happen from inet_twsk_deschedule_put(), but this
function is safe without any extern locks, as it's synchronized
inside itself. There are many examples, it's used in different
context. So, it's safe to leave tcp_sk_exit_batch() unlocked.

tcp_net_metrics_ops is synchronized on tcp_metrics_lock and safe.

udplite4_net_ops only creates/destroys pernet /proc file.

icmp_sk_ops creates percpu sockets, not touched by foreign
pernet_operations.

ipmr_net_ops creates/destroys pernet fib tables, (un)registers
fib rules and /proc files. This seem to be safe to execute
in parallel with foreign pernet_operations.

af_inet_ops just sets up default parameters of newly created net.

ipv4_mib_ops creates and destroys pernet percpu statistics.

raw_net_ops, tcp4_net_ops, udp4_net_ops, ping_v4_net_ops
and ip_proc_ops only create/destroy pernet /proc files.

ip4_frags_ops creates and destroys sysctl file.

So, it's safe to make the pernet_operations async.

Signed-off-by: Kirill Tkhai 
---
 net/ipv4/af_inet.c  |2 ++
 net/ipv4/arp.c  |1 +
 net/ipv4/devinet.c  |1 +
 net/ipv4/fib_frontend.c |1 +
 net/ipv4/icmp.c |1 +
 net/ipv4/igmp.c |1 +
 net/ipv4/ip_fragment.c  |1 +
 net/ipv4/ipmr.c |1 +
 net/ipv4/ping.c |1 +
 net/ipv4/proc.c |1 +
 net/ipv4/raw.c  |1 +
 net/ipv4/route.c|4 
 net/ipv4/tcp_ipv4.c |2 ++
 net/ipv4/tcp_metrics.c  |1 +
 net/ipv4/udp.c  |1 +
 net/ipv4/udplite.c  |1 +
 net/ipv4/xfrm4_policy.c |1 +
 net/xfrm/xfrm_policy.c  |1 +
 18 files changed, 23 insertions(+)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index ce4aa827be05..d1a2e9afbb50 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1697,6 +1697,7 @@ static __net_exit void ipv4_mib_exit_net(struct net *net)
 static __net_initdata struct pernet_operations ipv4_mib_ops = {
.init = ipv4_mib_init_net,
.exit = ipv4_mib_exit_net,
+   .async = true,
 };
 
 static int __init init_ipv4_mibs(void)
@@ -1750,6 +1751,7 @@ static __net_exit void inet_exit_net(struct net *net)
 static __net_initdata struct pernet_operations af_inet_ops = {
.init = inet_init_net,
.exit = inet_exit_net,
+   .async = true,
 };
 
 static int __init init_inet_pernet_ops(void)
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index a8d7c5a9fb05..19bcd10a928b 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -1443,6 +1443,7 @@ static void __net_exit arp_net_exit(struct net *net)
 static struct pernet_operations arp_net_ops = {
.init = arp_net_init,
.exit = arp_net_exit,
+   .async = true,
 };
 
 static int __init arp_proc_init(void)
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index a4573bccd6da..c359bda18ff5 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -2474,6 +2474,7 @@ static __net_exit void devinet_exit_net(struct net *net)
 static __net_initdata struct pernet_operations devinet_ops = {
.init = devinet_init_net,
.exit = devinet_exit_net,
+   .async = true,
 };
 
 static struct rtnl_af_ops inet_af_ops __read_mostly = {
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index f52d27a422c3..6eb4aa5ee66f 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -1361,6 +1361,7 @@ static void __net_exit fib_net_exit(struct net *net)
 static struct pernet_operations fib_net_ops = {
.init = fib_net_init,
.exit = fib_net_exit,
+   .async = true,
 };
 
 void __init ip_fib_init(void)
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 1617604c9284..cc56efa64d5c 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -1257,6 +1257,7 @@ static int __net_init icmp_sk_init(struct net *net)
 static struct pernet_operations __net_initdata icmp_sk_ops = {
.init = 

[PATCH v2 25/31] net: Convert unix_net_ops

2017-11-20 Thread Kirill Tkhai
These pernet_operations are just create and destroy
/proc and sysctl entries, and are not touched by
foreign pernet_operations.

So, we are able to make them async.

Signed-off-by: Kirill Tkhai 
---
 net/unix/af_unix.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index a9ee634f3c42..1ddf77260849 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2913,6 +2913,7 @@ static void __net_exit unix_net_exit(struct net *net)
 static struct pernet_operations unix_net_ops = {
.init = unix_net_init,
.exit = unix_net_exit,
+   .async = true,
 };
 
 static int __init af_unix_init(void)



[PATCH v2 26/31] net: Convert packet_net_ops

2017-11-20 Thread Kirill Tkhai
These pernet_operations just create and destroy /proc entry,
and another operations do not touch it.

Also, nobody else are interested in foreign net::packet::sklist.

Signed-off-by: Kirill Tkhai 
---
 net/packet/af_packet.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 737092ca9b4e..700cdf36767b 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -4566,6 +4566,7 @@ static void __net_exit packet_net_exit(struct net *net)
 static struct pernet_operations packet_net_ops = {
.init = packet_net_init,
.exit = packet_net_exit,
+   .async = true,
 };
 
 



[PATCH v2 28/31] net: Convert addrconf_ops

2017-11-20 Thread Kirill Tkhai
These pernet_operations (un)register sysctl, which
are not touched by anybody else.

So, it's safe to make them async.

Signed-off-by: Kirill Tkhai 
---
 net/ipv6/addrconf.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index a0ae1c9d37df..fb7cf120daa7 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -6523,6 +6523,7 @@ static void __net_exit addrconf_exit_net(struct net *net)
 static struct pernet_operations addrconf_ops = {
.init = addrconf_init_net,
.exit = addrconf_exit_net,
+   .async = true,
 };
 
 static struct rtnl_af_ops inet6_ops __read_mostly = {



[PATCH v2 29/31] net: Convert loopback_net_ops

2017-11-20 Thread Kirill Tkhai
These pernet_operations have only init() method. It allocates
memory for net_device, calls register_netdev() and assigns
net::loopback_dev.

register_netdev() is allowed be used without additional locks,
as it's synchronized on rtnl_lock(). There are many examples
of using this functon directly from ioctl().

The only difference, compared to ioctl(), is that net is not
completely alive at this moment. But it looks like, there is
no way for parallel pernet_operations to dereference
the net_device, as the most of struct net_device lists,
where it's linked, are related to net, and the net is not liked.

The exceptions are net_device::unreg_list, close_list, todo_list,
used for unregistration, and ::link_watch_list, where net_device
may be linked to global lists.

Unregistration of loopback_dev obviously can't happen, when
loopback_net_init() is executing, as the net as alive. It occurs
in default_device_ops, which currently requires net_mutex,
and it behaves as a barrier at the moment. It will be considered
in next patch.

Speaking about link_watch_list, it seems, there is no way
for loopback_dev at time of registration to be linked in lweventlist
and be available for another pernet_operations.

Signed-off-by: Kirill Tkhai 
---
 drivers/net/loopback.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index 30612497643c..b97a907ea5aa 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -230,4 +230,5 @@ static __net_init int loopback_net_init(struct net *net)
 /* Registered in net/core/dev.c */
 struct pernet_operations __net_initdata loopback_net_ops = {
.init = loopback_net_init,
+   .async = true,
 };



[PATCH v2 31/31] net: Convert diag_net_ops

2017-11-20 Thread Kirill Tkhai
These pernet operations just create and destroy netlink
socket. The socket is pernet and else operations don't
touch it.

Signed-off-by: Kirill Tkhai 
---
 net/core/sock_diag.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/sock_diag.c b/net/core/sock_diag.c
index 217f4e3b82f6..220130aee51d 100644
--- a/net/core/sock_diag.c
+++ b/net/core/sock_diag.c
@@ -328,6 +328,7 @@ static void __net_exit diag_net_exit(struct net *net)
 static struct pernet_operations diag_net_ops = {
.init = diag_net_init,
.exit = diag_net_exit,
+   .async = true,
 };
 
 static int __init sock_diag_init(void)



[PATCH v2 30/31] net: Convert default_device_ops

2017-11-20 Thread Kirill Tkhai
These pernet operations consist of exit() and exit_batch() methods.

default_device_exit() moves not-local and virtual devices to init_net.
There is nothing exiting, because this may happen in any time
on a working system, and rtnl_lock() and synchronize_net() protect
us from all cases of external dereference.

The same for default_device_exit_batch(). Similar unregisteration
may happen in any time on a system. Here several lists (like todo_list),
which are accessed under rtnl_lock(). After rtnl_unlock() and
netdev_run_todo() all the devices are flushed.

Signed-off-by: Kirill Tkhai 
---
 net/core/dev.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 41a576a17430..914fdb260aae 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -8757,6 +8757,7 @@ static void __net_exit default_device_exit_batch(struct 
list_head *net_list)
 static struct pernet_operations __net_initdata default_device_ops = {
.exit = default_device_exit,
.exit_batch = default_device_exit_batch,
+   .async = true,
 };
 
 /*



[PATCH v2 27/31] net: Convert ipv4_sysctl_ops

2017-11-20 Thread Kirill Tkhai
These pernet_operations create and destroy sysctl,
which are not touched by anybody else.

Signed-off-by: Kirill Tkhai 
---
 net/ipv4/sysctl_net_ipv4.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 93e172118a94..89683d868b37 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -1219,6 +1219,7 @@ static __net_exit void ipv4_sysctl_exit_net(struct net 
*net)
 static __net_initdata struct pernet_operations ipv4_sysctl_ops = {
.init = ipv4_sysctl_init_net,
.exit = ipv4_sysctl_exit_net,
+   .async = true,
 };
 
 static __init int sysctl_ipv4_init(void)



[PATCH v2 23/31] net: Convert sysctl_core_ops

2017-11-20 Thread Kirill Tkhai
These pernet_operations register and destroy sysctl
directory, and it's not interested for foreign
pernet_operations.

Signed-off-by: Kirill Tkhai 
---
 net/core/sysctl_net_core.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index cbc3dde4cfcc..1f8c94d726da 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -520,6 +520,7 @@ static __net_exit void sysctl_core_net_exit(struct net *net)
 static __net_initdata struct pernet_operations sysctl_core_ops = {
.init = sysctl_core_net_init,
.exit = sysctl_core_net_exit,
+   .async = true,
 };
 
 static __init int sysctl_core_init(void)



[PATCH v2 21/31] net: Convert genl_pernet_ops

2017-11-20 Thread Kirill Tkhai
This pernet_operations create and destroy net::genl_sock.
Foreign pernet_operations don't touch it.

Signed-off-by: Kirill Tkhai 
---
 net/netlink/genetlink.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
index d444daf1ac04..a66fad4c5ffa 100644
--- a/net/netlink/genetlink.c
+++ b/net/netlink/genetlink.c
@@ -1035,6 +1035,7 @@ static void __net_exit genl_pernet_exit(struct net *net)
 static struct pernet_operations genl_pernet_ops = {
.init = genl_pernet_init,
.exit = genl_pernet_exit,
+   .async = true,
 };
 
 static int __init genl_init(void)



[PATCH v2 01/31] net: Assign net to net_namespace_list in setup_net()

2017-11-20 Thread Kirill Tkhai
This patch merges two repeating pieces of code in one,
and they will live in setup_net() now.

It acts as cleanup even despite init_net_initialized
assignment is reordered with the linking of net now.
This variable is need for proc_net_init() called from:

start_kernel()->proc_root_init()->proc_net_init(),

which can't race with net_ns_init(), called from
initcall.

Signed-off-by: Kirill Tkhai 
---
 net/core/net_namespace.c |   13 +++--
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index b797832565d3..7ecf71050ffa 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -296,6 +296,9 @@ static __net_init int setup_net(struct net *net, struct 
user_namespace *user_ns)
if (error < 0)
goto out_undo;
}
+   rtnl_lock();
+   list_add_tail_rcu(>list, _namespace_list);
+   rtnl_unlock();
 out:
return error;
 
@@ -417,11 +420,6 @@ struct net *copy_net_ns(unsigned long flags,
 
net->ucounts = ucounts;
rv = setup_net(net, user_ns);
-   if (rv == 0) {
-   rtnl_lock();
-   list_add_tail_rcu(>list, _namespace_list);
-   rtnl_unlock();
-   }
mutex_unlock(_mutex);
if (rv < 0) {
dec_net_namespaces(ucounts);
@@ -847,11 +845,6 @@ static int __init net_ns_init(void)
panic("Could not setup the initial network namespace");
 
init_net_initialized = true;
-
-   rtnl_lock();
-   list_add_tail_rcu(_net.list, _namespace_list);
-   rtnl_unlock();
-
mutex_unlock(_mutex);
 
register_pernet_subsys(_ns_ops);



Re: [PATCH net,stable] net: qmi_wwan: add Quectel BG96 2c7c:0296

2017-11-20 Thread Bjørn Mork
Sebastian Sjoholm  writes:

> Quectel BG96 is an Qualcomm MDM9206 based IoT modem, supporting both 
> CAT-M and NB-IoT. Tested hardware is BG96 mounted on Quectel development 
> board (EVB). The USB id is added to qmi_wwan.c to allow QMI 
> communication with the BG96.
>
> Signed-off-by: Sebastian Sjoholm 

Perfect.  Thanks.

Acked-by: Bjørn Mork 


Re: [PATCH][v4] uprobes/x86: emulate push insns for uprobe on x86

2017-11-20 Thread Yonghong Song



On 11/20/17 8:41 AM, Oleg Nesterov wrote:

On 11/17, Yonghong Song wrote:


On 11/17/17 9:25 AM, Oleg Nesterov wrote:

On 11/15, Yonghong Song wrote:


v3 -> v4:
   . Revert most of v3 change as 32bit emulation is not really working
 on x86_64 platform as among other issues, function emulate_push_stack()
 needs to account for 32bit app on 64bit platform.
 A separate effort is ongoing to address this issue.


Reviewed-by: Oleg Nesterov 



Please test your patch with the fix below, in this particular case the
TIF_IA32 check should be fine. Although this is not what we really want,
we should probably use user_64bit_mode(regs) which checks ->cs. But this
needs more changes and doesn't solve other problems (get_unmapped_area)
so I still can't decide what should we do right now...


I tested the below change with my patch. On x86_64, both 64bit and 32bit
program can be uprobe emulated properly.


Good, so your patch is fine.


Thanks!




On x86_32, however, there is a
compilation error like below:


Yes, yes, when I said "in this particular case" I meant x86_64 system only.

Sorry for confusion, I asked you to test this additional change just to
ensure that we didn't miss something and your patch has no problems with
32bit tasks on 64bit system, except those we need to fix anyway.


Understood. I actually tried a little to see whether I could have a 
simple way to fix 32bit compilation error without using ugly "#ifdef 
CONFIG_X86_64". Maybe is_64bit_mm is a good choice. But we could defer 
this until you have a comprehensive fix for 32bit app uprobe on 64bit 
systems as there are multiple issues for this.




Oleg.



Re: [PATCH RFC 0/5] Support asynchronous crypto for IPsec GSO.

2017-11-20 Thread John Fastabend
On 11/20/2017 05:09 AM, David Miller wrote:
> From: Steffen Klassert 
> Date: Mon, 20 Nov 2017 08:37:47 +0100
> 
>> This patchset implements asynchronous crypto handling
>> in the layer 2 TX path. With this we can allow IPsec
>> ESP GSO for software crypto. This also merges the IPsec
>> GSO and non-GSO paths to both use validate_xmit_xfrm().
>  ...
> 
> Code looks generally fine to me.  Only thing of note is that this
> adds a new dev_requeue_skb() call site and that might intersect with
> John Fastabend's RFC work to make qdiscs lockless.

Right, fortunately it doesn't appear that the conflicts are too
hard to resolve. It looks like sch_redirect_xmit() will need to be
resolved with an additional check to test if the qdisc lock is
needed. Then assuming my reading is correct validate_xmit_xfrm(),
xfrm_dev_resume(), and xfrm_dev_backlog() can run without qdisc
lock already.

Thanks,
John

> 
> Also, please adhere to the reverse christmas tree rule for the
> ordering of your function local variables.
> 
> Thank you.
> 



[PATCH net,stable] net: qmi_wwan: add Quectel BG96 2c7c:0296

2017-11-20 Thread Sebastian Sjoholm
Quectel BG96 is an Qualcomm MDM9206 based IoT modem, supporting both 
CAT-M and NB-IoT. Tested hardware is BG96 mounted on Quectel development 
board (EVB). The USB id is added to qmi_wwan.c to allow QMI 
communication with the BG96.

Signed-off-by: Sebastian Sjoholm 

---
 drivers/net/usb/qmi_wwan.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c
index 720a3a248070..c750cf7c042b 100644
--- a/drivers/net/usb/qmi_wwan.c
+++ b/drivers/net/usb/qmi_wwan.c
@@ -1239,6 +1239,7 @@ static const struct usb_device_id products[] = {
{QMI_FIXED_INTF(0x1e0e, 0x9001, 5)},/* SIMCom 7230E */
{QMI_QUIRK_SET_DTR(0x2c7c, 0x0125, 4)}, /* Quectel EC25, EC20 R2.0  
Mini PCIe */
{QMI_QUIRK_SET_DTR(0x2c7c, 0x0121, 4)}, /* Quectel EC21 Mini PCIe */
+   {QMI_FIXED_INTF(0x2c7c, 0x0296, 4)},/* Quectel BG96 */
 
/* 4. Gobi 1000 devices */
{QMI_GOBI1K_DEVICE(0x05c6, 0x9212)},/* Acer Gobi Modem Device */
-- 
2.11.0 (Apple Git-81)



Re: [PATCH] net: sched: crash on blocks with goto chain action

2017-11-20 Thread Cong Wang
On Sun, Nov 19, 2017 at 8:17 AM, Roman Kapl  wrote:
> tcf_block_put_ext has assumed that all filters (and thus their goto
> actions) are destroyed in RCU callback and thus can not race with our
> list iteration. However, that is not true during netns cleanup (see
> tcf_exts_get_net comment).
>
> Prevent the user after free by holding the current list element we are
> iterating over (foreach_safe is not enough).

Hmm...

Looks like we need to restore the trick we used previously, that is
holding refcnt for all list entries before this list iteration.


Re: len = bpf_probe_read_str(); bpf_perf_event_output(... len) == FAIL

2017-11-20 Thread Yonghong Song



On 11/20/17 5:31 AM, Arnaldo Carvalho de Melo wrote:

Em Tue, Nov 14, 2017 at 09:25:17PM +0100, Daniel Borkmann escreveu:

On 11/14/2017 07:15 PM, Yonghong Song wrote:

On 11/14/17 6:19 AM, Daniel Borkmann wrote:

On 11/14/2017 02:42 PM, Arnaldo Carvalho de Melo wrote:

Em Tue, Nov 14, 2017 at 02:09:34PM +0100, Daniel Borkmann escreveu:

On 11/14/2017 01:58 PM, Arnaldo Carvalho de Melo wrote:

Em Tue, Nov 14, 2017 at 01:09:39AM +0100, Daniel Borkmann escreveu:

On 11/13/2017 04:08 PM, Arnaldo Carvalho de Melo wrote:

libbpf: -- BEGIN DUMP LOG ---
libbpf:
0: (79) r3 = *(u64 *)(r1 +104)
1: (b7) r2 = 0
2: (bf) r6 = r1
3: (bf) r1 = r10
4: (07) r1 += -128
5: (b7) r2 = 128
6: (85) call bpf_probe_read_str#45
7: (bf) r1 = r0
8: (07) r1 += -1
9: (67) r1 <<= 32
10: (77) r1 >>= 32
11: (25) if r1 > 0x7f goto pc+11


Right, so the compiler is optimizing the two tests into a single one above,
which means lower bound cannot properly be derived again by the verifier due
to this and thus you'll get the error. Similar issue was seen recently [1].

Does the below hack work for you?

int prog([...])
{
  char filename[128];
  int ret = bpf_probe_read_str(filename, sizeof(filename), 
filename_ptr);
  if (ret > 0)
  bpf_perf_event_output(ctx, &__bpf_stdout__, 
BPF_F_CURRENT_CPU, filename,
    ret & (sizeof(filename) - 1));
  return 1;
}

r0 should keep on tracking bounds here at least:

prog:
     0:    bf 16 00 00 00 00 00 00 r6 = r1
     1:    bf a1 00 00 00 00 00 00 r1 = r10
     2:    07 01 00 00 80 ff ff ff r1 += -128
     3:    b7 02 00 00 80 00 00 00 r2 = 128
     4:    85 00 00 00 2d 00 00 00 call 45
     5:    67 00 00 00 20 00 00 00 r0 <<= 32
     6:    c7 00 00 00 20 00 00 00 r0 s>>= 32
     7:    b7 01 00 00 01 00 00 00 r1 = 1
     8:    6d 01 0a 00 00 00 00 00 if r1 s> r0 goto 10
     9:    57 00 00 00 7f 00 00 00 r0 &= 127
    10:    bf a4 00 00 00 00 00 00 r4 = r10
    11:    07 04 00 00 80 ff ff ff r4 += -128
    12:    bf 61 00 00 00 00 00 00 r1 = r6
    13:    18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0ll
    15:    18 03 00 00 ff ff ff ff 00 00 00 00 00 00 00 00 r3 = 
4294967295ll
    17:    bf 05 00 00 00 00 00 00 r5 = r0
    18:    85 00 00 00 19 00 00 00 call 25

    [1] 
https://urldefense.proofpoint.com/v2/url?u=http-3A__patchwork.ozlabs.org_project_netdev_list_-3Fseries-3D13211=DwIDaQ=5VD0RTtNlTh3ycd41b3MUw=DA8e1B5r073vIqRrFz7MRA=Qp3xFfXEz-CT8rzYtrHeXbow2M6FlsUzwcY32i3_2Q0=z0d6b_hxStA845Kh7epJ-JiFwkiWqUH_z3fEadwqAQY=


Not yet:

6: (85) call bpf_probe_read_str#45
7: (bf) r1 = r0
8: (67) r1 <<= 32
9: (77) r1 >>= 32
10: (15) if r1 == 0x0 goto pc+10
   R0=inv(id=0) R1=inv(id=0,umax_value=4294967295,var_off=(0x0; 0x)) 
R6=ctx(id=0,off=0,imm=0) R10=fp0
11: (57) r0 &= 127
12: (bf) r4 = r10
13: (07) r4 += -128
14: (bf) r1 = r6
15: (18) r2 = 0x92bfc2aba840u
17: (18) r3 = 0x
19: (bf) r5 = r0
20: (85) call bpf_perf_event_output#25
invalid stack type R4 off=-128 access_size=0

I'll try updating clang/llvm...

Full details:

[root@jouet bpf]# cat open.c
#include "bpf.h"

SEC("prog=do_sys_open filename")
int prog(void *ctx, int err, const char __user *filename_ptr)
{
 char filename[128];
 const unsigned len = bpf_probe_read_str(filename, sizeof(filename), 
filename_ptr);


Btw, I was using 'int' here above instead of 'unsigned' as strncpy_from_unsafe()
could potentially return errors like -EFAULT.


I changed to int, didn't help
  

Currently having a version compiled from the git tree:

# llc --version
LLVM 
(https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.org_=DwIDaQ=5VD0RTtNlTh3ycd41b3MUw=DA8e1B5r073vIqRrFz7MRA=Qp3xFfXEz-CT8rzYtrHeXbow2M6FlsUzwcY32i3_2Q0=BKC_Gu9s1hw0v13OCgCpfsGtAY2hE7dujFqg8LNaK2I=):
    LLVM version 6.0.0git-2d810c2
    Optimized build.
    Default target: x86_64-unknown-linux-gnu
    Host CPU: skylake


[root@jouet bpf]# llc --version
LLVM 
(https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.org_=DwIDaQ=5VD0RTtNlTh3ycd41b3MUw=DA8e1B5r073vIqRrFz7MRA=Qp3xFfXEz-CT8rzYtrHeXbow2M6FlsUzwcY32i3_2Q0=BKC_Gu9s1hw0v13OCgCpfsGtAY2hE7dujFqg8LNaK2I=):
    LLVM version 4.0.0svn

Old stuff! ;-) Will change, but improving these messages should be on
the radar, I think :-)


Yep, agree, I think we need a generic, better solution for this type of
issue instead of converting individual helpers to handle 0 min bound and
then only bailing out in such case; need to brainstorm a bit on that.

I think for the above in your case ...

   [...]
    6: (85) call bpf_probe_read_str#45
    7: (bf) r1 = r0
    8: (67) r1 <<= 32
    9: (77) r1 >>= 32
   10: (15) if r1 == 0x0 goto pc+10
    R0=inv(id=0) R1=inv(id=0,umax_value=4294967295,var_off=(0x0; 0x)) 
R6=ctx(id=0,off=0,imm=0) R10=fp0
   11: (57) r0 &= 127
   [...]

... the shifts on r1 might 

Re: [PATCH][v4] uprobes/x86: emulate push insns for uprobe on x86

2017-11-20 Thread Oleg Nesterov
On 11/17, Yonghong Song wrote:
>
> On 11/17/17 9:25 AM, Oleg Nesterov wrote:
> >On 11/15, Yonghong Song wrote:
> >>
> >>v3 -> v4:
> >>   . Revert most of v3 change as 32bit emulation is not really working
> >> on x86_64 platform as among other issues, function emulate_push_stack()
> >> needs to account for 32bit app on 64bit platform.
> >> A separate effort is ongoing to address this issue.
> >
> >Reviewed-by: Oleg Nesterov 
> >
> >
> >
> >Please test your patch with the fix below, in this particular case the
> >TIF_IA32 check should be fine. Although this is not what we really want,
> >we should probably use user_64bit_mode(regs) which checks ->cs. But this
> >needs more changes and doesn't solve other problems (get_unmapped_area)
> >so I still can't decide what should we do right now...
>
> I tested the below change with my patch. On x86_64, both 64bit and 32bit
> program can be uprobe emulated properly.

Good, so your patch is fine.

> On x86_32, however, there is a
> compilation error like below:

Yes, yes, when I said "in this particular case" I meant x86_64 system only.

Sorry for confusion, I asked you to test this additional change just to
ensure that we didn't miss something and your patch has no problems with
32bit tasks on 64bit system, except those we need to fix anyway.

Oleg.



Re: [E1000-devel] Questions about crashes and GRO

2017-11-20 Thread Alexander Duyck
Hi Sarah,

I am adding the netdev mailing list as I am not certain this is an
i350 specific issue. The traces themselves aren't anything I recognize
as an existing issue. From what I can tell it looks like you are
running Xen, so would I be correct in assuming you are bridging
between VMs? If so are you using any sort of tunnels on your network,
if so what type? This information would be useful as we may be looking
at a bug in a tunnel offload for GRO.

On Fri, Nov 17, 2017 at 3:28 PM, Sarah Newman  wrote:
> Hi,
>
> I have an X10 supermicro with two I350's that has crashed twice now under 
> v4.9.39 within the last 3 weeks, with no crashes before v4.9.39:

What was the last kernel you tested before v4.9.39? Just wondering as
it will help to rule out certain patches as possibly being the issue.

> $ /sbin/lspci | grep -i ethernet
> 02:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network 
> Connection (rev 01)
> 02:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network 
> Connection (rev 01)
> 04:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network 
> Connection (rev 01)
> 04:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network 
> Connection (rev 01)
>
> And some X9 supermicro's that have not crashed, with a single I350 I believe:
> $ /sbin/lspci | grep -i ethernet
> 06:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network 
> Connection (rev 01)
> 06:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network 
> Connection (rev 01)
> 06:00.2 Ethernet controller: Intel Corporation I350 Gigabit Network 
> Connection (rev 01)
> 06:00.3 Ethernet controller: Intel Corporation I350 Gigabit Network 
> Connection (rev 01)
>
> I see in the release notes 
> https://downloadmirror.intel.com/22919/eng/README.txt " Do Not Use LRO When 
> Routing Packets."
>
> We are bridging traffic, not routing, and the crashes are in the GRO code.
>
> Is it possible there are problems with GRO for bridging in the igb driver 
> now? If I disable GRO can I have some confidence it will fix the issue?

As far as LRO not being used when routing, just so you know LRO and
GRO are two very different things. One of the issues with LRO is that
it wasn't reversible in some cases and so could lead to the packet
being changed if they were rerouted. With GRO that shouldn't be the
case as we should be able to get back out the original packets that
were put into a frame. So there shouldn't be any issues using GRO with
bridging or routing.

GRO isn't in the driver. It is in the network stack of the kernel
itself. The only responsibility of igb is to provide the frames in the
correct format so that they can be assembled by GRO if it is enabled.

> Here are my offload settings:
> Features for eth0:
> rx-checksumming: on
> tx-checksumming: on
> tx-checksum-ipv4: off [fixed]
> tx-checksum-ip-generic: on
> tx-checksum-ipv6: off [fixed]
> tx-checksum-fcoe-crc: off [fixed]
> tx-checksum-sctp: on
> scatter-gather: on
> tx-scatter-gather: on
> tx-scatter-gather-fraglist: off [fixed]
> tcp-segmentation-offload: on
> tx-tcp-segmentation: on
> tx-tcp-ecn-segmentation: off [fixed]
> tx-tcp-mangleid-segmentation: off
> tx-tcp6-segmentation: on
> udp-fragmentation-offload: off [fixed]
> generic-segmentation-offload: on
> generic-receive-offload: on
> large-receive-offload: off [fixed]
> rx-vlan-offload: on
> tx-vlan-offload: on
> ntuple-filters: off
> receive-hashing: on
> highdma: on [fixed]
> rx-vlan-filter: on [fixed]
> vlan-challenged: off [fixed]
> tx-lockless: off [fixed]
> netns-local: off [fixed]
> tx-gso-robust: off [fixed]
> tx-fcoe-segmentation: off [fixed]
> tx-gre-segmentation: on
> tx-gre-csum-segmentation: on
> tx-ipxip4-segmentation: on
> tx-ipxip6-segmentation: on
> tx-udp_tnl-segmentation: on
> tx-udp_tnl-csum-segmentation: on
> tx-gso-partial: on
> tx-sctp-segmentation: off [fixed]
> fcoe-mtu: off [fixed]
> tx-nocache-copy: off
> loopback: off [fixed]
> rx-fcs: off [fixed]
> rx-all: off
> tx-vlan-stag-hw-insert: off [fixed]
> rx-vlan-stag-hw-parse: off [fixed]
> rx-vlan-stag-filter: off [fixed]
> l2-fwd-offload: off [fixed]
> busy-poll: off [fixed]
> hw-tc-offload: off [fixed]
>
> First crash:
>
> [4083386.299221] [ cut here ]
> [4083386.299358] WARNING: CPU: 0 PID: 0 at net/ipv4/af_inet.c:1473 
> inet_gro_complete+0xbb/0xd0
> [4083386.299520] Modules linked in: sb_edac edac_core 8021q mrp garp 
> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_physdev 
> ip6table_filter
> ip6_tables xen_pciback blktap xen_netback xen_gntdev xen_gnt
> alloc xenfs xen_privcmd xen_evtchn xen_blkback tun sch_htb fuse ext2 ebt_mark 
> ebt_ip ebt_arp ebtable_filter ebtables drbd lru_cache cls_fw
> br_netfilter bridge stp llc iTCO_wdt iTCO_vendor_support pcspkr raid456 
> async_raid6_recov async_pq
>  async_xor xor async_memcpy async_tx raid10 raid6_pq libcrc32c 

Re: [PATCH 5/8] crypto: remove unused hardirq.h

2017-11-20 Thread Yang Shi

The email to Herbert is returned, resent it.

Yang


On 11/17/17 3:02 PM, Yang Shi wrote:

Preempt counter APIs have been split out, currently, hardirq.h just
includes irq_enter/exit APIs which are not used by crypto at all.

So, remove the unused hardirq.h.

Signed-off-by: Yang Shi 
Cc: Herbert Xu 
Cc: "David S. Miller" 
Cc: linux-cry...@vger.kernel.org
---
  crypto/ablk_helper.c | 1 -
  crypto/blkcipher.c   | 1 -
  crypto/mcryptd.c | 1 -
  3 files changed, 3 deletions(-)

diff --git a/crypto/ablk_helper.c b/crypto/ablk_helper.c
index 1441f07..ee52660 100644
--- a/crypto/ablk_helper.c
+++ b/crypto/ablk_helper.c
@@ -28,7 +28,6 @@
  #include 
  #include 
  #include 
-#include 
  #include 
  #include 
  #include 
diff --git a/crypto/blkcipher.c b/crypto/blkcipher.c
index 6c43a0a..01c0d4a 100644
--- a/crypto/blkcipher.c
+++ b/crypto/blkcipher.c
@@ -18,7 +18,6 @@
  #include 
  #include 
  #include 
-#include 
  #include 
  #include 
  #include 
diff --git a/crypto/mcryptd.c b/crypto/mcryptd.c
index 4e64726..9fa362c 100644
--- a/crypto/mcryptd.c
+++ b/crypto/mcryptd.c
@@ -26,7 +26,6 @@
  #include 
  #include 
  #include 
-#include 
  
  #define MCRYPTD_MAX_CPU_QLEN 100

  #define MCRYPTD_BATCH 9



Re: wcn36xx: fix iris child-node lookup

2017-11-20 Thread Kalle Valo
Johan Hovold  wrote:

> Fix child-node lookup during probe, which ended up searching the whole
> device tree depth-first starting at the parent rather than just matching
> on its children.
> 
> To make things worse, the parent mmio node was also prematurely freed.
> 
> Fixes: fd52bdae9ab0 ("wcn36xx: Disable 5GHz for wcn3620")
> Cc: Loic Poulain 
> Signed-off-by: Johan Hovold 
> Signed-off-by: Kalle Valo 

Patch applied to ath-current branch of ath.git, thanks.

1967c12896e0 wcn36xx: fix iris child-node lookup

-- 
https://patchwork.kernel.org/patch/10054441/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches



pull-request: mac80211 2017-11-20

2017-11-20 Thread Johannes Berg
Hi Dave,

Sorry this is coming now, I had a super hectic time after travel
since I had to catch up after two weeks of being away ...

That's not really an excuse, but I'm asking you anyway to pull
Kees's timer conversions with the fixes since he really wants
to make more changes on top and clean this up properly. I've had
those pending, but in the hectic week after coming back didn't
send the mac80211-next pull request that I should have done.

If you don't want that, let me know and I'll respin without it.

Otherwise, please pull and let me know if there's any problem.

Thanks,
johannes



The following changes since commit 32a72bbd5da2411eab591bf9bc2e39349106193a:

  net: vxge: Fix some indentation issues (2017-11-20 11:36:30 +0900)

are available in the git repository at:

  ssh://korg/pub/scm/linux/kernel/git/jberg/mac80211.git 
tags/mac80211-for-davem-2017-11-20

for you to fetch changes up to 33ddd81e2bd5d9970b9f01ab383ba45035fa41ee:

  mac80211: properly free requested-but-not-started TX agg sessions (2017-11-20 
17:01:31 +0100)


A few things:
 * straggler timer conversions from Kees
 * memory leak fix in hwsim
 * fix some fallout from regdb changes if wireless is built-in
 * also free aggregation sessions in startup state when station
   goes away, to avoid crashing the timer


Ben Hutchings (1):
  mac80211_hwsim: Fix memory leak in hwsim_new_radio_nl()

Johannes Berg (3):
  nl80211: don't expose wdev->ssid for most interfaces
  cfg80211: initialize regulatory keys/database later
  mac80211: properly free requested-but-not-started TX agg sessions

Kees Cook (2):
  mac80211: Convert timers to use timer_setup()
  mac80211: aggregation: Convert timers to use timer_setup()

 drivers/net/wireless/mac80211_hwsim.c |  5 +++-
 net/mac80211/agg-rx.c | 41 -
 net/mac80211/agg-tx.c | 49 ---
 net/mac80211/ibss.c   |  7 +++--
 net/mac80211/ieee80211_i.h|  3 ++-
 net/mac80211/led.c| 11 
 net/mac80211/main.c   |  3 +--
 net/mac80211/mesh.c   | 27 +--
 net/mac80211/mesh.h   |  2 +-
 net/mac80211/mesh_hwmp.c  |  4 +--
 net/mac80211/mesh_pathtbl.c   |  3 +--
 net/mac80211/mlme.c   | 32 ++-
 net/mac80211/ocb.c| 10 +++
 net/mac80211/sta_info.c   | 15 +++
 net/mac80211/sta_info.h   | 12 +++--
 net/wireless/nl80211.c| 26 +--
 net/wireless/reg.c| 42 +++---
 17 files changed, 155 insertions(+), 137 deletions(-)


Request for queuing to stable: pci id's

2017-11-20 Thread Ganesh Goudar
Hi David,

Can you please queue the below commits to 4.9 stable,
which add pci id's of T5 and T6 cards. These commits apply
as is.

5d071c24f0cb8ce9fb5642c2a65ab5ab7f5ad244
29db39841896de99dcb3b1deaed61a13cb9d8036
12eb070babbcab4b003e060933971089864a6a54
89ff67718c900754d2aa5c8e37efbe607be36154
803d5b6ebfb06a0d2ee3699fea4f1c7593958566
34929cb4d691f7f9e217ba0e3f536978cd56aa6c
acd669a8f67ed47f5edd385741486cc7a259a446
652faa98ec383c25296fb8493f17060a2c7e3438
36bf994a80571aeee2549db1bc93e34342f40c24

Thanks
Ganesh


Re: Linux ECN Handling

2017-11-20 Thread Eric Dumazet
On Mon, Nov 20, 2017 at 7:01 AM, Neal Cardwell  wrote:
> Going back to one of your Oct 19 trace snapshots (attached), AFAICT at the
> time of the timeout there is actually almost 64KBytes  (352553398 + 1448 -
> 352489686 = 65160) of unacknowledged data. So there really does seem to be a
> significant chunk of packets that were in-flight that were then declared
> lost.
>
> So here is a possibility: perhaps the combination of CWR+PRR plus
> tcp_tso_should_defer() means that PRR can make cwnd so gentle that
> tcp_tso_should_defer() thinks we should wait for another ACK to send, and
> that ACK doesn't come. Breaking it, down, the potential sequence would be:
>
> (1) tcp_write_xmit() does not send, because the CWR behavior, using PRR,
> does not leave enough cwnd for tcp_tso_should_defer() to think we should
> send (PRR was originally designed for recovery, which did not have TSO
> deferral)
>
> (2) TLP does not fire, because we are in state CWR, not Open
>
> (3) The only remaining option is an RTO, which fires.
>
> In other words, the possibility is that, at the time of the stall, the cwnd
> is reasonably high, but tcp_packets_in_flight() is also quite high, so
> either there is (a) literally no unused cwnd left ( tcp_packets_in_flight()
> == cwnd), or (b) some mechanism like tcp_tso_should_defer() is deciding that
> there is not enough available cwnd for it to make sense to chop off a
> fraction of a TSO skb to send now.
>
> One way to test that conjecture would be to disable tcp_tso_should_defer()
> by adding a:
>
>goto send_now;
>
> at the top of tcp_tso_should_defer().
>
> If that doesn't prevent the freezes then I would recommend adding printks or
> other instrumentation to  tcp_write_xmit() to log:
>
> - time
> - ca_state
> - cwnd
> - ssthresh
> - tcp_packets_in_flight()
> - the reason for breaking out of the tcp_write_xmit() loop (tso deferral, no
> packets left, tcp_snd_wnd_test, tcp_nagle_test, etc)
>
> cheers,
> neal
>
>
>
> On Mon, Nov 20, 2017 at 2:31 AM, Steve Ibanez  wrote:
>>
>> Hi Folks,
>>
>> I wanted to check back in on this for another update and to solicit
>> some more suggestions. I did a bit more digging to try an isolate the
>> problem.
>>
>> As I explained earlier, the log generated by tcp_probe indicates that
>> the snd_cwnd is set to 1 just before the end host receives an ECN
>> marked ACK and unexpectedly enters a timeout (
>> https://drive.google.com/open?id=1iyt8PvBxQga2jpRpBJ8KdQw3Q_mPTzZF ).
>> I was trying to track down where this is happening, but the only place
>> I could find that might be setting the snd_cwnd to 1 is in the
>> tcp_enter_loss() function. I inserted a printk() call in this function
>> to see when it is being invoked and it looks like it is only called by
>> the tcp_retransmit_timer() function after the timer expires.
>>
>> I decided to try recording the snd_cwnd, ss-thresh, and icsk_ca_state
>> inside the tcp_fastretrans_alert() function whenever it processes an
>> ECN marked ACK (
>> https://drive.google.com/open?id=17GD77lb9lkCSu0_s9p40GZ5r4EU8B4VB )
>> This plot also shows when the tcp_retransmit_timer() and
>> tcp_enter_loss() functions are invoked (red and purple dots
>> respectively). And I see that the ACK state machine is always either
>> in the TCP_CA_Open or TCP_CA_CWR state whenever the
>> tcp_fastretrans_alert() function processes ECN marked ACKs (
>> https://drive.google.com/open?id=1xwuPxjgwriT9DSblFx2uILfQ95Fy-Eqq ).
>> So I'm not sure where the snd_cwnd is being set to 1 (or possibly 0 as
>> Neal suggested) just before entering a timeout. Any suggestions here?
>>
>> In order to do a bit of profiling of the tcp_dctcp code I added
>> support into tcp_probe for recording the dctcp alpha parameter. I see
>> that alpha oscillates around about 0.1 when the flow rates have
>> converged, it goes to zero when the other host enters a timeout, and I
>> don't see any unexpected behavior just before the timeout (
>> https://drive.google.com/open?id=1zPdyS57TrUYZIekbid9p1UNyraLYrdw7 ).
>>
>> So I haven't had much luck yet trying to track down where the problem
>> is. If you have any suggestions that would help me to focus my search
>> efforts, I would appreciate the comments.
>>
>> Thanks!
>> -Steve

Steve, what HZ value your kernel is compiled with ?


Re: [kernel-hardening] [PATCH v4] scripts: add leaking_addresses.pl

2017-11-20 Thread Petr Mladek
On Mon 2017-11-13 11:16:28, kaiwan.billimo...@gmail.com wrote:
> On Mon, 2017-11-13 at 09:21 +1100, Tobin C. Harding wrote:
> > On Fri, Nov 10, 2017 at 07:26:34PM +0530, kaiwan.billimo...@gmail.com
> > >  - it currently hard-codes a global 'PAGE_OFFSET_32BIT=0xc000'
> > > , just
> > >  so I can test quickly; must figure whether to query it or pass it;
> > >  Suggestions?
> > 
> > Perhaps we should have a command line option for this.
> > 
> > --kernel-base-address
> 
> Why not just detect it programatically? We could devise a series of
> fallbacks; something like:
> - if .config exists in the kernel source tree root, grep it for
> PAGE_OFFSET
> - if not, grep the arch-specific (arch//configs/)
> for the same
> - if for some reason we don't have enough info regarding specific
> platform and thus the defconfig filename (could happen for ARM, PPC?),
> we then fail and request the user to pass it as a parameter.

You might also check /proc/config.gz.

Best Regards,
Petr


  1   2   >