date:20180831

Re: [Patch net-nnext] net_sched: add missing tcf_lock for act_connmark

2018-08-31 Thread David Miller

From: Cong Wang 
Date: Wed, 29 Aug 2018 10:15:36 -0700

> According to the new locking rule, we have to take tcf_lock
> for both ->init() and ->dump(), as RTNL will be removed.
> However, it is missing for act_connmark.
> 
> Cc: Vlad Buslov 
> Signed-off-by: Cong Wang 

Applied.

Re: [Patch net-nnext] Revert "net: sched: act: add extack for lookup callback"

2018-08-31 Thread David Miller

From: Cong Wang 
Date: Wed, 29 Aug 2018 10:15:35 -0700

> This reverts commit 331a9295de23 ("net: sched: act: add extack for lookup 
> callback").
> 
> This extack is never used after 6 months... In fact, it can be just
> set in the caller, right after ->lookup().
> 
> Cc: Alexander Aring 
> Signed-off-by: Cong Wang 

Applied.

[PATCH net-next] failover: Add missing check to validate 'slave_dev' in net_failover_slave_unregister

2018-08-31 Thread YueHaibing

Fixes gcc '-Wunused-but-set-variable' warning:

drivers/net/net_failover.c: In function 'net_failover_slave_unregister':
drivers/net/net_failover.c:598:35: warning:
 variable 'primary_dev' set but not used [-Wunused-but-set-variable]

There should check the validity of 'slave_dev'.

Fixes: cfc80d9a1163 ("net: Introduce net_failover driver")
Suggested-by: Samudrala, Sridhar 
Signed-off-by: YueHaibing 
---
 drivers/net/net_failover.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/net_failover.c b/drivers/net/net_failover.c
index 7ae1856..af1ece8 100644
--- a/drivers/net/net_failover.c
+++ b/drivers/net/net_failover.c
@@ -602,6 +602,9 @@ static int net_failover_slave_unregister(struct net_device 
*slave_dev,
nfo_info = netdev_priv(failover_dev);
primary_dev = rtnl_dereference(nfo_info->primary_dev);
standby_dev = rtnl_dereference(nfo_info->standby_dev);
+
+   if (slave_dev != primary_dev && slave_dev != standby_dev)
+   return -ENODEV;
 
vlan_vids_del_by_dev(slave_dev, failover_dev);
dev_uc_unsync(slave_dev, failover_dev);

Re: [PATCH net-next] failover: remove set but not used variable 'primary_dev'

2018-08-31 Thread YueHaibing




On 2018/9/1 0:39, Samudrala, Sridhar wrote:
> On 8/30/2018 8:46 PM, YueHaibing wrote:
>> Fixes gcc '-Wunused-but-set-variable' warning:
>>
>> drivers/net/net_failover.c: In function 'net_failover_slave_unregister':
>> drivers/net/net_failover.c:598:35: warning:
>>   variable 'primary_dev' set but not used [-Wunused-but-set-variable]
> 
> Actually this gcc option found a bug.
> We need to add this check after accessing primary_dev and standby_dev.
> 
> if (slave_dev != primary_dev && slave_dev != standby_dev)
> return -ENODEV;
> 
> Can you resubmit with the right fix?
> 

sure, thank you. will send v2

> 
>>
>> Signed-off-by: YueHaibing 
>> ---
>>   drivers/net/net_failover.c | 3 +--
>>   1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/net_failover.c b/drivers/net/net_failover.c
>> index 7ae1856..e103c94e 100644
>> --- a/drivers/net/net_failover.c
>> +++ b/drivers/net/net_failover.c
>> @@ -595,12 +595,11 @@ static int net_failover_slave_pre_unregister(struct 
>> net_device *slave_dev,
>>   static int net_failover_slave_unregister(struct net_device *slave_dev,
>>struct net_device *failover_dev)
>>   {
>> -struct net_device *standby_dev, *primary_dev;
>> +struct net_device *standby_dev;
>>   struct net_failover_info *nfo_info;
>>   bool slave_is_standby;
>> nfo_info = netdev_priv(failover_dev);
>> -primary_dev = rtnl_dereference(nfo_info->primary_dev);
>>   standby_dev = rtnl_dereference(nfo_info->standby_dev);
>> vlan_vids_del_by_dev(slave_dev, failover_dev);
>>
> 
> 
> .
>

Re: [PATCH net-next 0/5] rtnetlink: add IFA_IF_NETNSID for RTM_GETADDR

2018-08-31 Thread Christian Brauner

On Thu, Aug 30, 2018 at 04:45:45PM +0200, Christian Brauner wrote:
> On Thu, Aug 30, 2018 at 11:49:31AM +0300, Kirill Tkhai wrote:
> > On 29.08.2018 21:13, Christian Brauner wrote:
> > > Hi Kirill,
> > > 
> > > Thanks for the question!
> > > 
> > > On Wed, Aug 29, 2018 at 11:30:37AM +0300, Kirill Tkhai wrote:
> > >> Hi, Christian,
> > >>
> > >> On 29.08.2018 02:18, Christian Brauner wrote:
> > >>> From: Christian Brauner 
> > >>>
> > >>> Hey,
> > >>>
> > >>> A while back we introduced and enabled IFLA_IF_NETNSID in
> > >>> RTM_{DEL,GET,NEW}LINK requests (cf. [1], [2], [3], [4], [5]). This has 
> > >>> led
> > >>> to signficant performance increases since it allows userspace to avoid
> > >>> taking the hit of a setns(netns_fd, CLONE_NEWNET), then getting the
> > >>> interfaces from the netns associated with the netns_fd. Especially when 
> > >>> a
> > >>> lot of network namespaces are in use, using setns() becomes increasingly
> > >>> problematic when performance matters.
> > >>
> > >> could you please give a real example, when setns()+socket(AF_NETLINK) 
> > >> cause
> > >> problems with the performance? You should do this only once on 
> > >> application
> > >> startup, and then you have created netlink sockets in any net namespaces 
> > >> you
> > >> need. What is the problem here?
> > > 
> > > So we have a daemon (LXD) that is often running thousands of containers.
> > > When users issue a lxc list request against the daemon it returns a list
> > > of all containers including all of the interfaces and addresses for each
> > > container. To retrieve those addresses we currently rely on setns() +
> > > getifaddrs() for each of those containers. That has horrible
> > > performance.
> > 
> > Could you please provide some numbers showing that setns()
> > introduces signify performance decrease in the application?
> 
> Sure, might take a few days++ though since I'm traveling.

Hey Kirill,

As promised here's some code [1] that compares performance. I basically
did a setns() to the network namespace and called getifaddrs() and
compared this to the scenario where I use the newly introduced property.
I did this 1 million times and calculated the mean getifaddrs()
retrieval time based on that.
My patch cuts the time in half.

brauner@wittgenstein:~/netns_getifaddrs$ ./getifaddrs_perf 0 1178
Mean time in microseconds (netnsid): 81
Mean time in microseconds (setns): 162

Christian

I'm only appending the main file since the netsns_getifaddrs() code I
used is pretty long:

[1]:

#define _GNU_SOURCE
#define __STDC_FORMAT_MACROS
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include "netns_getifaddrs.h"
#include "print_getifaddrs.h"

#define ITERATIONS 100
#define SEC_TO_MICROSEC(x) (100 * (x))

int main(int argc, char *argv[])
{
int i, ret;
__s32 netns_id;
pid_t netns_pid;
char path[1024];
intmax_t times[ITERATIONS];
struct timeval t1, t2;
intmax_t time_in_mcs;
int fret = EXIT_FAILURE;
intmax_t sum = 0;
int host_netns_fd = -1, netns_fd = -1;

struct ifaddrs *ifaddrs = NULL;

if (argc != 3)
goto on_error;

netns_id = atoi(argv[1]);
netns_pid = atoi(argv[2]);
printf("%d\n", netns_id);
printf("%d\n", netns_pid);

for (i = 0; i < ITERATIONS; i++) {
ret = gettimeofday(, NULL);
if (ret < 0)
goto on_error;

ret = netns_getifaddrs(, netns_id);
freeifaddrs(ifaddrs);
if (ret < 0)
goto on_error;

ret = gettimeofday(, NULL);
if (ret < 0)
goto on_error;

time_in_mcs = (SEC_TO_MICROSEC(t2.tv_sec) + t2.tv_usec) -
  (SEC_TO_MICROSEC(t1.tv_sec) + t1.tv_usec);
times[i] = time_in_mcs;
}

for (i = 0; i < ITERATIONS; i++)
sum += times[i];

printf("Mean time in microseconds (netnsid): %ju\n",
   sum / ITERATIONS);

ret = snprintf(path, sizeof(path), "/proc/%d/ns/net", netns_pid);
if (ret < 0 || (size_t)ret >= sizeof(path))
goto on_error;

netns_fd = open(path, O_RDONLY | O_CLOEXEC);
if (netns_fd < 0)
goto on_error;

host_netns_fd = open("/proc/self/ns/net", O_RDONLY | O_CLOEXEC);
if (host_netns_fd < 0)
goto on_error;

memset(times, 0, sizeof(times));
for (i = 0; i < ITERATIONS; i++) {
ret = gettimeofday(, NULL);
if (ret < 0)
goto on_error;

ret = setns(netns_fd, CLONE_NEWNET);
if (ret < 0)
goto on_error;

ret = getifaddrs();
freeifaddrs(ifaddrs);
if (ret < 0)

Re: pull-request: bpf-next 2018-09-01

2018-08-31 Thread David Miller

From: Daniel Borkmann 
Date: Sat,  1 Sep 2018 02:05:06 +0200

> The following pull-request contains BPF updates for your *net-next* tree.
> 
> The main changes are:
> 
> 1) Add AF_XDP zero-copy support for i40e driver (!), from Björn and Magnus.

W00t!

> 2) BPF verifier improvements by giving each register its own liveness
>chain which allows to simplify and getting rid of skip_callee() logic,
>from Edward.
> 
> 3) Add bpf fs pretty print support for percpu arraymap, percpu hashmap
>and percpu lru hashmap. Also add generic percpu formatted print on
>bpftool so the same can be dumped there, from Yonghong.
> 
> 4) Add bpf_{set,get}sockopt() helper support for TCP_SAVE_SYN and
>TCP_SAVED_SYN options to allow reflection of tos/tclass from received
>SYN packet, from Nikita.
> 
> 5) Misc improvements to the BPF sockmap test cases in terms of cgroup v2
>interaction and removal of incorrect shutdown() calls, from John.
> 
> 6) Few cleanups in xdp_umem_assign_dev() and xdpsock samples, from Prashant.

Pulled, thanks Daniel!

[PATCH net-next] liquidio: Added delayed work for periodically updating the link statistics.

2018-08-31 Thread Felix Manlunas

From: Pradeep Nalla 

Signed-off-by: Pradeep Nalla 
Signed-off-by: Felix Manlunas 
---
 drivers/net/ethernet/cavium/liquidio/lio_core.c| 24 +++---
 drivers/net/ethernet/cavium/liquidio/lio_main.c|  9 +++-
 drivers/net/ethernet/cavium/liquidio/lio_vf_main.c |  8 +++-
 .../net/ethernet/cavium/liquidio/liquidio_common.h |  2 ++
 drivers/net/ethernet/cavium/liquidio/octeon_iq.h   |  3 +++
 .../net/ethernet/cavium/liquidio/octeon_network.h  |  4 +++-
 6 files changed, 40 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/lio_core.c 
b/drivers/net/ethernet/cavium/liquidio/lio_core.c
index 30b4a60..cdc26ca 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_core.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_core.c
@@ -1352,16 +1352,19 @@ octnet_nic_stats_callback(struct octeon_device *oct_dev,
 
resp->status = 1;
} else {
+   dev_err(_dev->pci_dev->dev, "sc OPCODE_NIC_PORT_STATS 
command failed\n");
resp->status = -1;
}
 }
 
-int octnet_get_link_stats(struct net_device *netdev)
+void lio_fetch_stats(struct work_struct *work)
 {
-   struct lio *lio = GET_LIO(netdev);
+   struct cavium_wk *wk = (struct cavium_wk *)work;
+   struct lio *lio = wk->ctxptr;
struct octeon_device *oct_dev = lio->oct_dev;
struct octeon_soft_command *sc;
struct oct_nic_stats_resp *resp;
+   unsigned long time_in_jiffies;
int retval;
 
/* Alloc soft command */
@@ -1371,8 +1374,10 @@ int octnet_get_link_stats(struct net_device *netdev)
  sizeof(struct oct_nic_stats_resp),
  0);
 
-   if (!sc)
-   return -ENOMEM;
+   if (!sc) {
+   dev_err(_dev->pci_dev->dev, "Soft command allocation 
failed\n");
+   goto lio_fetch_stats_exit;
+   }
 
resp = (struct oct_nic_stats_resp *)sc->virtrptr;
memset(resp, 0, sizeof(struct oct_nic_stats_resp));
@@ -1388,20 +1393,25 @@ int octnet_get_link_stats(struct net_device *netdev)
retval = octeon_send_soft_command(oct_dev, sc);
if (retval == IQ_SEND_FAILED) {
octeon_free_soft_command(oct_dev, sc);
-   return -EINVAL;
+   goto lio_fetch_stats_exit;
}
 
retval = wait_for_sc_completion_timeout(oct_dev, sc,
(2 * LIO_SC_MAX_TMO_MS));
if (retval)  {
dev_err(_dev->pci_dev->dev, "sc OPCODE_NIC_PORT_STATS 
command failed\n");
-   return retval;
+   goto lio_fetch_stats_exit;
}
 
octnet_nic_stats_callback(oct_dev, sc->sc_status, sc);
WRITE_ONCE(sc->caller_is_done, true);
 
-   return 0;
+lio_fetch_stats_exit:
+   time_in_jiffies = msecs_to_jiffies(LIQUIDIO_NDEV_STATS_POLL_TIME_MS);
+   if (ifstate_check(lio, LIO_IFSTATE_RUNNING))
+   schedule_delayed_work(>stats_wk.work, time_in_jiffies);
+
+   return;
 }
 
 int liquidio_set_speed(struct lio *lio, int speed)
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c 
b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index ed5fc6e..e973662 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -1841,6 +1841,12 @@ static int liquidio_open(struct net_device *netdev)
/* tell Octeon to start forwarding packets to host */
send_rx_ctrl_cmd(lio, 1);
 
+   /* start periodical statistics fetch */
+   INIT_DELAYED_WORK(>stats_wk.work, lio_fetch_stats);
+   lio->stats_wk.ctxptr = lio;
+   schedule_delayed_work(>stats_wk.work, msecs_to_jiffies
+   (LIQUIDIO_NDEV_STATS_POLL_TIME_MS));
+
dev_info(>pci_dev->dev, "%s interface is opened\n",
 netdev->name);
 
@@ -1881,6 +1887,8 @@ static int liquidio_stop(struct net_device *netdev)
cleanup_tx_poll_fn(netdev);
}
 
+   cancel_delayed_work_sync(>stats_wk.work);
+
if (lio->ptp_clock) {
ptp_clock_unregister(lio->ptp_clock);
lio->ptp_clock = NULL;
@@ -2081,7 +2089,6 @@ liquidio_get_stats64(struct net_device *netdev,
lstats->rx_packets = pkts;
lstats->rx_dropped = drop;
 
-   octnet_get_link_stats(netdev);
lstats->multicast = oct->link_stats.fromwire.fw_total_mcast;
lstats->collisions = oct->link_stats.fromhost.total_collisions;
 
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c 
b/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
index 9c267b4c..fe3d935 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
@@ -917,6 +917,11 @@ static int liquidio_open(struct net_device *netdev)
netif_info(lio, ifup, lio->netdev, "Interface Open, ready for 
traffic\n");

[PATCH RFC net-next 16/18] net/ipv6: Allow routes to use nexthop objects

2018-08-31 Thread dsahern

From: David Ahern 

Allow users to specify a nexthop id to use with a route.

Signed-off-by: David Ahern 
---
 include/net/ip6_fib.h |  4 +++
 include/net/nexthop.h |  3 ++
 net/ipv4/nexthop.c|  5 +++
 net/ipv6/addrconf.c   |  3 ++
 net/ipv6/ip6_fib.c| 17 ---
 net/ipv6/ndisc.c  |  2 ++
 net/ipv6/route.c  | 85 +--
 7 files changed, 98 insertions(+), 21 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 1f04a26e4c65..170aadcd83b4 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -52,6 +52,7 @@ struct fib6_config {
u16 fc_type;/* only 8 bits are used */
u16 fc_delete_all_nh : 1,
__unused : 15;
+   u32 fc_nh_id;
 
struct in6_addr fc_dst;
struct in6_addr fc_src;
@@ -139,6 +140,8 @@ struct fib6_info {
struct fib6_info __rcu  *fib6_next;
struct fib6_node __rcu  *fib6_node;
 
+   struct list_headnh_list;
+
/* Multipath routes:
 * siblings is a list of fib6_info that have the the same metric/weight,
 * destination, but not the same gateway. nsiblings is just a cache
@@ -171,6 +174,7 @@ struct fib6_info {
unused:3;
 
struct rcu_head rcu;
+   struct nexthop  *nh;
struct fib6_nh  fib6_nh[0];
 };
 
diff --git a/include/net/nexthop.h b/include/net/nexthop.h
index dae1518af3f3..759bb39e4ea7 100644
--- a/include/net/nexthop.h
+++ b/include/net/nexthop.h
@@ -175,6 +175,9 @@ static inline struct fib6_nh *nexthop_fib6_nh(struct 
nexthop *nh)
 
 static inline struct fib6_nh *fib6_info_nh(struct fib6_info *f6i)
 {
+   if (f6i->nh)
+   return nexthop_fib6_nh(f6i->nh);
+
return f6i->fib6_nh;
 }
 
diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index d1fc3d21af86..1e77fa94e562 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -317,6 +317,7 @@ static void nexthop_notify(int event, struct nexthop *nh, 
struct nl_info *info)
 
 static void __remove_nexthop_fib(struct net *net, struct nexthop *nh)
 {
+   struct fib6_info *f6i, *tmp;
struct fib_info *fi;
bool do_flush;
 
@@ -328,6 +329,10 @@ static void __remove_nexthop_fib(struct net *net, struct 
nexthop *nh)
 
if (do_flush)
fib_flush(net);
+
+   list_for_each_entry_safe(f6i, tmp, >f6i_list, nh_list) {
+   ip6_del_rt(net, f6i);
+   }
 }
 
 /* called on insert failure too */
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index da5102bff2a9..8131cdd472cb 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -2366,6 +2366,9 @@ static struct fib6_info *addrconf_get_prefix_route(const 
struct in6_addr *pfx,
goto out;
 
for_each_fib6_node_rt_rcu(fn) {
+   /* prefix routes do not use nexthop objects */
+   if (rt->nh)
+   continue;
if (rt->fib6_nh->nh_dev->ifindex != dev->ifindex)
continue;
if ((rt->fib6_flags & flags) != flags)
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 5b0ca5b3710d..b6dc644a55cf 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -202,7 +202,10 @@ void fib6_info_destroy_rcu(struct rcu_head *head)
}
}
 
-   fib6_nh_release(f6i->fib6_nh);
+   if (f6i->nh)
+   nexthop_put(f6i->nh);
+   else
+   fib6_nh_release(f6i->fib6_nh);
 
m = f6i->fib6_metrics;
if (m != _default_metrics && refcount_dec_and_test(>refcnt))
@@ -1302,6 +1305,8 @@ int fib6_add(struct fib6_node *root, struct fib6_info *rt,
if (!err) {
__fib6_update_sernum_upto_root(rt, sernum);
fib6_start_gc(info->nl_net, rt);
+   if (rt->nh)
+   list_add(>nh_list, >nh->f6i_list);
}
 
 out:
@@ -1776,6 +1781,9 @@ static void fib6_del_route(struct fib6_table *table, 
struct fib6_node *fn,
 
fib6_purge_rt(rt, fn, net);
 
+   if (rt->nh)
+   list_del(>nh_list);
+
call_fib6_entry_notifiers(net, FIB_EVENT_ENTRY_DEL, rt, NULL);
if (!info->skip_notify)
inet6_rt_notify(RTM_DELROUTE, rt, info, 0);
@@ -2251,7 +2259,6 @@ void fib6_gc_cleanup(void)
 static int ipv6_route_seq_show(struct seq_file *seq, void *v)
 {
struct fib6_info *rt = v;
-   struct fib6_nh *fib6_nh = rt->fib6_nh;
struct ipv6_route_iter *iter = seq->private;
const struct net_device *dev;
 
@@ -2262,12 +2269,12 @@ static int ipv6_route_seq_show(struct seq_file *seq, 
void *v)
 #else
seq_puts(seq, " 00 ");
 #endif
-   if (rt->fib6_flags & RTF_GATEWAY)
-   seq_printf(seq, "%pi6", _nh->nh_gw);
+   if (!rt->nh &&

[PATCH RFC net-next 11/18] net: Initial nexthop code

2018-08-31 Thread dsahern

From: David Ahern 

Initial import of nexthop code.
- Add new RTM commands for nexthop objects.
- Add new uapi attributes for creating nexthops. Attributes are similar
  to the current nexthop attributes for routes.
- Add basic helpers for ipv4 and ipv6 references to nexthop data

Similar to routes nexthops are configured per namespace, so add
netns_nexthop struct and add it to struct net.

Signed-off-by: David Ahern 
---
 include/net/net_namespace.h|2 +
 include/net/netns/nexthop.h|   18 +
 include/net/nexthop.h  |  121 +
 include/uapi/linux/nexthop.h   |   56 +++
 include/uapi/linux/rtnetlink.h |7 +
 net/ipv4/Makefile  |2 +-
 net/ipv4/nexthop.c | 1080 
 security/selinux/nlmsgtab.c|5 +-
 8 files changed, 1289 insertions(+), 2 deletions(-)
 create mode 100644 include/net/netns/nexthop.h
 create mode 100644 include/net/nexthop.h
 create mode 100644 include/uapi/linux/nexthop.h
 create mode 100644 net/ipv4/nexthop.c

diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 9b5fdc50519a..d3d678814b93 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -105,6 +106,7 @@ struct net {
struct netns_mibmib;
struct netns_packet packet;
struct netns_unix   unx;
+   struct netns_nexthopnexthop;
struct netns_ipv4   ipv4;
 #if IS_ENABLED(CONFIG_IPV6)
struct netns_ipv6   ipv6;
diff --git a/include/net/netns/nexthop.h b/include/net/netns/nexthop.h
new file mode 100644
index ..91627c35e9d3
--- /dev/null
+++ b/include/net/netns/nexthop.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * nexthops in net namespaces
+ */
+
+#ifndef __NETNS_NEXTHOP_H__
+#define __NETNS_NEXTHOP_H__
+
+#include 
+
+struct netns_nexthop {
+   struct rb_root  root; /* tree of nexthops by id */
+   struct hlist_head   *devhash; /* nexthops by device */
+
+   unsigned intseq;/* protected by rtnl_mutex */
+   u32 last_id_allocated;
+};
+#endif
diff --git a/include/net/nexthop.h b/include/net/nexthop.h
new file mode 100644
index ..1c59d04d1da6
--- /dev/null
+++ b/include/net/nexthop.h
@@ -0,0 +1,121 @@
+/*
+ * Generic nexthop implementation
+ *
+ * Copyright (C) 2017-18 Cumulus Networks
+ * Copyright (c) 2017-18 David Ahern 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef __LINUX_NEXTHOP_H
+#define __LINUX_NEXTHOP_H
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define NEXTHOP_VALID_USER_FLAGS RTNH_F_ONLINK
+
+struct nexthop;
+
+struct nh_info {
+   struct hlist_node   dev_hash;
+   struct net  *net;
+   struct nexthop  *nh_parent;
+
+   u8  family;
+   u8  reject_nh:1,
+   has_gw:1,
+   unused:6;
+
+   union {
+   /* fib_nh used for device only nexthops as well */
+   struct fib_nh   fib_nh;
+   struct fib6_nh  fib6_nh;
+   };
+};
+
+struct nexthop {
+   struct rb_node  rb_node;
+   struct list_headfi_list;/* v4 entries using nh */
+   struct list_headf6i_list;   /* v6 entries using nh */
+
+   u32 id;
+
+   u8  protocol;
+   u8  nh_flags;
+
+   refcount_t  refcnt;
+   struct rcu_head rcu;
+
+   union {
+   struct nh_info  __rcu *nh_info;
+   };
+};
+
+struct nh_config {
+   u8  nh_family;
+   u8  nh_scope;
+   u8  nh_protocol;
+   u8  nh_blackhole;
+   u32 nh_flags;
+
+   u32 nh_id;
+   u32 tclassid;
+
+   int nh_ifindex;
+   struct net_device *dev;
+   u32 nh_table;
+   union {
+   __be32  ipv4;
+   struct in6_addr ipv6;
+   } gw;
+
+   u32 nlflags;
+   struct nl_info  nlinfo;
+};
+
+void nexthop_get(struct nexthop *nh);
+void nexthop_put(struct nexthop *nh);
+
+/* caller is holding rtnl; no reference taken to nexthop */
+struct nexthop *nexthop_find_by_id(struct net *net, u32 id);
+

[PATCH RFC net-next 08/18] net/ipv4: Move device validation to helper

2018-08-31 Thread dsahern

From: David Ahern 

Move the device matching check in __fib_validate_source to a helper.
Code move only; no functional change intended.

Signed-off-by: David Ahern 
---
 net/ipv4/fib_frontend.c | 44 +++-
 1 file changed, 27 insertions(+), 17 deletions(-)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 2f9bf1ec2678..ec6ae186d4b0 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -315,6 +315,32 @@ __be32 fib_compute_spec_dst(struct sk_buff *skb)
return inet_select_addr(dev, ip_hdr(skb)->saddr, scope);
 }
 
+static bool fib_info_nh_uses_dev(struct fib_info *fi,
+const struct net_device *dev)
+{
+   bool dev_match = false;
+   int ret;
+
+#ifdef CONFIG_IP_ROUTE_MULTIPATH
+   for (ret = 0; ret < fi->fib_nhs; ret++) {
+   struct fib_nh *nh = >fib_nh[ret];
+
+   if (nh->nh_dev == dev) {
+   dev_match = true;
+   break;
+   } else if (l3mdev_master_ifindex_rcu(nh->nh_dev) == 
dev->ifindex) {
+   dev_match = true;
+   break;
+   }
+   }
+#else
+   if (FIB_RES_DEV(res) == dev)
+   dev_match = true;
+#endif
+
+   return dev_match;
+}
+
 /* Given (packet source, input interface) and optional (dst, oif, tos):
  * - (main) check, that source is valid i.e. not broadcast or our local
  *   address.
@@ -361,24 +387,8 @@ static int __fib_validate_source(struct sk_buff *skb, 
__be32 src, __be32 dst,
(res.type != RTN_LOCAL || !IN_DEV_ACCEPT_LOCAL(idev)))
goto e_inval;
fib_combine_itag(itag, );
-   dev_match = false;
-
-#ifdef CONFIG_IP_ROUTE_MULTIPATH
-   for (ret = 0; ret < res.fi->fib_nhs; ret++) {
-   struct fib_nh *nh = >fib_nh[ret];
 
-   if (nh->nh_dev == dev) {
-   dev_match = true;
-   break;
-   } else if (l3mdev_master_ifindex_rcu(nh->nh_dev) == 
dev->ifindex) {
-   dev_match = true;
-   break;
-   }
-   }
-#else
-   if (FIB_RES_DEV(res) == dev)
-   dev_match = true;
-#endif
+   dev_match = fib_info_nh_uses_dev(res.fi, dev);
if (dev_match) {
ret = FIB_RES_NH(res)->nh_scope >= RT_SCOPE_HOST;
return ret;
-- 
2.11.0

[PATCH RFC net-next 03/18] net/ipv4: export fib_info_update_nh_saddr

2018-08-31 Thread dsahern

From: David Ahern 

Add scope as input argument versus relying on fib_info reference in
fib_nh and export fib_info_update_nh_saddr.

Signed-off-by: David Ahern 
---
 include/net/ip_fib.h | 5 +++--
 net/ipv4/fib_semantics.c | 9 -
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index f1c053cf9489..a4a129344098 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -173,13 +173,14 @@ struct fib_result_nl {
 #define FIB_TABLE_HASHSZ 2
 #endif
 
-__be32 fib_info_update_nh_saddr(struct net *net, struct fib_nh *nh);
+__be32 fib_info_update_nh_saddr(struct net *net, struct fib_nh *nh,
+   unsigned char scope);
 
 #define FIB_RES_SADDR(net, res)\
((FIB_RES_NH(res).nh_saddr_genid == \
  atomic_read(&(net)->ipv4.dev_addr_genid)) ?   \
 FIB_RES_NH(res).nh_saddr : \
-fib_info_update_nh_saddr((net), _RES_NH(res)))
+fib_info_update_nh_saddr((net), _RES_NH(res), (res).fi->fib_scope))
 #define FIB_RES_GW(res)(FIB_RES_NH(res).nh_gw)
 #define FIB_RES_DEV(res)   (FIB_RES_NH(res).nh_dev)
 #define FIB_RES_OIF(res)   (FIB_RES_NH(res).nh_oif)
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 7bead7c03e1b..c034d0adf590 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -984,11 +984,10 @@ static void fib_info_hash_move(struct hlist_head 
*new_info_hash,
fib_info_hash_free(old_laddrhash, bytes);
 }
 
-__be32 fib_info_update_nh_saddr(struct net *net, struct fib_nh *nh)
+__be32 fib_info_update_nh_saddr(struct net *net, struct fib_nh *nh,
+   unsigned char scope)
 {
-   nh->nh_saddr = inet_select_addr(nh->nh_dev,
-   nh->nh_gw,
-   nh->nh_parent->fib_scope);
+   nh->nh_saddr = inet_select_addr(nh->nh_dev, nh->nh_gw, scope);
nh->nh_saddr_genid = atomic_read(>ipv4.dev_addr_genid);
 
return nh->nh_saddr;
@@ -1238,7 +1237,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg,
}
 
change_nexthops(fi) {
-   fib_info_update_nh_saddr(net, nexthop_nh);
+   fib_info_update_nh_saddr(net, nexthop_nh, fi->fib_scope);
} endfor_nexthops(fi)
 
fib_rebalance(fi);
-- 
2.11.0

Re: [PATCH net-next v1] selftests/tls: Add test for recv(PEEK) spanning across multiple records

2018-08-31 Thread David Miller

From: Vakul Garg 
Date: Wed, 29 Aug 2018 15:30:14 +0530

> Added test case to receive multiple records with a single recvmsg()
> operation with a MSG_PEEK set.

Applied.

[PATCH RFC net-next 12/18] net/ipv4: Add nexthop helpers for ipv4 integration

2018-08-31 Thread dsahern

From: David Ahern 

Add nexthop reference to fib_info along with a list_head for tracking
the association of nexthop back to the fib_info.

Add helpers to take a fib_info and return a fib_nh, a nexthop device
and nexthop gateway.

Add helper to validate a nexthop works with a fib_info.

Signed-off-by: David Ahern 
---
 include/net/ip_fib.h  |  4 
 include/net/nexthop.h | 46 ++
 net/ipv4/nexthop.c| 39 +++
 3 files changed, 89 insertions(+)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 0b40c59b8a5f..e39f55f3c3d8 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -103,9 +103,12 @@ struct fib_nh {
  * This structure contains data shared by many of routes.
  */
 
+struct nexthop;
+
 struct fib_info {
struct hlist_node   fib_hash;
struct hlist_node   fib_lhash;
+   struct list_headnh_list;
struct net  *fib_net;
int fib_treeref;
refcount_t  fib_clntref;
@@ -122,6 +125,7 @@ struct fib_info {
 #define fib_window fib_metrics->metrics[RTAX_WINDOW-1]
 #define fib_rtt fib_metrics->metrics[RTAX_RTT-1]
 #define fib_advmss fib_metrics->metrics[RTAX_ADVMSS-1]
+   struct nexthop  *nh;
int fib_nhs;
struct rcu_head rcu;
struct fib_nh   fib_nh[0];
diff --git a/include/net/nexthop.h b/include/net/nexthop.h
index 1c59d04d1da6..c149fe8394ab 100644
--- a/include/net/nexthop.h
+++ b/include/net/nexthop.h
@@ -118,4 +118,50 @@ static inline bool nexthop_is_blackhole(struct nexthop *nh)
nhi = rcu_dereference(nh->nh_info);
return !!nhi->reject_nh;
 }
+
+static inline struct fib_nh *nexthop_fib_nh(struct nexthop *nh, int nhsel)
+{
+   struct nh_info *nhi;
+
+   nhi = rcu_dereference(nh->nh_info);
+   if (nhi->family == AF_INET ||
+   nhi->family == AF_UNSPEC)  /* dev only re-uses IPv4 struct */
+   return >fib_nh;
+
+   return NULL;
+}
+
+static inline struct fib_nh *fib_info_nh(struct fib_info *fi, int nhsel)
+{
+   if (fi->nh)
+   return nexthop_fib_nh(fi->nh, 0);
+
+   WARN_ON(nhsel > fi->fib_nhs);
+   return >fib_nh[nhsel];
+}
+
+/* return fib_nh for fib_info; for historical reasons
+ * returns first nexthop only
+ */
+static inline struct net_device *fib_info_nh_dev(struct fib_info *fi)
+{
+   struct fib_nh *fib_nh = fib_info_nh(fi, 0);
+
+   return fib_nh->nh_dev;
+}
+
+/* return gateway for fib_info; for historical reasons
+ * returns gateway for first nexthop if multipath
+ */
+static inline __be32 fib_info_nh_gw(struct fib_info *fi)
+{
+   struct fib_nh *fib_nh = fib_info_nh(fi, 0);
+
+   return fib_nh ? fib_nh->nh_gw : 0;
+}
+
+int fib_check_nexthop(struct fib_info *fi, struct fib_config *cfg,
+ struct netlink_ext_ack *extack);
+
+bool nexthop_uses_dev(const struct nexthop *nh, const struct net_device *dev);
 #endif
diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index 24c4aa383c9d..d1fc3d21af86 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -315,6 +315,21 @@ static void nexthop_notify(int event, struct nexthop *nh, 
struct nl_info *info)
rtnl_set_sk_err(info->nl_net, RTNLGRP_IPV4_ROUTE, err);
 }
 
+static void __remove_nexthop_fib(struct net *net, struct nexthop *nh)
+{
+   struct fib_info *fi;
+   bool do_flush;
+
+   do_flush = false;
+   list_for_each_entry(fi, >fi_list, nh_list) {
+   fi->fib_flags |= RTNH_F_DEAD;
+   do_flush = true;
+   }
+
+   if (do_flush)
+   fib_flush(net);
+}
+
 /* called on insert failure too */
 static void __remove_nexthop(struct net *net, struct nexthop *nh,
 bool skip_fib, struct nl_info *nlinfo)
@@ -326,6 +341,8 @@ static void __remove_nexthop(struct net *net, struct 
nexthop *nh,
dev = nh_info_dev(nhi);
if (dev)
hlist_del(>dev_hash);
+   if (!skip_fib)
+   __remove_nexthop_fib(net, nh);
 }
 
 static void remove_nexthop(struct net *net, struct nexthop *nh,
@@ -461,6 +478,28 @@ static void flush_all_nexthops(struct net *net)
}
 }
 
+/* invoked by fib add code to verify nexthop by id is ok with
+ * config for prefix; parts of fib_check_nh not done when nexthop
+ * is created
+ */
+int fib_check_nexthop(struct fib_info *fi, struct fib_config *cfg,
+ struct netlink_ext_ack *extack)
+{
+   struct nexthop *nh = fi->nh;
+   struct nh_info *nhi;
+
+   nhi = rtnl_dereference(nh->nh_info);
+   if (nhi->family != AF_UNSPEC) {
+   if (nh->nh_flags & RTNH_F_ONLINK &&
+   cfg->fc_scope >= RT_SCOPE_LINK) {
+   NL_SET_ERR_MSG(extack, "Scope mismatch with nexthop");
+   return -EINVAL;
+   }
+   }
+
+   return 0;

[PATCH RFC net-next 10/18] net/ipv6: Make fib6_nh optional at the end of fib6_info

2018-08-31 Thread dsahern

From: David Ahern 

Move fib6_nh to the end of fib6_info and make an array of
size 0. Pass a flag to fib6_info_alloc indicating if the
allocation needs to add space for a fib6_nh.

The current code path always has a fib6_nh allocated; with
nexthop objects they will not.

Signed-off-by: David Ahern 
---
 include/net/ip6_fib.h   |   8 +--
 include/net/ip6_route.h |  10 ++-
 include/trace/events/fib6.h |  15 ++--
 net/core/filter.c   |   6 +-
 net/ipv6/addrconf.c |   2 +-
 net/ipv6/ip6_fib.c  |  15 ++--
 net/ipv6/ndisc.c|  13 ++--
 net/ipv6/route.c| 165 ++--
 8 files changed, 124 insertions(+), 110 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 2a1fae1247a9..9526eef711d5 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -170,8 +170,8 @@ struct fib6_info {
dst_host:1,
unused:3;
 
-   struct fib6_nh  fib6_nh;
struct rcu_head rcu;
+   struct fib6_nh  fib6_nh[0];
 };
 
 struct rt6_info {
@@ -274,7 +274,7 @@ static inline void ip6_rt_put(struct rt6_info *rt)
dst_release(>dst);
 }
 
-struct fib6_info *fib6_info_alloc(gfp_t gfp_flags);
+struct fib6_info *fib6_info_alloc(gfp_t gfp_flags, bool with_fib6_nh);
 void fib6_info_destroy_rcu(struct rcu_head *head);
 
 static inline void fib6_info_hold(struct fib6_info *f6i)
@@ -426,13 +426,13 @@ static inline void fib6_nh_release(struct fib6_nh 
*fib6_nh)
 
 static inline struct net_device *fib6_info_nh_dev(const struct fib6_info *f6i)
 {
-   return f6i->fib6_nh.nh_dev;
+   return f6i->fib6_nh->nh_dev;
 }
 
 static inline
 struct lwtunnel_state *fib6_info_nh_lwt(const struct fib6_info *f6i)
 {
-   return f6i->fib6_nh.nh_lwtstate;
+   return f6i->fib6_nh->nh_lwtstate;
 }
 
 void inet6_rt_notify(int event, struct fib6_info *rt, struct nl_info *info,
diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 7b9c82de11cc..b1ca637acb2a 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -274,9 +274,13 @@ static inline struct in6_addr *rt6_nexthop(struct rt6_info 
*rt,
 
 static inline bool rt6_duplicate_nexthop(struct fib6_info *a, struct fib6_info 
*b)
 {
-   return a->fib6_nh.nh_dev == b->fib6_nh.nh_dev &&
-  ipv6_addr_equal(>fib6_nh.nh_gw, >fib6_nh.nh_gw) &&
-  !lwtunnel_cmp_encap(a->fib6_nh.nh_lwtstate, 
b->fib6_nh.nh_lwtstate);
+// TO-DO:
+   //if (a->nh || b->nh)
+   //  return nexthop_cmp(a->nh, b->nh);
+
+   return a->fib6_nh->nh_dev == b->fib6_nh->nh_dev &&
+  ipv6_addr_equal(>fib6_nh->nh_gw, >fib6_nh->nh_gw) &&
+  !lwtunnel_cmp_encap(a->fib6_nh->nh_lwtstate, 
b->fib6_nh->nh_lwtstate);
 }
 
 static inline unsigned int ip6_dst_mtu_forward(const struct dst_entry *dst)
diff --git a/include/trace/events/fib6.h b/include/trace/events/fib6.h
index b088b54d699c..037df3d2be0b 100644
--- a/include/trace/events/fib6.h
+++ b/include/trace/events/fib6.h
@@ -12,7 +12,7 @@
 
 TRACE_EVENT(fib6_table_lookup,
 
-   TP_PROTO(const struct net *net, const struct fib6_info *f6i,
+   TP_PROTO(const struct net *net, struct fib6_info *f6i,
 struct fib6_table *table, const struct flowi6 *flp),
 
TP_ARGS(net, f6i, table, flp),
@@ -36,6 +36,7 @@ TRACE_EVENT(fib6_table_lookup,
),
 
TP_fast_assign(
+   struct fib6_nh *fib6_nh = f6i->fib6_nh;
struct in6_addr *in6;
 
__entry->tb_id = table->tb6_id;
@@ -62,20 +63,20 @@ TRACE_EVENT(fib6_table_lookup,
__entry->dport = 0;
}
 
-   if (f6i->fib6_nh.nh_dev) {
-   __assign_str(name, f6i->fib6_nh.nh_dev);
+   if (fib6_nh && fib6_nh->nh_dev) {
+   __assign_str(name, fib6_nh->nh_dev);
} else {
__assign_str(name, "-");
}
-   if (f6i == net->ipv6.fib6_null_entry) {
+
+   if (!fib6_nh) {
struct in6_addr in6_zero = {};
 
in6 = (struct in6_addr *)__entry->gw;
*in6 = in6_zero;
-
-   } else if (f6i) {
+   } else {
in6 = (struct in6_addr *)__entry->gw;
-   *in6 = f6i->fib6_nh.nh_gw;
+   *in6 = fib6_nh->nh_gw;
}
),
 
diff --git a/net/core/filter.c b/net/core/filter.c
index 0ba4c477415d..bc979edf06ca 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4428,13 +4428,13 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct 
bpf_fib_lookup *params,
return BPF_FIB_LKUP_RET_FRAG_NEEDED;
}
 
-   if (f6i->fib6_nh.nh_lwtstate)
+   if (f6i->fib6_nh->nh_lwtstate)

[PATCH RFC net-next 07/18] net: ipv4: Add fib_nh to fib_result

2018-08-31 Thread dsahern

From: David Ahern 

Add nexthop selection to fib_result and update FIB_RES macros to
use it. Right now, fib_nh in fib_result will point to a nexthop
within a fib_info. Later, fib_nh can point to data with a nexthop.

Signed-off-by: David Ahern 
---
 include/net/ip_fib.h | 21 +
 net/core/filter.c|  2 +-
 net/ipv4/fib_frontend.c  |  4 ++--
 net/ipv4/fib_semantics.c |  4 ++--
 net/ipv4/fib_trie.c  |  4 ++--
 net/ipv4/route.c | 18 +-
 6 files changed, 25 insertions(+), 28 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index ce9b92485064..0b40c59b8a5f 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -141,6 +141,7 @@ struct fib_result {
unsigned char   type;
unsigned char   scope;
u32 tclassid;
+   struct fib_nh   *nh;
struct fib_info *fi;
struct fib_table *table;
struct hlist_head *fa_head;
@@ -161,11 +162,7 @@ struct fib_result_nl {
int err;
 };
 
-#ifdef CONFIG_IP_ROUTE_MULTIPATH
-#define FIB_RES_NH(res)((res).fi->fib_nh[(res).nh_sel])
-#else /* CONFIG_IP_ROUTE_MULTIPATH */
-#define FIB_RES_NH(res)((res).fi->fib_nh[0])
-#endif /* CONFIG_IP_ROUTE_MULTIPATH */
+#define FIB_RES_NH(res)((res).nh)
 
 #ifdef CONFIG_IP_MULTIPLE_TABLES
 #define FIB_TABLE_HASHSZ 256
@@ -177,13 +174,13 @@ __be32 fib_info_update_nh_saddr(struct net *net, struct 
fib_nh *nh,
unsigned char scope);
 
 #define FIB_RES_SADDR(net, res)\
-   ((FIB_RES_NH(res).nh_saddr_genid == \
+   ((FIB_RES_NH(res)->nh_saddr_genid ==\
  atomic_read(&(net)->ipv4.dev_addr_genid)) ?   \
-FIB_RES_NH(res).nh_saddr : \
-fib_info_update_nh_saddr((net), _RES_NH(res), (res).fi->fib_scope))
-#define FIB_RES_GW(res)(FIB_RES_NH(res).nh_gw)
-#define FIB_RES_DEV(res)   (FIB_RES_NH(res).nh_dev)
-#define FIB_RES_OIF(res)   (FIB_RES_NH(res).nh_oif)
+FIB_RES_NH(res)->nh_saddr :\
+fib_info_update_nh_saddr((net), FIB_RES_NH(res), (res).fi->fib_scope))
+#define FIB_RES_GW(res)(FIB_RES_NH(res)->nh_gw)
+#define FIB_RES_DEV(res)   (FIB_RES_NH(res)->nh_dev)
+#define FIB_RES_OIF(res)   (FIB_RES_NH(res)->nh_oif)
 
 #define FIB_RES_PREFSRC(net, res)  ((res).fi->fib_prefsrc ? : \
 FIB_RES_SADDR(net, res))
@@ -422,7 +419,7 @@ static inline void fib_combine_itag(u32 *itag, const struct 
fib_result *res)
 #ifdef CONFIG_IP_MULTIPLE_TABLES
u32 rtag;
 #endif
-   *itag = FIB_RES_NH(*res).nh_tclassid<<16;
+   *itag = FIB_RES_NH(*res)->nh_tclassid << 16;
 #ifdef CONFIG_IP_MULTIPLE_TABLES
rtag = res->tclassid;
if (*itag == 0)
diff --git a/net/core/filter.c b/net/core/filter.c
index c25eb36f1320..0ba4c477415d 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4311,7 +4311,7 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct 
bpf_fib_lookup *params,
return BPF_FIB_LKUP_RET_FRAG_NEEDED;
}
 
-   nh = >fib_nh[res.nh_sel];
+   nh = res.nh;
 
/* do not handle lwt encaps right now */
if (nh->nh_lwtstate)
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index b0910d8c8bd4..2f9bf1ec2678 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -380,7 +380,7 @@ static int __fib_validate_source(struct sk_buff *skb, 
__be32 src, __be32 dst,
dev_match = true;
 #endif
if (dev_match) {
-   ret = FIB_RES_NH(res).nh_scope >= RT_SCOPE_HOST;
+   ret = FIB_RES_NH(res)->nh_scope >= RT_SCOPE_HOST;
return ret;
}
if (no_addr)
@@ -392,7 +392,7 @@ static int __fib_validate_source(struct sk_buff *skb, 
__be32 src, __be32 dst,
ret = 0;
if (fib_lookup(net, , , FIB_LOOKUP_IGNORE_LINKSTATE) == 0) {
if (res.type == RTN_UNICAST)
-   ret = FIB_RES_NH(res).nh_scope >= RT_SCOPE_HOST;
+   ret = FIB_RES_NH(res)->nh_scope >= RT_SCOPE_HOST;
}
return ret;
 
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 0d792666821a..53e38ecfdd58 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -1728,7 +1728,7 @@ void fib_select_multipath(struct fib_result *res, int 
hash)
if (!fib_good_nh(nh))
continue;
if (!first) {
-   res->nh_sel = nhsel;
+   res->nh = >fib_nh[nhsel];
first = true;
}
}
@@ -1736,7 +1736,7 @@ void fib_select_multipath(struct fib_result *res, int 
hash)

[PATCH RFC net-next 04/18] net/ipv4: export fib_check_nh

2018-08-31 Thread dsahern

From: David Ahern 

Change fib_check_nh to take net, table and scope as input arguments
over struct fib_config and export for use by nexthop code.

Signed-off-by: David Ahern 
---
 include/net/ip_fib.h |  2 ++
 net/ipv4/fib_semantics.c | 18 +-
 2 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index a4a129344098..19012f3ed501 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -400,6 +400,8 @@ int fib_sync_up(struct net_device *dev, unsigned int 
nh_flags);
 int fib_multipath_hash(const struct net *net, const struct flowi4 *fl4,
   const struct sk_buff *skb, struct flow_keys *flkeys);
 #endif
+int fib_check_nh(struct net *net, struct fib_nh *nh, u32 table, u8 scope,
+struct netlink_ext_ack *extack);
 bool fib_good_nh(const struct fib_nh *nh);
 void fib_select_multipath(struct fib_result *res, int hash);
 void fib_select_path(struct net *net, struct fib_result *res,
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index c034d0adf590..9f8126debba5 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -777,21 +777,19 @@ bool fib_metrics_match(struct fib_config *cfg, struct 
fib_info *fi)
  * |
  * |-> {local prefix} (terminal node)
  */
-static int fib_check_nh(struct fib_config *cfg, struct fib_nh *nh,
-   struct netlink_ext_ack *extack)
+int fib_check_nh(struct net *net, struct fib_nh *nh, u32 table, u8 scope,
+struct netlink_ext_ack *extack)
 {
int err = 0;
-   struct net *net;
struct net_device *dev;
 
-   net = cfg->fc_nlinfo.nl_net;
if (nh->nh_gw) {
struct fib_result res;
 
if (nh->nh_flags & RTNH_F_ONLINK) {
unsigned int addr_type;
 
-   if (cfg->fc_scope >= RT_SCOPE_LINK) {
+   if (scope >= RT_SCOPE_LINK) {
NL_SET_ERR_MSG(extack,
   "Nexthop has invalid scope");
return -EINVAL;
@@ -822,7 +820,7 @@ static int fib_check_nh(struct fib_config *cfg, struct 
fib_nh *nh,
struct fib_table *tbl = NULL;
struct flowi4 fl4 = {
.daddr = nh->nh_gw,
-   .flowi4_scope = cfg->fc_scope + 1,
+   .flowi4_scope = scope + 1,
.flowi4_oif = nh->nh_oif,
.flowi4_iif = LOOPBACK_IFINDEX,
};
@@ -831,8 +829,8 @@ static int fib_check_nh(struct fib_config *cfg, struct 
fib_nh *nh,
if (fl4.flowi4_scope < RT_SCOPE_LINK)
fl4.flowi4_scope = RT_SCOPE_LINK;
 
-   if (cfg->fc_table)
-   tbl = fib_get_table(net, cfg->fc_table);
+   if (table)
+   tbl = fib_get_table(net, table);
 
if (tbl)
err = fib_table_lookup(tbl, , ,
@@ -1221,7 +1219,9 @@ struct fib_info *fib_create_info(struct fib_config *cfg,
int linkdown = 0;
 
change_nexthops(fi) {
-   err = fib_check_nh(cfg, nexthop_nh, extack);
+   err = fib_check_nh(cfg->fc_nlinfo.nl_net, nexthop_nh,
+  cfg->fc_table, cfg->fc_scope,
+  extack);
if (err != 0)
goto failure;
if (nexthop_nh->nh_flags & RTNH_F_LINKDOWN)
-- 
2.11.0

[PATCH RFC net-next 17/18] net: Add support for nexthop groups

2018-08-31 Thread dsahern

From: David Ahern 

Allow the creation of nexthop groups which reference other nexthop
objects to create multipath routes.

TO-DO: Add mpath support to IPv6

Signed-off-by: David Ahern 
---
 include/net/nexthop.h|  77 +--
 net/ipv4/fib_semantics.c |   5 +-
 net/ipv4/nexthop.c   | 511 ++-
 net/ipv4/route.c |  16 +-
 4 files changed, 540 insertions(+), 69 deletions(-)

diff --git a/include/net/nexthop.h b/include/net/nexthop.h
index 759bb39e4ea7..654b67192337 100644
--- a/include/net/nexthop.h
+++ b/include/net/nexthop.h
@@ -28,6 +28,23 @@
 
 struct nexthop;
 
+struct nh_grp_entry {
+   struct nexthop   *nh;
+   u32  weight;
+   atomic_t upper_bound;
+
+   struct list_head nh_list;
+   struct nexthop   *nh_parent;  /* nexthop of group with this entry */
+};
+
+struct nh_group {
+   u16 num_nh_set;
+   u16 num_nh;
+   u8  mpath:1,
+   unused:7;
+   struct nh_grp_entry nh_entries[0];
+};
+
 struct nh_info {
struct hlist_node   dev_hash;
struct net  *net;
@@ -47,6 +64,7 @@ struct nh_info {
 
 struct nexthop {
struct rb_node  rb_node;
+   struct list_headgrp_list;  /* nh group entries using this nh */
struct list_headfi_list;/* v4 entries using nh */
struct list_headf6i_list;   /* v6 entries using nh */
 
@@ -54,12 +72,15 @@ struct nexthop {
 
u8  protocol;
u8  nh_flags;
+   u8  is_group:1,
+   unused:7;
 
refcount_t  refcnt;
struct rcu_head rcu;
 
union {
struct nh_info  __rcu *nh_info;
+   struct nh_group __rcu *nh_grp;
};
 };
 
@@ -81,6 +102,9 @@ struct nh_config {
struct in6_addr ipv6;
} gw;
 
+   struct nlattr   *nh_grp;
+   u16 nh_grp_type;
+
u32 nlflags;
struct nl_info  nlinfo;
 };
@@ -88,42 +112,61 @@ struct nh_config {
 void nexthop_get(struct nexthop *nh);
 void nexthop_put(struct nexthop *nh);
 
+static inline bool nexthop_cmp(struct nexthop *nh1, struct nexthop *nh2)
+{
+   return nh1 == nh2;
+}
+
 /* caller is holding rtnl; no reference taken to nexthop */
 struct nexthop *nexthop_find_by_id(struct net *net, u32 id);
 
-static inline bool nexthop_cmp(struct nexthop *nh1, struct nexthop *nh2)
+/* called with rcu lock */
+static inline bool nexthop_is_multipath(const struct nexthop *nh)
 {
-   return nh1 == nh2;
+   if (nh->is_group) {
+   struct nh_group *nh_grp;
+
+   nh_grp = rcu_dereference(nh->nh_grp);
+   return !!nh_grp->mpath;
+   }
+   return false;
 }
 
+struct nexthop *nexthop_mpath_select(struct nexthop *nh, int nhsel);
+
+/* called with rcu lock */
 static inline int nexthop_num_path(struct nexthop *nh)
 {
+   if (nexthop_is_multipath(nh)) {
+   struct nh_group *nh_grp;
+
+   nh_grp = rcu_dereference(nh->nh_grp);
+   return nh_grp->num_nh_set;
+   }
+
return 1;
 }
 
-/* called with rcu lock */
+void nexthop_select_path(struct net *net, struct fib_result *res, int hash);
+
 static inline bool nexthop_has_gw(struct nexthop *nh)
 {
-   struct nh_info *nhi;
-
-   nhi = rcu_dereference(nh->nh_info);
-   return !!nhi->has_gw;
+   return !!nh->nh_info->has_gw;
 }
 
-/* called with rcu lock */
 static inline bool nexthop_is_blackhole(struct nexthop *nh)
 {
-   struct nh_info *nhi;
-
-   nhi = rcu_dereference(nh->nh_info);
-   return !!nhi->reject_nh;
+   return !nexthop_is_multipath(nh) && !!nh->nh_info->reject_nh;
 }
 
 static inline struct fib_nh *nexthop_fib_nh(struct nexthop *nh, int nhsel)
 {
struct nh_info *nhi;
 
-   nhi = rcu_dereference(nh->nh_info);
+   if (nexthop_is_multipath(nh))
+   nh = nexthop_mpath_select(nh, nhsel);
+
+   nhi = nh->nh_info;
if (nhi->family == AF_INET ||
nhi->family == AF_UNSPEC)  /* dev only re-uses IPv4 struct */
return >fib_nh;
@@ -164,11 +207,11 @@ static inline __be32 fib_info_nh_gw(struct fib_info *fi)
  */
 static inline struct fib6_nh *nexthop_fib6_nh(struct nexthop *nh)
 {
-   struct nh_info *nhi;
+   if (nexthop_is_multipath(nh))
+   nh = nexthop_mpath_select(nh, 0);
 
-   nhi = rcu_dereference(nh->nh_info);
-   if (nhi->family == AF_INET6)
-   return >fib6_nh;
+   if (nh->nh_info->family == AF_INET6)
+   return >nh_info->fib6_nh;
 
return NULL;
 }
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index c91cdafd40ec..0ddf14512bb3 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -1821,7

[PATCH RFC net-next 06/18] net/ipv4: Create init and release helpers for fib_nh

2018-08-31 Thread dsahern

From: David Ahern 

Consolidate the fib_nh initialization which is duplicated between
fib_create_info for single path and fib_get_nhs for multipath.

Move the fib_nh cleanup code from free_fib_info_rcu into a new helper,
fib_nh_release. Move classid accounting into fib_nh_release which is
called per fib_nh to make accounting symmetrical with fib_nh_init.

Export both new helpers to allow for use with nexthop objects.

Signed-off-by: David Ahern 
---
 include/net/ip_fib.h |   5 ++
 net/ipv4/fib_semantics.c | 185 +--
 2 files changed, 104 insertions(+), 86 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 19012f3ed501..ce9b92485064 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -400,6 +400,11 @@ int fib_sync_up(struct net_device *dev, unsigned int 
nh_flags);
 int fib_multipath_hash(const struct net *net, const struct flowi4 *fl4,
   const struct sk_buff *skb, struct flow_keys *flkeys);
 #endif
+
+int fib_nh_init(struct net *net, struct fib_nh *fib_nh,
+   struct fib_config *cfg, int nh_weight,
+   struct netlink_ext_ack *extack);
+void fib_nh_release(struct net *net, struct fib_nh *fib_nh);
 int fib_check_nh(struct net *net, struct fib_nh *nh, u32 table, u8 scope,
 struct netlink_ext_ack *extack);
 bool fib_good_nh(const struct fib_nh *nh);
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 9b2d8ba6bdb3..0d792666821a 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -204,6 +204,21 @@ static void rt_fibinfo_free_cpus(struct rtable __rcu * 
__percpu *rtp)
free_percpu(rtp);
 }
 
+void fib_nh_release(struct net *net, struct fib_nh *fib_nh)
+{
+#ifdef CONFIG_IP_ROUTE_CLASSID
+   if (fib_nh->nh_tclassid)
+   net->ipv4.fib_num_tclassid_users--;
+#endif
+   if (fib_nh->nh_dev)
+   dev_put(fib_nh->nh_dev);
+
+   lwtstate_put(fib_nh->nh_lwtstate);
+   free_nh_exceptions(fib_nh);
+   rt_fibinfo_free_cpus(fib_nh->nh_pcpu_rth_output);
+   rt_fibinfo_free(_nh->nh_rth_input);
+}
+
 /* Release a nexthop info record */
 static void free_fib_info_rcu(struct rcu_head *head)
 {
@@ -211,12 +226,7 @@ static void free_fib_info_rcu(struct rcu_head *head)
struct dst_metrics *m;
 
change_nexthops(fi) {
-   if (nexthop_nh->nh_dev)
-   dev_put(nexthop_nh->nh_dev);
-   lwtstate_put(nexthop_nh->nh_lwtstate);
-   free_nh_exceptions(nexthop_nh);
-   rt_fibinfo_free_cpus(nexthop_nh->nh_pcpu_rth_output);
-   rt_fibinfo_free(_nh->nh_rth_input);
+   fib_nh_release(fi->fib_net, nexthop_nh);
} endfor_nexthops(fi);
 
m = fi->fib_metrics;
@@ -459,6 +469,52 @@ static int fib_detect_death(struct fib_info *fi, int order,
return 1;
 }
 
+int fib_nh_init(struct net *net, struct fib_nh *nh,
+   struct fib_config *cfg, int nh_weight,
+   struct netlink_ext_ack *extack)
+{
+   int err = -ENOMEM;
+
+   nh->nh_pcpu_rth_output = alloc_percpu(struct rtable __rcu *);
+   if (!nh->nh_pcpu_rth_output)
+   goto failure;
+
+   if (cfg->fc_encap) {
+   struct lwtunnel_state *lwtstate;
+
+   err = -EINVAL;
+   if (cfg->fc_encap_type == LWTUNNEL_ENCAP_NONE) {
+   NL_SET_ERR_MSG(extack, "LWT encap type not specified");
+   goto failure;
+   }
+   err = lwtunnel_build_state(cfg->fc_encap_type,
+  cfg->fc_encap, AF_INET, cfg,
+  , extack);
+   if (err)
+   goto failure;
+
+   nh->nh_lwtstate = lwtstate_get(lwtstate);
+   }
+
+   nh->nh_oif   = cfg->fc_oif;
+   nh->nh_gw= cfg->fc_gw;
+   nh->nh_flags = cfg->fc_flags;
+
+#ifdef CONFIG_IP_ROUTE_CLASSID
+   nh->nh_tclassid = cfg->fc_flow;
+   if (nh->nh_tclassid)
+   net->ipv4.fib_num_tclassid_users++;
+#endif
+#ifdef CONFIG_IP_ROUTE_MULTIPATH
+   nh->nh_weight = nh_weight;
+#endif
+
+   err = 0;
+
+failure:
+   return err;
+}
+
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 
 static int fib_count_nexthops(struct rtnexthop *rtnh, int remaining,
@@ -485,11 +541,15 @@ static int fib_get_nhs(struct fib_info *fi, struct 
rtnexthop *rtnh,
   int remaining, struct fib_config *cfg,
   struct netlink_ext_ack *extack)
 {
+   struct net *net = fi->fib_net;
+   struct fib_config fib_cfg;
int ret;
 
change_nexthops(fi) {
int attrlen;
 
+   memset(_cfg, 0, sizeof(fib_cfg));
+
if (!rtnh_ok(rtnh, remaining)) {
NL_SET_ERR_MSG(extack,
   "Invalid nexthop configuration - extra 
data after

Re: [PATCH net-next v2] net/tls: Add support for async decryption of tls records

2018-08-31 Thread David Miller

From: Vakul Garg 
Date: Wed, 29 Aug 2018 15:26:55 +0530

> When tls records are decrypted using asynchronous acclerators such as
> NXP CAAM engine, the crypto apis return -EINPROGRESS. Presently, on
> getting -EINPROGRESS, the tls record processing stops till the time the
> crypto accelerator finishes off and returns the result. This incurs a
> context switch and is not an efficient way of accessing the crypto
> accelerators. Crypto accelerators work efficient when they are queued
> with multiple crypto jobs without having to wait for the previous ones
> to complete.
> 
> The patch submits multiple crypto requests without having to wait for
> for previous ones to complete. This has been implemented for records
> which are decrypted in zero-copy mode. At the end of recvmsg(), we wait
> for all the asynchronous decryption requests to complete.
> 
> The references to records which have been sent for async decryption are
> dropped. For cases where record decryption is not possible in zero-copy
> mode, asynchronous decryption is not used and we wait for decryption
> crypto api to complete.
> 
> For crypto requests executing in async fashion, the memory for
> aead_request, sglists and skb etc is freed from the decryption
> completion handler. The decryption completion handler wakesup the
> sleeping user context when recvmsg() flags that it has done sending
> all the decryption requests and there are no more decryption requests
> pending to be completed.
> 
> Signed-off-by: Vakul Garg 
> Reviewed-by: Dave Watson 
> ---
> 
> Changes since v1:
>   - Simplified recvmsg() so to drop reference to skb in case it
> was submimtted for async decryption.
>   - Modified tls_sw_advance_skb() to handle case when input skb is
> NULL.

Applied.

[PATCH RFC net-next 13/18] net/ipv4: Convert existing use of fib_info to new helpers

2018-08-31 Thread dsahern

From: David Ahern 

Remove direct accesses to fi->fib_nh in favor of the helpers added
in the previous patch.

Signed-off-by: David Ahern 
---
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c|  4 +++-
 drivers/net/ethernet/rocker/rocker_ofdpa.c   | 20 ++--
 include/net/ip_fib.h |  1 -
 net/ipv4/fib_frontend.c  |  3 ++-
 net/ipv4/fib_rules.c |  3 ++-
 net/ipv4/fib_semantics.c | 12 
 net/ipv4/fib_trie.c  | 19 +++
 7 files changed, 40 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index 2ab9cf25a08a..3fcac0b6fa92 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "spectrum.h"
 #include "core.h"
@@ -4121,12 +4122,13 @@ mlxsw_sp_fib4_entry_type_set(struct mlxsw_sp *mlxsw_sp,
 struct mlxsw_sp_fib_entry *fib_entry)
 {
union mlxsw_sp_l3addr dip = { .addr4 = htonl(fen_info->dst) };
-   struct net_device *dev = fen_info->fi->fib_dev;
struct mlxsw_sp_ipip_entry *ipip_entry;
struct fib_info *fi = fen_info->fi;
+   struct net_device *dev;
 
switch (fen_info->type) {
case RTN_LOCAL:
+   dev = fib_info_nh_dev(fi);
ipip_entry = mlxsw_sp_ipip_entry_find_by_decap(mlxsw_sp, dev,
 MLXSW_SP_L3_PROTO_IPV4, dip);
if (ipip_entry && ipip_entry->ol_dev->flags & IFF_UP) {
diff --git a/drivers/net/ethernet/rocker/rocker_ofdpa.c 
b/drivers/net/ethernet/rocker/rocker_ofdpa.c
index 6473cc68c2d5..c05d35945ea7 100644
--- a/drivers/net/ethernet/rocker/rocker_ofdpa.c
+++ b/drivers/net/ethernet/rocker/rocker_ofdpa.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "rocker.h"
 #include "rocker_tlv.h"
@@ -2286,8 +2287,8 @@ static int ofdpa_port_fib_ipv4(struct ofdpa_port 
*ofdpa_port,  __be32 dst,
 
/* XXX support ECMP */
 
-   nh = fi->fib_nh;
-   nh_on_port = (fi->fib_dev == ofdpa_port->dev);
+   nh = fib_info_nh(fi, 0);
+   nh_on_port = (nh->nh_dev == ofdpa_port->dev);
has_gw = !!nh->nh_gw;
 
if (has_gw && nh_on_port) {
@@ -2747,11 +2748,13 @@ static int ofdpa_fib4_add(struct rocker *rocker,
 {
struct ofdpa *ofdpa = rocker->wpriv;
struct ofdpa_port *ofdpa_port;
+   struct net_device *dev;
int err;
 
if (ofdpa->fib_aborted)
return 0;
-   ofdpa_port = ofdpa_port_dev_lower_find(fen_info->fi->fib_dev, rocker);
+   dev = fib_info_nh_dev(fen_info->fi);
+   ofdpa_port = ofdpa_port_dev_lower_find(dev, rocker);
if (!ofdpa_port)
return 0;
err = ofdpa_port_fib_ipv4(ofdpa_port, htonl(fen_info->dst),
@@ -2768,10 +2771,12 @@ static int ofdpa_fib4_del(struct rocker *rocker,
 {
struct ofdpa *ofdpa = rocker->wpriv;
struct ofdpa_port *ofdpa_port;
+   struct net_device *dev;
 
if (ofdpa->fib_aborted)
return 0;
-   ofdpa_port = ofdpa_port_dev_lower_find(fen_info->fi->fib_dev, rocker);
+   dev = fib_info_nh_dev(fen_info->fi);
+   ofdpa_port = ofdpa_port_dev_lower_find(dev, rocker);
if (!ofdpa_port)
return 0;
fen_info->fi->fib_nh->nh_flags &= ~RTNH_F_OFFLOAD;
@@ -2794,11 +2799,14 @@ static void ofdpa_fib4_abort(struct rocker *rocker)
 
spin_lock_irqsave(>flow_tbl_lock, flags);
hash_for_each_safe(ofdpa->flow_tbl, bkt, tmp, flow_entry, entry) {
+   struct net_device *dev;
+
if (flow_entry->key.tbl_id !=
ROCKER_OF_DPA_TABLE_ID_UNICAST_ROUTING)
continue;
-   ofdpa_port = ofdpa_port_dev_lower_find(flow_entry->fi->fib_dev,
-  rocker);
+
+   dev = fib_info_nh_dev(flow_entry->fi);
+   ofdpa_port = ofdpa_port_dev_lower_find(dev, rocker);
if (!ofdpa_port)
continue;
flow_entry->fi->fib_nh->nh_flags &= ~RTNH_F_OFFLOAD;
diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index e39f55f3c3d8..c59e0f1ba59b 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -129,7 +129,6 @@ struct fib_info {
int fib_nhs;
struct rcu_head rcu;
struct fib_nh   fib_nh[0];
-#define fib_devfib_nh[0].nh_dev
 };
 
 
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index ec6ae186d4b0..c483453bf037 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -35,6 +35,7 @@
 #include 
 #include

[PATCH RFC net-next 01/18] net: Rename net/nexthop.h net/rtnh.h

2018-08-31 Thread dsahern

From: David Ahern 

The header contains rtnh_ macros so rename the file accordingly.
Allows next patch to use the nexthop.h name.

Signed-off-by: David Ahern 
---
 include/net/{nexthop.h => rtnh.h} | 4 ++--
 net/core/lwtunnel.c   | 2 +-
 net/decnet/dn_fib.c   | 2 +-
 net/ipv4/fib_semantics.c  | 2 +-
 net/ipv4/ipmr.c   | 2 +-
 net/ipv6/route.c  | 2 +-
 net/mpls/af_mpls.c| 2 +-
 7 files changed, 8 insertions(+), 8 deletions(-)
 rename include/net/{nexthop.h => rtnh.h} (94%)

diff --git a/include/net/nexthop.h b/include/net/rtnh.h
similarity index 94%
rename from include/net/nexthop.h
rename to include/net/rtnh.h
index 902ff382a6dc..aa2cfc508f7c 100644
--- a/include/net/nexthop.h
+++ b/include/net/rtnh.h
@@ -1,6 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0 */
-#ifndef __NET_NEXTHOP_H
-#define __NET_NEXTHOP_H
+#ifndef __NET_RTNH_H
+#define __NET_RTNH_H
 
 #include 
 #include 
diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
index 0b171756453c..80c30cd5744a 100644
--- a/net/core/lwtunnel.c
+++ b/net/core/lwtunnel.c
@@ -26,7 +26,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 
 #ifdef CONFIG_MODULES
 
diff --git a/net/decnet/dn_fib.c b/net/decnet/dn_fib.c
index f78fe58eafc8..3757a56bbcbd 100644
--- a/net/decnet/dn_fib.c
+++ b/net/decnet/dn_fib.c
@@ -42,7 +42,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 
 #define RT_MIN_TABLE 1
 
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index f3c89ccf14c5..93524a746ca8 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -42,7 +42,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 5660adcf7a04..564d4fd5a92b 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -66,7 +66,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 
 struct ipmr_rule {
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index c4ea13e8360b..07ed7812c6b4 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -59,7 +59,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 7a4de6d618b1..d066e5e9b76c 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -23,7 +23,7 @@
 #include 
 #endif
 #include 
-#include 
+#include 
 #include "internal.h"
 
 /* max memory we will use for mpls_route */
-- 
2.11.0

[PATCH RFC net-next 09/18] net/ipv6: Create init and release helpers for fib6_nh

2018-08-31 Thread dsahern

From: David Ahern 

Refactor initialization and cleanup of fib6_nh to helpers similar to
what was done for IPv4. Add fib6_nh_init to the ipv6 stubs for use by
core code when ipv6 is built as a module.

The replace helper is small enough, so make an inline rather than
requiring it to go through ipv6 stubs.

Signed-off-by: David Ahern 
---
 include/net/addrconf.h   |   5 +
 include/net/ip6_fib.h|  11 +++
 net/ipv6/addrconf_core.c |   9 ++
 net/ipv6/af_inet6.c  |   1 +
 net/ipv6/ip6_fib.c   |   5 +-
 net/ipv6/route.c | 239 +--
 6 files changed, 153 insertions(+), 117 deletions(-)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 6def0351bcc3..7748b8300ca0 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -2,6 +2,8 @@
 #ifndef _ADDRCONF_H
 #define _ADDRCONF_H
 
+#include 
+
 #define MAX_RTR_SOLICITATIONS  -1  /* unlimited */
 #define RTR_SOLICITATION_INTERVAL  (4*HZ)
 #define RTR_SOLICITATION_MAX_INTERVAL  (3600*HZ)   /* 1 hour */
@@ -253,6 +255,9 @@ struct ipv6_stub {
u32 (*ip6_mtu_from_fib6)(struct fib6_info *f6i, struct in6_addr *daddr,
 struct in6_addr *saddr);
 
+   int (*fib6_nh_init)(struct net *net, struct fib6_nh *fib6_nh,
+   struct fib6_config *cfg,
+   struct netlink_ext_ack *extack);
void (*udpv6_encap_enable)(void);
void (*ndisc_send_na)(struct net_device *dev, const struct in6_addr 
*daddr,
  const struct in6_addr *solicited_addr,
diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 3d4930528db0..2a1fae1247a9 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_IPV6_MULTIPLE_TABLES
 #define FIB6_TABLE_HASHSZ 256
@@ -413,6 +414,16 @@ int fib6_add(struct fib6_node *root, struct fib6_info *rt,
 struct nl_info *info, struct netlink_ext_ack *extack);
 int fib6_del(struct fib6_info *rt, struct nl_info *info);
 
+int fib6_nh_init(struct net *net, struct fib6_nh *fib6_nh,
+struct fib6_config *cfg, struct netlink_ext_ack *extack);
+static inline void fib6_nh_release(struct fib6_nh *fib6_nh)
+{
+   if (fib6_nh->nh_dev)
+   dev_put(fib6_nh->nh_dev);
+
+   lwtstate_put(fib6_nh->nh_lwtstate);
+}
+
 static inline struct net_device *fib6_info_nh_dev(const struct fib6_info *f6i)
 {
return f6i->fib6_nh.nh_dev;
diff --git a/net/ipv6/addrconf_core.c b/net/ipv6/addrconf_core.c
index 5cd0029d930e..f5c712136408 100644
--- a/net/ipv6/addrconf_core.c
+++ b/net/ipv6/addrconf_core.c
@@ -168,6 +168,14 @@ eafnosupport_ip6_mtu_from_fib6(struct fib6_info *f6i, 
struct in6_addr *daddr,
return 0;
 }
 
+static int eafnosupport_fib6_nh_init(struct net *net, struct fib6_nh *fib6_nh,
+struct fib6_config *cfg,
+struct netlink_ext_ack *extack)
+{
+   NL_SET_ERR_MSG(extack, "IPv6 support not enabled in kernel");
+   return -EAFNOSUPPORT;
+}
+
 const struct ipv6_stub *ipv6_stub __read_mostly = &(struct ipv6_stub) {
.ipv6_dst_lookup   = eafnosupport_ipv6_dst_lookup,
.fib6_get_table= eafnosupport_fib6_get_table,
@@ -175,6 +183,7 @@ const struct ipv6_stub *ipv6_stub __read_mostly = &(struct 
ipv6_stub) {
.fib6_lookup   = eafnosupport_fib6_lookup,
.fib6_multipath_select = eafnosupport_fib6_multipath_select,
.ip6_mtu_from_fib6 = eafnosupport_ip6_mtu_from_fib6,
+   .fib6_nh_init  = eafnosupport_fib6_nh_init,
 };
 EXPORT_SYMBOL_GPL(ipv6_stub);
 
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 673bba31eb18..a5809bf7c229 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -895,6 +895,7 @@ static const struct ipv6_stub ipv6_stub_impl = {
.fib6_lookup   = fib6_lookup,
.fib6_multipath_select = fib6_multipath_select,
.ip6_mtu_from_fib6 = ip6_mtu_from_fib6,
+   .fib6_nh_init  = fib6_nh_init,
.udpv6_encap_enable = udpv6_encap_enable,
.ndisc_send_na = ndisc_send_na,
.nd_tbl = _tbl,
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index c861a6d4671d..c1c23427a81e 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -198,10 +198,7 @@ void fib6_info_destroy_rcu(struct rcu_head *head)
}
}
 
-   lwtstate_put(f6i->fib6_nh.nh_lwtstate);
-
-   if (f6i->fib6_nh.nh_dev)
-   dev_put(f6i->fib6_nh.nh_dev);
+   fib6_nh_release(>fib6_nh);
 
m = f6i->fib6_metrics;
if (m != _default_metrics && refcount_dec_and_test(>refcnt))
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 07ed7812c6b4..aa44cd5b3217 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2844,9 +2844,11 @@ static int ip6_route_check_nh(struct net *net,
}
}

[PATCH RFC net-next 02/18] net: ipv4: export fib_good_nh and fib_flush

2018-08-31 Thread dsahern

From: David Ahern 

Export fib_good_nh for use by the nexthop code when selecting a path
within a multipath nexthop.

As nexthops are deleted, fib entries referencing it are marked dead.
Export fib_flush so those entries can be removed in a timely
manner.

Signed-off-by: David Ahern 
---
 include/net/ip_fib.h | 2 ++
 net/ipv4/fib_frontend.c  | 2 +-
 net/ipv4/fib_semantics.c | 2 +-
 3 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 69c91d1934c1..f1c053cf9489 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -399,6 +399,7 @@ int fib_sync_up(struct net_device *dev, unsigned int 
nh_flags);
 int fib_multipath_hash(const struct net *net, const struct flowi4 *fl4,
   const struct sk_buff *skb, struct flow_keys *flkeys);
 #endif
+bool fib_good_nh(const struct fib_nh *nh);
 void fib_select_multipath(struct fib_result *res, int hash);
 void fib_select_path(struct net *net, struct fib_result *res,
 struct flowi4 *fl4, const struct sk_buff *skb);
@@ -423,6 +424,7 @@ static inline void fib_combine_itag(u32 *itag, const struct 
fib_result *res)
 #endif
 }
 
+void fib_flush(struct net *net);
 void free_fib_info(struct fib_info *fi);
 
 static inline void fib_info_hold(struct fib_info *fi)
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 2998b0e47d4b..b0910d8c8bd4 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -192,7 +192,7 @@ int fib_unmerge(struct net *net)
return 0;
 }
 
-static void fib_flush(struct net *net)
+void fib_flush(struct net *net)
 {
int flushed = 0;
unsigned int h;
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 93524a746ca8..7bead7c03e1b 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -1682,7 +1682,7 @@ int fib_sync_up(struct net_device *dev, unsigned int 
nh_flags)
 }
 
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
-static bool fib_good_nh(const struct fib_nh *nh)
+bool fib_good_nh(const struct fib_nh *nh)
 {
int state = NUD_REACHABLE;
 
-- 
2.11.0

[PATCH RFC net-next 05/18] net/ipv4: Define fib_get_nhs when CONFIG_IP_ROUTE_MULTIPATH is disabled

2018-08-31 Thread dsahern

From: David Ahern 

Define fib_get_nhs to return EINVAL when CONFIG_IP_ROUTE_MULTIPATH is
not enabled and remove the ifdef check for CONFIG_IP_ROUTE_MULTIPATH.

Signed-off-by: David Ahern 
---
 net/ipv4/fib_semantics.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 9f8126debba5..9b2d8ba6bdb3 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -603,6 +603,15 @@ static void fib_rebalance(struct fib_info *fi)
 }
 #else /* CONFIG_IP_ROUTE_MULTIPATH */
 
+static int fib_get_nhs(struct fib_info *fi, struct rtnexthop *rtnh,
+  int remaining, struct fib_config *cfg,
+  struct netlink_ext_ack *extack)
+{
+   NL_SET_ERR_MSG(extack, "Multipath support not enabled in kernel");
+
+   return -EINVAL;
+}
+
 #define fib_rebalance(fi) do { } while (0)
 
 #endif /* CONFIG_IP_ROUTE_MULTIPATH */
@@ -1112,7 +1121,6 @@ struct fib_info *fib_create_info(struct fib_config *cfg,
goto failure;
 
if (cfg->fc_mp) {
-#ifdef CONFIG_IP_ROUTE_MULTIPATH
err = fib_get_nhs(fi, cfg->fc_mp, cfg->fc_mp_len, cfg, extack);
if (err != 0)
goto failure;
@@ -1133,11 +1141,6 @@ struct fib_info *fib_create_info(struct fib_config *cfg,
goto err_inval;
}
 #endif
-#else
-   NL_SET_ERR_MSG(extack,
-  "Multipath support not enabled in kernel");
-   goto err_inval;
-#endif
} else {
struct fib_nh *nh = fi->fib_nh;
 
-- 
2.11.0

[PATCH RFC net-next 18/18] net/ipv4: Optimization for fib_info lookup

2018-08-31 Thread dsahern

From: David Ahern 

Be optimistic about re-using a fib_info when nexthop id is given and
the route does not use metrics. Avoids a memory allocation which in
most cases is expected to be freed anyways.

Signed-off-by: David Ahern 
---
 net/ipv4/fib_semantics.c | 48 
 1 file changed, 48 insertions(+)

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 0ddf14512bb3..e4411cd5514b 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -316,6 +316,19 @@ static inline unsigned int fib_devindex_hashfn(unsigned 
int val)
(val >> (DEVINDEX_HASHBITS * 2))) & mask;
 }
 
+static inline unsigned int fib_info_hashfn_cfg(const struct fib_config *cfg)
+{
+   unsigned int mask = (fib_info_hash_size - 1);
+   unsigned int val = 0;
+
+   val ^= (cfg->fc_protocol << 8) | cfg->fc_scope;
+   val ^= (__force u32)cfg->fc_prefsrc;
+   val ^= cfg->fc_priority;
+   val ^= fib_devindex_hashfn(cfg->fc_nh_id);
+
+   return (val ^ (val >> 7) ^ (val >> 12)) & mask;
+}
+
 static inline unsigned int fib_info_hashfn(const struct fib_info *fi)
 {
unsigned int mask = (fib_info_hash_size - 1);
@@ -334,6 +347,35 @@ static inline unsigned int fib_info_hashfn(const struct 
fib_info *fi)
return (val ^ (val >> 7) ^ (val >> 12)) & mask;
 }
 
+/* no metrics, only nexthop id */
+static struct fib_info *fib_find_info_nh(struct net *net,
+const struct fib_config *cfg)
+{
+   struct hlist_head *head;
+   struct fib_info *fi;
+   unsigned int hash;
+
+   hash = fib_info_hashfn_cfg(cfg);
+   head = _info_hash[hash];
+
+   hlist_for_each_entry(fi, head, fib_hash) {
+   if (!net_eq(fi->fib_net, net))
+   continue;
+   if (!fi->nh || fi->nh->id != cfg->fc_nh_id)
+   continue;
+   if (cfg->fc_protocol == fi->fib_protocol &&
+   cfg->fc_scope == fi->fib_scope &&
+   cfg->fc_prefsrc == fi->fib_prefsrc &&
+   cfg->fc_priority == fi->fib_priority &&
+   cfg->fc_type == fi->fib_type &&
+   cfg->fc_table == fi->fib_tb_id &&
+   !((cfg->fc_flags ^ fi->fib_flags) & ~RTNH_COMPARE_MASK))
+   return fi;
+   }
+
+   return NULL;
+}
+
 static struct fib_info *fib_find_info(const struct fib_info *nfi)
 {
struct hlist_head *head;
@@ -1154,6 +1196,12 @@ struct fib_info *fib_create_info(struct fib_config *cfg,
goto err_inval;
}
 
+   if (!cfg->fc_mx) {
+   fi = fib_find_info_nh(net, cfg);
+   if (fi)
+   return fi;
+   }
+
nh = nexthop_find_by_id(net, cfg->fc_nh_id);
if (!nh) {
NL_SET_ERR_MSG(extack,
-- 
2.11.0

[PATCH RFC net-next 15/18] net/ipv6: Use helpers to access fib6_nh data

2018-08-31 Thread dsahern

From: David Ahern 

Similar to ipv4, add helpers for accessing fib6_nh data and convert
existing users.

Signed-off-by: David Ahern 
---
 include/net/ip6_fib.h   | 11 --
 include/net/ip6_route.h |  2 ++
 include/net/nexthop.h   | 40 +++
 include/trace/events/fib6.h |  2 +-
 net/core/filter.c   | 11 +++---
 net/ipv6/route.c| 51 ++---
 6 files changed, 81 insertions(+), 36 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 9526eef711d5..1f04a26e4c65 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -424,17 +424,6 @@ static inline void fib6_nh_release(struct fib6_nh *fib6_nh)
lwtstate_put(fib6_nh->nh_lwtstate);
 }
 
-static inline struct net_device *fib6_info_nh_dev(const struct fib6_info *f6i)
-{
-   return f6i->fib6_nh->nh_dev;
-}
-
-static inline
-struct lwtunnel_state *fib6_info_nh_lwt(const struct fib6_info *f6i)
-{
-   return f6i->fib6_nh->nh_lwtstate;
-}
-
 void inet6_rt_notify(int event, struct fib6_info *rt, struct nl_info *info,
 unsigned int flags);
 
diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index b1ca637acb2a..0cdfe176c530 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -2,6 +2,8 @@
 #ifndef _NET_IP6_ROUTE_H
 #define _NET_IP6_ROUTE_H
 
+#include 
+
 struct route_info {
__u8type;
__u8length;
diff --git a/include/net/nexthop.h b/include/net/nexthop.h
index c149fe8394ab..dae1518af3f3 100644
--- a/include/net/nexthop.h
+++ b/include/net/nexthop.h
@@ -160,6 +160,46 @@ static inline __be32 fib_info_nh_gw(struct fib_info *fi)
return fib_nh ? fib_nh->nh_gw : 0;
 }
 
+/* IPv6 variants
+ */
+static inline struct fib6_nh *nexthop_fib6_nh(struct nexthop *nh)
+{
+   struct nh_info *nhi;
+
+   nhi = rcu_dereference(nh->nh_info);
+   if (nhi->family == AF_INET6)
+   return >fib6_nh;
+
+   return NULL;
+}
+
+static inline struct fib6_nh *fib6_info_nh(struct fib6_info *f6i)
+{
+   return f6i->fib6_nh;
+}
+
+static inline struct net_device *fib6_info_nh_dev(struct fib6_info *f6i)
+{
+   struct fib6_nh *fib6_nh = fib6_info_nh(f6i);
+
+   return fib6_nh ? fib6_nh->nh_dev : NULL;
+}
+
+static inline struct in6_addr *fib6_info_nh_gw(struct fib6_info *f6i)
+{
+   struct fib6_nh *fib6_nh = fib6_info_nh(f6i);
+
+   return fib6_nh ? _nh->nh_gw : NULL;
+}
+
+static inline
+struct lwtunnel_state *fib6_info_nh_lwt(struct fib6_info *f6i)
+{
+   struct fib6_nh *fib6_nh = fib6_info_nh(f6i);
+
+   return fib6_nh ? fib6_nh->nh_lwtstate : NULL;
+}
+
 int fib_check_nexthop(struct fib_info *fi, struct fib_config *cfg,
  struct netlink_ext_ack *extack);
 
diff --git a/include/trace/events/fib6.h b/include/trace/events/fib6.h
index 037df3d2be0b..4e5e36cc35b9 100644
--- a/include/trace/events/fib6.h
+++ b/include/trace/events/fib6.h
@@ -36,7 +36,7 @@ TRACE_EVENT(fib6_table_lookup,
),
 
TP_fast_assign(
-   struct fib6_nh *fib6_nh = f6i->fib6_nh;
+   struct fib6_nh *fib6_nh = fib6_info_nh(f6i);
struct in6_addr *in6;
 
__entry->tb_id = table->tb6_id;
diff --git a/net/core/filter.c b/net/core/filter.c
index bc979edf06ca..4d227fae69c8 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4340,6 +4340,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct 
bpf_fib_lookup *params,
 {
struct in6_addr *src = (struct in6_addr *) params->ipv6_src;
struct in6_addr *dst = (struct in6_addr *) params->ipv6_dst;
+   struct fib6_nh *fib6_nh;
struct neighbour *neigh;
struct net_device *dev;
struct inet6_dev *idev;
@@ -4428,13 +4429,17 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct 
bpf_fib_lookup *params,
return BPF_FIB_LKUP_RET_FRAG_NEEDED;
}
 
-   if (f6i->fib6_nh->nh_lwtstate)
+   fib6_nh = fib6_info_nh(f6i);
+   if (!fib6_nh)
+   return BPF_FIB_LKUP_RET_NOT_FWDED;
+
+   if (fib6_nh->nh_lwtstate)
return BPF_FIB_LKUP_RET_UNSUPP_LWT;
 
if (f6i->fib6_flags & RTF_GATEWAY)
-   *dst = f6i->fib6_nh->nh_gw;
+   *dst = fib6_nh->nh_gw;
 
-   dev = f6i->fib6_nh->nh_dev;
+   dev = fib6_nh->nh_dev;
params->rt_metric = f6i->fib6_metric;
 
/* xdp and cls_bpf programs are run in RCU-bh so rcu_read_lock_bh is
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 5792f57fdb91..2c140ce95eb4 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -533,8 +533,8 @@ static void rt6_probe(struct fib6_info *rt)
if (!rt || !(rt->fib6_flags & RTF_GATEWAY))
return;
 
-   nh_gw = >fib6_nh->nh_gw;
-   dev = rt->fib6_nh->nh_dev;
+   nh_gw = fib6_info_nh_gw(rt);
+   dev = fib6_info_nh_dev(rt);

[PATCH iproute2-next] ip: Add support for nexthop objects

2018-08-31 Thread dsahern

From: David Ahern 

Signed-off-by: David Ahern 
---
 include/uapi/linux/nexthop.h   |  56 
 include/uapi/linux/rtnetlink.h |   8 +
 ip/Makefile|   3 +-
 ip/ip.c|   3 +-
 ip/ip_common.h |   7 +-
 ip/ipmonitor.c |   6 +
 ip/ipnexthop.c | 652 +
 ip/iproute.c   |  19 +-
 8 files changed, 747 insertions(+), 7 deletions(-)
 create mode 100644 include/uapi/linux/nexthop.h
 create mode 100644 ip/ipnexthop.c

diff --git a/include/uapi/linux/nexthop.h b/include/uapi/linux/nexthop.h
new file mode 100644
index ..335182e8229a
--- /dev/null
+++ b/include/uapi/linux/nexthop.h
@@ -0,0 +1,56 @@
+#ifndef __LINUX_NEXTHOP_H
+#define __LINUX_NEXTHOP_H
+
+#include 
+
+struct nhmsg {
+   unsigned char   nh_family;
+   unsigned char   nh_scope; /* one of RT_SCOPE */
+   unsigned char   nh_protocol;  /* Routing protocol that installed nh */
+   unsigned char   resvd;
+   unsigned intnh_flags; /* RTNH_F flags */
+};
+
+struct nexthop_grp {
+   __u32   id;
+   __u32   weight;
+};
+
+enum {
+   NEXTHOP_GRP_TYPE_MPATH,  /* default type if not specified */
+   __NEXTHOP_GRP_TYPE_MAX,
+};
+
+#define NEXTHOP_GRP_TYPE_MAX (__NEXTHOP_GRP_TYPE_MAX - 1)
+
+
+/* NHA_ID  32-bit id for nexthop. id must be greater than 0.
+ * id == 0 means assign an unused id.
+ */
+enum {
+   NHA_UNSPEC,
+   NHA_ID, /* u32 */
+   NHA_GROUP,  /* array of nexthop_grp */
+   NHA_GROUP_TYPE, /* u16 one of NEXTHOP_GRP_TYPE;
+* default is NEXTHOP_GRP_TYPE_MPATH */
+
+   /* if NHA_GROUP attribute is added, no other attributes can be set */
+
+   NHA_BLACKHOLE,  /* flag; nexthop used to blackhole packets */
+   NHA_OIF,/* u32 */
+   NHA_FLOW,   /* u32 */
+
+   NHA_TABLE_ID,   /* u32 - table id to validate gateway */
+   NHA_GATEWAY,/* be32 (IPv4) or in6_addr (IPv6) gw address */
+
+   /* Dump control attributes */
+   NHA_GROUPS, /* flag; only return nexthop groups in dump */
+   NHA_MASTER, /* u32; only return nexthops with given master dev */
+
+   NHA_SADDR,  /* return only: IPv4 or IPv6 source address */
+
+   __NHA_MAX,
+};
+
+#define NHA_MAX(__NHA_MAX - 1)
+#endif
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 8c1d600bfa33..158114245b6c 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -157,6 +157,13 @@ enum {
RTM_GETCHAIN,
 #define RTM_GETCHAIN RTM_GETCHAIN
 
+   RTM_NEWNEXTHOP = 104,
+#define RTM_NEWNEXTHOP RTM_NEWNEXTHOP
+   RTM_DELNEXTHOP,
+#define RTM_DELNEXTHOP RTM_DELNEXTHOP
+   RTM_GETNEXTHOP,
+#define RTM_GETNEXTHOP RTM_GETNEXTHOP
+
__RTM_MAX,
 #define RTM_MAX(((__RTM_MAX + 3) & ~3) - 1)
 };
@@ -342,6 +349,7 @@ enum rtattr_type_t {
RTA_IP_PROTO,
RTA_SPORT,
RTA_DPORT,
+   RTA_NH_ID,
__RTA_MAX
 };
 
diff --git a/ip/Makefile b/ip/Makefile
index a88f93665ee6..7df818dbe23a 100644
--- a/ip/Makefile
+++ b/ip/Makefile
@@ -10,7 +10,8 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o 
ipnetns.o \
 link_iptnl.o link_gre6.o iplink_bond.o iplink_bond_slave.o iplink_hsr.o \
 iplink_bridge.o iplink_bridge_slave.o ipfou.o iplink_ipvlan.o \
 iplink_geneve.o iplink_vrf.o iproute_lwtunnel.o ipmacsec.o ipila.o \
-ipvrf.o iplink_xstats.o ipseg6.o iplink_netdevsim.o iplink_rmnet.o
+ipvrf.o iplink_xstats.o ipseg6.o iplink_netdevsim.o iplink_rmnet.o \
+ipnexthop.o
 
 RTMONOBJ=rtmon.o
 
diff --git a/ip/ip.c b/ip/ip.c
index 58c643df8a36..963ef140c7c4 100644
--- a/ip/ip.c
+++ b/ip/ip.c
@@ -51,7 +51,7 @@ static void usage(void)
 "where  OBJECT := { link | address | addrlabel | route | rule | neigh | ntable 
|\n"
 "   tunnel | tuntap | maddress | mroute | mrule | monitor | 
xfrm |\n"
 "   netns | l2tp | fou | macsec | tcp_metrics | token | 
netconf | ila |\n"
-"   vrf | sr }\n"
+"   vrf | sr | nexthop }\n"
 "   OPTIONS := { -V[ersion] | -s[tatistics] | -d[etails] | -r[esolve] |\n"
 "-h[uman-readable] | -iec | -j[son] | -p[retty] |\n"
 "-f[amily] { inet | inet6 | ipx | dnet | mpls | bridge | 
link } |\n"
@@ -101,6 +101,7 @@ static const struct cmd {
{ "netconf",do_ipnetconf },
{ "vrf",do_ipvrf},
{ "sr", do_seg6 },
+   { "nexthop",do_ipnh },
{ "help",   do_help },
{ 0 }
 };
diff --git a/ip/ip_common.h b/ip/ip_common.h
index 200be5e23dd1..2971c1586c4e 100644
--- a/ip/ip_common.h
+++ b/ip/ip_common.h
@@ -56,6 +56,8 @@ int print_rule(const struct sockaddr_nl *who,
 int print_netconf(const struct sockaddr_nl *who,
  struct rtnl_ctrl_data *ctrl,

[PATCH RFC net-next 14/18] net/ipv4: Allow routes to use nexthop objects

2018-08-31 Thread dsahern

From: David Ahern 

Add new RTA attribute to allow a user to specify a nexthop id to use
with a route instead of the current nexthop specification.

Signed-off-by: David Ahern 
---
 include/net/ip_fib.h   |   1 +
 include/uapi/linux/rtnetlink.h |   1 +
 net/ipv4/fib_frontend.c|   7 +++
 net/ipv4/fib_semantics.c   | 139 ++---
 net/ipv4/fib_trie.c|  33 +++---
 5 files changed, 136 insertions(+), 45 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index c59e0f1ba59b..d2f961de732d 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -40,6 +40,7 @@ struct fib_config {
u32 fc_flags;
u32 fc_priority;
__be32  fc_prefsrc;
+   u32 fc_nh_id;
struct nlattr   *fc_mx;
struct rtnexthop*fc_mp;
int fc_mx_len;
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 4a0615797e5e..a036368798a9 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -349,6 +349,7 @@ enum rtattr_type_t {
RTA_IP_PROTO,
RTA_SPORT,
RTA_DPORT,
+   RTA_NH_ID,
__RTA_MAX
 };
 
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index c483453bf037..cf133d4e02f2 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -322,6 +322,9 @@ static bool fib_info_nh_uses_dev(struct fib_info *fi,
bool dev_match = false;
int ret;
 
+   if (fi->nh)
+   return nexthop_uses_dev(fi->nh, dev);
+
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
for (ret = 0; ret < fi->fib_nhs; ret++) {
struct fib_nh *nh = >fib_nh[ret];
@@ -663,6 +666,7 @@ const struct nla_policy rtm_ipv4_policy[RTA_MAX + 1] = {
[RTA_IP_PROTO]  = { .type = NLA_U8 },
[RTA_SPORT] = { .type = NLA_U16 },
[RTA_DPORT] = { .type = NLA_U16 },
+   [RTA_NH_ID] = { .type = NLA_U32 },
 };
 
 static int rtm_to_fib_config(struct net *net, struct sk_buff *skb,
@@ -746,6 +750,9 @@ static int rtm_to_fib_config(struct net *net, struct 
sk_buff *skb,
if (err < 0)
goto errout;
break;
+   case RTA_NH_ID:
+   cfg->fc_nh_id = nla_get_u32(attr);
+   break;
}
}
 
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 0cd536ad1761..c91cdafd40ec 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -226,9 +226,13 @@ static void free_fib_info_rcu(struct rcu_head *head)
struct fib_info *fi = container_of(head, struct fib_info, rcu);
struct dst_metrics *m;
 
-   change_nexthops(fi) {
-   fib_nh_release(fi->fib_net, nexthop_nh);
-   } endfor_nexthops(fi);
+   if (fi->nh) {
+   nexthop_put(fi->nh);
+   } else {
+   change_nexthops(fi) {
+   fib_nh_release(fi->fib_net, nexthop_nh);
+   } endfor_nexthops(fi);
+   }
 
m = fi->fib_metrics;
if (m != _default_metrics && refcount_dec_and_test(>refcnt))
@@ -260,11 +264,15 @@ void fib_release_info(struct fib_info *fi)
hlist_del(>fib_hash);
if (fi->fib_prefsrc)
hlist_del(>fib_lhash);
-   change_nexthops(fi) {
-   if (!nexthop_nh->nh_dev)
-   continue;
-   hlist_del(_nh->nh_hash);
-   } endfor_nexthops(fi)
+   if (fi->nh) {
+   list_del(>nh_list);
+   } else {
+   change_nexthops(fi) {
+   if (!nexthop_nh->nh_dev)
+   continue;
+   hlist_del(_nh->nh_hash);
+   } endfor_nexthops(fi)
+   }
fi->fib_dead = 1;
fib_info_put(fi);
}
@@ -275,6 +283,12 @@ static inline int nh_comp(const struct fib_info *fi, const 
struct fib_info *ofi)
 {
const struct fib_nh *onh = ofi->fib_nh;
 
+   if (fi->nh || ofi->nh)
+   return nexthop_cmp(fi->nh, ofi->nh) ? 0 : -1;
+
+   if (ofi->fib_nhs == 0)
+   return 0;
+
for_nexthops(fi) {
if (nh->nh_oif != onh->nh_oif ||
nh->nh_gw  != onh->nh_gw ||
@@ -310,10 +324,13 @@ static inline unsigned int fib_info_hashfn(const struct 
fib_info *fi)
val ^= (fi->fib_protocol << 8) | fi->fib_scope;
val ^= (__force u32)fi->fib_prefsrc;
val ^= fi->fib_priority;
-   for_nexthops(fi) {
-   val ^= fib_devindex_hashfn(nh->nh_oif);
-   } endfor_nexthops(fi)
-
+   if (fi->nh) {
+   val ^=

[PATCH RFC net-next 00/18] net: Improve route scalability via support for nexthop objects

2018-08-31 Thread dsahern

From: David Ahern 

As mentioned at netconf in Seoul, we would like to introduce nexthops as
independent objects from the routes to better align with both routing
daemons and hardware and to improve route insertion times into the kernel.

This series adds nexthop objects with their own lifecycle. The model
retains a lot of the established semantics from routes and re-uses some
of the data structures like fib_nh and fib6_nh to more easily align with
the existing code. One difference with nexthop objects is the behavior
better aligns with the target user - routing daemons and switch ASICs.
Specifically, with the exception of the blackhole nexthop, all nexthops
must reference a netdevice (or have a gateway that resolves to a device)
and the device must be admin up with carrier.

Prefixes are then installed pointing to the nexthop by id:
  { prefix } --> { nexthop }  --> { gateway, device }

The nexthop object contains the gateway and device reference.

Benchmarks
The following data shows the route insert time for 720,022 routes (a full
IPv4 internet feed from August 28th). "current" means the current code
where a route insert specifies the device and gateway inline with the
prefix; the "nexthop" columns mean use of the nexthop objects.

 1-hop  1-hop |2-hops   2-hops
currentnexthop|   current  nexthop
--|-
real0m21.872s  0m12.982s  |   0m28.723s0m12.406s
user0m2.929s   0m1.816s   |   0m3.966s 0m1.935s
sys 0m13.469s  0m6.010s   |   0m18.992s0m5.913s

With nexthop objects the time to insert the routes is reduced by more
than 30% with the kernel time cut in half. The current model has a route
insertion rate of about 32,000 prefixes / second and with nexthop objects
that increases to a little over 55,000 prefixes/second.

For routes with multiple nexthops the install time is cut by more than
half with system time reduce by a factor of 3. Further, with nexthop
objects insert times for multipath routes drops down to the same as
single path routes since the multipath spec is given once (ie., with the
current model, the time to insert routes increases with the number of
paths in the route compared to nexthop objects where the number of paths
is handled once and the prefixes referencing it are installed in constant
time.

The difference between real and system times shows there is room for
improvement with the trie implementation. As an example, increasing the
sync_pages from 128 to 1024 delays the call to synchronize_rcu increasing
the insert rate to more than 78,000 prefixes/sec!

Some key features:
1. Allows atomic replace of any nexthop object - a nexthop or a group.
   This allows existing route entries to have their nexthop updated
   without the overhead of removing and re-inserting (or replacing)
   them. Instead, one update of the nexthop object implicitly updates
   all routes referencing it.

   One limitation with the atomic replace is that a nexthop group can
   only be replaced with a new group spec and similarly a nexthop can
   only be replaced by a nexthop spec. Specifically, a nexthop id can
   not move between a single nexthop and a group nexthop.

2. Blackhole nexthop: a nexthop object can be designated a blackhole
   which means any lookups that resolve to it, packets are dropped as
   if the lookup failed with the result RTN_BLACKHOLE. Blackhole nexthops
   can not be used with nexthop groups. Combined with atomic replace
   this allows routes to be installed pointing to a blackhole nexthop
   and then switched to an actual gateway with a single nexthop replace
   command (or vice versa, a gateway nexthop is flipped to a blackhole).

3. Nexthop groups for multipath routes. A nexthop group is a nexthop
   that references other nexthops. A multipath group can not be used
   as a nexthop in another nexthop group (ie., groups can not be nested).

4. Multipath routes for IPv6 with device only nexthops. There is a
   demonstrated need for this feature and the existing route semantics
   do not allow it. This series provides a means for that end - create a
   nexthop that has a device only specification.

5. Admin and carrier up are required. If the device goes down (admin or
   carrier) the nexthop is removed in which case routes referencing the
   nexthop are evicted and any nexthop groups referencing it are adjusted.

6. Follow on patches will allow IPv6 nexthops with IPv4 routes for users
   wanting support of RFC 5549.

7. Future extensions: active / backup nexthop. The nexthop groups are
   structured to allow a new group type to be added. One example is a
   group where a nexthop has a preferred device and gateway, but should
   the device go down or the gateway not resolve, the backup nexthop is
   used.

Additional Benefits
- smaller route notifications - messages contain a single nexthop id versus
  the detailed nexthop specification. This is

pull-request: bpf-next 2018-09-01

2018-08-31 Thread Daniel Borkmann

Hi David,

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Add AF_XDP zero-copy support for i40e driver (!), from Björn and Magnus.

2) BPF verifier improvements by giving each register its own liveness
   chain which allows to simplify and getting rid of skip_callee() logic,
   from Edward.

3) Add bpf fs pretty print support for percpu arraymap, percpu hashmap
   and percpu lru hashmap. Also add generic percpu formatted print on
   bpftool so the same can be dumped there, from Yonghong.

4) Add bpf_{set,get}sockopt() helper support for TCP_SAVE_SYN and
   TCP_SAVED_SYN options to allow reflection of tos/tclass from received
   SYN packet, from Nikita.

5) Misc improvements to the BPF sockmap test cases in terms of cgroup v2
   interaction and removal of incorrect shutdown() calls, from John.

6) Few cleanups in xdp_umem_assign_dev() and xdpsock samples, from Prashant.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git

Thanks a lot!



The following changes since commit 817e60a7a2bb1f22052f18562990d675cb3a3762:

  Merge branch 'nfp-add-NFP5000-support' (2018-08-28 16:01:48 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git 

for you to fetch changes up to 93ee30f3e8b412c5fc2d2f7d9d002529d9a209ad:

  xsk: i40e: get rid of useless struct xdp_umem_props (2018-09-01 01:38:16 
+0200)


Alexei Starovoitov (2):
  Merge branch 'AF_XDP-zerocopy-for-i40e'
  Merge branch 'verifier-liveness-simplification'

Björn Töpel (9):
  xdp: implement convert_to_xdp_frame for MEM_TYPE_ZERO_COPY
  xdp: export xdp_rxq_info_unreg_mem_model
  xsk: expose xdp_umem_get_{data,dma} to drivers
  i40e: added queue pair disable/enable functions
  i40e: refactor Rx path for re-use
  i40e: move common Rx functions to i40e_txrx_common.h
  i40e: add AF_XDP zero-copy Rx support
  samples/bpf: add -c/--copy -z/--zero-copy flags to xdpsock
  xsk: include XDP meta data in AF_XDP frames

Colin Ian King (1):
  xdp: remove redundant variable 'headroom'

Daniel Borkmann (1):
  Merge branch 'bpf-bpffs-bpftool-dump-with-btf'

Edward Cree (2):
  bpf/verifier: per-register parent pointers
  bpf/verifier: display non-spill stack slot types in print_verifier_state

John Fastabend (2):
  bpf: sockmap test remove shutdown() calls
  bpf: use --cgroup in test_suite if supplied

Magnus Karlsson (5):
  net: add napi_if_scheduled_mark_missed
  i40e: move common Tx functions to i40e_txrx_common.h
  i40e: add AF_XDP zero-copy Tx support
  i40e: fix possible compiler warning in xsk TX path
  xsk: i40e: get rid of useless struct xdp_umem_props

Nikita V. Shirokov (3):
  bpf: add TCP_SAVE_SYN/TCP_SAVED_SYN options for bpf_(set|get)sockopt
  bpf: add TCP_SAVE_SYN/TCP_SAVED_SYN sample program
  bpf: add selftest for bpf's (set|get)_sockopt for SAVE_SYN

Prashant Bhole (2):
  xsk: remove unnecessary assignment
  samples/bpf: xdpsock, minor fixes

Yonghong Song (3):
  bpf: add bpffs pretty print for percpu arraymap/hash/lru_hash
  tools/bpf: add bpffs percpu map pretty print tests in test_btf
  tools/bpf: bpftool: add btf percpu map formated dump

YueHaibing (1):
  bpf: remove duplicated include from syscall.c

 drivers/net/ethernet/intel/i40e/Makefile   |   3 +-
 drivers/net/ethernet/intel/i40e/i40e.h |  19 +
 drivers/net/ethernet/intel/i40e/i40e_main.c| 307 +++-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c| 182 +++--
 drivers/net/ethernet/intel/i40e/i40e_txrx.h|  20 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx_common.h |  90 +++
 drivers/net/ethernet/intel/i40e/i40e_xsk.c | 832 +
 drivers/net/ethernet/intel/i40e/i40e_xsk.h |  25 +
 include/linux/bpf_verifier.h   |   8 +-
 include/linux/netdevice.h  |  26 +
 include/net/xdp.h  |   6 +-
 include/net/xdp_sock.h |  51 +-
 kernel/bpf/arraymap.c  |  24 +
 kernel/bpf/hashtab.c   |  31 +
 kernel/bpf/syscall.c   |   1 -
 kernel/bpf/verifier.c  | 216 ++
 net/core/filter.c  |  25 +-
 net/core/xdp.c |  53 +-
 net/xdp/xdp_umem.c |   6 +-
 net/xdp/xdp_umem.h |  10 -
 net/xdp/xdp_umem_props.h   |  14 -
 net/xdp/xsk.c  |  34 +-
 net/xdp/xsk_queue.c|   5 +-
 net/xdp/xsk_queue.h

Re: [PATCH bpf-next v2 0/2] xsk: misc code cleanup

2018-08-31 Thread Daniel Borkmann

On 08/31/2018 01:40 PM, Magnus Karlsson wrote:
> This patch set cleans up two code style issues with the xsk zero-copy
> code. The resulting code is smaller and simpler.
> 
> Changes from v1:
> 
> * Fixed bisecatbility problem reported by Daniel Borkmann by squashing
>   the two last patches into one.
> 
> Patch 1: Removes a potential compiler warning reported by the Intel
>  0-DAY kernel test infrastructure.
> Patch 2: Removes the xdp_umem_props structure. At some point, it
>  was used to break a dependency, but the members are these
>  days much better off in the xdp_umem since the dependency
>  does not exist anymore. Also adapts the i40e driver to this
>new interface.
> 
> I based this patch set on bpf-next commit 9c4f39811db8 ("samples/bpf:
> xdpsock, minor fixes")
> 
> Thanks: Magnus

Applied to bpf-next, thanks Magnus!

Re: [PATCH bpf-next] adding selftest for bpf's (set|get)_sockopt for SAVE_SYN

2018-08-31 Thread Daniel Borkmann

On 08/31/2018 06:43 PM, Nikita V. Shirokov wrote:
> adding selftest for feature, introduced in
> commit 9452048c79404 ("bpf: add TCP_SAVE_SYN/TCP_SAVED_SYN options for
> bpf_(set|get)sockopt")
> 
> Signed-off-by: Nikita V. Shirokov 

Applied to bpf-next. Nikita, please also add a proper subsystem prefix to
your patches in future, I've fixed them all up manually this time.

[RFC PATCH bpf-next v2 3/4] Sync uapi/bpf.h to tools/include

2018-08-31 Thread Mauricio Vasquez B

Sync both files.

Signed-off-by: Mauricio Vasquez B 
---
 tools/include/uapi/linux/bpf.h |   36 
 1 file changed, 32 insertions(+), 4 deletions(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 66917a4eba27..0a5b904ba42f 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -103,6 +103,7 @@ enum bpf_cmd {
BPF_BTF_LOAD,
BPF_BTF_GET_FD_BY_ID,
BPF_TASK_FD_QUERY,
+   BPF_MAP_LOOKUP_AND_DELETE_ELEM,
 };
 
 enum bpf_map_type {
@@ -127,6 +128,8 @@ enum bpf_map_type {
BPF_MAP_TYPE_SOCKHASH,
BPF_MAP_TYPE_CGROUP_STORAGE,
BPF_MAP_TYPE_REUSEPORT_SOCKARRAY,
+   BPF_MAP_TYPE_QUEUE,
+   BPF_MAP_TYPE_STACK,
 };
 
 enum bpf_prog_type {
@@ -459,6 +462,28 @@ union bpf_attr {
  * Return
  * 0 on success, or a negative error in case of failure.
  *
+ * int bpf_map_push_elem(struct bpf_map *map, const void *value, u64 flags)
+ * Description
+ * Push an element *value* in *map*. *flags* is one of:
+ *
+ * **BPF_EXIST**
+ * If the queue/stack is full, the oldest element is removed to
+ * make room for this.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * void *bpf_map_pop_elem(struct bpf_map *map)
+ * Description
+ * Pop an element from *map*.
+ * Return
+ * Pointer to the element of *NULL* if there is not any.
+ *
+ * void *bpf_map_peek_elem(struct bpf_map *map)
+ * Description
+ * Return an element from *map* without removing it.
+ * Return
+ * Pointer to the element of *NULL* if there is not any.
+ *
  * int bpf_probe_read(void *dst, u32 size, const void *src)
  * Description
  * For tracing programs, safely attempt to read *size* bytes from
@@ -786,14 +811,14 @@ union bpf_attr {
  *
  * int ret;
  * struct bpf_tunnel_key key = {};
- * 
+ *
  * ret = bpf_skb_get_tunnel_key(skb, , sizeof(key), 0);
  * if (ret < 0)
  * return TC_ACT_SHOT; // drop packet
- * 
+ *
  * if (key.remote_ipv4 != 0x0a01)
  * return TC_ACT_SHOT; // drop packet
- * 
+ *
  * return TC_ACT_OK;   // accept packet
  *
  * This interface can also be used with all encapsulation devices
@@ -2226,7 +2251,10 @@ union bpf_attr {
FN(get_current_cgroup_id),  \
FN(get_local_storage),  \
FN(sk_select_reuseport),\
-   FN(skb_ancestor_cgroup_id),
+   FN(skb_ancestor_cgroup_id), \
+   FN(map_push_elem),  \
+   FN(map_pop_elem),   \
+   FN(map_peek_elem),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call

[RFC PATCH bpf-next v2 4/4] selftests/bpf: add test cases for queue and stack maps

2018-08-31 Thread Mauricio Vasquez B

Two types of tests are done:
- test_maps: only userspace api.
- test_progs: userspace api and ebpf helpers.

Signed-off-by: Mauricio Vasquez B 
---
 tools/lib/bpf/bpf.c|   12 ++
 tools/lib/bpf/bpf.h|1 
 tools/testing/selftests/bpf/Makefile   |2 
 tools/testing/selftests/bpf/bpf_helpers.h  |7 +
 tools/testing/selftests/bpf/test_maps.c|  101 
 tools/testing/selftests/bpf/test_progs.c   |   99 
 tools/testing/selftests/bpf/test_queue_map.c   |4 +
 tools/testing/selftests/bpf/test_queue_stack_map.h |   59 
 tools/testing/selftests/bpf/test_stack_map.c   |4 +
 9 files changed, 288 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/test_queue_map.c
 create mode 100644 tools/testing/selftests/bpf/test_queue_stack_map.h
 create mode 100644 tools/testing/selftests/bpf/test_stack_map.c

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 60aa4ca8b2c5..7056b2eb554d 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -286,6 +286,18 @@ int bpf_map_lookup_elem(int fd, const void *key, void 
*value)
return sys_bpf(BPF_MAP_LOOKUP_ELEM, , sizeof(attr));
 }
 
+int bpf_map_lookup_and_delete_elem(int fd, const void *key, const void *value)
+{
+   union bpf_attr attr;
+
+   bzero(, sizeof(attr));
+   attr.map_fd = fd;
+   attr.key = ptr_to_u64(key);
+   attr.value = ptr_to_u64(value);
+
+   return sys_bpf(BPF_MAP_LOOKUP_AND_DELETE_ELEM, , sizeof(attr));
+}
+
 int bpf_map_delete_elem(int fd, const void *key)
 {
union bpf_attr attr;
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 6f38164b2618..6134ed9517d3 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -86,6 +86,7 @@ int bpf_map_update_elem(int fd, const void *key, const void 
*value,
__u64 flags);
 
 int bpf_map_lookup_elem(int fd, const void *key, void *value);
+int bpf_map_lookup_and_delete_elem(int fd, const void *key, const void *value);
 int bpf_map_delete_elem(int fd, const void *key);
 int bpf_map_get_next_key(int fd, const void *key, void *next_key);
 int bpf_obj_pin(int fd, const char *pathname);
diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index fff7fb1285fc..3c773a66aa5f 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -35,7 +35,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o 
test_tcp_estats.o test
test_get_stack_rawtp.o test_sockmap_kern.o test_sockhash_kern.o \
test_lwt_seg6local.o sendmsg4_prog.o sendmsg6_prog.o 
test_lirc_mode2_kern.o \
get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
-   test_skb_cgroup_id_kern.o
+   test_skb_cgroup_id_kern.o test_queue_map.o test_stack_map.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h 
b/tools/testing/selftests/bpf/bpf_helpers.h
index e4be7730222d..05fb5ed90b89 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -16,6 +16,13 @@ static int (*bpf_map_update_elem)(void *map, void *key, void 
*value,
(void *) BPF_FUNC_map_update_elem;
 static int (*bpf_map_delete_elem)(void *map, void *key) =
(void *) BPF_FUNC_map_delete_elem;
+static int (*bpf_map_push_elem)(void *map, void *value,
+   unsigned long long flags) =
+   (void *) BPF_FUNC_map_push_elem;
+static void *(*bpf_map_pop_elem)(void *map) =
+   (void *) BPF_FUNC_map_pop_elem;
+static void *(*bpf_map_peek_elem)(void *map) =
+   (void *) BPF_FUNC_map_peek_elem;
 static int (*bpf_probe_read)(void *dst, int size, void *unsafe_ptr) =
(void *) BPF_FUNC_probe_read;
 static unsigned long long (*bpf_ktime_get_ns)(void) =
diff --git a/tools/testing/selftests/bpf/test_maps.c 
b/tools/testing/selftests/bpf/test_maps.c
index 6f54f84144a0..754871c7c8b4 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -471,6 +472,102 @@ static void test_devmap(int task, void *data)
close(fd);
 }
 
+static void test_queuemap(int task, void *data)
+{
+   const int MAP_SIZE = 32;
+   __u32 vals[MAP_SIZE + MAP_SIZE/2], val;
+   int fd, i;
+
+   /* Fill test values to be used */
+   for (i = 0; i < MAP_SIZE + MAP_SIZE/2; i++)
+   vals[i] = rand();
+
+   fd = bpf_create_map(BPF_MAP_TYPE_QUEUE, 0, sizeof(val), MAP_SIZE,
+   map_flags);
+   if (fd < 0) {
+   printf("Failed to create queuemap '%s'!\n", strerror(errno));
+   exit(1);
+   }
+
+   /* Push MAP_SIZE elements */
+   for (i = 0; i <

[PATCH net] bpf: Fix bpf_msg_pull_data()

2018-08-31 Thread Tushar Dave

Helper bpf_msg_pull_data() mistakenly reuses variable 'offset' while
linearizing multiple scatterlist elements. Variable 'offset' is used
to find first starting scatterlist element
i.e. msg->data = sg_virt([first_sg]) + start - offset"

Use different variable name while linearizing multiple scatterlist
elements so that value contained in variable 'offset' won't get
overwritten.

Fixes: 015632bb30da ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
Signed-off-by: Tushar Dave 
---
 net/core/filter.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 2c7801f..aecdeba 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2292,7 +2292,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
 BPF_CALL_4(bpf_msg_pull_data,
   struct sk_msg_buff *, msg, u32, start, u32, end, u64, flags)
 {
-   unsigned int len = 0, offset = 0, copy = 0;
+   unsigned int len = 0, offset = 0, copy = 0, poffset = 0;
int bytes = end - start, bytes_sg_total;
struct scatterlist *sg = msg->sg_data;
int first_sg, last_sg, i, shift;
@@ -2348,16 +2348,15 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff 
*msg)
if (unlikely(!page))
return -ENOMEM;
p = page_address(page);
-   offset = 0;
 
i = first_sg;
do {
from = sg_virt([i]);
len = sg[i].length;
-   to = p + offset;
+   to = p + poffset;
 
memcpy(to, from, len);
-   offset += len;
+   poffset += len;
sg[i].length = 0;
put_page(sg_page([i]));
 
-- 
1.8.3.1

Re: [PATCH net] ebpf: fix bpf_msg_pull_data

2018-08-31 Thread Tushar Dave





On 08/31/2018 05:15 AM, Daniel Borkmann wrote:

On 08/31/2018 10:37 AM, Tushar Dave wrote:

On 08/30/2018 12:20 AM, Daniel Borkmann wrote:

On 08/30/2018 02:21 AM, Tushar Dave wrote:

On 08/29/2018 05:07 PM, Tushar Dave wrote:

While doing some preliminary testing it is found that bpf helper
bpf_msg_pull_data does not calculate the data and data_end offset
correctly. Fix it!

Fixes: 015632bb30da ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
Signed-off-by: Tushar Dave 
Acked-by: Sowmini Varadhan 
---
    net/core/filter.c | 38 +-
    1 file changed, 25 insertions(+), 13 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index c25eb36..3eeb3d6 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2285,7 +2285,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
    BPF_CALL_4(bpf_msg_pull_data,
   struct sk_msg_buff *, msg, u32, start, u32, end, u64, flags)
    {
-    unsigned int len = 0, offset = 0, copy = 0;
+    unsigned int len = 0, offset = 0, copy = 0, off = 0;
    struct scatterlist *sg = msg->sg_data;
    int first_sg, last_sg, i, shift;
    unsigned char *p, *to, *from;
@@ -2299,22 +2299,30 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff 
*msg)
    i = msg->sg_start;
    do {
    len = sg[i].length;
-    offset += len;
    if (start < offset + len)
    break;
+    offset += len;
    i++;
    if (i == MAX_SKB_FRAGS)
    i = 0;
-    } while (i != msg->sg_end);
+    } while (i <= msg->sg_end);


I don't think this condition is correct, msg operates as a scatterlist ring,
so sg_end may very well be < current i when there's a wrap-around in the
traversal ... more below.


I'm wondering then how this is suppose to work in case sg list is not
ring! For RDS, We have sg list that is not a ring. More below.




    +    /* return error if start is out of range */
    if (unlikely(start >= offset + len))
    return -EINVAL;
    -    if (!msg->sg_copy[i] && bytes <= len)
-    goto out;
+    /* return error if i is last entry in sglist and end is out of range */
+    if (msg->sg_copy[i] && end > offset + len)
+    return -EINVAL;
      first_sg = i;
    +    /* if i is not last entry in sg list and end (i.e start + bytes) is
+ * within this sg[i] then goto out and calculate data and data_end
+ */
+    if (!msg->sg_copy[i] && end <= offset + len)
+    goto out;
+
    /* At this point we need to linearize multiple scatterlist
     * elements or a single shared page. Either way we need to
     * copy into a linear buffer exclusively owned by BPF. Then
@@ -2330,9 +2338,14 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
    i++;
    if (i == MAX_SKB_FRAGS)
    i = 0;
-    if (bytes < copy)
+    if (end < copy)
    break;
-    } while (i != msg->sg_end);
+    } while (i <= msg->sg_end);
+
+    /* return error if i is last entry in sglist and end is out of range */
+    if (i > msg->sg_end && end > offset + copy)
+    return -EINVAL;
+
    last_sg = i;
      if (unlikely(copy < end - start))
@@ -2342,23 +2355,22 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff 
*msg)
    if (unlikely(!page))
    return -ENOMEM;
    p = page_address(page);
-    offset = 0;
      i = first_sg;
    do {
    from = sg_virt([i]);
    len = sg[i].length;
-    to = p + offset;
+    to = p + off;
      memcpy(to, from, len);
-    offset += len;
+    off += len;
    sg[i].length = 0;
    put_page(sg_page([i]));
      i++;
    if (i == MAX_SKB_FRAGS)
    i = 0;
-    } while (i != last_sg);
+    } while (i < last_sg);
      sg[first_sg].length = copy;
    sg_set_page([first_sg], page, copy, 0);
@@ -2380,7 +2392,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
    else
    move_from = i + shift;
    -    if (move_from == msg->sg_end)
+    if (move_from > msg->sg_end)
    break;
      sg[i] = sg[move_from];
@@ -2396,7 +2408,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
    if (msg->sg_end < 0)
    msg->sg_end += MAX_SKB_FRAGS;
    out:
-    msg->data = sg_virt([i]) + start - offset;
+    msg->data = sg_virt([first_sg]) + start - offset;
    msg->data_end = msg->data + bytes;
      return 0;



Please discard this patch. I just noticed that Daniel Borkmann sent some 
similar fixes for bpf_msg_pull_data.


Yeah I've been looking at these recently as well, please make sure you test
with the below fixes included to see if there's anything left:


I tested the latest net tree which has all the fixes you mentioned and I
am still seeing issues.

As I already mentioned before on RFC v3 thread, we need to be careful
reusing 'offset' while

[RFC PATCH bpf-next v2 1/4] bpf: add bpf queue and stack maps

2018-08-31 Thread Mauricio Vasquez B

Implement two new kind of maps that support the peek, push and pop
operations.

A use case for this is to keep track of a pool of elements, like
network ports in a SNAT.

Signed-off-by: Mauricio Vasquez B 
---
 include/linux/bpf.h   |8 +
 include/linux/bpf_types.h |2 
 include/uapi/linux/bpf.h  |   36 
 kernel/bpf/Makefile   |2 
 kernel/bpf/helpers.c  |   44 +
 kernel/bpf/queue_stack_maps.c |  353 +
 kernel/bpf/syscall.c  |   96 +++
 kernel/bpf/verifier.c |6 +
 net/core/filter.c |6 +
 9 files changed, 543 insertions(+), 10 deletions(-)
 create mode 100644 kernel/bpf/queue_stack_maps.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 523481a3471b..1d39b9096d9f 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -39,6 +39,11 @@ struct bpf_map_ops {
void *(*map_lookup_elem)(struct bpf_map *map, void *key);
int (*map_update_elem)(struct bpf_map *map, void *key, void *value, u64 
flags);
int (*map_delete_elem)(struct bpf_map *map, void *key);
+   void *(*map_lookup_and_delete_elem)(struct bpf_map *map, void *key);
+
+   /* funcs callable from eBPF programs */
+   void *(*map_lookup_or_init_elem)(struct bpf_map *map, void *key,
+void *value);
 
/* funcs called by prog_array and perf_event_array map */
void *(*map_fd_get_ptr)(struct bpf_map *map, struct file *map_file,
@@ -806,6 +811,9 @@ static inline int bpf_fd_reuseport_array_update_elem(struct 
bpf_map *map,
 extern const struct bpf_func_proto bpf_map_lookup_elem_proto;
 extern const struct bpf_func_proto bpf_map_update_elem_proto;
 extern const struct bpf_func_proto bpf_map_delete_elem_proto;
+extern const struct bpf_func_proto bpf_map_push_elem_proto;
+extern const struct bpf_func_proto bpf_map_pop_elem_proto;
+extern const struct bpf_func_proto bpf_map_peek_elem_proto;
 
 extern const struct bpf_func_proto bpf_get_prandom_u32_proto;
 extern const struct bpf_func_proto bpf_get_smp_processor_id_proto;
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index cd26c090e7c0..8d955f11f1cd 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -67,3 +67,5 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_XSKMAP, xsk_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, reuseport_array_ops)
 #endif
 #endif
+BPF_MAP_TYPE(BPF_MAP_TYPE_QUEUE, queue_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, queue_map_ops)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 66917a4eba27..0a5b904ba42f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -103,6 +103,7 @@ enum bpf_cmd {
BPF_BTF_LOAD,
BPF_BTF_GET_FD_BY_ID,
BPF_TASK_FD_QUERY,
+   BPF_MAP_LOOKUP_AND_DELETE_ELEM,
 };
 
 enum bpf_map_type {
@@ -127,6 +128,8 @@ enum bpf_map_type {
BPF_MAP_TYPE_SOCKHASH,
BPF_MAP_TYPE_CGROUP_STORAGE,
BPF_MAP_TYPE_REUSEPORT_SOCKARRAY,
+   BPF_MAP_TYPE_QUEUE,
+   BPF_MAP_TYPE_STACK,
 };
 
 enum bpf_prog_type {
@@ -459,6 +462,28 @@ union bpf_attr {
  * Return
  * 0 on success, or a negative error in case of failure.
  *
+ * int bpf_map_push_elem(struct bpf_map *map, const void *value, u64 flags)
+ * Description
+ * Push an element *value* in *map*. *flags* is one of:
+ *
+ * **BPF_EXIST**
+ * If the queue/stack is full, the oldest element is removed to
+ * make room for this.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * void *bpf_map_pop_elem(struct bpf_map *map)
+ * Description
+ * Pop an element from *map*.
+ * Return
+ * Pointer to the element of *NULL* if there is not any.
+ *
+ * void *bpf_map_peek_elem(struct bpf_map *map)
+ * Description
+ * Return an element from *map* without removing it.
+ * Return
+ * Pointer to the element of *NULL* if there is not any.
+ *
  * int bpf_probe_read(void *dst, u32 size, const void *src)
  * Description
  * For tracing programs, safely attempt to read *size* bytes from
@@ -786,14 +811,14 @@ union bpf_attr {
  *
  * int ret;
  * struct bpf_tunnel_key key = {};
- * 
+ *
  * ret = bpf_skb_get_tunnel_key(skb, , sizeof(key), 0);
  * if (ret < 0)
  * return TC_ACT_SHOT; // drop packet
- * 
+ *
  * if (key.remote_ipv4 != 0x0a01)
  * return TC_ACT_SHOT; // drop packet
- * 
+ *
  * return TC_ACT_OK;   // accept packet
  *
  * This interface can also be used with all encapsulation devices
@@ -2226,7 +2251,10 @@ union bpf_attr {

[RFC PATCH bpf-next v2 2/4] bpf: restrict use of peek/push/pop

2018-08-31 Thread Mauricio Vasquez B

Restrict the use of peek, push and pop helpers only to queue and stack
maps.

Signed-off-by: Mauricio Vasquez B 
---
 kernel/bpf/verifier.c |   14 ++
 1 file changed, 14 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 5bd67feb2f07..9e177ff4a3b9 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2172,6 +2172,13 @@ static int check_map_func_compatibility(struct 
bpf_verifier_env *env,
if (func_id != BPF_FUNC_sk_select_reuseport)
goto error;
break;
+   case BPF_MAP_TYPE_QUEUE:
+   case BPF_MAP_TYPE_STACK:
+   if (func_id != BPF_FUNC_map_peek_elem &&
+   func_id != BPF_FUNC_map_pop_elem &&
+   func_id != BPF_FUNC_map_push_elem)
+   goto error;
+   break;
default:
break;
}
@@ -2227,6 +2234,13 @@ static int check_map_func_compatibility(struct 
bpf_verifier_env *env,
if (map->map_type != BPF_MAP_TYPE_REUSEPORT_SOCKARRAY)
goto error;
break;
+   case BPF_FUNC_map_peek_elem:
+   case BPF_FUNC_map_pop_elem:
+   case BPF_FUNC_map_push_elem:
+   if (map->map_type != BPF_MAP_TYPE_QUEUE &&
+   map->map_type != BPF_MAP_TYPE_STACK)
+   goto error;
+   break;
default:
break;
}

[RFC PATCH bpf-next v2 0/4] Implement bpf queue/stack maps

2018-08-31 Thread Mauricio Vasquez B

In some applications this is needed have a pool of free elements, like for
example the list of free L4 ports in a SNAT.  None of the current maps allow
to do it as it is not possibleto get an any element without having they key
it is associated to.

This patchset implements two new kind of eBPF maps: queue and stack.
Those maps provide to eBPF programs the peek, push and pop operations, and for
userspace applications a new bpf_map_lookup_and_delete_elem() is added.

Signed-off-by: Mauricio Vasquez B 

---

I am sending this as an RFC because there is still an issue I am not sure how
to solve.

The queue/stack maps have a linked list for saving the nodes, and a
preallocation schema based on the pcpu_freelist already implemented and used
in the htabmap.  Each time an element is pushed into the map, a node from the
pcpu_freelist is taken and then added to the linked list.

The pop operation takes and *removes* the first node from the linked list, then
it uses *call_rcu* to postpose freeing the node, i.e, the node is only returned
to the pcpu_freelist when the rcu callback is executed.  This is needed because
an element returned by the pop() operation should remain valid for the whole
duration of the eBPF program.

The problem is that elements are not immediately returned to the free list, so
in some cases the push operation could fail because there are not free nodes
in the pcpu_freelist.

The following code snippet exposes that problem.

...
/* Push MAP_SIZE elements */
for (i = 0; i < MAP_SIZE; i++)
assert(bpf_map_update_elem(fd, NULL, [i], 0) == 0);

/* Pop all elements */
for (i = 0; i < MAP_SIZE; i++)
assert(bpf_map_lookup_and_delete_elem(fd, NULL, ) == 0 &&
   val == vals[i]);

  // sleep(1) <-- If I put this sleep, everything works.
/* Push MAP_SIZE elements */
for (i = 0; i < MAP_SIZE; i++)
assert(bpf_map_update_elem(fd, NULL, [i], 0) == 0);
   ^^^
   This fails because there are not available elements in pcpu_freelist
...

I think a possible solution is to oversize the pcpu_freelist (no idea by how
much, maybe double or, or make it 1.5 time the max elements in the map?)
I also have concerns about it, it would waste that memory in many cases and
this is also probably that it doesn't solve the issue because that code snippet
is puhsing and popping elements too fast, so even if the pcpu_freelist is much
large a certain time instant all the elements could be used.

Is this really an important issue?
Any idea of how to solve it?

Thanks,
---

Mauricio Vasquez B (4):
  bpf: add bpf queue and stack maps
  bpf: restrict use of peek/push/pop
  Sync uapi/bpf.h to tools/include
  selftests/bpf: add test cases for queue and stack maps


 include/linux/bpf.h|8 
 include/linux/bpf_types.h  |2 
 include/uapi/linux/bpf.h   |   36 ++
 kernel/bpf/Makefile|2 
 kernel/bpf/helpers.c   |   44 ++
 kernel/bpf/queue_stack_maps.c  |  353 
 kernel/bpf/syscall.c   |   96 +
 kernel/bpf/verifier.c  |   20 +
 net/core/filter.c  |6 
 tools/include/uapi/linux/bpf.h |   36 ++
 tools/lib/bpf/bpf.c|   12 +
 tools/lib/bpf/bpf.h|1 
 tools/testing/selftests/bpf/Makefile   |2 
 tools/testing/selftests/bpf/bpf_helpers.h  |7 
 tools/testing/selftests/bpf/test_maps.c|  101 ++
 tools/testing/selftests/bpf/test_progs.c   |   99 ++
 tools/testing/selftests/bpf/test_queue_map.c   |4 
 tools/testing/selftests/bpf/test_queue_stack_map.h |   59 +++
 tools/testing/selftests/bpf/test_stack_map.c   |4 
 19 files changed, 877 insertions(+), 15 deletions(-)
 create mode 100644 kernel/bpf/queue_stack_maps.c
 create mode 100644 tools/testing/selftests/bpf/test_queue_map.c
 create mode 100644 tools/testing/selftests/bpf/test_queue_stack_map.h
 create mode 100644 tools/testing/selftests/bpf/test_stack_map.c

--

Re: [net-next v2 00/15][pull request] 40GbE Intel Wired LAN Driver Updates 2018-08-30

2018-08-31 Thread David Miller

From: Jeff Kirsher 
Date: Thu, 30 Aug 2018 14:11:32 -0700

> This series contains updates to i40e, i40evf and virtchnl.

Pulled, thanks Jeff.

Re: [PATCH v2 1/2] IB/ipoib: Use dev_port to expose network interface port numbers

2018-08-31 Thread Doug Ledford

On Fri, 2018-08-31 at 11:57 +0300, Arseny Maslennikov wrote:
> On Thu, Aug 30, 2018 at 04:17:58PM -0400, Doug Ledford wrote:
> > On Thu, 2018-08-30 at 21:22 +0300, Arseny Maslennikov wrote:
> > > Some InfiniBand network devices have multiple ports on the same PCI
> > > function. This initializes the `dev_port' sysfs field of those
> > > network interfaces with their port number.
> > > 
> > > Prior to this the kernel erroneously used the `dev_id' sysfs
> > > field of those network interfaces to convey the port number to userspace.
> > > 
> > > The use of `dev_id' was considered correct until Linux 3.15,
> > > when another field, `dev_port', was defined for this particular
> > > purpose and `dev_id' was reserved for distinguishing stacked ifaces
> > > (e.g: VLANs) with the same hardware address as their parent device.
> > > 
> > > Similar fixes to net/mlx4_en and many other drivers, which started
> > > exporting this information through `dev_id' before 3.15, were accepted
> > > into the kernel 4 years ago.
> > > See 76a066f2a2a0 (`net/mlx4_en: Expose port number through sysfs').
> > > 
> > > Signed-off-by: Arseny Maslennikov 
> > > ---
> > >  drivers/infiniband/ulp/ipoib/ipoib_main.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
> > > b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> > > index e3d28f9ad9c0..ba16a63ee303 100644
> > > --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> > > +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> > > @@ -1880,7 +1880,7 @@ static int ipoib_parent_init(struct net_device 
> > > *ndev)
> > >  sizeof(union ib_gid));
> > >  
> > >   SET_NETDEV_DEV(priv->dev, priv->ca->dev.parent);
> > > - priv->dev->dev_id = priv->port - 1;
> > > + priv->dev->dev_port = priv->port - 1;
> > 
> > I don't know that we can't do this.  At least not yet.  Expose the new
> > item to make us compliant with the new docs, and deprecate the old sysfs
> > item, but we can't just yank the old item.  Existing tools/scripts might
> > (probably) rely on it (existing tools already special case IPoIB
> > interfaces and we'll need to make sure they don't special case this
> > element too).
> 
> I'm good with keeping both items for a (probably long) while to not break
> things. But how exactly should we notify users of the deprecation, so they
> don't special case this again? A comment in the code seems too little —
> everyone's obviously too busy to look there and stumble upon that.
> A distinct notice in the doc seems too much. I can't think of another place
> for the deprecation notice where people would take note of it, however.
> 
> Anyway: would it be OK to just restore both items and put a small note in
> dev_id's doc entry? If yes, I'll then send a v3.

A warn_on_once in the code so that when someone reads the dev_id entry,
we get a deprecation warning in the dmesg output at info level would be
my suggestion.  Have it output the command name as part of the warning
so we know what tools are using it.

-- 
Doug Ledford 
GPG KeyID: B826A3330E572FDD
Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

signature.asc
Description: This is a digitally signed message part

Re: [PATCH v2 net-next] liquidio: remove set but not used variable 'irh'

2018-08-31 Thread Felix Manlunas

On Fri, Aug 31, 2018 at 12:03:56PM +, YueHaibing wrote:
> Fixes gcc '-Wunused-but-set-variable' warning:
> 
> drivers/net/ethernet/cavium/liquidio/request_manager.c: In function 
> 'lio_process_iq_request_list':
> drivers/net/ethernet/cavium/liquidio/request_manager.c:383:27: warning:
>  variable 'irh' set but not used [-Wunused-but-set-variable]
> 
> Signed-off-by: YueHaibing 
> ---
> v2: fix patch description,remove 'cHECK-'
> ---
>  drivers/net/ethernet/cavium/liquidio/request_manager.c | 9 -
>  1 file changed, 9 deletions(-)
> 
> diff --git a/drivers/net/ethernet/cavium/liquidio/request_manager.c 
> b/drivers/net/ethernet/cavium/liquidio/request_manager.c
> index bd0153e..c6f4cbd 100644
> --- a/drivers/net/ethernet/cavium/liquidio/request_manager.c
> +++ b/drivers/net/ethernet/cavium/liquidio/request_manager.c
> @@ -380,7 +380,6 @@ static inline void __copy_cmd_into_iq(struct 
> octeon_instr_queue *iq,
> u32 inst_count = 0;
> unsigned int pkts_compl = 0, bytes_compl = 0;
> struct octeon_soft_command *sc;
> -   struct octeon_instr_irh *irh;
> unsigned long flags;
> 
> while (old != iq->octeon_read_index) {
> @@ -402,14 +401,6 @@ static inline void __copy_cmd_into_iq(struct 
> octeon_instr_queue *iq,
> case REQTYPE_RESP_NET:
> case REQTYPE_SOFT_COMMAND:
> sc = buf;
> -
> -   if (OCTEON_CN23XX_PF(oct) || OCTEON_CN23XX_VF(oct))
> -   irh = (struct octeon_instr_irh *)
> -   >cmd.cmd3.irh;
> -   else
> -   irh = (struct octeon_instr_irh *)
> -   >cmd.cmd2.irh;
> -
> /* We're expecting a response from Octeon.
>  * It's up to lio_process_ordered_list() to
>  * process  sc. Add sc to the ordered soft
> 

Acked-by: Felix Manlunas

Re: phys_port_id in switchdev mode?

2018-08-31 Thread Marcelo Ricardo Leitner

On Tue, Aug 28, 2018 at 08:43:51PM +0200, Jakub Kicinski wrote:
> Ugh, CC: netdev..
> 
> On Tue, 28 Aug 2018 20:05:39 +0200, Jakub Kicinski wrote:
> > Hi!
> > 
> > I wonder if we can use phys_port_id in switchdev to group together
> > interfaces of a single PCI PF?  Here is the problem:

On Mellanox cards, this is already possible via phys_switch_id, as
each PF has its own phys_switch_id. So all VFs with a given
phys_switch_id belong to the PF with that same phys_switch_id.

I understand this is a vendor-specific design, but if you have the
same phys_switch_id across PFs, does it really matter on which PF the
VF was created on?

  Marcelo

[PATCH] cxgb4: fix abort_req_rss6 struct

2018-08-31 Thread Steve Wise

Remove the incorrect WR_HDR field which can cause a misinterpretation
of this CPL by ULDs.

Fixes: a3cdaa69e4ae ("cxgb4: Adds CPL support for Shared Receive Queues")
Signed-off-by: Steve Wise 
---

Dave, Doug, and Jason,

I request this merge through the rdma repo since the only user of this
structure is iw_cxgb4.

---
 drivers/net/ethernet/chelsio/cxgb4/t4_msg.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_msg.h 
b/drivers/net/ethernet/chelsio/cxgb4/t4_msg.h
index b8f75a22fb6c..f152da1ce046 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_msg.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_msg.h
@@ -753,7 +753,6 @@ struct cpl_abort_req_rss {
 };
 
 struct cpl_abort_req_rss6 {
-   WR_HDR;
union opcode_tid ot;
__be32 srqidx_status;
 };
-- 
1.8.3.1

Re: [bpf-next PATCH 3/3] xdp: split code for map vs non-map redirect

2018-08-31 Thread Daniel Borkmann

On 08/31/2018 05:26 PM, Jesper Dangaard Brouer wrote:
> The compiler does an efficient job of inlining static C functions.
> Perf top clearly shows that almost everything gets inlined into the
> function call xdp_do_redirect.
> 
> The function xdp_do_redirect end-up containing and interleaving the
> map and non-map redirect code.  This is sub-optimal, as it would be
> strange for an XDP program to use both types of redirect in the same
> program. The two use-cases are separate, and interleaving the code
> just cause more instruction-cache pressure.
> 
> I would like to stress (again) that the non-map variant bpf_redirect
> is very slow compared to the bpf_redirect_map variant, approx half the
> speed.  Measured with driver i40e the difference is:
> 
> - map redirect: 13,250,350 pps
> - non-map redirect:  7,491,425 pps
> 
> For this reason, the function name of the non-map variant of redirect
> have been called xdp_do_redirect_slow.  This hopefully gives a hint
> when using perf, that this is not the optimal XDP redirect operating mode.
> 
> Signed-off-by: Jesper Dangaard Brouer 
> ---
>  net/core/filter.c |   52 ++--
>  1 file changed, 30 insertions(+), 22 deletions(-)
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index ec1b4eb0d3d4..c4ad1b93167f 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3170,6 +3170,32 @@ static int __bpf_tx_xdp(struct net_device *dev,
>   return 0;
>  }
>  
> +/* non-static to avoid inline by compiler */
> +int xdp_do_redirect_slow(struct net_device *dev, struct xdp_buff *xdp,

Nit: should be 'static noinline' in that case then.

> + struct bpf_prog *xdp_prog, struct bpf_redirect_info *ri)
> +{
> + struct net_device *fwd;
> + u32 index = ri->ifindex;
> + int err;
> +
> + fwd = dev_get_by_index_rcu(dev_net(dev), index);
> + ri->ifindex = 0;
> + if (unlikely(!fwd)) {
> + err = -EINVAL;
> + goto err;
> + }
> +
> + err = __bpf_tx_xdp(fwd, NULL, xdp, 0);
> + if (unlikely(err))
> + goto err;
> +
> + _trace_xdp_redirect(dev, xdp_prog, index);
> + return 0;
> +err:
> + _trace_xdp_redirect_err(dev, xdp_prog, index, err);
> + return err;
> +}
> +
>  static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
>   struct bpf_map *map,
>   struct xdp_buff *xdp,
> @@ -3264,9 +3290,9 @@ void bpf_clear_redirect_map(struct bpf_map *map)
>  }
>  
>  static int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp,
> -struct bpf_prog *xdp_prog, struct bpf_map *map)
> +struct bpf_prog *xdp_prog, struct bpf_map *map,
> +struct bpf_redirect_info *ri)
>  {
> - struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
>   u32 index = ri->ifindex;
>   void *fwd = NULL;
>   int err;
> @@ -3299,29 +3325,11 @@ int xdp_do_redirect(struct net_device *dev, struct 
> xdp_buff *xdp,
>  {
>   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
>   struct bpf_map *map = READ_ONCE(ri->map);
> - struct net_device *fwd;
> - u32 index = ri->ifindex;
> - int err;
>  
>   if (likely(map))
> - return xdp_do_redirect_map(dev, xdp, xdp_prog, map);
> + return xdp_do_redirect_map(dev, xdp, xdp_prog, map, ri);
>  
> - fwd = dev_get_by_index_rcu(dev_net(dev), index);
> - ri->ifindex = 0;
> - if (unlikely(!fwd)) {
> - err = -EINVAL;
> - goto err;
> - }
> -
> - err = __bpf_tx_xdp(fwd, NULL, xdp, 0);
> - if (unlikely(err))
> - goto err;
> -
> - _trace_xdp_redirect(dev, xdp_prog, index);
> - return 0;
> -err:
> - _trace_xdp_redirect_err(dev, xdp_prog, index, err);
> - return err;
> + return xdp_do_redirect_slow(dev, xdp, xdp_prog, ri);
>  }
>  EXPORT_SYMBOL_GPL(xdp_do_redirect);
>  
>

Re: [PATCH net-next] veth: report NEWLINK event when moving the peer device in a new namespace

2018-08-31 Thread Lorenzo Bianconi

> On 8/31/18 10:19 AM, Lorenzo Bianconi wrote:
> >> On 8/31/18 5:43 AM, Lorenzo Bianconi wrote:
> >>> When moving a veth device to another namespace, userspace receives a
> >>> RTM_DELLINK message indicating the device has been removed from current
> >>> netns. However, the other peer does not receive a netlink event
> >>> containing new values for IFLA_LINK_NETNSID and IFLA_LINK veth
> >>> attributes.
> >>> Fix that behaviour sending to userspace a RTM_NEWLINK message in the peer
> >>> namespace to report new IFLA_LINK_NETNSID/IFLA_LINK values
> >>>
> >>
> >> A newlink message is generated in the new namespace. What information is
> >> missing from that message?
> >>
> > 
> > Hi David,
> > 
> > let's assume we have two veth paired devices (veth0 and veth1) on inet
> > namespace. When moving a veth1 to another namespace, userspace is notified
> > with RTM_DELLINK event on inet namespace to indicate that veth1 has been
> > moved to another namespace. However some userspace applications
> > (e.g. NetworkManager), listening for events on inet namespace, are 
> > interested
> > in veth1 ifindex in the new namespace. This patch sends a new RTM_NEWLINK 
> > event
> > in inet namespace to provide new values for IFLA_LINK_NETNSID/IFLA_LINK 
> 
> This is in init namespace
> $ ip li set veth2 netns foo
> 
> $ ip monitor
> Deleted 20: veth2@veth1:  mtu 1500 qdisc
> noop state DOWN group default
> link/ether c6:d0:d6:c5:23:7d brd ff:ff:ff:ff:ff:ff new-netns foo
> new-ifindex 20
> 
> It shows the new namespace in the delete message.

Ops, I have not noticed this info has been already introduced in
the commit 38e01b30563a ("dev: advertise the new ifindex when the netns
iface changes"). Thanks for the hint.

DaveM please drop this patch.

Regards,
Lorenzo

[PATCH bpf-next] adding selftest for bpf's (set|get)_sockopt for SAVE_SYN

2018-08-31 Thread Nikita V. Shirokov

adding selftest for feature, introduced in
commit 9452048c79404 ("bpf: add TCP_SAVE_SYN/TCP_SAVED_SYN options for
bpf_(set|get)sockopt")

Signed-off-by: Nikita V. Shirokov 
---
 .../testing/selftests/bpf/test_tcpbpf_kern.c  | 38 +--
 .../testing/selftests/bpf/test_tcpbpf_user.c  | 31 ++-
 2 files changed, 65 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_tcpbpf_kern.c 
b/tools/testing/selftests/bpf/test_tcpbpf_kern.c
index 4b7fd540cea9..74f73b33a7b0 100644
--- a/tools/testing/selftests/bpf/test_tcpbpf_kern.c
+++ b/tools/testing/selftests/bpf/test_tcpbpf_kern.c
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -17,6 +18,13 @@ struct bpf_map_def SEC("maps") global_map = {
.type = BPF_MAP_TYPE_ARRAY,
.key_size = sizeof(__u32),
.value_size = sizeof(struct tcpbpf_globals),
+   .max_entries = 4,
+};
+
+struct bpf_map_def SEC("maps") sockopt_results = {
+   .type = BPF_MAP_TYPE_ARRAY,
+   .key_size = sizeof(__u32),
+   .value_size = sizeof(int),
.max_entries = 2,
 };
 
@@ -45,11 +53,14 @@ int _version SEC("version") = 1;
 SEC("sockops")
 int bpf_testcb(struct bpf_sock_ops *skops)
 {
-   int rv = -1;
-   int bad_call_rv = 0;
+   char header[sizeof(struct ipv6hdr) + sizeof(struct tcphdr)];
+   struct tcphdr *thdr;
int good_call_rv = 0;
-   int op;
+   int bad_call_rv = 0;
+   int save_syn = 1;
+   int rv = -1;
int v = 0;
+   int op;
 
op = (int) skops->op;
 
@@ -82,6 +93,21 @@ int bpf_testcb(struct bpf_sock_ops *skops)
v = 0xff;
rv = bpf_setsockopt(skops, SOL_IPV6, IPV6_TCLASS, ,
sizeof(v));
+   if (skops->family == AF_INET6) {
+   v = bpf_getsockopt(skops, IPPROTO_TCP, TCP_SAVED_SYN,
+  header, (sizeof(struct ipv6hdr) +
+   sizeof(struct tcphdr)));
+   if (!v) {
+   int offset = sizeof(struct ipv6hdr);
+
+   thdr = (struct tcphdr *)(header + offset);
+   v = thdr->syn;
+   __u32 key = 1;
+
+   bpf_map_update_elem(_results, , ,
+   BPF_ANY);
+   }
+   }
break;
case BPF_SOCK_OPS_RTO_CB:
break;
@@ -111,6 +137,12 @@ int bpf_testcb(struct bpf_sock_ops *skops)
break;
case BPF_SOCK_OPS_TCP_LISTEN_CB:
bpf_sock_ops_cb_flags_set(skops, BPF_SOCK_OPS_STATE_CB_FLAG);
+   v = bpf_setsockopt(skops, IPPROTO_TCP, TCP_SAVE_SYN,
+  _syn, sizeof(save_syn));
+   /* Update global map w/ result of setsock opt */
+   __u32 key = 0;
+
+   bpf_map_update_elem(_results, , , BPF_ANY);
break;
default:
rv = -1;
diff --git a/tools/testing/selftests/bpf/test_tcpbpf_user.c 
b/tools/testing/selftests/bpf/test_tcpbpf_user.c
index a275c2971376..e6eebda7d112 100644
--- a/tools/testing/selftests/bpf/test_tcpbpf_user.c
+++ b/tools/testing/selftests/bpf/test_tcpbpf_user.c
@@ -54,6 +54,26 @@ int verify_result(const struct tcpbpf_globals *result)
return -1;
 }
 
+int verify_sockopt_result(int sock_map_fd)
+{
+   __u32 key = 0;
+   int res;
+   int rv;
+
+   /* check setsockopt for SAVE_SYN */
+   rv = bpf_map_lookup_elem(sock_map_fd, , );
+   EXPECT_EQ(0, rv, "d");
+   EXPECT_EQ(0, res, "d");
+   key = 1;
+   /* check getsockopt for SAVED_SYN */
+   rv = bpf_map_lookup_elem(sock_map_fd, , );
+   EXPECT_EQ(0, rv, "d");
+   EXPECT_EQ(1, res, "d");
+   return 0;
+err:
+   return -1;
+}
+
 static int bpf_find_map(const char *test, struct bpf_object *obj,
const char *name)
 {
@@ -70,11 +90,11 @@ static int bpf_find_map(const char *test, struct bpf_object 
*obj,
 int main(int argc, char **argv)
 {
const char *file = "test_tcpbpf_kern.o";
+   int prog_fd, map_fd, sock_map_fd;
struct tcpbpf_globals g = {0};
const char *cg_path = "/foo";
int error = EXIT_FAILURE;
struct bpf_object *obj;
-   int prog_fd, map_fd;
int cg_fd = -1;
__u32 key = 0;
int rv;
@@ -110,6 +130,10 @@ int main(int argc, char **argv)
if (map_fd < 0)
goto err;
 
+   sock_map_fd = bpf_find_map(__func__, obj, "sockopt_results");
+   if (sock_map_fd < 0)
+   goto err;
+
rv = bpf_map_lookup_elem(map_fd, , );
if (rv != 0) {
printf("FAILED: bpf_map_lookup_elem returns %d\n", rv);
@@ -121,6 +145,11 @@ int main(int argc, char **argv)

Re: [PATCH net-next] failover: remove set but not used variable 'primary_dev'

2018-08-31 Thread Samudrala, Sridhar


On 8/30/2018 8:46 PM, YueHaibing wrote:

Fixes gcc '-Wunused-but-set-variable' warning:

drivers/net/net_failover.c: In function 'net_failover_slave_unregister':
drivers/net/net_failover.c:598:35: warning:
  variable 'primary_dev' set but not used [-Wunused-but-set-variable]


Actually this gcc option found a bug.
We need to add this check after accessing primary_dev and standby_dev.

    if (slave_dev != primary_dev && slave_dev != standby_dev)
    return -ENODEV;

Can you resubmit with the right fix?




Signed-off-by: YueHaibing 
---
  drivers/net/net_failover.c | 3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/net_failover.c b/drivers/net/net_failover.c
index 7ae1856..e103c94e 100644
--- a/drivers/net/net_failover.c
+++ b/drivers/net/net_failover.c
@@ -595,12 +595,11 @@ static int net_failover_slave_pre_unregister(struct 
net_device *slave_dev,
  static int net_failover_slave_unregister(struct net_device *slave_dev,
 struct net_device *failover_dev)
  {
-   struct net_device *standby_dev, *primary_dev;
+   struct net_device *standby_dev;
struct net_failover_info *nfo_info;
bool slave_is_standby;
  
  	nfo_info = netdev_priv(failover_dev);

-   primary_dev = rtnl_dereference(nfo_info->primary_dev);
standby_dev = rtnl_dereference(nfo_info->standby_dev);
  
  	vlan_vids_del_by_dev(slave_dev, failover_dev);

Re: [PATCH net-next v2 1/2] netlink: ipv4 igmp join notifications

2018-08-31 Thread Roopa Prabhu

On Fri, Aug 31, 2018 at 4:20 AM, Patrick Ruddy
 wrote:
> Some userspace applications need to know about IGMP joins from the kernel
> for 2 reasons
> 1. To allow the programming of multicast MAC filters in hardware
> 2. To form a multicast FORUS list for non link-local multicast
>groups to be sent to the kernel and from there to the interested
>party.
> (1) can be fulfilled but simply sending the hardware multicast MAC
> address to be programmed but (2) requires the L3 address to be sent
> since this cannot be constructed from the MAC address whereas the
> reverse translation is a standard library function.
>
> This commit provides addition and deletion of multicast addresses
> using the RTM_NEWADDR and RTM_DELADDR messages. It also provides
> the RTM_GETADDR extension to allow multicast join state to be read
> from the kernel.
>
> Signed-off-by: Patrick Ruddy 
> ---
> v2: fix kbuild warnings.

I am still going through the series, but AFAICT, user-space caches listening to
RTNLGRP_IPV4_IFADDR will now also get multicast addresses by default ?


>
>  include/linux/igmp.h |  4 ++
>  net/ipv4/devinet.c   | 39 +--
>  net/ipv4/igmp.c  | 90 
>  3 files changed, 122 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/igmp.h b/include/linux/igmp.h
> index 119f53941c12..644a548024ed 100644
> --- a/include/linux/igmp.h
> +++ b/include/linux/igmp.h
> @@ -19,6 +19,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  #include 
>
>  static inline struct igmphdr *igmp_hdr(const struct sk_buff *skb)
> @@ -130,6 +132,8 @@ extern void ip_mc_unmap(struct in_device *);
>  extern void ip_mc_remap(struct in_device *);
>  extern void ip_mc_dec_group(struct in_device *in_dev, __be32 addr);
>  extern void ip_mc_inc_group(struct in_device *in_dev, __be32 addr);
> +extern int ip_mc_dump_ifaddr(struct sk_buff *skb, struct netlink_callback 
> *cb,
> +struct net_device *dev);
>  int ip_mc_check_igmp(struct sk_buff *skb, struct sk_buff **skb_trimmed);
>
>  #endif
> diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
> index ea4bd8a52422..42f7dcc4fb5e 100644
> --- a/net/ipv4/devinet.c
> +++ b/net/ipv4/devinet.c
> @@ -57,6 +57,7 @@
>  #endif
>  #include 
>  #include 
> +#include 
>
>  #include 
>  #include 
> @@ -1651,6 +1652,7 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct 
> netlink_callback *cb)
> int h, s_h;
> int idx, s_idx;
> int ip_idx, s_ip_idx;
> +   int multicast, mcast_idx;
> struct net_device *dev;
> struct in_device *in_dev;
> struct in_ifaddr *ifa;
> @@ -1659,6 +1661,8 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct 
> netlink_callback *cb)
> s_h = cb->args[0];
> s_idx = idx = cb->args[1];
> s_ip_idx = ip_idx = cb->args[2];
> +   multicast = cb->args[3];
> +   mcast_idx = cb->args[4];
>
> for (h = s_h; h < NETDEV_HASHENTRIES; h++, s_idx = 0) {
> idx = 0;
> @@ -1675,18 +1679,29 @@ static int inet_dump_ifaddr(struct sk_buff *skb, 
> struct netlink_callback *cb)
> if (!in_dev)
> goto cont;
>
> -   for (ifa = in_dev->ifa_list, ip_idx = 0; ifa;
> -ifa = ifa->ifa_next, ip_idx++) {
> -   if (ip_idx < s_ip_idx)
> -   continue;
> -   if (inet_fill_ifaddr(skb, ifa,
> -NETLINK_CB(cb->skb).portid,
> -cb->nlh->nlmsg_seq,
> -RTM_NEWADDR, NLM_F_MULTI) < 0) {
> -   rcu_read_unlock();
> -   goto done;
> +   if (!multicast) {
> +   for (ifa = in_dev->ifa_list, ip_idx = 0; ifa;
> +ifa = ifa->ifa_next, ip_idx++) {
> +   if (ip_idx < s_ip_idx)
> +   continue;
> +   if (inet_fill_ifaddr(skb, ifa,
> +
> NETLINK_CB(cb->skb).portid,
> +
> cb->nlh->nlmsg_seq,
> +RTM_NEWADDR,
> +NLM_F_MULTI) < 
> 0) {
> +   rcu_read_unlock();
> +   goto done;
> +   }
> +   nl_dump_check_consistent(cb,
> +
> nlmsg_hdr(skb));
> }
> -

Re: [PATCH net-next] veth: report NEWLINK event when moving the peer device in a new namespace

2018-08-31 Thread David Ahern

On 8/31/18 10:19 AM, Lorenzo Bianconi wrote:
>> On 8/31/18 5:43 AM, Lorenzo Bianconi wrote:
>>> When moving a veth device to another namespace, userspace receives a
>>> RTM_DELLINK message indicating the device has been removed from current
>>> netns. However, the other peer does not receive a netlink event
>>> containing new values for IFLA_LINK_NETNSID and IFLA_LINK veth
>>> attributes.
>>> Fix that behaviour sending to userspace a RTM_NEWLINK message in the peer
>>> namespace to report new IFLA_LINK_NETNSID/IFLA_LINK values
>>>
>>
>> A newlink message is generated in the new namespace. What information is
>> missing from that message?
>>
> 
> Hi David,
> 
> let's assume we have two veth paired devices (veth0 and veth1) on inet
> namespace. When moving a veth1 to another namespace, userspace is notified
> with RTM_DELLINK event on inet namespace to indicate that veth1 has been
> moved to another namespace. However some userspace applications
> (e.g. NetworkManager), listening for events on inet namespace, are interested
> in veth1 ifindex in the new namespace. This patch sends a new RTM_NEWLINK 
> event
> in inet namespace to provide new values for IFLA_LINK_NETNSID/IFLA_LINK 

This is in init namespace
$ ip li set veth2 netns foo

$ ip monitor
Deleted 20: veth2@veth1:  mtu 1500 qdisc
noop state DOWN group default
link/ether c6:d0:d6:c5:23:7d brd ff:ff:ff:ff:ff:ff new-netns foo
new-ifindex 20

It shows the new namespace in the delete message.

Re: [PATCH net-next] veth: report NEWLINK event when moving the peer device in a new namespace

2018-08-31 Thread Lorenzo Bianconi

> On 8/31/18 5:43 AM, Lorenzo Bianconi wrote:
> > When moving a veth device to another namespace, userspace receives a
> > RTM_DELLINK message indicating the device has been removed from current
> > netns. However, the other peer does not receive a netlink event
> > containing new values for IFLA_LINK_NETNSID and IFLA_LINK veth
> > attributes.
> > Fix that behaviour sending to userspace a RTM_NEWLINK message in the peer
> > namespace to report new IFLA_LINK_NETNSID/IFLA_LINK values
> > 
> 
> A newlink message is generated in the new namespace. What information is
> missing from that message?
> 

Hi David,

let's assume we have two veth paired devices (veth0 and veth1) on inet
namespace. When moving a veth1 to another namespace, userspace is notified
with RTM_DELLINK event on inet namespace to indicate that veth1 has been
moved to another namespace. However some userspace applications
(e.g. NetworkManager), listening for events on inet namespace, are interested
in veth1 ifindex in the new namespace. This patch sends a new RTM_NEWLINK event
in inet namespace to provide new values for IFLA_LINK_NETNSID/IFLA_LINK 

Regards,
Lorenzo

Re: [RFC PATCH v2 bpf-next 0/2] verifier liveness simplification

2018-08-31 Thread Edward Cree

On 30/08/18 03:18, Alexei Starovoitov wrote:
> I think it's a better base to continue debugging.
> In particular:
> 1. we have instability issue in the verifier.
>  from time to time the verifier goes to process extra 7 instructions on one
>  of the cilium tests. This was happening before and after this set.
I can't reproduce this; I always get 36926.  Can you try recording the
 verifier log and diff the output between the two cases?
> 2. there is a nice improvement in number of processed insns with this set,
>  but the difference I cannot explain, hence it has to debugged.
>  In theory the liveness rewrite shouldn't cause the difference in processed 
> insns.
I shall attack this with the same methods I used for the other delta.
 Since that one turned out to be a real bug in the patch, I'm not so
 sanguine as to dismiss this one as probably connected to issue 1.

-Ed

[bpf-next PATCH 3/3] xdp: split code for map vs non-map redirect

2018-08-31 Thread Jesper Dangaard Brouer

The compiler does an efficient job of inlining static C functions.
Perf top clearly shows that almost everything gets inlined into the
function call xdp_do_redirect.

The function xdp_do_redirect end-up containing and interleaving the
map and non-map redirect code.  This is sub-optimal, as it would be
strange for an XDP program to use both types of redirect in the same
program. The two use-cases are separate, and interleaving the code
just cause more instruction-cache pressure.

I would like to stress (again) that the non-map variant bpf_redirect
is very slow compared to the bpf_redirect_map variant, approx half the
speed.  Measured with driver i40e the difference is:

- map redirect: 13,250,350 pps
- non-map redirect:  7,491,425 pps

For this reason, the function name of the non-map variant of redirect
have been called xdp_do_redirect_slow.  This hopefully gives a hint
when using perf, that this is not the optimal XDP redirect operating mode.

Signed-off-by: Jesper Dangaard Brouer 
---
 net/core/filter.c |   52 ++--
 1 file changed, 30 insertions(+), 22 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index ec1b4eb0d3d4..c4ad1b93167f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3170,6 +3170,32 @@ static int __bpf_tx_xdp(struct net_device *dev,
return 0;
 }
 
+/* non-static to avoid inline by compiler */
+int xdp_do_redirect_slow(struct net_device *dev, struct xdp_buff *xdp,
+   struct bpf_prog *xdp_prog, struct bpf_redirect_info *ri)
+{
+   struct net_device *fwd;
+   u32 index = ri->ifindex;
+   int err;
+
+   fwd = dev_get_by_index_rcu(dev_net(dev), index);
+   ri->ifindex = 0;
+   if (unlikely(!fwd)) {
+   err = -EINVAL;
+   goto err;
+   }
+
+   err = __bpf_tx_xdp(fwd, NULL, xdp, 0);
+   if (unlikely(err))
+   goto err;
+
+   _trace_xdp_redirect(dev, xdp_prog, index);
+   return 0;
+err:
+   _trace_xdp_redirect_err(dev, xdp_prog, index, err);
+   return err;
+}
+
 static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
struct bpf_map *map,
struct xdp_buff *xdp,
@@ -3264,9 +3290,9 @@ void bpf_clear_redirect_map(struct bpf_map *map)
 }
 
 static int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp,
-  struct bpf_prog *xdp_prog, struct bpf_map *map)
+  struct bpf_prog *xdp_prog, struct bpf_map *map,
+  struct bpf_redirect_info *ri)
 {
-   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
u32 index = ri->ifindex;
void *fwd = NULL;
int err;
@@ -3299,29 +3325,11 @@ int xdp_do_redirect(struct net_device *dev, struct 
xdp_buff *xdp,
 {
struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
struct bpf_map *map = READ_ONCE(ri->map);
-   struct net_device *fwd;
-   u32 index = ri->ifindex;
-   int err;
 
if (likely(map))
-   return xdp_do_redirect_map(dev, xdp, xdp_prog, map);
+   return xdp_do_redirect_map(dev, xdp, xdp_prog, map, ri);
 
-   fwd = dev_get_by_index_rcu(dev_net(dev), index);
-   ri->ifindex = 0;
-   if (unlikely(!fwd)) {
-   err = -EINVAL;
-   goto err;
-   }
-
-   err = __bpf_tx_xdp(fwd, NULL, xdp, 0);
-   if (unlikely(err))
-   goto err;
-
-   _trace_xdp_redirect(dev, xdp_prog, index);
-   return 0;
-err:
-   _trace_xdp_redirect_err(dev, xdp_prog, index, err);
-   return err;
+   return xdp_do_redirect_slow(dev, xdp, xdp_prog, ri);
 }
 EXPORT_SYMBOL_GPL(xdp_do_redirect);

[bpf-next PATCH 2/3] xdp: explicit inline __xdp_map_lookup_elem

2018-08-31 Thread Jesper Dangaard Brouer

The compiler chooses to not-inline the function __xdp_map_lookup_elem,
because it can see that it is used by both Generic-XDP and native-XDP
do redirect calls (xdp_do_generic_redirect_map and xdp_do_redirect_map).

The compiler cannot know that this is a bad choice, as it cannot know
that a net device cannot run both XDP modes (Generic or Native) at the
same time.  Thus, mark this function inline, even-though we normally
leave this up-to the compiler.

Signed-off-by: Jesper Dangaard Brouer 
---
 net/core/filter.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 520f5e9e0b73..ec1b4eb0d3d4 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3232,7 +3232,7 @@ void xdp_do_flush_map(void)
 }
 EXPORT_SYMBOL_GPL(xdp_do_flush_map);
 
-static void *__xdp_map_lookup_elem(struct bpf_map *map, u32 index)
+static inline void *__xdp_map_lookup_elem(struct bpf_map *map, u32 index)
 {
switch (map->map_type) {
case BPF_MAP_TYPE_DEVMAP:
@@ -3275,7 +3275,7 @@ static int xdp_do_redirect_map(struct net_device *dev, 
struct xdp_buff *xdp,
WRITE_ONCE(ri->map, NULL);
 
fwd = __xdp_map_lookup_elem(map, index);
-   if (!fwd) {
+   if (unlikely(!fwd)) {
err = -EINVAL;
goto err;
}
@@ -3303,7 +3303,7 @@ int xdp_do_redirect(struct net_device *dev, struct 
xdp_buff *xdp,
u32 index = ri->ifindex;
int err;
 
-   if (map)
+   if (likely(map))
return xdp_do_redirect_map(dev, xdp, xdp_prog, map);
 
fwd = dev_get_by_index_rcu(dev_net(dev), index);

[bpf-next PATCH 1/3] xdp: unlikely instrumentation for xdp map redirect

2018-08-31 Thread Jesper Dangaard Brouer

Notice the compiler generated ASM code layout was suboptimal.  It
assumed map enqueue errors as the likely case, which is shouldn't.
It assumed that xdp_do_flush_map() was a likely case, due to maps
changing between packets, which should be very unlikely.

Signed-off-by: Jesper Dangaard Brouer 
---
 net/core/filter.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index c25eb36f1320..520f5e9e0b73 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3182,7 +3182,7 @@ static int __bpf_tx_xdp_map(struct net_device *dev_rx, 
void *fwd,
struct bpf_dtab_netdev *dst = fwd;
 
err = dev_map_enqueue(dst, xdp, dev_rx);
-   if (err)
+   if (unlikely(err))
return err;
__dev_map_insert_ctx(map, index);
break;
@@ -3191,7 +3191,7 @@ static int __bpf_tx_xdp_map(struct net_device *dev_rx, 
void *fwd,
struct bpf_cpu_map_entry *rcpu = fwd;
 
err = cpu_map_enqueue(rcpu, xdp, dev_rx);
-   if (err)
+   if (unlikely(err))
return err;
__cpu_map_insert_ctx(map, index);
break;
@@ -3279,7 +3279,7 @@ static int xdp_do_redirect_map(struct net_device *dev, 
struct xdp_buff *xdp,
err = -EINVAL;
goto err;
}
-   if (ri->map_to_flush && ri->map_to_flush != map)
+   if (ri->map_to_flush && unlikely(ri->map_to_flush != map))
xdp_do_flush_map();
 
err = __bpf_tx_xdp_map(dev, fwd, map, xdp, index);

[bpf-next PATCH 0/3] XDP micro optimizations for redirect

2018-08-31 Thread Jesper Dangaard Brouer

This patchset contains XDP micro optimizations for the redirect core.
These are not functional changes.  The optimizations revolve around
getting the compiler to layout the code in a way that reflect how XDP
redirect is used.

Today the compiler chooses to inline and uninline (static C functions)
in a suboptimal way, compared to how XDP redirect can be used. Perf
top clearly shows that almost everything gets inlined into the
function call xdp_do_redirect.

The way the compiler chooses to inlines, does not reflect how XDP
redirect is used, as the compile cannot know this.

---

Jesper Dangaard Brouer (3):
  xdp: unlikely instrumentation for xdp map redirect
  xdp: explicit inline __xdp_map_lookup_elem
  xdp: split code for map vs non-map redirect


 net/core/filter.c |   64 ++---
 1 file changed, 36 insertions(+), 28 deletions(-)

--

Re: [PATCH net-next] veth: report NEWLINK event when moving the peer device in a new namespace

2018-08-31 Thread David Ahern

On 8/31/18 5:43 AM, Lorenzo Bianconi wrote:
> When moving a veth device to another namespace, userspace receives a
> RTM_DELLINK message indicating the device has been removed from current
> netns. However, the other peer does not receive a netlink event
> containing new values for IFLA_LINK_NETNSID and IFLA_LINK veth
> attributes.
> Fix that behaviour sending to userspace a RTM_NEWLINK message in the peer
> namespace to report new IFLA_LINK_NETNSID/IFLA_LINK values
> 

A newlink message is generated in the new namespace. What information is
missing from that message?

Re: [PATCH RFC net-next 00/11] udp gso

2018-08-31 Thread Willem de Bruijn

On Fri, Aug 31, 2018 at 9:44 AM Paolo Abeni  wrote:
>
> On Fri, 2018-08-31 at 09:08 -0400, Willem de Bruijn wrote:
> > On Fri, Aug 31, 2018 at 5:09 AM Paolo Abeni  wrote:
> > >
> > > Hi,
> > >
> > > On Tue, 2018-04-17 at 17:07 -0400, Willem de Bruijn wrote:
> > > > That said, for negotiated flows an inverse GRO feature could
> > > > conceivably be implemented to reduce rx stack traversal, too.
> > > > Though due to interleaving of packets on the wire, it aggregation
> > > > would be best effort, similar to TCP TSO and GRO using the
> > > > PSH bit as packetization signal.
> > >
> > > Reviving this old thread, before I forgot again. I have some local
> > > patches implementing UDP GRO in a dual way to current GSO_UDP_L4
> > > implementation: several datagram with the same length are aggregated
> > > into a single one, and the user space receive a single larger packet
> > > instead of multiple ones. I hope quic can leverage such scenario, but I
> > > really know nothing about the protocol.
> > >
> > > I measure roughly a 50% performance improvement with udpgso_bench in
> > > respect to UDP GSO, and ~100% using a pktgen sender, and a reduced CPU
> > > usage on the receiver[1]. Some additional hacking to the general GRO
> > > bits is required to avoid useless socket lookups for ingress UDP
> > > packets when UDP_GSO is not enabled.
> > >
> > > If there is interest on this topic, I can share some RFC patches
> > > (hopefully somewhat next week).
> >
> > As Eric pointed out, QUIC reception on mobile clients over the WAN
> > may not see much gain. But apparently there is a non-trivial amount
> > of traffic the other way, to servers. Again, WAN might limit whatever
> > gain we get, but I do want to look into that. And there are other UDP high
> > throughput workloads (with or without QUIC) between servers.
> >
> > If you have patches, please do share them.
>
> I'll try to clean them up and send them next week (as RFC).
>
> > I actually also have a rough
> > patch that I did not consider ready to share yet. Based on Tom's existing
> > socket lookup in udp_gro_receive to detect whether a local destination
> > exists and whether it has set an option to support receiving coalesced
> > payloads (along with a cmsg to share the segment size).
>
> That is more or less what I'm doing here.
> Side note: I had test it in baremetal, as veth/lo do not trigger the
> GRO path: selftest of this feature is not so straightforward.
>
> > Converting udp_recvmsg to split apart gso packets to make this
> > transparent seems to me to be too complex and not worth the effort.
>
> Agreed. Moreover doing many, small, recvmsg() instead of a single,
> large, one will hit the performances very badly due to PTI and
> HARDENED_USERCOPY.
>
> > If a local socket is not found in udp_gro_receive, this could also be
> > tentative interpreted as a non-local path (with false positives), enabling
> > transparent use of GRO + GSO batching on the forwarding path.
>
> That sounds interesting, even if false positive looks dangerous to me.
> Just to be on the same page, which false positive examples are you
> thinking at? UDP sockets bound to local address behind NAT?

Any packets that would otherwise get dropped, such as packets with
local destination, but no local bound socket. This may increase the
amount of cycles spent on such packets, potentially increasing a DoS
vector.

IPv6 neighbor discovery issues on 4.18

2018-08-31 Thread Brian Rak

We've upgraded a few machines to a 4.18.3 kernel and we're running into 
weird IPv6 neighbor discovery issues.  Basically, the machines stop 
responding to inbound IPv6 neighbor solicitation requests, which very 
quickly breaks all IPv6 connectivity.


It seems like the routing table gets confused:

# ip -6 route get fe80::4e16:fc00:c7a0:7800 dev br0
RTNETLINK answers: Network is unreachable
# ping6 fe80::4e16:fc00:c7a0:7800 -I br0
connect: Network is unreachable
yet

# ip -6 route | grep fe80 | grep br0
fe80::/64 dev br0 proto kernel metric 256 pref medium

fe80::4e16:fc00:c7a0:7800 is the link-local IP of the server's default 
gateway.


In this case, br0 has a single adapter attached to it.

I haven't been able to come up with any sort of reproduction steps here, 
this seems to happen after a few days of uptime in our environment.  The 
last known good release we have here is 4.17.13.


Any suggestions for troubleshooting this?  Sometimes we see machines fix 
themselves, but we haven't been able to figure out what's happening that 
helps.

Re: [PATCH RFC net-next 00/11] udp gso

2018-08-31 Thread Paolo Abeni

On Fri, 2018-08-31 at 09:08 -0400, Willem de Bruijn wrote:
> On Fri, Aug 31, 2018 at 5:09 AM Paolo Abeni  wrote:
> > 
> > Hi,
> > 
> > On Tue, 2018-04-17 at 17:07 -0400, Willem de Bruijn wrote:
> > > That said, for negotiated flows an inverse GRO feature could
> > > conceivably be implemented to reduce rx stack traversal, too.
> > > Though due to interleaving of packets on the wire, it aggregation
> > > would be best effort, similar to TCP TSO and GRO using the
> > > PSH bit as packetization signal.
> > 
> > Reviving this old thread, before I forgot again. I have some local
> > patches implementing UDP GRO in a dual way to current GSO_UDP_L4
> > implementation: several datagram with the same length are aggregated
> > into a single one, and the user space receive a single larger packet
> > instead of multiple ones. I hope quic can leverage such scenario, but I
> > really know nothing about the protocol.
> > 
> > I measure roughly a 50% performance improvement with udpgso_bench in
> > respect to UDP GSO, and ~100% using a pktgen sender, and a reduced CPU
> > usage on the receiver[1]. Some additional hacking to the general GRO
> > bits is required to avoid useless socket lookups for ingress UDP
> > packets when UDP_GSO is not enabled.
> > 
> > If there is interest on this topic, I can share some RFC patches
> > (hopefully somewhat next week).
> 
> As Eric pointed out, QUIC reception on mobile clients over the WAN
> may not see much gain. But apparently there is a non-trivial amount
> of traffic the other way, to servers. Again, WAN might limit whatever
> gain we get, but I do want to look into that. And there are other UDP high
> throughput workloads (with or without QUIC) between servers.
> 
> If you have patches, please do share them. 

I'll try to clean them up and send them next week (as RFC).

> I actually also have a rough
> patch that I did not consider ready to share yet. Based on Tom's existing
> socket lookup in udp_gro_receive to detect whether a local destination
> exists and whether it has set an option to support receiving coalesced
> payloads (along with a cmsg to share the segment size).

That is more or less what I'm doing here.
Side note: I had test it in baremetal, as veth/lo do not trigger the
GRO path: selftest of this feature is not so straightforward.

> Converting udp_recvmsg to split apart gso packets to make this
> transparent seems to me to be too complex and not worth the effort.

Agreed. Moreover doing many, small, recvmsg() instead of a single,
large, one will hit the performances very badly due to PTI and
HARDENED_USERCOPY.

> If a local socket is not found in udp_gro_receive, this could also be
> tentative interpreted as a non-local path (with false positives), enabling
> transparent use of GRO + GSO batching on the forwarding path.

That sounds interesting, even if false positive looks dangerous to me.
Just to be on the same page, which false positive examples are you
thinking at? UDP sockets bound to local address behind NAT?

Cheers,

Paolo

Re: [PATCH net] ipv6: don't get lwtstate twice in ip6_rt_copy_init()

2018-08-31 Thread Alexey Kodanev

On 30.08.2018 19:10, David Ahern wrote:
> On 8/30/18 10:11 AM, Alexey Kodanev wrote:
...
>> unreferenced object 0x880b6aaa14e0 (size 64):
>>   comm "ip", pid 10577, jiffies 4295149341 (age 1273.903s)
>>   hex dump (first 32 bytes):
>> 01 00 04 00 04 00 00 00 10 00 00 00 00 00 00 00  
>> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>>   backtrace:
>> [<18664623>] lwtunnel_build_state+0x1bc/0x420
>> [] ip6_route_info_create+0x9f7/0x1fd0
>> [] ip6_route_add+0x14/0x70
>> [<8537b55c>] inet6_rtm_newroute+0xd9/0xe0
>> [<2acc50f5>] rtnetlink_rcv_msg+0x66f/0x8e0
>> [<8d9cd381>] netlink_rcv_skb+0x268/0x3b0
>> [<4c893c76>] netlink_unicast+0x417/0x5a0
>> [] netlink_sendmsg+0x70b/0xc30
>> [<890ff0aa>] sock_sendmsg+0xb1/0xf0
>> [] ___sys_sendmsg+0x659/0x950
>> [<1e7426c8>] __sys_sendmsg+0xde/0x170
>> [] do_syscall_64+0x9f/0x4a0
>> [<1be7b28b>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> [<6d21f353>] 0x
> 
> What test did you run to uncover this? Curious as to why my testing that
> found the need for 80f1a0f4e0cd did not hit this.

I was using IPv6 route with MPLS. Will submit MPLS tests to LTP soon,
they will include that test as well.

Meanwhile, these commands below are able to trigger it:

  ip route add $new_route encap mpls 50 via inet6 $ip_rhost
  ping6 $ip_new_route
  ip route del $new_route

Thanks,
Alexey

Re: [PATCH RFC net-next 00/11] udp gso

2018-08-31 Thread Willem de Bruijn

On Fri, Aug 31, 2018 at 5:09 AM Paolo Abeni  wrote:
>
> Hi,
>
> On Tue, 2018-04-17 at 17:07 -0400, Willem de Bruijn wrote:
> > That said, for negotiated flows an inverse GRO feature could
> > conceivably be implemented to reduce rx stack traversal, too.
> > Though due to interleaving of packets on the wire, it aggregation
> > would be best effort, similar to TCP TSO and GRO using the
> > PSH bit as packetization signal.
>
> Reviving this old thread, before I forgot again. I have some local
> patches implementing UDP GRO in a dual way to current GSO_UDP_L4
> implementation: several datagram with the same length are aggregated
> into a single one, and the user space receive a single larger packet
> instead of multiple ones. I hope quic can leverage such scenario, but I
> really know nothing about the protocol.
>
> I measure roughly a 50% performance improvement with udpgso_bench in
> respect to UDP GSO, and ~100% using a pktgen sender, and a reduced CPU
> usage on the receiver[1]. Some additional hacking to the general GRO
> bits is required to avoid useless socket lookups for ingress UDP
> packets when UDP_GSO is not enabled.
>
> If there is interest on this topic, I can share some RFC patches
> (hopefully somewhat next week).

As Eric pointed out, QUIC reception on mobile clients over the WAN
may not see much gain. But apparently there is a non-trivial amount
of traffic the other way, to servers. Again, WAN might limit whatever
gain we get, but I do want to look into that. And there are other UDP high
throughput workloads (with or without QUIC) between servers.

If you have patches, please do share them. I actually also have a rough
patch that I did not consider ready to share yet. Based on Tom's existing
socket lookup in udp_gro_receive to detect whether a local destination
exists and whether it has set an option to support receiving coalesced
payloads (along with a cmsg to share the segment size).

Converting udp_recvmsg to split apart gso packets to make this
transparent seems to me to be too complex and not worth the effort.

If a local socket is not found in udp_gro_receive, this could also be
tentative interpreted as a non-local path (with false positives), enabling
transparent use of GRO + GSO batching on the forwarding path.

[PATCH net-next] cxgb4: collect hardware queue descriptors

2018-08-31 Thread Rahul Lakkireddy

Collect descriptors of all ULD and LLD hardware queues managed
by LLD.

Signed-off-by: Rahul Lakkireddy 
Signed-off-by: Ganesh Goudar 
---
 drivers/net/ethernet/chelsio/cxgb4/cudbg_entity.h |  42 +++
 drivers/net/ethernet/chelsio/cxgb4/cudbg_if.h |   3 +-
 drivers/net/ethernet/chelsio/cxgb4/cudbg_lib.c| 336 ++
 drivers/net/ethernet/chelsio/cxgb4/cudbg_lib.h|   5 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h|   7 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c  |   4 +
 6 files changed, 396 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cudbg_entity.h 
b/drivers/net/ethernet/chelsio/cxgb4/cudbg_entity.h
index 36d25883d123..b2d617abcf49 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cudbg_entity.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cudbg_entity.h
@@ -315,6 +315,48 @@ struct cudbg_pbt_tables {
u32 pbt_data[CUDBG_PBT_DATA_ENTRIES];
 };
 
+enum cudbg_qdesc_qtype {
+   CUDBG_QTYPE_UNKNOWN = 0,
+   CUDBG_QTYPE_NIC_TXQ,
+   CUDBG_QTYPE_NIC_RXQ,
+   CUDBG_QTYPE_NIC_FLQ,
+   CUDBG_QTYPE_CTRLQ,
+   CUDBG_QTYPE_FWEVTQ,
+   CUDBG_QTYPE_INTRQ,
+   CUDBG_QTYPE_PTP_TXQ,
+   CUDBG_QTYPE_OFLD_TXQ,
+   CUDBG_QTYPE_RDMA_RXQ,
+   CUDBG_QTYPE_RDMA_FLQ,
+   CUDBG_QTYPE_RDMA_CIQ,
+   CUDBG_QTYPE_ISCSI_RXQ,
+   CUDBG_QTYPE_ISCSI_FLQ,
+   CUDBG_QTYPE_ISCSIT_RXQ,
+   CUDBG_QTYPE_ISCSIT_FLQ,
+   CUDBG_QTYPE_CRYPTO_TXQ,
+   CUDBG_QTYPE_CRYPTO_RXQ,
+   CUDBG_QTYPE_CRYPTO_FLQ,
+   CUDBG_QTYPE_TLS_RXQ,
+   CUDBG_QTYPE_TLS_FLQ,
+   CUDBG_QTYPE_MAX,
+};
+
+#define CUDBG_QDESC_REV 1
+
+struct cudbg_qdesc_entry {
+   u32 data_size;
+   u32 qtype;
+   u32 qid;
+   u32 desc_size;
+   u32 num_desc;
+   u8 data[0]; /* Must be last */
+};
+
+struct cudbg_qdesc_info {
+   u32 qdesc_entry_size;
+   u32 num_queues;
+   u8 data[0]; /* Must be last */
+};
+
 #define IREG_NUM_ELEM 4
 
 static const u32 t6_tp_pio_array[][IREG_NUM_ELEM] = {
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cudbg_if.h 
b/drivers/net/ethernet/chelsio/cxgb4/cudbg_if.h
index 215fe6260fd7..dec63c15c0ba 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cudbg_if.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cudbg_if.h
@@ -81,7 +81,8 @@ enum cudbg_dbg_entity_type {
CUDBG_MBOX_LOG = 66,
CUDBG_HMA_INDIRECT = 67,
CUDBG_HMA = 68,
-   CUDBG_MAX_ENTITY = 70,
+   CUDBG_QDESC = 70,
+   CUDBG_MAX_ENTITY = 71,
 };
 
 struct cudbg_init {
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cudbg_lib.c 
b/drivers/net/ethernet/chelsio/cxgb4/cudbg_lib.c
index d97e0d7e541a..02fc350f81c9 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cudbg_lib.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cudbg_lib.c
@@ -19,6 +19,7 @@
 
 #include "t4_regs.h"
 #include "cxgb4.h"
+#include "cxgb4_cudbg.h"
 #include "cudbg_if.h"
 #include "cudbg_lib_common.h"
 #include "cudbg_entity.h"
@@ -2890,3 +2891,338 @@ int cudbg_collect_hma_indirect(struct cudbg_init 
*pdbg_init,
}
return cudbg_write_and_release_buff(pdbg_init, _buff, dbg_buff);
 }
+
+static inline u32 cudbg_uld_txq_to_qtype(u32 uld)
+{
+   switch (uld) {
+   case CXGB4_TX_OFLD:
+   return CUDBG_QTYPE_OFLD_TXQ;
+   case CXGB4_TX_CRYPTO:
+   return CUDBG_QTYPE_CRYPTO_TXQ;
+   }
+
+   return CUDBG_QTYPE_UNKNOWN;
+}
+
+static inline u32 cudbg_uld_rxq_to_qtype(u32 uld)
+{
+   switch (uld) {
+   case CXGB4_ULD_RDMA:
+   return CUDBG_QTYPE_RDMA_RXQ;
+   case CXGB4_ULD_ISCSI:
+   return CUDBG_QTYPE_ISCSI_RXQ;
+   case CXGB4_ULD_ISCSIT:
+   return CUDBG_QTYPE_ISCSIT_RXQ;
+   case CXGB4_ULD_CRYPTO:
+   return CUDBG_QTYPE_CRYPTO_RXQ;
+   case CXGB4_ULD_TLS:
+   return CUDBG_QTYPE_TLS_RXQ;
+   }
+
+   return CUDBG_QTYPE_UNKNOWN;
+}
+
+static inline u32 cudbg_uld_flq_to_qtype(u32 uld)
+{
+   switch (uld) {
+   case CXGB4_ULD_RDMA:
+   return CUDBG_QTYPE_RDMA_FLQ;
+   case CXGB4_ULD_ISCSI:
+   return CUDBG_QTYPE_ISCSI_FLQ;
+   case CXGB4_ULD_ISCSIT:
+   return CUDBG_QTYPE_ISCSIT_FLQ;
+   case CXGB4_ULD_CRYPTO:
+   return CUDBG_QTYPE_CRYPTO_FLQ;
+   case CXGB4_ULD_TLS:
+   return CUDBG_QTYPE_TLS_FLQ;
+   }
+
+   return CUDBG_QTYPE_UNKNOWN;
+}
+
+static inline u32 cudbg_uld_ciq_to_qtype(u32 uld)
+{
+   switch (uld) {
+   case CXGB4_ULD_RDMA:
+   return CUDBG_QTYPE_RDMA_CIQ;
+   }
+
+   return CUDBG_QTYPE_UNKNOWN;
+}
+
+static inline void cudbg_fill_qdesc_txq(const struct sge_txq *txq,
+   enum cudbg_qdesc_qtype type,
+   struct cudbg_qdesc_entry *entry)
+{
+   entry->qtype = type;
+   entry->qid = txq->cntxt_id;
+   entry->desc_size = sizeof(struct tx_desc);
+   entry->num_desc =

Re: [PATCH net] ebpf: fix bpf_msg_pull_data

2018-08-31 Thread Daniel Borkmann

On 08/31/2018 10:37 AM, Tushar Dave wrote:
> On 08/30/2018 12:20 AM, Daniel Borkmann wrote:
>> On 08/30/2018 02:21 AM, Tushar Dave wrote:
>>> On 08/29/2018 05:07 PM, Tushar Dave wrote:
 While doing some preliminary testing it is found that bpf helper
 bpf_msg_pull_data does not calculate the data and data_end offset
 correctly. Fix it!

 Fixes: 015632bb30da ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
 Signed-off-by: Tushar Dave 
 Acked-by: Sowmini Varadhan 
 ---
    net/core/filter.c | 38 +-
    1 file changed, 25 insertions(+), 13 deletions(-)

 diff --git a/net/core/filter.c b/net/core/filter.c
 index c25eb36..3eeb3d6 100644
 --- a/net/core/filter.c
 +++ b/net/core/filter.c
 @@ -2285,7 +2285,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff 
 *msg)
    BPF_CALL_4(bpf_msg_pull_data,
   struct sk_msg_buff *, msg, u32, start, u32, end, u64, flags)
    {
 -    unsigned int len = 0, offset = 0, copy = 0;
 +    unsigned int len = 0, offset = 0, copy = 0, off = 0;
    struct scatterlist *sg = msg->sg_data;
    int first_sg, last_sg, i, shift;
    unsigned char *p, *to, *from;
 @@ -2299,22 +2299,30 @@ struct sock *do_msg_redirect_map(struct 
 sk_msg_buff *msg)
    i = msg->sg_start;
    do {
    len = sg[i].length;
 -    offset += len;
    if (start < offset + len)
    break;
 +    offset += len;
    i++;
    if (i == MAX_SKB_FRAGS)
    i = 0;
 -    } while (i != msg->sg_end);
 +    } while (i <= msg->sg_end);
>>
>> I don't think this condition is correct, msg operates as a scatterlist ring,
>> so sg_end may very well be < current i when there's a wrap-around in the
>> traversal ... more below.
> 
> I'm wondering then how this is suppose to work in case sg list is not
> ring! For RDS, We have sg list that is not a ring. More below.
> 
>>
    +    /* return error if start is out of range */
    if (unlikely(start >= offset + len))
    return -EINVAL;
    -    if (!msg->sg_copy[i] && bytes <= len)
 -    goto out;
 +    /* return error if i is last entry in sglist and end is out of range 
 */
 +    if (msg->sg_copy[i] && end > offset + len)
 +    return -EINVAL;
      first_sg = i;
    +    /* if i is not last entry in sg list and end (i.e start + bytes) is
 + * within this sg[i] then goto out and calculate data and data_end
 + */
 +    if (!msg->sg_copy[i] && end <= offset + len)
 +    goto out;
 +
    /* At this point we need to linearize multiple scatterlist
     * elements or a single shared page. Either way we need to
     * copy into a linear buffer exclusively owned by BPF. Then
 @@ -2330,9 +2338,14 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff 
 *msg)
    i++;
    if (i == MAX_SKB_FRAGS)
    i = 0;
 -    if (bytes < copy)
 +    if (end < copy)
    break;
 -    } while (i != msg->sg_end);
 +    } while (i <= msg->sg_end);
 +
 +    /* return error if i is last entry in sglist and end is out of range 
 */
 +    if (i > msg->sg_end && end > offset + copy)
 +    return -EINVAL;
 +
    last_sg = i;
      if (unlikely(copy < end - start))
 @@ -2342,23 +2355,22 @@ struct sock *do_msg_redirect_map(struct 
 sk_msg_buff *msg)
    if (unlikely(!page))
    return -ENOMEM;
    p = page_address(page);
 -    offset = 0;
      i = first_sg;
    do {
    from = sg_virt([i]);
    len = sg[i].length;
 -    to = p + offset;
 +    to = p + off;
      memcpy(to, from, len);
 -    offset += len;
 +    off += len;
    sg[i].length = 0;
    put_page(sg_page([i]));
      i++;
    if (i == MAX_SKB_FRAGS)
    i = 0;
 -    } while (i != last_sg);
 +    } while (i < last_sg);
      sg[first_sg].length = copy;
    sg_set_page([first_sg], page, copy, 0);
 @@ -2380,7 +2392,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff 
 *msg)
    else
    move_from = i + shift;
    -    if (move_from == msg->sg_end)
 +    if (move_from > msg->sg_end)
    break;
      sg[i] = sg[move_from];
 @@ -2396,7 +2408,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff 
 *msg)
    if (msg->sg_end < 0)
    msg->sg_end += MAX_SKB_FRAGS;
    out:
 -    msg->data = sg_virt([i]) + start - offset;
 +    msg->data = sg_virt([first_sg]) + start - offset;
    msg->data_end

Re: [PATCH net] sctp: hold transport before accessing its asoc in sctp_transport_get_next

2018-08-31 Thread Neil Horman

On Fri, Aug 31, 2018 at 03:09:05PM +0800, Xin Long wrote:
> On Wed, Aug 29, 2018 at 7:36 PM Neil Horman  wrote:
> >
> > On Wed, Aug 29, 2018 at 12:08:40AM +0800, Xin Long wrote:
> > > On Mon, Aug 27, 2018 at 9:08 PM Neil Horman  wrote:
> > > >
> > > > On Mon, Aug 27, 2018 at 06:38:31PM +0800, Xin Long wrote:
> > > > > As Marcelo noticed, in sctp_transport_get_next, it is iterating over
> > > > > transports but then also accessing the association directly, without
> > > > > checking any refcnts before that, which can cause an use-after-free
> > > > > Read.
> > > > >
> > > > > So fix it by holding transport before accessing the association. With
> > > > > that, sctp_transport_hold calls can be removed in the later places.
> > > > >
> > > > > Fixes: 626d16f50f39 ("sctp: export some apis or variables for 
> > > > > sctp_diag and reuse some for proc")
> > > > > Reported-by: syzbot+fe62a0c9aa6a85c6d...@syzkaller.appspotmail.com
> > > > > Signed-off-by: Xin Long 
> > > > > ---
> > > > >  net/sctp/proc.c   |  4 
> > > > >  net/sctp/socket.c | 22 +++---
> > > > >  2 files changed, 15 insertions(+), 11 deletions(-)
> > > > >
> > > > > diff --git a/net/sctp/proc.c b/net/sctp/proc.c
> > > > > index ef5c9a8..4d6f1c8 100644
> > > > > --- a/net/sctp/proc.c
> > > > > +++ b/net/sctp/proc.c
> > > > > @@ -264,8 +264,6 @@ static int sctp_assocs_seq_show(struct seq_file 
> > > > > *seq, void *v)
> > > > >   }
> > > > >
> > > > >   transport = (struct sctp_transport *)v;
> > > > > - if (!sctp_transport_hold(transport))
> > > > > - return 0;
> > > > >   assoc = transport->asoc;
> > > > >   epb = >base;
> > > > >   sk = epb->sk;
> > > > > @@ -322,8 +320,6 @@ static int sctp_remaddr_seq_show(struct seq_file 
> > > > > *seq, void *v)
> > > > >   }
> > > > >
> > > > >   transport = (struct sctp_transport *)v;
> > > > > - if (!sctp_transport_hold(transport))
> > > > > - return 0;
> > > > >   assoc = transport->asoc;
> > > > >
> > > > >   list_for_each_entry_rcu(tsp, >peer.transport_addr_list,
> > > > > diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> > > > > index e96b15a..aa76586 100644
> > > > > --- a/net/sctp/socket.c
> > > > > +++ b/net/sctp/socket.c
> > > > > @@ -5005,9 +5005,14 @@ struct sctp_transport 
> > > > > *sctp_transport_get_next(struct net *net,
> > > > >   break;
> > > > >   }
> > > > >
> > > > > + if (!sctp_transport_hold(t))
> > > > > + continue;
> > > > > +
> > > > >   if (net_eq(sock_net(t->asoc->base.sk), net) &&
> > > > >   t->asoc->peer.primary_path == t)
> > > > >   break;
> > > > > +
> > > > > + sctp_transport_put(t);
> > > > >   }
> > > > >
> > > > >   return t;
> > > > > @@ -5017,13 +5022,18 @@ struct sctp_transport 
> > > > > *sctp_transport_get_idx(struct net *net,
> > > > > struct rhashtable_iter 
> > > > > *iter,
> > > > > int pos)
> > > > >  {
> > > > > - void *obj = SEQ_START_TOKEN;
> > > > > + struct sctp_transport *t;
> > > > >
> > > > > - while (pos && (obj = sctp_transport_get_next(net, iter)) &&
> > > > > -!IS_ERR(obj))
> > > > > - pos--;
> > > > > + if (!pos)
> > > > > + return SEQ_START_TOKEN;
> > > > >
> > > > > - return obj;
> > > > > + while ((t = sctp_transport_get_next(net, iter)) && !IS_ERR(t)) {
> > > > > + if (!--pos)
> > > > > + break;
> > > > > + sctp_transport_put(t);
> > > > > + }
> > > > > +
> > > > > + return t;
> > > > >  }
> > > > >
> > > > >  int sctp_for_each_endpoint(int (*cb)(struct sctp_endpoint *, void *),
> > > > > @@ -5082,8 +5092,6 @@ int sctp_for_each_transport(int (*cb)(struct 
> > > > > sctp_transport *, void *),
> > > > >
> > > > >   tsp = sctp_transport_get_idx(net, , *pos + 1);
> > > > >   for (; !IS_ERR_OR_NULL(tsp); tsp = sctp_transport_get_next(net, 
> > > > > )) {
> > > > > - if (!sctp_transport_hold(tsp))
> > > > > - continue;
> > > > >   ret = cb(tsp, p);
> > > > >   if (ret)
> > > > >   break;
> > > > > --
> > > > > 2.1.0
> > > > >
> > > > >
> > > > Acked-by: Neil Horman 
> > > >
> > > > Additionally, its not germaine to this particular fix, but why are we 
> > > > still
> > > > using that pos variable in sctp_transport_get_idx?  With the conversion 
> > > > to
> > > > rhashtables, it doesn't seem particularly useful anymore.
> > > For proc, seems so, hti is saved into seq->private.
> > > But for diag, "hti" in sctp_for_each_transport() is a local variable.
> > > do you think where we can save it?
> > >
> > Sorry, wasn't suggesting that it had to be removed from 
> > sctp_for_each_trasnport,
> > its clearly used as both an input and output there.  All I was

[PATCH v2 net-next] liquidio: remove set but not used variable 'irh'

2018-08-31 Thread YueHaibing

Fixes gcc '-Wunused-but-set-variable' warning:

drivers/net/ethernet/cavium/liquidio/request_manager.c: In function 
'lio_process_iq_request_list':
drivers/net/ethernet/cavium/liquidio/request_manager.c:383:27: warning:
 variable 'irh' set but not used [-Wunused-but-set-variable]

Signed-off-by: YueHaibing 
---
v2: fix patch description,remove 'cHECK-'
---
 drivers/net/ethernet/cavium/liquidio/request_manager.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/request_manager.c 
b/drivers/net/ethernet/cavium/liquidio/request_manager.c
index bd0153e..c6f4cbd 100644
--- a/drivers/net/ethernet/cavium/liquidio/request_manager.c
+++ b/drivers/net/ethernet/cavium/liquidio/request_manager.c
@@ -380,7 +380,6 @@ static inline void __copy_cmd_into_iq(struct 
octeon_instr_queue *iq,
u32 inst_count = 0;
unsigned int pkts_compl = 0, bytes_compl = 0;
struct octeon_soft_command *sc;
-   struct octeon_instr_irh *irh;
unsigned long flags;
 
while (old != iq->octeon_read_index) {
@@ -402,14 +401,6 @@ static inline void __copy_cmd_into_iq(struct 
octeon_instr_queue *iq,
case REQTYPE_RESP_NET:
case REQTYPE_SOFT_COMMAND:
sc = buf;
-
-   if (OCTEON_CN23XX_PF(oct) || OCTEON_CN23XX_VF(oct))
-   irh = (struct octeon_instr_irh *)
-   >cmd.cmd3.irh;
-   else
-   irh = (struct octeon_instr_irh *)
-   >cmd.cmd2.irh;
-
/* We're expecting a response from Octeon.
 * It's up to lio_process_ordered_list() to
 * process  sc. Add sc to the ordered soft

Re: [PATCH net-next] liquidio: cHECK-remove set but not used variable 'irh'

2018-08-31 Thread YueHaibing

sorry, patch description is messy, will fix it in V2.

On 2018/8/31 19:53, YueHaibing wrote:
> Fixes gcc '-Wunused-but-set-variable' warning:
> 
> drivers/net/ethernet/cavium/liquidio/request_manager.c: In function 
> 'lio_process_iq_request_list':
> drivers/net/ethernet/cavium/liquidio/request_manager.c:383:27: warning:
>  variable 'irh' set but not used [-Wunused-but-set-variable]
> 
> Signed-off-by: YueHaibing 
> ---
>  drivers/net/ethernet/cavium/liquidio/request_manager.c | 9 -
>  1 file changed, 9 deletions(-)
> 
> diff --git a/drivers/net/ethernet/cavium/liquidio/request_manager.c 
> b/drivers/net/ethernet/cavium/liquidio/request_manager.c
> index bd0153e..c6f4cbd 100644
> --- a/drivers/net/ethernet/cavium/liquidio/request_manager.c
> +++ b/drivers/net/ethernet/cavium/liquidio/request_manager.c
> @@ -380,7 +380,6 @@ static inline void __copy_cmd_into_iq(struct 
> octeon_instr_queue *iq,
>   u32 inst_count = 0;
>   unsigned int pkts_compl = 0, bytes_compl = 0;
>   struct octeon_soft_command *sc;
> - struct octeon_instr_irh *irh;
>   unsigned long flags;
>  
>   while (old != iq->octeon_read_index) {
> @@ -402,14 +401,6 @@ static inline void __copy_cmd_into_iq(struct 
> octeon_instr_queue *iq,
>   case REQTYPE_RESP_NET:
>   case REQTYPE_SOFT_COMMAND:
>   sc = buf;
> -
> - if (OCTEON_CN23XX_PF(oct) || OCTEON_CN23XX_VF(oct))
> - irh = (struct octeon_instr_irh *)
> - >cmd.cmd3.irh;
> - else
> - irh = (struct octeon_instr_irh *)
> - >cmd.cmd2.irh;
> -
>   /* We're expecting a response from Octeon.
>* It's up to lio_process_ordered_list() to
>* process  sc. Add sc to the ordered soft
> 
> 
> .
>

[PATCH net-next] veth: report NEWLINK event when moving the peer device in a new namespace

2018-08-31 Thread Lorenzo Bianconi

When moving a veth device to another namespace, userspace receives a
RTM_DELLINK message indicating the device has been removed from current
netns. However, the other peer does not receive a netlink event
containing new values for IFLA_LINK_NETNSID and IFLA_LINK veth
attributes.
Fix that behaviour sending to userspace a RTM_NEWLINK message in the peer
namespace to report new IFLA_LINK_NETNSID/IFLA_LINK values

Reviewed-by: Stefano Brivio 
Signed-off-by: Lorenzo Bianconi 
---
Changes since RFC:
- export rtmsg_ifinfo_build_skb() symbol and do not use rtmsg_ifinfo_newnet()
  directly
---
 drivers/net/veth.c   | 66 +++-
 net/core/rtnetlink.c |  1 +
 2 files changed, 66 insertions(+), 1 deletion(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 8d679c8b7f25..0caffc93d74d 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -1242,18 +1242,82 @@ static struct rtnl_link_ops veth_link_ops = {
.get_link_net   = veth_get_link_net,
 };
 
+static int veth_notify(struct notifier_block *this,
+  unsigned long event, void *ptr)
+{
+   struct net_device *peer, *dev = netdev_notifier_info_to_dev(ptr);
+   struct net *peer_net, *net = dev_net(dev);
+   int nsid, ret = NOTIFY_DONE;
+   struct veth_priv *priv;
+   struct sk_buff *skb;
+
+   if (dev->netdev_ops != _netdev_ops)
+   return NOTIFY_DONE;
+
+   if (event != NETDEV_REGISTER)
+   return NOTIFY_DONE;
+
+   priv = netdev_priv(dev);
+
+   rcu_read_lock();
+
+   peer = rcu_dereference(priv->peer);
+   if (!peer)
+   goto out;
+
+   peer_net = dev_net(peer);
+   /* do not forward events if both veth devices
+* are in the same namespace
+*/
+   if (peer_net == net)
+   goto out;
+
+   /* notify on peer namespace new IFLA_LINK_NETNSID
+* and IFLA_LINK values
+*/
+   nsid = peernet2id_alloc(peer_net, net);
+   skb = rtmsg_ifinfo_build_skb(RTM_NEWLINK, peer, ~0U,
+IFLA_EVENT_NONE, GFP_ATOMIC,
+, dev->ifindex);
+   if (skb) {
+   rtnl_notify(skb, peer_net, 0, RTNLGRP_LINK, NULL,
+   GFP_ATOMIC);
+   ret = NOTIFY_OK;
+   }
+
+out:
+   rcu_read_unlock();
+
+   return ret;
+}
+
+static struct notifier_block veth_notifier = {
+   .notifier_call = veth_notify,
+};
+
 /*
  * init/fini
  */
 
 static __init int veth_init(void)
 {
-   return rtnl_link_register(_link_ops);
+   int err;
+
+   err = register_netdevice_notifier(_notifier);
+   if (err < 0)
+   return err;
+
+   err = rtnl_link_register(_link_ops);
+   if (err < 0)
+   unregister_netdevice_notifier(_notifier);
+
+   return err;
 }
 
 static __exit void veth_exit(void)
 {
rtnl_link_unregister(_link_ops);
+   unregister_netdevice_notifier(_notifier);
 }
 
 module_init(veth_init);
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 24431e578310..b2df84db7670 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -3326,6 +3326,7 @@ struct sk_buff *rtmsg_ifinfo_build_skb(int type, struct 
net_device *dev,
rtnl_set_sk_err(net, RTNLGRP_LINK, err);
return NULL;
 }
+EXPORT_SYMBOL(rtmsg_ifinfo_build_skb);
 
 void rtmsg_ifinfo_send(struct sk_buff *skb, struct net_device *dev, gfp_t 
flags)
 {
-- 
2.17.1

[PATCH net-next] liquidio: cHECK-remove set but not used variable 'irh'

2018-08-31 Thread YueHaibing

Fixes gcc '-Wunused-but-set-variable' warning:

drivers/net/ethernet/cavium/liquidio/request_manager.c: In function 
'lio_process_iq_request_list':
drivers/net/ethernet/cavium/liquidio/request_manager.c:383:27: warning:
 variable 'irh' set but not used [-Wunused-but-set-variable]

Signed-off-by: YueHaibing 
---
 drivers/net/ethernet/cavium/liquidio/request_manager.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/request_manager.c 
b/drivers/net/ethernet/cavium/liquidio/request_manager.c
index bd0153e..c6f4cbd 100644
--- a/drivers/net/ethernet/cavium/liquidio/request_manager.c
+++ b/drivers/net/ethernet/cavium/liquidio/request_manager.c
@@ -380,7 +380,6 @@ static inline void __copy_cmd_into_iq(struct 
octeon_instr_queue *iq,
u32 inst_count = 0;
unsigned int pkts_compl = 0, bytes_compl = 0;
struct octeon_soft_command *sc;
-   struct octeon_instr_irh *irh;
unsigned long flags;
 
while (old != iq->octeon_read_index) {
@@ -402,14 +401,6 @@ static inline void __copy_cmd_into_iq(struct 
octeon_instr_queue *iq,
case REQTYPE_RESP_NET:
case REQTYPE_SOFT_COMMAND:
sc = buf;
-
-   if (OCTEON_CN23XX_PF(oct) || OCTEON_CN23XX_VF(oct))
-   irh = (struct octeon_instr_irh *)
-   >cmd.cmd3.irh;
-   else
-   irh = (struct octeon_instr_irh *)
-   >cmd.cmd2.irh;
-
/* We're expecting a response from Octeon.
 * It's up to lio_process_ordered_list() to
 * process  sc. Add sc to the ordered soft

[PATCH bpf-next v2 2/2] xsk: i40e: get rid of useless struct xdp_umem_props

2018-08-31 Thread Magnus Karlsson

This commit gets rid of the structure xdp_umem_props. It was there to
be able to break a dependency at one point, but this is no longer
needed. The values in the struct are instead stored directly in the
xdp_umem structure. This simplifies the xsk code as well as af_xdp
zero-copy drivers and as a bonus gets rid of one internal header file.

The i40e driver is also adapted to the new interface in this commit.

Signed-off-by: Magnus Karlsson 
---
 drivers/net/ethernet/intel/i40e/i40e_xsk.c |  4 ++--
 include/net/xdp_sock.h |  8 ++--
 net/xdp/xdp_umem.c |  4 ++--
 net/xdp/xdp_umem_props.h   | 14 --
 net/xdp/xsk.c  | 10 ++
 net/xdp/xsk_queue.c|  5 +++--
 net/xdp/xsk_queue.h| 13 +++--
 7 files changed, 22 insertions(+), 36 deletions(-)
 delete mode 100644 net/xdp/xdp_umem_props.h

diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c 
b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 41ca7e1..2ebfc78 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -442,7 +442,7 @@ static void i40e_reuse_rx_buffer_zc(struct i40e_ring 
*rx_ring,
struct i40e_rx_buffer *old_bi)
 {
struct i40e_rx_buffer *new_bi = _ring->rx_bi[rx_ring->next_to_alloc];
-   unsigned long mask = (unsigned long)rx_ring->xsk_umem->props.chunk_mask;
+   unsigned long mask = (unsigned long)rx_ring->xsk_umem->chunk_mask;
u64 hr = rx_ring->xsk_umem->headroom + XDP_PACKET_HEADROOM;
u16 nta = rx_ring->next_to_alloc;
 
@@ -477,7 +477,7 @@ void i40e_zca_free(struct zero_copy_allocator *alloc, 
unsigned long handle)
 
rx_ring = container_of(alloc, struct i40e_ring, zca);
hr = rx_ring->xsk_umem->headroom + XDP_PACKET_HEADROOM;
-   mask = rx_ring->xsk_umem->props.chunk_mask;
+   mask = rx_ring->xsk_umem->chunk_mask;
 
nta = rx_ring->next_to_alloc;
bi = _ring->rx_bi[nta];
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 56994ad..932ca0d 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -16,11 +16,6 @@
 struct net_device;
 struct xsk_queue;
 
-struct xdp_umem_props {
-   u64 chunk_mask;
-   u64 size;
-};
-
 struct xdp_umem_page {
void *addr;
dma_addr_t dma;
@@ -30,7 +25,8 @@ struct xdp_umem {
struct xsk_queue *fq;
struct xsk_queue *cq;
struct xdp_umem_page *pages;
-   struct xdp_umem_props props;
+   u64 chunk_mask;
+   u64 size;
u32 headroom;
u32 chunk_size_nohr;
struct user_struct *user;
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index d179732..b3b632c 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -312,8 +312,8 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct 
xdp_umem_reg *mr)
 
umem->pid = get_task_pid(current, PIDTYPE_PID);
umem->address = (unsigned long)addr;
-   umem->props.chunk_mask = ~((u64)chunk_size - 1);
-   umem->props.size = size;
+   umem->chunk_mask = ~((u64)chunk_size - 1);
+   umem->size = size;
umem->headroom = headroom;
umem->chunk_size_nohr = chunk_size - headroom;
umem->npgs = size / PAGE_SIZE;
diff --git a/net/xdp/xdp_umem_props.h b/net/xdp/xdp_umem_props.h
deleted file mode 100644
index 40eab10..000
--- a/net/xdp/xdp_umem_props.h
+++ /dev/null
@@ -1,14 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-/* XDP user-space packet buffer
- * Copyright(c) 2018 Intel Corporation.
- */
-
-#ifndef XDP_UMEM_PROPS_H_
-#define XDP_UMEM_PROPS_H_
-
-struct xdp_umem_props {
-   u64 chunk_mask;
-   u64 size;
-};
-
-#endif /* XDP_UMEM_PROPS_H_ */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 569048e..5a432df 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -470,8 +470,10 @@ static int xsk_bind(struct socket *sock, struct sockaddr 
*addr, int addr_len)
goto out_unlock;
} else {
/* This xsk has its own umem. */
-   xskq_set_umem(xs->umem->fq, >umem->props);
-   xskq_set_umem(xs->umem->cq, >umem->props);
+   xskq_set_umem(xs->umem->fq, xs->umem->size,
+ xs->umem->chunk_mask);
+   xskq_set_umem(xs->umem->cq, xs->umem->size,
+ xs->umem->chunk_mask);
 
err = xdp_umem_assign_dev(xs->umem, dev, qid, flags);
if (err)
@@ -481,8 +483,8 @@ static int xsk_bind(struct socket *sock, struct sockaddr 
*addr, int addr_len)
xs->dev = dev;
xs->zc = xs->umem->zc;
xs->queue_id = qid;
-   xskq_set_umem(xs->rx, >umem->props);
-   xskq_set_umem(xs->tx, >umem->props);
+   xskq_set_umem(xs->rx, xs->umem->size, xs->umem->chunk_mask);
+   xskq_set_umem(xs->tx, xs->umem->size, xs->umem->chunk_mask);

[PATCH bpf-next v2 1/2] i40e: fix possible compiler warning in xsk TX path

2018-08-31 Thread Magnus Karlsson

With certain gcc versions, it was possible to get the warning
"'tx_desc' may be used uninitialized in this function" for the
i40e_xmit_zc. This was not possible, however this commit simplifies
the code path so that this warning is no longer emitted.

Signed-off-by: Magnus Karlsson 
---
 drivers/net/ethernet/intel/i40e/i40e_xsk.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c 
b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 94947a8..41ca7e1 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -668,9 +668,8 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int 
budget)
  **/
 static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
 {
-   unsigned int total_packets = 0;
+   struct i40e_tx_desc *tx_desc = NULL;
struct i40e_tx_buffer *tx_bi;
-   struct i40e_tx_desc *tx_desc;
bool work_done = true;
dma_addr_t dma;
u32 len;
@@ -697,14 +696,13 @@ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, 
unsigned int budget)
build_ctob(I40E_TX_DESC_CMD_ICRC
   | I40E_TX_DESC_CMD_EOP,
   0, len, 0);
-   total_packets++;
 
xdp_ring->next_to_use++;
if (xdp_ring->next_to_use == xdp_ring->count)
xdp_ring->next_to_use = 0;
}
 
-   if (total_packets > 0) {
+   if (tx_desc) {
/* Request an interrupt for the last frame and bump tail ptr. */
tx_desc->cmd_type_offset_bsz |= (I40E_TX_DESC_CMD_RS <<
 I40E_TXD_QW1_CMD_SHIFT);
-- 
2.7.4

[PATCH bpf-next v2 0/2] xsk: misc code cleanup

2018-08-31 Thread Magnus Karlsson

This patch set cleans up two code style issues with the xsk zero-copy
code. The resulting code is smaller and simpler.

Changes from v1:

* Fixed bisecatbility problem reported by Daniel Borkmann by squashing
  the two last patches into one.

Patch 1: Removes a potential compiler warning reported by the Intel
 0-DAY kernel test infrastructure.
Patch 2: Removes the xdp_umem_props structure. At some point, it
 was used to break a dependency, but the members are these
 days much better off in the xdp_umem since the dependency
 does not exist anymore. Also adapts the i40e driver to this
 new interface.

I based this patch set on bpf-next commit 9c4f39811db8 ("samples/bpf:
xdpsock, minor fixes")

Thanks: Magnus

Magnus Karlsson (2):
  i40e: fix possible compiler warning in xsk TX path
  xsk: i40e: get rid of useless struct xdp_umem_props

 drivers/net/ethernet/intel/i40e/i40e_xsk.c | 10 --
 include/net/xdp_sock.h |  8 ++--
 net/xdp/xdp_umem.c |  4 ++--
 net/xdp/xdp_umem_props.h   | 14 --
 net/xdp/xsk.c  | 10 ++
 net/xdp/xsk_queue.c|  5 +++--
 net/xdp/xsk_queue.h| 13 +++--
 7 files changed, 24 insertions(+), 40 deletions(-)
 delete mode 100644 net/xdp/xdp_umem_props.h

--
2.7.4

[PATCH v2] xfrm6: call kfree_skb when skb is toobig

2018-08-31 Thread Thadeu Lima de Souza Cascardo

After commit d6990976af7c5d8f55903bfb4289b6fb030bf754 ("vti6: fix PMTU caching
and reporting on xmit"), some too big skbs might be potentially passed down to
__xfrm6_output, causing it to fail to transmit but not free the skb, causing a
leak of skb, and consequentially a leak of dst references.

After running pmtu.sh, that shows as failure to unregister devices in a 
namespace:

[  311.397671] unregister_netdevice: waiting for veth_b to become free. Usage 
count = 1

The fix is to call kfree_skb in case of transmit failures.

Signed-off-by: Thadeu Lima de Souza Cascardo 
Reviewed-by: Sabrina Dubroca 
Fixes: dd767856a36e ("xfrm6: Don't call icmpv6_send on local error")
---
 net/ipv6/xfrm6_output.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv6/xfrm6_output.c b/net/ipv6/xfrm6_output.c
index 5959ce9620eb..6a74080005cf 100644
--- a/net/ipv6/xfrm6_output.c
+++ b/net/ipv6/xfrm6_output.c
@@ -170,9 +170,11 @@ static int __xfrm6_output(struct net *net, struct sock 
*sk, struct sk_buff *skb)
 
if (toobig && xfrm6_local_dontfrag(skb)) {
xfrm6_local_rxpmtu(skb, mtu);
+   kfree_skb(skb);
return -EMSGSIZE;
} else if (!skb->ignore_df && toobig && skb->sk) {
xfrm_local_error(skb, mtu);
+   kfree_skb(skb);
return -EMSGSIZE;
}
 
-- 
2.17.1

[PATCH RFC] net/mlx5_en: switch to Toeplitz RSS hash by default

2018-08-31 Thread Konstantin Khlebnikov

XOR (MLX5_RX_HASH_FN_INVERTED_XOR8) gives only 8 bits.
It seems not enough for RFS. All other drivers use toeplitz.

Driver mlx4_en uses Toeplitz by default and warns if hash XOR is used
together with NETIF_F_RXHASH (enabled by default too): "Enabling both
XOR Hash function and RX Hashing can limit RPS functionality".

XOR is default in mlx5_en since commit 2be6967cdbc9
("net/mlx5e: Support ETH_RSS_HASH_XOR").

Hash function could be set via ethtool. But it would be nice to have
single standard for drivers or proper description why this one is special.

Signed-off-by: Konstantin Khlebnikov 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 5a7939e70190..def9fb5dcbff 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -4558,7 +4558,7 @@ void mlx5e_build_nic_params(struct mlx5_core_dev *mdev,
params->tx_min_inline_mode = mlx5e_params_calculate_tx_min_inline(mdev);
 
/* RSS */
-   params->rss_hfunc = ETH_RSS_HASH_XOR;
+   params->rss_hfunc = ETH_RSS_HASH_TOP;
netdev_rss_key_fill(params->toeplitz_hash_key, 
sizeof(params->toeplitz_hash_key));
mlx5e_build_default_indir_rqt(params->indirection_rqt,
  MLX5E_INDIR_RQT_SIZE, max_channels);

[PATCH net-next v2 2/2] netlink: ipv6 MLD join notifications

2018-08-31 Thread Patrick Ruddy

Some userspace applications need to know about MLD joins from the
kernel for 2 reasons:
1. To allow the programming of multicast MAC filters in hardware
2. To form a multicast FORUS list for non link-local multicast
   groups to be sent to the kernel and from there to the interested
   party.
(1) can be fulfilled but simply sending the hardware multicast MAC
address to be programmed but (2) requires the L3 address to be sent
since this cannot be constructed from the MAC address whereas the
reverse translation is a standard library function.

This commit provides addition and deletion of multicast addresses
using the RTM_NEWADDR and RTM_DELADDR messages. It also provides
the RTM_GETADDR extension to allow multicast join state to be read
from the kernel.

Signed-off-by: Patrick Ruddy 
---
v2: fix kbuild issues.

 net/ipv6/addrconf.c | 44 +-
 net/ipv6/mcast.c| 66 +
 2 files changed, 98 insertions(+), 12 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index d51a8c0b3372..0b609c7897b4 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4855,11 +4855,13 @@ static int inet6_fill_ifaddr(struct sk_buff *skb, 
struct inet6_ifaddr *ifa,
 }
 
 static int inet6_fill_ifmcaddr(struct sk_buff *skb, struct ifmcaddr6 *ifmca,
-   u32 portid, u32 seq, int event, u16 flags)
+   u32 portid, u32 seq, int event, u16 flags)
 {
struct nlmsghdr  *nlh;
u8 scope = RT_SCOPE_UNIVERSE;
int ifindex = ifmca->idev->dev->ifindex;
+   int addr_type = (event == RTM_GETMULTICAST) ? IFA_MULTICAST :
+   IFA_ADDRESS;
 
if (ipv6_addr_scope(>mca_addr) & IFA_SITE)
scope = RT_SCOPE_SITE;
@@ -4869,7 +4871,7 @@ static int inet6_fill_ifmcaddr(struct sk_buff *skb, 
struct ifmcaddr6 *ifmca,
return -EMSGSIZE;
 
put_ifaddrmsg(nlh, 128, IFA_F_PERMANENT, scope, ifindex);
-   if (nla_put_in6_addr(skb, IFA_MULTICAST, >mca_addr) < 0 ||
+   if (nla_put_in6_addr(skb, addr_type, >mca_addr) < 0 ||
put_cacheinfo(skb, ifmca->mca_cstamp, ifmca->mca_tstamp,
  INFINITY_LIFE_TIME, INFINITY_LIFE_TIME) < 0) {
nlmsg_cancel(skb, nlh);
@@ -4916,7 +4918,7 @@ enum addr_type_t {
 /* called with rcu_read_lock() */
 static int in6_dump_addrs(struct inet6_dev *idev, struct sk_buff *skb,
  struct netlink_callback *cb, enum addr_type_t type,
- int s_ip_idx, int *p_ip_idx)
+ int s_ip_idx, int *p_ip_idx, int msg_type)
 {
struct ifmcaddr6 *ifmca;
struct ifacaddr6 *ifaca;
@@ -4935,7 +4937,7 @@ static int in6_dump_addrs(struct inet6_dev *idev, struct 
sk_buff *skb,
err = inet6_fill_ifaddr(skb, ifa,
NETLINK_CB(cb->skb).portid,
cb->nlh->nlmsg_seq,
-   RTM_NEWADDR,
+   msg_type,
NLM_F_MULTI);
if (err < 0)
break;
@@ -4952,7 +4954,7 @@ static int in6_dump_addrs(struct inet6_dev *idev, struct 
sk_buff *skb,
err = inet6_fill_ifmcaddr(skb, ifmca,
  NETLINK_CB(cb->skb).portid,
  cb->nlh->nlmsg_seq,
- RTM_GETMULTICAST,
+ msg_type,
  NLM_F_MULTI);
if (err < 0)
break;
@@ -4967,7 +4969,7 @@ static int in6_dump_addrs(struct inet6_dev *idev, struct 
sk_buff *skb,
err = inet6_fill_ifacaddr(skb, ifaca,
  NETLINK_CB(cb->skb).portid,
  cb->nlh->nlmsg_seq,
- RTM_GETANYCAST,
+ msg_type,
  NLM_F_MULTI);
if (err < 0)
break;
@@ -4982,7 +4984,7 @@ static int in6_dump_addrs(struct inet6_dev *idev, struct 
sk_buff *skb,
 }
 
 static int inet6_dump_addr(struct sk_buff *skb, struct netlink_callback *cb,
-  enum addr_type_t type)
+  enum addr_type_t type, int msg_type)
 {
struct net *net = sock_net(skb->sk);
int h, s_h;
@@ -5012,7 +5014,7 @@ static int inet6_dump_addr(struct sk_buff *skb, struct 
netlink_callback *cb,
goto cont;
 
if (in6_dump_addrs(idev, skb, cb, type,
-

[PATCH net-next v2 1/2] netlink: ipv4 igmp join notifications

2018-08-31 Thread Patrick Ruddy

Some userspace applications need to know about IGMP joins from the kernel
for 2 reasons
1. To allow the programming of multicast MAC filters in hardware
2. To form a multicast FORUS list for non link-local multicast
   groups to be sent to the kernel and from there to the interested
   party.
(1) can be fulfilled but simply sending the hardware multicast MAC
address to be programmed but (2) requires the L3 address to be sent
since this cannot be constructed from the MAC address whereas the
reverse translation is a standard library function.

This commit provides addition and deletion of multicast addresses
using the RTM_NEWADDR and RTM_DELADDR messages. It also provides
the RTM_GETADDR extension to allow multicast join state to be read
from the kernel.

Signed-off-by: Patrick Ruddy 
---
v2: fix kbuild warnings.

 include/linux/igmp.h |  4 ++
 net/ipv4/devinet.c   | 39 +--
 net/ipv4/igmp.c  | 90 
 3 files changed, 122 insertions(+), 11 deletions(-)

diff --git a/include/linux/igmp.h b/include/linux/igmp.h
index 119f53941c12..644a548024ed 100644
--- a/include/linux/igmp.h
+++ b/include/linux/igmp.h
@@ -19,6 +19,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 
 static inline struct igmphdr *igmp_hdr(const struct sk_buff *skb)
@@ -130,6 +132,8 @@ extern void ip_mc_unmap(struct in_device *);
 extern void ip_mc_remap(struct in_device *);
 extern void ip_mc_dec_group(struct in_device *in_dev, __be32 addr);
 extern void ip_mc_inc_group(struct in_device *in_dev, __be32 addr);
+extern int ip_mc_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb,
+struct net_device *dev);
 int ip_mc_check_igmp(struct sk_buff *skb, struct sk_buff **skb_trimmed);
 
 #endif
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index ea4bd8a52422..42f7dcc4fb5e 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -57,6 +57,7 @@
 #endif
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1651,6 +1652,7 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct 
netlink_callback *cb)
int h, s_h;
int idx, s_idx;
int ip_idx, s_ip_idx;
+   int multicast, mcast_idx;
struct net_device *dev;
struct in_device *in_dev;
struct in_ifaddr *ifa;
@@ -1659,6 +1661,8 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct 
netlink_callback *cb)
s_h = cb->args[0];
s_idx = idx = cb->args[1];
s_ip_idx = ip_idx = cb->args[2];
+   multicast = cb->args[3];
+   mcast_idx = cb->args[4];
 
for (h = s_h; h < NETDEV_HASHENTRIES; h++, s_idx = 0) {
idx = 0;
@@ -1675,18 +1679,29 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct 
netlink_callback *cb)
if (!in_dev)
goto cont;
 
-   for (ifa = in_dev->ifa_list, ip_idx = 0; ifa;
-ifa = ifa->ifa_next, ip_idx++) {
-   if (ip_idx < s_ip_idx)
-   continue;
-   if (inet_fill_ifaddr(skb, ifa,
-NETLINK_CB(cb->skb).portid,
-cb->nlh->nlmsg_seq,
-RTM_NEWADDR, NLM_F_MULTI) < 0) {
-   rcu_read_unlock();
-   goto done;
+   if (!multicast) {
+   for (ifa = in_dev->ifa_list, ip_idx = 0; ifa;
+ifa = ifa->ifa_next, ip_idx++) {
+   if (ip_idx < s_ip_idx)
+   continue;
+   if (inet_fill_ifaddr(skb, ifa,
+
NETLINK_CB(cb->skb).portid,
+cb->nlh->nlmsg_seq,
+RTM_NEWADDR,
+NLM_F_MULTI) < 0) {
+   rcu_read_unlock();
+   goto done;
+   }
+   nl_dump_check_consistent(cb,
+
nlmsg_hdr(skb));
}
-   nl_dump_check_consistent(cb, nlmsg_hdr(skb));
+   /* set for multicast loop */
+   multicast++;
+   }
+   /* loop over multicast addresses */
+   if (ip_mc_dump_ifaddr(skb, cb, dev) < 0) {
+   rcu_read_unlock();
+   goto done;

Re: [PATCH bpf-next] samples/bpf: xdpsock, minor fixes

2018-08-31 Thread Daniel Borkmann

On 08/31/2018 03:00 AM, Prashant Bhole wrote:
> - xsks_map size was fixed to 4, changed it MAX_SOCKS
> - Remove redundant definition of MAX_SOCKS in xdpsock_user.c
> - In dump_stats(), add NULL check for xsks[i]
> 
> Signed-off-by: Prashant Bhole 

Applied, thanks Prashant!

Re: [PATCH bpf-next] xsk: remove unnecessary assignment

2018-08-31 Thread Daniel Borkmann

On 08/31/2018 02:59 AM, Prashant Bhole wrote:
> Since xdp_umem_query() was added one assignment of bpf.command was
> missed from cleanup. Removing the assignment statement.
> 
> Fixes: 84c6b86875e01a0 ("xsk: don't allow umem replace at stack level")
> Signed-off-by: Prashant Bhole 

Applied, thanks Prashant!

[PATCH net-next 1/1] qed: Lower the severity of a dcbx log message.

2018-08-31 Thread Sudarsana Reddy Kalluru

Driver displays an error message for each unrecognized dcbx TLV that's
received from the peer or configured on the device. It is observed that
syslog will be flooded with such messages in certain scenarios e.g.,
frequent link-flaps/lldp-transactions. Changing the severity of this
message to verbose level as it's not an error scenario/message.

Signed-off-by: Sudarsana Reddy Kalluru 
---
 drivers/net/ethernet/qlogic/qed/qed_dcbx.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_dcbx.c 
b/drivers/net/ethernet/qlogic/qed/qed_dcbx.c
index 6bb76e6..6ce9a76 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_dcbx.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_dcbx.c
@@ -253,8 +253,9 @@ static bool qed_dcbx_roce_v2_tlv(u32 app_info_bitmap, u16 
proto_id, bool ieee)
*type = DCBX_PROTOCOL_ROCE_V2;
} else {
*type = DCBX_MAX_PROTOCOL_TYPE;
-   DP_ERR(p_hwfn, "No action required, App TLV entry = 0x%x\n",
-  app_prio_bitmap);
+   DP_VERBOSE(p_hwfn, QED_MSG_DCB,
+  "No action required, App TLV entry = 0x%x\n",
+  app_prio_bitmap);
return false;
}
 
-- 
1.8.3.1

Re: [PATCH bpf-next 3/3] i40e: adapt driver to new xdp_umem structure

2018-08-31 Thread Daniel Borkmann

On 08/30/2018 03:56 PM, Magnus Karlsson wrote:
> The struct xdp_umem_props was removed in the xsk code and this commit
> adapts the i40e af_xdp zero-copy driver code to the new xdp_umem
> structure.
> 
> Signed-off-by: Magnus Karlsson 
> ---
>  drivers/net/ethernet/intel/i40e/i40e_xsk.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c 
> b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> index 41ca7e1..2ebfc78 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> @@ -442,7 +442,7 @@ static void i40e_reuse_rx_buffer_zc(struct i40e_ring 
> *rx_ring,
>   struct i40e_rx_buffer *old_bi)
>  {
>   struct i40e_rx_buffer *new_bi = _ring->rx_bi[rx_ring->next_to_alloc];
> - unsigned long mask = (unsigned long)rx_ring->xsk_umem->props.chunk_mask;
> + unsigned long mask = (unsigned long)rx_ring->xsk_umem->chunk_mask;

That is buggy indeed, kbuild bot complained that this hurts bisectability
because you removed struct xdp_umem_props props from umem in prior patch.
Please respin the full series with this one squashed into prior commit instead
and I'll toss the old one from bpf-next tree.

>   u64 hr = rx_ring->xsk_umem->headroom + XDP_PACKET_HEADROOM;
>   u16 nta = rx_ring->next_to_alloc;
>  
> @@ -477,7 +477,7 @@ void i40e_zca_free(struct zero_copy_allocator *alloc, 
> unsigned long handle)
>  
>   rx_ring = container_of(alloc, struct i40e_ring, zca);
>   hr = rx_ring->xsk_umem->headroom + XDP_PACKET_HEADROOM;
> - mask = rx_ring->xsk_umem->props.chunk_mask;
> + mask = rx_ring->xsk_umem->chunk_mask;
>  
>   nta = rx_ring->next_to_alloc;
>   bi = _ring->rx_bi[nta];
>

Re: [PATCH RFC net-next 00/11] udp gso

2018-08-31 Thread Eric Dumazet

On 08/31/2018 02:09 AM, Paolo Abeni wrote:
> I hope quic can leverage such scenario, but I
> really know nothing about the protocol.
>

Most QUIC receivers are mobile phones, laptops, with wifi without GRO anyway...

Even if they had GRO, the inter-packet delay would be too high for GRO to be 
successful.

(vast majority of QUIC flows are < 100 Mbits because of the last mile hop)

GSO UDP is used on servers with clear gains, but there are not
really high speed receivers where GRO could be used.

Re: [PATCH RFC net-next 00/11] udp gso

2018-08-31 Thread Paolo Abeni

Hi,

On Tue, 2018-04-17 at 17:07 -0400, Willem de Bruijn wrote:
> That said, for negotiated flows an inverse GRO feature could
> conceivably be implemented to reduce rx stack traversal, too.
> Though due to interleaving of packets on the wire, it aggregation
> would be best effort, similar to TCP TSO and GRO using the
> PSH bit as packetization signal.

Reviving this old thread, before I forgot again. I have some local
patches implementing UDP GRO in a dual way to current GSO_UDP_L4
implementation: several datagram with the same length are aggregated
into a single one, and the user space receive a single larger packet
instead of multiple ones. I hope quic can leverage such scenario, but I
really know nothing about the protocol.

I measure roughly a 50% performance improvement with udpgso_bench in
respect to UDP GSO, and ~100% using a pktgen sender, and a reduced CPU
usage on the receiver[1]. Some additional hacking to the general GRO
bits is required to avoid useless socket lookups for ingress UDP
packets when UDP_GSO is not enabled.

If there is interest on this topic, I can share some RFC patches
(hopefully somewhat next week).

Cheers,

Paolo

[1] With udpgso_bench_tx, the bottle-neck is again the sender, even
with GSO enabled. With a pktgen sender, the bottle-neck become the rx
softirqd, and I see a lot of time consumed due to retpolines in the GRO
code. In both scenarios skb_release_data() becomes the topmost perf
offender for the user space process.

Re: [PATCH v2 1/2] IB/ipoib: Use dev_port to expose network interface port numbers

2018-08-31 Thread Arseny Maslennikov

On Thu, Aug 30, 2018 at 04:17:58PM -0400, Doug Ledford wrote:
> On Thu, 2018-08-30 at 21:22 +0300, Arseny Maslennikov wrote:
> > Some InfiniBand network devices have multiple ports on the same PCI
> > function. This initializes the `dev_port' sysfs field of those
> > network interfaces with their port number.
> > 
> > Prior to this the kernel erroneously used the `dev_id' sysfs
> > field of those network interfaces to convey the port number to userspace.
> > 
> > The use of `dev_id' was considered correct until Linux 3.15,
> > when another field, `dev_port', was defined for this particular
> > purpose and `dev_id' was reserved for distinguishing stacked ifaces
> > (e.g: VLANs) with the same hardware address as their parent device.
> > 
> > Similar fixes to net/mlx4_en and many other drivers, which started
> > exporting this information through `dev_id' before 3.15, were accepted
> > into the kernel 4 years ago.
> > See 76a066f2a2a0 (`net/mlx4_en: Expose port number through sysfs').
> > 
> > Signed-off-by: Arseny Maslennikov 
> > ---
> >  drivers/infiniband/ulp/ipoib/ipoib_main.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
> > b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> > index e3d28f9ad9c0..ba16a63ee303 100644
> > --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> > +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> > @@ -1880,7 +1880,7 @@ static int ipoib_parent_init(struct net_device *ndev)
> >sizeof(union ib_gid));
> >  
> > SET_NETDEV_DEV(priv->dev, priv->ca->dev.parent);
> > -   priv->dev->dev_id = priv->port - 1;
> > +   priv->dev->dev_port = priv->port - 1;
> 
> I don't know that we can't do this.  At least not yet.  Expose the new
> item to make us compliant with the new docs, and deprecate the old sysfs
> item, but we can't just yank the old item.  Existing tools/scripts might
> (probably) rely on it (existing tools already special case IPoIB
> interfaces and we'll need to make sure they don't special case this
> element too).

I'm good with keeping both items for a (probably long) while to not break
things. But how exactly should we notify users of the deprecation, so they
don't special case this again? A comment in the code seems too little —
everyone's obviously too busy to look there and stumble upon that.
A distinct notice in the doc seems too much. I can't think of another place
for the deprecation notice where people would take note of it, however.

Anyway: would it be OK to just restore both items and put a small note in
dev_id's doc entry? If yes, I'll then send a v3.


signature.asc
Description: PGP signature

[PATCH net] ip6_tunnel: respect ttl inherit for ip6tnl

2018-08-31 Thread Hangbin Liu

man ip-tunnel ttl section says:
0 is a special value meaning that packets inherit the TTL value.

IPv4 tunnel respect this in ip_tunnel_xmit(), but IPv6 tunnel has not
implement it yet. To make IPv6 behave consistently with IP tunnel,
add ipv6 tunnel inherit support.

Signed-off-by: Hangbin Liu 
---
 net/ipv6/ip6_tunnel.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 5df2a58..419960b 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1188,7 +1188,15 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct net_device 
*dev, __u8 dsfield,
init_tel_txopt(, encap_limit);
ipv6_push_frag_opts(skb, , );
}
-   hop_limit = hop_limit ? : ip6_dst_hoplimit(dst);
+
+   if (hop_limit == 0) {
+   if (skb->protocol == htons(ETH_P_IP))
+   hop_limit = ip_hdr(skb)->ttl;
+   else if (skb->protocol == htons(ETH_P_IPV6))
+   hop_limit = ipv6_hdr(skb)->hop_limit;
+   else
+   hop_limit = ip6_dst_hoplimit(dst);
+   }
 
/* Calculate max headroom for all the headers and adjust
 * needed_headroom if necessary.
-- 
2.5.5

Re: [PATCH bpf-next] tools/bpf: bpftool, add xskmap in map types

2018-08-31 Thread Jakub Kicinski

On Fri, 31 Aug 2018 15:32:42 +0900, Prashant Bhole wrote:
> When listed all maps, bpftool currently shows (null) for xskmap.
> Added xskmap type in map_type_name[] to show correct type.
> 
> Signed-off-by: Prashant Bhole 

Acked-by: Jakub Kicinski 

Thank you!  I feel tempted to suggest considering the bpf tree, but
perhaps that's a stretch..

> diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
> index df175bc33c5d..9c55077ca5dd 100644
> --- a/tools/bpf/bpftool/map.c
> +++ b/tools/bpf/bpftool/map.c
> @@ -68,6 +68,7 @@ static const char * const map_type_name[] = {
>   [BPF_MAP_TYPE_DEVMAP]   = "devmap",
>   [BPF_MAP_TYPE_SOCKMAP]  = "sockmap",
>   [BPF_MAP_TYPE_CPUMAP]   = "cpumap",
> + [BPF_MAP_TYPE_XSKMAP]   = "xskmap",
>   [BPF_MAP_TYPE_SOCKHASH] = "sockhash",
>   [BPF_MAP_TYPE_CGROUP_STORAGE]   = "cgroup_storage",
>  };

Re: [PATCH net] ebpf: fix bpf_msg_pull_data

2018-08-31 Thread Tushar Dave





On 08/30/2018 12:20 AM, Daniel Borkmann wrote:

On 08/30/2018 02:21 AM, Tushar Dave wrote:

On 08/29/2018 05:07 PM, Tushar Dave wrote:

While doing some preliminary testing it is found that bpf helper
bpf_msg_pull_data does not calculate the data and data_end offset
correctly. Fix it!

Fixes: 015632bb30da ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
Signed-off-by: Tushar Dave 
Acked-by: Sowmini Varadhan 
---
   net/core/filter.c | 38 +-
   1 file changed, 25 insertions(+), 13 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index c25eb36..3eeb3d6 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2285,7 +2285,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
   BPF_CALL_4(bpf_msg_pull_data,
  struct sk_msg_buff *, msg, u32, start, u32, end, u64, flags)
   {
-    unsigned int len = 0, offset = 0, copy = 0;
+    unsigned int len = 0, offset = 0, copy = 0, off = 0;
   struct scatterlist *sg = msg->sg_data;
   int first_sg, last_sg, i, shift;
   unsigned char *p, *to, *from;
@@ -2299,22 +2299,30 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff 
*msg)
   i = msg->sg_start;
   do {
   len = sg[i].length;
-    offset += len;
   if (start < offset + len)
   break;
+    offset += len;
   i++;
   if (i == MAX_SKB_FRAGS)
   i = 0;
-    } while (i != msg->sg_end);
+    } while (i <= msg->sg_end);


I don't think this condition is correct, msg operates as a scatterlist ring,
so sg_end may very well be < current i when there's a wrap-around in the
traversal ... more below.


I'm wondering then how this is suppose to work in case sg list is not
ring! For RDS, We have sg list that is not a ring. More below.




   +    /* return error if start is out of range */
   if (unlikely(start >= offset + len))
   return -EINVAL;
   -    if (!msg->sg_copy[i] && bytes <= len)
-    goto out;
+    /* return error if i is last entry in sglist and end is out of range */
+    if (msg->sg_copy[i] && end > offset + len)
+    return -EINVAL;
     first_sg = i;
   +    /* if i is not last entry in sg list and end (i.e start + bytes) is
+ * within this sg[i] then goto out and calculate data and data_end
+ */
+    if (!msg->sg_copy[i] && end <= offset + len)
+    goto out;
+
   /* At this point we need to linearize multiple scatterlist
    * elements or a single shared page. Either way we need to
    * copy into a linear buffer exclusively owned by BPF. Then
@@ -2330,9 +2338,14 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
   i++;
   if (i == MAX_SKB_FRAGS)
   i = 0;
-    if (bytes < copy)
+    if (end < copy)
   break;
-    } while (i != msg->sg_end);
+    } while (i <= msg->sg_end);
+
+    /* return error if i is last entry in sglist and end is out of range */
+    if (i > msg->sg_end && end > offset + copy)
+    return -EINVAL;
+
   last_sg = i;
     if (unlikely(copy < end - start))
@@ -2342,23 +2355,22 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff 
*msg)
   if (unlikely(!page))
   return -ENOMEM;
   p = page_address(page);
-    offset = 0;
     i = first_sg;
   do {
   from = sg_virt([i]);
   len = sg[i].length;
-    to = p + offset;
+    to = p + off;
     memcpy(to, from, len);
-    offset += len;
+    off += len;
   sg[i].length = 0;
   put_page(sg_page([i]));
     i++;
   if (i == MAX_SKB_FRAGS)
   i = 0;
-    } while (i != last_sg);
+    } while (i < last_sg);
     sg[first_sg].length = copy;
   sg_set_page([first_sg], page, copy, 0);
@@ -2380,7 +2392,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
   else
   move_from = i + shift;
   -    if (move_from == msg->sg_end)
+    if (move_from > msg->sg_end)
   break;
     sg[i] = sg[move_from];
@@ -2396,7 +2408,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
   if (msg->sg_end < 0)
   msg->sg_end += MAX_SKB_FRAGS;
   out:
-    msg->data = sg_virt([i]) + start - offset;
+    msg->data = sg_virt([first_sg]) + start - offset;
   msg->data_end = msg->data + bytes;
     return 0;



Please discard this patch. I just noticed that Daniel Borkmann sent some 
similar fixes for bpf_msg_pull_data.


Yeah I've been looking at these recently as well, please make sure you test
with the below fixes included to see if there's anything left:


I tested the latest net tree which has all the fixes you mentioned and I
am still seeing issues.

As I already mentioned before on RFC v3 thread, we need to be careful
reusing 'offset' while linearizing multiple scatterlist
elements.
Variable 'offset' is used to calculate the 'msg->data'
i.e. msg->data = sg_virt([first_sg]) + start - offset"

We

Re: [RFC] net: xsk: add a simple buffer reuse queue

2018-08-31 Thread Björn Töpel


On 2018-08-29 21:19, Jakub Kicinski wrote:

XSK UMEM is strongly single producer single consumer so reuse of
frames is challenging.  Add a simple "stash" of FILL packets to
reuse for drivers to optionally make use of.  This is useful
when driver has to free (ndo_stop) or resize a ring with an active
AF_XDP ZC socket.



I'll take a stab using this in i40e. I have a couple of
comments/thoughts on this RFC, but let me get back when I have an actual
patch in place. :-)


Thanks!
Björn



Signed-off-by: Jakub Kicinski 
---
  include/net/xdp_sock.h | 44 +
  net/xdp/xdp_umem.c |  2 ++
  net/xdp/xsk_queue.c| 56 ++
  net/xdp/xsk_queue.h|  3 +++
  4 files changed, 105 insertions(+)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 6871e4755975..108c1c100de4 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -14,6 +14,7 @@
  #include 
  
  struct net_device;

+struct xdp_umem_fq_reuse;
  struct xsk_queue;
  
  struct xdp_umem_props {

@@ -41,6 +42,7 @@ struct xdp_umem {
struct page **pgs;
u32 npgs;
struct net_device *dev;
+   struct xdp_umem_fq_reuse *fq_reuse;
u16 queue_id;
bool zc;
spinlock_t xsk_list_lock;
@@ -110,4 +112,46 @@ static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem 
*umem, u64 addr)
return umem->pages[addr >> PAGE_SHIFT].dma + (addr & (PAGE_SIZE - 1));
  }
  
+struct xdp_umem_fq_reuse {

+   u32 nentries;
+   u32 length;
+   u64 handles[];
+};
+
+/* Following functions are not thread-safe in any way */
+struct xdp_umem_fq_reuse *xsk_reuseq_prepare(u32 nentries);
+struct xdp_umem_fq_reuse *xsk_reuseq_swap(struct xdp_umem *umem,
+ struct xdp_umem_fq_reuse *newq);
+void xsk_reuseq_free(struct xdp_umem_fq_reuse *rq);
+
+/* Reuse-queue aware version of FILL queue helpers */
+static inline u64 *xsk_umem_peek_addr_rq(struct xdp_umem *umem, u64 *addr)
+{
+   struct xdp_umem_fq_reuse *rq = umem->fq_reuse;
+
+   if (!rq->length) {
+   return xsk_umem_peek_addr(umem, addr);
+   } else {
+   *addr = rq->handles[rq->length - 1];
+   return addr;
+   }
+}
+
+static inline void xsk_umem_discard_addr_rq(struct xdp_umem *umem)
+{
+   struct xdp_umem_fq_reuse *rq = umem->fq_reuse;
+
+   if (!rq->length)
+   xsk_umem_discard_addr(umem);
+   else
+   rq->length--;
+}
+
+static inline void xsk_umem_fq_reuse(struct xdp_umem *umem, u64 addr)
+{
+   struct xdp_umem_fq_reuse *rq = umem->fq_reuse;
+
+   rq->handles[rq->length++] = addr;
+}
+
  #endif /* _LINUX_XDP_SOCK_H */
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index e762310c9bee..40303e24c954 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -170,6 +170,8 @@ static void xdp_umem_release(struct xdp_umem *umem)
umem->cq = NULL;
}
  
+	xsk_reuseq_destroy(umem);

+
xdp_umem_unpin_pages(umem);
  
  	task = get_pid_task(umem->pid, PIDTYPE_PID);

diff --git a/net/xdp/xsk_queue.c b/net/xdp/xsk_queue.c
index 6c32e92e98fc..f9ee40a13a9a 100644
--- a/net/xdp/xsk_queue.c
+++ b/net/xdp/xsk_queue.c
@@ -3,7 +3,9 @@
   * Copyright(c) 2018 Intel Corporation.
   */
  
+#include 

  #include 
+#include 
  
  #include "xsk_queue.h"
  
@@ -61,3 +63,57 @@ void xskq_destroy(struct xsk_queue *q)

page_frag_free(q->ring);
kfree(q);
  }
+
+struct xdp_umem_fq_reuse *xsk_reuseq_prepare(u32 nentries)
+{
+   struct xdp_umem_fq_reuse *newq;
+
+   /* Check for overflow */
+   if (nentries > (u32)roundup_pow_of_two(nentries))
+   return NULL;
+   nentries = roundup_pow_of_two(nentries);
+
+   newq = kvmalloc(struct_size(newq, handles, nentries), GFP_KERNEL);
+   if (!newq)
+   return NULL;
+   memset(newq, 0, offsetof(typeof(*newq), handles));
+
+   newq->nentries = nentries;
+   return newq;
+}
+EXPORT_SYMBOL_GPL(xsk_reuseq_prepare);
+
+struct xdp_umem_fq_reuse *xsk_reuseq_swap(struct xdp_umem *umem,
+ struct xdp_umem_fq_reuse *newq)
+{
+   struct xdp_umem_fq_reuse *oldq = umem->fq_reuse;
+
+   if (!oldq) {
+   umem->fq_reuse = newq;
+   return NULL;
+   }
+
+   if (newq->nentries < oldq->length)
+   return newq;
+
+
+   memcpy(newq->handles, oldq->handles,
+  array_size(oldq->length, sizeof(u64)));
+   newq->length = oldq->length;
+
+   umem->fq_reuse = newq;
+   return oldq;
+}
+EXPORT_SYMBOL_GPL(xsk_reuseq_swap);
+
+void xsk_reuseq_free(struct xdp_umem_fq_reuse *rq)
+{
+   kvfree(rq);
+}
+EXPORT_SYMBOL_GPL(xsk_reuseq_free);
+
+void xsk_reuseq_destroy(struct xdp_umem *umem)
+{
+   xsk_reuseq_free(umem->fq_reuse);
+   umem->fq_reuse = NULL;
+}
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index

Re: [PATCH bpf-next 08/11] i40e: add AF_XDP zero-copy Rx support

2018-08-31 Thread Jakub Kicinski

On Thu, 30 Aug 2018 14:06:24 +0200, Björn Töpel wrote:
> On 2018-08-29 21:14, Jakub Kicinski wrote:
>  > On Tue, 28 Aug 2018 14:44:32 +0200, Björn Töpel wrote:  
>  >> From: Björn Töpel 
>  >>
>  >> This patch adds zero-copy Rx support for AF_XDP sockets. Instead of
>  >> allocating buffers of type MEM_TYPE_PAGE_SHARED, the Rx frames are
>  >> allocated as MEM_TYPE_ZERO_COPY when AF_XDP is enabled for a certain
>  >> queue.
>  >>
>  >> All AF_XDP specific functions are added to a new file, i40e_xsk.c.
>  >>
>  >> Note that when AF_XDP zero-copy is enabled, the XDP action XDP_PASS
>  >> will allocate a new buffer and copy the zero-copy frame prior passing
>  >> it to the kernel stack.
>  >>
>  >> Signed-off-by: Björn Töpel   
>  >
>  > Mm.. I'm surprised you don't run into buffer reuse issues that I had
>  > when playing with AF_XDP.  What happens in i40e if someone downs the
>  > interface?  Will UMEMs get destroyed?  Will the RX buffers get freed?
>  >  
> 
> The UMEM will linger in the driver until the sockets are dead.
> 
>  > I'll shortly send an RFC with my quick and dirty RX buffer reuse queue,
>  > FWIW.
>  >  
> 
> Some background for folks that don't know the details: A zero-copy
> capable driver picks buffers off the fill ring and places them on the
> hardware Rx ring to be completed at a later point when DMA is
> complete. Analogous for the Tx side; The driver picks buffers off the
> Tx ring and places them on the Tx hardware ring.
> 
> In the typical flow, the Rx buffer will be placed onto an Rx ring
> (completed to the user), and the Tx buffer will be placed on the
> completion ring to notify the user that the transfer is done.
> 
> However, if the driver needs to tear down the hardware rings for some
> reason (interface goes down, reconfiguration and such), what should be
> done with the Rx and Tx buffers that has been given to the driver?
> 
> So, to frame the problem: What should a driver do when this happens,
> so that buffers aren't leaked?
> 
> Note that when the UMEM is going down, there's no need to complete
> anything, since the sockets are dying/dead already.
> 
> This is, as you state, a missing piece in the implementation and needs
> to be fixed.
> 
> Now on to possible solutions:
> 
> 1. Complete the buffers back to the user. For Tx, this is probably the
> best way -- just place the buffers onto the completion ring.
> 
> For Rx, we can give buffers back to user space by setting the
> length in the Rx descriptor to zero And putting them on the Rx
> ring. However, one complication here is that we do not have any
> back-pressure mechanism for the Rx side like we have on Tx. If the
> Rx ring(s) is (are) full the kernel will have to leak them or
> implement a retry mechanism (ugly and should be avoided).
> 
> Another option to solve this without needing any retry or leaking
> for Rx is to implement the same back-pressure mechanism that we
> have on the Tx path in the Rx path. In the Tx path, the driver will
> only get a Tx packet to send if there is space for it in the
> completion ring. On Rx, this would be that the driver would only
> get a buffer from the fill ring if there is space for it to put it
> on the Rx ring. The drawback of this is that it would likely impact
> performance negatively since the Rx ring would have to be touch one
> more time (in the Tx path, it increased performance since it made
> it possible to implement the Tx path without any buffering), but it
> would guarantee that all buffers can always be returned to user
> space making solution this a viable option.
> 
> 2. Store the buffers internally in the driver, and make sure that they
> are inserted into the "normal flow" again. For Rx that would be
> putting the buffers back into the allocation scheme that the driver
> is using. For Tx, placing the buffers back onto the Tx HW ring
> (plus all the logic for making sure that all corner cases work).
> 
> 3. Mark the socket(s) as in error state, en require the user to redo
> the setup. This is bit harsh...

That's a good summary, one more (bad) option:

4. Disallow any operations on the device which could lead to RX buffer
   discards if any UMEMs are attached.

> For i40e I think #2 for Rx (buffers reside in kernel, return to
> allocator) and #1 for Tx (complete to userland).
> 
> Your RFC is plumbing to implement #2 for Rx in a driver. I'm not a fan
> of extending the umem with the "reuse queue". This decision is really
> up the driver. Some driver might prefer another scheme, or simply
> prefer storing the buffers in another manner.

The only performance cost is the extra pointer in xdp_umem.  Drivers
have to opt-in by using the *_rq() helpers.  We can move the extra
pointer to driver structs, and have them pass it to the helpers, so that
would be zero extra cost, and we can reuse the implementation.

> Looking forward, as both you and Jesper has alluded to, we need

Re: [PATCH bpf-next] samples/bpf: xdpsock, minor fixes

2018-08-31 Thread Björn Töpel

Den fre 31 aug. 2018 kl 03:04 skrev Prashant Bhole
:
>
> - xsks_map size was fixed to 4, changed it MAX_SOCKS
> - Remove redundant definition of MAX_SOCKS in xdpsock_user.c
> - In dump_stats(), add NULL check for xsks[i]
>

Thanks for the cleanup!

Acked-by: Björn Töpel 

> Signed-off-by: Prashant Bhole 
> ---
>  samples/bpf/xdpsock_kern.c | 2 +-
>  samples/bpf/xdpsock_user.c | 3 +--
>  2 files changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/samples/bpf/xdpsock_kern.c b/samples/bpf/xdpsock_kern.c
> index d8806c41362e..b8ccd0802b3f 100644
> --- a/samples/bpf/xdpsock_kern.c
> +++ b/samples/bpf/xdpsock_kern.c
> @@ -16,7 +16,7 @@ struct bpf_map_def SEC("maps") xsks_map = {
> .type = BPF_MAP_TYPE_XSKMAP,
> .key_size = sizeof(int),
> .value_size = sizeof(int),
> -   .max_entries = 4,
> +   .max_entries = MAX_SOCKS,
>  };
>
>  struct bpf_map_def SEC("maps") rr_map = {
> diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
> index b3906111bedb..57ecadc58403 100644
> --- a/samples/bpf/xdpsock_user.c
> +++ b/samples/bpf/xdpsock_user.c
> @@ -118,7 +118,6 @@ struct xdpsock {
> unsigned long prev_tx_npkts;
>  };
>
> -#define MAX_SOCKS 4
>  static int num_socks;
>  struct xdpsock *xsks[MAX_SOCKS];
>
> @@ -596,7 +595,7 @@ static void dump_stats(void)
>
> prev_time = now;
>
> -   for (i = 0; i < num_socks; i++) {
> +   for (i = 0; i < num_socks && xsks[i]; i++) {
> char *fmt = "%-15s %'-11.0f %'-11lu\n";
> double rx_pps, tx_pps;
>
> --
> 2.17.1
>
>

Re: [PATCH bpf-next] xsk: remove unnecessary assignment

2018-08-31 Thread Björn Töpel

Den fre 31 aug. 2018 kl 03:02 skrev Prashant Bhole
:
>
> Since xdp_umem_query() was added one assignment of bpf.command was
> missed from cleanup. Removing the assignment statement.
>

Good catch!

Acked-by: Björn Töpel 

> Fixes: 84c6b86875e01a0 ("xsk: don't allow umem replace at stack level")
> Signed-off-by: Prashant Bhole 
> ---
>  net/xdp/xdp_umem.c | 2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
> index bfe2dbea480b..d179732617dc 100644
> --- a/net/xdp/xdp_umem.c
> +++ b/net/xdp/xdp_umem.c
> @@ -76,8 +76,6 @@ int xdp_umem_assign_dev(struct xdp_umem *umem, struct 
> net_device *dev,
> if (!dev->netdev_ops->ndo_bpf || !dev->netdev_ops->ndo_xsk_async_xmit)
> return force_zc ? -EOPNOTSUPP : 0; /* fail or fallback */
>
> -   bpf.command = XDP_QUERY_XSK_UMEM;
> -
> rtnl_lock();
> err = xdp_umem_query(dev, queue_id);
> if (err) {
> --
> 2.17.1
>
>

Re: [PATCH net] sctp: hold transport before accessing its asoc in sctp_transport_get_next

2018-08-31 Thread Xin Long

On Wed, Aug 29, 2018 at 7:36 PM Neil Horman  wrote:
>
> On Wed, Aug 29, 2018 at 12:08:40AM +0800, Xin Long wrote:
> > On Mon, Aug 27, 2018 at 9:08 PM Neil Horman  wrote:
> > >
> > > On Mon, Aug 27, 2018 at 06:38:31PM +0800, Xin Long wrote:
> > > > As Marcelo noticed, in sctp_transport_get_next, it is iterating over
> > > > transports but then also accessing the association directly, without
> > > > checking any refcnts before that, which can cause an use-after-free
> > > > Read.
> > > >
> > > > So fix it by holding transport before accessing the association. With
> > > > that, sctp_transport_hold calls can be removed in the later places.
> > > >
> > > > Fixes: 626d16f50f39 ("sctp: export some apis or variables for sctp_diag 
> > > > and reuse some for proc")
> > > > Reported-by: syzbot+fe62a0c9aa6a85c6d...@syzkaller.appspotmail.com
> > > > Signed-off-by: Xin Long 
> > > > ---
> > > >  net/sctp/proc.c   |  4 
> > > >  net/sctp/socket.c | 22 +++---
> > > >  2 files changed, 15 insertions(+), 11 deletions(-)
> > > >
> > > > diff --git a/net/sctp/proc.c b/net/sctp/proc.c
> > > > index ef5c9a8..4d6f1c8 100644
> > > > --- a/net/sctp/proc.c
> > > > +++ b/net/sctp/proc.c
> > > > @@ -264,8 +264,6 @@ static int sctp_assocs_seq_show(struct seq_file 
> > > > *seq, void *v)
> > > >   }
> > > >
> > > >   transport = (struct sctp_transport *)v;
> > > > - if (!sctp_transport_hold(transport))
> > > > - return 0;
> > > >   assoc = transport->asoc;
> > > >   epb = >base;
> > > >   sk = epb->sk;
> > > > @@ -322,8 +320,6 @@ static int sctp_remaddr_seq_show(struct seq_file 
> > > > *seq, void *v)
> > > >   }
> > > >
> > > >   transport = (struct sctp_transport *)v;
> > > > - if (!sctp_transport_hold(transport))
> > > > - return 0;
> > > >   assoc = transport->asoc;
> > > >
> > > >   list_for_each_entry_rcu(tsp, >peer.transport_addr_list,
> > > > diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> > > > index e96b15a..aa76586 100644
> > > > --- a/net/sctp/socket.c
> > > > +++ b/net/sctp/socket.c
> > > > @@ -5005,9 +5005,14 @@ struct sctp_transport 
> > > > *sctp_transport_get_next(struct net *net,
> > > >   break;
> > > >   }
> > > >
> > > > + if (!sctp_transport_hold(t))
> > > > + continue;
> > > > +
> > > >   if (net_eq(sock_net(t->asoc->base.sk), net) &&
> > > >   t->asoc->peer.primary_path == t)
> > > >   break;
> > > > +
> > > > + sctp_transport_put(t);
> > > >   }
> > > >
> > > >   return t;
> > > > @@ -5017,13 +5022,18 @@ struct sctp_transport 
> > > > *sctp_transport_get_idx(struct net *net,
> > > > struct rhashtable_iter 
> > > > *iter,
> > > > int pos)
> > > >  {
> > > > - void *obj = SEQ_START_TOKEN;
> > > > + struct sctp_transport *t;
> > > >
> > > > - while (pos && (obj = sctp_transport_get_next(net, iter)) &&
> > > > -!IS_ERR(obj))
> > > > - pos--;
> > > > + if (!pos)
> > > > + return SEQ_START_TOKEN;
> > > >
> > > > - return obj;
> > > > + while ((t = sctp_transport_get_next(net, iter)) && !IS_ERR(t)) {
> > > > + if (!--pos)
> > > > + break;
> > > > + sctp_transport_put(t);
> > > > + }
> > > > +
> > > > + return t;
> > > >  }
> > > >
> > > >  int sctp_for_each_endpoint(int (*cb)(struct sctp_endpoint *, void *),
> > > > @@ -5082,8 +5092,6 @@ int sctp_for_each_transport(int (*cb)(struct 
> > > > sctp_transport *, void *),
> > > >
> > > >   tsp = sctp_transport_get_idx(net, , *pos + 1);
> > > >   for (; !IS_ERR_OR_NULL(tsp); tsp = sctp_transport_get_next(net, 
> > > > )) {
> > > > - if (!sctp_transport_hold(tsp))
> > > > - continue;
> > > >   ret = cb(tsp, p);
> > > >   if (ret)
> > > >   break;
> > > > --
> > > > 2.1.0
> > > >
> > > >
> > > Acked-by: Neil Horman 
> > >
> > > Additionally, its not germaine to this particular fix, but why are we 
> > > still
> > > using that pos variable in sctp_transport_get_idx?  With the conversion to
> > > rhashtables, it doesn't seem particularly useful anymore.
> > For proc, seems so, hti is saved into seq->private.
> > But for diag, "hti" in sctp_for_each_transport() is a local variable.
> > do you think where we can save it?
> >
> Sorry, wasn't suggesting that it had to be removed from 
> sctp_for_each_trasnport,
> its clearly used as both an input and output there.  All I was sugesting was
> that, in sctp_transport_get_idx, the pos variable might no longer be needed
> there specifically, as sctp_transprt_get_next should terminate the loop on its
> own.  Or is there another purpose for that positional variable I am missing
Yes, Neil, all transports/asocs could not be dumped once

Re: kernels > v4.12 oops/crash with ipsec-traffic: bisected to b838d5e1c5b6e57b10ec8af2268824041e3ea911: ipv4: mark DST_NOGC and remove the operation of dst_free()

2018-08-31 Thread Steffen Klassert

On Thu, Aug 30, 2018 at 08:53:50PM +0200, Wolfgang Walter wrote:
> Hello,
> 
> kernels > 4.12 do not work on one of our main routers. They crash as soon
> as ipsec-tunnels are configured and ipsec-traffic actually flows.

Can you please send the backtrace of this crash?

Thanks!

Re: [PATCH 1/2] xfrm6: call kfree_skb when skb is toobig

2018-08-31 Thread Steffen Klassert

On Thu, Aug 30, 2018 at 03:23:11PM +0200, Sabrina Dubroca wrote:
> 2018-08-30, 09:58:16 -0300, Thadeu Lima de Souza Cascardo wrote:
> > After commit d6990976af7c5d8f55903bfb4289b6fb030bf754 ("vti6: fix PMTU 
> > caching
> > and reporting on xmit"), some too big skbs might be potentially passed down 
> > to
> > __xfrm6_output, causing it to fail to transmit but not free the skb, 
> > causing a
> > leak of skb, and consequentially a leak of dst references.
> > 
> > After running pmtu.sh, that shows as failure to unregister devices in a 
> > namespace:
> > 
> > [  311.397671] unregister_netdevice: waiting for veth_b to become free. 
> > Usage count = 1
> > 
> > The fix is to call kfree_skb in case of transmit failures.
> > 
> > Signed-off-by: Thadeu Lima de Souza Cascardo 

Good catch!

> Reviewed-by: Sabrina Dubroca 
> 
> I was about to post the same patch. Arguably, the commit introducing
> this bug is the one that added those "return -EMSGSIZE" to
> __xfrm6_output without freeing.
> 
> Either way, it's missing a Fixes: tag, which should be one of those,
> or both:
> 
> Fixes: d6990976af7c ("vti6: fix PMTU caching and reporting on xmit")
> Fixes: dd767856a36e ("xfrm6: Don't call icmpv6_send on local error")

This bug can be triggered even without vti6, so the correct
Fixes tag would be the latter.

Thadeu, please resend this one with the Fixes tag.

Thanks!

Re: [PATCH 2/2] vti6: do not check for ignore_df in order to update pmtu

2018-08-31 Thread Steffen Klassert

On Thu, Aug 30, 2018 at 09:58:17AM -0300, Thadeu Lima de Souza Cascardo wrote:
> Before commit d6990976af7c5d8f55903bfb4289b6fb030bf754 ("vti6: fix PMTU 
> caching
> and reporting on xmit"), skb was scrubbed before checking for ignore_df. The
> scrubbing meant ignore_df was false, making the check irrelevant. Now that the
> scrubbing happens after that, some packets might fail the checking and dst
> will not have its pmtu updated.
> 
> Not only that, but too big skb will be potentially passed down to
> __xfrm6_output, causing it to fail to transmit but not free the skb, causing a
> leak of skb, and consequentially a leak of dst references.
> 
> After running pmtu.sh, that shows as failure to unregister devices in a 
> namespace:
> 
> [  311.397671] unregister_netdevice: waiting for veth_b to become free. Usage 
> count = 1
> 
> Signed-off-by: Thadeu Lima de Souza Cascardo 
> ---
>  net/ipv6/ip6_vti.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
> index c72ae3a4fe09..fbd3752ea587 100644
> --- a/net/ipv6/ip6_vti.c
> +++ b/net/ipv6/ip6_vti.c
> @@ -481,7 +481,7 @@ vti6_xmit(struct sk_buff *skb, struct net_device *dev, 
> struct flowi *fl)
>   }
>  
>   mtu = dst_mtu(dst);
> - if (!skb->ignore_df && skb->len > mtu) {
> + if (skb->len > mtu) {
>   skb_dst_update_pmtu(skb, mtu);

The very same patch went already to the net tree two day ago:

commit 9f2895461439fda2801a7906fb4c5fb3dbb37a0a
vti6: remove !skb->ignore_df check from vti6_xmit()

[PATCH bpf-next] tools/bpf: bpftool, add xskmap in map types

2018-08-31 Thread Prashant Bhole

When listed all maps, bpftool currently shows (null) for xskmap.
Added xskmap type in map_type_name[] to show correct type.

Signed-off-by: Prashant Bhole 
---
 tools/bpf/bpftool/map.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index df175bc33c5d..9c55077ca5dd 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -68,6 +68,7 @@ static const char * const map_type_name[] = {
[BPF_MAP_TYPE_DEVMAP]   = "devmap",
[BPF_MAP_TYPE_SOCKMAP]  = "sockmap",
[BPF_MAP_TYPE_CPUMAP]   = "cpumap",
+   [BPF_MAP_TYPE_XSKMAP]   = "xskmap",
[BPF_MAP_TYPE_SOCKHASH] = "sockhash",
[BPF_MAP_TYPE_CGROUP_STORAGE]   = "cgroup_storage",
 };
-- 
2.17.1

95 matches

Mail list logo