[PATCH net] net: mv __dev_notify_flags from void to int

2018-07-03 Thread Hangbin Liu
As call_netdevice_notifiers() and call_netdevice_notifiers_info() have return
values, we should also return the values in function __dev_notify_flags().

Signed-off-by: Hangbin Liu 
---
 include/linux/netdevice.h |  2 +-
 net/core/dev.c| 30 --
 2 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3d0cc0b..b1c145e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3411,7 +3411,7 @@ int dev_ethtool(struct net *net, struct ifreq *);
 unsigned int dev_get_flags(const struct net_device *);
 int __dev_change_flags(struct net_device *, unsigned int flags);
 int dev_change_flags(struct net_device *, unsigned int);
-void __dev_notify_flags(struct net_device *, unsigned int old_flags,
+int __dev_notify_flags(struct net_device *, unsigned int old_flags,
unsigned int gchanges);
 int dev_change_name(struct net_device *, const char *);
 int dev_set_alias(struct net_device *, const char *, size_t);
diff --git a/net/core/dev.c b/net/core/dev.c
index a5aa1c7..0994533 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6798,8 +6798,10 @@ static int __dev_set_promiscuity(struct net_device *dev, 
int inc, bool notify)
 
dev_change_rx_flags(dev, IFF_PROMISC);
}
+
if (notify)
-   __dev_notify_flags(dev, old_flags, IFF_PROMISC);
+   return __dev_notify_flags(dev, old_flags, IFF_PROMISC);
+
return 0;
 }
 
@@ -6854,8 +6856,8 @@ static int __dev_set_allmulti(struct net_device *dev, int 
inc, bool notify)
dev_change_rx_flags(dev, IFF_ALLMULTI);
dev_set_rx_mode(dev);
if (notify)
-   __dev_notify_flags(dev, old_flags,
-  dev->gflags ^ old_gflags);
+   return  __dev_notify_flags(dev, old_flags,
+  dev->gflags ^ old_gflags);
}
return 0;
 }
@@ -7016,21 +7018,26 @@ int __dev_change_flags(struct net_device *dev, unsigned 
int flags)
return ret;
 }
 
-void __dev_notify_flags(struct net_device *dev, unsigned int old_flags,
-   unsigned int gchanges)
+int __dev_notify_flags(struct net_device *dev, unsigned int old_flags,
+  unsigned int gchanges)
 {
unsigned int changes = dev->flags ^ old_flags;
+   int err = 0;
 
if (gchanges)
rtmsg_ifinfo(RTM_NEWLINK, dev, gchanges, GFP_ATOMIC);
 
if (changes & IFF_UP) {
if (dev->flags & IFF_UP)
-   call_netdevice_notifiers(NETDEV_UP, dev);
+   err = call_netdevice_notifiers(NETDEV_UP, dev);
else
-   call_netdevice_notifiers(NETDEV_DOWN, dev);
+   err = call_netdevice_notifiers(NETDEV_DOWN, dev);
}
 
+   err = notifier_to_errno(err);
+   if (err)
+   goto out;
+
if (dev->flags & IFF_UP &&
(changes & ~(IFF_UP | IFF_PROMISC | IFF_ALLMULTI | IFF_VOLATILE))) {
struct netdev_notifier_change_info change_info = {
@@ -7040,8 +7047,12 @@ void __dev_notify_flags(struct net_device *dev, unsigned 
int old_flags,
.flags_changed = changes,
};
 
-   call_netdevice_notifiers_info(NETDEV_CHANGE, &change_info.info);
+   err = call_netdevice_notifiers_info(NETDEV_CHANGE, 
&change_info.info);
+   err = notifier_to_errno(err);
}
+
+out:
+   return err;
 }
 
 /**
@@ -7062,8 +7073,7 @@ int dev_change_flags(struct net_device *dev, unsigned int 
flags)
return ret;
 
changes = (old_flags ^ dev->flags) | (old_gflags ^ dev->gflags);
-   __dev_notify_flags(dev, old_flags, changes);
-   return ret;
+   return __dev_notify_flags(dev, old_flags, changes);
 }
 EXPORT_SYMBOL(dev_change_flags);
 
-- 
2.5.5



Re: [PATCH v2 net] net/ipv6: Revert attempt to simplify route replace and append

2018-07-03 Thread David Miller
From: dsah...@kernel.org
Date: Tue,  3 Jul 2018 14:36:21 -0700

> From: David Ahern 
> 
> NetworkManager likes to manage linklocal prefix routes and does so with
> the NLM_F_APPEND flag, breaking attempts to simplify the IPv6 route
> code and by extension enable multipath routes with device only nexthops.
> 
> Revert f34436a43092 and these followup patches:
> 6eba08c3626b ("ipv6: Only emit append events for appended routes").
> ce45bded6435 ("mlxsw: spectrum_router: Align with new route replace logic")
> 53b562df8c20 ("mlxsw: spectrum_router: Allow appending to dev-only routes")
> 
> Update the fib_tests cases to reflect the old behavior.
> 
> Fixes: f34436a43092 ("net/ipv6: Simplify route replace and appending into 
> multipath route")
> Signed-off-by: David Ahern 

Applied.

> The gre_multipath tests only exist in net-next. Those will need to be
> updated separately.

Ok, please send me the net-next update once this gets merged there.

Thanks.


[BUG] mlx5 have problems with ipv4-ipv6 tunnels in linux 4.4

2018-07-03 Thread Konstantin Khlebnikov

I'm seeing problems with tunnelled traffic with Mellanox Technologies MT27710 
Family [ConnectX-4 Lx] using vanilla driver from linux 4.4.y

Packets with payload bigger than 116 bytes are not exmited.
Smaller packets and normal ipv6 works fine.

In linux 4.9, 4.14 and out-of-tree driver everything seems fine for now.
It's hard to guess or bisect commit: there are a lot of changes and something 
wrong with driver or swiotlb in 4.7..4.8.
4.6 is affected too - so this should be something between 4.6 and 4.9

Probably this case was fixed indirectly by adding some kind of offload and 
non-offloaded path is still broken.
Please give me a hint: which commit could it be.

In 4.4 ethtool shows this:

# ethtool -i eth0
driver: mlx5_core
version: 3.0-1 (January 2015)
firmware-version: 14.21.2500
expansion-rom-version:
bus-info: :81:00.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no

# ethtool -k eth0
Features for eth0:
rx-checksumming: on
tx-checksumming: on
tx-checksum-ipv4: on
tx-checksum-ip-generic: off [fixed]
tx-checksum-ipv6: on
tx-checksum-fcoe-crc: off [fixed]
tx-checksum-sctp: off [fixed]
scatter-gather: on
tx-scatter-gather: on
tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: off
tx-tcp-segmentation: off
tx-tcp-ecn-segmentation: off [fixed]
tx-tcp6-segmentation: off
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off [fixed]
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-ipip-segmentation: off [fixed]
tx-sit-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
busy-poll: off [fixed]



in 4.9


# ethtool -i eth0
driver: mlx5_core
version: 3.0-1 (January 2015)
firmware-version: 14.21.2500
expansion-rom-version:
bus-info: :81:00.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes


# ethtool -k eth0
Features for eth0:
rx-checksumming: on
tx-checksumming: on
tx-checksum-ipv4: on
tx-checksum-ip-generic: off [fixed]
tx-checksum-ipv6: on
tx-checksum-fcoe-crc: off [fixed]
tx-checksum-sctp: off [fixed]
scatter-gather: on
tx-scatter-gather: on
tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: off
tx-tcp-segmentation: off
tx-tcp-ecn-segmentation: off [fixed]
tx-tcp-mangleid-segmentation: off
tx-tcp6-segmentation: off
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-gre-csum-segmentation: off [fixed]
tx-ipxip4-segmentation: off [fixed]
tx-ipxip6-segmentation: off [fixed]
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on
tx-sctp-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
busy-poll: off [fixed]
hw-tc-offload: off

diff

@@ -12,6 +12,7 @@
 tcp-segmentation-offload: off
tx-tcp-segmentation: off
tx-tcp-ecn-segmentation: off [fixed]
+   tx-tcp-mangleid-segmentation: off
tx-tcp6-segmentation: off
 udp-fragmentation-offload: off [fixed]
 generic-segmentation-offload: on
@@ -19,7 +20,7 @@
 large-receive-offload: off
 rx-vlan-offload: on
 tx-vlan-offload: on
-ntuple-filters: off [fixed]
+ntuple-filters: off
 receive-hashing: on
 highdma: on [fixed]
 rx-vlan-filter: on
@@ -29,16 +30,21 @@
 tx-gso-robust: off [fixed]
 tx-fcoe-segmentation: off [fixed]
 tx-gre-segmentation: off [fixed]
-tx-ipip-segmentation: off [fixed]
-tx-sit-segmentation: off [fixed]
-tx-udp_tnl-segmentation: off [fixed]
+tx-gre-csum-segmentation: off [fixed]
+tx-ipxip4-segmentation: off [fixed]
+tx-ipxip6-segmentation: off [fixed]
+tx-udp_tnl-segmentation: on
+tx-udp_tnl-csum-segmentation: on
+tx-gso-partial: on
+tx-sctp-segmentation: off [fixed]
 fcoe-mtu: off [fixed]
 tx-nocache-copy: off
 loopback: off [fixed]
 rx-fcs

Re: [PATCH] rhashtable: add restart routine in rhashtable_free_and_destroy()

2018-07-03 Thread Herbert Xu
On Tue, Jul 03, 2018 at 10:19:09PM +0900, Taehee Yoo wrote:
>
> diff --git a/lib/rhashtable.c b/lib/rhashtable.c
> index 0e04947..8ea27fa 100644
> --- a/lib/rhashtable.c
> +++ b/lib/rhashtable.c
> @@ -1134,6 +1134,7 @@ void rhashtable_free_and_destroy(struct rhashtable *ht,
>   mutex_lock(&ht->mutex);
>   tbl = rht_dereference(ht->tbl, ht);
>   if (free_fn) {
> +restart:
>   for (i = 0; i < tbl->size; i++) {
>   struct rhash_head *pos, *next;
>  
> @@ -1147,9 +1148,11 @@ void rhashtable_free_and_destroy(struct rhashtable *ht,
>   rht_dereference(pos->next, ht) : NULL)
>   rhashtable_free_one(ht, pos, free_fn, arg);
>   }
> + tbl = rht_dereference(tbl->future_tbl, ht);
> + if (tbl)
> + goto restart;
>   }
> -
> - bucket_table_free(tbl);
> + bucket_table_free(rht_dereference(ht->tbl, ht));

Good catch.  But don't we need to call bucket_table_free on all
the tables too rather than just the first table?

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH net-next 0/2] More mirror-to-gretap tests with bridge in UL

2018-07-03 Thread David Miller
From: Petr Machata 
Date: Mon, 02 Jul 2018 19:58:44 +0200

> This patchset adds two more tests where the mirror-to-gretap has a
> bridge in underlay packet path, without a VLAN above or below that
> bridge.
> 
> In patch #1, a non-VLAN-filtering bridge is tested.
> 
> In patch #2, a VLAN-filtering bridge is tested.

Series applied.


Re: [PATCH v4 net-next 0/9] Handle multiple received packets at each stage

2018-07-03 Thread David Miller
From: Edward Cree 
Date: Mon, 2 Jul 2018 16:11:36 +0100

> This patch series adds the capability for the network stack to receive a
>  list of packets and process them as a unit, rather than handling each
>  packet singly in sequence.  This is done by factoring out the existing
>  datapath code at each layer and wrapping it in list handling code.
 ...

This is really nice stuff.

I'll apply this, but please work on the ipv6 side too.

I hope that driver maintainers take a look at using the new
netif_receive_skb_list() interface and see how much it helps
performance with their devices.

Thanks!


Re: [PATCH][net-next] net: increase MAX_GRO_SKBS to 64

2018-07-03 Thread David Miller
From: Li RongQing 
Date: Tue, 3 Jul 2018 14:21:48 +0800

> On 7/2/18, David Miller  wrote:
>> From: Li RongQing 
>> Date: Mon,  2 Jul 2018 19:41:43 +0800
>>
>>> After 07d78363dcffd [net: Convert NAPI gro list into a small hash table]
>>> there is 8 hash buckets, which allow more flows to be held for merging.
>>>
>>> keep each as original list length, so increase MAX_GRO_SKBS to 64
>>>
>>> Signed-off-by: Li RongQing 
>>
>> I would like to hear some feedback from Eric, 64 might be too big.
>>
> How about the below change?
> 
> commit 6270b973a973b2944fedb4b5f9926ed3e379d0c2 (HEAD -> master)
> Author: Li RongQing 
> Date:   Mon Jul 2 19:08:37 2018 +0800
> 
> net: limit each hash list length to MAX_GRO_SKBS

Yes, this is much better.

> @@ -324,6 +324,7 @@ struct napi_struct {
>  #endif
> struct net_device   *dev;
> struct list_headgro_hash[GRO_HASH_BUCKETS];
> +   int list_len[GRO_HASH_BUCKETS];

For cache locality, it is probably best to make gro_hash an array of structures
whose members are:

struct list_headhash_chain;
int list_len;


Re: [PATCH iproute2] tc: Fix the bug not to display prio and quantum options of htb

2018-07-03 Thread Cong Wang
On Tue, Jul 3, 2018 at 8:33 PM fumihiko kakuma  wrote:
>
> A commandline like 'tc -d class show dev dev-name' does not
> display value of prio and quantum option when we use htb qdisc.
> This patch fixes the bug.
>
> Signed-off-by: Fumihiko Kakuma 

Good catch!

Acked-by: Cong Wang 


Re: QDisc Implementation: Setting bit rate and getting RTT

2018-07-03 Thread Cong Wang
On Tue, Jul 3, 2018 at 4:14 PM Taran Lynn  wrote:
>
> Hello,
> I'm new to linux development and am working on creating a qdisc module,
> similar to those under net/sched/sch_*.c. Currently I'm stuck on two
> things.
>
> 1. What's the best way to set the maximum bit rate?
> 2. How do I determine the RTT for packets?
>
> For (1) I'm currently tracking the number of bits sent, and only sending
> packets if they stay within the appropriate rate bound. Is there a better
> way to do this? I've looked at net/sched/sch_netem.c, and as far as I can
> tell it determines the delay between packets for a given rate and sends
> them after that delay has passed. However, when I tried to do this I got
> inconsistent rates, so I think I may be missing something.

It depends on what you want to do for bandwidth control.

If you simply want traffic policing, take a look at gen estimator
in net/core/gen_estimator.c. net/sched/act_police.c is an example
of how it is used.

If you want traffic shaping, it is harder, the simplest one in kernel
is probably net/sched/sch_tbf.c.

>
> As for (2) I'm not sure where to start.

Why do you care about RTT in Qdisc layer? RTT is kinda TCP
thing, while Qdisc is at L2...


Re: [PATCH net-next] net: sched: act_pedit: fix possible memory leak in tcf_pedit_init()

2018-07-03 Thread Cong Wang
On Tue, Jul 3, 2018 at 6:36 AM Wei Yongjun  wrote:
>
> 'keys_ex' is malloced by tcf_pedit_keys_ex_parse() in tcf_pedit_init()
> but not all of the error handle path free it, this may cause memory
> leak. This patch fix it.
>
> Fixes: 71d0ed7079df ("net/act_pedit: Support using offset relative to the 
> conventional network headers")
> Signed-off-by: Wei Yongjun 

Acked-by: Cong Wang 


[PATCH iproute2] tc: Fix the bug not to display prio and quantum options of htb

2018-07-03 Thread fumihiko kakuma
A commandline like 'tc -d class show dev dev-name' does not
display value of prio and quantum option when we use htb qdisc.
This patch fixes the bug.

Signed-off-by: Fumihiko Kakuma 
---
 tc/q_htb.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tc/q_htb.c b/tc/q_htb.c
index 7d5f6ce..b93d31d 100644
--- a/tc/q_htb.c
+++ b/tc/q_htb.c
@@ -291,9 +291,9 @@ static int htb_print_opt(struct qdisc_util *qu, FILE *f, 
struct rtattr *opt)
if (RTA_PAYLOAD(tb[TCA_HTB_PARMS])  < sizeof(*hopt)) return -1;
 
if (!hopt->level) {
-   print_int(PRINT_ANY, "prio", "prio ", (int)hopt->prio);
+   print_int(PRINT_ANY, "prio", "prio %d ", 
(int)hopt->prio);
if (show_details)
-   print_int(PRINT_ANY, "quantum", "quantum ",
+   print_int(PRINT_ANY, "quantum", "quantum %d ",
  (int)hopt->quantum);
}
 
-- 
2.7.4



[PATCH bpf-next 04/11] tools: bpftool: add support for loading programs for offload

2018-07-03 Thread Jakub Kicinski
Extend the bpftool prog load command to also accept "dev"
parameter, which will allow us to load programs onto devices.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 .../bpftool/Documentation/bpftool-prog.rst|  6 ++--
 tools/bpf/bpftool/bash-completion/bpftool | 23 ++--
 tools/bpf/bpftool/prog.c  | 35 +--
 3 files changed, 58 insertions(+), 6 deletions(-)

diff --git a/tools/bpf/bpftool/Documentation/bpftool-prog.rst 
b/tools/bpf/bpftool/Documentation/bpftool-prog.rst
index 43d34a5c3ec5..41723c6acaa6 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-prog.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-prog.rst
@@ -24,7 +24,7 @@ MAP COMMANDS
 |  **bpftool** **prog dump xlated** *PROG* [{**file** *FILE* | **opcodes** 
| **visual**}]
 |  **bpftool** **prog dump jited**  *PROG* [{**file** *FILE* | 
**opcodes**}]
 |  **bpftool** **prog pin** *PROG* *FILE*
-|  **bpftool** **prog load** *OBJ* *FILE*
+|  **bpftool** **prog load** *OBJ* *FILE* [**dev** *NAME*]
 |  **bpftool** **prog help**
 |
 |  *PROG* := { **id** *PROG_ID* | **pinned** *FILE* | **tag** *PROG_TAG* }
@@ -64,8 +64,10 @@ DESCRIPTION
 
  Note: *FILE* must be located in *bpffs* mount.
 
-   **bpftool prog load** *OBJ* *FILE*
+   **bpftool prog load** *OBJ* *FILE* [**dev** *NAME*]
  Load bpf program from binary *OBJ* and pin as *FILE*.
+ If **dev** *NAME* is specified program will be loaded onto
+ given networking device (offload).
 
  Note: *FILE* must be located in *bpffs* mount.
 
diff --git a/tools/bpf/bpftool/bash-completion/bpftool 
b/tools/bpf/bpftool/bash-completion/bpftool
index fffd76f4998b..ab044f528a85 100644
--- a/tools/bpf/bpftool/bash-completion/bpftool
+++ b/tools/bpf/bpftool/bash-completion/bpftool
@@ -99,6 +99,12 @@ _bpftool_get_prog_tags()
 command sed -n 's/.*"tag": "\(.*\)",$/\1/p' )" -- "$cur" ) )
 }
 
+_sysfs_get_netdevs()
+{
+COMPREPLY+=( $( compgen -W "$( ls /sys/class/net 2>/dev/null )" -- \
+"$cur" ) )
+}
+
 # For bpftool map update: retrieve type of the map to update.
 _bpftool_map_update_map_type()
 {
@@ -262,8 +268,21 @@ _bpftool()
 return 0
 ;;
 load)
-_filedir
-return 0
+if [[ ${#words[@]} -lt 6 ]]; then
+_filedir
+return 0
+fi
+
+case $prev in
+dev)
+_sysfs_get_netdevs
+return 0
+;;
+*)
+_bpftool_once_attr 'dev'
+return 0
+;;
+esac
 ;;
 *)
 [[ $prev == $object ]] && \
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index a5ef46c59029..21c74de7156f 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -39,6 +39,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -681,6 +682,9 @@ static int do_pin(int argc, char **argv)
 
 static int do_load(int argc, char **argv)
 {
+   struct bpf_prog_load_attr attr = {
+   .prog_type  = BPF_PROG_TYPE_UNSPEC,
+   };
const char *objfile, *pinfile;
struct bpf_object *obj;
int prog_fd;
@@ -690,7 +694,34 @@ static int do_load(int argc, char **argv)
objfile = GET_ARG();
pinfile = GET_ARG();
 
-   if (bpf_prog_load(objfile, BPF_PROG_TYPE_UNSPEC, &obj, &prog_fd)) {
+   while (argc) {
+   if (is_prefix(*argv, "dev")) {
+   NEXT_ARG();
+
+   if (attr.ifindex) {
+   p_err("offload device already specified");
+   return -1;
+   }
+   if (!REQ_ARGS(1))
+   return -1;
+
+   attr.ifindex = if_nametoindex(*argv);
+   if (!attr.ifindex) {
+   p_err("unrecognized netdevice '%s': %s",
+ *argv, strerror(errno));
+   return -1;
+   }
+   NEXT_ARG();
+   } else {
+   p_err("expected no more arguments or 'dev', got: '%s'?",
+ *argv);
+   return -1;
+   }
+   }
+
+   attr.file = objfile;
+
+   if (bpf_prog_load_xattr(&attr, &obj, &prog_fd)) {
p_err("failed to load program");
return -1;
}
@@ -722,7 +753,7 @@ static int do_help(int argc, char **argv)
"   %s %s dump xlated PROG [{ file F

[PATCH bpf-next 05/11] tools: libbpf: expose the prog type guessing from section name logic

2018-07-03 Thread Jakub Kicinski
libbpf can guess program type based on ELF section names.  As libbpf
becomes more popular its association between section name strings and
types becomes more of a standard.  Allow libbpf users to use the same
logic for matching strings to types, e.g. when the string originates
from command line.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 tools/lib/bpf/libbpf.c | 43 --
 tools/lib/bpf/libbpf.h |  3 +++
 2 files changed, 28 insertions(+), 18 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 38ed3e92e393..30f3e58bd563 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -2081,25 +2081,33 @@ static const struct {
 #undef BPF_S_PROG_SEC
 #undef BPF_SA_PROG_SEC
 
-static int bpf_program__identify_section(struct bpf_program *prog)
+int libbpf_prog_type_by_string(const char *name, enum bpf_prog_type *prog_type,
+  enum bpf_attach_type *expected_attach_type)
 {
int i;
 
-   if (!prog->section_name)
-   goto err;
-
-   for (i = 0; i < ARRAY_SIZE(section_names); i++)
-   if (strncmp(prog->section_name, section_names[i].sec,
-   section_names[i].len) == 0)
-   return i;
-
-err:
-   pr_warning("failed to guess program type based on section name %s\n",
-  prog->section_name);
+   if (!name)
+   return -1;
 
+   for (i = 0; i < ARRAY_SIZE(section_names); i++) {
+   if (strncmp(name, section_names[i].sec, section_names[i].len))
+   continue;
+   *prog_type = section_names[i].prog_type;
+   *expected_attach_type = section_names[i].expected_attach_type;
+   return 0;
+   }
return -1;
 }
 
+static int
+bpf_program__identify_section(struct bpf_program *prog,
+ enum bpf_prog_type *prog_type,
+ enum bpf_attach_type *expected_attach_type)
+{
+   return libbpf_prog_type_by_string(prog->section_name, prog_type,
+ expected_attach_type);
+}
+
 int bpf_map__fd(struct bpf_map *map)
 {
return map ? map->fd : -EINVAL;
@@ -2230,7 +2238,6 @@ int bpf_prog_load_xattr(const struct bpf_prog_load_attr 
*attr,
enum bpf_prog_type prog_type;
struct bpf_object *obj;
struct bpf_map *map;
-   int section_idx;
int err;
 
if (!attr)
@@ -2252,14 +2259,14 @@ int bpf_prog_load_xattr(const struct bpf_prog_load_attr 
*attr,
prog->prog_ifindex = attr->ifindex;
expected_attach_type = attr->expected_attach_type;
if (prog_type == BPF_PROG_TYPE_UNSPEC) {
-   section_idx = bpf_program__identify_section(prog);
-   if (section_idx < 0) {
+   err = bpf_program__identify_section(prog, &prog_type,
+   
&expected_attach_type);
+   if (err < 0) {
+   pr_warning("failed to guess program type based 
on section name %s\n",
+  prog->section_name);
bpf_object__close(obj);
return -EINVAL;
}
-   prog_type = section_names[section_idx].prog_type;
-   expected_attach_type =
-   section_names[section_idx].expected_attach_type;
}
 
bpf_program__set_type(prog, prog_type);
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 564f4be9bae0..617dacfc6704 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -92,6 +92,9 @@ int bpf_object__set_priv(struct bpf_object *obj, void *priv,
 bpf_object_clear_priv_t clear_priv);
 void *bpf_object__priv(struct bpf_object *prog);
 
+int libbpf_prog_type_by_string(const char *name, enum bpf_prog_type *prog_type,
+  enum bpf_attach_type *expected_attach_type);
+
 /* Accessors of bpf_program */
 struct bpf_program;
 struct bpf_program *bpf_program__next(struct bpf_program *prog,
-- 
2.17.1



[PATCH bpf-next 01/11] selftests/bpf: remove duplicated word from test offloads

2018-07-03 Thread Jakub Kicinski
Trivial removal of duplicated "mode" in error message.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 tools/testing/selftests/bpf/test_offload.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/test_offload.py 
b/tools/testing/selftests/bpf/test_offload.py
index be800d0e7a84..a257e4b08392 100755
--- a/tools/testing/selftests/bpf/test_offload.py
+++ b/tools/testing/selftests/bpf/test_offload.py
@@ -830,7 +830,7 @@ netns = []
 check_extack_nsim(err, "program loaded with different flags.", args)
 ret, _, err = sim.unset_xdp("", force=True,
 fail=False, include_stderr=True)
-fail(ret == 0, "Removed program with a bad mode mode")
+fail(ret == 0, "Removed program with a bad mode")
 check_extack_nsim(err, "program loaded with different flags.", args)
 
 start_test("Test MTU restrictions...")
-- 
2.17.1



[PATCH bpf-next 08/11] tools: libbpf: add extended attributes version of bpf_object__open()

2018-07-03 Thread Jakub Kicinski
Similarly to bpf_prog_load() users of bpf_object__open() may need
to specify the expected program type.  Program type is needed at
open to avoid the kernel version check for program types which don't
require it.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 tools/lib/bpf/libbpf.c | 21 +
 tools/lib/bpf/libbpf.h |  6 ++
 2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index edc3b0b3737d..5b0e84fbcf71 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1520,7 +1520,8 @@ __bpf_object__open(const char *path, void *obj_buf, 
size_t obj_buf_sz,
return ERR_PTR(err);
 }
 
-struct bpf_object *bpf_object__open(const char *path)
+struct bpf_object *bpf_object__open_xattr(const char *path,
+ struct bpf_object_open_attr *attr)
 {
/* param validation */
if (!path)
@@ -1528,7 +1529,17 @@ struct bpf_object *bpf_object__open(const char *path)
 
pr_debug("loading %s\n", path);
 
-   return __bpf_object__open(path, NULL, 0, true);
+   return __bpf_object__open(path, NULL, 0,
+ bpf_prog_type__needs_kver(attr->prog_type));
+}
+
+struct bpf_object *bpf_object__open(const char *path)
+{
+   struct bpf_object_open_attr attr = {
+   .prog_type  = BPF_PROG_TYPE_UNSPEC,
+   };
+
+   return bpf_object__open_xattr(path, &attr);
 }
 
 struct bpf_object *bpf_object__open_buffer(void *obj_buf,
@@ -2238,6 +2249,9 @@ int bpf_prog_load(const char *file, enum bpf_prog_type 
type,
 int bpf_prog_load_xattr(const struct bpf_prog_load_attr *attr,
struct bpf_object **pobj, int *prog_fd)
 {
+   struct bpf_object_open_attr open_attr = {
+   .prog_type  = attr->prog_type,
+   };
struct bpf_program *prog, *first_prog = NULL;
enum bpf_attach_type expected_attach_type;
enum bpf_prog_type prog_type;
@@ -2250,8 +2264,7 @@ int bpf_prog_load_xattr(const struct bpf_prog_load_attr 
*attr,
if (!attr->file)
return -EINVAL;
 
-   obj = __bpf_object__open(attr->file, NULL, 0,
-bpf_prog_type__needs_kver(attr->prog_type));
+   obj = bpf_object__open_xattr(attr->file, &open_attr);
if (IS_ERR_OR_NULL(obj))
return -ENOENT;
 
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 3122d74f2643..60593ac44700 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -66,7 +66,13 @@ void libbpf_set_print(libbpf_print_fn_t warn,
 /* Hide internal to user */
 struct bpf_object;
 
+struct bpf_object_open_attr {
+   enum bpf_prog_type prog_type;
+};
+
 struct bpf_object *bpf_object__open(const char *path);
+struct bpf_object *bpf_object__open_xattr(const char *path,
+ struct bpf_object_open_attr *attr);
 struct bpf_object *bpf_object__open_buffer(void *obj_buf,
   size_t obj_buf_sz,
   const char *name);
-- 
2.17.1



[PATCH bpf-next 00/11] tools: bpf: extend bpftool prog load

2018-07-03 Thread Jakub Kicinski
Hi!

This series starts with two minor clean ups to test_offload.py
selftest script.

The next 9 patches extend the abilities of bpftool prog load
beyond the simple cgroup use cases.  Three new parameters are
added:

 - type - allows specifying program type, independent of how
  code sections are named;
 - map  - allows reusing existing maps, instead of creating a new
  map on every program load;
 - dev  - offload/binding to a device.

A number of changes to libbpf is required to accomplish the task.
The section - program type logic mapping is exposed.  We should
probably aim to use the libbpf program section naming everywhere.
For reuse of maps we need to allow users to set FD for bpf map
object in libbpf.

Examples

Load program my_xdp.o and pin it as /sys/fs/bpf/my_xdp, for xdp
program type:

$ bpftool prog load my_xdp.o /sys/fs/bpf/my_xdp \
  type xdp

As above but for offload:

$ bpftool prog load my_xdp.o /sys/fs/bpf/my_xdp \
  type xdp \
  dev netdevsim0

Load program my_maps.o, but for the first map reuse map id 17,
and for the map called "other_map" reuse pinned map /sys/fs/bpf/map0:

$ bpftool prog load my_maps.o /sys/fs/bpf/prog \
  map idx 0 id 17 \
  map name other_map pinned /sys/fs/bpf/map0

---
This set needs a net-next -> bpf-next merge back, which I believe is
imminent.  It should not conflict with Okash's work on BTF.

Jakub Kicinski (11):
  selftests/bpf: remove duplicated word from test offloads
  selftests/bpf: add Error: prefix in check_extack helper
  tools: bpftool: refactor argument parsing for prog load
  tools: bpftool: add support for loading programs for offload
  tools: libbpf: expose the prog type guessing from section name logic
  tools: bpftool: allow users to specify program type for prog load
  tools: libbpf: recognize offload neutral maps
  tools: libbpf: add extended attributes version of bpf_object__open()
  tools: bpftool: reimplement bpf_prog_load() for prog load
  tools: libbpf: allow map reuse
  tools: bpftool: allow reuse of maps with bpftool prog load

 .../bpftool/Documentation/bpftool-prog.rst|  33 ++-
 tools/bpf/bpftool/bash-completion/bpftool |  96 ++-
 tools/bpf/bpftool/main.h  |  18 ++
 tools/bpf/bpftool/map.c   |   4 +-
 tools/bpf/bpftool/prog.c  | 245 +-
 tools/lib/bpf/libbpf.c| 107 ++--
 tools/lib/bpf/libbpf.h|  11 +
 tools/testing/selftests/bpf/test_offload.py   |  10 +-
 8 files changed, 476 insertions(+), 48 deletions(-)

-- 
2.17.1



[PATCH bpf-next 06/11] tools: bpftool: allow users to specify program type for prog load

2018-07-03 Thread Jakub Kicinski
Sometimes program section names don't match with libbpf's expectation.
In particular XDP's default section names differ between libbpf and
iproute2.  Allow users to pass program type on command line.  Name
the types like the libbpf expected section names.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 .../bpftool/Documentation/bpftool-prog.rst| 15 ++-
 tools/bpf/bpftool/bash-completion/bpftool |  6 +++
 tools/bpf/bpftool/prog.c  | 44 +--
 3 files changed, 60 insertions(+), 5 deletions(-)

diff --git a/tools/bpf/bpftool/Documentation/bpftool-prog.rst 
b/tools/bpf/bpftool/Documentation/bpftool-prog.rst
index 41723c6acaa6..e53e1ad2caf0 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-prog.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-prog.rst
@@ -24,10 +24,19 @@ MAP COMMANDS
 |  **bpftool** **prog dump xlated** *PROG* [{**file** *FILE* | **opcodes** 
| **visual**}]
 |  **bpftool** **prog dump jited**  *PROG* [{**file** *FILE* | 
**opcodes**}]
 |  **bpftool** **prog pin** *PROG* *FILE*
-|  **bpftool** **prog load** *OBJ* *FILE* [**dev** *NAME*]
+|  **bpftool** **prog load** *OBJ* *FILE* [**type** *TYPE*] [**dev** 
*NAME*]
 |  **bpftool** **prog help**
 |
 |  *PROG* := { **id** *PROG_ID* | **pinned** *FILE* | **tag** *PROG_TAG* }
+|  *TYPE* := {
+|  **socket** | **kprobe** | **kretprobe** | **classifier** | 
**action** |
+|  **tracepoint** | **raw_tracepoint** | **xdp** | **perf_event** 
| **cgroup/skb** |
+|  **cgroup/sock** | **cgroup/dev** | **lwt_in** | **lwt_out** | 
**lwt_xmit** |
+|  **lwt_seg6local** | **sockops** | **sk_skb** | **sk_msg** | 
**lirc_mode2** |
+|  **cgroup/bind4** | **cgroup/bind6** | **cgroup/post_bind4** | 
**cgroup/post_bind6** |
+|  **cgroup/connect4** | **cgroup/connect6** | **cgroup/sendmsg4** 
| **cgroup/sendmsg6**
+|  }
+
 
 DESCRIPTION
 ===
@@ -64,8 +73,10 @@ DESCRIPTION
 
  Note: *FILE* must be located in *bpffs* mount.
 
-   **bpftool prog load** *OBJ* *FILE* [**dev** *NAME*]
+   **bpftool prog load** *OBJ* *FILE* [**type** *TYPE*] [**dev** *NAME*]
  Load bpf program from binary *OBJ* and pin as *FILE*.
+ **type** is optional, if not specified program type will be
+ inferred from section names.
  If **dev** *NAME* is specified program will be loaded onto
  given networking device (offload).
 
diff --git a/tools/bpf/bpftool/bash-completion/bpftool 
b/tools/bpf/bpftool/bash-completion/bpftool
index ab044f528a85..9ae33a73d732 100644
--- a/tools/bpf/bpftool/bash-completion/bpftool
+++ b/tools/bpf/bpftool/bash-completion/bpftool
@@ -274,11 +274,17 @@ _bpftool()
 fi
 
 case $prev in
+type)
+COMPREPLY=( $( compgen -W "socket kprobe kretprobe 
classifier action tracepoint raw_tracepoint xdp perf_event cgroup/skb 
cgroup/sock cgroup/dev lwt_in lwt_out lwt_xmit lwt_seg6local sockops sk_skb 
sk_msg lirc_mode2 cgroup/bind4 cgroup/bind6 cgroup/connect4 cgroup/connect6 
cgroup/sendmsg4 cgroup/sendmsg6 cgroup/post_bind4 cgroup/post_bind6" -- \
+   "$cur" ) )
+return 0
+;;
 dev)
 _sysfs_get_netdevs
 return 0
 ;;
 *)
+_bpftool_once_attr 'type'
 _bpftool_once_attr 'dev'
 return 0
 ;;
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index 21c74de7156f..7a06fd4c5d27 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -688,6 +688,7 @@ static int do_load(int argc, char **argv)
const char *objfile, *pinfile;
struct bpf_object *obj;
int prog_fd;
+   int err;
 
if (!REQ_ARGS(2))
return -1;
@@ -695,7 +696,37 @@ static int do_load(int argc, char **argv)
pinfile = GET_ARG();
 
while (argc) {
-   if (is_prefix(*argv, "dev")) {
+   if (is_prefix(*argv, "type")) {
+   char *type;
+
+   NEXT_ARG();
+
+   if (attr.prog_type != BPF_PROG_TYPE_UNSPEC) {
+   p_err("program type already specified");
+   return -1;
+   }
+   if (!REQ_ARGS(1))
+   return -1;
+
+   /* Put a '/' at the end of type to appease libbpf */
+   type = malloc(strlen(*argv) + 2);
+   if (!type) {
+   p_err("mem alloc failed");
+ 

[PATCH bpf-next 09/11] tools: bpftool: reimplement bpf_prog_load() for prog load

2018-07-03 Thread Jakub Kicinski
bpf_prog_load() is a very useful helper but it doesn't give us full
flexibility of modifying the BPF objects before loading.  Open code
bpf_prog_load() in bpftool so we can add extra logic in following
commits.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 tools/bpf/bpftool/prog.c | 57 
 1 file changed, 46 insertions(+), 11 deletions(-)

diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index 7a06fd4c5d27..267d653c93f5 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -43,6 +43,8 @@
 #include 
 #include 
 
+#include 
+
 #include 
 #include 
 
@@ -682,12 +684,15 @@ static int do_pin(int argc, char **argv)
 
 static int do_load(int argc, char **argv)
 {
-   struct bpf_prog_load_attr attr = {
+   enum bpf_attach_type expected_attach_type;
+   struct bpf_object_open_attr attr = {
.prog_type  = BPF_PROG_TYPE_UNSPEC,
};
const char *objfile, *pinfile;
+   struct bpf_program *prog;
struct bpf_object *obj;
-   int prog_fd;
+   struct bpf_map *map;
+   __u32 ifindex = 0;
int err;
 
if (!REQ_ARGS(2))
@@ -719,7 +724,7 @@ static int do_load(int argc, char **argv)
strcat(type, "/");
 
err = libbpf_prog_type_by_string(type, &attr.prog_type,
-
&attr.expected_attach_type);
+&expected_attach_type);
free(type);
if (err < 0) {
p_err("unknown program type '%s'", *argv);
@@ -729,15 +734,15 @@ static int do_load(int argc, char **argv)
} else if (is_prefix(*argv, "dev")) {
NEXT_ARG();
 
-   if (attr.ifindex) {
+   if (ifindex) {
p_err("offload device already specified");
return -1;
}
if (!REQ_ARGS(1))
return -1;
 
-   attr.ifindex = if_nametoindex(*argv);
-   if (!attr.ifindex) {
+   ifindex = if_nametoindex(*argv);
+   if (!ifindex) {
p_err("unrecognized netdevice '%s': %s",
  *argv, strerror(errno));
return -1;
@@ -750,14 +755,44 @@ static int do_load(int argc, char **argv)
}
}
 
-   attr.file = objfile;
-
-   if (bpf_prog_load_xattr(&attr, &obj, &prog_fd)) {
-   p_err("failed to load program");
+   obj = bpf_object__open_xattr(objfile, &attr);
+   if (IS_ERR_OR_NULL(obj)) {
+   p_err("failed to open object file");
return -1;
}
 
-   if (do_pin_fd(prog_fd, pinfile))
+   prog = bpf_program__next(NULL, obj);
+   if (!prog) {
+   p_err("object file doesn't contain any bpf program");
+   goto err_close_obj;
+   }
+
+   bpf_program__set_ifindex(prog, ifindex);
+   if (attr.prog_type == BPF_PROG_TYPE_UNSPEC) {
+   const char *sec_name = bpf_program__title(prog, false);
+
+   err = libbpf_prog_type_by_string(sec_name, &attr.prog_type,
+&expected_attach_type);
+   if (err < 0) {
+   p_err("failed to guess program type based on section 
name %s\n",
+ sec_name);
+   goto err_close_obj;
+   }
+   }
+   bpf_program__set_type(prog, attr.prog_type);
+   bpf_program__set_expected_attach_type(prog, expected_attach_type);
+
+   bpf_map__for_each(map, obj)
+   if (!bpf_map__is_offload_neutral(map))
+   bpf_map__set_ifindex(map, ifindex);
+
+   err = bpf_object__load(obj);
+   if (err) {
+   p_err("failed to load object file");
+   goto err_close_obj;
+   }
+
+   if (do_pin_fd(bpf_program__fd(prog), pinfile))
goto err_close_obj;
 
if (json_output)
-- 
2.17.1



[PATCH bpf-next 11/11] tools: bpftool: allow reuse of maps with bpftool prog load

2018-07-03 Thread Jakub Kicinski
Add map parameter to prog load which will allow reuse of existing
maps instead of creating new ones.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 .../bpftool/Documentation/bpftool-prog.rst|  20 ++-
 tools/bpf/bpftool/bash-completion/bpftool |  67 +++-
 tools/bpf/bpftool/main.h  |   3 +
 tools/bpf/bpftool/map.c   |   4 +-
 tools/bpf/bpftool/prog.c  | 148 --
 5 files changed, 219 insertions(+), 23 deletions(-)

diff --git a/tools/bpf/bpftool/Documentation/bpftool-prog.rst 
b/tools/bpf/bpftool/Documentation/bpftool-prog.rst
index e53e1ad2caf0..64156a16d530 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-prog.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-prog.rst
@@ -24,9 +24,10 @@ MAP COMMANDS
 |  **bpftool** **prog dump xlated** *PROG* [{**file** *FILE* | **opcodes** 
| **visual**}]
 |  **bpftool** **prog dump jited**  *PROG* [{**file** *FILE* | 
**opcodes**}]
 |  **bpftool** **prog pin** *PROG* *FILE*
-|  **bpftool** **prog load** *OBJ* *FILE* [**type** *TYPE*] [**dev** 
*NAME*]
+|  **bpftool** **prog load** *OBJ* *FILE* [**type** *TYPE*] [**map** 
{**idx** *IDX* | **name** *NAME*} *MAP*] [**dev** *NAME*]
 |  **bpftool** **prog help**
 |
+|  *MAP* := { **id** *MAP_ID* | **pinned** *FILE* }
 |  *PROG* := { **id** *PROG_ID* | **pinned** *FILE* | **tag** *PROG_TAG* }
 |  *TYPE* := {
 |  **socket** | **kprobe** | **kretprobe** | **classifier** | 
**action** |
@@ -73,10 +74,17 @@ DESCRIPTION
 
  Note: *FILE* must be located in *bpffs* mount.
 
-   **bpftool prog load** *OBJ* *FILE* [**type** *TYPE*] [**dev** *NAME*]
+   **bpftool prog load** *OBJ* *FILE* [**type** *TYPE*] [**map** {**idx** 
*IDX* | **name** *NAME*} *MAP*] [**dev** *NAME*]
  Load bpf program from binary *OBJ* and pin as *FILE*.
  **type** is optional, if not specified program type will be
  inferred from section names.
+ By default bpftool will create new maps as declared in the ELF
+ object being loaded.  **map** parameter allows for the reuse
+ of existing maps.  It can be specified multiple times, each
+ time for a different map.  *IDX* refers to index of the map
+ to be replaced in the ELF file counting from 0, while *NAME*
+ allows to replace a map by name.  *MAP* specifies the map to
+ use, referring to it by **id** or through a **pinned** file.
  If **dev** *NAME* is specified program will be loaded onto
  given networking device (offload).
 
@@ -172,6 +180,14 @@ EXAMPLES
 mov%rbx,0x0(%rbp)
 48 89 5d 00
 
+|
+| **# bpftool prog load xdp1_kern.o /sys/fs/bpf/xdp1 type xdp map name rxcnt 
id 7**
+| **# bpftool prog show pinned /sys/fs/bpf/xdp1**
+|   9: xdp  name xdp_prog1  tag 539ec6ce11b52f98  gpl
+|  loaded_at 2018-06-25T16:17:31-0700  uid 0
+|  xlated 488B  jited 336B  memlock 4096B  map_ids 7
+| **# rm /sys/fs/bpf/xdp1**
+|
 
 SEE ALSO
 
diff --git a/tools/bpf/bpftool/bash-completion/bpftool 
b/tools/bpf/bpftool/bash-completion/bpftool
index 9ae33a73d732..626598964cee 100644
--- a/tools/bpf/bpftool/bash-completion/bpftool
+++ b/tools/bpf/bpftool/bash-completion/bpftool
@@ -99,6 +99,29 @@ _bpftool_get_prog_tags()
 command sed -n 's/.*"tag": "\(.*\)",$/\1/p' )" -- "$cur" ) )
 }
 
+_bpftool_get_obj_map_names()
+{
+local obj
+
+obj=$1
+
+maps=$(objdump -j maps -t $obj 2>/dev/null | \
+command awk '/g . maps/ {print $NF}')
+
+COMPREPLY+=( $( compgen -W "$maps" -- "$cur" ) )
+}
+
+_bpftool_get_obj_map_idxs()
+{
+local obj
+
+obj=$1
+
+nmaps=$(objdump -j maps -t $obj 2>/dev/null | grep -c 'g . maps')
+
+COMPREPLY+=( $( compgen -W "$(seq 0 $((nmaps - 1)))" -- "$cur" ) )
+}
+
 _sysfs_get_netdevs()
 {
 COMPREPLY+=( $( compgen -W "$( ls /sys/class/net 2>/dev/null )" -- \
@@ -220,12 +243,14 @@ _bpftool()
 # Completion depends on object and command in use
 case $object in
 prog)
-case $prev in
-id)
-_bpftool_get_prog_ids
-return 0
-;;
-esac
+if [[ $command != "load" ]]; then
+case $prev in
+id)
+_bpftool_get_prog_ids
+return 0
+;;
+esac
+fi
 
 local PROG_TYPE='id pinned tag'
 case $command in
@@ -268,22 +293,52 @@ _bpftool()
 return 0
 ;;
 load)
+local obj
+
 if [[ ${#words[@]} -lt 6 ]]; then
 _filedir
 return 0
 fi
 
+obj=${w

[PATCH bpf-next 07/11] tools: libbpf: recognize offload neutral maps

2018-07-03 Thread Jakub Kicinski
Add helper to libbpf for recognizing maps which should not have
ifindex set when program is loaded.  These maps only contain
host metadata and therefore are not marked for offload, e.g.
the perf event map.

Use this helper in bpf_prog_load_xattr().

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 tools/lib/bpf/libbpf.c | 8 +++-
 tools/lib/bpf/libbpf.h | 1 +
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 30f3e58bd563..edc3b0b3737d 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -2154,6 +2154,11 @@ void *bpf_map__priv(struct bpf_map *map)
return map ? map->priv : ERR_PTR(-EINVAL);
 }
 
+bool bpf_map__is_offload_neutral(struct bpf_map *map)
+{
+   return map->def.type == BPF_MAP_TYPE_PERF_EVENT_ARRAY;
+}
+
 void bpf_map__set_ifindex(struct bpf_map *map, __u32 ifindex)
 {
map->map_ifindex = ifindex;
@@ -2278,7 +2283,8 @@ int bpf_prog_load_xattr(const struct bpf_prog_load_attr 
*attr,
}
 
bpf_map__for_each(map, obj) {
-   map->map_ifindex = attr->ifindex;
+   if (!bpf_map__is_offload_neutral(map))
+   map->map_ifindex = attr->ifindex;
}
 
if (!first_prog) {
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 617dacfc6704..3122d74f2643 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -255,6 +255,7 @@ typedef void (*bpf_map_clear_priv_t)(struct bpf_map *, void 
*);
 int bpf_map__set_priv(struct bpf_map *map, void *priv,
  bpf_map_clear_priv_t clear_priv);
 void *bpf_map__priv(struct bpf_map *map);
+bool bpf_map__is_offload_neutral(struct bpf_map *map);
 void bpf_map__set_ifindex(struct bpf_map *map, __u32 ifindex);
 int bpf_map__pin(struct bpf_map *map, const char *path);
 
-- 
2.17.1



[PATCH bpf-next 10/11] tools: libbpf: allow map reuse

2018-07-03 Thread Jakub Kicinski
More advanced applications may want to only replace programs without
destroying associated maps.  Allow libbpf users to achieve that.
Instead of always creating all of the maps at load time, expose to
users an API to reconstruct the map object from already existing
map.

The map parameters are read from the kernel and replace the parameters
of the ELF map.  libbpf does not restrict the map replacement, i.e.
the reused map does not have to be compatible with the ELF map
definition.  We relay on the verifier for checking the compatibility
between maps and programs.  The ELF map definition is completely
overwritten by the information read from the kernel, to make sure
libbpf's view of map object corresponds to the actual map.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 tools/lib/bpf/libbpf.c | 35 +++
 tools/lib/bpf/libbpf.h |  1 +
 2 files changed, 36 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 5b0e84fbcf71..916810207c2a 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -214,6 +214,7 @@ struct bpf_map {
int fd;
char *name;
size_t offset;
+   bool fd_preset;
int map_ifindex;
struct bpf_map_def def;
uint32_t btf_key_type_id;
@@ -1081,6 +1082,34 @@ static int bpf_map_find_btf_info(struct bpf_map *map, 
const struct btf *btf)
return 0;
 }
 
+int bpf_map__reuse_fd(struct bpf_map *map, int fd)
+{
+   struct bpf_map_info info = {};
+   __u32 len = sizeof(info);
+   int err;
+
+   err = bpf_obj_get_info_by_fd(fd, &info, &len);
+   if (err)
+   return err;
+
+   map->fd = dup(fd);
+   if (map->fd < 0)
+   return map->fd;
+   map->fd_preset = true;
+
+   free(map->name);
+   map->name = strdup(info.name);
+   map->def.type = info.type;
+   map->def.key_size = info.key_size;
+   map->def.value_size = info.value_size;
+   map->def.max_entries = info.max_entries;
+   map->def.map_flags = info.map_flags;
+   map->btf_key_type_id = info.btf_key_type_id;
+   map->btf_value_type_id = info.btf_value_type_id;
+
+   return 0;
+}
+
 static int
 bpf_object__create_maps(struct bpf_object *obj)
 {
@@ -1093,6 +1122,12 @@ bpf_object__create_maps(struct bpf_object *obj)
struct bpf_map_def *def = &map->def;
int *pfd = &map->fd;
 
+   if (map->fd_preset) {
+   pr_debug("skip map create (preset) %s: fd=%d\n",
+map->name, map->fd);
+   continue;
+   }
+
create_attr.name = map->name;
create_attr.map_ifindex = map->map_ifindex;
create_attr.map_type = def->type;
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 60593ac44700..8e709a74f47c 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -261,6 +261,7 @@ typedef void (*bpf_map_clear_priv_t)(struct bpf_map *, void 
*);
 int bpf_map__set_priv(struct bpf_map *map, void *priv,
  bpf_map_clear_priv_t clear_priv);
 void *bpf_map__priv(struct bpf_map *map);
+int bpf_map__reuse_fd(struct bpf_map *map, int fd);
 bool bpf_map__is_offload_neutral(struct bpf_map *map);
 void bpf_map__set_ifindex(struct bpf_map *map, __u32 ifindex);
 int bpf_map__pin(struct bpf_map *map, const char *path);
-- 
2.17.1



[PATCH bpf-next 03/11] tools: bpftool: refactor argument parsing for prog load

2018-07-03 Thread Jakub Kicinski
Add a new macro for printing more informative message than straight
usage() when parameters are missing, and use it for prog do_load().
Save the object and pin path argument to variables for clarity.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 tools/bpf/bpftool/main.h | 15 +++
 tools/bpf/bpftool/prog.c | 11 +++
 2 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/tools/bpf/bpftool/main.h b/tools/bpf/bpftool/main.h
index d39f7ef01d23..15b6c49ae533 100644
--- a/tools/bpf/bpftool/main.h
+++ b/tools/bpf/bpftool/main.h
@@ -50,6 +50,21 @@
 #define NEXT_ARG() ({ argc--; argv++; if (argc < 0) usage(); })
 #define NEXT_ARGP()({ (*argc)--; (*argv)++; if (*argc < 0) usage(); })
 #define BAD_ARG()  ({ p_err("what is '%s'?", *argv); -1; })
+#define GET_ARG()  ({ argc--; *argv++; })
+#define REQ_ARGS(cnt)  \
+   ({  \
+   int _cnt = (cnt);   \
+   bool _res;  \
+   \
+   if (argc < _cnt) {  \
+   p_err("'%s' needs at least %d arguments, %d found", \
+ argv[-1], _cnt, argc);\
+   _res = false;   \
+   } else {\
+   _res = true;\
+   }   \
+   _res;   \
+   })
 
 #define ERR_MAX_LEN1024
 
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index a740da99d477..a5ef46c59029 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -681,18 +681,21 @@ static int do_pin(int argc, char **argv)
 
 static int do_load(int argc, char **argv)
 {
+   const char *objfile, *pinfile;
struct bpf_object *obj;
int prog_fd;
 
-   if (argc != 2)
-   usage();
+   if (!REQ_ARGS(2))
+   return -1;
+   objfile = GET_ARG();
+   pinfile = GET_ARG();
 
-   if (bpf_prog_load(argv[0], BPF_PROG_TYPE_UNSPEC, &obj, &prog_fd)) {
+   if (bpf_prog_load(objfile, BPF_PROG_TYPE_UNSPEC, &obj, &prog_fd)) {
p_err("failed to load program");
return -1;
}
 
-   if (do_pin_fd(prog_fd, argv[1]))
+   if (do_pin_fd(prog_fd, pinfile))
goto err_close_obj;
 
if (json_output)
-- 
2.17.1



[PATCH bpf-next 02/11] selftests/bpf: add Error: prefix in check_extack helper

2018-07-03 Thread Jakub Kicinski
Currently the test only checks errors, not warnings, so save typing
and prefix the extack messages with "Error:" inside the check helper.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 tools/testing/selftests/bpf/test_offload.py | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_offload.py 
b/tools/testing/selftests/bpf/test_offload.py
index a257e4b08392..f8d9bd81d9a4 100755
--- a/tools/testing/selftests/bpf/test_offload.py
+++ b/tools/testing/selftests/bpf/test_offload.py
@@ -547,11 +547,11 @@ netns = [] # net namespaces to be removed
 if skip_extack:
 return
 lines = output.split("\n")
-comp = len(lines) >= 2 and lines[1] == reference
+comp = len(lines) >= 2 and lines[1] == 'Error: ' + reference
 fail(not comp, "Missing or incorrect netlink extack message")
 
 def check_extack_nsim(output, reference, args):
-check_extack(output, "Error: netdevsim: " + reference, args)
+check_extack(output, "netdevsim: " + reference, args)
 
 def check_no_extack(res, needle):
 fail((res[1] + res[2]).count(needle) or (res[1] + 
res[2]).count("Warning:"),
@@ -654,7 +654,7 @@ netns = []
 ret, _, err = sim.cls_bpf_add_filter(obj, skip_sw=True,
  fail=False, include_stderr=True)
 fail(ret == 0, "TC filter loaded without enabling TC offloads")
-check_extack(err, "Error: TC offload is disabled on net device.", args)
+check_extack(err, "TC offload is disabled on net device.", args)
 sim.wait_for_flush()
 
 sim.set_ethtool_tc_offloads(True)
@@ -694,7 +694,7 @@ netns = []
  skip_sw=True,
  fail=False, include_stderr=True)
 fail(ret == 0, "Offloaded a filter to chain other than 0")
-check_extack(err, "Error: Driver supports only offload of chain 0.", args)
+check_extack(err, "Driver supports only offload of chain 0.", args)
 sim.tc_flush_filters()
 
 start_test("Test TC replace...")
-- 
2.17.1



Re: [PATCH] net: phy: marvell: change default m88e1510 LED configuration

2018-07-03 Thread David Miller
From: Wang Dongsheng 
Date: Sun, 1 Jul 2018 23:15:46 -0700

> The m88e1121 LED default configuration does not apply m88e151x.
> So add a function to relpace m88e1121 LED configuration.
> 
> Signed-off-by: Wang Dongsheng 

Applid, thank you.


Re: [PATCHv2 net] sctp: fix the issue that pathmtu may be set lower than MINSEGMENT

2018-07-03 Thread Marcelo Ricardo Leitner
On Tue, Jul 03, 2018 at 04:30:47PM +0800, Xin Long wrote:
> After commit b6c5734db070 ("sctp: fix the handling of ICMP Frag Needed
> for too small MTUs"), sctp_transport_update_pmtu would refetch pathmtu
> from the dst and set it to transport's pathmtu without any check.
> 
> The new pathmtu may be lower than MINSEGMENT if the dst is obsolete and
> updated by .get_dst() in sctp_transport_update_pmtu. In this case, it
> could have a smaller MTU as well, and thus we should validate it
> against MINSEGMENT instead.
> 
> Syzbot reported a warning in sctp_mtu_payload caused by this.
> 
> This patch refetches the pathmtu by calling sctp_dst_mtu where it does
> the check against MINSEGMENT.
> 
> v1->v2:
>   - refetch the pathmtu by calling sctp_dst_mtu instead as Marcelo's
> suggestion.
> 
> Fixes: b6c5734db070 ("sctp: fix the handling of ICMP Frag Needed for too 
> small MTUs")
> Reported-by: syzbot+f0d9d7cba052f9344...@syzkaller.appspotmail.com
> Suggested-by: Marcelo Ricardo Leitner 
> Signed-off-by: Xin Long 

Acked-by: Marcelo Ricardo Leitner 

> ---
>  net/sctp/transport.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/sctp/transport.c b/net/sctp/transport.c
> index 445b7ef..12cac85 100644
> --- a/net/sctp/transport.c
> +++ b/net/sctp/transport.c
> @@ -282,7 +282,7 @@ bool sctp_transport_update_pmtu(struct sctp_transport *t, 
> u32 pmtu)
>  
>   if (dst) {
>   /* Re-fetch, as under layers may have a higher minimum size */
> - pmtu = SCTP_TRUNC4(dst_mtu(dst));
> + pmtu = sctp_dst_mtu(dst);
>   change = t->pathmtu != pmtu;
>   }
>   t->pathmtu = pmtu;
> -- 
> 2.1.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure

2018-07-03 Thread Saeed Mahameed
On Tue, 2018-07-03 at 16:01 -0700, Alexei Starovoitov wrote:
> On Tue, Jun 26, 2018 at 07:46:11PM -0700, Saeed Mahameed wrote:
> > The idea from this patch is to define a well known structure for
> > XDP meta
> > data fields format and offset placement inside the xdp data meta
> > buffer.
> > 
> > For that driver will need some static information to know what to
> > provide and where, enters struct xdp_md_info and xdp_md_info_arr:
> > 
> > struct xdp_md_info {
> >__u16 present:1;
> >__u16 offset:15; /* offset from data_meta in xdp_md buffer
> > */
> > };
> > 
> > /* XDP meta data offsets info array
> >  * present bit describes if a meta data is or will be present in
> > xdp_md buff
> >  * offset describes where a meta data is or should be placed in
> > xdp_md buff
> >  *
> >  * Kernel builds this array using xdp_md_info_build helper on
> > demand.
> >  * User space builds it statically in the xdp program.
> >  */
> > typedef struct xdp_md_info xdp_md_info_arr[XDP_DATA_META_MAX];
> > 
> > Offsets in xdp_md_info_arr are always in ascending order and only
> > for
> > requested meta data:
> > example : flags = XDP_FLAGS_META_HASH | XDP_FLAGS_META_VLAN;
> > 
> > xdp_md_info_arr mdi = {
> > [XDP_DATA_META_HASH] = {.offset = 0, .present = 1},
> > [XDP_DATA_META_MARK] = {.offset = 0, .present = 0},
> > [XDP_DATA_META_VLAN] = {.offset = sizeof(struct xdp_md_hash),
> > .present = 1},
> > [XDP_DATA_META_CSUM] = {.offset = 0, .present = 0},
> > }
> > 
> > i.e: hash fields will always appear first then the vlan for every
> > xdp_md.
> > 
> > Once requested to provide xdp meta data, device driver will use a
> > pre-built
> > xdp_md_info_arr which was built via xdp_md_info_build on xdp setup,
> > xdp_md_info_arr will tell the driver what is the offset of each
> > meta data.
> > 
> > This patch also provides helper functions for the device drivers to
> > store
> > meta data into xdp_buff, and helper function for XDP programs to
> > fetch
> > specific xdp meta data from xdp_md buffer.
> > 
> > Signed-off-by: Saeed Mahameed 
> 
> ...
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 59b19b6a40d7..e320e7421565 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -2353,6 +2353,120 @@ struct xdp_md {
> > __u32 rx_queue_index;  /* rxq->queue_index  */
> >  };
> >  
> > +enum {
> > +   XDP_DATA_META_HASH = 0,
> > +   XDP_DATA_META_MARK,
> > +   XDP_DATA_META_VLAN,
> > +   XDP_DATA_META_CSUM_COMPLETE,
> > +
> > +   XDP_DATA_META_MAX,
> > +};
> > +
> > +struct xdp_md_hash {
> > +   __u32 hash;
> > +   __u8  type;
> > +} __attribute__((aligned(4)));
> > +
> > +struct xdp_md_mark {
> > +   __u32 mark;
> > +} __attribute__((aligned(4)));
> > +
> > +struct xdp_md_vlan {
> > +   __u16 vlan;
> > +} __attribute__((aligned(4)));
> > +
> > +struct xdp_md_csum {
> > +   __u16 csum;
> > +} __attribute__((aligned(4)));
> > +
> > +static const __u16 xdp_md_size[XDP_DATA_META_MAX] = {
> > +   sizeof(struct xdp_md_hash), /* XDP_DATA_META_HASH */
> > +   sizeof(struct xdp_md_mark), /* XDP_DATA_META_MARK */
> > +   sizeof(struct xdp_md_vlan), /* XDP_DATA_META_VLAN */
> > +   sizeof(struct xdp_md_csum), /* XDP_DATA_META_CSUM_COMPLETE
> > */
> > +};
> > +
> > +struct xdp_md_info {
> > +   __u16 present:1;
> > +   __u16 offset:15; /* offset from data_meta in xdp_md buffer
> > */
> > +};
> > +
> > +/* XDP meta data offsets info array
> > + * present bit describes if a meta data is or will be present in
> > xdp_md buff
> > + * offset describes where a meta data is or should be placed in
> > xdp_md buff
> > + *
> > + * Kernel builds this array using xdp_md_info_build helper on
> > demand.
> > + * User space builds it statically in the xdp program.
> > + */
> > +typedef struct xdp_md_info xdp_md_info_arr[XDP_DATA_META_MAX];
> > +
> > +static __always_inline __u8
> > +xdp_data_meta_present(xdp_md_info_arr mdi, int meta_data)
> > +{
> > +   return mdi[meta_data].present;
> > +}
> > +
> > +static __always_inline void
> > +xdp_data_meta_set_hash(xdp_md_info_arr mdi, void *data_meta, __u32
> > hash, __u8 htype)
> > +{
> > +   struct xdp_md_hash *hash_md;
> > +
> > +   hash_md = (struct xdp_md_hash *)((char*)data_meta +
> > mdi[XDP_DATA_META_HASH].offset);
> > +   hash_md->hash = hash;
> > +   hash_md->type = htype;
> > +}
> > +
> > +static __always_inline struct xdp_md_hash *
> > +xdp_data_meta_get_hash(xdp_md_info_arr mdi, void *data_meta)
> > +{
> > +   return (struct xdp_md_hash *)((char*)data_meta +
> > mdi[XDP_DATA_META_HASH].offset);
> > +}
> 
> I'm afraid this is not scalable uapi.
> This looks very much mlx specific. Every NIC vendor cannot add its
> own
> struct definitions into uapi/bpf.h. It doesn't scale.

No, this is not mlx specific, all you can find here is standard
hash/csum/vlan/flow mark, fields as described in SKB, we want them to
be well defined so you can construct SKBs with these standard values
from xdp_md with no surprises.

I agree this is 

Re: pull-request: bpf-next 2018-07-03

2018-07-03 Thread David Miller
From: Daniel Borkmann 
Date: Tue,  3 Jul 2018 23:18:13 +0200

> The following pull-request contains BPF updates for your *net-next* tree.
> 
> The main changes are:
 ...
> Please consider pulling these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git

Pulled, thanks Daniel.


Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure

2018-07-03 Thread Saeed Mahameed
On Mon, 2018-07-02 at 10:01 +0200, Daniel Borkmann wrote:
> On 06/27/2018 07:55 PM, Saeed Mahameed wrote:
> > On Wed, 2018-06-27 at 16:15 +0200, Jesper Dangaard Brouer wrote:
> > > On Tue, 26 Jun 2018 19:46:11 -0700
> > > Saeed Mahameed  wrote:
> > > 
> > > > diff --git a/include/net/xdp.h b/include/net/xdp.h
> > > > index 2deea7166a34..afe302613ae1 100644
> > > > --- a/include/net/xdp.h
> > > > +++ b/include/net/xdp.h
> > > > @@ -138,6 +138,12 @@ xdp_set_data_meta_invalid(struct xdp_buff
> > > > *xdp)
> > > > xdp->data_meta = xdp->data + 1;
> > > >  }
> > > >  
> > > > +static __always_inline void
> > > > +xdp_reset_data_meta(struct xdp_buff *xdp)
> > > > +{
> > > > +   xdp->data_meta = xdp->data_hard_start;
> > > > +}
> > > 
> > > This is WRONG ... it should be:
> > > 
> > >  xdp->data_meta = xdp->data;
> > 
> > maybe the name of the function is not suitable for the use case.
> > i need to set xdp->data_meta = xdp->data in the driver to start
> > storing
> > meta data.
> 
> The xdp_set_data_meta_invalid() is a straight forward way for XDP
> drivers
> to tell they do not support xdp->data_meta, since setting xdp->data +
> 1 will
> fail the checks for direct (meta) packet access in the BPF code and
> at the
> same time bpf_xdp_adjust_meta() will know to bail out with error when
> program
> attempts to make headroom for meta data that driver cannot handle
> later on.
> 
> So later setting 'xdp->data_meta = xdp->data' to enable it is as
> setting any
> other of the initializers in xdp_buff, and done so today in the i40e,
> ixgbe,
> ixgbevf and nfp drivers. (Theoretically it wouldn't have to be
> exactly set to
> xdp->data, but anything <= xdp->data works, if the driver would
> prepend info
> from hw in front of it that program can then use or later override.)
> 

Right ! As jesper pointer out as well, 
driver should set: xdp->data_meta = xdp->data
and then adjust it to the offset of the first meta data hint.



QDisc Implementation: Setting bit rate and getting RTT

2018-07-03 Thread Taran Lynn
Hello,
I'm new to linux development and am working on creating a qdisc module,
similar to those under net/sched/sch_*.c. Currently I'm stuck on two
things.

1. What's the best way to set the maximum bit rate?
2. How do I determine the RTT for packets?

For (1) I'm currently tracking the number of bits sent, and only sending
packets if they stay within the appropriate rate bound. Is there a better
way to do this? I've looked at net/sched/sch_netem.c, and as far as I can
tell it determines the delay between packets for a given rate and sends
them after that delay has passed. However, when I tried to do this I got
inconsistent rates, so I think I may be missing something.

As for (2) I'm not sure where to start.

Thanks in advance for any advise.


Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure

2018-07-03 Thread Alexei Starovoitov
On Tue, Jun 26, 2018 at 07:46:11PM -0700, Saeed Mahameed wrote:
> The idea from this patch is to define a well known structure for XDP meta
> data fields format and offset placement inside the xdp data meta buffer.
> 
> For that driver will need some static information to know what to
> provide and where, enters struct xdp_md_info and xdp_md_info_arr:
> 
> struct xdp_md_info {
>__u16 present:1;
>__u16 offset:15; /* offset from data_meta in xdp_md buffer */
> };
> 
> /* XDP meta data offsets info array
>  * present bit describes if a meta data is or will be present in xdp_md buff
>  * offset describes where a meta data is or should be placed in xdp_md buff
>  *
>  * Kernel builds this array using xdp_md_info_build helper on demand.
>  * User space builds it statically in the xdp program.
>  */
> typedef struct xdp_md_info xdp_md_info_arr[XDP_DATA_META_MAX];
> 
> Offsets in xdp_md_info_arr are always in ascending order and only for
> requested meta data:
> example : flags = XDP_FLAGS_META_HASH | XDP_FLAGS_META_VLAN;
> 
> xdp_md_info_arr mdi = {
>   [XDP_DATA_META_HASH] = {.offset = 0, .present = 1},
>   [XDP_DATA_META_MARK] = {.offset = 0, .present = 0},
>   [XDP_DATA_META_VLAN] = {.offset = sizeof(struct xdp_md_hash), .present 
> = 1},
>   [XDP_DATA_META_CSUM] = {.offset = 0, .present = 0},
> }
> 
> i.e: hash fields will always appear first then the vlan for every
> xdp_md.
> 
> Once requested to provide xdp meta data, device driver will use a pre-built
> xdp_md_info_arr which was built via xdp_md_info_build on xdp setup,
> xdp_md_info_arr will tell the driver what is the offset of each meta data.
> 
> This patch also provides helper functions for the device drivers to store
> meta data into xdp_buff, and helper function for XDP programs to fetch
> specific xdp meta data from xdp_md buffer.
> 
> Signed-off-by: Saeed Mahameed 
...
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 59b19b6a40d7..e320e7421565 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -2353,6 +2353,120 @@ struct xdp_md {
>   __u32 rx_queue_index;  /* rxq->queue_index  */
>  };
>  
> +enum {
> + XDP_DATA_META_HASH = 0,
> + XDP_DATA_META_MARK,
> + XDP_DATA_META_VLAN,
> + XDP_DATA_META_CSUM_COMPLETE,
> +
> + XDP_DATA_META_MAX,
> +};
> +
> +struct xdp_md_hash {
> + __u32 hash;
> + __u8  type;
> +} __attribute__((aligned(4)));
> +
> +struct xdp_md_mark {
> + __u32 mark;
> +} __attribute__((aligned(4)));
> +
> +struct xdp_md_vlan {
> + __u16 vlan;
> +} __attribute__((aligned(4)));
> +
> +struct xdp_md_csum {
> + __u16 csum;
> +} __attribute__((aligned(4)));
> +
> +static const __u16 xdp_md_size[XDP_DATA_META_MAX] = {
> + sizeof(struct xdp_md_hash), /* XDP_DATA_META_HASH */
> + sizeof(struct xdp_md_mark), /* XDP_DATA_META_MARK */
> + sizeof(struct xdp_md_vlan), /* XDP_DATA_META_VLAN */
> + sizeof(struct xdp_md_csum), /* XDP_DATA_META_CSUM_COMPLETE */
> +};
> +
> +struct xdp_md_info {
> + __u16 present:1;
> + __u16 offset:15; /* offset from data_meta in xdp_md buffer */
> +};
> +
> +/* XDP meta data offsets info array
> + * present bit describes if a meta data is or will be present in xdp_md buff
> + * offset describes where a meta data is or should be placed in xdp_md buff
> + *
> + * Kernel builds this array using xdp_md_info_build helper on demand.
> + * User space builds it statically in the xdp program.
> + */
> +typedef struct xdp_md_info xdp_md_info_arr[XDP_DATA_META_MAX];
> +
> +static __always_inline __u8
> +xdp_data_meta_present(xdp_md_info_arr mdi, int meta_data)
> +{
> + return mdi[meta_data].present;
> +}
> +
> +static __always_inline void
> +xdp_data_meta_set_hash(xdp_md_info_arr mdi, void *data_meta, __u32 hash, 
> __u8 htype)
> +{
> + struct xdp_md_hash *hash_md;
> +
> + hash_md = (struct xdp_md_hash *)((char*)data_meta + 
> mdi[XDP_DATA_META_HASH].offset);
> + hash_md->hash = hash;
> + hash_md->type = htype;
> +}
> +
> +static __always_inline struct xdp_md_hash *
> +xdp_data_meta_get_hash(xdp_md_info_arr mdi, void *data_meta)
> +{
> + return (struct xdp_md_hash *)((char*)data_meta + 
> mdi[XDP_DATA_META_HASH].offset);
> +}

I'm afraid this is not scalable uapi.
This looks very much mlx specific. Every NIC vendor cannot add its own
struct definitions into uapi/bpf.h. It doesn't scale.
Also doing 15 bit bitfield extraction using bpf instructions is not that simple.
mlx should consider doing plain u8 or u16 fields in firmware instead.
The metadata that "hw" provides is a joint work of actual asic and firmware.
I suspect the format can and will change with different firmware.
Baking this stuff into uapi/bpf.h is not an option.
How about we make driver+firmware provide a BTF definition of metadata that they
can provide? There can be multiple definitions of such structs.
Then in userpsace we can have BTF->plain C converter.
(bpftool practically ready to do that

[PATCH v2 iproute2 2/2] tc: Add support for the ETF Qdisc

2018-07-03 Thread Jesus Sanchez-Palencia
From: Vinicius Costa Gomes 

The "Earliest TxTime First" (ETF) queueing discipline allows precise
control of the transmission time of packets by providing a sorted
time-based scheduling of packets.

The syntax is:

tc qdisc add dev DEV parent NODE etf delta 
 clockid  [offload] [deadline_mode]

Signed-off-by: Vinicius Costa Gomes 
Signed-off-by: Jesus Sanchez-Palencia 
---
 tc/Makefile |   1 +
 tc/q_etf.c  | 168 
 2 files changed, 169 insertions(+)
 create mode 100644 tc/q_etf.c

diff --git a/tc/Makefile b/tc/Makefile
index dfd00267..4525c0fb 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -71,6 +71,7 @@ TCMODULES += q_clsact.o
 TCMODULES += e_bpf.o
 TCMODULES += f_matchall.o
 TCMODULES += q_cbs.o
+TCMODULES += q_etf.o
 
 TCSO :=
 ifeq ($(TC_CONFIG_ATM),y)
diff --git a/tc/q_etf.c b/tc/q_etf.c
new file mode 100644
index ..5db1dd6f
--- /dev/null
+++ b/tc/q_etf.c
@@ -0,0 +1,168 @@
+/*
+ * q_etf.c Earliest TxTime First (ETF).
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Authors:Vinicius Costa Gomes 
+ * Jesus Sanchez-Palencia 
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "utils.h"
+#include "tc_util.h"
+
+#define CLOCKID_INVALID (-1)
+static void explain(void)
+{
+   fprintf(stderr, "Usage: ... etf delta NANOS clockid CLOCKID [offload] 
[deadline_mode]\n");
+   fprintf(stderr, "CLOCKID must be a valid SYS-V id (i.e. CLOCK_TAI)\n");
+}
+
+static void explain1(const char *arg, const char *val)
+{
+   fprintf(stderr, "etf: illegal value for \"%s\": \"%s\"\n", arg, val);
+}
+
+static void explain_clockid(const char *val)
+{
+   fprintf(stderr, "etf: illegal value for \"clockid\": \"%s\".\n", val);
+   fprintf(stderr, "It must be a valid SYS-V id (i.e. CLOCK_TAI)");
+}
+
+static int get_clockid(__s32 *val, const char *arg)
+{
+   const struct static_clockid {
+   const char *name;
+   clockid_t clockid;
+   } clockids_sysv[] = {
+   { "CLOCK_REALTIME", CLOCK_REALTIME },
+   { "CLOCK_TAI", CLOCK_TAI },
+   { "CLOCK_BOOTTIME", CLOCK_BOOTTIME },
+   { "CLOCK_MONOTONIC", CLOCK_MONOTONIC },
+   { NULL }
+   };
+
+   const struct static_clockid *c;
+
+   for (c = clockids_sysv; c->name; c++) {
+   if (strncasecmp(c->name, arg, 25) == 0) {
+   *val = c->clockid;
+
+   return 0;
+   }
+   }
+
+   return -1;
+}
+
+
+static int etf_parse_opt(struct qdisc_util *qu, int argc,
+char **argv, struct nlmsghdr *n, const char *dev)
+{
+   struct tc_etf_qopt opt = {
+   .clockid = CLOCKID_INVALID,
+   };
+   struct rtattr *tail;
+
+   while (argc > 0) {
+   if (matches(*argv, "offload") == 0) {
+   if (opt.flags & TC_ETF_OFFLOAD_ON) {
+   fprintf(stderr, "etf: duplicate \"offload\" 
specification\n");
+   return -1;
+   }
+
+   opt.flags |= TC_ETF_OFFLOAD_ON;
+   } else if (matches(*argv, "deadline_mode") == 0) {
+   if (opt.flags & TC_ETF_DEADLINE_MODE_ON) {
+   fprintf(stderr, "etf: duplicate 
\"deadline_mode\" specification\n");
+   return -1;
+   }
+
+   opt.flags |= TC_ETF_DEADLINE_MODE_ON;
+   } else if (matches(*argv, "delta") == 0) {
+   NEXT_ARG();
+   if (opt.delta) {
+   fprintf(stderr, "etf: duplicate \"delta\" 
specification\n");
+   return -1;
+   }
+   if (get_s32(&opt.delta, *argv, 0)) {
+   explain1("delta", *argv);
+   return -1;
+   }
+   } else if (matches(*argv, "clockid") == 0) {
+   NEXT_ARG();
+   if (opt.clockid != CLOCKID_INVALID) {
+   fprintf(stderr, "etf: duplicate \"clockid\" 
specification\n");
+   return -1;
+   }
+   if (get_clockid(&opt.clockid, *argv)) {
+   explain_clockid(*argv);
+   return -1;
+   }
+   } else if (strcmp(*argv, "help") == 0) {
+   explain();
+   retu

[PATCH v2 iproute2 1/2] uapi pkt_sched: Add etf info - DO NOT COMMIT

2018-07-03 Thread Jesus Sanchez-Palencia
This should come from the next uapi headers update.
Sending it now just as a convenience so anyone can build tc with etf
and taprio support.

Signed-off-by: Jesus Sanchez-Palencia 
---
 include/uapi/linux/pkt_sched.h | 21 +
 1 file changed, 21 insertions(+)

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 37b5096a..94911846 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -539,6 +539,7 @@ enum {
TCA_NETEM_LATENCY64,
TCA_NETEM_JITTER64,
TCA_NETEM_SLOT,
+   TCA_NETEM_SLOT_DIST,
__TCA_NETEM_MAX,
 };
 
@@ -581,6 +582,8 @@ struct tc_netem_slot {
__s64   max_delay;
__s32   max_packets;
__s32   max_bytes;
+   __s64   dist_delay; /* nsec */
+   __s64   dist_jitter; /* nsec */
 };
 
 enum {
@@ -934,4 +937,22 @@ enum {
 
 #define TCA_CBS_MAX (__TCA_CBS_MAX - 1)
 
+
+/* ETF */
+struct tc_etf_qopt {
+   __s32 delta;
+   __s32 clockid;
+   __u32 flags;
+#define TC_ETF_DEADLINE_MODE_ONBIT(0)
+#define TC_ETF_OFFLOAD_ON  BIT(1)
+};
+
+enum {
+   TCA_ETF_UNSPEC,
+   TCA_ETF_PARMS,
+   __TCA_ETF_MAX,
+};
+
+#define TCA_ETF_MAX (__TCA_ETF_MAX - 1)
+
 #endif
-- 
2.17.1



[PATCH v2 net-next 02/14] net: Add a new socket option for a future transmit time.

2018-07-03 Thread Jesus Sanchez-Palencia
From: Richard Cochran 

This patch introduces SO_TXTIME. User space enables this option in
order to pass a desired future transmit time in a CMSG when calling
sendmsg(2). The argument to this socket option is a 8-bytes long struct
provided by the uapi header net_tstamp.h defined as:

struct sock_txtime {
clockid_t   clockid;
u32 flags;
};

Note that new fields were added to struct sock by filling a 2-bytes
hole found in the struct. For that reason, neither the struct size or
number of cachelines were altered.

Signed-off-by: Richard Cochran 
Signed-off-by: Jesus Sanchez-Palencia 
---
 arch/alpha/include/uapi/asm/socket.h  |  3 +++
 arch/ia64/include/uapi/asm/socket.h   |  3 +++
 arch/mips/include/uapi/asm/socket.h   |  3 +++
 arch/parisc/include/uapi/asm/socket.h |  3 +++
 arch/s390/include/uapi/asm/socket.h   |  3 +++
 arch/sparc/include/uapi/asm/socket.h  |  3 +++
 arch/xtensa/include/uapi/asm/socket.h |  3 +++
 include/net/sock.h| 10 
 include/uapi/asm-generic/socket.h |  3 +++
 include/uapi/linux/net_tstamp.h   | 15 
 net/core/sock.c   | 35 +++
 11 files changed, 84 insertions(+)

diff --git a/arch/alpha/include/uapi/asm/socket.h 
b/arch/alpha/include/uapi/asm/socket.h
index be14f16149d5..065fb372e355 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -112,4 +112,7 @@
 
 #define SO_ZEROCOPY60
 
+#define SO_TXTIME  61
+#define SCM_TXTIME SO_TXTIME
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/ia64/include/uapi/asm/socket.h 
b/arch/ia64/include/uapi/asm/socket.h
index 3efba40adc54..c872c4e6bafb 100644
--- a/arch/ia64/include/uapi/asm/socket.h
+++ b/arch/ia64/include/uapi/asm/socket.h
@@ -114,4 +114,7 @@
 
 #define SO_ZEROCOPY60
 
+#define SO_TXTIME  61
+#define SCM_TXTIME SO_TXTIME
+
 #endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/mips/include/uapi/asm/socket.h 
b/arch/mips/include/uapi/asm/socket.h
index 49c3d4795963..71370fb3ceef 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -123,4 +123,7 @@
 
 #define SO_ZEROCOPY60
 
+#define SO_TXTIME  61
+#define SCM_TXTIME SO_TXTIME
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/parisc/include/uapi/asm/socket.h 
b/arch/parisc/include/uapi/asm/socket.h
index 1d0fdc3b5d22..061b9cf2a779 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -104,4 +104,7 @@
 
 #define SO_ZEROCOPY0x4035
 
+#define SO_TXTIME  0x4036
+#define SCM_TXTIME SO_TXTIME
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/s390/include/uapi/asm/socket.h 
b/arch/s390/include/uapi/asm/socket.h
index 3510c0fd06f4..39d901476ee5 100644
--- a/arch/s390/include/uapi/asm/socket.h
+++ b/arch/s390/include/uapi/asm/socket.h
@@ -111,4 +111,7 @@
 
 #define SO_ZEROCOPY60
 
+#define SO_TXTIME  61
+#define SCM_TXTIME SO_TXTIME
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/sparc/include/uapi/asm/socket.h 
b/arch/sparc/include/uapi/asm/socket.h
index d58520c2e6ff..7ea35e5601b6 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -101,6 +101,9 @@
 
 #define SO_ZEROCOPY0x003e
 
+#define SO_TXTIME  0x003f
+#define SCM_TXTIME SO_TXTIME
+
 /* Security levels - as per NRL IPv6 - don't actually do anything */
 #define SO_SECURITY_AUTHENTICATION 0x5001
 #define SO_SECURITY_ENCRYPTION_TRANSPORT   0x5002
diff --git a/arch/xtensa/include/uapi/asm/socket.h 
b/arch/xtensa/include/uapi/asm/socket.h
index 75a07b8119a9..1de07a7f7680 100644
--- a/arch/xtensa/include/uapi/asm/socket.h
+++ b/arch/xtensa/include/uapi/asm/socket.h
@@ -116,4 +116,7 @@
 
 #define SO_ZEROCOPY60
 
+#define SO_TXTIME  61
+#define SCM_TXTIME SO_TXTIME
+
 #endif /* _XTENSA_SOCKET_H */
diff --git a/include/net/sock.h b/include/net/sock.h
index 2ed99bfa4595..68347b9821c6 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -319,6 +319,9 @@ struct sock_common {
   *@sk_destruct: called at sock freeing time, i.e. when all refcnt == 0
   *@sk_reuseport_cb: reuseport group container
   *@sk_rcu: used during RCU grace period
+  *@sk_clockid: clockid used by time-based scheduling (SO_TXTIME)
+  *@sk_txtime_deadline_mode: set deadline mode for SO_TXTIME
+  *@sk_txtime_unused: unused txtime flags
   */
 struct sock {
/*
@@ -475,6 +478,11 @@ struct sock {
u8  sk_shutdown;
u32 sk_tskey;
atomic_tsk_zckey;
+
+   u8  sk_clockid;
+   u8  sk_txtime_deadline_mode : 1,
+   sk_txtime_unused : 7;
+
st

[PATCH v2 net-next 00/14] Scheduled packet Transmission: ETF

2018-07-03 Thread Jesus Sanchez-Palencia


Changes since v1:
  - moved struct sock_txtime from socket.h to uapi net_tstamp.h;
  - sk_clockid was changed from u16 to u8;
  - sk_txtime_flags was changed from u16 to a u8 bit field in struct sock;
  - the socket option flags are now validated in sock_setsockopt();
  - added SO_EE_ORIGIN_TXTIME;
  - sockc.transmit_time is now initialized from all IPv4 Tx paths;
  - added support for the IPv6 Tx path;


Overview


This work consists of a set of kernel interfaces that can be used by
applications that require (time-based) Scheduled Tx of packets.
It is comprised by 3 new components to the kernel:

  - SO_TXTIME: socket option + cmsg programming interfaces.

  - etf: the "earliest txtime first" qdisc, that provides per-queue
 TxTime-based scheduling. This has been renamed from 'tbs' to
 'etf' to better describe its functionality.

  - taprio: the "time-aware priority scheduler" qdisc, that provides
per-port Time-Aware scheduling;

This patchset is providing the first 2 components, which have been
developed for longer. The taprio qdisc will be shared as an RFC separately
(shortly).

Note that this series is a follow up of the "Time based packet
transmission" RFCv3 [1].



etf (formerly known as 'tbs')
=

For applications/systems that the concept of time slices isn't precise
enough, the etf qdisc allows applications to control the instant when
a packet should leave the network controller. When used in conjunction
with taprio, it can also be used in case the application needs to
control with greater guarantee the offset into each time slice a packet
will be sent. Another use case of etf, is when only a small number of
applications on a system are time sensitive, so it can then be used
with a more traditional root qdisc (like mqprio).

The etf qdisc is designed so it buffers packets until a configurable
time before their deadline (Tx time). The qdisc uses a rbtree internally
so the buffered packets are always 'ordered' by their txtime (deadline)
and will be dequeued following the earliest txtime first.

It relies on the SO_TXTIME API set for receiving the per-packet timestamp
(txtime) as well as the config flags for each socket: the clockid to be
used as a reference, if the expected mode of txtime for that socket is
deadline or strict mode, and if packet drops should be reported on the
socket's error queue or not.

The qdisc will drop any packets with a Tx time in the past, or if a
packet expires while waiting for being dequeued. Drops can be reported
as errors back to userspace through the socket's error queue.

Example configuration:

$ tc qdisc add dev enp2s0 parent 100:1 etf offload delta 20 \
clockid CLOCK_TAI

Here, the Qdisc will use HW offload for the txtime control.
Packets will be dequeued by the qdisc "delta" (20) nanoseconds before
their transmission time. Because this will be using HW offload and
since dynamic clocks are not supported by hrtimers, the system clock
and the PHC clock must be synchronized for this mode to behave as expected.

A more complete example can be found here, with instructions of how to
test it:

https://gist.github.com/jeez/bd3afeff081ba64a695008dd8215866f [2]


Note that we haven't modified the qdisc so it uses a timerqueue because
the modification needed was increasing the number of cachelines of a sk_buff.



This series is also hosted on github and can be found at [3].
The companion iproute2 patches can be found at [4].


[1] https://patchwork.ozlabs.org/cover/882342/

[2] github doesn't make it clear, but the gist can be cloned like this:
$ git clone https://gist.github.com/jeez/bd3afeff081ba64a695008dd8215866f 
scheduled-tx-tests

[3] https://github.com/jeez/linux/tree/etf-v2

[4] https://github.com/jeez/iproute2/tree/etf-v2



Jesus Sanchez-Palencia (10):
  net: Clear skb->tstamp only on the forwarding path
  net: ipv4: Hook into time based transmission
  net: ipv6: Hook into time based transmission
  net/sched: Add HW offloading capability to ETF
  igb: Refactor igb_configure_cbs()
  igb: Only change Tx arbitration when CBS is on
  igb: Refactor igb_offload_cbs()
  igb: Only call skb_tx_timestamp after descriptors are ready
  igb: Add support for ETF offload
  net/sched: Make etf report drops on error_queue

Richard Cochran (2):
  net: Add a new socket option for a future transmit time.
  net: packet: Hook into time based transmission.

Vinicius Costa Gomes (2):
  net/sched: Allow creating a Qdisc watchdog with other clocks
  net/sched: Introduce the ETF Qdisc

 arch/alpha/include/uapi/asm/socket.h  |   3 +
 arch/ia64/include/uapi/asm/socket.h   |   3 +
 arch/mips/include/uapi/asm/socket.h   |   3 +
 arch/parisc/include/uapi/asm/socket.h |   3 +
 arch/s390/include/uapi/asm/socket.h   |   3 +
 arch/sparc/include/uapi/asm/socket.h  |   3 +
 arch/xtensa/include/uapi/asm/socket.h |   3 +
 .../net/ethernet/intel/igb/e1000_defines.h 

[PATCH v2 net-next 03/14] net: ipv4: Hook into time based transmission

2018-07-03 Thread Jesus Sanchez-Palencia
Add a transmit_time field to struct inet_cork, then copy the
timestamp from the CMSG cookie at ip_setup_cork() so we can
safely copy it into the skb later during __ip_make_skb().

For the raw fast path, just perform the copy at raw_send_hdrinc().

Signed-off-by: Richard Cochran 
Signed-off-by: Jesus Sanchez-Palencia 
---
 include/net/inet_sock.h | 1 +
 net/ipv4/icmp.c | 2 ++
 net/ipv4/ip_output.c| 3 +++
 net/ipv4/ping.c | 1 +
 net/ipv4/raw.c  | 2 ++
 net/ipv4/udp.c  | 1 +
 6 files changed, 10 insertions(+)

diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 83d5b3c2ac42..314be484c696 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -148,6 +148,7 @@ struct inet_cork {
__s16   tos;
charpriority;
__u16   gso_size;
+   u64 transmit_time;
 };
 
 struct inet_cork_full {
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 1617604c9284..937239afd68d 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -437,6 +437,7 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct 
sk_buff *skb)
ipc.tx_flags = 0;
ipc.ttl = 0;
ipc.tos = -1;
+   ipc.sockc.transmit_time = 0;
 
if (icmp_param->replyopts.opt.opt.optlen) {
ipc.opt = &icmp_param->replyopts.opt;
@@ -715,6 +716,7 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, 
__be32 info)
ipc.tx_flags = 0;
ipc.ttl = 0;
ipc.tos = -1;
+   ipc.sockc.transmit_time = 0;
 
rt = icmp_route_lookup(net, &fl4, skb_in, iph, saddr, tos, mark,
   type, code, &icmp_param);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index b3308e9d9762..135fb5036d18 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1153,6 +1153,7 @@ static int ip_setup_cork(struct sock *sk, struct 
inet_cork *cork,
cork->tos = ipc->tos;
cork->priority = ipc->priority;
cork->tx_flags = ipc->tx_flags;
+   cork->transmit_time = ipc->sockc.transmit_time;
 
return 0;
 }
@@ -1413,6 +1414,7 @@ struct sk_buff *__ip_make_skb(struct sock *sk,
 
skb->priority = (cork->tos != -1) ? cork->priority: sk->sk_priority;
skb->mark = sk->sk_mark;
+   skb->tstamp = cork->transmit_time;
/*
 * Steal rt from cork.dst to avoid a pair of atomic_inc/atomic_dec
 * on dst refcount
@@ -1550,6 +1552,7 @@ void ip_send_unicast_reply(struct sock *sk, struct 
sk_buff *skb,
ipc.tx_flags = 0;
ipc.ttl = 0;
ipc.tos = -1;
+   ipc.sockc.transmit_time = 0;
 
if (replyopts.opt.opt.optlen) {
ipc.opt = &replyopts.opt;
diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index 2ed64bca54e3..b47492205507 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -746,6 +746,7 @@ static int ping_v4_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t len)
ipc.tx_flags = 0;
ipc.ttl = 0;
ipc.tos = -1;
+   ipc.sockc.transmit_time = 0;
 
if (msg->msg_controllen) {
err = ip_cmsg_send(sk, msg, &ipc, false);
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index abb3c9490c55..446af7be2b55 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -381,6 +381,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 
*fl4,
 
skb->priority = sk->sk_priority;
skb->mark = sk->sk_mark;
+   skb->tstamp = sockc->transmit_time;
skb_dst_set(skb, &rt->dst);
*rtp = NULL;
 
@@ -562,6 +563,7 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
}
 
ipc.sockc.tsflags = sk->sk_tsflags;
+   ipc.sockc.transmit_time = 0;
ipc.addr = inet->inet_saddr;
ipc.opt = NULL;
ipc.tx_flags = 0;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 24e116ddae79..5c76ba0666ec 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -930,6 +930,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t 
len)
ipc.tx_flags = 0;
ipc.ttl = 0;
ipc.tos = -1;
+   ipc.sockc.transmit_time = 0;
 
getfrag = is_udplite ? udplite_getfrag : ip_generic_getfrag;
 
-- 
2.18.0



[PATCH v2 net-next 09/14] igb: Refactor igb_configure_cbs()

2018-07-03 Thread Jesus Sanchez-Palencia
Make this function retrieve what it needs from the Tx ring being
addressed since it already relies on what had been saved on it before.
Also, since this function will be used by the upcoming Launchtime
patches rename it to better reflect its intention. Note that
Launchtime is not part of what 802.1Qav specifies, but the i210
datasheet refers to this set of functionality as "Qav Transmission
Mode".

Here we also perform a tiny refactor at is_any_cbs_enabled(), and add
further documentation to igb_setup_tx_mode().

Signed-off-by: Jesus Sanchez-Palencia 
---
 drivers/net/ethernet/intel/igb/igb_main.c | 60 +++
 1 file changed, 28 insertions(+), 32 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index f1e3397bd405..15f6b9c57ccf 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -1655,23 +1655,17 @@ static void set_queue_mode(struct e1000_hw *hw, int 
queue, enum queue_mode mode)
 }
 
 /**
- *  igb_configure_cbs - Configure Credit-Based Shaper (CBS)
+ *  igb_config_tx_modes - Configure "Qav Tx mode" features on igb
  *  @adapter: pointer to adapter struct
  *  @queue: queue number
- *  @enable: true = enable CBS, false = disable CBS
- *  @idleslope: idleSlope in kbps
- *  @sendslope: sendSlope in kbps
- *  @hicredit: hiCredit in bytes
- *  @locredit: loCredit in bytes
  *
- *  Configure CBS for a given hardware queue. When disabling, idleslope,
- *  sendslope, hicredit, locredit arguments are ignored. Returns 0 if
- *  success. Negative otherwise.
+ *  Configure CBS for a given hardware queue. Parameters are retrieved
+ *  from the correct Tx ring, so igb_save_cbs_params() should be used
+ *  for setting those correctly prior to this function being called.
  **/
-static void igb_configure_cbs(struct igb_adapter *adapter, int queue,
- bool enable, int idleslope, int sendslope,
- int hicredit, int locredit)
+static void igb_config_tx_modes(struct igb_adapter *adapter, int queue)
 {
+   struct igb_ring *ring = adapter->tx_ring[queue];
struct net_device *netdev = adapter->netdev;
struct e1000_hw *hw = &adapter->hw;
u32 tqavcc;
@@ -1680,7 +1674,7 @@ static void igb_configure_cbs(struct igb_adapter 
*adapter, int queue,
WARN_ON(hw->mac.type != e1000_i210);
WARN_ON(queue < 0 || queue > 1);
 
-   if (enable || queue == 0) {
+   if (ring->cbs_enable || queue == 0) {
/* i210 does not allow the queue 0 to be in the Strict
 * Priority mode while the Qav mode is enabled, so,
 * instead of disabling strict priority mode, we give
@@ -1690,10 +1684,10 @@ static void igb_configure_cbs(struct igb_adapter 
*adapter, int queue,
 * Queue0 QueueMode must be set to 1b when
 * TransmitMode is set to Qav."
 */
-   if (queue == 0 && !enable) {
+   if (queue == 0 && !ring->cbs_enable) {
/* max "linkspeed" idleslope in kbps */
-   idleslope = 100;
-   hicredit = ETH_FRAME_LEN;
+   ring->idleslope = 100;
+   ring->hicredit = ETH_FRAME_LEN;
}
 
set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_HIGH);
@@ -1756,14 +1750,15 @@ static void igb_configure_cbs(struct igb_adapter 
*adapter, int queue,
 *   calculated value, so the resulting bandwidth might
 *   be slightly higher for some configurations.
 */
-   value = DIV_ROUND_UP_ULL(idleslope * 61034ULL, 100);
+   value = DIV_ROUND_UP_ULL(ring->idleslope * 61034ULL, 100);
 
tqavcc = rd32(E1000_I210_TQAVCC(queue));
tqavcc &= ~E1000_TQAVCC_IDLESLOPE_MASK;
tqavcc |= value;
wr32(E1000_I210_TQAVCC(queue), tqavcc);
 
-   wr32(E1000_I210_TQAVHC(queue), 0x8000 + hicredit * 0x7735);
+   wr32(E1000_I210_TQAVHC(queue),
+0x8000 + ring->hicredit * 0x7735);
} else {
set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_LOW);
set_queue_mode(hw, queue, QUEUE_MODE_STRICT_PRIORITY);
@@ -1783,8 +1778,9 @@ static void igb_configure_cbs(struct igb_adapter 
*adapter, int queue,
 */
 
netdev_dbg(netdev, "CBS %s: queue %d idleslope %d sendslope %d hiCredit 
%d locredit %d\n",
-  (enable) ? "enabled" : "disabled", queue,
-  idleslope, sendslope, hicredit, locredit);
+  (ring->cbs_enable) ? "enabled" : "disabled", queue,
+  ring->idleslope, ring->sendslope, ring->hicredit,
+  ring->locredit);
 }
 
 static int igb_save_cbs_params(struct igb_adapter *adapter, int queue,
@@ -1809,19 +1805,25 @@ 

[PATCH v2 net-next 11/14] igb: Refactor igb_offload_cbs()

2018-07-03 Thread Jesus Sanchez-Palencia
Split code into a separate function (igb_offload_apply()) that will be
used by ETF offload implementation.

Signed-off-by: Jesus Sanchez-Palencia 
---
 drivers/net/ethernet/intel/igb/igb_main.c | 23 ++-
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 8c90f1e51add..c30ab7b260cc 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2474,6 +2474,19 @@ igb_features_check(struct sk_buff *skb, struct 
net_device *dev,
return features;
 }
 
+static void igb_offload_apply(struct igb_adapter *adapter, s32 queue)
+{
+   if (!is_fqtss_enabled(adapter)) {
+   enable_fqtss(adapter, true);
+   return;
+   }
+
+   igb_config_tx_modes(adapter, queue);
+
+   if (!is_any_cbs_enabled(adapter))
+   enable_fqtss(adapter, false);
+}
+
 static int igb_offload_cbs(struct igb_adapter *adapter,
   struct tc_cbs_qopt_offload *qopt)
 {
@@ -2494,15 +2507,7 @@ static int igb_offload_cbs(struct igb_adapter *adapter,
if (err)
return err;
 
-   if (is_fqtss_enabled(adapter)) {
-   igb_config_tx_modes(adapter, qopt->queue);
-
-   if (!is_any_cbs_enabled(adapter))
-   enable_fqtss(adapter, false);
-
-   } else {
-   enable_fqtss(adapter, true);
-   }
+   igb_offload_apply(adapter, qopt->queue);
 
return 0;
 }
-- 
2.18.0



[PATCH v2 net-next 13/14] igb: Add support for ETF offload

2018-07-03 Thread Jesus Sanchez-Palencia
Implement HW offload support for SO_TXTIME through igb's Launchtime
feature. This is done by extending igb_setup_tc() so it supports
TC_SETUP_QDISC_ETF and configuring i210 so time based transmit
arbitration is enabled.

The FQTSS transmission mode added before is extended so strict
priority (SP) queues wait for stream reservation (SR) ones.
igb_config_tx_modes() is extended so it can support enabling/disabling
Launchtime following the previous approach used for the credit-based
shaper (CBS).

As the previous flow, FQTSS transmission mode is enabled automatically
by the driver once Launchtime (or CBS, as before) is enabled.
Similarly, it's automatically disabled when the feature is disabled
for the last queue that had it setup on.

The driver just consumes the transmit times from the skbuffs directly,
so no special handling is done in case an 'invalid' time is provided.
We assume this has been handled by the ETF qdisc already.

Signed-off-by: Jesus Sanchez-Palencia 
---
 .../net/ethernet/intel/igb/e1000_defines.h|  16 ++
 drivers/net/ethernet/intel/igb/igb.h  |   1 +
 drivers/net/ethernet/intel/igb/igb_main.c | 138 +++---
 3 files changed, 138 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/e1000_defines.h 
b/drivers/net/ethernet/intel/igb/e1000_defines.h
index 252440a418dc..8a28f3388f69 100644
--- a/drivers/net/ethernet/intel/igb/e1000_defines.h
+++ b/drivers/net/ethernet/intel/igb/e1000_defines.h
@@ -1048,6 +1048,22 @@
 #define E1000_TQAVCTRL_XMIT_MODE   BIT(0)
 #define E1000_TQAVCTRL_DATAFETCHARBBIT(4)
 #define E1000_TQAVCTRL_DATATRANARB BIT(8)
+#define E1000_TQAVCTRL_DATATRANTIM BIT(9)
+#define E1000_TQAVCTRL_SP_WAIT_SR  BIT(10)
+/* Fetch Time Delta - bits 31:16
+ *
+ * This field holds the value to be reduced from the launch time for
+ * fetch time decision. The FetchTimeDelta value is defined in 32 ns
+ * granularity.
+ *
+ * This field is 16 bits wide, and so the maximum value is:
+ *
+ * 65535 * 32 = 2097120 ~= 2.1 msec
+ *
+ * XXX: We are configuring the max value here since we couldn't come up
+ * with a reason for not doing so.
+ */
+#define E1000_TQAVCTRL_FETCHTIME_DELTA (0x << 16)
 
 /* TX Qav Credit Control fields */
 #define E1000_TQAVCC_IDLESLOPE_MASK0x
diff --git a/drivers/net/ethernet/intel/igb/igb.h 
b/drivers/net/ethernet/intel/igb/igb.h
index 9643b5b3d444..ca54e268d157 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -262,6 +262,7 @@ struct igb_ring {
u16 count;  /* number of desc. in the ring */
u8 queue_index; /* logical index of the ring*/
u8 reg_idx; /* physical index of the ring */
+   bool launchtime_enable; /* true if LaunchTime is enabled */
bool cbs_enable;/* indicates if CBS is enabled */
s32 idleslope;  /* idleSlope in kbps */
s32 sendslope;  /* sendSlope in kbps */
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 445da8285d9b..e3a0c02721c9 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -1666,13 +1666,26 @@ static bool is_any_cbs_enabled(struct igb_adapter 
*adapter)
return false;
 }
 
+static bool is_any_txtime_enabled(struct igb_adapter *adapter)
+{
+   int i;
+
+   for (i = 0; i < adapter->num_tx_queues; i++) {
+   if (adapter->tx_ring[i]->launchtime_enable)
+   return true;
+   }
+
+   return false;
+}
+
 /**
  *  igb_config_tx_modes - Configure "Qav Tx mode" features on igb
  *  @adapter: pointer to adapter struct
  *  @queue: queue number
  *
- *  Configure CBS for a given hardware queue. Parameters are retrieved
- *  from the correct Tx ring, so igb_save_cbs_params() should be used
+ *  Configure CBS and Launchtime for a given hardware queue.
+ *  Parameters are retrieved from the correct Tx ring, so
+ *  igb_save_cbs_params() and igb_save_txtime_params() should be used
  *  for setting those correctly prior to this function being called.
  **/
 static void igb_config_tx_modes(struct igb_adapter *adapter, int queue)
@@ -1686,6 +1699,19 @@ static void igb_config_tx_modes(struct igb_adapter 
*adapter, int queue)
WARN_ON(hw->mac.type != e1000_i210);
WARN_ON(queue < 0 || queue > 1);
 
+   /* If any of the Qav features is enabled, configure queues as SR and
+* with HIGH PRIO. If none is, then configure them with LOW PRIO and
+* as SP.
+*/
+   if (ring->cbs_enable || ring->launchtime_enable) {
+   set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_HIGH);
+   set_queue_mode(hw, queue, QUEUE_MODE_STREAM_RESERVATION);
+   } else {
+   set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_LOW);
+   set_queue_mode(hw, queue, QU

[PATCH v2 net-next 10/14] igb: Only change Tx arbitration when CBS is on

2018-07-03 Thread Jesus Sanchez-Palencia
Currently the data transmission arbitration algorithm - DataTranARB
field on TQAVCTRL reg - is always set to CBS when the Tx mode is
changed from legacy to 'Qav' mode.

Make that configuration a bit more granular in preparation for the
upcoming Launchtime enabling patches, since CBS and Launchtime can be
enabled separately. That is achieved by moving the DataTranARB setup
to igb_config_tx_modes() instead.

Similarly, when disabling CBS we must check if it has been disabled
for all queues, and clear the DataTranARB accordingly.

Signed-off-by: Jesus Sanchez-Palencia 
---
 drivers/net/ethernet/intel/igb/igb_main.c | 49 +++
 1 file changed, 33 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 15f6b9c57ccf..8c90f1e51add 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -1654,6 +1654,18 @@ static void set_queue_mode(struct e1000_hw *hw, int 
queue, enum queue_mode mode)
wr32(E1000_I210_TQAVCC(queue), val);
 }
 
+static bool is_any_cbs_enabled(struct igb_adapter *adapter)
+{
+   int i;
+
+   for (i = 0; i < adapter->num_tx_queues; i++) {
+   if (adapter->tx_ring[i]->cbs_enable)
+   return true;
+   }
+
+   return false;
+}
+
 /**
  *  igb_config_tx_modes - Configure "Qav Tx mode" features on igb
  *  @adapter: pointer to adapter struct
@@ -1668,7 +1680,7 @@ static void igb_config_tx_modes(struct igb_adapter 
*adapter, int queue)
struct igb_ring *ring = adapter->tx_ring[queue];
struct net_device *netdev = adapter->netdev;
struct e1000_hw *hw = &adapter->hw;
-   u32 tqavcc;
+   u32 tqavcc, tqavctrl;
u16 value;
 
WARN_ON(hw->mac.type != e1000_i210);
@@ -1693,6 +1705,14 @@ static void igb_config_tx_modes(struct igb_adapter 
*adapter, int queue)
set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_HIGH);
set_queue_mode(hw, queue, QUEUE_MODE_STREAM_RESERVATION);
 
+   /* Always set data transfer arbitration to credit-based
+* shaper algorithm on TQAVCTRL if CBS is enabled for any of
+* the queues.
+*/
+   tqavctrl = rd32(E1000_I210_TQAVCTRL);
+   tqavctrl |= E1000_TQAVCTRL_DATATRANARB;
+   wr32(E1000_I210_TQAVCTRL, tqavctrl);
+
/* According to i210 datasheet section 7.2.7.7, we should set
 * the 'idleSlope' field from TQAVCC register following the
 * equation:
@@ -1770,6 +1790,16 @@ static void igb_config_tx_modes(struct igb_adapter 
*adapter, int queue)
 
/* Set hiCredit to zero. */
wr32(E1000_I210_TQAVHC(queue), 0);
+
+   /* If CBS is not enabled for any queues anymore, then return to
+* the default state of Data Transmission Arbitration on
+* TQAVCTRL.
+*/
+   if (!is_any_cbs_enabled(adapter)) {
+   tqavctrl = rd32(E1000_I210_TQAVCTRL);
+   tqavctrl &= ~E1000_TQAVCTRL_DATATRANARB;
+   wr32(E1000_I210_TQAVCTRL, tqavctrl);
+   }
}
 
/* XXX: In i210 controller the sendSlope and loCredit parameters from
@@ -1803,18 +1833,6 @@ static int igb_save_cbs_params(struct igb_adapter 
*adapter, int queue,
return 0;
 }
 
-static bool is_any_cbs_enabled(struct igb_adapter *adapter)
-{
-   int i;
-
-   for (i = 0; i < adapter->num_tx_queues; i++) {
-   if (adapter->tx_ring[i]->cbs_enable)
-   return true;
-   }
-
-   return false;
-}
-
 /**
  *  igb_setup_tx_mode - Switch to/from Qav Tx mode when applicable
  *  @adapter: pointer to adapter struct
@@ -1838,11 +1856,10 @@ static void igb_setup_tx_mode(struct igb_adapter 
*adapter)
int i, max_queue;
 
/* Configure TQAVCTRL register: set transmit mode to 'Qav',
-* set data fetch arbitration to 'round robin' and set data
-* transfer arbitration to 'credit shaper algorithm.
+* set data fetch arbitration to 'round robin'.
 */
val = rd32(E1000_I210_TQAVCTRL);
-   val |= E1000_TQAVCTRL_XMIT_MODE | E1000_TQAVCTRL_DATATRANARB;
+   val |= E1000_TQAVCTRL_XMIT_MODE;
val &= ~E1000_TQAVCTRL_DATAFETCHARB;
wr32(E1000_I210_TQAVCTRL, val);
 
-- 
2.18.0



[PATCH v2 net-next 08/14] net/sched: Add HW offloading capability to ETF

2018-07-03 Thread Jesus Sanchez-Palencia
Add infra so etf qdisc supports HW offload of time-based transmission.

For hw offload, the time sorted list is still used, so packets are
dequeued always in order of txtime.

Example:

$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
   map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

$ tc qdisc add dev enp2s0 parent 100:1 etf offload delta 10 \
   clockid CLOCK_REALTIME

In this example, the Qdisc will use HW offload for the control of the
transmission time through the network adapter. The hrtimer used for
packets scheduling inside the qdisc will use the clockid CLOCK_REALTIME
as reference and packets leave the Qdisc "delta" (10) nanoseconds
before their transmission time. Because this will be using HW offload and
since dynamic clocks are not supported by the hrtimer, the system clock
and the PHC clock must be synchronized for this mode to behave as
expected.

Signed-off-by: Jesus Sanchez-Palencia 
---
 include/net/pkt_sched.h|  5 +++
 include/uapi/linux/pkt_sched.h |  1 +
 net/sched/sch_etf.c| 71 +-
 3 files changed, 76 insertions(+), 1 deletion(-)

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 2466ea143d01..7dc769e5452b 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -155,4 +155,9 @@ struct tc_cbs_qopt_offload {
s32 sendslope;
 };
 
+struct tc_etf_qopt_offload {
+   u8 enable;
+   s32 queue;
+};
+
 #endif
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index d5e933ce1447..949118461009 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -944,6 +944,7 @@ struct tc_etf_qopt {
__s32 clockid;
__u32 flags;
 #define TC_ETF_DEADLINE_MODE_ONBIT(0)
+#define TC_ETF_OFFLOAD_ON  BIT(1)
 };
 
 enum {
diff --git a/net/sched/sch_etf.c b/net/sched/sch_etf.c
index 4b7f4903ac17..932a136db568 100644
--- a/net/sched/sch_etf.c
+++ b/net/sched/sch_etf.c
@@ -20,8 +20,10 @@
 #include 
 
 #define DEADLINE_MODE_IS_ON(x) ((x)->flags & TC_ETF_DEADLINE_MODE_ON)
+#define OFFLOAD_IS_ON(x) ((x)->flags & TC_ETF_OFFLOAD_ON)
 
 struct etf_sched_data {
+   bool offload;
bool deadline_mode;
int clockid;
int queue;
@@ -45,6 +47,9 @@ static inline int validate_input_params(struct tc_etf_qopt 
*qopt,
 *  * Dynamic clockids are not supported.
 *
 *  * Delta must be a positive integer.
+*
+* Also note that for the HW offload case, we must
+* expect that system clocks have been synchronized to PHC.
 */
if (qopt->clockid < 0) {
NL_SET_ERR_MSG(extack, "Dynamic clockids are not supported");
@@ -225,6 +230,56 @@ static struct sk_buff *etf_dequeue_timesortedlist(struct 
Qdisc *sch)
return skb;
 }
 
+static void etf_disable_offload(struct net_device *dev,
+   struct etf_sched_data *q)
+{
+   struct tc_etf_qopt_offload etf = { };
+   const struct net_device_ops *ops;
+   int err;
+
+   if (!q->offload)
+   return;
+
+   ops = dev->netdev_ops;
+   if (!ops->ndo_setup_tc)
+   return;
+
+   etf.queue = q->queue;
+   etf.enable = 0;
+
+   err = ops->ndo_setup_tc(dev, TC_SETUP_QDISC_ETF, &etf);
+   if (err < 0)
+   pr_warn("Couldn't disable ETF offload for queue %d\n",
+   etf.queue);
+}
+
+static int etf_enable_offload(struct net_device *dev, struct etf_sched_data *q,
+ struct netlink_ext_ack *extack)
+{
+   const struct net_device_ops *ops = dev->netdev_ops;
+   struct tc_etf_qopt_offload etf = { };
+   int err;
+
+   if (q->offload)
+   return 0;
+
+   if (!ops->ndo_setup_tc) {
+   NL_SET_ERR_MSG(extack, "Specified device does not support ETF 
offload");
+   return -EOPNOTSUPP;
+   }
+
+   etf.queue = q->queue;
+   etf.enable = 1;
+
+   err = ops->ndo_setup_tc(dev, TC_SETUP_QDISC_ETF, &etf);
+   if (err < 0) {
+   NL_SET_ERR_MSG(extack, "Specified device failed to setup ETF 
hardware offload");
+   return err;
+   }
+
+   return 0;
+}
+
 static int etf_init(struct Qdisc *sch, struct nlattr *opt,
struct netlink_ext_ack *extack)
 {
@@ -251,8 +306,9 @@ static int etf_init(struct Qdisc *sch, struct nlattr *opt,
 
qopt = nla_data(tb[TCA_ETF_PARMS]);
 
-   pr_debug("delta %d clockid %d deadline %s\n",
+   pr_debug("delta %d clockid %d offload %s deadline %s\n",
 qopt->delta, qopt->clockid,
+OFFLOAD_IS_ON(qopt) ? "on" : "off",
 DEADLINE_MODE_IS_ON(qopt) ? "on" : "off");
 
err = validate_input_params(qopt, extack);
@@ -261,9 +317,16 @@ static int etf_init(struct Qdisc *sch, struct nlattr *opt,
 
q->queue = sch->dev_queue - netdev_ge

[PATCH v2 net-next 12/14] igb: Only call skb_tx_timestamp after descriptors are ready

2018-07-03 Thread Jesus Sanchez-Palencia
Currently, skb_tx_timestamp() is being called before the Tx
descriptors are prepared in igb_xmit_frame_ring(), which happens
during either the igb_tso() or igb_tx_csum() calls.

Given that now the skb->tstamp might be used to carry the timestamp
for SO_TXTIME, we must only call skb_tx_timestamp() after the
information has been copied into the Tx descriptors.

Signed-off-by: Jesus Sanchez-Palencia 
---
 drivers/net/ethernet/intel/igb/igb_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index c30ab7b260cc..445da8285d9b 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -6033,8 +6033,6 @@ netdev_tx_t igb_xmit_frame_ring(struct sk_buff *skb,
}
}
 
-   skb_tx_timestamp(skb);
-
if (skb_vlan_tag_present(skb)) {
tx_flags |= IGB_TX_FLAGS_VLAN;
tx_flags |= (skb_vlan_tag_get(skb) << IGB_TX_FLAGS_VLAN_SHIFT);
@@ -6050,6 +6048,8 @@ netdev_tx_t igb_xmit_frame_ring(struct sk_buff *skb,
else if (!tso)
igb_tx_csum(tx_ring, first);
 
+   skb_tx_timestamp(skb);
+
if (igb_tx_map(tx_ring, first, hdr_len))
goto cleanup_tx_tstamp;
 
-- 
2.18.0



[PATCH v2 net-next 05/14] net: packet: Hook into time based transmission.

2018-07-03 Thread Jesus Sanchez-Palencia
From: Richard Cochran 

For raw layer-2 packets, copy the desired future transmit time from
the CMSG cookie into the skb.

Signed-off-by: Richard Cochran 
Signed-off-by: Jesus Sanchez-Palencia 
---
 net/packet/af_packet.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 57634bc3da74..3428f7739ae9 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1951,6 +1951,7 @@ static int packet_sendmsg_spkt(struct socket *sock, 
struct msghdr *msg,
goto out_unlock;
}
 
+   sockc.transmit_time = 0;
sockc.tsflags = sk->sk_tsflags;
if (msg->msg_controllen) {
err = sock_cmsg_send(sk, msg, &sockc);
@@ -1962,6 +1963,7 @@ static int packet_sendmsg_spkt(struct socket *sock, 
struct msghdr *msg,
skb->dev = dev;
skb->priority = sk->sk_priority;
skb->mark = sk->sk_mark;
+   skb->tstamp = sockc.transmit_time;
 
sock_tx_timestamp(sk, sockc.tsflags, &skb_shinfo(skb)->tx_flags);
 
@@ -2457,6 +2459,7 @@ static int tpacket_fill_skb(struct packet_sock *po, 
struct sk_buff *skb,
skb->dev = dev;
skb->priority = po->sk.sk_priority;
skb->mark = po->sk.sk_mark;
+   skb->tstamp = sockc->transmit_time;
sock_tx_timestamp(&po->sk, sockc->tsflags, &skb_shinfo(skb)->tx_flags);
skb_shinfo(skb)->destructor_arg = ph.raw;
 
@@ -2633,6 +2636,7 @@ static int tpacket_snd(struct packet_sock *po, struct 
msghdr *msg)
if (unlikely(!(dev->flags & IFF_UP)))
goto out_put;
 
+   sockc.transmit_time = 0;
sockc.tsflags = po->sk.sk_tsflags;
if (msg->msg_controllen) {
err = sock_cmsg_send(&po->sk, msg, &sockc);
@@ -2829,6 +2833,7 @@ static int packet_snd(struct socket *sock, struct msghdr 
*msg, size_t len)
if (unlikely(!(dev->flags & IFF_UP)))
goto out_unlock;
 
+   sockc.transmit_time = 0;
sockc.tsflags = sk->sk_tsflags;
sockc.mark = sk->sk_mark;
if (msg->msg_controllen) {
@@ -2903,6 +2908,7 @@ static int packet_snd(struct socket *sock, struct msghdr 
*msg, size_t len)
skb->dev = dev;
skb->priority = sk->sk_priority;
skb->mark = sockc.mark;
+   skb->tstamp = sockc.transmit_time;
 
if (has_vnet_hdr) {
err = virtio_net_hdr_to_skb(skb, &vnet_hdr, vio_le());
-- 
2.18.0



[PATCH v2 net-next 07/14] net/sched: Introduce the ETF Qdisc

2018-07-03 Thread Jesus Sanchez-Palencia
From: Vinicius Costa Gomes 

The ETF (Earliest TxTime First) qdisc uses the information added
earlier in this series (the socket option SO_TXTIME and the new
role of sk_buff->tstamp) to schedule packets transmission based
on absolute time.

For some workloads, just bandwidth enforcement is not enough, and
precise control of the transmission of packets is necessary.

Example:

$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
   map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

$ tc qdisc add dev enp2s0 parent 100:1 etf delta 10 \
   clockid CLOCK_TAI

In this example, the Qdisc will provide SW best-effort for the control
of the transmission time to the network adapter, the time stamp in the
socket will be in reference to the clockid CLOCK_TAI and packets
will leave the qdisc "delta" (10) nanoseconds before its transmission
time.

The ETF qdisc will buffer packets sorted by their txtime. It will drop
packets on enqueue() if their skbuff clockid does not match the clock
reference of the Qdisc. Moreover, on dequeue(), a packet will be dropped
if it expires while being enqueued.

The qdisc also supports the SO_TXTIME deadline mode. For this mode, it
will dequeue a packet as soon as possible and change the skb timestamp
to 'now' during etf_dequeue().

Note that both the qdisc's and the SO_TXTIME ABIs allow for a clockid
to be configured, but it's been decided that usage of CLOCK_TAI should
be enforced until we decide to allow for other clockids to be used.
The rationale here is that PTP times are usually in the TAI scale, thus
no other clocks should be necessary. For now, the qdisc will return
EINVAL if any clocks other than CLOCK_TAI are used.

Signed-off-by: Jesus Sanchez-Palencia 
Signed-off-by: Vinicius Costa Gomes 
---
 include/linux/netdevice.h  |   1 +
 include/uapi/linux/pkt_sched.h |  17 ++
 net/sched/Kconfig  |  11 +
 net/sched/Makefile |   1 +
 net/sched/sch_etf.c| 384 +
 5 files changed, 414 insertions(+)
 create mode 100644 net/sched/sch_etf.c

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 64480a0f2c16..610df79b9845 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -798,6 +798,7 @@ enum tc_setup_type {
TC_SETUP_QDISC_RED,
TC_SETUP_QDISC_PRIO,
TC_SETUP_QDISC_MQ,
+   TC_SETUP_QDISC_ETF,
 };
 
 /* These structures hold the attributes of bpf state that are being passed
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index bad3c03bcf43..d5e933ce1447 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -937,4 +937,21 @@ enum {
 
 #define TCA_CBS_MAX (__TCA_CBS_MAX - 1)
 
+
+/* ETF */
+struct tc_etf_qopt {
+   __s32 delta;
+   __s32 clockid;
+   __u32 flags;
+#define TC_ETF_DEADLINE_MODE_ONBIT(0)
+};
+
+enum {
+   TCA_ETF_UNSPEC,
+   TCA_ETF_PARMS,
+   __TCA_ETF_MAX,
+};
+
+#define TCA_ETF_MAX (__TCA_ETF_MAX - 1)
+
 #endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index a01169fb5325..fcc89706745b 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -183,6 +183,17 @@ config NET_SCH_CBS
  To compile this code as a module, choose M here: the
  module will be called sch_cbs.
 
+config NET_SCH_ETF
+   tristate "Earliest TxTime First (ETF)"
+   help
+ Say Y here if you want to use the Earliest TxTime First (ETF) packet
+ scheduling algorithm.
+
+ See the top of  for more details.
+
+ To compile this code as a module, choose M here: the
+ module will be called sch_etf.
+
 config NET_SCH_GRED
tristate "Generic Random Early Detection (GRED)"
---help---
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 8811d3804878..9a5a7077d217 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -54,6 +54,7 @@ obj-$(CONFIG_NET_SCH_FQ)  += sch_fq.o
 obj-$(CONFIG_NET_SCH_HHF)  += sch_hhf.o
 obj-$(CONFIG_NET_SCH_PIE)  += sch_pie.o
 obj-$(CONFIG_NET_SCH_CBS)  += sch_cbs.o
+obj-$(CONFIG_NET_SCH_ETF)  += sch_etf.o
 
 obj-$(CONFIG_NET_CLS_U32)  += cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)   += cls_route.o
diff --git a/net/sched/sch_etf.c b/net/sched/sch_etf.c
new file mode 100644
index ..4b7f4903ac17
--- /dev/null
+++ b/net/sched/sch_etf.c
@@ -0,0 +1,384 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/* net/sched/sch_etf.c  Earliest TxTime First queueing discipline.
+ *
+ * Authors:Jesus Sanchez-Palencia 
+ * Vinicius Costa Gomes 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define DEADLINE_MODE_IS_ON(x) ((x)->flags & TC_ETF_DEADLINE_MODE_ON)
+
+struct etf_sched_data {
+   bool deadline_mode;
+   int clockid;
+   int queue;
+   s32 delta; /* in ns */
+   ktime_t last; /* The

[PATCH v2 net-next 14/14] net/sched: Make etf report drops on error_queue

2018-07-03 Thread Jesus Sanchez-Palencia
Use the socket error queue for reporting dropped packets if the
socket has enabled that feature through the SO_TXTIME API.

Packets are dropped either on enqueue() if they aren't accepted by the
qdisc or on dequeue() if the system misses their deadline. Those are
reported as different errors so applications can react accordingly.

Userspace can retrieve the errors through the socket error queue and the
corresponding cmsg interfaces. A struct sock_extended_err* is used for
returning the error data, and the packet's timestamp can be retrieved by
adding both ee_data and ee_info fields as e.g.:

((__u64) serr->ee_data << 32) + serr->ee_info

This feature is disabled by default and must be explicitly enabled by
applications. Enabling it can bring some overhead for the Tx cycles
of the application.

Signed-off-by: Jesus Sanchez-Palencia 
---
 include/net/sock.h  |  3 ++-
 include/uapi/linux/errqueue.h   |  4 
 include/uapi/linux/net_tstamp.h |  5 -
 net/core/sock.c |  4 
 net/sched/sch_etf.c | 35 +++--
 5 files changed, 47 insertions(+), 4 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 68347b9821c6..e0eac9ef44b5 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -481,7 +481,8 @@ struct sock {
 
u8  sk_clockid;
u8  sk_txtime_deadline_mode : 1,
-   sk_txtime_unused : 7;
+   sk_txtime_report_errors : 1,
+   sk_txtime_unused : 6;
 
struct socket   *sk_socket;
void*sk_user_data;
diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
index dc64cfaf13da..c0151200f7d1 100644
--- a/include/uapi/linux/errqueue.h
+++ b/include/uapi/linux/errqueue.h
@@ -20,12 +20,16 @@ struct sock_extended_err {
 #define SO_EE_ORIGIN_ICMP6 3
 #define SO_EE_ORIGIN_TXSTATUS  4
 #define SO_EE_ORIGIN_ZEROCOPY  5
+#define SO_EE_ORIGIN_TXTIME6
 #define SO_EE_ORIGIN_TIMESTAMPING SO_EE_ORIGIN_TXSTATUS
 
 #define SO_EE_OFFENDER(ee) ((struct sockaddr*)((ee)+1))
 
 #define SO_EE_CODE_ZEROCOPY_COPIED 1
 
+#define SO_EE_CODE_TXTIME_INVALID_PARAM1
+#define SO_EE_CODE_TXTIME_MISSED   2
+
 /**
  * struct scm_timestamping - timestamps exposed through cmsg
  *
diff --git a/include/uapi/linux/net_tstamp.h b/include/uapi/linux/net_tstamp.h
index c9a77c353b98..f8f4539f1135 100644
--- a/include/uapi/linux/net_tstamp.h
+++ b/include/uapi/linux/net_tstamp.h
@@ -147,8 +147,11 @@ struct scm_ts_pktinfo {
  */
 enum txtime_flags {
SOF_TXTIME_DEADLINE_MODE = (1 << 0),
+   SOF_TXTIME_REPORT_ERRORS = (1 << 1),
 
-   SOF_TXTIME_FLAGS_MASK = (SOF_TXTIME_DEADLINE_MODE)
+   SOF_TXTIME_FLAGS_LAST = SOF_TXTIME_REPORT_ERRORS,
+   SOF_TXTIME_FLAGS_MASK = (SOF_TXTIME_FLAGS_LAST - 1) |
+SOF_TXTIME_FLAGS_LAST
 };
 
 struct sock_txtime {
diff --git a/net/core/sock.c b/net/core/sock.c
index fe64b839f1b2..03fdea5b0f57 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1087,6 +1087,8 @@ int sock_setsockopt(struct socket *sock, int level, int 
optname,
sk->sk_clockid = sk_txtime.clockid;
sk->sk_txtime_deadline_mode =
!!(sk_txtime.flags & SOF_TXTIME_DEADLINE_MODE);
+   sk->sk_txtime_report_errors =
+   !!(sk_txtime.flags & SOF_TXTIME_REPORT_ERRORS);
}
break;
 
@@ -1429,6 +1431,8 @@ int sock_getsockopt(struct socket *sock, int level, int 
optname,
v.txtime.clockid = sk->sk_clockid;
v.txtime.flags |= sk->sk_txtime_deadline_mode ?
  SOF_TXTIME_DEADLINE_MODE : 0;
+   v.txtime.flags |= sk->sk_txtime_report_errors ?
+ SOF_TXTIME_REPORT_ERRORS : 0;
break;
 
default:
diff --git a/net/sched/sch_etf.c b/net/sched/sch_etf.c
index 932a136db568..1538d6fa8165 100644
--- a/net/sched/sch_etf.c
+++ b/net/sched/sch_etf.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -123,6 +124,32 @@ static void reset_watchdog(struct Qdisc *sch)
qdisc_watchdog_schedule_ns(&q->watchdog, ktime_to_ns(next));
 }
 
+static void report_sock_error(struct sk_buff *skb, u32 err, u8 code)
+{
+   struct sock_exterr_skb *serr;
+   struct sk_buff *clone;
+   ktime_t txtime = skb->tstamp;
+
+   if (!skb->sk || !(skb->sk->sk_txtime_report_errors))
+   return;
+
+   clone = skb_clone(skb, GFP_ATOMIC);
+   if (!clone)
+   return;
+
+   serr = SKB_EXT_ERR(clone);
+   serr->ee.ee_errno = err;
+   serr->ee.ee_origin = SO_EE_ORIGIN_TXTIME;
+   serr->ee.ee_type = 0;
+   serr->ee.ee_code = code;
+   serr->ee.ee_pad = 0;
+

[PATCH v2 net-next 01/14] net: Clear skb->tstamp only on the forwarding path

2018-07-03 Thread Jesus Sanchez-Palencia
This is done in preparation for the upcoming time based transmission
patchset. Now that skb->tstamp will be used to hold packet's txtime,
we must ensure that it is being cleared when traversing namespaces.
Also, doing that from skb_scrub_packet() before the early return would
break our feature when tunnels are used.

Signed-off-by: Jesus Sanchez-Palencia 
---
 net/core/skbuff.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 1357f36c8a5e..c4e24ac27464 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4898,7 +4898,6 @@ EXPORT_SYMBOL(skb_try_coalesce);
  */
 void skb_scrub_packet(struct sk_buff *skb, bool xnet)
 {
-   skb->tstamp = 0;
skb->pkt_type = PACKET_HOST;
skb->skb_iif = 0;
skb->ignore_df = 0;
@@ -4912,6 +4911,7 @@ void skb_scrub_packet(struct sk_buff *skb, bool xnet)
 
ipvs_reset(skb);
skb->mark = 0;
+   skb->tstamp = 0;
 }
 EXPORT_SYMBOL_GPL(skb_scrub_packet);
 
-- 
2.18.0



[PATCH v2 net-next 04/14] net: ipv6: Hook into time based transmission

2018-07-03 Thread Jesus Sanchez-Palencia
Add a struct sockcm_cookie parameter to ip6_setup_cork() so
we can easily re-use the transmit_time field from struct inet_cork
for most paths, by copying the timestamp from the CMSG cookie.
This is later copied into the skb during __ip6_make_skb().

For the raw fast path, also pass the sockcm_cookie as a parameter
so we can just perform the copy at rawv6_send_hdrinc() directly.

Signed-off-by: Jesus Sanchez-Palencia 
---
 net/ipv6/ip6_output.c | 11 ---
 net/ipv6/raw.c|  7 +--
 net/ipv6/udp.c|  1 +
 3 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index a14fb4fcdf18..f48af7e62f12 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1158,7 +1158,8 @@ static void ip6_append_data_mtu(unsigned int *mtu,
 
 static int ip6_setup_cork(struct sock *sk, struct inet_cork_full *cork,
  struct inet6_cork *v6_cork, struct ipcm6_cookie *ipc6,
- struct rt6_info *rt, struct flowi6 *fl6)
+ struct rt6_info *rt, struct flowi6 *fl6,
+ const struct sockcm_cookie *sockc)
 {
struct ipv6_pinfo *np = inet6_sk(sk);
unsigned int mtu;
@@ -1226,6 +1227,8 @@ static int ip6_setup_cork(struct sock *sk, struct 
inet_cork_full *cork,
cork->base.flags |= IPCORK_ALLFRAG;
cork->base.length = 0;
 
+   cork->base.transmit_time = sockc->transmit_time;
+
return 0;
 }
 
@@ -1575,7 +1578,7 @@ int ip6_append_data(struct sock *sk,
 * setup for corking
 */
err = ip6_setup_cork(sk, &inet->cork, &np->cork,
-ipc6, rt, fl6);
+ipc6, rt, fl6, sockc);
if (err)
return err;
 
@@ -1673,6 +1676,8 @@ struct sk_buff *__ip6_make_skb(struct sock *sk,
skb->priority = sk->sk_priority;
skb->mark = sk->sk_mark;
 
+   skb->tstamp = cork->base.transmit_time;
+
skb_dst_set(skb, dst_clone(&rt->dst));
IP6_UPD_PO_STATS(net, rt->rt6i_idev, IPSTATS_MIB_OUT, skb->len);
if (proto == IPPROTO_ICMPV6) {
@@ -1765,7 +1770,7 @@ struct sk_buff *ip6_make_skb(struct sock *sk,
cork->base.opt = NULL;
cork->base.dst = NULL;
v6_cork.opt = NULL;
-   err = ip6_setup_cork(sk, cork, &v6_cork, ipc6, rt, fl6);
+   err = ip6_setup_cork(sk, cork, &v6_cork, ipc6, rt, fl6, sockc);
if (err) {
ip6_cork_release(cork, &v6_cork);
return ERR_PTR(err);
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index afc307c89d1a..5737c50f16eb 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -620,7 +620,7 @@ static int rawv6_push_pending_frames(struct sock *sk, 
struct flowi6 *fl6,
 
 static int rawv6_send_hdrinc(struct sock *sk, struct msghdr *msg, int length,
struct flowi6 *fl6, struct dst_entry **dstp,
-   unsigned int flags)
+   unsigned int flags, const struct sockcm_cookie *sockc)
 {
struct ipv6_pinfo *np = inet6_sk(sk);
struct net *net = sock_net(sk);
@@ -650,6 +650,7 @@ static int rawv6_send_hdrinc(struct sock *sk, struct msghdr 
*msg, int length,
skb->protocol = htons(ETH_P_IPV6);
skb->priority = sk->sk_priority;
skb->mark = sk->sk_mark;
+   skb->tstamp = sockc->transmit_time;
skb_dst_set(skb, &rt->dst);
*dstp = NULL;
 
@@ -848,6 +849,7 @@ static int rawv6_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t len)
fl6.flowi6_oif = sk->sk_bound_dev_if;
 
sockc.tsflags = sk->sk_tsflags;
+   sockc.transmit_time = 0;
if (msg->msg_controllen) {
opt = &opt_space;
memset(opt, 0, sizeof(struct ipv6_txoptions));
@@ -921,7 +923,8 @@ static int rawv6_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t len)
 
 back_from_confirm:
if (inet->hdrincl)
-   err = rawv6_send_hdrinc(sk, msg, len, &fl6, &dst, 
msg->msg_flags);
+   err = rawv6_send_hdrinc(sk, msg, len, &fl6, &dst,
+   msg->msg_flags, &sockc);
else {
ipc6.opt = opt;
lock_sock(sk);
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index e6645cae403e..ac6fc6728903 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1148,6 +1148,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
ipc6.dontfrag = -1;
ipc6.gso_size = up->gso_size;
sockc.tsflags = sk->sk_tsflags;
+   sockc.transmit_time = 0;
 
/* destination address check */
if (sin6) {
-- 
2.18.0



[PATCH v2 net-next 06/14] net/sched: Allow creating a Qdisc watchdog with other clocks

2018-07-03 Thread Jesus Sanchez-Palencia
From: Vinicius Costa Gomes 

This adds 'qdisc_watchdog_init_clockid()' that allows a clockid to be
passed, this allows other time references to be used when scheduling
the Qdisc to run.

Signed-off-by: Vinicius Costa Gomes 
---
 include/net/pkt_sched.h |  2 ++
 net/sched/sch_api.c | 11 +--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 815b92a23936..2466ea143d01 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -72,6 +72,8 @@ struct qdisc_watchdog {
struct Qdisc*qdisc;
 };
 
+void qdisc_watchdog_init_clockid(struct qdisc_watchdog *wd, struct Qdisc 
*qdisc,
+clockid_t clockid);
 void qdisc_watchdog_init(struct qdisc_watchdog *wd, struct Qdisc *qdisc);
 void qdisc_watchdog_schedule_ns(struct qdisc_watchdog *wd, u64 expires);
 
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 54eca685420f..98541c6399db 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -596,12 +596,19 @@ static enum hrtimer_restart qdisc_watchdog(struct hrtimer 
*timer)
return HRTIMER_NORESTART;
 }
 
-void qdisc_watchdog_init(struct qdisc_watchdog *wd, struct Qdisc *qdisc)
+void qdisc_watchdog_init_clockid(struct qdisc_watchdog *wd, struct Qdisc 
*qdisc,
+clockid_t clockid)
 {
-   hrtimer_init(&wd->timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
+   hrtimer_init(&wd->timer, clockid, HRTIMER_MODE_ABS_PINNED);
wd->timer.function = qdisc_watchdog;
wd->qdisc = qdisc;
 }
+EXPORT_SYMBOL(qdisc_watchdog_init_clockid);
+
+void qdisc_watchdog_init(struct qdisc_watchdog *wd, struct Qdisc *qdisc)
+{
+   qdisc_watchdog_init_clockid(wd, qdisc, CLOCK_MONOTONIC);
+}
 EXPORT_SYMBOL(qdisc_watchdog_init);
 
 void qdisc_watchdog_schedule_ns(struct qdisc_watchdog *wd, u64 expires)
-- 
2.18.0



[PATCH v2 net] net/ipv6: Revert attempt to simplify route replace and append

2018-07-03 Thread dsahern
From: David Ahern 

NetworkManager likes to manage linklocal prefix routes and does so with
the NLM_F_APPEND flag, breaking attempts to simplify the IPv6 route
code and by extension enable multipath routes with device only nexthops.

Revert f34436a43092 and these followup patches:
6eba08c3626b ("ipv6: Only emit append events for appended routes").
ce45bded6435 ("mlxsw: spectrum_router: Align with new route replace logic")
53b562df8c20 ("mlxsw: spectrum_router: Allow appending to dev-only routes")

Update the fib_tests cases to reflect the old behavior.

Fixes: f34436a43092 ("net/ipv6: Simplify route replace and appending into 
multipath route")
Signed-off-by: David Ahern 
---
The gre_multipath tests only exist in net-next. Those will need to be
updated separately.

 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  |  48 +++
 include/net/ip6_route.h|   6 +
 net/ipv6/ip6_fib.c | 156 -
 net/ipv6/route.c   |   3 +-
 tools/testing/selftests/net/fib_tests.sh   |  41 --
 5 files changed, 117 insertions(+), 137 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index 6aaaf3d9ba31..77b2adb29341 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -4756,6 +4756,12 @@ static void mlxsw_sp_rt6_destroy(struct mlxsw_sp_rt6 
*mlxsw_sp_rt6)
kfree(mlxsw_sp_rt6);
 }
 
+static bool mlxsw_sp_fib6_rt_can_mp(const struct fib6_info *rt)
+{
+   /* RTF_CACHE routes are ignored */
+   return (rt->fib6_flags & (RTF_GATEWAY | RTF_ADDRCONF)) == RTF_GATEWAY;
+}
+
 static struct fib6_info *
 mlxsw_sp_fib6_entry_rt(const struct mlxsw_sp_fib6_entry *fib6_entry)
 {
@@ -4765,11 +4771,11 @@ mlxsw_sp_fib6_entry_rt(const struct mlxsw_sp_fib6_entry 
*fib6_entry)
 
 static struct mlxsw_sp_fib6_entry *
 mlxsw_sp_fib6_node_mp_entry_find(const struct mlxsw_sp_fib_node *fib_node,
-const struct fib6_info *nrt, bool append)
+const struct fib6_info *nrt, bool replace)
 {
struct mlxsw_sp_fib6_entry *fib6_entry;
 
-   if (!append)
+   if (!mlxsw_sp_fib6_rt_can_mp(nrt) || replace)
return NULL;
 
list_for_each_entry(fib6_entry, &fib_node->entry_list, common.list) {
@@ -4784,7 +4790,8 @@ mlxsw_sp_fib6_node_mp_entry_find(const struct 
mlxsw_sp_fib_node *fib_node,
break;
if (rt->fib6_metric < nrt->fib6_metric)
continue;
-   if (rt->fib6_metric == nrt->fib6_metric)
+   if (rt->fib6_metric == nrt->fib6_metric &&
+   mlxsw_sp_fib6_rt_can_mp(rt))
return fib6_entry;
if (rt->fib6_metric > nrt->fib6_metric)
break;
@@ -5163,7 +5170,7 @@ static struct mlxsw_sp_fib6_entry *
 mlxsw_sp_fib6_node_entry_find(const struct mlxsw_sp_fib_node *fib_node,
  const struct fib6_info *nrt, bool replace)
 {
-   struct mlxsw_sp_fib6_entry *fib6_entry;
+   struct mlxsw_sp_fib6_entry *fib6_entry, *fallback = NULL;
 
list_for_each_entry(fib6_entry, &fib_node->entry_list, common.list) {
struct fib6_info *rt = mlxsw_sp_fib6_entry_rt(fib6_entry);
@@ -5172,13 +5179,18 @@ mlxsw_sp_fib6_node_entry_find(const struct 
mlxsw_sp_fib_node *fib_node,
continue;
if (rt->fib6_table->tb6_id != nrt->fib6_table->tb6_id)
break;
-   if (replace && rt->fib6_metric == nrt->fib6_metric)
-   return fib6_entry;
+   if (replace && rt->fib6_metric == nrt->fib6_metric) {
+   if (mlxsw_sp_fib6_rt_can_mp(rt) ==
+   mlxsw_sp_fib6_rt_can_mp(nrt))
+   return fib6_entry;
+   if (mlxsw_sp_fib6_rt_can_mp(nrt))
+   fallback = fallback ?: fib6_entry;
+   }
if (rt->fib6_metric > nrt->fib6_metric)
-   return fib6_entry;
+   return fallback ?: fib6_entry;
}
 
-   return NULL;
+   return fallback;
 }
 
 static int
@@ -5304,8 +5316,7 @@ static void mlxsw_sp_fib6_entry_replace(struct mlxsw_sp 
*mlxsw_sp,
 }
 
 static int mlxsw_sp_router_fib6_add(struct mlxsw_sp *mlxsw_sp,
-   struct fib6_info *rt, bool replace,
-   bool append)
+   struct fib6_info *rt, bool replace)
 {
struct mlxsw_sp_fib6_entry *fib6_entry;
struct mlxsw_sp_fib_node *fib_node;
@@ -5331,7 +5342,7 @@ static int mlxsw_sp_router_fib6_add(struct mlxsw_sp 
*mlxsw_sp,
/* Before creating a new entry, try to append rou

Re: [PATCH bpf] bpf: hash_map: decrement counter on error

2018-07-03 Thread Daniel Borkmann
On 07/03/2018 10:28 PM, Alexei Starovoitov wrote:
> On Sun, Jul 01, 2018 at 11:33:58AM -0500, Mauricio Vasquez wrote:
>> On 06/30/2018 06:20 PM, Daniel Borkmann wrote:
>>> On 06/29/2018 02:48 PM, Mauricio Vasquez B wrote:
 Decrement the number of elements in the map in case the allocation
 of a new node fails.

 Signed-off-by: Mauricio Vasquez B 
>>> Thanks for the fix, Mauricio!
>>>
>>> Could you reply with a Fixes: tag in order to track the commit originally
>>> introducing this bug?
>>
>> Sure Daniel,
>>
>> Fixes: 6c9059817432 ("bpf: pre-allocate hash map elements")
> 
> Good catch. Thanks for the fix.
> Acked-by: Alexei Starovoitov 

Applied to bpf, thanks Mauricio!


pull-request: bpf-next 2018-07-03

2018-07-03 Thread Daniel Borkmann
Hi David,

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Various improvements to bpftool and libbpf, that is, bpftool build
   speed improvements, missing BPF program types added for detection
   by section name, ability to load programs from '.text' section is
   made to work again, and better bash completion handling, from Jakub.

2) Improvements to nfp JIT's map read handling which allows for optimizing
   memcpy from map to packet, from Jiong.

3) New BPF sample is added which demonstrates XDP in combination with
   bpf_perf_event_output() helper to sample packets on all CPUs, from Toke.

4) Add a new BPF kselftest case for tracking connect(2) BPF hooks
   infrastructure in combination with TFO, from Andrey.

5) Extend the XDP/BPF xdp_rxq_info sample code with a cmdline option to
   read payload from packet data in order to use it for benchmarking.
   Also for '--action XDP_TX' option implement swapping of MAC addresses
   to avoid drops on some hardware seen during testing, from Jesper.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git

Thanks a lot!



The following changes since commit b1a5046b2497e39cea9eb585358f3749442fb3f7:

  Merge branch 'Multipath-tests-for-tunnel-devices' (2018-06-27 10:42:13 +0900)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git 

for you to fetch changes up to 0b9e3d543f9fa6a8abdac04f974176ee02312860:

  Merge branch 'bpf-bpftool-libbpf-improvements' (2018-07-01 01:01:52 +0200)


Andrey Ignatov (1):
  selftests/bpf: Test sys_connect BPF hooks with TFO

Daniel Borkmann (1):
  Merge branch 'bpf-bpftool-libbpf-improvements'

Jakub Kicinski (8):
  tools: bpftool: use correct make variable type to improve compilation time
  tools: libbpf: add section names for missing program types
  tools: libbpf: allow setting ifindex for programs and maps
  tools: libbpf: restore the ability to load programs from .text section
  tools: libbpf: don't return '.text' as a program for multi-function 
programs
  tools: bpftool: drop unnecessary Author comments
  tools: bpftool: add missing --bpffs to completions
  tools: bpftool: deal with options upfront

Jesper Dangaard Brouer (2):
  samples/bpf: extend xdp_rxq_info to read packet payload
  samples/bpf: xdp_rxq_info action XDP_TX must adjust MAC-addrs

Jiong Wang (1):
  nfp: bpf: allow source ptr type be map ptr in memcpy optimization

Toke Høiland-Jørgensen (2):
  trace_helpers.c: Add helpers to poll multiple perf FDs for events
  samples/bpf: Add xdp_sample_pkts example

 drivers/net/ethernet/netronome/nfp/bpf/jit.c |   5 +-
 samples/bpf/Makefile |   4 +
 samples/bpf/xdp_rxq_info_kern.c  |  43 +++
 samples/bpf/xdp_rxq_info_user.c  |  45 ++-
 samples/bpf/xdp_sample_pkts_kern.c   |  66 +++
 samples/bpf/xdp_sample_pkts_user.c   | 169 +++
 tools/bpf/bpftool/Makefile   |   2 +-
 tools/bpf/bpftool/bash-completion/bpftool|  32 +++--
 tools/bpf/bpftool/common.c   |   2 -
 tools/bpf/bpftool/main.c |   4 +-
 tools/bpf/bpftool/main.h |   2 -
 tools/bpf/bpftool/map.c  |   2 -
 tools/bpf/bpftool/prog.c |   4 +-
 tools/lib/bpf/libbpf.c   |  49 ++--
 tools/lib/bpf/libbpf.h   |   2 +
 tools/testing/selftests/bpf/test_sock_addr.c |  37 +-
 tools/testing/selftests/bpf/trace_helpers.c  |  48 +++-
 tools/testing/selftests/bpf/trace_helpers.h  |   4 +
 18 files changed, 471 insertions(+), 49 deletions(-)
 create mode 100644 samples/bpf/xdp_sample_pkts_kern.c
 create mode 100644 samples/bpf/xdp_sample_pkts_user.c


[PATCH net] net: phy: fix flag masking in __set_phy_supported

2018-07-03 Thread Heiner Kallweit
Currently also the pause flags are removed from phydev->supported because
they're not included in PHY_DEFAULT_FEATURES. I don't think this is
intended, especially when considering that this function can be called
via phy_set_max_speed() anywhere in a driver. Change the masking to mask
out only the values we're going to change. In addition remove the
misleading comment, job of this small function is just to adjust the
supported and advertised speeds.

Fixes: f3a6bd393c2c ("phylib: Add phy_set_max_speed helper")
Signed-off-by: Heiner Kallweit 
---
 drivers/net/phy/phy_device.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index bd0f339f..b9f5f40a 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -1724,11 +1724,8 @@ EXPORT_SYMBOL(genphy_loopback);
 
 static int __set_phy_supported(struct phy_device *phydev, u32 max_speed)
 {
-   /* The default values for phydev->supported are provided by the PHY
-* driver "features" member, we want to reset to sane defaults first
-* before supporting higher speeds.
-*/
-   phydev->supported &= PHY_DEFAULT_FEATURES;
+   phydev->supported &= ~(PHY_1000BT_FEATURES | PHY_100BT_FEATURES |
+  PHY_10BT_FEATURES);
 
switch (max_speed) {
default:
-- 
2.18.0



Re: [PATCH bpf] bpf: hash_map: decrement counter on error

2018-07-03 Thread Alexei Starovoitov
On Sun, Jul 01, 2018 at 11:33:58AM -0500, Mauricio Vasquez wrote:
> 
> On 06/30/2018 06:20 PM, Daniel Borkmann wrote:
> > On 06/29/2018 02:48 PM, Mauricio Vasquez B wrote:
> > > Decrement the number of elements in the map in case the allocation
> > > of a new node fails.
> > > 
> > > Signed-off-by: Mauricio Vasquez B 
> > Thanks for the fix, Mauricio!
> > 
> > Could you reply with a Fixes: tag in order to track the commit originally
> > introducing this bug?
> > 
> > Thanks,
> > Daniel
> > 
> 
> Sure Daniel,
> 
> Fixes: 6c9059817432 ("bpf: pre-allocate hash map elements")

Good catch. Thanks for the fix.
Acked-by: Alexei Starovoitov 



Re: [PATCH net] net/ipv6: Revert attempt to simplify route replace and append

2018-07-03 Thread David Ahern
On 7/3/18 7:43 AM, Ido Schimmel wrote:
> On Mon, Jul 02, 2018 at 03:03:12PM -0700, dsah...@kernel.org wrote:
>> From: David Ahern 
>>
>> NetworkManager likes to manage linklocal prefix routes and does so with
>> the NLM_F_APPEND flag, breaking attempts to simplify the IPv6 route
>> code and by extension enable multipath routes with device only nexthops.
>>
>> Revert f34436a43092 and its followup
>> 6eba08c3626b ("ipv6: Only emit append events for appended routes").
>> Update the test cases to reflect the old behavior.
>>
>> Fixes: f34436a43092 ("net/ipv6: Simplify route replace and appending into 
>> multipath route")
>> Signed-off-by: David Ahern 
>> ---
>> Ido: I left 5a15a1b07c51. FIB_EVENT_ENTRY_APPEND is not generated for
>>  IPv6, so no harm in leaving it.
> 
> OK, but I had a follow-up series that did further changes in mlxsw
> (merge commit eab9a2d5f323) to support the new behavior. You want to
> revert it and squash to v2 or I'll send another revert?

I'll squash. The ability to add dev only nexthops is clearly needed.
Reverting in 1 patch will help on the next attempt.

It is unfortunate that mlxsw has to replicate the node lookup code.

> 
> Also, Petr added a multipath test that relies on IPv6 device only
> nexthops. See commit 54818c4c4b937.
> 

ok. I'll remove the ipv6 mpath tests as well.


Re: [PATCH net-next 01/10] r8169: add basic phylib support

2018-07-03 Thread Heiner Kallweit
On 03.07.2018 18:42, Florian Fainelli wrote:
> 
> 
> On 07/02/2018 02:15 PM, Heiner Kallweit wrote:
>> On 02.07.2018 23:02, Andrew Lunn wrote:
 +static int r8169_mdio_read_reg(struct mii_bus *mii_bus, int phyaddr, int 
 phyreg)
 +{
 +  struct rtl8169_private *tp = mii_bus->priv;
 +
 +  return rtl_readphy(tp, phyreg);
>>>
>>> So there is no support for phyaddr?
>>>
>> Right, the chip can access only the one internal PHY, therefore it
>> doesn't support phyaddr.
> 
> Then you might also want to set mii_bus->phy_mask accordingly such that
> only the internal PHY address bit is cleared there?
> 
That's something I'm doing already, see following line in r8169_mdio_register():
new_bus->phy_mask = ~1;
But thanks for the hint anyway.


Re: [PATCH net-next 02/10] r8169: use phy_resume/phy_suspend

2018-07-03 Thread Heiner Kallweit
On 03.07.2018 18:44, Florian Fainelli wrote:
> 
> 
> On 07/02/2018 02:24 PM, Heiner Kallweit wrote:
>> On 02.07.2018 23:06, Andrew Lunn wrote:
  static void r8168_pll_power_down(struct rtl8169_private *tp)
  {
if (r8168_check_dash(tp))
 @@ -4510,7 +4469,8 @@ static void r8168_pll_power_down(struct 
 rtl8169_private *tp)
if (rtl_wol_pll_power_down(tp))
return;
  
 -  r8168_phy_power_down(tp);
 +  /* cover the case that PHY isn't connected */
 +  phy_suspend(mdiobus_get_phy(tp->mii_bus, 0));
>>>
>>> This could do some more explanation. Why would it not be connected?
>>>
>> The PHY gets connected when the net_device is opened. If a network
>> port isn't used then it will be runtime-suspended a few seconds after
>> boot. In this case we call r8168_pll_power_down() with the PHY not
>> being connected. 
> 
> Would not the !netif_running() check in rtl8169_net_suspend() take care
> of that though?
> 
We don't come that far ..
If the interface is down then tp->TxDescArray is NULL. Means we call
rtl_pll_power_down() in rtl8169_runtime_suspend() before reaching
rtl8169_net_suspend().



Re: [PATCHv2 net-next 2/2] selftests: add a selftest for directed broadcast forwarding

2018-07-03 Thread David Ahern
On 7/3/18 5:36 AM, Xin Long wrote:
> On Mon, Jul 2, 2018 at 11:12 PM, David Ahern  wrote:
>> On 7/2/18 12:30 AM, Xin Long wrote:
>>> +ping_ipv4()
>>> +{
>>> + sysctl_set net.ipv4.icmp_echo_ignore_broadcasts 0
>>> + bc_forwarding_disable
>>> + ping_test $h1 198.51.100.255
>>> +
>>> + iptables -A INPUT -i vrf-r1 -p icmp -j DROP
>>> + bc_forwarding_restore
>>> + bc_forwarding_enable
>>> + ping_test $h1 198.51.100.255
>>> +
>>> + bc_forwarding_restore
>>> + iptables -D INPUT -i vrf-r1 -p icmp -j DROP
>>> + sysctl_restore net.ipv4.icmp_echo_ignore_broadcasts
>>> +}
>>
>> Both tests fail for me:
>> TEST: ping  [FAIL]
>> TEST: ping  [FAIL]
> I think 'ip vrf exec ...' is not working in your env, while
> the testing is using "ip vrf exec vrf-h1 ping ..."
> 
> You can test it by:
> # ip link add dev vrf-test type vrf table 
> # ip vrf exec vrf-test ls

well, that's embarrassing. yes, I updated ip and forgot to apply the bpf
workaround to define the syscall number (not defined in jessie).

> 
>>
>> Why the need for the iptables rule?
> This iptables rule is to block the echo_request packet going to
> route's local_in.
> When bc_forwarding is NOT doing forwarding well but the packet
> goes to the route's local_in, it will fail.
> 
> Without this rule, the 2nd ping will always succeed, we can't tell the
> echo_reply is from route or h2.
> 
> Or you have a better way to test this?

your commands are not a proper test. The test should succeed and fail
based on the routing lookup, not iptables rules.

> 
>>
>> And, PAUSE_ON_FAIL is not working to take a look at why tests are
>> failing. e.g.,
>>
>> PAUSE_ON_FAIL=yes ./router_broadcast.sh
>>
>> just continues on. Might be something with the infrastructure scripts.
> Yes, in ./router_broadcast.sh, it loads lib.sh where it loads 
> forwarding.config
> where it has "PAUSE_ON_FAIL=no", which would override your
> "PAUSE_ON_FAIL=yes".
> 

ack. bit by that as well.


Re: [Intel-wired-lan] [jkirsher/next-queue PATCH v2 0/7] Add support for L2 Fwd Offload w/o ndo_select_queue

2018-07-03 Thread Jeff Kirsher
On Tue, Jun 12, 2018 at 8:18 AM, Alexander Duyck
 wrote:
> This patch series is meant to allow support for the L2 forward offload, aka
> MACVLAN offload without the need for using ndo_select_queue.
>
> The existing solution currently requires that we use ndo_select_queue in
> the transmit path if we want to associate specific Tx queues with a given
> MACVLAN interface. In order to get away from this we need to repurpose the
> tc_to_txq array and XPS pointer for the MACVLAN interface and use those as
> a means of accessing the queues on the lower device. As a result we cannot
> offload a device that is configured as multiqueue, however it doesn't
> really make sense to configure a macvlan interfaced as being multiqueue
> anyway since it doesn't really have a qdisc of its own in the first place.
>
> I am submitting this as an RFC for the netdev mailing list, and officially
> submitting it for testing to Jeff Kirsher's next-queue in order to validate
> the ixgbe specific bits.
>
> The big changes in this set are:
>   Allow lower device to update tc_to_txq and XPS map of offloaded MACVLAN
>   Disable XPS for single queue devices
>   Replace accel_priv with sb_dev in ndo_select_queue
>   Add sb_dev parameter to fallback function for ndo_select_queue
>   Consolidated ndo_select_queue functions that appeared to be duplicates
>
> v2: Implement generic "select_queue" functions instead of "fallback" 
> functions.
> Tweak last two patches to account for changes in dev_pick_tx_xxx 
> functions.
>
> ---
>
> Alexander Duyck (7):
>   net-sysfs: Drop support for XPS and traffic_class on single queue device
>   net: Add support for subordinate device traffic classes
>   ixgbe: Add code to populate and use macvlan tc to Tx queue map
>   net: Add support for subordinate traffic classes to netdev_pick_tx
>   net: Add generic ndo_select_queue functions
>   net: allow ndo_select_queue to pass netdev
>   net: allow fallback function to pass netdev
>

Alex, there were recent changes to Dave's net-next which caused a
conflict with one or more of your patches.  So I have removed this
series from my tree for now, and will work with you on updating the
series to work with the latest net-next tree.


Re: [PATCH net-next v2 1/2] tcp: notify when a delayed ack is sent

2018-07-03 Thread Lawrence Brakmo
On 7/3/18, 6:15 AM, "Neal Cardwell"  wrote:

On Mon, Jul 2, 2018 at 7:49 PM Yuchung Cheng  wrote:
>
> On Mon, Jul 2, 2018 at 2:39 PM, Lawrence Brakmo  wrote:
> >
> > DCTCP depends on the CA_EVENT_NON_DELAYED_ACK and CA_EVENT_DELAYED_ACK
> > notifications to keep track if it needs to send an ACK for packets that
> > were received with a particular ECN state but whose ACK was delayed.
> >
> > Under some circumstances, for example when a delayed ACK is sent with a
> > data packet, DCTCP state was not being updated due to a lack of
> > notification that the previously delayed ACK was sent. As a result, it
> > would sometimes send a duplicate ACK when a new data packet arrived.
> >
> > This patch insures that DCTCP's state is correctly updated so it will
> > not send the duplicate ACK.
> Sorry to chime-in late here (lame excuse: IETF deadline)
>
> IIRC this issue would exist prior to 4.11 kernel. While it'd be good
> to fix that, it's not clear which patch introduced the regression
> between 4.11 and 4.16? I assume you tested Eric's most recent quickack
> fix.

I don't think Larry is saying there's a regression between 4.11 and
4.16. His recent "tcp: force cwnd at least 2 in tcp_cwnd_reduction"
patch here:

  
https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.ozlabs.org_patch_935233_&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=pq_Mqvzfy-C8ltkgyx1u_g&m=7ko5C_ln2b7gZvz2A_UrZzz0AlcnhrW7-9KRahj_PEA&s=HcpZvo1TulN4-Y7Jhba5KM1MIaPwnBC95T8pLZfJESI&e=

said that 4.11 was bad (high tail latency and lots of RTOs) and 4.16
was still bad but with different netstat counters (no RTOs but still
high tail latency):

"""
On 4.11, pcap traces indicate that in some instances the 1st packet of
the RPC is received but no ACK is sent before the packet is
retransmitted. On 4.11 netstat shows TCP timeouts, with some of them
spurious.

On 4.16, we don't see retransmits in netstat but the high tail latencies
are still there.
"""

I suspect the RTOs disappeared but latencies remained too high because
between 4.11 and 4.16 we introduced:
  tcp: allow TLP in ECN CWR
  
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=b4f70c3d4ec32a2ff4c62e1e2da0da5f55fe12bd

So the RTOs probably disappeared because this commit turned them into
TLPs. But the latencies remained high because the fundamental bug
remained throughout 4.11 and 4.16 and today: the DCTCP use of
tcp_send_ack() with an old rcv_nxt caused delayed ACKs to be cancelled
when they should not have been.

> In terms of the fix itself, it seems odd the tcp_send_ack() call in
> DCTCP generates NON_DELAYED_ACK event to toggle DCTCP's
> delayed_ack_reserved bit. Shouldn't the fix to have DCTCP send the
> "prior" ACK w/o cancelling delayed-ACK and mis-recording that it's
> cancelled, because that prior-ACK really is a supplementary old ACK.

This patch is fixing an issue that's orthogonal to the one you are
talking about. Using the taxonomy from our team's internal discussion
yesterday, the issue you mention where the DCTCP "prior" ACK is
cancelling delayed ACKs is issue (4); the issue that this particular
"tcp: notify when a delayed ack is sent" patch from Larry fixes is
issue (3). It's a bit tricky because both issues appeared in Larry's
trace summary and packetdrill script to reproduce the issue.

neal

I was able to track the patch that introduced the problem:

commit 3759824da87b30ce7a35b4873b62b0ba38905ef5
Author: Yuchung Cheng 
Date:   Wed Jul 1 14:11:15 2015 -0700

tcp: PRR uses CRB mode by default and SS mode conditionally

I tested a kernel which reverted the relevant change (see diff below) and the 
high tail latencies of more than 40ms disappeared. However, the 10KB high 
percentile latencies are 4ms vs. less than 200us with my patches. It looks like 
the patch above ended up reducing the cwnd to 1 in the scenarios that were 
triggering the high tail latencies. That is, it increased the likelihood of 
triggering actual bugs in the network stack code that my patch set fixes.


diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2c5d70bc294e..50fabb07d739 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2468,13 +2468,14 @@ void tcp_cwnd_reduction(struct sock *sk, int 
newly_acked_sacked, int flag)
u64 dividend = (u64)tp->snd_ssthresh * tp->prr_delivered +
   tp->prior_cwnd - 1;
sndcnt = div_u64(dividend, tp->prior_cwnd) - tp->prr_out;
-   } else if ((flag & FLAG_RETRANS_DATA_ACKED) &&
-  !(flag & FLAG_LOST_RETRANS)) {
+   } else {
+// } else if ((flag & FLAG_RETRANS_DATA_ACKED) &&
+//!(flag & FLAG_LOST_RETRANS)) {
sndcnt =

Re: [PATCH net-next 07/10] r8169: migrate speed_down function to phylib

2018-07-03 Thread Florian Fainelli



On 07/02/2018 02:31 PM, Heiner Kallweit wrote:
> On 02.07.2018 23:20, Andrew Lunn wrote:
>>  On Mon, Jul 02, 2018 at 09:37:08PM +0200, Heiner Kallweit wrote:
>>> Change rtl_speed_down() to use phylib.
>>>
>>> Signed-off-by: Heiner Kallweit 
>>> ---
>>>  drivers/net/ethernet/realtek/r8169.c | 33 +---
>>>  1 file changed, 15 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/drivers/net/ethernet/realtek/r8169.c 
>>> b/drivers/net/ethernet/realtek/r8169.c
>>> index 311321ee..807fbc75 100644
>>> --- a/drivers/net/ethernet/realtek/r8169.c
>>> +++ b/drivers/net/ethernet/realtek/r8169.c
>>> @@ -4240,6 +4240,10 @@ static void rtl8169_init_phy(struct net_device *dev, 
>>> struct rtl8169_private *tp)
>>> rtl_writephy(tp, 0x0b, 0x); //w 0x0b 15 0 0
>>> }
>>>  
>>> +   /* We may have called rtl_speed_down before */
>>> +   dev->phydev->advertising = dev->phydev->supported;
>>> +   genphy_config_aneg(dev->phydev);
>>> +
>>> genphy_soft_reset(dev->phydev);
>>>  
>>> rtl8169_set_speed(dev, AUTONEG_ENABLE, SPEED_1000, DUPLEX_FULL,
>>> @@ -4323,28 +4327,21 @@ static void rtl_init_mdio_ops(struct 
>>> rtl8169_private *tp)
>>> }
>>>  }
>>>  
>>> +#define BASET10(ADVERTISED_10baseT_Half | 
>>> ADVERTISED_10baseT_Full)
>>> +#define BASET100   (ADVERTISED_100baseT_Half | ADVERTISED_100baseT_Full)
>>> +#define BASET1000  (ADVERTISED_1000baseT_Half | ADVERTISED_1000baseT_Full)
>>> +
>>>  static void rtl_speed_down(struct rtl8169_private *tp)
>>>  {
>>> -   u32 adv;
>>> -   int lpa;
>>> +   struct phy_device *phydev = tp->dev->phydev;
>>> +   u32 adv = phydev->lp_advertising & phydev->supported;
>>>  
>>> -   rtl_writephy(tp, 0x1f, 0x);
>>> -   lpa = rtl_readphy(tp, MII_LPA);
>>> +   if (adv & BASET10)
>>> +   phydev->advertising &= ~(BASET100 | BASET1000);
>>> +   else if (adv & BASET100)
>>> +   phydev->advertising &= ~BASET1000;
>>>  
>>> -   if (lpa & (LPA_10HALF | LPA_10FULL))
>>> -   adv = ADVERTISED_10baseT_Half | ADVERTISED_10baseT_Full;
>>> -   else if (lpa & (LPA_100HALF | LPA_100FULL))
>>> -   adv = ADVERTISED_10baseT_Half | ADVERTISED_10baseT_Full |
>>> - ADVERTISED_100baseT_Half | ADVERTISED_100baseT_Full;
>>> -   else
>>> -   adv = ADVERTISED_10baseT_Half | ADVERTISED_10baseT_Full |
>>> - ADVERTISED_100baseT_Half | ADVERTISED_100baseT_Full |
>>> - (tp->mii.supports_gmii ?
>>> -  ADVERTISED_1000baseT_Half |
>>> -  ADVERTISED_1000baseT_Full : 0);
>>> -
>>> -   rtl8169_set_speed(tp->dev, AUTONEG_ENABLE, SPEED_1000, DUPLEX_FULL,
>>> - adv);
>>> +   genphy_config_aneg(phydev);
>>>  }
>>
>> It probably it is me being too tired, but i don't get what this is
>> doing? Changing the local advertisement based on what the remote is
>> advertising. Why?
>>
> It also took me some time to understand what this speed_down is doing.
> If we suspend and wait for a WoL packet, then we don't have to burn all
> the energy for a GBit connection. Therefore we switch to the lowest
> speed supported by chip and link partner. This is done by removing
> higher speeds from the advertised modes and restarting an autonego.

This is something that the tg3 driver also does, we should probably
consider doing this as part of a generic PHY library helpers since I was
told by several HW engineers that usually 10Mbits for WoL is much more
energy efficient.

One thing that bothers me a bit is that this should ideally be offered
as both blocking and non-blocking options, because we might want to make
sure that at the time we suspend, and we already had a link established,
we successfully re-negotiate the link with the partner. I agree that
there could be any sort of link disruption happening at any point though..
-- 
Florian


Re: [PATCH net-next 02/10] r8169: use phy_resume/phy_suspend

2018-07-03 Thread Florian Fainelli



On 07/02/2018 02:24 PM, Heiner Kallweit wrote:
> On 02.07.2018 23:06, Andrew Lunn wrote:
>>>  static void r8168_pll_power_down(struct rtl8169_private *tp)
>>>  {
>>> if (r8168_check_dash(tp))
>>> @@ -4510,7 +4469,8 @@ static void r8168_pll_power_down(struct 
>>> rtl8169_private *tp)
>>> if (rtl_wol_pll_power_down(tp))
>>> return;
>>>  
>>> -   r8168_phy_power_down(tp);
>>> +   /* cover the case that PHY isn't connected */
>>> +   phy_suspend(mdiobus_get_phy(tp->mii_bus, 0));
>>
>> This could do some more explanation. Why would it not be connected?
>>
> The PHY gets connected when the net_device is opened. If a network
> port isn't used then it will be runtime-suspended a few seconds after
> boot. In this case we call r8168_pll_power_down() with the PHY not
> being connected. 

Would not the !netif_running() check in rtl8169_net_suspend() take care
of that though?
-- 
Florian


Re: [PATCH net-next 01/10] r8169: add basic phylib support

2018-07-03 Thread Florian Fainelli



On 07/02/2018 02:15 PM, Heiner Kallweit wrote:
> On 02.07.2018 23:02, Andrew Lunn wrote:
>>> +static int r8169_mdio_read_reg(struct mii_bus *mii_bus, int phyaddr, int 
>>> phyreg)
>>> +{
>>> +   struct rtl8169_private *tp = mii_bus->priv;
>>> +
>>> +   return rtl_readphy(tp, phyreg);
>>
>> So there is no support for phyaddr?
>>
> Right, the chip can access only the one internal PHY, therefore it
> doesn't support phyaddr.

Then you might also want to set mii_bus->phy_mask accordingly such that
only the internal PHY address bit is cleared there?
-- 
Florian


[PATCH net-next v3 1/2] tcp: notify when a delayed ack is sent

2018-07-03 Thread Lawrence Brakmo
DCTCP depends on the CA_EVENT_NON_DELAYED_ACK and CA_EVENT_DELAYED_ACK
notifications to keep track if it needs to send an ACK for packets that
were received with a particular ECN state but whose ACK was delayed.

Under some circumstances, for example when a delayed ACK is sent with a
data packet, DCTCP state was not being updated due to a lack of
notification that the previously delayed ACK was sent. As a result, it
would sometimes send a duplicate ACK when a new data packet arrived.

This patch insures that DCTCP's state is correctly updated so it will
not send the duplicate ACK.

Improved based on comments from Neal Cardwell .

Signed-off-by: Lawrence Brakmo 
---
 net/ipv4/tcp_output.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index f8f6129160dd..acefb64e8280 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -172,6 +172,8 @@ static inline void tcp_event_ack_sent(struct sock *sk, 
unsigned int pkts)
__sock_put(sk);
}
tcp_dec_quickack_mode(sk, pkts);
+   if (inet_csk_ack_scheduled(sk))
+   tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);
inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK);
 }
 
@@ -3567,8 +3569,6 @@ void tcp_send_ack(struct sock *sk)
if (sk->sk_state == TCP_CLOSE)
return;
 
-   tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);
-
/* We are not putting this on the write queue, so
 * tcp_transmit_skb() will set the ownership to this
 * sock.
-- 
2.17.1



[PATCH net-next v3 2/2] tcp: ack immediately when a cwr packet arrives

2018-07-03 Thread Lawrence Brakmo
We observed high 99 and 99.9% latencies when doing RPCs with DCTCP. The
problem is triggered when the last packet of a request arrives CE
marked. The reply will carry the ECE mark causing TCP to shrink its cwnd
to 1 (because there are no packets in flight). When the 1st packet of
the next request arrives, the ACK was sometimes delayed even though it
is CWR marked, adding up to 40ms to the RPC latency.

This patch insures that CWR makred data packets arriving will be acked
immediately.

Modified based on comments by Neal Cardwell 

Signed-off-by: Lawrence Brakmo 
---
 net/ipv4/tcp_input.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2c5d70bc294e..23c2a43de8a4 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -246,8 +246,15 @@ static void tcp_ecn_queue_cwr(struct tcp_sock *tp)
 
 static void tcp_ecn_accept_cwr(struct tcp_sock *tp, const struct sk_buff *skb)
 {
-   if (tcp_hdr(skb)->cwr)
+   if (tcp_hdr(skb)->cwr) {
tp->ecn_flags &= ~TCP_ECN_DEMAND_CWR;
+
+   /* If the sender is telling us it has entered CWR, then its
+* cwnd may be very low (even just 1 packet), so we should ACK
+* immediately.
+*/
+   tcp_enter_quickack_mode((struct sock *)tp, 2);
+   }
 }
 
 static void tcp_ecn_withdraw_cwr(struct tcp_sock *tp)
-- 
2.17.1



[PATCH net-next v3 0/2] tcp: fix high tail latencies in DCTCP

2018-07-03 Thread Lawrence Brakmo
When have observed high tail latencies when using DCTCP for RPCs as
compared to using Cubic. For example, in one setup there are 2 hosts
sending to a 3rd one, with each sender having 3 flows (1 stream,
1 1MB back-to-back RPCs and 1 10KB back-to-back RPCs). The following
table shows the 99% and 99.9% latencies for both Cubic and dctcp:

   Cubic 99%  Cubic 99.9%   dctcp 99%dctcp 99.9%
1MB RPCs2.6ms   5.5ms 43ms  208ms
10KB RPCs1.1ms   1.3ms 53ms  212ms

Looking at tcpdump traces showed that there are two causes for the
latency.  

  1) RTOs caused by the receiver sending a dup ACK and not ACKing
 the last (and only) packet sent.
  2) Delaying ACKs when the sender has a cwnd of 1, so everything
 pauses for the duration of the delayed ACK.

The first patch fixes the cause of the dup ACKs, not updating DCTCP
state when an ACK that was initially delayed has been sent with a
data packet.

The second patch insures that an ACK is sent immediately when a
CWR marked packet arrives.

With the patches the latencies for DCTCP now look like:

   dctcp 99%  dctcp 99.9%
1MB RPCs5.8ms   6.9ms
10KB RPCs146us   203us

Note that while the 1MB RPCs tail latencies are higher than Cubic's,
the 10KB latencies are much smaller than Cubic's. These patches fix
issues on the receiver, but tcpdump traces indicate there is an
opportunity to also fix an issue at the sender that adds about 3ms
to the tail latencies.

The following trace shows the issue that tiggers an RTO (fixed by these 
patches):

   Host A sends the last packets of the request
   Host B receives them, and the last packet is marked with congestion (CE)
   Host B sends ACKs for packets not marked with congestion
   Host B sends data packet with reply and ACK for packet marked with
  congestion (TCP flag ECE)
   Host A receives ACKs with no ECE flag
   Host A receives data packet with ACK for the last packet of request
  and which has TCP ECE bit set
   Host A sends 1st data packet of the next request with TCP flag CWR
   Host B receives the packet (as seen in tcpdump at B), no CE flag
   Host B sends a dup ACK that also has the TCP ECE flag
   Host A RTO timer fires!
   Host A to send the next packet
   Host A receives an ACK for everything it has sent (i.e. Host B
  did receive 1st packet of request)
   Host A send more packets…

v2: Removed call to tcp_ca_event from tcp_send_ack since I added one in
tcp_event_ack_sent. Based on Neal Cardwell 
feedback.
Modified tcp_ecn_check_ce (and renamed it tcp_ecn_check) instead of 
modifying
tcp_ack_send_check to insure an ACK when cwr is received.
v3: Handling cwr in tcp_ecn_accept_cwr instead of in tcp_ecn_check.

[PATCH net-next v3 1/2] tcp: notify when a delayed ack is sent
[PATCH net-next v3 2/2] tcp: ack immediately when a cwr packet

 net/ipv4/tcp_input.c  | 9 -
 net/ipv4/tcp_output.c | 4 ++--
 2 files changed, 10 insertions(+), 3 deletions(-)




Re: [PATCH net-next v2 0/2] tcp: fix high tail latencies in DCTCP

2018-07-03 Thread Neal Cardwell
On Tue, Jul 3, 2018 at 11:10 AM Lawrence Brakmo  wrote:
>
> On 7/2/18, 5:52 PM, "netdev-ow...@vger.kernel.org on behalf of Neal Cardwell" 
>  wrote:
>
> On Mon, Jul 2, 2018 at 5:39 PM Lawrence Brakmo  wrote:
> >
> > When have observed high tail latencies when using DCTCP for RPCs as
> > compared to using Cubic. For example, in one setup there are 2 hosts
> > sending to a 3rd one, with each sender having 3 flows (1 stream,
> > 1 1MB back-to-back RPCs and 1 10KB back-to-back RPCs). The following
> > table shows the 99% and 99.9% latencies for both Cubic and dctcp:
> >
> >Cubic 99%  Cubic 99.9%   dctcp 99%dctcp 99.9%
> >  1MB RPCs2.6ms   5.5ms 43ms  208ms
> > 10KB RPCs1.1ms   1.3ms 53ms  212ms
> >
> > Looking at tcpdump traces showed that there are two causes for the
> > latency.
> >
> >   1) RTOs caused by the receiver sending a dup ACK and not ACKing
> >  the last (and only) packet sent.
> >   2) Delaying ACKs when the sender has a cwnd of 1, so everything
> >  pauses for the duration of the delayed ACK.
> >
> > The first patch fixes the cause of the dup ACKs, not updating DCTCP
> > state when an ACK that was initially delayed has been sent with a
> > data packet.
> >
> > The second patch insures that an ACK is sent immediately when a
> > CWR marked packet arrives.
> >
> > With the patches the latencies for DCTCP now look like:
> >
> >dctcp 99%  dctcp 99.9%
> >  1MB RPCs5.8ms   6.9ms
> > 10KB RPCs146us   203us
> >
> > Note that while the 1MB RPCs tail latencies are higher than Cubic's,
> > the 10KB latencies are much smaller than Cubic's. These patches fix
> > issues on the receiver, but tcpdump traces indicate there is an
> > opportunity to also fix an issue at the sender that adds about 3ms
> > to the tail latencies.
> >
> > The following trace shows the issue that tiggers an RTO (fixed by these 
> patches):
> >
> >Host A sends the last packets of the request
> >Host B receives them, and the last packet is marked with congestion 
> (CE)
> >Host B sends ACKs for packets not marked with congestion
> >Host B sends data packet with reply and ACK for packet marked with
> >   congestion (TCP flag ECE)
> >Host A receives ACKs with no ECE flag
> >Host A receives data packet with ACK for the last packet of request
> >   and which has TCP ECE bit set
> >Host A sends 1st data packet of the next request with TCP flag CWR
> >Host B receives the packet (as seen in tcpdump at B), no CE flag
> >Host B sends a dup ACK that also has the TCP ECE flag
> >Host A RTO timer fires!
> >Host A to send the next packet
> >Host A receives an ACK for everything it has sent (i.e. Host B
> >   did receive 1st packet of request)
> >Host A send more packets…
> >
> > [PATCH net-next v2 1/2] tcp: notify when a delayed ack is sent
> > [PATCH net-next v2 2/2] tcp: ack immediately when a cwr packet
> >
> >  net/ipv4/tcp_input.c  | 16 +++-
> >  net/ipv4/tcp_output.c |  4 ++--
> >  2 files changed, 13 insertions(+), 7 deletions(-)
>
> Thanks, Larry. Just for context, can you please let us know whether
> your tests included zero, one, or both of Eric's recent commits
> (listed below) that tuned the number of ACKs after ECN events? (Or
> maybe the tests were literally using a net-next kernel?) Just wanted
> to get a better handle on any possible interactions there.
>
> Thanks!
>
> neal
>
> Yes, my test kernel includes both patches listed below.

OK, great.

> BTW, I will send a new
> patch where I move the call to tcp_incr_quickack from tcp_ecn_check_ce to 
> tcp_ecn_accept_cwr.

OK, sounds good to me.

thanks,
neal


Re: [PATCH net-next v2 1/2] tcp: notify when a delayed ack is sent

2018-07-03 Thread Lawrence Brakmo
On 7/2/18, 4:50 PM, "Yuchung Cheng"  wrote:

On Mon, Jul 2, 2018 at 2:39 PM, Lawrence Brakmo  wrote:
>
> DCTCP depends on the CA_EVENT_NON_DELAYED_ACK and CA_EVENT_DELAYED_ACK
> notifications to keep track if it needs to send an ACK for packets that
> were received with a particular ECN state but whose ACK was delayed.
>
> Under some circumstances, for example when a delayed ACK is sent with a
> data packet, DCTCP state was not being updated due to a lack of
> notification that the previously delayed ACK was sent. As a result, it
> would sometimes send a duplicate ACK when a new data packet arrived.
>
> This patch insures that DCTCP's state is correctly updated so it will
> not send the duplicate ACK.
Sorry to chime-in late here (lame excuse: IETF deadline)

IIRC this issue would exist prior to 4.11 kernel. While it'd be good
to fix that, it's not clear which patch introduced the regression
between 4.11 and 4.16? I assume you tested Eric's most recent quickack
fix.

In terms of the fix itself, it seems odd the tcp_send_ack() call in
DCTCP generates NON_DELAYED_ACK event to toggle DCTCP's
delayed_ack_reserved bit. Shouldn't the fix to have DCTCP send the
"prior" ACK w/o cancelling delayed-ACK and mis-recording that it's
cancelled, because that prior-ACK really is a supplementary old ACK.

But it's still unclear how this bug introduces the regression 4.11 - 4.16
   
Feedback is always appreciated! This issue is also present in 4.11 (that is 
where I discovered). I think the bug was introduces much earlier.

Yes, I tested with Eric's quickack fix, it did not fix either of the two issues 
that are fixed with this patch set.

As I mentioned earlier, the bug was introduced before 4.11. I am not sure I 
understand your comments. Yes, at some level it would make sense to change the 
delayed_ack_reserved bit directly, but we would still need to do it whenever we 
send the ACK, so I do not think it can be helped. Please clarify if I 
misunderstood your comment.

>
> Improved based on comments from Neal Cardwell .
>
> Signed-off-by: Lawrence Brakmo 
> ---
>  net/ipv4/tcp_output.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index f8f6129160dd..acefb64e8280 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -172,6 +172,8 @@ static inline void tcp_event_ack_sent(struct sock 
*sk, unsigned int pkts)
> __sock_put(sk);
> }
> tcp_dec_quickack_mode(sk, pkts);
> +   if (inet_csk_ack_scheduled(sk))
> +   tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);
> inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK);
>  }
>
> @@ -3567,8 +3569,6 @@ void tcp_send_ack(struct sock *sk)
> if (sk->sk_state == TCP_CLOSE)
> return;
>
> -   tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);
> -
> /* We are not putting this on the write queue, so
>  * tcp_transmit_skb() will set the ownership to this
>  * sock.
> --
> 2.17.1
>




Re: [PATCH net-next v2 0/2] tcp: fix high tail latencies in DCTCP

2018-07-03 Thread Lawrence Brakmo
On 7/2/18, 5:52 PM, "netdev-ow...@vger.kernel.org on behalf of Neal Cardwell" 
 wrote:

On Mon, Jul 2, 2018 at 5:39 PM Lawrence Brakmo  wrote:
>
> When have observed high tail latencies when using DCTCP for RPCs as
> compared to using Cubic. For example, in one setup there are 2 hosts
> sending to a 3rd one, with each sender having 3 flows (1 stream,
> 1 1MB back-to-back RPCs and 1 10KB back-to-back RPCs). The following
> table shows the 99% and 99.9% latencies for both Cubic and dctcp:
>
>Cubic 99%  Cubic 99.9%   dctcp 99%dctcp 99.9%
>  1MB RPCs2.6ms   5.5ms 43ms  208ms
> 10KB RPCs1.1ms   1.3ms 53ms  212ms
>
> Looking at tcpdump traces showed that there are two causes for the
> latency.
>
>   1) RTOs caused by the receiver sending a dup ACK and not ACKing
>  the last (and only) packet sent.
>   2) Delaying ACKs when the sender has a cwnd of 1, so everything
>  pauses for the duration of the delayed ACK.
>
> The first patch fixes the cause of the dup ACKs, not updating DCTCP
> state when an ACK that was initially delayed has been sent with a
> data packet.
>
> The second patch insures that an ACK is sent immediately when a
> CWR marked packet arrives.
>
> With the patches the latencies for DCTCP now look like:
>
>dctcp 99%  dctcp 99.9%
>  1MB RPCs5.8ms   6.9ms
> 10KB RPCs146us   203us
>
> Note that while the 1MB RPCs tail latencies are higher than Cubic's,
> the 10KB latencies are much smaller than Cubic's. These patches fix
> issues on the receiver, but tcpdump traces indicate there is an
> opportunity to also fix an issue at the sender that adds about 3ms
> to the tail latencies.
>
> The following trace shows the issue that tiggers an RTO (fixed by these 
patches):
>
>Host A sends the last packets of the request
>Host B receives them, and the last packet is marked with congestion 
(CE)
>Host B sends ACKs for packets not marked with congestion
>Host B sends data packet with reply and ACK for packet marked with
>   congestion (TCP flag ECE)
>Host A receives ACKs with no ECE flag
>Host A receives data packet with ACK for the last packet of request
>   and which has TCP ECE bit set
>Host A sends 1st data packet of the next request with TCP flag CWR
>Host B receives the packet (as seen in tcpdump at B), no CE flag
>Host B sends a dup ACK that also has the TCP ECE flag
>Host A RTO timer fires!
>Host A to send the next packet
>Host A receives an ACK for everything it has sent (i.e. Host B
>   did receive 1st packet of request)
>Host A send more packets…
>
> [PATCH net-next v2 1/2] tcp: notify when a delayed ack is sent
> [PATCH net-next v2 2/2] tcp: ack immediately when a cwr packet
>
>  net/ipv4/tcp_input.c  | 16 +++-
>  net/ipv4/tcp_output.c |  4 ++--
>  2 files changed, 13 insertions(+), 7 deletions(-)

Thanks, Larry. Just for context, can you please let us know whether
your tests included zero, one, or both of Eric's recent commits
(listed below) that tuned the number of ACKs after ECN events? (Or
maybe the tests were literally using a net-next kernel?) Just wanted
to get a better handle on any possible interactions there.

Thanks!

neal

Yes, my test kernel includes both patches listed below. BTW, I will send a new 
patch where I move the call to tcp_incr_quickack from tcp_ecn_check_ce to 
tcp_ecn_accept_cwr.

---

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=522040ea5fdd1c33bbf75e1d7c7c0422b96a94ef
commit 522040ea5fdd1c33bbf75e1d7c7c0422b96a94ef
Author: Eric Dumazet 
Date:   Mon May 21 15:08:57 2018 -0700

tcp: do not aggressively quick ack after ECN events

ECN signals currently forces TCP to enter quickack mode for
up to 16 (TCP_MAX_QUICKACKS) following incoming packets.

We believe this is not needed, and only sending one immediate ack
for the current packet should be enough.

This should reduce the extra load noticed in DCTCP environments,
after congestion events.

This is part 2 of our effort to reduce pure ACK packets.

Signed-off-by: Eric Dumazet 
Acked-by: Soheil Hassas Yeganeh 
Acked-by: Yuchung Cheng 
Acked-by: Neal Cardwell 
Signed-off-by: David S. Miller 


https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=15ecbe94a45ef88491ca459b26efdd02f91edb6d
commit 15ecbe94a45ef88491ca459b26efdd02f91edb6d
Author: Eric Dumazet 
Date:   Wed Jun 27 08:47:21 2018 -0700

tcp: add one more quick 

Re: [bpf PATCH 1/2] bpf: sockmap, error path can not release psock in multi-map case

2018-07-03 Thread Daniel Borkmann
On 06/30/2018 03:51 PM, John Fastabend wrote:
> The current code, in the error path of sock_hash_ctx_update_elem,
> checks if the sock has a psock in the user data and if so decrements
> the reference count of the psock. However, if the error happens early
> in the error path we may have never incremented the psock reference
> count and if the psock exists because the sock is in another map then
> we may inadvertently decrement the reference count.
> 
> Fix this by making the error path only call smap_release_sock if the
> error happens after the increment.
> 
> Reported-by: syzbot+d464d2c20c717ef5a...@syzkaller.appspotmail.com
> Fixes: 81110384441a ("bpf: sockmap, add hash map support")
> Signed-off-by: John Fastabend 
> ---
>  0 files changed
> 
> diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
> index 4fc2cb1..63fb047 100644
> --- a/kernel/bpf/sockmap.c
> +++ b/kernel/bpf/sockmap.c
> @@ -1896,7 +1896,7 @@ static int __sock_map_ctx_update_elem(struct bpf_map 
> *map,
>   e = kzalloc(sizeof(*e), GFP_ATOMIC | __GFP_NOWARN);
>   if (!e) {
>   err = -ENOMEM;
> - goto out_progs;
> + goto out_free;
>   }
>   }
>  
> @@ -2324,7 +2324,12 @@ static int sock_hash_ctx_update_elem(struct 
> bpf_sock_ops_kern *skops,
>   if (err)
>   goto err;
>  
> - /* bpf_map_update_elem() can be called in_irq() */
> + psock = smap_psock_sk(sock);
> + if (unlikely(!psock)) {
> + err = -EINVAL;
> + goto err;
> + }

Is an error even possible at this point? If __sock_map_ctx_update_elem() 
succeeds,
we either allocated and linked a new psock to the sock or we inc'ed the existing
one's refcount. From my reading it seems we should always succeed the subsequent
smap_psock_sk(). If we would have failed here in between it would mean we'd have
a refcount imbalance somewhere?

> +
>   raw_spin_lock_bh(&b->lock);
>   l_old = lookup_elem_raw(head, hash, key, key_size);
>   if (l_old && map_flags == BPF_NOEXIST) {
> @@ -2342,12 +2347,6 @@ static int sock_hash_ctx_update_elem(struct 
> bpf_sock_ops_kern *skops,
>   goto bucket_err;
>   }
>  
> - psock = smap_psock_sk(sock);
> - if (unlikely(!psock)) {
> - err = -EINVAL;
> - goto bucket_err;
> - }
> -
>   rcu_assign_pointer(e->hash_link, l_new);
>   rcu_assign_pointer(e->htab,
>  container_of(map, struct bpf_htab, map));
> @@ -2370,12 +2369,10 @@ static int sock_hash_ctx_update_elem(struct 
> bpf_sock_ops_kern *skops,
>   raw_spin_unlock_bh(&b->lock);
>   return 0;
>  bucket_err:
> + smap_release_sock(psock, sock);
>   raw_spin_unlock_bh(&b->lock);
>  err:
>   kfree(e);
> - psock = smap_psock_sk(sock);
> - if (psock)
> - smap_release_sock(psock, sock);
>   return err;
>  }
>  
> 

Thanks,
Daniel


[PATCH net] smsc75xx: Add workaround for gigabit link up hardware errata.

2018-07-03 Thread Yuiko Oshino
In certain conditions, the device may not be able to link in gigabit mode. This 
software workaround ensures that the device will not enter the failure state.

Fixes: d0cad871703b898a442e4049c532ec39168e5b57 ("SMSC75XX USB 2.0 Gigabit 
Ethernet Devices")
Signed-off-by: Yuiko Oshino 
---
 drivers/net/usb/smsc75xx.c | 62 ++
 1 file changed, 62 insertions(+)

diff --git a/drivers/net/usb/smsc75xx.c b/drivers/net/usb/smsc75xx.c
index 7a6a1fe..05553d2 100644
--- a/drivers/net/usb/smsc75xx.c
+++ b/drivers/net/usb/smsc75xx.c
@@ -82,6 +82,9 @@ static bool turbo_mode = true;
 module_param(turbo_mode, bool, 0644);
 MODULE_PARM_DESC(turbo_mode, "Enable multiple frames per Rx transaction");
 
+static int smsc75xx_link_ok_nopm(struct usbnet *dev);
+static int smsc75xx_phy_gig_workaround(struct usbnet *dev);
+
 static int __must_check __smsc75xx_read_reg(struct usbnet *dev, u32 index,
u32 *data, int in_pm)
 {
@@ -852,6 +855,9 @@ static int smsc75xx_phy_initialize(struct usbnet *dev)
return -EIO;
}
 
+   /* phy workaround for gig link */
+   smsc75xx_phy_gig_workaround(dev);
+
smsc75xx_mdio_write(dev->net, dev->mii.phy_id, MII_ADVERTISE,
ADVERTISE_ALL | ADVERTISE_CSMA | ADVERTISE_PAUSE_CAP |
ADVERTISE_PAUSE_ASYM);
@@ -987,6 +993,62 @@ static int smsc75xx_wait_ready(struct usbnet *dev, int 
in_pm)
return -EIO;
 }
 
+static int smsc75xx_phy_gig_workaround(struct usbnet *dev)
+{
+   struct mii_if_info *mii = &dev->mii;
+   int ret = 0, timeout = 0;
+   u32 buf, link_up = 0;
+
+   /* Set the phy in Gig loopback */
+   smsc75xx_mdio_write(dev->net, mii->phy_id, MII_BMCR, 0x4040);
+
+   /* Wait for the link up */
+   do {
+   link_up = smsc75xx_link_ok_nopm(dev);
+   usleep_range(1, 2);
+   timeout++;
+   } while ((!link_up) && (timeout < 1000));
+
+   if (timeout >= 1000) {
+   netdev_warn(dev->net, "Timeout waiting for PHY link up\n");
+   return -EIO;
+   }
+
+   /* phy reset */
+   ret = smsc75xx_read_reg(dev, PMT_CTL, &buf);
+   if (ret < 0) {
+   netdev_warn(dev->net, "Failed to read PMT_CTL: %d\n", ret);
+   return ret;
+   }
+
+   buf |= PMT_CTL_PHY_RST;
+
+   ret = smsc75xx_write_reg(dev, PMT_CTL, buf);
+   if (ret < 0) {
+   netdev_warn(dev->net, "Failed to write PMT_CTL: %d\n", ret);
+   return ret;
+   }
+
+   timeout = 0;
+   do {
+   usleep_range(1, 2);
+   ret = smsc75xx_read_reg(dev, PMT_CTL, &buf);
+   if (ret < 0) {
+   netdev_warn(dev->net, "Failed to read PMT_CTL: %d\n",
+   ret);
+   return ret;
+   }
+   timeout++;
+   } while ((buf & PMT_CTL_PHY_RST) && (timeout < 100));
+
+   if (timeout >= 100) {
+   netdev_warn(dev->net, "timeout waiting for PHY Reset\n");
+   return -EIO;
+   }
+
+   return 0;
+}
+
 static int smsc75xx_reset(struct usbnet *dev)
 {
struct smsc75xx_priv *pdata = (struct smsc75xx_priv *)(dev->data[0]);
-- 
2.7.4



Re: [PATCH net] r8169: fix mac address change

2018-07-03 Thread David Miller
From: Heiner Kallweit 
Date: Mon, 2 Jul 2018 22:49:35 +0200

> Network core refuses to change mac address because flag
> IFF_LIVE_ADDR_CHANGE isn't set. Set this missing flag.
> 
> Fixes: 1f7aa2bc268e ("r8169: simplify rtl_set_mac_address")
> Reported-by: Corinna Vinschen 
> Signed-off-by: Heiner Kallweit 

Applied and queued up for -stable.


Re: [PATCH] lib: rhashtable: Correct self-assignment in rhashtable.c

2018-07-03 Thread David Miller
From: Rishabh Bhatnagar 
Date: Mon,  2 Jul 2018 09:35:34 -0700

> In file lib/rhashtable.c line 777, skip variable is assigned to
> itself. The following error was observed:
> 
> lib/rhashtable.c:777:41: warning: explicitly assigning value of
> variable of type 'int' to itself [-Wself-assign] error, forbidden
> warning: rhashtable.c:777
> This error was found when compiling with Clang 6.0. Change it to iter->skip.
> 
> Signed-off-by: Rishabh Bhatnagar 
> Acked-by: Herbert Xu 
> Reviewed-by: NeilBrown 

Applied, thanks.


Re: [PATCH net-next v3 0/5] net: aquantia: various ethtool ops implementation

2018-07-03 Thread David Miller
From: Igor Russkikh 
Date: Mon,  2 Jul 2018 17:03:34 +0300

> In this patchset Anton Mikaev and I added some useful ethtool operations:
> - ring size changes
> - link renegotioation
> - flow control management
> 
> The patch also improves init/deinit sequence.
> 
> V3 changes:
> - After review and analysis it is clear that rtnl lock (which is 
>   captured by default on ethtool ops) is enough to secure possible
>   overlapping of dev open/close. Thus, just dropping internal mutex.
> 
> V2 changes:
> - using mutex to secure simultaneous dev close/open
> - using state var to store/restore dev state

Series applied.


Re: [PATCH v2 net-next 1/3] rds: Changing IP address internal representation to struct in6_addr

2018-07-03 Thread Ka-Cheong Poon

On 06/30/2018 04:50 PM, David Miller wrote:

From: Ka-Cheong Poon 
Date: Wed, 27 Jun 2018 03:23:27 -0700


This patch changes the internal representation of an IP address to use
struct in6_addr.  IPv4 address is stored as an IPv4 mapped address.
All the functions which take an IP address as argument are also
changed to use struct in6_addr.  But RDS socket layer is not modified
such that it still does not accept IPv6 address from an application.
And RDS layer does not accept nor initiate IPv6 connections.

v2: Fixed sparse warnings.

Signed-off-by: Ka-Cheong Poon 


I really don't like this.

An ipv4 mapped ipv6 address is not the same as an ipv4 address.

You are effectively preventing the use of ipv6 connections
using ipv4 mapped addresses.



Could you please clarify what is meant by an IPv6
connections using IPv4 mapped address?  An IPv6 packet
cannot use an IPv4 mapped address as source or destination
address.  Do you mean an app uses an IPv4 mapped address
in a struct sockaddr_in6 to set up an IPv4 connection?

Please note that this patch is patch #1.  This patch
alone does not support RDS/IPv6.  Hence this patch has
checks to prevent an app to use IPv6 address.  Those
checks will be removed in patch #2.



Also, assuming the sockaddr type based upon size is wrong.
You have to check the family field, then you can decide
to interpret the rest of the sockaddr in one way or another
and also validate it's length.




--
K. Poon
ka-cheong.p...@oracle.com




Re: [PATCH net] net/ipv6: Revert attempt to simplify route replace and append

2018-07-03 Thread Ido Schimmel
On Mon, Jul 02, 2018 at 03:03:12PM -0700, dsah...@kernel.org wrote:
> From: David Ahern 
> 
> NetworkManager likes to manage linklocal prefix routes and does so with
> the NLM_F_APPEND flag, breaking attempts to simplify the IPv6 route
> code and by extension enable multipath routes with device only nexthops.
> 
> Revert f34436a43092 and its followup
> 6eba08c3626b ("ipv6: Only emit append events for appended routes").
> Update the test cases to reflect the old behavior.
> 
> Fixes: f34436a43092 ("net/ipv6: Simplify route replace and appending into 
> multipath route")
> Signed-off-by: David Ahern 
> ---
> Ido: I left 5a15a1b07c51. FIB_EVENT_ENTRY_APPEND is not generated for
>  IPv6, so no harm in leaving it.

OK, but I had a follow-up series that did further changes in mlxsw
(merge commit eab9a2d5f323) to support the new behavior. You want to
revert it and squash to v2 or I'll send another revert?

Also, Petr added a multipath test that relies on IPv6 device only
nexthops. See commit 54818c4c4b937.


[PATCH net-next] net: sched: act_pedit: fix possible memory leak in tcf_pedit_init()

2018-07-03 Thread Wei Yongjun
'keys_ex' is malloced by tcf_pedit_keys_ex_parse() in tcf_pedit_init()
but not all of the error handle path free it, this may cause memory
leak. This patch fix it.

Fixes: 71d0ed7079df ("net/act_pedit: Support using offset relative to the 
conventional network headers")
Signed-off-by: Wei Yongjun 
---
 net/sched/act_pedit.c | 25 -
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/net/sched/act_pedit.c b/net/sched/act_pedit.c
index 55bc96b..e43aef2 100644
--- a/net/sched/act_pedit.c
+++ b/net/sched/act_pedit.c
@@ -175,32 +175,35 @@ static int tcf_pedit_init(struct net *net, struct nlattr 
*nla,
if (!tcf_idr_check(tn, parm->index, a, bind)) {
if (!parm->nkeys) {
NL_SET_ERR_MSG_MOD(extack, "Pedit requires keys to be 
passed");
-   return -EINVAL;
+   ret = -EINVAL;
+   goto out_free;
}
ret = tcf_idr_create(tn, parm->index, est, a,
 &act_pedit_ops, bind, false);
if (ret)
-   return ret;
+   goto out_free;
p = to_pedit(*a);
keys = kmalloc(ksize, GFP_KERNEL);
if (!keys) {
tcf_idr_release(*a, bind);
-   kfree(keys_ex);
-   return -ENOMEM;
+   ret = -ENOMEM;
+   goto out_free;
}
ret = ACT_P_CREATED;
} else {
if (bind)
-   return 0;
+   goto out_free;
tcf_idr_release(*a, bind);
-   if (!ovr)
-   return -EEXIST;
+   if (!ovr) {
+   ret = -EEXIST;
+   goto out_free;
+   }
p = to_pedit(*a);
if (p->tcfp_nkeys && p->tcfp_nkeys != parm->nkeys) {
keys = kmalloc(ksize, GFP_KERNEL);
if (!keys) {
-   kfree(keys_ex);
-   return -ENOMEM;
+   ret = -ENOMEM;
+   goto out_free;
}
}
}
@@ -222,6 +225,10 @@ static int tcf_pedit_init(struct net *net, struct nlattr 
*nla,
if (ret == ACT_P_CREATED)
tcf_idr_insert(tn, *a);
return ret;
+out_free:
+   kfree(keys_ex);
+   return ret;
+
 }
 
 static void tcf_pedit_cleanup(struct tc_action *a)



[PATCH] rhashtable: add restart routine in rhashtable_free_and_destroy()

2018-07-03 Thread Taehee Yoo
rhashtable_free_and_destroy() cancels re-hash deferred work
then walks and destroys elements. at this moment, some elements can be
still in future_tbl. that elements are not destroyed.

test case:
 nft_rhash_destroy() calls rhashtable_free_and_destroy() to destroy
 all elements of sets before destroying sets and chains.
 But rhashtable_free_and_destroy() doesn't destroy elements of future_tbl.
 so that splat occurred.

test script:
   %cat test.nft
   table ip aa {
   map map1 {
   type ipv4_addr : verdict;
   elements = {
   0 : jump a0,
   1 : jump a0,
   2 : jump a0,
   3 : jump a0,
   4 : jump a0,
   5 : jump a0,
   6 : jump a0,
   7 : jump a0,
   8 : jump a0,
   9 : jump a0,
   }
   }
   chain a0 {
   }
   }
   flush ruleset
   table ip aa {
   map map1 {
   type ipv4_addr : verdict;
   elements = {
   0 : jump a0,
   1 : jump a0,
   2 : jump a0,
   3 : jump a0,
   4 : jump a0,
   5 : jump a0,
   6 : jump a0,
   7 : jump a0,
   8 : jump a0,
   9 : jump a0,
   }
   }
   chain a0 {
   }
   }
   flush ruleset

   %while :; do nft -f test.nft; done

Splat looks like:
[  200.795603] kernel BUG at net/netfilter/nf_tables_api.c:1363!
[  200.806944] invalid opcode:  [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[  200.812253] CPU: 1 PID: 1582 Comm: nft Not tainted 4.17.0+ #24
[  200.820297] Hardware name: To be filled by O.E.M. To be filled by 
O.E.M./Aptio CRB, BIOS 5.6.5 07/08/2015
[  200.830309] RIP: 0010:nf_tables_chain_destroy.isra.34+0x62/0x240 [nf_tables]
[  200.838317] Code: 43 50 85 c0 74 26 48 8b 45 00 48 8b 4d 08 ba 54 05 00 00 
48 c7 c6 60 6d 29 c0 48 c7 c7 c0 65 29 c0 4c 8b 40 08 e8 58 e5 fd f8 <0f> 0b 48 
89 da 48 b8 00 00 00 00 00 fc ff
[  200.860366] RSP: :880118dbf4d0 EFLAGS: 00010282
[  200.866354] RAX: 0061 RBX: 88010cdeaf08 RCX: 
[  200.874355] RDX: 0061 RSI: 0008 RDI: ed00231b7e90
[  200.882361] RBP: 880118dbf4e8 R08: ed002373bcfb R09: ed002373bcfa
[  200.890354] R10:  R11: ed002373bcfb R12: dead0200
[  200.898356] R13: dead0100 R14: bb62af38 R15: dc00
[  200.906354] FS:  7fefc31fd700() GS:88011b80() 
knlGS:
[  200.915533] CS:  0010 DS:  ES:  CR0: 80050033
[  200.922355] CR2: 557f1c8e9128 CR3: 00010688 CR4: 001006e0
[  200.930353] Call Trace:
[  200.932351]  ? nf_tables_commit+0x26f6/0x2c60 [nf_tables]
[  200.939525]  ? nf_tables_setelem_notify.constprop.49+0x1a0/0x1a0 [nf_tables]
[  200.947525]  ? nf_tables_delchain+0x6e0/0x6e0 [nf_tables]
[  200.952383]  ? nft_add_set_elem+0x1700/0x1700 [nf_tables]
[  200.959532]  ? nla_parse+0xab/0x230
[  200.963529]  ? nfnetlink_rcv_batch+0xd06/0x10d0 [nfnetlink]
[  200.968384]  ? nfnetlink_net_init+0x130/0x130 [nfnetlink]
[  200.975525]  ? debug_show_all_locks+0x290/0x290
[  200.980363]  ? debug_show_all_locks+0x290/0x290
[  200.986356]  ? sched_clock_cpu+0x132/0x170
[  200.990352]  ? find_held_lock+0x39/0x1b0
[  200.994355]  ? sched_clock_local+0x10d/0x130
[  200.999531]  ? memset+0x1f/0x40


Signed-off-by: Taehee Yoo 
---
 lib/rhashtable.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 0e04947..8ea27fa 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -1134,6 +1134,7 @@ void rhashtable_free_and_destroy(struct rhashtable *ht,
mutex_lock(&ht->mutex);
tbl = rht_dereference(ht->tbl, ht);
if (free_fn) {
+restart:
for (i = 0; i < tbl->size; i++) {
struct rhash_head *pos, *next;
 
@@ -1147,9 +1148,11 @@ void rhashtable_free_and_destroy(struct rhashtable *ht,
rht_dereference(pos->next, ht) : NULL)
rhashtable_free_one(ht, pos, free_fn, arg);
}
+   tbl = rht_dereference(tbl->future_tbl, ht);
+   if (tbl)
+   goto restart;
}
-
-   bucket_table_free(tbl);
+   bucket_table_free(rht_dereference(ht->tbl, ht));
mutex_unlock(&ht->mutex);
 }
 EXPORT_SYMBOL_GPL(rhashtable_free_and_destroy);
-- 
2.9.3



Re: [PATCH net-next v2 1/2] tcp: notify when a delayed ack is sent

2018-07-03 Thread Neal Cardwell
On Mon, Jul 2, 2018 at 7:49 PM Yuchung Cheng  wrote:
>
> On Mon, Jul 2, 2018 at 2:39 PM, Lawrence Brakmo  wrote:
> >
> > DCTCP depends on the CA_EVENT_NON_DELAYED_ACK and CA_EVENT_DELAYED_ACK
> > notifications to keep track if it needs to send an ACK for packets that
> > were received with a particular ECN state but whose ACK was delayed.
> >
> > Under some circumstances, for example when a delayed ACK is sent with a
> > data packet, DCTCP state was not being updated due to a lack of
> > notification that the previously delayed ACK was sent. As a result, it
> > would sometimes send a duplicate ACK when a new data packet arrived.
> >
> > This patch insures that DCTCP's state is correctly updated so it will
> > not send the duplicate ACK.
> Sorry to chime-in late here (lame excuse: IETF deadline)
>
> IIRC this issue would exist prior to 4.11 kernel. While it'd be good
> to fix that, it's not clear which patch introduced the regression
> between 4.11 and 4.16? I assume you tested Eric's most recent quickack
> fix.

I don't think Larry is saying there's a regression between 4.11 and
4.16. His recent "tcp: force cwnd at least 2 in tcp_cwnd_reduction"
patch here:

  https://patchwork.ozlabs.org/patch/935233/

said that 4.11 was bad (high tail latency and lots of RTOs) and 4.16
was still bad but with different netstat counters (no RTOs but still
high tail latency):

"""
On 4.11, pcap traces indicate that in some instances the 1st packet of
the RPC is received but no ACK is sent before the packet is
retransmitted. On 4.11 netstat shows TCP timeouts, with some of them
spurious.

On 4.16, we don't see retransmits in netstat but the high tail latencies
are still there.
"""

I suspect the RTOs disappeared but latencies remained too high because
between 4.11 and 4.16 we introduced:
  tcp: allow TLP in ECN CWR
  
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=b4f70c3d4ec32a2ff4c62e1e2da0da5f55fe12bd

So the RTOs probably disappeared because this commit turned them into
TLPs. But the latencies remained high because the fundamental bug
remained throughout 4.11 and 4.16 and today: the DCTCP use of
tcp_send_ack() with an old rcv_nxt caused delayed ACKs to be cancelled
when they should not have been.

> In terms of the fix itself, it seems odd the tcp_send_ack() call in
> DCTCP generates NON_DELAYED_ACK event to toggle DCTCP's
> delayed_ack_reserved bit. Shouldn't the fix to have DCTCP send the
> "prior" ACK w/o cancelling delayed-ACK and mis-recording that it's
> cancelled, because that prior-ACK really is a supplementary old ACK.

This patch is fixing an issue that's orthogonal to the one you are
talking about. Using the taxonomy from our team's internal discussion
yesterday, the issue you mention where the DCTCP "prior" ACK is
cancelling delayed ACKs is issue (4); the issue that this particular
"tcp: notify when a delayed ack is sent" patch from Larry fixes is
issue (3). It's a bit tricky because both issues appeared in Larry's
trace summary and packetdrill script to reproduce the issue.

neal


[PATCH iproute2] tc: Fix output of ip attributes

2018-07-03 Thread Roi Dayan
Example output is of tos and ttl.
Befoe this fix the format used %x caused output of the pointer
instead of the intended string created in the out variable.

Fixes: e28b88a464c4 ("tc: jsonify flower filter")
Signed-off-by: Roi Dayan 
---
 tc/f_flower.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tc/f_flower.c b/tc/f_flower.c
index c710765179fb..1dfd57d286d9 100644
--- a/tc/f_flower.c
+++ b/tc/f_flower.c
@@ -1134,7 +1134,7 @@ static void flower_print_ip_attr(char *name, struct 
rtattr *key_attr,
if (mask_attr)
sprintf(out + done, "/%x", rta_getattr_u8(mask_attr));
 
-   sprintf(namefrm, "\n  %s %%x", name);
+   sprintf(namefrm, "\n  %s %%s", name);
print_string(PRINT_ANY, name, namefrm, out);
 }
 
-- 
2.7.5



[PATCH net-next 1/2] selftests: forwarding: lib: extract ping and ping6 so they can be reused

2018-07-03 Thread Nikolay Aleksandrov
Extract ping and ping6 command execution so the return value can be
checked by the caller, this is needed for port isolation tests that are
intended to fail.

Signed-off-by: Nikolay Aleksandrov 
---
 tools/testing/selftests/net/forwarding/lib.sh | 22 --
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/net/forwarding/lib.sh 
b/tools/testing/selftests/net/forwarding/lib.sh
index e073918bbe15..2bb9cf303c53 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -659,30 +659,40 @@ multipath_eval()
 ##
 # Tests
 
-ping_test()
+ping_do()
 {
local if_name=$1
local dip=$2
local vrf_name
 
-   RET=0
-
vrf_name=$(master_name_get $if_name)
ip vrf exec $vrf_name $PING $dip -c 10 -i 0.1 -w 2 &> /dev/null
+}
+
+ping_test()
+{
+   RET=0
+
+   ping_do $1 $2
check_err $?
log_test "ping"
 }
 
-ping6_test()
+ping6_do()
 {
local if_name=$1
local dip=$2
local vrf_name
 
-   RET=0
-
vrf_name=$(master_name_get $if_name)
ip vrf exec $vrf_name $PING6 $dip -c 10 -i 0.1 -w 2 &> /dev/null
+}
+
+ping6_test()
+{
+   RET=0
+
+   ping6_do $1 $2
check_err $?
log_test "ping6"
 }
-- 
2.11.0



[PATCH iproute2 net-next] bridge: add support for isolated option

2018-07-03 Thread Nikolay Aleksandrov
This patch adds support for the new isolated port option which, if set,
would allow the isolated ports to communicate only with non-isolated
ports and the bridge device. The option can be set via the bridge or ip
link type bridge_slave commands, e.g.:
$ ip link set dev eth0 type bridge_slave isolated on
$ bridge link set dev eth0 isolated on

Signed-off-by: Nikolay Aleksandrov 
---
 bridge/link.c| 11 +++
 ip/iplink_bridge_slave.c |  9 +
 man/man8/bridge.8|  6 ++
 man/man8/ip-link.8.in|  6 --
 4 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/bridge/link.c b/bridge/link.c
index 8d89aca2e638..9656ca338782 100644
--- a/bridge/link.c
+++ b/bridge/link.c
@@ -152,6 +152,9 @@ static void print_protinfo(FILE *fp, struct rtattr *attr)
if (prtb[IFLA_BRPORT_VLAN_TUNNEL])
print_onoff(fp, "vlan_tunnel",

rta_getattr_u8(prtb[IFLA_BRPORT_VLAN_TUNNEL]));
+   if (prtb[IFLA_BRPORT_ISOLATED])
+   print_onoff(fp, "isolated",
+   rta_getattr_u8(prtb[IFLA_BRPORT_ISOLATED]));
} else
print_portstate(rta_getattr_u8(attr));
 }
@@ -250,6 +253,7 @@ static void usage(void)
fprintf(stderr, "   [ mcast_flood {on | 
off} ]\n");
fprintf(stderr, "   [ neigh_suppress {on | 
off} ]\n");
fprintf(stderr, "   [ vlan_tunnel {on | 
off} ]\n");
+   fprintf(stderr, "   [ isolated {on | off} 
]\n");
fprintf(stderr, "   [ hwmode {vepa | veb} 
]\n");
fprintf(stderr, "   [ self ] [ master ]\n");
fprintf(stderr, "   bridge link show [dev DEV]\n");
@@ -291,6 +295,7 @@ static int brlink_modify(int argc, char **argv)
__s8 flood = -1;
__s8 vlan_tunnel = -1;
__s8 mcast_flood = -1;
+   __s8 isolated = -1;
__s8 hairpin = -1;
__s8 bpdu_guard = -1;
__s8 fast_leave = -1;
@@ -386,6 +391,10 @@ static int brlink_modify(int argc, char **argv)
if (!on_off("vlan_tunnel", &vlan_tunnel,
*argv))
return -1;
+   } else if (strcmp(*argv, "isolated") == 0) {
+   NEXT_ARG();
+   if (!on_off("isolated", &isolated, *argv))
+   return -1;
} else {
usage();
}
@@ -444,6 +453,8 @@ static int brlink_modify(int argc, char **argv)
if (vlan_tunnel != -1)
addattr8(&req.n, sizeof(req), IFLA_BRPORT_VLAN_TUNNEL,
 vlan_tunnel);
+   if (isolated != -1)
+   addattr8(&req.n, sizeof(req), IFLA_BRPORT_ISOLATED, isolated);
 
addattr_nest_end(&req.n, nest);
 
diff --git a/ip/iplink_bridge_slave.c b/ip/iplink_bridge_slave.c
index 3fbfb878cdc4..5a6e48559781 100644
--- a/ip/iplink_bridge_slave.c
+++ b/ip/iplink_bridge_slave.c
@@ -40,6 +40,7 @@ static void print_explain(FILE *f)
"[ group_fwd_mask MASK ]\n"
"[ neigh_suppress {on | off} ]\n"
"[ vlan_tunnel {on | off} ]\n"
+   "[ isolated {on | off} ]\n"
);
 }
 
@@ -274,6 +275,10 @@ static void bridge_slave_print_opt(struct link_util *lu, 
FILE *f,
if (tb[IFLA_BRPORT_VLAN_TUNNEL])
_print_onoff(f, "vlan_tunnel", "vlan_tunnel",
 rta_getattr_u8(tb[IFLA_BRPORT_VLAN_TUNNEL]));
+
+   if (tb[IFLA_BRPORT_ISOLATED])
+   _print_onoff(f, "isolated", "isolated",
+rta_getattr_u8(tb[IFLA_BRPORT_ISOLATED]));
 }
 
 static void bridge_slave_parse_on_off(char *arg_name, char *arg_val,
@@ -379,6 +384,10 @@ static int bridge_slave_parse_opt(struct link_util *lu, 
int argc, char **argv,
NEXT_ARG();
bridge_slave_parse_on_off("vlan_tunnel", *argv, n,
  IFLA_BRPORT_VLAN_TUNNEL);
+   } else if (matches(*argv, "isolated") == 0) {
+   NEXT_ARG();
+   bridge_slave_parse_on_off("isolated", *argv, n,
+ IFLA_BRPORT_ISOLATED);
} else if (matches(*argv, "help") == 0) {
explain();
return -1;
diff --git a/man/man8/bridge.8 b/man/man8/bridge.8
index e7f7148315e1..f6d228c5ebfe 100644
--- a/man/man8/bridge.8
+++ b/man/man8/bridge.8
@@ -48,6 +48,7 @@ bridge \- show / manipulate bridge addresses and devices
 .BR mcast_flood " { " on " | " off " } ] [ "
 .BR neigh_suppress " { " on " | " off " } ] [

[PATCH net-next 2/2] selftests: forwarding: test for bridge port isolation

2018-07-03 Thread Nikolay Aleksandrov
This test checks if the bridge port isolation feature works as expected
by performing ping/ping6 tests between hosts that are isolated (should
not work) and between an isolated and non-isolated hosts (should work).
Same test is performed for flooding from and to isolated and
non-isolated ports.

Signed-off-by: Nikolay Aleksandrov 
---
 .../net/forwarding/bridge_port_isolation.sh| 151 +
 1 file changed, 151 insertions(+)
 create mode 100755 
tools/testing/selftests/net/forwarding/bridge_port_isolation.sh

diff --git a/tools/testing/selftests/net/forwarding/bridge_port_isolation.sh 
b/tools/testing/selftests/net/forwarding/bridge_port_isolation.sh
new file mode 100755
index ..a43b4645c4de
--- /dev/null
+++ b/tools/testing/selftests/net/forwarding/bridge_port_isolation.sh
@@ -0,0 +1,151 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+ALL_TESTS="ping_ipv4 ping_ipv6 flooding"
+NUM_NETIFS=6
+CHECK_TC="yes"
+source lib.sh
+
+h1_create()
+{
+   simple_if_init $h1 192.0.2.1/24 2001:db8:1::1/64
+}
+
+h1_destroy()
+{
+   simple_if_fini $h1 192.0.2.1/24 2001:db8:1::1/64
+}
+
+h2_create()
+{
+   simple_if_init $h2 192.0.2.2/24 2001:db8:1::2/64
+}
+
+h2_destroy()
+{
+   simple_if_fini $h2 192.0.2.2/24 2001:db8:1::2/64
+}
+
+h3_create()
+{
+   simple_if_init $h3 192.0.2.3/24 2001:db8:1::3/64
+}
+
+h3_destroy()
+{
+   simple_if_fini $h3 192.0.2.3/24 2001:db8:1::3/64
+}
+
+switch_create()
+{
+   ip link add dev br0 type bridge
+
+   ip link set dev $swp1 master br0
+   ip link set dev $swp2 master br0
+   ip link set dev $swp3 master br0
+
+   ip link set dev $swp1 type bridge_slave isolated on
+   check_err $? "Can't set isolation on port $swp1"
+   ip link set dev $swp2 type bridge_slave isolated on
+   check_err $? "Can't set isolation on port $swp2"
+   ip link set dev $swp3 type bridge_slave isolated off
+   check_err $? "Can't disable isolation on port $swp3"
+
+   ip link set dev br0 up
+   ip link set dev $swp1 up
+   ip link set dev $swp2 up
+   ip link set dev $swp3 up
+}
+
+switch_destroy()
+{
+   ip link set dev $swp3 down
+   ip link set dev $swp2 down
+   ip link set dev $swp1 down
+
+   ip link del dev br0
+}
+
+setup_prepare()
+{
+   h1=${NETIFS[p1]}
+   swp1=${NETIFS[p2]}
+
+   swp2=${NETIFS[p3]}
+   h2=${NETIFS[p4]}
+
+   swp3=${NETIFS[p5]}
+   h3=${NETIFS[p6]}
+
+   vrf_prepare
+
+   h1_create
+   h2_create
+   h3_create
+
+   switch_create
+}
+
+cleanup()
+{
+   pre_cleanup
+
+   switch_destroy
+
+   h3_destroy
+   h2_destroy
+   h1_destroy
+
+   vrf_cleanup
+}
+
+ping_ipv4()
+{
+   RET=0
+   ping_do $h1 192.0.2.2
+   check_fail $? "Ping worked when it should not have"
+
+   RET=0
+   ping_do $h3 192.0.2.2
+   check_err $? "Ping didn't work when it should have"
+
+   log_test "Isolated port ping"
+}
+
+ping_ipv6()
+{
+   RET=0
+   ping6_do $h1 2001:db8:1::2
+   check_fail $? "Ping6 worked when it should not have"
+
+   RET=0
+   ping6_do $h3 2001:db8:1::2
+   check_err $? "Ping6 didn't work when it should have"
+
+   log_test "Isolated port ping6"
+}
+
+flooding()
+{
+   local mac=de:ad:be:ef:13:37
+   local ip=192.0.2.100
+
+   RET=0
+   flood_test_do false $mac $ip $h1 $h2
+   check_err $? "Packet was flooded when it should not have been"
+
+   RET=0
+   flood_test_do true $mac $ip $h3 $h2
+   check_err $? "Packet was not flooded when it should have been"
+
+   log_test "Isolated port flooding"
+}
+
+trap cleanup EXIT
+
+setup_prepare
+setup_wait
+
+tests_run
+
+exit $EXIT_STATUS
-- 
2.11.0



[PATCH net-next 0/2] bridge: iproute2 isolated port and selftests

2018-07-03 Thread Nikolay Aleksandrov
Add support to iproute2 for port isolation config and selftests for it.

Nikolay Aleksandrov (2):
  selftests: forwarding: lib: extract ping and ping6 so they can be
reused
  selftests: forwarding: test for bridge port isolation

 .../net/forwarding/bridge_port_isolation.sh| 151 +
 tools/testing/selftests/net/forwarding/lib.sh  |  22 ++-
 2 files changed, 167 insertions(+), 6 deletions(-)
 create mode 100755 
tools/testing/selftests/net/forwarding/bridge_port_isolation.sh

-- 
2.11.0



Re: [PATCH v4 net-next 7/9] net: ipv4: listified version of ip_rcv

2018-07-03 Thread Florian Westphal
Pablo Neira Ayuso  wrote:
> On Mon, Jul 02, 2018 at 04:14:12PM +0100, Edward Cree wrote:
> > Also involved adding a way to run a netfilter hook over a list of packets.
> >  Rather than attempting to make netfilter know about lists (which would be
> >  a major project in itself) we just let it call the regular okfn (in this
> >  case ip_rcv_finish()) for any packets it steals, and have it give us back
> >  a list of packets it's synchronously accepted (which normally NF_HOOK
> >  would automatically call okfn() on, but we want to be able to potentially
> >  pass the list to a listified version of okfn().)
> > The netfilter hooks themselves are indirect calls that still happen per-
> >  packet (see nf_hook_entry_hookfn()), but again, changing that can be left
> >  for future work.
> > 
> > There is potential for out-of-order receives if the netfilter hook ends up
> >  synchronously stealing packets, as they will be processed before any
> >  accepts earlier in the list.  However, it was already possible for an
> >  asynchronous accept to cause out-of-order receives, so presumably this is
> >  considered OK.
> 
> I think we can simplify things if these chained packets don't follow
> the standard forwarding path, this would require to revisit many
> subsystems to handle these new chained packets - potentially a lot of
> work and likely breaking many things - and I would expect we (and
> other subsystems too) will not get very much benefits from these
> chained packets.

I guess it depends on what type of 'bundles' to expect on the wire.
For an average netfilter ruleset it will require more unbundling
because we might have

ipv4/udp -> ipv4/tcp -> ipv6/udp -> ipv4/tcp -> ipv4/udp

>From Eds patch set, this would be rebundled to

ipv4/udp -> ipv4/tcp -> ipv4/tcp -> ipv4/udp
and
ipv6/udp

nf ipv4 rule eval has to untangle again, to
ipv4/udp -> ipv4/udp
ipv4/tcp -> ipv4/tcp

HOWEVER, there are hooks that are L4 agnostic, such as defragmentation,
so we might still be able to take advantage.

Defrag is extra silly at the moment, we eat the indirect call cost
only to return NF_ACCEPT without doing anything in 99.99% of all cases.

So I still think there is value in exploring to pass the bundle
into the nf core, via new nf_hook_slow_list() that can be used
from forward path (ingress, prerouting, input, forward, postrouting).

We can make this step-by-step, first splitting everything in
nf_hook_slow_list() and just calling hooksfns for each skb in list.

> In general I like this infrastructure, but I think we can get
> something simpler if we combine it with the flowtable idea, so chained
> packets follow the non-standard flowtable forwarding path as described
> in [1].

Yes, in case all packets would be in the flow table (offload software
fallback) we might be able to keep list as-is in case everything has
same nexthop.

However, I don't see any need to make changes to this series for this
now, it can be added on top.

Did i miss anything?

I do agree that from netfilter p.o.v. ingress hook looks like a good
initial candidate for a listification.


Re: [PATCHv2 net-next 1/2] route: add support for directed broadcast forwarding

2018-07-03 Thread Xin Long
On Mon, Jul 2, 2018 at 11:05 PM, David Ahern  wrote:
> On 7/2/18 12:30 AM, Xin Long wrote:
>> @@ -2143,6 +2149,10 @@ static int devinet_conf_proc(struct ctl_table *ctl, 
>> int write,
>>   if ((new_value == 0) && (old_value != 0))
>>   rt_cache_flush(net);
>>
>> + if (i == IPV4_DEVCONF_BC_FORWARDING - 1 ||
>> + new_value != old_value)
>> + rt_cache_flush(net);
>> +
>
> Shouldn't that be '&&' instead of '||' otherwise you are bumping the fib
> gen id any time the old and new values do not equal regardless of which
> setting is changed.
ah right, It's a copy-paste mistake, sorry


Re: [PATCHv2 net-next 2/2] selftests: add a selftest for directed broadcast forwarding

2018-07-03 Thread Xin Long
On Mon, Jul 2, 2018 at 11:12 PM, David Ahern  wrote:
> On 7/2/18 12:30 AM, Xin Long wrote:
>> +ping_ipv4()
>> +{
>> + sysctl_set net.ipv4.icmp_echo_ignore_broadcasts 0
>> + bc_forwarding_disable
>> + ping_test $h1 198.51.100.255
>> +
>> + iptables -A INPUT -i vrf-r1 -p icmp -j DROP
>> + bc_forwarding_restore
>> + bc_forwarding_enable
>> + ping_test $h1 198.51.100.255
>> +
>> + bc_forwarding_restore
>> + iptables -D INPUT -i vrf-r1 -p icmp -j DROP
>> + sysctl_restore net.ipv4.icmp_echo_ignore_broadcasts
>> +}
>
> Both tests fail for me:
> TEST: ping  [FAIL]
> TEST: ping  [FAIL]
I think 'ip vrf exec ...' is not working in your env, while
the testing is using "ip vrf exec vrf-h1 ping ..."

You can test it by:
# ip link add dev vrf-test type vrf table 
# ip vrf exec vrf-test ls

>
> Why the need for the iptables rule?
This iptables rule is to block the echo_request packet going to
route's local_in.
When bc_forwarding is NOT doing forwarding well but the packet
goes to the route's local_in, it will fail.

Without this rule, the 2nd ping will always succeed, we can't tell the
echo_reply is from route or h2.

Or you have a better way to test this?

>
> And, PAUSE_ON_FAIL is not working to take a look at why tests are
> failing. e.g.,
>
> PAUSE_ON_FAIL=yes ./router_broadcast.sh
>
> just continues on. Might be something with the infrastructure scripts.
Yes, in ./router_broadcast.sh, it loads lib.sh where it loads forwarding.config
where it has "PAUSE_ON_FAIL=no", which would override your
"PAUSE_ON_FAIL=yes".


Re: [PATCHv2 net-next 0/5] sctp: fully support for dscp and flowlabel per transport

2018-07-03 Thread Neil Horman
On Mon, Jul 02, 2018 at 06:21:10PM +0800, Xin Long wrote:
> Now dscp and flowlabel are set from sock when sending the packets,
> but being multi-homing, sctp also supports for dscp and flowlabel
> per transport, which is described in section 8.1.12 in RFC6458.
> 
> v1->v2:
>   - define ip_queue_xmit as inline in net/ip.h, instead of exporting
> it in Patch 1/5 according to David's suggestion.
>   - fix the param len check in sctp_s/getsockopt_peer_addr_params()
> in Patch 3/5 to guarantee that an old app built with old kernel
> headers could work on the newer kernel per Marcelo's point.
> 
> Xin Long (5):
>   ipv4: add __ip_queue_xmit() that supports tos param
>   sctp: add support for dscp and flowlabel per transport
>   sctp: add spp_ipv6_flowlabel and spp_dscp for sctp_paddrparams
>   sctp: add support for setting flowlabel when adding a transport
>   sctp: check for ipv6_pinfo legal sndflow with flowlabel in
> sctp_v6_get_dst
> 
>  include/linux/sctp.h   |   7 ++
>  include/net/ip.h   |   9 ++-
>  include/net/sctp/structs.h |   9 +++
>  include/uapi/linux/sctp.h  |   4 +
>  net/ipv4/ip_output.c   |   9 ++-
>  net/sctp/associola.c   |  15 
>  net/sctp/ipv6.c|  20 -
>  net/sctp/protocol.c|  16 +++-
>  net/sctp/socket.c  | 182 
> +++--
>  9 files changed, 254 insertions(+), 17 deletions(-)
> 
> -- 
> 2.1.0
> 
> 
For the Series
Acked-by: Neil Horman 



Re: [PATCH v4 net-next 7/9] net: ipv4: listified version of ip_rcv

2018-07-03 Thread Pablo Neira Ayuso
On Mon, Jul 02, 2018 at 04:14:12PM +0100, Edward Cree wrote:
> Also involved adding a way to run a netfilter hook over a list of packets.
>  Rather than attempting to make netfilter know about lists (which would be
>  a major project in itself) we just let it call the regular okfn (in this
>  case ip_rcv_finish()) for any packets it steals, and have it give us back
>  a list of packets it's synchronously accepted (which normally NF_HOOK
>  would automatically call okfn() on, but we want to be able to potentially
>  pass the list to a listified version of okfn().)
> The netfilter hooks themselves are indirect calls that still happen per-
>  packet (see nf_hook_entry_hookfn()), but again, changing that can be left
>  for future work.
> 
> There is potential for out-of-order receives if the netfilter hook ends up
>  synchronously stealing packets, as they will be processed before any
>  accepts earlier in the list.  However, it was already possible for an
>  asynchronous accept to cause out-of-order receives, so presumably this is
>  considered OK.

I think we can simplify things if these chained packets don't follow
the standard forwarding path, this would require to revisit many
subsystems to handle these new chained packets - potentially a lot of
work and likely breaking many things - and I would expect we (and
other subsystems too) will not get very much benefits from these
chained packets.

In general I like this infrastructure, but I think we can get
something simpler if we combine it with the flowtable idea, so chained
packets follow the non-standard flowtable forwarding path as described
in [1].

We could generalize and place the flowtable code in the core if
needed, and make it not netfilter dependent if that's a problem.

Thanks.

[1] https://marc.info/?l=netfilter-devel&m=152898601419841&w=2


Re: [PATCH net] r8169: fix mac address change

2018-07-03 Thread Corinna Vinschen
On Jul  2 22:49, Heiner Kallweit wrote:
> Network core refuses to change mac address because flag
> IFF_LIVE_ADDR_CHANGE isn't set. Set this missing flag.
> 
> Fixes: 1f7aa2bc268e ("r8169: simplify rtl_set_mac_address")
> Reported-by: Corinna Vinschen 
> Signed-off-by: Heiner Kallweit 
> ---
>  drivers/net/ethernet/realtek/r8169.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/ethernet/realtek/r8169.c 
> b/drivers/net/ethernet/realtek/r8169.c
> index f80ac894..a390db27 100644
> --- a/drivers/net/ethernet/realtek/r8169.c
> +++ b/drivers/net/ethernet/realtek/r8169.c
> @@ -7607,6 +7607,7 @@ static int rtl_init_one(struct pci_dev *pdev, const 
> struct pci_device_id *ent)
>   NETIF_F_HW_VLAN_CTAG_RX;
>   dev->vlan_features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_TSO |
>   NETIF_F_HIGHDMA;
> + dev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
>  
>   tp->cp_cmd |= RxChkSum | RxVlan;
>  
> -- 
> 2.18.0

Tested-by: Corinna Vinschen 

This patch allows to change the MAC any time, just as before
1f7aa2bc268e, and thus also fixes the problem reported in
https://www.spinics.net/lists/netdev/msg510926.html


Thanks,
Corinna


[PATCHv2 net] sctp: fix the issue that pathmtu may be set lower than MINSEGMENT

2018-07-03 Thread Xin Long
After commit b6c5734db070 ("sctp: fix the handling of ICMP Frag Needed
for too small MTUs"), sctp_transport_update_pmtu would refetch pathmtu
from the dst and set it to transport's pathmtu without any check.

The new pathmtu may be lower than MINSEGMENT if the dst is obsolete and
updated by .get_dst() in sctp_transport_update_pmtu. In this case, it
could have a smaller MTU as well, and thus we should validate it
against MINSEGMENT instead.

Syzbot reported a warning in sctp_mtu_payload caused by this.

This patch refetches the pathmtu by calling sctp_dst_mtu where it does
the check against MINSEGMENT.

v1->v2:
  - refetch the pathmtu by calling sctp_dst_mtu instead as Marcelo's
suggestion.

Fixes: b6c5734db070 ("sctp: fix the handling of ICMP Frag Needed for too small 
MTUs")
Reported-by: syzbot+f0d9d7cba052f9344...@syzkaller.appspotmail.com
Suggested-by: Marcelo Ricardo Leitner 
Signed-off-by: Xin Long 
---
 net/sctp/transport.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sctp/transport.c b/net/sctp/transport.c
index 445b7ef..12cac85 100644
--- a/net/sctp/transport.c
+++ b/net/sctp/transport.c
@@ -282,7 +282,7 @@ bool sctp_transport_update_pmtu(struct sctp_transport *t, 
u32 pmtu)
 
if (dst) {
/* Re-fetch, as under layers may have a higher minimum size */
-   pmtu = SCTP_TRUNC4(dst_mtu(dst));
+   pmtu = sctp_dst_mtu(dst);
change = t->pathmtu != pmtu;
}
t->pathmtu = pmtu;
-- 
2.1.0



Re: [PATCH v4 net-next 0/9] Handle multiple received packets at each stage

2018-07-03 Thread Paolo Abeni
On Mon, 2018-07-02 at 09:40 -0600, David Ahern wrote:
> On 7/2/18 9:11 AM, Edward Cree wrote:
> > This patch series adds the capability for the network stack to receive a
> >  list of packets and process them as a unit, rather than handling each
> >  packet singly in sequence.  This is done by factoring out the existing
> >  datapath code at each layer and wrapping it in list handling code.
> > 
> 
> ...
> 
> >  drivers/net/ethernet/sfc/efx.c|  12 +++
> >  drivers/net/ethernet/sfc/net_driver.h |   3 +
> >  drivers/net/ethernet/sfc/rx.c |   7 +-
> >  include/linux/list.h  |  30 ++
> >  include/linux/netdevice.h |   4 +
> >  include/linux/netfilter.h |  22 +
> >  include/net/ip.h  |   2 +
> >  include/trace/events/net.h|   7 ++
> >  net/core/dev.c| 174 
> > --
> >  net/ipv4/af_inet.c|   1 +
> >  net/ipv4/ip_input.c   | 114 --
> >  11 files changed, 360 insertions(+), 16 deletions(-)
> > 
> 
> Nice work. Have you looked at IPv6 support yet?

I think this work opens opportunities for a lot of follow-ups, if there
is agreement on extending this approach to other areas. Onother item
I'd like to investigate is TC processing.

Cheers,

Paolo