date:20160921

Re: [PATCH] cxgb4: fix signed wrap around when decrementing index idx

2016-09-21 Thread David Miller

From: Colin King 
Date: Tue, 20 Sep 2016 15:48:45 +0100

> From: Colin Ian King 
> 
> Change predecrement compare to post decrement compare to avoid an
> unsigned integer wrap-around comparison when decrementing idx in
> the while loop.
> 
> For example, when idx is zero, the current situation will
> predecrement idx in the while loop, wrapping idx to the maximum
> signed integer and cause out of bounds reads on rxq_info->msix_tbl[idx].
> 
> Signed-off-by: Colin Ian King 

This doesn't apply cleanly to the 'net' tree, please respin.

Re: [PATCH net] net/mlx4_core: Fix to clean devlink resources

2016-09-21 Thread David Miller

From: Tariq Toukan 
Date: Tue, 20 Sep 2016 14:55:31 +0300

> From: Kamal Heib 
> 
> This patch cleans devlink resources by calling devlink_port_unregister()
> to avoid the following issues:
> 
> - Kernel panic when triggering reset flow.
> - Memory leak due to unfreed resources in mlx4_init_port_info().
> 
> Fixes: 09d4d087cd48 ("mlx4: Implement devlink interface")
> Signed-off-by: Kamal Heib 
> Signed-off-by: Tariq Toukan 
> ---
> Please push it to -stable  >= 4.6 as well. Thanks.

Applied and queued up for -stable, thanks.

Re: [PATCH net-next v3 0/5] cxgb4: add support for offloading TC u32 filters

2016-09-21 Thread David Miller

From: Rahul Lakkireddy 
Date: Tue, 20 Sep 2016 17:13:05 +0530

> This series of patches add support to offload TC u32 filters onto
> Chelsio NICs.
> 
> Patch 1 moves current common filter code to separate files
> in order to provide a common api for performing packet classification
> and filtering in Chelsio NICs.
> 
> Patch 2 enables filters for normal NIC configuration and implements
> common api for setting and deleting filters.
> 
> Patches 3-5 add support for TC u32 offload via ndo_setup_tc.

Series applied.

Re: [RFC PATCH v3 2/7] proc: Reduce cache miss in {snmp,netstat}_seq_show

2016-09-21 Thread hejianet




On 9/22/16 2:24 AM, Marcelo wrote:

On Thu, Sep 22, 2016 at 12:18:46AM +0800, hejianet wrote:

Hi Marcelo

sorry for the late, just came back from a vacation.

Hi, no problem. Hope your batteries are recharged now :-)


On 9/14/16 7:55 PM, Marcelo wrote:

Hi Jia,

On Wed, Sep 14, 2016 at 01:58:42PM +0800, hejianet wrote:

Hi Marcelo


On 9/13/16 2:57 AM, Marcelo wrote:

On Fri, Sep 09, 2016 at 02:33:57PM +0800, Jia He wrote:

This is to use the generic interface snmp_get_cpu_field{,64}_batch to
aggregate the data by going through all the items of each cpu sequentially.
Then snmp_seq_show and netstat_seq_show are split into 2 parts to avoid build
warning "the frame size" larger than 1024 on s390.

Yeah about that, did you test it with stack overflow detection?
These arrays can be quite large.

One more below..

Do you think it is acceptable if the stack usage is a little larger than 1024?
e.g. 1120
I can't find any other way to reduce the stack usage except use "static" before
unsigned long buff[TCP_MIB_MAX]

PS. sizeof buff is about TCP_MIB_MAX(116)*8=928
B.R.

That's pretty much the question. Linux has the option on some archs to
run with 4Kb (4KSTACKS option), so this function alone would be using
25% of it in this last case. While on x86_64, it uses 16Kb (6538b8ea886e
("x86_64: expand kernel stack to 16K")).

Adding static to it is not an option as it actually makes the variable
shared amongst the CPUs (and then you have concurrency issues), plus the
fact that it's always allocated, even while not in use.

Others here certainly know better than me if it's okay to make such
usage of the stach.

What about this patch instead?
It is a trade-off. I split the aggregation process into 2 parts, it will
increase the cache miss a little bit, but it can reduce the stack usage.
After this, stack usage is 672bytes
objdump -d vmlinux | ./scripts/checkstack.pl ppc64 | grep seq_show
0xc07f7cc0 netstat_seq_show_tcpext.isra.3 [vmlinux]:672

diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index c6ee8a2..cc41590 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -486,22 +486,37 @@ static const struct file_operations snmp_seq_fops = {
   */
  static int netstat_seq_show_tcpext(struct seq_file *seq, void *v)
  {
-   int i;
-   unsigned long buff[LINUX_MIB_MAX];
+   int i, c;
+   unsigned long buff[LINUX_MIB_MAX/2 + 1];
 struct net *net = seq->private;

-   memset(buff, 0, sizeof(unsigned long) * LINUX_MIB_MAX);
+   memset(buff, 0, sizeof(unsigned long) * (LINUX_MIB_MAX/2 + 1));

 seq_puts(seq, "TcpExt:");
 for (i = 0; snmp4_net_list[i].name; i++)
 seq_printf(seq, " %s", snmp4_net_list[i].name);

 seq_puts(seq, "\nTcpExt:");
-   snmp_get_cpu_field_batch(buff, snmp4_net_list,
-net->mib.net_statistics);
-   for (i = 0; snmp4_net_list[i].name; i++)
+   for_each_possible_cpu(c) {
+   for (i = 0; i < LINUX_MIB_MAX/2; i++)
+   buff[i] += snmp_get_cpu_field(
+ net->mib.net_statistics,
+   c, snmp4_net_list[i].entry);
+   }
+   for (i = 0; i < LINUX_MIB_MAX/2; i++)
 seq_printf(seq, " %lu", buff[i]);

+   memset(buff, 0, sizeof(unsigned long) * (LINUX_MIB_MAX/2 + 1));
+   for_each_possible_cpu(c) {
+   for (i = LINUX_MIB_MAX/2; snmp4_net_list[i].name; i++)
+   buff[i - LINUX_MIB_MAX/2] += snmp_get_cpu_field(
+   net->mib.net_statistics,
+   c,
+   snmp4_net_list[i].entry);
+   }
+for (i = LINUX_MIB_MAX/2; snmp4_net_list[i].name; i++)
+seq_printf(seq, " %lu", buff[i - LINUX_MIB_MAX/2]);
+
 return 0;
  }

Yep, it halves the stack usage, but it doesn't look good heh

But well, you may try to post the patchset (with or without this last
change, you pick) officially and see how it goes. As you're posting as
RFC, it's not being evaluated as seriously.

Thanks for the suggestion, I will remove it in future patch version


FWIW, I tested your patches, using your test and /proc/net/snmp file on
a x86_64 box, Intel(R) Xeon(R) CPU E5-2643 v3.

Before the patches:

  Performance counter stats for './test /proc/net/snmp':

  5.225  cache-misses
 12.708.673.785  L1-dcache-loads
  1.288.450.174  L1-dcache-load-misses #   10,14% of all L1-dcache 
hits
  1.271.857.028  LLC-loads
  4.122  LLC-load-misses   #0,00% of all LL-cache 
hits

9,174936524 seconds time elapsed

After:

  Performance counter stats for './test /proc/net/snmp':

  2.865  cache-misses
 30.203.883.807  L1-dcache-loads
  1.215.774.643  L1-dcache-load-misses #4,03% of all L1-dcache 
hits
  1.181.662.831  LLC-loads
  2.685  LLC-load-misses   #0,00%

Re: [PATCH v3 net-next 1/2] net: skbuff: Remove errornous length validation in skb_vlan_pop()

2016-09-21 Thread David Miller

From: Shmulik Ladkani 
Date: Tue, 20 Sep 2016 12:48:36 +0300

> In 93515d53b1
>   "net: move vlan pop/push functions into common code"
> skb_vlan_pop was moved from its private location in openvswitch to
> skbuff common code.
> 
> In case skb has non hw-accel vlan tag, the original 'pop_vlan()' assured
> that skb->len is sufficient (if skb->len < VLAN_ETH_HLEN then pop was
> considered a no-op).
> 
> This validation was moved as is into the new common 'skb_vlan_pop'.
> 
> Alas, in its original location (openvswitch), there was a guarantee that
> 'data' points to the mac_header, therefore the 'skb->len < VLAN_ETH_HLEN'
> condition made sense.
> However there's no such guarantee in the generic 'skb_vlan_pop'.
> 
> For short packets received in rx path going through 'skb_vlan_pop',
> this causes 'skb_vlan_pop' to fail pop-ing a valid vlan hdr (in the non
> hw-accel case) or to fail moving next tag into hw-accel tag.
> 
> Remove the 'skb->len < VLAN_ETH_HLEN' condition entirely:
> It is superfluous since inner '__skb_vlan_pop' already verifies there
> are VLAN_ETH_HLEN writable bytes at the mac_header.
> 
> Note this presents a slight change to skb_vlan_pop() users:
> In case total length is smaller than VLAN_ETH_HLEN, skb_vlan_pop() now
> returns an error, as opposed to previous "no-op" behavior.
> Existing callers (e.g. tc act vlan, ovs) usually drop the packet if
> 'skb_vlan_pop' fails.
> 
> Fixes: 93515d53b1 ("net: move vlan pop/push functions into common code")
> Signed-off-by: Shmulik Ladkani 

Applied.

Re: [PATCH v3 net-next 2/2] net: skbuff: Coding: Use eth_type_vlan() instead of open coding it

2016-09-21 Thread David Miller

From: Shmulik Ladkani 
Date: Tue, 20 Sep 2016 12:48:37 +0300

> Fix 'skb_vlan_pop' to use eth_type_vlan instead of directly comparing
> skb->protocol to ETH_P_8021Q or ETH_P_8021AD.
> 
> Signed-off-by: Shmulik Ladkani 

Applied.

Re: [PATCH v2 0/2] act_vlan: Introduce TCA_VLAN_ACT_MODIFY vlan action

2016-09-21 Thread David Miller

From: Shmulik Ladkani 
Date: Mon, 19 Sep 2016 19:11:08 +0300

> TCA_VLAN_ACT_MODIFY allows one to change an existing tag.
> 
> It accepts same attributes as TCA_VLAN_ACT_PUSH (protocol, id,
> priority).
> If packet is vlan tagged, then the tag gets overwritten according to
> user specified attributes.
> 
> For example, this allows user to replace a tag's vid while preserving
> its priority bits (as opposed to "action vlan pop pipe action vlan push").

Series applied.

Re: [PATCH] net: explicitly whitelist sysctls for unpriv namespaces

2016-09-21 Thread David Miller

From: Jann Horn 
Date: Sun, 18 Sep 2016 22:58:20 +0200

> There were two net sysctls that could be written from unprivileged net
> namespaces, but weren't actually namespaced.
> 
> To fix the existing issues and prevent stuff this from happening again in
> the future, explicitly whitelist permitted sysctls.
> 
> Note: The current whitelist is "allow everything that was previously
> accessible and that doesn't obviously modify global state".
> 
> On my system, this patch just removes the write permissions for
> ipv4/netfilter/ip_conntrack_max, which would have been usable for a local
> DoS. With a different config, the ipv4/vs/debug_level sysctl would also be
> affected.
> 
> Maximum impact of this seems to be local DoS, and it's a fairly large
> commit, so I'm sending this publicly directly.
> 
> An alternative (and much smaller) fix would be to just change the
> permissions of the two files in question to be 0444 in non-privileged
> namespaces, but I believe that this solution is slightly less error-prone.
> If you think I should switch to the simple fix, let me know.
> 
> Signed-off-by: Jann Horn 

And actually this patch dosn't apply cleanly to net-next, please
respin.

Thanks.

Re: [PATCH] net: explicitly whitelist sysctls for unpriv namespaces

2016-09-21 Thread David Miller

From: Jann Horn 
Date: Sun, 18 Sep 2016 22:58:20 +0200

> There were two net sysctls that could be written from unprivileged net
> namespaces, but weren't actually namespaced.
> 
> To fix the existing issues and prevent stuff this from happening again in
> the future, explicitly whitelist permitted sysctls.
> 
> Note: The current whitelist is "allow everything that was previously
> accessible and that doesn't obviously modify global state".
> 
> On my system, this patch just removes the write permissions for
> ipv4/netfilter/ip_conntrack_max, which would have been usable for a local
> DoS. With a different config, the ipv4/vs/debug_level sysctl would also be
> affected.
> 
> Maximum impact of this seems to be local DoS, and it's a fairly large
> commit, so I'm sending this publicly directly.
> 
> An alternative (and much smaller) fix would be to just change the
> permissions of the two files in question to be 0444 in non-privileged
> namespaces, but I believe that this solution is slightly less error-prone.
> If you think I should switch to the simple fix, let me know.
> 
> Signed-off-by: Jann Horn 

I think this is fine for net-next and will apply it there.

But for 'net' and 'stable', please also submit the simpler fix.

Thanks.

Re: [PATCH net-next 9/9] rxrpc: Reduce the number of ACK-Requests sent

2016-09-21 Thread kbuild test robot

Hi David,

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/David-Howells/rxrpc-Preparation-for-slow-start-algorithm/20160922-085242
config: arm-omap2plus_defconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=arm 

All errors (new ones prefixed by >>):

>> ERROR: "__aeabi_ldivmod" [net/rxrpc/af-rxrpc.ko] undefined!
   ERROR: "__aeabi_uldivmod" [net/rxrpc/af-rxrpc.ko] undefined!

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: [PATCH nf v3] netfilter: seqadj: Fix the wrong ack adjust for the RST packet without ack

2016-09-21 Thread Eric Dumazet

On Thu, 2016-09-22 at 10:22 +0800, f...@ikuai8.com wrote:
> From: Gao Feng 
> 
> It is valid that the TCP RST packet which does not set ack flag, and bytes
> of ack number are zero. But current seqadj codes would adjust the "0" ack
> to invalid ack number. Actually seqadj need to check the ack flag before
> adjust it for these RST packets.
> 
> The following is my test case
> 
> client is 10.26.98.245, and add one iptable rule:
> iptables  -I INPUT -p tcp --sport 12345 -m connbytes --connbytes 2:
> --connbytes-dir reply --connbytes-mode packets -j REJECT --reject-with
> tcp-reset
> This iptables rule could generate on TCP RST without ack flag.
> 
> server:10.172.135.55
> Enable the synproxy with seqadjust by the following iptables rules
> iptables -t raw -A PREROUTING -i eth0 -p tcp -d 10.172.135.55 --dport 12345
> -m tcp --syn -j CT --notrack
> 
> iptables -A INPUT -i eth0 -p tcp -d 10.172.135.55 --dport 12345 -m conntrack
> --ctstate INVALID,UNTRACKED -j SYNPROXY --sack-perm --timestamp --wscale 7
> --mss 1460
> iptables -A OUTPUT -o eth0 -p tcp -s 10.172.135.55 --sport 12345 -m conntrack
> --ctstate INVALID,UNTRACKED -m tcp --tcp-flags SYN,RST,ACK SYN,ACK -j ACCEPT
> 
> The following is my test result.
> 
> 1. packet trace on client
> root@routers:/tmp# tcpdump -i eth0 tcp port 12345 -n
> tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
> listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
> IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [S], seq 3695959829,
> win 29200, options [mss 1460,sackOK,TS val 452367884 ecr 0,nop,wscale 7],
> length 0
> IP 10.172.135.55.12345 > 10.26.98.245.45154: Flags [S.], seq 546723266,
> ack 3695959830, win 0, options [mss 1460,sackOK,TS val 15643479 ecr 452367884,
> nop,wscale 7], length 0
> IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [.], ack 1, win 229,
> options [nop,nop,TS val 452367885 ecr 15643479], length 0
> IP 10.172.135.55.12345 > 10.26.98.245.45154: Flags [.], ack 1, win 226,
> options [nop,nop,TS val 15643479 ecr 452367885], length 0
> IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [R], seq 3695959830,
> win 0, length 0
> 
> 2. seqadj log on server
> [62873.867319] Adjusting sequence number from 602341895->546723267,
> ack from 3695959830->3695959830
> [62873.867644] Adjusting sequence number from 602341895->546723267,
> ack from 3695959830->3695959830
> [62873.869040] Adjusting sequence number from 3695959830->3695959830,
> ack from 0->55618628
> 
> To summarize, it is clear that the seqadj codes adjust the 0 ack when receive
> one TCP RST packet without ack.
> 
> Signed-off-by: Gao Feng 
> ---
>  v3: Add the reproduce steps and packet trace
>  v2: Regenerate because the first patch is removed
>  v1: Initial patch
> 
>  net/netfilter/nf_conntrack_seqadj.c | 34 +++---
>  1 file changed, 19 insertions(+), 15 deletions(-)
> 
> diff --git a/net/netfilter/nf_conntrack_seqadj.c 
> b/net/netfilter/nf_conntrack_seqadj.c
> index dff0f0c..3bd9c7e 100644
> --- a/net/netfilter/nf_conntrack_seqadj.c
> +++ b/net/netfilter/nf_conntrack_seqadj.c
> @@ -179,30 +179,34 @@ int nf_ct_seq_adjust(struct sk_buff *skb,
>  
>   tcph = (void *)skb->data + protoff;
>   spin_lock_bh(>lock);
> +

Please do not add style change during a bug fix.

>   if (after(ntohl(tcph->seq), this_way->correction_pos))
>   seqoff = this_way->offset_after;
>   else
>   seqoff = this_way->offset_before;
>  
> - if (after(ntohl(tcph->ack_seq) - other_way->offset_before,
> -   other_way->correction_pos))
> - ackoff = other_way->offset_after;
> - else
> - ackoff = other_way->offset_before;
> -
>   newseq = htonl(ntohl(tcph->seq) + seqoff);
> - newack = htonl(ntohl(tcph->ack_seq) - ackoff);
> -
>   inet_proto_csum_replace4(>check, skb, tcph->seq, newseq, false);
> - inet_proto_csum_replace4(>check, skb, tcph->ack_seq, newack,
> -  false);
> -
> - pr_debug("Adjusting sequence number from %u->%u, ack from %u->%u\n",
> -  ntohl(tcph->seq), ntohl(newseq), ntohl(tcph->ack_seq),
> -  ntohl(newack));
>  
> + pr_debug("Adjusting sequence number from %u->%u\n",
> +  ntohl(tcph->seq), ntohl(newseq));
>   tcph->seq = newseq;
> - tcph->ack_seq = newack;
> +
> + if (likely(tcph->ack)) {
> + if (after(ntohl(tcph->ack_seq) - other_way->offset_before,
> +   other_way->correction_pos))
> + ackoff = other_way->offset_after;
> + else
> + ackoff = other_way->offset_before;
> +
> + newack = htonl(ntohl(tcph->ack_seq) - ackoff);
> + inet_proto_csum_replace4(>check, skb, tcph->ack_seq,
> +  newack, false);
> +
> + pr_debug("Adjusting ack number from %u->%u\n",
> +

Re: [patch net-next 2/6] fib: introduce FIB info offload flag helpers

2016-09-21 Thread Ido Schimmel

On Wed, Sep 21, 2016 at 01:53:10PM +0200, Jiri Pirko wrote:
> From: Jiri Pirko 
> 
> These helpers are to be used in case someone offloads the FIB entry. The
> result is that if the entry is offloaded to at least one device, the
> offload flag is set.
> 
> Signed-off-by: Jiri Pirko 

Reviewed-by: Ido Schimmel 

Thanks

Re: [patch net-next 1/6] fib: introduce FIB notification infrastructure

2016-09-21 Thread Ido Schimmel

On Wed, Sep 21, 2016 at 01:53:09PM +0200, Jiri Pirko wrote:
> From: Jiri Pirko 
> 
> This allows to pass information about added/deleted FIB entries/rules to
> whoever is interested. This is done in a very similar way as devinet
> notifies address additions/removals.
> 
> Signed-off-by: Jiri Pirko 

[...]

>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -84,6 +85,44 @@
>  #include 
>  #include "fib_lookup.h"
>  
> +static BLOCKING_NOTIFIER_HEAD(fib_chain);
> +
> +int register_fib_notifier(struct notifier_block *nb)
> +{
> + return blocking_notifier_chain_register(_chain, nb);
> +}
> +EXPORT_SYMBOL(register_fib_notifier);

If we remove and insert the switch driver, then the existing FIB entries
should be replayed when we register our notification block. Otherwise,
all of these entries will be missing from the switch's tables. I believe
it should be handled like register_netdevice_notifier(), where
"registration and up events are replayed".

Re: [PATCH net-next 3/9] rxrpc: Add per-peer RTT tracker

2016-09-21 Thread kbuild test robot

Hi David,

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/David-Howells/rxrpc-Preparation-for-slow-start-algorithm/20160922-085242
config: i386-randconfig-h0-09220655 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   net/built-in.o: In function `rxrpc_peer_add_rtt':
>> (.text+0x239e99): undefined reference to `__udivdi3'

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: [PATCH net-next 3/9] rxrpc: Add per-peer RTT tracker

2016-09-21 Thread kbuild test robot

Hi David,

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/David-Howells/rxrpc-Preparation-for-slow-start-algorithm/20160922-085242
config: arm-omap2plus_defconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=arm 

All errors (new ones prefixed by >>):

>> ERROR: "__aeabi_uldivmod" [net/rxrpc/af-rxrpc.ko] undefined!

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

[PATCH] net: VRF: Fix receiving multicast traffic

2016-09-21 Thread Mark Tomlinson

The previous patch to ensure that the original iif was used when
checking for forwarding also meant that this same interface was used to
determine whether multicast packets should be received or not. This was
incorrect, and would cause multicast packets to be dropped.

The fix here is to use skb->dev when checking multicast addresses.
skb->dev has been set to the l3mdev by this point, so the check will be
against that, rather than the ingress interface.

Fixes: "net:VRF: Pass original iif to ip_route_input()"
Signed-off-by: Mark Tomlinson 
---
 net/ipv4/route.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index a1f2830..75e1de6 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1971,7 +1971,7 @@ int ip_route_input_noref(struct sk_buff *skb, __be32 
daddr, __be32 saddr,
   route cache entry is created eventually.
 */
if (ipv4_is_multicast(daddr)) {
-   struct in_device *in_dev = __in_dev_get_rcu(dev);
+   struct in_device *in_dev = __in_dev_get_rcu(skb->dev);
 
if (in_dev) {
int our = ip_check_mc_rcu(in_dev, daddr, saddr,
-- 
2.9.3

Re: [PATCH net-next v2 0/3] add support for RGMII on GMAC0 through TRGMII hardware module

2016-09-21 Thread Florian Fainelli

Le 21/09/2016 à 19:33, sean.w...@mediatek.com a écrit :
> From: Sean Wang 
> 
> By default, GMAC0 is connected to built-in switch called
> MT7530 through the proprietary interface called Turbo RGMII
> (TRGMII). TRGMII also supports well for RGMII as generic external
> PHY uses but requires some slight changes to the setup of TRGMII 
> and doesn't have well support on current driver.
> 
> So this patchset
> 1) provides the slight changes of the setup for RGMII can work
>through TRGMII
> 2) adds additional setting "trgmii" as PHY_INTERFACE_MODE_TRGMII 
>about phy-mode on device tree to make GMAC0 distinguish which
>mode it runs
> 3) changes dynamically source clock, TX/RX delay and interface
>mode on TRGMII for adapting various link
> 
> Changes since v1:
> - fixed the style of comment which doesn't have a space at 
>the beginning and end of comment lines
> - add support for phy-mode "trgmii" as PHY_INTERFACE_MODE_TRGMII 
>into linux/phy.h
> - enhance the Documentation about device tree binding for trgmii
>   which is applicable only for GMAC0 which uses fixed-link

Looks good to me:

Reviewed-by: Florian Fainelli 

Thanks Sean!
-- 
Florian

[PATCH net-next v2 3/3] net: ethernet: mediatek: add the dts property to set if TRGMII supported on GMAC0

2016-09-21 Thread sean.wang

From: Sean Wang 

Add the dts property for the capability if TRGMII supported on GAMC0

Signed-off-by: Sean Wang 
---
 Documentation/devicetree/bindings/net/mediatek-net.txt | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/Documentation/devicetree/bindings/net/mediatek-net.txt 
b/Documentation/devicetree/bindings/net/mediatek-net.txt
index 6103e55..7111278 100644
--- a/Documentation/devicetree/bindings/net/mediatek-net.txt
+++ b/Documentation/devicetree/bindings/net/mediatek-net.txt
@@ -31,7 +31,10 @@ Optional properties:
 Required properties:
 - compatible: Should be "mediatek,eth-mac"
 - reg: The number of the MAC
-- phy-handle: see ethernet.txt file in the same directory.
+- phy-handle: see ethernet.txt file in the same directory and
+   the phy-mode "trgmii" required being provided when reg
+   is equal to 0 and the MAC uses fixed-link to connect
+   with inernal switch such as MT7530.
 
 Example:
 
-- 
1.9.1

[PATCH net-next v2 1/3] net: ethernet: mediatek: add extension of phy-mode for TRGMII

2016-09-21 Thread sean.wang

From: Sean Wang 

adds PHY-mode "trgmii" as an extension for the operation
mode of the PHY interface for PHY_INTERFACE_MODE_TRGMII.
and adds a variable trgmii inside mtk_mac as the indication
to make the difference between the MAC connected to internal
switch or connected to external PHY by the given configuration
on the board and then to perform the corresponding setup on
TRGMII hardware module.

Signed-off-by: Sean Wang 
Cc: Florian Fainelli 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 2 ++
 drivers/net/ethernet/mediatek/mtk_eth_soc.h | 3 +++
 include/linux/phy.h | 3 +++
 3 files changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index ca6b501..827f4bd 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -244,6 +244,8 @@ static int mtk_phy_connect(struct mtk_mac *mac)
return -ENODEV;
 
switch (of_get_phy_mode(np)) {
+   case PHY_INTERFACE_MODE_TRGMII:
+   mac->trgmii = true;
case PHY_INTERFACE_MODE_RGMII_TXID:
case PHY_INTERFACE_MODE_RGMII_RXID:
case PHY_INTERFACE_MODE_RGMII_ID:
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.h 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
index 7c5e534..e3b9525 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.h
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
@@ -529,6 +529,8 @@ struct mtk_eth {
  * @hw:Backpointer to our main datastruture
  * @hw_stats:  Packet statistics counter
  * @phy_dev:   The attached PHY if available
+ * @trgmii Indicate if the MAC uses TRGMII connected to internal
+   switch
  */
 struct mtk_mac {
int id;
@@ -539,6 +541,7 @@ struct mtk_mac {
struct phy_device   *phy_dev;
__be32  hwlro_ip[MTK_MAX_LRO_IP_CNT];
int hwlro_ip_cnt;
+   booltrgmii;
 };
 
 /* the struct describing the SoC. these are declared in the soc_xyz.c files */
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 2d24b28..e25f183 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -80,6 +80,7 @@ typedef enum {
PHY_INTERFACE_MODE_XGMII,
PHY_INTERFACE_MODE_MOCA,
PHY_INTERFACE_MODE_QSGMII,
+   PHY_INTERFACE_MODE_TRGMII,
PHY_INTERFACE_MODE_MAX,
 } phy_interface_t;
 
@@ -123,6 +124,8 @@ static inline const char *phy_modes(phy_interface_t 
interface)
return "moca";
case PHY_INTERFACE_MODE_QSGMII:
return "qsgmii";
+   case PHY_INTERFACE_MODE_TRGMII:
+   return "trgmii";
default:
return "unknown";
}
-- 
1.9.1

[PATCH net-next v2 2/3] net: ethernet: mediatek: add support for GMAC0 connecting with external PHY through TRGMII

2016-09-21 Thread sean.wang

From: Sean Wang 

Changing dynamically source clock, TX/RX delay and interface mode
used by TRGMII hardware module inside PHY capability polling routine
for adapting to the various speed of RGMII used by external PHY for
GMAC0.

Signed-off-by: Sean Wang 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 32 -
 drivers/net/ethernet/mediatek/mtk_eth_soc.h | 31 +++-
 2 files changed, 61 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 827f4bd..73c7904 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -52,7 +52,7 @@ static const struct mtk_ethtool_stats {
 };
 
 static const char * const mtk_clks_source_name[] = {
-   "ethif", "esw", "gp1", "gp2"
+   "ethif", "esw", "gp1", "gp2", "trgpll"
 };
 
 void mtk_w32(struct mtk_eth *eth, u32 val, unsigned reg)
@@ -135,6 +135,33 @@ static int mtk_mdio_read(struct mii_bus *bus, int 
phy_addr, int phy_reg)
return _mtk_mdio_read(eth, phy_addr, phy_reg);
 }
 
+static void mtk_gmac0_rgmii_adjust(struct mtk_eth *eth, int speed)
+{
+   u32 val;
+   int ret;
+
+   val = (speed == SPEED_1000) ?
+   INTF_MODE_RGMII_1000 : INTF_MODE_RGMII_10_100;
+   mtk_w32(eth, val, INTF_MODE);
+
+   regmap_update_bits(eth->ethsys, ETHSYS_CLKCFG0,
+  ETHSYS_TRGMII_CLK_SEL362_5,
+  ETHSYS_TRGMII_CLK_SEL362_5);
+
+   val = (speed == SPEED_1000) ? 25000 : 5;
+   ret = clk_set_rate(eth->clks[MTK_CLK_TRGPLL], val);
+   if (ret)
+   dev_err(eth->dev, "Failed to set trgmii pll: %d\n", ret);
+
+   val = (speed == SPEED_1000) ?
+   RCK_CTRL_RGMII_1000 : RCK_CTRL_RGMII_10_100;
+   mtk_w32(eth, val, TRGMII_RCK_CTRL);
+
+   val = (speed == SPEED_1000) ?
+   TCK_CTRL_RGMII_1000 : TCK_CTRL_RGMII_10_100;
+   mtk_w32(eth, val, TRGMII_TCK_CTRL);
+}
+
 static void mtk_phy_link_adjust(struct net_device *dev)
 {
struct mtk_mac *mac = netdev_priv(dev);
@@ -157,6 +184,9 @@ static void mtk_phy_link_adjust(struct net_device *dev)
break;
};
 
+   if (mac->id == 0 && !mac->trgmii)
+   mtk_gmac0_rgmii_adjust(mac->hw, mac->phy_dev->speed);
+
if (mac->phy_dev->link)
mcr |= MAC_MCR_FORCE_LINK;
 
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.h 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
index e3b9525..e521156 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.h
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.h
@@ -313,6 +313,30 @@
 MAC_MCR_FORCE_TX_FC | MAC_MCR_SPEED_1000 | \
 MAC_MCR_FORCE_DPX | MAC_MCR_FORCE_LINK)
 
+/* TRGMII RXC control register */
+#define TRGMII_RCK_CTRL0x10300
+#define DQSI0(x)   ((x << 0) & GENMASK(6, 0))
+#define DQSI1(x)   ((x << 8) & GENMASK(14, 8))
+#define RXCTL_DMWTLAT(x)   ((x << 16) & GENMASK(18, 16))
+#define RXC_DQSISELBIT(30)
+#define RCK_CTRL_RGMII_1000(RXC_DQSISEL | RXCTL_DMWTLAT(2) | DQSI1(16))
+#define RCK_CTRL_RGMII_10_100  RXCTL_DMWTLAT(2)
+
+/* TRGMII RXC control register */
+#define TRGMII_TCK_CTRL0x10340
+#define TXCTL_DMWTLAT(x)   ((x << 16) & GENMASK(18, 16))
+#define TXC_INVBIT(30)
+#define TCK_CTRL_RGMII_1000TXCTL_DMWTLAT(2)
+#define TCK_CTRL_RGMII_10_100  (TXC_INV | TXCTL_DMWTLAT(2))
+
+/* TRGMII Interface mode register */
+#define INTF_MODE  0x10390
+#define TRGMII_INTF_DISBIT(0)
+#define TRGMII_MODEBIT(1)
+#define TRGMII_CENTRAL_ALIGNED BIT(2)
+#define INTF_MODE_RGMII_1000(TRGMII_MODE | TRGMII_CENTRAL_ALIGNED)
+#define INTF_MODE_RGMII_10_100  0
+
 /* GPIO port control registers for GMAC 2*/
 #define GPIO_OD33_CTRL80x4c0
 #define GPIO_BIAS_CTRL 0xed0
@@ -323,7 +347,11 @@
 #define SYSCFG0_GE_MASK0x3
 #define SYSCFG0_GE_MODE(x, y)  (x << (12 + (y * 2)))
 
-/*ethernet reset control register*/
+/* ethernet subsystem clock register */
+#define ETHSYS_CLKCFG0 0x2c
+#define ETHSYS_TRGMII_CLK_SEL362_5 BIT(11)
+
+/* ethernet reset control register */
 #define ETHSYS_RSTCTRL 0x34
 #define RSTCTRL_FE BIT(6)
 #define RSTCTRL_PPEBIT(31)
@@ -389,6 +417,7 @@ enum mtk_clks_map {
MTK_CLK_ESW,
MTK_CLK_GP1,
MTK_CLK_GP2,
+   MTK_CLK_TRGPLL,
MTK_CLK_MAX
 };
 
-- 
1.9.1

[PATCH net-next v2 0/3] add support for RGMII on GMAC0 through TRGMII hardware module

2016-09-21 Thread sean.wang

From: Sean Wang 

By default, GMAC0 is connected to built-in switch called
MT7530 through the proprietary interface called Turbo RGMII
(TRGMII). TRGMII also supports well for RGMII as generic external
PHY uses but requires some slight changes to the setup of TRGMII 
and doesn't have well support on current driver.

So this patchset
1) provides the slight changes of the setup for RGMII can work
   through TRGMII
2) adds additional setting "trgmii" as PHY_INTERFACE_MODE_TRGMII 
   about phy-mode on device tree to make GMAC0 distinguish which
   mode it runs
3) changes dynamically source clock, TX/RX delay and interface
   mode on TRGMII for adapting various link

Changes since v1:
- fixed the style of comment which doesn't have a space at 
   the beginning and end of comment lines
- add support for phy-mode "trgmii" as PHY_INTERFACE_MODE_TRGMII 
   into linux/phy.h
- enhance the Documentation about device tree binding for trgmii
  which is applicable only for GMAC0 which uses fixed-link

Sean Wang (3):
  net: ethernet: mediatek: add extension of phy-mode for TRGMII
  net: ethernet: mediatek: add support for GMAC0 connecting with
external PHY through TRGMII
  net: ethernet: mediatek: add the dts property to set if TRGMII
supported on GMAC0

 .../devicetree/bindings/net/mediatek-net.txt   |  5 +++-
 drivers/net/ethernet/mediatek/mtk_eth_soc.c| 34 +-
 drivers/net/ethernet/mediatek/mtk_eth_soc.h| 34 +-
 include/linux/phy.h|  3 ++
 4 files changed, 73 insertions(+), 3 deletions(-)

-- 
1.9.1

[PATCH nf v3] netfilter: seqadj: Fix the wrong ack adjust for the RST packet without ack

2016-09-21 Thread fgao

From: Gao Feng 

It is valid that the TCP RST packet which does not set ack flag, and bytes
of ack number are zero. But current seqadj codes would adjust the "0" ack
to invalid ack number. Actually seqadj need to check the ack flag before
adjust it for these RST packets.

The following is my test case

client is 10.26.98.245, and add one iptable rule:
iptables  -I INPUT -p tcp --sport 12345 -m connbytes --connbytes 2:
--connbytes-dir reply --connbytes-mode packets -j REJECT --reject-with
tcp-reset
This iptables rule could generate on TCP RST without ack flag.

server:10.172.135.55
Enable the synproxy with seqadjust by the following iptables rules
iptables -t raw -A PREROUTING -i eth0 -p tcp -d 10.172.135.55 --dport 12345
-m tcp --syn -j CT --notrack

iptables -A INPUT -i eth0 -p tcp -d 10.172.135.55 --dport 12345 -m conntrack
--ctstate INVALID,UNTRACKED -j SYNPROXY --sack-perm --timestamp --wscale 7
--mss 1460
iptables -A OUTPUT -o eth0 -p tcp -s 10.172.135.55 --sport 12345 -m conntrack
--ctstate INVALID,UNTRACKED -m tcp --tcp-flags SYN,RST,ACK SYN,ACK -j ACCEPT

The following is my test result.

1. packet trace on client
root@routers:/tmp# tcpdump -i eth0 tcp port 12345 -n
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [S], seq 3695959829,
win 29200, options [mss 1460,sackOK,TS val 452367884 ecr 0,nop,wscale 7],
length 0
IP 10.172.135.55.12345 > 10.26.98.245.45154: Flags [S.], seq 546723266,
ack 3695959830, win 0, options [mss 1460,sackOK,TS val 15643479 ecr 452367884,
nop,wscale 7], length 0
IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [.], ack 1, win 229,
options [nop,nop,TS val 452367885 ecr 15643479], length 0
IP 10.172.135.55.12345 > 10.26.98.245.45154: Flags [.], ack 1, win 226,
options [nop,nop,TS val 15643479 ecr 452367885], length 0
IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [R], seq 3695959830,
win 0, length 0

2. seqadj log on server
[62873.867319] Adjusting sequence number from 602341895->546723267,
ack from 3695959830->3695959830
[62873.867644] Adjusting sequence number from 602341895->546723267,
ack from 3695959830->3695959830
[62873.869040] Adjusting sequence number from 3695959830->3695959830,
ack from 0->55618628

To summarize, it is clear that the seqadj codes adjust the 0 ack when receive
one TCP RST packet without ack.

Signed-off-by: Gao Feng 
---
 v3: Add the reproduce steps and packet trace
 v2: Regenerate because the first patch is removed
 v1: Initial patch

 net/netfilter/nf_conntrack_seqadj.c | 34 +++---
 1 file changed, 19 insertions(+), 15 deletions(-)

diff --git a/net/netfilter/nf_conntrack_seqadj.c 
b/net/netfilter/nf_conntrack_seqadj.c
index dff0f0c..3bd9c7e 100644
--- a/net/netfilter/nf_conntrack_seqadj.c
+++ b/net/netfilter/nf_conntrack_seqadj.c
@@ -179,30 +179,34 @@ int nf_ct_seq_adjust(struct sk_buff *skb,
 
tcph = (void *)skb->data + protoff;
spin_lock_bh(>lock);
+
if (after(ntohl(tcph->seq), this_way->correction_pos))
seqoff = this_way->offset_after;
else
seqoff = this_way->offset_before;
 
-   if (after(ntohl(tcph->ack_seq) - other_way->offset_before,
- other_way->correction_pos))
-   ackoff = other_way->offset_after;
-   else
-   ackoff = other_way->offset_before;
-
newseq = htonl(ntohl(tcph->seq) + seqoff);
-   newack = htonl(ntohl(tcph->ack_seq) - ackoff);
-
inet_proto_csum_replace4(>check, skb, tcph->seq, newseq, false);
-   inet_proto_csum_replace4(>check, skb, tcph->ack_seq, newack,
-false);
-
-   pr_debug("Adjusting sequence number from %u->%u, ack from %u->%u\n",
-ntohl(tcph->seq), ntohl(newseq), ntohl(tcph->ack_seq),
-ntohl(newack));
 
+   pr_debug("Adjusting sequence number from %u->%u\n",
+ntohl(tcph->seq), ntohl(newseq));
tcph->seq = newseq;
-   tcph->ack_seq = newack;
+
+   if (likely(tcph->ack)) {
+   if (after(ntohl(tcph->ack_seq) - other_way->offset_before,
+ other_way->correction_pos))
+   ackoff = other_way->offset_after;
+   else
+   ackoff = other_way->offset_before;
+
+   newack = htonl(ntohl(tcph->ack_seq) - ackoff);
+   inet_proto_csum_replace4(>check, skb, tcph->ack_seq,
+newack, false);
+
+   pr_debug("Adjusting ack number from %u->%u\n",
+ntohl(tcph->ack_seq), ntohl(newack));
+   tcph->ack_seq = newack;
+   }
 
res = nf_ct_sack_adjust(skb, protoff, tcph, ct, ctinfo);
spin_unlock_bh(>lock);
-- 
1.9.1

Re: [PATCH net] tcp: fix under-accounting retransmit SNMP counters

2016-09-21 Thread Eric Dumazet

On Wed, 2016-09-21 at 16:16 -0700, Yuchung Cheng wrote:
> This patch fixes these under-accounting SNMP rtx stats
> LINUX_MIB_TCPFORWARDRETRANS
> LINUX_MIB_TCPFASTRETRANS
> LINUX_MIB_TCPSLOWSTARTRETRANS
> when retransmitting TSO packets
> 
> Fixes: 10d3be569243 ("tcp-tso: do not split TSO packets at retransmit time")
> Signed-off-by: Yuchung Cheng 
> ---

Good catch, thanks Yuchung !

Acked-by: Eric Dumazet

Re: [PATCH net-next V2 0/4] mlx4 misc cleanups and improvements

2016-09-21 Thread David Miller

From: Tariq Toukan 
Date: Tue, 20 Sep 2016 14:39:38 +0300

> This patchset contains some cleanups and improvements from the team
> to the mlx4 Eth and core drivers.
> 
> Series generated against net-next commit:
> 5a7aa362 'net sched: stylistic cleanups'

Series applied, thanks.

Re: pull-request: wireless-drivers 2016-09-20

2016-09-21 Thread David Miller

From: Kalle Valo 
Date: Tue, 20 Sep 2016 13:20:46 +0300

> last pull request for 4.8, unless something really drastic comes up. And
> a small one even, just a small fix to iwlwifi to avoid a firmware crash.
> 
> Please let me know if there are any problems.

Pulled, thanks.

Re: [PATCH net-next 8/9] rxrpc: Reduce the number of PING ACKs sent

2016-09-21 Thread kbuild test robot

Hi David,

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/David-Howells/rxrpc-Preparation-for-slow-start-algorithm/20160922-085242
config: i386-randconfig-x009-201638 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

Note: the 
linux-review/David-Howells/rxrpc-Preparation-for-slow-start-algorithm/20160922-085242
 HEAD f739ad653a9471b67eb8fc01185c01c2ca1dcb4b builds fine.
  It only hurts bisectibility.

All errors (new ones prefixed by >>):

   net/rxrpc/input.c: In function 'rxrpc_send_ping':
>> net/rxrpc/input.c:50:42: error: 'struct rxrpc_peer' has no member named 
>> 'rtt_last_req'; did you mean 'rtt_cache'?
 ktime_before(ktime_add_ms(call->peer->rtt_last_req, 1000), now))
 ^~

vim +50 net/rxrpc/input.c

44  int skew)
45  {
46  struct rxrpc_skb_priv *sp = rxrpc_skb(skb);
47  ktime_t now = skb->tstamp;
48  
49  if (call->peer->rtt_usage < 3 ||
  > 50  ktime_before(ktime_add_ms(call->peer->rtt_last_req, 1000), 
now))
51  rxrpc_propose_ACK(call, RXRPC_ACK_PING, skew, 
sp->hdr.serial,
52true, true);
53  }

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

[PATCH net-next 8/9] rxrpc: Reduce the number of PING ACKs sent

2016-09-21 Thread David Howells

We don't want to send a PING ACK for every new incoming call as that just
adds to the network traffic.  Instead, we send a PING ACK to the first
three that we receive and then once per second thereafter.

This could probably be made adjustable in future.

Signed-off-by: David Howells 
---

 net/rxrpc/input.c |7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/rxrpc/input.c b/net/rxrpc/input.c
index c121949de3c8..cbb5d53f09d7 100644
--- a/net/rxrpc/input.c
+++ b/net/rxrpc/input.c
@@ -44,9 +44,12 @@ static void rxrpc_send_ping(struct rxrpc_call *call, struct 
sk_buff *skb,
int skew)
 {
struct rxrpc_skb_priv *sp = rxrpc_skb(skb);
+   ktime_t now = skb->tstamp;
 
-   rxrpc_propose_ACK(call, RXRPC_ACK_PING, skew, sp->hdr.serial,
- true, true);
+   if (call->peer->rtt_usage < 3 ||
+   ktime_before(ktime_add_ms(call->peer->rtt_last_req, 1000), now))
+   rxrpc_propose_ACK(call, RXRPC_ACK_PING, skew, sp->hdr.serial,
+ true, true);
 }
 
 /*

[PATCH net-next 7/9] rxrpc: Obtain RTT data by requesting ACKs on DATA packets

2016-09-21 Thread David Howells

In addition to sending a PING ACK to gain RTT data, we can set the
RXRPC_REQUEST_ACK flag on a DATA packet and get a REQUESTED-ACK ACK.  The
ACK packet contains the serial number of the packet it is in response to,
so we can look through the Tx buffer for a matching DATA packet.

This requires that the data packets be stamped with the time of
transmission as a ktime rather than having the resend_at time in jiffies.

This further requires the resend code to do the resend determination in
ktimes and convert to jiffies to set the timer.

Signed-off-by: David Howells 
---

 net/rxrpc/ar-internal.h |7 +++
 net/rxrpc/call_event.c  |   20 ++--
 net/rxrpc/input.c   |   35 +++
 net/rxrpc/misc.c|6 --
 net/rxrpc/output.c  |7 +--
 net/rxrpc/sendmsg.c |1 -
 net/rxrpc/sysctl.c  |2 +-
 7 files changed, 58 insertions(+), 20 deletions(-)

diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index 8b47f468eb9d..1c4597b2c6cd 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -142,10 +142,7 @@ struct rxrpc_host_header {
  */
 struct rxrpc_skb_priv {
union {
-   unsigned long   resend_at;  /* time in jiffies at which to 
resend */
-   struct {
-   u8  nr_jumbo;   /* Number of jumbo subpackets */
-   };
+   u8  nr_jumbo;   /* Number of jumbo subpackets */
};
union {
unsigned intoffset; /* offset into buffer of next 
read */
@@ -663,6 +660,7 @@ extern const char 
rxrpc_recvmsg_traces[rxrpc_recvmsg__nr_trace][5];
 
 enum rxrpc_rtt_tx_trace {
rxrpc_rtt_tx_ping,
+   rxrpc_rtt_tx_data,
rxrpc_rtt_tx__nr_trace
 };
 
@@ -670,6 +668,7 @@ extern const char 
rxrpc_rtt_tx_traces[rxrpc_rtt_tx__nr_trace][5];
 
 enum rxrpc_rtt_rx_trace {
rxrpc_rtt_rx_ping_response,
+   rxrpc_rtt_rx_requested_ack,
rxrpc_rtt_rx__nr_trace
 };
 
diff --git a/net/rxrpc/call_event.c b/net/rxrpc/call_event.c
index 34ad967f2d81..77802da14456 100644
--- a/net/rxrpc/call_event.c
+++ b/net/rxrpc/call_event.c
@@ -142,12 +142,14 @@ static void rxrpc_resend(struct rxrpc_call *call)
struct rxrpc_skb_priv *sp;
struct sk_buff *skb;
rxrpc_seq_t cursor, seq, top;
-   unsigned long resend_at, now;
+   ktime_t now = ktime_get_real(), max_age, oldest, resend_at;
int ix;
u8 annotation, anno_type;
 
_enter("{%d,%d}", call->tx_hard_ack, call->tx_top);
 
+   max_age = ktime_sub_ms(now, rxrpc_resend_timeout);
+
spin_lock_bh(>lock);
 
cursor = call->tx_hard_ack;
@@ -160,8 +162,7 @@ static void rxrpc_resend(struct rxrpc_call *call)
 * the packets in the Tx buffer we're going to resend and what the new
 * resend timeout will be.
 */
-   now = jiffies;
-   resend_at = now + rxrpc_resend_timeout;
+   oldest = now;
for (seq = cursor + 1; before_eq(seq, top); seq++) {
ix = seq & RXRPC_RXTX_BUFF_MASK;
annotation = call->rxtx_annotations[ix];
@@ -175,9 +176,9 @@ static void rxrpc_resend(struct rxrpc_call *call)
sp = rxrpc_skb(skb);
 
if (anno_type == RXRPC_TX_ANNO_UNACK) {
-   if (time_after(sp->resend_at, now)) {
-   if (time_before(sp->resend_at, resend_at))
-   resend_at = sp->resend_at;
+   if (ktime_after(skb->tstamp, max_age)) {
+   if (ktime_before(skb->tstamp, oldest))
+   oldest = skb->tstamp;
continue;
}
}
@@ -186,7 +187,9 @@ static void rxrpc_resend(struct rxrpc_call *call)
call->rxtx_annotations[ix] = RXRPC_TX_ANNO_RETRANS | annotation;
}
 
-   call->resend_at = resend_at;
+   resend_at = ktime_sub(ktime_add_ns(oldest, rxrpc_resend_timeout), now);
+   call->resend_at = jiffies +
+   usecs_to_jiffies(ktime_to_ns(resend_at) / NSEC_PER_USEC);
 
/* Now go through the Tx window and perform the retransmissions.  We
 * have to drop the lock for each send.  If an ACK comes in whilst the
@@ -205,15 +208,12 @@ static void rxrpc_resend(struct rxrpc_call *call)
spin_unlock_bh(>lock);
 
if (rxrpc_send_data_packet(call, skb) < 0) {
-   call->resend_at = now + 2;
rxrpc_free_skb(skb, rxrpc_skb_tx_freed);
return;
}
 
if (rxrpc_is_client_call(call))
rxrpc_expose_client_call(call);
-   sp = rxrpc_skb(skb);
-   sp->resend_at = now + rxrpc_resend_timeout;
 
rxrpc_free_skb(skb,

[PATCH net-next 6/9] rxrpc: Add ktime_sub_ms()

2016-09-21 Thread David Howells

Add a ktime_sub_ms() to go with ktime_add_ms() and co. for use in AF_RXRPC
RTT determination.

Signed-off-by: David Howells 
---

 include/linux/ktime.h |5 +
 1 file changed, 5 insertions(+)

diff --git a/include/linux/ktime.h b/include/linux/ktime.h
index 2b6a204bd8d4..aa118bad1407 100644
--- a/include/linux/ktime.h
+++ b/include/linux/ktime.h
@@ -231,6 +231,11 @@ static inline ktime_t ktime_sub_us(const ktime_t kt, const 
u64 usec)
return ktime_sub_ns(kt, usec * NSEC_PER_USEC);
 }
 
+static inline ktime_t ktime_sub_ms(const ktime_t kt, const u64 msec)
+{
+   return ktime_sub_ns(kt, msec * NSEC_PER_MSEC);
+}
+
 extern ktime_t ktime_add_safe(const ktime_t lhs, const ktime_t rhs);
 
 /**

[PATCH net-next 1/9] rxrpc: Don't store the rxrpc header in the Tx queue sk_buffs

2016-09-21 Thread David Howells

Don't store the rxrpc protocol header in sk_buffs on the transmit queue,
but rather generate it on the fly and pass it to kernel_sendmsg() as a
separate iov.  This reduces the amount of storage required.

Note that the security header is still stored in the sk_buff as it may get
encrypted along with the data (and doesn't change with each transmission).

Signed-off-by: David Howells 
---

 net/rxrpc/ar-internal.h |5 +--
 net/rxrpc/call_event.c  |   11 +-
 net/rxrpc/conn_object.c |1 -
 net/rxrpc/output.c  |   83 ---
 net/rxrpc/rxkad.c   |8 ++---
 net/rxrpc/sendmsg.c |   51 +
 6 files changed, 71 insertions(+), 88 deletions(-)

diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index 034f525f2235..f021df4a6a22 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -385,10 +385,9 @@ struct rxrpc_connection {
int debug_id;   /* debug ID for printks */
atomic_tserial; /* packet serial number counter 
*/
unsigned inthi_serial;  /* highest serial number 
received */
+   u32 security_nonce; /* response re-use preventer */
u8  size_align; /* data size alignment (for 
security) */
-   u8  header_size;/* rxrpc + security header size 
*/
u8  security_size;  /* security header size */
-   u32 security_nonce; /* response re-use preventer */
u8  security_ix;/* security type */
u8  out_clientflag; /* RXRPC_CLIENT_INITIATED if we 
are client */
 };
@@ -946,7 +945,7 @@ extern const s8 rxrpc_ack_priority[];
  * output.c
  */
 int rxrpc_send_call_packet(struct rxrpc_call *, u8);
-int rxrpc_send_data_packet(struct rxrpc_connection *, struct sk_buff *);
+int rxrpc_send_data_packet(struct rxrpc_call *, struct sk_buff *);
 void rxrpc_reject_packets(struct rxrpc_local *);
 
 /*
diff --git a/net/rxrpc/call_event.c b/net/rxrpc/call_event.c
index 7d1b99824ed9..6247ce25eb21 100644
--- a/net/rxrpc/call_event.c
+++ b/net/rxrpc/call_event.c
@@ -139,7 +139,6 @@ void rxrpc_propose_ACK(struct rxrpc_call *call, u8 
ack_reason,
  */
 static void rxrpc_resend(struct rxrpc_call *call)
 {
-   struct rxrpc_wire_header *whdr;
struct rxrpc_skb_priv *sp;
struct sk_buff *skb;
rxrpc_seq_t cursor, seq, top;
@@ -201,15 +200,8 @@ static void rxrpc_resend(struct rxrpc_call *call)
skb = call->rxtx_buffer[ix];
rxrpc_get_skb(skb, rxrpc_skb_tx_got);
spin_unlock_bh(>lock);
-   sp = rxrpc_skb(skb);
-
-   /* Each Tx packet needs a new serial number */
-   sp->hdr.serial = atomic_inc_return(>conn->serial);
 
-   whdr = (struct rxrpc_wire_header *)skb->head;
-   whdr->serial = htonl(sp->hdr.serial);
-
-   if (rxrpc_send_data_packet(call->conn, skb) < 0) {
+   if (rxrpc_send_data_packet(call, skb) < 0) {
call->resend_at = now + 2;
rxrpc_free_skb(skb, rxrpc_skb_tx_freed);
return;
@@ -217,6 +209,7 @@ static void rxrpc_resend(struct rxrpc_call *call)
 
if (rxrpc_is_client_call(call))
rxrpc_expose_client_call(call);
+   sp = rxrpc_skb(skb);
sp->resend_at = now + rxrpc_resend_timeout;
 
rxrpc_free_skb(skb, rxrpc_skb_tx_freed);
diff --git a/net/rxrpc/conn_object.c b/net/rxrpc/conn_object.c
index 3b55aee0c436..e1e83af47866 100644
--- a/net/rxrpc/conn_object.c
+++ b/net/rxrpc/conn_object.c
@@ -53,7 +53,6 @@ struct rxrpc_connection *rxrpc_alloc_connection(gfp_t gfp)
spin_lock_init(>state_lock);
conn->debug_id = atomic_inc_return(_debug_id);
conn->size_align = 4;
-   conn->header_size = sizeof(struct rxrpc_wire_header);
conn->idle_timestamp = jiffies;
}
 
diff --git a/net/rxrpc/output.c b/net/rxrpc/output.c
index 16e18a94ffa6..817fb0e82d6a 100644
--- a/net/rxrpc/output.c
+++ b/net/rxrpc/output.c
@@ -208,19 +208,42 @@ out:
 /*
  * send a packet through the transport endpoint
  */
-int rxrpc_send_data_packet(struct rxrpc_connection *conn, struct sk_buff *skb)
+int rxrpc_send_data_packet(struct rxrpc_call *call, struct sk_buff *skb)
 {
-   struct kvec iov[1];
+   struct rxrpc_connection *conn = call->conn;
+   struct rxrpc_wire_header whdr;
+   struct rxrpc_skb_priv *sp = rxrpc_skb(skb);
struct msghdr msg;
+   struct kvec iov[2];
+   rxrpc_serial_t serial;
+   size_t len;
int ret, opt;
 
_enter(",{%d}", skb->len);
 
-   iov[0].iov_base = skb->head;
-   iov[0].iov_len = skb->len;
+   /* Each

[PATCH net-next 5/9] rxrpc: Expedite ping response transmission

2016-09-21 Thread David Howells

Expedite the transmission of a response to a PING ACK by sending it from
sendmsg if one is pending.  We're most likely to see a PING ACK during the
client call Tx phase as the other side may use it to determine a number of
parameters, such as the client's receive window size, the RTT and whether
the client is doing slow start (similar to RFC5681).

If we don't expedite it, it's left to the background processing thread to
transmit.

Signed-off-by: David Howells 
---

 net/rxrpc/sendmsg.c |4 
 1 file changed, 4 insertions(+)

diff --git a/net/rxrpc/sendmsg.c b/net/rxrpc/sendmsg.c
index 814b17f23971..3c969de3ef05 100644
--- a/net/rxrpc/sendmsg.c
+++ b/net/rxrpc/sendmsg.c
@@ -180,6 +180,10 @@ static int rxrpc_send_data(struct rxrpc_sock *rx,
 
copied = 0;
do {
+   /* Check to see if there's a ping ACK to reply to. */
+   if (call->ackr_reason == RXRPC_ACK_PING_RESPONSE)
+   rxrpc_send_call_packet(call, RXRPC_PACKET_TYPE_ACK);
+
if (!skb) {
size_t size, chunk, max, space;

[PATCH net-next 9/9] rxrpc: Reduce the number of ACK-Requests sent

2016-09-21 Thread David Howells

Reduce the number of ACK-Requests we set on DATA packets that we're sending
to reduce network traffic.  We set the flag on odd-numbered DATA packets to
start off the RTT cache until we have at least three entries in it and then
probe once per second thereafter to keep it topped up.

This could be made tunable in future.

Note that from this point, the RXRPC_REQUEST_ACK flag is set on DATA
packets as we transmit them and not stored statically in the sk_buff.

Signed-off-by: David Howells 
---

 net/rxrpc/ar-internal.h |1 +
 net/rxrpc/output.c  |   13 +++--
 net/rxrpc/peer_object.c |1 +
 net/rxrpc/sendmsg.c |2 --
 4 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index 1c4597b2c6cd..b13754a6dd7a 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -255,6 +255,7 @@ struct rxrpc_peer {
 
/* calculated RTT cache */
 #define RXRPC_RTT_CACHE_SIZE 32
+   ktime_t rtt_last_req;   /* Time of last RTT request */
u64 rtt;/* Current RTT estimate (in nS) 
*/
u64 rtt_sum;/* Sum of cache contents */
u64 rtt_cache[RXRPC_RTT_CACHE_SIZE]; /* Determined 
RTT cache */
diff --git a/net/rxrpc/output.c b/net/rxrpc/output.c
index db01fbb70d23..282cb1e36d06 100644
--- a/net/rxrpc/output.c
+++ b/net/rxrpc/output.c
@@ -270,6 +270,12 @@ int rxrpc_send_data_packet(struct rxrpc_call *call, struct 
sk_buff *skb)
msg.msg_controllen = 0;
msg.msg_flags = 0;
 
+   /* If our RTT cache needs working on, request an ACK. */
+   if ((call->peer->rtt_usage < 3 && sp->hdr.seq & 1) ||
+   ktime_before(ktime_add_ms(call->peer->rtt_last_req, 1000),
+ktime_get_real()))
+   whdr.flags |= RXRPC_REQUEST_ACK;
+
if (IS_ENABLED(CONFIG_AF_RXRPC_INJECT_LOSS)) {
static int lose;
if ((lose++ & 7) == 7) {
@@ -301,11 +307,14 @@ int rxrpc_send_data_packet(struct rxrpc_call *call, 
struct sk_buff *skb)
 
 done:
if (ret >= 0) {
-   skb->tstamp = ktime_get_real();
+   ktime_t now = ktime_get_real();
+   skb->tstamp = now;
smp_wmb();
sp->hdr.serial = serial;
-   if (whdr.flags & RXRPC_REQUEST_ACK)
+   if (whdr.flags & RXRPC_REQUEST_ACK) {
+   call->peer->rtt_last_req = now;
trace_rxrpc_rtt_tx(call, rxrpc_rtt_tx_data, serial);
+   }
}
_leave(" = %d [%u]", ret, call->peer->maxdata);
return ret;
diff --git a/net/rxrpc/peer_object.c b/net/rxrpc/peer_object.c
index f3e5766910fd..941b724d523b 100644
--- a/net/rxrpc/peer_object.c
+++ b/net/rxrpc/peer_object.c
@@ -244,6 +244,7 @@ static void rxrpc_init_peer(struct rxrpc_peer *peer, 
unsigned long hash_key)
peer->hash_key = hash_key;
rxrpc_assess_MTU_size(peer);
peer->mtu = peer->if_mtu;
+   peer->rtt_last_req = ktime_get_real();
 
switch (peer->srx.transport.family) {
case AF_INET:
diff --git a/net/rxrpc/sendmsg.c b/net/rxrpc/sendmsg.c
index 607223f4f871..ca7c3be60ad2 100644
--- a/net/rxrpc/sendmsg.c
+++ b/net/rxrpc/sendmsg.c
@@ -299,8 +299,6 @@ static int rxrpc_send_data(struct rxrpc_sock *rx,
else if (call->tx_top - call->tx_hard_ack <
 call->tx_winsize)
sp->hdr.flags |= RXRPC_MORE_PACKETS;
-   if (seq & 1)
-   sp->hdr.flags |= RXRPC_REQUEST_ACK;
 
ret = conn->security->secure_packet(
call, skb, skb->mark, skb->head);

[PATCH net-next 0/9] rxrpc: Preparation for slow-start algorithm

2016-09-21 Thread David Howells


Here are some patches that prepare for improvements in ACK generation and
for the implementation of the slow-start part of the protocol:

 (1) Stop storing the protocol header in the Tx socket buffers, but rather
 generate it on the fly.  This potentially saves a little space and
 makes it easier to alter the header just before transmission (the
 flags may get altered and the serial number has to be changed).

 (2) Mask off the Tx buffer annotations and add a flag to record which ones
 have already been resent.

 (3) Track RTT on a per-peer basis for use in future changes.  Tracepoints
 are added to log this.

 (4) Send PING ACKs in response to incoming calls to elicit a PING-RESPONSE
 ACK from which RTT data can be calculated.  The response also carries
 other useful information.

 (5) Expedite PING-RESPONSE ACK generation from sendmsg.  If we're actively
 using sendmsg, this allows us, under some circumstances, to avoid
 having to rely on the background work item to run to generate this
 ACK.

 This requires ktime_sub_ms() to be added.

 (6) Set the REQUEST-ACK flag on some DATA packets to elicit ACK-REQUESTED
 ACKs from which RTT data can be calculated.

 (7) Limit the use of pings and ACK requests for RTT determination.

The patches can be found here also:


http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-rewrite

Tagged thusly:

git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
rxrpc-rewrite-20160922

David
---
David Howells (9):
  rxrpc: Don't store the rxrpc header in the Tx queue sk_buffs
  rxrpc: Add re-sent Tx annotation
  rxrpc: Add per-peer RTT tracker
  rxrpc: Send pings to get RTT data
  rxrpc: Expedite ping response transmission
  rxrpc: Add ktime_sub_ms()
  rxrpc: Obtain RTT data by requesting ACKs on DATA packets
  rxrpc: Reduce the number of PING ACKs sent
  rxrpc: Reduce the number of ACK-Requests sent


 include/linux/ktime.h|5 ++
 include/trace/events/rxrpc.h |   61 ++
 net/rxrpc/ar-internal.h  |   47 -
 net/rxrpc/call_event.c   |   57 +++-
 net/rxrpc/conn_object.c  |1 
 net/rxrpc/input.c|  100 ++--
 net/rxrpc/misc.c |   25 ++---
 net/rxrpc/output.c   |  117 --
 net/rxrpc/peer_event.c   |   39 ++
 net/rxrpc/peer_object.c  |1 
 net/rxrpc/rxkad.c|8 +--
 net/rxrpc/sendmsg.c  |   56 
 net/rxrpc/sysctl.c   |2 -
 13 files changed, 389 insertions(+), 130 deletions(-)

[PATCH net-next 4/9] rxrpc: Send pings to get RTT data

2016-09-21 Thread David Howells

Send a PING ACK packet to the peer when we get a new incoming call from a
peer we don't have a record for.  The PING RESPONSE ACK packet will tell us
the following about the peer:

 (1) its receive window size

 (2) its MTU sizes

 (3) its support for jumbo DATA packets

 (4) if it supports slow start (similar to RFC 5681)

 (5) an estimate of the RTT

This is necessary because the peer won't normally send us an ACK until it
gets to the Rx phase and we send it a packet, but we would like to know
some of this information before we start sending packets.

A pair of tracepoints are added so that RTT determination can be observed.

Signed-off-by: David Howells 
---

 net/rxrpc/ar-internal.h |7 +--
 net/rxrpc/input.c   |   48 ++-
 net/rxrpc/misc.c|   11 ++-
 net/rxrpc/output.c  |   22 ++
 4 files changed, 80 insertions(+), 8 deletions(-)

diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index 79c671e552c3..8b47f468eb9d 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -403,6 +403,7 @@ enum rxrpc_call_flag {
RXRPC_CALL_EXPOSED, /* The call was exposed to the world */
RXRPC_CALL_RX_LAST, /* Received the last packet (at 
rxtx_top) */
RXRPC_CALL_TX_LAST, /* Last packet in Tx buffer (at 
rxtx_top) */
+   RXRPC_CALL_PINGING, /* Ping in process */
 };
 
 /*
@@ -487,6 +488,8 @@ struct rxrpc_call {
u32 call_id;/* call ID on connection  */
u32 cid;/* connection ID plus channel 
index */
int debug_id;   /* debug ID for printks */
+   unsigned short  rx_pkt_offset;  /* Current recvmsg packet 
offset */
+   unsigned short  rx_pkt_len; /* Current recvmsg packet len */
 
/* Rx/Tx circular buffer, depending on phase.
 *
@@ -530,8 +533,8 @@ struct rxrpc_call {
u16 ackr_skew;  /* skew on packet being ACK'd */
rxrpc_serial_t  ackr_serial;/* serial of packet being ACK'd 
*/
rxrpc_seq_t ackr_prev_seq;  /* previous sequence number 
received */
-   unsigned short  rx_pkt_offset;  /* Current recvmsg packet 
offset */
-   unsigned short  rx_pkt_len; /* Current recvmsg packet len */
+   rxrpc_serial_t  ackr_ping;  /* Last ping sent */
+   ktime_t ackr_ping_time; /* Time last ping sent */
 
/* transmission-phase ACK management */
rxrpc_serial_t  acks_latest;/* serial number of latest ACK 
received */
diff --git a/net/rxrpc/input.c b/net/rxrpc/input.c
index aa261df9fc9e..a0a5bd108c9e 100644
--- a/net/rxrpc/input.c
+++ b/net/rxrpc/input.c
@@ -37,6 +37,19 @@ static void rxrpc_proto_abort(const char *why,
 }
 
 /*
+ * Ping the other end to fill our RTT cache and to retrieve the rwind
+ * and MTU parameters.
+ */
+static void rxrpc_send_ping(struct rxrpc_call *call, struct sk_buff *skb,
+   int skew)
+{
+   struct rxrpc_skb_priv *sp = rxrpc_skb(skb);
+
+   rxrpc_propose_ACK(call, RXRPC_ACK_PING, skew, sp->hdr.serial,
+ true, true);
+}
+
+/*
  * Apply a hard ACK by advancing the Tx window.
  */
 static void rxrpc_rotate_tx_window(struct rxrpc_call *call, rxrpc_seq_t to)
@@ -343,6 +356,32 @@ ack:
 }
 
 /*
+ * Process a ping response.
+ */
+static void rxrpc_input_ping_response(struct rxrpc_call *call,
+ ktime_t resp_time,
+ rxrpc_serial_t orig_serial,
+ rxrpc_serial_t ack_serial)
+{
+   rxrpc_serial_t ping_serial;
+   ktime_t ping_time;
+
+   ping_time = call->ackr_ping_time;
+   smp_rmb();
+   ping_serial = call->ackr_ping;
+
+   if (!test_bit(RXRPC_CALL_PINGING, >flags) ||
+   before(orig_serial, ping_serial))
+   return;
+   clear_bit(RXRPC_CALL_PINGING, >flags);
+   if (after(orig_serial, ping_serial))
+   return;
+
+   rxrpc_peer_add_rtt(call, rxrpc_rtt_rx_ping_response,
+  orig_serial, ack_serial, ping_time, resp_time);
+}
+
+/*
  * Process the extra information that may be appended to an ACK packet
  */
 static void rxrpc_input_ackinfo(struct rxrpc_call *call, struct sk_buff *skb,
@@ -438,6 +477,7 @@ static void rxrpc_input_ack(struct rxrpc_call *call, struct 
sk_buff *skb,
struct rxrpc_ackinfo info;
u8 acks[RXRPC_MAXACKS];
} buf;
+   rxrpc_serial_t acked_serial;
rxrpc_seq_t first_soft_ack, hard_ack;
int nr_acks, offset;
 
@@ -449,6 +489,7 @@ static void rxrpc_input_ack(struct rxrpc_call *call, struct 
sk_buff *skb,
}
sp->offset += sizeof(buf.ack);
 
+   acked_serial =

[PATCH net-next 2/9] rxrpc: Add re-sent Tx annotation

2016-09-21 Thread David Howells

Add a Tx-phase annotation for packet buffers to indicate that a buffer has
already been retransmitted.  This will be used by future congestion
management.  Re-retransmissions of a packet don't affect the congestion
window managment in the same way as initial retransmissions.

Signed-off-by: David Howells 
---

 net/rxrpc/ar-internal.h |2 ++
 net/rxrpc/call_event.c  |   28 +++-
 net/rxrpc/input.c   |   14 +++---
 3 files changed, 32 insertions(+), 12 deletions(-)

diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index f021df4a6a22..dcf54e3fb478 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -505,6 +505,8 @@ struct rxrpc_call {
 #define RXRPC_TX_ANNO_UNACK1
 #define RXRPC_TX_ANNO_NAK  2
 #define RXRPC_TX_ANNO_RETRANS  3
+#define RXRPC_TX_ANNO_MASK 0x03
+#define RXRPC_TX_ANNO_RESENT   0x04
 #define RXRPC_RX_ANNO_JUMBO0x3f/* Jumbo subpacket number + 1 
if not zero */
 #define RXRPC_RX_ANNO_JLAST0x40/* Set if last element of a 
jumbo packet */
 #define RXRPC_RX_ANNO_VERIFIED 0x80/* Set if verified and 
decrypted */
diff --git a/net/rxrpc/call_event.c b/net/rxrpc/call_event.c
index 6247ce25eb21..34ad967f2d81 100644
--- a/net/rxrpc/call_event.c
+++ b/net/rxrpc/call_event.c
@@ -144,7 +144,7 @@ static void rxrpc_resend(struct rxrpc_call *call)
rxrpc_seq_t cursor, seq, top;
unsigned long resend_at, now;
int ix;
-   u8 annotation;
+   u8 annotation, anno_type;
 
_enter("{%d,%d}", call->tx_hard_ack, call->tx_top);
 
@@ -165,14 +165,16 @@ static void rxrpc_resend(struct rxrpc_call *call)
for (seq = cursor + 1; before_eq(seq, top); seq++) {
ix = seq & RXRPC_RXTX_BUFF_MASK;
annotation = call->rxtx_annotations[ix];
-   if (annotation == RXRPC_TX_ANNO_ACK)
+   anno_type = annotation & RXRPC_TX_ANNO_MASK;
+   annotation &= ~RXRPC_TX_ANNO_MASK;
+   if (anno_type == RXRPC_TX_ANNO_ACK)
continue;
 
skb = call->rxtx_buffer[ix];
rxrpc_see_skb(skb, rxrpc_skb_tx_seen);
sp = rxrpc_skb(skb);
 
-   if (annotation == RXRPC_TX_ANNO_UNACK) {
+   if (anno_type == RXRPC_TX_ANNO_UNACK) {
if (time_after(sp->resend_at, now)) {
if (time_before(sp->resend_at, resend_at))
resend_at = sp->resend_at;
@@ -181,7 +183,7 @@ static void rxrpc_resend(struct rxrpc_call *call)
}
 
/* Okay, we need to retransmit a packet. */
-   call->rxtx_annotations[ix] = RXRPC_TX_ANNO_RETRANS;
+   call->rxtx_annotations[ix] = RXRPC_TX_ANNO_RETRANS | annotation;
}
 
call->resend_at = resend_at;
@@ -194,7 +196,8 @@ static void rxrpc_resend(struct rxrpc_call *call)
for (seq = cursor + 1; before_eq(seq, top); seq++) {
ix = seq & RXRPC_RXTX_BUFF_MASK;
annotation = call->rxtx_annotations[ix];
-   if (annotation != RXRPC_TX_ANNO_RETRANS)
+   anno_type = annotation & RXRPC_TX_ANNO_MASK;
+   if (anno_type != RXRPC_TX_ANNO_RETRANS)
continue;
 
skb = call->rxtx_buffer[ix];
@@ -220,10 +223,17 @@ static void rxrpc_resend(struct rxrpc_call *call)
 * received and the packet might have been hard-ACK'd (in which
 * case it will no longer be in the buffer).
 */
-   if (after(seq, call->tx_hard_ack) &&
-   (call->rxtx_annotations[ix] == RXRPC_TX_ANNO_RETRANS ||
-call->rxtx_annotations[ix] == RXRPC_TX_ANNO_NAK))
-   call->rxtx_annotations[ix] = RXRPC_TX_ANNO_UNACK;
+   if (after(seq, call->tx_hard_ack)) {
+   annotation = call->rxtx_annotations[ix];
+   anno_type = annotation & RXRPC_TX_ANNO_MASK;
+   if (anno_type == RXRPC_TX_ANNO_RETRANS ||
+   anno_type == RXRPC_TX_ANNO_NAK) {
+   annotation &= ~RXRPC_TX_ANNO_MASK;
+   annotation |= RXRPC_TX_ANNO_UNACK;
+   }
+   annotation |= RXRPC_TX_ANNO_RESENT;
+   call->rxtx_annotations[ix] = annotation;
+   }
 
if (after(call->tx_hard_ack, seq))
seq = call->tx_hard_ack;
diff --git a/net/rxrpc/input.c b/net/rxrpc/input.c
index 7ac1edf3aac7..aa261df9fc9e 100644
--- a/net/rxrpc/input.c
+++ b/net/rxrpc/input.c
@@ -388,17 +388,25 @@ static void rxrpc_input_soft_acks(struct rxrpc_call 
*call, u8 *acks,
 {
bool resend = false;
int ix;
+   u8 annotation, anno_type;
 
for (; nr_acks > 0; nr_acks--, seq++)

[PATCH net-next 3/9] rxrpc: Add per-peer RTT tracker

2016-09-21 Thread David Howells

Add a function to track the average RTT for a peer.  Sources of RTT data
will be added in subsequent patches.

The RTT data will be useful in the future for determining resend timeouts
and for handling the slow-start part of the Rx protocol.

Also add a pair of tracepoints, one to log transmissions to elicit a
response for RTT purposes and one to log responses that contribute RTT
data.

Signed-off-by: David Howells 
---

 include/trace/events/rxrpc.h |   61 ++
 net/rxrpc/ar-internal.h  |   25 ++---
 net/rxrpc/misc.c |8 ++
 net/rxrpc/peer_event.c   |   39 +++
 4 files changed, 129 insertions(+), 4 deletions(-)

diff --git a/include/trace/events/rxrpc.h b/include/trace/events/rxrpc.h
index 75a5d8bf50e1..e8f2afbbe0bf 100644
--- a/include/trace/events/rxrpc.h
+++ b/include/trace/events/rxrpc.h
@@ -353,6 +353,67 @@ TRACE_EVENT(rxrpc_recvmsg,
  __entry->ret)
);
 
+TRACE_EVENT(rxrpc_rtt_tx,
+   TP_PROTO(struct rxrpc_call *call, enum rxrpc_rtt_tx_trace why,
+rxrpc_serial_t send_serial),
+
+   TP_ARGS(call, why, send_serial),
+
+   TP_STRUCT__entry(
+   __field(struct rxrpc_call *,call)
+   __field(enum rxrpc_rtt_tx_trace,why )
+   __field(rxrpc_serial_t, send_serial )
+),
+
+   TP_fast_assign(
+   __entry->call = call;
+   __entry->why = why;
+   __entry->send_serial = send_serial;
+  ),
+
+   TP_printk("c=%p %s sr=%08x",
+ __entry->call,
+ rxrpc_rtt_tx_traces[__entry->why],
+ __entry->send_serial)
+   );
+
+TRACE_EVENT(rxrpc_rtt_rx,
+   TP_PROTO(struct rxrpc_call *call, enum rxrpc_rtt_rx_trace why,
+rxrpc_serial_t send_serial, rxrpc_serial_t resp_serial,
+s64 rtt, u8 nr, s64 avg),
+
+   TP_ARGS(call, why, send_serial, resp_serial, rtt, nr, avg),
+
+   TP_STRUCT__entry(
+   __field(struct rxrpc_call *,call)
+   __field(enum rxrpc_rtt_rx_trace,why )
+   __field(u8, nr  )
+   __field(rxrpc_serial_t, send_serial )
+   __field(rxrpc_serial_t, resp_serial )
+   __field(s64,rtt )
+   __field(u64,avg )
+),
+
+   TP_fast_assign(
+   __entry->call = call;
+   __entry->why = why;
+   __entry->send_serial = send_serial;
+   __entry->resp_serial = resp_serial;
+   __entry->rtt = rtt;
+   __entry->nr = nr;
+   __entry->avg = avg;
+  ),
+
+   TP_printk("c=%p %s sr=%08x rr=%08x rtt=%lld nr=%u avg=%lld",
+ __entry->call,
+ rxrpc_rtt_rx_traces[__entry->why],
+ __entry->send_serial,
+ __entry->resp_serial,
+ __entry->rtt,
+ __entry->nr,
+ __entry->avg)
+   );
+
 #endif /* _TRACE_RXRPC_H */
 
 /* This part must be outside protection */
diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index dcf54e3fb478..79c671e552c3 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -258,10 +258,11 @@ struct rxrpc_peer {
 
/* calculated RTT cache */
 #define RXRPC_RTT_CACHE_SIZE 32
-   suseconds_t rtt;/* current RTT estimate (in uS) 
*/
-   unsigned intrtt_point;  /* next entry at which to 
insert */
-   unsigned intrtt_usage;  /* amount of cache actually 
used */
-   suseconds_t rtt_cache[RXRPC_RTT_CACHE_SIZE]; /* calculated 
RTT cache */
+   u64 rtt;/* Current RTT estimate (in nS) 
*/
+   u64 rtt_sum;/* Sum of cache contents */
+   u64 rtt_cache[RXRPC_RTT_CACHE_SIZE]; /* Determined 
RTT cache */
+   u8  rtt_cursor; /* next entry at which to 
insert */
+   u8  rtt_usage;  /* amount of cache actually 
used */
 };
 
 /*
@@ -657,6 +658,20 @@ enum rxrpc_recvmsg_trace {
 
 extern const char rxrpc_recvmsg_traces[rxrpc_recvmsg__nr_trace][5];
 
+enum rxrpc_rtt_tx_trace {
+   rxrpc_rtt_tx_ping,
+   rxrpc_rtt_tx__nr_trace
+};
+
+extern const char rxrpc_rtt_tx_traces[rxrpc_rtt_tx__nr_trace][5];
+
+enum rxrpc_rtt_rx_trace {
+

Re: XDP (eXpress Data Path) documentation

2016-09-21 Thread Tom Herbert

On Tue, Sep 20, 2016 at 2:08 AM, Jesper Dangaard Brouer
 wrote:
> Hi all,
>
> As promised, I've started documenting the XDP eXpress Data Path):
>
>  [1] 
> https://prototype-kernel.readthedocs.io/en/latest/networking/XDP/index.html
>
> IMHO the documentation have reached a stage where it is useful for the
> XDP project, BUT I request collaboration on improving the documentation
> from all. (Native English speakers are encouraged to send grammar fixes ;-))
>
Hi Jesper,

Thanks for taking the initiative on the this, The document reads more
like a design doc than description right now, that's probably okay
since we could use a design doc.

Under "Important to understand" there are some disclaimers that XDP
does not implement qdiscs or BQL and fairness otherwise. This is true
for it's own traffic, but it does not (or at least should not) affect
these mechanisms or normal stack traffic running simultaneously. I
think we've made assumptions about fairness between XDP and non-XDP
queues, we probably want to clarify fairness (and also validate
whatever assumptions we've made with testing).

Thanks,
Tom

> You wouldn't believe it: But this pretty looking documentation actually
> follows the new Kernel documentation format.  It is actually just
> ".rst" text files stored in my github repository under kernel/Documentation 
> [2]
>
>  [2] 
> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/Documentation
>
> Thus, just git clone my repository and started editing and send me
> patches (or github pull requests). Like:
>
>  $ git clone https://github.com/netoptimizer/prototype-kernel
>  $ cd prototype-kernel/kernel/Documentation/
>  $ make html
>  $ firefox _build/html/index.html &
>
> This new documentation format combines the best of two worlds, pretty
> online browser documentation with almost plain text files, and changes
> being tracked via git commits [3] (and auto git hooks to generate the
> readthedocs.org page). You got to love it! :-)
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer
>
> [3] https://github.com/netoptimizer/prototype-kernel/commits/master

Re: [PATCHv7 net-next 00/15] BPF hardware offload (cls_bpf for now)

2016-09-21 Thread David Miller

From: Jakub Kicinski 
Date: Wed, 21 Sep 2016 11:43:52 +0100

> In the last year a lot of progress have been made on offloading
> simpler TC classifiers.  There is also growing interest in using
> BPF for generic high-speed packet processing in the kernel.
> It seems beneficial to tie those two trends together and think
> about hardware offloads of BPF programs.  This patch set presents
> such offload to Netronome smart NICs.  cls_bpf is extended with
> hardware offload capabilities and NFP driver gets a JIT translator
> which in presence of capable firmware can be used to offload
> the BPF program onto the card.
> 
> BPF JIT implementation is not 100% complete (e.g. missing instructions)
> but it is functional.  Encouragingly it should be possible to
> offload most (if not all) advanced BPF features onto the NIC - 
> including packet modification, maps, tunnel encap/decap etc.
 ...

Series applied, thanks.

Re: [PATCH iproute2 net-next] iptnl: add support for collect_md flag in IPv4 and IPv6 tunnels

2016-09-21 Thread Stephen Hemminger

On Mon, 19 Sep 2016 17:03:14 -0700
Alexei Starovoitov  wrote:

> Signed-off-by: Alexei Starovoitov 

Applied, please send update to man page as well.

Re: [PATCH iproute2] ipmonitor: fix ip monitor can't work when NET_NS is not enabled

2016-09-21 Thread Stephen Hemminger

On Tue, 20 Sep 2016 02:09:02 -0700
Liping Zhang  wrote:

> From: Liping Zhang 
> 
> In ip monitor, netns_map_init will check getnsid is supported or not.
> But when /proc/self/ns/net does not exist, we just print out error
> messages and exit. So user cannot use ip monitor anymore when
> CONFIG_NET_NS is disabled:
>   # ip monitor
>   open("/proc/self/ns/net"): No such file or directory
> 
> If open "/proc/self/ns/net" failed, set have_rtnl_getnsid to false.
> 
> Fixes: d652ccbf8195 ("netns: allow to dump and monitor nsid")
> Signed-off-by: Liping Zhang 

Makes sense. Applied.

Re: [PATCH net-next 2/3] udp: implement memory accounting helpers

2016-09-21 Thread Eric Dumazet

On Wed, 2016-09-21 at 19:23 +0200, Paolo Abeni wrote:
> Avoid usage of common memory accounting functions, since
> the logic is pretty much different.
> 
> To account for forward allocation, a couple of new atomic_t
> members are added to udp_sock: 'mem_alloced' and 'mem_freed'.
> The current forward allocation is estimated as 'mem_alloced'
>  minus 'mem_freed' minus 'sk_rmem_alloc'.
> 
> When the forward allocation can't cope with the packet to be
> enqueued, 'mem_alloced' is incremented by the packet size
> rounded-up to the next SK_MEM_QUANTUM.
> After a dequeue, we try to partially reclaim of the forward
> allocated memory rounded down to an SK_MEM_QUANTUM and
> 'mem_freed' is increased by that amount.
> sk->sk_forward_alloc is set after each allocated/freed memory
> update, to the currently estimated forward allocation, without
> any lock or protection.
> This value is updated/maintained only to expose some
> semi-reasonable value to the eventual reader, and is guaranteed
> to be 0 at socket destruction time.
> 
> The above needs custom memory reclaiming on shutdown, provided
> by the udp_destruct_sock() helper, which completely reclaim
> the allocated forward memory.
> 
> Helpers are provided for skb free, consume and purge, respecting
> the above constraints.
> 
> The socket lock is still used to protect the updates to sk_peek_off,
> but is acquired only if peeking with offset is enabled.
> 
> As a consequence of the above schema, enqueue to sk_error_queue
> will cause larger forward allocation on following normal data
> (due to sk_rmem_alloc grow), but this allows amortizing the cost
> of the atomic operation on SK_MEM_QUANTUM/skb->truesize packets.
> The use of separate atomics for 'mem_alloced' and 'mem_freed'
> allows the use of a single atomic operation to protect against
> concurrent dequeue.
> 
> Acked-by: Hannes Frederic Sowa 
> Signed-off-by: Paolo Abeni 
> ---
>  include/linux/udp.h |   2 +
>  include/net/udp.h   |   5 ++
>  net/ipv4/udp.c  | 151 
> 
>  3 files changed, 158 insertions(+)
> 
> diff --git a/include/linux/udp.h b/include/linux/udp.h
> index d1fd8cd..cd72645 100644
> --- a/include/linux/udp.h
> +++ b/include/linux/udp.h
> @@ -42,6 +42,8 @@ static inline u32 udp_hashfn(const struct net *net, u32 
> num, u32 mask)
>  struct udp_sock {
>   /* inet_sock has to be the first member */
>   struct inet_sock inet;
> + atomic_t mem_allocated;
> + atomic_t mem_freed;

Hi Paolo, thanks for working on this.

All this code looks quite invasive to me ?

Also does inet_diag properly give the forward_alloc to user ?

$ ss -mua
State  Recv-Q Send-Q Local Address:Port Peer Addres
s:Port
UNCONN 51584  0  *:52460  *:*
 skmem:(r51584,rb327680,t0,tb327680,f1664,w0,o0,bl0,d575)


Couldn't we instead use an union of an atomic_t and int for
sk->sk_forward_alloc ?

All udp queues/dequeues would manipulate the atomic_t using regular
atomic ops and use a special skb destructor (instead of sock_rfree())

Also I would not bother 'reclaiming' forward_alloc at dequeue, unless
udp is under memory pressure.

Please share your performance numbers, thanks !

Re: [PATCH iproute2 3/3] ss: output TCP BBR diag information

2016-09-21 Thread Stephen Hemminger

On Tue, 20 Sep 2016 22:43:44 -0400
Neal Cardwell  wrote:

> Dump useful TCP BBR state information from a struct tcp_bbr_info that
> was grabbed using the inet_diag API.
> 
> We tolerate info that is shorter or longer than expected, in case the
> kernel is older or newer than the ss binary. We simply print the
> minimum of what is expected from the kernel and what is provided from
> the kernel. We use the same trick as that used for struct tcp_info:
> when the info from the kernel is shorter than we hoped, we pad the end
> with zeroes, and don't print fields if they are zero.
> 
> The BBR output looks like:
>   bbr:(bw:1.2Mbps,mrtt:18.965,pacing_gain:2.88672,cwnd_gain:2.88672)
> 
> The motivation here is to be consistent with DCTCP, which looks like:
>   dctcp(ce_state:23,alpha:23,ab_ecn:23,ab_tot:23)
> 
> Signed-off-by: Neal Cardwell 
> Signed-off-by: Yuchung Cheng 
> Signed-off-by: Eric Dumazet 
> Signed-off-by: Soheil Hassas Yeganeh 

Applied, to net-next.
The first two patches were unnecessary. Already picked up current net-next 
headers.

[PATCH net] tcp: fix under-accounting retransmit SNMP counters

2016-09-21 Thread Yuchung Cheng

This patch fixes these under-accounting SNMP rtx stats
LINUX_MIB_TCPFORWARDRETRANS
LINUX_MIB_TCPFASTRETRANS
LINUX_MIB_TCPSLOWSTARTRETRANS
when retransmitting TSO packets

Fixes: 10d3be569243 ("tcp-tso: do not split TSO packets at retransmit time")
Signed-off-by: Yuchung Cheng 
---
 net/ipv4/tcp_output.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 810be35..5725822 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2822,7 +2822,7 @@ begin_fwd:
if (tcp_retransmit_skb(sk, skb, segs))
return;
 
-   NET_INC_STATS(sock_net(sk), mib_idx);
+   NET_ADD_STATS(sock_net(sk), mib_idx, tcp_skb_pcount(skb));
 
if (tcp_in_cwnd_reduction(sk))
tp->prr_out += tcp_skb_pcount(skb);
-- 
2.8.0.rc3.226.g39d4020

[PATCH net] tcp: properly account Fast Open SYN-ACK retrans

2016-09-21 Thread Yuchung Cheng

Since the TFO socket is accepted right off SYN-data, the socket
owner can call getsockopt(TCP_INFO) to collect ongoing SYN-ACK
retransmission or timeout stats (i.e., tcpi_total_retrans,
tcpi_retransmits). Currently those stats are only updated
upon handshake completes. This patch fixes it.

Signed-off-by: Yuchung Cheng 
Signed-off-by: Eric Dumazet 
Signed-off-by: Neal Cardwell 
Signed-off-by: Soheil Hassas Yeganeh 
---
 net/ipv4/tcp_input.c  | 2 +-
 net/ipv4/tcp_output.c | 2 ++
 net/ipv4/tcp_timer.c  | 1 +
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d6c8f4cd0..8faf97e 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5852,7 +5852,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff 
*skb)
 * so release it.
 */
if (req) {
-   tp->total_retrans = req->num_retrans;
+   inet_csk(sk)->icsk_retransmits = 0;
reqsk_fastopen_remove(sk, req, false);
} else {
/* Make sure socket is routed, for correct metrics. */
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 8bd9911..810be35 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3559,6 +3559,8 @@ int tcp_rtx_synack(const struct sock *sk, struct 
request_sock *req)
if (!res) {
__TCP_INC_STATS(sock_net(sk), TCP_MIB_RETRANSSEGS);
__NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
+   if (unlikely(tcp_passive_fastopen(sk)))
+   tcp_sk(sk)->total_retrans++;
}
return res;
 }
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index debdd8b..39bc5b2 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -346,6 +346,7 @@ static void tcp_fastopen_synack_timer(struct sock *sk)
 */
inet_rtx_syn_ack(sk, req);
req->num_timeout++;
+   icsk->icsk_retransmits++;
inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
  TCP_TIMEOUT_INIT << req->num_timeout, TCP_RTO_MAX);
 }
-- 
2.8.0.rc3.226.g39d4020

[PATCH net-next v2 3/6] net/faraday: Adapt for Aspeed SoCs

2016-09-21 Thread Joel Stanley

The RXDES and TXDES registers bits in the ftgmac100 indicates EDO{R,T}R
at bit position 15 for the Faraday Tech IP. However, the version of this
IP present in the Aspeed SoCs has these bits at position 30 in the
registers.

It appers that ast2400 SoCs support both positions, with the 15th bit
marked as reserved but still functional. In the ast2500 this bit is
reused for another function, so we need a work around.

This was confirmed with engineers from Aspeed that using bit 30 is
correct for both the ast2400 and ast2500 SoCs.

Signed-off-by: Joel Stanley 
---
 drivers/net/ethernet/faraday/ftgmac100.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index 62a88d1a1f99..47f512224b57 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -1345,9 +1345,6 @@ static int ftgmac100_probe(struct platform_device *pdev)
priv->netdev = netdev;
priv->dev = >dev;
 
-   priv->rxdes0_edorr_mask = BIT(15);
-   priv->txdes0_edotr_mask = BIT(15);
-
spin_lock_init(>tx_lock);
 
/* initialize NAPI */
@@ -1381,6 +1378,16 @@ static int ftgmac100_probe(struct platform_device *pdev)
  FTGMAC100_INT_PHYSTS_CHG |
  FTGMAC100_INT_RPKT_BUF |
  FTGMAC100_INT_NO_RXBUF);
+
+   if (of_machine_is_compatible("aspeed,ast2400") ||
+   of_machine_is_compatible("aspeed,ast2500")) {
+   priv->rxdes0_edorr_mask = BIT(30);
+   priv->txdes0_edotr_mask = BIT(30);
+   } else {
+   priv->rxdes0_edorr_mask = BIT(15);
+   priv->txdes0_edotr_mask = BIT(15);
+   }
+
if (pdev->dev.of_node &&
of_get_property(pdev->dev.of_node, "use-ncsi", NULL)) {
if (!IS_ENABLED(CONFIG_NET_NCSI)) {
-- 
2.9.3

[PATCH net-next v2 6/6] net/faraday: Mask out PHYSTS_CHG interrupt

2016-09-21 Thread Joel Stanley

The PHYSTS_CHG (the ftgmac100's PHY IRQ) is telling the system to go
look at the PHY registers for a link status change.

The interrupt was causing issues on Aspeed SoC where some board designs
had an active high configuration, some active low, and in some cases
repurposed for other functions. When misconfigured Linux would chew 100%
of CPU cycles servicing interrupts:

 [   20.28] ftgmac100 1e66.ethernet eth0: [ISR] = 0x200: PHYSTS_CHG
 [   20.28] ftgmac100 1e66.ethernet eth0: [ISR] = 0x200: PHYSTS_CHG
 [   20.28] ftgmac100 1e66.ethernet eth0: [ISR] = 0x200: PHYSTS_CHG
 [   20.30] ftgmac100 1e66.ethernet eth0: [ISR] = 0x200: PHYSTS_CHG

While in the ftgmac100 IP can be configured for high, low and edge
sensitivity the current driver always polls the PHY, so we chose to mask
out the interrupt.

See https://patchwork.ozlabs.org/patch/672099/ for more discussion.

Signed-off-by: Joel Stanley 
---
v2:
 - Reworked to mask out PHYSTS_CHG instead of trying to determine the IRQ line
   level

 drivers/net/ethernet/faraday/ftgmac100.c | 10 +++---
 drivers/net/ethernet/faraday/ftgmac100.h |  1 +
 2 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index e3653b14008a..90f9c5481290 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -1075,14 +1075,12 @@ static int ftgmac100_poll(struct napi_struct *napi, int 
budget)
}
 
if (status & priv->int_mask_all & (FTGMAC100_INT_NO_RXBUF |
-   FTGMAC100_INT_RPKT_LOST | FTGMAC100_INT_AHB_ERR |
-   FTGMAC100_INT_PHYSTS_CHG)) {
+   FTGMAC100_INT_RPKT_LOST | FTGMAC100_INT_AHB_ERR)) {
if (net_ratelimit())
-   netdev_info(netdev, "[ISR] = 0x%x: %s%s%s%s\n", status,
+   netdev_info(netdev, "[ISR] = 0x%x: %s%s%s\n", status,
status & FTGMAC100_INT_NO_RXBUF ? "NO_RXBUF 
" : "",
status & FTGMAC100_INT_RPKT_LOST ? 
"RPKT_LOST " : "",
-   status & FTGMAC100_INT_AHB_ERR ? "AHB_ERR " 
: "",
-   status & FTGMAC100_INT_PHYSTS_CHG ? 
"PHYSTS_CHG" : "");
+   status & FTGMAC100_INT_AHB_ERR ? "AHB_ERR " 
: "");
 
if (status & FTGMAC100_INT_NO_RXBUF) {
/* RX buffer unavailable */
@@ -1390,7 +1388,6 @@ static int ftgmac100_probe(struct platform_device *pdev)
  FTGMAC100_INT_XPKT_ETH |
  FTGMAC100_INT_XPKT_LOST |
  FTGMAC100_INT_AHB_ERR |
- FTGMAC100_INT_PHYSTS_CHG |
  FTGMAC100_INT_RPKT_BUF |
  FTGMAC100_INT_NO_RXBUF);
 
@@ -1412,7 +1409,6 @@ static int ftgmac100_probe(struct platform_device *pdev)
 
dev_info(>dev, "Using NCSI interface\n");
priv->use_ncsi = true;
-   priv->int_mask_all &= ~FTGMAC100_INT_PHYSTS_CHG;
priv->ndev = ncsi_register_dev(netdev, ftgmac100_ncsi_handler);
if (!priv->ndev)
goto err_ncsi_dev;
diff --git a/drivers/net/ethernet/faraday/ftgmac100.h 
b/drivers/net/ethernet/faraday/ftgmac100.h
index 8a377ab1d127..a7ce0ac8858a 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.h
+++ b/drivers/net/ethernet/faraday/ftgmac100.h
@@ -157,6 +157,7 @@
 #define FTGMAC100_MACCR_FULLDUP(1 << 8)
 #define FTGMAC100_MACCR_GIGA_MODE  (1 << 9)
 #define FTGMAC100_MACCR_CRC_APD(1 << 10)
+#define FTGMAC100_MACCR_PHY_LINK_LEVEL (1 << 11)
 #define FTGMAC100_MACCR_RX_RUNT(1 << 12)
 #define FTGMAC100_MACCR_JUMBO_LF   (1 << 13)
 #define FTGMAC100_MACCR_RX_ALL (1 << 14)
-- 
2.9.3

[PATCH net-next v2 1/6] net/faraday: Separate rx page storage from rxdesc

2016-09-21 Thread Joel Stanley

From: Andrew Jeffery 

The ftgmac100 hardware revision in e.g. the Aspeed AST2500 no longer
reserves all bits in RXDES#2 but instead uses the bottom 16 bits to
store MAC frame metadata. Avoid corruption by shifting struct page
pointers out to their own member in struct ftgmac100.

Signed-off-by: Andrew Jeffery 
Signed-off-by: Joel Stanley 
---
 drivers/net/ethernet/faraday/ftgmac100.c | 25 ++---
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index 36361f8bf894..40622567159a 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -60,6 +60,8 @@ struct ftgmac100 {
struct ftgmac100_descs *descs;
dma_addr_t descs_dma_addr;
 
+   struct page *rx_pages[RX_QUEUE_ENTRIES];
+
unsigned int rx_pointer;
unsigned int tx_clean_pointer;
unsigned int tx_pointer;
@@ -341,18 +343,27 @@ static bool ftgmac100_rxdes_ipcs_err(struct 
ftgmac100_rxdes *rxdes)
return rxdes->rxdes1 & cpu_to_le32(FTGMAC100_RXDES1_IP_CHKSUM_ERR);
 }
 
+static inline struct page **ftgmac100_rxdes_page_slot(struct ftgmac100 *priv,
+ struct ftgmac100_rxdes 
*rxdes)
+{
+   return >rx_pages[rxdes - priv->descs->rxdes];
+}
+
 /*
  * rxdes2 is not used by hardware. We use it to keep track of page.
  * Since hardware does not touch it, we can skip cpu_to_le32()/le32_to_cpu().
  */
-static void ftgmac100_rxdes_set_page(struct ftgmac100_rxdes *rxdes, struct 
page *page)
+static void ftgmac100_rxdes_set_page(struct ftgmac100 *priv,
+struct ftgmac100_rxdes *rxdes,
+struct page *page)
 {
-   rxdes->rxdes2 = (unsigned int)page;
+   *ftgmac100_rxdes_page_slot(priv, rxdes) = page;
 }
 
-static struct page *ftgmac100_rxdes_get_page(struct ftgmac100_rxdes *rxdes)
+static struct page *ftgmac100_rxdes_get_page(struct ftgmac100 *priv,
+struct ftgmac100_rxdes *rxdes)
 {
-   return (struct page *)rxdes->rxdes2;
+   return *ftgmac100_rxdes_page_slot(priv, rxdes);
 }
 
 /**
@@ -501,7 +512,7 @@ static bool ftgmac100_rx_packet(struct ftgmac100 *priv, int 
*processed)
 
do {
dma_addr_t map = ftgmac100_rxdes_get_dma_addr(rxdes);
-   struct page *page = ftgmac100_rxdes_get_page(rxdes);
+   struct page *page = ftgmac100_rxdes_get_page(priv, rxdes);
unsigned int size;
 
dma_unmap_page(priv->dev, map, RX_BUF_SIZE, DMA_FROM_DEVICE);
@@ -779,7 +790,7 @@ static int ftgmac100_alloc_rx_page(struct ftgmac100 *priv,
return -ENOMEM;
}
 
-   ftgmac100_rxdes_set_page(rxdes, page);
+   ftgmac100_rxdes_set_page(priv, rxdes, page);
ftgmac100_rxdes_set_dma_addr(rxdes, map);
ftgmac100_rxdes_set_dma_own(rxdes);
return 0;
@@ -791,7 +802,7 @@ static void ftgmac100_free_buffers(struct ftgmac100 *priv)
 
for (i = 0; i < RX_QUEUE_ENTRIES; i++) {
struct ftgmac100_rxdes *rxdes = >descs->rxdes[i];
-   struct page *page = ftgmac100_rxdes_get_page(rxdes);
+   struct page *page = ftgmac100_rxdes_get_page(priv, rxdes);
dma_addr_t map = ftgmac100_rxdes_get_dma_addr(rxdes);
 
if (!page)
-- 
2.9.3

Re: [PATCH net-next 0/7] ftgmac100 support for ast2500

2016-09-21 Thread Joel Stanley

Please ignore this one.

On Thu, Sep 22, 2016 at 8:33 AM, Joel Stanley  wrote:
> Hello Dave,
>
> This series adds support to the ftgmac100 driver for the Aspeed ast2400 and
> ast2500 SoCs. In particular, they ensure the driver works correctly on the
> ast2500 where the MAC block has seen some changes in register layout.
>
> They have been tested on ast2400 and ast2500 systems with the NCSI stack and
> with a directly attached PHY.
>
> Cheers,
>
> Joel
>
> Andrew Jeffery (2):
>   net/ftgmac100: Separate rx page storage from rxdesc
>   net/ftgmac100: Make EDO{R,T}R bits configurable
>
> Gavin Shan (2):
>   net/faraday: Avoid PHYSTS_CHG interrupt
>   net/faraday: Clear stale interrupts
>
> Joel Stanley (3):
>   net/ftgmac100: Adapt for Aspeed SoCs
>   net/faraday: Fix phy link irq on Aspeed G5 SoCs
>   net/faraday: Configure old MDIO interface on Aspeed SoCs
>
>  drivers/net/ethernet/faraday/ftgmac100.c | 92 
> 
>  drivers/net/ethernet/faraday/ftgmac100.h |  8 ++-
>  2 files changed, 77 insertions(+), 23 deletions(-)
>
> --
> 2.9.3
>

[PATCH net-next v2 4/6] net/faraday: Clear stale interrupts

2016-09-21 Thread Joel Stanley

From: Gavin Shan 

There is stale interrupt (PHYSTS_CHG in ISR, bit#6 in 0x0) from
the bootloader (uboot) when enabling the MAC. The stale interrupts
aren't part of kernel and should be cleared.

This clears the stale interrupts in ISR (0x0) when enabling the MAC.

Signed-off-by: Gavin Shan 
Signed-off-by: Joel Stanley 
---
 drivers/net/ethernet/faraday/ftgmac100.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index 47f512224b57..189373743ddf 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -1112,6 +1112,7 @@ static int ftgmac100_poll(struct napi_struct *napi, int 
budget)
 static int ftgmac100_open(struct net_device *netdev)
 {
struct ftgmac100 *priv = netdev_priv(netdev);
+   unsigned int status;
int err;
 
err = ftgmac100_alloc_buffers(priv);
@@ -1137,6 +1138,11 @@ static int ftgmac100_open(struct net_device *netdev)
 
ftgmac100_init_hw(priv);
ftgmac100_start_hw(priv, priv->use_ncsi ? 100 : 10);
+
+   /* Clear stale interrupts */
+   status = ioread32(priv->base + FTGMAC100_OFFSET_ISR);
+   iowrite32(status, priv->base + FTGMAC100_OFFSET_ISR);
+
if (netdev->phydev)
phy_start(netdev->phydev);
else if (priv->use_ncsi)
-- 
2.9.3

[PATCH net-next v2 5/6] net/faraday: Configure old MDIO interface on Aspeed SoCs

2016-09-21 Thread Joel Stanley

The Aspeed SoCs have a new MDIO interface as an option in the G4 and G5
SoCs. The old one is still available, so select it in order to remain
compatible with the ftgmac100 driver.

Signed-off-by: Joel Stanley 
---
 drivers/net/ethernet/faraday/ftgmac100.c | 9 +
 drivers/net/ethernet/faraday/ftgmac100.h | 5 +
 2 files changed, 14 insertions(+)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index 189373743ddf..e3653b14008a 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -1252,12 +1252,21 @@ static int ftgmac100_setup_mdio(struct net_device 
*netdev)
struct ftgmac100 *priv = netdev_priv(netdev);
struct platform_device *pdev = to_platform_device(priv->dev);
int i, err = 0;
+   u32 reg;
 
/* initialize mdio bus */
priv->mii_bus = mdiobus_alloc();
if (!priv->mii_bus)
return -EIO;
 
+   if (of_machine_is_compatible("aspeed,ast2400") ||
+   of_machine_is_compatible("aspeed,ast2500")) {
+   /* This driver supports the old MDIO interface */
+   reg = ioread32(priv->base + FTGMAC100_OFFSET_REVR);
+   reg &= ~FTGMAC100_REVR_NEW_MDIO_INTERFACE;
+   iowrite32(reg, priv->base + FTGMAC100_OFFSET_REVR);
+   };
+
priv->mii_bus->name = "ftgmac100_mdio";
snprintf(priv->mii_bus->id, MII_BUS_ID_SIZE, "%s-%d",
 pdev->name, pdev->id);
diff --git a/drivers/net/ethernet/faraday/ftgmac100.h 
b/drivers/net/ethernet/faraday/ftgmac100.h
index c258586ce4a4..8a377ab1d127 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.h
+++ b/drivers/net/ethernet/faraday/ftgmac100.h
@@ -134,6 +134,11 @@
 #define FTGMAC100_DMAFIFOS_TXDMA_REQ   (1 << 31)
 
 /*
+ * Feature Register
+ */
+#define FTGMAC100_REVR_NEW_MDIO_INTERFACE  BIT(31)
+
+/*
  * Receive buffer size register
  */
 #define FTGMAC100_RBSR_SIZE(x) ((x) & 0x3fff)
-- 
2.9.3

[PATCH net-next v2 0/6] ftgmac100 support for ast2500

2016-09-21 Thread Joel Stanley

Hello Dave,

This series adds support to the ftgmac100 driver for the Aspeed ast2400 and
ast2500 SoCs. In particular, they ensure the driver works correctly on the
ast2500 where the MAC block has seen some changes in register layout.

They have been tested on ast2400 and ast2500 systems with the NCSI stack and
with a directly attached PHY.

V2 reworks the two patches relating to PHYSTS_CHG into the one patch that
disables the interrupt instead of playing with interrupt sensitivity. I kept
patch 4 'net/faraday: Clear stale interrupts' which was first introduced to
clear the stale PHYSTS_CHG interrupt, as it helps keep us safe from unhygienic
(vendor) bootloaders.

Cheers,

Joel

Andrew Jeffery (2):
  net/faraday: Separate rx page storage from rxdesc
  net/faraday: Make EDO{R,T}R bits configurable

Gavin Shan (1):
  net/faraday: Clear stale interrupts

Joel Stanley (3):
  net/faraday: Adapt for Aspeed SoCs
  net/faraday: Configure old MDIO interface on Aspeed SoCs
  net/faraday: Mask out PHYSTS_CHG interrupt

 drivers/net/ethernet/faraday/ftgmac100.c | 97 +++-
 drivers/net/ethernet/faraday/ftgmac100.h |  8 ++-
 2 files changed, 75 insertions(+), 30 deletions(-)

-- 
2.9.3

[PATCH net-next v2 2/6] net/faraday: Make EDO{R,T}R bits configurable

2016-09-21 Thread Joel Stanley

From: Andrew Jeffery 

These bits are #defined at a fixed location. In order to support future
hardware that has chosen to move these bits around move the bits into a
member of the struct ftgmac100.

Signed-off-by: Andrew Jeffery 
Signed-off-by: Joel Stanley 
---
 drivers/net/ethernet/faraday/ftgmac100.c | 40 +---
 drivers/net/ethernet/faraday/ftgmac100.h |  2 --
 2 files changed, 26 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index 40622567159a..62a88d1a1f99 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -79,6 +79,9 @@ struct ftgmac100 {
int int_mask_all;
bool use_ncsi;
bool enabled;
+
+   u32 rxdes0_edorr_mask;
+   u32 txdes0_edotr_mask;
 };
 
 static int ftgmac100_alloc_rx_page(struct ftgmac100 *priv,
@@ -259,10 +262,11 @@ static bool ftgmac100_rxdes_packet_ready(struct 
ftgmac100_rxdes *rxdes)
return rxdes->rxdes0 & cpu_to_le32(FTGMAC100_RXDES0_RXPKT_RDY);
 }
 
-static void ftgmac100_rxdes_set_dma_own(struct ftgmac100_rxdes *rxdes)
+static void ftgmac100_rxdes_set_dma_own(const struct ftgmac100 *priv,
+   struct ftgmac100_rxdes *rxdes)
 {
/* clear status bits */
-   rxdes->rxdes0 &= cpu_to_le32(FTGMAC100_RXDES0_EDORR);
+   rxdes->rxdes0 &= cpu_to_le32(priv->rxdes0_edorr_mask);
 }
 
 static bool ftgmac100_rxdes_rx_error(struct ftgmac100_rxdes *rxdes)
@@ -300,9 +304,10 @@ static bool ftgmac100_rxdes_multicast(struct 
ftgmac100_rxdes *rxdes)
return rxdes->rxdes0 & cpu_to_le32(FTGMAC100_RXDES0_MULTICAST);
 }
 
-static void ftgmac100_rxdes_set_end_of_ring(struct ftgmac100_rxdes *rxdes)
+static void ftgmac100_rxdes_set_end_of_ring(const struct ftgmac100 *priv,
+   struct ftgmac100_rxdes *rxdes)
 {
-   rxdes->rxdes0 |= cpu_to_le32(FTGMAC100_RXDES0_EDORR);
+   rxdes->rxdes0 |= cpu_to_le32(priv->rxdes0_edorr_mask);
 }
 
 static void ftgmac100_rxdes_set_dma_addr(struct ftgmac100_rxdes *rxdes,
@@ -393,7 +398,7 @@ ftgmac100_rx_locate_first_segment(struct ftgmac100 *priv)
if (ftgmac100_rxdes_first_segment(rxdes))
return rxdes;
 
-   ftgmac100_rxdes_set_dma_own(rxdes);
+   ftgmac100_rxdes_set_dma_own(priv, rxdes);
ftgmac100_rx_pointer_advance(priv);
rxdes = ftgmac100_current_rxdes(priv);
}
@@ -464,7 +469,7 @@ static void ftgmac100_rx_drop_packet(struct ftgmac100 *priv)
if (ftgmac100_rxdes_last_segment(rxdes))
done = true;
 
-   ftgmac100_rxdes_set_dma_own(rxdes);
+   ftgmac100_rxdes_set_dma_own(priv, rxdes);
ftgmac100_rx_pointer_advance(priv);
rxdes = ftgmac100_current_rxdes(priv);
} while (!done && ftgmac100_rxdes_packet_ready(rxdes));
@@ -556,10 +561,11 @@ static bool ftgmac100_rx_packet(struct ftgmac100 *priv, 
int *processed)
 /**
  * internal functions (transmit descriptor)
  */
-static void ftgmac100_txdes_reset(struct ftgmac100_txdes *txdes)
+static void ftgmac100_txdes_reset(const struct ftgmac100 *priv,
+ struct ftgmac100_txdes *txdes)
 {
/* clear all except end of ring bit */
-   txdes->txdes0 &= cpu_to_le32(FTGMAC100_TXDES0_EDOTR);
+   txdes->txdes0 &= cpu_to_le32(priv->txdes0_edotr_mask);
txdes->txdes1 = 0;
txdes->txdes2 = 0;
txdes->txdes3 = 0;
@@ -580,9 +586,10 @@ static void ftgmac100_txdes_set_dma_own(struct 
ftgmac100_txdes *txdes)
txdes->txdes0 |= cpu_to_le32(FTGMAC100_TXDES0_TXDMA_OWN);
 }
 
-static void ftgmac100_txdes_set_end_of_ring(struct ftgmac100_txdes *txdes)
+static void ftgmac100_txdes_set_end_of_ring(const struct ftgmac100 *priv,
+   struct ftgmac100_txdes *txdes)
 {
-   txdes->txdes0 |= cpu_to_le32(FTGMAC100_TXDES0_EDOTR);
+   txdes->txdes0 |= cpu_to_le32(priv->txdes0_edotr_mask);
 }
 
 static void ftgmac100_txdes_set_first_segment(struct ftgmac100_txdes *txdes)
@@ -701,7 +708,7 @@ static bool ftgmac100_tx_complete_packet(struct ftgmac100 
*priv)
 
dev_kfree_skb(skb);
 
-   ftgmac100_txdes_reset(txdes);
+   ftgmac100_txdes_reset(priv, txdes);
 
ftgmac100_tx_clean_pointer_advance(priv);
 
@@ -792,7 +799,7 @@ static int ftgmac100_alloc_rx_page(struct ftgmac100 *priv,
 
ftgmac100_rxdes_set_page(priv, rxdes, page);
ftgmac100_rxdes_set_dma_addr(rxdes, map);
-   ftgmac100_rxdes_set_dma_own(rxdes);
+   ftgmac100_rxdes_set_dma_own(priv, rxdes);
return 0;
 }
 
@@ -839,7 +846,8 @@ static int

[PATCH net-next 0/7] ftgmac100 support for ast2500

2016-09-21 Thread Joel Stanley

Hello Dave,

This series adds support to the ftgmac100 driver for the Aspeed ast2400 and
ast2500 SoCs. In particular, they ensure the driver works correctly on the
ast2500 where the MAC block has seen some changes in register layout.

They have been tested on ast2400 and ast2500 systems with the NCSI stack and
with a directly attached PHY.

Cheers,

Joel

Andrew Jeffery (2):
  net/ftgmac100: Separate rx page storage from rxdesc
  net/ftgmac100: Make EDO{R,T}R bits configurable

Gavin Shan (2):
  net/faraday: Avoid PHYSTS_CHG interrupt
  net/faraday: Clear stale interrupts

Joel Stanley (3):
  net/ftgmac100: Adapt for Aspeed SoCs
  net/faraday: Fix phy link irq on Aspeed G5 SoCs
  net/faraday: Configure old MDIO interface on Aspeed SoCs

 drivers/net/ethernet/faraday/ftgmac100.c | 92 
 drivers/net/ethernet/faraday/ftgmac100.h |  8 ++-
 2 files changed, 77 insertions(+), 23 deletions(-)

-- 
2.9.3

Re: [net-next 01/15] i40e: Introduce VF port representor/control netdevs

2016-09-21 Thread Jeff Kirsher

On Wed, 2016-09-21 at 22:21 +0300, Or Gerlitz wrote:
> On Wed, Sep 21, 2016 at 7:59 PM, Samudrala, Sridhar
>  wrote:
> > On 9/21/2016 12:04 AM, Or Gerlitz wrote:
> 
> 
> >> so what happens after this patchset is applied and before the future
> work is
> >> submitted? RX/TX slow path through the VFPRs isn't supported and what
> >> about fast path? in other words what happens when someone
> >> loads the driver, sets SRIOV (--> the driver set itself to switchdev
> mode
> >> and VFPRs are created) and then a VF sends a packet? do you still put
> >> into the HW the legacy DMAC based switching rules? I am not
> following...
> 
> > The VF driver requests adding the dmac based filter rules via mailbox
> > messages to PF and that is not changed in this patchset.
> > Once we have VFPR TX/RX support, we will not allow the VF driver to add
> > these rules, Instead a host based
> > program will be able to add these rules to enable the fast path.
> 
> I see, this means that when this patch set is applied your driver
> reports through devlink that they are in switchdev mode, but the
> operational state of the VFs and VFPRs isn't such - as the VFs dictate
> the steering and the VFPRs don't support slow path TX/RX --- in an
> earlier comment you made on this thread you said that you will be
> submitting RX/TX support in the next patchset. Maybe it would be best
> if you can take the VFPRs patches out of this series and roll a follow
> up series with all what's needed? unless you need more time and gonna
> miss 4.9 as of that... if the patches are ready, I say lets have them
> all in one series, if not, I wonder what other people think on the
> matter. I am basically half+ good to have also the half baked code
> base merged
> 
> Anyway, there's no point to report through ethtool something (VF vport
> HW stats) you can report in the standard and convenient manner, so
> this one please do address regardless of the prev comment.

I will drop Sridhar's changes from this series for now, so that he can do
the re-work AND provide the additional patches he referred to earlier at a
later date.

signature.asc
Description: This is a digitally signed message part

[PATCH] L2TP:Adjust intf MTU, add underlay L3, overlay L2

2016-09-21 Thread R. Parameswaran



Take into account all of the tunnel encapsulation headers when setting
up the MTU on the L2TP logical interface device. Otherwise, packets
created by the applications on top of the L2TP layer are larger
than they ought to be, relative to the underlay MTU, leading to
needless fragmentation once the outer IP encap is added.

Specifically, take into account the (outer, underlay) IP header
imposed on the encapsulated L2TP packet, and the Layer 2 header
imposed on the inner IP packet prior to L2TP encapsulation.

Do not assume an Ethernet (non-jumbo) underlay. Use the PMTU mechanism
and the dst entry in the L2TP tunnel socket to directly pull up
the underlay MTU (as the baseline number on top of which the
encapsulation headers are factored in).  Fall back to Ethernet MTU
if this fails.

Signed-off-by: Ramkumar Parameswaran 

Reviewed-by: N. Prachanda ,
 R. Shearman  ,
 D. Fawcus


---
 net/l2tp/l2tp_eth.c | 48 
 1 file changed, 44 insertions(+), 4 deletions(-)

diff --git a/net/l2tp/l2tp_eth.c b/net/l2tp/l2tp_eth.c
index 57fc5a4..dbcd6bd 100644
--- a/net/l2tp/l2tp_eth.c
+++ b/net/l2tp/l2tp_eth.c
@@ -30,6 +30,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 

 #include "l2tp_core.h"

@@ -206,6 +209,46 @@ static void l2tp_eth_show(struct seq_file *m, void 
*arg)

 }
 #endif

+static void l2tp_eth_adjust_mtu(struct l2tp_tunnel *tunnel,
+   struct l2tp_session *session,
+   struct net_device *dev)
+{
+   unsigned int overhead = 0;
+   struct dst_entry *dst;
+
+   if (session->mtu != 0) {
+   dev->mtu = session->mtu;
+   dev->needed_headroom += session->hdr_len;
+   if (tunnel->encap == L2TP_ENCAPTYPE_UDP)
+   dev->needed_headroom += sizeof(struct udphdr);
+   return;
+   }
+   overhead = session->hdr_len;
+   /* Adjust MTU, factor overhead - underlay L3 hdr, overlay L2 hdr*/
+   if (tunnel->sock->sk_family == AF_INET)
+   overhead += (ETH_HLEN + sizeof(struct iphdr));
+   else if (tunnel->sock->sk_family == AF_INET6)
+   overhead += (ETH_HLEN + sizeof(struct ipv6hdr));
+   /* Additionally, if the encap is UDP, account for UDP header size */
+   if (tunnel->encap == L2TP_ENCAPTYPE_UDP)
+   overhead += sizeof(struct udphdr);
+   /* If PMTU discovery was enabled, use discovered MTU on L2TP device */
+   dst = sk_dst_get(tunnel->sock);
+   if (dst) {
+   u32 pmtu = dst_mtu(dst);
+
+   if (pmtu != 0)
+   dev->mtu = pmtu;
+   dst_release(dst);
+   }
+   /* else (no PMTUD) L2TP dev MTU defaulted to Ethernet MTU in caller */
+   session->mtu = dev->mtu - overhead;
+   dev->mtu = session->mtu;
+   dev->needed_headroom += session->hdr_len;
+   if (tunnel->encap == L2TP_ENCAPTYPE_UDP)
+   dev->needed_headroom += sizeof(struct udphdr);
+}
+
 static int l2tp_eth_create(struct net *net, u32 tunnel_id, u32 
session_id, u32 peer_session_id, struct l2tp_session_cfg *cfg)

 {
struct net_device *dev;
@@ -255,11 +298,8 @@ static int l2tp_eth_create(struct net *net, u32 
tunnel_id, u32 session_id, u32 p

}

dev_net_set(dev, net);
-   if (session->mtu == 0)
-   session->mtu = dev->mtu - session->hdr_len;
-   dev->mtu = session->mtu;
-   dev->needed_headroom += session->hdr_len;

+   l2tp_eth_adjust_mtu(tunnel, session, dev);
priv = netdev_priv(dev);
priv->dev = dev;
priv->session = session;
--
2.1.4

Re: [PATCH RFC 1/3] xdp: Infrastructure to generalize XDP

2016-09-21 Thread Jesper Dangaard Brouer


On Wed, 21 Sep 2016 08:08:34 -0700 Tom Herbert  wrote:

> On Wed, Sep 21, 2016 at 7:48 AM, Thomas Graf  wrote:
> > On 09/21/16 at 07:19am, Tom Herbert wrote:  
> >> certain design that because of constraints on one kernel interface. As
> >> a kernel developer I want flexibility on how we design and implement
> >> things!  
> >
> > Perfectly valid argument. I reviewed your ILA changes and did not
> > object to them.
> >
> >  
> >> I think there are two questions that this patch set poses for the
> >> community wrt XDP:
> >>
> >> #1: Should we allow alternate code to run in XDP other than BPF?
> >> #2: If #1 is true what is the best way to implement that?
> >>
> >> If the answer to #1 is "no" then the answer to #2 is irrelevant. So
> >> with this RFC I'm hoping we can come the agreement on questions #1.  

I vote yes to #1.

> > I'm not opposed to running non-BPF code at XDP. I'm against adding
> > a linked list of hook consumers.

I also worry about the performance impact of a linked list.  We should
simple benchmark it instead of discussing it! ;-)


> > Would anyone require to run XDP-BPF in combination ILA? Or XDP-BPF
> > in combination with a potential XDP-nftables? We don't know yet I
> > guess.
> >  
> Right. Admittedly, I feel like we owe a bit of reciprocity to
> nftables. For ILA we are using the NF_INET_PRE_ROUTING hook with our
> own code (looks like ipvlan set nfhooks as well). This works really
> well and saves the value of early demux in ILA. Had we not had the
> ability to use nfhooks in this fashion it's likely we would have had
> to create another hook (we did try putting translation in nftables
> rules but that was too inefficient for ILA).

Thinking about it, I actually think Tom is proposing a very valid user
of the XDP hook, which is the kernel itself.  And Tom even have a real
first user ILA.  The way I read the ILA-RFC-draft[1], the XDP hook
would benefit the NVE (Network Virtualization Edge) component, which
can run separately or run on the Tenant System, where the latter case
could use XDP_PASS.

[1] https://www.ietf.org/archive/id/draft-herbert-nvo3-ila-02.txt
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [RFC v2 07/12] qedr: Add support for memory registeration verbs

2016-09-21 Thread Sagi Grimberg




+static int qedr_set_page(struct ib_mr *ibmr, u64 addr)
+{
+   struct qedr_mr *mr = get_qedr_mr(ibmr);
+   struct qedr_pbl *pbl_table;
+   struct regpair *pbe;
+   u32 pbes_in_page;
+
+   if (unlikely(mr->npages == mr->info.pbl_info.num_pbes)) {
+   DP_ERR(mr->dev, "qedr_set_page failes when %d\n", mr->npages);
+   return -ENOMEM;
+   }
+
+   DP_VERBOSE(mr->dev, QEDR_MSG_MR, "qedr_set_page pages[%d] = 0x%llx\n",
+  mr->npages, addr);
+
+   pbes_in_page = mr->info.pbl_info.pbl_size / sizeof(u64);
+   pbl_table = mr->info.pbl_table + (mr->npages / pbes_in_page);
+   pbe = (struct regpair *)pbl_table->va;
+   pbe +=  mr->npages % pbes_in_page;
+   pbe->lo = cpu_to_le32((u32)addr);
+   pbe->hi = cpu_to_le32((u32)upper_32_bits(addr));
+
+   mr->npages++;
+
+   return 0;
+}


Looks better.

Reviewed-by: Sagi Grimberg

Re: [net-next 01/15] i40e: Introduce VF port representor/control netdevs

2016-09-21 Thread Or Gerlitz

On Wed, Sep 21, 2016 at 7:59 PM, Samudrala, Sridhar
 wrote:
> On 9/21/2016 12:04 AM, Or Gerlitz wrote:

>> so what happens after this patchset is applied and before the future work is
>> submitted? RX/TX slow path through the VFPRs isn't supported and what
>> about fast path? in other words what happens when someone
>> loads the driver, sets SRIOV (--> the driver set itself to switchdev mode
>> and VFPRs are created) and then a VF sends a packet? do you still put
>> into the HW the legacy DMAC based switching rules? I am not following...

> The VF driver requests adding the dmac based filter rules via mailbox
> messages to PF and that is not changed in this patchset.
> Once we have VFPR TX/RX support, we will not allow the VF driver to add
> these rules, Instead a host based
> program will be able to add these rules to enable the fast path.

I see, this means that when this patch set is applied your driver
reports through devlink that they are in switchdev mode, but the
operational state of the VFs and VFPRs isn't such - as the VFs dictate
the steering and the VFPRs don't support slow path TX/RX --- in an
earlier comment you made on this thread you said that you will be
submitting RX/TX support in the next patchset. Maybe it would be best
if you can take the VFPRs patches out of this series and roll a follow
up series with all what's needed? unless you need more time and gonna
miss 4.9 as of that... if the patches are ready, I say lets have them
all in one series, if not, I wonder what other people think on the
matter. I am basically half+ good to have also the half baked code
base merged

Anyway, there's no point to report through ethtool something (VF vport
HW stats) you can report in the standard and convenient manner, so
this one please do address regardless of the prev comment.

Or.

Re: [PATCH RFC 1/3] xdp: Infrastructure to generalize XDP

2016-09-21 Thread Thomas Graf

On 09/21/16 at 11:50am, Tom Herbert wrote:
> 50 lines in one driver is not a big deal, 50 lines in a hundred
> drivers is! I learned this lesson in BQL which was well abstracted out
> to be minimally invasive but we still saw many issues because of the
> pecularities of different drivers.

You want to enable XDP in a hundred drivers? Are you planning to
deploy ISA NIC based ILA routers? ;-)

Re: [PATCH RFC 1/3] xdp: Infrastructure to generalize XDP

2016-09-21 Thread Jakub Kicinski

On Wed, 21 Sep 2016 11:50:06 -0700, Tom Herbert wrote:
> On Wed, Sep 21, 2016 at 11:45 AM, Jakub Kicinski  wrote:
> > On Wed, 21 Sep 2016 10:39:40 -0700, Tom Herbert wrote:  
> >> On Wed, Sep 21, 2016 at 10:26 AM, Jakub Kicinski  wrote:  
> >> > On Tue, 20 Sep 2016 17:01:39 -0700, Alexei Starovoitov wrote:  
> >> >>  >  - Reduces the amount of code and complexity needed in drivers to
> >> >>  >manage XDP  
> >> >>
> >> >> hmm:
> >> >> 534 insertions(+), 144 deletions(-)
> >> >> looks like increase in complexity instead.  
> >> >
> >> > and more to come to tie this with HW offloads.  
> >>
> >> The amount of driver code did decrease with these patches:
> >>
> >> drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 64 
> >> --
> >> drivers/net/ethernet/mellanox/mlx4/en_rx.c | 25 --
> >> drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  1 -
> >>
> >> Minimizing complexity being added to drivers for XDP is critical since
> >> we basically asking every driver to replicate the function. This
> >> property also should also apply to HW offloads, the more complexity we
> >> can abstract out drivers into a common backend infrastructure the
> >> better for supporting across different drivers.  
> >
> > I'm in the middle of writing/testing XDP support for the Netronome's
> > driver and generic infra is very much appreciated ;)  In my experience
> > the 50 lines of code which are required for assigning the programs and
> > freeing them aren't really a big deal, though.
> >  
> 
> 50 lines in one driver is not a big deal, 50 lines in a hundred
> drivers is! I learned this lesson in BQL which was well abstracted out
> to be minimally invasive but we still saw many issues because of the
> pecularities of different drivers.

Agreed, I just meant to say that splitting rings and rewritting RX path
to behave differently for XDP vs non-XDP case is way more brain
consuming than a bit of boilerplate code so if anyone could solve those
two it would be much appreciated :)  My main point was what I wrote
below, though.

> > Let's also separate putting xdp_prog in netdevice/napi_struct from the
> > generic hook infra.  All the simplifications to the driver AFAICS come
> > from the former.  If everyone is fine with growing napi_struct we can do
> > that but IMHO this is not an argument for the generic infra :)

Re: [PATCH RFC 1/3] xdp: Infrastructure to generalize XDP

2016-09-21 Thread Tom Herbert

On Wed, Sep 21, 2016 at 11:45 AM, Jakub Kicinski  wrote:
> On Wed, 21 Sep 2016 10:39:40 -0700, Tom Herbert wrote:
>> On Wed, Sep 21, 2016 at 10:26 AM, Jakub Kicinski  wrote:
>> > On Tue, 20 Sep 2016 17:01:39 -0700, Alexei Starovoitov wrote:
>> >>  >  - Reduces the amount of code and complexity needed in drivers to
>> >>  >manage XDP
>> >>
>> >> hmm:
>> >> 534 insertions(+), 144 deletions(-)
>> >> looks like increase in complexity instead.
>> >
>> > and more to come to tie this with HW offloads.
>>
>> The amount of driver code did decrease with these patches:
>>
>> drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 64 
>> --
>> drivers/net/ethernet/mellanox/mlx4/en_rx.c | 25 --
>> drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  1 -
>>
>> Minimizing complexity being added to drivers for XDP is critical since
>> we basically asking every driver to replicate the function. This
>> property also should also apply to HW offloads, the more complexity we
>> can abstract out drivers into a common backend infrastructure the
>> better for supporting across different drivers.
>
> I'm in the middle of writing/testing XDP support for the Netronome's
> driver and generic infra is very much appreciated ;)  In my experience
> the 50 lines of code which are required for assigning the programs and
> freeing them aren't really a big deal, though.
>

50 lines in one driver is not a big deal, 50 lines in a hundred
drivers is! I learned this lesson in BQL which was well abstracted out
to be minimally invasive but we still saw many issues because of the
pecularities of different drivers.

> Let's also separate putting xdp_prog in netdevice/napi_struct from the
> generic hook infra.  All the simplifications to the driver AFAICS come
> from the former.  If everyone is fine with growing napi_struct we can do
> that but IMHO this is not an argument for the generic infra :)

Re: [PATCH v6 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs

2016-09-21 Thread Thomas Graf

On 09/21/16 at 05:45pm, Pablo Neira Ayuso wrote:
> On Tue, Sep 20, 2016 at 06:43:35PM +0200, Daniel Mack wrote:
> > The point is that from an application's perspective, restricting the
> > ability to bind a port and dropping packets that are being sent is a
> > very different thing. Applications will start to behave differently if
> > they can't bind to a port, and that's something we do not want to happen.
> 
> What is exactly the problem? Applications are not checking for return
> value from bind? They should be fixed. If you want to collect
> statistics, I see no reason why you couldn't collect them for every
> EACCESS on each bind() call.

It's not about applications not checking the return value of bind().
Unfortunately, many applications (or the respective libraries they use)
retry on connect() failure but handle bind() errors as a hard failure
and exit. Yes, it's an application or library bug but these
applications have very specific exceptions how something fails.
Sometimes even going from drop to RST will break applications.

Paranoia speaking: by returning errors where no error was returned
before, undefined behaviour occurs. In Murphy speak: things break.

This is given and we can't fix it from the kernel side. Returning at
system call level has many benefits but it's not always an option.

Adding the late hook does not prevent filtering at socket layer to
also be added. I think we need both.

Re: [PATCH RFC 1/3] xdp: Infrastructure to generalize XDP

2016-09-21 Thread Jakub Kicinski

On Wed, 21 Sep 2016 10:39:40 -0700, Tom Herbert wrote:
> On Wed, Sep 21, 2016 at 10:26 AM, Jakub Kicinski  wrote:
> > On Tue, 20 Sep 2016 17:01:39 -0700, Alexei Starovoitov wrote:  
> >>  >  - Reduces the amount of code and complexity needed in drivers to
> >>  >manage XDP  
> >>
> >> hmm:
> >> 534 insertions(+), 144 deletions(-)
> >> looks like increase in complexity instead.  
> >
> > and more to come to tie this with HW offloads.  
> 
> The amount of driver code did decrease with these patches:
> 
> drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 64 --
> drivers/net/ethernet/mellanox/mlx4/en_rx.c | 25 --
> drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  1 -
> 
> Minimizing complexity being added to drivers for XDP is critical since
> we basically asking every driver to replicate the function. This
> property also should also apply to HW offloads, the more complexity we
> can abstract out drivers into a common backend infrastructure the
> better for supporting across different drivers.

I'm in the middle of writing/testing XDP support for the Netronome's
driver and generic infra is very much appreciated ;)  In my experience
the 50 lines of code which are required for assigning the programs and
freeing them aren't really a big deal, though.

Let's also separate putting xdp_prog in netdevice/napi_struct from the
generic hook infra.  All the simplifications to the driver AFAICS come
from the former.  If everyone is fine with growing napi_struct we can do
that but IMHO this is not an argument for the generic infra :)

[PATCH v2] tcp: fix wrong checksum calculation on MTU probing

2016-09-21 Thread Douglas Caetano dos Santos

With TCP MTU probing enabled and offload TX checksumming disabled,
tcp_mtu_probe() calculated the wrong checksum when a fragment being copied
into the probe's SKB had an odd length. This was caused by the direct use
of skb_copy_and_csum_bits() to calculate the checksum, as it pads the
fragment being copied, if needed. When this fragment was not the last, a
subsequent call used the previous checksum without considering this
padding.

The effect was a stale connection in one way, as even retransmissions
wouldn't solve the problem, because the checksum was never recalculated for
the full SKB length.

Signed-off-by: Douglas Caetano dos Santos 
---
 net/ipv4/tcp_output.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index f53d0cc..767135e 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1968,10 +1968,12 @@ static int tcp_mtu_probe(struct sock *sk)
copy = min_t(int, skb->len, probe_size - len);
if (nskb->ip_summed)
skb_copy_bits(skb, 0, skb_put(nskb, copy), copy);
-   else
-   nskb->csum = skb_copy_and_csum_bits(skb, 0,
-   skb_put(nskb, copy),
-   copy, nskb->csum);
+   else {
+   __wsum csum = skb_copy_and_csum_bits(skb, 0,
+skb_put(nskb, 
copy),
+copy, 0);
+   nskb->csum = csum_block_add(nskb->csum, csum, len);
+   }

if (skb->len <= copy) {
/* We've eaten all the data from this skb.
-- 
2.5.0

Re: [RFC PATCH v3 2/7] proc: Reduce cache miss in {snmp,netstat}_seq_show

2016-09-21 Thread Marcelo

On Thu, Sep 22, 2016 at 12:18:46AM +0800, hejianet wrote:
> Hi Marcelo
> 
> sorry for the late, just came back from a vacation.

Hi, no problem. Hope your batteries are recharged now :-)

> 
> On 9/14/16 7:55 PM, Marcelo wrote:
> > Hi Jia,
> > 
> > On Wed, Sep 14, 2016 at 01:58:42PM +0800, hejianet wrote:
> > > Hi Marcelo
> > > 
> > > 
> > > On 9/13/16 2:57 AM, Marcelo wrote:
> > > > On Fri, Sep 09, 2016 at 02:33:57PM +0800, Jia He wrote:
> > > > > This is to use the generic interface snmp_get_cpu_field{,64}_batch to
> > > > > aggregate the data by going through all the items of each cpu 
> > > > > sequentially.
> > > > > Then snmp_seq_show and netstat_seq_show are split into 2 parts to 
> > > > > avoid build
> > > > > warning "the frame size" larger than 1024 on s390.
> > > > Yeah about that, did you test it with stack overflow detection?
> > > > These arrays can be quite large.
> > > > 
> > > > One more below..
> > > Do you think it is acceptable if the stack usage is a little larger than 
> > > 1024?
> > > e.g. 1120
> > > I can't find any other way to reduce the stack usage except use "static" 
> > > before
> > > unsigned long buff[TCP_MIB_MAX]
> > > 
> > > PS. sizeof buff is about TCP_MIB_MAX(116)*8=928
> > > B.R.
> > That's pretty much the question. Linux has the option on some archs to
> > run with 4Kb (4KSTACKS option), so this function alone would be using
> > 25% of it in this last case. While on x86_64, it uses 16Kb (6538b8ea886e
> > ("x86_64: expand kernel stack to 16K")).
> > 
> > Adding static to it is not an option as it actually makes the variable
> > shared amongst the CPUs (and then you have concurrency issues), plus the
> > fact that it's always allocated, even while not in use.
> > 
> > Others here certainly know better than me if it's okay to make such
> > usage of the stach.
> What about this patch instead?
> It is a trade-off. I split the aggregation process into 2 parts, it will
> increase the cache miss a little bit, but it can reduce the stack usage.
> After this, stack usage is 672bytes
> objdump -d vmlinux | ./scripts/checkstack.pl ppc64 | grep seq_show
> 0xc07f7cc0 netstat_seq_show_tcpext.isra.3 [vmlinux]:672
> 
> diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
> index c6ee8a2..cc41590 100644
> --- a/net/ipv4/proc.c
> +++ b/net/ipv4/proc.c
> @@ -486,22 +486,37 @@ static const struct file_operations snmp_seq_fops = {
>   */
>  static int netstat_seq_show_tcpext(struct seq_file *seq, void *v)
>  {
> -   int i;
> -   unsigned long buff[LINUX_MIB_MAX];
> +   int i, c;
> +   unsigned long buff[LINUX_MIB_MAX/2 + 1];
> struct net *net = seq->private;
> 
> -   memset(buff, 0, sizeof(unsigned long) * LINUX_MIB_MAX);
> +   memset(buff, 0, sizeof(unsigned long) * (LINUX_MIB_MAX/2 + 1));
> 
> seq_puts(seq, "TcpExt:");
> for (i = 0; snmp4_net_list[i].name; i++)
> seq_printf(seq, " %s", snmp4_net_list[i].name);
> 
> seq_puts(seq, "\nTcpExt:");
> -   snmp_get_cpu_field_batch(buff, snmp4_net_list,
> -net->mib.net_statistics);
> -   for (i = 0; snmp4_net_list[i].name; i++)
> +   for_each_possible_cpu(c) {
> +   for (i = 0; i < LINUX_MIB_MAX/2; i++)
> +   buff[i] += snmp_get_cpu_field(
> + net->mib.net_statistics,
> +   c, snmp4_net_list[i].entry);
> +   }
> +   for (i = 0; i < LINUX_MIB_MAX/2; i++)
> seq_printf(seq, " %lu", buff[i]);
> 
> +   memset(buff, 0, sizeof(unsigned long) * (LINUX_MIB_MAX/2 + 1));
> +   for_each_possible_cpu(c) {
> +   for (i = LINUX_MIB_MAX/2; snmp4_net_list[i].name; i++)
> +   buff[i - LINUX_MIB_MAX/2] += snmp_get_cpu_field(
> +   net->mib.net_statistics,
> +   c,
> +   snmp4_net_list[i].entry);
> +   }
> +for (i = LINUX_MIB_MAX/2; snmp4_net_list[i].name; i++)
> +seq_printf(seq, " %lu", buff[i - LINUX_MIB_MAX/2]);
> +
> return 0;
>  }

Yep, it halves the stack usage, but it doesn't look good heh

But well, you may try to post the patchset (with or without this last
change, you pick) officially and see how it goes. As you're posting as
RFC, it's not being evaluated as seriously.

FWIW, I tested your patches, using your test and /proc/net/snmp file on
a x86_64 box, Intel(R) Xeon(R) CPU E5-2643 v3.

Before the patches:

 Performance counter stats for './test /proc/net/snmp':

 5.225  cache-misses

12.708.673.785  L1-dcache-loads 

 1.288.450.174  L1-dcache-load-misses #   10,14% of all L1-dcache 
hits  
 1.271.857.028  LLC-loads   

 4.122  LLC-load-misses   #0,00% of all LL-cache 
hits

Re: [ANNOUNCE] ndiv: line-rate network traffic processing

2016-09-21 Thread Willy Tarreau

Hi Tom,

On Wed, Sep 21, 2016 at 10:16:45AM -0700, Tom Herbert wrote:
> This does seem interesting and indeed the driver datapath looks very
> much like XDP. It would be quite interesting if you could rebase and
> then maybe look at how this can work with XDP that would be helpful.

OK I'll assign some time to rebase it then.

> The return actions are identical,

I'm not surprized that the same needs lead to the same designs when
these designs are constrained by CPU cycle count :-)

> but processing descriptor meta data
> (like checksum, vlan) is not yet implemented in XDP-- maybe this is
> something we can leverage from ndiv?

Yes possibly. It's not a big work but it's absolutely mandatory if you
don't want to waste some smart devices' valuable performance improvements.
We changed our API when porting it to ixgbe to support what this NIC (and
many other ones) supports so that the application code doesn't have to
deal with checksums etc. By the way, VLAN is not yet implemented in the
mvneta driver. But this choice ensures that no application has to deal
nor to create bugs.

Cheers,
Willy

Re: [ANNOUNCE] ndiv: line-rate network traffic processing

2016-09-21 Thread Willy Tarreau

Hi Jesper!

On Wed, Sep 21, 2016 at 06:26:39PM +0200, Jesper Dangaard Brouer wrote:
> I definitely want to study it!

Great, at least I've not put this online for nothing :-)

> You mention XDP.  If you didn't notice, I've created some documentation
> on XDP (it is very "live" documentation at this point and it will
> hopefully "materialized" later in the process).  But it should be a good
> starting point for understanding XDP:
> 
>  https://prototype-kernel.readthedocs.io/en/latest/networking/XDP/index.html

Thanks, I'll read it. We'll need to educate ourselves to see how to port
our anti-ddos to XDP in the future I guess, so better ensure the design
is fit from the beginning!

> > I presented it in 2014 at kernel recipes :
> >   
> > http://kernel-recipes.org/en/2014/ndiv-a-low-overhead-network-traffic-diverter/
> 
> Cool, and it even have a video!

Yep, with a horrible english accent :-)

> > It now supports drivers mvneta, ixgbe, e1000e, e1000 and igb. It is
> > very light, and retrieves the packets in the NIC's driver before they
> > are converted to an skb, then submits them to a registered RX handler
> > running in softirq context so we have the best of all worlds by
> > benefitting from CPU scalability, delayed processing, and not paying
> > the cost of switching to userland. Also an rx_done() function allows
> > handlers to batch their processing. 
> 
> Wow - it does sound a lot like XDP!  I would say that is sort of
> validate the current direction of XDP, and that there are real
> use-cases for this stuff.

Absolutely! In fact what drove use to this architecture is that we first
wrote our anti-ddos in userland using netmap. While userland might be OK
for switches and routers, in our case we have haproxy listening on TCP
sockets and waiting for these packets. So the packets were bouncing from
kernel to user, then to kernel again, losing checksums, GRO, GSO, etc...
We modified it to support all of these but the performance was still poor,
capping at about 8 Gbps of forwarded traffic instead of ~40. Thus we thought
that the processing would definitely need to be placed in the kernel to avoid
this bouncing, and to avoid turning rings into newer rings all the time.
That's when I realized that it could possibly also cover my needs for a
sniffer and we redesigned the initial code to support both use cases. Now
we don't even see it in regular traffic, which is pretty nice.

> > The RX handler returns an action
> > among accepting the packet as-is, accepting it modified (eg: vlan or
> > tunnel decapsulation), dropping it, postponing the processing
> > (equivalent to EAGAIN), or building a new packet to send back.
> 
> I'll be very interested in studying in-details how you implemented and
> choose what actions to implement.

OK. The HTTP server is a good use case to study because it lets packets
pass through, being dropped, or being responded to, and the code is very
small, so easy to analyse.

> What was the need for postponing the processing (EAGAIN)?

Our SYN cookie generator. If the NIC's Tx queue is full and we cannot build
a SYN-ACK, we prefer to break out of the Rx loop because there's still room
in the Rx ring (statistically speaking).

> > This last function is the one requiring the most changes in existing
> > drivers, but offers the widest range of possibilities. We use it to
> > send SYN cookies, but I have also implemented a stateless HTTP server
> > supporting keep-alive using it, achieving line-rate traffic processing
> > on a single CPU core when the NIC supports it. It's very convenient to
> > test various stateful TCP components as it's easy to sustain millions
> > of connections per second on it.
> 
> Interesting, and controversial use-case.  One controversial use-case
> for XDP, that I imagine was implementing a DNS accelerator, what
> answers simple and frequent requests.

We thought about such a use case as well, just like of a ping responder
(rate limited to avoid serving as DDoS responders).

> You took it a step further with a HTTP server!

It's a fake HTTP server. You ask it to return 1kB of data and it sends you
1kB. It can even do multiple segments but then you're facing the risk of
losses that you'd preferably avoid. But in our case it's very useful to
test various things including netfilter, LVS and haproxy because it
consumes so little power to reach performance levels that they cannot even
reach that you can set it up on a small machine (eg: a cheap USB-powered
ARM board saturates the GigE link with 340 kcps, 663 krps). However I found
that it *could* be fun to improve it to deliver favicons or small error
pages.

> > It does not support forwarding between NICs. It was my first goal
> > because I wanted to implement a TAP with it, bridging the traffic
> > between two ports, but figured it was adding some complexity to the
> > system back then. 
> 
> With all the XDP features at the moment, we have avoided going through
> the page allocator, by relying on different page

RE: [PATCH] net: fec: set mac address unconditionally

2016-09-21 Thread Andy Duan

From: Gavin Schenk  Sent: Wednesday, September 21, 2016 
9:31 PM
> To: Andy Duan 
> Cc: netdev@vger.kernel.org; ker...@pengutronix.de; Gavin Schenk
> 
> Subject: [PATCH] net: fec: set mac address unconditionally
> 
> Fixes: 9638d19e4816 ("net: fec: add netif status check before set mac
> address")
> 
> If the mac address origin is not dt, you can only safe assign a mac address
> after "link up" of the device. If the link is down the clocks are disabled and
> because of issues assigning registers when clocks are down the new mac
> address is discarded on some soc's. This fix sets the mac address
> unconditionally in fec_restart(...) and ensures consistens between fec
> registers and the network layer.
> 
> Signed-off-by: Gavin Schenk 
> ---

It make sense, thanks.

Acked-by: Fugang Duan 

>  drivers/net/ethernet/freescale/fec_main.c | 12 +---
>  1 file changed, 5 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/net/ethernet/freescale/fec_main.c
> b/drivers/net/ethernet/freescale/fec_main.c
> index 2a03857cca18..bdabea6cd981 100644
> --- a/drivers/net/ethernet/freescale/fec_main.c
> +++ b/drivers/net/ethernet/freescale/fec_main.c
> @@ -903,13 +903,11 @@ fec_restart(struct net_device *ndev)
>* enet-mac reset will reset mac address registers too,
>* so need to reconfigure it.
>*/
> - if (fep->quirks & FEC_QUIRK_ENET_MAC) {
> - memcpy(_mac, ndev->dev_addr, ETH_ALEN);
> - writel((__force u32)cpu_to_be32(temp_mac[0]),
> -fep->hwp + FEC_ADDR_LOW);
> - writel((__force u32)cpu_to_be32(temp_mac[1]),
> -fep->hwp + FEC_ADDR_HIGH);
> - }
> + memcpy(_mac, ndev->dev_addr, ETH_ALEN);
> + writel((__force u32)cpu_to_be32(temp_mac[0]),
> +fep->hwp + FEC_ADDR_LOW);
> + writel((__force u32)cpu_to_be32(temp_mac[1]),
> +fep->hwp + FEC_ADDR_HIGH);
> 
>   /* Clear any outstanding interrupt. */
>   writel(0x, fep->hwp + FEC_IEVENT);
> --
> 1.9.1
> 
> 
> Eckelmann AG
> Vorstand: Dipl.-Ing. Peter Frankenbach (Sprecher) Dipl.-Wi.-Ing. Philipp
> Eckelmann
> Dr.-Ing. Frank-Thomas Mellert Dr.-Ing. Marco Münchhof Dr.-Ing. Frank
> Uhlemann
> Vorsitzender des Aufsichtsrats: Hubertus G. Krossa
> Sitz der Gesellschaft: Berliner Str. 161, 65205 Wiesbaden, Amtsgericht
> Wiesbaden HRB 12636
> http://www.eckelmann.de

Re: [PATCH RFC 1/3] xdp: Infrastructure to generalize XDP

2016-09-21 Thread Tom Herbert

On Wed, Sep 21, 2016 at 10:26 AM, Jakub Kicinski  wrote:
> On Tue, 20 Sep 2016 17:01:39 -0700, Alexei Starovoitov wrote:
>>  >  - Reduces the amount of code and complexity needed in drivers to
>>  >manage XDP
>>
>> hmm:
>> 534 insertions(+), 144 deletions(-)
>> looks like increase in complexity instead.
>
> and more to come to tie this with HW offloads.

The amount of driver code did decrease with these patches:

drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 64 --
drivers/net/ethernet/mellanox/mlx4/en_rx.c | 25 --
drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  1 -

Minimizing complexity being added to drivers for XDP is critical since
we basically asking every driver to replicate the function. This
property also should also apply to HW offloads, the more complexity we
can abstract out drivers into a common backend infrastructure the
better for supporting across different drivers.

Tom

As-salam o Alaikum

2016-09-21 Thread Saidat Abdulmumuni

Hello, I am Saidat, I sent you a message yesterday, Please did you
receive it? Thank you!

Re: [PATCH RFC 1/3] xdp: Infrastructure to generalize XDP

2016-09-21 Thread Jakub Kicinski

On Tue, 20 Sep 2016 17:01:39 -0700, Alexei Starovoitov wrote:
>  >  - Reduces the amount of code and complexity needed in drivers to
>  >manage XDP  
> 
> hmm:
> 534 insertions(+), 144 deletions(-)
> looks like increase in complexity instead.

and more to come to tie this with HW offloads.

Re: [PATCH RFC 1/3] xdp: Infrastructure to generalize XDP

2016-09-21 Thread Jakub Kicinski

On Wed, 21 Sep 2016 13:55:45 +0200, Thomas Graf wrote:
> > I am looking at using this for ILA router. The problem I am hitting is
> > that not all packets that we need to translate go through the XDP
> > path. Some would go through the kernel path, some through XDP path but  
> 
> When you say kernel path, what do you mean specifically? One aspect of
> XDP I love is that XDP can act as an acceleration option for existing
> BPF programs attached to cls_bpf. Support for direct packet read and
> write at clsact level have made it straight forward to write programs
> which are compatible or at minimum share a lot of common code. They
> can share data structures, lookup functionality, etc.

My very humble dream was that XDP would be transparently offloaded from
cls_bpf if program was simple enough.  That ship has most likely sailed
because XDP has different abort behaviour.  When possible though, trying
to offload higher-level hooks when the rules don't require access to
full skb would be really cool.

[PATCH net-next 1/3] net/socket: factor out helpers for memory and queue manipulation

2016-09-21 Thread Paolo Abeni

Basic sock operations that udp code can use with its own
memory accounting schema. No functional change is introduced
in the existing APIs.

Acked-by: Hannes Frederic Sowa 
Signed-off-by: Paolo Abeni 
---
 include/linux/skbuff.h |  2 +-
 include/net/sock.h |  5 +++
 net/core/datagram.c| 36 +++
 net/core/skbuff.c  |  3 +-
 net/core/sock.c| 96 +-
 5 files changed, 94 insertions(+), 48 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index cfb7219..49c489d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3016,7 +3016,7 @@ static inline void skb_frag_list_init(struct sk_buff *skb)
 #define skb_walk_frags(skb, iter)  \
for (iter = skb_shinfo(skb)->frag_list; iter; iter = iter->next)
 
-
+void sock_rmem_free(struct sk_buff *skb);
 int __skb_wait_for_more_packets(struct sock *sk, int *err, long *timeo_p,
const struct sk_buff *skb);
 struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned flags,
diff --git a/include/net/sock.h b/include/net/sock.h
index c797c57..a37362c 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1274,7 +1274,9 @@ static inline struct inode *SOCK_INODE(struct socket 
*socket)
 /*
  * Functions for memory accounting
  */
+int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind);
 int __sk_mem_schedule(struct sock *sk, int size, int kind);
+void __sk_mem_reduce_allocated(struct sock *sk, int amount);
 void __sk_mem_reclaim(struct sock *sk, int amount);
 
 #define SK_MEM_QUANTUM ((int)PAGE_SIZE)
@@ -1940,6 +1942,9 @@ void sk_reset_timer(struct sock *sk, struct timer_list 
*timer,
 
 void sk_stop_timer(struct sock *sk, struct timer_list *timer);
 
+int __sk_queue_drop_skb(struct sock *sk, struct sk_buff *skb,
+   unsigned int flags);
+void __sock_enqueue_skb(struct sock *sk, struct sk_buff *skb);
 int __sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
 int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
 
diff --git a/net/core/datagram.c b/net/core/datagram.c
index b7de71f..bfb973a 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -323,6 +323,27 @@ void __skb_free_datagram_locked(struct sock *sk, struct 
sk_buff *skb, int len)
 }
 EXPORT_SYMBOL(__skb_free_datagram_locked);
 
+int __sk_queue_drop_skb(struct sock *sk, struct sk_buff *skb,
+   unsigned int flags)
+{
+   int err = 0;
+
+   if (flags & MSG_PEEK) {
+   err = -ENOENT;
+   spin_lock_bh(>sk_receive_queue.lock);
+   if (skb == skb_peek(>sk_receive_queue)) {
+   __skb_unlink(skb, >sk_receive_queue);
+   atomic_dec(>users);
+   err = 0;
+   }
+   spin_unlock_bh(>sk_receive_queue.lock);
+   }
+
+   atomic_inc(>sk_drops);
+   return err;
+}
+EXPORT_SYMBOL(__sk_queue_drop_skb);
+
 /**
  * skb_kill_datagram - Free a datagram skbuff forcibly
  * @sk: socket
@@ -346,23 +367,10 @@ EXPORT_SYMBOL(__skb_free_datagram_locked);
 
 int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int flags)
 {
-   int err = 0;
-
-   if (flags & MSG_PEEK) {
-   err = -ENOENT;
-   spin_lock_bh(>sk_receive_queue.lock);
-   if (skb == skb_peek(>sk_receive_queue)) {
-   __skb_unlink(skb, >sk_receive_queue);
-   atomic_dec(>users);
-   err = 0;
-   }
-   spin_unlock_bh(>sk_receive_queue.lock);
-   }
+   int err = __sk_queue_drop_skb(sk, skb, flags);
 
kfree_skb(skb);
-   atomic_inc(>sk_drops);
sk_mem_reclaim_partial(sk);
-
return err;
 }
 EXPORT_SYMBOL(skb_kill_datagram);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3864b4b6..4dce605 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3657,12 +3657,13 @@ int skb_cow_data(struct sk_buff *skb, int tailbits, 
struct sk_buff **trailer)
 }
 EXPORT_SYMBOL_GPL(skb_cow_data);
 
-static void sock_rmem_free(struct sk_buff *skb)
+void sock_rmem_free(struct sk_buff *skb)
 {
struct sock *sk = skb->sk;
 
atomic_sub(skb->truesize, >sk_rmem_alloc);
 }
+EXPORT_SYMBOL_GPL(sock_rmem_free);
 
 /*
  * Note: We dont mem charge error packets (no sk_forward_alloc changes)
diff --git a/net/core/sock.c b/net/core/sock.c
index 51a7304..752308d 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -405,24 +405,12 @@ static void sock_disable_timestamp(struct sock *sk, 
unsigned long flags)
 }
 
 
-int __sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
+void __sock_enqueue_skb(struct sock *sk, struct sk_buff *skb)
 {
unsigned long flags;
struct sk_buff_head *list = >sk_receive_queue;
 
-   if (atomic_read(>sk_rmem_alloc) >= sk->sk_rcvbuf) {
-

[PATCH net-next 2/3] udp: implement memory accounting helpers

2016-09-21 Thread Paolo Abeni

Avoid usage of common memory accounting functions, since
the logic is pretty much different.

To account for forward allocation, a couple of new atomic_t
members are added to udp_sock: 'mem_alloced' and 'mem_freed'.
The current forward allocation is estimated as 'mem_alloced'
 minus 'mem_freed' minus 'sk_rmem_alloc'.

When the forward allocation can't cope with the packet to be
enqueued, 'mem_alloced' is incremented by the packet size
rounded-up to the next SK_MEM_QUANTUM.
After a dequeue, we try to partially reclaim of the forward
allocated memory rounded down to an SK_MEM_QUANTUM and
'mem_freed' is increased by that amount.
sk->sk_forward_alloc is set after each allocated/freed memory
update, to the currently estimated forward allocation, without
any lock or protection.
This value is updated/maintained only to expose some
semi-reasonable value to the eventual reader, and is guaranteed
to be 0 at socket destruction time.

The above needs custom memory reclaiming on shutdown, provided
by the udp_destruct_sock() helper, which completely reclaim
the allocated forward memory.

Helpers are provided for skb free, consume and purge, respecting
the above constraints.

The socket lock is still used to protect the updates to sk_peek_off,
but is acquired only if peeking with offset is enabled.

As a consequence of the above schema, enqueue to sk_error_queue
will cause larger forward allocation on following normal data
(due to sk_rmem_alloc grow), but this allows amortizing the cost
of the atomic operation on SK_MEM_QUANTUM/skb->truesize packets.
The use of separate atomics for 'mem_alloced' and 'mem_freed'
allows the use of a single atomic operation to protect against
concurrent dequeue.

Acked-by: Hannes Frederic Sowa 
Signed-off-by: Paolo Abeni 
---
 include/linux/udp.h |   2 +
 include/net/udp.h   |   5 ++
 net/ipv4/udp.c  | 151 
 3 files changed, 158 insertions(+)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index d1fd8cd..cd72645 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -42,6 +42,8 @@ static inline u32 udp_hashfn(const struct net *net, u32 num, 
u32 mask)
 struct udp_sock {
/* inet_sock has to be the first member */
struct inet_sock inet;
+   atomic_t mem_allocated;
+   atomic_t mem_freed;
 #define udp_port_hash  inet.sk.__sk_common.skc_u16hashes[0]
 #define udp_portaddr_hash  inet.sk.__sk_common.skc_u16hashes[1]
 #define udp_portaddr_node  inet.sk.__sk_common.skc_portaddr_node
diff --git a/include/net/udp.h b/include/net/udp.h
index ea53a87..86307a4 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -246,6 +246,10 @@ static inline __be16 udp_flow_src_port(struct net *net, 
struct sk_buff *skb,
 }
 
 /* net/ipv4/udp.c */
+void skb_free_udp(struct sock *sk, struct sk_buff *skb);
+void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len);
+int udp_rmem_schedule(struct sock *sk, struct sk_buff *skb);
+
 void udp_v4_early_demux(struct sk_buff *skb);
 int udp_get_port(struct sock *sk, unsigned short snum,
 int (*saddr_cmp)(const struct sock *,
@@ -258,6 +262,7 @@ void udp_flush_pending_frames(struct sock *sk);
 void udp4_hwcsum(struct sk_buff *skb, __be32 src, __be32 dst);
 int udp_rcv(struct sk_buff *skb);
 int udp_ioctl(struct sock *sk, int cmd, unsigned long arg);
+int udp_init_sock(struct sock *sk);
 int udp_disconnect(struct sock *sk, int flags);
 unsigned int udp_poll(struct file *file, struct socket *sock, poll_table 
*wait);
 struct sk_buff *skb_udp_tunnel_segment(struct sk_buff *skb,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 058c312..98480af 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1178,6 +1178,157 @@ out:
return ret;
 }
 
+static inline int __udp_forward(struct udp_sock *up, int freed, int rmem)
+{
+   return atomic_read(>mem_allocated) - freed - rmem;
+}
+
+static int skb_unref(struct sk_buff *skb)
+{
+   if (likely(atomic_read(>users) == 1))
+   smp_rmb();
+   else if (likely(!atomic_dec_and_test(>users)))
+   return 0;
+
+   return skb->truesize;
+}
+
+static inline int udp_try_release(struct sock *sk, int *fwd, int partial)
+{
+   struct udp_sock *up = udp_sk(sk);
+   int freed_old, freed_new, amt;
+
+   freed_old = atomic_read(>mem_freed);
+   *fwd = __udp_forward(up, freed_old, atomic_read(>sk_rmem_alloc));
+   if (*fwd < SK_MEM_QUANTUM + partial)
+   return 0;
+
+   /* we can have concurrent release; if we catch any conflict
+* via atomic_cmpxchg, let only one of them relase the memory
+*/
+   amt = sk_mem_pages(*fwd - partial) << SK_MEM_QUANTUM_SHIFT;
+   freed_new = atomic_cmpxchg(>mem_freed, freed_old, freed_old + amt);
+   return (freed_new == freed_old) ? amt : 0;
+}
+
+/* reclaim the allocated forward memory, except 'partial' quanta */

[PATCH net-next 3/3] udp: use it's own memory accounting schema

2016-09-21 Thread Paolo Abeni

Completely avoid default sock memory accounting and replace it
with udp-specific accounting.

Since the new memory accounting model does not require socket
locking, remove the lock on enqueue and free and avoid using the
backlog on enqueue.

Be sure to clean-up rx queue memory on socket destruction, using
udp its own sk_destruct.

Acked-by: Hannes Frederic Sowa 
Signed-off-by: Paolo Abeni 
---
 net/ipv4/udp.c| 39 ++-
 net/ipv6/udp.c| 28 +---
 net/sunrpc/svcsock.c  | 22 ++
 net/sunrpc/xprtsock.c |  2 +-
 4 files changed, 38 insertions(+), 53 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 98480af..cb617ee 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1358,13 +1358,8 @@ static int first_packet_length(struct sock *sk)
res = skb ? skb->len : -1;
spin_unlock_bh(>lock);
 
-   if (!skb_queue_empty(_kill)) {
-   bool slow = lock_sock_fast(sk);
-
-   __skb_queue_purge(_kill);
-   sk_mem_reclaim_partial(sk);
-   unlock_sock_fast(sk, slow);
-   }
+   if (!skb_queue_empty(_kill))
+   udp_queue_purge(sk, _kill, 1);
return res;
 }
 
@@ -1413,7 +1408,6 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len, int noblock,
int err;
int is_udplite = IS_UDPLITE(sk);
bool checksum_valid = false;
-   bool slow;
 
if (flags & MSG_ERRQUEUE)
return ip_recv_error(sk, msg, len, addr_len);
@@ -1454,13 +1448,12 @@ try_again:
}
 
if (unlikely(err)) {
-   trace_kfree_skb(skb, udp_recvmsg);
if (!peeked) {
atomic_inc(>sk_drops);
UDP_INC_STATS(sock_net(sk),
  UDP_MIB_INERRORS, is_udplite);
}
-   skb_free_datagram_locked(sk, skb);
+   skb_free_udp(sk, skb);
return err;
}
 
@@ -1485,16 +1478,15 @@ try_again:
if (flags & MSG_TRUNC)
err = ulen;
 
-   __skb_free_datagram_locked(sk, skb, peeking ? -err : err);
+   skb_consume_udp(sk, skb, peeking ? -err : err);
return err;
 
 csum_copy_err:
-   slow = lock_sock_fast(sk);
-   if (!skb_kill_datagram(sk, skb, flags)) {
+   if (!__sk_queue_drop_skb(sk, skb, flags)) {
UDP_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS, is_udplite);
UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
}
-   unlock_sock_fast(sk, slow);
+   skb_free_udp(sk, skb);
 
/* starting over for a new packet, but check if we need to yield */
cond_resched();
@@ -1613,7 +1605,7 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct 
sk_buff *skb)
sk_incoming_cpu_update(sk);
}
 
-   rc = __sock_queue_rcv_skb(sk, skb);
+   rc = udp_rmem_schedule(sk, skb);
if (rc < 0) {
int is_udplite = IS_UDPLITE(sk);
 
@@ -1627,8 +1619,8 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct 
sk_buff *skb)
return -1;
}
 
+   __sock_enqueue_skb(sk, skb);
return 0;
-
 }
 
 static struct static_key udp_encap_needed __read_mostly;
@@ -1650,7 +1642,6 @@ EXPORT_SYMBOL(udp_encap_enable);
 int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 {
struct udp_sock *up = udp_sk(sk);
-   int rc;
int is_udplite = IS_UDPLITE(sk);
 
/*
@@ -1743,19 +1734,8 @@ int udp_queue_rcv_skb(struct sock *sk, struct sk_buff 
*skb)
goto drop;
}
 
-   rc = 0;
-
ipv4_pktinfo_prepare(sk, skb);
-   bh_lock_sock(sk);
-   if (!sock_owned_by_user(sk))
-   rc = __udp_queue_rcv_skb(sk, skb);
-   else if (sk_add_backlog(sk, skb, sk->sk_rcvbuf)) {
-   bh_unlock_sock(sk);
-   goto drop;
-   }
-   bh_unlock_sock(sk);
-
-   return rc;
+   return __udp_queue_rcv_skb(sk, skb);
 
 csum_error:
__UDP_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS, is_udplite);
@@ -2365,6 +2345,7 @@ struct proto udp_prot = {
.connect   = ip4_datagram_connect,
.disconnect= udp_disconnect,
.ioctl = udp_ioctl,
+   .init  = udp_init_sock,
.destroy   = udp_destroy_sock,
.setsockopt= udp_setsockopt,
.getsockopt= udp_getsockopt,
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 9aa7c1c..6f8e160 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -334,7 +334,6 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len,
int is_udplite = IS_UDPLITE(sk);
bool checksum_valid = false;
int is_udp4;
-   bool slow;
 
if (flags & MSG_ERRQUEUE)
return ipv6_recv_error(sk, msg, len, addr_len);
@@

[PATCH net-next 0/3] udp: refactor memory accounting

2016-09-21 Thread Paolo Abeni

This patch series refactor the udp memory accounting, replacing the
generic implementation with a custom one, in order to remove the needs for
locking the socket on the enqueue and dequeue operations. The socket backlog
usage is dropped, as well.

The first patch factor out core pieces of some queue and memory management
socket helpers, so that they can later be used by the udp memory accounting
functions.
The second patch adds the memory account helpers, without using them.
The third patch replacse the old rx memory accounting path for udp over ipv4 and
udp over ipv6. In kernel UDP users are updated, as well.

The memory accounting schema is described in detail in the individual patch
commit message.

The performance gain depends on the specific scenario; with few flows (and
little contention in the original code) the differences are in the noise range,
while with several flows contending the same socket, the measured speed-up
is relevant (e.g. even over 100% in case of extreme contention)

Paolo Abeni (3):
  net/socket: factor out helpers for memory and queue manipulation
  udp: implement memory accounting helpers
  udp: use it's own memory accounting schema

 include/linux/skbuff.h |   2 +-
 include/linux/udp.h|   2 +
 include/net/sock.h |   5 ++
 include/net/udp.h  |   5 ++
 net/core/datagram.c|  36 ++
 net/core/skbuff.c  |   3 +-
 net/core/sock.c|  96 -
 net/ipv4/udp.c | 190 +
 net/ipv6/udp.c |  28 +++-
 net/sunrpc/svcsock.c   |  22 --
 net/sunrpc/xprtsock.c  |   2 +-
 11 files changed, 290 insertions(+), 101 deletions(-)

-- 
1.8.3.1

Re: [ANNOUNCE] ndiv: line-rate network traffic processing

2016-09-21 Thread Tom Herbert

On Wed, Sep 21, 2016 at 4:28 AM, Willy Tarreau  wrote:
> Hi,
>
> Over the last 3 years I've been working a bit on high traffic processing
> for various reasons. It started with the wish to capture line-rate GigE
> traffic on very small fanless ARM machines and the framework has evolved
> to be used at my company as a basis for our anti-DDoS engine capable of
> dealing with multiple 10G links saturated with floods.
>
> I know it comes a bit late now that there is XDP, but it's my first
> vacation since then and I needed to have a bit of calm time to collect
> the patches from the various branches and put them together. Anyway I'm
> sending this here in case it can be of interest to anyone, for use or
> just to study it.
>
> I presented it in 2014 at kernel recipes :
>   
> http://kernel-recipes.org/en/2014/ndiv-a-low-overhead-network-traffic-diverter/
>
> It now supports drivers mvneta, ixgbe, e1000e, e1000 and igb. It is
> very light, and retrieves the packets in the NIC's driver before they
> are converted to an skb, then submits them to a registered RX handler
> running in softirq context so we have the best of all worlds by
> benefitting from CPU scalability, delayed processing, and not paying
> the cost of switching to userland. Also an rx_done() function allows
> handlers to batch their processing. The RX handler returns an action
> among accepting the packet as-is, accepting it modified (eg: vlan or
> tunnel decapsulation), dropping it, postponing the processing
> (equivalent to EAGAIN), or building a new packet to send back.
>
> This last function is the one requiring the most changes in existing
> drivers, but offers the widest range of possibilities. We use it to
> send SYN cookies, but I have also implemented a stateless HTTP server
> supporting keep-alive using it, achieving line-rate traffic processing
> on a single CPU core when the NIC supports it. It's very convenient to
> test various stateful TCP components as it's easy to sustain millions
> of connections per second on it.
>
> It does not support forwarding between NICs. It was my first goal
> because I wanted to implement a TAP with it, bridging the traffic
> between two ports, but figured it was adding some complexity to the
> system back then. However since then we've implemented traffic
> capture in our product, exploiting this framework to capture without
> losses at 14 Mpps. I may find some time to try to extract it later.
> It uses the /sys API so that you can simply plug tcpdump -r on a
> file there, though there's also an mmap version which uses less CPU
> (that's important at 10G).
>
> In its current form since the initial code's intent was to limit
> core changes, it happens not to modify anything in the kernel by
> default and to reuse the net_device's ax25_ptr to attach devices
> (idea borrowed from netmap), so it can be used on an existing
> kernel just by loading the patched network drivers (yes, I know
> it's not a valid solution for the long term).
>
> The current code is available here :
>
>   http://git.kernel.org/cgit/linux/kernel/git/wtarreau/ndiv.git/
>
> Please let me know if there could be some interest in rebasing it
> on more recent versions (currently 3.10, 3.14 and 4.4 are supported).
> I don't have much time to assign to it since it works fine as-is,
> but will be glad to do so if that can be useful.
>
Hi Willy,

This does seem interesting and indeed the driver datapath looks very
much like XDP. It would be quite interesting if you could rebase and
then maybe look at how this can work with XDP that would be helpful.
The return actions are identical, but processing descriptor meta data
(like checksum, vlan) is not yet implemented in XDP-- maybe this is
something we can leverage from ndiv?

Tom

> Also the stateless HTTP server provided in it definitely is a nice
> use case for testing such a framework.
>
> Regards,
> Willy

Re: [PATCH net-next V2] net/vxlan: Avoid unaligned access in vxlan_build_skb()

2016-09-21 Thread Alexei Starovoitov

On Wed, Sep 21, 2016 at 09:53:31AM -0700, Eric Dumazet wrote:
> On Wed, 2016-09-21 at 09:14 -0700, Alexei Starovoitov wrote:
> 
> > 
> > I think it's the opposite. Even on x86 compiler will use byte loads.
> 
> Unless you tweaked gcc, it should still use word loads on x86.

> checked that on x86-64 actually. Also clearly visible here:

ahh. ok. good to know. thanks guys!

Re: [RFC PATCH v3 2/7] proc: Reduce cache miss in {snmp,netstat}_seq_show

2016-09-21 Thread hejianet


Hi Marcelo

sorry for the late, just came back from a vacation.

On 9/14/16 7:55 PM, Marcelo wrote:

Hi Jia,

On Wed, Sep 14, 2016 at 01:58:42PM +0800, hejianet wrote:

Hi Marcelo


On 9/13/16 2:57 AM, Marcelo wrote:

On Fri, Sep 09, 2016 at 02:33:57PM +0800, Jia He wrote:

This is to use the generic interface snmp_get_cpu_field{,64}_batch to
aggregate the data by going through all the items of each cpu sequentially.
Then snmp_seq_show and netstat_seq_show are split into 2 parts to avoid build
warning "the frame size" larger than 1024 on s390.

Yeah about that, did you test it with stack overflow detection?
These arrays can be quite large.

One more below..

Do you think it is acceptable if the stack usage is a little larger than 1024?
e.g. 1120
I can't find any other way to reduce the stack usage except use "static" before
unsigned long buff[TCP_MIB_MAX]

PS. sizeof buff is about TCP_MIB_MAX(116)*8=928
B.R.

That's pretty much the question. Linux has the option on some archs to
run with 4Kb (4KSTACKS option), so this function alone would be using
25% of it in this last case. While on x86_64, it uses 16Kb (6538b8ea886e
("x86_64: expand kernel stack to 16K")).

Adding static to it is not an option as it actually makes the variable
shared amongst the CPUs (and then you have concurrency issues), plus the
fact that it's always allocated, even while not in use.

Others here certainly know better than me if it's okay to make such
usage of the stach.

What about this patch instead?
It is a trade-off. I split the aggregation process into 2 parts, it will
increase the cache miss a little bit, but it can reduce the stack usage.
After this, stack usage is 672bytes
objdump -d vmlinux | ./scripts/checkstack.pl ppc64 | grep seq_show
0xc07f7cc0 netstat_seq_show_tcpext.isra.3 [vmlinux]:672

diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index c6ee8a2..cc41590 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -486,22 +486,37 @@ static const struct file_operations snmp_seq_fops = {
  */
 static int netstat_seq_show_tcpext(struct seq_file *seq, void *v)
 {
-   int i;
-   unsigned long buff[LINUX_MIB_MAX];
+   int i, c;
+   unsigned long buff[LINUX_MIB_MAX/2 + 1];
struct net *net = seq->private;

-   memset(buff, 0, sizeof(unsigned long) * LINUX_MIB_MAX);
+   memset(buff, 0, sizeof(unsigned long) * (LINUX_MIB_MAX/2 + 1));

seq_puts(seq, "TcpExt:");
for (i = 0; snmp4_net_list[i].name; i++)
seq_printf(seq, " %s", snmp4_net_list[i].name);

seq_puts(seq, "\nTcpExt:");
-   snmp_get_cpu_field_batch(buff, snmp4_net_list,
-net->mib.net_statistics);
-   for (i = 0; snmp4_net_list[i].name; i++)
+   for_each_possible_cpu(c) {
+   for (i = 0; i < LINUX_MIB_MAX/2; i++)
+   buff[i] += snmp_get_cpu_field(
+ net->mib.net_statistics,
+   c, snmp4_net_list[i].entry);
+   }
+   for (i = 0; i < LINUX_MIB_MAX/2; i++)
seq_printf(seq, " %lu", buff[i]);

+   memset(buff, 0, sizeof(unsigned long) * (LINUX_MIB_MAX/2 + 1));
+   for_each_possible_cpu(c) {
+   for (i = LINUX_MIB_MAX/2; snmp4_net_list[i].name; i++)
+   buff[i - LINUX_MIB_MAX/2] += snmp_get_cpu_field(
+   net->mib.net_statistics,
+   c,
+   snmp4_net_list[i].entry);
+   }
+for (i = LINUX_MIB_MAX/2; snmp4_net_list[i].name; i++)
+seq_printf(seq, " %lu", buff[i - LINUX_MIB_MAX/2]);
+
return 0;
 }


+static int netstat_seq_show_ipext(struct seq_file *seq, void *v)
+{
+   int i;
+   u64 buff64[IPSTATS_MIB_MAX];
+   struct net *net = seq->private;
seq_puts(seq, "\nIpExt:");
for (i = 0; snmp4_ipextstats_list[i].name != NULL; i++)
seq_printf(seq, " %s", snmp4_ipextstats_list[i].name);
seq_puts(seq, "\nIpExt:");

You're missing a memset() call here.

Not sure if you missed this one or not..

indeed, thanks
B.R.
Jia

Thanks,
Marcelo

Re: [PATCH net-next 1/3] net: ethernet: mediatek: add extension of phy-mode for TRGMII

2016-09-21 Thread Florian Fainelli

On 09/21/2016 12:33 AM, Sean Wang wrote:
> Date: Tue, 20 Sep 2016 14:23:24 -0700, Florian Fainelli 
>  wrote:
>> On 09/20/2016 12:59 AM, sean.w...@mediatek.com wrote:
>>> From: Sean Wang 
>>>
>>> adds PHY-mode "trgmii" as an extension for the operation
>>> mode of the PHY interface, TRGMII can be compatible with
>>> RGMII, so the extended mode doesn't really have effects on
>>> the target MAC and PHY, is used as the indication if the
>>> current MAC is connected to an internal switch or external
>>> PHY respectively by the given configuration on the board and
>>> then to perform the corresponding setup on TRGMII hardware
>>> module.
>>
>> Based on my googling, it seems like Turbo RGMII is a Mediatek-specific
>> thing for now, but this could become standard and used by other vendors
>> at some point, so I would be inclined to just extend the phy-mode
>> property to support trgmii as another interface type.
>>
>> If you do so, do you also mind proposing an update to the Device Tree
>> specification:
>>
>> https://www.devicetree.org/specifications/
>>
>> Thanks!
> 
> I am willing to do the these thing
> 
> 1)
> in the next version, I will extend rgmii mode as
> another interface type as PHY_INTERFACE_MODE_TRGMII
> defined in linux/phy.h instead of extension only inside
> the current driver. This change also helps to save some code.
> 
> 2)
> I send another separate patch for updating the Device Tree
> specification about TRGMII adding description
> 
> are these all okay for you?

Absolutely, thanks a lot!
-- 
Florian

Re: [net-next 01/15] i40e: Introduce VF port representor/control netdevs

2016-09-21 Thread Samudrala, Sridhar




On 9/21/2016 12:04 AM, Or Gerlitz wrote:

On Wed, Sep 21, 2016 at 8:45 AM, Samudrala, Sridhar
 wrote:

On 9/20/2016 9:22 PM, Or Gerlitz wrote:

On Wed, Sep 21, 2016 at 6:43 AM, Jeff Kirsher
 wrote:

From: Sridhar Samudrala 
This patch enables creation of a VF Port representor/Control netdev
associated with each VF. These netdevs can be used to control and
configure
VFs from PFs namespace. They enable exposing VF statistics, configuring
link state, mtu, fdb/vlan entries etc.

What happens if someone does a xmit on the VF representor, does the
packet show up @ the VF?
and what happens of the VF xmits and there's no HW steering rule that
matches this, does
the frame show up @ the VF rep on the host?

TX/RX are not yet supported via VFPR netdevs in this patch series.
Will be submitting this support in the next patchset.

Okay, good.


In other words, can these VF reps serve for setting up host SW based
switching which you
can later offload (through TC, bridge, netfilter, etc)?

Yes. These offloads will be possible  via VFPRs.

cool


I am posing these questions because in downstream patch you are adding
devlink support
for set/get the e-switch mode and you declare the default mode to be switchdev.
When the switchdev mode was introduced in 4.8 these RX/TX
characteristics were defined
to be an essential (== requirement) part for a driver to support that mode.

The current patchset introduces the basic VFPR support starting with
exposing VF stats and syncing link state between VFs and VFPRs.
We decided to declare the default mode to be switchdev so that the new code
paths will get exercised by default during normal testing.

so what happens after this patchset is applied and before the future
work is submitted?
RX/TX slow path through the VFPRs isn't supported and what about fast
path? in other words
what happens when someone loads the driver, sets SRIOV (--> the driver
set itself to switchdev mode
and VFPRs are created) and then a VF sends a packet? do you still put
into the HW the legacy DMAC
based switching rules? I am not following...


The VF driver requests adding the dmac based filter rules via mailbox 
messages to PF and that is

not changed in this patchset.
Once we have VFPR TX/RX support, we will not allow the VF driver to add 
these rules, Instead a host based

program will be able to add these rules to enable the fast path.

Thanks
Sridhar

Re: [PATCH net-next] MAINTAINERS: Update b44 maintainer.

2016-09-21 Thread Florian Fainelli

On 09/20/2016 08:33 PM, Michael Chan wrote:
> Taking over as maintainer since Gary Zambrano is no longer working
> for Broadcom.
> 
> Signed-off-by: Michael Chan 

Acked-by: Florian Fainelli 

Thanks Michael!
-- 
Florian

Re: [PATCH net-next V2] net/vxlan: Avoid unaligned access in vxlan_build_skb()

2016-09-21 Thread Eric Dumazet

On Wed, 2016-09-21 at 09:14 -0700, Alexei Starovoitov wrote:

> 
> I think it's the opposite. Even on x86 compiler will use byte loads.

Unless you tweaked gcc, it should still use word loads on x86.

Re: [PATCH net-next V2] net/vxlan: Avoid unaligned access in vxlan_build_skb()

2016-09-21 Thread Hannes Frederic Sowa

On 21.09.2016 18:14, Alexei Starovoitov wrote:
> On Wed, Sep 21, 2016 at 12:10:55PM +0200, Hannes Frederic Sowa wrote:
>> On 20.09.2016 20:57, Sowmini Varadhan wrote:
>>> The vxlan header may not be aligned to 4 bytes in
>>> vxlan_build_skb (e.g., for MLD packets). This patch
>>> avoids unaligned access traps from vxlan_build_skb
>>> (in platforms like sparc) by making struct vxlanhdr __packed.
>>>
>>> Signed-off-by: Sowmini Varadhan 
>>
>> Performance wise this should only affect code generation for archs where
>> it matters anyway.
> 
> I think it's the opposite. Even on x86 compiler will use byte loads.

I checked that on x86-64 actually. Also clearly visible here:

https://godbolt.org/g/xsW2P1

Bye,
Hannes

Re: [ANNOUNCE] ndiv: line-rate network traffic processing

2016-09-21 Thread Jesper Dangaard Brouer

On Wed, 21 Sep 2016 13:28:52 +0200 Willy Tarreau  wrote:

> Over the last 3 years I've been working a bit on high traffic processing
> for various reasons. It started with the wish to capture line-rate GigE
> traffic on very small fanless ARM machines and the framework has evolved
> to be used at my company as a basis for our anti-DDoS engine capable of
> dealing with multiple 10G links saturated with floods.
> 
> I know it comes a bit late now that there is XDP, but it's my first
> vacation since then and I needed to have a bit of calm time to collect
> the patches from the various branches and put them together. Anyway I'm
> sending this here in case it can be of interest to anyone, for use or
> just to study it.

I definitely want to study it!

You mention XDP.  If you didn't notice, I've created some documentation
on XDP (it is very "live" documentation at this point and it will
hopefully "materialized" later in the process).  But it should be a good
starting point for understanding XDP:

 https://prototype-kernel.readthedocs.io/en/latest/networking/XDP/index.html

> I presented it in 2014 at kernel recipes :
>   
> http://kernel-recipes.org/en/2014/ndiv-a-low-overhead-network-traffic-diverter/

Cool, and it even have a video!

> It now supports drivers mvneta, ixgbe, e1000e, e1000 and igb. It is
> very light, and retrieves the packets in the NIC's driver before they
> are converted to an skb, then submits them to a registered RX handler
> running in softirq context so we have the best of all worlds by
> benefitting from CPU scalability, delayed processing, and not paying
> the cost of switching to userland. Also an rx_done() function allows
> handlers to batch their processing. 

Wow - it does sound a lot like XDP!  I would say that is sort of
validate the current direction of XDP, and that there are real
use-cases for this stuff.

> The RX handler returns an action
> among accepting the packet as-is, accepting it modified (eg: vlan or
> tunnel decapsulation), dropping it, postponing the processing
> (equivalent to EAGAIN), or building a new packet to send back.

I'll be very interested in studying in-details how you implemented and
choose what actions to implement.

What was the need for postponing the processing (EAGAIN)?

> This last function is the one requiring the most changes in existing
> drivers, but offers the widest range of possibilities. We use it to
> send SYN cookies, but I have also implemented a stateless HTTP server
> supporting keep-alive using it, achieving line-rate traffic processing
> on a single CPU core when the NIC supports it. It's very convenient to
> test various stateful TCP components as it's easy to sustain millions
> of connections per second on it.

Interesting, and controversial use-case.  One controversial use-case
for XDP, that I imagine was implementing a DNS accelerator, what
answers simple and frequent requests.  You took it a step further with
a HTTP server!

> It does not support forwarding between NICs. It was my first goal
> because I wanted to implement a TAP with it, bridging the traffic
> between two ports, but figured it was adding some complexity to the
> system back then. 

With all the XDP features at the moment, we have avoided going through
the page allocator, by relying on different page recycling tricks.

When doing forwarding between NICs is it harder to do these page
recycling tricks.  I've measured that page allocators fast-path
("recycling" same page) cost approx 270 cycles, and the 14Mpps cycle
count on this 4GHz CPU is 268 cycles.  Thus, it is a non-starter...

Did you have to modify the page allocator?
Or implement some kind of recycling?

> However since then we've implemented traffic
> capture in our product, exploiting this framework to capture without
> losses at 14 Mpps. I may find some time to try to extract it later.
> It uses the /sys API so that you can simply plug tcpdump -r on a
> file there, though there's also an mmap version which uses less CPU
> (that's important at 10G).

Interesting.  I do see a XDP use-case for RAW packet capture, but I've
postponed that work until later.  I would interested in how you solved
it?  E.g. Do you support zero-copy?

> In its current form since the initial code's intent was to limit
> core changes, it happens not to modify anything in the kernel by
> default and to reuse the net_device's ax25_ptr to attach devices
> (idea borrowed from netmap), so it can be used on an existing
> kernel just by loading the patched network drivers (yes, I know
> it's not a valid solution for the long term).
> 
> The current code is available here :
> 
>   http://git.kernel.org/cgit/linux/kernel/git/wtarreau/ndiv.git/

I was just about to complain that the link was broken... but it fixed
itself while writing this email ;-)

Can you instead explain what branch to look at?

> 
> Please let me know if there could be some interest in rebasing it
> on more recent versions (currently 3.10, 3.14 and

[PATCH net] act_ife: Add support for machines with hard_header_len != mac_len

2016-09-21 Thread Yotam Gigi

Without that fix, the following could occur:
 - On encode ingress, the total amount of skb_pushes (in lines 751 and
   753) was more than specified in cow.
 - On machines with hard_header_len > mac_len, the packet format was not
   according to the ife specifications, as the place reserved at the
   beginning of the packet is for hard_header_len, but only mac_len bytes
   get initialized in it. Acting upon simple ping packet, The following
   tc commands:

   tc qdisc add dev sw1p5 handle 1: root prio
   tc filter add dev sw1p5 parent 1: protocol ip \
matchall skip_hw \
action ife encode type 0xdead use mark 0x12

   when netdev sw1p5 has hard_header_len of 30, created the following
   packet:

   0x:  e41d 2da5 f1d3 e41d 2d46 ffb5 dead   ..-.-F..
   0x0010:         000a  
   0x0020:  0001 0004  0012 e41d 2da5 f1d3 e41d  ..-.
   0x0031:  2d46 ffb5 0800 4500 0054 55e9 4000 4001  -FE..TU.@.@.
   0x0040:  ceba 0b00 0005 0b00 0001 0800 d2ea 63b7  ..c.
   0x0050:  0002 a360 e257   7ad0 0200   ...`.Wz.
   0x0060:   1011 1213 1415 1617 1819 1a1b 1c1d  
   0x0070:  1e1f 2021 2223 2425 2627 2829 2a2b 2c2d  ...!"#$%&'()*+,-
   0x0080:  2e2f 3031 3233 3435 3637 ./01234567

   and it can be seen that bytes 0x0e to 0x01e are not initialized and
   contain random memory data, where bytes 0x01e to 0x028 contain the ife
   header.

To fix those problems, add the total_push and reserve variables, which
indicates the exact amount of pushes needed and the exact amount of
headroom the packet should have. Thus, it is possible to take care of the
cases:
 - on ingress, the mac header must be pushed back as the ingress parser
   already parses the mac header and pulls it
 - on egress, the code should reserve hard_header_len space extra in
   headroom for driver to put the rest of the headers

Fixes: ef6980b6becb ("net sched: introduce IFE action")
Signed-off-by: Yotam Gigi 
---
 net/sched/act_ife.c | 34 +-
 1 file changed, 25 insertions(+), 9 deletions(-)

diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index e87cd81..27b19ca 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -708,11 +708,13 @@ static int tcf_ife_encode(struct sk_buff *skb, const 
struct tc_action *a,
   where ORIGDATA = original ethernet header ...
 */
u16 metalen = ife_get_sz(skb, ife);
-   int hdrm = metalen + skb->dev->hard_header_len + IFE_METAHDRLEN;
-   unsigned int skboff = skb->dev->hard_header_len;
u32 at = G_TC_AT(skb->tc_verd);
-   int new_len = skb->len + hdrm;
bool exceed_mtu = false;
+   unsigned int skboff;
+   int total_push;
+   int reserve;
+   int new_len;
+   int hdrm;
int err;
 
if (at & AT_EGRESS) {
@@ -724,6 +726,22 @@ static int tcf_ife_encode(struct sk_buff *skb, const 
struct tc_action *a,
bstats_update(>tcf_bstats, skb);
tcf_lastuse_update(>tcf_tm);
 
+   if (at & AT_EGRESS) {
+   /* on egress, reserve space for hard_header_len instead of
+* mac_len
+*/
+   skb_reset_mac_len(skb);
+   hdrm = metalen + skb->mac_len + IFE_METAHDRLEN;
+   total_push = hdrm;
+   reserve = metalen + skb->dev->hard_header_len + IFE_METAHDRLEN;
+   } else {
+   /* on ingress, push mac_len as it already get parsed from tc */
+   hdrm = metalen + skb->mac_len + IFE_METAHDRLEN;
+   total_push = hdrm + skb->mac_len;
+   reserve = total_push;
+   }
+   new_len =  skb->len + hdrm;
+
if (!metalen) { /* no metadata to send */
/* abuse overlimits to count when we allow packet
 * with no metadata
@@ -742,19 +760,17 @@ static int tcf_ife_encode(struct sk_buff *skb, const 
struct tc_action *a,
 
iethh = eth_hdr(skb);
 
-   err = skb_cow_head(skb, hdrm);
+   err = skb_cow_head(skb, reserve);
if (unlikely(err)) {
ife->tcf_qstats.drops++;
spin_unlock(>tcf_lock);
return TC_ACT_SHOT;
}
 
-   if (!(at & AT_EGRESS))
-   skb_push(skb, skb->dev->hard_header_len);
-
-   __skb_push(skb, hdrm);
+   __skb_push(skb, total_push);
memcpy(skb->data, iethh, skb->mac_len);
skb_reset_mac_header(skb);
+   skboff += skb->mac_len;
oethh = eth_hdr(skb);
 
/*total metadata length */
@@ -792,7 +808,7 @@ static int tcf_ife_encode(struct sk_buff *skb, const struct 
tc_action *a,
oethh->h_proto = htons(ife->eth_type);
 
if (!(at & AT_EGRESS))
-   skb_pull(skb, skb->dev->hard_header_len);
+   skb_pull(skb, skb->mac_len);
 
spin_unlock(>tcf_lock);
 
-- 
2.4.11

Re: [PATCH net-next V2] net/vxlan: Avoid unaligned access in vxlan_build_skb()

2016-09-21 Thread Alexei Starovoitov

On Wed, Sep 21, 2016 at 12:10:55PM +0200, Hannes Frederic Sowa wrote:
> On 20.09.2016 20:57, Sowmini Varadhan wrote:
> > The vxlan header may not be aligned to 4 bytes in
> > vxlan_build_skb (e.g., for MLD packets). This patch
> > avoids unaligned access traps from vxlan_build_skb
> > (in platforms like sparc) by making struct vxlanhdr __packed.
> > 
> > Signed-off-by: Sowmini Varadhan 
> 
> Performance wise this should only affect code generation for archs where
> it matters anyway.

I think it's the opposite. Even on x86 compiler will use byte loads.

[PATCH iproute2] ss: Support displaying and filtering on socket marks.

2016-09-21 Thread Lorenzo Colitti

This allows the user to dump sockets with a given mark (via
"fwmark = 0x1234/0x1234" or "fwmark = 12345", etc.) , and to
display the socket marks of dumped sockets.

The relevant kernel commits are: d545caca827b ("net: inet: diag:
expose the socket mark to privileged processes.") and
- a52e95abf772 ("net: diag: allow socket bytecode filters to
match socket marks")

Signed-off-by: Lorenzo Colitti 
---
 include/linux/inet_diag.h |  7 +++
 misc/ss.c | 52 +++
 misc/ssfilter.h   |  2 ++
 misc/ssfilter.y   | 23 +++--
 4 files changed, 82 insertions(+), 2 deletions(-)

diff --git a/include/linux/inet_diag.h b/include/linux/inet_diag.h
index beb74ee..016de88 100644
--- a/include/linux/inet_diag.h
+++ b/include/linux/inet_diag.h
@@ -73,6 +73,7 @@ enum {
INET_DIAG_BC_S_COND,
INET_DIAG_BC_D_COND,
INET_DIAG_BC_DEV_COND,   /* u32 ifindex */
+   INET_DIAG_BC_MARK_COND,
 };
 
 struct inet_diag_hostcond {
@@ -82,6 +83,11 @@ struct inet_diag_hostcond {
__be32  addr[0];
 };
 
+struct inet_diag_markcond {
+   __u32 mark;
+   __u32 mask;
+};
+
 /* Base info structure. It contains socket identity (addrs/ports/cookie)
  * and, alas, the information shown by netstat. */
 struct inet_diag_msg {
@@ -117,6 +123,7 @@ enum {
INET_DIAG_LOCALS,
INET_DIAG_PEERS,
INET_DIAG_PAD,
+   INET_DIAG_MARK,
__INET_DIAG_MAX,
 };
 
diff --git a/misc/ss.c b/misc/ss.c
index 3b268d9..83fb01f 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -737,6 +737,7 @@ struct sockstat {
unsigned long long  sk;
char *name;
char *peer_name;
+   __u32   mark;
 };
 
 struct dctcpstat {
@@ -807,6 +808,9 @@ static void sock_details_print(struct sockstat *s)
 
printf(" ino:%u", s->ino);
printf(" sk:%llx", s->sk);
+
+   if (s->mark)
+   printf(" fwmark:0x%x", s->mark);
 }
 
 static void sock_addr_print_width(int addr_len, const char *addr, char *delim,
@@ -1046,6 +1050,8 @@ struct aafilter {
inet_prefix addr;
int port;
unsigned intiface;
+   __u32   mark;
+   __u32   mask;
struct aafilter *next;
 };
 
@@ -1166,6 +1172,12 @@ static int run_ssfilter(struct ssfilter *f, struct 
sockstat *s)
 
return s->iface == a->iface;
}
+   case SSF_MARKMASK:
+   {
+   struct aafilter *a = (void *)f->pred;
+
+   return (s->mark & a->mask) == a->mark;
+   }
/* Yup. It is recursion. Sorry. */
case SSF_AND:
return run_ssfilter(f->pred, s) && run_ssfilter(f->post, s);
@@ -1341,6 +1353,23 @@ static int ssfilter_bytecompile(struct ssfilter *f, char 
**bytecode)
/* bytecompile for SSF_DEVCOND not supported yet */
return 0;
}
+   case SSF_MARKMASK:
+   {
+   struct aafilter *a = (void *)f->pred;
+   struct instr {
+   struct inet_diag_bc_op op;
+   struct inet_diag_markcond cond;
+   };
+   int inslen = sizeof(struct instr);
+
+   if (!(*bytecode = malloc(inslen))) abort();
+   ((struct instr *)*bytecode)[0] = (struct instr) {
+   { INET_DIAG_BC_MARK_COND, inslen, inslen + 4 },
+   { a->mark, a->mask},
+   };
+
+   return inslen;
+   }
default:
abort();
}
@@ -1620,6 +1649,25 @@ out:
return res;
 }
 
+void *parse_markmask(const char *markmask)
+{
+   struct aafilter a, *res;
+
+   if (strchr(markmask, '/')) {
+   if (sscanf(markmask, "%i/%i", , ) != 2)
+   return NULL;
+   } else {
+   a.mask = 0x;
+   if (sscanf(markmask, "%i", ) != 1)
+   return NULL;
+   }
+
+   res = malloc(sizeof(*res));
+   if (res)
+   memcpy(res, , sizeof(a));
+   return res;
+}
+
 static char *proto_name(int protocol)
 {
switch (protocol) {
@@ -2107,6 +2155,10 @@ static void parse_diag_msg(struct nlmsghdr *nlh, struct 
sockstat *s)
s->iface= r->id.idiag_if;
s->sk   = cookie_sk_get(>id.idiag_cookie[0]);
 
+   s->mark = 0;
+   if (tb[INET_DIAG_MARK])
+   s->mark = *(__u32 *) RTA_DATA(tb[INET_DIAG_MARK]);
+
if (s->local.family == AF_INET)
s->local.bytelen = s->remote.bytelen = 4;
else
diff --git a/misc/ssfilter.h b/misc/ssfilter.h
index c7db8ee..dfc5b93 100644
--- a/misc/ssfilter.h
+++ b/misc/ssfilter.h
@@ -9,6 +9,7 @@
 #define SSF_S_LE  8
 #define SSF_S_AUTO  9
 #define SSF_DEVCOND 10
+#define SSF_MARKMASK 11
 
 #include 
 
@@ -22,3 +23,4 @@ struct ssfilter
 int ssfilter_parse(struct

Re: [PATCH v6 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs

2016-09-21 Thread Pablo Neira Ayuso

Hi Daniel,

On Tue, Sep 20, 2016 at 06:43:35PM +0200, Daniel Mack wrote:
> Hi Pablo,
> 
> On 09/20/2016 04:29 PM, Pablo Neira Ayuso wrote:
> > On Mon, Sep 19, 2016 at 10:56:14PM +0200, Daniel Mack wrote:
> > [...]
> >> Why would we artificially limit the use-cases of this implementation if
> >> the way it stands, both filtering and introspection are possible?
> > 
> > Why should we place infrastructure in the kernel to filter packets so
> > late, and why at postrouting btw, when we can do this way earlier
> > before any packet is actually sent?
> 
> The point is that from an application's perspective, restricting the
> ability to bind a port and dropping packets that are being sent is a
> very different thing. Applications will start to behave differently if
> they can't bind to a port, and that's something we do not want to happen.

What is exactly the problem? Applications are not checking for return
value from bind? They should be fixed. If you want to collect
statistics, I see no reason why you couldn't collect them for every
EACCESS on each bind() call.

> Looking at packets and making a verdict on them is the only way to
> implement what we have in mind. Given that's in line with what netfilter
> does, it can't be all that wrong, can it?

That output hook was added ~20 years ago... At that time we didn't
have anything better than dropping locally generated packets. Today we
can probably do something better.

> > No performance impact, no need for
> > skbuff allocation and *no cycles wasted to evaluate if every packet is
> > wanted or not*.
> 
> Hmm, not sure why this keeps coming up. As I said - for accounting,
> there is no other option than to look at every packet and its size.
> 
> Regarding the performance concerns, are you saying a netfilter based
> implementation that uses counters for that purpose would be more
> efficient?

> Because in my tests, just loading the netfilter modules with no
> rules in place at all has more impact than running the code from 6/6
> on every packet.

You must be talking on iptables. When did you test this? We now have
on-demand hook registration per-netns, anyway, in nftables you only
register what you need.

Everytime you mention about performance, it sounds like there is no
room to improve what we have... and we indeed have room and ideas to
get this flying faster, but keeping in mind good integration with our
generic network stack and extensible interfaces, that's important too.
On top of that, I started working on some preliminary patches to add
nftables jit, will be talking on this during NetDev 1.2
netfilter/nftables workshop. I would expect numbers close to what
you're observing with this solution.

Re: [RFC v2 06/12] qedr: Add support for QP verbs

2016-09-21 Thread Jason Gunthorpe

On Wed, Sep 21, 2016 at 02:23:46PM +, Amrani, Ram wrote:
> > Ugh, each patch keeps adding to this?
> 
> The logic in the patch series is to have each patch contain only
> what is necessary for the specific functionality it adds.  This made
> it harder on us to prepare but, IMHO, easier for the reviewer.  If
> you'd like to have this file in one chunk, I can do this. What do
> you prefer?

I wouldn't change anything at this point, but I'm not sure it is
easier to review like this than the one patch per file scheme.

Do you have a git tree?

Jason

Re: [PATCH] netfilter: replace list_head with single linked list

2016-09-21 Thread Aaron Conole

Aaron Conole  writes:

> The netfilter hook list never uses the prev pointer, and so can be trimmed to
> be a simple singly-linked list.
>
> In addition to having a more light weight structure for hook traversal,
> struct net becomes 5568 bytes (down from 6400) and struct net_device becomes
> 2176 bytes (down from 2240).
>
> Signed-off-by: Aaron Conole 
> Signed-off-by: Florian Westphal 

Apologies, all.  This subject prefix is incorrect.  It should be:
[PATCH nf-next v3 7/7]

Should I repost?

> ---
>  include/linux/netdevice.h |   2 +-
>  include/linux/netfilter.h |  61 +
>  include/linux/netfilter_ingress.h |  16 +++--
>  include/net/netfilter/nf_queue.h  |   3 +-
>  include/net/netns/netfilter.h |   2 +-
>  net/bridge/br_netfilter_hooks.c   |  19 ++---
>  net/netfilter/core.c  | 141 
> +-
>  net/netfilter/nf_internals.h  |  10 +--
>  net/netfilter/nf_queue.c  |  18 ++---
>  net/netfilter/nfnetlink_queue.c   |   8 ++-
>  10 files changed, 165 insertions(+), 115 deletions(-)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 67bb978..41f49f5 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1783,7 +1783,7 @@ struct net_device {
>  #endif
>   struct netdev_queue __rcu *ingress_queue;
>  #ifdef CONFIG_NETFILTER_INGRESS
> - struct list_headnf_hooks_ingress;
> + struct nf_hook_entry __rcu *nf_hooks_ingress;
>  #endif
>  
>   unsigned char   broadcast[MAX_ADDR_LEN];
> diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
> index ad444f0..17c90b0 100644
> --- a/include/linux/netfilter.h
> +++ b/include/linux/netfilter.h
> @@ -55,12 +55,34 @@ struct nf_hook_state {
>   struct net_device *out;
>   struct sock *sk;
>   struct net *net;
> - struct list_head *hook_list;
> + struct nf_hook_entry __rcu *hook_entries;
>   int (*okfn)(struct net *, struct sock *, struct sk_buff *);
>  };
>  
> +typedef unsigned int nf_hookfn(void *priv,
> +struct sk_buff *skb,
> +const struct nf_hook_state *state);
> +struct nf_hook_ops {
> + struct list_headlist;
> +
> + /* User fills in from here down. */
> + nf_hookfn   *hook;
> + struct net_device   *dev;
> + void*priv;
> + u_int8_tpf;
> + unsigned inthooknum;
> + /* Hooks are ordered in ascending priority. */
> + int priority;
> +};
> +
> +struct nf_hook_entry {
> + struct nf_hook_entry __rcu  *next;
> + struct nf_hook_ops  ops;
> + const struct nf_hook_ops*orig_ops;
> +};
> +
>  static inline void nf_hook_state_init(struct nf_hook_state *p,
> -   struct list_head *hook_list,
> +   struct nf_hook_entry *hook_entry,
> unsigned int hook,
> int thresh, u_int8_t pf,
> struct net_device *indev,
> @@ -76,26 +98,11 @@ static inline void nf_hook_state_init(struct 
> nf_hook_state *p,
>   p->out = outdev;
>   p->sk = sk;
>   p->net = net;
> - p->hook_list = hook_list;
> + RCU_INIT_POINTER(p->hook_entries, hook_entry);
>   p->okfn = okfn;
>  }
>  
> -typedef unsigned int nf_hookfn(void *priv,
> -struct sk_buff *skb,
> -const struct nf_hook_state *state);
>  
> -struct nf_hook_ops {
> - struct list_headlist;
> -
> - /* User fills in from here down. */
> - nf_hookfn   *hook;
> - struct net_device   *dev;
> - void*priv;
> - u_int8_tpf;
> - unsigned inthooknum;
> - /* Hooks are ordered in ascending priority. */
> - int priority;
> -};
>  
>  struct nf_sockopt_ops {
>   struct list_head list;
> @@ -161,7 +168,8 @@ static inline int nf_hook_thresh(u_int8_t pf, unsigned 
> int hook,
>int (*okfn)(struct net *, struct sock *, 
> struct sk_buff *),
>int thresh)
>  {
> - struct list_head *hook_list;
> + struct nf_hook_entry *hook_head;
> + int ret = 1;
>  
>  #ifdef HAVE_JUMP_LABEL
>   if (__builtin_constant_p(pf) &&
> @@ -170,22 +178,19 @@ static inline int nf_hook_thresh(u_int8_t pf, unsigned 
> int hook,
>   return 1;
>  #endif
>  
> - hook_list = >nf.hooks[pf][hook];
> + rcu_read_lock();
>  
> - if (!list_empty(hook_list)) {
> + hook_head = rcu_dereference(net->nf.hooks[pf][hook]);
> + if (hook_head) {
>   struct nf_hook_state state;
> - int ret;
>  
> - /* We may already have this,

Re: [PATCH RFC 1/3] xdp: Infrastructure to generalize XDP

2016-09-21 Thread Alexei Starovoitov


On 9/20/16 11:39 PM, Jesper Dangaard Brouer wrote:

We are the early stages of XDP development. Users cannot consider XDP a
stable UAPI yet.  I added a big fat warning to the docs here[1].

If you already consider this a stable API, then I will suggest that we
disable XDP or rip the hole thing out again!!!


the doc that you wrote is a great description of your understanding
of what XDP is about. It's not an official spec or design document.
Until it is reviewed and lands in kernel.org, please do not
make any assumption about present or future XDP API based on it.

[PATCH] netfilter: replace list_head with single linked list

2016-09-21 Thread Aaron Conole

The netfilter hook list never uses the prev pointer, and so can be trimmed to
be a simple singly-linked list.

In addition to having a more light weight structure for hook traversal,
struct net becomes 5568 bytes (down from 6400) and struct net_device becomes
2176 bytes (down from 2240).

Signed-off-by: Aaron Conole 
Signed-off-by: Florian Westphal 
---
 include/linux/netdevice.h |   2 +-
 include/linux/netfilter.h |  61 +
 include/linux/netfilter_ingress.h |  16 +++--
 include/net/netfilter/nf_queue.h  |   3 +-
 include/net/netns/netfilter.h |   2 +-
 net/bridge/br_netfilter_hooks.c   |  19 ++---
 net/netfilter/core.c  | 141 +-
 net/netfilter/nf_internals.h  |  10 +--
 net/netfilter/nf_queue.c  |  18 ++---
 net/netfilter/nfnetlink_queue.c   |   8 ++-
 10 files changed, 165 insertions(+), 115 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 67bb978..41f49f5 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1783,7 +1783,7 @@ struct net_device {
 #endif
struct netdev_queue __rcu *ingress_queue;
 #ifdef CONFIG_NETFILTER_INGRESS
-   struct list_headnf_hooks_ingress;
+   struct nf_hook_entry __rcu *nf_hooks_ingress;
 #endif
 
unsigned char   broadcast[MAX_ADDR_LEN];
diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index ad444f0..17c90b0 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -55,12 +55,34 @@ struct nf_hook_state {
struct net_device *out;
struct sock *sk;
struct net *net;
-   struct list_head *hook_list;
+   struct nf_hook_entry __rcu *hook_entries;
int (*okfn)(struct net *, struct sock *, struct sk_buff *);
 };
 
+typedef unsigned int nf_hookfn(void *priv,
+  struct sk_buff *skb,
+  const struct nf_hook_state *state);
+struct nf_hook_ops {
+   struct list_headlist;
+
+   /* User fills in from here down. */
+   nf_hookfn   *hook;
+   struct net_device   *dev;
+   void*priv;
+   u_int8_tpf;
+   unsigned inthooknum;
+   /* Hooks are ordered in ascending priority. */
+   int priority;
+};
+
+struct nf_hook_entry {
+   struct nf_hook_entry __rcu  *next;
+   struct nf_hook_ops  ops;
+   const struct nf_hook_ops*orig_ops;
+};
+
 static inline void nf_hook_state_init(struct nf_hook_state *p,
- struct list_head *hook_list,
+ struct nf_hook_entry *hook_entry,
  unsigned int hook,
  int thresh, u_int8_t pf,
  struct net_device *indev,
@@ -76,26 +98,11 @@ static inline void nf_hook_state_init(struct nf_hook_state 
*p,
p->out = outdev;
p->sk = sk;
p->net = net;
-   p->hook_list = hook_list;
+   RCU_INIT_POINTER(p->hook_entries, hook_entry);
p->okfn = okfn;
 }
 
-typedef unsigned int nf_hookfn(void *priv,
-  struct sk_buff *skb,
-  const struct nf_hook_state *state);
 
-struct nf_hook_ops {
-   struct list_headlist;
-
-   /* User fills in from here down. */
-   nf_hookfn   *hook;
-   struct net_device   *dev;
-   void*priv;
-   u_int8_tpf;
-   unsigned inthooknum;
-   /* Hooks are ordered in ascending priority. */
-   int priority;
-};
 
 struct nf_sockopt_ops {
struct list_head list;
@@ -161,7 +168,8 @@ static inline int nf_hook_thresh(u_int8_t pf, unsigned int 
hook,
 int (*okfn)(struct net *, struct sock *, 
struct sk_buff *),
 int thresh)
 {
-   struct list_head *hook_list;
+   struct nf_hook_entry *hook_head;
+   int ret = 1;
 
 #ifdef HAVE_JUMP_LABEL
if (__builtin_constant_p(pf) &&
@@ -170,22 +178,19 @@ static inline int nf_hook_thresh(u_int8_t pf, unsigned 
int hook,
return 1;
 #endif
 
-   hook_list = >nf.hooks[pf][hook];
+   rcu_read_lock();
 
-   if (!list_empty(hook_list)) {
+   hook_head = rcu_dereference(net->nf.hooks[pf][hook]);
+   if (hook_head) {
struct nf_hook_state state;
-   int ret;
 
-   /* We may already have this, but read-locks nest anyway */
-   rcu_read_lock();
-   nf_hook_state_init(, hook_list, hook, thresh,
+   nf_hook_state_init(, hook_head, hook, thresh,
   pf, indev, outdev, sk, net, okfn);
 
ret =

Re: [PATCH net-next 3/3] net: ethernet: mediatek: add the dts property to set if TRGMII supported on GMAC0

2016-09-21 Thread Sean Wang

Date: Wed, 21 Sep 2016 16:17:20 +0200, Andrew Lunn  wrote:
>On Wed, Sep 21, 2016 at 02:16:30PM +0800, Sean Wang wrote:
>> Date: Tue, 20 Sep 2016 21:37:58 +0200, Andrew Lunn  wrote:
>> >On Tue, Sep 20, 2016 at 03:59:20PM +0800, sean.w...@mediatek.com wrote:
>> >> From: Sean Wang 
>> >>
>> >> Add the dts property for the capability if TRGMII supported on GAMC0
>> >>
>> >> Signed-off-by: Sean Wang 

 deleted

>> >In this case the switch is an MDIO device, not an PHY. It will not
>> >have an phy-mode. It cannot have a phy mode, it is not a PHY.
>> >
>> >Or am i missing something here?
>> >
>> >Thanks
>> >
>> 
>> 1)
>> 
>> The switch driver is not supported for DSA so far yet 
>> but DSA is good thing and I will try make it happen
>> in the near future.
>
>O.K. But if i understand correctly, the TRGMII is so you can use the
>switch. So it needs to work when you have DSA.
>

yes, you are right. TRGMII for now is dedicated for switch
and furthermore it needs doing calibration between the host and
the switch before it works, that I expect to put
the logic of calibration into setup callback of DSA driver.


>> And another question about DSA, that is
>> if I use DSA for switch, how to know the relationship
>> between MAC and DSA ? such like I could know relationship 
>> between MAC and PHY by phy-handle.
>
>It will look like what i stated above. But i missed the cpu node in
>the ports, which is what you are asking about. There will also be a
>node like:
>
>port@6 {
> reg = <6>;
> label = "cpu";
> ethernet = <>;
> };
>
>And this is how you couple the MAC to DSA.

thanks, it is answerig my question : i can get the relationship from 
the node of cpu port pointing to what MAC it runs for.

>> The cause I ask is becasue I think it's good if the topology
>> about MAC/PHYs/Switch is known just by dts files.
>> 
>> 2)
>> 
>> The phy-mode I mention is for fixed-link. For current MAC driver, 
>> it just uses fixed phy to adapt into the part of switch, so the 
>> device tree looks something like the below. 
>> 
>>  {
>> status = "okay";
>> gmac0: mac@0 {
>> compatible = "mediatek,eth-mac";
>> reg = <0>;
>> phy-mode = "trgmii";
>> fixed-link {
>> speed = <1000>;
>> full-duplex;
>> pause;
>> };
>> };
>> 
>> gmac1: mac@1 {
>> compatible = "mediatek,eth-mac";
>> reg = <1>;
>> phy-handle = <>;
>> };
>
>
>static int mtk_phy_connect(struct mtk_mac *mac)
>{
>struct mtk_eth *eth = mac->hw;
>struct device_node *np;
>u32 val;
>
>np = of_parse_phandle(mac->of_node, "phy-handle", 0);
>if (!np && of_phy_is_fixed_link(mac->of_node))
>if (!of_phy_register_fixed_link(mac->of_node))
>np = of_node_get(mac->of_node);
>   ...
>...
>mtk_phy_connect_node(eth, mac, np);
>
>
>So in the case of a fixed-phy, you do look in the MAC node, and when
>there is a phy-handle, you look in the PHY node.
>
>So this does work

yes , it is all

>
>   Andrew

Re: [PATCH RFC 1/3] xdp: Infrastructure to generalize XDP

2016-09-21 Thread Alexei Starovoitov


On 9/21/16 7:19 AM, Tom Herbert wrote:

#1: Should we allow alternate code to run in XDP other than BPF?


separate nft hook - yes
generic hook - no
since it's one step away from kernel modules abusing this hook.
pass/drop/tx of raw buffer at the driver level is a perfect
interface to bypass everything in the stack.
The tighter we make it the better.

If nft and bpf are both not flexible enough to express
dataplane functionality we should extend them instead of
writing C code or kernel modules.

On bpf side we're trying very hard to kill any dream of
interoperability with kernel modules.
The map and prog type registration is done in a way to make
it impossible for kernel modules to register their own
map and program types or provide their own helper functions.

nfhooks approach is very lax at that and imo it was a mistake,
since there are plenty of out of tree modules that using nf hooks
and plenty of in-tree modules that are barely maintained.


#2: If #1 is true what is the best way to implement that?


Add separate nft hook that doesn't interfere in any way
with bpf hook at xdp level.
The order nft-first or bpf-first or exclusive attach
doesn't matter to me. These are details to be discussed.

Re: [PATCH next v3 0/2] Rename WORD_TRUNC/ROUND macros and use them

2016-09-21 Thread Neil Horman

On Wed, Sep 21, 2016 at 08:45:54AM -0300, Marcelo Ricardo Leitner wrote:
> This patchset aims to rename these macros to a non-confusing name, as
> reported by David Laight and David Miller, and to update all remaining
> places to make use of it, which was 1 last remaining spot.
> 
> v3:
> - Name it SCTP_PAD4 instead of SCTP_ALIGN4, as suggested by David Laight
> v2:
> - fixed 2nd patch summary
> 
> Details on the specific changelogs.
> 
> Thanks!
> 
> Marcelo Ricardo Leitner (2):
>   sctp: rename WORD_TRUNC/ROUND macros
>   sctp: make use of WORD_TRUNC macro
> 
>  include/net/sctp/sctp.h  | 10 +-
>  net/netfilter/xt_sctp.c  |  2 +-
>  net/sctp/associola.c |  2 +-
>  net/sctp/chunk.c | 13 +++--
>  net/sctp/input.c |  8 
>  net/sctp/inqueue.c   |  2 +-
>  net/sctp/output.c| 12 ++--
>  net/sctp/sm_make_chunk.c | 28 ++--
>  net/sctp/sm_statefuns.c  |  6 +++---
>  net/sctp/transport.c |  4 ++--
>  net/sctp/ulpevent.c  |  4 ++--
>  11 files changed, 46 insertions(+), 45 deletions(-)
> 
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Acked-by: Neil Horman

[PATCH nf-next v3 5/7] nf_register_net_hook: Only allow sane values

2016-09-21 Thread Aaron Conole

This commit adds an upfront check for sane values to be passed when
registering a netfilter hook.  This will be used in a future patch for a
simplified hook list traversal.

Signed-off-by: Aaron Conole 
---
 net/netfilter/core.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index c8faf81..67b7428 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -89,6 +89,11 @@ int nf_register_net_hook(struct net *net, const struct 
nf_hook_ops *reg)
struct nf_hook_entry *entry;
struct nf_hook_ops *elem;
 
+   if (reg->pf == NFPROTO_NETDEV &&
+   (reg->hooknum != NF_NETDEV_INGRESS ||
+!reg->dev || dev_net(reg->dev) != net))
+   return -EINVAL;
+
entry = kmalloc(sizeof(*entry), GFP_KERNEL);
if (!entry)
return -ENOMEM;
-- 
2.7.4

[PATCH nf-next v3 1/7] netfilter: bridge: add and use br_nf_hook_thresh

2016-09-21 Thread Aaron Conole

From: Florian Westphal 

This replaces the last uses of NF_HOOK_THRESH().
Followup patch will remove it and rename nf_hook_thresh.

The reason is that inet (non-bridge) netfilter no longer invokes the
hooks from hooks, so we do no longer need the thresh value to skip hooks
with a lower priority.

The bridge netfilter however may need to do this. br_nf_hook_thresh is a
wrapper that is supposed to do this, i.e. only call hooks with a
priority that exceeds NF_BR_PRI_BRNF.

It's used only in the recursion cases of br_netfilter.  It invokes
nf_hook_slow while holding an rcu read-side critical section to make a
future cleanup simpler.

Signed-off-by: Florian Westphal 
Signed-off-by: Aaron Conole 
---
 include/net/netfilter/br_netfilter.h |  6 
 net/bridge/br_netfilter_hooks.c  | 60 ++--
 net/bridge/br_netfilter_ipv6.c   | 12 +++-
 3 files changed, 62 insertions(+), 16 deletions(-)

diff --git a/include/net/netfilter/br_netfilter.h 
b/include/net/netfilter/br_netfilter.h
index e8d1448..0b0c35c 100644
--- a/include/net/netfilter/br_netfilter.h
+++ b/include/net/netfilter/br_netfilter.h
@@ -15,6 +15,12 @@ static inline struct nf_bridge_info *nf_bridge_alloc(struct 
sk_buff *skb)
 
 void nf_bridge_update_protocol(struct sk_buff *skb);
 
+int br_nf_hook_thresh(unsigned int hook, struct net *net, struct sock *sk,
+ struct sk_buff *skb, struct net_device *indev,
+ struct net_device *outdev,
+ int (*okfn)(struct net *, struct sock *,
+ struct sk_buff *));
+
 static inline struct nf_bridge_info *
 nf_bridge_info_get(const struct sk_buff *skb)
 {
diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index 77e7f69..6029af4 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -395,11 +396,10 @@ bridged_dnat:
skb->dev = nf_bridge->physindev;
nf_bridge_update_protocol(skb);
nf_bridge_push_encap_header(skb);
-   NF_HOOK_THRESH(NFPROTO_BRIDGE,
-  NF_BR_PRE_ROUTING,
-  net, sk, skb, skb->dev, NULL,
-  br_nf_pre_routing_finish_bridge,
-  1);
+   br_nf_hook_thresh(NF_BR_PRE_ROUTING,
+ net, sk, skb, skb->dev,
+ NULL,
+ br_nf_pre_routing_finish);
return 0;
}
ether_addr_copy(eth_hdr(skb)->h_dest, dev->dev_addr);
@@ -417,10 +417,8 @@ bridged_dnat:
skb->dev = nf_bridge->physindev;
nf_bridge_update_protocol(skb);
nf_bridge_push_encap_header(skb);
-   NF_HOOK_THRESH(NFPROTO_BRIDGE, NF_BR_PRE_ROUTING, net, sk, skb,
-  skb->dev, NULL,
-  br_handle_frame_finish, 1);
-
+   br_nf_hook_thresh(NF_BR_PRE_ROUTING, net, sk, skb, skb->dev, NULL,
+ br_handle_frame_finish);
return 0;
 }
 
@@ -992,6 +990,50 @@ static struct notifier_block brnf_notifier __read_mostly = 
{
.notifier_call = brnf_device_event,
 };
 
+/* recursively invokes nf_hook_slow (again), skipping already-called
+ * hooks (< NF_BR_PRI_BRNF).
+ *
+ * Called with rcu read lock held.
+ */
+int br_nf_hook_thresh(unsigned int hook, struct net *net,
+ struct sock *sk, struct sk_buff *skb,
+ struct net_device *indev,
+ struct net_device *outdev,
+ int (*okfn)(struct net *, struct sock *,
+ struct sk_buff *))
+{
+   struct nf_hook_ops *elem;
+   struct nf_hook_state state;
+   struct list_head *head;
+   int ret;
+
+   head = >nf.hooks[NFPROTO_BRIDGE][hook];
+
+   list_for_each_entry_rcu(elem, head, list) {
+   struct nf_hook_ops *next;
+
+   next = list_entry_rcu(list_next_rcu(>list),
+ struct nf_hook_ops, list);
+   if (next->priority <= NF_BR_PRI_BRNF)
+   continue;
+   }
+
+   if (>list == head)
+   return okfn(net, sk, skb);
+
+   /* We may already have this, but read-locks nest anyway */
+   rcu_read_lock();
+   nf_hook_state_init(, head, hook, NF_BR_PRI_BRNF + 1,
+  NFPROTO_BRIDGE, indev, outdev, sk, net, okfn);
+
+   ret = nf_hook_slow(skb, );
+   rcu_read_unlock();
+   if (ret == 1)
+   ret =

[PATCH nf-next v3 2/7] netfilter: call nf_hook_state_init with rcu_read_lock held

2016-09-21 Thread Aaron Conole

From: Florian Westphal 

This makes things simpler because we can store the head of the list
in the nf_state structure without worrying about concurrent add/delete
of hook elements from the list.

A future commit will make use of this to implement a simpler
linked-list.

Signed-off-by: Florian Westphal 
Signed-off-by: Aaron Conole 
---
 include/linux/netfilter.h | 8 +++-
 include/linux/netfilter_ingress.h | 1 +
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index 9230f9a..ad444f0 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -174,10 +174,16 @@ static inline int nf_hook_thresh(u_int8_t pf, unsigned 
int hook,
 
if (!list_empty(hook_list)) {
struct nf_hook_state state;
+   int ret;
 
+   /* We may already have this, but read-locks nest anyway */
+   rcu_read_lock();
nf_hook_state_init(, hook_list, hook, thresh,
   pf, indev, outdev, sk, net, okfn);
-   return nf_hook_slow(skb, );
+
+   ret = nf_hook_slow(skb, );
+   rcu_read_unlock();
+   return ret;
}
return 1;
 }
diff --git a/include/linux/netfilter_ingress.h 
b/include/linux/netfilter_ingress.h
index 5fcd375..6965ba0 100644
--- a/include/linux/netfilter_ingress.h
+++ b/include/linux/netfilter_ingress.h
@@ -14,6 +14,7 @@ static inline bool nf_hook_ingress_active(const struct 
sk_buff *skb)
return !list_empty(>dev->nf_hooks_ingress);
 }
 
+/* caller must hold rcu_read_lock */
 static inline int nf_hook_ingress(struct sk_buff *skb)
 {
struct nf_hook_state state;
-- 
2.7.4

[PATCH nf-next v3 3/7] netfilter: call nf_hook_ingress with rcu_read_lock

2016-09-21 Thread Aaron Conole

This commit ensures that the rcu read-side lock is held while the
ingress hook is called.  This ensures that a call to nf_hook_slow (and
ultimately nf_ingress) will be read protected.

Signed-off-by: Aaron Conole 
---
 net/core/dev.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 34b5322..0649194 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4040,12 +4040,17 @@ static inline int nf_ingress(struct sk_buff *skb, 
struct packet_type **pt_prev,
 {
 #ifdef CONFIG_NETFILTER_INGRESS
if (nf_hook_ingress_active(skb)) {
+   int ingress_retval;
+
if (*pt_prev) {
*ret = deliver_skb(skb, *pt_prev, orig_dev);
*pt_prev = NULL;
}
 
-   return nf_hook_ingress(skb);
+   rcu_read_lock();
+   ingress_retval = nf_hook_ingress(skb);
+   rcu_read_unlock();
+   return ingress_retval;
}
 #endif /* CONFIG_NETFILTER_INGRESS */
return 0;
-- 
2.7.4

1 2 3 >

1 - 100 of 205 matches

Mail list logo