date:20181009

Inquiry 09-10-2018

2018-10-09 Thread Daniel Murray

Hi,friend,

This is Daniel Murray and i am from Sinara Group Co.Ltd Group Co.,LTD in Russia.
We are glad to know about your company from the web and we are interested in 
your products.
Could you kindly send us your Latest catalog and price list for our trial order.

Best Regards,

Daniel Murray
Purchasing Manager

Re: [PATCH bpf-next] tools/bpf: use proper type and uapi perf_event.h header for libbpf

2018-10-09 Thread Alexei Starovoitov

On Tue, Oct 09, 2018 at 04:14:47PM -0700, Yonghong Song wrote:
> Use __u32 instead u32 in libbpf.c and also use
> uapi perf_event.h instead of tools/perf/perf-sys.h.
> 
> Signed-off-by: Yonghong Song 

Applied, Thanks

Re: [bpf-next V2 PATCH 0/3] bpf/xdp: fix generic-XDP and demonstrate VLAN manipulation

2018-10-09 Thread Alexei Starovoitov

On Tue, Oct 09, 2018 at 12:04:37PM +0200, Jesper Dangaard Brouer wrote:
> While implementing PoC building blocks for eBPF code XDP+TC that can
> manipulate VLANs headers, I discovered a bug in generic-XDP.
> 
> The fix should be backported to stable kernels.  Even-though
> generic-XDP was introduced in v4.12, I think the bug is not exposed
> until v4.14 in the mentined fixes commit.

Applied, Thanks

Re: [PATCH bpf-next 0/6] Error handling when map lookup isn't supported

2018-10-09 Thread Alexei Starovoitov

On Tue, Oct 09, 2018 at 10:04:48AM +0900, Prashant Bhole wrote:
> Currently when map a lookup fails, user space API can not make any
> distinction whether given key was not found or lookup is not supported
> by particular map.
> 
> In this series we modify return value of maps which do not support
> lookup. Lookup on such map implementation will return -EOPNOTSUPP.
> bpf() syscall with BPF_MAP_LOOKUP_ELEM command will set EOPNOTSUPP
> errno. We also handle this error in bpftool to print appropriate
> message.
> 
> Patch 1: adds handling of BPF_MAP_LOOKUP ELEM command of bpf syscall
> such that errno will set to EOPNOTSUPP when map doesn't support lookup
> 
> Patch 2: Modifies the return value of map_lookup_elem() to EOPNOTSUPP
> for maps which do not support lookup
> 
> Patch 3: Splits do_dump() in bpftool/map.c. Element printing code is
> moved out into new function dump_map_elem(). This was done in order to
> reduce deep indentation and accomodate further changes.
> 
> Patch 4: Changes in bpftool to print strerror() message when lookup
> error is occured. This will result in appropriate message like
> "Operation not supported" when map doesn't support lookup.
> 
> Patch 5: test_verifier: change fixup map naming convention as
> suggested by Alexei
> 
> Patch 6: Added verifier tests to check whether verifier rejects call 
> to bpf_map_lookup_elem from bpf program. For all map types those
> do not support map lookup.

Applied, Thanks

Re: [PATCH net-next v3] net/ncsi: Extend NC-SI Netlink interface to allow user space to send NC-SI command

2018-10-09 Thread Samuel Mendoza-Jonas

On Mon, 2018-10-08 at 23:13 +, justin.l...@dell.com wrote:
> The new command (NCSI_CMD_SEND_CMD) is added to allow user space 
> application to send NC-SI command to the network card.
> Also, add a new attribute (NCSI_ATTR_DATA) for transferring request and 
> response.
> 
> The work flow is as below. 
> 
> Request:
> User space application
>   -> Netlink interface (msg)
>   -> new Netlink handler - ncsi_send_cmd_nl()
>   -> ncsi_xmit_cmd()
> 
> Response:
> Response received - ncsi_rcv_rsp()
>   -> internal response handler - ncsi_rsp_handler_xxx()
>   -> ncsi_rsp_handler_netlink()
>   -> ncsi_send_netlink_rsp ()
>   -> Netlink interface (msg)
>   -> user space application
> 
> Command timeout - ncsi_request_timeout()
>   -> ncsi_send_netlink_timeout ()
>   -> Netlink interface (msg with zero data length)
>   -> user space application
> 
> Error:
> Error detected
>   -> ncsi_send_netlink_err ()
>   -> Netlink interface (err msg)
>   -> user space application

Hi Justin,

I've built and tested this and it works as expected; except for some very
minor comments below:

Reviewed-by: Samuel Mendoza-Jonas 

> 
> V3: Based on http://patchwork.ozlabs.org/patch/979688/ to remove the 
> duplicated code.
> V2: Remove non-related debug message and clean up the code.

It's better to put these change notes under the --- below so they're not
included in the commit message, but thanks for including them!

> 
> 
> Signed-off-by: Justin Lee  
> 
> 
> ---
>  include/uapi/linux/ncsi.h |   3 +
>  net/ncsi/internal.h   |  10 ++-
>  net/ncsi/ncsi-cmd.c   |   8 ++
>  net/ncsi/ncsi-manage.c|  16 
>  net/ncsi/ncsi-netlink.c   | 204 
> ++
>  net/ncsi/ncsi-netlink.h   |  12 +++
>  net/ncsi/ncsi-rsp.c   |  67 +--
>  7 files changed, 314 insertions(+), 6 deletions(-)
> 
> diff --git a/include/uapi/linux/ncsi.h b/include/uapi/linux/ncsi.h
> index 4c292ec..4992bfc 100644
> --- a/include/uapi/linux/ncsi.h
> +++ b/include/uapi/linux/ncsi.h
> @@ -30,6 +30,7 @@ enum ncsi_nl_commands {
>   NCSI_CMD_PKG_INFO,
>   NCSI_CMD_SET_INTERFACE,
>   NCSI_CMD_CLEAR_INTERFACE,
> + NCSI_CMD_SEND_CMD,
>  
>   __NCSI_CMD_AFTER_LAST,
>   NCSI_CMD_MAX = __NCSI_CMD_AFTER_LAST - 1
> @@ -43,6 +44,7 @@ enum ncsi_nl_commands {
>   * @NCSI_ATTR_PACKAGE_LIST: nested array of NCSI_PKG_ATTR attributes
>   * @NCSI_ATTR_PACKAGE_ID: package ID
>   * @NCSI_ATTR_CHANNEL_ID: channel ID
> + * @NCSI_ATTR_DATA: command payload
>   * @NCSI_ATTR_MAX: highest attribute number
>   */
>  enum ncsi_nl_attrs {
> @@ -51,6 +53,7 @@ enum ncsi_nl_attrs {
>   NCSI_ATTR_PACKAGE_LIST,
>   NCSI_ATTR_PACKAGE_ID,
>   NCSI_ATTR_CHANNEL_ID,
> + NCSI_ATTR_DATA,
>  
>   __NCSI_ATTR_AFTER_LAST,
>   NCSI_ATTR_MAX = __NCSI_ATTR_AFTER_LAST - 1
> diff --git a/net/ncsi/internal.h b/net/ncsi/internal.h
> index 3d0a33b..e9db100 100644
> --- a/net/ncsi/internal.h
> +++ b/net/ncsi/internal.h
> @@ -175,6 +175,8 @@ struct ncsi_package;
>  #define NCSI_RESERVED_CHANNEL0x1f
>  #define NCSI_CHANNEL_INDEX(c)((c) & ((1 << NCSI_PACKAGE_SHIFT) - 1))
>  #define NCSI_TO_CHANNEL(p, c)(((p) << NCSI_PACKAGE_SHIFT) | (c))
> +#define NCSI_MAX_PACKAGE 8
> +#define NCSI_MAX_CHANNEL 32
>  
>  struct ncsi_channel {
>   unsigned char   id;
> @@ -219,12 +221,17 @@ struct ncsi_request {
>   unsigned charid;  /* Request ID - 0 to 255   */
>   bool used;/* Request that has been assigned  */
>   unsigned int flags;   /* NCSI request property   */
> -#define NCSI_REQ_FLAG_EVENT_DRIVEN   1
> +#define NCSI_REQ_FLAG_EVENT_DRIVEN   1
> +#define NCSI_REQ_FLAG_NETLINK_DRIVEN 2
>   struct ncsi_dev_priv *ndp;/* Associated NCSI device  */
>   struct sk_buff   *cmd;/* Associated NCSI command packet  */
>   struct sk_buff   *rsp;/* Associated NCSI response packet */
>   struct timer_listtimer;   /* Timer on waiting for response   */
>   bool enabled; /* Time has been enabled or not*/
> +
> + u32  snd_seq; /* netlink sending sequence number */
> + u32  snd_portid;  /* netlink portid of sender*/
> + struct nlmsghdr  nlhdr;   /* netlink message header  */
>  };
>  
>  enum {
> @@ -310,6 +317,7 @@ struct ncsi_cmd_arg {
>   unsigned int   dwords[4];
>   };
>   unsigned char*data;   /* NCSI OEM data */
> + struct genl_info *info;   /* Netlink information   */
>  };
>  
>  extern struct list_head ncsi_dev_list;
> diff --git a/net/ncsi/ncsi-cmd.c b/net/ncsi/ncsi-cmd.c
> index 82b7d92..356af47 100644
> --- a/net/ncsi/ncsi-cmd.c
> +++ b/net/ncsi/ncsi-cmd.c
> @@ -17,6 +17,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include

Re: [PATCH stable 4.9 00/29] backport of IP fragmentation fixes

2018-10-09 Thread Florian Fainelli




On 10/09/18 17:46, Eric Dumazet wrote:
> 
> 
> On 10/09/2018 03:48 PM, Florian Fainelli wrote:
>> This is based on Stephen's v4.14 patches, with the necessary merge
>> conflicts, and the lack of timer_setup() on the 4.9 baseline.
>>
>> Perf results on a gigabit capable system, before and after are below.
>>
>> Series can also be found here:
>>
>> https://github.com/ffainelli/linux/commits/fragment-stack-v4.9
>>
>>
>>PerfTop: 457 irqs/sec  kernel:74.4%  exact:  0.0% [4000Hz cycles],  
>> (all, 4 CPUs)
>> ---
>>
>> 29.62%  [kernel]   [k] ip_defrag  
>>  6.57%  [kernel]   [k] arch_cpu_idle  
>>  1.72%  [kernel]   [k] v7_dma_inv_range   
>>  1.68%  [kernel]   [k] __netif_receive_skb_core   
>>  1.43%  [kernel]   [k] fib_table_lookup   
>>  1.30%  [kernel]   [k] finish_task_switch 
>>  1.08%  [kernel]   [k] ip_rcv 
>>  1.01%  [kernel]   [k] skb_release_data   
>>  0.99%  [kernel]   [k] __slab_free
>>  0.96%  [kernel]   [k] bcm_sysport_poll   
>>  0.88%  [kernel]   [k] __netdev_alloc_skb 
>>  0.87%  [kernel]   [k] tick_nohz_idle_enter   
>>  0.86%  [kernel]   [k] dev_gro_receive
>>  0.85%  [kernel]   [k] _raw_spin_unlock_irqrestore
>>  0.84%  [kernel]   [k] __memzero  
>>  0.74%  [kernel]   [k] tick_nohz_idle_exit
>>  0.73%  ld-2.24.so [.] do_lookup_x
>>  0.66%  [kernel]   [k] kmem_cache_free
>>  0.66%  [kernel]   [k] bcm_sysport_rx_refill  
>>  0.65%  [kernel]   [k] eth_type_trans 
>>
>>
>> After patching:
>>
>>   PerfTop: 170 irqs/sec  kernel:86.5%  exact:  0.0% [4000Hz cycles],  
>> (all, 4 CPUs)
>> ---
>>
>>  7.79%  [kernel]   [k] arch_cpu_idle  
>>  5.14%  [kernel]   [k] v7_dma_inv_range   
>>  4.20%  [kernel]   [k] ip_defrag  
>>  3.89%  [kernel]   [k] __netif_receive_skb_core   
>>  3.65%  [kernel]   [k] fib_table_lookup   
>>  2.16%  [kernel]   [k] finish_task_switch 
>>  1.93%  [kernel]   [k] _raw_spin_unlock_irqrestore
>>  1.90%  [kernel]   [k] ip_rcv 
>>  1.84%  [kernel]   [k] bcm_sysport_poll   
>>  1.83%  [kernel]   [k] __memzero  
>>  1.65%  [kernel]   [k] __netdev_alloc_skb 
>>  1.60%  [kernel]   [k] __slab_free
>>  1.49%  [kernel]   [k] __do_softirq   
>>  1.49%  [kernel]   [k] bcm_sysport_rx_refill  
>>  1.31%  [kernel]   [k] dma_cache_maint_page   
>>  1.25%  [kernel]   [k] tick_nohz_idle_enter   
>>  1.24%  [kernel]   [k] ip_route_input_noref   
>>  1.17%  [kernel]   [k] eth_type_trans 
>>  1.06%  [kernel]   [k] fib_validate_source
>>  1.03%  [kernel]   [k] inet_frag_find
>>
>> Dan Carpenter (1):
>>   ipv4: frags: precedence bug in ip_expire()
>>
>> Eric Dumazet (22):
>>   inet: frags: change inet_frags_init_net() return value
>>   inet: frags: add a pointer to struct netns_frags
>>   inet: frags: refactor ipfrag_init()
>>   inet: frags: refactor ipv6_frag_init()
>>   inet: frags: refactor lowpan_net_frag_init()
>>   ipv6: export ip6 fragments sysctl to unprivileged users
>>   rhashtable: add schedule points
>>   inet: frags: use rhashtables for reassembly units
>>   inet: frags: remove some helpers
>>   inet: frags: get rif of inet_frag_evicting()
>>   inet: frags: remove inet_frag_maybe_warn_overflow()
>>   inet: frags: break the 2GB limit for frags storage
>>   inet: frags: do not clone skb in ip_expire()
>>   ipv6: frags: rewrite ip6_expire_frag_queue()
>>   rhashtable: reorganize struct rhashtable layout
>>   inet: frags: reorganize struct netns_frags
>>   inet: frags: get rid of ipfrag_skb_cb/FRAG_CB
>>   inet: frags: fix ip6frag_low_thresh boundary
>>   net: speed up skb_rbtree_purge()
>>   net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends
>>   net: add rb_to_skb() and other rb tree helpers
>>   net: sk_buff rbnode reorg
>>
>> Florian Westphal (1):
>>   ipv6: defrag: drop non-last frags smaller than min mtu
>>
>> Peter Oskolkov (4):
>>   ip: discard IPv4 datagrams with overlapping segments.
>>   net: modify skb_rbtree_purge to return the truesize of all purged
>> skbs.
>>   ip: add helpers to process in-order fragments faster.
>>   ip: process in-order fragments efficiently
>>
>> Taehee Yoo (1):
>>   ip: frags: fix crash in ip_do_fragment()
>>
> 
> Strange, I do not see "ip: use rb trees for IP frag queue." in this list ?

And it was

Re: [PATCH net-next] net: enable RPS on vlan devices

2018-10-09 Thread Eric Dumazet




On 10/09/2018 07:11 PM, Shannon Nelson wrote:
> 
> Hence the reason we sent this as an RFC a couple of weeks ago.  We got no 
> response, so followed up with this patch in order to get some input. Do you 
> have any suggestions for how we might accomplish this in a less ugly way?

I dunno, maybe a modern way for all these very specific needs would be to use 
an eBPF
hook to implement whatever combination of RPS/RFS/what_have_you

Then, we no longer have to review what various strategies are used by users.

Re: [PATCH net-next] net: enable RPS on vlan devices

2018-10-09 Thread Shannon Nelson


On 10/9/2018 6:04 PM, Eric Dumazet wrote:


On 10/09/2018 05:41 PM, Shannon Nelson wrote:

From: Silviu Smarandache 

This patch modifies the RPS processing code so that it searches
for a matching vlan interface on the packet and then uses the
RPS settings of the vlan interface.  If no vlan interface
is found or the vlan interface does not have RPS enabled,
it will fall back to the RPS settings of the underlying device.

In supporting VMs where we can't control the OS being used,
we'd like to separate the VM cpu processing from the host's
cpus as a way to help mitigate the impact of the L1TF issue.
When running the VM's traffic on a vlan we can stick the Rx
processing on one set of cpus separate from the VM's cpus.
Yes, choosing to use this may cause a bit of throughput pain
when the packets are actually passed into the VM and have to
move from one cache to another.

Orabug: 28645929

Signed-off-by: Silviu Smarandache 
Signed-off-by: Shannon Nelson 
---
  net/core/dev.c | 59 +++---
  1 file changed, 56 insertions(+), 3 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 0b2d777..1da3f63 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3971,8 +3971,8 @@ set_rps_cpu(struct net_device *dev, struct sk_buff *skb,
   * CPU from the RPS map of the receiving queue for a given skb.
   * rcu_read_lock must be held on entry.
   */
-static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
-  struct rps_dev_flow **rflowp)
+static int __get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
+struct rps_dev_flow **rflowp)
  {
const struct rps_sock_flow_table *sock_flow_table;
struct netdev_rx_queue *rxqueue = dev->_rx;
@@ -4066,6 +4066,35 @@ static int get_rps_cpu(struct net_device *dev, struct 
sk_buff *skb,
return cpu;
  }
  
+static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,

+  struct rps_dev_flow **rflowp)
+{
+   /* Check for a vlan device with RPS settings */
+   if (skb_vlan_tag_present(skb)) {
+   struct net_device *vdev;
+   u16 vid;
+
+   vid = skb_vlan_tag_get_id(skb);
+   vdev = __vlan_find_dev_deep_rcu(dev, skb->vlan_proto, vid);
+   if (vdev) {
+   /* recorded queue is not referring to the vlan device.
+* Save and restore it
+*/
+   int cpu;
+   u16 queue_mapping = skb_get_queue_mapping(skb);
+
+   skb_set_queue_mapping(skb, 0);
+   cpu = __get_rps_cpu(vdev, skb, rflowp);
+   skb_set_queue_mapping(skb, queue_mapping);


This is really ugly :/


Hence the reason we sent this as an RFC a couple of weeks ago.  We got 
no response, so followed up with this patch in order to get some input. 
Do you have any suggestions for how we might accomplish this in a less 
ugly way?




Also what makes vlan so special compared to say macvlan ?


Only that vlan was the itch that Silviu needed to scratch.  If we can 
solve this for vlan, then perhaps we'll have a template to follow for 
other upper devices.


sln

Inquiry 09-10-2018

2018-10-09 Thread Daniel Murray

Hi,friend,

This is Daniel Murray and i am from Sinara Group Co.Ltd Group Co.,LTD in Russia.
We are glad to know about your company from the web and we are interested in 
your products.
Could you kindly send us your Latest catalog and price list for our trial order.

Best Regards,

Daniel Murray
Purchasing Manager

[PATCH v3] rxrpc: use correct kvec num while send response packet in rxrpc_reject_packets

2018-10-09 Thread YueHaibing

Fixes gcc '-Wunused-but-set-variable' warning:

net/rxrpc/output.c: In function 'rxrpc_reject_packets':
net/rxrpc/output.c:527:11: warning:
 variable 'ioc' set but not used [-Wunused-but-set-variable]

'ioc' is the correct kvec num while send response packet.

Fixes: ece64fec164f ("rxrpc: Emit BUSY packets when supposed to rather than 
ABORTs")
Signed-off-by: YueHaibing 
---
v3: remove 'commit' from Fixes info.
v2: use 'ioc' rather than remove it.
---
 net/rxrpc/output.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/rxrpc/output.c b/net/rxrpc/output.c
index e8fb892..a141ee3 100644
--- a/net/rxrpc/output.c
+++ b/net/rxrpc/output.c
@@ -572,7 +572,8 @@ void rxrpc_reject_packets(struct rxrpc_local *local)
whdr.flags  ^= RXRPC_CLIENT_INITIATED;
whdr.flags  &= RXRPC_CLIENT_INITIATED;
 
-   ret = kernel_sendmsg(local->socket, , iov, 2, size);
+   ret = kernel_sendmsg(local->socket, ,
+iov, ioc, size);
if (ret < 0)
trace_rxrpc_tx_fail(local->debug_id, 0, ret,
rxrpc_tx_point_reject);

Re: [PATCH v2] rxrpc: use correct kvec num while send response packet in rxrpc_reject_packets

2018-10-09 Thread YueHaibing

On 2018/10/9 23:34, Sergei Shtylyov wrote:
> On 10/09/2018 05:15 PM, YueHaibing wrote:
> 
>> Fixes gcc '-Wunused-but-set-variable' warning:
>>
>> net/rxrpc/output.c: In function 'rxrpc_reject_packets':
>> net/rxrpc/output.c:527:11: warning:
>>  variable 'ioc' set but not used [-Wunused-but-set-variable]
>>
>> 'ioc' is the correct kvec num while send response packet.
>>
>> Fixes: commit ece64fec164f ("rxrpc: Emit BUSY packets when supposed to 
>> rather than 
> ABORTs")
> 
>"commit" not needed here.

Thank you for review.

> 
>> Signed-off-by: YueHaibing 
> [...]
> 
> MBR, Sergei
> 
> 
>

Re: [sky2 driver] 88E8056 PCI-E Gigabit Ethernet Controller not working after suspend

2018-10-09 Thread Laurent Bigonville


Le 9/10/18 à 22:09, Stephen Hemminger a écrit :

On Tue, 9 Oct 2018 19:30:30 +0200
Laurent Bigonville  wrote:


Hello,

On my desktop (Asus MB with dual Ethernet port), when waking up after
suspend, the network card is not detecting the link.

I have to rmmod the sky2 driver and then modprobing it again.

lspci shows me:

04:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8056 PCI-E
Gigabit Ethernet Controller (rev 12)
05:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8056 PCI-E
Gigabit Ethernet Controller (rev 12)

An idea what's wrong here?

Kind regards,

Laurent Bigonville


I used to have that motherboard (about 8 years ago). Long  dead by now.

There was some issue with how the power management worked. Forgot the 
workaround,
you might have to dig in the mailing list archive.


I've made some test and it seems that this was working in 4.14 and then 
broken in 4.15 (using the debian kernel pkg), so it was working not that 
long ago:


The only commit I see to the sky2 driver is the following:

commit e99e88a9d2b067465adaa9c111ada99a041bef9a
Author: Kees Cook 
Date:   Mon Oct 16 14:43:17 2017 -0700

    treewide: setup_timer() -> timer_setup()

    This converts all remaining cases of the old setup_timer() API into 
using

    timer_setup(), where the callback argument is the structure already
    holding the struct timer_list. These should have no behavioral changes,
    since they just change which pointer is passed into the callback with
    the same available pointers after conversion. It handles the following
    examples, in addition to some other variations.

Re: [PATCH net-next] net: enable RPS on vlan devices

2018-10-09 Thread Eric Dumazet




On 10/09/2018 05:41 PM, Shannon Nelson wrote:
> From: Silviu Smarandache 
> 
> This patch modifies the RPS processing code so that it searches
> for a matching vlan interface on the packet and then uses the
> RPS settings of the vlan interface.  If no vlan interface
> is found or the vlan interface does not have RPS enabled,
> it will fall back to the RPS settings of the underlying device.
> 
> In supporting VMs where we can't control the OS being used,
> we'd like to separate the VM cpu processing from the host's
> cpus as a way to help mitigate the impact of the L1TF issue.
> When running the VM's traffic on a vlan we can stick the Rx
> processing on one set of cpus separate from the VM's cpus.
> Yes, choosing to use this may cause a bit of throughput pain
> when the packets are actually passed into the VM and have to
> move from one cache to another.
> 
> Orabug: 28645929
> 
> Signed-off-by: Silviu Smarandache 
> Signed-off-by: Shannon Nelson 
> ---
>  net/core/dev.c | 59 
> +++---
>  1 file changed, 56 insertions(+), 3 deletions(-)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 0b2d777..1da3f63 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3971,8 +3971,8 @@ set_rps_cpu(struct net_device *dev, struct sk_buff *skb,
>   * CPU from the RPS map of the receiving queue for a given skb.
>   * rcu_read_lock must be held on entry.
>   */
> -static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
> -struct rps_dev_flow **rflowp)
> +static int __get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
> +  struct rps_dev_flow **rflowp)
>  {
>   const struct rps_sock_flow_table *sock_flow_table;
>   struct netdev_rx_queue *rxqueue = dev->_rx;
> @@ -4066,6 +4066,35 @@ static int get_rps_cpu(struct net_device *dev, struct 
> sk_buff *skb,
>   return cpu;
>  }
>  
> +static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
> +struct rps_dev_flow **rflowp)
> +{
> + /* Check for a vlan device with RPS settings */
> + if (skb_vlan_tag_present(skb)) {
> + struct net_device *vdev;
> + u16 vid;
> +
> + vid = skb_vlan_tag_get_id(skb);
> + vdev = __vlan_find_dev_deep_rcu(dev, skb->vlan_proto, vid);
> + if (vdev) {
> + /* recorded queue is not referring to the vlan device.
> +  * Save and restore it
> +  */
> + int cpu;
> + u16 queue_mapping = skb_get_queue_mapping(skb);
> +
> + skb_set_queue_mapping(skb, 0);
> + cpu = __get_rps_cpu(vdev, skb, rflowp);
> + skb_set_queue_mapping(skb, queue_mapping);

This is really ugly :/

Also what makes vlan so special compared to say macvlan ?

Re: [PATCH stable 4.9 00/29] backport of IP fragmentation fixes

2018-10-09 Thread Eric Dumazet




On 10/09/2018 03:48 PM, Florian Fainelli wrote:
> This is based on Stephen's v4.14 patches, with the necessary merge
> conflicts, and the lack of timer_setup() on the 4.9 baseline.
> 
> Perf results on a gigabit capable system, before and after are below.
> 
> Series can also be found here:
> 
> https://github.com/ffainelli/linux/commits/fragment-stack-v4.9
> 
> 
>PerfTop: 457 irqs/sec  kernel:74.4%  exact:  0.0% [4000Hz cycles],  
> (all, 4 CPUs)
> ---
> 
> 29.62%  [kernel]   [k] ip_defrag  
>  6.57%  [kernel]   [k] arch_cpu_idle  
>  1.72%  [kernel]   [k] v7_dma_inv_range   
>  1.68%  [kernel]   [k] __netif_receive_skb_core   
>  1.43%  [kernel]   [k] fib_table_lookup   
>  1.30%  [kernel]   [k] finish_task_switch 
>  1.08%  [kernel]   [k] ip_rcv 
>  1.01%  [kernel]   [k] skb_release_data   
>  0.99%  [kernel]   [k] __slab_free
>  0.96%  [kernel]   [k] bcm_sysport_poll   
>  0.88%  [kernel]   [k] __netdev_alloc_skb 
>  0.87%  [kernel]   [k] tick_nohz_idle_enter   
>  0.86%  [kernel]   [k] dev_gro_receive
>  0.85%  [kernel]   [k] _raw_spin_unlock_irqrestore
>  0.84%  [kernel]   [k] __memzero  
>  0.74%  [kernel]   [k] tick_nohz_idle_exit
>  0.73%  ld-2.24.so [.] do_lookup_x
>  0.66%  [kernel]   [k] kmem_cache_free
>  0.66%  [kernel]   [k] bcm_sysport_rx_refill  
>  0.65%  [kernel]   [k] eth_type_trans 
> 
> 
> After patching:
> 
>   PerfTop: 170 irqs/sec  kernel:86.5%  exact:  0.0% [4000Hz cycles],  
> (all, 4 CPUs)
> ---
> 
>  7.79%  [kernel]   [k] arch_cpu_idle  
>  5.14%  [kernel]   [k] v7_dma_inv_range   
>  4.20%  [kernel]   [k] ip_defrag  
>  3.89%  [kernel]   [k] __netif_receive_skb_core   
>  3.65%  [kernel]   [k] fib_table_lookup   
>  2.16%  [kernel]   [k] finish_task_switch 
>  1.93%  [kernel]   [k] _raw_spin_unlock_irqrestore
>  1.90%  [kernel]   [k] ip_rcv 
>  1.84%  [kernel]   [k] bcm_sysport_poll   
>  1.83%  [kernel]   [k] __memzero  
>  1.65%  [kernel]   [k] __netdev_alloc_skb 
>  1.60%  [kernel]   [k] __slab_free
>  1.49%  [kernel]   [k] __do_softirq   
>  1.49%  [kernel]   [k] bcm_sysport_rx_refill  
>  1.31%  [kernel]   [k] dma_cache_maint_page   
>  1.25%  [kernel]   [k] tick_nohz_idle_enter   
>  1.24%  [kernel]   [k] ip_route_input_noref   
>  1.17%  [kernel]   [k] eth_type_trans 
>  1.06%  [kernel]   [k] fib_validate_source
>  1.03%  [kernel]   [k] inet_frag_find
> 
> Dan Carpenter (1):
>   ipv4: frags: precedence bug in ip_expire()
> 
> Eric Dumazet (22):
>   inet: frags: change inet_frags_init_net() return value
>   inet: frags: add a pointer to struct netns_frags
>   inet: frags: refactor ipfrag_init()
>   inet: frags: refactor ipv6_frag_init()
>   inet: frags: refactor lowpan_net_frag_init()
>   ipv6: export ip6 fragments sysctl to unprivileged users
>   rhashtable: add schedule points
>   inet: frags: use rhashtables for reassembly units
>   inet: frags: remove some helpers
>   inet: frags: get rif of inet_frag_evicting()
>   inet: frags: remove inet_frag_maybe_warn_overflow()
>   inet: frags: break the 2GB limit for frags storage
>   inet: frags: do not clone skb in ip_expire()
>   ipv6: frags: rewrite ip6_expire_frag_queue()
>   rhashtable: reorganize struct rhashtable layout
>   inet: frags: reorganize struct netns_frags
>   inet: frags: get rid of ipfrag_skb_cb/FRAG_CB
>   inet: frags: fix ip6frag_low_thresh boundary
>   net: speed up skb_rbtree_purge()
>   net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends
>   net: add rb_to_skb() and other rb tree helpers
>   net: sk_buff rbnode reorg
> 
> Florian Westphal (1):
>   ipv6: defrag: drop non-last frags smaller than min mtu
> 
> Peter Oskolkov (4):
>   ip: discard IPv4 datagrams with overlapping segments.
>   net: modify skb_rbtree_purge to return the truesize of all purged
> skbs.
>   ip: add helpers to process in-order fragments faster.
>   ip: process in-order fragments efficiently
> 
> Taehee Yoo (1):
>   ip: frags: fix crash in ip_do_fragment()
>

Strange, I do not see "ip: use rb trees for IP frag queue." in this list ?

Thanks !

[PATCH net-next] net: enable RPS on vlan devices

2018-10-09 Thread Shannon Nelson

From: Silviu Smarandache 

This patch modifies the RPS processing code so that it searches
for a matching vlan interface on the packet and then uses the
RPS settings of the vlan interface.  If no vlan interface
is found or the vlan interface does not have RPS enabled,
it will fall back to the RPS settings of the underlying device.

In supporting VMs where we can't control the OS being used,
we'd like to separate the VM cpu processing from the host's
cpus as a way to help mitigate the impact of the L1TF issue.
When running the VM's traffic on a vlan we can stick the Rx
processing on one set of cpus separate from the VM's cpus.
Yes, choosing to use this may cause a bit of throughput pain
when the packets are actually passed into the VM and have to
move from one cache to another.

Orabug: 28645929

Signed-off-by: Silviu Smarandache 
Signed-off-by: Shannon Nelson 
---
 net/core/dev.c | 59 +++---
 1 file changed, 56 insertions(+), 3 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 0b2d777..1da3f63 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3971,8 +3971,8 @@ set_rps_cpu(struct net_device *dev, struct sk_buff *skb,
  * CPU from the RPS map of the receiving queue for a given skb.
  * rcu_read_lock must be held on entry.
  */
-static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
-  struct rps_dev_flow **rflowp)
+static int __get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
+struct rps_dev_flow **rflowp)
 {
const struct rps_sock_flow_table *sock_flow_table;
struct netdev_rx_queue *rxqueue = dev->_rx;
@@ -4066,6 +4066,35 @@ static int get_rps_cpu(struct net_device *dev, struct 
sk_buff *skb,
return cpu;
 }
 
+static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
+  struct rps_dev_flow **rflowp)
+{
+   /* Check for a vlan device with RPS settings */
+   if (skb_vlan_tag_present(skb)) {
+   struct net_device *vdev;
+   u16 vid;
+
+   vid = skb_vlan_tag_get_id(skb);
+   vdev = __vlan_find_dev_deep_rcu(dev, skb->vlan_proto, vid);
+   if (vdev) {
+   /* recorded queue is not referring to the vlan device.
+* Save and restore it
+*/
+   int cpu;
+   u16 queue_mapping = skb_get_queue_mapping(skb);
+
+   skb_set_queue_mapping(skb, 0);
+   cpu = __get_rps_cpu(vdev, skb, rflowp);
+   skb_set_queue_mapping(skb, queue_mapping);
+   if (cpu != -1)
+   return cpu;
+   }
+   }
+
+   /* Fall back to RPS settings of original device */
+   return __get_rps_cpu(dev, skb, rflowp);
+}
+
 #ifdef CONFIG_RFS_ACCEL
 
 /**
@@ -4437,12 +4466,23 @@ static int netif_rx_internal(struct sk_buff *skb)
preempt_disable();
rcu_read_lock();
 
+   /* strip any vlan tag before calling get_rps_cpu() */
+   if (skb->protocol == cpu_to_be16(ETH_P_8021Q) ||
+   skb->protocol == cpu_to_be16(ETH_P_8021AD)) {
+   skb = skb_vlan_untag(skb);
+   if (unlikely(!skb)) {
+   ret = NET_RX_DROP;
+   goto unlock;
+   }
+   }
+
cpu = get_rps_cpu(skb->dev, skb, );
if (cpu < 0)
cpu = smp_processor_id();
 
ret = enqueue_to_backlog(skb, cpu, >last_qtail);
 
+unlock:
rcu_read_unlock();
preempt_enable();
} else
@@ -5095,8 +5135,19 @@ static int netif_receive_skb_internal(struct sk_buff 
*skb)
 #ifdef CONFIG_RPS
if (static_key_false(_needed)) {
struct rps_dev_flow voidflow, *rflow = 
-   int cpu = get_rps_cpu(skb->dev, skb, );
+   int cpu;
+
+   /* strip any vlan tag before calling get_rps_cpu() */
+   if (skb->protocol == cpu_to_be16(ETH_P_8021Q) ||
+   skb->protocol == cpu_to_be16(ETH_P_8021AD)) {
+   skb = skb_vlan_untag(skb);
+   if (unlikely(!skb)) {
+   ret = NET_RX_DROP;
+   goto out;
+   }
+   }
 
+   cpu = get_rps_cpu(skb->dev, skb, );
if (cpu >= 0) {
ret = enqueue_to_backlog(skb, cpu, >last_qtail);
rcu_read_unlock();
@@ -5105,6 +5156,8 @@ static int netif_receive_skb_internal(struct sk_buff *skb)
}
 #endif
ret = __netif_receive_skb(skb);
+
+out:
rcu_read_unlock();
return ret;
 }
-- 
2.7.4

[PATCH net 2/2] net: dsa: bcm_sf2: Call setup during switch resume

2018-10-09 Thread Florian Fainelli

There is no reason to open code what the switch setup function does, in
fact, because we just issued a switch reset, we would make all the
register get their default values, including for instance, having unused
port be enabled again and wasting power and leading to an inappropriate
switch core clock being selected.

Fixes: 8cfa94984c9c ("net: dsa: bcm_sf2: add suspend/resume callbacks")
Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/bcm_sf2.c | 10 +-
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c
index b6d8e849a949..fc8b48adf38b 100644
--- a/drivers/net/dsa/bcm_sf2.c
+++ b/drivers/net/dsa/bcm_sf2.c
@@ -703,7 +703,6 @@ static int bcm_sf2_sw_suspend(struct dsa_switch *ds)
 static int bcm_sf2_sw_resume(struct dsa_switch *ds)
 {
struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
-   unsigned int port;
int ret;
 
ret = bcm_sf2_sw_rst(priv);
@@ -715,14 +714,7 @@ static int bcm_sf2_sw_resume(struct dsa_switch *ds)
if (priv->hw_params.num_gphy == 1)
bcm_sf2_gphy_enable_set(ds, true);
 
-   for (port = 0; port < DSA_MAX_PORTS; port++) {
-   if (dsa_is_user_port(ds, port))
-   bcm_sf2_port_setup(ds, port, NULL);
-   else if (dsa_is_cpu_port(ds, port))
-   bcm_sf2_imp_setup(ds, port);
-   }
-
-   bcm_sf2_enable_acb(ds);
+   ds->ops->setup(ds);
 
return 0;
 }
-- 
2.17.1

[PATCH net 1/2] net: dsa: bcm_sf2: Fix unbind ordering

2018-10-09 Thread Florian Fainelli

The order in which we release resources is unfortunately leading to bus
errors while dismantling the port. This is because we set
priv->wol_ports_mask to 0 to tell bcm_sf2_sw_suspend() that it is now
permissible to clock gate the switch. Later on, when dsa_slave_destroy()
comes in from dsa_unregister_switch() and calls
dsa_switch_ops::port_disable, we perform the same dismantling again, and
this time we hit registers that are clock gated.

Make sure that dsa_unregister_switch() is the first thing that happens,
which takes care of releasing all user visible resources, then proceed
with clock gating hardware. We still need to set priv->wol_ports_mask to
0 to make sure that an enabled port properly gets disabled in case it
was previously used as part of Wake-on-LAN.

Fixes: d9338023fb8e ("net: dsa: bcm_sf2: Make it a real platform device driver")
Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/bcm_sf2.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c
index e0066adcd2f3..b6d8e849a949 100644
--- a/drivers/net/dsa/bcm_sf2.c
+++ b/drivers/net/dsa/bcm_sf2.c
@@ -1173,10 +1173,10 @@ static int bcm_sf2_sw_remove(struct platform_device 
*pdev)
 {
struct bcm_sf2_priv *priv = platform_get_drvdata(pdev);
 
-   /* Disable all ports and interrupts */
priv->wol_ports_mask = 0;
-   bcm_sf2_sw_suspend(priv->dev->ds);
dsa_unregister_switch(priv->dev->ds);
+   /* Disable all ports and interrupts */
+   bcm_sf2_sw_suspend(priv->dev->ds);
bcm_sf2_mdio_unregister(priv);
 
return 0;
-- 
2.17.1

[PATCH net 0/2] net: dsa: bcm_sf2: Couple of fixes

2018-10-09 Thread Florian Fainelli

Hi David,

Here are two fixes for the bcm_sf2 driver that were found during testing
unbind and analysing another issue during system suspend/resume.

Thanks!

Florian Fainelli (2):
  net: dsa: bcm_sf2: Fix unbind ordering
  net: dsa: bcm_sf2: Call setup during switch resume

 drivers/net/dsa/bcm_sf2.c | 14 +++---
 1 file changed, 3 insertions(+), 11 deletions(-)

-- 
2.17.1

Re: [PATCH bpf-next] tools/bpf: use proper type and uapi perf_event.h header for libbpf

2018-10-09 Thread Song Liu




> On Oct 9, 2018, at 4:14 PM, Yonghong Song  wrote:
> 
> Use __u32 instead u32 in libbpf.c and also use
> uapi perf_event.h instead of tools/perf/perf-sys.h.
> 
> Signed-off-by: Yonghong Song 

Acked-by: Song Liu 

> ---
> tools/lib/bpf/Makefile | 2 +-
> tools/lib/bpf/libbpf.c | 8 
> 2 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/tools/lib/bpf/Makefile b/tools/lib/bpf/Makefile
> index 6ad27257fd67..79d84413ddf2 100644
> --- a/tools/lib/bpf/Makefile
> +++ b/tools/lib/bpf/Makefile
> @@ -69,7 +69,7 @@ FEATURE_USER = .libbpf
> FEATURE_TESTS = libelf libelf-mmap bpf reallocarray
> FEATURE_DISPLAY = libelf bpf
> 
> -INCLUDES = -I. -I$(srctree)/tools/include 
> -I$(srctree)/tools/arch/$(ARCH)/include/uapi -I$(srctree)/tools/include/uapi 
> -I$(srctree)/tools/perf
> +INCLUDES = -I. -I$(srctree)/tools/include 
> -I$(srctree)/tools/arch/$(ARCH)/include/uapi -I$(srctree)/tools/include/uapi
> FEATURE_CHECK_CFLAGS-bpf = $(INCLUDES)
> 
> check_feat := 1
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index ceb918c14d80..176cf5523728 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -19,7 +19,6 @@
> #include 
> #include 
> #include 
> -#include 
> #include 
> #include 
> #include 
> @@ -27,6 +26,7 @@
> #include 
> #include 
> #include 
> +#include 
> #include 
> #include 
> #include 
> @@ -169,7 +169,7 @@ static LIST_HEAD(bpf_objects_list);
> 
> struct bpf_object {
>   char license[64];
> - u32 kern_version;
> + __u32 kern_version;
> 
>   struct bpf_program *programs;
>   size_t nr_programs;
> @@ -540,7 +540,7 @@ static int
> bpf_object__init_kversion(struct bpf_object *obj,
> void *data, size_t size)
> {
> - u32 kver;
> + __u32 kver;
> 
>   if (size != sizeof(kver)) {
>   pr_warning("invalid kver section in %s\n", obj->path);
> @@ -1295,7 +1295,7 @@ static int bpf_object__collect_reloc(struct bpf_object 
> *obj)
> static int
> load_program(enum bpf_prog_type type, enum bpf_attach_type 
> expected_attach_type,
>const char *name, struct bpf_insn *insns, int insns_cnt,
> -  char *license, u32 kern_version, int *pfd, int prog_ifindex)
> +  char *license, __u32 kern_version, int *pfd, int prog_ifindex)
> {
>   struct bpf_load_program_attr load_attr;
>   char *cp, errmsg[STRERR_BUFSIZE];
> -- 
> 2.17.1
>

Re: [PATCH net-next] net/ipv6: Add knob to skip DELROUTE message on device down

2018-10-09 Thread David Ahern

On 10/9/18 3:27 PM, David Ahern wrote:
> From: David Ahern 
> 
> Another difference between IPv4 and IPv6 is the generation of RTM_DELROUTE
> notifications when a device is taken down (admin down) or deleted. IPv4
> does not generate a message for routes evicted by the down or delete;
> IPv6 does. A NOS at scale really needs to avoid these messages and have
> IPv4 and IPv6 behave similarly, relying on userspace to handle link
> notifications and evict the routes.
> 
> At this point existing user behavior needs to be preserved. Since
> notifications are a global action (not per app) the only way to preserve
> existing behavior and allow the messages to be skipped is to add a new
> sysctl (net/ipv6/route/skip_notify_on_dev_down) which can be set to
> disable the notificatioons.
> 
> IPv6 route code already supports the option to skip the message (it is
> used for multipath routes for example). Besides the new sysctl we need
> to pass the skip_notify setting through the generic fib6_clean and
> fib6_walk functions to fib6_clean_node and to set skip_notify on calls
> to __ip_del_rt for the addrconf_ifdown path.
> 
> Signed-off-by: David Ahern 
> ---
>  Documentation/networking/ip-sysctl.txt |  8 +++
>  include/net/addrconf.h |  3 ++-
>  include/net/ip6_fib.h  |  3 +++
>  include/net/ip6_route.h|  1 +
>  include/net/netns/ipv6.h   |  1 +
>  net/ipv6/addrconf.c| 44 
> ++
>  net/ipv6/anycast.c | 10 +---
>  net/ipv6/ip6_fib.c | 20 
>  net/ipv6/route.c   | 30 ++-

I should have noticed this before sending the patch: the addrconf and
anycast changes are not needed. addrconf_ifdown calls rt6_disable_ip
which calls rt6_sync_down_dev. The last one evicts all routes for the
device, so the delete route calls done later in addrconf and anycast are
superfluous.

[PATCH bpf-next] tools/bpf: use proper type and uapi perf_event.h header for libbpf

2018-10-09 Thread Yonghong Song

Use __u32 instead u32 in libbpf.c and also use
uapi perf_event.h instead of tools/perf/perf-sys.h.

Signed-off-by: Yonghong Song 
---
 tools/lib/bpf/Makefile | 2 +-
 tools/lib/bpf/libbpf.c | 8 
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/tools/lib/bpf/Makefile b/tools/lib/bpf/Makefile
index 6ad27257fd67..79d84413ddf2 100644
--- a/tools/lib/bpf/Makefile
+++ b/tools/lib/bpf/Makefile
@@ -69,7 +69,7 @@ FEATURE_USER = .libbpf
 FEATURE_TESTS = libelf libelf-mmap bpf reallocarray
 FEATURE_DISPLAY = libelf bpf
 
-INCLUDES = -I. -I$(srctree)/tools/include 
-I$(srctree)/tools/arch/$(ARCH)/include/uapi -I$(srctree)/tools/include/uapi 
-I$(srctree)/tools/perf
+INCLUDES = -I. -I$(srctree)/tools/include 
-I$(srctree)/tools/arch/$(ARCH)/include/uapi -I$(srctree)/tools/include/uapi
 FEATURE_CHECK_CFLAGS-bpf = $(INCLUDES)
 
 check_feat := 1
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index ceb918c14d80..176cf5523728 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -19,7 +19,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -27,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -169,7 +169,7 @@ static LIST_HEAD(bpf_objects_list);
 
 struct bpf_object {
char license[64];
-   u32 kern_version;
+   __u32 kern_version;
 
struct bpf_program *programs;
size_t nr_programs;
@@ -540,7 +540,7 @@ static int
 bpf_object__init_kversion(struct bpf_object *obj,
  void *data, size_t size)
 {
-   u32 kver;
+   __u32 kver;
 
if (size != sizeof(kver)) {
pr_warning("invalid kver section in %s\n", obj->path);
@@ -1295,7 +1295,7 @@ static int bpf_object__collect_reloc(struct bpf_object 
*obj)
 static int
 load_program(enum bpf_prog_type type, enum bpf_attach_type 
expected_attach_type,
 const char *name, struct bpf_insn *insns, int insns_cnt,
-char *license, u32 kern_version, int *pfd, int prog_ifindex)
+char *license, __u32 kern_version, int *pfd, int prog_ifindex)
 {
struct bpf_load_program_attr load_attr;
char *cp, errmsg[STRERR_BUFSIZE];
-- 
2.17.1

[PATCH net-next v2 2/2] FDDI: defza: Support capturing outgoing SMT traffic

2018-10-09 Thread Maciej W. Rozycki

DEC FDDIcontroller 700 (DEFZA) uses a Tx/Rx queue pair to communicate 
SMT frames with adapter's firmware.  Any SMT frame received from the RMC 
via the Rx queue is queued back by the driver to the SMT Rx queue for 
the firmware to process.  Similarly the firmware uses the SMT Tx queue 
to supply the driver with SMT frames which are queued back to the Tx 
queue for the RMC to send to the ring.

When a network tap is attached to an FDDI interface handled by `defza' 
any incoming SMT frames captured are queued to our usual processing of 
network data received, which in turn delivers them to any listening 
taps.

However the outgoing SMT frames produced by the firmware bypass our 
network protocol stack and are therefore not delivered to taps.  This in 
turn means that taps are missing a part of network traffic sent by the 
adapter, which may make it more difficult to track down network problems 
or do general traffic analysis.

Call `dev_queue_xmit_nit' then in the SMT Tx path, having checked that
a network tap is attached, with a newly-created `dev_nit_active' helper
wrapping the usual condition used in the transmit path.

Signed-off-by: Maciej W. Rozycki 
---
New in v2.
---
 drivers/net/fddi/defza.c  |   33 +++--
 include/linux/netdevice.h |1 +
 net/core/dev.c|   13 -
 3 files changed, 44 insertions(+), 3 deletions(-)

linux-defza-1.1.4-pcap.patch
Index: net-next-20181008-4maxp64/drivers/net/fddi/defza.c
===
--- net-next-20181008-4maxp64.orig/drivers/net/fddi/defza.c
+++ net-next-20181008-4maxp64/drivers/net/fddi/defza.c
@@ -797,11 +797,40 @@ static void fza_tx_smt(struct net_device
smt_tx_ptr = fp->mmio + readl_u(>ring_smt_tx[i].buffer);
len = readl_u(>ring_smt_tx[i].rmc) & FZA_RING_PBC_MASK;
 
-   /* Queue the frame to the RMC transmit ring. */
-   if (!netif_queue_stopped(dev))
+   if (!netif_queue_stopped(dev)) {
+   if (dev_nit_active(dev)) {
+   struct sk_buff *skb;
+
+   /* Length must be a multiple of 4 as only word
+* reads are permitted!
+*/
+   skb = fza_alloc_skb_irq(dev, (len + 3) & ~3);
+   if (!skb)
+   goto err_no_skb;/* Drop. */
+
+   skb_data_ptr = (struct fza_buffer_tx *)
+  skb->data;
+
+   fza_reads(smt_tx_ptr, skb_data_ptr,
+ (len + 3) & ~3);
+   skb->dev = dev;
+   skb_reserve(skb, 3);/* Skip over PRH. */
+   skb_put(skb, len - 3);
+   skb_reset_network_header(skb);
+
+   dev_queue_xmit_nit(skb, dev);
+
+   dev_kfree_skb_irq(skb);
+
+err_no_skb:
+   ;
+   }
+
+   /* Queue the frame to the RMC transmit ring. */
fza_do_xmit((union fza_buffer_txp)
{ .mmio_ptr = smt_tx_ptr },
len, dev, 1);
+   }
 
writel_o(FZA_RING_OWN_FZA, >ring_smt_tx[i].own);
fp->ring_smt_tx_index =
Index: net-next-20181008-4maxp64/include/linux/netdevice.h
===
--- net-next-20181008-4maxp64.orig/include/linux/netdevice.h
+++ net-next-20181008-4maxp64/include/linux/netdevice.h
@@ -3632,6 +3632,7 @@ static __always_inline int dev_forwa
return 0;
 }
 
+bool dev_nit_active(struct net_device *dev);
 void dev_queue_xmit_nit(struct sk_buff *skb, struct net_device *dev);
 
 extern int netdev_budget;
Index: net-next-20181008-4maxp64/net/core/dev.c
===
--- net-next-20181008-4maxp64.orig/net/core/dev.c
+++ net-next-20181008-4maxp64/net/core/dev.c
@@ -1954,6 +1954,17 @@ static inline bool skb_loop_sk(struct pa
return false;
 }
 
+/**
+ * dev_nit_active - return true if any network interface taps are in use
+ *
+ * @dev: network device to check for the presence of taps
+ */
+bool dev_nit_active(struct net_device *dev)
+{
+   return !list_empty(_all) || !list_empty(>ptype_all);
+}
+EXPORT_SYMBOL_GPL(dev_nit_active);
+
 /*
  * Support routine. Sends outgoing frames to any network
  * taps currently in use.
@@ -3211,7 +3222,7 @@ static int xmit_one(struct sk_buff *skb,
unsigned int len;
int rc;
 
-   if (!list_empty(_all) || !list_empty(>ptype_all))
+   if (dev_nit_active(dev))

[PATCH net-next v2 1/2] FDDI: defza: Add support for DEC FDDIcontroller 700 TURBOchannel adapter

2018-10-09 Thread Maciej W. Rozycki

Add support for the DEC FDDIcontroller 700 (DEFZA), Digital Equipment 
Corporation's first-generation FDDI network interface adapter, made for 
TURBOchannel and based on a discrete version of what eventually became 
Motorola's widely used CAMEL chipset.

The CAMEL chipset is present for example in the DEC FDDIcontroller 
TURBOchannel, EISA and PCI adapters (DEFTA/DEFEA/DEFPA) that we support 
with the `defxx' driver, however the host bus interface logic and the 
firmware API are different in the DEFZA and hence a separate driver is 
required.

There isn't much to say about the driver except that it works, but there 
is one peculiarity to mention.  The adapter implements two Tx/Rx queue 
pairs.

Of these one pair is the usual network Tx/Rx queue pair, in this case 
used by the adapter to exchange frames with the ring, via the RMC (Ring 
Memory Controller) chip.  The Tx queue is handled directly by the RMC 
chip and resides in onboard packet memory.  The Rx queue is maintained 
via DMA in host memory by adapter's firmware copying received data 
stored by the RMC in onboard packet memory.

The other pair is used to communicate SMT frames with adapter's 
firmware.  Any SMT frame received from the RMC via the Rx queue must be 
queued back by the driver to the SMT Rx queue for the firmware to 
process.  Similarly the firmware uses the SMT Tx queue to supply the 
driver with SMT frames that must be queued back to the Tx queue for the 
RMC to send to the ring.

This solution was chosen because the designers ran out of PCB space and 
could not squeeze in more logic onto the board that would be required to 
handle this SMT frame traffic without the need to involve the driver, as 
with the later DEFTA/DEFEA/DEFPA adapters.

Finally the driver does some Frame Control byte decoding, so to avoid 
magic numbers some macros are added to .

Signed-off-by: Maciej W. Rozycki 
---
Changes from v1:

- driver version 1.1.4, due to completed 700-C adapter variant support,

- DEC FDDIcontroller 700-C support verified and driver code updated
  accordingly: PMD types added and the associated debug message removed,

- the spelling of 700C corrected to 700-C throughout, according to board 
  marking and documentation,

- driver notes (defza.txt) updated, reworded and reformatted for 72
  columns,

- `fza_start_xmit' return type now `netdev_tx_t',

- switched from the removed `init_timer' interface to `timer_setup',

- reworked MMIO accesses according to memory ordering guarantees,

- outgoing SMT traffic packet capturing split off to 2/2,

- reformatted to shut up `checkpatch.pl',

- change description revised.

---
 Documentation/networking/00-INDEX  |2 
 Documentation/networking/defza.txt |   57 +
 MAINTAINERS|5 
 drivers/net/fddi/Kconfig   |   11 
 drivers/net/fddi/Makefile  |1 
 drivers/net/fddi/defza.c   | 1535 +
 drivers/net/fddi/defza.h   |  791 +++
 include/uapi/linux/if_fddi.h   |   21 
 8 files changed, 2420 insertions(+), 3 deletions(-)

linux-defza-1.1.4.patch
Index: net-next-20181008-4maxp64/Documentation/networking/00-INDEX
===
--- net-next-20181008-4maxp64.orig/Documentation/networking/00-INDEX
+++ net-next-20181008-4maxp64/Documentation/networking/00-INDEX
@@ -56,6 +56,8 @@ de4x5.txt
- the Digital EtherWORKS DE4?? and DE5?? PCI Ethernet driver
 decnet.txt
- info on using the DECnet networking layer in Linux.
+defza.txt
+   - the DEC FDDIcontroller 700 (DEFZA-xx) TURBOchannel FDDI driver
 dl2k.txt
- README for D-Link DL2000-based Gigabit Ethernet Adapters (dl2k.ko).
 dm9000.txt
Index: net-next-20181008-4maxp64/Documentation/networking/defza.txt
===
--- /dev/null
+++ net-next-20181008-4maxp64/Documentation/networking/defza.txt
@@ -0,0 +1,57 @@
+Notes on the DEC FDDIcontroller 700 (DEFZA-xx) driver v.1.1.4.
+
+
+DEC FDDIcontroller 700 is DEC's first-generation TURBOchannel FDDI
+network card, designed in 1990 specifically for the DECstation 5000
+model 200 workstation.  The board is a single attachment station and
+it was manufactured in two variations, both of which are supported.
+
+First is the SAS MMF DEFZA-AA option, the original design implementing
+the standard MMF-PMD, however with a pair of ST connectors rather than
+the usual MIC connector.  The other one is the SAS ThinWire/STP DEFZA-CA
+option, denoted 700-C, with the network medium selectable by a switch
+between the DEC proprietary ThinWire-PMD using a BNC connector and the
+standard STP-PMD using a DE-9F connector.  This option can interface to
+a DECconcentrator 500 device and, in the case of the STP-PMD, also other
+FDDI equipment and was designed to make it easier to transition from
+existing IEEE 802.3 10BASE2 Ethernet and IEEE 802.5 Token Ring networks
+by providing means to

[PATCH net-next v2 0/2] FDDI: DEC FDDIcontroller 700 TURBOchannel adapter support

2018-10-09 Thread Maciej W. Rozycki

Hi,

 This is an update to .  I 
believe I have addressed all the requests made in the previous review 
round.

 There is still one `checkpatch.pl' warning remaining:

WARNING: quoted string split across lines
#1652: FILE: drivers/net/fddi/defza.c:1442:
+   pr_info("%s: ROM rev. %.4s, firmware rev. %.4s, RMC rev. %.4s, "
+   "SMT ver. %u\n", fp->name, rom_rev, fw_rev, rmc_rev, smt_ver);

total: 0 errors, 1 warnings, 2458 lines checked

however I think the value of staying within 80 columns is higher than the 
value of having the string on a single line.  This is because with all the 
formatting specifiers there it is not directly greppable based on the 
final output produced to the kernel log on one hand, e.g.:

tc2: ROM rev. 1.0, firmware rev. 1.2, RMC rev. A, SMT ver. 1

while it can be easily tracked down by grepping for an obvious substring 
such as "RMC rev" on the other.

 The issue with MMIO barriers I discussed in the course of the original 
review turned out mostly irrelevant to this driver, because as I have 
learnt in a recent Alpha/Linux discussion starting here: 

 
our MMIO API mandates the `readX' and `writeX' accessors to be strongly 
ordered with respect to each other, even if that is not implicitly 
enforced by hardware.

 Consequently I have removed all the explicit ordering barriers and 
instead submitted a fix for MIPS MMIO implementation, which currently does 
not guarantee strong ordering (the MIPS architecture does not define bus
ordering rules except in terms of SYNC barriers), as recorded here: 
.  

 Enforcing strong MMIO ordering can be costly however and is often 
unnecessary, e.g. when using PIO to access network frame data in onboard 
packet memory.  I have therefore retained the information that would be 
lost by the removal of barriers, by defining accessor wrappers suffixed by 
`_o' and `_u', for accesses that have to be ordered and can be unordered 
respectively.

 If we ever have an API defined for weakly-ordered MMIO accesses, then 
these wrappers can be redefined accordingly.  Right now they all expand to 
the respective `_relaxed' accessors, because, again, enforcing the 
ordering WRT DMA transfers can be costly and we don't need it here except 
in one place, where I chose to use explicit `dma_rmb' instead.

 Similarly I have replaced the completion barriers with a read back from 
the respective MMIO location (all adapter MMIO registers can be read with 
no side effects incurred), which will serve its purpose on the basis of 
MMIO being strongly ordered (although a read from TURBOchannel is going to 
be slower than `iob', making the delay incurred unnecessarily longer).

 And last but not least, I have split off the SMT Tx network tap support 
to a separate change, 2/2 in this series, so that it does not block the 
driver proper and can be discussed separately.

 I think it has value in that it makes the view of the outgoing network 
traffic complete, as if one actually physically tapped into the outgoing 
line of the ring, between the station being examined and its downstream 
neighbour.  Without this part only traffic passed from applications 
through the whole protocol stack can be captured and this is only a part 
of the view.

 With the `dev_queue_xmit_nit' interface now exported it's only 
`ptype_all' that remains private, and to define a properly abstracted API 
I propose to provide am exported `dev_nit_active' predicate that tells 
whether any taps are active.  This predicate is then used accordingly.

 NB if there is a long-term maintenance concern about the `dev_nit_active' 
predicate, then well, corresponding inline code currently present in 
`xmit_one' has to be maintained anyway, and if the resulting changes 
require `defza' to be updated accordingly, then I am going to handle it; 
after some 20 years with Linux it's not that I am going to disappear 
anywhere anytime.  And once I am dead, which is inevitably going to happen 
sooner or later, then the driver can simply be ripped from the kernel.  
Though I suspect that at that point no DECstation Linux users may survive 
anymore, even though hardware, being as sturdy as it is, likely will.

 I have a patch for `tcpdump' to actually decode SMT frames, which I plan 
to upstream sometime.  Here's a sample of SMT traffic captured through the 
`defza' driver in a small network of 4 stations and no concentrators, 
printed in the most verbose mode:

01:16:59.138381 4f 00:60:b0:58:41:e7 00:60:b0:58:41:e7 73: SMT NIF ann vid:1 
tid:0270 sid:00-00-00-60-b0-58-41-e7 len:40: UNA: 00 00 00 06 0d 1a 02 ae 
StationDescr: 00 01 02 00 StationState: 00 00 30 00 MACFrameStatusFunctions.3: 
00 00 00 01
01:17:00.332750 4f 08:00:2b:a3:a3:29

[PATCH stable 4.9 27/29] ip: add helpers to process in-order fragments faster.

2018-10-09 Thread Florian Fainelli

From: Peter Oskolkov 

This patch introduces several helper functions/macros that will be
used in the follow-up patch. No runtime changes yet.

The new logic (fully implemented in the second patch) is as follows:

* Nodes in the rb-tree will now contain not single fragments, but lists
  of consecutive fragments ("runs").

* At each point in time, the current "active" run at the tail is
  maintained/tracked. Fragments that arrive in-order, adjacent
  to the previous tail fragment, are added to this tail run without
  triggering the re-balancing of the rb-tree.

* If a fragment arrives out of order with the offset _before_ the tail run,
  it is inserted into the rb-tree as a single fragment.

* If a fragment arrives after the current tail fragment (with a gap),
  it starts a new "tail" run, as is inserted into the rb-tree
  at the end as the head of the new run.

skb->cb is used to store additional information
needed here (suggested by Eric Dumazet).

Reported-by: Willem de Bruijn 
Signed-off-by: Peter Oskolkov 
Cc: Eric Dumazet 
Cc: Florian Westphal 
Signed-off-by: David S. Miller 
(cherry picked from commit 353c9cb360874e737fb000545f783df756c06f9a)
---
 include/net/inet_frag.h |  6 
 net/ipv4/ip_fragment.c  | 73 +
 2 files changed, 79 insertions(+)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 1ff0433d94a7..a3812e9c8fee 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -56,7 +56,9 @@ struct frag_v6_compare_key {
  * @lock: spinlock protecting this frag
  * @refcnt: reference count of the queue
  * @fragments: received fragments head
+ * @rb_fragments: received fragments rb-tree root
  * @fragments_tail: received fragments tail
+ * @last_run_head: the head of the last "run". see ip_fragment.c
  * @stamp: timestamp of the last received fragment
  * @len: total length of the original datagram
  * @meat: length of received fragments so far
@@ -77,6 +79,7 @@ struct inet_frag_queue {
struct sk_buff  *fragments;  /* Used in IPv6. */
struct rb_root  rb_fragments; /* Used in IPv4. */
struct sk_buff  *fragments_tail;
+   struct sk_buff  *last_run_head;
ktime_t stamp;
int len;
int meat;
@@ -112,6 +115,9 @@ void inet_frag_kill(struct inet_frag_queue *q);
 void inet_frag_destroy(struct inet_frag_queue *q);
 struct inet_frag_queue *inet_frag_find(struct netns_frags *nf, void *key);
 
+/* Free all skbs in the queue; return the sum of their truesizes. */
+unsigned int inet_frag_rbtree_purge(struct rb_root *root);
+
 static inline void inet_frag_put(struct inet_frag_queue *q)
 {
if (atomic_dec_and_test(>refcnt))
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 4d243fcb02f7..e20a5afaf6e5 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -56,6 +56,57 @@
  */
 static const char ip_frag_cache_name[] = "ip4-frags";
 
+/* Use skb->cb to track consecutive/adjacent fragments coming at
+ * the end of the queue. Nodes in the rb-tree queue will
+ * contain "runs" of one or more adjacent fragments.
+ *
+ * Invariants:
+ * - next_frag is NULL at the tail of a "run";
+ * - the head of a "run" has the sum of all fragment lengths in frag_run_len.
+ */
+struct ipfrag_skb_cb {
+   struct inet_skb_parmh;
+   struct sk_buff  *next_frag;
+   int frag_run_len;
+};
+
+#define FRAG_CB(skb)   ((struct ipfrag_skb_cb *)((skb)->cb))
+
+static void ip4_frag_init_run(struct sk_buff *skb)
+{
+   BUILD_BUG_ON(sizeof(struct ipfrag_skb_cb) > sizeof(skb->cb));
+
+   FRAG_CB(skb)->next_frag = NULL;
+   FRAG_CB(skb)->frag_run_len = skb->len;
+}
+
+/* Append skb to the last "run". */
+static void ip4_frag_append_to_last_run(struct inet_frag_queue *q,
+   struct sk_buff *skb)
+{
+   RB_CLEAR_NODE(>rbnode);
+   FRAG_CB(skb)->next_frag = NULL;
+
+   FRAG_CB(q->last_run_head)->frag_run_len += skb->len;
+   FRAG_CB(q->fragments_tail)->next_frag = skb;
+   q->fragments_tail = skb;
+}
+
+/* Create a new "run" with the skb. */
+static void ip4_frag_create_run(struct inet_frag_queue *q, struct sk_buff *skb)
+{
+   if (q->last_run_head)
+   rb_link_node(>rbnode, >last_run_head->rbnode,
+>last_run_head->rbnode.rb_right);
+   else
+   rb_link_node(>rbnode, NULL, >rb_fragments.rb_node);
+   rb_insert_color(>rbnode, >rb_fragments);
+
+   ip4_frag_init_run(skb);
+   q->fragments_tail = skb;
+   q->last_run_head = skb;
+}
+
 /* Describe an entry in the "incomplete datagrams" queue. */
 struct ipq {
struct inet_frag_queue q;
@@ -652,6 +703,28 @@ struct sk_buff *ip_check_defrag(struct net *net, struct 
sk_buff *skb, u32 user)
 }
 EXPORT_SYMBOL(ip_check_defrag);
 
+unsigned int inet_frag_rbtree_purge(struct

[PATCH stable 4.9 28/29] ip: process in-order fragments efficiently

2018-10-09 Thread Florian Fainelli

From: Peter Oskolkov 

This patch changes the runtime behavior of IP defrag queue:
incoming in-order fragments are added to the end of the current
list/"run" of in-order fragments at the tail.

On some workloads, UDP stream performance is substantially improved:

RX: ./udp_stream -F 10 -T 2 -l 60
TX: ./udp_stream -c -H  -F 10 -T 5 -l 60

with this patchset applied on a 10Gbps receiver:

  throughput=9524.18
  throughput_units=Mbit/s

upstream (net-next):

  throughput=4608.93
  throughput_units=Mbit/s

Reported-by: Willem de Bruijn 
Signed-off-by: Peter Oskolkov 
Cc: Eric Dumazet 
Cc: Florian Westphal 
Signed-off-by: David S. Miller 
(cherry picked from commit a4fd284a1f8fd4b6c59aa59db2185b1e17c5c11c)
---
 net/ipv4/inet_fragment.c |   2 +-
 net/ipv4/ip_fragment.c   | 110 ---
 2 files changed, 70 insertions(+), 42 deletions(-)

diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 535fa57af51e..8323d33c0ce2 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -145,7 +145,7 @@ void inet_frag_destroy(struct inet_frag_queue *q)
fp = xp;
} while (fp);
} else {
-   sum_truesize = skb_rbtree_purge(>rb_fragments);
+   sum_truesize = inet_frag_rbtree_purge(>rb_fragments);
}
sum = sum_truesize + f->qsize;
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index e20a5afaf6e5..8f899c13a392 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -125,8 +125,8 @@ static u8 ip4_frag_ecn(u8 tos)
 
 static struct inet_frags ip4_frags;
 
-static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
-struct net_device *dev);
+static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
+struct sk_buff *prev_tail, struct net_device *dev);
 
 
 static void ip4_frag_init(struct inet_frag_queue *q, const void *a)
@@ -217,7 +217,12 @@ static void ip_expire(unsigned long arg)
head = skb_rb_first(>q.rb_fragments);
if (!head)
goto out;
-   rb_erase(>rbnode, >q.rb_fragments);
+   if (FRAG_CB(head)->next_frag)
+   rb_replace_node(>rbnode,
+   _CB(head)->next_frag->rbnode,
+   >q.rb_fragments);
+   else
+   rb_erase(>rbnode, >q.rb_fragments);
memset(>rbnode, 0, sizeof(head->rbnode));
barrier();
}
@@ -318,7 +323,7 @@ static int ip_frag_reinit(struct ipq *qp)
return -ETIMEDOUT;
}
 
-   sum_truesize = skb_rbtree_purge(>q.rb_fragments);
+   sum_truesize = inet_frag_rbtree_purge(>q.rb_fragments);
sub_frag_mem_limit(qp->q.net, sum_truesize);
 
qp->q.flags = 0;
@@ -327,6 +332,7 @@ static int ip_frag_reinit(struct ipq *qp)
qp->q.fragments = NULL;
qp->q.rb_fragments = RB_ROOT;
qp->q.fragments_tail = NULL;
+   qp->q.last_run_head = NULL;
qp->iif = 0;
qp->ecn = 0;
 
@@ -338,7 +344,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
 {
struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
struct rb_node **rbn, *parent;
-   struct sk_buff *skb1;
+   struct sk_buff *skb1, *prev_tail;
struct net_device *dev;
unsigned int fragsize;
int flags, offset;
@@ -416,38 +422,41 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
 */
 
/* Find out where to put this fragment.  */
-   skb1 = qp->q.fragments_tail;
-   if (!skb1) {
-   /* This is the first fragment we've received. */
-   rb_link_node(>rbnode, NULL, >q.rb_fragments.rb_node);
-   qp->q.fragments_tail = skb;
-   } else if ((skb1->ip_defrag_offset + skb1->len) < end) {
-   /* This is the common/special case: skb goes to the end. */
+   prev_tail = qp->q.fragments_tail;
+   if (!prev_tail)
+   ip4_frag_create_run(>q, skb);  /* First fragment. */
+   else if (prev_tail->ip_defrag_offset + prev_tail->len < end) {
+   /* This is the common case: skb goes to the end. */
/* Detect and discard overlaps. */
-   if (offset < (skb1->ip_defrag_offset + skb1->len))
+   if (offset < prev_tail->ip_defrag_offset + prev_tail->len)
goto discard_qp;
-   /* Insert after skb1. */
-   rb_link_node(>rbnode, >rbnode, 
>rbnode.rb_right);
-   qp->q.fragments_tail = skb;
+   if (offset == prev_tail->ip_defrag_offset + prev_tail->len)
+   ip4_frag_append_to_last_run(>q, skb);
+   else
+   ip4_frag_create_run(>q, skb);
} else {
-   /* Binary search. Note that skb can become the first

[PATCH stable 4.9 13/29] inet: frags: do not clone skb in ip_expire()

2018-10-09 Thread Florian Fainelli

From: Eric Dumazet 

An skb_clone() was added in commit ec4fbd64751d ("inet: frag: release
spinlock before calling icmp_send()")

While fixing the bug at that time, it also added a very high cost
for DDOS frags, as the ICMP rate limit is applied after this
expensive operation (skb_clone() + consume_skb(), implying memory
allocations, copy, and freeing)

We can use skb_get(head) here, all we want is to make sure skb wont
be freed by another cpu.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
(cherry picked from commit 1eec5d5670084ee644597bd26c25e22c69b9f748)
---
 net/ipv4/ip_fragment.c | 16 ++--
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 3dd19bebeb55..e235f62dab58 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -141,8 +141,8 @@ static bool frag_expire_skip_icmp(u32 user)
  */
 static void ip_expire(unsigned long arg)
 {
-   struct sk_buff *clone, *head;
const struct iphdr *iph;
+   struct sk_buff *head;
struct net *net;
struct ipq *qp;
int err;
@@ -185,16 +185,12 @@ static void ip_expire(unsigned long arg)
(skb_rtable(head)->rt_type != RTN_LOCAL))
goto out;
 
-   clone = skb_clone(head, GFP_ATOMIC);
+   skb_get(head);
+   spin_unlock(>q.lock);
+   icmp_send(head, ICMP_TIME_EXCEEDED, ICMP_EXC_FRAGTIME, 0);
+   kfree_skb(head);
+   goto out_rcu_unlock;
 
-   /* Send an ICMP "Fragment Reassembly Timeout" message. */
-   if (clone) {
-   spin_unlock(>q.lock);
-   icmp_send(clone, ICMP_TIME_EXCEEDED,
- ICMP_EXC_FRAGTIME, 0);
-   consume_skb(clone);
-   goto out_rcu_unlock;
-   }
 out:
spin_unlock(>q.lock);
 out_rcu_unlock:
-- 
2.17.1

[PATCH stable 4.9 29/29] ip: frags: fix crash in ip_do_fragment()

2018-10-09 Thread Florian Fainelli

From: Taehee Yoo 

commit 5d407b071dc369c26a38398326ee2be53651cfe4 upstream

A kernel crash occurrs when defragmented packet is fragmented
in ip_do_fragment().
In defragment routine, skb_orphan() is called and
skb->ip_defrag_offset is set. but skb->sk and
skb->ip_defrag_offset are same union member. so that
frag->sk is not NULL.
Hence crash occurrs in skb->sk check routine in ip_do_fragment() when
defragmented packet is fragmented.

test commands:
   %iptables -t nat -I POSTROUTING -j MASQUERADE
   %hping3 192.168.4.2 -s 1000 -p 2000 -d 6

splat looks like:
[  261.069429] kernel BUG at net/ipv4/ip_output.c:636!
[  261.075753] invalid opcode:  [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[  261.083854] CPU: 1 PID: 1349 Comm: hping3 Not tainted 4.19.0-rc2+ #3
[  261.100977] RIP: 0010:ip_do_fragment+0x1613/0x2600
[  261.106945] Code: e8 e2 38 e3 fe 4c 8b 44 24 18 48 8b 74 24 08 e9 92 f6 ff 
ff 80 3c 02 00 0f 85 da 07 00 00 48 8b b5 d0 00 00 00 e9 25 f6 ff ff <0f> 0b 0f 
0b 44 8b 54 24 58 4c 8b 4c 24 18 4c 8b 5c 24 60 4c 8b 6c
[  261.127015] RSP: 0018:8801031cf2c0 EFLAGS: 00010202
[  261.134156] RAX: 11002297537b RBX: ed0020639e6e RCX: 0004
[  261.142156] RDX:  RSI:  RDI: 880114ba9bd8
[  261.150157] RBP: 880114ba8a40 R08: ed0022975395 R09: ed0022975395
[  261.158157] R10: 0001 R11: ed0022975394 R12: 880114ba9ca4
[  261.166159] R13: 0010 R14: 880114ba9bc0 R15: dc00
[  261.174169] FS:  7fbae2199700() GS:88011b40() 
knlGS:
[  261.183012] CS:  0010 DS:  ES:  CR0: 80050033
[  261.189013] CR2: 5579244fe000 CR3: 000119bf4000 CR4: 001006e0
[  261.198158] Call Trace:
[  261.199018]  ? dst_output+0x180/0x180
[  261.205011]  ? save_trace+0x300/0x300
[  261.209018]  ? ip_copy_metadata+0xb00/0xb00
[  261.213034]  ? sched_clock_local+0xd4/0x140
[  261.218158]  ? kill_l4proto+0x120/0x120 [nf_conntrack]
[  261.223014]  ? rt_cpu_seq_stop+0x10/0x10
[  261.227014]  ? find_held_lock+0x39/0x1c0
[  261.233008]  ip_finish_output+0x51d/0xb50
[  261.237006]  ? ip_fragment.constprop.56+0x220/0x220
[  261.243011]  ? nf_ct_l4proto_register_one+0x5b0/0x5b0 [nf_conntrack]
[  261.250152]  ? rcu_is_watching+0x77/0x120
[  261.255010]  ? nf_nat_ipv4_out+0x1e/0x2b0 [nf_nat_ipv4]
[  261.261033]  ? nf_hook_slow+0xb1/0x160
[  261.265007]  ip_output+0x1c7/0x710
[  261.269005]  ? ip_mc_output+0x13f0/0x13f0
[  261.273002]  ? __local_bh_enable_ip+0xe9/0x1b0
[  261.278152]  ? ip_fragment.constprop.56+0x220/0x220
[  261.282996]  ? nf_hook_slow+0xb1/0x160
[  261.287007]  raw_sendmsg+0x21f9/0x4420
[  261.291008]  ? dst_output+0x180/0x180
[  261.297003]  ? sched_clock_cpu+0x126/0x170
[  261.301003]  ? find_held_lock+0x39/0x1c0
[  261.306155]  ? stop_critical_timings+0x420/0x420
[  261.311004]  ? check_flags.part.36+0x450/0x450
[  261.315005]  ? _raw_spin_unlock_irq+0x29/0x40
[  261.320995]  ? _raw_spin_unlock_irq+0x29/0x40
[  261.326142]  ? cyc2ns_read_end+0x10/0x10
[  261.330139]  ? raw_bind+0x280/0x280
[  261.334138]  ? sched_clock_cpu+0x126/0x170
[  261.338995]  ? check_flags.part.36+0x450/0x450
[  261.342991]  ? __lock_acquire+0x4500/0x4500
[  261.348994]  ? inet_sendmsg+0x11c/0x500
[  261.352989]  ? dst_output+0x180/0x180
[  261.357012]  inet_sendmsg+0x11c/0x500
[ ... ]

v2:
 - clear skb->sk at reassembly routine.(Eric Dumarzet)

Fixes: fa0f527358bd ("ip: use rb trees for IP frag queue.")
Suggested-by: Eric Dumazet 
Signed-off-by: Taehee Yoo 
Reviewed-by: Eric Dumazet 
Signed-off-by: David S. Miller 
---
 net/ipv4/ip_fragment.c  | 1 +
 net/ipv6/netfilter/nf_conntrack_reasm.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 8f899c13a392..cc8c6ac84d08 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -597,6 +597,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff 
*skb,
nextp = >next;
fp->prev = NULL;
memset(>rbnode, 0, sizeof(fp->rbnode));
+   fp->sk = NULL;
head->data_len += fp->len;
head->len += fp->len;
if (head->ip_summed != fp->ip_summed)
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c 
b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 907c2d5753dd..b9147558a8f2 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -452,6 +452,7 @@ nf_ct_frag6_reasm(struct frag_queue *fq, struct sk_buff 
*prev,  struct net_devic
else if (head->ip_summed == CHECKSUM_COMPLETE)
head->csum = csum_add(head->csum, fp->csum);
head->truesize += fp->truesize;
+   fp->sk = NULL;
}
sub_frag_mem_limit(fq->q.net, head->truesize);
 
-- 
2.17.1

[PATCH stable 4.9 25/29] net: sk_buff rbnode reorg

2018-10-09 Thread Florian Fainelli

From: Eric Dumazet 

commit bffa72cf7f9df842f0016ba03586039296b4caaf upstream

skb->rbnode shares space with skb->next, skb->prev and skb->tstamp

Current uses (TCP receive ofo queue and netem) need to save/restore
tstamp, while skb->dev is either NULL (TCP) or a constant for a given
queue (netem).

Since we plan using an RB tree for TCP retransmit queue to speedup SACK
processing with large BDP, this patch exchanges skb->dev and
skb->tstamp.

This saves some overhead in both TCP and netem.

v2: removes the swtstamp field from struct tcp_skb_cb

Signed-off-by: Eric Dumazet 
Cc: Soheil Hassas Yeganeh 
Cc: Wei Wang 
Cc: Willem de Bruijn 
Acked-by: Soheil Hassas Yeganeh 
Signed-off-by: David S. Miller 
---
 include/linux/skbuff.h  |  18 ++-
 include/net/inet_frag.h |   3 +-
 net/ipv4/inet_fragment.c|  16 ++-
 net/ipv4/ip_fragment.c  | 182 +---
 net/ipv6/netfilter/nf_conntrack_reasm.c |   1 +
 net/ipv6/reassembly.c   |   1 +
 6 files changed, 128 insertions(+), 93 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 7e7e12aeaf82..d966cc688750 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -639,19 +639,27 @@ struct sk_buff {
struct sk_buff  *prev;
 
union {
-   ktime_t tstamp;
-   struct skb_mstamp skb_mstamp;
+   struct net_device   *dev;
+   /* Some protocols might use this space to store 
information,
+* while device pointer would be NULL.
+* UDP receive path is one user.
+*/
+   unsigned long   dev_scratch;
};
};
-   struct rb_node  rbnode; /* used in netem & tcp stack */
+   struct rb_node  rbnode; /* used in netem, ip4 defrag, 
and tcp stack */
+   struct list_headlist;
};
 
union {
+   struct sock *sk;
int ip_defrag_offset;
};
 
-   struct sock *sk;
-   struct net_device   *dev;
+   union {
+   ktime_t tstamp;
+   struct skb_mstamp skb_mstamp;
+   };
 
/*
 * This is the control buffer. It is free to use for every
diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index f47678d2ccc2..1ff0433d94a7 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -74,7 +74,8 @@ struct inet_frag_queue {
struct timer_list   timer;
spinlock_t  lock;
atomic_trefcnt;
-   struct sk_buff  *fragments;
+   struct sk_buff  *fragments;  /* Used in IPv6. */
+   struct rb_root  rb_fragments; /* Used in IPv4. */
struct sk_buff  *fragments_tail;
ktime_t stamp;
int len;
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 47c240f50b99..535fa57af51e 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -136,12 +136,16 @@ void inet_frag_destroy(struct inet_frag_queue *q)
fp = q->fragments;
nf = q->net;
f = nf->f;
-   while (fp) {
-   struct sk_buff *xp = fp->next;
-
-   sum_truesize += fp->truesize;
-   kfree_skb(fp);
-   fp = xp;
+   if (fp) {
+   do {
+   struct sk_buff *xp = fp->next;
+
+   sum_truesize += fp->truesize;
+   kfree_skb(fp);
+   fp = xp;
+   } while (fp);
+   } else {
+   sum_truesize = skb_rbtree_purge(>rb_fragments);
}
sum = sum_truesize + f->qsize;
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 8bfb34e9ea32..11d3dc649ef0 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -134,7 +134,7 @@ static bool frag_expire_skip_icmp(u32 user)
 static void ip_expire(unsigned long arg)
 {
const struct iphdr *iph;
-   struct sk_buff *head;
+   struct sk_buff *head = NULL;
struct net *net;
struct ipq *qp;
int err;
@@ -150,14 +150,31 @@ static void ip_expire(unsigned long arg)
 
ipq_kill(qp);
__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
-
-   head = qp->q.fragments;
-
__IP_INC_STATS(net, IPSTATS_MIB_REASMTIMEOUT);
 
-   if (!(qp->q.flags & INET_FRAG_FIRST_IN) || !head)
+   if (!qp->q.flags & INET_FRAG_FIRST_IN)
goto out;
 
+   /* sk_buff::dev and sk_buff::rbnode are unionized. So we
+* pull the head out of the tree in order to be able to
+* deal with

[PATCH stable 4.9 26/29] ipv4: frags: precedence bug in ip_expire()

2018-10-09 Thread Florian Fainelli

From: Dan Carpenter 

We accidentally removed the parentheses here, but they are required
because '!' has higher precedence than '&'.

Fixes: fa0f527358bd ("ip: use rb trees for IP frag queue.")
Signed-off-by: Dan Carpenter 
Signed-off-by: David S. Miller 
(cherry picked from commit 70837ffe3085c9a91488b52ca13ac84424da1042)
---
 net/ipv4/ip_fragment.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 11d3dc649ef0..4d243fcb02f7 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -152,7 +152,7 @@ static void ip_expire(unsigned long arg)
__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
__IP_INC_STATS(net, IPSTATS_MIB_REASMTIMEOUT);
 
-   if (!qp->q.flags & INET_FRAG_FIRST_IN)
+   if (!(qp->q.flags & INET_FRAG_FIRST_IN))
goto out;
 
/* sk_buff::dev and sk_buff::rbnode are unionized. So we
-- 
2.17.1

[PATCH stable 4.9 11/29] inet: frags: remove inet_frag_maybe_warn_overflow()

2018-10-09 Thread Florian Fainelli

From: Eric Dumazet 

This function is obsolete, after rhashtable addition to inet defrag.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
(cherry picked from commit 2d44ed22e607f9a285b049de2263e3840673a260)
---
 include/net/inet_frag.h |  2 --
 net/ieee802154/6lowpan/reassembly.c |  5 ++---
 net/ipv4/inet_fragment.c| 11 ---
 net/ipv4/ip_fragment.c  |  5 ++---
 net/ipv6/netfilter/nf_conntrack_reasm.c |  5 ++---
 net/ipv6/reassembly.c   |  5 ++---
 6 files changed, 8 insertions(+), 25 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 7e984045b2b7..23161bf5d899 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -109,8 +109,6 @@ void inet_frags_exit_net(struct netns_frags *nf);
 void inet_frag_kill(struct inet_frag_queue *q);
 void inet_frag_destroy(struct inet_frag_queue *q);
 struct inet_frag_queue *inet_frag_find(struct netns_frags *nf, void *key);
-void inet_frag_maybe_warn_overflow(struct inet_frag_queue *q,
-  const char *prefix);
 
 static inline void inet_frag_put(struct inet_frag_queue *q)
 {
diff --git a/net/ieee802154/6lowpan/reassembly.c 
b/net/ieee802154/6lowpan/reassembly.c
index a63360a05108..b54015981af9 100644
--- a/net/ieee802154/6lowpan/reassembly.c
+++ b/net/ieee802154/6lowpan/reassembly.c
@@ -83,10 +83,9 @@ fq_find(struct net *net, const struct lowpan_802154_cb *cb,
struct inet_frag_queue *q;
 
q = inet_frag_find(_lowpan->frags, );
-   if (IS_ERR_OR_NULL(q)) {
-   inet_frag_maybe_warn_overflow(q, pr_fmt());
+   if (!q)
return NULL;
-   }
+
return container_of(q, struct lowpan_frag_queue, q);
 }
 
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index a50ac25878aa..47c240f50b99 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -217,14 +217,3 @@ struct inet_frag_queue *inet_frag_find(struct netns_frags 
*nf, void *key)
return inet_frag_create(nf, key);
 }
 EXPORT_SYMBOL(inet_frag_find);
-
-void inet_frag_maybe_warn_overflow(struct inet_frag_queue *q,
-  const char *prefix)
-{
-   static const char msg[] = "inet_frag_find: Fragment hash bucket"
-   " list length grew over limit. Dropping fragment.\n";
-
-   if (PTR_ERR(q) == -ENOBUFS)
-   net_dbg_ratelimited("%s%s", prefix, msg);
-}
-EXPORT_SYMBOL(inet_frag_maybe_warn_overflow);
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 308592a8ba97..696bfef06caa 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -219,10 +219,9 @@ static struct ipq *ip_find(struct net *net, struct iphdr 
*iph,
struct inet_frag_queue *q;
 
q = inet_frag_find(>ipv4.frags, );
-   if (IS_ERR_OR_NULL(q)) {
-   inet_frag_maybe_warn_overflow(q, pr_fmt());
+   if (!q)
return NULL;
-   }
+
return container_of(q, struct ipq, q);
 }
 
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c 
b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 314568d8b84a..267f2ae2d05c 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -177,10 +177,9 @@ static struct frag_queue *fq_find(struct net *net, __be32 
id, u32 user,
struct inet_frag_queue *q;
 
q = inet_frag_find(>nf_frag.frags, );
-   if (IS_ERR_OR_NULL(q)) {
-   inet_frag_maybe_warn_overflow(q, pr_fmt());
+   if (!q)
return NULL;
-   }
+
return container_of(q, struct frag_queue, q);
 }
 
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 629a45a4c79f..6de4cec69054 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -154,10 +154,9 @@ fq_find(struct net *net, __be32 id, const struct ipv6hdr 
*hdr, int iif)
key.iif = 0;
 
q = inet_frag_find(>ipv6.frags, );
-   if (IS_ERR_OR_NULL(q)) {
-   inet_frag_maybe_warn_overflow(q, pr_fmt());
+   if (!q)
return NULL;
-   }
+
return container_of(q, struct frag_queue, q);
 }
 
-- 
2.17.1

[PATCH stable 4.9 18/29] inet: frags: fix ip6frag_low_thresh boundary

2018-10-09 Thread Florian Fainelli

From: Eric Dumazet 

Giving an integer to proc_doulongvec_minmax() is dangerous on 64bit arches,
since linker might place next to it a non zero value preventing a change
to ip6frag_low_thresh.

ip6frag_low_thresh is not used anymore in the kernel, but we do not
want to prematuraly break user scripts wanting to change it.

Since specifying a minimal value of 0 for proc_doulongvec_minmax()
is moot, let's remove these zero values in all defrag units.

Fixes: 6e00f7dd5e4e ("ipv6: frags: fix /proc/sys/net/ipv6/ip6frag_low_thresh")
Signed-off-by: Eric Dumazet 
Reported-by: Maciej Żenczykowski 
Signed-off-by: David S. Miller 
(cherry picked from commit 3d23401283e80ceb03f765842787e0e79ff598b7)
---
 net/ieee802154/6lowpan/reassembly.c |  2 --
 net/ipv4/ip_fragment.c  | 40 ++---
 net/ipv6/netfilter/nf_conntrack_reasm.c |  2 --
 net/ipv6/reassembly.c   |  4 +--
 4 files changed, 17 insertions(+), 31 deletions(-)

diff --git a/net/ieee802154/6lowpan/reassembly.c 
b/net/ieee802154/6lowpan/reassembly.c
index 122a625d9a66..6fca75581e13 100644
--- a/net/ieee802154/6lowpan/reassembly.c
+++ b/net/ieee802154/6lowpan/reassembly.c
@@ -410,7 +410,6 @@ int lowpan_frag_rcv(struct sk_buff *skb, u8 frag_type)
 }
 
 #ifdef CONFIG_SYSCTL
-static long zero;
 
 static struct ctl_table lowpan_frags_ns_ctl_table[] = {
{
@@ -427,7 +426,6 @@ static struct ctl_table lowpan_frags_ns_ctl_table[] = {
.maxlen = sizeof(unsigned long),
.mode   = 0644,
.proc_handler   = proc_doulongvec_minmax,
-   .extra1 = ,
.extra2 = _net.ieee802154_lowpan.frags.high_thresh
},
{
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index e235f62dab58..73c0adc61a65 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -56,14 +56,6 @@
  */
 static const char ip_frag_cache_name[] = "ip4-frags";
 
-struct ipfrag_skb_cb
-{
-   struct inet_skb_parmh;
-   int offset;
-};
-
-#define FRAG_CB(skb)   ((struct ipfrag_skb_cb *)((skb)->cb))
-
 /* Describe an entry in the "incomplete datagrams" queue. */
 struct ipq {
struct inet_frag_queue q;
@@ -351,13 +343,13 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
 * this fragment, right?
 */
prev = qp->q.fragments_tail;
-   if (!prev || FRAG_CB(prev)->offset < offset) {
+   if (!prev || prev->ip_defrag_offset < offset) {
next = NULL;
goto found;
}
prev = NULL;
for (next = qp->q.fragments; next != NULL; next = next->next) {
-   if (FRAG_CB(next)->offset >= offset)
+   if (next->ip_defrag_offset >= offset)
break;  /* bingo! */
prev = next;
}
@@ -368,7 +360,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
 * any overlaps are eliminated.
 */
if (prev) {
-   int i = (FRAG_CB(prev)->offset + prev->len) - offset;
+   int i = (prev->ip_defrag_offset + prev->len) - offset;
 
if (i > 0) {
offset += i;
@@ -385,8 +377,8 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
 
err = -ENOMEM;
 
-   while (next && FRAG_CB(next)->offset < end) {
-   int i = end - FRAG_CB(next)->offset; /* overlap is 'i' bytes */
+   while (next && next->ip_defrag_offset < end) {
+   int i = end - next->ip_defrag_offset; /* overlap is 'i' bytes */
 
if (i < next->len) {
int delta = -next->truesize;
@@ -399,7 +391,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
delta += next->truesize;
if (delta)
add_frag_mem_limit(qp->q.net, delta);
-   FRAG_CB(next)->offset += i;
+   next->ip_defrag_offset += i;
qp->q.meat -= i;
if (next->ip_summed != CHECKSUM_UNNECESSARY)
next->ip_summed = CHECKSUM_NONE;
@@ -423,7 +415,13 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
}
}
 
-   FRAG_CB(skb)->offset = offset;
+   /* Note : skb->ip_defrag_offset and skb->dev share the same location */
+   dev = skb->dev;
+   if (dev)
+   qp->iif = dev->ifindex;
+   /* Makes sure compiler wont do silly aliasing games */
+   barrier();
+   skb->ip_defrag_offset = offset;
 
/* Insert this fragment in the chain of fragments. */
skb->next = next;
@@ -434,11 +432,6 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
else
qp->q.fragments = skb;
 
-   dev = skb->dev;
-   if (dev) {
-   qp->iif = dev->ifindex;
-

[PATCH stable 4.9 04/29] inet: frags: refactor ipv6_frag_init()

2018-10-09 Thread Florian Fainelli

From: Eric Dumazet 

We want to call inet_frags_init() earlier.

This is a prereq to "inet: frags: use rhashtables for reassembly units"

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
(cherry picked from commit 5b975bab23615cd0fdf67af6c9298eb01c4b9f61)
---
 net/ipv6/reassembly.c | 25 ++---
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 436e6d594f25..9440bb9bdab7 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -740,10 +740,21 @@ int __init ipv6_frag_init(void)
 {
int ret;
 
-   ret = inet6_add_protocol(_protocol, IPPROTO_FRAGMENT);
+   ip6_frags.hashfn = ip6_hashfn;
+   ip6_frags.constructor = ip6_frag_init;
+   ip6_frags.destructor = NULL;
+   ip6_frags.qsize = sizeof(struct frag_queue);
+   ip6_frags.match = ip6_frag_match;
+   ip6_frags.frag_expire = ip6_frag_expire;
+   ip6_frags.frags_cache_name = ip6_frag_cache_name;
+   ret = inet_frags_init(_frags);
if (ret)
goto out;
 
+   ret = inet6_add_protocol(_protocol, IPPROTO_FRAGMENT);
+   if (ret)
+   goto err_protocol;
+
ret = ip6_frags_sysctl_register();
if (ret)
goto err_sysctl;
@@ -752,16 +763,6 @@ int __init ipv6_frag_init(void)
if (ret)
goto err_pernet;
 
-   ip6_frags.hashfn = ip6_hashfn;
-   ip6_frags.constructor = ip6_frag_init;
-   ip6_frags.destructor = NULL;
-   ip6_frags.qsize = sizeof(struct frag_queue);
-   ip6_frags.match = ip6_frag_match;
-   ip6_frags.frag_expire = ip6_frag_expire;
-   ip6_frags.frags_cache_name = ip6_frag_cache_name;
-   ret = inet_frags_init(_frags);
-   if (ret)
-   goto err_pernet;
 out:
return ret;
 
@@ -769,6 +770,8 @@ int __init ipv6_frag_init(void)
ip6_frags_sysctl_unregister();
 err_sysctl:
inet6_del_protocol(_protocol, IPPROTO_FRAGMENT);
+err_protocol:
+   inet_frags_fini(_frags);
goto out;
 }
 
-- 
2.17.1

[PATCH stable 4.9 16/29] inet: frags: reorganize struct netns_frags

2018-10-09 Thread Florian Fainelli

From: Eric Dumazet 

Put the read-mostly fields in a separate cache line
at the beginning of struct netns_frags, to reduce
false sharing noticed in inet_frag_kill()

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
(cherry picked from commit c2615cf5a761b32bf74e85bddc223dfff3d9b9f0)
---
 include/net/inet_frag.h | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index dea175f3418a..f47678d2ccc2 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -4,16 +4,17 @@
 #include 
 
 struct netns_frags {
-   struct rhashtable   rhashtable cacheline_aligned_in_smp;
-
-   /* Keep atomic mem on separate cachelines in structs that include it */
-   atomic_long_t   mem cacheline_aligned_in_smp;
/* sysctls */
longhigh_thresh;
longlow_thresh;
int timeout;
int max_dist;
struct inet_frags   *f;
+
+   struct rhashtable   rhashtable cacheline_aligned_in_smp;
+
+   /* Keep atomic mem on separate cachelines in structs that include it */
+   atomic_long_t   mem cacheline_aligned_in_smp;
 };
 
 /**
-- 
2.17.1

[PATCH stable 4.9 21/29] net: modify skb_rbtree_purge to return the truesize of all purged skbs.

2018-10-09 Thread Florian Fainelli

From: Peter Oskolkov 

Tested: see the next patch is the series.

Suggested-by: Eric Dumazet 
Signed-off-by: Peter Oskolkov 
Signed-off-by: Eric Dumazet 
Cc: Florian Westphal 
Signed-off-by: David S. Miller 
(cherry picked from commit 385114dec8a49b5e5945e77ba7de6356106713f4)
---
 include/linux/skbuff.h | 2 +-
 net/core/skbuff.c  | 6 +-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index c7cca35fcf6d..724c6abdb9e6 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2418,7 +2418,7 @@ static inline void __skb_queue_purge(struct sk_buff_head 
*list)
kfree_skb(skb);
 }
 
-void skb_rbtree_purge(struct rb_root *root);
+unsigned int skb_rbtree_purge(struct rb_root *root);
 
 void *netdev_alloc_frag(unsigned int fragsz);
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 96a553da1518..4e545e4432eb 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2425,23 +2425,27 @@ EXPORT_SYMBOL(skb_queue_purge);
 /**
  * skb_rbtree_purge - empty a skb rbtree
  * @root: root of the rbtree to empty
+ * Return value: the sum of truesizes of all purged skbs.
  *
  * Delete all buffers on an _buff rbtree. Each buffer is removed from
  * the list and one reference dropped. This function does not take
  * any lock. Synchronization should be handled by the caller (e.g., TCP
  * out-of-order queue is protected by the socket lock).
  */
-void skb_rbtree_purge(struct rb_root *root)
+unsigned int skb_rbtree_purge(struct rb_root *root)
 {
struct rb_node *p = rb_first(root);
+   unsigned int sum = 0;
 
while (p) {
struct sk_buff *skb = rb_entry(p, struct sk_buff, rbnode);
 
p = rb_next(p);
rb_erase(>rbnode, root);
+   sum += skb->truesize;
kfree_skb(skb);
}
+   return sum;
 }
 
 /**
-- 
2.17.1

[PATCH stable 4.9 19/29] ip: discard IPv4 datagrams with overlapping segments.

2018-10-09 Thread Florian Fainelli

From: Peter Oskolkov 

This behavior is required in IPv6, and there is little need
to tolerate overlapping fragments in IPv4. This change
simplifies the code and eliminates potential DDoS attack vectors.

Tested: ran ip_defrag selftest (not yet available uptream).

Suggested-by: David S. Miller 
Signed-off-by: Peter Oskolkov 
Signed-off-by: Eric Dumazet 
Cc: Florian Westphal 
Acked-by: Stephen Hemminger 
Signed-off-by: David S. Miller 
(cherry picked from commit 7969e5c40dfd04799d4341f1b7cd266b6e47f227)
---
 include/uapi/linux/snmp.h |  1 +
 net/ipv4/ip_fragment.c| 75 ++-
 net/ipv4/proc.c   |  1 +
 3 files changed, 21 insertions(+), 56 deletions(-)

diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index e7a31f830690..3442a26d36d9 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -55,6 +55,7 @@ enum
IPSTATS_MIB_ECT1PKTS,   /* InECT1Pkts */
IPSTATS_MIB_ECT0PKTS,   /* InECT0Pkts */
IPSTATS_MIB_CEPKTS, /* InCEPkts */
+   IPSTATS_MIB_REASM_OVERLAPS, /* ReasmOverlaps */
__IPSTATS_MIB_MAX
 };
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 73c0adc61a65..8bfb34e9ea32 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -275,6 +275,7 @@ static int ip_frag_reinit(struct ipq *qp)
 /* Add new segment to existing queue. */
 static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 {
+   struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
struct sk_buff *prev, *next;
struct net_device *dev;
unsigned int fragsize;
@@ -355,65 +356,23 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
}
 
 found:
-   /* We found where to put this one.  Check for overlap with
-* preceding fragment, and, if needed, align things so that
-* any overlaps are eliminated.
+   /* RFC5722, Section 4, amended by Errata ID : 3089
+*  When reassembling an IPv6 datagram, if
+*   one or more its constituent fragments is determined to be an
+*   overlapping fragment, the entire datagram (and any constituent
+*   fragments) MUST be silently discarded.
+*
+* We do the same here for IPv4.
 */
-   if (prev) {
-   int i = (prev->ip_defrag_offset + prev->len) - offset;
 
-   if (i > 0) {
-   offset += i;
-   err = -EINVAL;
-   if (end <= offset)
-   goto err;
-   err = -ENOMEM;
-   if (!pskb_pull(skb, i))
-   goto err;
-   if (skb->ip_summed != CHECKSUM_UNNECESSARY)
-   skb->ip_summed = CHECKSUM_NONE;
-   }
-   }
+   /* Is there an overlap with the previous fragment? */
+   if (prev &&
+   (prev->ip_defrag_offset + prev->len) > offset)
+   goto discard_qp;
 
-   err = -ENOMEM;
-
-   while (next && next->ip_defrag_offset < end) {
-   int i = end - next->ip_defrag_offset; /* overlap is 'i' bytes */
-
-   if (i < next->len) {
-   int delta = -next->truesize;
-
-   /* Eat head of the next overlapped fragment
-* and leave the loop. The next ones cannot overlap.
-*/
-   if (!pskb_pull(next, i))
-   goto err;
-   delta += next->truesize;
-   if (delta)
-   add_frag_mem_limit(qp->q.net, delta);
-   next->ip_defrag_offset += i;
-   qp->q.meat -= i;
-   if (next->ip_summed != CHECKSUM_UNNECESSARY)
-   next->ip_summed = CHECKSUM_NONE;
-   break;
-   } else {
-   struct sk_buff *free_it = next;
-
-   /* Old fragment is completely overridden with
-* new one drop it.
-*/
-   next = next->next;
-
-   if (prev)
-   prev->next = next;
-   else
-   qp->q.fragments = next;
-
-   qp->q.meat -= free_it->len;
-   sub_frag_mem_limit(qp->q.net, free_it->truesize);
-   kfree_skb(free_it);
-   }
-   }
+   /* Is there an overlap with the next fragment? */
+   if (next && next->ip_defrag_offset < end)
+   goto discard_qp;
 
/* Note : skb->ip_defrag_offset and skb->dev share the same location */
dev = skb->dev;
@@ -461,6 +420,10 @@ static int ip_frag_queue(struct

[PATCH stable 4.9 17/29] inet: frags: get rid of ipfrag_skb_cb/FRAG_CB

2018-10-09 Thread Florian Fainelli

From: Eric Dumazet 

ip_defrag uses skb->cb[] to store the fragment offset, and unfortunately
this integer is currently in a different cache line than skb->next,
meaning that we use two cache lines per skb when finding the insertion point.

By aliasing skb->ip_defrag_offset and skb->dev, we pack all the fields
in a single cache line and save precious memory bandwidth.

Note that after the fast path added by Changli Gao in commit
d6bebca92c66 ("fragment: add fast path for in-order fragments")
this change wont help the fast path, since we still need
to access prev->len (2nd cache line), but will show great
benefits when slow path is entered, since we perform
a linear scan of a potentially long list.

Also, note that this potential long list is an attack vector,
we might consider also using an rb-tree there eventually.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
(cherry picked from commit bf66337140c64c27fa37222b7abca7e49d63fb57)
---
 include/linux/skbuff.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 1f207dd22757..c7cca35fcf6d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -645,6 +645,11 @@ struct sk_buff {
};
struct rb_node  rbnode; /* used in netem & tcp stack */
};
+
+   union {
+   int ip_defrag_offset;
+   };
+
struct sock *sk;
struct net_device   *dev;
 
-- 
2.17.1

[PATCH stable 4.9 20/29] net: speed up skb_rbtree_purge()

2018-10-09 Thread Florian Fainelli

From: Eric Dumazet 

As measured in my prior patch ("sch_netem: faster rb tree removal"),
rbtree_postorder_for_each_entry_safe() is nice looking but much slower
than using rb_next() directly, except when tree is small enough
to fit in CPU caches (then the cost is the same)

Also note that there is not even an increase of text size :
$ size net/core/skbuff.o.before net/core/skbuff.o
   textdata bss dec hex filename
  407111298   0   42009a419 net/core/skbuff.o.before
  407111298   0   42009a419 net/core/skbuff.o

From: Eric Dumazet 

Signed-off-by: David S. Miller 
(cherry picked from commit 7c90584c66cc4b033a3b684b0e0950f79e7b7166)
---
 net/core/skbuff.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 84c731aef0d8..96a553da1518 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2433,12 +2433,15 @@ EXPORT_SYMBOL(skb_queue_purge);
  */
 void skb_rbtree_purge(struct rb_root *root)
 {
-   struct sk_buff *skb, *next;
+   struct rb_node *p = rb_first(root);
 
-   rbtree_postorder_for_each_entry_safe(skb, next, root, rbnode)
-   kfree_skb(skb);
+   while (p) {
+   struct sk_buff *skb = rb_entry(p, struct sk_buff, rbnode);
 
-   *root = RB_ROOT;
+   p = rb_next(p);
+   rb_erase(>rbnode, root);
+   kfree_skb(skb);
+   }
 }
 
 /**
-- 
2.17.1

[PATCH stable 4.9 06/29] ipv6: export ip6 fragments sysctl to unprivileged users

2018-10-09 Thread Florian Fainelli

From: Eric Dumazet 

IPv4 was changed in commit 52a773d645e9 ("net: Export ip fragment
sysctl to unprivileged users")

The only sysctl that is not per-netns is not used :
ip6frag_secret_interval

Signed-off-by: Eric Dumazet 
Cc: Nikolay Borisov 
Signed-off-by: David S. Miller 
(cherry picked from commit 18dcbe12fe9fca0ab825f7eff993060525ac2503)
---
 net/ipv6/reassembly.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 9440bb9bdab7..9fe99475bddf 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -643,10 +643,6 @@ static int __net_init ip6_frags_ns_sysctl_register(struct 
net *net)
table[1].data = >ipv6.frags.low_thresh;
table[1].extra2 = >ipv6.frags.high_thresh;
table[2].data = >ipv6.frags.timeout;
-
-   /* Don't export sysctls to unprivileged users */
-   if (net->user_ns != _user_ns)
-   table[0].procname = NULL;
}
 
hdr = register_net_sysctl(net, "net/ipv6", table);
-- 
2.17.1

[PATCH stable 4.9 12/29] inet: frags: break the 2GB limit for frags storage

2018-10-09 Thread Florian Fainelli

From: Eric Dumazet 

Some users are willing to provision huge amounts of memory to be able
to perform reassembly reasonnably well under pressure.

Current memory tracking is using one atomic_t and integers.

Switch to atomic_long_t so that 64bit arches can use more than 2GB,
without any cost for 32bit arches.

Note that this patch avoids an overflow error, if high_thresh was set
to ~2GB, since this test in inet_frag_alloc() was never true :

if (... || frag_mem_limit(nf) > nf->high_thresh)

Tested:

$ echo 160 >/proc/sys/net/ipv4/ipfrag_high_thresh



$ grep FRAG /proc/net/sockstat
FRAG: inuse 14705885 memory 1602880

$ nstat -n ; sleep 1 ; nstat | grep Reas
IpReasmReqds33171500.0
IpReasmFails33171120.0

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
(cherry picked from commit 3e67f106f619dcfaf6f4e2039599bdb69848c714)
---
 Documentation/networking/ip-sysctl.txt  |  4 ++--
 include/net/inet_frag.h | 20 ++--
 net/ieee802154/6lowpan/reassembly.c | 10 +-
 net/ipv4/ip_fragment.c  | 10 +-
 net/ipv4/proc.c |  2 +-
 net/ipv6/netfilter/nf_conntrack_reasm.c | 10 +-
 net/ipv6/proc.c |  2 +-
 net/ipv6/reassembly.c   |  6 +++---
 8 files changed, 32 insertions(+), 32 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 6cd632578ce8..dbdc4130e149 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -122,10 +122,10 @@ min_adv_mss - INTEGER
 
 IP Fragmentation:
 
-ipfrag_high_thresh - INTEGER
+ipfrag_high_thresh - LONG INTEGER
Maximum memory used to reassemble IP fragments.
 
-ipfrag_low_thresh - INTEGER
+ipfrag_low_thresh - LONG INTEGER
(Obsolete since linux-4.17)
Maximum memory used to reassemble IP fragments before the kernel
begins to remove incomplete fragment queues to free up resources.
diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 23161bf5d899..dea175f3418a 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -7,11 +7,11 @@ struct netns_frags {
struct rhashtable   rhashtable cacheline_aligned_in_smp;
 
/* Keep atomic mem on separate cachelines in structs that include it */
-   atomic_tmem cacheline_aligned_in_smp;
+   atomic_long_t   mem cacheline_aligned_in_smp;
/* sysctls */
+   longhigh_thresh;
+   longlow_thresh;
int timeout;
-   int high_thresh;
-   int low_thresh;
int max_dist;
struct inet_frags   *f;
 };
@@ -101,7 +101,7 @@ void inet_frags_fini(struct inet_frags *);
 
 static inline int inet_frags_init_net(struct netns_frags *nf)
 {
-   atomic_set(>mem, 0);
+   atomic_long_set(>mem, 0);
return rhashtable_init(>rhashtable, >f->rhash_params);
 }
 void inet_frags_exit_net(struct netns_frags *nf);
@@ -118,19 +118,19 @@ static inline void inet_frag_put(struct inet_frag_queue 
*q)
 
 /* Memory Tracking Functions. */
 
-static inline int frag_mem_limit(struct netns_frags *nf)
+static inline long frag_mem_limit(const struct netns_frags *nf)
 {
-   return atomic_read(>mem);
+   return atomic_long_read(>mem);
 }
 
-static inline void sub_frag_mem_limit(struct netns_frags *nf, int i)
+static inline void sub_frag_mem_limit(struct netns_frags *nf, long val)
 {
-   atomic_sub(i, >mem);
+   atomic_long_sub(val, >mem);
 }
 
-static inline void add_frag_mem_limit(struct netns_frags *nf, int i)
+static inline void add_frag_mem_limit(struct netns_frags *nf, long val)
 {
-   atomic_add(i, >mem);
+   atomic_long_add(val, >mem);
 }
 
 /* RFC 3168 support :
diff --git a/net/ieee802154/6lowpan/reassembly.c 
b/net/ieee802154/6lowpan/reassembly.c
index b54015981af9..122a625d9a66 100644
--- a/net/ieee802154/6lowpan/reassembly.c
+++ b/net/ieee802154/6lowpan/reassembly.c
@@ -410,23 +410,23 @@ int lowpan_frag_rcv(struct sk_buff *skb, u8 frag_type)
 }
 
 #ifdef CONFIG_SYSCTL
-static int zero;
+static long zero;
 
 static struct ctl_table lowpan_frags_ns_ctl_table[] = {
{
.procname   = "6lowpanfrag_high_thresh",
.data   = _net.ieee802154_lowpan.frags.high_thresh,
-   .maxlen = sizeof(int),
+   .maxlen = sizeof(unsigned long),
.mode   = 0644,
-   .proc_handler   = proc_dointvec_minmax,
+   .proc_handler   = proc_doulongvec_minmax,
.extra1 = _net.ieee802154_lowpan.frags.low_thresh
},
{
.procname   = "6lowpanfrag_low_thresh",
.data   =

[PATCH stable 4.9 22/29] ipv6: defrag: drop non-last frags smaller than min mtu

2018-10-09 Thread Florian Fainelli

From: Florian Westphal 

don't bother with pathological cases, they only waste cycles.
IPv6 requires a minimum MTU of 1280 so we should never see fragments
smaller than this (except last frag).

v3: don't use awkward "-offset + len"
v2: drop IPv4 part, which added same check w. IPV4_MIN_MTU (68).
There were concerns that there could be even smaller frags
generated by intermediate nodes, e.g. on radio networks.

Cc: Peter Oskolkov 
Cc: Eric Dumazet 
Signed-off-by: Florian Westphal 
Signed-off-by: David S. Miller 
(cherry picked from commit 0ed4229b08c13c84a3c301a08defdc9e7f4467e6)
---
 net/ipv6/netfilter/nf_conntrack_reasm.c | 4 
 net/ipv6/reassembly.c   | 4 
 2 files changed, 8 insertions(+)

diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c 
b/net/ipv6/netfilter/nf_conntrack_reasm.c
index ff49d1f2c8cb..b81541701346 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -564,6 +564,10 @@ int nf_ct_frag6_gather(struct net *net, struct sk_buff 
*skb, u32 user)
hdr = ipv6_hdr(skb);
fhdr = (struct frag_hdr *)skb_transport_header(skb);
 
+   if (skb->len - skb_network_offset(skb) < IPV6_MIN_MTU &&
+   fhdr->frag_off & htons(IP6_MF))
+   return -EINVAL;
+
skb_orphan(skb);
fq = fq_find(net, fhdr->identification, user, hdr,
 skb->dev ? skb->dev->ifindex : 0);
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index dbe726c9a2ae..78656bbe50e7 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -516,6 +516,10 @@ static int ipv6_frag_rcv(struct sk_buff *skb)
return 1;
}
 
+   if (skb->len - skb_network_offset(skb) < IPV6_MIN_MTU &&
+   fhdr->frag_off & htons(IP6_MF))
+   goto fail_hdr;
+
iif = skb->dev ? skb->dev->ifindex : 0;
fq = fq_find(net, fhdr->identification, hdr, iif);
if (fq) {
-- 
2.17.1

[PATCH stable 4.9 15/29] rhashtable: reorganize struct rhashtable layout

2018-10-09 Thread Florian Fainelli

From: Eric Dumazet 

While under frags DDOS I noticed unfortunate false sharing between
@nelems and @params.automatic_shrinking

Move @nelems at the end of struct rhashtable so that first cache line
is shared between all cpus, because almost never dirtied.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
(cherry picked from commit e5d672a0780d9e7118caad4c171ec88b8299398d)
---
 include/linux/rhashtable.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 85d1ffc90285..4421e5ccb092 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -138,7 +138,6 @@ struct rhashtable_params {
 /**
  * struct rhashtable - Hash table handle
  * @tbl: Bucket table
- * @nelems: Number of elements in table
  * @key_len: Key length for hashfn
  * @elasticity: Maximum chain length before rehash
  * @p: Configuration parameters
@@ -146,10 +145,10 @@ struct rhashtable_params {
  * @run_work: Deferred worker to expand/shrink asynchronously
  * @mutex: Mutex to protect current/future table swapping
  * @lock: Spin lock to protect walker list
+ * @nelems: Number of elements in table
  */
 struct rhashtable {
struct bucket_table __rcu   *tbl;
-   atomic_tnelems;
unsigned intkey_len;
unsigned intelasticity;
struct rhashtable_paramsp;
@@ -157,6 +156,7 @@ struct rhashtable {
struct work_struct  run_work;
struct mutexmutex;
spinlock_t  lock;
+   atomic_tnelems;
 };
 
 /**
-- 
2.17.1

[PATCH stable 4.9 08/29] inet: frags: use rhashtables for reassembly units

2018-10-09 Thread Florian Fainelli

From: Eric Dumazet 

Some applications still rely on IP fragmentation, and to be fair linux
reassembly unit is not working under any serious load.

It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!)

A work queue is supposed to garbage collect items when host is under memory
pressure, and doing a hash rebuild, changing seed used in hash computations.

This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
occurring every 5 seconds if host is under fire.

Then there is the problem of sharing this hash table for all netns.

It is time to switch to rhashtables, and allocate one of them per netns
to speedup netns dismantle, since this is a critical metric these days.

Lookup is now using RCU. A followup patch will even remove
the refcount hold/release left from prior implementation and save
a couple of atomic operations.

Before this patch, 16 cpus (16 RX queue NIC) could not handle more
than 1 Mpps frags DDOS.

After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB
of storage for the fragments (exact number depends on frags being evicted
after timeout)

$ grep FRAG /proc/net/sockstat
FRAG: inuse 1966916 memory 2140004608

A followup patch will change the limits for 64bit arches.

Signed-off-by: Eric Dumazet 
Cc: Kirill Tkhai 
Cc: Herbert Xu 
Cc: Florian Westphal 
Cc: Jesper Dangaard Brouer 
Cc: Alexander Aring 
Cc: Stefan Schmidt 
Signed-off-by: David S. Miller 
(cherry picked from commit 648700f76b03b7e8149d13cc2bdb3355035258a9)
---
 Documentation/networking/ip-sysctl.txt  |   7 +-
 include/net/inet_frag.h |  81 +++---
 include/net/ipv6.h  |  16 +-
 net/ieee802154/6lowpan/6lowpan_i.h  |  26 +-
 net/ieee802154/6lowpan/reassembly.c |  91 +++---
 net/ipv4/inet_fragment.c| 349 +---
 net/ipv4/ip_fragment.c  | 112 
 net/ipv6/netfilter/nf_conntrack_reasm.c |  51 +---
 net/ipv6/reassembly.c   | 110 
 9 files changed, 267 insertions(+), 576 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 3db8c67d2c8d..6cd632578ce8 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -123,13 +123,10 @@ min_adv_mss - INTEGER
 IP Fragmentation:
 
 ipfrag_high_thresh - INTEGER
-   Maximum memory used to reassemble IP fragments. When
-   ipfrag_high_thresh bytes of memory is allocated for this purpose,
-   the fragment handler will toss packets until ipfrag_low_thresh
-   is reached. This also serves as a maximum limit to namespaces
-   different from the initial one.
+   Maximum memory used to reassemble IP fragments.
 
 ipfrag_low_thresh - INTEGER
+   (Obsolete since linux-4.17)
Maximum memory used to reassemble IP fragments before the kernel
begins to remove incomplete fragment queues to free up resources.
The kernel still accepts new fragments for defragmentation.
diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 19a2ccda0300..b146f1e456a5 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -1,7 +1,11 @@
 #ifndef __NET_FRAG_H__
 #define __NET_FRAG_H__
 
+#include 
+
 struct netns_frags {
+   struct rhashtable   rhashtable cacheline_aligned_in_smp;
+
/* Keep atomic mem on separate cachelines in structs that include it */
atomic_tmem cacheline_aligned_in_smp;
/* sysctls */
@@ -25,12 +29,30 @@ enum {
INET_FRAG_COMPLETE  = BIT(2),
 };
 
+struct frag_v4_compare_key {
+   __be32  saddr;
+   __be32  daddr;
+   u32 user;
+   u32 vif;
+   __be16  id;
+   u16 protocol;
+};
+
+struct frag_v6_compare_key {
+   struct in6_addr saddr;
+   struct in6_addr daddr;
+   u32 user;
+   __be32  id;
+   u32 iif;
+};
+
 /**
  * struct inet_frag_queue - fragment queue
  *
- * @lock: spinlock protecting the queue
+ * @node: rhash node
+ * @key: keys identifying this frag.
  * @timer: queue expiration timer
- * @list: hash bucket list
+ * @lock: spinlock protecting this frag
  * @refcnt: reference count of the queue
  * @fragments: received fragments head
  * @fragments_tail: received fragments tail
@@ -40,12 +62,16 @@ enum {
  * @flags: fragment queue flags
  * @max_size: maximum received fragment size
  * @net: namespace that this frag belongs to
- * @list_evictor: list of queues to forcefully evict (e.g. due to low memory)
+ * @rcu: rcu head for freeing deferall
  */
 struct inet_frag_queue {
-   spinlock_t  lock;
+   struct rhash_head   node;
+   union {
+   struct frag_v4_compare_key v4;
+   struct frag_v6_compare_key v6;
+   } key;
struct timer_list   timer;
-   struct hlist_node   list;

[PATCH stable 4.9 07/29] rhashtable: add schedule points

2018-10-09 Thread Florian Fainelli

From: Eric Dumazet 

Rehashing and destroying large hash table takes a lot of time,
and happens in process context. It is safe to add cond_resched()
in rhashtable_rehash_table() and rhashtable_free_and_destroy()

Signed-off-by: Eric Dumazet 
Acked-by: Herbert Xu 
Signed-off-by: David S. Miller 
(cherry picked from commit ae6da1f503abb5a5081f9f6c4a6881de97830f3e)
---
 lib/rhashtable.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 101dac085c62..fdffd6232365 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -251,8 +251,10 @@ static int rhashtable_rehash_table(struct rhashtable *ht)
if (!new_tbl)
return 0;
 
-   for (old_hash = 0; old_hash < old_tbl->size; old_hash++)
+   for (old_hash = 0; old_hash < old_tbl->size; old_hash++) {
rhashtable_rehash_chain(ht, old_hash);
+   cond_resched();
+   }
 
/* Publish the new table pointer. */
rcu_assign_pointer(ht->tbl, new_tbl);
@@ -993,6 +995,7 @@ void rhashtable_free_and_destroy(struct rhashtable *ht,
for (i = 0; i < tbl->size; i++) {
struct rhash_head *pos, *next;
 
+   cond_resched();
for (pos = rht_dereference(tbl->buckets[i], ht),
 next = !rht_is_a_nulls(pos) ?
rht_dereference(pos->next, ht) : NULL;
-- 
2.17.1

[PATCH stable 4.9 09/29] inet: frags: remove some helpers

2018-10-09 Thread Florian Fainelli

From: Eric Dumazet 

Remove sum_frag_mem_limit(), ip_frag_mem() & ip6_frag_mem()

Also since we use rhashtable we can bring back the number of fragments
in "grep FRAG /proc/net/sockstat /proc/net/sockstat6" that was
removed in commit 434d305405ab ("inet: frag: don't account number
of fragment queues")

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
(cherry picked from commit 6befe4a78b1553edb6eed3a78b4bcd9748526672)
---
 include/net/inet_frag.h | 5 -
 include/net/ip.h| 1 -
 include/net/ipv6.h  | 7 ---
 net/ipv4/ip_fragment.c  | 5 -
 net/ipv4/proc.c | 6 +++---
 net/ipv6/proc.c | 5 +++--
 6 files changed, 6 insertions(+), 23 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index b146f1e456a5..a49cdf25cef0 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -140,11 +140,6 @@ static inline void add_frag_mem_limit(struct netns_frags 
*nf, int i)
atomic_add(i, >mem);
 }
 
-static inline int sum_frag_mem_limit(struct netns_frags *nf)
-{
-   return atomic_read(>mem);
-}
-
 /* RFC 3168 support :
  * We want to check ECN values of all fragments, do detect invalid 
combinations.
  * In ipq->ecn, we store the OR value of each ip4_frag_ecn() fragment value.
diff --git a/include/net/ip.h b/include/net/ip.h
index bc9b4deeb60e..8646da034851 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -548,7 +548,6 @@ static inline struct sk_buff *ip_check_defrag(struct net 
*net, struct sk_buff *s
return skb;
 }
 #endif
-int ip_frag_mem(struct net *net);
 
 /*
  * Functions provided by ip_forward.c
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 6b9b551f653b..7cb100d25bb5 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -330,13 +330,6 @@ static inline bool ipv6_accept_ra(struct inet6_dev *idev)
idev->cnf.accept_ra;
 }
 
-#if IS_ENABLED(CONFIG_IPV6)
-static inline int ip6_frag_mem(struct net *net)
-{
-   return sum_frag_mem_limit(>ipv6.frags);
-}
-#endif
-
 #define IPV6_FRAG_HIGH_THRESH  (4 * 1024*1024) /* 4194304 */
 #define IPV6_FRAG_LOW_THRESH   (3 * 1024*1024) /* 3145728 */
 #define IPV6_FRAG_TIMEOUT  (60 * HZ)   /* 60 seconds */
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 29038513f3ca..8c4072b19296 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -82,11 +82,6 @@ static u8 ip4_frag_ecn(u8 tos)
 
 static struct inet_frags ip4_frags;
 
-int ip_frag_mem(struct net *net)
-{
-   return sum_frag_mem_limit(>ipv4.frags);
-}
-
 static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 struct net_device *dev);
 
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 7143ca1a6af9..b7a2d002cb27 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -54,7 +54,6 @@
 static int sockstat_seq_show(struct seq_file *seq, void *v)
 {
struct net *net = seq->private;
-   unsigned int frag_mem;
int orphans, sockets;
 
local_bh_disable();
@@ -74,8 +73,9 @@ static int sockstat_seq_show(struct seq_file *seq, void *v)
   sock_prot_inuse_get(net, _prot));
seq_printf(seq, "RAW: inuse %d\n",
   sock_prot_inuse_get(net, _prot));
-   frag_mem = ip_frag_mem(net);
-   seq_printf(seq,  "FRAG: inuse %u memory %u\n", !!frag_mem, frag_mem);
+   seq_printf(seq,  "FRAG: inuse %u memory %u\n",
+  atomic_read(>ipv4.frags.rhashtable.nelems),
+  frag_mem_limit(>ipv4.frags));
return 0;
 }
 
diff --git a/net/ipv6/proc.c b/net/ipv6/proc.c
index e88bcb8ff0fd..5704ec3d3178 100644
--- a/net/ipv6/proc.c
+++ b/net/ipv6/proc.c
@@ -38,7 +38,6 @@
 static int sockstat6_seq_show(struct seq_file *seq, void *v)
 {
struct net *net = seq->private;
-   unsigned int frag_mem = ip6_frag_mem(net);
 
seq_printf(seq, "TCP6: inuse %d\n",
   sock_prot_inuse_get(net, _prot));
@@ -48,7 +47,9 @@ static int sockstat6_seq_show(struct seq_file *seq, void *v)
sock_prot_inuse_get(net, _prot));
seq_printf(seq, "RAW6: inuse %d\n",
   sock_prot_inuse_get(net, _prot));
-   seq_printf(seq, "FRAG6: inuse %u memory %u\n", !!frag_mem, frag_mem);
+   seq_printf(seq, "FRAG6: inuse %u memory %u\n",
+  atomic_read(>ipv6.frags.rhashtable.nelems),
+  frag_mem_limit(>ipv6.frags));
return 0;
 }
 
-- 
2.17.1

[PATCH stable 4.9 14/29] ipv6: frags: rewrite ip6_expire_frag_queue()

2018-10-09 Thread Florian Fainelli

From: Eric Dumazet 

Make it similar to IPv4 ip_expire(), and release the lock
before calling icmp functions.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
(cherry picked from commit 05c0b86b9696802fd0ce5676a92a63f1b455bdf3)
---
 net/ipv6/reassembly.c | 24 
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 8a4ece339c19..1cb45a0d1a0e 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -92,7 +92,9 @@ EXPORT_SYMBOL(ip6_frag_init);
 void ip6_expire_frag_queue(struct net *net, struct frag_queue *fq)
 {
struct net_device *dev = NULL;
+   struct sk_buff *head;
 
+   rcu_read_lock();
spin_lock(>q.lock);
 
if (fq->q.flags & INET_FRAG_COMPLETE)
@@ -100,28 +102,34 @@ void ip6_expire_frag_queue(struct net *net, struct 
frag_queue *fq)
 
inet_frag_kill(>q);
 
-   rcu_read_lock();
dev = dev_get_by_index_rcu(net, fq->iif);
if (!dev)
-   goto out_rcu_unlock;
+   goto out;
 
__IP6_INC_STATS(net, __in6_dev_get(dev), IPSTATS_MIB_REASMFAILS);
__IP6_INC_STATS(net, __in6_dev_get(dev), IPSTATS_MIB_REASMTIMEOUT);
 
/* Don't send error if the first segment did not arrive. */
-   if (!(fq->q.flags & INET_FRAG_FIRST_IN) || !fq->q.fragments)
-   goto out_rcu_unlock;
+   head = fq->q.fragments;
+   if (!(fq->q.flags & INET_FRAG_FIRST_IN) || !head)
+   goto out;
 
/* But use as source device on which LAST ARRIVED
 * segment was received. And do not use fq->dev
 * pointer directly, device might already disappeared.
 */
-   fq->q.fragments->dev = dev;
-   icmpv6_send(fq->q.fragments, ICMPV6_TIME_EXCEED, ICMPV6_EXC_FRAGTIME, 
0);
-out_rcu_unlock:
-   rcu_read_unlock();
+   head->dev = dev;
+   skb_get(head);
+   spin_unlock(>q.lock);
+
+   icmpv6_send(head, ICMPV6_TIME_EXCEED, ICMPV6_EXC_FRAGTIME, 0);
+   kfree_skb(head);
+   goto out_rcu_unlock;
+
 out:
spin_unlock(>q.lock);
+out_rcu_unlock:
+   rcu_read_unlock();
inet_frag_put(>q);
 }
 EXPORT_SYMBOL(ip6_expire_frag_queue);
-- 
2.17.1

[PATCH stable 4.9 23/29] net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends

2018-10-09 Thread Florian Fainelli

From: Eric Dumazet 

After working on IP defragmentation lately, I found that some large
packets defeat CHECKSUM_COMPLETE optimization because of NIC adding
zero paddings on the last (small) fragment.

While removing the padding with pskb_trim_rcsum(), we set skb->ip_summed
to CHECKSUM_NONE, forcing a full csum validation, even if all prior
fragments had CHECKSUM_COMPLETE set.

We can instead compute the checksum of the part we are trimming,
usually smaller than the part we keep.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
(cherry picked from commit 88078d98d1bb085d72af8437707279e203524fa5)
---
 include/linux/skbuff.h |  5 ++---
 net/core/skbuff.c  | 14 ++
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 724c6abdb9e6..11974e63a41d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2954,6 +2954,7 @@ static inline unsigned char *skb_push_rcsum(struct 
sk_buff *skb,
return skb->data;
 }
 
+int pskb_trim_rcsum_slow(struct sk_buff *skb, unsigned int len);
 /**
  * pskb_trim_rcsum - trim received skb and update checksum
  * @skb: buffer to trim
@@ -2967,9 +2968,7 @@ static inline int pskb_trim_rcsum(struct sk_buff *skb, 
unsigned int len)
 {
if (likely(len >= skb->len))
return 0;
-   if (skb->ip_summed == CHECKSUM_COMPLETE)
-   skb->ip_summed = CHECKSUM_NONE;
-   return __pskb_trim(skb, len);
+   return pskb_trim_rcsum_slow(skb, len);
 }
 
 static inline int __skb_trim_rcsum(struct sk_buff *skb, unsigned int len)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4e545e4432eb..038ec74fa131 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1578,6 +1578,20 @@ int ___pskb_trim(struct sk_buff *skb, unsigned int len)
 }
 EXPORT_SYMBOL(___pskb_trim);
 
+/* Note : use pskb_trim_rcsum() instead of calling this directly
+ */
+int pskb_trim_rcsum_slow(struct sk_buff *skb, unsigned int len)
+{
+   if (skb->ip_summed == CHECKSUM_COMPLETE) {
+   int delta = skb->len - len;
+
+   skb->csum = csum_sub(skb->csum,
+skb_checksum(skb, len, delta, 0));
+   }
+   return __pskb_trim(skb, len);
+}
+EXPORT_SYMBOL(pskb_trim_rcsum_slow);
+
 /**
  * __pskb_pull_tail - advance tail of skb header
  * @skb: buffer to reallocate
-- 
2.17.1

[PATCH stable 4.9 10/29] inet: frags: get rif of inet_frag_evicting()

2018-10-09 Thread Florian Fainelli

From: Eric Dumazet 

This refactors ip_expire() since one indentation level is removed.

Note: in the future, we should try hard to avoid the skb_clone()
since this is a serious performance cost.
Under DDOS, the ICMP message wont be sent because of rate limits.

Fact that ip6_expire_frag_queue() does not use skb_clone() is
disturbing too. Presumably IPv6 should have the same
issue than the one we fixed in commit ec4fbd64751d
("inet: frag: release spinlock before calling icmp_send()")

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
(cherry picked from commit 399d1404be660d355192ff4df5ccc3f4159ec1e4)
---
 include/net/inet_frag.h |  5 
 net/ipv4/ip_fragment.c  | 65 -
 net/ipv6/reassembly.c   |  4 ---
 3 files changed, 32 insertions(+), 42 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index a49cdf25cef0..7e984045b2b7 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -118,11 +118,6 @@ static inline void inet_frag_put(struct inet_frag_queue *q)
inet_frag_destroy(q);
 }
 
-static inline bool inet_frag_evicting(struct inet_frag_queue *q)
-{
-   return false;
-}
-
 /* Memory Tracking Functions. */
 
 static inline int frag_mem_limit(struct netns_frags *nf)
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 8c4072b19296..308592a8ba97 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -141,8 +141,11 @@ static bool frag_expire_skip_icmp(u32 user)
  */
 static void ip_expire(unsigned long arg)
 {
-   struct ipq *qp;
+   struct sk_buff *clone, *head;
+   const struct iphdr *iph;
struct net *net;
+   struct ipq *qp;
+   int err;
 
qp = container_of((struct inet_frag_queue *) arg, struct ipq, q);
net = container_of(qp->q.net, struct net, ipv4.frags);
@@ -156,45 +159,41 @@ static void ip_expire(unsigned long arg)
ipq_kill(qp);
__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
 
-   if (!inet_frag_evicting(>q)) {
-   struct sk_buff *clone, *head = qp->q.fragments;
-   const struct iphdr *iph;
-   int err;
+   head = qp->q.fragments;
 
-   __IP_INC_STATS(net, IPSTATS_MIB_REASMTIMEOUT);
+   __IP_INC_STATS(net, IPSTATS_MIB_REASMTIMEOUT);
 
-   if (!(qp->q.flags & INET_FRAG_FIRST_IN) || !qp->q.fragments)
-   goto out;
+   if (!(qp->q.flags & INET_FRAG_FIRST_IN) || !head)
+   goto out;
 
-   head->dev = dev_get_by_index_rcu(net, qp->iif);
-   if (!head->dev)
-   goto out;
+   head->dev = dev_get_by_index_rcu(net, qp->iif);
+   if (!head->dev)
+   goto out;
 
 
-   /* skb has no dst, perform route lookup again */
-   iph = ip_hdr(head);
-   err = ip_route_input_noref(head, iph->daddr, iph->saddr,
+   /* skb has no dst, perform route lookup again */
+   iph = ip_hdr(head);
+   err = ip_route_input_noref(head, iph->daddr, iph->saddr,
   iph->tos, head->dev);
-   if (err)
-   goto out;
+   if (err)
+   goto out;
 
-   /* Only an end host needs to send an ICMP
-* "Fragment Reassembly Timeout" message, per RFC792.
-*/
-   if (frag_expire_skip_icmp(qp->q.key.v4.user) &&
-   (skb_rtable(head)->rt_type != RTN_LOCAL))
-   goto out;
-
-   clone = skb_clone(head, GFP_ATOMIC);
-
-   /* Send an ICMP "Fragment Reassembly Timeout" message. */
-   if (clone) {
-   spin_unlock(>q.lock);
-   icmp_send(clone, ICMP_TIME_EXCEEDED,
- ICMP_EXC_FRAGTIME, 0);
-   consume_skb(clone);
-   goto out_rcu_unlock;
-   }
+   /* Only an end host needs to send an ICMP
+* "Fragment Reassembly Timeout" message, per RFC792.
+*/
+   if (frag_expire_skip_icmp(qp->q.key.v4.user) &&
+   (skb_rtable(head)->rt_type != RTN_LOCAL))
+   goto out;
+
+   clone = skb_clone(head, GFP_ATOMIC);
+
+   /* Send an ICMP "Fragment Reassembly Timeout" message. */
+   if (clone) {
+   spin_unlock(>q.lock);
+   icmp_send(clone, ICMP_TIME_EXCEEDED,
+ ICMP_EXC_FRAGTIME, 0);
+   consume_skb(clone);
+   goto out_rcu_unlock;
}
 out:
spin_unlock(>q.lock);
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 8fe3836d1410..629a45a4c79f 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -106,10 +106,6 @@ void ip6_expire_frag_queue(struct net *net, struct 
frag_queue *fq)
goto out_rcu_unlock;
 
__IP6_INC_STATS(net, __in6_dev_get(dev),

[PATCH stable 4.9 24/29] net: add rb_to_skb() and other rb tree helpers

2018-10-09 Thread Florian Fainelli

From: Eric Dumazet 

Geeralize private netem_rb_to_skb()

TCP rtx queue will soon be converted to rb-tree,
so we will need skb_rbtree_walk() helpers.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
(cherry picked from commit 18a4c0eab2623cc95be98a1e6af1ad18e7695977)
---
 include/linux/skbuff.h | 18 ++
 net/ipv4/tcp_input.c   | 33 -
 2 files changed, 30 insertions(+), 21 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 11974e63a41d..7e7e12aeaf82 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2988,6 +2988,12 @@ static inline int __skb_grow_rcsum(struct sk_buff *skb, 
unsigned int len)
 
 #define rb_to_skb(rb) rb_entry_safe(rb, struct sk_buff, rbnode)
 
+#define rb_to_skb(rb) rb_entry_safe(rb, struct sk_buff, rbnode)
+#define skb_rb_first(root) rb_to_skb(rb_first(root))
+#define skb_rb_last(root)  rb_to_skb(rb_last(root))
+#define skb_rb_next(skb)   rb_to_skb(rb_next(&(skb)->rbnode))
+#define skb_rb_prev(skb)   rb_to_skb(rb_prev(&(skb)->rbnode))
+
 #define skb_queue_walk(queue, skb) \
for (skb = (queue)->next;   
\
 skb != (struct sk_buff *)(queue);  
\
@@ -3002,6 +3008,18 @@ static inline int __skb_grow_rcsum(struct sk_buff *skb, 
unsigned int len)
for (; skb != (struct sk_buff *)(queue);
\
 skb = skb->next)
 
+#define skb_rbtree_walk(skb, root) 
\
+   for (skb = skb_rb_first(root); skb != NULL; 
\
+skb = skb_rb_next(skb))
+
+#define skb_rbtree_walk_from(skb)  
\
+   for (; skb != NULL; 
\
+skb = skb_rb_next(skb))
+
+#define skb_rbtree_walk_from_safe(skb, tmp)
\
+   for (; tmp = skb ? skb_rb_next(skb) : NULL, (skb != NULL);  
\
+skb = tmp)
+
 #define skb_queue_walk_from_safe(queue, skb, tmp)  
\
for (tmp = skb->next;   
\
 skb != (struct sk_buff *)(queue);  
\
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9d0b73aa649f..c169a2b261be 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4406,7 +4406,7 @@ static void tcp_ofo_queue(struct sock *sk)
 
p = rb_first(>out_of_order_queue);
while (p) {
-   skb = rb_entry(p, struct sk_buff, rbnode);
+   skb = rb_to_skb(p);
if (after(TCP_SKB_CB(skb)->seq, tp->rcv_nxt))
break;
 
@@ -4470,7 +4470,7 @@ static int tcp_try_rmem_schedule(struct sock *sk, struct 
sk_buff *skb,
 static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 {
struct tcp_sock *tp = tcp_sk(sk);
-   struct rb_node **p, *q, *parent;
+   struct rb_node **p, *parent;
struct sk_buff *skb1;
u32 seq, end_seq;
bool fragstolen;
@@ -4529,7 +4529,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct 
sk_buff *skb)
parent = NULL;
while (*p) {
parent = *p;
-   skb1 = rb_entry(parent, struct sk_buff, rbnode);
+   skb1 = rb_to_skb(parent);
if (before(seq, TCP_SKB_CB(skb1)->seq)) {
p = >rb_left;
continue;
@@ -4574,9 +4574,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct 
sk_buff *skb)
 
 merge_right:
/* Remove other segments covered by skb. */
-   while ((q = rb_next(>rbnode)) != NULL) {
-   skb1 = rb_entry(q, struct sk_buff, rbnode);
-
+   while ((skb1 = skb_rb_next(skb)) != NULL) {
if (!after(end_seq, TCP_SKB_CB(skb1)->seq))
break;
if (before(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
@@ -4591,7 +4589,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct 
sk_buff *skb)
tcp_drop(sk, skb1);
}
/* If there is no skb after us, we are the last_skb ! */
-   if (!q)
+   if (!skb1)
tp->ooo_last_skb = skb;
 
 add_sack:
@@ -4792,7 +4790,7 @@ static struct sk_buff *tcp_skb_next(struct sk_buff *skb, 
struct sk_buff_head *li
if (list)
return !skb_queue_is_last(list, skb) ? skb->next : NULL;
 
-   return rb_entry_safe(rb_next(>rbnode), struct sk_buff, rbnode);
+   return skb_rb_next(skb);
 }
 
 static struct sk_buff *tcp_collapse_one(struct sock *sk, struct sk_buff *skb,
@@ -4821,7 +4819,7 @@ static void tcp_rbtree_insert(struct rb_root *root, 
struct sk_buff *skb)
 
while (*p) {
parent = *p;
-   skb1 = rb_entry(parent, struct sk_buff, rbnode);
+

[PATCH stable 4.9 01/29] inet: frags: change inet_frags_init_net() return value

2018-10-09 Thread Florian Fainelli

From: Eric Dumazet 

We will soon initialize one rhashtable per struct netns_frags
in inet_frags_init_net().

This patch changes the return value to eventually propagate an
error.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
(cherry picked from commit 787bea7748a76130566f881c2342a0be4127d182)
---
 include/net/inet_frag.h |  3 ++-
 net/ieee802154/6lowpan/reassembly.c | 11 ---
 net/ipv4/ip_fragment.c  | 12 +---
 net/ipv6/netfilter/nf_conntrack_reasm.c | 12 +---
 net/ipv6/reassembly.c   | 11 +--
 5 files changed, 37 insertions(+), 12 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 634d19203e7d..3fb9c48dedfe 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -103,9 +103,10 @@ struct inet_frags {
 int inet_frags_init(struct inet_frags *);
 void inet_frags_fini(struct inet_frags *);
 
-static inline void inet_frags_init_net(struct netns_frags *nf)
+static inline int inet_frags_init_net(struct netns_frags *nf)
 {
atomic_set(>mem, 0);
+   return 0;
 }
 void inet_frags_exit_net(struct netns_frags *nf, struct inet_frags *f);
 
diff --git a/net/ieee802154/6lowpan/reassembly.c 
b/net/ieee802154/6lowpan/reassembly.c
index f85b08baff16..9757ce6c077a 100644
--- a/net/ieee802154/6lowpan/reassembly.c
+++ b/net/ieee802154/6lowpan/reassembly.c
@@ -580,14 +580,19 @@ static int __net_init lowpan_frags_init_net(struct net 
*net)
 {
struct netns_ieee802154_lowpan *ieee802154_lowpan =
net_ieee802154_lowpan(net);
+   int res;
 
ieee802154_lowpan->frags.high_thresh = IPV6_FRAG_HIGH_THRESH;
ieee802154_lowpan->frags.low_thresh = IPV6_FRAG_LOW_THRESH;
ieee802154_lowpan->frags.timeout = IPV6_FRAG_TIMEOUT;
 
-   inet_frags_init_net(_lowpan->frags);
-
-   return lowpan_frags_ns_sysctl_register(net);
+   res = inet_frags_init_net(_lowpan->frags);
+   if (res < 0)
+   return res;
+   res = lowpan_frags_ns_sysctl_register(net);
+   if (res < 0)
+   inet_frags_exit_net(_lowpan->frags, _frags);
+   return res;
 }
 
 static void __net_exit lowpan_frags_exit_net(struct net *net)
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 752711cd4834..803914569858 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -849,6 +849,8 @@ static void __init ip4_frags_ctl_register(void)
 
 static int __net_init ipv4_frags_init_net(struct net *net)
 {
+   int res;
+
/* Fragment cache limits.
 *
 * The fragment memory accounting code, (tries to) account for
@@ -874,9 +876,13 @@ static int __net_init ipv4_frags_init_net(struct net *net)
 
net->ipv4.frags.max_dist = 64;
 
-   inet_frags_init_net(>ipv4.frags);
-
-   return ip4_frags_ns_ctl_register(net);
+   res = inet_frags_init_net(>ipv4.frags);
+   if (res < 0)
+   return res;
+   res = ip4_frags_ns_ctl_register(net);
+   if (res < 0)
+   inet_frags_exit_net(>ipv4.frags, _frags);
+   return res;
 }
 
 static void __net_exit ipv4_frags_exit_net(struct net *net)
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c 
b/net/ipv6/netfilter/nf_conntrack_reasm.c
index ee33a6743f3b..afa9ea76155d 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -630,12 +630,18 @@ EXPORT_SYMBOL_GPL(nf_ct_frag6_gather);
 
 static int nf_ct_net_init(struct net *net)
 {
+   int res;
+
net->nf_frag.frags.high_thresh = IPV6_FRAG_HIGH_THRESH;
net->nf_frag.frags.low_thresh = IPV6_FRAG_LOW_THRESH;
net->nf_frag.frags.timeout = IPV6_FRAG_TIMEOUT;
-   inet_frags_init_net(>nf_frag.frags);
-
-   return nf_ct_frag6_sysctl_register(net);
+   res = inet_frags_init_net(>nf_frag.frags);
+   if (res < 0)
+   return res;
+   res = nf_ct_frag6_sysctl_register(net);
+   if (res < 0)
+   inet_frags_exit_net(>nf_frag.frags, _frags);
+   return res;
 }
 
 static void nf_ct_net_exit(struct net *net)
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index e585c0a2591c..ee8095eaca5d 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -709,13 +709,20 @@ static void ip6_frags_sysctl_unregister(void)
 
 static int __net_init ipv6_frags_init_net(struct net *net)
 {
+   int res;
+
net->ipv6.frags.high_thresh = IPV6_FRAG_HIGH_THRESH;
net->ipv6.frags.low_thresh = IPV6_FRAG_LOW_THRESH;
net->ipv6.frags.timeout = IPV6_FRAG_TIMEOUT;
 
-   inet_frags_init_net(>ipv6.frags);
+   res = inet_frags_init_net(>ipv6.frags);
+   if (res < 0)
+   return res;
 
-   return ip6_frags_ns_sysctl_register(net);
+   res = ip6_frags_ns_sysctl_register(net);
+   if (res < 0)
+   inet_frags_exit_net(>ipv6.frags, _frags);
+   return res;
 }
 
 static void __net_exit

[PATCH stable 4.9 02/29] inet: frags: add a pointer to struct netns_frags

2018-10-09 Thread Florian Fainelli

From: Eric Dumazet 

In order to simplify the API, add a pointer to struct inet_frags.
This will allow us to make things less complex.

These functions no longer have a struct inet_frags parameter :

inet_frag_destroy(struct inet_frag_queue *q  /*, struct inet_frags *f */)
inet_frag_put(struct inet_frag_queue *q /*, struct inet_frags *f */)
inet_frag_kill(struct inet_frag_queue *q /*, struct inet_frags *f */)
inet_frags_exit_net(struct netns_frags *nf /*, struct inet_frags *f */)
ip6_expire_frag_queue(struct net *net, struct frag_queue *fq)

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
(cherry picked from commit 093ba72914b696521e4885756a68a3332782c8de)
---
 include/net/inet_frag.h | 11 ++-
 include/net/ipv6.h  |  3 +--
 net/ieee802154/6lowpan/reassembly.c | 13 +++--
 net/ipv4/inet_fragment.c| 17 ++---
 net/ipv4/ip_fragment.c  |  9 +
 net/ipv6/netfilter/nf_conntrack_reasm.c | 16 +---
 net/ipv6/reassembly.c   | 20 ++--
 7 files changed, 48 insertions(+), 41 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 3fb9c48dedfe..19a2ccda0300 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -9,6 +9,7 @@ struct netns_frags {
int high_thresh;
int low_thresh;
int max_dist;
+   struct inet_frags   *f;
 };
 
 /**
@@ -108,20 +109,20 @@ static inline int inet_frags_init_net(struct netns_frags 
*nf)
atomic_set(>mem, 0);
return 0;
 }
-void inet_frags_exit_net(struct netns_frags *nf, struct inet_frags *f);
+void inet_frags_exit_net(struct netns_frags *nf);
 
-void inet_frag_kill(struct inet_frag_queue *q, struct inet_frags *f);
-void inet_frag_destroy(struct inet_frag_queue *q, struct inet_frags *f);
+void inet_frag_kill(struct inet_frag_queue *q);
+void inet_frag_destroy(struct inet_frag_queue *q);
 struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
struct inet_frags *f, void *key, unsigned int hash);
 
 void inet_frag_maybe_warn_overflow(struct inet_frag_queue *q,
   const char *prefix);
 
-static inline void inet_frag_put(struct inet_frag_queue *q, struct inet_frags 
*f)
+static inline void inet_frag_put(struct inet_frag_queue *q)
 {
if (atomic_dec_and_test(>refcnt))
-   inet_frag_destroy(q, f);
+   inet_frag_destroy(q);
 }
 
 static inline bool inet_frag_evicting(struct inet_frag_queue *q)
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 64b0e9df31c7..903a10a5f259 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -559,8 +559,7 @@ struct frag_queue {
u8  ecn;
 };
 
-void ip6_expire_frag_queue(struct net *net, struct frag_queue *fq,
-  struct inet_frags *frags);
+void ip6_expire_frag_queue(struct net *net, struct frag_queue *fq);
 
 static inline bool ipv6_addr_any(const struct in6_addr *a)
 {
diff --git a/net/ieee802154/6lowpan/reassembly.c 
b/net/ieee802154/6lowpan/reassembly.c
index 9757ce6c077a..9ccb8458b5c3 100644
--- a/net/ieee802154/6lowpan/reassembly.c
+++ b/net/ieee802154/6lowpan/reassembly.c
@@ -93,10 +93,10 @@ static void lowpan_frag_expire(unsigned long data)
if (fq->q.flags & INET_FRAG_COMPLETE)
goto out;
 
-   inet_frag_kill(>q, _frags);
+   inet_frag_kill(>q);
 out:
spin_unlock(>q.lock);
-   inet_frag_put(>q, _frags);
+   inet_frag_put(>q);
 }
 
 static inline struct lowpan_frag_queue *
@@ -229,7 +229,7 @@ static int lowpan_frag_reasm(struct lowpan_frag_queue *fq, 
struct sk_buff *prev,
struct sk_buff *fp, *head = fq->q.fragments;
int sum_truesize;
 
-   inet_frag_kill(>q, _frags);
+   inet_frag_kill(>q);
 
/* Make the one we just received the head. */
if (prev) {
@@ -437,7 +437,7 @@ int lowpan_frag_rcv(struct sk_buff *skb, u8 frag_type)
ret = lowpan_frag_queue(fq, skb, frag_type);
spin_unlock(>q.lock);
 
-   inet_frag_put(>q, _frags);
+   inet_frag_put(>q);
return ret;
}
 
@@ -585,13 +585,14 @@ static int __net_init lowpan_frags_init_net(struct net 
*net)
ieee802154_lowpan->frags.high_thresh = IPV6_FRAG_HIGH_THRESH;
ieee802154_lowpan->frags.low_thresh = IPV6_FRAG_LOW_THRESH;
ieee802154_lowpan->frags.timeout = IPV6_FRAG_TIMEOUT;
+   ieee802154_lowpan->frags.f = _frags;
 
res = inet_frags_init_net(_lowpan->frags);
if (res < 0)
return res;
res = lowpan_frags_ns_sysctl_register(net);
if (res < 0)
-   inet_frags_exit_net(_lowpan->frags, _frags);
+   inet_frags_exit_net(_lowpan->frags);
return res;
 }
 
@@ -601,7 +602,7 @@ static void __net_exit

[PATCH stable 4.9 00/29] backport of IP fragmentation fixes

2018-10-09 Thread Florian Fainelli

This is based on Stephen's v4.14 patches, with the necessary merge
conflicts, and the lack of timer_setup() on the 4.9 baseline.

Perf results on a gigabit capable system, before and after are below.

Series can also be found here:

https://github.com/ffainelli/linux/commits/fragment-stack-v4.9


   PerfTop: 457 irqs/sec  kernel:74.4%  exact:  0.0% [4000Hz cycles],  
(all, 4 CPUs)
---

29.62%  [kernel]   [k] ip_defrag  
 6.57%  [kernel]   [k] arch_cpu_idle  
 1.72%  [kernel]   [k] v7_dma_inv_range   
 1.68%  [kernel]   [k] __netif_receive_skb_core   
 1.43%  [kernel]   [k] fib_table_lookup   
 1.30%  [kernel]   [k] finish_task_switch 
 1.08%  [kernel]   [k] ip_rcv 
 1.01%  [kernel]   [k] skb_release_data   
 0.99%  [kernel]   [k] __slab_free
 0.96%  [kernel]   [k] bcm_sysport_poll   
 0.88%  [kernel]   [k] __netdev_alloc_skb 
 0.87%  [kernel]   [k] tick_nohz_idle_enter   
 0.86%  [kernel]   [k] dev_gro_receive
 0.85%  [kernel]   [k] _raw_spin_unlock_irqrestore
 0.84%  [kernel]   [k] __memzero  
 0.74%  [kernel]   [k] tick_nohz_idle_exit
 0.73%  ld-2.24.so [.] do_lookup_x
 0.66%  [kernel]   [k] kmem_cache_free
 0.66%  [kernel]   [k] bcm_sysport_rx_refill  
 0.65%  [kernel]   [k] eth_type_trans 


After patching:

  PerfTop: 170 irqs/sec  kernel:86.5%  exact:  0.0% [4000Hz cycles],  (all, 
4 CPUs)
---

 7.79%  [kernel]   [k] arch_cpu_idle  
 5.14%  [kernel]   [k] v7_dma_inv_range   
 4.20%  [kernel]   [k] ip_defrag  
 3.89%  [kernel]   [k] __netif_receive_skb_core   
 3.65%  [kernel]   [k] fib_table_lookup   
 2.16%  [kernel]   [k] finish_task_switch 
 1.93%  [kernel]   [k] _raw_spin_unlock_irqrestore
 1.90%  [kernel]   [k] ip_rcv 
 1.84%  [kernel]   [k] bcm_sysport_poll   
 1.83%  [kernel]   [k] __memzero  
 1.65%  [kernel]   [k] __netdev_alloc_skb 
 1.60%  [kernel]   [k] __slab_free
 1.49%  [kernel]   [k] __do_softirq   
 1.49%  [kernel]   [k] bcm_sysport_rx_refill  
 1.31%  [kernel]   [k] dma_cache_maint_page   
 1.25%  [kernel]   [k] tick_nohz_idle_enter   
 1.24%  [kernel]   [k] ip_route_input_noref   
 1.17%  [kernel]   [k] eth_type_trans 
 1.06%  [kernel]   [k] fib_validate_source
 1.03%  [kernel]   [k] inet_frag_find

Dan Carpenter (1):
  ipv4: frags: precedence bug in ip_expire()

Eric Dumazet (22):
  inet: frags: change inet_frags_init_net() return value
  inet: frags: add a pointer to struct netns_frags
  inet: frags: refactor ipfrag_init()
  inet: frags: refactor ipv6_frag_init()
  inet: frags: refactor lowpan_net_frag_init()
  ipv6: export ip6 fragments sysctl to unprivileged users
  rhashtable: add schedule points
  inet: frags: use rhashtables for reassembly units
  inet: frags: remove some helpers
  inet: frags: get rif of inet_frag_evicting()
  inet: frags: remove inet_frag_maybe_warn_overflow()
  inet: frags: break the 2GB limit for frags storage
  inet: frags: do not clone skb in ip_expire()
  ipv6: frags: rewrite ip6_expire_frag_queue()
  rhashtable: reorganize struct rhashtable layout
  inet: frags: reorganize struct netns_frags
  inet: frags: get rid of ipfrag_skb_cb/FRAG_CB
  inet: frags: fix ip6frag_low_thresh boundary
  net: speed up skb_rbtree_purge()
  net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends
  net: add rb_to_skb() and other rb tree helpers
  net: sk_buff rbnode reorg

Florian Westphal (1):
  ipv6: defrag: drop non-last frags smaller than min mtu

Peter Oskolkov (4):
  ip: discard IPv4 datagrams with overlapping segments.
  net: modify skb_rbtree_purge to return the truesize of all purged
skbs.
  ip: add helpers to process in-order fragments faster.
  ip: process in-order fragments efficiently

Taehee Yoo (1):
  ip: frags: fix crash in ip_do_fragment()

 Documentation/networking/ip-sysctl.txt  |  13 +-
 include/linux/rhashtable.h  |   4 +-
 include/linux/skbuff.h  |  48 +-
 include/net/inet_frag.h | 133 +++---
 include/net/ip.h|   1 -
 include/net/ipv6.h  |  26 +-
 include/uapi/linux/snmp.h   |   1 +
 lib/rhashtable.c|   5 +-
 net/core/skbuff.c   |  31 +-

[PATCH stable 4.9 03/29] inet: frags: refactor ipfrag_init()

2018-10-09 Thread Florian Fainelli

From: Eric Dumazet 

We need to call inet_frags_init() before register_pernet_subsys(),
as a prereq for following patch ("inet: frags: use rhashtables for reassembly 
units")

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
(cherry picked from commit 483a6e4fa055123142d8956866fe2aa9c98d546d)
---
 net/ipv4/ip_fragment.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 0fd3b475b929..b6fe05e8a431 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -899,8 +899,6 @@ static struct pernet_operations ip4_frags_ops = {
 
 void __init ipfrag_init(void)
 {
-   ip4_frags_ctl_register();
-   register_pernet_subsys(_frags_ops);
ip4_frags.hashfn = ip4_hashfn;
ip4_frags.constructor = ip4_frag_init;
ip4_frags.destructor = ip4_frag_free;
@@ -910,4 +908,6 @@ void __init ipfrag_init(void)
ip4_frags.frags_cache_name = ip_frag_cache_name;
if (inet_frags_init(_frags))
panic("IP: failed to allocate ip4_frags cache\n");
+   ip4_frags_ctl_register();
+   register_pernet_subsys(_frags_ops);
 }
-- 
2.17.1

[PATCH stable 4.9 05/29] inet: frags: refactor lowpan_net_frag_init()

2018-10-09 Thread Florian Fainelli

From: Eric Dumazet 

We want to call lowpan_net_frag_init() earlier.
Similar to commit "inet: frags: refactor ipv6_frag_init()"

This is a prereq to "inet: frags: use rhashtables for reassembly units"

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
(cherry picked from commit 807f1844df4ac23594268fa9f41902d0549e92aa)
---
 net/ieee802154/6lowpan/reassembly.c | 20 +++-
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/net/ieee802154/6lowpan/reassembly.c 
b/net/ieee802154/6lowpan/reassembly.c
index 9ccb8458b5c3..977b4ed58112 100644
--- a/net/ieee802154/6lowpan/reassembly.c
+++ b/net/ieee802154/6lowpan/reassembly.c
@@ -614,14 +614,6 @@ int __init lowpan_net_frag_init(void)
 {
int ret;
 
-   ret = lowpan_frags_sysctl_register();
-   if (ret)
-   return ret;
-
-   ret = register_pernet_subsys(_frags_ops);
-   if (ret)
-   goto err_pernet;
-
lowpan_frags.hashfn = lowpan_hashfn;
lowpan_frags.constructor = lowpan_frag_init;
lowpan_frags.destructor = NULL;
@@ -631,11 +623,21 @@ int __init lowpan_net_frag_init(void)
lowpan_frags.frags_cache_name = lowpan_frags_cache_name;
ret = inet_frags_init(_frags);
if (ret)
-   goto err_pernet;
+   goto out;
 
+   ret = lowpan_frags_sysctl_register();
+   if (ret)
+   goto err_sysctl;
+
+   ret = register_pernet_subsys(_frags_ops);
+   if (ret)
+   goto err_pernet;
+out:
return ret;
 err_pernet:
lowpan_frags_sysctl_unregister();
+err_sysctl:
+   inet_frags_fini(_frags);
return ret;
 }
 
-- 
2.17.1

[net-next:master 375/375] drivers/net//ethernet/chelsio/cxgb4/cxgb4_thermal.c:48:11: error: 'struct adapter' has no member named 'ch_thermal'

2018-10-09 Thread kbuild test robot

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 
master
head:   b187191577629b5358acf4e234809ee8d441ceb4
commit: b187191577629b5358acf4e234809ee8d441ceb4 [375/375] cxgb4: Add thermal 
zone support
config: parisc-allmodconfig (attached as .config)
compiler: hppa-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
git checkout b187191577629b5358acf4e234809ee8d441ceb4
# save the attached .config to linux build tree
GCC_VERSION=7.2.0 make.cross ARCH=parisc 

All errors (new ones prefixed by >>):

   drivers/net//ethernet/chelsio/cxgb4/cxgb4_thermal.c: In function 
'cxgb4_thermal_get_trip_type':
>> drivers/net//ethernet/chelsio/cxgb4/cxgb4_thermal.c:48:11: error: 'struct 
>> adapter' has no member named 'ch_thermal'
 if (!adap->ch_thermal.trip_temp)
  ^~
   drivers/net//ethernet/chelsio/cxgb4/cxgb4_thermal.c:51:14: error: 'struct 
adapter' has no member named 'ch_thermal'
 *type = adap->ch_thermal.trip_type;
 ^~
   drivers/net//ethernet/chelsio/cxgb4/cxgb4_thermal.c: In function 
'cxgb4_thermal_get_trip_temp':
   drivers/net//ethernet/chelsio/cxgb4/cxgb4_thermal.c:60:11: error: 'struct 
adapter' has no member named 'ch_thermal'
 if (!adap->ch_thermal.trip_temp)
  ^~
   drivers/net//ethernet/chelsio/cxgb4/cxgb4_thermal.c:63:14: error: 'struct 
adapter' has no member named 'ch_thermal'
 *temp = adap->ch_thermal.trip_temp;
 ^~
   drivers/net//ethernet/chelsio/cxgb4/cxgb4_thermal.c: In function 
'cxgb4_thermal_init':
   drivers/net//ethernet/chelsio/cxgb4/cxgb4_thermal.c:75:39: error: 'struct 
adapter' has no member named 'ch_thermal'
 struct ch_thermal *ch_thermal = >ch_thermal;
  ^~
>> drivers/net//ethernet/chelsio/cxgb4/cxgb4_thermal.c:92:13: error: 
>> dereferencing pointer to incomplete type 'struct ch_thermal'
  ch_thermal->trip_temp = val * 1000;
^~
   drivers/net//ethernet/chelsio/cxgb4/cxgb4_thermal.c: In function 
'cxgb4_thermal_remove':
   drivers/net//ethernet/chelsio/cxgb4/cxgb4_thermal.c:111:10: error: 'struct 
adapter' has no member named 'ch_thermal'
 if (adap->ch_thermal.tzdev)
 ^~
   drivers/net//ethernet/chelsio/cxgb4/cxgb4_thermal.c:112:38: error: 'struct 
adapter' has no member named 'ch_thermal'
  thermal_zone_device_unregister(adap->ch_thermal.tzdev);
 ^~

vim +48 drivers/net//ethernet/chelsio/cxgb4/cxgb4_thermal.c

42  
43  static int cxgb4_thermal_get_trip_type(struct thermal_zone_device 
*tzdev,
44 int trip, enum thermal_trip_type 
*type)
45  {
46  struct adapter *adap = tzdev->devdata;
47  
  > 48  if (!adap->ch_thermal.trip_temp)
49  return -EINVAL;
50  
51  *type = adap->ch_thermal.trip_type;
52  return 0;
53  }
54  
55  static int cxgb4_thermal_get_trip_temp(struct thermal_zone_device 
*tzdev,
56 int trip, int *temp)
57  {
58  struct adapter *adap = tzdev->devdata;
59  
60  if (!adap->ch_thermal.trip_temp)
61  return -EINVAL;
62  
  > 63  *temp = adap->ch_thermal.trip_temp;
64  return 0;
65  }
66  
67  static struct thermal_zone_device_ops cxgb4_thermal_ops = {
68  .get_temp = cxgb4_thermal_get_temp,
69  .get_trip_type = cxgb4_thermal_get_trip_type,
70  .get_trip_temp = cxgb4_thermal_get_trip_temp,
71  };
72  
73  int cxgb4_thermal_init(struct adapter *adap)
74  {
75  struct ch_thermal *ch_thermal = >ch_thermal;
76  int num_trip = CXGB4_NUM_TRIPS;
77  u32 param, val;
78  int ret;
79  
80  /* on older firmwares we may not get the trip temperature,
81   * set the num of trips to 0.
82   */
83  param = (FW_PARAMS_MNEM_V(FW_PARAMS_MNEM_DEV) |
84   FW_PARAMS_PARAM_X_V(FW_PARAMS_PARAM_DEV_DIAG) |
85   FW_PARAMS_PARAM_Y_V(FW_PARAM_DEV_DIAG_MAXTMPTHRESH));
86  
87  ret = t4_query_params(adap, adap->mbox, adap->pf, 0, 1,
88, );
89  if (ret < 0) {
90  num_trip = 0; /* could not get trip temperature */
91  } else {
  > 92  ch_thermal->trip_temp = val * 1000;
93  ch_thermal->trip_type = THERMAL_TRIP_CRITICAL;
94  }
95  
96  ch_thermal->tzdev = thermal_zone_device_register("cxgb4", 
num_trip,
97   0, adap,
98

[PATCH net-next] net: sched: avoid writing on noop_qdisc

2018-10-09 Thread Eric Dumazet

While noop_qdisc.gso_skb and noop_qdisc.skb_bad_txq are not used
in other places, it seems not correct to overwrite their fields
in dev_init_scheduler_queue().

noop_qdisc is essentially a shared and read-only object, even if
it is not marked as const because of some implementation detail.

Signed-off-by: Eric Dumazet 
---
 net/sched/sch_generic.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 
3023929852e8c4aaa4172861d2d0beff17e25f27..de1663f7d3ad6e2c06cd2e031036cef5979366a5
 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -572,6 +572,18 @@ struct Qdisc noop_qdisc = {
.dev_queue  =   _netdev_queue,
.running=   SEQCNT_ZERO(noop_qdisc.running),
.busylock   =   __SPIN_LOCK_UNLOCKED(noop_qdisc.busylock),
+   .gso_skb = {
+   .next = (struct sk_buff *)_qdisc.gso_skb,
+   .prev = (struct sk_buff *)_qdisc.gso_skb,
+   .qlen = 0,
+   .lock = __SPIN_LOCK_UNLOCKED(noop_qdisc.gso_skb.lock),
+   },
+   .skb_bad_txq = {
+   .next = (struct sk_buff *)_qdisc.skb_bad_txq,
+   .prev = (struct sk_buff *)_qdisc.skb_bad_txq,
+   .qlen = 0,
+   .lock = __SPIN_LOCK_UNLOCKED(noop_qdisc.skb_bad_txq.lock),
+   },
 };
 EXPORT_SYMBOL(noop_qdisc);
 
@@ -1273,8 +1285,6 @@ static void dev_init_scheduler_queue(struct net_device 
*dev,
 
rcu_assign_pointer(dev_queue->qdisc, qdisc);
dev_queue->qdisc_sleeping = qdisc;
-   __skb_queue_head_init(>gso_skb);
-   __skb_queue_head_init(>skb_bad_txq);
 }
 
 void dev_init_scheduler(struct net_device *dev)
-- 
2.19.0.605.g01d371f741-goog

Re: [PATCH bpf-next 0/6] Consistent prefixes for libbpf interfaces

2018-10-09 Thread Daniel Borkmann

On 10/09/2018 08:43 AM, Yonghong Song wrote:
> On 10/4/18 7:22 AM, Daniel Borkmann wrote:
>> [ +Yonghong ]
>>
>> On 10/04/2018 12:26 AM, Andrey Ignatov wrote:
>>> This patch set renames a few interfaces in libbpf, mostly netlink related,
>>> so that all symbols provided by the library have only three possible
>>> prefixes:
>>>
>>> % nm -D tools/lib/bpf/libbpf.so  | \
>>>  awk '$2 == "T" {sub(/[_\(].*/, "", $3); if ($3) print $3}' | \
>>>  sort | \
>>>  uniq -c
>>>   91 bpf
>>>8 btf
>>>   14 libbpf
>>>
>>> libbpf is used more and more outside kernel tree. That means the library
>>> should follow good practices in library design and implementation to
>>> play well with third party code that uses it.
>>>
>>> One of such practices is to have a common prefix (or a few) for every
>>> interface, function or data structure, library provides. It helps to
>>> avoid name conflicts with other libraries and keeps API/ABI consistent.
>>>
>>> Inconsistent names in libbpf already cause problems in real life. E.g.
>>> an application can't use both libbpf and libnl due to conflicting
>>> symbols (specifically nla_parse, nla_parse_nested and a few others).
>>>
>>> Some of problematic global symbols are not part of ABI and can be
>>> restricted from export with either visibility attribute/pragma or export
>>> map (what is useful by itself and can be done in addition). That won't
>>> solve the problem for those that are part of ABI though. Also export
>>> restrictions would help only in DSO case. If third party application links
>>> libbpf statically it won't help, and people do it (e.g. Facebook links
>>> most of libraries statically, including libbpf).
>>>
>>> libbpf already uses the following prefixes for its interfaces:
>>> * bpf_ for bpf system call wrappers, program/map/elf-object
>>>abstractions and a few other things;
>>> * btf_ for BTF related API;
>>> * libbpf_ for everything else.
>>>
>>> The patch adds libbpf_ prefix to interfaces that use none of mentioned
>>> above prefixes and don't fit well into the first two categories.
>>>
>>> Long term benefits of having common prefix should outweigh possible
>>> inconvenience of changing API for those functions now.
>>>
>>> Patches 2-4 add libbpf_ prefix to libbpf interfaces: separate patch per
>>> header. Other patches are simple improvements in API.
>>>
>>>
>>> Andrey Ignatov (6):
>>>libbpf: Move __dump_nlmsg_t from API to implementation
>>>libbpf: Consistent prefixes for interfaces in libbpf.h.
>>>libbpf: Consistent prefixes for interfaces in nlattr.h.
>>>libbpf: Consistent prefixes for interfaces in str_error.h.
>>>libbpf: Make include guards consistent
>>>libbpf: Use __u32 instead of u32 in bpf_program__load
>>>
>>>   tools/bpf/bpftool/net.c| 41 ++-
>>>   tools/bpf/bpftool/netlink_dumper.c | 32 ---
>>>   tools/lib/bpf/bpf.h|  6 +--
>>>   tools/lib/bpf/btf.h|  6 +--
>>>   tools/lib/bpf/libbpf.c | 22 +-
>>>   tools/lib/bpf/libbpf.h | 31 +++---
>>>   tools/lib/bpf/netlink.c| 48 --
>>>   tools/lib/bpf/nlattr.c | 64 +++--
>>>   tools/lib/bpf/nlattr.h | 65 +++---
>>>   tools/lib/bpf/str_error.c  |  2 +-
>>>   tools/lib/bpf/str_error.h  |  8 ++--
>>>   11 files changed, 171 insertions(+), 154 deletions(-)
>>
>> Overall agree that this is needed, and I've therefore applied the
>> set, thanks for cleaning up, Andrey!
>>
>> But, I would actually like to see this going one step further, in
>> particular, we should keep all the netlink related stuff inside
>> libbpf-/only/. Meaning, the goal of libbpf is not to provide yet
>> another set of netlink primitives but instead e.g. for the bpftool
>> dumper this should be abstracted away such that we pass in a callback
>> from bpftool side and have an iterator object which will then be
>> populated from inside the libbpf logic, meaning, bpftool shouldn't
>> even be aware of anything netlink there.
> 
> Agreed. This indeed make sense, the user really only cares a few fields
> like devname/id, attachment_type, prog_id, etc. I will take a look at
> this later if nobody works on it.

Awesome, that would be great, thanks!

Re: [PATCH net v2] net/sched: cls_api: add missing validation of netlink attributes

2018-10-09 Thread Cong Wang

On Tue, Oct 9, 2018 at 2:26 PM Davide Caratti  wrote:
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -1322,6 +1322,7 @@ const struct nla_policy rtm_tca_policy[TCA_MAX + 1] = {
> [TCA_INGRESS_BLOCK] = { .type = NLA_U32 },
> [TCA_EGRESS_BLOCK]  = { .type = NLA_U32 },
>  };
> +EXPORT_SYMBOL(rtm_tca_policy);

cls_api.c is not a module, so you don't need to export it.

[PATCH net-next] net/ipv6: Add knob to skip DELROUTE message on device down

2018-10-09 Thread David Ahern

From: David Ahern 

Another difference between IPv4 and IPv6 is the generation of RTM_DELROUTE
notifications when a device is taken down (admin down) or deleted. IPv4
does not generate a message for routes evicted by the down or delete;
IPv6 does. A NOS at scale really needs to avoid these messages and have
IPv4 and IPv6 behave similarly, relying on userspace to handle link
notifications and evict the routes.

At this point existing user behavior needs to be preserved. Since
notifications are a global action (not per app) the only way to preserve
existing behavior and allow the messages to be skipped is to add a new
sysctl (net/ipv6/route/skip_notify_on_dev_down) which can be set to
disable the notificatioons.

IPv6 route code already supports the option to skip the message (it is
used for multipath routes for example). Besides the new sysctl we need
to pass the skip_notify setting through the generic fib6_clean and
fib6_walk functions to fib6_clean_node and to set skip_notify on calls
to __ip_del_rt for the addrconf_ifdown path.

Signed-off-by: David Ahern 
---
 Documentation/networking/ip-sysctl.txt |  8 +++
 include/net/addrconf.h |  3 ++-
 include/net/ip6_fib.h  |  3 +++
 include/net/ip6_route.h|  1 +
 include/net/netns/ipv6.h   |  1 +
 net/ipv6/addrconf.c| 44 ++
 net/ipv6/anycast.c | 10 +---
 net/ipv6/ip6_fib.c | 20 
 net/ipv6/route.c   | 30 ++-
 9 files changed, 95 insertions(+), 25 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 960de8fe3f40..163b5ff1073c 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -1442,6 +1442,14 @@ max_hbh_length - INTEGER
header.
Default: INT_MAX (unlimited)
 
+skip_notify_on_dev_down - BOOLEAN
+   Controls whether an RTM_DELROUTE message is generated for routes
+   removed when a device is taken down or deleted. IPv4 does not
+   generate this message; IPv6 does by default. Setting this sysctl
+   to true skips the message, making IPv4 and IPv6 on par in relying
+   on userspace caches to track link events and evict routes.
+   Default: false (generate message)
+
 IPv6 Fragmentation:
 
 ip6frag_high_thresh - INTEGER
diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 6def0351bcc3..ee6292f64c86 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -306,7 +306,8 @@ int ipv6_sock_ac_drop(struct sock *sk, int ifindex,
 void ipv6_sock_ac_close(struct sock *sk);
 
 int __ipv6_dev_ac_inc(struct inet6_dev *idev, const struct in6_addr *addr);
-int __ipv6_dev_ac_dec(struct inet6_dev *idev, const struct in6_addr *addr);
+int __ipv6_dev_ac_dec(struct inet6_dev *idev, const struct in6_addr *addr,
+ bool skip_notify);
 void ipv6_ac_destroy_dev(struct inet6_dev *idev);
 bool ipv6_chk_acast_addr(struct net *net, struct net_device *dev,
 const struct in6_addr *addr);
diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index f06e968f1992..caabfd84a098 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -407,6 +407,9 @@ struct fib6_node *fib6_locate(struct fib6_node *root,
 
 void fib6_clean_all(struct net *net, int (*func)(struct fib6_info *, void 
*arg),
void *arg);
+void fib6_clean_all_skip_notify(struct net *net,
+   int (*func)(struct fib6_info *, void *arg),
+   void *arg);
 
 int fib6_add(struct fib6_node *root, struct fib6_info *rt,
 struct nl_info *info, struct netlink_ext_ack *extack);
diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index cef186dbd2ce..7c140cb2eeb0 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -104,6 +104,7 @@ int ip6_route_add(struct fib6_config *cfg, gfp_t gfp_flags,
  struct netlink_ext_ack *extack);
 int ip6_ins_rt(struct net *net, struct fib6_info *f6i);
 int ip6_del_rt(struct net *net, struct fib6_info *f6i);
+int ip6_del_rt_skip_notify(struct net *net, struct fib6_info *f6i);
 
 void rt6_flush_exceptions(struct fib6_info *f6i);
 void rt6_age_exceptions(struct fib6_info *f6i, struct fib6_gc_args *gc_args,
diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
index f0e396ab9bec..ef1ed529f33c 100644
--- a/include/net/netns/ipv6.h
+++ b/include/net/netns/ipv6.h
@@ -45,6 +45,7 @@ struct netns_sysctl_ipv6 {
int max_dst_opts_len;
int max_hbh_opts_len;
int seg6_flowlabel;
+   bool skip_notify_on_dev_down;
 };
 
 struct netns_ipv6 {
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 2496b12bf721..cf591cf66884 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -164,7 +164,7 @@ static struct workqueue_struct

[PATCH net v2] net/sched: cls_api: add missing validation of netlink attributes

2018-10-09 Thread Davide Caratti

Similarly to what has been done in 8b4c3cdd9dd8 ("net: sched: Add policy
validation for tc attributes"), fix classifier code to add validation of
TCA_CHAIN and TCA_KIND netlink attributes.

tested with:
 # ./tdc.py -c filter

v2: let sch_api and cls_api share nla_policy they have in common, thanks
to David Ahern

Fixes: 5bc1701881e39 ("net: sched: introduce multichain support for filters")
Signed-off-by: Davide Caratti 
---
 include/net/sch_generic.h |  2 ++
 net/sched/cls_api.c   | 11 ++-
 net/sched/sch_api.c   |  1 +
 3 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index a6d00093f35e..8f4335c0a6c8 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -487,6 +487,8 @@ extern struct Qdisc_ops pfifo_fast_ops;
 extern struct Qdisc_ops mq_qdisc_ops;
 extern struct Qdisc_ops noqueue_qdisc_ops;
 extern const struct Qdisc_ops *default_qdisc_ops;
+extern const struct nla_policy rtm_tca_policy[];
+
 static inline const struct Qdisc_ops *
 get_default_qdisc_ops(const struct net_device *dev, int ntx)
 {
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 0a75cb2e5e7b..5f72190bf36c 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -1211,7 +1211,7 @@ static int tc_new_tfilter(struct sk_buff *skb, struct 
nlmsghdr *n,
 replay:
tp_created = 0;
 
-   err = nlmsg_parse(n, sizeof(*t), tca, TCA_MAX, NULL, extack);
+   err = nlmsg_parse(n, sizeof(*t), tca, TCA_MAX, rtm_tca_policy, extack);
if (err < 0)
return err;
 
@@ -1360,7 +1360,7 @@ static int tc_del_tfilter(struct sk_buff *skb, struct 
nlmsghdr *n,
if (!netlink_ns_capable(skb, net->user_ns, CAP_NET_ADMIN))
return -EPERM;
 
-   err = nlmsg_parse(n, sizeof(*t), tca, TCA_MAX, NULL, extack);
+   err = nlmsg_parse(n, sizeof(*t), tca, TCA_MAX, rtm_tca_policy, extack);
if (err < 0)
return err;
 
@@ -1475,7 +1475,7 @@ static int tc_get_tfilter(struct sk_buff *skb, struct 
nlmsghdr *n,
void *fh = NULL;
int err;
 
-   err = nlmsg_parse(n, sizeof(*t), tca, TCA_MAX, NULL, extack);
+   err = nlmsg_parse(n, sizeof(*t), tca, TCA_MAX, rtm_tca_policy, extack);
if (err < 0)
return err;
 
@@ -1838,7 +1838,7 @@ static int tc_ctl_chain(struct sk_buff *skb, struct 
nlmsghdr *n,
return -EPERM;
 
 replay:
-   err = nlmsg_parse(n, sizeof(*t), tca, TCA_MAX, NULL, extack);
+   err = nlmsg_parse(n, sizeof(*t), tca, TCA_MAX, rtm_tca_policy, extack);
if (err < 0)
return err;
 
@@ -1949,7 +1949,8 @@ static int tc_dump_chain(struct sk_buff *skb, struct 
netlink_callback *cb)
if (nlmsg_len(cb->nlh) < sizeof(*tcm))
return skb->len;
 
-   err = nlmsg_parse(cb->nlh, sizeof(*tcm), tca, TCA_MAX, NULL, NULL);
+   err = nlmsg_parse(cb->nlh, sizeof(*tcm), tca, TCA_MAX, rtm_tca_policy,
+ NULL);
if (err)
return err;
 
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 85e73f48e48f..99eae8007567 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1322,6 +1322,7 @@ const struct nla_policy rtm_tca_policy[TCA_MAX + 1] = {
[TCA_INGRESS_BLOCK] = { .type = NLA_U32 },
[TCA_EGRESS_BLOCK]  = { .type = NLA_U32 },
 };
+EXPORT_SYMBOL(rtm_tca_policy);
 
 static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
struct netlink_ext_ack *extack)
-- 
2.17.1

Re: [PATCH V1 net-next 00/12] Improving performance and reducing latencies, by using latest capabilities exposed in ENA device

2018-10-09 Thread Bshara, Saeed


Jesper,
currently the driver allocate page per rx buffer, but we are considering to 
support mode where page split to 2 buffers in order to overcome memory 
fragmentation issue on low memory systems. but, this won't work with XDP, right?
what's your advice?



From: Bshara, Nafea
Sent: Tuesday, October 9, 2018 10:33 PM
To: Jesper Dangaard Brouer; Kiyanovski, Arthur
Cc: da...@davemloft.net; netdev@vger.kernel.org; Woodhouse, David; Machulsky, 
Zorik; Matushevsky, Alexander; Bshara, Saeed; Wilson, Matt; Liguori, Anthony; 
Tzalik, Guy; Belgazal, Netanel; Saidi, Ali; Björn Töpel
Subject: Re: [PATCH V1 net-next 00/12] Improving performance and reducing 
latencies, by using latest capabilities exposed in ENA device
    
It is high priority for us right after this major release get merged.

On 10/9/18, 12:31 PM, "Jesper Dangaard Brouer"  wrote:

    
    On Tue, 9 Oct 2018 21:44:57 +0300  wrote:
    
    > From: Arthur Kiyanovski 
    > 
    > This patchset introduces the following:
    > 1. A new placement policy of Tx headers and descriptors, which takes
    > advantage of an option to place headers + descriptors in device memory
    > space. This is sometimes referred to as LLQ - low latency queue.
    > The patch set defines the admin capability, maps the device memory as
    > write-combined, and adds a mode in transmit datapath to do header +
    > descriptor placement on the device.
    > 2. Support for RX checksum offloading
    > 3. Miscelaneous small improvements and code cleanups
    
    What are your plans for XDP?
    
    You are unsure ask your-colleague David Woodhouse, who I've discussed
    this with when he attended my talk at Kernel-Recipes[1], slide[2].
    
    [1]  
https://kernel-recipes.org/en/2018/talks/xdp-a-new-programmable-network-layer/
    [2]  
http://people.netfilter.org/hawk/presentations/KernelRecipes2018/XDP_Kernel_Recipes_2018.pdf
    -- 
    Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH V1 net-next 00/12] Improving performance and reducing latencies, by using latest capabilities exposed in ENA device

2018-10-09 Thread Machulsky, Zorik

Acked-by: Zorik Machulsky 

On 10/9/18, 11:45 AM, "akiy...@amazon.com"  wrote:

From: Arthur Kiyanovski 

This patchset introduces the following:
1. A new placement policy of Tx headers and descriptors, which takes
advantage of an option to place headers + descriptors in device memory
space. This is sometimes referred to as LLQ - low latency queue.
The patch set defines the admin capability, maps the device memory as
write-combined, and adds a mode in transmit datapath to do header +
descriptor placement on the device.
2. Support for RX checksum offloading
3. Miscelaneous small improvements and code cleanups

Arthur Kiyanovski (12):
  net: ena: minor performance improvement
  net: ena: complete host info to match latest ENA spec
  net: ena: introduce Low Latency Queues data structures according to
ENA spec
  net: ena: add functions for handling Low Latency Queues in ena_com
  net: ena: add functions for handling Low Latency Queues in ena_netdev
  net: ena: use CSUM_CHECKED device indication to report skb's checksum
status
  net: ena: explicit casting and initialization, and clearer error
handling
  net: ena: limit refill Rx threshold to 256 to avoid latency issues
  net: ena: change rx copybreak default to reduce kernel memory pressure
  net: ena: remove redundant parameter in ena_com_admin_init()
  net: ena: update driver version to 2.0.1
  net: ena: fix indentations in ena_defs for better readability

 drivers/net/ethernet/amazon/ena/ena_admin_defs.h  | 425 
-
 drivers/net/ethernet/amazon/ena/ena_com.c | 302 +--
 drivers/net/ethernet/amazon/ena/ena_com.h |  71 +++-
 drivers/net/ethernet/amazon/ena/ena_common_defs.h |   4 +-
 drivers/net/ethernet/amazon/ena/ena_eth_com.c | 277 +-
 drivers/net/ethernet/amazon/ena/ena_eth_com.h |  72 +++-
 drivers/net/ethernet/amazon/ena/ena_eth_io_defs.h | 229 ++-
 drivers/net/ethernet/amazon/ena/ena_ethtool.c |   2 +
 drivers/net/ethernet/amazon/ena/ena_netdev.c  | 446 
++
 drivers/net/ethernet/amazon/ena/ena_netdev.h  |  42 +-
 drivers/net/ethernet/amazon/ena/ena_regs_defs.h   | 206 +-
 11 files changed, 1334 insertions(+), 742 deletions(-)

-- 
2.7.4

Re: [PATCH net-next v2] net: core: change bool members of struct net_device to bitfield members

2018-10-09 Thread Eric Dumazet

On 10/09/2018 01:24 PM, Heiner Kallweit wrote:

> Reordering the struct members to fill the holes could be a little tricky
> and could have side effects because it may make a performance difference
> whether certain members are in one cacheline or not.
> And whether it's worth to spend this effort (incl. the related risks)
> just to save a few bytes (also considering that typically we have quite
> few instances of struct net_device)?

Not really.

In fact we probably should spend time reordering fields for performance,
since some new fields were added a bit randomly, breaking the goal of data 
locality.

Some fields are used in control path only can could be moved out of the cache 
lines
needed in data path (fast path).

Re: [PATCH net-next v2] net: core: change bool members of struct net_device to bitfield members

2018-10-09 Thread David Ahern

On 10/9/18 2:24 PM, Heiner Kallweit wrote:
> Reordering the struct members to fill the holes could be a little tricky
> and could have side effects because it may make a performance difference
> whether certain members are in one cacheline or not.
> And whether it's worth to spend this effort (incl. the related risks)
> just to save a few bytes (also considering that typically we have quite
> few instances of struct net_device)?
> 

It would be good to get net_device below 2048 without affecting
performance. Anything else is just moving elements around for the same
end allocation (rounds up to 4096).

Re: [PATCH net-next v2] net: core: change bool members of struct net_device to bitfield members

2018-10-09 Thread Heiner Kallweit

On 09.10.2018 17:20, David Ahern wrote:
> On 10/8/18 2:17 PM, Heiner Kallweit wrote:
>> bool is good as parameter type or function return type, but if used
>> for struct members it consumes more memory than needed.
>> Changing the bool members of struct net_device to bitfield members
>> allows to decrease the memory footprint of this struct.
> 
> What does pahole show for the size of the struct before and after? I
> suspect you have not really changed the size and certainly not the
> actual memory allocated.
> 
> 
Thanks for the hint to use pahole. Indeed we gain nothing,
so there's no justification for this patch.

before:
/* size: 2496, cachelines: 39, members: 116 */
/* sum members: 2396, holes: 8, sum holes: 80 */
/* padding: 20 */
/* paddings: 4, sum paddings: 19 */
/* bit_padding: 31 bits */

after:  
/* size: 2496, cachelines: 39, members: 116 */
/* sum members: 2394, holes: 8, sum holes: 82 */
/* bit holes: 1, sum bit holes: 8 bits */
/* padding: 20 */
/* paddings: 4, sum paddings: 19 */
/* bit_padding: 27 bits */

The biggest hole is here, because _tx is annotated to be cacheline-aligned.

struct hlist_node  index_hlist;  /*   88816 */

/* XXX 56 bytes hole, try to pack */

/* --- cacheline 15 boundary (960 bytes) --- */
struct netdev_queue *  _tx;  /*   960 8 */

Reordering the struct members to fill the holes could be a little tricky
and could have side effects because it may make a performance difference
whether certain members are in one cacheline or not.
And whether it's worth to spend this effort (incl. the related risks)
just to save a few bytes (also considering that typically we have quite
few instances of struct net_device)?

Re: [PATCH] qtnfmac: avoid uninitialized variable access

2018-10-09 Thread Sergey Matyukevich

Hello Arnd,

> When qtnf_trans_send_cmd_with_resp() fails, we have not yet initialized
> 'resp', as pointed out by a valid gcc warning:
> 
> drivers/net/wireless/quantenna/qtnfmac/commands.c: In function 
> 'qtnf_cmd_send_with_reply':
> drivers/net/wireless/quantenna/qtnfmac/commands.c:133:54: error: 'resp' may 
> be used uninitialized in this function [-Werror=maybe-uninitialized]
> 
> Since 'resp_skb' is also not set here, we can skip all further
> processing and just print the warning and return the failure code.
> 
> Fixes: c6ed298ffe09 ("qtnfmac: cleanup and unify command error handling")
> Signed-off-by: Arnd Bergmann 

Thanks for the patch! And for reminding me that I forgot to enable
gcc warnings in CI builds in addition to sparse checks.

Reviewed-by: Sergey Matyukevich 

Regards,
Sergey

Re: [sky2 driver] 88E8056 PCI-E Gigabit Ethernet Controller not working after suspend

2018-10-09 Thread Stephen Hemminger

On Tue, 9 Oct 2018 19:30:30 +0200
Laurent Bigonville  wrote:

> Hello,
> 
> On my desktop (Asus MB with dual Ethernet port), when waking up after 
> suspend, the network card is not detecting the link.
> 
> I have to rmmod the sky2 driver and then modprobing it again.
> 
> lspci shows me:
> 
> 04:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8056 PCI-E 
> Gigabit Ethernet Controller (rev 12)
> 05:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8056 PCI-E 
> Gigabit Ethernet Controller (rev 12)
> 
> An idea what's wrong here?
> 
> Kind regards,
> 
> Laurent Bigonville
> 

I used to have that motherboard (about 8 years ago). Long  dead by now.

There was some issue with how the power management worked. Forgot the 
workaround,
you might have to dig in the mailing list archive.

Inquiry 09-10-2018

2018-10-09 Thread Daniel Murray

Hi,friend,

This is Daniel Murray and i am from Sinara Group Co.Ltd Group Co.,LTD in Russia.
We are glad to know about your company from the web and we are interested in 
your products.
Could you kindly send us your Latest catalog and price list for our trial order.

Best Regards,

Daniel Murray
Purchasing Manager

Re: [PATCH V1 net-next 00/12] Improving performance and reducing latencies, by using latest capabilities exposed in ENA device

2018-10-09 Thread Bshara, Nafea

It is high priority for us right after this major release get merged.

On 10/9/18, 12:31 PM, "Jesper Dangaard Brouer"  wrote:


On Tue, 9 Oct 2018 21:44:57 +0300  wrote:

> From: Arthur Kiyanovski 
> 
> This patchset introduces the following:
> 1. A new placement policy of Tx headers and descriptors, which takes
> advantage of an option to place headers + descriptors in device memory
> space. This is sometimes referred to as LLQ - low latency queue.
> The patch set defines the admin capability, maps the device memory as
> write-combined, and adds a mode in transmit datapath to do header +
> descriptor placement on the device.
> 2. Support for RX checksum offloading
> 3. Miscelaneous small improvements and code cleanups

What are your plans for XDP?

You are unsure ask your-colleague David Woodhouse, who I've discussed
this with when he attended my talk at Kernel-Recipes[1], slide[2].

[1] 
https://kernel-recipes.org/en/2018/talks/xdp-a-new-programmable-network-layer/
[2] 
http://people.netfilter.org/hawk/presentations/KernelRecipes2018/XDP_Kernel_Recipes_2018.pdf
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH V1 net-next 00/12] Improving performance and reducing latencies, by using latest capabilities exposed in ENA device

2018-10-09 Thread Jesper Dangaard Brouer



On Tue, 9 Oct 2018 21:44:57 +0300  wrote:

> From: Arthur Kiyanovski 
> 
> This patchset introduces the following:
> 1. A new placement policy of Tx headers and descriptors, which takes
> advantage of an option to place headers + descriptors in device memory
> space. This is sometimes referred to as LLQ - low latency queue.
> The patch set defines the admin capability, maps the device memory as
> write-combined, and adds a mode in transmit datapath to do header +
> descriptor placement on the device.
> 2. Support for RX checksum offloading
> 3. Miscelaneous small improvements and code cleanups

What are your plans for XDP?

You are unsure ask your-colleague David Woodhouse, who I've discussed
this with when he attended my talk at Kernel-Recipes[1], slide[2].

[1] 
https://kernel-recipes.org/en/2018/talks/xdp-a-new-programmable-network-layer/
[2] 
http://people.netfilter.org/hawk/presentations/KernelRecipes2018/XDP_Kernel_Recipes_2018.pdf
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Inquiry 09-10-2018

2018-10-09 Thread Daniel Murray

Hi,friend,

This is Daniel Murray and i am from Sinara Group Co.Ltd Group Co.,LTD in Russia.
We are glad to know about your company from the web and we are interested in 
your products.
Could you kindly send us your Latest catalog and price list for our trial order.

Best Regards,

Daniel Murray
Purchasing Manager

Re: [bpf-next V2 PATCH 0/3] bpf/xdp: fix generic-XDP and demonstrate VLAN manipulation

2018-10-09 Thread Song Liu

For the series:

Acked-by: Song Liu 


On Tue, Oct 9, 2018 at 3:04 AM Jesper Dangaard Brouer  wrote:
>
> While implementing PoC building blocks for eBPF code XDP+TC that can
> manipulate VLANs headers, I discovered a bug in generic-XDP.
>
> The fix should be backported to stable kernels.  Even-though
> generic-XDP was introduced in v4.12, I think the bug is not exposed
> until v4.14 in the mentined fixes commit.
>
> ---
>
> Jesper Dangaard Brouer (3):
>   net: fix generic XDP to handle if eth header was mangled
>   bpf: make TC vlan bpf_helpers avail to selftests
>   selftests/bpf: add XDP selftests for modifying and popping VLAN headers
>
>
>  net/core/dev.c   |   14 +
>  tools/testing/selftests/bpf/Makefile |6 -
>  tools/testing/selftests/bpf/bpf_helpers.h|4
>  tools/testing/selftests/bpf/test_xdp_vlan.c  |  292 
> ++
>  tools/testing/selftests/bpf/test_xdp_vlan.sh |  195 +
>  5 files changed, 509 insertions(+), 2 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/test_xdp_vlan.c
>  create mode 100755 tools/testing/selftests/bpf/test_xdp_vlan.sh
>
> --

[PATCH V1 net-next 12/12] net: ena: fix indentations in ena_defs for better readability

2018-10-09 Thread akiyano

From: Arthur Kiyanovski 

Signed-off-by: Arthur Kiyanovski 
---
 drivers/net/ethernet/amazon/ena/ena_admin_defs.h  | 334 +-
 drivers/net/ethernet/amazon/ena/ena_eth_io_defs.h | 223 +++
 drivers/net/ethernet/amazon/ena/ena_regs_defs.h   | 206 +++--
 3 files changed, 338 insertions(+), 425 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_admin_defs.h 
b/drivers/net/ethernet/amazon/ena/ena_admin_defs.h
index b439ec1..9f80b73 100644
--- a/drivers/net/ethernet/amazon/ena/ena_admin_defs.h
+++ b/drivers/net/ethernet/amazon/ena/ena_admin_defs.h
@@ -32,119 +32,81 @@
 #ifndef _ENA_ADMIN_H_
 #define _ENA_ADMIN_H_
 
-enum ena_admin_aq_opcode {
-   ENA_ADMIN_CREATE_SQ = 1,
-
-   ENA_ADMIN_DESTROY_SQ= 2,
-
-   ENA_ADMIN_CREATE_CQ = 3,
-
-   ENA_ADMIN_DESTROY_CQ= 4,
-
-   ENA_ADMIN_GET_FEATURE   = 8,
 
-   ENA_ADMIN_SET_FEATURE   = 9,
-
-   ENA_ADMIN_GET_STATS = 11,
+enum ena_admin_aq_opcode {
+   ENA_ADMIN_CREATE_SQ = 1,
+   ENA_ADMIN_DESTROY_SQ= 2,
+   ENA_ADMIN_CREATE_CQ = 3,
+   ENA_ADMIN_DESTROY_CQ= 4,
+   ENA_ADMIN_GET_FEATURE   = 8,
+   ENA_ADMIN_SET_FEATURE   = 9,
+   ENA_ADMIN_GET_STATS = 11,
 };
 
 enum ena_admin_aq_completion_status {
-   ENA_ADMIN_SUCCESS   = 0,
-
-   ENA_ADMIN_RESOURCE_ALLOCATION_FAILURE   = 1,
-
-   ENA_ADMIN_BAD_OPCODE= 2,
-
-   ENA_ADMIN_UNSUPPORTED_OPCODE= 3,
-
-   ENA_ADMIN_MALFORMED_REQUEST = 4,
-
+   ENA_ADMIN_SUCCESS   = 0,
+   ENA_ADMIN_RESOURCE_ALLOCATION_FAILURE   = 1,
+   ENA_ADMIN_BAD_OPCODE= 2,
+   ENA_ADMIN_UNSUPPORTED_OPCODE= 3,
+   ENA_ADMIN_MALFORMED_REQUEST = 4,
/* Additional status is provided in ACQ entry extended_status */
-   ENA_ADMIN_ILLEGAL_PARAMETER = 5,
-
-   ENA_ADMIN_UNKNOWN_ERROR = 6,
-
-   ENA_ADMIN_RESOURCE_BUSY = 7,
+   ENA_ADMIN_ILLEGAL_PARAMETER = 5,
+   ENA_ADMIN_UNKNOWN_ERROR = 6,
+   ENA_ADMIN_RESOURCE_BUSY = 7,
 };
 
 enum ena_admin_aq_feature_id {
-   ENA_ADMIN_DEVICE_ATTRIBUTES = 1,
-
-   ENA_ADMIN_MAX_QUEUES_NUM= 2,
-
-   ENA_ADMIN_HW_HINTS  = 3,
-
-   ENA_ADMIN_LLQ   = 4,
-
-   ENA_ADMIN_RSS_HASH_FUNCTION = 10,
-
-   ENA_ADMIN_STATELESS_OFFLOAD_CONFIG  = 11,
-
-   ENA_ADMIN_RSS_REDIRECTION_TABLE_CONFIG  = 12,
-
-   ENA_ADMIN_MTU   = 14,
-
-   ENA_ADMIN_RSS_HASH_INPUT= 18,
-
-   ENA_ADMIN_INTERRUPT_MODERATION  = 20,
-
-   ENA_ADMIN_AENQ_CONFIG   = 26,
-
-   ENA_ADMIN_LINK_CONFIG   = 27,
-
-   ENA_ADMIN_HOST_ATTR_CONFIG  = 28,
-
-   ENA_ADMIN_FEATURES_OPCODE_NUM   = 32,
+   ENA_ADMIN_DEVICE_ATTRIBUTES = 1,
+   ENA_ADMIN_MAX_QUEUES_NUM= 2,
+   ENA_ADMIN_HW_HINTS  = 3,
+   ENA_ADMIN_LLQ   = 4,
+   ENA_ADMIN_RSS_HASH_FUNCTION = 10,
+   ENA_ADMIN_STATELESS_OFFLOAD_CONFIG  = 11,
+   ENA_ADMIN_RSS_REDIRECTION_TABLE_CONFIG  = 12,
+   ENA_ADMIN_MTU   = 14,
+   ENA_ADMIN_RSS_HASH_INPUT= 18,
+   ENA_ADMIN_INTERRUPT_MODERATION  = 20,
+   ENA_ADMIN_AENQ_CONFIG   = 26,
+   ENA_ADMIN_LINK_CONFIG   = 27,
+   ENA_ADMIN_HOST_ATTR_CONFIG  = 28,
+   ENA_ADMIN_FEATURES_OPCODE_NUM   = 32,
 };
 
 enum ena_admin_placement_policy_type {
/* descriptors and headers are in host memory */
-   ENA_ADMIN_PLACEMENT_POLICY_HOST = 1,
-
+   ENA_ADMIN_PLACEMENT_POLICY_HOST = 1,
/* descriptors and headers are in device memory (a.k.a Low Latency
 * Queue)
 */
-   ENA_ADMIN_PLACEMENT_POLICY_DEV  = 3,
+   ENA_ADMIN_PLACEMENT_POLICY_DEV  = 3,
 };
 
 enum ena_admin_link_types {
-   ENA_ADMIN_LINK_SPEED_1G = 0x1,
-
-   ENA_ADMIN_LINK_SPEED_2_HALF_G   = 0x2,
-
-   ENA_ADMIN_LINK_SPEED_5G = 0x4,
-
-   ENA_ADMIN_LINK_SPEED_10G= 0x8,
-
-   ENA_ADMIN_LINK_SPEED_25G= 0x10,
-
-   ENA_ADMIN_LINK_SPEED_40G= 0x20,
-
-   ENA_ADMIN_LINK_SPEED_50G= 0x40,
-
-   ENA_ADMIN_LINK_SPEED_100G   = 0x80,
-
-   ENA_ADMIN_LINK_SPEED_200G   = 0x100,
-
-   ENA_ADMIN_LINK_SPEED_400G   = 0x200,
+

[PATCH V1 net-next 11/12] net: ena: update driver version to 2.0.1

2018-10-09 Thread akiyano

From: Arthur Kiyanovski 

Signed-off-by: Arthur Kiyanovski 
---
 drivers/net/ethernet/amazon/ena/ena_netdev.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.h 
b/drivers/net/ethernet/amazon/ena/ena_netdev.h
index d241dfc..5218736 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.h
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.h
@@ -43,9 +43,9 @@
 #include "ena_com.h"
 #include "ena_eth_com.h"
 
-#define DRV_MODULE_VER_MAJOR   1
-#define DRV_MODULE_VER_MINOR   5
-#define DRV_MODULE_VER_SUBMINOR 0
+#define DRV_MODULE_VER_MAJOR   2
+#define DRV_MODULE_VER_MINOR   0
+#define DRV_MODULE_VER_SUBMINOR 1
 
 #define DRV_MODULE_NAME"ena"
 #ifndef DRV_MODULE_VERSION
-- 
2.7.4

[PATCH V1 net-next 08/12] net: ena: limit refill Rx threshold to 256 to avoid latency issues

2018-10-09 Thread akiyano

From: Arthur Kiyanovski 

Currently Rx refill is done when the number of required descriptors is
above 1/8 queue size. With a default of 1024 entries per queue the
threshold is 128 descriptors.
There is intention to increase the queue size to 8196 entries.
In this case threshold of 1024 descriptors is too large and can hurt
latency.
Add another limitation to Rx threshold to be at most 256 descriptors.

Signed-off-by: Arthur Kiyanovski 
---
 drivers/net/ethernet/amazon/ena/ena_netdev.c | 4 +++-
 drivers/net/ethernet/amazon/ena/ena_netdev.h | 5 +++--
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index e345220..c4c33b1 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -1122,7 +1122,9 @@ static int ena_clean_rx_irq(struct ena_ring *rx_ring, 
struct napi_struct *napi,
rx_ring->next_to_clean = next_to_clean;
 
refill_required = ena_com_free_desc(rx_ring->ena_com_io_sq);
-   refill_threshold = rx_ring->ring_size / ENA_RX_REFILL_THRESH_DIVIDER;
+   refill_threshold =
+   min_t(int, rx_ring->ring_size / ENA_RX_REFILL_THRESH_DIVIDER,
+ ENA_RX_REFILL_THRESH_PACKET);
 
/* Optimization, try to batch new rx buffers */
if (refill_required > refill_threshold) {
diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.h 
b/drivers/net/ethernet/amazon/ena/ena_netdev.h
index a16baf0..0cf35ae 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.h
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.h
@@ -106,10 +106,11 @@
  */
 #define ENA_TX_POLL_BUDGET_DIVIDER 4
 
-/* Refill Rx queue when number of available descriptors is below
- * QUEUE_SIZE / ENA_RX_REFILL_THRESH_DIVIDER
+/* Refill Rx queue when number of required descriptors is above
+ * QUEUE_SIZE / ENA_RX_REFILL_THRESH_DIVIDER or ENA_RX_REFILL_THRESH_PACKET
  */
 #define ENA_RX_REFILL_THRESH_DIVIDER   8
+#define ENA_RX_REFILL_THRESH_PACKET256
 
 /* Number of queues to check for missing queues per timer service */
 #define ENA_MONITORED_TX_QUEUES4
-- 
2.7.4

[PATCH V1 net-next 10/12] net: ena: remove redundant parameter in ena_com_admin_init()

2018-10-09 Thread akiyano

From: Arthur Kiyanovski 

Remove redundant spinlock acquire parameter from ena_com_admin_init()

Signed-off-by: Arthur Kiyanovski 
---
 drivers/net/ethernet/amazon/ena/ena_com.c| 6 ++
 drivers/net/ethernet/amazon/ena/ena_com.h| 5 +
 drivers/net/ethernet/amazon/ena/ena_netdev.c | 2 +-
 3 files changed, 4 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c 
b/drivers/net/ethernet/amazon/ena/ena_com.c
index 5c468b2..420cede 100644
--- a/drivers/net/ethernet/amazon/ena/ena_com.c
+++ b/drivers/net/ethernet/amazon/ena/ena_com.c
@@ -1701,8 +1701,7 @@ void ena_com_mmio_reg_read_request_write_dev_addr(struct 
ena_com_dev *ena_dev)
 }
 
 int ena_com_admin_init(struct ena_com_dev *ena_dev,
-  struct ena_aenq_handlers *aenq_handlers,
-  bool init_spinlock)
+  struct ena_aenq_handlers *aenq_handlers)
 {
struct ena_com_admin_queue *admin_queue = _dev->admin_queue;
u32 aq_caps, acq_caps, dev_sts, addr_low, addr_high;
@@ -1728,8 +1727,7 @@ int ena_com_admin_init(struct ena_com_dev *ena_dev,
 
atomic_set(_queue->outstanding_cmds, 0);
 
-   if (init_spinlock)
-   spin_lock_init(_queue->q_lock);
+   spin_lock_init(_queue->q_lock);
 
ret = ena_com_init_comp_ctxt(admin_queue);
if (ret)
diff --git a/drivers/net/ethernet/amazon/ena/ena_com.h 
b/drivers/net/ethernet/amazon/ena/ena_com.h
index 25af8d0..ae8b485 100644
--- a/drivers/net/ethernet/amazon/ena/ena_com.h
+++ b/drivers/net/ethernet/amazon/ena/ena_com.h
@@ -436,8 +436,6 @@ void ena_com_mmio_reg_read_request_destroy(struct 
ena_com_dev *ena_dev);
 /* ena_com_admin_init - Init the admin and the async queues
  * @ena_dev: ENA communication layer struct
  * @aenq_handlers: Those handlers to be called upon event.
- * @init_spinlock: Indicate if this method should init the admin spinlock or
- * the spinlock was init before (for example, in a case of FLR).
  *
  * Initialize the admin submission and completion queues.
  * Initialize the asynchronous events notification queues.
@@ -445,8 +443,7 @@ void ena_com_mmio_reg_read_request_destroy(struct 
ena_com_dev *ena_dev);
  * @return - 0 on success, negative value on failure.
  */
 int ena_com_admin_init(struct ena_com_dev *ena_dev,
-  struct ena_aenq_handlers *aenq_handlers,
-  bool init_spinlock);
+  struct ena_aenq_handlers *aenq_handlers);
 
 /* ena_com_admin_destroy - Destroy the admin and the async events queues.
  * @ena_dev: ENA communication layer struct
diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index c4c33b1..284a0a6 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -2508,7 +2508,7 @@ static int ena_device_init(struct ena_com_dev *ena_dev, 
struct pci_dev *pdev,
}
 
/* ENA admin level init */
-   rc = ena_com_admin_init(ena_dev, _handlers, true);
+   rc = ena_com_admin_init(ena_dev, _handlers);
if (rc) {
dev_err(dev,
"Can not initialize ena admin queue with device\n");
-- 
2.7.4

[PATCH V1 net-next 09/12] net: ena: change rx copybreak default to reduce kernel memory pressure

2018-10-09 Thread akiyano

From: Arthur Kiyanovski 

Improves socket memory utilization when receiving packets larger
than 128 bytes (the previous rx copybreak) and smaller than 256 bytes.

Signed-off-by: Arthur Kiyanovski 
---
 drivers/net/ethernet/amazon/ena/ena_netdev.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.h 
b/drivers/net/ethernet/amazon/ena/ena_netdev.h
index 0cf35ae..d241dfc 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.h
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.h
@@ -81,7 +81,7 @@
 #define ENA_DEFAULT_RING_SIZE  (1024)
 
 #define ENA_TX_WAKEUP_THRESH   (MAX_SKB_FRAGS + 2)
-#define ENA_DEFAULT_RX_COPYBREAK   (128 - NET_IP_ALIGN)
+#define ENA_DEFAULT_RX_COPYBREAK   (256 - NET_IP_ALIGN)
 
 /* limit the buffer size to 600 bytes to handle MTU changes from very
  * small to very large, in which case the number of buffers per packet
-- 
2.7.4

[PATCH V1 net-next 04/12] net: ena: add functions for handling Low Latency Queues in ena_com

2018-10-09 Thread akiyano

From: Arthur Kiyanovski 

This patch introduces APIs for detection, initialization, configuration
and actual usage of low latency queues(LLQ). It extends transmit API with
creation of LLQ descriptors in device memory (which include host buffers
descriptors as well as packet header)

Signed-off-by: Arthur Kiyanovski 
---
 drivers/net/ethernet/amazon/ena/ena_com.c | 249 --
 drivers/net/ethernet/amazon/ena/ena_com.h |  28 +++
 drivers/net/ethernet/amazon/ena/ena_eth_com.c | 231 ++--
 drivers/net/ethernet/amazon/ena/ena_eth_com.h |  25 ++-
 drivers/net/ethernet/amazon/ena/ena_netdev.c  |  21 +--
 5 files changed, 474 insertions(+), 80 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c 
b/drivers/net/ethernet/amazon/ena/ena_com.c
index b6e6a47..5220c75 100644
--- a/drivers/net/ethernet/amazon/ena/ena_com.c
+++ b/drivers/net/ethernet/amazon/ena/ena_com.c
@@ -58,6 +58,8 @@
 
 #define ENA_MMIO_READ_TIMEOUT 0x
 
+#define ENA_COM_BOUNCE_BUFFER_CNTRL_CNT4
+
 #define ENA_REGS_ADMIN_INTR_MASK 1
 
 #define ENA_POLL_MS5
@@ -352,21 +354,48 @@ static int ena_com_init_io_sq(struct ena_com_dev *ena_dev,
_sq->desc_addr.phys_addr,
GFP_KERNEL);
}
-   } else {
+
+   if (!io_sq->desc_addr.virt_addr) {
+   pr_err("memory allocation failed");
+   return -ENOMEM;
+   }
+   }
+
+   if (io_sq->mem_queue_type == ENA_ADMIN_PLACEMENT_POLICY_DEV) {
+   /* Allocate bounce buffers */
+   io_sq->bounce_buf_ctrl.buffer_size =
+   ena_dev->llq_info.desc_list_entry_size;
+   io_sq->bounce_buf_ctrl.buffers_num =
+   ENA_COM_BOUNCE_BUFFER_CNTRL_CNT;
+   io_sq->bounce_buf_ctrl.next_to_use = 0;
+
+   size = io_sq->bounce_buf_ctrl.buffer_size *
+io_sq->bounce_buf_ctrl.buffers_num;
+
dev_node = dev_to_node(ena_dev->dmadev);
set_dev_node(ena_dev->dmadev, ctx->numa_node);
-   io_sq->desc_addr.virt_addr =
+   io_sq->bounce_buf_ctrl.base_buffer =
devm_kzalloc(ena_dev->dmadev, size, GFP_KERNEL);
set_dev_node(ena_dev->dmadev, dev_node);
-   if (!io_sq->desc_addr.virt_addr) {
-   io_sq->desc_addr.virt_addr =
+   if (!io_sq->bounce_buf_ctrl.base_buffer)
+   io_sq->bounce_buf_ctrl.base_buffer =
devm_kzalloc(ena_dev->dmadev, size, GFP_KERNEL);
+
+   if (!io_sq->bounce_buf_ctrl.base_buffer) {
+   pr_err("bounce buffer memory allocation failed");
+   return -ENOMEM;
}
-   }
 
-   if (!io_sq->desc_addr.virt_addr) {
-   pr_err("memory allocation failed");
-   return -ENOMEM;
+   memcpy(_sq->llq_info, _dev->llq_info,
+  sizeof(io_sq->llq_info));
+
+   /* Initiate the first bounce buffer */
+   io_sq->llq_buf_ctrl.curr_bounce_buf =
+   ena_com_get_next_bounce_buffer(_sq->bounce_buf_ctrl);
+   memset(io_sq->llq_buf_ctrl.curr_bounce_buf,
+  0x0, io_sq->llq_info.desc_list_entry_size);
+   io_sq->llq_buf_ctrl.descs_left_in_line =
+   io_sq->llq_info.descs_num_before_header;
}
 
io_sq->tail = 0;
@@ -554,6 +583,156 @@ static int 
ena_com_wait_and_process_admin_cq_polling(struct ena_comp_ctx *comp_c
return ret;
 }
 
+/**
+ * Set the LLQ configurations of the firmware
+ *
+ * The driver provides only the enabled feature values to the device,
+ * which in turn, checks if they are supported.
+ */
+static int ena_com_set_llq(struct ena_com_dev *ena_dev)
+{
+   struct ena_com_admin_queue *admin_queue;
+   struct ena_admin_set_feat_cmd cmd;
+   struct ena_admin_set_feat_resp resp;
+   struct ena_com_llq_info *llq_info = _dev->llq_info;
+   int ret;
+
+   memset(, 0x0, sizeof(cmd));
+   admin_queue = _dev->admin_queue;
+
+   cmd.aq_common_descriptor.opcode = ENA_ADMIN_SET_FEATURE;
+   cmd.feat_common.feature_id = ENA_ADMIN_LLQ;
+
+   cmd.u.llq.header_location_ctrl_enabled = llq_info->header_location_ctrl;
+   cmd.u.llq.entry_size_ctrl_enabled = llq_info->desc_list_entry_size_ctrl;
+   cmd.u.llq.desc_num_before_header_enabled = 
llq_info->descs_num_before_header;
+   cmd.u.llq.descriptors_stride_ctrl_enabled = llq_info->desc_stride_ctrl;
+
+   ret = ena_com_execute_admin_command(admin_queue,
+   (struct ena_admin_aq_entry *),
+   sizeof(cmd),
+   (struct

[PATCH V1 net-next 05/12] net: ena: add functions for handling Low Latency Queues in ena_netdev

2018-10-09 Thread akiyano

From: Arthur Kiyanovski 

This patch includes all code changes necessary in ena_netdev to enable
packet sending via the LLQ placemnt mode.

Signed-off-by: Arthur Kiyanovski 
---
 drivers/net/ethernet/amazon/ena/ena_ethtool.c |   1 +
 drivers/net/ethernet/amazon/ena/ena_netdev.c  | 387 --
 drivers/net/ethernet/amazon/ena/ena_netdev.h  |   6 +
 3 files changed, 251 insertions(+), 143 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_ethtool.c 
b/drivers/net/ethernet/amazon/ena/ena_ethtool.c
index 521607b..fd28bd0 100644
--- a/drivers/net/ethernet/amazon/ena/ena_ethtool.c
+++ b/drivers/net/ethernet/amazon/ena/ena_ethtool.c
@@ -81,6 +81,7 @@ static const struct ena_stats ena_stats_tx_strings[] = {
ENA_STAT_TX_ENTRY(doorbells),
ENA_STAT_TX_ENTRY(prepare_ctx_err),
ENA_STAT_TX_ENTRY(bad_req_id),
+   ENA_STAT_TX_ENTRY(llq_buffer_copy),
ENA_STAT_TX_ENTRY(missed_tx),
 };
 
diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index e732bd2..fcdfaf0 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -237,6 +237,17 @@ static int ena_setup_tx_resources(struct ena_adapter 
*adapter, int qid)
}
}
 
+   size = tx_ring->tx_max_header_size;
+   tx_ring->push_buf_intermediate_buf = vzalloc_node(size, node);
+   if (!tx_ring->push_buf_intermediate_buf) {
+   tx_ring->push_buf_intermediate_buf = vzalloc(size);
+   if (!tx_ring->push_buf_intermediate_buf) {
+   vfree(tx_ring->tx_buffer_info);
+   vfree(tx_ring->free_tx_ids);
+   return -ENOMEM;
+   }
+   }
+
/* Req id ring for TX out of order completions */
for (i = 0; i < tx_ring->ring_size; i++)
tx_ring->free_tx_ids[i] = i;
@@ -265,6 +276,9 @@ static void ena_free_tx_resources(struct ena_adapter 
*adapter, int qid)
 
vfree(tx_ring->free_tx_ids);
tx_ring->free_tx_ids = NULL;
+
+   vfree(tx_ring->push_buf_intermediate_buf);
+   tx_ring->push_buf_intermediate_buf = NULL;
 }
 
 /* ena_setup_all_tx_resources - allocate I/O Tx queues resources for All queues
@@ -602,6 +616,36 @@ static void ena_free_all_rx_bufs(struct ena_adapter 
*adapter)
ena_free_rx_bufs(adapter, i);
 }
 
+static inline void ena_unmap_tx_skb(struct ena_ring *tx_ring,
+   struct ena_tx_buffer *tx_info)
+{
+   struct ena_com_buf *ena_buf;
+   u32 cnt;
+   int i;
+
+   ena_buf = tx_info->bufs;
+   cnt = tx_info->num_of_bufs;
+
+   if (unlikely(!cnt))
+   return;
+
+   if (tx_info->map_linear_data) {
+   dma_unmap_single(tx_ring->dev,
+dma_unmap_addr(ena_buf, paddr),
+dma_unmap_len(ena_buf, len),
+DMA_TO_DEVICE);
+   ena_buf++;
+   cnt--;
+   }
+
+   /* unmap remaining mapped pages */
+   for (i = 0; i < cnt; i++) {
+   dma_unmap_page(tx_ring->dev, dma_unmap_addr(ena_buf, paddr),
+  dma_unmap_len(ena_buf, len), DMA_TO_DEVICE);
+   ena_buf++;
+   }
+}
+
 /* ena_free_tx_bufs - Free Tx Buffers per Queue
  * @tx_ring: TX ring for which buffers be freed
  */
@@ -612,9 +656,6 @@ static void ena_free_tx_bufs(struct ena_ring *tx_ring)
 
for (i = 0; i < tx_ring->ring_size; i++) {
struct ena_tx_buffer *tx_info = _ring->tx_buffer_info[i];
-   struct ena_com_buf *ena_buf;
-   int nr_frags;
-   int j;
 
if (!tx_info->skb)
continue;
@@ -630,21 +671,7 @@ static void ena_free_tx_bufs(struct ena_ring *tx_ring)
   tx_ring->qid, i);
}
 
-   ena_buf = tx_info->bufs;
-   dma_unmap_single(tx_ring->dev,
-ena_buf->paddr,
-ena_buf->len,
-DMA_TO_DEVICE);
-
-   /* unmap remaining mapped pages */
-   nr_frags = tx_info->num_of_bufs - 1;
-   for (j = 0; j < nr_frags; j++) {
-   ena_buf++;
-   dma_unmap_page(tx_ring->dev,
-  ena_buf->paddr,
-  ena_buf->len,
-  DMA_TO_DEVICE);
-   }
+   ena_unmap_tx_skb(tx_ring, tx_info);
 
dev_kfree_skb_any(tx_info->skb);
}
@@ -735,8 +762,6 @@ static int ena_clean_tx_irq(struct ena_ring *tx_ring, u32 
budget)
while (tx_pkts < budget) {
struct ena_tx_buffer *tx_info;
struct sk_buff *skb;
-   struct ena_com_buf

[PATCH V1 net-next 06/12] net: ena: use CSUM_CHECKED device indication to report skb's checksum status

2018-10-09 Thread akiyano

From: Arthur Kiyanovski 

Set skb->ip_summed to the correct value as reported by the device.
Add counter for the case where rx csum offload is enabled but
device didn't check it.

Signed-off-by: Arthur Kiyanovski 
---
 drivers/net/ethernet/amazon/ena/ena_eth_com.c |  3 +++
 drivers/net/ethernet/amazon/ena/ena_eth_com.h |  1 +
 drivers/net/ethernet/amazon/ena/ena_eth_io_defs.h | 10 --
 drivers/net/ethernet/amazon/ena/ena_ethtool.c |  1 +
 drivers/net/ethernet/amazon/ena/ena_netdev.c  | 13 -
 drivers/net/ethernet/amazon/ena/ena_netdev.h  |  1 +
 6 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_eth_com.c 
b/drivers/net/ethernet/amazon/ena/ena_eth_com.c
index 17107ca..f6c2d38 100644
--- a/drivers/net/ethernet/amazon/ena/ena_eth_com.c
+++ b/drivers/net/ethernet/amazon/ena/ena_eth_com.c
@@ -354,6 +354,9 @@ static inline void ena_com_rx_set_flags(struct 
ena_com_rx_ctx *ena_rx_ctx,
ena_rx_ctx->l4_csum_err =
!!((cdesc->status & ENA_ETH_IO_RX_CDESC_BASE_L4_CSUM_ERR_MASK) 
>>
ENA_ETH_IO_RX_CDESC_BASE_L4_CSUM_ERR_SHIFT);
+   ena_rx_ctx->l4_csum_checked =
+   !!((cdesc->status & 
ENA_ETH_IO_RX_CDESC_BASE_L4_CSUM_CHECKED_MASK) >>
+   ENA_ETH_IO_RX_CDESC_BASE_L4_CSUM_CHECKED_SHIFT);
ena_rx_ctx->hash = cdesc->hash;
ena_rx_ctx->frag =
(cdesc->status & ENA_ETH_IO_RX_CDESC_BASE_IPV4_FRAG_MASK) >>
diff --git a/drivers/net/ethernet/amazon/ena/ena_eth_com.h 
b/drivers/net/ethernet/amazon/ena/ena_eth_com.h
index bcc8407..340d02b 100644
--- a/drivers/net/ethernet/amazon/ena/ena_eth_com.h
+++ b/drivers/net/ethernet/amazon/ena/ena_eth_com.h
@@ -67,6 +67,7 @@ struct ena_com_rx_ctx {
enum ena_eth_io_l4_proto_index l4_proto;
bool l3_csum_err;
bool l4_csum_err;
+   u8 l4_csum_checked;
/* fragmented packet */
bool frag;
u32 hash;
diff --git a/drivers/net/ethernet/amazon/ena/ena_eth_io_defs.h 
b/drivers/net/ethernet/amazon/ena/ena_eth_io_defs.h
index f320c58..4c5ccaa 100644
--- a/drivers/net/ethernet/amazon/ena/ena_eth_io_defs.h
+++ b/drivers/net/ethernet/amazon/ena/ena_eth_io_defs.h
@@ -242,9 +242,13 @@ struct ena_eth_io_rx_cdesc_base {
 *checksum error detected, or, the controller didn't
 *validate the checksum. This bit is valid only when
 *l4_proto_idx indicates TCP/UDP packet, and,
-*ipv4_frag is not set
+*ipv4_frag is not set. This bit is valid only when
+*l4_csum_checked below is set.
 * 15 : ipv4_frag - Indicates IPv4 fragmented packet
-* 23:16 : reserved16
+* 16 : l4_csum_checked - L4 checksum was verified
+*(could be OK or error), when cleared the status of
+*checksum is unknown
+* 23:17 : reserved17 - MBZ
 * 24 : phase
 * 25 : l3_csum2 - second checksum engine result
 * 26 : first - Indicates first descriptor in
@@ -390,6 +394,8 @@ struct ena_eth_io_numa_node_cfg_reg {
 #define ENA_ETH_IO_RX_CDESC_BASE_L4_CSUM_ERR_MASK BIT(14)
 #define ENA_ETH_IO_RX_CDESC_BASE_IPV4_FRAG_SHIFT 15
 #define ENA_ETH_IO_RX_CDESC_BASE_IPV4_FRAG_MASK BIT(15)
+#define ENA_ETH_IO_RX_CDESC_BASE_L4_CSUM_CHECKED_SHIFT 16
+#define ENA_ETH_IO_RX_CDESC_BASE_L4_CSUM_CHECKED_MASK BIT(16)
 #define ENA_ETH_IO_RX_CDESC_BASE_PHASE_SHIFT 24
 #define ENA_ETH_IO_RX_CDESC_BASE_PHASE_MASK BIT(24)
 #define ENA_ETH_IO_RX_CDESC_BASE_L3_CSUM2_SHIFT 25
diff --git a/drivers/net/ethernet/amazon/ena/ena_ethtool.c 
b/drivers/net/ethernet/amazon/ena/ena_ethtool.c
index fd28bd0..f3a5a38 100644
--- a/drivers/net/ethernet/amazon/ena/ena_ethtool.c
+++ b/drivers/net/ethernet/amazon/ena/ena_ethtool.c
@@ -97,6 +97,7 @@ static const struct ena_stats ena_stats_rx_strings[] = {
ENA_STAT_RX_ENTRY(rx_copybreak_pkt),
ENA_STAT_RX_ENTRY(bad_req_id),
ENA_STAT_RX_ENTRY(empty_rx_ring),
+   ENA_STAT_RX_ENTRY(csum_unchecked),
 };
 
 static const struct ena_stats ena_stats_ena_com_strings[] = {
diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index fcdfaf0..35b0ce5 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -994,8 +994,19 @@ static inline void ena_rx_checksum(struct ena_ring 
*rx_ring,
return;
}
 
-   skb->ip_summed = CHECKSUM_UNNECESSARY;
+   if (likely(ena_rx_ctx->l4_csum_checked)) {
+   skb->ip_summed = CHECKSUM_UNNECESSARY;
+   } else {
+   u64_stats_update_begin(_ring->syncp);
+   rx_ring->rx_stats.csum_unchecked++;
+   u64_stats_update_end(_ring->syncp);
+   skb->ip_summed = CHECKSUM_NONE;
+   }
+   } else {
+   skb->ip_summed = CHECKSUM_NONE;
+

[PATCH V1 net-next 07/12] net: ena: explicit casting and initialization, and clearer error handling

2018-10-09 Thread akiyano

From: Arthur Kiyanovski 

Signed-off-by: Arthur Kiyanovski 
---
 drivers/net/ethernet/amazon/ena/ena_com.c| 39 
 drivers/net/ethernet/amazon/ena/ena_netdev.c |  5 ++--
 drivers/net/ethernet/amazon/ena/ena_netdev.h | 22 
 3 files changed, 36 insertions(+), 30 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c 
b/drivers/net/ethernet/amazon/ena/ena_com.c
index 5220c75..5c468b2 100644
--- a/drivers/net/ethernet/amazon/ena/ena_com.c
+++ b/drivers/net/ethernet/amazon/ena/ena_com.c
@@ -235,7 +235,7 @@ static struct ena_comp_ctx 
*__ena_com_submit_admin_cmd(struct ena_com_admin_queu
tail_masked = admin_queue->sq.tail & queue_size_mask;
 
/* In case of queue FULL */
-   cnt = atomic_read(_queue->outstanding_cmds);
+   cnt = (u16)atomic_read(_queue->outstanding_cmds);
if (cnt >= admin_queue->q_depth) {
pr_debug("admin queue is full.\n");
admin_queue->stats.out_of_space++;
@@ -304,7 +304,7 @@ static struct ena_comp_ctx *ena_com_submit_admin_cmd(struct 
ena_com_admin_queue
 struct ena_admin_acq_entry 
*comp,
 size_t comp_size_in_bytes)
 {
-   unsigned long flags;
+   unsigned long flags = 0;
struct ena_comp_ctx *comp_ctx;
 
spin_lock_irqsave(_queue->q_lock, flags);
@@ -332,7 +332,7 @@ static int ena_com_init_io_sq(struct ena_com_dev *ena_dev,
 
memset(_sq->desc_addr, 0x0, sizeof(io_sq->desc_addr));
 
-   io_sq->dma_addr_bits = ena_dev->dma_addr_bits;
+   io_sq->dma_addr_bits = (u8)ena_dev->dma_addr_bits;
io_sq->desc_entry_size =
(io_sq->direction == ENA_COM_IO_QUEUE_DIRECTION_TX) ?
sizeof(struct ena_eth_io_tx_desc) :
@@ -486,7 +486,7 @@ static void ena_com_handle_admin_completion(struct 
ena_com_admin_queue *admin_qu
 
/* Go over all the completions */
while ((READ_ONCE(cqe->acq_common_descriptor.flags) &
-   ENA_ADMIN_ACQ_COMMON_DESC_PHASE_MASK) == phase) {
+   ENA_ADMIN_ACQ_COMMON_DESC_PHASE_MASK) == phase) {
/* Do not read the rest of the completion entry before the
 * phase bit was validated
 */
@@ -537,7 +537,8 @@ static int ena_com_comp_status_to_errno(u8 comp_status)
 static int ena_com_wait_and_process_admin_cq_polling(struct ena_comp_ctx 
*comp_ctx,
 struct ena_com_admin_queue 
*admin_queue)
 {
-   unsigned long flags, timeout;
+   unsigned long flags = 0;
+   unsigned long timeout;
int ret;
 
timeout = jiffies + usecs_to_jiffies(admin_queue->completion_timeout);
@@ -736,7 +737,7 @@ static int ena_com_config_llq_info(struct ena_com_dev 
*ena_dev,
 static int ena_com_wait_and_process_admin_cq_interrupts(struct ena_comp_ctx 
*comp_ctx,
struct 
ena_com_admin_queue *admin_queue)
 {
-   unsigned long flags;
+   unsigned long flags = 0;
int ret;
 
wait_for_completion_timeout(_ctx->wait_event,
@@ -782,7 +783,7 @@ static u32 ena_com_reg_bar_read32(struct ena_com_dev 
*ena_dev, u16 offset)
volatile struct ena_admin_ena_mmio_req_read_less_resp *read_resp =
mmio_read->read_resp;
u32 mmio_read_reg, ret, i;
-   unsigned long flags;
+   unsigned long flags = 0;
u32 timeout = mmio_read->reg_read_to;
 
might_sleep();
@@ -1426,7 +1427,7 @@ void ena_com_abort_admin_commands(struct ena_com_dev 
*ena_dev)
 void ena_com_wait_for_abort_completion(struct ena_com_dev *ena_dev)
 {
struct ena_com_admin_queue *admin_queue = _dev->admin_queue;
-   unsigned long flags;
+   unsigned long flags = 0;
 
spin_lock_irqsave(_queue->q_lock, flags);
while (atomic_read(_queue->outstanding_cmds) != 0) {
@@ -1470,7 +1471,7 @@ bool ena_com_get_admin_running_state(struct ena_com_dev 
*ena_dev)
 void ena_com_set_admin_running_state(struct ena_com_dev *ena_dev, bool state)
 {
struct ena_com_admin_queue *admin_queue = _dev->admin_queue;
-   unsigned long flags;
+   unsigned long flags = 0;
 
spin_lock_irqsave(_queue->q_lock, flags);
ena_dev->admin_queue.running_state = state;
@@ -1504,7 +1505,7 @@ int ena_com_set_aenq_config(struct ena_com_dev *ena_dev, 
u32 groups_flag)
}
 
if ((get_resp.u.aenq.supported_groups & groups_flag) != groups_flag) {
-   pr_warn("Trying to set unsupported aenq events. supported flag: 
%x asked flag: %x\n",
+   pr_warn("Trying to set unsupported aenq events. supported flag: 
0x%x asked flag: 0x%x\n",
get_resp.u.aenq.supported_groups, groups_flag);
return -EOPNOTSUPP;
}
@@ -1652,7 +1653,7 @@ int ena_com_mmio_reg_read_request_init(struct ena_com_dev 
*ena_dev)

[PATCH V1 net-next 00/12] Improving performance and reducing latencies, by using latest capabilities exposed in ENA device

2018-10-09 Thread akiyano

From: Arthur Kiyanovski 

This patchset introduces the following:
1. A new placement policy of Tx headers and descriptors, which takes
advantage of an option to place headers + descriptors in device memory
space. This is sometimes referred to as LLQ - low latency queue.
The patch set defines the admin capability, maps the device memory as
write-combined, and adds a mode in transmit datapath to do header +
descriptor placement on the device.
2. Support for RX checksum offloading
3. Miscelaneous small improvements and code cleanups

Arthur Kiyanovski (12):
  net: ena: minor performance improvement
  net: ena: complete host info to match latest ENA spec
  net: ena: introduce Low Latency Queues data structures according to
ENA spec
  net: ena: add functions for handling Low Latency Queues in ena_com
  net: ena: add functions for handling Low Latency Queues in ena_netdev
  net: ena: use CSUM_CHECKED device indication to report skb's checksum
status
  net: ena: explicit casting and initialization, and clearer error
handling
  net: ena: limit refill Rx threshold to 256 to avoid latency issues
  net: ena: change rx copybreak default to reduce kernel memory pressure
  net: ena: remove redundant parameter in ena_com_admin_init()
  net: ena: update driver version to 2.0.1
  net: ena: fix indentations in ena_defs for better readability

 drivers/net/ethernet/amazon/ena/ena_admin_defs.h  | 425 -
 drivers/net/ethernet/amazon/ena/ena_com.c | 302 +--
 drivers/net/ethernet/amazon/ena/ena_com.h |  71 +++-
 drivers/net/ethernet/amazon/ena/ena_common_defs.h |   4 +-
 drivers/net/ethernet/amazon/ena/ena_eth_com.c | 277 +-
 drivers/net/ethernet/amazon/ena/ena_eth_com.h |  72 +++-
 drivers/net/ethernet/amazon/ena/ena_eth_io_defs.h | 229 ++-
 drivers/net/ethernet/amazon/ena/ena_ethtool.c |   2 +
 drivers/net/ethernet/amazon/ena/ena_netdev.c  | 446 ++
 drivers/net/ethernet/amazon/ena/ena_netdev.h  |  42 +-
 drivers/net/ethernet/amazon/ena/ena_regs_defs.h   | 206 +-
 11 files changed, 1334 insertions(+), 742 deletions(-)

-- 
2.7.4

[PATCH V1 net-next 03/12] net: ena: introduce Low Latency Queues data structures according to ENA spec

2018-10-09 Thread akiyano

From: Arthur Kiyanovski 

Low Latency Queues(LLQ) allow usage of device's memory for descriptors
and headers. Such queues decrease processing time since data is already
located on the device when driver rings the doorbell.

Signed-off-by: Arthur Kiyanovski 
---
 drivers/net/ethernet/amazon/ena/ena_admin_defs.h | 90 +++-
 drivers/net/ethernet/amazon/ena/ena_com.h| 38 ++
 drivers/net/ethernet/amazon/ena/ena_netdev.c |  6 +-
 3 files changed, 128 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_admin_defs.h 
b/drivers/net/ethernet/amazon/ena/ena_admin_defs.h
index d735164..b439ec1 100644
--- a/drivers/net/ethernet/amazon/ena/ena_admin_defs.h
+++ b/drivers/net/ethernet/amazon/ena/ena_admin_defs.h
@@ -74,6 +74,8 @@ enum ena_admin_aq_feature_id {
 
ENA_ADMIN_HW_HINTS  = 3,
 
+   ENA_ADMIN_LLQ   = 4,
+
ENA_ADMIN_RSS_HASH_FUNCTION = 10,
 
ENA_ADMIN_STATELESS_OFFLOAD_CONFIG  = 11,
@@ -485,8 +487,85 @@ struct ena_admin_device_attr_feature_desc {
u32 max_mtu;
 };
 
+enum ena_admin_llq_header_location {
+   /* header is in descriptor list */
+   ENA_ADMIN_INLINE_HEADER = 1,
+   /* header in a separate ring, implies 16B descriptor list entry */
+   ENA_ADMIN_HEADER_RING   = 2,
+};
+
+enum ena_admin_llq_ring_entry_size {
+   ENA_ADMIN_LIST_ENTRY_SIZE_128B  = 1,
+   ENA_ADMIN_LIST_ENTRY_SIZE_192B  = 2,
+   ENA_ADMIN_LIST_ENTRY_SIZE_256B  = 4,
+};
+
+enum ena_admin_llq_num_descs_before_header {
+   ENA_ADMIN_LLQ_NUM_DESCS_BEFORE_HEADER_0 = 0,
+   ENA_ADMIN_LLQ_NUM_DESCS_BEFORE_HEADER_1 = 1,
+   ENA_ADMIN_LLQ_NUM_DESCS_BEFORE_HEADER_2 = 2,
+   ENA_ADMIN_LLQ_NUM_DESCS_BEFORE_HEADER_4 = 4,
+   ENA_ADMIN_LLQ_NUM_DESCS_BEFORE_HEADER_8 = 8,
+};
+
+/* packet descriptor list entry always starts with one or more descriptors,
+ * followed by a header. The rest of the descriptors are located in the
+ * beginning of the subsequent entry. Stride refers to how the rest of the
+ * descriptors are placed. This field is relevant only for inline header
+ * mode
+ */
+enum ena_admin_llq_stride_ctrl {
+   ENA_ADMIN_SINGLE_DESC_PER_ENTRY = 1,
+   ENA_ADMIN_MULTIPLE_DESCS_PER_ENTRY  = 2,
+};
+
+struct ena_admin_feature_llq_desc {
+   u32 max_llq_num;
+
+   u32 max_llq_depth;
+
+   /*  specify the header locations the device supports. bitfield of
+*enum ena_admin_llq_header_location.
+*/
+   u16 header_location_ctrl_supported;
+
+   /* the header location the driver selected to use. */
+   u16 header_location_ctrl_enabled;
+
+   /* if inline header is specified - this is the size of descriptor
+*list entry. If header in a separate ring is specified - this is
+*the size of header ring entry. bitfield of enum
+*ena_admin_llq_ring_entry_size. specify the entry sizes the device
+*supports
+*/
+   u16 entry_size_ctrl_supported;
+
+   /* the entry size the driver selected to use. */
+   u16 entry_size_ctrl_enabled;
+
+   /* valid only if inline header is specified. First entry associated
+*with the packet includes descriptors and header. Rest of the
+*entries occupied by descriptors. This parameter defines the max
+*number of descriptors precedding the header in the first entry.
+*The field is bitfield of enum
+*ena_admin_llq_num_descs_before_header and specify the values the
+*device supports
+*/
+   u16 desc_num_before_header_supported;
+
+   /* the desire field the driver selected to use */
+   u16 desc_num_before_header_enabled;
+
+   /* valid only if inline was chosen. bitfield of enum
+*ena_admin_llq_stride_ctrl
+*/
+   u16 descriptors_stride_ctrl_supported;
+
+   /* the stride control the driver selected to use */
+   u16 descriptors_stride_ctrl_enabled;
+};
+
 struct ena_admin_queue_feature_desc {
-   /* including LLQs */
u32 max_sq_num;
 
u32 max_sq_depth;
@@ -495,9 +574,9 @@ struct ena_admin_queue_feature_desc {
 
u32 max_cq_depth;
 
-   u32 max_llq_num;
+   u32 max_legacy_llq_num;
 
-   u32 max_llq_depth;
+   u32 max_legacy_llq_depth;
 
u32 max_header_size;
 
@@ -822,6 +901,8 @@ struct ena_admin_get_feat_resp {
 
struct ena_admin_device_attr_feature_desc dev_attr;
 
+   struct ena_admin_feature_llq_desc llq;
+
struct ena_admin_queue_feature_desc max_queue;
 
struct ena_admin_feature_aenq_desc aenq;
@@ -869,6 +950,9 @@ struct ena_admin_set_feat_cmd {
 
/* rss indirection table */
struct ena_admin_feature_rss_ind_table

[PATCH V1 net-next 02/12] net: ena: complete host info to match latest ENA spec

2018-10-09 Thread akiyano

From: Arthur Kiyanovski 

Add new fields and definitions to host info and fill them
according to the latest ENA spec version.

Signed-off-by: Arthur Kiyanovski 
---
 drivers/net/ethernet/amazon/ena/ena_admin_defs.h  | 31 ++-
 drivers/net/ethernet/amazon/ena/ena_com.c | 12 +++--
 drivers/net/ethernet/amazon/ena/ena_common_defs.h |  4 +--
 drivers/net/ethernet/amazon/ena/ena_netdev.c  | 10 +---
 4 files changed, 43 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_admin_defs.h 
b/drivers/net/ethernet/amazon/ena/ena_admin_defs.h
index 4532e57..d735164 100644
--- a/drivers/net/ethernet/amazon/ena/ena_admin_defs.h
+++ b/drivers/net/ethernet/amazon/ena/ena_admin_defs.h
@@ -63,6 +63,8 @@ enum ena_admin_aq_completion_status {
ENA_ADMIN_ILLEGAL_PARAMETER = 5,
 
ENA_ADMIN_UNKNOWN_ERROR = 6,
+
+   ENA_ADMIN_RESOURCE_BUSY = 7,
 };
 
 enum ena_admin_aq_feature_id {
@@ -702,6 +704,10 @@ enum ena_admin_os_type {
ENA_ADMIN_OS_FREEBSD= 4,
 
ENA_ADMIN_OS_IPXE   = 5,
+
+   ENA_ADMIN_OS_ESXI   = 6,
+
+   ENA_ADMIN_OS_GROUPS_NUM = 6,
 };
 
 struct ena_admin_host_info {
@@ -723,11 +729,27 @@ struct ena_admin_host_info {
/* 7:0 : major
 * 15:8 : minor
 * 23:16 : sub_minor
+* 31:24 : module_type
 */
u32 driver_version;
 
/* features bitmap */
-   u32 supported_network_features[4];
+   u32 supported_network_features[2];
+
+   /* ENA spec version of driver */
+   u16 ena_spec_version;
+
+   /* ENA device's Bus, Device and Function
+* 2:0 : function
+* 7:3 : device
+* 15:8 : bus
+*/
+   u16 bdf;
+
+   /* Number of CPUs */
+   u16 num_cpus;
+
+   u16 reserved;
 };
 
 struct ena_admin_rss_ind_table_entry {
@@ -1008,6 +1030,13 @@ struct ena_admin_ena_mmio_req_read_less_resp {
 #define ENA_ADMIN_HOST_INFO_MINOR_MASK GENMASK(15, 8)
 #define ENA_ADMIN_HOST_INFO_SUB_MINOR_SHIFT 16
 #define ENA_ADMIN_HOST_INFO_SUB_MINOR_MASK GENMASK(23, 16)
+#define ENA_ADMIN_HOST_INFO_MODULE_TYPE_SHIFT 24
+#define ENA_ADMIN_HOST_INFO_MODULE_TYPE_MASK GENMASK(31, 24)
+#define ENA_ADMIN_HOST_INFO_FUNCTION_MASK GENMASK(2, 0)
+#define ENA_ADMIN_HOST_INFO_DEVICE_SHIFT 3
+#define ENA_ADMIN_HOST_INFO_DEVICE_MASK GENMASK(7, 3)
+#define ENA_ADMIN_HOST_INFO_BUS_SHIFT 8
+#define ENA_ADMIN_HOST_INFO_BUS_MASK GENMASK(15, 8)
 
 /* aenq_common_desc */
 #define ENA_ADMIN_AENQ_COMMON_DESC_PHASE_MASK BIT(0)
diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c 
b/drivers/net/ethernet/amazon/ena/ena_com.c
index 7635c38..b6e6a47 100644
--- a/drivers/net/ethernet/amazon/ena/ena_com.c
+++ b/drivers/net/ethernet/amazon/ena/ena_com.c
@@ -41,9 +41,6 @@
 #define ENA_ASYNC_QUEUE_DEPTH 16
 #define ENA_ADMIN_QUEUE_DEPTH 32
 
-#define MIN_ENA_VER (((ENA_COMMON_SPEC_VERSION_MAJOR) << \
-   ENA_REGS_VERSION_MAJOR_VERSION_SHIFT) \
-   | (ENA_COMMON_SPEC_VERSION_MINOR))
 
 #define ENA_CTRL_MAJOR 0
 #define ENA_CTRL_MINOR 0
@@ -1400,11 +1397,6 @@ int ena_com_validate_version(struct ena_com_dev *ena_dev)
ENA_REGS_VERSION_MAJOR_VERSION_SHIFT,
ver & ENA_REGS_VERSION_MINOR_VERSION_MASK);
 
-   if (ver < MIN_ENA_VER) {
-   pr_err("ENA version is lower than the minimal version the 
driver supports\n");
-   return -1;
-   }
-
pr_info("ena controller version: %d.%d.%d implementation version %d\n",
(ctrl_ver & ENA_REGS_CONTROLLER_VERSION_MAJOR_VERSION_MASK) >>
ENA_REGS_CONTROLLER_VERSION_MAJOR_VERSION_SHIFT,
@@ -2441,6 +2433,10 @@ int ena_com_allocate_host_info(struct ena_com_dev 
*ena_dev)
if (unlikely(!host_attr->host_info))
return -ENOMEM;
 
+   host_attr->host_info->ena_spec_version =
+   ((ENA_COMMON_SPEC_VERSION_MAJOR << 
ENA_REGS_VERSION_MAJOR_VERSION_SHIFT) |
+   (ENA_COMMON_SPEC_VERSION_MINOR));
+
return 0;
 }
 
diff --git a/drivers/net/ethernet/amazon/ena/ena_common_defs.h 
b/drivers/net/ethernet/amazon/ena/ena_common_defs.h
index bb8d736..23beb7e 100644
--- a/drivers/net/ethernet/amazon/ena/ena_common_defs.h
+++ b/drivers/net/ethernet/amazon/ena/ena_common_defs.h
@@ -32,8 +32,8 @@
 #ifndef _ENA_COMMON_H_
 #define _ENA_COMMON_H_
 
-#define ENA_COMMON_SPEC_VERSION_MAJOR  0 /*  */
-#define ENA_COMMON_SPEC_VERSION_MINOR  10 /*  */
+#define ENA_COMMON_SPEC_VERSION_MAJOR2
+#define ENA_COMMON_SPEC_VERSION_MINOR0
 
 /* ENA operates with 48-bit memory addresses. ena_mem_addr_t */
 struct ena_common_mem_addr {
diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index 69a4978..0c9c0d3 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -2206,7 +2206,8 @@ static u16

[PATCH V1 net-next 01/12] net: ena: minor performance improvement

2018-10-09 Thread akiyano

From: Arthur Kiyanovski 

Reduce fastpath overhead by making ena_com_tx_comp_req_id_get() inline.
Also move it to ena_eth_com.h file with its dependency function
ena_com_cq_inc_head().

Signed-off-by: Arthur Kiyanovski 
---
 drivers/net/ethernet/amazon/ena/ena_eth_com.c | 43 -
 drivers/net/ethernet/amazon/ena/ena_eth_com.h | 46 +--
 2 files changed, 44 insertions(+), 45 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_eth_com.c 
b/drivers/net/ethernet/amazon/ena/ena_eth_com.c
index 2b3ff0c..9c0511e 100644
--- a/drivers/net/ethernet/amazon/ena/ena_eth_com.c
+++ b/drivers/net/ethernet/amazon/ena/ena_eth_com.c
@@ -59,15 +59,6 @@ static inline struct ena_eth_io_rx_cdesc_base 
*ena_com_get_next_rx_cdesc(
return cdesc;
 }
 
-static inline void ena_com_cq_inc_head(struct ena_com_io_cq *io_cq)
-{
-   io_cq->head++;
-
-   /* Switch phase bit in case of wrap around */
-   if (unlikely((io_cq->head & (io_cq->q_depth - 1)) == 0))
-   io_cq->phase ^= 1;
-}
-
 static inline void *get_sq_desc(struct ena_com_io_sq *io_sq)
 {
u16 tail_masked;
@@ -477,40 +468,6 @@ int ena_com_add_single_rx_desc(struct ena_com_io_sq *io_sq,
return 0;
 }
 
-int ena_com_tx_comp_req_id_get(struct ena_com_io_cq *io_cq, u16 *req_id)
-{
-   u8 expected_phase, cdesc_phase;
-   struct ena_eth_io_tx_cdesc *cdesc;
-   u16 masked_head;
-
-   masked_head = io_cq->head & (io_cq->q_depth - 1);
-   expected_phase = io_cq->phase;
-
-   cdesc = (struct ena_eth_io_tx_cdesc *)
-   ((uintptr_t)io_cq->cdesc_addr.virt_addr +
-   (masked_head * io_cq->cdesc_entry_size_in_bytes));
-
-   /* When the current completion descriptor phase isn't the same as the
-* expected, it mean that the device still didn't update
-* this completion.
-*/
-   cdesc_phase = READ_ONCE(cdesc->flags) & ENA_ETH_IO_TX_CDESC_PHASE_MASK;
-   if (cdesc_phase != expected_phase)
-   return -EAGAIN;
-
-   dma_rmb();
-   if (unlikely(cdesc->req_id >= io_cq->q_depth)) {
-   pr_err("Invalid req id %d\n", cdesc->req_id);
-   return -EINVAL;
-   }
-
-   ena_com_cq_inc_head(io_cq);
-
-   *req_id = READ_ONCE(cdesc->req_id);
-
-   return 0;
-}
-
 bool ena_com_cq_empty(struct ena_com_io_cq *io_cq)
 {
struct ena_eth_io_rx_cdesc_base *cdesc;
diff --git a/drivers/net/ethernet/amazon/ena/ena_eth_com.h 
b/drivers/net/ethernet/amazon/ena/ena_eth_com.h
index 2f76572..4930324 100644
--- a/drivers/net/ethernet/amazon/ena/ena_eth_com.h
+++ b/drivers/net/ethernet/amazon/ena/ena_eth_com.h
@@ -86,8 +86,6 @@ int ena_com_add_single_rx_desc(struct ena_com_io_sq *io_sq,
   struct ena_com_buf *ena_buf,
   u16 req_id);
 
-int ena_com_tx_comp_req_id_get(struct ena_com_io_cq *io_cq, u16 *req_id);
-
 bool ena_com_cq_empty(struct ena_com_io_cq *io_cq);
 
 static inline void ena_com_unmask_intr(struct ena_com_io_cq *io_cq,
@@ -159,4 +157,48 @@ static inline void ena_com_comp_ack(struct ena_com_io_sq 
*io_sq, u16 elem)
io_sq->next_to_comp += elem;
 }
 
+static inline void ena_com_cq_inc_head(struct ena_com_io_cq *io_cq)
+{
+   io_cq->head++;
+
+   /* Switch phase bit in case of wrap around */
+   if (unlikely((io_cq->head & (io_cq->q_depth - 1)) == 0))
+   io_cq->phase ^= 1;
+}
+
+static inline int ena_com_tx_comp_req_id_get(struct ena_com_io_cq *io_cq,
+u16 *req_id)
+{
+   u8 expected_phase, cdesc_phase;
+   struct ena_eth_io_tx_cdesc *cdesc;
+   u16 masked_head;
+
+   masked_head = io_cq->head & (io_cq->q_depth - 1);
+   expected_phase = io_cq->phase;
+
+   cdesc = (struct ena_eth_io_tx_cdesc *)
+   ((uintptr_t)io_cq->cdesc_addr.virt_addr +
+   (masked_head * io_cq->cdesc_entry_size_in_bytes));
+
+   /* When the current completion descriptor phase isn't the same as the
+* expected, it mean that the device still didn't update
+* this completion.
+*/
+   cdesc_phase = READ_ONCE(cdesc->flags) & ENA_ETH_IO_TX_CDESC_PHASE_MASK;
+   if (cdesc_phase != expected_phase)
+   return -EAGAIN;
+
+   dma_rmb();
+
+   *req_id = READ_ONCE(cdesc->req_id);
+   if (unlikely(*req_id >= io_cq->q_depth)) {
+   pr_err("Invalid req id %d\n", cdesc->req_id);
+   return -EINVAL;
+   }
+
+   ena_com_cq_inc_head(io_cq);
+
+   return 0;
+}
+
 #endif /* ENA_ETH_COM_H_ */
-- 
2.7.4

[PATCH net-next] tun: Consistently configure generic netdev params via rtnetlink

2018-10-09 Thread Serhey Popovych

Configuring generic network device parameters on tun will fail in
presence of IFLA_INFO_KIND attribute in IFLA_LINKINFO nested attribute
since tun_validate() always return failure.

This can be visualized with following ip-link(8) command sequences:

  # ip link set dev tun0 group 100
  # ip link set dev tun0 group 100 type tun
  RTNETLINK answers: Invalid argument

with contrast to dummy and veth drivers:

  # ip link set dev dummy0 group 100
  # ip link set dev dummy0 type dummy

  # ip link set dev veth0 group 100
  # ip link set dev veth0 group 100 type veth

Fix by returning zero in tun_validate() when @data is NULL that is
always in case since rtnl_link_ops->maxtype is zero in tun driver.

Fixes: f019a7a594d9 ("tun: Implement ip link del tunXXX")
Signed-off-by: Serhey Popovych 
---
 drivers/net/tun.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index b174342..a3e8a43 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2298,6 +2298,8 @@ static void tun_setup(struct net_device *dev)
 static int tun_validate(struct nlattr *tb[], struct nlattr *data[],
struct netlink_ext_ack *extack)
 {
+   if (!data)
+   return 0;
return -EINVAL;
 }
 
-- 
1.8.3.1

Re: [PATCH net-next] cxgb4: Add thermal zone support

2018-10-09 Thread David Miller

From: Ganesh Goudar 
Date: Tue,  9 Oct 2018 19:14:13 +0530

> Add thermal zone support to monitor ASIC's temperature.
> 
> Signed-off-by: Ganesh Goudar 

Applied.

[PATCH net-next] net/mpls: Implement handler for strict data checking on dumps

2018-10-09 Thread David Ahern

From: David Ahern 

Without CONFIG_INET enabled compiles fail with:

net/mpls/af_mpls.o: In function `mpls_dump_routes':
af_mpls.c:(.text+0xed0): undefined reference to `ip_valid_fib_dump_req'

The preference is for MPLS to use the same handler as ipv4 and ipv6
to allow consistency when doing a dump for AF_UNSPEC which walks
all address families invoking the route dump handler. If INET is
disabled then fallback to an MPLS version which can be tighter on
the data checks.

Fixes: e8ba330ac0c5 ("rtnetlink: Update fib dumps for strict data checking")
Reported-by: Randy Dunlap 
Reported-by: Arnd Bergmann 
Signed-off-by: David Ahern 
---
 net/mpls/af_mpls.c | 36 +++-
 1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 7f891c05..5fe274c47c41 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -2031,6 +2031,40 @@ static int mpls_dump_route(struct sk_buff *skb, u32 
portid, u32 seq, int event,
return -EMSGSIZE;
 }
 
+#if IS_ENABLED(CONFIG_INET)
+static int mpls_valid_fib_dump_req(const struct nlmsghdr *nlh,
+  struct netlink_ext_ack *extack)
+{
+   return ip_valid_fib_dump_req(nlh, extack);
+}
+#else
+static int mpls_valid_fib_dump_req(const struct nlmsghdr *nlh,
+  struct netlink_ext_ack *extack)
+{
+   struct rtmsg *rtm;
+
+   if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*rtm))) {
+   NL_SET_ERR_MSG_MOD(extack, "Invalid header for FIB dump 
request");
+   return -EINVAL;
+   }
+
+   rtm = nlmsg_data(nlh);
+   if (rtm->rtm_dst_len || rtm->rtm_src_len  || rtm->rtm_tos   ||
+   rtm->rtm_table   || rtm->rtm_protocol || rtm->rtm_scope ||
+   rtm->rtm_type|| rtm->rtm_flags) {
+   NL_SET_ERR_MSG_MOD(extack, "Invalid values in header for FIB 
dump request");
+   return -EINVAL;
+   }
+
+   if (nlmsg_attrlen(nlh, sizeof(*rtm))) {
+   NL_SET_ERR_MSG_MOD(extack, "Invalid data after header in FIB 
dump request");
+   return -EINVAL;
+   }
+
+   return 0;
+}
+#endif
+
 static int mpls_dump_routes(struct sk_buff *skb, struct netlink_callback *cb)
 {
const struct nlmsghdr *nlh = cb->nlh;
@@ -2042,7 +2076,7 @@ static int mpls_dump_routes(struct sk_buff *skb, struct 
netlink_callback *cb)
ASSERT_RTNL();
 
if (cb->strict_check) {
-   int err = ip_valid_fib_dump_req(nlh, cb->extack);
+   int err = mpls_valid_fib_dump_req(nlh, cb->extack);
 
if (err < 0)
return err;
-- 
2.11.0

Re: [PATCH bpf-next 4/6] bpf: add queue and stack maps

2018-10-09 Thread Song Liu

On Tue, Oct 9, 2018 at 6:05 AM Mauricio Vasquez
 wrote:
>
>
>
> On 10/08/2018 08:36 PM, Song Liu wrote:
> > On Mon, Oct 8, 2018 at 12:12 PM Mauricio Vasquez B
> >  wrote:
> >> Queue/stack maps implement a FIFO/LIFO data storage for ebpf programs.
> >> These maps support peek, pop and push operations that are exposed to eBPF
> >> programs through the new bpf_map[peek/pop/push] helpers.  Those operations
> >> are exposed to userspace applications through the already existing
> >> syscalls in the following way:
> >>
> >> BPF_MAP_LOOKUP_ELEM-> peek
> >> BPF_MAP_LOOKUP_AND_DELETE_ELEM -> pop
> >> BPF_MAP_UPDATE_ELEM-> push
> >>
> >> Queue/stack maps are implemented using a buffer, tail and head indexes,
> >> hence BPF_F_NO_PREALLOC is not supported.
> >>
> >> As opposite to other maps, queue and stack do not use RCU for protecting
> >> maps values, the bpf_map[peek/pop] have a ARG_PTR_TO_UNINIT_MAP_VALUE
> >> argument that is a pointer to a memory zone where to save the value of a
> >> map.  Basically the same as ARG_PTR_TO_UNINIT_MEM, but the size has not
> >> be passed as an extra argument.
> >>
> >> Our main motivation for implementing queue/stack maps was to keep track
> >> of a pool of elements, like network ports in a SNAT, however we forsee
> >> other use cases, like for exampling saving last N kernel events in a map
> >> and then analysing from userspace.
> >>
> >> Signed-off-by: Mauricio Vasquez B 
> >> ---
> >>   include/linux/bpf.h   |7 +
> >>   include/linux/bpf_types.h |2
> >>   include/uapi/linux/bpf.h  |   35 -
> >>   kernel/bpf/Makefile   |2
> >>   kernel/bpf/core.c |3
> >>   kernel/bpf/helpers.c  |   43 ++
> >>   kernel/bpf/queue_stack_maps.c |  288 
> >> +
> >>   kernel/bpf/syscall.c  |   30 +++-
> >>   kernel/bpf/verifier.c |   28 +++-
> >>   net/core/filter.c |6 +
> >>   10 files changed, 426 insertions(+), 18 deletions(-)
> >>   create mode 100644 kernel/bpf/queue_stack_maps.c
> >>
> >> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> >> index 98c7eeb6d138..cad3bc5cffd1 100644
> >> --- a/include/linux/bpf.h
> >> +++ b/include/linux/bpf.h
> >> @@ -40,6 +40,9 @@ struct bpf_map_ops {
> >>  int (*map_update_elem)(struct bpf_map *map, void *key, void 
> >> *value, u64 flags);
> >>  int (*map_delete_elem)(struct bpf_map *map, void *key);
> >>  void *(*map_lookup_and_delete_elem)(struct bpf_map *map, void 
> >> *key);
> >> +   int (*map_push_elem)(struct bpf_map *map, void *value, u64 flags);
> >> +   int (*map_pop_elem)(struct bpf_map *map, void *value);
> >> +   int (*map_peek_elem)(struct bpf_map *map, void *value);
> >>
> >>  /* funcs called by prog_array and perf_event_array map */
> >>  void *(*map_fd_get_ptr)(struct bpf_map *map, struct file 
> >> *map_file,
> >> @@ -139,6 +142,7 @@ enum bpf_arg_type {
> >>  ARG_CONST_MAP_PTR,  /* const argument used as pointer to 
> >> bpf_map */
> >>  ARG_PTR_TO_MAP_KEY, /* pointer to stack used as map key */
> >>  ARG_PTR_TO_MAP_VALUE,   /* pointer to stack used as map value */
> >> +   ARG_PTR_TO_UNINIT_MAP_VALUE,/* pointer to valid memory used to 
> >> store a map value */
> > How about we put ARG_PTR_TO_UNINIT_MAP_VALUE and related logic to a
> > separate patch?
>
> I thought it too, but this is a really small change (6 additions, 3
> deletions). Does it worth a separated patch?

I think a separate patch is better. You can also put small changes in
uapi header
in a separate patch.

Thanks,
Song


> >
> >>  /* the following constraints used to prototype bpf_memcmp() and 
> >> other
> >>   * functions that access data on eBPF program stack
> >> @@ -825,6 +829,9 @@ static inline int 
> >> bpf_fd_reuseport_array_update_elem(struct bpf_map *map,
> >>   extern const struct bpf_func_proto bpf_map_lookup_elem_proto;
> >>   extern const struct bpf_func_proto bpf_map_update_elem_proto;
> >>   extern const struct bpf_func_proto bpf_map_delete_elem_proto;
> >> +extern const struct bpf_func_proto bpf_map_push_elem_proto;
> >> +extern const struct bpf_func_proto bpf_map_pop_elem_proto;
> >> +extern const struct bpf_func_proto bpf_map_peek_elem_proto;
> >>
> >>   extern const struct bpf_func_proto bpf_get_prandom_u32_proto;
> >>   extern const struct bpf_func_proto bpf_get_smp_processor_id_proto;
> >> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> >> index 658509daacd4..a2ec73aa1ec7 100644
> >> --- a/include/linux/bpf_types.h
> >> +++ b/include/linux/bpf_types.h
> >> @@ -69,3 +69,5 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_XSKMAP, xsk_map_ops)
> >>   BPF_MAP_TYPE(BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, reuseport_array_ops)
> >>   #endif
> >>   #endif
> >> +BPF_MAP_TYPE(BPF_MAP_TYPE_QUEUE, queue_map_ops)
> >> +BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops)
> >> diff --git

Re: selftests/bpf: test_kmod.sh hangs on all devices

2018-10-09 Thread Naresh Kamboju

Hi Shuah,

On Mon, 8 Oct 2018 at 20:46, Shuah Khan  wrote:
>
> Hi Naresh,
>
> Please use sh...@kernel.org for faster responses. I updated MAINTAINERS
> entry a while back removing shua...@osg.samsung.com address due to IT
> infrastructure changes at Samsung.
+1

Thank you.

Best Regards
Naresh Kamboju

Re: selftests/bpf: test_kmod.sh hangs on all devices

2018-10-09 Thread Naresh Kamboju

> > > OTOH,
> > > There is a kernel BUG,
> >
> > This is quite an old linux-next kernel, should be fixed by 100811936f89 
> > ("bpf: test_bpf:
> > add init_net to dev for flow_dissector"). Please make sure you have that 
> > commit included
> > in your testing:

This patch included in the linux next tree.

>
> I will re-validate on latest code base and let you know.

test_kmod.sh PASS on linux-next.

+ ./test_kmod.sh
[ JIT enabled:0 hardened:0 ]
[   25.886816] test_bpf: #0 TAX jited:0 441 435 436 PASS
[   25.889223] test_bpf: #1 TXA jited:0 144 143 143 PASS

[  105.557950] test_bpf: #377 JNE signed compare, test 7 jited:1 46 PASS
[  105.564354] test_bpf: Summary: 378 PASSED, 0 FAILED, [366/366 JIT'ed]
test_bpf: ok
[  105.564354] test_bpf: Summary: 378 PASSED, 0 FAILED

Thank you
- Naresh

Re: [PATCH V2 net 0/4] minor bug fixes for ENA Ethernet driver

2018-10-09 Thread David Miller

From: 
Date: Tue, 9 Oct 2018 11:21:26 +0300

> From: Arthur Kiyanovski 
> 
> Arthur Kiyanovski (4):
>   net: ena: fix warning in rmmod caused by double iounmap
>   net: ena: fix rare bug when failed restart/resume is followed by
> driver removal
>   net: ena: fix NULL dereference due to untimely napi initialization
>   net: ena: fix auto casting to boolean

Series applied.

Re: [PATCH net] net/sched: cls_api: add missing validation of netlink attributes

2018-10-09 Thread David Ahern

On 10/9/18 10:12 AM, Davide Caratti wrote:
>>> --- a/net/sched/cls_api.c
>>> +++ b/net/sched/cls_api.c
>>> @@ -37,6 +37,11 @@ static LIST_HEAD(tcf_proto_base);
>>>  /* Protects list of registered TC modules. It is pure SMP lock. */
>>>  static DEFINE_RWLOCK(cls_mod_lock);
>>>  
>>> +const struct nla_policy cls_tca_policy[TCA_MAX + 1] = {
>>> +   [TCA_KIND]  = { .type = NLA_STRING },
>>> +   [TCA_CHAIN] = { .type = NLA_U32 },
>>> +};
>>> +
>>
> 
>> it be nice to have a tc_common module so this stuff does not have to be
>> defined multiple times.
> 
> it makes sense to avoid duplicating the declaration of that array. But I
> don't think we can put it in a module, because CONFIG_NET_SCHED is 'bool'
> and
> 
> obj-$(CONFIG_NET_SCHED) += sch_api.o
> 
> I can try a v2 where 'rtm_tca_policy' symbol in sch_api is exported and
> used by cls_api.c code. WDYT?

since NET_SCHED is a bool, that should work.

[sky2 driver] 88E8056 PCI-E Gigabit Ethernet Controller not working after suspend

2018-10-09 Thread Laurent Bigonville


Hello,

On my desktop (Asus MB with dual Ethernet port), when waking up after 
suspend, the network card is not detecting the link.


I have to rmmod the sky2 driver and then modprobing it again.

lspci shows me:

04:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8056 PCI-E 
Gigabit Ethernet Controller (rev 12)
05:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8056 PCI-E 
Gigabit Ethernet Controller (rev 12)


An idea what's wrong here?

Kind regards,

Laurent Bigonville

Re: BBR and TCP internal pacing causing interrupt storm with pfifo_fast

2018-10-09 Thread Eric Dumazet

On Tue, Oct 9, 2018 at 10:22 AM Gasper Zejn  wrote:
>
> On 09. 10. 2018 19:00, Eric Dumazet wrote:
> >
> > On 10/09/2018 09:38 AM, Gasper Zejn wrote:
> >> Hello,
> >>
> >> I am seeing interrupt storms of over 100k-900k local timer interrupts
> >> when changing between network devices or networks with open TCP
> >> connections when not using sch_fq (I was using pfifo_fast). Using sch_fq
> >> makes the bug with interrupt storm go away.
> >>
> > That is for what kind of traffic ?
> >
> > If your TCP flows send 100k-3M packets per second, then yes, the pacing 
> > timers
> > could be setup in the 100k-900k range.
> >
> Traffic is nowhere in that range, think of having a few browser tabs of
> javascript rich
> web pages open, mostly idle, for example slack, gmail or tweetdeck. No
> significant
> packet rate is needed, just open connections.

No idea of what is going on really. A repro would be nice.

Re: BBR and TCP internal pacing causing interrupt storm with pfifo_fast

2018-10-09 Thread Gasper Zejn

On 09. 10. 2018 19:00, Eric Dumazet wrote:
>
> On 10/09/2018 09:38 AM, Gasper Zejn wrote:
>> Hello,
>>
>> I am seeing interrupt storms of over 100k-900k local timer interrupts
>> when changing between network devices or networks with open TCP
>> connections when not using sch_fq (I was using pfifo_fast). Using sch_fq
>> makes the bug with interrupt storm go away.
>>
> That is for what kind of traffic ?
>
> If your TCP flows send 100k-3M packets per second, then yes, the pacing timers
> could be setup in the 100k-900k range.
>
Traffic is nowhere in that range, think of having a few browser tabs of
javascript rich
web pages open, mostly idle, for example slack, gmail or tweetdeck. No
significant
packet rate is needed, just open connections.

>> The interrupts all called tcp_pace_kick (according to perf), which seems
>> to return HRTIMER_NORESTART, but apparently somewhere calls another
>> function, that does restart the timer.
>>
>> The bug is fairly easy to reproduce. Congestion control needs to be BBR,
>> network scheduler was pfifo_fast, and there need to be open TCP
>> connections when changing network in such a way that TCP connections
>> cannot continue to work (eg. different client IP addresses). The more
>> connections the more interrupts. The connection handling code will cause
>> interrupt storm, which eventually sets down as the connections time out.
>> It is a bit annoying as high interrupt rate does not show as load. I
>> successfully reproduced this with 4.18.12, but this has been happening
>> for some time, with previous versions of kernel too.
>>
>>
>> I'd like to thank you for the comment regarding use of sch_fq with BBR
>> above the tcp_needs_internal_pacing function. It has pointed me in the
>> direction to find the workaround.
>>
> Well, BBR has been very clear about sch_fq being the best packet scheduler
>
> net/ipv4/tcp_bbr.c currently says :
>
> /* ...
>  *
>  * NOTE: BBR might be used with the fq qdisc ("man tc-fq") with pacing 
> enabled,
>  * otherwise TCP stack falls back to an internal pacing using one high
>  * resolution timer per TCP socket and may use more resources.
>  */
>
I am not disputing FQ being the best packet packet scheduler, it does
seem however
that some effort has been made to make BBR work without FQ too. Using more
resources in that case is perfectly fine. But going from ~ thousand
interrupts to few
hundred thousand interrupts (and in the process consuming most of the cpu)
seems to indicate that a corner case was somehow hit as this happens the
moment the network is changed and not before.

Re: BBR and TCP internal pacing causing interrupt storm with pfifo_fast

2018-10-09 Thread Eric Dumazet




On 10/09/2018 09:38 AM, Gasper Zejn wrote:
> Hello,
> 
> I am seeing interrupt storms of over 100k-900k local timer interrupts
> when changing between network devices or networks with open TCP
> connections when not using sch_fq (I was using pfifo_fast). Using sch_fq
> makes the bug with interrupt storm go away.
> 

That is for what kind of traffic ?

If your TCP flows send 100k-3M packets per second, then yes, the pacing timers
could be setup in the 100k-900k range.

> The interrupts all called tcp_pace_kick (according to perf), which seems
> to return HRTIMER_NORESTART, but apparently somewhere calls another
> function, that does restart the timer.
> 
> The bug is fairly easy to reproduce. Congestion control needs to be BBR,
> network scheduler was pfifo_fast, and there need to be open TCP
> connections when changing network in such a way that TCP connections
> cannot continue to work (eg. different client IP addresses). The more
> connections the more interrupts. The connection handling code will cause
> interrupt storm, which eventually sets down as the connections time out.
> It is a bit annoying as high interrupt rate does not show as load. I
> successfully reproduced this with 4.18.12, but this has been happening
> for some time, with previous versions of kernel too.
> 
> 
> I'd like to thank you for the comment regarding use of sch_fq with BBR
> above the tcp_needs_internal_pacing function. It has pointed me in the
> direction to find the workaround.
>

Well, BBR has been very clear about sch_fq being the best packet scheduler

net/ipv4/tcp_bbr.c currently says :

/* ...
 *
 * NOTE: BBR might be used with the fq qdisc ("man tc-fq") with pacing enabled,
 * otherwise TCP stack falls back to an internal pacing using one high
 * resolution timer per TCP socket and may use more resources.
 */

[PATCH net] net/xfrm: fix out-of-bounds packet access

2018-10-09 Thread Alexei Starovoitov

BUG: KASAN: slab-out-of-bounds in _decode_session6+0x1331/0x14e0
net/ipv6/xfrm6_policy.c:161
Read of size 1 at addr 8801d882eec7 by task syz-executor1/6667
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
  print_address_description+0x6c/0x20b mm/kasan/report.c:256
  kasan_report_error mm/kasan/report.c:354 [inline]
  kasan_report.cold.7+0x242/0x30d mm/kasan/report.c:412
  __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430
  _decode_session6+0x1331/0x14e0 net/ipv6/xfrm6_policy.c:161
  __xfrm_decode_session+0x71/0x140 net/xfrm/xfrm_policy.c:2299
  xfrm_decode_session include/net/xfrm.h:1232 [inline]
  vti6_tnl_xmit+0x3c3/0x1bc1 net/ipv6/ip6_vti.c:542
  __netdev_start_xmit include/linux/netdevice.h:4313 [inline]
  netdev_start_xmit include/linux/netdevice.h:4322 [inline]
  xmit_one net/core/dev.c:3217 [inline]
  dev_hard_start_xmit+0x272/0xc10 net/core/dev.c:3233
  __dev_queue_xmit+0x2ab2/0x3870 net/core/dev.c:3803
  dev_queue_xmit+0x17/0x20 net/core/dev.c:3836

Reported-by: syzbot+acffccec848dc13fe...@syzkaller.appspotmail.com
Reported-by: Eric Dumazet 
Signed-off-by: Alexei Starovoitov 
---
 net/ipv6/xfrm6_policy.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c
index ef3defaf43b9..d35bcf92969c 100644
--- a/net/ipv6/xfrm6_policy.c
+++ b/net/ipv6/xfrm6_policy.c
@@ -146,8 +146,8 @@ _decode_session6(struct sk_buff *skb, struct flowi *fl, int 
reverse)
fl6->daddr = reverse ? hdr->saddr : hdr->daddr;
fl6->saddr = reverse ? hdr->daddr : hdr->saddr;
 
-   while (nh + offset + 1 < skb->data ||
-  pskb_may_pull(skb, nh + offset + 1 - skb->data)) {
+   while (nh + offset + sizeof(*exthdr) < skb->data ||
+  pskb_may_pull(skb, nh + offset + sizeof(*exthdr) - skb->data)) {
nh = skb_network_header(skb);
exthdr = (struct ipv6_opt_hdr *)(nh + offset);
 
-- 
2.17.1

enquiry 09-10-2018

2018-10-09 Thread Daniel Murray

Hi,friend,

This is Daniel Murray and i am from Sinara Group Co.Ltd Group Co.,LTD in Russia.
We are glad to know about your company from the web and we are interested in 
your products.
Could you kindly send us your Latest catalog and price list for our trial order.

Best Regards,

Daniel Murray
Purchasing Manager

BBR and TCP internal pacing causing interrupt storm with pfifo_fast

2018-10-09 Thread Gasper Zejn

Hello,

I am seeing interrupt storms of over 100k-900k local timer interrupts
when changing between network devices or networks with open TCP
connections when not using sch_fq (I was using pfifo_fast). Using sch_fq
makes the bug with interrupt storm go away.

The interrupts all called tcp_pace_kick (according to perf), which seems
to return HRTIMER_NORESTART, but apparently somewhere calls another
function, that does restart the timer.

The bug is fairly easy to reproduce. Congestion control needs to be BBR,
network scheduler was pfifo_fast, and there need to be open TCP
connections when changing network in such a way that TCP connections
cannot continue to work (eg. different client IP addresses). The more
connections the more interrupts. The connection handling code will cause
interrupt storm, which eventually sets down as the connections time out.
It is a bit annoying as high interrupt rate does not show as load. I
successfully reproduced this with 4.18.12, but this has been happening
for some time, with previous versions of kernel too.


I'd like to thank you for the comment regarding use of sch_fq with BBR
above the tcp_needs_internal_pacing function. It has pointed me in the
direction to find the workaround.


Kind regards,

Gasper Zejn

1 2 >

1 - 100 of 160 matches

Mail list logo