[PATCH net-next] bridge: ebtables: Avoid resetting limit rule state

2017-11-24 Thread Linus Lüssing
So far any changes with ebtables will reset the state of limit rules,
leading to spikes in traffic. This is especially noticeable if changes
are done frequently, for instance via a daemon.

This patch fixes this by bailing out from (re)setting if the limit
rule was initialized before.

When sending packets every 250ms for 600s, with a
"--limit 1/sec --limit-burst 50" rule and a command like this
in the background:

$ ebtables -N VOIDCHAIN
$ while true; do ebtables -F VOIDCHAIN; sleep 30; done

The results are:

Before: ~1600 packets
After: 650 packets

Signed-off-by: Linus Lüssing 
---
 net/bridge/netfilter/ebt_limit.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/net/bridge/netfilter/ebt_limit.c b/net/bridge/netfilter/ebt_limit.c
index 61a9f1be1263..f74b48633feb 100644
--- a/net/bridge/netfilter/ebt_limit.c
+++ b/net/bridge/netfilter/ebt_limit.c
@@ -69,6 +69,10 @@ static int ebt_limit_mt_check(const struct xt_mtchk_param 
*par)
 {
struct ebt_limit_info *info = par->matchinfo;
 
+   /* Do not reset state on unrelated table changes */
+   if (info->prev)
+   return 0;
+
/* Check for overflow. */
if (info->burst == 0 ||
user2credits(info->avg * info->burst) < user2credits(info->avg)) {
-- 
2.11.0



Re: [PATCH] uapi: add SPDX identifier to vm_sockets_diag.h

2017-11-24 Thread Stefan Hajnoczi
On Fri, Nov 24, 2017 at 8:08 PM, Stephen Hemminger
 wrote:
> New file seems to have missed the SPDX license scan and update.
>
> Signed-off-by: Stephen Hemminger 
> ---
>  include/uapi/linux/vm_sockets_diag.h | 1 +
>  1 file changed, 1 insertion(+)

Reviewed-by: Stefan Hajnoczi 


Re: [RFC net-next 0/6] xdp: make stack perform remove and tests

2017-11-24 Thread Jakub Kicinski
On Fri, 24 Nov 2017 00:02:32 -0800, Jakub Kicinski wrote:
> >>Something I'm still battling with, and would appreciate help of
> >>wiser people is that occasionally during the test something makes
> >>the refcount of init_net drop to 0 :S  I tried to create a simple
> >>reproducer, but seems like just running the script in the loop is
> >>the easiest way to go...  Could it have something to do with the
> >>recent TC work?  The driver is pretty simple and never touches  
> >
> > I don't see how...  
> 
> To be clear I meant the changes made to destruction of filters, not
> your work. The BPF code doesn't touch ref counts and cls exts do seem
> to hold a ref on the net...  but perhaps that's just pointing the
> finger unnecessarily :)  I will try to investigate again tomorrow.

Looks like I was lazy when adding the offload and just called
__cls_bpf_delete_prog() instead of extending the error path.  
Cong missed this extra call in aae2c35ec892 ("cls_bpf: use
tcf_exts_get_net() before call_rcu()").  We need something like 
this:

diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index a9f3e317055c..40d4289aea28 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -514,12 +514,8 @@ static int cls_bpf_change(struct net *net, struct sk_buff 
*in_skb,
goto errout_idr;
 
ret = cls_bpf_offload(tp, prog, oldprog);
-   if (ret) {
-   if (!oldprog)
-   idr_remove_ext(>handle_idr, prog->handle);
-   __cls_bpf_delete_prog(prog);
-   return ret;
-   }
+   if (ret)
+   goto errout_parms;
 
if (!tc_in_hw(prog->gen_flags))
prog->gen_flags |= TCA_CLS_FLAGS_NOT_IN_HW;
@@ -537,6 +533,13 @@ static int cls_bpf_change(struct net *net, struct sk_buff 
*in_skb,
*arg = prog;
return 0;
 
+errout_parms:
+   if (cls_bpf_is_ebpf(prog))
+   bpf_prog_put(prog->filter);
+   else
+   bpf_prog_destroy(prog->filter);
+   kfree(prog->bpf_name);
+   kfree(prog->bpf_ops);
 errout_idr:
if (!oldprog)
idr_remove_ext(>handle_idr, prog->handle);



Re: [RFC net-next 3/6] net: xdp: make the stack take care of the tear down

2017-11-24 Thread Jakub Kicinski
On Sat, 25 Nov 2017 00:24:50 +0100, Daniel Borkmann wrote:
> > +static void dev_xdp_uninstall(struct net_device *dev)
> > +{
> > +   struct netdev_bpf xdp;
> > +   bpf_op_t ndo_bpf;  
> 
> Can you add a comment here stating that generic XDP does not
> need to be handled since we drop the prog from free_netdev()?
> Potentially we could also drop the generic one from here, that
> way we'd make no difference and have a dev_xdp_install() and
> one dev_xdp_uninstall() for all kind of attach types. Given
> generic XDP should simulate native XDP anyway, probably better
> to just do that.

I will move the freeing of generic XDP here and add a simple test to
the last patch.  Thanks!

> > +   ndo_bpf = dev->netdev_ops->ndo_bpf;
> > +   if (!ndo_bpf)
> > +   return;
> > +
> > +   __dev_xdp_query(dev, ndo_bpf, );
> > +   if (xdp.prog_attached == XDP_ATTACHED_NONE)
> > +   return;
> > +
> > +   /* Program removal should always succeed */
> > +   WARN_ON(dev_xdp_install(dev, ndo_bpf, NULL, xdp.prog_flags, NULL));
> > +}


Re: [RFC net-next 3/6] net: xdp: make the stack take care of the tear down

2017-11-24 Thread Daniel Borkmann
On 11/24/2017 03:36 AM, Jakub Kicinski wrote:
> Since day one of XDP drivers had to remember to free the program
> on the remove path.  This leads to code duplication and is error
> prone.  Make the stack query the installed programs on unregister
> and if something is installed, remove the program.
> 
> Because the remove will now be called before notifiers are
> invoked, BPF offload state of the program will not get destroyed
> before uninstall.
> 
> Signed-off-by: Jakub Kicinski 
> Reviewed-by: Simon Horman 
[...]

Nice work, series looks good to me! One really just minor
comment below:

> diff --git a/net/core/dev.c b/net/core/dev.c
> index 3f271c9cb5e0..a3e932f98419 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -7110,6 +7110,23 @@ static int dev_xdp_install(struct net_device *dev, 
> bpf_op_t bpf_op,
>   return bpf_op(dev, );
>  }
>  
> +static void dev_xdp_uninstall(struct net_device *dev)
> +{
> + struct netdev_bpf xdp;
> + bpf_op_t ndo_bpf;

Can you add a comment here stating that generic XDP does not
need to be handled since we drop the prog from free_netdev()?
Potentially we could also drop the generic one from here, that
way we'd make no difference and have a dev_xdp_install() and
one dev_xdp_uninstall() for all kind of attach types. Given
generic XDP should simulate native XDP anyway, probably better
to just do that.

> + ndo_bpf = dev->netdev_ops->ndo_bpf;
> + if (!ndo_bpf)
> + return;
> +
> + __dev_xdp_query(dev, ndo_bpf, );
> + if (xdp.prog_attached == XDP_ATTACHED_NONE)
> + return;
> +
> + /* Program removal should always succeed */
> + WARN_ON(dev_xdp_install(dev, ndo_bpf, NULL, xdp.prog_flags, NULL));
> +}
> +
>  /**
>   *   dev_change_xdp_fd - set or clear a bpf program for a device rx path
>   *   @dev: device
> @@ -7240,6 +7257,7 @@ static void rollback_registered_many(struct list_head 
> *head)
>   /* Shutdown queueing discipline. */
>   dev_shutdown(dev);
>  
> + dev_xdp_uninstall(dev);
>  
>   /* Notify protocols, that we are about to destroy
>* this device. They should clean all the things.
> 

Thanks,
Daniel


Re: [PATCH net] net: dsa: fix 'increment on 0' warning

2017-11-24 Thread Florian Fainelli


On 11/24/2017 08:36 AM, Vivien Didelot wrote:
> Setting the refcount to 0 when allocating a tree to match the number of
> switch devices it holds may cause an 'increment on 0; use-after-free',
> if CONFIG_REFCOUNT_FULL is enabled.
> 
> To fix this, do not decrement the refcount of a newly allocated tree,
> increment it when an already allocated tree is found, and decrement it
> after the probing of a switch, as done with the previous behavior.
> 
> At the same time, make dsa_tree_get and dsa_tree_put accept a NULL
> argument to simplify callers, and return the tree after incrementation,
> as most kref users like of_node_get and of_node_put do.
> 
> Fixes: 8e5bf9759a06 ("net: dsa: simplify tree reference counting")
> Signed-off-by: Vivien Didelot 

Reviewed-by: Florian Fainelli 
Tested-by: Florian Fainelli 

Thanks!
-- 
Florian


Re: [PATCH iproute2/net-next v3]tc: B.W limits can now be specified in %.

2017-11-24 Thread Nishanth Devarajan
On Fri, Nov 24, 2017 at 11:25:28AM -0800, Stephen Hemminger wrote:
> On Sat, 18 Nov 2017 02:13:38 +0530
> Nishanth Devarajan  wrote:
> 
> > This patch adapts the tc command line interface to allow bandwidth limits
> > to be specified as a percentage of the interface's capacity.
> > 
> > Adding this functionality requires passing the specified device string to
> > each class/qdisc which changes the prototype for a couple of functions: the
> > .parse_qopt and .parse_copt interfaces. The device string is a required
> > parameter for tc-qdisc and tc-class, and when not specified, the kernel
> > returns ENODEV. In this patch, if the user tries to specify a bandwidth
> > percentage without naming the device, we return an error from userspace.
> > 
> > v2:
> > * Modified and moved int read_prop() from ip/iptuntap.c to lib/utils.c,
> > to make it accessible to tc. 
> > 
> > v3:
> > * Modified and moved int parse_percent() from tc/q_netem.c to ib/util.c for
> > use in tc.
> > 
> > * Changed couple variable names in int parse_percent_rate().
> > 
> > * Handled showing error message when device speed is unknown.
> > 
> > * Updated man page to warn users that when specifying rates in %, tc only
> > uses the current device speed and does not recalculate if it changes after.
> > 
> > During cases when properties (like device speed) are unknown, read_prop()
> > assumes that if the property file can be opened but not read, it means
> > that the property is unknown.
> > 
> > Signed-off by: Nishanth Devarajan
> > 
> 
> Applied, but there were three things that I needed to change:
>   1. The DCO tag is "Signed-off-by" not "Signed-off by"
>   2. The revision history should be below the cut line --- in the mail message
>  so that it doesn't end up in the commit message.
>   3. The qopt function declarations now are a really long line.
>  I will break them up.
>

Thanks for the help, and will do, I'll keep the feedback in mind for
future patches, thanks.

-Nishanth


[PATCH iproute2] SPDX license identifiers

2017-11-24 Thread Stephen Hemminger
For all files in iproute2 which do not already have an obvious license
identification, mark them with GPL-2.

If any of the original authors want a more permissive license
than that, please let ms know.

Signed-off-by: Stephen Hemminger 
---
 Makefile | 1 +
 bridge/Makefile  | 1 +
 bridge/br_common.h   | 2 ++
 bridge/bridge.c  | 1 +
 bridge/fdb.c | 1 +
 bridge/link.c| 1 +
 bridge/mdb.c | 1 +
 bridge/vlan.c| 1 +
 configure| 1 +
 devlink/Makefile | 1 +
 examples/bpf/bpf_tailcall.c  | 1 +
 genl/Makefile| 1 +
 genl/genl_utils.h| 1 +
 genl/static-syms.c   | 1 +
 include/bpf_api.h| 1 +
 include/bpf_elf.h| 1 +
 include/bpf_scm.h| 1 +
 include/color.h  | 1 +
 include/dlfcn.h  | 1 +
 include/ip6tables.h  | 1 +
 include/iptables.h   | 1 +
 include/iptables/internal.h  | 1 +
 include/libgenl.h| 1 +
 include/libiptc/ipt_kernel_headers.h | 1 +
 include/libiptc/libip6tc.h   | 1 +
 include/libiptc/libiptc.h| 1 +
 include/libiptc/libxtc.h | 1 +
 include/libiptc/xtcshared.h  | 1 +
 include/libnetlink.h | 1 +
 include/list.h   | 1 +
 include/ll_map.h | 1 +
 include/names.h  | 1 +
 include/namespace.h  | 1 +
 include/rt_names.h   | 1 +
 include/rtm_map.h| 1 +
 include/utils.h  | 1 +
 include/xt-internal.h| 1 +
 include/xtables.h| 1 +
 ip/Makefile  | 1 +
 ip/ifcfg | 1 +
 ip/ila_common.h  | 1 +
 ip/ip_common.h   | 1 +
 ip/iplink_dummy.c| 1 +
 ip/iplink_ifb.c  | 1 +
 ip/iplink_nlmon.c| 1 +
 ip/iplink_team.c | 1 +
 ip/iplink_vcan.c | 1 +
 ip/ipnetns.c | 1 +
 ip/iproute_lwtunnel.h| 1 +
 ip/routef| 1 +
 ip/routel| 2 +-
 ip/rtpr  | 1 +
 ip/static-syms.c | 1 +
 ip/xdp.h | 1 +
 lib/Makefile | 1 +
 lib/color.c  | 1 +
 lib/dnet_ntop.c  | 1 +
 lib/dnet_pton.c  | 1 +
 lib/exec.c   | 1 +
 lib/ipx_ntop.c   | 1 +
 lib/ipx_pton.c   | 1 +
 lib/libgenl.c| 1 +
 lib/mpls_ntop.c  | 2 ++
 lib/mpls_pton.c  | 2 ++
 man/Makefile | 1 +
 man/man3/Makefile| 1 +
 man/man7/Makefile| 1 +
 man/man8/Makefile| 1 +
 misc/Makefile| 1 +
 misc/lnstat.h| 1 +
 misc/ssfilter.h  | 1 +
 netem/Makefile   | 1 +
 rdma/Makefile| 1 +
 tc/Makefile  | 1 +
 tc/emp_ematch.l  | 1 +
 tc/f_tcindex.c   | 1 +
 tc/m_ematch.h| 1 +
 tc/q_atm.c   | 1 +
 tc/q_clsact.c| 1 +
 tc/q_dsmark.c| 1 +
 tc/q_hhf.c   | 1 +
 tc/static-syms.c | 1 +
 tc/tc_cbq.h  | 1 +
 tc/tc_common.h   | 1 +
 tc/tc_core.h | 1 +
 tc/tc_red.h  | 1 +
 tc/tc_util.h | 1 +
 testsuite/Makefile   | 1 +
 testsuite/iproute2/Makefile  | 1 +
 testsuite/tools/Makefile | 1 +
 tipc/Makefile| 1 +
 91 files changed, 94 insertions(+), 1 deletion(-)

diff --git a/Makefile b/Makefile
index 6ad961043052..6a51e0db9107 100644
--- a/Makefile
+++ b/Makefile
@@ -1,3 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0
 # Top level Makefile for iproute2
 
 ifeq ($(VERBOSE),0)
diff --git a/bridge/Makefile b/bridge/Makefile
index b2ae0a4ed04d..c6b7d08dade4 100644
--- a/bridge/Makefile
+++ b/bridge/Makefile
@@ -1,3 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0
 BROBJ = bridge.o fdb.o monitor.o link.o mdb.o vlan.o
 
 include ../config.mk
diff --git a/bridge/br_common.h b/bridge/br_common.h
index 01447ddca337..f07c7d1c9090 100644
--- a/bridge/br_common.h
+++ b/bridge/br_common.h
@@ -1,3 +1,5 @@
+/* SPDX-License-Identifier: 

Re: [PATCH 1/3] net: core: export dev_alloc_name_ns

2017-11-24 Thread David Miller
From: Rasmus Villemoes 
Date: Tue, 21 Nov 2017 01:34:37 +0100

> dev_alloc_name_ns and dev_get_valid_name now do exactly the same
> thing. Let's expose this functionality as dev_alloc_name_ns
> (obviously, a core function like this won't return an invalid
> name...).
> 
> Signed-off-by: Rasmus Villemoes 

If you're going to keep one of the routines, keep the one with
the simpler and smaller name, "dev_get_valid_name".


Re: [PATCHv2 net-next 1/1] forcedeth: replace pci_unmap_page with dma_unmap_page

2017-11-24 Thread David Miller
From: Zhu Yanjun 
Date: Sun, 19 Nov 2017 22:21:08 -0500

> The function pci_unmap_page is obsolete. So it is replaced with
> the function dma_unmap_page.
> 
> CC: Srinivas Eeda 
> CC: Joe Jin 
> CC: Junxiao Bi 
> Signed-off-by: Zhu Yanjun 
> ---
> V1->V2: fix direction flag error.

Applied, thank you.


[PATCH] uapi: add SPDX identifier to vm_sockets_diag.h

2017-11-24 Thread Stephen Hemminger
New file seems to have missed the SPDX license scan and update.

Signed-off-by: Stephen Hemminger 
---
 include/uapi/linux/vm_sockets_diag.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/vm_sockets_diag.h 
b/include/uapi/linux/vm_sockets_diag.h
index 14cd7dc5a187..0b4dd54f3d1e 100644
--- a/include/uapi/linux/vm_sockets_diag.h
+++ b/include/uapi/linux/vm_sockets_diag.h
@@ -1,3 +1,4 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
 /* AF_VSOCK sock_diag(7) interface for querying open sockets */
 
 #ifndef _UAPI__VM_SOCKETS_DIAG_H__
-- 
2.11.0



Re: [PATCH net-next 00/12] rxrpc: Fixes and improvements

2017-11-24 Thread David Miller
From: David Howells 
Date: Fri, 24 Nov 2017 14:37:39 +

> Is it too late for this to go to Linus in this merge window?

These look predominantly like fixes so I'll pull this in.

Thanks.


Re: [PATCH iproute2/net-next v3]tc: B.W limits can now be specified in %.

2017-11-24 Thread Stephen Hemminger
On Sat, 18 Nov 2017 02:13:38 +0530
Nishanth Devarajan  wrote:

> This patch adapts the tc command line interface to allow bandwidth limits
> to be specified as a percentage of the interface's capacity.
> 
> Adding this functionality requires passing the specified device string to
> each class/qdisc which changes the prototype for a couple of functions: the
> .parse_qopt and .parse_copt interfaces. The device string is a required
> parameter for tc-qdisc and tc-class, and when not specified, the kernel
> returns ENODEV. In this patch, if the user tries to specify a bandwidth
> percentage without naming the device, we return an error from userspace.
> 
> v2:
> * Modified and moved int read_prop() from ip/iptuntap.c to lib/utils.c,
> to make it accessible to tc. 
> 
> v3:
> * Modified and moved int parse_percent() from tc/q_netem.c to ib/util.c for
> use in tc.
> 
> * Changed couple variable names in int parse_percent_rate().
> 
> * Handled showing error message when device speed is unknown.
> 
> * Updated man page to warn users that when specifying rates in %, tc only
> uses the current device speed and does not recalculate if it changes after.
> 
> During cases when properties (like device speed) are unknown, read_prop()
> assumes that if the property file can be opened but not read, it means
> that the property is unknown.
> 
> Signed-off by: Nishanth Devarajan
> 

Applied, but there were three things that I needed to change:
  1. The DCO tag is "Signed-off-by" not "Signed-off by"
  2. The revision history should be below the cut line --- in the mail message
 so that it doesn't end up in the commit message.
  3. The qopt function declarations now are a really long line.
 I will break them up.



Re: [PATCH] net-sysfs: export gso_max_size attribute

2017-11-24 Thread Eric Dumazet
On Fri, 2017-11-24 at 11:43 -0700, David Ahern wrote:
> On 11/24/17 11:32 AM, Eric Dumazet wrote:
> > On Fri, 2017-11-24 at 10:14 -0700, David Ahern wrote:
> > > On 11/22/17 5:30 PM, Solio Sarabia wrote:
> > > > The netdevice gso_max_size is exposed to allow users fine-
> > > > control
> > > > on
> > > > systems with multiple NICs with different GSO buffer sizes, and
> > > > where
> > > > the virtual devices like bridge and veth, need to be aware of
> > > > the
> > > > GSO
> > > > size of the underlying devices.
> > > > 
> > > > In a virtualized environment, setting the right GSO sizes for
> > > > physical
> > > > and virtual devices makes all TSO work to be on physical NIC,
> > > > improving
> > > > throughput and reducing CPU util. If virtual devices send
> > > > buffers
> > > > greater than what NIC supports, it forces host to do TSO for
> > > > buffers
> > > > exceeding the limit, increasing CPU utilization in host.
> > > > 
> > > > Suggested-by: Shiny Sebastian 
> > > > Signed-off-by: Solio Sarabia 
> > > > ---
> > > 
> > > This should be added to rtnetlink rather than sysfs.
> > 
> > This is already exposed by rtnetlink [1]
> 
> It currently is read-only. This patch wants to control setting it.
> 
> > 
> > Please lets not add yet another net-sysfs knob.
> 
> Which is my main point - no more sysfs files.
> 

I was not objecting to your point, sorry if this was not obvious.

I usually hit reply on the latest email, not the first one in the
thread.

Proper support for changing these attributes is more complex than that
trivial change. Bonding and team devices, and tunnels comes to mind.




Re: [PATCH] net-sysfs: export gso_max_size attribute

2017-11-24 Thread David Ahern
On 11/24/17 11:32 AM, Eric Dumazet wrote:
> On Fri, 2017-11-24 at 10:14 -0700, David Ahern wrote:
>> On 11/22/17 5:30 PM, Solio Sarabia wrote:
>>> The netdevice gso_max_size is exposed to allow users fine-control
>>> on
>>> systems with multiple NICs with different GSO buffer sizes, and
>>> where
>>> the virtual devices like bridge and veth, need to be aware of the
>>> GSO
>>> size of the underlying devices.
>>>
>>> In a virtualized environment, setting the right GSO sizes for
>>> physical
>>> and virtual devices makes all TSO work to be on physical NIC,
>>> improving
>>> throughput and reducing CPU util. If virtual devices send buffers
>>> greater than what NIC supports, it forces host to do TSO for
>>> buffers
>>> exceeding the limit, increasing CPU utilization in host.
>>>
>>> Suggested-by: Shiny Sebastian 
>>> Signed-off-by: Solio Sarabia 
>>> ---
>>
>> This should be added to rtnetlink rather than sysfs.
> 
> This is already exposed by rtnetlink [1]

It currently is read-only. This patch wants to control setting it.

> 
> Please lets not add yet another net-sysfs knob.

Which is my main point - no more sysfs files.



Re: [PATCH] net-sysfs: export gso_max_size attribute

2017-11-24 Thread Eric Dumazet
On Fri, 2017-11-24 at 10:14 -0700, David Ahern wrote:
> On 11/22/17 5:30 PM, Solio Sarabia wrote:
> > The netdevice gso_max_size is exposed to allow users fine-control
> > on
> > systems with multiple NICs with different GSO buffer sizes, and
> > where
> > the virtual devices like bridge and veth, need to be aware of the
> > GSO
> > size of the underlying devices.
> > 
> > In a virtualized environment, setting the right GSO sizes for
> > physical
> > and virtual devices makes all TSO work to be on physical NIC,
> > improving
> > throughput and reducing CPU util. If virtual devices send buffers
> > greater than what NIC supports, it forces host to do TSO for
> > buffers
> > exceeding the limit, increasing CPU utilization in host.
> > 
> > Suggested-by: Shiny Sebastian 
> > Signed-off-by: Solio Sarabia 
> > ---
> 
> This should be added to rtnetlink rather than sysfs.

This is already exposed by rtnetlink [1]

Please lets not add yet another net-sysfs knob.

[1] c70ce028e834f8e51306217dbdbd441d851c64d3 net/rtnetlink: add 
IFLA_GSO_MAX_SEGS and IFLA_GSO_MAX_SIZE attributes




Re: [PATCH net] net: qmi_wwan: add support for Cinterion PLS8

2017-11-24 Thread Bjørn Mork
Reinhard Speyerer  writes:

> before posting this problem report
> https://developer.gemalto.com/threads/ipv6dualstack-problems-pls8-e-revision-03017
> in the Gemalto developer forum I tested the qmi_wwan/cdc_ether changes
> you suggested above and apart from having two working QMI interfaces
> the IPv6/dualstack problems observed with AT^SWWAN/cdc_ether were
> also gone when using WDSStartNetworkInterface and the QMI interface in
> raw IP mode instead.

Right. I did not know about the "carrier off" issue. But messed up
ethernet headers is a well known problem with all these Qualcomm based
modems. Switching them to raw IP mode is often the only way to make them
work consistently.

Having seen this problem with multiple vendors, where some even have
borrowed our workarounds for their own out-of-tree drivers, makes me
pretty sure that it isn't easily fixable. It's a Qualcomm bug, and I
guess no one is allowed to even look at the code.  Much less change it.
Which makes sense given the mess it must be...

> Unfortunately Gemalto does no seems to be willing to provide an
> alternative USB composition which includes QMI interfaces for the
> PLS8. Therefore applying the above changes to qmi_wwan/cdc_ether might
> make the PLS8 network interfaces stop working when Gemalto decides to
> replace their f_rmnet gadget in CDCECM mode with a f_ecm gadget when
> releasing a firmware update.

I don't think this is necessarily a problem. Only the QMI control
channel will stop working should this happen.  The qmi_wwan driver will
provide the same network device support as cdc_ether, using CDC ECM
framing.

And to be honest, such a redesign of the modem application for a mature
product is very unlikely, isn't it?  Why would Gemalto want to do all
that extra work, taking the risks involved?  For what possible purpose?
This is probably the reason they don't want to mess with alternative USB
compositions either.

In any case, I think it is worth adding this device to qmi_wwan if it
works with current firmwares and you, or anyone else, finds it useful.
And it does sound like that based on the IPv6 issues you mention..

But I'll leave the decision to you or anyone else with such a device.


Bjørn




Re: sunrpc: infinite unkillable console spam in xs_tcp_setup_socket

2017-11-24 Thread Trond Myklebust
On Mon, 2017-11-20 at 14:02 +0100, Dmitry Vyukov wrote:
> Hello,
> 
> The following program triggers infinite stream of the following
> output
> on console. The program is unkillable and this effectively brings the
> machine down:
> 
> 
> ** 16 printk messages dropped ** [12875.022917] xs_tcp_setup_socket:
> connect returned unhandled error -113
>

Does the following fix the issue?

8<-
From f48d3f01df45f50f0145060f5272ccf1aea855ac Mon Sep 17 00:00:00 2001
From: Trond Myklebust 
Date: Fri, 24 Nov 2017 12:00:24 -0500
Subject: [PATCH] SUNRPC: Allow connect to return EHOSTUNREACH

Reported-by: Dmitry Vyukov 
Signed-off-by: Trond Myklebust 
---
 net/sunrpc/xprtsock.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 4dad5da388d6..8cb40f8ffa5b 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2437,6 +2437,7 @@ static void xs_tcp_setup_socket(struct work_struct *work)
case -ECONNREFUSED:
case -ECONNRESET:
case -ENETUNREACH:
+   case -EHOSTUNREACH:
case -EADDRINUSE:
case -ENOBUFS:
/*
-- 
2.14.3

-- 
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.mykleb...@primarydata.com


Re: [PATCH iproute 0/5] ila: additional configuratio support

2017-11-24 Thread Stephen Hemminger
On Wed, 22 Nov 2017 12:05:32 -0800
Tom Herbert  wrote:

> Add configuration support for checksum neutral-map-auto, identifier
> tyoes, and hook type (for LWT).
> 
> Tom Herbert (5):
>   ila: Fix reporting of ILA locators and locator match
>   ila: added csum neutral support to ipila
>   ila: support to configure checksum neutral-map-auto
>   ila: support for configuring identifier and hook types
>   ila: create ila_common.h
> 
>  ip/ila_common.h   | 105 
> ++
>  ip/ipila.c|  57 +--
>  ip/iproute_lwtunnel.c |  68 +++-
>  3 files changed, 200 insertions(+), 30 deletions(-)
>  create mode 100644 ip/ila_common.h
> 

Applied, thanks.


Re: [PATCH] net-sysfs: export gso_max_size attribute

2017-11-24 Thread David Ahern
On 11/22/17 5:30 PM, Solio Sarabia wrote:
> The netdevice gso_max_size is exposed to allow users fine-control on
> systems with multiple NICs with different GSO buffer sizes, and where
> the virtual devices like bridge and veth, need to be aware of the GSO
> size of the underlying devices.
> 
> In a virtualized environment, setting the right GSO sizes for physical
> and virtual devices makes all TSO work to be on physical NIC, improving
> throughput and reducing CPU util. If virtual devices send buffers
> greater than what NIC supports, it forces host to do TSO for buffers
> exceeding the limit, increasing CPU utilization in host.
> 
> Suggested-by: Shiny Sebastian 
> Signed-off-by: Solio Sarabia 
> ---

This should be added to rtnetlink rather than sysfs.


Re: [patch iproute2] tc: move action cookie print out of the stats if

2017-11-24 Thread Stephen Hemminger
On Fri, 24 Nov 2017 09:28:21 +0100
Jiri Pirko  wrote:

> From: Jiri Pirko 
> 
> Cookie print was made dependent on show_stats for no good reason. Fix
> this bu pushing cookie print ot of the stats if.
> 
> Fixes: fd8b3d2c1b9b ("actions: Add support for user cookies")
> Signed-off-by: Jiri Pirko 
> ---
>  tc/m_action.c | 17 -
>  1 file changed, 8 insertions(+), 9 deletions(-)
> 
> diff --git a/tc/m_action.c b/tc/m_action.c
> index 0dce97f..c2fc4f1 100644
> --- a/tc/m_action.c
> +++ b/tc/m_action.c
> @@ -301,19 +301,18 @@ static int tc_print_one_action(FILE *f, struct rtattr 
> *arg)
>   return err;
>  
>   if (show_stats && tb[TCA_ACT_STATS]) {
> -
>   fprintf(f, "\tAction statistics:\n");
>   print_tcstats2_attr(f, tb[TCA_ACT_STATS], "\t", NULL);
> - if (tb[TCA_ACT_COOKIE]) {
> - int strsz = RTA_PAYLOAD(tb[TCA_ACT_COOKIE]);
> - char b1[strsz * 2 + 1];
> -
> - fprintf(f, "\n\tcookie len %d %s ", strsz,
> - hexstring_n2a(RTA_DATA(tb[TCA_ACT_COOKIE]),
> -   strsz, b1, sizeof(b1)));
> - }
>   fprintf(f, "\n");
>   }
> + if (tb[TCA_ACT_COOKIE]) {
> + int strsz = RTA_PAYLOAD(tb[TCA_ACT_COOKIE]);
> + char b1[strsz * 2 + 1];
> +
> + fprintf(f, "\tcookie len %d %s\n", strsz,
> + hexstring_n2a(RTA_DATA(tb[TCA_ACT_COOKIE]),
> +   strsz, b1, sizeof(b1)));
> + }
>  
>   return 0;
>  }

Yes, it should not be under stats flag.
The general model is that -s is for statistics only; and -d is for read only 
detail values.
So this makes sense.

The problem is that the format of the action cookie needs to be same on command 
line
argument and on display; i.e drop the length part of the display .


[PATCH net] net: dsa: fix 'increment on 0' warning

2017-11-24 Thread Vivien Didelot
Setting the refcount to 0 when allocating a tree to match the number of
switch devices it holds may cause an 'increment on 0; use-after-free',
if CONFIG_REFCOUNT_FULL is enabled.

To fix this, do not decrement the refcount of a newly allocated tree,
increment it when an already allocated tree is found, and decrement it
after the probing of a switch, as done with the previous behavior.

At the same time, make dsa_tree_get and dsa_tree_put accept a NULL
argument to simplify callers, and return the tree after incrementation,
as most kref users like of_node_get and of_node_put do.

Fixes: 8e5bf9759a06 ("net: dsa: simplify tree reference counting")
Signed-off-by: Vivien Didelot 
---
 net/dsa/dsa2.c | 27 +++
 1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index 44e3fb7dec8c..1e287420ff49 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -51,9 +51,7 @@ static struct dsa_switch_tree *dsa_tree_alloc(int index)
INIT_LIST_HEAD(>list);
list_add_tail(_tree_list, >list);
 
-   /* Initialize the reference counter to the number of switches, not 1 */
kref_init(>refcount);
-   refcount_set(>refcount.refcount, 0);
 
return dst;
 }
@@ -64,20 +62,23 @@ static void dsa_tree_free(struct dsa_switch_tree *dst)
kfree(dst);
 }
 
+static struct dsa_switch_tree *dsa_tree_get(struct dsa_switch_tree *dst)
+{
+   if (dst)
+   kref_get(>refcount);
+
+   return dst;
+}
+
 static struct dsa_switch_tree *dsa_tree_touch(int index)
 {
struct dsa_switch_tree *dst;
 
dst = dsa_tree_find(index);
-   if (!dst)
-   dst = dsa_tree_alloc(index);
-
-   return dst;
-}
-
-static void dsa_tree_get(struct dsa_switch_tree *dst)
-{
-   kref_get(>refcount);
+   if (dst)
+   return dsa_tree_get(dst);
+   else
+   return dsa_tree_alloc(index);
 }
 
 static void dsa_tree_release(struct kref *ref)
@@ -91,7 +92,8 @@ static void dsa_tree_release(struct kref *ref)
 
 static void dsa_tree_put(struct dsa_switch_tree *dst)
 {
-   kref_put(>refcount, dsa_tree_release);
+   if (dst)
+   kref_put(>refcount, dsa_tree_release);
 }
 
 static bool dsa_port_is_dsa(struct dsa_port *port)
@@ -765,6 +767,7 @@ int dsa_register_switch(struct dsa_switch *ds)
 
mutex_lock(_mutex);
err = dsa_switch_probe(ds);
+   dsa_tree_put(ds->dst);
mutex_unlock(_mutex);
 
return err;
-- 
2.15.0



Re: 8e5bf9759a ("net: dsa: simplify tree reference counting"): WARNING: CPU: 1 PID: 27 at lib/refcount.c:153 refcount_inc

2017-11-24 Thread Vivien Didelot
Hi Fengguang,

Fengguang Wu  writes:

> It looks linus/master and linux-next still has this issue.

I sent a fix to net-next before it closes but it hasn't been picked.
Now that it's in the net tree, I'm sending an alternative fix right now.

Thank for the note!

  Vivien


[PATCH net-next 00/12] rxrpc: Fixes and improvements

2017-11-24 Thread David Howells

Hi David,

Is it too late for this to go to Linus in this merge window?

---

Here's a set of patches that fix and improve some stuff in the AF_RXRPC
protocol:

The patches are:

 (1) Unlock mutex returned by rxrpc_accept_call().

 (2) Don't set connection upgrade by default.

 (3) Differentiate the call->user_mutex used by the kernel from that used
 by userspace calling sendmsg() to avoid lockdep warnings.

 (4) Delay terminal ACK transmission to a work queue so that it can be
 replaced by the next call if there is one.

 (5) Split the call parameters from the connection parameters so that more
 call-specific parameters can be passed through.

 (6) Fix the call timeouts to work the same as for other RxRPC/AFS
 implementations.

 (7) Don't transmit DELAY ACKs immediately, but instead delay them slightly
 so that can be discarded or can represent more packets.

 (8) Use RTT to calculate certain protocol timeouts.

 (9) Add a timeout to detect lost ACK/DATA packets.

(10) Add a keepalive function so that we ping the peer if we haven't
 transmitted for a short while, thereby keeping intervening firewall
 routes open.

(11) Make service endpoints expire like they're supposed to so that the UDP
 port can be reused.

(12) Fix connection expiry timers to make cleanup happen in a more timely
 fashion.

The patches can be found here also:


http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-fixes

Tagged thusly:

git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
rxrpc-fixes-20171124

David
---
David Howells (12):
  rxrpc: The mutex lock returned by rxrpc_accept_call() needs releasing
  rxrpc: Don't set upgrade by default in sendmsg()
  rxrpc: Provide a different lockdep key for call->user_mutex for kernel 
calls
  rxrpc: Delay terminal ACK transmission on a client call
  rxrpc: Split the call params from the operation params
  rxrpc: Fix call timeouts
  rxrpc: Don't transmit DELAY ACKs immediately on proposal
  rxrpc: Express protocol timeouts in terms of RTT
  rxrpc: Add a timeout for detecting lost ACKs/lost DATA
  rxrpc: Add keepalive for a call
  rxrpc: Fix service endpoint expiry
  rxrpc: Fix conn expiry timers


 include/trace/events/rxrpc.h |   86 
 include/uapi/linux/rxrpc.h   |1 
 net/rxrpc/af_rxrpc.c |   23 
 net/rxrpc/ar-internal.h  |  103 ---
 net/rxrpc/call_accept.c  |2 
 net/rxrpc/call_event.c   |  229 --
 net/rxrpc/call_object.c  |   62 +++
 net/rxrpc/conn_client.c  |   54 --
 net/rxrpc/conn_event.c   |   74 +++---
 net/rxrpc/conn_object.c  |   76 +-
 net/rxrpc/input.c|   74 +-
 net/rxrpc/misc.c |   19 +--
 net/rxrpc/net_ns.c   |   33 +-
 net/rxrpc/output.c   |   43 
 net/rxrpc/recvmsg.c  |   12 +-
 net/rxrpc/sendmsg.c  |  126 ++-
 net/rxrpc/sysctl.c   |   60 +--
 17 files changed, 752 insertions(+), 325 deletions(-)



[PATCH net-next 01/12] rxrpc: The mutex lock returned by rxrpc_accept_call() needs releasing

2017-11-24 Thread David Howells
The caller of rxrpc_accept_call() must release the lock on call->user_mutex
returned by that function.

Signed-off-by: David Howells 
---

 net/rxrpc/sendmsg.c |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/rxrpc/sendmsg.c b/net/rxrpc/sendmsg.c
index 7d2595582c09..3a99b1a908df 100644
--- a/net/rxrpc/sendmsg.c
+++ b/net/rxrpc/sendmsg.c
@@ -619,8 +619,8 @@ int rxrpc_do_sendmsg(struct rxrpc_sock *rx, struct msghdr 
*msg, size_t len)
/* The socket is now unlocked. */
if (IS_ERR(call))
return PTR_ERR(call);
-   rxrpc_put_call(call, rxrpc_call_put);
-   return 0;
+   ret = 0;
+   goto out_put_unlock;
}
 
call = rxrpc_find_call_by_user_ID(rx, p.user_call_ID);
@@ -689,6 +689,7 @@ int rxrpc_do_sendmsg(struct rxrpc_sock *rx, struct msghdr 
*msg, size_t len)
ret = rxrpc_send_data(rx, call, msg, len, NULL);
}
 
+out_put_unlock:
mutex_unlock(>user_mutex);
 error_put:
rxrpc_put_call(call, rxrpc_call_put);



[PATCH net-next 03/12] rxrpc: Provide a different lockdep key for call->user_mutex for kernel calls

2017-11-24 Thread David Howells
Provide a different lockdep key for rxrpc_call::user_mutex when the call is
made on a kernel socket, such as by the AFS filesystem.

The problem is that lockdep registers a false positive between userspace
calling the sendmsg syscall on a user socket where call->user_mutex is held
whilst userspace memory is accessed whereas the AFS filesystem may perform
operations with mmap_sem held by the caller.

In such a case, the following warning is produced.

==
WARNING: possible circular locking dependency detected
4.14.0-fscache+ #243 Tainted: GE
--
modpost/16701 is trying to acquire lock:
 (>io_lock){+.+.}, at: [] 
afs_begin_vnode_operation+0x33/0x77 [kafs]

but task is already holding lock:
 (>mmap_sem){}, at: [] __do_page_fault+0x1ef/0x486

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #3 (>mmap_sem){}:
   __might_fault+0x61/0x89
   _copy_from_iter_full+0x40/0x1fa
   rxrpc_send_data+0x8dc/0xff3
   rxrpc_do_sendmsg+0x62f/0x6a1
   rxrpc_sendmsg+0x166/0x1b7
   sock_sendmsg+0x2d/0x39
   ___sys_sendmsg+0x1ad/0x22b
   __sys_sendmsg+0x41/0x62
   do_syscall_64+0x89/0x1be
   return_from_SYSCALL_64+0x0/0x75

-> #2 (>user_mutex){+.+.}:
   __mutex_lock+0x86/0x7d2
   rxrpc_new_client_call+0x378/0x80e
   rxrpc_kernel_begin_call+0xf3/0x154
   afs_make_call+0x195/0x454 [kafs]
   afs_vl_get_capabilities+0x193/0x198 [kafs]
   afs_vl_lookup_vldb+0x5f/0x151 [kafs]
   afs_create_volume+0x2e/0x2f4 [kafs]
   afs_mount+0x56a/0x8d7 [kafs]
   mount_fs+0x6a/0x109
   vfs_kern_mount+0x67/0x135
   do_mount+0x90b/0xb57
   SyS_mount+0x72/0x98
   do_syscall_64+0x89/0x1be
   return_from_SYSCALL_64+0x0/0x75

-> #1 (k-sk_lock-AF_RXRPC){+.+.}:
   lock_sock_nested+0x74/0x8a
   rxrpc_kernel_begin_call+0x8a/0x154
   afs_make_call+0x195/0x454 [kafs]
   afs_fs_get_capabilities+0x17a/0x17f [kafs]
   afs_probe_fileserver+0xf7/0x2f0 [kafs]
   afs_select_fileserver+0x83f/0x903 [kafs]
   afs_fetch_status+0x89/0x11d [kafs]
   afs_iget+0x16f/0x4f8 [kafs]
   afs_mount+0x6c6/0x8d7 [kafs]
   mount_fs+0x6a/0x109
   vfs_kern_mount+0x67/0x135
   do_mount+0x90b/0xb57
   SyS_mount+0x72/0x98
   do_syscall_64+0x89/0x1be
   return_from_SYSCALL_64+0x0/0x75

-> #0 (>io_lock){+.+.}:
   lock_acquire+0x174/0x19f
   __mutex_lock+0x86/0x7d2
   afs_begin_vnode_operation+0x33/0x77 [kafs]
   afs_fetch_data+0x80/0x12a [kafs]
   afs_readpages+0x314/0x405 [kafs]
   __do_page_cache_readahead+0x203/0x2ba
   filemap_fault+0x179/0x54d
   __do_fault+0x17/0x60
   __handle_mm_fault+0x6d7/0x95c
   handle_mm_fault+0x24e/0x2a3
   __do_page_fault+0x301/0x486
   do_page_fault+0x236/0x259
   page_fault+0x22/0x30
   __clear_user+0x3d/0x60
   padzero+0x1c/0x2b
   load_elf_binary+0x785/0xdc7
   search_binary_handler+0x81/0x1ff
   do_execveat_common.isra.14+0x600/0x888
   do_execve+0x1f/0x21
   SyS_execve+0x28/0x2f
   do_syscall_64+0x89/0x1be
   return_from_SYSCALL_64+0x0/0x75

other info that might help us debug this:

Chain exists of:
  >io_lock --> >user_mutex --> >mmap_sem

 Possible unsafe locking scenario:

   CPU0CPU1
   
  lock(>mmap_sem);
   lock(>user_mutex);
   lock(>mmap_sem);
  lock(>io_lock);

 *** DEADLOCK ***

1 lock held by modpost/16701:
 #0:  (>mmap_sem){}, at: [] 
__do_page_fault+0x1ef/0x486

stack backtrace:
CPU: 0 PID: 16701 Comm: modpost Tainted: GE   4.14.0-fscache+ #243
Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
Call Trace:
 dump_stack+0x67/0x8e
 print_circular_bug+0x341/0x34f
 check_prev_add+0x11f/0x5d4
 ? add_lock_to_list.isra.12+0x8b/0x8b
 ? add_lock_to_list.isra.12+0x8b/0x8b
 ? __lock_acquire+0xf77/0x10b4
 __lock_acquire+0xf77/0x10b4
 lock_acquire+0x174/0x19f
 ? afs_begin_vnode_operation+0x33/0x77 [kafs]
 __mutex_lock+0x86/0x7d2
 ? afs_begin_vnode_operation+0x33/0x77 [kafs]
 ? afs_begin_vnode_operation+0x33/0x77 [kafs]
 ? afs_begin_vnode_operation+0x33/0x77 [kafs]
 afs_begin_vnode_operation+0x33/0x77 [kafs]
 afs_fetch_data+0x80/0x12a [kafs]
 afs_readpages+0x314/0x405 [kafs]
 __do_page_cache_readahead+0x203/0x2ba
 ? filemap_fault+0x179/0x54d
 filemap_fault+0x179/0x54d
 __do_fault+0x17/0x60
 __handle_mm_fault+0x6d7/0x95c
 handle_mm_fault+0x24e/0x2a3
 __do_page_fault+0x301/0x486
 do_page_fault+0x236/0x259
 page_fault+0x22/0x30
RIP: 0010:__clear_user+0x3d/0x60
RSP: 0018:880071e93da0 EFLAGS: 00010202
RAX:  RBX: 011c RCX: 011c
RDX:  RSI: 0008 RDI: 0060f720
RBP: 0060f720 R08: 0001 R09: 
R10: 

[PATCH net-next 05/12] rxrpc: Split the call params from the operation params

2017-11-24 Thread David Howells
When rxrpc_sendmsg() parses the control message buffer, it places the
parameters extracted into a structure, but lumps together call parameters
(such as user call ID) with operation parameters (such as whether to send
data, send an abort or accept a call).

Split the call parameters out into their own structure, a copy of which is
then embedded in the operation parameters struct.

The call parameters struct is then passed down into the places that need it
instead of passing the individual parameters.  This allows for extra call
parameters to be added.

Signed-off-by: David Howells 
---

 net/rxrpc/af_rxrpc.c|8 ++-
 net/rxrpc/ar-internal.h |   31 -
 net/rxrpc/call_object.c |   15 ++
 net/rxrpc/sendmsg.c |   51 ---
 4 files changed, 60 insertions(+), 45 deletions(-)

diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c
index 9b5c46b052fd..c0cdcf980ffc 100644
--- a/net/rxrpc/af_rxrpc.c
+++ b/net/rxrpc/af_rxrpc.c
@@ -285,6 +285,7 @@ struct rxrpc_call *rxrpc_kernel_begin_call(struct socket 
*sock,
   bool upgrade)
 {
struct rxrpc_conn_parameters cp;
+   struct rxrpc_call_params p;
struct rxrpc_call *call;
struct rxrpc_sock *rx = rxrpc_sk(sock->sk);
int ret;
@@ -302,6 +303,10 @@ struct rxrpc_call *rxrpc_kernel_begin_call(struct socket 
*sock,
if (key && !key->payload.data[0])
key = NULL; /* a no-security key */
 
+   memset(, 0, sizeof(p));
+   p.user_call_ID = user_call_ID;
+   p.tx_total_len = tx_total_len;
+
memset(, 0, sizeof(cp));
cp.local= rx->local;
cp.key  = key;
@@ -309,8 +314,7 @@ struct rxrpc_call *rxrpc_kernel_begin_call(struct socket 
*sock,
cp.exclusive= false;
cp.upgrade  = upgrade;
cp.service_id   = srx->srx_service;
-   call = rxrpc_new_client_call(rx, , srx, user_call_ID, tx_total_len,
-gfp);
+   call = rxrpc_new_client_call(rx, , srx, , gfp);
/* The socket has been unlocked. */
if (!IS_ERR(call)) {
call->notify_rx = notify_rx;
diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index d1213d503f30..ba63f2231107 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -643,6 +643,35 @@ struct rxrpc_ack_summary {
u8  cumulative_acks;
 };
 
+/*
+ * sendmsg() cmsg-specified parameters.
+ */
+enum rxrpc_command {
+   RXRPC_CMD_SEND_DATA,/* send data message */
+   RXRPC_CMD_SEND_ABORT,   /* request abort generation */
+   RXRPC_CMD_ACCEPT,   /* [server] accept incoming call */
+   RXRPC_CMD_REJECT_BUSY,  /* [server] reject a call as busy */
+};
+
+struct rxrpc_call_params {
+   s64 tx_total_len;   /* Total Tx data length (if 
send data) */
+   unsigned long   user_call_ID;   /* User's call ID */
+   struct {
+   u32 hard;   /* Maximum lifetime (sec) */
+   u32 idle;   /* Max time since last data 
packet (msec) */
+   u32 normal; /* Max time since last call 
packet (msec) */
+   } timeouts;
+   u8  nr_timeouts;/* Number of timeouts specified 
*/
+};
+
+struct rxrpc_send_params {
+   struct rxrpc_call_params call;
+   u32 abort_code; /* Abort code to Tx (if abort) 
*/
+   enum rxrpc_command  command : 8;/* The command to implement */
+   boolexclusive;  /* Shared or exclusive call */
+   boolupgrade;/* If the connection is 
upgradeable */
+};
+
 #include 
 
 /*
@@ -687,7 +716,7 @@ struct rxrpc_call *rxrpc_alloc_call(struct rxrpc_sock *, 
gfp_t);
 struct rxrpc_call *rxrpc_new_client_call(struct rxrpc_sock *,
 struct rxrpc_conn_parameters *,
 struct sockaddr_rxrpc *,
-unsigned long, s64, gfp_t);
+struct rxrpc_call_params *, gfp_t);
 int rxrpc_retry_client_call(struct rxrpc_sock *,
struct rxrpc_call *,
struct rxrpc_conn_parameters *,
diff --git a/net/rxrpc/call_object.c b/net/rxrpc/call_object.c
index 1f141dc08ad2..c3e1fa854471 100644
--- a/net/rxrpc/call_object.c
+++ b/net/rxrpc/call_object.c
@@ -208,8 +208,7 @@ static void rxrpc_start_call_timer(struct rxrpc_call *call)
 struct rxrpc_call *rxrpc_new_client_call(struct rxrpc_sock *rx,
 struct rxrpc_conn_parameters *cp,
 struct sockaddr_rxrpc *srx,
- 

[PATCH net-next 04/12] rxrpc: Delay terminal ACK transmission on a client call

2017-11-24 Thread David Howells
Delay terminal ACK transmission on a client call by deferring it to the
connection processor.  This allows it to be skipped if we can send the next
call instead, the first DATA packet of which will implicitly ack this call.

Signed-off-by: David Howells 
---

 net/rxrpc/ar-internal.h |   17 +++
 net/rxrpc/conn_client.c |   18 +++
 net/rxrpc/conn_event.c  |   74 +++
 net/rxrpc/conn_object.c |   10 ++
 net/rxrpc/recvmsg.c |2 +
 5 files changed, 108 insertions(+), 13 deletions(-)

diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index a972887b3f5d..d1213d503f30 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -338,8 +338,17 @@ enum rxrpc_conn_flag {
RXRPC_CONN_DONT_REUSE,  /* Don't reuse this connection */
RXRPC_CONN_COUNTED, /* Counted by rxrpc_nr_client_conns */
RXRPC_CONN_PROBING_FOR_UPGRADE, /* Probing for service upgrade */
+   RXRPC_CONN_FINAL_ACK_0, /* Need final ACK for channel 0 */
+   RXRPC_CONN_FINAL_ACK_1, /* Need final ACK for channel 1 */
+   RXRPC_CONN_FINAL_ACK_2, /* Need final ACK for channel 2 */
+   RXRPC_CONN_FINAL_ACK_3, /* Need final ACK for channel 3 */
 };
 
+#define RXRPC_CONN_FINAL_ACK_MASK ((1UL << RXRPC_CONN_FINAL_ACK_0) |   \
+  (1UL << RXRPC_CONN_FINAL_ACK_1) |\
+  (1UL << RXRPC_CONN_FINAL_ACK_2) |\
+  (1UL << RXRPC_CONN_FINAL_ACK_3))
+
 /*
  * Events that can be raised upon a connection.
  */
@@ -393,6 +402,7 @@ struct rxrpc_connection {
 #define RXRPC_ACTIVE_CHANS_MASK((1 << RXRPC_MAXCALLS) - 1)
struct list_headwaiting_calls;  /* Calls waiting for channels */
struct rxrpc_channel {
+   unsigned long   final_ack_at;   /* Time at which to 
issue final ACK */
struct rxrpc_call __rcu *call;  /* Active call */
u32 call_id;/* ID of current call */
u32 call_counter;   /* Call ID counter */
@@ -404,6 +414,7 @@ struct rxrpc_connection {
};
} channels[RXRPC_MAXCALLS];
 
+   struct timer_list   timer;  /* Conn event timer */
struct work_struct  processor;  /* connection event processor */
union {
struct rb_node  client_node;/* Node in local->client_conns 
*/
@@ -861,6 +872,12 @@ static inline void rxrpc_put_connection(struct 
rxrpc_connection *conn)
rxrpc_put_service_conn(conn);
 }
 
+static inline void rxrpc_reduce_conn_timer(struct rxrpc_connection *conn,
+  unsigned long expire_at)
+{
+   timer_reduce(>timer, expire_at);
+}
+
 /*
  * conn_service.c
  */
diff --git a/net/rxrpc/conn_client.c b/net/rxrpc/conn_client.c
index 5f9624bd311c..cfb997593da9 100644
--- a/net/rxrpc/conn_client.c
+++ b/net/rxrpc/conn_client.c
@@ -554,6 +554,11 @@ static void rxrpc_activate_one_channel(struct 
rxrpc_connection *conn,
 
trace_rxrpc_client(conn, channel, rxrpc_client_chan_activate);
 
+   /* Cancel the final ACK on the previous call if it hasn't been sent yet
+* as the DATA packet will implicitly ACK it.
+*/
+   clear_bit(RXRPC_CONN_FINAL_ACK_0 + channel, >flags);
+
write_lock_bh(>state_lock);
if (!test_bit(RXRPC_CALL_TX_LASTQ, >flags))
call->state = RXRPC_CALL_CLIENT_SEND_REQUEST;
@@ -813,6 +818,19 @@ void rxrpc_disconnect_client_call(struct rxrpc_call *call)
goto out_2;
}
 
+   /* Schedule the final ACK to be transmitted in a short while so that it
+* can be skipped if we find a follow-on call.  The first DATA packet
+* of the follow on call will implicitly ACK this call.
+*/
+   if (test_bit(RXRPC_CALL_EXPOSED, >flags)) {
+   unsigned long final_ack_at = jiffies + 2;
+
+   WRITE_ONCE(chan->final_ack_at, final_ack_at);
+   smp_wmb(); /* vs rxrpc_process_delayed_final_acks() */
+   set_bit(RXRPC_CONN_FINAL_ACK_0 + channel, >flags);
+   rxrpc_reduce_conn_timer(conn, final_ack_at);
+   }
+
/* Things are more complex and we need the cache lock.  We might be
 * able to simply idle the conn or it might now be lurking on the wait
 * list.  It might even get moved back to the active list whilst we're
diff --git a/net/rxrpc/conn_event.c b/net/rxrpc/conn_event.c
index 59a51a56e7c8..9e9a8db1bc9c 100644
--- a/net/rxrpc/conn_event.c
+++ b/net/rxrpc/conn_event.c
@@ -24,9 +24,10 @@
  * Retransmit terminal ACK or ABORT of the previous call.
  */
 static void rxrpc_conn_retransmit_call(struct rxrpc_connection *conn,
-  struct sk_buff *skb)
+

[PATCH net-next 02/12] rxrpc: Don't set upgrade by default in sendmsg()

2017-11-24 Thread David Howells
Don't set upgrade by default when creating a call from sendmsg().  This is
a holdover from when I was testing the code.

Signed-off-by: David Howells 
---

 net/rxrpc/sendmsg.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/rxrpc/sendmsg.c b/net/rxrpc/sendmsg.c
index 3a99b1a908df..94555c94b2d8 100644
--- a/net/rxrpc/sendmsg.c
+++ b/net/rxrpc/sendmsg.c
@@ -602,7 +602,7 @@ int rxrpc_do_sendmsg(struct rxrpc_sock *rx, struct msghdr 
*msg, size_t len)
.abort_code = 0,
.command= RXRPC_CMD_SEND_DATA,
.exclusive  = false,
-   .upgrade= true,
+   .upgrade= false,
};
 
_enter("");



[PATCH net-next 07/12] rxrpc: Don't transmit DELAY ACKs immediately on proposal

2017-11-24 Thread David Howells
Don't transmit a DELAY ACK immediately on proposal when the Rx window is
rotated, but rather defer it to the work function.  This means that we have
a chance to queue/consume more received packets before we actually send the
DELAY ACK, or even cancel it entirely, thereby reducing the number of
packets transmitted.

We do, however, want to continue sending other types of packet immediately,
particularly REQUESTED ACKs, as they may be used for RTT calculation by the
other side.

Signed-off-by: David Howells 
---

 net/rxrpc/recvmsg.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/rxrpc/recvmsg.c b/net/rxrpc/recvmsg.c
index 0b6609da80b7..fad5f42a3abd 100644
--- a/net/rxrpc/recvmsg.c
+++ b/net/rxrpc/recvmsg.c
@@ -219,9 +219,9 @@ static void rxrpc_rotate_rx_window(struct rxrpc_call *call)
after_eq(top, call->ackr_seen + 2) ||
(hard_ack == top && after(hard_ack, call->ackr_consumed)))
rxrpc_propose_ACK(call, RXRPC_ACK_DELAY, 0, serial,
- true, false,
+ true, true,
  rxrpc_propose_ack_rotate_rx);
-   if (call->ackr_reason)
+   if (call->ackr_reason && call->ackr_reason != RXRPC_ACK_DELAY)
rxrpc_send_ack_packet(call, false);
}
 }



[PATCH net-next 06/12] rxrpc: Fix call timeouts

2017-11-24 Thread David Howells
Fix the rxrpc call expiration timeouts and make them settable from
userspace.  By analogy with other rx implementations, there should be three
timeouts:

 (1) "Normal timeout"

 This is set for all calls and is triggered if we haven't received any
 packets from the peer in a while.  It is measured from the last time
 we received any packet on that call.  This is not reset by any
 connection packets (such as CHALLENGE/RESPONSE packets).

 If a service operation takes a long time, the server should generate
 PING ACKs at a duration that's substantially less than the normal
 timeout so is to keep both sides alive.  This is set at 1/6 of normal
 timeout.

 (2) "Idle timeout"

 This is set only for a service call and is triggered if we stop
 receiving the DATA packets that comprise the request data.  It is
 measured from the last time we received a DATA packet.

 (3) "Hard timeout"

 This can be set for a call and specified the maximum lifetime of that
 call.  It should not be specified by default.  Some operations (such
 as volume transfer) take a long time.

Allow userspace to set/change the timeouts on a call with sendmsg, using a
control message:

RXRPC_SET_CALL_TIMEOUTS

The data to the message is a number of 32-bit words, not all of which need
be given:

u32 hard_timeout;   /* sec from first packet */
u32 idle_timeout;   /* msec from packet Rx */
u32 normal_timeout; /* msec from data Rx */

This can be set in combination with any other sendmsg() that affects a
call.

Signed-off-by: David Howells 
---

 include/trace/events/rxrpc.h |   69 +++-
 include/uapi/linux/rxrpc.h   |1 
 net/rxrpc/ar-internal.h  |   37 ++---
 net/rxrpc/call_event.c   |  179 --
 net/rxrpc/call_object.c  |   27 --
 net/rxrpc/conn_client.c  |4 -
 net/rxrpc/input.c|   34 +++-
 net/rxrpc/misc.c |   19 ++--
 net/rxrpc/recvmsg.c  |2 
 net/rxrpc/sendmsg.c  |   59 +++---
 net/rxrpc/sysctl.c   |   60 +++---
 11 files changed, 290 insertions(+), 201 deletions(-)

diff --git a/include/trace/events/rxrpc.h b/include/trace/events/rxrpc.h
index ebe96796027a..01dcbc2164b5 100644
--- a/include/trace/events/rxrpc.h
+++ b/include/trace/events/rxrpc.h
@@ -138,10 +138,20 @@ enum rxrpc_rtt_rx_trace {
 
 enum rxrpc_timer_trace {
rxrpc_timer_begin,
+   rxrpc_timer_exp_ack,
+   rxrpc_timer_exp_hard,
+   rxrpc_timer_exp_idle,
+   rxrpc_timer_exp_normal,
+   rxrpc_timer_exp_ping,
+   rxrpc_timer_exp_resend,
rxrpc_timer_expired,
rxrpc_timer_init_for_reply,
rxrpc_timer_init_for_send_reply,
+   rxrpc_timer_restart,
rxrpc_timer_set_for_ack,
+   rxrpc_timer_set_for_hard,
+   rxrpc_timer_set_for_idle,
+   rxrpc_timer_set_for_normal,
rxrpc_timer_set_for_ping,
rxrpc_timer_set_for_resend,
rxrpc_timer_set_for_send,
@@ -296,12 +306,22 @@ enum rxrpc_congest_change {
 #define rxrpc_timer_traces \
EM(rxrpc_timer_begin,   "Begin ") \
EM(rxrpc_timer_expired, "*EXPR*") \
+   EM(rxrpc_timer_exp_ack, "ExpAck") \
+   EM(rxrpc_timer_exp_hard,"ExpHrd") \
+   EM(rxrpc_timer_exp_idle,"ExpIdl") \
+   EM(rxrpc_timer_exp_normal,  "ExpNml") \
+   EM(rxrpc_timer_exp_ping,"ExpPng") \
+   EM(rxrpc_timer_exp_resend,  "ExpRsn") \
EM(rxrpc_timer_init_for_reply,  "IniRpl") \
EM(rxrpc_timer_init_for_send_reply, "SndRpl") \
+   EM(rxrpc_timer_restart, "Restrt") \
EM(rxrpc_timer_set_for_ack, "SetAck") \
+   EM(rxrpc_timer_set_for_hard,"SetHrd") \
+   EM(rxrpc_timer_set_for_idle,"SetIdl") \
+   EM(rxrpc_timer_set_for_normal,  "SetNml") \
EM(rxrpc_timer_set_for_ping,"SetPng") \
EM(rxrpc_timer_set_for_resend,  "SetRTx") \
-   E_(rxrpc_timer_set_for_send,"SetTx ")
+   E_(rxrpc_timer_set_for_send,"SetSnd")
 
 #define rxrpc_propose_ack_traces \
EM(rxrpc_propose_ack_client_tx_end, "ClTxEnd") \
@@ -932,39 +952,44 @@ TRACE_EVENT(rxrpc_rtt_rx,
 
 TRACE_EVENT(rxrpc_timer,
TP_PROTO(struct rxrpc_call *call, enum rxrpc_timer_trace why,
-ktime_t now, unsigned long now_j),
+unsigned long now),
 
-   TP_ARGS(call, why, now, now_j),
+   TP_ARGS(call, why, now),
 
TP_STRUCT__entry(
__field(struct rxrpc_call *,call
)
__field(enum rxrpc_timer_trace, why 
)
-   __field_struct(ktime_t, now   

[PATCH net-next 09/12] rxrpc: Add a timeout for detecting lost ACKs/lost DATA

2017-11-24 Thread David Howells
Add an extra timeout that is set/updated when we send a DATA packet that
has the request-ack flag set.  This allows us to detect if we don't get an
ACK in response to the latest flagged packet.

The ACK packet is adjudged to have been lost if it doesn't turn up within
2*RTT of the transmission.

If the timeout occurs, we schedule the sending of a PING ACK to find out
the state of the other side.  If a new DATA packet is ready to go sooner,
we cancel the sending of the ping and set the request-ack flag on that
instead.

If we get back a PING-RESPONSE ACK that indicates a lower tx_top than what
we had at the time of the ping transmission, we adjudge all the DATA
packets sent between the response tx_top and the ping-time tx_top to have
been lost and retransmit immediately.

Rather than sending a PING ACK, we could just pick a DATA packet and
speculatively retransmit that with request-ack set.  It should result in
either a REQUESTED ACK or a DUPLICATE ACK which we can then use in lieu the
a PING-RESPONSE ACK mentioned above.

Signed-off-by: David Howells 
---

 include/trace/events/rxrpc.h |   11 +--
 net/rxrpc/ar-internal.h  |6 +-
 net/rxrpc/call_event.c   |   26 ++
 net/rxrpc/call_object.c  |1 +
 net/rxrpc/input.c|   40 
 net/rxrpc/output.c   |   20 ++--
 net/rxrpc/recvmsg.c  |4 ++--
 net/rxrpc/sendmsg.c  |2 +-
 8 files changed, 98 insertions(+), 12 deletions(-)

diff --git a/include/trace/events/rxrpc.h b/include/trace/events/rxrpc.h
index 01dcbc2164b5..84ade8b76a19 100644
--- a/include/trace/events/rxrpc.h
+++ b/include/trace/events/rxrpc.h
@@ -141,6 +141,7 @@ enum rxrpc_timer_trace {
rxrpc_timer_exp_ack,
rxrpc_timer_exp_hard,
rxrpc_timer_exp_idle,
+   rxrpc_timer_exp_lost_ack,
rxrpc_timer_exp_normal,
rxrpc_timer_exp_ping,
rxrpc_timer_exp_resend,
@@ -151,6 +152,7 @@ enum rxrpc_timer_trace {
rxrpc_timer_set_for_ack,
rxrpc_timer_set_for_hard,
rxrpc_timer_set_for_idle,
+   rxrpc_timer_set_for_lost_ack,
rxrpc_timer_set_for_normal,
rxrpc_timer_set_for_ping,
rxrpc_timer_set_for_resend,
@@ -309,6 +311,7 @@ enum rxrpc_congest_change {
EM(rxrpc_timer_exp_ack, "ExpAck") \
EM(rxrpc_timer_exp_hard,"ExpHrd") \
EM(rxrpc_timer_exp_idle,"ExpIdl") \
+   EM(rxrpc_timer_exp_lost_ack,"ExpLoA") \
EM(rxrpc_timer_exp_normal,  "ExpNml") \
EM(rxrpc_timer_exp_ping,"ExpPng") \
EM(rxrpc_timer_exp_resend,  "ExpRsn") \
@@ -318,6 +321,7 @@ enum rxrpc_congest_change {
EM(rxrpc_timer_set_for_ack, "SetAck") \
EM(rxrpc_timer_set_for_hard,"SetHrd") \
EM(rxrpc_timer_set_for_idle,"SetIdl") \
+   EM(rxrpc_timer_set_for_lost_ack,"SetLoA") \
EM(rxrpc_timer_set_for_normal,  "SetNml") \
EM(rxrpc_timer_set_for_ping,"SetPng") \
EM(rxrpc_timer_set_for_resend,  "SetRTx") \
@@ -961,6 +965,7 @@ TRACE_EVENT(rxrpc_timer,
__field(enum rxrpc_timer_trace, why 
)
__field(long,   now 
)
__field(long,   ack_at  
)
+   __field(long,   ack_lost_at 
)
__field(long,   resend_at   
)
__field(long,   ping_at 
)
__field(long,   expect_rx_by
)
@@ -974,6 +979,7 @@ TRACE_EVENT(rxrpc_timer,
__entry->why= why;
__entry->now= now;
__entry->ack_at = call->ack_at;
+   __entry->ack_lost_at= call->ack_lost_at;
__entry->resend_at  = call->resend_at;
__entry->expect_rx_by   = call->expect_rx_by;
__entry->expect_req_by  = call->expect_req_by;
@@ -981,10 +987,11 @@ TRACE_EVENT(rxrpc_timer,
__entry->timer  = call->timer.expires;
   ),
 
-   TP_printk("c=%p %s a=%ld r=%ld xr=%ld xq=%ld xt=%ld t=%ld",
+   TP_printk("c=%p %s a=%ld la=%ld r=%ld xr=%ld xq=%ld xt=%ld t=%ld",
  __entry->call,
  __print_symbolic(__entry->why, rxrpc_timer_traces),
  __entry->ack_at - __entry->now,
+ __entry->ack_lost_at - __entry->now,
  __entry->resend_at - __entry->now,
  __entry->expect_rx_by - __entry->now,
   

[PATCH net-next 11/12] rxrpc: Fix service endpoint expiry

2017-11-24 Thread David Howells
RxRPC service endpoints expire like they're supposed to by the following
means:

 (1) Mark dead rxrpc_net structs (with ->live) rather than twiddling the
 global service conn timeout, otherwise the first rxrpc_net struct to
 die will cause connections on all others to expire immediately from
 then on.

 (2) Mark local service endpoints for which the socket has been closed
 (->service_closed) so that the expiration timeout can be much
 shortened for service and client connections going through that
 endpoint.

 (3) rxrpc_put_service_conn() needs to schedule the reaper when the usage
 count reaches 1, not 0, as idle conns have a 1 count.

 (4) The accumulator for the earliest time we might want to schedule for
 should be initialised to jiffies + MAX_JIFFY_OFFSET, not ULONG_MAX as
 the comparison functions use signed arithmetic.

 (5) Simplify the expiration handling, adding the expiration value to the
 idle timestamp each time rather than keeping track of the time in the
 past before which the idle timestamp must go to be expired.  This is
 much easier to read.

 (6) Ignore the timeouts if the net namespace is dead.

 (7) Restart the service reaper work item rather the client reaper.

Signed-off-by: David Howells 
---

 include/trace/events/rxrpc.h |2 ++
 net/rxrpc/af_rxrpc.c |   13 +
 net/rxrpc/ar-internal.h  |3 +++
 net/rxrpc/conn_client.c  |2 ++
 net/rxrpc/conn_object.c  |   42 --
 net/rxrpc/net_ns.c   |3 +++
 6 files changed, 47 insertions(+), 18 deletions(-)

diff --git a/include/trace/events/rxrpc.h b/include/trace/events/rxrpc.h
index e98fed6de497..36cb50c111a6 100644
--- a/include/trace/events/rxrpc.h
+++ b/include/trace/events/rxrpc.h
@@ -49,6 +49,7 @@ enum rxrpc_conn_trace {
rxrpc_conn_put_client,
rxrpc_conn_put_service,
rxrpc_conn_queued,
+   rxrpc_conn_reap_service,
rxrpc_conn_seen,
 };
 
@@ -221,6 +222,7 @@ enum rxrpc_congest_change {
EM(rxrpc_conn_put_client,   "PTc") \
EM(rxrpc_conn_put_service,  "PTs") \
EM(rxrpc_conn_queued,   "QUE") \
+   EM(rxrpc_conn_reap_service, "RPs") \
E_(rxrpc_conn_seen, "SEE")
 
 #define rxrpc_client_traces \
diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c
index c0cdcf980ffc..abb524c2b8f8 100644
--- a/net/rxrpc/af_rxrpc.c
+++ b/net/rxrpc/af_rxrpc.c
@@ -867,6 +867,19 @@ static int rxrpc_release_sock(struct sock *sk)
sock_orphan(sk);
sk->sk_shutdown = SHUTDOWN_MASK;
 
+   /* We want to kill off all connections from a service socket
+* as fast as possible because we can't share these; client
+* sockets, on the other hand, can share an endpoint.
+*/
+   switch (sk->sk_state) {
+   case RXRPC_SERVER_BOUND:
+   case RXRPC_SERVER_BOUND2:
+   case RXRPC_SERVER_LISTENING:
+   case RXRPC_SERVER_LISTEN_DISABLED:
+   rx->local->service_closed = true;
+   break;
+   }
+
spin_lock_bh(>sk_receive_queue.lock);
sk->sk_state = RXRPC_CLOSE;
spin_unlock_bh(>sk_receive_queue.lock);
diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index cdcbc798f921..a0082c407005 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -84,6 +84,7 @@ struct rxrpc_net {
unsigned intnr_client_conns;
unsigned intnr_active_client_conns;
boolkill_all_client_conns;
+   boollive;
spinlock_t  client_conn_cache_lock; /* Lock for 
->*_client_conns */
spinlock_t  client_conn_discard_lock; /* Prevent multiple 
discarders */
struct list_headwaiting_client_conns;
@@ -265,6 +266,7 @@ struct rxrpc_local {
rwlock_tservices_lock;  /* lock for services list */
int debug_id;   /* debug ID for printks */
booldead;
+   boolservice_closed; /* Service socket closed */
struct sockaddr_rxrpc   srx;/* local address */
 };
 
@@ -881,6 +883,7 @@ void rxrpc_process_connection(struct work_struct *);
  * conn_object.c
  */
 extern unsigned int rxrpc_connection_expiry;
+extern unsigned int rxrpc_closed_conn_expiry;
 
 struct rxrpc_connection *rxrpc_alloc_connection(gfp_t);
 struct rxrpc_connection *rxrpc_find_connection_rcu(struct rxrpc_local *,
diff --git a/net/rxrpc/conn_client.c b/net/rxrpc/conn_client.c
index 97f6a8de4845..785dfdb9fef1 100644
--- a/net/rxrpc/conn_client.c
+++ b/net/rxrpc/conn_client.c
@@ -1079,6 +1079,8 @@ void rxrpc_discard_expired_client_conns(struct 
work_struct *work)
expiry = rxrpc_conn_idle_client_expiry;
if (nr_conns > rxrpc_reap_client_connections)
 

[PATCH net-next 10/12] rxrpc: Add keepalive for a call

2017-11-24 Thread David Howells
We need to transmit a packet every so often to act as a keepalive for the
peer (which has a timeout from the last time it received a packet) and also
to prevent any intervening firewalls from closing the route.

Do this by resetting a timer every time we transmit a packet.  If the timer
ever expires, we transmit a PING ACK packet and thereby also elicit a PING
RESPONSE ACK from the other side - which prevents our last-rx timeout from
expiring.

The timer is set to 1/6 of the last-rx timeout so that we can detect the
other side going away if it misses 6 replies in a row.

This is particularly necessary for servers where the processing of the
service function may take a significant amount of time.

Signed-off-by: David Howells 
---

 include/trace/events/rxrpc.h |6 ++
 net/rxrpc/ar-internal.h  |1 +
 net/rxrpc/call_event.c   |   10 ++
 net/rxrpc/output.c   |   23 +++
 4 files changed, 40 insertions(+)

diff --git a/include/trace/events/rxrpc.h b/include/trace/events/rxrpc.h
index 84ade8b76a19..e98fed6de497 100644
--- a/include/trace/events/rxrpc.h
+++ b/include/trace/events/rxrpc.h
@@ -141,6 +141,7 @@ enum rxrpc_timer_trace {
rxrpc_timer_exp_ack,
rxrpc_timer_exp_hard,
rxrpc_timer_exp_idle,
+   rxrpc_timer_exp_keepalive,
rxrpc_timer_exp_lost_ack,
rxrpc_timer_exp_normal,
rxrpc_timer_exp_ping,
@@ -152,6 +153,7 @@ enum rxrpc_timer_trace {
rxrpc_timer_set_for_ack,
rxrpc_timer_set_for_hard,
rxrpc_timer_set_for_idle,
+   rxrpc_timer_set_for_keepalive,
rxrpc_timer_set_for_lost_ack,
rxrpc_timer_set_for_normal,
rxrpc_timer_set_for_ping,
@@ -162,6 +164,7 @@ enum rxrpc_timer_trace {
 enum rxrpc_propose_ack_trace {
rxrpc_propose_ack_client_tx_end,
rxrpc_propose_ack_input_data,
+   rxrpc_propose_ack_ping_for_keepalive,
rxrpc_propose_ack_ping_for_lost_ack,
rxrpc_propose_ack_ping_for_lost_reply,
rxrpc_propose_ack_ping_for_params,
@@ -311,6 +314,7 @@ enum rxrpc_congest_change {
EM(rxrpc_timer_exp_ack, "ExpAck") \
EM(rxrpc_timer_exp_hard,"ExpHrd") \
EM(rxrpc_timer_exp_idle,"ExpIdl") \
+   EM(rxrpc_timer_exp_keepalive,   "ExpKA ") \
EM(rxrpc_timer_exp_lost_ack,"ExpLoA") \
EM(rxrpc_timer_exp_normal,  "ExpNml") \
EM(rxrpc_timer_exp_ping,"ExpPng") \
@@ -321,6 +325,7 @@ enum rxrpc_congest_change {
EM(rxrpc_timer_set_for_ack, "SetAck") \
EM(rxrpc_timer_set_for_hard,"SetHrd") \
EM(rxrpc_timer_set_for_idle,"SetIdl") \
+   EM(rxrpc_timer_set_for_keepalive,   "KeepAl") \
EM(rxrpc_timer_set_for_lost_ack,"SetLoA") \
EM(rxrpc_timer_set_for_normal,  "SetNml") \
EM(rxrpc_timer_set_for_ping,"SetPng") \
@@ -330,6 +335,7 @@ enum rxrpc_congest_change {
 #define rxrpc_propose_ack_traces \
EM(rxrpc_propose_ack_client_tx_end, "ClTxEnd") \
EM(rxrpc_propose_ack_input_data,"DataIn ") \
+   EM(rxrpc_propose_ack_ping_for_keepalive, "KeepAlv") \
EM(rxrpc_propose_ack_ping_for_lost_ack, "LostAck") \
EM(rxrpc_propose_ack_ping_for_lost_reply, "LostRpl") \
EM(rxrpc_propose_ack_ping_for_params,   "Params ") \
diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index 7e7b817c69f0..cdcbc798f921 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -519,6 +519,7 @@ struct rxrpc_call {
unsigned long   ack_lost_at;/* When ACK is figured as lost 
*/
unsigned long   resend_at;  /* When next resend needs to 
happen */
unsigned long   ping_at;/* When next to send a ping */
+   unsigned long   keepalive_at;   /* When next to send a 
keepalive ping */
unsigned long   expect_rx_by;   /* When we expect to get a 
packet by */
unsigned long   expect_req_by;  /* When we expect to get a 
request DATA packet by */
unsigned long   expect_term_by; /* When we expect call 
termination by */
diff --git a/net/rxrpc/call_event.c b/net/rxrpc/call_event.c
index c65666b2f39e..bda952ffe6a6 100644
--- a/net/rxrpc/call_event.c
+++ b/net/rxrpc/call_event.c
@@ -366,6 +366,15 @@ void rxrpc_process_call(struct work_struct *work)
set_bit(RXRPC_CALL_EV_ACK_LOST, >events);
}
 
+   t = READ_ONCE(call->keepalive_at);
+   if (time_after_eq(now, t)) {
+   trace_rxrpc_timer(call, rxrpc_timer_exp_keepalive, now);
+   cmpxchg(>keepalive_at, t, now + MAX_JIFFY_OFFSET);
+   rxrpc_propose_ACK(call, RXRPC_ACK_PING, 0, 0, true, true,
+ rxrpc_propose_ack_ping_for_keepalive);
+   set_bit(RXRPC_CALL_EV_PING, 

[PATCH net-next 12/12] rxrpc: Fix conn expiry timers

2017-11-24 Thread David Howells
Fix the rxrpc connection expiry timers so that connections for closed
AF_RXRPC sockets get deleted in a more timely fashion, freeing up the
transport UDP port much more quickly.

 (1) Replace the delayed work items with work items plus timers so that
 timer_reduce() can be used to shorten them and so that the timer
 doesn't requeue the work item if the net namespace is dead.

 (2) Don't use queue_delayed_work() as that won't alter the timeout if the
 timer is already running.

 (3) Don't rearm the timers if the network namespace is dead.

Signed-off-by: David Howells 
---

 net/rxrpc/af_rxrpc.c|2 ++
 net/rxrpc/ar-internal.h |6 --
 net/rxrpc/conn_client.c |   30 +++---
 net/rxrpc/conn_object.c |   28 +---
 net/rxrpc/net_ns.c  |   30 ++
 5 files changed, 68 insertions(+), 28 deletions(-)

diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c
index abb524c2b8f8..8f7cf4c042be 100644
--- a/net/rxrpc/af_rxrpc.c
+++ b/net/rxrpc/af_rxrpc.c
@@ -895,6 +895,8 @@ static int rxrpc_release_sock(struct sock *sk)
rxrpc_release_calls_on_socket(rx);
flush_workqueue(rxrpc_workqueue);
rxrpc_purge_queue(>sk_receive_queue);
+   rxrpc_queue_work(>local->rxnet->service_conn_reaper);
+   rxrpc_queue_work(>local->rxnet->client_conn_reaper);
 
rxrpc_put_local(rx->local);
rx->local = NULL;
diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index a0082c407005..416688381eb7 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -79,7 +79,8 @@ struct rxrpc_net {
struct list_headconn_proc_list; /* List of conns in this 
namespace for proc */
struct list_headservice_conns;  /* Service conns in this 
namespace */
rwlock_tconn_lock;  /* Lock for ->conn_proc_list, 
->service_conns */
-   struct delayed_work service_conn_reaper;
+   struct work_struct  service_conn_reaper;
+   struct timer_list   service_conn_reap_timer;
 
unsigned intnr_client_conns;
unsigned intnr_active_client_conns;
@@ -90,7 +91,8 @@ struct rxrpc_net {
struct list_headwaiting_client_conns;
struct list_headactive_client_conns;
struct list_headidle_client_conns;
-   struct delayed_work client_conn_reaper;
+   struct work_struct  client_conn_reaper;
+   struct timer_list   client_conn_reap_timer;
 
struct list_headlocal_endpoints;
struct mutexlocal_mutex;/* Lock for ->local_endpoints */
diff --git a/net/rxrpc/conn_client.c b/net/rxrpc/conn_client.c
index 785dfdb9fef1..7f74ca3059f8 100644
--- a/net/rxrpc/conn_client.c
+++ b/net/rxrpc/conn_client.c
@@ -691,7 +691,7 @@ int rxrpc_connect_call(struct rxrpc_call *call,
 
_enter("{%d,%lx},", call->debug_id, call->user_call_ID);
 
-   rxrpc_discard_expired_client_conns(>client_conn_reaper.work);
+   rxrpc_discard_expired_client_conns(>client_conn_reaper);
rxrpc_cull_active_client_conns(rxnet);
 
ret = rxrpc_get_client_conn(call, cp, srx, gfp);
@@ -757,6 +757,18 @@ void rxrpc_expose_client_call(struct rxrpc_call *call)
 }
 
 /*
+ * Set the reap timer.
+ */
+static void rxrpc_set_client_reap_timer(struct rxrpc_net *rxnet)
+{
+   unsigned long now = jiffies;
+   unsigned long reap_at = now + rxrpc_conn_idle_client_expiry;
+
+   if (rxnet->live)
+   timer_reduce(>client_conn_reap_timer, reap_at);
+}
+
+/*
  * Disconnect a client call.
  */
 void rxrpc_disconnect_client_call(struct rxrpc_call *call)
@@ -896,9 +908,7 @@ void rxrpc_disconnect_client_call(struct rxrpc_call *call)
list_move_tail(>cache_link, >idle_client_conns);
if (rxnet->idle_client_conns.next == >cache_link &&
!rxnet->kill_all_client_conns)
-   queue_delayed_work(rxrpc_workqueue,
-  >client_conn_reaper,
-  rxrpc_conn_idle_client_expiry);
+   rxrpc_set_client_reap_timer(rxnet);
} else {
trace_rxrpc_client(conn, channel, rxrpc_client_to_inactive);
conn->cache_state = RXRPC_CONN_CLIENT_INACTIVE;
@@ -1036,8 +1046,7 @@ void rxrpc_discard_expired_client_conns(struct 
work_struct *work)
 {
struct rxrpc_connection *conn;
struct rxrpc_net *rxnet =
-   container_of(to_delayed_work(work),
-struct rxrpc_net, client_conn_reaper);
+   container_of(work, struct rxrpc_net, client_conn_reaper);
unsigned long expiry, conn_expires_at, now;
unsigned int nr_conns;
bool did_discard = false;
@@ -1116,9 +1125,8 @@ void rxrpc_discard_expired_client_conns(struct 
work_struct *work)
 */

[PATCH net-next 08/12] rxrpc: Express protocol timeouts in terms of RTT

2017-11-24 Thread David Howells
Express protocol timeouts for data retransmission and deferred ack
generation in terms on RTT rather than specified timeouts once we have
sufficient RTT samples.

For the moment, this requires just one RTT sample to be able to use this
for ack deferral and two for data retransmission.

The data retransmission timeout is set at RTT*1.5 and the ACK deferral
timeout is set at RTT.

Note that the calculated timeout is limited to a minimum of 4ns to make
sure it doesn't happen too quickly.

Signed-off-by: David Howells 
---

 net/rxrpc/call_event.c |   22 ++
 net/rxrpc/sendmsg.c|7 +++
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/net/rxrpc/call_event.c b/net/rxrpc/call_event.c
index c14395d5ad8c..da91f16ac77c 100644
--- a/net/rxrpc/call_event.c
+++ b/net/rxrpc/call_event.c
@@ -52,7 +52,7 @@ static void __rxrpc_propose_ACK(struct rxrpc_call *call, u8 
ack_reason,
enum rxrpc_propose_ack_trace why)
 {
enum rxrpc_propose_ack_outcome outcome = rxrpc_propose_ack_use;
-   unsigned long now, ack_at, expiry = rxrpc_soft_ack_delay;
+   unsigned long expiry = rxrpc_soft_ack_delay;
s8 prior = rxrpc_ack_priority[ack_reason];
 
/* Pings are handled specially because we don't want to accidentally
@@ -116,7 +116,13 @@ static void __rxrpc_propose_ACK(struct rxrpc_call *call, 
u8 ack_reason,
background)
rxrpc_queue_call(call);
} else {
-   now = jiffies;
+   unsigned long now = jiffies, ack_at;
+
+   if (call->peer->rtt_usage > 0)
+   ack_at = nsecs_to_jiffies(call->peer->rtt);
+   else
+   ack_at = expiry;
+
ack_at = jiffies + expiry;
if (time_before(ack_at, call->ack_at)) {
WRITE_ONCE(call->ack_at, ack_at);
@@ -160,14 +166,22 @@ static void rxrpc_resend(struct rxrpc_call *call, 
unsigned long now_j)
struct sk_buff *skb;
unsigned long resend_at;
rxrpc_seq_t cursor, seq, top;
-   ktime_t now, max_age, oldest, ack_ts;
+   ktime_t now, max_age, oldest, ack_ts, timeout, min_timeo;
int ix;
u8 annotation, anno_type, retrans = 0, unacked = 0;
 
_enter("{%d,%d}", call->tx_hard_ack, call->tx_top);
 
+   if (call->peer->rtt_usage > 1)
+   timeout = ns_to_ktime(call->peer->rtt * 3 / 2);
+   else
+   timeout = ms_to_ktime(rxrpc_resend_timeout);
+   min_timeo = ns_to_ktime((10 / HZ) * 4);
+   if (ktime_before(timeout, min_timeo))
+   timeout = min_timeo;
+
now = ktime_get_real();
-   max_age = ktime_sub_ms(now, rxrpc_resend_timeout * 1000 / HZ);
+   max_age = ktime_sub(now, timeout);
 
spin_lock_bh(>lock);
 
diff --git a/net/rxrpc/sendmsg.c b/net/rxrpc/sendmsg.c
index 03e0676db28c..c56ee54fdd1f 100644
--- a/net/rxrpc/sendmsg.c
+++ b/net/rxrpc/sendmsg.c
@@ -226,6 +226,13 @@ static void rxrpc_queue_packet(struct rxrpc_sock *rx, 
struct rxrpc_call *call,
} else {
unsigned long now = jiffies, resend_at;
 
+   if (call->peer->rtt_usage > 1)
+   resend_at = nsecs_to_jiffies(call->peer->rtt * 3 / 2);
+   else
+   resend_at = rxrpc_resend_timeout;
+   if (resend_at < 1)
+   resend_at = 1;
+
resend_at = now + rxrpc_resend_timeout;
WRITE_ONCE(call->resend_at, resend_at);
rxrpc_reduce_call_timer(call, resend_at, now,



[PATCH v2] VSOCK: Don't call vsock_stream_has_data in atomic context

2017-11-24 Thread Jorgen Hansen
When using the host personality, VMCI will grab a mutex for any
queue pair access. In the detach callback for the vmci vsock
transport, we call vsock_stream_has_data while holding a spinlock,
and vsock_stream_has_data will access a queue pair.

To avoid this, we can simply omit calling vsock_stream_has_data
for host side queue pairs, since the QPs are empty per default
when the guest has detached.

This bug affects users of VMware Workstation using kernel version
4.4 and later.

Testing: Ran vsock tests between guest and host, and verified that
with this change, the host isn't calling vsock_stream_has_data
during detach. Ran mixedTest between guest and host using both
guest and host as server.

v2: Rebased on top of recent change to sk_state values
Reviewed-by: Adit Ranadive 
Reviewed-by: Aditya Sarwade 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Jorgen Hansen 
---
 net/vmw_vsock/vmci_transport.c |   10 +++---
 1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
index 391775e..56573dc 100644
--- a/net/vmw_vsock/vmci_transport.c
+++ b/net/vmw_vsock/vmci_transport.c
@@ -797,9 +797,13 @@ static void vmci_transport_handle_detach(struct sock *sk)
 
/* We should not be sending anymore since the peer won't be
 * there to receive, but we can still receive if there is data
-* left in our consume queue.
+* left in our consume queue. If the local endpoint is a host,
+* we can't call vsock_stream_has_data, since that may block,
+* but a host endpoint can't read data once the VM has
+* detached, so there is no available data in that case.
 */
-   if (vsock_stream_has_data(vsk) <= 0) {
+   if (vsk->local_addr.svm_cid == VMADDR_CID_HOST ||
+   vsock_stream_has_data(vsk) <= 0) {
sk->sk_state = TCP_CLOSE;
 
if (sk->sk_state == TCP_SYN_SENT) {
@@ -2144,7 +2148,7 @@ static void __exit vmci_transport_exit(void)
 
 MODULE_AUTHOR("VMware, Inc.");
 MODULE_DESCRIPTION("VMCI transport for Virtual Sockets");
-MODULE_VERSION("1.0.4.0-k");
+MODULE_VERSION("1.0.5.0-k");
 MODULE_LICENSE("GPL v2");
 MODULE_ALIAS("vmware_vsock");
 MODULE_ALIAS_NETPROTO(PF_VSOCK);
-- 
1.7.0



[PATCH] atm: lanai: use setup_timer instead of init_timer

2017-11-24 Thread Colin King
From: Colin Ian King 

Use setup_timer function instead of initializing timer with the
function and data fields.

Signed-off-by: Colin Ian King 
---
 drivers/atm/lanai.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/atm/lanai.c b/drivers/atm/lanai.c
index 2351dad78ff5..87e8b5dfac39 100644
--- a/drivers/atm/lanai.c
+++ b/drivers/atm/lanai.c
@@ -1790,10 +1790,8 @@ static void lanai_timed_poll(unsigned long arg)
 
 static inline void lanai_timed_poll_start(struct lanai_dev *lanai)
 {
-   init_timer(>timer);
+   setup_timer(>timer, lanai_timed_poll, (unsigned long)lanai);
lanai->timer.expires = jiffies + LANAI_POLL_PERIOD;
-   lanai->timer.data = (unsigned long) lanai;
-   lanai->timer.function = lanai_timed_poll;
add_timer(>timer);
 }
 
-- 
2.14.1



[PATCH] atm: firestream: use setup_timer instead of init_timer

2017-11-24 Thread Colin King
From: Colin Ian King 

Use setup_timer function instead of initializing timer with the
function and data fields.

Signed-off-by: Colin Ian King 
---
 drivers/atm/firestream.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/atm/firestream.c b/drivers/atm/firestream.c
index 6b6368a56526..534001270be5 100644
--- a/drivers/atm/firestream.c
+++ b/drivers/atm/firestream.c
@@ -1885,9 +1885,7 @@ static int fs_init(struct fs_dev *dev)
}
 
 #ifdef FS_POLL_FREQ
-   init_timer (>timer);
-   dev->timer.data = (unsigned long) dev;
-   dev->timer.function = fs_poll;
+   setup_timer(>timer, fs_poll, (unsigned long)dev);
dev->timer.expires = jiffies + FS_POLL_FREQ;
add_timer (>timer);
 #endif
-- 
2.14.1



Re: [PATCH v2] net: sched: crash on blocks with goto chain action

2017-11-24 Thread Jiri Pirko
Fri, Nov 24, 2017 at 12:27:58PM CET, c...@rkapl.cz wrote:
>tcf_block_put_ext has assumed that all filters (and thus their goto
>actions) are destroyed in RCU callback and thus can not race with our
>list iteration. However, that is not true during netns cleanup (see
>tcf_exts_get_net comment).
>
>Prevent the user after free by holding all chains (except 0, that one is
>already held). foreach_safe is not enough in this case.
>
>To reproduce, run the following in a netns and then delete the ns:
>ip link add dtest type dummy
>tc qdisc add dev dtest ingress
>tc filter add dev dtest chain 1 parent : handle 1 prio 1 flower action 
> goto chain 2
>
>Fixes: 822e86d997 ("net_sched: remove tcf_block_put_deferred()")
>Signed-off-by: Roman Kapl 
>---
>v1 -> v2: Hold all chains instead of just the currently iterated one,
>  the code should be more clear this way.
>---
> net/sched/cls_api.c | 17 -
> 1 file changed, 12 insertions(+), 5 deletions(-)
>
>diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
>index 7d97f612c9b9..ddcf04b4ab43 100644
>--- a/net/sched/cls_api.c
>+++ b/net/sched/cls_api.c
>@@ -336,7 +336,8 @@ static void tcf_block_put_final(struct work_struct *work)
>   struct tcf_chain *chain, *tmp;
> 
>   rtnl_lock();
>-  /* Only chain 0 should be still here. */
>+
>+  /* At this point, all the chains should have refcnt == 1. */
>   list_for_each_entry_safe(chain, tmp, >chain_list, list)
>   tcf_chain_put(chain);
>   rtnl_unlock();
>@@ -344,15 +345,21 @@ static void tcf_block_put_final(struct work_struct *work)
> }
> 
> /* XXX: Standalone actions are not allowed to jump to any chain, and bound
>- * actions should be all removed after flushing. However, filters are now
>- * destroyed in tc filter workqueue with RTNL lock, they can not race here.
>+ * actions should be all removed after flushing.
>  */
> void tcf_block_put_ext(struct tcf_block *block, struct Qdisc *q,
>  struct tcf_block_ext_info *ei)
> {
>-  struct tcf_chain *chain, *tmp;
>+  struct tcf_chain *chain;
> 
>-  list_for_each_entry_safe(chain, tmp, >chain_list, list)
>+  /* Hold a refcnt for all chains, except 0, so that they don't disappear
>+   * while we are iterating.

Would be perhaps nice to mention that the appropriate tcf_chain_put
is done in tcf_block_put_final()

Regardless of this:

Acked-by: Jiri Pirko 



>+   */
>+  list_for_each_entry(chain, >chain_list, list)
>+  if (chain->index)
>+  tcf_chain_hold(chain);
>+
>+  list_for_each_entry(chain, >chain_list, list)
>   tcf_chain_flush(chain);
> 
>   tcf_block_offload_unbind(block, q, ei);
>-- 
>2.15.0
>


Re: [RFC 0/9] net: create adaptive software irq moderation library

2017-11-24 Thread Saeed Mahameed
On Fri, Nov 24, 2017 at 4:05 AM, Saeed Mahameed
 wrote:
> On Sun, Nov 5, 2017 at 9:44 PM, Andy Gospodarek  wrote:
>> From: Andy Gospodarek 
>>
>> This RFC converts the adaptive interrupt moderation library from the
>> mlx5_en driver into a library so it can be used by any driver.  The last
>> patch in this set adds support for interrupt moderation in the bnxt_en
>> driver.
>>
>> The main purpose of this code in the mlx5_en driver is to allow an
>> administrator to make sure that default coalesce settings are optimized
>> for low latency, but quickly adapt to handle high throughput traffic and
>> optimize how many packets are received during each napi poll.
>>
>> For any new driver the following changes would be needed to use this
>> library:
>>
>> - add elements in ring struct to track items needed by this library
>> - create function that can be called to actually set coalesce settings
>>   for the driver
>>
>> My main reason for making this an RFC is that I would like verification
>> from Mellanox that the performance of their driver does not change in a
>> unintended way.  I did some basic testing (netperf) and did not note a
>> statistically significant change in throughput or CPU utilization before
>> and after this set.
>>
>> Credit to Rob Rice and Lee Reed for doing some of the initial proof of
>> concept and testing for this patch.
>
> Hi Andy,
>
> Following our conversation in netdev 2.2,  i would like to suggest the
> following:
>
> Instead of introducing a new API which demands from the driver to
> provide callbacks and function pointers to the adaptive moderation
> logic, which might be called on every irq interrupt, and to avoid
> performance hit, we can move the generic code and the core adaptive
> moderation logic to a header file.
>

I would like also to suggesting adding Tal Gilboa, as the official
maintainer for this new file.
as he is the current maintainer and the co-author of this feature in mlx5.

> the mlx5e am logic and data structures are already written in a very
> modular way and can be stripped out of mlx5e fairly easily.
> And i would like to suggest to do it in the following manner:
>
> 1. naming convention:
> I would like to change the generic code naming convention to have the
> words DIM (Dynamically-Tuned Interrupt Moderation) instead of mlx5e_am
> or am, Following our public blog [1] of the matter and the official
> name we prefer for this feature.
>
> [1] https://community.mellanox.com/docs/DOC-2511
>
> Suggested naming convention instead of rx_am:  net_dim (DIM for net
> applications).
> As the rx_am or (dim) logic can be applied to other applications.
>
> 2. Data types:
>
> All below mlx5e am data types can be used as is as they hold nothing
> mlx5 related.
>
> struct mlx5e_rx_am_sample
>   - Holds the current stats sample with ktime stamp
>   - rename to: net_dim_sample
>
> struct mlx5e_rx_am_stats
>  - Holds the needed stats (delta) calculation of last 2 samples
>  - rename to: net_dim_stats
>
> struct mlx5e_rx_am
>  - Adaptive moderation handle
>  - rename to: net_dim
>
> 3. static inline generic functions API (based on the usage from
> mlx5e_rx_am function)
>
> //Make a DIM measurement:
> net_dim_sample(struct *net_dim_sample sample, packets, bytes, event_ctr)
> - previously mlx5e_am_sample()
> - Fills a sample struct with the provided stats and the current timestamp
>
>
> //start a new DIM measurement and handles the DIM state machine initial state:
> net_dim_start_sample(struct *net_dim rx_dim)
>  - Makes a new measurement
>  - stores it into rx_dim->start
>  - rx_dim->state = DIM_MEASURE_IN_PROGRESS
>
>
> // Takes a new sample (curr_sample) and makes the decision (handles
> DIM_MEASURE_IN_PROGRESS state)
> net_dim_decision(struct *net_dim rx_dim, curr_sample)
>   -  previously mlx5e_am_decision
>   - Note, instead of providing the current_stats (delta between start
> and current_sample) I suggest to provide the current_sample and move
> the stats calculation logic into net_dime_decision.
>- All the logic in this function will move to the generic code.
>
> 4. Driver implementation: (according to the above suggested API)
>-  Driver should initialize struct net_dim rx_dim, and provide a
> work function to handle "dim apply new profile" decision.
>- in napi_poll driver should implement the rx_dim state machine
> using the above API before arming the completion event queues as
> follows:
>
> mlx5e_rx_am:
>
> void mlx5e_rx_am(struct mlx5e_rq *rq)
> {
>struct net_dim *rx_dim = >dim;
>struct net_dim_sample end_sample;
>u16 nevents;
>
>switch (rx_dim->state) {
>case DIM_MEASURE_IN_PROGRESS:
>// driver specific pre condition to decide whether to
> continue or skip
>// Note that here we only sample and don't calc the delta
> stats, this logic moved into net_dim_decision
>net_dim_sample(rq, _sample, rq->packets, 

Re: [PATCH net] net: qmi_wwan: add support for Cinterion PLS8

2017-11-24 Thread Oliver Graute
On 24/11/17, Reinhard Speyerer wrote:
> before posting this problem report
> https://developer.gemalto.com/threads/ipv6dualstack-problems-pls8-e-revision-03017
> in the Gemalto developer forum I tested the qmi_wwan/cdc_ether changes
> you suggested above and apart from having two working QMI interfaces
> the IPv6/dualstack problems observed with AT^SWWAN/cdc_ether were
> also gone when using WDSStartNetworkInterface and the QMI interface in
> raw IP mode instead.

thx for sharing this information. IPv6 with PLS8-E is also a topic on
our side

Best Regards,

Oliver


Re: [PATCH net] net: qmi_wwan: add support for Cinterion PLS8

2017-11-24 Thread Oliver Graute
On 23/11/17, Bjørn Mork wrote:
> 
> This is also consistent with the Windows drivers.  And being a proper
> CDC ECM class function, it should Just Work with the cdc_ether driver.
> Except for the "RmNet" part, which I guess is the reason you want to
> add this device to qmi_wwan.  Which is fine, *if* we can be reasonably
> certain that it does support QMI.  The description string is a strong
> indication, but it would be even better to know this was tested.
> 
> But adding this to qmi_wwan is not enough.  You also need to add a
> blacklist entry to cdc_ether.  Both should use a device+class match,
> similar to the Novatel entries.  This will make the interface numbering
> irrelevant, and will allow a single entry to match both QMI/rmnet
> functions.

ok I tried it this way:

+++ b/drivers/net/usb/cdc_ether.c
@@ -562,6 +562,7 @@ static void usbnet_cdc_zte_status(struct usbnet *dev, 
struct urb *urb)
 #define MICROSOFT_VENDOR_ID0x045e
 #define UBLOX_VENDOR_ID0x1546
 #define TPLINK_VENDOR_ID   0x2357
+#define CINTERION_VENDOR_ID0x1e2d
 
 static const struct usb_device_id  products[] = {
 /* BLACKLIST !!
@@ -821,6 +822,13 @@ static void usbnet_cdc_zte_status(struct usbnet *dev, 
struct urb *urb)
.driver_info = 0,
 },
 
+/* Cinterion PLS8 - handled by qmi_wwan */
+{
+   USB_DEVICE_AND_INTERFACE_INFO(CINTERION_VENDOR_ID, 0x0061, 
USB_CLASS_COMM,
+   USB_CDC_SUBCLASS_ETHERNET, USB_CDC_PROTO_NONE),
+   .driver_info = 0,
+},
+
 /* WHITELIST!!!
  *
  * CDC Ether uses two interfaces, not necessarily consecutive.
diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c
index 720a3a2..93e102e 100644
--- a/drivers/net/usb/qmi_wwan.c
+++ b/drivers/net/usb/qmi_wwan.c
@@ -1221,6 +1221,7 @@ static int qmi_wwan_resume(struct usb_interface *intf)
{QMI_FIXED_INTF(0x0b3c, 0xc00a, 6)},/* Olivetti Olicard 160 */
{QMI_FIXED_INTF(0x0b3c, 0xc00b, 4)},/* Olivetti Olicard 500 */
{QMI_FIXED_INTF(0x1e2d, 0x0060, 4)},/* Cinterion PLxx */
+   {QMI_FIXED_INTF(0x1e2d, 0x0061, 3)},/* Cinterion PLS8 LTE */
{QMI_FIXED_INTF(0x1e2d, 0x0053, 4)},/* Cinterion PHxx,PXxx */
{QMI_FIXED_INTF(0x1e2d, 0x0082, 4)},/* Cinterion PHxx,PXxx (2 
RmNet) */
{QMI_FIXED_INTF(0x1e2d, 0x0082, 5)},/* Cinterion PHxx,PXxx (2 
RmNet) */

but now I'am missing an ttyACM4 interface and the edc_ether registering
is not working anymore.

[  124.310611] usb 2-1: new high-speed USB device number 2 using ci_hdrc
[  124.457029] usb 2-1: New USB device found, idVendor=1e2d, idProduct=0061
[  124.463938] usb 2-1: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[  124.471307] usb 2-1: Product: LTE Modem
[  124.475278] usb 2-1: Manufacturer: Cinterion
[  124.536219] cdc_acm 2-1:1.0: ttyACM0: USB ACM device
[  124.563155] cdc_acm 2-1:1.2: ttyACM1: USB ACM device
[  124.589625] cdc_acm 2-1:1.4: ttyACM2: USB ACM device
[  124.613517] cdc_acm 2-1:1.6: ttyACM3: USB ACM device

in my working old setup with kernel 3.9.11 it looks like this:

[  129.710622] usb 2-1: new high-speed USB device number 2 using ci_hdrc
[  129.873985] usb 2-1: New USB device found, idVendor=1e2d, idProduct=0061
[  129.888573] usb 2-1: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[  129.902973] usb 2-1: Product: LTE Modem
[  129.906927] usb 2-1: Manufacturer: Cinterion
[  129.928389] cdc_acm 2-1:1.0: ttyACM0: USB ACM device
[  129.959324] cdc_acm 2-1:1.2: ttyACM1: USB ACM device
[  129.992714] cdc_acm 2-1:1.4: ttyACM2: USB ACM device
[  130.019416] cdc_acm 2-1:1.6: ttyACM3: USB ACM device
[  130.045248] cdc_acm 2-1:1.8: This device cannot do calls on its own. It is 
not a modem.
[  130.073929] cdc_acm 2-1:1.8: ttyACM4: USB ACM device
[  130.100982] cdc_ether 2-1:1.10 usb0: register 'cdc_ether' at 
usb-ci_hdrc.1-1, CDC Ethernet Device, de:ad:be:ef:00:00
[  130.136438] cdc_ether 2-1:1.12 usb1: register 'cdc_ether' at 
usb-ci_hdrc.1-1, CDC Ethernet Device, de:ad:be:ef:00:01

Any clue what I'am doing wrong here?

Best regards,

Oliver


Re: [RFC 0/9] net: create adaptive software irq moderation library

2017-11-24 Thread Saeed Mahameed
On Sun, Nov 5, 2017 at 9:44 PM, Andy Gospodarek  wrote:
> From: Andy Gospodarek 
>
> This RFC converts the adaptive interrupt moderation library from the
> mlx5_en driver into a library so it can be used by any driver.  The last
> patch in this set adds support for interrupt moderation in the bnxt_en
> driver.
>
> The main purpose of this code in the mlx5_en driver is to allow an
> administrator to make sure that default coalesce settings are optimized
> for low latency, but quickly adapt to handle high throughput traffic and
> optimize how many packets are received during each napi poll.
>
> For any new driver the following changes would be needed to use this
> library:
>
> - add elements in ring struct to track items needed by this library
> - create function that can be called to actually set coalesce settings
>   for the driver
>
> My main reason for making this an RFC is that I would like verification
> from Mellanox that the performance of their driver does not change in a
> unintended way.  I did some basic testing (netperf) and did not note a
> statistically significant change in throughput or CPU utilization before
> and after this set.
>
> Credit to Rob Rice and Lee Reed for doing some of the initial proof of
> concept and testing for this patch.

Hi Andy,

Following our conversation in netdev 2.2,  i would like to suggest the
following:

Instead of introducing a new API which demands from the driver to
provide callbacks and function pointers to the adaptive moderation
logic, which might be called on every irq interrupt, and to avoid
performance hit, we can move the generic code and the core adaptive
moderation logic to a header file.

the mlx5e am logic and data structures are already written in a very
modular way and can be stripped out of mlx5e fairly easily.
And i would like to suggest to do it in the following manner:

1. naming convention:
I would like to change the generic code naming convention to have the
words DIM (Dynamically-Tuned Interrupt Moderation) instead of mlx5e_am
or am, Following our public blog [1] of the matter and the official
name we prefer for this feature.

[1] https://community.mellanox.com/docs/DOC-2511

Suggested naming convention instead of rx_am:  net_dim (DIM for net
applications).
As the rx_am or (dim) logic can be applied to other applications.

2. Data types:

All below mlx5e am data types can be used as is as they hold nothing
mlx5 related.

struct mlx5e_rx_am_sample
  - Holds the current stats sample with ktime stamp
  - rename to: net_dim_sample

struct mlx5e_rx_am_stats
 - Holds the needed stats (delta) calculation of last 2 samples
 - rename to: net_dim_stats

struct mlx5e_rx_am
 - Adaptive moderation handle
 - rename to: net_dim

3. static inline generic functions API (based on the usage from
mlx5e_rx_am function)

//Make a DIM measurement:
net_dim_sample(struct *net_dim_sample sample, packets, bytes, event_ctr)
- previously mlx5e_am_sample()
- Fills a sample struct with the provided stats and the current timestamp


//start a new DIM measurement and handles the DIM state machine initial state:
net_dim_start_sample(struct *net_dim rx_dim)
 - Makes a new measurement
 - stores it into rx_dim->start
 - rx_dim->state = DIM_MEASURE_IN_PROGRESS


// Takes a new sample (curr_sample) and makes the decision (handles
DIM_MEASURE_IN_PROGRESS state)
net_dim_decision(struct *net_dim rx_dim, curr_sample)
  -  previously mlx5e_am_decision
  - Note, instead of providing the current_stats (delta between start
and current_sample) I suggest to provide the current_sample and move
the stats calculation logic into net_dime_decision.
   - All the logic in this function will move to the generic code.

4. Driver implementation: (according to the above suggested API)
   -  Driver should initialize struct net_dim rx_dim, and provide a
work function to handle "dim apply new profile" decision.
   - in napi_poll driver should implement the rx_dim state machine
using the above API before arming the completion event queues as
follows:

mlx5e_rx_am:

void mlx5e_rx_am(struct mlx5e_rq *rq)
{
   struct net_dim *rx_dim = >dim;
   struct net_dim_sample end_sample;
   u16 nevents;

   switch (rx_dim->state) {
   case DIM_MEASURE_IN_PROGRESS:
   // driver specific pre condition to decide whether to
continue or skip
   // Note that here we only sample and don't calc the delta
stats, this logic moved into net_dim_decision
   net_dim_sample(rq, _sample, rq->packets, rq->bytes, cq->events);
   if (net_dim_decision(rx_dim, _sample)) {
   rx_dim->state = DIM_APPLY_NEW_PROFILE;
   schedule_work(_dim->work);
}
/* fall through */
   case DIM_START_MEASURE:
   net_dim_start_sample(rx_dim);
   break;
   case DIM_APPLY_NEW_PROFILE:
   break;
}

Thanks,
Saeed.

>
> Andy Gospodarek (9):
>   mlx5_en: move interrupt moderation structs to new file
>   

[PATCH net-next] net: thunderx: Set max queue count taking XDP_TX into account

2017-11-24 Thread Aleksey Makarov
From: Sunil Goutham 

on T81 there are only 4 cores, hence setting max queue count to 4
would leave nothing for XDP_TX. This patch fixes this by doubling
max queue count in above scenarios.

Signed-off-by: Sunil Goutham 
Signed-off-by: cjacob 
Signed-off-by: Aleksey Makarov 
---
 drivers/net/ethernet/cavium/thunder/nicvf_main.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c 
b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index b82e28262c57..52b3a6044f85 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -1891,6 +1891,11 @@ static int nicvf_probe(struct pci_dev *pdev, const 
struct pci_device_id *ent)
nic->pdev = pdev;
nic->pnicvf = nic;
nic->max_queues = qcount;
+   /* If no of CPUs are too low, there won't be any queues left
+* for XDP_TX, hence double it.
+*/
+   if (!nic->t88)
+   nic->max_queues *= 2;
 
/* MAP VF's configuration registers */
nic->reg_base = pcim_iomap(pdev, PCI_CFG_REG_BAR_NUM, 0);
-- 
2.15.0



[PATCH net-next] net: thunderx: Add support for xdp redirect

2017-11-24 Thread Aleksey Makarov
From: Sunil Goutham 

This patch adds support for XDP_REDIRECT. Flush is not
yet supported.

Signed-off-by: Sunil Goutham 
Signed-off-by: cjacob 
Signed-off-by: Aleksey Makarov 
---
 drivers/net/ethernet/cavium/thunder/nicvf_main.c   | 110 -
 drivers/net/ethernet/cavium/thunder/nicvf_queues.c |  11 ++-
 drivers/net/ethernet/cavium/thunder/nicvf_queues.h |   4 +
 3 files changed, 94 insertions(+), 31 deletions(-)

diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c 
b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index a063c36c4c58..b82e28262c57 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -65,6 +65,11 @@ module_param(cpi_alg, int, S_IRUGO);
 MODULE_PARM_DESC(cpi_alg,
 "PFC algorithm (0=none, 1=VLAN, 2=VLAN16, 3=IP Diffserv)");
 
+struct nicvf_xdp_tx {
+   u64 dma_addr;
+   u8  qidx;
+};
+
 static inline u8 nicvf_netdev_qidx(struct nicvf *nic, u8 qidx)
 {
if (nic->sqs_mode)
@@ -500,14 +505,29 @@ static int nicvf_init_resources(struct nicvf *nic)
return 0;
 }
 
+static void nicvf_unmap_page(struct nicvf *nic, struct page *page, u64 
dma_addr)
+{
+   /* Check if it's a recycled page, if not unmap the DMA mapping.
+* Recycled page holds an extra reference.
+*/
+   if (page_ref_count(page) == 1) {
+   dma_addr &= PAGE_MASK;
+   dma_unmap_page_attrs(>pdev->dev, dma_addr,
+RCV_FRAG_LEN + XDP_HEADROOM,
+DMA_FROM_DEVICE,
+DMA_ATTR_SKIP_CPU_SYNC);
+   }
+}
+
 static inline bool nicvf_xdp_rx(struct nicvf *nic, struct bpf_prog *prog,
struct cqe_rx_t *cqe_rx, struct snd_queue *sq,
struct sk_buff **skb)
 {
struct xdp_buff xdp;
struct page *page;
+   struct nicvf_xdp_tx *xdp_tx = NULL;
u32 action;
-   u16 len, offset = 0;
+   u16 len, err, offset = 0;
u64 dma_addr, cpu_addr;
void *orig_data;
 
@@ -521,7 +541,7 @@ static inline bool nicvf_xdp_rx(struct nicvf *nic, struct 
bpf_prog *prog,
cpu_addr = (u64)phys_to_virt(cpu_addr);
page = virt_to_page((void *)cpu_addr);
 
-   xdp.data_hard_start = page_address(page);
+   xdp.data_hard_start = page_address(page) + RCV_BUF_HEADROOM;
xdp.data = (void *)cpu_addr;
xdp_set_data_meta_invalid();
xdp.data_end = xdp.data + len;
@@ -540,18 +560,7 @@ static inline bool nicvf_xdp_rx(struct nicvf *nic, struct 
bpf_prog *prog,
 
switch (action) {
case XDP_PASS:
-   /* Check if it's a recycled page, if not
-* unmap the DMA mapping.
-*
-* Recycled page holds an extra reference.
-*/
-   if (page_ref_count(page) == 1) {
-   dma_addr &= PAGE_MASK;
-   dma_unmap_page_attrs(>pdev->dev, dma_addr,
-RCV_FRAG_LEN + XDP_PACKET_HEADROOM,
-DMA_FROM_DEVICE,
-DMA_ATTR_SKIP_CPU_SYNC);
-   }
+   nicvf_unmap_page(nic, page, dma_addr);
 
/* Build SKB and pass on packet to network stack */
*skb = build_skb(xdp.data,
@@ -564,6 +573,20 @@ static inline bool nicvf_xdp_rx(struct nicvf *nic, struct 
bpf_prog *prog,
case XDP_TX:
nicvf_xdp_sq_append_pkt(nic, sq, (u64)xdp.data, dma_addr, len);
return true;
+   case XDP_REDIRECT:
+   /* Save DMA address for use while transmitting */
+   xdp_tx = (struct nicvf_xdp_tx *)page_address(page);
+   xdp_tx->dma_addr = dma_addr;
+   xdp_tx->qidx = nicvf_netdev_qidx(nic, cqe_rx->rq_idx);
+
+   err = xdp_do_redirect(nic->pnicvf->netdev, , prog);
+   if (!err)
+   return true;
+
+   /* Free the page on error */
+   nicvf_unmap_page(nic, page, dma_addr);
+   put_page(page);
+   break;
default:
bpf_warn_invalid_xdp_action(action);
/* fall through */
@@ -571,18 +594,7 @@ static inline bool nicvf_xdp_rx(struct nicvf *nic, struct 
bpf_prog *prog,
trace_xdp_exception(nic->netdev, prog, action);
/* fall through */
case XDP_DROP:
-   /* Check if it's a recycled page, if not
-* unmap the DMA mapping.
-*
-* Recycled page holds an extra reference.
-*/
-   if (page_ref_count(page) == 1) {
-   dma_addr &= PAGE_MASK;
-   

[PATCH v2] net: sched: crash on blocks with goto chain action

2017-11-24 Thread Roman Kapl
tcf_block_put_ext has assumed that all filters (and thus their goto
actions) are destroyed in RCU callback and thus can not race with our
list iteration. However, that is not true during netns cleanup (see
tcf_exts_get_net comment).

Prevent the user after free by holding all chains (except 0, that one is
already held). foreach_safe is not enough in this case.

To reproduce, run the following in a netns and then delete the ns:
ip link add dtest type dummy
tc qdisc add dev dtest ingress
tc filter add dev dtest chain 1 parent : handle 1 prio 1 flower action 
goto chain 2

Fixes: 822e86d997 ("net_sched: remove tcf_block_put_deferred()")
Signed-off-by: Roman Kapl 
---
v1 -> v2: Hold all chains instead of just the currently iterated one,
  the code should be more clear this way.
---
 net/sched/cls_api.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 7d97f612c9b9..ddcf04b4ab43 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -336,7 +336,8 @@ static void tcf_block_put_final(struct work_struct *work)
struct tcf_chain *chain, *tmp;
 
rtnl_lock();
-   /* Only chain 0 should be still here. */
+
+   /* At this point, all the chains should have refcnt == 1. */
list_for_each_entry_safe(chain, tmp, >chain_list, list)
tcf_chain_put(chain);
rtnl_unlock();
@@ -344,15 +345,21 @@ static void tcf_block_put_final(struct work_struct *work)
 }
 
 /* XXX: Standalone actions are not allowed to jump to any chain, and bound
- * actions should be all removed after flushing. However, filters are now
- * destroyed in tc filter workqueue with RTNL lock, they can not race here.
+ * actions should be all removed after flushing.
  */
 void tcf_block_put_ext(struct tcf_block *block, struct Qdisc *q,
   struct tcf_block_ext_info *ei)
 {
-   struct tcf_chain *chain, *tmp;
+   struct tcf_chain *chain;
 
-   list_for_each_entry_safe(chain, tmp, >chain_list, list)
+   /* Hold a refcnt for all chains, except 0, so that they don't disappear
+* while we are iterating.
+*/
+   list_for_each_entry(chain, >chain_list, list)
+   if (chain->index)
+   tcf_chain_hold(chain);
+
+   list_for_each_entry(chain, >chain_list, list)
tcf_chain_flush(chain);
 
tcf_block_offload_unbind(block, q, ei);
-- 
2.15.0



[PATCH] net: thunderbolt: Stop using zero to mean no valid DMA mapping

2017-11-24 Thread Mika Westerberg
Commit 86dabda426ac ("net: thunderbolt: Clear finished Tx frame bus
address in tbnet_tx_callback()") fixed a DMA-API violation where the
driver called dma_unmap_page() in tbnet_free_buffers() for a bus address
that might already be unmapped. The fix was to zero out the bus address
of a frame in tbnet_tx_callback().

However, as pointed out by David Miller, zero might well be valid
mapping (at least in theory) so it is not good idea to use it here.

It turns out that we don't need the whole map/unmap dance for Tx buffers
at all. Instead we can map the buffers when they are initially allocated
and unmap them when the interface is brought down. In between we just
DMA sync the buffers for the CPU or device as needed.

Signed-off-by: Mika Westerberg 
---
 drivers/net/thunderbolt.c | 57 ---
 1 file changed, 24 insertions(+), 33 deletions(-)

diff --git a/drivers/net/thunderbolt.c b/drivers/net/thunderbolt.c
index 228d4aa6d9ae..ca5e375de27c 100644
--- a/drivers/net/thunderbolt.c
+++ b/drivers/net/thunderbolt.c
@@ -335,7 +335,7 @@ static void tbnet_free_buffers(struct tbnet_ring *ring)
if (ring->ring->is_tx) {
dir = DMA_TO_DEVICE;
order = 0;
-   size = tbnet_frame_size(tf);
+   size = TBNET_FRAME_SIZE;
} else {
dir = DMA_FROM_DEVICE;
order = TBNET_RX_PAGE_ORDER;
@@ -512,6 +512,7 @@ static int tbnet_alloc_rx_buffers(struct tbnet *net, 
unsigned int nbuffers)
 static struct tbnet_frame *tbnet_get_tx_buffer(struct tbnet *net)
 {
struct tbnet_ring *ring = >tx_ring;
+   struct device *dma_dev = tb_ring_dma_device(ring->ring);
struct tbnet_frame *tf;
unsigned int index;
 
@@ -522,7 +523,9 @@ static struct tbnet_frame *tbnet_get_tx_buffer(struct tbnet 
*net)
 
tf = >frames[index];
tf->frame.size = 0;
-   tf->frame.buffer_phy = 0;
+
+   dma_sync_single_for_cpu(dma_dev, tf->frame.buffer_phy,
+   tbnet_frame_size(tf), DMA_TO_DEVICE);
 
return tf;
 }
@@ -531,13 +534,8 @@ static void tbnet_tx_callback(struct tb_ring *ring, struct 
ring_frame *frame,
  bool canceled)
 {
struct tbnet_frame *tf = container_of(frame, typeof(*tf), frame);
-   struct device *dma_dev = tb_ring_dma_device(ring);
struct tbnet *net = netdev_priv(tf->dev);
 
-   dma_unmap_page(dma_dev, tf->frame.buffer_phy, tbnet_frame_size(tf),
-  DMA_TO_DEVICE);
-   tf->frame.buffer_phy = 0;
-
/* Return buffer to the ring */
net->tx_ring.prod++;
 
@@ -548,10 +546,12 @@ static void tbnet_tx_callback(struct tb_ring *ring, 
struct ring_frame *frame,
 static int tbnet_alloc_tx_buffers(struct tbnet *net)
 {
struct tbnet_ring *ring = >tx_ring;
+   struct device *dma_dev = tb_ring_dma_device(ring->ring);
unsigned int i;
 
for (i = 0; i < TBNET_RING_SIZE; i++) {
struct tbnet_frame *tf = >frames[i];
+   dma_addr_t dma_addr;
 
tf->page = alloc_page(GFP_KERNEL);
if (!tf->page) {
@@ -559,7 +559,17 @@ static int tbnet_alloc_tx_buffers(struct tbnet *net)
return -ENOMEM;
}
 
+   dma_addr = dma_map_page(dma_dev, tf->page, 0, TBNET_FRAME_SIZE,
+   DMA_TO_DEVICE);
+   if (dma_mapping_error(dma_dev, dma_addr)) {
+   __free_page(tf->page);
+   tf->page = NULL;
+   tbnet_free_buffers(ring);
+   return -ENOMEM;
+   }
+
tf->dev = net->dev;
+   tf->frame.buffer_phy = dma_addr;
tf->frame.callback = tbnet_tx_callback;
tf->frame.sof = TBIP_PDF_FRAME_START;
tf->frame.eof = TBIP_PDF_FRAME_END;
@@ -881,19 +891,6 @@ static int tbnet_stop(struct net_device *dev)
return 0;
 }
 
-static bool tbnet_xmit_map(struct device *dma_dev, struct tbnet_frame *tf)
-{
-   dma_addr_t dma_addr;
-
-   dma_addr = dma_map_page(dma_dev, tf->page, 0, tbnet_frame_size(tf),
-   DMA_TO_DEVICE);
-   if (dma_mapping_error(dma_dev, dma_addr))
-   return false;
-
-   tf->frame.buffer_phy = dma_addr;
-   return true;
-}
-
 static bool tbnet_xmit_csum_and_map(struct tbnet *net, struct sk_buff *skb,
struct tbnet_frame **frames, u32 frame_count)
 {
@@ -908,13 +905,14 @@ static bool tbnet_xmit_csum_and_map(struct tbnet *net, 
struct sk_buff *skb,
 
if (skb->ip_summed != CHECKSUM_PARTIAL) {
/* No need to calculate checksum so we just update the
-* total frame count and map the frames for DMA.
+* total frame count and sync the frames for 

Re: [PATCH v7 3/5] bpf: add a bpf_override_function helper

2017-11-24 Thread Daniel Borkmann
On 11/22/2017 10:23 PM, Josef Bacik wrote:
> From: Josef Bacik 
> 
> Error injection is sloppy and very ad-hoc.  BPF could fill this niche
> perfectly with it's kprobe functionality.  We could make sure errors are
> only triggered in specific call chains that we care about with very
> specific situations.  Accomplish this with the bpf_override_funciton
> helper.  This will modify the probe'd callers return value to the
> specified value and set the PC to an override function that simply
> returns, bypassing the originally probed function.  This gives us a nice
> clean way to implement systematic error injection for all of our code
> paths.
> 
> Acked-by: Alexei Starovoitov 
> Acked-by: Ingo Molnar 
> Signed-off-by: Josef Bacik 

Series looks good to me as well; BPF bits:

Acked-by: Daniel Borkmann 


Re: [PATCH] dsa: dsa2: fix compile error for !CONFIG_OF

2017-11-24 Thread Arend van Spriel

On 11/24/2017 3:28 AM, Andrew Lunn wrote:

On Thu, Nov 23, 2017 at 08:27:48PM +0100, Arend Van Spriel wrote:

+ Arnd

On Thu, Nov 23, 2017 at 8:12 PM, Arend Van Spriel
 wrote:

On Thu, Nov 23, 2017 at 3:04 PM, Andrew Lunn  wrote:

On Thu, Nov 23, 2017 at 01:00:51PM +0100, Arend van Spriel wrote:

Compilation fails building on x86_64 platform which does not
have CONFIG_OF enabled.

Signed-off-by: Arend van Spriel 
---
After rebasing my branch to v4.14 I attempted to build the kernel and hit
the following compile issue:

net/dsa/dsa2.c: In function \u2018dsa_switch_parse_member_of\u2019:
net/dsa/dsa2.c:678:2: error: implicit declaration of function
'of_property_read_variable_u32_array'


Hi Arend

https://lkml.org/lkml/2017/11/6/493


So my email/patch did get through initially. Sorry for the noise and
thanks for the info.


Hi Andrew,

Getting back to this. It seems that this patch did not get in. At
least I searched for it in v4.14.1 but no luck.


Hi Arned

The use of of_property_read_variable_u32_array was added in
975e6e32215e ("net: dsa: rework switch parsing"). This patch is not in
v4.14. It is in linus/master, so v4.15-rc1 should have it. And the fix
is also in linus/master.

So there does not appear to be anything wrong. I just built v4.14.1
for x86_64 with DSA without problems.


Thanks, Andrew

I am actually using wireless-testing tree which it based on 4.14 and 
throws in net-next and the wireless trees. I assume the fix did not go 
through net-next. Sorry for the confusion.


Regards,
Arend


[PATCH] atm: nicstar: use the setup_timer helper

2017-11-24 Thread Colin King
From: Colin Ian King 

Replace init_timer and two explicit assignments with the setup_timer
helper.

Signed-off-by: Colin Ian King 
---
 drivers/atm/nicstar.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/atm/nicstar.c b/drivers/atm/nicstar.c
index a9702836cbae..335447ed0ba4 100644
--- a/drivers/atm/nicstar.c
+++ b/drivers/atm/nicstar.c
@@ -284,10 +284,8 @@ static int __init nicstar_init(void)
XPRINTK("nicstar: nicstar_init() returned.\n");
 
if (!error) {
-   init_timer(_timer);
+   setup_timer(_timer, ns_poll, 0UL);
ns_timer.expires = jiffies + NS_POLL_PERIOD;
-   ns_timer.data = 0UL;
-   ns_timer.function = ns_poll;
add_timer(_timer);
}
 
-- 
2.14.1



Re: [PATCH] r8152: disable rx checksum offload on Dell TB dock

2017-11-24 Thread Kai Heng Feng

> Also the MAC address is different, can you just trigger off of Dell's
> MAC address space instead of the address space of the dongle device?

A really good idea, never thought of this. Thanks for the hint :)
Still, I need to ask Dell folks to get all the answers.

Kai-Heng



Re: [PATCH] r8152: disable rx checksum offload on Dell TB dock

2017-11-24 Thread Kai Heng Feng


> On 24 Nov 2017, at 4:28 PM, Greg KH  wrote:
> 
> The bcdDevice is different between the dock device and the "real"
> device, why not use that?

Yea, I’ll poke around and see if bcdDevice alone can be a good predicate.

> Then there is still a bug.  Who as ASMedia is working on this, have they
> posted anything to the linux-usb mailing list about it?

I think they are doing this internally. I’ll advice them to ask questions here 
if
they encounter any problem.

> Maybe.  Have you tried using usbmon to see what the data streams are for
> the two devices and where they have problems and diverge?  Is the dock
> device doing something different in response to something from the host
> that the "real" device does not do?

No I haven’t.
Not really sure how do debug network packets over USB. I’ll do some research
on the topic.

Kai-Heng


Re: [PATCH] r8152: disable rx checksum offload on Dell TB dock

2017-11-24 Thread Greg KH
On Fri, Nov 24, 2017 at 11:44:02AM +0800, Kai Heng Feng wrote:
> 
> 
> > On 23 Nov 2017, at 5:24 PM, Greg KH  wrote:
> > 
> > On Thu, Nov 23, 2017 at 04:53:41PM +0800, Kai Heng Feng wrote:
> >> 
> >> What I want to do here is to finding this connection:
> >> Realtek r8153 <-> SMSC hub (USD ID: 0424:5537) <-> 
> >> ASMedia XHCI controller (PCI ID: 1b21:1142).
> >> 
> >> Is there a safer way to do this?
> > 
> > Nope!  You can't do that at all from within a USB driver, sorry.  As you
> > really should not care at all :)
> 
> Got it :)
> 
> The r8153 in Dell TB dock has version information, RTL_VER_05.
> We can use it to check for workaround, but many working RTL_VER_05 devices
> will also be affected.
> Do you think it’s an acceptable compromise?

I think all of the users of this device that is working just fine for
them would not like that to happen :(

> >> I have a r8153 <-> USB 3.0 dongle which work just fine. I can’t find any 
> >> information to differentiate them. Hence I want to use the connection to
> >> identify if r8153 is on a Dell TB dock.
> > 
> > Are you sure there is nothing different in the version or release number
> > of the device?  'lsusb -v' shows the exact same information for both
> > devices?
> 
> Yes. I attached `lsusb -v` for r8153 on Dell TB dock, on a RJ45 <-> USB 3.0 
> dongle,
> and on a RJ45 <-> USB Type-C dongle.

The bcdDevice is different between the dock device and the "real"
device, why not use that?

> >> Yes. From what I know, ASMedia is working on it, but not sure how long it
> >> will take. In the meantime, I’d like to workaround this issue for the 
> >> users.
> > 
> > Again, it's a host controller bug, it should be fixed there, don't try
> > to paper over the real issue in different individual drivers.
> > 
> > I think I've seen various patches on the linux-usb list for this
> > controller already, have you tried them?
> 
> Yes. These patches are all in mainline Linux now.

Then there is still a bug.  Who as ASMedia is working on this, have they
posted anything to the linux-usb mailing list about it?

> >> Actually no.
> >> I just plugged r8153 dongle into the same hub, surprisingly the issue
> >> doesn’t happen in this scenario.
> > 
> > Then something seems to be wrong with the device itself, as that would
> > be the same exact electrical/logical path, right?
> 
> I have no idea why externally plugged one doesn’t have this issue.
> Maybe it’s related how it’s wired inside the Dell TB dock...

Maybe.  Have you tried using usbmon to see what the data streams are for
the two devices and where they have problems and diverge?  Is the dock
device doing something different in response to something from the host
that the "real" device does not do?

thanks,

greg k-h


[patch iproute2] tc: move action cookie print out of the stats if

2017-11-24 Thread Jiri Pirko
From: Jiri Pirko 

Cookie print was made dependent on show_stats for no good reason. Fix
this bu pushing cookie print ot of the stats if.

Fixes: fd8b3d2c1b9b ("actions: Add support for user cookies")
Signed-off-by: Jiri Pirko 
---
 tc/m_action.c | 17 -
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/tc/m_action.c b/tc/m_action.c
index 0dce97f..c2fc4f1 100644
--- a/tc/m_action.c
+++ b/tc/m_action.c
@@ -301,19 +301,18 @@ static int tc_print_one_action(FILE *f, struct rtattr 
*arg)
return err;
 
if (show_stats && tb[TCA_ACT_STATS]) {
-
fprintf(f, "\tAction statistics:\n");
print_tcstats2_attr(f, tb[TCA_ACT_STATS], "\t", NULL);
-   if (tb[TCA_ACT_COOKIE]) {
-   int strsz = RTA_PAYLOAD(tb[TCA_ACT_COOKIE]);
-   char b1[strsz * 2 + 1];
-
-   fprintf(f, "\n\tcookie len %d %s ", strsz,
-   hexstring_n2a(RTA_DATA(tb[TCA_ACT_COOKIE]),
- strsz, b1, sizeof(b1)));
-   }
fprintf(f, "\n");
}
+   if (tb[TCA_ACT_COOKIE]) {
+   int strsz = RTA_PAYLOAD(tb[TCA_ACT_COOKIE]);
+   char b1[strsz * 2 + 1];
+
+   fprintf(f, "\tcookie len %d %s\n", strsz,
+   hexstring_n2a(RTA_DATA(tb[TCA_ACT_COOKIE]),
+ strsz, b1, sizeof(b1)));
+   }
 
return 0;
 }
-- 
2.9.5



Re: [PATCH] r8152: disable rx checksum offload on Dell TB dock

2017-11-24 Thread Greg KH
On Fri, Nov 24, 2017 at 09:28:05AM +0100, Greg KH wrote:
> On Fri, Nov 24, 2017 at 11:44:02AM +0800, Kai Heng Feng wrote:
> > 
> > 
> > > On 23 Nov 2017, at 5:24 PM, Greg KH  wrote:
> > > 
> > > On Thu, Nov 23, 2017 at 04:53:41PM +0800, Kai Heng Feng wrote:
> > >> 
> > >> What I want to do here is to finding this connection:
> > >> Realtek r8153 <-> SMSC hub (USD ID: 0424:5537) <-> 
> > >> ASMedia XHCI controller (PCI ID: 1b21:1142).
> > >> 
> > >> Is there a safer way to do this?
> > > 
> > > Nope!  You can't do that at all from within a USB driver, sorry.  As you
> > > really should not care at all :)
> > 
> > Got it :)
> > 
> > The r8153 in Dell TB dock has version information, RTL_VER_05.
> > We can use it to check for workaround, but many working RTL_VER_05 devices
> > will also be affected.
> > Do you think it’s an acceptable compromise?
> 
> I think all of the users of this device that is working just fine for
> them would not like that to happen :(
> 
> > >> I have a r8153 <-> USB 3.0 dongle which work just fine. I can’t find any 
> > >> information to differentiate them. Hence I want to use the connection to
> > >> identify if r8153 is on a Dell TB dock.
> > > 
> > > Are you sure there is nothing different in the version or release number
> > > of the device?  'lsusb -v' shows the exact same information for both
> > > devices?
> > 
> > Yes. I attached `lsusb -v` for r8153 on Dell TB dock, on a RJ45 <-> USB 3.0 
> > dongle,
> > and on a RJ45 <-> USB Type-C dongle.
> 
> The bcdDevice is different between the dock device and the "real"
> device, why not use that?

Also the MAC address is different, can you just trigger off of Dell's
MAC address space instead of the address space of the dongle device?

thanks,

greg k-h


Re: [PATCH 1/6] perf: Add new type PERF_TYPE_PROBE

2017-11-24 Thread Peter Zijlstra
On Thu, Nov 23, 2017 at 10:31:29PM -0800, Alexei Starovoitov wrote:
> unfortunately 32-bit is more screwed than it seems:
> 
> $ cat align.c
> #include 
> 
> struct S {
>   unsigned long long a;
> } s;
> 
> struct U {
>   unsigned long long a;
> } u;
> 
> int main()
> {
> printf("%d, %d\n", sizeof(unsigned long long),
>__alignof__(unsigned long long));
> printf("%d, %d\n", sizeof(s), __alignof__(s));
> printf("%d, %d\n", sizeof(u), __alignof__(u));
> }
> $ gcc -m32 align.c
> $ ./a.out
> 8, 8
> 8, 4
> 8, 4

*blink* how is that even correct? I understood the spec to say the
alignment of composite types should be the max alignment of any of its
member types (otherwise it cannot guarantee the alignment of its
members).

> so we have to use __aligned_u64 in uapi.

Ideally yes, but effectively it most often doesn't matter.

> Otherwise, yes, we could have used config1 and config2 to pass pointers
> to the kernel, but since they're defined as __u64 already we cannot
> change them and have to do this ugly dance around 'config' field.

I don't understand the reasoning why you cannot use them. Even if they
are not naturally aligned on x86_32, why would it matter?

x86_32 needs two loads in any case, but there is no concurrency, so
split loads is not a problem. Add to that that 'intptr_t' on ILP32
is in fact only a single u32 and thus the other u32 will always be 0.

So yes, alignment is screwy, but I really don't see who cares and why it
would matter in practise.

Please explain.


Re: [RFC net-next 4/6] netdevsim: add software driver for testing offloads

2017-11-24 Thread Jiri Pirko
Fri, Nov 24, 2017 at 08:49:17AM CET, jakub.kicin...@netronome.com wrote:
>On Thu, Nov 23, 2017 at 11:24 PM, Jiri Pirko  wrote:
>> Fri, Nov 24, 2017 at 03:36:11AM CET, jakub.kicin...@netronome.com wrote:
>>>To be able to run selftests without any hardware required we
>>>need a software model.  The model can also serve as an example
>>>implementation for those implementing actual HW offloads.
>>>The dummy driver have previously been extended to test SR-IOV,
>>>but the general consensus seems to be against adding further
>>>features to it.
>>>
>>>Signed-off-by: Jakub Kicinski 
>>>Reviewed-by: Simon Horman 
>>>---
>>
>> [...]
>>
>>
>>>+++ b/drivers/net/netdevsim/netdev.c
>>>@@ -0,0 +1,136 @@
>>>+/*
>>>+ * Copyright (C) 2017 Netronome Systems, Inc.
>>>+ *
>>>+ * This software is dual licensed under the GNU General License Version 2,
>>>+ * June 1991 as shown in the file COPYING in the top-level directory of this
>>>+ * source tree or the BSD 2-Clause License provided below.  You have the
>>>+ * option to license this software under the complete terms of either 
>>>license.
>>>+ *
>>>+ * The BSD 2-Clause License:
>>
>> Why gpl2 is not enough for this?
>
>It's the license I got from legal, I will request permission to use
>pure gpl2.  Thanks!

Yeah, I semi-understand need for bsd for actual hw driver (we have it
for mlxsw as well). But for this testing driver, it really does not make
sense.


Re: [RFC net-next 0/6] xdp: make stack perform remove and tests

2017-11-24 Thread Jakub Kicinski
On Thu, Nov 23, 2017 at 11:45 PM, Jiri Pirko  wrote:
> Fri, Nov 24, 2017 at 03:36:07AM CET, jakub.kicin...@netronome.com wrote:
>>Hi!
>>
>>The purpose of this series is to add a software model of BPF offloads
>>to make it easier for everyone to test them and make some of the more
>>arcane rules and assumptions more clear.
>>
>>The series starts with 3 patches aiming to make XDP handling in the
>>drivers less error prone.  Currently driver authors have to remember
>>to free XDP programs if XDP is active during unregister.  With this
>>series the core will disable XDP on its own.  It will take place
>>after close, drivers are not expected to perform reconfiguration
>>when disabling XDP on a downed device.
>>
>>Next two patches add the software netdev driver.  Last but not least
>
> I wonder if for this it is needed to split the driver into multiple
> files. I think that a single file would be better as I don't expect the
> driver would get big.

I was hoping other offloads will be added to their separate files, to
make it easier for people to find "all code relevant when implementing
X" easier.

Sort of related to your comment on the license, I'm hoping to be able
to use SPDX one-line header to lower the overhead of many files.  Has
anyone managed to get an OK to do that?

>>there is a python test which exercises all the corner cases which
>>came to my mind.
>>
>>Test needs to be run as root.  It will print basic information to
>>stdout, but can also create a more detailed log of all commands
>>when --log option is passed.  Log is in Emacs Org-mode format.
>>
>>  ./tools/testing/selftests/bpf/test_offload.py --log /tmp/log
>>
>>Something I'm still battling with, and would appreciate help of
>>wiser people is that occasionally during the test something makes
>>the refcount of init_net drop to 0 :S  I tried to create a simple
>>reproducer, but seems like just running the script in the loop is
>>the easiest way to go...  Could it have something to do with the
>>recent TC work?  The driver is pretty simple and never touches
>
> I don't see how...

To be clear I meant the changes made to destruction of filters, not
your work. The BPF code doesn't touch ref counts and cls exts do seem
to hold a ref on the net...  but perhaps that's just pointing the
finger unnecessarily :)  I will try to investigate again tomorrow.

>>ref counts.  The only slightly unusual thing is that the BPF code
>>sleeps for a bit on remove in the netdev notifier.
>>
>>
>>Jakub Kicinski (6):
>>  net: xdp: avoid output parameters when querying XDP prog
>>  net: xdp: report flags program was installed with on query
>>  net: xdp: make the stack take care of the tear down
>>  netdevsim: add software driver for testing offloads
>>  netdevsim: add bpf offload support
>>  selftests/bpf: add offload test based on netdevsim
>
> Patchset looks fine to me.
> Thanks for this!

Thanks!