Re: [PATCH net] ipvlan: fix IFLA_MTU ignored on NEWLINK

2018-06-20 Thread David Miller
From: Xin Long 
Date: Thu, 21 Jun 2018 12:56:04 +0800

> Commit 296d48568042 ("ipvlan: inherit MTU from master device") adjusted
> the mtu from the master device when creating a ipvlan device, but it
> would also override the mtu value set in rtnl_create_link. It causes
> IFLA_MTU param not to take effect.
> 
> So this patch is to not adjust the mtu if IFLA_MTU param is set when
> creating a ipvlan device.
> 
> Fixes: 296d48568042 ("ipvlan: inherit MTU from master device")
> Reported-by: Jianlin Shi 
> Signed-off-by: Xin Long 

Applied and queued up for -stable, thanks.


Re: [PATCH bpf 2/2] tools: bpftool: remember to close the libbpf object after prog load

2018-06-20 Thread Song Liu
On Wed, Jun 20, 2018 at 11:42 AM, Jakub Kicinski
 wrote:
> Remembering to close all descriptors and free memory may not seem
> important in a user space tool like bpftool, but if we were to run
> in batch mode the consumed resources start to add up quickly.  Make
> sure program load closes the libbpf object (which unloads and frees
> it).
>
> Fixes: 49a086c201a9 ("bpftool: implement prog load command")
> Signed-off-by: Jakub Kicinski 
> Reviewed-by: Quentin Monnet 

Acked-by: Song Liu 

> ---
>  tools/bpf/bpftool/prog.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
> index 12b694fe0404..959aa53ab678 100644
> --- a/tools/bpf/bpftool/prog.c
> +++ b/tools/bpf/bpftool/prog.c
> @@ -695,12 +695,18 @@ static int do_load(int argc, char **argv)
> }
>
> if (do_pin_fd(prog_fd, argv[1]))
> -   return -1;
> +   goto err_close_obj;
>
> if (json_output)
> jsonw_null(json_wtr);
>
> +   bpf_object__close(obj);
> +
> return 0;
> +
> +err_close_obj:
> +   bpf_object__close(obj);
> +   return -1;
>  }
>
>  static int do_help(int argc, char **argv)
> --
> 2.17.1
>


Re: [PATCH bpf 1/2] tools: bpftool: remove duplicated error message on prog load

2018-06-20 Thread Song Liu
On Wed, Jun 20, 2018 at 11:42 AM, Jakub Kicinski
 wrote:
> do_pin_fd() will already print out an error message if something
> goes wrong.  Printing another error is unnecessary and will break
> JSON output, since error messages are full objects:
>
> $ bpftool -jp prog load tracex1_kern.o /sys/fs/bpf/a
> {
> "error": "can't pin the object (/sys/fs/bpf/a): File exists"
> },{
> "error": "failed to pin program"
> }
>
> Fixes: 49a086c201a9 ("bpftool: implement prog load command")
> Signed-off-by: Jakub Kicinski 
> Reviewed-by: Quentin Monnet 

Acked-by: Song Liu 

> ---
>  tools/bpf/bpftool/prog.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
> index 05f42a46d6ed..12b694fe0404 100644
> --- a/tools/bpf/bpftool/prog.c
> +++ b/tools/bpf/bpftool/prog.c
> @@ -694,10 +694,8 @@ static int do_load(int argc, char **argv)
> return -1;
> }
>
> -   if (do_pin_fd(prog_fd, argv[1])) {
> -   p_err("failed to pin program");
> +   if (do_pin_fd(prog_fd, argv[1]))
> return -1;
> -   }
>
> if (json_output)
> jsonw_null(json_wtr);
> --
> 2.17.1
>


Re: Route fallback issue

2018-06-20 Thread Grant Taylor

On 06/20/2018 01:00 PM, Julian Anastasov wrote:

You can also try alternative routes.


"Alternative routes"?  I can't say as I've heard that description as a 
specific technique / feature / capability before.


Is that it's official name?

Where can I find out more about it?

But as the kernel supports only default alternative routes, you can put 
them in their own table:


I don't know that that is the case any more.

I was able to issue the following commands without a problem:

# ip route append 192.0.2.128/26 via 192.0.2.62
# ip route append 192.0.2.128/26 via 192.0.2.126

I crated two network namespaces and had a pair of vEths between them 
(192.0.2.0/26 and 192.0.2.64/26).  I added a dummy network to each NetNS 
(192.0.2.128/26 and 192.0.2.192/26).


I ran the following commands while a persistent ping was running from 
one NetNS to the IP on the other's dummy0 interface:


# ip link set ns2b up && ip route append 192.0.2.192/26 via 192.0.2.126 
&& ip link set ns2a down

(pause and watch things)
# ip link set ns2a up && ip route append 192.0.2.192/26 via 192.0.2.62 
&& ip link set ns2b down

(pause and watch things)

I could iterate between the two above commands and pings continued to work.

So, I think that it's now possible to use "alternate routes" (new to me) 
on specific prefixes in addition to the default.  Thus there is no 
longer any need for a separate table and the associated IP rule.


I'm running kernel version 4.9.76.

I did go ahead and set net.ipv4.conf.ns2b.ignore_routes_with_linkdown to 1.

for i in /proc/sys/net/ipv4/conf/*/ignore_routes_with_linkdown; do echo 
1 > $i; done


Doing that dropped the number of dropped pings from 60 ~ 90 (1 / second) 
to 0 ~ 5 (1 / second).  (Rarely, maybe 1 out of 20 flips, would it take 
upwards of 10 pings / seconds.)



# Alternative routes use same metric!!!
ip route append default via 192.168.1.254 dev eno1 table 100
ip route append default via 192.168.2.254 dev eno2 table 100
ip rule add prio 100 to 172.16.0.0/12 table 100


I did have to "append" the route.  I couldn't just "add" the route. 
When I tried to "add" the second route, I got an error about the route 
already existing.  Using "append" instead of "add" with everything else 
the same worked just fine.


Note:  I did go ahead and remove the single route that was added via 
"add" and used "append" for both.


Of course, you will get better results if an user space tool puts only 
alive routes in service after doing health checks of all near gateways.


I've got to say, with as well as this is working, I don't feel any need 
for a user space monitoring daemon.  I agree that I've felt the need for 
such in the past before I learned about "alternative routes".


I still want to learn more about "alternative routes".

Here's a diagram of the test network if someone wants to try to 
reproduce my findings:


+-++-+
| NS1 || NS2 |
|ns2a +-vEth-A-+ ns1a|
| || |
+ dummy0  ||  dummy0 +
| || |
|ns2b +-vEth-B-+ ns1b|
| || |
+-++-+

(vEths get the name of the NS that they face.)

NS1:ns2a 192.0.2.1 /26
NS1:ns2b 192.0.2.65/26
NS1:dummy0   192.0.2.129   /26
NS2:ns1a 192.0.2.62/26
NS2:ns1b 192.0.2.126   /26
NS2:dummy0   192.0.2.254   /26



--
Grant. . . .
unix || die


[PATCH net] ipvlan: fix IFLA_MTU ignored on NEWLINK

2018-06-20 Thread Xin Long
Commit 296d48568042 ("ipvlan: inherit MTU from master device") adjusted
the mtu from the master device when creating a ipvlan device, but it
would also override the mtu value set in rtnl_create_link. It causes
IFLA_MTU param not to take effect.

So this patch is to not adjust the mtu if IFLA_MTU param is set when
creating a ipvlan device.

Fixes: 296d48568042 ("ipvlan: inherit MTU from master device")
Reported-by: Jianlin Shi 
Signed-off-by: Xin Long 
---
 drivers/net/ipvlan/ipvlan_main.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
index d02f0a7..23c1d660 100644
--- a/drivers/net/ipvlan/ipvlan_main.c
+++ b/drivers/net/ipvlan/ipvlan_main.c
@@ -594,7 +594,8 @@ int ipvlan_link_new(struct net *src_net, struct net_device 
*dev,
ipvlan->phy_dev = phy_dev;
ipvlan->dev = dev;
ipvlan->sfeatures = IPVLAN_FEATURES;
-   ipvlan_adjust_mtu(ipvlan, phy_dev);
+   if (!tb[IFLA_MTU])
+   ipvlan_adjust_mtu(ipvlan, phy_dev);
INIT_LIST_HEAD(&ipvlan->addrs);
spin_lock_init(&ipvlan->addrs_lock);
 
-- 
2.1.0



Re: [PATCH net] bpf: enforce correct alignment for instructions

2018-06-20 Thread Eric Dumazet



On 06/20/2018 08:46 PM, David Miller wrote:
> From: Eric Dumazet 
> Date: Wed, 20 Jun 2018 17:24:09 -0700
> 
>> After commit 9facc336876f ("bpf: reject any prog that failed read-only lock")
>> offsetof(struct bpf_binary_header, image) became 3 instead of 4,
>> breaking powerpc BPF badly, since instructions need to be word aligned.
>>
>> Fixes: 9facc336876f ("bpf: reject any prog that failed read-only lock")
>> Signed-off-by: Eric Dumazet 
> 
> I'll apply this directly, thanks Eric.
> 

Thanks David :)


Re: [PATCH net] sctp: fix erroneous inc of snmp SctpFragUsrMsgs

2018-06-20 Thread David Miller
From: Marcelo Ricardo Leitner 
Date: Wed, 20 Jun 2018 12:47:52 -0300

> Currently it is incrementing SctpFragUsrMsgs when the user message size
> is of the exactly same size as the maximum fragment size, which is wrong.
> 
> The fix is to increment it only when user message is bigger than the
> maximum fragment size.
> 
> Fixes: bfd2e4b8734d ("sctp: refactor sctp_datamsg_from_user")
> Signed-off-by: Marcelo Ricardo Leitner 

Applied.


Re: [PATCH net] bpf: enforce correct alignment for instructions

2018-06-20 Thread David Miller
From: Eric Dumazet 
Date: Wed, 20 Jun 2018 17:24:09 -0700

> After commit 9facc336876f ("bpf: reject any prog that failed read-only lock")
> offsetof(struct bpf_binary_header, image) became 3 instead of 4,
> breaking powerpc BPF badly, since instructions need to be word aligned.
> 
> Fixes: 9facc336876f ("bpf: reject any prog that failed read-only lock")
> Signed-off-by: Eric Dumazet 

I'll apply this directly, thanks Eric.


[net:master 6/6] drivers/net/ethernet/mscc/ocelot.c:377:17: sparse: incorrect type in argument 2 (different base types)

2018-06-20 Thread kbuild test robot
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git master
head:   08d02364b12faa54d76dbfea2090321fd27996f2
commit: 08d02364b12faa54d76dbfea2090321fd27996f2 [6/6] net: mscc: fix the 
injection header
reproduce:
# apt-get install sparse
git checkout 08d02364b12faa54d76dbfea2090321fd27996f2
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

>> drivers/net/ethernet/mscc/ocelot.c:377:17: sparse: incorrect type in 
>> argument 2 (different base types) @@expected unsigned int [unsigned] 
>> [usertype] val @@got ed int [unsigned] [usertype] val @@
   drivers/net/ethernet/mscc/ocelot.c:377:17:expected unsigned int 
[unsigned] [usertype] val
   drivers/net/ethernet/mscc/ocelot.c:377:17:got restricted __be32 
[usertype] 
   include/linux/device.h:678:13: sparse: undefined identifier 
'__builtin_mul_overflow'
   include/linux/device.h:678:13: sparse: call with no type!

vim +377 drivers/net/ethernet/mscc/ocelot.c

   353  
   354  static int ocelot_port_xmit(struct sk_buff *skb, struct net_device *dev)
   355  {
   356  struct ocelot_port *port = netdev_priv(dev);
   357  struct ocelot *ocelot = port->ocelot;
   358  u32 val, ifh[IFH_LEN];
   359  struct frame_info info = {};
   360  u8 grp = 0; /* Send everything on CPU group 0 */
   361  unsigned int i, count, last;
   362  
   363  val = ocelot_read(ocelot, QS_INJ_STATUS);
   364  if (!(val & QS_INJ_STATUS_FIFO_RDY(BIT(grp))) ||
   365  (val & QS_INJ_STATUS_WMARK_REACHED(BIT(grp
   366  return NETDEV_TX_BUSY;
   367  
   368  ocelot_write_rix(ocelot, QS_INJ_CTRL_GAP_SIZE(1) |
   369   QS_INJ_CTRL_SOF, QS_INJ_CTRL, grp);
   370  
   371  info.port = BIT(port->chip_port);
   372  info.tag_type = IFH_TAG_TYPE_C;
   373  info.vid = skb_vlan_tag_get(skb);
   374  ocelot_gen_ifh(ifh, &info);
   375  
   376  for (i = 0; i < IFH_LEN; i++)
 > 377  ocelot_write_rix(ocelot, cpu_to_be32(ifh[i]), 
 > QS_INJ_WR, grp);
   378  
   379  count = (skb->len + 3) / 4;
   380  last = skb->len % 4;
   381  for (i = 0; i < count; i++) {
   382  ocelot_write_rix(ocelot, ((u32 *)skb->data)[i], 
QS_INJ_WR, grp);
   383  }
   384  
   385  /* Add padding */
   386  while (i < (OCELOT_BUFFER_CELL_SZ / 4)) {
   387  ocelot_write_rix(ocelot, 0, QS_INJ_WR, grp);
   388  i++;
   389  }
   390  
   391  /* Indicate EOF and valid bytes in last word */
   392  ocelot_write_rix(ocelot, QS_INJ_CTRL_GAP_SIZE(1) |
   393   QS_INJ_CTRL_VLD_BYTES(skb->len < 
OCELOT_BUFFER_CELL_SZ ? 0 : last) |
   394   QS_INJ_CTRL_EOF,
   395   QS_INJ_CTRL, grp);
   396  
   397  /* Add dummy CRC */
   398  ocelot_write_rix(ocelot, 0, QS_INJ_WR, grp);
   399  skb_tx_timestamp(skb);
   400  
   401  dev->stats.tx_packets++;
   402  dev->stats.tx_bytes += skb->len;
   403  dev_kfree_skb_any(skb);
   404  
   405  return NETDEV_TX_OK;
   406  }
   407  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


[PATCH v2 bpf-net] bpf: Change bpf_fib_lookup to return lookup status

2018-06-20 Thread dsahern
From: David Ahern 

For ACLs implemented using either FIB rules or FIB entries, the BPF
program needs the FIB lookup status to be able to drop the packet.
Since the bpf_fib_lookup API has not reached a released kernel yet,
change the return code to contain an encoding of the FIB lookup
result and return the nexthop device index in the params struct.

In addition, inform the BPF program of any post FIB lookup reason as
to why the packet needs to go up the stack.

The fib result for unicast routes must have an egress device, so remove
the check that it is non-NULL.

Signed-off-by: David Ahern 
---
v2
- drop BPF_FIB_LKUP_RET_NO_NHDEV; check in dev in fib result not needed
- enhance documentation of BPF_FIB_LKUP_RET_ codes

 include/uapi/linux/bpf.h   | 28 ++
 net/core/filter.c  | 72 ++
 samples/bpf/xdp_fwd_kern.c |  8 +++---
 3 files changed, 74 insertions(+), 34 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 59b19b6a40d7..b7db3261c62d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1857,7 +1857,8 @@ union bpf_attr {
  * is resolved), the nexthop address is returned in ipv4_dst
  * or ipv6_dst based on family, smac is set to mac address of
  * egress device, dmac is set to nexthop mac address, rt_metric
- * is set to metric from route (IPv4/IPv6 only).
+ * is set to metric from route (IPv4/IPv6 only), and ifindex
+ * is set to the device index of the nexthop from the FIB lookup.
  *
  * *plen* argument is the size of the passed in struct.
  * *flags* argument can be a combination of one or more of the
@@ -1873,9 +1874,10 @@ union bpf_attr {
  * *ctx* is either **struct xdp_md** for XDP programs or
  * **struct sk_buff** tc cls_act programs.
  * Return
- * Egress device index on success, 0 if packet needs to continue
- * up the stack for further processing or a negative error in case
- * of failure.
+ * * < 0 if any input argument is invalid
+ * *   0 on success (packet is forwarded, nexthop neighbor exists)
+ * * > 0 one of **BPF_FIB_LKUP_RET_** codes explaining why the
+ * * packet is not forwarded or needs assist from full stack
  *
  * int bpf_sock_hash_update(struct bpf_sock_ops_kern *skops, struct bpf_map 
*map, void *key, u64 flags)
  * Description
@@ -2612,6 +2614,18 @@ struct bpf_raw_tracepoint_args {
 #define BPF_FIB_LOOKUP_DIRECT  BIT(0)
 #define BPF_FIB_LOOKUP_OUTPUT  BIT(1)
 
+enum {
+   BPF_FIB_LKUP_RET_SUCCESS,  /* lookup successful */
+   BPF_FIB_LKUP_RET_BLACKHOLE,/* dest is blackholed; can be dropped */
+   BPF_FIB_LKUP_RET_UNREACHABLE,  /* dest is unreachable; can be dropped */
+   BPF_FIB_LKUP_RET_PROHIBIT, /* dest not allowed; can be dropped */
+   BPF_FIB_LKUP_RET_NOT_FWDED,/* packet is not forwarded */
+   BPF_FIB_LKUP_RET_FWD_DISABLED, /* fwding is not enabled on ingress */
+   BPF_FIB_LKUP_RET_UNSUPP_LWT,   /* fwd requires encapsulation */
+   BPF_FIB_LKUP_RET_NO_NEIGH, /* no neighbor entry for nh */
+   BPF_FIB_LKUP_RET_FRAG_NEEDED,  /* fragmentation required to fwd */
+};
+
 struct bpf_fib_lookup {
/* input:  network family for lookup (AF_INET, AF_INET6)
 * output: network family of egress nexthop
@@ -2625,7 +2639,11 @@ struct bpf_fib_lookup {
 
/* total length of packet from network header - used for MTU check */
__u16   tot_len;
-   __u32   ifindex;  /* L3 device index for lookup */
+
+   /* input: L3 device index for lookup
+* output: device index from FIB lookup
+*/
+   __u32   ifindex;
 
union {
/* inputs to lookup */
diff --git a/net/core/filter.c b/net/core/filter.c
index e7f12e9f598c..f8dd8aa89de4 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4073,8 +4073,9 @@ static int bpf_fib_set_fwd_params(struct bpf_fib_lookup 
*params,
memcpy(params->smac, dev->dev_addr, ETH_ALEN);
params->h_vlan_TCI = 0;
params->h_vlan_proto = 0;
+   params->ifindex = dev->ifindex;
 
-   return dev->ifindex;
+   return 0;
 }
 #endif
 
@@ -4098,7 +4099,7 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct 
bpf_fib_lookup *params,
/* verify forwarding is enabled on this interface */
in_dev = __in_dev_get_rcu(dev);
if (unlikely(!in_dev || !IN_DEV_FORWARD(in_dev)))
-   return 0;
+   return BPF_FIB_LKUP_RET_FWD_DISABLED;
 
if (flags & BPF_FIB_LOOKUP_OUTPUT) {
fl4.flowi4_iif = 1;
@@ -4123,7 +4124,7 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct 
bpf_fib_lookup *params,
 
tb = fib_get_table(net, tbid);
if (unlikely(!tb))
-   return 0;
+ 

Re: [PATCH net] ipvlan: call dev_change_flags when reset ipvlan mode

2018-06-20 Thread Hangbin Liu
On Wed, Jun 20, 2018 at 10:45:39AM -0700, Cong Wang wrote:
> On Tue, Jun 19, 2018 at 10:31 PM, David Miller  wrote:
> > From: Hangbin Liu 
> > Date: Wed, 20 Jun 2018 11:22:54 +0800
> >
> >> The only case dev_change_flags() return an err is when we change IFF_UP 
> >> flag.
> >> Since we only set/reset IFF_NOARP, do you think we still need to check the
> >> return value?
> >
> > It is bad to try and take shortcuts on error handling using assumptions
> > like that.
> >
> > If dev_change_flags() is adjusted to return error codes in more
> > situations, nobody is going to remember to undo your "optimziation"
> > here.
> >
> > Please check for errors, thank you.
> 
> Yeah. Also since the notifier is triggered in this case:
> 
> if (dev->flags & IFF_UP &&
> (changes & ~(IFF_UP | IFF_PROMISC | IFF_ALLMULTI | 
> IFF_VOLATILE))) {
> struct netdev_notifier_change_info change_info = {
> .info = {
> .dev = dev,
> },
> .flags_changed = changes,
> };
> 
> call_netdevice_notifiers_info(NETDEV_CHANGE, 
> &change_info.info);
> }
> 
> the return value of call_netdevice_notifiers_info() isn't captured
> either, but it should be.

Thanks for the explanation. I will fix it.

Regards
Hangbin


Re: [PATCH] r8169: Fix netpoll oops

2018-06-20 Thread David Miller
From: Ville Syrjala 
Date: Wed, 20 Jun 2018 15:01:53 +0300

> From: Ville Syrjälä 
> 
> Pass the correct thing to rtl8169_interrupt() from netpoll.
> 
> Cc: Realtek linux nic maintainers 
> Cc: netdev@vger.kernel.org
> Cc: Heiner Kallweit 
> Cc: David S. Miller 
> Fixes: ebcd5daa7ffd ("r8169: change interrupt handler argument type")
> Signed-off-by: Ville Syrjälä 

Applied.


Re: [PATCH net-next 0/2] fixes for ipsec selftests

2018-06-20 Thread Shannon Nelson

On 6/20/2018 4:18 PM, Anders Roxell wrote:

On Thu, 21 Jun 2018 at 00:26, Shannon Nelson  wrote:


On 6/20/2018 12:09 PM, Anders Roxell wrote:

On Wed, 20 Jun 2018 at 07:42, Shannon Nelson  wrote:


A couple of bad behaviors in the ipsec selftest were pointed out
by Anders Roxell  and are addressed here.

Shannon Nelson (2):
selftests: rtnetlink: hide complaint from terminated monitor
selftests: rtnetlink: use a local IP address for IPsec tests

   tools/testing/selftests/net/rtnetlink.sh | 11 +++
   1 file changed, 7 insertions(+), 4 deletions(-)

--
2.7.4



Hi Shannon,

With this patches applied and my config patch.

I still get this error when I run the ipsec test:

FAIL: can't add fou port , skipping test
RTNETLINK answers: Operation not supported
FAIL: can't add macsec interface, skipping test
RTNETLINK answers: Protocol not supported
RTNETLINK answers: No such process
RTNETLINK answers: No such process
FAIL: ipsec


One of the odd things I noticed about this script is that there really
aren't any diagnosis messages, just PASS or FAIL.  I followed this
custom when I added the ipsec tests, but I think this is something that
should change so we can get some idea of what breaks.

I'm curious about the "RTNETLINK answers" messages and where they might
be coming from, especially "RTNETLINK answers: Protocol not supported".


I added: "set -x" in the beginning of the rtnetlink.sh script.
+ ip x s add proto esp src 10.66.17.140 dst 10.66.17.141 spi 0x07 mode
transport reqid 0x07 replay-window 32 aead 'rfc4106(gcm(aes))'
0x3132333435
363738393031323334353664636261 128 sel src 10.66.17.140/24 dst 10.66.17.141/24
RTNETLINK answers: Protocol not supported


Okay, so ip didn't like this command...


What are the XFRM and AES settings in your kernel config - what is the
output from
 egrep -i "xfrm|_aes" .config


CONFIG_XFRM=y
CONFIG_XFRM_ALGO=y
CONFIG_XFRM_USER=y
CONFIG_INET_XFRM_MODE_TUNNEL=y
CONFIG_INET6_XFRM_MODE_TRANSPORT=y
CONFIG_INET6_XFRM_MODE_TUNNEL=y
CONFIG_INET6_XFRM_MODE_BEET=y
CONFIG_CRYPTO_AES=y


And this is probably why - there seem to be a few config variables 
missing, including CONFIG_INET_XFRM_MODE_TRANSPORT, which might be why 
the ip command fails above.


Here's what I have in my config:
CONFIG_XFRM=y
CONFIG_XFRM_OFFLOAD=y
CONFIG_XFRM_ALGO=m
CONFIG_XFRM_USER=m
# CONFIG_XFRM_SUB_POLICY is not set
# CONFIG_XFRM_MIGRATE is not set
CONFIG_XFRM_STATISTICS=y
CONFIG_XFRM_IPCOMP=m
CONFIG_INET_XFRM_TUNNEL=m
CONFIG_INET_XFRM_MODE_TRANSPORT=m
CONFIG_INET_XFRM_MODE_TUNNEL=m
CONFIG_INET_XFRM_MODE_BEET=m
CONFIG_INET6_XFRM_TUNNEL=m
CONFIG_INET6_XFRM_MODE_TRANSPORT=m
CONFIG_INET6_XFRM_MODE_TUNNEL=m
CONFIG_INET6_XFRM_MODE_BEET=m
CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION=m
CONFIG_SECURITY_NETWORK_XFRM=y
CONFIG_CRYPTO_AES=y
# CONFIG_CRYPTO_AES_TI is not set
CONFIG_CRYPTO_AES_X86_64=m
CONFIG_CRYPTO_AES_NI_INTEL=m
CONFIG_CRYPTO_CAMELLIA_AESNI_AVX_X86_64=m
CONFIG_CRYPTO_CAMELLIA_AESNI_AVX2_X86_64=m
CONFIG_CRYPTO_DEV_PADLOCK_AES=m

Can I talk you into adding CONFIG_INET_XFRM_MODE_TRANSPORT to your 
config and trying again?


sln



Re: [net RFC] net/mlx4_en: Use frag stride in crossing page boundary condition

2018-06-20 Thread Eric Dumazet



On 06/20/2018 04:41 PM, Saeed Mahameed wrote:
> 
> I see, I wanted to use _stride_ as grantee for how much a page frag can
> grow, for example in mlx5 we need the whole stride to build_skb  around
> the frag, since we always need the trailer, but it is different in here
> and we can avoid resource waste.
> 
> so how a bout this: (As suggested by Martin).
> currently as mlx4_en_complete_rx_desc assumes that priv->rx_headroom
> is always 0 in non-XDP setup, hence:
> 
> frags->page_offset += sz_align;
> 
> where it really should be:
> frags->page_offset += sz_align + priv->rx_headroom;
> 
> we can use it as a hint to not reuse as below:
> what do you think ?
> 
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 9f54ccbddea7..f14c7a574cc8 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -474,10 +474,10 @@ static int mlx4_en_complete_rx_desc(struct
> mlx4_en_priv *priv,
>  {
> const struct mlx4_en_frag_info *frag_info = priv->frag_info;
> unsigned int truesize = 0;
> +   bool release = true;
> int nr, frag_size;
> struct page *page;
> dma_addr_t dma;
> -   bool release;
> index 9f54ccbddea7..f14c7a574cc8 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> 
> /* Collect used fragments while replacing them in the HW
> descriptors */
> for (nr = 0;; frags++) {
> @@ -500,7 +500,7 @@ static int mlx4_en_complete_rx_desc(struct
> mlx4_en_priv *priv,
> release = page_count(page) != 1 ||
>   page_is_pfmemalloc(page) ||
>   page_to_nid(page) != numa_mem_id();
> -   } else {
> +   } elseif(!priv->rx_headroom) {
> u32 sz_align = ALIGN(frag_size,
> SMP_CACHE_BYTES);
> 
> frags->page_offset += sz_align;
> 

I guess that would work, please double check priv->rx_headroom wont need 
another cache line,
thanks !



Re: [PATCH] bpfilter: fix user mode helper cross compilation

2018-06-20 Thread David Miller
From: Matteo Croce 
Date: Wed, 20 Jun 2018 16:04:34 +0200

> Use $(OBJDUMP) instead of literal 'objdump' to avoid
> using host toolchain when cross compiling.
> 
> Fixes: 421780fd4983 ("bpfilter: fix build error")
> Signed-off-by: Matteo Croce 

Applied.


[PATCH net] bpf: enforce correct alignment for instructions

2018-06-20 Thread Eric Dumazet
After commit 9facc336876f ("bpf: reject any prog that failed read-only lock")
offsetof(struct bpf_binary_header, image) became 3 instead of 4,
breaking powerpc BPF badly, since instructions need to be word aligned.

Fixes: 9facc336876f ("bpf: reject any prog that failed read-only lock")
Signed-off-by: Eric Dumazet 
Cc: Daniel Borkmann 
Cc: Martin KaFai Lau 
Cc: Alexei Starovoitov 
---
 include/linux/filter.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 
b615df57b7d5b2ccb468c411c3a2aae103cd2aea..20f2659dd829256d7fef206087ab3262e1e291f5
 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -472,7 +472,9 @@ struct sock_fprog_kern {
 struct bpf_binary_header {
u16 pages;
u16 locked:1;
-   u8 image[];
+
+   /* Some arches need word alignment for their instructions */
+   u8 image[] __aligned(4);
 };
 
 struct bpf_prog {
-- 
2.18.0.rc1.244.gcf134e6275-goog



Re: net-next compilation failures

2018-06-20 Thread David Miller
From: "Chopra, Manish" 
Date: Wed, 20 Jun 2018 21:00:40 +

> I am trying to compile net-next kernel and I face these below
> compilation errros for some reason. Attached the kernel .config
> used.  Any idea for what reason these failures could be stemming ?

The net-next tree isn't open and therefore you shouldn't be doing
work against it.

When it is closed, I don't try hard to keep net-next up to date with
upstream and thus the recent upstream fixes for build problems or
other bugs.


Re: [net RFC] net/mlx4_en: Use frag stride in crossing page boundary condition

2018-06-20 Thread Saeed Mahameed
On Tue, 2018-06-19 at 17:25 -0700, Eric Dumazet wrote:
> 
> On 06/19/2018 11:05 AM, Saeed Mahameed wrote:
> 
> > this is only true for XDP setup, for non XDP max stride_size can
> > only
> > be around ~3k and only for mtu > ~6k
> > 
> > For XDP setup you suggested:
> > -   priv->frag_info[0].frag_size = eff_mtu;
> > +   priv->frag_info[0].frag_size = PAGE_SIZE;
> > 
> > currently the condition is:
> > 
> > release = frags->page_offset + frag_info->frag_size > PAGE_SIZE;
> > 
> > so my solution and yours have the same problem you described above.
> > 
> > the problem is not with the initial values or with stride/farg size
> > math, it just that in XDP we shouldn't reuse at ALL. I agree with
> > you
> > that we need to optimize and maybe for PAGE_SIZE > 8k we need to
> > allow
> > XDP setup to reuses. but for now there is a data corruption to
> > handle.
> 
> 
> Sure, we all agree there is a bug to fix.
> 
> The way you are fixing it is kind of illogical.
> 
> The NIC can use a frag if its _size_ is big enough to receive the
> frame.
> 
> The _stride_  is an abstraction created by the driver to report an
> estimation of the _truesize_,
> or memory consumption, so that linux can better track overall memory
> usage.
> 
> For example, if MTU=1500, the size of the fragment is 1536 bytes, but
> since we can put only
> 2 fragments per 4KB page (on x86), we declare the _stride_ to be 2048
> bytes.
> 
> Declaring that a final blob of a page, being 1600 bytes, not able to
> receive a frame because
> _stride_ is 2048 is illogical and waste resources.
> 
> 

I see, I wanted to use _stride_ as grantee for how much a page frag can
grow, for example in mlx5 we need the whole stride to build_skb  around
the frag, since we always need the trailer, but it is different in here
and we can avoid resource waste.

so how a bout this: (As suggested by Martin).
currently as mlx4_en_complete_rx_desc assumes that priv->rx_headroom
is always 0 in non-XDP setup, hence:

frags->page_offset += sz_align;

where it really should be:
frags->page_offset += sz_align + priv->rx_headroom;

we can use it as a hint to not reuse as below:
what do you think ?


diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 9f54ccbddea7..f14c7a574cc8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -474,10 +474,10 @@ static int mlx4_en_complete_rx_desc(struct
mlx4_en_priv *priv,
 {
const struct mlx4_en_frag_info *frag_info = priv->frag_info;
unsigned int truesize = 0;
+   bool release = true;
int nr, frag_size;
struct page *page;
dma_addr_t dma;
-   bool release;
index 9f54ccbddea7..f14c7a574cc8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c

/* Collect used fragments while replacing them in the HW
descriptors */
for (nr = 0;; frags++) {
@@ -500,7 +500,7 @@ static int mlx4_en_complete_rx_desc(struct
mlx4_en_priv *priv,
release = page_count(page) != 1 ||
  page_is_pfmemalloc(page) ||
  page_to_nid(page) != numa_mem_id();
-   } else {
+   } elseif(!priv->rx_headroom) {
u32 sz_align = ALIGN(frag_size,
SMP_CACHE_BYTES);

frags->page_offset += sz_align;

Re: [PATCH net-next 0/2] fixes for ipsec selftests

2018-06-20 Thread Anders Roxell
On Thu, 21 Jun 2018 at 00:26, Shannon Nelson  wrote:
>
> On 6/20/2018 12:09 PM, Anders Roxell wrote:
> > On Wed, 20 Jun 2018 at 07:42, Shannon Nelson  
> > wrote:
> >>
> >> A couple of bad behaviors in the ipsec selftest were pointed out
> >> by Anders Roxell  and are addressed here.
> >>
> >> Shannon Nelson (2):
> >>selftests: rtnetlink: hide complaint from terminated monitor
> >>selftests: rtnetlink: use a local IP address for IPsec tests
> >>
> >>   tools/testing/selftests/net/rtnetlink.sh | 11 +++
> >>   1 file changed, 7 insertions(+), 4 deletions(-)
> >>
> >> --
> >> 2.7.4
> >>
> >
> > Hi Shannon,
> >
> > With this patches applied and my config patch.
> >
> > I still get this error when I run the ipsec test:
> >
> > FAIL: can't add fou port , skipping test
> > RTNETLINK answers: Operation not supported
> > FAIL: can't add macsec interface, skipping test
> > RTNETLINK answers: Protocol not supported
> > RTNETLINK answers: No such process
> > RTNETLINK answers: No such process
> > FAIL: ipsec
>
> One of the odd things I noticed about this script is that there really
> aren't any diagnosis messages, just PASS or FAIL.  I followed this
> custom when I added the ipsec tests, but I think this is something that
> should change so we can get some idea of what breaks.
>
> I'm curious about the "RTNETLINK answers" messages and where they might
> be coming from, especially "RTNETLINK answers: Protocol not supported".

I added: "set -x" in the beginning of the rtnetlink.sh script.
+ ip x s add proto esp src 10.66.17.140 dst 10.66.17.141 spi 0x07 mode
transport reqid 0x07 replay-window 32 aead 'rfc4106(gcm(aes))'
0x3132333435
363738393031323334353664636261 128 sel src 10.66.17.140/24 dst 10.66.17.141/24
RTNETLINK answers: Protocol not supported

> What version of iproute2 are you using?  Is it older than iproute2-ss130716?

I use iproute2 release 4.17.0.

>
> What distro and kernel are you running?

for this test linux-next tag: next-20180620 distro OE (morty)

>
> What are the XFRM and AES settings in your kernel config - what is the
> output from
> egrep -i "xfrm|_aes" .config

CONFIG_XFRM=y
CONFIG_XFRM_ALGO=y
CONFIG_XFRM_USER=y
CONFIG_INET_XFRM_MODE_TUNNEL=y
CONFIG_INET6_XFRM_MODE_TRANSPORT=y
CONFIG_INET6_XFRM_MODE_TUNNEL=y
CONFIG_INET6_XFRM_MODE_BEET=y
CONFIG_CRYPTO_AES=y

>
> I did also notice that the ipsec test should set ret=0 at its start.

did.

> Can you either add this or comment out all the other tests in
> kci_test_rtnl() so that only the kci_test_ipsec is run and send me the
> output?

done.

Same result as before... added "set -x" and this is the output:
+ devdummy=test-dummy0
+ ret=0
+ ksft_skip=4
++ id -u
+ '[' 0 -ne 0 ']'
+ for x in ip tc
+ ip -Version
+ '[' 0 -ne 0 ']'
+ for x in ip tc
+ tc -Version
+ '[' 0 -ne 0 ']'
+ kci_test_rtnl
+ kci_test_ipsec
+ ret=0
++ ip -o addr
++ awk '/inet / { print $4; }'
++ grep -v '^127'
++ head -1
++ cut -f1 -d/
+ srcip=10.66.17.140
++ echo 10.66.17.140
++ cut -f1-3 -d.
+ net=10.66.17
++ echo 10.66.17.140
++ cut -f4 -d.
+ base=140
++ expr 140 + 1
+ dstip=10.66.17.141
+ algo='aead rfc4106(gcm(aes)) 0x3132333435363738393031323334353664636261 128'
+ ip x s flush
+ ip x p flush
+ check_err 0
+ '[' 0 -eq 0 ']'
+ ret=0
++ mktemp ipsectestXXX
+ tmpfile=ipsectestHFP
+ mpid=3339
+ sleep 0.2
+ ipsecid='proto esp src 10.66.17.140 dst 10.66.17.141 spi 0x07'
+ ip x s add proto esp src 10.66.17.140 dst 10.66.17.141 spi 0x07 mode
transport reqid 0x07 replay-window 32 aead 'rfc4106(gcm(aes))'
0x3132333435
363738393031323334353664636261 128 sel src 10.66.17.140/24 dst 10.66.17.141/24
RTNETLINK answers: Protocol not supported
+ check_err 2
+ '[' 0 -eq 0 ']'
+ ret=2
++ ip x s list
++ grep 10.66.17.140
++ grep 10.66.17.141
++ wc -l
+ lines=0
+ test 0 -eq 2
+ check_err 1
+ '[' 2 -eq 0 ']'
+ ip x s count
+ grep -q 'SAD count 1'
+ check_err 1
+ '[' 2 -eq 0 ']'
++ ip x s get proto esp src 10.66.17.140 dst 10.66.17.141 spi 0x07
++ grep 10.66.17.140
++ grep 10.66.17.141
++ wc -l
RTNETLINK answers: No such process
+ lines=0
+ test 0 -eq 2
+ check_err 1
+ '[' 2 -eq 0 ']'
+ ip x s delete proto esp src 10.66.17.140 dst 10.66.17.141 spi 0x07
RTNETLINK answers: No such process
+ check_err 2
+ '[' 2 -eq 0 ']'
++ ip x s list
++ wc -l
+ lines=0
+ test 0 -eq 0
+ check_err 0
+ '[' 2 -eq 0 ']'
+ ipsecsel='dir out src 10.66.17.140/24 dst 10.66.17.141

Re: WARNING: CPU: 3 PID: 0 at net/sched/sch_hfsc.c:1388 hfsc_dequeue+0x319/0x350 [sch_hfsc]

2018-06-20 Thread Cong Wang
On Wed, Jun 20, 2018 at 12:00 AM, Marco Berizzi  wrote:
>> Il 19 giugno 2018 alle 13.42 Marco Berizzi  ha scritto:
>>
>> > Il 18 giugno 2018 alle 21.28 Cong Wang  ha 
>> > scritto:
>> > Can you test the attached patch?
>> >
>> > It almost certainly fixes the warning, but I am not sure if
>> > it papers out any other real problem. Please make sure
>> > hfsc still works as expected. :)
>>
>> Hi Cong,
>>
>> Thanks a lot for the patch. I'm building 4.17.2 + your patch.
>> Within 24 hours I will update you.
>
> Hi Cong,
>
> I have applied your patch to 4.17.2
> From my point of view hfsc is working as expected.
> However my setup is pretty simple:
>
> tc class add dev eth2 parent 1:2 classid 1:16 hfsc ls rate 3000kbit ul rate 
> 2kbit
>
> I can see hfsc is shaping my traffic to 20MBit/s.
> The error message has not popped up :-)
> I will drop you a message withing 48 hours for confirmation.

Please also test HFSC_RSC ("rt") if possible.


> Will this patch merged in a future kernel release?
>

If you can confirm nothing breaks, I will send it out formally.

Thanks for testing!


Re: [PATCH net 1/1] net/smc: coordinate wait queues for nonblocking connect

2018-06-20 Thread Cong Wang
On Wed, Jun 20, 2018 at 1:07 AM, Ursula Braun  wrote:
> diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
> index da7f02edcd37..21c84b924ffb 100644
> --- a/net/smc/af_smc.c
> +++ b/net/smc/af_smc.c
> @@ -605,6 +605,8 @@ static int smc_connect(struct socket *sock, struct 
> sockaddr *addr,
>
> smc_copy_sock_settings_to_clc(smc);
> tcp_sk(smc->clcsock->sk)->syn_smc = 1;
> +   if (flags & O_NONBLOCK)
> +   sock->sk->sk_wq = smc->clcsock->sk->sk_wq;
> rc = kernel_connect(smc->clcsock, addr, alen, flags);
> if (rc)
> goto out;
> @@ -1285,12 +1287,9 @@ static __poll_t smc_poll_mask(struct socket *sock, 
> __poll_t events)
>
> smc = smc_sk(sock->sk);
> sock_hold(sk);
> -   lock_sock(sk);
> if ((sk->sk_state == SMC_INIT) || smc->use_fallback) {
> /* delegate to CLC child sock */
> -   release_sock(sk);
> mask = smc->clcsock->ops->poll_mask(smc->clcsock, events);
> -   lock_sock(sk);
> sk->sk_err = smc->clcsock->sk->sk_err;
> if (sk->sk_err) {
> mask |= EPOLLERR;
> @@ -1299,7 +1298,10 @@ static __poll_t smc_poll_mask(struct socket *sock, 
> __poll_t events)
> if (sk->sk_state == SMC_INIT &&
> mask & EPOLLOUT &&
> smc->clcsock->sk->sk_state != TCP_CLOSE) {
> +   sock->sk->sk_wq = smc->smcwq;
> +   lock_sock(sk);

I think you need to use proper RCU API to protect these
assignment of sk->sk_wq.


Re: [PATCH net-next 0/2] fixes for ipsec selftests

2018-06-20 Thread Shannon Nelson

On 6/20/2018 12:09 PM, Anders Roxell wrote:

On Wed, 20 Jun 2018 at 07:42, Shannon Nelson  wrote:


A couple of bad behaviors in the ipsec selftest were pointed out
by Anders Roxell  and are addressed here.

Shannon Nelson (2):
   selftests: rtnetlink: hide complaint from terminated monitor
   selftests: rtnetlink: use a local IP address for IPsec tests

  tools/testing/selftests/net/rtnetlink.sh | 11 +++
  1 file changed, 7 insertions(+), 4 deletions(-)

--
2.7.4



Hi Shannon,

With this patches applied and my config patch.

I still get this error when I run the ipsec test:

FAIL: can't add fou port , skipping test
RTNETLINK answers: Operation not supported
FAIL: can't add macsec interface, skipping test
RTNETLINK answers: Protocol not supported
RTNETLINK answers: No such process
RTNETLINK answers: No such process
FAIL: ipsec


One of the odd things I noticed about this script is that there really 
aren't any diagnosis messages, just PASS or FAIL.  I followed this 
custom when I added the ipsec tests, but I think this is something that 
should change so we can get some idea of what breaks.


I'm curious about the "RTNETLINK answers" messages and where they might 
be coming from, especially "RTNETLINK answers: Protocol not supported". 
What version of iproute2 are you using?  Is it older than iproute2-ss130716?


What distro and kernel are you running?

What are the XFRM and AES settings in your kernel config - what is the 
output from

egrep -i "xfrm|_aes" .config

I did also notice that the ipsec test should set ret=0 at its start. 
Can you either add this or comment out all the other tests in 
kci_test_rtnl() so that only the kci_test_ipsec is run and send me the 
output?


Thanks,
sln




Can you please cc the kselftest list when sending patches to
tools/testing/selftests/ ?

Cheers,
Anders



Re: [bpf PATCH v2 2/6] bpf: sockmap only allow ESTABLISHED sock state

2018-06-20 Thread John Fastabend
On 06/18/2018 02:17 PM, Martin KaFai Lau wrote:
> On Mon, Jun 18, 2018 at 07:50:19AM -0700, John Fastabend wrote:
>> On 06/14/2018 05:18 PM, Martin KaFai Lau wrote:
>>> On Thu, Jun 14, 2018 at 09:44:52AM -0700, John Fastabend wrote:
 Per the note in the TLS ULP (which is actually a generic statement
 regarding ULPs)

  /* The TLS ulp is currently supported only for TCP sockets
   * in ESTABLISHED state.
   * Supporting sockets in LISTEN state will require us
   * to modify the accept implementation to clone rather then
   * share the ulp context.
   */
>>> Can you explain how that apply to bpf_tcp ulp?
>>>
>>> My understanding is the "ulp context" referred in TLS ulp is
>>> the tls_context stored in icsk_ulp_data but I don't see bpf_tcp's
>>> ulp is using icsk_ulp_data.
>>>
>>> Others LGTM.
>>>
>>
>> So I think you are right we could probably allow it
>> here but I am thinking I'll leave the check for now
>> anyways for a couple reasons. First, we will shortly
>> add support to allow ULP types to coexist. At the moment
>> the two ULP types can not coexist. When this happens it
>> looks like we will need to restrict to only ESTABLISHED
>> types or somehow make all ULPs work in all states.
>>
>> Second, I don't have any use cases (nor can I think of
>> any) for the sock{map|hash} ULP to be running on a non
>> ESTABLISHED socket. Its not clear to me that having the
>> sendmsg/sendpage hooks for a LISTEN socket makes sense.
>> I would rather restrict it now and if we add something
>> later where it makes sense to run on non-ESTABLISHED
>> socks we can remove the check.
> Make sense if there is no use case.  It will be helpful if the commit log
> is updated accordingly.  Thanks!
> 
> Acked-by: Martin KaFai Lau 
> 

The fall-out of this patch got a bit ugly. It doesn't
make much sense to me to allow transitioning into
ESTABLISH state (via tcp_disconnect) and not allow adding
the socks up front. But the fix via unhash callback, subsequent
patch, ended up causing a few issues. First to avoid racing
with transitions through update logic we had to use
sock_lock(sk) in the update handler. Which means we
can't use the normal ./kernel/bpf/syscall.c map update
logic and had to special case it so that preempt and
rcu were not used until after the lock was taken because
sock_lock can sleep. Then after running over night I noticed
a couple WARNINGS related to sk_forward_alloc not being
zero'd correctly on sock teardown. The issue is unhash
doesn't have sock_lock either and can be done while a
sendmsg/sendpage are running resulting in incorrectly
removing scatterlists. :(

All in all the "fix" got ugly so lets stay with the minimal
required set and allow non-established socks. It shouldn't
hurt anything even if from a use case perspective its a bit
useless. Then in bpf-next when we allow ULPs to coexist we
need to have some fix to handle this. But I would rather do
that in *next branches instead of bpf/net branches.

Thanks for all the useful feedback. I'm going to send a
set of three patches now to address the specific issues,
(a) IPV6 socks, (b) bucket lock missing and (c) uref release
missing.

Thanks,
John


862591bf4("xfrm: skip policies marked as dead while rehashing")

2018-06-20 Thread Zubin Mithra
Hello,

Syzkaller has reported a crash here[1] for a slab OOB read in
xfrm_hash_rebuild.

Could the following 2 patches be applied in order to on 4.4.y?

6916fb3b10("xfrm: Ignore socket policies when rebuilding hash tables")
862591bf4f("xfrm: skip policies marked as dead while rehashing")

[1] 
https://syzkaller.appspot.com/bug?id=1c11a638b7d27e871aa297f3b4d5fd5bc90f0cb4

Thanks,
- Zubin



Re: [PATCH RFC v2 ipsec-next 0/3] Virtual xfrm interfaces

2018-06-20 Thread Benedict Wong
This set of patches entire resolves issues Android was having with
VTIs, specifically that we need multiple IPsec tunnels with the same
source/destination endpoints.

These tests pass the following regression tests
(https://android-review.googlesource.com/c/kernel/tests/+/588259/24):
- Outbound encapsulation
- Inbound decapsulation
- Rekey
- ICMP error path
- Netfilter rejections of outbound paths.

-- Benedict Wong


On Tue, Jun 12, 2018 at 12:56 AM Steffen Klassert
 wrote:
>
> This patchset introduces new virtual xfrm interfaces.
> The design of virtual xfrm interfaces interfaces was
> discussed at the Linux IPsec workshop 2018. This patchset
> implements these interfaces as the IPsec userspace and
> kernel developers agreed. The purpose of these interfaces
> is to overcome the design limitations that the existing
> VTI devices have.
>
> The main limitations that we see with the current VTI are the
> following:
>
> - VTI interfaces are L3 tunnels with configurable endpoints.
>   For xfrm, the tunnel endpoint are already determined by the SA.
>   So the VTI tunnel endpoints must be either the same as on the
>   SA or wildcards. In case VTI tunnel endpoints are same as on
>   the SA, we get a one to one correlation between the SA and
>   the tunnel. So each SA needs its own tunnel interface.
>
>   On the other hand, we can have only one VTI tunnel with
>   wildcard src/dst tunnel endpoints in the system because the
>   lookup is based on the tunnel endpoints. The existing tunnel
>   lookup won't work with multiple tunnels with wildcard
>   tunnel endpoints. Some usecases require more than on
>   VTI tunnel of this type, for example if somebody has multiple
>   namespaces and every namespace requires such a VTI.
>
> - VTI needs separate interfaces for IPv4 and IPv6 tunnels.
>   So when routing to a VTI, we have to know to which address
>   family this traffic class is going to be encapsulated.
>   This is a lmitation because it makes routing more complex
>   and it is not always possible to know what happens behind the
>   VTI, e.g. when the VTI is move to some namespace.
>
> - VTI works just with tunnel mode SAs. We need generic interfaces
>   that ensures transfomation, regardless of the xfrm mode and
>   the encapsulated address family.
>
> - VTI is configured with a combination GRE keys and xfrm marks.
>   With this we have to deal with some extra cases in the generic
>   tunnel lookup because the GRE keys on the VTI are actually
>   not GRE keys, the GRE keys were just reused for something else.
>   All extensions to the VTI interfaces would require to add
>   even more complexity to the generic tunnel lookup.
>
> To overcome this, we started with the following design goal:
>
> - It should be possible to tunnel IPv4 and IPv6 through the same
>   interface.
>
> - No limitation on xfrm mode (tunnel, transport and beet).
>
> - Should be a generic virtual interface that ensures IPsec
>   transformation, no need to know what happens behind the
>   interface.
>
> - Interfaces should be configured with a new key that must match a
>   new policy/SA lookup key.
>
> - The lookup logic should stay in the xfrm codebase, no need to
>   change or extend generic routing and tunnel lookups.
>
> - Should be possible to use IPsec hardware offloads of the underlying
>   interface.
>
> Changes from v1:
>
> - Document the limitations of VTI interfaces and the design of
>   the new xfrm interfaces more explicit in the commit messages.
>
> - No code changes.


[PATCH net-next] tcp_bbr: fix bbr pacing rate for internal pacing

2018-06-20 Thread Kevin Yang
From: Eric Dumazet 

This commit makes BBR use only the MSS (without any headers) to
calculate pacing rates when internal TCP-layer pacing is used.

This is necessary to achieve the correct pacing behavior in this case,
since tcp_internal_pacing() uses only the payload length to calculate
pacing delays.

Signed-off-by: Kevin Yang 
Signed-off-by: Eric Dumazet 
Reviewed-by: Neal Cardwell 
---
 include/net/tcp.h | 11 +++
 net/ipv4/tcp_bbr.c|  6 +-
 net/ipv4/tcp_output.c | 14 --
 3 files changed, 16 insertions(+), 15 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 0448e7c5d2b4..822ee49ed0f9 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1184,6 +1184,17 @@ static inline bool tcp_is_cwnd_limited(const struct sock 
*sk)
return tp->is_cwnd_limited;
 }
 
+/* BBR congestion control needs pacing.
+ * Same remark for SO_MAX_PACING_RATE.
+ * sch_fq packet scheduler is efficiently handling pacing,
+ * but is not always installed/used.
+ * Return true if TCP stack should pace packets itself.
+ */
+static inline bool tcp_needs_internal_pacing(const struct sock *sk)
+{
+   return smp_load_acquire(&sk->sk_pacing_status) == SK_PACING_NEEDED;
+}
+
 /* Something is really bad, we could not queue an additional packet,
  * because qdisc is full or receiver sent a 0 window.
  * We do not want to add fuel to the fire, or abort too early,
diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index 58e2f479ffb4..3b5f45b9e81e 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -205,7 +205,11 @@ static u32 bbr_bw(const struct sock *sk)
  */
 static u64 bbr_rate_bytes_per_sec(struct sock *sk, u64 rate, int gain)
 {
-   rate *= tcp_mss_to_mtu(sk, tcp_sk(sk)->mss_cache);
+   unsigned int mss = tcp_sk(sk)->mss_cache;
+
+   if (!tcp_needs_internal_pacing(sk))
+   mss = tcp_mss_to_mtu(sk, mss);
+   rate *= mss;
rate *= gain;
rate >>= BBR_SCALE;
rate *= USEC_PER_SEC;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 8e08b409c71e..f8f6129160dd 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -973,17 +973,6 @@ enum hrtimer_restart tcp_pace_kick(struct hrtimer *timer)
return HRTIMER_NORESTART;
 }
 
-/* BBR congestion control needs pacing.
- * Same remark for SO_MAX_PACING_RATE.
- * sch_fq packet scheduler is efficiently handling pacing,
- * but is not always installed/used.
- * Return true if TCP stack should pace packets itself.
- */
-static bool tcp_needs_internal_pacing(const struct sock *sk)
-{
-   return smp_load_acquire(&sk->sk_pacing_status) == SK_PACING_NEEDED;
-}
-
 static void tcp_internal_pacing(struct sock *sk, const struct sk_buff *skb)
 {
u64 len_ns;
@@ -995,9 +984,6 @@ static void tcp_internal_pacing(struct sock *sk, const 
struct sk_buff *skb)
if (!rate || rate == ~0U)
return;
 
-   /* Should account for header sizes as sch_fq does,
-* but lets make things simple.
-*/
len_ns = (u64)skb->len * NSEC_PER_SEC;
do_div(len_ns, rate);
hrtimer_start(&tcp_sk(sk)->pacing_timer,
-- 
2.18.0.rc1.244.gcf134e6275-goog



Re: [PATCH net-next 0/2] fixes for ipsec selftests

2018-06-20 Thread Anders Roxell
On Wed, 20 Jun 2018 at 07:42, Shannon Nelson  wrote:
>
> A couple of bad behaviors in the ipsec selftest were pointed out
> by Anders Roxell  and are addressed here.
>
> Shannon Nelson (2):
>   selftests: rtnetlink: hide complaint from terminated monitor
>   selftests: rtnetlink: use a local IP address for IPsec tests
>
>  tools/testing/selftests/net/rtnetlink.sh | 11 +++
>  1 file changed, 7 insertions(+), 4 deletions(-)
>
> --
> 2.7.4
>

Hi Shannon,

With this patches applied and my config patch.

I still get this error when I run the ipsec test:

FAIL: can't add fou port , skipping test
RTNETLINK answers: Operation not supported
FAIL: can't add macsec interface, skipping test
RTNETLINK answers: Protocol not supported
RTNETLINK answers: No such process
RTNETLINK answers: No such process
FAIL: ipsec

Can you please cc the kselftest list when sending patches to
tools/testing/selftests/ ?

Cheers,
Anders


Re: Route fallback issue

2018-06-20 Thread Julian Anastasov


Hello,

On Wed, 20 Jun 2018, Akshat Kakkar wrote:

> Hi netdev community,
> 
> I have 2 interfaces
> eno1 : 192.168.1.10/24
> eno2 : 192.168.2.10/24
> 
> I added routes as
> 172.16.0.0/12 via 192.168.1.254 metric 1
> 172.16.0.0/12 via 192.168.2.254 metric 2
> 
> My intention : All traffic to 172.16.0.0/12 should go thru eno1. If
> 192.168.1.254 is not reachable (no arp entry or link down), then it
> should fall back to eno2.

You can also try alternative routes. But as the
kernel supports only default alternative routes, you can
put them in their own table:

# Alternative routes use same metric!!!
ip route append default via 192.168.1.254 dev eno1 table 100
ip route append default via 192.168.2.254 dev eno2 table 100
ip rule add prio 100 to 172.16.0.0/12 table 100

Of course, you will get better results if an user space
tool puts only alive routes in service after doing health
checks of all near gateways.

Regards

--
Julian Anastasov 


Re: [PATCH] r8169: Fix netpoll oops

2018-06-20 Thread Heiner Kallweit
On 20.06.2018 14:01, Ville Syrjala wrote:
> From: Ville Syrjälä 
> 
> Pass the correct thing to rtl8169_interrupt() from netpoll.
> 
Indeed, this place I missed to change.

Only small remark: The patch should be annotated as "net" so
that it can go to 4.18.

> Cc: Realtek linux nic maintainers 
> Cc: netdev@vger.kernel.org
> Cc: Heiner Kallweit 
> Cc: David S. Miller 
> Fixes: ebcd5daa7ffd ("r8169: change interrupt handler argument type")
> Signed-off-by: Ville Syrjälä 
> ---
>  drivers/net/ethernet/realtek/r8169.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/realtek/r8169.c 
> b/drivers/net/ethernet/realtek/r8169.c
> index 75dfac0248f4..f4cae2be0fda 100644
> --- a/drivers/net/ethernet/realtek/r8169.c
> +++ b/drivers/net/ethernet/realtek/r8169.c
> @@ -7148,7 +7148,7 @@ static void rtl8169_netpoll(struct net_device *dev)
>  {
>   struct rtl8169_private *tp = netdev_priv(dev);
>  
> - rtl8169_interrupt(pci_irq_vector(tp->pci_dev, 0), dev);
> + rtl8169_interrupt(pci_irq_vector(tp->pci_dev, 0), tp);
>  }
>  #endif
>  
> 



[PATCH bpf 2/2] tools: bpftool: remember to close the libbpf object after prog load

2018-06-20 Thread Jakub Kicinski
Remembering to close all descriptors and free memory may not seem
important in a user space tool like bpftool, but if we were to run
in batch mode the consumed resources start to add up quickly.  Make
sure program load closes the libbpf object (which unloads and frees
it).

Fixes: 49a086c201a9 ("bpftool: implement prog load command")
Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 tools/bpf/bpftool/prog.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index 12b694fe0404..959aa53ab678 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -695,12 +695,18 @@ static int do_load(int argc, char **argv)
}
 
if (do_pin_fd(prog_fd, argv[1]))
-   return -1;
+   goto err_close_obj;
 
if (json_output)
jsonw_null(json_wtr);
 
+   bpf_object__close(obj);
+
return 0;
+
+err_close_obj:
+   bpf_object__close(obj);
+   return -1;
 }
 
 static int do_help(int argc, char **argv)
-- 
2.17.1



[PATCH bpf 1/2] tools: bpftool: remove duplicated error message on prog load

2018-06-20 Thread Jakub Kicinski
do_pin_fd() will already print out an error message if something
goes wrong.  Printing another error is unnecessary and will break
JSON output, since error messages are full objects:

$ bpftool -jp prog load tracex1_kern.o /sys/fs/bpf/a
{
"error": "can't pin the object (/sys/fs/bpf/a): File exists"
},{
"error": "failed to pin program"
}

Fixes: 49a086c201a9 ("bpftool: implement prog load command")
Signed-off-by: Jakub Kicinski 
Reviewed-by: Quentin Monnet 
---
 tools/bpf/bpftool/prog.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index 05f42a46d6ed..12b694fe0404 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -694,10 +694,8 @@ static int do_load(int argc, char **argv)
return -1;
}
 
-   if (do_pin_fd(prog_fd, argv[1])) {
-   p_err("failed to pin program");
+   if (do_pin_fd(prog_fd, argv[1]))
return -1;
-   }
 
if (json_output)
jsonw_null(json_wtr);
-- 
2.17.1



[PATCH bpf 0/2] tools: bpftool: small fixes for error handling in prog load

2018-06-20 Thread Jakub Kicinski
Hi!

Two small fixes for error handling in bpftool prog load, first patch
removes a duplicated message, second one frees resources correctly.
Multiple error messages break JSON:

# bpftool -jp prog load tracex1_kern.o /sys/fs/bpf/a
{
"error": "can't pin the object (/sys/fs/bpf/a): File exists"
},{
"error": "failed to pin program"
}   

Jakub Kicinski (2):
  tools: bpftool: remove duplicated error message on prog load
  tools: bpftool: remember to close the libbpf object on prog load error

 tools/bpf/bpftool/prog.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

-- 
2.17.1



Re: [PATCH v4 net-next] net:sched: add action inheritdsfield to skbedit

2018-06-20 Thread Marcelo Ricardo Leitner
On Wed, Jun 20, 2018 at 07:02:41PM +0200, Davide Caratti wrote:
...
> > I agree that we should update the drop counter, but given that 
> > you're already converting the stats to be per-cpu counters, whatever we 
> > add now will be just symbolic since you're going to change it anyway.

It wouldn't be symbolic. One thing is to convert a given increment
into something else, another is to start increasing it for some (new)
reason.

> 
> that's ok for me also, as I can use the current v4 code for the rebase
> (and not wait for another respin) _ but let's hear what reviewers think.
> 
> >  If 
> > reviewers think that Qiaobin's patch must add the update line, could you 
> > provide the exact line and location so we avoid going to v6 of this patch?
> 
> In case, I was thinking of something like: 
> 
> https://elixir.bootlin.com/linux/v4.18-rc1/source/net/sched/act_ipt.c#L249
> 
> so, between 'err:' and 'spin_unlock(&d->tcf_lock)', insert a line like:
> 
> d->tcf_qstats.drop++;

I prefer the more complete version. Then it will have a more complete
(git) history and help people when troubleshooting.

Thanks,
Marcelo


Re: [PATCH net] cls_flower: fix use after free in flower S/W path

2018-06-20 Thread Cong Wang
On Wed, Jun 20, 2018 at 10:34 AM, Paolo Abeni  wrote:
>
> +static void fl_mask_free(struct fl_flow_mask *mask)
> +{
> +   rhashtable_destroy(&mask->ht);

I don't believe you can call rhashtable_destroy() in BH
context, it acquires a mutex...


Re: [RFC v2, net-next, PATCH 4/4] net/cpsw_switchdev: add switchdev mode of operation on cpsw driver

2018-06-20 Thread Ilias Apalodimas
Hi Florian,

On Wed, Jun 20, 2018 at 10:58:26AM -0700, Florian Fainelli wrote:
> On 06/20/2018 10:51 AM, Ilias Apalodimas wrote:
> > Hello Ivan,
> > On Wed, Jun 20, 2018 at 02:56:48PM +0200, Ivan Vecera wrote:
> >> On 18.6.2018 22:19, Ilias Apalodimas wrote:
> >>> Jiri proposed using devlink, which makes sense, but i am not sure it's
> >>> applicable on this patchset. This will change the driver completely and 
> >>> will
> >>> totally break backwards compatibility.
> >>
> >> Another good reason for a new driver.
> >>
> >> I.
> > This is actually conflicting at least to my understanding. Jiri proposed 
> > using 
> > devlink was used as an alternative method to enable a new mode instead of 
> > adding it on a .config option. A new driver wouldn't have a need for that 
> > right?
> 
> Correct, with a new driver would likely behave correctly upon being
> probed such that you could have your switch ports act as normal network
> devices from which you could run IP-config and do NFS boot.
The current driver also does NFS properly and the 2 ethernet ports act as normal
network interfaces. The NFS section in the cover letter
is to cover the cases were users running on NFS need to change the running
switch configuration(starting from adding the 2 interfaces on a bridge). 
Since iproute2 is located on the NFS filesystem the moment
network connectivity is lost, you loose the ability to perform further
configuration and in certian configuration scenarios render the device
unusable.
> -- 
> Florian

Thanks
Ilias


Re: [RFC v2, net-next, PATCH 4/4] net/cpsw_switchdev: add switchdev mode of operation on cpsw driver

2018-06-20 Thread Florian Fainelli
On 06/20/2018 10:51 AM, Ilias Apalodimas wrote:
> Hello Ivan,
> On Wed, Jun 20, 2018 at 02:56:48PM +0200, Ivan Vecera wrote:
>> On 18.6.2018 22:19, Ilias Apalodimas wrote:
>>> Jiri proposed using devlink, which makes sense, but i am not sure it's
>>> applicable on this patchset. This will change the driver completely and will
>>> totally break backwards compatibility.
>>
>> Another good reason for a new driver.
>>
>> I.
> This is actually conflicting at least to my understanding. Jiri proposed 
> using 
> devlink was used as an alternative method to enable a new mode instead of 
> adding it on a .config option. A new driver wouldn't have a need for that 
> right?

Correct, with a new driver would likely behave correctly upon being
probed such that you could have your switch ports act as normal network
devices from which you could run IP-config and do NFS boot.
-- 
Florian


Re: [RFC v2, net-next, PATCH 4/4] net/cpsw_switchdev: add switchdev mode of operation on cpsw driver

2018-06-20 Thread Andrew Lunn
> > Another good reason for a new driver.
> > 
> > I.
> This is actually conflicting at least to my understanding. Jiri proposed 
> using 
> devlink was used as an alternative method to enable a new mode instead of 
> adding it on a .config option. A new driver wouldn't have a need for that 
> right?

A new driver would only do switchdev. So correct, there would not be
any configuration need. It would also give you a chance to have a new
device tree binding, if that helps at all.

Andrew


Re: [RFC v2, net-next, PATCH 4/4] net/cpsw_switchdev: add switchdev mode of operation on cpsw driver

2018-06-20 Thread Ilias Apalodimas
Hello Ivan,
On Wed, Jun 20, 2018 at 02:56:48PM +0200, Ivan Vecera wrote:
> On 18.6.2018 22:19, Ilias Apalodimas wrote:
> > Jiri proposed using devlink, which makes sense, but i am not sure it's
> > applicable on this patchset. This will change the driver completely and will
> > totally break backwards compatibility.
> 
> Another good reason for a new driver.
> 
> I.
This is actually conflicting at least to my understanding. Jiri proposed using 
devlink was used as an alternative method to enable a new mode instead of 
adding it on a .config option. A new driver wouldn't have a need for that right?

Thanks
Ilias


Re: [PATCH net] ipvlan: call dev_change_flags when reset ipvlan mode

2018-06-20 Thread Cong Wang
On Tue, Jun 19, 2018 at 10:31 PM, David Miller  wrote:
> From: Hangbin Liu 
> Date: Wed, 20 Jun 2018 11:22:54 +0800
>
>> The only case dev_change_flags() return an err is when we change IFF_UP flag.
>> Since we only set/reset IFF_NOARP, do you think we still need to check the
>> return value?
>
> It is bad to try and take shortcuts on error handling using assumptions
> like that.
>
> If dev_change_flags() is adjusted to return error codes in more
> situations, nobody is going to remember to undo your "optimziation"
> here.
>
> Please check for errors, thank you.

Yeah. Also since the notifier is triggered in this case:

if (dev->flags & IFF_UP &&
(changes & ~(IFF_UP | IFF_PROMISC | IFF_ALLMULTI | IFF_VOLATILE))) {
struct netdev_notifier_change_info change_info = {
.info = {
.dev = dev,
},
.flags_changed = changes,
};

call_netdevice_notifiers_info(NETDEV_CHANGE, &change_info.info);
}

the return value of call_netdevice_notifiers_info() isn't captured
either, but it should be.


[PATCH net] cls_flower: fix use after free in flower S/W path

2018-06-20 Thread Paolo Abeni
If flower filter is created without the skip_sw flag, fl_mask_put()
can race with fl_classify() and we can destroy the mask rhashtable
while a lookup operation is accessing it.

 BUG: unable to handle kernel paging request at 000911d1
 PGD 0 P4D 0
 SMP PTI
 CPU: 3 PID: 5582 Comm: vhost-5541 Not tainted 4.18.0-rc1.vanilla+ #1950
 Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.1.7 06/16/2016
 RIP: 0010:rht_bucket_nested+0x20/0x60
 Code: 31 c8 c1 c1 18 29 c8 c3 66 90 8b 4f 04 ba 01 00 00 00 8b 07 48 8b bf 80 
00 00 0
 RSP: 0018:afc5cfbb7a48 EFLAGS: 00010206
 RAX: 1978 RBX: 9f12dff88a00 RCX: 9f12
 RDX: 000911d1 RSI: 0148 RDI: 0001
 RBP: 9f12dff88a00 R08: 5f1cc119 R09: a715fae2
 R10: afc5cfbb7aa8 R11: 9f1cb4be804e R12: 9f1265e13000
 R13:  R14: afc5cfbb7b48 R15: 9f12dff88b68
 FS:  () GS:9f1d3f0c() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 000911d1 CR3: 001575a94006 CR4: 001626e0
 Call Trace:
  fl_lookup+0x134/0x140 [cls_flower]
  fl_classify+0xf3/0x180 [cls_flower]
  tcf_classify+0x78/0x150
  __netif_receive_skb_core+0x69e/0xa50
  netif_receive_skb_internal+0x42/0xf0
  tun_get_user+0xdd5/0xfd0 [tun]
  tun_sendmsg+0x52/0x70 [tun]
  handle_tx+0x2b3/0x5f0 [vhost_net]
  vhost_worker+0xab/0x100 [vhost]
  kthread+0xf8/0x130
  ret_from_fork+0x35/0x40
 Modules linked in: act_mirred act_gact cls_flower vhost_net vhost tap 
sch_ingress
 CR2: 000911d1

Fix the above waiting for a RCU grace period before destroying the
rhashtable.

Fixes: 05cd271fd61a ("cls_flower: Support multiple masks per priority")
Signed-off-by: Paolo Abeni 
---
 net/sched/cls_flower.c | 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 2b5be42a9f1c..5e7333c6472d 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -203,6 +203,17 @@ static int fl_init(struct tcf_proto *tp)
return rhashtable_init(&head->ht, &mask_ht_params);
 }
 
+static void fl_mask_free(struct fl_flow_mask *mask)
+{
+   rhashtable_destroy(&mask->ht);
+   kfree(mask);
+}
+
+static void fl_mask_free_rcu(struct rcu_head *entry)
+{
+   fl_mask_free(container_of(entry, struct fl_flow_mask, rcu));
+}
+
 static bool fl_mask_put(struct cls_fl_head *head, struct fl_flow_mask *mask,
bool async)
 {
@@ -210,12 +221,11 @@ static bool fl_mask_put(struct cls_fl_head *head, struct 
fl_flow_mask *mask,
return false;
 
rhashtable_remove_fast(&head->ht, &mask->ht_node, mask_ht_params);
-   rhashtable_destroy(&mask->ht);
list_del_rcu(&mask->list);
if (async)
-   kfree_rcu(mask, rcu);
+   call_rcu(&mask->rcu, fl_mask_free_rcu);
else
-   kfree(mask);
+   fl_mask_free(mask);
 
return true;
 }
-- 
2.17.1



Re: [PATCH iproute2-next 1/1] tc: jsonify nat action

2018-06-20 Thread David Ahern
On 6/18/18 12:57 PM, Keara Leibovitz wrote:
> Add json output support for nat action
> 

...

> 
> Signed-off-by: Keara Leibovitz 
> ---
>  tc/m_nat.c | 32 +++-
>  1 file changed, 19 insertions(+), 13 deletions(-)
> 


applied to iproute2-next. Thanks



Re: [PATCH iproute2-next v3] ip-xfrm: Add support for OUTPUT_MARK

2018-06-20 Thread David Ahern
On 6/15/18 8:32 PM, Subash Abhinov Kasiviswanathan wrote:
> This patch adds support for OUTPUT_MARK in xfrm state to exercise the
> functionality added by kernel commit 077fbac405bf
> ("net: xfrm: support setting an output mark.").
> 

...

> v1->v2: Moved the XFRMA_OUTPUT_MARK print after XFRMA_MARK in
> xfrm_xfrma_print() as mentioned by Lorenzo
> 
> v2->v3: Fix one help formatting error as mentioned by Lorenzo.
> Keep mark and output-mark on the same line and add man page info as
> mentioned by David.
> 
> Signed-off-by: Subash Abhinov Kasiviswanathan 
> ---
>  ip/ipxfrm.c| 17 -
>  ip/xfrm_state.c|  9 +
>  man/man8/ip-xfrm.8 | 11 +++
>  3 files changed, 36 insertions(+), 1 deletion(-)
> 

applied to iproute2-next. Thanks




Re: [PATCH] net: Fix device name resolving crash in default_device_exit()

2018-06-20 Thread David Ahern
On 6/20/18 2:57 AM, Kirill Tkhai wrote:
> From: Kirill Tkhai 
> 
> The following script makes kernel to crash since it can't obtain
> a name for a device, when the name is occupied by another device:
> 
> #!/bin/bash
> ifconfig eth0 down
> ifconfig eth1 down
> index=`cat /sys/class/net/eth1/ifindex`
> ip link set eth1 name dev$index
> unshare -n sleep 1h &
> pid=$!
> while [[ "`readlink /proc/self/ns/net`" == "`readlink /proc/$pid/ns/net`" ]]; 
> do continue; done
> ip link set dev$index netns $pid
> ip link set eth0 name dev$index
> kill -9 $pid
> 
> Kernel messages:
> 
> virtio_net virtio1 dev3: renamed from eth1
> virtio_net virtio0 dev3: renamed from eth0
> default_device_exit: failed to move dev3 to init_net: -17
> [ cut here ]
> kernel BUG at net/core/dev.c:8978!
> invalid opcode:  [#1] PREEMPT SMP
> CPU: 1 PID: 276 Comm: kworker/u8:3 Not tainted 4.17.0+ #292
> Workqueue: netns cleanup_net
> RIP: 0010:default_device_exit+0x9c/0xb0
> [stack trace snipped]
> 
> This patch gives more variability during choosing new name
> of device and fixes the problem.
> 
> Signed-off-by: Kirill Tkhai 
> ---
> 
> Since there is no suggestions how to fix this in another way, I'm resending 
> the patch.

This patch does not remove the BUG, so does not really solve the
problem. ie., it is fairly trivial to write a script (32k dev%d named
devices in init_net) that triggers it again, so your commit subject and
commit log are not correct with the references to 'fixing the problem'.

The change does provide more variability in naming and reduces the
likelihood of not being able to push a device back to init_net.


Re: [PATCH net] sctp: fix erroneous inc of snmp SctpFragUsrMsgs

2018-06-20 Thread Neil Horman
On Wed, Jun 20, 2018 at 12:47:52PM -0300, Marcelo Ricardo Leitner wrote:
> Currently it is incrementing SctpFragUsrMsgs when the user message size
> is of the exactly same size as the maximum fragment size, which is wrong.
> 
> The fix is to increment it only when user message is bigger than the
> maximum fragment size.
> 
> Fixes: bfd2e4b8734d ("sctp: refactor sctp_datamsg_from_user")
> Signed-off-by: Marcelo Ricardo Leitner 
> ---
>  net/sctp/chunk.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c
> index 
> 79daa98208c391c780440144d69bc7be875c3476..bfb9f812e2ef9fa605b08dc1f534781573c3abf8
>  100644
> --- a/net/sctp/chunk.c
> +++ b/net/sctp/chunk.c
> @@ -237,7 +237,9 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct 
> sctp_association *asoc,
>   /* Account for a different sized first fragment */
>   if (msg_len >= first_len) {
>   msg->can_delay = 0;
> - SCTP_INC_STATS(sock_net(asoc->base.sk), SCTP_MIB_FRAGUSRMSGS);
> + if (msg_len > first_len)
> + SCTP_INC_STATS(sock_net(asoc->base.sk),
> +SCTP_MIB_FRAGUSRMSGS);
>   } else {
>   /* Which may be the only one... */
>   first_len = msg_len;
> -- 
> 2.14.4
> 
> 
Acked-by: Neil Horman 



Re: [PATCH v4 net-next] net:sched: add action inheritdsfield to skbedit

2018-06-20 Thread Davide Caratti
On Wed, 2018-06-20 at 12:47 -0400, Michel Machado wrote:
> On 06/20/2018 12:08 PM, Davide Caratti wrote:
> > On Tue, 2018-06-12 at 15:42 +, Fu, Qiaobin wrote:
> > > The new action inheritdsfield copies the field DS of
> > > IPv4 and IPv6 packets into skb->priority. This enables
> > > later classification of packets based on the DS field.
> > > 
> > > v4:
> > > *Not allow setting flags other than the expected ones.
> > > 
> > > *Allow dumping the pure flags.
> > > 
> > > Original idea by Jamal Hadi Salim 
> > > 
> > > Signed-off-by: Qiaobin Fu 
> > > Reviewed-by: Michel Machado 
> > > ---
> > > 
> > 
> > [...]
> > 
> > > @@ -41,6 +44,25 @@ static int tcf_skbedit(struct sk_buff *skb, const 
> > > struct tc_action *a,
> > >   
> > >   if (d->flags & SKBEDIT_F_PRIORITY)
> > >   skb->priority = d->priority;
> > > + if (d->flags & SKBEDIT_F_INHERITDSFIELD) {
> > > + int wlen = skb_network_offset(skb);
> > > +
> > > + switch (tc_skb_protocol(skb)) {
> > > + case htons(ETH_P_IP):
> > > + wlen += sizeof(struct iphdr);
> > > + if (!pskb_may_pull(skb, wlen))
> > > + goto err;
> > > + skb->priority = ipv4_get_dsfield(ip_hdr(skb)) >> 2;
> > > + break;
> > > +
> > > + case htons(ETH_P_IPV6):
> > > + wlen += sizeof(struct ipv6hdr);
> > > + if (!pskb_may_pull(skb, wlen))
> > > + goto err;
> > > + skb->priority = ipv6_get_dsfield(ipv6_hdr(skb)) >> 2;
> > > + break;
> > > + }
> > > + }
> > >   if (d->flags & SKBEDIT_F_QUEUE_MAPPING &&
> > >   skb->dev->real_num_tx_queues > d->queue_mapping)
> > >   skb_set_queue_mapping(skb, d->queue_mapping);
> > > @@ -53,6 +75,10 @@ static int tcf_skbedit(struct sk_buff *skb, const 
> > > struct tc_action *a,
> > >   
> > >   spin_unlock(&d->tcf_lock);
> > >   return d->tcf_action;
> > > +
> > > +err:
> > > + spin_unlock(&d->tcf_lock);
> > > + return TC_ACT_SHOT;
> > >   }
> > >   
> > 
> > sorry for asking this when the patch is a v4...
> > 
> > I spotted this, as I'm rebasing a small series that removes the tcf_lock
> > from the data plane of skbedit to gain some speed, and it converts the
> > stats to be per-cpu counters.
> > 
> > in the code above, you are catching failures of pskb_may_pull(skb) and
> > then you return TC_ACT_SHOT. That's OK, but I think you should update the
> > drop counter, like other TC actions do.
> > 
> > If you (author / reviewers) think this is a minor issue, like I do think,
> > then I can add the missing update in my series and post it when net-next
> > reopens.
> > 
> > WDYT?
> > 
> > thank you in advance!
> > regards,
> > 
> 
> Hi Davide,
> 
> I agree that we should update the drop counter, but given that 
> you're already converting the stats to be per-cpu counters, whatever we 
> add now will be just symbolic since you're going to change it anyway.

that's ok for me also, as I can use the current v4 code for the rebase
(and not wait for another respin) _ but let's hear what reviewers think.

>  If 
> reviewers think that Qiaobin's patch must add the update line, could you 
> provide the exact line and location so we avoid going to v6 of this patch?

In case, I was thinking of something like: 

https://elixir.bootlin.com/linux/v4.18-rc1/source/net/sched/act_ipt.c#L249

so, between 'err:' and 'spin_unlock(&d->tcf_lock)', insert a line like:

d->tcf_qstats.drop++;

regards,
-- 
davide 




Re: [PATCH v4 net-next] net:sched: add action inheritdsfield to skbedit

2018-06-20 Thread Michel Machado

On 06/20/2018 12:08 PM, Davide Caratti wrote:

On Tue, 2018-06-12 at 15:42 +, Fu, Qiaobin wrote:

The new action inheritdsfield copies the field DS of
IPv4 and IPv6 packets into skb->priority. This enables
later classification of packets based on the DS field.

v4:
*Not allow setting flags other than the expected ones.

*Allow dumping the pure flags.

Original idea by Jamal Hadi Salim 

Signed-off-by: Qiaobin Fu 
Reviewed-by: Michel Machado 
---



[...]


@@ -41,6 +44,25 @@ static int tcf_skbedit(struct sk_buff *skb, const struct 
tc_action *a,
  
  	if (d->flags & SKBEDIT_F_PRIORITY)

skb->priority = d->priority;
+   if (d->flags & SKBEDIT_F_INHERITDSFIELD) {
+   int wlen = skb_network_offset(skb);
+
+   switch (tc_skb_protocol(skb)) {
+   case htons(ETH_P_IP):
+   wlen += sizeof(struct iphdr);
+   if (!pskb_may_pull(skb, wlen))
+   goto err;
+   skb->priority = ipv4_get_dsfield(ip_hdr(skb)) >> 2;
+   break;
+
+   case htons(ETH_P_IPV6):
+   wlen += sizeof(struct ipv6hdr);
+   if (!pskb_may_pull(skb, wlen))
+   goto err;
+   skb->priority = ipv6_get_dsfield(ipv6_hdr(skb)) >> 2;
+   break;
+   }
+   }
if (d->flags & SKBEDIT_F_QUEUE_MAPPING &&
skb->dev->real_num_tx_queues > d->queue_mapping)
skb_set_queue_mapping(skb, d->queue_mapping);
@@ -53,6 +75,10 @@ static int tcf_skbedit(struct sk_buff *skb, const struct 
tc_action *a,
  
  	spin_unlock(&d->tcf_lock);

return d->tcf_action;
+
+err:
+   spin_unlock(&d->tcf_lock);
+   return TC_ACT_SHOT;
  }
  


sorry for asking this when the patch is a v4...

I spotted this, as I'm rebasing a small series that removes the tcf_lock
from the data plane of skbedit to gain some speed, and it converts the
stats to be per-cpu counters.

in the code above, you are catching failures of pskb_may_pull(skb) and
then you return TC_ACT_SHOT. That's OK, but I think you should update the
drop counter, like other TC actions do.

If you (author / reviewers) think this is a minor issue, like I do think,
then I can add the missing update in my series and post it when net-next
reopens.

WDYT?

thank you in advance!
regards,



Hi Davide,

   I agree that we should update the drop counter, but given that 
you're already converting the stats to be per-cpu counters, whatever we 
add now will be just symbolic since you're going to change it anyway. If 
reviewers think that Qiaobin's patch must add the update line, could you 
provide the exact line and location so we avoid going to v6 of this patch?


[ ]'s
Michel Machado


Re: [PATCH] tc, bpf: add option to dump bpf verifier as C program fragment

2018-06-20 Thread Stephen Hemminger
On Wed, 20 Jun 2018 00:13:52 +0200
Daniel Borkmann  wrote:

> On 06/18/2018 11:44 PM, David Ahern wrote:
> > On 6/18/18 2:18 PM, Jakub Kicinski wrote:  
> >> On Sun, 17 Jun 2018 08:48:41 +, Ophir Munk wrote:  
> >>> Similar to cbpf used within tcpdump utility with a "-d" option to dump
> >>> the compiled packet-matching code in a human readable form - tc has the
> >>> "verbose" option to dump ebpf verifier output.
> >>> Another useful option of cbpf using tcpdump "-dd" option is to dump
> >>> packet-matching code a C program fragment. Similar to this - this commit
> >>> adds a new tc ebpf option named "code" to dump ebpf verifier as C program
> >>> fragment.
> >>>
> >>> Existing "verbose" option sample output:
> >>>
> >>> Verifier analysis:
> >>> 0: (61) r2 = *(u32 *)(r1 +52)
> >>> 1: (18) r3 = 0xdeadbeef
> >>> 3: (63) *(u32 *)(r10 -4) = r3
> >>> .
> >>> .
> >>> 11: (63) *(u32 *)(r1 +52) = r2
> >>> 12: (18) r0 = 0x
> >>> 14: (95) exit
> >>>
> >>> New "code" option sample output:
> >>>
> >>> /* struct bpf_insn cls_q_code[] = { */
> >>> {0x61,2,1,   52, 0x},
> >>> {0x18,3,0,0, 0xdeadbeef},
> >>> {0x00,0,0,0, 0x},
> >>> .
> >>> .
> >>> {0x63,1,2,   52, 0x},
> >>> {0x18,0,0,0, 0x},
> >>> {0x00,0,0,0, 0x},
> >>> {0x95,0,0,0, 0x},
> >>>
> >>> Signed-off-by: Ophir Munk   
> >>
> >> Hmm... printing C arrays looks like hacky integration with some C
> >> code...  Would you not be better served by simply using libbpf in
> >> whatever is consuming this output?  
> > 
> > I was thinking the same. bpftool would provide options too -- print the
> > above, print in macro encodings and verifier. I gave an example of this
> > side by side dump at netconf 2.1. Does not look like the slides made it
> > online; see attached.  
> 
> +1, I would also doubt that this adds a lot in terms of debuggability
> when you're trying to load an object file with thousands of insns. Better
> way would be to use llvm-objdump on the obj file to get to the annotated
> disassembly, see also example in [0]. A .o to .c converter is wip for
> libbpf/bpftool as presented in [1], which should provide the flexibility
> to embedd an obj file.
> 
> Cheers,
> Daniel
> 
>   [0] http://cilium.readthedocs.io/en/latest/bpf/#llvm
>   [1] 
> http://vger.kernel.org/netconf2018_files/AlexeiStarovoitov_netconf2018.pdf 
> page 22

I am going to not accept this for now. Please respin for iproute2 next if
you think bpftool won't be able to handle this.


Re: [PATCH] tc: fix batch force option

2018-06-20 Thread Stephen Hemminger
On Wed, 20 Jun 2018 10:24:21 +0300
Vlad Buslov  wrote:

> When sending accumulated compound command results an error, check 'force'
> option before exiting. Move return code check after putting batch bufs and
> freeing iovs to prevent memory leak. Break from loop, instead of returning
> error code to allow cleanup at the end of batch function. Don't reset ret
> code on each iteration.
> 
> Fixes: 485d0c6001c4 ("tc: Add batchsize feature for filter and actions")
> Reviewed-by: Roi Dayan 
> Reviewed-by: Chris Mi 
> Signed-off-by: Vlad Buslov 

Applied


[PATCH v2] ucc_geth: Add BQL support

2018-06-20 Thread Joakim Tjernlund
Signed-off-by: Joakim Tjernlund 
---

 v2 - Reoder varibles according to Dave
  Add call to netdev_reset_queue(dev) open/close
 drivers/net/ethernet/freescale/ucc_geth.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/freescale/ucc_geth.c 
b/drivers/net/ethernet/freescale/ucc_geth.c
index f77ba9fa257b..e8debbde0a34 100644
--- a/drivers/net/ethernet/freescale/ucc_geth.c
+++ b/drivers/net/ethernet/freescale/ucc_geth.c
@@ -3096,6 +3096,7 @@ static int ucc_geth_start_xmit(struct sk_buff *skb, 
struct net_device *dev)
 
ugeth_vdbg("%s: IN", __func__);
 
+   netdev_sent_queue(dev, skb->len);
spin_lock_irqsave(&ugeth->lock, flags);
 
dev->stats.tx_bytes += skb->len;
@@ -3240,6 +3241,8 @@ static int ucc_geth_tx(struct net_device *dev, u8 txQ)
 {
/* Start from the next BD that should be filled */
struct ucc_geth_private *ugeth = netdev_priv(dev);
+   unsigned int bytes_sent = 0;
+   int howmany = 0;
u8 __iomem *bd; /* BD pointer */
u32 bd_status;
 
@@ -3257,7 +3260,8 @@ static int ucc_geth_tx(struct net_device *dev, u8 txQ)
skb = ugeth->tx_skbuff[txQ][ugeth->skb_dirtytx[txQ]];
if (!skb)
break;
-
+   howmany++;
+   bytes_sent += skb->len;
dev->stats.tx_packets++;
 
dev_consume_skb_any(skb);
@@ -3279,6 +3283,7 @@ static int ucc_geth_tx(struct net_device *dev, u8 txQ)
bd_status = in_be32((u32 __iomem *)bd);
}
ugeth->confBd[txQ] = bd;
+   netdev_completed_queue(dev, howmany, bytes_sent);
return 0;
 }
 
@@ -3479,6 +3484,7 @@ static int ucc_geth_open(struct net_device *dev)
 
phy_start(ugeth->phydev);
napi_enable(&ugeth->napi);
+   netdev_reset_queue(dev);
netif_start_queue(dev);
 
device_set_wakeup_capable(&dev->dev,
@@ -3509,6 +3515,7 @@ static int ucc_geth_close(struct net_device *dev)
free_irq(ugeth->ug_info->uf_info.irq, ugeth->ndev);
 
netif_stop_queue(dev);
+   netdev_reset_queue(dev);
 
return 0;
 }
-- 
2.13.6



Re: [PATCH v4 net-next] net:sched: add action inheritdsfield to skbedit

2018-06-20 Thread Davide Caratti
On Tue, 2018-06-12 at 15:42 +, Fu, Qiaobin wrote:
> The new action inheritdsfield copies the field DS of
> IPv4 and IPv6 packets into skb->priority. This enables
> later classification of packets based on the DS field.
> 
> v4:
> *Not allow setting flags other than the expected ones.
> 
> *Allow dumping the pure flags.
> 
> Original idea by Jamal Hadi Salim 
> 
> Signed-off-by: Qiaobin Fu 
> Reviewed-by: Michel Machado 
> ---
> 

[...]

> @@ -41,6 +44,25 @@ static int tcf_skbedit(struct sk_buff *skb, const struct 
> tc_action *a,
>  
>   if (d->flags & SKBEDIT_F_PRIORITY)
>   skb->priority = d->priority;
> + if (d->flags & SKBEDIT_F_INHERITDSFIELD) {
> + int wlen = skb_network_offset(skb);
> +
> + switch (tc_skb_protocol(skb)) {
> + case htons(ETH_P_IP):
> + wlen += sizeof(struct iphdr);
> + if (!pskb_may_pull(skb, wlen))
> + goto err;
> + skb->priority = ipv4_get_dsfield(ip_hdr(skb)) >> 2;
> + break;
> +
> + case htons(ETH_P_IPV6):
> + wlen += sizeof(struct ipv6hdr);
> + if (!pskb_may_pull(skb, wlen))
> + goto err;
> + skb->priority = ipv6_get_dsfield(ipv6_hdr(skb)) >> 2;
> + break;
> + }
> + }
>   if (d->flags & SKBEDIT_F_QUEUE_MAPPING &&
>   skb->dev->real_num_tx_queues > d->queue_mapping)
>   skb_set_queue_mapping(skb, d->queue_mapping);
> @@ -53,6 +75,10 @@ static int tcf_skbedit(struct sk_buff *skb, const struct 
> tc_action *a,
>  
>   spin_unlock(&d->tcf_lock);
>   return d->tcf_action;
> +
> +err:
> + spin_unlock(&d->tcf_lock);
> + return TC_ACT_SHOT;
>  }
>  

sorry for asking this when the patch is a v4... 

I spotted this, as I'm rebasing a small series that removes the tcf_lock
from the data plane of skbedit to gain some speed, and it converts the
stats to be per-cpu counters.

in the code above, you are catching failures of pskb_may_pull(skb) and
then you return TC_ACT_SHOT. That's OK, but I think you should update the
drop counter, like other TC actions do.

If you (author / reviewers) think this is a minor issue, like I do think,
then I can add the missing update in my series and post it when net-next
reopens.

WDYT?

thank you in advance!
regards,
-- 
davide
  



[PATCH net] sctp: fix erroneous inc of snmp SctpFragUsrMsgs

2018-06-20 Thread Marcelo Ricardo Leitner
Currently it is incrementing SctpFragUsrMsgs when the user message size
is of the exactly same size as the maximum fragment size, which is wrong.

The fix is to increment it only when user message is bigger than the
maximum fragment size.

Fixes: bfd2e4b8734d ("sctp: refactor sctp_datamsg_from_user")
Signed-off-by: Marcelo Ricardo Leitner 
---
 net/sctp/chunk.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c
index 
79daa98208c391c780440144d69bc7be875c3476..bfb9f812e2ef9fa605b08dc1f534781573c3abf8
 100644
--- a/net/sctp/chunk.c
+++ b/net/sctp/chunk.c
@@ -237,7 +237,9 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct 
sctp_association *asoc,
/* Account for a different sized first fragment */
if (msg_len >= first_len) {
msg->can_delay = 0;
-   SCTP_INC_STATS(sock_net(asoc->base.sk), SCTP_MIB_FRAGUSRMSGS);
+   if (msg_len > first_len)
+   SCTP_INC_STATS(sock_net(asoc->base.sk),
+  SCTP_MIB_FRAGUSRMSGS);
} else {
/* Which may be the only one... */
first_len = msg_len;
-- 
2.14.4



Re: Route fallback issue

2018-06-20 Thread Grant Taylor

On 06/20/2018 09:18 AM, Grant Taylor wrote:
Where can I find more information on ignore_routes_with_linkdown?  I 
don't see it listed in $Kernel/Documentation/networking/ip-sysctl.txt. 
(I do see fib_multipath_use_neigh documented there in.)


I'm specifically interested in if ignore_routes_with_linkdown and / or 
fib_multipath_use_neigh will cause Linux to fall back to an alternate 
(higher metric) route if the link is still up but the neighbor is not 
accessible across it.


  +---+
  | Linux |
  +---+---+
  |
+-+   +---++   +-+
| R 1 +---+ Switch +---+ R 2 |
+-+   ++   +-+

A typical scenario is where Linux is connected to a DSL or Cable modem 
where the physical link stays up even if the neighbor R {1,2} goes 
offline.  It's not possible to rely on the local link (MII) status to 
determine that a neighbor is not reachable.  I.e. R 1 going away like below.


  +---+
  | Linux |
  +---+---+
  |
+-+   +---++   +-+
| R 1 X   X Switch +---+ R 2 |
+-+   ++   +-+



--
Grant. . . .
unix || die



smime.p7s
Description: S/MIME Cryptographic Signature


Re: Incomplete fix for be9c798 / 7b2ee50 hv_netvsc: common detach logic?

2018-06-20 Thread Thomas Walker
On Tue, Jun 19, 2018 at 04:14:40PM -0700, Stephen Hemminger wrote:
> On Wed, 20 Jun 2018 08:01:50 +0900
> Greg KH  wrote:
> 
> > On Tue, Jun 19, 2018 at 03:19:41PM -0400, Thomas Walker wrote:
> > > Upon updating some internal kernels from 4.14.18 to 4.14.49, our hyper-v 
> > > guests failed to bring up network interfaces on boot, logging "A link 
> > > change request failed with some changes committed already. Interface eth0 
> > > may have been left with an inconsistent configuration, please check."  
> > > Running 'ifconfig eth0 up' appears to fix the problem temporarily so I 
> > > went about bisecting which landed on:
> > > 
> > > commit be9c798d0d13ae609a91177323ac816545c39d28
> > > Author: Stephen Hemminger 
> > > Date:   Mon May 14 15:32:18 2018 -0700
> > > 
> > > hv_netvsc: common detach logic
> > > 
> > > [ Commit 7b2ee50c0cd513a176a26a71f2989facdd75bfea upstream. ]
> > > 
> > > Make common function for detaching internals of device
> > > during changes to MTU and RSS. Make sure no more packets
> > > are transmitted and all packets have been received before
> > > doing device teardown.
> > > 
> > > Change the wait logic to be common and use usleep_range().
> > > 
> > > Changes transmit enabling logic so that transmit queues are disabled
> > > during the period when lower device is being changed. And enabled
> > > only after sub channels are setup. This avoids issue where it could
> > > be that a packet was being sent while subchannel was not initialized.
> > > 
> > > Fixes: 8195b1396ec8 ("hv_netvsc: fix deadlock on hotplug")
> > > Signed-off-by: Stephen Hemminger 
> > > Signed-off-by: David S. Miller 
> > > Signed-off-by: Greg Kroah-Hartman 
> > > 
> > > 
> > > The removal of which does indeed fix the problem.  That led me to:
> > > 
> > > commit 52acf73b6e9a6962045feb2ba5a8921da2201915
> > > Author: Dexuan Cui 
> > > Date:   Wed Jun 6 21:32:51 2018 +
> > > 
> > > hv_netvsc: Fix a network regression after ifdown/ifup
> > > 
> > > Recently people reported the NIC stops working after
> > > "ifdown eth0; ifup eth0". It turns out in this case the TX queues are 
> > > not
> > > enabled, after the refactoring of the common detach logic: when the 
> > > NIC
> > > has sub-channels, usually we enable all the TX queues after all
> > > sub-channels are set up: see rndis_set_subchannel() ->
> > > netif_device_attach(), but in the case of "ifdown eth0; ifup eth0" 
> > > where
> > > the number of channels doesn't change, we also must make sure the TX 
> > > queues
> > > are enabled. The patch fixes the regression.
> > > 
> > > Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic")
> > > Signed-off-by: Dexuan Cui 
> > > Cc: Stephen Hemminger 
> > > Cc: K. Y. Srinivasan 
> > > Cc: Haiyang Zhang 
> > > Signed-off-by: David S. Miller 
> > > 
> > > Which sounded very promising, but does not seem to fully fix the issue.  
> > > Doing some more digging, I was able to determine that the message 
> > > coincides with 'ip link set dev eth0 mtu 1300 up' very shortly (>~1 
> > > second) after the hv_netvsc driver loads.  If I delay the mtu change 
> > > until well after the driver loads, everything works fine.  If I unload 
> > > hv_netvsc and then reload it and apply the mtu change immediately, the 
> > > failure re-occurs.  So something is racy here, and the above doesn't 
> > > entirely address it.
> > > 
> > > I'm happy to test out any suggested patches and/or do additional 
> > > debugging if anyone has any suggestions.
> > > 
> > > (oh, and I did also try 4.18-rc1 and the problem still persists)  
> > 
> > Always cc: the authors of the patch you are having problems with, along
> > with the mailing list for the networking subsystem, so that the right
> > people know to look at this.
> 
> How are you changing the MTU? and starting the device.
> When MTU changes the device has to reconnect with the host. This takes a 
> small amount of time
> and no changes to device state are possible then.
> 
> If MTU change happens and at the same time some other script tries to bring 
> up the connection,
> then that script will get an error.

Simply setting 'mtu 1300' in /etc/network/interfaces (Debian userland) which 
results in ifup executing:

ip addr add 192.168.1.100/255.255.255.0 broadcast 192.168.1.255 dev eth0 label 
eth0
ip link set dev eth0 mtu 1300 up   
ip route add default via 192.168.1.1 dev eth0   

Running those commands standalone (outside of ifup) reproduces the problem as 
well.  Trying lots of different permutations of loading the driver and 
different commands with different delays, my initial suspicion that this was 
somehow related to proximity to the driver load appears incorrect.  The only 
thing that seems to matter is setting the mtu and link state in one go.  If I 
separate the 'ip link' command 

Re: Route fallback issue

2018-06-20 Thread Grant Taylor

On 06/20/2018 07:48 AM, David Ahern wrote:
See the ignore_routes_with_linkdown and fib_multipath_use_neigh sysctl 
settings.


Where can I find more information on ignore_routes_with_linkdown?  I 
don't see it listed in $Kernel/Documentation/networking/ip-sysctl.txt. 
(I do see fib_multipath_use_neigh documented there in.)



A feature is in the works to have fallback nexthops.


O.o?



--
Grant. . . .
unix || die



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [PATCH v4 net-next] net:sched: add action inheritdsfield to skbedit

2018-06-20 Thread Marcelo Ricardo Leitner
On Tue, Jun 12, 2018 at 03:42:55PM +, Fu, Qiaobin wrote:
> The new action inheritdsfield copies the field DS of
> IPv4 and IPv6 packets into skb->priority. This enables
> later classification of packets based on the DS field.
> 
> v4:
> *Not allow setting flags other than the expected ones.
> 
> *Allow dumping the pure flags.
> 
> Original idea by Jamal Hadi Salim 
> 
> Signed-off-by: Qiaobin Fu 
> Reviewed-by: Michel Machado 

Reviewed-by: Marcelo Ricardo Leitner 

> ---
> 
> Note that the motivation for this patch is found in the following discussion:
> https://www.spinics.net/lists/netdev/msg501061.html
> ---
> diff --git a/include/uapi/linux/tc_act/tc_skbedit.h 
> b/include/uapi/linux/tc_act/tc_skbedit.h
> index fbcfe27a4e6c..6de6071ebed6 100644
> --- a/include/uapi/linux/tc_act/tc_skbedit.h
> +++ b/include/uapi/linux/tc_act/tc_skbedit.h
> @@ -30,6 +30,7 @@
>  #define SKBEDIT_F_MARK   0x4
>  #define SKBEDIT_F_PTYPE  0x8
>  #define SKBEDIT_F_MASK   0x10
> +#define SKBEDIT_F_INHERITDSFIELD 0x20
>  
>  struct tc_skbedit {
>   tc_gen;
> @@ -45,6 +46,7 @@ enum {
>   TCA_SKBEDIT_PAD,
>   TCA_SKBEDIT_PTYPE,
>   TCA_SKBEDIT_MASK,
> + TCA_SKBEDIT_FLAGS,
>   __TCA_SKBEDIT_MAX
>  };
>  #define TCA_SKBEDIT_MAX (__TCA_SKBEDIT_MAX - 1)
> diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c
> index 6138d1d71900..9adbcfa3f5fe 100644
> --- a/net/sched/act_skbedit.c
> +++ b/net/sched/act_skbedit.c
> @@ -23,6 +23,9 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> +#include 
>  
>  #include 
>  #include 
> @@ -41,6 +44,25 @@ static int tcf_skbedit(struct sk_buff *skb, const struct 
> tc_action *a,
>  
>   if (d->flags & SKBEDIT_F_PRIORITY)
>   skb->priority = d->priority;
> + if (d->flags & SKBEDIT_F_INHERITDSFIELD) {
> + int wlen = skb_network_offset(skb);
> +
> + switch (tc_skb_protocol(skb)) {
> + case htons(ETH_P_IP):
> + wlen += sizeof(struct iphdr);
> + if (!pskb_may_pull(skb, wlen))
> + goto err;
> + skb->priority = ipv4_get_dsfield(ip_hdr(skb)) >> 2;
> + break;
> +
> + case htons(ETH_P_IPV6):
> + wlen += sizeof(struct ipv6hdr);
> + if (!pskb_may_pull(skb, wlen))
> + goto err;
> + skb->priority = ipv6_get_dsfield(ipv6_hdr(skb)) >> 2;
> + break;
> + }
> + }
>   if (d->flags & SKBEDIT_F_QUEUE_MAPPING &&
>   skb->dev->real_num_tx_queues > d->queue_mapping)
>   skb_set_queue_mapping(skb, d->queue_mapping);
> @@ -53,6 +75,10 @@ static int tcf_skbedit(struct sk_buff *skb, const struct 
> tc_action *a,
>  
>   spin_unlock(&d->tcf_lock);
>   return d->tcf_action;
> +
> +err:
> + spin_unlock(&d->tcf_lock);
> + return TC_ACT_SHOT;
>  }
>  
>  static const struct nla_policy skbedit_policy[TCA_SKBEDIT_MAX + 1] = {
> @@ -62,6 +88,7 @@ static const struct nla_policy 
> skbedit_policy[TCA_SKBEDIT_MAX + 1] = {
>   [TCA_SKBEDIT_MARK]  = { .len = sizeof(u32) },
>   [TCA_SKBEDIT_PTYPE] = { .len = sizeof(u16) },
>   [TCA_SKBEDIT_MASK]  = { .len = sizeof(u32) },
> + [TCA_SKBEDIT_FLAGS] = { .len = sizeof(u64) },
>  };
>  
>  static int tcf_skbedit_init(struct net *net, struct nlattr *nla,
> @@ -73,6 +100,7 @@ static int tcf_skbedit_init(struct net *net, struct nlattr 
> *nla,
>   struct tc_skbedit *parm;
>   struct tcf_skbedit *d;
>   u32 flags = 0, *priority = NULL, *mark = NULL, *mask = NULL;
> + u64 *pure_flags = NULL;
>   u16 *queue_mapping = NULL, *ptype = NULL;
>   bool exists = false;
>   int ret = 0, err;
> @@ -114,6 +142,12 @@ static int tcf_skbedit_init(struct net *net, struct 
> nlattr *nla,
>   mask = nla_data(tb[TCA_SKBEDIT_MASK]);
>   }
>  
> + if (tb[TCA_SKBEDIT_FLAGS] != NULL) {
> + pure_flags = nla_data(tb[TCA_SKBEDIT_FLAGS]);
> + if (*pure_flags & SKBEDIT_F_INHERITDSFIELD)
> + flags |= SKBEDIT_F_INHERITDSFIELD;
> + }
> +
>   parm = nla_data(tb[TCA_SKBEDIT_PARMS]);
>  
>   exists = tcf_idr_check(tn, parm->index, a, bind);
> @@ -178,6 +212,7 @@ static int tcf_skbedit_dump(struct sk_buff *skb, struct 
> tc_action *a,
>   .action  = d->tcf_action,
>   };
>   struct tcf_t t;
> + u64 pure_flags = 0;
>  
>   if (nla_put(skb, TCA_SKBEDIT_PARMS, sizeof(opt), &opt))
>   goto nla_put_failure;
> @@ -196,6 +231,11 @@ static int tcf_skbedit_dump(struct sk_buff *skb, struct 
> tc_action *a,
>   if ((d->flags & SKBEDIT_F_MASK) &&
>   nla_put_u32(skb, TCA_SKBEDIT_MASK, d->mask))
>   goto nla_put_failure;
> + if (d->flags & SKBEDIT_F_INHERITDSFIELD)
>

Re: [PATCH] bpfilter: fix user mode helper cross compilation

2018-06-20 Thread Stefano Brivio
On Wed, 20 Jun 2018 16:04:34 +0200
Matteo Croce  wrote:

> Use $(OBJDUMP) instead of literal 'objdump' to avoid
> using host toolchain when cross compiling.
> 
> Fixes: 421780fd4983 ("bpfilter: fix build error")
> Signed-off-by: Matteo Croce 

Reported-by: Stefano Brivio 

-- 
Stefano


Re: [PATCH] bpfilter: fix build error

2018-06-20 Thread Matteo Croce
On Wed, Jun 20, 2018 at 12:39 PM Stefano Brivio  wrote:
>
> On Tue, 19 Jun 2018 17:16:20 +0200
> Matteo Croce  wrote:
>
> > bpfilter Makefile assumes that the system locale is en_US, and the
> > parsing of objdump output fails.
> > Set LC_ALL=C and, while at it, rewrite the objdump parsing so it spawns
> > only 2 processes instead of 7.
> >
> > Fixes: d2ba09c17a064 ("net: add skeleton of bpfilter kernel module")
> > Signed-off-by: Matteo Croce 
> > ---
> >  net/bpfilter/Makefile | 6 --
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile
> > index e0bbe7583e58..dd86b022eff0 100644
> > --- a/net/bpfilter/Makefile
> > +++ b/net/bpfilter/Makefile
> > @@ -21,8 +21,10 @@ endif
> >  # which bpfilter_kern.c passes further into umh blob loader at run-time
> >  quiet_cmd_copy_umh = GEN $@
> >cmd_copy_umh = echo ':' > $(obj)/.bpfilter_umh.o.cmd; \
> > -  $(OBJCOPY) -I binary -O `$(OBJDUMP) -f $<|grep format|cut -d' ' -f8` 
> > \
> > -  -B `$(OBJDUMP) -f $<|grep architecture|cut -d, -f1|cut -d' ' -f2` \
> > +  $(OBJCOPY) -I binary \
> > +  `LC_ALL=C objdump -f net/bpfilter/bpfilter_umh \
>
> Why do you use objdump instead of $(OBJDUMP) now? I guess this might
> cause issues if you're cross-compiling.
>
> --
> Stefano

Right, I've sent a proper fix.

Thanks,
-- 
Matteo Croce
per aspera ad upstream


[PATCH] bpfilter: fix user mode helper cross compilation

2018-06-20 Thread Matteo Croce
Use $(OBJDUMP) instead of literal 'objdump' to avoid
using host toolchain when cross compiling.

Fixes: 421780fd4983 ("bpfilter: fix build error")
Signed-off-by: Matteo Croce 
---
 net/bpfilter/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile
index dd86b022eff0..051dc18b8ccb 100644
--- a/net/bpfilter/Makefile
+++ b/net/bpfilter/Makefile
@@ -22,7 +22,7 @@ endif
 quiet_cmd_copy_umh = GEN $@
   cmd_copy_umh = echo ':' > $(obj)/.bpfilter_umh.o.cmd; \
   $(OBJCOPY) -I binary \
-  `LC_ALL=C objdump -f net/bpfilter/bpfilter_umh \
+  `LC_ALL=C $(OBJDUMP) -f net/bpfilter/bpfilter_umh \
   |awk -F' |,' '/file format/{print "-O",$$NF} \
   /^architecture:/{print "-B",$$2}'` \
   --rename-section .data=.init.rodata $< $@
-- 
2.17.1



Re: [RFC v2, net-next, PATCH 4/4] net/cpsw_switchdev: add switchdev mode of operation on cpsw driver

2018-06-20 Thread Ivan Vecera
On 20.6.2018 14:59, Ilias Apalodimas wrote:
> On Wed, Jun 20, 2018 at 02:53:47PM +0200, Ivan Vecera wrote:
>> On 20.6.2018 09:08, Jiri Pirko wrote:
>>> Tue, Jun 19, 2018 at 01:19:00AM CEST, grygorii.stras...@ti.com wrote:


 On 06/14/2018 06:43 AM, Ilias Apalodimas wrote:
> On Thu, Jun 14, 2018 at 01:39:58PM +0200, Jiri Pirko wrote:
>> Thu, Jun 14, 2018 at 01:34:04PM CEST, ilias.apalodi...@linaro.org wrote:
>>> On Thu, Jun 14, 2018 at 01:30:28PM +0200, Jiri Pirko wrote:
 Thu, Jun 14, 2018 at 01:11:30PM CEST, ilias.apalodi...@linaro.org 
 wrote:

 [...]

> @@ -2711,6 +2789,10 @@ static int cpsw_probe_dt(struct 
> cpsw_platform_data *data,
>   if (of_property_read_bool(node, "dual_emac"))
>   data->switch_mode = CPSW_DUAL_EMAC;
>
> + /* switchdev overrides DTS */
> + if (IS_ENABLED(CONFIG_TI_CPSW_SWITCHDEV))
> + data->switch_mode = CPSW_SWITCHDEV;

 So you force CPSW_SWITCHDEV mode if the CONFIG_TI_CPSW_SWITCHDEV is
 enabled. That does not sound right. I think that user should tell what
 mode does he want regardless what the kernel config is.
>>> We discussed this during the V1 of the RFC. Yes it doesn't seem good, 
>>> but the
>>> device currently configures the modes using DTS (which is not correct). 
>>> I choose
>>> the .config due to that. I can't think of anything better, but i am 
>>> open to
>>> suggestions
>>
>> Agreed that DTS does fit as well. I think that this might be a job for
>> devlink parameters (patchset is going to be sent upstream next week).
>> You do have 1 bus address for the whole device (both ports), right?
>>
> Yes devlink sounds reasonable. I thyink there's only one bus for it, but 
> then
> again i am far from an expert on the hardware interrnals. Grygorii can 
> correct
> me if i am wrong.

 Devlink and NFS boot are not compatible as per my understanding, so .. 
>>>
>>> ? Why do you say so?
>>
>> Why aren't they compatible?? You can have devlink binary in initramfs and
>> configure the ASIC and ports via devlink batch file - prior mounting NFS 
>> root.
> This is doable but, are we taking into consideration device reconfiguration 
> (which was the reason for creating the minimal filesystem) or are we just 
> talking about device booting/initial config?
Devlink calls should be placed in setup.sh

I.



Re: Route fallback issue

2018-06-20 Thread David Ahern
On 6/20/18 2:26 AM, Akshat Kakkar wrote:
> Hi netdev community,
> 
> I have 2 interfaces
> eno1 : 192.168.1.10/24
> eno2 : 192.168.2.10/24
> 
> I added routes as
> 172.16.0.0/12 via 192.168.1.254 metric 1
> 172.16.0.0/12 via 192.168.2.254 metric 2
> 
> My intention : All traffic to 172.16.0.0/12 should go thru eno1. If
> 192.168.1.254 is not reachable (no arp entry or link down), then it
> should fall back to eno2.

See the ignore_routes_with_linkdown and fib_multipath_use_neigh sysctl
settings.


> On Wed, Jun 20, 2018 at 1:49 PM, Erik Auerswald
>  wrote:
>> Hi,
>>
>> I have usually used the "replace" keyword of iproute2 for similar
>> purposes. I would suggest a script as well, run via cron unless 1 minute
>> failover times are not acceptable. The logic could be as follows:
>>
>> if ping -c1 $PRIMARY_NH >/dev/null 2>&1; then
>>   ip route replace $PREFIX via $PRIMARY_NH
>> elif ping -c1 $SECONDARY_NH >/dev/null 2>&1; then
>>   ip route replace $PREFIX via $SECONDARY_NH
>> else
>>   ip route del $PREFIX
>> fi
>>
>> Alternatively, one could look into a routing daemon that supports static
>> routing (Zebra/Quagga/FRRouting, BIRD, ...) and check if that supports
>> some form of next-hop tracking or at least removes static routes with
>> unreachable next-hops as one would expect from experience with dedicated
>> networking devices.

A feature is in the works to have fallback nexthops.


>>
>> IMHO static route handling as done by the Linux kernel does not seem
>> useful for networking devices. I have even had bad experiences with
>> Arista switches and static routing because they relied too much on the
>> Linux kernel (probably still do).

Useful how? what did not work as expected?

Do not confuse Arista's NOS with Linux's capabilities or any NOS truly
based on Linux and using a modern kernel. A lot of work has been put
into bringing Linux up to par with NOS features. If something is not
working, demonstrate the problem on the latest kernel and inquire if
someone is working on it.




Re: [PATCH net 1/5] net sched actions: fix coding style in pedit action

2018-06-20 Thread Roman Mashak
Davide Caratti  writes:

> On Tue, 2018-06-19 at 12:56 -0400, Roman Mashak wrote:
>> Fix coding style issues in tc pedit action detected by the
>> checkpatch script.
>> 
>> Signed-off-by: Roman Mashak 
> ...
>
>> ---
>> @@ -316,16 +318,15 @@ static int tcf_pedit(struct sk_buff *skb, const struct 
>> tc_action *a,
>>  hoffset + tkey->at);
>>  goto bad;
>>  }
>> -d = skb_header_pointer(skb, hoffset + tkey->at, 
>> 1,
>> -   &_d);
>> +d = skb_header_pointer(skb, hoffset + tkey->at,
>> +   1, &_d);
>>  if (!d)
>>  goto bad;
>>  offset += (*d & tkey->offmask) >> tkey->shift;
>>  }
>
> hello Roman,
>
> nit: while we are here, what about changing the declaration of _d and *d
> to u8, so that the bitwise operation is done on unsigned?

Yes makes sense, I will send v2 in net-next once opened. Thanks Davide.

> BTW: the patch (and the series) looks ok, but I guess it will better
> target net-next when the branch reopens


Re: [RFC v2, net-next, PATCH 4/4] net/cpsw_switchdev: add switchdev mode of operation on cpsw driver

2018-06-20 Thread Ilias Apalodimas
On Wed, Jun 20, 2018 at 02:53:47PM +0200, Ivan Vecera wrote:
> On 20.6.2018 09:08, Jiri Pirko wrote:
> > Tue, Jun 19, 2018 at 01:19:00AM CEST, grygorii.stras...@ti.com wrote:
> >>
> >>
> >> On 06/14/2018 06:43 AM, Ilias Apalodimas wrote:
> >>> On Thu, Jun 14, 2018 at 01:39:58PM +0200, Jiri Pirko wrote:
>  Thu, Jun 14, 2018 at 01:34:04PM CEST, ilias.apalodi...@linaro.org wrote:
> > On Thu, Jun 14, 2018 at 01:30:28PM +0200, Jiri Pirko wrote:
> >> Thu, Jun 14, 2018 at 01:11:30PM CEST, ilias.apalodi...@linaro.org 
> >> wrote:
> >>
> >> [...]
> >>
> >>> @@ -2711,6 +2789,10 @@ static int cpsw_probe_dt(struct 
> >>> cpsw_platform_data *data,
> >>>   if (of_property_read_bool(node, "dual_emac"))
> >>>   data->switch_mode = CPSW_DUAL_EMAC;
> >>>
> >>> + /* switchdev overrides DTS */
> >>> + if (IS_ENABLED(CONFIG_TI_CPSW_SWITCHDEV))
> >>> + data->switch_mode = CPSW_SWITCHDEV;
> >>
> >> So you force CPSW_SWITCHDEV mode if the CONFIG_TI_CPSW_SWITCHDEV is
> >> enabled. That does not sound right. I think that user should tell what
> >> mode does he want regardless what the kernel config is.
> > We discussed this during the V1 of the RFC. Yes it doesn't seem good, 
> > but the
> > device currently configures the modes using DTS (which is not correct). 
> > I choose
> > the .config due to that. I can't think of anything better, but i am 
> > open to
> > suggestions
> 
>  Agreed that DTS does fit as well. I think that this might be a job for
>  devlink parameters (patchset is going to be sent upstream next week).
>  You do have 1 bus address for the whole device (both ports), right?
> 
> >>> Yes devlink sounds reasonable. I thyink there's only one bus for it, but 
> >>> then
> >>> again i am far from an expert on the hardware interrnals. Grygorii can 
> >>> correct
> >>> me if i am wrong.
> >>
> >> Devlink and NFS boot are not compatible as per my understanding, so .. 
> > 
> > ? Why do you say so?
> 
> Why aren't they compatible?? You can have devlink binary in initramfs and
> configure the ASIC and ports via devlink batch file - prior mounting NFS root.
This is doable but, are we taking into consideration device reconfiguration 
(which was the reason for creating the minimal filesystem) or are we just 
talking about device booting/initial config?
> 
> >>
> >> Again, current driver, as is, supports NFS boot in all modes
> >> (how good is the cur driver is question which out of scope of this 
> >> discussion).
> >>
> >> And we discussed this in RFC v1 - driver mode selection will be changed 
> >> if we will proceed and it will be new driver for proper switch support.
> >>
> >> Not sure about about Devlink (need to study it and we never got any 
> >> requests from end
> >> users for this as I know), and I'd like to note (again) that this is 
> >> embedded 
> >> (industrial/automotive etc), so everything preferred to be simple, fast and
> >> preferably configured statically (in most of the cases end users what boot 
> >> time 
> >> configuration).
> > 
> > You need to study it indeed.
> > 
> >>
> >> -- 
> >> regards,
> >> -grygorii
> 

Regards
Ilias


Re: [RFC v2, net-next, PATCH 4/4] net/cpsw_switchdev: add switchdev mode of operation on cpsw driver

2018-06-20 Thread Ivan Vecera
On 18.6.2018 22:19, Ilias Apalodimas wrote:
> Jiri proposed using devlink, which makes sense, but i am not sure it's
> applicable on this patchset. This will change the driver completely and will
> totally break backwards compatibility.

Another good reason for a new driver.

I.


Re: [RFC v2, net-next, PATCH 4/4] net/cpsw_switchdev: add switchdev mode of operation on cpsw driver

2018-06-20 Thread Ivan Vecera
On 20.6.2018 09:08, Jiri Pirko wrote:
> Tue, Jun 19, 2018 at 01:19:00AM CEST, grygorii.stras...@ti.com wrote:
>>
>>
>> On 06/14/2018 06:43 AM, Ilias Apalodimas wrote:
>>> On Thu, Jun 14, 2018 at 01:39:58PM +0200, Jiri Pirko wrote:
 Thu, Jun 14, 2018 at 01:34:04PM CEST, ilias.apalodi...@linaro.org wrote:
> On Thu, Jun 14, 2018 at 01:30:28PM +0200, Jiri Pirko wrote:
>> Thu, Jun 14, 2018 at 01:11:30PM CEST, ilias.apalodi...@linaro.org wrote:
>>
>> [...]
>>
>>> @@ -2711,6 +2789,10 @@ static int cpsw_probe_dt(struct 
>>> cpsw_platform_data *data,
>>> if (of_property_read_bool(node, "dual_emac"))
>>> data->switch_mode = CPSW_DUAL_EMAC;
>>>
>>> +   /* switchdev overrides DTS */
>>> +   if (IS_ENABLED(CONFIG_TI_CPSW_SWITCHDEV))
>>> +   data->switch_mode = CPSW_SWITCHDEV;
>>
>> So you force CPSW_SWITCHDEV mode if the CONFIG_TI_CPSW_SWITCHDEV is
>> enabled. That does not sound right. I think that user should tell what
>> mode does he want regardless what the kernel config is.
> We discussed this during the V1 of the RFC. Yes it doesn't seem good, but 
> the
> device currently configures the modes using DTS (which is not correct). I 
> choose
> the .config due to that. I can't think of anything better, but i am open 
> to
> suggestions

 Agreed that DTS does fit as well. I think that this might be a job for
 devlink parameters (patchset is going to be sent upstream next week).
 You do have 1 bus address for the whole device (both ports), right?

>>> Yes devlink sounds reasonable. I thyink there's only one bus for it, but 
>>> then
>>> again i am far from an expert on the hardware interrnals. Grygorii can 
>>> correct
>>> me if i am wrong.
>>
>> Devlink and NFS boot are not compatible as per my understanding, so .. 
> 
> ? Why do you say so?

Why aren't they compatible?? You can have devlink binary in initramfs and
configure the ASIC and ports via devlink batch file - prior mounting NFS root.

>>
>> Again, current driver, as is, supports NFS boot in all modes
>> (how good is the cur driver is question which out of scope of this 
>> discussion).
>>
>> And we discussed this in RFC v1 - driver mode selection will be changed 
>> if we will proceed and it will be new driver for proper switch support.
>>
>> Not sure about about Devlink (need to study it and we never got any requests 
>> from end
>> users for this as I know), and I'd like to note (again) that this is 
>> embedded 
>> (industrial/automotive etc), so everything preferred to be simple, fast and
>> preferably configured statically (in most of the cases end users what boot 
>> time 
>> configuration).
> 
> You need to study it indeed.
> 
>>
>> -- 
>> regards,
>> -grygorii



Re: [PATCH net] net: sungem: fix rx checksum support

2018-06-20 Thread Eric Dumazet



On 06/20/2018 04:22 AM, Mathieu Malaterre wrote:
 
> Thanks for the stable tag.
> 
> IMHO the commit message should have also reference commit 7ce5a27f2ef8
> ("Revert "net: Handle CHECKSUM_COMPLETE more adequately in
> pskb_trim_rcsum().""). Which means that commit 88078d98d1bb ("net:
> pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends") should be ready
> for stable also, since it seems that the only driver responsible for
> the revert was sungem.
> 
> my 2cts
> 

I disagree.

commit 88078d98d1bb is a performance improvement, and not needed for stable.

We might also discover another buggy driver later.

Thanks.


Re: [PATCH v4 net-next] net:sched: add action inheritdsfield to skbedit

2018-06-20 Thread Michel Machado

On 06/20/2018 07:53 AM, Jamal Hadi Salim wrote:

On 19/06/18 08:39 AM, Michel Machado wrote:

 Notice that, different from skbmod, there's no field parm->flags in 
skbedit. Skbedit infers the flags in d->flags from the presence of the 
parameters of each of its actions. But SKBEDIT_F_INHERITDSFIELD has no 
parameter and adding field parm->flags breaks backward compatibility 
with user space as pointed out by Marcelo Ricardo Leitner. Our 
solution was to add TCA_SKBEDIT_FLAGS, so SKBEDIT_F_INHERITDSFIELD and 
future flag-only actions can be added.


Ok, that makes sense - thanks. I am not so sure about using
64 bits (32 bits would have been fine to match the size of
the kernel flags), but other than that LGTM.


   The choice for the u64 is meant to keep the interface between kernel 
and user space the same for hopefully a longer time than it would be 
with a u32. Changing from u32 to u64 in the kernel, when the need 
arrives, won't impact applications. This interface choice was motivated 
by the backward compatibility issue mentioned above.


   Thank you for the review, Jamal.

[ ]'s
Michel Machado


Re: [PATCH] bpfilter: fix build error

2018-06-20 Thread Stefano Brivio
On Tue, 19 Jun 2018 17:16:20 +0200
Matteo Croce  wrote:

> bpfilter Makefile assumes that the system locale is en_US, and the
> parsing of objdump output fails.
> Set LC_ALL=C and, while at it, rewrite the objdump parsing so it spawns
> only 2 processes instead of 7.
> 
> Fixes: d2ba09c17a064 ("net: add skeleton of bpfilter kernel module")
> Signed-off-by: Matteo Croce 
> ---
>  net/bpfilter/Makefile | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile
> index e0bbe7583e58..dd86b022eff0 100644
> --- a/net/bpfilter/Makefile
> +++ b/net/bpfilter/Makefile
> @@ -21,8 +21,10 @@ endif
>  # which bpfilter_kern.c passes further into umh blob loader at run-time
>  quiet_cmd_copy_umh = GEN $@
>cmd_copy_umh = echo ':' > $(obj)/.bpfilter_umh.o.cmd; \
> -  $(OBJCOPY) -I binary -O `$(OBJDUMP) -f $<|grep format|cut -d' ' -f8` \
> -  -B `$(OBJDUMP) -f $<|grep architecture|cut -d, -f1|cut -d' ' -f2` \
> +  $(OBJCOPY) -I binary \
> +  `LC_ALL=C objdump -f net/bpfilter/bpfilter_umh \

Why do you use objdump instead of $(OBJDUMP) now? I guess this might
cause issues if you're cross-compiling.

-- 
Stefano


Re: [PATCH net 1/5] net sched actions: fix coding style in pedit action

2018-06-20 Thread Davide Caratti
On Tue, 2018-06-19 at 12:56 -0400, Roman Mashak wrote:
> Fix coding style issues in tc pedit action detected by the
> checkpatch script.
> 
> Signed-off-by: Roman Mashak 
...

> ---
> @@ -316,16 +318,15 @@ static int tcf_pedit(struct sk_buff *skb, const struct 
> tc_action *a,
>   hoffset + tkey->at);
>   goto bad;
>   }
> - d = skb_header_pointer(skb, hoffset + tkey->at, 
> 1,
> -&_d);
> + d = skb_header_pointer(skb, hoffset + tkey->at,
> +1, &_d);
>   if (!d)
>   goto bad;
>   offset += (*d & tkey->offmask) >> tkey->shift;
>   }

hello Roman,

nit: while we are here, what about changing the declaration of _d and *d
to u8, so that the bitwise operation is done on unsigned?

BTW: the patch (and the series) looks ok, but I guess it will better
target net-next when the branch reopens

thanks!
-- 
davide

 






[PATCH] r8169: Fix netpoll oops

2018-06-20 Thread Ville Syrjala
From: Ville Syrjälä 

Pass the correct thing to rtl8169_interrupt() from netpoll.

Cc: Realtek linux nic maintainers 
Cc: netdev@vger.kernel.org
Cc: Heiner Kallweit 
Cc: David S. Miller 
Fixes: ebcd5daa7ffd ("r8169: change interrupt handler argument type")
Signed-off-by: Ville Syrjälä 
---
 drivers/net/ethernet/realtek/r8169.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index 75dfac0248f4..f4cae2be0fda 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -7148,7 +7148,7 @@ static void rtl8169_netpoll(struct net_device *dev)
 {
struct rtl8169_private *tp = netdev_priv(dev);
 
-   rtl8169_interrupt(pci_irq_vector(tp->pci_dev, 0), dev);
+   rtl8169_interrupt(pci_irq_vector(tp->pci_dev, 0), tp);
 }
 #endif
 
-- 
2.16.4



Re: [PATCH v4 net-next] net:sched: add action inheritdsfield to skbedit

2018-06-20 Thread Jamal Hadi Salim

On 19/06/18 08:39 AM, Michel Machado wrote:

 Notice that, different from skbmod, there's no field parm->flags in 
skbedit. Skbedit infers the flags in d->flags from the presence of the 
parameters of each of its actions. But SKBEDIT_F_INHERITDSFIELD has no 
parameter and adding field parm->flags breaks backward compatibility 
with user space as pointed out by Marcelo Ricardo Leitner. Our solution 
was to add TCA_SKBEDIT_FLAGS, so SKBEDIT_F_INHERITDSFIELD and future 
flag-only actions can be added.


Ok, that makes sense - thanks. I am not so sure about using
64 bits (32 bits would have been fine to match the size of
the kernel flags), but other than that LGTM.

Acked-by: Jamal Hadi Salim 

cheers,
jamal



Re: [PATCH net] net: sungem: fix rx checksum support

2018-06-20 Thread Mathieu Malaterre
On Wed, Jun 20, 2018 at 7:31 AM David Miller  wrote:
>
> From: Eric Dumazet 
> Date: Tue, 19 Jun 2018 19:18:50 -0700
>
> > After commit 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE
> > are friends"), sungem owners reported the infamous "eth0: hw csum failure"
> > message.
> >
> > CHECKSUM_COMPLETE has in fact never worked for this driver, but this
> > was masked by the fact that upper stacks had to strip the FCS, and
> > therefore skb->ip_summed was set back to CHECKSUM_NONE before
> > my recent change.
> >
> > Driver configures a number of bytes to skip when the chip computes
> > the checksum, and for some reason only half of the Ethernet header
> > was skipped.
> >
> > Then a second problem is that we should strip the FCS by default,
> > unless the driver is updated to eventually support NETIF_F_RXFCS in
> > the future.
> >
> > Finally, a driver should check if NETIF_F_RXCSUM feature is enabled
> > or not, so that the admin can turn off rx checksum if wanted.
> >
> > Many thanks to Andreas Schwab and Mathieu Malaterre for their
> > help in debugging this issue.
> >
> > Signed-off-by: Eric Dumazet 
> > Reported-by: Meelis Roos 
> > Reported-by: Mathieu Malaterre 
> > Reported-by: Andreas Schwab 
> > Tested-by: Andreas Schwab 
>
> Applied and queued up for -stable, thanks Eric.

Thanks for the stable tag.

IMHO the commit message should have also reference commit 7ce5a27f2ef8
("Revert "net: Handle CHECKSUM_COMPLETE more adequately in
pskb_trim_rcsum().""). Which means that commit 88078d98d1bb ("net:
pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends") should be ready
for stable also, since it seems that the only driver responsible for
the revert was sungem.

my 2cts


Re: Grant

2018-06-20 Thread Maratovich M. Fridman





--
I Mikhail Fridman. has selected you specially as one of my beneficiaries
for my Charitable Donation, Just as I have declared on May 23, 2016 to give
my fortune as charity.

Check the link below for confirmation:

http://www.ibtimes.co.uk/russias-second-wealthiest-man-mikhail-fridman-plans-leaving-14-2bn-fortune-charity-1561604

Reply as soon as possible with further directives.

Best Regards,
Mikhail Fridman.


Re: [PATCH][v2] xfrm: replace NR_CPU with nr_cpu_ids

2018-06-20 Thread Florian Westphal
Steffen Klassert  wrote:
> On Tue, Jun 19, 2018 at 09:53:49AM +0200, Florian Westphal wrote:
> > Li RongQing  wrote:
> > > The default NR_CPUS can be very large, but actual possible nr_cpu_ids
> > > usually is very small. For some x86 distribution, the NR_CPUS is 8192
> > > and nr_cpu_ids is 4, so replace NR_CPU to save some memory
> > 
> > Steffen,
> > 
> > I will soon submit a patch to remove the percpu cache; removal
> > improved performance for at least one user (and by quite a sizeable
> > amount).
> > 
> > Would you consider such removal for ipsec or ipsec-next?
> 
> I think this removel would better fit to ipsec-next.

Agree, it slows things down further for me in my tests.
Problem is that I get quite good re-use of pcpu cache due to
unidirectional flows and only one tunnel.

I suspect that even with tunnel the removal is a win in practice
though, netperf is quite artifical, so I rather trust Kristians results
(real world) than my own.

> considered to apply it to ipsec-next. If you plan
> to remove it, I'll wait for that.

I'll submit once net-next opens.


[PATCH] tc: fix batch force option

2018-06-20 Thread Vlad Buslov
When sending accumulated compound command results an error, check 'force'
option before exiting. Move return code check after putting batch bufs and
freeing iovs to prevent memory leak. Break from loop, instead of returning
error code to allow cleanup at the end of batch function. Don't reset ret
code on each iteration.

Fixes: 485d0c6001c4 ("tc: Add batchsize feature for filter and actions")
Reviewed-by: Roi Dayan 
Reviewed-by: Chris Mi 
Signed-off-by: Vlad Buslov 
---
 tc/tc.c | 19 +++
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/tc/tc.c b/tc/tc.c
index 0d223281ba25..62d54186ec66 100644
--- a/tc/tc.c
+++ b/tc/tc.c
@@ -331,6 +331,7 @@ static int batch(const char *name)
int batchsize = 0;
size_t len = 0;
int ret = 0;
+   int err;
bool send;
 
batch_mode = 1;
@@ -399,9 +400,9 @@ static int batch(const char *name)
continue;   /* blank line */
}
 
-   ret = do_cmd(largc, largv, tail == NULL ? NULL : tail->buf,
+   err = do_cmd(largc, largv, tail == NULL ? NULL : tail->buf,
 tail == NULL ? 0 : sizeof(tail->buf));
-   if (ret != 0) {
+   if (err != 0) {
fprintf(stderr, "Command failed %s:%d\n", name,
cmdlineno - 1);
ret = 1;
@@ -423,15 +424,17 @@ static int batch(const char *name)
iov->iov_len = n->nlmsg_len;
}
 
-   ret = rtnl_talk_iov(&rth, iovs, batchsize, NULL);
-   if (ret < 0) {
+   err = rtnl_talk_iov(&rth, iovs, batchsize, NULL);
+   put_batch_bufs(&buf_pool, &head, &tail);
+   free(iovs);
+   if (err < 0) {
fprintf(stderr, "Command failed %s:%d\n", name,
-   cmdlineno - (batchsize + ret) - 1);
-   return 2;
+   cmdlineno - (batchsize + err) - 1);
+   ret = 1;
+   if (!force)
+   break;
}
-   put_batch_bufs(&buf_pool, &head, &tail);
batchsize = 0;
-   free(iovs);
}
} while (!lastline);
 
-- 
2.7.5



RE: ICT Service Desk

2018-06-20 Thread Concepcion Bas



From: Concepcion Bas
Sent: Wednesday, June 20, 2018 2:18 AM
To: Concepcion Bas
Subject: ICT Service Desk

ICT Service Desk vereist dat u bijwerkt / opnieuw valideert naar de meest 
recente e-mail Outlook Web Apps 2018, vriendelijk Klik op Service 
Desk om opnieuw te valideren 
/ upgrade naar de nieuwste e-mail Outlook Web Apps 2018

Verbonden met Microsoft Exchange
© 2018 Microsoft Co-oporation. Alle rechten voorbehouden


[PATCH] net: Fix device name resolving crash in default_device_exit()

2018-06-20 Thread Kirill Tkhai
From: Kirill Tkhai 

The following script makes kernel to crash since it can't obtain
a name for a device, when the name is occupied by another device:

#!/bin/bash
ifconfig eth0 down
ifconfig eth1 down
index=`cat /sys/class/net/eth1/ifindex`
ip link set eth1 name dev$index
unshare -n sleep 1h &
pid=$!
while [[ "`readlink /proc/self/ns/net`" == "`readlink /proc/$pid/ns/net`" ]]; 
do continue; done
ip link set dev$index netns $pid
ip link set eth0 name dev$index
kill -9 $pid

Kernel messages:

virtio_net virtio1 dev3: renamed from eth1
virtio_net virtio0 dev3: renamed from eth0
default_device_exit: failed to move dev3 to init_net: -17
[ cut here ]
kernel BUG at net/core/dev.c:8978!
invalid opcode:  [#1] PREEMPT SMP
CPU: 1 PID: 276 Comm: kworker/u8:3 Not tainted 4.17.0+ #292
Workqueue: netns cleanup_net
RIP: 0010:default_device_exit+0x9c/0xb0
[stack trace snipped]

This patch gives more variability during choosing new name
of device and fixes the problem.

Signed-off-by: Kirill Tkhai 
---

Since there is no suggestions how to fix this in another way, I'm resending the 
patch.

 net/core/dev.c |4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 6e18242a1cae..6c9b9303ded6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -8959,7 +8959,6 @@ static void __net_exit default_device_exit(struct net 
*net)
rtnl_lock();
for_each_netdev_safe(net, dev, aux) {
int err;
-   char fb_name[IFNAMSIZ];
 
/* Ignore unmoveable devices (i.e. loopback) */
if (dev->features & NETIF_F_NETNS_LOCAL)
@@ -8970,8 +8969,7 @@ static void __net_exit default_device_exit(struct net 
*net)
continue;
 
/* Push remaining network devices to init_net */
-   snprintf(fb_name, IFNAMSIZ, "dev%d", dev->ifindex);
-   err = dev_change_net_namespace(dev, &init_net, fb_name);
+   err = dev_change_net_namespace(dev, &init_net, "dev%d");
if (err) {
pr_emerg("%s: failed to move %s to init_net: %d\n",
 __func__, dev->name, err);


Re: WARNING: CPU: 3 PID: 0 at net/sched/sch_hfsc.c:1388 hfsc_dequeue+0x319/0x350 [sch_hfsc]

2018-06-20 Thread Marco Berizzi
> Il 19 giugno 2018 alle 13.42 Marco Berizzi  ha scritto:
> 
> > Il 18 giugno 2018 alle 21.28 Cong Wang  ha 
> > scritto:
> > Can you test the attached patch?
> > 
> > It almost certainly fixes the warning, but I am not sure if
> > it papers out any other real problem. Please make sure
> > hfsc still works as expected. :)
> 
> Hi Cong,
> 
> Thanks a lot for the patch. I'm building 4.17.2 + your patch.
> Within 24 hours I will update you.

Hi Cong,

I have applied your patch to 4.17.2
>From my point of view hfsc is working as expected.
However my setup is pretty simple:

tc class add dev eth2 parent 1:2 classid 1:16 hfsc ls rate 3000kbit ul rate 
2kbit

I can see hfsc is shaping my traffic to 20MBit/s.
The error message has not popped up :-)
I will drop you a message withing 48 hours for confirmation.
Will this patch merged in a future kernel release?

Thanks a lot Cong.

Marco


Re: [RFC v2, net-next, PATCH 4/4] net/cpsw_switchdev: add switchdev mode of operation on cpsw driver

2018-06-20 Thread Jiri Pirko
Tue, Jun 19, 2018 at 01:19:00AM CEST, grygorii.stras...@ti.com wrote:
>
>
>On 06/14/2018 06:43 AM, Ilias Apalodimas wrote:
>> On Thu, Jun 14, 2018 at 01:39:58PM +0200, Jiri Pirko wrote:
>>> Thu, Jun 14, 2018 at 01:34:04PM CEST, ilias.apalodi...@linaro.org wrote:
 On Thu, Jun 14, 2018 at 01:30:28PM +0200, Jiri Pirko wrote:
> Thu, Jun 14, 2018 at 01:11:30PM CEST, ilias.apalodi...@linaro.org wrote:
>
> [...]
>
>> @@ -2711,6 +2789,10 @@ static int cpsw_probe_dt(struct 
>> cpsw_platform_data *data,
>>  if (of_property_read_bool(node, "dual_emac"))
>>  data->switch_mode = CPSW_DUAL_EMAC;
>>
>> +/* switchdev overrides DTS */
>> +if (IS_ENABLED(CONFIG_TI_CPSW_SWITCHDEV))
>> +data->switch_mode = CPSW_SWITCHDEV;
>
> So you force CPSW_SWITCHDEV mode if the CONFIG_TI_CPSW_SWITCHDEV is
> enabled. That does not sound right. I think that user should tell what
> mode does he want regardless what the kernel config is.
 We discussed this during the V1 of the RFC. Yes it doesn't seem good, but 
 the
 device currently configures the modes using DTS (which is not correct). I 
 choose
 the .config due to that. I can't think of anything better, but i am open to
 suggestions
>>>
>>> Agreed that DTS does fit as well. I think that this might be a job for
>>> devlink parameters (patchset is going to be sent upstream next week).
>>> You do have 1 bus address for the whole device (both ports), right?
>>>
>> Yes devlink sounds reasonable. I thyink there's only one bus for it, but then
>> again i am far from an expert on the hardware interrnals. Grygorii can 
>> correct
>> me if i am wrong.
>
>Devlink and NFS boot are not compatible as per my understanding, so .. 

? Why do you say so?


>
>Again, current driver, as is, supports NFS boot in all modes
>(how good is the cur driver is question which out of scope of this discussion).
>
>And we discussed this in RFC v1 - driver mode selection will be changed 
>if we will proceed and it will be new driver for proper switch support.
>
>Not sure about about Devlink (need to study it and we never got any requests 
>from end
>users for this as I know), and I'd like to note (again) that this is embedded 
>(industrial/automotive etc), so everything preferred to be simple, fast and
>preferably configured statically (in most of the cases end users what boot 
>time 
>configuration).

You need to study it indeed.

> 
>-- 
>regards,
>-grygorii


[PATCH net 1/1] net/smc: coordinate wait queues for nonblocking connect

2018-06-20 Thread Ursula Braun
The recent poll change may lead to stalls for non-blocking connecting
SMC sockets, since sock_poll_wait is no longer performed on the
internal CLC socket, but on the outer SMC socket.  kernel_connect() on
the internal CLC socket returns with -EINPROGRESS, but the wake up
logic does not work in all cases. If the internal CLC socket is still
in state TCP_SYN_SENT when polled, sock_poll_wait() from sock_poll()
does not sleep. It is supposed to sleep till the state of the internal
CLC socket switches to TCP_ESTABLISHED.

This patch temporarily propagates the wait queue from the internal
CLC sock to the SMC sock, till the non-blocking connect() is
finished.

In addition locking is reduced due to the removed poll waits.

Fixes: c0129a061442 ("smc: convert to ->poll_mask")
Signed-off-by: Ursula Braun 
---
 net/smc/af_smc.c | 10 ++
 net/smc/smc.h|  1 +
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index da7f02edcd37..21c84b924ffb 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -605,6 +605,8 @@ static int smc_connect(struct socket *sock, struct sockaddr 
*addr,
 
smc_copy_sock_settings_to_clc(smc);
tcp_sk(smc->clcsock->sk)->syn_smc = 1;
+   if (flags & O_NONBLOCK)
+   sock->sk->sk_wq = smc->clcsock->sk->sk_wq;
rc = kernel_connect(smc->clcsock, addr, alen, flags);
if (rc)
goto out;
@@ -1285,12 +1287,9 @@ static __poll_t smc_poll_mask(struct socket *sock, 
__poll_t events)
 
smc = smc_sk(sock->sk);
sock_hold(sk);
-   lock_sock(sk);
if ((sk->sk_state == SMC_INIT) || smc->use_fallback) {
/* delegate to CLC child sock */
-   release_sock(sk);
mask = smc->clcsock->ops->poll_mask(smc->clcsock, events);
-   lock_sock(sk);
sk->sk_err = smc->clcsock->sk->sk_err;
if (sk->sk_err) {
mask |= EPOLLERR;
@@ -1299,7 +1298,10 @@ static __poll_t smc_poll_mask(struct socket *sock, 
__poll_t events)
if (sk->sk_state == SMC_INIT &&
mask & EPOLLOUT &&
smc->clcsock->sk->sk_state != TCP_CLOSE) {
+   sock->sk->sk_wq = smc->smcwq;
+   lock_sock(sk);
rc = __smc_connect(smc);
+   release_sock(sk);
if (rc < 0)
mask |= EPOLLERR;
/* success cases including fallback */
@@ -1334,7 +1336,6 @@ static __poll_t smc_poll_mask(struct socket *sock, 
__poll_t events)
mask |= EPOLLPRI;
 
}
-   release_sock(sk);
sock_put(sk);
 
return mask;
@@ -1663,6 +1664,7 @@ static int smc_create(struct net *net, struct socket 
*sock, int protocol,
sk_common_release(sk);
goto out;
}
+   smc->smcwq = sk->sk_wq;
smc->sk.sk_sndbuf = max(smc->clcsock->sk->sk_sndbuf, SMC_BUF_MIN_SIZE);
smc->sk.sk_rcvbuf = max(smc->clcsock->sk->sk_rcvbuf, SMC_BUF_MIN_SIZE);
 
diff --git a/net/smc/smc.h b/net/smc/smc.h
index 51ae1f10d81a..89d6d7ef973f 100644
--- a/net/smc/smc.h
+++ b/net/smc/smc.h
@@ -190,6 +190,7 @@ struct smc_connection {
 struct smc_sock {  /* smc sock container */
struct sock sk;
struct socket   *clcsock;   /* internal tcp socket */
+   struct socket_wq*smcwq; /* original smcsock wq */
struct smc_connection   conn;   /* smc connection */
struct smc_sock *listen_smc;/* listen parent */
struct work_struct  tcp_listen_work;/* handle tcp socket accepts */
-- 
2.16.4



Re: [PATCH net 0/5] net sched actions: code style cleanup and fixes

2018-06-20 Thread Simon Horman
On Tue, Jun 19, 2018 at 12:56:03PM -0400, Roman Mashak wrote:
> The patchset fixes a few code stylistic issues and typos, as well as one
> detected by sparse semantic checker tool.
> 
> No functional changes introduced.
> 
> Patch 1 & 2 fix coding style bits caught by the checkpatch.pl script
> Patch 3 fixes an issue with a shadowed variable
> Patch 4 adds sizeof() operator instead of magic number for buffer length
> Patch 5 fixes typos in diagnostics messages
> 
> Roman Mashak (5):
>   net sched actions: fix coding style in pedit action
>   net sched actions: fix coding style in pedit headers
>   net sched actions: fix sparse warning
>   net sched actions: use sizeof operator for buffer length
>   net sched actions: fix misleading text strings in pedit action

All patches:

Reviewed-by: Simon Horman 



Re: Route fallback issue

2018-06-20 Thread Akshat Kakkar
Hi netdev community,

I have 2 interfaces
eno1 : 192.168.1.10/24
eno2 : 192.168.2.10/24

I added routes as
172.16.0.0/12 via 192.168.1.254 metric 1
172.16.0.0/12 via 192.168.2.254 metric 2

My intention : All traffic to 172.16.0.0/12 should go thru eno1. If
192.168.1.254 is not reachable (no arp entry or link down), then it
should fall back to eno2.

But this is not working. My box keeps on looking for 192.168.1.254
(i.e. sending arp requests) and never falls back.

I have posted this in lartc but looks like solution, if any, has to be
from netdev.

Your views on this.

Do we have some plan/roadmap to resolve this in linux kernel?



On Wed, Jun 20, 2018 at 1:49 PM, Erik Auerswald
 wrote:
> Hi,
>
> I have usually used the "replace" keyword of iproute2 for similar
> purposes. I would suggest a script as well, run via cron unless 1 minute
> failover times are not acceptable. The logic could be as follows:
>
> if ping -c1 $PRIMARY_NH >/dev/null 2>&1; then
>   ip route replace $PREFIX via $PRIMARY_NH
> elif ping -c1 $SECONDARY_NH >/dev/null 2>&1; then
>   ip route replace $PREFIX via $SECONDARY_NH
> else
>   ip route del $PREFIX
> fi
>
> Alternatively, one could look into a routing daemon that supports static
> routing (Zebra/Quagga/FRRouting, BIRD, ...) and check if that supports
> some form of next-hop tracking or at least removes static routes with
> unreachable next-hops as one would expect from experience with dedicated
> networking devices.
>
> IMHO static route handling as done by the Linux kernel does not seem
> useful for networking devices. I have even had bad experiences with
> Arista switches and static routing because they relied too much on the
> Linux kernel (probably still do).
>
> Thanks,
> Erik
> --
> Bufferbloat just waits in hiding to get you when you try to use the network.
> -- Jim Gettys
>
> On Wed, Jun 20, 2018 at 04:20:11AM +0100, cronolog+lartc wrote:
>> Hi,
>>
>> I /think/ Linux continues sending ARP requests and doesn't fall back
>> to the other route because the route to the failed next hop still
>> exists in the routing table with highest metric, so it continues
>> looking for this next hop.  I get the same behaviour as you when
>> labbing this up, I could not see a straightforward option to mark a
>> route as invalid under changes in reachability, I'd also like to
>> know if this feature is built in and exists.
>>
>>
>> However in the enterprise Cisco world, we can do what you are trying
>> to do very easily using "route tracking" and "IP SLA" features.
>> Basically we define tests e.g. reachability via ping with
>> appropriate frequency and threshholds, then attach these tests to
>> one or more preferred routes.  If the test fails, the associated
>> route is automatically uninstalled from the forwarding table, so any
>> existing lower metric routes get exposed and are used instead.  When
>> the test passes again, the preferred routes are reapplied.
>>
>> The underlying logic of this can certainly be scripted under Linux
>> to get very similar functionality, then put into a cron job or a
>> while loop or similar.  Something along the lines of (pseudocode):
>>if [the test such as ping fails] ; then
>>   if [preferred route exists] ; then ip route delete ... ; fi
>>else  ## ping is successful
>>   if [preferred route doesn't exist] ; then ip route add ... ; fi
>>fi
>>
>>
>> Hope that helps.  I'm also interested in any other solutions to do
>> this under Linux.
>>
>>
>> On 2018-06-19 13:18, Akshat Kakkar wrote:
>> >I have 2 interfaces
>> >eno1 : 192.168.1.10/24
>> >eno2 : 192.168.2.10/24
>> >
>> >I added routes as
>> >172.16.0.0/12 via 192.168.1.254 metric 1
>> >172.16.0.0/12 via 192.168.2.254 metric 2
>> >
>> >My intention : All traffic to 172.16.0.0/12 should go thru eno1. If
>> >192.168.1.254 is not reachable (no arp entry or link down), then it
>> >should fall back to eno2.
>> >
>> >But this is not working. My box keeps on looking for 192.168.1.254
>> >(i.e. sending arp requests) and never falls back.
>> >
>> >Can anyone help?
>> >