from:"Florian Westphal"

[PATCH net] netlink: don't call ->netlink_bind with table lock held

2021-04-16 Thread Florian Westphal

When I added support to allow generic netlink multicast groups to be
restricted to subscribers with CAP_NET_ADMIN I was unaware that a
genl_bind implementation already existed in the past.

It was reverted due to ABBA deadlock:

1. ->netlink_bind gets called with the table lock held.
2. genetlink bind callback is invoked, it grabs the genl lock.

But when a new genl subsystem is (un)registered, these two locks are
taken in reverse order.

One solution would be to revert again and add a comment in genl
referring 1e82a62fec613, "genetlink: remove genl_bind").

This would need a second change in mptcp to not expose the raw token
value anymore, e.g.  by hashing the token with a secret key so userspace
can still associate subflow events with the correct mptcp connection.

However, Paolo Abeni reminded me to double-check why the netlink table is
locked in the first place.

I can't find one.  netlink_bind() is already called without this lock
when userspace joins a group via NETLINK_ADD_MEMBERSHIP setsockopt.
Same holds for the netlink_unbind operation.

Digging through the history, commit f773608026ee1
("netlink: access nlk groups safely in netlink bind and getname")
expanded the lock scope.

commit 3a20773bade ("net: netlink: cap max groups which will be considered 
in netlink_bind()")
... removed the nlk->ngroups access that the lock scope
extension was all about.

Reduce the lock scope again and always call ->netlink_bind without
the table lock.

The Fixes tag should be vs. the patch mentioned in the link below,
but that one got squash-merged into the patch that came earlier in the
series.

Fixes: 4d54cc32112d8d ("mptcp: avoid lock_fast usage in accept path")
Link: 
https://lore.kernel.org/mptcp/2021021301.379332-8-mathew.j.martin...@linux.intel.com/T/#u
Cc: Cong Wang 
Cc: Xin Long 
Cc: Johannes Berg 
Cc: Sean Tranchetti 
Cc: Paolo Abeni 
Cc: Pablo Neira Ayuso 
Signed-off-by: Florian Westphal 
---
 net/netlink/af_netlink.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index dd488938447f..3a62f97acf39 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1019,7 +1019,6 @@ static int netlink_bind(struct socket *sock, struct 
sockaddr *addr,
return -EINVAL;
}
 
-   netlink_lock_table();
if (nlk->netlink_bind && groups) {
int group;
 
@@ -1031,13 +1030,14 @@ static int netlink_bind(struct socket *sock, struct 
sockaddr *addr,
if (!err)
continue;
netlink_undo_bind(group, groups, sk);
-   goto unlock;
+   return err;
}
}
 
/* No need for barriers here as we return to user-space without
 * using any of the bound attributes.
 */
+   netlink_lock_table();
if (!bound) {
err = nladdr->nl_pid ?
netlink_insert(sk, nladdr->nl_pid) :
-- 
2.26.3

[PATCH iproute2] mptcp: add support for event monitoring

2021-04-16 Thread Florian Westphal

This adds iproute2 support for mptcp event monitoring, e.g. creation,
establishment, address announcements from the peer, subflow establishment
and so on.

While the kernel-generated events are primarily aimed at mptcpd (e.g. for
subflow management), this is also useful for debugging.

This adds print support for the existing events.

Sample output of 'ip mptcp monitor':
[   CREATED] token=83f3a692 remid=0 locid=0 saddr4=10.0.1.2 daddr4=10.0.1.1 
sport=58710 dport=10011
[   ESTABLISHED] token=83f3a692 remid=0 locid=0 saddr4=10.0.1.2 daddr4=10.0.1.1 
sport=58710 dport=10011
[SF_ESTABLISHED] token=83f3a692 remid=0 locid=1 saddr4=10.0.2.2 daddr4=10.0.1.1 
sport=40195 dport=10011 backup=0
[CLOSED] token=83f3a692

Signed-off-by: Florian Westphal 
---
 include/libgenl.h  |   1 +
 include/uapi/linux/mptcp.h |   2 +
 ip/ipmptcp.c   | 113 +
 lib/libgenl.c  |  66 ++
 man/man8/ip-mptcp.8|   8 +++
 5 files changed, 190 insertions(+)

diff --git a/include/libgenl.h b/include/libgenl.h
index 656493a2c3c4..97281cc1103f 100644
--- a/include/libgenl.h
+++ b/include/libgenl.h
@@ -21,6 +21,7 @@ struct {  
\
},  \
 }
 
+int genl_add_mcast_grp(struct rtnl_handle *grth, __u16 genl_family, const char 
*group);
 int genl_resolve_family(struct rtnl_handle *grth, const char *family);
 int genl_init_handle(struct rtnl_handle *grth, const char *family,
 int *genl_family);
diff --git a/include/uapi/linux/mptcp.h b/include/uapi/linux/mptcp.h
index c3e40165ad4b..d7d47802bc98 100644
--- a/include/uapi/linux/mptcp.h
+++ b/include/uapi/linux/mptcp.h
@@ -174,6 +174,8 @@ enum mptcp_event_attr {
MPTCP_ATTR_FLAGS,   /* u16 */
MPTCP_ATTR_TIMEOUT, /* u32 */
MPTCP_ATTR_IF_IDX,  /* s32 */
+   MPTCP_ATTR_RESET_REASON,/* u32 */
+   MPTCP_ATTR_RESET_FLAGS, /* u32 */
 
__MPTCP_ATTR_AFTER_LAST
 };
diff --git a/ip/ipmptcp.c b/ip/ipmptcp.c
index e1ffafb3c658..fa8481069124 100644
--- a/ip/ipmptcp.c
+++ b/ip/ipmptcp.c
@@ -23,6 +23,7 @@ static void usage(void)
"   ip mptcp endpoint flush\n"
"   ip mptcp limits set [ subflows NR ] [ add_addr_accepted 
NR ]\n"
"   ip mptcp limits show\n"
+   "   ip mptcp monitor\n"
"FLAG-LIST := [ FLAG-LIST ] FLAG\n"
"FLAG  := [ signal | subflow | backup ]\n");
 
@@ -385,6 +386,110 @@ static int mptcp_limit_get_set(int argc, char **argv, int 
cmd)
return 0;
 }
 
+static const char * const event_to_str[] = {
+   [MPTCP_EVENT_CREATED] = "CREATED",
+   [MPTCP_EVENT_ESTABLISHED] = "ESTABLISHED",
+   [MPTCP_EVENT_CLOSED] = "CLOSED",
+   [MPTCP_EVENT_ANNOUNCED] = "ANNOUNCED",
+   [MPTCP_EVENT_REMOVED] = "REMOVED",
+   [MPTCP_EVENT_SUB_ESTABLISHED] = "SF_ESTABLISHED",
+   [MPTCP_EVENT_SUB_CLOSED] = "SF_CLOSED",
+   [MPTCP_EVENT_SUB_PRIORITY] = "SF_PRIO",
+};
+
+static void print_addr(const char *key, int af, struct rtattr *value)
+{
+   void *data = RTA_DATA(value);
+   char str[INET6_ADDRSTRLEN];
+
+   if (inet_ntop(af, data, str, sizeof(str)))
+   printf(" %s=%s", key, str);
+}
+
+static int mptcp_monitor_msg(struct rtnl_ctrl_data *ctrl,
+struct nlmsghdr *n, void *arg)
+{
+   const struct genlmsghdr *ghdr = NLMSG_DATA(n);
+   struct rtattr *tb[MPTCP_ATTR_MAX + 1];
+   int len = n->nlmsg_len;
+
+   len -= NLMSG_LENGTH(GENL_HDRLEN);
+   if (len < 0)
+   return -1;
+
+   if (n->nlmsg_type != genl_family)
+   return 0;
+
+   if (timestamp)
+   print_timestamp(stdout);
+
+   if (ghdr->cmd >= ARRAY_SIZE(event_to_str)) {
+   printf("[UNKNOWN %u]\n", ghdr->cmd);
+   goto out;
+   }
+
+   if (event_to_str[ghdr->cmd] == NULL) {
+   printf("[UNKNOWN %u]\n", ghdr->cmd);
+   goto out;
+   }
+
+   printf("[%14s]", event_to_str[ghdr->cmd]);
+
+   parse_rtattr(tb, MPTCP_ATTR_MAX, (void *) ghdr + GENL_HDRLEN, len);
+
+   printf(" token=%08x", rta_getattr_u32(tb[MPTCP_ATTR_TOKEN]));
+
+   if (tb[MPTCP_ATTR_REM_ID])
+   printf(" remid=%u", rta_getattr_u8(tb[MPTCP_ATTR_REM_ID]));
+   if (tb[MPTCP_ATTR_LOC_ID])
+   printf(" locid=%u", rta_getattr_u8(tb[MPTCP_ATTR_LOC_ID]));
+
+   if (tb[MPTCP_ATTR_SADDR4])
+   print_addr("saddr4", AF_INET, tb[MPTCP_ATTR_SADDR4]);
+   if (tb[MPTCP_ATTR_DADDR4])
+

[PATCH ipsec-next 3/3] xfrm: avoid synchronize_rcu during netns destruction

2021-04-14 Thread Florian Westphal

Use the new exit_pre hook to NULL the netlink socket.
The net namespace core will do a synchronize_rcu() between the exit_pre
and exit/exit_batch handlers.

Signed-off-by: Florian Westphal 
---
 net/xfrm/xfrm_user.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 5a0ef4361e43..9313592fa01f 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -3480,18 +3480,22 @@ static int __net_init xfrm_user_net_init(struct net 
*net)
return 0;
 }
 
+static void __net_exit xfrm_user_net_pre_exit(struct net *net)
+{
+   RCU_INIT_POINTER(net->xfrm.nlsk, NULL);
+}
+
 static void __net_exit xfrm_user_net_exit(struct list_head *net_exit_list)
 {
struct net *net;
-   list_for_each_entry(net, net_exit_list, exit_list)
-   RCU_INIT_POINTER(net->xfrm.nlsk, NULL);
-   synchronize_net();
+
list_for_each_entry(net, net_exit_list, exit_list)
netlink_kernel_release(net->xfrm.nlsk_stash);
 }
 
 static struct pernet_operations xfrm_user_net_ops = {
.init   = xfrm_user_net_init,
+   .pre_exit   = xfrm_user_net_pre_exit,
.exit_batch = xfrm_user_net_exit,
 };
 
-- 
2.26.3

[PATCH ipsec-next 2/3] xfrm: remove stray synchronize_rcu from xfrm_init

2021-04-14 Thread Florian Westphal

This function is called during boot, from ipv4 stack, there is no need
to set the pointer to NULL (static storage duration, so already NULL).

No need for the synchronize_rcu either.  Remove both.

Signed-off-by: Florian Westphal 
---
 net/xfrm/xfrm_policy.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index c49f20657cdb..59691611a9ab 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -4134,9 +4134,6 @@ void __init xfrm_init(void)
 #ifdef CONFIG_XFRM_ESPINTCP
espintcp_init();
 #endif
-
-   RCU_INIT_POINTER(xfrm_if_cb, NULL);
-   synchronize_rcu();
 }
 
 #ifdef CONFIG_AUDITSYSCALL
-- 
2.26.3

[PATCH ipsec-next 1/3] flow: remove spi key from flowi struct

2021-04-14 Thread Florian Westphal

xfrm session decode ipv4 path (but not ipv6) sets this, but there are no
consumers.  Remove it.

Signed-off-by: Florian Westphal 
---
 include/net/flow.h |  3 ---
 net/xfrm/xfrm_policy.c | 39 ---
 2 files changed, 42 deletions(-)

diff --git a/include/net/flow.h b/include/net/flow.h
index 39d0cedcddee..6f5e70240071 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -59,7 +59,6 @@ union flowi_uli {
__le16  sport;
} dnports;
 
-   __be32  spi;
__be32  gre_key;
 
struct {
@@ -90,7 +89,6 @@ struct flowi4 {
 #define fl4_dport  uli.ports.dport
 #define fl4_icmp_type  uli.icmpt.type
 #define fl4_icmp_code  uli.icmpt.code
-#define fl4_ipsec_spi  uli.spi
 #define fl4_mh_typeuli.mht.type
 #define fl4_gre_keyuli.gre_key
 } __attribute__((__aligned__(BITS_PER_LONG/8)));
@@ -150,7 +148,6 @@ struct flowi6 {
 #define fl6_dport  uli.ports.dport
 #define fl6_icmp_type  uli.icmpt.type
 #define fl6_icmp_code  uli.icmpt.code
-#define fl6_ipsec_spi  uli.spi
 #define fl6_mh_typeuli.mht.type
 #define fl6_gre_keyuli.gre_key
__u32   mp_hash;
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index b74f28cabe24..c49f20657cdb 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -3326,39 +3326,6 @@ decode_session4(struct sk_buff *skb, struct flowi *fl, 
bool reverse)
fl4->fl4_icmp_code = icmp[1];
}
break;
-   case IPPROTO_ESP:
-   if (xprth + 4 < skb->data ||
-   pskb_may_pull(skb, xprth + 4 - skb->data)) {
-   __be32 *ehdr;
-
-   xprth = skb_network_header(skb) + ihl * 4;
-   ehdr = (__be32 *)xprth;
-
-   fl4->fl4_ipsec_spi = ehdr[0];
-   }
-   break;
-   case IPPROTO_AH:
-   if (xprth + 8 < skb->data ||
-   pskb_may_pull(skb, xprth + 8 - skb->data)) {
-   __be32 *ah_hdr;
-
-   xprth = skb_network_header(skb) + ihl * 4;
-   ah_hdr = (__be32 *)xprth;
-
-   fl4->fl4_ipsec_spi = ah_hdr[1];
-   }
-   break;
-   case IPPROTO_COMP:
-   if (xprth + 4 < skb->data ||
-   pskb_may_pull(skb, xprth + 4 - skb->data)) {
-   __be16 *ipcomp_hdr;
-
-   xprth = skb_network_header(skb) + ihl * 4;
-   ipcomp_hdr = (__be16 *)xprth;
-
-   fl4->fl4_ipsec_spi = 
htonl(ntohs(ipcomp_hdr[1]));
-   }
-   break;
case IPPROTO_GRE:
if (xprth + 12 < skb->data ||
pskb_may_pull(skb, xprth + 12 - skb->data)) {
@@ -3377,7 +3344,6 @@ decode_session4(struct sk_buff *skb, struct flowi *fl, 
bool reverse)
}
break;
default:
-   fl4->fl4_ipsec_spi = 0;
break;
}
}
@@ -3470,12 +3436,7 @@ decode_session6(struct sk_buff *skb, struct flowi *fl, 
bool reverse)
fl6->flowi6_proto = nexthdr;
return;
 #endif
-   /* XXX Why are there these headers? */
-   case IPPROTO_AH:
-   case IPPROTO_ESP:
-   case IPPROTO_COMP:
default:
-   fl6->fl6_ipsec_spi = 0;
fl6->flowi6_proto = nexthdr;
return;
}
-- 
2.26.3

[PATCH ipsec-next 0/3] xfrm: minor cleanup and synchronize_rcu removal

2021-04-14 Thread Florian Westphal

First patch gets rid of SPI key from flowi struct.
xfrm_policy populates this but there are no consumers.

This is part of a different patch (not part of this) to replace
xfrm_decode_session internals with the flow dissector.

Second patch removes a synchronize_rcu/initialisation in the init path.
Third patch avoids a synchronize_rcu during netns destruction.

Florian Westphal (3):
  flow: remove spi key from flowi struct
  xfrm: remove stray synchronize_rcu from xfrm_init
  xfrm: avoid synchronize_rcu during netns destruction

 include/net/flow.h |  3 ---
 net/xfrm/xfrm_policy.c | 42 --
 net/xfrm/xfrm_user.c   | 10 +++---
 3 files changed, 7 insertions(+), 48 deletions(-)

-- 
2.26.3

Re: [PATCH] netfilter: nf_conntrack: Add conntrack helper for ESP/IPsec

2021-04-14 Thread Florian Westphal

Cole Dishington  wrote:
> Introduce changes to add ESP connection tracking helper to netfilter
> conntrack. The connection tracking of ESP is based on IPsec SPIs. The
> underlying motivation for this patch was to allow multiple VPN ESP
> clients to be distinguished when using NAT.
>
> Added config flag CONFIG_NF_CT_PROTO_ESP to enable the ESP/IPsec
> conntrack helper.

Thanks for the effort to upstream out of tree code.

A couple of comments and questions below.

Preface: AFAIU this tracker aims to 'soft-splice' two independent
ESP connections, i.e.:

saddr:spi1 -> daddr
daddr:spi2 <- saddr

So that we basically get this conntrack:

saddr,daddr,spi1 (original)   daddr,saddr,spi2 (remote)

This can't be done as-is, because we don't know spi2 at the time the
first ESP packet is received.

The solution implemented here is introduction of a 'virtual esp id',
computed when first ESP packet is received, so conntrack really stores:

saddr,daddr,ID (original)   daddr,saddr,ID (remote)

Because the ID is never carried on the wire, this tracker hooks into
pkt_to_tuple() infra so that the conntrack tuple gets populated
as-needed.

If I got that right, I think it would be good to place some description
like this in the source code, this is unlike all the other trackers.

> index ..2441e031c68e
> --- /dev/null
> +++ b/include/linux/netfilter/nf_conntrack_proto_esp.h
> @@ -0,0 +1,25 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _CONNTRACK_PROTO_ESP_H
> +#define _CONNTRACK_PROTO_ESP_H
> +#include 
> +
> +/* ESP PROTOCOL HEADER */
> +
> +struct esphdr {
> + __u32 spi;
> +};
> +
> +struct nf_ct_esp {
> + unsigned int stream_timeout;
> + unsigned int timeout;
> +};
> +
> +#ifdef __KERNEL__
> +#include 
> +
> +void destroy_esp_conntrack_entry(struct nf_conn *ct);
> +
> +bool esp_pkt_to_tuple(const struct sk_buff *skb, unsigned int dataoff,
> +   struct net *net, struct nf_conntrack_tuple *tuple);
> +#endif /* __KERNEL__ */

No need for the __KERNEL__, this header is not exposed to userspace
(only those in include/uapi/).

>   struct {
>   __be16 key;
>   } gre;
> + struct {
> + __be16 spi;

__be32 ?

I now see that this "spi" seems to be allocated by the esp tracker.
Maybe 'esp_id' or something like that?

It doesn't appear to be related to the ESP header SPI value.

> --- a/include/net/netns/conntrack.h
> +++ b/include/net/netns/conntrack.h
> @@ -69,6 +69,27 @@ struct nf_gre_net {
>  };
>  #endif
>  
> +#ifdef CONFIG_NF_CT_PROTO_ESP
> +#define ESP_MAX_PORTS  1000
> +#define HASH_TAB_SIZE  ESP_MAX_PORTS

ESP? Ports?  Should this be 'slots'?  Maybe a comment helps, I don't
expect to see ports in an ESP tracker.

> +enum esp_conntrack {
> + ESP_CT_UNREPLIED,
> + ESP_CT_REPLIED,
> + ESP_CT_MAX
> +};
> +
> +struct nf_esp_net {
> + rwlock_t esp_table_lock;

This uses a rwlock but i only see writer locks being taken.
So this either should use a spinlock, or reader-parts should
take readlock, not wrlock.

(but also see below).

> + struct hlist_head ltable[HASH_TAB_SIZE];
> + struct hlist_head rtable[HASH_TAB_SIZE];
> + /* Initial lookup for remote end until rspi is known */
> + struct hlist_head incmpl_rtable[HASH_TAB_SIZE];
> + struct _esp_table *esp_table[ESP_MAX_PORTS];
> + unsigned int esp_timeouts[ESP_CT_MAX];
> +};

This is large structure -- >32kb.

Could this be moved to nf_conntrack_net?

It would also be good to not allocate these hash slots until after conntrack
is needed.

The esp_timeouts[] can be kept to avoid module dep problems.

(But also see below, I'm not sure homegrown hash table is the way to go).

>  struct ct_pcpu {
> diff --git a/include/uapi/linux/netfilter/nf_conntrack_tuple_common.h 
> b/include/uapi/linux/netfilter/nf_conntrack_tuple_common.h
> index 64390fac6f7e..9bbd76c325d2 100644
> --- a/include/uapi/linux/netfilter/nf_conntrack_tuple_common.h
> +++ b/include/uapi/linux/netfilter/nf_conntrack_tuple_common.h
> @@ -39,6 +39,9 @@ union nf_conntrack_man_proto {
..
> +#if 0
> +#define ESP_DEBUG 1
> +#define DEBUGP(format, args...) printk(KERN_DEBUG "%s: " format, __func__, 
> ## args)
> +#else
> +#undef ESP_DEBUG
> +#define DEBUGP(x, args...)
> +#endif

I suggest to get rid of all of DEBUGP(), either drop them, or, in cases
where they are useful, switch to pr_debug().

> +#define TEMP_SPI_START 1500
> +#define TEMP_SPI_MAX   (TEMP_SPI_START + ESP_MAX_PORTS - 1)

I think this could use an explanation.

> +struct _esp_table {
> + /* Hash table nodes for each required lookup
> +  * lnode: l_spi, l_ip, r_ip
> +  * rnode: r_spi, r_ip
> +  * incmpl_rnode: r_ip
> +  */
> + struct hlist_node lnode;
> + struct hlist_node rnode;
> + struct hlist_node incmpl_rnode;
> +
> + u32 l_spi;
> + u32 r_spi;
> + u32 l_ip;
> + u32 r_ip;

Hmm, ipv4 only.  Could this be chan

Re: linux-next: build failure after merge of the net-next tree

2021-04-12 Thread Florian Westphal

Stephen Rothwell  wrote:
> net/bridge/netfilter/ebtables.c:1248:33: error: 'struct netns_xt' has no 
> member named 'tables'
>  1248 |  list_for_each_entry(t, &net->xt.tables[NFPROTO_BRIDGE], list) {
>   | ^
> include/linux/list.h:619:20: note: in definition of macro 'list_entry_is_head'
>   619 |  (&pos->member == (head))
>   |^~~~
> net/bridge/netfilter/ebtables.c:1248:2: note: in expansion of macro 
> 'list_for_each_entry'
>  1248 |  list_for_each_entry(t, &net->xt.tables[NFPROTO_BRIDGE], list) {
>   |  ^~~
> 
> Caused by commit
> 
>   5b53951cfc85 ("netfilter: ebtables: use net_generic infra")
> 
> interacting with commit
> 
>   7ee3c61dcd28 ("netfilter: bridge: add pre_exit hooks for ebtable 
> unregistration")

Right, the fixup is correct.

Re: [BUG / question] in routing rules, some options (e.g. ipproto, sport) cause rules to be ignored in presence of packet marks

2021-04-09 Thread Florian Westphal

Michal Soltys  wrote:
> On 3/29/21 10:52 PM, Ido Schimmel wrote:
> > 
> > ip_route_me_harder() does not set source / destination port in the
> > flow key, so it explains why fib rules that use them are not hit after
> > mangling the packet. These keys were added in 4.17, but I
> > don't think this use case every worked. You have a different experience?
> > 
> 
> So all the more recent additions to routing rules - src port, dst port, uid
> range and ipproto - are not functioning correctly with the second routing
> check.
>
> Are there plans to eventually fix that ?
> 
> While I just adjusted/rearranged my stuff to not rely on those, it should
> probably be at least documented otherwise (presumably in ip-rule manpage and
> perhaps in `ip rule help` as well).

Fixing this would be better. As Ido implies it should be enough to fully
populate the flow keys in ip(6)_route_me_harder.

[PATCH net-next] net: dccp: use net_generic storage

2021-04-08 Thread Florian Westphal

DCCP is virtually never used, so no need to use space in struct net for it.

Put the pernet ipv4/v6 socket in the dccp ipv4/ipv6 modules instead.

Signed-off-by: Florian Westphal 
---
 include/net/net_namespace.h |  4 
 include/net/netns/dccp.h| 12 
 net/dccp/ipv4.c | 24 
 net/dccp/ipv6.c | 24 
 4 files changed, 40 insertions(+), 24 deletions(-)
 delete mode 100644 include/net/netns/dccp.h

diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 3802c8322ab0..fa5887143f0d 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -22,7 +22,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
@@ -130,9 +129,6 @@ struct net {
 #if defined(CONFIG_IP_SCTP) || defined(CONFIG_IP_SCTP_MODULE)
struct netns_sctp   sctp;
 #endif
-#if defined(CONFIG_IP_DCCP) || defined(CONFIG_IP_DCCP_MODULE)
-   struct netns_dccp   dccp;
-#endif
 #ifdef CONFIG_NETFILTER
struct netns_nf nf;
struct netns_xt xt;
diff --git a/include/net/netns/dccp.h b/include/net/netns/dccp.h
deleted file mode 100644
index cdbc4f5b8390..
--- a/include/net/netns/dccp.h
+++ /dev/null
@@ -1,12 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef __NETNS_DCCP_H__
-#define __NETNS_DCCP_H__
-
-struct sock;
-
-struct netns_dccp {
-   struct sock *v4_ctl_sk;
-   struct sock *v6_ctl_sk;
-};
-
-#endif
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 2455b0c0e486..ffc601a3b329 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -23,14 +23,21 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ackvec.h"
 #include "ccid.h"
 #include "dccp.h"
 #include "feat.h"
 
+struct dccp_v4_pernet {
+   struct sock *v4_ctl_sk;
+};
+
+static unsigned int dccp_v4_pernet_id __read_mostly;
+
 /*
- * The per-net dccp.v4_ctl_sk socket is used for responding to
+ * The per-net v4_ctl_sk socket is used for responding to
  * the Out-of-the-blue (OOTB) packets. A control sock will be created
  * for this socket at the initialization time.
  */
@@ -513,7 +520,8 @@ static void dccp_v4_ctl_send_reset(const struct sock *sk, 
struct sk_buff *rxskb)
struct sk_buff *skb;
struct dst_entry *dst;
struct net *net = dev_net(skb_dst(rxskb)->dev);
-   struct sock *ctl_sk = net->dccp.v4_ctl_sk;
+   struct dccp_v4_pernet *pn;
+   struct sock *ctl_sk;
 
/* Never send a reset in response to a reset. */
if (dccp_hdr(rxskb)->dccph_type == DCCP_PKT_RESET)
@@ -522,6 +530,8 @@ static void dccp_v4_ctl_send_reset(const struct sock *sk, 
struct sk_buff *rxskb)
if (skb_rtable(rxskb)->rt_type != RTN_LOCAL)
return;
 
+   pn = net_generic(net, dccp_v4_pernet_id);
+   ctl_sk = pn->v4_ctl_sk;
dst = dccp_v4_route_skb(net, ctl_sk, rxskb);
if (dst == NULL)
return;
@@ -1005,16 +1015,20 @@ static struct inet_protosw dccp_v4_protosw = {
 
 static int __net_init dccp_v4_init_net(struct net *net)
 {
+   struct dccp_v4_pernet *pn = net_generic(net, dccp_v4_pernet_id);
+
if (dccp_hashinfo.bhash == NULL)
return -ESOCKTNOSUPPORT;
 
-   return inet_ctl_sock_create(&net->dccp.v4_ctl_sk, PF_INET,
+   return inet_ctl_sock_create(&pn->v4_ctl_sk, PF_INET,
SOCK_DCCP, IPPROTO_DCCP, net);
 }
 
 static void __net_exit dccp_v4_exit_net(struct net *net)
 {
-   inet_ctl_sock_destroy(net->dccp.v4_ctl_sk);
+   struct dccp_v4_pernet *pn = net_generic(net, dccp_v4_pernet_id);
+
+   inet_ctl_sock_destroy(pn->v4_ctl_sk);
 }
 
 static void __net_exit dccp_v4_exit_batch(struct list_head *net_exit_list)
@@ -1026,6 +1040,8 @@ static struct pernet_operations dccp_v4_ops = {
.init   = dccp_v4_init_net,
.exit   = dccp_v4_exit_net,
.exit_batch = dccp_v4_exit_batch,
+   .id = &dccp_v4_pernet_id,
+   .size   = sizeof(struct dccp_v4_pernet),
 };
 
 static int __init dccp_v4_init(void)
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index 2be5c69824f9..6f5304db5a67 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -27,13 +27,20 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "dccp.h"
 #include "ipv6.h"
 #include "feat.h"
 
-/* The per-net dccp.v6_ctl_sk is used for sending RSTs and ACKs */
+struct dccp_v6_pernet {
+   struct sock *v6_ctl_sk;
+};
+
+static unsigned int dccp_v6_pernet_id __read_mostly;
+
+/* The per-net v6_ctl_sk is used for sending RSTs and ACKs */
 
 static const struct inet_connection_sock_af_ops dccp_ipv6_mapped;
 static const struct inet_connection_sock_af_ops dccp_ipv6_af_ops;
@@ -254,7 +261,8 @@ static void dccp_v6_ctl_send_reset(const struct sock *sk, 
str

Re: [PATCH netfilter] netfilter: xt_IDLETIMER: fix idletimer_tg_helper non-kosher casts

2021-04-02 Thread Florian Westphal

Maciej Żenczykowski  wrote:
> From: Maciej Żenczykowski 
> 
> The code is relying on the identical layout of the beginning
> of the v0 and v1 structs, but this can easily lead to code bugs
> if one were to try to extend this further...

What is the concern?  These structs are part of ABI, they
cannot be changed.

Re: [PATCH net 2/2] mptcp: revert "mptcp: provide subflow aware release function"

2021-04-01 Thread Florian Westphal

Paolo Abeni  wrote:
> This change reverts commit ad98dd37051e ("mptcp: provide subflow aware
> release function"). The latter introduced a deadlock spotted by
> syzkaller and is not needed anymore after the previous commit.
> 
> Fixes: ad98dd37051e ("mptcp: provide subflow aware release function")
> Signed-off-by: Paolo Abeni 

Paolo, thanks for passing this to -net.

Acked-by: Florian Westphal

Re: [PATCH][next] netfilter: nf_log_bridge: Fix missing assignment of ret on a call to nf_log_register

2021-03-31 Thread Florian Westphal

Colin King  wrote:
> From: Colin Ian King 
> 
> Currently the call to nf_log_register is returning an error code that
> is not being assigned to ret and yet ret is being checked. Fix this by
> adding in the missing assignment.

Thanks for catching this.

Acked-by: Florian Westphal

Re: [PATCH 5.10 104/157] mptcp: put subflow sock on connect error

2021-03-24 Thread Florian Westphal

Naresh Kamboju  wrote:
> On Mon, 22 Mar 2021 at 18:15, Greg Kroah-Hartman
>  wrote:
> >
> > From: Florian Westphal 
> >
> > [ Upstream commit f07157792c633b528de5fc1dbe2e4ea54f8e09d4 ]
> >
> > mptcp_add_pending_subflow() performs a sock_hold() on the subflow,
> > then adds the subflow to the join list.
> >
> > Without a sock_put the subflow sk won't be freed in case connect() fails.
> >
> > unreferenced object 0x88810c03b100 (size 3000):
> > [..]
> > sk_prot_alloc.isra.0+0x2f/0x110
> > sk_alloc+0x5d/0xc20
> > inet6_create+0x2b7/0xd30
> > __sock_create+0x17f/0x410
> > mptcp_subflow_create_socket+0xff/0x9c0
> > __mptcp_subflow_connect+0x1da/0xaf0
> > mptcp_pm_nl_work+0x6e0/0x1120
> > mptcp_worker+0x508/0x9a0
> >
> > Fixes: 5b950ff4331ddda ("mptcp: link MPC subflow into msk only after 
> > accept")

I don't see this change in 5.10, so why is this fix queued up?

> I have reported the following warnings and kernel crash on 5.10.26-rc2 [1]
> The bisect reported that issue pointing out to this commit.
> 
> commit 460916534896e6d4f80a37152e0948db33376873
> mptcp: put subflow sock on connect error
> 
> This problem is specific to 5.10.26-rc2.
> 
> Warning:
> 
> [ 1040.114695] refcount_t: addition on 0; use-after-free.
> [ 1040.119857] WARNING: CPU: 3 PID: 31925 at
> /usr/src/kernel/lib/refcount.c:25 refcount_warn_saturate+0xd7/0x100
> [ 1040.129769] Modules linked in: act_mirred cls_u32 sch_netem sch_etf
> ip6table_nat xt_nat iptable_nat nf_nat ip6table_filter xt_conntrack
> nf_conntrack nf_defrag_ipv4 libcrc32c ip6_tables nf_defrag_ipv6 sch_fq
> iptable_filter xt_mark ip_tables cls_bpf sch_ingress algif_hash
> x86_pkg_temp_thermal fuse [last unloaded: test_blackhole_dev]
> [ 1040.159030] CPU: 3 PID: 31925 Comm: mptcp_connect Tainted: G
> W K   5.10.26-rc2 #1
> [ 1040.167459] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> 2.2 05/23/2018
> [ 1040.174851] RIP: 0010:refcount_warn_saturate+0xd7/0x100
> 
> And
> 
> Kernel Panic:
> -
> [ 1069.557485] BUG: kernel NULL pointer dereference, address: 0010
> [ 1069.564446] #PF: supervisor read access in kernel mode
> [ 1069.569583] #PF: error_code(0x) - not-present page
> [ 1069.574714] PGD 0 P4D 0
> [ 1069.577246] Oops:  [#1] SMP PTI
> > index 16adba172fb9..591546d0953f 100644
> > --- a/net/mptcp/subflow.c
> > +++ b/net/mptcp/subflow.c
> > @@ -1133,6 +1133,7 @@ int __mptcp_subflow_connect(struct sock *sk, const 
> > struct mptcp_addr_info *loc,
> > spin_lock_bh(&msk->join_list_lock);
> > list_add_tail(&subflow->node, &msk->join_list);
> > spin_unlock_bh(&msk->join_list_lock);
> > +   sock_put(mptcp_subflow_tcp_sock(subflow));
> >
> > return err;

Crash is not surprising, the backport puts the socket in the 'success' path
(list_add_tail).

I don't see why this is in -stable, the faulty commit is not there?

The upstream patch is:
list_del(&subflow->node);
spin_unlock_bh(&msk->join_list_lock);
+   sock_put(mptcp_subflow_tcp_sock(subflow));

[ Note the 'list_del', this is in the error unwind path ]

Re: [PATCH ipsec] xfrm: Provide private skb extensions for segmented and hw offloaded ESP packets

2021-03-23 Thread Florian Westphal

Steffen Klassert  wrote:
> Commit 94579ac3f6d0 ("xfrm: Fix double ESP trailer insertion in IPsec
> crypto offload.") added a XFRM_XMIT flag to avoid duplicate ESP trailer
> insertion on HW offload. This flag is set on the secpath that is shared
> amongst segments. This lead to a situation where some segments are
> not transformed correctly when segmentation happens at layer 3.
> 
> Fix this by using private skb extensions for segmented and hw offloaded
> ESP packets.
> 
> Fixes: 94579ac3f6d0 ("xfrm: Fix double ESP trailer insertion in IPsec crypto 
> offload.")
> Signed-off-by: Steffen Klassert 
> ---
>  include/linux/skbuff.h  |  1 +
>  net/core/skbuff.c   | 23 ++-
>  net/ipv4/esp4_offload.c | 16 +++-
>  net/ipv6/esp6_offload.c | 16 +++-
>  net/xfrm/xfrm_device.c  |  2 --
>  5 files changed, 49 insertions(+), 9 deletions(-)
> 
> - if (hw_offload)
> + if (hw_offload) {
> + ext = skb_ext_cow(skb->extensions, skb->active_extensions);

It should be possible to do

if (hw_offload) {
if (!skb_ext_add(skb, SKB_EXT_SECPATH);
return -ENOMEM;

xo = xfrm_offload(skb);


without need for a new 'cow' function.
skb_ext_add() will auto-COW if the extension area has a refcount > 1.

Re: [PATCH] net: bridge: fix error return code of do_update_counters()

2021-03-09 Thread Florian Westphal

Jia-Ju Bai  wrote:
> When find_table_lock() returns NULL to t, no error return code of
> do_update_counters() is assigned.

Its -ENOENT.

>   t = find_table_lock(net, name, &ret, &ebt_mutex);
   ^

ret is passed to find_table_lock, which passes it to
find_inlist_lock_noload() which will set *ret = -ENOENT
for NULL case.

Re: [PATCH resend] netlink.7: note not reliable if NETLINK_NO_ENOBUFS

2021-03-05 Thread Florian Westphal

Pablo Neira Ayuso  wrote:
> > If I understand correctly, the connection tracking netlink interface
> > is an exception here because it has its own handling of dealing with
> > congestion ("more reliable"?) so you need to disable the "default
> > congestion control"?
> 
> In conntrack, you have to combine NETLINK_NO_ENOBUFS with
> NETLINK_BROADCAST_ERROR, then it's the kernel turns on the "more
> reliable" event delivery.

The "more reliable" event delivery guarantees that the kernel will
deliver at least the DESTROY notification (connection close).

If the userspace program is stuck, kernel has to hold on the expired
entries.  Eventually conntrack stops accepting new connections because
the table is full.

So this feature can't be recommended as a best-practice for conntrack
either.

Re: [PATCH] xfrm: Fix incorrect types in assignment

2021-02-19 Thread Florian Westphal

Yang Li  wrote:
> Fix the following sparse warnings:
> net/xfrm/xfrm_policy.c:1303:22: warning: incorrect type in assignment
> (different address spaces)
> Reported-by: Abaci Robot 
> Signed-off-by: Yang Li 
> ---
>  net/xfrm/xfrm_policy.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
> index b74f28c..5c67407 100644
> --- a/net/xfrm/xfrm_policy.c
> +++ b/net/xfrm/xfrm_policy.c
> @@ -1225,7 +1225,7 @@ static void xfrm_hash_rebuild(struct work_struct *work)
>   struct xfrm_policy *pol;
>   struct xfrm_policy *policy;
>   struct hlist_head *chain;
> - struct hlist_head *odst;
> + struct hlist_head __rcu *odst;

This doesn't look right.  Try something like

- odst = net->xfrm.policy_bydst[dir].table;
+ odst = rcu_dereference_protected(net->xfrm.policy_bydst[dir].table,
   
lockdep_is_held(&net->xfrm.xfrm_policy_lock));

Re: WARNING in dst_release

2021-02-18 Thread Florian Westphal

syzbot  wrote:
> Hello,
> 
> syzbot has tested the proposed patch and the reproducer did not trigger any 
> issue:
> 
> Reported-and-tested-by: syzbot+b53bbea2ad64f9cf8...@syzkaller.appspotmail.com

#syz-fix: mptcp: reset last_snd on subflow close

[ This patch is currently in mptcp-next ]

Re: [v3 net-next 08/10] skbuff: reuse NAPI skb cache on allocation path (__build_skb())

2021-02-10 Thread Florian Westphal

Alexander Lobakin  wrote:
> we're in such context. This includes: build_skb() (called only
> from NIC drivers in NAPI Rx context) and {,__}napi_alloc_skb()
> (called from the same place or from kernel network softirq
> functions).

build_skb is called from sleepable context in drivers/net/tun.c .
Perhaps there are other cases.

Re: [PATCH net 1/1] netfilter: conntrack: Check offload bit on table dump

2021-02-03 Thread Florian Westphal

Roi Dayan  wrote:
> > Do you think rhashtable_insert_fast() in flow_offload_add() blocks for
> > dozens of seconds?
> 
> I'm not sure. but its not only that but also the time to be in
> established state as only then we offload.

That makes it even more weird.  Timeout for established is even larger.
In case of TCP, its days... so I don't understand at all :/

> > Thats about the only thing I can see between 'offload bit gets set'
> > and 'timeout is extended' in flow_offload_add() that could at least
> > spend *some* time.
> > 
> > > We hit this issue before more easily and pushed this fix
> > > 
> > > 4203b19c2796 netfilter: flowtable: Set offload timeout when adding flow
> > 
> > This fix makes sense to me.
> 
> I just noted we didn't test correctly Pablo's suggestion instead of
> to check the bit and extend the timeout in ctnetlink_dump_table() and
> ct_seq_show() like GC does.

Ok.  Extending it there makes sense, but I still don't understand
why newly offloaded flows hit this problem.

Flow offload should never see a 'about to expire' ct entry.
The 'extend timeout from gc' is more to make sure GC doesn't reap
long-lived entries that have been offloaded aeons ago, not 'prevent
new flows from getting zapped...'

Re: [PATCH net 1/1] netfilter: conntrack: Check offload bit on table dump

2021-02-01 Thread Florian Westphal

Roi Dayan  wrote:
> > TCP initial timeout is one minute, UDP 30 seconds.
> > That should surely be enough to do flow_offload_add (which extends
> > the timeout)?
> 
> Yes, flow_offload_add() extends the timeout. but it needs to finish.
> 
> > 
> > Maybe something is doing flow_offload_add() for unconfirmed conntrack?
> > 
> > In unconfirmed conntrack case, ct->timeout is absolute timeout value, e.g. 
> > for
> > tcp it will be set to 60 * HZ.
> 
> When I hit the issue I printed jiffies and ct->timeout and saw they are
> the same or very close but not an absolute number.

Thats strange, for new flows they should not be close at all.
UDP sets a 30 second timeout, TCP sets a 60 second initial timeout.

Do you think rhashtable_insert_fast() in flow_offload_add() blocks for
dozens of seconds?

Thats about the only thing I can see between 'offload bit gets set'
and 'timeout is extended' in flow_offload_add() that could at least
spend *some* time.

> We hit this issue before more easily and pushed this fix
>
> 4203b19c2796 netfilter: flowtable: Set offload timeout when adding flow

This fix makes sense to me.

Re: [PATCH net 1/1] netfilter: conntrack: Check offload bit on table dump

2021-02-01 Thread Florian Westphal

Roi Dayan  wrote:
> > > There is a 3rd caller nf_ct_gc_expired() which being called by 3
> > > other callers:
> > > nf_conntrack_find()
> > > nf_conntrack_tuple_taken()
> > > early_drop_list()
> > 
> > Hm. I'm not sure yet what path is triggering this bug.
> > 
> > Florian came up with the idea of setting a very large timeout for
> > offloaded flows (that are refreshed by the garbage collector) to avoid
> > the extra check from the packet path, so those 3 functions above never
> > hit the garbage collection path. This also applies for the ctnetlink
> > (conntrack -L) and the /proc/net/nf_conntrack sysctl paths that the
> > patch describes, those should not ever see an offloaded flow with a
> > small timeout.
> > 
> > nf_ct_offload_timeout() is called from:
> > 
> > #1 flow_offload_add() to set a very large timer.
> > #2 the garbage collector path, to refresh the timeout the very large
> > offload timer.
> > 
> > Probably there is a race between setting the IPS_OFFLOAD and when
> > flow_offload_add() is called? Garbage collector gets in between and
> > zaps the connection. Is a newly offloaded connection that you observed
> > that is being removed?
> > 
> 
> yes. the flows being removed are newly offloaded connections.

If they are new, how can they be timed out already?

TCP initial timeout is one minute, UDP 30 seconds.
That should surely be enough to do flow_offload_add (which extends
the timeout)?

Maybe something is doing flow_offload_add() for unconfirmed conntrack?

In unconfirmed conntrack case, ct->timeout is absolute timeout value, e.g. for
tcp it will be set to 60 * HZ.

conntrack confirmation adds jiffies32 to it to make it relative
to current time (this is before insertion into the conntrack table,
so GC isn't supposed to happen before this).

In any case adding test for the offload bit seems to be papering over
invalid/broken ct->timeout value.

Re: [PATCH net-next 3/3] net: core: Namespace-ify sysctl_rmem_max and sysctl_wmem_max

2021-01-20 Thread Florian Westphal

menglong8.d...@gmail.com  wrote:
> From: Menglong Dong 
> 
> For now, sysctl_wmem_max and sysctl_rmem_max are globally unified.
> It's not convenient in some case. For example, when we use docker
> and try to control the default udp socket receive buffer for each
> container.
> 
> For that reason, make sysctl_wmem_max and sysctl_rmem_max
> per-namespace.

I think having those values be restricted by init netns is a desirable
property.

Re: [PATCH] netfilter: Fix memleak in nf_nat_init

2021-01-09 Thread Florian Westphal

Dinghao Liu  wrote:
> When register_pernet_subsys() fails, nf_nat_bysource
> should be freed just like when nf_ct_extend_register()
> fails.

Acked-by: Florian Westphal

Re: [PATCH net] netfilter: conntrack: fix reading nf_conntrack_buckets

2021-01-08 Thread Florian Westphal

Jesper Dangaard Brouer  wrote:
> The old way of changing the conntrack hashsize runtime was through changing
> the module param via file /sys/module/nf_conntrack/parameters/hashsize. This
> was extended to sysctl change in commit 3183ab8997a4 ("netfilter: conntrack:
> allow increasing bucket size via sysctl too").
> 
> The commit introduced second "user" variable nf_conntrack_htable_size_user
> which shadow actual variable nf_conntrack_htable_size. When hashsize is
> changed via module param this "user" variable isn't updated. This results in
> sysctl net/netfilter/nf_conntrack_buckets shows the wrong value when users
> update via the old way.

Oh, right!

Acked-by: Florian Westphal

Re: 5.10.4+ hang with 'rmmod nf_conntrack'

2021-01-07 Thread Florian Westphal

Ben Greear  wrote:
> I noticed my system has a hung process trying to 'rmmod nf_conntrack'.
> 
> I've generally been doing the script that calls rmmod forever,
> but only extensively tested on 5.4 kernel and earlier.
> 
> If anyone has any ideas, please let me know.  This is from 'sysrq t'.  I 
> don't see
> any hung-task splats in dmesg.

rmmod on conntrack loops forever until the active conntrack object count 
reaches 0.
(plus a walk of the conntrack table to evict/put all entries).

> I'll see if it is reproducible and if so will try
> with lockdep enabled...

No idea, there was a regression in 5.6, but that was fixed by the time
5.7 was released.

Can't reproduce hangs with a script that injects a few dummy entries
and then removes the module:

added=0

add_and_rmmod()
{
while [ $added -lt 1000 ]; do
conntrack -I -s 
$(($RANDOM%256)).$(($RANDOM%256)).$(($RANDOM%256)).$(($RANDOM%255+1)) \
-d 
$(($RANDOM%256)).$(($RANDOM%256)).$(($RANDOM%256)).$(($RANDOM%255+1)) \
 --protonum 6 --timeout $(((RANDOM%120) + 240)) --state 
ESTABLISHED --sport $RANDOM --dport $RANDOM 2> /dev/null || break

added=$((added + 1))
if [ $((added % 1000)) -eq 0 ];then
echo $added
fi
done

echo rmmod after adding $added entries
conntrack -C
rmmod nf_conntrack_netlink
rmmod nf_conntrack
}

add_and_rmmod

I don't see how it would make a difference, but do you have any special 
conntrack features enabled
at run time, e.g. reliable netlink events? (If you don't know what I mean the 
answer is no).

Re: [PATCH] tcp: remove obsolete paramter sysctl_tcp_low_latency

2021-01-07 Thread Florian Westphal

Jakub Kicinski  wrote:
> > Got it. But a question: why tcp_tw_recycle can be removed totally?
> > it is also part of uAPI
> 
> Good question, perhaps with tcp_tw_recycle we wanted to make sure users
> who depended on it notice removal, since the feature was broken by
> design? 
> 
> tcp_low_latency is an optimization, not functionality which users may
> depend on.
> 
> But I may be wrong so CCing authors.

I guess it was just a case of 'noone noticed'.
I'm not sure if anyone would notice dropping lowlatency sysctl, was just
a case of 'overly careful'.  Personally I'd rather have them gone so
'sysctl tcp.bla' shows if the feature exists/does something.

[PATCH net 0/3] net: fix netfilter defrag/ip tunnel pmtu blackhole

2021-01-05 Thread Florian Westphal

Christian Perle reported a PMTU blackhole due to unexpected interaction
between the ip defragmentation that comes with connection tracking and
ip tunnels.

Unfortunately setting 'nopmtudisc' on the tunnel breaks the test
scenario even without netfilter.

Christinas setup looks like this:
 ++   +-+   ++
 |Router A|---|Wanrouter|---|Router B|
 ||.IPIP..| |..IPIP.||
 ++   +-+   ++
  / mtu 1400   \
 /  \
 ++  ++
 |Client A|  |Client B|
 ++  ++

MTU is 1500 everywhere, except on Router A to Wanrouter and
Wanrouter to Router B.

Router A and Router B use IPIP tunnel interfaces to tunnel traffic
between Client A and Client B over WAN.

Client A sends a 1400 byte UDP datagram to Client B.
This packet gets encapsulated in the IPIP tunnel.

This works, packet is received on client B.

When conntrack (or anything else that forces ip defragmentation) is
enabled on Router A, the packet gets dropped on Router A after
encapsulation because they exceed the link MTU.

Setting the 'nopmtudisc' flag on the IPIP tunnel makes things worse,
no packets pass even in the no-netfilter scenario.

Patch one is a reproducer script for selftest infra.

Patch two is a fix for 'nopmtudisc' behaviour so ip_tunnel will send
an icmp error to Client A.  This allows 'nopmtudisc' tunnel to forward
the UDP datagrams.

Patch three enables ip refragmentation for all reassembled packets, just
like ipv6.

[PATCH net 3/3] net: ip: always refragment ip defragmented packets

2021-01-05 Thread Florian Westphal

Conntrack reassembly records the largest fragment size seen in IPCB.
However, when this gets forwarded/transmitted, fragmentation will only
be forced if one of the fragmented packets had the DF bit set.

In that case, a flag in IPCB will force fragmentation even if the
MTU is large enough.

This should work fine, but this breaks with ip tunnels.
Consider client that sends a UDP datagram of size X to another host.

The client fragments the datagram, so two packets, of size y and z, are
sent. DF bit is not set on any of these packets.

Middlebox netfilter reassembles those packets back to single size-X
packet, before routing decision.

packet-size-vs-mtu checks in ip_forward are irrelevant, because DF bit
isn't set.  At output time, ip refragmentation is skipped as well
because x is still smaller than the mtu of the output device.

If ttransmit device is an ip tunnel, the packet size increases to
x+overhead.

Also, tunnel might be configured to force DF bit on outer header.

In this case, packet will be dropped (exceeds MTU) and an ICMP error is
generated back to sender.

But sender already respects the announced MTU, all the packets that
it sent did fit the announced mtu.

Force refragmentation as per original sizes unconditionally so ip tunnel
will encapsulate the fragments instead.

The only other solution I see is to place ip refragmentation in
the ip_tunnel code to handle this case.

Fixes: d6b915e29f4ad ("ip_fragment: don't forward defragmented DF packet")
Reported-by: Christian Perle 
Signed-off-by: Florian Westphal 
---
 net/ipv4/ip_output.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 89fff5f59eea..2ed0b01f72f0 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -302,7 +302,7 @@ static int __ip_finish_output(struct net *net, struct sock 
*sk, struct sk_buff *
if (skb_is_gso(skb))
return ip_finish_output_gso(net, sk, skb, mtu);
 
-   if (skb->len > mtu || (IPCB(skb)->flags & IPSKB_FRAG_PMTU))
+   if (skb->len > mtu || IPCB(skb)->frag_max_size)
return ip_fragment(net, sk, skb, mtu, ip_finish_output2);
 
return ip_finish_output2(net, sk, skb);
-- 
2.26.2

[PATCH net 1/3] selftests: netfilter: add selftest for ipip pmtu discovery with enabled connection tracking

2021-01-05 Thread Florian Westphal

Convert Christians bug description into a reproducer.

Cc: Shuah Khan 
Cc: Pablo Neira Ayuso 
Reported-by: Christian Perle 
Signed-off-by: Florian Westphal 
---
 tools/testing/selftests/netfilter/Makefile|   3 +-
 .../selftests/netfilter/ipip-conntrack-mtu.sh | 206 ++
 2 files changed, 208 insertions(+), 1 deletion(-)
 create mode 100755 tools/testing/selftests/netfilter/ipip-conntrack-mtu.sh

diff --git a/tools/testing/selftests/netfilter/Makefile 
b/tools/testing/selftests/netfilter/Makefile
index a374e10ef506..3006a8e5b41a 100644
--- a/tools/testing/selftests/netfilter/Makefile
+++ b/tools/testing/selftests/netfilter/Makefile
@@ -4,7 +4,8 @@
 TEST_PROGS := nft_trans_stress.sh nft_nat.sh bridge_brouter.sh \
conntrack_icmp_related.sh nft_flowtable.sh ipvs.sh \
nft_concat_range.sh nft_conntrack_helper.sh \
-   nft_queue.sh nft_meta.sh
+   nft_queue.sh nft_meta.sh \
+   ipip-conntrack-mtu.sh
 
 LDLIBS = -lmnl
 TEST_GEN_FILES =  nf-queue
diff --git a/tools/testing/selftests/netfilter/ipip-conntrack-mtu.sh 
b/tools/testing/selftests/netfilter/ipip-conntrack-mtu.sh
new file mode 100755
index ..4a6f5c3b3215
--- /dev/null
+++ b/tools/testing/selftests/netfilter/ipip-conntrack-mtu.sh
@@ -0,0 +1,206 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+
+# Conntrack needs to reassemble fragments in order to have complete
+# packets for rule matching.  Reassembly can lead to packet loss.
+
+# Consider the following setup:
+#++   +-+   ++
+#|Router A|---|Wanrouter|---|Router B|
+#||.IPIP..| |..IPIP.||
+#++   +-+   ++
+#   /  mtu 1400   \
+#  /   \
+#++ ++
+#|Client A| |Client B|
+#|| ||
+#++ ++
+
+# Router A and Router B use IPIP tunnel interfaces to tunnel traffic
+# between Client A and Client B over WAN. Wanrouter has MTU 1400 set
+# on its interfaces.
+
+rnd=$(mktemp -u )
+rx=$(mktemp)
+
+r_a="ns-ra-$rnd"
+r_b="ns-rb-$rnd"
+r_w="ns-rw-$rnd"
+c_a="ns-ca-$rnd"
+c_b="ns-cb-$rnd"
+
+checktool (){
+   if ! $1 > /dev/null 2>&1; then
+   echo "SKIP: Could not $2"
+   exit $ksft_skip
+   fi
+}
+
+checktool "iptables --version" "run test without iptables"
+checktool "ip -Version" "run test without ip tool"
+checktool "which nc" "run test without nc (netcat)"
+checktool "ip netns add ${r_a}" "create net namespace"
+
+for n in ${r_b} ${r_w} ${c_a} ${c_b};do
+   ip netns add ${n}
+done
+
+cleanup() {
+   for n in ${r_a} ${r_b} ${r_w} ${c_a} ${c_b};do
+   ip netns del ${n}
+   done
+   rm -f ${rx}
+}
+
+trap cleanup EXIT
+
+test_path() {
+   msg="$1"
+
+   ip netns exec ${c_b} nc -n -w 3 -q 3 -u -l -p 5000 > ${rx} < /dev/null &
+
+   sleep 1
+   for i in 1 2 3; do
+   head -c1400 /dev/zero | tr "\000" "a" | ip netns exec ${c_a} nc 
-n -w 1 -u 192.168.20.2 5000
+   done
+
+   wait
+
+   bytes=$(wc -c < ${rx})
+
+   if [ $bytes -eq 1400 ];then
+   echo "OK: PMTU $msg connection tracking"
+   else
+   echo "FAIL: PMTU $msg connection tracking: got $bytes, expected 
1400"
+   exit 1
+   fi
+}
+
+# Detailed setup for Router A
+# ---
+# Interfaces:
+# eth0: 10.2.2.1/24
+# eth1: 192.168.10.1/24
+# ipip0: No IP address, local 10.2.2.1 remote 10.4.4.1
+# Routes:
+# 192.168.20.0/24 dev ipip0(192.168.20.0/24 is subnet of Client B)
+# 10.4.4.1 via 10.2.2.254  (Router B via Wanrouter)
+# No iptables rules at all.
+
+ip link add veth0 netns ${r_a} type veth peer name veth0 netns ${r_w}
+ip link add veth1 netns ${r_a} type veth peer name veth0 netns ${c_a}
+
+l_addr="10.2.2.1"
+r_addr="10.4.4.1"
+ip netns exec ${r_a} ip link add ipip0 type ipip local ${l_addr} remote 
${r_addr} mode ipip || exit $ksft_skip
+
+for dev in lo veth0 veth1 ipip0; do
+ip -net ${r_a} link set $dev up
+done
+
+ip -net ${r_a} addr add 10.2.2.1/24 dev veth0
+ip -net ${r_a} addr add 192.168.10.1/24 dev veth1
+
+ip -net ${r_a} route add 192.168.20.0/24 dev ipip0
+ip -net ${r_a} route add 10.4.4.0/24 via 10.2.2.254
+
+ip netns exec ${r_a} sysctl -q net.ipv4.conf.all.forwarding=1 > /dev/null
+
+# Detailed setup for Router B
+# ---
+#

[PATCH net 2/3] net: fix pmtu check in nopmtudisc mode

2021-01-05 Thread Florian Westphal

For some reason ip_tunnel insist on setting the DF bit anyway when the
inner header has the DF bit set, EVEN if the tunnel was configured with
'nopmtudisc'.

This means that the script added in the previous commit
cannot be made to work by adding the 'nopmtudisc' flag to the
ip tunnel configuration. Doing so breaks connectivity even for the
without-conntrack/netfilter scenario.

When nopmtudisc is set, the tunnel will skip the mtu check, so no
icmp error is sent to client. Then, because inner header has DF set,
the outer header gets added with DF bit set as well.

IP stack then sends an error to itself because the packet exceeds
the device MTU.

Fixes: 23a3647bc4f93 ("ip_tunnels: Use skb-len to PMTU check.")
Cc: Stefano Brivio 
Signed-off-by: Florian Westphal 
---
 net/ipv4/ip_tunnel.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index ee65c9225178..64594aa755f0 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -759,8 +759,11 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device 
*dev,
goto tx_error;
}
 
-   if (tnl_update_pmtu(dev, skb, rt, tnl_params->frag_off, inner_iph,
-   0, 0, false)) {
+   df = tnl_params->frag_off;
+   if (skb->protocol == htons(ETH_P_IP) && !tunnel->ignore_df)
+   df |= (inner_iph->frag_off & htons(IP_DF));
+
+   if (tnl_update_pmtu(dev, skb, rt, df, inner_iph, 0, 0, false)) {
ip_rt_put(rt);
goto tx_error;
}
@@ -788,10 +791,6 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device 
*dev,
ttl = ip4_dst_hoplimit(&rt->dst);
}
 
-   df = tnl_params->frag_off;
-   if (skb->protocol == htons(ETH_P_IP) && !tunnel->ignore_df)
-   df |= (inner_iph->frag_off&htons(IP_DF));
-
max_headroom = LL_RESERVED_SPACE(rt->dst.dev) + sizeof(struct iphdr)
+ rt->dst.header_len + ip_encap_hlen(&tunnel->encap);
if (max_headroom > dev->needed_headroom)
-- 
2.26.2

Re: [PATCH v2] xfrm: Fix wraparound in xfrm_policy_addr_delta()

2020-12-30 Thread Florian Westphal

Visa Hankala  wrote:
> Use three-way comparison for address components to avoid integer
> wraparound in the result of xfrm_policy_addr_delta(). This ensures
> that the search trees are built and traversed correctly.
> 
> Treat IPv4 and IPv6 similarly by returning 0 when prefixlen == 0.
> Prefix /0 has only one equivalence class.

Acked-by: Florian Westphal

Re: [PATCH] selftests: xfrm: fix test return value override issue in xfrm_policy.sh

2020-12-30 Thread Florian Westphal

Po-Hsu Lin  wrote:
> When running this xfrm_policy.sh test script, even with some cases
> marked as FAIL, the overall test result will still be PASS:
> 
> $ sudo ./xfrm_policy.sh
> PASS: policy before exception matches
> FAIL: expected ping to .254 to fail (exceptions)
> PASS: direct policy matches (exceptions)
> PASS: policy matches (exceptions)
> FAIL: expected ping to .254 to fail (exceptions and block policies)
> PASS: direct policy matches (exceptions and block policies)
> PASS: policy matches (exceptions and block policies)
> FAIL: expected ping to .254 to fail (exceptions and block policies after 
> hresh changes)
> PASS: direct policy matches (exceptions and block policies after hresh 
> changes)
> PASS: policy matches (exceptions and block policies after hresh changes)
> FAIL: expected ping to .254 to fail (exceptions and block policies after 
> hthresh change in ns3)
> PASS: direct policy matches (exceptions and block policies after hthresh 
> change in ns3)
> PASS: policy matches (exceptions and block policies after hthresh change in 
> ns3)
> FAIL: expected ping to .254 to fail (exceptions and block policies after 
> htresh change to normal)
> PASS: direct policy matches (exceptions and block policies after htresh 
> change to normal)
> PASS: policy matches (exceptions and block policies after htresh change to 
> normal)
> PASS: policies with repeated htresh change
> $ echo $?
> 0
> 
> This is because the $lret in check_xfrm() is not a local variable.

Acked-by: Florian Westphal

Re: [PATCH] xfrm: Fix wraparound in xfrm_policy_addr_delta()

2020-12-30 Thread Florian Westphal

Visa Hankala  wrote:
> On Tue, Dec 29, 2020 at 05:01:27PM +0100, Florian Westphal wrote:
> > This is suspicious.  Is prefixlen == 0 impossible?
> > 
> > If not, then after patch
> > mask = ~0U << 32;
> > 
> > ... and function returns 0.
> 
> With prefixlen == 0, there is only one equivalence class, so
> returning 0 seems reasonable to me.

Right, that seems reasonable indeed.

> Is there a reason why the function has treated /0 prefix as /32
> with IPv4? IPv6 does not have this treatment.

Not that I recall, looks like a bug.

Re: [PATCH] xfrm: Fix wraparound in xfrm_policy_addr_delta()

2020-12-29 Thread Florian Westphal

Visa Hankala  wrote:
> Use three-way comparison for address elements to avoid integer
> wraparound in the result of xfrm_policy_addr_delta().
> 
> This ensures that the search trees are built and traversed correctly
> when the difference between compared address elements is larger
> than INT_MAX.

Please provide an update to tools/testing/selftests/net/xfrm_policy.sh
that shows that this is a problem.

>   switch (family) {
>   case AF_INET:
> - if (sizeof(long) == 4 && prefixlen == 0)
> - return ntohl(a->a4) - ntohl(b->a4);
> - return (ntohl(a->a4) & ((~0UL << (32 - prefixlen -
> -(ntohl(b->a4) & ((~0UL << (32 - prefixlen;
> + mask = ~0U << (32 - prefixlen);
> + ma = ntohl(a->a4) & mask;
> + mb = ntohl(b->a4) & mask;

This is suspicious.  Is prefixlen == 0 impossible?

If not, then after patch
mask = ~0U << 32;

... and function returns 0.

Re: [PATCH nf] netfilter: xt_RATEEST: reject non-null terminated string from userspace

2020-12-22 Thread Florian Westphal

Linus Torvalds  wrote:
> On Tue, Dec 22, 2020 at 2:24 PM Florian Westphal  wrote:
> >
> > strlcpy assumes src is a c-string. Check info->name before its used.
> 
> If strlcpy is the only problem, then the fix is to use strscpy(),
> which doesn't have the design mistake that strlcpy has.

It would silence the reproducer, but the checkentry function calls
__xt_rateest_lookup which may 'strcmp(..., maybe_not_zero_terminated)'.

[PATCH nf] netfilter: xt_RATEEST: reject non-null terminated string from userspace

2020-12-22 Thread Florian Westphal

syzbot reports:
detected buffer overflow in strlen
[..]
Call Trace:
 strlen include/linux/string.h:325 [inline]
 strlcpy include/linux/string.h:348 [inline]
 xt_rateest_tg_checkentry+0x2a5/0x6b0 net/netfilter/xt_RATEEST.c:143

strlcpy assumes src is a c-string. Check info->name before its used.

Reported-by: syzbot+e86f7c428c8c50db6...@syzkaller.appspotmail.com
Fixes: 5859034d7eb8793 ("[NETFILTER]: x_tables: add RATEEST target")
Signed-off-by: Florian Westphal 
---
RATEEST test in iptables.git still passes, syzbot repro setsockopt
fails with -ENAMETOOLONG.

diff --git a/net/netfilter/xt_RATEEST.c b/net/netfilter/xt_RATEEST.c
index 37253d399c6b..0d5c422f8745 100644
--- a/net/netfilter/xt_RATEEST.c
+++ b/net/netfilter/xt_RATEEST.c
@@ -115,6 +115,9 @@ static int xt_rateest_tg_checkentry(const struct 
xt_tgchk_param *par)
} cfg;
int ret;
 
+   if (strnlen(info->name, sizeof(est->name)) >= sizeof(est->name))
+   return -ENAMETOOLONG;
+
net_get_random_once(&jhash_rnd, sizeof(jhash_rnd));
 
mutex_lock(&xn->hash_lock);
-- 
2.26.2

Re: kernel BUG at lib/string.c:LINE! (6)

2020-12-22 Thread Florian Westphal

Linus Torvalds  wrote:
> On Tue, Dec 22, 2020 at 6:44 AM syzbot
>  wrote:
> >
> > The issue was bisected to:
> >
> > commit 2f78788b55ba ("ilog2: improve ilog2 for constant arguments")
> 
> That looks unlikely, although possibly some constant folding
> improvement might make the fortify code notice something with it.
> 
> > detected buffer overflow in strlen
> > [ cut here ]
> > kernel BUG at lib/string.c:1149!
> > Call Trace:
> >  strlen include/linux/string.h:325 [inline]
> >  strlcpy include/linux/string.h:348 [inline]
> >  xt_rateest_tg_checkentry+0x2a5/0x6b0 net/netfilter/xt_RATEEST.c:143
> 
> Honestly, this just looks like the traditional bug in "strlcpy()".

Yes, thats exactly what this is, no idea why the bisection points
at ilog2 changes.

> That BSD function is complete garbage, exactly because it doesn't
> limit the source length. People tend to _think_ it does ("what's that
> size_t argument for?") but strlcpy() only limits the *destination*
> size, and the source is always read fully.

Right, I'll send a patch shortly.

Re: [PATCH] mptcp: print new line in mptcp_seq_show() if mptcp isn't in use

2020-12-04 Thread Florian Westphal

Jianguo Wu  wrote:
> From: Jianguo Wu 

A brief explanation would have helped.
This is for net tree.

> Signed-off-by: Jianguo Wu 

Fixes: fc518953bc9c8d7d ("mptcp: add and use MIB counter infrastructure")
Acked-by: Florian Westphal

Re: [Race] data race between eth_heder_cache_update() and neigh_hh_output()

2020-11-30 Thread Florian Westphal

Gong, Sishuai  wrote:
> Hi,
> 
> We found a data race in linux kernel 5.3.11 that we are able to reproduce in 
> x86 under specific interleavings. We are not sure about the consequence of 
> this race now but it seems that the two memcpy() can lead to some 
> inconsistency. We also noticed that both the writer and reader are protected 
> by locks, but the writer is protected using seqlock while the reader is 
> protected by rculock.

AFAICS reader uses same seqlock as the writer.

> --
> Write site
> 
>  /tmp/tmp.B7zb7od2zE-5.3.11/extract/linux-5.3.11/net/ethernet/eth.c:264
> 252  /**
> 253   * eth_header_cache_update - update cache entry
> 254   * @hh: destination cache entry
> 255   * @dev: network device
> 256   * @haddr: new hardware address
> 257   *
> 258   * Called by Address Resolution module to notify changes in 
> address.
> 259   */
> 260  void eth_header_cache_update(struct hh_cache *hh,
> 261   const struct net_device *dev,
> 262   const unsigned char *haddr)
> 263  {
>  ==>264  memcpy(((u8 *) hh->hh_data) + HH_DATA_OFF(sizeof(struct 
> ethhdr)),
> 265 haddr, ETH_ALEN);

This is called under write_seqlock_bh(hh->hh_lock)

> /tmp/tmp.B7zb7od2zE-5.3.11/extract/linux-5.3.11/include/net/neighbour.h:481
> 463  static inline int neigh_hh_output(const struct hh_cache *hh, 
> struct sk_buff *skb)
> 464  {
> 465  unsigned int hh_alen = 0;
> 466  unsigned int seq;
> 467  unsigned int hh_len;
> 468
> 469  do {
> 470  seq = read_seqbegin(&hh->hh_lock); 

This samples the seqcount.

> 471  hh_len = hh->hh_len;
> 472  if (likely(hh_len <= HH_DATA_MOD)) {
> 473  hh_alen = HH_DATA_MOD;
> 474
> 475  /* skb_push() would proceed silently if we have room 
> for
> 476   * the unaligned size but not for the aligned size:
> 477   * check headroom explicitly.
> 478   */
> 479  if (likely(skb_headroom(skb) >= HH_DATA_MOD)) {
> 480  /* this is inlined by gcc */
>  ==>481  memcpy(skb->data - HH_DATA_MOD, hh->hh_data,
> 482 HH_DATA_MOD);

[..]

> 492  } while (read_seqretry(&hh->hh_lock, seq));

... so this retries unless seqcount was stable.

[PATCH net-next 3/3] mptcp: emit tcp reset when a join request fails

2020-11-30 Thread Florian Westphal

RFC 8684 says:
 If the token is unknown or the host wants to refuse subflow establishment
 (for example, due to a limit on the number of subflows it will permit),
 the receiver will send back a reset (RST) signal, analogous to an unknown
 port in TCP, containing an MP_TCPRST option (Section 3.6) with an
 "MPTCP specific error" reason code.

mptcp-next doesn't support MP_TCPRST yet, this can be added in another
change.

Signed-off-by: Florian Westphal 
---
 net/mptcp/subflow.c | 47 ++---
 1 file changed, 36 insertions(+), 11 deletions(-)

diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index c55b8f176746..5a8005746bc8 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -112,9 +112,14 @@ static int __subflow_init_req(struct request_sock *req, 
const struct sock *sk_li
return 0;
 }
 
-static void subflow_init_req(struct request_sock *req,
-const struct sock *sk_listener,
-struct sk_buff *skb)
+/* Init mptcp request socket.
+ *
+ * Returns an error code if a JOIN has failed and a TCP reset
+ * should be sent.
+ */
+static int subflow_init_req(struct request_sock *req,
+   const struct sock *sk_listener,
+   struct sk_buff *skb)
 {
struct mptcp_subflow_context *listener = mptcp_subflow_ctx(sk_listener);
struct mptcp_subflow_request_sock *subflow_req = mptcp_subflow_rsk(req);
@@ -125,7 +130,7 @@ static void subflow_init_req(struct request_sock *req,
 
ret = __subflow_init_req(req, sk_listener);
if (ret)
-   return;
+   return 0;
 
mptcp_get_options(skb, &mp_opt);
 
@@ -133,7 +138,7 @@ static void subflow_init_req(struct request_sock *req,
SUBFLOW_REQ_INC_STATS(req, MPTCP_MIB_MPCAPABLEPASSIVE);
 
if (mp_opt.mp_join)
-   return;
+   return 0;
} else if (mp_opt.mp_join) {
SUBFLOW_REQ_INC_STATS(req, MPTCP_MIB_JOINSYNRX);
}
@@ -157,7 +162,7 @@ static void subflow_init_req(struct request_sock *req,
} else {
subflow_req->mp_capable = 1;
}
-   return;
+   return 0;
}
 
err = mptcp_token_new_request(req);
@@ -175,7 +180,11 @@ static void subflow_init_req(struct request_sock *req,
subflow_req->remote_nonce = mp_opt.nonce;
subflow_req->msk = subflow_token_join_request(req, skb);
 
-   if (unlikely(req->syncookie) && subflow_req->msk) {
+   /* Can't fall back to TCP in this case. */
+   if (!subflow_req->msk)
+   return -EPERM;
+
+   if (unlikely(req->syncookie)) {
if (mptcp_can_accept_new_subflow(subflow_req->msk))
subflow_init_req_cookie_join_save(subflow_req, 
skb);
}
@@ -183,6 +192,8 @@ static void subflow_init_req(struct request_sock *req,
pr_debug("token=%u, remote_nonce=%u msk=%p", subflow_req->token,
 subflow_req->remote_nonce, subflow_req->msk);
}
+
+   return 0;
 }
 
 int mptcp_subflow_init_cookie_req(struct request_sock *req,
@@ -234,6 +245,7 @@ static struct dst_entry *subflow_v4_route_req(const struct 
sock *sk,
  struct request_sock *req)
 {
struct dst_entry *dst;
+   int err;
 
tcp_rsk(req)->is_mptcp = 1;
 
@@ -241,8 +253,14 @@ static struct dst_entry *subflow_v4_route_req(const struct 
sock *sk,
if (!dst)
return NULL;
 
-   subflow_init_req(req, sk, skb);
-   return dst;
+   err = subflow_init_req(req, sk, skb);
+   if (err == 0)
+   return dst;
+
+   dst_release(dst);
+   if (!req->syncookie)
+   tcp_request_sock_ops.send_reset(sk, skb);
+   return NULL;
 }
 
 #if IS_ENABLED(CONFIG_MPTCP_IPV6)
@@ -252,6 +270,7 @@ static struct dst_entry *subflow_v6_route_req(const struct 
sock *sk,
  struct request_sock *req)
 {
struct dst_entry *dst;
+   int err;
 
tcp_rsk(req)->is_mptcp = 1;
 
@@ -259,8 +278,14 @@ static struct dst_entry *subflow_v6_route_req(const struct 
sock *sk,
if (!dst)
return NULL;
 
-   subflow_init_req(req, sk, skb);
-   return dst;
+   err = subflow_init_req(req, sk, skb);
+   if (err == 0)
+   return dst;
+
+   dst_release(dst);
+   if (!req->syncookie)
+   tcp6_request_sock_ops.send_reset(sk, skb);
+   return NULL;
 }
 #endif
 
-- 
2.26.2

[PATCH net-next 1/3] security: add const qualifier to struct sock in various places

2020-11-30 Thread Florian Westphal

A followup change to tcp_request_sock_op would have to drop the 'const'
qualifier from the 'route_req' function as the
'security_inet_conn_request' call is moved there - and that function
expects a 'struct sock *'.

However, it turns out its also possible to add a const qualifier to
security_inet_conn_request instead.

Signed-off-by: Florian Westphal 
---
 The code churn is unfortunate.  Alternative would be to change
 the function signature of ->route_req:
 struct dst_entry *(*route_req)(struct sock *sk, ...
 [ i.e., drop 'const' ].  Thoughts?

 include/linux/lsm_audit.h   | 2 +-
 include/linux/lsm_hook_defs.h   | 2 +-
 include/linux/security.h| 4 ++--
 security/apparmor/include/net.h | 2 +-
 security/apparmor/lsm.c | 2 +-
 security/apparmor/net.c | 6 +++---
 security/lsm_audit.c| 4 ++--
 security/security.c | 2 +-
 security/selinux/hooks.c| 2 +-
 security/smack/smack_lsm.c  | 4 ++--
 10 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/include/linux/lsm_audit.h b/include/linux/lsm_audit.h
index 28f23b341c1c..cd23355d2271 100644
--- a/include/linux/lsm_audit.h
+++ b/include/linux/lsm_audit.h
@@ -26,7 +26,7 @@
 
 struct lsm_network_audit {
int netif;
-   struct sock *sk;
+   const struct sock *sk;
u16 family;
__be16 dport;
__be16 sport;
diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
index 32a940117e7a..acc0494cceba 100644
--- a/include/linux/lsm_hook_defs.h
+++ b/include/linux/lsm_hook_defs.h
@@ -301,7 +301,7 @@ LSM_HOOK(void, LSM_RET_VOID, sk_clone_security, const 
struct sock *sk,
 struct sock *newsk)
 LSM_HOOK(void, LSM_RET_VOID, sk_getsecid, struct sock *sk, u32 *secid)
 LSM_HOOK(void, LSM_RET_VOID, sock_graft, struct sock *sk, struct socket 
*parent)
-LSM_HOOK(int, 0, inet_conn_request, struct sock *sk, struct sk_buff *skb,
+LSM_HOOK(int, 0, inet_conn_request, const struct sock *sk, struct sk_buff *skb,
 struct request_sock *req)
 LSM_HOOK(void, LSM_RET_VOID, inet_csk_clone, struct sock *newsk,
 const struct request_sock *req)
diff --git a/include/linux/security.h b/include/linux/security.h
index bc2725491560..0df62735651b 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -1358,7 +1358,7 @@ void security_sk_clone(const struct sock *sk, struct sock 
*newsk);
 void security_sk_classify_flow(struct sock *sk, struct flowi *fl);
 void security_req_classify_flow(const struct request_sock *req, struct flowi 
*fl);
 void security_sock_graft(struct sock*sk, struct socket *parent);
-int security_inet_conn_request(struct sock *sk,
+int security_inet_conn_request(const struct sock *sk,
struct sk_buff *skb, struct request_sock *req);
 void security_inet_csk_clone(struct sock *newsk,
const struct request_sock *req);
@@ -1519,7 +1519,7 @@ static inline void security_sock_graft(struct sock *sk, 
struct socket *parent)
 {
 }
 
-static inline int security_inet_conn_request(struct sock *sk,
+static inline int security_inet_conn_request(const struct sock *sk,
struct sk_buff *skb, struct request_sock *req)
 {
return 0;
diff --git a/security/apparmor/include/net.h b/security/apparmor/include/net.h
index 2431c011800d..aadb4b29fb66 100644
--- a/security/apparmor/include/net.h
+++ b/security/apparmor/include/net.h
@@ -107,6 +107,6 @@ int aa_sock_file_perm(struct aa_label *label, const char 
*op, u32 request,
  struct socket *sock);
 
 int apparmor_secmark_check(struct aa_label *label, char *op, u32 request,
-  u32 secid, struct sock *sk);
+  u32 secid, const struct sock *sk);
 
 #endif /* __AA_NET_H */
diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c
index ffeaee5ed968..1b0aba8eb723 100644
--- a/security/apparmor/lsm.c
+++ b/security/apparmor/lsm.c
@@ -1147,7 +1147,7 @@ static void apparmor_sock_graft(struct sock *sk, struct 
socket *parent)
 }
 
 #ifdef CONFIG_NETWORK_SECMARK
-static int apparmor_inet_conn_request(struct sock *sk, struct sk_buff *skb,
+static int apparmor_inet_conn_request(const struct sock *sk, struct sk_buff 
*skb,
  struct request_sock *req)
 {
struct aa_sk_ctx *ctx = SK_CTX(sk);
diff --git a/security/apparmor/net.c b/security/apparmor/net.c
index fa0e85568450..e0c1b50d6edd 100644
--- a/security/apparmor/net.c
+++ b/security/apparmor/net.c
@@ -211,7 +211,7 @@ static int apparmor_secmark_init(struct aa_secmark *secmark)
 }
 
 static int aa_secmark_perm(struct aa_profile *profile, u32 request, u32 secid,
-  struct common_audit_data *sa, struct sock *sk)
+  struct common_audit_data *sa)
 {
int i, ret;
struct aa_perms perms = { };
@@ -244,13 +244,13 @@ static int aa_secmark_perm(struct aa

[PATCH net-next 2/3] tcp: merge 'init_req' and 'route_req' functions

2020-11-30 Thread Florian Westphal

The Multipath-TCP standard (RFC 8684) says that an MPTCP host should send
a TCP reset if the token in a MP_JOIN request is unknown.

At this time we don't do this, the 3whs completes and the 'new subflow'
is reset afterwards.  There are two ways to allow MPTCP to send the
reset.

1. override 'send_synack' callback and emit the rst from there.
   The drawback is that the request socket gets inserted into the
   listeners queue just to get removed again right away.

2. Send the reset from the 'route_req' function instead.
   This avoids the 'add&remove request socket', but route_req lacks the
   skb that is required to send the TCP reset.

Instead of just adding the skb to that function for MPTCP sake alone,
Paolo suggested to merge init_req and route_req functions.

This saves one indirection from syn processing path and provides the skb
to the merged function at the same time.

'send reset on unknown mptcp join token' is added in next patch.

Suggested-by: Paolo Abeni 
Cc: Eric Dumazet 
Signed-off-by: Florian Westphal 
---
 include/net/tcp.h|  9 -
 net/ipv4/tcp_input.c |  9 ++---
 net/ipv4/tcp_ipv4.c  |  9 +++--
 net/ipv6/tcp_ipv6.c  |  9 +++--
 net/mptcp/subflow.c  | 36 
 5 files changed, 44 insertions(+), 28 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index d5d22c411918..4525d6256321 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2007,15 +2007,14 @@ struct tcp_request_sock_ops {
  const struct sock *sk,
  const struct sk_buff *skb);
 #endif
-   void (*init_req)(struct request_sock *req,
-const struct sock *sk_listener,
-struct sk_buff *skb);
 #ifdef CONFIG_SYN_COOKIES
__u32 (*cookie_init_seq)(const struct sk_buff *skb,
 __u16 *mss);
 #endif
-   struct dst_entry *(*route_req)(const struct sock *sk, struct flowi *fl,
-  const struct request_sock *req);
+   struct dst_entry *(*route_req)(const struct sock *sk,
+  struct sk_buff *skb,
+  struct flowi *fl,
+  struct request_sock *req);
u32 (*init_seq)(const struct sk_buff *skb);
u32 (*init_ts_off)(const struct net *net, const struct sk_buff *skb);
int (*send_synack)(const struct sock *sk, struct dst_entry *dst,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index fb3a7750f623..9e8a6c1aa019 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6799,18 +6799,13 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
/* Note: tcp_v6_init_req() might override ir_iif for link locals */
inet_rsk(req)->ir_iif = inet_request_bound_dev_if(sk, skb);
 
-   af_ops->init_req(req, sk, skb);
-
-   if (security_inet_conn_request(sk, skb, req))
+   dst = af_ops->route_req(sk, skb, &fl, req);
+   if (!dst)
goto drop_and_free;
 
if (tmp_opt.tstamp_ok)
tcp_rsk(req)->ts_off = af_ops->init_ts_off(net, skb);
 
-   dst = af_ops->route_req(sk, &fl, req);
-   if (!dst)
-   goto drop_and_free;
-
if (!want_cookie && !isn) {
/* Kill the following clause, if you dislike this way. */
if (!net->ipv4.sysctl_tcp_syncookies &&
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index e4b31e70bd30..af2338294598 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1444,9 +1444,15 @@ static void tcp_v4_init_req(struct request_sock *req,
 }
 
 static struct dst_entry *tcp_v4_route_req(const struct sock *sk,
+ struct sk_buff *skb,
  struct flowi *fl,
- const struct request_sock *req)
+ struct request_sock *req)
 {
+   tcp_v4_init_req(req, sk, skb);
+
+   if (security_inet_conn_request(sk, skb, req))
+   return NULL;
+
return inet_csk_route_req(sk, &fl->u.ip4, req);
 }
 
@@ -1466,7 +1472,6 @@ const struct tcp_request_sock_ops 
tcp_request_sock_ipv4_ops = {
.req_md5_lookup =   tcp_v4_md5_lookup,
.calc_md5_hash  =   tcp_v4_md5_hash_skb,
 #endif
-   .init_req   =   tcp_v4_init_req,
 #ifdef CONFIG_SYN_COOKIES
.cookie_init_seq =  cookie_v4_init_sequence,
 #endif
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 992cbf3eb9e3..1a1510513739 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -828,9 +828,15 @@ static void tcp_v6_init_req(struct request_sock *req,
 }
 
 static struct dst_entry *tcp_v6_route_req(const struct sock *sk,
+

[PATCH net-next 0/3] mptcp: reject invalid mp_join requests right away

2020-11-30 Thread Florian Westphal

At the moment MPTCP can detect an invalid join request (invalid token,
max number of subflows reached, and so on) right away but cannot reject
the connection until the 3WHS has completed.
Instead the connection will complete and the subflow is reset afterwards.

To send the reset most information is already available, but we don't have
good spot where the reset could be sent:

1. The ->init_req callback is too early and also doesn't allow to return an
   error that could be used to inform the TCP stack that the SYN should be
   dropped.

2. The ->route_req callback lacks the skb needed to send a reset.

3. The ->send_synack callback is the best fit from the available hooks,
   but its called after the request socket has been inserted into the queue
   already. This means we'd have to remove it again right away.

>From a technical point of view, the second hook would be best:
 1. Its before insertion into listener queue.
 2. If it returns NULL TCP will drop the packet for us.

Problem is that we'd have to pass the skb to the function just for MPTCP.

Paolo suggested to merge init_req and route_req callbacks instead:
This makes all info available to MPTCP -- a return value of NULL drops the
packet and MPTCP can send the reset if needed.

Because 'route_req' has a 'const struct sock *', this means either removal
of const qualifier, or a bit of code churn to pass 'const' in security land.

This does the latter; I did not find any spots that need write access to struct
sock.

To recap, the two alternatives are:
1. Solve it entirely in MPTCP: use the ->send_synack callback to
   unlink the request socket from the listener & drop it.
2. Avoid 'security' churn by removing the const qualifier.

Re: [PATCH net-next] netfilter: bridge: reset skb->pkt_type after NF_INET_POST_ROUTING traversal

2020-11-28 Thread Florian Westphal

Jakub Kicinski  wrote:
> On Mon, 23 Nov 2020 19:32:53 +0100 Florian Westphal wrote:
> > That comment is 18 years old, safe bet noone thought of
> > ipv6-in-tunnel-interface-added-as-bridge-port back then.
> > 
> > Reviewed-by: Florian Westphal 
> 
> Sounds like a fix. Probably hard to pin point which commit to blame,
> but this should go to net, not net-next, right?

The commit predates git history, so probably a good idea to add
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")

... and apply it to net tree.

Re: [PATCH v6 0/3] net, mac80211, kernel: enable KCOV remote coverage collection for 802.11 frame handling

2020-11-25 Thread Florian Westphal

Marco Elver  wrote:
[..]

> v6:
> * Revert usage of skb extensions due to potential memory leak. Patch 2/3 is 
> now
>   idential to that in v2.
> * Patches 1/3 and 3/3 are otherwise identical to v5.

The earlier series was already applied to net-next, so you need to
rebase on top of net-next and include a revert of the patch that added
the kcov skb extension.

Also, please indicate the git tree that you want this applied to in the
subject.

[PATCH net-next] mptcp: put reference in mptcp timeout timer

2020-11-24 Thread Florian Westphal

On close this timer might be scheduled. mptcp uses sk_reset_timer for
this, so the a reference on the mptcp socket is taken.

This causes a refcount leak which can for example be reproduced
with 'mp_join_server_v4.pkt' from the mptcp-packetdrill repo.

The leak has nothing to do with join requests, v1_mp_capable_bind_no_cs.pkt
works too when replacing the last ack mpcapable to v1 instead of v0.

unreferenced object 0x888109bba040 (size 2744):
  comm "packetdrill", [..]
  backtrace:
[..] sk_prot_alloc.isra.0+0x2b/0xc0
[..] sk_clone_lock+0x2f/0x740
[..] mptcp_sk_clone+0x33/0x1a0
[..] subflow_syn_recv_sock+0x2b1/0x690 [..]

Fixes: e16163b6e2b7 ("mptcp: refactor shutdown and close")
Cc: Paolo Abeni 
Cc: Davide Caratti 
Signed-off-by: Florian Westphal 
---
 net/mptcp/protocol.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 4b7794835fea..dc979571f561 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1710,6 +1710,7 @@ static void mptcp_timeout_timer(struct timer_list *t)
struct sock *sk = from_timer(sk, t, sk_timer);
 
mptcp_schedule_work(sk);
+   sock_put(sk);
 }
 
 /* Find an idle subflow.  Return NULL if there is unacked data at tcp
-- 
2.26.2

Re: [PATCH net-next] netfilter: bridge: reset skb->pkt_type after NF_INET_POST_ROUTING traversal

2020-11-23 Thread Florian Westphal

Antoine Tenart  wrote:
> Netfilter changes PACKET_OTHERHOST to PACKET_HOST before invoking the
> hooks as, while it's an expected value for a bridge, routing expects
> PACKET_HOST. The change is undone later on after hook traversal. This
> can be seen with pairs of functions updating skb>pkt_type and then
> reverting it to its original value:
> 
> For hook NF_INET_PRE_ROUTING:
>   setup_pre_routing / br_nf_pre_routing_finish
> 
> For hook NF_INET_FORWARD:
>   br_nf_forward_ip / br_nf_forward_finish
> 
> But the third case where netfilter does this, for hook
> NF_INET_POST_ROUTING, the packet type is changed in br_nf_post_routing
> but never reverted. A comment says:
> 
>   /* We assume any code from br_dev_queue_push_xmit onwards doesn't care
>* about the value of skb->pkt_type. */

[..]
> But when having a tunnel (say vxlan) attached to a bridge we have the
> following call trace:

> In this specific case, this creates issues such as when an ICMPv6 PTB
> should be sent back. When CONFIG_BRIDGE_NETFILTER is enabled, the PTB
> isn't sent (as skb_tunnel_check_pmtu checks if pkt_type is PACKET_HOST
> and returns early).
> 
> If the comment is right and no one cares about the value of
> skb->pkt_type after br_dev_queue_push_xmit (which isn't true), resetting
> it to its original value should be safe.

That comment is 18 years old, safe bet noone thought of
ipv6-in-tunnel-interface-added-as-bridge-port back then.

Reviewed-by: Florian Westphal

Re: [Patch stable] netfilter: clear skb->next in NF_HOOK_LIST()

2020-11-21 Thread Florian Westphal

Cong Wang  wrote:
> From: Cong Wang 
> 
> NF_HOOK_LIST() uses list_del() to remove skb from the linked list,
> however, it is not sufficient as skb->next still points to other
> skb. We should just call skb_list_del_init() to clear skb->next,
> like the rest places which using skb list.
> 
> This has been fixed in upstream by commit ca58fbe06c54
> ("netfilter: add and use nf_hook_slow_list()").

Thanks Cong, agree with this change, afaics its applicable to 4.19.y and 5.4.y.

Re: [PATCH v5 2/3] net: add kcov handle to skb extensions

2020-11-21 Thread Florian Westphal

Ido Schimmel  wrote:
> On Thu, Oct 29, 2020 at 05:36:19PM +, Aleksandr Nogikh wrote:
> > From: Aleksandr Nogikh 
> > 
> > Remote KCOV coverage collection enables coverage-guided fuzzing of the
> > code that is not reachable during normal system call execution. It is
> > especially helpful for fuzzing networking subsystems, where it is
> > common to perform packet handling in separate work queues even for the
> > packets that originated directly from the user space.
> > 
> > Enable coverage-guided frame injection by adding kcov remote handle to
> > skb extensions. Default initialization in __alloc_skb and
> > __build_skb_around ensures that no socket buffer that was generated
> > during a system call will be missed.
> > 
> > Code that is of interest and that performs packet processing should be
> > annotated with kcov_remote_start()/kcov_remote_stop().
> > 
> > An alternative approach is to determine kcov_handle solely on the
> > basis of the device/interface that received the specific socket
> > buffer. However, in this case it would be impossible to distinguish
> > between packets that originated during normal background network
> > processes or were intentionally injected from the user space.
> > 
> > Signed-off-by: Aleksandr Nogikh 
> > Acked-by: Willem de Bruijn 
> 
> [...]
> 
> > @@ -249,6 +249,9 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
> > gfp_mask,
> >  
> > fclones->skb2.fclone = SKB_FCLONE_CLONE;
> > }
> > +
> > +   skb_set_kcov_handle(skb, kcov_common_handle());
> 
> Hi,
> 
> This causes skb extensions to be allocated for the allocated skb, but
> there are instances that blindly overwrite 'skb->extensions' by invoking
> skb_copy_header() after __alloc_skb(). For example, skb_copy(),
> __pskb_copy_fclone() and skb_copy_expand(). This results in the skb
> extensions being leaked [1].

[..]
> Other suggestions?

Aleksandr, why was this made into an skb extension in the first place?

AFAIU this feature is usually always disabled at build time.
For debug builds (test farm /debug kernel etc) its always needed.

If thats the case this u64 should be an sk_buff member, not an
extension.

Re: [PATCH 108/141] netfilter: ipt_REJECT: Fix fall-through warnings for Clang

2020-11-20 Thread Florian Westphal

Gustavo A. R. Silva  wrote:
> In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
> by explicitly adding a break statement instead of letting the code fall
> through to the next case.

Acked-by: Florian Westphal

Re: [PATCH 015/141] netfilter: Fix fall-through warnings for Clang

2020-11-20 Thread Florian Westphal

Gustavo A. R. Silva  wrote:
> In preparation to enable -Wimplicit-fallthrough for Clang, fix multiple
> warnings by explicitly adding multiple break statements instead of just
> letting the code fall through to the next case.

Acked-by: Florian Westphal 

Feel free to carry this in next iteration of series, if any.

Re: [PATCH net-next v2] net: openvswitch: Be liberal in tcp conntrack.

2020-11-19 Thread Florian Westphal

Jakub Kicinski  wrote:
> On Mon, 16 Nov 2020 18:31:26 +0530 nusid...@redhat.com wrote:
> > From: Numan Siddique 
> > 
> > There is no easy way to distinguish if a conntracked tcp packet is
> > marked invalid because of tcp_in_window() check error or because
> > it doesn't belong to an existing connection. With this patch,
> > openvswitch sets liberal tcp flag for the established sessions so
> > that out of window packets are not marked invalid.
> > 
> > A helper function - nf_ct_set_tcp_be_liberal(nf_conn) is added which
> > sets this flag for both the directions of the nf_conn.
> > 
> > Suggested-by: Florian Westphal 
> > Signed-off-by: Numan Siddique 
> 
> Florian, LGTY?

Sorry, this one sailed past me.

Acked-by: Florian Westphal

Re: [PATCH v4] aquantia: Remove the build_skb path

2020-11-19 Thread Florian Westphal

Ramsay, Lincoln  wrote:

[ patch looks good to me, I have no further comments ]

> > For build_skb path to work the buffer scheme would need to be changed
> > to reserve headroom, so yes, I think that the proposed patch is the
> > most convenient solution.
> 
> I don't know about benefits/feasibility, but I did wonder if (in the event 
> that the "fast path" is possible), the dma_mapping could use an offset? The 
> page would include the skb header but the dma mapping would not. If that was 
> done though, only 1 RX frame would fit into the page (at least on my system, 
> where the RX frame seems to be 2k and the page is 4k). Also, there's a 
> possibility to set the "order" variable, so that multiple pages are created 
> at once and I'm not sure if this would work in that case.

Yes, this is what some drivers do, they allocate a page, pass
pageaddr + headroom_offset everywhere, except build_skb() which gets the
pageaddr followed by skb_reserve(skb, headroom_offset).

> > This only copies the initial part and then the rest is added as a frag.
> 
> Oh yeah. That's not as bad as I had thought then :)
> 
> I wonder though... if the "fast path" is possible, could the whole packet 
> (including header) be added as a frag, avoiding the header copy? Or is that 
> not how SKBs work?

No, you can either have skb->head point to the page (build_skb), or
skb->head needs to be kmalloc'd (napi_alloc_skb for example).

Both can have page frags. In the second case, at least L2 header
needs to be in skb->head (and ip stack would pull in L3 header as well
later on anyway).

Re: [PATCH v3] aquantia: Remove the build_skb path

2020-11-19 Thread Florian Westphal

Ramsay, Lincoln  wrote:
> When performing IPv6 forwarding, there is an expectation that SKBs
> will have some headroom. When forwarding a packet from the aquantia
> driver, this does not always happen, triggering a kernel warning.
> 
> The build_skb path fails to allow for an SKB header, but the hardware

For build_skb path to work the buffer scheme would need to be changed
to reserve headroom, so yes, I think that the proposed patch is the
most convenient solution.

> buffer it is built around won't allow for this anyway. Just always use the
> slower codepath that copies memory into an allocated SKB.

I thought this changes the driver to always copy the entire packet, but
thats not true, see below.

> It seems that skb_headroom is only 14, when it is expected to be >= 16.

Yes, kernel expects to have some headroom in skbs.

> aq_ring.c has this code (edited slightly for brevity):
> 
> if (buff->is_eop && buff->len <= AQ_CFG_RX_FRAME_MAX - AQ_SKB_ALIGN) {
> skb = build_skb(aq_buf_vaddr(&buff->rxdata), AQ_CFG_RX_FRAME_MAX);
> skb_put(skb, buff->len);
> } else {
> skb = napi_alloc_skb(napi, AQ_CFG_RX_HDR_SIZE);
> 
> There is a significant difference between the SKB produced by these 2 code 
> paths. When napi_alloc_skb creates an SKB, there is a certain amount of 
> headroom reserved. The same pattern appears to be used in all of the other 
> ethernet drivers I have looked at. However, this is not done in the build_skb 
> codepath.

I think the above should be part of the commit message rather than this
meta-space (which gets removed by git-am).

> + skb = napi_alloc_skb(napi, AQ_CFG_RX_HDR_SIZE);
> + if (unlikely(!skb)) {

AQ_CFG_RX_HDR_SIZE is 256 byte, so for larger packets ...

> + memcpy(__skb_put(skb, hdr_len), aq_buf_vaddr(&buff->rxdata),
> + ALIGN(hdr_len, sizeof(long)));

This only copies the initial part and then...
> + if (buff->len - hdr_len > 0) {
> + skb_add_rx_frag(skb, 0, buff->rxdata.page,
> + buff->rxdata.pg_off + hdr_len,
> + buff->len - hdr_len,
>   AQ_CFG_RX_FRAME_MAX);

The rest is added as a frag.

IOW, this patch looks good to me, but could you update the
commit message so it becomes clear that this doesn't result in a full
copy?

Perhaps something like:
'Just always use the napi_alloc_skb() code path that passes the buffer
 as a page fragment', or similar.

Thanks.

Re: [PATCH v2] aquantia: Remove the build_skb path

2020-11-19 Thread Florian Westphal

Ramsay, Lincoln  wrote:
> > Ramsay, Lincoln  wrote:
> > > The build_skb path fails to allow for an SKB header, but the hardware
> > > buffer it is built around won't allow for this anyway.
> > 
> > What problem is being resolved here?
> 
> Sorry... Do I need to re-post the context? (I thought the reply headers would 
> have kept this with the original patch that included the justification, plus 
> the discussion that led to this revised patch).

This is the only text that gets recorded in git, see

https://patchwork.kernel.org/project/netdevbpf/patch/cy4pr1001mb2311f01c543420e5f89c0f4de8...@cy4pr1001mb2311.namprd10.prod.outlook.com/

so, yes, please include this information in the patch description and
post a v3.

Thank you.

Re: [PATCH v2] aquantia: Remove the build_skb path

2020-11-19 Thread Florian Westphal

Ramsay, Lincoln  wrote:
> The build_skb path fails to allow for an SKB header, but the hardware
> buffer it is built around won't allow for this anyway.

What problem is being resolved here?

Re: [PATCH net] netfilter: ipset: prevent uninit-value in hash_ip6_add

2020-11-19 Thread Florian Westphal

Eric Dumazet  wrote:
> From: Eric Dumazet 
> 
> syzbot found that we are not validating user input properly
> before copying 16 bytes [1].
> Using NLA_BINARY in ipaddr_policy[] for IPv6 address is not correct,
> since it ensures at most 16 bytes were provided.

Thanks Eric. Looks like this is the only case in ipset, the other 3
NLA_BINARY users do a

nla_len(tb[IPSET_ATTR_ETHER]) != ETH_ALEN))

before copying.

Re: [PATCH net-next,v4 2/9] netfilter: flowtable: add xmit path types

2020-11-18 Thread Florian Westphal

Pablo Neira Ayuso  wrote:
> - if (unlikely(dst_xfrm(&rt->dst))) {
> + rt = (struct rtable *)tuplehash->tuple.dst_cache;
> +
> + if (unlikely(tuplehash->tuple.xmit_type == FLOW_OFFLOAD_XMIT_XFRM)) {
>   memset(skb->cb, 0, sizeof(struct inet_skb_parm));
>   IPCB(skb)->iif = skb->dev->ifindex;
>   IPCB(skb)->flags = IPSKB_FORWARDED;
>   return nf_flow_xmit_xfrm(skb, state, &rt->dst);
>   }
[..]

> +
> + if (unlikely(tuplehash->tuple.xmit_type == FLOW_OFFLOAD_XMIT_NEIGH))

This needs to be XMIT_XFRM too.

Re: [PATCH net-next,v4 3/9] net: resolve forwarding path from virtual netdevice and HW destination address

2020-11-18 Thread Florian Westphal

Pablo Neira Ayuso  wrote:
> +#define NET_DEVICE_PATH_STACK_MAX5
> +
> +struct net_device_path_stack {
> + int num_paths;
> + struct net_device_path  path[NET_DEVICE_PATH_STACK_MAX];
> +};

[..]

> +int dev_fill_forward_path(const struct net_device *dev, const u8 *daddr,
> +   struct net_device_path_stack *stack)
> +{
> + const struct net_device *last_dev;
> + struct net_device_path_ctx ctx = {
> + .dev= dev,
> + .daddr  = daddr,
> + };
> + struct net_device_path *path;
> + int ret = 0, k;
> +
> + stack->num_paths = 0;
> + while (ctx.dev && ctx.dev->netdev_ops->ndo_fill_forward_path) {
> + last_dev = ctx.dev;
> + k = stack->num_paths++;
> + if (WARN_ON_ONCE(k >= NET_DEVICE_PATH_STACK_MAX))
> + return -1;

This guarantees k < NET_DEVICE_PATH_STACK_MAX, so we can fill
entire path[].

> + path = &stack->path[k];
> + memset(path, 0, sizeof(struct net_device_path));
> +
> + ret = ctx.dev->netdev_ops->ndo_fill_forward_path(&ctx, path);
> + if (ret < 0)
> + return -1;
> +
> + if (WARN_ON_ONCE(last_dev == ctx.dev))
> + return -1;
> + }

... but this means that stack->num_paths == NET_DEVICE_PATH_STACK_MAX
is possible, with k being last element.

> + path = &stack->path[stack->num_paths++];

... so this may result in a off by one?

Re: [PATCH net-next v5] net: linux/skbuff.h: combine SKB_EXTENSIONS + KCOV handling

2020-11-16 Thread Florian Westphal

Randy Dunlap  wrote:
> The previous Kconfig patch led to some other build errors as
> reported by the 0day bot and my own overnight build testing.
> 
> These are all in  when KCOV is enabled but
> SKB_EXTENSIONS is not enabled, so fix those by combining those conditions
> in the header file.

Thanks Randy.

Acked-by: Florian Westphal

Re: [PATCH net-next v4] net: linux/skbuff.h: combine SKB_EXTENSIONS + KCOV handling

2020-11-16 Thread Florian Westphal

Randy Dunlap  wrote:
> On 11/16/20 7:30 AM, Jakub Kicinski wrote:
> > On Mon, 16 Nov 2020 15:31:21 +0100 Florian Westphal wrote:
> >>>> @@ -4151,12 +4150,11 @@ enum skb_ext_id {
> >>>>   #if IS_ENABLED(CONFIG_MPTCP)
> >>>>  SKB_EXT_MPTCP,
> >>>>   #endif
> >>>> -#if IS_ENABLED(CONFIG_KCOV)
> >>>>  SKB_EXT_KCOV_HANDLE,
> >>>> -#endif  
> >>>
> >>> I don't think we should remove this #ifdef: the number of extensions are
> >>> currently limited to 8, we might not want to always have KCOV there even 
> >>> if
> >>> we don't want it. I think adding items in this enum only when needed was 
> >>> the
> >>> intension of Florian (+cc) when creating these SKB extensions.
> >>> Also, this will increase a tiny bit some structures, see "struct 
> >>> skb_ext()".  
> >>
> >> Yes, I would also prefer to retrain the ifdef.
> >>
> >> Another reason was to make sure that any skb_ext_add(..., MY_EXT) gives
> >> a compile error if the extension is not enabled.
> > 
> > Oh well, sorry for taking you down the wrong path Randy!
> 
> No problem.
> So we are back to v2, right?

Yes, you can still drop the line

>> +#if IS_ENABLED(CONFIG_KCOV) && IS_ENABLED(CONFIG_SKB_EXTENSIONS)

for enum skb_ext_id (alreadyt under SKB_EXTENSIONS).

Other than that v2 looks good to me.

Thanks!

Re: [PATCH net-next v4] net: linux/skbuff.h: combine SKB_EXTENSIONS + KCOV handling

2020-11-16 Thread Florian Westphal

Matthieu Baerts  wrote:
> > --- linux-next-20201113.orig/include/linux/skbuff.h
> > +++ linux-next-20201113/include/linux/skbuff.h
> > @@ -4137,7 +4137,6 @@ static inline void skb_set_nfct(struct s
> >   #endif
> >   }
> > -#ifdef CONFIG_SKB_EXTENSIONS
> >   enum skb_ext_id {
> >   #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
> > SKB_EXT_BRIDGE_NF,
> > @@ -4151,12 +4150,11 @@ enum skb_ext_id {
> >   #if IS_ENABLED(CONFIG_MPTCP)
> > SKB_EXT_MPTCP,
> >   #endif
> > -#if IS_ENABLED(CONFIG_KCOV)
> > SKB_EXT_KCOV_HANDLE,
> > -#endif
> 
> I don't think we should remove this #ifdef: the number of extensions are
> currently limited to 8, we might not want to always have KCOV there even if
> we don't want it. I think adding items in this enum only when needed was the
> intension of Florian (+cc) when creating these SKB extensions.
> Also, this will increase a tiny bit some structures, see "struct skb_ext()".

Yes, I would also prefer to retrain the ifdef.

Another reason was to make sure that any skb_ext_add(..., MY_EXT) gives
a compile error if the extension is not enabled.

> So if we think it is better to remove these #ifdef here, we should be OK.
> But if we prefer not to do that, we should then not add stubs for
> skb_ext_{add,find}() and keep the ones for skb_[gs]et_kcov_handle().

Yes, exactly, I did not add these stubs because I could not figure out
a case where an empty skb_ext_{add,find} would be wanted.

If your code calls skb_ext_add() but no skb extensions exist you forgot
a SELECT/DEPENDS SKB_EXTENSIONS in Kconfig & compiler error would tell
you that.

Re: [net-next] netfiler: conntrack: Add the option to set ct tcp flag - BE_LIBERAL per-ct basis.

2020-11-10 Thread Florian Westphal

Numan Siddique  wrote:
> On Tue, Nov 10, 2020 at 5:55 PM Florian Westphal  wrote:
> >
> > Numan Siddique  wrote:
> > > On Tue, Nov 10, 2020 at 3:06 AM Florian Westphal  wrote:
> > > Thanks for the comments. I actually tried this approach first, but it
> > > doesn't seem to work.
> > > I noticed that for the committed connections, the ct tcp flag -
> > > IP_CT_TCP_FLAG_BE_LIBERAL is
> > > not set when nf_conntrack_in() calls resolve_normal_ct().
> >
> > Yes, it won't be set during nf_conntrack_in, thats why I suggested
> > to do it before confirming the connection.
> 
> Sorry for the confusion. What I mean is - I tested  your suggestion - i.e 
> called
> nf_ct_set_tcp_be_liberal()  before calling nf_conntrack_confirm().
> 
>  Once the connection is established, for subsequent packets, openvswitch
>  calls nf_conntrack_in() [1] to see if the packet is part of the
> existing connection or not (i.e ct.new or ct.est )
> and if the packet happens to be out-of-window then skb->_nfct is set
> to NULL. And the tcp
> be flags set during confirmation are not reflected when
> nf_conntrack_in() calls resolve_normal_ct().

Can you debug where this happens?  This looks very very wrong.
resolve_normal_ct() has no business to check any of those flags
(and I don't see where it uses them, it should only deal with the
 tuples).

The flags come into play when nf_conntrack_handle_packet() gets called
after resolve_normal_ct has found an entry, since that will end up
calling the tcp conntrack part.

The entry found/returned by resolve_normal_ct should be the same
nf_conn entry that was confirmed earlier, i.e. it should be in "liberal"
mode.

Re: [net-next] netfiler: conntrack: Add the option to set ct tcp flag - BE_LIBERAL per-ct basis.

2020-11-10 Thread Florian Westphal

Numan Siddique  wrote:
> On Tue, Nov 10, 2020 at 3:06 AM Florian Westphal  wrote:
> Thanks for the comments. I actually tried this approach first, but it
> doesn't seem to work.
> I noticed that for the committed connections, the ct tcp flag -
> IP_CT_TCP_FLAG_BE_LIBERAL is
> not set when nf_conntrack_in() calls resolve_normal_ct().

Yes, it won't be set during nf_conntrack_in, thats why I suggested
to do it before confirming the connection.

> Would you expect that the tcp ct flags should have been preserved once
> the connection is committed ?

Yes, they are preserved when you set them after nf_conntrack_in(), else
we would already have trouble with hw flow offloading which needs to
turn off window checks as well.

Re: [net-next] netfiler: conntrack: Add the option to set ct tcp flag - BE_LIBERAL per-ct basis.

2020-11-09 Thread Florian Westphal

nusid...@redhat.com  wrote:
> From: Numan Siddique 
> 
> Before calling nf_conntrack_in(), caller can set this flag in the
> connection template for a tcp packet and any errors in the
> tcp_in_window() will be ignored.
> 
> A helper function - nf_ct_set_tcp_be_liberal(nf_conn) is added which
> sets this flag for both the directions of the nf_conn.
> 
> openvswitch makes use of this feature so that any out of window tcp
> packets are not marked invalid. Prior to this there was no easy way
> to distinguish if conntracked packet is marked invalid because of
> tcp_in_window() check error or because it doesn't belong to an
> existing connection.
> 
> An earlier attempt (see the link) tried to solve this problem for
> openvswitch in a different way. Florian Westphal instead suggested
> to be liberal in openvswitch for tcp packets.
> 
> Link: 
> https://patchwork.ozlabs.org/project/netdev/patch/20201006083355.121018-1-nusid...@redhat.com/
> 
> Suggested-by: Florian Westphal 
> Signed-off-by: Numan Siddique 
> ---
>  include/net/netfilter/nf_conntrack_l4proto.h |  6 ++
>  net/netfilter/nf_conntrack_core.c| 13 +++--
>  net/openvswitch/conntrack.c  |  1 +
>  3 files changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/include/net/netfilter/nf_conntrack_l4proto.h 
> b/include/net/netfilter/nf_conntrack_l4proto.h
> index 88186b95b3c2..572ae8d2a622 100644
> --- a/include/net/netfilter/nf_conntrack_l4proto.h
> +++ b/include/net/netfilter/nf_conntrack_l4proto.h
> @@ -203,6 +203,12 @@ static inline struct nf_icmp_net 
> *nf_icmpv6_pernet(struct net *net)
>  {
>   return &net->ct.nf_ct_proto.icmpv6;
>  }
> +
> +static inline void nf_ct_set_tcp_be_liberal(struct nf_conn *ct)
> +{
> + ct->proto.tcp.seen[0].flags |= IP_CT_TCP_FLAG_BE_LIBERAL;
> + ct->proto.tcp.seen[1].flags |= IP_CT_TCP_FLAG_BE_LIBERAL;
> +}
>  #endif
>  
>  #ifdef CONFIG_NF_CT_PROTO_DCCP
> diff --git a/net/netfilter/nf_conntrack_core.c 
> b/net/netfilter/nf_conntrack_core.c
> index 234b7cab37c3..8290c5b04e88 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -1748,10 +1748,18 @@ static int nf_conntrack_handle_packet(struct nf_conn 
> *ct,
> struct sk_buff *skb,
> unsigned int dataoff,
> enum ip_conntrack_info ctinfo,
> -   const struct nf_hook_state *state)
> +   const struct nf_hook_state *state,
> +   union nf_conntrack_proto *tmpl_proto)

I would prefer if we could avoid adding yet another function argument.

AFAICS its enough to call the new nf_ct_set_tcp_be_liberal() helper
before nf_conntrack_confirm() in ovs_ct_commit(), e.g.:

diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -1235,10 +1235,13 @@ static int ovs_ct_commit(struct net *net, struct 
sw_flow_key *key,
if (!nf_ct_is_confirmed(ct)) {
err = ovs_ct_init_labels(ct, key, &info->labels.value,
 &info->labels.mask);
if (err)
return err;
+
+   if (nf_ct_protonum(ct) == IPPROTO_TCP)
+   nf_ct_set_tcp_be_liberal(ct);

Re: [PATCH nf 2/2] netfilter: use actual socket sk rather than skb sk when routing harder

2020-10-29 Thread Florian Westphal

Jason A. Donenfeld  wrote:
> If netfilter changes the packet mark when mangling, the packet is
> rerouted using the route_me_harder set of functions. Prior to this
> commit, there's one big difference between route_me_harder and the
> ordinary initial routing functions, described in the comment above
> __ip_queue_xmit():
> 
>/* Note: skb->sk can be different from sk, in case of tunnels */
>int __ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl,
> 
> That function goes on to correctly make use of sk->sk_bound_dev_if,
> rather than skb->sk->sk_bound_dev_if. And indeed the comment is true: a
> tunnel will receive a packet in ndo_start_xmit with an initial skb->sk.
> It will make some transformations to that packet, and then it will send
> the encapsulated packet out of a *new* socket. That new socket will
> basically always have a different sk_bound_dev_if (otherwise there'd be
> a routing loop). So for the purposes of routing the encapsulated packet,
> the routing information as it pertains to the socket should come from
> that socket's sk, rather than the packet's original skb->sk. For that
> reason __ip_queue_xmit() and related functions all do the right thing.
> 
> One might argue that all tunnels should just call skb_orphan(skb) before
> transmitting the encapsulated packet into the new socket. But tunnels do
> *not* do this -- and this is wisely avoided in skb_scrub_packet() too --
> because features like TSQ rely on skb->destructor() being called when
> that buffer space is truely available again. Calling skb_orphan(skb) too
> early would result in buffers filling up unnecessarily and accounting
> info being all wrong. Instead, additional routing must take into account
> the new sk, just as __ip_queue_xmit() notes.
> 
> So, this commit addresses the problem by fishing the correct sk out of
> state->sk -- it's already set properly in the call to nf_hook() in
> __ip_local_out(), which receives the sk as part of its normal
> functionality. So we make sure to plumb state->sk through the various
> route_me_harder functions, and then make correct use of it following the
> example of __ip_queue_xmit().

Reviewed-by: Florian Westphal

Re: [PATCH nf v2] netfilter: conntrack: connection timeout after re-register

2020-10-14 Thread Florian Westphal

Francesco Ruggeri  wrote:
> On Wed, Oct 14, 2020 at 1:23 AM Florian Westphal  wrote:
> >
> > Pablo Neira Ayuso  wrote:
> > > Legacy would still be flawed though.
> >
> > Its fine too, new rule blob gets handled (and match/target checkentry
> > called) before old one is dismantled.
> >
> > We only have a 0 refcount + hook unregister when rules get
> > flushed/removed explicitly.
> 
> Should the patch be used in the meantime while this gets
> worked out?

I think the patch is correct, and I do NOT see a better solution.

Re: [PATCH nf v2] netfilter: conntrack: connection timeout after re-register

2020-10-14 Thread Florian Westphal

Pablo Neira Ayuso  wrote:
> > Yes, we iterate table on re-register and modify the existing entries.
> 
> For iptables-nft, it might be possible to avoid this deregister +
> register ct hooks in the same transaction: Maybe add something like
> nf_ct_netns_get_all() to bump refcounters by one _iff_ they are > 0
> before starting the transaction processing, then call
> nf_ct_netns_put_all() which decrements refcounters and unregister
> hooks if they reach 0.

No need, its already fine.  Decrement happens from destroy path,
so new rules are already in place.

> The only problem with this approach is that this pulls in the
> conntrack module, to solve that, struct nf_ct_hook in
> net/netfilter/core.c could be used to store the reference to
> ->netns_get_all and ->net_put_all.
> 
> Legacy would still be flawed though.

Its fine too, new rule blob gets handled (and match/target checkentry
called) before old one is dismantled.

We only have a 0 refcount + hook unregister when rules get
flushed/removed explicitly.

Re: [PATCH nf v2] netfilter: conntrack: connection timeout after re-register

2020-10-09 Thread Florian Westphal

Jozsef Kadlecsik  wrote:
> > The "delay unregister" remark was wrt. the "all rules were deleted"
> > case, i.e. add a "grace period" rather than acting right away when
> > conntrack use count did hit 0.
> 
> Now I understand it, thanks really. The hooks are removed, so conntrack 
> cannot "see" the packets and the entries become stale. 

Yes.

> What is the rationale behind "remove the conntrack hooks when there are no 
> rule left referring to conntrack"? Performance optimization? But then the 
> content of the whole conntrack table could be deleted too... ;-)

Yes, this isn't the case at the moment -- only hooks are removed,
entries will eventually time out.

> > Conntrack entries are not removed, only the base hooks get unregistered. 
> > This is a problem for tcp window tracking.
> > 
> > When re-register occurs, kernel is supposed to switch the existing 
> > entries to "loose" mode so window tracking won't flag packets as 
> > invalid, but apparently this isn't enough to handle keepalive case.
> 
> "loose" (nf_ct_tcp_loose) mode doesn't disable window tracking, it 
> enables/disables picking up already established connections. 
> 
> nf_ct_tcp_be_liberal would disable TCP window checking (but not tracking) 
> for non RST packets.

You are right, mixup on my part.

> But both seems to be modified only via the proc entries.

Yes, we iterate table on re-register and modify the existing entries.

Re: [PATCH nf v2] netfilter: conntrack: connection timeout after re-register

2020-10-09 Thread Florian Westphal

Jozsef Kadlecsik  wrote:
> > The repro clears all rules, waits 4 seconds, then restores the ruleset. 
> > using iptables-restore < FOO; sleep 4; iptables-restore < FOO will not 
> > result in any unregister ops.
> >
> > We could make kernel defer unregister via some work queue but i don't
> > see what this would help/accomplish (and its questionable of how long it
> > should wait).
> 
> Sorry, I can't put together the two paragraphs above: in the first you 
> wrote that no (hook) unregister-register happens and in the second one 
> that those could be derefed.

Sorry, my reply is confusing indeed.

Matches/targets that need conntrack increment a refcount.
So, when all rules are flushed, refcount goes down to 0 and conntrack is
disabled because the hooks get removed..

Just doing iptables-restore doesn't unregister as long as both the old
and new rulesets need conntrack.

The "delay unregister" remark was wrt. the "all rules were deleted"
case, i.e. add a "grace period" rather than acting right away when
conntrack use count did hit 0.

> > We could disallow unregister, but that seems silly (forces reboot...).
> > 
> > I think the patch is fine.
> 
> The patch is fine, but why the packets are handled by conntrack (after the 
> first restore and during the 4s sleep? And then again after the second 
> restore?) as if all conntrack entries were removed?

Conntrack entries are not removed, only the base hooks get unregistered.
This is a problem for tcp window tracking.

When re-register occurs, kernel is supposed to switch the existing
entries to "loose" mode so window tracking won't flag packets as
invalid, but apparently this isn't enough to handle keepalive case.

Re: [PATCH nf v2] netfilter: conntrack: connection timeout after re-register

2020-10-09 Thread Florian Westphal

Jozsef Kadlecsik  wrote:
> > Any comments?
> > Here is a simple reproducer. The idea is to show that keepalive packets 
> > in an idle tcp connection will be dropped (and the connection will time 
> > out) if conntrack hooks are de-registered and then re-registered. The 
> > reproducer has two files. client_server.py creates both ends of a tcp 
> > connection, bounces a few packets back and forth, and then blocks on a 
> > recv on the client side. The client's keepalive is configured to time 
> > out in 20 seconds. This connection should not time out. test is a bash 
> > script that creates a net namespace where it sets iptables rules for the 
> > connection, starts client_server.py, and then clears and restores the 
> > iptables rules (which causes conntrack hooks to be de-registered and 
> > re-registered).
> 
> In my opinion an iptables restore should not cause conntrack hooks to be 
> de-registered and re-registered, because important TCP initialization 
> parameters cannot be "restored" later from the packets. Therefore the 
> proper fix would be to prevent it to happen. Otherwise your patch looks OK 
> to handle the case when conntrack is intentionally restarted.

The repro clears all rules, waits 4 seconds, then restores the ruleset.
using iptables-restore < FOO; sleep 4; iptables-restore < FOO will
not result in any unregister ops.

We could make kernel defer unregister via some work queue but i don't
see what this would help/accomplish (and its questionable of how long it
should wait).

We could disallow unregister, but that seems silly (forces reboot...).

I think the patch is fine.

Re: [PATCH net-next] net: openvswitch: Add support to lookup invalid packet in ct action.

2020-10-06 Thread Florian Westphal

Numan Siddique  wrote:
> On Tue, Oct 6, 2020 at 4:46 PM Florian Westphal  wrote:
> >
> > nusid...@redhat.com  wrote:
> > > From: Numan Siddique 
> > >
> > > For a tcp packet which is part of an existing committed connection,
> > > nf_conntrack_in() will return err and set skb->_nfct to NULL if it is
> > > out of tcp window. ct action for this packet will set the ct_state
> > > to +inv which is as expected.
> >
> > This is because from conntrack p.o.v., such TCP packet is NOT part of
> > the existing connection.
> >
> > For example, because it is considered part of a previous incarnation
> > of the same connection.
> >
> > > But a controller cannot add an OVS flow as
> > >
> > > table=21,priority=100,ct_state=+inv, actions=drop
> > >
> > > to drop such packets. That is because when ct action is executed on other
> > > packets which are not part of existing committed connections, ct_state
> > > can be set to invalid. Few such cases are:
> > >- ICMP reply packets.
> >
> > Can you elaborate? Echo reply should not be invalid. Conntrack should
> > mark it as established (unless such echo reply came out of the blue).
> 
> Hi Florian,
> 
> Thanks for providing the comments.
> 
> Sorry for not being very clear.
> 
> Let me brief about the present problem we see in OVN (which is a
> controller using ovs)
> 
> When a VM/container sends a packet (in the ingress direction), we don't send 
> all
> the packets to conntrack. If a packet is destined to an OVN load
> balancer virtual ip,
> only then we send the packet to conntrack in the ingress direction and
> then we do dnat
> to the backend.

Ah, okay.  That explains the INVALID then, but in that case I think this
patch/direction is less desirable from conntrack point of view.

More below.

> table=1, match = (ip && ip4.dst == VIP) action = ct(table=2)
> tablle=2, ct_state=+new+trk && ip4.dst == VIP, action = ct(commit,
> nat=BACKEND_IP)
> ...
> ..
> 
> However for the egress direction (when the packet is to be delivered
> to the VM/container),
> we send all the packets to conntrack and if the ct.est is set, we do
> undnat before delivering
> the packet to the VM/container.
> ...
> table=40, match = ip, action = ct(table=41)
> table=41, match = ct_state=+est+trk, action = ct(nat)
> ...
> 
> What I mean here is that, since we send all the packets in the egress
> pipeline to conntrack,
> we can't add a flow like - match = ct_state=+inv, action = drop.

Basically this is like a normal routing/netfitler (no ovs), where
the machine in question only sees unidirectional traffic (reply packets
taking different route).

> i.e When a VM/container sends an ICMP request packet, it will not be
> sent to conntrack, but
> the reply ICMP will be sent to conntrack and it will be marked as invalid.

Yes.  For plain netfilter, the solution would be to accept INVALID icmp
replies in the iptables/nftables ruleset.

> So is the case with TCP, the TCP SYN from the VM is not sent to
> conntrack, but the SYN/ACK
> from the server would be sent to conntrack and it will be marked as invalid.

Right.

> > 1. packet was not even seen by conntrack
> > 2. packet matches existing connection, but is "bad", for example:
> >   - contradicting tcp flags
> >   - out of window
> >   - invalid checksum
> 
> We want the below to be solved (using OVS flows) :
>   - If the packet is marked as invalid due to (2) which you mentioned above,
> we would like to read the ct_mark and ct_label fields as the packet is
> part of existing connection, so that we can add an OVS flow like
>
> ct_state=+inv+trk,ct_label=0x2 actions=drop

Wouldn't it make more sense to let tcp conntrack skip the in-window
check?  AFAIU thats the only problem and its identical to other cases
that we have at the moment, for example, conntrack supports mid-stream
pickup (on by default) which also disables tcp window checks.

> I tested by setting 'be_liberal' sysctl flag and since skb->_nfct was
> set for (2), OVS
> datapath module set the ct_state to +est.

Yes.  This flag can be set programatically on a per-ct basis.

See nft_flow_offload_eval() for example (BE_LIBERAL).
Given we now have multiple places that set this I think it would make
sense to add a helper, say, e.g.

void nf_ct_tcp_be_liberal(struct nf_conn *ct);
?

(Or perhaps nf_ct_tcp_disable_window_checks() , might be more
 clear/descriptive as to what this is doing).

Do you think that this resolves the OVN problem?

Re: [PATCH net-next] net: openvswitch: Add support to lookup invalid packet in ct action.

2020-10-06 Thread Florian Westphal

nusid...@redhat.com  wrote:
> From: Numan Siddique 
> 
> For a tcp packet which is part of an existing committed connection,
> nf_conntrack_in() will return err and set skb->_nfct to NULL if it is
> out of tcp window. ct action for this packet will set the ct_state
> to +inv which is as expected.

This is because from conntrack p.o.v., such TCP packet is NOT part of
the existing connection.

For example, because it is considered part of a previous incarnation
of the same connection.

> But a controller cannot add an OVS flow as
> 
> table=21,priority=100,ct_state=+inv, actions=drop
> 
> to drop such packets. That is because when ct action is executed on other
> packets which are not part of existing committed connections, ct_state
> can be set to invalid. Few such cases are:
>- ICMP reply packets.

Can you elaborate? Echo reply should not be invalid. Conntrack should
mark it as established (unless such echo reply came out of the blue).

>- TCP SYN/ACK packets during connection establishment.

SYN/ACK should also be established state.
INVALID should only be matched for packets that were never seen
by conntrack, or that are deemed out of date / corrupted.

> To distinguish between an invalid packet part of committed connection
> and others, this patch introduces as a new ct attribute
> OVS_CT_ATTR_LOOKUP_INV. If this is set in the ct action (without commit),
> it tries to find the ct entry and if present, sets the ct_state to
> +inv,+trk and also sets the mark and labels associated with the
> connection.
> 
> With this,  a controller can add flows like
> 
> 
> 
> table=20,ip, action=ct(table=21, lookup_invalid)
> table=21,priority=100,ct_state=+inv+trk,ct_label=0x2/0x2 actions=drop
> table=21,ip, actions=resubmit(,22)
> 
> 

What exactly is the feature/problem that needs to be solved?
I suspect this would help me to provide better feedback than the
semi-random comments below  :-)

My only problem with how conntrack does things ATM is that the ruleset
cannot distinguish:

1. packet was not even seen by conntrack
2. packet matches existing connection, but is "bad", for example:
  - contradicting tcp flags
  - out of window
  - invalid checksum

There are a few sysctls to modify default behaviour, e.g. relax window
checks, or ignore/skip checksum validation.

The other problem i see (solveable for sure by yet-another-sysctl but i
see that as last-resort) is usual compatibility problem:

ct state invalid drop
ct mark gt 0 accept

If standard netfilter conntrack were to set skb->_nfct e.g. even if
state is invalid, we could still make the above work via some internal
flag.

But if you reverse it, you get different behaviour:

ct mark gt 0 accept
ct state invalid drop

First rule might now accept out-of-window packet even when "be_liberal"
sysctl is off.

[PATCH net-next] net: tcp: drop unused function argument from mptcp_incoming_options

2020-09-24 Thread Florian Westphal

Since commit cfde141ea3faa30e ("mptcp: move option parsing into
mptcp_incoming_options()"), the 3rd function argument is no longer used.

Signed-off-by: Florian Westphal 
---
 include/net/mptcp.h  | 6 ++
 net/ipv4/tcp_input.c | 4 ++--
 net/mptcp/options.c  | 3 +--
 3 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/include/net/mptcp.h b/include/net/mptcp.h
index 3525d2822abe..753ba7e755d6 100644
--- a/include/net/mptcp.h
+++ b/include/net/mptcp.h
@@ -85,8 +85,7 @@ bool mptcp_synack_options(const struct request_sock *req, 
unsigned int *size,
 bool mptcp_established_options(struct sock *sk, struct sk_buff *skb,
   unsigned int *size, unsigned int remaining,
   struct mptcp_out_options *opts);
-void mptcp_incoming_options(struct sock *sk, struct sk_buff *skb,
-   struct tcp_options_received *opt_rx);
+void mptcp_incoming_options(struct sock *sk, struct sk_buff *skb);
 
 void mptcp_write_options(__be32 *ptr, struct mptcp_out_options *opts);
 
@@ -185,8 +184,7 @@ static inline bool mptcp_established_options(struct sock 
*sk,
 }
 
 static inline void mptcp_incoming_options(struct sock *sk,
- struct sk_buff *skb,
- struct tcp_options_received *opt_rx)
+ struct sk_buff *skb)
 {
 }
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 02d0e2fb77c0..748980862a09 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4908,7 +4908,7 @@ static void tcp_data_queue(struct sock *sk, struct 
sk_buff *skb)
int eaten;
 
if (sk_is_mptcp(sk))
-   mptcp_incoming_options(sk, skb, &tp->rx_opt);
+   mptcp_incoming_options(sk, skb);
 
if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq) {
__kfree_skb(skb);
@@ -6489,7 +6489,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff 
*skb)
case TCP_LAST_ACK:
if (!before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt)) {
if (sk_is_mptcp(sk))
-   mptcp_incoming_options(sk, skb, &tp->rx_opt);
+   mptcp_incoming_options(sk, skb);
break;
}
fallthrough;
diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index 7fa822b55c34..a64102b17f52 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -824,8 +824,7 @@ static bool add_addr_hmac_valid(struct mptcp_sock *msk,
return hmac == mp_opt->ahmac;
 }
 
-void mptcp_incoming_options(struct sock *sk, struct sk_buff *skb,
-   struct tcp_options_received *opt_rx)
+void mptcp_incoming_options(struct sock *sk, struct sk_buff *skb)
 {
struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
struct mptcp_sock *msk = mptcp_sk(subflow->conn);
-- 
2.26.2

Re: [PATCH 1/3 nf] selftests: netfilter: add cpu counter check

2020-09-09 Thread Florian Westphal

Fabian Frederick  wrote:
> run task on first CPU with netfilter counters reset and check
> cpu meta after another ping

Thanks!

Acked-by: Florian Westphal

Re: [PATCH v3 1/1] netfilter: nat: add a range check for l3/l4 protonum

2020-08-28 Thread Florian Westphal

Pablo Neira Ayuso  wrote:
> Hi Will,
> 
> Given this is for -stable maintainers only, I'd suggest:
> 
> 1) Specify what -stable kernel versions this patch applies to.
>Explain that this problem is gone since what kernel version.
> 
> 2) Maybe clarify that this is only for stable in the patch subject,
>e.g. [PATCH -stable v3] netfilter: nat: add a range check for l3/l4

Hmm, we silently accept a tuple that we can't really deal with, no?

> > +   if (l3num != NFPROTO_IPV4 && l3num != NFPROTO_IPV6)
> > +   return -EOPNOTSUPP;

I vote to apply this to nf.git

[PATCH net] mptcp: free acked data before waiting for more memory

2020-08-25 Thread Florian Westphal

After subflow lock is dropped, more wmem might have been made available.

This fixes a deadlock in mptcp_connect.sh 'mmap' mode: wmem is exhausted.
But as the mptcp socket holds on to already-acked data (for retransmit)
no wakeup will occur.

Using 'goto restart' calls mptcp_clean_una(sk) which will free pages
that have been acked completely in the mean time.

Fixes: fb529e62d3f3 ("mptcp: break and restart in case mptcp sndbuf is full")
Signed-off-by: Florian Westphal 
---
 net/mptcp/protocol.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 1aad411a0e46..8ccd4a151dcb 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -892,7 +892,6 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t len)
goto out;
}
 
-wait_for_sndbuf:
__mptcp_flush_join_list(msk);
ssk = mptcp_subflow_get_send(msk);
while (!sk_stream_memory_free(sk) ||
@@ -982,7 +981,7 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t len)
 */
mptcp_set_timeout(sk, ssk);
release_sock(ssk);
-   goto wait_for_sndbuf;
+   goto restart;
}
}
}
-- 
2.26.2

Re: [PATCH] net: netfilter: delete repeated words

2020-08-22 Thread Florian Westphal

Randy Dunlap  wrote:
> Drop duplicated words in net/netfilter/ and net/ipv4/netfilter/.

Reviewed-by: Florian Westphal

[PATCH net] mptcp: sendmsg: reset iter on error redux

2020-08-16 Thread Florian Westphal

This fix wasn't correct: When this function is invoked from the
retransmission worker, the iterator contains garbage and resetting
it causes a crash.

As the work queue should not be performance critical also zero the
msghdr struct.

Fixes: 35759383133f64d "(mptcp: sendmsg: reset iter on error)"
Signed-off-by: Florian Westphal 
---
 Brown paper bag patch.  I will see if having distinct functions
 for the mtcp_sendmsg and retransmit wq case is feasible/more
 appropriate.

 net/mptcp/protocol.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 2499757bf899..f6561d126110 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -740,7 +740,8 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock 
*ssk,
ret = do_tcp_sendpages(ssk, page, offset, psize,
   msg->msg_flags | MSG_SENDPAGE_NOTLAST | 
MSG_DONTWAIT);
if (ret <= 0) {
-   iov_iter_revert(&msg->msg_iter, psize);
+   if (!retransmission)
+   iov_iter_revert(&msg->msg_iter, psize);
return ret;
}
 
@@ -1391,7 +1392,9 @@ static void mptcp_worker(struct work_struct *work)
struct mptcp_data_frag *dfrag;
u64 orig_write_seq;
size_t copied = 0;
-   struct msghdr msg;
+   struct msghdr msg = {
+   .msg_flags = MSG_DONTWAIT,
+   };
long timeo = 0;
 
lock_sock(sk);
@@ -1424,7 +1427,6 @@ static void mptcp_worker(struct work_struct *work)
 
lock_sock(ssk);
 
-   msg.msg_flags = MSG_DONTWAIT;
orig_len = dfrag->data_len;
orig_offset = dfrag->offset;
orig_write_seq = dfrag->data_seq;
-- 
2.26.2

Re: [PATCH] netfilter: nf_conntrack_sip: fix parsing error

2020-08-15 Thread Florian Westphal

Tong Zhang  wrote:
> ct_sip_parse_numerical_param can only return 0 or 1, but the caller is
> checking parsing error using < 0

Reviewed-by: Florian Westphal

[PATCH net] mptcp: sendmsg: reset iter on error

2020-08-14 Thread Florian Westphal

Once we've copied data from the iterator we need to revert in case we
end up not sending any data.

This bug doesn't trigger with normal 'poll' based tests, because
we only feed a small chunk of data to kernel after poll indicated
POLLOUT.  With blocking IO and large writes this triggers. Receiver
ends up with less data than it should get.

Fixes: 72511aab95c94d ("mptcp: avoid blocking in tcp_sendpages")
Signed-off-by: Florian Westphal 
---
 net/mptcp/protocol.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index d5aaa98b9136..2e7e87304930 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -725,8 +725,10 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock 
*ssk,
if (!psize)
return -EINVAL;
 
-   if (!sk_wmem_schedule(sk, psize + dfrag->overhead))
+   if (!sk_wmem_schedule(sk, psize + dfrag->overhead)) {
+   iov_iter_revert(&msg->msg_iter, psize);
return -ENOMEM;
+   }
} else {
offset = dfrag->offset;
psize = min_t(size_t, dfrag->data_len, avail_size);
@@ -737,8 +739,10 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock 
*ssk,
 */
ret = do_tcp_sendpages(ssk, page, offset, psize,
   msg->msg_flags | MSG_SENDPAGE_NOTLAST | 
MSG_DONTWAIT);
-   if (ret <= 0)
+   if (ret <= 0) {
+   iov_iter_revert(&msg->msg_iter, psize);
return ret;
+   }
 
frag_truesize += ret;
if (!retransmission) {
-- 
2.26.2

[PATCH nf] netfilter/ebtables: reject bogus getopt len value

2020-08-13 Thread Florian Westphal

syzkaller reports splat:
[ cut here ]
Buffer overflow detected (80 < 137)!
Call Trace:
 do_ebt_get_ctl+0x2b4/0x790 net/bridge/netfilter/ebtables.c:2317
 nf_getsockopt+0x72/0xd0 net/netfilter/nf_sockopt.c:116
 ip_getsockopt net/ipv4/ip_sockglue.c:1778 [inline]

caused by a copy-to-user with a too-large "*len" value.
This adds a argument check on *len just like in the non-compat version
of the handler.

Before the "Fixes" commit, the reproducer fails with -EINVAL as
expected:
1. core calls the "compat" getsockopt version
2. compat getsockopt version detects the *len value is possibly
   in 64-bit layout (*len != compat_len)
3. compat getsockopt version delegates everything to native getsockopt
   version
4. native getsockopt rejects invalid *len

-> compat handler only sees len == sizeof(compat_struct) for GET_ENTRIES.

After the refactor, event sequence is:
1. getsockopt calls "compat" version (len != native_len)
2. compat version attempts to copy *len bytes, where *len is random
   value from userspace

Fixes: fc66de8e16e ("netfilter/ebtables: clean up compat {get, set}sockopt 
handling")
Reported-by: syzbot+5accb5c62faa1d346...@syzkaller.appspotmail.com
Signed-off-by: Florian Westphal 
---
 net/bridge/netfilter/ebtables.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/net/bridge/netfilter/ebtables.c b/net/bridge/netfilter/ebtables.c
index 1641f414d1ba..ebe33b60efd6 100644
--- a/net/bridge/netfilter/ebtables.c
+++ b/net/bridge/netfilter/ebtables.c
@@ -2238,6 +2238,10 @@ static int compat_do_ebt_get_ctl(struct sock *sk, int 
cmd,
struct ebt_table *t;
struct net *net = sock_net(sk);
 
+   if ((cmd == EBT_SO_GET_INFO || cmd == EBT_SO_GET_INIT_INFO) &&
+   *len != sizeof(struct compat_ebt_replace))
+   return -EINVAL;
+
if (copy_from_user(&tmp, user, sizeof(tmp)))
return -EFAULT;
 
-- 
2.26.2

Re: [PATCH] net: eliminate meaningless memcpy to data in pskb_carve_inside_nonlinear()

2020-08-10 Thread Florian Westphal

Miaohe Lin  wrote:
> The skb_shared_info part of the data is assigned in the following loop.

Where?

Re: [DRAFT PATCH] random32: make prandom_u32() output unpredictable

2020-08-10 Thread Florian Westphal

Willy Tarreau  wrote:
> On Sun, Aug 09, 2020 at 06:30:17PM +, George Spelvin wrote:
> > Even something simple like buffering 8 TSC samples, and adding them
> > at 32-bit offsets across the state every 8th call, would make a huge
> > difference.
> 
> Doing testing on real hardware showed that retrieving the TSC on every
> call had a non negligible cost, causing a loss of 2.5% on the accept()
> rate and 4% on packet rate when using iptables -m statistics. However
> I reused your idea of accumulating old TSCs to increase the uncertainty
> about their exact value, except that I retrieve it only on 1/8 calls
> and use the previous noise in this case. With this I observe the same
> performance as plain 5.8. Below are the connection rates accepted on
> a single core :
> 
> 5.8   5.8+patch 5.8+patch+tsc
>192900-197900   188800->192200   194500-197500  (conn/s)
> 
> This was on a core i7-8700K. I looked at the asm code for the function
> and it remains reasonably light, in the same order of complexity as the
> original one, so I think we could go with that.
> 
> My proposed change is below, in case you have any improvements to suggest.

As this relates to networking, you could also hook perturbation into rx/tx
softirq processing.  E.g. once for each new napi poll round or only once
for each softnet invocation, depending on cost.

IIRC the proposed draft left a unused prandom_seed() stub around, you could
re-use that to place extra data to include in the hash in percpu data.

Re: Flaw in "random32: update the net random state on interrupt and activity"

2020-08-08 Thread Florian Westphal

Willy Tarreau  wrote:
> diff --git a/include/linux/random.h b/include/linux/random.h
> index 9ab7443bd91b..9e22973b207c 100644
> --- a/include/linux/random.h
> +++ b/include/linux/random.h
> @@ -12,6 +12,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  
> @@ -117,7 +118,8 @@ void prandom_seed(u32 seed);
>  void prandom_reseed_late(void);
>  
>  struct rnd_state {
> - __u32 s1, s2, s3, s4;
> + siphash_key_t key;
> + uint64_t counter;
>  };

Does the siphash_key really need to be percpu?
The counter is different of course.
Alternative would be to siphash a few prandom_u32 results
if the extra u64 is too much storage.

>  DECLARE_PER_CPU(struct rnd_state, net_rand_state);
> @@ -161,12 +163,14 @@ static inline u32 __seed(u32 x, u32 m)
>   */
>  static inline void prandom_seed_state(struct rnd_state *state, u64 seed)
>  {
> +#if 0
>   u32 i = (seed >> 32) ^ (seed << 10) ^ seed;
>  
>   state->s1 = __seed(i,   2U);
>   state->s2 = __seed(i,   8U);
>   state->s3 = __seed(i,  16U);
>   state->s4 = __seed(i, 128U);
> +#endif
>  }
[..]

Can't we keep prandom_u32 as-is...?  Most of the usage, esp. in the
packet schedulers, is fine.

I'd much rather have a prandom_u32_hashed() or whatever for
those cases where some bits might leak to the outside and then convert
those prandom_u32 users over to the siphashed version.

Re: [PATCH net-next 0/6] Support PMTU discovery with bridged UDP tunnels

2020-08-03 Thread Florian Westphal

Stefano Brivio  wrote:
> Currently, PMTU discovery for UDP tunnels only works if packets are
> routed to the encapsulating interfaces, not bridged.
> 
> This results from the fact that we generally don't have valid routes
> to the senders we can use to relay ICMP and ICMPv6 errors, and makes
> PMTU discovery completely non-functional for VXLAN and GENEVE ports of
> both regular bridges and Open vSwitch instances.
> 
> If the sender is local, and packets are forwarded to the port by a
> regular bridge, all it takes is to generate a corresponding route
> exception on the encapsulating device. The bridge then finds the route
> exception carrying the PMTU value estimate as it forwards frames, and
> relays ICMP messages back to the socket of the local sender. Patch 1/6
> fixes this case.
> 
> If the sender resides on another node, we actually need to reply to
> IP and IPv6 packets ourselves and send these ICMP or ICMPv6 errors
> back, using the same encapsulating device. Patch 2/6, based on an
> original idea by Florian Westphal, adds the needed functionality,
> while patches 3/6 and 4/6 add matching support for VXLAN and GENEVE.
> 
> Finally, 5/6 and 6/6 introduce selftests for all combinations of
> inner and outer IP versions, covering both VXLAN and GENEVE, with
> both regular bridges and Open vSwitch instances.

Thanks for taking over and brining this into shape, this looks good to
me.

Given such setups will become easily get stuck on first pmtu update
it would be good to get this applied now, even tough merge window is
already open.

Re: [PATCH nf] netfilter: nf_tables: nft_exthdr: the presence return value should be little-endian

2020-08-03 Thread Florian Westphal

Stephen Suryaputra  wrote:
> On big-endian machine, the returned register data when the exthdr is
> present is not being compared correctly because little-endian is
> assumed. The function nft_cmp_fast_mask(), called by nft_cmp_fast_eval()
> and nft_cmp_fast_init(), calls cpu_to_le32().
> 
> The following dump also shows that little endian is assumed:
> 
> $ nft --debug=netlink add rule ip recordroute forward ip option rr exists 
> counter
> ip
>   [ exthdr load ipv4 1b @ 7 + 0 present => reg 1 ]
>   [ cmp eq reg 1 0x0100 ]
>   [ counter pkts 0 bytes 0 ]
> 
> Lastly, debug print in nft_cmp_fast_init() and nft_cmp_fast_eval() when
> RR option exists in the packet shows that the comparison fails because
> the assumption:
> 
> nft_cmp_fast_init:189 priv->sreg=4 desc.len=8 mask=0xff00 
> data.data[0]=0x10003e0
> nft_cmp_fast_eval:57 regs->data[priv->sreg=4]=0x1 mask=0xff00 
> priv->data=0x100

Right, nft userspace assumes a boolean data type when it does existence
check.

> diff --git a/net/netfilter/nft_exthdr.c b/net/netfilter/nft_exthdr.c
> index 07782836fad6..50e4935585e3 100644
> --- a/net/netfilter/nft_exthdr.c
> +++ b/net/netfilter/nft_exthdr.c
> @@ -44,7 +44,7 @@ static void nft_exthdr_ipv6_eval(const struct nft_expr 
> *expr,
>  
>   err = ipv6_find_hdr(pkt->skb, &offset, priv->type, NULL, NULL);
>   if (priv->flags & NFT_EXTHDR_F_PRESENT) {
> - *dest = (err >= 0);
> + *dest = cpu_to_le32(err >= 0);

Both should probably use nft_reg_store8(dst, err >= 0) for consistency
with the rest.

But the patch looks correct to me, thanks.

[PATCH net-next] mptcp: fix syncookie build error on UP

2020-08-01 Thread Florian Westphal

kernel test robot says:
net/mptcp/syncookies.c: In function 'mptcp_join_cookie_init':
include/linux/kernel.h:47:38: warning: division by zero [-Wdiv-by-zero]
 #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]) + __must_be_array(arr))

I forgot that spinock_t size is 0 on UP, so ARRAY_SIZE cannot be used.

Fixes: 9466a1ccebbe54 ("mptcp: enable JOIN requests even if cookies are in use")
Reported-by: kernel test robot 
Signed-off-by: Florian Westphal 
---
 net/mptcp/syncookies.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/mptcp/syncookies.c b/net/mptcp/syncookies.c
index 6eb992789b50..abe0fd099746 100644
--- a/net/mptcp/syncookies.c
+++ b/net/mptcp/syncookies.c
@@ -125,8 +125,6 @@ void __init mptcp_join_cookie_init(void)
 {
int i;
 
-   for (i = 0; i < ARRAY_SIZE(join_entry_locks); i++)
+   for (i = 0; i < COOKIE_JOIN_SLOTS; i++)
spin_lock_init(&join_entry_locks[i]);
-
-   BUILD_BUG_ON(ARRAY_SIZE(join_entry_locks) != ARRAY_SIZE(join_entries));
 }
-- 
2.26.2

Re: [PATCH net-next] tcp: fix build fong CONFIG_MPTCP=n

2020-08-01 Thread Florian Westphal

Florian Westphal  wrote:
> Eric Dumazet  wrote:
> > Fixes these errors:
> > 
> > net/ipv4/syncookies.c: In function 'tcp_get_cookie_sock':
> > net/ipv4/syncookies.c:216:19: error: 'struct tcp_request_sock' has no
> > member named 'drop_req'
> >   216 |   if (tcp_rsk(req)->drop_req) {
> >   |   ^~
> > net/ipv4/syncookies.c: In function 'cookie_tcp_reqsk_alloc':
> > net/ipv4/syncookies.c:289:27: warning: unused variable 'treq'
> > [-Wunused-variable]
> >   289 |  struct tcp_request_sock *treq;
> >   |   ^~~~
> 
> Ugh, sorry about this.
> 
> > make[3]: *** [scripts/Makefile.build:280: net/ipv4/syncookies.o] Error 1
> > make[3]: *** Waiting for unfinished jobs
> >
> > Fixes: 9466a1ccebbe ("mptcp: enable JOIN requests even if cookies are in 
> > use")
> > Signed-off-by: Eric Dumazet 
> > Cc: Florian Westphal 
> > ---
> >  net/ipv4/syncookies.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> > 
> > diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
> > index 
> > 11b20474be8310d7070750a1c7b4013f2fba2f55..f0794f0232bae749244fff35d8b96b1f561a5e87
> >  100644
> > --- a/net/ipv4/syncookies.c
> > +++ b/net/ipv4/syncookies.c
> > @@ -213,7 +213,7 @@ struct sock *tcp_get_cookie_sock(struct sock *sk, 
> > struct sk_buff *skb,
> > tcp_sk(child)->tsoffset = tsoff;
> > sock_rps_save_rxhash(child, skb);
> >  
> > -   if (tcp_rsk(req)->drop_req) {
> > +   if (rsk_drop_req(req)) {
> 
> This hunk breaks join self test for mptcp, but it should not. Looks like
> cookie path missed a ->is_mptcp = 1 somewhere.  Investigating.

I take that back, the failure is spurious, also sometimes occurs with
no-cookie part of test and also happens with a checkout before the
syncookie patches got merged.  I will take another look at this on
Monday.  No need to hold up this fix, so

Acked-by: Florian Westphal

Re: [PATCH net-next] tcp: fix build fong CONFIG_MPTCP=n

2020-08-01 Thread Florian Westphal

Eric Dumazet  wrote:
> Fixes these errors:
> 
> net/ipv4/syncookies.c: In function 'tcp_get_cookie_sock':
> net/ipv4/syncookies.c:216:19: error: 'struct tcp_request_sock' has no
> member named 'drop_req'
>   216 |   if (tcp_rsk(req)->drop_req) {
>   |   ^~
> net/ipv4/syncookies.c: In function 'cookie_tcp_reqsk_alloc':
> net/ipv4/syncookies.c:289:27: warning: unused variable 'treq'
> [-Wunused-variable]
>   289 |  struct tcp_request_sock *treq;
>   |   ^~~~

Ugh, sorry about this.

> make[3]: *** [scripts/Makefile.build:280: net/ipv4/syncookies.o] Error 1
> make[3]: *** Waiting for unfinished jobs
>
> Fixes: 9466a1ccebbe ("mptcp: enable JOIN requests even if cookies are in use")
> Signed-off-by: Eric Dumazet 
> Cc: Florian Westphal 
> ---
>  net/ipv4/syncookies.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
> index 
> 11b20474be8310d7070750a1c7b4013f2fba2f55..f0794f0232bae749244fff35d8b96b1f561a5e87
>  100644
> --- a/net/ipv4/syncookies.c
> +++ b/net/ipv4/syncookies.c
> @@ -213,7 +213,7 @@ struct sock *tcp_get_cookie_sock(struct sock *sk, struct 
> sk_buff *skb,
>   tcp_sk(child)->tsoffset = tsoff;
>   sock_rps_save_rxhash(child, skb);
>  
> - if (tcp_rsk(req)->drop_req) {
> + if (rsk_drop_req(req)) {

This hunk breaks join self test for mptcp, but it should not. Looks like
cookie path missed a ->is_mptcp = 1 somewhere.  Investigating.

[PATCH v2 net-next 8/9] selftests: mptcp: make 2nd net namespace use tcp syn cookies unconditionally

2020-07-30 Thread Florian Westphal

check we can establish connections also when syn cookies are in use.

Check that
MPTcpExtMPCapableSYNRX and MPTcpExtMPCapableACKRX increase for each
MPTCP test.

Check TcpExtSyncookiesSent and TcpExtSyncookiesRecv increase in netns2.

Signed-off-by: Florian Westphal 
---
 .../selftests/net/mptcp/mptcp_connect.sh  | 47 +++
 1 file changed, 47 insertions(+)

diff --git a/tools/testing/selftests/net/mptcp/mptcp_connect.sh 
b/tools/testing/selftests/net/mptcp/mptcp_connect.sh
index c0589e071f20..57d75b7f6220 100755
--- a/tools/testing/selftests/net/mptcp/mptcp_connect.sh
+++ b/tools/testing/selftests/net/mptcp/mptcp_connect.sh
@@ -196,6 +196,9 @@ ip -net "$ns4" link set ns4eth3 up
 ip -net "$ns4" route add default via 10.0.3.2
 ip -net "$ns4" route add default via dead:beef:3::2
 
+# use TCP syn cookies, even if no flooding was detected.
+ip netns exec "$ns2" sysctl -q net.ipv4.tcp_syncookies=2
+
 set_ethtool_flags() {
local ns="$1"
local dev="$2"
@@ -407,6 +410,11 @@ do_transfer()
sleep 1
fi
 
+   local stat_synrx_last_l=$(ip netns exec ${listener_ns} nstat -z -a 
MPTcpExtMPCapableSYNRX | while read a count c rest ;do  echo $count;done)
+   local stat_ackrx_last_l=$(ip netns exec ${listener_ns} nstat -z -a 
MPTcpExtMPCapableACKRX | while read a count c rest ;do  echo $count;done)
+   local stat_cookietx_last=$(ip netns exec ${listener_ns} nstat -z -a 
TcpExtSyncookiesSent | while read a count c rest ;do  echo $count;done)
+   local stat_cookierx_last=$(ip netns exec ${listener_ns} nstat -z -a 
TcpExtSyncookiesRecv | while read a count c rest ;do  echo $count;done)
+
ip netns exec ${listener_ns} ./mptcp_connect -t $timeout -l -p $port -s 
${srv_proto} $extra_args $local_addr < "$sin" > "$sout" &
local spid=$!
 
@@ -450,6 +458,45 @@ do_transfer()
check_transfer $cin $sout "file received by server"
rets=$?
 
+   local stat_synrx_now_l=$(ip netns exec ${listener_ns} nstat -z -a 
MPTcpExtMPCapableSYNRX  | while read a count c rest ;do  echo $count;done)
+   local stat_ackrx_now_l=$(ip netns exec ${listener_ns} nstat -z -a 
MPTcpExtMPCapableACKRX  | while read a count c rest ;do  echo $count;done)
+
+   local stat_cookietx_now=$(ip netns exec ${listener_ns} nstat -z -a 
TcpExtSyncookiesSent | while read a count c rest ;do  echo $count;done)
+   local stat_cookierx_now=$(ip netns exec ${listener_ns} nstat -z -a 
TcpExtSyncookiesRecv | while read a count c rest ;do  echo $count;done)
+
+   expect_synrx=$((stat_synrx_last_l))
+   expect_ackrx=$((stat_ackrx_last_l))
+
+   cookies=$(ip netns exec ${listener_ns} sysctl net.ipv4.tcp_syncookies)
+   cookies=${cookies##*=}
+
+   if [ ${cl_proto} = "MPTCP" ] && [ ${srv_proto} = "MPTCP" ]; then
+   expect_synrx=$((stat_synrx_last_l+1))
+   expect_ackrx=$((stat_ackrx_last_l+1))
+   fi
+   if [ $cookies -eq 2 ];then
+   if [ $stat_cookietx_last -ge $stat_cookietx_now ] ;then
+   echo "${listener_ns} CookieSent: ${cl_proto} -> 
${srv_proto}: did not advance"
+   fi
+   if [ $stat_cookierx_last -ge $stat_cookierx_now ] ;then
+   echo "${listener_ns} CookieRecv: ${cl_proto} -> 
${srv_proto}: did not advance"
+   fi
+   else
+   if [ $stat_cookietx_last -ne $stat_cookietx_now ] ;then
+   echo "${listener_ns} CookieSent: ${cl_proto} -> 
${srv_proto}: changed"
+   fi
+   if [ $stat_cookierx_last -ne $stat_cookierx_now ] ;then
+   echo "${listener_ns} CookieRecv: ${cl_proto} -> 
${srv_proto}: changed"
+   fi
+   fi
+
+   if [ $expect_synrx -ne $stat_synrx_now_l ] ;then
+   echo "${listener_ns} SYNRX: ${cl_proto} -> ${srv_proto}: expect 
${expect_synrx}, got ${stat_synrx_now_l}"
+   fi
+   if [ $expect_ackrx -ne $stat_ackrx_now_l ] ;then
+   echo "${listener_ns} ACKRX: ${cl_proto} -> ${srv_proto}: expect 
${expect_synrx}, got ${stat_synrx_now_l}"
+   fi
+
if [ $retc -eq 0 ] && [ $rets -eq 0 ];then
echo "$duration [ OK ]"
cat "$capout"
-- 
2.26.2

[PATCH v2 net-next 4/9] mptcp: rename and export mptcp_subflow_request_sock_ops

2020-07-30 Thread Florian Westphal

syncookie code path needs to create an mptcp request sock.

Prepare for this and add mptcp prefix plus needed export of ops struct.

Signed-off-by: Florian Westphal 
---
 include/net/mptcp.h |  1 +
 net/mptcp/subflow.c | 11 ++-
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/net/mptcp.h b/include/net/mptcp.h
index 02158c257bd4..76eb915bf91c 100644
--- a/include/net/mptcp.h
+++ b/include/net/mptcp.h
@@ -58,6 +58,7 @@ struct mptcp_out_options {
 };
 
 #ifdef CONFIG_MPTCP
+extern struct request_sock_ops mptcp_subflow_request_sock_ops;
 
 void mptcp_init(void);
 
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 091e305a81c8..9b11d2b6ff4d 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -284,7 +284,8 @@ static void subflow_finish_connect(struct sock *sk, const 
struct sk_buff *skb)
tcp_done(sk);
 }
 
-static struct request_sock_ops subflow_request_sock_ops;
+struct request_sock_ops mptcp_subflow_request_sock_ops;
+EXPORT_SYMBOL_GPL(mptcp_subflow_request_sock_ops);
 static struct tcp_request_sock_ops subflow_request_sock_ipv4_ops;
 
 static int subflow_v4_conn_request(struct sock *sk, struct sk_buff *skb)
@@ -297,7 +298,7 @@ static int subflow_v4_conn_request(struct sock *sk, struct 
sk_buff *skb)
if (skb_rtable(skb)->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))
goto drop;
 
-   return tcp_conn_request(&subflow_request_sock_ops,
+   return tcp_conn_request(&mptcp_subflow_request_sock_ops,
&subflow_request_sock_ipv4_ops,
sk, skb);
 drop:
@@ -322,7 +323,7 @@ static int subflow_v6_conn_request(struct sock *sk, struct 
sk_buff *skb)
if (!ipv6_unicast_destination(skb))
goto drop;
 
-   return tcp_conn_request(&subflow_request_sock_ops,
+   return tcp_conn_request(&mptcp_subflow_request_sock_ops,
&subflow_request_sock_ipv6_ops, sk, skb);
 
 drop:
@@ -1311,8 +1312,8 @@ static int subflow_ops_init(struct request_sock_ops 
*subflow_ops)
 
 void __init mptcp_subflow_init(void)
 {
-   subflow_request_sock_ops = tcp_request_sock_ops;
-   if (subflow_ops_init(&subflow_request_sock_ops) != 0)
+   mptcp_subflow_request_sock_ops = tcp_request_sock_ops;
+   if (subflow_ops_init(&mptcp_subflow_request_sock_ops) != 0)
panic("MPTCP: failed to init subflow request sock ops\n");
 
subflow_request_sock_ipv4_ops = tcp_request_sock_ipv4_ops;
-- 
2.26.2

[PATCH v2 net-next 5/9] mptcp: subflow: add mptcp_subflow_init_cookie_req helper

2020-07-30 Thread Florian Westphal

Will be used to initialize the mptcp request socket when a MP_CAPABLE
request was handled in syncookie mode, i.e. when a TCP ACK containing a
MP_CAPABLE option is a valid syncookie value.

Normally (non-cookie case), MPTCP will generate a unique 32 bit connection
ID and stores it in the MPTCP token storage to be able to retrieve the
mptcp socket for subflow joining.

In syncookie case, we do not want to store any state, so just generate the
unique ID and use it in the reply.

This means there is a small window where another connection could generate
the same token.

When Cookie ACK comes back, we check that the token has not been registered
in the mean time.  If it was, the connection needs to fall back to TCP.

Changes in v2:
 - use req->syncookie instead of passing 'want_cookie' arg to ->init_req()
   (Eric Dumazet)

Signed-off-by: Florian Westphal 
---
 include/net/mptcp.h  | 10 +
 net/mptcp/protocol.h |  1 +
 net/mptcp/subflow.c  | 50 +++-
 net/mptcp/token.c| 26 +++
 4 files changed, 86 insertions(+), 1 deletion(-)

diff --git a/include/net/mptcp.h b/include/net/mptcp.h
index 76eb915bf91c..3525d2822abe 100644
--- a/include/net/mptcp.h
+++ b/include/net/mptcp.h
@@ -131,6 +131,9 @@ static inline bool mptcp_skb_can_collapse(const struct 
sk_buff *to,
 }
 
 void mptcp_seq_show(struct seq_file *seq);
+int mptcp_subflow_init_cookie_req(struct request_sock *req,
+ const struct sock *sk_listener,
+ struct sk_buff *skb);
 #else
 
 static inline void mptcp_init(void)
@@ -200,6 +203,13 @@ static inline bool mptcp_skb_can_collapse(const struct 
sk_buff *to,
 
 static inline void mptcp_space(const struct sock *ssk, int *s, int *fs) { }
 static inline void mptcp_seq_show(struct seq_file *seq) { }
+
+static inline int mptcp_subflow_init_cookie_req(struct request_sock *req,
+   const struct sock *sk_listener,
+   struct sk_buff *skb)
+{
+   return 0; /* TCP fallback */
+}
 #endif /* CONFIG_MPTCP */
 
 #if IS_ENABLED(CONFIG_MPTCP_IPV6)
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index beb34b8a5363..d76d3b40d69e 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -400,6 +400,7 @@ void mptcp_token_destroy_request(struct request_sock *req);
 int mptcp_token_new_connect(struct sock *sk);
 void mptcp_token_accept(struct mptcp_subflow_request_sock *r,
struct mptcp_sock *msk);
+bool mptcp_token_exists(u32 token);
 struct mptcp_sock *mptcp_token_get_sock(u32 token);
 struct mptcp_sock *mptcp_token_iter_next(const struct net *net, long *s_slot,
 long *s_num);
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 9b11d2b6ff4d..3d346572d4c9 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -140,18 +140,31 @@ static void subflow_init_req(struct request_sock *req,
if (mp_opt.mp_capable && listener->request_mptcp) {
int err, retries = 4;
 
+   subflow_req->ssn_offset = TCP_SKB_CB(skb)->seq;
 again:
do {
get_random_bytes(&subflow_req->local_key, 
sizeof(subflow_req->local_key));
} while (subflow_req->local_key == 0);
 
+   if (unlikely(req->syncookie)) {
+   mptcp_crypto_key_sha(subflow_req->local_key,
+&subflow_req->token,
+&subflow_req->idsn);
+   if (mptcp_token_exists(subflow_req->token)) {
+   if (retries-- > 0)
+   goto again;
+   } else {
+   subflow_req->mp_capable = 1;
+   }
+   return;
+   }
+
err = mptcp_token_new_request(req);
if (err == 0)
subflow_req->mp_capable = 1;
else if (retries-- > 0)
goto again;
 
-   subflow_req->ssn_offset = TCP_SKB_CB(skb)->seq;
} else if (mp_opt.mp_join && listener->request_mptcp) {
subflow_req->ssn_offset = TCP_SKB_CB(skb)->seq;
subflow_req->mp_join = 1;
@@ -165,6 +178,41 @@ static void subflow_init_req(struct request_sock *req,
}
 }
 
+int mptcp_subflow_init_cookie_req(struct request_sock *req,
+ const struct sock *sk_listener,
+ struct sk_buff *skb)
+{
+   struct mptcp_subflow_context *listener = mptcp_subflow_ctx(sk_listener);
+   struct mptcp_subflow_request_sock *subflow_req = mptcp_subflow_rsk(req);
+   struct mptcp_opti

[PATCH v2 net-next 2/9] mptcp: token: move retry to caller

2020-07-30 Thread Florian Westphal

Once syncookie support is added, no state will be stored anymore when the
syn/ack is generated in syncookie mode.

When the ACK comes back, the generated key will be taken from the TCP ACK,
the token is re-generated and inserted into the token tree.

This means we can't retry with a new key when the token is already taken
in the syncookie case.

Therefore, move the retry logic to the caller to prepare for syncookie
support in mptcp.

Signed-off-by: Florian Westphal 
---
 net/mptcp/subflow.c |  9 -
 net/mptcp/token.c   | 12 
 2 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 1c8482bc2ce5..9feb87880d1c 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -126,11 +126,18 @@ static void subflow_init_req(struct request_sock *req,
}
 
if (mp_opt.mp_capable && listener->request_mptcp) {
-   int err;
+   int err, retries = 4;
+
+again:
+   do {
+   get_random_bytes(&subflow_req->local_key, 
sizeof(subflow_req->local_key));
+   } while (subflow_req->local_key == 0);
 
err = mptcp_token_new_request(req);
if (err == 0)
subflow_req->mp_capable = 1;
+   else if (retries-- > 0)
+   goto again;
 
subflow_req->ssn_offset = TCP_SKB_CB(skb)->seq;
} else if (mp_opt.mp_join && listener->request_mptcp) {
diff --git a/net/mptcp/token.c b/net/mptcp/token.c
index 97cfc45bcc4f..f82410c54653 100644
--- a/net/mptcp/token.c
+++ b/net/mptcp/token.c
@@ -109,14 +109,12 @@ static void mptcp_crypto_key_gen_sha(u64 *key, u32 
*token, u64 *idsn)
 int mptcp_token_new_request(struct request_sock *req)
 {
struct mptcp_subflow_request_sock *subflow_req = mptcp_subflow_rsk(req);
-   int retries = TOKEN_MAX_RETRIES;
struct token_bucket *bucket;
u32 token;
 
-again:
-   mptcp_crypto_key_gen_sha(&subflow_req->local_key,
-&subflow_req->token,
-&subflow_req->idsn);
+   mptcp_crypto_key_sha(subflow_req->local_key,
+&subflow_req->token,
+&subflow_req->idsn);
pr_debug("req=%p local_key=%llu, token=%u, idsn=%llu\n",
 req, subflow_req->local_key, subflow_req->token,
 subflow_req->idsn);
@@ -126,9 +124,7 @@ int mptcp_token_new_request(struct request_sock *req)
spin_lock_bh(&bucket->lock);
if (__token_bucket_busy(bucket, token)) {
spin_unlock_bh(&bucket->lock);
-   if (!--retries)
-   return -EBUSY;
-   goto again;
+   return -EBUSY;
}
 
hlist_nulls_add_head_rcu(&subflow_req->token_node, &bucket->req_chain);
-- 
2.26.2

[PATCH v2 net-next 9/9] selftests: mptcp: add test cases for mptcp join tests with syn cookies

2020-07-30 Thread Florian Westphal

Also add test cases with MP_JOIN when tcp_syncookies sysctl is 2 (i.e.,
syncookies are always-on).

While at it, also print the test number and add the test number
to the pcap files that can be generated optionally.

This makes it easier to match the pcap to the test case.

Signed-off-by: Florian Westphal 
---
 .../testing/selftests/net/mptcp/mptcp_join.sh | 66 ++-
 1 file changed, 64 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh 
b/tools/testing/selftests/net/mptcp/mptcp_join.sh
index dd42c2f692d0..f39c1129ce5f 100755
--- a/tools/testing/selftests/net/mptcp/mptcp_join.sh
+++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh
@@ -72,6 +72,15 @@ reset()
init
 }
 
+reset_with_cookies()
+{
+   reset
+
+   for netns in "$ns1" "$ns2";do
+   ip netns exec $netns sysctl -q net.ipv4.tcp_syncookies=2
+   done
+}
+
 for arg in "$@"; do
if [ "$arg" = "-c" ]; then
capture=1
@@ -138,7 +147,7 @@ do_transfer()
capuser="-Z $SUDO_USER"
fi
 
-   capfile="mp_join-${listener_ns}.pcap"
+   capfile=$(printf "mp_join-%02u-%s.pcap" "$TEST_COUNT" 
"${listener_ns}")
 
echo "Capturing traffic for test $TEST_COUNT into $capfile"
ip netns exec ${listener_ns} tcpdump -i any -s 65535 -B 32768 
$capuser -w $capfile > "$capout" 2>&1 &
@@ -227,7 +236,7 @@ chk_join_nr()
local count
local dump_stats
 
-   printf "%-36s %s" "$msg" "syn"
+   printf "%02u %-36s %s" "$TEST_COUNT" "$msg" "syn"
count=`ip netns exec $ns1 nstat -as | grep MPTcpExtMPJoinSynRx | awk 
'{print $2}'`
[ -z "$count" ] && count=0
if [ "$count" != "$syn_nr" ]; then
@@ -354,4 +363,57 @@ ip netns exec $ns2 ./pm_nl_ctl add 10.0.4.2 flags subflow
 run_tests $ns1 $ns2 10.0.1.1
 chk_join_nr "multiple subflows and signal" 3 3 3
 
+# single subflow, syncookies
+reset_with_cookies
+ip netns exec $ns1 ./pm_nl_ctl limits 0 1
+ip netns exec $ns2 ./pm_nl_ctl limits 0 1
+ip netns exec $ns2 ./pm_nl_ctl add 10.0.3.2 flags subflow
+run_tests $ns1 $ns2 10.0.1.1
+chk_join_nr "single subflow with syn cookies" 1 1 1
+
+# multiple subflows with syn cookies
+reset_with_cookies
+ip netns exec $ns1 ./pm_nl_ctl limits 0 2
+ip netns exec $ns2 ./pm_nl_ctl limits 0 2
+ip netns exec $ns2 ./pm_nl_ctl add 10.0.3.2 flags subflow
+ip netns exec $ns2 ./pm_nl_ctl add 10.0.2.2 flags subflow
+run_tests $ns1 $ns2 10.0.1.1
+chk_join_nr "multiple subflows with syn cookies" 2 2 2
+
+# multiple subflows limited by server
+reset_with_cookies
+ip netns exec $ns1 ./pm_nl_ctl limits 0 1
+ip netns exec $ns2 ./pm_nl_ctl limits 0 2
+ip netns exec $ns2 ./pm_nl_ctl add 10.0.3.2 flags subflow
+ip netns exec $ns2 ./pm_nl_ctl add 10.0.2.2 flags subflow
+run_tests $ns1 $ns2 10.0.1.1
+chk_join_nr "subflows limited by server w cookies" 2 2 1
+
+# test signal address with cookies
+reset_with_cookies
+ip netns exec $ns1 ./pm_nl_ctl limits 0 1
+ip netns exec $ns2 ./pm_nl_ctl limits 1 1
+ip netns exec $ns1 ./pm_nl_ctl add 10.0.2.1 flags signal
+run_tests $ns1 $ns2 10.0.1.1
+chk_join_nr "signal address with syn cookies" 1 1 1
+
+# test cookie with subflow and signal
+reset_with_cookies
+ip netns exec $ns1 ./pm_nl_ctl add 10.0.2.1 flags signal
+ip netns exec $ns1 ./pm_nl_ctl limits 0 2
+ip netns exec $ns2 ./pm_nl_ctl limits 1 2
+ip netns exec $ns2 ./pm_nl_ctl add 10.0.3.2 flags subflow
+run_tests $ns1 $ns2 10.0.1.1
+chk_join_nr "subflow and signal w cookies" 2 2 2
+
+# accept and use add_addr with additional subflows
+reset_with_cookies
+ip netns exec $ns1 ./pm_nl_ctl limits 0 3
+ip netns exec $ns1 ./pm_nl_ctl add 10.0.2.1 flags signal
+ip netns exec $ns2 ./pm_nl_ctl limits 1 3
+ip netns exec $ns2 ./pm_nl_ctl add 10.0.3.2 flags subflow
+ip netns exec $ns2 ./pm_nl_ctl add 10.0.4.2 flags subflow
+run_tests $ns1 $ns2 10.0.1.1
+chk_join_nr "subflows and signal w. cookies" 3 3 3
+
 exit $ret
-- 
2.26.2

[PATCH v2 net-next 6/9] tcp: syncookies: create mptcp request socket for ACK cookies with MPTCP option

2020-07-30 Thread Florian Westphal

If SYN packet contains MP_CAPABLE option, keep it enabled.
Syncokie validation and cookie-based socket creation is changed to
instantiate an mptcp request sockets if the ACK contains an MPTCP
connection request.

Rather than extend both cookie_v4/6_check, add a common helper to create
the (mp)tcp request socket.

Suggested-by: Paolo Abeni 
Signed-off-by: Florian Westphal 
---
 include/net/tcp.h |  2 ++
 net/ipv4/syncookies.c | 38 ++
 net/ipv4/tcp_input.c  |  3 ---
 net/ipv6/syncookies.c |  5 +
 4 files changed, 37 insertions(+), 11 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index e0c35d56091f..dbf5c791a6eb 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -469,6 +469,8 @@ struct sock *tcp_get_cookie_sock(struct sock *sk, struct 
sk_buff *skb,
 int __cookie_v4_check(const struct iphdr *iph, const struct tcphdr *th,
  u32 cookie);
 struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb);
+struct request_sock *cookie_tcp_reqsk_alloc(const struct request_sock_ops *ops,
+   struct sock *sk, struct sk_buff 
*skb);
 #ifdef CONFIG_SYN_COOKIES
 
 /* Syncookies use a monotonic timer which increments every 60 seconds.
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 9a4f6b16c9bc..54838ee2e8d4 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -276,6 +276,39 @@ bool cookie_ecn_ok(const struct tcp_options_received 
*tcp_opt,
 }
 EXPORT_SYMBOL(cookie_ecn_ok);
 
+struct request_sock *cookie_tcp_reqsk_alloc(const struct request_sock_ops *ops,
+   struct sock *sk,
+   struct sk_buff *skb)
+{
+   struct tcp_request_sock *treq;
+   struct request_sock *req;
+
+#ifdef CONFIG_MPTCP
+   if (sk_is_mptcp(sk))
+   ops = &mptcp_subflow_request_sock_ops;
+#endif
+
+   req = inet_reqsk_alloc(ops, sk, false);
+   if (!req)
+   return NULL;
+
+#if IS_ENABLED(CONFIG_MPTCP)
+   treq = tcp_rsk(req);
+   treq->is_mptcp = sk_is_mptcp(sk);
+   if (treq->is_mptcp) {
+   int err = mptcp_subflow_init_cookie_req(req, sk, skb);
+
+   if (err) {
+   reqsk_free(req);
+   return NULL;
+   }
+   }
+#endif
+
+   return req;
+}
+EXPORT_SYMBOL_GPL(cookie_tcp_reqsk_alloc);
+
 /* On input, sk is a listener.
  * Output is listener if incoming packet would not create a child
  *   NULL if memory could not be allocated.
@@ -326,7 +359,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct 
sk_buff *skb)
goto out;
 
ret = NULL;
-   req = inet_reqsk_alloc(&tcp_request_sock_ops, sk, false); /* for safety 
*/
+   req = cookie_tcp_reqsk_alloc(&tcp_request_sock_ops, sk, skb);
if (!req)
goto out;
 
@@ -350,9 +383,6 @@ struct sock *cookie_v4_check(struct sock *sk, struct 
sk_buff *skb)
treq->snt_synack= 0;
treq->tfo_listener  = false;
 
-   if (IS_ENABLED(CONFIG_MPTCP))
-   treq->is_mptcp = 0;
-
if (IS_ENABLED(CONFIG_SMC))
ireq->smc_ok = 0;
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 11a6f128e51c..739da25b0c23 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6701,9 +6701,6 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 
af_ops->init_req(req, sk, skb);
 
-   if (IS_ENABLED(CONFIG_MPTCP) && want_cookie)
-   tcp_rsk(req)->is_mptcp = 0;
-
if (security_inet_conn_request(sk, skb, req))
goto drop_and_free;
 
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index 13235a012388..e796a64be308 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -170,7 +170,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct 
sk_buff *skb)
goto out;
 
ret = NULL;
-   req = inet_reqsk_alloc(&tcp6_request_sock_ops, sk, false);
+   req = cookie_tcp_reqsk_alloc(&tcp6_request_sock_ops, sk, skb);
if (!req)
goto out;
 
@@ -178,9 +178,6 @@ struct sock *cookie_v6_check(struct sock *sk, struct 
sk_buff *skb)
treq = tcp_rsk(req);
treq->tfo_listener = false;
 
-   if (IS_ENABLED(CONFIG_MPTCP))
-   treq->is_mptcp = 0;
-
if (security_inet_conn_request(sk, skb, req))
goto out_free;
 
-- 
2.26.2

[PATCH v2 net-next 7/9] mptcp: enable JOIN requests even if cookies are in use

2020-07-30 Thread Florian Westphal

JOIN requests do not work in syncookie mode -- for HMAC validation, the
peers nonce and the mptcp token (to obtain the desired connection socket
the join is for) are required, but this information is only present in the
initial syn.

So either we need to drop all JOIN requests once a listening socket enters
syncookie mode, or we need to store enough state to reconstruct the request
socket later.

This adds a state table (1024 entries) to store the data present in the
MP_JOIN syn request and the random nonce used for the cookie syn/ack.

When a MP_JOIN ACK passed cookie validation, the table is consulted
to rebuild the request socket from it.

An alternate approach would be to "cancel" syn-cookie mode and force
MP_JOIN to always use a syn queue entry.

However, doing so brings the backlog over the configured queue limit.

v2: use req->syncookie, not (removed) want_cookie arg

Suggested-by: Paolo Abeni 
Signed-off-by: Florian Westphal 
---
 net/ipv4/syncookies.c  |   6 ++
 net/mptcp/Makefile |   1 +
 net/mptcp/ctrl.c   |   1 +
 net/mptcp/protocol.h   |  20 +++
 net/mptcp/subflow.c|  14 +
 net/mptcp/syncookies.c | 132 +
 6 files changed, 174 insertions(+)
 create mode 100644 net/mptcp/syncookies.c

diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 54838ee2e8d4..11b20474be83 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -212,6 +212,12 @@ struct sock *tcp_get_cookie_sock(struct sock *sk, struct 
sk_buff *skb,
refcount_set(&req->rsk_refcnt, 1);
tcp_sk(child)->tsoffset = tsoff;
sock_rps_save_rxhash(child, skb);
+
+   if (tcp_rsk(req)->drop_req) {
+   refcount_set(&req->rsk_refcnt, 2);
+   return child;
+   }
+
if (inet_csk_reqsk_queue_add(sk, req, child))
return child;
 
diff --git a/net/mptcp/Makefile b/net/mptcp/Makefile
index 2360cbd27d59..a611968be4d7 100644
--- a/net/mptcp/Makefile
+++ b/net/mptcp/Makefile
@@ -4,6 +4,7 @@ obj-$(CONFIG_MPTCP) += mptcp.o
 mptcp-y := protocol.o subflow.o options.o token.o crypto.o ctrl.o pm.o diag.o \
   mib.o pm_netlink.o
 
+obj-$(CONFIG_SYN_COOKIES) += syncookies.o
 obj-$(CONFIG_INET_MPTCP_DIAG) += mptcp_diag.o
 
 mptcp_crypto_test-objs := crypto_test.o
diff --git a/net/mptcp/ctrl.c b/net/mptcp/ctrl.c
index 8e39585d37f3..54b888f94009 100644
--- a/net/mptcp/ctrl.c
+++ b/net/mptcp/ctrl.c
@@ -112,6 +112,7 @@ static struct pernet_operations mptcp_pernet_ops = {
 
 void __init mptcp_init(void)
 {
+   mptcp_join_cookie_init();
mptcp_proto_init();
 
if (register_pernet_subsys(&mptcp_pernet_ops) < 0)
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index d76d3b40d69e..60b27d44c184 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -506,4 +506,24 @@ static inline bool subflow_simultaneous_connect(struct 
sock *sk)
   !subflow->conn_finished;
 }
 
+#ifdef CONFIG_SYN_COOKIES
+void subflow_init_req_cookie_join_save(const struct mptcp_subflow_request_sock 
*subflow_req,
+  struct sk_buff *skb);
+bool mptcp_token_join_cookie_init_state(struct mptcp_subflow_request_sock 
*subflow_req,
+   struct sk_buff *skb);
+void __init mptcp_join_cookie_init(void);
+#else
+static inline void
+subflow_init_req_cookie_join_save(const struct mptcp_subflow_request_sock 
*subflow_req,
+ struct sk_buff *skb) {}
+static inline bool
+mptcp_token_join_cookie_init_state(struct mptcp_subflow_request_sock 
*subflow_req,
+  struct sk_buff *skb)
+{
+   return false;
+}
+
+static inline void mptcp_join_cookie_init(void) {}
+#endif
+
 #endif /* __MPTCP_PROTOCOL_H */
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 3d346572d4c9..a4cc4591bd4e 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -173,6 +173,12 @@ static void subflow_init_req(struct request_sock *req,
subflow_req->token = mp_opt.token;
subflow_req->remote_nonce = mp_opt.nonce;
subflow_req->msk = subflow_token_join_request(req, skb);
+
+   if (unlikely(req->syncookie) && subflow_req->msk) {
+   if (mptcp_can_accept_new_subflow(subflow_req->msk))
+   subflow_init_req_cookie_join_save(subflow_req, 
skb);
+   }
+
pr_debug("token=%u, remote_nonce=%u msk=%p", subflow_req->token,
 subflow_req->remote_nonce, subflow_req->msk);
}
@@ -207,6 +213,14 @@ int mptcp_subflow_init_cookie_req(struct request_sock *req,
 
subflow_req->mp_capable = 1;
subflow_req->ssn_offset = TCP_SKB_CB(skb)->seq - 1;
+   } else if (mp_

[PATCH v2 net-next 3/9] mptcp: subflow: split subflow_init_req

2020-07-30 Thread Florian Westphal

When syncookie support is added, we will need to add a variant of
subflow_init_req() helper.  It will do almost same thing except
that it will not compute/add a token to the mptcp token tree.

To avoid excess copy&paste, this commit splits away part of the
code into a new helper, __subflow_init_req, that can then be re-used
from the 'no insert' function added in a followup change.

Signed-off-by: Florian Westphal 
---
 net/mptcp/subflow.c | 32 ++--
 1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 9feb87880d1c..091e305a81c8 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -91,17 +91,9 @@ static struct mptcp_sock *subflow_token_join_request(struct 
request_sock *req,
return msk;
 }
 
-static void subflow_init_req(struct request_sock *req,
-const struct sock *sk_listener,
-struct sk_buff *skb)
+static int __subflow_init_req(struct request_sock *req, const struct sock 
*sk_listener)
 {
-   struct mptcp_subflow_context *listener = mptcp_subflow_ctx(sk_listener);
struct mptcp_subflow_request_sock *subflow_req = mptcp_subflow_rsk(req);
-   struct mptcp_options_received mp_opt;
-
-   pr_debug("subflow_req=%p, listener=%p", subflow_req, listener);
-
-   mptcp_get_options(skb, &mp_opt);
 
subflow_req->mp_capable = 0;
subflow_req->mp_join = 0;
@@ -113,9 +105,29 @@ static void subflow_init_req(struct request_sock *req,
 * TCP option space.
 */
if (rcu_access_pointer(tcp_sk(sk_listener)->md5sig_info))
-   return;
+   return -EINVAL;
 #endif
 
+   return 0;
+}
+
+static void subflow_init_req(struct request_sock *req,
+const struct sock *sk_listener,
+struct sk_buff *skb)
+{
+   struct mptcp_subflow_context *listener = mptcp_subflow_ctx(sk_listener);
+   struct mptcp_subflow_request_sock *subflow_req = mptcp_subflow_rsk(req);
+   struct mptcp_options_received mp_opt;
+   int ret;
+
+   pr_debug("subflow_req=%p, listener=%p", subflow_req, listener);
+
+   ret = __subflow_init_req(req, sk_listener);
+   if (ret)
+   return;
+
+   mptcp_get_options(skb, &mp_opt);
+
if (mp_opt.mp_capable) {
SUBFLOW_REQ_INC_STATS(req, MPTCP_MIB_MPCAPABLEPASSIVE);
 
-- 
2.26.2

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1338 matches

Mail list logo