Re: [net-next, RFC, 4/8] net: core: add recycle capabilities on skbs via page_pool API

2018-12-08 Thread Florian Westphal
Jesper Dangaard Brouer  wrote:
> From: Ilias Apalodimas 
> 
> This patch is changing struct sk_buff, and is thus per-definition
> controversial.
> 
> Place a new member 'mem_info' of type struct xdp_mem_info, just after
> members (flags) head_frag and pfmemalloc, And not in between
> headers_start/end to ensure skb_copy() and pskb_copy() work as-is.
> Copying mem_info during skb_clone() is required.  This makes sure that
> pages are correctly freed or recycled during the altered
> skb_free_head() invocation.

I read this to mean that this 'info' isn't accessed/needed until skb
is freed.  Any reason its not added at the end?

This would avoid moving other fields that are probably accessed
more frequently during processing.


[PATCH ipsec-next] xfrm: policy: fix policy hash rebuild

2018-11-27 Thread Florian Westphal
Dan Carpenter reports following static checker warning:
 net/xfrm/xfrm_policy.c:1316 xfrm_hash_rebuild()
 warn: 'dir' is out of bounds '3' vs '2'

 |  1280  /* reset the bydst and inexact table in all directions */
 |  1281  xfrm_hash_reset_inexact_table(net);
 |  1282
 |  1283  for (dir = 0; dir < XFRM_POLICY_MAX; dir++) {
 |  ^
 |dir == XFRM_POLICY_MAX at the end of this loop.
 |  1304  /* re-insert all policies by order of creation */
 |  1305  list_for_each_entry_reverse(policy, >xfrm.policy_all, 
walk.all) {
 [..]
 |  1314
xfrm_policy_id2dir(policy->index));
 |  1315  if (!chain) {
 |  1316  void *p = xfrm_policy_inexact_insert(policy, 
dir, 0);

Fix this by updating 'dir' based on current policy.  Otherwise, the
inexact policies won't be found anymore during lookup, as they get
hashed to a bogus bin.

Reported-by: Dan Carpenter 
Fixes: cc1bb845adc9 ("xfrm: policy: return NULL when inexact search needed")
Signed-off-by: Florian Westphal 
---
 net/xfrm/xfrm_policy.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index ddc3335dd552..be04091eb7db 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -1303,15 +1303,16 @@ static void xfrm_hash_rebuild(struct work_struct *work)
 
/* re-insert all policies by order of creation */
list_for_each_entry_reverse(policy, >xfrm.policy_all, walk.all) {
-   if (policy->walk.dead ||
-   xfrm_policy_id2dir(policy->index) >= XFRM_POLICY_MAX) {
+   if (policy->walk.dead)
+   continue;
+   dir = xfrm_policy_id2dir(policy->index);
+   if (dir >= XFRM_POLICY_MAX) {
/* skip socket policies */
continue;
}
newpos = NULL;
chain = policy_hash_bysel(net, >selector,
- policy->family,
- xfrm_policy_id2dir(policy->index));
+ policy->family, dir);
if (!chain) {
void *p = xfrm_policy_inexact_insert(policy, dir, 0);
 
-- 
2.18.1



Re: [RFC PATCH 0/3] sk_buff: add skb extension infrastructure

2018-11-26 Thread Florian Westphal
David Miller  wrote:
> From: Florian Westphal 
> Date: Mon, 26 Nov 2018 22:19:33 +0100
> 
> >> In the future please document what is so enormous and absolutely
> >> required that they must put it all into the SKB control block.
> > 
> > Ok, will do.
> 
> Why don't we establish the details about what MP-TCP needs in the CB
> before we discuss this further.
> 
> Could you do that for us?

Sure.

There is also an MPTCP conf call this week and I'll present this
patchset there as well.

I'll post the requirements and what possible alternatives have
been considered here as a follow-up.

Thanks,
Florian


Re: [RFC PATCH 0/3] sk_buff: add skb extension infrastructure

2018-11-26 Thread Florian Westphal
David Miller  wrote:
> > This adds an extension infrastructure for sk_buff instead:
> > 1. extension memory is released when the sk_buff is free'd.
> > 2. data is shared after cloning an skb.
> > 3. adding extension to an skb will COW the extension
> >buffer if needed.
> 
> So MP-TCP, when enabled for a connection, will have a new atomic
> operation for every packet?

Yes, at least for every kfree_skb call.

> And new tests all in the fast paths of the networking to facilitate
> this feature, a cost paid by everyone.

No, right now everyone has two non-atomic tests (skb->sp + skb->nf_bridge),
with this proposal everyone has one (skb->active_extensions), assuming that
both br_nf and xfrm are converted to use the extension system.

Test(s) occur both on copy/clone and kfree_skb, just like in current
kernels.

atomic test(s) are done in case skb->{sp,nf_bridge} are set, with
this patch its done if skb->active_exensions is != 0.

So from that angle current status is kept.

Main motivation was to find a solution that does not add more costs
for normal cases.

I did a quick hack to also convert skb->sp, it seems possible to do so.

In that case skbuff size is reduced by 8 bytes as sp/nf_bridge get
replaced by single 'extension pointer', and slightly less code
provided kernel is built with both XFRM and bridge netfilter support.

> Sorry, that doesn't seem like a good idea to me.
>
> Can't they just encode whatever huge amount of crap they want to
> put into the CB by deriving the information from skb->sk and some
> tiny value like an index or something to resolve the path?

Perhaps, if thats the only way I'm afraid thats what will need to be
used.  I did try such a scheme once in the past to get
rid of skb->nf_bridge and things became very very fugly due to
kfree_skb() not being aware of such 'external storage', i.e. no
way to easily clean the external storage when an skbuff gets tossed.

Might be possibe to use destructor to take care of this in mptcp case.
I can have a look if this is the only possible way.

> In the future please document what is so enormous and absolutely
> required that they must put it all into the SKB control block.

Ok, will do.

> Like Eric, I am concerned about the slow creep of overhead.  Lots of
> small "not that bad" additions of extra cycles here and there over
> time adds up to impossible to fix performance regressions.

I have the same concern, which is why i am proposing the conversion
of xfrm and nf_bridge to use this instead of the current
nf_bridge/secpath maintanance.

Although MPTCP is the main motivation here, it was intended as a
standalone series, i.e., these 3 patches and a few followup changes
to convert xfrm.

> I'm sorry if this is a major disappointment for the MP-TCP folks but a
> better way needs to be found to integrate what they want to do with
> real zero cost for the rest of the world which won't be using MP-TCP
> and therefore should not be paying for it's added overhead at all.

Agreed.


[RFC PATCH 3/3] net: convert bridge_nf to use skb extension infrastructure

2018-11-26 Thread Florian Westphal
This converts the bridge netfilter (calling iptables hooks from bridge)
facility to use the extension infrastructure instead.

The bridge_nf specific hooks in skb clone and free paths are removed, they
have been replaced by the skb_ext hooks that do the same as the bridge nf
allocations hooks did.

Signed-off-by: Florian Westphal 
---
 include/linux/netfilter_bridge.h |  4 ++--
 include/linux/skbuff.h   | 28 ++--
 include/net/netfilter/br_netfilter.h |  8 
 net/Kconfig  |  1 +
 net/bridge/br_netfilter_hooks.c  | 20 ++--
 net/bridge/br_netfilter_ipv6.c   |  4 ++--
 net/core/skbuff.c|  3 ---
 7 files changed, 13 insertions(+), 55 deletions(-)

diff --git a/include/linux/netfilter_bridge.h b/include/linux/netfilter_bridge.h
index 0a65a422587c..5f2614d02e03 100644
--- a/include/linux/netfilter_bridge.h
+++ b/include/linux/netfilter_bridge.h
@@ -20,12 +20,12 @@ static inline void br_drop_fake_rtable(struct sk_buff *skb)
 static inline struct nf_bridge_info *
 nf_bridge_info_get(const struct sk_buff *skb)
 {
-   return skb->nf_bridge;
+   return skb_ext_find(skb, SKB_EXT_BRIDGE_NF);
 }
 
 static inline bool nf_bridge_info_exists(const struct sk_buff *skb)
 {
-   return skb->nf_bridge != NULL;
+   return skb_ext_exist(skb, SKB_EXT_BRIDGE_NF);
 }
 
 static inline int nf_bridge_get_physinif(const struct sk_buff *skb)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 832904d71a85..c33654da09c3 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -255,7 +255,6 @@ struct nf_conntrack {
 
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
 struct nf_bridge_info {
-   refcount_t  use;
enum {
BRNF_PROTO_UNCHANGED,
BRNF_PROTO_8021Q,
@@ -717,9 +716,6 @@ struct sk_buff {
 #endif
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
unsigned long_nfct;
-#endif
-#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
-   struct nf_bridge_info   *nf_bridge;
 #endif
unsigned intlen,
data_len;
@@ -4011,18 +4007,6 @@ static inline void __skb_ext_copy(struct sk_buff *d, 
const struct sk_buff *s) {}
 static inline void skb_ext_copy(struct sk_buff *dst, const struct sk_buff *s) 
{}
 #endif /* CONFIG_SKB_EXTENSIONS */
 
-#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
-static inline void nf_bridge_put(struct nf_bridge_info *nf_bridge)
-{
-   if (nf_bridge && refcount_dec_and_test(_bridge->use))
-   kfree(nf_bridge);
-}
-static inline void nf_bridge_get(struct nf_bridge_info *nf_bridge)
-{
-   if (nf_bridge)
-   refcount_inc(_bridge->use);
-}
-#endif /* CONFIG_BRIDGE_NETFILTER */
 static inline void nf_reset(struct sk_buff *skb)
 {
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
@@ -4030,8 +4014,7 @@ static inline void nf_reset(struct sk_buff *skb)
skb->_nfct = 0;
 #endif
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
-   nf_bridge_put(skb->nf_bridge);
-   skb->nf_bridge = NULL;
+   skb_ext_del(skb, SKB_EXT_BRIDGE_NF);
 #endif
 }
 
@@ -4049,7 +4032,7 @@ static inline void ipvs_reset(struct sk_buff *skb)
 #endif
 }
 
-/* Note: This doesn't put any conntrack and bridge info in dst. */
+/* Note: This doesn't put any conntrack info in dst. */
 static inline void __nf_copy(struct sk_buff *dst, const struct sk_buff *src,
 bool copy)
 {
@@ -4057,10 +4040,6 @@ static inline void __nf_copy(struct sk_buff *dst, const 
struct sk_buff *src,
dst->_nfct = src->_nfct;
nf_conntrack_get(skb_nfct(src));
 #endif
-#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
-   dst->nf_bridge  = src->nf_bridge;
-   nf_bridge_get(src->nf_bridge);
-#endif
 #if IS_ENABLED(CONFIG_NETFILTER_XT_TARGET_TRACE) || defined(CONFIG_NF_TABLES)
if (copy)
dst->nf_trace = src->nf_trace;
@@ -4071,9 +4050,6 @@ static inline void nf_copy(struct sk_buff *dst, const 
struct sk_buff *src)
 {
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
nf_conntrack_put(skb_nfct(dst));
-#endif
-#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
-   nf_bridge_put(dst->nf_bridge);
 #endif
__nf_copy(dst, src, true);
 }
diff --git a/include/net/netfilter/br_netfilter.h 
b/include/net/netfilter/br_netfilter.h
index 6efc0153987b..4cd56808ac4e 100644
--- a/include/net/netfilter/br_netfilter.h
+++ b/include/net/netfilter/br_netfilter.h
@@ -6,12 +6,12 @@
 
 static inline struct nf_bridge_info *nf_bridge_alloc(struct sk_buff *skb)
 {
-   skb->nf_bridge = kzalloc(sizeof(struct nf_bridge_info), GFP_ATOMIC);
+   struct nf_bridge_info *b = skb_ext_add(skb, SKB_EXT_BRIDGE_NF);
 
-   if (likely(skb->nf_bridge))
-   refcount_set(&(skb->nf_bridge->use), 1);
+   if (b)
+   

[RFC PATCH 0/3] sk_buff: add skb extension infrastructure

2018-11-26 Thread Florian Westphal
The (out-of-tree) Multipath-TCP implementation needs a significant amount
of extra space in the skb control buffer.

Increasing skb->cb[] size in mainline is a non-starter for memory and
and performance reasons (f.e. increase in cb size also moves several
frequently-accessed fields to other cache lines).

One approach that might work for MPTCP is to extend skb_shared_info instead
of sk_buff.  However, this comes with other drawbacks, e.g.  it either
needs special skb allocation to make sure there is enough space for such
'extended shinfo' at the end of data buffer (which makes this only useable
for tx path) or increased size of skb_shared_info.

This adds an extension infrastructure for sk_buff instead:
1. extension memory is released when the sk_buff is free'd.
2. data is shared after cloning an skb.
3. adding extension to an skb will COW the extension
   buffer if needed.

This is also how xfrm and bridge_nf extra data (skb->sp, skb->nf_bridge)
are handled.

In the future, protocols that need to store more than 48 bytes in skb->cb[]
could add a 'SKB_EXT_EXTRA_CB' or similar to allocate extra space.

Two new members are added to sk_buff:
1. 'active_extensions' byte (filling a hole), telling which extensions
   have been enabled for this skb.
2. extension pointer, located at the end of the sk_buff.
   If active_extensions byte is 0, pointer value is undefined.

Last patch converts nf_bridge to use the extension infrastructure:
The 'nf_bridge' pointer is removed, i.e. sk_buff size remains the same.

Extra code added to skb clone and free paths (to deal with
refcount/free of extension area) replace the existing code that
deals with skb->nf_bridge.

Conversion of skb->sp (ipsec/xfrm secpath) to an skb extension could be
done as a followup, but I'm reluctant to work on this before there is
agreement that this is the right direction.

Comments welcome.

 include/linux/netfilter_bridge.h |   33 +---
 include/linux/skbuff.h   |  142 +--
 include/net/netfilter/br_netfilter.h |   14 ---
 net/Kconfig  |4 
 net/bridge/br_netfilter_hooks.c  |   39 +++--
 net/bridge/br_netfilter_ipv6.c   |4 
 net/core/skbuff.c|  134 -
 net/ipv4/ip_output.c |1 
 net/ipv4/netfilter/nf_reject_ipv4.c  |6 -
 net/ipv6/ip6_output.c|1 
 net/ipv6/netfilter/nf_reject_ipv6.c  |   10 +-
 net/netfilter/nf_log_common.c|   20 ++--
 net/netfilter/nf_queue.c |   50 
 net/netfilter/nfnetlink_queue.c  |   23 ++---
 net/netfilter/xt_physdev.c   |2 
 15 files changed, 368 insertions(+), 115 deletions(-)




[RFC PATCH 1/3] netfilter: avoid using skb->nf_bridge directly

2018-11-26 Thread Florian Westphal
plan is to remove this pointer from sk_buff, so use the existing helpers
in more places to reduce noise later on.

Signed-off-by: Florian Westphal 
---
 include/linux/netfilter_bridge.h | 33 +-
 include/net/netfilter/br_netfilter.h |  6 
 net/bridge/br_netfilter_hooks.c  | 19 ---
 net/ipv4/netfilter/nf_reject_ipv4.c  |  6 ++--
 net/ipv6/netfilter/nf_reject_ipv6.c  | 10 --
 net/netfilter/nf_log_common.c| 20 +--
 net/netfilter/nf_queue.c | 50 ++--
 net/netfilter/nfnetlink_queue.c  | 23 ++---
 net/netfilter/xt_physdev.c   |  2 +-
 9 files changed, 103 insertions(+), 66 deletions(-)

diff --git a/include/linux/netfilter_bridge.h b/include/linux/netfilter_bridge.h
index fa0686500970..0a65a422587c 100644
--- a/include/linux/netfilter_bridge.h
+++ b/include/linux/netfilter_bridge.h
@@ -17,43 +17,58 @@ static inline void br_drop_fake_rtable(struct sk_buff *skb)
skb_dst_drop(skb);
 }
 
+static inline struct nf_bridge_info *
+nf_bridge_info_get(const struct sk_buff *skb)
+{
+   return skb->nf_bridge;
+}
+
+static inline bool nf_bridge_info_exists(const struct sk_buff *skb)
+{
+   return skb->nf_bridge != NULL;
+}
+
 static inline int nf_bridge_get_physinif(const struct sk_buff *skb)
 {
-   struct nf_bridge_info *nf_bridge;
+   const struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
 
-   if (skb->nf_bridge == NULL)
+   if (!nf_bridge)
return 0;
 
-   nf_bridge = skb->nf_bridge;
return nf_bridge->physindev ? nf_bridge->physindev->ifindex : 0;
 }
 
 static inline int nf_bridge_get_physoutif(const struct sk_buff *skb)
 {
-   struct nf_bridge_info *nf_bridge;
+   const struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
 
-   if (skb->nf_bridge == NULL)
+   if (!nf_bridge)
return 0;
 
-   nf_bridge = skb->nf_bridge;
return nf_bridge->physoutdev ? nf_bridge->physoutdev->ifindex : 0;
 }
 
 static inline struct net_device *
 nf_bridge_get_physindev(const struct sk_buff *skb)
 {
-   return skb->nf_bridge ? skb->nf_bridge->physindev : NULL;
+   const struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
+
+   return nf_bridge ? nf_bridge->physindev : NULL;
 }
 
 static inline struct net_device *
 nf_bridge_get_physoutdev(const struct sk_buff *skb)
 {
-   return skb->nf_bridge ? skb->nf_bridge->physoutdev : NULL;
+   const struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
+
+   return nf_bridge ? nf_bridge->physoutdev : NULL;
 }
 
 static inline bool nf_bridge_in_prerouting(const struct sk_buff *skb)
 {
-   return skb->nf_bridge && skb->nf_bridge->in_prerouting;
+   const struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
+
+   return nf_bridge && nf_bridge->in_prerouting;
 }
 #else
 #define br_drop_fake_rtable(skb)   do { } while (0)
diff --git a/include/net/netfilter/br_netfilter.h 
b/include/net/netfilter/br_netfilter.h
index 74af19c3a8f7..6efc0153987b 100644
--- a/include/net/netfilter/br_netfilter.h
+++ b/include/net/netfilter/br_netfilter.h
@@ -22,12 +22,6 @@ int br_nf_hook_thresh(unsigned int hook, struct net *net, 
struct sock *sk,
  int (*okfn)(struct net *, struct sock *,
  struct sk_buff *));
 
-static inline struct nf_bridge_info *
-nf_bridge_info_get(const struct sk_buff *skb)
-{
-   return skb->nf_bridge;
-}
-
 unsigned int nf_bridge_encap_header_len(const struct sk_buff *skb);
 
 static inline void nf_bridge_push_encap_header(struct sk_buff *skb)
diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index c9383c470a83..c58cf68b45c5 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -247,7 +247,9 @@ static int br_validate_ipv4(struct net *net, struct sk_buff 
*skb)
 
 void nf_bridge_update_protocol(struct sk_buff *skb)
 {
-   switch (skb->nf_bridge->orig_proto) {
+   const struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
+
+   switch (nf_bridge->orig_proto) {
case BRNF_PROTO_8021Q:
skb->protocol = htons(ETH_P_8021Q);
break;
@@ -569,7 +571,8 @@ static unsigned int br_nf_forward_ip(void *priv,
struct net_device *parent;
u_int8_t pf;
 
-   if (!skb->nf_bridge)
+   nf_bridge = nf_bridge_info_get(skb);
+   if (!nf_bridge)
return NF_ACCEPT;
 
/* Need exclusive nf_bridge_info since we might have multiple
@@ -701,7 +704,9 @@ br_nf_ip_fragment(struct net *net, struct sock *sk, struct 
sk_buff *skb,
 
 static unsigned int nf_bridge_mtu_reduction(const struct sk_buff *skb)
 {
-   if (skb->nf_bridge->orig_proto == BRNF_PROTO_PPPOE)
+   const struct nf_bridge_info *nf_b

[RFC PATCH 2/3] sk_buff: add skb extension infrastructure

2018-11-26 Thread Florian Westphal
adds an extension infrastructure for sk_buff:
1. extension memory is released when the sk_buff is free'd.
2. data is shared after cloning an skb.

This is also how xfrm and bridge netfilter skb-associated data
(skb->sp and skb->nf_bridge) are handled.

Two new members are added to sk_buff:
1. 'active_extensions' byte (filling a hole), telling which extensions
   have been allocated for the skb.
2. extension pointer, located at the end of the sk_buff.
   If active_extensions is 0, its content is undefined.

The 'nf_bridge' pointer is removed, i.e. sk_buff size remains the same,
in a followup patch.

This adds extra code to skb clone and free paths (to deal with
refcount/free of extension area) but replaces the existing code that
deals with skb->nf_bridge.

This patch only adds the basic infrastructure, the nf_bridge conversion
is done in the next patch.

Conversion of skb->sp (ipsec/xfrm secpath) to an skb extension is planned
as a followup.

Signed-off-by: Florian Westphal 
---
 include/linux/skbuff.h | 124 +-
 net/Kconfig|   3 +
 net/core/skbuff.c  | 131 +
 net/ipv4/ip_output.c   |   1 +
 net/ipv6/ip6_output.c  |   1 +
 5 files changed, 259 insertions(+), 1 deletion(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 73902acf2b71..832904d71a85 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -245,6 +245,7 @@ struct iov_iter;
 struct napi_struct;
 struct bpf_prog;
 union bpf_attr;
+struct skb_ext;
 
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
 struct nf_conntrack {
@@ -633,6 +634,7 @@ typedef unsigned char *sk_buff_data_t;
  * @queue_mapping: Queue mapping for multiqueue devices
  * @xmit_more: More SKBs are pending for this queue
  * @pfmemalloc: skbuff was allocated from PFMEMALLOC reserves
+ * @active_extensions: active extensions (skb_ext_id types)
  * @ndisc_nodetype: router type (from link layer)
  * @ooo_okay: allow the mapping of a socket to a queue to be changed
  * @l4_hash: indicate hash is a canonical 4-tuple hash over transport
@@ -662,6 +664,7 @@ typedef unsigned char *sk_buff_data_t;
  * @data: Data head pointer
  * @truesize: Buffer size
  * @users: User count - see {datagram,tcp}.c
+ * @extensions: allocated extensions, valid if active_extensions is nonzero
  */
 
 struct sk_buff {
@@ -744,7 +747,9 @@ struct sk_buff {
head_frag:1,
xmit_more:1,
pfmemalloc:1;
-
+#ifdef CONFIG_SKB_EXTENSIONS
+   __u8active_extensions;
+#endif
/* fields enclosed in headers_start/headers_end are copied
 * using a single memcpy() in __copy_skb_header()
 */
@@ -866,6 +871,11 @@ struct sk_buff {
*data;
unsigned inttruesize;
refcount_t  users;
+
+#ifdef CONFIG_SKB_EXTENSIONS
+   /* only useable after checking ->active_extensions != 0 */
+   struct skb_ext  *extensions;
+#endif
 };
 
 #ifdef __KERNEL__
@@ -3889,6 +3899,118 @@ static inline void nf_conntrack_get(struct nf_conntrack 
*nfct)
atomic_inc(>use);
 }
 #endif
+
+#ifdef CONFIG_SKB_EXTENSIONS
+enum skb_ext_id {
+#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
+   SKB_EXT_BRIDGE_NF,
+#endif
+   SKB_EXT_NUM, /* must be last */
+};
+
+/* each extension aligned to this value */
+#define SKB_EXT_ALIGN  8
+/* offsets/len: left-shift needed to translate offset to bytes */
+#define SKB_EXT_ALIGN_SHIFT 3
+
+/**
+ * struct skb_ext - sk_buff extensions
+ * @refcount: 1 on allocation, deallocated on 0
+ * @offset: offset to add to @data to obtain extension address
+ * @len: size currently allocated, stored in SKB_EXT_ALIGN_SHIFT units
+ * @data: start of extension data, variable sized
+ *
+ * Note: offsets and len are stored in chunks of 8 bytes, this allows
+ * to use 'u8' types while allowing up to 2kb worth of extension data.
+ */
+struct skb_ext {
+   refcount_t refcnt;
+   u8 offset[SKB_EXT_NUM]; /* chunks of 8 bytes */
+   u8 len; /* same, i.e. size == len << 3 */
+   char data[0] __aligned(SKB_EXT_ALIGN);
+};
+
+void *skb_ext_add(struct sk_buff *skb, enum skb_ext_id id);
+void __skb_ext_del(struct sk_buff *skb, enum skb_ext_id id);
+void __skb_ext_free(struct skb_ext *ext);
+
+static inline void __skb_ext_put(struct skb_ext *ext)
+{
+   if (ext && refcount_dec_and_test(>refcnt))
+   __skb_ext_free(ext);
+}
+
+static inline void skb_ext_put(struct sk_buff *skb)
+{
+   if (skb->active_extensions)
+   __skb_ext_put(skb->extensions);
+}
+
+static inline void skb_ext_get(struct sk_buff *skb)
+{
+   if (skb->active_extensions) {
+   struct skb_ext *ext 

[PATCH ipsec-next] xfrm: policy: fix netlink/pf_key policy lookups

2018-11-14 Thread Florian Westphal
Colin Ian King says:
 Static analysis with CoverityScan found a potential issue [..]
 It seems that pointer pol is set to NULL and then a check to see if it
 is non-null is used to set pol to tmp; howeverm this check is always
 going to be false because pol is always NULL.

Fix this and update test script to catch this.  Updated script only:
./xfrm_policy.sh ; echo $?
RTNETLINK answers: No such file or directory
FAIL: ip -net ns3 xfrm policy get src 10.0.1.0/24 dst 10.0.2.0/24 dir out
RTNETLINK answers: No such file or directory
[..]
PASS: policy before exception matches
PASS: ping to .254 bypassed ipsec tunnel
PASS: direct policy matches
PASS: policy matches
1

Fixes: 6be3b0db6db ("xfrm: policy: add inexact policy search tree 
infrastructure")
Reported-by: Colin Ian King 
Signed-off-by: Florian Westphal 
---
 net/xfrm/xfrm_policy.c |  5 ++-
 tools/testing/selftests/net/xfrm_policy.sh | 38 ++
 2 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index bd80b8a4322f..cff8c5b720b4 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -1663,7 +1663,10 @@ struct xfrm_policy *xfrm_policy_bysel_ctx(struct net 
*net, u32 mark, u32 if_id,
tmp = __xfrm_policy_bysel_ctx(cand.res[i], mark,
  if_id, type, dir,
  sel, ctx);
-   if (tmp && pol && tmp->pos < pol->pos)
+   if (!tmp)
+   continue;
+
+   if (!pol || tmp->pos < pol->pos)
pol = tmp;
}
} else {
diff --git a/tools/testing/selftests/net/xfrm_policy.sh 
b/tools/testing/selftests/net/xfrm_policy.sh
index 6ca63a6e50e9..8db35b99457c 100755
--- a/tools/testing/selftests/net/xfrm_policy.sh
+++ b/tools/testing/selftests/net/xfrm_policy.sh
@@ -21,6 +21,7 @@
 # Kselftest framework requirement - SKIP code is 4.
 ksft_skip=4
 ret=0
+policy_checks_ok=1
 
 KEY_SHA=0xdeadbeef1234567890abcdefabcdefabcdefabcd
 KEY_AES=0x0123456789abcdef0123456789012345
@@ -45,6 +46,26 @@ do_esp() {
 ip -net $ns xfrm policy add src $rnet dst $lnet dir fwd tmpl src $remote 
dst $me proto esp mode tunnel priority 100 action allow
 }
 
+do_esp_policy_get_check() {
+local ns=$1
+local lnet=$2
+local rnet=$3
+
+ip -net $ns xfrm policy get src $lnet dst $rnet dir out > /dev/null
+if [ $? -ne 0 ] && [ $policy_checks_ok -eq 1 ] ;then
+policy_checks_ok=0
+echo "FAIL: ip -net $ns xfrm policy get src $lnet dst $rnet dir out"
+ret=1
+fi
+
+ip -net $ns xfrm policy get src $rnet dst $lnet dir fwd > /dev/null
+if [ $? -ne 0 ] && [ $policy_checks_ok -eq 1 ] ;then
+policy_checks_ok=0
+echo "FAIL: ip -net $ns xfrm policy get src $rnet dst $lnet dir fwd"
+ret=1
+fi
+}
+
 do_exception() {
 local ns=$1
 local me=$2
@@ -112,31 +133,31 @@ check_xfrm() {
# 1: iptables -m policy rule count != 0
rval=$1
ip=$2
-   ret=0
+   lret=0
 
ip netns exec ns1 ping -q -c 1 10.0.2.$ip > /dev/null
 
check_ipt_policy_count ns3
if [ $? -ne $rval ] ; then
-   ret=1
+   lret=1
fi
check_ipt_policy_count ns4
if [ $? -ne $rval ] ; then
-   ret=1
+   lret=1
fi
 
ip netns exec ns2 ping -q -c 1 10.0.1.$ip > /dev/null
 
check_ipt_policy_count ns3
if [ $? -ne $rval ] ; then
-   ret=1
+   lret=1
fi
check_ipt_policy_count ns4
if [ $? -ne $rval ] ; then
-   ret=1
+   lret=1
fi
 
-   return $ret
+   return $lret
 }
 
 #check for needed privileges
@@ -227,6 +248,11 @@ do_esp ns4 dead:3::10 dead:3::1 dead:2::/64 dead:1::/64 
$SPI2 $SPI1
 do_dummies4 ns3
 do_dummies6 ns4
 
+do_esp_policy_get_check ns3 10.0.1.0/24 10.0.2.0/24
+do_esp_policy_get_check ns4 10.0.2.0/24 10.0.1.0/24
+do_esp_policy_get_check ns3 dead:1::/64 dead:2::/64
+do_esp_policy_get_check ns4 dead:2::/64 dead:1::/64
+
 # ping to .254 should use ipsec, exception is not installed.
 check_xfrm 1 254
 if [ $? -ne 0 ]; then
-- 
2.18.1



Re: xfrm: policy: add inexact policy search tree infrastructure

2018-11-14 Thread Florian Westphal
Colin Ian King  wrote:
> Hi,
> 
> Static analysis with CoverityScan found a potential issue with the commit:
> 
> commit 6be3b0db6db82cf056a72cc18042048edd27f8ee
> Author: Florian Westphal 
> Date:   Wed Nov 7 23:00:37 2018 +0100
> 
> xfrm: policy: add inexact policy search tree infrastructure
> 
> It seems that pointer pol is set to NULL and then a check to see if it
> is non-null is used to set pol to tmp; howeverm this check is always
> going to be false because pol is always NULL.

Right.  This is in the control-plane code to retrieve a policy
via netlink or PF_KEY.

> The issue is reported by CoverityScan as follows:
> 
> Line
> 1658
> assignment: Assigning: pol = NULL.
> 1659pol = NULL;
> 1660for (i = 0; i < ARRAY_SIZE(cand.res); i++) {
> 1661struct xfrm_policy *tmp;
> 1662
> 1663tmp = __xfrm_policy_bysel_ctx(cand.res[i], mark,
> 1664  if_id, type, dir,
> 1665  sel, ctx);
> 
> null: At condition pol, the value of pol must be NULL.
> dead_error_condition: The condition pol cannot be true.
> 
> CID 1475480 (#1 of 1): Logically dead code
> 
> (DEADCODE) dead_error_line: Execution cannot reach the expression
> tmp->pos < pol->pos inside this statement: if (tmp && pol && tmp->pos 
> 
> 1666if (tmp && pol && tmp->pos < pol->pos)
> 1667pol = tmp;
>
> 
> I suspect this is not intentional and is probably a bug.

Right, bug.  Would like to just break after first 'tmp != NULL' but
that might make us return a different policy than old linear search.
So we should update pol in case its NULL as well.

Steffen, let me know if you want an incremental fix or if you
prefer to squash this:

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -1663,7 +1663,10 @@ struct xfrm_policy *xfrm_policy_bysel_ctx(struct net 
*net, u32 mark, u32 if_id,
tmp = __xfrm_policy_bysel_ctx(cand.res[i], mark,
  if_id, type, dir,
  sel, ctx);
-   if (tmp && pol && tmp->pos < pol->pos)
+   if (!tmp)
+   continue;
+
+   if (!pol || tmp->pos < pol->pol)
pol = tmp;
}
} else {


[PATCH ipsec-next 05/11] xfrm: policy: store inexact policies in an rhashtable

2018-11-07 Thread Florian Westphal
Switch packet-path lookups for inexact policies to rhashtable.

In this initial version, we now no longer need to search policies with
non-matching address family and type.

Next patch will add the if_id as well so lookups from the xfrm interface
driver only need to search inexact policies for that device.

Future patches will augment the hlist in each rhash bucket with a tree
and pre-sort policies according to daddr/prefix.

A single rhashtable is used.  In order to avoid a full rhashtable walk on
netns exit, the bins get placed on a pernet list, i.e. we add almost no
cost for network namespaces that had no xfrm policies.

The inexact lists are kept in place, and policies are added to both the
per-rhash-inexact list and a pernet one.

The latter is needed for the control plane to handle migrate -- these
requests do not consider the if_id, so if we'd remove the inexact_list
now we would have to search all hash buckets and then figure
out which matching policy candidate is the most recent one -- this appears
a bit harder than just keeping the 'old' inexact list for this purpose.

Signed-off-by: Florian Westphal 
---
 include/net/netns/xfrm.h |   2 +
 include/net/xfrm.h   |   1 +
 net/xfrm/xfrm_policy.c   | 350 +--
 3 files changed, 335 insertions(+), 18 deletions(-)

diff --git a/include/net/netns/xfrm.h b/include/net/netns/xfrm.h
index 9991e5ef52cc..59f45b1e9dac 100644
--- a/include/net/netns/xfrm.h
+++ b/include/net/netns/xfrm.h
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -53,6 +54,7 @@ struct netns_xfrm {
unsigned intpolicy_count[XFRM_POLICY_MAX * 2];
struct work_struct  policy_hash_work;
struct xfrm_policy_hthresh policy_hthresh;
+   struct list_headinexact_bins;
 
 
struct sock *nlsk;
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 0eb390c205af..870fa9b27f7e 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -596,6 +596,7 @@ struct xfrm_policy {
u16 family;
struct xfrm_sec_ctx *security;
struct xfrm_tmplxfrm_vec[XFRM_MAX_DEPTH];
+   struct hlist_node   bydst_inexact_list;
struct rcu_head rcu;
 };
 
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index b00c265f6be3..5c7e7399323f 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -45,6 +46,22 @@ struct xfrm_flo {
u8 flags;
 };
 
+struct xfrm_pol_inexact_key {
+   possible_net_t net;
+   u16 family;
+   u8 dir, type;
+};
+
+struct xfrm_pol_inexact_bin {
+   struct xfrm_pol_inexact_key k;
+   struct rhash_head head;
+   struct hlist_head hhead;
+
+   /* slow path below */
+   struct list_head inexact_bins;
+   struct rcu_head rcu;
+};
+
 static DEFINE_SPINLOCK(xfrm_if_cb_lock);
 static struct xfrm_if_cb const __rcu *xfrm_if_cb __read_mostly;
 
@@ -55,6 +72,9 @@ static struct xfrm_policy_afinfo const __rcu 
*xfrm_policy_afinfo[AF_INET6 + 1]
 static struct kmem_cache *xfrm_dst_cache __ro_after_init;
 static __read_mostly seqcount_t xfrm_policy_hash_generation;
 
+static struct rhashtable xfrm_policy_inexact_table;
+static const struct rhashtable_params xfrm_pol_inexact_params;
+
 static void xfrm_init_pmtu(struct xfrm_dst **bundle, int nr);
 static int stale_bundle(struct dst_entry *dst);
 static int xfrm_bundle_ok(struct xfrm_dst *xdst);
@@ -64,6 +84,18 @@ static void __xfrm_policy_link(struct xfrm_policy *pol, int 
dir);
 static struct xfrm_policy *__xfrm_policy_unlink(struct xfrm_policy *pol,
int dir);
 
+static struct xfrm_pol_inexact_bin *
+xfrm_policy_inexact_lookup(struct net *net, u8 type, u16 family, u8 dir);
+
+static struct xfrm_pol_inexact_bin *
+xfrm_policy_inexact_lookup_rcu(struct net *net,
+  u8 type, u16 family, u8 dir);
+static struct xfrm_policy *
+xfrm_policy_insert_list(struct hlist_head *chain, struct xfrm_policy *policy,
+   bool excl);
+static void xfrm_policy_insert_inexact_list(struct hlist_head *chain,
+   struct xfrm_policy *policy);
+
 static inline bool xfrm_pol_hold_rcu(struct xfrm_policy *policy)
 {
return refcount_inc_not_zero(>refcnt);
@@ -269,6 +301,7 @@ struct xfrm_policy *xfrm_policy_alloc(struct net *net, 
gfp_t gfp)
if (policy) {
write_pnet(>xp_net, net);
INIT_LIST_HEAD(>walk.all);
+   INIT_HLIST_NODE(>bydst_inexact_list);
INIT_HLIST_NODE(>bydst);
INIT_HLIST_NODE(>byidx);
rwlock_init(>lock);
@@ -563,6 +596,107 @@ static void xfrm_hash_resize(struct work_struct *work)
mutex_unlock(_resize_mutex);
 }
 
+static void xfrm_hash_reset_ine

[PATCH ipsec-next 04/11] xfrm: policy: return NULL when inexact search needed

2018-11-07 Thread Florian Westphal
currently policy_hash_bysel() returns the hash bucket list
(for exact policies), or the inexact list (when policy uses a prefix).

Searching this inexact list is slow, so it might be better to pre-sort
inexact lists into a tree or another data structure for faster
searching.

However, due to 'any' policies, that need to be searched in any case,
doing so will require that 'inexact' policies need to be handled
specially to decide the best search strategy.  So change hash_bysel()
and return NULL if the policy can't be handled via the policy hash
table.

Right now, we simply use the inexact list when this happens, but
future patch can then implement a different strategy.

Signed-off-by: Florian Westphal 
---
 net/xfrm/xfrm_policy.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 20d6815be0d7..b00c265f6be3 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -365,7 +365,7 @@ static struct hlist_head *policy_hash_bysel(struct net *net,
hash = __sel_hash(sel, family, hmask, dbits, sbits);
 
if (hash == hmask + 1)
-   return >xfrm.policy_inexact[dir];
+   return NULL;
 
return rcu_dereference_check(net->xfrm.policy_bydst[dir].table,
 lockdep_is_held(>xfrm.xfrm_policy_lock)) + hash;
@@ -625,6 +625,8 @@ static void xfrm_hash_rebuild(struct work_struct *work)
chain = policy_hash_bysel(net, >selector,
  policy->family,
  xfrm_policy_id2dir(policy->index));
+   if (!chain)
+   chain = >xfrm.policy_inexact[dir];
hlist_for_each_entry(pol, chain, bydst) {
if (policy->priority >= pol->priority)
newpos = >bydst;
@@ -781,7 +783,12 @@ int xfrm_policy_insert(int dir, struct xfrm_policy 
*policy, int excl)
 
spin_lock_bh(>xfrm.xfrm_policy_lock);
chain = policy_hash_bysel(net, >selector, policy->family, dir);
-   delpol = xfrm_policy_insert_list(chain, policy, excl);
+   if (chain) {
+   delpol = xfrm_policy_insert_list(chain, policy, excl);
+   } else {
+   chain = >xfrm.policy_inexact[dir];
+   delpol = xfrm_policy_insert_list(chain, policy, excl);
+   }
 
if (IS_ERR(delpol)) {
spin_unlock_bh(>xfrm.xfrm_policy_lock);
@@ -829,6 +836,8 @@ struct xfrm_policy *xfrm_policy_bysel_ctx(struct net *net, 
u32 mark, u32 if_id,
*err = 0;
spin_lock_bh(>xfrm.xfrm_policy_lock);
chain = policy_hash_bysel(net, sel, sel->family, dir);
+   if (!chain)
+   chain = >xfrm.policy_inexact[dir];
ret = NULL;
hlist_for_each_entry(pol, chain, bydst) {
if (pol->type == type &&
-- 
2.18.1



[PATCH ipsec-next 07/11] xfrm: policy: add inexact policy search tree infrastructure

2018-11-07 Thread Florian Westphal
At this time inexact policies are all searched in-order until the first
match is found.  After removal of the flow cache, this resolution has
to be performed for every packetm resulting in major slowdown when
number of inexact policies is high.

This adds infrastructure to later sort inexact policies into a tree.
This only introduces a single class: any:any.

Next patch will add a search tree to pre-sort policies that
have a fixed daddr/prefixlen, so in this patch the any:any class
will still be used for all policies.

Signed-off-by: Florian Westphal 
---
 include/net/xfrm.h |   1 +
 net/xfrm/xfrm_policy.c | 301 +
 2 files changed, 248 insertions(+), 54 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 870fa9b27f7e..9df6dca17155 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -577,6 +577,7 @@ struct xfrm_policy {
/* This lock only affects elements except for entry. */
rwlock_tlock;
refcount_t  refcnt;
+   u32 pos;
struct timer_list   timer;
 
atomic_tgenid;
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index dda27fd7b8a4..4eb12e9b40c2 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -46,6 +46,9 @@ struct xfrm_flo {
u8 flags;
 };
 
+/* prefixes smaller than this are stored in lists, not trees. */
+#define INEXACT_PREFIXLEN_IPV4 16
+#define INEXACT_PREFIXLEN_IPV6 48
 struct xfrm_pol_inexact_key {
possible_net_t net;
u32 if_id;
@@ -56,6 +59,7 @@ struct xfrm_pol_inexact_key {
 struct xfrm_pol_inexact_bin {
struct xfrm_pol_inexact_key k;
struct rhash_head head;
+   /* list containing '*:*' policies */
struct hlist_head hhead;
 
/* slow path below */
@@ -63,6 +67,16 @@ struct xfrm_pol_inexact_bin {
struct rcu_head rcu;
 };
 
+enum xfrm_pol_inexact_candidate_type {
+   XFRM_POL_CAND_ANY,
+
+   XFRM_POL_CAND_MAX,
+};
+
+struct xfrm_pol_inexact_candidates {
+   struct hlist_head *res[XFRM_POL_CAND_MAX];
+};
+
 static DEFINE_SPINLOCK(xfrm_if_cb_lock);
 static struct xfrm_if_cb const __rcu *xfrm_if_cb __read_mostly;
 
@@ -98,6 +112,12 @@ xfrm_policy_insert_list(struct hlist_head *chain, struct 
xfrm_policy *policy,
 static void xfrm_policy_insert_inexact_list(struct hlist_head *chain,
struct xfrm_policy *policy);
 
+static bool
+xfrm_policy_find_inexact_candidates(struct xfrm_pol_inexact_candidates *cand,
+   struct xfrm_pol_inexact_bin *b,
+   const xfrm_address_t *saddr,
+   const xfrm_address_t *daddr);
+
 static inline bool xfrm_pol_hold_rcu(struct xfrm_policy *policy)
 {
return refcount_inc_not_zero(>refcnt);
@@ -652,13 +672,48 @@ xfrm_policy_inexact_alloc_bin(const struct xfrm_policy 
*pol, u8 dir)
return IS_ERR(prev) ? NULL : prev;
 }
 
-static void xfrm_policy_inexact_delete_bin(struct net *net,
-  struct xfrm_pol_inexact_bin *b)
+static bool xfrm_pol_inexact_addr_use_any_list(const xfrm_address_t *addr,
+  int family, u8 prefixlen)
 {
-   lockdep_assert_held(>xfrm.xfrm_policy_lock);
+   if (xfrm_addr_any(addr, family))
+   return true;
+
+   if (family == AF_INET6 && prefixlen < INEXACT_PREFIXLEN_IPV6)
+   return true;
+
+   if (family == AF_INET && prefixlen < INEXACT_PREFIXLEN_IPV4)
+   return true;
+
+   return false;
+}
+
+static bool
+xfrm_policy_inexact_insert_use_any_list(const struct xfrm_policy *policy)
+{
+   const xfrm_address_t *addr;
+   bool saddr_any, daddr_any;
+   u8 prefixlen;
+
+   addr = >selector.saddr;
+   prefixlen = policy->selector.prefixlen_s;
 
-   if (!hlist_empty(>hhead))
+   saddr_any = xfrm_pol_inexact_addr_use_any_list(addr,
+  policy->family,
+  prefixlen);
+   addr = >selector.daddr;
+   prefixlen = policy->selector.prefixlen_d;
+   daddr_any = xfrm_pol_inexact_addr_use_any_list(addr,
+  policy->family,
+  prefixlen);
+   return saddr_any && daddr_any;
+}
+
+static void __xfrm_policy_inexact_prune_bin(struct xfrm_pol_inexact_bin *b, 
bool net_exit)
+{
+   if (!hlist_empty(>hhead)) {
+   WARN_ON_ONCE(net_exit);
return;
+   }
 
if (rhashtable_remove_fast(_policy_inexact_table, >head,
   xfrm_pol_inexact_params) == 0) {
@@ -667,14 +722,23 @@ static void xfrm_policy_inexact_delete_bin(struct net 
*net,
 

[PATCH ipsec-next 10/11] xfrm: policy: store inexact policies in a tree ordered by source address

2018-11-07 Thread Florian Westphal
This adds the 'saddr:any' search class.  It contains all policies that have
a fixed saddr/prefixlen, but 'any' destination.

Signed-off-by: Florian Westphal 
---
 net/xfrm/xfrm_policy.c | 46 ++
 1 file changed, 42 insertions(+), 4 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 57e28dcd7c53..38e33326c856 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -71,11 +71,20 @@ struct xfrm_pol_inexact_node {
  * | |
  * | +- coarse policies and all any:daddr policies
  * |
+ * + root_s: sorted by saddr:prefix
+ * | |
+ * |xfrm_pol_inexact_node
+ * | |
+ * | + root: unused
+ * | |
+ * | + hhead: saddr:any policies
+ * |
  * + coarse policies and all any:any policies
  *
- * Lookups return two candidate lists:
+ * Lookups return three candidate lists:
  * 1. any:any list from top-level xfrm_pol_inexact_bin
  * 2. any:daddr list from daddr tree
+ * 2. saddr:any list from saddr tree
  *
  * This result set then needs to be searched for the policy with
  * the lowest priority.  If two results have same prio, youngest one wins.
@@ -98,12 +107,16 @@ struct xfrm_pol_inexact_bin {
/* tree sorted by daddr/prefix */
struct rb_root root_d;
 
+   /* tree sorted by saddr/prefix */
+   struct rb_root root_s;
+
/* slow path below */
struct list_head inexact_bins;
struct rcu_head rcu;
 };
 
 enum xfrm_pol_inexact_candidate_type {
+   XFRM_POL_CAND_SADDR,
XFRM_POL_CAND_DADDR,
XFRM_POL_CAND_ANY,
 
@@ -696,6 +709,7 @@ xfrm_policy_inexact_alloc_bin(const struct xfrm_policy 
*pol, u8 dir)
bin->k = k;
INIT_HLIST_HEAD(>hhead);
bin->root_d = RB_ROOT;
+   bin->root_s = RB_ROOT;
seqcount_init(>count);
 
prev = rhashtable_lookup_get_insert_key(_policy_inexact_table,
@@ -980,9 +994,10 @@ static void __xfrm_policy_inexact_prune_bin(struct 
xfrm_pol_inexact_bin *b, bool
 {
write_seqcount_begin(>count);
xfrm_policy_inexact_gc_tree(>root_d, net_exit);
+   xfrm_policy_inexact_gc_tree(>root_s, net_exit);
write_seqcount_end(>count);
 
-   if (!RB_EMPTY_ROOT(>root_d) ||
+   if (!RB_EMPTY_ROOT(>root_d) || !RB_EMPTY_ROOT(>root_s) ||
!hlist_empty(>hhead)) {
WARN_ON_ONCE(net_exit);
return;
@@ -1027,11 +1042,29 @@ xfrm_policy_inexact_alloc_chain(struct 
xfrm_pol_inexact_bin *bin,
if (xfrm_policy_inexact_insert_use_any_list(policy))
return >hhead;
 
-   if (xfrm_pol_inexact_addr_use_any_list(>selector.daddr,
+   /* saddr is wildcard */
+   if (xfrm_pol_inexact_addr_use_any_list(>selector.saddr,
   policy->family,
-  policy->selector.prefixlen_d))
+  policy->selector.prefixlen_s))
return >hhead;
 
+   if (xfrm_pol_inexact_addr_use_any_list(>selector.daddr,
+  policy->family,
+  policy->selector.prefixlen_d)) {
+   write_seqcount_begin(>count);
+   n = xfrm_policy_inexact_insert_node(net,
+   >root_s,
+   >selector.saddr,
+   policy->family,
+   
policy->selector.prefixlen_s,
+   dir);
+   write_seqcount_end(>count);
+   if (!n)
+   return NULL;
+
+   return >hhead;
+   }
+
/* daddr is fixed */
write_seqcount_begin(>count);
n = xfrm_policy_inexact_insert_node(net,
@@ -1826,6 +1859,11 @@ xfrm_policy_find_inexact_candidates(struct 
xfrm_pol_inexact_candidates *cand,
if (n)
cand->res[XFRM_POL_CAND_DADDR] = >hhead;
 
+   n = xfrm_policy_lookup_inexact_addr(>root_s, >count, saddr,
+   family);
+   if (n)
+   cand->res[XFRM_POL_CAND_SADDR] = >hhead;
+
return true;
 }
 
-- 
2.18.1



[PATCH ipsec-next 06/11] xfrm: policy: consider if_id when hashing inexact policy

2018-11-07 Thread Florian Westphal
This avoids searches of polices that cannot match in the first
place due to different interface id by placing them in different bins.

Signed-off-by: Florian Westphal 
---
 net/xfrm/xfrm_policy.c | 25 -
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 5c7e7399323f..dda27fd7b8a4 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -48,6 +48,7 @@ struct xfrm_flo {
 
 struct xfrm_pol_inexact_key {
possible_net_t net;
+   u32 if_id;
u16 family;
u8 dir, type;
 };
@@ -85,11 +86,12 @@ static struct xfrm_policy *__xfrm_policy_unlink(struct 
xfrm_policy *pol,
int dir);
 
 static struct xfrm_pol_inexact_bin *
-xfrm_policy_inexact_lookup(struct net *net, u8 type, u16 family, u8 dir);
+xfrm_policy_inexact_lookup(struct net *net, u8 type, u16 family, u8 dir,
+  u32 if_id);
 
 static struct xfrm_pol_inexact_bin *
 xfrm_policy_inexact_lookup_rcu(struct net *net,
-  u8 type, u16 family, u8 dir);
+  u8 type, u16 family, u8 dir, u32 if_id);
 static struct xfrm_policy *
 xfrm_policy_insert_list(struct hlist_head *chain, struct xfrm_policy *policy,
bool excl);
@@ -618,6 +620,7 @@ xfrm_policy_inexact_alloc_bin(const struct xfrm_policy 
*pol, u8 dir)
.family = pol->family,
.type = pol->type,
.dir = dir,
+   .if_id = pol->if_id,
};
struct net *net = xp_net(pol);
 
@@ -925,7 +928,8 @@ static u32 xfrm_pol_bin_key(const void *data, u32 len, u32 
seed)
const struct xfrm_pol_inexact_key *k = data;
u32 a = k->type << 24 | k->dir << 16 | k->family;
 
-   return jhash_2words(a, net_hash_mix(read_pnet(>net)), seed);
+   return jhash_3words(a, k->if_id, net_hash_mix(read_pnet(>net)),
+   seed);
 }
 
 static u32 xfrm_pol_bin_obj(const void *data, u32 len, u32 seed)
@@ -957,7 +961,7 @@ static int xfrm_pol_bin_cmp(struct rhashtable_compare_arg 
*arg,
if (ret)
return ret;
 
-   return 0;
+   return b->k.if_id ^ key->if_id;
 }
 
 static const struct rhashtable_params xfrm_pol_inexact_params = {
@@ -1094,7 +1098,7 @@ struct xfrm_policy *xfrm_policy_bysel_ctx(struct net 
*net, u32 mark, u32 if_id,
chain = policy_hash_bysel(net, sel, sel->family, dir);
if (!chain) {
bin = xfrm_policy_inexact_lookup(net, type,
-sel->family, dir);
+sel->family, dir, if_id);
if (!bin) {
spin_unlock_bh(>xfrm.xfrm_policy_lock);
return NULL;
@@ -1335,12 +1339,14 @@ static int xfrm_policy_match(const struct xfrm_policy 
*pol,
 }
 
 static struct xfrm_pol_inexact_bin *
-xfrm_policy_inexact_lookup_rcu(struct net *net, u8 type, u16 family, u8 dir)
+xfrm_policy_inexact_lookup_rcu(struct net *net, u8 type, u16 family,
+  u8 dir, u32 if_id)
 {
struct xfrm_pol_inexact_key k = {
.family = family,
.type = type,
.dir = dir,
+   .if_id = if_id,
};
 
write_pnet(, net);
@@ -1350,14 +1356,15 @@ xfrm_policy_inexact_lookup_rcu(struct net *net, u8 
type, u16 family, u8 dir)
 }
 
 static struct xfrm_pol_inexact_bin *
-xfrm_policy_inexact_lookup(struct net *net, u8 type, u16 family, u8 dir)
+xfrm_policy_inexact_lookup(struct net *net, u8 type, u16 family,
+  u8 dir, u32 if_id)
 {
struct xfrm_pol_inexact_bin *bin;
 
lockdep_assert_held(>xfrm.xfrm_policy_lock);
 
rcu_read_lock();
-   bin = xfrm_policy_inexact_lookup_rcu(net, type, family, dir);
+   bin = xfrm_policy_inexact_lookup_rcu(net, type, family, dir, if_id);
rcu_read_unlock();
 
return bin;
@@ -1405,7 +1412,7 @@ static struct xfrm_policy 
*xfrm_policy_lookup_bytype(struct net *net, u8 type,
break;
}
}
-   bin = xfrm_policy_inexact_lookup_rcu(net, type, family, dir);
+   bin = xfrm_policy_inexact_lookup_rcu(net, type, family, dir, if_id);
if (!bin)
goto skip_inexact;
chain = >hhead;
-- 
2.18.1



[PATCH ipsec-next 01/11] selftests: add xfrm policy test script

2018-11-07 Thread Florian Westphal
add a script that adds a ipsec tunnel between two network
namespaces plus following policies:

.0/24 -> ipsec tunnel
.240/28 -> bypass
.253/32 -> ipsec tunnel

Then check that .254 bypasses tunnel (match /28 exception),
and .2 (match /24) and .253 (match direct policy) pass through the
tunnel.

Abuses iptables to check if ping did resolve an ipsec policy or not.

Also adds a bunch of 'block' rules that are not supposed to match.

Signed-off-by: Florian Westphal 
---
 tools/testing/selftests/net/Makefile   |   3 +-
 tools/testing/selftests/net/xfrm_policy.sh | 276 +
 2 files changed, 278 insertions(+), 1 deletion(-)
 create mode 100755 tools/testing/selftests/net/xfrm_policy.sh

diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index 256d82d5fa87..4c7df17d4f42 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -4,7 +4,8 @@
 CFLAGS =  -Wall -Wl,--no-as-needed -O2 -g
 CFLAGS += -I../../../../usr/include/
 
-TEST_PROGS := run_netsocktests run_afpackettests test_bpf.sh netdevice.sh 
rtnetlink.sh
+TEST_PROGS := run_netsocktests run_afpackettests test_bpf.sh netdevice.sh \
+ rtnetlink.sh xfrm_policy.sh
 TEST_PROGS += fib_tests.sh fib-onlink-tests.sh pmtu.sh udpgso.sh ip_defrag.sh
 TEST_PROGS += udpgso_bench.sh fib_rule_tests.sh msg_zerocopy.sh psock_snd.sh
 TEST_PROGS_EXTENDED := in_netns.sh
diff --git a/tools/testing/selftests/net/xfrm_policy.sh 
b/tools/testing/selftests/net/xfrm_policy.sh
new file mode 100755
index ..72f9e3dfbea1
--- /dev/null
+++ b/tools/testing/selftests/net/xfrm_policy.sh
@@ -0,0 +1,276 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Check xfrm policy resolution.  Topology:
+#
+# 1.2   1.1   3.1  3.102.1   2.2
+# eth1  eth1 veth0 veth0 eth1   eth1
+# ns1  ns3 - ns4  ns2
+#
+# ns3 and ns4 are connected via ipsec tunnel.
+# pings from ns1 to ns2 (and vice versa) are supposed to work like this:
+# ns1: ping 10.0.2.2: passes via ipsec tunnel.
+# ns2: ping 10.0.1.2: passes via ipsec tunnel.
+
+# ns1: ping 10.0.1.253: passes via ipsec tunnel (direct policy)
+# ns2: ping 10.0.2.253: passes via ipsec tunnel (direct policy)
+#
+# ns1: ping 10.0.2.254: does NOT pass via ipsec tunnel (exception)
+# ns2: ping 10.0.1.254: does NOT pass via ipsec tunnel (exception)
+
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+ret=0
+
+KEY_SHA=0xdeadbeef1234567890abcdefabcdefabcdefabcd
+KEY_AES=0x0123456789abcdef0123456789012345
+SPI1=0x1
+SPI2=0x2
+
+do_esp() {
+local ns=$1
+local me=$2
+local remote=$3
+local lnet=$4
+local rnet=$5
+local spi_out=$6
+local spi_in=$7
+
+ip -net $ns xfrm state add src $remote dst $me proto esp spi $spi_in  enc 
aes $KEY_AES  auth sha1 $KEY_SHA  mode tunnel sel src $rnet dst $lnet
+ip -net $ns xfrm state add src $me  dst $remote proto esp spi $spi_out enc 
aes $KEY_AES auth sha1 $KEY_SHA mode tunnel sel src $lnet dst $rnet
+
+# to encrypt packets as they go out (includes forwarded packets that need 
encapsulation)
+ip -net $ns xfrm policy add src $lnet dst $rnet dir out tmpl src $me dst 
$remote proto esp mode tunnel priority 100 action allow
+# to fwd decrypted packets after esp processing:
+ip -net $ns xfrm policy add src $rnet dst $lnet dir fwd tmpl src $remote 
dst $me proto esp mode tunnel priority 100 action allow
+}
+
+do_exception() {
+local ns=$1
+local me=$2
+local remote=$3
+local encryptip=$4
+local plain=$5
+
+# network $plain passes without tunnel
+ip -net $ns xfrm policy add dst $plain dir out priority 10 action allow
+
+# direct policy for $encryptip, use tunnel, higher prio takes precedence
+ip -net $ns xfrm policy add dst $encryptip dir out tmpl src $me dst 
$remote proto esp mode tunnel priority 1 action allow
+}
+
+# policies that are not supposed to match any packets generated in this test.
+do_dummies4() {
+local ns=$1
+
+for i in $(seq 10 16);do
+  # dummy policy with wildcard src/dst.
+  echo netns exec $ns ip xfrm policy add src 0.0.0.0/0 dst 10.$i.99.0/30 
dir out action block
+  echo netns exec $ns ip xfrm policy add src 10.$i.99.0/30 dst 0.0.0.0/0 
dir out action block
+  for j in $(seq 32 64);do
+echo netns exec $ns ip xfrm policy add src 10.$i.1.0/30 dst 
10.$i.$j.0/30 dir out action block
+# silly, as it encompasses the one above too, but its allowed:
+echo netns exec $ns ip xfrm policy add src 10.$i.1.0/29 dst 
10.$i.$j.0/29 dir out action block
+# and yet again, even more broad one.
+echo netns exec $ns ip xfrm policy add src 10.$i.1.0/24 dst 
10.$i.$j.0/24 dir out action block
+echo netns exec $ns ip xfrm policy add src 10.$i.$j.0/24 dst 
10.$i.1.0/24 dir fwd action block
+  done
+done | ip -batch /dev/stdin
+}
+
+do_dummies6() {
+local ns=$1
+
+for i in $(seq 10 16);do
+  for j in $(seq 

[PATCH ipsec-next 09/11] xfrm: policy: check reinserted policies match their node

2018-11-07 Thread Florian Westphal
validate the re-inserted policies match the lookup node.
Policies that fail this test won't be returned in the candidate set.

This is enabled by default for now, it should not cause noticeable
reinsert slow down.

Such reinserts are needed when we have to merge an existing node
(e.g. for 10.0.0.0/28 because a overlapping subnet was added (e.g.
10.0.0.0/24), so whenever this happens existing policies have to
be placed on the list of the new node.

Signed-off-by: Florian Westphal 
---
 net/xfrm/xfrm_policy.c | 32 
 1 file changed, 32 insertions(+)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 81447d5d02e6..57e28dcd7c53 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -806,10 +806,16 @@ static void xfrm_policy_inexact_list_reinsert(struct net 
*net,
  struct xfrm_pol_inexact_node *n,
  u16 family)
 {
+   unsigned int matched_s, matched_d;
struct hlist_node *newpos = NULL;
struct xfrm_policy *policy, *p;
 
+   matched_s = 0;
+   matched_d = 0;
+
list_for_each_entry_reverse(policy, >xfrm.policy_all, walk.all) {
+   bool matches_s, matches_d;
+
if (!policy->bydst_reinsert)
continue;
 
@@ -827,6 +833,32 @@ static void xfrm_policy_inexact_list_reinsert(struct net 
*net,
hlist_add_behind(>bydst, newpos);
else
hlist_add_head(>bydst, >hhead);
+
+   /* paranoia checks follow.
+* Check that the reinserted policy matches at least
+* saddr or daddr for current node prefix.
+*
+* Matching both is fine, matching saddr in one policy
+* (but not daddr) and then matching only daddr in another
+* is a bug.
+*/
+   matches_s = xfrm_policy_addr_delta(>selector.saddr,
+  >addr,
+  n->prefixlen,
+  family) == 0;
+   matches_d = xfrm_policy_addr_delta(>selector.daddr,
+  >addr,
+  n->prefixlen,
+  family) == 0;
+   if (matches_s && matches_d)
+   continue;
+
+   WARN_ON_ONCE(!matches_s && !matches_d);
+   if (matches_s)
+   matched_s++;
+   if (matches_d)
+   matched_d++;
+   WARN_ON_ONCE(matched_s && matched_d);
}
 }
 
-- 
2.18.1



[PATCH ipsec-next 00/11] xfrm: policy: add inexact policy search tree

2018-11-07 Thread Florian Westphal
add commands
   keeps on returning same lookup result compared to older kernel.

This work doesn't prevent us from re-adding a new flow caching scheme,
but so far i found no good solution (all have various DoS or degradation
issues).

Comments or questions welcome.

Florian Westphal (11):
  selftests: add xfrm policy test script
  xfrm: security: iterate all, not inexact lists
  xfrm: policy: split list insertion into a helper
  xfrm: policy: return NULL when inexact search needed
  xfrm: policy: store inexact policies in an rhashtable
  xfrm: policy: consider if_id when hashing inexact policy
  xfrm: policy: add inexact policy search tree infrastructure
  xfrm: policy: store inexact policies in a tree ordered by destination 
address
  xfrm: policy: check reinserted policies match their node
  xfrm: policy: store inexact policies in a tree ordered by source address
  xfrm: policy: add 2nd-level saddr trees for inexact policies

 include/net/netns/xfrm.h   |2
 include/net/xfrm.h |3
 net/xfrm/xfrm_policy.c | 1234 ++---
 tools/testing/selftests/net/Makefile   |3
 tools/testing/selftests/net/xfrm_policy.sh |  276 ++
 5 files changed, 1391 insertions(+), 127 deletions(-)



[PATCH ipsec-next 08/11] xfrm: policy: store inexact policies in a tree ordered by destination address

2018-11-07 Thread Florian Westphal
This adds inexact lists per destination network, stored in a search tree.

Inexact lookups now return two 'candidate lists', the 'any' policies
('any' destionations), and a list of policies that share same
daddr/prefix.

Next patch will add a second search tree for 'saddr:any' policies
so we can avoid placing those on the 'any:any' list too.

Signed-off-by: Florian Westphal 
---
 include/net/xfrm.h |   1 +
 net/xfrm/xfrm_policy.c | 333 -
 2 files changed, 328 insertions(+), 6 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 9df6dca17155..fa4b3c877fcf 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -590,6 +590,7 @@ struct xfrm_policy {
struct xfrm_lifetime_cur curlft;
struct xfrm_policy_walk_entry walk;
struct xfrm_policy_queue polq;
+   boolbydst_reinsert;
u8  type;
u8  action;
u8  flags;
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 4eb12e9b40c2..81447d5d02e6 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -49,6 +49,38 @@ struct xfrm_flo {
 /* prefixes smaller than this are stored in lists, not trees. */
 #define INEXACT_PREFIXLEN_IPV4 16
 #define INEXACT_PREFIXLEN_IPV6 48
+
+struct xfrm_pol_inexact_node {
+   struct rb_node node;
+   union {
+   xfrm_address_t addr;
+   struct rcu_head rcu;
+   };
+   u8 prefixlen;
+
+   /* the policies matching this node, can be empty list */
+   struct hlist_head hhead;
+};
+
+/* xfrm inexact policy search tree:
+ * xfrm_pol_inexact_bin = hash(dir,type,family,if_id);
+ *  |
+ * + root_d: sorted by daddr:prefix
+ * | |
+ * |xfrm_pol_inexact_node
+ * | |
+ * | +- coarse policies and all any:daddr policies
+ * |
+ * + coarse policies and all any:any policies
+ *
+ * Lookups return two candidate lists:
+ * 1. any:any list from top-level xfrm_pol_inexact_bin
+ * 2. any:daddr list from daddr tree
+ *
+ * This result set then needs to be searched for the policy with
+ * the lowest priority.  If two results have same prio, youngest one wins.
+ */
+
 struct xfrm_pol_inexact_key {
possible_net_t net;
u32 if_id;
@@ -62,12 +94,17 @@ struct xfrm_pol_inexact_bin {
/* list containing '*:*' policies */
struct hlist_head hhead;
 
+   seqcount_t count;
+   /* tree sorted by daddr/prefix */
+   struct rb_root root_d;
+
/* slow path below */
struct list_head inexact_bins;
struct rcu_head rcu;
 };
 
 enum xfrm_pol_inexact_candidate_type {
+   XFRM_POL_CAND_DADDR,
XFRM_POL_CAND_ANY,
 
XFRM_POL_CAND_MAX,
@@ -658,6 +695,8 @@ xfrm_policy_inexact_alloc_bin(const struct xfrm_policy 
*pol, u8 dir)
 
bin->k = k;
INIT_HLIST_HEAD(>hhead);
+   bin->root_d = RB_ROOT;
+   seqcount_init(>count);
 
prev = rhashtable_lookup_get_insert_key(_policy_inexact_table,
>k, >head,
@@ -708,9 +747,211 @@ xfrm_policy_inexact_insert_use_any_list(const struct 
xfrm_policy *policy)
return saddr_any && daddr_any;
 }
 
+static void xfrm_pol_inexact_node_init(struct xfrm_pol_inexact_node *node,
+  const xfrm_address_t *addr, u8 prefixlen)
+{
+   node->addr = *addr;
+   node->prefixlen = prefixlen;
+}
+
+static struct xfrm_pol_inexact_node *
+xfrm_pol_inexact_node_alloc(const xfrm_address_t *addr, u8 prefixlen)
+{
+   struct xfrm_pol_inexact_node *node;
+
+   node = kzalloc(sizeof(*node), GFP_ATOMIC);
+   if (node)
+   xfrm_pol_inexact_node_init(node, addr, prefixlen);
+
+   return node;
+}
+
+static int xfrm_policy_addr_delta(const xfrm_address_t *a,
+ const xfrm_address_t *b,
+ u8 prefixlen, u16 family)
+{
+   unsigned int pdw, pbi;
+   int delta = 0;
+
+   switch (family) {
+   case AF_INET:
+   if (sizeof(long) == 4 && prefixlen == 0)
+   return ntohl(a->a4) - ntohl(b->a4);
+   return (ntohl(a->a4) & ((~0UL << (32 - prefixlen -
+  (ntohl(b->a4) & ((~0UL << (32 - prefixlen;
+   case AF_INET6:
+   pdw = prefixlen >> 5;
+   pbi = prefixlen & 0x1f;
+
+   if (pdw) {
+   delta = memcmp(a->a6, b->a6, pdw << 2);
+   if (delta)
+   return delta;
+   }
+   if (pbi) {
+   u32 mask = ~0u << (32 - pbi);
+
+   delta = (ntohl(a->a6[pdw]) & mask) -
+   (ntohl(b->a6[pdw]) & 

[PATCH ipsec-next 11/11] xfrm: policy: add 2nd-level saddr trees for inexact policies

2018-11-07 Thread Florian Westphal
This adds the fourth and final search class, containing policies
where both saddr and daddr have prefix lengths (i.e., not wildcards).

Inexact policies now end up in one of the following four search classes:

1. "Any:Any" list, containing policies where both saddr and daddr are
   wildcards or have very coarse prefixes, e.g. 10.0.0.0/8 and the like.
2. "saddr:any" list, containing policies with a fixed saddr/prefixlen,
   but without destination restrictions.
   These lists are stored in rbtree nodes; each node contains those
   policies matching saddr/prefixlen.
3. "Any:daddr" list. Similar to 2), except for policies where only the
   destinations are specified.
4. "saddr:daddr" lists, containing only those policies that
   match the given source/destination network.
   The root of the saddr/daddr nodes gets stored in the nodes of the
   'daddr' tree.

This diagram illustrates the list classes, and their
placement in the lookup hierarchy:

xfrm_pol_inexact_bin = hash(dir,type,family,if_id);
  |
  + root_d: sorted by daddr:prefix
  | |
  |xfrm_pol_inexact_node
  | |
  | +- root: sorted by saddr/prefix
  | |  |
  | | xfrm_pol_inexact_node
  | |  |
  | |  + root: unused
  | |  |
  | |  + hhead: saddr:daddr policies
  | |
  | +- coarse policies and all any:daddr policies
  |
  + root_s: sorted by saddr:prefix
  | |
  |xfrm_pol_inexact_node
  | |
  | + root: unused
  | |
  | + hhead: saddr:any policies
  |
  + coarse policies and all any:any policies

lookup for an inexact policy returns pointers to the four relevant list
classes, after which each of the lists needs to be searched for the policy
with the higher priority.

This will only speed up lookups in case we have many policies and a
sizeable portion of these have disjunct saddr/daddr addresses.

Signed-off-by: Florian Westphal 
---
 net/xfrm/xfrm_policy.c | 118 +
 1 file changed, 108 insertions(+), 10 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 38e33326c856..bd80b8a4322f 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -58,6 +58,8 @@ struct xfrm_pol_inexact_node {
};
u8 prefixlen;
 
+   struct rb_root root;
+
/* the policies matching this node, can be empty list */
struct hlist_head hhead;
 };
@@ -69,6 +71,14 @@ struct xfrm_pol_inexact_node {
  * | |
  * |xfrm_pol_inexact_node
  * | |
+ * | +- root: sorted by saddr/prefix
+ * | |  |
+ * | | xfrm_pol_inexact_node
+ * | |  |
+ * | |  + root: unused
+ * | |  |
+ * | |  + hhead: saddr:daddr policies
+ * | |
  * | +- coarse policies and all any:daddr policies
  * |
  * + root_s: sorted by saddr:prefix
@@ -81,10 +91,11 @@ struct xfrm_pol_inexact_node {
  * |
  * + coarse policies and all any:any policies
  *
- * Lookups return three candidate lists:
+ * Lookups return four candidate lists:
  * 1. any:any list from top-level xfrm_pol_inexact_bin
  * 2. any:daddr list from daddr tree
- * 2. saddr:any list from saddr tree
+ * 3. saddr:daddr list from 2nd level daddr tree
+ * 4. saddr:any list from saddr tree
  *
  * This result set then needs to be searched for the policy with
  * the lowest priority.  If two results have same prio, youngest one wins.
@@ -116,6 +127,7 @@ struct xfrm_pol_inexact_bin {
 };
 
 enum xfrm_pol_inexact_candidate_type {
+   XFRM_POL_CAND_BOTH,
XFRM_POL_CAND_SADDR,
XFRM_POL_CAND_DADDR,
XFRM_POL_CAND_ANY,
@@ -876,13 +888,82 @@ static void xfrm_policy_inexact_list_reinsert(struct net 
*net,
}
 }
 
+static void xfrm_policy_inexact_node_reinsert(struct net *net,
+ struct xfrm_pol_inexact_node *n,
+ struct rb_root *new,
+ u16 family)
+{
+   struct rb_node **p, *parent = NULL;
+   struct xfrm_pol_inexact_node *node;
+
+   /* we should not have another subtree here */
+   WARN_ON_ONCE(!RB_EMPTY_ROOT(>root));
+
+   p = >rb_node;
+   while (*p) {
+   u8 prefixlen;
+   int delta;
+
+   parent = *p;
+   node = rb_entry(*p, struct xfrm_pol_inexact_node, node);
+
+  

[PATCH ipsec-next 03/11] xfrm: policy: split list insertion into a helper

2018-11-07 Thread Florian Westphal
... so we can reuse this later without code duplication when we add
policy to a second inexact list.

Signed-off-by: Florian Westphal 
---
 net/xfrm/xfrm_policy.c | 43 ++
 1 file changed, 27 insertions(+), 16 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 39d0db2a50d9..20d6815be0d7 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -740,18 +740,12 @@ static bool xfrm_policy_mark_match(struct xfrm_policy 
*policy,
return false;
 }
 
-int xfrm_policy_insert(int dir, struct xfrm_policy *policy, int excl)
+static struct xfrm_policy *xfrm_policy_insert_list(struct hlist_head *chain,
+  struct xfrm_policy *policy,
+  bool excl)
 {
-   struct net *net = xp_net(policy);
-   struct xfrm_policy *pol;
-   struct xfrm_policy *delpol;
-   struct hlist_head *chain;
-   struct hlist_node *newpos;
+   struct xfrm_policy *pol, *newpos = NULL, *delpol = NULL;
 
-   spin_lock_bh(>xfrm.xfrm_policy_lock);
-   chain = policy_hash_bysel(net, >selector, policy->family, dir);
-   delpol = NULL;
-   newpos = NULL;
hlist_for_each_entry(pol, chain, bydst) {
if (pol->type == policy->type &&
pol->if_id == policy->if_id &&
@@ -759,24 +753,41 @@ int xfrm_policy_insert(int dir, struct xfrm_policy 
*policy, int excl)
xfrm_policy_mark_match(policy, pol) &&
xfrm_sec_ctx_match(pol->security, policy->security) &&
!WARN_ON(delpol)) {
-   if (excl) {
-   spin_unlock_bh(>xfrm.xfrm_policy_lock);
-   return -EEXIST;
-   }
+   if (excl)
+   return ERR_PTR(-EEXIST);
delpol = pol;
if (policy->priority > pol->priority)
continue;
} else if (policy->priority >= pol->priority) {
-   newpos = >bydst;
+   newpos = pol;
continue;
}
if (delpol)
break;
}
if (newpos)
-   hlist_add_behind_rcu(>bydst, newpos);
+   hlist_add_behind_rcu(>bydst, >bydst);
else
hlist_add_head_rcu(>bydst, chain);
+
+   return delpol;
+}
+
+int xfrm_policy_insert(int dir, struct xfrm_policy *policy, int excl)
+{
+   struct net *net = xp_net(policy);
+   struct xfrm_policy *delpol;
+   struct hlist_head *chain;
+
+   spin_lock_bh(>xfrm.xfrm_policy_lock);
+   chain = policy_hash_bysel(net, >selector, policy->family, dir);
+   delpol = xfrm_policy_insert_list(chain, policy, excl);
+
+   if (IS_ERR(delpol)) {
+   spin_unlock_bh(>xfrm.xfrm_policy_lock);
+   return PTR_ERR(delpol);
+   }
+
__xfrm_policy_link(policy, dir);
 
/* After previous checking, family can either be AF_INET or AF_INET6 */
-- 
2.18.1



[PATCH ipsec-next 02/11] xfrm: security: iterate all, not inexact lists

2018-11-07 Thread Florian Westphal
currently all non-socket policies are either hashed in the dst table,
or placed on the 'inexact list'.  When flushing, we first walk the
table, then the (per-direction) inexact lists.

When we try and get rid of the inexact lists to having "n" inexact
lists (e.g. per-af inexact lists, or sorted into a tree), this walk
would become more complicated.

Simplify this: walk the 'all' list and skip socket policies during
traversal so we don't need to handle exact and inexact policies
separately anymore.

Signed-off-by: Florian Westphal 
---
 net/xfrm/xfrm_policy.c | 93 --
 1 file changed, 26 insertions(+), 67 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 119a427d9b2b..39d0db2a50d9 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -892,36 +892,19 @@ EXPORT_SYMBOL(xfrm_policy_byid);
 static inline int
 xfrm_policy_flush_secctx_check(struct net *net, u8 type, bool task_valid)
 {
-   int dir, err = 0;
+   struct xfrm_policy *pol;
+   int err = 0;
 
-   for (dir = 0; dir < XFRM_POLICY_MAX; dir++) {
-   struct xfrm_policy *pol;
-   int i;
+   list_for_each_entry(pol, >xfrm.policy_all, walk.all) {
+   if (pol->walk.dead ||
+   xfrm_policy_id2dir(pol->index) >= XFRM_POLICY_MAX ||
+   pol->type != type)
+   continue;
 
-   hlist_for_each_entry(pol,
->xfrm.policy_inexact[dir], bydst) {
-   if (pol->type != type)
-   continue;
-   err = security_xfrm_policy_delete(pol->security);
-   if (err) {
-   xfrm_audit_policy_delete(pol, 0, task_valid);
-   return err;
-   }
-   }
-   for (i = net->xfrm.policy_bydst[dir].hmask; i >= 0; i--) {
-   hlist_for_each_entry(pol,
-net->xfrm.policy_bydst[dir].table 
+ i,
-bydst) {
-   if (pol->type != type)
-   continue;
-   err = security_xfrm_policy_delete(
-   pol->security);
-   if (err) {
-   xfrm_audit_policy_delete(pol, 0,
-task_valid);
-   return err;
-   }
-   }
+   err = security_xfrm_policy_delete(pol->security);
+   if (err) {
+   xfrm_audit_policy_delete(pol, 0, task_valid);
+   return err;
}
}
return err;
@@ -937,6 +920,7 @@ xfrm_policy_flush_secctx_check(struct net *net, u8 type, 
bool task_valid)
 int xfrm_policy_flush(struct net *net, u8 type, bool task_valid)
 {
int dir, err = 0, cnt = 0;
+   struct xfrm_policy *pol;
 
spin_lock_bh(>xfrm.xfrm_policy_lock);
 
@@ -944,46 +928,21 @@ int xfrm_policy_flush(struct net *net, u8 type, bool 
task_valid)
if (err)
goto out;
 
-   for (dir = 0; dir < XFRM_POLICY_MAX; dir++) {
-   struct xfrm_policy *pol;
-   int i;
-
-   again1:
-   hlist_for_each_entry(pol,
->xfrm.policy_inexact[dir], bydst) {
-   if (pol->type != type)
-   continue;
-   __xfrm_policy_unlink(pol, dir);
-   spin_unlock_bh(>xfrm.xfrm_policy_lock);
-   cnt++;
-
-   xfrm_audit_policy_delete(pol, 1, task_valid);
-
-   xfrm_policy_kill(pol);
-
-   spin_lock_bh(>xfrm.xfrm_policy_lock);
-   goto again1;
-   }
-
-   for (i = net->xfrm.policy_bydst[dir].hmask; i >= 0; i--) {
-   again2:
-   hlist_for_each_entry(pol,
-net->xfrm.policy_bydst[dir].table 
+ i,
-bydst) {
-   if (pol->type != type)
-   continue;
-   __xfrm_policy_unlink(pol, dir);
-   spin_unlock_bh(>xfrm.xfrm_policy_lock);
-   cnt++;
-
-   xfrm_audit_policy_delete(pol, 1, task_valid);
-   xfrm_policy_kill(pol);
-
-   spin_lock_bh(>xfrm.xfrm_policy_lock);
-   

Re: Attribute failed policy validation

2018-10-24 Thread Florian Westphal
Marco Berizzi  wrote:

[ CC David ]

> root@Calimero:~# tc qdisc add dev eth0 root handle 1:0 hfsc default 1
> Error: Attribute failed policy validation.

caused by:
commit 8b4c3cdd9dd8290343ce959a132d3b334062c5b9
net: sched: Add policy validation for tc attributes

Looks like TCA_OPTIONS is NLA_BINARY in hfsc case, so
we can't enforce NLA_NESTED :-(



Re: crash in xt_policy due to skb_dst_drop() in nf_ct_frag6_gather()

2018-10-16 Thread Florian Westphal
Maciej Żenczykowski  wrote:

I am currently travelling and not able to investigate
until next week.

> commit ad8b1ffc3efae2f65080bdb11145c87d299b8f9a
> Author: Florian Westphal 
> netfilter: ipv6: nf_defrag: drop skb dst before queueing
> 
> +++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
> @@ -618,6 +618,8 @@ int nf_ct_frag6_gather(struct net *net, struct
> sk_buff *skb, u32 user)
> fq->q.meat == fq->q.len &&
> nf_ct_frag6_reasm(fq, skb, dev))
> ret = 0;
> +   else
> +   skb_dst_drop(skb);

This is only supposed to drop dst of skbs that are enqueued,
i.e. frag6_gather returns NF_STOLEN.

In case skb completes the queue, then that skbs dst_entry
is supposed to be kept, so skb_dst() does NOT return NULL.

Its not supposed to be any different than ipv4 defrag.

> const struct dst_entry *dst = skb_dst(skb); // returns NULL

That is not supposed to happen.



[PATCH ipsec] xfrm: policy: use hlist rcu variants on insert

2018-10-10 Thread Florian Westphal
bydst table/list lookups use rcu, so insertions must use rcu versions.

Fixes: a7c44247f704e ("xfrm: policy: make xfrm_policy_lookup_bytype lockless")
Signed-off-by: Florian Westphal 
---
 net/xfrm/xfrm_policy.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 3110c3fbee20..3cf1fd7c0869 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -632,9 +632,9 @@ static void xfrm_hash_rebuild(struct work_struct *work)
break;
}
if (newpos)
-   hlist_add_behind(>bydst, newpos);
+   hlist_add_behind_rcu(>bydst, newpos);
else
-   hlist_add_head(>bydst, chain);
+   hlist_add_head_rcu(>bydst, chain);
}
 
spin_unlock_bh(>xfrm.xfrm_policy_lock);
@@ -774,9 +774,9 @@ int xfrm_policy_insert(int dir, struct xfrm_policy *policy, 
int excl)
break;
}
if (newpos)
-   hlist_add_behind(>bydst, newpos);
+   hlist_add_behind_rcu(>bydst, newpos);
else
-   hlist_add_head(>bydst, chain);
+   hlist_add_head_rcu(>bydst, chain);
__xfrm_policy_link(policy, dir);
 
/* After previous checking, family can either be AF_INET or AF_INET6 */
-- 
2.18.0



Re: [PATCH ipsec] net: xfrm: pass constant family to nf_hook function

2018-09-21 Thread Florian Westphal
David Ahern  wrote:
> On 9/21/18 8:55 AM, Florian Westphal wrote:
> > David Ahern  wrote:
> >>>  David, i hope this will silence the warning, would be nice
> >>>  if you could test it.
> >>
> >> I still the warning.
> > 
> > Wait.  Do you see this warning everywhere or just in xfrm?
> > 
> 
> just the one file.

Sigh, ok, i will set up a vm and deal with this somehow.

Steffen, please toss this patch, sorry for the noise.


Re: [PATCH ipsec] net: xfrm: pass constant family to nf_hook function

2018-09-21 Thread Florian Westphal
David Ahern  wrote:
> >  David, i hope this will silence the warning, would be nice
> >  if you could test it.
> 
> I still the warning.

Wait.  Do you see this warning everywhere or just in xfrm?


[PATCH ipsec] net: xfrm: pass constant family to nf_hook function

2018-09-21 Thread Florian Westphal
Unfortunately some versions of gcc emit following warning:
  linux/compiler.h:252:20: warning: array subscript is above array bounds 
[-Warray-bounds]
  hook_head = rcu_dereference(net->nf.hooks_arp[hook]);
  ^
xfrm_output_resume passes non-const 'pf' argument so compiler can't know
that the affected statement above will never be executed (we only
pass either NFPROTO_IPV4 or NFPROTO_IPV6), this change makes this
explicit.

Another solution would be to increase hooks_arp[] size, but that
increases struct net size needlesly.

Reported-by: David Ahern 
Signed-off-by: Florian Westphal 
---
 David, i hope this will silence the warning, would be nice
 if you could test it.

 I don't really like this patch, but I see no better solution
 expect needless increase of hooks_arp[].

 Any other idea?

 net/xfrm/xfrm_output.c | 23 ++-
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
index 45ba07ab3e4f..199c0e782ac7 100644
--- a/net/xfrm/xfrm_output.c
+++ b/net/xfrm/xfrm_output.c
@@ -152,11 +152,24 @@ int xfrm_output_resume(struct sk_buff *skb, int err)
if (!skb_dst(skb)->xfrm)
return dst_output(net, skb->sk, skb);
 
-   err = nf_hook(skb_dst(skb)->ops->family,
- NF_INET_POST_ROUTING, net, skb->sk, skb,
- NULL, skb_dst(skb)->dev, xfrm_output2);
-   if (unlikely(err != 1))
-   goto out;
+   switch (skb_dst(skb)->ops->family) {
+   case AF_INET:
+   err = nf_hook(NFPROTO_IPV4,
+ NF_INET_POST_ROUTING, net, skb->sk, skb,
+ NULL, skb_dst(skb)->dev, xfrm_output2);
+   if (unlikely(err != 1))
+   goto out;
+   break;
+   case AF_INET6:
+   err = nf_hook(NFPROTO_IPV6,
+ NF_INET_POST_ROUTING, net, skb->sk, skb,
+ NULL, skb_dst(skb)->dev, xfrm_output2);
+   if (unlikely(err != 1))
+   goto out;
+   break;
+   default:
+   break;
+   }
}
 
if (err == -EINPROGRESS)
-- 
2.16.4



Re: array bounds warning in xfrm_output_resume

2018-09-20 Thread Florian Westphal
David Ahern  wrote:
> > $ make O=kbuild/perf -j 24 -s
> > In file included from /home/dsa/kernel-3.git/include/linux/kernel.h:10:0,
> >  from /home/dsa/kernel-3.git/include/linux/list.h:9,
> >  from /home/dsa/kernel-3.git/include/linux/module.h:9,
> >  from /home/dsa/kernel-3.git/net/xfrm/xfrm_output.c:13:
> > /home/dsa/kernel-3.git/net/xfrm/xfrm_output.c: In function
> > ‘xfrm_output_resume’:
> > /home/dsa/kernel-3.git/include/linux/compiler.h:252:20: warning: array
> > subscript is above array bounds [-Warray-bounds]
> >__read_once_size(&(x), __u.__c, sizeof(x));  \

Does this thing avoid the warning?

diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
--- a/net/xfrm/xfrm_output.c
+++ b/net/xfrm/xfrm_output.c
@@ -143,6 +143,8 @@ int xfrm_output_resume(struct sk_buff *skb, int err)
struct net *net = xs_net(skb_dst(skb)->xfrm);
 
while (likely((err = xfrm_output_one(skb, err)) == 0)) {
+   unsigned short pf;
+
nf_reset(skb);
 
err = skb_dst(skb)->ops->local_out(net, skb->sk, skb);
@@ -152,11 +154,19 @@ int xfrm_output_resume(struct sk_buff *skb, int err)
if (!skb_dst(skb)->xfrm)
return dst_output(net, skb->sk, skb);
 
-   err = nf_hook(skb_dst(skb)->ops->family,
- NF_INET_POST_ROUTING, net, skb->sk, skb,
- NULL, skb_dst(skb)->dev, xfrm_output2);
-   if (unlikely(err != 1))
-   goto out;
+   pf = skb_dst(skb)->ops->family;
+   switch (pf) {
+   case AF_INET: /* fallthrough */
+   case AF_INET6:
+   err = nf_hook(pf,
+ NF_INET_POST_ROUTING, net, skb->sk, skb,
+ NULL, skb_dst(skb)->dev, xfrm_output2);
+   if (unlikely(err != 1))
+   goto out;
+   break;
+   default:
+   break;
+   }
}
 
if (err == -EINPROGRESS)


Re: array bounds warning in xfrm_output_resume

2018-09-19 Thread Florian Westphal
David Ahern  wrote:
> > I believe ef57170bbfdd6 is the commit that introduced the warning
> > 
> 
> Hi Florian:
> 
> I am still seeing this. I realize the compiler is not understanding the
> conditions properly, but it is a distraction every time I do a kernel
> build on Debian Stretch having to filter the above noise from whether I
> introduce another warning.

Sorry, I forgot about this.  I'll look into it tomorrow.


[PATCH net] tcp: do not restart timewait timer on rst reception

2018-08-30 Thread Florian Westphal
RFC 1337 says:
 ''Ignore RST segments in TIME-WAIT state.
   If the 2 minute MSL is enforced, this fix avoids all three hazards.''

So with net.ipv4.tcp_rfc1337=1, expected behaviour is to have TIME-WAIT sk
expire rather than removing it instantly when a reset is received.

However, Linux will also re-start the TIME-WAIT timer.

This causes connect to fail when tying to re-use ports or very long
delays (until syn retry interval exceeds MSL).

packetdrill test case:
// Demonstrate bogus rearming of TIME-WAIT timer in rfc1337 mode.
`sysctl net.ipv4.tcp_rfc1337=1`

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0

0.100 < S 0:0(0) win 29200 
0.100 > S. 0:0(0) ack 1 
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4

// Receive first segment
0.310 < P. 1:1001(1000) ack 1 win 46

// Send one ACK
0.310 > . 1:1(0) ack 1001

// read 1000 byte
0.310 read(4, ..., 1000) = 1000

// Application writes 100 bytes
0.350 write(4, ..., 100) = 100
0.350 > P. 1:101(100) ack 1001

// ACK
0.500 < . 1001:1001(0) ack 101 win 257

// close the connection
0.600 close(4) = 0
0.600 > F. 101:101(0) ack 1001 win 244

// Our side is in FIN_WAIT_1 & waits for ack to fin
0.7 < . 1001:1001(0) ack 102 win 244

// Our side is in FIN_WAIT_2 with no outstanding data.
0.8 < F. 1001:1001(0) ack 102 win 244
0.8 > . 102:102(0) ack 1002 win 244

// Our side is now in TIME_WAIT state, send ack for fin.
0.9 < F. 1002:1002(0) ack 102 win 244
0.9 > . 102:102(0) ack 1002 win 244

// Peer reopens with in-window SYN:
1.000 < S 1000:1000(0) win 9200 

// Therefore, reply with ACK.
1.000 > . 102:102(0) ack 1002 win 244

// Peer sends RST for this ACK.  Normally this RST results
// in tw socket removal, but rfc1337=1 setting prevents this.
1.100 < R 1002:1002(0) win 244

// second syn. Due to rfc1337=1 expect another pure ACK.
31.0 < S 1000:1000(0) win 9200 
31.0 > . 102:102(0) ack 1002 win 244

// .. and another RST from peer.
31.1 < R 1002:1002(0) win 244
31.2 `echo no timer restart;ss -m -e -a -i -n -t -o state TIME-WAIT`

// third syn after one minute.  Time-Wait socket should have expired by now.
63.0 < S 1000:1000(0) win 9200 

// so we expect a syn-ack & 3whs to proceed from here on.
63.0 > S. 0:0(0) ack 1 

Without this patch, 'ss' shows restarts of tw timer and last packet is
thus just another pure ack, more than one minute later.

This restores the original code from commit 283fd6cf0be690a83
("Merge in ANK networking jumbo patch") in netdev-vger-cvs.git .

For some reason the else branch was removed/lost in 1f28b683339f7
("Merge in TCP/UDP optimizations and [..]") and timer restart became
unconditional.

Reported-by: Michal Tesar 
Signed-off-by: Florian Westphal 
---
 net/ipv4/tcp_minisocks.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 1dda1341a223..b690132f5da2 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -184,8 +184,9 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, 
struct sk_buff *skb,
inet_twsk_deschedule_put(tw);
return TCP_TW_SUCCESS;
}
+   } else {
+   inet_twsk_reschedule(tw, TCP_TIMEWAIT_LEN);
}
-   inet_twsk_reschedule(tw, TCP_TIMEWAIT_LEN);
 
if (tmp_opt.saw_tstamp) {
tcptw->tw_ts_recent   = tmp_opt.rcv_tsval;
-- 
2.16.4



Re: Linux kernel error stack

2018-08-05 Thread Florian Westphal
Michal Kubecek  wrote:
> On Mon, Aug 06, 2018 at 01:15:37AM +0200, Florian Westphal wrote:
> > Michal Kubecek  wrote:
> > > Oops, exactly this issue was already discussed almost a year ago:
> > > 
> > >   http://lkml.kernel.org/r/20170824104824.2c318a0...@unicorn.suse.cz
> > > 
> > > But something more urgent came and I forgot to get back to it. :-(
> > 
> > I did not even remeber, thanks for the pointer.
> > So I think best course of action is to update man page to clearly
> > say this only works in postrouting and with udp, and is ONLY
> > intended for working around old dhcp software.
> 
> As GSO for UDP is on its way to mainline, one might get into trouble
> even with UDP if the rule is not specific enough.

Yes, we still need a fix to ignore GSO too.


Re: Linux kernel error stack

2018-08-05 Thread Florian Westphal
Michal Kubecek  wrote:
> Oops, exactly this issue was already discussed almost a year ago:
> 
>   http://lkml.kernel.org/r/20170824104824.2c318a0...@unicorn.suse.cz
> 
> But something more urgent came and I forgot to get back to it. :-(

I did not even remeber, thanks for the pointer.
So I think best course of action is to update man page to clearly
say this only works in postrouting and with udp, and is ONLY
intended for working around old dhcp software.

>From kernel, have checkentry emit a warning
when this is used for protocols other than udp,
and make this a no-op for everything else.

Same for postrouting: warn from checkentry if the hook is something
else, and have target function not do anything for !postrouting.

Does that make sense?

I can work on a patch later this week if needed.


Re: Linux kernel error stack

2018-08-05 Thread Florian Westphal
Satish Patel  wrote:
> Florian,
> 
> It seems those rules coming from here
> https://github.com/openstack/openstack-ansible-os_neutron/blob/master/files/post-up-metadata-checksum

This is crazy, and, as you found, it doesn't even do what they seem to
think it does.

I see no reason for these rules at all.


Re: Linux kernel error stack

2018-08-05 Thread Florian Westphal
Satish Patel  wrote:
> After reading further related DHCP checksum issue, it seems we need
> that rules when you running DHCP on same host machine where your guest
> using host DHCP service, in that case virtual nic won't do checksum.
> If your DHCP running on different host then your physical nic perform
> checksum.

Partially right, its only needed for older dhcp software that relies
on packet-header checksum correctness (metadata contains info when
checksum will be filled in by nic later, modern servers will
use that and ignore the packet-header checksums in this case).

If you remove the rule and your VMs still get leases via DHCP then
the CHECKSUM workaround isn't needed.


Re: Linux kernel error stack

2018-08-05 Thread Florian Westphal
Satish Patel  wrote:
> > [84166:59495417] -A POSTROUTING -p tcp -m tcp --sport 80 -j CHECKSUM
> > --checksum-fill
> > [68739:5153476] -A POSTROUTING -p tcp -m tcp --sport 8000 -j CHECKSUM
> > --checksum-fill

These rules make no sense to me, and are also source of your backtrace.
Who set this up?

If this is coming from openstack, I suggest asking openstack developers
WTH this is supposed to do.

> > [755:275452] -A POSTROUTING -s 10.0.3.0/24 -o lxcbr0 -p udp -m udp
> > --dport 68 -j CHECKSUM --checksum-fill

This was needed to work around dhcpd issues w. checksum offloading but I
guess that DCHCP will work fine without this rule too nowadays.

So I suggest you simply get rid of these rules.


Re: Linux kernel error stack

2018-08-05 Thread Florian Westphal
Satish Patel  wrote:
> Thanks Florian,
> 
> FYI, I don't have any CHECKSUM configure in my iptables,

You have, according to WARN stacktrace you provided.

iptables-save -c
ip6tables-save -c


Re: Linux kernel error stack

2018-08-05 Thread Florian Westphal
Satish Patel  wrote:
> I am installing openstack and as you know i have lots of bridges and
> vlan interface on my Linux CentOS 7.5
> 
> I was getting following error stack on 3.10 kernel and found this is
> kernel bug which required kernel upgrade so now i have upgraded my
> kernel to 4.17.12 but i am still seeing same kernel stack error on my
> dmesg
> 
> I have disable TSO, LRO, SG & GSO on my NIC but still getting error
> just wanted to understand what is this and why it popping up

Get rid of CHECKSUM target in the iptables rules.
This thing was added 8 years ago to work around dhcp bugs, I don't
think its use is needed anymore.
Try removing it and see that all VMs can still retrieve IP address
via DHCP.

I'm curious as to the rules, normally CHECKSUM target should be
limited to -p udp --dport bootp; its bad idea to feed it normal
packets, its expensive to do this in software rather than have device
do the checksumming.

As for fix, I'm tempted to send patch to make checksum target
eval a no-op & add deprecation warning on init...

Other "fix" is to

diff --git a/net/netfilter/xt_CHECKSUM.c b/net/netfilter/xt_CHECKSUM.c
index 9f4151ec3e06..23a17dda604d 100644
--- a/net/netfilter/xt_CHECKSUM.c
+++ b/net/netfilter/xt_CHECKSUM.c
@@ -25,8 +25,12 @@ MODULE_ALIAS("ip6t_CHECKSUM");
 static unsigned int
 checksum_tg(struct sk_buff *skb, const struct xt_action_param *par)
 {
-   if (skb->ip_summed == CHECKSUM_PARTIAL)
-   skb_checksum_help(skb);
+   if (skb->ip_summed == CHECKSUM_PARTIAL) {
+   if (skb_shinfo(skb)->gso_size)
+   skb->ip_summed = CHECKSUM_NONE;
+   else
+   skb_checksum_help(skb);
+   }
 
return XT_CONTINUE;
 }

(unfortunately, the target isn't restricted to POSTROUTING, sigh).


[PATCH net-next v3] ipv6: defrag: drop non-last frags smaller than min mtu

2018-08-02 Thread Florian Westphal
don't bother with pathological cases, they only waste cycles.
IPv6 requires a minimum MTU of 1280 so we should never see fragments
smaller than this (except last frag).

v3: don't use awkward "-offset + len"
v2: drop IPv4 part, which added same check w. IPV4_MIN_MTU (68).
There were concerns that there could be even smaller frags
generated by intermediate nodes, e.g. on radio networks.

Cc: Peter Oskolkov 
Cc: Eric Dumazet 
Signed-off-by: Florian Westphal 
---
 net/ipv6/netfilter/nf_conntrack_reasm.c | 4 
 net/ipv6/reassembly.c   | 4 
 2 files changed, 8 insertions(+)

diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c 
b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 0610bdab721c..aed9766559c9 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -557,6 +557,10 @@ int nf_ct_frag6_gather(struct net *net, struct sk_buff 
*skb, u32 user)
hdr = ipv6_hdr(skb);
fhdr = (struct frag_hdr *)skb_transport_header(skb);
 
+   if (skb->len - skb_network_offset(skb) < IPV6_MIN_MTU &&
+   fhdr->frag_off & htons(IP6_MF))
+   return -EINVAL;
+
skb_orphan(skb);
fq = fq_find(net, fhdr->identification, user, hdr,
 skb->dev ? skb->dev->ifindex : 0);
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 6edd2ac8ae4b..a976b143463b 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -455,6 +455,10 @@ static int ipv6_frag_rcv(struct sk_buff *skb)
return 1;
}
 
+   if (skb->len - skb_network_offset(skb) < IPV6_MIN_MTU &&
+   fhdr->frag_off & htons(IP6_MF))
+   goto fail_hdr;
+
iif = skb->dev ? skb->dev->ifindex : 0;
fq = fq_find(net, fhdr->identification, hdr, iif);
if (fq) {
-- 
2.16.4



[PATCH nf-next v2 1/1] ipv6: defrag: drop non-last frags smaller than min mtu

2018-08-02 Thread Florian Westphal
don't bother with pathological cases, they only waste cycles.
IPv6 requires a minimum MTU of 1280 so we should never see fragments
smaller than this (except last frag).

v2: drop IPv4 part, which added same check w. IPV4_MIN_MTU (68).
There were concerns that there could be even smaller frags
generated by intermediate nodes, e.g. on radio networks.

Cc: Peter Oskolkov 
Cc: Eric Dumazet 
Signed-off-by: Florian Westphal 
---
 net/ipv6/netfilter/nf_conntrack_reasm.c | 4 
 net/ipv6/reassembly.c   | 4 
 2 files changed, 8 insertions(+)

diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c 
b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 0610bdab721c..c121d534d321 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -557,6 +557,10 @@ int nf_ct_frag6_gather(struct net *net, struct sk_buff 
*skb, u32 user)
hdr = ipv6_hdr(skb);
fhdr = (struct frag_hdr *)skb_transport_header(skb);
 
+   if (-skb_network_offset(skb) + skb->len < IPV6_MIN_MTU &&
+   fhdr->frag_off & htons(IP6_MF))
+   return -EINVAL;
+
skb_orphan(skb);
fq = fq_find(net, fhdr->identification, user, hdr,
 skb->dev ? skb->dev->ifindex : 0);
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 6edd2ac8ae4b..ff00ada6128f 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -455,6 +455,10 @@ static int ipv6_frag_rcv(struct sk_buff *skb)
return 1;
}
 
+   if (-skb_network_offset(skb) + skb->len < IPV6_MIN_MTU &&
+   fhdr->frag_off & htons(IP6_MF))
+   goto fail_hdr;
+
iif = skb->dev ? skb->dev->ifindex : 0;
fq = fq_find(net, fhdr->identification, hdr, iif);
if (fq) {
-- 
2.16.4



[PATCH net-next] inet: defrag: drop non-last frags smaller than min mtu

2018-08-02 Thread Florian Westphal
don't bother with pathological cases, they only waste cycles.
IPv6 requires a minimum MTU of 1280 so we should never see fragments
smaller than this (except last frag).

For IPv4, in practice, we could probably also adopt a higher limit,
but for now use ipv4 min mtu (68).

Cc: Peter Oskolkov 
Cc: Eric Dumazet 
Signed-off-by: Florian Westphal 
---
 net/ipv4/ip_fragment.c  | 5 +
 net/ipv6/netfilter/nf_conntrack_reasm.c | 4 
 net/ipv6/reassembly.c   | 4 
 3 files changed, 13 insertions(+)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 8e9528ebaa8e..19aa10abc6ab 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -605,6 +605,10 @@ int ip_defrag(struct net *net, struct sk_buff *skb, u32 
user)
int vif = l3mdev_master_ifindex_rcu(dev);
struct ipq *qp;
 
+   if (-skb_network_offset(skb) + skb->len < IPV4_MIN_MTU &&
+   ip_hdr(skb)->frag_off & htons(IP_MF))
+   goto drop;
+
__IP_INC_STATS(net, IPSTATS_MIB_REASMREQDS);
skb_orphan(skb);
 
@@ -622,6 +626,7 @@ int ip_defrag(struct net *net, struct sk_buff *skb, u32 
user)
return ret;
}
 
+drop:
__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
kfree_skb(skb);
return -ENOMEM;
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c 
b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 0610bdab721c..c121d534d321 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -557,6 +557,10 @@ int nf_ct_frag6_gather(struct net *net, struct sk_buff 
*skb, u32 user)
hdr = ipv6_hdr(skb);
fhdr = (struct frag_hdr *)skb_transport_header(skb);
 
+   if (-skb_network_offset(skb) + skb->len < IPV6_MIN_MTU &&
+   fhdr->frag_off & htons(IP6_MF))
+   return -EINVAL;
+
skb_orphan(skb);
fq = fq_find(net, fhdr->identification, user, hdr,
 skb->dev ? skb->dev->ifindex : 0);
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 6edd2ac8ae4b..ff00ada6128f 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -455,6 +455,10 @@ static int ipv6_frag_rcv(struct sk_buff *skb)
return 1;
}
 
+   if (-skb_network_offset(skb) + skb->len < IPV6_MIN_MTU &&
+   fhdr->frag_off & htons(IP6_MF))
+   goto fail_hdr;
+
iif = skb->dev ? skb->dev->ifindex : 0;
fq = fq_find(net, fhdr->identification, hdr, iif);
if (fq) {
-- 
2.16.4



Re: [PATCH net] inet: frag: enforce memory limits earlier

2018-07-31 Thread Florian Westphal
Jann Horn  wrote:
> On Tue, Jul 31, 2018 at 7:54 AM Florian Westphal  wrote:
> >
> > Eric Dumazet  wrote:
> > > We currently check current frags memory usage only when
> > > a new frag queue is created. This allows attackers to first
> > > consume the memory budget (default : 4 MB) creating thousands
> > > of frag queues, then sending tiny skbs to exceed high_thresh
> > > limit by 2 to 3 order of magnitude.
> > >
> > > Note that before commit 648700f76b03 ("inet: frags: use rhashtables
> > > for reassembly units"), work queue could be starved under DOS,
> > > getting no cpu cycles.
> > > After commit 648700f76b03, only the per frag queue timer can eventually
> > > remove an incomplete frag queue and its skbs.
> >
> > I'm not sure this is a good idea.
> >
> > This can now prevent "good" queue from completing just because attacker
> > is sending garbage.
> 
> There is only a limited amount of memory available to store fragments.
> If you receive lots of fragments that don't form complete packets,
> you'll have to drop some packets. I don't see why it matters whether
> incoming garbage only prevents the creation of new queues or also the
> completion of existing queues.

Agreed.  Objection withdrawn.

Acked-by: Florian Westphal 


Re: [PATCH net] inet: frag: enforce memory limits earlier

2018-07-30 Thread Florian Westphal
Eric Dumazet  wrote:
> We currently check current frags memory usage only when
> a new frag queue is created. This allows attackers to first
> consume the memory budget (default : 4 MB) creating thousands
> of frag queues, then sending tiny skbs to exceed high_thresh
> limit by 2 to 3 order of magnitude.
> 
> Note that before commit 648700f76b03 ("inet: frags: use rhashtables
> for reassembly units"), work queue could be starved under DOS,
> getting no cpu cycles.
> After commit 648700f76b03, only the per frag queue timer can eventually
> remove an incomplete frag queue and its skbs.

I'm not sure this is a good idea.

This can now prevent "good" queue from completing just because attacker
is sending garbage.


[PATCH net-next] netlink: do not store start function in netlink_cb

2018-07-24 Thread Florian Westphal
->start() is called once when dump is being initialized, there is no
need to store it in netlink_cb.

Signed-off-by: Florian Westphal 
---
 include/linux/netlink.h  | 1 -
 net/netlink/af_netlink.c | 5 ++---
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index f3075d6c7e82..71f121b66ca8 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -170,7 +170,6 @@ netlink_skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
 struct netlink_callback {
struct sk_buff  *skb;
const struct nlmsghdr   *nlh;
-   int (*start)(struct netlink_callback *);
int (*dump)(struct sk_buff * skb,
struct netlink_callback *cb);
int (*done)(struct netlink_callback *cb);
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 393573a99a5a..f6ac7693d2cc 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2300,7 +2300,6 @@ int __netlink_dump_start(struct sock *ssk, struct sk_buff 
*skb,
 
cb = >cb;
memset(cb, 0, sizeof(*cb));
-   cb->start = control->start;
cb->dump = control->dump;
cb->done = control->done;
cb->nlh = nlh;
@@ -2309,8 +2308,8 @@ int __netlink_dump_start(struct sock *ssk, struct sk_buff 
*skb,
cb->min_dump_alloc = control->min_dump_alloc;
cb->skb = skb;
 
-   if (cb->start) {
-   ret = cb->start(cb);
+   if (control->start) {
+   ret = control->start(cb);
if (ret)
goto error_put;
}
-- 
2.16.4



Re: [PATCH] netfilter: avoid stalls in nf_ct_alloc_hashtable

2018-07-24 Thread Florian Westphal
Li RongQing  wrote:
> when system forks a process with CLONE_NEWNET flag under the
> high memory pressure, it will trigger memory reclaim and stall
> for a long time because nf_ct_alloc_hashtable need to allocate
> high-order memory at that time. The calltrace as below:

>   nf_ct_alloc_hashtable
>   nf_conntrack_init_net

This call trace is from a kernel < 4.7.

commit 56d52d4892d0e478a005b99ed10d0a7f488ea8c1
netfilter: conntrack: use a single hashtable for all namespaces

removed per-netns hash table.


[PATCH net] atl1c: reserve min skb headroom

2018-07-20 Thread Florian Westphal
Got crash report with following backtrace:
BUG: unable to handle kernel paging request at 8801869daffe
RIP: 0010:[]  [] 
ip6_finish_output2+0x394/0x4c0
RSP: 0018:880186c83a98  EFLAGS: 00010283
RAX: 8801869db00e ...
  [] ip6_finish_output+0x8c/0xf0
  [] ip6_output+0x57/0x100
  [] ip6_forward+0x4b9/0x840
  [] ip6_rcv_finish+0x66/0xc0
  [] ipv6_rcv+0x319/0x530
  [] netif_receive_skb+0x1c/0x70
  [] atl1c_clean+0x1ec/0x310 [atl1c]
  ...

The bad access is in neigh_hh_output(), at skb->data - 16 (HH_DATA_MOD).
atl1c driver provided skb with no headroom, so 14 bytes (ethernet
header) got pulled, but then 16 are copied.

Reserve NET_SKB_PAD bytes headroom, like netdev_alloc_skb().

Compile tested only; I lack hardware.

Fixes: 7b7017642199 ("atl1c: Fix misuse of netdev_alloc_skb in refilling rx 
ring")
Signed-off-by: Florian Westphal 
---
diff --git a/drivers/net/ethernet/atheros/atl1c/atl1c_main.c 
b/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
index 94270f654b3b..7087b88550db 100644
--- a/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
+++ b/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
@@ -1686,6 +1686,7 @@ static struct sk_buff *atl1c_alloc_skb(struct 
atl1c_adapter *adapter)
skb = build_skb(page_address(page) + adapter->rx_page_offset,
adapter->rx_frag_size);
if (likely(skb)) {
+   skb_reserve(skb, NET_SKB_PAD);
adapter->rx_page_offset += adapter->rx_frag_size;
if (adapter->rx_page_offset >= PAGE_SIZE)
adapter->rx_page = NULL;
-- 
2.16.4



Re: [PATCH v4 net-next 7/9] net: ipv4: listified version of ip_rcv

2018-07-03 Thread Florian Westphal
Pablo Neira Ayuso  wrote:
> On Mon, Jul 02, 2018 at 04:14:12PM +0100, Edward Cree wrote:
> > Also involved adding a way to run a netfilter hook over a list of packets.
> >  Rather than attempting to make netfilter know about lists (which would be
> >  a major project in itself) we just let it call the regular okfn (in this
> >  case ip_rcv_finish()) for any packets it steals, and have it give us back
> >  a list of packets it's synchronously accepted (which normally NF_HOOK
> >  would automatically call okfn() on, but we want to be able to potentially
> >  pass the list to a listified version of okfn().)
> > The netfilter hooks themselves are indirect calls that still happen per-
> >  packet (see nf_hook_entry_hookfn()), but again, changing that can be left
> >  for future work.
> > 
> > There is potential for out-of-order receives if the netfilter hook ends up
> >  synchronously stealing packets, as they will be processed before any
> >  accepts earlier in the list.  However, it was already possible for an
> >  asynchronous accept to cause out-of-order receives, so presumably this is
> >  considered OK.
> 
> I think we can simplify things if these chained packets don't follow
> the standard forwarding path, this would require to revisit many
> subsystems to handle these new chained packets - potentially a lot of
> work and likely breaking many things - and I would expect we (and
> other subsystems too) will not get very much benefits from these
> chained packets.

I guess it depends on what type of 'bundles' to expect on the wire.
For an average netfilter ruleset it will require more unbundling
because we might have

ipv4/udp -> ipv4/tcp -> ipv6/udp -> ipv4/tcp -> ipv4/udp

>From Eds patch set, this would be rebundled to

ipv4/udp -> ipv4/tcp -> ipv4/tcp -> ipv4/udp
and
ipv6/udp

nf ipv4 rule eval has to untangle again, to
ipv4/udp -> ipv4/udp
ipv4/tcp -> ipv4/tcp

HOWEVER, there are hooks that are L4 agnostic, such as defragmentation,
so we might still be able to take advantage.

Defrag is extra silly at the moment, we eat the indirect call cost
only to return NF_ACCEPT without doing anything in 99.99% of all cases.

So I still think there is value in exploring to pass the bundle
into the nf core, via new nf_hook_slow_list() that can be used
from forward path (ingress, prerouting, input, forward, postrouting).

We can make this step-by-step, first splitting everything in
nf_hook_slow_list() and just calling hooksfns for each skb in list.

> In general I like this infrastructure, but I think we can get
> something simpler if we combine it with the flowtable idea, so chained
> packets follow the non-standard flowtable forwarding path as described
> in [1].

Yes, in case all packets would be in the flow table (offload software
fallback) we might be able to keep list as-is in case everything has
same nexthop.

However, I don't see any need to make changes to this series for this
now, it can be added on top.

Did i miss anything?

I do agree that from netfilter p.o.v. ingress hook looks like a good
initial candidate for a listification.


Re: [RFC PATCH v2 net-next 07/12] net: ipv4: listified version of ip_rcv

2018-06-27 Thread Florian Westphal
Edward Cree  wrote:
> Also involved adding a way to run a netfilter hook over a list of packets.
>  Rather than attempting to make netfilter know about lists (which would be
>  a major project in itself) we just let it call the regular okfn (in this
>  case ip_rcv_finish()) for any packets it steals, and have it give us back
>  a list of packets it's synchronously accepted (which normally NF_HOOK
>  would automatically call okfn() on, but we want to be able to potentially
>  pass the list to a listified version of okfn().)

okfn() is only used during async reinject in NFQUEUE case,
skb is queued in kernel and we'll wait for a verdict from a userspace
process.  If thats ACCEPT, then okfn() gets called to reinject the skb
into the network stack.

A normal -j ACCEPT doesn't call okfn in the netfilter core, which is why
this occurs on '1' retval in NF_HOOK().

Only other user of okfn() is bridge netfilter, so listified version of
okfn() doesn't make too much sense to me, its not used normally
(unless such listified version makes the code simpler of course).

AFAICS its fine to unlink/free skbs from the list to handle
drops/queueing etc. so a future version of nf_hook() could propagate the
list into nf_hook_slow and mangle the list there to deal with hooks
that steal/drop/queue skbs.

Later on we can pass the list to the hook functions themselves.

We'll have to handle non-accept verdicts in-place in the hook functions
for this, but fortunately most hookfns only return NF_ACCEPT so I think
it is manageable.

I'll look into this once the series makes it to net-next.


[PATCH v2 ipsec-next] xfrm: policy: remove pcpu policy cache

2018-06-25 Thread Florian Westphal
Kristian Evensen says:
  In a project I am involved in, we are running ipsec (Strongswan) on
  different mt7621-based routers. Each router is configured as an
  initiator and has around ~30 tunnels to different responders (running
  on misc. devices). Before the flow cache was removed (kernel 4.9), we
  got a combined throughput of around 70Mbit/s for all tunnels on one
  router. However, we recently switched to kernel 4.14 (4.14.48), and
  the total throughput is somewhere around 57Mbit/s (best-case). I.e., a
  drop of around 20%. Reverting the flow cache removal restores, as
  expected, performance levels to that of kernel 4.9.

When pcpu xdst exists, it has to be validated first before it can be
used.

A negative hit thus increases cost vs. no-cache.

As number of tunnels increases, hit rate decreases so this pcpu caching
isn't a viable strategy.

Furthermore, the xdst cache also needs to run with BH off, so when
removing this the bh disable/enable pairs can be removed too.

Kristian tested a 4.14.y backport of this change and reported
increased performance:

  In our tests, the throughput reduction has been reduced from around -20%
  to -5%. We also see that the overall throughput is independent of the
  number of tunnels, while before the throughput was reduced as the number
  of tunnels increased.

Reported-by: Kristian Evensen 
Signed-off-by: Florian Westphal 
---
 v2: rebase on top of current ipsec-next

 include/net/xfrm.h |   1 -
 net/xfrm/xfrm_device.c |  10 
 net/xfrm/xfrm_policy.c | 139 +
 net/xfrm/xfrm_state.c  |   5 +-
 4 files changed, 3 insertions(+), 152 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 3fa578a6a819..a5378613a49c 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -332,7 +332,6 @@ int xfrm_policy_register_afinfo(const struct 
xfrm_policy_afinfo *afinfo, int fam
 void xfrm_policy_unregister_afinfo(const struct xfrm_policy_afinfo *afinfo);
 void km_policy_notify(struct xfrm_policy *xp, int dir,
  const struct km_event *c);
-void xfrm_policy_cache_flush(void);
 void km_state_notify(struct xfrm_state *x, const struct km_event *c);
 
 struct xfrm_tmpl;
diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
index 16c1230d20fa..11d56a44e9e8 100644
--- a/net/xfrm/xfrm_device.c
+++ b/net/xfrm/xfrm_device.c
@@ -307,12 +307,6 @@ static int xfrm_dev_register(struct net_device *dev)
return xfrm_api_check(dev);
 }
 
-static int xfrm_dev_unregister(struct net_device *dev)
-{
-   xfrm_policy_cache_flush();
-   return NOTIFY_DONE;
-}
-
 static int xfrm_dev_feat_change(struct net_device *dev)
 {
return xfrm_api_check(dev);
@@ -323,7 +317,6 @@ static int xfrm_dev_down(struct net_device *dev)
if (dev->features & NETIF_F_HW_ESP)
xfrm_dev_state_flush(dev_net(dev), dev, true);
 
-   xfrm_policy_cache_flush();
return NOTIFY_DONE;
 }
 
@@ -335,9 +328,6 @@ static int xfrm_dev_event(struct notifier_block *this, 
unsigned long event, void
case NETDEV_REGISTER:
return xfrm_dev_register(dev);
 
-   case NETDEV_UNREGISTER:
-   return xfrm_dev_unregister(dev);
-
case NETDEV_FEAT_CHANGE:
return xfrm_dev_feat_change(dev);
 
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index d960ea6657b5..ef75891450e7 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -45,8 +45,6 @@ struct xfrm_flo {
u8 flags;
 };
 
-static DEFINE_PER_CPU(struct xfrm_dst *, xfrm_last_dst);
-static struct work_struct *xfrm_pcpu_work __read_mostly;
 static DEFINE_SPINLOCK(xfrm_if_cb_lock);
 static struct xfrm_if_cb const __rcu *xfrm_if_cb __read_mostly;
 
@@ -1732,108 +1730,6 @@ static int xfrm_expand_policies(const struct flowi *fl, 
u16 family,
 
 }
 
-static void xfrm_last_dst_update(struct xfrm_dst *xdst, struct xfrm_dst *old)
-{
-   this_cpu_write(xfrm_last_dst, xdst);
-   if (old)
-   dst_release(>u.dst);
-}
-
-static void __xfrm_pcpu_work_fn(void)
-{
-   struct xfrm_dst *old;
-
-   old = this_cpu_read(xfrm_last_dst);
-   if (old && !xfrm_bundle_ok(old))
-   xfrm_last_dst_update(NULL, old);
-}
-
-static void xfrm_pcpu_work_fn(struct work_struct *work)
-{
-   local_bh_disable();
-   rcu_read_lock();
-   __xfrm_pcpu_work_fn();
-   rcu_read_unlock();
-   local_bh_enable();
-}
-
-void xfrm_policy_cache_flush(void)
-{
-   struct xfrm_dst *old;
-   bool found = false;
-   int cpu;
-
-   might_sleep();
-
-   local_bh_disable();
-   rcu_read_lock();
-   for_each_possible_cpu(cpu) {
-   old = per_cpu(xfrm_last_dst, cpu);
-   if (old && !xfrm_bundle_ok(old)) {
-   if (smp_processor_id() == cpu) {
-   __xfrm_pcpu_work_fn();
- 

[PATCH ipsec] xfrm: free skb if nlsk pointer is NULL

2018-06-25 Thread Florian Westphal
nlmsg_multicast() always frees the skb, so in case we cannot call
it we must do that ourselves.

Fixes: 21ee543edc0dea ("xfrm: fix race between netns cleanup and state expire 
notification")
Signed-off-by: Florian Westphal 
---
 net/xfrm/xfrm_user.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 080035f056d9..5fd44fadb195 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -1025,10 +1025,12 @@ static inline int xfrm_nlmsg_multicast(struct net *net, 
struct sk_buff *skb,
 {
struct sock *nlsk = rcu_dereference(net->xfrm.nlsk);
 
-   if (nlsk)
-   return nlmsg_multicast(nlsk, skb, pid, group, GFP_ATOMIC);
-   else
-   return -1;
+   if (!nlsk) {
+   kfree_skb(skb);
+   return -EPIPE;
+   }
+
+   return nlmsg_multicast(nlsk, skb, pid, group, GFP_ATOMIC);
 }
 
 static inline unsigned int xfrm_spdinfo_msgsize(void)
-- 
2.16.1



[PATCH ipsec-next] xfrm: policy: remove pcpu policy cache

2018-06-25 Thread Florian Westphal
Kristian Evensen says:
  In a project I am involved in, we are running ipsec (Strongswan) on
  different mt7621-based routers. Each router is configured as an
  initiator and has around ~30 tunnels to different responders (running
  on misc. devices). Before the flow cache was removed (kernel 4.9), we
  got a combined throughput of around 70Mbit/s for all tunnels on one
  router. However, we recently switched to kernel 4.14 (4.14.48), and
  the total throughput is somewhere around 57Mbit/s (best-case). I.e., a
  drop of around 20%. Reverting the flow cache removal restores, as
  expected, performance levels to that of kernel 4.9.

When pcpu xdst exists, it has to be validated first before it can be
used.

A negative hit thus increases cost vs. no-cache.

As number of tunnels increases, hit rate decreases so this pcpu caching
isn't a viable strategy.

Furthermore, the xdst cache also needs to run with BH off, so when
removing this the bh disable/enable pairs can be removed too.

Kristian tested a 4.14.y backport of this change and reported
increased performance:

  In our tests, the throughput reduction has been reduced from around -20%
  to -5%. We also see that the overall throughput is independent of the
  number of tunnels, while before the throughput was reduced as the number
  of tunnels increased.

Reported-by: Kristian Evensen 
Signed-off-by: Florian Westphal 
---
 include/net/xfrm.h |   1 -
 net/xfrm/xfrm_device.c |  10 
 net/xfrm/xfrm_policy.c | 139 +
 net/xfrm/xfrm_state.c  |   5 +-
 4 files changed, 3 insertions(+), 152 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 557122846e0e..b4c79692d915 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -323,7 +323,6 @@ int xfrm_policy_register_afinfo(const struct 
xfrm_policy_afinfo *afinfo, int fam
 void xfrm_policy_unregister_afinfo(const struct xfrm_policy_afinfo *afinfo);
 void km_policy_notify(struct xfrm_policy *xp, int dir,
  const struct km_event *c);
-void xfrm_policy_cache_flush(void);
 void km_state_notify(struct xfrm_state *x, const struct km_event *c);
 
 struct xfrm_tmpl;
diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
index 175941e15a6e..b03ef5d2b4c0 100644
--- a/net/xfrm/xfrm_device.c
+++ b/net/xfrm/xfrm_device.c
@@ -306,12 +306,6 @@ static int xfrm_dev_register(struct net_device *dev)
return xfrm_api_check(dev);
 }
 
-static int xfrm_dev_unregister(struct net_device *dev)
-{
-   xfrm_policy_cache_flush();
-   return NOTIFY_DONE;
-}
-
 static int xfrm_dev_feat_change(struct net_device *dev)
 {
return xfrm_api_check(dev);
@@ -322,7 +316,6 @@ static int xfrm_dev_down(struct net_device *dev)
if (dev->features & NETIF_F_HW_ESP)
xfrm_dev_state_flush(dev_net(dev), dev, true);
 
-   xfrm_policy_cache_flush();
return NOTIFY_DONE;
 }
 
@@ -334,9 +327,6 @@ static int xfrm_dev_event(struct notifier_block *this, 
unsigned long event, void
case NETDEV_REGISTER:
return xfrm_dev_register(dev);
 
-   case NETDEV_UNREGISTER:
-   return xfrm_dev_unregister(dev);
-
case NETDEV_FEAT_CHANGE:
return xfrm_dev_feat_change(dev);
 
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 5f48251c1319..77d998c99541 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -45,8 +45,6 @@ struct xfrm_flo {
u8 flags;
 };
 
-static DEFINE_PER_CPU(struct xfrm_dst *, xfrm_last_dst);
-static struct work_struct *xfrm_pcpu_work __read_mostly;
 static DEFINE_SPINLOCK(xfrm_policy_afinfo_lock);
 static struct xfrm_policy_afinfo const __rcu *xfrm_policy_afinfo[AF_INET6 + 1]
__read_mostly;
@@ -1714,108 +1712,6 @@ static int xfrm_expand_policies(const struct flowi *fl, 
u16 family,
 
 }
 
-static void xfrm_last_dst_update(struct xfrm_dst *xdst, struct xfrm_dst *old)
-{
-   this_cpu_write(xfrm_last_dst, xdst);
-   if (old)
-   dst_release(>u.dst);
-}
-
-static void __xfrm_pcpu_work_fn(void)
-{
-   struct xfrm_dst *old;
-
-   old = this_cpu_read(xfrm_last_dst);
-   if (old && !xfrm_bundle_ok(old))
-   xfrm_last_dst_update(NULL, old);
-}
-
-static void xfrm_pcpu_work_fn(struct work_struct *work)
-{
-   local_bh_disable();
-   rcu_read_lock();
-   __xfrm_pcpu_work_fn();
-   rcu_read_unlock();
-   local_bh_enable();
-}
-
-void xfrm_policy_cache_flush(void)
-{
-   struct xfrm_dst *old;
-   bool found = false;
-   int cpu;
-
-   might_sleep();
-
-   local_bh_disable();
-   rcu_read_lock();
-   for_each_possible_cpu(cpu) {
-   old = per_cpu(xfrm_last_dst, cpu);
-   if (old && !xfrm_bundle_ok(old)) {
-   if (smp_processor_id() == cpu) {
-   __xfrm_pcpu_work_fn();
- 

Re: [PATCH][v2] xfrm: replace NR_CPU with nr_cpu_ids

2018-06-20 Thread Florian Westphal
Steffen Klassert  wrote:
> On Tue, Jun 19, 2018 at 09:53:49AM +0200, Florian Westphal wrote:
> > Li RongQing  wrote:
> > > The default NR_CPUS can be very large, but actual possible nr_cpu_ids
> > > usually is very small. For some x86 distribution, the NR_CPUS is 8192
> > > and nr_cpu_ids is 4, so replace NR_CPU to save some memory
> > 
> > Steffen,
> > 
> > I will soon submit a patch to remove the percpu cache; removal
> > improved performance for at least one user (and by quite a sizeable
> > amount).
> > 
> > Would you consider such removal for ipsec or ipsec-next?
> 
> I think this removel would better fit to ipsec-next.

Agree, it slows things down further for me in my tests.
Problem is that I get quite good re-use of pcpu cache due to
unidirectional flows and only one tunnel.

I suspect that even with tunnel the removal is a win in practice
though, netperf is quite artifical, so I rather trust Kristians results
(real world) than my own.

> considered to apply it to ipsec-next. If you plan
> to remove it, I'll wait for that.

I'll submit once net-next opens.


Re: array bounds warning in xfrm_output_resume

2018-06-19 Thread Florian Westphal
David Ahern  wrote:
> $ make O=kbuild/perf -j 24 -s
> In file included from /home/dsa/kernel-3.git/include/linux/kernel.h:10:0,
>  from /home/dsa/kernel-3.git/include/linux/list.h:9,
>  from /home/dsa/kernel-3.git/include/linux/module.h:9,
>  from /home/dsa/kernel-3.git/net/xfrm/xfrm_output.c:13:
> /home/dsa/kernel-3.git/net/xfrm/xfrm_output.c: In function
> ‘xfrm_output_resume’:
> /home/dsa/kernel-3.git/include/linux/compiler.h:252:20: warning: array
> subscript is above array bounds [-Warray-bounds]
>__read_once_size(&(x), __u.__c, sizeof(x));  \
> ^~~~
> /home/dsa/kernel-3.git/include/linux/compiler.h:258:22: note: in
> expansion of macro ‘__READ_ONCE’
>  #define READ_ONCE(x) __READ_ONCE(x, 1)
>   ^~~
> /home/dsa/kernel-3.git/include/linux/rcupdate.h:350:48: note: in
> expansion of macro ‘READ_ONCE’
>   typeof(*p) *p1 = (typeof(*p) *__force)READ_ONCE(p); \
> ^
> /home/dsa/kernel-3.git/include/linux/rcupdate.h:487:2: note: in
> expansion of macro ‘__rcu_dereference_check’
>   __rcu_dereference_check((p), (c) || rcu_read_lock_held(), __rcu)
>   ^~~
> /home/dsa/kernel-3.git/include/linux/rcupdate.h:545:28: note: in
> expansion of macro ‘rcu_dereference_check’
>  #define rcu_dereference(p) rcu_dereference_check(p, 0)
> ^
> /home/dsa/kernel-3.git/include/linux/netfilter.h:218:15: note: in
> expansion of macro ‘rcu_dereference’
>hook_head = rcu_dereference(net->nf.hooks_arp[hook]);

Hmpf. compiler can't know that this is only called for ipv4 and
ipv6 families.

> Line in question is the nf_hook in xfrm_output_resume.
> NF_INET_POST_ROUTING = 4 which is greater than NF_ARP_NUMHOOKS = 3
> 
> I believe ef57170bbfdd6 is the commit that introduced the warning

Yes.  I will see how to best fix this, probably needs an explicit
check on skb_dst(skb)->ops->family.


Re: [PATCH][v2] xfrm: replace NR_CPU with nr_cpu_ids

2018-06-19 Thread Florian Westphal
Li RongQing  wrote:
> The default NR_CPUS can be very large, but actual possible nr_cpu_ids
> usually is very small. For some x86 distribution, the NR_CPUS is 8192
> and nr_cpu_ids is 4, so replace NR_CPU to save some memory

Steffen,

I will soon submit a patch to remove the percpu cache; removal
improved performance for at least one user (and by quite a sizeable
amount).

Would you consider such removal for ipsec or ipsec-next?
If -next, consider applying this patch for ipsec.git.

Otherwise consider not applying this, as the code will go away soon.


Thanks,
Florian


Re: [PATCH net-next 0/10] xfrm: remove flow cache

2018-06-13 Thread Florian Westphal
Kristian Evensen  wrote:
> Hello,
> 
> On Tue, Jul 18, 2017 at 8:15 PM, David Miller  wrote:
> > Steffen, I know you have some level of trepidation about this because
> > there is obviously some performance cost immediately for removing this
> > DoS problem.
> 
> In a project I am involved in, we are running ipsec (Strongswan) on
> different mt7621-based routers. Each router is configured as an
> initiator and has around ~30 tunnels to different responders (running
> on misc. devices). Before the flow cache was removed (kernel 4.9), we
> got a combined throughput of around 70Mbit/s for all tunnels on one
> router. However, we recently switched to kernel 4.14 (4.14.48), and
> the total throughput is somewhere around 57Mbit/s (best-case). I.e., a
> drop of around 20%. Reverting the flow cache removal restores, as
> expected, performance levels to that of kernel 4.9.

Can you test attached patch?

I'd like to see how much the pcpu cache helps or if it actually hurts
in your setup.

Subject: [TEST PATCH 4.14.y] xfrm: remove pcpu policy cache

We need to re-evaluate if this still buys anything after indirect calls
got more expensive (retpolines).
When pcpu xdst exists, it has to be validated first (which needs
indirect calls).  So even if hit rate is good, it might be cheaper to
allocate a new xdst entry.

Furthermore, the current xdst cache needs to run with BH off, which
is also not needed when its removed.

Compile tested only.

Signed-off-by: Florian Westphal 
---
 include/net/xfrm.h |   1 -
 net/xfrm/xfrm_device.c |  10 
 net/xfrm/xfrm_policy.c | 138 +
 net/xfrm/xfrm_state.c  |   5 +-
 4 files changed, 3 insertions(+), 151 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index db99efb2d1d0..bdf185ae93db 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -323,7 +323,6 @@ int xfrm_policy_register_afinfo(const struct 
xfrm_policy_afinfo *afinfo, int fam
 void xfrm_policy_unregister_afinfo(const struct xfrm_policy_afinfo *afinfo);
 void km_policy_notify(struct xfrm_policy *xp, int dir,
  const struct km_event *c);
-void xfrm_policy_cache_flush(void);
 void km_state_notify(struct xfrm_state *x, const struct km_event *c);
 
 struct xfrm_tmpl;
diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
index 30e5746085b8..4e458fd9236a 100644
--- a/net/xfrm/xfrm_device.c
+++ b/net/xfrm/xfrm_device.c
@@ -153,12 +153,6 @@ static int xfrm_dev_register(struct net_device *dev)
return NOTIFY_DONE;
 }
 
-static int xfrm_dev_unregister(struct net_device *dev)
-{
-   xfrm_policy_cache_flush();
-   return NOTIFY_DONE;
-}
-
 static int xfrm_dev_feat_change(struct net_device *dev)
 {
if ((dev->features & NETIF_F_HW_ESP) && !dev->xfrmdev_ops)
@@ -178,7 +172,6 @@ static int xfrm_dev_down(struct net_device *dev)
if (dev->features & NETIF_F_HW_ESP)
xfrm_dev_state_flush(dev_net(dev), dev, true);
 
-   xfrm_policy_cache_flush();
return NOTIFY_DONE;
 }
 
@@ -190,9 +183,6 @@ static int xfrm_dev_event(struct notifier_block *this, 
unsigned long event, void
case NETDEV_REGISTER:
return xfrm_dev_register(dev);
 
-   case NETDEV_UNREGISTER:
-   return xfrm_dev_unregister(dev);
-
case NETDEV_FEAT_CHANGE:
return xfrm_dev_feat_change(dev);
 
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 9c57d6a5816c..cdfb60a9820b 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -45,8 +45,6 @@ struct xfrm_flo {
u8 flags;
 };
 
-static DEFINE_PER_CPU(struct xfrm_dst *, xfrm_last_dst);
-static struct work_struct *xfrm_pcpu_work __read_mostly;
 static DEFINE_SPINLOCK(xfrm_policy_afinfo_lock);
 static struct xfrm_policy_afinfo const __rcu *xfrm_policy_afinfo[AF_INET6 + 1]
__read_mostly;
@@ -1715,108 +1713,6 @@ static int xfrm_expand_policies(const struct flowi *fl, 
u16 family,
 
 }
 
-static void xfrm_last_dst_update(struct xfrm_dst *xdst, struct xfrm_dst *old)
-{
-   this_cpu_write(xfrm_last_dst, xdst);
-   if (old)
-   dst_release(>u.dst);
-}
-
-static void __xfrm_pcpu_work_fn(void)
-{
-   struct xfrm_dst *old;
-
-   old = this_cpu_read(xfrm_last_dst);
-   if (old && !xfrm_bundle_ok(old))
-   xfrm_last_dst_update(NULL, old);
-}
-
-static void xfrm_pcpu_work_fn(struct work_struct *work)
-{
-   local_bh_disable();
-   rcu_read_lock();
-   __xfrm_pcpu_work_fn();
-   rcu_read_unlock();
-   local_bh_enable();
-}
-
-void xfrm_policy_cache_flush(void)
-{
-   struct xfrm_dst *old;
-   bool found = 0;
-   int cpu;
-
-   might_sleep();
-
-   local_bh_disable();
-   rcu_read_lock();
-   for_each_possible_cpu(cpu) {
-   old = per_cpu(xfrm_last_dst, cpu);
- 

Re: [PATCH net-next] netfilter: fix null-ptr-deref in nf_nat_decode_session

2018-05-28 Thread Florian Westphal
Prashant Bhole <bhole_prashant...@lab.ntt.co.jp> wrote:
> Add null check for nat_hook in nf_nat_decode_session()

Acked-by: Florian Westphal <f...@strlen.de>


Re: [PATCH v2 net] netfilter: provide correct argument to nla_strlcpy()

2018-05-22 Thread Florian Westphal
Eric Dumazet <eduma...@google.com> wrote:
> Recent patch forgot to remove nla_data(), upsetting syzkaller a bit.

Duuuh Thanks Eric.
Acked-by: Florian Westphal <f...@strlen.de>



Re: [PATCH v2] netfilter: properly initialize xt_table_info structure

2018-05-18 Thread Florian Westphal
Greg Kroah-Hartman  wrote:
> On Thu, May 17, 2018 at 12:42:00PM +0200, Jan Engelhardt wrote:
> > 
> > On Thursday 2018-05-17 12:09, Greg Kroah-Hartman wrote:
> > >> > --- a/net/netfilter/x_tables.c
> > >> > +++ b/net/netfilter/x_tables.c
> > >> > @@ -1183,11 +1183,10 @@ struct xt_table_info 
> > >> > *xt_alloc_table_info(unsigned int size)
> > >> > * than shoot all processes down before realizing there is 
> > >> > nothing
> > >> > * more to reclaim.
> > >> > */
> > >> > -  info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY);
> > >> > +  info = kvzalloc(sz, GFP_KERNEL | __GFP_NORETRY);
> > >> >if (!info)
> > >> >return NULL;
> > >>
> > >> I am curious, what particular path does not later overwrite the whole 
> > >> zone ?
> > >
> > >In do_ipt_get_ctl, the IPT_SO_GET_ENTRIES: option uses a len value that
> > >can be larger than the size of the structure itself.
> > >
> > >Then the data is copied to userspace in copy_entries_to_user() for ipv4
> > >and v6, and that's where the "bad data"
> > 
> > If the kernel incorrectly copies more bytes than it should, isn't that
> > a sign that may be going going past the end of the info buffer?
> > (And thus, zeroing won't truly fix the issue)
> 
> No, the buffer size is correct, we just aren't filling up the whole
> buffer as the data requested is smaller than the buffer size.

I have no objections to the patch but I'd like to understand what
problem its fixing.

Normal pattern is:
newinfo = xt_alloc_table_info(tmp.size);
copy_from_user(newinfo->entries, user + sizeof(tmp), tmp.size);

So inital value of the rule blob area should not matter.

Furthermore, when copying the rule blob back to userspace,
the kernel is not supposed to copy any padding back to userspace either,
since commit f32815d21d4d8287336fb9cef4d2d9e0866214c2 only the
user-relevant parts should be copied (some matches and targets allocate
kernel-private data such as pointers, and we did use to leak such pointer
values back to userspace).


Re: Silently dropped UDP packets on kernel 4.14

2018-05-02 Thread Florian Westphal
Kristian Evensen  wrote:
> I went for the early-insert approached and have patched

I'm sorry for suggesting that.

It doesn't work, because of NAT.
NAT rewrites packet content and changes the reply tuple, but the tuples
determine the hash insertion location.

I don't know how to solve this problem.


Re: Silently dropped UDP packets on kernel 4.14

2018-05-01 Thread Florian Westphal
Kristian Evensen  wrote:
> On Tue, May 1, 2018 at 8:50 PM, Kristian Evensen
>  wrote:
> > Does anyone have any idea of what could be wrong, where I should look
> > or other things I can try? I tried to space the requests out a bit in
> > time (I inserted a sleep 1 between them), and then the problem went
> > away.
> 
> I should learn to always go through everything one last time before
> sending an email. First of all, I see that both requests are treated
> as new.

Normal when nfqueue is in use.

> Second, on my router, new requests are sent to user space for
> marking, which explains the large delay in processing. When removing
> the NFQUEUE-rule + handling and marking statically, my problem goes
> away and I get an answer for both packets.

Yes, because 2nd packet finds the conntrack entry created by the first
one.

> However, I do have one question. In my application, both packets are
> assigned the same mark. Shouldn't they then match the same conntrack
> entry, or am I missing something since that seems not to be the case?

The 2nd packet creates a new conntrack entry, because the conntrack
entry created by the first one is not inserted into global table yet.

This happens as last step, after packet has traversed all chains.
When nfqueue is used, this gets delayed, and 2nd packet will be dropped
as the insertion step finds that another packet created same flow
already.

I'm not sure what the best way to solve this is, we either need
to insert earlier in nfqueue case, or do conflict resolution in nfqueue
case (and perhaps also nat undo? not sure).


Re: [PATCH] netfilter: ebtables: handle string from userspace with care

2018-04-27 Thread Florian Westphal
Paolo Abeni  wrote:
> strlcpy() can't be safely used on a user-space provided string,
> as it can try to read beyond the buffer's end, if the latter is
> not NULL terminated.

Yes.

> Leveraging the above, syzbot has been able to trigger the following
> splat:
> 
> BUG: KASAN: stack-out-of-bounds in strlcpy include/linux/string.h:300
> [inline]
> BUG: KASAN: stack-out-of-bounds in compat_mtw_from_user
> net/bridge/netfilter/ebtables.c:1957 [inline]
> BUG: KASAN: stack-out-of-bounds in ebt_size_mwt
> net/bridge/netfilter/ebtables.c:2059 [inline]
> BUG: KASAN: stack-out-of-bounds in size_entry_mwt
> net/bridge/netfilter/ebtables.c:2155 [inline]
> BUG: KASAN: stack-out-of-bounds in compat_copy_entries+0x96c/0x14a0
> net/bridge/netfilter/ebtables.c:2194
> Write of size 33 at addr 8801b0abf888 by task syz-executor0/4504

Which is weird, I don't understand this report.
The code IS wrong, but it should cause out-of-bounds read (strlen on
src), but not out-of-bounds write.

Yes, I sent a recent patch (dceb48d86b4871984b8ce9ad5057fb2c01aa33de in
nf.git) that would now allow to get rid of the strlcpy and use the
source directly.



Re: [nf-next] netfilter: extend SRH match to support matching previous, next and last SID

2018-04-23 Thread Florian Westphal
Ahmed Abdelsalam  wrote:
> > > @@ -50,6 +62,12 @@ struct ip6t_srh {
> > >   __u8segs_left;
> > >   __u8last_entry;
> > >   __u16   tag;
> > > + struct in6_addr psid_addr;
> > > + struct in6_addr nsid_addr;
> > > + struct in6_addr lsid_addr;
> > > + struct in6_addr psid_msk;
> > > + struct in6_addr nsid_msk;
> > > + struct in6_addr lsid_msk;
> > 
> > This is changing something exposed through UAPI, so you will need a
> > new revision for this.
> 
> Could you please advice what should be done in this case? 

You need to add
struct ip6t_srh_v1 {
/* copy of struct ip6t_srh here */

/* new fields go here */
};


Look at xt_conntrack.c, conntrack_mt_reg[] for an example of
multi-revision match.

You can probably re-origanise code to avoid too much duplication.
See 5a786232eb69a1f870ddc0cfd69d5bdef241a2ea in nf.git for an example,
it makes v0 into a v1 struct at runtime and re-uses new v1 code
for old v0.




Re: [PATCH RFC iptables] iptables: Per-net ns lock

2018-04-20 Thread Florian Westphal
Kirill Tkhai  wrote:
> Pablo, Florian, could you please provide comments on this?
> 
> On 09.04.2018 19:55, Kirill Tkhai wrote:
> > In CRIU and LXC-restore we met the situation,
> > when iptables in container can't be restored
> > because of permission denied:
> > 
> > https://github.com/checkpoint-restore/criu/issues/469
> > 
> > Containers want to restore their own net ns,
> > while they may have no their own mnt ns.
> > This case they share host's /run/xtables.lock
> > file, but they may not have permission to open
> > it.
> > 
> > Patch makes /run/xtables.lock to be per-namespace,
> > i.e., to refer to the caller task's net ns.

It looks ok to me but then again the entire userspace
lock thing is a ugly band aid :-/


Re: tcp hang when socket fills up ?

2018-04-17 Thread Florian Westphal
Dominique Martinet  wrote:

[ CC Jozsef ]

> Could it have something to do with the way I setup the connection?
> I don't think the "both remotes call connect() with carefully selected
> source/dest port" is a very common case..
> 
> If you look at the tcpdump outputs I attached the sequence usually is
> something like
>  server > client SYN
>  client > server SYN
>  server > client SYNACK
>  client > server ACK
> 
> ultimately it IS a connection, but with an extra SYN packet in front of
> it (that first SYN opens up the conntrack of the nat so that the
> client's syn can come in, the client's conntrack will be that of a
> normal connection since its first SYN goes in directly after the
> server's (it didn't see the server's SYN))
> 
> Looking at my logs again, I'm seeing the same as you:
> 
> This looks like the actual SYN/SYN/SYNACK/ACK:
>  - 14.364090 seq=505004283 likely SYN coming out of server
>  - 14.661731 seq=1913287797 on next line it says receiver
> end=505004284 so likely the matching SYN from client
> Which this time gets a proper SYNACK from server:
> 14.662020 seq=505004283 ack=1913287798
> And following final dataless ACK:
> 14.687570 seq=1913287798 ack=505004284
> 
> Then as you point out some data ACK, where the scale poofs:
> 14.688762 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 
> end=1913287819
> 14.688793 tcp_in_window: sender end=1913287798 maxend=1913316998 maxwin=29312 
> scale=7 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7
> 14.688824 tcp_in_window: 
> 14.688852 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 
> end=1913287819
> 14.62 tcp_in_window: sender end=1913287819 maxend=1913287819 maxwin=229 
> scale=0 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7
>
> As you say, only tcp_options() will clear only on side of the scales.
> We don't have sender->td_maxwin == 0 (printed) so I see no other way
> than we are in the last else if:
>  - we have after(end, sender->td_end) (end=1913287819 > sender
> end=1913287798)
>  - I assume the tcp state machine must be confused because of the
> SYN/SYN/SYNACK/ACK pattern and we probably enter the next check, 
> but since this is a data packet it doesn't have the tcp option for scale
> thus scale resets.

Yes, this looks correct. Jozsef, can you please have a look?

Problem seems to be that conntrack believes that ACK packet
re-initializes the connection:

 595 /*
 596  * RFC 793: "if a TCP is reinitialized ... then it need
 597  * not wait at all; it must only be sure to use sequence
 598  * numbers larger than those recently used."
 599  */
 600 sender->td_end =
 601 sender->td_maxend = end;
 602 sender->td_maxwin = (win == 0 ? 1 : win);
 603 
 604 tcp_options(skb, dataoff, tcph, sender);

and last line clears the scale value (no wscale option in data packet).


Transitions are:
 server > client SYN  sNO -> sSS
 client > server SYN  sSS -> sS2
 server > client SYNACK   sS2 -> sSR /* here */
 client > server ACK  sSR -> sES

SYN/ACK was observed in original direction so we hit
state->state == TCP_CONNTRACK_SYN_RECV && dir == IP_CT_DIR_REPLY test
when we see the ack packet and end up in the 'TCP is reinitialized' branch.

AFAICS, without this, connection would move to sES just fine,
as the data ack is in window.


Re: tcp hang when socket fills up ?

2018-04-16 Thread Florian Westphal
Dominique Martinet  wrote:
> Eric Dumazet wrote on Sun, Apr 15, 2018:
> > Are you sure you do not have some iptables/netfilter stuff ?
> 
> I have a basic firewall setup with default rules e.g. starts with
> -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
> in the INPUT chain...
> That said, I just dropped it on the server to check and that seems to
> workaround the issue?!
> When logging everything dropped it appears to decide that the connection
> is no longer established at some point, but only if there is
> tcp_timestamp, just, err, how?
> 
> And certainly enough, if I restore the firewall while a connection is up
> that just hangs; conntrack doesn't consider it connected anymore at some
> point (but it worked for a while!)
> 
> Here's the kind of logs I get from iptables:
> IN=wlp1s0 OUT= MAC=00:c2:c6:b4:7e:c7:a4:12:42:b5:5d:fc:08:00 SRC=client 
> DST=server LEN=52 TOS=0x00 PREC=0x00 TTL=52 ID=17038 DF PROTO=TCP SPT=41558 
> DPT=15609 WINDOW=1212 RES=0x00 ACK URGP=0 

You could do
echo 6 > /proc/sys/net/netfilter/nf_conntrack_log_invalid

to have conntrack log when/why it thinks packet is invalid.

You can also set
echo 1 > /proc/sys/net/netfilter/nf_conntrack_tcp_be_liberal

which stops conntrack from marking packets with out-of-window
acks as invalid.

(Earlier email implies this is related to timestamps, but unfortunately
 to best of my knowledge conntrack doesn't look at tcp timestamps).



Re: possible deadlock in rtnl_lock (5)

2018-03-27 Thread Florian Westphal
syzbot  wrote:
[ cc Julian and trimming cc list ]

> syzkaller688027/4497 is trying to acquire lock:
>  (rtnl_mutex){+.+.}, at: [] rtnl_lock+0x17/0x20
> net/core/rtnetlink.c:74

> but task is already holding lock:
> IPVS: stopping backup sync thread 4495 ...
>  (rtnl_mutex){+.+.}, at: [] rtnl_lock+0x17/0x20
> net/core/rtnetlink.c:74
> 
> other info that might help us debug this:
>  Possible unsafe locking scenario:
> 
>CPU0
>
>   lock(rtnl_mutex);
>   lock(rtnl_mutex);
> 
>  *** DEADLOCK ***
> 
>  May be due to missing lock nesting notation

Looks like this is real, commit e0b26cc997d57305b4097711e12e13992580ae34
("ipvs: call rtnl_lock early") added rtnl_lock when starting sync thread
but socket close invokes rtnl_lock too:

> stack backtrace:
>  rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74
>  ip_mc_drop_socket+0x88/0x230 net/ipv4/igmp.c:2643
>  inet_release+0x4e/0x1c0 net/ipv4/af_inet.c:413
>  sock_release+0x8d/0x1e0 net/socket.c:595
>  start_sync_thread+0x2213/0x2b70 net/netfilter/ipvs/ip_vs_sync.c:1924
>  do_ip_vs_set_ctl+0x1139/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2389


Re: [PATCH nf] netfilter: drop template ct when conntrack is skipped.

2018-03-22 Thread Florian Westphal
Paolo Abeni <pab...@redhat.com> wrote:
> The ipv4 nf_ct code currently skips the nf_conntrak_in() call
> for fragmented packets. As a results later matches/target can end
> up manipulating template ct entry instead of 'real' ones.
> 
> Exploiting the above, syzbot found a way to trigger the following
> splat:
> 
> WARNING: CPU: 1 PID: 4242 at net/netfilter/xt_cluster.c:55
> xt_cluster_mt+0x6c1/0x840 net/netfilter/xt_cluster.c:127
> Kernel panic - not syncing: panic_on_warn set ...

Right, template has l3 protocol 0.

> Instead of adding checks for template ct on every target/match
> manipulating skb->_nfct, simply drop the template ct when skipping
> nf_conntrack_in().

Fixes: 7b4fdf77a450ec ("netfilter: don't track fragmented packets")
Acked-by: Florian Westphal <f...@strlen.de>



Re: linux-next: ip6tables *broken* - last base chain position %u doesn't match underflow %u (hook %u

2018-03-20 Thread Florian Westphal
valdis.kletni...@vt.edu  wrote:
> (Resending because I haven't heard anything)
[ ip6tables broken ]

Sorry, did not see this email before.

I'll investigate asap, thanks for the detailed report.


Re: [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer

2018-03-15 Thread Florian Westphal
Alexei Starovoitov  wrote:
> The way this IMR defined today looks pretty much like nft and
> it feels a bit too low level than iptable conversion would need.

It wasn't so much about a specific IMR but to avoid code duplication
between nft and iptables translators.

> I think it would be simpler to have user space only extensions
> and opcodes added to bpf for the purpose of the translation.
> Like there is no bpf instruction called 'load from IP header',
> but we can make one. Just extend extended bpf with an instruction
> like this and on the first pass do full conversion of nft
> directly into this 'extended extended bpf'.

I don't want to duplicate any ebpf conversion (and optimisations)
in the nft part.

If nft can be translated to this 'extended extended bpf' and
this then generates bpf code from nft input all is good.



Re: [PATCH 00/30] Netfilter/IPVS updates for net-next

2018-03-13 Thread Florian Westphal
David Miller  wrote:
[ flow tables ]
> Ok, that seems to constrain the exposure.
> 
> We should talk at some point about how exposed conntrack itself is.

Sure, we can do that.

If you have specific scenarios (synflood, peer that opens
100k (legitimate) connections, perpetual-fin, etc) in mind let me know,
i do think that we could still do better in some cases.


Re: [PATCH 00/30] Netfilter/IPVS updates for net-next

2018-03-13 Thread Florian Westphal
David Miller  wrote:
> From: Felix Fietkau 
> Date: Mon, 12 Mar 2018 20:30:01 +0100
> 
> > It's not dead and useless. In its current state, it has a software fast
> > path that significantly improves nftables routing/NAT throughput,
> > especially on embedded devices.
> > On some devices, I've seen "only" 20% throughput improvement (along with
> > CPU usage reduction), on others it's quite a bit lot more. This is
> > without any extra drivers or patches aside from what's posted.
> 
> I wonder if this software fast path has the exploitability problems that
> things like the ipv4 routing cache and the per-cpu flow cache both had.

No, entries in the flow table are backed by an entry in the conntrack
table, and that has an upper ceiling.

As decision of when an entry gets placed into the flow table is
configureable via ruleset (nftables, iptables will be coming too), one
can tie the 'fastpathing' to almost-arbitrary criterion, e.g.

'only flows from trusted internal network'
'only flows that saw two-way communication'
'only flows that sent more than 100kbyte'

or any combination thereof.

Do you see another problem that needs to be addressed?


Re: [PATCH v2] net: netfilter: Replace printk() with appropriate pr_*() macro

2018-03-11 Thread Florian Westphal
Arushi Singhal  wrote:
> On Mon, Mar 12, 2018 at 2:17 AM, Pablo Neira Ayuso 
> wrote:
> 
> > Hi Joe,
> >
> > On Sun, Mar 11, 2018 at 12:52:41PM -0700, Joe Perches wrote:
> > > On Mon, 2018-03-12 at 01:11 +0530, Arushi Singhal wrote:
> > > > Using pr_() is more concise than
> > > > printk(KERN_).
> > > > Replace printks having a log level with the appropriate
> > > > pr_*() macros.
> > > >
> > > > Signed-off-by: Arushi Singhal 
> > > > ---
> > > > changes in v2
> > > > *in v1 printk() were replaced with netdev_*()
> > >
> > > >  net/netfilter/nf_conntrack_acct.c  | 2 +-
> > > >  net/netfilter/nf_conntrack_ecache.c| 2 +-
> > > >  net/netfilter/nf_conntrack_timestamp.c | 2 +-
> > > >  net/netfilter/nf_nat_core.c| 2 +-
> > > >  net/netfilter/nfnetlink_queue.c| 4 ++--
> > > >  5 files changed, 6 insertions(+), 6 deletions(-)
> > >
> > > None of these files have a #define for pr_fmt so this
> > > should be OK.
> >
> > I think Arushi could add pr_fmt in the same go, so we skip another
> > follow up patch for this. @Arushi: I suggested this in my previous
> > email, please have a look.
> >
> > Hello Pablo
> 
> Should I send two patches, one with the conversion of printk() to pr_() and
> another for defining pr_fmt().
> 
> Or
> 
> only one patch with all the changes?

Both in one, it reduces code churn.

With pr_* + pr_fmt, the module name will be prefixed automatically, see
e.g. commit e016c5e43db51875c2b541b59bd217494d213174 as an example.

> > This would also probably allows us to save the line break in the error
> > message, which IIRC is not a good practise either, eg.
> >
> > pr_warn("nf_queue: OOM "
> > "in mangle, dropping packet\n");
> >
> > > Perhaps coalesce the formats and remove the unnecessary periods too.

The above message should be removed, its useless (on oom allocator
already warns).


Re: WARNING in __proc_create

2018-03-09 Thread Florian Westphal
Cong Wang  wrote:
> On Fri, Mar 9, 2018 at 2:58 PM, Eric Dumazet  wrote:
> >
> >
> > On 03/09/2018 02:56 PM, Eric Dumazet wrote:
> >
> >>
> >> I sent a patch a while back, but Pablo/Florian wanted more than that
> >> simple fix.
> >>
> >> We also need to filter special characters like '/'
> 
> proc_create_data() itself accepts '/', so it must be xt_hashlimit doesn't
> want it.

--hashimit-name / also triggers WARN for me.
. or .. "work", (no crash), but cause appearance of 2nd ./.. in
/proc/net/ipt_hashlimit , so I think its better to disallow that too.


Re: WARNING in __proc_create

2018-03-09 Thread Florian Westphal
Eric Dumazet  wrote:
> >>fs/proc/generic.c:354
> >
> >We need to reject empty names.
> >
> 
> I sent a patch a while back, but Pablo/Florian wanted more than that simple
> fix.
> 
> We also need to filter special characters like '/'
> 
> Or maybe I am mixing with something else.

Argh, sorry, this fell off the truck it seems :(

I'll work on this tomorrow unless someone else does it today ;)


Re: [RFC PATCH v2] bridge: make it possible for packets to traverse the bridge without hitting netfilter

2018-03-09 Thread Florian Westphal
David Woodhouse <dw...@infradead.org> wrote:
> 
> 
> On Fri, 2015-03-06 at 17:37 +0100, Florian Westphal wrote:
> > 
> > > > I did performance measurements in the following way:
> > > > 
> > > > Removed those pieces of the packet pipeline that I don't necessarily
> > > > need one-by-one.  Then measured their effect on small packet
> > > > performance.
> > > > 
> > > > This was the only part that produced considerable effect.
> > > > 
> > > > The pure speculation was about why the effect is more than 15%
> > > > increase in packet throughput, although the code path avoided
> > > > contains way less code than 15% of the packet pipeline.  It seems,
> > > > Felix Fietkau profiled similar changes, and found my guess well
> > > > founded.
> > > > 
> > > > Now could anybody explain me what else is wrong with my patch?
> > > 
> > > We have to come up with a more generic solution for this.
> > 
> > Jiri Benc suggested to allowing to attach netfilter hooks e.g. via tc
> > action, maybe that would be an option worth investigating.
> > 
> > Then you could for instance add filtering rules only to the bridge port
> > that needs it.
> > 
> > > These sysfs tweaks you're proposing look to me like an obscure way to
> > > tune this.
> > 
> > I agree, adding more tunables isn't all that helpful, in the past this
> > only helped to prolong the problem.
> 
> How feasible would it be to make it completely dynamic?
> 
> A given hook could automatically disable itself (for a given device) if
> the result of running it the first time was *tautologically* false for
> that device (i.e. regardless of the packet itself, or anything else).
> 
> The hook would need to be automatically re-enabled if the rule chain
> ever changes (and might subsequently disable itself again).
> 
> Is that something that's worth exploring for the general case?

AF_BRIDGE hooks sit in the net namespace, so its enough for
one bridge to request filtering to bring in the hook overhead for all
bridges in the same netns.

Alternatives:
- place the bridges that need filtering in different netns
- use tc ingress for filtering
- use nftables ingress hook for filtering (it sits in almost same
  location as tc ingress hook) to attach the ruleset to those bridge
  ports that need packet filtering.

(The original request came from user with tons of bridges where only
 one single bridge needed filtering).

One alternative I see is to place the bridge hooks into the
bridge device (net_bridge struct, which is in netdev private area).

But, as you already mentioned we would need to annotate the hooks
to figure out which device(s) they are for.

This sounds rather fragile to me, so i would just use nft ingress:

#!/sbin/nft -f

table netdev ingress {
  chain in_public { type filter hook ingress device eth0 priority 0;
   ip saddr 192.168.0.1 counter
 }
}


Re: [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer

2018-03-06 Thread Florian Westphal
Edward Cree <ec...@solarflare.com> wrote:
> On 06/03/18 16:42, Florian Westphal wrote:
> > I would also add 'highlevel' objects that are themselves translated into
> > basic operations.  Most obvious example
> > are 'fetch 4 bytes x bytes into transport header'.
> >
> > Frontend should not need to bother with ipv4 details, such as ip
> > options.  Instead IMR should take care of this and generate the needed
> > instructions to fetch iph->ihl and figure out the correct transport
> > header offset.
> Presumably then for this the IMR regs will cease to have any connection to
>  BPF regs and will simply be (SSA?) r0, r1, ... as far as needed (not
>  limited to 10 regs like BPF)?  Then register allocation all happens in
>  the IMR->BPF conversion (even for things 64 bits or smaller).
> 
> I wonder how sophisticated we should be about register allocation; whether
>  we should go the whole hog with graph-colouring algorithms or linear
>  scan, or just do something naïve like an LRU.
> 
> Relatedly, should we spill values to the stack when we run out of
>  registers, or should we just rely on being able to rematerialise them
>  from parsing the packet again?

I don't know.  I suspect we should go for naive algorithm only,
but I would defer such decision to Alexei/Daniel.

f.e. i don't know if using llvm is a good idea or not, I did not
intend to turn proposed imr into full blown compiler in any case,
I only want to avoid code duplication for iptables/nftables -> ebpf
translator.


Re: [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer

2018-03-06 Thread Florian Westphal
Daniel Borkmann <dan...@iogearbox.net> wrote:
> On 03/04/2018 08:40 PM, Florian Westphal wrote:
> > Its still in early stage, but I think its good enough as
> > a proof-of-concept.
> 
> be the critical part in that it needs to be flexible enough to cover
> both front ends well enough without having to make compromises to
> one of them.

Yes, thats the idea.

> The same would be for optimization passes e.g. when we
> know that two successive rules would match on TCP header bits that we
> can reuse the register that loaded/parsed it previously to that point.

Yes, I already thought about this.

First step would be to simply track the last accessed header base
(mac, nh, th) in the imr state so we can reuse the offset if possible.

> Similar when it comes to maps when the lookup value would need to
> propagate through the linked imr objects. Do you see such optimizations
> or in general propagation of state as direct part of the IMR or rather
> somewhat hidden in IMR layer when doing the IMR to BPF 'jit' phase?

Direct part.  I would only say that it should be hidden in the sense that
frontends don't need to do anything here.
I would do such propagation in the IMR before generating code.

The IMR has enough info to figure out:
1. number of (ebpf) registers that are needed simultaneously at most
2. how many payload requests relative to (mac/network/tcp header) there are
(both in total and in relation to rule count).

So it could e.g. decide to set aside 2 registers to contain the
th/nh offset to have them ready at all times, while not covering mac
header at all because its not used anywhere.

Likewise, we could even avoid for instance extension header parsing
as IMR would see that there are no 'IPv6+transport header' rules.

> Which other parts do you think would be needed for the IMR aside
> from above basic operations? ALU ops, JMPs for the relational ops?

We'll need at least following IMR features/objects (pseudo instructions):

- relops: gt, ge, lt etc.
- all bitops (shift, xor, and, etc)
- payload: allow stores instead of loads (this includes csum fixup)
- verdict: jumps to other chains, nfqueue verdict
- meta: (accesses to in/output interface names, skb length, and the like).
- register allocation for ipv6 addresses, interface names, etc (anything
larger than 64 bit).

I would also add 'highlevel' objects that are themselves translated into
basic operations.  Most obvious example
are 'fetch 4 bytes x bytes into transport header'.

Frontend should not need to bother with ipv4 details, such as ip
options.  Instead IMR should take care of this and generate the needed
instructions to fetch iph->ihl and figure out the correct transport
header offset.

It could do so by adding the needed part of the rule (fetch byte containing
ihl, mask/shift result, store the offset, etc.) internally.

I would also add:
- limit (to jit both xt_limit and nft limit)
- statistic (random sampling)

as both nft and iptables have those features and i think they're
pretty much the same so it can be handled by common code.
Same for others extensions but those have side effects and I don't
know yet how to handle that (log, counter and the like).

Same for set lookups.

In iptables case, they will only be 'address/network/interface is (not) in set',
i.e. only match/no match, and set manipulations (SET target).

In nftables case it can also be something like 'retrieve value
belonging to key (dreg = lookup(map, sreg, sreg_len), and then dreg
value gets used in other ways.

I think that by making the IMR also handle the nftables case rather than
iptables one has the advantage that the IMR might be able to optimize
repetitiive iptables rules (and nft rules that do not use jump map even
if they could).

iptables .. -i eth0 -j ETH0_RULES
iptables .. -i eth1 -j ETH1_RULES
..
iptables .. -i ethX -j ETHX_RULES

could be transparently thrown into a map and the jump target retrieved
from it base on iifname...

The IMR would also need to be able to create/allocate such sets/maps
on its own.

Do you see this as something that has to be decided now?
Ideally feeling about IMR would be that we can probably handle this
later by extending it rather than having to throw it away plus a
rewrite :)

> I think it would be good to have clear semantics in terms of what
> it would eventually abstract away from raw BPF when sitting on top
> of it; potentially these could be details on packet access or

[..]

Yes, it would be good if IMR could e.g. do merge of adjacent loads:

load 4 bytes from network header at offset 12  (relop, equal) imm 0x1
load 4 bytes from network header at offset 16  (relop, equal) imm 0x2

 ..  would be condensed to ...

load 8 bytes from network header at offset 12  (relop, equal) imm 0x10002

... before generating ebpf code.

I don't think this should be part of frontend (too lowlevel task for
frontend) and I don't see why it should happen when generating c

Re: [PATCH net] netfilter: check for out-of-bounds while copying compat entries

2018-03-05 Thread Florian Westphal
Paolo Abeni  wrote:
> Currently, when coping ebt compat entries, no checks are in place
> for the offsets provided by user space, so that syzbot was able to
> trigger the following splat:
> ---
>  net/bridge/netfilter/ebtables.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/bridge/netfilter/ebtables.c b/net/bridge/netfilter/ebtables.c
> index 02c4b409d317..54ceaff701fb 100644
> --- a/net/bridge/netfilter/ebtables.c
> +++ b/net/bridge/netfilter/ebtables.c
> @@ -2114,7 +2114,7 @@ static int size_entry_mwt(struct ebt_entry *entry, 
> const unsigned char *base,
>   unsigned int size;
>   char *buf = buf_start + offsets[i];
>  
> - if (offsets[i] > offsets[j])
> + if (offsets[i] > offsets[j] || offsets[j] > *total)
>   return -EINVAL;

I thought i fixed this via b71812168571fa55e44cdd0254471331b9c4c4c6,
and, after looking at it again I still don't see why that doesn't cover
this :-(


[RFC,POC 1/3] bpfilter: add experimental IMR bpf translator

2018-03-04 Thread Florian Westphal
This is a basic intermediate representation to decouple
the ruleset representation (iptables, nftables) from the
ebpf translation.

The IMR currently assumes that translation will always be
into ebpf, its pseudo-registers map 1:1 to ebpf ones.

Objects implemented at the moment:
- relop (eq, ne only for now)
- immediate (32, 64 bit constants)
- payload, with relative addressing (mac header, network header, transport 
header)

This doesn't add a user; files will not even be compiled yet.

Signed-off-by: Florian Westphal <f...@strlen.de>
---
 net/bpfilter/imr.c | 655 +
 net/bpfilter/imr.h |  78 +++
 2 files changed, 733 insertions(+)
 create mode 100644 net/bpfilter/imr.c
 create mode 100644 net/bpfilter/imr.h

diff --git a/net/bpfilter/imr.c b/net/bpfilter/imr.c
new file mode 100644
index ..09c557ea7c21
--- /dev/null
+++ b/net/bpfilter/imr.c
@@ -0,0 +1,655 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+
+#include 
+typedef __u16 __bitwise __sum16; /* hack */
+#include 
+#include 
+
+#include "imr.h"
+#include "bpfilter_gen.h"
+
+#define EMIT(ctx, x)   \
+   do {\
+   if ((ctx)->len_cur + 1 > (ctx)->len_max)\
+   return -ENOMEM; \
+   (ctx)->img[(ctx)->len_cur++] = x;   \
+   } while (0)
+
+struct imr_object {
+   enum imr_obj_type type:8;
+   uint8_t len;
+
+   union {
+   struct {
+   union {
+   uint64_t value64;
+   uint32_t value32;
+   };
+   } immedate;
+   struct {
+   struct imr_object *left;
+   struct imr_object *right;
+   enum imr_relop op:8;
+   } relational;
+   struct {
+   uint16_t offset;
+   enum imr_payload_base base:8;
+   } payload;
+   struct {
+   enum imr_verdict verdict;
+   } verdict;
+   };
+};
+
+struct imr_state {
+   struct bpf_insn *img;
+   uint32_t len_cur;
+   uint32_t len_max;
+
+   struct imr_object *registers[IMR_REG_COUNT];
+   uint8_t regcount;
+
+   uint32_t num_objects;
+   struct imr_object **objects;
+};
+
+static int imr_jit_object(struct bpfilter_gen_ctx *ctx,
+ struct imr_state *, const struct imr_object *o);
+
+static void internal_error(const char *s)
+{
+   fprintf(stderr, "FIXME: internal error %s\n", s);
+   exit(1);
+}
+
+/* FIXME: consider len too (e.g. reserve 2 registers for len == 8) */
+static int imr_register_alloc(struct imr_state *s, uint32_t len)
+{
+   uint8_t reg = s->regcount;
+
+   if (s->regcount >= IMR_REG_COUNT)
+   return -1;
+
+   s->regcount++;
+
+   return reg;
+}
+
+static int imr_register_get(const struct imr_state *s, uint32_t len)
+{
+   if (len > sizeof(uint64_t))
+   internal_error(">64bit types not yet implemented");
+   if (s->regcount == 0)
+   internal_error("no registers in use");
+
+   return s->regcount - 1;
+}
+
+static int imr_to_bpf_reg(enum imr_reg_num n)
+{
+   /* currently maps 1:1 */
+   return (int)n;
+}
+
+static int bpf_reg_width(unsigned int len)
+{
+   switch (len) {
+   case sizeof(uint8_t): return BPF_B;
+   case sizeof(uint16_t): return BPF_H;
+   case sizeof(uint32_t): return BPF_W;
+   case sizeof(uint64_t): return BPF_DW;
+   default:
+   internal_error("reg size not supported");
+   }
+
+   return -EINVAL;
+}
+
+static void imr_register_release(struct imr_state *s)
+{
+   if (s->regcount == 0)
+   internal_error("regcount underflow");
+   s->regcount--;
+}
+
+void imr_register_store(struct imr_state *s, enum imr_reg_num reg, struct 
imr_object *o)
+{
+   s->registers[reg] = o;
+}
+
+struct imr_object *imr_register_load(const struct imr_state *s, enum 
imr_reg_num reg)
+{
+   return s->registers[reg];
+}
+
+struct imr_state *imr_state_alloc(void)
+{
+   struct imr_state *s = calloc(1, sizeof(*s));
+
+   return s;
+}
+
+void imr_state_free(struct imr_state *s)
+{
+   int i;
+
+   for (i = 0; i < s->num_objects; i++)
+   imr_object_free(s->objects[i]);
+
+   free(s);
+}
+
+struct imr_object *imr_object_alloc(enum imr_obj_type t)
+{
+   struct imr_object *o = calloc(1, sizeof(*o));
+
+   if (o)
+   o->type = t;
+
+   return o;
+}
+
+void imr_object_free(struct imr_object *o)
+{
+   switch (o->type) {
+   case IMR_OBJ_TYP

[RFC,POC 2/3] bpfilter: add nftables jit proof-of-concept

2018-03-04 Thread Florian Westphal
This adds a nftables frontend for the IMR->BPF translator.

This doesn't work via UMH yet.

AFAIU it should be possible to get transparent ebpf translation for
nftables, similar to the bpfilter/iptables UMH.

However, at this time I think its better to get IMR "right".

nftjit.ko currently needs libnftnl/libmnl but thats convenince on
my end and not a "must have".

Signed-off-by: Florian Westphal <f...@strlen.de>
---
 net/bpfilter/Makefile   |   7 +-
 net/bpfilter/nftables.c | 679 
 2 files changed, 685 insertions(+), 1 deletion(-)
 create mode 100644 net/bpfilter/nftables.c

diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile
index 5a85ef7d7a4d..a4064986dc2f 100644
--- a/net/bpfilter/Makefile
+++ b/net/bpfilter/Makefile
@@ -3,7 +3,12 @@
 # Makefile for the Linux BPFILTER layer.
 #
 
-hostprogs-y := bpfilter.ko
+hostprogs-y := nftjit.ko bpfilter.ko
 always := $(hostprogs-y)
 bpfilter.ko-objs := bpfilter.o tgts.o targets.o tables.o init.o ctor.o 
sockopt.o gen.o
+
+NFT_LIBS = -lnftnl
+nftjit.ko-objs := tgts.o targets.o tables.o init.o ctor.o gen.o nftables.o 
imr.o
+HOSTLOADLIBES_nftjit.ko = `pkg-config --libs libnftnl libmnl`
+
 HOSTCFLAGS += -I. -Itools/include/
diff --git a/net/bpfilter/nftables.c b/net/bpfilter/nftables.c
new file mode 100644
index ..5a756ccd03a1
--- /dev/null
+++ b/net/bpfilter/nftables.c
@@ -0,0 +1,679 @@
+/*
+ * based on previous code from:
+ *
+ * Copyright (c) 2013 Arturo Borrero Gonzalez <art...@netfilter.org>
+ * Copyright (c) 2013 Pablo Neira Ayuso <pa...@netfilter.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#include "bpfilter_mod.h"
+#include "imr.h"
+
+/* Hack, we don't link bpfilter.o */
+extern long int syscall (long int __sysno, ...);
+
+int sys_bpf(int cmd, union bpf_attr *attr, unsigned int size)
+{
+   return syscall(321, cmd, attr, size);
+}
+
+static int seq;
+
+static void memory_allocation_error(void) { perror("allocation failed"); 
exit(1); }
+
+static int nft_reg_to_imr_reg(int nfreg)
+{
+   switch (nfreg) {
+   case NFT_REG_VERDICT:
+   return IMR_REG_0;
+   /* old register numbers, 4 128 bit registers. */
+   case NFT_REG_1:
+   return IMR_REG_4;
+   case NFT_REG_2:
+   return IMR_REG_6;
+   case NFT_REG_3:
+   return IMR_REG_8;
+   case NFT_REG_4:
+   break;
+   /* new register numbers, 16 32 bit registers, map to old ones */
+   case NFT_REG32_00:
+   return IMR_REG_4;
+   case NFT_REG32_01:
+   return IMR_REG_5;
+   case NFT_REG32_02:
+   return IMR_REG_6;
+   default:
+   return -1;
+   }
+   return -1;
+}
+
+static int netlink_parse_immediate(const struct nftnl_expr *nle, void *out)
+{
+   struct imr_state *state = out;
+   struct imr_object *o = NULL;
+
+   if (nftnl_expr_is_set(nle, NFTNL_EXPR_IMM_DATA)) {
+   uint32_t len;
+   int reg;
+
+   nftnl_expr_get(nle, NFTNL_EXPR_IMM_DATA, );
+
+   switch (len) {
+   case sizeof(uint32_t):
+   o = imr_object_alloc_imm32(nftnl_expr_get_u32(nle, 
NFTNL_EXPR_IMM_DATA));
+   break;
+   case sizeof(uint64_t):
+   o = imr_object_alloc_imm64(nftnl_expr_get_u64(nle, 
NFTNL_EXPR_IMM_DATA));
+   break;
+   default:
+   return -ENOTSUPP;
+   }
+   reg = nft_reg_to_imr_reg(nftnl_expr_get_u32(nle,
+NFTNL_EXPR_IMM_DREG));
+   if (reg < 0) {
+   imr_object_free(o);
+   return reg;
+   }
+
+   imr_register_store(state, reg, o);
+   return 0;
+   } else if (nftnl_expr_is_set(nle, NFTNL_EXPR_IMM_VERDICT)) {
+   uint32_t verdict;
+   int ret;
+
+   if (nftnl_expr_is_set(nle, NFTNL_EXPR_IMM_CHAIN))
+   return -ENOTSUPP;
+
+verdict = nftnl_expr_get_u32(nle, NFTNL_EXPR_IMM_VERDICT);
+
+   switch (verdict) {
+   case NF_ACCEPT:
+   o = imr_object_alloc_verdict(IMR_VERDICT_PASS);
+   break;
+   case NF_DROP:
+   o = imr_object_alloc_verdict(IMR_VERDICT_DROP);
+   break;
+   default:
+   fpr

[RFC,POC 3/3] bpfilter: switch bpfilter to iptables->IMR translation

2018-03-04 Thread Florian Westphal
Translate basic iptables rule blob to the IMR, then ask
IMR to translate to ebpf.

IMR is shared between nft and bpfilter translators.
iptables_gen_append() is the only relevant function here,
as it demonstrates simple 'source/destination matches x' test.

Signed-off-by: Florian Westphal <f...@strlen.de>
---
 net/bpfilter/Makefile   |  2 +-
 net/bpfilter/bpfilter_gen.h | 15 +
 net/bpfilter/bpfilter_mod.h | 16 +-
 net/bpfilter/iptables.c | 76 +
 net/bpfilter/iptables.h |  4 +++
 net/bpfilter/sockopt.c  | 73 +--
 6 files changed, 154 insertions(+), 32 deletions(-)
 create mode 100644 net/bpfilter/bpfilter_gen.h
 create mode 100644 net/bpfilter/iptables.c
 create mode 100644 net/bpfilter/iptables.h

diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile
index a4064986dc2f..21a8afb60b7c 100644
--- a/net/bpfilter/Makefile
+++ b/net/bpfilter/Makefile
@@ -5,7 +5,7 @@
 
 hostprogs-y := nftjit.ko bpfilter.ko
 always := $(hostprogs-y)
-bpfilter.ko-objs := bpfilter.o tgts.o targets.o tables.o init.o ctor.o 
sockopt.o gen.o
+bpfilter.ko-objs := bpfilter.o tgts.o targets.o tables.o init.o ctor.o 
sockopt.o gen.o iptables.o imr.o
 
 NFT_LIBS = -lnftnl
 nftjit.ko-objs := tgts.o targets.o tables.o init.o ctor.o gen.o nftables.o 
imr.o
diff --git a/net/bpfilter/bpfilter_gen.h b/net/bpfilter/bpfilter_gen.h
new file mode 100644
index ..71c6e8a73e24
--- /dev/null
+++ b/net/bpfilter/bpfilter_gen.h
@@ -0,0 +1,15 @@
+struct bpfilter_gen_ctx {
+   struct bpf_insn *img;
+   u32 len_cur;
+   u32 len_max;
+   u32 default_verdict;
+   int fd;
+   int ifindex;
+   booloffloaded;
+};
+
+int bpfilter_gen_init(struct bpfilter_gen_ctx *ctx);
+int bpfilter_gen_prologue(struct bpfilter_gen_ctx *ctx);
+int bpfilter_gen_epilogue(struct bpfilter_gen_ctx *ctx);
+int bpfilter_gen_commit(struct bpfilter_gen_ctx *ctx);
+void bpfilter_gen_destroy(struct bpfilter_gen_ctx *ctx);
diff --git a/net/bpfilter/bpfilter_mod.h b/net/bpfilter/bpfilter_mod.h
index b4209985efff..dc3a90df1788 100644
--- a/net/bpfilter/bpfilter_mod.h
+++ b/net/bpfilter/bpfilter_mod.h
@@ -4,6 +4,7 @@
 
 #include "include/uapi/linux/bpfilter.h"
 #include 
+#include "bpfilter_gen.h"
 
 struct bpfilter_table {
struct hlist_node   hash;
@@ -71,26 +72,11 @@ struct bpfilter_target {
u8  rev;
 };
 
-struct bpfilter_gen_ctx {
-   struct bpf_insn *img;
-   u32 len_cur;
-   u32 len_max;
-   u32 default_verdict;
-   int fd;
-   int ifindex;
-   booloffloaded;
-};
-
 union bpf_attr;
 int sys_bpf(int cmd, union bpf_attr *attr, unsigned int size);
 
-int bpfilter_gen_init(struct bpfilter_gen_ctx *ctx);
-int bpfilter_gen_prologue(struct bpfilter_gen_ctx *ctx);
-int bpfilter_gen_epilogue(struct bpfilter_gen_ctx *ctx);
 int bpfilter_gen_append(struct bpfilter_gen_ctx *ctx,
struct bpfilter_ipt_ip *ent, int verdict);
-int bpfilter_gen_commit(struct bpfilter_gen_ctx *ctx);
-void bpfilter_gen_destroy(struct bpfilter_gen_ctx *ctx);
 
 struct bpfilter_target *bpfilter_target_get_by_name(const char *name);
 void bpfilter_target_put(struct bpfilter_target *tgt);
diff --git a/net/bpfilter/iptables.c b/net/bpfilter/iptables.c
new file mode 100644
index ..055cfa8fbf21
--- /dev/null
+++ b/net/bpfilter/iptables.c
@@ -0,0 +1,76 @@
+#include 
+#include 
+
+typedef uint16_t __sum16; /* hack */
+#include 
+
+#include "bpfilter_mod.h"
+#include "iptables.h"
+#include "imr.h"
+
+static int check_entry(const struct bpfilter_ipt_ip *ent)
+{
+#define M_FF   "\xff\xff\xff\xff"
+   static const __u8 mask1[IFNAMSIZ] = M_FF M_FF M_FF M_FF;
+   static const __u8 mask0[IFNAMSIZ] = { };
+   int ones = strlen(ent->in_iface); ones += ones > 0;
+#undef M_FF
+   if (strlen(ent->out_iface) > 0)
+   return -ENOTSUPP;
+   if (memcmp(ent->in_iface_mask, mask1, ones) ||
+   memcmp(>in_iface_mask[ones], mask0, sizeof(mask0) - ones))
+   return -ENOTSUPP;
+   if ((ent->src_mask != 0 && ent->src_mask != 0x) ||
+   (ent->dst_mask != 0 && ent->dst_mask != 0x))
+   return -ENOTSUPP;
+
+   return 0;
+}
+
+int iptables_gen_append(struct imr_state *state,
+   struct bpfilter_ipt_ip *ent, int verdict)
+{
+   struct imr_object *left, *right, *relop;
+   int ret;
+
+   ret = check_entry(ent);
+   if (ret < 0)
+   return ret;
+   if (ent->src_mask == 0 && ent->ds

[RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer

2018-03-04 Thread Florian Westphal
These patches, which go on top of the 'bpfilter' RFC patches,
demonstrate an nftables to ebpf translation (done in userspace).
In order to not duplicate the ebpf code generation efforts, the rules

iptables -i lo -d 127.0.0.2 -j DROP
and
nft add rule ip filter input ip daddr 127.0.0.2 drop

are first translated to a common intermediate representation, and then
to ebpf, which attaches resulting prog to the XDP hook.

IMR representation is identical in both cases so therefore both
rules result in the same ebpf program.

The IMR currently assumes that translation will always be to ebpf.
As per previous discussion it doesn't consider other targets, so
for instance IMR pseudo-registers map 1:1 to ebpf ones.

The IMR is also supposed to be generic enough to make it easy to convert
'fronted' formats (iptables rule blob, nftables netlink) to it, and
also extend it to cover ip rule, ovs or any other inputs in the future
without need for major changes to the IMR.

The IMR currently implements following basic operations:
 - Relational (equal, not equal)
 - immediates (32 and 64bit constants)
 - payload with relative addressing (macr, network, transport header)
 - verdict (pass, drop, next rule)

Its still in early stage, but I think its good enough as
a proof-of-concept.

Known differences between nftjit.ko and bpfilter.ko:
nftjit.ko currently doesn't run transparently, but thats only
because I wanted to focus on the IMR and get the POC out of the door.

It should be possible to get it transparent via the bpfilter.ko approach.

Next steps for the IMR could be addition of binary operations for
prefixes ("-d 192.168.0.1/24"), its also needed e.g. for tcp flag
matching (-p tcp --syn in iptables) and so on.

I'd also be interested in wheter XDP is seen as appropriate
target hook.  AFAICS the XDP and the nftables ingress hook are similar
enough to consider just (re)using the XDP hook to jit the nftables ingress
hook.  The translator could check if the hook is unused, and return
early if some other program is already attached.

Comments welcome, especially wrt. IMR concept and what might be
next step(s) in moving forward.

The patches are also available via git at
https://git.breakpoint.cc/cgit/fw/net-next.git/log/?h=bpfilter7 .



[ANNOUNCE] nftables 0.8.3 release

2018-03-03 Thread Florian Westphal
Hi!

The Netfilter project proudly presents:

nftables 0.8.3

This release includes a few fixes since last release plus following
enhancements:
 - ifname_type, so its possible to match interface names via sets:

  table inet t {
set s {
  type ifname
  elements = { "eth0",
   "eth1" }
 }
 chain c {
   iifname @s accept
   oifname @s accept
 }
  }

- raw payload support to match headers that do not yet have
a more human-readable mnemonic.  This also allows to match
udp and tcp port numbers in a single rule, because the raw
payload expression doesn't enforce a protocol dependency on
the network header.  Example:

 input meta l4proto {tcp, udp} @th,16,16 { dns, http }

 matches both udp and tcp dport 53 and 80 in single rule.

See ChangeLog that comes attached to this email for more details.

You can download it from:

http://www.netfilter.org/projects/nftables/downloads.html#nftables-0.8.3
ftp://ftp.netfilter.org/pub/nftables/

To build the code, libnftnl 1.0.9 and libmnl >= 1.0.2 are required:

* http://netfilter.org/projects/libnftnl/index.html
* http://netfilter.org/projects/libmnl/index.html

Visit our wikipage for user documentation at:

* http://wiki.nftables.org

For the manpage reference, check man(8) nft.

In case of bugs and feature request, file them via:

* https://bugzilla.netfilter.org

Happy firewalling!
Arturo Borrero Gonzalez (4):
  nftables: rearrange files and examples
  examples: add ct helper examples
  files: add load balance example
  meta: introduce datatype ifname_type

Baruch Siach (1):
  src: fix build with older glibc

David Fabian (1):
  Added undefine/redefine keywords

Duncan Roe (1):
  doc/nft.xml: fix typo

Florian Westphal (16):
  tests: enable sets test case 27
  tests: add test case for sets updated from packet path
  payload: don't decode past last valid template
  include: fix build failure
  tests: meta.t: fix test case for anonymous set automerge
  payload: use integer_type when initializing a raw expression
  payload: don't resolve expressions using the inet pseudoheader
  src: make raw payloads work
  doc: document raw protocol expression
  tests: add raw payload test cases.
  doc: mention meta l4proto and ipv6 nexthdr issue wrt. extension headers
  doc: remove ipv6 address FIXME
  doc: add example for rule add/delete
  parser: use nf_key_proto
  src: datatype: prefer sscanf, avoid strncpy
  build: Bump version to v0.8.3

Harsha Sharma (2):
  libnftables: don't crash when no commands are specified
  src: Use snprintf() over strncpy()

Laura Garcia Liebana (1):
  parser: support of maps with timeout

Pablo Neira Ayuso (11):
  src: pass family to payload_dependency_kill()
  payload: add payload_dependency_release() helper function
  src: add payload_dependency_exists()
  src: get rid of __payload_dependency_kill()
  payload: add payload_may_dependency_kill()
  netlink_delinearize: add meta_may_dependency_kill()
  src: bail out when exporting ruleset with unsupported output
  segtree: check for overlapping elements at insertion
  tests: shell: regression test for bugzilla 1228
  configure: misc updates
  netlink: remove non-batching routines

Phil Sutter (10):
  evaluate: Enable automerge feature for anonymous sets
  Review switch statements for unmarked fall through cases
  monitor: Make trace events respect output_fp
  monitor: Make JSON/XML output respect output_fp
  cli: Drop pointless check in cli_append_multiline()
  erec: Avoid passing negative offset to fseek()
  evaluate: Fix memleak in stmt_reject_gen_dependency()
  hash: Fix potential null-pointer dereference in hash_expr_cmp()
  netlink: Complain if setting O_NONBLOCK fails
  netlink_delinearize: Fix resource leaks

Ville Skyttä (2):
  configure: Make missing docbook2man an error if man build requested
  src: Spelling fixes



Re: [PATCH] netfilter: use skb_to_full_sk in ip6_route_me_harder

2018-02-25 Thread Florian Westphal
Eric Dumazet  wrote:
> From: Eric Dumazet 
> 
> For some reason, Florian forgot to apply to ip6_route_me_harder
> the fix that went in commit 29e09229d9f2 ("netfilter: use
> skb_to_full_sk in ip_route_me_harder")

Indeed, sorry about that, thanks for taking care of this.


Re: syzcaller patch postings...

2018-02-22 Thread Florian Westphal
Dmitry Vyukov  wrote:
> On Thu, Feb 22, 2018 at 9:26 AM, Paolo Abeni  wrote:
> > On Wed, 2018-02-21 at 16:47 -0500, David Miller wrote:
> >> I have to mention this now before it gets out of control.
> >>
> >> I would like to ask that syzkaller stop posting the patch it is
> >> testing when it posts to netdev.
> >
> > There is an open issue on this topic:
> >
> > https://github.com/google/syzkaller/issues/526
> >
> > The current behaviour is that syzbot replies to all get_maintainer.pl
> > recipients after testing a patch, regardless of the test submission
> > recipient list, the idea was instead to respect such list.
>
> 
> Hi David, Florian, Paolo,
> 
> Didn't realize it triggers patchwork. This wasn't intentional, sorry.
> 
> Do I understand it correctly that if syzbot replies to the CC list
> that was in the testing request, it will resolve the problem? So if
> netdev wasn't in CC, it will not be added to CC.

Yes, thats at least my expected/desired behaviour.
This way I can even CC some other person (maintainer, reporter etc)
to have them informed about test result too.

> I will go and fix it now.

Thank you Dmitry!


Re: syzcaller patch postings...

2018-02-21 Thread Florian Westphal
David Miller  wrote:
> I have to mention this now before it gets out of control.
> 
> I would like to ask that syzkaller stop posting the patch it is
> testing when it posts to netdev.

Same for netfilter-devel.

I could not get a reproducer to trigger and asked syzbot to test
the patch (great feature, thanks!) -- i did not cc any mailing list.

So I was very surprised syzbot announced test result by adding
back all CCs from original report.  I think its better to
announce the result to patch author only.



Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-21 Thread Florian Westphal
Pablo Neira Ayuso  wrote:
> On Tue, Feb 20, 2018 at 05:52:54PM -0800, Alexei Starovoitov wrote:
> > On Tue, Feb 20, 2018 at 11:44:31AM +0100, Pablo Neira Ayuso wrote:
> > > 
> > >   Don't get me wrong, no software is safe from security issues, but if you
> > >   don't abstract your resources in the right way, you have more chance to
> > >   have experimence more problems.
> > 
> > interesting point.
> > The key part of iptables and nft design is heavy use of indirect calls.
> > The execution of single iptable rule is ~3 indirect calls.
> > Quite a lot worse in nft where every expression is an indirect call.
> 
> That's right. Netfilter is probably too modular, probably we can
> revisit this to find a better balance, actually Felix Fietkau was
> recently rising concerns on this, specifically in environments with
> limited space to store the kernel image. We'll have a look, thanks for
> remind us about this.

Agree, we have too many config knobs, probably a good idea to turn some
modules into plain .o (like cmp and bitwise).

Obvious candidates are: meta, numgen, limit, objref, quota, reject.

We should probably also consider removing
CONFIG_NFT_SET_RBTREE and CONFIG_NFT_SET_HASH and just always
build both too (at least rbtree since that offers interval).

For the indirect call issue we can use direct calls from eval loop for
some of the more frequently used ones, similar to what we do already
for nft_cmp_fast_expr.  But maybe we don't even have to if we can
get help to build a jitter that takes an nftables netlink table dump
and builds jit code from that.


Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread Florian Westphal
David Miller  wrote:
> From: Phil Sutter 
> Date: Mon, 19 Feb 2018 18:14:11 +0100
> 
> > OK, so reading between the lines you're saying that nftables project
> > has failed to provide an adequate successor to iptables?
> 
> Whilst it is great that the atomic table update problem was solved, I
> think the emphasis on flexibility often at the expense of performance
> was a bad move.

Thats not true, IMO.

One idea previosuly discussed was to add a 'freeze' option
to our nftables syntax.  Essentially what would happen is that further
updates to the table become impossible, with exception of named sets
(which can be changed independently similar to ebpf maps is suppose).

As further updates to the table are then no longer allowed this would
then make it possible to e.g. jit all rules into a single program.

The table could still be removed (and recreated) of course so its
not impossible to make changes, but no longer at the rule level.

> Netfilter's chronic performance differential is why a lot of mindshare
> was lost to userspace networking technologies.

I think this is a unfair statement and also not true.
If you refer to the linear-ruleset-evaluation of iptables, this is
what ipset was added for.

Yes, its a band aid.  But again, that problem come from the UAPI
format/limitations of only having one source or destination address per
rule, a limitation not present in nftables.

Other reason why iptables is a bit more costly than needed (although it
IS rather fast given, no spinlocks in main eval loop) are the rule
counter updates which were built into the design all those years ago.

Again, a problem solved in nftables by making the counters optional.

If you want to speedup forward path with XDP -- fine.
But AFAIU its still possible with XDP to have packets being sent to
full stack, right?

If so, it would be possible to even combine nftables with XDP, f.e.
by allowing an ebpf program running on host CPU to query netfilter
conntrack.

No Entry -> push to normal path
Entry -> check 'fastpath' flag (which would be in nf_conn struct).
Not set -> also normal path.
Otherwise continue XDP, stack bypass.

nftables would have a rule similar to this:
nft add rule inet forward ct state established ct label set fastpath
to switch such conntrack to xdp mode.

This decision can then be combined with nftables infra,
for example 'fatpath for tcp flows that saw more than 1mbit of data
in either direction' or the like.

Yes, this needs ebpf support for conntrack and NAT transformations,
and it does beg question how to handle the details, e.g. conntrack
timeouts.  Don't see any unsolveable issues with this though.

Also has similarities with the 'flow offload' proposal, i.e. we
could perhaps even reuse what we already have to add provide flow
offload in software using epbf/XDP as offload backend.


Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread Florian Westphal
David Miller <da...@davemloft.net> wrote:
> From: Florian Westphal <f...@strlen.de>
> Date: Mon, 19 Feb 2018 15:53:14 +0100
> 
> > Sure, but looking at all the things that were added to iptables
> > to alleviate some of the issues (ipset for instance) show that we need a
> > meaningful re-design of how things work conceptually.
> 
> As you said iptables is in maintainenance mode.
> 
> But there are millions upon millions of users, like it or not, and
> they aren't going away for decades.  And this is the iptables binary
> ABI I'm talking about, not the iptables user command line interface.

I know.

> my house?"  Please see further than the view inside your home. 
> 
> By in large, we are stuck with iptables's data path for an extremely
> long time.

So?

> Major data centers doesn't even enable NFTABLES in their kernels, and
> there is nothing you can do about that in the short to medium term.

So?

> Therefore, for all of the beneficial reasons I have discussed we
> should make that datapath as aligned and integrated with our core
> important technologies as possible, so that they can benefit from any
> and all improvements in that area rather than just collecting dust.

See my other mail, where I explained, in great detail, the problems
of the xtables UAPI.

If you go through with this, and, eventually somehow get feature parity,
all of the problems remain in full effect.
You will also need to replicate the translation efforts that already
went into nftables.  The translator wasn't yet a high priority as we
lacked some features but this can be changed now that nft is catching
up.

Userspace program expectation is for iptables to be like fib for
instance, i.e. you can add and remove without stomping on each others
feet.  You are setting this in stone.

You're also adding a way to make it so that I can delete entries from
the fib (bpfilter) but iproute2 will still show all entries (iptables
legacy).


Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread Florian Westphal
David Miller <da...@davemloft.net> wrote:
> From: Florian Westphal <f...@strlen.de>
> Date: Mon, 19 Feb 2018 15:59:35 +0100
> 
> > David Miller <da...@davemloft.net> wrote:
> >> It also means that the scope of developers who can contribute and work
> >> on the translater is much larger.
> > 
> > How so?  Translator is in userspace in nftables case too?
> 
> Florian, first of all, the whole "change the iptables binary" idea is
> a non-starter.  For the many reasons I have described in the various
> postings I have made today.
> 
> It is entirely impractical.

???
You suggest:

iptables -> setsockopt -> umh (xtables -> ebpf) -> kernel

How is this different from

iptables -> setsockopt -> umh (Xtables -> nftables -> kernel

?
EBPF can be placed within nftables either userspace or kernel,
there is nothing that prevents this.

> Anything designed in that nature must be distributed completely in the
> kernel tree, so that the iptables kernel ABI is provided without any
> externel dependencies.

Would you be willing to merge nftables into kernel tools directory then?


Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread Florian Westphal
David Miller  wrote:
> From: Daniel Borkmann 
> Date: Mon, 19 Feb 2018 13:03:17 +0100
> 
> > Thought was that it would be more suitable to push all the complexity of
> > such translation into user space which brings couple of additional 
> > advantages
> > as well: the translation can become very complex and thus it would contain
> > all of it behind syscall boundary where natural path of loading programs
> > would go via verifier. Given the tool would reside in user space, it would
> > also allow to ease development and testing can happen w/o recompiling the
> > kernel. It would allow for all the clang sanitizers to run there and for
> > having a comprehensive test suite to verify and dry test translations 
> > against
> > traffic test patterns (e.g. bpf infra would provide possibilities on this
> > w/o complex setup). Given normal user mode helpers make this rather painful
> > since they need to be shipped as extra package by the various distros, the
> > idea was that the module loader back end could treat umh similarly as kernel
> > modules and hook them in through request_module() approach while still
> > operating out of user space. In any case, I could image this approach might
> > be interesting and useful in general also for other subsystems requiring
> > umh in one way or another.
> 
> Yes, this is a very powerful new facility.
> 
> It also means that the scope of developers who can contribute and work
> on the translater is much larger.

How so?  Translator is in userspace in nftables case too?


Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-19 Thread Florian Westphal
David Miller  wrote:
> > How many of those wide-spread applications are you aware of?  The two
> > projects you have pointed out (docker and kubernetes) don't. As the
> > assumption that many such tools would need to be supported drives a lot
> > of the design decisions, I would argue one needs a solid empircal basis.
> 
> I see talk about "just replacing the iptables binary".
> 
> A long time ago in a galaxy far far away, that would be a reasonable
> scheme.  But that kind of approach won't work in the realities of
> today.
> 
> You aren't going to be able to replace the iptables binary in the tens
> of thousands of container images out there, nor the virtualization
> installations either.

Why would you have to?
iptables kernel parts are still maintained, its not dead code that
stands in the way.

We can leave it alone, in maintenance mode, just fine.

> Like it or not iptables ABI based filtering is going to be in the data
> path for many years if not a decade or more to come.  iptables is a
> victim of it's own success, like it or not :-) Yes, the ABI is
> terrible, but obviously it was useful enough for lots of people.

Sure, but looking at all the things that were added to iptables
to alleviate some of the issues (ipset for instance) show that we need a
meaningful re-design of how things work conceptually.

The umh helper translation that has been proposed could be applied to
transparently xlate iptables to nftables (or e.g. iptables compat32 to
iptables64), i.e. legacy binary talks to kernel, kernel invokes umh, umh
generates nftables netlink messages).

But I don't even see a need to do this, I don't think its an issue
to leave it in the tree even for another decade or more if needed be.


[PATCH nf] netfilter: bridge: ebt_among: add missing match size checks

2018-02-18 Thread Florian Westphal
ebt_among is special, it has a dynamic match size and is exempt
from the central size checks.

Therefore it must check that the size of the match structure
provided from userspace is sane by making sure em->match_size
is at least the minimum size of the expected structure.

The module has such a check, but its only done after accessing
a structure that might be out of bounds.

tested with: ebtables -A INPUT ... \
--among-dst fe:fe:fe:fe:fe:fe
--among-dst fe:fe:fe:fe:fe:fe --among-src 
fe:fe:fe:fe:ff:f,fe:fe:fe:fe:fe:fb,fe:fe:fe:fe:fc:fd,fe:fe:fe:fe:fe:fd,fe:fe:fe:fe:fe:fe
--among-src 
fe:fe:fe:fe:ff:f,fe:fe:fe:fe:fe:fa,fe:fe:fe:fe:fe:fd,fe:fe:fe:fe:fe:fe,fe:fe:fe:fe:fe:fe

Reported-by: <syzbot+fe0b19af568972814...@syzkaller.appspotmail.com>
Signed-off-by: Florian Westphal <f...@strlen.de>
---
 net/bridge/netfilter/ebt_among.c | 21 +++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/net/bridge/netfilter/ebt_among.c b/net/bridge/netfilter/ebt_among.c
index ce7152a12bd8..c5afb4232ecb 100644
--- a/net/bridge/netfilter/ebt_among.c
+++ b/net/bridge/netfilter/ebt_among.c
@@ -172,18 +172,35 @@ ebt_among_mt(const struct sk_buff *skb, struct 
xt_action_param *par)
return true;
 }
 
+static bool poolsize_invalid(const struct ebt_mac_wormhash *w)
+{
+   return w && w->poolsize >= (INT_MAX / sizeof(struct 
ebt_mac_wormhash_tuple));
+}
+
 static int ebt_among_mt_check(const struct xt_mtchk_param *par)
 {
const struct ebt_among_info *info = par->matchinfo;
const struct ebt_entry_match *em =
container_of(par->matchinfo, const struct ebt_entry_match, 
data);
-   int expected_length = sizeof(struct ebt_among_info);
+   unsigned int expected_length = sizeof(struct ebt_among_info);
const struct ebt_mac_wormhash *wh_dst, *wh_src;
int err;
 
+   if (expected_length > em->match_size)
+   return -EINVAL;
+
wh_dst = ebt_among_wh_dst(info);
-   wh_src = ebt_among_wh_src(info);
+   if (poolsize_invalid(wh_dst))
+   return -EINVAL;
+
expected_length += ebt_mac_wormhash_size(wh_dst);
+   if (expected_length > em->match_size)
+   return -EINVAL;
+
+   wh_src = ebt_among_wh_src(info);
+   if (poolsize_invalid(wh_src))
+   return -EINVAL;
+
expected_length += ebt_mac_wormhash_size(wh_src);
 
if (em->match_size != EBT_ALIGN(expected_length)) {
-- 
2.16.1



[PATCH nf] netfilter: ebtables: CONFIG_COMPAT: don't trust userland offsets

2018-02-18 Thread Florian Westphal
We need to make sure the offsets are not out of range of the
total size.
Also check that they are in ascending order.

The WARN_ON triggered by syzkaller (it sets panic_on_warn) is
changed to also bail out, no point in continuing parsing.

Briefly tested with simple ruleset of
-A INPUT --limit 1/s' --log
plus jump to custom chains using 32bit ebtables binary.

Reported-by: <syzbot+845a53d13171abf8b...@syzkaller.appspotmail.com>
Signed-off-by: Florian Westphal <f...@strlen.de>
---
 net/bridge/netfilter/ebtables.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/net/bridge/netfilter/ebtables.c b/net/bridge/netfilter/ebtables.c
index 02c4b409d317..3f536c7a3354 100644
--- a/net/bridge/netfilter/ebtables.c
+++ b/net/bridge/netfilter/ebtables.c
@@ -2053,7 +2053,9 @@ static int ebt_size_mwt(struct compat_ebt_entry_mwt 
*match32,
if (match_kern)
match_kern->match_size = ret;
 
-   WARN_ON(type == EBT_COMPAT_TARGET && size_left);
+   if (WARN_ON(type == EBT_COMPAT_TARGET && size_left))
+   return -EINVAL;
+
match32 = (struct compat_ebt_entry_mwt *) buf;
}
 
@@ -2109,6 +2111,15 @@ static int size_entry_mwt(struct ebt_entry *entry, const 
unsigned char *base,
 *
 * offsets are relative to beginning of struct ebt_entry (i.e., 0).
 */
+   for (i = 0; i < 4 ; ++i) {
+   if (offsets[i] >= *total)
+   return -EINVAL;
+   if (i == 0)
+   continue;
+   if (offsets[i-1] > offsets[i])
+   return -EINVAL;
+   }
+
for (i = 0, j = 1 ; j < 4 ; j++, i++) {
struct compat_ebt_entry_mwt *match32;
unsigned int size;
-- 
2.16.1



Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-18 Thread Florian Westphal
Daniel Borkmann  wrote:
> As rule translation can potentially become very complex, this is performed
> entirely in user space. In order to ease deployment, request_module() code
> is extended to allow user mode helpers to be invoked. Idea is that user mode
> helpers are built as part of the kernel build and installed as traditional
> kernel modules with .ko file extension into distro specified location,
> such that from a distribution point of view, they are no different than
> regular kernel modules. Thus, allow request_module() logic to load such
> user mode helper (umh) binaries via:
> 
>   request_module("foo") ->
> call_umh("modprobe foo") ->
>   sys_finit_module(FD of /lib/modules/.../foo.ko) ->
> call_umh(struct file)
> 
> Such approach enables kernel to delegate functionality traditionally done
> by kernel modules into user space processes (either root or !root)

Unrelated:  AFAIU this would allow to e.g. move the compat32 handlers
(which are very ugly/error prone) off to userspace?

compat_syscall -> umh_32_64_xlate -> syscall() ?

[ feel free to move this to different thread, only mentioning this
  so I won't forget ]


Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-17 Thread Florian Westphal
Harald Welte  wrote:
> I believe _if_ one wants to use the approach of "hiding" eBPF behind
> iptables, then either
[..]
> b) you must introduce new 'tables', like an 'xdp' table which then has
>the notion of processing very early in processing, way before the
>normal filter table INPUT processing happens.

In nftables. the netdev ingress hook location could be used for this,
but right, iptables has no equivalent.

netdev ingress is interesting from an hw-offload point of view,
unlike all other netfilter hooks its tied to a specific network interface
rather than owned by the network namespace.

A rule like (yes i am making this up)
limit 1 byte/s

cannot be offloaded because it affects all packets going through
the system, i.e. you'd need to share state among all nics which i think
won't work :-)

Same goes for any other match/target that somehow contains (global)
state and was added to the 'classic' iptables hook points.
(exception: rule restricts interface via '-i foo').


Note well: "offloaded != ebpf" in this case.

I see no reasons why ebpf cannot be used in either iptables or
nftables.  How to get there is obviously a different beast.

For iptables, I think we should put it in maintenance mode and
focus on nftables, for many reasons outlined in other replies.

And how to best make use of ebpf+nftables

In ideal world, nftables would have used (e)bpf from the start.
But, well, its not an ideal world (iirc nft origins are just a bit
too old).

That doesn't mean that we can't leverage ebpf from nftables.
Its just a question of where it makes sense and where it doesn't,
f.e. i see no reason to replace c code with ebpf just 'because you can'.

Speedup?  Good argument.
Feature enhancements that could use ebpf programs? Another good
argument.

I guess there are a lot more.

So I'd like to second Haralds question.

What is the main goal?

For nftables, I believe most important ones are:
- make kernel keeper/owner of all rules
- allow userspace to learn of rule addition/deletion
- provide fast matching (no linear evaluation of rules,
native sets with jump and verdict maps)
- provide a single tool instead of ip/ip6/arp/ebtables
- unified ipv4/ipv6 matching
- backwards compat and/or translation infrastructure


But once these are reached, we will hopefully have more:
- offloading (hardware)
- speedup via JIT compilation
- feature enhancements such as matching arbitrary packet
contents

I suspect you see that ebpf might be a fit and/or help us with
all of these things.

So, once I understand what your goals are I might be better able
to see how nftables could fit into the picture, as you can see
I did a lot of guesswork :-)


Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-17 Thread Florian Westphal
Florian Westphal <f...@strlen.de> wrote:
> David Miller <da...@davemloft.net> wrote:
> > From: Florian Westphal <f...@strlen.de>
> > Date: Fri, 16 Feb 2018 17:14:08 +0100
> > 
> > > Any particular reason why translating iptables rather than nftables
> > > (it should be possible to monitor the nftables changes that are
> > >  announced by kernel and act on those)?
> > 
> > As Daniel said, iptables is by far the most deployed of the two
> > technologies.  Therefore it provides the largest environment for
> > testing and coverage.
> 
> Right, but the approach of hooking old blob format comes with
> lots of limitations that were meant to be resolved with a netlink based
> interface which places kernel in a position to mediate all transactions
> to the rule database (which isn't fixable with old setsockopt format).
> 
> As all programs call iptables(-restore) or variants translation can
> be done in userspace to nftables so api spoken is nfnetlink.
> Such a translator already exists and can handle some cases already:
> 
> nft flush ruleset
> nft list ruleset | wc -l
> 0
> xtables-compat-multi iptables -A INPUT -i eth0 -m conntrack --ctstate 
> ESTABLISHED,RELATED -j ACCEPT
> xtables-compat-multi iptables -A REJECT_LOG -i eth0 -p tcp --tcp-flags 
> SYN,ACK SYN --dport 22:80 -m limit --limit 1/sec -j LOG --log-prefix 
> "RejectTCPConnectReq"

to be fair, for these two I had to use
$(xtables-compat-multi iptables-translate -A INPUT -i eth0 -m conntrack 
--ctstate ESTABLISHED,RELATED -j ACCEPT)

Reason is that the 'iptables-translate' part nowadays has way more
translations available (nft gained many features since the
iptables-compat layer was added).

If given appropriate prioriy however it should be pretty
trivial to make the 'translate' descriptions available in
the 'direct' version, we already have function in libnftables
to execute/run a command directly from a buffer so this would
not even need fork/execve overhead (although I don't think
its a big concern).

> (f.e. nftables misses some selinux matches/targets for netlabel so we 
> obviously
> can't translate this, same for ipsec sa/policy matching -- but this isn't
> impossible to resolve).

I am working on some poc code for the sa/policy thing now.


Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-17 Thread Florian Westphal
David Miller <da...@davemloft.net> wrote:
> From: Florian Westphal <f...@strlen.de>
> Date: Fri, 16 Feb 2018 17:14:08 +0100
> 
> > Any particular reason why translating iptables rather than nftables
> > (it should be possible to monitor the nftables changes that are
> >  announced by kernel and act on those)?
> 
> As Daniel said, iptables is by far the most deployed of the two
> technologies.  Therefore it provides the largest environment for
> testing and coverage.

Right, but the approach of hooking old blob format comes with
lots of limitations that were meant to be resolved with a netlink based
interface which places kernel in a position to mediate all transactions
to the rule database (which isn't fixable with old setsockopt format).

As all programs call iptables(-restore) or variants translation can
be done in userspace to nftables so api spoken is nfnetlink.
Such a translator already exists and can handle some cases already:

nft flush ruleset
nft list ruleset | wc -l
0
xtables-compat-multi iptables -A INPUT -s 192.168.0.24 -j ACCEPT
xtables-compat-multi iptables -A INPUT -s 192.168.0.0/16 -p tcp --dport 22 -j 
ACCEPT
xtables-compat-multi iptables -A INPUT -i eth0 -m conntrack --ctstate 
ESTABLISHED,RELATED -j ACCEPT
xtables-compat-multi iptables -A INPUT -p icmp -j ACCEPT
xtables-compat-multi iptables -N REJECT_LOG
xtables-compat-multi iptables -A REJECT_LOG -i eth0 -p tcp --tcp-flags SYN,ACK 
SYN --dport 22:80 -m limit --limit 1/sec -j LOG --log-prefix 
"RejectTCPConnectReq"
xtables-compat-multi iptables -A REJECT_LOG -j DROP
xtables-compat-multi iptables -A INPUT -j REJECT_LOG

nft list ruleset
table ip filter {
chain INPUT {
type filter hook input priority 0; policy accept;
ip saddr 192.168.0.24 counter packets 0 bytes 0 accept
ip saddr 192.168.0.0/16 tcp dport 22 counter accept
iifname "eth0" ct state related,established counter accept
ip protocol icmp counter packets 0 bytes 0 accept
counter packets 0 bytes 0 jump REJECT_LOG
}

chain FORWARD {
type filter hook forward priority 0; policy accept;
}

chain OUTPUT {
type filter hook output priority 0; policy accept;
}

chain REJECT_LOG {
iifname "eth0" tcp dport 22-80 tcp flags & (syn | ack) == syn 
limit rate 1/second burst 5 packets counter packets 0 bytes 0 log prefix 
"RejectTCPConnectReq"
counter packets 0 bytes 0 drop
}
}

and, while 'iptables' rules were added, nft monitor in different terminal:
nft monitor
add table ip filter
add chain ip filter INPUT { type filter hook input priority 0; policy accept; }
add chain ip filter FORWARD { type filter hook forward priority 0; policy 
accept; }
add chain ip filter OUTPUT { type filter hook output priority 0; policy accept; 
}
add rule ip filter INPUT ip saddr 192.168.0.24 counter packets 0 bytes 0 accept
# new generation 9893 by process 7471 (xtables-compat-)
add rule ip filter INPUT ip saddr 192.168.0.0/16 tcp dport 22 counter accept
# new generation 9894 by process 7504 (xtables-compat-)
add rule ip filter INPUT iifname "eth0" ct state related,established counter 
accept
# new generation 9895 by process 7528 (xtables-compat-)
add rule ip filter INPUT ip protocol icmp counter packets 0 bytes 0 accept
# new generation 9896 by process 7542 (xtables-compat-)
add chain ip filter REJECT_LOG
# new generation 9897 by process 7595 (xtables-compat-)
add rule ip filter REJECT_LOG iifname "eth0" tcp dport 22-80 tcp flags & (syn | 
ack) == syn limit rate 1/second burst 5 packets counter packets 0 bytes 0 log 
prefix "RejectTCPConnectReq"
# new generation 9898 by process 7639 (xtables-compat-)
add rule ip filter REJECT_LOG counter packets 0 bytes 0 drop
# new generation 9899 by process 7657 (xtables-compat-)
add rule ip filter INPUT counter packets 0 bytes 0 jump REJECT_LOG
# new generation 9900 by process 7663 (xtables-compat-)

Now, does this work in all cases?

Unfortunately not -- this is still work-in-progress, so I would
not rm /sbin/iptables and replace it with a link to xtables-compat-multi just 
yet.

(f.e. nftables misses some selinux matches/targets for netlabel so we obviously
can't translate this, same for ipsec sa/policy matching -- but this isn't
impossible to resolve).

Hopefully this does show that at least some commonly used features work
and that we've come a long way to make seamless nftables transition happen.


Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-17 Thread Florian Westphal
Daniel Borkmann <dan...@iogearbox.net> wrote:
> Hi Florian,
> 
> On 02/16/2018 05:14 PM, Florian Westphal wrote:
> > Florian Westphal <f...@strlen.de> wrote:
> >> Daniel Borkmann <dan...@iogearbox.net> wrote:
> >> Several questions spinning at the moment, I will probably come up with
> >> more:
> > 
> > ... and here there are some more ...
> > 
> > One of the many pain points of xtables design is the assumption of 'used
> > only by sysadmin'.
> > 
> > This has not been true for a very long time, so by now iptables has
> > this userspace lock (yes, its fugly workaround) to serialize concurrent
> > iptables invocations in userspace.
> > 
> > AFAIU the translate-in-userspace design now brings back the old problem
> > of different tools overwriting each others iptables rules.
> 
> Right, so the behavior would need to be adapted to be exactly the same,
> given all the requests go into kernel space first via the usual uapis,
> I don't think there would be anything in the way of keeping that as is.

Uff.  This isn't solveable.  At least thats what I tried to say here.
This is a limitation of the xtables setsockopt interface design.

If $docker (or anything else) adds a new rule using plain iptables other
daemons are not aware of it.

If some deletes a rule added by $software it won't learn that either.

The "solutions" in place now (periodic reloads/'is my rule still in
place' etc. are not desirable long-term.

You'll also need 4 decoders for arp/ip/ip6/ebtables plus translations
for all matches and targets xtables currently has. (almost 100 i would
guess from quick glance).

Some of the more crazy ones also have external user visible interfaces
outside setsockopt (proc files, ipset).

> > One of the nftables advantages is that (since rule representation in
> > kernel is black-box from userspace point of view) is that the kernel
> > can announce add/delete of rules or elements from nftables sets.
> > 
> > Any particular reason why translating iptables rather than nftables
> > (it should be possible to monitor the nftables changes that are
> >  announced by kernel and act on those)?
> 
> Yeah, correct, this should be possible as well. We started out with the
> iptables part in the demo as the majority of bigger infrastructure projects
> all still rely heavily on it (e.g. docker, k8s to just name two big ones).

Yes, which is why we have translation tools in place.

Just for the fun of it I tried to delete ip/ip6tables binaries on my
fedora27 laptop and replaced them with symlinks to
'xtables-compat-multi'.

Aside from two issues (SELinux denying 'iptables' to use netlink) and
one translation issue (-m rpfilter, which can be translated in current
upstream version) this works out of the box, the translator uses
nftables api to kernel (so kernel doesn't even know which program is
talking...), 'nft monitor' displays the rules being added, and
'nft list ruleset' shows the default firewalld ruleset.

Obviously there are a few limitations, for instance ip6tables-save will
stop working once you add nft-based rules that use features that cannot
be expressed in xtables syntax (it will throw an error message similar
to 'you are using nftables featues not available in xtables, please use
nft'), for intance verdict maps, sets and the like.

> Usually they have their requests to iptables baked into their code directly
> which probably won't change any time soon, so thought was that they could
> benefit initially from it once there would be sufficient coverage.

See above, the translator covers most basic use cases nowadays.
The more extreme cases are not covered because we were reluctant to
provide equivalent in nftables (-m time comes to mind which was always a
PITA because kernel has no notion of timezone or DST transitions,
leading to 'magic' mismatches when timezone changes...

I could explain on more problem cases but none of them are too
important I think.

If you'd like to have more ebpf users in the kernel, then there is at
least one use case where ebpf could be very attractive for nftables
(matching dynamic headers and the like).  This would be a new
feature and would need changes on nftables userspace side
as well (we don't have syntax/grammar to represent this in either
nft or iptables).

In most basic form, it would be nftables replacement for '-m string'
(and perhaps also -m bpf to some degree, depends on how it would be
 realized).

We can discuss more if there is interest, but I think it
would be more suitable for conference/face to face discussion.


Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-16 Thread Florian Westphal
Florian Westphal <f...@strlen.de> wrote:
> Daniel Borkmann <dan...@iogearbox.net> wrote:
> Several questions spinning at the moment, I will probably come up with
> more:

... and here there are some more ...

One of the many pain points of xtables design is the assumption of 'used
only by sysadmin'.

This has not been true for a very long time, so by now iptables has
this userspace lock (yes, its fugly workaround) to serialize concurrent
iptables invocations in userspace.

AFAIU the translate-in-userspace design now brings back the old problem
of different tools overwriting each others iptables rules.

Another question -- am i correct in that each rule manipulation would
incur a 'recompilation'?  Or are there different mini programs chained
together?

One of the nftables advantages is that (since rule representation in
kernel is black-box from userspace point of view) is that the kernel
can announce add/delete of rules or elements from nftables sets.

Any particular reason why translating iptables rather than nftables
(it should be possible to monitor the nftables changes that are
 announced by kernel and act on those)?


  1   2   3   4   5   6   7   8   9   >