Re: [PATCH nf 2/2] netfilter: nf_conncount: fix list_del corruption in conn_free

2018-10-29 Thread Yi-Hung Wei
On Thu, Oct 25, 2018 at 7:56 AM Taehee Yoo  wrote:
>
> nf_conncount_tuple is an element of nft_connlimit and that is deleted by
> conn_free(). elements can be deleted by both GC routine and
> data path functions(nf_conncount_lookup, nf_conncount_add) and they
> calls conn_free() to free elements.
> But conn_free() only protects lists, not each element.
> So that list_del corruption could occurred.
>
> The conn_free() doesn't check whether element is already deleted.
> In order to protect elements, dead flag is added.
> If an element is deleted, dead flag is set.
> The only conn_free() can delete elements so that both list lock and
> dead flag are enough to protect it.
>
> test commands:
>%nft add table ip filter
>%nft add chain ip filter input { type filter hook input priority 0\; }
>%nft add rule filter input meter test { ip id ct count over 2 } counter
>
> [ ... ]
>
> Fixes: 5c789e131cbb ("netfilter: nf_conncount: Add list lock and gc worker, 
> and RCU for init tree search")
> Signed-off-by: Taehee Yoo 
> ---

LGTM.
Acked-by: Yi-Hung Wei 


Re: [PATCH nf 1/2] netfilter: nf_conncount: use spin_lock_bh instead of spin_lock

2018-10-29 Thread Yi-Hung Wei
On Thu, Oct 25, 2018 at 7:56 AM Taehee Yoo  wrote:
>
> conn_free() holds lock with spin_lock(). and it is called by both
> nf_conncount_lookup() and nf_conncount_gc_list().
> nf_conncount_lookup() is bottom-half context and nf_conncount_gc_list()
> is process context. so that spin_lock() is not safe.
> Hence conn_free() should use spin_lock_bh() instead of spin_lock().
>
> test commands:
>%nft add table ip filter
>%nft add chain ip filter input { type filter hook input priority 0\; }
>%nft add rule filter input meter test { ip saddr ct count over 2 } \
>counter
>
> Fixes: 5c789e131cbb ("netfilter: nf_conncount: Add list lock and gc worker, 
> and RCU for init tree search")
> Signed-off-by: Taehee Yoo 

Thanks for the fix.
Acked-by: Yi-Hung Wei 


[PATCH nf-next 3/6] netfilter: nf_conncount: Early exit in nf_conncount_lookup() and cleanup

2018-07-02 Thread Yi-Hung Wei
This patch is originally from Florian Westphal.

This patch does the following three tasks.

It applies the same early exit technique for nf_conncount_lookup().

Since now we keep the number of connections in 'struct nf_conncount_list',
we no longer need to return the count in nf_conncount_lookup().

Moreover, we expose the garbage collection function nf_conncount_gc_list()
for nft_connlimit.

Signed-off-by: Yi-Hung Wei 
Signed-off-by: Florian Westphal 
---
 include/net/netfilter/nf_conntrack_count.h | 11 +
 net/netfilter/nf_conncount.c   | 38 +-
 net/netfilter/nft_connlimit.c  |  9 +++
 3 files changed, 33 insertions(+), 25 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack_count.h 
b/include/net/netfilter/nf_conntrack_count.h
index e4884e0e4f69..dbec17f674b7 100644
--- a/include/net/netfilter/nf_conntrack_count.h
+++ b/include/net/netfilter/nf_conntrack_count.h
@@ -21,10 +21,10 @@ unsigned int nf_conncount_count(struct net *net,
const struct nf_conntrack_tuple *tuple,
const struct nf_conntrack_zone *zone);
 
-unsigned int nf_conncount_lookup(struct net *net, struct nf_conncount_list 
*list,
-const struct nf_conntrack_tuple *tuple,
-const struct nf_conntrack_zone *zone,
-bool *addit);
+void nf_conncount_lookup(struct net *net, struct nf_conncount_list *list,
+const struct nf_conntrack_tuple *tuple,
+const struct nf_conntrack_zone *zone,
+bool *addit);
 
 void nf_conncount_list_init(struct nf_conncount_list *list);
 
@@ -32,6 +32,9 @@ bool nf_conncount_add(struct nf_conncount_list *list,
  const struct nf_conntrack_tuple *tuple,
  const struct nf_conntrack_zone *zone);
 
+void nf_conncount_gc_list(struct net *net,
+ struct nf_conncount_list *list);
+
 void nf_conncount_cache_free(struct nf_conncount_list *list);
 
 #endif
diff --git a/net/netfilter/nf_conncount.c b/net/netfilter/nf_conncount.c
index 81b060adefef..7dfd9d5e6a3e 100644
--- a/net/netfilter/nf_conncount.c
+++ b/net/netfilter/nf_conncount.c
@@ -144,26 +144,29 @@ find_or_evict(struct net *net, struct nf_conncount_list 
*list,
return ERR_PTR(-EAGAIN);
 }
 
-unsigned int nf_conncount_lookup(struct net *net,
-struct nf_conncount_list *list,
-const struct nf_conntrack_tuple *tuple,
-const struct nf_conntrack_zone *zone,
-bool *addit)
+void nf_conncount_lookup(struct net *net,
+struct nf_conncount_list *list,
+const struct nf_conntrack_tuple *tuple,
+const struct nf_conntrack_zone *zone,
+bool *addit)
 {
const struct nf_conntrack_tuple_hash *found;
struct nf_conncount_tuple *conn, *conn_n;
struct nf_conn *found_ct;
-   unsigned int length = 0;
+   unsigned int collect = 0;
 
+   /* best effort only */
*addit = tuple ? true : false;
 
/* check the saved connections */
list_for_each_entry_safe(conn, conn_n, >head, node) {
+   if (collect > CONNCOUNT_GC_MAX_NODES)
+   break;
+
found = find_or_evict(net, list, conn);
if (IS_ERR(found)) {
/* Not found, but might be about to be confirmed */
if (PTR_ERR(found) == -EAGAIN) {
-   length++;
if (!tuple)
continue;
 
@@ -171,8 +174,8 @@ unsigned int nf_conncount_lookup(struct net *net,
nf_ct_zone_id(>zone, conn->zone.dir) 
==
nf_ct_zone_id(zone, zone->dir))
*addit = false;
-   }
-
+   } else if (PTR_ERR(found) == -ENOENT)
+   collect++;
continue;
}
 
@@ -181,9 +184,10 @@ unsigned int nf_conncount_lookup(struct net *net,
if (tuple && nf_ct_tuple_equal(>tuple, tuple) &&
nf_ct_zone_equal(found_ct, zone, zone->dir)) {
/*
-* Just to be sure we have it only once in the list.
 * We should not see tuples twice unless someone hooks
 * this into a table without "-p tcp --syn".
+*
+* Attempt to avoid a re-add in this case.
 */
*addit = false;
} else if (already_clo

[PATCH nf-next 6/6] netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search

2018-07-02 Thread Yi-Hung Wei
This patch is originally from Florian Westphal.

This patch does the following 3 main tasks.

1) Add list lock to 'struct nf_conncount_list' so that we can
alter the lists containing the individual connections without holding the
main tree lock.  It would be useful when we only need to add/remove to/from
a list without allocate/remove a node in the tree.  With this change, we
update nft_connlimit accordingly since we longer need to maintain
a list lock in nft_connlimit now.

2) Use RCU for the initial tree search to improve tree look up performance.

3) Add a garbage collection worker. This worker is schedule when there
are excessive tree node that needed to be recycled.

Moreover,the rbnode reclaim logic is moved from search tree to insert tree
to avoid race condition.

Signed-off-by: Yi-Hung Wei 
Signed-off-by: Florian Westphal 
---
 include/net/netfilter/nf_conntrack_count.h |  17 +-
 net/netfilter/nf_conncount.c   | 253 +
 net/netfilter/nft_connlimit.c  |  17 +-
 3 files changed, 196 insertions(+), 91 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack_count.h 
b/include/net/netfilter/nf_conntrack_count.h
index dbec17f674b7..4b2b2baf8ab4 100644
--- a/include/net/netfilter/nf_conntrack_count.h
+++ b/include/net/netfilter/nf_conntrack_count.h
@@ -5,9 +5,17 @@
 
 struct nf_conncount_data;
 
+enum nf_conncount_list_add {
+   NF_CONNCOUNT_ADDED, /* list add was ok */
+   NF_CONNCOUNT_ERR,   /* -ENOMEM, must drop skb */
+   NF_CONNCOUNT_SKIP,  /* list is already reclaimed by gc */
+};
+
 struct nf_conncount_list {
+   spinlock_t list_lock;
struct list_head head;  /* connections with the same filtering key */
unsigned int count; /* length of list */
+   bool dead;
 };
 
 struct nf_conncount_data *nf_conncount_init(struct net *net, unsigned int 
family,
@@ -28,11 +36,12 @@ void nf_conncount_lookup(struct net *net, struct 
nf_conncount_list *list,
 
 void nf_conncount_list_init(struct nf_conncount_list *list);
 
-bool nf_conncount_add(struct nf_conncount_list *list,
- const struct nf_conntrack_tuple *tuple,
- const struct nf_conntrack_zone *zone);
+enum nf_conncount_list_add
+nf_conncount_add(struct nf_conncount_list *list,
+const struct nf_conntrack_tuple *tuple,
+const struct nf_conntrack_zone *zone);
 
-void nf_conncount_gc_list(struct net *net,
+bool nf_conncount_gc_list(struct net *net,
  struct nf_conncount_list *list);
 
 void nf_conncount_cache_free(struct nf_conncount_list *list);
diff --git a/net/netfilter/nf_conncount.c b/net/netfilter/nf_conncount.c
index 3f14806b7271..02ca7df793f5 100644
--- a/net/netfilter/nf_conncount.c
+++ b/net/netfilter/nf_conncount.c
@@ -49,12 +49,14 @@ struct nf_conncount_tuple {
struct nf_conntrack_zonezone;
int cpu;
u32 jiffies32;
+   struct rcu_head rcu_head;
 };
 
 struct nf_conncount_rb {
struct rb_node node;
struct nf_conncount_list list;
u32 key[MAX_KEYLEN];
+   struct rcu_head rcu_head;
 };
 
 static spinlock_t nf_conncount_locks[CONNCOUNT_LOCK_SLOTS] 
__cacheline_aligned_in_smp;
@@ -62,6 +64,10 @@ static spinlock_t nf_conncount_locks[CONNCOUNT_LOCK_SLOTS] 
__cacheline_aligned_i
 struct nf_conncount_data {
unsigned int keylen;
struct rb_root root[CONNCOUNT_SLOTS];
+   struct net *net;
+   struct work_struct gc_work;
+   unsigned long pending_trees[BITS_TO_LONGS(CONNCOUNT_SLOTS)];
+   unsigned int gc_tree;
 };
 
 static u_int32_t conncount_rnd __read_mostly;
@@ -82,42 +88,70 @@ static int key_diff(const u32 *a, const u32 *b, unsigned 
int klen)
return memcmp(a, b, klen * sizeof(u32));
 }
 
-bool nf_conncount_add(struct nf_conncount_list *list,
- const struct nf_conntrack_tuple *tuple,
- const struct nf_conntrack_zone *zone)
+enum nf_conncount_list_add
+nf_conncount_add(struct nf_conncount_list *list,
+const struct nf_conntrack_tuple *tuple,
+const struct nf_conntrack_zone *zone)
 {
struct nf_conncount_tuple *conn;
 
if (WARN_ON_ONCE(list->count > INT_MAX))
-   return false;
+   return NF_CONNCOUNT_ERR;
 
conn = kmem_cache_alloc(conncount_conn_cachep, GFP_ATOMIC);
if (conn == NULL)
-   return false;
+   return NF_CONNCOUNT_ERR;
+
conn->tuple = *tuple;
conn->zone = *zone;
conn->cpu = raw_smp_processor_id();
conn->jiffies32 = (u32)jiffies;
+   spin_lock(>list_lock);
+   if (list->dead == true) {
+   kmem_cache_free(conncount_conn_cachep, conn);
+   spin_unlock(>list_lock);
+   return NF_CONNCOUNT_SKIP;
+   }
list_add_tail(>nod

[PATCH nf-next 1/6] netfilter: nf_conncount: Early exit for garbage collection

2018-07-02 Thread Yi-Hung Wei
This patch is originally from Florian Westphal.

We use an extra function with early exit for garbage collection.
It is not necessary to traverse the full list for every node since
it is enough to zap a couple of entries for garbage collection.

Signed-off-by: Yi-Hung Wei 
Signed-off-by: Florian Westphal 
---
 net/netfilter/nf_conncount.c | 39 +--
 1 file changed, 37 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/nf_conncount.c b/net/netfilter/nf_conncount.c
index 510039862aa9..81c02185b2e8 100644
--- a/net/netfilter/nf_conncount.c
+++ b/net/netfilter/nf_conncount.c
@@ -189,6 +189,42 @@ unsigned int nf_conncount_lookup(struct net *net, struct 
hlist_head *head,
 }
 EXPORT_SYMBOL_GPL(nf_conncount_lookup);
 
+static void nf_conncount_gc_list(struct net *net,
+struct nf_conncount_rb *rbconn)
+{
+   const struct nf_conntrack_tuple_hash *found;
+   struct nf_conncount_tuple *conn;
+   struct hlist_node *n;
+   struct nf_conn *found_ct;
+   unsigned int collected = 0;
+
+   hlist_for_each_entry_safe(conn, n, >hhead, node) {
+   found = find_or_evict(net, conn);
+   if (IS_ERR(found)) {
+   if (PTR_ERR(found) == -ENOENT)
+   collected++;
+   continue;
+   }
+
+   found_ct = nf_ct_tuplehash_to_ctrack(found);
+   if (already_closed(found_ct)) {
+   /*
+* we do not care about connections which are
+* closed already -> ditch it
+*/
+   nf_ct_put(found_ct);
+   hlist_del(>node);
+   kmem_cache_free(conncount_conn_cachep, conn);
+   collected++;
+   continue;
+   }
+
+   nf_ct_put(found_ct);
+   if (collected > CONNCOUNT_GC_MAX_NODES)
+   return;
+   }
+}
+
 static void tree_nodes_free(struct rb_root *root,
struct nf_conncount_rb *gc_nodes[],
unsigned int gc_count)
@@ -251,8 +287,7 @@ count_tree(struct net *net, struct rb_root *root,
if (no_gc || gc_count >= ARRAY_SIZE(gc_nodes))
continue;
 
-   /* only used for GC on hhead, retval and 'addit' ignored */
-   nf_conncount_lookup(net, >hhead, tuple, zone, );
+   nf_conncount_gc_list(net, rbconn);
if (hlist_empty(>hhead))
gc_nodes[gc_count++] = rbconn;
}
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH nf-next 2/6] netfilter: nf_conncount: Switch to plain list

2018-07-02 Thread Yi-Hung Wei
Original patch is from Florian Westphal.

This patch switches from hlist to plain list to store the list of
connections with the same filtering key in nf_conncount. With the
plain list, we can insert new connections at the tail, so over time
the beginning of list holds long-running connections and those are
expired, while the newly creates ones are at the end.

Later on, we could probably move checked ones to the end of the list,
so the next run has higher chance to reclaim stale entries in the front.

Signed-off-by: Yi-Hung Wei 
Signed-off-by: Florian Westphal 
---
 include/net/netfilter/nf_conntrack_count.h | 15 --
 net/netfilter/nf_conncount.c   | 83 ++
 net/netfilter/nft_connlimit.c  | 24 -
 3 files changed, 75 insertions(+), 47 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack_count.h 
b/include/net/netfilter/nf_conntrack_count.h
index 3a188a0923a3..e4884e0e4f69 100644
--- a/include/net/netfilter/nf_conntrack_count.h
+++ b/include/net/netfilter/nf_conntrack_count.h
@@ -1,8 +1,15 @@
 #ifndef _NF_CONNTRACK_COUNT_H
 #define _NF_CONNTRACK_COUNT_H
 
+#include 
+
 struct nf_conncount_data;
 
+struct nf_conncount_list {
+   struct list_head head;  /* connections with the same filtering key */
+   unsigned int count; /* length of list */
+};
+
 struct nf_conncount_data *nf_conncount_init(struct net *net, unsigned int 
family,
unsigned int keylen);
 void nf_conncount_destroy(struct net *net, unsigned int family,
@@ -14,15 +21,17 @@ unsigned int nf_conncount_count(struct net *net,
const struct nf_conntrack_tuple *tuple,
const struct nf_conntrack_zone *zone);
 
-unsigned int nf_conncount_lookup(struct net *net, struct hlist_head *head,
+unsigned int nf_conncount_lookup(struct net *net, struct nf_conncount_list 
*list,
 const struct nf_conntrack_tuple *tuple,
 const struct nf_conntrack_zone *zone,
 bool *addit);
 
-bool nf_conncount_add(struct hlist_head *head,
+void nf_conncount_list_init(struct nf_conncount_list *list);
+
+bool nf_conncount_add(struct nf_conncount_list *list,
  const struct nf_conntrack_tuple *tuple,
  const struct nf_conntrack_zone *zone);
 
-void nf_conncount_cache_free(struct hlist_head *hhead);
+void nf_conncount_cache_free(struct nf_conncount_list *list);
 
 #endif
diff --git a/net/netfilter/nf_conncount.c b/net/netfilter/nf_conncount.c
index 81c02185b2e8..81b060adefef 100644
--- a/net/netfilter/nf_conncount.c
+++ b/net/netfilter/nf_conncount.c
@@ -44,7 +44,7 @@
 
 /* we will save the tuples of all connections we care about */
 struct nf_conncount_tuple {
-   struct hlist_node   node;
+   struct list_headnode;
struct nf_conntrack_tuple   tuple;
struct nf_conntrack_zonezone;
int cpu;
@@ -53,7 +53,7 @@ struct nf_conncount_tuple {
 
 struct nf_conncount_rb {
struct rb_node node;
-   struct hlist_head hhead; /* connections/hosts in same subnet */
+   struct nf_conncount_list list;
u32 key[MAX_KEYLEN];
 };
 
@@ -82,12 +82,15 @@ static int key_diff(const u32 *a, const u32 *b, unsigned 
int klen)
return memcmp(a, b, klen * sizeof(u32));
 }
 
-bool nf_conncount_add(struct hlist_head *head,
+bool nf_conncount_add(struct nf_conncount_list *list,
  const struct nf_conntrack_tuple *tuple,
  const struct nf_conntrack_zone *zone)
 {
struct nf_conncount_tuple *conn;
 
+   if (WARN_ON_ONCE(list->count > INT_MAX))
+   return false;
+
conn = kmem_cache_alloc(conncount_conn_cachep, GFP_ATOMIC);
if (conn == NULL)
return false;
@@ -95,13 +98,26 @@ bool nf_conncount_add(struct hlist_head *head,
conn->zone = *zone;
conn->cpu = raw_smp_processor_id();
conn->jiffies32 = (u32)jiffies;
-   hlist_add_head(>node, head);
+   list_add_tail(>node, >head);
+   list->count++;
return true;
 }
 EXPORT_SYMBOL_GPL(nf_conncount_add);
 
+static void conn_free(struct nf_conncount_list *list,
+ struct nf_conncount_tuple *conn)
+{
+   if (WARN_ON_ONCE(list->count == 0))
+   return;
+
+   list->count--;
+   list_del(>node);
+   kmem_cache_free(conncount_conn_cachep, conn);
+}
+
 static const struct nf_conntrack_tuple_hash *
-find_or_evict(struct net *net, struct nf_conncount_tuple *conn)
+find_or_evict(struct net *net, struct nf_conncount_list *list,
+ struct nf_conncount_tuple *conn)
 {
const struct nf_conntrack_tuple_hash *found;
unsigned long a, b;
@@ -121,30 +137,29 @@ find_or_evict(struct net *net, struct nf_conncount_t

[PATCH nf-next 0/6] netfilter: nf_conncount: optimize nf_conncount performance

2018-07-02 Thread Yi-Hung Wei
This patch series apply the following techniques to optimize nf_conncount
performance.

* Early exit for garbage collection
In order to reduce gc time, we skip traversing the full list on
every node when doing garbage collection, since it is enough to zap
a couple of expired entries.

* Split tree insertion and traversal
When we have a very coarse grouping, e.g. by large subnets, zone id,
etc, it is likely that we do not need to do tree rotation because
we'll find a node where we can attach new entry.  Based on this
observation, we then make traversal lockless (tree protected
by RCU), and add extra lock in the individual node to protect list
insertion/deletion, thereby allowing parallel insert/delete in different
tree nodes.

* Add garbage collection worker
Instead of doing all of garbage collection task in the packet forwarding
path, we will schedule a garbage collection worker when the number of
nodes that can be freed exceeds a threshold.

This patch series has dependency on the following commit in nf git tree.
b36e4523d4d5 ("netfilter: nf_conncount: fix garbage collection confirm race")

RFC -> v1:
* Rebase patch series to a bug fix commit on nf_conncount.
* Merge patch 7 and patch 6 to one commit to avoid a race condition.


Yi-Hung Wei (6):
  netfilter: nf_conncount: Early exit for garbage collection
  netfilter: nf_conncount: Switch to plain list
  netfilter: nf_conncount: Early exit in nf_conncount_lookup() and
cleanup
  netfilter: nf_conncount: Move locking into count_tree()
  netfilter: nf_conncount: Split insert and traversal
  netfilter: nf_conncount: Add list lock and gc worker, and RCU for init
tree search

 include/net/netfilter/nf_conntrack_count.h |  37 ++-
 net/netfilter/nf_conncount.c   | 386 ++---
 net/netfilter/nft_connlimit.c  |  36 +--
 3 files changed, 340 insertions(+), 119 deletions(-)

-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH nf-next 5/6] netfilter: nf_conncount: Split insert and traversal

2018-07-02 Thread Yi-Hung Wei
This patch is originally from Florian Westphal.

When we have a very coarse grouping, e.g. by large subnets, zone id,
etc, it's likely that we do not need to do tree rotation because
we'll find a node where we can attach new entry.  Based on this
observation, we split tree traversal and insertion.

Later on, we can make traversal lockless (tree protected
by RCU), and add extra lock in the individual nodes to protect list
insertion/deletion, thereby allowing parallel insert/delete in different
tree nodes.

Signed-off-by: Yi-Hung Wei 
Signed-off-by: Florian Westphal 
---
 net/netfilter/nf_conncount.c | 87 ++--
 1 file changed, 67 insertions(+), 20 deletions(-)

diff --git a/net/netfilter/nf_conncount.c b/net/netfilter/nf_conncount.c
index d1a4fd1c0f81..3f14806b7271 100644
--- a/net/netfilter/nf_conncount.c
+++ b/net/netfilter/nf_conncount.c
@@ -262,6 +262,71 @@ static void tree_nodes_free(struct rb_root *root,
 }
 
 static unsigned int
+insert_tree(struct rb_root *root,
+   unsigned int hash,
+   const u32 *key,
+   u8 keylen,
+   const struct nf_conntrack_tuple *tuple,
+   const struct nf_conntrack_zone *zone)
+{
+   struct rb_node **rbnode, *parent;
+   struct nf_conncount_rb *rbconn;
+   struct nf_conncount_tuple *conn;
+   unsigned int count = 0;
+
+   spin_lock_bh(_conncount_locks[hash % CONNCOUNT_LOCK_SLOTS]);
+
+   parent = NULL;
+   rbnode = &(root->rb_node);
+   while (*rbnode) {
+   int diff;
+   rbconn = rb_entry(*rbnode, struct nf_conncount_rb, node);
+
+   parent = *rbnode;
+   diff = key_diff(key, rbconn->key, keylen);
+   if (diff < 0) {
+   rbnode = &((*rbnode)->rb_left);
+   } else if (diff > 0) {
+   rbnode = &((*rbnode)->rb_right);
+   } else {
+   /* unlikely: other cpu added node already */
+   if (!nf_conncount_add(>list, tuple, zone)) {
+   count = 0; /* hotdrop */
+   goto out_unlock;
+   }
+
+   count = rbconn->list.count;
+   goto out_unlock;
+   }
+   }
+
+   /* expected case: match, insert new node */
+   rbconn = kmem_cache_alloc(conncount_rb_cachep, GFP_ATOMIC);
+   if (rbconn == NULL)
+   goto out_unlock;
+
+   conn = kmem_cache_alloc(conncount_conn_cachep, GFP_ATOMIC);
+   if (conn == NULL) {
+   kmem_cache_free(conncount_rb_cachep, rbconn);
+   goto out_unlock;
+   }
+
+   conn->tuple = *tuple;
+   conn->zone = *zone;
+   memcpy(rbconn->key, key, sizeof(u32) * keylen);
+
+   nf_conncount_list_init(>list);
+   list_add(>node, >list.head);
+   count = 1;
+
+   rb_link_node(>node, parent, rbnode);
+   rb_insert_color(>node, root);
+out_unlock:
+   spin_unlock_bh(_conncount_locks[hash % CONNCOUNT_LOCK_SLOTS]);
+   return count;
+}
+
+static unsigned int
 count_tree(struct net *net,
   struct nf_conncount_data *data,
   const u32 *key,
@@ -272,7 +337,6 @@ count_tree(struct net *net,
struct rb_root *root;
struct rb_node **rbnode, *parent;
struct nf_conncount_rb *rbconn;
-   struct nf_conncount_tuple *conn;
unsigned int gc_count, hash;
bool no_gc = false;
unsigned int count = 0;
@@ -339,27 +403,10 @@ count_tree(struct net *net,
count = 0;
if (!tuple)
goto out_unlock;
-   /* no match, need to insert new node */
-   rbconn = kmem_cache_alloc(conncount_rb_cachep, GFP_ATOMIC);
-   if (rbconn == NULL)
-   goto out_unlock;
-
-   conn = kmem_cache_alloc(conncount_conn_cachep, GFP_ATOMIC);
-   if (conn == NULL) {
-   kmem_cache_free(conncount_rb_cachep, rbconn);
-   goto out_unlock;
-   }
-
-   conn->tuple = *tuple;
-   conn->zone = *zone;
-   memcpy(rbconn->key, key, sizeof(u32) * keylen);
 
-   nf_conncount_list_init(>list);
-   list_add(>node, >list.head);
-   count = 1;
+   spin_unlock_bh(_conncount_locks[hash % CONNCOUNT_LOCK_SLOTS]);
+   return insert_tree(root, hash, key, keylen, tuple, zone);
 
-   rb_link_node(>node, parent, rbnode);
-   rb_insert_color(>node, root);
 out_unlock:
spin_unlock_bh(_conncount_locks[hash % CONNCOUNT_LOCK_SLOTS]);
return count;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH nf-next 4/6] netfilter: nf_conncount: Move locking into count_tree()

2018-07-02 Thread Yi-Hung Wei
This patch is originally from Florian Westphal.

This is a preparation patch to allow lockless traversal
of the tree via RCU.

Signed-off-by: Yi-Hung Wei 
Signed-off-by: Florian Westphal 
---
 net/netfilter/nf_conncount.c | 52 +---
 1 file changed, 25 insertions(+), 27 deletions(-)

diff --git a/net/netfilter/nf_conncount.c b/net/netfilter/nf_conncount.c
index 7dfd9d5e6a3e..d1a4fd1c0f81 100644
--- a/net/netfilter/nf_conncount.c
+++ b/net/netfilter/nf_conncount.c
@@ -262,18 +262,26 @@ static void tree_nodes_free(struct rb_root *root,
 }
 
 static unsigned int
-count_tree(struct net *net, struct rb_root *root,
-  const u32 *key, u8 keylen,
+count_tree(struct net *net,
+  struct nf_conncount_data *data,
+  const u32 *key,
   const struct nf_conntrack_tuple *tuple,
   const struct nf_conntrack_zone *zone)
 {
struct nf_conncount_rb *gc_nodes[CONNCOUNT_GC_MAX_NODES];
+   struct rb_root *root;
struct rb_node **rbnode, *parent;
struct nf_conncount_rb *rbconn;
struct nf_conncount_tuple *conn;
-   unsigned int gc_count;
+   unsigned int gc_count, hash;
bool no_gc = false;
+   unsigned int count = 0;
+   u8 keylen = data->keylen;
 
+   hash = jhash2(key, data->keylen, conncount_rnd) % CONNCOUNT_SLOTS;
+   root = >root[hash];
+
+   spin_lock_bh(_conncount_locks[hash % CONNCOUNT_LOCK_SLOTS]);
  restart:
gc_count = 0;
parent = NULL;
@@ -292,20 +300,20 @@ count_tree(struct net *net, struct rb_root *root,
rbnode = &((*rbnode)->rb_right);
} else {
/* same source network -> be counted! */
-   unsigned int count;
-
nf_conncount_lookup(net, >list, tuple, zone,
);
count = rbconn->list.count;
 
tree_nodes_free(root, gc_nodes, gc_count);
if (!addit)
-   return count;
+   goto out_unlock;
 
if (!nf_conncount_add(>list, tuple, zone))
-   return 0; /* hotdrop */
+   count = 0; /* hotdrop */
+   goto out_unlock;
 
-   return count + 1;
+   count++;
+   goto out_unlock;
}
 
if (no_gc || gc_count >= ARRAY_SIZE(gc_nodes))
@@ -328,18 +336,18 @@ count_tree(struct net *net, struct rb_root *root,
goto restart;
}
 
+   count = 0;
if (!tuple)
-   return 0;
-
+   goto out_unlock;
/* no match, need to insert new node */
rbconn = kmem_cache_alloc(conncount_rb_cachep, GFP_ATOMIC);
if (rbconn == NULL)
-   return 0;
+   goto out_unlock;
 
conn = kmem_cache_alloc(conncount_conn_cachep, GFP_ATOMIC);
if (conn == NULL) {
kmem_cache_free(conncount_rb_cachep, rbconn);
-   return 0;
+   goto out_unlock;
}
 
conn->tuple = *tuple;
@@ -348,10 +356,13 @@ count_tree(struct net *net, struct rb_root *root,
 
nf_conncount_list_init(>list);
list_add(>node, >list.head);
+   count = 1;
 
rb_link_node(>node, parent, rbnode);
rb_insert_color(>node, root);
-   return 1;
+out_unlock:
+   spin_unlock_bh(_conncount_locks[hash % CONNCOUNT_LOCK_SLOTS]);
+   return count;
 }
 
 /* Count and return number of conntrack entries in 'net' with particular 'key'.
@@ -363,20 +374,7 @@ unsigned int nf_conncount_count(struct net *net,
const struct nf_conntrack_tuple *tuple,
const struct nf_conntrack_zone *zone)
 {
-   struct rb_root *root;
-   int count;
-   u32 hash;
-
-   hash = jhash2(key, data->keylen, conncount_rnd) % CONNCOUNT_SLOTS;
-   root = >root[hash];
-
-   spin_lock_bh(_conncount_locks[hash % CONNCOUNT_LOCK_SLOTS]);
-
-   count = count_tree(net, root, key, data->keylen, tuple, zone);
-
-   spin_unlock_bh(_conncount_locks[hash % CONNCOUNT_LOCK_SLOTS]);
-
-   return count;
+   return count_tree(net, data, key, tuple, zone);
 }
 EXPORT_SYMBOL_GPL(nf_conncount_count);
 
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC nf-next 0/7] netfilter: nf_conncount: optimize nf_conncount performance

2018-07-02 Thread Yi-Hung Wei
On Mon, Jul 2, 2018 at 9:49 AM, Florian Westphal  wrote:
> Yi-Hung Wei  wrote:
>> This patch series apply the following techniques to optimize nf_conncount
>> performance.
>
> Looks good to me, thanks Yi-Hung for working on this.
>
> Maybe just avoid the race in patch 6/7, its possible its needed to merge
> it with the last patch, if so, thats fine with me.

Thanks Florian for review. I will repost the patch series that merge
patch 6 and 7 to avoid the race. Also I will do a rebase based on this
commit

b36e4523d4d5 ("netfilter: nf_conncount: fix garbage collection confirm race")

in nf tree that contains a fix to nf_conncount.

Thanks,

-Yi-Hung
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH nf v2] netfilter: nf_conncount: fix garbage collection confirm race

2018-06-25 Thread Yi-Hung Wei
On Wed, Jun 20, 2018 at 2:32 PM, Florian Westphal  wrote:

Thanks for v2. It takes care of a corner case so that a duplicated
entry won't be re-added in the second time.

Just some nits in the commit message as below.

Acked-by: Yi-Hung Wei 

> When doing list walk, we lookup the tuple in the conntrack table.
> If the lookup fails we we remove this tuple from our list because
s/we we/we
> the conntrack entry is gone.


> The second constaint allows GC to be taken over by other
s/constaint/constraint
> cpu too (e.g. because a cpu was offlined or napi got moved to another
> cpu).
>
> We can't pretend the 'doubtful' entry wasn't in our list.
> Instead, when we don't find an entry indicate via IS_ERR
> that entry was removed ('did not exist' or withheld
> ('might-be-unconfirmed').
>
> This most likely also fixes a xt_connlimit imbalance earlier reported by
> Dmitry Andrianov.
>
> Cc: Dmitry Andrianov 
> Reported-by: Justin Pettit 
> Reported-by: Yi-Hung Wei 
> Signed-off-by: Florian Westphal 
> ---
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC nf-next 0/7] netfilter: nf_conncount: optimize nf_conncount performance

2018-06-18 Thread Yi-Hung Wei
This patch series apply the following techniques to optimize nf_conncount
performance.

* Early exit for garbage collection
In order to reduce gc time, we skip traversing the full list on
every node when doing garbage collection, since it is enough to zap
a couple of expired entries.

* Split tree insertion and traversal
When we have a very coarse grouping, e.g. by large subnets, zone id,
etc, it is likely that we do not need to do tree rotation because
we'll find a node where we can attach new entry.  Based on this
observation, we then make traversal lockless (tree protected
by RCU), and add extra lock in the individual node to protect list
insertion/deletion, thereby allowing parallel insert/delete in different
tree nodes.

* Add garbage collection worker
Instead of doing all of garbage collection task in the packet forwarding
path, we will schedule a garbage collection worker when the number of
nodes that can be freed exceeds a threshold.

This patch series has dependency on the following commmit in nf git tree.
21ba8847 ("netfilter: nf_conncount: Fix garbage collection with zones")

Yi-Hung Wei (7):
  netfilter: nf_conncount: Early exit for garbage collection
  netfilter: nf_conncount: Switch to plain list
  netfilter: nf_conncount: Early exit in nf_conncount_lookup() and
cleanup
  netfilter: nf_conncount: Move locking into count_tree()
  netfilter: nf_conncount: Split insert and traversal
  netfilter: nf_conncount: Add list lock and use RCU for init tree
search
  netfilter: nf_conncount: Add garbage collection worker

 include/net/netfilter/nf_conntrack_count.h |  37 ++-
 net/netfilter/nf_conncount.c   | 374 ++---
 net/netfilter/nft_connlimit.c  |  36 +--
 3 files changed, 332 insertions(+), 115 deletions(-)

-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC nf-next 1/7] netfilter: nf_conncount: Early exit for garbage collection

2018-06-18 Thread Yi-Hung Wei
From: Florian Westphal 

We use an extra function with early exit for garbage collection.
It is not necessary to traverse the full list for every node since
it is enough to zap a couple of entries for garbage collection.

Signed-off-by: Florian Westphal 
Signed-off-by: Yi-Hung Wei 
---
 net/netfilter/nf_conncount.c | 40 ++--
 1 file changed, 38 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/nf_conncount.c b/net/netfilter/nf_conncount.c
index d8383609fe28..ed87ad40541b 100644
--- a/net/netfilter/nf_conncount.c
+++ b/net/netfilter/nf_conncount.c
@@ -147,6 +147,43 @@ unsigned int nf_conncount_lookup(struct net *net, struct 
hlist_head *head,
 }
 EXPORT_SYMBOL_GPL(nf_conncount_lookup);
 
+static void nf_conncount_gc_list(struct net *net,
+struct nf_conncount_rb *rbconn)
+{
+   const struct nf_conntrack_tuple_hash *found;
+   struct nf_conncount_tuple *conn;
+   struct hlist_node *n;
+   struct nf_conn *found_ct;
+   unsigned int collected = 0;
+
+   hlist_for_each_entry_safe(conn, n, >hhead, node) {
+   found = nf_conntrack_find_get(net, >zone, >tuple);
+   if (found == NULL) {
+   hlist_del(>node);
+   kmem_cache_free(conncount_conn_cachep, conn);
+   collected++;
+   continue;
+   }
+
+   found_ct = nf_ct_tuplehash_to_ctrack(found);
+   if (already_closed(found_ct)) {
+   /*
+* we do not care about connections which are
+* closed already -> ditch it
+*/
+   nf_ct_put(found_ct);
+   hlist_del(>node);
+   kmem_cache_free(conncount_conn_cachep, conn);
+   collected++;
+   continue;
+   }
+
+   nf_ct_put(found_ct);
+   if (collected > CONNCOUNT_GC_MAX_NODES)
+   return;
+   }
+}
+
 static void tree_nodes_free(struct rb_root *root,
struct nf_conncount_rb *gc_nodes[],
unsigned int gc_count)
@@ -209,8 +246,7 @@ count_tree(struct net *net, struct rb_root *root,
if (no_gc || gc_count >= ARRAY_SIZE(gc_nodes))
continue;
 
-   /* only used for GC on hhead, retval and 'addit' ignored */
-   nf_conncount_lookup(net, >hhead, tuple, zone, );
+   nf_conncount_gc_list(net, rbconn);
if (hlist_empty(>hhead))
gc_nodes[gc_count++] = rbconn;
}
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC nf-next 6/7] netfilter: nf_conncount: Add list lock and use RCU for init tree search

2018-06-18 Thread Yi-Hung Wei
From: Florian Westphal 

This patch adds list lock to 'struct nf_conncount_list' so that we can
alter the lists containing the individual connections without holding the
main tree lock.  It would be useful when we only need to add/remove to/from
a list without allocate/remove a node in the tree.

This patch also use RCU for the initial tree search.

We also update nft_connlimit accordingly since we longer need to maintain
a list lock in nft_connlimit now.

Signed-off-by: Florian Westphal 
Signed-off-by: Yi-Hung Wei 
---
 include/net/netfilter/nf_conntrack_count.h |  3 +-
 net/netfilter/nf_conncount.c   | 98 +++---
 net/netfilter/nft_connlimit.c  | 15 +
 3 files changed, 66 insertions(+), 50 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack_count.h 
b/include/net/netfilter/nf_conntrack_count.h
index dbec17f674b7..af0c1c95eccd 100644
--- a/include/net/netfilter/nf_conntrack_count.h
+++ b/include/net/netfilter/nf_conntrack_count.h
@@ -6,6 +6,7 @@
 struct nf_conncount_data;
 
 struct nf_conncount_list {
+   spinlock_t list_lock;
struct list_head head;  /* connections with the same filtering key */
unsigned int count; /* length of list */
 };
@@ -32,7 +33,7 @@ bool nf_conncount_add(struct nf_conncount_list *list,
  const struct nf_conntrack_tuple *tuple,
  const struct nf_conntrack_zone *zone);
 
-void nf_conncount_gc_list(struct net *net,
+bool nf_conncount_gc_list(struct net *net,
  struct nf_conncount_list *list);
 
 void nf_conncount_cache_free(struct nf_conncount_list *list);
diff --git a/net/netfilter/nf_conncount.c b/net/netfilter/nf_conncount.c
index e6f854cc268e..760ba24db735 100644
--- a/net/netfilter/nf_conncount.c
+++ b/net/netfilter/nf_conncount.c
@@ -47,12 +47,14 @@ struct nf_conncount_tuple {
struct list_headnode;
struct nf_conntrack_tuple   tuple;
struct nf_conntrack_zonezone;
+   struct rcu_head rcu_head;
 };
 
 struct nf_conncount_rb {
struct rb_node node;
struct nf_conncount_list list;
u32 key[MAX_KEYLEN];
+   struct rcu_head rcu_head;
 };
 
 static spinlock_t nf_conncount_locks[CONNCOUNT_LOCK_SLOTS] 
__cacheline_aligned_in_smp;
@@ -94,21 +96,42 @@ bool nf_conncount_add(struct nf_conncount_list *list,
return false;
conn->tuple = *tuple;
conn->zone = *zone;
+   spin_lock(>list_lock);
list_add_tail(>node, >head);
list->count++;
+   spin_unlock(>list_lock);
return true;
 }
 EXPORT_SYMBOL_GPL(nf_conncount_add);
 
-static void conn_free(struct nf_conncount_list *list,
+static void __conn_free(struct rcu_head *h)
+{
+   struct nf_conncount_tuple *conn;
+
+   conn = container_of(h, struct nf_conncount_tuple, rcu_head);
+   kmem_cache_free(conncount_conn_cachep, conn);
+}
+
+static bool conn_free(struct nf_conncount_list *list,
  struct nf_conncount_tuple *conn)
 {
-   if (WARN_ON_ONCE(list->count == 0))
-   return;
+   bool free_entry = false;
+
+   spin_lock(>list_lock);
+
+   if (list->count == 0) {
+   spin_unlock(>list_lock);
+return free_entry;
+   }
 
list->count--;
-   list_del(>node);
-   kmem_cache_free(conncount_conn_cachep, conn);
+   list_del_rcu(>node);
+   if (list->count == 0)
+   free_entry = true;
+
+   spin_unlock(>list_lock);
+   call_rcu(>rcu_head, __conn_free);
+   return free_entry;
 }
 
 void nf_conncount_lookup(struct net *net,
@@ -166,12 +189,14 @@ EXPORT_SYMBOL_GPL(nf_conncount_lookup);
 
 void nf_conncount_list_init(struct nf_conncount_list *list)
 {
+   spin_lock_init(>list_lock);
INIT_LIST_HEAD(>head);
list->count = 1;
 }
 EXPORT_SYMBOL_GPL(nf_conncount_list_init);
 
-void nf_conncount_gc_list(struct net *net,
+/* Return true if the list is empty */
+bool nf_conncount_gc_list(struct net *net,
  struct nf_conncount_list *list)
 {
const struct nf_conntrack_tuple_hash *found;
@@ -182,7 +207,8 @@ void nf_conncount_gc_list(struct net *net,
list_for_each_entry_safe(conn, conn_n, >head, node) {
found = nf_conntrack_find_get(net, >zone, >tuple);
if (found == NULL) {
-   conn_free(list, conn);
+   if(conn_free(list, conn))
+   return true;
collected++;
continue;
}
@@ -194,18 +220,28 @@ void nf_conncount_gc_list(struct net *net,
 * closed already -> ditch it
 */
nf_ct_put(found_ct);
-   conn_free(list, conn);
+ 

[RFC nf-next 5/7] netfilter: nf_conncount: Split insert and traversal

2018-06-18 Thread Yi-Hung Wei
From: Florian Westphal 

When we have a very coarse grouping, e.g. by large subnets, zone id,
etc, it's likely that we do not need to do tree rotation because
we'll find a node where we can attach new entry.  Based on this
observation, we split tree traversal and insertion.

Later on, we can make traversal lockless (tree protected
by RCU), and add extra lock in the individual nodes to protect list
insertion/deletion, thereby allowing parallel insert/delete in different
tree nodes.

Signed-off-by: Florian Westphal 
Signed-off-by: Yi-Hung Wei 
---
 net/netfilter/nf_conncount.c | 87 ++--
 1 file changed, 67 insertions(+), 20 deletions(-)

diff --git a/net/netfilter/nf_conncount.c b/net/netfilter/nf_conncount.c
index 3ef6846f1b57..e6f854cc268e 100644
--- a/net/netfilter/nf_conncount.c
+++ b/net/netfilter/nf_conncount.c
@@ -220,6 +220,71 @@ static void tree_nodes_free(struct rb_root *root,
 }
 
 static unsigned int
+insert_tree(struct rb_root *root,
+   unsigned int hash,
+   const u32 *key,
+   u8 keylen,
+   const struct nf_conntrack_tuple *tuple,
+   const struct nf_conntrack_zone *zone)
+{
+   struct rb_node **rbnode, *parent;
+   struct nf_conncount_rb *rbconn;
+   struct nf_conncount_tuple *conn;
+   unsigned int count = 0;
+
+   spin_lock_bh(_conncount_locks[hash % CONNCOUNT_LOCK_SLOTS]);
+
+   parent = NULL;
+   rbnode = &(root->rb_node);
+   while (*rbnode) {
+   int diff;
+   rbconn = rb_entry(*rbnode, struct nf_conncount_rb, node);
+
+   parent = *rbnode;
+   diff = key_diff(key, rbconn->key, keylen);
+   if (diff < 0) {
+   rbnode = &((*rbnode)->rb_left);
+   } else if (diff > 0) {
+   rbnode = &((*rbnode)->rb_right);
+   } else {
+   /* unlikely: other cpu added node already */
+   if (!nf_conncount_add(>list, tuple, zone)) {
+   count = 0; /* hotdrop */
+   goto out_unlock;
+   }
+
+   count = rbconn->list.count;
+   goto out_unlock;
+   }
+   }
+
+   /* expected case: match, insert new node */
+   rbconn = kmem_cache_alloc(conncount_rb_cachep, GFP_ATOMIC);
+   if (rbconn == NULL)
+   goto out_unlock;
+
+   conn = kmem_cache_alloc(conncount_conn_cachep, GFP_ATOMIC);
+   if (conn == NULL) {
+   kmem_cache_free(conncount_rb_cachep, rbconn);
+   goto out_unlock;
+   }
+
+   conn->tuple = *tuple;
+   conn->zone = *zone;
+   memcpy(rbconn->key, key, sizeof(u32) * keylen);
+
+   nf_conncount_list_init(>list);
+   list_add(>node, >list.head);
+   count = 1;
+
+   rb_link_node(>node, parent, rbnode);
+   rb_insert_color(>node, root);
+out_unlock:
+   spin_unlock_bh(_conncount_locks[hash % CONNCOUNT_LOCK_SLOTS]);
+   return count;
+}
+
+static unsigned int
 count_tree(struct net *net,
   struct nf_conncount_data *data,
   const u32 *key,
@@ -230,7 +295,6 @@ count_tree(struct net *net,
struct rb_root *root;
struct rb_node **rbnode, *parent;
struct nf_conncount_rb *rbconn;
-   struct nf_conncount_tuple *conn;
unsigned int gc_count, hash;
bool no_gc = false;
unsigned int count = 0;
@@ -297,27 +361,10 @@ count_tree(struct net *net,
count = 0;
if (!tuple)
goto out_unlock;
-   /* no match, need to insert new node */
-   rbconn = kmem_cache_alloc(conncount_rb_cachep, GFP_ATOMIC);
-   if (rbconn == NULL)
-   goto out_unlock;
-
-   conn = kmem_cache_alloc(conncount_conn_cachep, GFP_ATOMIC);
-   if (conn == NULL) {
-   kmem_cache_free(conncount_rb_cachep, rbconn);
-   goto out_unlock;
-   }
-
-   conn->tuple = *tuple;
-   conn->zone = *zone;
-   memcpy(rbconn->key, key, sizeof(u32) * keylen);
 
-   nf_conncount_list_init(>list);
-   list_add(>node, >list.head);
-   count = 1;
+   spin_unlock_bh(_conncount_locks[hash % CONNCOUNT_LOCK_SLOTS]);
+   return insert_tree(root, hash, key, keylen, tuple, zone);
 
-   rb_link_node(>node, parent, rbnode);
-   rb_insert_color(>node, root);
 out_unlock:
spin_unlock_bh(_conncount_locks[hash % CONNCOUNT_LOCK_SLOTS]);
return count;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC nf-next 2/7] netfilter: nf_conncount: Switch to plain list

2018-06-18 Thread Yi-Hung Wei
From: Florian Westphal 

This patch switches from hlist to plain list to store the list of
connections with the same filtering key in nf_conncount. With the
plain list, we can insert new connections at the tail, so over time
the beginning of list holds long-running connections and those are
expired, while the newly creates ones are at the end.

Later on, we could probably move checked ones to the end of the list,
so the next run has higher chance to reclaim stale entries in the front.

Signed-off-by: Florian Westphal 
Signed-off-by: Yi-Hung Wei 
---
 include/net/netfilter/nf_conntrack_count.h | 15 --
 net/netfilter/nf_conncount.c   | 78 ++
 net/netfilter/nft_connlimit.c  | 24 -
 3 files changed, 71 insertions(+), 46 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack_count.h 
b/include/net/netfilter/nf_conntrack_count.h
index 3a188a0923a3..e4884e0e4f69 100644
--- a/include/net/netfilter/nf_conntrack_count.h
+++ b/include/net/netfilter/nf_conntrack_count.h
@@ -1,8 +1,15 @@
 #ifndef _NF_CONNTRACK_COUNT_H
 #define _NF_CONNTRACK_COUNT_H
 
+#include 
+
 struct nf_conncount_data;
 
+struct nf_conncount_list {
+   struct list_head head;  /* connections with the same filtering key */
+   unsigned int count; /* length of list */
+};
+
 struct nf_conncount_data *nf_conncount_init(struct net *net, unsigned int 
family,
unsigned int keylen);
 void nf_conncount_destroy(struct net *net, unsigned int family,
@@ -14,15 +21,17 @@ unsigned int nf_conncount_count(struct net *net,
const struct nf_conntrack_tuple *tuple,
const struct nf_conntrack_zone *zone);
 
-unsigned int nf_conncount_lookup(struct net *net, struct hlist_head *head,
+unsigned int nf_conncount_lookup(struct net *net, struct nf_conncount_list 
*list,
 const struct nf_conntrack_tuple *tuple,
 const struct nf_conntrack_zone *zone,
 bool *addit);
 
-bool nf_conncount_add(struct hlist_head *head,
+void nf_conncount_list_init(struct nf_conncount_list *list);
+
+bool nf_conncount_add(struct nf_conncount_list *list,
  const struct nf_conntrack_tuple *tuple,
  const struct nf_conntrack_zone *zone);
 
-void nf_conncount_cache_free(struct hlist_head *hhead);
+void nf_conncount_cache_free(struct nf_conncount_list *list);
 
 #endif
diff --git a/net/netfilter/nf_conncount.c b/net/netfilter/nf_conncount.c
index ed87ad40541b..6eb52c2332bd 100644
--- a/net/netfilter/nf_conncount.c
+++ b/net/netfilter/nf_conncount.c
@@ -44,14 +44,14 @@
 
 /* we will save the tuples of all connections we care about */
 struct nf_conncount_tuple {
-   struct hlist_node   node;
+   struct list_headnode;
struct nf_conntrack_tuple   tuple;
struct nf_conntrack_zonezone;
 };
 
 struct nf_conncount_rb {
struct rb_node node;
-   struct hlist_head hhead; /* connections/hosts in same subnet */
+   struct nf_conncount_list list;
u32 key[MAX_KEYLEN];
 };
 
@@ -80,41 +80,55 @@ static int key_diff(const u32 *a, const u32 *b, unsigned 
int klen)
return memcmp(a, b, klen * sizeof(u32));
 }
 
-bool nf_conncount_add(struct hlist_head *head,
+bool nf_conncount_add(struct nf_conncount_list *list,
  const struct nf_conntrack_tuple *tuple,
  const struct nf_conntrack_zone *zone)
 {
struct nf_conncount_tuple *conn;
 
+   if (WARN_ON_ONCE(list->count > INT_MAX))
+   return false;
+
conn = kmem_cache_alloc(conncount_conn_cachep, GFP_ATOMIC);
if (conn == NULL)
return false;
conn->tuple = *tuple;
conn->zone = *zone;
-   hlist_add_head(>node, head);
+   list_add_tail(>node, >head);
+   list->count++;
return true;
 }
 EXPORT_SYMBOL_GPL(nf_conncount_add);
 
-unsigned int nf_conncount_lookup(struct net *net, struct hlist_head *head,
+static void conn_free(struct nf_conncount_list *list,
+ struct nf_conncount_tuple *conn)
+{
+   if (WARN_ON_ONCE(list->count == 0))
+   return;
+
+   list->count--;
+   list_del(>node);
+   kmem_cache_free(conncount_conn_cachep, conn);
+}
+
+unsigned int nf_conncount_lookup(struct net *net,
+struct nf_conncount_list *list,
 const struct nf_conntrack_tuple *tuple,
 const struct nf_conntrack_zone *zone,
 bool *addit)
 {
const struct nf_conntrack_tuple_hash *found;
-   struct nf_conncount_tuple *conn;
-   struct hlist_node *n;
+   struct nf_conncount_tuple *conn, *conn_n;
struct nf_conn *found_ct;
unsigned

[RFC nf-next 7/7] netfilter: nf_conncount: Add garbage collection worker

2018-06-18 Thread Yi-Hung Wei
From: Florian Westphal 

Add garbage collection worker.

Move the rbnode reclaim logic from search tree to insert tree
to avoid race condition.

Signed-off-by: Florian Westphal 
Signed-off-by: Yi-Hung Wei 
---
 include/net/netfilter/nf_conntrack_count.h |  14 ++-
 net/netfilter/nf_conncount.c   | 163 +
 net/netfilter/nft_connlimit.c  |   2 +-
 3 files changed, 132 insertions(+), 47 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack_count.h 
b/include/net/netfilter/nf_conntrack_count.h
index af0c1c95eccd..4b2b2baf8ab4 100644
--- a/include/net/netfilter/nf_conntrack_count.h
+++ b/include/net/netfilter/nf_conntrack_count.h
@@ -5,10 +5,17 @@
 
 struct nf_conncount_data;
 
+enum nf_conncount_list_add {
+   NF_CONNCOUNT_ADDED, /* list add was ok */
+   NF_CONNCOUNT_ERR,   /* -ENOMEM, must drop skb */
+   NF_CONNCOUNT_SKIP,  /* list is already reclaimed by gc */
+};
+
 struct nf_conncount_list {
spinlock_t list_lock;
struct list_head head;  /* connections with the same filtering key */
unsigned int count; /* length of list */
+   bool dead;
 };
 
 struct nf_conncount_data *nf_conncount_init(struct net *net, unsigned int 
family,
@@ -29,9 +36,10 @@ void nf_conncount_lookup(struct net *net, struct 
nf_conncount_list *list,
 
 void nf_conncount_list_init(struct nf_conncount_list *list);
 
-bool nf_conncount_add(struct nf_conncount_list *list,
- const struct nf_conntrack_tuple *tuple,
- const struct nf_conntrack_zone *zone);
+enum nf_conncount_list_add
+nf_conncount_add(struct nf_conncount_list *list,
+const struct nf_conntrack_tuple *tuple,
+const struct nf_conntrack_zone *zone);
 
 bool nf_conncount_gc_list(struct net *net,
  struct nf_conncount_list *list);
diff --git a/net/netfilter/nf_conncount.c b/net/netfilter/nf_conncount.c
index 760ba24db735..7e93487e5856 100644
--- a/net/netfilter/nf_conncount.c
+++ b/net/netfilter/nf_conncount.c
@@ -62,6 +62,10 @@ static spinlock_t nf_conncount_locks[CONNCOUNT_LOCK_SLOTS] 
__cacheline_aligned_i
 struct nf_conncount_data {
unsigned int keylen;
struct rb_root root[CONNCOUNT_SLOTS];
+   struct net *net;
+   struct work_struct gc_work;
+   unsigned long pending_trees[BITS_TO_LONGS(CONNCOUNT_SLOTS)];
+   unsigned int gc_tree;
 };
 
 static u_int32_t conncount_rnd __read_mostly;
@@ -82,25 +86,32 @@ static int key_diff(const u32 *a, const u32 *b, unsigned 
int klen)
return memcmp(a, b, klen * sizeof(u32));
 }
 
-bool nf_conncount_add(struct nf_conncount_list *list,
- const struct nf_conntrack_tuple *tuple,
- const struct nf_conntrack_zone *zone)
+enum nf_conncount_list_add
+nf_conncount_add(struct nf_conncount_list *list,
+const struct nf_conntrack_tuple *tuple,
+const struct nf_conntrack_zone *zone)
 {
struct nf_conncount_tuple *conn;
 
if (WARN_ON_ONCE(list->count > INT_MAX))
-   return false;
+   return NF_CONNCOUNT_ERR;
 
conn = kmem_cache_alloc(conncount_conn_cachep, GFP_ATOMIC);
if (conn == NULL)
-   return false;
+   return NF_CONNCOUNT_ERR;
+
conn->tuple = *tuple;
conn->zone = *zone;
spin_lock(>list_lock);
+   if (list->dead == true) {
+   kmem_cache_free(conncount_conn_cachep, conn);
+   spin_unlock(>list_lock);
+   return NF_CONNCOUNT_SKIP;
+   }
list_add_tail(>node, >head);
list->count++;
spin_unlock(>list_lock);
-   return true;
+   return NF_CONNCOUNT_ADDED;
 }
 EXPORT_SYMBOL_GPL(nf_conncount_add);
 
@@ -192,6 +203,7 @@ void nf_conncount_list_init(struct nf_conncount_list *list)
spin_lock_init(>list_lock);
INIT_LIST_HEAD(>head);
list->count = 1;
+   list->dead = false;
 }
 EXPORT_SYMBOL_GPL(nf_conncount_list_init);
 
@@ -250,23 +262,39 @@ static void tree_nodes_free(struct rb_root *root,
 
while (gc_count) {
rbconn = gc_nodes[--gc_count];
-   rb_erase(>node, root);
-   call_rcu(>rcu_head, __tree_nodes_free);
+   spin_lock(>list.list_lock);
+   if (rbconn->list.count == 0 && rbconn->list.dead == false) {
+   rbconn->list.dead = true;
+   rb_erase(>node, root);
+   call_rcu(>rcu_head, __tree_nodes_free);
+   }
+   spin_unlock(>list.list_lock);
}
 }
 
+static void schedule_gc_worker(struct nf_conncount_data *data, int tree)
+{
+   set_bit(tree, data->pending_trees);
+   schedule_work(>gc_work);
+}
+
 static unsigned int
-insert_tree(struct rb_root *root,
+insert_tree(struct 

[RFC nf-next 4/7] netfilter: nf_conncount: Move locking into count_tree()

2018-06-18 Thread Yi-Hung Wei
From: Florian Westphal 

This is a preparation patch to allow lockless traversal
of the tree via RCU.

Signed-off-by: Florian Westphal 
Signed-off-by: Yi-Hung Wei 
---
 net/netfilter/nf_conncount.c | 52 +---
 1 file changed, 25 insertions(+), 27 deletions(-)

diff --git a/net/netfilter/nf_conncount.c b/net/netfilter/nf_conncount.c
index 6e77b2a2f851..3ef6846f1b57 100644
--- a/net/netfilter/nf_conncount.c
+++ b/net/netfilter/nf_conncount.c
@@ -220,18 +220,26 @@ static void tree_nodes_free(struct rb_root *root,
 }
 
 static unsigned int
-count_tree(struct net *net, struct rb_root *root,
-  const u32 *key, u8 keylen,
+count_tree(struct net *net,
+  struct nf_conncount_data *data,
+  const u32 *key,
   const struct nf_conntrack_tuple *tuple,
   const struct nf_conntrack_zone *zone)
 {
struct nf_conncount_rb *gc_nodes[CONNCOUNT_GC_MAX_NODES];
+   struct rb_root *root;
struct rb_node **rbnode, *parent;
struct nf_conncount_rb *rbconn;
struct nf_conncount_tuple *conn;
-   unsigned int gc_count;
+   unsigned int gc_count, hash;
bool no_gc = false;
+   unsigned int count = 0;
+   u8 keylen = data->keylen;
 
+   hash = jhash2(key, data->keylen, conncount_rnd) % CONNCOUNT_SLOTS;
+   root = >root[hash];
+
+   spin_lock_bh(_conncount_locks[hash % CONNCOUNT_LOCK_SLOTS]);
  restart:
gc_count = 0;
parent = NULL;
@@ -250,20 +258,20 @@ count_tree(struct net *net, struct rb_root *root,
rbnode = &((*rbnode)->rb_right);
} else {
/* same source network -> be counted! */
-   unsigned int count;
-
nf_conncount_lookup(net, >list, tuple, zone,
);
count = rbconn->list.count;
 
tree_nodes_free(root, gc_nodes, gc_count);
if (!addit)
-   return count;
+   goto out_unlock;
 
if (!nf_conncount_add(>list, tuple, zone))
-   return 0; /* hotdrop */
+   count = 0; /* hotdrop */
+   goto out_unlock;
 
-   return count + 1;
+   count++;
+   goto out_unlock;
}
 
if (no_gc || gc_count >= ARRAY_SIZE(gc_nodes))
@@ -286,18 +294,18 @@ count_tree(struct net *net, struct rb_root *root,
goto restart;
}
 
+   count = 0;
if (!tuple)
-   return 0;
-
+   goto out_unlock;
/* no match, need to insert new node */
rbconn = kmem_cache_alloc(conncount_rb_cachep, GFP_ATOMIC);
if (rbconn == NULL)
-   return 0;
+   goto out_unlock;
 
conn = kmem_cache_alloc(conncount_conn_cachep, GFP_ATOMIC);
if (conn == NULL) {
kmem_cache_free(conncount_rb_cachep, rbconn);
-   return 0;
+   goto out_unlock;
}
 
conn->tuple = *tuple;
@@ -306,10 +314,13 @@ count_tree(struct net *net, struct rb_root *root,
 
nf_conncount_list_init(>list);
list_add(>node, >list.head);
+   count = 1;
 
rb_link_node(>node, parent, rbnode);
rb_insert_color(>node, root);
-   return 1;
+out_unlock:
+   spin_unlock_bh(_conncount_locks[hash % CONNCOUNT_LOCK_SLOTS]);
+   return count;
 }
 
 /* Count and return number of conntrack entries in 'net' with particular 'key'.
@@ -321,20 +332,7 @@ unsigned int nf_conncount_count(struct net *net,
const struct nf_conntrack_tuple *tuple,
const struct nf_conntrack_zone *zone)
 {
-   struct rb_root *root;
-   int count;
-   u32 hash;
-
-   hash = jhash2(key, data->keylen, conncount_rnd) % CONNCOUNT_SLOTS;
-   root = >root[hash];
-
-   spin_lock_bh(_conncount_locks[hash % CONNCOUNT_LOCK_SLOTS]);
-
-   count = count_tree(net, root, key, data->keylen, tuple, zone);
-
-   spin_unlock_bh(_conncount_locks[hash % CONNCOUNT_LOCK_SLOTS]);
-
-   return count;
+   return count_tree(net, data, key, tuple, zone);
 }
 EXPORT_SYMBOL_GPL(nf_conncount_count);
 
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC nf-next 3/7] netfilter: nf_conncount: Early exit in nf_conncount_lookup() and cleanup

2018-06-18 Thread Yi-Hung Wei
From: Florian Westphal 

This patch does the following three tasks.

It applies the same early exit technique for nf_conncount_lookup().

Since now we keep the number of connections in 'struct nf_conncount_list',
we no longer need to return the count in nf_conncount_lookup().

Moreover, we expose the garbage collection function nf_conncount_gc_list()
for nft_connlimit.

Signed-off-by: Florian Westphal 
Signed-off-by: Yi-Hung Wei 
---
 include/net/netfilter/nf_conntrack_count.h | 11 ++
 net/netfilter/nf_conncount.c   | 34 ++
 net/netfilter/nft_connlimit.c  |  9 
 3 files changed, 32 insertions(+), 22 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack_count.h 
b/include/net/netfilter/nf_conntrack_count.h
index e4884e0e4f69..dbec17f674b7 100644
--- a/include/net/netfilter/nf_conntrack_count.h
+++ b/include/net/netfilter/nf_conntrack_count.h
@@ -21,10 +21,10 @@ unsigned int nf_conncount_count(struct net *net,
const struct nf_conntrack_tuple *tuple,
const struct nf_conntrack_zone *zone);
 
-unsigned int nf_conncount_lookup(struct net *net, struct nf_conncount_list 
*list,
-const struct nf_conntrack_tuple *tuple,
-const struct nf_conntrack_zone *zone,
-bool *addit);
+void nf_conncount_lookup(struct net *net, struct nf_conncount_list *list,
+const struct nf_conntrack_tuple *tuple,
+const struct nf_conntrack_zone *zone,
+bool *addit);
 
 void nf_conncount_list_init(struct nf_conncount_list *list);
 
@@ -32,6 +32,9 @@ bool nf_conncount_add(struct nf_conncount_list *list,
  const struct nf_conntrack_tuple *tuple,
  const struct nf_conntrack_zone *zone);
 
+void nf_conncount_gc_list(struct net *net,
+ struct nf_conncount_list *list);
+
 void nf_conncount_cache_free(struct nf_conncount_list *list);
 
 #endif
diff --git a/net/netfilter/nf_conncount.c b/net/netfilter/nf_conncount.c
index 6eb52c2332bd..6e77b2a2f851 100644
--- a/net/netfilter/nf_conncount.c
+++ b/net/netfilter/nf_conncount.c
@@ -111,24 +111,29 @@ static void conn_free(struct nf_conncount_list *list,
kmem_cache_free(conncount_conn_cachep, conn);
 }
 
-unsigned int nf_conncount_lookup(struct net *net,
-struct nf_conncount_list *list,
-const struct nf_conntrack_tuple *tuple,
-const struct nf_conntrack_zone *zone,
-bool *addit)
+void nf_conncount_lookup(struct net *net,
+struct nf_conncount_list *list,
+const struct nf_conntrack_tuple *tuple,
+const struct nf_conntrack_zone *zone,
+bool *addit)
 {
const struct nf_conntrack_tuple_hash *found;
struct nf_conncount_tuple *conn, *conn_n;
struct nf_conn *found_ct;
-   unsigned int length = 0;
+   unsigned int collect = 0;
 
+   /* best effort only */
*addit = tuple ? true : false;
 
/* check the saved connections */
list_for_each_entry_safe(conn, conn_n, >head, node) {
+   if (collect > CONNCOUNT_GC_MAX_NODES)
+   break;
+
found = nf_conntrack_find_get(net, >zone, >tuple);
if (found == NULL) {
conn_free(list, conn);
+   collect++;
continue;
}
 
@@ -137,9 +142,10 @@ unsigned int nf_conncount_lookup(struct net *net,
if (tuple && nf_ct_tuple_equal(>tuple, tuple) &&
nf_ct_zone_equal(found_ct, zone, zone->dir)) {
/*
-* Just to be sure we have it only once in the list.
 * We should not see tuples twice unless someone hooks
 * this into a table without "-p tcp --syn".
+*
+* Attempt to avoid a re-add in this case.
 */
*addit = false;
} else if (already_closed(found_ct)) {
@@ -149,14 +155,12 @@ unsigned int nf_conncount_lookup(struct net *net,
 */
nf_ct_put(found_ct);
conn_free(list, conn);
+   collect++;
continue;
}
 
nf_ct_put(found_ct);
-   length++;
}
-
-   return length;
 }
 EXPORT_SYMBOL_GPL(nf_conncount_lookup);
 
@@ -167,8 +171,8 @@ void nf_conncount_list_init(struct nf_conncount_list *list)
 }
 EXPORT_SYMBOL_GPL(nf_conncount_list_init);
 
-stat

[PATCH nf] netfilter: Fix nf_conncount garbage collection

2018-06-12 Thread Yi-Hung Wei
Currently, we use check_hlist() for garbage colleciton. However, we
use the ‘zone’ from the counted entry to query the existence of
existing entries in the hlist. This could be wrong when they are in
different zones, and this patch fixes this issue.

Signed-off-by: Yi-Hung Wei 
---
This fix is based on current nf-next tree commit 44a53e0e6d55
("wcn36xx: Add support for Factory Test Mode (FTM)").
---
 include/net/netfilter/nf_conntrack_count.h |  3 ++-
 net/netfilter/nf_conncount.c   | 13 +
 net/netfilter/nft_connlimit.c  |  2 +-
 3 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack_count.h 
b/include/net/netfilter/nf_conntrack_count.h
index 1910b6572430..3a188a0923a3 100644
--- a/include/net/netfilter/nf_conntrack_count.h
+++ b/include/net/netfilter/nf_conntrack_count.h
@@ -20,7 +20,8 @@ unsigned int nf_conncount_lookup(struct net *net, struct 
hlist_head *head,
 bool *addit);
 
 bool nf_conncount_add(struct hlist_head *head,
- const struct nf_conntrack_tuple *tuple);
+ const struct nf_conntrack_tuple *tuple,
+ const struct nf_conntrack_zone *zone);
 
 void nf_conncount_cache_free(struct hlist_head *hhead);
 
diff --git a/net/netfilter/nf_conncount.c b/net/netfilter/nf_conncount.c
index 3b5059a8dcdd..d8383609fe28 100644
--- a/net/netfilter/nf_conncount.c
+++ b/net/netfilter/nf_conncount.c
@@ -46,6 +46,7 @@
 struct nf_conncount_tuple {
struct hlist_node   node;
struct nf_conntrack_tuple   tuple;
+   struct nf_conntrack_zonezone;
 };
 
 struct nf_conncount_rb {
@@ -80,7 +81,8 @@ static int key_diff(const u32 *a, const u32 *b, unsigned int 
klen)
 }
 
 bool nf_conncount_add(struct hlist_head *head,
- const struct nf_conntrack_tuple *tuple)
+ const struct nf_conntrack_tuple *tuple,
+ const struct nf_conntrack_zone *zone)
 {
struct nf_conncount_tuple *conn;
 
@@ -88,6 +90,7 @@ bool nf_conncount_add(struct hlist_head *head,
if (conn == NULL)
return false;
conn->tuple = *tuple;
+   conn->zone = *zone;
hlist_add_head(>node, head);
return true;
 }
@@ -108,7 +111,7 @@ unsigned int nf_conncount_lookup(struct net *net, struct 
hlist_head *head,
 
/* check the saved connections */
hlist_for_each_entry_safe(conn, n, head, node) {
-   found = nf_conntrack_find_get(net, zone, >tuple);
+   found = nf_conntrack_find_get(net, >zone, >tuple);
if (found == NULL) {
hlist_del(>node);
kmem_cache_free(conncount_conn_cachep, conn);
@@ -117,7 +120,8 @@ unsigned int nf_conncount_lookup(struct net *net, struct 
hlist_head *head,
 
found_ct = nf_ct_tuplehash_to_ctrack(found);
 
-   if (tuple && nf_ct_tuple_equal(>tuple, tuple)) {
+   if (tuple && nf_ct_tuple_equal(>tuple, tuple) &&
+   nf_ct_zone_equal(found_ct, zone, zone->dir)) {
/*
 * Just to be sure we have it only once in the list.
 * We should not see tuples twice unless someone hooks
@@ -196,7 +200,7 @@ count_tree(struct net *net, struct rb_root *root,
if (!addit)
return count;
 
-   if (!nf_conncount_add(>hhead, tuple))
+   if (!nf_conncount_add(>hhead, tuple, zone))
return 0; /* hotdrop */
 
return count + 1;
@@ -238,6 +242,7 @@ count_tree(struct net *net, struct rb_root *root,
}
 
conn->tuple = *tuple;
+   conn->zone = *zone;
memcpy(rbconn->key, key, sizeof(u32) * keylen);
 
INIT_HLIST_HEAD(>hhead);
diff --git a/net/netfilter/nft_connlimit.c b/net/netfilter/nft_connlimit.c
index 50c068d660e5..a832c59f0a9c 100644
--- a/net/netfilter/nft_connlimit.c
+++ b/net/netfilter/nft_connlimit.c
@@ -52,7 +52,7 @@ static inline void nft_connlimit_do_eval(struct nft_connlimit 
*priv,
if (!addit)
goto out;
 
-   if (!nf_conncount_add(>hhead, tuple_ptr)) {
+   if (!nf_conncount_add(>hhead, tuple_ptr, zone)) {
regs->verdict.code = NF_DROP;
spin_unlock_bh(>lock);
return;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH nf-next v2 1/2] netfilter: Refactor nf_conncount

2018-03-04 Thread Yi-Hung Wei
Remove parameter 'family' in nf_conncount_count() and count_tree().
It is because the parameter is not useful after commit 625c556118f3
("netfilter: connlimit: split xt_connlimit into front and backend").

Signed-off-by: Yi-Hung Wei <yihung@gmail.com>
---
 include/net/netfilter/nf_conntrack_count.h | 1 -
 net/netfilter/nf_conncount.c   | 4 +---
 net/netfilter/xt_connlimit.c   | 4 ++--
 3 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack_count.h 
b/include/net/netfilter/nf_conntrack_count.h
index adf8db44cf86..e61184fbfb71 100644
--- a/include/net/netfilter/nf_conntrack_count.h
+++ b/include/net/netfilter/nf_conntrack_count.h
@@ -11,7 +11,6 @@ void nf_conncount_destroy(struct net *net, unsigned int 
family,
 unsigned int nf_conncount_count(struct net *net,
struct nf_conncount_data *data,
const u32 *key,
-   unsigned int family,
const struct nf_conntrack_tuple *tuple,
const struct nf_conntrack_zone *zone);
 #endif
diff --git a/net/netfilter/nf_conncount.c b/net/netfilter/nf_conncount.c
index 6d65389e308f..9305a08b4422 100644
--- a/net/netfilter/nf_conncount.c
+++ b/net/netfilter/nf_conncount.c
@@ -158,7 +158,6 @@ static void tree_nodes_free(struct rb_root *root,
 static unsigned int
 count_tree(struct net *net, struct rb_root *root,
   const u32 *key, u8 keylen,
-  u8 family,
   const struct nf_conntrack_tuple *tuple,
   const struct nf_conntrack_zone *zone)
 {
@@ -246,7 +245,6 @@ count_tree(struct net *net, struct rb_root *root,
 unsigned int nf_conncount_count(struct net *net,
struct nf_conncount_data *data,
const u32 *key,
-   unsigned int family,
const struct nf_conntrack_tuple *tuple,
const struct nf_conntrack_zone *zone)
 {
@@ -259,7 +257,7 @@ unsigned int nf_conncount_count(struct net *net,
 
spin_lock_bh(_conncount_locks[hash % CONNCOUNT_LOCK_SLOTS]);
 
-   count = count_tree(net, root, key, data->keylen, family, tuple, zone);
+   count = count_tree(net, root, key, data->keylen, tuple, zone);
 
spin_unlock_bh(_conncount_locks[hash % CONNCOUNT_LOCK_SLOTS]);
 
diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c
index b1b17b9353e1..6275106ccf50 100644
--- a/net/netfilter/xt_connlimit.c
+++ b/net/netfilter/xt_connlimit.c
@@ -67,8 +67,8 @@ connlimit_mt(const struct sk_buff *skb, struct 
xt_action_param *par)
key[1] = zone->id;
}
 
-   connections = nf_conncount_count(net, info->data, key,
-xt_family(par), tuple_ptr, zone);
+   connections = nf_conncount_count(net, info->data, key, tuple_ptr,
+zone);
if (connections == 0)
/* kmalloc failed, drop it entirely */
goto hotdrop;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH nf-next v2 2/2] nf_conncount: Support count only use case

2018-03-04 Thread Yi-Hung Wei
Currently, nf_conncount_count() counts the number of connections that
matches key and inserts a conntrack 'tuple' with the same key into the
accounting data structure.  This patch supports another use case that only
counts the number of connections where 'tuple' is not provided.  Therefore,
proper changes are made on nf_conncount_count() to support the case where
'tuple' is NULL.  This could be useful for querying statistics or
debugging purpose.

Signed-off-by: Yi-Hung Wei <yihung@gmail.com>
---
 net/netfilter/nf_conncount.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/nf_conncount.c b/net/netfilter/nf_conncount.c
index 9305a08b4422..153e690e2893 100644
--- a/net/netfilter/nf_conncount.c
+++ b/net/netfilter/nf_conncount.c
@@ -104,7 +104,7 @@ static unsigned int check_hlist(struct net *net,
struct nf_conn *found_ct;
unsigned int length = 0;
 
-   *addit = true;
+   *addit = tuple ? true : false;
 
/* check the saved connections */
hlist_for_each_entry_safe(conn, n, head, node) {
@@ -117,7 +117,7 @@ static unsigned int check_hlist(struct net *net,
 
found_ct = nf_ct_tuplehash_to_ctrack(found);
 
-   if (nf_ct_tuple_equal(>tuple, tuple)) {
+   if (tuple && nf_ct_tuple_equal(>tuple, tuple)) {
/*
 * Just to be sure we have it only once in the list.
 * We should not see tuples twice unless someone hooks
@@ -220,6 +220,9 @@ count_tree(struct net *net, struct rb_root *root,
goto restart;
}
 
+   if (!tuple)
+   return 0;
+
/* no match, need to insert new node */
rbconn = kmem_cache_alloc(conncount_rb_cachep, GFP_ATOMIC);
if (rbconn == NULL)
@@ -242,6 +245,9 @@ count_tree(struct net *net, struct rb_root *root,
return 1;
 }
 
+/* Count and return number of conntrack entries in 'net' with particular 'key'.
+ * If 'tuple' is not null, insert it into the accounting data structure.
+ */
 unsigned int nf_conncount_count(struct net *net,
struct nf_conncount_data *data,
const u32 *key,
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH nf-next 2/2] nf_conncount: Support count only use case

2018-03-01 Thread Yi-Hung Wei
On Thu, Mar 1, 2018 at 12:09 AM, Florian Westphal <f...@strlen.de> wrote:
> Yi-Hung Wei <yihung@gmail.com> wrote:
>> Currently, nf_conncount_count() counts the number of connections that
>> matches key and inserts a conntrack 'tuple' associated with the key into
>> the accounting data structure.  This patch supports another use case that
>> only counts the number of connections associated with the key without
>> providing a 'tuple'.  Therefore, proper changes are made on
>> nf_conncount_count() to support the case where 'tuple' is NULL.
>
> Normal use case is to combine this with another match to only lookup
> start of a connection (-p tcp --syn in iptables, or -m conntrack
> --ctstate NEW and the like).
>
> Could you perhaps illustrate how this is going to be used?
>

I am thinking about to use the nf_conncount backend to limit the number
of connections in particular zones for OVS.  A use case for us is to
query the number of connections in particular zone without adding
a new entry to that zone.  This is could be useful for querying statistics
or debugging purpose.

Thanks,

-Yi-Hung
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH nf-next 1/2] netfilter: nf_conncount: Refactor nf_conncount

2018-03-01 Thread Yi-Hung Wei
On Thu, Mar 1, 2018 at 12:07 AM, Florian Westphal  wrote:
>> 2. Move nf_ct_netns_get/put() to the user of nf_conncount.
>> Since nf_conncount now supports general keys, if the key is not related
>> to a particular NFPROTO_*, then it is not necessary to do
>> nf_ct_netns_get/put() in nf_conncount.
>
> I wonder if this is correct.
>
> conncount relies on all entries being backed by a conntrack entry so
> it can expire those that are no longer around.

Thanks for the comment.

You are right.  I will drop the second part of this patch in v2.

Thanks,

-Yi-Hung
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH nf-next 2/2] nf_conncount: Support count only use case

2018-02-28 Thread Yi-Hung Wei
Currently, nf_conncount_count() counts the number of connections that
matches key and inserts a conntrack 'tuple' associated with the key into
the accounting data structure.  This patch supports another use case that
only counts the number of connections associated with the key without
providing a 'tuple'.  Therefore, proper changes are made on
nf_conncount_count() to support the case where 'tuple' is NULL.

Signed-off-by: Yi-Hung Wei <yihung@gmail.com>
---
 net/netfilter/nf_conncount.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/nf_conncount.c b/net/netfilter/nf_conncount.c
index 91b13142631e..b247e82ae8e2 100644
--- a/net/netfilter/nf_conncount.c
+++ b/net/netfilter/nf_conncount.c
@@ -104,7 +104,7 @@ static unsigned int check_hlist(struct net *net,
struct nf_conn *found_ct;
unsigned int length = 0;
 
-   *addit = true;
+   *addit = tuple ? true : false;
 
/* check the saved connections */
hlist_for_each_entry_safe(conn, n, head, node) {
@@ -117,7 +117,7 @@ static unsigned int check_hlist(struct net *net,
 
found_ct = nf_ct_tuplehash_to_ctrack(found);
 
-   if (nf_ct_tuple_equal(>tuple, tuple)) {
+   if (tuple && nf_ct_tuple_equal(>tuple, tuple)) {
/*
 * Just to be sure we have it only once in the list.
 * We should not see tuples twice unless someone hooks
@@ -220,6 +220,9 @@ count_tree(struct net *net, struct rb_root *root,
goto restart;
}
 
+   if (!tuple)
+   return 0;
+
/* no match, need to insert new node */
rbconn = kmem_cache_alloc(conncount_rb_cachep, GFP_ATOMIC);
if (rbconn == NULL)
@@ -242,6 +245,9 @@ count_tree(struct net *net, struct rb_root *root,
return 1;
 }
 
+/* Count and return number of conntrack entries in 'net' with particular 'key'.
+ * If 'tuple' is not null, insert it into the accounting data structure.
+ */
 unsigned int nf_conncount_count(struct net *net,
struct nf_conncount_data *data,
const u32 *key,
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH nf-next 1/2] netfilter: nf_conncount: Refactor nf_conncount

2018-02-28 Thread Yi-Hung Wei
This patch contains two parts.

1. Remove parameter 'family' in nf_conncount_count() and count_tree().
Before commit 625c556118f3 ("netfilter: connlimit: split xt_connlimit
into front and backend"), 'family' was used to determine the type
of nf_inet_addr, but the parameter is not useful after that commit.

2. Move nf_ct_netns_get/put() to the user of nf_conncount.
Since nf_conncount now supports general keys, if the key is not related
to a particular NFPROTO_*, then it is not necessary to do
nf_ct_netns_get/put() in nf_conncount.

Signed-off-by: Yi-Hung Wei <yihung@gmail.com>
---
 include/net/netfilter/nf_conntrack_count.h |  6 ++
 net/netfilter/nf_conncount.c   | 19 ---
 net/netfilter/xt_connlimit.c   | 16 
 3 files changed, 18 insertions(+), 23 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack_count.h 
b/include/net/netfilter/nf_conntrack_count.h
index adf8db44cf86..c4f33f762ceb 100644
--- a/include/net/netfilter/nf_conntrack_count.h
+++ b/include/net/netfilter/nf_conntrack_count.h
@@ -3,15 +3,13 @@
 
 struct nf_conncount_data;
 
-struct nf_conncount_data *nf_conncount_init(struct net *net, unsigned int 
family,
+struct nf_conncount_data *nf_conncount_init(struct net *net,
unsigned int keylen);
-void nf_conncount_destroy(struct net *net, unsigned int family,
- struct nf_conncount_data *data);
+void nf_conncount_destroy(struct nf_conncount_data *data);
 
 unsigned int nf_conncount_count(struct net *net,
struct nf_conncount_data *data,
const u32 *key,
-   unsigned int family,
const struct nf_conntrack_tuple *tuple,
const struct nf_conntrack_zone *zone);
 #endif
diff --git a/net/netfilter/nf_conncount.c b/net/netfilter/nf_conncount.c
index 6d65389e308f..91b13142631e 100644
--- a/net/netfilter/nf_conncount.c
+++ b/net/netfilter/nf_conncount.c
@@ -158,7 +158,6 @@ static void tree_nodes_free(struct rb_root *root,
 static unsigned int
 count_tree(struct net *net, struct rb_root *root,
   const u32 *key, u8 keylen,
-  u8 family,
   const struct nf_conntrack_tuple *tuple,
   const struct nf_conntrack_zone *zone)
 {
@@ -246,7 +245,6 @@ count_tree(struct net *net, struct rb_root *root,
 unsigned int nf_conncount_count(struct net *net,
struct nf_conncount_data *data,
const u32 *key,
-   unsigned int family,
const struct nf_conntrack_tuple *tuple,
const struct nf_conntrack_zone *zone)
 {
@@ -259,7 +257,7 @@ unsigned int nf_conncount_count(struct net *net,
 
spin_lock_bh(_conncount_locks[hash % CONNCOUNT_LOCK_SLOTS]);
 
-   count = count_tree(net, root, key, data->keylen, family, tuple, zone);
+   count = count_tree(net, root, key, data->keylen, tuple, zone);
 
spin_unlock_bh(_conncount_locks[hash % CONNCOUNT_LOCK_SLOTS]);
 
@@ -267,11 +265,11 @@ unsigned int nf_conncount_count(struct net *net,
 }
 EXPORT_SYMBOL_GPL(nf_conncount_count);
 
-struct nf_conncount_data *nf_conncount_init(struct net *net, unsigned int 
family,
+struct nf_conncount_data *nf_conncount_init(struct net *net,
unsigned int keylen)
 {
struct nf_conncount_data *data;
-   int ret, i;
+   int i;
 
if (keylen % sizeof(u32) ||
keylen / sizeof(u32) > MAX_KEYLEN ||
@@ -284,12 +282,6 @@ struct nf_conncount_data *nf_conncount_init(struct net 
*net, unsigned int family
if (!data)
return ERR_PTR(-ENOMEM);
 
-   ret = nf_ct_netns_get(net, family);
-   if (ret < 0) {
-   kfree(data);
-   return ERR_PTR(ret);
-   }
-
for (i = 0; i < ARRAY_SIZE(data->root); ++i)
data->root[i] = RB_ROOT;
 
@@ -318,13 +310,10 @@ static void destroy_tree(struct rb_root *r)
}
 }
 
-void nf_conncount_destroy(struct net *net, unsigned int family,
- struct nf_conncount_data *data)
+void nf_conncount_destroy(struct nf_conncount_data *data)
 {
unsigned int i;
 
-   nf_ct_netns_put(net, family);
-
for (i = 0; i < ARRAY_SIZE(data->root); ++i)
destroy_tree(>root[i]);
 
diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c
index b1b17b9353e1..805e23155253 100644
--- a/net/netfilter/xt_connlimit.c
+++ b/net/netfilter/xt_connlimit.c
@@ -67,8 +67,8 @@ connlimit_mt(const struct sk_buff *skb, struct 
xt_action_param *par)
key[1] = zone->id;
}
 
-   connections = nf_conncount_count(net, info->data, key,
-xt_family(par), tuple_ptr,

Re: [PATCH v5 nf-next] netfilter: connlimit: split xt_connlimit into front and backend

2017-12-11 Thread Yi-Hung Wei
On Sat, Dec 9, 2017 at 12:01 PM, Florian Westphal <f...@strlen.de> wrote:
> This allows to reuse xt_connlimit infrastructure from nf_tables.
> The upcoming nf_tables frontend can just pass in an nftables register
> as input key, this allows limiting by any nft-supported key, including
> concatenations.
>
> For xt_connlimit, pass in the zone and the ip/ipv6 address.
>
> With help from Yi-Hung Wei.
>
> Signed-off-by: Florian Westphal <f...@strlen.de>
> ---
Thanks for working on this again. It looks good to me.

Acked-by: Yi-Hung Wei <yihung@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 nf-next] netfilter: connlimit: split xt_connlimit into front and backend

2017-12-08 Thread Yi-Hung Wei
On Fri, Dec 8, 2017 at 3:15 AM, Florian Westphal  wrote:
> --- /dev/null
> +++ b/include/net/netfilter/nf_conntrack_count.h
> @@ -0,0 +1,17 @@
> +#ifdef _NF_CONNTRACK_COUNT_H

#ifndef? Looks like a typo in v4.


> diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c
> index a6214f235333..d69e2f41aaa6 100644
> --- a/net/netfilter/xt_connlimit.c
> +++ b/net/netfilter/xt_connlimit.c
>  static void connlimit_mt_destroy(const struct xt_mtdtor_param *par)
>  {
> const struct xt_connlimit_info *info = par->matchinfo;
> -   unsigned int i;
>
> nf_ct_netns_put(par->net, par->family);
I run into some issues when load and unload xt_connlimit a couple of
times. It turns out that we do nf_ct_netns_put() twice, where the
other one is in nf_conncount_destroy(). Maybe get rid of this line?

Thanks,

-Yi-Hung
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 nf-next] netfilter: connlimit: split xt_connlimit into front/backend

2017-12-07 Thread Yi-Hung Wei
On Wed, Dec 6, 2017 at 4:41 AM, Florian Westphal  wrote:
> +static int __init nf_conncount_modinit(void)
> +{
> +   int ret, i;
> +
> +   BUILD_BUG_ON(CONNCOUNT_LOCK_SLOTS > CONNCOUNT_SLOTS);
> +   BUILD_BUG_ON((CONNCOUNT_SLOTS % CONNCOUNT_LOCK_SLOTS) != 0);
> +
> +   for (i = 0; i < CONNCOUNT_LOCK_SLOTS; ++i)
> +   spin_lock_init(_conncount_locks[i]);
> +
> +   conncount_conn_cachep = kmem_cache_create("nf_conncount_tuple",
> +  sizeof(struct nf_conncount_tuple),
> +  0, 0, NULL);
> +   if (!conncount_conn_cachep)
> +   return -ENOMEM;
> +
> +   conncount_rb_cachep = kmem_cache_create("nf_conncount_rb",
> +  sizeof(struct nf_conncount_rb),
> +  0, 0, NULL);
> +   if (!conncount_rb_cachep) {
> +   kmem_cache_destroy(conncount_conn_cachep);
> +   return -ENOMEM;
> +   }
> +
> +   return ret;
I test it with iptables -m connlimit, and nf_conncount may fail to
load sometimes.

It turns out that it is because ret is not used in this function.
Shall we get rid of ret, and return 0 here?

Thanks,

-Yi-Hung
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH nf-next v2] netfilter: connlimit: split xt_connlimit into front/backend

2017-12-05 Thread Yi-Hung Wei
Thanks for working on this patch.

> --- /dev/null
> +++ b/include/net/netfilter/nf_conntrack_count.h
> @@ -0,0 +1,13 @@
Should it have something like the following in the header file?

#ifdef _NF_CONNTRACK_COUNT_H
#define _NF_CONNTRACK_COUNT_H

> +struct nf_conncount_data;
> +
> +struct nf_conncount_data *nf_conncount_init(struct net *net, unsigned int 
> family,
> +   unsigned int keylen);
> +void nf_conncount_destroy(struct net *net, unsigned int family,
> + struct nf_conncount_data *data);
> +
> +unsigned int nf_conncount_count(struct net *net,
> +   struct nf_conncount_data *data,
> +   const u32 *key,
> +   unsigned int family,
> +   const struct nf_conntrack_tuple *tuple,
> +   const struct nf_conntrack_zone *zone);

#endif /*_NF_CONNTRACK_COUNT_H*/


> --- /dev/null
> +++ b/net/netfilter/nf_conncount.c
> +
> +struct nf_conncount_data *nf_conncount_init(struct net *net, unsigned int 
> family,
> +   unsigned int keylen)
> +{
> +   struct nf_conncount_data *data;
> +   int ret, i;
> +
> +   if (keylen % sizeof(u32) ||
> +   keylen / sizeof(u32) > MAX_KEYLEN ||
> +   keylen == 0)
> +   return ERR_PTR(-EINVAL);
Just wanna to check the case that if users want to count only by zone,
since zone id is only 2 bytes, the user should claim 4 bytes as the
keylen right?


> +
> +   net_get_random_once(_rnd, sizeof(conncount_rnd));
> +
> +   ret = nf_ct_netns_get(net, family);
> +   if (ret < 0)
> +   return ERR_PTR(ret);
> +
> +   data = kmalloc(sizeof(*data), GFP_KERNEL);
> +   if (!data)
> +   return ERR_PTR(-ENOMEM);
Should we call nf_ct_netns_put() in the error case?


> +
> +   for (i = 0; i < ARRAY_SIZE(data->root); ++i)
> +   data->root[i] = RB_ROOT;
> +
> +   data->keylen = keylen / sizeof(u32);
> +
> +   return data;
> +}
> +EXPORT_SYMBOL_GPL(nf_conncount_init);

Thanks,

-Yi-Hung
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] connlimit for nftables and limiting ct count by zone

2017-10-31 Thread Yi-Hung Wei
> I think we could possible use maps to fetch values based on a key, e.g.
>
> nft add map filter zonelimits { type typeof(ct zone) : typeof(ct count); 
> flags interval\;}
> nft add element filter zonelimits { 1-100 : 5 }
> nft add element filter zonelimits { 101 : 6 }
>
> nft add rule filter prerouting \
>   ct count { ip saddr & 255.255.255.0 } gt { ct zone }
>
> None of this is implemented or planned, just an idea.
>
> Don't think we should bother about this at this time and
> focus on the conntrack counting infrastructure first.
You are right, the first step should be focusing on the conntrack
counting infrastructure, and supporting that using maps looks handy.


>> How would we use nft to get the ct limit and ct count on
>> particular key ? It may be useful for debugging and monitoring
>> purpose?
>
> Really?  So far my impression was that ovs does not wish
> to interact with nft in any way.
I was thinking about how to use nft to get and the ct limit and ct
count not by ovs tho. Say a nft user may want to know how many active
connections are there in a particular zone.


>> It is not clear to me that why we have 'u32 key[5]'. Is it because the
>> key would be the combination of Ip/ipv6 src or dst address with mask
>> (up to 16bytes), src or dst port (2 bytes), and ct zone (2 bytes)?
>
> Yes, xt_connlimit would have
>
> u32 key[5];
>
> key[0] = zone;
> key[1] = iphdr->saddr;
>
> nf_connlimit_count, ... key, 8);
>
> key[2], [3], [4] are only needed for ipv6.
Thanks for explanation, that this part is clear now.


>> I am wondering if there will be APIs to set and get the conntrack
>> limit on particular key,
>
> No, this would not be provided by that backend, as the backend doesn't
> know how the result is used.  It just returns the number of connections
> that match the current key.
>
>> and also query the ct count on particular key
>> in the kernel space?
>
> For what purpose would that be needed?
I think as long as the user space utilities e.g. nft can query ct
limit, ct count, it is not necessarily to support that in the kernel
space.


>> Just as you mentioned in the DESTORY event approach. I am wondering if
>> we can store an extra pointer to the tree node maybe in struct
>> nf_ct_ext, so that we can associate nf_conn with the tree node and
>> increase the count when the connection is confirmed, and we will
>> reduce the count when the connection is destroyed. In this way, with
>> some extra space, the insertion and updating the count in the tree
>> node is O(1) and we can get rid of GC scheme. I am not sure how much
>> effort does it take to modify the conntrack core tho.
>
> There could be dozens of connlimit trees.
>
> For example, a user could configure per-zone limit and a per-host limit
> at the same time.
You are right, in this case we can store a list of pointers to
multiple trees. I do not have a strong opinion on this tho, just want
to check if it can help the design.

Thanks,

-Yi-Hung
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] connlimit for nftables and limiting ct count by zone

2017-10-26 Thread Yi-Hung Wei
On Mon, Oct 16, 2017 at 7:42 AM, Florian Westphal  wrote:
> During NFWS we briefly discussed iptables '-m connlimit' and how
> to apply this to nftables.
>
> There is also a use case to make nf_conntrack_max more fine-grained
> by making this setting apply per conntrack zone.
>
> I'd like to ideally resolve both with a single solution.
Thanks for the proposal. It would be great to be able to resolve the
two issues at the same time.


> From nft (user front end) this would look like this:
>
> add rule inet filter input ct state new ct count { ip saddr } gt 100 drop
>
> The part enclosed in { } denotes the key/grouping that should be applied.
>
> e.g.
>
> ct count { ip saddr & 255.255.255.0 }   # count by source network
> ct count { ip saddr & 255.255.255.0 . tcp dport }   # count by source network 
> and service
> ct count { ct zone } # count by zone
Does it mean that all the ct zones will have the same limit? If it is,
shall we extend this syntax to support different limit for different
zone? How would we use nft to get the ct limit and ct count on
particular key ? It may be useful for debugging and monitoring
purpose?

Other than using nft to configure ct limit, if other user space
utilities want to set/get the ct limit or query the current ct count
would there be any API to achieve that?


> For this to work there are several issues that need to be resolved.
>
> 1. xt_connlimit must be split into an iptables part and a 'nf_connlimit'
>backend.
>
>nf_connlimit.c would implement the main function:
>
> unsigned int nf_connlimit_count(struct nf_connlimit_data *,
> const struct nf_conn *conn,
> const void *key,
> u16 keylen);
>
> Where 'nf_connlimit_data' is a structure that contains the (internal)
> bookkeeping structure(s), conn is the connection that is to be counted,
> and key/keylen is the (arbitrary) identifier to be used for grouping
> connections.
>
> xt_connlimits match function would then build a 'u32 key[5]' based on
> the options the user provided on the iptables command line, i.e.
> the conntrack zone and then either a source or destination network
> (or address).
It is not clear to me that why we have 'u32 key[5]'. Is it because the
key would be the combination of Ip/ipv6 src or dst address with mask
(up to 16bytes), src or dst port (2 bytes), and ct zone (2 bytes)?

Do we need to specify the order for the key? How do we distinguish the
case where we have zone (u16) as a key vs. dport (u16) as a key?


> 3. Other users, such as ovs, could also call this api, in the case of
> per-zone limiting key would simply be a u16 containing the zone identifier.
I am wondering if there will be APIs to set and get the conntrack
limit on particular key, and also query the ct count on particular key
in the kernel space?


> Right now, connlimit performs garbage collection from the packet path.
> This isn't a big deal now, as we limit based by single ip address or network
> only, the limit will be small.
>
> bookkeeping looks like this:
>
>   Node
>   / \
>Node  Node  -> conn1 -> conn2 -> conn3
>  /
>   Node
>
> Each node contains a single-linked list of connections that share the
> same source (address/network).
>
> When searching for the Node, all Nodes traversed get garbage-collected,
> i.e. connlimit walks the hlist attached to the node and removes any tuple
> no longer stored in the main conntrack table.  If the node then has
> empty list, it gets erased from the tree.
Thanks for the explanation. It saves me quite some time to understand
the code structure there.


> But if we limit by zone then its entirely reasonable to have a limit
> of e.g.  10k per zone, i.e. each Node could have a list consisting
> of 10k elements.  Walking a 10k list is not acceptable from packet path.
>
> Instead, I propose to store a timestamp of last gc in each node, so
> we can skip intermediate nodes that had a very recent gc run.
>
> If we find the node (zone, source network, etc. etc) that we should
> count the new connection for, then do an on-demand garbage collection.
>
> This is unfortunately required, we don't want to say '150' if the limit
> is 150 then the new connection would be dropped, unless we really still
> have 150 active connections.
>
> To resolve this, i suggest to store the count in the node itself
> (so we do not need to walk the full list of individual connections in the
>  packet path).
>
> The hlist would be replaced with another tree:
>
> This allows us to quickly find out if the tuple we want to count now
> is already stored (e.g. because of address/port reuse) or not.
>
> It also permits us to check all tuples we see while searching for
> the 'new' tuple that should be stored in the subtree and remove those that
> are no longer in the main conntrack table.
> We could also abort a packet-path gc run if we found at least one old
> connection.
I agree that it is