Re: [PATCH nf-next 2/6] netfilter: conntrack: get rid of conntrack timer

2016-08-22 Thread Florian Westphal
Eric Dumazet  wrote:
> On Fri, 2016-08-19 at 18:04 +0200, Florian Westphal wrote:
> > Eric Dumazet  wrote:
> > > On Fri, 2016-08-19 at 17:16 +0200, Florian Westphal wrote:
> > > 
> > > > Hmm, nf_conntrack_find caller needs to hold rcu_read_lock,
> > > > in case object is free'd SLAB_DESTROY_BY_RCU should delay actual release
> > > > of the page.
> > > 
> > > Well, point is that SLAB_DESTROY_BY_RCU means that we have no grace
> > > period, and object can be immediately reused and recycled.
> > > 
> > > @next pointer can definitely be overwritten.
> > 
> > I see.  Isn't that detected by the nulls magic (to restart
> > lookup if entry was moved to other chain due to overwritten next pointer)?
> 
> Well, you did not add the nulls magic in your code ;)

Oh.  Right, its indeed mising in the gc code.

> It might be fine, since it should be a rare event, and garbage
> collection is best effort, so you might add a comment in gc_worker() why
> it is probably overkill to restart the loop in this unlikely event.

Seems like a good idea, I will add it.

> BTW, maybe nf_conntrack_tuple_taken() should get the nulls magic check,
> as it is currently missing.

Good point, I will investigate.

Thanks Eric!
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH nf-next 2/6] netfilter: conntrack: get rid of conntrack timer

2016-08-21 Thread Eric Dumazet
On Fri, 2016-08-19 at 18:04 +0200, Florian Westphal wrote:
> Eric Dumazet  wrote:
> > On Fri, 2016-08-19 at 17:16 +0200, Florian Westphal wrote:
> > 
> > > Hmm, nf_conntrack_find caller needs to hold rcu_read_lock,
> > > in case object is free'd SLAB_DESTROY_BY_RCU should delay actual release
> > > of the page.
> > 
> > Well, point is that SLAB_DESTROY_BY_RCU means that we have no grace
> > period, and object can be immediately reused and recycled.
> > 
> > @next pointer can definitely be overwritten.
> 
> I see.  Isn't that detected by the nulls magic (to restart
> lookup if entry was moved to other chain due to overwritten next pointer)?

Well, you did not add the nulls magic in your code ;)

It might be fine, since it should be a rare event, and garbage
collection is best effort, so you might add a comment in gc_worker() why
it is probably overkill to restart the loop in this unlikely event.

BTW, maybe nf_conntrack_tuple_taken() should get the nulls magic check,
as it is currently missing.




--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH nf-next 2/6] netfilter: conntrack: get rid of conntrack timer

2016-08-19 Thread Florian Westphal
Eric Dumazet  wrote:
> On Fri, 2016-08-19 at 17:16 +0200, Florian Westphal wrote:
> 
> > Hmm, nf_conntrack_find caller needs to hold rcu_read_lock,
> > in case object is free'd SLAB_DESTROY_BY_RCU should delay actual release
> > of the page.
> 
> Well, point is that SLAB_DESTROY_BY_RCU means that we have no grace
> period, and object can be immediately reused and recycled.
> 
> @next pointer can definitely be overwritten.

I see.  Isn't that detected by the nulls magic (to restart
lookup if entry was moved to other chain due to overwritten next pointer)?

I don't see a hlist_nulls_for_each_entry_rcu_safe variant like we have
for normal lists (list_for_each_entry_safe), do I need to add one?
Currently we have this:

rcu_read_lock()
nf_conntrack_find
hlist_nulls_for_each_entry_rcu ...

 if (nf_ct_is_expired(ct)) { // lazy test, ct can be reused here
 nf_ct_gc_expired(ct);
 continue;
 }

nf_ct_gc_expired() does this:

static void nf_ct_gc_expired(struct nf_conn *ct)
{
if (!atomic_inc_not_zero(&ct->ct_general.use))
return;

if (nf_ct_should_gc(ct))
nf_ct_kill(ct);

nf_ct_put(ct);

So we only nf_ct_kill when we have a reference
and we know ct is in hash (check is in nf_ct_should_gc()).

So once we put, the object can be released but the
next pointer is not altered.

In case ct object is reused right after this last
nf_ct_put it could be re-inserted already, e.g. into
dying list or back into hash table -- but is this a problem?

'dead' (non-recycled) element is cought by atomic_inc_not_zero
in nf_ct_gc_expired, about-to-inserted (not in hash) is detected
by nf_ct_should_gc() (it checks confirmed bit in ct->status),
already-inserted (in hashtable) would lead us to 'magically'
move to a different hash chain in hlist_nulls_for_each_entry_rcu.

But we would then hit the wrong nulls list terminator and restart
the traversal.

Does that make sense?
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH nf-next 2/6] netfilter: conntrack: get rid of conntrack timer

2016-08-19 Thread Eric Dumazet
On Fri, 2016-08-19 at 17:16 +0200, Florian Westphal wrote:

> Hmm, nf_conntrack_find caller needs to hold rcu_read_lock,
> in case object is free'd SLAB_DESTROY_BY_RCU should delay actual release
> of the page.

Well, point is that SLAB_DESTROY_BY_RCU means that we have no grace
period, and object can be immediately reused and recycled.

@next pointer can definitely be overwritten.

> 
> Should be same as (old) 'death_by_timeout' timer firing during
> hlist_nulls_for_each_entry_rcu walk.
> 
> Thanks for looking at this Eric!


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH nf-next 2/6] netfilter: conntrack: get rid of conntrack timer

2016-08-19 Thread Florian Westphal
Eric Dumazet  wrote:
> > +/* caller must hold rcu readlock and none of the nf_conntrack_locks */
> > +static void nf_ct_gc_expired(struct nf_conn *ct)
> > +{
> > +   if (!atomic_inc_not_zero(&ct->ct_general.use))
> > +   return;
> > +
> > +   if (nf_ct_should_gc(ct))
> > +   nf_ct_kill(ct);
> > +
> > +   nf_ct_put(ct);
> > +}
> > +
> >  /*
> >   * Warning :
> >   * - Caller must take a reference on returned object
> > @@ -499,6 +505,17 @@ begin:
> > bucket = reciprocal_scale(hash, hsize);
> >  
> > hlist_nulls_for_each_entry_rcu(h, n, &ct_hash[bucket], hnnode) {
> > +   struct nf_conn *ct;
> > +
> > +   ct = nf_ct_tuplehash_to_ctrack(h);
> > +   if (nf_ct_is_expired(ct)) {
> > +   nf_ct_gc_expired(ct);
> > +   continue;
> > +   }
> > +
> > +   if (nf_ct_is_dying(ct))
> > +   continue;
> > +
> > if (nf_ct_key_equal(h, tuple, zone, net)) {
> > NF_CT_STAT_INC_ATOMIC(net, found);
> > return h;
> 
> Florian, I do not see how this part is safe against concurrent lookups
> and deletes ?
> 
> At least the hlist_nulls_for_each_entry_rcu() looks buggy, since
> fetching the next pointer would trigger a use after free ?

Hmm, nf_conntrack_find caller needs to hold rcu_read_lock,
in case object is free'd SLAB_DESTROY_BY_RCU should delay actual release
of the page.

Should be same as (old) 'death_by_timeout' timer firing during
hlist_nulls_for_each_entry_rcu walk.

Thanks for looking at this Eric!
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH nf-next 2/6] netfilter: conntrack: get rid of conntrack timer

2016-08-19 Thread Eric Dumazet
On Fri, 2016-08-19 at 13:36 +0200, Florian Westphal wrote:
> With stats enabled this eats 80 bytes on x86_64 per nf_conn entry.
> 
> Remove it and use a 32bit jiffies value containing timestamp until
> entry is valid.

Great work !

...

> +/* caller must hold rcu readlock and none of the nf_conntrack_locks */
> +static void nf_ct_gc_expired(struct nf_conn *ct)
> +{
> + if (!atomic_inc_not_zero(&ct->ct_general.use))
> + return;
> +
> + if (nf_ct_should_gc(ct))
> + nf_ct_kill(ct);
> +
> + nf_ct_put(ct);
> +}
> +
>  /*
>   * Warning :
>   * - Caller must take a reference on returned object
> @@ -499,6 +505,17 @@ begin:
>   bucket = reciprocal_scale(hash, hsize);
>  
>   hlist_nulls_for_each_entry_rcu(h, n, &ct_hash[bucket], hnnode) {
> + struct nf_conn *ct;
> +
> + ct = nf_ct_tuplehash_to_ctrack(h);
> + if (nf_ct_is_expired(ct)) {
> + nf_ct_gc_expired(ct);
> + continue;
> + }
> +
> + if (nf_ct_is_dying(ct))
> + continue;
> +
>   if (nf_ct_key_equal(h, tuple, zone, net)) {
>   NF_CT_STAT_INC_ATOMIC(net, found);
>   return h;

Florian, I do not see how this part is safe against concurrent lookups
and deletes ?

At least the hlist_nulls_for_each_entry_rcu() looks buggy, since
fetching the next pointer would trigger a use after free ?

Thanks !



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH nf-next 2/6] netfilter: conntrack: get rid of conntrack timer

2016-08-19 Thread Florian Westphal
With stats enabled this eats 80 bytes on x86_64 per nf_conn entry.

Remove it and use a 32bit jiffies value containing timestamp until
entry is valid.

During conntrack lookup, even before doing tuple comparision, check
the timeout value and evict the entry in case it is too old.

The dying bit is used as a synchronization point to avoid races where
multiple cpus try to evict the same entry.

Because lookup is always lockless, we need to bump the refcnt once
when we evict, else we could try to evict already-dead entry that
is being recycled.

This is the standard/expected way when conntrack entries are destroyed.

Followup patches will introduce garbage colliction via work queue
and further places where we can reap obsoleted entries (e.g. during
netlink dumps), this is needed to avoid expired conntracks from hanging
around for too long when lookup rate is low after a busy period.

Signed-off-by: Florian Westphal 
---
 include/net/netfilter/nf_conntrack.h | 23 +++--
 net/netfilter/nf_conntrack_core.c| 91 
 net/netfilter/nf_conntrack_netlink.c | 14 ++
 net/netfilter/nf_conntrack_pptp.c|  3 +-
 net/netfilter/nf_nat_core.c  |  6 ---
 5 files changed, 74 insertions(+), 63 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h 
b/include/net/netfilter/nf_conntrack.h
index 2a12748..6d8cf06 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -42,7 +42,6 @@ union nf_conntrack_expect_proto {
 
 #include 
 #include 
-#include 
 
 #ifdef CONFIG_NETFILTER_DEBUG
 #define NF_CT_ASSERT(x)WARN_ON(!(x))
@@ -73,7 +72,7 @@ struct nf_conn_help {
 #include 
 
 struct nf_conn {
-   /* Usage count in here is 1 for hash table/destruct timer, 1 per skb,
+   /* Usage count in here is 1 for hash table, 1 per skb,
 * plus 1 for any connection(s) we are `master' for
 *
 * Hint, SKB address this struct and refcnt via skb->nfct and
@@ -96,8 +95,8 @@ struct nf_conn {
/* Have we seen traffic both ways yet? (bitset) */
unsigned long status;
 
-   /* Timer function; drops refcnt when it goes off. */
-   struct timer_list timeout;
+   /* jiffies32 when this ct is considered dead */
+   u32 timeout;
 
possible_net_t ct_net;
 
@@ -291,14 +290,28 @@ static inline bool nf_is_loopback_packet(const struct 
sk_buff *skb)
return skb->dev && skb->skb_iif && skb->dev->flags & IFF_LOOPBACK;
 }
 
+#define nfct_time_stamp ((u32)(jiffies))
+
 /* jiffies until ct expires, 0 if already expired */
 static inline unsigned long nf_ct_expires(const struct nf_conn *ct)
 {
-   long timeout = (long)ct->timeout.expires - (long)jiffies;
+   s32 timeout = ct->timeout - nfct_time_stamp;
 
return timeout > 0 ? timeout : 0;
 }
 
+static inline bool nf_ct_is_expired(const struct nf_conn *ct)
+{
+return (__s32)(ct->timeout - nfct_time_stamp) <= 0;
+}
+
+/* use after obtaining a reference count */
+static inline bool nf_ct_should_gc(const struct nf_conn *ct)
+{
+   return nf_ct_is_expired(ct) && nf_ct_is_confirmed(ct) &&
+  !nf_ct_is_dying(ct);
+}
+
 struct kernel_param;
 
 int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp);
diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index 7d90a5d..38c50d1 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -371,7 +371,6 @@ destroy_conntrack(struct nf_conntrack *nfct)
 
pr_debug("destroy_conntrack(%p)\n", ct);
NF_CT_ASSERT(atomic_read(&nfct->use) == 0);
-   NF_CT_ASSERT(!timer_pending(&ct->timeout));
 
if (unlikely(nf_ct_is_template(ct))) {
nf_ct_tmpl_free(ct);
@@ -434,35 +433,30 @@ bool nf_ct_delete(struct nf_conn *ct, u32 portid, int 
report)
 {
struct nf_conn_tstamp *tstamp;
 
+   if (test_and_set_bit(IPS_DYING_BIT, &ct->status))
+   return false;
+
tstamp = nf_conn_tstamp_find(ct);
if (tstamp && tstamp->stop == 0)
tstamp->stop = ktime_get_real_ns();
 
-   if (nf_ct_is_dying(ct))
-   goto delete;
-
if (nf_conntrack_event_report(IPCT_DESTROY, ct,
portid, report) < 0) {
-   /* destroy event was not delivered */
+   /* destroy event was not delivered. nf_ct_put will
+* be done by event cache worker on redelivery.
+*/
nf_ct_delete_from_lists(ct);
nf_conntrack_ecache_delayed_work(nf_ct_net(ct));
return false;
}
 
nf_conntrack_ecache_work(nf_ct_net(ct));
-   set_bit(IPS_DYING_BIT, &ct->status);
- delete:
nf_ct_delete_from_lists(ct);
nf_ct_put(ct);
return true;
 }
 EXPORT_SYMBOL_GPL(nf_ct_delete);
 
-static void death_by_timeout(unsigned long ul_conntrack)
-{
-   nf_ct_delete((struct nf_conn *)ul