[PATCH nf-next 7/7] netfilter: restart search if moved to other chain

2016-08-24 Thread Florian Westphal
In case nf_conntrack_tuple_taken did not find a conflicting entry
check that all entries in this hash slot were tested and restart
in case an entry was moved to another chain.

Reported-by: Eric Dumazet 
Signed-off-by: Florian Westphal 
---
 net/netfilter/nf_conntrack_core.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index 2f7f759..4dda1ec 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -837,6 +837,7 @@ nf_conntrack_tuple_taken(const struct nf_conntrack_tuple 
*tuple,
zone = nf_ct_zone(ignored_conntrack);
 
rcu_read_lock();
+ begin:
nf_conntrack_get_ht(&ct_hash, &hsize);
hash = __hash_conntrack(net, tuple, hsize);
 
@@ -858,6 +859,12 @@ nf_conntrack_tuple_taken(const struct nf_conntrack_tuple 
*tuple,
}
NF_CT_STAT_INC_ATOMIC(net, searched);
}
+
+   if (get_nulls_value(n) != hash) {
+   NF_CT_STAT_INC_ATOMIC(net, search_restart);
+   goto begin;
+   }
+
rcu_read_unlock();
 
return 0;
-- 
2.7.3

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH nf-next v2 0/7] netfilter: get rid of per-object conntrack timers

2016-08-24 Thread Florian Westphal
During NFWS 2016 it was mentioned that per-conntrack timers have
two drawbacks:
 - the 5-day default established timeout is very large and brings
   extra constraints for the timer subsystem.
 - most distros enable timer stats so timer struct eats 80 bytes
   in each conntrack object.

This series replaces the per-object struct timer with a u32 jiffie
stamp and one global delayed work queue for conntrack eviction.
Size of nf_conn struct is reduced to 256 bytes on x86_64.

Eviction is performed from the packet path when doing table lookup, for
cases where we have idle periods the work queue is used.

No changes since v1, except addition of a comment in #3 and
a one-line change (replace cond_reschd w. cond_resched_rcu_qs).

Patch #7 was not part of v1.

Tested with following script, conntrackd running in 'relible
event mode' and httpterm running on other host:

---
random_resize() {
while true; do
RND=$RANDOM%256000
RND=$((RND+8192))
sysctl net.netfilter.nf_conntrack_buckets=$RND
sleep $((RANDOM % 120))
done
}

random_flush() {
while true; do
sleep $((RANDOM % 120))
conntrack -F
done
}

random_startstop() {
while true; do
sleep $((RANDOM % 120))
pkill -STOP conntrackd
sleep $((RANDOM % 10))
pkill -CONT conntrackd
done
}

http-client -u 1000 -t 3 -F 192.168.10.50 -G 192.168.10.17:8001 &
http-client -u 1000 -t 3 -F 192.168.10.51 -G 192.168.10.17:8001 &
http-client -u 1000 -t 3 -F 192.168.10.52 -G 192.168.10.17:8001 &

random_resize &
random_flush &
random_startstop &

wait
---

 include/net/netfilter/nf_conntrack.h|   36 +++--
 include/net/netfilter/nf_conntrack_ecache.h |   17 +-
 net/netfilter/nf_conntrack_core.c   |  194 +---
 net/netfilter/nf_conntrack_ecache.c |   22 ++-
 net/netfilter/nf_conntrack_netlink.c|   39 -
 net/netfilter/nf_conntrack_pptp.c   |3 
 net/netfilter/nf_nat_core.c |6 
 7 files changed, 223 insertions(+), 94 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 nf-next 5/7] netfilter: conntrack: resched gc again if eviction rate is high

2016-08-24 Thread Florian Westphal
If we evicted a large fraction of the scanned conntrack entries re-schedule
the next gc cycle for immediate execution.

This triggers during tests where load is high, then drops to zero and
many connections will be in TW/CLOSE state with < 30 second timeouts.

Without this change it will take several minutes until conntrack count
comes back to normal.

Signed-off-by: Florian Westphal 
---
 net/netfilter/nf_conntrack_core.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index e0a7889..f2133e8 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -938,6 +938,7 @@ static void gc_worker(struct work_struct *work)
 {
unsigned int i, goal, buckets = 0, expired_count = 0;
unsigned long next_run = GC_INTERVAL;
+   unsigned int ratio, scanned = 0;
struct conntrack_gc_work *gc_work;
 
gc_work = container_of(work, struct conntrack_gc_work, dwork.work);
@@ -962,6 +963,7 @@ static void gc_worker(struct work_struct *work)
hlist_nulls_for_each_entry_rcu(h, n, &ct_hash[i], hnnode) {
tmp = nf_ct_tuplehash_to_ctrack(h);
 
+   scanned++;
if (nf_ct_is_expired(tmp)) {
nf_ct_gc_expired(tmp);
expired_count++;
@@ -981,6 +983,10 @@ static void gc_worker(struct work_struct *work)
if (gc_work->exiting)
return;
 
+   ratio = scanned ? expired_count * 100 / scanned : 0;
+   if (ratio >= 90)
+   next_run = 0;
+
gc_work->last_bucket = i;
schedule_delayed_work(&gc_work->dwork, next_run);
 }
-- 
2.7.3

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 nf-next 2/7] netfilter: conntrack: get rid of conntrack timer

2016-08-24 Thread Florian Westphal
With stats enabled this eats 80 bytes on x86_64 per nf_conn entry, as
Eric Dumazet pointed out during netfilter workshop 2016.

Remove the timer and use a 32bit jiffies value containing timestamp until
entry is valid.

During conntrack lookup, even before doing tuple comparision, check
the timeout value and evict the entry in case it is too old.

The dying bit is used as a synchronization point to avoid races where
multiple cpus try to evict the same entry.

Because lookup is always lockless, we need to bump the refcnt once
when we evict, else we could try to evict already-dead entry that
is being recycled.

This is the standard/expected way when conntrack entries are destroyed.

Followup patches will introduce garbage collection via work queue
and further places where we can reap obsoleted entries (e.g. during
netlink dumps), this is needed to avoid expired conntracks from hanging
around for too long when lookup rate is low after a busy period.

Signed-off-by: Florian Westphal 
---
 include/net/netfilter/nf_conntrack.h | 23 +++--
 net/netfilter/nf_conntrack_core.c| 91 
 net/netfilter/nf_conntrack_netlink.c | 14 ++
 net/netfilter/nf_conntrack_pptp.c|  3 +-
 net/netfilter/nf_nat_core.c  |  6 ---
 5 files changed, 74 insertions(+), 63 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h 
b/include/net/netfilter/nf_conntrack.h
index 2a12748..6d8cf06 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -42,7 +42,6 @@ union nf_conntrack_expect_proto {
 
 #include 
 #include 
-#include 
 
 #ifdef CONFIG_NETFILTER_DEBUG
 #define NF_CT_ASSERT(x)WARN_ON(!(x))
@@ -73,7 +72,7 @@ struct nf_conn_help {
 #include 
 
 struct nf_conn {
-   /* Usage count in here is 1 for hash table/destruct timer, 1 per skb,
+   /* Usage count in here is 1 for hash table, 1 per skb,
 * plus 1 for any connection(s) we are `master' for
 *
 * Hint, SKB address this struct and refcnt via skb->nfct and
@@ -96,8 +95,8 @@ struct nf_conn {
/* Have we seen traffic both ways yet? (bitset) */
unsigned long status;
 
-   /* Timer function; drops refcnt when it goes off. */
-   struct timer_list timeout;
+   /* jiffies32 when this ct is considered dead */
+   u32 timeout;
 
possible_net_t ct_net;
 
@@ -291,14 +290,28 @@ static inline bool nf_is_loopback_packet(const struct 
sk_buff *skb)
return skb->dev && skb->skb_iif && skb->dev->flags & IFF_LOOPBACK;
 }
 
+#define nfct_time_stamp ((u32)(jiffies))
+
 /* jiffies until ct expires, 0 if already expired */
 static inline unsigned long nf_ct_expires(const struct nf_conn *ct)
 {
-   long timeout = (long)ct->timeout.expires - (long)jiffies;
+   s32 timeout = ct->timeout - nfct_time_stamp;
 
return timeout > 0 ? timeout : 0;
 }
 
+static inline bool nf_ct_is_expired(const struct nf_conn *ct)
+{
+   return (__s32)(ct->timeout - nfct_time_stamp) <= 0;
+}
+
+/* use after obtaining a reference count */
+static inline bool nf_ct_should_gc(const struct nf_conn *ct)
+{
+   return nf_ct_is_expired(ct) && nf_ct_is_confirmed(ct) &&
+  !nf_ct_is_dying(ct);
+}
+
 struct kernel_param;
 
 int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp);
diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index 7d90a5d..38c50d1 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -371,7 +371,6 @@ destroy_conntrack(struct nf_conntrack *nfct)
 
pr_debug("destroy_conntrack(%p)\n", ct);
NF_CT_ASSERT(atomic_read(&nfct->use) == 0);
-   NF_CT_ASSERT(!timer_pending(&ct->timeout));
 
if (unlikely(nf_ct_is_template(ct))) {
nf_ct_tmpl_free(ct);
@@ -434,35 +433,30 @@ bool nf_ct_delete(struct nf_conn *ct, u32 portid, int 
report)
 {
struct nf_conn_tstamp *tstamp;
 
+   if (test_and_set_bit(IPS_DYING_BIT, &ct->status))
+   return false;
+
tstamp = nf_conn_tstamp_find(ct);
if (tstamp && tstamp->stop == 0)
tstamp->stop = ktime_get_real_ns();
 
-   if (nf_ct_is_dying(ct))
-   goto delete;
-
if (nf_conntrack_event_report(IPCT_DESTROY, ct,
portid, report) < 0) {
-   /* destroy event was not delivered */
+   /* destroy event was not delivered. nf_ct_put will
+* be done by event cache worker on redelivery.
+*/
nf_ct_delete_from_lists(ct);
nf_conntrack_ecache_delayed_work(nf_ct_net(ct));
return false;
}
 
nf_conntrack_ecache_work(nf_ct_net(ct));
-   set_bit(IPS_DYING_BIT, &ct->status);
- delete:
nf_ct_delete_from_lists(ct);
nf_ct_put(ct);
return true;
 }
 EXPORT_SYMBOL_GPL(nf_ct_delete);
 
-static void death_by_timeout(unsign

[PATCH v2 nf-next 1/7] netfilter: don't rely on DYING bit to detect when destroy event was sent

2016-08-24 Thread Florian Westphal
The reliable event delivery mode currently (ab)uses the DYING bit to
detect which entries on the dying list have to be skipped when
re-delivering events from the eache worker in reliable event mode.

Currently when we delete the conntrack from main table we only set this
bit if we could also deliver the netlink destroy event to userspace.

If we fail we move it to the dying list, the ecache worker will
reattempt event delivery for all confirmed conntracks on the dying list
that do not have the DYING bit set.

Once timer is gone, we can no longer use if (del_timer()) to detect
when we 'stole' the reference count owned by the timer/hash entry, so
we need some other way to avoid racing with other cpu.

Pablo suggested to add a marker in the ecache extension that skips
entries that have been unhashed from main table but are still waiting
for the last reference count to be dropped (e.g. because one skb waiting
on nfqueue verdict still holds a reference).

We do this by adding a tristate.
If we fail to deliver the destroy event, make a note of this in the
eache extension.  The worker can then skip all entries that are in
a different state.  Either they never delivered a destroy event,
e.g. because the netlink backend was not loaded, or redelivery took
place already.

Once the conntrack timer is removed we will now be able to replace
del_timer() test with test_and_set_bit(DYING, &ct->status) to avoid
racing with other cpu that tries to evict the same conntrack.

Because DYING will then be set right before we report the destroy event
we can no longer skip event reporting when dying bit is set.

Suggested-by: Pablo Neira Ayuso 
Signed-off-by: Florian Westphal 
---
 include/net/netfilter/nf_conntrack_ecache.h | 17 -
 net/netfilter/nf_conntrack_ecache.c | 22 ++
 2 files changed, 26 insertions(+), 13 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack_ecache.h 
b/include/net/netfilter/nf_conntrack_ecache.h
index fa36447..12d967b 100644
--- a/include/net/netfilter/nf_conntrack_ecache.h
+++ b/include/net/netfilter/nf_conntrack_ecache.h
@@ -12,12 +12,19 @@
 #include 
 #include 
 
+enum nf_ct_ecache_state {
+   NFCT_ECACHE_UNKNOWN,/* destroy event not sent */
+   NFCT_ECACHE_DESTROY_FAIL,   /* tried but failed to send destroy 
event */
+   NFCT_ECACHE_DESTROY_SENT,   /* sent destroy event after failure */
+};
+
 struct nf_conntrack_ecache {
-   unsigned long cache;/* bitops want long */
-   unsigned long missed;   /* missed events */
-   u16 ctmask; /* bitmask of ct events to be delivered */
-   u16 expmask;/* bitmask of expect events to be delivered */
-   u32 portid; /* netlink portid of destroyer */
+   unsigned long cache;/* bitops want long */
+   unsigned long missed;   /* missed events */
+   u16 ctmask; /* bitmask of ct events to be delivered 
*/
+   u16 expmask;/* bitmask of expect events to be 
delivered */
+   u32 portid; /* netlink portid of destroyer */
+   enum nf_ct_ecache_state state;  /* ecache state */
 };
 
 static inline struct nf_conntrack_ecache *
diff --git a/net/netfilter/nf_conntrack_ecache.c 
b/net/netfilter/nf_conntrack_ecache.c
index d28011b..da9df2d 100644
--- a/net/netfilter/nf_conntrack_ecache.c
+++ b/net/netfilter/nf_conntrack_ecache.c
@@ -49,8 +49,13 @@ static enum retry_state ecache_work_evict_list(struct 
ct_pcpu *pcpu)
 
hlist_nulls_for_each_entry(h, n, &pcpu->dying, hnnode) {
struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h);
+   struct nf_conntrack_ecache *e;
 
-   if (nf_ct_is_dying(ct))
+   if (!nf_ct_is_confirmed(ct))
+   continue;
+
+   e = nf_ct_ecache_find(ct);
+   if (!e || e->state != NFCT_ECACHE_DESTROY_FAIL)
continue;
 
if (nf_conntrack_event(IPCT_DESTROY, ct)) {
@@ -58,8 +63,7 @@ static enum retry_state ecache_work_evict_list(struct ct_pcpu 
*pcpu)
break;
}
 
-   /* we've got the event delivered, now it's dying */
-   set_bit(IPS_DYING_BIT, &ct->status);
+   e->state = NFCT_ECACHE_DESTROY_SENT;
refs[evicted] = ct;
 
if (++evicted >= ARRAY_SIZE(refs)) {
@@ -130,7 +134,7 @@ int nf_conntrack_eventmask_report(unsigned int eventmask, 
struct nf_conn *ct,
if (!e)
goto out_unlock;
 
-   if (nf_ct_is_confirmed(ct) && !nf_ct_is_dying(ct)) {
+   if (nf_ct_is_confirmed(ct)) {
struct nf_ct_event item = {
.ct = ct,
.portid = e->portid ? e->portid : portid,
@@ -150,11 +154,13 @@ int nf_conntrack_eventmask_report(unsigned int eventmask, 
struct nf_conn *ct,
 * trigge

[PATCH v2 nf-next 3/7] netfilter: evict stale entries on netlink dumps

2016-08-24 Thread Florian Westphal
When dumping we already have to look at the entire table, so we might
as well toss those entries whose timeout value is in the past.

We also look at every entry during resize operations.
However, eviction there is not as simple because we hold the
global resize lock so we can't evict without adding a 'expired' list
to drop from later.  Considering that resizes are very rare it doesn't
seem worth doing it.

Signed-off-by: Florian Westphal 
---
 net/netfilter/nf_conntrack_netlink.c | 25 -
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/net/netfilter/nf_conntrack_netlink.c 
b/net/netfilter/nf_conntrack_netlink.c
index 81fd34c..dedbe4b 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -815,14 +815,23 @@ ctnetlink_dump_table(struct sk_buff *skb, struct 
netlink_callback *cb)
struct hlist_nulls_node *n;
struct nfgenmsg *nfmsg = nlmsg_data(cb->nlh);
u_int8_t l3proto = nfmsg->nfgen_family;
-   int res;
+   struct nf_conn *nf_ct_evict[8];
+   int res, i;
spinlock_t *lockp;
 
last = (struct nf_conn *)cb->args[1];
+   i = 0;
 
local_bh_disable();
for (; cb->args[0] < nf_conntrack_htable_size; cb->args[0]++) {
 restart:
+   while (i) {
+   i--;
+   if (nf_ct_should_gc(nf_ct_evict[i]))
+   nf_ct_kill(nf_ct_evict[i]);
+   nf_ct_put(nf_ct_evict[i]);
+   }
+
lockp = &nf_conntrack_locks[cb->args[0] % CONNTRACK_LOCKS];
nf_conntrack_lock(lockp);
if (cb->args[0] >= nf_conntrack_htable_size) {
@@ -834,6 +843,13 @@ restart:
if (NF_CT_DIRECTION(h) != IP_CT_DIR_ORIGINAL)
continue;
ct = nf_ct_tuplehash_to_ctrack(h);
+   if (nf_ct_is_expired(ct)) {
+   if (i < ARRAY_SIZE(nf_ct_evict) &&
+   atomic_inc_not_zero(&ct->ct_general.use))
+   nf_ct_evict[i++] = ct;
+   continue;
+   }
+
if (!net_eq(net, nf_ct_net(ct)))
continue;
 
@@ -875,6 +891,13 @@ out:
if (last)
nf_ct_put(last);
 
+   while (i) {
+   i--;
+   if (nf_ct_should_gc(nf_ct_evict[i]))
+   nf_ct_kill(nf_ct_evict[i]);
+   nf_ct_put(nf_ct_evict[i]);
+   }
+
return skb->len;
 }
 
-- 
2.7.3

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 nf-next 4/7] netfilter: conntrack: add gc worker to remove timed-out entries

2016-08-24 Thread Florian Westphal
Conntrack gc worker to evict stale entries.

GC happens once every 5 seconds, but we only scan at most 1/64th of the
table (and not more than 8k) buckets to avoid hogging cpu.

This means that a complete scan of the table will take several minutes
of wall-clock time.

Considering that the gc run will never have to evict any entries
during normal operation because those will happen from packet path
this should be fine.

We only need gc to make sure userspace (conntrack event listeners)
eventually learn of the timeout, and for resource reclaim in case the
system becomes idle.

We do not disable BH and cond_resched for every bucket so this should
not introduce noticeable latencies either.

A followup patch will add a small change to speed up GC for the extreme
case where most entries are timed out on an otherwise idle system.

v2: Use cond_resched_rcu_qs & add comment wrt. missing restart on
nulls value change in gc worker, suggested by Eric Dumazet.

Signed-off-by: Florian Westphal 
---
 net/netfilter/nf_conntrack_core.c | 78 +++
 1 file changed, 78 insertions(+)

diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index 38c50d1..e0a7889 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -72,11 +72,24 @@ EXPORT_SYMBOL_GPL(nf_conntrack_expect_lock);
 struct hlist_nulls_head *nf_conntrack_hash __read_mostly;
 EXPORT_SYMBOL_GPL(nf_conntrack_hash);
 
+struct conntrack_gc_work {
+   struct delayed_work dwork;
+   u32 last_bucket;
+   boolexiting;
+};
+
 static __read_mostly struct kmem_cache *nf_conntrack_cachep;
 static __read_mostly spinlock_t nf_conntrack_locks_all_lock;
 static __read_mostly DEFINE_SPINLOCK(nf_conntrack_locks_all_lock);
 static __read_mostly bool nf_conntrack_locks_all;
 
+#define GC_MAX_BUCKETS_DIV 64u
+#define GC_MAX_BUCKETS 8192u
+#define GC_INTERVAL(5 * HZ)
+#define GC_MAX_EVICTS  256u
+
+static struct conntrack_gc_work conntrack_gc_work;
+
 void nf_conntrack_lock(spinlock_t *lock) __acquires(lock)
 {
spin_lock(lock);
@@ -921,6 +934,63 @@ static noinline int early_drop(struct net *net, unsigned 
int _hash)
return false;
 }
 
+static void gc_worker(struct work_struct *work)
+{
+   unsigned int i, goal, buckets = 0, expired_count = 0;
+   unsigned long next_run = GC_INTERVAL;
+   struct conntrack_gc_work *gc_work;
+
+   gc_work = container_of(work, struct conntrack_gc_work, dwork.work);
+
+   goal = min(nf_conntrack_htable_size / GC_MAX_BUCKETS_DIV, 
GC_MAX_BUCKETS);
+   i = gc_work->last_bucket;
+
+   do {
+   struct nf_conntrack_tuple_hash *h;
+   struct hlist_nulls_head *ct_hash;
+   struct hlist_nulls_node *n;
+   unsigned int hashsz;
+   struct nf_conn *tmp;
+
+   i++;
+   rcu_read_lock();
+
+   nf_conntrack_get_ht(&ct_hash, &hashsz);
+   if (i >= hashsz)
+   i = 0;
+
+   hlist_nulls_for_each_entry_rcu(h, n, &ct_hash[i], hnnode) {
+   tmp = nf_ct_tuplehash_to_ctrack(h);
+
+   if (nf_ct_is_expired(tmp)) {
+   nf_ct_gc_expired(tmp);
+   expired_count++;
+   continue;
+   }
+   }
+
+   /* could check get_nulls_value() here and restart if ct
+* was moved to another chain.  But given gc is best-effort
+* we will just continue with next hash slot.
+*/
+   rcu_read_unlock();
+   cond_resched_rcu_qs();
+   } while (++buckets < goal &&
+expired_count < GC_MAX_EVICTS);
+
+   if (gc_work->exiting)
+   return;
+
+   gc_work->last_bucket = i;
+   schedule_delayed_work(&gc_work->dwork, next_run);
+}
+
+static void conntrack_gc_work_init(struct conntrack_gc_work *gc_work)
+{
+   INIT_DELAYED_WORK(&gc_work->dwork, gc_worker);
+   gc_work->exiting = false;
+}
+
 static struct nf_conn *
 __nf_conntrack_alloc(struct net *net,
 const struct nf_conntrack_zone *zone,
@@ -1527,6 +1597,7 @@ static int untrack_refs(void)
 
 void nf_conntrack_cleanup_start(void)
 {
+   conntrack_gc_work.exiting = true;
RCU_INIT_POINTER(ip_ct_attach, NULL);
 }
 
@@ -1536,6 +1607,9 @@ void nf_conntrack_cleanup_end(void)
while (untrack_refs() > 0)
schedule();
 
+   cancel_delayed_work_sync(&conntrack_gc_work.dwork);
+   /* can be re-scheduled once */
+   cancel_delayed_work_sync(&conntrack_gc_work.dwork);
nf_ct_free_hashtable(nf_conntrack_hash, nf_conntrack_htable_size);
 
nf_conntrack_proto_fini();
@@ -1810,6 +1884,10 @@ int nf_conntrack_init_start(void)
}
/*  - and look it

[PATCH v2 nf-next 6/7] netfilter: remove __nf_ct_kill_acct helper

2016-08-24 Thread Florian Westphal
After timer removal this just calls nf_ct_delete so remove the __ prefix
version and make nf_ct_kill a shorthand for nf_ct_delete.

Signed-off-by: Florian Westphal 
---
 include/net/netfilter/nf_conntrack.h | 13 +++--
 net/netfilter/nf_conntrack_core.c| 12 +---
 2 files changed, 8 insertions(+), 17 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h 
b/include/net/netfilter/nf_conntrack.h
index 6d8cf06..a3595c2 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -219,21 +219,14 @@ static inline void nf_ct_refresh(struct nf_conn *ct,
__nf_ct_refresh_acct(ct, 0, skb, extra_jiffies, 0);
 }
 
-bool __nf_ct_kill_acct(struct nf_conn *ct, enum ip_conntrack_info ctinfo,
-  const struct sk_buff *skb, int do_acct);
-
 /* kill conntrack and do accounting */
-static inline bool nf_ct_kill_acct(struct nf_conn *ct,
-  enum ip_conntrack_info ctinfo,
-  const struct sk_buff *skb)
-{
-   return __nf_ct_kill_acct(ct, ctinfo, skb, 1);
-}
+bool nf_ct_kill_acct(struct nf_conn *ct, enum ip_conntrack_info ctinfo,
+const struct sk_buff *skb);
 
 /* kill conntrack without accounting */
 static inline bool nf_ct_kill(struct nf_conn *ct)
 {
-   return __nf_ct_kill_acct(ct, 0, NULL, 0);
+   return nf_ct_delete(ct, 0, 0);
 }
 
 /* These are for NAT.  Icky. */
diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index f2133e8..2f7f759 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1423,17 +1423,15 @@ acct:
 }
 EXPORT_SYMBOL_GPL(__nf_ct_refresh_acct);
 
-bool __nf_ct_kill_acct(struct nf_conn *ct,
-  enum ip_conntrack_info ctinfo,
-  const struct sk_buff *skb,
-  int do_acct)
+bool nf_ct_kill_acct(struct nf_conn *ct,
+enum ip_conntrack_info ctinfo,
+const struct sk_buff *skb)
 {
-   if (do_acct)
-   nf_ct_acct_update(ct, ctinfo, skb->len);
+   nf_ct_acct_update(ct, ctinfo, skb->len);
 
return nf_ct_delete(ct, 0, 0);
 }
-EXPORT_SYMBOL_GPL(__nf_ct_kill_acct);
+EXPORT_SYMBOL_GPL(nf_ct_kill_acct);
 
 #if IS_ENABLED(CONFIG_NF_CT_NETLINK)
 
-- 
2.7.3

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH libnftnl] expr: data_reg: Fix DATA_CHAIN comparison

2016-08-24 Thread Carlos Falgueras García
Splits DATA_VERDICT and DATA_CHAIN comparison. A verdict can have a NULL
chain (accept, drop, etc) so segfault will happen when strcmp is call.

Signed-off-by: Carlos Falgueras García 
---
 src/expr/data_reg.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/expr/data_reg.c b/src/expr/data_reg.c
index a954e95..395e9f6 100644
--- a/src/expr/data_reg.c
+++ b/src/expr/data_reg.c
@@ -387,6 +387,7 @@ bool nftnl_data_reg_cmp(const union nftnl_data_reg *r1,
return  r1->len == r2->len &&
!memcmp(r1->val, r2->val, r1->len);
case DATA_VERDICT:
+   return  r1->verdict == r2->verdict;
case DATA_CHAIN:
return  r1->verdict == r2->verdict &&
!strcmp(r1->chain, r2->chain);
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH nf-next 7/7] netfilter: restart search if moved to other chain

2016-08-24 Thread Eric Dumazet
On Wed, 2016-08-24 at 13:55 +0200, Florian Westphal wrote:
> In case nf_conntrack_tuple_taken did not find a conflicting entry
> check that all entries in this hash slot were tested and restart
> in case an entry was moved to another chain.
> 
> Reported-by: Eric Dumazet 
> Signed-off-by: Florian Westphal 
> ---

Fixes: ea781f197d6a ("netfilter: nf_conntrack: use SLAB_DESTROY_BY_RCU and get 
rid of call_rcu()")
Acked-by: Eric Dumazet 

Thanks !


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH nf-next 1/2] rhashtable: add rhashtable_lookup_get_insert_key()

2016-08-24 Thread Pablo Neira Ayuso
This patch modifies __rhashtable_insert_fast() so it returns the
existing element that clashes with the one that you want to insert.
This adds a new rhashtable_lookup_get_insert_key() interface to fetch
this existing element.

nf_tables needs this change to improve handling of EEXIST cases via
honoring the NLM_F_EXCL flag and by checking if the data part of the
mapping matches what we have.

Cc: Herbert Xu 
Cc: Thomas Graf 
Signed-off-by: Pablo Neira Ayuso 
---

 include/linux/rhashtable.h | 32 +++-
 1 file changed, 27 insertions(+), 5 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 3eef080..a632cf0 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -566,7 +566,7 @@ restart:
 /* Internal function, please use rhashtable_insert_fast() instead */
 static inline int __rhashtable_insert_fast(
struct rhashtable *ht, const void *key, struct rhash_head *obj,
-   const struct rhashtable_params params)
+   const struct rhashtable_params params, void **data)
 {
struct rhashtable_compare_arg arg = {
.ht = ht,
@@ -630,8 +630,11 @@ slow_path:
if (key &&
unlikely(!(params.obj_cmpfn ?
   params.obj_cmpfn(&arg, rht_obj(ht, head)) :
-  rhashtable_compare(&arg, rht_obj(ht, head)
+  rhashtable_compare(&arg, rht_obj(ht, head) {
+   if (data)
+   *data = rht_obj(ht, head);
goto out;
+   }
if (!--elasticity)
goto slow_path;
}
@@ -675,7 +678,7 @@ static inline int rhashtable_insert_fast(
struct rhashtable *ht, struct rhash_head *obj,
const struct rhashtable_params params)
 {
-   return __rhashtable_insert_fast(ht, NULL, obj, params);
+   return __rhashtable_insert_fast(ht, NULL, obj, params, NULL);
 }
 
 /**
@@ -708,7 +711,7 @@ static inline int rhashtable_lookup_insert_fast(
BUG_ON(ht->p.obj_hashfn);
 
return __rhashtable_insert_fast(ht, key + ht->p.key_offset, obj,
-   params);
+   params, NULL);
 }
 
 /**
@@ -739,7 +742,26 @@ static inline int rhashtable_lookup_insert_key(
 {
BUG_ON(!ht->p.obj_hashfn || !key);
 
-   return __rhashtable_insert_fast(ht, key, obj, params);
+   return __rhashtable_insert_fast(ht, key, obj, params, NULL);
+}
+
+/**
+ * rhashtable_lookup_get_insert_key - lookup and insert object into hash table
+ * @ht:hash table
+ * @obj:   pointer to hash head inside object
+ * @params:hash table parameters
+ * @data:  pointer to element data already in hashes
+ *
+ * Just like rhashtable_lookup_insert_key(), but it returns the matching
+ * element in case that it already exists in the hashes.
+ */
+static inline int rhashtable_lookup_get_insert_key(
+   struct rhashtable *ht, const void *key, struct rhash_head *obj,
+   const struct rhashtable_params params, void **data)
+{
+   BUG_ON(!ht->p.obj_hashfn || !key);
+
+   return __rhashtable_insert_fast(ht, key, obj, params, data);
 }
 
 /* Internal function, please use rhashtable_remove_fast() instead */
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH nf-next 2/2] netfilter: nf_tables: honor NLM_F_EXCL flag in set element insertion

2016-08-24 Thread Pablo Neira Ayuso
If the NLM_F_EXCL flag is set, then new elements that clash with an
existing one return EEXIST. In case you try to add an element whose
data area differs from what we have, then this returns EBUSY. If no
flag is specified at all, then this returns success to userspace.

This patch also update the set insert operation so we can fetch the
existing element that clashes with the one you want to add, we need
this to make sure the element data doesn't differ.

Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/nf_tables.h |  3 ++-
 net/netfilter/nf_tables_api.c | 20 +++-
 net/netfilter/nft_set_hash.c  | 11 ---
 net/netfilter/nft_set_rbtree.c| 12 
 4 files changed, 33 insertions(+), 13 deletions(-)

diff --git a/include/net/netfilter/nf_tables.h 
b/include/net/netfilter/nf_tables.h
index f2f1339..8972468 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -251,7 +251,8 @@ struct nft_set_ops {
 
int (*insert)(const struct net *net,
  const struct nft_set *set,
- const struct nft_set_elem 
*elem);
+ const struct nft_set_elem 
*elem,
+ struct nft_set_ext **ext);
void(*activate)(const struct net *net,
const struct nft_set *set,
const struct nft_set_elem 
*elem);
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 221d27f..bd9715e 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -3483,12 +3483,12 @@ static int nft_setelem_parse_flags(const struct nft_set 
*set,
 }
 
 static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set,
-   const struct nlattr *attr)
+   const struct nlattr *attr, u32 nlmsg_flags)
 {
struct nlattr *nla[NFTA_SET_ELEM_MAX + 1];
struct nft_data_desc d1, d2;
struct nft_set_ext_tmpl tmpl;
-   struct nft_set_ext *ext;
+   struct nft_set_ext *ext, *ext2;
struct nft_set_elem elem;
struct nft_set_binding *binding;
struct nft_userdata *udata;
@@ -3615,9 +3615,19 @@ static int nft_add_set_elem(struct nft_ctx *ctx, struct 
nft_set *set,
goto err4;
 
ext->genmask = nft_genmask_cur(ctx->net) | NFT_SET_ELEM_BUSY_MASK;
-   err = set->ops->insert(ctx->net, set, &elem);
-   if (err < 0)
+   err = set->ops->insert(ctx->net, set, &elem, &ext2);
+   if (err) {
+   if (err == -EEXIST) {
+   if (nft_set_ext_exists(ext, NFT_SET_EXT_DATA) &&
+   nft_set_ext_exists(ext2, NFT_SET_EXT_DATA) &&
+   memcmp(nft_set_ext_data(ext),
+  nft_set_ext_data(ext2), set->dlen) != 0)
+   err = -EBUSY;
+   else if (!(nlmsg_flags & NLM_F_EXCL))
+   err = 0;
+   }
goto err5;
+   }
 
nft_trans_elem(trans) = elem;
list_add_tail(&trans->list, &ctx->net->nft.commit_list);
@@ -3673,7 +3683,7 @@ static int nf_tables_newsetelem(struct net *net, struct 
sock *nlsk,
!atomic_add_unless(&set->nelems, 1, set->size + 
set->ndeact))
return -ENFILE;
 
-   err = nft_add_set_elem(&ctx, set, attr);
+   err = nft_add_set_elem(&ctx, set, attr, nlh->nlmsg_flags);
if (err < 0) {
atomic_dec(&set->nelems);
break;
diff --git a/net/netfilter/nft_set_hash.c b/net/netfilter/nft_set_hash.c
index 564fa79..95b3e74 100644
--- a/net/netfilter/nft_set_hash.c
+++ b/net/netfilter/nft_set_hash.c
@@ -126,7 +126,8 @@ err1:
 }
 
 static int nft_hash_insert(const struct net *net, const struct nft_set *set,
-  const struct nft_set_elem *elem)
+  const struct nft_set_elem *elem,
+  struct nft_set_ext **ext)
 {
struct nft_hash *priv = nft_set_priv(set);
struct nft_hash_elem *he = elem->priv;
@@ -135,9 +136,13 @@ static int nft_hash_insert(const struct net *net, const 
struct nft_set *set,
.set = set,
.key = elem->key.val.data,
};
+   struct nft_hash_elem *prev;
+   int ret;
 
-   return rhashtable_lookup_insert_key(&priv->ht, &arg, &he->node,
-   nft_hash_params);
+   ret = rhashtable_lookup_get_insert_key(&priv->ht, &arg, &he->node,
+  nft_hash_params, (void **)&prev);
+   *ext = &prev->ext;
+   return ret;
 }
 
 s

[PATCH nft 4/4] tests: shell: cover add and create set command

2016-08-24 Thread Pablo Neira Ayuso
This patch validates that creation of an already existing element
bails out with EEXIST.

Signed-off-by: Pablo Neira Ayuso 
---
 tests/shell/testcases/sets/0007create_element_0 | 15 +++
 1 file changed, 15 insertions(+)
 create mode 100755 tests/shell/testcases/sets/0007create_element_0

diff --git a/tests/shell/testcases/sets/0007create_element_0 
b/tests/shell/testcases/sets/0007create_element_0
new file mode 100755
index 000..47b3559
--- /dev/null
+++ b/tests/shell/testcases/sets/0007create_element_0
@@ -0,0 +1,15 @@
+#!/bin/bash
+
+# This testcase checks for add and create element commands.
+
+set -e
+$NFT add table t
+$NFT add set t s { type ipv4_addr \; }
+$NFT add element t s { 1.1.1.1 }
+if $NFT create element t s { 1.1.1.1 } 2>/dev/null ; then
+   echo "E: accepted element creation that already exists" >&2
+   exit 1
+fi
+$NFT add element t s { 1.1.1.1 }
+
+exit 0
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH nft 3/4] src: create element command

2016-08-24 Thread Pablo Neira Ayuso
This patch adds the create command, that send the NLM_F_EXCL flag so
nf_tables bails out if the element already exists, eg.

 # nft add element x y { 1.1.1.1 }
 # nft create element x y { 1.1.1.1 }
 :1:1-31: Error: Could not process rule: File exists
 create element x y { 1.1.1.1 }
 ^^^

This update requires nf_tables kernel patches to honor the NLM_F_EXCL.

Signed-off-by: Pablo Neira Ayuso 
---
 include/netlink.h  |  2 +-
 src/netlink.c  | 15 ---
 src/parser_bison.y |  4 
 src/rule.c | 13 +++--
 4 files changed, 20 insertions(+), 14 deletions(-)

diff --git a/include/netlink.h b/include/netlink.h
index 5f48707..28c11f6 100644
--- a/include/netlink.h
+++ b/include/netlink.h
@@ -160,7 +160,7 @@ extern struct stmt *netlink_parse_set_expr(const struct set 
*set,
   const struct nftnl_expr *nle);
 
 extern int netlink_add_setelems(struct netlink_ctx *ctx, const struct handle 
*h,
-   const struct expr *expr);
+   const struct expr *expr, bool excl);
 extern int netlink_delete_setelems(struct netlink_ctx *ctx, const struct 
handle *h,
   const struct expr *expr);
 extern int netlink_get_setelems(struct netlink_ctx *ctx, const struct handle 
*h,
diff --git a/src/netlink.c b/src/netlink.c
index f897b0e..f8da2a6 100644
--- a/src/netlink.c
+++ b/src/netlink.c
@@ -1318,7 +1318,7 @@ static void alloc_setelem_cache(const struct expr *set, 
struct nftnl_set *nls)
 
 static int netlink_add_setelems_batch(struct netlink_ctx *ctx,
  const struct handle *h,
- const struct expr *expr)
+ const struct expr *expr, bool excl)
 {
struct nftnl_set *nls;
int err;
@@ -1327,7 +1327,8 @@ static int netlink_add_setelems_batch(struct netlink_ctx 
*ctx,
alloc_setelem_cache(expr, nls);
netlink_dump_set(nls);
 
-   err = mnl_nft_setelem_batch_add(nls, 0, ctx->seqnum);
+   err = mnl_nft_setelem_batch_add(nls, excl ? NLM_F_EXCL : 0,
+   ctx->seqnum);
nftnl_set_free(nls);
if (err < 0)
netlink_io_error(ctx, &expr->location,
@@ -1338,7 +1339,7 @@ static int netlink_add_setelems_batch(struct netlink_ctx 
*ctx,
 
 static int netlink_add_setelems_compat(struct netlink_ctx *ctx,
   const struct handle *h,
-  const struct expr *expr)
+  const struct expr *expr, bool excl)
 {
struct nftnl_set *nls;
int err;
@@ -1347,7 +1348,7 @@ static int netlink_add_setelems_compat(struct netlink_ctx 
*ctx,
alloc_setelem_cache(expr, nls);
netlink_dump_set(nls);
 
-   err = mnl_nft_setelem_add(nf_sock, nls, 0);
+   err = mnl_nft_setelem_add(nf_sock, nls, excl ? NLM_F_EXCL : 0);
nftnl_set_free(nls);
if (err < 0)
netlink_io_error(ctx, &expr->location,
@@ -1357,12 +1358,12 @@ static int netlink_add_setelems_compat(struct 
netlink_ctx *ctx,
 }
 
 int netlink_add_setelems(struct netlink_ctx *ctx, const struct handle *h,
-const struct expr *expr)
+const struct expr *expr, bool excl)
 {
if (ctx->batch_supported)
-   return netlink_add_setelems_batch(ctx, h, expr);
+   return netlink_add_setelems_batch(ctx, h, expr, excl);
else
-   return netlink_add_setelems_compat(ctx, h, expr);
+   return netlink_add_setelems_compat(ctx, h, expr, excl);
 }
 
 static int netlink_del_setelems_batch(struct netlink_ctx *ctx,
diff --git a/src/parser_bison.y b/src/parser_bison.y
index 5d5ce8c..8c0f625 100644
--- a/src/parser_bison.y
+++ b/src/parser_bison.y
@@ -788,6 +788,10 @@ create_cmd :   TABLE   table_spec
handle_merge(&$3->handle, &$2);
$$ = cmd_alloc(CMD_CREATE, CMD_OBJ_SET, &$2, 
&@$, $5);
}
+   |   ELEMENT set_specset_expr
+   {
+   $$ = cmd_alloc(CMD_CREATE, CMD_OBJ_SETELEM, 
&$2, &@$, $3);
+   }
;
 
 insert_cmd :   RULErule_position   rule
diff --git a/src/rule.c b/src/rule.c
index 54edd8c..8c58bfa 100644
--- a/src/rule.c
+++ b/src/rule.c
@@ -881,20 +881,20 @@ static int do_add_chain(struct netlink_ctx *ctx, const 
struct handle *h,
 }
 
 static int __do_add_setelems(struct netlink_ctx *ctx, const struct handle *h,
-struct set *set, struct expr *expr)
+struct set *set, struct expr *expr, bool excl)
 {
if (set->flags & SET_F_INTERVAL &&
set_to_intervals(ctx->

[PATCH nft 2/4] tests: shell: cover add and create set command

2016-08-24 Thread Pablo Neira Ayuso
This patch validates that creation of an already existing set bails out
with EEXIST.

Signed-off-by: Pablo Neira Ayuso 
---
 tests/shell/testcases/sets/0006create_set_0 | 14 ++
 1 file changed, 14 insertions(+)
 create mode 100755 tests/shell/testcases/sets/0006create_set_0

diff --git a/tests/shell/testcases/sets/0006create_set_0 
b/tests/shell/testcases/sets/0006create_set_0
new file mode 100755
index 000..ca36cf7
--- /dev/null
+++ b/tests/shell/testcases/sets/0006create_set_0
@@ -0,0 +1,14 @@
+#!/bin/bash
+
+# This testscase checks for add and create set commands.
+
+set -e
+$NFT add table t
+$NFT add set t s { type ipv4_addr \; }
+if $NFT create set t s { type ipv4_addr \; } 2>/dev/null ; then
+   echo "E: accepted set creation that already exists" >&2
+   exit 1
+fi
+$NFT add set t s { type ipv4_addr \; }
+
+exit 0
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH nft 1/4] src: add create set command

2016-08-24 Thread Pablo Neira Ayuso
Add support for the 'create' command, we already support this in other
existing objects, so support this for sets too, eg.

 # nft add set x y { type ipv4_addr\; }
 # nft create set x y { type ipv4_addr\; }
 :1:1-35: Error: Could not process rule: File exists
 create set x y { type ipv4_addr; }
 ^^^
 # nft add set x y { type ipv4_addr\; }
 #

This command sets the NLM_F_EXCL netlink flag, so if the object already
exists, nf_tables returns -EEXIST.

This is changing the existing behaviour of 'nft add set' which was
setting this flag, this is inconsistent with regards to the way other
objects behave.

Signed-off-by: Pablo Neira Ayuso 
---
 include/netlink.h  |  2 +-
 src/netlink.c  | 17 ++---
 src/parser_bison.y | 14 ++
 src/rule.c |  8 
 4 files changed, 29 insertions(+), 12 deletions(-)

diff --git a/include/netlink.h b/include/netlink.h
index 76a9da4..5f48707 100644
--- a/include/netlink.h
+++ b/include/netlink.h
@@ -148,7 +148,7 @@ extern int netlink_flush_table(struct netlink_ctx *ctx, 
const struct handle *h,
   const struct location *loc);
 
 extern int netlink_add_set(struct netlink_ctx *ctx, const struct handle *h,
-  struct set *set);
+  struct set *set, bool excl);
 extern int netlink_delete_set(struct netlink_ctx *ctx, const struct handle *h,
  const struct location *loc);
 extern int netlink_list_sets(struct netlink_ctx *ctx, const struct handle *h,
diff --git a/src/netlink.c b/src/netlink.c
index cf24c8a..f897b0e 100644
--- a/src/netlink.c
+++ b/src/netlink.c
@@ -1117,8 +1117,10 @@ static struct set *netlink_delinearize_set(struct 
netlink_ctx *ctx,
 }
 
 static int netlink_add_set_compat(struct netlink_ctx *ctx,
- const struct handle *h, struct set *set)
+ const struct handle *h, struct set *set,
+ bool excl)
 {
+   unsigned int flags = excl ? NLM_F_EXCL : 0;
struct nftnl_set *nls;
int err;
 
@@ -1136,7 +1138,7 @@ static int netlink_add_set_compat(struct netlink_ctx *ctx,
}
netlink_dump_set(nls);
 
-   err = mnl_nft_set_add(nf_sock, nls, NLM_F_EXCL | NLM_F_ECHO);
+   err = mnl_nft_set_add(nf_sock, nls, NLM_F_ECHO | flags);
if (err < 0)
netlink_io_error(ctx, &set->location, "Could not add set: %s",
 strerror(errno));
@@ -1148,7 +1150,8 @@ static int netlink_add_set_compat(struct netlink_ctx *ctx,
 }
 
 static int netlink_add_set_batch(struct netlink_ctx *ctx,
-const struct handle *h, struct set *set)
+const struct handle *h, struct set *set,
+bool excl)
 {
struct nftnl_set *nls;
int err;
@@ -1183,7 +1186,7 @@ static int netlink_add_set_batch(struct netlink_ctx *ctx,
 
netlink_dump_set(nls);
 
-   err = mnl_nft_set_batch_add(nls, NLM_F_EXCL, ctx->seqnum);
+   err = mnl_nft_set_batch_add(nls, excl ? NLM_F_EXCL : 0, ctx->seqnum);
if (err < 0)
netlink_io_error(ctx, &set->location, "Could not add set: %s",
 strerror(errno));
@@ -1193,12 +1196,12 @@ static int netlink_add_set_batch(struct netlink_ctx 
*ctx,
 }
 
 int netlink_add_set(struct netlink_ctx *ctx, const struct handle *h,
-   struct set *set)
+   struct set *set, bool excl)
 {
if (ctx->batch_supported)
-   return netlink_add_set_batch(ctx, h, set);
+   return netlink_add_set_batch(ctx, h, set, excl);
else
-   return netlink_add_set_compat(ctx, h, set);
+   return netlink_add_set_compat(ctx, h, set, excl);
 }
 
 static int netlink_del_set_compat(struct netlink_ctx *ctx,
diff --git a/src/parser_bison.y b/src/parser_bison.y
index a3d93bf..5d5ce8c 100644
--- a/src/parser_bison.y
+++ b/src/parser_bison.y
@@ -774,6 +774,20 @@ create_cmd :   TABLE   table_spec
close_scope(state);
$$ = cmd_alloc(CMD_CREATE, CMD_OBJ_CHAIN, &$2, 
&@$, $5);
}
+   |   SET set_specset_block_alloc
+   '{' set_block   '}'
+   {
+   $5->location = @5;
+   handle_merge(&$3->handle, &$2);
+   $$ = cmd_alloc(CMD_CREATE, CMD_OBJ_SET, &$2, 
&@$, $5);
+   }
+   |   MAP set_specmap_block_alloc
+   '{' map_block   '}'
+   {
+   $5->location = @5;
+

Re: [PATCH libnftnl] expr: data_reg: Fix DATA_CHAIN comparison

2016-08-24 Thread Pablo Neira Ayuso
On Wed, Aug 24, 2016 at 02:22:16PM +0200, Carlos Falgueras García wrote:
> Splits DATA_VERDICT and DATA_CHAIN comparison. A verdict can have a NULL
> chain (accept, drop, etc) so segfault will happen when strcmp is call.

Applied, thanks Carlos.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ANNOUNCE] Netdev 1.2 weekly updates (24th August, 2016)

2016-08-24 Thread Hajime Tazaki

Hello folks,

Here is an weekly update of Netdev 1.2 Tokyo.

The early bird registration is still available.  Please
don't miss the discount ticket - and your early registration
will be definitely helpful to prepare the conference.

http://netdevconf.org/1.2/registration.html

The session proposal submission is now closed. Thanks to
everyone who submitted a proposal. We have a lot of very
exciting talks, workshops and tutorials that the tech
committee is finalizing.  We will be announcing them as they
get accepted. Please watch the website and twitter feed. We
are also going to be notifying you via email but less
frequently. We look forward to see you all in Tokyo.


== New accepted sessions ==

Full list of accepted sessions is available here.

http://netdevconf.org/1.2/accepted-sessions.html

* Workshop


- IPsec workshop Workshop
  by Steffen Klassert

- Network Performance Workshop
  by Jesper Dangaard Brouer


* Talk
- Scaling with multiple network namespaces in a single
  application
  by PJ Waskiewicz

- What is an L3 Master Device?
  by David Ahern

- TLS Offload to Network Devices
  by Boris Pismenny

- The adventures of a Suricate in eBPF land
  by Eric Leblond

- Accelerating Linux IP Virtual Server with OpenNPU
  by Gilad Ben-Yossef

* Tutorial
- Microservice Networking Leveraging VRF on the Host
  by David Ahern


Our sponsors:
- Platinum
Verizon, Facebook, Cumulus Networks
- Gold
Mojatatu Networks, VMWare, Google, NTT
- Silver
NetApp, IIJ, Netronome, SolarFlare, Mellanox
- Bronze
Zen Load Balancer (new)

Twitter: https://twitter.com/netdev01
Web: http://netdevconf.org/1.2/


-- Hajime

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 nf-next 1/7] netfilter: don't rely on DYING bit to detect when destroy event was sent

2016-08-24 Thread Eric Dumazet
On Wed, 2016-08-24 at 13:55 +0200, Florian Westphal wrote:
> The reliable event delivery mode currently (ab)uses the DYING bit to
> detect which entries on the dying list have to be skipped when
> re-delivering events from the eache worker in reliable event mode.
> 
> Currently when we delete the conntrack from main table we only set this
> bit if we could also deliver the netlink destroy event to userspace.
> 
> If we fail we move it to the dying list, the ecache worker will
> reattempt event delivery for all confirmed conntracks on the dying list
> that do not have the DYING bit set.
> 
> Once timer is gone, we can no longer use if (del_timer()) to detect
> when we 'stole' the reference count owned by the timer/hash entry, so
> we need some other way to avoid racing with other cpu.
> 
> Pablo suggested to add a marker in the ecache extension that skips

Acked-by: Eric Dumazet 


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 nf-next 2/7] netfilter: conntrack: get rid of conntrack timer

2016-08-24 Thread Eric Dumazet
On Wed, 2016-08-24 at 13:55 +0200, Florian Westphal wrote:
> With stats enabled this eats 80 bytes on x86_64 per nf_conn entry, as
> Eric Dumazet pointed out during netfilter workshop 2016.

Another reason was the fact that Thomas was about to change max timer
range : 

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=500462a9de657f86edaa102f8ab6bff7f7e43fc2

Acked-by: Eric Dumazet 

Thanks !


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 nf-next 3/7] netfilter: evict stale entries on netlink dumps

2016-08-24 Thread Eric Dumazet
On Wed, 2016-08-24 at 13:55 +0200, Florian Westphal wrote:
> When dumping we already have to look at the entire table, so we might
> as well toss those entries whose timeout value is in the past.
> 
> We also look at every entry during resize operations.
> However, eviction there is not as simple because we hold the
> global resize lock so we can't evict without adding a 'expired' list
> to drop from later.  Considering that resizes are very rare it doesn't
> seem worth doing it.
> 
> Signed-off-by: Florian Westphal 
> ---
Acked-by: Eric Dumazet 


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 nf-next 4/7] netfilter: conntrack: add gc worker to remove timed-out entries

2016-08-24 Thread Eric Dumazet
On Wed, 2016-08-24 at 13:55 +0200, Florian Westphal wrote:
> Conntrack gc worker to evict stale entries.


>  static struct nf_conn *
>  __nf_conntrack_alloc(struct net *net,
>const struct nf_conntrack_zone *zone,
> @@ -1527,6 +1597,7 @@ static int untrack_refs(void)
>  
>  void nf_conntrack_cleanup_start(void)
>  {
> + conntrack_gc_work.exiting = true;
>   RCU_INIT_POINTER(ip_ct_attach, NULL);
>  }
>  
> @@ -1536,6 +1607,9 @@ void nf_conntrack_cleanup_end(void)
>   while (untrack_refs() > 0)
>   schedule();
>  
> + cancel_delayed_work_sync(&conntrack_gc_work.dwork);
> + /* can be re-scheduled once */

Are you sure ?

As conntrack_gc_work.exiting = true, I do not see how this can happen ?

> + cancel_delayed_work_sync(&conntrack_gc_work.dwork);
>   nf_ct_free_hashtable(nf_conntrack_hash, nf_conntrack_htable_size);
>  
>   nf_conntrack_proto_fini();
> @@ -1810,6 +1884,10 @@ int nf_conntrack_init_start(void)
>   }
>   /*  - and look it like as a confirmed connection */
>   nf_ct_untracked_status_or(IPS_CONFIRMED | IPS_UNTRACKED);
> +
> + conntrack_gc_work_init(&conntrack_gc_work);
> + schedule_delayed_work(&conntrack_gc_work.dwork, GC_INTERVAL);
> +
>   return 0;
>  
>  err_proto:


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 nf-next 4/7] netfilter: conntrack: add gc worker to remove timed-out entries

2016-08-24 Thread Florian Westphal
Eric Dumazet  wrote:
> On Wed, 2016-08-24 at 13:55 +0200, Florian Westphal wrote:
> > Conntrack gc worker to evict stale entries.
> 
> 
> >  static struct nf_conn *
> >  __nf_conntrack_alloc(struct net *net,
> >  const struct nf_conntrack_zone *zone,
> > @@ -1527,6 +1597,7 @@ static int untrack_refs(void)
> >  
> >  void nf_conntrack_cleanup_start(void)
> >  {
> > +   conntrack_gc_work.exiting = true;
> > RCU_INIT_POINTER(ip_ct_attach, NULL);
> >  }
> >  
> > @@ -1536,6 +1607,9 @@ void nf_conntrack_cleanup_end(void)
> > while (untrack_refs() > 0)
> > schedule();
> >  
> > +   cancel_delayed_work_sync(&conntrack_gc_work.dwork);
> > +   /* can be re-scheduled once */
> 
> Are you sure ?
> 
> As conntrack_gc_work.exiting = true, I do not see how this can happen ?

nf_conntrack_cleanup_start() sets exiting = true

current cpu blocks in

cancel_delayed_work_sync(&conntrack_gc_work.dwork);

Iff the work queue was running on other cpu but was already past
gc_work->exiting check then when cancel_delayed_work_sync() (first one)
returns it will have re-armed itself via schedule_delayed_work().

So I think the 2nd cancel_delayed_work_sync is needed.

Let me know if you'd like to see a v3 with more verbose
comment about this.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 nf-next 4/7] netfilter: conntrack: add gc worker to remove timed-out entries

2016-08-24 Thread Eric Dumazet
On Wed, 2016-08-24 at 22:11 +0200, Florian Westphal wrote:
> Eric Dumazet  wrote:
> > On Wed, 2016-08-24 at 13:55 +0200, Florian Westphal wrote:
> > > Conntrack gc worker to evict stale entries.
> > 
> > 
> > >  static struct nf_conn *
> > >  __nf_conntrack_alloc(struct net *net,
> > >const struct nf_conntrack_zone *zone,
> > > @@ -1527,6 +1597,7 @@ static int untrack_refs(void)
> > >  
> > >  void nf_conntrack_cleanup_start(void)
> > >  {
> > > + conntrack_gc_work.exiting = true;
> > >   RCU_INIT_POINTER(ip_ct_attach, NULL);
> > >  }
> > >  
> > > @@ -1536,6 +1607,9 @@ void nf_conntrack_cleanup_end(void)
> > >   while (untrack_refs() > 0)
> > >   schedule();
> > >  
> > > + cancel_delayed_work_sync(&conntrack_gc_work.dwork);
> > > + /* can be re-scheduled once */
> > 
> > Are you sure ?
> > 
> > As conntrack_gc_work.exiting = true, I do not see how this can happen ?
> 
> nf_conntrack_cleanup_start() sets exiting = true
> 
> current cpu blocks in
> 
> cancel_delayed_work_sync(&conntrack_gc_work.dwork);
> 
> Iff the work queue was running on other cpu but was already past
> gc_work->exiting check then when cancel_delayed_work_sync() (first one)
> returns it will have re-armed itself via schedule_delayed_work().
> 
> So I think the 2nd cancel_delayed_work_sync is needed.
> 
> Let me know if you'd like to see a v3 with more verbose
> comment about this.

If you were using cancel_delayed_work() (instead of
cancel_delayed_work_sync()) I would understand the concern.

But here you are using the thing that was designed to exactly avoid the
issue, of both work running on another cpu and/or re-arming itself.

If what you are saying was true, we would have to fix hundreds of
cancel_delayed_work_sync() call sites ...



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH nf-next 1/2] rhashtable: add rhashtable_lookup_get_insert_key()

2016-08-24 Thread Herbert Xu
On Wed, Aug 24, 2016 at 04:23:14PM +0200, Pablo Neira Ayuso wrote:
> This patch modifies __rhashtable_insert_fast() so it returns the
> existing element that clashes with the one that you want to insert.
> This adds a new rhashtable_lookup_get_insert_key() interface to fetch
> this existing element.
> 
> nf_tables needs this change to improve handling of EEXIST cases via
> honoring the NLM_F_EXCL flag and by checking if the data part of the
> mapping matches what we have.
> 
> Cc: Herbert Xu 
> Cc: Thomas Graf 
> Signed-off-by: Pablo Neira Ayuso 

I would prefer an interface where you returned the object if it
exists, NULL if it does not and the insertion was successful, and
an ERR_PTR otherwise.  But it's not a deal-breaker.

However, you missed the rhashtable_insert_slow path.  It too
needs to be updated to incorporate your new logic.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 nf-next 4/7] netfilter: conntrack: add gc worker to remove timed-out entries

2016-08-24 Thread Florian Westphal
Eric Dumazet  wrote:
> On Wed, 2016-08-24 at 22:11 +0200, Florian Westphal wrote:
> > Eric Dumazet  wrote:
> > > On Wed, 2016-08-24 at 13:55 +0200, Florian Westphal wrote:
> > > > Conntrack gc worker to evict stale entries.
> > > 
> > > 
> > > >  static struct nf_conn *
> > > >  __nf_conntrack_alloc(struct net *net,
> > > >  const struct nf_conntrack_zone *zone,
> > > > @@ -1527,6 +1597,7 @@ static int untrack_refs(void)
> > > >  
> > > >  void nf_conntrack_cleanup_start(void)
> > > >  {
> > > > +   conntrack_gc_work.exiting = true;
> > > > RCU_INIT_POINTER(ip_ct_attach, NULL);
> > > >  }
> > > >  
> > > > @@ -1536,6 +1607,9 @@ void nf_conntrack_cleanup_end(void)
> > > > while (untrack_refs() > 0)
> > > > schedule();
> > > >  
> > > > +   cancel_delayed_work_sync(&conntrack_gc_work.dwork);
> > > > +   /* can be re-scheduled once */
> > > 
> > > Are you sure ?
> > > 
> > > As conntrack_gc_work.exiting = true, I do not see how this can happen ?
> > 
> > nf_conntrack_cleanup_start() sets exiting = true
> > 
> > current cpu blocks in
> > 
> > cancel_delayed_work_sync(&conntrack_gc_work.dwork);
> > 
> > Iff the work queue was running on other cpu but was already past
> > gc_work->exiting check then when cancel_delayed_work_sync() (first one)
> > returns it will have re-armed itself via schedule_delayed_work().
> > 
> > So I think the 2nd cancel_delayed_work_sync is needed.
> > 
> > Let me know if you'd like to see a v3 with more verbose
> > comment about this.
> 
> If you were using cancel_delayed_work() (instead of
> cancel_delayed_work_sync()) I would understand the concern.
> 
> But here you are using the thing that was designed to exactly avoid the
> issue, of both work running on another cpu and/or re-arming itself.
> 
> If what you are saying was true, we would have to fix hundreds of
> cancel_delayed_work_sync() call sites ...

Ok, I see, the _sync version indeed seems to be desgined to
suppress/avoid re-arming.

I will send a v3 without the 2nd cancel_delayed_work_sync, thanks Eric!

(I will preserve/add your Acked-by tags to the unchanged patches so you
 won't need to resend them).
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html