Re: [PATCH net 2/2] conntrack: enable to tune gc parameters
Pablo Neira Ayusowrote: > I would prefer not to expose sysctl knobs, if we don't really know > what good default values are good, then we cannot expect our users to > know this for us. > > I would go tune this in a way that this resembles to the previous > behaviour. I do not see how this is possible without reverting to old per-conntrack timer scheme. With per-ct timer userspace gets notified the moment the timer fires, without it notification comes 'when kernel detects the timeout' which in worst case, as Nicholas describes, is when gc worker comes along. You can run the gc worker every jiffie of course, but thats just wasting cpu cycles (and you still get a small delay). I don't see a way to do run-time tuning except faster restarts when old entries start accumulating. This is what the code tries to do, perhaps you have a better idea for the 'next gc run' computation. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net 2/2] conntrack: enable to tune gc parameters
On Fri, Oct 14, 2016 at 12:37:26PM +0200, Florian Westphal wrote: > Nicolas Dichtelwrote: > > Le 13/10/2016 à 22:43, Florian Westphal a écrit : [...] > > > (Or cause too many useless scans) > > > > > > Another idea worth trying might be to get rid of the max cap and > > > instead break early in case too many jiffies expired. > > > > > > I don't want to add sysctl knobs for this unless absolutely needed; its > > > already > > > possible to 'force' eviction cycle by running 'conntrack -L'. > > > > > Sure, but this is not a "real" solution, just a workaround. > > We need to find a way to deliver conntrack deletion events in a reasonable > > delay, whatever the traffic on the machine is. > > Agree, but that depends on what 'reasonable' means and what kind of > uneeded cpu churn we're willing to add. > > We can add a sysctl for this but we should use a low default to not do > too much unneeded work. > > So what about your original patch, but only add > > nf_conntrack_gc_interval > > (and also add instant-resched in case entire budget was consumed)? I would prefer not to expose sysctl knobs, if we don't really know what good default values are good, then we cannot expect our users to know this for us. I would go tune this in a way that this resembles to the previous behaviour. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net 2/2] conntrack: enable to tune gc parameters
Nicolas Dichtelwrote: > Le 13/10/2016 à 22:43, Florian Westphal a écrit : > > Nicolas Dichtel wrote: > >> Le 10/10/2016 à 16:04, Florian Westphal a écrit : > >>> Nicolas Dichtel wrote: > After commit b87a2f9199ea ("netfilter: conntrack: add gc worker to remove > timed-out entries"), netlink conntrack deletion events may be sent with a > huge delay. It could be interesting to let the user tweak gc parameters > depending on its use case. > >>> > >>> Hmm, care to elaborate? > >>> > >>> I am not against doing this but I'd like to hear/read your use case. > >>> > >>> The expectation is that in almot all cases eviction will happen from > >>> packet path. The gc worker is jusdt there for case where a busy system > >>> goes idle. > >> It was precisely that case. After a period of activity, the event is sent > >> a long > >> time after the timeout. If the router does not manage a lot of flows, why > >> not > >> trying to parse more entries instead of the default 1/64 of the table? > >> In fact, I don't understand why using GC_MAX_BUCKETS_DIV instead of using > >> always > >> GC_MAX_BUCKETS whatever the size of the table is. > > > > I wanted to make sure that we have a known upper bound on the number of > > buckets we process so that we do not block other pending kworker items > > for too long. > I don't understand. GC_MAX_BUCKETS is the upper bound and I agree that it is > needed. But why GC_MAX_BUCKETS_DIV (ie 1/64)? > In other words, why this line: > goal = min(nf_conntrack_htable_size / GC_MAX_BUCKETS_DIV, GC_MAX_BUCKETS); > instead of: > goal = GC_MAX_BUCKETS; Sure, we can do that. But why is a fixed size better than a fraction? E.g. with 8k buckets and simple goal = GC_MAX_BUCKETS we scan entire table on every run, currently we only scan 128. I wanted to keep too many destroy notifications from firing at once but maybe i was too paranoid... > > (Or cause too many useless scans) > > > > Another idea worth trying might be to get rid of the max cap and > > instead break early in case too many jiffies expired. > > > > I don't want to add sysctl knobs for this unless absolutely needed; its > > already > > possible to 'force' eviction cycle by running 'conntrack -L'. > > > Sure, but this is not a "real" solution, just a workaround. > We need to find a way to deliver conntrack deletion events in a reasonable > delay, whatever the traffic on the machine is. Agree, but that depends on what 'reasonable' means and what kind of uneeded cpu churn we're willing to add. We can add a sysctl for this but we should use a low default to not do too much unneeded work. So what about your original patch, but only add nf_conntrack_gc_interval (and also add instant-resched in case entire budget was consumed)? -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net 2/2] conntrack: enable to tune gc parameters
Le 13/10/2016 à 22:43, Florian Westphal a écrit : > Nicolas Dichtelwrote: >> Le 10/10/2016 à 16:04, Florian Westphal a écrit : >>> Nicolas Dichtel wrote: After commit b87a2f9199ea ("netfilter: conntrack: add gc worker to remove timed-out entries"), netlink conntrack deletion events may be sent with a huge delay. It could be interesting to let the user tweak gc parameters depending on its use case. >>> >>> Hmm, care to elaborate? >>> >>> I am not against doing this but I'd like to hear/read your use case. >>> >>> The expectation is that in almot all cases eviction will happen from >>> packet path. The gc worker is jusdt there for case where a busy system >>> goes idle. >> It was precisely that case. After a period of activity, the event is sent a >> long >> time after the timeout. If the router does not manage a lot of flows, why not >> trying to parse more entries instead of the default 1/64 of the table? >> In fact, I don't understand why using GC_MAX_BUCKETS_DIV instead of using >> always >> GC_MAX_BUCKETS whatever the size of the table is. > > I wanted to make sure that we have a known upper bound on the number of > buckets we process so that we do not block other pending kworker items > for too long. I don't understand. GC_MAX_BUCKETS is the upper bound and I agree that it is needed. But why GC_MAX_BUCKETS_DIV (ie 1/64)? In other words, why this line: goal = min(nf_conntrack_htable_size / GC_MAX_BUCKETS_DIV, GC_MAX_BUCKETS); instead of: goal = GC_MAX_BUCKETS; ? > > (Or cause too many useless scans) > > Another idea worth trying might be to get rid of the max cap and > instead break early in case too many jiffies expired. > > I don't want to add sysctl knobs for this unless absolutely needed; its > already > possible to 'force' eviction cycle by running 'conntrack -L'. > Sure, but this is not a "real" solution, just a workaround. We need to find a way to deliver conntrack deletion events in a reasonable delay, whatever the traffic on the machine is. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net 2/2] conntrack: enable to tune gc parameters
Nicolas Dichtelwrote: > Le 10/10/2016 à 16:04, Florian Westphal a écrit : > > Nicolas Dichtel wrote: > >> After commit b87a2f9199ea ("netfilter: conntrack: add gc worker to remove > >> timed-out entries"), netlink conntrack deletion events may be sent with a > >> huge delay. It could be interesting to let the user tweak gc parameters > >> depending on its use case. > > > > Hmm, care to elaborate? > > > > I am not against doing this but I'd like to hear/read your use case. > > > > The expectation is that in almot all cases eviction will happen from > > packet path. The gc worker is jusdt there for case where a busy system > > goes idle. > It was precisely that case. After a period of activity, the event is sent a > long > time after the timeout. If the router does not manage a lot of flows, why not > trying to parse more entries instead of the default 1/64 of the table? > In fact, I don't understand why using GC_MAX_BUCKETS_DIV instead of using > always > GC_MAX_BUCKETS whatever the size of the table is. I wanted to make sure that we have a known upper bound on the number of buckets we process so that we do not block other pending kworker items for too long. (Or cause too many useless scans) Another idea worth trying might be to get rid of the max cap and instead break early in case too many jiffies expired. I don't want to add sysctl knobs for this unless absolutely needed; its already possible to 'force' eviction cycle by running 'conntrack -L'. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net 2/2] conntrack: enable to tune gc parameters
Le 10/10/2016 à 16:04, Florian Westphal a écrit : > Nicolas Dichtelwrote: >> After commit b87a2f9199ea ("netfilter: conntrack: add gc worker to remove >> timed-out entries"), netlink conntrack deletion events may be sent with a >> huge delay. It could be interesting to let the user tweak gc parameters >> depending on its use case. > > Hmm, care to elaborate? > > I am not against doing this but I'd like to hear/read your use case. > > The expectation is that in almot all cases eviction will happen from > packet path. The gc worker is jusdt there for case where a busy system > goes idle. It was precisely that case. After a period of activity, the event is sent a long time after the timeout. If the router does not manage a lot of flows, why not trying to parse more entries instead of the default 1/64 of the table? In fact, I don't understand why using GC_MAX_BUCKETS_DIV instead of using always GC_MAX_BUCKETS whatever the size of the table is. > >> +nf_conntrack_gc_max_evicts - INTEGER >> +The maximum number of entries to be evicted during a run of gc. >> +This sysctl is only writeable in the initial net namespace. > > Hmmm, do you have any advice on sizing this one? In fact, no ;-) I really hesitate to expose the four values or just a subset. My goal was also to get feedback. I can remove this one. > > I think a better change might be (instead of adding htis knob) to > resched the gc worker for immediate re-executaion in case the entire > "budget" was used. What do you think? Even if it's not directly related to my problem, I think it's a good idea. > > > diff --git a/net/netfilter/nf_conntrack_core.c > b/net/netfilter/nf_conntrack_core.c > --- a/net/netfilter/nf_conntrack_core.c > +++ b/net/netfilter/nf_conntrack_core.c > @@ -983,7 +983,7 @@ static void gc_worker(struct work_struct *work) > return; > > ratio = scanned ? expired_count * 100 / scanned : 0; > - if (ratio >= 90) > + if (ratio >= 90 || expired_count == GC_MAX_EVICTS) > next_run = 0; > -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net 2/2] conntrack: enable to tune gc parameters
Nicolas Dichtelwrote: > After commit b87a2f9199ea ("netfilter: conntrack: add gc worker to remove > timed-out entries"), netlink conntrack deletion events may be sent with a > huge delay. It could be interesting to let the user tweak gc parameters > depending on its use case. Hmm, care to elaborate? I am not against doing this but I'd like to hear/read your use case. The expectation is that in almot all cases eviction will happen from packet path. The gc worker is jusdt there for case where a busy system goes idle. > +nf_conntrack_gc_max_evicts - INTEGER > + The maximum number of entries to be evicted during a run of gc. > + This sysctl is only writeable in the initial net namespace. Hmmm, do you have any advice on sizing this one? I think a better change might be (instead of adding htis knob) to resched the gc worker for immediate re-executaion in case the entire "budget" was used. What do you think? diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -983,7 +983,7 @@ static void gc_worker(struct work_struct *work) return; ratio = scanned ? expired_count * 100 / scanned : 0; - if (ratio >= 90) + if (ratio >= 90 || expired_count == GC_MAX_EVICTS) next_run = 0; -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net 2/2] conntrack: enable to tune gc parameters
After commit b87a2f9199ea ("netfilter: conntrack: add gc worker to remove timed-out entries"), netlink conntrack deletion events may be sent with a huge delay. It could be interesting to let the user tweak gc parameters depending on its use case. CC: Florian WestphalSigned-off-by: Nicolas Dichtel --- Documentation/networking/nf_conntrack-sysctl.txt | 17 +++ include/net/netfilter/nf_conntrack_core.h| 5 net/netfilter/nf_conntrack_core.c| 17 +-- net/netfilter/nf_conntrack_standalone.c | 36 4 files changed, 67 insertions(+), 8 deletions(-) diff --git a/Documentation/networking/nf_conntrack-sysctl.txt b/Documentation/networking/nf_conntrack-sysctl.txt index 399e4e866a9c..5b6ace93521d 100644 --- a/Documentation/networking/nf_conntrack-sysctl.txt +++ b/Documentation/networking/nf_conntrack-sysctl.txt @@ -37,6 +37,23 @@ nf_conntrack_expect_max - INTEGER Maximum size of expectation table. Default value is nf_conntrack_buckets / 256. Minimum is 1. +nf_conntrack_gc_interval - INTEGER + Maximum interval in second between two run of the conntrack gc. This + gc is in charge of removing stale entries. It also impacts the delay + before notifying the userland a conntrack deletion. + This sysctl is only writeable in the initial net namespace. + +nf_conntrack_gc_max_buckets - INTEGER +nf_conntrack_gc_max_buckets_div - INTEGER + During a run, the conntrack gc processes at maximum + nf_conntrack_buckets/nf_conntrack_gc_max_buckets_div (and never more + than nf_conntrack_gc_max_buckets) entries. + These sysctl are only writeable in the initial net namespace. + +nf_conntrack_gc_max_evicts - INTEGER + The maximum number of entries to be evicted during a run of gc. + This sysctl is only writeable in the initial net namespace. + nf_conntrack_frag6_high_thresh - INTEGER default 262144 diff --git a/include/net/netfilter/nf_conntrack_core.h b/include/net/netfilter/nf_conntrack_core.h index 62e17d1319ff..2a5ed368fb71 100644 --- a/include/net/netfilter/nf_conntrack_core.h +++ b/include/net/netfilter/nf_conntrack_core.h @@ -86,4 +86,9 @@ void nf_conntrack_lock(spinlock_t *lock); extern spinlock_t nf_conntrack_expect_lock; +extern unsigned int nf_ct_gc_interval; +extern unsigned int nf_ct_gc_max_buckets_div; +extern unsigned int nf_ct_gc_max_buckets; +extern unsigned int nf_ct_gc_max_evicts; + #endif /* _NF_CONNTRACK_CORE_H */ diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index ba6a1d421222..435b431e3449 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -83,10 +83,10 @@ static __read_mostly spinlock_t nf_conntrack_locks_all_lock; static __read_mostly DEFINE_SPINLOCK(nf_conntrack_locks_all_lock); static __read_mostly bool nf_conntrack_locks_all; -#define GC_MAX_BUCKETS_DIV 64u -#define GC_MAX_BUCKETS 8192u -#define GC_INTERVAL(5 * HZ) -#define GC_MAX_EVICTS 256u +unsigned int nf_ct_gc_interval = 5 * HZ; +unsigned int nf_ct_gc_max_buckets = 8192; +unsigned int nf_ct_gc_max_buckets_div = 64; +unsigned int nf_ct_gc_max_evicts = 256; static struct conntrack_gc_work conntrack_gc_work; @@ -936,13 +936,14 @@ static noinline int early_drop(struct net *net, unsigned int _hash) static void gc_worker(struct work_struct *work) { unsigned int i, goal, buckets = 0, expired_count = 0; - unsigned long next_run = GC_INTERVAL; + unsigned long next_run = nf_ct_gc_interval; unsigned int ratio, scanned = 0; struct conntrack_gc_work *gc_work; gc_work = container_of(work, struct conntrack_gc_work, dwork.work); - goal = min(nf_conntrack_htable_size / GC_MAX_BUCKETS_DIV, GC_MAX_BUCKETS); + goal = min(nf_conntrack_htable_size / nf_ct_gc_max_buckets_div, + nf_ct_gc_max_buckets); i = gc_work->last_bucket; do { @@ -977,7 +978,7 @@ static void gc_worker(struct work_struct *work) rcu_read_unlock(); cond_resched_rcu_qs(); } while (++buckets < goal && -expired_count < GC_MAX_EVICTS); +expired_count < nf_ct_gc_max_evicts); if (gc_work->exiting) return; @@ -1885,7 +1886,7 @@ int nf_conntrack_init_start(void) nf_ct_untracked_status_or(IPS_CONFIRMED | IPS_UNTRACKED); conntrack_gc_work_init(_gc_work); - schedule_delayed_work(_gc_work.dwork, GC_INTERVAL); + schedule_delayed_work(_gc_work.dwork, nf_ct_gc_interval); return 0; diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c index 5f446cd9f3fd..c5310fb35eca 100644 --- a/net/netfilter/nf_conntrack_standalone.c +++ b/net/netfilter/nf_conntrack_standalone.c @@ -445,6 +445,8 @@ static void