Re: [PATCH nf] netfilter: ipv6: nf_defrag: drop skb dst before queueing
On 07/09/2018 04:43 AM, Florian Westphal wrote: > Eric Dumazet reports: > Here is a reproducer of an annoying bug detected by syzkaller on our > production kernel > [..] > ./b78305423 enable_conntrack > Then : > sleep 60 > dmesg | tail -10 > [ 171.599093] unregister_netdevice: waiting for lo to become free. Usage > count = 2 > [ 181.631024] unregister_netdevice: waiting for lo to become free. Usage > count = 2 > [ 191.687076] unregister_netdevice: waiting for lo to become free. Usage > count = 2 > [ 201.703037] unregister_netdevice: waiting for lo to become free. Usage > count = 2 > [ 211.711072] unregister_netdevice: waiting for lo to become free. Usage > count = 2 > [ 221.959070] unregister_netdevice: waiting for lo to become free. Usage > count = 2 > > Reproducer sends ipv6 fragment that hits nfct defrag via LOCAL_OUT hook. > skb gets queued until frag timer expiry -- 1 minute. > > Normally nf_conntrack_reasm gets called during prerouting, so skb has > no dst yet which might explain why this wasn't spotted earlier. > > Reported-by: Eric Dumazet > Reported-by: John Sperbeck > Signed-off-by: Florian Westphal Tested-by: Eric Dumazet Reported-by: syzbot Thanks ! -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] netfilter: properly initialize xt_table_info structure
On 05/17/2018 02:34 AM, Greg Kroah-Hartman wrote: > When allocating a xt_table_info structure, we should be clearing out the > full amount of memory that was allocated, not just the "header" of the > structure. Otherwise odd values could be passed to userspace, which is > not a good thing. > > Cc: stable> Signed-off-by: Greg Kroah-Hartman > --- > v2: use kvzalloc instead of kvmalloc/memset pair, as suggested by Michal > Kubecek > > net/netfilter/x_tables.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c > index cb7cb300c3bc..cd22bb9b66f3 100644 > --- a/net/netfilter/x_tables.c > +++ b/net/netfilter/x_tables.c > @@ -1183,11 +1183,10 @@ struct xt_table_info *xt_alloc_table_info(unsigned > int size) >* than shoot all processes down before realizing there is nothing >* more to reclaim. >*/ > - info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY); > + info = kvzalloc(sz, GFP_KERNEL | __GFP_NORETRY); > if (!info) > return NULL; > > - memset(info, 0, sizeof(*info)); > info->size = size; > return info; > } > I am curious, what particular path does not later overwrite the whole zone ? Do not get me wrong, this is not fast path, but these blobs can be huge. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: WARNING in __proc_create
On 03/09/2018 03:32 PM, Cong Wang wrote: On Fri, Mar 9, 2018 at 3:21 PM, Eric Dumazet <eric.duma...@gmail.com> wrote: On 03/09/2018 03:05 PM, Cong Wang wrote: BTW, the warning itself is all about empty names, so perhaps it's better to fix them separately. Huh ? You want more syzbot reports ? I do not. I always prefer one patch to fix one problem, and, as I already said, checking "." and ".." is probably not as trivial as checking empty. This why I believe 2 (or more) patches here are better than 1 single patch. BTW, I don't have any patch, it is so trivial that I don't want to spend my time on it. The trivial patch has been sent, and Florian gave a (very good) feedback. Then Pablo offered to write a complete patch. They both had more pressing issues, and quite frankly I also had more pressing issues. I doubt that sending again the simple fix will be accepted by netfilter maintainers. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: WARNING in __proc_create
On 03/09/2018 03:05 PM, Cong Wang wrote: BTW, the warning itself is all about empty names, so perhaps it's better to fix them separately. Huh ? You want more syzbot reports ? I do not. I unblocked this report today [1], you can be sure that as soon as syzbot gets the correct tag (Reported-by: ...), it will find the other problems, which are currently on hold to avoid spam. [1] It has been sitting for a while here at Google, since January 28th. At some point you have to ping Pablo/Florian again ;) -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: WARNING in __proc_create
On 03/09/2018 02:56 PM, Eric Dumazet wrote: I sent a patch a while back, but Pablo/Florian wanted more than that simple fix. We also need to filter special characters like '/' Or maybe I am mixing with something else. Yes, Florian mentioned that we also had to reject "." and ".." -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: WARNING in __proc_create
On 03/09/2018 02:48 PM, Cong Wang wrote: On Fri, Mar 9, 2018 at 1:59 PM, syzbotwrote: Hello, syzbot hit the following crash on net-next commit 617aebe6a97efa539cc4b8a52adccd89596e6be0 (Sun Feb 4 00:25:42 2018 +) Merge tag 'usercopy-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux So far this crash happened 12 times on net-next, upstream. C reproducer is attached. syzkaller reproducer is attached. Raw console output is attached. compiler: gcc (GCC) 7.1.1 20170620 .config is attached. IMPORTANT: if you fix the bug, please add the following tag to the commit: Reported-by: syzbot+0502b00edac2a0680...@syzkaller.appspotmail.com It will help syzbot understand when the bug is fixed. See footer for details. If you forward the report, please keep this part and the footer. audit: type=1400 audit(1518223777.419:7): avc: denied { map } for pid=4159 comm="syzkaller598581" path="/root/syzkaller598581546" dev="sda1" ino=16481 scontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023 tcontext=unconfined_u:object_r:user_home_t:s0 tclass=file permissive=1 [ cut here ] name len 0 WARNING: CPU: 0 PID: 4159 at fs/proc/generic.c:354 __proc_create+0x696/0x880 fs/proc/generic.c:354 We need to reject empty names. I sent a patch a while back, but Pablo/Florian wanted more than that simple fix. We also need to filter special characters like '/' Or maybe I am mixing with something else. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH nf v5] netfilter: bridge: ebt_among: add more missing match size checks
On 03/09/2018 05:27 AM, Florian Westphal wrote: ebt_among is special, it has a dynamic match size and is exempt from the central size checks. commit c4585a2823edf ("bridge: ebt_among: add missing match size checks") added validation for pool size, but missed fact that the macros ebt_among_wh_src/dst can already return out-of-bound result because they do not check value of wh_src/dst_ofs (an offset) vs. the size of the match that userspace gave to us. v2: check that offset has correct alignment. Paolo Abeni points out that we should also check that src/dst wormhash arrays do not overlap, and src + length lines up with start of dst (or vice versa). v3: compact wormhash_sizes_valid() part NB: Fixes tag is intentionally wrong, this bug exists from day one when match was added for 2.6 kernel. Tag is there so stable maintainers will notice this one too. Tested with same rules from the earlier patch. Fixes: c4585a2823edf ("bridge: ebt_among: add missing match size checks") Reported-by: <syzbot+bdabab6f1983a03fc...@syzkaller.appspotmail.com> Signed-off-by: Florian Westphal <f...@strlen.de> --- v5: this is really v3 again, but with explicit (int)sizeof() cast rather than use of temporary 'int minsize'. objdump shows identical output for v3/v4/v5. SGTM, thanks Florian ;) Reviewed-by: Eric Dumazet <eduma...@google.com> -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH nf v4] netfilter: bridge: ebt_among: add more missing match size checks
On 03/09/2018 02:03 AM, Florian Westphal wrote: ebt_among is special, it has a dynamic match size and is exempt from the central size checks. commit c4585a2823edf ("bridge: ebt_among: add missing match size checks") added validation for pool size, but missed fact that the macros ebt_among_wh_src/dst can already return out-of-bound result because they do not check value of wh_src/dst_ofs (an offset) vs. the size of the match that userspace gave to us. v2: check that offset has correct alignment. Paolo Abeni points out that we should also check that src/dst wormhash arrays do not overlap, and src + length lines up with start of dst (or vice versa). v3: compact wormhash_sizes_valid() part NB: Fixes tag is intentionally wrong, this bug exists from day one when match was added for 2.6 kernel. Tag is there so stable maintainers will notice this one too. Tested with same rules from the earlier patch. Fixes: c4585a2823edf ("bridge: ebt_among: add missing match size checks") Reported-by:Signed-off-by: Florian Westphal --- v4: rewrite wormhash_offset_invalid to make it clearer that 'off' is <= INT_MAX, objdump doesn't show change vs. v3. net/bridge/netfilter/ebt_among.c | 37 + 1 file changed, 37 insertions(+) diff --git a/net/bridge/netfilter/ebt_among.c b/net/bridge/netfilter/ebt_among.c index c5afb4232ecb..eec7459235a2 100644 --- a/net/bridge/netfilter/ebt_among.c +++ b/net/bridge/netfilter/ebt_among.c @@ -177,6 +177,31 @@ static bool poolsize_invalid(const struct ebt_mac_wormhash *w) return w && w->poolsize >= (INT_MAX / sizeof(struct ebt_mac_wormhash_tuple)); } +static bool wormhash_offset_invalid(int s_off, unsigned int len) +{ + int minsize = sizeof(struct ebt_among_info); + unsigned int off = s_off; + + if (s_off == 0) /* not present */ + return false; + + if (s_off < minsize || off % __alignof__(struct ebt_mac_wormhash)) + return true; + + /* off <= INT_MAX */ Yes, I guess my confusion came from the fact that you are using a signed integer to hold sizeof(struct ebt_among_info), and this is a bit unusual, sizeof() being fundamentally unsigned. A very explicit check would have helped, no need for an extra temp var. if ((int)off < (int)sizeof(struct ebt_among_info)) return true; Thanks ! + off += sizeof(struct ebt_mac_wormhash); + + return off > len; +} + +static bool wormhash_sizes_valid(const struct ebt_mac_wormhash *wh, int a, int b) +{ + if (a == 0) + a = sizeof(struct ebt_among_info); + + return ebt_mac_wormhash_size(wh) + a == b; +} + static int ebt_among_mt_check(const struct xt_mtchk_param *par) { const struct ebt_among_info *info = par->matchinfo; @@ -189,6 +214,10 @@ static int ebt_among_mt_check(const struct xt_mtchk_param *par) if (expected_length > em->match_size) return -EINVAL; + if (wormhash_offset_invalid(info->wh_dst_ofs, em->match_size) || + wormhash_offset_invalid(info->wh_src_ofs, em->match_size)) + return -EINVAL; + wh_dst = ebt_among_wh_dst(info); if (poolsize_invalid(wh_dst)) return -EINVAL; @@ -201,6 +230,14 @@ static int ebt_among_mt_check(const struct xt_mtchk_param *par) if (poolsize_invalid(wh_src)) return -EINVAL; + if (info->wh_src_ofs < info->wh_dst_ofs) { + if (!wormhash_sizes_valid(wh_src, info->wh_src_ofs, info->wh_dst_ofs)) + return -EINVAL; + } else { + if (!wormhash_sizes_valid(wh_dst, info->wh_dst_ofs, info->wh_src_ofs)) + return -EINVAL; + } + expected_length += ebt_mac_wormhash_size(wh_src); if (em->match_size != EBT_ALIGN(expected_length)) { -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH nf v3] netfilter: bridge: ebt_among: add more missing match size checks
On 03/08/2018 04:24 PM, Florian Westphal wrote: Eric Dumazet <eric.duma...@gmail.com> wrote: Fixes: c4585a2823edf ("bridge: ebt_among: add missing match size checks") Reported-by: <syzbot+bdabab6f1983a03fc...@syzkaller.appspotmail.com> Signed-off-by: Florian Westphal <f...@strlen.de> --- net/bridge/netfilter/ebt_among.c | 35 +++ 1 file changed, 35 insertions(+) diff --git a/net/bridge/netfilter/ebt_among.c b/net/bridge/netfilter/ebt_among.c index c5afb4232ecb..600ac7f6671a 100644 --- a/net/bridge/netfilter/ebt_among.c +++ b/net/bridge/netfilter/ebt_among.c @@ -177,6 +177,29 @@ static bool poolsize_invalid(const struct ebt_mac_wormhash *w) return w && w->poolsize >= (INT_MAX / sizeof(struct ebt_mac_wormhash_tuple)); } +static bool wormhash_offset_invalid(int off, unsigned int len) +{ + int minsize = sizeof(struct ebt_among_info); + + if (off == 0) /* not present */ + return false; + + if (off < minsize || off % __alignof__(struct ebt_mac_wormhash)) + return true; + + off += sizeof(struct ebt_mac_wormhash); Can this overflow ? Yes, off can wrap. + return off > len; len is unsigned though so the unsigned promotion would still catch this. Not sure I understand. Say the result is off==0 return off > len; will return false. I thought we were trying to return true for invalid input. If you think this is too fragile let me know and I can submit a v4 with a more explicit test (e.g. adding back "unsigned int alleged_off = off", and testing vs. INT_MAX. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH nf-next 1/2] netfilter: SYNPROXY: set transport header properly
On 03/08/2018 07:01 AM, Serhey Popovych wrote: Eric Dumazet wrote: On 03/08/2018 02:08 AM, Serhey Popovych wrote: We can't use skb_reset_transport_header() together with skb_put() to set skb->transport_header field because skb_put() does not touch skb->data. Do this same way as we did for csum_data in code: substract skb->head from tcph. Signed-off-by: Serhey Popovych <serhe.popov...@gmail.com> --- net/ipv4/netfilter/ipt_SYNPROXY.c | 8 net/ipv6/netfilter/ip6t_SYNPROXY.c | 8 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/net/ipv4/netfilter/ipt_SYNPROXY.c b/net/ipv4/netfilter/ipt_SYNPROXY.c index f75fc6b..4953390 100644 --- a/net/ipv4/netfilter/ipt_SYNPROXY.c +++ b/net/ipv4/netfilter/ipt_SYNPROXY.c @@ -90,8 +90,8 @@ niph = synproxy_build_ip(net, nskb, iph->daddr, iph->saddr); - skb_reset_transport_header(nskb); nth = skb_put(nskb, tcp_hdr_size); + nskb->transport_header = (unsigned char *)nth - nskb->head; I have no idea why you believe the current code is not correct. alloc_skb(): skb->head = skb->data = skb_tail_pointer(skb) synproxy_build_ip(): skb_reset_network_header() /* skb_network_header(skb) == skb->data */ skb_put(skb, len); /* skb->tail += len; skb->len += len */ nth = skb_put() : nth is equal to skb->data Yes it is equal to skb->data, but in my opinion should be set to skb->data + ip_hdrlen(skb) to point to TCP header. Is it really needed to set transport header offset ? We do not intend to inject this packet to a local TCP stack ? Nothing will care of transport header for non GSO packet. Anyway, maybe the confusion would be avoided if the normal way of pushing headers would be used (using skb_push()) No casts requested ;) -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH nf-next 1/2] netfilter: SYNPROXY: set transport header properly
On 03/08/2018 02:08 AM, Serhey Popovych wrote: We can't use skb_reset_transport_header() together with skb_put() to set skb->transport_header field because skb_put() does not touch skb->data. Do this same way as we did for csum_data in code: substract skb->head from tcph. Signed-off-by: Serhey Popovych--- net/ipv4/netfilter/ipt_SYNPROXY.c | 8 net/ipv6/netfilter/ip6t_SYNPROXY.c | 8 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/net/ipv4/netfilter/ipt_SYNPROXY.c b/net/ipv4/netfilter/ipt_SYNPROXY.c index f75fc6b..4953390 100644 --- a/net/ipv4/netfilter/ipt_SYNPROXY.c +++ b/net/ipv4/netfilter/ipt_SYNPROXY.c @@ -90,8 +90,8 @@ niph = synproxy_build_ip(net, nskb, iph->daddr, iph->saddr); - skb_reset_transport_header(nskb); nth = skb_put(nskb, tcp_hdr_size); + nskb->transport_header = (unsigned char *)nth - nskb->head; I have no idea why you believe the current code is not correct. nth = skb_put() : nth is equal to skb->data So skb_reset_transport_header(nskb); does exactly the same thing than your code, but in a more readable way, without ugly cast. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch nf-next] netfilter: make xt_rateest hash table per net
On Thu, 2018-03-01 at 18:58 -0800, Cong Wang wrote: > As suggested by Eric, we need to make the xt_rateest > hash table and its lock per netns to reduce lock > contentions. > > Cc: Florian Westphal <f...@strlen.de> > Cc: Eric Dumazet <eduma...@google.com> > Cc: Pablo Neira Ayuso <pa...@netfilter.org> > Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com> > --- > include/net/netfilter/xt_rateest.h | 4 +- > net/netfilter/xt_RATEEST.c | 91 > +++--- > net/netfilter/xt_rateest.c | 10 ++--- > 3 files changed, 72 insertions(+), 33 deletions(-) Very nice, thanks ! Reviewed-by: Eric Dumazet <eduma...@google.com> Although the main reason was to avoid name collisions between different netns. Hash table is small enough that it can be allocated for each netns. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH nf] netfilter: IDLETIMER: be syzkaller friendly
From: Eric Dumazet <eduma...@google.com> We had one report from syzkaller [1] First issue is that INIT_WORK() should be done before mod_timer() or we risk timer being fired too soon, even with a 1 second timer. Second issue is that we need to reject too big info->timeout to avoid overflows in msecs_to_jiffies(info->timeout * 1000), or risk looping, if result after overflow is 0. [1] WARNING: CPU: 1 PID: 5129 at kernel/workqueue.c:1444 __queue_work+0xdf4/0x1230 kernel/workqueue.c:1444 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 5129 Comm: syzkaller159866 Not tainted 4.16.0-rc1+ #230 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x194/0x257 lib/dump_stack.c:53 panic+0x1e4/0x41c kernel/panic.c:183 __warn+0x1dc/0x200 kernel/panic.c:547 report_bug+0x211/0x2d0 lib/bug.c:184 fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178 fixup_bug arch/x86/kernel/traps.c:247 [inline] do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315 invalid_op+0x22/0x40 arch/x86/entry/entry_64.S:988 RIP: 0010:__queue_work+0xdf4/0x1230 kernel/workqueue.c:1444 RSP: 0018:8801db507538 EFLAGS: 00010006 RAX: 8801aeb46080 RBX: 8801db530200 RCX: 81481404 RDX: 0100 RSI: 86b42640 RDI: 0082 RBP: 8801db507758 R08: 11003b6a0de5 R09: 000c R10: 8801db5073f0 R11: 0020 R12: 11003b6a0eb6 R13: 8801b1067ae0 R14: 01f8 R15: dc00 queue_work_on+0x16a/0x1c0 kernel/workqueue.c:1488 queue_work include/linux/workqueue.h:488 [inline] schedule_work include/linux/workqueue.h:546 [inline] idletimer_tg_expired+0x44/0x60 net/netfilter/xt_IDLETIMER.c:116 call_timer_fn+0x228/0x820 kernel/time/timer.c:1326 expire_timers kernel/time/timer.c:1363 [inline] __run_timers+0x7ee/0xb70 kernel/time/timer.c:1666 run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692 __do_softirq+0x2d7/0xb85 kernel/softirq.c:285 invoke_softirq kernel/softirq.c:365 [inline] irq_exit+0x1cc/0x200 kernel/softirq.c:405 exiting_irq arch/x86/include/asm/apic.h:541 [inline] smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052 apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:829 RIP: 0010:arch_local_irq_restore arch/x86/include/asm/paravirt.h:777 [inline] RIP: 0010:__raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:160 [inline] RIP: 0010:_raw_spin_unlock_irqrestore+0x5e/0xba kernel/locking/spinlock.c:184 RSP: 0018:8801c20173c8 EFLAGS: 0282 ORIG_RAX: ff12 RAX: dc00 RBX: 0282 RCX: 0006 RDX: 10d592cd RSI: 110035d68d23 RDI: 0282 RBP: 8801c20173d8 R08: 110038402e47 R09: R10: R11: R12: 8820e5c8 R13: 8801b1067ad8 R14: 8801aea7c268 R15: 8801aea7c278 __debug_object_init+0x235/0x1040 lib/debugobjects.c:378 debug_object_init+0x17/0x20 lib/debugobjects.c:391 __init_work+0x2b/0x60 kernel/workqueue.c:506 idletimer_tg_create net/netfilter/xt_IDLETIMER.c:152 [inline] idletimer_tg_checkentry+0x691/0xb00 net/netfilter/xt_IDLETIMER.c:213 xt_check_target+0x22c/0x7d0 net/netfilter/x_tables.c:850 check_target net/ipv6/netfilter/ip6_tables.c:533 [inline] find_check_entry.isra.7+0x935/0xcf0 net/ipv6/netfilter/ip6_tables.c:575 translate_table+0xf52/0x1690 net/ipv6/netfilter/ip6_tables.c:744 do_replace net/ipv6/netfilter/ip6_tables.c:1160 [inline] do_ip6t_set_ctl+0x370/0x5f0 net/ipv6/netfilter/ip6_tables.c:1686 nf_sockopt net/netfilter/nf_sockopt.c:106 [inline] nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115 ipv6_setsockopt+0x10b/0x130 net/ipv6/ipv6_sockglue.c:927 udpv6_setsockopt+0x45/0x80 net/ipv6/udp.c:1422 sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2976 SYSC_setsockopt net/socket.c:1850 [inline] SyS_setsockopt+0x189/0x360 net/socket.c:1829 do_syscall_64+0x282/0x940 arch/x86/entry/common.c:287 Fixes: 0902b469bd25 ("netfilter: xtables: idletimer target implementation") Signed-off-by: Eric Dumazet <eduma...@google.com> Reported-by: syzkaller <syzkal...@googlegroups.com> --- net/netfilter/xt_IDLETIMER.c |9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/net/netfilter/xt_IDLETIMER.c b/net/netfilter/xt_IDLETIMER.c index 6c2482b709b1eca341926a8393f45dfead358561..1ac6600bfafd60b6b4d5aaf88a41fb57c6ec195b 100644 --- a/net/netfilter/xt_IDLETIMER.c +++ b/net/netfilter/xt_IDLETIMER.c @@ -146,11 +146,11 @@ static int idletimer_tg_create(struct idletimer_tg_info *info) timer_setup(>timer->timer, idletimer_tg_expired, 0); info->timer->refcnt = 1; + INIT_WORK(>timer->work, idletimer_tg_work); + mod_timer(>timer->timer, msecs_to_jiffies(info->
Re: [PATCH net v2] netfilter: nat: cope with negative port range
On Wed, 2018-02-14 at 13:30 +0100, Florian Westphal wrote: > Eric Dumazet <eric.duma...@gmail.com> wrote: > > On Wed, 2018-02-14 at 12:13 +0100, Paolo Abeni wrote: > > > syzbot reported a division by 0 bug in the netfilter nat code: > > > Adding the relevant check at parse time could break existing > > > setup, moreover we would need to read/write such values atomically > > > to avoid possible transient negative ranges at update time. > > > > I do not quite follow why it is so hard to add a check at parse time. > > > > Breaking buggy setups would not be a concern I think. > > It would be possible for xtables but afaics in nft_nat.c case > (nft_nat_eval) range.{min,max}_proto.all values are loaded from nft > registers at runtime. I prefer this explanation much more, I suggest we update the changelog to explain the real reason. Thanks Florian. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch net v2] xt_RATEEST: acquire xt_rateest_mutex for hash insert
On Mon, 2018-02-05 at 14:41 -0800, Cong Wang wrote: > rateest_hash is supposed to be protected by xt_rateest_mutex, > and, as suggested by Eric, lookup and insert should be atomic, > so we should acquire the xt_rateest_mutex once for both. > > So introduce a non-locking helper for internal use and keep the > locking one for external. > > Reported-by: <syzbot+5cb189720978275e4...@syzkaller.appspotmail.com> > Fixes: 5859034d7eb8 ("[NETFILTER]: x_tables: add RATEEST target") > Cc: Pablo Neira Ayuso <pa...@netfilter.org> > Cc: Eric Dumazet <eric.duma...@gmail.com> > Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com> > --- Reviewed-by: Eric Dumazet <eduma...@google.com> Thanks ! -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch net] xt_RATEEST: acquire xt_rateest_mutex for hash insert
On Wed, 2018-01-31 at 16:26 -0800, Cong Wang wrote: > rateest_hash is supposed to be protected by xt_rateest_mutex. > > Reported-by:> Fixes: 5859034d7eb8 ("[NETFILTER]: x_tables: add RATEEST target") > Cc: Pablo Neira Ayuso > Signed-off-by: Cong Wang > --- > net/netfilter/xt_RATEEST.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/net/netfilter/xt_RATEEST.c b/net/netfilter/xt_RATEEST.c > index 498b54fd04d7..83ec3a282755 100644 > --- a/net/netfilter/xt_RATEEST.c > +++ b/net/netfilter/xt_RATEEST.c > @@ -36,7 +36,9 @@ static void xt_rateest_hash_insert(struct xt_rateest *est) > unsigned int h; > > h = xt_rateest_hash(est->name); > + mutex_lock(_rateest_mutex); > hlist_add_head(>list, _hash[h]); > + mutex_unlock(_rateest_mutex); > } We probably should make this module netns aware, otherwise bad things will happen. (It seems multiple threads could run, so getting the mutex twice (xt_rateest_lookup then xt_rateest_hash_insert() is racy) -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/1] net/netfilter/x_tables.c: make allocation less aggressive
On Tue, 2018-01-30 at 11:30 -0800, a...@linux-foundation.org wrote: > From: Michal Hocko> Subject: net/netfilter/x_tables.c: make allocation less aggressive > > syzbot has noticed that xt_alloc_table_info can allocate a lot of memory. > This is an admin only interface but an admin in a namespace is sufficient > as well. eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc() in > xt_alloc_table_info()") has changed the opencoded kmalloc->vmalloc > fallback into kvmalloc. It has dropped __GFP_NORETRY on the way because > vmalloc has simply never fully supported __GFP_NORETRY semantic. This is > still the case because e.g. page tables backing the vmalloc area are > hardcoded GFP_KERNEL. > > Revert back to __GFP_NORETRY as a poors man defence against excessively > large allocation request here. We will not rule out the OOM killer > completely but __GFP_NORETRY should at least stop the large request in > most cases. > > [a...@linux-foundation.org: coding-style fixes] > Fixes: eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc() in > xt_alloc_tableLink: > http://lkml.kernel.org/r/20180130140104.ge21...@dhcp22.suse.cz > Signed-off-by: Michal Hocko > Acked-by: Florian Westphal > Reviewed-by: Andrew Morton > Cc: David S. Miller > Signed-off-by: Andrew Morton > --- > > net/netfilter/x_tables.c |7 ++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff -puN > net/netfilter/x_tables.c~net-netfilter-x_tablesc-make-allocation-less-aggressive > net/netfilter/x_tables.c > --- > a/net/netfilter/x_tables.c~net-netfilter-x_tablesc-make-allocation-less-aggressive > +++ a/net/netfilter/x_tables.c > @@ -1008,7 +1008,12 @@ struct xt_table_info *xt_alloc_table_inf > if ((size >> PAGE_SHIFT) + 2 > totalram_pages) > return NULL; > > - info = kvmalloc(sz, GFP_KERNEL); > + /* __GFP_NORETRY is not fully supported by kvmalloc but it should > + * work reasonably well if sz is too large and bail out rather > + * than shoot all processes down before realizing there is nothing > + * more to reclaim. > + */ > + info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY); > if (!info) > return NULL; How is __GFP_NORETRY working exactly ? Surely, if some firewall tools attempt to load a new iptables rules, we do not want to abort them if the request can be satisfied after few pages moved on swap or written back to disk. We want to avoid huge allocations, but leave reasonable ones succeed. Thanks. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net] netfilter: xt_recent: do not accept / in table name
From: Eric Dumazet <eduma...@google.com> proc_create_data() will issue a WARN() otherwise, lets avoid that. name 'syz/\xF5' WARNING: CPU: 1 PID: 3688 at fs/proc/generic.c:163 __xlate_proc_name+0xe6/0x110 fs/proc/generic.c:163 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 3688 Comm: syzkaller153061 Not tainted 4.15.0-rc9+ #283 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x194/0x257 lib/dump_stack.c:53 panic+0x1e4/0x41c kernel/panic.c:183 __warn+0x1dc/0x200 kernel/panic.c:547 report_bug+0x211/0x2d0 lib/bug.c:184 fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178 fixup_bug arch/x86/kernel/traps.c:247 [inline] do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315 invalid_op+0x22/0x40 arch/x86/entry/entry_64.S:1096 RIP: 0010:__xlate_proc_name+0xe6/0x110 fs/proc/generic.c:163 RSP: 0018:8801d913ec18 EFLAGS: 00010282 RAX: dc08 RBX: 0003 RCX: 8159ebae RDX: RSI: 11003b227d3e RDI: 8801d913e920 RBP: 8801d913ec48 R08: 11003b227d00 R09: R10: 8801d913eb10 R11: R12: 8801d92b4550 R13: R14: 8801d913ede0 R15: 8801d92b4550 xlate_proc_name fs/proc/generic.c:179 [inline] __proc_create+0xcc/0x880 fs/proc/generic.c:349 proc_create_data+0x76/0x180 fs/proc/generic.c:488 recent_mt_check.isra.8+0xb1b/0xe70 net/netfilter/xt_recent.c:412 recent_mt_check_v0+0xd7/0x150 net/netfilter/xt_recent.c:440 xt_check_match+0x231/0x7d0 net/netfilter/x_tables.c:465 check_match net/ipv4/netfilter/ip_tables.c:479 [inline] find_check_match net/ipv4/netfilter/ip_tables.c:495 [inline] find_check_entry.isra.8+0x3fc/0xcb0 net/ipv4/netfilter/ip_tables.c:544 translate_table+0xed1/0x1610 net/ipv4/netfilter/ip_tables.c:730 do_replace net/ipv4/netfilter/ip_tables.c:1148 [inline] do_ipt_set_ctl+0x370/0x5f0 net/ipv4/netfilter/ip_tables.c:1682 nf_sockopt net/netfilter/nf_sockopt.c:106 [inline] nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115 ip_setsockopt+0xa1/0xb0 net/ipv4/ip_sockglue.c:1256 sctp_setsockopt+0x2a0/0x5de0 net/sctp/socket.c:4074 sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2968 SYSC_setsockopt net/socket.c:1831 [inline] SyS_setsockopt+0x189/0x360 net/socket.c:1810 entry_SYSCALL_64_fastpath+0x29/0xa0 Signed-off-by: Eric Dumazet <eduma...@google.com> Reported-by: syzbot <syzkal...@googlegroups.com> --- net/netfilter/xt_recent.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/netfilter/xt_recent.c b/net/netfilter/xt_recent.c index 245fa350a7a85390e6767c4a0c5862f4213000fe..724f7cf072c1c81a912d007f6f89ea542a42eb0e 100644 --- a/net/netfilter/xt_recent.c +++ b/net/netfilter/xt_recent.c @@ -362,7 +362,8 @@ static int recent_mt_check(const struct xt_mtchk_param *par, return -EINVAL; } if (info->name[0] == '\0' || - strnlen(info->name, XT_RECENT_NAME_LEN) == XT_RECENT_NAME_LEN) + strnlen(info->name, XT_RECENT_NAME_LEN) == XT_RECENT_NAME_LEN || + strchr(info->name, '/')) return -EINVAL; if (ip_pkt_list_tot && info->hit_count < ip_pkt_list_tot) -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net] netfilter: xt_hashlimit: do not allow empty names
From: Eric Dumazet <eduma...@google.com> Syzbot reported a WARN() in proc_create_data() [1] Issue here is that xt_hashlimit does not check that user space provided an empty table name. [1] name len 0 WARNING: CPU: 0 PID: 3680 at fs/proc/generic.c:354 __proc_create+0x696/0x880 fs/proc/generic.c:354 Kernel panic - not syncing: panic_on_warn set ... CPU: 0 PID: 3680 Comm: syzkaller464755 Not tainted 4.15.0-rc9+ #283 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x194/0x257 lib/dump_stack.c:53 panic+0x1e4/0x41c kernel/panic.c:183 __warn+0x1dc/0x200 kernel/panic.c:547 report_bug+0x211/0x2d0 lib/bug.c:184 fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178 fixup_bug arch/x86/kernel/traps.c:247 [inline] do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315 invalid_op+0x22/0x40 arch/x86/entry/entry_64.S:1096 RIP: 0010:__proc_create+0x696/0x880 fs/proc/generic.c:354 RSP: 0018:8801d9607410 EFLAGS: 00010286 RAX: dc08 RBX: 11003b2c0e87 RCX: 8159ebae RDX: RSI: 11003b284970 RDI: 0293 RBP: 8801d9607580 R08: 11003b2c0e15 R09: R10: 8801d96072c8 R11: R12: 8801d981ef28 R13: 8801d9607558 R14: R15: 8801d9607518 proc_create_data+0x76/0x180 fs/proc/generic.c:488 htable_create net/netfilter/xt_hashlimit.c:333 [inline] hashlimit_mt_check_common.isra.9+0xaee/0x1420 net/netfilter/xt_hashlimit.c:900 hashlimit_mt_check_v1+0x48d/0x640 net/netfilter/xt_hashlimit.c:926 xt_check_match+0x231/0x7d0 net/netfilter/x_tables.c:465 check_match net/ipv4/netfilter/ip_tables.c:479 [inline] find_check_match net/ipv4/netfilter/ip_tables.c:495 [inline] find_check_entry.isra.8+0x3fc/0xcb0 net/ipv4/netfilter/ip_tables.c:544 translate_table+0xed1/0x1610 net/ipv4/netfilter/ip_tables.c:730 do_replace net/ipv4/netfilter/ip_tables.c:1148 [inline] do_ipt_set_ctl+0x370/0x5f0 net/ipv4/netfilter/ip_tables.c:1682 nf_sockopt net/netfilter/nf_sockopt.c:106 [inline] nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115 ip_setsockopt+0xa1/0xb0 net/ipv4/ip_sockglue.c:1256 tcp_setsockopt+0x82/0xd0 net/ipv4/tcp.c:2875 sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2968 SYSC_setsockopt net/socket.c:1831 [inline] SyS_setsockopt+0x189/0x360 net/socket.c:1810 entry_SYSCALL_64_fastpath+0x29/0xa0 Signed-off-by: Eric Dumazet <eduma...@google.com> Reported-by: syzbot <syzkal...@googlegroups.com> --- net/netfilter/xt_hashlimit.c |2 ++ 1 file changed, 2 insertions(+) diff --git a/net/netfilter/xt_hashlimit.c b/net/netfilter/xt_hashlimit.c index 5da8746f7b88ff4c9446f256e542e823a6a561b0..eae732e013df92a364b500645360d4606c283a75 100644 --- a/net/netfilter/xt_hashlimit.c +++ b/net/netfilter/xt_hashlimit.c @@ -894,6 +894,8 @@ static int hashlimit_mt_check_common(const struct xt_mtchk_param *par, return -ERANGE; } + if (!name[0]) + return -EINVAL; mutex_lock(_mutex); *hinfo = htable_find_get(net, name, par->family); if (*hinfo == NULL) { -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net] netfilter: x_tables: avoid out-of-bounds reads in xt_request_find_match()
From: Eric Dumazet <eduma...@google.com> It looks like syzbot found its way into netfilter territory. Issue here is that @name comes from user space and might not be null terminated. Out-of-bound reads happen, KASAN is not happy. Signed-off-by: Eric Dumazet <eduma...@google.com> Reported-by: syzbot <syzkal...@googlegroups.com> --- No Fixes: tag, bug seems to be a day-0 one. diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c index 55802e97f906d1987ed78b4296584deb38e5f876..8516dc459b539342f44d2b2b3e21b140677c7826 100644 --- a/net/netfilter/x_tables.c +++ b/net/netfilter/x_tables.c @@ -210,6 +210,9 @@ xt_request_find_match(uint8_t nfproto, const char *name, uint8_t revision) { struct xt_match *match; + if (strnlen(name, XT_EXTENSION_MAXNAMELEN) == XT_EXTENSION_MAXNAMELEN) + return ERR_PTR(-EINVAL); + match = xt_find_match(nfproto, name, revision); if (IS_ERR(match)) { request_module("%st_%s", xt_prefix[nfproto], name); -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next v2] net: move decnet to staging
On Mon, 2017-11-13 at 11:32 -0800, Joe Perches wrote: > On Mon, 2017-11-13 at 09:11 -0800, Stephen Hemminger wrote: > > Support for Decnet has been orphaned for some time. > > In the interest of reducing the potential bug surface and pre-holiday > > cleaning, move the decnet protocol into staging for eventual removal. > [] > > diff --git a/drivers/staging/decnet/TODO b/drivers/staging/decnet/TODO > [] > > @@ -0,0 +1,4 @@ > > +The DecNet code will be removed soon from the kernel tree as it is old, > > +obsolete, and buggy. > > Old and obsolete, well OK, but > what's buggy about decnet? > > https://bugzilla.kernel.org/buglist.cgi?quicksearch=decnet > > Zarro Boogs found. > Then that means nobody uses it. And that syzkaller guys never bothered to add code to actually trigger the bugs that are probably there. Probably they have bigger fishes to fry at this moment. If we leave the code there, chances are high that some hacker is interested into exploiting the bugs. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 nf-next 1/2] netfilter: x_tables: wait until old table isn't used anymore
On Wed, Oct 11, 2017 at 11:18 AM, Florian Westphal <f...@strlen.de> wrote: > Eric Dumazet <eduma...@google.com> wrote: >> On Wed, Oct 11, 2017 at 10:48 AM, Florian Westphal <f...@strlen.de> wrote: >> > Eric Dumazet <eduma...@google.com> wrote: >> >> On Wed, Oct 11, 2017 at 7:26 AM, Florian Westphal <f...@strlen.de> wrote: >> >> > xt_replace_table relies on table replacement counter retrieval (which >> >> > uses xt_recseq to synchronize pcpu counters). >> >> > >> >> > This is fine, however with large rule set get_counters() can take >> >> > a very long time -- it needs to synchronize all counters because >> >> > it has to assume concurrent modifications can occur. >> >> > >> >> > Make xt_replace_table synchronize by itself by waiting until all cpus >> >> > had an even seqcount. >> >> > >> >> > This allows a followup patch to copy the counters of the old ruleset >> >> > without any synchonization after xt_replace_table has completed. >> >> > >> >> > Cc: Dan Williams <d...@redhat.com> >> >> > Cc: Eric Dumazet <eduma...@google.com> >> >> > Signed-off-by: Florian Westphal <f...@strlen.de> >> >> > --- >> >> > v3: check for 'seq is uneven' OR 'has changed' since >> >> > last check. Its fine if seq is uneven iff its a different >> >> > sequence number than the initial one. >> >> > >> >> > v2: fix Erics email address >> >> > net/netfilter/x_tables.c | 18 +++--- >> >> > 1 file changed, 15 insertions(+), 3 deletions(-) >> >> > >> >> > >> >> >> >> Reviewed-by: Eric Dumazet <eduma...@google.com> >> >> >> >> But it seems we need an extra smp_wmb() after >> >> >> >> smp_wmb(); >> >> table->private = newinfo; >> >> + smp_wmb(); >> >> >> >> Otherwise we have no guarantee other cpus actually see the new ->private >> >> value. >> > >> > Seems to be unrelated to this change, so I will submit >> > a separate patch for nf.git that adds this. >> >> This is related to this change, please read the comment before the >> local_bh_enable9) >> >> /* >> * Even though table entries have now been swapped, other CPU's >> * may still be using the old entries. This is okay, because >> * resynchronization happens because of the locking done >> * during the get_counters() routine. >> */ > > Hmm, but get_counters() does not issue a wmb, and the 'new' code added > here essentially is the same as get_counters(), except that we only > read seqcount until we saw a change (and not for each counter in > the rule set). > > What am I missing? Your sync code does nothing interesting if we are not sure new table->private value is visible Without barriers, compiler/cpu could do this : + /* ... so wait for even xt_recseq on all cpus */ + for_each_possible_cpu(cpu) { + seqcount_t *s = _cpu(xt_recseq, cpu); + u32 seq = raw_read_seqcount(s); + + if (seq & 1) { + do { + cond_resched(); + cpu_relax(); + } while (seq == raw_read_seqcount(s)); + } + } /* finally, write new private value */ table->private = newinfo; Basically, your loop is now useless and you could remove it. So there is definitely a bug. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH nf] netfilter: x_tables: ensure readers see new ->private value
On Wed, Oct 11, 2017 at 11:03 AM, Florian Westphal <f...@strlen.de> wrote: > Eric Dumazet wrote: > But it seems we need an extra smp_wmb() after > smp_wmb(); > table->private = newinfo; > > Otherwise we have no guarantee other cpus actually see the new > ->private value. > > Suggested-by: Eric Dumazet <eduma...@google.com> > Signed-off-by: Florian Westphal <f...@strlen.de> I do not believe this change is needed in net (or nf) tree. See my other reply. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 nf-next 1/2] netfilter: x_tables: wait until old table isn't used anymore
On Wed, Oct 11, 2017 at 10:48 AM, Florian Westphal <f...@strlen.de> wrote: > Eric Dumazet <eduma...@google.com> wrote: >> On Wed, Oct 11, 2017 at 7:26 AM, Florian Westphal <f...@strlen.de> wrote: >> > xt_replace_table relies on table replacement counter retrieval (which >> > uses xt_recseq to synchronize pcpu counters). >> > >> > This is fine, however with large rule set get_counters() can take >> > a very long time -- it needs to synchronize all counters because >> > it has to assume concurrent modifications can occur. >> > >> > Make xt_replace_table synchronize by itself by waiting until all cpus >> > had an even seqcount. >> > >> > This allows a followup patch to copy the counters of the old ruleset >> > without any synchonization after xt_replace_table has completed. >> > >> > Cc: Dan Williams <d...@redhat.com> >> > Cc: Eric Dumazet <eduma...@google.com> >> > Signed-off-by: Florian Westphal <f...@strlen.de> >> > --- >> > v3: check for 'seq is uneven' OR 'has changed' since >> > last check. Its fine if seq is uneven iff its a different >> > sequence number than the initial one. >> > >> > v2: fix Erics email address >> > net/netfilter/x_tables.c | 18 +++--- >> > 1 file changed, 15 insertions(+), 3 deletions(-) >> > >> > >> >> Reviewed-by: Eric Dumazet <eduma...@google.com> >> >> But it seems we need an extra smp_wmb() after >> >> smp_wmb(); >> table->private = newinfo; >> + smp_wmb(); >> >> Otherwise we have no guarantee other cpus actually see the new ->private >> value. > > Seems to be unrelated to this change, so I will submit > a separate patch for nf.git that adds this. This is related to this change, please read the comment before the local_bh_enable9) /* * Even though table entries have now been swapped, other CPU's * may still be using the old entries. This is okay, because * resynchronization happens because of the locking done * during the get_counters() routine. */ Since your new code happens right after table->private = newinfo ; the smp_wmb() is required. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 nf-next 1/2] netfilter: x_tables: wait until old table isn't used anymore
On Wed, Oct 11, 2017 at 7:26 AM, Florian Westphal <f...@strlen.de> wrote: > xt_replace_table relies on table replacement counter retrieval (which > uses xt_recseq to synchronize pcpu counters). > > This is fine, however with large rule set get_counters() can take > a very long time -- it needs to synchronize all counters because > it has to assume concurrent modifications can occur. > > Make xt_replace_table synchronize by itself by waiting until all cpus > had an even seqcount. > > This allows a followup patch to copy the counters of the old ruleset > without any synchonization after xt_replace_table has completed. > > Cc: Dan Williams <d...@redhat.com> > Cc: Eric Dumazet <eduma...@google.com> > Signed-off-by: Florian Westphal <f...@strlen.de> > --- > v3: check for 'seq is uneven' OR 'has changed' since > last check. Its fine if seq is uneven iff its a different > sequence number than the initial one. > > v2: fix Erics email address > net/netfilter/x_tables.c | 18 +++++++--- > 1 file changed, 15 insertions(+), 3 deletions(-) > > Reviewed-by: Eric Dumazet <eduma...@google.com> But it seems we need an extra smp_wmb() after smp_wmb(); table->private = newinfo; + smp_wmb(); Otherwise we have no guarantee other cpus actually see the new ->private value. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net] netfilter: x_tables: avoid stack-out-of-bounds read in xt_copy_counters_from_user
From: Eric Dumazet <eduma...@google.com> syzkaller reports an out of bound read in strlcpy(), triggered by xt_copy_counters_from_user() Fix this by using memcpy(), then forcing a zero byte at the last position of the destination, as Florian did for the non COMPAT code. Fixes: d7591f0c41ce ("netfilter: x_tables: introduce and use xt_copy_counters_from_user") Signed-off-by: Eric Dumazet <eduma...@google.com> Cc: Willem de Bruijn <will...@google.com> --- net/netfilter/x_tables.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c index c83a3b5e1c6c2a91b713b6681a794bd79ab3fa08..d8571f4142080a3c121fc90f0b52d81ee9df6712 100644 --- a/net/netfilter/x_tables.c +++ b/net/netfilter/x_tables.c @@ -892,7 +892,7 @@ void *xt_copy_counters_from_user(const void __user *user, unsigned int len, if (copy_from_user(_tmp, user, sizeof(compat_tmp)) != 0) return ERR_PTR(-EFAULT); - strlcpy(info->name, compat_tmp.name, sizeof(info->name)); + memcpy(info->name, compat_tmp.name, sizeof(info->name) - 1); info->num_counters = compat_tmp.num_counters; user += sizeof(compat_tmp); } else @@ -905,9 +905,9 @@ void *xt_copy_counters_from_user(const void __user *user, unsigned int len, if (copy_from_user(info, user, sizeof(*info)) != 0) return ERR_PTR(-EFAULT); - info->name[sizeof(info->name) - 1] = '\0'; user += sizeof(*info); } + info->name[sizeof(info->name) - 1] = '\0'; size = sizeof(struct xt_counters); size *= info->num_counters; -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] netfilter: xt_socket: Restore mark from full sockets only
On Thu, 2017-09-21 at 16:08 -0600, Subash Abhinov Kasiviswanathan wrote: > An out of bounds error was detected on an ARM64 target with > Android based kernel 4.9. This occurs while trying to > restore mark on a skb from an inet request socket. > > BUG: KASAN: slab-out-of-bounds in socket_match.isra.2+0xc8/0x1f0 > net/netfilter/xt_socket.c:248 > Read of size 4 at addr ffc06a8d824c by task syz-fuzzer/1532 > CPU: 7 PID: 1532 Comm: syz-fuzzer Tainted: GW O4.9.41+ #1 > Call trace: > > v1->v2: Change socket_mt6_v1_v2_v3() as well as mentioned by Eric > > Fixes: a9407388 ("netfilter: xt_socket: prepare for TCP_NEW_SYN_RECV > support") When my commit was written (Mon Mar 16 21:06:17 2015 -0700), this XT_SOCKET_RESTORESKMARK code was not there yet. The bug was added in commit 01555e74bde5 ("netfilter: xt_socket: add XT_SOCKET_RESTORESKMARK flag") which came later (Mon Jun 15 18:40:43 2015 -0600) Please put a correct Fixes: tag, thanks. > Signed-off-by: Subash Abhinov Kasiviswanathan> --- > net/netfilter/xt_socket.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/net/netfilter/xt_socket.c b/net/netfilter/xt_socket.c > index e75ef39..575d215 100644 > --- a/net/netfilter/xt_socket.c > +++ b/net/netfilter/xt_socket.c > @@ -76,7 +76,7 @@ > transparent = nf_sk_is_transparent(sk); > > if (info->flags & XT_SOCKET_RESTORESKMARK && !wildcard && > - transparent) > + transparent && sk_fullsock(sk)) > pskb->mark = sk->sk_mark; > > if (sk != skb->sk) > @@ -133,7 +133,7 @@ > transparent = nf_sk_is_transparent(sk); > > if (info->flags & XT_SOCKET_RESTORESKMARK && !wildcard && > - transparent) > + transparent && sk_fullsock(sk)) > pskb->mark = sk->sk_mark; > > if (sk != skb->sk) -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] netfilter: xt_socket: Restore mark from full sockets only
On Thu, 2017-09-21 at 15:20 -0600, Subash Abhinov Kasiviswanathan wrote: > An out of bounds error was detected on an ARM64 target with > Android based kernel 4.9. This occurs while trying to > restore mark on a skb from an inet request socket. > > BUG: KASAN: slab-out-of-bounds in socket_match.isra.2+0xc8/0x1f0 > net/netfilter/xt_socket.c:248 ... > > Fixes: a9407388 ("netfilter: xt_socket: prepare for TCP_NEW_SYN_RECV > support") > Signed-off-by: Subash Abhinov Kasiviswanathan> --- > net/netfilter/xt_socket.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/netfilter/xt_socket.c b/net/netfilter/xt_socket.c > index e75ef39..3301091 100644 > --- a/net/netfilter/xt_socket.c > +++ b/net/netfilter/xt_socket.c > @@ -76,7 +76,7 @@ > transparent = nf_sk_is_transparent(sk); > > if (info->flags & XT_SOCKET_RESTORESKMARK && !wildcard && > - transparent) > + transparent && sk_fullsock(sk)) > pskb->mark = sk->sk_mark; > > if (sk != skb->sk) What about socket_mt6_v1_v2_v3() ? Thanks. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH nf] netfilter: xtables: add scheduling opportunity in get_counters
On Fri, Sep 1, 2017 at 1:41 PM, Florian Westphal <f...@strlen.de> wrote: > There are reports about spurious softlockups during iptables-restore, a > backtrace i saw points at get_counters -- it uses a sequence lock and also > has unbounded restart loop. > > Signed-off-by: Florian Westphal <f...@strlen.de> > --- > net/ipv4/netfilter/arp_tables.c | 1 + > net/ipv4/netfilter/ip_tables.c | 1 + > net/ipv6/netfilter/ip6_tables.c | 1 + > 3 files changed, 3 insertions(+) SGTM, thanks Florian ! Acked-by: Eric Dumazet <eduma...@google.com> -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH nf-next 1/3] netfilter: convert hook list to an array
On Wed, 2017-08-23 at 17:26 +0200, Florian Westphal wrote: > From: Aaron Conole... > -static struct nf_hook_entry __rcu **nf_hook_entry_head(struct net *net, > const struct nf_hook_ops *reg) > +static struct nf_hook_entries *allocate_hook_entries_size(u16 num) > +{ > + struct nf_hook_entries *e; > + size_t alloc = sizeof(*e) + > +sizeof(struct nf_hook_entry) * num + > +sizeof(struct nf_hook_ops *) * num; > + > + if (num == 0) > + return NULL; > + > + e = kvmalloc(alloc, GFP_KERNEL); > + if (e) { > + memset(e, 0, alloc); > + e->num_hook_entries = num; > + } nit: e = kvzalloc(alloc, GFP_KERNEL); if (e) e->num_hook_entries = num; -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH nf-next 4/4] netfilter: rt: add support to fetch path mss
On Tue, 2017-08-08 at 15:15 +0200, Florian Westphal wrote: > to be used in combination with tcp option set support to mimic > iptables TCPMSS --clamp-mss-to-pmtu. > > Signed-off-by: Florian Westphal> --- > include/uapi/linux/netfilter/nf_tables.h | 2 + > net/netfilter/nft_rt.c | 65 > > 2 files changed, 67 insertions(+) > > diff --git a/include/uapi/linux/netfilter/nf_tables.h > b/include/uapi/linux/netfilter/nf_tables.h > index 40fd199f7531..b49da72efa68 100644 > --- a/include/uapi/linux/netfilter/nf_tables.h > +++ b/include/uapi/linux/netfilter/nf_tables.h > @@ -811,11 +811,13 @@ enum nft_meta_keys { > * @NFT_RT_CLASSID: realm value of packet's route (skb->dst->tclassid) > * @NFT_RT_NEXTHOP4: routing nexthop for IPv4 > * @NFT_RT_NEXTHOP6: routing nexthop for IPv6 > + * @NFT_RT_TCPMSS: fetch current path tcp mss > */ > enum nft_rt_keys { > NFT_RT_CLASSID, > NFT_RT_NEXTHOP4, > NFT_RT_NEXTHOP6, > + NFT_RT_TCPMSS, > }; > > /** > diff --git a/net/netfilter/nft_rt.c b/net/netfilter/nft_rt.c > index c7383d8f88d0..69ed601d6fc6 100644 > --- a/net/netfilter/nft_rt.c > +++ b/net/netfilter/nft_rt.c > @@ -23,6 +23,41 @@ struct nft_rt { > enum nft_registers dreg:8; > }; > > +static u16 get_tcpmss(const struct nft_pktinfo *pkt, const struct dst_entry > *skbdst) > +{ > + u32 minlen = sizeof(struct ipv6hdr), mtu = dst_mtu(skbdst); > + const struct sk_buff *skb = pkt->skb; > + const struct nf_afinfo *ai; > + struct dst_entry *dst; > + struct flowi fl; > + > + memset(, 0, sizeof(fl)); > + > + switch (nft_pf(pkt)) { > + case NFPROTO_IPV4: > + fl.u.ip4.daddr = ip_hdr(skb)->saddr; > + minlen = sizeof(struct iphdr); > + break; > + case NFPROTO_IPV6: > + fl.u.ip6.daddr = ipv6_hdr(skb)->saddr; > + break; > + } > + > + ai = nf_get_afinfo(nft_pf(pkt)); > + if (ai) > + ai->route(nft_net(pkt), , , false); > + if ai is NULL, dst is not initialized and might contain garbage. > + if (dst) { > + mtu = min(mtu, dst_mtu(dst)); > + dst_release(dst); > + } > + > + if (mtu <= minlen || mtu > 0x) > + return TCP_MSS_DEFAULT; > + > + return mtu - minlen; -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH nf-next] netfilter: nft_set_rbtree: use seqcount to avoid lock in most cases
On Wed, 2017-07-26 at 02:09 +0200, Florian Westphal wrote: > switch to lockless lockup. write side now also increments sequence > counter. On lookup, sample counter value and only take the lock > if we did not find a match and the counter has changed. > > This avoids need to write to private area in normal (lookup) cases. > > Note that we take the non-blocking variant (raw_seqcount_begin), i.e. > read side will not wait for writer to finish. > > If we did not find a result we will fall back to use of read-lock. > > The readlock is also used during dumps to ensure we get a consistent > tree walk. > > Similar technique (rbtree+seqlock) was used by David Howells in rxrpc. Please note that in commit b145425f269a17ed344d737f746b844dfac60c82 ("inetpeer: remove AVL implementation in favor of RB tree") I chose to also pass the sequence so that the lookup could abort. I am not sure that during rb tree write operations, some nodes could be left with some kind of loop. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] netfilter: nfnetlink: Improve input length sanitization in nfnetlink_rcv
On Wed, 2017-06-07 at 14:35 +0200, Mateusz Jurczyk wrote: > Verify that the length of the socket buffer is sufficient to cover the > entire nlh->nlmsg_len field before accessing that field for further > input sanitization. If the client only supplies 1-3 bytes of data in > sk_buff, then nlh->nlmsg_len remains partially uninitialized and > contains leftover memory from the corresponding kernel allocation. > Operating on such data may result in indeterminate evaluation of the > nlmsg_len < NLMSG_HDRLEN expression. > > The bug was discovered by a runtime instrumentation designed to detect > use of uninitialized memory in the kernel. The patch prevents this and > other similar tools (e.g. KMSAN) from flagging this behavior in the future. > > Signed-off-by: Mateusz Jurczyk> --- > net/netfilter/nfnetlink.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c > index 80f5ecf2c3d7..c634cfca40ec 100644 > --- a/net/netfilter/nfnetlink.c > +++ b/net/netfilter/nfnetlink.c > @@ -491,7 +491,8 @@ static void nfnetlink_rcv(struct sk_buff *skb) > { > struct nlmsghdr *nlh = nlmsg_hdr(skb); > > - if (nlh->nlmsg_len < NLMSG_HDRLEN || > + if (skb->len < sizeof(nlh->nlmsg_len) || This assumes nlmsg_len is first field of the structure. offsetofend() might be more descriptive, one does not have to check the structure to make sure the code is correct. Or simply use the more common form : if (skb->len < NLMSG_HDRLEN || > + nlh->nlmsg_len < NLMSG_HDRLEN || > skb->len < nlh->nlmsg_len) > return; > -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH nf-next] netfilter: tcp: Use TCP_MAX_WSCALE instead of literal 14
On Thu, 2017-04-20 at 08:44 +0800, Gao Feng wrote: > > On Wed, Apr 19, 2017 at 09:57:55PM +0200, Pablo Neira Ayuso wrote: > > > On Wed, Apr 19, 2017 at 09:22:08AM -0700, Eric Dumazet wrote: > > > > On Wed, 2017-04-19 at 17:58 +0200, Pablo Neira Ayuso wrote: > > > > > On Wed, Apr 19, 2017 at 09:23:42AM +0800, gfree.w...@foxmail.com > > wrote: > > > > > > From: Gao Feng <f...@ikuai8.com> > > > > > > > > > > > > The window scale may be enlarged from 14 to 15 according to the > > > > > > itef draft > https://tools.ietf.org/html/draft-nishida-tcpm-maxwin-03. > > > > > > > > > > > > Use the macro TCP_MAX_WSCALE to support it easily with TCP stack > > > > > > in the future. > > > > > > > > > > Applied, thanks. > > > > > > > > Note that linux kernel is not ready yet for a TCP_MAX_WSCALE being > > > > changed to 15. > > > > > > > > Signed 32bit sk counters can already be abused with 1GB TCP windows, > > > > for malicious peers sending SACK forcing linux to increase its > > > > memory usage above 2GB and overflows are pretty bad. > > > > > > We have tend to use our own definitions for the TCP connection > > > tracking so far. This one I checked it refers RFC1323 too. > > > > > > If this semantics may change from one way to another in a way that may > > > break conntracking, please let me know, I can toss it here. > > > > Or I can just amend the commit here to remove the "enlarged from 14 to 15" > > comment, I was going to push out this now, but I'll wait a bit. > > Thanks Eric & Pablo, > When the wscale is really enlarged to 15 one day, these Netfilter codes may > be modified. > Because it would reset the wscale value to the max value which Netfilter > support it. > if (state->td_scale > 14) > state->td_scale = 14; > It would cause the receive see a less window size than sender announced > actually. Simply because some middle boxes are enforcing the limit of 14, a change of TCP stacks on peers might be simply not possible without causing serious interoperability issues. This IETF draft assumes TCP peers can freely decide of what they can do, but experience shows that they can not. TCP FastOpen for example hit a bug in linux TCP conntracking, and some linux middle boxes are still having this bug. ( SYN messages with data payload was not really considered in the past ) -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH nf-next] netfilter: tcp: Use TCP_MAX_WSCALE instead of literal 14
On Wed, 2017-04-19 at 17:58 +0200, Pablo Neira Ayuso wrote: > On Wed, Apr 19, 2017 at 09:23:42AM +0800, gfree.w...@foxmail.com wrote: > > From: Gao Feng> > > > The window scale may be enlarged from 14 to 15 according to the itef > > draft https://tools.ietf.org/html/draft-nishida-tcpm-maxwin-03. > > > > Use the macro TCP_MAX_WSCALE to support it easily with TCP stack in > > the future. > > Applied, thanks. Note that linux kernel is not ready yet for a TCP_MAX_WSCALE being changed to 15. Signed 32bit sk counters can already be abused with 1GB TCP windows, for malicious peers sending SACK forcing linux to increase its memory usage above 2GB and overflows are pretty bad. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net] netfilter: xt_TCPMSS: add more sanity tests on tcph->doff
From: Eric Dumazet <eduma...@google.com> Denys provided an awesome KASAN report pointing to an use after free in xt_TCPMSS I have provided three patches to fix this issue, either in xt_TCPMSS or in xt_tcpudp.c. It seems xt_TCPMSS patch has the smallest possible impact. Signed-off-by: Eric Dumazet <eduma...@google.com> Reported-by: Denys Fedoryshchenko <nuclear...@nuclearcat.com> --- net/netfilter/xt_TCPMSS.c |6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/net/netfilter/xt_TCPMSS.c b/net/netfilter/xt_TCPMSS.c index 27241a767f17b4b27d24095a31e5e9a2d3e29ce4..c64aca611ac5c5f81ad7c925652bbb90554763ac 100644 --- a/net/netfilter/xt_TCPMSS.c +++ b/net/netfilter/xt_TCPMSS.c @@ -104,7 +104,7 @@ tcpmss_mangle_packet(struct sk_buff *skb, tcph = (struct tcphdr *)(skb_network_header(skb) + tcphoff); tcp_hdrlen = tcph->doff * 4; - if (len < tcp_hdrlen) + if (len < tcp_hdrlen || tcp_hdrlen < sizeof(struct tcphdr)) return -1; if (info->mss == XT_TCPMSS_CLAMP_PMTU) { @@ -152,6 +152,10 @@ tcpmss_mangle_packet(struct sk_buff *skb, if (len > tcp_hdrlen) return 0; + /* tcph->doff has 4 bits, do not wrap it to 0 */ + if (tcp_hdrlen >= 15 * 4) + return 0; + /* * MSS Option not found ?! add it.. */ -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KASAN, xt_TCPMSS finally found nasty use-after-free bug? 4.10.8
On Mon, 2017-04-03 at 15:14 +0300, Denys Fedoryshchenko wrote: > On 2017-04-03 15:09, Eric Dumazet wrote: > > On Mon, 2017-04-03 at 11:10 +0300, Denys Fedoryshchenko wrote: > > > >> I modified patch a little as: > >> if (th->doff * 4 < sizeof(_tcph)) { > >> par->hotdrop = true; > >> WARN_ON_ONCE(!tcpinfo->option); > >> return false; > >> } > >> > >> And it did triggered WARN once at morning, and didn't hit KASAN. I > >> will > >> run for a while more, to see if it is ok, and then if stable, will try > >> to enable SFQ again. > > > > Excellent news ! > > We will post an official fix today, thanks a lot for this detective > > work > > Denys. > I am not sure it is finally fixed, maybe we need test more? > I'm doing extensive tests today with identical configuration (i had to > run fifo, because customer cannot afford anymore outages). I've dded sfq > now different way, and identical config i will run after 3 hours approx. I did not say this bug fix would be the last one. But this check is definitely needed, otherwise xt_TCPMSS can iterate whole memory, and either crash or scramble two bytes in memory, that do not belong to the frame at all. If tcp_hdrlen is 0 (can happen if tcph->doff is 0) Then : for (i = sizeof(struct tcphdr); i <= tcp_hdrlen - TCPOLEN_MSS; i += optlen(opt, i)) { Can look at 4 GB of memory, until it finds a 0x02, 0x04 sequence. It can also loop forever in some cases, depending on memory content. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KASAN, xt_TCPMSS finally found nasty use-after-free bug? 4.10.8
On Mon, 2017-04-03 at 11:10 +0300, Denys Fedoryshchenko wrote: > I modified patch a little as: > if (th->doff * 4 < sizeof(_tcph)) { > par->hotdrop = true; > WARN_ON_ONCE(!tcpinfo->option); > return false; > } > > And it did triggered WARN once at morning, and didn't hit KASAN. I will > run for a while more, to see if it is ok, and then if stable, will try > to enable SFQ again. Excellent news ! We will post an official fix today, thanks a lot for this detective work Denys. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KASAN, xt_TCPMSS finally found nasty use-after-free bug? 4.10.8
On Sun, 2017-04-02 at 10:14 -0700, Eric Dumazet wrote: > Could that be that netfilter does not abort earlier if TCP header is > completely wrong ? > Yes, I wonder if this patch would be better, unless we replicate the th->doff sanity check in all netfilter modules dissecting TCP frames. diff --git a/net/netfilter/xt_tcpudp.c b/net/netfilter/xt_tcpudp.c index ade024c90f4f129a7c384e9e1cbfdb8ffe73065f..8cb4eadd5ba1c20e74bc27ee52a0bc36a5b26725 100644 --- a/net/netfilter/xt_tcpudp.c +++ b/net/netfilter/xt_tcpudp.c @@ -103,11 +103,11 @@ static bool tcp_mt(const struct sk_buff *skb, struct xt_action_param *par) if (!NF_INVF(tcpinfo, XT_TCP_INV_FLAGS, (((unsigned char *)th)[13] & tcpinfo->flg_mask) == tcpinfo->flg_cmp)) return false; + if (th->doff * 4 < sizeof(_tcph)) { + par->hotdrop = true; + return false; + } if (tcpinfo->option) { - if (th->doff * 4 < sizeof(_tcph)) { - par->hotdrop = true; - return false; - } if (!tcp_find_option(tcpinfo->option, skb, par->thoff, th->doff*4 - sizeof(_tcph), tcpinfo->invflags & XT_TCP_INV_OPTION, -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KASAN, xt_TCPMSS finally found nasty use-after-free bug? 4.10.8
On Sun, 2017-04-02 at 19:52 +0300, Denys Fedoryshchenko wrote: > On 2017-04-02 15:32, Eric Dumazet wrote: > > On Sun, 2017-04-02 at 15:25 +0300, Denys Fedoryshchenko wrote: > >> > */ > >> I will add also WARN_ON_ONCE(tcp_hdrlen >= 15 * 4) before, for > >> curiosity, if this condition are triggered. Is it fine like that? > > > > Sure. > > It didnt triggered WARN_ON, and with both patches here is one more > KASAN. > What i noticed also after this KASAN, there is many others start to > trigger in TCPMSS and locking up server by flood. > There is heavy netlink activity, it is pppoe server with lot of shapers. > I noticed there left sfq by mistake, usually i am removing it, because > it may trigger kernel panic too (and hard to trace reason). > I will try with pfifo instead, after 6 hours. > > Here is full log with others: https://nuclearcat.com/kasan.txt > Could that be that netfilter does not abort earlier if TCP header is completely wrong ? diff --git a/net/netfilter/xt_TCPMSS.c b/net/netfilter/xt_TCPMSS.c index 27241a767f17b4b27d24095a31e5e9a2d3e29ce4..dd64bff38ba8074e6feb2e6e305eacb30ce4c2da 100644 --- a/net/netfilter/xt_TCPMSS.c +++ b/net/netfilter/xt_TCPMSS.c @@ -104,7 +104,7 @@ tcpmss_mangle_packet(struct sk_buff *skb, tcph = (struct tcphdr *)(skb_network_header(skb) + tcphoff); tcp_hdrlen = tcph->doff * 4; - if (len < tcp_hdrlen) + if (len < tcp_hdrlen || tcp_hdrlen < sizeof(struct tcphdr)) return -1; if (info->mss == XT_TCPMSS_CLAMP_PMTU) { -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KASAN, xt_TCPMSS finally found nasty use-after-free bug? 4.10.8
On Sun, 2017-04-02 at 15:25 +0300, Denys Fedoryshchenko wrote: > > */ > I will add also WARN_ON_ONCE(tcp_hdrlen >= 15 * 4) before, for > curiosity, if this condition are triggered. Is it fine like that? Sure. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KASAN, xt_TCPMSS finally found nasty use-after-free bug? 4.10.8
On Sun, 2017-04-02 at 04:54 -0700, Eric Dumazet wrote: > On Sun, 2017-04-02 at 13:45 +0200, Florian Westphal wrote: > > Eric Dumazet <eric.duma...@gmail.com> wrote: > > > - for (i = sizeof(struct tcphdr); i <= tcp_hdrlen - TCPOLEN_MSS; i += > > > optlen(opt, i)) { > > > + for (i = sizeof(struct tcphdr); i < tcp_hdrlen - TCPOLEN_MSS; i += > > > optlen(opt, i)) { > > > if (opt[i] == TCPOPT_MSS && opt[i+1] == TCPOLEN_MSS) { > > > u_int16_t oldmss; > > > > maybe I am low on caffeeine but this looks fine, for tcp header with > > only tcpmss this boils down to "20 <= 24 - 4" so we acccess offsets 20-23 > > which seems ok. > > I am definitely low on caffeine ;) > > An issue in this function is that we might add the missing MSS option, > without checking that TCP options are already full. > > But this should not cause a KASAN splat, only some malformed TCP packet > > (tcph->doff would wrap) Something like that maybe. diff --git a/net/netfilter/xt_TCPMSS.c b/net/netfilter/xt_TCPMSS.c index 27241a767f17b4b27d24095a31e5e9a2d3e29ce4..1465aaf0e3a15d69d105d0a50b0429b11b6439d3 100644 --- a/net/netfilter/xt_TCPMSS.c +++ b/net/netfilter/xt_TCPMSS.c @@ -151,7 +151,9 @@ tcpmss_mangle_packet(struct sk_buff *skb, */ if (len > tcp_hdrlen) return 0; - + /* tcph->doff is 4 bits wide, do not wrap its value to 0 */ + if (tcp_hdrlen >= 15 * 4) + return 0; /* * MSS Option not found ?! add it.. */ -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KASAN, xt_TCPMSS finally found nasty use-after-free bug? 4.10.8
On Sun, 2017-04-02 at 13:45 +0200, Florian Westphal wrote: > Eric Dumazet <eric.duma...@gmail.com> wrote: > > - for (i = sizeof(struct tcphdr); i <= tcp_hdrlen - TCPOLEN_MSS; i += > > optlen(opt, i)) { > > + for (i = sizeof(struct tcphdr); i < tcp_hdrlen - TCPOLEN_MSS; i += > > optlen(opt, i)) { > > if (opt[i] == TCPOPT_MSS && opt[i+1] == TCPOLEN_MSS) { > > u_int16_t oldmss; > > maybe I am low on caffeeine but this looks fine, for tcp header with > only tcpmss this boils down to "20 <= 24 - 4" so we acccess offsets 20-23 > which seems ok. I am definitely low on caffeine ;) An issue in this function is that we might add the missing MSS option, without checking that TCP options are already full. But this should not cause a KASAN splat, only some malformed TCP packet (tcph->doff would wrap) -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KASAN, xt_TCPMSS finally found nasty use-after-free bug? 4.10.8
On Sun, 2017-04-02 at 10:43 +0300, Denys Fedoryshchenko wrote: > Repost, due being sleepy missed few important points. > > I am searching reasons of crashes for multiple conntrack enabled > servers, usually they point to conntrack, but i suspect use after free > might be somewhere else, > so i tried to enable KASAN. > And seems i got something after few hours, and it looks related to all > crashes, because on all that servers who rebooted i had MSS adjustment > (--clamp-mss-to-pmtu or --set-mss). > Please let me know if any additional information needed. > > [25181.855611] > == > [25181.855985] BUG: KASAN: use-after-free in tcpmss_tg4+0x682/0xe9c > [xt_TCPMSS] at addr 8802976000ea > [25181.856344] Read of size 1 by task swapper/1/0 > [25181.856555] page:ea000a5d8000 count:0 mapcount:0 mapping: > (null) index:0x0 > [25181.856909] flags: 0x1000() > [25181.857123] raw: 1000 > > [25181.857630] raw: ea000b0444a0 ea000a0b1f60 > > [25181.857996] page dumped because: kasan: bad access detected > [25181.858214] CPU: 1 PID: 0 Comm: swapper/1 Not tainted > 4.10.8-build-0133-debug #3 > [25181.858571] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80 > 04/02/2015 > [25181.858786] Call Trace: > [25181.859000] > [25181.859215] dump_stack+0x99/0xd4 > [25181.859423] ? _atomic_dec_and_lock+0x15d/0x15d > [25181.859644] ? __dump_page+0x447/0x4e3 > [25181.859859] ? tcpmss_tg4+0x682/0xe9c [xt_TCPMSS] > [25181.860080] kasan_report+0x577/0x69d > [25181.860291] ? __ip_route_output_key_hash+0x14ce/0x1503 > [25181.860512] ? tcpmss_tg4+0x682/0xe9c [xt_TCPMSS] > [25181.860736] __asan_report_load1_noabort+0x19/0x1b > [25181.860956] tcpmss_tg4+0x682/0xe9c [xt_TCPMSS] > [25181.861180] ? tcpmss_tg4_check+0x287/0x287 [xt_TCPMSS] > [25181.861407] ? udp_mt+0x45a/0x45a [xt_tcpudp] > [25181.861634] ? __fib_validate_source+0x46b/0xcd1 > [25181.861860] ipt_do_table+0x1432/0x1573 [ip_tables] > [25181.862088] ? igb_msix_ring+0x2d/0x35 > [25181.862318] ? ip_tables_net_init+0x15/0x15 [ip_tables] > [25181.862537] ? ip_route_input_slow+0xe9f/0x17e3 > [25181.862759] ? handle_irq_event_percpu+0x141/0x141 > [25181.862985] ? rt_set_nexthop+0x9a7/0x9a7 > [25181.863203] ? ip_tables_net_exit+0xe/0x15 [ip_tables] > [25181.863419] ? tcf_action_exec+0xce/0x18c > [25181.863628] ? iptable_mangle_net_exit+0x92/0x92 [iptable_mangle] > [25181.863856] ? iptable_filter_net_exit+0x92/0x92 [iptable_filter] > [25181.864084] iptable_filter_hook+0xc0/0x1c8 [iptable_filter] > [25181.864311] nf_hook_slow+0x7d/0x121 > [25181.864536] ip_forward+0x1183/0x11c6 > [25181.864752] ? ip_forward_finish+0x168/0x168 > [25181.864967] ? ip_frag_mem+0x43/0x43 > [25181.865194] ? iptable_nat_net_exit+0x92/0x92 [iptable_nat] > [25181.865423] ? nf_nat_ipv4_in+0xf0/0x209 [nf_nat_ipv4] > [25181.865648] ip_rcv_finish+0xf4c/0xf5b > [25181.865861] ip_rcv+0xb41/0xb72 > [25181.866086] ? ip_local_deliver+0x282/0x282 > [25181.866308] ? ip_local_deliver_finish+0x6e6/0x6e6 > [25181.866524] ? ip_local_deliver+0x282/0x282 > [25181.866752] __netif_receive_skb_core+0x1b27/0x21bf > [25181.866971] ? netdev_rx_handler_register+0x1a6/0x1a6 > [25181.867186] ? enqueue_hrtimer+0x232/0x240 > [25181.867401] ? hrtimer_start_range_ns+0xd1c/0xd4b > [25181.867630] ? __ppp_xmit_process+0x101f/0x104e [ppp_generic] > [25181.867852] ? hrtimer_cancel+0x20/0x20 > [25181.868081] ? ppp_push+0x1402/0x1402 [ppp_generic] > [25181.868301] ? __pskb_pull_tail+0xb0f/0xb25 > [25181.868523] ? ppp_xmit_process+0x47/0xaf [ppp_generic] > [25181.868749] __netif_receive_skb+0x5e/0x191 > [25181.868968] process_backlog+0x295/0x573 > [25181.869180] ? __netif_receive_skb+0x191/0x191 > [25181.869401] napi_poll+0x311/0x745 > [25181.869611] ? napi_complete_done+0x3b4/0x3b4 > [25181.869836] ? __qdisc_run+0x4ec/0xb7f > [25181.870061] ? sch_direct_xmit+0x60b/0x60b > [25181.870286] net_rx_action+0x2e8/0x6dc > [25181.870512] ? napi_poll+0x745/0x745 > [25181.870732] ? rps_trigger_softirq+0x181/0x1e4 > [25181.870956] ? rps_may_expire_flow+0x29b/0x29b > [25181.871184] ? irq_work_run+0x2c/0x2e > [25181.871411] __do_softirq+0x22b/0x5df > [25181.871629] ? smp_call_function_single_async+0x17d/0x17d > [25181.871854] irq_exit+0x8a/0xfe > [25181.872069] smp_call_function_single_interrupt+0x8d/0x90 > [25181.872297] call_function_single_interrupt+0x83/0x90 > [25181.872519] RIP: 0010:mwait_idle+0x15a/0x30d > [25181.872733] RSP: 0018:8802d1017e78 EFLAGS: 0246 ORIG_RAX: > ff04 > [25181.873091] RAX: RBX: 8802d1000c80 RCX: > > [25181.873311] RDX: 11005a200190 RSI: RDI: > > [25181.873532] RBP: 8802d1017e98 R08: 003f R09: > 7f75f7fff700 > [25181.873751] R10: 8802d1017d80 R11:
Re: [PATCH nf-next 0/2] netfilter: untracked object removal
On Wed, 2017-03-08 at 13:49 +0100, Florian Westphal wrote: > These patches remove the percpu untracked objects, they get replaced > with a new (kernel internal) ctinfo state. > > This avoids reference counter operations for untracked packets and > removes the need to check a conntrack for the UNTRACKED status bit > before setting connmark, labels, etc. > > I checked with following rule set and things appear to work as > expected (i.e., ssh connections don't show up in conntrack -L): > > *raw > :PREROUTING ACCEPT [455:34825] > :OUTPUT ACCEPT [251:29555] > [775:63699] -A PREROUTING -p tcp -m tcp --dport 22 -j NOTRACK > [251:29555] -A OUTPUT -p tcp -m tcp --sport 22 -j NOTRACK > COMMIT > *filter > :INPUT ACCEPT [0:0] > :FORWARD ACCEPT [0:0] > :OUTPUT ACCEPT [0:0] > [0:0] -A INPUT -m conntrack --ctstate INVALID -j DROP > [337:26377] -A INPUT -p tcp -m conntrack --ctstate UNTRACKED -m tcp --dport > 22 -j ACCEPT > [0:0] -A INPUT -m conntrack --ctstate UNTRACKED > [102:13883] -A OUTPUT -p tcp -m conntrack --ctstate UNTRACKED -m tcp --sport > 22 -j ACCEPT > [0:0] -A OUTPUT -m conntrack --ctstate UNTRACKED > COMMIT Very nice series Florian, thanks ! -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: net: suspicious RCU usage in nf_hook
On Wed, 2017-02-01 at 15:48 -0800, Eric Dumazet wrote: > On Wed, Feb 1, 2017 at 3:29 PM, Cong Wang <xiyou.wangc...@gmail.com> wrote: > > > Not sure if it is better. The difference is caught up in > > net_enable_timestamp(), > > which is called setsockopt() path and sk_clone() path, so we could be > > in netstamp_needed state for a long time too until user-space exercises > > these paths. > > > > I am feeling we probably need to get rid of netstamp_needed_deferred, > > and simply defer the whole static_key_slow_dec(), like the attached patch > > (compile only). > > > > What do you think? > > I think we need to keep the atomic. > > If two cpus call net_disable_timestamp() roughly at the same time, the > work will be scheduled once. Updated patch (but not tested yet) diff --git a/net/core/dev.c b/net/core/dev.c index 7f218e095361520d11c243d650e053321ea7274f..29101c98399f40b6b8e42c31a255d8f1fb6bd7a1 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1695,24 +1695,19 @@ EXPORT_SYMBOL_GPL(net_dec_egress_queue); static struct static_key netstamp_needed __read_mostly; #ifdef HAVE_JUMP_LABEL -/* We are not allowed to call static_key_slow_dec() from irq context - * If net_disable_timestamp() is called from irq context, defer the - * static_key_slow_dec() calls. - */ static atomic_t netstamp_needed_deferred; -#endif - -void net_enable_timestamp(void) +static void netstamp_clear(struct work_struct *work) { -#ifdef HAVE_JUMP_LABEL int deferred = atomic_xchg(_needed_deferred, 0); - if (deferred) { - while (--deferred) - static_key_slow_dec(_needed); - return; - } + while (deferred--) + static_key_slow_dec(_needed); +} +static DECLARE_WORK(netstamp_work, netstamp_clear); #endif + +void net_enable_timestamp(void) +{ static_key_slow_inc(_needed); } EXPORT_SYMBOL(net_enable_timestamp); @@ -1720,12 +1715,12 @@ EXPORT_SYMBOL(net_enable_timestamp); void net_disable_timestamp(void) { #ifdef HAVE_JUMP_LABEL - if (in_interrupt()) { - atomic_inc(_needed_deferred); - return; - } -#endif + /* net_disable_timestamp() can be called from non process context */ + atomic_inc(_needed_deferred); + schedule_work(_work); +#else static_key_slow_dec(_needed); +#endif } EXPORT_SYMBOL(net_disable_timestamp); -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: net: suspicious RCU usage in nf_hook
On Wed, Feb 1, 2017 at 3:29 PM, Cong Wangwrote: > Not sure if it is better. The difference is caught up in > net_enable_timestamp(), > which is called setsockopt() path and sk_clone() path, so we could be > in netstamp_needed state for a long time too until user-space exercises > these paths. > > I am feeling we probably need to get rid of netstamp_needed_deferred, > and simply defer the whole static_key_slow_dec(), like the attached patch > (compile only). > > What do you think? I think we need to keep the atomic. If two cpus call net_disable_timestamp() roughly at the same time, the work will be scheduled once. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: net: suspicious RCU usage in nf_hook
On Wed, 2017-02-01 at 13:16 -0800, Eric Dumazet wrote: > This would permanently leave the kernel in the netstamp_needed state. > > I would prefer the patch using a process context to perform the > cleanup ? Note there is a race window, but probably not a big deal. > > net/core/dev.c | 30 ++ > 1 file changed, 10 insertions(+), 20 deletions(-) Patch is not complete (for the HAVE_JUMP_LABEL=n case) Would you like to author/submit it ? Thanks. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: net: suspicious RCU usage in nf_hook
On Wed, 2017-02-01 at 12:51 -0800, Cong Wang wrote: > On Tue, Jan 31, 2017 at 7:44 AM, Eric Dumazet <eric.duma...@gmail.com> wrote: > > On Mon, 2017-01-30 at 22:19 -0800, Cong Wang wrote: > > > >> > >> The context is process context (TX path before hitting qdisc), and > >> BH is not disabled, so in_interrupt() doesn't catch it. Hmm, this > >> makes me thinking maybe we really need to disable BH in this > >> case for nf_hook()? But it is called in RX path too, and BH is > >> already disabled there. > > > > ipt_do_table() and similar netfilter entry points disable BH. > > > > Maybe it is done too late. > > I think we need a fix like the following one for minimum impact. > > diff --git a/net/core/dev.c b/net/core/dev.c > index 727b6fd..eee7d63 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -1720,12 +1720,10 @@ EXPORT_SYMBOL(net_enable_timestamp); > void net_disable_timestamp(void) > { > #ifdef HAVE_JUMP_LABEL > - if (in_interrupt()) { > - atomic_inc(_needed_deferred); > - return; > - } > -#endif > + atomic_inc(_needed_deferred); > +#else > static_key_slow_dec(_needed); > +#endif > } > EXPORT_SYMBOL(net_disable_timestamp); This would permanently leave the kernel in the netstamp_needed state. I would prefer the patch using a process context to perform the cleanup ? Note there is a race window, but probably not a big deal. net/core/dev.c | 30 ++ 1 file changed, 10 insertions(+), 20 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index 7f218e095361520d11c243d650e053321ea7274f..1cae681b6cfd1cf2c9bee7072eb8af9cf79cced8 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1695,37 +1695,27 @@ EXPORT_SYMBOL_GPL(net_dec_egress_queue); static struct static_key netstamp_needed __read_mostly; #ifdef HAVE_JUMP_LABEL -/* We are not allowed to call static_key_slow_dec() from irq context - * If net_disable_timestamp() is called from irq context, defer the - * static_key_slow_dec() calls. - */ static atomic_t netstamp_needed_deferred; -#endif - -void net_enable_timestamp(void) +static void netstamp_clear(struct work_struct *work) { -#ifdef HAVE_JUMP_LABEL int deferred = atomic_xchg(_needed_deferred, 0); - if (deferred) { - while (--deferred) - static_key_slow_dec(_needed); - return; - } + while (deferred--) + static_key_slow_dec(_needed); +} +static DECLARE_WORK(netstamp_work, netstamp_clear); #endif + +void net_enable_timestamp(void) +{ static_key_slow_inc(_needed); } EXPORT_SYMBOL(net_enable_timestamp); void net_disable_timestamp(void) { -#ifdef HAVE_JUMP_LABEL - if (in_interrupt()) { - atomic_inc(_needed_deferred); - return; - } -#endif - static_key_slow_dec(_needed); + atomic_inc(_needed_deferred); + schedule_work(_work); } EXPORT_SYMBOL(net_disable_timestamp); -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: net: suspicious RCU usage in nf_hook
On Mon, 2017-01-30 at 22:19 -0800, Cong Wang wrote: > > The context is process context (TX path before hitting qdisc), and > BH is not disabled, so in_interrupt() doesn't catch it. Hmm, this > makes me thinking maybe we really need to disable BH in this > case for nf_hook()? But it is called in RX path too, and BH is > already disabled there. ipt_do_table() and similar netfilter entry points disable BH. Maybe it is done too late. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: net: suspicious RCU usage in nf_hook
On Fri, 2017-01-27 at 17:00 -0800, Cong Wang wrote: > On Fri, Jan 27, 2017 at 3:35 PM, Eric Dumazet <eric.duma...@gmail.com> wrote: > > Oh well, I forgot to submit the official patch I think, Jan 9th. > > > > https://groups.google.com/forum/#!topic/syzkaller/BhyN5OFd7sQ > > > > Hmm, but why only fragments need skb_orphan()? It seems like > any kfree_skb() inside a nf hook needs to have a preceding > skb_orphan(). > > Also, I am not convinced it is similar to commit 8282f27449bf15548 > which is on RX path. Well, we clearly see IPv6 reassembly being part of the equation in both cases. I was replying to first part of the splat [1], which was already diagnosed and had a non official patch. use after free is also a bug, regardless of jump label being used or not. I still do not really understand this nf_hook issue, I thought we were disabling BH in netfilter. So the in_interrupt() check in net_disable_timestamp() should trigger, this was the intent of netstamp_needed_deferred existence. Not sure if we can test for rcu_read_lock() as well. [1] sk_destruct+0x47/0x80 net/core/sock.c:1460 __sk_free+0x57/0x230 net/core/sock.c:1468 sock_wfree+0xae/0x120 net/core/sock.c:1645 skb_release_head_state+0xfc/0x200 net/core/skbuff.c:655 skb_release_all+0x15/0x60 net/core/skbuff.c:668 __kfree_skb+0x15/0x20 net/core/skbuff.c:684 kfree_skb+0x16e/0x4c0 net/core/skbuff.c:705 inet_frag_destroy+0x121/0x290 net/ipv4/inet_fragment.c:304 inet_frag_put include/net/inet_frag.h:133 [inline] nf_ct_frag6_gather+0x1106/0x3840 net/ipv6/netfilter/nf_conntrack_reasm.c:617 ipv6_defrag+0x1be/0x2b0 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c:68 nf_hook_entry_hookfn include/linux/netfilter.h:102 [inline] nf_hook_slow+0xc3/0x290 net/netfilter/core.c:310 nf_hook include/linux/netfilter.h:212 [inline] __ip6_local_out+0x489/0x840 net/ipv6/output_core.c:160 ip6_local_out+0x2d/0x170 net/ipv6/output_core.c:170 -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/14] tcp: fix mark propagation with fwmark_reflect enabled
On Thu, 2017-01-26 at 20:19 +0100, Pablo Neira Ayuso wrote: > Right. This is not percpu as in IPv4. > > I can send a follow up patch to get this in sync with the way we do it > in IPv4, ie. add percpu socket. > > Fine with this approach? Thanks! Not really. percpu sockets are going to slow down network namespace creation / deletion and increase memory foot print. IPv6 is cleaner because it does not really have to use different sockets. Ultimately would would like to have the same for IPv4. I would rather carry the mark either in an additional parameter, or in the flow that is already passed as a parameter. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/14] tcp: fix mark propagation with fwmark_reflect enabled
On Thu, 2017-01-26 at 17:37 +0100, Pablo Neira Ayuso wrote: > From: Pau Espin Pedrol> > Otherwise, RST packets generated by the TCP stack for non-existing > sockets always have mark 0. > The mark from the original packet is assigned to the netns_ipv4/6 > socket used to send the response so that it can get copied into the > response skb when the socket sends it. > > Fixes: e110861f8609 ("net: add a sysctl to reflect the fwmark on replies") > Cc: Lorenzo Colitti > Signed-off-by: Pau Espin Pedrol > Signed-off-by: Pablo Neira Ayuso > --- > net/ipv4/ip_output.c | 1 + > net/ipv6/tcp_ipv6.c | 1 + > 2 files changed, 2 insertions(+) > > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c > index fac275c48108..b67719f45953 100644 > --- a/net/ipv4/ip_output.c > +++ b/net/ipv4/ip_output.c > @@ -1629,6 +1629,7 @@ void ip_send_unicast_reply(struct sock *sk, struct > sk_buff *skb, > sk->sk_protocol = ip_hdr(skb)->protocol; > sk->sk_bound_dev_if = arg->bound_dev_if; > sk->sk_sndbuf = sysctl_wmem_default; > + sk->sk_mark = fl4.flowi4_mark; > err = ip_append_data(sk, , ip_reply_glue_bits, arg->iov->iov_base, >len, 0, , , MSG_DONTWAIT); > if (unlikely(err)) { > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c > index 73bc8fc68acd..2b20622a5824 100644 > --- a/net/ipv6/tcp_ipv6.c > +++ b/net/ipv6/tcp_ipv6.c > @@ -840,6 +840,7 @@ static void tcp_v6_send_response(const struct sock *sk, > struct sk_buff *skb, u32 > dst = ip6_dst_lookup_flow(ctl_sk, , NULL); > if (!IS_ERR(dst)) { > skb_dst_set(buff, dst); > + ctl_sk->sk_mark = fl6.flowi6_mark; > ip6_xmit(ctl_sk, buff, , NULL, tclass); > TCP_INC_STATS(net, TCP_MIB_OUTSEGS); > if (rst) This patch is wrong. ctl_sk is a shared socket, and tcp_v6_send_response() can be called from many different cpus at the same time. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ip_rcv_finish() NULL pointer kernel panic
On Thu, 2017-01-26 at 10:00 -0800, Eric Dumazet wrote: > On Thu, 2017-01-26 at 17:24 +0100, Florian Westphal wrote: > > > I think it makes sense to set dst->incoming > > to a stub in br_netfilter_rtable_init() to just kfree_skb()+ > > WARN_ON_ONCE(), no need to add code to ip stack or crash kernel > > due to brnf bug. > > Just kfree_skb() would hide bugs. > > Dropping packets is not uncommon in networking... > > I would rather use at least a WARN_ON_ONCE() before the kfree_skb() ;) Oh well, I obviously did not parse properly your suggestion. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ip_rcv_finish() NULL pointer kernel panic
On Thu, 2017-01-26 at 17:24 +0100, Florian Westphal wrote: > I think it makes sense to set dst->incoming > to a stub in br_netfilter_rtable_init() to just kfree_skb()+ > WARN_ON_ONCE(), no need to add code to ip stack or crash kernel > due to brnf bug. Just kfree_skb() would hide bugs. Dropping packets is not uncommon in networking... I would rather use at least a WARN_ON_ONCE() before the kfree_skb() ;) -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ip_rcv_finish() NULL pointer kernel panic
On Thu, 2017-01-26 at 09:32 -0600, Roy Keene wrote: > This bug appears to have existed for a long time: > > https://www.spinics.net/lists/netdev/msg222459.html > > http://www.kernelhub.org/?p=2=823752 > > Though possibly with different things not setting the "input" function > pointer in the "struct dst_entry". > > include/net/dst.h: > 496 static inline int dst_input(struct sk_buff *skb) { > 498 return skb_dst(skb)->input(skb); > 499 } > > Is there any reason not to check to see if this pointer is NULL before > blindly calling it ? Sure. It can not be NULL at this point. Just look at the code in ip_rcv_finish() : It first make sure to get a valid dst. Something is broken, probably in bridge netfilter code. I suspect the dst here points to >fake_rtable, and this is not expected. br_drop_fake_rtable() should have been called somewhere ... -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] netfilter: nft_counter: rework atomic dump and reset
On Sat, 2016-12-10 at 15:25 +0100, Pablo Neira Ayuso wrote: > On Sat, Dec 10, 2016 at 03:16:55PM +0100, Pablo Neira Ayuso wrote: = > > - nft_counter_fetch(priv, , reset); > + nft_counter_fetch(priv, ); > + if (reset) > + nft_counter_reset(priv, ); > > if (nla_put_be64(skb, NFTA_COUNTER_BYTES, > cpu_to_be64(total.bytes), > NFTA_COUNTER_PAD) || Night be nitpicking, but you might reset the stats only if the nla_put_be64() succeeded. But regardless of this detail, patch looks good and is very close to the one I cooked and was about to send this morning. Thanks Pablo ! -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 37/50] netfilter: nf_tables: atomic dump and reset for stateful objects
On Fri, 2016-12-09 at 06:24 -0800, Eric Dumazet wrote: > It looks that you want a seqcount, even on 64bit arches, > so that CPU 2 can restart its loop, and more importantly you need > to not accumulate the values you read, because they might be old/invalid. Untested patch to give general idea. I can polish it a bit later today. net/netfilter/nft_counter.c | 59 +- 1 file changed, 23 insertions(+), 36 deletions(-) diff --git a/net/netfilter/nft_counter.c b/net/netfilter/nft_counter.c index f6a02c5071c2aeafca7635da3282a809aa04d6ab..57ed95b024473a2aa76298fe5bb5013bf709801b 100644 --- a/net/netfilter/nft_counter.c +++ b/net/netfilter/nft_counter.c @@ -31,18 +31,25 @@ struct nft_counter_percpu_priv { struct nft_counter_percpu __percpu *counter; }; +static DEFINE_PER_CPU(seqcount_t, nft_counter_seq); + static inline void nft_counter_do_eval(struct nft_counter_percpu_priv *priv, struct nft_regs *regs, const struct nft_pktinfo *pkt) { struct nft_counter_percpu *this_cpu; + seqcount_t *myseq; local_bh_disable(); this_cpu = this_cpu_ptr(priv->counter); - u64_stats_update_begin(_cpu->syncp); + myseq = this_cpu_ptr(_counter_seq); + + write_seqcount_begin(myseq); + this_cpu->counter.bytes += pkt->skb->len; this_cpu->counter.packets++; - u64_stats_update_end(_cpu->syncp); + + write_seqcount_end(myseq); local_bh_enable(); } @@ -110,52 +117,30 @@ static void nft_counter_fetch(struct nft_counter_percpu __percpu *counter, memset(total, 0, sizeof(*total)); for_each_possible_cpu(cpu) { + seqcount_t *seqp = per_cpu_ptr(_counter_seq, cpu); + cpu_stats = per_cpu_ptr(counter, cpu); do { - seq = u64_stats_fetch_begin_irq(_stats->syncp); + seq = read_seqcount_begin(seqp); bytes = cpu_stats->counter.bytes; packets = cpu_stats->counter.packets; - } while (u64_stats_fetch_retry_irq(_stats->syncp, seq)); + } while (read_seqcount_retry(seqp, seq)); total->packets += packets; total->bytes += bytes; } } -static u64 __nft_counter_reset(u64 *counter) -{ - u64 ret, old; - - do { - old = *counter; - ret = cmpxchg64(counter, old, 0); - } while (ret != old); - - return ret; -} - static void nft_counter_reset(struct nft_counter_percpu __percpu *counter, struct nft_counter *total) { struct nft_counter_percpu *cpu_stats; - u64 bytes, packets; - unsigned int seq; - int cpu; - memset(total, 0, sizeof(*total)); - for_each_possible_cpu(cpu) { - bytes = packets = 0; - - cpu_stats = per_cpu_ptr(counter, cpu); - do { - seq = u64_stats_fetch_begin_irq(_stats->syncp); - packets += __nft_counter_reset(_stats->counter.packets); - bytes += __nft_counter_reset(_stats->counter.bytes); - } while (u64_stats_fetch_retry_irq(_stats->syncp, seq)); - - total->packets += packets; - total->bytes += bytes; - } + local_bh_disable(); + cpu_stats = this_cpu_ptr(counter); + cpu_stats->counter.packets -= total->packets; + cpu_stats->counter.bytes -= total->bytes; + local_bh_enable(); } static int nft_counter_do_dump(struct sk_buff *skb, @@ -164,10 +149,9 @@ static int nft_counter_do_dump(struct sk_buff *skb, { struct nft_counter total; + nft_counter_fetch(priv->counter, ); if (reset) nft_counter_reset(priv->counter, ); - else - nft_counter_fetch(priv->counter, ); if (nla_put_be64(skb, NFTA_COUNTER_BYTES, cpu_to_be64(total.bytes), NFTA_COUNTER_PAD) || @@ -285,7 +269,10 @@ static struct nft_expr_type nft_counter_type __read_mostly = { static int __init nft_counter_module_init(void) { - int err; + int err, cpu; + + for_each_possible_cpu(cpu) + seqcount_init(per_cpu_ptr(_counter_seq, cpu)); err = nft_register_obj(_counter_obj); if (err < 0) -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 37/50] netfilter: nf_tables: atomic dump and reset for stateful objects
On Fri, 2016-12-09 at 11:24 +0100, Pablo Neira Ayuso wrote: > Hi Paul, Hi Pablo Given that bytes/packets counters are modified without cmpxchg64() : static inline void nft_counter_do_eval(struct nft_counter_percpu_priv *priv, struct nft_regs *regs, const struct nft_pktinfo *pkt) { struct nft_counter_percpu *this_cpu; local_bh_disable(); this_cpu = this_cpu_ptr(priv->counter); u64_stats_update_begin(_cpu->syncp); this_cpu->counter.bytes += pkt->skb->len; this_cpu->counter.packets++; u64_stats_update_end(_cpu->syncp); local_bh_enable(); } It means that the cmpxchg64() used to clear the stats is not good enough. It does not help to make sure stats are properly cleared. On 64 bit, the ->syncp is not there, so the nft_counter_reset() might not see that a bytes or packets counter was modified by another cpu. CPU 1 CPU 2 LOAD PTR->BYTES into REG_A old = *counter; REG_A += skb->len; cmpxchg64(counter, old, 0); PTR->BYTES = REG_A It looks that you want a seqcount, even on 64bit arches, so that CPU 2 can restart its loop, and more importantly you need to not accumulate the values you read, because they might be old/invalid. Another way would be to not use cmpxchg64() at all. Way to expensive in fast path ! The percpu value would never be modified by an other cpu than the owner. You need a per cpu seqcount, no need to add a syncp per nft percpu counter. static DEFINE_PERCPU(seqcount_t, nft_pcpu_seq); static inline void nft_counter_do_eval(struct nft_counter_percpu_priv *priv, struct nft_regs *regs, const struct nft_pktinfo *pkt) { struct nft_counter_percpu *this_cpu; seqcount_t *myseq; local_bh_disable(); this_cpu = this_cpu_ptr(priv->counter); myseq = this_cpu_ptr(_pcpu_seq); write_seqcount_begin(myseq); this_cpu->counter.bytes += pkt->skb->len; this_cpu->counter.packets++; write_seqcount_end(myseq); local_bh_enable(); } Thanks ! -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH nf-next] netfilter: xt_bpf: support ebpf
On Mon, 2016-12-05 at 15:28 -0500, Willem de Bruijn wrote: > From: Willem de Bruijn <will...@google.com> > > Add support for attaching an eBPF object by file descriptor. > > The iptables binary can be called with a path to an elf object or a > pinned bpf object. Also pass the mode and path to the kernel to be > able to return it later for iptables dump and save. > > Signed-off-by: Willem de Bruijn <will...@google.com> > --- Assuming there is no simple way to get variable matchsize in iptables, this looks good to me, thanks. Reviewed-by: Eric Dumazet <eduma...@google.com> -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCN net-next] net_sched: gen_estimator: complete rewrite of rate estimators
On Sat, 2016-12-03 at 23:07 -0800, Eric Dumazet wrote: > From: Eric Dumazet <eduma...@google.com> > > 1) Old code was hard to maintain, due to complex lock chains. >(We probably will be able to remove some kfree_rcu() in callers) > > 2) Using a single timer to update all estimators does not scale. > > 3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity >is not supposed to work well) > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index > c7adcb57654ea57d1ba6702c91743cb7d2c74d28..859b60bfa86712031186fffc09c65bc43aa065dd > 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -2081,6 +2081,14 @@ static bool tcp_small_queue_check(struct sock *sk, > const struct sk_buff *skb, > limit <<= factor; > > if (atomic_read(>sk_wmem_alloc) > limit) { > + /* Special case where TX completion is delayed too much : > + * If the skb we try to send is the first skb in write queue, > + * then send it ! > + * No need to wait for TX completion to call us back. > + */ > + if (skb == sk->sk_write_queue.next) > + return false; > + Oups, this has nothing to do here. I will send a v2, sorry. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCN net-next] net_sched: gen_estimator: complete rewrite of rate estimators
From: Eric Dumazet <eduma...@google.com> 1) Old code was hard to maintain, due to complex lock chains. (We probably will be able to remove some kfree_rcu() in callers) 2) Using a single timer to update all estimators does not scale. 3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity is not supposed to work well) In this rewrite : - I removed the RB tree that had to be scanned in gen_estimator_active(). qdisc dumps should be much faster. - Each estimator has its own timer. - Estimations are maintained in net_rate_estimator structure, instead of dirtying the qdisc. Minor, but part of the simplification. - Reading the estimator uses RCU and a seqcount to provide proper support for 32bit kernels. - We reduce memory need when estimators are not used, since we store a pointer, instead of the bytes/packets counters. - xt_rateest_mt() no longer has to grab a spinlock. (In the future, xt_rateest_tg() could be switched to per cpu counters) Signed-off-by: Eric Dumazet <eduma...@google.com> --- include/net/act_api.h |2 include/net/gen_stats.h| 17 - include/net/netfilter/xt_rateest.h | 10 include/net/sch_generic.h |2 net/core/gen_estimator.c | 298 +-- net/core/gen_stats.c | 17 - net/ipv4/tcp_output.c |8 net/netfilter/xt_RATEEST.c |4 net/netfilter/xt_rateest.c | 28 +- net/sched/act_api.c|9 net/sched/act_police.c | 21 + net/sched/sch_api.c|2 net/sched/sch_cbq.c|6 net/sched/sch_drr.c|6 net/sched/sch_generic.c|2 net/sched/sch_hfsc.c |6 net/sched/sch_htb.c|6 net/sched/sch_qfq.c|8 18 files changed, 188 insertions(+), 264 deletions(-) diff --git a/include/net/act_api.h b/include/net/act_api.h index 9dddf77a69ccbcb003cfa66bcc0de337f78f3dae..1d716449209e4753a297c61a287077a1eb96e6d8 100644 --- a/include/net/act_api.h +++ b/include/net/act_api.h @@ -36,7 +36,7 @@ struct tc_action { struct tcf_ttcfa_tm; struct gnet_stats_basic_packed tcfa_bstats; struct gnet_stats_queue tcfa_qstats; - struct gnet_stats_rate_est64tcfa_rate_est; + struct net_rate_estimator __rcu *tcfa_rate_est; spinlock_t tcfa_lock; struct rcu_head tcfa_rcu; struct gnet_stats_basic_cpu __percpu *cpu_bstats; diff --git a/include/net/gen_stats.h b/include/net/gen_stats.h index 231e121cc7d9c72075e7e6dde3655d631f64a1c4..8b7aa370e7a4af61fcb71ed751dba72ebead6143 100644 --- a/include/net/gen_stats.h +++ b/include/net/gen_stats.h @@ -11,6 +11,8 @@ struct gnet_stats_basic_cpu { struct u64_stats_sync syncp; }; +struct net_rate_estimator; + struct gnet_dump { spinlock_t * lock; struct sk_buff * skb; @@ -42,8 +44,7 @@ void __gnet_stats_copy_basic(const seqcount_t *running, struct gnet_stats_basic_cpu __percpu *cpu, struct gnet_stats_basic_packed *b); int gnet_stats_copy_rate_est(struct gnet_dump *d, -const struct gnet_stats_basic_packed *b, -struct gnet_stats_rate_est64 *r); +struct net_rate_estimator __rcu **ptr); int gnet_stats_copy_queue(struct gnet_dump *d, struct gnet_stats_queue __percpu *cpu_q, struct gnet_stats_queue *q, __u32 qlen); @@ -53,16 +54,16 @@ int gnet_stats_finish_copy(struct gnet_dump *d); int gen_new_estimator(struct gnet_stats_basic_packed *bstats, struct gnet_stats_basic_cpu __percpu *cpu_bstats, - struct gnet_stats_rate_est64 *rate_est, + struct net_rate_estimator __rcu **rate_est, spinlock_t *stats_lock, seqcount_t *running, struct nlattr *opt); -void gen_kill_estimator(struct gnet_stats_basic_packed *bstats, - struct gnet_stats_rate_est64 *rate_est); +void gen_kill_estimator(struct net_rate_estimator __rcu **ptr); int gen_replace_estimator(struct gnet_stats_basic_packed *bstats, struct gnet_stats_basic_cpu __percpu *cpu_bstats, - struct gnet_stats_rate_est64 *rate_est, + struct net_rate_estimator __rcu **ptr, spinlock_t *stats_lock, seqcount_t *running, struct nlattr *opt); -bool gen_estimator_active(const struct gnet_stats_basic_packed *bstats, - const struct gnet_stats_rate_est64 *rate_est); +bool gen_estimator_active(struct net_rate_estimator __rcu **ptr); +bool gen_estimator_read(struct net_rate_estimator
[PATCN v2 net-next] net_sched: gen_estimator: complete rewrite of rate estimators
From: Eric Dumazet <eduma...@google.com> 1) Old code was hard to maintain, due to complex lock chains. (We probably will be able to remove some kfree_rcu() in callers) 2) Using a single timer to update all estimators does not scale. 3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity is not supposed to work well) In this rewrite : - I removed the RB tree that had to be scanned in gen_estimator_active(). qdisc dumps should be much faster. - Each estimator has its own timer. - Estimations are maintained in net_rate_estimator structure, instead of dirtying the qdisc. Minor, but part of the simplification. - Reading the estimator uses RCU and a seqcount to provide proper support for 32bit kernels. - We reduce memory need when estimators are not used, since we store a pointer, instead of the bytes/packets counters. - xt_rateest_mt() no longer has to grab a spinlock. (In the future, xt_rateest_tg() could be switched to per cpu counters) Signed-off-by: Eric Dumazet <eduma...@google.com> --- v2: removed unwanted changes to tcp_output.c Renamed some parameters to please htmldoc include/net/act_api.h |2 include/net/gen_stats.h| 17 - include/net/netfilter/xt_rateest.h | 10 include/net/sch_generic.h |2 net/core/gen_estimator.c | 299 +-- net/core/gen_stats.c | 17 - net/netfilter/xt_RATEEST.c |4 net/netfilter/xt_rateest.c | 28 +- net/sched/act_api.c|9 net/sched/act_police.c | 21 + net/sched/sch_api.c|2 net/sched/sch_cbq.c|6 net/sched/sch_drr.c|6 net/sched/sch_generic.c|2 net/sched/sch_hfsc.c |6 net/sched/sch_htb.c|6 net/sched/sch_qfq.c|8 17 files changed, 181 insertions(+), 264 deletions(-) diff --git a/include/net/act_api.h b/include/net/act_api.h index 9dddf77a69ccbcb003cfa66bcc0de337f78f3dae..1d716449209e4753a297c61a287077a1eb96e6d8 100644 --- a/include/net/act_api.h +++ b/include/net/act_api.h @@ -36,7 +36,7 @@ struct tc_action { struct tcf_ttcfa_tm; struct gnet_stats_basic_packed tcfa_bstats; struct gnet_stats_queue tcfa_qstats; - struct gnet_stats_rate_est64tcfa_rate_est; + struct net_rate_estimator __rcu *tcfa_rate_est; spinlock_t tcfa_lock; struct rcu_head tcfa_rcu; struct gnet_stats_basic_cpu __percpu *cpu_bstats; diff --git a/include/net/gen_stats.h b/include/net/gen_stats.h index 231e121cc7d9c72075e7e6dde3655d631f64a1c4..8b7aa370e7a4af61fcb71ed751dba72ebead6143 100644 --- a/include/net/gen_stats.h +++ b/include/net/gen_stats.h @@ -11,6 +11,8 @@ struct gnet_stats_basic_cpu { struct u64_stats_sync syncp; }; +struct net_rate_estimator; + struct gnet_dump { spinlock_t * lock; struct sk_buff * skb; @@ -42,8 +44,7 @@ void __gnet_stats_copy_basic(const seqcount_t *running, struct gnet_stats_basic_cpu __percpu *cpu, struct gnet_stats_basic_packed *b); int gnet_stats_copy_rate_est(struct gnet_dump *d, -const struct gnet_stats_basic_packed *b, -struct gnet_stats_rate_est64 *r); +struct net_rate_estimator __rcu **ptr); int gnet_stats_copy_queue(struct gnet_dump *d, struct gnet_stats_queue __percpu *cpu_q, struct gnet_stats_queue *q, __u32 qlen); @@ -53,16 +54,16 @@ int gnet_stats_finish_copy(struct gnet_dump *d); int gen_new_estimator(struct gnet_stats_basic_packed *bstats, struct gnet_stats_basic_cpu __percpu *cpu_bstats, - struct gnet_stats_rate_est64 *rate_est, + struct net_rate_estimator __rcu **rate_est, spinlock_t *stats_lock, seqcount_t *running, struct nlattr *opt); -void gen_kill_estimator(struct gnet_stats_basic_packed *bstats, - struct gnet_stats_rate_est64 *rate_est); +void gen_kill_estimator(struct net_rate_estimator __rcu **ptr); int gen_replace_estimator(struct gnet_stats_basic_packed *bstats, struct gnet_stats_basic_cpu __percpu *cpu_bstats, - struct gnet_stats_rate_est64 *rate_est, + struct net_rate_estimator __rcu **ptr, spinlock_t *stats_lock, seqcount_t *running, struct nlattr *opt); -bool gen_estimator_active(const struct gnet_stats_basic_packed *bstats, - const struct gnet_stats_rate_est64 *rate_est); +bool gen_estimator_active(struct net_rate_estimator __rcu **ptr); +bool gen_estimator
Re: net/sctp: vmalloc allocation failure in sctp_setsockopt/xt_alloc_table_info
On Mon, 2016-11-28 at 19:09 +0100, Florian Westphal wrote: > We should prevent OOM killer from running in first place (GFP_NORETRY should > work). Make sure that a vmalloc(8) will succeed, even if few pages need to be swapped out. Otherwise, some scripts using iptables will die while they where okay before ? I am not sure how GFP_NORETRY really works among different linux kernels. Maybe this flag has no effect for order-0 allocations anyway. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 1/1] netfilter: xt_multiport: Fix wrong unmatch result with multiple ports
On Fri, 2016-11-25 at 11:58 +0800, f...@ikuai8.com wrote: > From: Gao Feng> > I lost one test case in the commit for xt_multiport. > For example, the rule is "-m multiport --dports 22,80,443". > When first port is unmatched and the second is matched, the curent codes > could not return the right result. > It would return false directly when the first port is unmatched. > > Fixes: dd2602d00f80 ("netfilter: xt_multiport: Use switch case instead > of multiple condition checks") > > Signed-off-by: Gao Feng Note that the Fixes: tag should immediately precede your 'Signed-off-by:' No empty new line between the two tags. Thanks. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 nf-next 3/3] netfilter: x_tables: pack percpu counter allocations
On Mon, 2016-11-21 at 23:07 +0100, Florian Westphal wrote: > instead of allocating each xt_counter individually, allocate 4k chunks > and then use these for counter allocation requests. > > This should speed up rule evaluation by increasing data locality, > also speeds up ruleset loading because we reduce calls to the percpu > allocator. > > As Eric points out we can't use PAGE_SIZE, page_allocator would fail on > arches with 64k page size. > > Suggested-by: Eric Dumazet <eduma...@google.com> > Signed-off-by: Florian Westphal <f...@strlen.de> > --- Acked-by: Eric Dumazet <eduma...@google.com> Very cool, thanks a lot ! -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH nf-next 3/3] netfilter: x_tables: pack percpu counter allocations
On Mon, 2016-11-21 at 14:57 +0100, Florian Westphal wrote: ... > #define SMP_ALIGN(x) (((x) + SMP_CACHE_BYTES-1) & ~(SMP_CACHE_BYTES-1)) > +#define XT_PCPU_BLOCK_SIZE 4096 > > struct compat_delta { > unsigned int offset; /* offset in kernel */ > @@ -1618,6 +1619,7 @@ EXPORT_SYMBOL_GPL(xt_proto_fini); > /** > * xt_percpu_counter_alloc - allocate x_tables rule counter > * > + * @state: pointer to xt_percpu allocation state > * @counter: pointer to counter struct inside the ip(6)/arpt_entry struct > * > * On SMP, the packet counter [ ip(6)t_entry->counters.pcnt ] will then > @@ -1626,20 +1628,33 @@ EXPORT_SYMBOL_GPL(xt_proto_fini); > * Rule evaluation needs to use xt_get_this_cpu_counter() helper > * to fetch the real percpu counter. > * > + * To speed up allocation and improve data locality, an entire > + * page is allocated. * To speed up allocation and improve data locality, a 4KB bloc * is allocated. > + * > + * xt_percpu_counter_alloc_state contains the base address of the > + * allocated page and the current sub-offset. > + * > * returns false on error. > */ > -bool xt_percpu_counter_alloc(struct xt_counters *counter) > +bool xt_percpu_counter_alloc(struct xt_percpu_counter_alloc_state *state, > + struct xt_counters *counter) > { > - void __percpu *res; > + BUILD_BUG_ON(XT_PCPU_BLOCK_SIZE < (sizeof(*counter) * 2)); > > - if (nr_cpu_ids > 1) { > - res = __alloc_percpu(sizeof(struct xt_counters), > - sizeof(struct xt_counters)); > + if (nr_cpu_ids <= 1) > + return true; > > - if (!res) > + if (state->mem == NULL) { if (!state->mem) { > + state->mem = __alloc_percpu(XT_PCPU_BLOCK_SIZE, > XT_PCPU_BLOCK_SIZE); > + if (!state->mem) (to be consistent with this test ) > return false; > } > - counter->pcnt = (__force unsigned long)res; > + counter->pcnt = (__force unsigned long)(state->mem + state->off); > + state->off += sizeof(*counter); > + if (state->off > (XT_PCPU_BLOCK_SIZE - sizeof(*counter))) { > + state->mem = NULL; > + state->off = 0; > + } -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH nf-next 2/3] netfilter: x_tables: pass xt_counters struct to counter allocator
On Mon, 2016-11-21 at 14:57 +0100, Florian Westphal wrote: > Keeps some noise away from a followup patch. > > Signed-off-by: Florian Westphal <f...@strlen.de> > --- Acked-by: Eric Dumazet <eduma...@google.com> -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH nf-next 1/3] netfilter: x_tables: pass xt_counters struct instead of packet counter
On Mon, 2016-11-21 at 14:57 +0100, Florian Westphal wrote: > On SMP we overload the packet counter (unsigned long) to contain > percpu offset. Hide this from callers and pass xt_counters address > instead. > > Preparation patch to allocate the percpu counters in page-sized batch > chunks. > > Signed-off-by: Florian Westphal <f...@strlen.de> > --- Acked-by: Eric Dumazet <eduma...@google.com> -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: netfilter question
On Sun, 2016-11-20 at 09:31 -0800, Eric Dumazet wrote: > Thanks Eric, I will test the patch myself, because I believe we need it > asap ;) Current net-next without Florian patch : lpaa24:~# time for f in `seq 1 2000` ; do iptables -A FORWARD ; done real0m12.856s user0m0.590s sys 0m11.131s perf report ...; perf report -> 47.45% iptables [kernel.kallsyms] [k] pcpu_alloc_area 8.49% iptables [kernel.kallsyms] [k] memset_erms 7.35% iptables [kernel.kallsyms] [k] get_counters 2.87% iptables [kernel.kallsyms] [k] __memmove 2.33% iptables [kernel.kallsyms] [k] pcpu_alloc 2.07% iptables [kernel.kallsyms] [k] _find_next_bit.part.0 1.62% iptables xtables-multi [.] 0x0001bb9d 1.25% iptables [kernel.kallsyms] [k] page_fault 1.01% iptables [kernel.kallsyms] [k] memcmp 0.94% iptables [kernel.kallsyms] [k] translate_table 0.76% iptables [kernel.kallsyms] [k] find_next_bit 0.73% iptables [kernel.kallsyms] [k] filemap_map_pages 0.68% iptables [kernel.kallsyms] [k] copy_user_enhanced_fast_string 0.54% iptables [kernel.kallsyms] [k] __get_user_8 0.54% iptables [kernel.kallsyms] [k] clear_page_c_e After patch : lpaa24:~# time for f in `seq 1 2000` ; do iptables -A FORWARD ; done real0m3.867s user0m0.559s sys 0m2.216s 22.15% iptables [kernel.kallsyms] [k] get_counters 5.85% iptables xtables-multi [.] 0x0001bbac 3.99% iptables [kernel.kallsyms] [k] page_fault 2.37% iptables [kernel.kallsyms] [k] memcmp 2.19% iptables [kernel.kallsyms] [k] copy_user_enhanced_fast_string 1.89% iptables [kernel.kallsyms] [k] translate_table 1.78% iptables [kernel.kallsyms] [k] memset_erms 1.74% iptables [kernel.kallsyms] [k] clear_page_c_e 1.73% iptables [kernel.kallsyms] [k] __get_user_8 1.72% iptables [kernel.kallsyms] [k] perf_iterate_ctx 1.21% iptables [kernel.kallsyms] [k] handle_mm_fault 0.98% iptables [kernel.kallsyms] [k] unmap_page_range So this is a huge win. And I suspect data path will also gain from all pcpu counters being in the same area of memory (this is where I am very interested) -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: netfilter question
On Sun, 2016-11-20 at 12:22 -0500, Eric D wrote: > I'm currently abroad for work and will come back home soon. I will > test the solution and provide feedback to Florian by end of week. > > Thanks for jumping on this quickly. > > Eric > > > On Nov 20, 2016 7:33 AM, "Eric Dumazet" <eric.duma...@gmail.com> > wrote: > On Thu, 2016-11-17 at 01:07 +0100, Florian Westphal wrote: > > > + if (state->mem == NULL) { > > + state->mem = __alloc_percpu(PAGE_SIZE, > PAGE_SIZE); > > + if (!state->mem) > > + return false; > > + } > > This will fail on arches where PAGE_SIZE=65536 > > percpu allocator limit is PCPU_MIN_UNIT_SIZE ( 32 KB ) > > So maybe use a smaller value like 4096 ? > > #define XT_PCPU_BLOC_SIZE 4096 > Thanks Eric, I will test the patch myself, because I believe we need it asap ;) -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: netfilter question
On Thu, 2016-11-17 at 01:07 +0100, Florian Westphal wrote: > + if (state->mem == NULL) { > + state->mem = __alloc_percpu(PAGE_SIZE, PAGE_SIZE); > + if (!state->mem) > + return false; > + } This will fail on arches where PAGE_SIZE=65536 percpu allocator limit is PCPU_MIN_UNIT_SIZE ( 32 KB ) So maybe use a smaller value like 4096 ? #define XT_PCPU_BLOC_SIZE 4096 -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: netfilter question
On Thu, 2016-11-17 at 01:07 +0100, Florian Westphal wrote: Seems very nice ! > + > +void xt_percpu_counter_free(struct xt_counters *counters) > +{ > + unsigned long pcnt = counters->pcnt; > + > + if (nr_cpu_ids > 1 && (pcnt & (PAGE_SIZE - 1)) == 0) > + free_percpu((void __percpu *) (unsigned long)pcnt); > +} pcnt is already an "unsigned long" This packing might also speed up "iptables -nvL" dumps ;) -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: netfilter question
On Wed, 2016-11-16 at 16:02 +0100, Florian Westphal wrote: > Eric Dumazet <eduma...@google.com> wrote: > > On Wed, Nov 16, 2016 at 2:22 AM, Eric Desrochers <er...@miromedia.ca> wrote: > > > Hi Eric, > > > > > > My name is Eric and I'm reaching you today as I found your name in > > > multiple netfilter kernel commits, and was hoping we can discuss about a > > > potential regression. > > > > > > I identified (git bisect) this commit > > > [https://github.com/torvalds/linux/commit/71ae0dff02d756e4d2ca710b79f2ff5390029a5f] > > > as the one introducing a serious performance slowdown when using the > > > binary ip/ip6tables with a large number of policies. > > > > > > I also tried with the latest and greatest v4.9-rc4 mainline kernel, and > > > the slowdown is still present. > > > > > > So even commit > > > [https://github.com/torvalds/linux/commit/a1a56aaa0735c7821e85c37268a7c12e132415cb] > > > which introduce a 16 bytes alignment on xt_counter percpu allocations so > > > that bytes and packets sit in same cache line, doesn't have impact too. > > > > > > > > > Everything I found is detailed in the following bug : > > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1640786 > > > > > > Of course, I'm totally aware that "iptables-restore" should be the > > > favorite choice as it is way more efficient (note that using > > > iptables-restore doesn't exhibit the problem) but some folks still rely > > > on ip/ip6tables and might face this performance slowdown. > > > > > > I found the problem today, I will continue to investigate on my side, but > > > I was wondering if we could have a discussion about this subject. > > > > > > Thanks in advance. > > [..] > > > Key point is that we really care about fast path : packet processing. > > And cited commit helps this a lot by lowering the memory foot print on > > hosts with many cores. > > This is a step into right direction. > > > > Now we probably should batch the percpu allocations one page at a > > time, or ask Tejun if percpu allocations could be really really fast > > (probably much harder) > > > > But really you should not use iptables one rule at a time... > > This will never compete with iptables-restore. ;) > > > > Florian, would you have time to work on a patch trying to group the > > percpu allocations one page at a time ? > > You mean something like this ? : > xt_entry_foreach(iter, entry0, newinfo->size) { > - ret = find_check_entry(iter, net, repl->name, repl->size); > - if (ret != 0) > + if (pcpu_alloc == 0) { > + pcnt = __alloc_percpu(PAGE_SIZE, sizeof(struct > xt_counters)); alignment should be a page. > + if (IS_ERR_VALUE(pcnt)) > + BUG(); well. no BUG() for sure ;) > + } > + > + iter->counters.pcnt = pcnt + pcpu_alloc; > + iter->counters.bcnt = !!pcpu_alloc; > + pcpu_alloc += sizeof(struct xt_counters); > + > + if (pcpu_alloc > PAGE_SIZE - sizeof(struct xt_counters)) > + pcpu_alloc = 0; > + > + ret = find_check_entry(iter, net, repl->name, repl->size) > ... > > This is going to be ugly since we'll have to deal with SMP vs. NONSMP (i.e. > no perpcu allocations) > in ip/ip6/arptables. Time for a common helper then ... > > Error unwind will also be a mess (we can abuse .bcnt to tell if pcpu offset > should be free'd or not). Free if the address is aligned to a page boundary ? Otherwise skip it, it already has been freed earlier. > > But maybe I don't understand what you are suggesting :) > Can you elaborate? Note that this grouping will also help data locality. I definitely have servers with a huge number of percpu allocations and I fear we might have many TLB misses because of possible spread of xt_counters. Note that percpu pages must not be shared by multiple users (ip/ip6/arptable), each table should get its own cache. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH nf] netfilter: conntrack: refine gc worker heuristics
On Tue, 2016-11-01 at 21:01 +0100, Florian Westphal wrote: > schedule_delayed_work(_work->dwork, next_run); > @@ -993,6 +1029,7 @@ static void gc_worker(struct work_struct *work) > static void conntrack_gc_work_init(struct conntrack_gc_work *gc_work) > { > INIT_DELAYED_WORK(_work->dwork, gc_worker); > + gc_work->next_gc_run = GC_INTERVAL_MAX; > gc_work->exiting = false; > } > > @@ -1885,7 +1922,7 @@ int nf_conntrack_init_start(void) > nf_ct_untracked_status_or(IPS_CONFIRMED | IPS_UNTRACKED); > > conntrack_gc_work_init(_gc_work); > - schedule_delayed_work(_gc_work.dwork, GC_INTERVAL); > + schedule_delayed_work(_gc_work.dwork, GC_INTERVAL_MAX); > > return 0; > We might use system_long_wq instead of system_wq ? queue_delayed_work(system_long_wq, ...) -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: error: 'struct net_device' has no member named 'nf_hooks_ingress'
On Wed, 2016-10-05 at 22:56 +0200, Michal Sojka wrote: > this commit is now in mainline as > e3b37f11e6e4e6b6f02cc762f182ce233d2c1c9d and it breaks my build: > > net/netfilter/core.c: In function 'nf_set_hooks_head': > net/netfilter/core.c:96:3: error: 'struct net_device' has no member named > 'nf_hooks_ingress' > > Are the fixes (see below) on the way to mainline too? Yes the fixes are already in nf tree and _will_ get pushed. Pablo and David are attending netdev 1.2 in Tokyo and have obligations. https://git.kernel.org/cgit/linux/kernel/git/pablo/nf-next.git/ Thanks. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] netfilter: xt_hashlimit: uses div_u64 for division
On Fri, 2016-09-30 at 18:05 +0200, Arnd Bergmann wrote: > The newly added support for high-resolution pps rates introduced multiple > 64-bit > division operations in one function, which fails on all 32-bit architectures: > > net/netfilter/xt_hashlimit.o: In function `user2credits': > xt_hashlimit.c:(.text.user2credits+0x3c): undefined reference to > `__aeabi_uldivmod' > xt_hashlimit.c:(.text.user2credits+0x68): undefined reference to > `__aeabi_uldivmod' > xt_hashlimit.c:(.text.user2credits+0x88): undefined reference to > `__aeabi_uldivmod' > > This replaces the division with an explicit call to div_u64 for version 2 > to documents that this is a slow operation, and reverts back to 32-bit > arguments > for the version 1 data to restore the original faster 32-bit division. > > With both changes combined, we no longer get a link error. > > Fixes: 11d5f15723c9 ("netfilter: xt_hashlimit: Create revision 2 to support > higher pps rates") > Signed-off-by: Arnd Bergmann> --- > Vishwanath Pai already sent a patch for this, and I did my version > independently. > The difference is that his version also the more expensive division for the > version 1 variant that doesn't need it. > > See also http://patchwork.ozlabs.org/patch/676713/ > --- > net/netfilter/xt_hashlimit.c | 17 ++--- > 1 file changed, 10 insertions(+), 7 deletions(-) > > diff --git a/net/netfilter/xt_hashlimit.c b/net/netfilter/xt_hashlimit.c > index 44a095ecc7b7..3d5525df6eb3 100644 > --- a/net/netfilter/xt_hashlimit.c > +++ b/net/netfilter/xt_hashlimit.c > @@ -464,20 +464,23 @@ static u32 xt_hashlimit_len_to_chunks(u32 len) > static u64 user2credits(u64 user, int revision) > { > if (revision == 1) { > + u32 user32 = user; /* use 32-bit division */ > + This looks dangerous to me. Have you really tried all possible cases ? Caller (even if using revision == 1) does user2credits(cfg->avg * cfg->burst, revision); Since this is not a fast path, I would prefer to keep the 64bit divide. Vishwanath version looks safer. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH nf-next v3 1/2] netfilter: Fix potential null pointer dereference
On Wed, 2016-09-28 at 10:56 -0400, Aaron Conole wrote: > Eric Dumazet <eric.duma...@gmail.com> writes: > > > On Wed, 2016-09-28 at 09:12 -0400, Aaron Conole wrote: > >> It's possible for nf_hook_entry_head to return NULL. If two > >> nf_unregister_net_hook calls happen simultaneously with a single hook > >> entry in the list, both will enter the nf_hook_mutex critical section. > >> The first will successfully delete the head, but the second will see > >> this NULL pointer and attempt to dereference. > >> > >> This fix ensures that no null pointer dereference could occur when such > >> a condition happens. > >> > >> Signed-off-by: Aaron Conole <acon...@bytheb.org> > >> --- > >> net/netfilter/core.c | 2 +- > >> 1 file changed, 1 insertion(+), 1 deletion(-) > >> > >> diff --git a/net/netfilter/core.c b/net/netfilter/core.c > >> index 360c63d..e58e420 100644 > >> --- a/net/netfilter/core.c > >> +++ b/net/netfilter/core.c > >> @@ -160,7 +160,7 @@ void nf_unregister_net_hook(struct net *net, const > >> struct nf_hook_ops *reg) > >> > >>mutex_lock(_hook_mutex); > >>hooks_entry = nf_hook_entry_head(net, reg); > >> - if (hooks_entry->orig_ops == reg) { > >> + if (hooks_entry && hooks_entry->orig_ops == reg) { > >>nf_set_hooks_head(net, reg, > >> nf_entry_dereference(hooks_entry->next)); > >>goto unlock; > > > > When was the bug added exactly ? > > Sunday, on the nf-next tree. > > > For all bug fixes, you need to add a Fixes: tag. > > > > Like : > > > > Fixes: e3b37f11e6e4 ("netfilter: replace list_head with single linked list") > > I would but it's in nf-next tree, and I'm not sure how pulls go. If > they are done via patch imports, then the sha sums will be wrong and the > commit message will be misleading. If the sums are preserved, then I > can resubmit with this information. > I gave the (12 digits) sha-1 as present in David Miller net-next tree. https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=e3b37f11e6e4e6b6f02cc762f182ce233d2c1c9d This wont change, because David never rebases his tree under normal operations. Thanks. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH nf-next v3 1/2] netfilter: Fix potential null pointer dereference
On Wed, 2016-09-28 at 09:12 -0400, Aaron Conole wrote: > It's possible for nf_hook_entry_head to return NULL. If two > nf_unregister_net_hook calls happen simultaneously with a single hook > entry in the list, both will enter the nf_hook_mutex critical section. > The first will successfully delete the head, but the second will see > this NULL pointer and attempt to dereference. > > This fix ensures that no null pointer dereference could occur when such > a condition happens. > > Signed-off-by: Aaron Conole> --- > net/netfilter/core.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/netfilter/core.c b/net/netfilter/core.c > index 360c63d..e58e420 100644 > --- a/net/netfilter/core.c > +++ b/net/netfilter/core.c > @@ -160,7 +160,7 @@ void nf_unregister_net_hook(struct net *net, const struct > nf_hook_ops *reg) > > mutex_lock(_hook_mutex); > hooks_entry = nf_hook_entry_head(net, reg); > - if (hooks_entry->orig_ops == reg) { > + if (hooks_entry && hooks_entry->orig_ops == reg) { > nf_set_hooks_head(net, reg, > nf_entry_dereference(hooks_entry->next)); > goto unlock; When was the bug added exactly ? For all bug fixes, you need to add a Fixes: tag. Like : Fixes: e3b37f11e6e4 ("netfilter: replace list_head with single linked list") So that 100 different people in stable teams do not have to do the archeology themselves ... Thanks. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix link error in 32bit arch because of 64bit division
On Tue, 2016-09-27 at 03:42 -0400, Vishwanath Pai wrote: > Fix link error in 32bit arch because of 64bit division > > Division of 64bit integers will cause linker error undefined reference > to `__udivdi3'. Fix this by replacing divisions with div64_64 > > Signed-off-by: Vishwanath Pai> > --- > net/netfilter/xt_hashlimit.c | 15 --- > 1 file changed, 8 insertions(+), 7 deletions(-) > > diff --git a/net/netfilter/xt_hashlimit.c b/net/netfilter/xt_hashlimit.c > index 44a095e..7fc694e 100644 > --- a/net/netfilter/xt_hashlimit.c > +++ b/net/netfilter/xt_hashlimit.c > @@ -465,19 +465,20 @@ static u64 user2credits(u64 user, int revision) > { > if (revision == 1) { > /* If multiplying would overflow... */ > - if (user > 0x / (HZ*CREDITS_PER_JIFFY_v1)) > + if (user > div64_u64(0x, (HZ*CREDITS_PER_JIFFY_v1))) How can this be needed ? 0x is 32bits, compiler knows how to compute 0x / (HZ*CREDITS_PER_JIFFY_v1) itself, without using a 64 bit divide ! Please be selective. > /* Divide first. */ > - return (user / XT_HASHLIMIT_SCALE) *\ > + return div64_u64(user, XT_HASHLIMIT_SCALE) *\ > HZ * CREDITS_PER_JIFFY_v1; > > - return (user * HZ * CREDITS_PER_JIFFY_v1) \ > - / XT_HASHLIMIT_SCALE; > + return div64_u64((user * HZ * CREDITS_PER_JIFFY_v1), > + XT_HASHLIMIT_SCALE); > } else { > - if (user > 0x / (HZ*CREDITS_PER_JIFFY)) > - return (user / XT_HASHLIMIT_SCALE_v2) *\ Probably same remark here. > + if (user > div64_u64(0x, > (HZ*CREDITS_PER_JIFFY))) > + return div64_u64(user, XT_HASHLIMIT_SCALE_v2) *\ > HZ * CREDITS_PER_JIFFY; > > - return (user * HZ * CREDITS_PER_JIFFY) / XT_HASHLIMIT_SCALE_v2; > + return div64_u64((user * HZ * CREDITS_PER_JIFFY), > + XT_HASHLIMIT_SCALE_v2); > } > } > -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH nf] netfilter: nf_tables: Ensure u8 attributes are loaded from u32 within the bounds
On Thu, 2016-09-22 at 16:58 +0200, Pablo Neira Ayuso wrote: > attributes") > > Always use 12 bytes commit-ids. 4da449a is too short, given the number > of changes we're getting in the kernel tree, this may become ambiguous > at some point so it won't be unique. > > You can achieve this via: git log --oneline --abbrev=12 and Documentation/SubmittingPatches has these tips : You should also be sure to use at least the first twelve characters of the SHA-1 ID. The kernel repository holds a *lot* of objects, making collisions with shorter IDs a real possibility. Bear in mind that, even if there is no collision with your six-character ID now, that condition may change five years from now. If your patch fixes a bug in a specific commit, e.g. you found an issue using git-bisect, please use the 'Fixes:' tag with the first 12 characters of the SHA-1 ID, and the one line summary. For example: Fixes: e21d2170f366 ("video: remove unnecessary platform_set_drvdata()") The following git-config settings can be used to add a pretty format for outputting the above style in the git log or git show commands [core] abbrev = 12 [pretty] fixes = Fixes: %h (\"%s\") -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH nf v3] netfilter: seqadj: Fix the wrong ack adjust for the RST packet without ack
On Thu, 2016-09-22 at 10:22 +0800, f...@ikuai8.com wrote: > From: Gao Feng> > It is valid that the TCP RST packet which does not set ack flag, and bytes > of ack number are zero. But current seqadj codes would adjust the "0" ack > to invalid ack number. Actually seqadj need to check the ack flag before > adjust it for these RST packets. > > The following is my test case > > client is 10.26.98.245, and add one iptable rule: > iptables -I INPUT -p tcp --sport 12345 -m connbytes --connbytes 2: > --connbytes-dir reply --connbytes-mode packets -j REJECT --reject-with > tcp-reset > This iptables rule could generate on TCP RST without ack flag. > > server:10.172.135.55 > Enable the synproxy with seqadjust by the following iptables rules > iptables -t raw -A PREROUTING -i eth0 -p tcp -d 10.172.135.55 --dport 12345 > -m tcp --syn -j CT --notrack > > iptables -A INPUT -i eth0 -p tcp -d 10.172.135.55 --dport 12345 -m conntrack > --ctstate INVALID,UNTRACKED -j SYNPROXY --sack-perm --timestamp --wscale 7 > --mss 1460 > iptables -A OUTPUT -o eth0 -p tcp -s 10.172.135.55 --sport 12345 -m conntrack > --ctstate INVALID,UNTRACKED -m tcp --tcp-flags SYN,RST,ACK SYN,ACK -j ACCEPT > > The following is my test result. > > 1. packet trace on client > root@routers:/tmp# tcpdump -i eth0 tcp port 12345 -n > tcpdump: verbose output suppressed, use -v or -vv for full protocol decode > listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes > IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [S], seq 3695959829, > win 29200, options [mss 1460,sackOK,TS val 452367884 ecr 0,nop,wscale 7], > length 0 > IP 10.172.135.55.12345 > 10.26.98.245.45154: Flags [S.], seq 546723266, > ack 3695959830, win 0, options [mss 1460,sackOK,TS val 15643479 ecr 452367884, > nop,wscale 7], length 0 > IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [.], ack 1, win 229, > options [nop,nop,TS val 452367885 ecr 15643479], length 0 > IP 10.172.135.55.12345 > 10.26.98.245.45154: Flags [.], ack 1, win 226, > options [nop,nop,TS val 15643479 ecr 452367885], length 0 > IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [R], seq 3695959830, > win 0, length 0 > > 2. seqadj log on server > [62873.867319] Adjusting sequence number from 602341895->546723267, > ack from 3695959830->3695959830 > [62873.867644] Adjusting sequence number from 602341895->546723267, > ack from 3695959830->3695959830 > [62873.869040] Adjusting sequence number from 3695959830->3695959830, > ack from 0->55618628 > > To summarize, it is clear that the seqadj codes adjust the 0 ack when receive > one TCP RST packet without ack. > > Signed-off-by: Gao Feng > --- > v3: Add the reproduce steps and packet trace > v2: Regenerate because the first patch is removed > v1: Initial patch > > net/netfilter/nf_conntrack_seqadj.c | 34 +++--- > 1 file changed, 19 insertions(+), 15 deletions(-) > > diff --git a/net/netfilter/nf_conntrack_seqadj.c > b/net/netfilter/nf_conntrack_seqadj.c > index dff0f0c..3bd9c7e 100644 > --- a/net/netfilter/nf_conntrack_seqadj.c > +++ b/net/netfilter/nf_conntrack_seqadj.c > @@ -179,30 +179,34 @@ int nf_ct_seq_adjust(struct sk_buff *skb, > > tcph = (void *)skb->data + protoff; > spin_lock_bh(>lock); > + Please do not add style change during a bug fix. > if (after(ntohl(tcph->seq), this_way->correction_pos)) > seqoff = this_way->offset_after; > else > seqoff = this_way->offset_before; > > - if (after(ntohl(tcph->ack_seq) - other_way->offset_before, > - other_way->correction_pos)) > - ackoff = other_way->offset_after; > - else > - ackoff = other_way->offset_before; > - > newseq = htonl(ntohl(tcph->seq) + seqoff); > - newack = htonl(ntohl(tcph->ack_seq) - ackoff); > - > inet_proto_csum_replace4(>check, skb, tcph->seq, newseq, false); > - inet_proto_csum_replace4(>check, skb, tcph->ack_seq, newack, > - false); > - > - pr_debug("Adjusting sequence number from %u->%u, ack from %u->%u\n", > - ntohl(tcph->seq), ntohl(newseq), ntohl(tcph->ack_seq), > - ntohl(newack)); > > + pr_debug("Adjusting sequence number from %u->%u\n", > + ntohl(tcph->seq), ntohl(newseq)); > tcph->seq = newseq; > - tcph->ack_seq = newack; > + > + if (likely(tcph->ack)) { > + if (after(ntohl(tcph->ack_seq) - other_way->offset_before, > + other_way->correction_pos)) > + ackoff = other_way->offset_after; > + else > + ackoff = other_way->offset_before; > + > + newack = htonl(ntohl(tcph->ack_seq) - ackoff); > + inet_proto_csum_replace4(>check, skb, tcph->ack_seq, > + newack, false); > + > + pr_debug("Adjusting ack number from %u->%u\n", > +
Re: [PATCH] netfilter: xt_socket: fix transparent match for IPv6 request sockets
On Tue, 2016-09-20 at 08:01 -0700, Eric Dumazet wrote: > > Something like : > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index 3ebf45b38bc3..0fccfd6cc258 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -6264,6 +6264,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops, > /* Note: tcp_v6_init_req() might override ir_iif for link locals */ > inet_rsk(req)->ir_iif = inet_request_bound_dev_if(sk, skb); > > + ireq->no_srccheck = inet_sk(sk)->transparent; Since there is no @ireq there, simply use : inet_rsk(req)->no_srccheck = inet_sk(sk)->transparent; > af_ops->init_req(req, sk, skb); > -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] netfilter: xt_socket: fix transparent match for IPv6 request sockets
On Tue, 2016-09-20 at 15:26 +0200, KOVACS Krisztian wrote: > The introduction of TCP_NEW_SYN_RECV state, and the addition of request > sockets to the ehash table seems to have broken the --transparent option > of the socket match for IPv6 (around commit a9407000). > > Now that the socket lookup finds the TCP_NEW_SYN_RECV socket instead of the > listener, the --transparent option tries to match on the no_srccheck flag > of the request socket. > > Unfortunately, that flag was only set for IPv4 sockets in tcp_v4_init_req() > by copying the transparent flag of the listener socket. This effectively > causes '-m socket --transparent' not match on the ACK packet sent by the > client in a TCP handshake. > > This change adds the same code initializing no_srccheck to > tcp_v6_init_req(), rendering the above scenario working again. > > Signed-off-by: Alex Badics> Signed-off-by: KOVACS Krisztian > --- > net/ipv6/tcp_ipv6.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c > index 94f4f89..21f2e5c 100644 > --- a/net/ipv6/tcp_ipv6.c > +++ b/net/ipv6/tcp_ipv6.c > @@ -690,6 +690,7 @@ static void tcp_v6_init_req(struct request_sock *req, > > ireq->ir_v6_rmt_addr = ipv6_hdr(skb)->saddr; > ireq->ir_v6_loc_addr = ipv6_hdr(skb)->daddr; > + ireq->no_srccheck = inet_sk(sk_listener)->transparent; > > /* So that link locals have meaning */ > if (!sk_listener->sk_bound_dev_if && Thanks a lot for the bug hunting guys. Could you add the precise tag to help stable backports ? Fixes: 12-digit-sha1 ("patch title") Since it is a netdev patch, I would also copy netdev@ (done here) Also what about moving this (IPv4/IPv6 common code) before the af_ops->init_req(req, sk, skb); call, since it is no longer family specific ? Something like : diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 3ebf45b38bc3..0fccfd6cc258 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -6264,6 +6264,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops, /* Note: tcp_v6_init_req() might override ir_iif for link locals */ inet_rsk(req)->ir_iif = inet_request_bound_dev_if(sk, skb); + ireq->no_srccheck = inet_sk(sk)->transparent; af_ops->init_req(req, sk, skb); if (security_inet_conn_request(sk, skb, req)) diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 7158d4f8dae4..b448eb9fe1b9 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1195,7 +1195,6 @@ static void tcp_v4_init_req(struct request_sock *req, sk_rcv_saddr_set(req_to_sk(req), ip_hdr(skb)->daddr); sk_daddr_set(req_to_sk(req), ip_hdr(skb)->saddr); - ireq->no_srccheck = inet_sk(sk_listener)->transparent; ireq->opt = tcp_v4_save_options(skb); } -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] net/netfilter/nf_conntrack_core: update memory barriers.
On Wed, 2016-08-31 at 15:42 +0200, Manfred Spraul wrote: > As explained in commit 51d7d5205d33 > ("powerpc: Add smp_mb() to arch_spin_is_locked()", for some architectures > the ACQUIRE during spin_lock only applies to loading the lock, not to > storing the lock state. > > nf_conntrack_lock() does not handle this correctly: > /* 1) Acquire the lock */ > spin_lock(lock); > while (unlikely(nf_conntrack_locks_all)) { > spin_unlock(lock); > > spinlock_store_acquire() is missing between spin_lock and reading > nf_conntrack_locks_all. In addition, reading nf_conntrack_locks_all > needs ACQUIRE memory ordering. > > 2nd, minor issue: If there would be many nf_conntrack_all_lock() callers, > then nf_conntrack_lock() would loop forever. > > Therefore: Change nf_conntrack_lock and nf_conntract_lock_all() to the > approach used by ipc/sem.c: > > - add spinlock_store_acquire() > - add smp_load_acquire() > - for nf_conntrack_lock, use spin_lock(_lock) instead of > spin_unlock_wait(_lock) and loop backward. > - use smp_store_mb() instead of a raw smp_mb() > > Signed-off-by: Manfred Spraul> Cc: Pablo Neira Ayuso > Cc: netfilter-devel@vger.kernel.org > > --- > > Question: Should I split this patch? > First a patch that uses smp_mb(), with Cc: stable. > The replace the smp_mb() with spinlock_store_acquire, not for stable I guess it all depends on stable backports you believe are needed. You probably should add the tags : Fixes: <12-digit-sha1> "patch title" that introduced the bug(s) you fix. By doing this archaeological research you will likely have a better answer ? Thanks ! -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 nf-next 5/7] netfilter: conntrack: add gc worker to remove timed-out entries
On Thu, 2016-08-25 at 15:33 +0200, Florian Westphal wrote: > Conntrack gc worker to evict stale entries. > > GC happens once every 5 seconds, but we only scan at most 1/64th of the > table (and not more than 8k) buckets to avoid hogging cpu. > > This means that a complete scan of the table will take several minutes > of wall-clock time. > > Considering that the gc run will never have to evict any entries > during normal operation because those will happen from packet path > this should be fine. > > We only need gc to make sure userspace (conntrack event listeners) > eventually learn of the timeout, and for resource reclaim in case the > system becomes idle. > > We do not disable BH and cond_resched for every bucket so this should > not introduce noticeable latencies either. > > A followup patch will add a small change to speed up GC for the extreme > case where most entries are timed out on an otherwise idle system. > > v2: Use cond_resched_rcu_qs & add comment wrt. missing restart on > nulls value change in gc worker, suggested by Eric Dumazet. > > v3: don't call cancel_delayed_work_sync twice (again, Eric). > > Signed-off-by: Florian Westphal <f...@strlen.de> > --- > net/netfilter/nf_conntrack_core.c | 76 > +++ > 1 file changed, 76 insertions(+) Acked-by: Eric Dumazet <eduma...@google.com> -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 nf-next 4/7] netfilter: conntrack: add gc worker to remove timed-out entries
On Wed, 2016-08-24 at 22:11 +0200, Florian Westphal wrote: > Eric Dumazet <eric.duma...@gmail.com> wrote: > > On Wed, 2016-08-24 at 13:55 +0200, Florian Westphal wrote: > > > Conntrack gc worker to evict stale entries. > > > > > > > static struct nf_conn * > > > __nf_conntrack_alloc(struct net *net, > > >const struct nf_conntrack_zone *zone, > > > @@ -1527,6 +1597,7 @@ static int untrack_refs(void) > > > > > > void nf_conntrack_cleanup_start(void) > > > { > > > + conntrack_gc_work.exiting = true; > > > RCU_INIT_POINTER(ip_ct_attach, NULL); > > > } > > > > > > @@ -1536,6 +1607,9 @@ void nf_conntrack_cleanup_end(void) > > > while (untrack_refs() > 0) > > > schedule(); > > > > > > + cancel_delayed_work_sync(_gc_work.dwork); > > > + /* can be re-scheduled once */ > > > > Are you sure ? > > > > As conntrack_gc_work.exiting = true, I do not see how this can happen ? > > nf_conntrack_cleanup_start() sets exiting = true > > current cpu blocks in > > cancel_delayed_work_sync(_gc_work.dwork); > > Iff the work queue was running on other cpu but was already past > gc_work->exiting check then when cancel_delayed_work_sync() (first one) > returns it will have re-armed itself via schedule_delayed_work(). > > So I think the 2nd cancel_delayed_work_sync is needed. > > Let me know if you'd like to see a v3 with more verbose > comment about this. If you were using cancel_delayed_work() (instead of cancel_delayed_work_sync()) I would understand the concern. But here you are using the thing that was designed to exactly avoid the issue, of both work running on another cpu and/or re-arming itself. If what you are saying was true, we would have to fix hundreds of cancel_delayed_work_sync() call sites ... -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 nf-next 4/7] netfilter: conntrack: add gc worker to remove timed-out entries
On Wed, 2016-08-24 at 13:55 +0200, Florian Westphal wrote: > Conntrack gc worker to evict stale entries. > static struct nf_conn * > __nf_conntrack_alloc(struct net *net, >const struct nf_conntrack_zone *zone, > @@ -1527,6 +1597,7 @@ static int untrack_refs(void) > > void nf_conntrack_cleanup_start(void) > { > + conntrack_gc_work.exiting = true; > RCU_INIT_POINTER(ip_ct_attach, NULL); > } > > @@ -1536,6 +1607,9 @@ void nf_conntrack_cleanup_end(void) > while (untrack_refs() > 0) > schedule(); > > + cancel_delayed_work_sync(_gc_work.dwork); > + /* can be re-scheduled once */ Are you sure ? As conntrack_gc_work.exiting = true, I do not see how this can happen ? > + cancel_delayed_work_sync(_gc_work.dwork); > nf_ct_free_hashtable(nf_conntrack_hash, nf_conntrack_htable_size); > > nf_conntrack_proto_fini(); > @@ -1810,6 +1884,10 @@ int nf_conntrack_init_start(void) > } > /* - and look it like as a confirmed connection */ > nf_ct_untracked_status_or(IPS_CONFIRMED | IPS_UNTRACKED); > + > + conntrack_gc_work_init(_gc_work); > + schedule_delayed_work(_gc_work.dwork, GC_INTERVAL); > + > return 0; > > err_proto: -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 nf-next 2/7] netfilter: conntrack: get rid of conntrack timer
On Wed, 2016-08-24 at 13:55 +0200, Florian Westphal wrote: > With stats enabled this eats 80 bytes on x86_64 per nf_conn entry, as > Eric Dumazet pointed out during netfilter workshop 2016. Another reason was the fact that Thomas was about to change max timer range : https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=500462a9de657f86edaa102f8ab6bff7f7e43fc2 Acked-by: Eric Dumazet <eduma...@google.com> Thanks ! -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 nf-next 1/7] netfilter: don't rely on DYING bit to detect when destroy event was sent
On Wed, 2016-08-24 at 13:55 +0200, Florian Westphal wrote: > The reliable event delivery mode currently (ab)uses the DYING bit to > detect which entries on the dying list have to be skipped when > re-delivering events from the eache worker in reliable event mode. > > Currently when we delete the conntrack from main table we only set this > bit if we could also deliver the netlink destroy event to userspace. > > If we fail we move it to the dying list, the ecache worker will > reattempt event delivery for all confirmed conntracks on the dying list > that do not have the DYING bit set. > > Once timer is gone, we can no longer use if (del_timer()) to detect > when we 'stole' the reference count owned by the timer/hash entry, so > we need some other way to avoid racing with other cpu. > > Pablo suggested to add a marker in the ecache extension that skips Acked-by: Eric Dumazet <eduma...@google.com> -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH nf-next 7/7] netfilter: restart search if moved to other chain
On Wed, 2016-08-24 at 13:55 +0200, Florian Westphal wrote: > In case nf_conntrack_tuple_taken did not find a conflicting entry > check that all entries in this hash slot were tested and restart > in case an entry was moved to another chain. > > Reported-by: Eric Dumazet <eduma...@google.com> > Signed-off-by: Florian Westphal <f...@strlen.de> > --- Fixes: ea781f197d6a ("netfilter: nf_conntrack: use SLAB_DESTROY_BY_RCU and get rid of call_rcu()") Acked-by: Eric Dumazet <eduma...@google.com> Thanks ! -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH nf-next 2/6] netfilter: conntrack: get rid of conntrack timer
On Fri, 2016-08-19 at 18:04 +0200, Florian Westphal wrote: > Eric Dumazet <eric.duma...@gmail.com> wrote: > > On Fri, 2016-08-19 at 17:16 +0200, Florian Westphal wrote: > > > > > Hmm, nf_conntrack_find caller needs to hold rcu_read_lock, > > > in case object is free'd SLAB_DESTROY_BY_RCU should delay actual release > > > of the page. > > > > Well, point is that SLAB_DESTROY_BY_RCU means that we have no grace > > period, and object can be immediately reused and recycled. > > > > @next pointer can definitely be overwritten. > > I see. Isn't that detected by the nulls magic (to restart > lookup if entry was moved to other chain due to overwritten next pointer)? Well, you did not add the nulls magic in your code ;) It might be fine, since it should be a rare event, and garbage collection is best effort, so you might add a comment in gc_worker() why it is probably overkill to restart the loop in this unlikely event. BTW, maybe nf_conntrack_tuple_taken() should get the nulls magic check, as it is currently missing. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH nf-next 2/6] netfilter: conntrack: get rid of conntrack timer
On Fri, 2016-08-19 at 17:16 +0200, Florian Westphal wrote: > Hmm, nf_conntrack_find caller needs to hold rcu_read_lock, > in case object is free'd SLAB_DESTROY_BY_RCU should delay actual release > of the page. Well, point is that SLAB_DESTROY_BY_RCU means that we have no grace period, and object can be immediately reused and recycled. @next pointer can definitely be overwritten. > > Should be same as (old) 'death_by_timeout' timer firing during > hlist_nulls_for_each_entry_rcu walk. > > Thanks for looking at this Eric! -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH nf-next 4/6] netfilter: conntrack: add gc worker to remove timed-out entries
On Fri, 2016-08-19 at 13:36 +0200, Florian Westphal wrote: > Conntrack gc worker to evict stale entries. ... > + > + hlist_nulls_for_each_entry_rcu(h, n, _hash[i], hnnode) { > + tmp = nf_ct_tuplehash_to_ctrack(h); > + > + if (nf_ct_is_expired(tmp)) { > + nf_ct_gc_expired(tmp); > + expired_count++; > + continue; Same remark about hlist_nulls_for_each_entry_rcu() not 'safe' here > + } > + } > + > + rcu_read_unlock(); > + cond_resched(); This could use cond_resched_rcu_qs() > + } while (++buckets < goal && > + expired_count < GC_MAX_EVICTS); > + > + if (gc_work->exiting) > + return; > + -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH nf-next 2/6] netfilter: conntrack: get rid of conntrack timer
On Fri, 2016-08-19 at 13:36 +0200, Florian Westphal wrote: > With stats enabled this eats 80 bytes on x86_64 per nf_conn entry. > > Remove it and use a 32bit jiffies value containing timestamp until > entry is valid. Great work ! ... > +/* caller must hold rcu readlock and none of the nf_conntrack_locks */ > +static void nf_ct_gc_expired(struct nf_conn *ct) > +{ > + if (!atomic_inc_not_zero(>ct_general.use)) > + return; > + > + if (nf_ct_should_gc(ct)) > + nf_ct_kill(ct); > + > + nf_ct_put(ct); > +} > + > /* > * Warning : > * - Caller must take a reference on returned object > @@ -499,6 +505,17 @@ begin: > bucket = reciprocal_scale(hash, hsize); > > hlist_nulls_for_each_entry_rcu(h, n, _hash[bucket], hnnode) { > + struct nf_conn *ct; > + > + ct = nf_ct_tuplehash_to_ctrack(h); > + if (nf_ct_is_expired(ct)) { > + nf_ct_gc_expired(ct); > + continue; > + } > + > + if (nf_ct_is_dying(ct)) > + continue; > + > if (nf_ct_key_equal(h, tuple, zone, net)) { > NF_CT_STAT_INC_ATOMIC(net, found); > return h; Florian, I do not see how this part is safe against concurrent lookups and deletes ? At least the hlist_nulls_for_each_entry_rcu() looks buggy, since fetching the next pointer would trigger a use after free ? Thanks ! -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net] netfilter: tproxy: properly refcount tcp listeners
From: Eric Dumazet <eduma...@google.com> inet_lookup_listener() and inet6_lookup_listener() no longer take a reference on the found listener. This minimal patch adds back the refcounting, but we might do this differently in net-next later. Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood") Reported-and-tested-by: Denys Fedoryshchenko <nuclear...@nuclearcat.com> Signed-off-by: Eric Dumazet <eduma...@google.com> --- Note: bug added in 4.7, stable candidate. net/netfilter/xt_TPROXY.c |4 1 file changed, 4 insertions(+) diff --git a/net/netfilter/xt_TPROXY.c b/net/netfilter/xt_TPROXY.c index 7f4414d26a66..663c4c3c9072 100644 --- a/net/netfilter/xt_TPROXY.c +++ b/net/netfilter/xt_TPROXY.c @@ -127,6 +127,8 @@ nf_tproxy_get_sock_v4(struct net *net, struct sk_buff *skb, void *hp, daddr, dport, in->ifindex); + if (sk && !atomic_inc_not_zero(>sk_refcnt)) + sk = NULL; /* NOTE: we return listeners even if bound to * 0.0.0.0, those are filtered out in * xt_socket, since xt_TPROXY needs 0 bound @@ -195,6 +197,8 @@ nf_tproxy_get_sock_v6(struct net *net, struct sk_buff *skb, int thoff, void *hp, daddr, ntohs(dport), in->ifindex); + if (sk && !atomic_inc_not_zero(>sk_refcnt)) + sk = NULL; /* NOTE: we return listeners even if bound to * 0.0.0.0, those are filtered out in * xt_socket, since xt_TPROXY needs 0 bound -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel panic TPROXY , vanilla 4.7.1
On Wed, 2016-08-17 at 19:44 +0300, Denys Fedoryshchenko wrote: > Yes, everything fine after patch! > Thanks a lot Perfect, thanks for testing, I am sending the official patch. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel panic TPROXY , vanilla 4.7.1
On Wed, 2016-08-17 at 08:42 -0700, Eric Dumazet wrote: > On Wed, 2016-08-17 at 17:31 +0300, Denys Fedoryshchenko wrote: > > Hi! > > > > Tried to run squid on latest kernel, and hit a panic > > Sometimes it just shows warning in dmesg (but doesnt work properly) > > [ 75.701666] IPv4: Attempt to release TCP socket in state 10 > > 88102d430780 > > [ 83.866974] squid (2700) used greatest stack depth: 12912 bytes left > > [ 87.506644] IPv4: Attempt to release TCP socket in state 10 > > 880078a48780 > > [ 114.704295] IPv4: Attempt to release TCP socket in state 10 > > 881029f8ad00 > > > > I cannot catch yet oops/panic message, netconsole not working. > > > > After triggering warning message 3 times, i am unable to run squid > > anymore (without reboot), and in netstat it doesnt show port running. > > > > firewall is: > > *mangle > > -A PREROUTING -p tcp -m socket -j DIVERT > > -A PREROUTING -p tcp -m tcp --dport 80 -i eno1 -j TPROXY --on-port 3129 > > --on-ip 0.0.0.0 --tproxy-mark 0x1/0x1 > > -A DIVERT -j MARK --set-xmark 0x1/0x > > -A DIVERT -j ACCEPT > > > > routing > > ip rule add fwmark 1 lookup 100 > > ip route add local default dev eno1 table 100 > > > > > > squid config is default with tproxy option > > http_port 3129 tproxy > > > > Hmppff... sorry for this, I will send a fix. > > Thanks for the report ! > Could you try the following ? Thanks ! net/netfilter/xt_TPROXY.c |4 1 file changed, 4 insertions(+) diff --git a/net/netfilter/xt_TPROXY.c b/net/netfilter/xt_TPROXY.c index 7f4414d26a66..663c4c3c9072 100644 --- a/net/netfilter/xt_TPROXY.c +++ b/net/netfilter/xt_TPROXY.c @@ -127,6 +127,8 @@ nf_tproxy_get_sock_v4(struct net *net, struct sk_buff *skb, void *hp, daddr, dport, in->ifindex); + if (sk && !atomic_inc_not_zero(>sk_refcnt)) + sk = NULL; /* NOTE: we return listeners even if bound to * 0.0.0.0, those are filtered out in * xt_socket, since xt_TPROXY needs 0 bound @@ -195,6 +197,8 @@ nf_tproxy_get_sock_v6(struct net *net, struct sk_buff *skb, int thoff, void *hp, daddr, ntohs(dport), in->ifindex); + if (sk && !atomic_inc_not_zero(>sk_refcnt)) + sk = NULL; /* NOTE: we return listeners even if bound to * 0.0.0.0, those are filtered out in * xt_socket, since xt_TPROXY needs 0 bound -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PROBLEM: TPROXY and DNAT broken (bisected to 079096f103fa)
On Wed, 2016-07-27 at 18:19 +, Brandon Cazander wrote: > [1.] One line summary of the problem: > Using TPROXY together with a DNAT rule (working on older kernels) fails to > work on newer kernels as of commit 079096f103fa > > [2.] Full description of the problem/report: > I performed a git bisect using a qemu image to test my example below, and the > bisect ended at this commit: > > > commit 079096f103faca2dd87342cca6f23d4b34da8871 > > Author: Eric Dumazet <eduma...@google.com> > > Date: Fri Oct 2 11:43:32 2015 -0700 > > > > tcp/dccp: install syn_recv requests into ehash table > > [3.] Keywords: networking > > [4.] Kernel information > [4.1.] Kernel version (from /proc/version): > Everything as of commit 079096f103fa (tested up to 4.5.0) > > [4.2.] Kernel .config file: > When performing the bisect, I built with make oldconfig. Let me know if you > want the whole .config file. > > [5.] Most recent kernel version which did not have the bug: > Any kernel that I built prior to commit > 079096f103faca2dd87342cca6f23d4b34da8871 did not have this issue. > > [6.] no Oops > > [7.] A small shell script or example program which triggers the > problem (if possible) > > I have produced what I hope is a minimal example, using the instructions for > TPROXY from > http://lxr.linux.no/#linux+v3.10/Documentation/networking/tproxy.txt and an > example transparent TCP proxy written in C that I found at > https://github.com/kristrev/tproxy-example. > > * I have a machine ("ROUTER") with 10.100.0.164/24 on eth0, and > 192.168.30.2/24 on eth1. This is running the tproxy-example program, with the > following rules: > iptables -t mangle -N DIVERT > iptables -t mangle -A PREROUTING -p tcp -m socket -j DIVERT > iptables -t mangle -A DIVERT -j MARK --set-mark 1 > iptables -t mangle -A DIVERT -j ACCEPT > iptables -t mangle -A PREROUTING -p tcp --dport 8080 -j TPROXY > --tproxy-mark 0x1/0x1 --on-port 9876 > iptables -t nat -I PREROUTING -i eth0 -d 42.0.1.1 -j DNAT --to-dest > 192.168.30.1 > ip rule add fwmark 1 lookup 100 > ip route add local 0.0.0.0/0 dev lo table 100 > > * There is a machine ("WEBSERVER") at 192.168.30.1/24 hosting a webserver on > port 8080. > > * My workstation is at 10.100.0.206, and I have a static route for both > 192.168.30.2 and 42.0.1.1 via 10.100.0.164. > > * Making a curl request to 192.168.30.2:8080 hits the transparent proxy and > works in both GOOD (before the aforementioned commit) kernel, and BAD (at the > commit or later) kernel. > > * Making a curl request to 42.0.1.1:8080 hits the transparent proxy and works > in GOOD kernel but in BAD kernel I get: > "curl: (56) Recv failure: Connection reset by peer" > > * When it fails, no traffic hits the WEBSERVER. A tcpdump on the bad kernel > shows: > root@dons-qemu-new-kernel:~# tcpdump -niany tcp and port 8080 > tcpdump: verbose output suppressed, use -v or -vv for full protocol decode > listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 > bytes > 16:42:31.551952 IP 10.100.0.206.35562 > 42.0.1.1.8080: Flags [S], seq > 3793582216, win 29200, options [mss 1460,sackOK,TS val 632068656 ecr > 0,nop,wscale 7], length 0 > 16:42:31.551988 IP 42.0.1.1.8080 > 10.100.0.206.35562: Flags [S.], seq > 4042636216, ack 3793582217, win 28960, options [mss 1460,sackOK,TS val 745382 > ecr 632068656,nop,wscale 7], length 0 > 16:42:31.55 IP 10.100.0.206.35562 > 42.0.1.1.8080: Flags [.], ack 1, > win 229, options [nop,nop,TS val 632068657 ecr 745382], length 0 > 16:42:31.552238 IP 42.0.1.1.8080 > 10.100.0.206.35562: Flags [R], seq > 4042636217, win 0, length 0 > 16:42:31.552246 IP 10.100.0.206.35562 > 42.0.1.1.8080: Flags [P.], seq > 1:78, ack 1, win 229, options [nop,nop,TS val 632068657 ecr 745382], length 77 > 16:42:31.552251 IP 42.0.1.1.8080 > 10.100.0.206.35562: Flags [R], seq > 4042636217, win 0, length 0 > 16:42:32.551668 IP 42.0.1.1.8080 > 10.100.0.206.35562: Flags [S.], seq > 4042636216, ack 3793582217, win 28960, options [mss 1460,sackOK,TS val 745632 > ecr 632068656,nop,wscale 7], length 0 > 16:42:32.551925 IP 10.100.0.206.35562 > 42.0.1.1.8080: Flags [R], seq > 3793582217, win 0, length 0 > 16:42:34.551668 IP 42.0.1.1.8080 > 10.100.0.206.35562: Flags [S.], seq > 4042636216, ack 3793582217, win 28960, options [mss 1460,sackOK,TS val 746132 > ecr 632068656,nop,wscale 7], length 0 > 16:42:34.551995 IP 10.100.0.206.35562 > 42.0.1.1.8080: Flags [R], seq > 3793582217, win 0, length 0 > > * A tcpdump on a GOOD kernel shows: &g
Re: [RFC 5/7] net: Add allocation flag to rtnl_unicast()
On Fri, 2016-07-08 at 12:15 +0900, Masashi Honma wrote: = > Thanks for comment. > > I have selected GFP flags based on existing code. > > I have selected GFP_ATOMIC in inet6_netconf_get_devconf() because > skb was allocated with GFP_ATOMIC. Point is : we should remove GFP_ATOMIC uses as much as we can. Everytime we see one of them, we should think why it was added and if this is really needed. inet6_netconf_get_devconf() is a perfect example of one careless GFP_ATOMIC usage https://patchwork.ozlabs.org/patch/646291/ -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html