from:"Eric Dumazet"

Re: [PATCH nf] netfilter: ipv6: nf_defrag: drop skb dst before queueing

2018-07-09 Thread Eric Dumazet




On 07/09/2018 04:43 AM, Florian Westphal wrote:
> Eric Dumazet reports:
>  Here is a reproducer of an annoying bug detected by syzkaller on our 
> production kernel
>  [..]
>  ./b78305423 enable_conntrack
>  Then :
>  sleep 60
>  dmesg | tail -10
>  [  171.599093] unregister_netdevice: waiting for lo to become free. Usage 
> count = 2
>  [  181.631024] unregister_netdevice: waiting for lo to become free. Usage 
> count = 2
>  [  191.687076] unregister_netdevice: waiting for lo to become free. Usage 
> count = 2
>  [  201.703037] unregister_netdevice: waiting for lo to become free. Usage 
> count = 2
>  [  211.711072] unregister_netdevice: waiting for lo to become free. Usage 
> count = 2
>  [  221.959070] unregister_netdevice: waiting for lo to become free. Usage 
> count = 2
> 
> Reproducer sends ipv6 fragment that hits nfct defrag via LOCAL_OUT hook.
> skb gets queued until frag timer expiry -- 1 minute.
> 
> Normally nf_conntrack_reasm gets called during prerouting, so skb has
> no dst yet which might explain why this wasn't spotted earlier.
> 
> Reported-by: Eric Dumazet 
> Reported-by: John Sperbeck 
> Signed-off-by: Florian Westphal 

Tested-by: Eric Dumazet 
Reported-by: syzbot 

Thanks !

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] netfilter: properly initialize xt_table_info structure

2018-05-17 Thread Eric Dumazet



On 05/17/2018 02:34 AM, Greg Kroah-Hartman wrote:
> When allocating a xt_table_info structure, we should be clearing out the
> full amount of memory that was allocated, not just the "header" of the
> structure.  Otherwise odd values could be passed to userspace, which is
> not a good thing.
> 
> Cc: stable 
> Signed-off-by: Greg Kroah-Hartman 
> ---
> v2: use kvzalloc instead of kvmalloc/memset pair, as suggested by Michal 
> Kubecek
> 
>  net/netfilter/x_tables.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> index cb7cb300c3bc..cd22bb9b66f3 100644
> --- a/net/netfilter/x_tables.c
> +++ b/net/netfilter/x_tables.c
> @@ -1183,11 +1183,10 @@ struct xt_table_info *xt_alloc_table_info(unsigned 
> int size)
>* than shoot all processes down before realizing there is nothing
>* more to reclaim.
>*/
> - info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY);
> + info = kvzalloc(sz, GFP_KERNEL | __GFP_NORETRY);
>   if (!info)
>   return NULL;
>  
> - memset(info, 0, sizeof(*info));
>   info->size = size;
>   return info;
>  }
> 

I am curious, what particular path does not later overwrite the whole zone ?

Do not get me wrong, this is not fast path, but these blobs can be huge.

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: WARNING in __proc_create

2018-03-09 Thread Eric Dumazet




On 03/09/2018 03:32 PM, Cong Wang wrote:

On Fri, Mar 9, 2018 at 3:21 PM, Eric Dumazet <eric.duma...@gmail.com> wrote:



On 03/09/2018 03:05 PM, Cong Wang wrote:



BTW, the warning itself is all about empty names, so perhaps
it's better to fix them separately.



Huh ? You want more syzbot reports ? I do not.


I always prefer one patch to fix one problem, and, as I already said,
checking "." and ".." is probably not as trivial as checking empty.

This why I believe 2 (or more) patches here are better than 1 single
patch.

BTW, I don't have any patch, it is so trivial that I don't want to spend
my time on it.


The trivial patch has been sent, and Florian gave a (very good) feedback.

Then Pablo offered to write a complete patch.

They both had more pressing issues, and quite frankly I also had more 
pressing issues.


I doubt that sending again the simple fix will be accepted by netfilter 
maintainers.


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: WARNING in __proc_create

2018-03-09 Thread Eric Dumazet




On 03/09/2018 03:05 PM, Cong Wang wrote:


BTW, the warning itself is all about empty names, so perhaps
it's better to fix them separately.


Huh ? You want more syzbot reports ? I do not.

I unblocked this report today [1], you can be sure that as soon
as syzbot gets the correct tag (Reported-by: ...), it will find the 
other problems, which are currently on hold to avoid spam.


[1] It has been sitting for a while here at Google, since January 28th.
   At some point you have to ping Pablo/Florian again ;)



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: WARNING in __proc_create

2018-03-09 Thread Eric Dumazet




On 03/09/2018 02:56 PM, Eric Dumazet wrote:



I sent a patch a while back, but Pablo/Florian wanted more than that 
simple fix.


We also need to filter special characters like '/'

Or maybe I am mixing with something else.


Yes, Florian mentioned that we also had to reject "."  and ".."

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: WARNING in __proc_create

2018-03-09 Thread Eric Dumazet




On 03/09/2018 02:48 PM, Cong Wang wrote:

On Fri, Mar 9, 2018 at 1:59 PM, syzbot
 wrote:

Hello,

syzbot hit the following crash on net-next commit
617aebe6a97efa539cc4b8a52adccd89596e6be0 (Sun Feb 4 00:25:42 2018 +)
Merge tag 'usercopy-v4.16-rc1' of
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

So far this crash happened 12 times on net-next, upstream.
C reproducer is attached.
syzkaller reproducer is attached.
Raw console output is attached.
compiler: gcc (GCC) 7.1.1 20170620
.config is attached.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+0502b00edac2a0680...@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for
details.
If you forward the report, please keep this part and the footer.

audit: type=1400 audit(1518223777.419:7): avc:  denied  { map } for
pid=4159 comm="syzkaller598581" path="/root/syzkaller598581546" dev="sda1"
ino=16481 scontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023
tcontext=unconfined_u:object_r:user_home_t:s0 tclass=file permissive=1
[ cut here ]
name len 0
WARNING: CPU: 0 PID: 4159 at fs/proc/generic.c:354 __proc_create+0x696/0x880
fs/proc/generic.c:354


We need to reject empty names.



I sent a patch a while back, but Pablo/Florian wanted more than that 
simple fix.


We also need to filter special characters like '/'

Or maybe I am mixing with something else.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH nf v5] netfilter: bridge: ebt_among: add more missing match size checks

2018-03-09 Thread Eric Dumazet




On 03/09/2018 05:27 AM, Florian Westphal wrote:

ebt_among is special, it has a dynamic match size and is exempt
from the central size checks.

commit c4585a2823edf ("bridge: ebt_among: add missing match size checks")
added validation for pool size, but missed fact that the macros
ebt_among_wh_src/dst can already return out-of-bound result because
they do not check value of wh_src/dst_ofs (an offset) vs. the size
of the match that userspace gave to us.

v2:
check that offset has correct alignment.
Paolo Abeni points out that we should also check that src/dst
wormhash arrays do not overlap, and src + length lines up with
start of dst (or vice versa).
v3: compact wormhash_sizes_valid() part

NB: Fixes tag is intentionally wrong, this bug exists from day
one when match was added for 2.6 kernel. Tag is there so stable
maintainers will notice this one too.

Tested with same rules from the earlier patch.

Fixes: c4585a2823edf ("bridge: ebt_among: add missing match size checks")
Reported-by: <syzbot+bdabab6f1983a03fc...@syzkaller.appspotmail.com>
Signed-off-by: Florian Westphal <f...@strlen.de>
---
  v5: this is really v3 again, but with explicit (int)sizeof()
  cast rather than use of temporary 'int minsize'.
  objdump shows identical output for v3/v4/v5.



SGTM, thanks Florian ;)

Reviewed-by: Eric Dumazet <eduma...@google.com>


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH nf v4] netfilter: bridge: ebt_among: add more missing match size checks

2018-03-09 Thread Eric Dumazet




On 03/09/2018 02:03 AM, Florian Westphal wrote:

ebt_among is special, it has a dynamic match size and is exempt
from the central size checks.

commit c4585a2823edf ("bridge: ebt_among: add missing match size checks")
added validation for pool size, but missed fact that the macros
ebt_among_wh_src/dst can already return out-of-bound result because
they do not check value of wh_src/dst_ofs (an offset) vs. the size
of the match that userspace gave to us.

v2:
check that offset has correct alignment.
Paolo Abeni points out that we should also check that src/dst
wormhash arrays do not overlap, and src + length lines up with
start of dst (or vice versa).
v3: compact wormhash_sizes_valid() part

NB: Fixes tag is intentionally wrong, this bug exists from day
one when match was added for 2.6 kernel. Tag is there so stable
maintainers will notice this one too.

Tested with same rules from the earlier patch.

Fixes: c4585a2823edf ("bridge: ebt_among: add missing match size checks")
Reported-by: 
Signed-off-by: Florian Westphal 
---
  v4: rewrite wormhash_offset_invalid to make it clearer that
  'off' is <= INT_MAX, objdump doesn't show change vs. v3.

  net/bridge/netfilter/ebt_among.c | 37 +
  1 file changed, 37 insertions(+)

diff --git a/net/bridge/netfilter/ebt_among.c b/net/bridge/netfilter/ebt_among.c
index c5afb4232ecb..eec7459235a2 100644
--- a/net/bridge/netfilter/ebt_among.c
+++ b/net/bridge/netfilter/ebt_among.c
@@ -177,6 +177,31 @@ static bool poolsize_invalid(const struct ebt_mac_wormhash 
*w)
return w && w->poolsize >= (INT_MAX / sizeof(struct 
ebt_mac_wormhash_tuple));
  }
  
+static bool wormhash_offset_invalid(int s_off, unsigned int len)

+{
+   int minsize = sizeof(struct ebt_among_info);
+   unsigned int off = s_off;
+
+   if (s_off == 0) /* not present */
+   return false;
+
+   if (s_off < minsize || off % __alignof__(struct ebt_mac_wormhash))
+   return true;
+
+   /* off <= INT_MAX */


Yes, I guess my confusion came from the fact that you are using a signed 
integer to hold sizeof(struct ebt_among_info), and this is a bit 
unusual, sizeof() being fundamentally unsigned.


A very explicit check would have helped, no need for an extra temp var.

if ((int)off < (int)sizeof(struct ebt_among_info))
return true;

Thanks !



+   off += sizeof(struct ebt_mac_wormhash);
+
+   return off > len;
+}
+
+static bool wormhash_sizes_valid(const struct ebt_mac_wormhash *wh, int a, int 
b)
+{
+   if (a == 0)
+   a = sizeof(struct ebt_among_info);
+
+   return ebt_mac_wormhash_size(wh) + a == b;
+}
+
  static int ebt_among_mt_check(const struct xt_mtchk_param *par)
  {
const struct ebt_among_info *info = par->matchinfo;
@@ -189,6 +214,10 @@ static int ebt_among_mt_check(const struct xt_mtchk_param 
*par)
if (expected_length > em->match_size)
return -EINVAL;
  
+	if (wormhash_offset_invalid(info->wh_dst_ofs, em->match_size) ||

+   wormhash_offset_invalid(info->wh_src_ofs, em->match_size))
+   return -EINVAL;
+
wh_dst = ebt_among_wh_dst(info);
if (poolsize_invalid(wh_dst))
return -EINVAL;
@@ -201,6 +230,14 @@ static int ebt_among_mt_check(const struct xt_mtchk_param 
*par)
if (poolsize_invalid(wh_src))
return -EINVAL;
  
+	if (info->wh_src_ofs < info->wh_dst_ofs) {

+   if (!wormhash_sizes_valid(wh_src, info->wh_src_ofs, 
info->wh_dst_ofs))
+   return -EINVAL;
+   } else {
+   if (!wormhash_sizes_valid(wh_dst, info->wh_dst_ofs, 
info->wh_src_ofs))
+   return -EINVAL;
+   }
+
expected_length += ebt_mac_wormhash_size(wh_src);
  
  	if (em->match_size != EBT_ALIGN(expected_length)) {



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH nf v3] netfilter: bridge: ebt_among: add more missing match size checks

2018-03-08 Thread Eric Dumazet




On 03/08/2018 04:24 PM, Florian Westphal wrote:

Eric Dumazet <eric.duma...@gmail.com> wrote:

Fixes: c4585a2823edf ("bridge: ebt_among: add missing match size checks")
Reported-by: <syzbot+bdabab6f1983a03fc...@syzkaller.appspotmail.com>
Signed-off-by: Florian Westphal <f...@strlen.de>
---
  net/bridge/netfilter/ebt_among.c | 35 +++
  1 file changed, 35 insertions(+)

diff --git a/net/bridge/netfilter/ebt_among.c b/net/bridge/netfilter/ebt_among.c
index c5afb4232ecb..600ac7f6671a 100644
--- a/net/bridge/netfilter/ebt_among.c
+++ b/net/bridge/netfilter/ebt_among.c
@@ -177,6 +177,29 @@ static bool poolsize_invalid(const struct ebt_mac_wormhash 
*w)
return w && w->poolsize >= (INT_MAX / sizeof(struct 
ebt_mac_wormhash_tuple));
  }
+static bool wormhash_offset_invalid(int off, unsigned int len)
+{
+   int minsize = sizeof(struct ebt_among_info);
+
+   if (off == 0) /* not present */
+   return false;
+
+   if (off < minsize || off % __alignof__(struct ebt_mac_wormhash))
+   return true;
+
+   off += sizeof(struct ebt_mac_wormhash);


Can this overflow ?


Yes, off can wrap.


+   return off > len;


len is unsigned though so the unsigned promotion
would still catch this.


Not sure I understand.  Say the result is off==0

return off > len;   will return false.

I thought we were trying to return true for invalid input.



If you think this is too fragile let me know and I can submit a v4 with
a more explicit test (e.g. adding back "unsigned int alleged_off = off",
and testing vs. INT_MAX.


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH nf-next 1/2] netfilter: SYNPROXY: set transport header properly

2018-03-08 Thread Eric Dumazet




On 03/08/2018 07:01 AM, Serhey Popovych wrote:

Eric Dumazet wrote:



On 03/08/2018 02:08 AM, Serhey Popovych wrote:

We can't use skb_reset_transport_header() together with skb_put() to set
skb->transport_header field because skb_put() does not touch skb->data.

Do this same way as we did for csum_data in code: substract skb->head
from tcph.

Signed-off-by: Serhey Popovych <serhe.popov...@gmail.com>
---
   net/ipv4/netfilter/ipt_SYNPROXY.c  | 8 
   net/ipv6/netfilter/ip6t_SYNPROXY.c | 8 
   2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/netfilter/ipt_SYNPROXY.c
b/net/ipv4/netfilter/ipt_SYNPROXY.c
index f75fc6b..4953390 100644
--- a/net/ipv4/netfilter/ipt_SYNPROXY.c
+++ b/net/ipv4/netfilter/ipt_SYNPROXY.c
@@ -90,8 +90,8 @@
     niph = synproxy_build_ip(net, nskb, iph->daddr, iph->saddr);
   -    skb_reset_transport_header(nskb);
   nth = skb_put(nskb, tcp_hdr_size);
+    nskb->transport_header = (unsigned char *)nth - nskb->head;


I have no idea why you believe the current code is not correct.


alloc_skb():
skb->head = skb->data = skb_tail_pointer(skb)

synproxy_build_ip():
skb_reset_network_header() /* skb_network_header(skb) == skb->data */
skb_put(skb, len); /* skb->tail += len; skb->len += len */


nth = skb_put()   : nth is equal to skb->data


Yes it is equal to skb->data, but in my opinion should be set to
skb->data + ip_hdrlen(skb) to point to TCP header.


Is it really needed to set transport header offset ? We do not intend to 
inject this packet to a local TCP stack ?


Nothing will care of transport header for non GSO packet.

Anyway, maybe the confusion would be avoided if the normal way of 
pushing headers would be used (using skb_push())


No casts requested ;)
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH nf-next 1/2] netfilter: SYNPROXY: set transport header properly

2018-03-08 Thread Eric Dumazet




On 03/08/2018 02:08 AM, Serhey Popovych wrote:

We can't use skb_reset_transport_header() together with skb_put() to set
skb->transport_header field because skb_put() does not touch skb->data.

Do this same way as we did for csum_data in code: substract skb->head
from tcph.

Signed-off-by: Serhey Popovych 
---
  net/ipv4/netfilter/ipt_SYNPROXY.c  | 8 
  net/ipv6/netfilter/ip6t_SYNPROXY.c | 8 
  2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/netfilter/ipt_SYNPROXY.c 
b/net/ipv4/netfilter/ipt_SYNPROXY.c
index f75fc6b..4953390 100644
--- a/net/ipv4/netfilter/ipt_SYNPROXY.c
+++ b/net/ipv4/netfilter/ipt_SYNPROXY.c
@@ -90,8 +90,8 @@
  
  	niph = synproxy_build_ip(net, nskb, iph->daddr, iph->saddr);
  
-	skb_reset_transport_header(nskb);

nth = skb_put(nskb, tcp_hdr_size);
+   nskb->transport_header = (unsigned char *)nth - nskb->head;


I have no idea why you believe the current code is not correct.

nth = skb_put()   : nth is equal to skb->data

So skb_reset_transport_header(nskb); does exactly the same thing than 
your code, but in a more readable way, without ugly cast.


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Patch nf-next] netfilter: make xt_rateest hash table per net

2018-03-01 Thread Eric Dumazet

On Thu, 2018-03-01 at 18:58 -0800, Cong Wang wrote:
> As suggested by Eric, we need to make the xt_rateest
> hash table and its lock per netns to reduce lock
> contentions.
> 
> Cc: Florian Westphal <f...@strlen.de>
> Cc: Eric Dumazet <eduma...@google.com>
> Cc: Pablo Neira Ayuso <pa...@netfilter.org>
> Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
> ---
>  include/net/netfilter/xt_rateest.h |  4 +-
>  net/netfilter/xt_RATEEST.c | 91 
> +++---
>  net/netfilter/xt_rateest.c | 10 ++---
>  3 files changed, 72 insertions(+), 33 deletions(-)

Very nice, thanks !

Reviewed-by: Eric Dumazet <eduma...@google.com>

Although the main reason was to avoid name collisions between different
netns.

Hash table is small enough that it can be allocated for each netns.


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH nf] netfilter: IDLETIMER: be syzkaller friendly

2018-02-16 Thread Eric Dumazet

From: Eric Dumazet <eduma...@google.com>

We had one report from syzkaller [1]

First issue is that INIT_WORK() should be done before mod_timer()
or we risk timer being fired too soon, even with a 1 second timer.

Second issue is that we need to reject too big info->timeout
to avoid overflows in msecs_to_jiffies(info->timeout * 1000), or
risk looping, if result after overflow is 0.

[1]
WARNING: CPU: 1 PID: 5129 at kernel/workqueue.c:1444 __queue_work+0xdf4/0x1230 
kernel/workqueue.c:1444
Kernel panic - not syncing: panic_on_warn set ...

CPU: 1 PID: 5129 Comm: syzkaller159866 Not tainted 4.16.0-rc1+ #230
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x194/0x257 lib/dump_stack.c:53
 panic+0x1e4/0x41c kernel/panic.c:183
 __warn+0x1dc/0x200 kernel/panic.c:547
 report_bug+0x211/0x2d0 lib/bug.c:184
 fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178
 fixup_bug arch/x86/kernel/traps.c:247 [inline]
 do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
 invalid_op+0x22/0x40 arch/x86/entry/entry_64.S:988
RIP: 0010:__queue_work+0xdf4/0x1230 kernel/workqueue.c:1444
RSP: 0018:8801db507538 EFLAGS: 00010006
RAX: 8801aeb46080 RBX: 8801db530200 RCX: 81481404
RDX: 0100 RSI: 86b42640 RDI: 0082
RBP: 8801db507758 R08: 11003b6a0de5 R09: 000c
R10: 8801db5073f0 R11: 0020 R12: 11003b6a0eb6
R13: 8801b1067ae0 R14: 01f8 R15: dc00
 queue_work_on+0x16a/0x1c0 kernel/workqueue.c:1488
 queue_work include/linux/workqueue.h:488 [inline]
 schedule_work include/linux/workqueue.h:546 [inline]
 idletimer_tg_expired+0x44/0x60 net/netfilter/xt_IDLETIMER.c:116
 call_timer_fn+0x228/0x820 kernel/time/timer.c:1326
 expire_timers kernel/time/timer.c:1363 [inline]
 __run_timers+0x7ee/0xb70 kernel/time/timer.c:1666
 run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692
 __do_softirq+0x2d7/0xb85 kernel/softirq.c:285
 invoke_softirq kernel/softirq.c:365 [inline]
 irq_exit+0x1cc/0x200 kernel/softirq.c:405
 exiting_irq arch/x86/include/asm/apic.h:541 [inline]
 smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
 apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:829
 
RIP: 0010:arch_local_irq_restore arch/x86/include/asm/paravirt.h:777 [inline]
RIP: 0010:__raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:160 
[inline]
RIP: 0010:_raw_spin_unlock_irqrestore+0x5e/0xba kernel/locking/spinlock.c:184
RSP: 0018:8801c20173c8 EFLAGS: 0282 ORIG_RAX: ff12
RAX: dc00 RBX: 0282 RCX: 0006
RDX: 10d592cd RSI: 110035d68d23 RDI: 0282
RBP: 8801c20173d8 R08: 110038402e47 R09: 
R10:  R11:  R12: 8820e5c8
R13: 8801b1067ad8 R14: 8801aea7c268 R15: 8801aea7c278
 __debug_object_init+0x235/0x1040 lib/debugobjects.c:378
 debug_object_init+0x17/0x20 lib/debugobjects.c:391
 __init_work+0x2b/0x60 kernel/workqueue.c:506
 idletimer_tg_create net/netfilter/xt_IDLETIMER.c:152 [inline]
 idletimer_tg_checkentry+0x691/0xb00 net/netfilter/xt_IDLETIMER.c:213
 xt_check_target+0x22c/0x7d0 net/netfilter/x_tables.c:850
 check_target net/ipv6/netfilter/ip6_tables.c:533 [inline]
 find_check_entry.isra.7+0x935/0xcf0 net/ipv6/netfilter/ip6_tables.c:575
 translate_table+0xf52/0x1690 net/ipv6/netfilter/ip6_tables.c:744
 do_replace net/ipv6/netfilter/ip6_tables.c:1160 [inline]
 do_ip6t_set_ctl+0x370/0x5f0 net/ipv6/netfilter/ip6_tables.c:1686
 nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
 nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115
 ipv6_setsockopt+0x10b/0x130 net/ipv6/ipv6_sockglue.c:927
 udpv6_setsockopt+0x45/0x80 net/ipv6/udp.c:1422
 sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2976
 SYSC_setsockopt net/socket.c:1850 [inline]
 SyS_setsockopt+0x189/0x360 net/socket.c:1829
 do_syscall_64+0x282/0x940 arch/x86/entry/common.c:287

Fixes: 0902b469bd25 ("netfilter: xtables: idletimer target implementation")
Signed-off-by: Eric Dumazet <eduma...@google.com>
Reported-by: syzkaller <syzkal...@googlegroups.com>
---
 net/netfilter/xt_IDLETIMER.c |9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/net/netfilter/xt_IDLETIMER.c b/net/netfilter/xt_IDLETIMER.c
index 
6c2482b709b1eca341926a8393f45dfead358561..1ac6600bfafd60b6b4d5aaf88a41fb57c6ec195b
 100644
--- a/net/netfilter/xt_IDLETIMER.c
+++ b/net/netfilter/xt_IDLETIMER.c
@@ -146,11 +146,11 @@ static int idletimer_tg_create(struct idletimer_tg_info 
*info)
timer_setup(>timer->timer, idletimer_tg_expired, 0);
info->timer->refcnt = 1;
 
+   INIT_WORK(>timer->work, idletimer_tg_work);
+
mod_timer(>timer->timer,
  msecs_to_jiffies(info->

Re: [PATCH net v2] netfilter: nat: cope with negative port range

2018-02-14 Thread Eric Dumazet

On Wed, 2018-02-14 at 13:30 +0100, Florian Westphal wrote:
> Eric Dumazet <eric.duma...@gmail.com> wrote:
> > On Wed, 2018-02-14 at 12:13 +0100, Paolo Abeni wrote:
> > > syzbot reported a division by 0 bug in the netfilter nat code:
> > > Adding the relevant check at parse time could break existing
> > > setup, moreover we would need to read/write such values atomically
> > > to avoid possible transient negative ranges at update time.
> > 
> > I do not quite follow why it is so hard to add a check at parse time.
> > 
> > Breaking buggy setups would not be a concern I think.
> 
> It would be possible for xtables but afaics in nft_nat.c case
> (nft_nat_eval) range.{min,max}_proto.all values are loaded from nft
> registers at runtime.

I prefer this explanation much more, I suggest we update the changelog
to explain the real reason.

Thanks Florian.

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Patch net v2] xt_RATEEST: acquire xt_rateest_mutex for hash insert

2018-02-05 Thread Eric Dumazet

On Mon, 2018-02-05 at 14:41 -0800, Cong Wang wrote:
> rateest_hash is supposed to be protected by xt_rateest_mutex,
> and, as suggested by Eric, lookup and insert should be atomic,
> so we should acquire the xt_rateest_mutex once for both.
> 
> So introduce a non-locking helper for internal use and keep the
> locking one for external.
> 
> Reported-by: <syzbot+5cb189720978275e4...@syzkaller.appspotmail.com>
> Fixes: 5859034d7eb8 ("[NETFILTER]: x_tables: add RATEEST target")
> Cc: Pablo Neira Ayuso <pa...@netfilter.org>
> Cc: Eric Dumazet <eric.duma...@gmail.com>
> Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
> ---

Reviewed-by: Eric Dumazet <eduma...@google.com>

Thanks !

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Patch net] xt_RATEEST: acquire xt_rateest_mutex for hash insert

2018-01-31 Thread Eric Dumazet

On Wed, 2018-01-31 at 16:26 -0800, Cong Wang wrote:
> rateest_hash is supposed to be protected by xt_rateest_mutex.
> 
> Reported-by: 
> Fixes: 5859034d7eb8 ("[NETFILTER]: x_tables: add RATEEST target")
> Cc: Pablo Neira Ayuso 
> Signed-off-by: Cong Wang 
> ---
>  net/netfilter/xt_RATEEST.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/net/netfilter/xt_RATEEST.c b/net/netfilter/xt_RATEEST.c
> index 498b54fd04d7..83ec3a282755 100644
> --- a/net/netfilter/xt_RATEEST.c
> +++ b/net/netfilter/xt_RATEEST.c
> @@ -36,7 +36,9 @@ static void xt_rateest_hash_insert(struct xt_rateest *est)
>   unsigned int h;
>  
>   h = xt_rateest_hash(est->name);
> + mutex_lock(_rateest_mutex);
>   hlist_add_head(>list, _hash[h]);
> + mutex_unlock(_rateest_mutex);
>  }

We probably should make this module netns aware, otherwise bad things
will happen.

(It seems multiple threads could run, so getting the mutex twice 
(xt_rateest_lookup then xt_rateest_hash_insert() is racy)


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 1/1] net/netfilter/x_tables.c: make allocation less aggressive

2018-01-30 Thread Eric Dumazet

On Tue, 2018-01-30 at 11:30 -0800, a...@linux-foundation.org wrote:
> From: Michal Hocko 
> Subject: net/netfilter/x_tables.c: make allocation less aggressive
> 
> syzbot has noticed that xt_alloc_table_info can allocate a lot of memory. 
> This is an admin only interface but an admin in a namespace is sufficient
> as well.  eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc() in
> xt_alloc_table_info()") has changed the opencoded kmalloc->vmalloc
> fallback into kvmalloc.  It has dropped __GFP_NORETRY on the way because
> vmalloc has simply never fully supported __GFP_NORETRY semantic.  This is
> still the case because e.g.  page tables backing the vmalloc area are
> hardcoded GFP_KERNEL.
> 
> Revert back to __GFP_NORETRY as a poors man defence against excessively
> large allocation request here.  We will not rule out the OOM killer
> completely but __GFP_NORETRY should at least stop the large request in
> most cases.
> 
> [a...@linux-foundation.org: coding-style fixes]
> Fixes: eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc() in 
> xt_alloc_tableLink: 
> http://lkml.kernel.org/r/20180130140104.ge21...@dhcp22.suse.cz
> Signed-off-by: Michal Hocko 
> Acked-by: Florian Westphal 
> Reviewed-by: Andrew Morton 
> Cc: David S. Miller 
> Signed-off-by: Andrew Morton 
> ---
> 
>  net/netfilter/x_tables.c |7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff -puN 
> net/netfilter/x_tables.c~net-netfilter-x_tablesc-make-allocation-less-aggressive
>  net/netfilter/x_tables.c
> --- 
> a/net/netfilter/x_tables.c~net-netfilter-x_tablesc-make-allocation-less-aggressive
> +++ a/net/netfilter/x_tables.c
> @@ -1008,7 +1008,12 @@ struct xt_table_info *xt_alloc_table_inf
>   if ((size >> PAGE_SHIFT) + 2 > totalram_pages)
>   return NULL;
>  
> - info = kvmalloc(sz, GFP_KERNEL);
> + /* __GFP_NORETRY is not fully supported by kvmalloc but it should
> +  * work reasonably well if sz is too large and bail out rather
> +  * than shoot all processes down before realizing there is nothing
> +  * more to reclaim.
> +  */
> + info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY);
>   if (!info)
>   return NULL;


How is __GFP_NORETRY working exactly ?

Surely, if some firewall tools attempt to load a new iptables rules, we
do not want to abort them if the request can be satisfied after few
pages moved on swap or written back to disk.

We want to avoid huge allocations, but leave reasonable ones succeed.

Thanks.

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net] netfilter: xt_recent: do not accept / in table name

2018-01-28 Thread Eric Dumazet

From: Eric Dumazet <eduma...@google.com>

proc_create_data() will issue a WARN() otherwise, lets avoid that.

name 'syz/\xF5'
WARNING: CPU: 1 PID: 3688 at fs/proc/generic.c:163 __xlate_proc_name+0xe6/0x110 
fs/proc/generic.c:163
Kernel panic - not syncing: panic_on_warn set ...

CPU: 1 PID: 3688 Comm: syzkaller153061 Not tainted 4.15.0-rc9+ #283
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x194/0x257 lib/dump_stack.c:53
 panic+0x1e4/0x41c kernel/panic.c:183
 __warn+0x1dc/0x200 kernel/panic.c:547
 report_bug+0x211/0x2d0 lib/bug.c:184
 fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178
 fixup_bug arch/x86/kernel/traps.c:247 [inline]
 do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
 invalid_op+0x22/0x40 arch/x86/entry/entry_64.S:1096
RIP: 0010:__xlate_proc_name+0xe6/0x110 fs/proc/generic.c:163
RSP: 0018:8801d913ec18 EFLAGS: 00010282
RAX: dc08 RBX: 0003 RCX: 8159ebae
RDX:  RSI: 11003b227d3e RDI: 8801d913e920
RBP: 8801d913ec48 R08: 11003b227d00 R09: 
R10: 8801d913eb10 R11:  R12: 8801d92b4550
R13:  R14: 8801d913ede0 R15: 8801d92b4550
 xlate_proc_name fs/proc/generic.c:179 [inline]
 __proc_create+0xcc/0x880 fs/proc/generic.c:349
 proc_create_data+0x76/0x180 fs/proc/generic.c:488
 recent_mt_check.isra.8+0xb1b/0xe70 net/netfilter/xt_recent.c:412
 recent_mt_check_v0+0xd7/0x150 net/netfilter/xt_recent.c:440
 xt_check_match+0x231/0x7d0 net/netfilter/x_tables.c:465
 check_match net/ipv4/netfilter/ip_tables.c:479 [inline]
 find_check_match net/ipv4/netfilter/ip_tables.c:495 [inline]
 find_check_entry.isra.8+0x3fc/0xcb0 net/ipv4/netfilter/ip_tables.c:544
 translate_table+0xed1/0x1610 net/ipv4/netfilter/ip_tables.c:730
 do_replace net/ipv4/netfilter/ip_tables.c:1148 [inline]
 do_ipt_set_ctl+0x370/0x5f0 net/ipv4/netfilter/ip_tables.c:1682
 nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
 nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115
 ip_setsockopt+0xa1/0xb0 net/ipv4/ip_sockglue.c:1256
 sctp_setsockopt+0x2a0/0x5de0 net/sctp/socket.c:4074
 sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2968
 SYSC_setsockopt net/socket.c:1831 [inline]
 SyS_setsockopt+0x189/0x360 net/socket.c:1810
 entry_SYSCALL_64_fastpath+0x29/0xa0

Signed-off-by: Eric Dumazet <eduma...@google.com>
Reported-by: syzbot <syzkal...@googlegroups.com>
---
 net/netfilter/xt_recent.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/netfilter/xt_recent.c b/net/netfilter/xt_recent.c
index 
245fa350a7a85390e6767c4a0c5862f4213000fe..724f7cf072c1c81a912d007f6f89ea542a42eb0e
 100644
--- a/net/netfilter/xt_recent.c
+++ b/net/netfilter/xt_recent.c
@@ -362,7 +362,8 @@ static int recent_mt_check(const struct xt_mtchk_param *par,
    return -EINVAL;
    }
    if (info->name[0] == '\0' ||
-   strnlen(info->name, XT_RECENT_NAME_LEN) == XT_RECENT_NAME_LEN)
+   strnlen(info->name, XT_RECENT_NAME_LEN) == XT_RECENT_NAME_LEN ||
+   strchr(info->name, '/'))
    return -EINVAL;
 
    if (ip_pkt_list_tot && info->hit_count < ip_pkt_list_tot)

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net] netfilter: xt_hashlimit: do not allow empty names

2018-01-28 Thread Eric Dumazet

From: Eric Dumazet <eduma...@google.com>

Syzbot reported a WARN() in proc_create_data() [1]

Issue here is that xt_hashlimit does not check that user space provided
an empty table name.

[1]
name len 0
WARNING: CPU: 0 PID: 3680 at fs/proc/generic.c:354 __proc_create+0x696/0x880 
fs/proc/generic.c:354
Kernel panic - not syncing: panic_on_warn set ...

CPU: 0 PID: 3680 Comm: syzkaller464755 Not tainted 4.15.0-rc9+ #283
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x194/0x257 lib/dump_stack.c:53
 panic+0x1e4/0x41c kernel/panic.c:183
 __warn+0x1dc/0x200 kernel/panic.c:547
 report_bug+0x211/0x2d0 lib/bug.c:184
 fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178
 fixup_bug arch/x86/kernel/traps.c:247 [inline]
 do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
 invalid_op+0x22/0x40 arch/x86/entry/entry_64.S:1096
RIP: 0010:__proc_create+0x696/0x880 fs/proc/generic.c:354
RSP: 0018:8801d9607410 EFLAGS: 00010286
RAX: dc08 RBX: 11003b2c0e87 RCX: 8159ebae
RDX:  RSI: 11003b284970 RDI: 0293
RBP: 8801d9607580 R08: 11003b2c0e15 R09: 
R10: 8801d96072c8 R11:  R12: 8801d981ef28
R13: 8801d9607558 R14:  R15: 8801d9607518
 proc_create_data+0x76/0x180 fs/proc/generic.c:488
 htable_create net/netfilter/xt_hashlimit.c:333 [inline]
 hashlimit_mt_check_common.isra.9+0xaee/0x1420 net/netfilter/xt_hashlimit.c:900
 hashlimit_mt_check_v1+0x48d/0x640 net/netfilter/xt_hashlimit.c:926
 xt_check_match+0x231/0x7d0 net/netfilter/x_tables.c:465
 check_match net/ipv4/netfilter/ip_tables.c:479 [inline]
 find_check_match net/ipv4/netfilter/ip_tables.c:495 [inline]
 find_check_entry.isra.8+0x3fc/0xcb0 net/ipv4/netfilter/ip_tables.c:544
 translate_table+0xed1/0x1610 net/ipv4/netfilter/ip_tables.c:730
 do_replace net/ipv4/netfilter/ip_tables.c:1148 [inline]
 do_ipt_set_ctl+0x370/0x5f0 net/ipv4/netfilter/ip_tables.c:1682
 nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
 nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115
 ip_setsockopt+0xa1/0xb0 net/ipv4/ip_sockglue.c:1256
 tcp_setsockopt+0x82/0xd0 net/ipv4/tcp.c:2875
 sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2968
 SYSC_setsockopt net/socket.c:1831 [inline]
 SyS_setsockopt+0x189/0x360 net/socket.c:1810
 entry_SYSCALL_64_fastpath+0x29/0xa0

Signed-off-by: Eric Dumazet <eduma...@google.com>
Reported-by: syzbot <syzkal...@googlegroups.com>
---
 net/netfilter/xt_hashlimit.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/netfilter/xt_hashlimit.c b/net/netfilter/xt_hashlimit.c
index 
5da8746f7b88ff4c9446f256e542e823a6a561b0..eae732e013df92a364b500645360d4606c283a75
 100644
--- a/net/netfilter/xt_hashlimit.c
+++ b/net/netfilter/xt_hashlimit.c
@@ -894,6 +894,8 @@ static int hashlimit_mt_check_common(const struct 
xt_mtchk_param *par,
return -ERANGE;
}
 
+   if (!name[0])
+   return -EINVAL;
mutex_lock(_mutex);
*hinfo = htable_find_get(net, name, par->family);
if (*hinfo == NULL) {

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net] netfilter: x_tables: avoid out-of-bounds reads in xt_request_find_match()

2018-01-24 Thread Eric Dumazet

From: Eric Dumazet <eduma...@google.com>

It looks like syzbot found its way into netfilter territory.

Issue here is that @name comes from user space and might
not be null terminated.

Out-of-bound reads happen, KASAN is not happy.

Signed-off-by: Eric Dumazet <eduma...@google.com>
Reported-by: syzbot <syzkal...@googlegroups.com>
---
No Fixes: tag, bug seems to be a day-0 one.

diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index 
55802e97f906d1987ed78b4296584deb38e5f876..8516dc459b539342f44d2b2b3e21b140677c7826
 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -210,6 +210,9 @@ xt_request_find_match(uint8_t nfproto, const char *name, 
uint8_t revision)
 {
struct xt_match *match;
 
+   if (strnlen(name, XT_EXTENSION_MAXNAMELEN) == XT_EXTENSION_MAXNAMELEN)
+   return ERR_PTR(-EINVAL);
+
match = xt_find_match(nfproto, name, revision);
if (IS_ERR(match)) {
request_module("%st_%s", xt_prefix[nfproto], name);

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v2] net: move decnet to staging

2017-11-13 Thread Eric Dumazet

On Mon, 2017-11-13 at 11:32 -0800, Joe Perches wrote:
> On Mon, 2017-11-13 at 09:11 -0800, Stephen Hemminger wrote:
> > Support for Decnet has been orphaned for some time.
> > In the interest of reducing the potential bug surface and pre-holiday
> > cleaning, move the decnet protocol into staging for eventual removal.
> []
> > diff --git a/drivers/staging/decnet/TODO b/drivers/staging/decnet/TODO
> []
> > @@ -0,0 +1,4 @@
> > +The DecNet code will be removed soon from the kernel tree as it is old,
> > +obsolete, and buggy.
> 
> Old and obsolete, well OK, but
> what's buggy about decnet?
> 
> https://bugzilla.kernel.org/buglist.cgi?quicksearch=decnet
> 
> Zarro Boogs found.
> 

Then that means nobody uses it.

And that syzkaller guys never bothered to add code to actually trigger
the bugs that are probably there. Probably they have bigger fishes to
fry at this moment.

If we leave the code there, chances are high that some hacker is
interested into exploiting the bugs.


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 nf-next 1/2] netfilter: x_tables: wait until old table isn't used anymore

2017-10-11 Thread Eric Dumazet

On Wed, Oct 11, 2017 at 11:18 AM, Florian Westphal <f...@strlen.de> wrote:
> Eric Dumazet <eduma...@google.com> wrote:
>> On Wed, Oct 11, 2017 at 10:48 AM, Florian Westphal <f...@strlen.de> wrote:
>> > Eric Dumazet <eduma...@google.com> wrote:
>> >> On Wed, Oct 11, 2017 at 7:26 AM, Florian Westphal <f...@strlen.de> wrote:
>> >> > xt_replace_table relies on table replacement counter retrieval (which
>> >> > uses xt_recseq to synchronize pcpu counters).
>> >> >
>> >> > This is fine, however with large rule set get_counters() can take
>> >> > a very long time -- it needs to synchronize all counters because
>> >> > it has to assume concurrent modifications can occur.
>> >> >
>> >> > Make xt_replace_table synchronize by itself by waiting until all cpus
>> >> > had an even seqcount.
>> >> >
>> >> > This allows a followup patch to copy the counters of the old ruleset
>> >> > without any synchonization after xt_replace_table has completed.
>> >> >
>> >> > Cc: Dan Williams <d...@redhat.com>
>> >> > Cc: Eric Dumazet <eduma...@google.com>
>> >> > Signed-off-by: Florian Westphal <f...@strlen.de>
>> >> > ---
>> >> >  v3: check for 'seq is uneven' OR 'has changed' since
>> >> >  last check. Its fine if seq is uneven iff its a different
>> >> >  sequence number than the initial one.
>> >> >
>> >> >  v2: fix Erics email address
>> >> >  net/netfilter/x_tables.c | 18 +++---
>> >> >  1 file changed, 15 insertions(+), 3 deletions(-)
>> >> >
>> >> >
>> >>
>> >> Reviewed-by: Eric Dumazet <eduma...@google.com>
>> >>
>> >> But it seems we need an extra smp_wmb() after
>> >>
>> >> smp_wmb();
>> >> table->private = newinfo;
>> >> +  smp_wmb();
>> >>
>> >> Otherwise we have no guarantee other cpus actually see the new ->private 
>> >> value.
>> >
>> > Seems to be unrelated to this change, so I will submit
>> > a separate patch for nf.git that adds this.
>>
>> This is related to this change, please read the comment before the
>> local_bh_enable9)
>>
>> /*
>>  * Even though table entries have now been swapped, other CPU's
>>  * may still be using the old entries. This is okay, because
>>  * resynchronization happens because of the locking done
>>  * during the get_counters() routine.
>>  */
>
> Hmm, but get_counters() does not issue a wmb, and the 'new' code added
> here essentially is the same as get_counters(), except that we only
> read seqcount until we saw a change (and not for each counter in
> the rule set).
>
> What am I missing?

Your sync code does nothing interesting if we are not sure new
table->private value is visible

Without barriers, compiler/cpu could  do this :


+   /* ... so wait for even xt_recseq on all cpus */
+   for_each_possible_cpu(cpu) {
+   seqcount_t *s = _cpu(xt_recseq, cpu);
+   u32 seq = raw_read_seqcount(s);
+
+   if (seq & 1) {
+   do {
+   cond_resched();
+   cpu_relax();
+   } while (seq == raw_read_seqcount(s));
+   }
+   }

/* finally, write new private value */
table->private = newinfo;

Basically, your loop is now useless and you could remove it. So there
is definitely a bug.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH nf] netfilter: x_tables: ensure readers see new ->private value

2017-10-11 Thread Eric Dumazet

On Wed, Oct 11, 2017 at 11:03 AM, Florian Westphal <f...@strlen.de> wrote:
> Eric Dumazet wrote:
>  But it seems we need an extra smp_wmb() after
>   smp_wmb();
>   table->private = newinfo;
>
> Otherwise we have no guarantee other cpus actually see the new
> ->private value.
>
> Suggested-by: Eric Dumazet <eduma...@google.com>
> Signed-off-by: Florian Westphal <f...@strlen.de>

I do not believe this change is needed in net (or nf) tree.
See my other reply.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 nf-next 1/2] netfilter: x_tables: wait until old table isn't used anymore

2017-10-11 Thread Eric Dumazet

On Wed, Oct 11, 2017 at 10:48 AM, Florian Westphal <f...@strlen.de> wrote:
> Eric Dumazet <eduma...@google.com> wrote:
>> On Wed, Oct 11, 2017 at 7:26 AM, Florian Westphal <f...@strlen.de> wrote:
>> > xt_replace_table relies on table replacement counter retrieval (which
>> > uses xt_recseq to synchronize pcpu counters).
>> >
>> > This is fine, however with large rule set get_counters() can take
>> > a very long time -- it needs to synchronize all counters because
>> > it has to assume concurrent modifications can occur.
>> >
>> > Make xt_replace_table synchronize by itself by waiting until all cpus
>> > had an even seqcount.
>> >
>> > This allows a followup patch to copy the counters of the old ruleset
>> > without any synchonization after xt_replace_table has completed.
>> >
>> > Cc: Dan Williams <d...@redhat.com>
>> > Cc: Eric Dumazet <eduma...@google.com>
>> > Signed-off-by: Florian Westphal <f...@strlen.de>
>> > ---
>> >  v3: check for 'seq is uneven' OR 'has changed' since
>> >  last check. Its fine if seq is uneven iff its a different
>> >  sequence number than the initial one.
>> >
>> >  v2: fix Erics email address
>> >  net/netfilter/x_tables.c | 18 +++---
>> >  1 file changed, 15 insertions(+), 3 deletions(-)
>> >
>> >
>>
>> Reviewed-by: Eric Dumazet <eduma...@google.com>
>>
>> But it seems we need an extra smp_wmb() after
>>
>> smp_wmb();
>> table->private = newinfo;
>> +  smp_wmb();
>>
>> Otherwise we have no guarantee other cpus actually see the new ->private 
>> value.
>
> Seems to be unrelated to this change, so I will submit
> a separate patch for nf.git that adds this.

This is related to this change, please read the comment before the
local_bh_enable9)

/*
 * Even though table entries have now been swapped, other CPU's
 * may still be using the old entries. This is okay, because
 * resynchronization happens because of the locking done
 * during the get_counters() routine.
 */


Since your new code happens right after table->private  = newinfo ;
the smp_wmb() is required.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 nf-next 1/2] netfilter: x_tables: wait until old table isn't used anymore

2017-10-11 Thread Eric Dumazet

On Wed, Oct 11, 2017 at 7:26 AM, Florian Westphal <f...@strlen.de> wrote:
> xt_replace_table relies on table replacement counter retrieval (which
> uses xt_recseq to synchronize pcpu counters).
>
> This is fine, however with large rule set get_counters() can take
> a very long time -- it needs to synchronize all counters because
> it has to assume concurrent modifications can occur.
>
> Make xt_replace_table synchronize by itself by waiting until all cpus
> had an even seqcount.
>
> This allows a followup patch to copy the counters of the old ruleset
> without any synchonization after xt_replace_table has completed.
>
> Cc: Dan Williams <d...@redhat.com>
> Cc: Eric Dumazet <eduma...@google.com>
> Signed-off-by: Florian Westphal <f...@strlen.de>
> ---
>  v3: check for 'seq is uneven' OR 'has changed' since
>  last check. Its fine if seq is uneven iff its a different
>  sequence number than the initial one.
>
>  v2: fix Erics email address
>  net/netfilter/x_tables.c | 18 +++++++---
>  1 file changed, 15 insertions(+), 3 deletions(-)
>
>

Reviewed-by: Eric Dumazet <eduma...@google.com>

But it seems we need an extra smp_wmb() after

smp_wmb();
table->private = newinfo;
+  smp_wmb();

Otherwise we have no guarantee other cpus actually see the new ->private value.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net] netfilter: x_tables: avoid stack-out-of-bounds read in xt_copy_counters_from_user

2017-10-05 Thread Eric Dumazet

From: Eric Dumazet <eduma...@google.com>

syzkaller reports an out of bound read in strlcpy(), triggered
by xt_copy_counters_from_user()

Fix this by using memcpy(), then forcing a zero byte at the last position
of the destination, as Florian did for the non COMPAT code.

Fixes: d7591f0c41ce ("netfilter: x_tables: introduce and use 
xt_copy_counters_from_user")
Signed-off-by: Eric Dumazet <eduma...@google.com>
Cc: Willem de Bruijn <will...@google.com>
---
 net/netfilter/x_tables.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index 
c83a3b5e1c6c2a91b713b6681a794bd79ab3fa08..d8571f4142080a3c121fc90f0b52d81ee9df6712
 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -892,7 +892,7 @@ void *xt_copy_counters_from_user(const void __user *user, 
unsigned int len,
if (copy_from_user(_tmp, user, sizeof(compat_tmp)) != 0)
return ERR_PTR(-EFAULT);
 
-   strlcpy(info->name, compat_tmp.name, sizeof(info->name));
+   memcpy(info->name, compat_tmp.name, sizeof(info->name) - 1);
info->num_counters = compat_tmp.num_counters;
user += sizeof(compat_tmp);
} else
@@ -905,9 +905,9 @@ void *xt_copy_counters_from_user(const void __user *user, 
unsigned int len,
if (copy_from_user(info, user, sizeof(*info)) != 0)
return ERR_PTR(-EFAULT);
 
-   info->name[sizeof(info->name) - 1] = '\0';
user += sizeof(*info);
}
+   info->name[sizeof(info->name) - 1] = '\0';
 
size = sizeof(struct xt_counters);
size *= info->num_counters;


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] netfilter: xt_socket: Restore mark from full sockets only

2017-09-21 Thread Eric Dumazet

On Thu, 2017-09-21 at 16:08 -0600, Subash Abhinov Kasiviswanathan wrote:
> An out of bounds error was detected on an ARM64 target with
> Android based kernel 4.9. This occurs while trying to
> restore mark on a skb from an inet request socket.
> 
> BUG: KASAN: slab-out-of-bounds in socket_match.isra.2+0xc8/0x1f0 
> net/netfilter/xt_socket.c:248
> Read of size 4 at addr ffc06a8d824c by task syz-fuzzer/1532
> CPU: 7 PID: 1532 Comm: syz-fuzzer Tainted: GW  O4.9.41+ #1
> Call trace:

> 
> v1->v2: Change socket_mt6_v1_v2_v3() as well as mentioned by Eric
> 
> Fixes: a9407388 ("netfilter: xt_socket: prepare for TCP_NEW_SYN_RECV 
> support")

When my commit was written (Mon Mar 16 21:06:17 2015 -0700), this
XT_SOCKET_RESTORESKMARK code was not there yet.


The bug was added in commit 01555e74bde5 ("netfilter: xt_socket: add
XT_SOCKET_RESTORESKMARK flag")  which came later (Mon Jun 15 18:40:43
2015 -0600)

Please put a correct Fixes: tag, thanks.


> Signed-off-by: Subash Abhinov Kasiviswanathan 
> ---
>  net/netfilter/xt_socket.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/netfilter/xt_socket.c b/net/netfilter/xt_socket.c
> index e75ef39..575d215 100644
> --- a/net/netfilter/xt_socket.c
> +++ b/net/netfilter/xt_socket.c
> @@ -76,7 +76,7 @@
>   transparent = nf_sk_is_transparent(sk);
>  
>   if (info->flags & XT_SOCKET_RESTORESKMARK && !wildcard &&
> - transparent)
> + transparent && sk_fullsock(sk))
>   pskb->mark = sk->sk_mark;
>  
>   if (sk != skb->sk)
> @@ -133,7 +133,7 @@
>   transparent = nf_sk_is_transparent(sk);
>  
>   if (info->flags & XT_SOCKET_RESTORESKMARK && !wildcard &&
> - transparent)
> + transparent && sk_fullsock(sk))
>   pskb->mark = sk->sk_mark;
>  
>   if (sk != skb->sk)


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] netfilter: xt_socket: Restore mark from full sockets only

2017-09-21 Thread Eric Dumazet

On Thu, 2017-09-21 at 15:20 -0600, Subash Abhinov Kasiviswanathan wrote:
> An out of bounds error was detected on an ARM64 target with
> Android based kernel 4.9. This occurs while trying to
> restore mark on a skb from an inet request socket.
> 
> BUG: KASAN: slab-out-of-bounds in socket_match.isra.2+0xc8/0x1f0 
> net/netfilter/xt_socket.c:248

...

> 
> Fixes: a9407388 ("netfilter: xt_socket: prepare for TCP_NEW_SYN_RECV 
> support")
> Signed-off-by: Subash Abhinov Kasiviswanathan 
> ---
>  net/netfilter/xt_socket.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/netfilter/xt_socket.c b/net/netfilter/xt_socket.c
> index e75ef39..3301091 100644
> --- a/net/netfilter/xt_socket.c
> +++ b/net/netfilter/xt_socket.c
> @@ -76,7 +76,7 @@
>   transparent = nf_sk_is_transparent(sk);
>  
>   if (info->flags & XT_SOCKET_RESTORESKMARK && !wildcard &&
> - transparent)
> + transparent && sk_fullsock(sk))
>   pskb->mark = sk->sk_mark;
>  
>   if (sk != skb->sk)

What about socket_mt6_v1_v2_v3() ?

Thanks.


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH nf] netfilter: xtables: add scheduling opportunity in get_counters

2017-09-01 Thread Eric Dumazet

On Fri, Sep 1, 2017 at 1:41 PM, Florian Westphal <f...@strlen.de> wrote:
> There are reports about spurious softlockups during iptables-restore, a
> backtrace i saw points at get_counters -- it uses a sequence lock and also
> has unbounded restart loop.
>
> Signed-off-by: Florian Westphal <f...@strlen.de>
> ---
>  net/ipv4/netfilter/arp_tables.c | 1 +
>  net/ipv4/netfilter/ip_tables.c  | 1 +
>  net/ipv6/netfilter/ip6_tables.c | 1 +
>  3 files changed, 3 insertions(+)

SGTM, thanks Florian !

Acked-by: Eric Dumazet <eduma...@google.com>
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH nf-next 1/3] netfilter: convert hook list to an array

2017-08-23 Thread Eric Dumazet

On Wed, 2017-08-23 at 17:26 +0200, Florian Westphal wrote:
> From: Aaron Conole 

...

> -static struct nf_hook_entry __rcu **nf_hook_entry_head(struct net *net, 
> const struct nf_hook_ops *reg)
> +static struct nf_hook_entries *allocate_hook_entries_size(u16 num)
> +{
> + struct nf_hook_entries *e;
> + size_t alloc = sizeof(*e) +
> +sizeof(struct nf_hook_entry) * num +
> +sizeof(struct nf_hook_ops *) * num;
> +
> + if (num == 0)
> + return NULL;
> +
> + e = kvmalloc(alloc, GFP_KERNEL);
> + if (e) {
> + memset(e, 0, alloc);
> + e->num_hook_entries = num;
> + }


nit:

e = kvzalloc(alloc, GFP_KERNEL);
if (e)
e->num_hook_entries = num;


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH nf-next 4/4] netfilter: rt: add support to fetch path mss

2017-08-08 Thread Eric Dumazet

On Tue, 2017-08-08 at 15:15 +0200, Florian Westphal wrote:
> to be used in combination with tcp option set support to mimic
> iptables TCPMSS --clamp-mss-to-pmtu.
> 
> Signed-off-by: Florian Westphal 
> ---
>  include/uapi/linux/netfilter/nf_tables.h |  2 +
>  net/netfilter/nft_rt.c   | 65 
> 
>  2 files changed, 67 insertions(+)
> 
> diff --git a/include/uapi/linux/netfilter/nf_tables.h 
> b/include/uapi/linux/netfilter/nf_tables.h
> index 40fd199f7531..b49da72efa68 100644
> --- a/include/uapi/linux/netfilter/nf_tables.h
> +++ b/include/uapi/linux/netfilter/nf_tables.h
> @@ -811,11 +811,13 @@ enum nft_meta_keys {
>   * @NFT_RT_CLASSID: realm value of packet's route (skb->dst->tclassid)
>   * @NFT_RT_NEXTHOP4: routing nexthop for IPv4
>   * @NFT_RT_NEXTHOP6: routing nexthop for IPv6
> + * @NFT_RT_TCPMSS: fetch current path tcp mss
>   */
>  enum nft_rt_keys {
>   NFT_RT_CLASSID,
>   NFT_RT_NEXTHOP4,
>   NFT_RT_NEXTHOP6,
> + NFT_RT_TCPMSS,
>  };
>  
>  /**
> diff --git a/net/netfilter/nft_rt.c b/net/netfilter/nft_rt.c
> index c7383d8f88d0..69ed601d6fc6 100644
> --- a/net/netfilter/nft_rt.c
> +++ b/net/netfilter/nft_rt.c
> @@ -23,6 +23,41 @@ struct nft_rt {
>   enum nft_registers  dreg:8;
>  };
>  
> +static u16 get_tcpmss(const struct nft_pktinfo *pkt, const struct dst_entry 
> *skbdst)
> +{
> + u32 minlen = sizeof(struct ipv6hdr), mtu = dst_mtu(skbdst);
> + const struct sk_buff *skb = pkt->skb;
> + const struct nf_afinfo *ai;
> + struct dst_entry *dst;
> + struct flowi fl;
> +
> + memset(, 0, sizeof(fl));
> +
> + switch (nft_pf(pkt)) {
> + case NFPROTO_IPV4:
> + fl.u.ip4.daddr = ip_hdr(skb)->saddr;
> + minlen = sizeof(struct iphdr);
> + break;
> + case NFPROTO_IPV6:
> + fl.u.ip6.daddr = ipv6_hdr(skb)->saddr;
> + break;
> + }
> +
> + ai = nf_get_afinfo(nft_pf(pkt));
> + if (ai)
> + ai->route(nft_net(pkt), , , false);
> +

if ai is NULL,

dst is not initialized and might contain garbage.

> + if (dst) {
> + mtu = min(mtu, dst_mtu(dst));
> + dst_release(dst);
> + }
> +
> + if (mtu <= minlen || mtu > 0x)
> + return TCP_MSS_DEFAULT;
> +
> + return mtu - minlen;




--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH nf-next] netfilter: nft_set_rbtree: use seqcount to avoid lock in most cases

2017-07-26 Thread Eric Dumazet

On Wed, 2017-07-26 at 02:09 +0200, Florian Westphal wrote:
> switch to lockless lockup. write side now also increments sequence
> counter.  On lookup, sample counter value and only take the lock
> if we did not find a match and the counter has changed.
> 
> This avoids need to write to private area in normal (lookup) cases.
> 
> Note that we take the non-blocking variant (raw_seqcount_begin), i.e.
> read side will not wait for writer to finish.
> 
> If we did not find a result we will fall back to use of read-lock.
> 
> The readlock is also used during dumps to ensure we get a consistent
> tree walk.
> 
> Similar technique (rbtree+seqlock) was used by David Howells in rxrpc.


Please note that in commit b145425f269a17ed344d737f746b844dfac60c82
("inetpeer: remove AVL implementation in favor of RB tree")

I chose to also pass the sequence so that the lookup could abort.

I am not sure that during rb tree write operations, some nodes could be
left with some kind of loop.



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] netfilter: nfnetlink: Improve input length sanitization in nfnetlink_rcv

2017-06-07 Thread Eric Dumazet

On Wed, 2017-06-07 at 14:35 +0200, Mateusz Jurczyk wrote:
> Verify that the length of the socket buffer is sufficient to cover the
> entire nlh->nlmsg_len field before accessing that field for further
> input sanitization. If the client only supplies 1-3 bytes of data in
> sk_buff, then nlh->nlmsg_len remains partially uninitialized and
> contains leftover memory from the corresponding kernel allocation.
> Operating on such data may result in indeterminate evaluation of the
> nlmsg_len < NLMSG_HDRLEN expression.
> 
> The bug was discovered by a runtime instrumentation designed to detect
> use of uninitialized memory in the kernel. The patch prevents this and
> other similar tools (e.g. KMSAN) from flagging this behavior in the future.
> 
> Signed-off-by: Mateusz Jurczyk 
> ---
>  net/netfilter/nfnetlink.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c
> index 80f5ecf2c3d7..c634cfca40ec 100644
> --- a/net/netfilter/nfnetlink.c
> +++ b/net/netfilter/nfnetlink.c
> @@ -491,7 +491,8 @@ static void nfnetlink_rcv(struct sk_buff *skb)
>  {
>   struct nlmsghdr *nlh = nlmsg_hdr(skb);
>  
> - if (nlh->nlmsg_len < NLMSG_HDRLEN ||
> + if (skb->len < sizeof(nlh->nlmsg_len) ||

This assumes nlmsg_len is first field of the structure.

offsetofend() might be more descriptive, one does not have to check the
structure to make sure the code is correct.

Or simply use the more common form :

if (skb->len < NLMSG_HDRLEN ||

> + nlh->nlmsg_len < NLMSG_HDRLEN ||
>   skb->len < nlh->nlmsg_len)
>   return;
>  


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH nf-next] netfilter: tcp: Use TCP_MAX_WSCALE instead of literal 14

2017-04-20 Thread Eric Dumazet

On Thu, 2017-04-20 at 08:44 +0800, Gao Feng wrote:
> > On Wed, Apr 19, 2017 at 09:57:55PM +0200, Pablo Neira Ayuso wrote:
> > > On Wed, Apr 19, 2017 at 09:22:08AM -0700, Eric Dumazet wrote:
> > > > On Wed, 2017-04-19 at 17:58 +0200, Pablo Neira Ayuso wrote:
> > > > > On Wed, Apr 19, 2017 at 09:23:42AM +0800, gfree.w...@foxmail.com
> > wrote:
> > > > > > From: Gao Feng <f...@ikuai8.com>
> > > > > >
> > > > > > The window scale may be enlarged from 14 to 15 according to the
> > > > > > itef draft
> https://tools.ietf.org/html/draft-nishida-tcpm-maxwin-03.
> > > > > >
> > > > > > Use the macro TCP_MAX_WSCALE to support it easily with TCP stack
> > > > > > in the future.
> > > > >
> > > > > Applied, thanks.
> > > >
> > > > Note that linux kernel is not ready yet for a TCP_MAX_WSCALE being
> > > > changed to 15.
> > > >
> > > > Signed 32bit sk counters can already be abused with 1GB TCP windows,
> > > > for malicious peers sending SACK forcing linux to increase its
> > > > memory usage above 2GB and overflows are pretty bad.
> > >
> > > We have tend to use our own definitions for the TCP connection
> > > tracking so far. This one I checked it refers RFC1323 too.
> > >
> > > If this semantics may change from one way to another in a way that may
> > > break conntracking, please let me know, I can toss it here.
> > 
> > Or I can just amend the commit here to remove the "enlarged from 14 to 15"
> > comment, I was going to push out this now, but I'll wait a bit.
> 
> Thanks Eric & Pablo,
> When the wscale is really enlarged to 15 one day, these Netfilter codes may
> be modified.
> Because it would reset the wscale value to the max value which Netfilter
> support it.
>   if (state->td_scale > 14)
>   state->td_scale = 14;
> It would cause the receive see a less window size than sender announced
> actually.

Simply because some middle boxes are enforcing the limit of 14, a change
of TCP stacks on peers might be simply not possible without causing
serious interoperability issues.

This IETF draft assumes TCP peers can freely decide of what they can do,
but experience shows that they can not.

TCP FastOpen for example hit a bug in linux TCP conntracking, and some
linux middle boxes are still having this bug.

( SYN messages with data payload was not really considered in the past )





--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH nf-next] netfilter: tcp: Use TCP_MAX_WSCALE instead of literal 14

2017-04-19 Thread Eric Dumazet

On Wed, 2017-04-19 at 17:58 +0200, Pablo Neira Ayuso wrote:
> On Wed, Apr 19, 2017 at 09:23:42AM +0800, gfree.w...@foxmail.com wrote:
> > From: Gao Feng 
> > 
> > The window scale may be enlarged from 14 to 15 according to the itef
> > draft https://tools.ietf.org/html/draft-nishida-tcpm-maxwin-03.
> > 
> > Use the macro TCP_MAX_WSCALE to support it easily with TCP stack in
> > the future.
> 
> Applied, thanks.

Note that linux kernel is not ready yet for a TCP_MAX_WSCALE being
changed to 15.

Signed 32bit sk counters can already be abused with 1GB TCP windows, for
malicious peers sending SACK forcing linux to increase its memory usage
above 2GB and overflows are pretty bad.




--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net] netfilter: xt_TCPMSS: add more sanity tests on tcph->doff

2017-04-03 Thread Eric Dumazet

From: Eric Dumazet <eduma...@google.com>

Denys provided an awesome KASAN report pointing to an use
after free in xt_TCPMSS

I have provided three patches to fix this issue, either in xt_TCPMSS or
in xt_tcpudp.c. It seems xt_TCPMSS patch has the smallest possible
impact.

Signed-off-by: Eric Dumazet <eduma...@google.com>
Reported-by: Denys Fedoryshchenko <nuclear...@nuclearcat.com>
---
 net/netfilter/xt_TCPMSS.c |6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/netfilter/xt_TCPMSS.c b/net/netfilter/xt_TCPMSS.c
index 
27241a767f17b4b27d24095a31e5e9a2d3e29ce4..c64aca611ac5c5f81ad7c925652bbb90554763ac
 100644
--- a/net/netfilter/xt_TCPMSS.c
+++ b/net/netfilter/xt_TCPMSS.c
@@ -104,7 +104,7 @@ tcpmss_mangle_packet(struct sk_buff *skb,
tcph = (struct tcphdr *)(skb_network_header(skb) + tcphoff);
tcp_hdrlen = tcph->doff * 4;
 
-   if (len < tcp_hdrlen)
+   if (len < tcp_hdrlen || tcp_hdrlen < sizeof(struct tcphdr))
return -1;
 
if (info->mss == XT_TCPMSS_CLAMP_PMTU) {
@@ -152,6 +152,10 @@ tcpmss_mangle_packet(struct sk_buff *skb,
if (len > tcp_hdrlen)
return 0;
 
+   /* tcph->doff has 4 bits, do not wrap it to 0 */
+   if (tcp_hdrlen >= 15 * 4)
+   return 0;
+
/*
 * MSS Option not found ?! add it..
 */


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KASAN, xt_TCPMSS finally found nasty use-after-free bug? 4.10.8

2017-04-03 Thread Eric Dumazet

On Mon, 2017-04-03 at 15:14 +0300, Denys Fedoryshchenko wrote:
> On 2017-04-03 15:09, Eric Dumazet wrote:
> > On Mon, 2017-04-03 at 11:10 +0300, Denys Fedoryshchenko wrote:
> > 
> >> I modified patch a little as:
> >> if (th->doff * 4 < sizeof(_tcph)) {
> >>   par->hotdrop = true;
> >>   WARN_ON_ONCE(!tcpinfo->option);
> >>   return false;
> >> }
> >> 
> >> And it did triggered WARN once at morning, and didn't hit KASAN. I 
> >> will
> >> run for a while more, to see if it is ok, and then if stable, will try
> >> to enable SFQ again.
> > 
> > Excellent news !
> > We will post an official fix today, thanks a lot for this detective 
> > work
> > Denys.
> I am not sure it is finally fixed, maybe we need test more?
> I'm doing extensive tests today with identical configuration (i had to 
> run fifo, because customer cannot afford anymore outages). I've dded sfq 
> now different way, and identical config i will run after 3 hours approx.

I did not say this bug fix would be the last one.

But this check is definitely needed, otherwise xt_TCPMSS can iterate
whole memory, and either crash or scramble two bytes in memory, that do
not belong to the frame at all.

If tcp_hdrlen is 0 (can happen if tcph->doff is 0)

Then :

for (i = sizeof(struct tcphdr); i <= tcp_hdrlen - TCPOLEN_MSS; i +=
optlen(opt, i)) {

Can look at 4 GB of memory, until it finds a 0x02, 0x04  sequence.

It can also loop forever in some cases, depending on memory content.






--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KASAN, xt_TCPMSS finally found nasty use-after-free bug? 4.10.8

2017-04-03 Thread Eric Dumazet

On Mon, 2017-04-03 at 11:10 +0300, Denys Fedoryshchenko wrote:

> I modified patch a little as:
> if (th->doff * 4 < sizeof(_tcph)) {
>   par->hotdrop = true;
>   WARN_ON_ONCE(!tcpinfo->option);
>   return false;
> }
> 
> And it did triggered WARN once at morning, and didn't hit KASAN. I will 
> run for a while more, to see if it is ok, and then if stable, will try 
> to enable SFQ again.

Excellent news !
We will post an official fix today, thanks a lot for this detective work
Denys.


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KASAN, xt_TCPMSS finally found nasty use-after-free bug? 4.10.8

2017-04-02 Thread Eric Dumazet

On Sun, 2017-04-02 at 10:14 -0700, Eric Dumazet wrote:

> Could that be that netfilter does not abort earlier if TCP header is
> completely wrong ?
> 

Yes, I wonder if this patch would be better, unless we replicate the
th->doff sanity check in all netfilter modules dissecting TCP frames.

diff --git a/net/netfilter/xt_tcpudp.c b/net/netfilter/xt_tcpudp.c
index 
ade024c90f4f129a7c384e9e1cbfdb8ffe73065f..8cb4eadd5ba1c20e74bc27ee52a0bc36a5b26725
 100644
--- a/net/netfilter/xt_tcpudp.c
+++ b/net/netfilter/xt_tcpudp.c
@@ -103,11 +103,11 @@ static bool tcp_mt(const struct sk_buff *skb, struct 
xt_action_param *par)
if (!NF_INVF(tcpinfo, XT_TCP_INV_FLAGS,
 (((unsigned char *)th)[13] & tcpinfo->flg_mask) == 
tcpinfo->flg_cmp))
return false;
+   if (th->doff * 4 < sizeof(_tcph)) {
+   par->hotdrop = true;
+   return false;
+   }
if (tcpinfo->option) {
-   if (th->doff * 4 < sizeof(_tcph)) {
-   par->hotdrop = true;
-   return false;
-   }
if (!tcp_find_option(tcpinfo->option, skb, par->thoff,
 th->doff*4 - sizeof(_tcph),
 tcpinfo->invflags & XT_TCP_INV_OPTION,


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KASAN, xt_TCPMSS finally found nasty use-after-free bug? 4.10.8

2017-04-02 Thread Eric Dumazet

On Sun, 2017-04-02 at 19:52 +0300, Denys Fedoryshchenko wrote:
> On 2017-04-02 15:32, Eric Dumazet wrote:
> > On Sun, 2017-04-02 at 15:25 +0300, Denys Fedoryshchenko wrote:
> >> >   */
> >> I will add also WARN_ON_ONCE(tcp_hdrlen >= 15 * 4) before, for
> >> curiosity, if this condition are triggered. Is it fine like that?
> > 
> > Sure.
> 
> It didnt triggered WARN_ON, and with both patches here is one more 
> KASAN.
> What i noticed also after this KASAN, there is many others start to 
> trigger in TCPMSS and locking up server by flood.
> There is heavy netlink activity, it is pppoe server with lot of shapers.
> I noticed there left sfq by mistake, usually i am removing it, because 
> it may trigger kernel panic too (and hard to trace reason).
> I will try with pfifo instead, after 6 hours.
> 
> Here is full log with others: https://nuclearcat.com/kasan.txt
> 


Could that be that netfilter does not abort earlier if TCP header is
completely wrong ?

diff --git a/net/netfilter/xt_TCPMSS.c b/net/netfilter/xt_TCPMSS.c
index 
27241a767f17b4b27d24095a31e5e9a2d3e29ce4..dd64bff38ba8074e6feb2e6e305eacb30ce4c2da
 100644
--- a/net/netfilter/xt_TCPMSS.c
+++ b/net/netfilter/xt_TCPMSS.c
@@ -104,7 +104,7 @@ tcpmss_mangle_packet(struct sk_buff *skb,
tcph = (struct tcphdr *)(skb_network_header(skb) + tcphoff);
tcp_hdrlen = tcph->doff * 4;
 
-   if (len < tcp_hdrlen)
+   if (len < tcp_hdrlen || tcp_hdrlen < sizeof(struct tcphdr))
return -1;
 
if (info->mss == XT_TCPMSS_CLAMP_PMTU) {



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KASAN, xt_TCPMSS finally found nasty use-after-free bug? 4.10.8

2017-04-02 Thread Eric Dumazet

On Sun, 2017-04-02 at 15:25 +0300, Denys Fedoryshchenko wrote:
> >  */
> I will add also WARN_ON_ONCE(tcp_hdrlen >= 15 * 4) before, for 
> curiosity, if this condition are triggered. Is it fine like that?

Sure.


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KASAN, xt_TCPMSS finally found nasty use-after-free bug? 4.10.8

2017-04-02 Thread Eric Dumazet

On Sun, 2017-04-02 at 04:54 -0700, Eric Dumazet wrote:
> On Sun, 2017-04-02 at 13:45 +0200, Florian Westphal wrote:
> > Eric Dumazet <eric.duma...@gmail.com> wrote:
> > > - for (i = sizeof(struct tcphdr); i <= tcp_hdrlen - TCPOLEN_MSS; i += 
> > > optlen(opt, i)) {
> > > + for (i = sizeof(struct tcphdr); i < tcp_hdrlen - TCPOLEN_MSS; i += 
> > > optlen(opt, i)) {
> > >   if (opt[i] == TCPOPT_MSS && opt[i+1] == TCPOLEN_MSS) {
> > >   u_int16_t oldmss;
> > 
> > maybe I am low on caffeeine but this looks fine, for tcp header with
> > only tcpmss this boils down to "20 <= 24 - 4" so we acccess offsets 20-23 
> > which seems ok.
> 
> I am definitely low on caffeine ;)
> 
> An issue in this function is that we might add the missing MSS option,
> without checking that TCP options are already full.
> 
> But this should not cause a KASAN splat, only some malformed TCP packet
> 
> (tcph->doff would wrap)

Something like that maybe.

diff --git a/net/netfilter/xt_TCPMSS.c b/net/netfilter/xt_TCPMSS.c
index
27241a767f17b4b27d24095a31e5e9a2d3e29ce4..1465aaf0e3a15d69d105d0a50b0429b11b6439d3
 100644
--- a/net/netfilter/xt_TCPMSS.c
+++ b/net/netfilter/xt_TCPMSS.c
@@ -151,7 +151,9 @@ tcpmss_mangle_packet(struct sk_buff *skb,
 */
if (len > tcp_hdrlen)
return 0;
-
+   /* tcph->doff is 4 bits wide, do not wrap its value to 0 */
+   if (tcp_hdrlen >= 15 * 4)
+   return 0;
/*
 * MSS Option not found ?! add it..
 */


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KASAN, xt_TCPMSS finally found nasty use-after-free bug? 4.10.8

2017-04-02 Thread Eric Dumazet

On Sun, 2017-04-02 at 13:45 +0200, Florian Westphal wrote:
> Eric Dumazet <eric.duma...@gmail.com> wrote:
> > -   for (i = sizeof(struct tcphdr); i <= tcp_hdrlen - TCPOLEN_MSS; i += 
> > optlen(opt, i)) {
> > +   for (i = sizeof(struct tcphdr); i < tcp_hdrlen - TCPOLEN_MSS; i += 
> > optlen(opt, i)) {
> > if (opt[i] == TCPOPT_MSS && opt[i+1] == TCPOLEN_MSS) {
> > u_int16_t oldmss;
> 
> maybe I am low on caffeeine but this looks fine, for tcp header with
> only tcpmss this boils down to "20 <= 24 - 4" so we acccess offsets 20-23 
> which seems ok.

I am definitely low on caffeine ;)

An issue in this function is that we might add the missing MSS option,
without checking that TCP options are already full.

But this should not cause a KASAN splat, only some malformed TCP packet

(tcph->doff would wrap)



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KASAN, xt_TCPMSS finally found nasty use-after-free bug? 4.10.8

2017-04-02 Thread Eric Dumazet

On Sun, 2017-04-02 at 10:43 +0300, Denys Fedoryshchenko wrote:
> Repost, due being sleepy missed few important points.
> 
> I am searching reasons of crashes for multiple conntrack enabled 
> servers, usually they point to conntrack, but i suspect use after free 
> might be somewhere else,
> so i tried to enable KASAN.
> And seems i got something after few hours, and it looks related to all 
> crashes, because on all that servers who rebooted i had MSS adjustment 
> (--clamp-mss-to-pmtu or --set-mss).
> Please let me know if any additional information needed.
> 
> [25181.855611] 
> ==
> [25181.855985] BUG: KASAN: use-after-free in tcpmss_tg4+0x682/0xe9c 
> [xt_TCPMSS] at addr 8802976000ea
> [25181.856344] Read of size 1 by task swapper/1/0
> [25181.856555] page:ea000a5d8000 count:0 mapcount:0 mapping: 
>   (null) index:0x0
> [25181.856909] flags: 0x1000()
> [25181.857123] raw: 1000   
> 
> [25181.857630] raw: ea000b0444a0 ea000a0b1f60  
> 
> [25181.857996] page dumped because: kasan: bad access detected
> [25181.858214] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 
> 4.10.8-build-0133-debug #3
> [25181.858571] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80 
> 04/02/2015
> [25181.858786] Call Trace:
> [25181.859000]  
> [25181.859215]  dump_stack+0x99/0xd4
> [25181.859423]  ? _atomic_dec_and_lock+0x15d/0x15d
> [25181.859644]  ? __dump_page+0x447/0x4e3
> [25181.859859]  ? tcpmss_tg4+0x682/0xe9c [xt_TCPMSS]
> [25181.860080]  kasan_report+0x577/0x69d
> [25181.860291]  ? __ip_route_output_key_hash+0x14ce/0x1503
> [25181.860512]  ? tcpmss_tg4+0x682/0xe9c [xt_TCPMSS]
> [25181.860736]  __asan_report_load1_noabort+0x19/0x1b
> [25181.860956]  tcpmss_tg4+0x682/0xe9c [xt_TCPMSS]
> [25181.861180]  ? tcpmss_tg4_check+0x287/0x287 [xt_TCPMSS]
> [25181.861407]  ? udp_mt+0x45a/0x45a [xt_tcpudp]
> [25181.861634]  ? __fib_validate_source+0x46b/0xcd1
> [25181.861860]  ipt_do_table+0x1432/0x1573 [ip_tables]
> [25181.862088]  ? igb_msix_ring+0x2d/0x35
> [25181.862318]  ? ip_tables_net_init+0x15/0x15 [ip_tables]
> [25181.862537]  ? ip_route_input_slow+0xe9f/0x17e3
> [25181.862759]  ? handle_irq_event_percpu+0x141/0x141
> [25181.862985]  ? rt_set_nexthop+0x9a7/0x9a7
> [25181.863203]  ? ip_tables_net_exit+0xe/0x15 [ip_tables]
> [25181.863419]  ? tcf_action_exec+0xce/0x18c
> [25181.863628]  ? iptable_mangle_net_exit+0x92/0x92 [iptable_mangle]
> [25181.863856]  ? iptable_filter_net_exit+0x92/0x92 [iptable_filter]
> [25181.864084]  iptable_filter_hook+0xc0/0x1c8 [iptable_filter]
> [25181.864311]  nf_hook_slow+0x7d/0x121
> [25181.864536]  ip_forward+0x1183/0x11c6
> [25181.864752]  ? ip_forward_finish+0x168/0x168
> [25181.864967]  ? ip_frag_mem+0x43/0x43
> [25181.865194]  ? iptable_nat_net_exit+0x92/0x92 [iptable_nat]
> [25181.865423]  ? nf_nat_ipv4_in+0xf0/0x209 [nf_nat_ipv4]
> [25181.865648]  ip_rcv_finish+0xf4c/0xf5b
> [25181.865861]  ip_rcv+0xb41/0xb72
> [25181.866086]  ? ip_local_deliver+0x282/0x282
> [25181.866308]  ? ip_local_deliver_finish+0x6e6/0x6e6
> [25181.866524]  ? ip_local_deliver+0x282/0x282
> [25181.866752]  __netif_receive_skb_core+0x1b27/0x21bf
> [25181.866971]  ? netdev_rx_handler_register+0x1a6/0x1a6
> [25181.867186]  ? enqueue_hrtimer+0x232/0x240
> [25181.867401]  ? hrtimer_start_range_ns+0xd1c/0xd4b
> [25181.867630]  ? __ppp_xmit_process+0x101f/0x104e [ppp_generic]
> [25181.867852]  ? hrtimer_cancel+0x20/0x20
> [25181.868081]  ? ppp_push+0x1402/0x1402 [ppp_generic]
> [25181.868301]  ? __pskb_pull_tail+0xb0f/0xb25
> [25181.868523]  ? ppp_xmit_process+0x47/0xaf [ppp_generic]
> [25181.868749]  __netif_receive_skb+0x5e/0x191
> [25181.868968]  process_backlog+0x295/0x573
> [25181.869180]  ? __netif_receive_skb+0x191/0x191
> [25181.869401]  napi_poll+0x311/0x745
> [25181.869611]  ? napi_complete_done+0x3b4/0x3b4
> [25181.869836]  ? __qdisc_run+0x4ec/0xb7f
> [25181.870061]  ? sch_direct_xmit+0x60b/0x60b
> [25181.870286]  net_rx_action+0x2e8/0x6dc
> [25181.870512]  ? napi_poll+0x745/0x745
> [25181.870732]  ? rps_trigger_softirq+0x181/0x1e4
> [25181.870956]  ? rps_may_expire_flow+0x29b/0x29b
> [25181.871184]  ? irq_work_run+0x2c/0x2e
> [25181.871411]  __do_softirq+0x22b/0x5df
> [25181.871629]  ? smp_call_function_single_async+0x17d/0x17d
> [25181.871854]  irq_exit+0x8a/0xfe
> [25181.872069]  smp_call_function_single_interrupt+0x8d/0x90
> [25181.872297]  call_function_single_interrupt+0x83/0x90
> [25181.872519] RIP: 0010:mwait_idle+0x15a/0x30d
> [25181.872733] RSP: 0018:8802d1017e78 EFLAGS: 0246 ORIG_RAX: 
> ff04
> [25181.873091] RAX:  RBX: 8802d1000c80 RCX: 
> 
> [25181.873311] RDX: 11005a200190 RSI:  RDI: 
> 
> [25181.873532] RBP: 8802d1017e98 R08: 003f R09: 
> 7f75f7fff700
> [25181.873751] R10: 8802d1017d80 R11:

Re: [PATCH nf-next 0/2] netfilter: untracked object removal

2017-03-08 Thread Eric Dumazet

On Wed, 2017-03-08 at 13:49 +0100, Florian Westphal wrote:
> These patches remove the percpu untracked objects, they get replaced
> with a new (kernel internal) ctinfo state.
> 
> This avoids reference counter operations for untracked packets and
> removes the need to check a conntrack for the UNTRACKED status bit
> before setting connmark, labels, etc.
> 
> I checked with following rule set and things appear to work as
> expected (i.e., ssh connections don't show up in conntrack -L):
> 
> *raw
> :PREROUTING ACCEPT [455:34825]
> :OUTPUT ACCEPT [251:29555]
> [775:63699] -A PREROUTING -p tcp -m tcp --dport 22 -j NOTRACK
> [251:29555] -A OUTPUT -p tcp -m tcp --sport 22 -j NOTRACK
> COMMIT
> *filter
> :INPUT ACCEPT [0:0]
> :FORWARD ACCEPT [0:0]
> :OUTPUT ACCEPT [0:0]
> [0:0] -A INPUT -m conntrack --ctstate INVALID -j DROP
> [337:26377] -A INPUT -p tcp -m conntrack --ctstate UNTRACKED -m tcp --dport 
> 22 -j ACCEPT
> [0:0] -A INPUT -m conntrack --ctstate UNTRACKED
> [102:13883] -A OUTPUT -p tcp -m conntrack --ctstate UNTRACKED -m tcp --sport 
> 22 -j ACCEPT
> [0:0] -A OUTPUT -m conntrack --ctstate UNTRACKED
> COMMIT


Very nice series Florian, thanks !



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: net: suspicious RCU usage in nf_hook

2017-02-01 Thread Eric Dumazet

On Wed, 2017-02-01 at 15:48 -0800, Eric Dumazet wrote:
> On Wed, Feb 1, 2017 at 3:29 PM, Cong Wang <xiyou.wangc...@gmail.com> wrote:
> 
> > Not sure if it is better. The difference is caught up in 
> > net_enable_timestamp(),
> > which is called setsockopt() path and sk_clone() path, so we could be
> > in netstamp_needed state for a long time too until user-space exercises
> > these paths.
> >
> > I am feeling we probably need to get rid of netstamp_needed_deferred,
> > and simply defer the whole static_key_slow_dec(), like the attached patch
> > (compile only).
> >
> > What do you think?
> 
> I think we need to keep the atomic.
> 
> If two cpus call net_disable_timestamp() roughly at the same time, the
> work will be scheduled once.

Updated patch (but not tested yet)

diff --git a/net/core/dev.c b/net/core/dev.c
index 
7f218e095361520d11c243d650e053321ea7274f..29101c98399f40b6b8e42c31a255d8f1fb6bd7a1
 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1695,24 +1695,19 @@ EXPORT_SYMBOL_GPL(net_dec_egress_queue);
 
 static struct static_key netstamp_needed __read_mostly;
 #ifdef HAVE_JUMP_LABEL
-/* We are not allowed to call static_key_slow_dec() from irq context
- * If net_disable_timestamp() is called from irq context, defer the
- * static_key_slow_dec() calls.
- */
 static atomic_t netstamp_needed_deferred;
-#endif
-
-void net_enable_timestamp(void)
+static void netstamp_clear(struct work_struct *work)
 {
-#ifdef HAVE_JUMP_LABEL
int deferred = atomic_xchg(_needed_deferred, 0);
 
-   if (deferred) {
-   while (--deferred)
-   static_key_slow_dec(_needed);
-   return;
-   }
+   while (deferred--)
+   static_key_slow_dec(_needed);
+}
+static DECLARE_WORK(netstamp_work, netstamp_clear);
 #endif
+
+void net_enable_timestamp(void)
+{
static_key_slow_inc(_needed);
 }
 EXPORT_SYMBOL(net_enable_timestamp);
@@ -1720,12 +1715,12 @@ EXPORT_SYMBOL(net_enable_timestamp);
 void net_disable_timestamp(void)
 {
 #ifdef HAVE_JUMP_LABEL
-   if (in_interrupt()) {
-   atomic_inc(_needed_deferred);
-   return;
-   }
-#endif
+   /* net_disable_timestamp() can be called from non process context */
+   atomic_inc(_needed_deferred);
+   schedule_work(_work);
+#else
static_key_slow_dec(_needed);
+#endif
 }
 EXPORT_SYMBOL(net_disable_timestamp);
 


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: net: suspicious RCU usage in nf_hook

2017-02-01 Thread Eric Dumazet

On Wed, Feb 1, 2017 at 3:29 PM, Cong Wang  wrote:

> Not sure if it is better. The difference is caught up in 
> net_enable_timestamp(),
> which is called setsockopt() path and sk_clone() path, so we could be
> in netstamp_needed state for a long time too until user-space exercises
> these paths.
>
> I am feeling we probably need to get rid of netstamp_needed_deferred,
> and simply defer the whole static_key_slow_dec(), like the attached patch
> (compile only).
>
> What do you think?

I think we need to keep the atomic.

If two cpus call net_disable_timestamp() roughly at the same time, the
work will be scheduled once.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: net: suspicious RCU usage in nf_hook

2017-02-01 Thread Eric Dumazet

On Wed, 2017-02-01 at 13:16 -0800, Eric Dumazet wrote:

> This would permanently leave the kernel in the netstamp_needed state.
> 
> I would prefer the patch using a process context to perform the
> cleanup ? Note there is a race window, but probably not a big deal.
> 
>  net/core/dev.c |   30 ++
>  1 file changed, 10 insertions(+), 20 deletions(-)

Patch is not complete (for the HAVE_JUMP_LABEL=n case)

Would you like to author/submit it ?

Thanks.


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: net: suspicious RCU usage in nf_hook

2017-02-01 Thread Eric Dumazet

On Wed, 2017-02-01 at 12:51 -0800, Cong Wang wrote:
> On Tue, Jan 31, 2017 at 7:44 AM, Eric Dumazet <eric.duma...@gmail.com> wrote:
> > On Mon, 2017-01-30 at 22:19 -0800, Cong Wang wrote:
> >
> >>
> >> The context is process context (TX path before hitting qdisc), and
> >> BH is not disabled, so in_interrupt() doesn't catch it. Hmm, this
> >> makes me thinking maybe we really need to disable BH in this
> >> case for nf_hook()? But it is called in RX path too, and BH is
> >> already disabled there.
> >
> > ipt_do_table() and similar netfilter entry points disable BH.
> >
> > Maybe it is done too late.
> 
> I think we need a fix like the following one for minimum impact.
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 727b6fd..eee7d63 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1720,12 +1720,10 @@ EXPORT_SYMBOL(net_enable_timestamp);
>  void net_disable_timestamp(void)
>  {
>  #ifdef HAVE_JUMP_LABEL
> -   if (in_interrupt()) {
> -   atomic_inc(_needed_deferred);
> -   return;
> -   }
> -#endif
> +   atomic_inc(_needed_deferred);
> +#else
> static_key_slow_dec(_needed);
> +#endif
>  }
>  EXPORT_SYMBOL(net_disable_timestamp);

This would permanently leave the kernel in the netstamp_needed state.

I would prefer the patch using a process context to perform the
cleanup ? Note there is a race window, but probably not a big deal.

 net/core/dev.c |   30 ++
 1 file changed, 10 insertions(+), 20 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 
7f218e095361520d11c243d650e053321ea7274f..1cae681b6cfd1cf2c9bee7072eb8af9cf79cced8
 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1695,37 +1695,27 @@ EXPORT_SYMBOL_GPL(net_dec_egress_queue);
 
 static struct static_key netstamp_needed __read_mostly;
 #ifdef HAVE_JUMP_LABEL
-/* We are not allowed to call static_key_slow_dec() from irq context
- * If net_disable_timestamp() is called from irq context, defer the
- * static_key_slow_dec() calls.
- */
 static atomic_t netstamp_needed_deferred;
-#endif
-
-void net_enable_timestamp(void)
+static void netstamp_clear(struct work_struct *work)
 {
-#ifdef HAVE_JUMP_LABEL
int deferred = atomic_xchg(_needed_deferred, 0);
 
-   if (deferred) {
-   while (--deferred)
-   static_key_slow_dec(_needed);
-   return;
-   }
+   while (deferred--)
+   static_key_slow_dec(_needed);
+}
+static DECLARE_WORK(netstamp_work, netstamp_clear);
 #endif
+
+void net_enable_timestamp(void)
+{
static_key_slow_inc(_needed);
 }
 EXPORT_SYMBOL(net_enable_timestamp);
 
 void net_disable_timestamp(void)
 {
-#ifdef HAVE_JUMP_LABEL
-   if (in_interrupt()) {
-   atomic_inc(_needed_deferred);
-   return;
-   }
-#endif
-   static_key_slow_dec(_needed);
+   atomic_inc(_needed_deferred);
+   schedule_work(_work);
 }
 EXPORT_SYMBOL(net_disable_timestamp);
 


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: net: suspicious RCU usage in nf_hook

2017-01-31 Thread Eric Dumazet

On Mon, 2017-01-30 at 22:19 -0800, Cong Wang wrote:

> 
> The context is process context (TX path before hitting qdisc), and
> BH is not disabled, so in_interrupt() doesn't catch it. Hmm, this
> makes me thinking maybe we really need to disable BH in this
> case for nf_hook()? But it is called in RX path too, and BH is
> already disabled there.

ipt_do_table() and similar netfilter entry points disable BH.

Maybe it is done too late.



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: net: suspicious RCU usage in nf_hook

2017-01-27 Thread Eric Dumazet

On Fri, 2017-01-27 at 17:00 -0800, Cong Wang wrote:
> On Fri, Jan 27, 2017 at 3:35 PM, Eric Dumazet <eric.duma...@gmail.com> wrote:
> > Oh well, I forgot to submit the official patch I think, Jan 9th.
> >
> > https://groups.google.com/forum/#!topic/syzkaller/BhyN5OFd7sQ
> >
> 
> Hmm, but why only fragments need skb_orphan()? It seems like
> any kfree_skb() inside a nf hook needs to have a preceding
> skb_orphan().


> 
> Also, I am not convinced it is similar to commit 8282f27449bf15548
> which is on RX path.

Well, we clearly see IPv6 reassembly being part of the equation in both
cases.

I was replying to first part of the splat [1], which was already
diagnosed and had a non official patch.

use after free is also a bug, regardless of jump label being used or
not.

I still do not really understand this nf_hook issue, I thought we were
disabling BH in netfilter.

So the in_interrupt() check in net_disable_timestamp() should trigger,
this was the intent of netstamp_needed_deferred existence.

Not sure if we can test for rcu_read_lock() as well.

[1]
 sk_destruct+0x47/0x80 net/core/sock.c:1460
 __sk_free+0x57/0x230 net/core/sock.c:1468
 sock_wfree+0xae/0x120 net/core/sock.c:1645
 skb_release_head_state+0xfc/0x200 net/core/skbuff.c:655
 skb_release_all+0x15/0x60 net/core/skbuff.c:668
 __kfree_skb+0x15/0x20 net/core/skbuff.c:684
 kfree_skb+0x16e/0x4c0 net/core/skbuff.c:705
 inet_frag_destroy+0x121/0x290 net/ipv4/inet_fragment.c:304
 inet_frag_put include/net/inet_frag.h:133 [inline]
 nf_ct_frag6_gather+0x1106/0x3840
net/ipv6/netfilter/nf_conntrack_reasm.c:617
 ipv6_defrag+0x1be/0x2b0 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c:68
 nf_hook_entry_hookfn include/linux/netfilter.h:102 [inline]
 nf_hook_slow+0xc3/0x290 net/netfilter/core.c:310
 nf_hook include/linux/netfilter.h:212 [inline]
 __ip6_local_out+0x489/0x840 net/ipv6/output_core.c:160
 ip6_local_out+0x2d/0x170 net/ipv6/output_core.c:170





--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 02/14] tcp: fix mark propagation with fwmark_reflect enabled

2017-01-26 Thread Eric Dumazet

On Thu, 2017-01-26 at 20:19 +0100, Pablo Neira Ayuso wrote:

> Right. This is not percpu as in IPv4.
> 
> I can send a follow up patch to get this in sync with the way we do it
> in IPv4, ie. add percpu socket.
> 
> Fine with this approach? Thanks!

Not really.

percpu sockets are going to slow down network namespace creation /
deletion and increase memory foot print.

IPv6 is cleaner because it does not really have to use different
sockets.

Ultimately would would like to have the same for IPv4.

I would rather carry the mark either in an additional parameter,
or in the flow that is already passed as a parameter.

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 02/14] tcp: fix mark propagation with fwmark_reflect enabled

2017-01-26 Thread Eric Dumazet

On Thu, 2017-01-26 at 17:37 +0100, Pablo Neira Ayuso wrote:
> From: Pau Espin Pedrol 
> 
> Otherwise, RST packets generated by the TCP stack for non-existing
> sockets always have mark 0.
> The mark from the original packet is assigned to the netns_ipv4/6
> socket used to send the response so that it can get copied into the
> response skb when the socket sends it.
> 
> Fixes: e110861f8609 ("net: add a sysctl to reflect the fwmark on replies")
> Cc: Lorenzo Colitti 
> Signed-off-by: Pau Espin Pedrol 
> Signed-off-by: Pablo Neira Ayuso 
> ---
>  net/ipv4/ip_output.c | 1 +
>  net/ipv6/tcp_ipv6.c  | 1 +
>  2 files changed, 2 insertions(+)
> 
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index fac275c48108..b67719f45953 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -1629,6 +1629,7 @@ void ip_send_unicast_reply(struct sock *sk, struct 
> sk_buff *skb,
>   sk->sk_protocol = ip_hdr(skb)->protocol;
>   sk->sk_bound_dev_if = arg->bound_dev_if;
>   sk->sk_sndbuf = sysctl_wmem_default;
> + sk->sk_mark = fl4.flowi4_mark;
>   err = ip_append_data(sk, , ip_reply_glue_bits, arg->iov->iov_base,
>len, 0, , , MSG_DONTWAIT);
>   if (unlikely(err)) {
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index 73bc8fc68acd..2b20622a5824 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -840,6 +840,7 @@ static void tcp_v6_send_response(const struct sock *sk, 
> struct sk_buff *skb, u32
>   dst = ip6_dst_lookup_flow(ctl_sk, , NULL);
>   if (!IS_ERR(dst)) {
>   skb_dst_set(buff, dst);
> + ctl_sk->sk_mark = fl6.flowi6_mark;
>   ip6_xmit(ctl_sk, buff, , NULL, tclass);
>   TCP_INC_STATS(net, TCP_MIB_OUTSEGS);
>   if (rst)


This patch is wrong.

ctl_sk is a shared socket, and tcp_v6_send_response() can be called from
many different cpus at the same time.



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ip_rcv_finish() NULL pointer kernel panic

2017-01-26 Thread Eric Dumazet

On Thu, 2017-01-26 at 10:00 -0800, Eric Dumazet wrote:
> On Thu, 2017-01-26 at 17:24 +0100, Florian Westphal wrote:
> 
> > I think it makes sense to set dst->incoming
> > to a stub in br_netfilter_rtable_init() to just kfree_skb()+
> > WARN_ON_ONCE(), no need to add code to ip stack or crash kernel
> > due to brnf bug.
> 
> Just kfree_skb() would hide bugs.
> 
> Dropping packets is not uncommon in networking...
> 
> I would rather use at least a WARN_ON_ONCE() before the kfree_skb() ;)


Oh well, I obviously did not parse properly your suggestion.


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ip_rcv_finish() NULL pointer kernel panic

2017-01-26 Thread Eric Dumazet

On Thu, 2017-01-26 at 17:24 +0100, Florian Westphal wrote:

> I think it makes sense to set dst->incoming
> to a stub in br_netfilter_rtable_init() to just kfree_skb()+
> WARN_ON_ONCE(), no need to add code to ip stack or crash kernel
> due to brnf bug.

Just kfree_skb() would hide bugs.

Dropping packets is not uncommon in networking...

I would rather use at least a WARN_ON_ONCE() before the kfree_skb() ;)



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ip_rcv_finish() NULL pointer kernel panic

2017-01-26 Thread Eric Dumazet

On Thu, 2017-01-26 at 09:32 -0600, Roy Keene wrote:
> This bug appears to have existed for a long time:
> 
>   https://www.spinics.net/lists/netdev/msg222459.html
> 
>   http://www.kernelhub.org/?p=2=823752
> 
> Though possibly with different things not setting the "input" function 
> pointer in the "struct dst_entry".
> 
> include/net/dst.h:
>   496 static inline int dst_input(struct sk_buff *skb) {
>   498 return skb_dst(skb)->input(skb);
>   499 }
> 
> Is there any reason not to check to see if this pointer is NULL before 
> blindly calling it ?

Sure. It can not be NULL at this point.

Just look at the code in ip_rcv_finish() :

It first make sure to get a valid dst.

Something is broken, probably in bridge netfilter code.

I suspect the dst here points to >fake_rtable, and this is not
expected.

br_drop_fake_rtable() should have been called somewhere ...



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] netfilter: nft_counter: rework atomic dump and reset

2016-12-10 Thread Eric Dumazet

On Sat, 2016-12-10 at 15:25 +0100, Pablo Neira Ayuso wrote:
> On Sat, Dec 10, 2016 at 03:16:55PM +0100, Pablo Neira Ayuso wrote:
=
>  
> -   nft_counter_fetch(priv, , reset);
> +   nft_counter_fetch(priv, );
> +   if (reset)
> +   nft_counter_reset(priv, );
>  
> if (nla_put_be64(skb, NFTA_COUNTER_BYTES,
> cpu_to_be64(total.bytes),
>  NFTA_COUNTER_PAD) ||

Night be nitpicking, but you might reset the stats only if the
nla_put_be64() succeeded.

But regardless of this detail, patch looks good and is very close to the
one I cooked and was about to send this morning.

Thanks Pablo !



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 37/50] netfilter: nf_tables: atomic dump and reset for stateful objects

2016-12-09 Thread Eric Dumazet

On Fri, 2016-12-09 at 06:24 -0800, Eric Dumazet wrote:

> It looks that you want a seqcount, even on 64bit arches,
> so that CPU 2 can restart its loop, and more importantly you need
> to not accumulate the values you read, because they might be old/invalid.

Untested patch to give general idea. I can polish it a bit later today.

 net/netfilter/nft_counter.c |   59 +-
 1 file changed, 23 insertions(+), 36 deletions(-)

diff --git a/net/netfilter/nft_counter.c b/net/netfilter/nft_counter.c
index 
f6a02c5071c2aeafca7635da3282a809aa04d6ab..57ed95b024473a2aa76298fe5bb5013bf709801b
 100644
--- a/net/netfilter/nft_counter.c
+++ b/net/netfilter/nft_counter.c
@@ -31,18 +31,25 @@ struct nft_counter_percpu_priv {
struct nft_counter_percpu __percpu *counter;
 };
 
+static DEFINE_PER_CPU(seqcount_t, nft_counter_seq);
+
 static inline void nft_counter_do_eval(struct nft_counter_percpu_priv *priv,
   struct nft_regs *regs,
   const struct nft_pktinfo *pkt)
 {
struct nft_counter_percpu *this_cpu;
+   seqcount_t *myseq;
 
local_bh_disable();
this_cpu = this_cpu_ptr(priv->counter);
-   u64_stats_update_begin(_cpu->syncp);
+   myseq = this_cpu_ptr(_counter_seq);
+
+   write_seqcount_begin(myseq);
+
this_cpu->counter.bytes += pkt->skb->len;
this_cpu->counter.packets++;
-   u64_stats_update_end(_cpu->syncp);
+
+   write_seqcount_end(myseq);
local_bh_enable();
 }
 
@@ -110,52 +117,30 @@ static void nft_counter_fetch(struct nft_counter_percpu 
__percpu *counter,
 
memset(total, 0, sizeof(*total));
for_each_possible_cpu(cpu) {
+   seqcount_t *seqp = per_cpu_ptr(_counter_seq, cpu);
+
cpu_stats = per_cpu_ptr(counter, cpu);
do {
-   seq = u64_stats_fetch_begin_irq(_stats->syncp);
+   seq = read_seqcount_begin(seqp);
bytes   = cpu_stats->counter.bytes;
packets = cpu_stats->counter.packets;
-   } while (u64_stats_fetch_retry_irq(_stats->syncp, seq));
+   } while (read_seqcount_retry(seqp, seq));
 
total->packets += packets;
total->bytes += bytes;
}
 }
 
-static u64 __nft_counter_reset(u64 *counter)
-{
-   u64 ret, old;
-
-   do {
-   old = *counter;
-   ret = cmpxchg64(counter, old, 0);
-   } while (ret != old);
-
-   return ret;
-}
-
 static void nft_counter_reset(struct nft_counter_percpu __percpu *counter,
  struct nft_counter *total)
 {
struct nft_counter_percpu *cpu_stats;
-   u64 bytes, packets;
-   unsigned int seq;
-   int cpu;
 
-   memset(total, 0, sizeof(*total));
-   for_each_possible_cpu(cpu) {
-   bytes = packets = 0;
-
-   cpu_stats = per_cpu_ptr(counter, cpu);
-   do {
-   seq = u64_stats_fetch_begin_irq(_stats->syncp);
-   packets += 
__nft_counter_reset(_stats->counter.packets);
-   bytes   += 
__nft_counter_reset(_stats->counter.bytes);
-   } while (u64_stats_fetch_retry_irq(_stats->syncp, seq));
-
-   total->packets += packets;
-   total->bytes += bytes;
-   }
+   local_bh_disable();
+   cpu_stats = this_cpu_ptr(counter);
+   cpu_stats->counter.packets -= total->packets;
+   cpu_stats->counter.bytes -= total->bytes;
+   local_bh_enable();
 }
 
 static int nft_counter_do_dump(struct sk_buff *skb,
@@ -164,10 +149,9 @@ static int nft_counter_do_dump(struct sk_buff *skb,
 {
struct nft_counter total;
 
+   nft_counter_fetch(priv->counter, );
if (reset)
nft_counter_reset(priv->counter, );
-   else
-   nft_counter_fetch(priv->counter, );
 
if (nla_put_be64(skb, NFTA_COUNTER_BYTES, cpu_to_be64(total.bytes),
 NFTA_COUNTER_PAD) ||
@@ -285,7 +269,10 @@ static struct nft_expr_type nft_counter_type __read_mostly 
= {
 
 static int __init nft_counter_module_init(void)
 {
-   int err;
+   int err, cpu;
+
+   for_each_possible_cpu(cpu)
+   seqcount_init(per_cpu_ptr(_counter_seq, cpu));
 
err = nft_register_obj(_counter_obj);
if (err < 0)


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 37/50] netfilter: nf_tables: atomic dump and reset for stateful objects

2016-12-09 Thread Eric Dumazet

On Fri, 2016-12-09 at 11:24 +0100, Pablo Neira Ayuso wrote:
> Hi Paul,

Hi Pablo

Given that bytes/packets counters are modified without cmpxchg64()  :

static inline void nft_counter_do_eval(struct nft_counter_percpu_priv *priv,
   struct nft_regs *regs,
   const struct nft_pktinfo *pkt)
{
struct nft_counter_percpu *this_cpu;

local_bh_disable();
this_cpu = this_cpu_ptr(priv->counter);
u64_stats_update_begin(_cpu->syncp);
this_cpu->counter.bytes += pkt->skb->len;
this_cpu->counter.packets++;
u64_stats_update_end(_cpu->syncp);
local_bh_enable();
}

It means that the cmpxchg64() used to clear the stats is not good enough.

It does not help to make sure stats are properly cleared.

On 64 bit, the ->syncp is not there, so the nft_counter_reset() might
not see that a bytes or packets counter was modified by another cpu.


CPU 1  CPU 2

LOAD PTR->BYTES into REG_A old = *counter;
REG_A += skb->len;
   cmpxchg64(counter, old, 0);
PTR->BYTES = REG_A

It looks that you want a seqcount, even on 64bit arches,
so that CPU 2 can restart its loop, and more importantly you need
to not accumulate the values you read, because they might be old/invalid.

Another way would be to not use cmpxchg64() at all.
Way to expensive in fast path !

The percpu value would never be modified by an other cpu than the owner.

You need a per cpu seqcount, no need to add a syncp per nft percpu counter.


static DEFINE_PERCPU(seqcount_t, nft_pcpu_seq);

static inline void nft_counter_do_eval(struct nft_counter_percpu_priv *priv,
   struct nft_regs *regs,
   const struct nft_pktinfo *pkt)
{
struct nft_counter_percpu *this_cpu;
seqcount_t *myseq;

local_bh_disable();
this_cpu = this_cpu_ptr(priv->counter);
myseq = this_cpu_ptr(_pcpu_seq);

write_seqcount_begin(myseq);

this_cpu->counter.bytes += pkt->skb->len;
this_cpu->counter.packets++;

write_seqcount_end(myseq);

local_bh_enable();
}

Thanks !


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH nf-next] netfilter: xt_bpf: support ebpf

2016-12-05 Thread Eric Dumazet

On Mon, 2016-12-05 at 15:28 -0500, Willem de Bruijn wrote:
> From: Willem de Bruijn <will...@google.com>
> 
> Add support for attaching an eBPF object by file descriptor.
> 
> The iptables binary can be called with a path to an elf object or a
> pinned bpf object. Also pass the mode and path to the kernel to be
> able to return it later for iptables dump and save.
> 
> Signed-off-by: Willem de Bruijn <will...@google.com>
> ---

Assuming there is no simple way to get variable matchsize in iptables,
this looks good to me, thanks.

Reviewed-by: Eric Dumazet <eduma...@google.com>




--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCN net-next] net_sched: gen_estimator: complete rewrite of rate estimators

2016-12-03 Thread Eric Dumazet

On Sat, 2016-12-03 at 23:07 -0800, Eric Dumazet wrote:
> From: Eric Dumazet <eduma...@google.com>
> 
> 1) Old code was hard to maintain, due to complex lock chains.
>(We probably will be able to remove some kfree_rcu() in callers)
> 
> 2) Using a single timer to update all estimators does not scale.
> 
> 3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity
>is not supposed to work well)


> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 
> c7adcb57654ea57d1ba6702c91743cb7d2c74d28..859b60bfa86712031186fffc09c65bc43aa065dd
>  100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2081,6 +2081,14 @@ static bool tcp_small_queue_check(struct sock *sk, 
> const struct sk_buff *skb,
>   limit <<= factor;
>  
>   if (atomic_read(>sk_wmem_alloc) > limit) {
> + /* Special case where TX completion is delayed too much :
> +  * If the skb we try to send is the first skb in write queue,
> +  * then send it !
> +  * No need to wait for TX completion to call us back.
> +  */
> + if (skb == sk->sk_write_queue.next)
> + return false;
> +

Oups, this has nothing to do here. I will send a v2, sorry.


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCN net-next] net_sched: gen_estimator: complete rewrite of rate estimators

2016-12-03 Thread Eric Dumazet

From: Eric Dumazet <eduma...@google.com>

1) Old code was hard to maintain, due to complex lock chains.
   (We probably will be able to remove some kfree_rcu() in callers)

2) Using a single timer to update all estimators does not scale.

3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity
   is not supposed to work well)

In this rewrite :

- I removed the RB tree that had to be scanned in
gen_estimator_active(). qdisc dumps should be much faster.

- Each estimator has its own timer.

- Estimations are maintained in net_rate_estimator structure,
  instead of dirtying the qdisc. Minor, but part of the simplification.

- Reading the estimator uses RCU and a seqcount to provide proper
  support for 32bit kernels.

- We reduce memory need when estimators are not used, since
  we store a pointer, instead of the bytes/packets counters.

- xt_rateest_mt() no longer has to grab a spinlock.
  (In the future, xt_rateest_tg() could be switched to per cpu counters)

Signed-off-by: Eric Dumazet <eduma...@google.com>
---
 include/net/act_api.h  |2 
 include/net/gen_stats.h|   17 -
 include/net/netfilter/xt_rateest.h |   10 
 include/net/sch_generic.h  |2 
 net/core/gen_estimator.c   |  298 +--
 net/core/gen_stats.c   |   17 -
 net/ipv4/tcp_output.c  |8 
 net/netfilter/xt_RATEEST.c |4 
 net/netfilter/xt_rateest.c |   28 +-
 net/sched/act_api.c|9 
 net/sched/act_police.c |   21 +
 net/sched/sch_api.c|2 
 net/sched/sch_cbq.c|6 
 net/sched/sch_drr.c|6 
 net/sched/sch_generic.c|2 
 net/sched/sch_hfsc.c   |6 
 net/sched/sch_htb.c|6 
 net/sched/sch_qfq.c|8 
 18 files changed, 188 insertions(+), 264 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 
9dddf77a69ccbcb003cfa66bcc0de337f78f3dae..1d716449209e4753a297c61a287077a1eb96e6d8
 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -36,7 +36,7 @@ struct tc_action {
struct tcf_ttcfa_tm;
struct gnet_stats_basic_packed  tcfa_bstats;
struct gnet_stats_queue tcfa_qstats;
-   struct gnet_stats_rate_est64tcfa_rate_est;
+   struct net_rate_estimator __rcu *tcfa_rate_est;
spinlock_t  tcfa_lock;
struct rcu_head tcfa_rcu;
struct gnet_stats_basic_cpu __percpu *cpu_bstats;
diff --git a/include/net/gen_stats.h b/include/net/gen_stats.h
index 
231e121cc7d9c72075e7e6dde3655d631f64a1c4..8b7aa370e7a4af61fcb71ed751dba72ebead6143
 100644
--- a/include/net/gen_stats.h
+++ b/include/net/gen_stats.h
@@ -11,6 +11,8 @@ struct gnet_stats_basic_cpu {
struct u64_stats_sync syncp;
 };
 
+struct net_rate_estimator;
+
 struct gnet_dump {
spinlock_t *  lock;
struct sk_buff *  skb;
@@ -42,8 +44,7 @@ void __gnet_stats_copy_basic(const seqcount_t *running,
 struct gnet_stats_basic_cpu __percpu *cpu,
 struct gnet_stats_basic_packed *b);
 int gnet_stats_copy_rate_est(struct gnet_dump *d,
-const struct gnet_stats_basic_packed *b,
-struct gnet_stats_rate_est64 *r);
+struct net_rate_estimator __rcu **ptr);
 int gnet_stats_copy_queue(struct gnet_dump *d,
  struct gnet_stats_queue __percpu *cpu_q,
  struct gnet_stats_queue *q, __u32 qlen);
@@ -53,16 +54,16 @@ int gnet_stats_finish_copy(struct gnet_dump *d);
 
 int gen_new_estimator(struct gnet_stats_basic_packed *bstats,
  struct gnet_stats_basic_cpu __percpu *cpu_bstats,
- struct gnet_stats_rate_est64 *rate_est,
+ struct net_rate_estimator __rcu **rate_est,
  spinlock_t *stats_lock,
  seqcount_t *running, struct nlattr *opt);
-void gen_kill_estimator(struct gnet_stats_basic_packed *bstats,
-   struct gnet_stats_rate_est64 *rate_est);
+void gen_kill_estimator(struct net_rate_estimator __rcu **ptr);
 int gen_replace_estimator(struct gnet_stats_basic_packed *bstats,
  struct gnet_stats_basic_cpu __percpu *cpu_bstats,
- struct gnet_stats_rate_est64 *rate_est,
+ struct net_rate_estimator __rcu **ptr,
  spinlock_t *stats_lock,
  seqcount_t *running, struct nlattr *opt);
-bool gen_estimator_active(const struct gnet_stats_basic_packed *bstats,
- const struct gnet_stats_rate_est64 *rate_est);
+bool gen_estimator_active(struct net_rate_estimator __rcu **ptr);
+bool gen_estimator_read(struct net_rate_estimator

[PATCN v2 net-next] net_sched: gen_estimator: complete rewrite of rate estimators

2016-12-03 Thread Eric Dumazet

From: Eric Dumazet <eduma...@google.com>

1) Old code was hard to maintain, due to complex lock chains.
   (We probably will be able to remove some kfree_rcu() in callers)

2) Using a single timer to update all estimators does not scale.

3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity
   is not supposed to work well)

In this rewrite :

- I removed the RB tree that had to be scanned in
gen_estimator_active(). qdisc dumps should be much faster.

- Each estimator has its own timer.

- Estimations are maintained in net_rate_estimator structure,
  instead of dirtying the qdisc. Minor, but part of the simplification.

- Reading the estimator uses RCU and a seqcount to provide proper
  support for 32bit kernels.

- We reduce memory need when estimators are not used, since
  we store a pointer, instead of the bytes/packets counters.

- xt_rateest_mt() no longer has to grab a spinlock.
  (In the future, xt_rateest_tg() could be switched to per cpu counters)

Signed-off-by: Eric Dumazet <eduma...@google.com>
---
v2: removed unwanted changes to tcp_output.c
Renamed some parameters to please htmldoc

 include/net/act_api.h  |2 
 include/net/gen_stats.h|   17 -
 include/net/netfilter/xt_rateest.h |   10 
 include/net/sch_generic.h  |2 
 net/core/gen_estimator.c   |  299 +--
 net/core/gen_stats.c   |   17 -
 net/netfilter/xt_RATEEST.c |4 
 net/netfilter/xt_rateest.c |   28 +-
 net/sched/act_api.c|9 
 net/sched/act_police.c |   21 +
 net/sched/sch_api.c|2 
 net/sched/sch_cbq.c|6 
 net/sched/sch_drr.c|6 
 net/sched/sch_generic.c|2 
 net/sched/sch_hfsc.c   |6 
 net/sched/sch_htb.c|6 
 net/sched/sch_qfq.c|8 
 17 files changed, 181 insertions(+), 264 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 
9dddf77a69ccbcb003cfa66bcc0de337f78f3dae..1d716449209e4753a297c61a287077a1eb96e6d8
 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -36,7 +36,7 @@ struct tc_action {
struct tcf_ttcfa_tm;
struct gnet_stats_basic_packed  tcfa_bstats;
struct gnet_stats_queue tcfa_qstats;
-   struct gnet_stats_rate_est64tcfa_rate_est;
+   struct net_rate_estimator __rcu *tcfa_rate_est;
spinlock_t  tcfa_lock;
struct rcu_head tcfa_rcu;
struct gnet_stats_basic_cpu __percpu *cpu_bstats;
diff --git a/include/net/gen_stats.h b/include/net/gen_stats.h
index 
231e121cc7d9c72075e7e6dde3655d631f64a1c4..8b7aa370e7a4af61fcb71ed751dba72ebead6143
 100644
--- a/include/net/gen_stats.h
+++ b/include/net/gen_stats.h
@@ -11,6 +11,8 @@ struct gnet_stats_basic_cpu {
struct u64_stats_sync syncp;
 };
 
+struct net_rate_estimator;
+
 struct gnet_dump {
spinlock_t *  lock;
struct sk_buff *  skb;
@@ -42,8 +44,7 @@ void __gnet_stats_copy_basic(const seqcount_t *running,
 struct gnet_stats_basic_cpu __percpu *cpu,
 struct gnet_stats_basic_packed *b);
 int gnet_stats_copy_rate_est(struct gnet_dump *d,
-const struct gnet_stats_basic_packed *b,
-struct gnet_stats_rate_est64 *r);
+struct net_rate_estimator __rcu **ptr);
 int gnet_stats_copy_queue(struct gnet_dump *d,
  struct gnet_stats_queue __percpu *cpu_q,
  struct gnet_stats_queue *q, __u32 qlen);
@@ -53,16 +54,16 @@ int gnet_stats_finish_copy(struct gnet_dump *d);
 
 int gen_new_estimator(struct gnet_stats_basic_packed *bstats,
  struct gnet_stats_basic_cpu __percpu *cpu_bstats,
- struct gnet_stats_rate_est64 *rate_est,
+ struct net_rate_estimator __rcu **rate_est,
  spinlock_t *stats_lock,
  seqcount_t *running, struct nlattr *opt);
-void gen_kill_estimator(struct gnet_stats_basic_packed *bstats,
-   struct gnet_stats_rate_est64 *rate_est);
+void gen_kill_estimator(struct net_rate_estimator __rcu **ptr);
 int gen_replace_estimator(struct gnet_stats_basic_packed *bstats,
  struct gnet_stats_basic_cpu __percpu *cpu_bstats,
- struct gnet_stats_rate_est64 *rate_est,
+ struct net_rate_estimator __rcu **ptr,
  spinlock_t *stats_lock,
  seqcount_t *running, struct nlattr *opt);
-bool gen_estimator_active(const struct gnet_stats_basic_packed *bstats,
- const struct gnet_stats_rate_est64 *rate_est);
+bool gen_estimator_active(struct net_rate_estimator __rcu **ptr);
+bool gen_estimator

Re: net/sctp: vmalloc allocation failure in sctp_setsockopt/xt_alloc_table_info

2016-11-28 Thread Eric Dumazet

On Mon, 2016-11-28 at 19:09 +0100, Florian Westphal wrote:

> We should prevent OOM killer from running in first place (GFP_NORETRY should 
> work).

Make sure that a vmalloc(8) will succeed, even if few pages need to
be swapped out.

Otherwise, some scripts using iptables will die while they where okay
before ?

I am not sure how GFP_NORETRY really works among different linux
kernels.

Maybe this flag has no effect for order-0 allocations anyway.

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 1/1] netfilter: xt_multiport: Fix wrong unmatch result with multiple ports

2016-11-24 Thread Eric Dumazet

On Fri, 2016-11-25 at 11:58 +0800, f...@ikuai8.com wrote:
> From: Gao Feng 
> 
> I lost one test case in the commit for xt_multiport.
> For example, the rule is "-m multiport --dports 22,80,443".
> When first port is unmatched and the second is matched, the curent codes
> could not return the right result.
> It would return false directly when the first port is unmatched.
> 
> Fixes: dd2602d00f80 ("netfilter: xt_multiport: Use switch case instead
> of multiple condition checks")
> 
> Signed-off-by: Gao Feng 

Note that the Fixes: tag should immediately precede your
'Signed-off-by:'   

No empty new line between the two tags.

Thanks.


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 nf-next 3/3] netfilter: x_tables: pack percpu counter allocations

2016-11-21 Thread Eric Dumazet

On Mon, 2016-11-21 at 23:07 +0100, Florian Westphal wrote:
> instead of allocating each xt_counter individually, allocate 4k chunks
> and then use these for counter allocation requests.
> 
> This should speed up rule evaluation by increasing data locality,
> also speeds up ruleset loading because we reduce calls to the percpu
> allocator.
> 
> As Eric points out we can't use PAGE_SIZE, page_allocator would fail on
> arches with 64k page size.
> 
> Suggested-by: Eric Dumazet <eduma...@google.com>
> Signed-off-by: Florian Westphal <f...@strlen.de>
> ---

Acked-by: Eric Dumazet <eduma...@google.com>

Very cool, thanks a lot !


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH nf-next 3/3] netfilter: x_tables: pack percpu counter allocations

2016-11-21 Thread Eric Dumazet

On Mon, 2016-11-21 at 14:57 +0100, Florian Westphal wrote:

...

>  #define SMP_ALIGN(x) (((x) + SMP_CACHE_BYTES-1) & ~(SMP_CACHE_BYTES-1))
> +#define XT_PCPU_BLOCK_SIZE 4096
>  
>  struct compat_delta {
>   unsigned int offset; /* offset in kernel */
> @@ -1618,6 +1619,7 @@ EXPORT_SYMBOL_GPL(xt_proto_fini);
>  /**
>   * xt_percpu_counter_alloc - allocate x_tables rule counter
>   *
> + * @state: pointer to xt_percpu allocation state
>   * @counter: pointer to counter struct inside the ip(6)/arpt_entry struct
>   *
>   * On SMP, the packet counter [ ip(6)t_entry->counters.pcnt ] will then
> @@ -1626,20 +1628,33 @@ EXPORT_SYMBOL_GPL(xt_proto_fini);
>   * Rule evaluation needs to use xt_get_this_cpu_counter() helper
>   * to fetch the real percpu counter.
>   *
> + * To speed up allocation and improve data locality, an entire
> + * page is allocated.

* To speed up allocation and improve data locality, a 4KB bloc
* is allocated.

> + *
> + * xt_percpu_counter_alloc_state contains the base address of the
> + * allocated page and the current sub-offset.
> + *
>   * returns false on error.
>   */
> -bool xt_percpu_counter_alloc(struct xt_counters *counter)
> +bool xt_percpu_counter_alloc(struct xt_percpu_counter_alloc_state *state,
> +  struct xt_counters *counter)
>  {
> - void __percpu *res;
> + BUILD_BUG_ON(XT_PCPU_BLOCK_SIZE < (sizeof(*counter) * 2));
>  
> - if (nr_cpu_ids > 1) {
> - res = __alloc_percpu(sizeof(struct xt_counters),
> -  sizeof(struct xt_counters));
> + if (nr_cpu_ids <= 1)
> + return true;
>  
> - if (!res)
> + if (state->mem == NULL) {

if (!state->mem) {

> + state->mem = __alloc_percpu(XT_PCPU_BLOCK_SIZE, 
> XT_PCPU_BLOCK_SIZE);
> + if (!state->mem)
   (to be consistent with this test )

>   return false;
>   }
> - counter->pcnt = (__force unsigned long)res;
> + counter->pcnt = (__force unsigned long)(state->mem + state->off);
> + state->off += sizeof(*counter);
> + if (state->off > (XT_PCPU_BLOCK_SIZE - sizeof(*counter))) {
> + state->mem = NULL;
> + state->off = 0;
> + }




--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH nf-next 2/3] netfilter: x_tables: pass xt_counters struct to counter allocator

2016-11-21 Thread Eric Dumazet

On Mon, 2016-11-21 at 14:57 +0100, Florian Westphal wrote:
> Keeps some noise away from a followup patch.
> 
> Signed-off-by: Florian Westphal <f...@strlen.de>
> ---

Acked-by: Eric Dumazet <eduma...@google.com>


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH nf-next 1/3] netfilter: x_tables: pass xt_counters struct instead of packet counter

2016-11-21 Thread Eric Dumazet

On Mon, 2016-11-21 at 14:57 +0100, Florian Westphal wrote:
> On SMP we overload the packet counter (unsigned long) to contain
> percpu offset.  Hide this from callers and pass xt_counters address
> instead.
> 
> Preparation patch to allocate the percpu counters in page-sized batch
> chunks.
> 
> Signed-off-by: Florian Westphal <f...@strlen.de>
> ---

Acked-by: Eric Dumazet <eduma...@google.com>



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: netfilter question

2016-11-20 Thread Eric Dumazet

On Sun, 2016-11-20 at 09:31 -0800, Eric Dumazet wrote:

> Thanks Eric, I will test the patch myself, because I believe we need it
> asap ;)


Current net-next without Florian patch :

lpaa24:~# time for f in `seq 1 2000` ; do iptables -A FORWARD ; done

real0m12.856s
user0m0.590s
sys 0m11.131s


perf report ...; perf report ->

47.45%  iptables  [kernel.kallsyms]  [k] pcpu_alloc_area
  
 8.49%  iptables  [kernel.kallsyms]  [k] memset_erms
  
 7.35%  iptables  [kernel.kallsyms]  [k] get_counters   
  
 2.87%  iptables  [kernel.kallsyms]  [k] __memmove  
  
 2.33%  iptables  [kernel.kallsyms]  [k] pcpu_alloc 
  
 2.07%  iptables  [kernel.kallsyms]  [k] _find_next_bit.part.0  
  
 1.62%  iptables  xtables-multi  [.] 0x0001bb9d 
  
 1.25%  iptables  [kernel.kallsyms]  [k] page_fault 
  
 1.01%  iptables  [kernel.kallsyms]  [k] memcmp 
  
 0.94%  iptables  [kernel.kallsyms]  [k] translate_table
  
 0.76%  iptables  [kernel.kallsyms]  [k] find_next_bit  
  
 0.73%  iptables  [kernel.kallsyms]  [k] filemap_map_pages  
  
 0.68%  iptables  [kernel.kallsyms]  [k] copy_user_enhanced_fast_string 
  
 0.54%  iptables  [kernel.kallsyms]  [k] __get_user_8   
  
 0.54%  iptables  [kernel.kallsyms]  [k] clear_page_c_e

After patch :

lpaa24:~# time for f in `seq 1 2000` ; do iptables -A FORWARD ; done

real0m3.867s
user0m0.559s
sys 0m2.216s

22.15%  iptables  [kernel.kallsyms]  [k] get_counters   

 5.85%  iptables  xtables-multi  [.] 0x0001bbac 

 3.99%  iptables  [kernel.kallsyms]  [k] page_fault 

 2.37%  iptables  [kernel.kallsyms]  [k] memcmp 

 2.19%  iptables  [kernel.kallsyms]  [k] copy_user_enhanced_fast_string 

 1.89%  iptables  [kernel.kallsyms]  [k] translate_table

 1.78%  iptables  [kernel.kallsyms]  [k] memset_erms

 1.74%  iptables  [kernel.kallsyms]  [k] clear_page_c_e 

 1.73%  iptables  [kernel.kallsyms]  [k] __get_user_8   

 1.72%  iptables  [kernel.kallsyms]  [k] perf_iterate_ctx   

 1.21%  iptables  [kernel.kallsyms]  [k] handle_mm_fault

 0.98%  iptables  [kernel.kallsyms]  [k] unmap_page_range  

So this is a huge win. And I suspect data path will also gain from all
pcpu counters being in the same area of memory (this is where I am very 
interested)


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: netfilter question

2016-11-20 Thread Eric Dumazet

On Sun, 2016-11-20 at 12:22 -0500, Eric D wrote:
> I'm currently abroad for work and will come back home soon. I will
> test the solution and provide feedback to Florian by end of week.
> 
> Thanks for jumping on this quickly.
> 
> Eric
> 
> 
> On Nov 20, 2016 7:33 AM, "Eric Dumazet" <eric.duma...@gmail.com>
> wrote:
> On Thu, 2016-11-17 at 01:07 +0100, Florian Westphal wrote:
> 
> > + if (state->mem == NULL) {
> > + state->mem = __alloc_percpu(PAGE_SIZE,
> PAGE_SIZE);
> > + if (!state->mem)
> > + return false;
> > + }
> 
> This will fail on arches where PAGE_SIZE=65536
> 
> percpu allocator limit is PCPU_MIN_UNIT_SIZE  ( 32 KB )
> 
> So maybe use a smaller value like 4096 ?
> 
> #define XT_PCPU_BLOC_SIZE 4096
> 
Thanks Eric, I will test the patch myself, because I believe we need it
asap ;)




--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: netfilter question

2016-11-19 Thread Eric Dumazet

On Thu, 2016-11-17 at 01:07 +0100, Florian Westphal wrote:

> + if (state->mem == NULL) {
> + state->mem = __alloc_percpu(PAGE_SIZE, PAGE_SIZE);
> + if (!state->mem)
> + return false;
> + }

This will fail on arches where PAGE_SIZE=65536

percpu allocator limit is PCPU_MIN_UNIT_SIZE  ( 32 KB )

So maybe use a smaller value like 4096 ?

#define XT_PCPU_BLOC_SIZE 4096


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: netfilter question

2016-11-16 Thread Eric Dumazet

On Thu, 2016-11-17 at 01:07 +0100, Florian Westphal wrote:

Seems very nice !

> +
> +void xt_percpu_counter_free(struct xt_counters *counters)
> +{
> + unsigned long pcnt = counters->pcnt;
> +
> + if (nr_cpu_ids > 1 && (pcnt & (PAGE_SIZE - 1)) == 0)
> + free_percpu((void __percpu *) (unsigned long)pcnt);
> +}


pcnt is already an "unsigned long"

This packing might also speed up "iptables -nvL" dumps ;)


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: netfilter question

2016-11-16 Thread Eric Dumazet

On Wed, 2016-11-16 at 16:02 +0100, Florian Westphal wrote:
> Eric Dumazet <eduma...@google.com> wrote:
> > On Wed, Nov 16, 2016 at 2:22 AM, Eric Desrochers <er...@miromedia.ca> wrote:
> > > Hi Eric,
> > >
> > > My name is Eric and I'm reaching you today as I found your name in 
> > > multiple netfilter kernel commits, and was hoping we can discuss about a 
> > > potential regression.
> > >
> > > I identified (git bisect) this commit 
> > > [https://github.com/torvalds/linux/commit/71ae0dff02d756e4d2ca710b79f2ff5390029a5f]
> > >  as the one introducing a serious performance slowdown when using the 
> > > binary ip/ip6tables with a large number of policies.
> > >
> > > I also tried with the latest and greatest v4.9-rc4 mainline kernel, and 
> > > the slowdown is still present.
> > >
> > > So even commit 
> > > [https://github.com/torvalds/linux/commit/a1a56aaa0735c7821e85c37268a7c12e132415cb]
> > >  which introduce a 16 bytes alignment on xt_counter percpu allocations so 
> > > that bytes and packets sit in same cache line, doesn't have impact too.
> > >
> > >
> > > Everything I found is detailed in the following bug : 
> > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1640786
> > >
> > > Of course, I'm totally aware that "iptables-restore" should be the 
> > > favorite choice as it is way more efficient (note that using 
> > > iptables-restore doesn't exhibit the problem) but some folks still rely 
> > > on ip/ip6tables and might face this performance slowdown.
> > >
> > > I found the problem today, I will continue to investigate on my side, but 
> > > I was wondering if we could have a discussion about this subject.
> > >
> > > Thanks in advance.
> 
> [..]
> 
> > Key point is that we really care about fast path : packet processing.
> > And cited commit helps this a lot by lowering the memory foot print on
> > hosts with many cores.
> > This is a step into right direction.
> > 
> > Now we probably should batch the percpu allocations one page at a
> > time, or ask Tejun if percpu allocations could be really really fast
> > (probably much harder)
> > 
> > But really you should not use iptables one rule at a time...
> > This will never compete with iptables-restore. ;)
> > 
> > Florian, would you have time to work on a patch trying to group the
> > percpu allocations one page at a time ?
> 
> You mean something like this ? :
> xt_entry_foreach(iter, entry0, newinfo->size) {
> -   ret = find_check_entry(iter, net, repl->name, repl->size);
> -   if (ret != 0)
> +   if (pcpu_alloc == 0) {
> +   pcnt = __alloc_percpu(PAGE_SIZE, sizeof(struct 
> xt_counters));

alignment should be a page.

> +   if (IS_ERR_VALUE(pcnt))
> +   BUG();

well. no BUG() for sure ;)

> +   }
> +
> +   iter->counters.pcnt = pcnt + pcpu_alloc;
> +   iter->counters.bcnt = !!pcpu_alloc;
> +   pcpu_alloc += sizeof(struct xt_counters);
> +
> +   if (pcpu_alloc > PAGE_SIZE - sizeof(struct xt_counters))
> +   pcpu_alloc = 0;
> +
> +   ret = find_check_entry(iter, net, repl->name, repl->size)
>  ...
> 
> This is going to be ugly since we'll have to deal with SMP vs. NONSMP (i.e. 
> no perpcu allocations)
> in ip/ip6/arptables.

Time for a common helper then ...

> 
> Error unwind will also be a mess (we can abuse .bcnt to tell if pcpu offset 
> should be free'd or not).

Free if the address is aligned to a page boundary ?

Otherwise skip it, it already has been freed earlier.

> 
> But maybe I don't understand what you are suggesting :)
> Can you elaborate?

Note that this grouping will also help data locality.

I definitely have servers with a huge number of percpu allocations and I
fear we might have many TLB misses because of possible spread of
xt_counters.

Note that percpu pages must not be shared by multiple users
(ip/ip6/arptable), each table should get its own cache.



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH nf] netfilter: conntrack: refine gc worker heuristics

2016-11-01 Thread Eric Dumazet

On Tue, 2016-11-01 at 21:01 +0100, Florian Westphal wrote:
>   schedule_delayed_work(_work->dwork, next_run);


> @@ -993,6 +1029,7 @@ static void gc_worker(struct work_struct *work)
>  static void conntrack_gc_work_init(struct conntrack_gc_work *gc_work)
>  {
>   INIT_DELAYED_WORK(_work->dwork, gc_worker);
> + gc_work->next_gc_run = GC_INTERVAL_MAX;
>   gc_work->exiting = false;
>  }
>  
> @@ -1885,7 +1922,7 @@ int nf_conntrack_init_start(void)
>   nf_ct_untracked_status_or(IPS_CONFIRMED | IPS_UNTRACKED);
>  
>   conntrack_gc_work_init(_gc_work);
> - schedule_delayed_work(_gc_work.dwork, GC_INTERVAL);
> + schedule_delayed_work(_gc_work.dwork, GC_INTERVAL_MAX);
>  
>   return 0;
>  


We might use system_long_wq instead of system_wq ?

queue_delayed_work(system_long_wq, ...)



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: error: 'struct net_device' has no member named 'nf_hooks_ingress'

2016-10-05 Thread Eric Dumazet

On Wed, 2016-10-05 at 22:56 +0200, Michal Sojka wrote:

> this commit is now in mainline as
> e3b37f11e6e4e6b6f02cc762f182ce233d2c1c9d and it breaks my build:
> 
> net/netfilter/core.c: In function 'nf_set_hooks_head':
> net/netfilter/core.c:96:3: error: 'struct net_device' has no member named 
> 'nf_hooks_ingress'
> 
> Are the fixes (see below) on the way to mainline too?

Yes the fixes are already in nf tree and _will_ get pushed.

Pablo and David are attending netdev 1.2 in Tokyo and have obligations.

https://git.kernel.org/cgit/linux/kernel/git/pablo/nf-next.git/

Thanks.




--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] netfilter: xt_hashlimit: uses div_u64 for division

2016-09-30 Thread Eric Dumazet

On Fri, 2016-09-30 at 18:05 +0200, Arnd Bergmann wrote:
> The newly added support for high-resolution pps rates introduced multiple 
> 64-bit
> division operations in one function, which fails on all 32-bit architectures:
> 
> net/netfilter/xt_hashlimit.o: In function `user2credits':
> xt_hashlimit.c:(.text.user2credits+0x3c): undefined reference to 
> `__aeabi_uldivmod'
> xt_hashlimit.c:(.text.user2credits+0x68): undefined reference to 
> `__aeabi_uldivmod'
> xt_hashlimit.c:(.text.user2credits+0x88): undefined reference to 
> `__aeabi_uldivmod'
> 
> This replaces the division with an explicit call to div_u64 for version 2
> to documents that this is a slow operation, and reverts back to 32-bit 
> arguments
> for the version 1 data to restore the original faster 32-bit division.
> 
> With both changes combined, we no longer get a link error.
> 
> Fixes: 11d5f15723c9 ("netfilter: xt_hashlimit: Create revision 2 to support 
> higher pps rates")
> Signed-off-by: Arnd Bergmann 
> ---
> Vishwanath Pai already sent a patch for this, and I did my version 
> independently.
> The difference is that his version also the more expensive division for the
> version 1 variant that doesn't need it.
> 
> See also http://patchwork.ozlabs.org/patch/676713/
> ---
>  net/netfilter/xt_hashlimit.c | 17 ++---
>  1 file changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/net/netfilter/xt_hashlimit.c b/net/netfilter/xt_hashlimit.c
> index 44a095ecc7b7..3d5525df6eb3 100644
> --- a/net/netfilter/xt_hashlimit.c
> +++ b/net/netfilter/xt_hashlimit.c
> @@ -464,20 +464,23 @@ static u32 xt_hashlimit_len_to_chunks(u32 len)
>  static u64 user2credits(u64 user, int revision)
>  {
>   if (revision == 1) {
> + u32 user32 = user; /* use 32-bit division */
> +

This looks dangerous to me. Have you really tried all possible cases ?

Caller (even if using revision == 1) does
user2credits(cfg->avg * cfg->burst, revision);

Since this is not a fast path, I would prefer to keep the 64bit divide.

Vishwanath version looks safer.


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH nf-next v3 1/2] netfilter: Fix potential null pointer dereference

2016-09-28 Thread Eric Dumazet

On Wed, 2016-09-28 at 10:56 -0400, Aaron Conole wrote:
> Eric Dumazet <eric.duma...@gmail.com> writes:
> 
> > On Wed, 2016-09-28 at 09:12 -0400, Aaron Conole wrote:
> >> It's possible for nf_hook_entry_head to return NULL.  If two
> >> nf_unregister_net_hook calls happen simultaneously with a single hook
> >> entry in the list, both will enter the nf_hook_mutex critical section.
> >> The first will successfully delete the head, but the second will see
> >> this NULL pointer and attempt to dereference.
> >> 
> >> This fix ensures that no null pointer dereference could occur when such
> >> a condition happens.
> >> 
> >> Signed-off-by: Aaron Conole <acon...@bytheb.org>
> >> ---
> >>  net/netfilter/core.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >> 
> >> diff --git a/net/netfilter/core.c b/net/netfilter/core.c
> >> index 360c63d..e58e420 100644
> >> --- a/net/netfilter/core.c
> >> +++ b/net/netfilter/core.c
> >> @@ -160,7 +160,7 @@ void nf_unregister_net_hook(struct net *net, const 
> >> struct nf_hook_ops *reg)
> >>  
> >>mutex_lock(_hook_mutex);
> >>hooks_entry = nf_hook_entry_head(net, reg);
> >> -  if (hooks_entry->orig_ops == reg) {
> >> +  if (hooks_entry && hooks_entry->orig_ops == reg) {
> >>nf_set_hooks_head(net, reg,
> >>  nf_entry_dereference(hooks_entry->next));
> >>goto unlock;
> >
> > When was the bug added exactly ?
> 
> Sunday, on the nf-next tree.
> 
> > For all bug fixes, you need to add a Fixes: tag.
> >
> > Like :
> >
> > Fixes: e3b37f11e6e4 ("netfilter: replace list_head with single linked list")
> 
> I would but it's in nf-next tree, and I'm not sure how pulls go.  If
> they are done via patch imports, then the sha sums will be wrong and the
> commit message will be misleading.  If the sums are preserved, then I
> can resubmit with this information.
> 

I gave the (12 digits) sha-1 as present in David Miller net-next tree.

https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=e3b37f11e6e4e6b6f02cc762f182ce233d2c1c9d


This wont change, because David never rebases his tree under normal
operations.

Thanks.



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH nf-next v3 1/2] netfilter: Fix potential null pointer dereference

2016-09-28 Thread Eric Dumazet

On Wed, 2016-09-28 at 09:12 -0400, Aaron Conole wrote:
> It's possible for nf_hook_entry_head to return NULL.  If two
> nf_unregister_net_hook calls happen simultaneously with a single hook
> entry in the list, both will enter the nf_hook_mutex critical section.
> The first will successfully delete the head, but the second will see
> this NULL pointer and attempt to dereference.
> 
> This fix ensures that no null pointer dereference could occur when such
> a condition happens.
> 
> Signed-off-by: Aaron Conole 
> ---
>  net/netfilter/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/netfilter/core.c b/net/netfilter/core.c
> index 360c63d..e58e420 100644
> --- a/net/netfilter/core.c
> +++ b/net/netfilter/core.c
> @@ -160,7 +160,7 @@ void nf_unregister_net_hook(struct net *net, const struct 
> nf_hook_ops *reg)
>  
>   mutex_lock(_hook_mutex);
>   hooks_entry = nf_hook_entry_head(net, reg);
> - if (hooks_entry->orig_ops == reg) {
> + if (hooks_entry && hooks_entry->orig_ops == reg) {
>   nf_set_hooks_head(net, reg,
> nf_entry_dereference(hooks_entry->next));
>   goto unlock;

When was the bug added exactly ?

For all bug fixes, you need to add a Fixes: tag.

Like :

Fixes: e3b37f11e6e4 ("netfilter: replace list_head with single linked list")

So that 100 different people in stable teams do not have to do the archeology 
themselves ...

Thanks.


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Fix link error in 32bit arch because of 64bit division

2016-09-27 Thread Eric Dumazet

On Tue, 2016-09-27 at 03:42 -0400, Vishwanath Pai wrote:
> Fix link error in 32bit arch because of 64bit division
> 
> Division of 64bit integers will cause linker error undefined reference
> to `__udivdi3'. Fix this by replacing divisions with div64_64
> 
> Signed-off-by: Vishwanath Pai 
> 
> ---
>  net/netfilter/xt_hashlimit.c | 15 ---
>  1 file changed, 8 insertions(+), 7 deletions(-)
> 
> diff --git a/net/netfilter/xt_hashlimit.c b/net/netfilter/xt_hashlimit.c
> index 44a095e..7fc694e 100644
> --- a/net/netfilter/xt_hashlimit.c
> +++ b/net/netfilter/xt_hashlimit.c
> @@ -465,19 +465,20 @@ static u64 user2credits(u64 user, int revision)
>  {
>   if (revision == 1) {
>   /* If multiplying would overflow... */
> - if (user > 0x / (HZ*CREDITS_PER_JIFFY_v1))
> + if (user > div64_u64(0x, (HZ*CREDITS_PER_JIFFY_v1)))

How can this be needed ? 0x is 32bits, compiler knows how to
compute 0x / (HZ*CREDITS_PER_JIFFY_v1) itself, without using a
64 bit divide !

Please be selective.

>   /* Divide first. */
> - return (user / XT_HASHLIMIT_SCALE) *\
> + return div64_u64(user, XT_HASHLIMIT_SCALE) *\
>   HZ * CREDITS_PER_JIFFY_v1;
>  
> - return (user * HZ * CREDITS_PER_JIFFY_v1) \
> - / XT_HASHLIMIT_SCALE;
> + return div64_u64((user * HZ * CREDITS_PER_JIFFY_v1),
> +   XT_HASHLIMIT_SCALE);
>   } else {
> - if (user > 0x / (HZ*CREDITS_PER_JIFFY))
> - return (user / XT_HASHLIMIT_SCALE_v2) *\

Probably same remark here.

> + if (user > div64_u64(0x, 
> (HZ*CREDITS_PER_JIFFY)))
> + return div64_u64(user, XT_HASHLIMIT_SCALE_v2) *\
>   HZ * CREDITS_PER_JIFFY;
>  
> - return (user * HZ * CREDITS_PER_JIFFY) / XT_HASHLIMIT_SCALE_v2;
> + return div64_u64((user * HZ * CREDITS_PER_JIFFY),
> +  XT_HASHLIMIT_SCALE_v2);
>   }
>  }
>  


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH nf] netfilter: nf_tables: Ensure u8 attributes are loaded from u32 within the bounds

2016-09-22 Thread Eric Dumazet

On Thu, 2016-09-22 at 16:58 +0200, Pablo Neira Ayuso wrote:
> attributes")
> 
> Always use 12 bytes commit-ids. 4da449a is too short, given the number
> of changes we're getting in the kernel tree, this may become ambiguous
> at some point so it won't be unique.
> 
> You can achieve this via: git log --oneline --abbrev=12

and Documentation/SubmittingPatches has these tips :

You should also be sure to use at least the first twelve characters of the
SHA-1 ID.  The kernel repository holds a *lot* of objects, making
collisions with shorter IDs a real possibility.  Bear in mind that, even if
there is no collision with your six-character ID now, that condition may
change five years from now.

If your patch fixes a bug in a specific commit, e.g. you found an issue using
git-bisect, please use the 'Fixes:' tag with the first 12 characters of the
SHA-1 ID, and the one line summary.  For example:

Fixes: e21d2170f366 ("video: remove unnecessary platform_set_drvdata()")

The following git-config settings can be used to add a pretty format for
outputting the above style in the git log or git show commands

[core]
abbrev = 12
[pretty]
fixes = Fixes: %h (\"%s\")

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH nf v3] netfilter: seqadj: Fix the wrong ack adjust for the RST packet without ack

2016-09-21 Thread Eric Dumazet

On Thu, 2016-09-22 at 10:22 +0800, f...@ikuai8.com wrote:
> From: Gao Feng 
> 
> It is valid that the TCP RST packet which does not set ack flag, and bytes
> of ack number are zero. But current seqadj codes would adjust the "0" ack
> to invalid ack number. Actually seqadj need to check the ack flag before
> adjust it for these RST packets.
> 
> The following is my test case
> 
> client is 10.26.98.245, and add one iptable rule:
> iptables  -I INPUT -p tcp --sport 12345 -m connbytes --connbytes 2:
> --connbytes-dir reply --connbytes-mode packets -j REJECT --reject-with
> tcp-reset
> This iptables rule could generate on TCP RST without ack flag.
> 
> server:10.172.135.55
> Enable the synproxy with seqadjust by the following iptables rules
> iptables -t raw -A PREROUTING -i eth0 -p tcp -d 10.172.135.55 --dport 12345
> -m tcp --syn -j CT --notrack
> 
> iptables -A INPUT -i eth0 -p tcp -d 10.172.135.55 --dport 12345 -m conntrack
> --ctstate INVALID,UNTRACKED -j SYNPROXY --sack-perm --timestamp --wscale 7
> --mss 1460
> iptables -A OUTPUT -o eth0 -p tcp -s 10.172.135.55 --sport 12345 -m conntrack
> --ctstate INVALID,UNTRACKED -m tcp --tcp-flags SYN,RST,ACK SYN,ACK -j ACCEPT
> 
> The following is my test result.
> 
> 1. packet trace on client
> root@routers:/tmp# tcpdump -i eth0 tcp port 12345 -n
> tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
> listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
> IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [S], seq 3695959829,
> win 29200, options [mss 1460,sackOK,TS val 452367884 ecr 0,nop,wscale 7],
> length 0
> IP 10.172.135.55.12345 > 10.26.98.245.45154: Flags [S.], seq 546723266,
> ack 3695959830, win 0, options [mss 1460,sackOK,TS val 15643479 ecr 452367884,
> nop,wscale 7], length 0
> IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [.], ack 1, win 229,
> options [nop,nop,TS val 452367885 ecr 15643479], length 0
> IP 10.172.135.55.12345 > 10.26.98.245.45154: Flags [.], ack 1, win 226,
> options [nop,nop,TS val 15643479 ecr 452367885], length 0
> IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [R], seq 3695959830,
> win 0, length 0
> 
> 2. seqadj log on server
> [62873.867319] Adjusting sequence number from 602341895->546723267,
> ack from 3695959830->3695959830
> [62873.867644] Adjusting sequence number from 602341895->546723267,
> ack from 3695959830->3695959830
> [62873.869040] Adjusting sequence number from 3695959830->3695959830,
> ack from 0->55618628
> 
> To summarize, it is clear that the seqadj codes adjust the 0 ack when receive
> one TCP RST packet without ack.
> 
> Signed-off-by: Gao Feng 
> ---
>  v3: Add the reproduce steps and packet trace
>  v2: Regenerate because the first patch is removed
>  v1: Initial patch
> 
>  net/netfilter/nf_conntrack_seqadj.c | 34 +++---
>  1 file changed, 19 insertions(+), 15 deletions(-)
> 
> diff --git a/net/netfilter/nf_conntrack_seqadj.c 
> b/net/netfilter/nf_conntrack_seqadj.c
> index dff0f0c..3bd9c7e 100644
> --- a/net/netfilter/nf_conntrack_seqadj.c
> +++ b/net/netfilter/nf_conntrack_seqadj.c
> @@ -179,30 +179,34 @@ int nf_ct_seq_adjust(struct sk_buff *skb,
>  
>   tcph = (void *)skb->data + protoff;
>   spin_lock_bh(>lock);
> +

Please do not add style change during a bug fix.

>   if (after(ntohl(tcph->seq), this_way->correction_pos))
>   seqoff = this_way->offset_after;
>   else
>   seqoff = this_way->offset_before;
>  
> - if (after(ntohl(tcph->ack_seq) - other_way->offset_before,
> -   other_way->correction_pos))
> - ackoff = other_way->offset_after;
> - else
> - ackoff = other_way->offset_before;
> -
>   newseq = htonl(ntohl(tcph->seq) + seqoff);
> - newack = htonl(ntohl(tcph->ack_seq) - ackoff);
> -
>   inet_proto_csum_replace4(>check, skb, tcph->seq, newseq, false);
> - inet_proto_csum_replace4(>check, skb, tcph->ack_seq, newack,
> -  false);
> -
> - pr_debug("Adjusting sequence number from %u->%u, ack from %u->%u\n",
> -  ntohl(tcph->seq), ntohl(newseq), ntohl(tcph->ack_seq),
> -  ntohl(newack));
>  
> + pr_debug("Adjusting sequence number from %u->%u\n",
> +  ntohl(tcph->seq), ntohl(newseq));
>   tcph->seq = newseq;
> - tcph->ack_seq = newack;
> +
> + if (likely(tcph->ack)) {
> + if (after(ntohl(tcph->ack_seq) - other_way->offset_before,
> +   other_way->correction_pos))
> + ackoff = other_way->offset_after;
> + else
> + ackoff = other_way->offset_before;
> +
> + newack = htonl(ntohl(tcph->ack_seq) - ackoff);
> + inet_proto_csum_replace4(>check, skb, tcph->ack_seq,
> +  newack, false);
> +
> + pr_debug("Adjusting ack number from %u->%u\n",
> +

Re: [PATCH] netfilter: xt_socket: fix transparent match for IPv6 request sockets

2016-09-20 Thread Eric Dumazet

On Tue, 2016-09-20 at 08:01 -0700, Eric Dumazet wrote:

> 
> Something like :
> 
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 3ebf45b38bc3..0fccfd6cc258 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -6264,6 +6264,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
>   /* Note: tcp_v6_init_req() might override ir_iif for link locals */
>   inet_rsk(req)->ir_iif = inet_request_bound_dev_if(sk, skb);
>  
> + ireq->no_srccheck = inet_sk(sk)->transparent;

Since there is no @ireq there, simply use :

inet_rsk(req)->no_srccheck = inet_sk(sk)->transparent;


>   af_ops->init_req(req, sk, skb);
>  

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] netfilter: xt_socket: fix transparent match for IPv6 request sockets

2016-09-20 Thread Eric Dumazet

On Tue, 2016-09-20 at 15:26 +0200, KOVACS Krisztian wrote:
> The introduction of TCP_NEW_SYN_RECV state, and the addition of request
> sockets to the ehash table seems to have broken the --transparent option
> of the socket match for IPv6 (around commit a9407000).
> 
> Now that the socket lookup finds the TCP_NEW_SYN_RECV socket instead of the
> listener, the --transparent option tries to match on the no_srccheck flag
> of the request socket.
> 
> Unfortunately, that flag was only set for IPv4 sockets in tcp_v4_init_req()
> by copying the transparent flag of the listener socket. This effectively
> causes '-m socket --transparent' not match on the ACK packet sent by the
> client in a TCP handshake.
> 
> This change adds the same code initializing no_srccheck to
> tcp_v6_init_req(), rendering the above scenario working again.
> 
> Signed-off-by: Alex Badics 
> Signed-off-by: KOVACS Krisztian 
> ---
>  net/ipv6/tcp_ipv6.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index 94f4f89..21f2e5c 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -690,6 +690,7 @@ static void tcp_v6_init_req(struct request_sock *req,
>  
>   ireq->ir_v6_rmt_addr = ipv6_hdr(skb)->saddr;
>   ireq->ir_v6_loc_addr = ipv6_hdr(skb)->daddr;
> + ireq->no_srccheck = inet_sk(sk_listener)->transparent;
>  
>   /* So that link locals have meaning */
>   if (!sk_listener->sk_bound_dev_if &&

Thanks a lot for the bug hunting guys.

Could you add the precise tag to help stable backports ?

Fixes: 12-digit-sha1 ("patch title")

Since it is a netdev patch, I would also copy netdev@ (done here)

Also what about moving this (IPv4/IPv6 common code) before the
af_ops->init_req(req, sk, skb); call, since it is no longer family
specific ?

Something like :

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 3ebf45b38bc3..0fccfd6cc258 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6264,6 +6264,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
/* Note: tcp_v6_init_req() might override ir_iif for link locals */
inet_rsk(req)->ir_iif = inet_request_bound_dev_if(sk, skb);
 
+   ireq->no_srccheck = inet_sk(sk)->transparent;
af_ops->init_req(req, sk, skb);
 
if (security_inet_conn_request(sk, skb, req))
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 7158d4f8dae4..b448eb9fe1b9 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1195,7 +1195,6 @@ static void tcp_v4_init_req(struct request_sock *req,
 
sk_rcv_saddr_set(req_to_sk(req), ip_hdr(skb)->daddr);
sk_daddr_set(req_to_sk(req), ip_hdr(skb)->saddr);
-   ireq->no_srccheck = inet_sk(sk_listener)->transparent;
ireq->opt = tcp_v4_save_options(skb);
 }
 


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/5] net/netfilter/nf_conntrack_core: update memory barriers.

2016-08-31 Thread Eric Dumazet

On Wed, 2016-08-31 at 15:42 +0200, Manfred Spraul wrote:
> As explained in commit 51d7d5205d33
> ("powerpc: Add smp_mb() to arch_spin_is_locked()", for some architectures
> the ACQUIRE during spin_lock only applies to loading the lock, not to
> storing the lock state.
> 
> nf_conntrack_lock() does not handle this correctly:
> /* 1) Acquire the lock */
> spin_lock(lock);
> while (unlikely(nf_conntrack_locks_all)) {
> spin_unlock(lock);
> 
> spinlock_store_acquire() is missing between spin_lock and reading
> nf_conntrack_locks_all. In addition, reading nf_conntrack_locks_all
> needs ACQUIRE memory ordering.
> 
> 2nd, minor issue: If there would be many nf_conntrack_all_lock() callers,
> then nf_conntrack_lock() would loop forever.
> 
> Therefore: Change nf_conntrack_lock and nf_conntract_lock_all() to the
> approach used by ipc/sem.c:
> 
> - add spinlock_store_acquire()
> - add smp_load_acquire()
> - for nf_conntrack_lock, use spin_lock(_lock) instead of
>   spin_unlock_wait(_lock) and loop backward.
> - use smp_store_mb() instead of a raw smp_mb()
> 
> Signed-off-by: Manfred Spraul 
> Cc: Pablo Neira Ayuso 
> Cc: netfilter-devel@vger.kernel.org
> 
> ---
> 
> Question: Should I split this patch?
> First a patch that uses smp_mb(), with Cc: stable.
> The replace the smp_mb() with spinlock_store_acquire, not for stable

I guess it all depends on stable backports you believe are needed.

You probably should add the tags :
Fixes: <12-digit-sha1> "patch title"
that introduced the bug(s) you fix.

By doing this archaeological research you will likely have a better
answer ?

Thanks !


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 nf-next 5/7] netfilter: conntrack: add gc worker to remove timed-out entries

2016-08-25 Thread Eric Dumazet

On Thu, 2016-08-25 at 15:33 +0200, Florian Westphal wrote:
> Conntrack gc worker to evict stale entries.
> 
> GC happens once every 5 seconds, but we only scan at most 1/64th of the
> table (and not more than 8k) buckets to avoid hogging cpu.
> 
> This means that a complete scan of the table will take several minutes
> of wall-clock time.
> 
> Considering that the gc run will never have to evict any entries
> during normal operation because those will happen from packet path
> this should be fine.
> 
> We only need gc to make sure userspace (conntrack event listeners)
> eventually learn of the timeout, and for resource reclaim in case the
> system becomes idle.
> 
> We do not disable BH and cond_resched for every bucket so this should
> not introduce noticeable latencies either.
> 
> A followup patch will add a small change to speed up GC for the extreme
> case where most entries are timed out on an otherwise idle system.
> 
> v2: Use cond_resched_rcu_qs & add comment wrt. missing restart on
> nulls value change in gc worker, suggested by Eric Dumazet.
> 
> v3: don't call cancel_delayed_work_sync twice (again, Eric).
> 
> Signed-off-by: Florian Westphal <f...@strlen.de>
> ---
>  net/netfilter/nf_conntrack_core.c | 76 
> +++
>  1 file changed, 76 insertions(+)

Acked-by: Eric Dumazet <eduma...@google.com>


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 nf-next 4/7] netfilter: conntrack: add gc worker to remove timed-out entries

2016-08-24 Thread Eric Dumazet

On Wed, 2016-08-24 at 22:11 +0200, Florian Westphal wrote:
> Eric Dumazet <eric.duma...@gmail.com> wrote:
> > On Wed, 2016-08-24 at 13:55 +0200, Florian Westphal wrote:
> > > Conntrack gc worker to evict stale entries.
> > 
> > 
> > >  static struct nf_conn *
> > >  __nf_conntrack_alloc(struct net *net,
> > >const struct nf_conntrack_zone *zone,
> > > @@ -1527,6 +1597,7 @@ static int untrack_refs(void)
> > >  
> > >  void nf_conntrack_cleanup_start(void)
> > >  {
> > > + conntrack_gc_work.exiting = true;
> > >   RCU_INIT_POINTER(ip_ct_attach, NULL);
> > >  }
> > >  
> > > @@ -1536,6 +1607,9 @@ void nf_conntrack_cleanup_end(void)
> > >   while (untrack_refs() > 0)
> > >   schedule();
> > >  
> > > + cancel_delayed_work_sync(_gc_work.dwork);
> > > + /* can be re-scheduled once */
> > 
> > Are you sure ?
> > 
> > As conntrack_gc_work.exiting = true, I do not see how this can happen ?
> 
> nf_conntrack_cleanup_start() sets exiting = true
> 
> current cpu blocks in
> 
> cancel_delayed_work_sync(_gc_work.dwork);
> 
> Iff the work queue was running on other cpu but was already past
> gc_work->exiting check then when cancel_delayed_work_sync() (first one)
> returns it will have re-armed itself via schedule_delayed_work().
> 
> So I think the 2nd cancel_delayed_work_sync is needed.
> 
> Let me know if you'd like to see a v3 with more verbose
> comment about this.

If you were using cancel_delayed_work() (instead of
cancel_delayed_work_sync()) I would understand the concern.

But here you are using the thing that was designed to exactly avoid the
issue, of both work running on another cpu and/or re-arming itself.

If what you are saying was true, we would have to fix hundreds of
cancel_delayed_work_sync() call sites ...



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 nf-next 4/7] netfilter: conntrack: add gc worker to remove timed-out entries

2016-08-24 Thread Eric Dumazet

On Wed, 2016-08-24 at 13:55 +0200, Florian Westphal wrote:
> Conntrack gc worker to evict stale entries.


>  static struct nf_conn *
>  __nf_conntrack_alloc(struct net *net,
>const struct nf_conntrack_zone *zone,
> @@ -1527,6 +1597,7 @@ static int untrack_refs(void)
>  
>  void nf_conntrack_cleanup_start(void)
>  {
> + conntrack_gc_work.exiting = true;
>   RCU_INIT_POINTER(ip_ct_attach, NULL);
>  }
>  
> @@ -1536,6 +1607,9 @@ void nf_conntrack_cleanup_end(void)
>   while (untrack_refs() > 0)
>   schedule();
>  
> + cancel_delayed_work_sync(_gc_work.dwork);
> + /* can be re-scheduled once */

Are you sure ?

As conntrack_gc_work.exiting = true, I do not see how this can happen ?

> + cancel_delayed_work_sync(_gc_work.dwork);
>   nf_ct_free_hashtable(nf_conntrack_hash, nf_conntrack_htable_size);
>  
>   nf_conntrack_proto_fini();
> @@ -1810,6 +1884,10 @@ int nf_conntrack_init_start(void)
>   }
>   /*  - and look it like as a confirmed connection */
>   nf_ct_untracked_status_or(IPS_CONFIRMED | IPS_UNTRACKED);
> +
> + conntrack_gc_work_init(_gc_work);
> + schedule_delayed_work(_gc_work.dwork, GC_INTERVAL);
> +
>   return 0;
>  
>  err_proto:


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 nf-next 2/7] netfilter: conntrack: get rid of conntrack timer

2016-08-24 Thread Eric Dumazet

On Wed, 2016-08-24 at 13:55 +0200, Florian Westphal wrote:
> With stats enabled this eats 80 bytes on x86_64 per nf_conn entry, as
> Eric Dumazet pointed out during netfilter workshop 2016.

Another reason was the fact that Thomas was about to change max timer
range : 

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=500462a9de657f86edaa102f8ab6bff7f7e43fc2

Acked-by: Eric Dumazet <eduma...@google.com>

Thanks !


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 nf-next 1/7] netfilter: don't rely on DYING bit to detect when destroy event was sent

2016-08-24 Thread Eric Dumazet

On Wed, 2016-08-24 at 13:55 +0200, Florian Westphal wrote:
> The reliable event delivery mode currently (ab)uses the DYING bit to
> detect which entries on the dying list have to be skipped when
> re-delivering events from the eache worker in reliable event mode.
> 
> Currently when we delete the conntrack from main table we only set this
> bit if we could also deliver the netlink destroy event to userspace.
> 
> If we fail we move it to the dying list, the ecache worker will
> reattempt event delivery for all confirmed conntracks on the dying list
> that do not have the DYING bit set.
> 
> Once timer is gone, we can no longer use if (del_timer()) to detect
> when we 'stole' the reference count owned by the timer/hash entry, so
> we need some other way to avoid racing with other cpu.
> 
> Pablo suggested to add a marker in the ecache extension that skips

Acked-by: Eric Dumazet <eduma...@google.com>


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH nf-next 7/7] netfilter: restart search if moved to other chain

2016-08-24 Thread Eric Dumazet

On Wed, 2016-08-24 at 13:55 +0200, Florian Westphal wrote:
> In case nf_conntrack_tuple_taken did not find a conflicting entry
> check that all entries in this hash slot were tested and restart
> in case an entry was moved to another chain.
> 
> Reported-by: Eric Dumazet <eduma...@google.com>
> Signed-off-by: Florian Westphal <f...@strlen.de>
> ---

Fixes: ea781f197d6a ("netfilter: nf_conntrack: use SLAB_DESTROY_BY_RCU and get 
rid of call_rcu()")
Acked-by: Eric Dumazet <eduma...@google.com>

Thanks !


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH nf-next 2/6] netfilter: conntrack: get rid of conntrack timer

2016-08-21 Thread Eric Dumazet

On Fri, 2016-08-19 at 18:04 +0200, Florian Westphal wrote:
> Eric Dumazet <eric.duma...@gmail.com> wrote:
> > On Fri, 2016-08-19 at 17:16 +0200, Florian Westphal wrote:
> > 
> > > Hmm, nf_conntrack_find caller needs to hold rcu_read_lock,
> > > in case object is free'd SLAB_DESTROY_BY_RCU should delay actual release
> > > of the page.
> > 
> > Well, point is that SLAB_DESTROY_BY_RCU means that we have no grace
> > period, and object can be immediately reused and recycled.
> > 
> > @next pointer can definitely be overwritten.
> 
> I see.  Isn't that detected by the nulls magic (to restart
> lookup if entry was moved to other chain due to overwritten next pointer)?

Well, you did not add the nulls magic in your code ;)

It might be fine, since it should be a rare event, and garbage
collection is best effort, so you might add a comment in gc_worker() why
it is probably overkill to restart the loop in this unlikely event.

BTW, maybe nf_conntrack_tuple_taken() should get the nulls magic check,
as it is currently missing.




--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH nf-next 2/6] netfilter: conntrack: get rid of conntrack timer

2016-08-19 Thread Eric Dumazet

On Fri, 2016-08-19 at 17:16 +0200, Florian Westphal wrote:

> Hmm, nf_conntrack_find caller needs to hold rcu_read_lock,
> in case object is free'd SLAB_DESTROY_BY_RCU should delay actual release
> of the page.

Well, point is that SLAB_DESTROY_BY_RCU means that we have no grace
period, and object can be immediately reused and recycled.

@next pointer can definitely be overwritten.

> 
> Should be same as (old) 'death_by_timeout' timer firing during
> hlist_nulls_for_each_entry_rcu walk.
> 
> Thanks for looking at this Eric!


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH nf-next 4/6] netfilter: conntrack: add gc worker to remove timed-out entries

2016-08-19 Thread Eric Dumazet

On Fri, 2016-08-19 at 13:36 +0200, Florian Westphal wrote:
> Conntrack gc worker to evict stale entries.

...

> +
> + hlist_nulls_for_each_entry_rcu(h, n, _hash[i], hnnode) {
> + tmp = nf_ct_tuplehash_to_ctrack(h);
> +
> + if (nf_ct_is_expired(tmp)) {
> + nf_ct_gc_expired(tmp);
> + expired_count++;
> + continue;

Same remark about hlist_nulls_for_each_entry_rcu() not 'safe' here

> + }
> + }
> +
> + rcu_read_unlock();
> + cond_resched();

This could use cond_resched_rcu_qs() 

> + } while (++buckets < goal &&
> +  expired_count < GC_MAX_EVICTS);
> +
> + if (gc_work->exiting)
> + return;
> +

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH nf-next 2/6] netfilter: conntrack: get rid of conntrack timer

2016-08-19 Thread Eric Dumazet

On Fri, 2016-08-19 at 13:36 +0200, Florian Westphal wrote:
> With stats enabled this eats 80 bytes on x86_64 per nf_conn entry.
> 
> Remove it and use a 32bit jiffies value containing timestamp until
> entry is valid.

Great work !

...

> +/* caller must hold rcu readlock and none of the nf_conntrack_locks */
> +static void nf_ct_gc_expired(struct nf_conn *ct)
> +{
> + if (!atomic_inc_not_zero(>ct_general.use))
> + return;
> +
> + if (nf_ct_should_gc(ct))
> + nf_ct_kill(ct);
> +
> + nf_ct_put(ct);
> +}
> +
>  /*
>   * Warning :
>   * - Caller must take a reference on returned object
> @@ -499,6 +505,17 @@ begin:
>   bucket = reciprocal_scale(hash, hsize);
>  
>   hlist_nulls_for_each_entry_rcu(h, n, _hash[bucket], hnnode) {
> + struct nf_conn *ct;
> +
> + ct = nf_ct_tuplehash_to_ctrack(h);
> + if (nf_ct_is_expired(ct)) {
> + nf_ct_gc_expired(ct);
> + continue;
> + }
> +
> + if (nf_ct_is_dying(ct))
> + continue;
> +
>   if (nf_ct_key_equal(h, tuple, zone, net)) {
>   NF_CT_STAT_INC_ATOMIC(net, found);
>   return h;

Florian, I do not see how this part is safe against concurrent lookups
and deletes ?

At least the hlist_nulls_for_each_entry_rcu() looks buggy, since
fetching the next pointer would trigger a use after free ?

Thanks !



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net] netfilter: tproxy: properly refcount tcp listeners

2016-08-17 Thread Eric Dumazet

From: Eric Dumazet <eduma...@google.com>

inet_lookup_listener() and inet6_lookup_listener() no longer
take a reference on the found listener.

This minimal patch adds back the refcounting, but we might do
this differently in net-next later.

Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Reported-and-tested-by: Denys Fedoryshchenko <nuclear...@nuclearcat.com>
Signed-off-by: Eric Dumazet <eduma...@google.com>
---
Note: bug added in 4.7, stable candidate.

 net/netfilter/xt_TPROXY.c |4 
 1 file changed, 4 insertions(+)

diff --git a/net/netfilter/xt_TPROXY.c b/net/netfilter/xt_TPROXY.c
index 7f4414d26a66..663c4c3c9072 100644
--- a/net/netfilter/xt_TPROXY.c
+++ b/net/netfilter/xt_TPROXY.c
@@ -127,6 +127,8 @@ nf_tproxy_get_sock_v4(struct net *net, struct sk_buff *skb, 
void *hp,
daddr, dport,
in->ifindex);
 
+   if (sk && !atomic_inc_not_zero(>sk_refcnt))
+   sk = NULL;
/* NOTE: we return listeners even if bound to
 * 0.0.0.0, those are filtered out in
 * xt_socket, since xt_TPROXY needs 0 bound
@@ -195,6 +197,8 @@ nf_tproxy_get_sock_v6(struct net *net, struct sk_buff *skb, 
int thoff, void *hp,
   daddr, ntohs(dport),
   in->ifindex);
 
+   if (sk && !atomic_inc_not_zero(>sk_refcnt))
+   sk = NULL;
/* NOTE: we return listeners even if bound to
 * 0.0.0.0, those are filtered out in
 * xt_socket, since xt_TPROXY needs 0 bound


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kernel panic TPROXY , vanilla 4.7.1

2016-08-17 Thread Eric Dumazet

On Wed, 2016-08-17 at 19:44 +0300, Denys Fedoryshchenko wrote:

> Yes, everything fine after patch!
> Thanks a lot

Perfect, thanks for testing, I am sending the official patch.


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kernel panic TPROXY , vanilla 4.7.1

2016-08-17 Thread Eric Dumazet

On Wed, 2016-08-17 at 08:42 -0700, Eric Dumazet wrote:
> On Wed, 2016-08-17 at 17:31 +0300, Denys Fedoryshchenko wrote:
> > Hi!
> > 
> > Tried to run squid on latest kernel, and hit a panic
> > Sometimes it just shows warning in dmesg (but doesnt work properly)
> > [   75.701666] IPv4: Attempt to release TCP socket in state 10 
> > 88102d430780
> > [   83.866974] squid (2700) used greatest stack depth: 12912 bytes left
> > [   87.506644] IPv4: Attempt to release TCP socket in state 10 
> > 880078a48780
> > [  114.704295] IPv4: Attempt to release TCP socket in state 10 
> > 881029f8ad00
> > 
> > I cannot catch yet oops/panic message, netconsole not working.
> > 
> > After triggering warning message 3 times, i am unable to run squid 
> > anymore (without reboot), and in netstat it doesnt show port running.
> > 
> > firewall is:
> > *mangle
> > -A PREROUTING -p tcp -m socket -j DIVERT
> > -A PREROUTING -p tcp -m tcp --dport 80 -i eno1 -j TPROXY --on-port 3129 
> > --on-ip 0.0.0.0 --tproxy-mark 0x1/0x1
> > -A DIVERT -j MARK --set-xmark 0x1/0x
> > -A DIVERT -j ACCEPT
> > 
> > routing
> > ip rule add fwmark 1 lookup 100
> > ip route add local default dev eno1 table 100
> > 
> > 
> > squid config is default with tproxy option
> > http_port 3129 tproxy
> > 
> 
> Hmppff... sorry for this, I will send a fix.
> 
> Thanks for the report !
> 


Could you try the following ?

Thanks !

 net/netfilter/xt_TPROXY.c |4 
 1 file changed, 4 insertions(+)

diff --git a/net/netfilter/xt_TPROXY.c b/net/netfilter/xt_TPROXY.c
index 7f4414d26a66..663c4c3c9072 100644
--- a/net/netfilter/xt_TPROXY.c
+++ b/net/netfilter/xt_TPROXY.c
@@ -127,6 +127,8 @@ nf_tproxy_get_sock_v4(struct net *net, struct sk_buff *skb, 
void *hp,
daddr, dport,
in->ifindex);
 
+   if (sk && !atomic_inc_not_zero(>sk_refcnt))
+   sk = NULL;
/* NOTE: we return listeners even if bound to
 * 0.0.0.0, those are filtered out in
 * xt_socket, since xt_TPROXY needs 0 bound
@@ -195,6 +197,8 @@ nf_tproxy_get_sock_v6(struct net *net, struct sk_buff *skb, 
int thoff, void *hp,
   daddr, ntohs(dport),
   in->ifindex);
 
+   if (sk && !atomic_inc_not_zero(>sk_refcnt))
+   sk = NULL;
/* NOTE: we return listeners even if bound to
 * 0.0.0.0, those are filtered out in
 * xt_socket, since xt_TPROXY needs 0 bound


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: PROBLEM: TPROXY and DNAT broken (bisected to 079096f103fa)

2016-07-27 Thread Eric Dumazet

On Wed, 2016-07-27 at 18:19 +, Brandon Cazander wrote:
> [1.] One line summary of the problem:
> Using TPROXY together with a DNAT rule (working on older kernels) fails to 
> work on newer kernels as of commit 079096f103fa
> 
> [2.] Full description of the problem/report:
> I performed a git bisect using a qemu image to test my example below, and the 
> bisect ended at this commit:
> 
> > commit 079096f103faca2dd87342cca6f23d4b34da8871
> > Author: Eric Dumazet <eduma...@google.com>
> > Date:   Fri Oct 2 11:43:32 2015 -0700
> > 
> > tcp/dccp: install syn_recv requests into ehash table
> 
> [3.] Keywords: networking
> 
> [4.] Kernel information
> [4.1.] Kernel version (from /proc/version):
> Everything as of commit 079096f103fa (tested up to 4.5.0)
> 
> [4.2.] Kernel .config file:
> When performing the bisect, I built with make oldconfig. Let me know if you 
> want the whole .config file.
> 
> [5.] Most recent kernel version which did not have the bug:
> Any kernel that I built prior to commit 
> 079096f103faca2dd87342cca6f23d4b34da8871 did not have this issue.
> 
> [6.] no Oops
> 
> [7.] A small shell script or example program which triggers the
>  problem (if possible)
> 
> I have produced what I hope is a minimal example, using the instructions for 
> TPROXY from 
> http://lxr.linux.no/#linux+v3.10/Documentation/networking/tproxy.txt and an 
> example transparent TCP proxy written in C that I found at 
> https://github.com/kristrev/tproxy-example.
> 
> * I have a machine ("ROUTER") with 10.100.0.164/24 on eth0, and 
> 192.168.30.2/24 on eth1. This is running the tproxy-example program, with the 
> following rules:
> iptables -t mangle -N DIVERT
> iptables -t mangle -A PREROUTING -p tcp -m socket -j DIVERT
> iptables -t mangle -A DIVERT -j MARK --set-mark 1
> iptables -t mangle -A DIVERT -j ACCEPT
> iptables -t mangle -A PREROUTING -p tcp --dport 8080 -j TPROXY 
> --tproxy-mark 0x1/0x1 --on-port 9876
> iptables -t nat -I PREROUTING -i eth0 -d 42.0.1.1 -j DNAT --to-dest 
> 192.168.30.1
> ip rule add fwmark 1 lookup 100
> ip route add local 0.0.0.0/0 dev lo table 100
> 
> * There is a machine ("WEBSERVER") at 192.168.30.1/24 hosting a webserver on 
> port 8080.
> 
> * My workstation is at 10.100.0.206, and I have a static route for both 
> 192.168.30.2 and 42.0.1.1 via 10.100.0.164.
> 
> * Making a curl request to 192.168.30.2:8080 hits the transparent proxy and 
> works in both GOOD (before the aforementioned commit) kernel, and BAD (at the 
> commit or later) kernel.
> 
> * Making a curl request to 42.0.1.1:8080 hits the transparent proxy and works 
> in GOOD kernel but in BAD kernel I get:
> "curl: (56) Recv failure: Connection reset by peer"
> 
> * When it fails, no traffic hits the WEBSERVER. A tcpdump on the bad kernel 
> shows:
> root@dons-qemu-new-kernel:~# tcpdump -niany tcp and port 8080
> tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
> listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 
> bytes
> 16:42:31.551952 IP 10.100.0.206.35562 > 42.0.1.1.8080: Flags [S], seq 
> 3793582216, win 29200, options [mss 1460,sackOK,TS val 632068656 ecr 
> 0,nop,wscale 7], length 0
> 16:42:31.551988 IP 42.0.1.1.8080 > 10.100.0.206.35562: Flags [S.], seq 
> 4042636216, ack 3793582217, win 28960, options [mss 1460,sackOK,TS val 745382 
> ecr 632068656,nop,wscale 7], length 0
> 16:42:31.55 IP 10.100.0.206.35562 > 42.0.1.1.8080: Flags [.], ack 1, 
> win 229, options [nop,nop,TS val 632068657 ecr 745382], length 0
> 16:42:31.552238 IP 42.0.1.1.8080 > 10.100.0.206.35562: Flags [R], seq 
> 4042636217, win 0, length 0
> 16:42:31.552246 IP 10.100.0.206.35562 > 42.0.1.1.8080: Flags [P.], seq 
> 1:78, ack 1, win 229, options [nop,nop,TS val 632068657 ecr 745382], length 77
> 16:42:31.552251 IP 42.0.1.1.8080 > 10.100.0.206.35562: Flags [R], seq 
> 4042636217, win 0, length 0
> 16:42:32.551668 IP 42.0.1.1.8080 > 10.100.0.206.35562: Flags [S.], seq 
> 4042636216, ack 3793582217, win 28960, options [mss 1460,sackOK,TS val 745632 
> ecr 632068656,nop,wscale 7], length 0
> 16:42:32.551925 IP 10.100.0.206.35562 > 42.0.1.1.8080: Flags [R], seq 
> 3793582217, win 0, length 0
> 16:42:34.551668 IP 42.0.1.1.8080 > 10.100.0.206.35562: Flags [S.], seq 
> 4042636216, ack 3793582217, win 28960, options [mss 1460,sackOK,TS val 746132 
> ecr 632068656,nop,wscale 7], length 0
> 16:42:34.551995 IP 10.100.0.206.35562 > 42.0.1.1.8080: Flags [R], seq 
> 3793582217, win 0, length 0
> 
> * A tcpdump on a GOOD kernel shows:
&g

Re: [RFC 5/7] net: Add allocation flag to rtnl_unicast()

2016-07-07 Thread Eric Dumazet

On Fri, 2016-07-08 at 12:15 +0900, Masashi Honma wrote:
=
> Thanks for comment.
> 
> I have selected GFP flags based on existing code.
> 
> I have selected GFP_ATOMIC in inet6_netconf_get_devconf() because
> skb was allocated with GFP_ATOMIC.

Point is : we should remove GFP_ATOMIC uses as much as we can.

Everytime we see one of them, we should think why it was added
and if this is really needed.

inet6_netconf_get_devconf() is a perfect example of one careless
GFP_ATOMIC usage

https://patchwork.ozlabs.org/patch/646291/






--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 117 matches

Mail list logo