Re: [patch 1/1] net/netfilter/x_tables.c: make allocation less aggressive

2018-01-30 Thread Eric Dumazet
On Tue, 2018-01-30 at 11:30 -0800, a...@linux-foundation.org wrote:
> From: Michal Hocko 
> Subject: net/netfilter/x_tables.c: make allocation less aggressive
> 
> syzbot has noticed that xt_alloc_table_info can allocate a lot of memory. 
> This is an admin only interface but an admin in a namespace is sufficient
> as well.  eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc() in
> xt_alloc_table_info()") has changed the opencoded kmalloc->vmalloc
> fallback into kvmalloc.  It has dropped __GFP_NORETRY on the way because
> vmalloc has simply never fully supported __GFP_NORETRY semantic.  This is
> still the case because e.g.  page tables backing the vmalloc area are
> hardcoded GFP_KERNEL.
> 
> Revert back to __GFP_NORETRY as a poors man defence against excessively
> large allocation request here.  We will not rule out the OOM killer
> completely but __GFP_NORETRY should at least stop the large request in
> most cases.
> 
> [a...@linux-foundation.org: coding-style fixes]
> Fixes: eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc() in 
> xt_alloc_tableLink: 
> http://lkml.kernel.org/r/20180130140104.ge21...@dhcp22.suse.cz
> Signed-off-by: Michal Hocko 
> Acked-by: Florian Westphal 
> Reviewed-by: Andrew Morton 
> Cc: David S. Miller 
> Signed-off-by: Andrew Morton 
> ---
> 
>  net/netfilter/x_tables.c |7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff -puN 
> net/netfilter/x_tables.c~net-netfilter-x_tablesc-make-allocation-less-aggressive
>  net/netfilter/x_tables.c
> --- 
> a/net/netfilter/x_tables.c~net-netfilter-x_tablesc-make-allocation-less-aggressive
> +++ a/net/netfilter/x_tables.c
> @@ -1008,7 +1008,12 @@ struct xt_table_info *xt_alloc_table_inf
>   if ((size >> PAGE_SHIFT) + 2 > totalram_pages)
>   return NULL;
>  
> - info = kvmalloc(sz, GFP_KERNEL);
> + /* __GFP_NORETRY is not fully supported by kvmalloc but it should
> +  * work reasonably well if sz is too large and bail out rather
> +  * than shoot all processes down before realizing there is nothing
> +  * more to reclaim.
> +  */
> + info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY);
>   if (!info)
>   return NULL;


How is __GFP_NORETRY working exactly ?

Surely, if some firewall tools attempt to load a new iptables rules, we
do not want to abort them if the request can be satisfied after few
pages moved on swap or written back to disk.

We want to avoid huge allocations, but leave reasonable ones succeed.

Thanks.

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 1/1] net/netfilter/x_tables.c: make allocation less aggressive

2018-01-30 Thread akpm
From: Michal Hocko 
Subject: net/netfilter/x_tables.c: make allocation less aggressive

syzbot has noticed that xt_alloc_table_info can allocate a lot of memory. 
This is an admin only interface but an admin in a namespace is sufficient
as well.  eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc() in
xt_alloc_table_info()") has changed the opencoded kmalloc->vmalloc
fallback into kvmalloc.  It has dropped __GFP_NORETRY on the way because
vmalloc has simply never fully supported __GFP_NORETRY semantic.  This is
still the case because e.g.  page tables backing the vmalloc area are
hardcoded GFP_KERNEL.

Revert back to __GFP_NORETRY as a poors man defence against excessively
large allocation request here.  We will not rule out the OOM killer
completely but __GFP_NORETRY should at least stop the large request in
most cases.

[a...@linux-foundation.org: coding-style fixes]
Fixes: eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc() in 
xt_alloc_tableLink: 
http://lkml.kernel.org/r/20180130140104.ge21...@dhcp22.suse.cz
Signed-off-by: Michal Hocko 
Acked-by: Florian Westphal 
Reviewed-by: Andrew Morton 
Cc: David S. Miller 
Signed-off-by: Andrew Morton 
---

 net/netfilter/x_tables.c |7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff -puN 
net/netfilter/x_tables.c~net-netfilter-x_tablesc-make-allocation-less-aggressive
 net/netfilter/x_tables.c
--- 
a/net/netfilter/x_tables.c~net-netfilter-x_tablesc-make-allocation-less-aggressive
+++ a/net/netfilter/x_tables.c
@@ -1008,7 +1008,12 @@ struct xt_table_info *xt_alloc_table_inf
if ((size >> PAGE_SHIFT) + 2 > totalram_pages)
return NULL;
 
-   info = kvmalloc(sz, GFP_KERNEL);
+   /* __GFP_NORETRY is not fully supported by kvmalloc but it should
+* work reasonably well if sz is too large and bail out rather
+* than shoot all processes down before realizing there is nothing
+* more to reclaim.
+*/
+   info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY);
if (!info)
return NULL;
 
_
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)

2018-01-30 Thread Andrew Morton
On Tue, 30 Jan 2018 15:01:04 +0100 Michal Hocko  wrote:

> > Well, this is not about syzkaller, it merely pointed out a potential
> > DoS... And that has to be addressed somehow.
> 
> So how about this?
> ---

argh ;)

> >From d48e950f1b04f234b57b9e34c363bdcfec10aeee Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Tue, 30 Jan 2018 14:51:07 +0100
> Subject: [PATCH] net/netfilter/x_tables.c: make allocation less aggressive
> 
> syzbot has noticed that xt_alloc_table_info can allocate a lot of
> memory. This is an admin only interface but an admin in a namespace
> is sufficient as well. eacd86ca3b03 ("net/netfilter/x_tables.c: use
> kvmalloc() in xt_alloc_table_info()") has changed the opencoded
> kmalloc->vmalloc fallback into kvmalloc. It has dropped __GFP_NORETRY on
> the way because vmalloc has simply never fully supported __GFP_NORETRY
> semantic. This is still the case because e.g. page tables backing the
> vmalloc area are hardcoded GFP_KERNEL.
> 
> Revert back to __GFP_NORETRY as a poors man defence against excessively
> large allocation request here. We will not rule out the OOM killer
> completely but __GFP_NORETRY should at least stop the large request
> in most cases.
> 
> Fixes: eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc() in 
> xt_alloc_table_info()")
> Signed-off-by: Michal Hocko 
> ---
>  net/netfilter/x_tables.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> index d8571f414208..a5f5c29bcbdc 100644
> --- a/net/netfilter/x_tables.c
> +++ b/net/netfilter/x_tables.c
> @@ -1003,7 +1003,13 @@ struct xt_table_info *xt_alloc_table_info(unsigned int 
> size)
>   if ((SMP_ALIGN(size) >> PAGE_SHIFT) + 2 > totalram_pages)
>   return NULL;

offtopic: preceding comment here is "prevent them from hitting BUG() in
vmalloc.c".  I suspect this is ancient code and vmalloc sure as heck
shouldn't go BUG with this input.  And it should be using `sz' ;)

So I suspect and hope that this code can be removed.  If not, let's fix
vmalloc!

> - info = kvmalloc(sz, GFP_KERNEL);
> + /*
> +  * __GFP_NORETRY is not fully supported by kvmalloc but it should
> +  * work reasonably well if sz is too large and bail out rather
> +  * than shoot all processes down before realizing there is nothing
> +  * more to reclaim.
> +  */
> + info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY);
>   if (!info)
>   return NULL;

checkpatch sayeth

networking block comments don't use an empty /* line, use /* Comment...

So I'll do that and shall scoot the patch Davewards.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net] netfilter: on sockopt() acquire sock lock only in the required scope

2018-01-30 Thread Paolo Abeni
Syzbot reported several deadlocks in the netfilter area caused by
rtnl lock and socket lock being acquired with a different order on
different code paths, leading to backtraces like the following one:

==
WARNING: possible circular locking dependency detected
4.15.0-rc9+ #212 Not tainted
--
syzkaller041579/3682 is trying to acquire lock:
  (sk_lock-AF_INET6){+.+.}, at: [<8775e4dd>] lock_sock
include/net/sock.h:1463 [inline]
  (sk_lock-AF_INET6){+.+.}, at: [<8775e4dd>]
do_ipv6_setsockopt.isra.8+0x3c5/0x39d0 net/ipv6/ipv6_sockglue.c:167

but task is already holding lock:
  (rtnl_mutex){+.+.}, at: [<4342eaa9>] rtnl_lock+0x17/0x20
net/core/rtnetlink.c:74

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (rtnl_mutex){+.+.}:
__mutex_lock_common kernel/locking/mutex.c:756 [inline]
__mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893
mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74
register_netdevice_notifier+0xad/0x860 net/core/dev.c:1607
tee_tg_check+0x1a0/0x280 net/netfilter/xt_TEE.c:106
xt_check_target+0x22c/0x7d0 net/netfilter/x_tables.c:845
check_target net/ipv6/netfilter/ip6_tables.c:538 [inline]
find_check_entry.isra.7+0x935/0xcf0
net/ipv6/netfilter/ip6_tables.c:580
translate_table+0xf52/0x1690 net/ipv6/netfilter/ip6_tables.c:749
do_replace net/ipv6/netfilter/ip6_tables.c:1165 [inline]
do_ip6t_set_ctl+0x370/0x5f0 net/ipv6/netfilter/ip6_tables.c:1691
nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115
ipv6_setsockopt+0x115/0x150 net/ipv6/ipv6_sockglue.c:928
udpv6_setsockopt+0x45/0x80 net/ipv6/udp.c:1422
sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2978
SYSC_setsockopt net/socket.c:1849 [inline]
SyS_setsockopt+0x189/0x360 net/socket.c:1828
entry_SYSCALL_64_fastpath+0x29/0xa0

-> #0 (sk_lock-AF_INET6){+.+.}:
lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3914
lock_sock_nested+0xc2/0x110 net/core/sock.c:2780
lock_sock include/net/sock.h:1463 [inline]
do_ipv6_setsockopt.isra.8+0x3c5/0x39d0 net/ipv6/ipv6_sockglue.c:167
ipv6_setsockopt+0xd7/0x150 net/ipv6/ipv6_sockglue.c:922
udpv6_setsockopt+0x45/0x80 net/ipv6/udp.c:1422
sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2978
SYSC_setsockopt net/socket.c:1849 [inline]
SyS_setsockopt+0x189/0x360 net/socket.c:1828
entry_SYSCALL_64_fastpath+0x29/0xa0

other info that might help us debug this:

  Possible unsafe locking scenario:

CPU0CPU1

   lock(rtnl_mutex);
lock(sk_lock-AF_INET6);
lock(rtnl_mutex);
   lock(sk_lock-AF_INET6);

  *** DEADLOCK ***

1 lock held by syzkaller041579/3682:
  #0:  (rtnl_mutex){+.+.}, at: [<4342eaa9>] rtnl_lock+0x17/0x20
net/core/rtnetlink.c:74

The problem, as Florian noted, is that nf_setsockopt() is always
called with the socket held, even if the lock itself is required only
for very tight scopes and only for some operation.

This patch addresses the issues moving the lock_sock() call only
where really needed, namely in ipv*_getorigdst(), so that nf_setsockopt()
does not need anymore to acquire both locks.

Fixes: 22265a5c3c10 ("netfilter: xt_TEE: resolve oif using netdevice notifiers")
Reported-by: syzbot+a4c2dc980ac1af699...@syzkaller.appspotmail.com
Suggested-by: Florian Westphal 
Signed-off-by: Paolo Abeni 
---
 net/ipv4/ip_sockglue.c | 14 --
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c |  6 +-
 net/ipv6/ipv6_sockglue.c   | 17 +
 net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c | 18 --
 4 files changed, 26 insertions(+), 29 deletions(-)

diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 60fb1eb7d7d8..c7df4969f80a 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -1251,11 +1251,8 @@ int ip_setsockopt(struct sock *sk, int level,
if (err == -ENOPROTOOPT && optname != IP_HDRINCL &&
optname != IP_IPSEC_POLICY &&
optname != IP_XFRM_POLICY &&
-   !ip_mroute_opt(optname)) {
-   lock_sock(sk);
+   !ip_mroute_opt(optname))
err = nf_setsockopt(sk, PF_INET, optname, optval, optlen);
-   release_sock(sk);
-   }
 #endif
return err;
 }
@@ -1280,12 +1277,9 @@ int compat_ip_setsockopt(struct sock *sk, int level, int 
optname,
if (err == -ENOPROTOOPT && optname != IP_HDRINCL &&
   

Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)

2018-01-30 Thread Michal Hocko
On Tue 30-01-18 15:01:11, Florian Westphal wrote:
> > From d48e950f1b04f234b57b9e34c363bdcfec10aeee Mon Sep 17 00:00:00 2001
> > From: Michal Hocko 
> > Date: Tue, 30 Jan 2018 14:51:07 +0100
> > Subject: [PATCH] net/netfilter/x_tables.c: make allocation less aggressive
> 
> Acked-by: Florian Westphal 

Thanks! How should we route this change? Andrew, David?

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)

2018-01-30 Thread Florian Westphal
> From d48e950f1b04f234b57b9e34c363bdcfec10aeee Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Tue, 30 Jan 2018 14:51:07 +0100
> Subject: [PATCH] net/netfilter/x_tables.c: make allocation less aggressive

Acked-by: Florian Westphal 
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)

2018-01-30 Thread Michal Hocko
On Tue 30-01-18 10:57:39, Michal Hocko wrote:
> On Tue 30-01-18 10:02:34, Dmitry Vyukov wrote:
> > On Tue, Jan 30, 2018 at 9:28 AM, Kirill A. Shutemov
> >  wrote:
> > > On Tue, Jan 30, 2018 at 09:11:27AM +0100, Florian Westphal wrote:
> > >> Michal Hocko  wrote:
> > >> > On Mon 29-01-18 23:35:22, Florian Westphal wrote:
> > >> > > Kirill A. Shutemov  wrote:
> > >> > [...]
> > >> > > > I hate what I'm saying, but I guess we need some tunable here.
> > >> > > > Not sure what exactly.
> > >> > >
> > >> > > Would memcg help?
> > >> >
> > >> > That really depends. I would have to check whether vmalloc path obeys
> > >> > __GFP_ACCOUNT (I suspect it does except for page tables allocations but
> > >> > that shouldn't be a big deal). But then the other potential problem is
> > >> > the life time of the xt_table_info (or other potentially large) data
> > >> > structures. Are they bound to any process life time.
> > >>
> > >> No.
> > >
> > > Well, IIUC they bound to net namespace life time, so killing all
> > > proccesses in the namespace would help to get memory back. :)
> > 
> > ... unless the namespace is mounted into file system.
> > 
> > Let's start with NOWARN as that's what kernel generally uses for
> > allocations with user-controllable size. ENOMEM is roughly as
> > informative as the WARNING message in this case.
> 
> You want __GFP_NORETRY but that is not _fully_ supported by kvmalloc
> right now. More specifically kvmalloc doesn't guanratee that the request
> will not trigger the OOM killer (like regular __GFP_NORETRY). This is
> because of internal vmalloc restrictions. If you are however OK to
> simply bail out in most cases then __GFP_NORETRY should work reasonably
> fine.
> 
> > I think we also need to consider setting up memory cgroup for
> > syzkaller test processes (we do RLIMIT_AS, but that's weak).
> 
> Well, this is not about syzkaller, it merely pointed out a potential
> DoS... And that has to be addressed somehow.

So how about this?
---
>From d48e950f1b04f234b57b9e34c363bdcfec10aeee Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Tue, 30 Jan 2018 14:51:07 +0100
Subject: [PATCH] net/netfilter/x_tables.c: make allocation less aggressive

syzbot has noticed that xt_alloc_table_info can allocate a lot of
memory. This is an admin only interface but an admin in a namespace
is sufficient as well. eacd86ca3b03 ("net/netfilter/x_tables.c: use
kvmalloc() in xt_alloc_table_info()") has changed the opencoded
kmalloc->vmalloc fallback into kvmalloc. It has dropped __GFP_NORETRY on
the way because vmalloc has simply never fully supported __GFP_NORETRY
semantic. This is still the case because e.g. page tables backing the
vmalloc area are hardcoded GFP_KERNEL.

Revert back to __GFP_NORETRY as a poors man defence against excessively
large allocation request here. We will not rule out the OOM killer
completely but __GFP_NORETRY should at least stop the large request
in most cases.

Fixes: eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc() in 
xt_alloc_table_info()")
Signed-off-by: Michal Hocko 
---
 net/netfilter/x_tables.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index d8571f414208..a5f5c29bcbdc 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -1003,7 +1003,13 @@ struct xt_table_info *xt_alloc_table_info(unsigned int 
size)
if ((SMP_ALIGN(size) >> PAGE_SHIFT) + 2 > totalram_pages)
return NULL;
 
-   info = kvmalloc(sz, GFP_KERNEL);
+   /*
+* __GFP_NORETRY is not fully supported by kvmalloc but it should
+* work reasonably well if sz is too large and bail out rather
+* than shoot all processes down before realizing there is nothing
+* more to reclaim.
+*/
+   info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY);
if (!info)
return NULL;
 
-- 
2.15.1

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4] netfilter : add NAT support for shifted portmap ranges

2018-01-30 Thread Thierry Du Tre
This is a patch proposal to support shifted ranges in portmaps.
(i.e. tcp/udp incoming port 5000-5100 on WAN redirected to LAN 
192.168.1.5:2000-2100)

Currently DNAT only works for single port or identical port ranges.
(i.e. ports 5000-5100 on WAN interface redirected to a LAN host while original 
destination port is not altered)
When different port ranges are configured, either 'random' mode should be used, 
or else all incoming connections are mapped onto the first port in the redirect 
range. (in described example WAN:5000-5100 will all be mapped to 
192.168.1.5:2000)

This patch introduces a new mode indicated by flag NF_NAT_RANGE_PROTO_OFFSET 
which uses a base port value to calculate an offset with the destination port 
present in the incoming stream. That offset is then applied as index within the 
redirect port range (index modulo rangewidth to handle range overflow).

In described example the base port would be 5000. An incoming stream with 
destination port 5004 would result in an offset value 4 which means that the 
NAT'ed stream will be using destination port 2004.

Other possibilities include deterministic mapping of larger or multiple ranges 
to a smaller range : WAN:5000-5999 -> LAN:5000-5099 (maps WAN port 5*xx to port 
51xx)

This patch does not change any current behavior. It just adds new NAT proto 
range functionality which must be selected via the specific flag when intended 
to use.

A patch for iptables (libipt_DNAT.c + libip6t_DNAT.c) will also be proposed 
which makes this functionality immediately available.

Signed-off-by: Thierry Du Tre 

---
Changes in v4:
- renamed nf_nat_range1 to nf_nat_range_v1

Changes in v3:
- use nf_nat_range as name for updated struct, renamed existing 
nf_nat_range to nf_nat_range1
- reverted all nf_nat_range2 occurences

Changes in v2:
- added new revision for SNAT and DNAT targets to support the new base port 
variable in struct nf_nat_range2
- replaced all occurences of struct nf_nat_range with struct nf_nat_range2

 include/uapi/linux/netfilter/nf_nat.h | 12 ++-
 net/netfilter/nf_nat_core.c   |  9 +++---
 net/netfilter/nf_nat_proto_common.c   |  5 ++-
 net/netfilter/xt_nat.c| 60 +--
 4 files changed, 78 insertions(+), 8 deletions(-)

diff --git a/include/uapi/linux/netfilter/nf_nat.h 
b/include/uapi/linux/netfilter/nf_nat.h
index a33000d..5d919c7 100644
--- a/include/uapi/linux/netfilter/nf_nat.h
+++ b/include/uapi/linux/netfilter/nf_nat.h
@@ -10,6 +10,7 @@
 #define NF_NAT_RANGE_PROTO_RANDOM  (1 << 2)
 #define NF_NAT_RANGE_PERSISTENT(1 << 3)
 #define NF_NAT_RANGE_PROTO_RANDOM_FULLY(1 << 4)
+#define NF_NAT_RANGE_PROTO_OFFSET  (1 << 5)
 
 #define NF_NAT_RANGE_PROTO_RANDOM_ALL  \
(NF_NAT_RANGE_PROTO_RANDOM | NF_NAT_RANGE_PROTO_RANDOM_FULLY)
@@ -17,7 +18,7 @@
 #define NF_NAT_RANGE_MASK  \
(NF_NAT_RANGE_MAP_IPS | NF_NAT_RANGE_PROTO_SPECIFIED |  \
 NF_NAT_RANGE_PROTO_RANDOM | NF_NAT_RANGE_PERSISTENT |  \
-NF_NAT_RANGE_PROTO_RANDOM_FULLY)
+NF_NAT_RANGE_PROTO_RANDOM_FULLY | NF_NAT_RANGE_PROTO_OFFSET)
 
 struct nf_nat_ipv4_range {
unsigned intflags;
@@ -32,12 +33,21 @@ struct nf_nat_ipv4_multi_range_compat {
struct nf_nat_ipv4_rangerange[1];
 };
 
+struct nf_nat_range_v1 {
+   unsigned intflags;
+   union nf_inet_addr  min_addr;
+   union nf_inet_addr  max_addr;
+   union nf_conntrack_man_protomin_proto;
+   union nf_conntrack_man_protomax_proto;
+};
+
 struct nf_nat_range {
unsigned intflags;
union nf_inet_addr  min_addr;
union nf_inet_addr  max_addr;
union nf_conntrack_man_protomin_proto;
union nf_conntrack_man_protomax_proto;
+   union nf_conntrack_man_protobase_proto;
 };
 
 #endif /* _NETFILTER_NF_NAT_H */
diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_core.c
index af8345f..de5c327 100644
--- a/net/netfilter/nf_nat_core.c
+++ b/net/netfilter/nf_nat_core.c
@@ -347,9 +347,10 @@ get_unique_tuple(struct nf_conntrack_tuple *tuple,
/* Only bother mapping if it's not already in range and unique */
if (!(range->flags & NF_NAT_RANGE_PROTO_RANDOM_ALL)) {
if (range->flags & NF_NAT_RANGE_PROTO_SPECIFIED) {
-   if (l4proto->in_range(tuple, maniptype,
- >min_proto,
- >max_proto) &&
+   if (!(range->flags & NF_NAT_RANGE_PROTO_OFFSET) &&
+   l4proto->in_range(tuple, maniptype,
+ >min_proto,
+ >max_proto) &&

Re: question about UNDEFINE/REDEFINE

2018-01-30 Thread David Fabian
Hello Arturo,

Dne pátek 26. ledna 2018 19:43:18 CET, Arturo Borrero Gonzalez napsal(a):
> My suggestion is to simply create one variable per value:
> 
> define INET_IFACES_VLAN43 = { bond0.x, bond3.y}
> define INET_IFACES_VLAN3 = { bond3.x, bond3.y}
> define XXX_VLAN43 = xxx
> define XXX_VLAN3 = xxx
> 
> you could generate such a file, something like 'defines.nft' and
> include it once in your main ruleset file.

that is exactly the boilerplate that we are trying to avoid. By using 
consistent (and non-unique) variable names we are able to freely move the 
rules from one customer to another without rewriting every use of a variable 
every time. We also do not want to build a code-generating harness in bash (or 
any other language) since that would sort of defeat the purpose of scripting 
in nftables in my eyes. the redefine keyword was just my first idea to solve 
the 
problem of a single flat variable scope. There may be a better approach but I 
think that if nftables wants to have scripting capabilities, some kind of 
variable scoping (even in flat notation) and more ubiquitous variable use 
within rules is necessary.

I event went so far and made some experimental patches that allowed me to use 
string variables and string concatenation in places like interface names and 
rule targets. With that I was able to create very generic rules and I tied 
them to a customer/VLAN just by changing one or two constants in the header of 
a file (e.g. the VLAN number). Of course, I had to use redefine in the header.

-- 
S pozdravem,

David Fabian
Cluster Design, s.r.o.

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: question about UNDEFINE/REDEFINE

2018-01-30 Thread David Fabian
Hello Pablo,

Dne pátek 26. ledna 2018 14:45:49 CET, Pablo Neira Ayuso napsal(a):
> 2) Probably even cleaner is to look at 'local' scopes like in bash.
> 
>  define local IP1 = 1.1.1.1
> 
> so the symbol is bound to this file - consider the content of a file
> determines a given scope. This can be also useful to the nested
> notation.
> 
> 3) You rework your ruleset to use the notation with nesting :-). But I
> think 2) can be useful for both the flat and nested notation.
> 
> I'm not asking you to do 2), but I would like to see how a patch that
> adds explicit scoping for the flat ruleset representation looks like.

I know about scopes in the code but unfortunately as you said, the flat 
notation only has a single scope. Since we are talking about analogy to bash, 
bash allows me to redefine a variable in the same scope. Variables in nftables 
feel more like constants which is not necessarily bad as it can prevent some 
typos but is hard to work with in scripting as it's not that flexible.

From those options you listed I would strongly prefer to have an implicit 
scope for each file included in the flat notation. That way, defining a 
variable 
in one file would not collide with the same variable in a sibling include. 
Variables from outer scopes would still be available in inner scopes. For 
people that would want to have their "global" definitions in a separate 
include, I would recommend creating a new keyword like global or export that 
would tie a variable to the top-level scope and thus make it available to 
everyone. I don't think that would be that hard to implement and I may try to 
if we agree on it.

Anyway there should definitely be a way to de-register (undefine) a variable 
from a scope to prevent a misuse due to typos.

By the way, can we restructure the FW using nesting and still be able to 
retain all per-customer rules in a single file? Wouldn't that require us to 
split prerouting, postrouting, forward and other rules to separate scopes/
table definitions? That would be highly inconvenient.

-- 
S pozdravem,

David Fabian
Cluster Design, s.r.o.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)

2018-01-30 Thread Michal Hocko
On Tue 30-01-18 10:02:34, Dmitry Vyukov wrote:
> On Tue, Jan 30, 2018 at 9:28 AM, Kirill A. Shutemov
>  wrote:
> > On Tue, Jan 30, 2018 at 09:11:27AM +0100, Florian Westphal wrote:
> >> Michal Hocko  wrote:
> >> > On Mon 29-01-18 23:35:22, Florian Westphal wrote:
> >> > > Kirill A. Shutemov  wrote:
> >> > [...]
> >> > > > I hate what I'm saying, but I guess we need some tunable here.
> >> > > > Not sure what exactly.
> >> > >
> >> > > Would memcg help?
> >> >
> >> > That really depends. I would have to check whether vmalloc path obeys
> >> > __GFP_ACCOUNT (I suspect it does except for page tables allocations but
> >> > that shouldn't be a big deal). But then the other potential problem is
> >> > the life time of the xt_table_info (or other potentially large) data
> >> > structures. Are they bound to any process life time.
> >>
> >> No.
> >
> > Well, IIUC they bound to net namespace life time, so killing all
> > proccesses in the namespace would help to get memory back. :)
> 
> ... unless the namespace is mounted into file system.
> 
> Let's start with NOWARN as that's what kernel generally uses for
> allocations with user-controllable size. ENOMEM is roughly as
> informative as the WARNING message in this case.

You want __GFP_NORETRY but that is not _fully_ supported by kvmalloc
right now. More specifically kvmalloc doesn't guanratee that the request
will not trigger the OOM killer (like regular __GFP_NORETRY). This is
because of internal vmalloc restrictions. If you are however OK to
simply bail out in most cases then __GFP_NORETRY should work reasonably
fine.

> I think we also need to consider setting up memory cgroup for
> syzkaller test processes (we do RLIMIT_AS, but that's weak).

Well, this is not about syzkaller, it merely pointed out a potential
DoS... And that has to be addressed somehow.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)

2018-01-30 Thread Michal Hocko
On Tue 30-01-18 09:11:27, Florian Westphal wrote:
> Michal Hocko  wrote:
> > On Mon 29-01-18 23:35:22, Florian Westphal wrote:
> > > Kirill A. Shutemov  wrote:
> > [...]
> > > > I hate what I'm saying, but I guess we need some tunable here.
> > > > Not sure what exactly.
> > > 
> > > Would memcg help?
> > 
> > That really depends. I would have to check whether vmalloc path obeys
> > __GFP_ACCOUNT (I suspect it does except for page tables allocations but
> > that shouldn't be a big deal). But then the other potential problem is
> > the life time of the xt_table_info (or other potentially large) data
> > structures. Are they bound to any process life time.
> 
> No.
> 
> > Because if they are
> > not then the OOM killer will not help. The OOM panic earlier in this
> > thread suggests it doesn't because the test case managed to eat all the
> > available memory and killed all the eligible tasks which didn't help.
> 
> Yes, which is why we do not want any OOM killer invocation in first
> place...

The problem is that as soon as you eat that memory and ask for more
until you fail with ENOMEM then the OOM is simply unavoidable.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: possible deadlock in xt_find_revision

2018-01-30 Thread Florian Westphal
#syz dup: possible deadlock in do_ip_getsockopt
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: possible deadlock in xt_find_table_lock

2018-01-30 Thread Florian Westphal
#syz dup: possible deadlock in do_ip_getsockopt
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)

2018-01-30 Thread Dmitry Vyukov
On Tue, Jan 30, 2018 at 9:28 AM, Kirill A. Shutemov
 wrote:
> On Tue, Jan 30, 2018 at 09:11:27AM +0100, Florian Westphal wrote:
>> Michal Hocko  wrote:
>> > On Mon 29-01-18 23:35:22, Florian Westphal wrote:
>> > > Kirill A. Shutemov  wrote:
>> > [...]
>> > > > I hate what I'm saying, but I guess we need some tunable here.
>> > > > Not sure what exactly.
>> > >
>> > > Would memcg help?
>> >
>> > That really depends. I would have to check whether vmalloc path obeys
>> > __GFP_ACCOUNT (I suspect it does except for page tables allocations but
>> > that shouldn't be a big deal). But then the other potential problem is
>> > the life time of the xt_table_info (or other potentially large) data
>> > structures. Are they bound to any process life time.
>>
>> No.
>
> Well, IIUC they bound to net namespace life time, so killing all
> proccesses in the namespace would help to get memory back. :)

... unless the namespace is mounted into file system.

Let's start with NOWARN as that's what kernel generally uses for
allocations with user-controllable size. ENOMEM is roughly as
informative as the WARNING message in this case.

I think we also need to consider setting up memory cgroup for
syzkaller test processes (we do RLIMIT_AS, but that's weak).
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)

2018-01-30 Thread Kirill A. Shutemov
On Tue, Jan 30, 2018 at 09:11:27AM +0100, Florian Westphal wrote:
> Michal Hocko  wrote:
> > On Mon 29-01-18 23:35:22, Florian Westphal wrote:
> > > Kirill A. Shutemov  wrote:
> > [...]
> > > > I hate what I'm saying, but I guess we need some tunable here.
> > > > Not sure what exactly.
> > > 
> > > Would memcg help?
> > 
> > That really depends. I would have to check whether vmalloc path obeys
> > __GFP_ACCOUNT (I suspect it does except for page tables allocations but
> > that shouldn't be a big deal). But then the other potential problem is
> > the life time of the xt_table_info (or other potentially large) data
> > structures. Are they bound to any process life time.
> 
> No.

Well, IIUC they bound to net namespace life time, so killing all
proccesses in the namespace would help to get memory back. :)

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)

2018-01-30 Thread Florian Westphal
Michal Hocko  wrote:
> On Mon 29-01-18 23:35:22, Florian Westphal wrote:
> > Kirill A. Shutemov  wrote:
> [...]
> > > I hate what I'm saying, but I guess we need some tunable here.
> > > Not sure what exactly.
> > 
> > Would memcg help?
> 
> That really depends. I would have to check whether vmalloc path obeys
> __GFP_ACCOUNT (I suspect it does except for page tables allocations but
> that shouldn't be a big deal). But then the other potential problem is
> the life time of the xt_table_info (or other potentially large) data
> structures. Are they bound to any process life time.

No.

> Because if they are
> not then the OOM killer will not help. The OOM panic earlier in this
> thread suggests it doesn't because the test case managed to eat all the
> available memory and killed all the eligible tasks which didn't help.

Yes, which is why we do not want any OOM killer invocation in first
place...

> So in some sense the memcg would help to stop the excessive allocation,
> but it wouldn't resolve it other than kill all tasks in the affected
> memcg/container. Whether this is sufficient or not, I dunno. It sounds
> quite suboptimal to me. But it is true this would be less tricky then
> adding a global knob...

Global knob doesn't really help at all, I can add multiple large
iptables rulesets (so we would have to account), and we have same issue
in virtually all of networking, so we need limits for interface count,
tunnel count, ipsec policies/SAs, nftables, tc, etc etc.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html