[PATCH net v2] netfilter: nat: cope with negative port range

2018-02-14 Thread Paolo Abeni
syzbot reported a division by 0 bug in the netfilter nat code:

divide error:  [#1] SMP KASAN
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 4168 Comm: syzkaller034710 Not tainted 4.16.0-rc1+ #309
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:nf_nat_l4proto_unique_tuple+0x291/0x530
net/netfilter/nf_nat_proto_common.c:88
RSP: 0018:8801b2466778 EFLAGS: 00010246
RAX: f153 RBX: 8801b2466dd8 RCX: 8801b2466c7c
RDX:  RSI: 8801b2466c58 RDI: 8801db5293ac
RBP: 8801b24667d8 R08: 8801b8ba6dc0 R09: 88af5900
R10: 8801b24666f0 R11:  R12: 2990f153
R13: 0001 R14:  R15: 8801b2466c7c
FS:  017e3880() GS:8801db50() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 208fdfe4 CR3: 0001b5340002 CR4: 001606e0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
  dccp_unique_tuple+0x40/0x50 net/netfilter/nf_nat_proto_dccp.c:30
  get_unique_tuple+0xc28/0x1c10 net/netfilter/nf_nat_core.c:362
  nf_nat_setup_info+0x1c2/0xe00 net/netfilter/nf_nat_core.c:406
  nf_nat_redirect_ipv6+0x306/0x730 net/netfilter/nf_nat_redirect.c:124
  redirect_tg6+0x7f/0xb0 net/netfilter/xt_REDIRECT.c:34
  ip6t_do_table+0xc2a/0x1a30 net/ipv6/netfilter/ip6_tables.c:365
  ip6table_nat_do_chain+0x65/0x80 net/ipv6/netfilter/ip6table_nat.c:41
  nf_nat_ipv6_fn+0x594/0xa80 net/ipv6/netfilter/nf_nat_l3proto_ipv6.c:302
  nf_nat_ipv6_local_fn+0x33/0x5d0
net/ipv6/netfilter/nf_nat_l3proto_ipv6.c:407
  ip6table_nat_local_fn+0x2c/0x40 net/ipv6/netfilter/ip6table_nat.c:69
  nf_hook_entry_hookfn include/linux/netfilter.h:120 [inline]
  nf_hook_slow+0xba/0x1a0 net/netfilter/core.c:483
  nf_hook include/linux/netfilter.h:243 [inline]
  NF_HOOK include/linux/netfilter.h:286 [inline]
  ip6_xmit+0x10ec/0x2260 net/ipv6/ip6_output.c:277
  inet6_csk_xmit+0x2fc/0x580 net/ipv6/inet6_connection_sock.c:139
  dccp_transmit_skb+0x9ac/0x10f0 net/dccp/output.c:142
  dccp_connect+0x369/0x670 net/dccp/output.c:564
  dccp_v6_connect+0xe17/0x1bf0 net/dccp/ipv6.c:946
  __inet_stream_connect+0x2d4/0xf00 net/ipv4/af_inet.c:620
  inet_stream_connect+0x58/0xa0 net/ipv4/af_inet.c:684
  SYSC_connect+0x213/0x4a0 net/socket.c:1639
  SyS_connect+0x24/0x30 net/socket.c:1620
  do_syscall_64+0x282/0x940 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x26/0x9b
RIP: 0033:0x441c69
RSP: 002b:7ffe50cc0be8 EFLAGS: 0217 ORIG_RAX: 002a
RAX: ffda RBX:  RCX: 00441c69
RDX: 001c RSI: 208fdfe4 RDI: 0003
RBP: 006cc018 R08:  R09: 
R10: 0538 R11: 0217 R12: 00403590
R13: 00403620 R14:  R15: 
Code: 48 89 f0 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 46 02 00 00 48 8b
45 c8 44 0f b7 20 e8 88 97 04 fd 31 d2 41 0f b7 c4 4c 89 f9 <41> f7 f6 48
c1 e9 03 48 b8 00 00 00 00 00 fc ff df 0f b6 0c 01
RIP: nf_nat_l4proto_unique_tuple+0x291/0x530
net/netfilter/nf_nat_proto_common.c:88 RSP: 8801b2466778

The problem is that currently we don't have any check on the
configured port range. A port range == -1 triggers the bug, while
other negative values may require a very long time to complete the
following loop.

Adding the relevant check at parse time could break existing
setup, moreover we would need to read/write such values atomically
to avoid possible transient negative ranges at update time.

This commit addresses the issue swapping the two ends on negative
ranges in nf_nat_l4proto_unique_tuple().

v1 -> v2: use the correct 'Fixes' tag

Fixes: 5b1158e909ec ("[NETFILTER]: Add NAT support for nf_conntrack")
Reported-and-tested-by: syzbot+8012e198bd037f487...@syzkaller.appspotmail.com
Signed-off-by: Paolo Abeni 
---
 net/netfilter/nf_nat_proto_common.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/net/netfilter/nf_nat_proto_common.c 
b/net/netfilter/nf_nat_proto_common.c
index fbce552a796e..a05cce545e98 100644
--- a/net/netfilter/nf_nat_proto_common.c
+++ b/net/netfilter/nf_nat_proto_common.c
@@ -41,7 +41,7 @@ void nf_nat_l4proto_unique_tuple(const struct nf_nat_l3proto 
*l3proto,
 const struct nf_conn *ct,
 u16 *rover)
 {
-   unsigned int range_size, min, i;
+   unsigned int range_size, min, max, i;
__be16 *portptr;
u_int16_t off;
 
@@ -70,8 +70,11 @@ void nf_nat_l4proto_unique_tuple(const struct nf_nat_l3proto 
*l3proto,
range_size = 65535 - 1024 + 1;
}
} else {
-   min = ntohs(range->min_proto.all);
-   range_size = ntohs(range->max_proto.all) - min + 1;
+

Re: [PATCH net v2] netfilter: nat: cope with negative port range

2018-02-14 Thread Eric Dumazet
On Wed, 2018-02-14 at 12:13 +0100, Paolo Abeni wrote:
> syzbot reported a division by 0 bug in the netfilter nat code:

...

> Adding the relevant check at parse time could break existing
> setup, moreover we would need to read/write such values atomically
> to avoid possible transient negative ranges at update time.

I do not quite follow why it is so hard to add a check at parse time.

Breaking buggy setups would not be a concern I think.

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v2] netfilter: nat: cope with negative port range

2018-02-14 Thread Florian Westphal
Eric Dumazet  wrote:
> On Wed, 2018-02-14 at 12:13 +0100, Paolo Abeni wrote:
> > syzbot reported a division by 0 bug in the netfilter nat code:
> 
> > Adding the relevant check at parse time could break existing
> > setup, moreover we would need to read/write such values atomically
> > to avoid possible transient negative ranges at update time.
> 
> I do not quite follow why it is so hard to add a check at parse time.
> 
> Breaking buggy setups would not be a concern I think.

It would be possible for xtables but afaics in nft_nat.c case
(nft_nat_eval) range.{min,max}_proto.all values are loaded from nft
registers at runtime.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v2] netfilter: nat: cope with negative port range

2018-02-14 Thread Eric Dumazet
On Wed, 2018-02-14 at 13:30 +0100, Florian Westphal wrote:
> Eric Dumazet  wrote:
> > On Wed, 2018-02-14 at 12:13 +0100, Paolo Abeni wrote:
> > > syzbot reported a division by 0 bug in the netfilter nat code:
> > > Adding the relevant check at parse time could break existing
> > > setup, moreover we would need to read/write such values atomically
> > > to avoid possible transient negative ranges at update time.
> > 
> > I do not quite follow why it is so hard to add a check at parse time.
> > 
> > Breaking buggy setups would not be a concern I think.
> 
> It would be possible for xtables but afaics in nft_nat.c case
> (nft_nat_eval) range.{min,max}_proto.all values are loaded from nft
> registers at runtime.

I prefer this explanation much more, I suggest we update the changelog
to explain the real reason.

Thanks Florian.

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [bisected] Forwarded packets occasionally has loopback output interface in Netfilter

2018-02-14 Thread Anders K. Pedersen | Cohaesio
On man, 2018-01-22 at 15:54 +0100, Anders K. Pedersen | Cohaesio wrote:
> On tor, 2018-01-11 at 10:18 -0800, Wei Wang wrote:
> > On Thu, Jan 11, 2018 at 9:25 AM, Anders K. Pedersen | Cohaesio
> >  wrote:
> > > On tir, 2017-12-26 at 12:05 +0100, Anders K. Pedersen | Cohaesio
> > > wrote:
> > > > Hello,
> > > > 
> > > > On one of our border routers, Netfilter is occasionally logging
> > > > packets
> > > > with "OUT=lo" (output interface lo) even though the packet should
> > > > be
> > > > going out via a regular interface. This behavior is present on
> > > > Linux
> > > > 4.13.0 to 4.14.9, and a bisection of the problem points to
> > > > 
> > > > [95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call
> > > > dst_dev_put() properly
> > > > 
> > > > as the first bad commit. This commit adds dst_dev_put() calls
> > > > before
> > > > some dst_release() calls, and dst_dev_put() does
> > > > 
> > > > dst->dev = dev_net(dst->dev)->loopback_dev;
> > > > 
> > > > (among other things), which fits the problem we're seeing.
> > > > 
> > > > The essential part of our nftables rule set that shows this
> > > > behavior is
> > > > 
> > > > chain forward {
> > > > type filter hook forward priority 0;
> > > > 
> > > > meta oif { $internal_interfaces } accept
> > > > 
> > > > meta oif lo ip daddr != 127.0.0.0/8 \
> > > > log group 0 snaplen 80 prefix "oif-lo"
> > > > counter
> > > > 
> > > > ip saddr { $our_ip_series } \
> > > > flow table acct_out \
> > > > { meta oif . rt nexthop . ip saddr
> > > > timeout 12m counter } \
> > > > accept
> > > > 
> > > > log group 0 snaplen 80 prefix "DROP" counter drop
> > > > }
> > > > 
> > > > The router only does stateless packet filtering and no
> > > > redirection or
> > > > rewriting of the packets (connection tracking, NAT, ipvs etc. are
> > > > not
> > > > even compiled for this kernel).
> > > > 
> > > > As a result of this problem we see packets that should be going
> > > > to an
> > > > internal interface (and thus accepted by the first rule above)
> > > > being
> > > > logged and dropped by the last rule. Some examples:
> > > > 
> > > > Dec 22 11:57:02 cix4 oif-lo IN=eth10 OUT=lo
> > > > MAC=90:e2:ba:5c:b6:95:10:f3:11:38:06:77:08:00 SRC=81.170.163.118
> > > > DST=212.97.158.33 LEN=1500 TOS=00 PREC=0x00 TTL=116 ID=25932 DF
> > > > PROTO=TCP SPT=35118 DPT=8443 SEQ=604358330 ACK=1182278705
> > > > WINDOW=3295 ACK URGP=0 MARK=0
> > > > Dec 22 11:57:02 cix4 DROP IN=eth10 OUT=lo
> > > > MAC=90:e2:ba:5c:b6:95:10:f3:11:38:06:77:08:00 SRC=81.170.163.118
> > > > DST=212.97.158.33 LEN=1500 TOS=00 PREC=0x00 TTL=116 ID=25932 DF
> > > > PROTO=TCP SPT=35118 DPT=8443 SEQ=604358330 ACK=1182278705
> > > > WINDOW=3295 ACK URGP=0 MARK=0
> > > > 
> > > > Dec 22 12:47:07 cix4 oif-lo IN=eth10 OUT=lo
> > > > MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.30.18
> > > > DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=118 ID=10370 DF
> > > > PROTO=TCP SPT=443 DPT=44988 SEQ=1141545913 ACK=3844573103
> > > > WINDOW=65535 ACK PSH URGP=0 MARK=0
> > > > Dec 22 12:47:07 cix4 DROP IN=eth10 OUT=lo
> > > > MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.30.18
> > > > DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=118 ID=10370 DF
> > > > PROTO=TCP SPT=443 DPT=44988 SEQ=1141545913 ACK=3844573103
> > > > WINDOW=65535 ACK PSH URGP=0 MARK=0
> > > > 
> > > > Dec 22 12:53:56 cix4 oif-lo IN=eth10 OUT=lo
> > > > MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.12.34
> > > > DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=115 ID=27728 DF
> > > > PROTO=TCP SPT=443 DPT=39724 SEQ=3797156404 ACK=3944234612
> > > > WINDOW=65535 ACK PSH URGP=0 MARK=0
> > > > Dec 22 12:53:56 cix4 DROP IN=eth10 OUT=lo
> > > > MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.12.34
> > > > DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=115 ID=27728 DF
> > > > PROTO=TCP SPT=443 DPT=39724 SEQ=3797156404 ACK=3944234612
> > > > WINDOW=65535 ACK PSH URGP=0 MARK=0
> > > > 
> > > > It also happens for outbound traffic, where the packets are
> > > > logged and
> > > > counted in the acct_out flow table with "meta oif" = "lo", but a
> > > > correct "rt nexthop" - an example:
> > > > 
> > > > Dec 22 12:29:13 cix4 oif-lo IN=team0.20 OUT=lo
> > > > MAC=3c:fd:fe:15:db:a8:00:24:a8:ff:f0:00:08:00 SRC=212.97.129.25
> > > > DST=95.166.119.129 LEN=40 TOS=00 PREC=0x00 TTL=62 ID=19481 DF
> > > > PROTO=TCP SPT=443 DPT=52560 SEQ=3034827396 ACK=2862814901
> > > > WINDOW=12618 ACK URGP=0 MARK=0
> > > > 
> > > > # nft list flow table filter acct_out|tr ',' '\n'|grep lo
> > > > flow table acct_out {
> > > >  "lo" . 94.101.208.217 . 212.97.129.25 expires 3m17s : counter
> > > > packets 1 bytes 40
> > > > 
> > > > I don't know if these packets are actually sent out on the
> > > > correct
> > > > outbound interface thanks to the proper nexthop (the MAC=
> > > > i

Re: [PATCH net v2] netfilter: nat: cope with negative port range

2018-02-14 Thread Pablo Neira Ayuso
On Wed, Feb 14, 2018 at 01:30:37PM +0100, Florian Westphal wrote:
> Eric Dumazet  wrote:
> > On Wed, 2018-02-14 at 12:13 +0100, Paolo Abeni wrote:
> > > syzbot reported a division by 0 bug in the netfilter nat code:
> > 
> > > Adding the relevant check at parse time could break existing
> > > setup, moreover we would need to read/write such values atomically
> > > to avoid possible transient negative ranges at update time.
> > 
> > I do not quite follow why it is so hard to add a check at parse time.
> > 
> > Breaking buggy setups would not be a concern I think.
> 
> It would be possible for xtables but afaics in nft_nat.c case
> (nft_nat_eval) range.{min,max}_proto.all values are loaded from nft
> registers at runtime.

Then, restrict this from nft_nat.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH nft 2/6] payload: add payload_dependency_release() helper function

2018-02-14 Thread Pablo Neira Ayuso
Wrap code that releases existing dependencies that we have just
annotated in the context structure.

Signed-off-by: Pablo Neira Ayuso 
---
 src/payload.c | 22 +-
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/src/payload.c b/src/payload.c
index df3c8136c88c..21c2842807cd 100644
--- a/src/payload.c
+++ b/src/payload.c
@@ -428,6 +428,17 @@ void payload_dependency_store(struct payload_dep_ctx *ctx,
ctx->pdep  = stmt;
 }
 
+static void payload_dependency_release(struct payload_dep_ctx *ctx)
+{
+   list_del(&ctx->pdep->list);
+   stmt_free(ctx->pdep);
+
+   ctx->pbase = PROTO_BASE_INVALID;
+   if (ctx->pdep == ctx->prev)
+   ctx->prev = NULL;
+   ctx->pdep  = NULL;
+}
+
 /**
  * __payload_dependency_kill - kill a redundant payload depedency
  *
@@ -442,15 +453,8 @@ void __payload_dependency_kill(struct payload_dep_ctx *ctx,
 {
if (ctx->pbase != PROTO_BASE_INVALID &&
ctx->pbase == base &&
-   ctx->pdep != NULL) {
-   list_del(&ctx->pdep->list);
-   stmt_free(ctx->pdep);
-
-   ctx->pbase = PROTO_BASE_INVALID;
-   if (ctx->pdep == ctx->prev)
-   ctx->prev = NULL;
-   ctx->pdep  = NULL;
-   }
+   ctx->pdep != NULL)
+   payload_dependency_release(ctx);
 }
 
 void payload_dependency_kill(struct payload_dep_ctx *ctx, struct expr *expr,
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH nft 0/6] rework dependency removal (v2)

2018-02-14 Thread Pablo Neira Ayuso
Hi,

This patchset aims to address what Florian reported time ago [1]. This
is skipping removal of protocol key payload expressions at network base
for the netdev, bridge and inet.

It would better to annotate all redundant expressions and add a later
stage, where we can do smarter simplifications by looking globally at
what we have, instead of just looking at current protocol key expression
and last one that we have annotated in the context structure to perform
removals. But I would prefer to have a fix now upstream then look at
this larger rework later on since it would require to review a bit of
the postprocess code.

The initial 4 patches in this batch are just cleanup/preparation patches
for patches 5/6 and 6/6.

There is still a few warning in the tests/py/ infrastructure, some of
them I think need to be fixed, and at least one can remain there to
remind us that we can do better.

Let me know if you have any concern with this.

Thanks.

[1] https://www.spinics.net/lists/netfilter-devel/msg50078.html

Pablo Neira Ayuso (6):
  src: pass family to payload_dependency_kill()
  payload: add payload_dependency_release() helper function
  src: add payload_dependency_exists()
  src: get rid of __payload_dependency_kill()
  payload: add payload_may_dependency_kill()
  netlink_delinearize: add meta_may_dependency_kill()

 include/payload.h |   7 +--
 src/netlink.c |   2 +-
 src/netlink_delinearize.c | 106 +-
 src/payload.c |  85 -
 4 files changed, 165 insertions(+), 35 deletions(-)

-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH nft 1/6] src: pass family to payload_dependency_kill()

2018-02-14 Thread Pablo Neira Ayuso
This context information is very relevant when deciding if a redundant
dependency needs to be removed or not, specifically for the inet, bridge
and netdev families. This new parameter is used by follow up patch
entitled ("payload: add payload_should_dependency_kill()").

Signed-off-by: Pablo Neira Ayuso 
---
 include/payload.h |  7 ---
 src/netlink.c |  2 +-
 src/netlink_delinearize.c | 18 +++---
 src/payload.c | 14 --
 4 files changed, 24 insertions(+), 17 deletions(-)

diff --git a/include/payload.h b/include/payload.h
index 8e357aef461e..294ff2706e30 100644
--- a/include/payload.h
+++ b/include/payload.h
@@ -41,11 +41,12 @@ extern void payload_dependency_store(struct payload_dep_ctx 
*ctx,
 struct stmt *stmt,
 enum proto_bases base);
 extern void __payload_dependency_kill(struct payload_dep_ctx *ctx,
- enum proto_bases base);
+ enum proto_bases base,
+ unsigned int family);
 extern void payload_dependency_kill(struct payload_dep_ctx *ctx,
-   struct expr *expr);
+   struct expr *expr, unsigned int family);
 extern void exthdr_dependency_kill(struct payload_dep_ctx *ctx,
-  struct expr *expr);
+  struct expr *expr, unsigned int family);
 
 extern bool payload_can_merge(const struct expr *e1, const struct expr *e2);
 extern struct expr *payload_expr_join(const struct expr *e1,
diff --git a/src/netlink.c b/src/netlink.c
index 488ae6f3971f..233bfd2df764 100644
--- a/src/netlink.c
+++ b/src/netlink.c
@@ -2768,7 +2768,7 @@ next:
pctx->pbase == PROTO_BASE_INVALID) {
payload_dependency_store(pctx, stmt, base - stacked);
} else {
-   payload_dependency_kill(pctx, lhs);
+   payload_dependency_kill(pctx, lhs, ctx->family);
if (lhs->flags & EXPR_F_PROTOCOL)
payload_dependency_store(pctx, stmt, base - 
stacked);
}
diff --git a/src/netlink_delinearize.c b/src/netlink_delinearize.c
index 256552b5b46e..8d11969e0fb1 100644
--- a/src/netlink_delinearize.c
+++ b/src/netlink_delinearize.c
@@ -1352,7 +1352,8 @@ static void payload_match_expand(struct rule_pp_ctx *ctx,
left->flags & EXPR_F_PROTOCOL) {
payload_dependency_store(&ctx->pdctx, nstmt, base - 
stacked);
} else {
-   payload_dependency_kill(&ctx->pdctx, nexpr->left);
+   payload_dependency_kill(&ctx->pdctx, nexpr->left,
+   ctx->pctx.family);
if (expr->op == OP_EQ && left->flags & EXPR_F_PROTOCOL)
payload_dependency_store(&ctx->pdctx, nstmt, 
base - stacked);
}
@@ -1383,7 +1384,7 @@ static void payload_match_postprocess(struct rule_pp_ctx 
*ctx,
payload_expr_complete(payload, &ctx->pctx);
expr_set_type(expr->right, payload->dtype,
  payload->byteorder);
-   payload_dependency_kill(&ctx->pdctx, payload);
+   payload_dependency_kill(&ctx->pdctx, payload, ctx->pctx.family);
break;
}
 }
@@ -1406,7 +1407,8 @@ static void ct_meta_common_postprocess(struct rule_pp_ctx 
*ctx,
left->flags & EXPR_F_PROTOCOL) {
payload_dependency_store(&ctx->pdctx, ctx->stmt, base);
} else if (ctx->pdctx.pbase < PROTO_BASE_TRANSPORT_HDR) {
-   __payload_dependency_kill(&ctx->pdctx, base);
+   __payload_dependency_kill(&ctx->pdctx, base,
+ ctx->pctx.family);
if (left->flags & EXPR_F_PROTOCOL)
payload_dependency_store(&ctx->pdctx, 
ctx->stmt, base);
}
@@ -1814,7 +1816,7 @@ static void expr_postprocess(struct rule_pp_ctx *ctx, 
struct expr **exprp)
break;
case EXPR_PAYLOAD:
payload_expr_complete(expr, &ctx->pctx);
-   payload_dependency_kill(&ctx->pdctx, expr);
+   payload_dependency_kill(&ctx->pdctx, expr, ctx->pctx.family);
break;
case EXPR_VALUE:
// FIXME
@@ -1837,7 +1839,7 @@ static void expr_postprocess(struct rule_pp_ctx *ctx, 
struct expr **exprp)
expr_postprocess(ctx, &expr->key);
break;
case EXPR_EXTHDR:
-   exthdr_dependency_kill(&ctx->pdctx, expr);
+   exthdr_dependency_kill(&ctx->pdctx, expr, ctx->pctx.family);
break;
case EXPR_SET_R

[PATCH nft 5/6] payload: add payload_may_dependency_kill()

2018-02-14 Thread Pablo Neira Ayuso
Payload protocol key expressions at network base are meaningful in the
netdev, bridge and inet families, do not exercise the redundant
dependency removal in those cases since it breaks rule semantics.

Signed-off-by: Pablo Neira Ayuso 
---
 src/payload.c | 31 +--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/src/payload.c b/src/payload.c
index 383aed039a5e..15d055b6be7e 100644
--- a/src/payload.c
+++ b/src/payload.c
@@ -456,6 +456,31 @@ void payload_dependency_release(struct payload_dep_ctx 
*ctx)
ctx->pdep  = NULL;
 }
 
+static bool payload_may_dependency_kill(struct payload_dep_ctx *ctx,
+   unsigned int family, struct expr *expr)
+{
+   struct expr *dep = ctx->pdep->expr;
+
+   /* Protocol key payload expression at network base such as 'ip6 nexthdr'
+* need to be left in place since it implicitly restricts matching to
+* IPv6 for the bridge, inet and netdev families.
+*/
+   switch (family) {
+   case NFPROTO_BRIDGE:
+   case NFPROTO_NETDEV:
+   case NFPROTO_INET:
+   if (dep->left->ops->type == EXPR_PAYLOAD &&
+   dep->left->payload.base == PROTO_BASE_NETWORK_HDR &&
+   (dep->left->payload.desc == &proto_ip ||
+dep->left->payload.desc == &proto_ip6) &&
+   expr->payload.base == PROTO_BASE_TRANSPORT_HDR)
+   return false;
+   break;
+   }
+
+   return true;
+}
+
 /**
  * payload_dependency_kill - kill a redundant payload depedency
  *
@@ -463,12 +488,14 @@ void payload_dependency_release(struct payload_dep_ctx 
*ctx)
  * @expr: higher layer payload expression
  *
  * Kill a redundant payload expression if a higher layer payload expression
- * implies its existance.
+ * implies its existance. Skip this if the dependency is a network payload and
+ * we are in bridge, netdev and inet families.
  */
 void payload_dependency_kill(struct payload_dep_ctx *ctx, struct expr *expr,
 unsigned int family)
 {
-   if (payload_dependency_exists(ctx, expr->payload.base))
+   if (payload_dependency_exists(ctx, expr->payload.base) &&
+   payload_may_dependency_kill(ctx, family, expr))
payload_dependency_release(ctx);
 }
 
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH nft 4/6] src: get rid of __payload_dependency_kill()

2018-02-14 Thread Pablo Neira Ayuso
Use payload_dependency_release() instead.

Signed-off-by: Pablo Neira Ayuso 
---
 include/payload.h |  3 +--
 src/netlink_delinearize.c |  9 +++--
 src/payload.c | 15 +--
 3 files changed, 9 insertions(+), 18 deletions(-)

diff --git a/include/payload.h b/include/payload.h
index dec4647aad9f..161c64aedf11 100644
--- a/include/payload.h
+++ b/include/payload.h
@@ -42,8 +42,7 @@ extern void payload_dependency_store(struct payload_dep_ctx 
*ctx,
 enum proto_bases base);
 extern bool payload_dependency_exists(const struct payload_dep_ctx *ctx,
  enum proto_bases base);
-extern void __payload_dependency_kill(struct payload_dep_ctx *ctx,
- unsigned int family);
+extern void payload_dependency_release(struct payload_dep_ctx *ctx);
 extern void payload_dependency_kill(struct payload_dep_ctx *ctx,
struct expr *expr, unsigned int family);
 extern void exthdr_dependency_kill(struct payload_dep_ctx *ctx,
diff --git a/src/netlink_delinearize.c b/src/netlink_delinearize.c
index f4b943964fc9..b9f195b07d63 100644
--- a/src/netlink_delinearize.c
+++ b/src/netlink_delinearize.c
@@ -1408,8 +1408,7 @@ static void ct_meta_common_postprocess(struct rule_pp_ctx 
*ctx,
payload_dependency_store(&ctx->pdctx, ctx->stmt, base);
} else if (ctx->pdctx.pbase < PROTO_BASE_TRANSPORT_HDR) {
if (payload_dependency_exists(&ctx->pdctx, base))
-   __payload_dependency_kill(&ctx->pdctx,
- ctx->pctx.family);
+   payload_dependency_release(&ctx->pdctx);
if (left->flags & EXPR_F_PROTOCOL)
payload_dependency_store(&ctx->pdctx, 
ctx->stmt, base);
}
@@ -1874,8 +1873,7 @@ static void stmt_reject_postprocess(struct rule_pp_ctx 
*rctx)
if (stmt->reject.type == NFT_REJECT_TCP_RST &&
payload_dependency_exists(&rctx->pdctx,
  PROTO_BASE_TRANSPORT_HDR))
-   __payload_dependency_kill(&rctx->pdctx,
- rctx->pctx.family);
+   payload_dependency_release(&rctx->pdctx);
break;
case NFPROTO_IPV6:
stmt->reject.family = rctx->pctx.family;
@@ -1883,8 +1881,7 @@ static void stmt_reject_postprocess(struct rule_pp_ctx 
*rctx)
if (stmt->reject.type == NFT_REJECT_TCP_RST &&
payload_dependency_exists(&rctx->pdctx,
  PROTO_BASE_TRANSPORT_HDR))
-   __payload_dependency_kill(&rctx->pdctx,
- rctx->pctx.family);
+   payload_dependency_release(&rctx->pdctx);
break;
case NFPROTO_INET:
if (stmt->reject.type == NFT_REJECT_ICMPX_UNREACH) {
diff --git a/src/payload.c b/src/payload.c
index df59ac8adcac..383aed039a5e 100644
--- a/src/payload.c
+++ b/src/payload.c
@@ -445,7 +445,7 @@ bool payload_dependency_exists(const struct payload_dep_ctx 
*ctx,
   ctx->pdep != NULL;
 }
 
-static void payload_dependency_release(struct payload_dep_ctx *ctx)
+void payload_dependency_release(struct payload_dep_ctx *ctx)
 {
list_del(&ctx->pdep->list);
stmt_free(ctx->pdep);
@@ -457,7 +457,7 @@ static void payload_dependency_release(struct 
payload_dep_ctx *ctx)
 }
 
 /**
- * __payload_dependency_kill - kill a redundant payload depedency
+ * payload_dependency_kill - kill a redundant payload depedency
  *
  * @ctx: payload dependency context
  * @expr: higher layer payload expression
@@ -465,16 +465,11 @@ static void payload_dependency_release(struct 
payload_dep_ctx *ctx)
  * Kill a redundant payload expression if a higher layer payload expression
  * implies its existance.
  */
-void __payload_dependency_kill(struct payload_dep_ctx *ctx, unsigned int 
family)
-{
-   payload_dependency_release(ctx);
-}
-
 void payload_dependency_kill(struct payload_dep_ctx *ctx, struct expr *expr,
 unsigned int family)
 {
if (payload_dependency_exists(ctx, expr->payload.base))
-   __payload_dependency_kill(ctx, family);
+   payload_dependency_release(ctx);
 }
 
 void exthdr_dependency_kill(struct payload_dep_ctx *ctx, struct expr *expr,
@@ -483,11 +478,11 @@ void exthdr_dependency_kill(struct payload_dep_ctx *ctx, 
struct expr *expr,
switch (expr->exthdr.op) {
case NFT_EXTHDR_OP_TCPOPT:
if (payload_dependency_exists(ctx, PROTO_BASE_TRANSPORT_HDR))
-   __payload_dependency_kill(ctx, family);
+   payload_dependency_release(ctx);
  

[PATCH nft 6/6] netlink_delinearize: add meta_may_dependency_kill()

2018-02-14 Thread Pablo Neira Ayuso
Do not exercise dependency removal for protocol key network payload
expressions in bridge, netdev and inet families from meta expressions,
more specifically:

* inet: nfproto and ether type.
* netdev and bridge: meta protocol and ether type.

need to be left in place.

Signed-off-by: Pablo Neira Ayuso 
---
 src/netlink_delinearize.c | 72 ++-
 1 file changed, 71 insertions(+), 1 deletion(-)

diff --git a/src/netlink_delinearize.c b/src/netlink_delinearize.c
index b9f195b07d63..82eefa61d4a4 100644
--- a/src/netlink_delinearize.c
+++ b/src/netlink_delinearize.c
@@ -1389,6 +1389,74 @@ static void payload_match_postprocess(struct rule_pp_ctx 
*ctx,
}
 }
 
+/* We have seen a protocol key expression that restricts matching at the 
network
+ * base, leave it in place since this is meaninful in bridge, inet and netdev
+ * families. Exceptions are ICMP and ICMPv6 where this code assumes that can
+ * only happen with IPv4 and IPv6.
+ */
+static bool meta_may_dependency_kill(struct payload_dep_ctx *ctx,
+unsigned int family,
+const struct expr *expr)
+{
+   struct expr *dep = ctx->pdep->expr;
+
+   if (ctx->pbase != PROTO_BASE_NETWORK_HDR)
+   return true;
+
+   switch (family) {
+   case NFPROTO_INET:
+   switch (dep->left->ops->type) {
+   case EXPR_META:
+   if (dep->left->meta.key == NFT_META_NFPROTO &&
+   (mpz_get_uint16(dep->right->value) == NFPROTO_IPV4 
||
+mpz_get_uint16(dep->right->value) == NFPROTO_IPV6) 
&&
+   expr->left->meta.key == NFT_META_L4PROTO &&
+   mpz_get_uint8(expr->right->value) != IPPROTO_ICMP &&
+   mpz_get_uint8(expr->right->value) != IPPROTO_ICMPV6)
+   return false;
+   break;
+   case EXPR_PAYLOAD:
+   if (dep->left->payload.base == PROTO_BASE_LL_HDR &&
+   (mpz_get_uint16(dep->right->value) == ETH_P_IP ||
+mpz_get_uint16(dep->right->value) == ETH_P_IPV6) &&
+   expr->left->meta.key == NFT_META_L4PROTO &&
+   mpz_get_uint8(expr->right->value) != IPPROTO_ICMP &&
+   mpz_get_uint8(expr->right->value) != IPPROTO_ICMPV6)
+   return false;
+   break;
+   default:
+   break;
+   }
+   break;
+   case NFPROTO_NETDEV:
+   case NFPROTO_BRIDGE:
+   switch (dep->left->ops->type) {
+   case EXPR_META:
+   if (dep->left->meta.key == NFT_META_PROTOCOL &&
+   (mpz_get_uint16(dep->right->value) == ETH_P_IP ||
+mpz_get_uint16(dep->right->value) == ETH_P_IPV6) &&
+   expr->left->meta.key == NFT_META_L4PROTO &&
+   mpz_get_uint8(expr->right->value) != IPPROTO_ICMP &&
+   mpz_get_uint8(expr->right->value) != IPPROTO_ICMPV6)
+   return false;
+   break;
+   case EXPR_PAYLOAD:
+   if (dep->left->payload.base == PROTO_BASE_LL_HDR &&
+   (mpz_get_uint16(dep->right->value) == ETH_P_IP ||
+mpz_get_uint16(dep->right->value) == ETH_P_IPV6) &&
+   expr->left->meta.key == NFT_META_L4PROTO &&
+   mpz_get_uint8(expr->right->value) != IPPROTO_ICMP &&
+   mpz_get_uint8(expr->right->value) != IPPROTO_ICMPV6)
+   return false;
+   break;
+   default:
+   break;
+   }
+   break;
+   }
+   return true;
+}
+
 static void ct_meta_common_postprocess(struct rule_pp_ctx *ctx,
   const struct expr *expr,
   enum proto_bases base)
@@ -1407,7 +1475,9 @@ static void ct_meta_common_postprocess(struct rule_pp_ctx 
*ctx,
left->flags & EXPR_F_PROTOCOL) {
payload_dependency_store(&ctx->pdctx, ctx->stmt, base);
} else if (ctx->pdctx.pbase < PROTO_BASE_TRANSPORT_HDR) {
-   if (payload_dependency_exists(&ctx->pdctx, base))
+   if (payload_dependency_exists(&ctx->pdctx, base) &&
+   meta_may_dependency_kill(&ctx->pdctx,
+ctx->pctx.family, expr))
payload_dependency_release(&ctx->pdctx);
if (left->flags & EXPR_F_PROTOCOL)
 

[PATCH nft 3/6] src: add payload_dependency_exists()

2018-02-14 Thread Pablo Neira Ayuso
This helper function tells us if there is already a protocol key payload
expression, ie. those with EXPR_F_PROTOCOL flag set on, that we might
want to remove since we can infer from another expression in the upper
protocol base, eg.

ip protocol tcp tcp dport 22

'ip protocol tcp' can be removed in the ip family since it is redundant,
but not in the netdev, bridge and inet families, where we cannot make
assumptions on the layer 3 protocol.

Signed-off-by: Pablo Neira Ayuso 
---
 include/payload.h |  3 ++-
 src/netlink_delinearize.c | 15 +--
 src/payload.c | 34 +-
 3 files changed, 36 insertions(+), 16 deletions(-)

diff --git a/include/payload.h b/include/payload.h
index 294ff2706e30..dec4647aad9f 100644
--- a/include/payload.h
+++ b/include/payload.h
@@ -40,8 +40,9 @@ void payload_dependency_reset(struct payload_dep_ctx *ctx);
 extern void payload_dependency_store(struct payload_dep_ctx *ctx,
 struct stmt *stmt,
 enum proto_bases base);
+extern bool payload_dependency_exists(const struct payload_dep_ctx *ctx,
+ enum proto_bases base);
 extern void __payload_dependency_kill(struct payload_dep_ctx *ctx,
- enum proto_bases base,
  unsigned int family);
 extern void payload_dependency_kill(struct payload_dep_ctx *ctx,
struct expr *expr, unsigned int family);
diff --git a/src/netlink_delinearize.c b/src/netlink_delinearize.c
index 8d11969e0fb1..f4b943964fc9 100644
--- a/src/netlink_delinearize.c
+++ b/src/netlink_delinearize.c
@@ -1407,8 +1407,9 @@ static void ct_meta_common_postprocess(struct rule_pp_ctx 
*ctx,
left->flags & EXPR_F_PROTOCOL) {
payload_dependency_store(&ctx->pdctx, ctx->stmt, base);
} else if (ctx->pdctx.pbase < PROTO_BASE_TRANSPORT_HDR) {
-   __payload_dependency_kill(&ctx->pdctx, base,
- ctx->pctx.family);
+   if (payload_dependency_exists(&ctx->pdctx, base))
+   __payload_dependency_kill(&ctx->pdctx,
+ ctx->pctx.family);
if (left->flags & EXPR_F_PROTOCOL)
payload_dependency_store(&ctx->pdctx, 
ctx->stmt, base);
}
@@ -1870,17 +1871,19 @@ static void stmt_reject_postprocess(struct rule_pp_ctx 
*rctx)
case NFPROTO_IPV4:
stmt->reject.family = rctx->pctx.family;
stmt->reject.expr->dtype = &icmp_code_type;
-   if (stmt->reject.type == NFT_REJECT_TCP_RST)
+   if (stmt->reject.type == NFT_REJECT_TCP_RST &&
+   payload_dependency_exists(&rctx->pdctx,
+ PROTO_BASE_TRANSPORT_HDR))
__payload_dependency_kill(&rctx->pdctx,
- PROTO_BASE_TRANSPORT_HDR,
  rctx->pctx.family);
break;
case NFPROTO_IPV6:
stmt->reject.family = rctx->pctx.family;
stmt->reject.expr->dtype = &icmpv6_code_type;
-   if (stmt->reject.type == NFT_REJECT_TCP_RST)
+   if (stmt->reject.type == NFT_REJECT_TCP_RST &&
+   payload_dependency_exists(&rctx->pdctx,
+ PROTO_BASE_TRANSPORT_HDR))
__payload_dependency_kill(&rctx->pdctx,
- PROTO_BASE_TRANSPORT_HDR,
  rctx->pctx.family);
break;
case NFPROTO_INET:
diff --git a/src/payload.c b/src/payload.c
index 21c2842807cd..df59ac8adcac 100644
--- a/src/payload.c
+++ b/src/payload.c
@@ -428,6 +428,23 @@ void payload_dependency_store(struct payload_dep_ctx *ctx,
ctx->pdep  = stmt;
 }
 
+/**
+ * payload_dependency_exists - there is a payload dependency in place
+ * @ctx: payload dependency context
+ * @base: payload protocol base
+ *
+ * Check if we have seen a protocol key payload expression for this base, we 
can
+ * usually remove it if we can infer it from another payload expression in the
+ * upper base.
+ */
+bool payload_dependency_exists(const struct payload_dep_ctx *ctx,
+  enum proto_bases base)
+{
+   return ctx->pbase != PROTO_BASE_INVALID &&
+  ctx->pbase == base &&
+  ctx->pdep != NULL;
+}
+
 static void payload_dependency_release(struct payload_dep_ctx *ctx)
 {
list_del(&ctx->pdep->list);
@@ -448,19 +465,16 @@ static void payload_dependency_release(struct 
payload_dep_ctx *ctx)
  * Kill a redundant payload expression i

Re: [PATCH net v2] netfilter: nat: cope with negative port range

2018-02-14 Thread Paolo Abeni
Hi,

On Wed, 2018-02-14 at 14:51 +0100, Pablo Neira Ayuso wrote:
> On Wed, Feb 14, 2018 at 01:30:37PM +0100, Florian Westphal wrote:
> > Eric Dumazet  wrote:
> > > On Wed, 2018-02-14 at 12:13 +0100, Paolo Abeni wrote:
> > > > syzbot reported a division by 0 bug in the netfilter nat code:
> > > > Adding the relevant check at parse time could break existing
> > > > setup, moreover we would need to read/write such values atomically
> > > > to avoid possible transient negative ranges at update time.
> > > 
> > > I do not quite follow why it is so hard to add a check at parse time.
> > > 
> > > Breaking buggy setups would not be a concern I think.
> > 
> > It would be possible for xtables but afaics in nft_nat.c case
> > (nft_nat_eval) range.{min,max}_proto.all values are loaded from nft
> > registers at runtime.
> 
> Then, restrict this from nft_nat.

If we move the check in the caller for nft, then need cope individually
with several control paths (nf_nat_setup_info() is used by ~10 modules
if I'm not wrong), I think keeping the check here would be better, do
you have strong opinions against that?

Thanks,

Paolo
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v2] netfilter: nat: cope with negative port range

2018-02-14 Thread Pablo Neira Ayuso
On Wed, Feb 14, 2018 at 04:45:31PM +0100, Paolo Abeni wrote:
> Hi,
> 
> On Wed, 2018-02-14 at 14:51 +0100, Pablo Neira Ayuso wrote:
> > On Wed, Feb 14, 2018 at 01:30:37PM +0100, Florian Westphal wrote:
> > > Eric Dumazet  wrote:
> > > > On Wed, 2018-02-14 at 12:13 +0100, Paolo Abeni wrote:
> > > > > syzbot reported a division by 0 bug in the netfilter nat code:
> > > > > Adding the relevant check at parse time could break existing
> > > > > setup, moreover we would need to read/write such values atomically
> > > > > to avoid possible transient negative ranges at update time.
> > > > 
> > > > I do not quite follow why it is so hard to add a check at parse time.
> > > > 
> > > > Breaking buggy setups would not be a concern I think.
> > > 
> > > It would be possible for xtables but afaics in nft_nat.c case
> > > (nft_nat_eval) range.{min,max}_proto.all values are loaded from nft
> > > registers at runtime.
> > 
> > Then, restrict this from nft_nat.
> 
> If we move the check in the caller for nft, then need cope individually
> with several control paths (nf_nat_setup_info() is used by ~10 modules
> if I'm not wrong), I think keeping the check here would be better, do
> you have strong opinions against that?

You're right, this is fine.

Thanks for explaining!
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net v3] netfilter: nat: cope with negative port range

2018-02-14 Thread Paolo Abeni
syzbot reported a division by 0 bug in the netfilter nat code:

divide error:  [#1] SMP KASAN
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 4168 Comm: syzkaller034710 Not tainted 4.16.0-rc1+ #309
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:nf_nat_l4proto_unique_tuple+0x291/0x530
net/netfilter/nf_nat_proto_common.c:88
RSP: 0018:8801b2466778 EFLAGS: 00010246
RAX: f153 RBX: 8801b2466dd8 RCX: 8801b2466c7c
RDX:  RSI: 8801b2466c58 RDI: 8801db5293ac
RBP: 8801b24667d8 R08: 8801b8ba6dc0 R09: 88af5900
R10: 8801b24666f0 R11:  R12: 2990f153
R13: 0001 R14:  R15: 8801b2466c7c
FS:  017e3880() GS:8801db50() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 208fdfe4 CR3: 0001b5340002 CR4: 001606e0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
  dccp_unique_tuple+0x40/0x50 net/netfilter/nf_nat_proto_dccp.c:30
  get_unique_tuple+0xc28/0x1c10 net/netfilter/nf_nat_core.c:362
  nf_nat_setup_info+0x1c2/0xe00 net/netfilter/nf_nat_core.c:406
  nf_nat_redirect_ipv6+0x306/0x730 net/netfilter/nf_nat_redirect.c:124
  redirect_tg6+0x7f/0xb0 net/netfilter/xt_REDIRECT.c:34
  ip6t_do_table+0xc2a/0x1a30 net/ipv6/netfilter/ip6_tables.c:365
  ip6table_nat_do_chain+0x65/0x80 net/ipv6/netfilter/ip6table_nat.c:41
  nf_nat_ipv6_fn+0x594/0xa80 net/ipv6/netfilter/nf_nat_l3proto_ipv6.c:302
  nf_nat_ipv6_local_fn+0x33/0x5d0
net/ipv6/netfilter/nf_nat_l3proto_ipv6.c:407
  ip6table_nat_local_fn+0x2c/0x40 net/ipv6/netfilter/ip6table_nat.c:69
  nf_hook_entry_hookfn include/linux/netfilter.h:120 [inline]
  nf_hook_slow+0xba/0x1a0 net/netfilter/core.c:483
  nf_hook include/linux/netfilter.h:243 [inline]
  NF_HOOK include/linux/netfilter.h:286 [inline]
  ip6_xmit+0x10ec/0x2260 net/ipv6/ip6_output.c:277
  inet6_csk_xmit+0x2fc/0x580 net/ipv6/inet6_connection_sock.c:139
  dccp_transmit_skb+0x9ac/0x10f0 net/dccp/output.c:142
  dccp_connect+0x369/0x670 net/dccp/output.c:564
  dccp_v6_connect+0xe17/0x1bf0 net/dccp/ipv6.c:946
  __inet_stream_connect+0x2d4/0xf00 net/ipv4/af_inet.c:620
  inet_stream_connect+0x58/0xa0 net/ipv4/af_inet.c:684
  SYSC_connect+0x213/0x4a0 net/socket.c:1639
  SyS_connect+0x24/0x30 net/socket.c:1620
  do_syscall_64+0x282/0x940 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x26/0x9b
RIP: 0033:0x441c69
RSP: 002b:7ffe50cc0be8 EFLAGS: 0217 ORIG_RAX: 002a
RAX: ffda RBX:  RCX: 00441c69
RDX: 001c RSI: 208fdfe4 RDI: 0003
RBP: 006cc018 R08:  R09: 
R10: 0538 R11: 0217 R12: 00403590
R13: 00403620 R14:  R15: 
Code: 48 89 f0 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 46 02 00 00 48 8b
45 c8 44 0f b7 20 e8 88 97 04 fd 31 d2 41 0f b7 c4 4c 89 f9 <41> f7 f6 48
c1 e9 03 48 b8 00 00 00 00 00 fc ff df 0f b6 0c 01
RIP: nf_nat_l4proto_unique_tuple+0x291/0x530
net/netfilter/nf_nat_proto_common.c:88 RSP: 8801b2466778

The problem is that currently we don't have any check on the
configured port range. A port range == -1 triggers the bug, while
other negative values may require a very long time to complete the
following loop.

This commit addresses the issue swapping the two ends on negative
ranges. The check is performed in nf_nat_l4proto_unique_tuple() since
the nft nat loads the port values from nft registers at runtime.

v1 -> v2: use the correct 'Fixes' tag
v2 -> v3: update commit message, drop unneeded READ_ONCE()

Fixes: 5b1158e909ec ("[NETFILTER]: Add NAT support for nf_conntrack")
Reported-by: syzbot+8012e198bd037f487...@syzkaller.appspotmail.com
Signed-off-by: Paolo Abeni 
---
 net/netfilter/nf_nat_proto_common.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/nf_nat_proto_common.c 
b/net/netfilter/nf_nat_proto_common.c
index fbce552a796e..7d7466dbf663 100644
--- a/net/netfilter/nf_nat_proto_common.c
+++ b/net/netfilter/nf_nat_proto_common.c
@@ -41,7 +41,7 @@ void nf_nat_l4proto_unique_tuple(const struct nf_nat_l3proto 
*l3proto,
 const struct nf_conn *ct,
 u16 *rover)
 {
-   unsigned int range_size, min, i;
+   unsigned int range_size, min, max, i;
__be16 *portptr;
u_int16_t off;
 
@@ -71,7 +71,10 @@ void nf_nat_l4proto_unique_tuple(const struct nf_nat_l3proto 
*l3proto,
}
} else {
min = ntohs(range->min_proto.all);
-   range_size = ntohs(range->max_proto.all) - min + 1;
+   max = ntohs(range->max_proto.all);
+   if (unlikely(max < min))
+

[PATCH nft] tests: add test case for sets updated from packet path

2018-02-14 Thread Florian Westphal
currently kernel may pick a set implementation that doesn't provide
a ->update() function. This causes an error when user attempts to
add the nftables rule that is supposed to add entries to the set.

Signed-off-by: Florian Westphal 
---
Pablo, unless you have objections I would push this now.

diff --git a/tests/shell/testcases/sets/0028autoselect_0 
b/tests/shell/testcases/sets/0028autoselect_0
new file mode 100755
index ..2225e7aee247
--- /dev/null
+++ b/tests/shell/testcases/sets/0028autoselect_0
@@ -0,0 +1,18 @@
+#!/bin/bash
+
+# This testscase checks kernel picks a suitable set backends.
+# Ruleset attempts to update from packet path, so set backend
+# needs an ->update() implementation.
+
+set -e
+
+$NFT add table t
+$NFT add set t s1 { type inet_proto \; }
+$NFT add set t s2 { type ipv4_addr \; }
+$NFT add set t s3 { type ipv4_addr \; size 1024\; }
+$NFT add chain t c {type filter hook input priority 0 \; }
+
+# chosen set type must support updates from packet path
+$NFT add rule t c meta iifname foobar set add ip protocol @s1
+$NFT add rule t c meta iifname foobar set add ip daddr @s2
+$NFT add rule t c meta iifname foobar set add ip daddr @s3
-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Overlapping IP networks no longer allowed?

2018-02-14 Thread Mantas Mikulėnas
Hello,

As of nftables 0.8.1, it seems I can no longer write anonymous sets
which contain overlapping networks (CIDR masks).

For example, I want to write the following ruleset:

#!/usr/bin/nft -f
define users = { 10.0.0.0/8, 193.219.181.192/26 }
define admins = { 10.123.0.0/24, 31.220.42.129 }
define allowed = { $users, $admins }
table inet filter {
chain foobar {
ip saddr $allowed accept
}
}

results in an error message:

Error: interval overlaps with previous one

I noticed a few nftables.git commits related to disabling auto-merge
for interval sets... but mine don't have the 'interval' flag, and
there doesn't seem to be any way to specify 'auto-merge' for anonymous
sets, either.

-- 
Mantas Mikulėnas
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH nft] tests: add test case for sets updated from packet path

2018-02-14 Thread Pablo Neira Ayuso
On Wed, Feb 14, 2018 at 05:40:17PM +0100, Florian Westphal wrote:
> currently kernel may pick a set implementation that doesn't provide
> a ->update() function. This causes an error when user attempts to
> add the nftables rule that is supposed to add entries to the set.
> 
> Signed-off-by: Florian Westphal 
> ---
> Pablo, unless you have objections I would push this now.

Go ahead. Thanks!

> diff --git a/tests/shell/testcases/sets/0028autoselect_0 
> b/tests/shell/testcases/sets/0028autoselect_0
> new file mode 100755
> index ..2225e7aee247
> --- /dev/null
> +++ b/tests/shell/testcases/sets/0028autoselect_0
> @@ -0,0 +1,18 @@
> +#!/bin/bash
> +
> +# This testscase checks kernel picks a suitable set backends.
> +# Ruleset attempts to update from packet path, so set backend
> +# needs an ->update() implementation.
> +
> +set -e
> +
> +$NFT add table t
> +$NFT add set t s1 { type inet_proto \; }
> +$NFT add set t s2 { type ipv4_addr \; }
> +$NFT add set t s3 { type ipv4_addr \; size 1024\; }
> +$NFT add chain t c {type filter hook input priority 0 \; }
> +
> +# chosen set type must support updates from packet path
> +$NFT add rule t c meta iifname foobar set add ip protocol @s1
> +$NFT add rule t c meta iifname foobar set add ip daddr @s2
> +$NFT add rule t c meta iifname foobar set add ip daddr @s3
> -- 
> 2.13.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Overlapping IP networks no longer allowed?

2018-02-14 Thread Pablo Neira Ayuso
On Wed, Feb 14, 2018 at 07:02:32PM +0200, Mantas Mikulėnas wrote:
> Hello,
> 
> As of nftables 0.8.1, it seems I can no longer write anonymous sets
> which contain overlapping networks (CIDR masks).
> 
> For example, I want to write the following ruleset:
> 
> #!/usr/bin/nft -f
> define users = { 10.0.0.0/8, 193.219.181.192/26 }
> define admins = { 10.123.0.0/24, 31.220.42.129 }
> define allowed = { $users, $admins }
> table inet filter {
> chain foobar {
> ip saddr $allowed accept
> }
> }
> 
> results in an error message:
> 
> Error: interval overlaps with previous one
> 
> I noticed a few nftables.git commits related to disabling auto-merge
> for interval sets... but mine don't have the 'interval' flag, and
> there doesn't seem to be any way to specify 'auto-merge' for anonymous
> sets, either.

I would like not to enable this by default since typo in rulesets
could go through unnoticed.

So the two alternatives I see are:

1) add per-table configuration options, this would allow us to
   enable auto-merge explicitly for all anonymous sets. This is also
   required if we want to allow user to select "policy memory;" for
   anonymous sets. Only problem with this approach is that this needs
   a kernel patch, so it will take a while to restore the behaviour you
   want since we need a new NFTA_TABLE_USERDATA attribute to store user
   preferences on this.

2) We add a -m option that we can combine with -f for this, which
   globally enables auto-merge for every set, including anonymous and
   named sets.

Let me know.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH nft] parser_bison: restore nft {import,export} ruleset

2018-02-14 Thread Pablo Neira Ayuso
Restore original syntax for the yet experimental VM low-level json
representation.

Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1224
Signed-off-by: Pablo Neira Ayuso 
---
I asked for this change to make room for the high-level json
representation, but we can use -j options for this instead.  Given there
are more users for the json representation that I expected, I'm fixing
it myself by restoring the former behaviour.

 src/parser_bison.y | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/src/parser_bison.y b/src/parser_bison.y
index 578bfdc10429..4cfc54cfd7b2 100644
--- a/src/parser_bison.y
+++ b/src/parser_bison.y
@@ -1186,7 +1186,7 @@ rename_cmd:   CHAIN   
chain_spec  identifier
}
;
 
-import_cmd :   RULESET markup_format
+import_cmd :   RULESET markup_format
{
struct handle h = { .family = NFPROTO_UNSPEC };
struct markup *markup = markup_alloc($2);
@@ -1198,7 +1198,6 @@ import_cmd:   RULESET 
markup_format
struct markup *markup = markup_alloc($1);
$$ = cmd_alloc(CMD_IMPORT, CMD_OBJ_MARKUP, &h, 
&@$, markup);
}
-   |   JSON{ $$ = NULL; }
;
 
 export_cmd :   RULESET markup_format
@@ -1213,7 +1212,6 @@ export_cmd:   RULESET 
markup_format
struct markup *markup = markup_alloc($1);
$$ = cmd_alloc(CMD_EXPORT, CMD_OBJ_MARKUP, &h, 
&@$, markup);
}
-   |   JSON{ $$ = NULL; }
;
 
 monitor_cmd:   monitor_event   monitor_object  monitor_format
@@ -1241,10 +1239,10 @@ monitor_object  :   /* empty */ { $$ = 
CMD_MONITOR_OBJ_ANY; }
 
 monitor_format :   /* empty */ { $$ = NFTNL_OUTPUT_DEFAULT; }
|   markup_format
-   |   JSON{ $$ = NFTNL_OUTPUT_JSON; }
;
 
 markup_format  :   XML { $$ = NFTNL_OUTPUT_XML; }
+   |   JSON{ $$ = NFTNL_OUTPUT_JSON; }
|   VM JSON { $$ = NFTNL_OUTPUT_JSON; }
;
 
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] configure: Make missing docbook2man an error if man build requested

2018-02-14 Thread Pablo Neira Ayuso
On Wed, Feb 07, 2018 at 11:40:00AM +0200, Ville Skyttä wrote:
> Previously, if man page build was enabled but no suitable docbook2man or
> the like tool was found, build failed at a later stage with
> undescriptive error message. Fail early and explicitly at configure
> stage instead.

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH nft] parser_bison: restore nft {import,export} ruleset

2018-02-14 Thread Shyam Saini
Hi Pablo,

On Thu, Feb 15, 2018 at 12:02 AM, Pablo Neira Ayuso  wrote:
> Restore original syntax for the yet experimental VM low-level json
> representation.
>
> Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1224
> Signed-off-by: Pablo Neira Ayuso 
> ---
> I asked for this change to make room for the high-level json
> representation, but we can use -j options for this instead.  Given there
> are more users for the json representation that I expected, I'm fixing
> it myself by restoring the former behaviour.

Why would one use "nft export"  without "nft import".
if someone exports rules in json then they can't use those rules
given the fact that "nft import" was not available earlier.

Am i missing something?


Thanks!!

>  src/parser_bison.y | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/src/parser_bison.y b/src/parser_bison.y
> index 578bfdc10429..4cfc54cfd7b2 100644
> --- a/src/parser_bison.y
> +++ b/src/parser_bison.y
> @@ -1186,7 +1186,7 @@ rename_cmd:   CHAIN   
> chain_spec  identifier
> }
> ;
>
> -import_cmd :   RULESET markup_format
> +import_cmd :   RULESET markup_format
> {
> struct handle h = { .family = NFPROTO_UNSPEC 
> };
> struct markup *markup = markup_alloc($2);
> @@ -1198,7 +1198,6 @@ import_cmd:   RULESET   
>   markup_format
> struct markup *markup = markup_alloc($1);
> $$ = cmd_alloc(CMD_IMPORT, CMD_OBJ_MARKUP, 
> &h, &@$, markup);
> }
> -   |   JSON{ $$ = NULL; }
> ;
>
>  export_cmd :   RULESET markup_format
> @@ -1213,7 +1212,6 @@ export_cmd:   RULESET 
> markup_format
> struct markup *markup = markup_alloc($1);
> $$ = cmd_alloc(CMD_EXPORT, CMD_OBJ_MARKUP, 
> &h, &@$, markup);
> }
> -   |   JSON{ $$ = NULL; }
> ;
>
>  monitor_cmd:   monitor_event   monitor_object  monitor_format
> @@ -1241,10 +1239,10 @@ monitor_object  :   /* empty */ { $$ 
> = CMD_MONITOR_OBJ_ANY; }
>
>  monitor_format :   /* empty */ { $$ = NFTNL_OUTPUT_DEFAULT; }
> |   markup_format
> -   |   JSON{ $$ = NFTNL_OUTPUT_JSON; }
> ;
>
>  markup_format  :   XML { $$ = NFTNL_OUTPUT_XML; }
> +   |   JSON{ $$ = NFTNL_OUTPUT_JSON; }
> |   VM JSON { $$ = NFTNL_OUTPUT_JSON; }
> ;
>
> --
> 2.11.0
>
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH libnftnl] examples: nft-set-del: fix set deletion

2018-02-14 Thread Pablo Neira Ayuso
Signed-off-by: Pablo Neira Ayuso 
---
 examples/nft-set-del.c | 33 -
 1 file changed, 24 insertions(+), 9 deletions(-)

diff --git a/examples/nft-set-del.c b/examples/nft-set-del.c
index 5e1fad30d341..8c216df861d7 100644
--- a/examples/nft-set-del.c
+++ b/examples/nft-set-del.c
@@ -25,6 +25,7 @@ int main(int argc, char *argv[])
struct mnl_socket *nl;
char buf[MNL_SOCKET_BUFFER_SIZE];
struct nlmsghdr *nlh;
+   struct mnl_nlmsg_batch *batch;
uint32_t portid, seq, family;
struct nftnl_set *t = NULL;
int ret;
@@ -54,13 +55,23 @@ int main(int argc, char *argv[])
exit(EXIT_FAILURE);
}
 
-   nlh = nftnl_set_nlmsg_build_hdr(buf, NFT_MSG_DELSET, family,
+   batch = mnl_nlmsg_batch_start(buf, sizeof(buf));
+
+   nftnl_batch_begin(mnl_nlmsg_batch_current(batch), seq++);
+   mnl_nlmsg_batch_next(batch);
+
+   nlh = nftnl_set_nlmsg_build_hdr(mnl_nlmsg_batch_current(batch),
+   NFT_MSG_DELSET, family,
NLM_F_ACK, seq);
nftnl_set_set(t, NFTNL_SET_TABLE, argv[2]);
nftnl_set_set(t, NFTNL_SET_NAME, argv[3]);
 
nftnl_set_nlmsg_build_payload(nlh, t);
nftnl_set_free(t);
+   mnl_nlmsg_batch_next(batch);
+
+   nftnl_batch_end(mnl_nlmsg_batch_current(batch), seq++);
+   mnl_nlmsg_batch_next(batch);
 
nl = mnl_socket_open(NETLINK_NETFILTER);
if (nl == NULL) {
@@ -74,20 +85,24 @@ int main(int argc, char *argv[])
}
portid = mnl_socket_get_portid(nl);
 
-   if (mnl_socket_sendto(nl, nlh, nlh->nlmsg_len) < 0) {
+   ret = mnl_socket_sendto(nl, mnl_nlmsg_batch_head(batch),
+   mnl_nlmsg_batch_size(batch));
+   if (ret < 0) {
perror("mnl_socket_send");
exit(EXIT_FAILURE);
}
 
+   mnl_nlmsg_batch_stop(batch);
+
ret = mnl_socket_recvfrom(nl, buf, sizeof(buf));
-   while (ret > 0) {
-   ret = mnl_cb_run(buf, ret, seq, portid, NULL, NULL);
-   if (ret <= 0)
-   break;
-   ret = mnl_socket_recvfrom(nl, buf, sizeof(buf));
+   while (ret < 0) {
+   perror("mnl_socket_recvfrom");
+   exit(EXIT_FAILURE);
}
-   if (ret == -1) {
-   perror("error");
+
+   ret = mnl_cb_run(buf, ret, 0, portid, NULL, NULL);
+   if (ret < 0) {
+   perror("mnl_cb_run");
exit(EXIT_FAILURE);
}
mnl_socket_close(nl);
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH libnftnl] set_elem: nftnl_set_elems_parse() returns 0 if set is empty

2018-02-14 Thread Pablo Neira Ayuso
Instead of -1, which results n misleading error propagate to the caller
with errno == 0 (success).

Signed-off-by: Pablo Neira Ayuso 
---
 src/set_elem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/set_elem.c b/src/set_elem.c
index e02a38791c9a..1ac53dadbb75 100644
--- a/src/set_elem.c
+++ b/src/set_elem.c
@@ -490,7 +490,7 @@ nftnl_set_elem_list_parse_attr_cb(const struct nlattr 
*attr, void *data)
 static int nftnl_set_elems_parse(struct nftnl_set *s, const struct nlattr 
*nest)
 {
struct nlattr *attr;
-   int ret = -1;
+   int ret = 0;
 
mnl_attr_for_each_nested(attr, nest) {
if (mnl_attr_get_type(attr) != NFTA_LIST_ELEM)
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH nft] parser_bison: restore nft {import,export} ruleset

2018-02-14 Thread Pablo Neira Ayuso
On Thu, Feb 15, 2018 at 12:34:31AM +0530, Shyam Saini wrote:
> Hi Pablo,
> 
> On Thu, Feb 15, 2018 at 12:02 AM, Pablo Neira Ayuso  
> wrote:
> > Restore original syntax for the yet experimental VM low-level json
> > representation.
> >
> > Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1224
> > Signed-off-by: Pablo Neira Ayuso 
> > ---
> > I asked for this change to make room for the high-level json
> > representation, but we can use -j options for this instead.  Given there
> > are more users for the json representation that I expected, I'm fixing
> > it myself by restoring the former behaviour.
> 
> Why would one use "nft export"  without "nft import".
> if someone exports rules in json then they can't use those rules
> given the fact that "nft import" was not available earlier.
> 
> Am i missing something?

With this patch nft import and nft export works as expected, ie.

nft export ruleset json > file.json
nft import ruleset json < file.json

I'm just restoring 'nft export ruleset json' with this patch, it seems
there are more users of this than I expected, so let's restore this
before 0.8.3 is released, that's my proposal.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH nft] parser_bison: restore nft {import,export} ruleset

2018-02-14 Thread Pablo Neira Ayuso
On Wed, Feb 14, 2018 at 08:16:52PM +0100, Pablo Neira Ayuso wrote:
> On Thu, Feb 15, 2018 at 12:34:31AM +0530, Shyam Saini wrote:
> > Hi Pablo,
> > 
> > On Thu, Feb 15, 2018 at 12:02 AM, Pablo Neira Ayuso  
> > wrote:
> > > Restore original syntax for the yet experimental VM low-level json
> > > representation.
> > >
> > > Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1224
> > > Signed-off-by: Pablo Neira Ayuso 
> > > ---
> > > I asked for this change to make room for the high-level json
> > > representation, but we can use -j options for this instead.  Given there
> > > are more users for the json representation that I expected, I'm fixing
> > > it myself by restoring the former behaviour.
> > 
> > Why would one use "nft export"  without "nft import".
> > if someone exports rules in json then they can't use those rules
> > given the fact that "nft import" was not available earlier.
> > 
> > Am i missing something?
> 
> With this patch nft import and nft export works as expected, ie.
> 
> nft export ruleset json > file.json
> nft import ruleset json < file.json
> 
> I'm just restoring 'nft export ruleset json' with this patch, it seems
> there are more users of this than I expected, so let's restore this
> before 0.8.3 is released, that's my proposal.

Oh, probably you got confused because the patch title refers to nft
import when it should only refer to nft export ruleset json?
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v2] netfilter: drop outermost socket lock in getsockopt()

2018-02-14 Thread Pablo Neira Ayuso
On Thu, Feb 08, 2018 at 12:19:00PM +0100, Paolo Abeni wrote:
> The Syzbot reported a possible deadlock in the netfilter area caused by
> rtnl lock, xt lock and socket lock being acquired with a different order
> on different code paths, leading to the following backtrace:
> 
> ==
> WARNING: possible circular locking dependency detected
> 4.15.0+ #301 Not tainted
> --
> syzkaller233489/4179 is trying to acquire lock:
>   (rtnl_mutex){+.+.}, at: [<48e996fd>] rtnl_lock+0x17/0x20
> net/core/rtnetlink.c:74
> 
> but task is already holding lock:
>   (&xt[i].mutex){+.+.}, at: [<328553a2>]
> xt_find_table_lock+0x3e/0x3e0 net/netfilter/x_tables.c:1041
> 
> which lock already depends on the new lock.
> ===
> 
> Since commit 3f34cfae1230 ("netfilter: on sockopt() acquire sock lock
> only in the required scope"), we already acquire the socket lock in
> the innermost scope, where needed. In such commit I forgot to remove
> the outer-most socket lock from the getsockopt() path, this commit
> addresses the issues dropping it now.

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: netfilter: x_tables: ratelimit most printks

2018-02-14 Thread Pablo Neira Ayuso
On Wed, Feb 07, 2018 at 02:48:21PM +0100, Florian Westphal wrote:
> Aeons ago, before namespaces, there was no need to ratelimit this:
> all of these error messages got triggered in response to iptables
> commands, which need CAP_NET_ADMIN.
> 
> Nowadays we have namespaces, so its better to ratelimit these.
> This should also help fuzzing (syzkaller), as it can generate a large
> volume of error messages (which are useless there).
> 
> The patches are split as follows:
> - first get rid of printks that should never be triggered, as userland
>   doesn't generate such malformed rules anyway.
> - second, switch some printks to pr_debug.  This is mostly for messages
>   where it might make sense for developers to see what exactly went
>   wrong.
> 
> Rest of the patches swap remaining pr_foo with pr_foo_ratelimited().
> 
> Note that most patches introduce overly long lines, but splitting these
> would make it necessary to split the error messages which is worse.
> 
> 46 files changed, 254 insertions(+), 257 deletions(-)

Series applied, thanks Florian.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch net v2] ipt_CLUSTERIP: fix a refcount bug in clusterip_config_find_get()

2018-02-14 Thread Pablo Neira Ayuso
On Thu, Feb 08, 2018 at 01:53:52PM -0800, Cong Wang wrote:
> In clusterip_config_find_get() we hold RCU read lock so it could
> run concurrently with clusterip_config_entry_put(), as a result,
> the refcnt could go back to 1 from 0, which leads to a double
> list_del()... Just replace refcount_inc() with
> refcount_inc_not_zero(), as for c->refcount.

Applied, thanks Cong.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH nf] netfilter: add back stackpointer size checks

2018-02-14 Thread Pablo Neira Ayuso
On Wed, Feb 07, 2018 at 01:46:25PM +0100, Florian Westphal wrote:
> The rationale for removing the check is only correct for rulesets
> generated by ip(6)tables.
> 
> In iptables, a jump can only occur to a user-defined chain, i.e.
> because we size the stack based on number of user-defined chains we
> cannot exceed stack size.
> 
> However, the underlying binary format has no such restriction,
> and the validation step only ensures that the jump target is a
> valid rule start point.
> 
> IOW, its possible to build a rule blob that has no user-defined
> chains but does contain a jump.
> 
> If this happens, no jump stack gets allocated and crash occurs
> because no jumpstack was allocated.

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] .gitignore: ignore ASN.1 auto generated files

2018-02-14 Thread Pablo Neira Ayuso
On Mon, Feb 12, 2018 at 09:45:42PM +0800, Zhu Lingshan wrote:
> when build kernel with default configure, files:
> 
> generatenet/ipv4/netfilter/nf_nat_snmp_basic-asn1.c
> net/ipv4/netfilter/nf_nat_snmp_basic-asn1.h
> 
> will be automatically generated by ASN.1 compiler, so
> No need to track them in git, it's better to ignore them.

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v2] netfilter: x_tables: fix missing timer initialization in xt_LED

2018-02-14 Thread Pablo Neira Ayuso
On Mon, Feb 12, 2018 at 06:49:39PM +0100, Paolo Abeni wrote:
> syzbot reported that xt_LED may try to use the ledinternal->timer
> without previously initializing it:
> 
> [ cut here ]
> kernel BUG at kernel/time/timer.c:958!
> invalid opcode:  [#1] SMP KASAN
> Dumping ftrace buffer:
> (ftrace buffer empty)
> Modules linked in:
> CPU: 1 PID: 1826 Comm: kworker/1:2 Not tainted 4.15.0+ #306
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Workqueue: ipv6_addrconf addrconf_dad_work
> RIP: 0010:__mod_timer kernel/time/timer.c:958 [inline]
> RIP: 0010:mod_timer+0x7d6/0x13c0 kernel/time/timer.c:1102
> RSP: 0018:8801d24fe9f8 EFLAGS: 00010293
> RAX: 8801d25246c0 RBX: 8801aec6cb50 RCX: 816052c6
> RDX:  RSI: fffbd14b RDI: 8801aec6cb68
> RBP: 8801d24fec98 R08:  R09: 11003a49fd6c
> R10: 8801d24feb28 R11: 0005 R12: dc00
> R13: 8801d24fec70 R14: fffbd14b R15: 8801af608f90
> FS:  () GS:8801db50() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 206d6fd0 CR3: 06a22001 CR4: 001606e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>   led_tg+0x1db/0x2e0 net/netfilter/xt_LED.c:75
>   ip6t_do_table+0xc2a/0x1a30 net/ipv6/netfilter/ip6_tables.c:365
>   ip6table_raw_hook+0x65/0x80 net/ipv6/netfilter/ip6table_raw.c:42
>   nf_hook_entry_hookfn include/linux/netfilter.h:120 [inline]
>   nf_hook_slow+0xba/0x1a0 net/netfilter/core.c:483
>   nf_hook.constprop.27+0x3f6/0x830 include/linux/netfilter.h:243
>   NF_HOOK include/linux/netfilter.h:286 [inline]
>   ndisc_send_skb+0xa51/0x1370 net/ipv6/ndisc.c:491
>   ndisc_send_ns+0x38a/0x870 net/ipv6/ndisc.c:633
>   addrconf_dad_work+0xb9e/0x1320 net/ipv6/addrconf.c:4008
>   process_one_work+0xbbf/0x1af0 kernel/workqueue.c:2113
>   worker_thread+0x223/0x1990 kernel/workqueue.c:2247
>   kthread+0x33c/0x400 kernel/kthread.c:238
>   ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:429
> Code: 85 2a 0b 00 00 4d 8b 3c 24 4d 85 ff 75 9f 4c 8b bd 60 fd ff ff e8 bb
> 57 10 00 65 ff 0d 94 9a a1 7e e9 d9 fc ff ff e8 aa 57 10 00 <0f> 0b e8 a3
> 57 10 00 e9 14 fb ff ff e8 99 57 10 00 4c 89 bd 70
> RIP: __mod_timer kernel/time/timer.c:958 [inline] RSP: 8801d24fe9f8
> RIP: mod_timer+0x7d6/0x13c0 kernel/time/timer.c:1102 RSP: 8801d24fe9f8
> ---[ end trace f661ab06f5dd8b3d ]---
> 
> The ledinternal struct can be shared between several different
> xt_LED targets, but the related timer is currently initialized only
> if the first target requires it. Fix it by unconditionally
> initializing the timer struct.

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH nft] parser_bison: restore nft {import,export} ruleset

2018-02-14 Thread Shyam Saini
> On Wed, Feb 14, 2018 at 08:16:52PM +0100, Pablo Neira Ayuso wrote:
>> On Thu, Feb 15, 2018 at 12:34:31AM +0530, Shyam Saini wrote:
>> > Hi Pablo,
>> >
>> > On Thu, Feb 15, 2018 at 12:02 AM, Pablo Neira Ayuso  
>> > wrote:
>> > > Restore original syntax for the yet experimental VM low-level json
>> > > representation.
>> > >
>> > > Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1224
>> > > Signed-off-by: Pablo Neira Ayuso 
>> > > ---
>> > > I asked for this change to make room for the high-level json
>> > > representation, but we can use -j options for this instead.  Given there
>> > > are more users for the json representation that I expected, I'm fixing
>> > > it myself by restoring the former behaviour.
>> >
>> > Why would one use "nft export"  without "nft import".
>> > if someone exports rules in json then they can't use those rules
>> > given the fact that "nft import" was not available earlier.
>> >
>> > Am i missing something?
>>
>> With this patch nft import and nft export works as expected, ie.
>>
>> nft export ruleset json > file.json
>> nft import ruleset json < file.json
>>
>> I'm just restoring 'nft export ruleset json' with this patch, it seems
>> there are more users of this than I expected, so let's restore this
>> before 0.8.3 is released, that's my proposal.
>
> Oh, probably you got confused because the patch title refers to nft
> import when it should only refer to nft export ruleset json?

No, I mean in some previous mail Phil mentioned that it could break user's
script.
 I was thinking why one would use "nft export json" alone.
Earlier we didn't have nft import command.

So lets say if some user do "nft export json >file.json" then the
rules in file.json
are of no use without "nft import json" because we had no way to
import or use them again.

Sorry, I missed previous mail and couldn't follow up.


Thanks!
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net v3] netfilter: nat: cope with negative port range

2018-02-14 Thread Pablo Neira Ayuso
On Wed, Feb 14, 2018 at 05:21:19PM +0100, Paolo Abeni wrote:
> syzbot reported a division by 0 bug in the netfilter nat code:
> 
> divide error:  [#1] SMP KASAN
> Dumping ftrace buffer:
> (ftrace buffer empty)
> Modules linked in:
> CPU: 1 PID: 4168 Comm: syzkaller034710 Not tainted 4.16.0-rc1+ #309
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> RIP: 0010:nf_nat_l4proto_unique_tuple+0x291/0x530
> net/netfilter/nf_nat_proto_common.c:88
> RSP: 0018:8801b2466778 EFLAGS: 00010246
> RAX: f153 RBX: 8801b2466dd8 RCX: 8801b2466c7c
> RDX:  RSI: 8801b2466c58 RDI: 8801db5293ac
> RBP: 8801b24667d8 R08: 8801b8ba6dc0 R09: 88af5900
> R10: 8801b24666f0 R11:  R12: 2990f153
> R13: 0001 R14:  R15: 8801b2466c7c
> FS:  017e3880() GS:8801db50() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 208fdfe4 CR3: 0001b5340002 CR4: 001606e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>   dccp_unique_tuple+0x40/0x50 net/netfilter/nf_nat_proto_dccp.c:30
>   get_unique_tuple+0xc28/0x1c10 net/netfilter/nf_nat_core.c:362
>   nf_nat_setup_info+0x1c2/0xe00 net/netfilter/nf_nat_core.c:406
>   nf_nat_redirect_ipv6+0x306/0x730 net/netfilter/nf_nat_redirect.c:124
>   redirect_tg6+0x7f/0xb0 net/netfilter/xt_REDIRECT.c:34
>   ip6t_do_table+0xc2a/0x1a30 net/ipv6/netfilter/ip6_tables.c:365
>   ip6table_nat_do_chain+0x65/0x80 net/ipv6/netfilter/ip6table_nat.c:41
>   nf_nat_ipv6_fn+0x594/0xa80 net/ipv6/netfilter/nf_nat_l3proto_ipv6.c:302
>   nf_nat_ipv6_local_fn+0x33/0x5d0
> net/ipv6/netfilter/nf_nat_l3proto_ipv6.c:407
>   ip6table_nat_local_fn+0x2c/0x40 net/ipv6/netfilter/ip6table_nat.c:69
>   nf_hook_entry_hookfn include/linux/netfilter.h:120 [inline]
>   nf_hook_slow+0xba/0x1a0 net/netfilter/core.c:483
>   nf_hook include/linux/netfilter.h:243 [inline]
>   NF_HOOK include/linux/netfilter.h:286 [inline]
>   ip6_xmit+0x10ec/0x2260 net/ipv6/ip6_output.c:277
>   inet6_csk_xmit+0x2fc/0x580 net/ipv6/inet6_connection_sock.c:139
>   dccp_transmit_skb+0x9ac/0x10f0 net/dccp/output.c:142
>   dccp_connect+0x369/0x670 net/dccp/output.c:564
>   dccp_v6_connect+0xe17/0x1bf0 net/dccp/ipv6.c:946
>   __inet_stream_connect+0x2d4/0xf00 net/ipv4/af_inet.c:620
>   inet_stream_connect+0x58/0xa0 net/ipv4/af_inet.c:684
>   SYSC_connect+0x213/0x4a0 net/socket.c:1639
>   SyS_connect+0x24/0x30 net/socket.c:1620
>   do_syscall_64+0x282/0x940 arch/x86/entry/common.c:287
>   entry_SYSCALL_64_after_hwframe+0x26/0x9b
> RIP: 0033:0x441c69
> RSP: 002b:7ffe50cc0be8 EFLAGS: 0217 ORIG_RAX: 002a
> RAX: ffda RBX:  RCX: 00441c69
> RDX: 001c RSI: 208fdfe4 RDI: 0003
> RBP: 006cc018 R08:  R09: 
> R10: 0538 R11: 0217 R12: 00403590
> R13: 00403620 R14:  R15: 
> Code: 48 89 f0 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 46 02 00 00 48 8b
> 45 c8 44 0f b7 20 e8 88 97 04 fd 31 d2 41 0f b7 c4 4c 89 f9 <41> f7 f6 48
> c1 e9 03 48 b8 00 00 00 00 00 fc ff df 0f b6 0c 01
> RIP: nf_nat_l4proto_unique_tuple+0x291/0x530
> net/netfilter/nf_nat_proto_common.c:88 RSP: 8801b2466778
> 
> The problem is that currently we don't have any check on the
> configured port range. A port range == -1 triggers the bug, while
> other negative values may require a very long time to complete the
> following loop.
> 
> This commit addresses the issue swapping the two ends on negative
> ranges. The check is performed in nf_nat_l4proto_unique_tuple() since
> the nft nat loads the port values from nft registers at runtime.

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH nf RFC] netfilter: x_tables: only allow jumps to user-defined chains

2018-02-14 Thread Pablo Neira Ayuso
On Wed, Feb 07, 2018 at 02:20:41PM +0100, Florian Westphal wrote:
> This rejects rulesets where a jump occurs to a non-user defined chain.
> This isn't limited in any way in the binary format (you can jump to
> any rule you want within the blob structure), but iptables tools
> do not offset such a feature.
> 
> Sending as RFC as this limits features that might be used by programs
> that don't call xtables(-restore) tools.
> 
> This change also prevents the syzkaller reported crash as
> ruleset gets rejected.

My original intention was to go for this, given our official interface
since the beginning has been iptables-restore. But given this
description makes it clear that we have chance to break third
applications relying on this binary layout, better go conservative and
keep allowing this.

My only concern so far is if this sort of flexibility, allowing us
arbitrary jumps, can cause us more problems later on.

Let me know,
Thanks!
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Overlapping IP networks no longer allowed?

2018-02-14 Thread Florian Westphal
Pablo Neira Ayuso  wrote:
> On Wed, Feb 14, 2018 at 07:02:32PM +0200, Mantas Mikulėnas wrote:
> > Hello,
> > 
> > As of nftables 0.8.1, it seems I can no longer write anonymous sets
> > which contain overlapping networks (CIDR masks).
> > 
> > For example, I want to write the following ruleset:
> > 
> > #!/usr/bin/nft -f
> > define users = { 10.0.0.0/8, 193.219.181.192/26 }
> > define admins = { 10.123.0.0/24, 31.220.42.129 }
> > define allowed = { $users, $admins }
> > table inet filter {
> > chain foobar {
> > ip saddr $allowed accept
> > }
> > }
> > 
> > results in an error message:
> > 
> > Error: interval overlaps with previous one
> > 
> > I noticed a few nftables.git commits related to disabling auto-merge
> > for interval sets... but mine don't have the 'interval' flag, and
> > there doesn't seem to be any way to specify 'auto-merge' for anonymous
> > sets, either.
> 
> I would like not to enable this by default since typo in rulesets
> could go through unnoticed.

nft add rule filter input ip protocol '{6 ,tcp }'
works just fine, eliminating duplicate set elements.
So I don't see how that is different from removing the redundant parts
of an anon set.

Especially with 'define' things I believe that automerge by default
is desireable.

> So the two alternatives I see are:
> 
> 1) add per-table configuration options, this would allow us to
>enable auto-merge explicitly for all anonymous sets. This is also
>required if we want to allow user to select "policy memory;" for
>anonymous sets. Only problem with this approach is that this needs
>a kernel patch, so it will take a while to restore the behaviour you
>want since we need a new NFTA_TABLE_USERDATA attribute to store user
>preferences on this.

Right.

> 2) We add a -m option that we can combine with -f for this, which
>globally enables auto-merge for every set, including anonymous and
>named sets.

What about doing automerge by default again for anon sets?

I know you don't like it but it restores old behaviour.
We could have a debug option that tells users which addresse(s) were
autoremoved.

The typo argument -- not sure its a valid:
Consider '10.0.0.01' (instead of .10), we don't try to be 'smart'
and thats a good thing.

For named sets, the no automerge makes sense because it seems like
we can't make any reasonable default choice when users try to delete
a no-longer existing (i.e. merged) element.

But that problem doesn't exist with constant (anon or not) sets.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH nf] netfilter: don't set F_IFACE on ipv6 fib lookups

2018-02-14 Thread Florian Westphal
"fib" starts to behave strangely when an ipv6 default route is
added - the FIB lookup returns a route using 'oif' in this case.

This behaviour was inherited from ip6tables rpfilter so change
this as well.

Bugzilla: https://bugzilla.netfilter.org/show_bug.cgi?id=1221
Signed-off-by: Florian Westphal 
---
 net/ipv6/netfilter/ip6t_rpfilter.c |  4 
 net/ipv6/netfilter/nft_fib_ipv6.c  | 12 ++--
 2 files changed, 2 insertions(+), 14 deletions(-)

diff --git a/net/ipv6/netfilter/ip6t_rpfilter.c 
b/net/ipv6/netfilter/ip6t_rpfilter.c
index 94deb69bbbda..91ed25a24b79 100644
--- a/net/ipv6/netfilter/ip6t_rpfilter.c
+++ b/net/ipv6/netfilter/ip6t_rpfilter.c
@@ -48,10 +48,6 @@ static bool rpfilter_lookup_reverse6(struct net *net, const 
struct sk_buff *skb,
}
 
fl6.flowi6_mark = flags & XT_RPFILTER_VALID_MARK ? skb->mark : 0;
-   if ((flags & XT_RPFILTER_LOOSE) == 0) {
-   fl6.flowi6_oif = dev->ifindex;
-   lookup_flags |= RT6_LOOKUP_F_IFACE;
-   }
 
rt = (void *) ip6_route_lookup(net, &fl6, lookup_flags);
if (rt->dst.error)
diff --git a/net/ipv6/netfilter/nft_fib_ipv6.c 
b/net/ipv6/netfilter/nft_fib_ipv6.c
index cc5174c7254c..62fc84d7bdff 100644
--- a/net/ipv6/netfilter/nft_fib_ipv6.c
+++ b/net/ipv6/netfilter/nft_fib_ipv6.c
@@ -180,7 +180,6 @@ void nft_fib6_eval(const struct nft_expr *expr, struct 
nft_regs *regs,
}
 
*dest = 0;
- again:
rt = (void *)ip6_route_lookup(nft_net(pkt), &fl6, lookup_flags);
if (rt->dst.error)
goto put_rt_err;
@@ -189,15 +188,8 @@ void nft_fib6_eval(const struct nft_expr *expr, struct 
nft_regs *regs,
if (rt->rt6i_flags & (RTF_REJECT | RTF_ANYCAST | RTF_LOCAL))
goto put_rt_err;
 
-   if (oif && oif != rt->rt6i_idev->dev) {
-   /* multipath route? Try again with F_IFACE */
-   if ((lookup_flags & RT6_LOOKUP_F_IFACE) == 0) {
-   lookup_flags |= RT6_LOOKUP_F_IFACE;
-   fl6.flowi6_oif = oif->ifindex;
-   ip6_rt_put(rt);
-   goto again;
-   }
-   }
+   if (oif && oif != rt->rt6i_idev->dev)
+   goto put_rt_err;
 
switch (priv->result) {
case NFT_FIB_RESULT_OIF:
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: shift by n bits while performing '--restore-mark'

2018-02-14 Thread Jack Ma
Hi Florian,

I attached two 'draft' patches in this email :)

Thanks,
JackFrom 6d811e63c9c777ed4287bc4547134c99e939b49d Mon Sep 17 00:00:00 2001
From: Jack Ma 
Date: Mon, 12 Feb 2018 13:41:29 +1300
Subject: [PATCH] libxt_CONNMARK: Support bit-shifting for --restore,set and
 save-mark

Added bit-shifting operations for --restore & set & save-mark.

Signed-off-by: Jack Ma 
Signed-off-by: Florian Westphal 
---
 extensions/libxt_CONNMARK.c   | 176 ++
 include/linux/netfilter/xt_connmark.h |   2 +-
 2 files changed, 139 insertions(+), 39 deletions(-)

diff --git a/extensions/libxt_CONNMARK.c b/extensions/libxt_CONNMARK.c
index f60be583..dbd02351 100644
--- a/extensions/libxt_CONNMARK.c
+++ b/extensions/libxt_CONNMARK.c
@@ -28,32 +28,43 @@
 struct xt_connmark_target_info {
 	unsigned long mark;
 	unsigned long mask;
+	uint8_t shift_dir;
+	uint8_t shift_bits;
 	uint8_t mode;
 };
 
 enum {
+	D_SHIFT_LEFT = 0,
+	D_SHIFT_RIGHT,
+};
+
+enum {
 	O_SET_MARK = 0,
 	O_SAVE_MARK,
 	O_RESTORE_MARK,
 	O_AND_MARK,
 	O_OR_MARK,
 	O_XOR_MARK,
+	O_LEFT_SHIFT_MARK,
+	O_RIGHT_SHIFT_MARK,
 	O_SET_XMARK,
 	O_CTMASK,
 	O_NFMASK,
 	O_MASK,
-	F_SET_MARK = 1 << O_SET_MARK,
-	F_SAVE_MARK= 1 << O_SAVE_MARK,
-	F_RESTORE_MARK = 1 << O_RESTORE_MARK,
-	F_AND_MARK = 1 << O_AND_MARK,
-	F_OR_MARK  = 1 << O_OR_MARK,
-	F_XOR_MARK = 1 << O_XOR_MARK,
-	F_SET_XMARK= 1 << O_SET_XMARK,
-	F_CTMASK   = 1 << O_CTMASK,
-	F_NFMASK   = 1 << O_NFMASK,
-	F_MASK = 1 << O_MASK,
-	F_OP_ANY   = F_SET_MARK | F_SAVE_MARK | F_RESTORE_MARK |
-	 F_AND_MARK | F_OR_MARK | F_XOR_MARK | F_SET_XMARK,
+	F_SET_MARK = 1 << O_SET_MARK,
+	F_SAVE_MARK= 1 << O_SAVE_MARK,
+	F_RESTORE_MARK = 1 << O_RESTORE_MARK,
+	F_AND_MARK = 1 << O_AND_MARK,
+	F_OR_MARK  = 1 << O_OR_MARK,
+	F_XOR_MARK = 1 << O_XOR_MARK,
+	F_LEFT_SHIFT_MARK  = 1 << O_LEFT_SHIFT_MARK,
+	F_RIGHT_SHIFT_MARK = 1 << O_RIGHT_SHIFT_MARK,
+	F_SET_XMARK= 1 << O_SET_XMARK,
+	F_CTMASK   = 1 << O_CTMASK,
+	F_NFMASK   = 1 << O_NFMASK,
+	F_MASK = 1 << O_MASK,
+	F_OP_ANY   = F_SET_MARK | F_SAVE_MARK | F_RESTORE_MARK |
+	 F_AND_MARK | F_OR_MARK | F_XOR_MARK | F_SET_XMARK,
 };
 
 static void CONNMARK_help(void)
@@ -74,6 +85,8 @@ static const struct xt_option_entry CONNMARK_opts[] = {
 	{.name = "restore-mark", .id = O_RESTORE_MARK, .type = XTTYPE_NONE,
 	 .excl = F_OP_ANY},
 	{.name = "mask", .id = O_MASK, .type = XTTYPE_UINT32},
+	{.name = "left-shift-mark", .id = O_LEFT_SHIFT_MARK, .type = XTTYPE_UINT8},
+	{.name = "right-shift-mark", .id = O_RIGHT_SHIFT_MARK, .type = XTTYPE_UINT8},
 	XTOPT_TABLEEND,
 };
 #undef s
@@ -94,6 +107,8 @@ static const struct xt_option_entry connmark_tg_opts[] = {
 	 .excl = F_OP_ANY},
 	{.name = "restore-mark", .id = O_RESTORE_MARK, .type = XTTYPE_NONE,
 	 .excl = F_OP_ANY},
+	{.name = "left-shift-mark", .id = O_LEFT_SHIFT_MARK, .type = XTTYPE_UINT8},
+	{.name = "right-shift-mark", .id = O_RIGHT_SHIFT_MARK, .type = XTTYPE_UINT8},
 	{.name = "ctmask", .id = O_CTMASK, .type = XTTYPE_UINT32,
 	 .excl = F_MASK, .flags = XTOPT_PUT, XTOPT_POINTER(s, ctmask)},
 	{.name = "nfmask", .id = O_NFMASK, .type = XTTYPE_UINT32,
@@ -119,6 +134,8 @@ static void connmark_tg_help(void)
 "  --and-mark value  Binary AND the ctmark with bits\n"
 "  --or-mark value   Binary OR  the ctmark with bits\n"
 "  --xor-mark value  Binary XOR the ctmark with bits\n"
+"  --left-shift-mark value   Left shift the ctmark with bits\n"
+"  --right-shift-mark value  Right shift the ctmark with bits\n"
 );
 }
 
@@ -154,6 +171,16 @@ static void CONNMARK_parse(struct xt_option_call *cb)
 	case O_MASK:
 		markinfo->mask = cb->val.u32;
 		break;
+	case O_LEFT_SHIFT_MARK:
+		markinfo->mode = XT_CONNMARK_RESTORE;
+		markinfo->shift_dir = D_SHIFT_LEFT;
+		markinfo->shift_bits = cb->val.u8;
+		break;
+	case O_RIGHT_SHIFT_MARK:
+		markinfo->mode = XT_CONNMARK_RESTORE;
+		markinfo->shift_dir = D_SHIFT_RIGHT;
+		markinfo->shift_bits = cb->val.u8;
+  break;
 	}
 }
 
@@ -197,6 +224,14 @@ static void connmark_tg_parse(struct xt_option_call *cb)
 	case O_MASK:
 		info->nfmask = info->ctmask = cb->val.u32;
 		break;
+	case O_LEFT_SHIFT_MARK:
+		info->shift_dir = D_SHIFT_LEFT;
+		info->shift_bits = cb->val.u8;
+		break;
+	case O_RIGHT_SHIFT_MARK:
+		info->shift_dir = D_SHIFT_RIGHT;
+		info->shift_bits = cb->val.u8;
+  break;
 	}
 }
 
@@ -253,36 +288,101 @@ connmark_tg_print(const void *ip, const struct xt_entry_target *target,
 
 	switch (info->mode) {
 	case XT_CONNMARK_SET:
-		if (info->ctmark == 0)
-			printf(" CONNMARK and 0x%x",
-			   (unsigned int)(uint32_t)~info->ctmask);
-		else if (info->ctmark == info->ctmask)
-			printf(" CONNMARK or 0x%x", info->ctmark);
-		else if (info->ctmask == 0)
-			printf(" CONNMARK xor 0x%x", info->ctmark);
-		else if (info->ctmask == 0xU)
-			printf(" CONNMARK