[PATCH net v2 2/2] ipv6: properly check return value in inet6_dump_all()

2018-11-02 Thread Alexey Kodanev
Make sure we call fib6_dump_end() if it happens that skb->len
is zero. rtnl_dump_all() can reset cb->args on the next loop
iteration there.

Fixes: 08e814c9e8eb ("net/ipv6: Bail early if user only wants cloned entries")
Fixes: ae677bbb4441 ("net: Don't return invalid table id error when dumping all 
families")
Signed-off-by: Alexey Kodanev 
---

v2: a new patch in v2

 net/ipv6/ip6_fib.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 1b8bc00..ae37861 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -591,7 +591,7 @@ static int inet6_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
 
/* fib entries are never clones */
if (arg.filter.flags & RTM_F_CLONED)
-   return skb->len;
+   goto out;
 
w = (void *)cb->args[2];
if (!w) {
@@ -621,7 +621,7 @@ static int inet6_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
tb = fib6_get_table(net, arg.filter.table_id);
if (!tb) {
if (arg.filter.dump_all_families)
-   return skb->len;
+   goto out;
 
NL_SET_ERR_MSG_MOD(cb->extack, "FIB table does not 
exist");
return -ENOENT;
-- 
1.8.3.1



[PATCH net v2 1/2] rtnetlink: restore handling of dumpit return value in rtnl_dump_all()

2018-11-02 Thread Alexey Kodanev
For non-zero return from dumpit() we should break the loop
in rtnl_dump_all() and return the result. Otherwise, e.g.,
we could get the memory leak in inet6_dump_fib() [1]. The
pointer to the allocated struct fib6_walker there (saved
in cb->args) can be lost, reset on the next iteration.

Fix it by partially restoring the previous behavior before
commit c63586dc9b3e ("net: rtnl_dump_all needs to propagate
error from dumpit function"). The returned error from
dumpit() is still passed further.

[1]:
unreferenced object 0x88001322a200 (size 96):
  comm "sshd", pid 1484, jiffies 4296032768 (age 1432.542s)
  hex dump (first 32 bytes):
00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de  
18 09 41 36 00 88 ff ff 18 09 41 36 00 88 ff ff  ..A6..A6
  backtrace:
[<95846b39>] kmem_cache_alloc_trace+0x151/0x220
[<7d12709f>] inet6_dump_fib+0x68d/0x940
[<2775a316>] rtnl_dump_all+0x1d9/0x2d0
[<d7cd302b>] netlink_dump+0x945/0x11a0
[<2f43485f>] __netlink_dump_start+0x55d/0x800
[<f76bbeec>] rtnetlink_rcv_msg+0x4fa/0xa00
[<9b5761f3>] netlink_rcv_skb+0x29c/0x420
[<87a1dae1>] rtnetlink_rcv+0x15/0x20
[<691b703b>] netlink_unicast+0x4e3/0x6c0
[<b5be0204>] netlink_sendmsg+0x7f2/0xba0
[<96d2aa60>] sock_sendmsg+0xba/0xf0
[<8c1b786f>] __sys_sendto+0x1e4/0x330
[<19587b3f>] __x64_sys_sendto+0xe1/0x1a0
[<071f4d56>] do_syscall_64+0x9f/0x300
[<2737577f>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<57587684>] 0xffff

Fixes: c63586dc9b3e ("net: rtnl_dump_all needs to propagate error from dumpit 
function")
Signed-off-by: Alexey Kodanev 
---

v2: * fix it by restoring the previous behavior as suggested by David

* adjust subject and commit description

 net/core/rtnetlink.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index e01274b..33d9227 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -3367,7 +3367,7 @@ static int rtnl_dump_all(struct sk_buff *skb, struct 
netlink_callback *cb)
cb->seq = 0;
}
ret = dumpit(skb, cb);
-   if (ret < 0)
+   if (ret)
break;
}
cb->family = idx;
-- 
1.8.3.1



Re: [PATCH net] rtnetlink: invoke 'cb->done' destructor before 'cb->args' reset

2018-11-01 Thread Alexey Kodanev
On 11/01/2018 04:11 PM, Alexey Kodanev wrote:
> On 10/31/2018 08:35 PM, David Ahern wrote:
>> On 10/31/18 10:55 AM, David Ahern wrote:
>>> I think the simplest fix for 4.20 is to break the loop if ret is non-0 -
>>> restore the previous behavior. 
>>
>> that is the only recourse. It has to bail if ret is non-0. Do you want
>> to send a patch with that fix?
>>
> 
> I see, and inet6_dump_fib() cleanups fib6_walker if ret is zero. Will send 
> the fix.

Can it happen that inet6_dump_fib() returns skb->len (0) in the below cases?

*   if (arg.filter.flags & RTM_F_CLONED)
return skb->len;

...

w = (void *)cb->args[2];
if (!w) {
...
w = kzalloc(...)
...

*   if (arg.filter.table_id) {
...
if (!tb) {
if (arg.filter.dump_all_families)
return skb->len;


Would it be safer to add "res = skb->len; goto out;" instead of "return 
skb->len;"
so that it can call fib6_dump_end() for "res <= 0"? Or use cb->data instead of
cb->args?


Re: [PATCH net] rtnetlink: invoke 'cb->done' destructor before 'cb->args' reset

2018-11-01 Thread Alexey Kodanev
On 10/31/2018 08:35 PM, David Ahern wrote:
> On 10/31/18 10:55 AM, David Ahern wrote:
>> I think the simplest fix for 4.20 is to break the loop if ret is non-0 -
>> restore the previous behavior. 
> 
> that is the only recourse. It has to bail if ret is non-0. Do you want
> to send a patch with that fix?
> 

I see, and inet6_dump_fib() cleanups fib6_walker if ret is zero. Will send the 
fix.


Re: [PATCH net] rtnetlink: invoke 'cb->done' destructor before 'cb->args' reset

2018-10-31 Thread Alexey Kodanev
On 31.10.2018 09:42, Alexey Kodanev wrote:
> cb->args[2] can store the pointer to the struct fib6_walker,
> allocated in inet6_dump_fib(). On the next loop iteration in
> rtnl_dump_all(), 'memset(, 0, sizeof(cb->args))' can reset
> that pointer, leaking the memory [1].
>

On the second thought we could as well save the state of fib6_walker
in inet6_dump_fib() with cb->data. That should fix the leak too.
Is it sounds reasonable?

Thanks,
Alexey

 
> Fix it by calling cb->done, if it is set, before filling 'cb->args'
> with zeros.
> 
> Looks like the recent changes in rtnl_dump_all() contributed to
> the appearance of this kmemleak [1], commit c63586dc9b3e ("net:
> rtnl_dump_all needs to propagate error from dumpit function")
> breaks the loop only on an error now.
> 
> [1]:
> unreferenced object 0x88001322a200 (size 96):
>   comm "sshd", pid 1484, jiffies 4296032768 (age 1432.542s)
>   hex dump (first 32 bytes):
> 00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de  
> 18 09 41 36 00 88 ff ff 18 09 41 36 00 88 ff ff  ..A6..A6
>   backtrace:
> [<95846b39>] kmem_cache_alloc_trace+0x151/0x220
> [<7d12709f>] inet6_dump_fib+0x68d/0x940
> [<2775a316>] rtnl_dump_all+0x1d9/0x2d0
> [<d7cd302b>] netlink_dump+0x945/0x11a0
> [<2f43485f>] __netlink_dump_start+0x55d/0x800
> [<f76bbeec>] rtnetlink_rcv_msg+0x4fa/0xa00
> [<9b5761f3>] netlink_rcv_skb+0x29c/0x420
> [<87a1dae1>] rtnetlink_rcv+0x15/0x20
> [<691b703b>] netlink_unicast+0x4e3/0x6c0
> [<b5be0204>] netlink_sendmsg+0x7f2/0xba0
> [<96d2aa60>] sock_sendmsg+0xba/0xf0
> [<8c1b786f>] __sys_sendto+0x1e4/0x330
> [<19587b3f>] __x64_sys_sendto+0xe1/0x1a0
> [<071f4d56>] do_syscall_64+0x9f/0x300
>     [<00002737577f>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [<57587684>] 0x
> 
> Fixes: 1b43af5480c3 ("[IPV6]: Increase number of possible routing tables to 
> 2^32")
> Signed-off-by: Alexey Kodanev 
> ---
>  net/core/rtnetlink.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
> index f679c7a..314c683 100644
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -3362,6 +3362,8 @@ static int rtnl_dump_all(struct sk_buff *skb, struct 
> netlink_callback *cb)
>   continue;
>  
>   if (idx > s_idx) {
> + if (cb->done)
> + cb->done(cb);
>   memset(>args[0], 0, sizeof(cb->args));
>   cb->prev_seq = 0;
>   cb->seq = 0;
> 



[PATCH net] rtnetlink: invoke 'cb->done' destructor before 'cb->args' reset

2018-10-31 Thread Alexey Kodanev
cb->args[2] can store the pointer to the struct fib6_walker,
allocated in inet6_dump_fib(). On the next loop iteration in
rtnl_dump_all(), 'memset(, 0, sizeof(cb->args))' can reset
that pointer, leaking the memory [1].

Fix it by calling cb->done, if it is set, before filling 'cb->args'
with zeros.

Looks like the recent changes in rtnl_dump_all() contributed to
the appearance of this kmemleak [1], commit c63586dc9b3e ("net:
rtnl_dump_all needs to propagate error from dumpit function")
breaks the loop only on an error now.

[1]:
unreferenced object 0x88001322a200 (size 96):
  comm "sshd", pid 1484, jiffies 4296032768 (age 1432.542s)
  hex dump (first 32 bytes):
00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de  
18 09 41 36 00 88 ff ff 18 09 41 36 00 88 ff ff  ..A6..A6
  backtrace:
[<95846b39>] kmem_cache_alloc_trace+0x151/0x220
[<7d12709f>] inet6_dump_fib+0x68d/0x940
[<2775a316>] rtnl_dump_all+0x1d9/0x2d0
[<d7cd302b>] netlink_dump+0x945/0x11a0
[<2f43485f>] __netlink_dump_start+0x55d/0x800
[<f76bbeec>] rtnetlink_rcv_msg+0x4fa/0xa00
[<9b5761f3>] netlink_rcv_skb+0x29c/0x420
[<87a1dae1>] rtnetlink_rcv+0x15/0x20
[<691b703b>] netlink_unicast+0x4e3/0x6c0
[<b5be0204>] netlink_sendmsg+0x7f2/0xba0
[<96d2aa60>] sock_sendmsg+0xba/0xf0
[<8c1b786f>] __sys_sendto+0x1e4/0x330
[<19587b3f>] __x64_sys_sendto+0xe1/0x1a0
[<071f4d56>] do_syscall_64+0x9f/0x300
[<2737577f>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<57587684>] 0xffff

Fixes: 1b43af5480c3 ("[IPV6]: Increase number of possible routing tables to 
2^32")
Signed-off-by: Alexey Kodanev 
---
 net/core/rtnetlink.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index f679c7a..314c683 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -3362,6 +3362,8 @@ static int rtnl_dump_all(struct sk_buff *skb, struct 
netlink_callback *cb)
continue;
 
if (idx > s_idx) {
+   if (cb->done)
+   cb->done(cb);
memset(>args[0], 0, sizeof(cb->args));
cb->prev_seq = 0;
cb->seq = 0;
-- 
1.8.3.1



Re: [PATCH net] ipv6: don't get lwtstate twice in ip6_rt_copy_init()

2018-08-31 Thread Alexey Kodanev
On 30.08.2018 19:10, David Ahern wrote:
> On 8/30/18 10:11 AM, Alexey Kodanev wrote:
...
>> unreferenced object 0x880b6aaa14e0 (size 64):
>>   comm "ip", pid 10577, jiffies 4295149341 (age 1273.903s)
>>   hex dump (first 32 bytes):
>> 01 00 04 00 04 00 00 00 10 00 00 00 00 00 00 00  
>> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>>   backtrace:
>> [<18664623>] lwtunnel_build_state+0x1bc/0x420
>> [<b73aa29a>] ip6_route_info_create+0x9f7/0x1fd0
>> [<ee2c5d1f>] ip6_route_add+0x14/0x70
>> [<8537b55c>] inet6_rtm_newroute+0xd9/0xe0
>> [<2acc50f5>] rtnetlink_rcv_msg+0x66f/0x8e0
>> [<8d9cd381>] netlink_rcv_skb+0x268/0x3b0
>> [<4c893c76>] netlink_unicast+0x417/0x5a0
>> [<f2ab1afb>] netlink_sendmsg+0x70b/0xc30
>> [<890ff0aa>] sock_sendmsg+0xb1/0xf0
>> [<a2e7b66f>] ___sys_sendmsg+0x659/0x950
>> [<1e7426c8>] __sys_sendmsg+0xde/0x170
>> [<fe411443>] do_syscall_64+0x9f/0x4a0
>> [<1be7b28b>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> [<6d21f353>] 0x
> 
> What test did you run to uncover this? Curious as to why my testing that
> found the need for 80f1a0f4e0cd did not hit this.

I was using IPv6 route with MPLS. Will submit MPLS tests to LTP soon,
they will include that test as well.

Meanwhile, these commands below are able to trigger it:

  ip route add $new_route encap mpls 50 via inet6 $ip_rhost
  ping6 $ip_new_route
  ip route del $new_route

Thanks,
Alexey


[PATCH net] ipv6: don't get lwtstate twice in ip6_rt_copy_init()

2018-08-30 Thread Alexey Kodanev
Commit 80f1a0f4e0cd ("net/ipv6: Put lwtstate when destroying fib6_info")
partially fixed the kmemleak [1], lwtstate can be copied from fib6_info,
with ip6_rt_copy_init(), and it should be done only once there.

rt->dst.lwtstate is set by ip6_rt_init_dst(), at the start of the function
ip6_rt_copy_init(), so there is no need to get it again at the end.

With this patch, lwtstate also isn't copied from RTF_REJECT routes.

[1]:
unreferenced object 0x880b6aaa14e0 (size 64):
  comm "ip", pid 10577, jiffies 4295149341 (age 1273.903s)
  hex dump (first 32 bytes):
01 00 04 00 04 00 00 00 10 00 00 00 00 00 00 00  
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
  backtrace:
[<18664623>] lwtunnel_build_state+0x1bc/0x420
[<b73aa29a>] ip6_route_info_create+0x9f7/0x1fd0
[<ee2c5d1f>] ip6_route_add+0x14/0x70
[<8537b55c>] inet6_rtm_newroute+0xd9/0xe0
[<2acc50f5>] rtnetlink_rcv_msg+0x66f/0x8e0
[<8d9cd381>] netlink_rcv_skb+0x268/0x3b0
[<4c893c76>] netlink_unicast+0x417/0x5a0
[<f2ab1afb>] netlink_sendmsg+0x70b/0xc30
[<890ff0aa>] sock_sendmsg+0xb1/0xf0
[<a2e7b66f>] ___sys_sendmsg+0x659/0x950
[<1e7426c8>] __sys_sendmsg+0xde/0x170
[<fe411443>] do_syscall_64+0x9f/0x4a0
[<1be7b28b>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[<6d21f353>] 0xffffffff

Fixes: 6edb3c96a5f0 ("net/ipv6: Defer initialization of dst to data path")
Signed-off-by: Alexey Kodanev 
---
 net/ipv6/route.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 8e08a91..9f27ada 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -996,7 +996,6 @@ static void ip6_rt_copy_init(struct rt6_info *rt, struct 
fib6_info *ort)
rt->rt6i_src = ort->fib6_src;
 #endif
rt->rt6i_prefsrc = ort->fib6_prefsrc;
-   rt->dst.lwtstate = lwtstate_get(ort->fib6_nh.nh_lwtstate);
 }
 
 static struct fib6_node* fib6_backtrack(struct fib6_node *fn,
-- 
1.8.3.1



[PATCH net] vti6: remove !skb->ignore_df check from vti6_xmit()

2018-08-23 Thread Alexey Kodanev
Before the commit d6990976af7c ("vti6: fix PMTU caching and reporting
on xmit") '!skb->ignore_df' check was always true because the function
skb_scrub_packet() was called before it, resetting ignore_df to zero.

In the commit, skb_scrub_packet() was moved below, and now this check
can be false for the packet, e.g. when sending it in the two fragments,
this prevents successful PMTU updates in such case. The next attempts
to send the packet lead to the same tx error. Moreover, vti6 initial
MTU value relies on PMTU adjustments.

This issue can be reproduced with the following LTP test script:
udp_ipsec_vti.sh -6 -p ah -m tunnel -s 2000

Fixes: ccd740cbc6e0 ("vti6: Add pmtu handling to vti6_xmit.")
Signed-off-by: Alexey Kodanev 
---
Not sure about xfrmi_xmit2(), it has a similar check for ignore_df...

 net/ipv6/ip6_vti.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
index 38dec9d..f48d196 100644
--- a/net/ipv6/ip6_vti.c
+++ b/net/ipv6/ip6_vti.c
@@ -481,7 +481,7 @@ static bool vti6_state_check(const struct xfrm_state *x,
}
 
mtu = dst_mtu(dst);
-   if (!skb->ignore_df && skb->len > mtu) {
+   if (skb->len > mtu) {
skb_dst_update_pmtu(skb, mtu);
 
if (skb->protocol == htons(ETH_P_IPV6)) {
-- 
1.8.3.1



[PATCH net v2] dccp: fix undefined behavior with 'cwnd' shift in ccid2_cwnd_restart()

2018-08-07 Thread Alexey Kodanev
The shift of 'cwnd' with '(now - hc->tx_lsndtime) / hc->tx_rto' value
can lead to undefined behavior [1].

In order to fix this use a gradual shift of the window with a 'while'
loop, similar to what tcp_cwnd_restart() is doing.

When comparing delta and RTO there is a minor difference between TCP
and DCCP, the last one also invokes dccp_cwnd_restart() and reduces
'cwnd' if delta equals RTO. That case is preserved in this change.

[1]:
[40850.963623] UBSAN: Undefined behaviour in net/dccp/ccids/ccid2.c:237:7
[40851.043858] shift exponent 67 is too large for 32-bit type 'unsigned int'
[40851.127163] CPU: 3 PID: 15940 Comm: netstress Tainted: GW   E 
4.18.0-rc7.x86_64 #1
...
[40851.377176] Call Trace:
[40851.408503]  dump_stack+0xf1/0x17b
[40851.451331]  ? show_regs_print_info+0x5/0x5
[40851.503555]  ubsan_epilogue+0x9/0x7c
[40851.548363]  __ubsan_handle_shift_out_of_bounds+0x25b/0x2b4
[40851.617109]  ? __ubsan_handle_load_invalid_value+0x18f/0x18f
[40851.686796]  ? xfrm4_output_finish+0x80/0x80
[40851.739827]  ? lock_downgrade+0x6d0/0x6d0
[40851.789744]  ? xfrm4_prepare_output+0x160/0x160
[40851.845912]  ? ip_queue_xmit+0x810/0x1db0
[40851.895845]  ? ccid2_hc_tx_packet_sent+0xd36/0x10a0 [dccp]
[40851.963530]  ccid2_hc_tx_packet_sent+0xd36/0x10a0 [dccp]
[40852.029063]  dccp_xmit_packet+0x1d3/0x720 [dccp]
[40852.086254]  dccp_write_xmit+0x116/0x1d0 [dccp]
[40852.142412]  dccp_sendmsg+0x428/0xb20 [dccp]
[40852.195454]  ? inet_dccp_listen+0x200/0x200 [dccp]
[40852.254833]  ? sched_clock+0x5/0x10
[40852.298508]  ? sched_clock+0x5/0x10
[40852.342194]  ? inet_create+0xdf0/0xdf0
[40852.388988]  sock_sendmsg+0xd9/0x160
...

Fixes: 113ced1f52e5 ("dccp ccid-2: Perform congestion-window validation")
Signed-off-by: Alexey Kodanev 
---

v2: instead of checking the value with min(), use gradual shifting,
similar to tcp_cwnd_restart().

 net/dccp/ccids/ccid2.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/dccp/ccids/ccid2.c b/net/dccp/ccids/ccid2.c
index 2b75df4..842a9c7 100644
--- a/net/dccp/ccids/ccid2.c
+++ b/net/dccp/ccids/ccid2.c
@@ -229,14 +229,16 @@ static void ccid2_cwnd_restart(struct sock *sk, const u32 
now)
struct ccid2_hc_tx_sock *hc = ccid2_hc_tx_sk(sk);
u32 cwnd = hc->tx_cwnd, restart_cwnd,
iwnd = rfc3390_bytes_to_packets(dccp_sk(sk)->dccps_mss_cache);
+   s32 delta = now - hc->tx_lsndtime;
 
hc->tx_ssthresh = max(hc->tx_ssthresh, (cwnd >> 1) + (cwnd >> 2));
 
/* don't reduce cwnd below the initial window (IW) */
restart_cwnd = min(cwnd, iwnd);
-   cwnd >>= (now - hc->tx_lsndtime) / hc->tx_rto;
-   hc->tx_cwnd = max(cwnd, restart_cwnd);
 
+   while ((delta -= hc->tx_rto) >= 0 && cwnd > restart_cwnd)
+   cwnd >>= 1;
+   hc->tx_cwnd = max(cwnd, restart_cwnd);
hc->tx_cwnd_stamp = now;
hc->tx_cwnd_used  = 0;
 
-- 
1.8.3.1



[PATCH net] dccp: fix undefined behavior with 'cwnd' shift in ccid2_cwnd_restart()

2018-08-02 Thread Alexey Kodanev
Make sure that the value of "(now - hc->tx_lsndtime) / hc->tx_rto" is
properly limited when shifting 'u32 cwnd' with it, otherwise we can get:

[40850.963623] UBSAN: Undefined behaviour in net/dccp/ccids/ccid2.c:237:7
[40851.043858] shift exponent 67 is too large for 32-bit type 'unsigned int'
[40851.127163] CPU: 3 PID: 15940 Comm: netstress Tainted: GW   E 
4.18.0-rc7.x86_64 #1
...
[40851.377176] Call Trace:
[40851.408503]  dump_stack+0xf1/0x17b
[40851.451331]  ? show_regs_print_info+0x5/0x5
[40851.503555]  ubsan_epilogue+0x9/0x7c
[40851.548363]  __ubsan_handle_shift_out_of_bounds+0x25b/0x2b4
[40851.617109]  ? __ubsan_handle_load_invalid_value+0x18f/0x18f
[40851.686796]  ? xfrm4_output_finish+0x80/0x80
[40851.739827]  ? lock_downgrade+0x6d0/0x6d0
[40851.789744]  ? xfrm4_prepare_output+0x160/0x160
[40851.845912]  ? ip_queue_xmit+0x810/0x1db0
[40851.895845]  ? ccid2_hc_tx_packet_sent+0xd36/0x10a0 [dccp]
[40851.963530]  ccid2_hc_tx_packet_sent+0xd36/0x10a0 [dccp]
[40852.029063]  dccp_xmit_packet+0x1d3/0x720 [dccp]
[40852.086254]  dccp_write_xmit+0x116/0x1d0 [dccp]
[40852.142412]  dccp_sendmsg+0x428/0xb20 [dccp]
[40852.195454]  ? inet_dccp_listen+0x200/0x200 [dccp]
[40852.254833]  ? sched_clock+0x5/0x10
[40852.298508]  ? sched_clock+0x5/0x10
[40852.342194]  ? inet_create+0xdf0/0xdf0
[40852.388988]  sock_sendmsg+0xd9/0x160
...

Fixes: 113ced1f52e5 ("dccp ccid-2: Perform congestion-window validation")
Signed-off-by: Alexey Kodanev 
---
 net/dccp/ccids/ccid2.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/dccp/ccids/ccid2.c b/net/dccp/ccids/ccid2.c
index 2b75df4..2821180 100644
--- a/net/dccp/ccids/ccid2.c
+++ b/net/dccp/ccids/ccid2.c
@@ -234,7 +234,7 @@ static void ccid2_cwnd_restart(struct sock *sk, const u32 
now)
 
/* don't reduce cwnd below the initial window (IW) */
restart_cwnd = min(cwnd, iwnd);
-   cwnd >>= (now - hc->tx_lsndtime) / hc->tx_rto;
+   cwnd >>= min((now - hc->tx_lsndtime) / hc->tx_rto, 31U);
hc->tx_cwnd = max(cwnd, restart_cwnd);
 
hc->tx_cwnd_stamp = now;
-- 
1.8.3.1



[PATCH net] dccp: don't free ccid2_hc_tx_sock struct in dccp_disconnect()

2018-05-21 Thread Alexey Kodanev
mapcount:0 mapping:8801bebb41c0
index:0x8801bebb5240 compound_mapcount: 0
flags: 0x2fffc008100(slab|head)
raw: 02fffc008100 8801bebb41c0 8801bebb5240 00010003
raw: 8801cdba3138 ea0007634120 8801cdbaab40 
page dumped because: kasan: bad access detected
...
==

Reported-by: syzbot+5d47e9ec91a6f15db...@syzkaller.appspotmail.com
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 net/dccp/proto.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index 84cd4e3..0d56e36 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -283,9 +283,7 @@ int dccp_disconnect(struct sock *sk, int flags)
 
dccp_clear_xmit_timers(sk);
ccid_hc_rx_delete(dp->dccps_hc_rx_ccid, sk);
-   ccid_hc_tx_delete(dp->dccps_hc_tx_ccid, sk);
dp->dccps_hc_rx_ccid = NULL;
-   dp->dccps_hc_tx_ccid = NULL;
 
__skb_queue_purge(>sk_receive_queue);
__skb_queue_purge(>sk_write_queue);
-- 
1.8.3.1



[PATCH v2 3/3] selinux: correctly handle sa_family cases in selinux_sctp_bind_connect()

2018-05-11 Thread Alexey Kodanev
Allow to pass the socket address structure with AF_UNSPEC family for
compatibility purposes. selinux_socket_bind() will further check it
for INADDR_ANY and selinux_socket_connect_helper() should return
EINVAL.

For a bad address family return EINVAL instead of AFNOSUPPORT error,
i.e. what is expected from SCTP protocol in such case.

Fixes: d452930fd3b9 ("selinux: Add SCTP support")
Suggested-by: Paul Moore <p...@paul-moore.com>
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---

v2: new patch in v2

 security/selinux/hooks.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index e7882e5a..be5817d 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -5277,6 +5277,7 @@ static int selinux_sctp_bind_connect(struct sock *sk, int 
optname,
while (walk_size < addrlen) {
addr = addr_buf;
switch (addr->sa_family) {
+   case AF_UNSPEC:
case AF_INET:
len = sizeof(struct sockaddr_in);
break;
@@ -5284,7 +5285,7 @@ static int selinux_sctp_bind_connect(struct sock *sk, int 
optname,
len = sizeof(struct sockaddr_in6);
break;
default:
-   return -EAFNOSUPPORT;
+   return -EINVAL;
}
 
err = -EINVAL;
-- 
1.8.3.1



[PATCH v2 1/3] selinux: add AF_UNSPEC and INADDR_ANY checks to selinux_socket_bind()

2018-05-11 Thread Alexey Kodanev
Commit d452930fd3b9 ("selinux: Add SCTP support") breaks compatibility
with the old programs that can pass sockaddr_in structure with AF_UNSPEC
and INADDR_ANY to bind(). As a result, bind() returns EAFNOSUPPORT error.
This was found with LTP/asapi_01 test.

Similar to commit 29c486df6a20 ("net: ipv4: relax AF_INET check in
bind()"), which relaxed AF_INET check for compatibility, add AF_UNSPEC
case to AF_INET and make sure that the address is INADDR_ANY.

Fixes: d452930fd3b9 ("selinux: Add SCTP support")
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---

v2: As suggested by Paul:
* return EINVAL for SCTP socket if sa_family is AF_UNSPEC and
  address is not INADDR_ANY
* add new 'sa_family' variable so that it equals either AF_INET
  or AF_INET6. Besides, it it will be used in the next patch that
  fixes audit record.

 security/selinux/hooks.c | 29 +++--
 1 file changed, 19 insertions(+), 10 deletions(-)

diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 4cafe6a..1ed7004 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -4576,6 +4576,7 @@ static int selinux_socket_post_create(struct socket 
*sock, int family,
 static int selinux_socket_bind(struct socket *sock, struct sockaddr *address, 
int addrlen)
 {
struct sock *sk = sock->sk;
+   struct sk_security_struct *sksec = sk->sk_security;
u16 family;
int err;
 
@@ -4587,11 +4588,11 @@ static int selinux_socket_bind(struct socket *sock, 
struct sockaddr *address, in
family = sk->sk_family;
if (family == PF_INET || family == PF_INET6) {
char *addrp;
-   struct sk_security_struct *sksec = sk->sk_security;
struct common_audit_data ad;
struct lsm_network_audit net = {0,};
struct sockaddr_in *addr4 = NULL;
struct sockaddr_in6 *addr6 = NULL;
+   u16 family_sa = address->sa_family;
unsigned short snum;
u32 sid, node_perm;
 
@@ -4601,11 +4602,20 @@ static int selinux_socket_bind(struct socket *sock, 
struct sockaddr *address, in
 * need to check address->sa_family as it is possible to have
 * sk->sk_family = PF_INET6 with addr->sa_family = AF_INET.
 */
-   switch (address->sa_family) {
+   switch (family_sa) {
+   case AF_UNSPEC:
case AF_INET:
if (addrlen < sizeof(struct sockaddr_in))
return -EINVAL;
addr4 = (struct sockaddr_in *)address;
+   if (family_sa == AF_UNSPEC) {
+   /* see __inet_bind(), we only want to allow
+* AF_UNSPEC if the address is INADDR_ANY
+*/
+   if (addr4->sin_addr.s_addr != htonl(INADDR_ANY))
+   goto err_af;
+   family_sa = AF_INET;
+   }
snum = ntohs(addr4->sin_port);
addrp = (char *)>sin_addr.s_addr;
break;
@@ -4617,13 +4627,7 @@ static int selinux_socket_bind(struct socket *sock, 
struct sockaddr *address, in
addrp = (char *)>sin6_addr.s6_addr;
break;
default:
-   /* Note that SCTP services expect -EINVAL, whereas
-* others expect -EAFNOSUPPORT.
-*/
-   if (sksec->sclass == SECCLASS_SCTP_SOCKET)
-   return -EINVAL;
-   else
-   return -EAFNOSUPPORT;
+   goto err_af;
}
 
if (snum) {
@@ -4681,7 +4685,7 @@ static int selinux_socket_bind(struct socket *sock, 
struct sockaddr *address, in
ad.u.net->sport = htons(snum);
ad.u.net->family = family;
 
-   if (address->sa_family == AF_INET)
+   if (family_sa == AF_INET)
ad.u.net->v4info.saddr = addr4->sin_addr.s_addr;
else
ad.u.net->v6info.saddr = addr6->sin6_addr;
@@ -4694,6 +4698,11 @@ static int selinux_socket_bind(struct socket *sock, 
struct sockaddr *address, in
}
 out:
return err;
+err_af:
+   /* Note that SCTP services expect -EINVAL, others -EAFNOSUPPORT. */
+   if (sksec->sclass == SECCLASS_SCTP_SOCKET)
+   return -EINVAL;
+   return -EAFNOSUPPORT;
 }
 
 /* This supports connect(2) and SCTP connect services such as sctp_connectx(3)
-- 
1.8.3.1



[PATCH v2 2/3] selinux: fix address family in bind() and connect() to match address/port

2018-05-11 Thread Alexey Kodanev
Since sctp_bindx() and sctp_connectx() can have multiple addresses,
sk_family can differ from sa_family. Therefore, selinux_socket_bind()
and selinux_socket_connect_helper(), which process sockaddr structure
(address and port), should use the address family from that structure
too, and not from the socket one.

The initialization of the data for the audit record is moved above,
in selinux_socket_bind(), so that there is no duplicate changes and
code.

Fixes: d452930fd3b9 ("selinux: Add SCTP support")
Suggested-by: Paul Moore <p...@paul-moore.com>
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---

v2: new patch in v2

 security/selinux/hooks.c | 18 +++---
 1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 1ed7004..e7882e5a 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -4630,6 +4630,11 @@ static int selinux_socket_bind(struct socket *sock, 
struct sockaddr *address, in
goto err_af;
}
 
+   ad.type = LSM_AUDIT_DATA_NET;
+   ad.u.net = 
+   ad.u.net->sport = htons(snum);
+   ad.u.net->family = family_sa;
+
if (snum) {
int low, high;
 
@@ -4641,10 +4646,6 @@ static int selinux_socket_bind(struct socket *sock, 
struct sockaddr *address, in
  snum, );
if (err)
goto out;
-   ad.type = LSM_AUDIT_DATA_NET;
-   ad.u.net = 
-   ad.u.net->sport = htons(snum);
-   ad.u.net->family = family;
err = avc_has_perm(_state,
   sksec->sid, sid,
   sksec->sclass,
@@ -4676,15 +4677,10 @@ static int selinux_socket_bind(struct socket *sock, 
struct sockaddr *address, in
break;
}
 
-   err = sel_netnode_sid(addrp, family, );
+   err = sel_netnode_sid(addrp, family_sa, );
if (err)
goto out;
 
-   ad.type = LSM_AUDIT_DATA_NET;
-   ad.u.net = 
-   ad.u.net->sport = htons(snum);
-   ad.u.net->family = family;
-
if (family_sa == AF_INET)
ad.u.net->v4info.saddr = addr4->sin_addr.s_addr;
else
@@ -4780,7 +4776,7 @@ static int selinux_socket_connect_helper(struct socket 
*sock,
ad.type = LSM_AUDIT_DATA_NET;
ad.u.net = 
ad.u.net->dport = htons(snum);
-   ad.u.net->family = sk->sk_family;
+   ad.u.net->family = address->sa_family;
err = avc_has_perm(_state,
   sksec->sid, sid, sksec->sclass, perm, );
if (err)
-- 
1.8.3.1



Re: [PATCH] selinux: add AF_UNSPEC and INADDR_ANY checks to selinux_socket_bind()

2018-05-10 Thread Alexey Kodanev
On 10.05.2018 01:02, Paul Moore wrote:
...
> I just had a better look at this and I believe that Alexey and Stephen
> are right: this is the best option.  My apologies for the noise
> earlier.  However, while looking at the code I think there are some
> additional necessary changes:
> 
> * In the case of an SCTP socket, we should return -EINVAL, just as we
> do with other address families.

Right.

> * While not strictly related to AF_UNSPEC, we really should be passing
> the address family of the sockaddr, and not the socket, to functions
> that need to interpret the bind address/port.

That looks like a correct solution. I guess we need the same fix for
sctp_connectx(), in selinux_socket_connect_helper().

> 
> I'm waiting for my kernel to compile so I haven't given this any
> sanity testing, but the patch below is what I think we need ...
> 
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index 4cafe6a19167..5f30045b2053 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -4576,6 +4576,7 @@ static int selinux_socket_post_create(struct socket 
> *sock,
> int family,
> static int selinux_socket_bind(struct socket *sock, struct sockaddr *address, 
> i
> nt addrlen)
> {
>struct sock *sk = sock->sk;
> +   struct sk_security_struct *sksec = sk->sk_security;
>u16 family;
>int err;
> 
> @@ -4587,13 +4588,13 @@ static int selinux_socket_bind(struct socket *sock, 
> stru
> ct sockaddr *address, in
>family = sk->sk_family;
>if (family == PF_INET || family == PF_INET6) {
>char *addrp;
> -   struct sk_security_struct *sksec = sk->sk_security;
>struct common_audit_data ad;
>struct lsm_network_audit net = {0,};
>struct sockaddr_in *addr4 = NULL;
>struct sockaddr_in6 *addr6 = NULL;
>unsigned short snum;
>u32 sid, node_perm;
> +   u16 family_sa = address->sa_family;
> 
>/*
> * sctp_bindx(3) calls via selinux_sctp_bind_connect()
> @@ -4601,11 +4602,19 @@ static int selinux_socket_bind(struct socket *sock, 
> stru
> ct sockaddr *address, in
> * need to check address->sa_family as it is possible to have
> * sk->sk_family = PF_INET6 with addr->sa_family = AF_INET.
> */
> -   switch (address->sa_family) {
> +   switch (family_sa) {
> +   case AF_UNSPEC:
>case AF_INET:
>if (addrlen < sizeof(struct sockaddr_in))
>return -EINVAL;
>addr4 = (struct sockaddr_in *)address;
> +   if (family_sa == AF_UNSPEC) {
> +   /* see "__inet_bind()", we only want to allow
> +* AF_UNSPEC if the address is INADDR_ANY */
> +   if (addr4->sin_addr.s_addr != 
> htonl(INADDR_ANY))
> +   goto err_af;
> +   family_sa = AF_INET;
> +   }
>snum = ntohs(addr4->sin_port);
>addrp = (char *)>sin_addr.s_addr;
>break;
> @@ -4617,15 +4626,14 @@ static int selinux_socket_bind(struct socket *sock, 
> stru
> ct sockaddr *address, in
>addrp = (char *)>sin6_addr.s6_addr;
>break;
>default:
> -   /* Note that SCTP services expect -EINVAL, whereas
> -* others expect -EAFNOSUPPORT.
> -*/
> -   if (sksec->sclass == SECCLASS_SCTP_SOCKET)
> -   return -EINVAL;
> -   else
> -   return -EAFNOSUPPORT;
> +   goto err_af;
>}
> 
> +   ad.type = LSM_AUDIT_DATA_NET;
> +   ad.u.net = 
> +   ad.u.net->sport = htons(snum);
> +   ad.u.net->family = family_sa;
> +

May be we could move setting ad.u.net->v{4|6}info.saddr here as well?


Will send a v2 of this patch so that SCTP socket returns EINVAL with
AF_UNSPEC. Should I prepare a patch with correcting 'ad.u.net->family'
and sel_netnode_sid()?  

Thanks,
Alexey

>if (snum) {
>int low, high;
> 
> @@ -4637,10 +4645,6 @@ static int selinux_socket_bind(struct socket *sock, 
> struc
> t sockaddr *address, in
>  snum, );
>if (err)
>goto out;
> -   ad.type = LSM_AUDIT_DATA_NET;
> -   ad.u.net = 
> -   ad.u.net->sport = htons(snum);
> -   ad.u.net->family = family;
>  

[PATCH] selinux: add AF_UNSPEC and INADDR_ANY checks to selinux_socket_bind()

2018-05-08 Thread Alexey Kodanev
Commit d452930fd3b9 ("selinux: Add SCTP support") breaks compatibility
with the old programs that can pass sockaddr_in with AF_UNSPEC and
INADDR_ANY to bind(). As a result, bind() returns EAFNOSUPPORT error.
It was found with LTP/asapi_01 test.

Similar to commit 29c486df6a20 ("net: ipv4: relax AF_INET check in
bind()"), which relaxed AF_INET check for compatibility, add AF_UNSPEC
case to AF_INET and make sure that the address is INADDR_ANY.

Also, in the end of selinux_socket_bind(), instead of adding AF_UNSPEC
to 'address->sa_family == AF_INET', verify AF_INET6 first.

Fixes: d452930fd3b9 ("selinux: Add SCTP support")
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 security/selinux/hooks.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 4cafe6a..649a3be 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -4602,10 +4602,16 @@ static int selinux_socket_bind(struct socket *sock, 
struct sockaddr *address, in
 * sk->sk_family = PF_INET6 with addr->sa_family = AF_INET.
 */
switch (address->sa_family) {
+   case AF_UNSPEC:
case AF_INET:
if (addrlen < sizeof(struct sockaddr_in))
return -EINVAL;
addr4 = (struct sockaddr_in *)address;
+
+   if (address->sa_family == AF_UNSPEC &&
+   addr4->sin_addr.s_addr != htonl(INADDR_ANY))
+   return -EAFNOSUPPORT;
+
snum = ntohs(addr4->sin_port);
addrp = (char *)>sin_addr.s_addr;
break;
@@ -4681,10 +4687,10 @@ static int selinux_socket_bind(struct socket *sock, 
struct sockaddr *address, in
ad.u.net->sport = htons(snum);
ad.u.net->family = family;
 
-   if (address->sa_family == AF_INET)
-   ad.u.net->v4info.saddr = addr4->sin_addr.s_addr;
-   else
+   if (address->sa_family == AF_INET6)
ad.u.net->v6info.saddr = addr6->sin6_addr;
+   else
+   ad.u.net->v4info.saddr = addr4->sin_addr.s_addr;
 
err = avc_has_perm(_state,
   sksec->sid, sid,
-- 
1.8.3.1



[PATCH net-next 0/4] geneve: verify user specified MTU or adjust with a lower device

2018-04-19 Thread Alexey Kodanev
The first two patches don't introduce any functional changes and
contain minor cleanups for code readability.

The last one adds a new function geneve_link_config() similar to the
other tunnels. The function will be used on a new link creation or
when 'remote' parameter is changed. It adjusts a user specified MTU
or, if it finds a lower device, tunes the tunnel MTU using it.

Alexey Kodanev (4):
  geneve: remove white-space before '#if IS_ENABLED(CONFIG_IPV6)'
  geneve: cleanup hard coded value for Ethernet header length
  geneve: check MTU for a minimum in geneve_change_mtu()
  geneve: configure MTU based on a lower device

 drivers/net/geneve.c | 72 
 1 file changed, 61 insertions(+), 11 deletions(-)

-- 
1.8.3.1



[PATCH net-next 1/4] geneve: remove white-space before '#if IS_ENABLED(CONFIG_IPV6)'

2018-04-19 Thread Alexey Kodanev
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 drivers/net/geneve.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index b919e89..45acdc9 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -1261,7 +1261,7 @@ static int geneve_nl2info(struct nlattr *tb[], struct 
nlattr *data[],
}
 
if (data[IFLA_GENEVE_REMOTE6]) {
- #if IS_ENABLED(CONFIG_IPV6)
+#if IS_ENABLED(CONFIG_IPV6)
if (changelink && (ip_tunnel_info_af(info) == AF_INET)) {
attrtype = IFLA_GENEVE_REMOTE6;
goto change_notsup;
-- 
1.8.3.1



[PATCH net-next 4/4] geneve: configure MTU based on a lower device

2018-04-19 Thread Alexey Kodanev
Currently, on a new link creation or when 'remote' address parameter
is updated, an MTU is not changed and always equals 1500. When a lower
device has a larger MTU, it might not be efficient, e.g. for UDP, and
requires the manual MTU adjustments to match the MTU of the lower
device.

This patch tries to automate this process, finds a lower device using
the 'remote' address parameter, then uses its MTU to tune GENEVE's MTU:
  * on a new link creation
  * when 'remote' parameter is changed

Also with this patch, the MTU from a user, on a new link creation, is
passed to geneve_change_mtu() where it is verified, and MTU adjustments
with a lower device is skipped in that case. Prior that change, it was
possible to set the invalid MTU values on a new link creation.

Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 drivers/net/geneve.c | 56 +---
 1 file changed, 53 insertions(+), 3 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index ae649f6..750eaa5 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -1387,6 +1387,48 @@ static int geneve_nl2info(struct nlattr *tb[], struct 
nlattr *data[],
return -EOPNOTSUPP;
 }
 
+static void geneve_link_config(struct net_device *dev,
+  struct ip_tunnel_info *info, struct nlattr *tb[])
+{
+   struct geneve_dev *geneve = netdev_priv(dev);
+   int ldev_mtu = 0;
+
+   if (tb[IFLA_MTU]) {
+   geneve_change_mtu(dev, nla_get_u32(tb[IFLA_MTU]));
+   return;
+   }
+
+   switch (ip_tunnel_info_af(info)) {
+   case AF_INET: {
+   struct flowi4 fl4 = { .daddr = info->key.u.ipv4.dst };
+   struct rtable *rt = ip_route_output_key(geneve->net, );
+
+   if (!IS_ERR(rt) && rt->dst.dev) {
+   ldev_mtu = rt->dst.dev->mtu - GENEVE_IPV4_HLEN;
+   ip_rt_put(rt);
+   }
+   break;
+   }
+#if IS_ENABLED(CONFIG_IPV6)
+   case AF_INET6: {
+   struct rt6_info *rt = rt6_lookup(geneve->net,
+>key.u.ipv6.dst, NULL, 0,
+NULL, 0);
+
+   if (rt && rt->dst.dev)
+   ldev_mtu = rt->dst.dev->mtu - GENEVE_IPV6_HLEN;
+   ip6_rt_put(rt);
+   break;
+   }
+#endif
+   }
+
+   if (ldev_mtu <= 0)
+   return;
+
+   geneve_change_mtu(dev, ldev_mtu - info->options_len);
+}
+
 static int geneve_newlink(struct net *net, struct net_device *dev,
  struct nlattr *tb[], struct nlattr *data[],
  struct netlink_ext_ack *extack)
@@ -1402,8 +1444,14 @@ static int geneve_newlink(struct net *net, struct 
net_device *dev,
if (err)
return err;
 
-   return geneve_configure(net, dev, extack, , metadata,
-   use_udp6_rx_checksums);
+   err = geneve_configure(net, dev, extack, , metadata,
+  use_udp6_rx_checksums);
+   if (err)
+   return err;
+
+   geneve_link_config(dev, , tb);
+
+   return 0;
 }
 
 /* Quiesces the geneve device data path for both TX and RX.
@@ -1477,8 +1525,10 @@ static int geneve_changelink(struct net_device *dev, 
struct nlattr *tb[],
if (err)
return err;
 
-   if (!geneve_dst_addr_equal(>info, ))
+   if (!geneve_dst_addr_equal(>info, )) {
dst_cache_reset(_cache);
+   geneve_link_config(dev, , tb);
+   }
 
geneve_quiesce(geneve, , );
geneve->info = info;
-- 
1.8.3.1



[PATCH net-next 3/4] geneve: check MTU for a minimum in geneve_change_mtu()

2018-04-19 Thread Alexey Kodanev
geneve_change_mtu() will be used not only as ndo_change_mtu() callback,
but also to verify a user specified MTU on a new link creation in the
next patch.

Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 drivers/net/geneve.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index b650f84..ae649f6 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -942,11 +942,10 @@ static netdev_tx_t geneve_xmit(struct sk_buff *skb, 
struct net_device *dev)
 
 static int geneve_change_mtu(struct net_device *dev, int new_mtu)
 {
-   /* Only possible if called internally, ndo_change_mtu path's new_mtu
-* is guaranteed to be between dev->min_mtu and dev->max_mtu.
-*/
if (new_mtu > dev->max_mtu)
new_mtu = dev->max_mtu;
+   else if (new_mtu < dev->min_mtu)
+   new_mtu = dev->min_mtu;
 
dev->mtu = new_mtu;
return 0;
-- 
1.8.3.1



[PATCH net-next 2/4] geneve: cleanup hard coded value for Ethernet header length

2018-04-19 Thread Alexey Kodanev
Use ETH_HLEN instead and introduce two new macros: GENEVE_IPV4_HLEN
and GENEVE_IPV6_HLEN that include Ethernet header length, corresponded
IP header length and GENEVE_BASE_HLEN.

Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 drivers/net/geneve.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 45acdc9..b650f84 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -36,6 +36,8 @@
 
 #define GENEVE_VER 0
 #define GENEVE_BASE_HLEN (sizeof(struct udphdr) + sizeof(struct genevehdr))
+#define GENEVE_IPV4_HLEN (ETH_HLEN + sizeof(struct iphdr) + GENEVE_BASE_HLEN)
+#define GENEVE_IPV6_HLEN (ETH_HLEN + sizeof(struct ipv6hdr) + GENEVE_BASE_HLEN)
 
 /* per-network namespace private data for this module */
 struct geneve_net {
@@ -826,8 +828,8 @@ static int geneve_xmit_skb(struct sk_buff *skb, struct 
net_device *dev,
return PTR_ERR(rt);
 
if (skb_dst(skb)) {
-   int mtu = dst_mtu(>dst) - sizeof(struct iphdr) -
- GENEVE_BASE_HLEN - info->options_len - 14;
+   int mtu = dst_mtu(>dst) - GENEVE_IPV4_HLEN -
+ info->options_len;
 
skb_dst_update_pmtu(skb, mtu);
}
@@ -872,8 +874,7 @@ static int geneve6_xmit_skb(struct sk_buff *skb, struct 
net_device *dev,
return PTR_ERR(dst);
 
if (skb_dst(skb)) {
-   int mtu = dst_mtu(dst) - sizeof(struct ipv6hdr) -
- GENEVE_BASE_HLEN - info->options_len - 14;
+   int mtu = dst_mtu(dst) - GENEVE_IPV6_HLEN - info->options_len;
 
skb_dst_update_pmtu(skb, mtu);
}
-- 
1.8.3.1



[RFC PATCH net] tcp: allow to use TCP Fastopen with MSG_ZEROCOPY

2018-04-03 Thread Alexey Kodanev
With TCP Fastopen we can have the following cases, which could also
use MSG_ZEROCOPY flag with send() and sendto():

* sendto() + MSG_FASTOPEN flag, sk state can be in TCP_CLOSE at
  the start of tcp_sendmsg()

* set socket option TCP_FASTOPEN_CONNECT, then connect()
  and send(), sk state in TCP_SYN_SENT

Currently, both cases with tcp_sendmsg() and MSG_ZEROCOPY flag results
to EINVAL error, because of the check for TCP_ESTABLISHED sk state in
the beginning of tcp_sendmsg().

Both conditions require two more checks there: !tp->fastopen_connect
and !(flags & MSG_FASTOPEN). It looks like we could remove the original
check altogether for this unlikely event instead. That way tcp_sendmsg()
without TFO should fail with EPIPE on sk_stream_wait_connect(), as
before the introduction of MSG_ZEROCOPY there. And work smoothly for
the TFO cases.

Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY")
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---

Is there something that I've overlooked and we can't use it here, and
we should handle this type of error, while using sendto() + TFO,
in userspace?

 net/ipv4/tcp.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 9225610..768f02c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1193,11 +1193,6 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr 
*msg, size_t size)
flags = msg->msg_flags;
 
if (flags & MSG_ZEROCOPY && size) {
-   if (sk->sk_state != TCP_ESTABLISHED) {
-   err = -EINVAL;
-   goto out_err;
-   }
-
skb = tcp_write_queue_tail(sk);
uarg = sock_zerocopy_realloc(sk, size, skb_zcopy(skb));
if (!uarg) {
-- 
1.8.3.1



[PATCH net v6 2/4] ipv6: allow to cache dst for a connected sk in ip6_sk_dst_lookup_flow()

2018-04-03 Thread Alexey Kodanev
Add 'connected' parameter to ip6_sk_dst_lookup_flow() and update
the cache only if ip6_sk_dst_check() returns NULL and a socket
is connected.

The function is used as before, the new behavior for UDP sockets
in udpv6_sendmsg() will be enabled in the next patch.

Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 include/net/ipv6.h|  3 ++-
 net/ipv6/ip6_output.c | 15 ---
 net/ipv6/ping.c   |  2 +-
 net/ipv6/udp.c|  2 +-
 4 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 8606c91..1d416f2 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -977,7 +977,8 @@ int ip6_dst_lookup(struct net *net, struct sock *sk, struct 
dst_entry **dst,
 struct dst_entry *ip6_dst_lookup_flow(const struct sock *sk, struct flowi6 
*fl6,
  const struct in6_addr *final_dst);
 struct dst_entry *ip6_sk_dst_lookup_flow(struct sock *sk, struct flowi6 *fl6,
-const struct in6_addr *final_dst);
+const struct in6_addr *final_dst,
+bool connected);
 struct dst_entry *ip6_blackhole_route(struct net *net,
  struct dst_entry *orig_dst);
 
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index a8a9195..46ea7b6 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1105,23 +1105,32 @@ struct dst_entry *ip6_dst_lookup_flow(const struct sock 
*sk, struct flowi6 *fl6,
  * @sk: socket which provides the dst cache and route info
  * @fl6: flow to lookup
  * @final_dst: final destination address for ipsec lookup
+ * @connected: whether @sk is connected or not
  *
  * This function performs a route lookup on the given flow with the
  * possibility of using the cached route in the socket if it is valid.
  * It will take the socket dst lock when operating on the dst cache.
  * As a result, this function can only be used in process context.
  *
+ * In addition, for a connected socket, cache the dst in the socket
+ * if the current cache is not valid.
+ *
  * It returns a valid dst pointer on success, or a pointer encoded
  * error code.
  */
 struct dst_entry *ip6_sk_dst_lookup_flow(struct sock *sk, struct flowi6 *fl6,
-const struct in6_addr *final_dst)
+const struct in6_addr *final_dst,
+bool connected)
 {
struct dst_entry *dst = sk_dst_check(sk, inet6_sk(sk)->dst_cookie);
 
dst = ip6_sk_dst_check(sk, dst, fl6);
-   if (!dst)
-   dst = ip6_dst_lookup_flow(sk, fl6, final_dst);
+   if (dst)
+   return dst;
+
+   dst = ip6_dst_lookup_flow(sk, fl6, final_dst);
+   if (connected && !IS_ERR(dst))
+   ip6_sk_dst_store_flow(sk, dst_clone(dst), fl6);
 
return dst;
 }
diff --git a/net/ipv6/ping.c b/net/ipv6/ping.c
index d12c55d..746eeae 100644
--- a/net/ipv6/ping.c
+++ b/net/ipv6/ping.c
@@ -121,7 +121,7 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t len)
ipc6.tclass = np->tclass;
fl6.flowlabel = ip6_make_flowinfo(ipc6.tclass, fl6.flowlabel);
 
-   dst = ip6_sk_dst_lookup_flow(sk, ,  daddr);
+   dst = ip6_sk_dst_lookup_flow(sk, , daddr, false);
if (IS_ERR(dst))
return PTR_ERR(dst);
rt = (struct rt6_info *) dst;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 52e3ea0..5bc102b2 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1289,7 +1289,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
 
fl6.flowlabel = ip6_make_flowinfo(ipc6.tclass, fl6.flowlabel);
 
-   dst = ip6_sk_dst_lookup_flow(sk, , final_p);
+   dst = ip6_sk_dst_lookup_flow(sk, , final_p, false);
if (IS_ERR(dst)) {
err = PTR_ERR(dst);
dst = NULL;
-- 
1.8.3.1



[PATCH net v6 3/4] ipv6: udp: convert 'connected' to bool type in udpv6_sendmsg()

2018-04-03 Thread Alexey Kodanev
This should make it consistent with ip6_sk_dst_lookup_flow()
that is accepting the new 'connected' parameter of type bool.

Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 net/ipv6/udp.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 5bc102b2..4aa50ea 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1097,10 +1097,10 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
struct dst_entry *dst;
struct ipcm6_cookie ipc6;
int addr_len = msg->msg_namelen;
+   bool connected = false;
int ulen = len;
int corkreq = up->corkflag || msg->msg_flags_MORE;
int err;
-   int connected = 0;
int is_udplite = IS_UDPLITE(sk);
int (*getfrag)(void *, char *, int, int, int, struct sk_buff *);
struct sockcm_cookie sockc;
@@ -1222,7 +1222,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
fl6.fl6_dport = inet->inet_dport;
daddr = >sk_v6_daddr;
fl6.flowlabel = np->flow_label;
-   connected = 1;
+   connected = true;
}
 
if (!fl6.flowi6_oif)
@@ -1252,7 +1252,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
}
if (!(opt->opt_nflen|opt->opt_flen))
opt = NULL;
-   connected = 0;
+   connected = false;
}
if (!opt) {
opt = txopt_get(np);
@@ -1274,11 +1274,11 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
 
final_p = fl6_update_dst(, opt, );
if (final_p)
-   connected = 0;
+   connected = false;
 
if (!fl6.flowi6_oif && ipv6_addr_is_multicast()) {
fl6.flowi6_oif = np->mcast_oif;
-   connected = 0;
+   connected = false;
} else if (!fl6.flowi6_oif)
fl6.flowi6_oif = np->ucast_oif;
 
-- 
1.8.3.1



[PATCH net v6 0/4] ipv6: udp: set dst cache for a connected sk if current not valid

2018-04-03 Thread Alexey Kodanev
A new RTF_CACHE route can be created with the socket's dst cache
update between the below calls in udpv6_sendmsg(), when datagram
sending results to ICMPV6_PKT_TOOBIG error:

   dst = ip6_sk_dst_lookup_flow(...)
   ...
release_dst:
if (dst) {
if (connected) {
ip6_dst_store(sk, dst)

Therefore, the new socket's dst cache reset to the old one on
"release_dst:".

The first three patches prepare the code to store dst cache
with ip6_sk_dst_lookup_flow():

  * the first patch adds ip6_sk_dst_store_flow() function with
commonly used source and destiantion addresses checks using
the flow information.

  * the second patch adds a new argument to ip6_sk_dst_lookup_flow()
and ability to store dst in the socket's cache. Also, the two
users of the function are updated without enabling the new
behavior: pingv6_sendmsg() and udpv6_sendmsg().

  * the third patch makes 'connected' variable in udpv6_sendmsg()
to be consistent with ip6_sk_dst_store_flow(), changes its type
from int to bool.

The last patch contains the actual fix that removes sk dst cache
update in the end of udpv6_sendmsg(), and allows to do it in
ip6_sk_dst_lookup_flow().

v6: * use bool type for a new parameter in ip_sk_dst_lookup_flow()
* add one more patch to convert 'connected' variable in
  udpv6_sendmsg() from int to bool type. If it shouldn't be
  here I will resend it when the net-next is opened.

v5: * relocate ip6_sk_dst_store_flow() to net/ipv6/route.c and
  rename ip6_dst_store_flow() to ip6_sk_dst_store_flow() as
  suggested by Martin

v4: * fix the error in the build of ip_dst_store_flow() reported by
  kbuild test robot due to missing checks for CONFIG_IPV6: add
  new function to ip6_output.c instead of ip6_route.h
* add 'const' to struct flowi6 in ip6_dst_store_flow()
* minor commit messages fixes

v3: * instead of moving ip6_dst_store() above udp_v6_send_skb(),
  update socket's dst cache inside ip6_sk_dst_lookup_flow()
  if the current one is invalid
* the issue not reproduced in 4.1, but starting from 4.2. Add
  one more 'Fixes:' commit that creates new RTF_CACHE route.
  Though, it is also mentioned in the first one

Alexey Kodanev (4):
  ipv6: add a wrapper for ip6_dst_store() with flowi6 checks
  ipv6: allow to cache dst for a connected sk in ip6_sk_dst_lookup_flow()
  ipv6: udp: convert 'connected' to bool type in udpv6_sendmsg()
  ipv6: udp: set dst cache for a connected sk if current not valid

 include/net/ip6_route.h |  3 +++
 include/net/ipv6.h  |  3 ++-
 net/ipv6/datagram.c |  9 +
 net/ipv6/ip6_output.c   | 15 ---
 net/ipv6/ping.c |  2 +-
 net/ipv6/route.c| 17 +
 net/ipv6/udp.c  | 31 +++
 7 files changed, 43 insertions(+), 37 deletions(-)

-- 
1.8.3.1



[PATCH net v6 4/4] ipv6: udp: set dst cache for a connected sk if current not valid

2018-04-03 Thread Alexey Kodanev
A new RTF_CACHE route can be created between ip6_sk_dst_lookup_flow()
and ip6_dst_store() calls in udpv6_sendmsg(), when datagram sending
results to ICMPV6_PKT_TOOBIG error:

udp_v6_send_skb(), for example with vti6 tunnel:
vti6_xmit(), get ICMPV6_PKT_TOOBIG error
skb_dst_update_pmtu(), can create a RTF_CACHE clone
icmpv6_send()
...
udpv6_err()
ip6_sk_update_pmtu()
   ip6_update_pmtu(), can create a RTF_CACHE clone
   ...
   ip6_datagram_dst_update()
ip6_dst_store()

And after commit 33c162a980fe ("ipv6: datagram: Update dst cache of
a connected datagram sk during pmtu update"), the UDPv6 error handler
can update socket's dst cache, but it can happen before the update in
the end of udpv6_sendmsg(), preventing getting the new dst cache on
the next udpv6_sendmsg() calls.

In order to fix it, save dst in a connected socket only if the current
socket's dst cache is invalid.

The previous patch prepared ip6_sk_dst_lookup_flow() to do that with
the new argument, and this patch enables it in udpv6_sendmsg().

Fixes: 33c162a980fe ("ipv6: datagram: Update dst cache of a connected datagram 
sk during pmtu update")
Fixes: 45e4fd26683c ("ipv6: Only create RTF_CACHE routes after encountering 
pmtu exception")
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 net/ipv6/udp.c | 21 ++---
 1 file changed, 2 insertions(+), 19 deletions(-)

diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 4aa50ea..9b74092 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1289,7 +1289,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
 
fl6.flowlabel = ip6_make_flowinfo(ipc6.tclass, fl6.flowlabel);
 
-   dst = ip6_sk_dst_lookup_flow(sk, , final_p, false);
+   dst = ip6_sk_dst_lookup_flow(sk, , final_p, connected);
if (IS_ERR(dst)) {
err = PTR_ERR(dst);
dst = NULL;
@@ -1314,7 +1314,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
err = PTR_ERR(skb);
if (!IS_ERR_OR_NULL(skb))
err = udp_v6_send_skb(skb, );
-   goto release_dst;
+   goto out;
}
 
lock_sock(sk);
@@ -1348,23 +1348,6 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
err = np->recverr ? net_xmit_errno(err) : 0;
release_sock(sk);
 
-release_dst:
-   if (dst) {
-   if (connected) {
-   ip6_dst_store(sk, dst,
- ipv6_addr_equal(, 
>sk_v6_daddr) ?
- >sk_v6_daddr : NULL,
-#ifdef CONFIG_IPV6_SUBTREES
- ipv6_addr_equal(, >saddr) ?
- >saddr :
-#endif
- NULL);
-   } else {
-   dst_release(dst);
-   }
-   dst = NULL;
-   }
-
 out:
dst_release(dst);
fl6_sock_release(flowlabel);
-- 
1.8.3.1



[PATCH net v6 1/4] ipv6: add a wrapper for ip6_dst_store() with flowi6 checks

2018-04-03 Thread Alexey Kodanev
Move commonly used pattern of ip6_dst_store() usage to a separate
function - ip6_sk_dst_store_flow(), which will check the addresses
for equality using the flow information, before saving them.

There is no functional changes in this patch. In addition, it will
be used in the next patch, in ip6_sk_dst_lookup_flow().

Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 include/net/ip6_route.h |  3 +++
 net/ipv6/datagram.c |  9 +
 net/ipv6/route.c| 17 +
 3 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index ac0866b..abec280 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -210,6 +210,9 @@ static inline void ip6_dst_store(struct sock *sk, struct 
dst_entry *dst,
 #endif
 }
 
+void ip6_sk_dst_store_flow(struct sock *sk, struct dst_entry *dst,
+  const struct flowi6 *fl6);
+
 static inline bool ipv6_unicast_destination(const struct sk_buff *skb)
 {
struct rt6_info *rt = (struct rt6_info *) skb_dst(skb);
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index a9f7eca..8f6a391 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -106,14 +106,7 @@ int ip6_datagram_dst_update(struct sock *sk, bool 
fix_sk_saddr)
}
}
 
-   ip6_dst_store(sk, dst,
- ipv6_addr_equal(, >sk_v6_daddr) ?
- >sk_v6_daddr : NULL,
-#ifdef CONFIG_IPV6_SUBTREES
- ipv6_addr_equal(, >saddr) ?
- >saddr :
-#endif
- NULL);
+   ip6_sk_dst_store_flow(sk, dst, );
 
 out:
fl6_sock_release(flowlabel);
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index b0d5c64..b14008e 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2153,6 +2153,23 @@ void ip6_sk_update_pmtu(struct sk_buff *skb, struct sock 
*sk, __be32 mtu)
 }
 EXPORT_SYMBOL_GPL(ip6_sk_update_pmtu);
 
+void ip6_sk_dst_store_flow(struct sock *sk, struct dst_entry *dst,
+  const struct flowi6 *fl6)
+{
+#ifdef CONFIG_IPV6_SUBTREES
+   struct ipv6_pinfo *np = inet6_sk(sk);
+#endif
+
+   ip6_dst_store(sk, dst,
+ ipv6_addr_equal(>daddr, >sk_v6_daddr) ?
+ >sk_v6_daddr : NULL,
+#ifdef CONFIG_IPV6_SUBTREES
+ ipv6_addr_equal(>saddr, >saddr) ?
+ >saddr :
+#endif
+ NULL);
+}
+
 /* Handle redirects */
 struct ip6rd_flowi {
struct flowi6 fl6;
-- 
1.8.3.1



[PATCH net v5 2/3] ipv6: allow to cache dst for a connected sk in ip6_sk_dst_lookup_flow()

2018-04-02 Thread Alexey Kodanev
Add 'connected' argument to ip6_sk_dst_lookup_flow() and update
the cache only if ip6_sk_dst_check() returns NULL and a socket
is connected.

The function is used as before, the new behavior for UDP sockets
in udpv6_sendmsg() will be enabled in the next patch.

Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 include/net/ipv6.h|  3 ++-
 net/ipv6/ip6_output.c | 15 ---
 net/ipv6/ping.c   |  2 +-
 net/ipv6/udp.c|  2 +-
 4 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 8606c91..07e94dc 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -977,7 +977,8 @@ int ip6_dst_lookup(struct net *net, struct sock *sk, struct 
dst_entry **dst,
 struct dst_entry *ip6_dst_lookup_flow(const struct sock *sk, struct flowi6 
*fl6,
  const struct in6_addr *final_dst);
 struct dst_entry *ip6_sk_dst_lookup_flow(struct sock *sk, struct flowi6 *fl6,
-const struct in6_addr *final_dst);
+const struct in6_addr *final_dst,
+int connected);
 struct dst_entry *ip6_blackhole_route(struct net *net,
  struct dst_entry *orig_dst);
 
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index a8a9195..15724e0 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1105,23 +1105,32 @@ struct dst_entry *ip6_dst_lookup_flow(const struct sock 
*sk, struct flowi6 *fl6,
  * @sk: socket which provides the dst cache and route info
  * @fl6: flow to lookup
  * @final_dst: final destination address for ipsec lookup
+ * @connected: whether @sk is connected or not
  *
  * This function performs a route lookup on the given flow with the
  * possibility of using the cached route in the socket if it is valid.
  * It will take the socket dst lock when operating on the dst cache.
  * As a result, this function can only be used in process context.
  *
+ * In addition, for a connected socket, cache the dst in the socket
+ * if the current cache is not valid.
+ *
  * It returns a valid dst pointer on success, or a pointer encoded
  * error code.
  */
 struct dst_entry *ip6_sk_dst_lookup_flow(struct sock *sk, struct flowi6 *fl6,
-const struct in6_addr *final_dst)
+const struct in6_addr *final_dst,
+int connected)
 {
struct dst_entry *dst = sk_dst_check(sk, inet6_sk(sk)->dst_cookie);
 
dst = ip6_sk_dst_check(sk, dst, fl6);
-   if (!dst)
-   dst = ip6_dst_lookup_flow(sk, fl6, final_dst);
+   if (dst)
+   return dst;
+
+   dst = ip6_dst_lookup_flow(sk, fl6, final_dst);
+   if (connected && !IS_ERR(dst))
+   ip6_sk_dst_store_flow(sk, dst_clone(dst), fl6);
 
return dst;
 }
diff --git a/net/ipv6/ping.c b/net/ipv6/ping.c
index d12c55d..546f4a6 100644
--- a/net/ipv6/ping.c
+++ b/net/ipv6/ping.c
@@ -121,7 +121,7 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t len)
ipc6.tclass = np->tclass;
fl6.flowlabel = ip6_make_flowinfo(ipc6.tclass, fl6.flowlabel);
 
-   dst = ip6_sk_dst_lookup_flow(sk, ,  daddr);
+   dst = ip6_sk_dst_lookup_flow(sk, , daddr, 0);
if (IS_ERR(dst))
return PTR_ERR(dst);
rt = (struct rt6_info *) dst;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 52e3ea0..e49dac4 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1289,7 +1289,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
 
fl6.flowlabel = ip6_make_flowinfo(ipc6.tclass, fl6.flowlabel);
 
-   dst = ip6_sk_dst_lookup_flow(sk, , final_p);
+   dst = ip6_sk_dst_lookup_flow(sk, , final_p, 0);
if (IS_ERR(dst)) {
err = PTR_ERR(dst);
dst = NULL;
-- 
1.8.3.1



[PATCH net v5 3/3] ipv6: udp6: set dst cache for a connected sk if current not valid

2018-04-02 Thread Alexey Kodanev
A new RTF_CACHE route can be created between ip6_sk_dst_lookup_flow()
and ip6_dst_store() calls in udpv6_sendmsg(), when datagram sending
results to ICMPV6_PKT_TOOBIG error:

udp_v6_send_skb(), for example with vti6 tunnel:
vti6_xmit(), get ICMPV6_PKT_TOOBIG error
skb_dst_update_pmtu(), can create a RTF_CACHE clone
icmpv6_send()
...
udpv6_err()
ip6_sk_update_pmtu()
   ip6_update_pmtu(), can create a RTF_CACHE clone
   ...
   ip6_datagram_dst_update()
ip6_dst_store()

And after commit 33c162a980fe ("ipv6: datagram: Update dst cache of
a connected datagram sk during pmtu update"), the UDPv6 error handler
can update socket's dst cache, but it can happen before the update in
the end of udpv6_sendmsg(), preventing getting the new dst cache on
the next udpv6_sendmsg() calls.

In order to fix it, save dst in a connected socket only if the current
socket's dst cache is invalid.

The previous patch prepared ip6_sk_dst_lookup_flow() to do that with
the new argument, and this patch enables it in udpv6_sendmsg().

Fixes: 33c162a980fe ("ipv6: datagram: Update dst cache of a connected datagram 
sk during pmtu update")
Fixes: 45e4fd26683c ("ipv6: Only create RTF_CACHE routes after encountering 
pmtu exception")
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 net/ipv6/udp.c | 21 ++---
 1 file changed, 2 insertions(+), 19 deletions(-)

diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index e49dac4..da13c90 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1289,7 +1289,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
 
fl6.flowlabel = ip6_make_flowinfo(ipc6.tclass, fl6.flowlabel);
 
-   dst = ip6_sk_dst_lookup_flow(sk, , final_p, 0);
+   dst = ip6_sk_dst_lookup_flow(sk, , final_p, connected);
if (IS_ERR(dst)) {
err = PTR_ERR(dst);
dst = NULL;
@@ -1314,7 +1314,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
err = PTR_ERR(skb);
if (!IS_ERR_OR_NULL(skb))
err = udp_v6_send_skb(skb, );
-   goto release_dst;
+   goto out;
}
 
lock_sock(sk);
@@ -1348,23 +1348,6 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
err = np->recverr ? net_xmit_errno(err) : 0;
release_sock(sk);
 
-release_dst:
-   if (dst) {
-   if (connected) {
-   ip6_dst_store(sk, dst,
- ipv6_addr_equal(, 
>sk_v6_daddr) ?
- >sk_v6_daddr : NULL,
-#ifdef CONFIG_IPV6_SUBTREES
- ipv6_addr_equal(, >saddr) ?
- >saddr :
-#endif
- NULL);
-   } else {
-   dst_release(dst);
-   }
-   dst = NULL;
-   }
-
 out:
dst_release(dst);
fl6_sock_release(flowlabel);
-- 
1.8.3.1



[PATCH net v5 1/3] ipv6: add a wrapper for ip6_dst_store() with flowi6 checks

2018-04-02 Thread Alexey Kodanev
Move commonly used pattern of ip6_dst_store() usage to a separate
function - ip6_sk_dst_store_flow(), which will check the addresses
for equality using the flow information, before saving them.

There is no functional changes in this patch. In addition, it will
be used in the next patch, in ip6_sk_dst_lookup_flow().

Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 include/net/ip6_route.h |  3 +++
 net/ipv6/datagram.c |  9 +
 net/ipv6/route.c| 17 +
 3 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index ac0866b..abec280 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -210,6 +210,9 @@ static inline void ip6_dst_store(struct sock *sk, struct 
dst_entry *dst,
 #endif
 }
 
+void ip6_sk_dst_store_flow(struct sock *sk, struct dst_entry *dst,
+  const struct flowi6 *fl6);
+
 static inline bool ipv6_unicast_destination(const struct sk_buff *skb)
 {
struct rt6_info *rt = (struct rt6_info *) skb_dst(skb);
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index a9f7eca..8f6a391 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -106,14 +106,7 @@ int ip6_datagram_dst_update(struct sock *sk, bool 
fix_sk_saddr)
}
}
 
-   ip6_dst_store(sk, dst,
- ipv6_addr_equal(, >sk_v6_daddr) ?
- >sk_v6_daddr : NULL,
-#ifdef CONFIG_IPV6_SUBTREES
- ipv6_addr_equal(, >saddr) ?
- >saddr :
-#endif
- NULL);
+   ip6_sk_dst_store_flow(sk, dst, );
 
 out:
fl6_sock_release(flowlabel);
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index b0d5c64..b14008e 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2153,6 +2153,23 @@ void ip6_sk_update_pmtu(struct sk_buff *skb, struct sock 
*sk, __be32 mtu)
 }
 EXPORT_SYMBOL_GPL(ip6_sk_update_pmtu);
 
+void ip6_sk_dst_store_flow(struct sock *sk, struct dst_entry *dst,
+  const struct flowi6 *fl6)
+{
+#ifdef CONFIG_IPV6_SUBTREES
+   struct ipv6_pinfo *np = inet6_sk(sk);
+#endif
+
+   ip6_dst_store(sk, dst,
+ ipv6_addr_equal(>daddr, >sk_v6_daddr) ?
+ >sk_v6_daddr : NULL,
+#ifdef CONFIG_IPV6_SUBTREES
+ ipv6_addr_equal(>saddr, >saddr) ?
+ >saddr :
+#endif
+ NULL);
+}
+
 /* Handle redirects */
 struct ip6rd_flowi {
struct flowi6 fl6;
-- 
1.8.3.1



[PATCH net v5 0/3] ipv6: udp6: set dst cache for a connected sk if current not valid

2018-04-02 Thread Alexey Kodanev
A new RTF_CACHE route can be created with the socket's dst cache
update between the below calls in udpv6_sendmsg(), when datagram
sending results to ICMPV6_PKT_TOOBIG error:

   dst = ip6_sk_dst_lookup_flow(...)
   ...
release_dst:
if (dst) {
if (connected) {
ip6_dst_store(sk, dst)

Therefore, the new socket's dst cache reset to the old one on
"release_dst:".

The first two patches prepare the code to store dst cache
with ip6_sk_dst_lookup_flow():

  * the first patch adds ip6_sk_dst_store_flow() function with
commonly used source and destiantion addresses checks using
the flow information.

  * the second patch adds new argument to ip6_sk_dst_lookup_flow()
and ability to store dst in the socket's cache. Also, the two
users of the function are updated without enabling the new
behavior: pingv6_sendmsg() and udpv6_sendmsg().

The last patch contains the actual fix that removes sk dst cache
update in the end of udpv6_sendmsg(), and allows to do it in
ip6_sk_dst_lookup_flow().

v5: * relocate ip6_sk_dst_store_flow() to net/ipv6/route.c and
  rename ip6_dst_store_flow() to ip6_sk_dst_store_flow() as
  suggested by Martin

v4: * fix the error in the build of ip_dst_store_flow() reported by
  kbuild test robot due to missing checks for CONFIG_IPV6: add
  new function to ip6_output.c instead of ip6_route.h
* add 'const' to struct flowi6 in ip6_dst_store_flow()
* minor commit messages fixes

v3: * instead of moving ip6_dst_store() above udp_v6_send_skb(),
  update socket's dst cache inside ip6_sk_dst_lookup_flow()
  if the current one is invalid
* the issue not reproduced in 4.1, but starting from 4.2. Add
  one more 'Fixes:' commit that creates new RTF_CACHE route.
  Though, it is also mentioned in the first one

Alexey Kodanev (3):
  ipv6: add a wrapper for ip6_dst_store() with flowi6 checks
  ipv6: allow to cache dst for a connected sk in ip6_sk_dst_lookup_flow()
  ipv6: udp6: set dst cache for a connected sk if current not valid

 include/net/ip6_route.h |  3 +++
 include/net/ipv6.h  |  3 ++-
 net/ipv6/datagram.c |  9 +
 net/ipv6/ip6_output.c   | 15 ---
 net/ipv6/ping.c |  2 +-
 net/ipv6/route.c| 17 +
 net/ipv6/udp.c  | 21 ++---
 7 files changed, 38 insertions(+), 32 deletions(-)

-- 
1.8.3.1



[PATCH net v4 2/3] ipv6: allow to cache dst for a connected sk in ip6_sk_dst_lookup_flow()

2018-03-30 Thread Alexey Kodanev
Add 'connected' argument to ip6_sk_dst_lookup_flow() and update
the cache only if ip6_sk_dst_check() returns NULL and a socket
is connected.

The function is used as before, the new behavior for UDP sockets
in udpv6_sendmsg() will be enabled in the next patch.

Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 include/net/ipv6.h|  3 ++-
 net/ipv6/ip6_output.c | 15 ---
 net/ipv6/ping.c   |  2 +-
 net/ipv6/udp.c|  2 +-
 4 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 1dd31ca..769550a 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -978,7 +978,8 @@ int ip6_dst_lookup(struct net *net, struct sock *sk, struct 
dst_entry **dst,
 struct dst_entry *ip6_dst_lookup_flow(const struct sock *sk, struct flowi6 
*fl6,
  const struct in6_addr *final_dst);
 struct dst_entry *ip6_sk_dst_lookup_flow(struct sock *sk, struct flowi6 *fl6,
-const struct in6_addr *final_dst);
+const struct in6_addr *final_dst,
+int connected);
 struct dst_entry *ip6_blackhole_route(struct net *net,
  struct dst_entry *orig_dst);
 
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index ed87ce5..cdcdf37 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1122,23 +1122,32 @@ struct dst_entry *ip6_dst_lookup_flow(const struct sock 
*sk, struct flowi6 *fl6,
  * @sk: socket which provides the dst cache and route info
  * @fl6: flow to lookup
  * @final_dst: final destination address for ipsec lookup
+ * @connected: whether @sk is connected or not
  *
  * This function performs a route lookup on the given flow with the
  * possibility of using the cached route in the socket if it is valid.
  * It will take the socket dst lock when operating on the dst cache.
  * As a result, this function can only be used in process context.
  *
+ * In addition, for a connected socket, cache the dst in the socket
+ * if the current cache is not valid.
+ *
  * It returns a valid dst pointer on success, or a pointer encoded
  * error code.
  */
 struct dst_entry *ip6_sk_dst_lookup_flow(struct sock *sk, struct flowi6 *fl6,
-const struct in6_addr *final_dst)
+const struct in6_addr *final_dst,
+int connected)
 {
struct dst_entry *dst = sk_dst_check(sk, inet6_sk(sk)->dst_cookie);
 
dst = ip6_sk_dst_check(sk, dst, fl6);
-   if (!dst)
-   dst = ip6_dst_lookup_flow(sk, fl6, final_dst);
+   if (dst)
+   return dst;
+
+   dst = ip6_dst_lookup_flow(sk, fl6, final_dst);
+   if (connected && !IS_ERR(dst))
+   ip6_dst_store_flow(sk, dst_clone(dst), fl6);
 
return dst;
 }
diff --git a/net/ipv6/ping.c b/net/ipv6/ping.c
index d12c55d..546f4a6 100644
--- a/net/ipv6/ping.c
+++ b/net/ipv6/ping.c
@@ -121,7 +121,7 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t len)
ipc6.tclass = np->tclass;
fl6.flowlabel = ip6_make_flowinfo(ipc6.tclass, fl6.flowlabel);
 
-   dst = ip6_sk_dst_lookup_flow(sk, ,  daddr);
+   dst = ip6_sk_dst_lookup_flow(sk, , daddr, 0);
if (IS_ERR(dst))
return PTR_ERR(dst);
rt = (struct rt6_info *) dst;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 52e3ea0..e49dac4 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1289,7 +1289,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
 
fl6.flowlabel = ip6_make_flowinfo(ipc6.tclass, fl6.flowlabel);
 
-   dst = ip6_sk_dst_lookup_flow(sk, , final_p);
+   dst = ip6_sk_dst_lookup_flow(sk, , final_p, 0);
if (IS_ERR(dst)) {
err = PTR_ERR(dst);
dst = NULL;
-- 
1.8.3.1



[PATCH net v4 0/3] ipv6: udp6: set dst cache for a connected sk if current not valid

2018-03-30 Thread Alexey Kodanev
A new RTF_CACHE route can be created with the socket's dst cache
update between the below calls in udpv6_sendmsg(), when datagram
sending results to ICMPV6_PKT_TOOBIG error:

   dst = ip6_sk_dst_lookup_flow(...)
   ...
release_dst:
if (dst) {
if (connected) {
ip6_dst_store(sk, dst)

Therefore, the new socket's dst cache reset to the old one on
"release_dst:".

The first two patches prepare the code to store dst cache
with ip6_sk_dst_lookup_flow():

  * the first patch adds ip6_dst_store_flow() function with
commonly used source and destiantion addresses checks using
the flow information.

  * the second patch adds new argument to ip6_sk_dst_lookup_flow()
and ability to store dst in the socket's cache. Also, the two
users of the function are updated without enabling the new
behavior: pingv6_sendmsg() and udpv6_sendmsg().

The last patch contains the actual fix that removes sk dst cache
update in the end of udpv6_sendmsg(), and allows to do it in
ip6_sk_dst_lookup_flow().

v4: * fix the error in the build of ip_dst_store_flow() reported by
  kbuild test robot due to missing checks for CONFIG_IPV6: add
  new function to ip6_output.c instead of ip6_route.h
* add 'const' to struct flowi6 in ip6_dst_store_flow()
* minor commit messages fixes

v3: * instead of moving ip6_dst_store() above udp_v6_send_skb(),
  update socket's dst cache inside ip6_sk_dst_lookup_flow()
  if the current one is invalid
* the issue not reproduced in 4.1, but starting from 4.2. Add
  one more 'Fixes:' commit that creates new RTF_CACHE route.
  Though, it is also mentioned in the first one


Alexey Kodanev (3):
  ipv6: add a wrapper for ip6_dst_store() with flowi6 checks
  ipv6: allow to cache dst for a connected sk in ip6_sk_dst_lookup_flow()
  ipv6: udp6: set dst cache for a connected sk if current not valid

 include/net/ipv6.h|  6 --
 net/ipv6/datagram.c   |  9 +
 net/ipv6/ip6_output.c | 32 +---
 net/ipv6/ping.c   |  2 +-
 net/ipv6/udp.c| 21 ++---
 5 files changed, 37 insertions(+), 33 deletions(-)

-- 
1.8.3.1



[PATCH net v4 3/3] ipv6: udp6: set dst cache for a connected sk if current not valid

2018-03-30 Thread Alexey Kodanev
A new RTF_CACHE route can be created between ip6_sk_dst_lookup_flow()
and ip6_dst_store() calls in udpv6_sendmsg(), when datagram sending
results to ICMPV6_PKT_TOOBIG error:

udp_v6_send_skb(), for example with vti6 tunnel:
vti6_xmit(), get ICMPV6_PKT_TOOBIG error
skb_dst_update_pmtu(), can create a RTF_CACHE clone
icmpv6_send()
...
udpv6_err()
ip6_sk_update_pmtu()
   ip6_update_pmtu(), can create a RTF_CACHE clone
   ...
   ip6_datagram_dst_update()
ip6_dst_store()

And after commit 33c162a980fe ("ipv6: datagram: Update dst cache of
a connected datagram sk during pmtu update"), the UDPv6 error handler
can update socket's dst cache, but it can happen before the update in
the end of udpv6_sendmsg(), preventing getting the new dst cache on
the next udpv6_sendmsg() calls.

In order to fix it, save dst in a connected socket only if the current
socket's dst cache is invalid.

The previous patch prepared ip6_sk_dst_lookup_flow() to do that with
the new argument, and this patch enables it in udpv6_sendmsg().

Fixes: 33c162a980fe ("ipv6: datagram: Update dst cache of a connected datagram 
sk during pmtu update")
Fixes: 45e4fd26683c ("ipv6: Only create RTF_CACHE routes after encountering 
pmtu exception")
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 net/ipv6/udp.c | 21 ++---
 1 file changed, 2 insertions(+), 19 deletions(-)

diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index e49dac4..da13c90 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1289,7 +1289,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
 
fl6.flowlabel = ip6_make_flowinfo(ipc6.tclass, fl6.flowlabel);
 
-   dst = ip6_sk_dst_lookup_flow(sk, , final_p, 0);
+   dst = ip6_sk_dst_lookup_flow(sk, , final_p, connected);
if (IS_ERR(dst)) {
err = PTR_ERR(dst);
dst = NULL;
@@ -1314,7 +1314,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
err = PTR_ERR(skb);
if (!IS_ERR_OR_NULL(skb))
err = udp_v6_send_skb(skb, );
-   goto release_dst;
+   goto out;
}
 
lock_sock(sk);
@@ -1348,23 +1348,6 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
err = np->recverr ? net_xmit_errno(err) : 0;
release_sock(sk);
 
-release_dst:
-   if (dst) {
-   if (connected) {
-   ip6_dst_store(sk, dst,
- ipv6_addr_equal(, 
>sk_v6_daddr) ?
- >sk_v6_daddr : NULL,
-#ifdef CONFIG_IPV6_SUBTREES
- ipv6_addr_equal(, >saddr) ?
- >saddr :
-#endif
- NULL);
-   } else {
-   dst_release(dst);
-   }
-   dst = NULL;
-   }
-
 out:
dst_release(dst);
fl6_sock_release(flowlabel);
-- 
1.8.3.1



[PATCH net v4 1/3] ipv6: add a wrapper for ip6_dst_store() with flowi6 checks

2018-03-30 Thread Alexey Kodanev
Move commonly used pattern of ip6_dst_store() usage to a separate
function - ip6_dst_store_flow(), which will check the addresses
for equality using the flow information, before saving them.

There is no functional changes in this patch. In addition, it will
be used in the next patch, in ip6_sk_dst_lookup_flow().

Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 include/net/ipv6.h|  3 ++-
 net/ipv6/datagram.c   |  9 +
 net/ipv6/ip6_output.c | 17 +
 3 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 8606c91..1dd31ca 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -971,7 +971,8 @@ static inline struct sk_buff *ip6_finish_skb(struct sock 
*sk)
 }
 
 unsigned int ip6_dst_mtu_forward(const struct dst_entry *dst);
-
+void ip6_dst_store_flow(struct sock *sk, struct dst_entry *dst,
+   const struct flowi6 *fl6);
 int ip6_dst_lookup(struct net *net, struct sock *sk, struct dst_entry **dst,
   struct flowi6 *fl6);
 struct dst_entry *ip6_dst_lookup_flow(const struct sock *sk, struct flowi6 
*fl6,
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index a9f7eca..8b4fa0c 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -106,14 +106,7 @@ int ip6_datagram_dst_update(struct sock *sk, bool 
fix_sk_saddr)
}
}
 
-   ip6_dst_store(sk, dst,
- ipv6_addr_equal(, >sk_v6_daddr) ?
- >sk_v6_daddr : NULL,
-#ifdef CONFIG_IPV6_SUBTREES
- ipv6_addr_equal(, >saddr) ?
- >saddr :
-#endif
- NULL);
+   ip6_dst_store_flow(sk, dst, );
 
 out:
fl6_sock_release(flowlabel);
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index a8a9195..ed87ce5 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -943,6 +943,23 @@ static struct dst_entry *ip6_sk_dst_check(struct sock *sk,
return dst;
 }
 
+void ip6_dst_store_flow(struct sock *sk, struct dst_entry *dst,
+   const struct flowi6 *fl6)
+{
+#ifdef CONFIG_IPV6_SUBTREES
+   struct ipv6_pinfo *np = inet6_sk(sk);
+#endif
+
+   ip6_dst_store(sk, dst,
+ ipv6_addr_equal(>daddr, >sk_v6_daddr) ?
+ >sk_v6_daddr : NULL,
+#ifdef CONFIG_IPV6_SUBTREES
+ ipv6_addr_equal(>saddr, >saddr) ?
+ >saddr :
+#endif
+ NULL);
+}
+
 static int ip6_dst_lookup_tail(struct net *net, const struct sock *sk,
   struct dst_entry **dst, struct flowi6 *fl6)
 {
-- 
1.8.3.1



[PATCH net] ip6_gre: remove redundant 'tunnel' setting in ip6erspan_tap_init()

2018-03-30 Thread Alexey Kodanev
'tunnel' was already set at the start of ip6erspan_tap_init().

Fixes: 5a963eb61b7c ("ip6_gre: Add ERSPAN native tunnel support")
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 net/ipv6/ip6_gre.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 1bbd093..09901be 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -1757,7 +1757,6 @@ static int ip6erspan_tap_init(struct net_device *dev)
dev->mtu -= 8;
 
dev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
-   tunnel = netdev_priv(dev);
ip6gre_tnl_link_config(tunnel, 1);
 
return 0;
-- 
1.8.3.1



Re: [PATCH net v3 1/3] ipv6: move ip6_dst_store() calls with flowi6 checks to a wrapper

2018-03-30 Thread Alexey Kodanev
On 03/30/2018 01:14 PM, kbuild test robot wrote:
> Hi Alexey,
> 
> I love your patch! Perhaps something to improve:
> 
> [auto build test WARNING on net/master]
> 
> url:
> https://github.com/0day-ci/linux/commits/Alexey-Kodanev/ipv6-move-ip6_dst_store-calls-with-flowi6-checks-to-a-wrapper/20180330-173050
> config: ia64-defconfig (attached as .config)
> compiler: ia64-linux-gcc (GCC) 7.2.0
> reproduce:
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> make.cross ARCH=ia64 
> 
> All warnings (new ones prefixed by >>):
> 
>In file included from include/linux/tcp.h:23:0,
> from include/linux/ipv6.h:87,
> from include/net/ipv6.h:16,
> from include/net/inetpeer.h:16,
> from include/net/route.h:28,
> from drivers/infiniband/core/addr.c:43:
>include/net/ip6_route.h: In function 'ip6_dst_store_flow':
>include/net/sock.h:349:34: error: 'struct sock_common' has no member named 
> 'skc_v6_daddr'; did you mean 'skc_daddr'?
> #define sk_v6_daddr  __sk_common.skc_v6_daddr
>  ^
>>> include/net/ip6_route.h:221:43: note: in expansion of macro 'sk_v6_daddr'
> ipv6_addr_equal(>daddr, >sk_v6_daddr) ?
>   ^~~
>include/net/sock.h:349:34: error: 'struct sock_common' has no member named 
> 'skc_v6_daddr'; did you mean 'skc_daddr'?
> #define sk_v6_daddr  __sk_common.skc_v6_daddr
>  ^
>include/net/ip6_route.h:222:14: note: in expansion of macro 'sk_v6_daddr'
> >sk_v6_daddr : NULL
>  ^~~


This is because CONFIG_IPV6 is not enabled, no sk_v6_daddr member,
I'll fix it in the new version.

Thanks,
Alexey


[PATCH net v3 3/3] ipv6: udp6: set dst cache for a connected sk if current not valid

2018-03-29 Thread Alexey Kodanev
A new RTF_CACHE route can be created in udpv6_sendmsg() between
ip6_sk_dst_lookup_flow() and ip6_dst_store() calls when datagram
sending results to ICMPV6_PKT_TOOBIG error:

udp_v6_send_skb(), for example with vti6 tunnel:
vti6_xmit(), get ICMPV6_PKT_TOOBIG error
skb_dst_update_pmtu(), can create a RTF_CACHE clone
icmpv6_send()
...
udpv6_err()
ip6_sk_update_pmtu()
   ip6_update_pmtu(), can create a RTF_CACHE clone
   ...
   ip6_datagram_dst_update()
ip6_dst_store()

And after commit 33c162a980fe ("ipv6: datagram: Update dst cache
of a connected datagram sk during pmtu update"), the UDPv6 error
handler can update socket's dst cache, but it can happen before
the update in end of udpv6_sendmsg(), preventing getting updated
dst cache on the next udpv6_sendmsg() calls.

In order to fix it, we should save dst for a connected socket
in udpv6_sendmsg(), only if the previous socket's cache is not
valid.

Fixes: 33c162a980fe ("ipv6: datagram: Update dst cache of a connected datagram 
sk during pmtu update")
Fixes: 45e4fd26683c ("ipv6: Only create RTF_CACHE routes after encountering 
pmtu exception")
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 net/ipv6/udp.c | 21 ++---
 1 file changed, 2 insertions(+), 19 deletions(-)

diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index e49dac4..da13c90 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1289,7 +1289,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
 
fl6.flowlabel = ip6_make_flowinfo(ipc6.tclass, fl6.flowlabel);
 
-   dst = ip6_sk_dst_lookup_flow(sk, , final_p, 0);
+   dst = ip6_sk_dst_lookup_flow(sk, , final_p, connected);
if (IS_ERR(dst)) {
err = PTR_ERR(dst);
dst = NULL;
@@ -1314,7 +1314,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
err = PTR_ERR(skb);
if (!IS_ERR_OR_NULL(skb))
err = udp_v6_send_skb(skb, );
-   goto release_dst;
+   goto out;
}
 
lock_sock(sk);
@@ -1348,23 +1348,6 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
err = np->recverr ? net_xmit_errno(err) : 0;
release_sock(sk);
 
-release_dst:
-   if (dst) {
-   if (connected) {
-   ip6_dst_store(sk, dst,
- ipv6_addr_equal(, 
>sk_v6_daddr) ?
- >sk_v6_daddr : NULL,
-#ifdef CONFIG_IPV6_SUBTREES
- ipv6_addr_equal(, >saddr) ?
- >saddr :
-#endif
- NULL);
-   } else {
-   dst_release(dst);
-   }
-   dst = NULL;
-   }
-
 out:
dst_release(dst);
fl6_sock_release(flowlabel);
-- 
1.8.3.1



[PATCH net v3 0/3] ipv6: udp6: set dst cache for a connected sk if current not valid

2018-03-29 Thread Alexey Kodanev
A new RTF_CACHE route can be created with the socket's dst cache
update between the below calls in udpv6_sendmsg(), when datagram
sending results to ICMPV6_PKT_TOOBIG error:

   dst = ip6_sk_dst_lookup_flow(...)
   ...
release_dst:
if (dst) {
if (connected)
ip6_dst_store(sk, dst)

Therefore, the new socket's dst cache reset to the old one on
"release_dst:".

The first two patches prepare the code to store dst cache
with ip6_sk_dst_lookup_flow() in udpv6_sendmsg():

  * the first patch adds new wrapper ip6_dst_store_flow() to
increase readability of the code and the changes.

  * the second patch adds new argument to ip6_sk_dst_lookup_flow()
and ability to store dst in the socket's cache. Also, the two
users of the function are updated without enabling the new
behavior: pingv6_sendmsg() and udpv6_sendmsg().

The last patch contains the actual fix that removes sk dst cache
update in the end of udpv6_sendmsg(), and allows to do it in
ip6_sk_dst_lookup_flow().

v3: * instead of moving ip6_dst_store() above udp_v6_send_skb(),
  update socket's dst cache inside ip6_sk_dst_lookup_flow()
  if the current one is invalid.
* the issue not reproduced in 4.1, but starting from 4.2. Add
  one more 'Fixes:' commit that creates new RTF_CACHE route.
  Though, it is also mentioned in the first one.

Alexey Kodanev (3):
  ipv6: move ip6_dst_store() calls with flowi6 checks to a wrapper
  ipv6: allow to cache dst for connected sk in ip6_sk_dst_lookup_flow()
  ipv6: udp6: set dst cache for a connected sk if current not valid

 include/net/ip6_route.h | 17 +
 include/net/ipv6.h  |  3 ++-
 net/ipv6/datagram.c |  9 +
 net/ipv6/ip6_output.c   | 15 ---
 net/ipv6/ping.c |  2 +-
 net/ipv6/udp.c  | 21 ++---
 6 files changed, 35 insertions(+), 32 deletions(-)

-- 
1.8.3.1



[PATCH net v3 1/3] ipv6: move ip6_dst_store() calls with flowi6 checks to a wrapper

2018-03-29 Thread Alexey Kodanev
Move commonly used pattern of ip6_dst_store() usage to a separate
function - ip6_dst_store_flow(), which will check the addresses
for equality before saving them.

There is no functional changes in this patch, the new wrapper
will be used in the next patch, in ip6_sk_dst_lookup_flow().

Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 include/net/ip6_route.h | 17 +
 net/ipv6/datagram.c |  9 +
 2 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index ac0866b..36c3946 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -210,6 +210,23 @@ static inline void ip6_dst_store(struct sock *sk, struct 
dst_entry *dst,
 #endif
 }
 
+static inline void ip6_dst_store_flow(struct sock *sk, struct dst_entry *dst,
+ struct flowi6 *fl6)
+{
+#ifdef CONFIG_IPV6_SUBTREES
+   struct ipv6_pinfo *np = inet6_sk(sk);
+#endif
+
+   ip6_dst_store(sk, dst,
+ ipv6_addr_equal(>daddr, >sk_v6_daddr) ?
+ >sk_v6_daddr : NULL,
+#ifdef CONFIG_IPV6_SUBTREES
+ ipv6_addr_equal(>saddr, >saddr) ?
+ >saddr :
+#endif
+ NULL);
+}
+
 static inline bool ipv6_unicast_destination(const struct sk_buff *skb)
 {
struct rt6_info *rt = (struct rt6_info *) skb_dst(skb);
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index a9f7eca..8b4fa0c 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -106,14 +106,7 @@ int ip6_datagram_dst_update(struct sock *sk, bool 
fix_sk_saddr)
}
}
 
-   ip6_dst_store(sk, dst,
- ipv6_addr_equal(, >sk_v6_daddr) ?
- >sk_v6_daddr : NULL,
-#ifdef CONFIG_IPV6_SUBTREES
- ipv6_addr_equal(, >saddr) ?
- >saddr :
-#endif
- NULL);
+   ip6_dst_store_flow(sk, dst, );
 
 out:
fl6_sock_release(flowlabel);
-- 
1.8.3.1



[PATCH net v3 2/3] ipv6: allow to cache dst for connected sk in ip6_sk_dst_lookup_flow()

2018-03-29 Thread Alexey Kodanev
Add 'connected' argument to ip6_sk_dst_lookup_flow() and
update the cache only if ip6_sk_dst_check() returns NULL
and a socket is connected.

The function is used as before, the new behavior for UDP
sockets in udpv6_sendmsg() will be enabled in the next
patch.

Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 include/net/ipv6.h|  3 ++-
 net/ipv6/ip6_output.c | 15 ---
 net/ipv6/ping.c   |  2 +-
 net/ipv6/udp.c|  2 +-
 4 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 8606c91..07e94dc 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -977,7 +977,8 @@ int ip6_dst_lookup(struct net *net, struct sock *sk, struct 
dst_entry **dst,
 struct dst_entry *ip6_dst_lookup_flow(const struct sock *sk, struct flowi6 
*fl6,
  const struct in6_addr *final_dst);
 struct dst_entry *ip6_sk_dst_lookup_flow(struct sock *sk, struct flowi6 *fl6,
-const struct in6_addr *final_dst);
+const struct in6_addr *final_dst,
+int connected);
 struct dst_entry *ip6_blackhole_route(struct net *net,
  struct dst_entry *orig_dst);
 
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index a8a9195..c22017c 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1105,23 +1105,32 @@ struct dst_entry *ip6_dst_lookup_flow(const struct sock 
*sk, struct flowi6 *fl6,
  * @sk: socket which provides the dst cache and route info
  * @fl6: flow to lookup
  * @final_dst: final destination address for ipsec lookup
+ * @connected: whether @sk is connected or not
  *
  * This function performs a route lookup on the given flow with the
  * possibility of using the cached route in the socket if it is valid.
  * It will take the socket dst lock when operating on the dst cache.
  * As a result, this function can only be used in process context.
  *
+ * In addition, for a connected socket, cache the dst in the socket
+ * if the current cache is not valid.
+ *
  * It returns a valid dst pointer on success, or a pointer encoded
  * error code.
  */
 struct dst_entry *ip6_sk_dst_lookup_flow(struct sock *sk, struct flowi6 *fl6,
-const struct in6_addr *final_dst)
+const struct in6_addr *final_dst,
+int connected)
 {
struct dst_entry *dst = sk_dst_check(sk, inet6_sk(sk)->dst_cookie);
 
dst = ip6_sk_dst_check(sk, dst, fl6);
-   if (!dst)
-   dst = ip6_dst_lookup_flow(sk, fl6, final_dst);
+   if (dst)
+   return dst;
+
+   dst = ip6_dst_lookup_flow(sk, fl6, final_dst);
+   if (connected && !IS_ERR(dst))
+   ip6_dst_store_flow(sk, dst_clone(dst), fl6);
 
return dst;
 }
diff --git a/net/ipv6/ping.c b/net/ipv6/ping.c
index d12c55d..546f4a6 100644
--- a/net/ipv6/ping.c
+++ b/net/ipv6/ping.c
@@ -121,7 +121,7 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t len)
ipc6.tclass = np->tclass;
fl6.flowlabel = ip6_make_flowinfo(ipc6.tclass, fl6.flowlabel);
 
-   dst = ip6_sk_dst_lookup_flow(sk, ,  daddr);
+   dst = ip6_sk_dst_lookup_flow(sk, , daddr, 0);
if (IS_ERR(dst))
return PTR_ERR(dst);
rt = (struct rt6_info *) dst;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 52e3ea0..e49dac4 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1289,7 +1289,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
 
fl6.flowlabel = ip6_make_flowinfo(ipc6.tclass, fl6.flowlabel);
 
-   dst = ip6_sk_dst_lookup_flow(sk, , final_p);
+   dst = ip6_sk_dst_lookup_flow(sk, , final_p, 0);
if (IS_ERR(dst)) {
err = PTR_ERR(dst);
dst = NULL;
-- 
1.8.3.1



Re: [PATCH net v2] udp6: set dst cache for a connected sk before udp_v6_send_skb

2018-03-28 Thread Alexey Kodanev
On 27.03.2018 22:00, Martin KaFai Lau wrote:
> On Tue, Mar 27, 2018 at 04:27:30PM +0300, Alexey Kodanev wrote:
>> On 26.03.2018 20:02, Martin KaFai Lau wrote:
>>> On Mon, Mar 26, 2018 at 05:48:47PM +0300, Alexey Kodanev wrote:
>>>> After commit 33c162a980fe ("ipv6: datagram: Update dst cache of a
>>>> connected datagram sk during pmtu update"), when the error occurs on
>>>> sending datagram in udpv6_sendmsg() due to ICMPV6_PKT_TOOBIG type,
>>>> error handler can trigger the following path and call ip6_dst_store():
>>>>
>>>> udpv6_err()
>>>> ip6_sk_update_pmtu()
>>>> ip6_datagram_dst_update()
>>>> ip6_dst_lookup_flow(), can create a RTF_CACHE clone
>>> Instead of ip6_dst_lookup_flow(),
>>> you meant the RTF_CACHE route created in ip6_update_pmtu()
>>>
>>
>> Right, or even earlier... I was using vti tunnel and it invokes
>> skb_dst_update_pmtu() on this error, then sends ICMPv6_PKT_TOOBIG.
>>
>>>> ...
>>>> ip6_dst_store()
>>>>
>>>> It can happen before a connected UDP socket invokes ip6_dst_store()
>>>> in the end of udpv6_sendmsg(), on destination release, as a result,
>>>> the last one changes dst to the old one, preventing getting updated
>>>> dst cache on the next udpv6_sendmsg() call.
>>>>
>>>> This patch moves ip6_dst_store() in udpv6_sendmsg(), so that it is
>>>> invoked after ip6_sk_dst_lookup_flow() and before udp_v6_send_skb().
>>> After this patch, the above udpv6_err() path could not happen after
>>> ip6_sk_dst_lookup_flow() and before the ip6_dst_store() in udpv6_sendmsg()?
>>>
>>
>> May be we could minimize this if save it in ip6_sk_dst_lookup_flow()
>> for a connected UDP sockets only if we're not getting it from a cache
>> for some reason?
> I am just not clear how moving it earlier can completely stop the
> described issue.  Beside, the next ICMPV6_PKT_TOOBIG will be
> received again and eventually rectify sk->sk_dst_cache?


With a request/response scenario in my environment, it always happens
after udp_v6_send_skb() and before the 'release_dst:' in udpv6_sendmsg().
So the socket there never gets updated cache when resending datagram
again after ICMPV6_PKT_TOOBIG... dst cache is always reset to the old
one in the end of udpv6_sendmsg().


> 
>>
>> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
>> index a8a9195..0204f52 100644
>> --- a/net/ipv6/ip6_output.c
>> +++ b/net/ipv6/ip6_output.c
>> @@ -1115,13 +1115,30 @@ struct dst_entry *ip6_dst_lookup_flow(const struct 
>> sock *sk, struct flowi6 *fl6,
>>   * error code.
>>   */
>>  struct dst_entry *ip6_sk_dst_lookup_flow(struct sock *sk, struct flowi6 
>> *fl6,
>> -const struct in6_addr *final_dst)
>> +const struct in6_addr *final_dst,
>> +bool connected)
>>  {
>> struct dst_entry *dst = sk_dst_check(sk, inet6_sk(sk)->dst_cookie);
>>  
>> dst = ip6_sk_dst_check(sk, dst, fl6);
>> -   if (!dst)
>> -   dst = ip6_dst_lookup_flow(sk, fl6, final_dst);
>> +   if (dst)
>> +   return dst;
>> +
>> +   dst = ip6_dst_lookup_flow(sk, fl6, final_dst);
>> +
>> +   if (connected && !IS_ERR(dst))
>> +   ip6_dst_store(sk, dst_clone(dst), ...);
> Right, I think doing dst_store() only if ip6_sk_dst_check()/
> sk_dst_check() returns NULL is a more sensible thing to
> do here.

OK, will prepare the patch.

Thanks,
Alexey


Re: [PATCH net v2] udp6: set dst cache for a connected sk before udp_v6_send_skb

2018-03-27 Thread Alexey Kodanev
On 26.03.2018 20:02, Martin KaFai Lau wrote:
> On Mon, Mar 26, 2018 at 05:48:47PM +0300, Alexey Kodanev wrote:
>> After commit 33c162a980fe ("ipv6: datagram: Update dst cache of a
>> connected datagram sk during pmtu update"), when the error occurs on
>> sending datagram in udpv6_sendmsg() due to ICMPV6_PKT_TOOBIG type,
>> error handler can trigger the following path and call ip6_dst_store():
>>
>> udpv6_err()
>> ip6_sk_update_pmtu()
>> ip6_datagram_dst_update()
>> ip6_dst_lookup_flow(), can create a RTF_CACHE clone
> Instead of ip6_dst_lookup_flow(),
> you meant the RTF_CACHE route created in ip6_update_pmtu()
> 

Right, or even earlier... I was using vti tunnel and it invokes
skb_dst_update_pmtu() on this error, then sends ICMPv6_PKT_TOOBIG.

>> ...
>> ip6_dst_store()
>>
>> It can happen before a connected UDP socket invokes ip6_dst_store()
>> in the end of udpv6_sendmsg(), on destination release, as a result,
>> the last one changes dst to the old one, preventing getting updated
>> dst cache on the next udpv6_sendmsg() call.
>>
>> This patch moves ip6_dst_store() in udpv6_sendmsg(), so that it is
>> invoked after ip6_sk_dst_lookup_flow() and before udp_v6_send_skb().
> After this patch, the above udpv6_err() path could not happen after
> ip6_sk_dst_lookup_flow() and before the ip6_dst_store() in udpv6_sendmsg()?
>

May be we could minimize this if save it in ip6_sk_dst_lookup_flow()
for a connected UDP sockets only if we're not getting it from a cache
for some reason?

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index a8a9195..0204f52 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1115,13 +1115,30 @@ struct dst_entry *ip6_dst_lookup_flow(const struct sock 
*sk, struct flowi6 *fl6,
  * error code.
  */
 struct dst_entry *ip6_sk_dst_lookup_flow(struct sock *sk, struct flowi6 *fl6,
-const struct in6_addr *final_dst)
+const struct in6_addr *final_dst,
+bool connected)
 {
struct dst_entry *dst = sk_dst_check(sk, inet6_sk(sk)->dst_cookie);
 
dst = ip6_sk_dst_check(sk, dst, fl6);
-   if (!dst)
-   dst = ip6_dst_lookup_flow(sk, fl6, final_dst);
+   if (dst)
+   return dst;
+
+   dst = ip6_dst_lookup_flow(sk, fl6, final_dst);
+
+   if (connected && !IS_ERR(dst))
+   ip6_dst_store(sk, dst_clone(dst), ...);

Thanks,
Alexey
 
>>
>> Also, increase refcnt for dst, when passing it to ip6_dst_store()
>> because after that the dst cache can be released by other calls
>> to ip6_dst_store() with the same socket.
>>
>> Fixes: 33c162a980fe ("ipv6: datagram: Update dst cache of a connected 
>> datagram sk during pmtu update")
>> Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
>> ---
>>
>> v2: * remove 'release_dst:' label
>>
>> * move ip6_dst_store() below MSG_CONFIRM check as
>>   suggested by Eric and add dst_clone()
>>
>> * add 'Fixes' commit.
>>
>>
>>  net/ipv6/udp.c | 29 +++--
>>  1 file changed, 11 insertions(+), 18 deletions(-)
>>
>> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
>> index 52e3ea0..4508e5a 100644
>> --- a/net/ipv6/udp.c
>> +++ b/net/ipv6/udp.c
>> @@ -1303,6 +1303,16 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr 
>> *msg, size_t len)
>>  goto do_confirm;
>>  back_from_confirm:
>>  
>> +if (connected)>> +  ip6_dst_store(sk, dst_clone(dst),
>> +  ipv6_addr_equal(, >sk_v6_daddr) ?
>> +  >sk_v6_daddr : NULL,
>> +#ifdef CONFIG_IPV6_SUBTREES
>> +  ipv6_addr_equal(, >saddr) ?
>> +  >saddr :
>> +#endif
>> +  NULL);
>> +
>>  /* Lockless fast path for the non-corking case */
>>  if (!corkreq) {
>>  struct sk_buff *skb;
>> @@ -1314,7 +1324,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
>> size_t len)
>>  err = PTR_ERR(skb);
>>  if (!IS_ERR_OR_NULL(skb))
>>  err = udp_v6_send_skb(skb, );
>> -goto release_dst;
>> +goto out;
>>  }
>>  
>>  lock_sock(sk);
>> @@ -1348,23 +1358,6 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr 
>> *msg, size_t len)
>>  err = np->recverr ? net_xmi

[PATCH net v2] udp6: set dst cache for a connected sk before udp_v6_send_skb

2018-03-26 Thread Alexey Kodanev
After commit 33c162a980fe ("ipv6: datagram: Update dst cache of a
connected datagram sk during pmtu update"), when the error occurs on
sending datagram in udpv6_sendmsg() due to ICMPV6_PKT_TOOBIG type,
error handler can trigger the following path and call ip6_dst_store():

udpv6_err()
ip6_sk_update_pmtu()
ip6_datagram_dst_update()
ip6_dst_lookup_flow(), can create a RTF_CACHE clone
...
ip6_dst_store()

It can happen before a connected UDP socket invokes ip6_dst_store()
in the end of udpv6_sendmsg(), on destination release, as a result,
the last one changes dst to the old one, preventing getting updated
dst cache on the next udpv6_sendmsg() call.

This patch moves ip6_dst_store() in udpv6_sendmsg(), so that it is
invoked after ip6_sk_dst_lookup_flow() and before udp_v6_send_skb().

Also, increase refcnt for dst, when passing it to ip6_dst_store()
because after that the dst cache can be released by other calls
to ip6_dst_store() with the same socket.

Fixes: 33c162a980fe ("ipv6: datagram: Update dst cache of a connected datagram 
sk during pmtu update")
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---

v2: * remove 'release_dst:' label

* move ip6_dst_store() below MSG_CONFIRM check as
  suggested by Eric and add dst_clone()

* add 'Fixes' commit.


 net/ipv6/udp.c | 29 +++--
 1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 52e3ea0..4508e5a 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1303,6 +1303,16 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
goto do_confirm;
 back_from_confirm:
 
+   if (connected)
+   ip6_dst_store(sk, dst_clone(dst),
+ ipv6_addr_equal(, >sk_v6_daddr) ?
+ >sk_v6_daddr : NULL,
+#ifdef CONFIG_IPV6_SUBTREES
+ ipv6_addr_equal(, >saddr) ?
+ >saddr :
+#endif
+ NULL);
+
/* Lockless fast path for the non-corking case */
if (!corkreq) {
struct sk_buff *skb;
@@ -1314,7 +1324,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
err = PTR_ERR(skb);
if (!IS_ERR_OR_NULL(skb))
err = udp_v6_send_skb(skb, );
-   goto release_dst;
+   goto out;
}
 
lock_sock(sk);
@@ -1348,23 +1358,6 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
err = np->recverr ? net_xmit_errno(err) : 0;
release_sock(sk);
 
-release_dst:
-   if (dst) {
-   if (connected) {
-   ip6_dst_store(sk, dst,
- ipv6_addr_equal(, 
>sk_v6_daddr) ?
- >sk_v6_daddr : NULL,
-#ifdef CONFIG_IPV6_SUBTREES
- ipv6_addr_equal(, >saddr) ?
- >saddr :
-#endif
- NULL);
-   } else {
-   dst_release(dst);
-   }
-   dst = NULL;
-   }
-
 out:
dst_release(dst);
fl6_sock_release(flowlabel);
-- 
1.8.3.1



Re: [PATCH net] udp6: set dst cache for a connected sk before udp_v6_send_skb

2018-03-23 Thread Alexey Kodanev
On 03/23/2018 08:13 PM, Alexey Kodanev wrote:
> On 03/23/2018 06:50 PM, Eric Dumazet wrote:
...
>>> +   if (connected)
>>> +   ip6_dst_store(sk, dst,
>>> + ipv6_addr_equal(, >sk_v6_daddr) ?
>>> + >sk_v6_daddr : NULL,
>>> +#ifdef CONFIG_IPV6_SUBTREES
>>> + ipv6_addr_equal(, >saddr) ?
>>> + >saddr :
>>> +#endif
>>> + NULL);
>>> +
>>
>> What about the MSG_CONFIRM stuff ?
>>
>>> if (msg->msg_flags_CONFIRM)
>>> goto do_confirm;
>>>  back_from_confirm:
>>
>> Should not you move the above code here instead ?
> 
> Ah, you are right, it can release that dst if it go to "do_confirm".
> 
>>> Also ip6_dst_store() does not increment dst refcount.
>>
>> I fear that as soon as dst is visible to other cpus, it might be stolen.
>>
> 
> So we should pass dst_clone(dst) to ip6_dst_store() instead of dst,
> because udpv6_err() can release it if it's set the new one.
> 
> Then, I guess, we could left udpv6_sendmsg()/ip6_dst_store() where it
> is now in the patch and remove the check for "connected" before dst_relase(),
> similar to udp_sendmsg(), right?

And the section "release_dst:" looks redundant after that too.

Thanks,
Alexey


Re: [PATCH net] udp6: set dst cache for a connected sk before udp_v6_send_skb

2018-03-23 Thread Alexey Kodanev
On 03/23/2018 06:50 PM, Eric Dumazet wrote:
> 
> 
> On 03/23/2018 07:39 AM, Alexey Kodanev wrote:
>> After commit 33c162a980fe ("ipv6: datagram: Update dst cache of a
>> connected datagram sk during pmtu update"), when the error occurs on
>> sending datagram in udpv6_sendmsg() due to ICMPV6_PKT_TOOBIG type,
>> error handler can trigger the following path and call ip6_dst_store():
>>
>> udpv6_err()
>> ip6_sk_update_pmtu()
>> ip6_datagram_dst_update()
>> ip6_dst_lookup_flow(), can create a RTF_CACHE clone
>> ...
>> ip6_dst_store()
>>
>> It can happen before a connected UDP socket invokes ip6_dst_store()
>> in the end of udpv6_sendmsg(), on destination release, as a result,
>> the last one changes dst to the old one, preventing getting updated
>> dst cache on the next udpv6_sendmsg() call.
>>
>> This patch moves ip6_dst_store() in udpv6_sendmsg(), so that it is
>> invoked after ip6_sk_dst_lookup_flow() and before udp_v6_send_skb().
>>
> 
> 
> A Fixes: tag would be nice, for automatic tools (and humans as well)

Hi Eric,

I see, will add the commit 33c162a980fe.

...
>>  
>> +if (connected)
>> +ip6_dst_store(sk, dst,
>> +  ipv6_addr_equal(, >sk_v6_daddr) ?
>> +  >sk_v6_daddr : NULL,
>> +#ifdef CONFIG_IPV6_SUBTREES
>> +  ipv6_addr_equal(, >saddr) ?
>> +  >saddr :
>> +#endif
>> +  NULL);
>> +
> 
> What about the MSG_CONFIRM stuff ?
> 
>>  if (msg->msg_flags_CONFIRM)
>>  goto do_confirm;
>>  back_from_confirm:
> 
> Should not you move the above code here instead ?

Ah, you are right, it can release that dst if it go to "do_confirm".

> > Also ip6_dst_store() does not increment dst refcount.
> 
> I fear that as soon as dst is visible to other cpus, it might be stolen.
> 

So we should pass dst_clone(dst) to ip6_dst_store() instead of dst,
because udpv6_err() can release it if it's set the new one.

Then, I guess, we could left udpv6_sendmsg()/ip6_dst_store() where it
is now in the patch and remove the check for "connected" before dst_relase(),
similar to udp_sendmsg(), right?

Thanks,
Alexey


[PATCH net] udp6: set dst cache for a connected sk before udp_v6_send_skb

2018-03-23 Thread Alexey Kodanev
After commit 33c162a980fe ("ipv6: datagram: Update dst cache of a
connected datagram sk during pmtu update"), when the error occurs on
sending datagram in udpv6_sendmsg() due to ICMPV6_PKT_TOOBIG type,
error handler can trigger the following path and call ip6_dst_store():

udpv6_err()
ip6_sk_update_pmtu()
ip6_datagram_dst_update()
ip6_dst_lookup_flow(), can create a RTF_CACHE clone
...
ip6_dst_store()

It can happen before a connected UDP socket invokes ip6_dst_store()
in the end of udpv6_sendmsg(), on destination release, as a result,
the last one changes dst to the old one, preventing getting updated
dst cache on the next udpv6_sendmsg() call.

This patch moves ip6_dst_store() in udpv6_sendmsg(), so that it is
invoked after ip6_sk_dst_lookup_flow() and before udp_v6_send_skb().

Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 net/ipv6/udp.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 52e3ea0..0d413c6 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1299,6 +1299,16 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
if (ipc6.hlimit < 0)
ipc6.hlimit = ip6_sk_dst_hoplimit(np, , dst);
 
+   if (connected)
+   ip6_dst_store(sk, dst,
+ ipv6_addr_equal(, >sk_v6_daddr) ?
+ >sk_v6_daddr : NULL,
+#ifdef CONFIG_IPV6_SUBTREES
+ ipv6_addr_equal(, >saddr) ?
+ >saddr :
+#endif
+ NULL);
+
if (msg->msg_flags_CONFIRM)
goto do_confirm;
 back_from_confirm:
@@ -1350,18 +1360,8 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
 
 release_dst:
if (dst) {
-   if (connected) {
-   ip6_dst_store(sk, dst,
- ipv6_addr_equal(, 
>sk_v6_daddr) ?
- >sk_v6_daddr : NULL,
-#ifdef CONFIG_IPV6_SUBTREES
- ipv6_addr_equal(, >saddr) ?
- >saddr :
-#endif
- NULL);
-   } else {
+   if (!connected)
dst_release(dst);
-   }
dst = NULL;
}
 
-- 
1.8.3.1



[PATCH net] dccp: check sk for closed state in dccp_sendmsg()

2018-03-06 Thread Alexey Kodanev
dccp_disconnect() sets 'dp->dccps_hc_tx_ccid' tx handler to NULL,
therefore if DCCP socket is disconnected and dccp_sendmsg() is
called after it, it will cause a NULL pointer dereference in
dccp_write_xmit().

This crash and the reproducer was reported by syzbot. Looks like
it is reproduced if commit 69c64866ce07 ("dccp: CVE-2017-8824:
use-after-free in DCCP code") is applied.

Reported-by: syzbot+f99ab3887ab65d70f...@syzkaller.appspotmail.com
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 net/dccp/proto.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index 15bdc00..84cd4e3 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -794,6 +794,11 @@ int dccp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
if (skb == NULL)
goto out_release;
 
+   if (sk->sk_state == DCCP_CLOSED) {
+   rc = -ENOTCONN;
+   goto out_discard;
+   }
+
skb_reserve(skb, sk->sk_prot->max_header);
rc = memcpy_from_msg(skb_put(skb, len), msg, len);
if (rc != 0)
-- 
1.8.3.1



[PATCH net v3] sch_netem: fix skb leak in netem_enqueue()

2018-03-05 Thread Alexey Kodanev
When we exceed current packets limit and we have more than one
segment in the list returned by skb_gso_segment(), netem drops
only the first one, skipping the rest, hence kmemleak reports:

unreferenced object 0x880b5d23b600 (size 1024):
  comm "softirq", pid 0, jiffies 4384527763 (age 2770.629s)
  hex dump (first 32 bytes):
00 80 23 5d 0b 88 ff ff 00 00 00 00 00 00 00 00  ..#]
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
  backtrace:
[<d8a19b9d>] __alloc_skb+0xc9/0x520
[<1709b32f>] skb_segment+0x8c8/0x3710
[<c7b9bb88>] tcp_gso_segment+0x331/0x1830
[<c921cba1>] inet_gso_segment+0x476/0x1370
[<8b762dd4>] skb_mac_gso_segment+0x1f9/0x510
[<2182660a>] __skb_gso_segment+0x1dd/0x620
[<412651b9>] netem_enqueue+0x1536/0x2590 [sch_netem]
[<05d3b2a9>] __dev_queue_xmit+0x1167/0x2120
[<fc5f7327>] ip_finish_output2+0x998/0xf00
[<d309e9d3>] ip_output+0x1aa/0x2c0
[<7ecbd3a4>] tcp_transmit_skb+0x18db/0x3670
[<42d2a45f>] tcp_write_xmit+0x4d4/0x58c0
[<56a44199>] tcp_tasklet_func+0x3d9/0x540
[<13d06d02>] tasklet_action+0x1ca/0x250
[<fcde0b8b>] __do_softirq+0x1b4/0x5a3
[<e7ed027c>] irq_exit+0x1e2/0x210

Fix it by adding the rest of the segments, if any, to skb 'to_free'
list. Add new __qdisc_drop_all() and qdisc_drop_all() functions
because they can be useful in the future if we need to drop segmented
GSO packets in other places.

Fixes: 6071bd1aa13e ("netem: Segment GSO packets on enqueue")
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---

v3: use skb->prev to find the tail of the list. skb->prev can be NULL
for not segmented skbs, so check it too.

v2: add new __qdisc_drop_all() and qdisc_drop_all(), and use
qdisc_drop_all() in sch_netem.


 include/net/sch_generic.h | 19 +++
 net/sched/sch_netem.c |  2 +-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index e2ab136..2092d33 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -824,6 +824,16 @@ static inline void __qdisc_drop(struct sk_buff *skb, 
struct sk_buff **to_free)
*to_free = skb;
 }
 
+static inline void __qdisc_drop_all(struct sk_buff *skb,
+   struct sk_buff **to_free)
+{
+   if (skb->prev)
+   skb->prev->next = *to_free;
+   else
+   skb->next = *to_free;
+   *to_free = skb;
+}
+
 static inline unsigned int __qdisc_queue_drop_head(struct Qdisc *sch,
   struct qdisc_skb_head *qh,
   struct sk_buff **to_free)
@@ -956,6 +966,15 @@ static inline int qdisc_drop(struct sk_buff *skb, struct 
Qdisc *sch,
return NET_XMIT_DROP;
 }
 
+static inline int qdisc_drop_all(struct sk_buff *skb, struct Qdisc *sch,
+struct sk_buff **to_free)
+{
+   __qdisc_drop_all(skb, to_free);
+   qdisc_qstats_drop(sch);
+
+   return NET_XMIT_DROP;
+}
+
 /* Length to Time (L2T) lookup in a qdisc_rate_table, to determine how
long it will take to send a packet given its size.
  */
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 7c179ad..7d6801f 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -509,7 +509,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc 
*sch,
}
 
if (unlikely(sch->q.qlen >= sch->limit))
-   return qdisc_drop(skb, sch, to_free);
+   return qdisc_drop_all(skb, sch, to_free);
 
qdisc_qstats_backlog_inc(sch, skb);
 
-- 
1.8.3.1



Re: [PATCH net] sch_netem: fix skb leak in netem_enqueue()

2018-03-05 Thread Alexey Kodanev
On 03/05/2018 06:13 PM, Eric Dumazet wrote:
> On Mon, 2018-03-05 at 15:57 +0300, Alexey Kodanev wrote:
>>
>> +static inline void __qdisc_drop_all(struct sk_buff *skb,
>> +   struct sk_buff **to_free)
>> +{
>> +   struct sk_buff *first = skb;
>> +
>> +   while (skb->next)
>> +   skb = skb->next;
>> +
>> +   skb->next = *to_free;
>> +   *to_free = first;
>> +}
> 
> You probably can leverage what I did for commit bec3cfdca36bf43cfa
> ("net: skb_segment() provides list head and tail")
> to avoid the iteration.

OK, thanks! Will use it and send a new version of the patch.

Thanks,
Alexey


[PATCH net v2] sch_netem: fix skb leak in netem_enqueue()

2018-03-05 Thread Alexey Kodanev
When we exceed current packets limit and we have more than one
segment in the list returned by skb_gso_segment(), netem drops
only the first one, skipping the rest, hence kmemleak reports:

unreferenced object 0x880b5d23b600 (size 1024):
  comm "softirq", pid 0, jiffies 4384527763 (age 2770.629s)
  hex dump (first 32 bytes):
00 80 23 5d 0b 88 ff ff 00 00 00 00 00 00 00 00  ..#]
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
  backtrace:
[<d8a19b9d>] __alloc_skb+0xc9/0x520
[<1709b32f>] skb_segment+0x8c8/0x3710
[<c7b9bb88>] tcp_gso_segment+0x331/0x1830
[<c921cba1>] inet_gso_segment+0x476/0x1370
[<8b762dd4>] skb_mac_gso_segment+0x1f9/0x510
[<2182660a>] __skb_gso_segment+0x1dd/0x620
[<412651b9>] netem_enqueue+0x1536/0x2590 [sch_netem]
[<05d3b2a9>] __dev_queue_xmit+0x1167/0x2120
[<fc5f7327>] ip_finish_output2+0x998/0xf00
[<d309e9d3>] ip_output+0x1aa/0x2c0
[<7ecbd3a4>] tcp_transmit_skb+0x18db/0x3670
[<42d2a45f>] tcp_write_xmit+0x4d4/0x58c0
[<56a44199>] tcp_tasklet_func+0x3d9/0x540
[<13d06d02>] tasklet_action+0x1ca/0x250
[<fcde0b8b>] __do_softirq+0x1b4/0x5a3
[<e7ed027c>] irq_exit+0x1e2/0x210

Fix it by adding the rest of the segments, if any, to skb 'to_free'
list. Add new __qdisc_drop_all() and qdisc_drop_all() functions
because they can be useful in the future if we should drop segmented
GSO packets in other places.

Fixes: 6071bd1aa13e ("netem: Segment GSO packets on enqueue")
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 include/net/sch_generic.h | 21 +
 net/sched/sch_netem.c |  2 +-
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index e2ab136..7c4bfd7 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -824,6 +824,18 @@ static inline void __qdisc_drop(struct sk_buff *skb, 
struct sk_buff **to_free)
*to_free = skb;
 }
 
+static inline void __qdisc_drop_all(struct sk_buff *skb,
+   struct sk_buff **to_free)
+{
+   struct sk_buff *first = skb;
+
+   while (skb->next)
+   skb = skb->next;
+
+   skb->next = *to_free;
+   *to_free = first;
+}
+
 static inline unsigned int __qdisc_queue_drop_head(struct Qdisc *sch,
   struct qdisc_skb_head *qh,
   struct sk_buff **to_free)
@@ -956,6 +968,15 @@ static inline int qdisc_drop(struct sk_buff *skb, struct 
Qdisc *sch,
return NET_XMIT_DROP;
 }
 
+static inline int qdisc_drop_all(struct sk_buff *skb, struct Qdisc *sch,
+struct sk_buff **to_free)
+{
+   __qdisc_drop_all(skb, to_free);
+   qdisc_qstats_drop(sch);
+
+   return NET_XMIT_DROP;
+}
+
 /* Length to Time (L2T) lookup in a qdisc_rate_table, to determine how
long it will take to send a packet given its size.
  */
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 7c179ad..7d6801f 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -509,7 +509,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc 
*sch,
}
 
if (unlikely(sch->q.qlen >= sch->limit))
-   return qdisc_drop(skb, sch, to_free);
+   return qdisc_drop_all(skb, sch, to_free);
 
qdisc_qstats_backlog_inc(sch, skb);
 
-- 
1.8.3.1



Re: [PATCH net] sch_netem: fix skb leak in netem_enqueue()

2018-03-05 Thread Alexey Kodanev
On 03/03/2018 03:20 PM, Neil Horman wrote:
> On Fri, Mar 02, 2018 at 09:16:48PM +0300, Alexey Kodanev wrote:
>> When we exceed current packets limit and have more than one
>> segment in the list returned by skb_gso_segment(), netem drops
>> only the first one, skipping the rest, hence kmemleak reports:
>>
...
>>
>> Fix it by adding the rest of the segments, if any, to skb
>> 'to_free' list in that case.
>>
>> Fixes: 6071bd1aa13e ("netem: Segment GSO packets on enqueue")
>> Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
>> ---
>>  net/sched/sch_netem.c | 8 +++-
>>  1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
>> index 7c179ad..a5023a2 100644
>> --- a/net/sched/sch_netem.c
>> +++ b/net/sched/sch_netem.c
>> @@ -508,8 +508,14 @@ static int netem_enqueue(struct sk_buff *skb, struct 
>> Qdisc *sch,
>>  1<<(prandom_u32() % 8);
>>  }
>>  
>> -if (unlikely(sch->q.qlen >= sch->limit))
>> +if (unlikely(sch->q.qlen >= sch->limit)) {
>> +while (segs) {
>> +skb2 = segs->next;
>> +__qdisc_drop(segs, to_free);
>> +segs = skb2;
>> +}
>>  return qdisc_drop(skb, sch, to_free);
>> +}
>>  
> It seems like it might be nice to wrap up this drop loop into a
> qdisc_drop_all inline function.  Then we can easily drop segments in other
> locations if we should need to


Agree, will prepare the patch. I guess we could just add 'segs' to 'to_free'
list, then add qdisc_drop_all() with stats counter and returning status,
something like this:

@@ -824,6 +824,18 @@ static inline void __qdisc_drop(struct sk_buff *skb, 
struct sk_buff **to_free)
*to_free = skb;
 }

+static inline void __qdisc_drop_all(struct sk_buff *skb,
+   struct sk_buff **to_free)
+{
+   struct sk_buff *first = skb;
+
+   while (skb->next)
+   skb = skb->next;
+
+   skb->next = *to_free;
+   *to_free = first;
+}

Thanks,
Alexey


[PATCH net] sch_netem: fix skb leak in netem_enqueue()

2018-03-02 Thread Alexey Kodanev
When we exceed current packets limit and have more than one
segment in the list returned by skb_gso_segment(), netem drops
only the first one, skipping the rest, hence kmemleak reports:

unreferenced object 0x880b5d23b600 (size 1024):
  comm "softirq", pid 0, jiffies 4384527763 (age 2770.629s)
  hex dump (first 32 bytes):
00 80 23 5d 0b 88 ff ff 00 00 00 00 00 00 00 00  ..#]
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
  backtrace:
[<d8a19b9d>] __alloc_skb+0xc9/0x520
[<1709b32f>] skb_segment+0x8c8/0x3710
[<c7b9bb88>] tcp_gso_segment+0x331/0x1830
[<c921cba1>] inet_gso_segment+0x476/0x1370
[<8b762dd4>] skb_mac_gso_segment+0x1f9/0x510
[<2182660a>] __skb_gso_segment+0x1dd/0x620
[<412651b9>] netem_enqueue+0x1536/0x2590 [sch_netem]
[<05d3b2a9>] __dev_queue_xmit+0x1167/0x2120
[<fc5f7327>] ip_finish_output2+0x998/0xf00
[<d309e9d3>] ip_output+0x1aa/0x2c0
[<7ecbd3a4>] tcp_transmit_skb+0x18db/0x3670
[<42d2a45f>] tcp_write_xmit+0x4d4/0x58c0
[<56a44199>] tcp_tasklet_func+0x3d9/0x540
[<13d06d02>] tasklet_action+0x1ca/0x250
[<fcde0b8b>] __do_softirq+0x1b4/0x5a3
[<e7ed027c>] irq_exit+0x1e2/0x210

Fix it by adding the rest of the segments, if any, to skb
'to_free' list in that case.

Fixes: 6071bd1aa13e ("netem: Segment GSO packets on enqueue")
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 net/sched/sch_netem.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 7c179ad..a5023a2 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -508,8 +508,14 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc 
*sch,
1<<(prandom_u32() % 8);
}
 
-   if (unlikely(sch->q.qlen >= sch->limit))
+   if (unlikely(sch->q.qlen >= sch->limit)) {
+   while (segs) {
+   skb2 = segs->next;
+   __qdisc_drop(segs, to_free);
+   segs = skb2;
+   }
return qdisc_drop(skb, sch, to_free);
+   }
 
qdisc_qstats_backlog_inc(sch, skb);
 
-- 
1.8.3.1



[PATCH net] macvlan: fix use-after-free in macvlan_common_newlink()

2018-02-22 Thread Alexey Kodanev
The following use-after-free was reported by KASan when running
LTP macvtap01 test on 4.16-rc2:

[10642.528443] BUG: KASAN: use-after-free in
   macvlan_common_newlink+0x12ef/0x14a0 [macvlan]
[10642.626607] Read of size 8 at addr 880ba49f2100 by task ip/18450
...
[10642.963873] Call Trace:
[10642.994352]  dump_stack+0x5c/0x7c
[10643.035325]  print_address_description+0x75/0x290
[10643.092938]  kasan_report+0x28d/0x390
[10643.137971]  ? macvlan_common_newlink+0x12ef/0x14a0 [macvlan]
[10643.207963]  macvlan_common_newlink+0x12ef/0x14a0 [macvlan]
[10643.275978]  macvtap_newlink+0x171/0x260 [macvtap]
[10643.334532]  rtnl_newlink+0xd4f/0x1300
...
[10646.256176] Allocated by task 18450:
[10646.299964]  kasan_kmalloc+0xa6/0xd0
[10646.343746]  kmem_cache_alloc_trace+0xf1/0x210
[10646.397826]  macvlan_common_newlink+0x6de/0x14a0 [macvlan]
[10646.464386]  macvtap_newlink+0x171/0x260 [macvtap]
[10646.522728]  rtnl_newlink+0xd4f/0x1300
...
[10647.022028] Freed by task 18450:
[10647.061549]  __kasan_slab_free+0x138/0x180
[10647.111468]  kfree+0x9e/0x1c0
[10647.147869]  macvlan_port_destroy+0x3db/0x650 [macvlan]
[10647.211411]  rollback_registered_many+0x5b9/0xb10
[10647.268715]  rollback_registered+0xd9/0x190
[10647.319675]  register_netdevice+0x8eb/0xc70
[10647.370635]  macvlan_common_newlink+0xe58/0x14a0 [macvlan]
[10647.437195]  macvtap_newlink+0x171/0x260 [macvtap]

Commit d02fd6e7d293 ("macvlan: Fix one possible double free") handles
the case when register_netdevice() invokes ndo_uninit() on error and
as a result free the port. But 'macvlan_port_get_rtnl(dev))' check
(returns dev->rx_handler_data), which was added by this commit in order
to prevent double free, is not quite correct:

* for macvlan it always returns NULL because 'lowerdev' is the one that
  was used to register rx handler (port) in macvlan_port_create() as
  well as to unregister it in macvlan_port_destroy().
* for macvtap it always returns a valid pointer because macvtap registers
  its own rx handler before macvlan_common_newlink().

Fixes: d02fd6e7d293 ("macvlan: Fix one possible double free")
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 drivers/net/macvlan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index a0f2be8..8fc02d9 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -1451,7 +1451,7 @@ int macvlan_common_newlink(struct net *src_net, struct 
net_device *dev,
/* the macvlan port may be freed by macvlan_uninit when fail to 
register.
 * so we destroy the macvlan port only when it's valid.
 */
-   if (create && macvlan_port_get_rtnl(dev))
+   if (create && macvlan_port_get_rtnl(lowerdev))
macvlan_port_destroy(port->dev);
return err;
 }
-- 
1.8.3.1



[PATCH net] udplite: fix partial checksum initialization

2018-02-15 Thread Alexey Kodanev
Since UDP-Lite is always using checksum, the following path is
triggered when calculating pseudo header for it:

  udp4_csum_init() or udp6_csum_init()
skb_checksum_init_zero_check()
  __skb_checksum_validate_complete()

The problem can appear if skb->len is less than CHECKSUM_BREAK. In
this particular case __skb_checksum_validate_complete() also invokes
__skb_checksum_complete(skb). If UDP-Lite is using partial checksum
that covers only part of a packet, the function will return bad
checksum and the packet will be dropped.

It can be fixed if we skip skb_checksum_init_zero_check() and only
set the required pseudo header checksum for UDP-Lite with partial
checksum before udp4_csum_init()/udp6_csum_init() functions return.

Fixes: ed70fcfcee95 ("net: Call skb_checksum_init in IPv4")
Fixes: e4f45b7f40bd ("net: Call skb_checksum_init in IPv6")
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---

Alternatively, we could modify skb_checksum_init_zero_check() because
it is used only in udp4_csum_init() and udp6_csum_init() as well as
introducing a new 'cscov' parameter in __skb_checksum_validate_complete()
and calling __skb_checksum_complete_head() if 'cscov' not equals zero.
But these changes would touch general code in skbuff.h

 include/net/udplite.h   | 1 +
 net/ipv4/udp.c  | 5 +
 net/ipv6/ip6_checksum.c | 5 +
 3 files changed, 11 insertions(+)

diff --git a/include/net/udplite.h b/include/net/udplite.h
index 81bdbf9..9185e45 100644
--- a/include/net/udplite.h
+++ b/include/net/udplite.h
@@ -64,6 +64,7 @@ static inline int udplite_checksum_init(struct sk_buff *skb, 
struct udphdr *uh)
UDP_SKB_CB(skb)->cscov = cscov;
if (skb->ip_summed == CHECKSUM_COMPLETE)
skb->ip_summed = CHECKSUM_NONE;
+   skb->csum_valid = 0;
 }
 
return 0;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index bfaefe5..e5ef7c3 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2024,6 +2024,11 @@ static inline int udp4_csum_init(struct sk_buff *skb, 
struct udphdr *uh,
err = udplite_checksum_init(skb, uh);
if (err)
return err;
+
+   if (UDP_SKB_CB(skb)->partial_cov) {
+   skb->csum = inet_compute_pseudo(skb, proto);
+   return 0;
+   }
}
 
/* Note, we are only interested in != 0 or == 0, thus the
diff --git a/net/ipv6/ip6_checksum.c b/net/ipv6/ip6_checksum.c
index ec43d18..547515e 100644
--- a/net/ipv6/ip6_checksum.c
+++ b/net/ipv6/ip6_checksum.c
@@ -73,6 +73,11 @@ int udp6_csum_init(struct sk_buff *skb, struct udphdr *uh, 
int proto)
err = udplite_checksum_init(skb, uh);
if (err)
return err;
+
+   if (UDP_SKB_CB(skb)->partial_cov) {
+   skb->csum = ip6_compute_pseudo(skb, proto);
+   return 0;
+   }
}
 
/* To support RFC 6936 (allow zero checksum in UDP/IPV6 for tunnels)
-- 
1.8.3.1



[PATCH net v2] sctp: verify size of a new chunk in _sctp_make_chunk()

2018-02-09 Thread Alexey Kodanev
When SCTP makes INIT or INIT_ACK packet the total chunk length
can exceed SCTP_MAX_CHUNK_LEN which leads to kernel panic when
transmitting these packets, e.g. the crash on sending INIT_ACK:

[  597.804948] skbuff: skb_over_panic: text:ffae06e4 len:120168
   put:120156 head:7aa47635 data:d991c2de
   tail:0x1d640 end:0xfec0 dev:
...
[  597.976970] [ cut here ]
[  598.033408] kernel BUG at net/core/skbuff.c:104!
[  600.314841] Call Trace:
[  600.345829]  
[  600.371639]  ? sctp_packet_transmit+0x2095/0x26d0 [sctp]
[  600.436934]  skb_put+0x16c/0x200
[  600.477295]  sctp_packet_transmit+0x2095/0x26d0 [sctp]
[  600.540630]  ? sctp_packet_config+0x890/0x890 [sctp]
[  600.601781]  ? __sctp_packet_append_chunk+0x3b4/0xd00 [sctp]
[  600.671356]  ? sctp_cmp_addr_exact+0x3f/0x90 [sctp]
[  600.731482]  sctp_outq_flush+0x663/0x30d0 [sctp]
[  600.788565]  ? sctp_make_init+0xbf0/0xbf0 [sctp]
[  600.84]  ? sctp_check_transmitted+0x18f0/0x18f0 [sctp]
[  600.912945]  ? sctp_outq_tail+0x631/0x9d0 [sctp]
[  600.969936]  sctp_cmd_interpreter.isra.22+0x3be1/0x5cb0 [sctp]
[  601.041593]  ? sctp_sf_do_5_1B_init+0x85f/0xc30 [sctp]
[  601.104837]  ? sctp_generate_t1_cookie_event+0x20/0x20 [sctp]
[  601.175436]  ? sctp_eat_data+0x1710/0x1710 [sctp]
[  601.233575]  sctp_do_sm+0x182/0x560 [sctp]
[  601.284328]  ? sctp_has_association+0x70/0x70 [sctp]
[  601.345586]  ? sctp_rcv+0xef4/0x32f0 [sctp]
[  601.397478]  ? sctp6_rcv+0xa/0x20 [sctp]
...

Here the chunk size for INIT_ACK packet becomes too big, mostly
because of the state cookie (INIT packet has large size with
many address parameters), plus additional server parameters.

Later this chunk causes the panic in skb_put_data():

  skb_packet_transmit()
  sctp_packet_pack()
  skb_put_data(nskb, chunk->skb->data, chunk->skb->len);

'nskb' (head skb) was previously allocated with packet->size
from u16 'chunk->chunk_hdr->length'.

As suggested by Marcelo we should check the chunk's length in
_sctp_make_chunk() before trying to allocate skb for it and
discard a chunk if its size bigger than SCTP_MAX_CHUNK_LEN.

Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---

v2: account for padding before checking chunklen

 net/sctp/sm_make_chunk.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
index 793b05e..d01475f 100644
--- a/net/sctp/sm_make_chunk.c
+++ b/net/sctp/sm_make_chunk.c
@@ -1380,9 +1380,14 @@ static struct sctp_chunk *_sctp_make_chunk(const struct 
sctp_association *asoc,
struct sctp_chunk *retval;
struct sk_buff *skb;
struct sock *sk;
+   int chunklen;
+
+   chunklen = SCTP_PAD4(sizeof(*chunk_hdr) + paylen);
+   if (chunklen > SCTP_MAX_CHUNK_LEN)
+   goto nodata;
 
/* No need to allocate LL here, as this is only a chunk. */
-   skb = alloc_skb(SCTP_PAD4(sizeof(*chunk_hdr) + paylen), gfp);
+   skb = alloc_skb(chunklen, gfp);
if (!skb)
goto nodata;
 
-- 
1.8.3.1



Re: [PATCH] sctp: verify size of a new chunk in _sctp_make_chunk()

2018-02-09 Thread Alexey Kodanev
On 09.02.2018 16:27, Marcelo Ricardo Leitner wrote:
> On Fri, Feb 09, 2018 at 04:02:31PM +0300, Alexey Kodanev wrote:
>> 
>> ---
>>  net/sctp/sm_make_chunk.c |7 ++-
>>  1 files changed, 6 insertions(+), 1 deletions(-)
>>
>> diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
>> index 793b05e..95618eb 100644
>> --- a/net/sctp/sm_make_chunk.c
>> +++ b/net/sctp/sm_make_chunk.c
>> @@ -1380,9 +1380,14 @@ void sctp_init_addrs(struct sctp_chunk *chunk, union 
>> sctp_addr *src,
> 
> Weird how it identified sctp_init_addrs as the context here o.O
> Line numbers above and the rest below matches _sctp_make_chunk.

Looks like because of my old git version...

> 
>>  struct sctp_chunk *retval;
>>  struct sk_buff *skb;
>>  struct sock *sk;
>> +int chunklen;
>> +
>> +chunklen = sizeof(*chunk_hdr) + paylen;
> 
> It's better to do the padding here too, as it may grow the length by 3
> bytes.

Hmm, SCTP_MAX_CHUNK_LEN is 65532 (accounts for padding by subtracting
sizeof(__u32) from 1 << 16) and SCTP_PAD4(65531) equals 65532, i.e. it
can't grow above it or am I missing something?

Agree, may be it's better to not rely on the particular constant value.
I'll send a new version.

Thanks,
Alexey

> 
>   Marcelo
> 
>> +if (chunklen > SCTP_MAX_CHUNK_LEN)
>> +goto nodata;
>>  
>>  /* No need to allocate LL here, as this is only a chunk. */
>> -skb = alloc_skb(SCTP_PAD4(sizeof(*chunk_hdr) + paylen), gfp);
>> +skb = alloc_skb(SCTP_PAD4(chunklen), gfp);
>>  if (!skb)
>>  goto nodata;
>>  
>> -- 
>> 1.7.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>



[PATCH] sctp: verify size of a new chunk in _sctp_make_chunk()

2018-02-09 Thread Alexey Kodanev
When SCTP makes INIT or INIT_ACK packets the total chunk length
can exceed SCTP_MAX_CHUNK_LEN which leads to kernel panic when
transmitting these packets, e.g. the crash on sending INIT_ACK:

[  597.804948] skbuff: skb_over_panic: text:ffae06e4 len:120168
   put:120156 head:7aa47635 data:d991c2de
   tail:0x1d640 end:0xfec0 dev:
...
[  597.976970] [ cut here ]
[  598.033408] kernel BUG at net/core/skbuff.c:104!
[  600.314841] Call Trace:
[  600.345829]  
[  600.371639]  ? sctp_packet_transmit+0x2095/0x26d0 [sctp]
[  600.436934]  skb_put+0x16c/0x200
[  600.477295]  sctp_packet_transmit+0x2095/0x26d0 [sctp]
[  600.540630]  ? sctp_packet_config+0x890/0x890 [sctp]
[  600.601781]  ? __sctp_packet_append_chunk+0x3b4/0xd00 [sctp]
[  600.671356]  ? sctp_cmp_addr_exact+0x3f/0x90 [sctp]
[  600.731482]  sctp_outq_flush+0x663/0x30d0 [sctp]
[  600.788565]  ? sctp_make_init+0xbf0/0xbf0 [sctp]
[  600.84]  ? sctp_check_transmitted+0x18f0/0x18f0 [sctp]
[  600.912945]  ? sctp_outq_tail+0x631/0x9d0 [sctp]
[  600.969936]  sctp_cmd_interpreter.isra.22+0x3be1/0x5cb0 [sctp]
[  601.041593]  ? sctp_sf_do_5_1B_init+0x85f/0xc30 [sctp]
[  601.104837]  ? sctp_generate_t1_cookie_event+0x20/0x20 [sctp]
[  601.175436]  ? sctp_eat_data+0x1710/0x1710 [sctp]
[  601.233575]  sctp_do_sm+0x182/0x560 [sctp]
[  601.284328]  ? sctp_has_association+0x70/0x70 [sctp]
[  601.345586]  ? sctp_rcv+0xef4/0x32f0 [sctp]
[  601.397478]  ? sctp6_rcv+0xa/0x20 [sctp]
...

Here the chunk size for INIT_ACK packet becomes too big, mostly
because of the state cookie (INIT packet has large size with
many address parameters), plus additional server parameters.

Later this chunk causes the panic in skb_put_data():

  skb_packet_transmit()
  sctp_packet_pack()
  skb_put_data(nskb, chunk->skb->data, chunk->skb->len);

'nskb' (head skb) was previously allocated with packet->size
from u16 'chunk->chunk_hdr->length'.

As suggested by Marcelo we should check the chunk's length in
_sctp_make_chunk() before trying to allocate skb for it and
discard a chunk if its size bigger than SCTP_MAX_CHUNK_LEN.

Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 net/sctp/sm_make_chunk.c |7 ++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
index 793b05e..95618eb 100644
--- a/net/sctp/sm_make_chunk.c
+++ b/net/sctp/sm_make_chunk.c
@@ -1380,9 +1380,14 @@ void sctp_init_addrs(struct sctp_chunk *chunk, union 
sctp_addr *src,
struct sctp_chunk *retval;
struct sk_buff *skb;
struct sock *sk;
+   int chunklen;
+
+   chunklen = sizeof(*chunk_hdr) + paylen;
+   if (chunklen > SCTP_MAX_CHUNK_LEN)
+   goto nodata;
 
/* No need to allocate LL here, as this is only a chunk. */
-   skb = alloc_skb(SCTP_PAD4(sizeof(*chunk_hdr) + paylen), gfp);
+   skb = alloc_skb(SCTP_PAD4(chunklen), gfp);
if (!skb)
goto nodata;
 
-- 
1.7.1



sctp: skb_over_panic on INIT/INIT_ACK packet sending

2018-02-08 Thread Alexey Kodanev
Hi,

Got the following panic when the received INIT packet has a lot of
address parameters, so that the INIT_ACK chunksize exceeds
SCTP_MAX_CHUNK_LEN: 

[  597.804948] skbuff: skb_over_panic: text:ffae06e4 len:120168 
put:120156
   head:7aa47635 data:d991c2de tail:0x1d640 
end:0xfec0 dev:
...
[  597.976970] [ cut here ]
[  598.033408] kernel BUG at net/core/skbuff.c:104!
[  600.314841] Call Trace:
[  600.345829]  
[  600.371639]  ? sctp_packet_transmit+0x2095/0x26d0 [sctp]
[  600.436934]  skb_put+0x16c/0x200
[  600.477295]  sctp_packet_transmit+0x2095/0x26d0 [sctp]
[  600.540630]  ? sctp_packet_config+0x890/0x890 [sctp]
[  600.601781]  ? __sctp_packet_append_chunk+0x3b4/0xd00 [sctp]
[  600.671356]  ? sctp_cmp_addr_exact+0x3f/0x90 [sctp]
[  600.731482]  sctp_outq_flush+0x663/0x30d0 [sctp]
[  600.788565]  ? sctp_make_init+0xbf0/0xbf0 [sctp]
[  600.84]  ? sctp_check_transmitted+0x18f0/0x18f0 [sctp]
[  600.912945]  ? sctp_outq_tail+0x631/0x9d0 [sctp]
[  600.969936]  sctp_cmd_interpreter.isra.22+0x3be1/0x5cb0 [sctp]
[  601.041593]  ? sctp_sf_do_5_1B_init+0x85f/0xc30 [sctp]
[  601.104837]  ? sctp_generate_t1_cookie_event+0x20/0x20 [sctp]
[  601.175436]  ? sctp_eat_data+0x1710/0x1710 [sctp]
[  601.233575]  sctp_do_sm+0x182/0x560 [sctp]
[  601.284328]  ? sctp_has_association+0x70/0x70 [sctp]
[  601.345586]  ? sctp_rcv+0xef4/0x32f0 [sctp]
[  601.397478]  ? sctp6_rcv+0xa/0x20 [sctp]
...

Here the chunk size for INIT_ACK packet may become too big, mostly
because of the cookielen. Tried on local machine so both addrlen
and cookielen was big: addrlen 39960 cookielen 80168.
 
Not sure how to fix it correctly, may be we need a check here
or add it to sctp_make_control()?

diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
index 793b05e..c27564c 100644
--- a/net/sctp/sm_make_chunk.c
+++ b/net/sctp/sm_make_chunk.c
@@ -475,6 +475,9 @@ struct sctp_chunk *sctp_make_init_ack(const struct 
sctp_association *asoc,
if (num_ext)
chunksize += SCTP_PAD4(sizeof(ext_param) + num_ext);
 
+   if (chunksize > SCTP_MAX_CHUNK_LEN)
+   goto nomem_chunk;
+
/* Now allocate and fill out the chunk.  */
retval = sctp_make_control(asoc, SCTP_CID_INIT_ACK, 0, chunksize, gfp);
if (!retval)

And for INIT as well...

Otherwise this chunk goes to skb_packet_transmit() -> sctp_packet_pack()
where panic on

  skb_put_data(nskb, chunk->skb->data, chunk->skb->len);

nskb (head skb) was previously allocated with packet->size that looks
like getting the chunk size from u16 chunk_hdr->length.

Thanks,
Alexey


[PATCH] sctp: fix dst refcnt leak in sctp_v6_get_dst()

2018-02-05 Thread Alexey Kodanev
When going through the bind address list in sctp_v6_get_dst() and
the previously found address is better ('matchlen > bmatchlen'),
the code continues to the next iteration without releasing currently
held destination.

Fix it by releasing 'bdst' before continue to the next iteration, and
instead of introducing one more '!IS_ERR(bdst)' check for dst_release(),
move the already existed one right after ip6_dst_lookup_flow(), i.e. we
shouldn't proceed further if we get an error for the route lookup.

Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using secondary 
addresses for ipv6")
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 net/sctp/ipv6.c |   10 +++---
 1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
index 5d4c15b..e35d4f7 100644
--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -326,8 +326,10 @@ static void sctp_v6_get_dst(struct sctp_transport *t, 
union sctp_addr *saddr,
final_p = fl6_update_dst(fl6, rcu_dereference(np->opt), );
bdst = ip6_dst_lookup_flow(sk, fl6, final_p);
 
-   if (!IS_ERR(bdst) &&
-   ipv6_chk_addr(dev_net(bdst->dev),
+   if (IS_ERR(bdst))
+   continue;
+
+   if (ipv6_chk_addr(dev_net(bdst->dev),
  >a.v6.sin6_addr, bdst->dev, 1)) {
if (!IS_ERR_OR_NULL(dst))
dst_release(dst);
@@ -336,8 +338,10 @@ static void sctp_v6_get_dst(struct sctp_transport *t, 
union sctp_addr *saddr,
}
 
bmatchlen = sctp_v6_addr_match_len(daddr, >a);
-   if (matchlen > bmatchlen)
+   if (matchlen > bmatchlen) {
+   dst_release(bdst);
continue;
+   }
 
if (!IS_ERR_OR_NULL(dst))
dst_release(dst);
-- 
1.7.1



Re: sctp netns "unregister_netdevice: waiting for lo to become free. Usage count = 1"

2018-02-02 Thread Alexey Kodanev
On 02.02.2018 11:27, Tommi Rantala wrote:
> 2018-02-02 1:57 GMT+02:00 Alexey Kodanev <alexey.koda...@oracle.com>:
>> For ipv6 part, shouldn't we release 'bdst' there if the previous address
>> match is better and we continue to the next iteration?
> 
> Good catch!
> 

On the second thought, I think, we should also check 'bdst' ptr for
the error earlier, i.e. right after 'bdst = ip6_dst_lookup_flow(...)'.
I'll prepare the patch.

Thanks,
Alexey


> Didn't see that one.
> 
> Tommi
> 
>> diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
>> index 5d4c15b..a044096 100644
>> --- a/net/sctp/ipv6.c
>> +++ b/net/sctp/ipv6.c
>> @@ -336,8 +336,11 @@ static void sctp_v6_get_dst(struct sctp_transport *t, 
>> union sctp_addr *saddr,
>> }
>>
>> bmatchlen = sctp_v6_addr_match_len(daddr, >a);
>> -   if (matchlen > bmatchlen)
>> +   if (matchlen > bmatchlen) {
>> +   if (!IS_ERR(bdst))
>> +   dst_release(bdst);
>> continue;
>> +   }
>>
>> if (!IS_ERR_OR_NULL(dst))
>> dst_release(dst);
>>
>> Thanks,
>> Alexey



Re: sctp netns "unregister_netdevice: waiting for lo to become free. Usage count = 1"

2018-02-01 Thread Alexey Kodanev
On 01.02.2018 21:02, Tommi Rantala wrote:
> 2018-01-31 19:51 GMT+02:00 Tommi Rantala :
>> On 31.01.2018 14:31, Neil Horman wrote:
>>>
>>> On Wed, Jan 31, 2018 at 11:42:24AM +0200, Tommi Rantala wrote:

 I think there's a problem in the dst refcounting in sctp_v4_get_dst()

 There's a dst_entry struct that has >0 refcnt after running the testcase,
 which makes it impossible to delete the loopback device, as that dst is
 never freed.

 I'll try to make a patch.

>>> Are you looking at the second for loop there, which uses
>>> ip_route_output_key,
>>> but discards the result if dst is already set?  That does look a bit
>>> wonky, and
>>> the same problem may exist in the ipv6 path.  Let me know what the result
>>> is.
>>
>>
>> Yes, that was it. Did you receive the email I sent with the patch?
>> (I'm not seeing that message e.g. at spinics.net linux-sctp archive, so just
>> wondering if that email got lost somehow...)
>>
>> I'll check the ipv6 case, did not try it yet.
>>
>> Tommi
> 
> As far as I can tell, the ipv6 code does not suffer from this.
> The dst handling there looks good to me.
> 

For ipv6 part, shouldn't we release 'bdst' there if the previous address
match is better and we continue to the next iteration?

diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
index 5d4c15b..a044096 100644
--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -336,8 +336,11 @@ static void sctp_v6_get_dst(struct sctp_transport *t, 
union sctp_addr *saddr,
}
 
bmatchlen = sctp_v6_addr_match_len(daddr, >a);
-   if (matchlen > bmatchlen)
+   if (matchlen > bmatchlen) {
+   if (!IS_ERR(bdst))
+   dst_release(bdst);
continue;
+   }
 
if (!IS_ERR_OR_NULL(dst))
dst_release(dst);

Thanks,
Alexey


[PATCH net v2] dccp: don't restart ccid2_hc_tx_rto_expire() if sk in closed state

2018-01-26 Thread Alexey Kodanev
ccid2_hc_tx_rto_expire() timer callback always restarts the timer
again and can run indefinitely (unless it is stopped outside), and after
commit 120e9dabaf55 ("dccp: defer ccid_hc_tx_delete() at dismantle time"),
which moved ccid_hc_tx_delete() (also includes sk_stop_timer()) from
dccp_destroy_sock() to sk_destruct(), this started to happen quite often.
The timer prevents releasing the socket, as a result, sk_destruct() won't
be called.

Found with LTP/dccp_ipsec tests running on the bonding device,
which later couldn't be unloaded after the tests were completed:

  unregister_netdevice: waiting for bond0 to become free. Usage count = 148

Fixes: 2a91aa396739 ("[DCCP] CCID2: Initial CCID2 (TCP-Like) implementation")
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---

v2: * corrected bug origin commit id
* clarified commit message about sk_stop_timer()

 net/dccp/ccids/ccid2.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/net/dccp/ccids/ccid2.c b/net/dccp/ccids/ccid2.c
index 1c75cd1..92d016e 100644
--- a/net/dccp/ccids/ccid2.c
+++ b/net/dccp/ccids/ccid2.c
@@ -140,6 +140,9 @@ static void ccid2_hc_tx_rto_expire(struct timer_list *t)
 
ccid2_pr_debug("RTO_EXPIRE\n");
 
+   if (sk->sk_state == DCCP_CLOSED)
+   goto out;
+
/* back-off timer */
hc->tx_rto <<= 1;
if (hc->tx_rto > DCCP_RTO_MAX)
-- 
1.7.1



Re: [PATCH net] dccp: don't restart ccid2_hc_tx_rto_expire() if sk in closed state

2018-01-26 Thread Alexey Kodanev
On 01/25/2018 09:03 PM, Eric Dumazet wrote:
> On Thu, 2018-01-25 at 20:43 +0300, Alexey Kodanev wrote:
>> ccid2_hc_tx_rto_expire() timer callback always restarts the timer
>> again and can run indefinitely (unless it is stopped outside), and
>> after commit 120e9dabaf55 ("dccp: defer ccid_hc_tx_delete() at
>> dismantle time"), which moved sk_stop_timer() to sk_destruct(),
>> this started to happen quite often. The timer prevents releasing
>> the socket, as a result, sk_destruct() won't be called.
>>
>> Found with LTP/dccp_ipsec tests running on the bonding device,
>> which later couldn't be unloaded after the tests were completed:
>>
>>   unregister_netdevice: waiting for bond0 to become free. Usage count = 148
>>
>> Fixes: 120e9dabaf55 ("dccp: defer ccid_hc_tx_delete() at dismantle time")
>> Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
>> ---
> 
> I understand your fix, but not why commit 120e9dabaf55 is bug origin.
> 
> Looks like this always had been buggy : Timer logic should have checked
> socket state from day 0.

Hi Eric,

Agree, I'll change to the initial commit id. I've added commit 120e9dabaf55
because ccid_hc_tx_delete() (and sk_stop_timer()) moved from dccp_destroy_sock()
to sk_destruct(), and only after this change the chances that the timer won't
stop increased significantly.

> 
> I did not move sk_stop_timer() to sk_destruct()
> 

ccid_hc_tx_delete() includes sk_stop_timer():


ccid_hc_tx_delete()
ccid2_hc_tx_exit(struct sock *sk)
sk_stop_timer(sk, >tx_rtotimer);

ccid3_hc_tx_exit(struct sock *sk)
sk_stop_timer(sk, >tx_no_feedback_timer);


Thanks,
Alexey


[PATCH net] dccp: don't restart ccid2_hc_tx_rto_expire() if sk in closed state

2018-01-25 Thread Alexey Kodanev
ccid2_hc_tx_rto_expire() timer callback always restarts the timer
again and can run indefinitely (unless it is stopped outside), and
after commit 120e9dabaf55 ("dccp: defer ccid_hc_tx_delete() at
dismantle time"), which moved sk_stop_timer() to sk_destruct(),
this started to happen quite often. The timer prevents releasing
the socket, as a result, sk_destruct() won't be called.

Found with LTP/dccp_ipsec tests running on the bonding device,
which later couldn't be unloaded after the tests were completed:

  unregister_netdevice: waiting for bond0 to become free. Usage count = 148

Fixes: 120e9dabaf55 ("dccp: defer ccid_hc_tx_delete() at dismantle time")
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 net/dccp/ccids/ccid2.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/net/dccp/ccids/ccid2.c b/net/dccp/ccids/ccid2.c
index 1c75cd1..92d016e 100644
--- a/net/dccp/ccids/ccid2.c
+++ b/net/dccp/ccids/ccid2.c
@@ -140,6 +140,9 @@ static void ccid2_hc_tx_rto_expire(struct timer_list *t)
 
ccid2_pr_debug("RTO_EXPIRE\n");
 
+   if (sk->sk_state == DCCP_CLOSED)
+   goto out;
+
/* back-off timer */
hc->tx_rto <<= 1;
if (hc->tx_rto > DCCP_RTO_MAX)
-- 
1.7.1



[PATCH net v2] ip6_gre: init dev->mtu and dev->hard_header_len correctly

2018-01-18 Thread Alexey Kodanev
Commit b05229f44228 ("gre6: Cleanup GREv6 transmit path,
call common GRE functions") moved dev->mtu initialization
from ip6gre_tunnel_setup() to ip6gre_tunnel_init(), as a
result, the previously set values, before ndo_init(), are
reset in the following cases:

* rtnl_create_link() can update dev->mtu from IFLA_MTU
  parameter.

* ip6gre_tnl_link_config() is invoked before ndo_init() in
  netlink and ioctl setup, so ndo_init() can reset MTU
  adjustments with the lower device MTU as well, dev->mtu
  and dev->hard_header_len.

  Not applicable for ip6gretap because it has one more call
  to ip6gre_tnl_link_config(tunnel, 1) in ip6gre_tap_init().

Fix the first case by updating dev->mtu with 'tb[IFLA_MTU]'
parameter if a user sets it manually on a device creation,
and fix the second one by moving ip6gre_tnl_link_config()
call after register_netdevice().

Fixes: b05229f44228 ("gre6: Cleanup GREv6 transmit path, call common GRE 
functions")
Fixes: db2ec95d1ba4 ("ip6_gre: Fix MTU setting")
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---

v2: Instead of checking whether dev->mtu equals zero or not
in ip6gre_tunnel_init_common(), update 'dev->mtu' once
more with 'IFLA_MTU' parameter after register_netdevice().

 net/ipv6/ip6_gre.c |   14 +++---
 1 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 7726959..8735492 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -337,11 +337,12 @@ static void ip6gre_tunnel_unlink(struct ip6gre_net *ign, 
struct ip6_tnl *t)
 
nt->dev = dev;
nt->net = dev_net(dev);
-   ip6gre_tnl_link_config(nt, 1);
 
if (register_netdevice(dev) < 0)
goto failed_free;
 
+   ip6gre_tnl_link_config(nt, 1);
+
/* Can use a lockless transmit, unless we generate output sequences */
if (!(nt->parms.o_flags & TUNNEL_SEQ))
dev->features |= NETIF_F_LLTX;
@@ -1303,7 +1304,6 @@ static void ip6gre_netlink_parms(struct nlattr *data[],
 
 static int ip6gre_tap_init(struct net_device *dev)
 {
-   struct ip6_tnl *tunnel;
int ret;
 
ret = ip6gre_tunnel_init_common(dev);
@@ -1312,10 +1312,6 @@ static int ip6gre_tap_init(struct net_device *dev)
 
dev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
 
-   tunnel = netdev_priv(dev);
-
-   ip6gre_tnl_link_config(tunnel, 1);
-
return 0;
 }
 
@@ -1408,12 +1404,16 @@ static int ip6gre_newlink(struct net *src_net, struct 
net_device *dev,
 
nt->dev = dev;
nt->net = dev_net(dev);
-   ip6gre_tnl_link_config(nt, !tb[IFLA_MTU]);
 
err = register_netdevice(dev);
if (err)
goto out;
 
+   ip6gre_tnl_link_config(nt, !tb[IFLA_MTU]);
+
+   if (tb[IFLA_MTU])
+   ip6_tnl_change_mtu(dev, nla_get_u32(tb[IFLA_MTU]));
+
dev_hold(dev);
ip6gre_tunnel_link(ign, nt);
 
-- 
1.7.1



Re: [PATCH net] ip6_gre: init dev->mtu and dev->hard_header_len correctly

2018-01-17 Thread Alexey Kodanev
On 01/16/2018 07:32 PM, David Miller wrote:
> From: Alexey Kodanev <alexey.koda...@oracle.com>
> Date: Thu, 11 Jan 2018 16:02:54 +0300
> 
>> For ip6gretap, reset dev->mtu to zero in ip6gre_tap_setup()
>> after ether_setup(), in order for it to work with the new check
>> in ip6gre_tunnel_init_common().
> 
> This part is error prone.  Please instead add a new boolean argument
> to ip6gre_tunnel_init_common: "bool set_mtu".  Set it to true when
> it is invoked from ip6gre_tap_init() and false when it is invoked
> from ip6gre_tunnel_init().

Hi David,

This way it won't fix the first regression mentioned in the patch for
ip6gretap, i.e. if a user sets a MTU manually on the device creation.

May be it would be less error prone if a MTU is adjusted again in
ip6gre_newlink with "tb[IFLA_MTU]" parameter and ip6gre_tnl_link_config()
is moved after register_netdevice()? I haven't tested the following patch
yet but it looks like it is fixing all the mentioned cases as well:

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 7726959..d12550e 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -337,11 +337,12 @@ static void ip6gre_tunnel_unlink(struct ip6gre_net *ign, 
struct ip6_tnl *t)

nt->dev = dev;
nt->net = dev_net(dev);
-   ip6gre_tnl_link_config(nt, 1);

if (register_netdevice(dev) < 0)
goto failed_free;

+   ip6gre_tnl_link_config(nt, 1);
+
/* Can use a lockless transmit, unless we generate output sequences */
if (!(nt->parms.o_flags & TUNNEL_SEQ))
dev->features |= NETIF_F_LLTX;
@@ -1303,7 +1304,6 @@ static void ip6gre_netlink_parms(struct nlattr *data[],

 static int ip6gre_tap_init(struct net_device *dev)
 {
-   struct ip6_tnl *tunnel;
int ret;

ret = ip6gre_tunnel_init_common(dev);
@@ -1312,10 +1312,6 @@ static int ip6gre_tap_init(struct net_device *dev)

dev->priv_flags |= IFF_LIVE_ADDR_CHANGE;

-   tunnel = netdev_priv(dev);
-
-   ip6gre_tnl_link_config(tunnel, 1);
-
return 0;
 }

@@ -1408,12 +1404,16 @@ static int ip6gre_newlink(struct net *src_net, struct 
net_device *dev,

nt->dev = dev;
nt->net = dev_net(dev);
-   ip6gre_tnl_link_config(nt, !tb[IFLA_MTU]);

err = register_netdevice(dev);
if (err)
goto out;

+   ip6gre_tnl_link_config(nt, !tb[IFLA_MTU]);
+
+   if (tb[IFLA_MTU])
+   dev->mtu = nla_get_u32(tb[IFLA_MTU]);
+
dev_hold(dev);
ip6gre_tunnel_link(ign, nt);


Thanks,
Alexey


[PATCH net] ip6_gre: init dev->mtu and dev->hard_header_len correctly

2018-01-11 Thread Alexey Kodanev
Commit b05229f44228 ("gre6: Cleanup GREv6 transmit path,
call common GRE functions") moved dev->mtu initialization
from ip6gre_tunnel_setup() to ip6gre_tunnel_init(), as a
result, the previously set values, before ndo_init(), are
reset in the following cases:

* rtnl_create_link() can update dev->mtu from IFLA_MTU
  parameter

* ip6gre_tnl_link_config() is invoked before ndo_init() in
  netlink and ioctl setup, so ndo_init() can reset MTU
  adjustments with the lower device MTU as well, dev->mtu
  and dev->hard_header_len.

  Not applicable for ip6gretap because it has one more call
  to ip6gre_tnl_link_config(tunnel, 1) in ip6gre_tap_init().

Since, initially net_device allocated with kvzalloc, make sure
that dev->mtu is zero, i.e. not changed, before setting default
MTU inside ndo_init(), and invoke ip6gre_tnl_link_config after
setting default values.

For ip6gretap, reset dev->mtu to zero in ip6gre_tap_setup()
after ether_setup(), in order for it to work with the new check
in ip6gre_tunnel_init_common().

Fixes: b05229f44228 ("gre6: Cleanup GREv6 transmit path, call common GRE 
functions")
Fixes: db2ec95d1ba4 ("ip6_gre: Fix MTU setting")
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 net/ipv6/ip6_gre.c |   22 ++
 1 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 7726959..edf65d0 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -337,7 +337,6 @@ static void ip6gre_tunnel_unlink(struct ip6gre_net *ign, 
struct ip6_tnl *t)
 
nt->dev = dev;
nt->net = dev_net(dev);
-   ip6gre_tnl_link_config(nt, 1);
 
if (register_netdevice(dev) < 0)
goto failed_free;
@@ -1047,6 +1046,7 @@ static void ip6gre_tnl_init_features(struct net_device 
*dev)
 static int ip6gre_tunnel_init_common(struct net_device *dev)
 {
struct ip6_tnl *tunnel;
+   int set_mtu = !dev->mtu;
int ret;
int t_hlen;
 
@@ -1072,13 +1072,16 @@ static int ip6gre_tunnel_init_common(struct net_device 
*dev)
t_hlen = tunnel->hlen + sizeof(struct ipv6hdr);
 
dev->hard_header_len = LL_MAX_HEADER + t_hlen;
-   dev->mtu = ETH_DATA_LEN - t_hlen;
-   if (dev->type == ARPHRD_ETHER)
-   dev->mtu -= ETH_HLEN;
-   if (!(tunnel->parms.flags & IP6_TNL_F_IGN_ENCAP_LIMIT))
-   dev->mtu -= 8;
+   if (set_mtu) {
+   dev->mtu = ETH_DATA_LEN - t_hlen;
+   if (dev->type == ARPHRD_ETHER)
+   dev->mtu -= ETH_HLEN;
+   if (!(tunnel->parms.flags & IP6_TNL_F_IGN_ENCAP_LIMIT))
+   dev->mtu -= 8;
+   }
 
ip6gre_tnl_init_features(dev);
+   ip6gre_tnl_link_config(tunnel, set_mtu);
 
return 0;
 }
@@ -1303,7 +1306,6 @@ static void ip6gre_netlink_parms(struct nlattr *data[],
 
 static int ip6gre_tap_init(struct net_device *dev)
 {
-   struct ip6_tnl *tunnel;
int ret;
 
ret = ip6gre_tunnel_init_common(dev);
@@ -1312,10 +1314,6 @@ static int ip6gre_tap_init(struct net_device *dev)
 
dev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
 
-   tunnel = netdev_priv(dev);
-
-   ip6gre_tnl_link_config(tunnel, 1);
-
return 0;
 }
 
@@ -1335,6 +1333,7 @@ static void ip6gre_tap_setup(struct net_device *dev)
 
ether_setup(dev);
 
+   dev->mtu = 0;
dev->max_mtu = 0;
dev->netdev_ops = _tap_netdev_ops;
dev->needs_free_netdev = true;
@@ -1408,7 +1407,6 @@ static int ip6gre_newlink(struct net *src_net, struct 
net_device *dev,
 
nt->dev = dev;
nt->net = dev_net(dev);
-   ip6gre_tnl_link_config(nt, !tb[IFLA_MTU]);
 
err = register_netdevice(dev);
if (err)
-- 
1.7.1



[PATCH net] ip6_gre: fix device features for ioctl setup

2017-12-20 Thread Alexey Kodanev
When ip6gre is created using ioctl, its features, such as
scatter-gather, GSO and tx-checksumming will be turned off:

  # ip -f inet6 tunnel add gre6 mode ip6gre remote fd00::1
  # ethtool -k gre6 (truncated output)
tx-checksumming: off
scatter-gather: off
tcp-segmentation-offload: off
generic-segmentation-offload: off [requested on]

But when netlink is used, they will be enabled:
  # ip link add gre6 type ip6gre remote fd00::1
  # ethtool -k gre6 (truncated output)
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: on
generic-segmentation-offload: on

This results in a loss of performance when gre6 is created via ioctl.
The issue was found with LTP/gre tests.

Fix it by moving the setup of device features to a separate function
and invoke it with ndo_init callback because both netlink and ioctl
will eventually call it via register_netdevice():

   register_netdevice()
   - ndo_init() callback -> ip6gre_tunnel_init() or ip6gre_tap_init()
   - ip6gre_tunnel_init_common()
- ip6gre_tnl_init_features()

The moved code also contains two minor style fixes:
  * removed needless tab from GRE6_FEATURES on NETIF_F_HIGHDMA line.
  * fixed the issue reported by checkpatch: "Unnecessary parentheses around
'nt->encap.type == TUNNEL_ENCAP_NONE'"

Fixes: ac4eb009e477 ("ip6gre: Add support for basic offloads offloads excluding 
GSO")
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 net/ipv6/ip6_gre.c |   57 +--
 1 files changed, 32 insertions(+), 25 deletions(-)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 4cfd8e0..9a4b376 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -1014,6 +1014,36 @@ static void ip6gre_tunnel_setup(struct net_device *dev)
eth_random_addr(dev->perm_addr);
 }
 
+#define GRE6_FEATURES (NETIF_F_SG |\
+  NETIF_F_FRAGLIST |   \
+  NETIF_F_HIGHDMA |\
+  NETIF_F_HW_CSUM)
+
+static void ip6gre_tnl_init_features(struct net_device *dev)
+{
+   struct ip6_tnl *nt = netdev_priv(dev);
+
+   dev->features   |= GRE6_FEATURES;
+   dev->hw_features|= GRE6_FEATURES;
+
+   if (!(nt->parms.o_flags & TUNNEL_SEQ)) {
+   /* TCP offload with GRE SEQ is not supported, nor
+* can we support 2 levels of outer headers requiring
+* an update.
+*/
+   if (!(nt->parms.o_flags & TUNNEL_CSUM) ||
+   nt->encap.type == TUNNEL_ENCAP_NONE) {
+   dev->features|= NETIF_F_GSO_SOFTWARE;
+   dev->hw_features |= NETIF_F_GSO_SOFTWARE;
+   }
+
+   /* Can use a lockless transmit, unless we generate
+* output sequences
+*/
+   dev->features |= NETIF_F_LLTX;
+   }
+}
+
 static int ip6gre_tunnel_init_common(struct net_device *dev)
 {
struct ip6_tnl *tunnel;
@@ -1048,6 +1078,8 @@ static int ip6gre_tunnel_init_common(struct net_device 
*dev)
if (!(tunnel->parms.flags & IP6_TNL_F_IGN_ENCAP_LIMIT))
dev->mtu -= 8;
 
+   ip6gre_tnl_init_features(dev);
+
return 0;
 }
 
@@ -1298,11 +1330,6 @@ static int ip6gre_tap_init(struct net_device *dev)
.ndo_get_iflink = ip6_tnl_get_iflink,
 };
 
-#define GRE6_FEATURES (NETIF_F_SG |\
-  NETIF_F_FRAGLIST |   \
-  NETIF_F_HIGHDMA |\
-  NETIF_F_HW_CSUM)
-
 static void ip6gre_tap_setup(struct net_device *dev)
 {
 
@@ -1382,26 +1409,6 @@ static int ip6gre_newlink(struct net *src_net, struct 
net_device *dev,
nt->net = dev_net(dev);
ip6gre_tnl_link_config(nt, !tb[IFLA_MTU]);
 
-   dev->features   |= GRE6_FEATURES;
-   dev->hw_features|= GRE6_FEATURES;
-
-   if (!(nt->parms.o_flags & TUNNEL_SEQ)) {
-   /* TCP offload with GRE SEQ is not supported, nor
-* can we support 2 levels of outer headers requiring
-* an update.
-*/
-   if (!(nt->parms.o_flags & TUNNEL_CSUM) ||
-   (nt->encap.type == TUNNEL_ENCAP_NONE)) {
-   dev->features|= NETIF_F_GSO_SOFTWARE;
-   dev->hw_features |= NETIF_F_GSO_SOFTWARE;
-   }
-
-   /* Can use a lockless transmit, unless we generate
-* output sequences
-*/
-   dev->features |= NETIF_F_LLTX;
-   }
-
err = register_netdevice(dev);
if (err)
goto out;
-- 
1.7.1



[PATCH net-next v4] ip6_vti: adjust vti mtu according to mtu of lower device

2017-12-19 Thread Alexey Kodanev
LTP/udp6_ipsec_vti tests fail when sending large UDP datagrams over
ip6_vti that require fragmentation and the underlying device has an
MTU smaller than 1500 plus some extra space for headers. This happens
because ip6_vti, by default, sets MTU to ETH_DATA_LEN and not updating
it depending on a destination address or link parameter. Further
attempts to send UDP packets may succeed because pmtu gets updated on
ICMPV6_PKT_TOOBIG in vti6_err().

In case the lower device has larger MTU size, e.g. 9000, ip6_vti works
but not using the possible maximum size, output packets have 1500 limit.

The above cases require manual MTU setup after ip6_vti creation. However
ip_vti already updates MTU based on lower device with ip_tunnel_bind_dev().

Here is the example when the lower device MTU is set to 9000:

  # ip a sh ltp_ns_veth2
  ltp_ns_veth2@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 ...
inet 10.0.0.2/24 scope global ltp_ns_veth2
inet6 fd00::2/64 scope global

  # ip li add vti6 type vti6 local fd00::2 remote fd00::1
  # ip li show vti6
  vti6@NONE: <POINTOPOINT,NOARP> mtu 1500 ...
link/tunnel6 fd00::2 peer fd00::1

After the patch:
  # ip li add vti6 type vti6 local fd00::2 remote fd00::1
  # ip li show vti6
  vti6@NONE: <POINTOPOINT,NOARP> mtu 8832 ...
link/tunnel6 fd00::2 peer fd00::1

Reported-by: Petr Vorel <pvo...@suse.cz>
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
v4: * remove an empty line between variable declarations
* update the commit message to reflect unexpected behavior with an MTU
  larger than 1500.

v3: * fix style issue with curly braces around single-statement if block

v2: * cleanup commit message issues (thanks to Shannon)
* handle the case when we don't have route but have device parameter
* cast new MTU to int and then check the maximum (tdev->mtu can be
  less than dev->hard_header_len)

 net/ipv6/ip6_vti.c |   20 
 1 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
index dbb74f3..18caa95 100644
--- a/net/ipv6/ip6_vti.c
+++ b/net/ipv6/ip6_vti.c
@@ -626,6 +626,7 @@ static void vti6_link_config(struct ip6_tnl *t)
 {
struct net_device *dev = t->dev;
struct __ip6_tnl_parm *p = >parms;
+   struct net_device *tdev = NULL;
 
memcpy(dev->dev_addr, >laddr, sizeof(struct in6_addr));
memcpy(dev->broadcast, >raddr, sizeof(struct in6_addr));
@@ -638,6 +639,25 @@ static void vti6_link_config(struct ip6_tnl *t)
dev->flags |= IFF_POINTOPOINT;
else
dev->flags &= ~IFF_POINTOPOINT;
+
+   if (p->flags & IP6_TNL_F_CAP_XMIT) {
+   int strict = (ipv6_addr_type(>raddr) &
+ (IPV6_ADDR_MULTICAST | IPV6_ADDR_LINKLOCAL));
+   struct rt6_info *rt = rt6_lookup(t->net,
+>raddr, >laddr,
+p->link, strict);
+
+   if (rt)
+   tdev = rt->dst.dev;
+   ip6_rt_put(rt);
+   }
+
+   if (!tdev && p->link)
+   tdev = __dev_get_by_index(t->net, p->link);
+
+   if (tdev)
+   dev->mtu = max_t(int, tdev->mtu - dev->hard_header_len,
+IPV6_MIN_MTU);
 }
 
 /**
-- 
1.7.1



[PATCH v2] vxlan: restore dev->mtu setting based on lower device

2017-12-14 Thread Alexey Kodanev
Stefano Brivio says:
Commit a985343ba906 ("vxlan: refactor verification and
application of configuration") introduced a change in the
behaviour of initial MTU setting: earlier, the MTU for a link
created on top of a given lower device, without an initial MTU
specification, was set to the MTU of the lower device minus
headroom as a result of this path in vxlan_dev_configure():

if (!conf->mtu)
dev->mtu = lowerdev->mtu -
   (use_ipv6 ? VXLAN6_HEADROOM : VXLAN_HEADROOM);

which is now gone. Now, the initial MTU, in absence of a
configured value, is simply set by ether_setup() to ETH_DATA_LEN
(1500 bytes).

This breaks userspace expectations in case the MTU of
the lower device is higher than 1500 bytes minus headroom.

This patch restores the previous behaviour on newlink operation. Since
max_mtu can be negative and we update dev->mtu directly, also check it
for valid minimum.

Reported-by: Junhan Yan <ju...@redhat.com>
Fixes: a985343ba906 ("vxlan: refactor verification and application of 
configuration")
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---

v2: Based on initial patch from Stefano Brivio:
"vxlan: Restore initial MTU setting based on lower device"

In this version, 'dev->mtu' is set from 'max_mtu' that already has the
the needed initial MTU value based on the lower device. Plus, it is adding
extra check for valid minimum of 'max_mtu'.

 drivers/net/vxlan.c |5 +
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 19b9cc5..1000b0e 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -3103,6 +3103,11 @@ static void vxlan_config_apply(struct net_device *dev,
 
max_mtu = lowerdev->mtu - (use_ipv6 ? VXLAN6_HEADROOM :
   VXLAN_HEADROOM);
+   if (max_mtu < ETH_MIN_MTU)
+   max_mtu = ETH_MIN_MTU;
+
+   if (!changelink && !conf->mtu)
+   dev->mtu = max_mtu;
}
 
if (dev->mtu > max_mtu)
-- 
1.7.1



Re: [PATCH net] vxlan: Restore initial MTU setting based on lower device

2017-12-14 Thread Alexey Kodanev
On 12/14/2017 03:36 PM, Stefano Brivio wrote:
> On Thu, 14 Dec 2017 14:23:36 +0300
> Alexey Kodanev <alexey.koda...@oracle.com> wrote:
> 
>> On 12/14/2017 03:31 AM, Stefano Brivio wrote:
...
>>
>> if we move it up in "if (lowerdev) { ..." branch we will be checking the 
>> presence
>> of "lowerdev" and also not calculating it again. Also I would check max_mtu 
>> for
>> minimum as it might happen to be negative, though unlikely corner case...
> 
> Indeed it might happen to be negative (only for IPv6, down to -2), good
> catch.
> 
> For the benefit of others: it took me a few minutes to see how this is
> *not* unrelated, because we are introducing a direct assignment of
> dev->mtu to set max_mtu, whereas earlier it was just used in
> comparisons, so it didn't matter whether it was negative.
> 
>> diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
>> index 19b9cc5..1000b0e 100644
>> --- a/drivers/net/vxlan.c
>> +++ b/drivers/net/vxlan.c
>> @@ -3103,6 +3103,11 @@ static void vxlan_config_apply(struct net_device *dev,
>>
>> max_mtu = lowerdev->mtu - (use_ipv6 ? VXLAN6_HEADROOM :
>>VXLAN_HEADROOM);
>> +   if (max_mtu < ETH_MIN_MTU)
>> +   max_mtu = ETH_MIN_MTU;
>> +
>> +   if (!changelink && !conf->mtu)
>> +   dev->mtu = max_mtu;
> 
> I don't really have a strong preference here. On one hand, you're
> hiding this a bit from the "device creation" path. On the other hand,
> it's a bit more compact. So I'm also fine with this.
> 
> Can you perhaps submit a formal patch?
> 

OK, I'll send a patch.

Thanks,
Alexey


[PATCH net-next v3] ip6_vti: adjust vti mtu according to mtu of output device

2017-12-14 Thread Alexey Kodanev
LTP/udp6_ipsec_vti tests fail when sending large UDP datagrams that
require fragmentation and the underlying device has MTU <= 1500. This
happens because ip6_vti sets mtu to ETH_DATA_LEN and not updating it
depending on a destination address or link parameter.

Further attempts to send UDP packets may succeed because pmtu gets
updated on ICMPV6_PKT_TOOBIG in vti6_err().

Here is the example when the output device MTU is set to 9000:

  # ip a sh ltp_ns_veth2
  ltp_ns_veth2@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 ...
inet 10.0.0.2/24 scope global ltp_ns_veth2
inet6 fd00::2/64 scope global

  # ip li add vti6 type vti6 local fd00::2 remote fd00::1
  # ip li show vti6
  vti6@NONE: <POINTOPOINT,NOARP> mtu 1500 ...
link/tunnel6 fd00::2 peer fd00::1

After the patch:
  # ip li add vti6 type vti6 local fd00::2 remote fd00::1
  # ip li show vti6
  vti6@NONE: <POINTOPOINT,NOARP> mtu 8832 ...
link/tunnel6 fd00::2 peer fd00::1

Regarding ip_vti, it already tunes MTU with ip_tunnel_bind_dev().

Reported-by: Petr Vorel <pvo...@suse.cz>
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
Acked-by: Shannon Nelson <shannon.nel...@oracle.com>
---
v3: * fix style issue with curly braces around single-statement if block

v2: * cleanup commit message issues (thanks to Shannon)

* handle the case when we don't have route but have device parameter

* cast new MTU to int and then check the maximum (tdev->mtu can be
  less than dev->hard_header_len)

 net/ipv6/ip6_vti.c |   21 +
 1 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
index dbb74f3..5404443 100644
--- a/net/ipv6/ip6_vti.c
+++ b/net/ipv6/ip6_vti.c
@@ -626,6 +626,7 @@ static void vti6_link_config(struct ip6_tnl *t)
 {
struct net_device *dev = t->dev;
struct __ip6_tnl_parm *p = >parms;
+   struct net_device *tdev = NULL;
 
memcpy(dev->dev_addr, >laddr, sizeof(struct in6_addr));
memcpy(dev->broadcast, >raddr, sizeof(struct in6_addr));
@@ -638,6 +639,26 @@ static void vti6_link_config(struct ip6_tnl *t)
dev->flags |= IFF_POINTOPOINT;
else
dev->flags &= ~IFF_POINTOPOINT;
+
+   if (p->flags & IP6_TNL_F_CAP_XMIT) {
+   int strict = (ipv6_addr_type(>raddr) &
+ (IPV6_ADDR_MULTICAST | IPV6_ADDR_LINKLOCAL));
+
+   struct rt6_info *rt = rt6_lookup(t->net,
+>raddr, >laddr,
+p->link, strict);
+
+   if (rt)
+   tdev = rt->dst.dev;
+   ip6_rt_put(rt);
+   }
+
+   if (!tdev && p->link)
+   tdev = __dev_get_by_index(t->net, p->link);
+
+   if (tdev)
+   dev->mtu = max_t(int, tdev->mtu - dev->hard_header_len,
+IPV6_MIN_MTU);
 }
 
 /**
-- 
1.7.1



Re: [PATCH net] vxlan: Restore initial MTU setting based on lower device

2017-12-14 Thread Alexey Kodanev
On 12/14/2017 03:31 AM, Stefano Brivio wrote:
> On Thu, 14 Dec 2017 01:25:40 +0100
> Matthias Schiffer  wrote:
> 
>> On 12/14/2017 01:10 AM, Stefano Brivio wrote:
>>> On Thu, 14 Dec 2017 00:57:32 +0100
>>> Matthias Schiffer  wrote:
>>>   
 As you note, there is another occurrence of this calculation in
 vxlan_config_apply():


 [...]
 if (lowerdev) {
 [...]
 max_mtu = lowerdev->mtu - (use_ipv6 ? VXLAN6_HEADROOM :
VXLAN_HEADROOM);
 }

 if (dev->mtu > max_mtu)
 dev->mtu = max_mtu;
 [...]


 Unless I'm overlooking something, this should already do the same thing and
 your patch is redundant.  
>>>
>>> The code above sets max_mtu, and only if dev->mtu exceeds that, the
>>> latter is then clamped.
>>>
>>> What my patch does is to actually set dev->mtu to that value, no matter
>>> what's the previous value set by ether_setup() (only on creation, and
>>> only if lowerdev is there), just like the previous behaviour used to be.
>>>
>>> Let's consider these two cases, on the existing code:
>>>
>>> 1. lowerdev->mtu is 1500:
>>>- ether_setup(), called by vxlan_setup(), sets dev->mtu to 1500
>>>- here max_mtu is 1450
>>>- we enter the second if clause above (dev->mtu > max_mtu)
>>>- at the end of vxlan_config_apply(), dev->mtu will be 1450
>>>
>>> which is consistent with the previous behaviour.
>>>
>>> 2. lowerdev->mtu is 9000:
>>>- ether_setup(), called by vxlan_setup(), sets dev->mtu to 1500
>>>- here max_mtu is 8950
>>>- we do not enter the second if clause above (dev->mtu < max_mtu)
>>>- at the end of vxlan_config_apply(), dev->mtu will still be 1500
>>>
>>> which is not consistent with the previous behaviour, where it used to
>>> be 8950 instead.  
>>
>> Ah, thank you for the explanation, I was missing the context that this was
>> about higher rather than lower MTUs.
>>
>> Personally, I would prefer a change like the following, as it does not
>> introduce another duplication of the MTU calculation (not tested at all):
>>
>>> --- a/drivers/net/vxlan.c
>>> +++ b/drivers/net/vxlan.c
>>> @@ -3105,7 +3105,7 @@ static void vxlan_config_apply(struct net_device *dev,
>>>VXLAN_HEADROOM);
>>> }
>>>  
>>> -   if (dev->mtu > max_mtu)
>>> +   if (dev->mtu > max_mtu || (!changelink && !conf->mtu))
>>> dev->mtu = max_mtu;
> 
> You would also need to check that lowerdev is present, though.
> 


if we move it up in "if (lowerdev) { ..." branch we will be checking the 
presence
of "lowerdev" and also not calculating it again. Also I would check max_mtu for
minimum as it might happen to be negative, though unlikely corner case...


diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 19b9cc5..1000b0e 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -3103,6 +3103,11 @@ static void vxlan_config_apply(struct net_device *dev,

max_mtu = lowerdev->mtu - (use_ipv6 ? VXLAN6_HEADROOM :
   VXLAN_HEADROOM);
+   if (max_mtu < ETH_MIN_MTU)
+   max_mtu = ETH_MIN_MTU;
+
+   if (!changelink && !conf->mtu)
+   dev->mtu = max_mtu;
}

if (dev->mtu > max_mtu)


Thanks,
Alexey


> Otherwise, you're changing the behaviour again, that is, if lowerdev is
> not present, we want to keep 1500 and not set ETH_MAX_MTU (65535).
> 
> Sure you can change the if condition to reflect that, but IMHO it
> becomes quite awkward.
> 



[PATCH net-next v2] ip6_vti: adjust vti mtu according to mtu of output device

2017-12-12 Thread Alexey Kodanev
LTP/udp6_ipsec_vti tests fail when sending large UDP datagrams that
require fragmentation and the underlying device has MTU <= 1500. This
happens because ip6_vti sets mtu to ETH_DATA_LEN and not updating it
depending on a destination address or link parameter.

Further attempts to send UDP packets may succeed because pmtu gets
updated on ICMPV6_PKT_TOOBIG in vti6_err().

Here is the example when the output device MTU is set to 9000:

  # ip a sh ltp_ns_veth2
  ltp_ns_veth2@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 ...
inet 10.0.0.2/24 scope global ltp_ns_veth2
inet6 fd00::2/64 scope global

  # ip li add vti6 type vti6 local fd00::2 remote fd00::1
  # ip li show vti6
  vti6@NONE: <POINTOPOINT,NOARP> mtu 1500 ...
link/tunnel6 fd00::2 peer fd00::1

After the patch:
  # ip li add vti6 type vti6 local fd00::2 remote fd00::1
  # ip li show vti6
  vti6@NONE: <POINTOPOINT,NOARP> mtu 8832 ...
link/tunnel6 fd00::2 peer fd00::1

Regarding ip_vti, it already tunes MTU with ip_tunnel_bind_dev().

Reported-by: Petr Vorel <pvo...@suse.cz>
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
v2: * cleanup commit message issues (thanks to Shannon)

* handle the case when we don't have route but have device parameter

* cast new MTU to int and then check the maximum (tdev->mtu can be
  less than dev->hard_header_len)

When changing the tunnel parameters, MTU can be updated as well... should
we also check that parms 'link', 'laddr' or 'raddr' were actually changed
in vti6_tnl_change() and/or IFLA_MTU wasn't set?

 net/ipv6/ip6_vti.c |   22 ++
 1 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
index dbb74f3..d4624c2 100644
--- a/net/ipv6/ip6_vti.c
+++ b/net/ipv6/ip6_vti.c
@@ -626,6 +626,7 @@ static void vti6_link_config(struct ip6_tnl *t)
 {
struct net_device *dev = t->dev;
struct __ip6_tnl_parm *p = >parms;
+   struct net_device *tdev = NULL;
 
memcpy(dev->dev_addr, >laddr, sizeof(struct in6_addr));
memcpy(dev->broadcast, >raddr, sizeof(struct in6_addr));
@@ -638,6 +639,27 @@ static void vti6_link_config(struct ip6_tnl *t)
dev->flags |= IFF_POINTOPOINT;
else
dev->flags &= ~IFF_POINTOPOINT;
+
+   if (p->flags & IP6_TNL_F_CAP_XMIT) {
+   int strict = (ipv6_addr_type(>raddr) &
+ (IPV6_ADDR_MULTICAST | IPV6_ADDR_LINKLOCAL));
+
+   struct rt6_info *rt = rt6_lookup(t->net,
+>raddr, >laddr,
+p->link, strict);
+
+   if (rt)
+   tdev = rt->dst.dev;
+   ip6_rt_put(rt);
+   }
+
+   if (!tdev && p->link)
+   tdev = __dev_get_by_index(t->net, p->link);
+
+   if (tdev) {
+   dev->mtu = max_t(int, tdev->mtu - dev->hard_header_len,
+IPV6_MIN_MTU);
+   }
 }
 
 /**
-- 
1.7.1



[PATCH net-next] ip6_vti: adjust vti mtu according to mtu of output device

2017-12-06 Thread Alexey Kodanev
LTP/udp6_ipsec_vti tests fail when sending large UDP datagrams
that require fragmentation and underlying device MTU <= 1500.
This happens because ip6_vti sets mtu to ETH_DATA_LEN and not
updating it depending on a destiantion address.

Futhure attempts to send UDP packets may succeed because pmtu
get updated on ICMPV6_PKT_TOOBIG in vti6_err().

Here is the example when output device MTU set to 9000:

  # ip a sh ltp_ns_veth2
ltp_ns_veth2@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 ...
  inet 10.0.0.2/24 scope global ltp_ns_veth2
  inet6 fd00::2/64 scope global
  ...
  # ip li add vti6 type vti6 local fd00::2 remote fd00::1
  # ip li show vti6
vti6@NONE: <POINTOPOINT,NOARP> mtu 1500 ...
  link/tunnel6 fd00::2 peer fd00::1

After the patch:

  # ip li add vti6 type vti6 local fd00::2 remote fd00::1
  # ip li show vti6
vti6@NONE: <POINTOPOINT,NOARP> mtu 8832 ...
  link/tunnel6 fd00::2 peer fd00::1

Regarding ip_vti, it already tunes mtu with ip_tunnel_bind_dev():

  # ip li add vti4 type vti local 10.0.0.2 remote 10.0.0.1
  # ip li sh vti4
vti4@NONE: <POINTOPOINT,NOARP> mtu 8832 ...
  link/ipip 10.0.0.2 peer 10.0.0.1

Reported-by: Petr Vorel <pvo...@suse.cz>
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---

ip6_vti mtu offset is the same (168) as in ip_vti because ip_vti
offset includes two sizes of struct iphdr: in dev->hard_header_len
and in t_hlen in ip_tunnel_bind_dev(). I'm not sure if it's correct.

 net/ipv6/ip6_vti.c |   18 ++
 1 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
index dbb74f3..47e6464 100644
--- a/net/ipv6/ip6_vti.c
+++ b/net/ipv6/ip6_vti.c
@@ -638,6 +638,24 @@ static void vti6_link_config(struct ip6_tnl *t)
dev->flags |= IFF_POINTOPOINT;
else
dev->flags &= ~IFF_POINTOPOINT;
+
+   if (p->flags & IP6_TNL_F_CAP_XMIT) {
+   int strict = (ipv6_addr_type(>raddr) &
+ (IPV6_ADDR_MULTICAST | IPV6_ADDR_LINKLOCAL));
+
+   struct rt6_info *rt = rt6_lookup(t->net,
+>raddr, >laddr,
+p->link, strict);
+
+   if (!rt)
+   return;
+
+   if (rt->dst.dev) {
+   dev->mtu = max(rt->dst.dev->mtu - dev->hard_header_len,
+  IPV6_MIN_MTU);
+   }
+   ip6_rt_put(rt);
+   }
 }
 
 /**
-- 
1.7.1



Re: ipsec: ipcomp alg problem on vti interface

2017-11-27 Thread Alexey Kodanev
On 11/27/2017 03:07 PM, Steffen Klassert wrote:
> On Wed, Nov 22, 2017 at 07:06:13PM +0300, Alexey Kodanev wrote:
>> Hi Steffen,
>>
>> LTP has vti test-cases which fail on ipcomp alg, e.g.
>> "tcp_ipsec_vti.sh -p comp -m tunnel -s 100"
>>
>> Basically, the setupconsists of the following commands:
>>
>> ip li add ltp_vti0 type vti local 10.0.0.2 remote 10.0.0.1 key 10 dev 
>> ltp_ns_veth2
>> ip li set ltp_vti0 up
>> ip -4 xf st add src 10.0.0.1 dst 10.0.0.2 proto comp spi 0x1001 comp deflate 
>> mode tunnel
>> ip -4 xf po add dir out tmpl src 10.0.0.2 dst 10.0.0.1 proto comp mode 
>> tunnel mark 10
>> ip -4 xf po add dir in tmpl src 10.0.0.1 dst 10.0.0.2 proto comp mode tunnel 
>> mark 10
>> ip route add 10.23.1.0/30 dev ltp_vti0
>> ip a add 10.23.1.1/30 dev ltp_vti0
>>
>> ...omitted corresponded setup in netns for the other end.
>>
>> The problem appears with the small packets like SYN which are not
>> compressed and sent as is through vti tunnel and appear on ltp_ns_veth2
>> as IPIP packets. On the other end, vti doesn't handle them and theyare
>> rejected (InNoPol stats increased).
>>
>> As a workaround, setting:
>> # sysctl net.ipv4.conf.ltp_ns_veth2.disable_policy=1
>> # sysctl net.ipv4.conf.ltp_ns_veth1.disable_policy=1
>>
>> works, but compressed packets seen on vti device, the other on ltp_ns_veth2.
>>
>> Is there some flaw in setup or vti not designed to handle ipcomp alg that
>> can send packets with/without compression (or without further encryption)?
> VTI is not designed to handle packets without IPsec header, so yes
> this does not work well with ipcomp that might omit the compression
> header.

if so, is it reasonable to keep ipcomp handling in VTI?

Thanks,
Alexey

>
>> May be we should handle such packets by registering additional tunnel
>> handler onvti, like in the diff below?
>>  
>> diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c
>> index 89453cf..99ad70b 100644
>> --- a/net/ipv4/ip_vti.c
>> +++ b/net/ipv4/ip_vti.c
>> @@ -44,11 +44,20 @@
>>  #include 
>>  #include 
>>
>> +static bool log_ecn_error = true;
>> +module_param(log_ecn_error, bool, 0644);
>> +MODULE_PARM_DESC(log_ecn_error, "Log packets received with corrupted ECN");
>> +
>>  static struct rtnl_link_ops vti_link_ops __read_mostly;
>>
>>  static unsigned int vti_net_id __read_mostly;
>>  static int vti_tunnel_init(struct net_device *dev);
>>
>> +static const struct tnl_ptk_info tpi = {
>> +   /* no tunnel info required for ipip. */
>> +   .proto = htons(ETH_P_IP),
>> +};
>> +
>>  static int vti_input(struct sk_buff *skb, int nexthdr, __be32 spi,
>>  int encap_type)
>>  {
>> @@ -65,6 +74,13 @@ static int vti_input(struct sk_buff *skb, int nexthdr, 
>> __be32 spi,
>>
>> XFRM_TUNNEL_SKB_CB(skb)->tunnel.ip4 = tunnel;
>>
>> +   if (ip_hdr(skb)->protocol == IPPROTO_IPIP) {
>> +   if (iptunnel_pull_header(skb, 0, tpi.proto, false))
>> +   goto drop;
>> +   return ip_tunnel_rcv(tunnel, skb, , NULL,
>> +log_ecn_error);
>> +   }
>> +
> As it is, we can rely that packets received by a vti interface came
> through a secure channel. This would not be the case anymore with
> your change and I'd not like to weaken this for a corener case
> like ipcomp in tunnel mode with vti.



ipsec: ipcomp alg problem on vti interface

2017-11-22 Thread Alexey Kodanev
Hi Steffen,

LTP has vti test-cases which fail on ipcomp alg, e.g.
"tcp_ipsec_vti.sh -p comp -m tunnel -s 100"

Basically, the setupconsists of the following commands:

ip li add ltp_vti0 type vti local 10.0.0.2 remote 10.0.0.1 key 10 dev 
ltp_ns_veth2
ip li set ltp_vti0 up
ip -4 xf st add src 10.0.0.1 dst 10.0.0.2 proto comp spi 0x1001 comp deflate 
mode tunnel
ip -4 xf po add dir out tmpl src 10.0.0.2 dst 10.0.0.1 proto comp mode tunnel 
mark 10
ip -4 xf po add dir in tmpl src 10.0.0.1 dst 10.0.0.2 proto comp mode tunnel 
mark 10
ip route add 10.23.1.0/30 dev ltp_vti0
ip a add 10.23.1.1/30 dev ltp_vti0

...omitted corresponded setup in netns for the other end.

The problem appears with the small packets like SYN which are not
compressed and sent as is through vti tunnel and appear on ltp_ns_veth2
as IPIP packets. On the other end, vti doesn't handle them and theyare
rejected (InNoPol stats increased).

As a workaround, setting:
# sysctl net.ipv4.conf.ltp_ns_veth2.disable_policy=1
# sysctl net.ipv4.conf.ltp_ns_veth1.disable_policy=1

works, but compressed packets seen on vti device, the other on ltp_ns_veth2.

Is there some flaw in setup or vti not designed to handle ipcomp alg that
can send packets with/without compression (or without further encryption)?

May be we should handle such packets by registering additional tunnel
handler onvti, like in the diff below?
 
diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c
index 89453cf..99ad70b 100644
--- a/net/ipv4/ip_vti.c
+++ b/net/ipv4/ip_vti.c
@@ -44,11 +44,20 @@
 #include 
 #include 

+static bool log_ecn_error = true;
+module_param(log_ecn_error, bool, 0644);
+MODULE_PARM_DESC(log_ecn_error, "Log packets received with corrupted ECN");
+
 static struct rtnl_link_ops vti_link_ops __read_mostly;

 static unsigned int vti_net_id __read_mostly;
 static int vti_tunnel_init(struct net_device *dev);

+static const struct tnl_ptk_info tpi = {
+   /* no tunnel info required for ipip. */
+   .proto = htons(ETH_P_IP),
+};
+
 static int vti_input(struct sk_buff *skb, int nexthdr, __be32 spi,
 int encap_type)
 {
@@ -65,6 +74,13 @@ static int vti_input(struct sk_buff *skb, int nexthdr, 
__be32 spi,

XFRM_TUNNEL_SKB_CB(skb)->tunnel.ip4 = tunnel;

+   if (ip_hdr(skb)->protocol == IPPROTO_IPIP) {
+   if (iptunnel_pull_header(skb, 0, tpi.proto, false))
+   goto drop;
+   return ip_tunnel_rcv(tunnel, skb, , NULL,
+log_ecn_error);
+   }
+
return xfrm_input(skb, nexthdr, spi, encap_type);
}

@@ -335,6 +351,11 @@ static int vti4_err(struct sk_buff *skb, u32 info)
return 0;
 }

+static int vti_ip_err(struct sk_buff *skb, u32 info)
+{
+   return -ENOENT;
+}
+
 static int
 vti_tunnel_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
 {
@@ -440,6 +461,12 @@ static void __net_init vti_fb_tunnel_init(struct 
net_device *dev)
.priority   =   100,
 };

+static struct xfrm_tunnel vti_ip4_tunnnel __read_mostly = {
+   .handler=   vti_rcv,
+   .err_handler=   vti_ip_err,
+   .priority   =   1,
+};
+
 static int __net_init vti_init_net(struct net *net)
 {
int err;
@@ -607,6 +634,9 @@ static int __init vti_init(void)
err = xfrm4_protocol_register(_ipcomp4_protocol, IPPROTO_COMP);
if (err < 0)
goto xfrm_proto_comp_failed;
+   err = xfrm4_tunnel_register(_ip4_tunnnel, AF_INET);
+   if (err < 0)
+   goto xfrm_tunnel_failed;

msg = "netlink interface";
err = rtnl_link_register(_link_ops);
@@ -616,6 +646,8 @@ static int __init vti_init(void)
return err;

 rtnl_link_failed:
+   xfrm4_tunnel_deregister(_ip4_tunnnel, AF_INET);
+xfrm_tunnel_failed:
xfrm4_protocol_deregister(_ipcomp4_protocol, IPPROTO_COMP);
 xfrm_proto_comp_failed:
xfrm4_protocol_deregister(_ah4_protocol, IPPROTO_AH);
@@ -631,6 +663,7 @@ static int __init vti_init(void)
 static void __exit vti_fini(void)
 {
rtnl_link_unregister(_link_ops);
+   xfrm4_tunnel_deregister(_ip4_tunnnel, AF_INET);
xfrm4_protocol_deregister(_ipcomp4_protocol, IPPROTO_COMP);
xfrm4_protocol_deregister(_ah4_protocol, IPPROTO_AH);
xfrm4_protocol_deregister(_esp4_protocol, IPPROTO_ESP);

Thanks,
Alexey



Re: [PATCH 2/2] ip6_tunnel: pass tun_dst arg from ip6_tnl_rcv() to __ip6_tnl_rcv()

2017-11-20 Thread Alexey Kodanev
On 11/19/2017 06:22 AM, David Miller wrote:
> From: Alexey Kodanev <alexey.koda...@oracle.com>
> Date: Fri, 17 Nov 2017 19:16:18 +0300
>
>> Otherwise tun_dst argument is unused there. Currently, ip6_tnl_rcv()
>> invoked with tun_dst set to NULL, so there is no actual functional
>> changes introduced in this patch.
> Oh yes there is a functional change, becaue __ip6_tnl_rcv() is also
> used by ipxip6_rcv() which can pass a non-NULL tnl_dst.

The patch is not changing __ip6_tnl_rcv(), only ip6_tnl_rcv() wrapper.

Thanks,
Alexey



[PATCH 1/2] gre6: use log_ecn_error module parameter in ip6_tnl_rcv()

2017-11-17 Thread Alexey Kodanev
After commit 308edfdf1563 ("gre6: Cleanup GREv6 receive path, call
common GRE functions") it's not used anywhere in the module, but
previously was used in ip6gre_rcv().

Fixes: 308edfdf1563 ("gre6: Cleanup GREv6 receive path, call common GRE 
functions")
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 net/ipv6/ip6_gre.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 59c121b..5d6bee0 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -461,7 +461,7 @@ static int ip6gre_rcv(struct sk_buff *skb, const struct 
tnl_ptk_info *tpi)
  >saddr, >daddr, tpi->key,
  tpi->proto);
if (tunnel) {
-   ip6_tnl_rcv(tunnel, skb, tpi, NULL, false);
+   ip6_tnl_rcv(tunnel, skb, tpi, NULL, log_ecn_error);
 
return PACKET_RCVD;
}
-- 
1.8.3.1



[PATCH 2/2] ip6_tunnel: pass tun_dst arg from ip6_tnl_rcv() to __ip6_tnl_rcv()

2017-11-17 Thread Alexey Kodanev
Otherwise tun_dst argument is unused there. Currently, ip6_tnl_rcv()
invoked with tun_dst set to NULL, so there is no actual functional
changes introduced in this patch.

Fixes: 0d3c703a9d17 ("ipv6: Cleanup IPv6 tunnel receive path")
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 net/ipv6/ip6_tunnel.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index a1c2444..bc050e8 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -869,7 +869,7 @@ int ip6_tnl_rcv(struct ip6_tnl *t, struct sk_buff *skb,
struct metadata_dst *tun_dst,
bool log_ecn_err)
 {
-   return __ip6_tnl_rcv(t, skb, tpi, NULL, ip6ip6_dscp_ecn_decapsulate,
+   return __ip6_tnl_rcv(t, skb, tpi, tun_dst, ip6ip6_dscp_ecn_decapsulate,
 log_ecn_err);
 }
 EXPORT_SYMBOL(ip6_tnl_rcv);
-- 
1.8.3.1



[PATCH v2] gso: fix payload length when gso_size is zero

2017-10-06 Thread Alexey Kodanev
When gso_size reset to zero for the tail segment in skb_segment(), later
in ipv6_gso_segment(), __skb_udp_tunnel_segment() and gre_gso_segment()
we will get incorrect results (payload length, pcsum) for that segment.
inet_gso_segment() already has a check for gso_size before calculating
payload.

The issue was found with LTP vxlan & gre tests over ixgbe NIC.

Fixes: 07b26c9454a2 ("gso: Support partial splitting at the frag_list pointer")
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
v2: also added skb_is_gso to gre_gso_segment() and __skb_udp_tunnel_segment()

 net/ipv4/gre_offload.c | 2 +-
 net/ipv4/udp_offload.c | 2 +-
 net/ipv6/ip6_offload.c | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/gre_offload.c b/net/ipv4/gre_offload.c
index d5cac99..8c72034 100644
--- a/net/ipv4/gre_offload.c
+++ b/net/ipv4/gre_offload.c
@@ -98,7 +98,7 @@ static struct sk_buff *gre_gso_segment(struct sk_buff *skb,
greh = (struct gre_base_hdr *)skb_transport_header(skb);
pcsum = (__sum16 *)(greh + 1);
 
-   if (gso_partial) {
+   if (gso_partial && skb_is_gso(skb)) {
unsigned int partial_adj;
 
/* Adjust checksum to account for the fact that
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 0932c85..6401574 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -122,7 +122,7 @@ static struct sk_buff *__skb_udp_tunnel_segment(struct 
sk_buff *skb,
 * will be using a length value equal to only one MSS sized
 * segment instead of the entire frame.
 */
-   if (gso_partial) {
+   if (gso_partial && skb_is_gso(skb)) {
uh->len = htons(skb_shinfo(skb)->gso_size +
SKB_GSO_CB(skb)->data_offset +
skb->head - (unsigned char *)uh);
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index cdb3728..4a87f94 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -105,7 +105,7 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff *skb,
 
for (skb = segs; skb; skb = skb->next) {
ipv6h = (struct ipv6hdr *)(skb_mac_header(skb) + nhoff);
-   if (gso_partial)
+   if (gso_partial && skb_is_gso(skb))
payload_len = skb_shinfo(skb)->gso_size +
  SKB_GSO_CB(skb)->data_offset +
  skb->head - (unsigned char *)(ipv6h + 1);
-- 
1.8.3.1



Re: [PATCH] ipv6: gso: fix payload length when gso_size is zero

2017-10-06 Thread Alexey Kodanev
On 10/05/2017 09:58 PM, Duyck, Alexander H wrote:
> On Thu, 2017-10-05 at 20:06 +0300, Alexey Kodanev wrote:
>> When gso_size reset to zero for the tail segment in skb_segment(), later
>> in ipv6_gso_segment(), we will get incorrect payload_len for that segment.
>> inet_gso_segment() already has a check for gso_size before calculating
>> payload so fixing only IPv6 part.
>>
>> The issue was found with LTP vxlan & gre tests over ixgbe NIC.
>>
>> Fixes: 07b26c9454a2 ("gso: Support partial splitting at the frag_list 
>> pointer")
>> Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
>> ---
>>  net/ipv6/ip6_offload.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
>> index cdb3728..4a87f94 100644
>> --- a/net/ipv6/ip6_offload.c
>> +++ b/net/ipv6/ip6_offload.c
>> @@ -105,7 +105,7 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff 
>> *skb,
>>  
>>  for (skb = segs; skb; skb = skb->next) {
>>  ipv6h = (struct ipv6hdr *)(skb_mac_header(skb) + nhoff);
>> -if (gso_partial)
>> +if (gso_partial && skb_is_gso(skb))
>>  payload_len = skb_shinfo(skb)->gso_size +
>>SKB_GSO_CB(skb)->data_offset +
>>skb->head - (unsigned char *)(ipv6h + 1);
> So looking over this change it looks good to me. I'm just wondering if
> you have looked at the code in __skb_udp_tunnel_segment or
> gre_gso_segment? It seems like if you needed this change here you
> should need to make similar changes to those functions as well. I'm
> wondering if we just aren't seeing issues due to the segments already
> being MSS sized before being handed to us for segmentation.

Right, it can happen in __skb_udp_tunnel_segment as well. I wasn't able
to reproduce with gre, it looks like it doesn't go to that part of code,
skipping it, possibly, on gso_type & SKB_GSO_GRE_CSUM check, though the
NIC has tx-gre-csum-segmentation enabled...

I'll send new version after testing completed.

Thanks,
Alexey



[PATCH] ipv6: gso: fix payload length when gso_size is zero

2017-10-05 Thread Alexey Kodanev
When gso_size reset to zero for the tail segment in skb_segment(), later
in ipv6_gso_segment(), we will get incorrect payload_len for that segment.
inet_gso_segment() already has a check for gso_size before calculating
payload so fixing only IPv6 part.

The issue was found with LTP vxlan & gre tests over ixgbe NIC.

Fixes: 07b26c9454a2 ("gso: Support partial splitting at the frag_list pointer")
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 net/ipv6/ip6_offload.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index cdb3728..4a87f94 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -105,7 +105,7 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff *skb,
 
for (skb = segs; skb; skb = skb->next) {
ipv6h = (struct ipv6hdr *)(skb_mac_header(skb) + nhoff);
-   if (gso_partial)
+   if (gso_partial && skb_is_gso(skb))
payload_len = skb_shinfo(skb)->gso_size +
  SKB_GSO_CB(skb)->data_offset +
  skb->head - (unsigned char *)(ipv6h + 1);
-- 
1.8.3.1



[PATCH] vti: fix use after free in vti_tunnel_xmit/vti6_tnl_xmit

2017-09-26 Thread Alexey Kodanev
When running LTP IPsec tests, KASan might report:

BUG: KASAN: use-after-free in vti_tunnel_xmit+0xeee/0xff0 [ip_vti]
Read of size 4 at addr 880dc6ad1980 by task swapper/0/0
...
Call Trace:
  
  dump_stack+0x63/0x89
  print_address_description+0x7c/0x290
  kasan_report+0x28d/0x370
  ? vti_tunnel_xmit+0xeee/0xff0 [ip_vti]
  __asan_report_load4_noabort+0x19/0x20
  vti_tunnel_xmit+0xeee/0xff0 [ip_vti]
  ? vti_init_net+0x190/0x190 [ip_vti]
  ? save_stack_trace+0x1b/0x20
  ? save_stack+0x46/0xd0
  dev_hard_start_xmit+0x147/0x510
  ? icmp_echo.part.24+0x1f0/0x210
  __dev_queue_xmit+0x1394/0x1c60
...
Freed by task 0:
  save_stack_trace+0x1b/0x20
  save_stack+0x46/0xd0
  kasan_slab_free+0x70/0xc0
  kmem_cache_free+0x81/0x1e0
  kfree_skbmem+0xb1/0xe0
  kfree_skb+0x75/0x170
  kfree_skb_list+0x3e/0x60
  __dev_queue_xmit+0x1298/0x1c60
  dev_queue_xmit+0x10/0x20
  neigh_resolve_output+0x3a8/0x740
  ip_finish_output2+0x5c0/0xe70
  ip_finish_output+0x4ba/0x680
  ip_output+0x1c1/0x3a0
  xfrm_output_resume+0xc65/0x13d0
  xfrm_output+0x1e4/0x380
  xfrm4_output_finish+0x5c/0x70

Can be fixed if we get skb->len before dst_output().

Fixes: b9959fd3b0fa ("vti: switch to new ip tunnel code")
Fixes: 22e1b23dafa8 ("vti6: Support inter address family tunneling.")
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 net/ipv4/ip_vti.c  |3 ++-
 net/ipv6/ip6_vti.c |3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c
index 5ed63d2..89453cf 100644
--- a/net/ipv4/ip_vti.c
+++ b/net/ipv4/ip_vti.c
@@ -168,6 +168,7 @@ static netdev_tx_t vti_xmit(struct sk_buff *skb, struct 
net_device *dev,
struct ip_tunnel_parm *parms = >parms;
struct dst_entry *dst = skb_dst(skb);
struct net_device *tdev;/* Device to other host */
+   int pkt_len = skb->len;
int err;
int mtu;
 
@@ -229,7 +230,7 @@ static netdev_tx_t vti_xmit(struct sk_buff *skb, struct 
net_device *dev,
 
err = dst_output(tunnel->net, skb->sk, skb);
if (net_xmit_eval(err) == 0)
-   err = skb->len;
+   err = pkt_len;
iptunnel_xmit_stats(dev, err);
return NETDEV_TX_OK;
 
diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
index 79444a4..bcdc2d5 100644
--- a/net/ipv6/ip6_vti.c
+++ b/net/ipv6/ip6_vti.c
@@ -445,6 +445,7 @@ static bool vti6_state_check(const struct xfrm_state *x,
struct dst_entry *dst = skb_dst(skb);
struct net_device *tdev;
struct xfrm_state *x;
+   int pkt_len = skb->len;
int err = -1;
int mtu;
 
@@ -502,7 +503,7 @@ static bool vti6_state_check(const struct xfrm_state *x,
struct pcpu_sw_netstats *tstats = this_cpu_ptr(dev->tstats);
 
u64_stats_update_begin(>syncp);
-   tstats->tx_bytes += skb->len;
+   tstats->tx_bytes += pkt_len;
tstats->tx_packets++;
u64_stats_update_end(>syncp);
} else {
-- 
1.7.1



[PATCH] vti: fix NULL dereference in xfrm_input()

2017-09-12 Thread Alexey Kodanev
Can be reproduced with LTP tests:
  # icmp-uni-vti.sh -p ah -a sha256 -m tunnel -S fffe -k 1 -s 10

IPv4:
  RIP: 0010:xfrm_input+0x7f9/0x870
  ...
  Call Trace:
  
  vti_input+0xaa/0x110 [ip_vti]
  ? skb_free_head+0x21/0x40
  vti_rcv+0x33/0x40 [ip_vti]
  xfrm4_ah_rcv+0x33/0x60
  ip_local_deliver_finish+0x94/0x1e0
  ip_local_deliver+0x6f/0xe0
  ? ip_route_input_noref+0x28/0x50
  ...

  # icmp-uni-vti.sh -6 -p ah -a sha256 -m tunnel -S fffe -k 1 -s 10
IPv6:
  RIP: 0010:xfrm_input+0x7f9/0x870
  ...
  Call Trace:
  
  xfrm6_rcv_tnl+0x3c/0x40
  vti6_rcv+0xd5/0xe0 [ip6_vti]
  xfrm6_ah_rcv+0x33/0x60
  ip6_input_finish+0xee/0x460
  ip6_input+0x3f/0xb0
  ip6_rcv_finish+0x45/0xa0
  ipv6_rcv+0x34b/0x540

xfrm_input() invokes xfrm_rcv_cb() -> vti_rcv_cb(), the last callback
might call skb_scrub_packet(), which in turn can reset secpath.

Fix it by adding a check that skb->sp is not NULL.

Fixes: 7e9e9202bccc ("xfrm: Clear RX SKB secpath xfrm_offload")
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 net/xfrm/xfrm_input.c |6 --
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index 2515cd2..8ac9d32 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -429,7 +429,8 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 
spi, int encap_type)
nf_reset(skb);
 
if (decaps) {
-   skb->sp->olen = 0;
+   if (skb->sp)
+   skb->sp->olen = 0;
skb_dst_drop(skb);
gro_cells_receive(_cells, skb);
return 0;
@@ -440,7 +441,8 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 
spi, int encap_type)
 
err = x->inner_mode->afinfo->transport_finish(skb, xfrm_gro || 
async);
if (xfrm_gro) {
-   skb->sp->olen = 0;
+   if (skb->sp)
+   skb->sp->olen = 0;
skb_dst_drop(skb);
gro_cells_receive(_cells, skb);
return err;
-- 
1.7.1



[PATCH net-next] tcp: rename *_sequence_number() to *_seq_and_tsoff()

2017-03-09 Thread Alexey Kodanev
The functions that are returning tcp sequence number also setup
TS offset value, so rename them to better describe their purpose.

No functional changes in this patch.

Suggested-by: Eric Dumazet <eduma...@google.com>
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 include/net/secure_seq.h |6 +++---
 include/net/tcp.h|2 +-
 net/core/secure_seq.c|   13 ++---
 net/ipv4/tcp_input.c |4 ++--
 net/ipv4/tcp_ipv4.c  |   22 +++---
 net/ipv6/tcp_ipv6.c  |   22 +++---
 6 files changed, 34 insertions(+), 35 deletions(-)

diff --git a/include/net/secure_seq.h b/include/net/secure_seq.h
index 0caee63..fe236b3 100644
--- a/include/net/secure_seq.h
+++ b/include/net/secure_seq.h
@@ -6,10 +6,10 @@
 u32 secure_ipv4_port_ephemeral(__be32 saddr, __be32 daddr, __be16 dport);
 u32 secure_ipv6_port_ephemeral(const __be32 *saddr, const __be32 *daddr,
   __be16 dport);
-u32 secure_tcp_sequence_number(__be32 saddr, __be32 daddr,
+u32 secure_tcp_seq_and_tsoff(__be32 saddr, __be32 daddr,
+__be16 sport, __be16 dport, u32 *tsoff);
+u32 secure_tcpv6_seq_and_tsoff(const __be32 *saddr, const __be32 *daddr,
   __be16 sport, __be16 dport, u32 *tsoff);
-u32 secure_tcpv6_sequence_number(const __be32 *saddr, const __be32 *daddr,
-__be16 sport, __be16 dport, u32 *tsoff);
 u64 secure_dccp_sequence_number(__be32 saddr, __be32 daddr,
__be16 sport, __be16 dport);
 u64 secure_dccpv6_sequence_number(__be32 *saddr, __be32 *daddr,
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6ec4ea6..bede8f7 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1816,7 +1816,7 @@ struct tcp_request_sock_ops {
struct dst_entry *(*route_req)(const struct sock *sk, struct flowi *fl,
   const struct request_sock *req,
   bool *strict);
-   __u32 (*init_seq)(const struct sk_buff *skb, u32 *tsoff);
+   __u32 (*init_seq_tsoff)(const struct sk_buff *skb, u32 *tsoff);
int (*send_synack)(const struct sock *sk, struct dst_entry *dst,
   struct flowi *fl, struct request_sock *req,
   struct tcp_fastopen_cookie *foc,
diff --git a/net/core/secure_seq.c b/net/core/secure_seq.c
index 758f140..fb87e78 100644
--- a/net/core/secure_seq.c
+++ b/net/core/secure_seq.c
@@ -45,8 +45,8 @@ static u32 seq_scale(u32 seq)
 #endif
 
 #if IS_ENABLED(CONFIG_IPV6)
-u32 secure_tcpv6_sequence_number(const __be32 *saddr, const __be32 *daddr,
-__be16 sport, __be16 dport, u32 *tsoff)
+u32 secure_tcpv6_seq_and_tsoff(const __be32 *saddr, const __be32 *daddr,
+  __be16 sport, __be16 dport, u32 *tsoff)
 {
const struct {
struct in6_addr saddr;
@@ -66,7 +66,7 @@ u32 secure_tcpv6_sequence_number(const __be32 *saddr, const 
__be32 *daddr,
*tsoff = sysctl_tcp_timestamps == 1 ? (hash >> 32) : 0;
return seq_scale(hash);
 }
-EXPORT_SYMBOL(secure_tcpv6_sequence_number);
+EXPORT_SYMBOL(secure_tcpv6_seq_and_tsoff);
 
 u32 secure_ipv6_port_ephemeral(const __be32 *saddr, const __be32 *daddr,
   __be16 dport)
@@ -89,14 +89,13 @@ u32 secure_ipv6_port_ephemeral(const __be32 *saddr, const 
__be32 *daddr,
 
 #ifdef CONFIG_INET
 
-/* secure_tcp_sequence_number(a, b, 0, d) == secure_ipv4_port_ephemeral(a, b, 
d),
+/* secure_tcp_seq_and_tsoff(a, b, 0, d) == secure_ipv4_port_ephemeral(a, b, d),
  * but fortunately, `sport' cannot be 0 in any circumstances. If this changes,
  * it would be easy enough to have the former function use siphash_4u32, 
passing
  * the arguments as separate u32.
  */
-
-u32 secure_tcp_sequence_number(__be32 saddr, __be32 daddr,
-  __be16 sport, __be16 dport, u32 *tsoff)
+u32 secure_tcp_seq_and_tsoff(__be32 saddr, __be32 daddr,
+__be16 sport, __be16 dport, u32 *tsoff)
 {
u64 hash;
net_secret_init();
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 39c393c..96b67a8 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6324,7 +6324,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
goto drop_and_free;
 
if (isn && tmp_opt.tstamp_ok)
-   af_ops->init_seq(skb, _rsk(req)->ts_off);
+   af_ops->init_seq_tsoff(skb, _rsk(req)->ts_off);
 
if (!want_cookie && !isn) {
/* VJ's idea. We save last timestamp seen
@@ -6366,7 +6366,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
goto drop_and_release;
}
 
-   isn = af_ops->init_seq(skb, _rsk(req)->ts_off);
+   isn = af_ops->init_seq_ts

[PATCH v2] udp: avoid ufo handling on IP payload compression packets

2017-03-09 Thread Alexey Kodanev
commit c146066ab802 ("ipv4: Don't use ufo handling on later transformed
packets") and commit f89c56ce710a ("ipv6: Don't use ufo handling on
later transformed packets") added a check that 'rt->dst.header_len' isn't
zero in order to skip UFO, but it doesn't include IPcomp in transport mode
where it equals zero.

Packets, after payload compression, may not require further fragmentation,
and if original length exceeds MTU, later compressed packets will be
transmitted incorrectly. This can be reproduced with LTP udp_ipsec.sh test
on veth device with enabled UFO, MTU is 1500 and UDP payload is 2000:

* IPv4 case, offset is wrong + unnecessary fragmentation
udp_ipsec.sh -p comp -m transport -s 2000 &
tcpdump -ni ltp_ns_veth2
...
IP (tos 0x0, ttl 64, id 45203, offset 0, flags [+],
  proto Compressed IP (108), length 49)
  10.0.0.2 > 10.0.0.1: IPComp(cpi=0x1000)
IP (tos 0x0, ttl 64, id 45203, offset 1480, flags [none],
  proto UDP (17), length 21) 10.0.0.2 > 10.0.0.1: ip-proto-17

* IPv6 case, sending small fragments
udp_ipsec.sh -6 -p comp -m transport -s 2000 &
tcpdump -ni ltp_ns_veth2
...
IP6 (flowlabel 0x6b9ba, hlim 64, next-header Compressed IP (108)
  payload length: 37) fd00::2 > fd00::1: IPComp(cpi=0x1000)
IP6 (flowlabel 0x6b9ba, hlim 64, next-header Compressed IP (108)
  payload length: 21) fd00::2 > fd00::1: IPComp(cpi=0x1000)

Fix it by checking 'rt->dst.xfrm' pointer to 'xfrm_state' struct, skip UFO
if xfrm is set. So the new check will include both cases: IPcomp and IPsec.

Fixes: c146066ab802 ("ipv4: Don't use ufo handling on later transformed 
packets")
Fixes: f89c56ce710a ("ipv6: Don't use ufo handling on later transformed 
packets")
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---

v2: use dst_xfrm() to access xfrm_state

 net/ipv4/ip_output.c  |2 +-
 net/ipv6/ip6_output.c |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 737ce82..7a3fd25 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -966,7 +966,7 @@ static int __ip_append_data(struct sock *sk,
cork->length += length;
if length + fragheaderlen) > mtu) || (skb && skb_is_gso(skb))) &&
(sk->sk_protocol == IPPROTO_UDP) &&
-   (rt->dst.dev->features & NETIF_F_UFO) && !rt->dst.header_len &&
+   (rt->dst.dev->features & NETIF_F_UFO) && !dst_xfrm(>dst) &&
(sk->sk_type == SOCK_DGRAM) && !sk->sk_no_check_tx) {
err = ip_ufo_append_data(sk, queue, getfrag, from, length,
 hh_len, fragheaderlen, transhdrlen,
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 528b3c1..df42096 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1385,7 +1385,7 @@ static int __ip6_append_data(struct sock *sk,
if length + fragheaderlen) > mtu) ||
 (skb && skb_is_gso(skb))) &&
(sk->sk_protocol == IPPROTO_UDP) &&
-   (rt->dst.dev->features & NETIF_F_UFO) && !rt->dst.header_len &&
+   (rt->dst.dev->features & NETIF_F_UFO) && !dst_xfrm(>dst) &&
(sk->sk_type == SOCK_DGRAM) && !udp_get_no_check6_tx(sk)) {
err = ip6_ufo_append_data(sk, queue, getfrag, from, length,
  hh_len, fragheaderlen, exthdrlen,
-- 
1.7.1



[PATCH] udp: avoid ufo handling on IP payload compression packets

2017-03-03 Thread Alexey Kodanev
commit c146066ab802 ("ipv4: Don't use ufo handling on later transformed
packets") and commit f89c56ce710a ("ipv6: Don't use ufo handling on
later transformed packets") added a check that 'rt->dst.header_len' isn't
zero in order to skip UFO, but it doesn't include IPcomp in transport mode
where it equals zero.

Packets, after payload compression, may not require further fragmentation,
and if original length exceeds MTU, later compressed packets will be
transmitted incorrectly. This can be reproduced with LTP udp_ipsec.sh test
on veth device with enabled UFO, MTU is 1500 and UDP payload is 2000:

* IPv4 case, offset is wrong + unnecessary fragmentation
udp_ipsec.sh -p comp -m transport -s 2000 &
tcpdump -ni ltp_ns_veth2
...
IP (tos 0x0, ttl 64, id 45203, offset 0, flags [+],
  proto Compressed IP (108), length 49)
  10.0.0.2 > 10.0.0.1: IPComp(cpi=0x1000)
IP (tos 0x0, ttl 64, id 45203, offset 1480, flags [none],
  proto UDP (17), length 21) 10.0.0.2 > 10.0.0.1: ip-proto-17

* IPv6 case, sending small fragments
udp_ipsec.sh -6 -p comp -m transport -s 2000 &
tcpdump -ni ltp_ns_veth2
...
IP6 (flowlabel 0x6b9ba, hlim 64, next-header Compressed IP (108)
  payload length: 37) fd00::2 > fd00::1: IPComp(cpi=0x1000)
IP6 (flowlabel 0x6b9ba, hlim 64, next-header Compressed IP (108)
  payload length: 21) fd00::2 > fd00::1: IPComp(cpi=0x1000)

Fix it by checking 'rt->dst.xfrm' pointer to 'xfrm_state' struct, skip UFO
if xfrm is set. So the new check will include both cases: IPcomp and IPsec.

Fixes: c146066ab802 ("ipv4: Don't use ufo handling on later transformed 
packets")
Fixes: f89c56ce710a ("ipv6: Don't use ufo handling on later transformed 
packets")
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 net/ipv4/ip_output.c  |5 -
 net/ipv6/ip6_output.c |5 -
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index b67719f..18383ef 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -960,7 +960,10 @@ static int __ip_append_data(struct sock *sk,
cork->length += length;
if length + fragheaderlen) > mtu) || (skb && skb_is_gso(skb))) &&
(sk->sk_protocol == IPPROTO_UDP) &&
-   (rt->dst.dev->features & NETIF_F_UFO) && !rt->dst.header_len &&
+   (rt->dst.dev->features & NETIF_F_UFO) &&
+#ifdef CONFIG_XFRM
+   !rt->dst.xfrm &&
+#endif
(sk->sk_type == SOCK_DGRAM) && !sk->sk_no_check_tx) {
err = ip_ufo_append_data(sk, queue, getfrag, from, length,
 hh_len, fragheaderlen, transhdrlen,
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 7cebee5..2a4d5ab 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1381,7 +1381,10 @@ static int __ip6_append_data(struct sock *sk,
if length + fragheaderlen) > mtu) ||
 (skb && skb_is_gso(skb))) &&
(sk->sk_protocol == IPPROTO_UDP) &&
-   (rt->dst.dev->features & NETIF_F_UFO) && !rt->dst.header_len &&
+   (rt->dst.dev->features & NETIF_F_UFO) &&
+#ifdef CONFIG_XFRM
+   !rt->dst.xfrm &&
+#endif
(sk->sk_type == SOCK_DGRAM) && !udp_get_no_check6_tx(sk)) {
err = ip6_ufo_append_data(sk, queue, getfrag, from, length,
  hh_len, fragheaderlen, exthdrlen,
-- 
1.7.1



Re: [PATCH v2 1/2] tcp: setup timestamp offset when write_seq already set

2017-02-22 Thread Alexey Kodanev
On 02/22/2017 04:17 PM, Eric Dumazet wrote:
> On Wed, 2017-02-22 at 13:23 +0300, Alexey Kodanev wrote:
>> ...
>>
>> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
>> index fe9da4f..c5169b8 100644
>> --- a/net/ipv4/tcp_ipv4.c
>> +++ b/net/ipv4/tcp_ipv4.c
>> @@ -145,6 +145,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr 
>> *uaddr, int addr_len)
>>  struct flowi4 *fl4;
>>  struct rtable *rt;
>>  int err;
>> +u32 seq;
>>  struct ip_options_rcu *inet_opt;
>>  
>>  if (addr_len < sizeof(struct sockaddr_in))
>> @@ -232,12 +233,15 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr 
>> *uaddr, int addr_len)
>>  sk->sk_gso_type = SKB_GSO_TCPV4;
>>  sk_setup_caps(sk, >dst);
>>  
>> -if (!tp->write_seq && likely(!tp->repair))
>> -tp->write_seq = secure_tcp_sequence_number(inet->inet_saddr,
>> -   inet->inet_daddr,
>> -   inet->inet_sport,
>> -   usin->sin_port,
>> -   >tsoffset);
>> +if (likely(!tp->repair)) {
>> +seq = secure_tcp_sequence_number(inet->inet_saddr,
>> + inet->inet_daddr,
>> + inet->inet_sport,
>> + usin->sin_port,
>> + >tsoffset);
>> +if (!tp->write_seq)
>> +tp->write_seq = seq;
>> +}
>>  
> Nice catch !
>
> secure_tcp_sequence_number() could be renamed, because it has two
> purposes really.

What about "secure_tcp_seq_and_tsoff(...)" ?

Also,

tcp_v4_init_sequence(...) -> tcp_v4_init_seq_and_tsoff(...)
tcp_v6_init_sequence(...) -> tcp_v6_init_seq_and_tsoff(...)


Thanks,
Alexey


[PATCH v2 2/2] tcp: account for ts offset only if tsecr not zero

2017-02-22 Thread Alexey Kodanev
We can get SYN with zero tsecr, don't apply offset in this case.

Fixes: ee684b6f2830 ("tcp: send packets with a socket timestamp")
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
v2: no changes from the previous version

 net/ipv4/tcp_minisocks.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 28ce5ee..baff824 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -106,7 +106,8 @@ enum tcp_tw_status
tcp_parse_options(skb, _opt, 0, NULL);
 
if (tmp_opt.saw_tstamp) {
-   tmp_opt.rcv_tsecr   -= tcptw->tw_ts_offset;
+   if (tmp_opt.rcv_tsecr)
+   tmp_opt.rcv_tsecr -= tcptw->tw_ts_offset;
tmp_opt.ts_recent   = tcptw->tw_ts_recent;
tmp_opt.ts_recent_stamp = tcptw->tw_ts_recent_stamp;
paws_reject = tcp_paws_reject(_opt, th->rst);
-- 
1.7.1



[PATCH v2 1/2] tcp: setup timestamp offset when write_seq already set

2017-02-22 Thread Alexey Kodanev
Found that when randomized tcp offsets are enabled (by default)
TCP client can still start new connections without them. Later,
if server does active close and re-uses sockets in TIME-WAIT
state, new SYN from client can be rejected on PAWS check inside
tcp_timewait_state_process(), because either tw_ts_recent or
rcv_tsval doesn't really have an offset set.

Here is how to reproduce it with LTP netstress tool:
netstress -R 1 &
netstress -H 127.0.0.1 -lr 100 -a1

[...]
< S  seq 1956977072 win 43690 TS val 295618 ecr 459956970
> .  ack 1956911535 win 342 TS val 459967184 ecr 1547117608
< R  seq 1956911535 win 0 length 0
+1. < S  seq 1956977072 win 43690 TS val 296640 ecr 459956970
> S. seq 657450664 ack 1956977073 win 43690 TS val 459968205 ecr 296640

Fixes: 95a22caee396 ("tcp: randomize tcp timestamp offsets for each connection")
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
v2: * remove 'else if' clause and add new variable 'seq' to store tmp result,
* change slightly the subject and commit message.

 net/ipv4/tcp_ipv4.c |   16 ++--
 net/ipv6/tcp_ipv6.c |   16 ++--
 2 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index fe9da4f..c5169b8 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -145,6 +145,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, 
int addr_len)
struct flowi4 *fl4;
struct rtable *rt;
int err;
+   u32 seq;
struct ip_options_rcu *inet_opt;
 
if (addr_len < sizeof(struct sockaddr_in))
@@ -232,12 +233,15 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr 
*uaddr, int addr_len)
sk->sk_gso_type = SKB_GSO_TCPV4;
sk_setup_caps(sk, >dst);
 
-   if (!tp->write_seq && likely(!tp->repair))
-   tp->write_seq = secure_tcp_sequence_number(inet->inet_saddr,
-  inet->inet_daddr,
-  inet->inet_sport,
-  usin->sin_port,
-  >tsoffset);
+   if (likely(!tp->repair)) {
+   seq = secure_tcp_sequence_number(inet->inet_saddr,
+inet->inet_daddr,
+inet->inet_sport,
+usin->sin_port,
+>tsoffset);
+   if (!tp->write_seq)
+   tp->write_seq = seq;
+   }
 
inet->inet_id = tp->write_seq ^ jiffies;
 
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 4c60c6f..e49a273 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -123,6 +123,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr 
*uaddr,
struct dst_entry *dst;
int addr_type;
int err;
+   u32 seq;
 
if (addr_len < SIN6_LEN_RFC2133)
return -EINVAL;
@@ -284,12 +285,15 @@ static int tcp_v6_connect(struct sock *sk, struct 
sockaddr *uaddr,
 
sk_set_txhash(sk);
 
-   if (!tp->write_seq && likely(!tp->repair))
-   tp->write_seq = 
secure_tcpv6_sequence_number(np->saddr.s6_addr32,
-
sk->sk_v6_daddr.s6_addr32,
-inet->inet_sport,
-inet->inet_dport,
->tsoffset);
+   if (likely(!tp->repair)) {
+   seq = secure_tcpv6_sequence_number(np->saddr.s6_addr32,
+  sk->sk_v6_daddr.s6_addr32,
+  inet->inet_sport,
+  inet->inet_dport,
+  >tsoffset);
+   if (!tp->write_seq)
+   tp->write_seq = seq;
+   }
 
err = tcp_connect(sk);
if (err)
-- 
1.7.1



Re: [PATCH 1/2] tcp: setup random timestamp offset when write_seq already set

2017-02-20 Thread Alexey Kodanev
On 20.02.2017 18:18, David Miller wrote:
> This would be so much easier to understand if it were coded as:
>   if (!tp->repair) {
>   seq = secure_tcp_sequence_number(...);
>   if (!tp->write_seq)
>   tp->write_seq = seq;
>   }

Hi David,

Thought about not adding extra variable here...
Agree,it looks much better and easier to follow.

Should I send a patch with this version modified as you
suggested or we can inherit ts offsetfrom tcptw->tw_ts_offset
in tcp_twsk_unique()?

Thanks,
Alexey



Re: [PATCH 1/2] tcp: setup random timestamp offset when write_seq already set

2017-02-18 Thread Alexey Kodanev
Hi,
On 18.02.2017 3:56, Alexey Kodanev wrote:
> Found that when random offset enabled (default) TCP client can
> still start new connections with and without random offsets. Later,
> if server does active close and re-use sockets in TIME-WAIT state,
> new SYN from client can be rejected on PAWS check inside
> tcp_timewait_state_process().
>

Actually, on second thoughts, we can just copy tsoffset from tw socket:

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 89a95da..f40a61d 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -126,6 +126,7 @@ int tcp_twsk_unique(struct sock *sk, struct sock
*sktw, void *twp)
tp->write_seq = 1;
tp->rx_opt.ts_recent   = tcptw->tw_ts_recent;
tp->rx_opt.ts_recent_stamp = tcptw->tw_ts_recent_stamp;
+   tp->tsoffset   = tcptw->tw_ts_offset;
sock_hold(sktw);
return 1;
}

Will test this and send a new version.

Thanks,
Alexey



[PATCH 1/2] tcp: setup random timestamp offset when write_seq already set

2017-02-17 Thread Alexey Kodanev
Found that when random offset enabled (default) TCP client can
still start new connections with and without random offsets. Later,
if server does active close and re-use sockets in TIME-WAIT state,
new SYN from client can be rejected on PAWS check inside
tcp_timewait_state_process().

Here is how to reproduce it with LTP netstress tool:
netstress -R 1 &
netstress -H 127.0.0.1 -lr 100 -a1

[...]
< S  seq 1956977072 win 43690 TS val 295618 ecr 459956970
> .  ack 1956911535 win 342 TS val 459967184 ecr 1547117608
< R  seq 1956911535 win 0 length 0
+1. < S  seq 1956977072 win 43690 TS val 296640 ecr 459956970
> S. seq 657450664 ack 1956977073 win 43690 TS val 459968205 ecr 296640

Fixes: 95a22caee396 ("tcp: randomize tcp timestamp offsets for each connection")
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 net/ipv4/tcp_ipv4.c |7 ++-
 net/ipv6/tcp_ipv6.c |8 +++-
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index fe9da4f..7269e9e 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -232,12 +232,17 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr 
*uaddr, int addr_len)
sk->sk_gso_type = SKB_GSO_TCPV4;
sk_setup_caps(sk, >dst);
 
-   if (!tp->write_seq && likely(!tp->repair))
+   if (!tp->write_seq && likely(!tp->repair)) {
tp->write_seq = secure_tcp_sequence_number(inet->inet_saddr,
   inet->inet_daddr,
   inet->inet_sport,
   usin->sin_port,
   >tsoffset);
+   } else if (likely(!tp->repair)) {
+   secure_tcp_sequence_number(inet->inet_saddr, inet->inet_daddr,
+  inet->inet_sport, usin->sin_port,
+  >tsoffset);
+   }
 
inet->inet_id = tp->write_seq ^ jiffies;
 
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 4c60c6f..1eceeb9 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -284,12 +284,18 @@ static int tcp_v6_connect(struct sock *sk, struct 
sockaddr *uaddr,
 
sk_set_txhash(sk);
 
-   if (!tp->write_seq && likely(!tp->repair))
+   if (!tp->write_seq && likely(!tp->repair)) {
tp->write_seq = 
secure_tcpv6_sequence_number(np->saddr.s6_addr32,
 
sk->sk_v6_daddr.s6_addr32,
 inet->inet_sport,
 inet->inet_dport,
 >tsoffset);
+   } else if (likely(!tp->repair)) {
+   secure_tcpv6_sequence_number(np->saddr.s6_addr32,
+sk->sk_v6_daddr.s6_addr32,
+inet->inet_sport, inet->inet_dport,
+>tsoffset);
+   }
 
err = tcp_connect(sk);
if (err)
-- 
1.7.1



[PATCH 2/2] tcp: account for ts offset only if tsecr not zero

2017-02-17 Thread Alexey Kodanev
We can get SYN with zero tsecr, don't apply offset in this case.

Fixes: ee684b6f2830 ("tcp: send packets with a socket timestamp")
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 net/ipv4/tcp_minisocks.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 28ce5ee..baff824 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -106,7 +106,8 @@ enum tcp_tw_status
tcp_parse_options(skb, _opt, 0, NULL);
 
if (tmp_opt.saw_tstamp) {
-   tmp_opt.rcv_tsecr   -= tcptw->tw_ts_offset;
+   if (tmp_opt.rcv_tsecr)
+   tmp_opt.rcv_tsecr -= tcptw->tw_ts_offset;
tmp_opt.ts_recent   = tcptw->tw_ts_recent;
tmp_opt.ts_recent_stamp = tcptw->tw_ts_recent_stamp;
paws_reject = tcp_paws_reject(_opt, th->rst);
-- 
1.7.1



[PATCH] tcp: initialize max window for a new fastopen socket

2017-01-19 Thread Alexey Kodanev
Found that if we run LTP netstress test with large MSS (65K),
the first attempt from server to send data comparable to this
MSS on fastopen connection will be delayed by the probe timer.

Here is an example:

 < S  seq 0:0 win 43690 options [mss 65495 wscale 7 tfo cookie] length 32
 > S. seq 0:0 ack 1 win 43690 options [mss 65495 wscale 7] length 0
 < .  ack 1 win 342 length 0

Inside tcp_sendmsg(), tcp_send_mss() returns max MSS in 'mss_now',
as well as in 'size_goal'. This results the segment not queued for
transmition until all the data copied from user buffer. Then, inside
__tcp_push_pending_frames(), it breaks on send window test and
continues with the check probe timer.

Fragmentation occurs in tcp_write_wakeup()...

+0.2 > P. seq 1:43777 ack 1 win 342 length 43776
 < .  ack 43777, win 1365 length 0
 > P. seq 43777:65001 ack 1 win 342 options [...] length 21224
 ...

This also contradicts with the fact that we should bound to the half
of the window if it is large.

Fix this flaw by correctly initializing max_window. Before that, it
could have large values that affect further calculations of 'size_goal'.

Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path")
Signed-off-by: Alexey Kodanev <alexey.koda...@oracle.com>
---
 net/ipv4/tcp_fastopen.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
index f519195..dd2560c 100644
--- a/net/ipv4/tcp_fastopen.c
+++ b/net/ipv4/tcp_fastopen.c
@@ -205,6 +205,7 @@ void tcp_fastopen_add_skb(struct sock *sk, struct sk_buff 
*skb)
 * scaled. So correct it appropriately.
 */
tp->snd_wnd = ntohs(tcp_hdr(skb)->window);
+   tp->max_window = tp->snd_wnd;
 
/* Activate the retrans timer so that SYNACK can be retransmitted.
 * The request socket is not added to the ehash
-- 
1.7.1



Re: resend: tcp: performance issue with fastopen connections (mss > window)

2017-01-18 Thread Alexey Kodanev

Hi Eric,

On 01/13/2017 08:07 PM, Alexey Kodanev wrote:


Hi Eric,
On 13.01.2017 18:35, Eric Dumazet wrote:


I would suggest to clamp MSS to half the initial window, but I guess
this is impractical since window in SYN/SYNACK are not scaled.


Looks like max_window not correctly initialized for tfo sockets.
On my test machine it has set to '5592320' in tcp_fastopen_create_child().

This diff fixes the issue, the question: is this the right place to do it?

diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
index 4e777a3..33ed508 100644
--- a/net/ipv4/tcp_fastopen.c
+++ b/net/ipv4/tcp_fastopen.c
@@ -206,6 +206,8 @@ static struct sock *tcp_fastopen_create_child(struct 
sock *sk,

 */
tp->snd_wnd = ntohs(tcp_hdr(skb)->window);

+   tp->max_window = tp->snd_wnd;
+
/* Activate the retrans timer so that SYNACK can be retransmitted.
 * The request socket is not added to the ehash
 * because it's been added to the accept queue directly.


Thanks,
Alexey



Re: resend: tcp: performance issue with fastopen connections (mss > window)

2017-01-13 Thread Alexey Kodanev
Hi Eric,
On 13.01.2017 18:35, Eric Dumazet wrote:
> On Fri, 2017-01-13 at 18:01 +0300, Alexey Kodanev wrote:
>> Hi,
>>
>> Got the issue when running LTP/netstress test on localhost with mss
>> greater than the send window advertised by client (right after 3WHS).
>> Here is the testscenario that can reproduce this:
> Hi Alexey
>
> So this is a combination of Fastopen + small window + large MSS ?

Yeah, this happens only in the beginning, after first ack from client.
Later window gets
lager than mss and it doesn't happen.

>
> I would rather not force burning tons of coal or other fossil fuel,
> by making each tcp_sendmsg() done by billions of linux devices more
> expensive, only to accommodate for some LTP test doing something not
> sensible ;)
>
> Fact that you removed one condition in the BUG_ON() might hide another
> issue later in the path.
>
> I would suggest to clamp MSS to half the initial window, but I guess
> this is impractical since window in SYN/SYNACK are not scaled.
> Care to send a packetdrill test so that we have a clear picture of what
> is going on ?

Is it capable of making two connections in the single test, one after
another?

Thanks,
Alexey


  1   2   >