from:"Wei Wang"

Re: [PATCH iproute2] ss: add support for bytes_sent, bytes_retrans, dsack_dups and reord_seen

2018-11-29 Thread Wei Wang

On Thu, Nov 29, 2018 at 2:28 AM Eric Dumazet  wrote:
>
> Wei Wang added these fields in linux-4.19
>
> Tested:
>
> ss -ti ...
>
> ts sack cubic wscale:8,8 rto:7 rtt:2.678/0.267 mss:1428 pmtu:1500
> rcvmss:536 advmss:1428 cwnd:91 ssthresh:65
> (*) bytes_sent:17470606104 bytes_retrans:2856
> bytes_acked:17470483297
> segs_out:12234320 segs_in:622983
> data_segs_out:12234318 send 388.2Mbps lastrcv:986784 lastack:1
> pacing_rate 465.8Mbps delivery_rate 162.7Mbps
> delivered:12234235 delivered_ce:3669056
> busy:986784ms unacked:84 retrans:0/2
> (*) dsack_dups:2
> rcv_space:14280 rcv_ssthresh:65535 notsent:2016336 minrtt:0.183
>
> Signed-off-by: Eric Dumazet 
> Cc: Wei Wang 
> Cc: Yuchung Cheng 
> Cc: Neal Cardwell 
> Cc: Soheil Hassas Yeganeh 
> ---

Acked-by: Wei Wang 

Thanks Eric.

>
>  misc/ss.c | 16 
>  1 file changed, 16 insertions(+)
>
> diff --git a/misc/ss.c b/misc/ss.c
> index 
> 3aa94f235085512510dca9fd597e8e37aaaf0fd3..3589ebedc5a0ab0615ba56f3df10d49198bed0d9
>  100644
> --- a/misc/ss.c
> +++ b/misc/ss.c
> @@ -819,6 +819,8 @@ struct tcpstat {
> unsigned intnot_sent;
> unsigned intdelivered;
> unsigned intdelivered_ce;
> +   unsigned intdsack_dups;
> +   unsigned intreord_seen;
> double  rcv_rtt;
> double  min_rtt;
> int rcv_space;
> @@ -826,6 +828,8 @@ struct tcpstat {
> unsigned long long  busy_time;
> unsigned long long  rwnd_limited;
> unsigned long long  sndbuf_limited;
> +   unsigned long long  bytes_sent;
> +   unsigned long long  bytes_retrans;
> boolhas_ts_opt;
> boolhas_sack_opt;
> boolhas_ecn_opt;
> @@ -2426,6 +2430,10 @@ static void tcp_stats_print(struct tcpstat *s)
> if (s->ssthresh)
> out(" ssthresh:%d", s->ssthresh);
>
> +   if (s->bytes_sent)
> +   out(" bytes_sent:%llu", s->bytes_sent);
> +   if (s->bytes_retrans)
> +   out(" bytes_retrans:%llu", s->bytes_retrans);
> if (s->bytes_acked)
> out(" bytes_acked:%llu", s->bytes_acked);
> if (s->bytes_received)
> @@ -2512,10 +2520,14 @@ static void tcp_stats_print(struct tcpstat *s)
> out(" lost:%u", s->lost);
> if (s->sacked && s->ss.state != SS_LISTEN)
> out(" sacked:%u", s->sacked);
> +   if (s->dsack_dups)
> +   out(" dsack_dups:%u", s->dsack_dups);
> if (s->fackets)
> out(" fackets:%u", s->fackets);
> if (s->reordering != 3)
> out(" reordering:%d", s->reordering);
> +   if (s->reord_seen)
> +   out(" reord_seen:%d", s->reord_seen);
> if (s->rcv_rtt)
> out(" rcv_rtt:%g", s->rcv_rtt);
> if (s->rcv_space)
> @@ -2837,6 +2849,10 @@ static void tcp_show_info(const struct nlmsghdr *nlh, 
> struct inet_diag_msg *r,
> s.sndbuf_limited = info->tcpi_sndbuf_limited;
> s.delivered = info->tcpi_delivered;
> s.delivered_ce = info->tcpi_delivered_ce;
> +   s.dsack_dups = info->tcpi_dsack_dups;
> +   s.reord_seen = info->tcpi_reord_seen;
> +   s.bytes_sent = info->tcpi_bytes_sent;
> +   s.bytes_retrans = info->tcpi_bytes_retrans;
> tcp_stats_print();
> free(s.dctcp);
> free(s.bbr_info);
> --
> 2.20.0.rc0.387.gc7a69e6b6c-goog
>

[PATCH net] ipv6: take rcu lock in rawv6_send_hdrinc()

2018-10-04 Thread Wei Wang

From: Wei Wang 

In rawv6_send_hdrinc(), in order to avoid an extra dst_hold(), we
directly assign the dst to skb and set passed in dst to NULL to avoid
double free.
However, in error case, we free skb and then do stats update with the
dst pointer passed in. This causes use-after-free on the dst.
Fix it by taking rcu read lock right before dst could get released to
make sure dst does not get freed until the stats update is done.
Note: we don't have this issue in ipv4 cause dst is not used for stats
update in v4.

Syzkaller reported following crash:
BUG: KASAN: use-after-free in rawv6_send_hdrinc net/ipv6/raw.c:692 [inline]
BUG: KASAN: use-after-free in rawv6_sendmsg+0x4421/0x4630 net/ipv6/raw.c:921
Read of size 8 at addr 8801d95ba730 by task syz-executor0/32088

CPU: 1 PID: 32088 Comm: syz-executor0 Not tainted 4.19.0-rc2+ #93
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113
 print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
 rawv6_send_hdrinc net/ipv6/raw.c:692 [inline]
 rawv6_sendmsg+0x4421/0x4630 net/ipv6/raw.c:921
 inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798
 sock_sendmsg_nosec net/socket.c:621 [inline]
 sock_sendmsg+0xd5/0x120 net/socket.c:631
 ___sys_sendmsg+0x7fd/0x930 net/socket.c:2114
 __sys_sendmsg+0x11d/0x280 net/socket.c:2152
 __do_sys_sendmsg net/socket.c:2161 [inline]
 __se_sys_sendmsg net/socket.c:2159 [inline]
 __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2159
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457099
Code: fd b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 
89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 
cb b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:7f83756edc78 EFLAGS: 0246 ORIG_RAX: 002e
RAX: ffda RBX: 7f83756ee6d4 RCX: 00457099
RDX:  RSI: 20003840 RDI: 0004
RBP: 009300a0 R08:  R09: 
R10:  R11: 0246 R12: 
R13: 004d4b30 R14: 004c90b1 R15: 

Allocated by task 32088:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
 kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:490
 kmem_cache_alloc+0x12e/0x730 mm/slab.c:3554
 dst_alloc+0xbb/0x1d0 net/core/dst.c:105
 ip6_dst_alloc+0x35/0xa0 net/ipv6/route.c:353
 ip6_rt_cache_alloc+0x247/0x7b0 net/ipv6/route.c:1186
 ip6_pol_route+0x8f8/0xd90 net/ipv6/route.c:1895
 ip6_pol_route_output+0x54/0x70 net/ipv6/route.c:2093
 fib6_rule_lookup+0x277/0x860 net/ipv6/fib6_rules.c:122
 ip6_route_output_flags+0x2c5/0x350 net/ipv6/route.c:2121
 ip6_route_output include/net/ip6_route.h:88 [inline]
 ip6_dst_lookup_tail+0xe27/0x1d60 net/ipv6/ip6_output.c:951
 ip6_dst_lookup_flow+0xc8/0x270 net/ipv6/ip6_output.c:1079
 rawv6_sendmsg+0x12d9/0x4630 net/ipv6/raw.c:905
 inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798
 sock_sendmsg_nosec net/socket.c:621 [inline]
 sock_sendmsg+0xd5/0x120 net/socket.c:631
 ___sys_sendmsg+0x7fd/0x930 net/socket.c:2114
 __sys_sendmsg+0x11d/0x280 net/socket.c:2152
 __do_sys_sendmsg net/socket.c:2161 [inline]
 __se_sys_sendmsg net/socket.c:2159 [inline]
 __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2159
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 5356:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 __kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521
 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
 __cache_free mm/slab.c:3498 [inline]
 kmem_cache_free+0x83/0x290 mm/slab.c:3756
 dst_destroy+0x267/0x3c0 net/core/dst.c:141
 dst_destroy_rcu+0x16/0x19 net/core/dst.c:154
 __rcu_reclaim kernel/rcu/rcu.h:236 [inline]
 rcu_do_batch kernel/rcu/tree.c:2576 [inline]
 invoke_rcu_callbacks kernel/rcu/tree.c:2880 [inline]
 __rcu_process_callbacks kernel/rcu/tree.c:2847 [inline]
 rcu_process_callbacks+0xf23/0x2670 kernel/rcu/tree.c:2864
 __do_softirq+0x30b/0xad8 kernel/softirq.c:292

Fixes: 1789a640f556 ("raw: avoid two atomics in xmit")
Signed-off-by: Wei Wang 
Signed-off-by: Eric Dumazet  
---
 net/ipv6/raw.c | 29 -
 1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 413d98bf24f4..5e0efd3954e9 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -651,8 +651,6 @@ static int rawv6_send_hdrinc(struct sock *sk, struct msghdr 
*msg, int length,
skb->priority = sk->sk_priority;
skb->mark = sk->sk_mark;
skb-&g

Re: [PATCH net 2/2] ipv6: fix memory leak on dst->_metrics

2018-09-18 Thread Wei Wang

On Tue, Sep 18, 2018 at 4:25 PM David Ahern  wrote:
>
> On 9/18/18 1:45 PM, Wei Wang wrote:
> > From: Wei Wang 
> >
> > When dst->_metrics and f6i->fib6_metrics share the same memory, both
> > take reference count on the dst_metrics structure. However, when dst is
> > destroyed, ip6_dst_destroy() only invokes dst_destroy_metrics_generic()
> > which does not take care of READONLY metrics and does not release refcnt.
> > This causes memory leak.
> > Similar to ipv4 logic, the fix is to properly release refcnt and free
> > the memory space pointed by dst->_metrics if refcnt becomes 0.
> >
> > Fixes: 93531c674315 ("net/ipv6: separate handling of FIB entries from dst 
> > based routes")
> > Reported-by: Sabrina Dubroca 
> > Signed-off-by: Wei Wang 
> > Signed-off-by: Eric Dumazet 
> > ---
> >  net/ipv6/route.c | 5 -
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> > index b5d3e6b294ab..826b14de7dbb 100644
> > --- a/net/ipv6/route.c
> > +++ b/net/ipv6/route.c
> > @@ -364,11 +364,14 @@ EXPORT_SYMBOL(ip6_dst_alloc);
> >
> >  static void ip6_dst_destroy(struct dst_entry *dst)
> >  {
> > + struct dst_metrics *p = (struct dst_metrics *)DST_METRICS_PTR(dst);
> >   struct rt6_info *rt = (struct rt6_info *)dst;
> >   struct fib6_info *from;
> >   struct inet6_dev *idev;
> >
> > - dst_destroy_metrics_generic(dst);
> > + if (p != _default_metrics && refcount_dec_and_test(>refcnt))
> > + kfree(p);
> > +
> >   rt6_uncached_list_del(rt);
> >
> >   idev = rt->rt6i_idev;
> >
>
> Reviewed-by: David Ahern 
>
> With the revert in patch 1 we are back to my original code after
> 93531c67431 ("net/ipv6: separate handling of FIB entries from dst based
> routes").
>
> My intention with that series was to make IPv6 handling of metrics as
> identical to IPv4 as possible (v6 does have differences for example due
> to autoconf and changing metrics after installing a route). The change
> in this patch is what I missed back in April.
>
> Comparing IPv4 and IPv6 code for memory allocation and freeing for FIB
> entries, transferring metrics to dst_entry and cleanup of dst_entry all
> look nearly identical - to the point that net-next could have common
> helpers to manage the refcnt'ing. I can submit those after this change
> hits net-next.
Yes. Agree. Since both v4 and v6 use the same logic on handling
metrics, helpers can be useful.

>
> Thanks for your time getting to the bottom of the leak.

[PATCH net 2/2] ipv6: fix memory leak on dst->_metrics

2018-09-18 Thread Wei Wang

From: Wei Wang 

When dst->_metrics and f6i->fib6_metrics share the same memory, both
take reference count on the dst_metrics structure. However, when dst is
destroyed, ip6_dst_destroy() only invokes dst_destroy_metrics_generic()
which does not take care of READONLY metrics and does not release refcnt.
This causes memory leak.
Similar to ipv4 logic, the fix is to properly release refcnt and free
the memory space pointed by dst->_metrics if refcnt becomes 0.

Fixes: 93531c674315 ("net/ipv6: separate handling of FIB entries from dst based 
routes")
Reported-by: Sabrina Dubroca 
Signed-off-by: Wei Wang 
Signed-off-by: Eric Dumazet 
---
 net/ipv6/route.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index b5d3e6b294ab..826b14de7dbb 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -364,11 +364,14 @@ EXPORT_SYMBOL(ip6_dst_alloc);
 
 static void ip6_dst_destroy(struct dst_entry *dst)
 {
+   struct dst_metrics *p = (struct dst_metrics *)DST_METRICS_PTR(dst);
struct rt6_info *rt = (struct rt6_info *)dst;
struct fib6_info *from;
struct inet6_dev *idev;
 
-   dst_destroy_metrics_generic(dst);
+   if (p != _default_metrics && refcount_dec_and_test(>refcnt))
+   kfree(p);
+
rt6_uncached_list_del(rt);
 
idev = rt->rt6i_idev;
-- 
2.19.0.397.gdd90340f6a-goog

[PATCH net 1/2] Revert "ipv6: fix double refcount of fib6_metrics"

2018-09-18 Thread Wei Wang

From: Wei Wang 

This reverts commit e70a3aad44cc8b24986687ffc98c4a4f6ecf25ea.

This change causes use-after-free on dst->_metrics.
The crash trace looks like this:
[   97.763269] BUG: KASAN: use-after-free in ip6_mtu+0x116/0x140
[   97.769038] Read of size 4 at addr 881781d2cf84 by task 
svw_NetThreadEv/8801

[   97.777954] CPU: 76 PID: 8801 Comm: svw_NetThreadEv Not tainted 
4.15.0-smp-DEV #11
[   97.777956] Hardware name: Default string Default string/Indus_QC_02, BIOS 
5.46.4 03/29/2018
[   97.777957] Call Trace:
[   97.777971]  [] dump_stack+0x4d/0x72
[   97.777985]  [] print_address_description+0x6f/0x260
[   97.777997]  [] kasan_report+0x257/0x370
[   97.778001]  [] ? ip6_mtu+0x116/0x140
[   97.778004]  [] __asan_report_load4_noabort+0x19/0x20
[   97.778008]  [] ip6_mtu+0x116/0x140
[   97.778013]  [] tcp_current_mss+0x12e/0x280
[   97.778016]  [] ? tcp_mtu_to_mss+0x2d0/0x2d0
[   97.778022]  [] ? depot_save_stack+0x138/0x4a0
[   97.778037]  [] ? __mmdrop+0x145/0x1f0
[   97.778040]  [] ? save_stack+0xb1/0xd0
[   97.778046]  [] tcp_send_mss+0x22/0x220
[   97.778059]  [] tcp_sendmsg_locked+0x4f9/0x39f0
[   97.778062]  [] ? kasan_check_write+0x14/0x20
[   97.778066]  [] ? tcp_sendpage+0x60/0x60
[   97.778070]  [] ? rw_copy_check_uvector+0x69/0x280
[   97.778075]  [] ? import_iovec+0x9f/0x430
[   97.778078]  [] ? kasan_slab_free+0x87/0xc0
[   97.778082]  [] ? memzero_page+0x140/0x140
[   97.778085]  [] ? kasan_check_write+0x14/0x20
[   97.778088]  [] tcp_sendmsg+0x2c/0x50
[   97.778092]  [] ? tcp_sendmsg+0x2c/0x50
[   97.778098]  [] inet_sendmsg+0x103/0x480
[   97.778102]  [] ? inet_gso_segment+0x15b0/0x15b0
[   97.778105]  [] sock_sendmsg+0xba/0xf0
[   97.778108]  [] ___sys_sendmsg+0x6ca/0x8e0
[   97.778113]  [] ? hrtimer_try_to_cancel+0x71/0x3b0
[   97.778116]  [] ? copy_msghdr_from_user+0x3d0/0x3d0
[   97.778119]  [] ? memset+0x31/0x40
[   97.778123]  [] ? 
schedule_hrtimeout_range_clock+0x165/0x380
[   97.778127]  [] ? hrtimer_nanosleep_restart+0x250/0x250
[   97.778130]  [] ? __hrtimer_init+0x180/0x180
[   97.778133]  [] ? ktime_get_ts64+0x172/0x200
[   97.778137]  [] ? __fget_light+0x8c/0x2f0
[   97.778141]  [] __sys_sendmsg+0xe6/0x190
[   97.778144]  [] ? __sys_sendmsg+0xe6/0x190
[   97.778147]  [] ? SyS_shutdown+0x20/0x20
[   97.778152]  [] ? wake_up_q+0xe0/0xe0
[   97.778155]  [] ? __sys_sendmsg+0x190/0x190
[   97.778158]  [] SyS_sendmsg+0x13/0x20
[   97.778162]  [] do_syscall_64+0x2ac/0x430
[   97.778166]  [] ? do_page_fault+0x35/0x3d0
[   97.778171]  [] ? page_fault+0x2f/0x50
[   97.778174]  [] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[   97.778177] RIP: 0033:0x7f83fa36000d
[   97.778178] RSP: 002b:7f83ef9229e0 EFLAGS: 0293 ORIG_RAX: 
002e
[   97.778180] RAX: ffda RBX: 0001 RCX: 7f83fa36000d
[   97.778182] RDX: 4000 RSI: 7f83ef922f00 RDI: 0036
[   97.778183] RBP: 7f83ef923040 R08: 7f83ef9231f8 R09: 7f83ef923168
[   97.778184] R10:  R11: 0293 R12: 7f83f69c5b40
[   97.778185] R13: 001c R14: 0001 R15: 4000

[   97.779684] Allocated by task 5919:
[   97.783185]  save_stack+0x46/0xd0
[   97.783187]  kasan_kmalloc+0xad/0xe0
[   97.783189]  kmem_cache_alloc_trace+0xdf/0x580
[   97.783190]  ip6_convert_metrics.isra.79+0x7e/0x190
[   97.783192]  ip6_route_info_create+0x60a/0x2480
[   97.783193]  ip6_route_add+0x1d/0x80
[   97.783195]  inet6_rtm_newroute+0xdd/0xf0
[   97.783198]  rtnetlink_rcv_msg+0x641/0xb10
[   97.783200]  netlink_rcv_skb+0x27b/0x3e0
[   97.783202]  rtnetlink_rcv+0x15/0x20
[   97.783203]  netlink_unicast+0x4be/0x720
[   97.783204]  netlink_sendmsg+0x7bc/0xbf0
[   97.783205]  sock_sendmsg+0xba/0xf0
[   97.783207]  ___sys_sendmsg+0x6ca/0x8e0
[   97.783208]  __sys_sendmsg+0xe6/0x190
[   97.783209]  SyS_sendmsg+0x13/0x20
[   97.783211]  do_syscall_64+0x2ac/0x430
[   97.783213]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2

[   97.784709] Freed by task 0:
[   97.785056] knetbase: Error: /proc/sys/net/core/txcs_enable does not exist
[   97.794497]  save_stack+0x46/0xd0
[   97.794499]  kasan_slab_free+0x71/0xc0
[   97.794500]  kfree+0x7c/0xf0
[   97.794501]  fib6_info_destroy_rcu+0x24f/0x310
[   97.794504]  rcu_process_callbacks+0x38b/0x1730
[   97.794506]  __do_softirq+0x1c8/0x5d0

Reported-by: John Sperbeck 
Signed-off-by: Wei Wang 
Signed-off-by: Eric Dumazet 
---
 net/ipv6/route.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 480a79f47c52..b5d3e6b294ab 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -976,6 +976,10 @@ static void rt6_set_from(struct rt6_info *rt, struct 
fib6_info *from)
rt->rt6i_flags &= ~RTF_EXPIRES;
rcu_assign_pointer(rt->from, from);
dst_init_metrics(>dst, from->fib6_metrics->metrics, true);
+   if (from->fib6_metrics != _default_metrics) {
+   rt->dst._metrics |= DST_METRICS_REFCOUNTED;

[PATCH net 0/2] ipv6: fix issues on accessing fib6_metrics

2018-09-18 Thread Wei Wang

From: Wei Wang 

The latest fix on the memory leak of fib6_metrics still causes
use-after-free.
This patch series first revert the previous fix and propose a new fix
that is more inline with ipv4 logic and is tested to fix the
use-after-free issue reported.

Wei Wang (2):
  Revert "ipv6: fix double refcount of fib6_metrics"
  ipv6: fix memory leak on dst->_metrics

 net/ipv6/route.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

-- 
2.19.0.397.gdd90340f6a-goog

Re: BUG: unable to handle kernel paging request in fib6_node_lookup_1

2018-09-05 Thread Wei Wang

On Tue, Sep 4, 2018 at 11:11 PM Song Liu  wrote:
>
> We are debugging an issue with fib6_node_lookup_1().
>
> We use a 4.16 based kernel, and we have back ported most upstream
> patches in ip6_fib.{c.h}. The only major differences I can spot are
>
> 8b7f2731bd68d83940714ce92381d1a72596407c
> c350637229fccbffee2475400fcd689d5738
>
> I guess the issue is not related to these two fixes.
>
> After staring at the call trace and disassembly code (attached below)
> I guess this is a use-after-free issue in (or right after) the lookup
> loop:
>
> for (;;) {
> struct fib6_node *next;
>
> dir = addr_bit_set(args->addr, fn->fn_bit);
>
> next = dir ? rcu_dereference(fn->right) :
>  rcu_dereference(fn->left);
>
> if (next) {
> fn = next;
> continue;
> }
> break;
> }
>
> I guess this probably also happens to latest upstream. I haven't
> tested this with upstream kernel (or net tree) yet, because we
> can only trigger this about once a week on 100 servers.
>
> Does this look familiar? Any comments and/or suggestions are highly
> appreciated.
>
By glancing at the commit logs, I don't think any changes were made
regarding the core logic of fib6_node handling recently.
(There were a couple of fixes regarding fib6_info but I don't think it
is the cause here... But it is still good to check if you have commit
9b0a8da8c4c6, e873e4b9cc7e, e70a3aad44cc in your build.)

I also went through the call path and did not find anything obviously wrong...
I think it's the best for you to reproduce it and we can debug further.
One question is, do you have "CONFIG_IPV6_SUBTREE" enabled and specify
src IP in the routing table?

Thanks.
Wei

> Thanks,
> Song
>
>
> Bug stack trace:
>
> [354764.457916] BUG: unable to handle kernel
> [354764.466125] paging request
> [354764.471720]  at f60fc318
> [354764.478360] IP: fib6_node_lookup_1+0x29/0x130
> [354764.487249] PGD 80010f725067
> [354764.494062] P4D 80010f725067
> [354764.500878] PUD 0
> [354764.505087] Oops:  [#1] SMP PTI
> [354764.512245] Modules linked in:
> [354764.518536]  udp_diag
> [354764.523266]  act_gact
> [354764.527997]  cls_bpf
> [354764.532557]  tcp_diag
> [354764.537291]  inet_diag
> [354764.542200]  nfsv3
> [354764.546409]  nfs
> [354764.550273]  fscache
> [354764.554834]  ip6table_raw
> [354764.560260]  ip6table_filter
> [354764.566208]  xt_DSCP
> [354764.570765]  iptable_raw
> [354764.576020]  iptable_filter
> [354764.581790]  ip6table_mangle
> [354764.587738]  iptable_mangle
> [354764.593505]  sb_edac
> [354764.598058]  x86_pkg_temp_thermal
> [354764.604872]  intel_powerclamp
> [354764.610992]  coretemp
> [354764.615723]  kvm_intel
> [354764.620628]  kvm
> [354764.624494]  irqbypass
> [354764.629399]  iTCO_wdt
> [354764.634132]  iTCO_vendor_support
> [354764.640772]  i2c_i801
> [354764.645507]  lpc_ich
> [354764.650064]  efivars
> [354764.654619]  mfd_core
> [354764.659353]  ipmi_si
> [354764.663911]  ipmi_devintf
> [354764.669341]  ipmi_msghandler
> [354764.675281]  acpi_cpufreq
> [354764.680711]  button
> [354764.685096]  sch_fq_codel
> [354764.690520]  nfsd
> [354764.694557]  nfs_acl
> [354764.699118]  lockd
> [354764.703330]  auth_rpcgss
> [354764.708588]  oid_registry
> [354764.714006]  grace
> [354764.718213]  sunrpc
> [354764.722590]  fuse
> [354764.726626]  loop
> [354764.730661]  efivarfs
> [354764.735395]  autofs4
> [354764.739957] CPU: 5 PID: 3460038 Comm: java Not tainted 
> 4.16.0-14_fbk2_1455_g6bcb99c57db6 #14
> [354764.756996] Hardware name: Wiwynn Leopard-Orv2/Leopard-DDR BW, BIOS LBM03 
>   06/02/2016
> [354764.773001] RIP: 0010:fib6_node_lookup_1+0x29/0x130
> [354764.782929] RSP: 0018:c9003f0bb730 EFLAGS: 00010206
> [354764.793557] RAX: 883fc131a000 RBX: f60fc300 RCX: 
> ffe4
> [354764.807999] RDX: 0010 RSI: 0001 RDI: 
> c9003f0bb8f0
> [354764.822436] RBP: c9003f0bb750 R08: 0002 R09: 
> 0004
> [354764.836877] R10: c9003f0bb7a8 R11: 883ff7795780 R12: 
> 82305080
> [354764.851317] R13: 0002 R14:  R15: 
> 
> [354764.865765] FS:  7f8defcfc700() GS:881fff94() 
> knlGS:
> [354764.882119] CS:  0010 DS:  ES:  CR0: 80050033
> [354764.893800] CR2: f60fc318 CR3: 000f68cae006 CR4: 
> 003606e0
> [354764.908235] DR0:  DR1:  DR2: 
> 
> [354764.922671] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [354764.937109] Call Trace:
> [354764.942195]  fib6_node_lookup+0x67/0x90
> [354764.950042]  ? fib6_table_lookup+0x43/0x2f0
> [354764.958587]  fib6_table_lookup+0x43/0x2f0
> [354764.966794]  ip6_pol_route+0x43/0x360
> [354764.974294]  ? ip6_pol_route_input+0x20/0x20
> [354764.983016]

[PATCH net v2] l2tp: use sk_dst_check() to avoid race on sk->sk_dst_cache

2018-08-10 Thread Wei Wang

From: Wei Wang 

In l2tp code, if it is a L2TP_UDP_ENCAP tunnel, tunnel->sk points to a
UDP socket. User could call sendmsg() on both this tunnel and the UDP
socket itself concurrently. As l2tp_xmit_skb() holds socket lock and call
__sk_dst_check() to refresh sk->sk_dst_cache, while udpv6_sendmsg() is
lockless and call sk_dst_check() to refresh sk->sk_dst_cache, there
could be a race and cause the dst cache to be freed multiple times.
So we fix l2tp side code to always call sk_dst_check() to garantee
xchg() is called when refreshing sk->sk_dst_cache to avoid race
conditions.

Syzkaller reported stack trace:
BUG: KASAN: use-after-free in atomic_read 
include/asm-generic/atomic-instrumented.h:21 [inline]
BUG: KASAN: use-after-free in atomic_fetch_add_unless 
include/linux/atomic.h:575 [inline]
BUG: KASAN: use-after-free in atomic_add_unless include/linux/atomic.h:597 
[inline]
BUG: KASAN: use-after-free in dst_hold_safe include/net/dst.h:308 [inline]
BUG: KASAN: use-after-free in ip6_hold_safe+0xe6/0x670 net/ipv6/route.c:1029
Read of size 4 at addr 8801aea9a880 by task syz-executor129/4829

CPU: 0 PID: 4829 Comm: syz-executor129 Not tainted 4.18.0-rc7-next-20180802+ #30
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
 print_address_description+0x6c/0x20b mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.7+0x242/0x30d mm/kasan/report.c:412
 check_memory_region_inline mm/kasan/kasan.c:260 [inline]
 check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
 kasan_check_read+0x11/0x20 mm/kasan/kasan.c:272
 atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
 atomic_fetch_add_unless include/linux/atomic.h:575 [inline]
 atomic_add_unless include/linux/atomic.h:597 [inline]
 dst_hold_safe include/net/dst.h:308 [inline]
 ip6_hold_safe+0xe6/0x670 net/ipv6/route.c:1029
 rt6_get_pcpu_route net/ipv6/route.c:1249 [inline]
 ip6_pol_route+0x354/0xd20 net/ipv6/route.c:1922
 ip6_pol_route_output+0x54/0x70 net/ipv6/route.c:2098
 fib6_rule_lookup+0x283/0x890 net/ipv6/fib6_rules.c:122
 ip6_route_output_flags+0x2c5/0x350 net/ipv6/route.c:2126
 ip6_dst_lookup_tail+0x1278/0x1da0 net/ipv6/ip6_output.c:978
 ip6_dst_lookup_flow+0xc8/0x270 net/ipv6/ip6_output.c:1079
 ip6_sk_dst_lookup_flow+0x5ed/0xc50 net/ipv6/ip6_output.c:1117
 udpv6_sendmsg+0x2163/0x36b0 net/ipv6/udp.c:1354
 inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798
 sock_sendmsg_nosec net/socket.c:622 [inline]
 sock_sendmsg+0xd5/0x120 net/socket.c:632
 ___sys_sendmsg+0x51d/0x930 net/socket.c:2115
 __sys_sendmmsg+0x240/0x6f0 net/socket.c:2210
 __do_sys_sendmmsg net/socket.c:2239 [inline]
 __se_sys_sendmmsg net/socket.c:2236 [inline]
 __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2236
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x446a29
Code: e8 ac b8 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 
89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 
eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:7f4de5532db8 EFLAGS: 0246 ORIG_RAX: 0133
RAX: ffda RBX: 006dcc38 RCX: 00446a29
RDX: 00b8 RSI: 20001b00 RDI: 0003
RBP: 006dcc30 R08: 7f4de5533700 R09: 
R10:  R11: 0246 R12: 006dcc3c
R13: 7ffe2b830fdf R14: 7f4de55339c0 R15: 0001

Fixes: 71b1391a4128 ("l2tp: ensure sk->dst is still valid")
Reported-by: syzbot+05f840f3b04f211ba...@syzkaller.appspotmail.com
Signed-off-by: Wei Wang 
Signed-off-by: Martin KaFai Lau 
Cc: Guillaume Nault 
Cc: David Ahern 
Cc: Cong Wang 
---
v1->v2: Removed dst_clone() as Guillaume Nault suggested

 net/l2tp/l2tp_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
index 40261cb68e83..8aaf8157da2b 100644
--- a/net/l2tp/l2tp_core.c
+++ b/net/l2tp/l2tp_core.c
@@ -1110,7 +1110,7 @@ int l2tp_xmit_skb(struct l2tp_session *session, struct 
sk_buff *skb, int hdr_len
 
/* Get routing info from the tunnel socket */
skb_dst_drop(skb);
-   skb_dst_set(skb, dst_clone(__sk_dst_check(sk, 0)));
+   skb_dst_set(skb, sk_dst_check(sk, 0));
 
inet = inet_sk(sk);
fl = >cork.fl;
-- 
2.18.0.597.ga71716f1ad-goog

[PATCH net] l2tp: use sk_dst_check() to avoid race on sk->sk_dst_cache

2018-08-09 Thread Wei Wang

From: Wei Wang 

In l2tp code, if it is a L2TP_UDP_ENCAP tunnel, tunnel->sk points to a
UDP socket. User could call sendmsg() on both this tunnel and the UDP
socket itself concurrently. As l2tp_xmit_skb() holds socket lock and call
__sk_dst_check() to refresh sk->sk_dst_cache, while udpv6_sendmsg() is
lockless and call sk_dst_check() to refresh sk->sk_dst_cache, there
could be a race and cause the dst cache to be freed multiple times.
So we fix l2tp side code to always call sk_dst_check() to garantee
xchg() is called when refreshing sk->sk_dst_cache to avoid race
conditions.

Syzkaller reported stack trace:
BUG: KASAN: use-after-free in atomic_read 
include/asm-generic/atomic-instrumented.h:21 [inline]
BUG: KASAN: use-after-free in atomic_fetch_add_unless 
include/linux/atomic.h:575 [inline]
BUG: KASAN: use-after-free in atomic_add_unless include/linux/atomic.h:597 
[inline]
BUG: KASAN: use-after-free in dst_hold_safe include/net/dst.h:308 [inline]
BUG: KASAN: use-after-free in ip6_hold_safe+0xe6/0x670 net/ipv6/route.c:1029
Read of size 4 at addr 8801aea9a880 by task syz-executor129/4829

CPU: 0 PID: 4829 Comm: syz-executor129 Not tainted 4.18.0-rc7-next-20180802+ #30
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
 print_address_description+0x6c/0x20b mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.7+0x242/0x30d mm/kasan/report.c:412
 check_memory_region_inline mm/kasan/kasan.c:260 [inline]
 check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
 kasan_check_read+0x11/0x20 mm/kasan/kasan.c:272
 atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
 atomic_fetch_add_unless include/linux/atomic.h:575 [inline]
 atomic_add_unless include/linux/atomic.h:597 [inline]
 dst_hold_safe include/net/dst.h:308 [inline]
 ip6_hold_safe+0xe6/0x670 net/ipv6/route.c:1029
 rt6_get_pcpu_route net/ipv6/route.c:1249 [inline]
 ip6_pol_route+0x354/0xd20 net/ipv6/route.c:1922
 ip6_pol_route_output+0x54/0x70 net/ipv6/route.c:2098
 fib6_rule_lookup+0x283/0x890 net/ipv6/fib6_rules.c:122
 ip6_route_output_flags+0x2c5/0x350 net/ipv6/route.c:2126
 ip6_dst_lookup_tail+0x1278/0x1da0 net/ipv6/ip6_output.c:978
 ip6_dst_lookup_flow+0xc8/0x270 net/ipv6/ip6_output.c:1079
 ip6_sk_dst_lookup_flow+0x5ed/0xc50 net/ipv6/ip6_output.c:1117
 udpv6_sendmsg+0x2163/0x36b0 net/ipv6/udp.c:1354
 inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798
 sock_sendmsg_nosec net/socket.c:622 [inline]
 sock_sendmsg+0xd5/0x120 net/socket.c:632
 ___sys_sendmsg+0x51d/0x930 net/socket.c:2115
 __sys_sendmmsg+0x240/0x6f0 net/socket.c:2210
 __do_sys_sendmmsg net/socket.c:2239 [inline]
 __se_sys_sendmmsg net/socket.c:2236 [inline]
 __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2236
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x446a29
Code: e8 ac b8 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 
89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 
eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:7f4de5532db8 EFLAGS: 0246 ORIG_RAX: 0133
RAX: ffda RBX: 006dcc38 RCX: 00446a29
RDX: 00b8 RSI: 20001b00 RDI: 0003
RBP: 006dcc30 R08: 7f4de5533700 R09: 
R10:  R11: 0246 R12: 006dcc3c
R13: 7ffe2b830fdf R14: 7f4de55339c0 R15: 0001

Fixes: 71b1391a4128 ("l2tp: ensure sk->dst is still valid")
Reported-by: syzbot+05f840f3b04f211ba...@syzkaller.appspotmail.com
Signed-off-by: Wei Wang 
Signed-off-by: Martin KaFai Lau 
Cc: David Ahern 
Cc: Cong Wang 
---
 net/l2tp/l2tp_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
index 40261cb68e83..7166b61338d4 100644
--- a/net/l2tp/l2tp_core.c
+++ b/net/l2tp/l2tp_core.c
@@ -1110,7 +1110,7 @@ int l2tp_xmit_skb(struct l2tp_session *session, struct 
sk_buff *skb, int hdr_len
 
/* Get routing info from the tunnel socket */
skb_dst_drop(skb);
-   skb_dst_set(skb, dst_clone(__sk_dst_check(sk, 0)));
+   skb_dst_set(skb, dst_clone(sk_dst_check(sk, 0)));
 
inet = inet_sk(sk);
fl = >cork.fl;
-- 
2.18.0.597.ga71716f1ad-goog

[PATCH v2 net-next 0/5] tcp: add 4 new stats

2018-07-31 Thread Wei Wang

From: Wei Wang 

This patch series adds 3 RFC4898 stats:
1. tcpEStatsPerfHCDataOctetsOut
2. tcpEStatsPerfOctetsRetrans
3. tcpEStatsStackDSACKDups
and an addtional stat to record the number of data packet reordering
events seen:
4. tcp_reord_seen

Together with the existing stats, application can use them to measure
the retransmission rate in bytes, exclude spurious retransmissions
reflected by DSACK, and keep track of the reordering events on live
connections.
In particular the networks with different MTUs make bytes-based loss stats
more useful. Google servers have been using these stats for many years to
instrument transport and network performance.

Note: The first patch is a refactor to add a helper to calculate
opt_stats size in order to make later changes cleaner.

Wei Wang (5):
  tcp: add a helper to calculate size of opt_stats
  tcp: add data bytes sent stats
  tcp: add data bytes retransmitted stats
  tcp: add dsack blocks received stats
  tcp: add stat of data packet reordering events

 include/linux/tcp.h  | 13 ++--
 include/uapi/linux/tcp.h | 10 -
 net/ipv4/tcp.c   | 46 +---
 net/ipv4/tcp_input.c |  4 +++-
 net/ipv4/tcp_output.c|  2 ++
 net/ipv4/tcp_recovery.c  |  2 +-
 6 files changed, 69 insertions(+), 8 deletions(-)

-- 
2.18.0.345.g5c9ce644c3-goog

[PATCH v2 net-next 5/5] tcp: add stat of data packet reordering events

2018-07-31 Thread Wei Wang

From: Wei Wang 

Introduce a new TCP stats to record the number of reordering events seen
and expose it in both tcp_info (TCP_INFO) and opt_stats
(SOF_TIMESTAMPING_OPT_STATS).
Application can use this stats to track the frequency of the reordering
events in addition to the existing reordering stats which tracks the
magnitude of the latest reordering event.

Note: this new stats tracks reordering events triggered by ACKs, which
could often be fewer than the actual number of packets being delivered
out-of-order.

Signed-off-by: Wei Wang 
Signed-off-by: Eric Dumazet 
Acked-by: Neal Cardwell 
Acked-by: Soheil Hassas Yeganeh 
Acked-by: Yuchung Cheng 
---
 include/linux/tcp.h  | 4 ++--
 include/uapi/linux/tcp.h | 2 ++
 net/ipv4/tcp.c   | 4 
 net/ipv4/tcp_input.c | 3 ++-
 net/ipv4/tcp_recovery.c  | 2 +-
 5 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index da6281c549a5..263e37271afd 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -220,8 +220,7 @@ struct tcp_sock {
 #define TCP_RACK_RECOVERY_THRESH 16
u8 reo_wnd_persist:5, /* No. of recovery since last adj */
   dsack_seen:1, /* Whether DSACK seen after last adj */
-  advanced:1,   /* mstamp advanced since last lost marking */
-  reord:1;  /* reordering detected */
+  advanced:1;   /* mstamp advanced since last lost marking */
} rack;
u16 advmss; /* Advertised MSS   */
u8  compressed_ack;
@@ -267,6 +266,7 @@ struct tcp_sock {
u8  ecn_flags;  /* ECN status bits. */
u8  keepalive_probes; /* num of allowed keep alive probes   */
u32 reordering; /* Packet reordering metric.*/
+   u32 reord_seen; /* number of data packet reordering events */
u32 snd_up; /* Urgent pointer   */
 
 /*
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 0e1c0aec0153..e02d31986ff9 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -239,6 +239,7 @@ struct tcp_info {
__u64   tcpi_bytes_sent; /* RFC4898 tcpEStatsPerfHCDataOctetsOut */
__u64   tcpi_bytes_retrans;  /* RFC4898 tcpEStatsPerfOctetsRetrans */
__u32   tcpi_dsack_dups; /* RFC4898 tcpEStatsStackDSACKDups */
+   __u32   tcpi_reord_seen; /* reordering events seen */
 };
 
 /* netlink attributes types for SCM_TIMESTAMPING_OPT_STATS */
@@ -264,6 +265,7 @@ enum {
TCP_NLA_BYTES_SENT, /* Data bytes sent including retransmission */
TCP_NLA_BYTES_RETRANS,  /* Data bytes retransmitted */
TCP_NLA_DSACK_DUPS, /* DSACK blocks received */
+   TCP_NLA_REORD_SEEN, /* reordering events seen */
 };
 
 /* for TCP_MD5SIG socket option */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index d6232b598cae..31fa1c080f28 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2597,6 +2597,7 @@ int tcp_disconnect(struct sock *sk, int flags)
tp->bytes_sent = 0;
tp->bytes_retrans = 0;
tp->dsack_dups = 0;
+   tp->reord_seen = 0;
 
/* Clean up fastopen related fields */
tcp_free_fastopen_req(tp);
@@ -3207,6 +3208,7 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
info->tcpi_bytes_sent = tp->bytes_sent;
info->tcpi_bytes_retrans = tp->bytes_retrans;
info->tcpi_dsack_dups = tp->dsack_dups;
+   info->tcpi_reord_seen = tp->reord_seen;
unlock_sock_fast(sk, slow);
 }
 EXPORT_SYMBOL_GPL(tcp_get_info);
@@ -3234,6 +3236,7 @@ static size_t tcp_opt_stats_get_size(void)
nla_total_size_64bit(sizeof(u64)) + /* TCP_NLA_BYTES_SENT */
nla_total_size_64bit(sizeof(u64)) + /* TCP_NLA_BYTES_RETRANS */
nla_total_size(sizeof(u32)) + /* TCP_NLA_DSACK_DUPS */
+   nla_total_size(sizeof(u32)) + /* TCP_NLA_REORD_SEEN */
0;
 }
 
@@ -3286,6 +3289,7 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const 
struct sock *sk)
nla_put_u64_64bit(stats, TCP_NLA_BYTES_RETRANS, tp->bytes_retrans,
  TCP_NLA_PAD);
nla_put_u32(stats, TCP_NLA_DSACK_DUPS, tp->dsack_dups);
+   nla_put_u32(stats, TCP_NLA_REORD_SEEN, tp->reord_seen);
 
return stats;
 }
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index fbc85ff7d71d..3d6156f07a8d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -906,8 +906,8 @@ static void tcp_check_sack_reordering(struct sock *sk, 
const u32 low_seq,
   
sock_net(sk)->ipv4.sysctl_tcp_max_reordering);
}
 
-   tp->rack.reord = 1;
/* This exciting event is worth to be remembered. 8) */
+   tp->reord_seen++;
NET_INC_STATS(sock_net(sk),
  ts ? LINUX_MIB_T

[PATCH v2 net-next 2/5] tcp: add data bytes sent stats

2018-07-31 Thread Wei Wang

From: Wei Wang 

Introduce a new TCP stat to record the number of bytes sent
(RFC4898 tcpEStatsPerfHCDataOctetsOut) and expose it in both tcp_info
(TCP_INFO) and opt_stats (SOF_TIMESTAMPING_OPT_STATS).

Signed-off-by: Wei Wang 
Signed-off-by: Eric Dumazet 
Acked-by: Neal Cardwell 
Acked-by: Soheil Hassas Yeganeh 
Acked-by: Yuchung Cheng 
---
 include/linux/tcp.h  | 3 +++
 include/uapi/linux/tcp.h | 4 +++-
 net/ipv4/tcp.c   | 6 ++
 net/ipv4/tcp_output.c| 1 +
 4 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 58a8d7d71354..d0798dcd2cab 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -181,6 +181,9 @@ struct tcp_sock {
u32 data_segs_out;  /* RFC4898 tcpEStatsPerfDataSegsOut
 * total number of data segments sent.
 */
+   u64 bytes_sent; /* RFC4898 tcpEStatsPerfHCDataOctetsOut
+* total number of data bytes sent.
+*/
u64 bytes_acked;/* RFC4898 tcpEStatsAppHCThruOctetsAcked
 * sum(delta(snd_una)), or how many bytes
 * were acked.
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index e3f6ed8a7064..1c70ed287c3b 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -235,6 +235,8 @@ struct tcp_info {
 
__u32   tcpi_delivered;
__u32   tcpi_delivered_ce;
+
+   __u64   tcpi_bytes_sent; /* RFC4898 tcpEStatsPerfHCDataOctetsOut */
 };
 
 /* netlink attributes types for SCM_TIMESTAMPING_OPT_STATS */
@@ -257,7 +259,7 @@ enum {
TCP_NLA_SND_SSTHRESH,   /* Slow start size threshold */
TCP_NLA_DELIVERED,  /* Data pkts delivered incl. out-of-order */
TCP_NLA_DELIVERED_CE,   /* Like above but only ones w/ CE marks */
-
+   TCP_NLA_BYTES_SENT, /* Data bytes sent including retransmission */
 };
 
 /* for TCP_MD5SIG socket option */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 27bbe6a792b7..873cb9968ff5 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2594,6 +2594,7 @@ int tcp_disconnect(struct sock *sk, int flags)
sk->sk_rx_dst = NULL;
tcp_saved_syn_free(tp);
tp->compressed_ack = 0;
+   tp->bytes_sent = 0;
 
/* Clean up fastopen related fields */
tcp_free_fastopen_req(tp);
@@ -3201,6 +3202,7 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
info->tcpi_delivery_rate = rate64;
info->tcpi_delivered = tp->delivered;
info->tcpi_delivered_ce = tp->delivered_ce;
+   info->tcpi_bytes_sent = tp->bytes_sent;
unlock_sock_fast(sk, slow);
 }
 EXPORT_SYMBOL_GPL(tcp_get_info);
@@ -3225,6 +3227,7 @@ static size_t tcp_opt_stats_get_size(void)
nla_total_size(sizeof(u32)) + /* TCP_NLA_SND_SSTHRESH */
nla_total_size(sizeof(u32)) + /* TCP_NLA_DELIVERED */
nla_total_size(sizeof(u32)) + /* TCP_NLA_DELIVERED_CE */
+   nla_total_size_64bit(sizeof(u64)) + /* TCP_NLA_BYTES_SENT */
0;
 }
 
@@ -3272,6 +3275,9 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const 
struct sock *sk)
nla_put_u32(stats, TCP_NLA_SNDQ_SIZE, tp->write_seq - tp->snd_una);
nla_put_u8(stats, TCP_NLA_CA_STATE, inet_csk(sk)->icsk_ca_state);
 
+   nla_put_u64_64bit(stats, TCP_NLA_BYTES_SENT, tp->bytes_sent,
+ TCP_NLA_PAD);
+
return stats;
 }
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 490df62f26d4..861531fe0e97 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1136,6 +1136,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct 
sk_buff *skb,
if (skb->len != tcp_header_size) {
tcp_event_data_sent(tp, sk);
tp->data_segs_out += tcp_skb_pcount(skb);
+   tp->bytes_sent += skb->len - tcp_header_size;
tcp_internal_pacing(sk, skb);
}
 
-- 
2.18.0.345.g5c9ce644c3-goog

[PATCH v2 net-next 4/5] tcp: add dsack blocks received stats

2018-07-31 Thread Wei Wang

From: Wei Wang 

Introduce a new TCP stat to record the number of DSACK blocks received
(RFC4989 tcpEStatsStackDSACKDups) and expose it in both tcp_info
(TCP_INFO) and opt_stats (SOF_TIMESTAMPING_OPT_STATS).

Signed-off-by: Wei Wang 
Signed-off-by: Eric Dumazet 
Acked-by: Neal Cardwell 
Acked-by: Soheil Hassas Yeganeh 
Acked-by: Yuchung Cheng 
---
 include/linux/tcp.h  | 3 +++
 include/uapi/linux/tcp.h | 2 ++
 net/ipv4/tcp.c   | 4 
 net/ipv4/tcp_input.c | 1 +
 4 files changed, 10 insertions(+)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index fb67f9a51b95..da6281c549a5 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -188,6 +188,9 @@ struct tcp_sock {
 * sum(delta(snd_una)), or how many bytes
 * were acked.
 */
+   u32 dsack_dups; /* RFC4898 tcpEStatsStackDSACKDups
+* total number of DSACK blocks received
+*/
u32 snd_una;/* First byte we want an ack for*/
u32 snd_sml;/* Last byte of the most recently transmitted 
small packet */
u32 rcv_tstamp; /* timestamp of last received ACK (for 
keepalives) */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index c31f5100b744..0e1c0aec0153 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -238,6 +238,7 @@ struct tcp_info {
 
__u64   tcpi_bytes_sent; /* RFC4898 tcpEStatsPerfHCDataOctetsOut */
__u64   tcpi_bytes_retrans;  /* RFC4898 tcpEStatsPerfOctetsRetrans */
+   __u32   tcpi_dsack_dups; /* RFC4898 tcpEStatsStackDSACKDups */
 };
 
 /* netlink attributes types for SCM_TIMESTAMPING_OPT_STATS */
@@ -262,6 +263,7 @@ enum {
TCP_NLA_DELIVERED_CE,   /* Like above but only ones w/ CE marks */
TCP_NLA_BYTES_SENT, /* Data bytes sent including retransmission */
TCP_NLA_BYTES_RETRANS,  /* Data bytes retransmitted */
+   TCP_NLA_DSACK_DUPS, /* DSACK blocks received */
 };
 
 /* for TCP_MD5SIG socket option */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 5ed1be88e922..d6232b598cae 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2596,6 +2596,7 @@ int tcp_disconnect(struct sock *sk, int flags)
tp->compressed_ack = 0;
tp->bytes_sent = 0;
tp->bytes_retrans = 0;
+   tp->dsack_dups = 0;
 
/* Clean up fastopen related fields */
tcp_free_fastopen_req(tp);
@@ -3205,6 +3206,7 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
info->tcpi_delivered_ce = tp->delivered_ce;
info->tcpi_bytes_sent = tp->bytes_sent;
info->tcpi_bytes_retrans = tp->bytes_retrans;
+   info->tcpi_dsack_dups = tp->dsack_dups;
unlock_sock_fast(sk, slow);
 }
 EXPORT_SYMBOL_GPL(tcp_get_info);
@@ -3231,6 +3233,7 @@ static size_t tcp_opt_stats_get_size(void)
nla_total_size(sizeof(u32)) + /* TCP_NLA_DELIVERED_CE */
nla_total_size_64bit(sizeof(u64)) + /* TCP_NLA_BYTES_SENT */
nla_total_size_64bit(sizeof(u64)) + /* TCP_NLA_BYTES_RETRANS */
+   nla_total_size(sizeof(u32)) + /* TCP_NLA_DSACK_DUPS */
0;
 }
 
@@ -3282,6 +3285,7 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const 
struct sock *sk)
  TCP_NLA_PAD);
nla_put_u64_64bit(stats, TCP_NLA_BYTES_RETRANS, tp->bytes_retrans,
  TCP_NLA_PAD);
+   nla_put_u32(stats, TCP_NLA_DSACK_DUPS, tp->dsack_dups);
 
return stats;
 }
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d51fa358b2b1..fbc85ff7d71d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -874,6 +874,7 @@ static void tcp_dsack_seen(struct tcp_sock *tp)
 {
tp->rx_opt.sack_ok |= TCP_DSACK_SEEN;
tp->rack.dsack_seen = 1;
+   tp->dsack_dups++;
 }
 
 /* It's reordering when higher sequence was delivered (i.e. sacked) before
-- 
2.18.0.345.g5c9ce644c3-goog

[PATCH v2 net-next 3/5] tcp: add data bytes retransmitted stats

2018-07-31 Thread Wei Wang

From: Wei Wang 

Introduce a new TCP stat to record the number of bytes retransmitted
(RFC4898 tcpEStatsPerfOctetsRetrans) and expose it in both tcp_info
(TCP_INFO) and opt_stats (SOF_TIMESTAMPING_OPT_STATS).

Signed-off-by: Wei Wang 
Signed-off-by: Eric Dumazet 
Acked-by: Neal Cardwell 
Acked-by: Soheil Hassas Yeganeh 
Acked-by: Yuchung Cheng 
---
 include/linux/tcp.h  | 3 +++
 include/uapi/linux/tcp.h | 2 ++
 net/ipv4/tcp.c   | 5 +
 net/ipv4/tcp_output.c| 1 +
 4 files changed, 11 insertions(+)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index d0798dcd2cab..fb67f9a51b95 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -333,6 +333,9 @@ struct tcp_sock {
 * the first SYN. */
u32 undo_marker;/* snd_una upon a new recovery episode. */
int undo_retrans;   /* number of undoable retransmissions. */
+   u64 bytes_retrans;  /* RFC4898 tcpEStatsPerfOctetsRetrans
+* Total data bytes retransmitted
+*/
u32 total_retrans;  /* Total retransmits for entire connection */
 
u32 urg_seq;/* Seq of received urgent pointer */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 1c70ed287c3b..c31f5100b744 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -237,6 +237,7 @@ struct tcp_info {
__u32   tcpi_delivered_ce;
 
__u64   tcpi_bytes_sent; /* RFC4898 tcpEStatsPerfHCDataOctetsOut */
+   __u64   tcpi_bytes_retrans;  /* RFC4898 tcpEStatsPerfOctetsRetrans */
 };
 
 /* netlink attributes types for SCM_TIMESTAMPING_OPT_STATS */
@@ -260,6 +261,7 @@ enum {
TCP_NLA_DELIVERED,  /* Data pkts delivered incl. out-of-order */
TCP_NLA_DELIVERED_CE,   /* Like above but only ones w/ CE marks */
TCP_NLA_BYTES_SENT, /* Data bytes sent including retransmission */
+   TCP_NLA_BYTES_RETRANS,  /* Data bytes retransmitted */
 };
 
 /* for TCP_MD5SIG socket option */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 873cb9968ff5..5ed1be88e922 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2595,6 +2595,7 @@ int tcp_disconnect(struct sock *sk, int flags)
tcp_saved_syn_free(tp);
tp->compressed_ack = 0;
tp->bytes_sent = 0;
+   tp->bytes_retrans = 0;
 
/* Clean up fastopen related fields */
tcp_free_fastopen_req(tp);
@@ -3203,6 +3204,7 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
info->tcpi_delivered = tp->delivered;
info->tcpi_delivered_ce = tp->delivered_ce;
info->tcpi_bytes_sent = tp->bytes_sent;
+   info->tcpi_bytes_retrans = tp->bytes_retrans;
unlock_sock_fast(sk, slow);
 }
 EXPORT_SYMBOL_GPL(tcp_get_info);
@@ -3228,6 +3230,7 @@ static size_t tcp_opt_stats_get_size(void)
nla_total_size(sizeof(u32)) + /* TCP_NLA_DELIVERED */
nla_total_size(sizeof(u32)) + /* TCP_NLA_DELIVERED_CE */
nla_total_size_64bit(sizeof(u64)) + /* TCP_NLA_BYTES_SENT */
+   nla_total_size_64bit(sizeof(u64)) + /* TCP_NLA_BYTES_RETRANS */
0;
 }
 
@@ -3277,6 +3280,8 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const 
struct sock *sk)
 
nla_put_u64_64bit(stats, TCP_NLA_BYTES_SENT, tp->bytes_sent,
  TCP_NLA_PAD);
+   nla_put_u64_64bit(stats, TCP_NLA_BYTES_RETRANS, tp->bytes_retrans,
+ TCP_NLA_PAD);
 
return stats;
 }
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 861531fe0e97..50cabf7656f3 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2871,6 +2871,7 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff 
*skb, int segs)
if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN)
__NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
tp->total_retrans += segs;
+   tp->bytes_retrans += skb->len;
 
/* make sure skb->data is aligned on arches that require it
 * and check if ack-trimming & collapsing extended the headroom
-- 
2.18.0.345.g5c9ce644c3-goog

[PATCH v2 net-next 1/5] tcp: add a helper to calculate size of opt_stats

2018-07-31 Thread Wei Wang

From: Wei Wang 

This is to refactor the calculation of the size of opt_stats to a helper
function to make the code cleaner and easier for later changes.

Suggested-by: Stephen Hemminger 
Signed-off-by: Wei Wang 
Signed-off-by: Eric Dumazet 
Acked-by: Neal Cardwell 
Acked-by: Soheil Hassas Yeganeh 
Acked-by: Yuchung Cheng 
---
 net/ipv4/tcp.c | 27 ---
 1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f3bfb9f29520..27bbe6a792b7 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3205,6 +3205,29 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
 }
 EXPORT_SYMBOL_GPL(tcp_get_info);
 
+static size_t tcp_opt_stats_get_size(void)
+{
+   return
+   nla_total_size_64bit(sizeof(u64)) + /* TCP_NLA_BUSY */
+   nla_total_size_64bit(sizeof(u64)) + /* TCP_NLA_RWND_LIMITED */
+   nla_total_size_64bit(sizeof(u64)) + /* TCP_NLA_SNDBUF_LIMITED */
+   nla_total_size_64bit(sizeof(u64)) + /* TCP_NLA_DATA_SEGS_OUT */
+   nla_total_size_64bit(sizeof(u64)) + /* TCP_NLA_TOTAL_RETRANS */
+   nla_total_size_64bit(sizeof(u64)) + /* TCP_NLA_PACING_RATE */
+   nla_total_size_64bit(sizeof(u64)) + /* TCP_NLA_DELIVERY_RATE */
+   nla_total_size(sizeof(u32)) + /* TCP_NLA_SND_CWND */
+   nla_total_size(sizeof(u32)) + /* TCP_NLA_REORDERING */
+   nla_total_size(sizeof(u32)) + /* TCP_NLA_MIN_RTT */
+   nla_total_size(sizeof(u8)) + /* TCP_NLA_RECUR_RETRANS */
+   nla_total_size(sizeof(u8)) + /* TCP_NLA_DELIVERY_RATE_APP_LMT */
+   nla_total_size(sizeof(u32)) + /* TCP_NLA_SNDQ_SIZE */
+   nla_total_size(sizeof(u8)) + /* TCP_NLA_CA_STATE */
+   nla_total_size(sizeof(u32)) + /* TCP_NLA_SND_SSTHRESH */
+   nla_total_size(sizeof(u32)) + /* TCP_NLA_DELIVERED */
+   nla_total_size(sizeof(u32)) + /* TCP_NLA_DELIVERED_CE */
+   0;
+}
+
 struct sk_buff *tcp_get_timestamping_opt_stats(const struct sock *sk)
 {
const struct tcp_sock *tp = tcp_sk(sk);
@@ -3213,9 +3236,7 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const 
struct sock *sk)
u64 rate64;
u32 rate;
 
-   stats = alloc_skb(7 * nla_total_size_64bit(sizeof(u64)) +
- 7 * nla_total_size(sizeof(u32)) +
- 3 * nla_total_size(sizeof(u8)), GFP_ATOMIC);
+   stats = alloc_skb(tcp_opt_stats_get_size(), GFP_ATOMIC);
if (!stats)
return NULL;
 
-- 
2.18.0.345.g5c9ce644c3-goog

Re: [PATCH net-next 2/4] tcp: add data bytes retransmitted stats

2018-07-30 Thread Wei Wang

On Mon, Jul 30, 2018 at 3:14 PM Stephen Hemminger
 wrote:
>
> On Mon, 30 Jul 2018 14:59:09 -0700
> Wei Wang  wrote:
>
> > + stats = alloc_skb(9 * nla_total_size_64bit(sizeof(u64)) +
> > 7 * nla_total_size(sizeof(u32)) +
> > 3 * nla_total_size(sizeof(u8)), GFP_ATOMIC);
>
> This is getting a bit awkward.
> Maybe use the style used in other drivers that have a get_size function
> like tun_get_size().

Thanks Stephen. OK. Will add a patch at the beginning to do this refactor.

[PATCH net-next 0/4] tcp: add 4 new stats

2018-07-30 Thread Wei Wang

From: Wei Wang 

This patch series adds 3 RFC4898 stats:
1. tcpEStatsPerfHCDataOctetsOut
2. tcpEStatsPerfOctetsRetrans
3. tcpEStatsStackDSACKDups
and an addtional stat to record the number of data packet reordering
events seen:
4. tcp_reord_seen

Together with the existing stats, application can use them to measure
the retransmission rate in bytes, exclude spurious retransmissions
reflected by DSACK, and keep track of the reordering events on live
connections.
In particular the networks with different MTUs make bytes-based loss stats
more useful. Google servers have been using these stats for many years to
instrument transport and network performance.

Wei Wang (4):
  tcp: add data bytes sent stats
  tcp: add data bytes retransmitted stats
  tcp: add dsack blocks received stats
  tcp: add stat of data packet reordering events

 include/linux/tcp.h  | 13 +++--
 include/uapi/linux/tcp.h | 10 +-
 net/ipv4/tcp.c   | 19 +--
 net/ipv4/tcp_input.c |  4 +++-
 net/ipv4/tcp_output.c|  2 ++
 net/ipv4/tcp_recovery.c  |  2 +-
 6 files changed, 43 insertions(+), 7 deletions(-)

-- 
2.18.0.345.g5c9ce644c3-goog

[PATCH net-next 2/4] tcp: add data bytes retransmitted stats

2018-07-30 Thread Wei Wang

From: Wei Wang 

Introduce a new TCP stat to record the number of bytes retransmitted
(RFC4898 tcpEStatsPerfOctetsRetrans) and expose it in both tcp_info
(TCP_INFO) and opt_stats (SOF_TIMESTAMPING_OPT_STATS).

Signed-off-by: Wei Wang 
Signed-off-by: Eric Dumazet 
Acked-by: Neal Cardwell 
Acked-by: Soheil Hassas Yeganeh 
Acked-by: Yuchung Cheng 
---
 include/linux/tcp.h  | 3 +++
 include/uapi/linux/tcp.h | 2 ++
 net/ipv4/tcp.c   | 6 +-
 net/ipv4/tcp_output.c| 1 +
 4 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index d0798dcd2cab..fb67f9a51b95 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -333,6 +333,9 @@ struct tcp_sock {
 * the first SYN. */
u32 undo_marker;/* snd_una upon a new recovery episode. */
int undo_retrans;   /* number of undoable retransmissions. */
+   u64 bytes_retrans;  /* RFC4898 tcpEStatsPerfOctetsRetrans
+* Total data bytes retransmitted
+*/
u32 total_retrans;  /* Total retransmits for entire connection */
 
u32 urg_seq;/* Seq of received urgent pointer */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 1c70ed287c3b..c31f5100b744 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -237,6 +237,7 @@ struct tcp_info {
__u32   tcpi_delivered_ce;
 
__u64   tcpi_bytes_sent; /* RFC4898 tcpEStatsPerfHCDataOctetsOut */
+   __u64   tcpi_bytes_retrans;  /* RFC4898 tcpEStatsPerfOctetsRetrans */
 };
 
 /* netlink attributes types for SCM_TIMESTAMPING_OPT_STATS */
@@ -260,6 +261,7 @@ enum {
TCP_NLA_DELIVERED,  /* Data pkts delivered incl. out-of-order */
TCP_NLA_DELIVERED_CE,   /* Like above but only ones w/ CE marks */
TCP_NLA_BYTES_SENT, /* Data bytes sent including retransmission */
+   TCP_NLA_BYTES_RETRANS,  /* Data bytes retransmitted */
 };
 
 /* for TCP_MD5SIG socket option */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 53d3db4a3d39..372357a035e9 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2595,6 +2595,7 @@ int tcp_disconnect(struct sock *sk, int flags)
tcp_saved_syn_free(tp);
tp->compressed_ack = 0;
tp->bytes_sent = 0;
+   tp->bytes_retrans = 0;
 
/* Clean up fastopen related fields */
tcp_free_fastopen_req(tp);
@@ -3203,6 +3204,7 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
info->tcpi_delivered = tp->delivered;
info->tcpi_delivered_ce = tp->delivered_ce;
info->tcpi_bytes_sent = tp->bytes_sent;
+   info->tcpi_bytes_retrans = tp->bytes_retrans;
unlock_sock_fast(sk, slow);
 }
 EXPORT_SYMBOL_GPL(tcp_get_info);
@@ -3215,7 +3217,7 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const 
struct sock *sk)
u64 rate64;
u32 rate;
 
-   stats = alloc_skb(8 * nla_total_size_64bit(sizeof(u64)) +
+   stats = alloc_skb(9 * nla_total_size_64bit(sizeof(u64)) +
  7 * nla_total_size(sizeof(u32)) +
  3 * nla_total_size(sizeof(u8)), GFP_ATOMIC);
if (!stats)
@@ -3255,6 +3257,8 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const 
struct sock *sk)
 
nla_put_u64_64bit(stats, TCP_NLA_BYTES_SENT, tp->bytes_sent,
  TCP_NLA_PAD);
+   nla_put_u64_64bit(stats, TCP_NLA_BYTES_RETRANS, tp->bytes_retrans,
+ TCP_NLA_PAD);
 
return stats;
 }
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 861531fe0e97..50cabf7656f3 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2871,6 +2871,7 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff 
*skb, int segs)
if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN)
__NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
tp->total_retrans += segs;
+   tp->bytes_retrans += skb->len;
 
/* make sure skb->data is aligned on arches that require it
 * and check if ack-trimming & collapsing extended the headroom
-- 
2.18.0.345.g5c9ce644c3-goog

[PATCH net-next 1/4] tcp: add data bytes sent stats

2018-07-30 Thread Wei Wang

From: Wei Wang 

Introduce a new TCP stat to record the number of bytes sent
(RFC4898 tcpEStatsPerfHCDataOctetsOut) and expose it in both tcp_info
(TCP_INFO) and opt_stats (SOF_TIMESTAMPING_OPT_STATS).

Signed-off-by: Wei Wang 
Signed-off-by: Eric Dumazet 
Acked-by: Neal Cardwell 
Acked-by: Soheil Hassas Yeganeh 
Acked-by: Yuchung Cheng 
---
 include/linux/tcp.h  | 3 +++
 include/uapi/linux/tcp.h | 4 +++-
 net/ipv4/tcp.c   | 7 ++-
 net/ipv4/tcp_output.c| 1 +
 4 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 58a8d7d71354..d0798dcd2cab 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -181,6 +181,9 @@ struct tcp_sock {
u32 data_segs_out;  /* RFC4898 tcpEStatsPerfDataSegsOut
 * total number of data segments sent.
 */
+   u64 bytes_sent; /* RFC4898 tcpEStatsPerfHCDataOctetsOut
+* total number of data bytes sent.
+*/
u64 bytes_acked;/* RFC4898 tcpEStatsAppHCThruOctetsAcked
 * sum(delta(snd_una)), or how many bytes
 * were acked.
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index e3f6ed8a7064..1c70ed287c3b 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -235,6 +235,8 @@ struct tcp_info {
 
__u32   tcpi_delivered;
__u32   tcpi_delivered_ce;
+
+   __u64   tcpi_bytes_sent; /* RFC4898 tcpEStatsPerfHCDataOctetsOut */
 };
 
 /* netlink attributes types for SCM_TIMESTAMPING_OPT_STATS */
@@ -257,7 +259,7 @@ enum {
TCP_NLA_SND_SSTHRESH,   /* Slow start size threshold */
TCP_NLA_DELIVERED,  /* Data pkts delivered incl. out-of-order */
TCP_NLA_DELIVERED_CE,   /* Like above but only ones w/ CE marks */
-
+   TCP_NLA_BYTES_SENT, /* Data bytes sent including retransmission */
 };
 
 /* for TCP_MD5SIG socket option */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f3bfb9f29520..53d3db4a3d39 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2594,6 +2594,7 @@ int tcp_disconnect(struct sock *sk, int flags)
sk->sk_rx_dst = NULL;
tcp_saved_syn_free(tp);
tp->compressed_ack = 0;
+   tp->bytes_sent = 0;
 
/* Clean up fastopen related fields */
tcp_free_fastopen_req(tp);
@@ -3201,6 +3202,7 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
info->tcpi_delivery_rate = rate64;
info->tcpi_delivered = tp->delivered;
info->tcpi_delivered_ce = tp->delivered_ce;
+   info->tcpi_bytes_sent = tp->bytes_sent;
unlock_sock_fast(sk, slow);
 }
 EXPORT_SYMBOL_GPL(tcp_get_info);
@@ -3213,7 +3215,7 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const 
struct sock *sk)
u64 rate64;
u32 rate;
 
-   stats = alloc_skb(7 * nla_total_size_64bit(sizeof(u64)) +
+   stats = alloc_skb(8 * nla_total_size_64bit(sizeof(u64)) +
  7 * nla_total_size(sizeof(u32)) +
  3 * nla_total_size(sizeof(u8)), GFP_ATOMIC);
if (!stats)
@@ -3251,6 +3253,9 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const 
struct sock *sk)
nla_put_u32(stats, TCP_NLA_SNDQ_SIZE, tp->write_seq - tp->snd_una);
nla_put_u8(stats, TCP_NLA_CA_STATE, inet_csk(sk)->icsk_ca_state);
 
+   nla_put_u64_64bit(stats, TCP_NLA_BYTES_SENT, tp->bytes_sent,
+ TCP_NLA_PAD);
+
return stats;
 }
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 490df62f26d4..861531fe0e97 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1136,6 +1136,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct 
sk_buff *skb,
if (skb->len != tcp_header_size) {
tcp_event_data_sent(tp, sk);
tp->data_segs_out += tcp_skb_pcount(skb);
+   tp->bytes_sent += skb->len - tcp_header_size;
tcp_internal_pacing(sk, skb);
}
 
-- 
2.18.0.345.g5c9ce644c3-goog

[PATCH net-next 3/4] tcp: add dsack blocks received stats

2018-07-30 Thread Wei Wang

From: Wei Wang 

Introduce a new TCP stat to record the number of DSACK blocks received
(RFC4989 tcpEStatsStackDSACKDups) and expose it in both tcp_info
(TCP_INFO) and opt_stats (SOF_TIMESTAMPING_OPT_STATS).

Signed-off-by: Wei Wang 
Signed-off-by: Eric Dumazet 
Acked-by: Neal Cardwell 
Acked-by: Soheil Hassas Yeganeh 
Acked-by: Yuchung Cheng 
---
 include/linux/tcp.h  | 3 +++
 include/uapi/linux/tcp.h | 2 ++
 net/ipv4/tcp.c   | 5 -
 net/ipv4/tcp_input.c | 1 +
 4 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index fb67f9a51b95..da6281c549a5 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -188,6 +188,9 @@ struct tcp_sock {
 * sum(delta(snd_una)), or how many bytes
 * were acked.
 */
+   u32 dsack_dups; /* RFC4898 tcpEStatsStackDSACKDups
+* total number of DSACK blocks received
+*/
u32 snd_una;/* First byte we want an ack for*/
u32 snd_sml;/* Last byte of the most recently transmitted 
small packet */
u32 rcv_tstamp; /* timestamp of last received ACK (for 
keepalives) */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index c31f5100b744..0e1c0aec0153 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -238,6 +238,7 @@ struct tcp_info {
 
__u64   tcpi_bytes_sent; /* RFC4898 tcpEStatsPerfHCDataOctetsOut */
__u64   tcpi_bytes_retrans;  /* RFC4898 tcpEStatsPerfOctetsRetrans */
+   __u32   tcpi_dsack_dups; /* RFC4898 tcpEStatsStackDSACKDups */
 };
 
 /* netlink attributes types for SCM_TIMESTAMPING_OPT_STATS */
@@ -262,6 +263,7 @@ enum {
TCP_NLA_DELIVERED_CE,   /* Like above but only ones w/ CE marks */
TCP_NLA_BYTES_SENT, /* Data bytes sent including retransmission */
TCP_NLA_BYTES_RETRANS,  /* Data bytes retransmitted */
+   TCP_NLA_DSACK_DUPS, /* DSACK blocks received */
 };
 
 /* for TCP_MD5SIG socket option */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 372357a035e9..a8ec6564a7ec 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2596,6 +2596,7 @@ int tcp_disconnect(struct sock *sk, int flags)
tp->compressed_ack = 0;
tp->bytes_sent = 0;
tp->bytes_retrans = 0;
+   tp->dsack_dups = 0;
 
/* Clean up fastopen related fields */
tcp_free_fastopen_req(tp);
@@ -3205,6 +3206,7 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
info->tcpi_delivered_ce = tp->delivered_ce;
info->tcpi_bytes_sent = tp->bytes_sent;
info->tcpi_bytes_retrans = tp->bytes_retrans;
+   info->tcpi_dsack_dups = tp->dsack_dups;
unlock_sock_fast(sk, slow);
 }
 EXPORT_SYMBOL_GPL(tcp_get_info);
@@ -3218,7 +3220,7 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const 
struct sock *sk)
u32 rate;
 
stats = alloc_skb(9 * nla_total_size_64bit(sizeof(u64)) +
- 7 * nla_total_size(sizeof(u32)) +
+ 8 * nla_total_size(sizeof(u32)) +
  3 * nla_total_size(sizeof(u8)), GFP_ATOMIC);
if (!stats)
return NULL;
@@ -3259,6 +3261,7 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const 
struct sock *sk)
  TCP_NLA_PAD);
nla_put_u64_64bit(stats, TCP_NLA_BYTES_RETRANS, tp->bytes_retrans,
  TCP_NLA_PAD);
+   nla_put_u32(stats, TCP_NLA_DSACK_DUPS, tp->dsack_dups);
 
return stats;
 }
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d51fa358b2b1..fbc85ff7d71d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -874,6 +874,7 @@ static void tcp_dsack_seen(struct tcp_sock *tp)
 {
tp->rx_opt.sack_ok |= TCP_DSACK_SEEN;
tp->rack.dsack_seen = 1;
+   tp->dsack_dups++;
 }
 
 /* It's reordering when higher sequence was delivered (i.e. sacked) before
-- 
2.18.0.345.g5c9ce644c3-goog

[PATCH net-next 4/4] tcp: add stat of data packet reordering events

2018-07-30 Thread Wei Wang

From: Wei Wang 

Introduce a new TCP stats to record the number of reordering events seen
and expose it in both tcp_info (TCP_INFO) and opt_stats
(SOF_TIMESTAMPING_OPT_STATS).
Application can use this stats to track the frequency of the reordering
events in addition to the existing reordering stats which tracks the
magnitude of the latest reordering event.

Note: this new stats tracks reordering events triggered by ACKs, which
could often be fewer than the actual number of packets being delivered
out-of-order.

Signed-off-by: Wei Wang 
Signed-off-by: Eric Dumazet 
Acked-by: Neal Cardwell 
Acked-by: Soheil Hassas Yeganeh 
Acked-by: Yuchung Cheng 
---
 include/linux/tcp.h  | 4 ++--
 include/uapi/linux/tcp.h | 2 ++
 net/ipv4/tcp.c   | 5 -
 net/ipv4/tcp_input.c | 3 ++-
 net/ipv4/tcp_recovery.c  | 2 +-
 5 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index da6281c549a5..263e37271afd 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -220,8 +220,7 @@ struct tcp_sock {
 #define TCP_RACK_RECOVERY_THRESH 16
u8 reo_wnd_persist:5, /* No. of recovery since last adj */
   dsack_seen:1, /* Whether DSACK seen after last adj */
-  advanced:1,   /* mstamp advanced since last lost marking */
-  reord:1;  /* reordering detected */
+  advanced:1;   /* mstamp advanced since last lost marking */
} rack;
u16 advmss; /* Advertised MSS   */
u8  compressed_ack;
@@ -267,6 +266,7 @@ struct tcp_sock {
u8  ecn_flags;  /* ECN status bits. */
u8  keepalive_probes; /* num of allowed keep alive probes   */
u32 reordering; /* Packet reordering metric.*/
+   u32 reord_seen; /* number of data packet reordering events */
u32 snd_up; /* Urgent pointer   */
 
 /*
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 0e1c0aec0153..e02d31986ff9 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -239,6 +239,7 @@ struct tcp_info {
__u64   tcpi_bytes_sent; /* RFC4898 tcpEStatsPerfHCDataOctetsOut */
__u64   tcpi_bytes_retrans;  /* RFC4898 tcpEStatsPerfOctetsRetrans */
__u32   tcpi_dsack_dups; /* RFC4898 tcpEStatsStackDSACKDups */
+   __u32   tcpi_reord_seen; /* reordering events seen */
 };
 
 /* netlink attributes types for SCM_TIMESTAMPING_OPT_STATS */
@@ -264,6 +265,7 @@ enum {
TCP_NLA_BYTES_SENT, /* Data bytes sent including retransmission */
TCP_NLA_BYTES_RETRANS,  /* Data bytes retransmitted */
TCP_NLA_DSACK_DUPS, /* DSACK blocks received */
+   TCP_NLA_REORD_SEEN, /* reordering events seen */
 };
 
 /* for TCP_MD5SIG socket option */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index a8ec6564a7ec..250b73ccb644 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2597,6 +2597,7 @@ int tcp_disconnect(struct sock *sk, int flags)
tp->bytes_sent = 0;
tp->bytes_retrans = 0;
tp->dsack_dups = 0;
+   tp->reord_seen = 0;
 
/* Clean up fastopen related fields */
tcp_free_fastopen_req(tp);
@@ -3207,6 +3208,7 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
info->tcpi_bytes_sent = tp->bytes_sent;
info->tcpi_bytes_retrans = tp->bytes_retrans;
info->tcpi_dsack_dups = tp->dsack_dups;
+   info->tcpi_reord_seen = tp->reord_seen;
unlock_sock_fast(sk, slow);
 }
 EXPORT_SYMBOL_GPL(tcp_get_info);
@@ -3220,7 +3222,7 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const 
struct sock *sk)
u32 rate;
 
stats = alloc_skb(9 * nla_total_size_64bit(sizeof(u64)) +
- 8 * nla_total_size(sizeof(u32)) +
+ 9 * nla_total_size(sizeof(u32)) +
  3 * nla_total_size(sizeof(u8)), GFP_ATOMIC);
if (!stats)
return NULL;
@@ -3262,6 +3264,7 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const 
struct sock *sk)
nla_put_u64_64bit(stats, TCP_NLA_BYTES_RETRANS, tp->bytes_retrans,
  TCP_NLA_PAD);
nla_put_u32(stats, TCP_NLA_DSACK_DUPS, tp->dsack_dups);
+   nla_put_u32(stats, TCP_NLA_REORD_SEEN, tp->reord_seen);
 
return stats;
 }
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index fbc85ff7d71d..3d6156f07a8d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -906,8 +906,8 @@ static void tcp_check_sack_reordering(struct sock *sk, 
const u32 low_seq,
   
sock_net(sk)->ipv4.sysctl_tcp_max_reordering);
}
 
-   tp->rack.reord = 1;
/* This exciting event is worth to be remembered. 8) */
+   tp->reord_seen++;
NET_INC_STATS(sock_net(

[PATCH net] ipv6: use fib6_info_hold_safe() when necessary

2018-07-21 Thread Wei Wang

From: Wei Wang 

In the code path where only rcu read lock is held, e.g. in the route
lookup code path, it is not safe to directly call fib6_info_hold()
because the fib6_info may already have been deleted but still exists
in the rcu grace period. Holding reference to it could cause double
free and crash the kernel.

This patch adds a new function fib6_info_hold_safe() and replace
fib6_info_hold() in all necessary places.

Syzbot reported 3 crash traces because of this. One of them is:
8021q: adding VLAN 0 to HW filter on device team0
IPv6: ADDRCONF(NETDEV_CHANGE): team0: link becomes ready
dst_release: dst:(ptrval) refcnt:-1
dst_release: dst:(ptrval) refcnt:-2
WARNING: CPU: 1 PID: 4845 at include/net/dst.h:239 dst_hold 
include/net/dst.h:239 [inline]
WARNING: CPU: 1 PID: 4845 at include/net/dst.h:239 ip6_setup_cork+0xd66/0x1830 
net/ipv6/ip6_output.c:1204
dst_release: dst:(ptrval) refcnt:-1
Kernel panic - not syncing: panic_on_warn set ...

CPU: 1 PID: 4845 Comm: syz-executor493 Not tainted 4.18.0-rc3+ #10
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
 panic+0x238/0x4e7 kernel/panic.c:184
dst_release: dst:(ptrval) refcnt:-2
dst_release: dst:(ptrval) refcnt:-3
 __warn.cold.8+0x163/0x1ba kernel/panic.c:536
dst_release: dst:(ptrval) refcnt:-4
 report_bug+0x252/0x2d0 lib/bug.c:186
 fixup_bug arch/x86/kernel/traps.c:178 [inline]
 do_error_trap+0x1fc/0x4d0 arch/x86/kernel/traps.c:296
dst_release: dst:(ptrval) refcnt:-5
 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:316
 invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992
RIP: 0010:dst_hold include/net/dst.h:239 [inline]
RIP: 0010:ip6_setup_cork+0xd66/0x1830 net/ipv6/ip6_output.c:1204
Code: c1 ed 03 89 9d 18 ff ff ff 48 b8 00 00 00 00 00 fc ff df 41 c6 44 05 00 
f8 e9 2d 01 00 00 4c 8b a5 c8 fe ff ff e8 1a f6 e6 fa <0f> 0b e9 6a fc ff ff e8 
0e f6 e6 fa 48 8b 85 d0 fe ff ff 48 8d 78
RSP: 0018:8801a8fcf178 EFLAGS: 00010293
RAX: 8801a8eba5c0 RBX:  RCX: 869511e6
RDX:  RSI: 869515b6 RDI: 0005
RBP: 8801a8fcf2c8 R08: 8801a8eba5c0 R09: ed0035ac8338
R10: ed0035ac8338 R11: 8801ad6419c3 R12: 8801a8fcf720
R13: 8801a8fcf6a0 R14: 8801ad6419c0 R15: 8801ad641980
 ip6_make_skb+0x2c8/0x600 net/ipv6/ip6_output.c:1768
 udpv6_sendmsg+0x2c90/0x35f0 net/ipv6/udp.c:1376
 inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798
 sock_sendmsg_nosec net/socket.c:641 [inline]
 sock_sendmsg+0xd5/0x120 net/socket.c:651
 ___sys_sendmsg+0x51d/0x930 net/socket.c:2125
 __sys_sendmmsg+0x240/0x6f0 net/socket.c:2220
 __do_sys_sendmmsg net/socket.c:2249 [inline]
 __se_sys_sendmmsg net/socket.c:2246 [inline]
 __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2246
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x446ba9
Code: e8 cc bb 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 
89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 
eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:7fb39a469da8 EFLAGS: 0246 ORIG_RAX: 0133
RAX: ffda RBX: 006dcc54 RCX: 00446ba9
RDX: 00b8 RSI: 20001b00 RDI: 0003
RBP: 006dcc50 R08: 7fb39a46a700 R09: 
R10:  R11: 0246 R12: 45c828efc7a64843
R13: e6eeb815b9d8a477 R14: 5068caf6f713c6fc R15: 0001
Dumping ftrace buffer:
   (ftrace buffer empty)
Kernel Offset: disabled
Rebooting in 86400 seconds..

Fixes: 93531c674315 (net/ipv6: separate handling of FIB entries from dst based 
routes)
Reported-by: syzbot+902e2a1bcd4f7808c...@syzkaller.appspotmail.com
Reported-by: syzbot+8ae62d67f647abeec...@syzkaller.appspotmail.com
Reported-by: syzbot+3f08feb1408693067...@syzkaller.appspotmail.com
Signed-off-by: Wei Wang 
Acked-by: Eric Dumazet 
---
 include/net/ip6_fib.h |  5 +
 net/ipv6/addrconf.c   |  3 ++-
 net/ipv6/route.c  | 41 +++--
 3 files changed, 38 insertions(+), 11 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 71b9043aa0e7..3d4930528db0 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -281,6 +281,11 @@ static inline void fib6_info_hold(struct fib6_info *f6i)
atomic_inc(>fib6_ref);
 }
 
+static inline bool fib6_info_hold_safe(struct fib6_info *f6i)
+{
+   return atomic_inc_not_zero(>fib6_ref);
+}
+
 static inline void fib6_info_release(struct fib6_info *f6i)
 {
if (f6i && atomic_dec_and_test(>fib6_ref))
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 91580c62bb86..f66a1cae3366 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -2374,7 +2374,8 @@ static struct fib6_info *addrconf_g

[PATCH net-next] tcp: ignore rcv_rtt sample with old ts ecr value

2018-06-19 Thread Wei Wang

From: Wei Wang 

When receiving multiple packets with the same ts ecr value, only try
to compute rcv_rtt sample with the earliest received packet.
This is because the rcv_rtt calculated by later received packets
could possibly include long idle time or other types of delay.
For example:
(1) server sends last packet of reply with TS val V1
(2) client ACKs last packet of reply with TS ecr V1
(3) long idle time passes
(4) client sends next request data packet with TS ecr V1 (again!)
At this time, the rcv_rtt computed on server with TS ecr V1 will be
inflated with the idle time and should get ignored.

Signed-off-by: Wei Wang 
Signed-off-by: Neal Cardwell 
Signed-off-by: Eric Dumazet 
---
 include/linux/tcp.h  |  1 +
 net/ipv4/tcp.c   |  1 +
 net/ipv4/tcp_input.c | 14 +++---
 3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 72705eaf4b84..3dbea6610304 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -350,6 +350,7 @@ struct tcp_sock {
 #endif
 
 /* Receiver side RTT estimation */
+   u32 rcv_rtt_last_tsecr;
struct {
u32 rtt_us;
u32 seq;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 141acd92e58a..47c45d5be9f9 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2563,6 +2563,7 @@ int tcp_disconnect(struct sock *sk, int flags)
sk->sk_shutdown = 0;
sock_reset_flag(sk, SOCK_DONE);
tp->srtt_us = 0;
+   tp->rcv_rtt_last_tsecr = 0;
tp->write_seq += tp->max_window + 2;
if (tp->write_seq == 0)
tp->write_seq = 1;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 355d3dffd021..76ca88f63b70 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -582,9 +582,12 @@ static inline void tcp_rcv_rtt_measure_ts(struct sock *sk,
 {
struct tcp_sock *tp = tcp_sk(sk);
 
-   if (tp->rx_opt.rcv_tsecr &&
-   (TCP_SKB_CB(skb)->end_seq -
-TCP_SKB_CB(skb)->seq >= inet_csk(sk)->icsk_ack.rcv_mss)) {
+   if (tp->rx_opt.rcv_tsecr == tp->rcv_rtt_last_tsecr)
+   return;
+   tp->rcv_rtt_last_tsecr = tp->rx_opt.rcv_tsecr;
+
+   if (TCP_SKB_CB(skb)->end_seq -
+   TCP_SKB_CB(skb)->seq >= inet_csk(sk)->icsk_ack.rcv_mss) {
u32 delta = tcp_time_stamp(tp) - tp->rx_opt.rcv_tsecr;
u32 delta_us;
 
@@ -5475,6 +5478,11 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff 
*skb)
tcp_ack(sk, skb, 0);
__kfree_skb(skb);
tcp_data_snd_check(sk);
+   /* When receiving pure ack in fast path, update
+* last ts ecr directly instead of calling
+* tcp_rcv_rtt_measure_ts()
+*/
+   tp->rcv_rtt_last_tsecr = tp->rx_opt.rcv_tsecr;
return;
} else { /* Header too small */
TCP_INC_STATS(sock_net(sk), TCP_MIB_INERRS);
-- 
2.18.0.rc1.244.gcf134e6275-goog

[PATCH bpf-next] bpf: prevent non-IPv4 socket to be added into sock hash

2018-05-30 Thread Wei Wang

From: Wei Wang 

Sock hash only supports IPv4 socket proto right now.
If a non-IPv4 socket gets stored in the BPF map, sk->sk_prot gets
overwritten with the v4 tcp prot.

Syskaller reported the following related issue on an IPv6 socket:
BUG: KASAN: slab-out-of-bounds in ip6_dst_idev include/net/ip6_fib.h:203 
[inline]
BUG: KASAN: slab-out-of-bounds in ip6_xmit+0x2002/0x23f0 
net/ipv6/ip6_output.c:264
Read of size 8 at addr 8801b300edb0 by task syz-executor888/4522

CPU: 0 PID: 4522 Comm: syz-executor888 Not tainted 4.17.0-rc4+ #17
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1b9/0x294 lib/dump_stack.c:113
 print_address_description+0x6c/0x20b mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
 ip6_dst_idev include/net/ip6_fib.h:203 [inline]
 ip6_xmit+0x2002/0x23f0 net/ipv6/ip6_output.c:264
 inet6_csk_xmit+0x377/0x630 net/ipv6/inet6_connection_sock.c:139
 tcp_transmit_skb+0x1be0/0x3e40 net/ipv4/tcp_output.c:1159
 tcp_send_syn_data net/ipv4/tcp_output.c:3441 [inline]
 tcp_connect+0x2207/0x45a0 net/ipv4/tcp_output.c:3480
 tcp_v4_connect+0x1934/0x1d50 net/ipv4/tcp_ipv4.c:272
 __inet_stream_connect+0x943/0x1120 net/ipv4/af_inet.c:655
 tcp_sendmsg_fastopen net/ipv4/tcp.c:1162 [inline]
 tcp_sendmsg_locked+0x2859/0x3ee0 net/ipv4/tcp.c:1209
 tcp_sendmsg+0x2f/0x50 net/ipv4/tcp.c:1447
 inet_sendmsg+0x19f/0x690 net/ipv4/af_inet.c:798
 sock_sendmsg_nosec net/socket.c:629 [inline]
 sock_sendmsg+0xd5/0x120 net/socket.c:639
 ___sys_sendmsg+0x805/0x940 net/socket.c:2117
 __sys_sendmsg+0x115/0x270 net/socket.c:2155
 __do_sys_sendmsg net/socket.c:2164 [inline]
 __se_sys_sendmsg net/socket.c:2162 [inline]
 __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2162
 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x43ff99
RSP: 002b:7ffc00bd1cf8 EFLAGS: 0217 ORIG_RAX: 002e
RAX: ffda RBX: 004002c8 RCX: 0043ff99
RDX: 2000 RSI: 2580 RDI: 0003
RBP: 006ca018 R08: 004002c8 R09: 004002c8
R10: 004002c8 R11: 0217 R12: 004018c0
R13: 00401950 R14:  R15: 

Fixes: 81110384441a ("bpf: sockmap, add hash map support")
Reported-by: syzbot+5c063698bdbfac19f...@syzkaller.appspotmail.com
Signed-off-by: Wei Wang 
Acked-by: Eric Dumazet 
Acked-by: Willem de Bruijn 
---
 kernel/bpf/sockmap.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 3b28955a6383..0e7b88bc3e3f 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -2300,6 +2300,11 @@ static int sock_hash_update_elem(struct bpf_map *map,
return -EINVAL;
}
 
+   if (skops.sk->sk_family != AF_INET) {
+   fput(socket->file);
+   return -EAFNOSUPPORT;
+   }
+
err = sock_hash_ctx_update_elem(, map, key, flags);
fput(socket->file);
return err;
-- 
2.17.1.1185.g55be947832-goog

[PATCH bpf] bpf: prevent non-ipv4 socket to be added into sock map

2018-05-30 Thread Wei Wang

From: Wei Wang 

Sock map only supports IPv4 socket proto right now.
If a non-IPv4 socket gets stored in the BPF map, sk->sk_prot gets
overwritten with the v4 tcp prot.
It could potentially cause issues when invoking functions from
sk->sk_prot later in the stack.

Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
Signed-off-by: Wei Wang 
Acked-by: Eric Dumazet 
Acked-by: Willem de Bruijn 
---
 kernel/bpf/sockmap.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 95a84b2f10ce..1984922f99ee 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -1873,6 +1873,11 @@ static int sock_map_update_elem(struct bpf_map *map,
return -EOPNOTSUPP;
}
 
+   if (skops.sk->sk_family != AF_INET) {
+   fput(socket->file);
+   return -EAFNOSUPPORT;
+   }
+
err = sock_map_ctx_update_elem(, map, key, flags);
fput(socket->file);
return err;
-- 
2.17.1.1185.g55be947832-goog

[PATCH net-next] tcp: remove mss check in tcp_select_initial_window()

2018-04-26 Thread Wei Wang

From: Wei Wang <wei...@google.com>

In tcp_select_initial_window(), we only set rcv_wnd to
tcp_default_init_rwnd() if current mss > (1 << wscale). Otherwise,
rcv_wnd is kept at the full receive space of the socket which is a
value way larger than tcp_default_init_rwnd().
With larger initial rcv_wnd value, receive buffer autotuning logic
takes longer to kick in and increase the receive buffer.

In a TCP throughput test where receiver has rmem[2] set to 125MB
(wscale is 11), we see the connection gets recvbuf limited at the
beginning of the connection and gets less throughput overall.

Signed-off-by: Wei Wang <wei...@google.com>
Acked-by: Eric Dumazet <eduma...@google.com>
Acked-by: Soheil Hassas Yeganeh <soh...@google.com>
Acked-by: Yuchung Cheng <ych...@google.com>
---
 net/ipv4/tcp_output.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 95feffb6d53f..d07c0dcc99aa 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -229,11 +229,9 @@ void tcp_select_initial_window(const struct sock *sk, int 
__space, __u32 mss,
}
}
 
-   if (mss > (1 << *rcv_wscale)) {
-   if (!init_rcv_wnd) /* Use default unless specified otherwise */
-   init_rcv_wnd = tcp_default_init_rwnd(mss);
-   *rcv_wnd = min(*rcv_wnd, init_rcv_wnd * mss);
-   }
+   if (!init_rcv_wnd) /* Use default unless specified otherwise */
+   init_rcv_wnd = tcp_default_init_rwnd(mss);
+   *rcv_wnd = min(*rcv_wnd, init_rcv_wnd * mss);
 
/* Set the clamp no higher than max representable value */
(*window_clamp) = min_t(__u32, U16_MAX << (*rcv_wscale), *window_clamp);
-- 
2.17.0.484.g0c8726318c-goog

Re: [PATCH net] ipv6: fix possible deadlock in rt6_age_examine_exception()

2018-03-23 Thread Wei Wang

ev_add kernel/locking/lockdep.c:1863 [inline]
>   check_prevs_add kernel/locking/lockdep.c:1976 [inline]
>   validate_chain kernel/locking/lockdep.c:2417 [inline]
>   __lock_acquire+0x30a8/0x3e00 kernel/locking/lockdep.c:3431
>   lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920
>   __raw_write_lock_bh include/linux/rwlock_api_smp.h:203 [inline]
>   _raw_write_lock_bh+0x31/0x40 kernel/locking/spinlock.c:312
>   __ipv6_dev_mc_dec+0x45/0x350 net/ipv6/mcast.c:928
>   ipv6_dev_mc_dec+0x110/0x1f0 net/ipv6/mcast.c:961
>   pndisc_destructor+0x21a/0x340 net/ipv6/ndisc.c:392
>   pneigh_ifdown net/core/neighbour.c:695 [inline]
>   neigh_ifdown+0x149/0x250 net/core/neighbour.c:294
>   rt6_disable_ip+0x537/0x700 net/ipv6/route.c:3874
>   addrconf_ifdown+0x14b/0x14f0 net/ipv6/addrconf.c:3633
>   addrconf_notify+0x5f8/0x2310 net/ipv6/addrconf.c:3557
>   notifier_call_chain+0x136/0x2c0 kernel/notifier.c:93
>   __raw_notifier_call_chain kernel/notifier.c:394 [inline]
>   raw_notifier_call_chain+0x2d/0x40 kernel/notifier.c:401
>   call_netdevice_notifiers_info+0x32/0x70 net/core/dev.c:1707
>   call_netdevice_notifiers net/core/dev.c:1725 [inline]
>   __dev_notify_flags+0x262/0x430 net/core/dev.c:6960
>   dev_change_flags+0xf5/0x140 net/core/dev.c:6994
>   devinet_ioctl+0x126a/0x1ac0 net/ipv4/devinet.c:1080
>   inet_ioctl+0x184/0x310 net/ipv4/af_inet.c:919
>   packet_ioctl+0x1ff/0x310 net/packet/af_packet.c:4066
>   sock_do_ioctl+0xef/0x390 net/socket.c:957
>   sock_ioctl+0x36b/0x610 net/socket.c:1081
>   vfs_ioctl fs/ioctl.c:46 [inline]
>   do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686
>   SYSC_ioctl fs/ioctl.c:701 [inline]
>   SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
>   do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
>   entry_SYSCALL_64_after_hwframe+0x42/0xb7

> Fixes: c757faa8bfa2 ("ipv6: prepare fib6_age() for exception table")
> Signed-off-by: Eric Dumazet <eduma...@google.com>
> Cc: Wei Wang <wei...@google.com>
> Cc: Martin KaFai Lau <ka...@fb.com>
> ---

Nice fix. Thanks Eric.

Acked-by: Wei Wang <wei...@google.com>

>   net/ipv6/route.c | 13 +++--
>   1 file changed, 7 insertions(+), 6 deletions(-)

> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index
b0d5c64e19780ce94feb112285ed1d85dbe07e9e..b33d057ac5eb2a85e19be59f0bceacf547cc9e59
100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -1626,11 +1626,10 @@ static void rt6_age_examine_exception(struct
rt6_exception_bucket *bucket,
>  struct neighbour *neigh;
>  __u8 neigh_flags = 0;

> -   neigh = dst_neigh_lookup(>dst, >rt6i_gateway);
> -   if (neigh) {
> +   neigh = __ipv6_neigh_lookup_noref(rt->dst.dev,
>rt6i_gateway);
> +   if (neigh)
>  neigh_flags = neigh->flags;
> -   neigh_release(neigh);
> -   }
> +
>  if (!(neigh_flags & NTF_ROUTER)) {
>  RT6_TRACE("purging route %p via non-router but
gateway\n",
>rt);
> @@ -1654,7 +1653,8 @@ void rt6_age_exceptions(struct rt6_info *rt,
>  if (!rcu_access_pointer(rt->rt6i_exception_bucket))
>  return;

> -   spin_lock_bh(_exception_lock);
> +   rcu_read_lock_bh();
> +   spin_lock(_exception_lock);
>  bucket = rcu_dereference_protected(rt->rt6i_exception_bucket,
>  lockdep_is_held(_exception_lock));

> @@ -1668,7 +1668,8 @@ void rt6_age_exceptions(struct rt6_info *rt,
>  bucket++;
>  }
>  }
> -   spin_unlock_bh(_exception_lock);
> +   spin_unlock(_exception_lock);
> +   rcu_read_unlock_bh();
>   }

>   struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table,
> --
> 2.17.0.rc0.231.g781580f067-goog

Re: [PATCH RFC net-next 16/20] net/ipv6: Cleanup exception route handling

2018-02-26 Thread Wei Wang

On Mon, Feb 26, 2018 at 3:02 PM, David Ahern <dsah...@gmail.com> wrote:
> On 2/26/18 3:29 PM, Wei Wang wrote:
>> On Sun, Feb 25, 2018 at 11:47 AM, David Ahern <dsah...@gmail.com> wrote:
>>> IPv6 FIB will only contain FIB entries with exception routes added to
>>> the FIB entry. Remove CACHE and dst checks from fib6 add and delete since
>>> they can never happen once the data type changes.
>>>
>>> Fixup the lookup functions to use a f6i name for fib lookups and retain
>>> the current rt name for return variables.
>>>
>>> Signed-off-by: David Ahern <dsah...@gmail.com>
>>> ---
>>>  net/ipv6/ip6_fib.c |  16 +--
>>>  net/ipv6/route.c   | 122 
>>> ++---
>>>  2 files changed, 71 insertions(+), 67 deletions(-)
>>>
>>> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
>>> index 5b03f7e8d850..63a91db61749 100644
>>> --- a/net/ipv6/ip6_fib.c
>>> +++ b/net/ipv6/ip6_fib.c
>>> @@ -1046,7 +1046,7 @@ static int fib6_add_rt2node(struct fib6_node *fn, 
>>> struct rt6_info *rt,
>>>  static void fib6_start_gc(struct net *net, struct rt6_info *rt)
>>>  {
>>> if (!timer_pending(>ipv6.ip6_fib_timer) &&
>>> -   (rt->rt6i_flags & (RTF_EXPIRES | RTF_CACHE)))
>>> +   (rt->rt6i_flags & RTF_EXPIRES))
>>> mod_timer(>ipv6.ip6_fib_timer,
>>>   jiffies + net->ipv6.sysctl.ip6_rt_gc_interval);
>>>  }
>>> @@ -1097,8 +1097,6 @@ int fib6_add(struct fib6_node *root, struct rt6_info 
>>> *rt,
>>
>> This rt here should be f6i?
>>
>>>
>>> if (WARN_ON_ONCE(!atomic_read(>dst.__refcnt)))
>>> return -EINVAL;
>>> -   if (WARN_ON_ONCE(rt->rt6i_flags & RTF_CACHE))
>>> -   return -EINVAL;
>>>
>>> if (info->nlh) {
>>> if (!(info->nlh->nlmsg_flags & NLM_F_CREATE))
>>> @@ -1622,8 +1620,6 @@ static void fib6_del_route(struct fib6_table *table, 
>>> struct fib6_node *fn,
>>>
>>> RT6_TRACE("fib6_del_route\n");
>>>
>>> -   WARN_ON_ONCE(rt->rt6i_flags & RTF_CACHE);
>>> -
>>> /* Unlink it */
>>> *rtp = rt->rt6_next;
>>
>> This rt here is also f6i right?
>>
>>> rt->rt6i_node = NULL;
>>> @@ -1692,21 +1688,11 @@ int fib6_del(struct rt6_info *rt, struct nl_info 
>>> *info)
>>
>> This rt here is also f6i right?
>>
>>> struct rt6_info __rcu **rtp;
>>> struct rt6_info __rcu **rtp_next;
>>>
>>> -#if RT6_DEBUG >= 2
>>> -   if (rt->dst.obsolete > 0) {
>>> -   WARN_ON(fn);
>>> -   return -ENOENT;
>>> -   }
>>> -#endif
>>> if (!fn || rt == net->ipv6.fib6_null_entry)
>>> return -ENOENT;
>>>
>>> WARN_ON(!(fn->fn_flags & RTN_RTINFO));
>>>
>>> -   /* remove cached dst from exception table */
>>> -   if (rt->rt6i_flags & RTF_CACHE)
>>> -   return rt6_remove_exception_rt(rt);
>>
>> Could you help delete rt6_remove_exception_rt() function? I don't
>> think it is used anymore.
>
> It is still used by ip6_negative_advice, ip6_link_failure and
> ip6_del_cached_rt. It can be made static; will fix.
>
Right. Missed those.

>
> The rest of your comments for this patch are renaming rt to f6i. My
> thought is to follow up with another patch that does the rename of rt to
> f6i for all fib6_info. Given how large this change is already I did not
> want to add extra diffs for that. If there is agreement to fold that
> part in now, I can do it.
Sure. Sounds good to me.

Re: [PATCH RFC net-next 10/20] net/ipv6: move expires into rt6_info

2018-02-26 Thread Wei Wang

On Mon, Feb 26, 2018 at 2:55 PM, David Ahern <dsah...@gmail.com> wrote:
> On 2/26/18 3:28 PM, Wei Wang wrote:
>>> @@ -213,11 +234,6 @@ static inline void rt6_set_expires(struct rt6_info 
>>> *rt, unsigned long expires)
>>>
>>>  static inline void rt6_update_expires(struct rt6_info *rt0, int timeout)
>>>  {
>>> -   struct rt6_info *rt;
>>> -
>>> -   for (rt = rt0; rt && !(rt->rt6i_flags & RTF_EXPIRES); rt = 
>>> rt->from);
>>> -   if (rt && rt != rt0)
>>> -   rt0->dst.expires = rt->dst.expires;
>>
>> I was wondering if we need to retain the above logic. It makes sure
>> dst.expires gets synced to its "parent" route. But  it might be hard
>> because after your change, we can no longer use rt->from to refer to
>> the "parent".
>
> As I understand it, the FIB entries are cloned into pcpu, uncached and
> exception routes. We should never have an rt6_info that ever points back
> more than 1 level -- ie., the dst rt6_info points to a from representing
> the original FIB entry.
>
Yes. Agree.

> After my change 'from' will still point to the FIB entry as a fib6_info
> which has its own expires.
>
understood. And fib6_age() is using fib6_check_expired() and
rt6_age_exceptions() is checking rt->dst.expires which I think is
correct.

> When I looked this code I was really confused. At best, the for loop
> above sets rt0->dst.expires to some value based on the 'from' but then
> the very next line calls dst_set_expires with the passed in timeout value.
>
>
>>
>>> dst_set_expires(>dst, timeout);
>>> rt0->rt6i_flags |= RTF_EXPIRES;
>>>  }
>

Re: [PATCH RFC net-next 07/20] net/ipv6: Move nexthop data to fib6_nh

2018-02-26 Thread Wei Wang

On Mon, Feb 26, 2018 at 2:47 PM, David Ahern <dsah...@gmail.com> wrote:
> On 2/26/18 3:28 PM, Wei Wang wrote:
>> On Sun, Feb 25, 2018 at 11:47 AM, David Ahern <dsah...@gmail.com> wrote:
>>> Introduce fib6_nh structure and move nexthop related data from
>>> rt6_info and rt6_info.dst to fib6_nh. References to dev, gateway or
>>> lwtstate from a FIB lookup perspective are converted to use fib6_nh;
>>> datapath references to dst version are left as is.
>>>
>>
>> My understanding is that after your whole patch series, sibling routes
>> will still have their own fib6_info. Does it make sense to make this
>> fib6_nh as an array in fib6_info so that sibling routes will share
>> fib6_info but will have their own fib6_nh as a future improvement? It
>> matches ipv4 behavior. And I think it will make the sibling route
>> handling code easier?
>
> I was not planning to. IPv6 allowing individual nexthops to be added and
> deleted is very convenient. I do agree the existing sibling route
> linkage makes the code much more complicated than it needs to be.
>
> After this set, I plan to send patches for nexthops as separate objects
> - which will have an impact on how multipath routes are done. With
> nexthop objects there will be 1 prefix route pointing to a nexthop
> object that is multipath (meaning it points in turn to a series of
> nexthop objects). This provides the simplification (no sibling linkage)
> without losing the individual nexhtop add / delete option.

Got it. Thanks for the explanation.

Re: [PATCH RFC net-next 16/20] net/ipv6: Cleanup exception route handling

2018-02-26 Thread Wei Wang

On Sun, Feb 25, 2018 at 11:47 AM, David Ahern  wrote:
> IPv6 FIB will only contain FIB entries with exception routes added to
> the FIB entry. Remove CACHE and dst checks from fib6 add and delete since
> they can never happen once the data type changes.
>
> Fixup the lookup functions to use a f6i name for fib lookups and retain
> the current rt name for return variables.
>
> Signed-off-by: David Ahern 
> ---
>  net/ipv6/ip6_fib.c |  16 +--
>  net/ipv6/route.c   | 122 
> ++---
>  2 files changed, 71 insertions(+), 67 deletions(-)
>
> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
> index 5b03f7e8d850..63a91db61749 100644
> --- a/net/ipv6/ip6_fib.c
> +++ b/net/ipv6/ip6_fib.c
> @@ -1046,7 +1046,7 @@ static int fib6_add_rt2node(struct fib6_node *fn, 
> struct rt6_info *rt,
>  static void fib6_start_gc(struct net *net, struct rt6_info *rt)
>  {
> if (!timer_pending(>ipv6.ip6_fib_timer) &&
> -   (rt->rt6i_flags & (RTF_EXPIRES | RTF_CACHE)))
> +   (rt->rt6i_flags & RTF_EXPIRES))
> mod_timer(>ipv6.ip6_fib_timer,
>   jiffies + net->ipv6.sysctl.ip6_rt_gc_interval);
>  }
> @@ -1097,8 +1097,6 @@ int fib6_add(struct fib6_node *root, struct rt6_info 
> *rt,

This rt here should be f6i?

>
> if (WARN_ON_ONCE(!atomic_read(>dst.__refcnt)))
> return -EINVAL;
> -   if (WARN_ON_ONCE(rt->rt6i_flags & RTF_CACHE))
> -   return -EINVAL;
>
> if (info->nlh) {
> if (!(info->nlh->nlmsg_flags & NLM_F_CREATE))
> @@ -1622,8 +1620,6 @@ static void fib6_del_route(struct fib6_table *table, 
> struct fib6_node *fn,
>
> RT6_TRACE("fib6_del_route\n");
>
> -   WARN_ON_ONCE(rt->rt6i_flags & RTF_CACHE);
> -
> /* Unlink it */
> *rtp = rt->rt6_next;

This rt here is also f6i right?

> rt->rt6i_node = NULL;
> @@ -1692,21 +1688,11 @@ int fib6_del(struct rt6_info *rt, struct nl_info 
> *info)

This rt here is also f6i right?

> struct rt6_info __rcu **rtp;
> struct rt6_info __rcu **rtp_next;
>
> -#if RT6_DEBUG >= 2
> -   if (rt->dst.obsolete > 0) {
> -   WARN_ON(fn);
> -   return -ENOENT;
> -   }
> -#endif
> if (!fn || rt == net->ipv6.fib6_null_entry)
> return -ENOENT;
>
> WARN_ON(!(fn->fn_flags & RTN_RTINFO));
>
> -   /* remove cached dst from exception table */
> -   if (rt->rt6i_flags & RTF_CACHE)
> -   return rt6_remove_exception_rt(rt);

Could you help delete rt6_remove_exception_rt() function? I don't
think it is used anymore.

> -
> /*
>  *  Walk the leaf entries looking for ourself
>  */
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 3ea60e932eb9..19b91c60ee55 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -1094,35 +1094,36 @@ static struct rt6_info *ip6_pol_route_lookup(struct 
> net *net,
>  struct fib6_table *table,
>  struct flowi6 *fl6, int flags)
>  {
> -   struct rt6_info *rt, *rt_cache;
> +   struct rt6_info *f6i;
> struct fib6_node *fn;
> +   struct rt6_info *rt;
>
> rcu_read_lock();
> fn = fib6_lookup(>tb6_root, >daddr, >saddr);
>  restart:
> -   rt = rcu_dereference(fn->leaf);
> -   if (!rt) {
> -   rt = net->ipv6.fib6_null_entry;
> +   f6i = rcu_dereference(fn->leaf);
> +   if (!f6i) {
> +   f6i = net->ipv6.fib6_null_entry;
> } else {
> -   rt = rt6_device_match(net, rt, >saddr,
> +   f6i = rt6_device_match(net, f6i, >saddr,
>   fl6->flowi6_oif, flags);
> -   if (rt->rt6i_nsiblings && fl6->flowi6_oif == 0)
> -   rt = rt6_multipath_select(rt, fl6,
> +   if (f6i->rt6i_nsiblings && fl6->flowi6_oif == 0)
> +   f6i = rt6_multipath_select(f6i, fl6,
>   fl6->flowi6_oif, flags);
> }
> -   if (rt == net->ipv6.fib6_null_entry) {
> +   if (f6i == net->ipv6.fib6_null_entry) {
> fn = fib6_backtrack(fn, >saddr);
> if (fn)
> goto restart;
> }
> +
> /* Search through exception table */
> -   rt_cache = rt6_find_cached_rt(rt, >daddr, >saddr);
> -   if (rt_cache) {
> -   rt = rt_cache;
> +   rt = rt6_find_cached_rt(f6i, >daddr, >saddr);
> +   if (rt) {
> if (ip6_hold_safe(net, , true))
> dst_use_noref(>dst, jiffies);
> } else {
> -   rt = ip6_create_rt_rcu(rt);
> +   rt = ip6_create_rt_rcu(f6i);
> }
>
> rcu_read_unlock();
> @@ -1204,9 +1205,6 @@ static struct rt6_info *ip6_rt_cache_alloc(struct 
>

Re: [PATCH RFC net-next 10/20] net/ipv6: move expires into rt6_info

2018-02-26 Thread Wei Wang

On Sun, Feb 25, 2018 at 11:47 AM, David Ahern  wrote:
> Add expires to rt6_info for FIB entries, and add fib6 helpers to
> manage it. Data path use of dst.expires remains.
>
> Signed-off-by: David Ahern 
> ---
>  include/net/ip6_fib.h | 26 +-
>  net/ipv6/addrconf.c   |  6 +++---
>  net/ipv6/ip6_fib.c|  8 
>  net/ipv6/ndisc.c  |  2 +-
>  net/ipv6/route.c  | 14 +++---
>  5 files changed, 36 insertions(+), 20 deletions(-)
>
> diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
> index da81669b9c90..3ba0bb7c7a43 100644
> --- a/include/net/ip6_fib.h
> +++ b/include/net/ip6_fib.h
> @@ -179,6 +179,7 @@ struct rt6_info {
> should_flush:1,
> unused:6;
>
> +   unsigned long   expires;
> struct dst_metrics  *fib6_metrics;
>  #define fib6_pmtu  fib6_metrics->metrics[RTAX_MTU-1]
>  #define fib6_hoplimit  fib6_metrics->metrics[RTAX_HOPLIMIT-1]
> @@ -199,6 +200,26 @@ static inline struct inet6_dev *ip6_dst_idev(struct 
> dst_entry *dst)
> return ((struct rt6_info *)dst)->rt6i_idev;
>  }
>
> +static inline void fib6_clean_expires(struct rt6_info *f6i)
> +{
> +   f6i->rt6i_flags &= ~RTF_EXPIRES;
> +   f6i->expires = 0;
> +}
> +
> +static inline void fib6_set_expires(struct rt6_info *f6i,
> +   unsigned long expires)
> +{
> +   f6i->expires = expires;
> +   f6i->rt6i_flags |= RTF_EXPIRES;
> +}
> +
> +static inline bool fib6_check_expired(const struct rt6_info *f6i)
> +{
> +   if (f6i->rt6i_flags & RTF_EXPIRES)
> +   return time_after(jiffies, f6i->expires);
> +   return false;
> +}
> +
>  static inline void rt6_clean_expires(struct rt6_info *rt)
>  {
> rt->rt6i_flags &= ~RTF_EXPIRES;
> @@ -213,11 +234,6 @@ static inline void rt6_set_expires(struct rt6_info *rt, 
> unsigned long expires)
>
>  static inline void rt6_update_expires(struct rt6_info *rt0, int timeout)
>  {
> -   struct rt6_info *rt;
> -
> -   for (rt = rt0; rt && !(rt->rt6i_flags & RTF_EXPIRES); rt = rt->from);
> -   if (rt && rt != rt0)
> -   rt0->dst.expires = rt->dst.expires;

I was wondering if we need to retain the above logic. It makes sure
dst.expires gets synced to its "parent" route. But  it might be hard
because after your change, we can no longer use rt->from to refer to
the "parent".

> dst_set_expires(>dst, timeout);
> rt0->rt6i_flags |= RTF_EXPIRES;
>  }
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index eeecef2b83a4..478f45bf13cf 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -1202,7 +1202,7 @@ cleanup_prefix_route(struct inet6_ifaddr *ifp, unsigned 
> long expires, bool del_r
> ip6_del_rt(dev_net(ifp->idev->dev), rt);
> else {
> if (!(rt->rt6i_flags & RTF_EXPIRES))
> -   rt6_set_expires(rt, expires);
> +   fib6_set_expires(rt, expires);
> ip6_rt_put(rt);
> }
> }
> @@ -2648,9 +2648,9 @@ void addrconf_prefix_rcv(struct net_device *dev, u8 
> *opt, int len, bool sllao)
> rt = NULL;
> } else if (addrconf_finite_timeout(rt_expires)) {
> /* not infinity */
> -   rt6_set_expires(rt, jiffies + rt_expires);
> +   fib6_set_expires(rt, jiffies + rt_expires);
> } else {
> -   rt6_clean_expires(rt);
> +   fib6_clean_expires(rt);
> }
> } else if (valid_lft) {
> clock_t expires = 0;
> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
> index faa2b46349df..7bc23b048189 100644
> --- a/net/ipv6/ip6_fib.c
> +++ b/net/ipv6/ip6_fib.c
> @@ -886,9 +886,9 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct 
> rt6_info *rt,
> if (!(iter->rt6i_flags & RTF_EXPIRES))
> return -EEXIST;
> if (!(rt->rt6i_flags & RTF_EXPIRES))
> -   rt6_clean_expires(iter);
> +   fib6_clean_expires(iter);
> else
> -   rt6_set_expires(iter, 
> rt->dst.expires);
> +   fib6_set_expires(iter, rt->expires);
> iter->fib6_pmtu = rt->fib6_pmtu;
> return -EEXIST;
> }
> @@ -1975,8 +1975,8 @@ static int fib6_age(struct rt6_info *rt, void *arg)
>  *  Routes are expired even if they are

Re: [PATCH RFC net-next 07/20] net/ipv6: Move nexthop data to fib6_nh

2018-02-26 Thread Wei Wang

On Sun, Feb 25, 2018 at 11:47 AM, David Ahern  wrote:
> Introduce fib6_nh structure and move nexthop related data from
> rt6_info and rt6_info.dst to fib6_nh. References to dev, gateway or
> lwtstate from a FIB lookup perspective are converted to use fib6_nh;
> datapath references to dst version are left as is.
>

My understanding is that after your whole patch series, sibling routes
will still have their own fib6_info. Does it make sense to make this
fib6_nh as an array in fib6_info so that sibling routes will share
fib6_info but will have their own fib6_nh as a future improvement? It
matches ipv4 behavior. And I think it will make the sibling route
handling code easier?

> Signed-off-by: David Ahern 
> ---
>  .../net/ethernet/mellanox/mlxsw/spectrum_router.c  |  32 ++--
>  include/net/ip6_fib.h  |  16 +-
>  include/net/ip6_route.h|   6 +-
>  net/ipv6/addrconf.c|   2 +-
>  net/ipv6/ip6_fib.c |   6 +-
>  net/ipv6/route.c   | 164 
> -
>  6 files changed, 127 insertions(+), 99 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
> b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> index 05146970c19c..90d01df783b3 100644
> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> @@ -2700,9 +2700,9 @@ mlxsw_sp_nexthop6_group_cmp(const struct 
> mlxsw_sp_nexthop_group *nh_grp,
> struct in6_addr *gw;
> int ifindex, weight;
>
> -   ifindex = mlxsw_sp_rt6->rt->dst.dev->ifindex;
> -   weight = mlxsw_sp_rt6->rt->rt6i_nh_weight;
> -   gw = _sp_rt6->rt->rt6i_gateway;
> +   ifindex = mlxsw_sp_rt6->rt->fib6_nh.nh_dev->ifindex;
> +   weight = mlxsw_sp_rt6->rt->fib6_nh.nh_weight;
> +   gw = _sp_rt6->rt->fib6_nh.nh_gw;
> if (!mlxsw_sp_nexthop6_group_has_nexthop(nh_grp, gw, ifindex,
>  weight))
> return false;
> @@ -2768,7 +2768,7 @@ mlxsw_sp_nexthop6_group_hash(struct mlxsw_sp_fib6_entry 
> *fib6_entry, u32 seed)
> struct net_device *dev;
>
> list_for_each_entry(mlxsw_sp_rt6, _entry->rt6_list, list) {
> -   dev = mlxsw_sp_rt6->rt->dst.dev;
> +   dev = mlxsw_sp_rt6->rt->fib6_nh.nh_dev;
> val ^= dev->ifindex;
> }
>
> @@ -3766,9 +3766,9 @@ mlxsw_sp_rt6_nexthop(struct mlxsw_sp_nexthop_group 
> *nh_grp,
> struct mlxsw_sp_nexthop *nh = _grp->nexthops[i];
> struct rt6_info *rt = mlxsw_sp_rt6->rt;
>
> -   if (nh->rif && nh->rif->dev == rt->dst.dev &&
> +   if (nh->rif && nh->rif->dev == rt->fib6_nh.nh_dev &&
> ipv6_addr_equal((const struct in6_addr *) >gw_addr,
> -   >rt6i_gateway))
> +   >fib6_nh.nh_gw))
> return nh;
> continue;
> }
> @@ -3825,7 +3825,7 @@ mlxsw_sp_fib6_entry_offload_set(struct 
> mlxsw_sp_fib_entry *fib_entry)
>
> if (fib_entry->type == MLXSW_SP_FIB_ENTRY_TYPE_LOCAL) {
> list_first_entry(_entry->rt6_list, struct mlxsw_sp_rt6,
> -list)->rt->rt6i_nh_flags |= RTNH_F_OFFLOAD;
> +list)->rt->fib6_nh.nh_flags |= 
> RTNH_F_OFFLOAD;
> return;
> }
>
> @@ -3835,9 +3835,9 @@ mlxsw_sp_fib6_entry_offload_set(struct 
> mlxsw_sp_fib_entry *fib_entry)
>
> nh = mlxsw_sp_rt6_nexthop(nh_grp, mlxsw_sp_rt6);
> if (nh && nh->offloaded)
> -   mlxsw_sp_rt6->rt->rt6i_nh_flags |= RTNH_F_OFFLOAD;
> +   mlxsw_sp_rt6->rt->fib6_nh.nh_flags |= RTNH_F_OFFLOAD;
> else
> -   mlxsw_sp_rt6->rt->rt6i_nh_flags &= ~RTNH_F_OFFLOAD;
> +   mlxsw_sp_rt6->rt->fib6_nh.nh_flags &= ~RTNH_F_OFFLOAD;
> }
>  }
>
> @@ -3852,7 +3852,7 @@ mlxsw_sp_fib6_entry_offload_unset(struct 
> mlxsw_sp_fib_entry *fib_entry)
> list_for_each_entry(mlxsw_sp_rt6, _entry->rt6_list, list) {
> struct rt6_info *rt = mlxsw_sp_rt6->rt;
>
> -   rt->rt6i_nh_flags &= ~RTNH_F_OFFLOAD;
> +   rt->fib6_nh.nh_flags &= ~RTNH_F_OFFLOAD;
> }
>  }
>
> @@ -4748,8 +4748,8 @@ static bool mlxsw_sp_nexthop6_ipip_type(const struct 
> mlxsw_sp *mlxsw_sp,
> const struct rt6_info *rt,
> enum mlxsw_sp_ipip_type *ret)
>  {
> -   return rt->dst.dev &&
> -  mlxsw_sp_netdev_ipip_type(mlxsw_sp, rt->dst.dev, ret);
> +   return rt->fib6_nh.nh_dev &&
> +

Re: [vhost:vhost 20/20] ERROR: "page_poisoning_enabled" [drivers/virtio/virtio_balloon.ko] undefined!

2018-02-06 Thread Wei Wang


On 02/07/2018 09:26 AM, kbuild test robot wrote:

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost
head:   96bcd04462b99e2c80e09f6537770a0ca6b288d0
commit: 96bcd04462b99e2c80e09f6537770a0ca6b288d0 [20/20] virtio-balloon: 
VIRTIO_BALLOON_F_FREE_PAGE_HINT
config: ia64-allmodconfig (attached as .config)
compiler: ia64-linux-gcc (GCC) 7.2.0
reproduce:
 wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
 chmod +x ~/bin/make.cross
 git checkout 96bcd04462b99e2c80e09f6537770a0ca6b288d0
 # save the attached .config to linux build tree
 make.cross ARCH=ia64

All errors (new ones prefixed by >>):

WARNING: modpost: missing MODULE_LICENSE() in 
drivers/auxdisplay/img-ascii-lcd.o
see include/linux/module.h for more information
WARNING: modpost: missing MODULE_LICENSE() in drivers/gpio/gpio-ath79.o
see include/linux/module.h for more information
WARNING: modpost: missing MODULE_LICENSE() in drivers/gpio/gpio-iop.o
see include/linux/module.h for more information
WARNING: modpost: missing MODULE_LICENSE() in drivers/iio/accel/kxsd9-i2c.o
see include/linux/module.h for more information
WARNING: modpost: missing MODULE_LICENSE() in 
drivers/iio/adc/qcom-vadc-common.o
see include/linux/module.h for more information
WARNING: modpost: missing MODULE_LICENSE() in 
drivers/media/platform/mtk-vcodec/mtk-vcodec-common.o
see include/linux/module.h for more information
WARNING: modpost: missing MODULE_LICENSE() in 
drivers/media/platform/tegra-cec/tegra_cec.o
see include/linux/module.h for more information
WARNING: modpost: missing MODULE_LICENSE() in drivers/mtd/nand/denali_pci.o
see include/linux/module.h for more information
WARNING: modpost: missing MODULE_LICENSE() in 
drivers/pinctrl/pxa/pinctrl-pxa2xx.o
see include/linux/module.h for more information
WARNING: modpost: missing MODULE_LICENSE() in 
drivers/power/reset/zx-reboot.o
see include/linux/module.h for more information

ERROR: "page_poisoning_enabled" [drivers/virtio/virtio_balloon.ko] undefined!


page_poisoning_enabled needs to be exposed. I'll send a small patch to 
add EXPORT_SYMBOL_GPL(page_poisoning_enabled).



Best,
Wei

Re: suspicious RCU usage at net/ipv6/ip6_fib.c:LINE

2018-02-01 Thread Wei Wang

On Thu, Feb 1, 2018 at 2:56 PM, Eric Biggers  wrote:
> +wei...@google.com
>
> On Tue, Jan 02, 2018 at 03:58:02PM -0800, syzbot wrote:
>> Hello,
>>
>> syzkaller hit the following crash on
>> 6bb8824732f69de0f233ae6b1a8158e149627b38
>> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/master
>> compiler: gcc (GCC) 7.1.1 20170620
>> .config is attached
>> Raw console output is attached.
>> C reproducer is attached
>> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
>> for information about syzkaller reproducers
>>
>>
>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> Reported-by: syzbot+bca7109dba5d86cb0...@syzkaller.appspotmail.com
>> It will help syzbot understand when the bug is fixed. See footer for
>> details.
>> If you forward the report, please keep this part and the footer.
>>
>>
>> =
>> WARNING: suspicious RCU usage
>> 4.15.0-rc5+ #171 Not tainted
>> -
>> net/ipv6/ip6_fib.c:1702 suspicious rcu_dereference_protected() usage!
>>
>> other info that might help us debug this:
>>
>>
>> rcu_scheduler_active = 2, debug_locks = 1
>> 4 locks held by swapper/0/0:
>>  #0:  ((>ipv6.ip6_fib_timer)){+.-.}, at: []
>> lockdep_copy_map include/linux/lockdep.h:178 [inline]
>>  #0:  ((>ipv6.ip6_fib_timer)){+.-.}, at: []
>> call_timer_fn+0x1c6/0x820 kernel/time/timer.c:1310
>>  #1:  (&(>ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<2ff9d65c>]
>> spin_lock_bh include/linux/spinlock.h:315 [inline]
>>  #1:  (&(>ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<2ff9d65c>]
>> fib6_run_gc+0x9d/0x3c0 net/ipv6/ip6_fib.c:2007
>>  #2:  (rcu_read_lock){}, at: [<91db762d>]
>> __fib6_clean_all+0x0/0x3a0 net/ipv6/ip6_fib.c:1560
>>  #3:  (&(>tb6_lock)->rlock){+.-.}, at: [<9e503581>] spin_lock_bh
>> include/linux/spinlock.h:315 [inline]
>>  #3:  (&(>tb6_lock)->rlock){+.-.}, at: [<9e503581>]
>> __fib6_clean_all+0x1d0/0x3a0 net/ipv6/ip6_fib.c:1948
>>
>> stack backtrace:
>> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc5+ #171
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>> Google 01/01/2011
>> Call Trace:
>>  
>>  __dump_stack lib/dump_stack.c:17 [inline]
>>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>>  lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4585
>>  fib6_del+0xcaa/0x11b0 net/ipv6/ip6_fib.c:1701
>>  fib6_clean_node+0x3aa/0x4f0 net/ipv6/ip6_fib.c:1892
>>  fib6_walk_continue+0x46c/0x8a0 net/ipv6/ip6_fib.c:1815
>>  fib6_walk+0x91/0xf0 net/ipv6/ip6_fib.c:1863
>>  fib6_clean_tree+0x1e6/0x340 net/ipv6/ip6_fib.c:1933
>>  __fib6_clean_all+0x1f4/0x3a0 net/ipv6/ip6_fib.c:1949
>>  fib6_clean_all net/ipv6/ip6_fib.c:1960 [inline]
>>  fib6_run_gc+0x16b/0x3c0 net/ipv6/ip6_fib.c:2016
>>  fib6_gc_timer_cb+0x20/0x30 net/ipv6/ip6_fib.c:2033
>>  call_timer_fn+0x228/0x820 kernel/time/timer.c:1320
>>  expire_timers kernel/time/timer.c:1357 [inline]
>>  __run_timers+0x7ee/0xb70 kernel/time/timer.c:1660
>>  run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686
>>  __do_softirq+0x2d7/0xb85 kernel/softirq.c:285
>>  invoke_softirq kernel/softirq.c:365 [inline]
>>  irq_exit+0x1cc/0x200 kernel/softirq.c:405
>>  exiting_irq arch/x86/include/asm/apic.h:540 [inline]
>>  smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
>>  apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:904
>>  
>> RIP: 0010:native_safe_halt+0x6/0x10 arch/x86/include/asm/irqflags.h:54
>> RSP: 0018:86407c38 EFLAGS: 0282 ORIG_RAX: ff11
>> RAX: dc00 RBX: 10c80f8a RCX: 
>> RDX: 10c99048 RSI: 0001 RDI: 864c8240
>> RBP: 86407c38 R08:  R09: 
>> R10:  R11:  R12: 
>> R13: 86407cf0 R14: 86c329a0 R15: 
>>  arch_safe_halt arch/x86/include/asm/paravirt.h:93 [inline]
>>  default_idle+0xbf/0x460 arch/x86/kernel/process.c:355
>>  arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:346
>>  default_idle_call+0x36/0x90 kernel/sched/idle.c:98
>>  cpuidle_idle_call kernel/sched/idle.c:156 [inline]
>>  do_idle+0x24a/0x3b0 kernel/sched/idle.c:246
>>  cpu_startup_entry+0x104/0x120 kernel/sched/idle.c:351
>>  rest_init+0xed/0xf0 init/main.c:435
>>  start_kernel+0x7f1/0x819 init/main.c:713
>>  x86_64_start_reservations+0x2a/0x2c arch/x86/kernel/head64.c:378
>>  x86_64_start_kernel+0x77/0x7a arch/x86/kernel/head64.c:359
>>  secondary_startup_64+0xa5/0xb0 arch/x86/kernel/head_64.S:237
>
> Wei, was this meant to have been fixed by commit 4512c43eac7e00 ("ipv6: remove
> null_entry before adding default route")?  This crash actually still occurred 
> on

Yes. I would think so.

> net-next as recently as 43df215d99e604 (Jan 25) which included your fix.  I 
> will
> go ahead and mark this bug fixed, but syzbot will send a new report if it hits
> it again.

Re: general protection fault in fib6_add (2)

2018-01-30 Thread Wei Wang

On Tue, Jan 30, 2018 at 5:16 PM, Eric Biggers <ebigge...@gmail.com> wrote:
> On Wed, Jan 03, 2018 at 10:53:02AM -0800, 'Wei Wang' via syzkaller-bugs wrote:
>> On Wed, Jan 3, 2018 at 8:16 AM, David Ahern <dsah...@gmail.com> wrote:
>> > [ +wei...@google.com ]
>> >
>> > On 1/2/18 3:58 PM, syzbot wrote:
>> >> Hello,
>> >>
>> >> syzkaller hit the following crash on
>> >> 61233580f1f33c50e159c50e24d80ffd2ba2e06b
>> >> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
>> >> compiler: gcc (GCC) 7.1.1 20170620
>> >> .config is attached
>> >> Raw console output is attached.
>> >> C reproducer is attached
>> >> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
>> >> for information about syzkaller reproducers
>> >>
>> >>
>> >> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> >> Reported-by: syzbot+0693adff3f83403dc...@syzkaller.appspotmail.com
>> >> It will help syzbot understand when the bug is fixed. See footer for
>> >> details.
>> >> If you forward the report, please keep this part and the footer.
>
> This crash seemed to stop occurring around Jan 8.  Wei, next time can you 
> please
> use the suggested Reported-by line so that syzbot knows which commit is meant 
> to
> fix the bug?  I assume it was this one:
Noted. Will do.

>
> #syz fix: ipv6: fix general protection fault in fib6_add()
>
Yes. Correct.

> Thanks!
>
> - Eric
>
>> >>
>> >> audit: type=1400 audit(1514594846.496:7): avc:  denied  { map } for
>> >> pid=3201 comm="syzkaller001778" path="/root/syzkaller001778299"
>> >> dev="sda1" ino=16481
>> >> scontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023
>> >> tcontext=unconfined_u:object_r:user_home_t:s0 tclass=file permissive=1
>> >> IPv6: Can't replace route, no match found
>> >> kasan: CONFIG_KASAN_INLINE enabled
>> >> kasan: GPF could be caused by NULL-ptr deref or user memory access
>> >> general protection fault:  [#1] SMP KASAN
>> >> Dumping ftrace buffer:
>> >>(ftrace buffer empty)
>> >> Modules linked in:
>> >> CPU: 0 PID: 3201 Comm: syzkaller001778 Not tainted 4.15.0-rc5+ #151
>> >> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>> >> Google 01/01/2011
>> >> RIP: 0010:fib6_add+0x736/0x15a0 net/ipv6/ip6_fib.c:1244
>>
>> pn could be NULL if fib6_add_1() failed. Will submit a fix for this.
>>
>> >> RSP: 0018:8801c7626a70 EFLAGS: 00010202
>> >> RAX: dc00 RBX: 0020 RCX: 84794465
>> >> RDX: 0004 RSI: 8801d38935f0 RDI: 0282
>> >> RBP: 8801c7626da0 R08: 110038ec4c35 R09: 
>> >> R10: 8801c7626c68 R11:  R12: fffe
>> >> R13:  R14:  R15: 0009
>> >> FS:  () GS:8801db20(0063)
>> >> knlGS:09b70840
>> >> CS:  0010 DS: 002b ES: 002b CR0: 80050033
>> >> CR2: 20be1000 CR3: 0001d585a006 CR4: 001606f0
>> >> DR0:  DR1:  DR2: 
>> >> DR3:  DR6: fffe0ff0 DR7: 0400
>> >> Call Trace:
>> >>  __ip6_ins_rt+0x6c/0x90 net/ipv6/route.c:1006
>> >>  ip6_route_multipath_add+0xd14/0x16c0 net/ipv6/route.c:3833
>> >>  inet6_rtm_newroute+0xdc/0x160 net/ipv6/route.c:3957
>> >>  rtnetlink_rcv_msg+0x733/0x1020 net/core/rtnetlink.c:4411
>> >>  netlink_rcv_skb+0x21e/0x460 net/netlink/af_netlink.c:2408
>> >>  rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4423
>> >>  netlink_unicast_kernel net/netlink/af_netlink.c:1275 [inline]
>> >>  netlink_unicast+0x4e8/0x6f0 net/netlink/af_netlink.c:1301
>> >>  netlink_sendmsg+0xa4a/0xe60 net/netlink/af_netlink.c:1864
>> >>  sock_sendmsg_nosec net/socket.c:636 [inline]
>> >>  sock_sendmsg+0xca/0x110 net/socket.c:646
>> >>  sock_write_iter+0x31a/0x5d0 net/socket.c:915
>> >>  call_write_iter include/linux/fs.h:1772 [inline]
>> >>  do_iter_readv_writev+0x525/0x7f0 fs/read_write.c:653
>> >>  do_iter_write+0x154/0x540 fs/read_write.c:932
>> >>  compat_writev+0x225/0x420 fs/read_write.c:1246
>> >>  do_compat_write

Re: [PATCH net] ipv6: change route cache aging logic

2018-01-26 Thread Wei Wang

On Fri, Jan 26, 2018 at 12:05 PM, Martin KaFai Lau <ka...@fb.com> wrote:
> On Fri, Jan 26, 2018 at 11:40:17AM -0800, Wei Wang wrote:
>> From: Wei Wang <wei...@google.com>
>>
>> In current route cache aging logic, if a route has both RTF_EXPIRE and
>> RTF_GATEWAY set, the route will only be removed if the neighbor cache
>> has no RTN_ROUTE flag. Otherwise, even if the route has expired, it
> You meant NTF_ROUTER instead of RTN_ROUTE?
>
Yes. NTF_ROUTER flag. Sorry...

>> won't get deleted.
>> Fix this logic to always check if the route has expired first and then
>> do the gateway neighbor cache check if previous check decide to not
>> remove the exception entry.
>>
>> Fixes: 1859bac04fb6 ("ipv6: remove from fib tree aged out RTF_CACHE dst")
>> Signed-off-by: Wei Wang <wei...@google.com>
>> Signed-off-by: Eric Dumazet <eduma...@google.com>
> Nice catch!
>
> Acked-by: Martin KaFai Lau <ka...@fb.com>
>
>> ---
>>  net/ipv6/route.c | 20 
>>  1 file changed, 12 insertions(+), 8 deletions(-)
>>
>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>> index 0458b761f3c5..a560fb1d0230 100644
>> --- a/net/ipv6/route.c
>> +++ b/net/ipv6/route.c
>> @@ -1586,12 +1586,19 @@ static void rt6_age_examine_exception(struct 
>> rt6_exception_bucket *bucket,
>>* EXPIRES exceptions - e.g. pmtu-generated ones are pruned when
>>* expired, independently from their aging, as per RFC 8201 section 4
>>*/
>> - if (!(rt->rt6i_flags & RTF_EXPIRES) &&
>> - time_after_eq(now, rt->dst.lastuse + gc_args->timeout)) {
>> - RT6_TRACE("aging clone %p\n", rt);
>> + if (!(rt->rt6i_flags & RTF_EXPIRES)) {
>> + if (time_after_eq(now, rt->dst.lastuse + gc_args->timeout)) {
>> + RT6_TRACE("aging clone %p\n", rt);
>> + rt6_remove_exception(bucket, rt6_ex);
>> + return;
>> + }
>> + } else if (time_after(jiffies, rt->dst.expires)) {
>> + RT6_TRACE("purging expired route %p\n", rt);
>>   rt6_remove_exception(bucket, rt6_ex);
>>   return;
>> - } else if (rt->rt6i_flags & RTF_GATEWAY) {
>> + }
>> +
>> + if (rt->rt6i_flags & RTF_GATEWAY) {
>>   struct neighbour *neigh;
>>   __u8 neigh_flags = 0;
>>
>> @@ -1606,11 +1613,8 @@ static void rt6_age_examine_exception(struct 
>> rt6_exception_bucket *bucket,
>>   rt6_remove_exception(bucket, rt6_ex);
>>   return;
>>   }
>> - } else if (__rt6_check_expired(rt)) {
>> - RT6_TRACE("purging expired route %p\n", rt);
>> - rt6_remove_exception(bucket, rt6_ex);
>> - return;
>>   }
>> +
>>   gc_args->more++;
>>  }
>>
>> --
>> 2.16.0.rc1.238.g530d649a79-goog
>>

[PATCH net] ipv6: change route cache aging logic

2018-01-26 Thread Wei Wang

From: Wei Wang <wei...@google.com>

In current route cache aging logic, if a route has both RTF_EXPIRE and
RTF_GATEWAY set, the route will only be removed if the neighbor cache
has no RTN_ROUTE flag. Otherwise, even if the route has expired, it
won't get deleted.
Fix this logic to always check if the route has expired first and then
do the gateway neighbor cache check if previous check decide to not
remove the exception entry.

Fixes: 1859bac04fb6 ("ipv6: remove from fib tree aged out RTF_CACHE dst")
Signed-off-by: Wei Wang <wei...@google.com>
Signed-off-by: Eric Dumazet <eduma...@google.com>
---
 net/ipv6/route.c | 20 
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 0458b761f3c5..a560fb1d0230 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1586,12 +1586,19 @@ static void rt6_age_examine_exception(struct 
rt6_exception_bucket *bucket,
 * EXPIRES exceptions - e.g. pmtu-generated ones are pruned when
 * expired, independently from their aging, as per RFC 8201 section 4
 */
-   if (!(rt->rt6i_flags & RTF_EXPIRES) &&
-   time_after_eq(now, rt->dst.lastuse + gc_args->timeout)) {
-   RT6_TRACE("aging clone %p\n", rt);
+   if (!(rt->rt6i_flags & RTF_EXPIRES)) {
+   if (time_after_eq(now, rt->dst.lastuse + gc_args->timeout)) {
+   RT6_TRACE("aging clone %p\n", rt);
+   rt6_remove_exception(bucket, rt6_ex);
+   return;
+   }
+   } else if (time_after(jiffies, rt->dst.expires)) {
+   RT6_TRACE("purging expired route %p\n", rt);
rt6_remove_exception(bucket, rt6_ex);
return;
-   } else if (rt->rt6i_flags & RTF_GATEWAY) {
+   }
+
+   if (rt->rt6i_flags & RTF_GATEWAY) {
struct neighbour *neigh;
__u8 neigh_flags = 0;
 
@@ -1606,11 +1613,8 @@ static void rt6_age_examine_exception(struct 
rt6_exception_bucket *bucket,
rt6_remove_exception(bucket, rt6_ex);
return;
}
-   } else if (__rt6_check_expired(rt)) {
-   RT6_TRACE("purging expired route %p\n", rt);
-   rt6_remove_exception(bucket, rt6_ex);
-   return;
}
+
gc_args->more++;
 }
 
-- 
2.16.0.rc1.238.g530d649a79-goog

Re: [PATCH net] ipv6: don't let tb6_root node share routes with other node

2018-01-19 Thread Wei Wang

On Fri, Jan 19, 2018 at 1:36 PM, Wei Wang <wei...@google.com> wrote:
>
>
> On Fri, Jan 19, 2018 at 1:13 PM, Ido Schimmel <ido...@idosch.org> wrote:
>> Hi Wei, Martin,
>>
>> On Thu, Jan 18, 2018 at 03:31:29PM -0800, Wei Wang wrote:
>>> On Thu, Jan 18, 2018 at 2:47 PM, Martin KaFai Lau <ka...@fb.com> wrote:
>>> > On Thu, Jan 18, 2018 at 10:40:03AM -0800, Wei Wang wrote:
>>> >> From: Wei Wang <wei...@google.com>
>>> >>
>>> >> After commit 4512c43eac7e, if we add a route to the subtree of
>>> >> tb6_root
>>> >> which does not have any route attached to it yet, the current code
>>> >> will
>>> >> let tb6_root and the node in the subtree share the same route.
>>> >> This could cause problem cause tb6_root has RTN_INFO flag marked and
>>> >> the
>>> > You meant the RTN_RTINFO check in fib6_purge_rt()?
>>> >
>>> Yes. Exactly.
>>
>> The check in fib6_purge_rt() is indeed problematic as tb6_root will not
>> release its reference on the deleted route. I can easily reproduce that
>> on my system. However, I don't understand how come we end up with a
>> use-after-free given tb6_root takes a reference on the route?
>>

(Resending with plain txt format)

Hi Ido,

I think the use-after-free does not really happen on the route that is being
falsely shared, but on the route which that route's rt6i_next is pointing to.
Nothing could prevent rt->rt6i_next from being released.

Thanks.
Wei

>> Thanks
>>
>>>
>>> >> tree repair and clean up code will not work properly.
>>> >> This commit makes sure tb6_root->leaf points back to null_entry
>>> >> instead
>>> >> of sharing route with other node.
>>> >>
>>> >> It fixes the following syzkaller reported issue:
>>> >> BUG: KASAN: use-after-free in ipv6_prefix_equal include/net/ipv6.h:540
>>> >> [inline]
>>> >> BUG: KASAN: use-after-free in fib6_add_1+0x165f/0x1790
>>> >> net/ipv6/ip6_fib.c:618
>>> >> Read of size 8 at addr 8801bc043498 by task syz-executor5/19819
>>> >>
>>> >> CPU: 1 PID: 19819 Comm: syz-executor5 Not tainted 4.15.0-rc7+ #186
>>> >> Hardware name: Google Google Compute Engine/Google Compute Engine,
>>> >> BIOS Google 01/01/2011
>>> >> Call Trace:
>>> >>  __dump_stack lib/dump_stack.c:17 [inline]
>>> >>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>>> >>  print_address_description+0x73/0x250 mm/kasan/report.c:252
>>> >>  kasan_report_error mm/kasan/report.c:351 [inline]
>>> >>  kasan_report+0x25b/0x340 mm/kasan/report.c:409
>>> >>  __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430
>>> >>  ipv6_prefix_equal include/net/ipv6.h:540 [inline]
>>> >>  fib6_add_1+0x165f/0x1790 net/ipv6/ip6_fib.c:618
>>> >>  fib6_add+0x5fa/0x1540 net/ipv6/ip6_fib.c:1214
>>> >>  __ip6_ins_rt+0x6c/0x90 net/ipv6/route.c:1003
>>> >>  ip6_route_add+0x141/0x190 net/ipv6/route.c:2790
>>> >>  ipv6_route_ioctl+0x4db/0x6b0 net/ipv6/route.c:3299
>>> >>  inet6_ioctl+0xef/0x1e0 net/ipv6/af_inet6.c:520
>>> >>  sock_do_ioctl+0x65/0xb0 net/socket.c:958
>>> >>  sock_ioctl+0x2c2/0x440 net/socket.c:1055
>>> >>  vfs_ioctl fs/ioctl.c:46 [inline]
>>> >>  do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686
>>> >>  SYSC_ioctl fs/ioctl.c:701 [inline]
>>> >>  SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
>>> >>  entry_SYSCALL_64_fastpath+0x23/0x9a
>>> >> RIP: 0033:0x452ac9
>>> >> RSP: 002b:7fd42b321c58 EFLAGS: 0212 ORIG_RAX: 0010
>>> >> RAX: ffda RBX: 0071bea0 RCX: 00452ac9
>>> >> RDX: 20fd7000 RSI: 890b RDI: 0013
>>> >> RBP: 049e R08:  R09: 
>>> >> R10:  R11: 0212 R12: 006f4f70
>>> >> R13:  R14: 7fd42b3226d4 R15: 
>>> >>
>>> >> Fixes: 4512c43eac7e ("ipv6: remove null_entry before adding default
>>> >> route")
>>> >> Signed-off-by: Wei Wang <wei...@google.com>
>>> >> Acked-by: Eric Dumazet <eduma...@google.com>
>>> >> ---
>>> >>  net/ipv6/ip6_fib.c | 10 --
>>> &

Re: [PATCH net] ipv6: don't let tb6_root node share routes with other node

2018-01-18 Thread Wei Wang

On Thu, Jan 18, 2018 at 2:47 PM, Martin KaFai Lau <ka...@fb.com> wrote:
> On Thu, Jan 18, 2018 at 10:40:03AM -0800, Wei Wang wrote:
>> From: Wei Wang <wei...@google.com>
>>
>> After commit 4512c43eac7e, if we add a route to the subtree of tb6_root
>> which does not have any route attached to it yet, the current code will
>> let tb6_root and the node in the subtree share the same route.
>> This could cause problem cause tb6_root has RTN_INFO flag marked and the
> You meant the RTN_RTINFO check in fib6_purge_rt()?
>
Yes. Exactly.

>> tree repair and clean up code will not work properly.
>> This commit makes sure tb6_root->leaf points back to null_entry instead
>> of sharing route with other node.
>>
>> It fixes the following syzkaller reported issue:
>> BUG: KASAN: use-after-free in ipv6_prefix_equal include/net/ipv6.h:540 
>> [inline]
>> BUG: KASAN: use-after-free in fib6_add_1+0x165f/0x1790 net/ipv6/ip6_fib.c:618
>> Read of size 8 at addr 8801bc043498 by task syz-executor5/19819
>>
>> CPU: 1 PID: 19819 Comm: syz-executor5 Not tainted 4.15.0-rc7+ #186
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
>> Google 01/01/2011
>> Call Trace:
>>  __dump_stack lib/dump_stack.c:17 [inline]
>>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>>  print_address_description+0x73/0x250 mm/kasan/report.c:252
>>  kasan_report_error mm/kasan/report.c:351 [inline]
>>  kasan_report+0x25b/0x340 mm/kasan/report.c:409
>>  __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430
>>  ipv6_prefix_equal include/net/ipv6.h:540 [inline]
>>  fib6_add_1+0x165f/0x1790 net/ipv6/ip6_fib.c:618
>>  fib6_add+0x5fa/0x1540 net/ipv6/ip6_fib.c:1214
>>  __ip6_ins_rt+0x6c/0x90 net/ipv6/route.c:1003
>>  ip6_route_add+0x141/0x190 net/ipv6/route.c:2790
>>  ipv6_route_ioctl+0x4db/0x6b0 net/ipv6/route.c:3299
>>  inet6_ioctl+0xef/0x1e0 net/ipv6/af_inet6.c:520
>>  sock_do_ioctl+0x65/0xb0 net/socket.c:958
>>  sock_ioctl+0x2c2/0x440 net/socket.c:1055
>>  vfs_ioctl fs/ioctl.c:46 [inline]
>>  do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686
>>  SYSC_ioctl fs/ioctl.c:701 [inline]
>>  SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
>>  entry_SYSCALL_64_fastpath+0x23/0x9a
>> RIP: 0033:0x452ac9
>> RSP: 002b:7fd42b321c58 EFLAGS: 0212 ORIG_RAX: 0010
>> RAX: ffda RBX: 0071bea0 RCX: 00452ac9
>> RDX: 20fd7000 RSI: 890b RDI: 0013
>> RBP: 0000049e R08:  R09: 
>> R10:  R11: 0212 R12: 006f4f70
>> R13:  R14: 7fd42b3226d4 R15: 
>>
>> Fixes: 4512c43eac7e ("ipv6: remove null_entry before adding default route")
>> Signed-off-by: Wei Wang <wei...@google.com>
>> Acked-by: Eric Dumazet <eduma...@google.com>
>> ---
>>  net/ipv6/ip6_fib.c | 10 --
>>  1 file changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
>> index 9dcc3924a975..217683d40f12 100644
>> --- a/net/ipv6/ip6_fib.c
>> +++ b/net/ipv6/ip6_fib.c
>> @@ -1226,8 +1226,14 @@ int fib6_add(struct fib6_node *root, struct rt6_info 
>> *rt,
>>   }
>>
>>   if (!rcu_access_pointer(fn->leaf)) {
>> - atomic_inc(>rt6i_ref);
>> - rcu_assign_pointer(fn->leaf, rt);
>> + if (fn->fn_flags & RTN_TL_ROOT) {
>> + /* put back null_entry for root node */
>> + rcu_assign_pointer(fn->leaf,
>> + info->nl_net->ipv6.ip6_null_entry);
>> + } else {
>> + atomic_inc(>rt6i_ref);
>> + rcu_assign_pointer(fn->leaf, rt);
>> + }
>>   }
>>   fn = sn;
>>   }
>> --
>> 2.16.0.rc1.238.g530d649a79-goog
>>

[PATCH net] ipv6: don't let tb6_root node share routes with other node

2018-01-18 Thread Wei Wang

From: Wei Wang <wei...@google.com>

After commit 4512c43eac7e, if we add a route to the subtree of tb6_root
which does not have any route attached to it yet, the current code will
let tb6_root and the node in the subtree share the same route.
This could cause problem cause tb6_root has RTN_INFO flag marked and the
tree repair and clean up code will not work properly.
This commit makes sure tb6_root->leaf points back to null_entry instead
of sharing route with other node.

It fixes the following syzkaller reported issue:
BUG: KASAN: use-after-free in ipv6_prefix_equal include/net/ipv6.h:540 [inline]
BUG: KASAN: use-after-free in fib6_add_1+0x165f/0x1790 net/ipv6/ip6_fib.c:618
Read of size 8 at addr 8801bc043498 by task syz-executor5/19819

CPU: 1 PID: 19819 Comm: syz-executor5 Not tainted 4.15.0-rc7+ #186
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x194/0x257 lib/dump_stack.c:53
 print_address_description+0x73/0x250 mm/kasan/report.c:252
 kasan_report_error mm/kasan/report.c:351 [inline]
 kasan_report+0x25b/0x340 mm/kasan/report.c:409
 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430
 ipv6_prefix_equal include/net/ipv6.h:540 [inline]
 fib6_add_1+0x165f/0x1790 net/ipv6/ip6_fib.c:618
 fib6_add+0x5fa/0x1540 net/ipv6/ip6_fib.c:1214
 __ip6_ins_rt+0x6c/0x90 net/ipv6/route.c:1003
 ip6_route_add+0x141/0x190 net/ipv6/route.c:2790
 ipv6_route_ioctl+0x4db/0x6b0 net/ipv6/route.c:3299
 inet6_ioctl+0xef/0x1e0 net/ipv6/af_inet6.c:520
 sock_do_ioctl+0x65/0xb0 net/socket.c:958
 sock_ioctl+0x2c2/0x440 net/socket.c:1055
 vfs_ioctl fs/ioctl.c:46 [inline]
 do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686
 SYSC_ioctl fs/ioctl.c:701 [inline]
 SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
 entry_SYSCALL_64_fastpath+0x23/0x9a
RIP: 0033:0x452ac9
RSP: 002b:7fd42b321c58 EFLAGS: 0212 ORIG_RAX: 0010
RAX: ffda RBX: 0071bea0 RCX: 00452ac9
RDX: 20fd7000 RSI: 890b RDI: 0013
RBP: 049e R08:  R09: 
R10:  R11: 0212 R12: 006f4f70
R13:  R14: 7fd42b3226d4 R15: 

Fixes: 4512c43eac7e ("ipv6: remove null_entry before adding default route")
Signed-off-by: Wei Wang <wei...@google.com>
Acked-by: Eric Dumazet <eduma...@google.com>
---
 net/ipv6/ip6_fib.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 9dcc3924a975..217683d40f12 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1226,8 +1226,14 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt,
}
 
if (!rcu_access_pointer(fn->leaf)) {
-   atomic_inc(>rt6i_ref);
-   rcu_assign_pointer(fn->leaf, rt);
+   if (fn->fn_flags & RTN_TL_ROOT) {
+   /* put back null_entry for root node */
+   rcu_assign_pointer(fn->leaf,
+   info->nl_net->ipv6.ip6_null_entry);
+   } else {
+   atomic_inc(>rt6i_ref);
+   rcu_assign_pointer(fn->leaf, rt);
+   }
}
fn = sn;
}
-- 
2.16.0.rc1.238.g530d649a79-goog

Re: KASAN: slab-out-of-bounds Write in tcp_v6_syn_recv_sock

2018-01-17 Thread Wei Wang

On Wed, Jan 3, 2018 at 3:31 PM, Cong Wang  wrote:
>
> On Wed, Jan 3, 2018 at 12:55 PM, Ozgur  wrote:
> >
> >
> > 03.01.2018, 21:57, "Cong Wang" :
> >> On Tue, Jan 2, 2018 at 3:58 PM, syzbot
> >>  wrote:
> >>>  Hello,
> >>>
> >>>  syzkaller hit the following crash on
> >>>  61233580f1f33c50e159c50e24d80ffd2ba2e06b
> >>>  git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
> >>>  compiler: gcc (GCC) 7.1.1 20170620
> >>>  .config is attached
> >>>  Raw console output is attached.
> >>>  C reproducer is attached
> >>>  syzkaller reproducer is attached. See https://goo.gl/kgGztJ
> >>>  for information about syzkaller reproducers
> >>>
> >>>  IMPORTANT: if you fix the bug, please add the following tag to the 
> >>> commit:
> >>>  Reported-by: syzbot+6dc95bddc6976b800...@syzkaller.appspotmail.com
> >>>  It will help syzbot understand when the bug is fixed. See footer for
> >>>  details.
> >>>  If you forward the report, please keep this part and the footer.
> >>>
> >>>  TCP: request_sock_TCPv6: Possible SYN flooding on port 20002. Sending
> >>>  cookies. Check SNMP counters.
> >>>  ==
> >>>  BUG: KASAN: slab-out-of-bounds in memcpy include/linux/string.h:344 
> >>> [inline]
> >>>  BUG: KASAN: slab-out-of-bounds in tcp_v6_syn_recv_sock+0x628/0x23a0
> >>>  net/ipv6/tcp_ipv6.c:1144
> >>>  Write of size 160 at addr 8801cbdd7460 by task syzkaller545407/3196
> >>>
> >>>  CPU: 1 PID: 3196 Comm: syzkaller545407 Not tainted 4.15.0-rc5+ #241
> >>>  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> >>>  Google 01/01/2011
> >>>  Call Trace:
> >>>   
> >>>   __dump_stack lib/dump_stack.c:17 [inline]
> >>>   dump_stack+0x194/0x257 lib/dump_stack.c:53
> >>>   print_address_description+0x73/0x250 mm/kasan/report.c:252
> >>>   kasan_report_error mm/kasan/report.c:351 [inline]
> >>>   kasan_report+0x25b/0x340 mm/kasan/report.c:409
> >>>   check_memory_region_inline mm/kasan/kasan.c:260 [inline]
> >>>   check_memory_region+0x137/0x190 mm/kasan/kasan.c:267
> >>>   memcpy+0x37/0x50 mm/kasan/kasan.c:303
> >>>   memcpy include/linux/string.h:344 [inline]
> >>>   tcp_v6_syn_recv_sock+0x628/0x23a0 net/ipv6/tcp_ipv6.c:1144
> >>
> >> tls_init() changes sk->sk_prot from IPv6 to IPv4, which leads
> >> to this bug. I guess IPv6 is not supported for TLS? If so, need
> >> a check on proto in tls_init()...
> >
> > Hello,
> >
> > I think IPv6 supports with TLS.
> > There was a previously posted commit by Mellanox:
> >
> > https://patchwork.ozlabs.org/patch/801530/
>
> Good to know.
>
> Can you resend the fix? It could probably fix another warning
> reported by syzbot too.

What is the status on resending the patch?
(https://patchwork.ozlabs.org/patch/801530/)
Another slab-out-of-bound report that could be fixed by it:

syzbot has found reproducer for the following crash on
6bd39bc3da0f4a301fae69c4a32db2768f5118be
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git master
compiler: gcc (GCC) 7.1.1 20170620

==
BUG: KASAN: slab-out-of-bounds in ip6_dst_idev
include/net/ip6_fib.h:192 [inline]
BUG: KASAN: slab-out-of-bounds in ip6_xmit+0x1ce9/0x2090
net/ipv6/ip6_output.c:264
Read of size 8 at addr 8801d99a1018 by task syzkaller716686/3664

CPU: 0 PID: 3664 Comm: syzkaller716686 Not tainted 4.15.0-rc7+ #187
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x194/0x257 lib/dump_stack.c:53
 print_address_description+0x73/0x250 mm/kasan/report.c:252
 kasan_report_error mm/kasan/report.c:351 [inline]
 kasan_report+0x25b/0x340 mm/kasan/report.c:409
 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430
 ip6_dst_idev include/net/ip6_fib.h:192 [inline]
 ip6_xmit+0x1ce9/0x2090 net/ipv6/ip6_output.c:264
 inet6_csk_xmit+0x2fc/0x580 net/ipv6/inet6_connection_sock.c:139
 tcp_transmit_skb+0x1b1b/0x38c0 net/ipv4/tcp_output.c:1176
 tcp_send_syn_data net/ipv4/tcp_output.c:3457 [inline]
 tcp_connect+0x1edb/0x4090 net/ipv4/tcp_output.c:3496
 tcp_v4_connect+0x15ef/0x1e70 net/ipv4/tcp_ipv4.c:257
 __inet_stream_connect+0x2d4/0xf00 net/ipv4/af_inet.c:620
 tcp_sendmsg_fastopen net/ipv4/tcp.c:1168 [inline]
 tcp_sendmsg_locked+0x264e/0x3c70 net/ipv4/tcp.c:1214
 tcp_sendmsg+0x2f/0x50 net/ipv4/tcp.c:1463
 inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:764
 sock_sendmsg_nosec net/socket.c:630 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:640
 ___sys_sendmsg+0x320/0x8b0 net/socket.c:2020
 __sys_sendmmsg+0x1ee/0x620 net/socket.c:2110
 SYSC_sendmmsg net/socket.c:2141 [inline]
 SyS_sendmmsg+0x35/0x60 net/socket.c:2136
 entry_SYSCALL_64_fastpath+0x23/0x9a
RIP: 0033:0x43fdd9
RSP: 002b:7ffcaf235288 EFLAGS: 0217 ORIG_RAX: 0133
RAX:

Re: [bisected] Forwarded packets occasionally has loopback output interface in Netfilter

2018-01-11 Thread Wei Wang

On Thu, Jan 11, 2018 at 9:25 AM, Anders K. Pedersen | Cohaesio
 wrote:
> On tir, 2017-12-26 at 12:05 +0100, Anders K. Pedersen | Cohaesio wrote:
>> Hello,
>>
>> On one of our border routers, Netfilter is occasionally logging packets
>> with "OUT=lo" (output interface lo) even though the packet should be
>> going out via a regular interface. This behavior is present on Linux
>> 4.13.0 to 4.14.9, and a bisection of the problem points to
>>
>> [95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call dst_dev_put() properly
>>
>> as the first bad commit. This commit adds dst_dev_put() calls before
>> some dst_release() calls, and dst_dev_put() does
>>
>> dst->dev = dev_net(dst->dev)->loopback_dev;
>>
>> (among other things), which fits the problem we're seeing.
>>
>> The essential part of our nftables rule set that shows this behavior is
>>
>> chain forward {
>> type filter hook forward priority 0;
>>
>> meta oif { $internal_interfaces } accept
>>
>> meta oif lo ip daddr != 127.0.0.0/8 \
>> log group 0 snaplen 80 prefix "oif-lo" counter
>>
>> ip saddr { $our_ip_series } \
>> flow table acct_out \
>> { meta oif . rt nexthop . ip saddr timeout 12m 
>> counter } \
>> accept
>>
>> log group 0 snaplen 80 prefix "DROP" counter drop
>> }
>>
>> The router only does stateless packet filtering and no redirection or
>> rewriting of the packets (connection tracking, NAT, ipvs etc. are not
>> even compiled for this kernel).
>>
>> As a result of this problem we see packets that should be going to an
>> internal interface (and thus accepted by the first rule above) being
>> logged and dropped by the last rule. Some examples:
>>
>> Dec 22 11:57:02 cix4 oif-lo IN=eth10 OUT=lo 
>> MAC=90:e2:ba:5c:b6:95:10:f3:11:38:06:77:08:00 SRC=81.170.163.118 
>> DST=212.97.158.33 LEN=1500 TOS=00 PREC=0x00 TTL=116 ID=25932 DF PROTO=TCP 
>> SPT=35118 DPT=8443 SEQ=604358330 ACK=1182278705 WINDOW=3295 ACK URGP=0 MARK=0
>> Dec 22 11:57:02 cix4 DROP IN=eth10 OUT=lo 
>> MAC=90:e2:ba:5c:b6:95:10:f3:11:38:06:77:08:00 SRC=81.170.163.118 
>> DST=212.97.158.33 LEN=1500 TOS=00 PREC=0x00 TTL=116 ID=25932 DF PROTO=TCP 
>> SPT=35118 DPT=8443 SEQ=604358330 ACK=1182278705 WINDOW=3295 ACK URGP=0 MARK=0
>>
>> Dec 22 12:47:07 cix4 oif-lo IN=eth10 OUT=lo 
>> MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.30.18 
>> DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=118 ID=10370 DF PROTO=TCP 
>> SPT=443 DPT=44988 SEQ=1141545913 ACK=3844573103 WINDOW=65535 ACK PSH URGP=0 
>> MARK=0
>> Dec 22 12:47:07 cix4 DROP IN=eth10 OUT=lo 
>> MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.30.18 
>> DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=118 ID=10370 DF PROTO=TCP 
>> SPT=443 DPT=44988 SEQ=1141545913 ACK=3844573103 WINDOW=65535 ACK PSH URGP=0 
>> MARK=0
>>
>> Dec 22 12:53:56 cix4 oif-lo IN=eth10 OUT=lo 
>> MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.12.34 
>> DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=115 ID=27728 DF PROTO=TCP 
>> SPT=443 DPT=39724 SEQ=3797156404 ACK=3944234612 WINDOW=65535 ACK PSH URGP=0 
>> MARK=0
>> Dec 22 12:53:56 cix4 DROP IN=eth10 OUT=lo 
>> MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.12.34 
>> DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=115 ID=27728 DF PROTO=TCP 
>> SPT=443 DPT=39724 SEQ=3797156404 ACK=3944234612 WINDOW=65535 ACK PSH URGP=0 
>> MARK=0
>>
>> It also happens for outbound traffic, where the packets are logged and
>> counted in the acct_out flow table with "meta oif" = "lo", but a
>> correct "rt nexthop" - an example:
>>
>> Dec 22 12:29:13 cix4 oif-lo IN=team0.20 OUT=lo 
>> MAC=3c:fd:fe:15:db:a8:00:24:a8:ff:f0:00:08:00 SRC=212.97.129.25 
>> DST=95.166.119.129 LEN=40 TOS=00 PREC=0x00 TTL=62 ID=19481 DF PROTO=TCP 
>> SPT=443 DPT=52560 SEQ=3034827396 ACK=2862814901 WINDOW=12618 ACK URGP=0 
>> MARK=0
>>
>> # nft list flow table filter acct_out|tr ',' '\n'|grep lo
>> flow table acct_out {
>>  "lo" . 94.101.208.217 . 212.97.129.25 expires 3m17s : counter packets 1 
>> bytes 40
>>
>> I don't know if these packets are actually sent out on the correct
>> outbound interface thanks to the proper nexthop (the MAC= information
>> in the Netfilter log is from the received packet and thus not useful
>> here).
>>
>> I tried running a tcpdump on the lo interface to see if these packets
>> would show up there, but during the three days I had it running, it
>> only logged one such packet, while Netfilter logs 20+ outbound packets
>> every day, and the one packet logged by tcpdump was *not* logged by
>> Netfilter.
>
> Further testing of the individual parts of the first bad commit shows
> that the five first additions of the dst_dev_put() call doesn't trigger
> the problem, while the last one does (also without the first five), so
> the problematic part is:
>
> diff --git a/net/ipv4/route.c

[PATCH net v2] ipv6: remove null_entry before adding default route

2018-01-08 Thread Wei Wang

From: Wei Wang <wei...@google.com>

In the current code, when creating a new fib6 table, tb6_root.leaf gets
initialized to net->ipv6.ip6_null_entry.
If a default route is being added with rt->rt6i_metric = 0x,
fib6_add() will add this route after net->ipv6.ip6_null_entry. As
null_entry is shared, it could cause problem.

In order to fix it, set fn->leaf to NULL before calling
fib6_add_rt2node() when trying to add the first default route.
And reset fn->leaf to null_entry when adding fails or when deleting the
last default route.

syzkaller reported the following issue which is fixed by this commit:

WARNING: suspicious RCU usage
4.15.0-rc5+ #171 Not tainted
-
net/ipv6/ip6_fib.c:1702 suspicious rcu_dereference_protected() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
4 locks held by swapper/0/0:
 #0:  ((>ipv6.ip6_fib_timer)){+.-.}, at: [<d43f631b>] 
lockdep_copy_map include/linux/lockdep.h:178 [inline]
 #0:  ((>ipv6.ip6_fib_timer)){+.-.}, at: [<d43f631b>] 
call_timer_fn+0x1c6/0x820 kernel/time/timer.c:1310
 #1:  (&(>ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<2ff9d65c>] 
spin_lock_bh include/linux/spinlock.h:315 [inline]
 #1:  (&(>ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<2ff9d65c>] 
fib6_run_gc+0x9d/0x3c0 net/ipv6/ip6_fib.c:2007
 #2:  (rcu_read_lock){}, at: [<91db762d>] 
__fib6_clean_all+0x0/0x3a0 net/ipv6/ip6_fib.c:1560
 #3:  (&(>tb6_lock)->rlock){+.-.}, at: [<9e503581>] spin_lock_bh 
include/linux/spinlock.h:315 [inline]
 #3:  (&(>tb6_lock)->rlock){+.-.}, at: [<9e503581>] 
__fib6_clean_all+0x1d0/0x3a0 net/ipv6/ip6_fib.c:1948

stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc5+ #171
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x194/0x257 lib/dump_stack.c:53
 lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4585
 fib6_del+0xcaa/0x11b0 net/ipv6/ip6_fib.c:1701
 fib6_clean_node+0x3aa/0x4f0 net/ipv6/ip6_fib.c:1892
 fib6_walk_continue+0x46c/0x8a0 net/ipv6/ip6_fib.c:1815
 fib6_walk+0x91/0xf0 net/ipv6/ip6_fib.c:1863
 fib6_clean_tree+0x1e6/0x340 net/ipv6/ip6_fib.c:1933
 __fib6_clean_all+0x1f4/0x3a0 net/ipv6/ip6_fib.c:1949
 fib6_clean_all net/ipv6/ip6_fib.c:1960 [inline]
 fib6_run_gc+0x16b/0x3c0 net/ipv6/ip6_fib.c:2016
 fib6_gc_timer_cb+0x20/0x30 net/ipv6/ip6_fib.c:2033
 call_timer_fn+0x228/0x820 kernel/time/timer.c:1320
 expire_timers kernel/time/timer.c:1357 [inline]
 __run_timers+0x7ee/0xb70 kernel/time/timer.c:1660
 run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686
 __do_softirq+0x2d7/0xb85 kernel/softirq.c:285
 invoke_softirq kernel/softirq.c:365 [inline]
 irq_exit+0x1cc/0x200 kernel/softirq.c:405
 exiting_irq arch/x86/include/asm/apic.h:540 [inline]
 smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
 apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:904
 

Reported-by: syzbot <syzkal...@googlegroups.com>
Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
Signed-off-by: Wei Wang <wei...@google.com>
---
 net/ipv6/ip6_fib.c | 38 +-
 1 file changed, 29 insertions(+), 9 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index d11a5578e4f8..9dcc3924a975 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -640,6 +640,11 @@ static struct fib6_node *fib6_add_1(struct net *net,
if (!(fn->fn_flags & RTN_RTINFO)) {
RCU_INIT_POINTER(fn->leaf, NULL);
rt6_release(leaf);
+   /* remove null_entry in the root node */
+   } else if (fn->fn_flags & RTN_TL_ROOT &&
+  rcu_access_pointer(fn->leaf) ==
+  net->ipv6.ip6_null_entry) {
+   RCU_INIT_POINTER(fn->leaf, NULL);
}
 
return fn;
@@ -1270,13 +1275,17 @@ int fib6_add(struct fib6_node *root, struct rt6_info 
*rt,
return err;
 
 failure:
-   /* fn->leaf could be NULL if fn is an intermediate node and we
-* failed to add the new route to it in both subtree creation
-* failure and fib6_add_rt2node() failure case.
-* In both cases, fib6_repair_tree() should be called to fix
-* fn->leaf.
+   /* fn->leaf could be NULL and fib6_repair_tree() needs to be called if:
+* 1. fn is an intermediate node and we failed to add the new
+* route to it in both subtree creation failure and fib6_add_rt2node()
+* failure case.
+* 2. fn is the root node in the table and we fail to add the fir

Re: [PATCH net] ipv6: remove null_entry before adding default route

2018-01-07 Thread Wei Wang

On Sat, Jan 6, 2018 at 10:16 PM, Martin KaFai Lau <ka...@fb.com> wrote:
> On Sat, Jan 06, 2018 at 05:41:28PM -0800, Wei Wang wrote:
>> On Fri, Jan 5, 2018 at 11:42 PM, Martin KaFai Lau <ka...@fb.com> wrote:
>> > On Fri, Jan 05, 2018 at 05:38:35PM -0800, Wei Wang wrote:
>> >> From: Wei Wang <wei...@google.com>
>> >>
>> >> In the current code, when creating a new fib6 table, tb6_root.leaf gets
>> >> initialized to net->ipv6.ip6_null_entry.
>> >> If a default route is being added with rt->rt6i_metric = 0x,
>> >> fib6_add() will add this route after net->ipv6.ip6_null_entry. As
>> >> null_entry is shared, it could cause problem.
>> >>
>> >> In order to fix it, set fn->leaf to NULL before calling
>> >> fib6_add_rt2node() when trying to add the first default route.
>> >> And reset fn->leaf to null_entry when adding fails or when deleting the
>> >> last default route.
>> >>
>> >> syzkaller reported the following issue which is fixed by this commit:
>> >> =
>> >> WARNING: suspicious RCU usage
>> >> 4.15.0-rc5+ #171 Not tainted
>> >> -
>> >> net/ipv6/ip6_fib.c:1702 suspicious rcu_dereference_protected() usage!
>> >>
>> >> other info that might help us debug this:
>> >>
>> >> rcu_scheduler_active = 2, debug_locks = 1
>> >> 4 locks held by swapper/0/0:
>> >>  #0:  ((>ipv6.ip6_fib_timer)){+.-.}, at: [<d43f631b>] 
>> >> lockdep_copy_map include/linux/lockdep.h:178 [inline]
>> >>  #0:  ((>ipv6.ip6_fib_timer)){+.-.}, at: [<d43f631b>] 
>> >> call_timer_fn+0x1c6/0x820 kernel/time/timer.c:1310
>> >>  #1:  (&(>ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<2ff9d65c>] 
>> >> spin_lock_bh include/linux/spinlock.h:315 [inline]
>> >>  #1:  (&(>ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<2ff9d65c>] 
>> >> fib6_run_gc+0x9d/0x3c0 net/ipv6/ip6_fib.c:2007
>> >>  #2:  (rcu_read_lock){}, at: [<91db762d>] 
>> >> __fib6_clean_all+0x0/0x3a0 net/ipv6/ip6_fib.c:1560
>> >>  #3:  (&(>tb6_lock)->rlock){+.-.}, at: [<9e503581>] 
>> >> spin_lock_bh include/linux/spinlock.h:315 [inline]
>> >>  #3:  (&(>tb6_lock)->rlock){+.-.}, at: [<9e503581>] 
>> >> __fib6_clean_all+0x1d0/0x3a0 net/ipv6/ip6_fib.c:1948
>> >>
>> >> stack backtrace:
>> >> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc5+ #171
>> >> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
>> >> Google 01/01/2011
>> >> Call Trace:
>> >>  
>> >>  __dump_stack lib/dump_stack.c:17 [inline]
>> >>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>> >>  lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4585
>> >>  fib6_del+0xcaa/0x11b0 net/ipv6/ip6_fib.c:1701
>> >>  fib6_clean_node+0x3aa/0x4f0 net/ipv6/ip6_fib.c:1892
>> >>  fib6_walk_continue+0x46c/0x8a0 net/ipv6/ip6_fib.c:1815
>> >>  fib6_walk+0x91/0xf0 net/ipv6/ip6_fib.c:1863
>> >>  fib6_clean_tree+0x1e6/0x340 net/ipv6/ip6_fib.c:1933
>> >>  __fib6_clean_all+0x1f4/0x3a0 net/ipv6/ip6_fib.c:1949
>> >>  fib6_clean_all net/ipv6/ip6_fib.c:1960 [inline]
>> >>  fib6_run_gc+0x16b/0x3c0 net/ipv6/ip6_fib.c:2016
>> >>  fib6_gc_timer_cb+0x20/0x30 net/ipv6/ip6_fib.c:2033
>> >>  call_timer_fn+0x228/0x820 kernel/time/timer.c:1320
>> >>  expire_timers kernel/time/timer.c:1357 [inline]
>> >>  __run_timers+0x7ee/0xb70 kernel/time/timer.c:1660
>> >>  run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686
>> >>  __do_softirq+0x2d7/0xb85 kernel/softirq.c:285
>> >>  invoke_softirq kernel/softirq.c:365 [inline]
>> >>  irq_exit+0x1cc/0x200 kernel/softirq.c:405
>> >>  exiting_irq arch/x86/include/asm/apic.h:540 [inline]
>> >>  smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
>> >>  apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:904
>> >>  
>> >>
>> >> Reported-by: syzbot <syzkal...@googlegroups.com>
>> >> Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in 
>> >> fib6_table")
>> >> Signed-off-by: Wei Wang <wei...@google.com>
>> >> ---
>>

Re: [PATCH net] ipv6: remove null_entry before adding default route

2018-01-06 Thread Wei Wang

On Fri, Jan 5, 2018 at 11:42 PM, Martin KaFai Lau <ka...@fb.com> wrote:
> On Fri, Jan 05, 2018 at 05:38:35PM -0800, Wei Wang wrote:
>> From: Wei Wang <wei...@google.com>
>>
>> In the current code, when creating a new fib6 table, tb6_root.leaf gets
>> initialized to net->ipv6.ip6_null_entry.
>> If a default route is being added with rt->rt6i_metric = 0x,
>> fib6_add() will add this route after net->ipv6.ip6_null_entry. As
>> null_entry is shared, it could cause problem.
>>
>> In order to fix it, set fn->leaf to NULL before calling
>> fib6_add_rt2node() when trying to add the first default route.
>> And reset fn->leaf to null_entry when adding fails or when deleting the
>> last default route.
>>
>> syzkaller reported the following issue which is fixed by this commit:
>> =
>> WARNING: suspicious RCU usage
>> 4.15.0-rc5+ #171 Not tainted
>> -
>> net/ipv6/ip6_fib.c:1702 suspicious rcu_dereference_protected() usage!
>>
>> other info that might help us debug this:
>>
>> rcu_scheduler_active = 2, debug_locks = 1
>> 4 locks held by swapper/0/0:
>>  #0:  ((>ipv6.ip6_fib_timer)){+.-.}, at: [<d43f631b>] 
>> lockdep_copy_map include/linux/lockdep.h:178 [inline]
>>  #0:  ((>ipv6.ip6_fib_timer)){+.-.}, at: [<d43f631b>] 
>> call_timer_fn+0x1c6/0x820 kernel/time/timer.c:1310
>>  #1:  (&(>ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<2ff9d65c>] 
>> spin_lock_bh include/linux/spinlock.h:315 [inline]
>>  #1:  (&(>ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<2ff9d65c>] 
>> fib6_run_gc+0x9d/0x3c0 net/ipv6/ip6_fib.c:2007
>>  #2:  (rcu_read_lock){}, at: [<91db762d>] 
>> __fib6_clean_all+0x0/0x3a0 net/ipv6/ip6_fib.c:1560
>>  #3:  (&(>tb6_lock)->rlock){+.-.}, at: [<9e503581>] spin_lock_bh 
>> include/linux/spinlock.h:315 [inline]
>>  #3:  (&(>tb6_lock)->rlock){+.-.}, at: [<9e503581>] 
>> __fib6_clean_all+0x1d0/0x3a0 net/ipv6/ip6_fib.c:1948
>>
>> stack backtrace:
>> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc5+ #171
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
>> Google 01/01/2011
>> Call Trace:
>>  
>>  __dump_stack lib/dump_stack.c:17 [inline]
>>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>>  lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4585
>>  fib6_del+0xcaa/0x11b0 net/ipv6/ip6_fib.c:1701
>>  fib6_clean_node+0x3aa/0x4f0 net/ipv6/ip6_fib.c:1892
>>  fib6_walk_continue+0x46c/0x8a0 net/ipv6/ip6_fib.c:1815
>>  fib6_walk+0x91/0xf0 net/ipv6/ip6_fib.c:1863
>>  fib6_clean_tree+0x1e6/0x340 net/ipv6/ip6_fib.c:1933
>>  __fib6_clean_all+0x1f4/0x3a0 net/ipv6/ip6_fib.c:1949
>>  fib6_clean_all net/ipv6/ip6_fib.c:1960 [inline]
>>  fib6_run_gc+0x16b/0x3c0 net/ipv6/ip6_fib.c:2016
>>  fib6_gc_timer_cb+0x20/0x30 net/ipv6/ip6_fib.c:2033
>>  call_timer_fn+0x228/0x820 kernel/time/timer.c:1320
>>  expire_timers kernel/time/timer.c:1357 [inline]
>>  __run_timers+0x7ee/0xb70 kernel/time/timer.c:1660
>>  run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686
>>  __do_softirq+0x2d7/0xb85 kernel/softirq.c:285
>>  invoke_softirq kernel/softirq.c:365 [inline]
>>  irq_exit+0x1cc/0x200 kernel/softirq.c:405
>>  exiting_irq arch/x86/include/asm/apic.h:540 [inline]
>>  smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
>>  apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:904
>>  
>>
>> Reported-by: syzbot <syzkal...@googlegroups.com>
>> Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in 
>> fib6_table")
>> Signed-off-by: Wei Wang <wei...@google.com>
>> ---
>>  net/ipv6/ip6_fib.c | 45 +++--
>>  1 file changed, 35 insertions(+), 10 deletions(-)
>>
>> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
>> index d11a5578e4f8..37cb4ad1ea29 100644
>> --- a/net/ipv6/ip6_fib.c
>> +++ b/net/ipv6/ip6_fib.c
>> @@ -640,6 +640,11 @@ static struct fib6_node *fib6_add_1(struct net *net,
>>   if (!(fn->fn_flags & RTN_RTINFO)) {
>>   RCU_INIT_POINTER(fn->leaf, NULL);
>>   rt6_release(leaf);
>> + /* remove null_entry in the root node */
>> + } else if (fn->fn_flags & RTN_TL_ROOT &&
>> +rcu_access_pointer(fn

[PATCH net] ipv6: remove null_entry before adding default route

2018-01-05 Thread Wei Wang

From: Wei Wang <wei...@google.com>

In the current code, when creating a new fib6 table, tb6_root.leaf gets
initialized to net->ipv6.ip6_null_entry.
If a default route is being added with rt->rt6i_metric = 0x,
fib6_add() will add this route after net->ipv6.ip6_null_entry. As
null_entry is shared, it could cause problem.

In order to fix it, set fn->leaf to NULL before calling
fib6_add_rt2node() when trying to add the first default route.
And reset fn->leaf to null_entry when adding fails or when deleting the
last default route.

syzkaller reported the following issue which is fixed by this commit:
=
WARNING: suspicious RCU usage
4.15.0-rc5+ #171 Not tainted
-
net/ipv6/ip6_fib.c:1702 suspicious rcu_dereference_protected() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
4 locks held by swapper/0/0:
 #0:  ((>ipv6.ip6_fib_timer)){+.-.}, at: [<d43f631b>] 
lockdep_copy_map include/linux/lockdep.h:178 [inline]
 #0:  ((>ipv6.ip6_fib_timer)){+.-.}, at: [<d43f631b>] 
call_timer_fn+0x1c6/0x820 kernel/time/timer.c:1310
 #1:  (&(>ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<2ff9d65c>] 
spin_lock_bh include/linux/spinlock.h:315 [inline]
 #1:  (&(>ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<2ff9d65c>] 
fib6_run_gc+0x9d/0x3c0 net/ipv6/ip6_fib.c:2007
 #2:  (rcu_read_lock){}, at: [<91db762d>] 
__fib6_clean_all+0x0/0x3a0 net/ipv6/ip6_fib.c:1560
 #3:  (&(>tb6_lock)->rlock){+.-.}, at: [<9e503581>] spin_lock_bh 
include/linux/spinlock.h:315 [inline]
 #3:  (&(>tb6_lock)->rlock){+.-.}, at: [<9e503581>] 
__fib6_clean_all+0x1d0/0x3a0 net/ipv6/ip6_fib.c:1948

stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc5+ #171
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x194/0x257 lib/dump_stack.c:53
 lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4585
 fib6_del+0xcaa/0x11b0 net/ipv6/ip6_fib.c:1701
 fib6_clean_node+0x3aa/0x4f0 net/ipv6/ip6_fib.c:1892
 fib6_walk_continue+0x46c/0x8a0 net/ipv6/ip6_fib.c:1815
 fib6_walk+0x91/0xf0 net/ipv6/ip6_fib.c:1863
 fib6_clean_tree+0x1e6/0x340 net/ipv6/ip6_fib.c:1933
 __fib6_clean_all+0x1f4/0x3a0 net/ipv6/ip6_fib.c:1949
 fib6_clean_all net/ipv6/ip6_fib.c:1960 [inline]
 fib6_run_gc+0x16b/0x3c0 net/ipv6/ip6_fib.c:2016
 fib6_gc_timer_cb+0x20/0x30 net/ipv6/ip6_fib.c:2033
 call_timer_fn+0x228/0x820 kernel/time/timer.c:1320
 expire_timers kernel/time/timer.c:1357 [inline]
 __run_timers+0x7ee/0xb70 kernel/time/timer.c:1660
 run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686
 __do_softirq+0x2d7/0xb85 kernel/softirq.c:285
 invoke_softirq kernel/softirq.c:365 [inline]
 irq_exit+0x1cc/0x200 kernel/softirq.c:405
 exiting_irq arch/x86/include/asm/apic.h:540 [inline]
 smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
 apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:904
 

Reported-by: syzbot <syzkal...@googlegroups.com>
Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
Signed-off-by: Wei Wang <wei...@google.com>
---
 net/ipv6/ip6_fib.c | 45 +++--
 1 file changed, 35 insertions(+), 10 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index d11a5578e4f8..37cb4ad1ea29 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -640,6 +640,11 @@ static struct fib6_node *fib6_add_1(struct net *net,
if (!(fn->fn_flags & RTN_RTINFO)) {
RCU_INIT_POINTER(fn->leaf, NULL);
rt6_release(leaf);
+   /* remove null_entry in the root node */
+   } else if (fn->fn_flags & RTN_TL_ROOT &&
+  rcu_access_pointer(fn->leaf) ==
+  net->ipv6.ip6_null_entry) {
+   RCU_INIT_POINTER(fn->leaf, NULL);
}
 
return fn;
@@ -1270,14 +1275,27 @@ int fib6_add(struct fib6_node *root, struct rt6_info 
*rt,
return err;
 
 failure:
-   /* fn->leaf could be NULL if fn is an intermediate node and we
-* failed to add the new route to it in both subtree creation
-* failure and fib6_add_rt2node() failure case.
-* In both cases, fib6_repair_tree() should be called to fix
+   /* fn->leaf could be NULL if:
+* 1. fn is the root node in the table and we fail to add the default
+* route to it.
+* In this case, we put fn->leaf back to net->ipv6.ip6_null_entry as
+* the way the table was created.
+* 2. fn is an intermediate node and we

[PATCH net] ipv6: fix general protection fault in fib6_add()

2018-01-03 Thread Wei Wang

From: Wei Wang <wei...@google.com>

In fib6_add(), pn could be NULL if fib6_add_1() failed to return a fib6
node. Checking pn != fn before accessing pn->leaf makes sure pn is not
NULL.
This fixes the following GPF reported by syzkaller:
general protection fault:  [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 3201 Comm: syzkaller001778 Not tainted 4.15.0-rc5+ #151
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
RIP: 0010:fib6_add+0x736/0x15a0 net/ipv6/ip6_fib.c:1244
RSP: 0018:8801c7626a70 EFLAGS: 00010202
RAX: dc00 RBX: 0020 RCX: 84794465
RDX: 0004 RSI: 8801d38935f0 RDI: 0282
RBP: 8801c7626da0 R08: 110038ec4c35 R09: 
R10: 8801c7626c68 R11:  R12: fffe
R13:  R14:  R15: 0009
FS:  () GS:8801db20(0063) knlGS:09b70840
CS:  0010 DS: 002b ES: 002b CR0: 80050033
CR2: 20be1000 CR3: 0001d585a006 CR4: 001606f0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 __ip6_ins_rt+0x6c/0x90 net/ipv6/route.c:1006
 ip6_route_multipath_add+0xd14/0x16c0 net/ipv6/route.c:3833
 inet6_rtm_newroute+0xdc/0x160 net/ipv6/route.c:3957
 rtnetlink_rcv_msg+0x733/0x1020 net/core/rtnetlink.c:4411
 netlink_rcv_skb+0x21e/0x460 net/netlink/af_netlink.c:2408
 rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4423
 netlink_unicast_kernel net/netlink/af_netlink.c:1275 [inline]
 netlink_unicast+0x4e8/0x6f0 net/netlink/af_netlink.c:1301
 netlink_sendmsg+0xa4a/0xe60 net/netlink/af_netlink.c:1864
 sock_sendmsg_nosec net/socket.c:636 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:646
 sock_write_iter+0x31a/0x5d0 net/socket.c:915
 call_write_iter include/linux/fs.h:1772 [inline]
 do_iter_readv_writev+0x525/0x7f0 fs/read_write.c:653
 do_iter_write+0x154/0x540 fs/read_write.c:932
 compat_writev+0x225/0x420 fs/read_write.c:1246
 do_compat_writev+0x115/0x220 fs/read_write.c:1267
 C_SYSC_writev fs/read_write.c:1278 [inline]
 compat_SyS_writev+0x26/0x30 fs/read_write.c:1274
 do_syscall_32_irqs_on arch/x86/entry/common.c:327 [inline]
 do_fast_syscall_32+0x3ee/0xf9d arch/x86/entry/common.c:389
 entry_SYSENTER_compat+0x54/0x63 arch/x86/entry/entry_64_compat.S:125

Reported-by: syzbot <syzkal...@googlegroups.com>
Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
Signed-off-by: Wei Wang <wei...@google.com>
---
 net/ipv6/ip6_fib.c | 35 ---
 1 file changed, 20 insertions(+), 15 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index f5285f4e1d08..d11a5578e4f8 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1241,23 +1241,28 @@ int fib6_add(struct fib6_node *root, struct rt6_info 
*rt,
 * If fib6_add_1 has cleared the old leaf pointer in the
 * super-tree leaf node we have to find a new one for it.
 */
-   struct rt6_info *pn_leaf = rcu_dereference_protected(pn->leaf,
-   lockdep_is_held(>tb6_lock));
-   if (pn != fn && pn_leaf == rt) {
-   pn_leaf = NULL;
-   RCU_INIT_POINTER(pn->leaf, NULL);
-   atomic_dec(>rt6i_ref);
-   }
-   if (pn != fn && !pn_leaf && !(pn->fn_flags & RTN_RTINFO)) {
-   pn_leaf = fib6_find_prefix(info->nl_net, table, pn);
-#if RT6_DEBUG >= 2
-   if (!pn_leaf) {
-   WARN_ON(!pn_leaf);
-   pn_leaf = info->nl_net->ipv6.ip6_null_entry;
+   if (pn != fn) {
+   struct rt6_info *pn_leaf =
+   rcu_dereference_protected(pn->leaf,
+   lockdep_is_held(>tb6_lock));
+   if (pn_leaf == rt) {
+   pn_leaf = NULL;
+   RCU_INIT_POINTER(pn->leaf, NULL);
+   atomic_dec(>rt6i_ref);
}
+   if (!pn_leaf && !(pn->fn_flags & RTN_RTINFO)) {
+   pn_leaf = fib6_find_prefix(info->nl_net, table,
+  pn);
+#if RT6_DEBUG >= 2
+   if (!pn_leaf) {
+   WARN_ON(!pn_leaf);
+   pn_leaf =
+   info->nl_net->ipv6.ip6_null_entry;
+   }
 #endif
-   atomic_inc(_leaf->rt6i

Re: general protection fault in fib6_add (2)

2018-01-03 Thread Wei Wang

On Wed, Jan 3, 2018 at 8:16 AM, David Ahern  wrote:
> [ +wei...@google.com ]
>
> On 1/2/18 3:58 PM, syzbot wrote:
>> Hello,
>>
>> syzkaller hit the following crash on
>> 61233580f1f33c50e159c50e24d80ffd2ba2e06b
>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
>> compiler: gcc (GCC) 7.1.1 20170620
>> .config is attached
>> Raw console output is attached.
>> C reproducer is attached
>> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
>> for information about syzkaller reproducers
>>
>>
>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> Reported-by: syzbot+0693adff3f83403dc...@syzkaller.appspotmail.com
>> It will help syzbot understand when the bug is fixed. See footer for
>> details.
>> If you forward the report, please keep this part and the footer.
>>
>> audit: type=1400 audit(1514594846.496:7): avc:  denied  { map } for
>> pid=3201 comm="syzkaller001778" path="/root/syzkaller001778299"
>> dev="sda1" ino=16481
>> scontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023
>> tcontext=unconfined_u:object_r:user_home_t:s0 tclass=file permissive=1
>> IPv6: Can't replace route, no match found
>> kasan: CONFIG_KASAN_INLINE enabled
>> kasan: GPF could be caused by NULL-ptr deref or user memory access
>> general protection fault:  [#1] SMP KASAN
>> Dumping ftrace buffer:
>>(ftrace buffer empty)
>> Modules linked in:
>> CPU: 0 PID: 3201 Comm: syzkaller001778 Not tainted 4.15.0-rc5+ #151
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>> Google 01/01/2011
>> RIP: 0010:fib6_add+0x736/0x15a0 net/ipv6/ip6_fib.c:1244

pn could be NULL if fib6_add_1() failed. Will submit a fix for this.

>> RSP: 0018:8801c7626a70 EFLAGS: 00010202
>> RAX: dc00 RBX: 0020 RCX: 84794465
>> RDX: 0004 RSI: 8801d38935f0 RDI: 0282
>> RBP: 8801c7626da0 R08: 110038ec4c35 R09: 
>> R10: 8801c7626c68 R11:  R12: fffe
>> R13:  R14:  R15: 0009
>> FS:  () GS:8801db20(0063)
>> knlGS:09b70840
>> CS:  0010 DS: 002b ES: 002b CR0: 80050033
>> CR2: 20be1000 CR3: 0001d585a006 CR4: 001606f0
>> DR0:  DR1:  DR2: 
>> DR3:  DR6: fffe0ff0 DR7: 0400
>> Call Trace:
>>  __ip6_ins_rt+0x6c/0x90 net/ipv6/route.c:1006
>>  ip6_route_multipath_add+0xd14/0x16c0 net/ipv6/route.c:3833
>>  inet6_rtm_newroute+0xdc/0x160 net/ipv6/route.c:3957
>>  rtnetlink_rcv_msg+0x733/0x1020 net/core/rtnetlink.c:4411
>>  netlink_rcv_skb+0x21e/0x460 net/netlink/af_netlink.c:2408
>>  rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4423
>>  netlink_unicast_kernel net/netlink/af_netlink.c:1275 [inline]
>>  netlink_unicast+0x4e8/0x6f0 net/netlink/af_netlink.c:1301
>>  netlink_sendmsg+0xa4a/0xe60 net/netlink/af_netlink.c:1864
>>  sock_sendmsg_nosec net/socket.c:636 [inline]
>>  sock_sendmsg+0xca/0x110 net/socket.c:646
>>  sock_write_iter+0x31a/0x5d0 net/socket.c:915
>>  call_write_iter include/linux/fs.h:1772 [inline]
>>  do_iter_readv_writev+0x525/0x7f0 fs/read_write.c:653
>>  do_iter_write+0x154/0x540 fs/read_write.c:932
>>  compat_writev+0x225/0x420 fs/read_write.c:1246
>>  do_compat_writev+0x115/0x220 fs/read_write.c:1267
>>  C_SYSC_writev fs/read_write.c:1278 [inline]
>>  compat_SyS_writev+0x26/0x30 fs/read_write.c:1274
>>  do_syscall_32_irqs_on arch/x86/entry/common.c:327 [inline]
>>  do_fast_syscall_32+0x3ee/0xf9d arch/x86/entry/common.c:389
>>  entry_SYSENTER_compat+0x54/0x63 arch/x86/entry/entry_64_compat.S:125
>> RIP: 0023:0xf7f1fc79
>> RSP: 002b:ffb61bfc EFLAGS: 0203 ORIG_RAX: 0092
>> RAX: ffda RBX: 0003 RCX: 204aaff0
>> RDX: 0001 RSI: 0167 RDI: 0010
>> RBP: 0003 R08:  R09: 
>> R10:  R11:  R12: 
>> R13:  R14:  R15: 
>> Code: f1 a9 f6 fc e8 2c f2 e2 fc 85 c0 0f 85 d5 03 00 00 49 8d 5e 20 e8
>> db a9 f6 fc 48 89 da 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <80> 3c
>> 02 00 0f 85 5a 0c 00 00 4d 39 ee 4d 8b 7e 20 0f 95 c0 4c
>> RIP: fib6_add+0x736/0x15a0 net/ipv6/ip6_fib.c:1244 RSP: 8801c7626a70
>> ---[ end trace 956c65133fcfff88 ]---
>>
>>
>> ---
>> This bug is generated by a dumb bot. It may contain errors.
>> See https://goo.gl/tpsmEJ for details.
>> Direct all questions to syzkal...@googlegroups.com.
>>
>> syzbot will keep track of this bug report.
>> If you forgot to add the Reported-by tag, once the fix for this bug is
>> merged
>> into any tree, please reply to this email with:
>> #syz fix: exact-commit-title
>> If you want to test a patch for this bug, please reply with:
>> #syz test: git://repo/address.git branch
>> and provide the

Re: dst refcount is -1

2017-12-19 Thread Wei Wang

On Tue, Dec 19, 2017 at 2:56 AM, Ortwin Glück  wrote:
> Hi,
>
> On 4.14.6 I just got this (on a busy firewall):
> [Tue Dec 19 11:15:59 2017] dst_release: dst:9bb7aca0d6c0 refcnt:-1
>
> Are you sure the refcounting is now correct?
>
> Ortwin

Would you give more details under what circumstances it happened?
What kind of traffic is running? IPv4? IPv6? Or Both? Do you use xfrm?

[PATCH net] tcp: fix potential underestimation on rcv_rtt

2017-12-12 Thread Wei Wang

From: Wei Wang <wei...@google.com>

When ms timestamp is used, current logic uses 1us in
tcp_rcv_rtt_update() when the real rcv_rtt is within 1 - 999us.
This could cause rcv_rtt underestimation.
Fix it by always using a min value of 1ms if ms timestamp is used.

Fixes: 645f4c6f2ebd ("tcp: switch rcv_rtt_est and rcvq_space to high
resolution timestamps")

Signed-off-by: Wei Wang <wei...@google.com>
Signed-off-by: Eric Dumazet <eduma...@google.com>
---
 net/ipv4/tcp_input.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9550cc42de2d..45f750e85714 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -508,9 +508,6 @@ static void tcp_rcv_rtt_update(struct tcp_sock *tp, u32 
sample, int win_dep)
u32 new_sample = tp->rcv_rtt_est.rtt_us;
long m = sample;
 
-   if (m == 0)
-   m = 1;
-
if (new_sample != 0) {
/* If we sample in larger samples in the non-timestamp
 * case, we could grossly overestimate the RTT especially
@@ -547,6 +544,8 @@ static inline void tcp_rcv_rtt_measure(struct tcp_sock *tp)
if (before(tp->rcv_nxt, tp->rcv_rtt_est.seq))
return;
delta_us = tcp_stamp_us_delta(tp->tcp_mstamp, tp->rcv_rtt_est.time);
+   if (!delta_us)
+   delta_us = 1;
tcp_rcv_rtt_update(tp, delta_us, 1);
 
 new_measure:
@@ -563,8 +562,11 @@ static inline void tcp_rcv_rtt_measure_ts(struct sock *sk,
(TCP_SKB_CB(skb)->end_seq -
 TCP_SKB_CB(skb)->seq >= inet_csk(sk)->icsk_ack.rcv_mss)) {
u32 delta = tcp_time_stamp(tp) - tp->rx_opt.rcv_tsecr;
-   u32 delta_us = delta * (USEC_PER_SEC / TCP_TS_HZ);
+   u32 delta_us;
 
+   if (!delta)
+   delta = 1;
+   delta_us = delta * (USEC_PER_SEC / TCP_TS_HZ);
tcp_rcv_rtt_update(tp, delta_us, 0);
}
 }
-- 
2.15.1.424.g9478a66081-goog

[iproute2] ss: print tcpi_rcv_ssthresh

2017-12-07 Thread Wei Wang

From: Wei Wang <wei...@google.com>

tcpi_rcv_ssthresh is an important stats when debugging receive side
behavior.
Add it to the ss output.

Signed-off-by: Wei Wang <wei...@google.com>
Signed-off-by: Eric Dumazet <eduma...@google.com>
---
 misc/ss.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/misc/ss.c b/misc/ss.c
index b5099d1e..90da93e3 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -751,6 +751,7 @@ struct tcpstat {
double  rcv_rtt;
double  min_rtt;
int rcv_space;
+   unsigned intrcv_ssthresh;
unsigned long long  busy_time;
unsigned long long  rwnd_limited;
unsigned long long  sndbuf_limited;
@@ -2058,6 +2059,8 @@ static void tcp_stats_print(struct tcpstat *s)
printf(" rcv_rtt:%g", s->rcv_rtt);
if (s->rcv_space)
printf(" rcv_space:%d", s->rcv_space);
+   if (s->rcv_ssthresh)
+   printf(" rcv_ssthresh:%u", s->rcv_ssthresh);
if (s->not_sent)
printf(" notsent:%u", s->not_sent);
if (s->min_rtt)
@@ -2304,6 +2307,7 @@ static void tcp_show_info(const struct nlmsghdr *nlh, 
struct inet_diag_msg *r,
s.fackets= info->tcpi_fackets;
s.reordering = info->tcpi_reordering;
s.rcv_space  = info->tcpi_rcv_space;
+   s.rcv_ssthresh   = info->tcpi_rcv_ssthresh;
s.cwnd   = info->tcpi_snd_cwnd;
 
if (info->tcpi_snd_ssthresh < 0x)
-- 
2.15.1.424.g9478a66081-goog

Re: [PULL] virtio: last minute bugfix

2017-11-07 Thread Wei Wang


On 11/08/2017 03:23 AM, Michael S. Tsirkin wrote:

On Tue, Nov 07, 2017 at 08:13:10PM +0200, Michael S. Tsirkin wrote:

On Tue, Nov 07, 2017 at 09:29:59AM -0800, Linus Torvalds wrote:

On Tue, Nov 7, 2017 at 9:23 AM, Linus Torvalds
 wrote:

I guess I'll take it, but please don't do things like this to me.

Oh no I wont.

The garbage you sent me doesn't even compile cleanly, and is utter shite.

Not acceptable for last-minute bugfixes, and you're now on my shit-list.

 Linus

Sorry about that.

I'll investigate what went wrong.

Will be more careful not to cut corners next time around, just follow
the standard procedure.

All right, my local tests didn't fail on new warnings, and I didn't give
the zero day infrastructure enough time to do it's job.

Lesson hopefully learned - don't rush it, give tools the time to do
their job.

Wei, you'll want to respin your 4.15 patchset on top of my fixed tree.
At this point the fix will only land in 4.15, sorry about that.

Thanks everyone.



OK, I'll use the fixed tree. Thanks, Michael.


Best,
Wei

[PATCH net-next] ipv6: prevent user from adding cached routes

2017-10-27 Thread Wei Wang

From: Wei Wang <wei...@google.com>

Cached routes should only be created by the system when receiving pmtu
discovery or ip redirect msg. Users should not be allowed to create
cached routes.

Furthermore, after the patch series to move cached routes into exception
table, user added cached routes will trigger the following warning in
fib6_add():

WARNING: CPU: 0 PID: 2985 at net/ipv6/ip6_fib.c:1137
fib6_add+0x20d9/0x2c10 net/ipv6/ip6_fib.c:1137
Kernel panic - not syncing: panic_on_warn set ...

CPU: 0 PID: 2985 Comm: syzkaller320388 Not tainted 4.14.0-rc3+ #74
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:16 [inline]
 dump_stack+0x194/0x257 lib/dump_stack.c:52
 panic+0x1e4/0x417 kernel/panic.c:181
 __warn+0x1c4/0x1d9 kernel/panic.c:542
 report_bug+0x211/0x2d0 lib/bug.c:183
 fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178
 do_trap_no_signal arch/x86/kernel/traps.c:212 [inline]
 do_trap+0x260/0x390 arch/x86/kernel/traps.c:261
 do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298
 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311
 invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905
RIP: 0010:fib6_add+0x20d9/0x2c10 net/ipv6/ip6_fib.c:1137
RSP: 0018:8801cf09f6a0 EFLAGS: 00010297
RAX: 8801ce45e340 RBX: 110039e13eec RCX: 8801d749c814
RDX:  RSI: 8801d749c700 RDI: 8801d749c780
RBP: 8801cf09fa08 R08:  R09: 8801cf09f360
R10: 8801cf09f2d8 R11: 110039c8befb R12: 0001
R13: dc00 R14: 8801d749c700 R15: 860655c0
 __ip6_ins_rt+0x6c/0x90 net/ipv6/route.c:1011
 ip6_route_add+0x148/0x1a0 net/ipv6/route.c:2782
 ipv6_route_ioctl+0x4d5/0x690 net/ipv6/route.c:3291
 inet6_ioctl+0xef/0x1e0 net/ipv6/af_inet6.c:521
 sock_do_ioctl+0x65/0xb0 net/socket.c:961
 sock_ioctl+0x2c2/0x440 net/socket.c:1058
 vfs_ioctl fs/ioctl.c:45 [inline]
 do_vfs_ioctl+0x1b1/0x1530 fs/ioctl.c:685
 SYSC_ioctl fs/ioctl.c:700 [inline]
 SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
 entry_SYSCALL_64_fastpath+0x1f/0xbe

So we fix this by failing the attemp to add cached routes from userspace
with returning EINVAL error.

Fixes: 2b760fcf5cfb ("ipv6: hook up exception table to store dst cache")
Signed-off-by: Wei Wang <wei...@google.com>
Signed-off-by: Eric Dumazet <eduma...@google.com>
---
 include/uapi/linux/ipv6_route.h | 2 +-
 net/ipv6/route.c| 6 ++
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/ipv6_route.h b/include/uapi/linux/ipv6_route.h
index d496c02e14bc..c15d8054905c 100644
--- a/include/uapi/linux/ipv6_route.h
+++ b/include/uapi/linux/ipv6_route.h
@@ -28,7 +28,7 @@
 
 #define RTF_ROUTEINFO  0x0080  /* route information - RA   */
 
-#define RTF_CACHE  0x0100  /* cache entry  */
+#define RTF_CACHE  0x0100  /* read-only: can not be set by user */
 #define RTF_FLOW   0x0200  /* flow significant route   */
 #define RTF_POLICY 0x0400  /* policy route */
 
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 605e5dc1c010..70d9659fc1e9 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2478,6 +2478,12 @@ static struct rt6_info *ip6_route_info_create(struct 
fib6_config *cfg,
goto out;
}
 
+   /* RTF_CACHE is an internal flag; can not be set by userspace */
+   if (cfg->fc_flags & RTF_CACHE) {
+   NL_SET_ERR_MSG(extack, "Userspace can not set RTF_CACHE");
+   goto out;
+   }
+
if (cfg->fc_dst_len > 128) {
NL_SET_ERR_MSG(extack, "Invalid prefix length");
goto out;
-- 
2.15.0.rc2.357.g7e34df9404-goog

[PATCH net-next] ipv6: add ip6_null_entry check in rt6_select()

2017-10-23 Thread Wei Wang

From: Wei Wang <wei...@google.com>

In rt6_select(), fn->leaf could be pointing to net->ipv6.ip6_null_entry.
In this case, we should directly return instead of trying to carry on
with the rest of the process.
If not, we could crash at:
  spin_lock_bh(>rt6i_table->rt6_lock);
because net->ipv6.ip6_null_entry does not have rt6i_table set.


Syzkaller recently reported following issue on net-next:
Use struct sctp_sack_info instead
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault:  [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
sctp: [Deprecated]: syz-executor4 (pid 26496) Use of struct sctp_assoc_value in 
delayed_ack socket option.
Use struct sctp_sack_info instead
CPU: 1 PID: 26523 Comm: syz-executor6 Not tainted 4.14.0-rc4+ #85
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
task: 8801d147e3c0 task.stack: 8801a4328000
RIP: 0010:debug_spin_lock_before kernel/locking/spinlock_debug.c:83 [inline]
RIP: 0010:do_raw_spin_lock+0x23/0x1e0 kernel/locking/spinlock_debug.c:112
RSP: 0018:8801a432ed70 EFLAGS: 00010207
RAX: dc00 RBX: 0018 RCX: 
RDX: 0003 RSI:  RDI: 001c
RBP: 8801a432ed90 R08: 0001 R09: 
R10:  R11: 8482b279 R12: 8801ce2ff3a0
sctp: [Deprecated]: syz-executor1 (pid 26546) Use of int in maxseg socket 
option.
Use struct sctp_assoc_value instead
R13: dc00 R14: 8801d971e000 R15: 8801ce2ff0d8
FS:  7f56e82f5700() GS:8801db30() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 001ddbc22000 CR3: 0001a4a04000 CR4: 001406e0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 __raw_spin_lock_bh include/linux/spinlock_api_smp.h:136 [inline]
 _raw_spin_lock_bh+0x39/0x40 kernel/locking/spinlock.c:175
 spin_lock_bh include/linux/spinlock.h:321 [inline]
 rt6_select net/ipv6/route.c:786 [inline]
 ip6_pol_route+0x1be3/0x3bd0 net/ipv6/route.c:1650
sctp: [Deprecated]: syz-executor1 (pid 26576) Use of int in maxseg socket 
option.
Use struct sctp_assoc_value instead
TCP: request_sock_TCPv6: Possible SYN flooding on port 20002. Sending cookies.  
Check SNMP counters.
 ip6_pol_route_output+0x4c/0x60 net/ipv6/route.c:1843
 fib6_rule_lookup+0x9e/0x2a0 net/ipv6/ip6_fib.c:309
 ip6_route_output_flags+0x1f1/0x2b0 net/ipv6/route.c:1871
 ip6_route_output include/net/ip6_route.h:80 [inline]
 ip6_dst_lookup_tail+0x4ea/0x970 net/ipv6/ip6_output.c:953
 ip6_dst_lookup_flow+0xc8/0x270 net/ipv6/ip6_output.c:1076
 sctp_v6_get_dst+0x675/0x1c30 net/sctp/ipv6.c:274
 sctp_transport_route+0xa8/0x430 net/sctp/transport.c:287
 sctp_assoc_add_peer+0x4fe/0x1100 net/sctp/associola.c:656
 __sctp_connect+0x251/0xc80 net/sctp/socket.c:1187
 sctp_connect+0xb4/0xf0 net/sctp/socket.c:4209
 inet_dgram_connect+0x16b/0x1f0 net/ipv4/af_inet.c:541
 SYSC_connect+0x20a/0x480 net/socket.c:1642
 SyS_connect+0x24/0x30 net/socket.c:1623
 entry_SYSCALL_64_fastpath+0x1f/0xbe

Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
Signed-off-by: Wei Wang <wei...@google.com>
Acked-by: Eric Dumazet <eduma...@google.com>
---
 net/ipv6/route.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 46c59a53c53f..605e5dc1c010 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -752,7 +752,7 @@ static struct rt6_info *rt6_select(struct net *net, struct 
fib6_node *fn,
bool do_rr = false;
int key_plen;
 
-   if (!leaf)
+   if (!leaf || leaf == net->ipv6.ip6_null_entry)
return net->ipv6.ip6_null_entry;
 
rt0 = rcu_dereference(fn->rr_ptr);
-- 
2.15.0.rc0.271.g36b669edcc-goog

Re: [PATCH net-next v3 2/2] ipv6: remove from fib tree aged out RTF_CACHE dst

2017-10-19 Thread Wei Wang

On Thu, Oct 19, 2017 at 7:07 AM, Paolo Abeni <pab...@redhat.com> wrote:
> The commit 2b760fcf5cfb ("ipv6: hook up exception table to store
> dst cache") partially reverted the commit 1e2ea8ad37be ("ipv6: set
> dst.obsolete when a cached route has expired").
>
> As a result, RTF_CACHE dst referenced outside the fib tree will
> not be removed until the next sernum change; dst_check() does not
> fail on aged-out dst, and dst->__refcnt can't decrease: the aged
> out dst will stay valid for a potentially unlimited time after the
> timeout expiration.
>
> This change explicitly removes RTF_CACHE dst from the fib tree when
> aged out. The rt6_remove_exception() logic will then obsolete the
> dst and other entities will drop the related reference on next
> dst_check().
>
> pMTU exceptions are not aged-out, and are removed from the exception
> table only when the - usually considerably longer - ip6_rt_mtu_expires
> timeout expires.
>
> v1 -> v2:
>   - do not touch dst.obsolete in rt6_remove_exception(), not needed
> v2 -> v3:
>   - take care of pMTU exceptions, too
>
> Fixes: 2b760fcf5cfb ("ipv6: hook up exception table to store dst cache")
> Signed-off-by: Paolo Abeni <pab...@redhat.com>
> ---

Acked-by: Wei Wang <wei...@google.com>


>  net/ipv6/route.c | 12 +++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 5c27313803d2..87a15cbd0e8b 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -1575,7 +1575,13 @@ static void rt6_age_examine_exception(struct 
> rt6_exception_bucket *bucket,
>  {
> struct rt6_info *rt = rt6_ex->rt6i;
>
> -   if (atomic_read(>dst.__refcnt) == 1 &&
> +   /* we are pruning and obsoleting aged-out and non gateway exceptions
> +* even if others have still references to them, so that on next
> +* dst_check() such references can be dropped.
> +* EXPIRES exceptions - e.g. pmtu-generated ones are pruned when
> +* expired, independently from their aging, as per RFC 8201 section 4
> +*/
> +   if (!(rt->rt6i_flags & RTF_EXPIRES) &&
> time_after_eq(now, rt->dst.lastuse + gc_args->timeout)) {
> RT6_TRACE("aging clone %p\n", rt);
> rt6_remove_exception(bucket, rt6_ex);
> @@ -1595,6 +1601,10 @@ static void rt6_age_examine_exception(struct 
> rt6_exception_bucket *bucket,
> rt6_remove_exception(bucket, rt6_ex);
> return;
> }
> +   } else if (__rt6_check_expired(rt)) {
> +   RT6_TRACE("purging expired route %p\n", rt);
> +   rt6_remove_exception(bucket, rt6_ex);
> +   return;
> }
> gc_args->more++;
>  }
> --
> 2.13.6
>

Re: [PATCH net-next v3 1/2] ipv6: start fib6 gc on RTF_CACHE dst creation

2017-10-19 Thread Wei Wang

On Thu, Oct 19, 2017 at 7:07 AM, Paolo Abeni <pab...@redhat.com> wrote:
> After the commit 2b760fcf5cfb ("ipv6: hook up exception table
> to store dst cache"), the fib6 gc is not started after the
> creation of a RTF_CACHE via a redirect or pmtu update, since
> fib6_add() isn't invoked anymore for such dsts.
>
> We need the fib6 gc to run periodically to clean the RTF_CACHE,
> or the dst will stay there forever.
>
> Fix it by explicitly calling fib6_force_start_gc() on successful
> exception creation. gc_args->more accounting will ensure that
> the gc timer will run for whatever time needed to properly
> clean the table.
>
> v2 -> v3:
>  - clarified the commit message
>
> Fixes: 2b760fcf5cfb ("ipv6: hook up exception table to store dst cache")
> Signed-off-by: Paolo Abeni <pab...@redhat.com>
> ---

Acked-by: Wei Wang <wei...@google.com>

>  net/ipv6/route.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 01a103c23a6c..5c27313803d2 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -1340,8 +1340,10 @@ static int rt6_insert_exception(struct rt6_info *nrt,
> spin_unlock_bh(_exception_lock);
>
> /* Update fn->fn_sernum to invalidate all cached dst */
> -   if (!err)
> +   if (!err) {
> fib6_update_sernum(ort);
> +   fib6_force_start_gc(net);
> +   }
>
> return err;
>  }
> --
> 2.13.6
>

Re: [PATCH net-next 3/3] ipv6: obsolete cached dst when removing them from fib tree

2017-10-18 Thread Wei Wang

On Wed, Oct 18, 2017 at 6:03 AM, Paolo Abeni <pab...@redhat.com> wrote:
> On Tue, 2017-10-17 at 13:48 -0700, Wei Wang wrote:
>> On Tue, Oct 17, 2017 at 1:02 PM, Paolo Abeni <pab...@redhat.com> wrote:
>> > Meanwhile others sockets may grab more references to (and use) the same
>> > aged-out dst.
>> >
>>
>> I don't think other sockets could grab more reference to this dst
>> because this dst should already be removed from the fib6 tree.
>
> With the current net-next code, the dst is not removed from the fib
> tree while someone else is holding it and dst_check() does not fail
> after that the cached dst is aged out. If a socket cache grab a
> reference to the CACHE dst, it will not release it untill the next
> sernum change, regardless of the dst aging.
>
>> > The commit 1e2ea8ad37be ("ipv6: set dst.obsolete when a cached route
>> > has expired") was the solution to the above issue prior to the recent
>> > refactor.
>> >
>>
>> I don't really understand how this commit is solving the above issue.
>> This commit still only ages out cached route if >dst.__refcnt ==
>> 1. So if socket is holding refcnt to this dst and dst_check() is not
>> getting called,  this cached route still won't get deleted.
>
> Setting obsolete to DST_OBSOLETE_KILL forced whoever was holding the
> dst reference to drop it on the next dst_check(), so that refcnt could
> go down.
>

Yes. Understood.
Martin and I had a discussion yesterday. We both think it is not a
good idea to set obolete to DST_OBSOLETE_KILL but not to remove it
from the fib6 tree.
It is because others who do a route lookup later might potentially
find this route and tries to use this route. However, dst_check() will
show this route is invalid. So the user will redo the route lookup.
But as this route is not yet deleted from the tree, it will find this
route again. This seems like a bad situation.
And again, setting obsolete to DST_OBSOLETE_KILL does not prevent some
idle socket holding on to this dst for a long time...

With the above said, I am now convinced what you have in your patch is
the correct thing to do. Just remove the cached route without checking
the refcnt when it is aged.

> Cheers,
>
> Paolo

Re: [PATCH net-next 3/3] ipv6: obsolete cached dst when removing them from fib tree

2017-10-17 Thread Wei Wang

On Tue, Oct 17, 2017 at 1:02 PM, Paolo Abeni <pab...@redhat.com> wrote:
> On Tue, 2017-10-17 at 11:58 -0700, Wei Wang wrote:
>> On Tue, Oct 17, 2017 at 10:40 AM, Paolo Abeni <pab...@redhat.com> wrote:
>> > The commit 2b760fcf5cfb ("ipv6: hook up exception table to store
>> > dst cache") partially reverted 1e2ea8ad37be ("ipv6: set
>> > dst.obsolete when a cached route has expired").
>> >
>> > This change brings back the dst obsoleting and push it a step
>> > farther: cached dst are always obsoleted when removed from the
>> > fib tree, and removal by time expiration is now performed
>> > regardless of dst->__refcnt, to be consistent with what we
>> > already do for RTF_GATEWAY dst.
>> >
>> > Fixes: 2b760fcf5cfb ("ipv6: hook up exception table to store dst cache")
>> > Signed-off-by: Paolo Abeni <pab...@redhat.com>
>> > ---
>> >  net/ipv6/route.c | 13 +++--
>> >  1 file changed, 11 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>> > index 8b25a31b6b03..fce740049e3e 100644
>> > --- a/net/ipv6/route.c
>> > +++ b/net/ipv6/route.c
>> > @@ -1147,6 +1147,12 @@ static void rt6_remove_exception(struct 
>> > rt6_exception_bucket *bucket,
>> > if (!bucket || !rt6_ex)
>> > return;
>> >
>> > +   /* sockets, flow cache, etc. can hold a refence to this dst, be 
>> > sure
>> > +* they will drop it.
>> > +*/
>> > +   if (rt6_ex->rt6i)
>> > +   rt6_ex->rt6i->dst.obsolete = DST_OBSOLETE_FORCE_CHK;
>> > +
>>
>> Hmm... I don't really think it is needed. rt6 is created with
>> rt6->dst.obsolete set to DST_OBSOLETE_FORCE_CHK. And by the time the
>> above function is called, it should still be that value.
>> Furthermore, the later call rt6_release() calls dst_dev_put() which
>> sets rt6->dst.obsolete to DST_OBSOLETE_DEAD to indicate this route has
>> been removed from the tree.
>
> You are right, this looks as not needed, if we keep the chunck below.
>
>> > net = dev_net(rt6_ex->rt6i->dst.dev);
>> > rt6_ex->rt6i->rt6i_node = NULL;
>> > hlist_del_rcu(_ex->hlist);
>> > @@ -1575,8 +1581,11 @@ static void rt6_age_examine_exception(struct 
>> > rt6_exception_bucket *bucket,
>> >  {
>> > struct rt6_info *rt = rt6_ex->rt6i;
>> >
>> > -   if (atomic_read(>dst.__refcnt) == 1 &&
>> > -   time_after_eq(now, rt->dst.lastuse + gc_args->timeout)) {
>> > +   /* we are pruning and obsoleting the exception route even if others
>> > +* have still reference to it, so that on next dst_check() such
>> > +* reference can be dropped
>> > +*/
>> > +   if (time_after_eq(now, rt->dst.lastuse + gc_args->timeout)) {
>>
>> Why do we want to change this behavior? Before my patch series, cached
>> routes were only deleted from the tree in fib6_age() when
>> rt->dst.__refcnt == 1, isn't it?
>
> yes, but that really looks like a relic from ancient past more than
> something really needed. We already remove from the dst from fib tree
> regardless of the refcnt if the gateway validation fails - a few lines
> below in the same function.
>
> Waiting for __refcnt going down will let the kernel keep the exception
> entry around for much longer - potentially forever, if e.g. we have a
> reference in a socket dst cache and the application stops processing
> packets.
>

True. If the socket is idle and doesn't send/receive packets,
dst_check() won't get triggered and the socket will keep holding
refcnt on the obsolete dst.

> Meanwhile others sockets may grab more references to (and use) the same
> aged-out dst.
>
I don't think other sockets could grab more reference to this dst
because this dst should already be removed from the fib6 tree.

> The commit 1e2ea8ad37be ("ipv6: set dst.obsolete when a cached route
> has expired") was the solution to the above issue prior to the recent
> refactor.
>

I don't really understand how this commit is solving the above issue.
This commit still only ages out cached route if >dst.__refcnt ==
1. So if socket is holding refcnt to this dst and dst_check() is not
getting called,  this cached route still won't get deleted.

> Cheers,
>
> Paolo

Re: [PATCH net-next 3/3] ipv6: obsolete cached dst when removing them from fib tree

2017-10-17 Thread Wei Wang

On Tue, Oct 17, 2017 at 10:40 AM, Paolo Abeni  wrote:
> The commit 2b760fcf5cfb ("ipv6: hook up exception table to store
> dst cache") partially reverted 1e2ea8ad37be ("ipv6: set
> dst.obsolete when a cached route has expired").
>
> This change brings back the dst obsoleting and push it a step
> farther: cached dst are always obsoleted when removed from the
> fib tree, and removal by time expiration is now performed
> regardless of dst->__refcnt, to be consistent with what we
> already do for RTF_GATEWAY dst.
>
> Fixes: 2b760fcf5cfb ("ipv6: hook up exception table to store dst cache")
> Signed-off-by: Paolo Abeni 
> ---
>  net/ipv6/route.c | 13 +++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 8b25a31b6b03..fce740049e3e 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -1147,6 +1147,12 @@ static void rt6_remove_exception(struct 
> rt6_exception_bucket *bucket,
> if (!bucket || !rt6_ex)
> return;
>
> +   /* sockets, flow cache, etc. can hold a refence to this dst, be sure
> +* they will drop it.
> +*/
> +   if (rt6_ex->rt6i)
> +   rt6_ex->rt6i->dst.obsolete = DST_OBSOLETE_FORCE_CHK;
> +

Hmm... I don't really think it is needed. rt6 is created with
rt6->dst.obsolete set to DST_OBSOLETE_FORCE_CHK. And by the time the
above function is called, it should still be that value.
Furthermore, the later call rt6_release() calls dst_dev_put() which
sets rt6->dst.obsolete to DST_OBSOLETE_DEAD to indicate this route has
been removed from the tree.

> net = dev_net(rt6_ex->rt6i->dst.dev);
> rt6_ex->rt6i->rt6i_node = NULL;
> hlist_del_rcu(_ex->hlist);
> @@ -1575,8 +1581,11 @@ static void rt6_age_examine_exception(struct 
> rt6_exception_bucket *bucket,
>  {
> struct rt6_info *rt = rt6_ex->rt6i;
>
> -   if (atomic_read(>dst.__refcnt) == 1 &&
> -   time_after_eq(now, rt->dst.lastuse + gc_args->timeout)) {
> +   /* we are pruning and obsoleting the exception route even if others
> +* have still reference to it, so that on next dst_check() such
> +* reference can be dropped
> +*/
> +   if (time_after_eq(now, rt->dst.lastuse + gc_args->timeout)) {

Why do we want to change this behavior? Before my patch series, cached
routes were only deleted from the tree in fib6_age() when
rt->dst.__refcnt == 1, isn't it?

> RT6_TRACE("aging clone %p\n", rt);
> rt6_remove_exception(bucket, rt6_ex);
> return;
> --
> 2.13.6
>

Re: [PATCH net-next 2/3] ipv6: start fib6 gc on RTF_CACHE dst creation

2017-10-17 Thread Wei Wang

On Tue, Oct 17, 2017 at 10:40 AM, Paolo Abeni <pab...@redhat.com> wrote:
> After the commit Fixes: 2b760fcf5cfb ("ipv6: hook up exception
> table to store dst cache"), the fib6 gc is not started after
> the creation of a RTF_CACHE via a redirect or pmtu update, since
> fib6_add() isn't invoked anymore for such dsts.
>
> We need the fib6 gc to run periodically to clean the RTF_CACHE,
> or the dst will stay there forever.
>
> Fix it by explicitly calling fib6_force_start_gc() on successful
> exception creation. gc_args->more accounting will ensure that
> the gc timer will run for whatever time needed to properly
> clean the table.
>
> Fixes: 2b760fcf5cfb ("ipv6: hook up exception table to store dst cache")
> Signed-off-by: Paolo Abeni <pab...@redhat.com>
> ---
Acked-by: Wei Wang <wei...@google.com>

Totally true. Thanks for catching this.

>  net/ipv6/route.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 5bb53dbd4fd3..8b25a31b6b03 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -1340,8 +1340,10 @@ static int rt6_insert_exception(struct rt6_info *nrt,
> spin_unlock_bh(_exception_lock);
>
> /* Update fn->fn_sernum to invalidate all cached dst */
> -   if (!err)
> +   if (!err) {
> fib6_update_sernum(ort);
> +   fib6_force_start_gc(net);
> +   }
>
> return err;
>  }
> --
> 2.13.6
>

Re: [PATCH net-next 1/3] ipv6: fix route cache dump

2017-10-17 Thread Wei Wang

On Tue, Oct 17, 2017 at 10:40 AM, Paolo Abeni  wrote:
> After the commit 2b760fcf5cfb ("ipv6: hook up exception table to
> store dst cache"), entries in the routing cache are not shown by:
>
> ip route show cache
>
> because the per route exception table containing such routes is not
> traversed by rt6_dump_route().
> Fix it by explicitly dumping all routes present into the
> rt6i_exception_bucket.
>
> Fixes: 2b760fcf5cfb ("ipv6: hook up exception table to store dst cache")
> Signed-off-by: Paolo Abeni 
> ---
>  net/ipv6/route.c | 30 ++
>  1 file changed, 26 insertions(+), 4 deletions(-)
>
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 01a103c23a6c..5bb53dbd4fd3 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -4190,10 +4190,14 @@ static int rt6_fill_node(struct net *net,
> return -EMSGSIZE;
>  }
>
> +/* this is called under the RCU lock */
>  int rt6_dump_route(struct rt6_info *rt, void *p_arg)
>  {
> struct rt6_rtnl_dump_arg *arg = (struct rt6_rtnl_dump_arg *) p_arg;
> +   struct rt6_exception_bucket *bucket;
> +   struct rt6_exception *rt6_ex;
> struct net *net = arg->net;
> +   int err, port_id, seq, i;
>
> if (rt == net->ipv6.ip6_null_entry)
> return 0;
> @@ -4209,10 +4213,28 @@ int rt6_dump_route(struct rt6_info *rt, void *p_arg)
> }
> }
>
> -   return rt6_fill_node(net,
> -arg->skb, rt, NULL, NULL, 0, RTM_NEWROUTE,
> -NETLINK_CB(arg->cb->skb).portid, arg->cb->nlh->nlmsg_seq,
> -NLM_F_MULTI);
> +   /* dump execeptions table, if available */
> +   port_id = NETLINK_CB(arg->cb->skb).portid;
> +   seq = arg->cb->nlh->nlmsg_seq;
> +   bucket = rcu_dereference(rt->rt6i_exception_bucket);
> +   if (!bucket)
> +   goto no_exceptions;
> +
> +   for (i = 0; i < FIB6_EXCEPTION_BUCKET_SIZE; i++) {
> +   hlist_for_each_entry_rcu(rt6_ex, >chain, hlist) {
> +   err = rt6_fill_node(net, arg->skb, rt6_ex->rt6i, NULL,
> +   NULL, 0, RTM_NEWROUTE, port_id, 
> seq,
> +   NLM_F_MULTI);
> +   if (err)
> +   return err;
> +   }
> +
> +   bucket++;
> +   }
> +
> +no_exceptions:
> +   return rt6_fill_node(net, arg->skb, rt, NULL, NULL, 0, RTM_NEWROUTE,
> +port_id, seq, NLM_F_MULTI);
>  }
>

Hi Paolo,

Thanks for doing this.
But I think your patch does not take care of the case where there are
a lot of cached routes in the exception table and 1 skb is just not
enough to dump the main route + all cached routes in the exception
table.
In this case, your patch will keep dumping the same main route.

I think some logic needs to be incorporated into the fib6_walk() so
that it can also remember the last dumped cached route if necessary in
the exception table and start from there for the next dump.
I do have a patch for that and that patch tries to keep a linked list
of all cached routes from the exception table in the walker struct and
remove any routes that are already dumped.
It is a bit complicated and might not be the best solution. And as
IPv4 already does not support dumping cached routes, I did not send
that out in the previous patch series.


>  static int inet6_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh,
> --
> 2.13.6
>

[PATCH net-next] ipv6: only update __use and lastusetime once per jiffy at most

2017-10-13 Thread Wei Wang

From: Wei Wang <wei...@google.com>

In order to not dirty the cacheline too often, we try to only update
dst->__use and dst->lastusetime at most once per jiffy.
As dst->lastusetime is only used by ipv6 garbage collector, it should
be good enough time resolution.
And __use is only used in ipv6_route_seq_show() to show how many times a
dst has been used. And as __use is not atomic_t right now, it does not
show the precise number of usage times anyway. So we think it should be
OK to only update it at most once per jiffy.

According to my latest syn flood test on a machine with intel Xeon 6th
gen processor and 2 10G mlx nics bonded together, each with 8 rx queues
on 2 NUMA nodes:
With this patch, the packet process rate increases from ~3.49Mpps to
~3.75Mpps with a 7% increase rate.

Note: dst_use() is being renamed to dst_hold_and_use() to better specify
the purpose of the function.

Signed-off-by: Wei Wang <wei...@google.com>
Acked-by: Eric Dumazet <eduma...@googl.com>
---
 include/net/dst.h | 15 ---
 net/decnet/dn_route.c |  8 
 2 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/include/net/dst.h b/include/net/dst.h
index 204c19e25456..5047e8053d6c 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -255,17 +255,18 @@ static inline void dst_hold(struct dst_entry *dst)
WARN_ON(atomic_inc_not_zero(>__refcnt) == 0);
 }
 
-static inline void dst_use(struct dst_entry *dst, unsigned long time)
+static inline void dst_use_noref(struct dst_entry *dst, unsigned long time)
 {
-   dst_hold(dst);
-   dst->__use++;
-   dst->lastuse = time;
+   if (time != dst->lastuse) {
+   dst->__use++;
+   dst->lastuse = time;
+   }
 }
 
-static inline void dst_use_noref(struct dst_entry *dst, unsigned long time)
+static inline void dst_hold_and_use(struct dst_entry *dst, unsigned long time)
 {
-   dst->__use++;
-   dst->lastuse = time;
+   dst_hold(dst);
+   dst_use_noref(dst, time);
 }
 
 static inline struct dst_entry *dst_clone(struct dst_entry *dst)
diff --git a/net/decnet/dn_route.c b/net/decnet/dn_route.c
index 0bd3afd01dd2..bff5ab88cdbb 100644
--- a/net/decnet/dn_route.c
+++ b/net/decnet/dn_route.c
@@ -338,7 +338,7 @@ static int dn_insert_route(struct dn_route *rt, unsigned 
int hash, struct dn_rou
   dn_rt_hash_table[hash].chain);
rcu_assign_pointer(dn_rt_hash_table[hash].chain, rth);
 
-   dst_use(>dst, now);
+   dst_hold_and_use(>dst, now);
spin_unlock_bh(_rt_hash_table[hash].lock);
 
dst_release_immediate(>dst);
@@ -351,7 +351,7 @@ static int dn_insert_route(struct dn_route *rt, unsigned 
int hash, struct dn_rou
rcu_assign_pointer(rt->dst.dn_next, dn_rt_hash_table[hash].chain);
rcu_assign_pointer(dn_rt_hash_table[hash].chain, rt);
 
-   dst_use(>dst, now);
+   dst_hold_and_use(>dst, now);
spin_unlock_bh(_rt_hash_table[hash].lock);
*rp = rt;
return 0;
@@ -1258,7 +1258,7 @@ static int __dn_route_output_key(struct dst_entry **pprt, 
const struct flowidn *
(flp->flowidn_mark == rt->fld.flowidn_mark) &&
dn_is_output_route(rt) &&
(rt->fld.flowidn_oif == flp->flowidn_oif)) {
-   dst_use(>dst, jiffies);
+   dst_hold_and_use(>dst, jiffies);
rcu_read_unlock_bh();
*pprt = >dst;
return 0;
@@ -1535,7 +1535,7 @@ static int dn_route_input(struct sk_buff *skb)
(rt->fld.flowidn_oif == 0) &&
(rt->fld.flowidn_mark == skb->mark) &&
(rt->fld.flowidn_iif == cb->iif)) {
-   dst_use(>dst, jiffies);
+   dst_hold_and_use(>dst, jiffies);
rcu_read_unlock();
skb_dst_set(skb, (struct dst_entry *)rt);
return 0;
-- 
2.15.0.rc0.271.g36b669edcc-goog

[PATCH net-next] ipv6: check fn before doing FIB6_SUBTREE(fn)

2017-10-13 Thread Wei Wang

From: Wei Wang <wei...@google.com>

In fib6_locate(), we need to first make sure fn is not NULL before doing
FIB6_SUBTREE(fn) to avoid crash.

This fixes the following static checker warning:
net/ipv6/ip6_fib.c:1462 fib6_locate()
 warn: variable dereferenced before check 'fn' (see line 1459)

net/ipv6/ip6_fib.c
  1458  if (src_len) {
  1459  struct fib6_node *subtree = FIB6_SUBTREE(fn);

We shifted this dereference

  1460
  1461  WARN_ON(saddr == NULL);
  1462  if (fn && subtree)
^^
before the check for NULL.

  1463  fn = fib6_locate_1(subtree, saddr, src_len,
  1464 offsetof(struct rt6_info, 
rt6i_src)

Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
Reported-by: Dan Carpenter <dan.carpen...@oracle.com>
Signed-off-by: Wei Wang <wei...@google.com>
Acked-by: Eric Dumazet <eduma...@google.com>
---
 net/ipv6/ip6_fib.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index c2ecd5ec638a..548af48212fc 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1456,13 +1456,16 @@ struct fib6_node *fib6_locate(struct fib6_node *root,
 
 #ifdef CONFIG_IPV6_SUBTREES
if (src_len) {
-   struct fib6_node *subtree = FIB6_SUBTREE(fn);
-
WARN_ON(saddr == NULL);
-   if (fn && subtree)
-   fn = fib6_locate_1(subtree, saddr, src_len,
+   if (fn) {
+   struct fib6_node *subtree = FIB6_SUBTREE(fn);
+
+   if (subtree) {
+   fn = fib6_locate_1(subtree, saddr, src_len,
   offsetof(struct rt6_info, rt6i_src),
   exact_match);
+   }
+   }
}
 #endif
 
-- 
2.15.0.rc0.271.g36b669edcc-goog

Re: [PATCH][V2] ipv6: fix incorrect bitwise operator used on rt6i_flags

2017-10-10 Thread Wei Wang

On Tue, Oct 10, 2017 at 11:10 AM, Colin King <colin.k...@canonical.com> wrote:
> From: Colin Ian King <colin.k...@canonical.com>
>
> The use of the | operator always leads to true which looks rather
> suspect to me. Fix this by using & instead to just check the
> RTF_CACHE entry bit.
>
> Detected by CoverityScan, CID#1457734, #1457747 ("Wrong operator used")
>
> Fixes: 35732d01fe31 ("ipv6: introduce a hash table to store dst cache")
> Signed-off-by: Colin Ian King <colin.k...@canonical.com>
> ---

Acked-by: Wei Wang <wei...@google.com>

>  net/ipv6/route.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 6db1541eaa7b..dd9ba1192dbc 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -1425,7 +1425,7 @@ int rt6_remove_exception_rt(struct rt6_info *rt)
> int err;
>
> if (!from ||
> -   !(rt->rt6i_flags | RTF_CACHE))
> +   !(rt->rt6i_flags & RTF_CACHE))
> return -EINVAL;
>
> if (!rcu_access_pointer(from->rt6i_exception_bucket))
> @@ -1469,7 +1469,7 @@ static void rt6_update_exception_stamp_rt(struct 
> rt6_info *rt)
> struct rt6_exception *rt6_ex;
>
> if (!from ||
> -   !(rt->rt6i_flags | RTF_CACHE))
> +   !(rt->rt6i_flags & RTF_CACHE))
> return;
>
> rcu_read_lock();
> --
> 2.14.1
>

Re: [PATCH][net-next] ipv6: fix incorrect bitwise operator used on rt6i_flags

2017-10-10 Thread Wei Wang

On Tue, Oct 10, 2017 at 11:10 AM, Martin KaFai Lau <ka...@fb.com> wrote:
> On Tue, Oct 10, 2017 at 05:55:27PM +, Colin King wrote:
>> From: Colin Ian King <colin.k...@canonical.com>
>>
>> The use of the | operator always leads to true on the expression
>> (rt->rt6i_flags | RTF_CACHE) which looks rather suspect to me. I
>> believe this is fixed by using & instead to just check the
>> RTF_CACHE entry bit.
> Good catch. LGTM. If rt does not have RTF_CACHE set, it should not be in the
> exception table.
>
> Acked-by: Martin KaFai Lau <ka...@fb.com>
>

Thanks a lot for catching this. Yes. It should have been '&' instead of '|'.

Acked-by: Wei Wang <wei...@google.com>

>>
>> Detected by CoverityScan, CID#1457747 ("Wrong operator used")
>>
>> Fixes: 35732d01fe31 ("ipv6: introduce a hash table to store dst cache")
>> Signed-off-by: Colin Ian King <colin.k...@canonical.com>
>> ---
>>  net/ipv6/route.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>> index 6db1541eaa7b..0556d1ee189c 100644
>> --- a/net/ipv6/route.c
>> +++ b/net/ipv6/route.c
>> @@ -1425,7 +1425,7 @@ int rt6_remove_exception_rt(struct rt6_info *rt)
>>   int err;
>>
>>   if (!from ||
>> - !(rt->rt6i_flags | RTF_CACHE))
>> + !(rt->rt6i_flags & RTF_CACHE))
>>   return -EINVAL;
>>
>>   if (!rcu_access_pointer(from->rt6i_exception_bucket))
>> --
>> 2.14.1
>>

[PATCH net-next] ipv6: use rcu_dereference_bh() in ipv6_route_seq_next()

2017-10-09 Thread Wei Wang

From: Wei Wang <wei...@google.com>

This patch replaces rcu_deference() with rcu_dereference_bh() in
ipv6_route_seq_next() to avoid the following warning:

[   19.431685] WARNING: suspicious RCU usage
[   19.433451] 4.14.0-rc3-00914-g66f5d6c #118 Not tainted
[   19.435509] -
[   19.437267] net/ipv6/ip6_fib.c:2259 suspicious
rcu_dereference_check() usage!
[   19.440790]
[   19.440790] other info that might help us debug this:
[   19.440790]
[   19.444734]
[   19.444734] rcu_scheduler_active = 2, debug_locks = 1
[   19.447757] 2 locks held by odhcpd/3720:
[   19.449480]  #0:  (>lock){+.+.}, at: []
seq_read+0x3c/0x333
[   19.452720]  #1:  (rcu_read_lock_bh){}, at: []
ipv6_route_seq_start+0x5/0xfd
[   19.456323]
[   19.456323] stack backtrace:
[   19.458812] CPU: 0 PID: 3720 Comm: odhcpd Not tainted
4.14.0-rc3-00914-g66f5d6c #118
[   19.462042] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.10.2-1 04/01/2014
[   19.465414] Call Trace:
[   19.466788]  dump_stack+0x86/0xc0
[   19.468358]  lockdep_rcu_suspicious+0xea/0xf3
[   19.470183]  ipv6_route_seq_next+0x71/0x164
[   19.471963]  seq_read+0x244/0x333
[   19.473522]  proc_reg_read+0x48/0x67
[   19.475152]  ? proc_reg_write+0x67/0x67
[   19.476862]  __vfs_read+0x26/0x10b
[   19.478463]  ? __might_fault+0x37/0x84
[   19.480148]  vfs_read+0xba/0x146
[   19.481690]  SyS_read+0x51/0x8e
[   19.483197]  do_int80_syscall_32+0x66/0x15a
[   19.484969]  entry_INT80_compat+0x32/0x50
[   19.486707] RIP: 0023:0xf7f0be8e
[   19.488244] RSP: 002b:ffa75d04 EFLAGS: 0246 ORIG_RAX:
0003
[   19.491431] RAX: ffda RBX: 0009 RCX:
08056068
[   19.493886] RDX: 1000 RSI: 08056008 RDI:
1000
[   19.496331] RBP: 01ff R08:  R09:

[   19.498768] R10:  R11:  R12:

[   19.501217] R13:  R14:  R15:


Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
Reported-by: Xiaolong Ye <xiaolong...@intel.com>
Signed-off-by: Wei Wang <wei...@google.com>
Acked-by: Eric Dumazet <eduma...@google.com>
---
 net/ipv6/ip6_fib.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 52a29ba32928..c2ecd5ec638a 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -2262,7 +2262,7 @@ static void *ipv6_route_seq_next(struct seq_file *seq, 
void *v, loff_t *pos)
if (!v)
goto iter_table;
 
-   n = rcu_dereference(((struct rt6_info *)v)->dst.rt6_next);
+   n = rcu_dereference_bh(((struct rt6_info *)v)->dst.rt6_next);
if (n) {
++*pos;
return n;
-- 
2.14.2.920.gcf0c67979c-goog

Re: [PATCH net-next 11/16] ipv6: replace dst_hold() with dst_hold_safe() in routing code

2017-10-06 Thread Wei Wang

On Fri, Oct 6, 2017 at 4:57 PM, 吉藤英明 <hideaki.yoshif...@miraclelinux.com> wrote:
> Hi,
>
> 2017-10-07 4:06 GMT+09:00 Wei Wang <wei...@google.com>:
>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>> index 941c062389d2..aeb349aea429 100644
>> --- a/net/ipv6/route.c
>> +++ b/net/ipv6/route.c
> :
>> @@ -1625,12 +1643,17 @@ struct rt6_info *ip6_pol_route(struct net *net, 
>> struct fib6_table *table,
>> if (rt_cache)
>> rt = rt_cache;
>>
>> -   if (rt == net->ipv6.ip6_null_entry || (rt->rt6i_flags & RTF_CACHE)) {
>> -   dst_use(>dst, jiffies);
>> +   if (rt == net->ipv6.ip6_null_entry) {
>> +   read_unlock_bh(>tb6_lock);
>> +   dst_hold(>dst);
>> +   trace_fib6_table_lookup(net, rt, table->tb6_id, fl6);
>> +   return rt;
>> +   } else if (rt->rt6i_flags & RTF_CACHE) {
>> +   if (ip6_hold_safe(net, , true)) {
>> +   dst_use_noref(>dst, jiffies);
>> +   rt6_dst_from_metrics_check(rt);
>> +   }
>> read_unlock_bh(>tb6_lock);
>> -
>> -   rt6_dst_from_metrics_check(rt);
>> -
>> trace_fib6_table_lookup(net, rt, table->tb6_id, fl6);
>> return rt;
>> } else if (unlikely((fl6->flowi6_flags & FLOWI_FLAG_KNOWN_NH) &&
>
> Is it intended to move rt6_dst_from_metrics_check() inside the table lock?
>

I think it doesn't really matter whether rt6_dst_from_metrics_check()
is inside the table lock or not. The code looks cleaner if we put it
inside the if (ip6_hold_safe()) {} block because we don't want to do
rt6_dst_from_metrics_check() if ip6_hold_safe() returns false.

> --yoshfuji

[PATCH net-next 11/16] ipv6: replace dst_hold() with dst_hold_safe() in routing code

2017-10-06 Thread Wei Wang

From: Wei Wang <wei...@google.com>

With rwlock, it is safe to call dst_hold() in the read thread because
read thread is guaranteed to be separated from write thread.
However, after we replace rwlock with rcu, it is no longer safe to use
dst_hold(). A dst might already have been deleted but is waiting for the
rcu grace period to pass before freeing the memory when a read thread is
trying to do dst_hold(). This could potentially cause double free issue.

So this commit replaces all dst_hold() with dst_hold_safe() in all read
thread to avoid this double free issue.
And in order to make the code more compact, a new function ip6_hold_safe()
is introduced. It calls dst_hold_safe() first, and if that fails, it will
either fall back to hold and return net->ipv6.ip6_null_entry or set rt to
NULL according to the caller's need.

Signed-off-by: Wei Wang <wei...@google.com>
Signed-off-by: Martin KaFai Lau <ka...@fb.com>
Signed-off-by: Eric Dumazet <eduma...@google.com>
---
 net/ipv6/addrconf.c |  3 ++-
 net/ipv6/route.c| 71 +++--
 2 files changed, 54 insertions(+), 20 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 873afafddfc4..f86e931d555e 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -2333,7 +2333,8 @@ static struct rt6_info *addrconf_get_prefix_route(const 
struct in6_addr *pfx,
continue;
if ((rt->rt6i_flags & noflags) != 0)
continue;
-   dst_hold(>dst);
+   if (!dst_hold_safe(>dst))
+   rt = NULL;
break;
}
 out:
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 941c062389d2..aeb349aea429 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -874,6 +874,23 @@ static struct fib6_node* fib6_backtrack(struct fib6_node 
*fn,
}
 }
 
+static bool ip6_hold_safe(struct net *net, struct rt6_info **prt,
+ bool null_fallback)
+{
+   struct rt6_info *rt = *prt;
+
+   if (dst_hold_safe(>dst))
+   return true;
+   if (null_fallback) {
+   rt = net->ipv6.ip6_null_entry;
+   dst_hold(>dst);
+   } else {
+   rt = NULL;
+   }
+   *prt = rt;
+   return false;
+}
+
 static struct rt6_info *ip6_pol_route_lookup(struct net *net,
 struct fib6_table *table,
 struct flowi6 *fl6, int flags)
@@ -898,7 +915,9 @@ static struct rt6_info *ip6_pol_route_lookup(struct net 
*net,
if (rt_cache)
rt = rt_cache;
 
-   dst_use(>dst, jiffies);
+   if (ip6_hold_safe(net, , true))
+   dst_use_noref(>dst, jiffies);
+
read_unlock_bh(>tb6_lock);
 
trace_fib6_table_lookup(net, rt, table->tb6_id, fl6);
@@ -1061,10 +1080,9 @@ static struct rt6_info *rt6_get_pcpu_route(struct 
rt6_info *rt)
p = this_cpu_ptr(rt->rt6i_pcpu);
pcpu_rt = *p;
 
-   if (pcpu_rt) {
-   dst_hold(_rt->dst);
+   if (pcpu_rt && ip6_hold_safe(NULL, _rt, false))
rt6_dst_from_metrics_check(pcpu_rt);
-   }
+
return pcpu_rt;
 }
 
@@ -1625,12 +1643,17 @@ struct rt6_info *ip6_pol_route(struct net *net, struct 
fib6_table *table,
if (rt_cache)
rt = rt_cache;
 
-   if (rt == net->ipv6.ip6_null_entry || (rt->rt6i_flags & RTF_CACHE)) {
-   dst_use(>dst, jiffies);
+   if (rt == net->ipv6.ip6_null_entry) {
+   read_unlock_bh(>tb6_lock);
+   dst_hold(>dst);
+   trace_fib6_table_lookup(net, rt, table->tb6_id, fl6);
+   return rt;
+   } else if (rt->rt6i_flags & RTF_CACHE) {
+   if (ip6_hold_safe(net, , true)) {
+   dst_use_noref(>dst, jiffies);
+   rt6_dst_from_metrics_check(rt);
+   }
read_unlock_bh(>tb6_lock);
-
-   rt6_dst_from_metrics_check(rt);
-
trace_fib6_table_lookup(net, rt, table->tb6_id, fl6);
return rt;
} else if (unlikely((fl6->flowi6_flags & FLOWI_FLAG_KNOWN_NH) &&
@@ -1643,7 +1666,13 @@ struct rt6_info *ip6_pol_route(struct net *net, struct 
fib6_table *table,
 
struct rt6_info *uncached_rt;
 
-   dst_use(>dst, jiffies);
+   if (ip6_hold_safe(net, , true)) {
+   dst_use_noref(>dst, jiffies);
+   } else {
+   read_unlock_bh(>tb6_lock);
+   uncached_rt = rt;
+   goto uncached_rt_out;
+   }
read_unlock_bh(>tb6_lock);
 
uncached_rt = ip6_rt_cache_alloc(rt, >daddr, NULL);
@@ -1659,6 +1688,7 @@ struct rt

[PATCH net-next 05/16] ipv6: prepare rt6_clean_tohost() for exception table

2017-10-06 Thread Wei Wang

From: Wei Wang <wei...@google.com>

If we move all cached dst into the exception table under the main route,
current rt6_clean_tohost() will no longer be able to access them.
This commit makes fib6_clean_tohost() to also go through all cached
routes in exception table and removes cached gateway routes to the
passed in gateway.
This is a preparation in order to move all cached routes into the
exception table.

Signed-off-by: Wei Wang <wei...@google.com>
Signed-off-by: Martin KaFai Lau <ka...@fb.com>
Signed-off-by: Eric Dumazet <eduma...@google.com>
---
 net/ipv6/route.c | 48 +++-
 1 file changed, 47 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index d9805a857809..e8e901589564 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1491,6 +1491,43 @@ static void rt6_exceptions_update_pmtu(struct rt6_info 
*rt, int mtu)
}
 }
 
+#define RTF_CACHE_GATEWAY  (RTF_GATEWAY | RTF_CACHE)
+
+static void rt6_exceptions_clean_tohost(struct rt6_info *rt,
+   struct in6_addr *gateway)
+{
+   struct rt6_exception_bucket *bucket;
+   struct rt6_exception *rt6_ex;
+   struct hlist_node *tmp;
+   int i;
+
+   if (!rcu_access_pointer(rt->rt6i_exception_bucket))
+   return;
+
+   spin_lock_bh(_exception_lock);
+   bucket = rcu_dereference_protected(rt->rt6i_exception_bucket,
+lockdep_is_held(_exception_lock));
+
+   if (bucket) {
+   for (i = 0; i < FIB6_EXCEPTION_BUCKET_SIZE; i++) {
+   hlist_for_each_entry_safe(rt6_ex, tmp,
+ >chain, hlist) {
+   struct rt6_info *entry = rt6_ex->rt6i;
+
+   if ((entry->rt6i_flags & RTF_CACHE_GATEWAY) ==
+   RTF_CACHE_GATEWAY &&
+   ipv6_addr_equal(gateway,
+   >rt6i_gateway)) {
+   rt6_remove_exception(bucket, rt6_ex);
+   }
+   }
+   bucket++;
+   }
+   }
+
+   spin_unlock_bh(_exception_lock);
+}
+
 struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table,
   int oif, struct flowi6 *fl6, int flags)
 {
@@ -3240,18 +3277,27 @@ void rt6_remove_prefsrc(struct inet6_ifaddr *ifp)
 }
 
 #define RTF_RA_ROUTER  (RTF_ADDRCONF | RTF_DEFAULT | RTF_GATEWAY)
-#define RTF_CACHE_GATEWAY  (RTF_GATEWAY | RTF_CACHE)
 
 /* Remove routers and update dst entries when gateway turn into host. */
 static int fib6_clean_tohost(struct rt6_info *rt, void *arg)
 {
struct in6_addr *gateway = (struct in6_addr *)arg;
 
+   /* RTF_CACHE_GATEWAY case will be removed once the exception
+* table is hooked up to store all cached routes.
+*/
if rt->rt6i_flags & RTF_RA_ROUTER) == RTF_RA_ROUTER) ||
 ((rt->rt6i_flags & RTF_CACHE_GATEWAY) == RTF_CACHE_GATEWAY)) &&
 ipv6_addr_equal(gateway, >rt6i_gateway)) {
return -1;
}
+
+   /* Further clean up cached routes in exception table.
+* This is needed because cached route may have a different
+* gateway than its 'parent' in the case of an ip redirect.
+*/
+   rt6_exceptions_clean_tohost(rt, gateway);
+
return 0;
 }
 
-- 
2.14.2.920.gcf0c67979c-goog

[PATCH net-next 01/16] ipv6: introduce a new function fib6_update_sernum()

2017-10-06 Thread Wei Wang

From: Wei Wang <wei...@google.com>

This function takes a route as input and tries to update the sernum in
the fib6_node this route is associated with. It will be used in later
commit when adding a cached route into the exception table under that
route.

Signed-off-by: Wei Wang <wei...@google.com>
Signed-off-by: Martin KaFai Lau <ka...@fb.com>
Signed-off-by: Eric Dumazet <eduma...@google.com>
---
 include/net/ip6_fib.h |  2 ++
 net/ipv6/ip6_fib.c| 14 ++
 2 files changed, 16 insertions(+)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index d060d711a624..152b7b14a5a5 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -358,6 +358,8 @@ void __net_exit fib6_notifier_exit(struct net *net);
 unsigned int fib6_tables_seq_read(struct net *net);
 int fib6_tables_dump(struct net *net, struct notifier_block *nb);
 
+void fib6_update_sernum(struct rt6_info *rt);
+
 #ifdef CONFIG_IPV6_MULTIPLE_TABLES
 int fib6_rules_init(void);
 void fib6_rules_cleanup(void);
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index e5308d7cbd75..0ba4fbb2f855 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -110,6 +110,20 @@ enum {
FIB6_NO_SERNUM_CHANGE = 0,
 };
 
+void fib6_update_sernum(struct rt6_info *rt)
+{
+   struct fib6_table *table = rt->rt6i_table;
+   struct net *net = dev_net(rt->dst.dev);
+   struct fib6_node *fn;
+
+   write_lock_bh(>tb6_lock);
+   fn = rcu_dereference_protected(rt->rt6i_node,
+   lockdep_is_held(>tb6_lock));
+   if (fn)
+   fn->fn_sernum = fib6_new_sernum(net);
+   write_unlock_bh(>tb6_lock);
+}
+
 /*
  * Auxiliary address test functions for the radix tree.
  *
-- 
2.14.2.920.gcf0c67979c-goog

[PATCH net-next 06/16] ipv6: prepare fib6_age() for exception table

2017-10-06 Thread Wei Wang

From: Wei Wang <wei...@google.com>

If all dst cache entries are stored in the exception table under the
main route, we have to go through them during fib6_age() when doing
garbage collecting.
Introduce a new function rt6_age_exception() which goes through all dst
entries in the exception table and remove those entries that are expired.
This function is called in fib6_age() so that all dst caches are also
garbage collected.

Signed-off-by: Wei Wang <wei...@google.com>
Signed-off-by: Martin KaFai Lau <ka...@fb.com>
Signed-off-by: Eric Dumazet <eduma...@google.com>
---
 include/net/ip6_fib.h   | 13 +++
 include/net/ip6_route.h |  2 ++
 net/ipv6/ip6_fib.c  | 26 -
 net/ipv6/route.c| 60 +
 4 files changed, 84 insertions(+), 17 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index c4864c1e8f13..11a79ef87a28 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -29,6 +29,14 @@
 #define FIB6_TABLE_HASHSZ 1
 #endif
 
+#define RT6_DEBUG 2
+
+#if RT6_DEBUG >= 3
+#define RT6_TRACE(x...) pr_debug(x)
+#else
+#define RT6_TRACE(x...) do { ; } while (0)
+#endif
+
 struct rt6_info;
 
 struct fib6_config {
@@ -75,6 +83,11 @@ struct fib6_node {
struct rcu_head rcu;
 };
 
+struct fib6_gc_args {
+   int timeout;
+   int more;
+};
+
 #ifndef CONFIG_IPV6_SUBTREES
 #define FIB6_SUBTREE(fn)   NULL
 #else
diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 3315605f34c9..a0087fb9864b 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -97,6 +97,8 @@ int ip6_del_rt(struct rt6_info *);
 
 void rt6_flush_exceptions(struct rt6_info *rt);
 int rt6_remove_exception_rt(struct rt6_info *rt);
+void rt6_age_exceptions(struct rt6_info *rt, struct fib6_gc_args *gc_args,
+   unsigned long now);
 
 static inline int ip6_route_get_saddr(struct net *net, struct rt6_info *rt,
  const struct in6_addr *daddr,
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 0ba4fbb2f855..3afbe50f2779 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -38,14 +38,6 @@
 #include 
 #include 
 
-#define RT6_DEBUG 2
-
-#if RT6_DEBUG >= 3
-#define RT6_TRACE(x...) pr_debug(x)
-#else
-#define RT6_TRACE(x...) do { ; } while (0)
-#endif
-
 static struct kmem_cache *fib6_node_kmem __read_mostly;
 
 struct fib6_cleaner {
@@ -1890,12 +1882,6 @@ static void fib6_flush_trees(struct net *net)
  * Garbage collection
  */
 
-struct fib6_gc_args
-{
-   int timeout;
-   int more;
-};
-
 static int fib6_age(struct rt6_info *rt, void *arg)
 {
struct fib6_gc_args *gc_args = arg;
@@ -1904,9 +1890,6 @@ static int fib6_age(struct rt6_info *rt, void *arg)
/*
 *  check addrconf expiration here.
 *  Routes are expired even if they are in use.
-*
-*  Also age clones. Note, that clones are aged out
-*  only if they are not in use now.
 */
 
if (rt->rt6i_flags & RTF_EXPIRES && rt->dst.expires) {
@@ -1915,6 +1898,9 @@ static int fib6_age(struct rt6_info *rt, void *arg)
return -1;
}
gc_args->more++;
+   /* The following part will soon be removed when the exception
+* table is hooked up to store all cached routes.
+*/
} else if (rt->rt6i_flags & RTF_CACHE) {
if (time_after_eq(now, rt->dst.lastuse + gc_args->timeout))
rt->dst.obsolete = DST_OBSOLETE_KILL;
@@ -1940,6 +1926,12 @@ static int fib6_age(struct rt6_info *rt, void *arg)
gc_args->more++;
}
 
+   /*  Also age clones in the exception table.
+*  Note, that clones are aged out
+*  only if they are not in use now.
+*/
+   rt6_age_exceptions(rt, gc_args, now);
+
return 0;
 }
 
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index e8e901589564..d2dd55f58b5d 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1528,6 +1528,66 @@ static void rt6_exceptions_clean_tohost(struct rt6_info 
*rt,
spin_unlock_bh(_exception_lock);
 }
 
+static void rt6_age_examine_exception(struct rt6_exception_bucket *bucket,
+ struct rt6_exception *rt6_ex,
+ struct fib6_gc_args *gc_args,
+ unsigned long now)
+{
+   struct rt6_info *rt = rt6_ex->rt6i;
+
+   if (atomic_read(>dst.__refcnt) == 1 &&
+   time_after_eq(now, rt->dst.lastuse + gc_args->timeout)) {
+   RT6_TRACE("aging clone %p\n", rt);
+   rt6_remove_exception(bucket, rt6_ex);
+   return;
+   } else if (rt->rt

[PATCH net-next 10/16] ipv6: don't release rt->rt6i_pcpu memory during rt6_release()

2017-10-06 Thread Wei Wang

From: Wei Wang <wei...@google.com>

After rwlock is replaced with rcu and spinlock, route lookup can happen
simultanously with route deletion.
This patch removes the call to free_percpu(rt->rt6i_pcpu) from
rt6_release() to avoid the race condition between rt6_release() and
rt6_get_pcpu_route(). And as free_percpu(rt->rt6i_pcpu) is already
called in ip6_dst_destroy() after the rcu grace period, it is safe to do
this change.

Signed-off-by: Wei Wang <wei...@google.com>
Signed-off-by: Martin KaFai Lau <ka...@fb.com>
Signed-off-by: Eric Dumazet <eduma...@google.com>
---
 net/ipv6/ip6_fib.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 9c8e704e6af7..eee392f7b1f6 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -190,9 +190,6 @@ void rt6_free_pcpu(struct rt6_info *non_pcpu_rt)
*ppcpu_rt = NULL;
}
}
-
-   free_percpu(non_pcpu_rt->rt6i_pcpu);
-   non_pcpu_rt->rt6i_pcpu = NULL;
 }
 EXPORT_SYMBOL_GPL(rt6_free_pcpu);
 
-- 
2.14.2.920.gcf0c67979c-goog

[PATCH net-next 03/16] ipv6: prepare fib6_remove_prefsrc() for exception table

2017-10-06 Thread Wei Wang

From: Wei Wang <wei...@google.com>

After we move cached dst entries into the exception table under its
parent route, current fib6_remove_prefsrc() no longer can access them.
This commit makes fib6_remove_prefsrc() also go through all routes
in the exception table to remove the pref src.
This is a preparation patch in order to move all cached dst into the
exception table.

Signed-off-by: Wei Wang <wei...@google.com>
Signed-off-by: Martin KaFai Lau <ka...@fb.com>
Signed-off-by: Eric Dumazet <eduma...@google.com>
---
 net/ipv6/route.c | 29 +
 1 file changed, 29 insertions(+)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index dc5e70975966..f52ac57dcc99 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1264,6 +1264,12 @@ static int rt6_insert_exception(struct rt6_info *nrt,
if (ort->rt6i_src.plen)
src_key = >rt6i_src.addr;
 #endif
+
+   /* Update rt6i_prefsrc as it could be changed
+* in rt6_remove_prefsrc()
+*/
+   nrt->rt6i_prefsrc = ort->rt6i_prefsrc;
+
rt6_ex = __rt6_find_exception_spinlock(, >rt6i_dst.addr,
   src_key);
if (rt6_ex)
@@ -1432,6 +1438,25 @@ static void rt6_update_exception_stamp_rt(struct 
rt6_info *rt)
rcu_read_unlock();
 }
 
+static void rt6_exceptions_remove_prefsrc(struct rt6_info *rt)
+{
+   struct rt6_exception_bucket *bucket;
+   struct rt6_exception *rt6_ex;
+   int i;
+
+   bucket = rcu_dereference_protected(rt->rt6i_exception_bucket,
+   lockdep_is_held(_exception_lock));
+
+   if (bucket) {
+   for (i = 0; i < FIB6_EXCEPTION_BUCKET_SIZE; i++) {
+   hlist_for_each_entry(rt6_ex, >chain, hlist) {
+   rt6_ex->rt6i->rt6i_prefsrc.plen = 0;
+   }
+   bucket++;
+   }
+   }
+}
+
 struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table,
   int oif, struct flowi6 *fl6, int flags)
 {
@@ -3159,8 +3184,12 @@ static int fib6_remove_prefsrc(struct rt6_info *rt, void 
*arg)
if (((void *)rt->dst.dev == dev || !dev) &&
rt != net->ipv6.ip6_null_entry &&
ipv6_addr_equal(addr, >rt6i_prefsrc.addr)) {
+   spin_lock_bh(_exception_lock);
/* remove prefsrc entry */
rt->rt6i_prefsrc.plen = 0;
+   /* need to update cache as well */
+   rt6_exceptions_remove_prefsrc(rt);
+   spin_unlock_bh(_exception_lock);
}
return 0;
 }
-- 
2.14.2.920.gcf0c67979c-goog

[PATCH net-next 09/16] ipv6: grab rt->rt6i_ref before allocating pcpu rt

2017-10-06 Thread Wei Wang

From: Wei Wang <wei...@google.com>

After rwlock is replaced with rcu and spinlock, ip6_pol_route() will be
called with only rcu held. That means rt6 route deletion could happen
simultaneously with rt6_make_pcpu_rt(). This could potentially cause
memory leak if rt6_release() is called right before rt6_make_pcpu_rt()
on the same route.

This patch grabs rt->rt6i_ref safely before calling rt6_make_pcpu_rt()
to make sure rt6_release() will not get triggered while
rt6_make_pcpu_rt() is in progress. And rt6_release() is called after
rt6_make_pcpu_rt() is finished.

Note: As we are incrementing rt->rt6i_ref in ip6_pol_route(), there is a
very slim chance that fib6_purge_rt() will be triggered unnecessarily
when deleting a route if ip6_pol_route() running on another thread picks
this route as well and tries to make pcpu cache for it.

Signed-off-by: Wei Wang <wei...@google.com>
Signed-off-by: Martin KaFai Lau <ka...@fb.com>
Signed-off-by: Eric Dumazet <eduma...@google.com>
---
 net/ipv6/route.c | 58 
 1 file changed, 29 insertions(+), 29 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 65130dde276a..941c062389d2 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1070,7 +1070,6 @@ static struct rt6_info *rt6_get_pcpu_route(struct 
rt6_info *rt)
 
 static struct rt6_info *rt6_make_pcpu_route(struct rt6_info *rt)
 {
-   struct fib6_table *table = rt->rt6i_table;
struct rt6_info *pcpu_rt, *prev, **p;
 
pcpu_rt = ip6_rt_pcpu_alloc(rt);
@@ -1081,28 +1080,20 @@ static struct rt6_info *rt6_make_pcpu_route(struct 
rt6_info *rt)
return net->ipv6.ip6_null_entry;
}
 
-   read_lock_bh(>tb6_lock);
-   if (rt->rt6i_pcpu) {
-   p = this_cpu_ptr(rt->rt6i_pcpu);
-   prev = cmpxchg(p, NULL, pcpu_rt);
-   if (prev) {
-   /* If someone did it before us, return prev instead */
-   dst_release_immediate(_rt->dst);
-   pcpu_rt = prev;
-   }
-   } else {
-   /* rt has been removed from the fib6 tree
-* before we have a chance to acquire the read_lock.
-* In this case, don't brother to create a pcpu rt
-* since rt is going away anyway.  The next
-* dst_check() will trigger a re-lookup.
-*/
+   dst_hold(_rt->dst);
+   p = this_cpu_ptr(rt->rt6i_pcpu);
+   prev = cmpxchg(p, NULL, pcpu_rt);
+   if (prev) {
+   /* If someone did it before us, return prev instead */
+   /* release refcnt taken by ip6_rt_pcpu_alloc() */
+   dst_release_immediate(_rt->dst);
+   /* release refcnt taken by above dst_hold() */
dst_release_immediate(_rt->dst);
-   pcpu_rt = rt;
+   dst_hold(>dst);
+   pcpu_rt = prev;
}
-   dst_hold(_rt->dst);
+
rt6_dst_from_metrics_check(pcpu_rt);
-   read_unlock_bh(>tb6_lock);
return pcpu_rt;
 }
 
@@ -1683,19 +1674,28 @@ struct rt6_info *ip6_pol_route(struct net *net, struct 
fib6_table *table,
if (pcpu_rt) {
read_unlock_bh(>tb6_lock);
} else {
-   /* We have to do the read_unlock first
-* because rt6_make_pcpu_route() may trigger
-* ip6_dst_gc() which will take the write_lock.
-*/
-   dst_hold(>dst);
-   read_unlock_bh(>tb6_lock);
-   pcpu_rt = rt6_make_pcpu_route(rt);
-   dst_release(>dst);
+   /* atomic_inc_not_zero() is needed when using rcu */
+   if (atomic_inc_not_zero(>rt6i_ref)) {
+   /* We have to do the read_unlock first
+* because rt6_make_pcpu_route() may trigger
+* ip6_dst_gc() which will take the write_lock.
+*
+* No dst_hold() on rt is needed because 
grabbing
+* rt->rt6i_ref makes sure rt can't be released.
+*/
+   read_unlock_bh(>tb6_lock);
+   pcpu_rt = rt6_make_pcpu_route(rt);
+   rt6_release(rt);
+   } else {
+   /* rt is already removed from tree */
+   read_unlock_bh(>tb6_lock);
+   pcpu_rt = net->ipv6.ip6_null_entry;
+   dst_hold(_rt->dst);
+   }
}
 
trace_fib6_table_lookup(net, pcpu_rt, t

[PATCH net-next 15/16] ipv6: replace rwlock with rcu and spinlock in fib6_table

2017-10-06 Thread Wei Wang

From: Wei Wang <wei...@google.com>

With all the preparation work before, we are now ready to replace rwlock
with rcu and spinlock in fib6_table.
That means now all fib6_node in fib6_table are protected by rcu. And
when freeing fib6_node, call_rcu() is used to wait for the rcu grace
period before releasing the memory.
When accessing fib6_node, corresponding rcu APIs need to be used.
And all previous sessions protected by the write lock will now be
protected by the spin lock per table.
All previous sessions protected by read lock will now be protected by
rcu_read_lock().

A couple of things to note here:
1. As part of the work of replacing rwlock with rcu, the linked list of
fn->leaf now has to be rcu protected as well. So both fn->leaf and
rt->dst.rt6_next are now __rcu tagged and corresponding rcu APIs are
used when manipulating them.

2. For fn->rr_ptr, first of all, it also needs to be rcu protected now
and is tagged with __rcu and rcu APIs are used in corresponding places.
Secondly, fn->rr_ptr is changed in rt6_select() which is a reader
thread. This makes the issue a bit complicated. We think a valid
solution for it is to let rt6_select() grab the tb6_lock if it decides
to change it. As it is not in the normal operation and only happens when
there is no valid neighbor cache for the route, we think the performance
impact should be low.

3. fib6_walk_continue() has to be called with tb6_lock held even in the
route dumping related functions, e.g. inet6_dump_fib(),
fib6_tables_dump() and ipv6_route_seq_ops. It is because
fib6_walk_continue() makes modifications to the walker structure, and so
are fib6_repair_tree() and fib6_del_route(). In order to do proper
syncing between them, we need to let fib6_walk_continue() hold the lock.
We may be able to do further improvement on the way we do the tree walk
to get rid of the need for holding the spin lock. But not for now.

4. When fib6_del_route() removes a route from the tree, we no longer
mark rt->dst.rt6_next to NULL to make simultaneous reader be able to
further traverse the list with rcu. However, rt->dst.rt6_next is only
valid within this same rcu period. No one should access it later.

5. All the operation of atomic_inc(rt->rt6i_ref) is changed to be
performed before we publish this route (either by linking it to fn->leaf
or insert it in the list pointed by fn->leaf) just to be safe because as
soon as we publish the route, some read thread will be able to access it.

Signed-off-by: Wei Wang <wei...@google.com>
Signed-off-by: Martin KaFai Lau <ka...@fb.com>
Signed-off-by: Eric Dumazet <eduma...@google.com>
---
 include/net/dst.h |   2 +-
 include/net/ip6_fib.h |  24 ++-
 net/ipv6/addrconf.c   |  11 +-
 net/ipv6/ip6_fib.c| 405 ++
 net/ipv6/route.c  | 121 ---
 5 files changed, 333 insertions(+), 230 deletions(-)

diff --git a/include/net/dst.h b/include/net/dst.h
index 06a6765da074..204c19e25456 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -101,7 +101,7 @@ struct dst_entry {
union {
struct dst_entry*next;
struct rtable __rcu *rt_next;
-   struct rt6_info *rt6_next;
+   struct rt6_info __rcu   *rt6_next;
struct dn_route __rcu   *dn_next;
};
 };
diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 6bf929b50951..0b438b9bcb10 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -68,18 +68,18 @@ struct fib6_config {
 };
 
 struct fib6_node {
-   struct fib6_node*parent;
-   struct fib6_node*left;
-   struct fib6_node*right;
+   struct fib6_node __rcu  *parent;
+   struct fib6_node __rcu  *left;
+   struct fib6_node __rcu  *right;
 #ifdef CONFIG_IPV6_SUBTREES
-   struct fib6_node*subtree;
+   struct fib6_node __rcu  *subtree;
 #endif
-   struct rt6_info *leaf;
+   struct rt6_info __rcu   *leaf;
 
__u16   fn_bit; /* bit key */
__u16   fn_flags;
int fn_sernum;
-   struct rt6_info *rr_ptr;
+   struct rt6_info __rcu   *rr_ptr;
struct rcu_head rcu;
 };
 
@@ -91,7 +91,7 @@ struct fib6_gc_args {
 #ifndef CONFIG_IPV6_SUBTREES
 #define FIB6_SUBTREE(fn)   NULL
 #else
-#define FIB6_SUBTREE(fn)   ((fn)->subtree)
+#define FIB6_SUBTREE(fn)   (rcu_dereference_protected((fn)->subtree, 1))
 #endif
 
 struct mx6_config {
@@ -174,6 +174,14 @@ struct rt6_info {
unused:7;
 };
 
+#define for_each_fib6_node_rt_rcu(fn)  \
+   for (rt = rcu_dereference((fn)->leaf); rt;  \
+rt = rcu_dereference(rt->dst.rt6_next))
+
+#define for_each_fib6_walker_rt(w)

[PATCH net-next 16/16] ipv6: take care of rt6_stats

2017-10-06 Thread Wei Wang

From: Wei Wang <wei...@google.com>

Currently, most of the rt6_stats are not hooked up correctly. As the
last part of this patch series, hook up all existing rt6_stats and add
one new stat fib_rt_uncache to indicate the number of routes in the
uncached list.
For details of the stats, please refer to the comments added in
include/net/ip6_fib.h.

Note: fib_rt_alloc and fib_rt_uncache are not guaranteed to be modified
under a lock. So atomic_t is used for them.

Signed-off-by: Wei Wang <wei...@google.com>
Signed-off-by: Martin KaFai Lau <ka...@fb.com>
Signed-off-by: Eric Dumazet <eduma...@google.com>
---
 include/net/ip6_fib.h | 15 +--
 net/ipv6/ip6_fib.c| 42 --
 net/ipv6/route.c  | 16 ++--
 3 files changed, 47 insertions(+), 26 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 0b438b9bcb10..10c913816032 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -297,12 +297,15 @@ struct fib6_walker {
 };
 
 struct rt6_statistics {
-   __u32   fib_nodes;
-   __u32   fib_route_nodes;
-   __u32   fib_rt_alloc;   /* permanent routes */
-   __u32   fib_rt_entries; /* rt entries in table  */
-   __u32   fib_rt_cache;   /* cache routes */
-   __u32   fib_discarded_routes;
+   __u32   fib_nodes;  /* all fib6 nodes */
+   __u32   fib_route_nodes;/* intermediate nodes */
+   __u32   fib_rt_entries; /* rt entries in fib table */
+   __u32   fib_rt_cache;   /* cached rt entries in 
exception table */
+   __u32   fib_discarded_routes;   /* total number of routes 
delete */
+
+   /* The following stats are not protected by any lock */
+   atomic_tfib_rt_alloc;   /* total number of routes 
alloced */
+   atomic_tfib_rt_uncache; /* rt entries in uncached list 
*/
 };
 
 #define RTN_TL_ROOT0x0001
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 3f95908b39c3..52a29ba32928 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -149,18 +149,21 @@ static __be32 addr_bit_set(const void *token, int fn_bit)
   addr[fn_bit >> 5];
 }
 
-static struct fib6_node *node_alloc(void)
+static struct fib6_node *node_alloc(struct net *net)
 {
struct fib6_node *fn;
 
fn = kmem_cache_zalloc(fib6_node_kmem, GFP_ATOMIC);
+   if (fn)
+   net->ipv6.rt6_stats->fib_nodes++;
 
return fn;
 }
 
-static void node_free_immediate(struct fib6_node *fn)
+static void node_free_immediate(struct net *net, struct fib6_node *fn)
 {
kmem_cache_free(fib6_node_kmem, fn);
+   net->ipv6.rt6_stats->fib_nodes--;
 }
 
 static void node_free_rcu(struct rcu_head *head)
@@ -170,9 +173,10 @@ static void node_free_rcu(struct rcu_head *head)
kmem_cache_free(fib6_node_kmem, fn);
 }
 
-static void node_free(struct fib6_node *fn)
+static void node_free(struct net *net, struct fib6_node *fn)
 {
call_rcu(>rcu, node_free_rcu);
+   net->ipv6.rt6_stats->fib_nodes--;
 }
 
 void rt6_free_pcpu(struct rt6_info *non_pcpu_rt)
@@ -583,7 +587,8 @@ static int inet6_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
  * node.
  */
 
-static struct fib6_node *fib6_add_1(struct fib6_table *table,
+static struct fib6_node *fib6_add_1(struct net *net,
+   struct fib6_table *table,
struct fib6_node *root,
struct in6_addr *addr, int plen,
int offset, int allow_create,
@@ -675,7 +680,7 @@ static struct fib6_node *fib6_add_1(struct fib6_table 
*table,
 *  Create new leaf node without children.
 */
 
-   ln = node_alloc();
+   ln = node_alloc(net);
 
if (!ln)
return ERR_PTR(-ENOMEM);
@@ -716,14 +721,14 @@ static struct fib6_node *fib6_add_1(struct fib6_table 
*table,
 *  (new leaf node)[ln] (old node)[fn]
 */
if (plen > bit) {
-   in = node_alloc();
-   ln = node_alloc();
+   in = node_alloc(net);
+   ln = node_alloc(net);
 
if (!in || !ln) {
if (in)
-   node_free_immediate(in);
+   node_free_immediate(net, in);
if (ln)
-   node_free_immediate(ln);
+   node_free_immediate(net, ln);
return ERR_PTR(-ENOMEM);
}
 
@@ -768,7 +773,7 @@ static struct fib6_node *fib6_add_1(struct fib6_table 
*table,
 *   (old node)[fn] NULL
 */
 
-   ln = node_alloc();
+

[PATCH net-next 13/16] ipv6: check fn->leaf before it is used

2017-10-06 Thread Wei Wang

From: Wei Wang <wei...@google.com>

If rwlock is replaced with rcu and spinlock, it is possible that the
reader thread will see fn->leaf as NULL in the following scenarios:
1. fib6_add() is in progress and we have already inserted a new node but
not yet inserted the route.
2. fib6_del_route() is in progress and we have already set fn->leaf to
NULL but not yet freed the node because of rcu grace period.

This patch makes sure all the reader threads check fn->leaf first before
using it. And together with later patch to grab rcu_read_lock() and
rcu_dereference() fn->leaf, it makes sure reader threads are safe when
accessing fn->leaf.

Signed-off-by: Wei Wang <wei...@google.com>
Signed-off-by: Martin KaFai Lau <ka...@fb.com>
Signed-off-by: Eric Dumazet <eduma...@google.com>
---
 net/ipv6/ip6_fib.c | 23 ++-
 net/ipv6/route.c   | 20 
 2 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index f604b311cc3e..cf6137e81408 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1279,10 +1279,13 @@ static struct fib6_node *fib6_lookup_1(struct fib6_node 
*root,
 
while (fn) {
if (FIB6_SUBTREE(fn) || fn->fn_flags & RTN_RTINFO) {
+   struct rt6_info *leaf = fn->leaf;
struct rt6key *key;
 
-   key = (struct rt6key *) ((u8 *) fn->leaf +
-args->offset);
+   if (!leaf)
+   goto backtrack;
+
+   key = (struct rt6key *) ((u8 *)leaf + args->offset);
 
if (ipv6_prefix_equal(>addr, args->addr, 
key->plen)) {
 #ifdef CONFIG_IPV6_SUBTREES
@@ -1299,9 +1302,7 @@ static struct fib6_node *fib6_lookup_1(struct fib6_node 
*root,
return fn;
}
}
-#ifdef CONFIG_IPV6_SUBTREES
 backtrack:
-#endif
if (fn->fn_flags & RTN_ROOT)
break;
 
@@ -1358,7 +1359,18 @@ static struct fib6_node *fib6_locate_1(struct fib6_node 
*root,
struct fib6_node *fn, *prev = NULL;
 
for (fn = root; fn ; ) {
-   struct rt6key *key = (struct rt6key *)((u8 *)fn->leaf + offset);
+   struct rt6_info *leaf = fn->leaf;
+   struct rt6key *key;
+
+   /* This node is being deleted */
+   if (!leaf) {
+   if (plen <= fn->fn_bit)
+   goto out;
+   else
+   goto next;
+   }
+
+   key = (struct rt6key *)((u8 *)leaf + offset);
 
/*
 *  Prefix match
@@ -1372,6 +1384,7 @@ static struct fib6_node *fib6_locate_1(struct fib6_node 
*root,
 
prev = fn;
 
+next:
/*
 *  We have more bits to go
 */
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index aeb349aea429..05dc450af441 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -712,6 +712,7 @@ static struct rt6_info *find_match(struct rt6_info *rt, int 
oif, int strict,
 }
 
 static struct rt6_info *find_rr_leaf(struct fib6_node *fn,
+struct rt6_info *leaf,
 struct rt6_info *rr_head,
 u32 metric, int oif, int strict,
 bool *do_rr)
@@ -730,7 +731,7 @@ static struct rt6_info *find_rr_leaf(struct fib6_node *fn,
match = find_match(rt, oif, strict, , match, do_rr);
}
 
-   for (rt = fn->leaf; rt && rt != rr_head; rt = rt->dst.rt6_next) {
+   for (rt = leaf; rt && rt != rr_head; rt = rt->dst.rt6_next) {
if (rt->rt6i_metric != metric) {
cont = rt;
break;
@@ -748,17 +749,21 @@ static struct rt6_info *find_rr_leaf(struct fib6_node *fn,
return match;
 }
 
-static struct rt6_info *rt6_select(struct fib6_node *fn, int oif, int strict)
+static struct rt6_info *rt6_select(struct net *net, struct fib6_node *fn,
+  int oif, int strict)
 {
+   struct rt6_info *leaf = fn->leaf;
struct rt6_info *match, *rt0;
-   struct net *net;
bool do_rr = false;
 
+   if (!leaf)
+   return net->ipv6.ip6_null_entry;
+
rt0 = fn->rr_ptr;
if (!rt0)
-   fn->rr_ptr = rt0 = fn->leaf;
+   fn->rr_ptr = rt0 = leaf;
 
-   match = find_rr_leaf(fn, rt0, rt0->rt6i_metric, oif, strict,
+   match = find_rr_leaf(fn, leaf, rt0, rt0->rt6i_metric, oif, strict,
 _rr);
 
if (do_rr) {
@@ -766,13 +771,12

[PATCH net-next 14/16] ipv6: add key length check into rt6_select()

2017-10-06 Thread Wei Wang

From: Wei Wang <wei...@google.com>

After rwlock is replaced with rcu and spinlock, fib6_lookup() could
potentially return an intermediate node if other thread is doing
fib6_del() on a route which is the only route on the node so that
fib6_repair_tree() will be called on this node and potentially assigns
fn->leaf to the its child's fn->leaf.

In order to detect this situation in rt6_select(), we have to check if
fn->fn_bit is consistent with the key length stored in the route. And
depending on if the fn is in the subtree or not, the key is either
rt->rt6i_dst or rt->rt6i_src.
If any inconsistency is found, that means the node no longer holds valid
routes in it. So net->ipv6.ip6_null_entry is returned.

Signed-off-by: Wei Wang <wei...@google.com>
Signed-off-by: Martin KaFai Lau <ka...@fb.com>
Signed-off-by: Eric Dumazet <eduma...@google.com>
---
 net/ipv6/route.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 05dc450af441..24b80f43bbfb 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -755,6 +755,7 @@ static struct rt6_info *rt6_select(struct net *net, struct 
fib6_node *fn,
struct rt6_info *leaf = fn->leaf;
struct rt6_info *match, *rt0;
bool do_rr = false;
+   int key_plen;
 
if (!leaf)
return net->ipv6.ip6_null_entry;
@@ -763,6 +764,19 @@ static struct rt6_info *rt6_select(struct net *net, struct 
fib6_node *fn,
if (!rt0)
fn->rr_ptr = rt0 = leaf;
 
+   /* Double check to make sure fn is not an intermediate node
+* and fn->leaf does not points to its child's leaf
+* (This might happen if all routes under fn are deleted from
+* the tree and fib6_repair_tree() is called on the node.)
+*/
+   key_plen = rt0->rt6i_dst.plen;
+#ifdef CONFIG_IPV6_SUBTREES
+   if (rt0->rt6i_src.plen)
+   key_plen = rt0->rt6i_src.plen;
+#endif
+   if (fn->fn_bit != key_plen)
+   return net->ipv6.ip6_null_entry;
+
match = find_rr_leaf(fn, leaf, rt0, rt0->rt6i_metric, oif, strict,
 _rr);
 
-- 
2.14.2.920.gcf0c67979c-goog

[PATCH net-next 08/16] ipv6: hook up exception table to store dst cache

2017-10-06 Thread Wei Wang

From: Wei Wang <wei...@google.com>

This commit makes use of the exception hash table implementation to
store dst caches created by pmtu discovery and ip redirect into the hash
table under the rt_info and no longer inserts these routes into fib6
tree.
This makes the fib6 tree only contain static configured routes and could
now be protected by rcu instead of a rw lock.
With this change, in the route lookup related functions, after finding
the rt6_info with the longest prefix, we also need to search for the
exception table before doing backtracking.
In the route delete function, if the route being deleted is not a dst
cache, deletion of this route also need to flush the whole hash table
under it. If it is a dst cache, then only delete the cached dst in the
hash table.

Note: for fib6_walk_continue() function, w->root now is always pointing
to a root node considering that fib6_prune_clones() is removed from the
code. So we add a WARN_ON() msg to make sure w->root always points to a
root node and also removed the update of w->root in fib6_repair_tree().
This is a prerequisite for later patch because we don't need to make
w->root as rcu protected when replacing rwlock with RCU.
Also, we remove all prune related variables as it is no longer used.

Signed-off-by: Wei Wang <wei...@google.com>
Signed-off-by: Martin KaFai Lau <ka...@fb.com>
Signed-off-by: Eric Dumazet <eduma...@google.com>
---
 include/net/ip6_fib.h |   1 -
 net/ipv6/addrconf.c   |   1 -
 net/ipv6/ip6_fib.c|  95 
 net/ipv6/route.c  | 108 +-
 4 files changed, 72 insertions(+), 133 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 4497a1eb4d41..d0b7283073e3 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -280,7 +280,6 @@ struct fib6_walker {
struct fib6_node *root, *node;
struct rt6_info *leaf;
enum fib6_walk_state state;
-   bool prune;
unsigned int skip;
unsigned int count;
int (*func)(struct fib6_walker *);
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 3ccaf52824c9..873afafddfc4 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -2326,7 +2326,6 @@ static struct rt6_info *addrconf_get_prefix_route(const 
struct in6_addr *pfx,
if (!fn)
goto out;
 
-   noflags |= RTF_CACHE;
for (rt = fn->leaf; rt; rt = rt->dst.rt6_next) {
if (rt->dst.dev->ifindex != dev->ifindex)
continue;
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index b3e4cf0962f8..9c8e704e6af7 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -54,7 +54,6 @@ struct fib6_cleaner {
 #define FWS_INIT FWS_L
 #endif
 
-static void fib6_prune_clones(struct net *net, struct fib6_node *fn);
 static struct rt6_info *fib6_find_prefix(struct net *net, struct fib6_node 
*fn);
 static struct fib6_node *fib6_repair_tree(struct net *net, struct fib6_node 
*fn);
 static int fib6_walk(struct net *net, struct fib6_walker *w);
@@ -1101,6 +1100,8 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt,
 
if (WARN_ON_ONCE(!atomic_read(>dst.__refcnt)))
return -EINVAL;
+   if (WARN_ON_ONCE(rt->rt6i_flags & RTF_CACHE))
+   return -EINVAL;
 
if (info->nlh) {
if (!(info->nlh->nlmsg_flags & NLM_F_CREATE))
@@ -1192,11 +1193,8 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt,
 #endif
 
err = fib6_add_rt2node(fn, rt, info, mxc);
-   if (!err) {
+   if (!err)
fib6_start_gc(info->nl_net, rt);
-   if (!(rt->rt6i_flags & RTF_CACHE))
-   fib6_prune_clones(info->nl_net, pn);
-   }
 
 out:
if (err) {
@@ -1511,19 +1509,12 @@ static struct fib6_node *fib6_repair_tree(struct net 
*net,
read_lock(>ipv6.fib6_walker_lock);
FOR_WALKERS(net, w) {
if (!child) {
-   if (w->root == fn) {
-   w->root = w->node = NULL;
-   RT6_TRACE("W %p adjusted by delroot 
1\n", w);
-   } else if (w->node == fn) {
+   if (w->node == fn) {
RT6_TRACE("W %p adjusted by delnode 1, 
s=%d/%d\n", w, w->state, nstate);
w->node = pn;
w->state = nstate;
}
} else {
-   if (w->root == fn) {
-   w->root = child;
-   RT6_TRACE("W %p adjusted by delroot 
2\n&quo

[PATCH net-next 12/16] ipv6: update fn_sernum after route is inserted to tree

2017-10-06 Thread Wei Wang

From: Wei Wang <wei...@google.com>

fib6_add() logic currently calls fib6_add_1() to figure out what node
should be used for the newly added route and then call
fib6_add_rt2node() to insert the route to the node.
And during the call of fib6_add_1(), fn_sernum is updated for all nodes
that share the same prefix as the new route.
This does not have issue in the current code because reader thread will
not be able to access the tree while writer thread is inserting new
route to it. However, it is not the case once we transition to use RCU.
Reader thread could potentially see the new fn_sernum before the new
route is inserted. As a result, reader thread's route lookup will return
a stale route with the new fn_sernum.

In order to solve this issue, we remove all the update of fn_sernum in
fib6_add_1(), and instead, introduce a new function that updates fn_sernum
for all related nodes and call this functions once the route is
successfully inserted to the tree.
Also, smp_wmb() is used after a route is successfully inserted into the
fib tree and right before the updated of fn->sernum. And smp_rmb() is
used right after fn->sernum is accessed in rt6_get_cookie_safe(). This
is to guarantee that when the reader thread sees the new fn->sernum, the
new route is already inserted in the tree in memory.

Signed-off-by: Wei Wang <wei...@google.com>
Signed-off-by: Martin KaFai Lau <ka...@fb.com>
Signed-off-by: Eric Dumazet <eduma...@google.com>
---
 include/net/ip6_fib.h |  2 ++
 net/ipv6/ip6_fib.c| 39 +--
 2 files changed, 23 insertions(+), 18 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index d0b7283073e3..6bf929b50951 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -220,6 +220,8 @@ static inline bool rt6_get_cookie_safe(const struct 
rt6_info *rt,
 
if (fn) {
*cookie = fn->fn_sernum;
+   /* pairs with smp_wmb() in fib6_update_sernum_upto_root() */
+   smp_rmb();
status = true;
}
 
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index eee392f7b1f6..f604b311cc3e 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -585,7 +585,7 @@ static int inet6_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
 static struct fib6_node *fib6_add_1(struct fib6_node *root,
 struct in6_addr *addr, int plen,
 int offset, int allow_create,
-int replace_required, int sernum,
+int replace_required,
 struct netlink_ext_ack *extack)
 {
struct fib6_node *fn, *in, *ln;
@@ -631,8 +631,6 @@ static struct fib6_node *fib6_add_1(struct fib6_node *root,
fn->leaf = NULL;
}
 
-   fn->fn_sernum = sernum;
-
return fn;
}
 
@@ -641,7 +639,6 @@ static struct fib6_node *fib6_add_1(struct fib6_node *root,
 */
 
/* Try to walk down on tree. */
-   fn->fn_sernum = sernum;
dir = addr_bit_set(addr, fn->fn_bit);
pn = fn;
fn = dir ? fn->right : fn->left;
@@ -677,7 +674,6 @@ static struct fib6_node *fib6_add_1(struct fib6_node *root,
ln->fn_bit = plen;
 
ln->parent = pn;
-   ln->fn_sernum = sernum;
 
if (dir)
pn->right = ln;
@@ -737,8 +733,6 @@ static struct fib6_node *fib6_add_1(struct fib6_node *root,
in->leaf = fn->leaf;
atomic_inc(>leaf->rt6i_ref);
 
-   in->fn_sernum = sernum;
-
/* update parent pointer */
if (dir)
pn->right = in;
@@ -750,8 +744,6 @@ static struct fib6_node *fib6_add_1(struct fib6_node *root,
ln->parent = in;
fn->parent = in;
 
-   ln->fn_sernum = sernum;
-
if (addr_bit_set(addr, bit)) {
in->right = ln;
in->left  = fn;
@@ -776,8 +768,6 @@ static struct fib6_node *fib6_add_1(struct fib6_node *root,
 
ln->parent = pn;
 
-   ln->fn_sernum = sernum;
-
if (dir)
pn->right = ln;
else
@@ -1079,6 +1069,20 @@ void fib6_force_start_gc(struct net *net)
  jiffies + net->ipv6.sysctl.ip6_rt_gc_interval);
 }
 
+static void fib6_update_sernum_upto_root(struct rt6_info *rt,
+int sernum)
+{
+   struct fib6_node *fn = rcu_dereference_protected(rt->rt6i_node,
+   lockdep_is_held(>rt6i_table->tb6_lock));
+
+   /* paired with smp_rmb() in rt6_get_cookie_s

[PATCH net-next 00/16] ipv6: replace rwlock with rcu and spinlock in fib6 table

2017-10-06 Thread Wei Wang

From: Wei Wang <wei...@google.com>

Currently, fib6 table is protected by rwlock. During route lookup,
reader lock is taken and during route insertion, deletion or
modification, writer lock is taken. This is a very inefficient
implementation because the fastpath always has to do the operation
to grab the reader lock.
According to my latest syn flood test on an iota ivybridage machine
with 2 10G mlx nics bonded together, each with 8 rx queues on 2 NUMA
nodes, and with the upstream net-next kernel:
ipv4 stack can handle around 4.2Mpps
ipv6 stack can handle around 1.3Mpps

In order to close the gap of the performance number between ipv4
and ipv6 stack, this patch series tries to get rid of the usage of
the rwlock and replace it with rcu and spinlock protection. This will
greatly speed up the fastpath performance as it only needs to hold
rcu which is much less expensive than grabbing the reader lock. It
also makes ipv6 fib implementation more consistent with ipv4.

In order to be able to replace the current rwlock with rcu and
spinlock, some preparation work is needed:
Patch 1-8 introduces a per-route hash table (protected by rcu and a
different spinlock) to store all cached routes created by pmtu and ip
redirect under its main route. This makes the main fib6 tree only
contain static routes.
Patch 9-14 prepares all the reader path to be ready to tolerate
concurrent writer.
Patch 15 finally does the rwlock to rcu and spinlock conversion.
Patch 16 takes care of rt6_stats.

After this patch series, in the same syn flood test,
ipv6 stack can now handle around 3.5Mpps compared to previous 1.3Mpps
in my test setup.

After this patch series, there are still some improvements that should
be done in ipv6 stack:
1. During route lookup, dst_use() is called everytime on the selected
route to update dst->__use and dst->lastuse. This dirties the cacheline 
and causes extra cacheline miss and should be avoided.
2. when no route is found in the current table, net->ip6.ipv6_null_entry
is used and refcnt is taken on it. As there is no pcpu cache for this
specific route, frequent change on the refcnt for this route causes
quite some cacheline misses.
And to make things worse, if CONFIG_IPV6_MULTIPLE_TABLES is defined,
output path route lookup always starts with local table first and
guarantees to hit net->ipv6.ip6_null_entry before continuing to do
lookup in the main table.
These operations on net->ipv6.ip6_null_entry could potentially be
avoided.
3. ipv6 input path route lookup grabs refcnt on dst. This is different
from ipv4. We could potentially change this behavior to let ipv6 input
path route lookup not to grab refcnt on dst. However, it does not give
us much performance boost as we currently have pcpu route cache for
input path as well in ipv6. But this work probably is still worth doing
to unify ipv6 and ipv4 route lookup behavior.

The above issues will be addressed separately after this patch series
has been accepted.

This is a joint work with Martin KaFai Lau and Eric Dumazet. And many
many thanks to them for their inspiring ideas and big big code review
efforts.

Wei Wang (16):
  ipv6: introduce a new function fib6_update_sernum()
  ipv6: introduce a hash table to store dst cache
  ipv6: prepare fib6_remove_prefsrc() for exception table
  ipv6: prepare rt6_mtu_change() for exception table
  ipv6: prepare rt6_clean_tohost() for exception table
  ipv6: prepare fib6_age() for exception table
  ipv6: prepare fib6_locate() for exception table
  ipv6: hook up exception table to store dst cache
  ipv6: grab rt->rt6i_ref before allocating pcpu rt
  ipv6: don't release rt->rt6i_pcpu memory during rt6_release()
  ipv6: replace dst_hold() with dst_hold_safe() in routing code
  ipv6: update fn_sernum after route is inserted to tree
  ipv6: check fn->leaf before it is used
  ipv6: add key length check into rt6_select()
  ipv6: replace rwlock with rcu and spinlock in fib6_table
  ipv6: take care of rt6_stats

 include/net/dst.h   |   2 +-
 include/net/ip6_fib.h   |  79 -
 include/net/ip6_route.h |   5 +
 net/ipv6/addrconf.c |  17 +-
 net/ipv6/ip6_fib.c  | 645 ++
 net/ipv6/route.c| 901 
 6 files changed, 1179 insertions(+), 470 deletions(-)

-- 
2.14.2.920.gcf0c67979c-goog

[PATCH net-next 07/16] ipv6: prepare fib6_locate() for exception table

2017-10-06 Thread Wei Wang

From: Wei Wang <wei...@google.com>

fib6_locate() is used to find the fib6_node according to the passed in
prefix address key. It currently tries to find the fib6_node with the
exact match of the passed in key. However, when we move cached routes
into the exception table, fib6_locate() will fail to find the fib6_node
for it as the cached routes will be stored in the exception table under
the fib6_node with the longest prefix match of the cache's dst addr key.
This commit adds a new parameter to let the caller specify if it needs
exact match or longest prefix match.
Right now, all callers still does exact match when calling
fib6_locate(). It will be changed in later commit where exception table
is hooked up to store cached routes.

Signed-off-by: Wei Wang <wei...@google.com>
Signed-off-by: Martin KaFai Lau <ka...@fb.com>
Signed-off-by: Eric Dumazet <eduma...@google.com>
---
 include/net/ip6_fib.h |  3 ++-
 net/ipv6/addrconf.c   |  2 +-
 net/ipv6/ip6_fib.c| 30 +++---
 net/ipv6/route.c  |  5 +++--
 4 files changed, 29 insertions(+), 11 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 11a79ef87a28..4497a1eb4d41 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -357,7 +357,8 @@ struct fib6_node *fib6_lookup(struct fib6_node *root,
 
 struct fib6_node *fib6_locate(struct fib6_node *root,
  const struct in6_addr *daddr, int dst_len,
- const struct in6_addr *saddr, int src_len);
+ const struct in6_addr *saddr, int src_len,
+ bool exact_match);
 
 void fib6_clean_all(struct net *net, int (*func)(struct rt6_info *, void *arg),
void *arg);
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 837418ff2d4b..3ccaf52824c9 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -2322,7 +2322,7 @@ static struct rt6_info *addrconf_get_prefix_route(const 
struct in6_addr *pfx,
return NULL;
 
read_lock_bh(>tb6_lock);
-   fn = fib6_locate(>tb6_root, pfx, plen, NULL, 0);
+   fn = fib6_locate(>tb6_root, pfx, plen, NULL, 0, true);
if (!fn)
goto out;
 
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 3afbe50f2779..b3e4cf0962f8 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1343,14 +1343,21 @@ struct fib6_node *fib6_lookup(struct fib6_node *root, 
const struct in6_addr *dad
 /*
  * Get node with specified destination prefix (and source prefix,
  * if subtrees are used)
+ * exact_match == true means we try to find fn with exact match of
+ * the passed in prefix addr
+ * exact_match == false means we try to find fn with longest prefix
+ * match of the passed in prefix addr. This is useful for finding fn
+ * for cached route as it will be stored in the exception table under
+ * the node with longest prefix length.
  */
 
 
 static struct fib6_node *fib6_locate_1(struct fib6_node *root,
   const struct in6_addr *addr,
-  int plen, int offset)
+  int plen, int offset,
+  bool exact_match)
 {
-   struct fib6_node *fn;
+   struct fib6_node *fn, *prev = NULL;
 
for (fn = root; fn ; ) {
struct rt6key *key = (struct rt6key *)((u8 *)fn->leaf + offset);
@@ -1360,11 +1367,13 @@ static struct fib6_node *fib6_locate_1(struct fib6_node 
*root,
 */
if (plen < fn->fn_bit ||
!ipv6_prefix_equal(>addr, addr, fn->fn_bit))
-   return NULL;
+   goto out;
 
if (plen == fn->fn_bit)
return fn;
 
+   prev = fn;
+
/*
 *  We have more bits to go
 */
@@ -1373,24 +1382,31 @@ static struct fib6_node *fib6_locate_1(struct fib6_node 
*root,
else
fn = fn->left;
}
-   return NULL;
+out:
+   if (exact_match)
+   return NULL;
+   else
+   return prev;
 }
 
 struct fib6_node *fib6_locate(struct fib6_node *root,
  const struct in6_addr *daddr, int dst_len,
- const struct in6_addr *saddr, int src_len)
+ const struct in6_addr *saddr, int src_len,
+ bool exact_match)
 {
struct fib6_node *fn;
 
fn = fib6_locate_1(root, daddr, dst_len,
-  offsetof(struct rt6_info, rt6i_dst));
+  offsetof(struct rt6_info, rt6i_dst),
+  exact_match);
 
 #ifdef CONFIG_IPV6_SUBTREES
if (src_len) {
WARN_ON(saddr == NULL);

[PATCH net-next 02/16] ipv6: introduce a hash table to store dst cache

2017-10-06 Thread Wei Wang

From: Wei Wang <wei...@google.com>

Add a hash table into struct rt6_info in order to store dst caches
created by pmtu discovery and ip redirect in ipv6 routing code.
APIs to add dst cache, delete dst cache, find dst cache and update
dst cache in the hash table are implemented and will be used in later
commits.
This is a preparation work to move all cache routes into the exception
table instead of getting inserted into the fib6 tree.

Signed-off-by: Wei Wang <wei...@google.com>
Signed-off-by: Martin KaFai Lau <ka...@fb.com>
Signed-off-by: Eric Dumazet <eduma...@google.com>
---
 include/net/ip6_fib.h   |  19 +++
 include/net/ip6_route.h |   3 +
 net/ipv6/route.c| 341 
 3 files changed, 363 insertions(+)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 152b7b14a5a5..c4864c1e8f13 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -98,6 +98,22 @@ struct rt6key {
 
 struct fib6_table;
 
+struct rt6_exception_bucket {
+   struct hlist_head   chain;
+   int depth;
+};
+
+struct rt6_exception {
+   struct hlist_node   hlist;
+   struct rt6_info *rt6i;
+   unsigned long   stamp;
+   struct rcu_head rcu;
+};
+
+#define FIB6_EXCEPTION_BUCKET_SIZE_SHIFT 10
+#define FIB6_EXCEPTION_BUCKET_SIZE (1 << FIB6_EXCEPTION_BUCKET_SIZE_SHIFT)
+#define FIB6_MAX_DEPTH 5
+
 struct rt6_info {
struct dst_entrydst;
 
@@ -134,12 +150,15 @@ struct rt6_info {
 
struct inet6_dev*rt6i_idev;
struct rt6_info * __percpu  *rt6i_pcpu;
+   struct rt6_exception_bucket __rcu *rt6i_exception_bucket;
 
u32 rt6i_metric;
u32 rt6i_pmtu;
/* more non-fragment space at head required */
unsigned short  rt6i_nfheader_len;
u8  rt6i_protocol;
+   u8  exception_bucket_flushed:1,
+   unused:7;
 };
 
 static inline struct inet6_dev *ip6_dst_idev(struct dst_entry *dst)
diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index ee96f402cb75..3315605f34c9 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -95,6 +95,9 @@ int ip6_route_add(struct fib6_config *cfg, struct 
netlink_ext_ack *extack);
 int ip6_ins_rt(struct rt6_info *);
 int ip6_del_rt(struct rt6_info *);
 
+void rt6_flush_exceptions(struct rt6_info *rt);
+int rt6_remove_exception_rt(struct rt6_info *rt);
+
 static inline int ip6_route_get_saddr(struct net *net, struct rt6_info *rt,
  const struct in6_addr *daddr,
  unsigned int prefs,
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 26cc9f483b6d..dc5e70975966 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -44,6 +44,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -104,6 +105,9 @@ static int rt6_fill_node(struct net *net,
 struct in6_addr *dst, struct in6_addr *src,
 int iif, int type, u32 portid, u32 seq,
 unsigned int flags);
+static struct rt6_info *rt6_find_cached_rt(struct rt6_info *rt,
+  struct in6_addr *daddr,
+  struct in6_addr *saddr);
 
 #ifdef CONFIG_IPV6_ROUTE_INFO
 static struct rt6_info *rt6_add_route_info(struct net *net,
@@ -392,6 +396,7 @@ EXPORT_SYMBOL(ip6_dst_alloc);
 static void ip6_dst_destroy(struct dst_entry *dst)
 {
struct rt6_info *rt = (struct rt6_info *)dst;
+   struct rt6_exception_bucket *bucket;
struct dst_entry *from = dst->from;
struct inet6_dev *idev;
 
@@ -404,6 +409,11 @@ static void ip6_dst_destroy(struct dst_entry *dst)
rt->rt6i_idev = NULL;
in6_dev_put(idev);
}
+   bucket = rcu_dereference_protected(rt->rt6i_exception_bucket, 1);
+   if (bucket) {
+   rt->rt6i_exception_bucket = NULL;
+   kfree(bucket);
+   }
 
dst->from = NULL;
dst_release(from);
@@ -1091,6 +1101,337 @@ static struct rt6_info *rt6_make_pcpu_route(struct 
rt6_info *rt)
return pcpu_rt;
 }
 
+/* exception hash table implementation
+ */
+static DEFINE_SPINLOCK(rt6_exception_lock);
+
+/* Remove rt6_ex from hash table and free the memory
+ * Caller must hold rt6_exception_lock
+ */
+static void rt6_remove_exception(struct rt6_exception_bucket *bucket,
+struct rt6_exception *rt6_ex)
+{
+   if (!bucket || !rt6_ex)
+   return;
+   rt6_ex->rt6i->rt6i_node = NULL;
+   hlist_del_rcu(_ex->hlist);
+   rt6_release(rt6_ex->rt6i);
+   kfree_rcu(rt6_ex, rcu);
+   WARN_ON_ONCE(!bucket->de

[PATCH net-next 04/16] ipv6: prepare rt6_mtu_change() for exception table

2017-10-06 Thread Wei Wang

From: Wei Wang <wei...@google.com>

If we move all cached dst into the exception table under the main route,
current rt6_mtu_change() will no longer be able to access them.
This commit makes rt6_mtu_change_route() function to also go through all
cached routes in the exception table under the main route and do proper
updates on the mtu.
This is a preparation in order to move all cached routes into the
exception table.

Signed-off-by: Wei Wang <wei...@google.com>
Signed-off-by: Martin KaFai Lau <ka...@fb.com>
Signed-off-by: Eric Dumazet <eduma...@google.com>
---
 net/ipv6/route.c | 40 
 1 file changed, 40 insertions(+)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f52ac57dcc99..d9805a857809 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1269,6 +1269,14 @@ static int rt6_insert_exception(struct rt6_info *nrt,
 * in rt6_remove_prefsrc()
 */
nrt->rt6i_prefsrc = ort->rt6i_prefsrc;
+   /* rt6_mtu_change() might lower mtu on ort.
+* Only insert this exception route if its mtu
+* is less than ort's mtu value.
+*/
+   if (nrt->rt6i_pmtu >= dst_mtu(>dst)) {
+   err = -EINVAL;
+   goto out;
+   }
 
rt6_ex = __rt6_find_exception_spinlock(, >rt6i_dst.addr,
   src_key);
@@ -1457,6 +1465,32 @@ static void rt6_exceptions_remove_prefsrc(struct 
rt6_info *rt)
}
 }
 
+static void rt6_exceptions_update_pmtu(struct rt6_info *rt, int mtu)
+{
+   struct rt6_exception_bucket *bucket;
+   struct rt6_exception *rt6_ex;
+   int i;
+
+   bucket = rcu_dereference_protected(rt->rt6i_exception_bucket,
+   lockdep_is_held(_exception_lock));
+
+   if (bucket) {
+   for (i = 0; i < FIB6_EXCEPTION_BUCKET_SIZE; i++) {
+   hlist_for_each_entry(rt6_ex, >chain, hlist) {
+   struct rt6_info *entry = rt6_ex->rt6i;
+   /* For RTF_CACHE with rt6i_pmtu == 0
+* (i.e. a redirected route),
+* the metrics of its rt->dst.from has already
+* been updated.
+*/
+   if (entry->rt6i_pmtu && entry->rt6i_pmtu > mtu)
+   entry->rt6i_pmtu = mtu;
+   }
+   bucket++;
+   }
+   }
+}
+
 struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table,
   int oif, struct flowi6 *fl6, int flags)
 {
@@ -3296,6 +3330,10 @@ static int rt6_mtu_change_route(struct rt6_info *rt, 
void *p_arg)
if (rt->dst.dev == arg->dev &&
dst_metric_raw(>dst, RTAX_MTU) &&
!dst_metric_locked(>dst, RTAX_MTU)) {
+   spin_lock_bh(_exception_lock);
+   /* This case will be removed once the exception table
+* is hooked up.
+*/
if (rt->rt6i_flags & RTF_CACHE) {
/* For RTF_CACHE with rt6i_pmtu == 0
 * (i.e. a redirected route),
@@ -3309,6 +3347,8 @@ static int rt6_mtu_change_route(struct rt6_info *rt, void 
*p_arg)
dst_mtu(>dst) == idev->cnf.mtu6)) {
dst_metric_set(>dst, RTAX_MTU, arg->mtu);
}
+   rt6_exceptions_update_pmtu(rt, arg->mtu);
+   spin_unlock_bh(_exception_lock);
}
return 0;
 }
-- 
2.14.2.920.gcf0c67979c-goog

[PATCH v2 net-next 2/2] tcp: clean up TFO server's initial tcp_rearm_rto() call

2017-10-04 Thread Wei Wang

From: Wei Wang <wei...@google.com>

This commit does a cleanup and moves tcp_rearm_rto() call in the TFO
server case into a previous spot in tcp_rcv_state_process() to make
it more compact.
This is only a cosmetic change.

Suggested-by: Yuchung Cheng <ych...@google.com>
Signed-off-by: Wei Wang <wei...@google.com>
Acked-by: Neal Cardwell <ncardw...@google.com>
Acked-by: Yuchung Cheng <ych...@google.com>
Acked-by: Eric Dumazet <eduma...@google.com>
---
no change in v2

 net/ipv4/tcp_input.c | 21 +
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index bd3a35f5dbf2..c5b8d61846c2 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5911,6 +5911,15 @@ int tcp_rcv_state_process(struct sock *sk, struct 
sk_buff *skb)
if (req) {
inet_csk(sk)->icsk_retransmits = 0;
reqsk_fastopen_remove(sk, req, false);
+   /* Re-arm the timer because data may have been sent out.
+* This is similar to the regular data transmission case
+* when new data has just been ack'ed.
+*
+* (TFO) - we could try to be more aggressive and
+* retransmitting any data sooner based on when they
+* are sent out.
+*/
+   tcp_rearm_rto(sk);
} else {
tcp_init_transfer(sk, 
BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
tp->copied_seq = tp->rcv_nxt;
@@ -5933,18 +5942,6 @@ int tcp_rcv_state_process(struct sock *sk, struct 
sk_buff *skb)
if (tp->rx_opt.tstamp_ok)
tp->advmss -= TCPOLEN_TSTAMP_ALIGNED;
 
-   if (req) {
-   /* Re-arm the timer because data may have been sent out.
-* This is similar to the regular data transmission case
-* when new data has just been ack'ed.
-*
-* (TFO) - we could try to be more aggressive and
-* retransmitting any data sooner based on when they
-* are sent out.
-*/
-   tcp_rearm_rto(sk);
-   }
-
if (!inet_csk(sk)->icsk_ca_ops->cong_control)
tcp_update_pacing_rate(sk);
 
-- 
2.14.2.920.gcf0c67979c-goog

[PATCH v2 net-next 1/2] tcp: uniform the set up of sockets after successful connection

2017-10-04 Thread Wei Wang

From: Wei Wang <wei...@google.com>

Currently in the TCP code, the initialization sequence for cached
metrics, congestion control, BPF, etc, after successful connection
is very inconsistent. This introduces inconsistent bevhavior and is
prone to bugs. The current call sequence is as follows:

(1) for active case (tcp_finish_connect() case):
tcp_mtup_init(sk);
icsk->icsk_af_ops->rebuild_header(sk);
tcp_init_metrics(sk);
tcp_call_bpf(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB);
tcp_init_congestion_control(sk);
tcp_init_buffer_space(sk);

(2) for passive case (tcp_rcv_state_process() TCP_SYN_RECV case):
icsk->icsk_af_ops->rebuild_header(sk);
tcp_call_bpf(sk, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
tcp_init_congestion_control(sk);
tcp_mtup_init(sk);
tcp_init_buffer_space(sk);
tcp_init_metrics(sk);

(3) for TFO passive case (tcp_fastopen_create_child()):
inet_csk(child)->icsk_af_ops->rebuild_header(child);
tcp_init_congestion_control(child);
tcp_mtup_init(child);
tcp_init_metrics(child);
tcp_call_bpf(child, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
tcp_init_buffer_space(child);

This commit uniforms the above functions to have the following sequence:
tcp_mtup_init(sk);
icsk->icsk_af_ops->rebuild_header(sk);
tcp_init_metrics(sk);
tcp_call_bpf(sk, BPF_SOCK_OPS_ACTIVE/PASSIVE_ESTABLISHED_CB);
tcp_init_congestion_control(sk);
tcp_init_buffer_space(sk);
This sequence is the same as the (1) active case. We pick this sequence
because this order correctly allows BPF to override the settings
including congestion control module and initial cwnd, etc from
the route, and then allows the CC module to see those settings.

Suggested-by: Neal Cardwell <ncardw...@google.com>
Tested-by: Neal Cardwell <ncardw...@google.com>
Signed-off-by: Wei Wang <wei...@google.com>
Acked-by: Neal Cardwell <ncardw...@google.com>
Acked-by: Yuchung Cheng <ych...@google.com>
Acked-by: Eric Dumazet <eduma...@google.com>
---
change in v2:
 removed EXPORT_SYMBOL(tcp_init_transfer);

 include/net/tcp.h   |  1 +
 net/ipv4/tcp.c  | 12 
 net/ipv4/tcp_fastopen.c |  7 +--
 net/ipv4/tcp_input.c| 21 +++--
 4 files changed, 17 insertions(+), 24 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 770b608c8439..f45fdc57d29d 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -417,6 +417,7 @@ bool tcp_peer_is_proven(struct request_sock *req, struct 
dst_entry *dst);
 void tcp_disable_fack(struct tcp_sock *tp);
 void tcp_close(struct sock *sk, long timeout);
 void tcp_init_sock(struct sock *sk);
+void tcp_init_transfer(struct sock *sk, int bpf_op);
 unsigned int tcp_poll(struct file *file, struct socket *sock,
  struct poll_table_struct *wait);
 int tcp_getsockopt(struct sock *sk, int level, int optname,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 5091402720ab..3ed21e281c39 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -456,6 +456,18 @@ void tcp_init_sock(struct sock *sk)
 }
 EXPORT_SYMBOL(tcp_init_sock);
 
+void tcp_init_transfer(struct sock *sk, int bpf_op)
+{
+   struct inet_connection_sock *icsk = inet_csk(sk);
+
+   tcp_mtup_init(sk);
+   icsk->icsk_af_ops->rebuild_header(sk);
+   tcp_init_metrics(sk);
+   tcp_call_bpf(sk, bpf_op);
+   tcp_init_congestion_control(sk);
+   tcp_init_buffer_space(sk);
+}
+
 static void tcp_tx_timestamp(struct sock *sk, u16 tsflags, struct sk_buff *skb)
 {
if (tsflags && skb) {
diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
index e3c33220c418..515a757f02a8 100644
--- a/net/ipv4/tcp_fastopen.c
+++ b/net/ipv4/tcp_fastopen.c
@@ -216,12 +216,7 @@ static struct sock *tcp_fastopen_create_child(struct sock 
*sk,
refcount_set(>rsk_refcnt, 2);
 
/* Now finish processing the fastopen child socket. */
-   inet_csk(child)->icsk_af_ops->rebuild_header(child);
-   tcp_init_congestion_control(child);
-   tcp_mtup_init(child);
-   tcp_init_metrics(child);
-   tcp_call_bpf(child, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
-   tcp_init_buffer_space(child);
+   tcp_init_transfer(child, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
 
tp->rcv_nxt = TCP_SKB_CB(skb)->seq + 1;
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index db9bb46b5776..bd3a35f5dbf2 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5513,20 +5513,13 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff 
*skb)
security_inet_conn_established(sk, skb);
}
 
-   /* Make sure socket is routed, for correct metrics.  */
-   icsk->icsk_af_ops->rebuild_header(sk);
-
-   tcp_init_metrics(sk);
-   tcp_call_bpf(sk, BPF_S

Re: [PATCH net-next 1/2] tcp: uniform the set up of sockets after successful connection

2017-10-03 Thread Wei Wang

On Tue, Oct 3, 2017 at 9:28 PM, David Miller <da...@davemloft.net> wrote:
> From: Wei Wang <wei...@google.com>
> Date: Mon,  2 Oct 2017 10:01:35 -0700
>
>> @@ -456,6 +456,19 @@ void tcp_init_sock(struct sock *sk)
>>  }
>>  EXPORT_SYMBOL(tcp_init_sock);
>>
>> +void tcp_init_transfer(struct sock *sk, int bpf_op)
>> +{
>> + struct inet_connection_sock *icsk = inet_csk(sk);
>> +
>> + tcp_mtup_init(sk);
>> + icsk->icsk_af_ops->rebuild_header(sk);
>> + tcp_init_metrics(sk);
>> + tcp_call_bpf(sk, bpf_op);
>> + tcp_init_congestion_control(sk);
>> + tcp_init_buffer_space(sk);
>> +}
>> +EXPORT_SYMBOL(tcp_init_transfer);
>
> This symbol export is unnecessary, and if it were it should
> be EXPORT_SYMBOL_GPL().

I see. This function is only called in the TCP stack. Will remove
EXPORT_SYMBOL() in v2.

[PATCH net-next 2/2] tcp: clean up TFO server's initial tcp_rearm_rto() call

2017-10-02 Thread Wei Wang

From: Wei Wang <wei...@google.com>

This commit does a cleanup and moves tcp_rearm_rto() call in the TFO
server case into a previous spot in tcp_rcv_state_process() to make
it more compact.
This is only a cosmetic change.

Suggested-by: Yuchung Cheng <ych...@google.com>
Signed-off-by: Wei Wang <wei...@google.com>
Acked-by: Neal Cardwell <ncardw...@google.com>
Acked-by: Yuchung Cheng <ych...@google.com>
Acked-by: Eric Dumazet <eduma...@google.com>
---
 net/ipv4/tcp_input.c | 21 +
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index bd3a35f5dbf2..c5b8d61846c2 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5911,6 +5911,15 @@ int tcp_rcv_state_process(struct sock *sk, struct 
sk_buff *skb)
if (req) {
inet_csk(sk)->icsk_retransmits = 0;
reqsk_fastopen_remove(sk, req, false);
+   /* Re-arm the timer because data may have been sent out.
+* This is similar to the regular data transmission case
+* when new data has just been ack'ed.
+*
+* (TFO) - we could try to be more aggressive and
+* retransmitting any data sooner based on when they
+* are sent out.
+*/
+   tcp_rearm_rto(sk);
} else {
tcp_init_transfer(sk, 
BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
tp->copied_seq = tp->rcv_nxt;
@@ -5933,18 +5942,6 @@ int tcp_rcv_state_process(struct sock *sk, struct 
sk_buff *skb)
if (tp->rx_opt.tstamp_ok)
tp->advmss -= TCPOLEN_TSTAMP_ALIGNED;
 
-   if (req) {
-   /* Re-arm the timer because data may have been sent out.
-* This is similar to the regular data transmission case
-* when new data has just been ack'ed.
-*
-* (TFO) - we could try to be more aggressive and
-* retransmitting any data sooner based on when they
-* are sent out.
-*/
-   tcp_rearm_rto(sk);
-   }
-
if (!inet_csk(sk)->icsk_ca_ops->cong_control)
tcp_update_pacing_rate(sk);
 
-- 
2.14.2.822.g60be5d43e6-goog

[PATCH net-next 1/2] tcp: uniform the set up of sockets after successful connection

2017-10-02 Thread Wei Wang

From: Wei Wang <wei...@google.com>

Currently in the TCP code, the initialization sequence for cached
metrics, congestion control, BPF, etc, after successful connection
is very inconsistent. This introduces inconsistent bevhavior and is
prone to bugs. The current call sequence is as follows:

(1) for active case (tcp_finish_connect() case):
tcp_mtup_init(sk);
icsk->icsk_af_ops->rebuild_header(sk);
tcp_init_metrics(sk);
tcp_call_bpf(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB);
tcp_init_congestion_control(sk);
tcp_init_buffer_space(sk);

(2) for passive case (tcp_rcv_state_process() TCP_SYN_RECV case):
icsk->icsk_af_ops->rebuild_header(sk);
tcp_call_bpf(sk, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
tcp_init_congestion_control(sk);
tcp_mtup_init(sk);
tcp_init_buffer_space(sk);
tcp_init_metrics(sk);

(3) for TFO passive case (tcp_fastopen_create_child()):
inet_csk(child)->icsk_af_ops->rebuild_header(child);
tcp_init_congestion_control(child);
tcp_mtup_init(child);
tcp_init_metrics(child);
tcp_call_bpf(child, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
tcp_init_buffer_space(child);

This commit uniforms the above functions to have the following sequence:
tcp_mtup_init(sk);
icsk->icsk_af_ops->rebuild_header(sk);
tcp_init_metrics(sk);
tcp_call_bpf(sk, BPF_SOCK_OPS_ACTIVE/PASSIVE_ESTABLISHED_CB);
tcp_init_congestion_control(sk);
tcp_init_buffer_space(sk);

This sequence is the same as the (1) active case. We pick this sequence
because this order correctly allows BPF to override the settings
including congestion control module and initial cwnd, etc from
the route, and then allows the CC module to see those settings.

Suggested-by: Neal Cardwell <ncardw...@google.com>
Tested-by: Neal Cardwell <ncardw...@google.com>
Signed-off-by: Wei Wang <wei...@google.com>
Acked-by: Neal Cardwell <ncardw...@google.com>
Acked-by: Yuchung Cheng <ych...@google.com>
Acked-by: Eric Dumazet <eduma...@google.com>
---
 include/net/tcp.h   |  1 +
 net/ipv4/tcp.c  | 13 +
 net/ipv4/tcp_fastopen.c |  7 +--
 net/ipv4/tcp_input.c| 21 +++--
 4 files changed, 18 insertions(+), 24 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 770b608c8439..f45fdc57d29d 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -417,6 +417,7 @@ bool tcp_peer_is_proven(struct request_sock *req, struct 
dst_entry *dst);
 void tcp_disable_fack(struct tcp_sock *tp);
 void tcp_close(struct sock *sk, long timeout);
 void tcp_init_sock(struct sock *sk);
+void tcp_init_transfer(struct sock *sk, int bpf_op);
 unsigned int tcp_poll(struct file *file, struct socket *sock,
  struct poll_table_struct *wait);
 int tcp_getsockopt(struct sock *sk, int level, int optname,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 5091402720ab..a16445664644 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -456,6 +456,19 @@ void tcp_init_sock(struct sock *sk)
 }
 EXPORT_SYMBOL(tcp_init_sock);
 
+void tcp_init_transfer(struct sock *sk, int bpf_op)
+{
+   struct inet_connection_sock *icsk = inet_csk(sk);
+
+   tcp_mtup_init(sk);
+   icsk->icsk_af_ops->rebuild_header(sk);
+   tcp_init_metrics(sk);
+   tcp_call_bpf(sk, bpf_op);
+   tcp_init_congestion_control(sk);
+   tcp_init_buffer_space(sk);
+}
+EXPORT_SYMBOL(tcp_init_transfer);
+
 static void tcp_tx_timestamp(struct sock *sk, u16 tsflags, struct sk_buff *skb)
 {
if (tsflags && skb) {
diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
index e3c33220c418..515a757f02a8 100644
--- a/net/ipv4/tcp_fastopen.c
+++ b/net/ipv4/tcp_fastopen.c
@@ -216,12 +216,7 @@ static struct sock *tcp_fastopen_create_child(struct sock 
*sk,
refcount_set(>rsk_refcnt, 2);
 
/* Now finish processing the fastopen child socket. */
-   inet_csk(child)->icsk_af_ops->rebuild_header(child);
-   tcp_init_congestion_control(child);
-   tcp_mtup_init(child);
-   tcp_init_metrics(child);
-   tcp_call_bpf(child, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
-   tcp_init_buffer_space(child);
+   tcp_init_transfer(child, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
 
tp->rcv_nxt = TCP_SKB_CB(skb)->seq + 1;
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index db9bb46b5776..bd3a35f5dbf2 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5513,20 +5513,13 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff 
*skb)
security_inet_conn_established(sk, skb);
}
 
-   /* Make sure socket is routed, for correct metrics.  */
-   icsk->icsk_af_ops->rebuild_header(sk);
-
-   tcp_init_metrics(sk);
-   tcp_call_bpf(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB);
-

Re: [PATCH net] ipv6: remove incorrect WARN_ON() in fib6_del()

2017-09-27 Thread Wei Wang

On Tue, Sep 26, 2017 at 6:20 AM, Eric Dumazet <eduma...@google.com> wrote:
> On Mon, Sep 25, 2017 at 10:52 PM, Wei Wang <wei...@google.com> wrote:
>> On Mon, Sep 25, 2017 at 7:23 PM, Eric Dumazet <eduma...@google.com> wrote:
>>> On Mon, Sep 25, 2017 at 7:07 PM, Martin KaFai Lau <ka...@fb.com> wrote:
>>>
>>>> I am probably still missing something.
>>>>
>>>> Considering the del operation should be under the writer lock,
>>>> if rt->rt6i_node should be NULL (for rt that has already been
>>>> removed from fib6), why this WARN_ON() is triggered?
>>>>
>>>> An example may help.
>>>>
>>>
>>> Look at the stack trace, you'll find the answers...
>>>
>>> ip6_link_failure() -> ip6_del_rt()
>>>
>>> Note that rt might have been deleted from the _tree_ already.
>>
>> Had a brief talk with Martin.
>> He has a valid point.
>> The current WARN_ON() code is as follows:
>> #if RT6_DEBUG >= 2
>>if (rt->dst.obsolete > 0) {
>>WARN_ON(fn);
>>return -ENOENT;
>>}
>> #endif
>>
>> The WARN_ON() only triggers when fn is not NULL. (I missed it before.)
>> In theory, fib6_del() calls fib6_del_route() which should set
>> rt->rt6i_node to NULL and rt->dst.obsolete to DST_OBSOLETE_DEAD within
>> the same write_lock session.
>> If those 2 values are inconsistent, it indicates something is wrong.
>> Will need more time to root cause the issue.
>>
>> Please ignore this patch. Sorry about the confusion.
>
> Oh well, for some reason I was seeing WARN_ON(1)  here, since this is
> a construct I often add in my tests ...

Just an update on this issue:
This WARNING issue should already be fixed by commit
7483cea79957312e9f8e9cf760a1bc5d6c507113:
Author: Ido Schimmel <ido...@mellanox.com>
Date:   Thu Aug 3 13:28:22 2017 +0200

ipv6: fib: Unlink replaced routes from their nodes

When a route is deleted its node pointer is set to NULL to indicate it's
no longer linked to its node. Do the same for routes that are replaced.

This will later allow us to test if a route is still in the FIB by
checking its node pointer instead of its reference count.

Signed-off-by: Ido Schimmel <ido...@mellanox.com>
Signed-off-by: Jiri Pirko <j...@mellanox.com>
Signed-off-by: David S. Miller <da...@davemloft.net>

So no further action is needed on this.

Thanks.
Wei

Re: [PATCH net-next] ipv6: do lazy dst->__use update when per cpu dst is available

2017-09-27 Thread Wei Wang

On Wed, Sep 27, 2017 at 10:14 AM, Paolo Abeni  wrote:
> When a host is under high ipv6 load, the updates of the ingress
> route '__use' field are a source of relevant contention: such
> field is updated for each packet and several cores can access
> concurrently the dst, even if percpu dst entries are available
> and used.
>
> The __use value is just a rough indication of the dst usage: is
> already updated concurrently from multiple CPUs without any lock,
> so we can decrease the contention leveraging the percpu dst to perform
> __use bulk updates: if a per cpu dst entry is found, we account on
> such entry and we flush the percpu counter once per jiffy.
>
> Performace gain under UDP flood is as follows:
>
> nr RX queuesbefore  after   delta
> kppskpps(%)
> 2   2316268816
> 3   3033360518
> 4   396343289
> 5   4379525319
> 6   5137600016
>
> Performance gain under TCP syn flood should be measurable as well.
>
> Signed-off-by: Paolo Abeni 
> ---
>  net/ipv6/route.c | 18 --
>  1 file changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 26cc9f483b6d..e69f304de950 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -1170,12 +1170,24 @@ struct rt6_info *ip6_pol_route(struct net *net, 
> struct fib6_table *table,
>
> struct rt6_info *pcpu_rt;
>
> -   rt->dst.lastuse = jiffies;
> -   rt->dst.__use++;
> pcpu_rt = rt6_get_pcpu_route(rt);
>
> if (pcpu_rt) {
> +   unsigned long ts;
> +
> read_unlock_bh(>tb6_lock);
> +
> +   /* do lazy updates of rt->dst->__use, at most once
> +* per jiffy, to avoid contention on such cacheline.
> +*/
> +   ts = jiffies;
> +   pcpu_rt->dst.__use++;
> +   if (pcpu_rt->dst.lastuse != ts) {
> +   rt->dst.__use += pcpu_rt->dst.__use;
> +   rt->dst.lastuse = ts;
> +   pcpu_rt->dst.__use = 0;
> +   pcpu_rt->dst.lastuse = ts;
> +   }
> } else {
> /* We have to do the read_unlock first
>  * because rt6_make_pcpu_route() may trigger
> @@ -1185,6 +1197,8 @@ struct rt6_info *ip6_pol_route(struct net *net, struct 
> fib6_table *table,
> read_unlock_bh(>tb6_lock);
> pcpu_rt = rt6_make_pcpu_route(rt);
> dst_release(>dst);
> +   rt->dst.lastuse = jiffies;
> +   rt->dst.__use++;
> }
>
> trace_fib6_table_lookup(net, pcpu_rt, table->tb6_id, fl6);
> --
> 2.13.5
>

Hi Paolo,

Eric and I discussed about this issue recently as well :).

What about the following change:

diff --git a/include/net/dst.h b/include/net/dst.h
index 93568bd0a352..33e1d86bcef6 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -258,14 +258,18 @@ static inline void dst_hold(struct dst_entry *dst)
 static inline void dst_use(struct dst_entry *dst, unsigned long time)
 {
dst_hold(dst);
-   dst->__use++;
-   dst->lastuse = time;
+   if (dst->lastuse != time) {
+   dst->__use++;
+   dst->lastuse = time;
+   }
 }

 static inline void dst_use_noref(struct dst_entry *dst, unsigned long time)
 {
-   dst->__use++;
-   dst->lastuse = time;
+   if (dst->lastuse != time) {
+   dst->__use++;
+   dst->lastuse = time;
+   }
 }

 static inline struct dst_entry *dst_clone(struct dst_entry *dst)
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 26cc9f483b6d..e195f093add3 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1170,8 +1170,7 @@ struct rt6_info *ip6_pol_route(struct net *net,
struct fib6_table *table,

struct rt6_info *pcpu_rt;

-   rt->dst.lastuse = jiffies;
-   rt->dst.__use++;
+   dst_use_noref(rt, jiffies);
pcpu_rt = rt6_get_pcpu_route(rt);

if (pcpu_rt) {


This way we always only update dst->__use and dst->lastuse at most
once per jiffy. And we don't really need to update pcpu and then do
the copy over from pcpu_rt to rt operation.

Another thing is that I don't really see any places making use of
dst->__use. So maybe we can also get rid of this dst->__use field?

Thanks.
Wei

Re: [PATCH net] ipv6: remove incorrect WARN_ON() in fib6_del()

2017-09-25 Thread Wei Wang

On Mon, Sep 25, 2017 at 7:23 PM, Eric Dumazet  wrote:
> On Mon, Sep 25, 2017 at 7:07 PM, Martin KaFai Lau  wrote:
>
>> I am probably still missing something.
>>
>> Considering the del operation should be under the writer lock,
>> if rt->rt6i_node should be NULL (for rt that has already been
>> removed from fib6), why this WARN_ON() is triggered?
>>
>> An example may help.
>>
>
> Look at the stack trace, you'll find the answers...
>
> ip6_link_failure() -> ip6_del_rt()
>
> Note that rt might have been deleted from the _tree_ already.

Had a brief talk with Martin.
He has a valid point.
The current WARN_ON() code is as follows:
#if RT6_DEBUG >= 2
   if (rt->dst.obsolete > 0) {
   WARN_ON(fn);
   return -ENOENT;
   }
#endif

The WARN_ON() only triggers when fn is not NULL. (I missed it before.)
In theory, fib6_del() calls fib6_del_route() which should set
rt->rt6i_node to NULL and rt->dst.obsolete to DST_OBSOLETE_DEAD within
the same write_lock session.
If those 2 values are inconsistent, it indicates something is wrong.
Will need more time to root cause the issue.

Please ignore this patch. Sorry about the confusion.

Re: [PATCH net] ipv6: remove incorrect WARN_ON() in fib6_del()

2017-09-25 Thread Wei Wang

On Mon, Sep 25, 2017 at 5:56 PM, Martin KaFai Lau <ka...@fb.com> wrote:
> On Mon, Sep 25, 2017 at 05:35:22PM +0000, Wei Wang wrote:
>> From: Wei Wang <wei...@google.com>
>>
>> fib6_del() generates WARN_ON() when rt->dst.obsolete > 0. This does not
>> make sense because it is possible that the route passed in is already
>> deleted by some other thread and rt->dst.obsolete is set to
>> DST_OBSOLETE_DEAD.
>> So this commit deletes this WARN_ON() and also remove the
>> "#ifdef RT6_DEBUG >= 2" condition so that if the route is already
>> obsolete, we return right at the beginning of fib6_del().
>>
>>
>> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
>> index e5308d7cbd75..693bcd7ef6d2 100644
>> --- a/net/ipv6/ip6_fib.c
>> +++ b/net/ipv6/ip6_fib.c
>> @@ -1592,13 +1592,7 @@ int fib6_del(struct rt6_info *rt, struct nl_info 
>> *info)
>>   struct net *net = info->nl_net;
>>   struct rt6_info **rtp;
>>
>> -#if RT6_DEBUG >= 2
>> - if (rt->dst.obsolete > 0) {
>> - WARN_ON(fn);
> fn should have already been set to NULL if it is removed
> from the fib6 tree?
>

That is true. rt->rt6i_node (fn) should already be marked as NULL.
That means the check on rt->dst.obsolete is redundant.
I will remove it in v2.
Thanks Martin.


>> - return -ENOENT;
>> - }
>> -#endif
>> - if (!fn || rt == net->ipv6.ip6_null_entry)
>> + if (!fn || rt->dst.obsolete > 0 || rt == net->ipv6.ip6_null_entry)
>>   return -ENOENT;
>>
>>   WARN_ON(!(fn->fn_flags & RTN_RTINFO));
>> --
>> 2.14.1.821.g8fa685d3b7-goog
>>

[PATCH net] ipv6: remove incorrect WARN_ON() in fib6_del()

2017-09-25 Thread Wei Wang

From: Wei Wang <wei...@google.com>

fib6_del() generates WARN_ON() when rt->dst.obsolete > 0. This does not
make sense because it is possible that the route passed in is already
deleted by some other thread and rt->dst.obsolete is set to
DST_OBSOLETE_DEAD.
So this commit deletes this WARN_ON() and also remove the
"#ifdef RT6_DEBUG >= 2" condition so that if the route is already
obsolete, we return right at the beginning of fib6_del().

Syzkaller hit this WARN_ON() in the following call trace:
 __dump_stack lib/dump_stack.c:16 [inline]
 dump_stack+0x194/0x257 lib/dump_stack.c:52
 panic+0x1e4/0x417 kernel/panic.c:180
 __warn+0x1c4/0x1d9 kernel/panic.c:541
 report_bug+0x211/0x2d0 lib/bug.c:183
 fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:190
 do_trap_no_signal arch/x86/kernel/traps.c:224 [inline]
 do_trap+0x260/0x390 arch/x86/kernel/traps.c:273
 do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:310
 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:323
 invalid_op+0x1e/0x30 arch/x86/entry/entry_64.S:846
RIP: 0010:fib6_del+0x947/0xca0 net/ipv6/ip6_fib.c:1477
RSP: 0018:8801db2074d8 EFLAGS: 00010206
RAX: 8801d1500080 RBX: 8801d01638c0 RCX: 
RDX: 0100 RSI: 8801db207650 RDI: 8801d0163924
RBP: 8801db2075f0 R08: 86df5f98 R09: 0002
R10: 8801db2074b8 R11: 11003a2a026b R12: dc00
R13: 8801db207650 R14: 8801a0748180 R15: 11003b640ea5
 __ip6_del_rt+0xc7/0x120 net/ipv6/route.c:2136
 ip6_del_rt+0x132/0x1a0 net/ipv6/route.c:2149
 ip6_link_failure+0x244/0x380 net/ipv6/route.c:1359
 dst_link_failure include/net/dst.h:454 [inline]
 ndisc_error_report+0xae/0x180 net/ipv6/ndisc.c:682
 neigh_invalidate+0x225/0x530 net/core/neighbour.c:883
 neigh_timer_handler+0x883/0xca0 net/core/neighbour.c:969
 call_timer_fn+0x233/0x830 kernel/time/timer.c:1268
 expire_timers kernel/time/timer.c:1307 [inline]
 __run_timers+0x7fd/0xb90 kernel/time/timer.c:1601
 run_timer_softirq+0x21/0x80 kernel/time/timer.c:1614
 __do_softirq+0x2f5/0xba3 kernel/softirq.c:284
 invoke_softirq kernel/softirq.c:364 [inline]
 irq_exit+0x1cc/0x200 kernel/softirq.c:405
 exiting_irq arch/x86/include/asm/apic.h:638 [inline]
 smp_apic_timer_interrupt+0x76/0xa0 arch/x86/kernel/apic/apic.c:1044
 apic_timer_interrupt+0x93/0xa0 arch/x86/entry/entry_64.S:702
RIP: 0010:arch_local_irq_enable arch/x86/include/asm/paravirt.h:824 [inline]
RIP: 0010:__raw_spin_unlock_irq include/linux/spinlock_api_smp.h:168 [inline]
RIP: 0010:_raw_spin_unlock_irq+0x56/0x70 kernel/locking/spinlock.c:199
RSP: 0018:8801d0407040 EFLAGS: 0286 ORIG_RAX: ff10
RAX: dc00 RBX: 8801db225780 RCX: 
RDX: 10b59433 RSI: 0001 RDI: 85aca198
RBP: 8801d0407048 R08: 0001 R09: 
R10:  R11:  R12: 8801c6820400
R13: 11003a080e11 R14: 8801d1500080 R15: 8801d1500080
 
 finish_lock_switch kernel/sched/sched.h:1334 [inline]
 finish_task_switch+0x1d3/0x740 kernel/sched/core.c:2638
 context_switch kernel/sched/core.c:2774 [inline]
 __schedule+0x8f0/0x2070 kernel/sched/core.c:3332
 schedule+0x108/0x440 kernel/sched/core.c:3391
 schedule_hrtimeout_range_clock+0x23e/0x810 kernel/time/hrtimer.c:1708
 schedule_hrtimeout_range+0x2a/0x40 kernel/time/hrtimer.c:1753
 poll_schedule_timeout+0x10f/0x1f0 fs/select.c:242
 do_select+0x11ea/0x1710 fs/select.c:581
 core_sys_select+0x480/0x960 fs/select.c:655
 do_pselect fs/select.c:732 [inline]
 SYSC_pselect6 fs/select.c:773 [inline]
 SyS_pselect6+0x54a/0x650 fs/select.c:758
 entry_SYSCALL_64_fastpath+0x1f/0xbe
RIP: 0033:0x45f181
RSP: 002b:7f91306e1db0 EFLAGS: 0246 ORIG_RAX: 010e
RAX: ffda RBX:  RCX: 0045f181
RDX:  RSI:  RDI: 
RBP: 0086 R08: 7f91306e1db0 R09: 
R10:  R11: 0246 R12: 7ffdd9621670
R13: 7f91306e29c0 R14: 7f9130eac040 R15: 0003

Note: there is no Fixes tag because this bug was introduced long ago.

Signed-off-by: Wei Wang <wei...@google.com>
Acked-by: Eric Dumazet <eduma...@google.com>
---
 net/ipv6/ip6_fib.c | 8 +---
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index e5308d7cbd75..693bcd7ef6d2 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1592,13 +1592,7 @@ int fib6_del(struct rt6_info *rt, struct nl_info *info)
struct net *net = info->nl_net;
struct rt6_info **rtp;
 
-#if RT6_DEBUG >= 2
-   if (rt->dst.obsolete > 0) {
-   WARN_ON(fn);
-   return -ENOENT;
-   }
-#endif
-   if (!fn || rt == net->ipv6.ip6_null_entry)
+   if (!fn || rt->dst.obsolete > 0 || rt == net->ipv6.ip6_null_entry)
return -ENOENT;
 
WARN_ON(!(fn->fn_flags & RTN_RTINFO));
-- 
2.14.1.821.g8fa685d3b7-goog

Re: [PATCH net] net: prevent dst uses after free

2017-09-21 Thread Wei Wang

On Thu, Sep 21, 2017 at 9:15 AM, Eric Dumazet <eric.duma...@gmail.com> wrote:
> From: Eric Dumazet <eduma...@google.com>
>
> In linux-4.13, Wei worked hard to convert dst to a traditional
> refcounted model, removing GC.
>
> We now want to make sure a dst refcount can not transition from 0 back
> to 1.
>
> The problem here is that input path attached a not refcounted dst to an
> skb. Then later, because packet is forwarded and hits skb_dst_force()
> before exiting RCU section, we might try to take a refcount on one dst
> that is about to be freed, if another cpu saw 1 -> 0 transition in
> dst_release() and queued the dst for freeing after one RCU grace period.
>
> Lets unify skb_dst_force() and skb_dst_force_safe(), since we should
> always perform the complete check against dst refcount, and not assume
> it is not zero.
>
> Bugzilla : https://bugzilla.kernel.org/show_bug.cgi?id=197005
>
> [  989.919496]  skb_dst_force+0x32/0x34
> [  989.919498]  __dev_queue_xmit+0x1ad/0x482
> [  989.919501]  ? eth_header+0x28/0xc6
> [  989.919502]  dev_queue_xmit+0xb/0xd
> [  989.919504]  neigh_connected_output+0x9b/0xb4
> [  989.919507]  ip_finish_output2+0x234/0x294
> [  989.919509]  ? ipt_do_table+0x369/0x388
> [  989.919510]  ip_finish_output+0x12c/0x13f
> [  989.919512]  ip_output+0x53/0x87
> [  989.919513]  ip_forward_finish+0x53/0x5a
> [  989.919515]  ip_forward+0x2cb/0x3e6
> [  989.919516]  ? pskb_trim_rcsum.part.9+0x4b/0x4b
> [  989.919518]  ip_rcv_finish+0x2e2/0x321
> [  989.919519]  ip_rcv+0x26f/0x2eb
> [  989.919522]  ? vlan_do_receive+0x4f/0x289
> [  989.919523]  __netif_receive_skb_core+0x467/0x50b
> [  989.919526]  ? tcp_gro_receive+0x239/0x239
> [  989.919529]  ? inet_gro_receive+0x226/0x238
> [  989.919530]  __netif_receive_skb+0x4d/0x5f
> [  989.919532]  netif_receive_skb_internal+0x5c/0xaf
> [  989.919533]  napi_gro_receive+0x45/0x81
> [  989.919536]  ixgbe_poll+0xc8a/0xf09
> [  989.919539]  ? kmem_cache_free_bulk+0x1b6/0x1f7
> [  989.919540]  net_rx_action+0xf4/0x266
> [  989.919543]  __do_softirq+0xa8/0x19d
> [  989.919545]  irq_exit+0x5d/0x6b
> [  989.919546]  do_IRQ+0x9c/0xb5
> [  989.919548]  common_interrupt+0x93/0x93
> [  989.919548]  
>
>
> Similarly dst_clone() can use dst_hold() helper to have additional
> debugging, as a follow up to commit 44ebe79149ff ("net: add debug
> atomic_inc_not_zero() in dst_hold()")
>
> In net-next we will convert dst atomic_t to refcount_t for peace of
> mind.
>
> Fixes: a4c2fd7f7891 ("net: remove DST_NOCACHE flag")
> Signed-off-by: Eric Dumazet <eduma...@google.com>
> Cc: Wei Wang <wei...@google.com>
> Reported-by: Paweł Staszewski <pstaszew...@itcare.pl>
> Bisected-by: Paweł Staszewski <pstaszew...@itcare.pl>
> ---

Thanks a lot for the fix Eric. It makes sense to unify all the usage
of skb_dst_force() to always check on the refcnt not being 0.
And thank you Pawel for reporting and testing on this.

Acked-by: Wei Wang <wei...@google.com>


>  include/net/dst.h   |   22 --
>  include/net/route.h |2 +-
>  include/net/sock.h  |2 +-
>  3 files changed, 6 insertions(+), 20 deletions(-)
>
> diff --git a/include/net/dst.h b/include/net/dst.h
> index 
> 93568bd0a3520bb7402f04d90cf04ac99c81cfbe..06a6765da074449e6f1fe42ee05e711e898ad372
>  100644
> --- a/include/net/dst.h
> +++ b/include/net/dst.h
> @@ -271,7 +271,7 @@ static inline void dst_use_noref(struct dst_entry *dst, 
> unsigned long time)
>  static inline struct dst_entry *dst_clone(struct dst_entry *dst)
>  {
> if (dst)
> -   atomic_inc(>__refcnt);
> +   dst_hold(dst);
> return dst;
>  }
>
> @@ -311,21 +311,6 @@ static inline void skb_dst_copy(struct sk_buff *nskb, 
> const struct sk_buff *oskb
> __skb_dst_copy(nskb, oskb->_skb_refdst);
>  }
>
> -/**
> - * skb_dst_force - makes sure skb dst is refcounted
> - * @skb: buffer
> - *
> - * If dst is not yet refcounted, let's do it
> - */
> -static inline void skb_dst_force(struct sk_buff *skb)
> -{
> -   if (skb_dst_is_noref(skb)) {
> -   WARN_ON(!rcu_read_lock_held());
> -   skb->_skb_refdst &= ~SKB_DST_NOREF;
> -   dst_clone(skb_dst(skb));
> -   }
> -}
> -
>  /**
>   * dst_hold_safe - Take a reference on a dst if possible
>   * @dst: pointer to dst entry
> @@ -339,16 +324,17 @@ static inline bool dst_hold_safe(struct dst_entry *dst)
>  }
>
>  /**
> - * skb_dst_force_safe - makes sure skb dst is refcounted
> + * skb_dst_force - makes sure skb dst is refcounted
>   * @skb: buffer
>   *
>   * If dst is not yet refcounted and n

Re: Latest net-next from GIT panic

2017-09-20 Thread Wei Wang

> Thanks very much Pawel for the feedback.
>
> I was looking into the code (specifically IPv4 part) and found that in
> free_fib_info_rcu(), we call free_nh_exceptions() without holding the
> fnhe_lock. I am wondering if that could cause some race condition on
> fnhe->fnhe_rth_input/output so a double call on dst_dev_put() on the
> same dst could be happening.
>
> But as we call free_fib_info_rcu() only after the grace period, and
> the lookup code which could potentially modify
> fnhe->fnhe_rth_input/output all holds rcu_read_lock(), it seems
> fine...
>

Hi Pawel,

Could you try the following debug patch on top of net-next branch and
reproduce the issue check if there are warning msg showing?

diff --git a/include/net/dst.h b/include/net/dst.h
index 93568bd0a352..82aff41c6f63 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -271,7 +271,7 @@ static inline void dst_use_noref(struct dst_entry
*dst, unsigned long time)
 static inline struct dst_entry *dst_clone(struct dst_entry *dst)
 {
if (dst)
-   atomic_inc(>__refcnt);
+   dst_hold(dst);
return dst;
 }

Thanks.
Wei


On Wed, Sep 20, 2017 at 3:09 PM, Wei Wang <wei...@google.com> wrote:
>>>> bisected again and same result:
>>>> b838d5e1c5b6e57b10ec8af2268824041e3ea911 is the first bad commit
>>>> commit b838d5e1c5b6e57b10ec8af2268824041e3ea911
>>>> Author: Wei Wang <wei...@google.com>
>>>> Date:   Sat Jun 17 10:42:32 2017 -0700
>>>>
>>>> ipv4: mark DST_NOGC and remove the operation of dst_free()
>>>>
>>>> With the previous preparation patches, we are ready to get rid of the
>>>> dst gc operation in ipv4 code and release dst based on refcnt only.
>>>> So this patch adds DST_NOGC flag for all IPv4 dst and remove the
>>>> calls
>>>> to dst_free().
>>>> At this point, all dst created in ipv4 code do not use the dst gc
>>>> anymore and will be destroyed at the point when refcnt drops to 0.
>>>>
>>>> Signed-off-by: Wei Wang <wei...@google.com>
>>>> Acked-by: Martin KaFai Lau <ka...@fb.com>
>>>> Signed-off-by: David S. Miller <da...@davemloft.net>
>>>>
>>>> :04 04 9b7e7fb641de6531fc7887473ca47ef7cb6a11da
>>>> 831a73b71d3df1755f3e24c0d3c86d7a93fd55e2 M  net
>>>>
>>>> Will add now version 2 of patch from Eric and we will see
>>>>
>>>>
>>> after adding patch
>>> perf top catch
>>>PerfTop:   77159 irqs/sec  kernel:99.7%  exact:  0.0% [4000Hz cycles],
>>> (all, 40 CPUs)
>>>
>>> ---
>>>
>>> 60.95%  [kernel][k] dev_put.part.6
>>>  4.00%  [kernel][k] ixgbe_poll
>>>  3.63%  [kernel][k] irq_entries_start
>>>  1.22%  [kernel][k] fib_table_lookup
>>>  1.15%  [kernel][k] do_raw_spin_lock
>>>  1.05%  [kernel][k] ixgbe_xmit_frame_ring
>>>  1.04%  [kernel][k] lookup
>>>  0.87%  [kernel][k] eth_type_trans
>>>
>>>
>>> no panic on console - rebooting to check logs
>>>
>>>
>> Nothing logged
>>
>
> Thanks very much Pawel for the feedback.
>
> I was looking into the code (specifically IPv4 part) and found that in
> free_fib_info_rcu(), we call free_nh_exceptions() without holding the
> fnhe_lock. I am wondering if that could cause some race condition on
> fnhe->fnhe_rth_input/output so a double call on dst_dev_put() on the
> same dst could be happening.
>
> But as we call free_fib_info_rcu() only after the grace period, and
> the lookup code which could potentially modify
> fnhe->fnhe_rth_input/output all holds rcu_read_lock(), it seems
> fine...
>
>
> On Wed, Sep 20, 2017 at 2:25 PM, Paweł Staszewski <pstaszew...@itcare.pl> 
> wrote:
>>
>>
>> W dniu 2017-09-20 o 23:24, Paweł Staszewski pisze:
>>
>>>
>>>
>>> W dniu 2017-09-20 o 23:10, Paweł Staszewski pisze:
>>>>
>>>>
>>>>
>>>> W dniu 2017-09-20 o 21:23, Paweł Staszewski pisze:
>>>>>
>>>>>
>>>>>
>>>>> W dniu 2017-09-20 o 21:13, Paweł Staszewski pisze:
>>>>>>
>>>>>>
>>>>&

Re: Latest net-next from GIT panic

2017-09-20 Thread Wei Wang

>>> bisected again and same result:
>>> b838d5e1c5b6e57b10ec8af2268824041e3ea911 is the first bad commit
>>> commit b838d5e1c5b6e57b10ec8af2268824041e3ea911
>>> Author: Wei Wang <wei...@google.com>
>>> Date:   Sat Jun 17 10:42:32 2017 -0700
>>>
>>> ipv4: mark DST_NOGC and remove the operation of dst_free()
>>>
>>> With the previous preparation patches, we are ready to get rid of the
>>> dst gc operation in ipv4 code and release dst based on refcnt only.
>>> So this patch adds DST_NOGC flag for all IPv4 dst and remove the
>>> calls
>>> to dst_free().
>>> At this point, all dst created in ipv4 code do not use the dst gc
>>> anymore and will be destroyed at the point when refcnt drops to 0.
>>>
>>> Signed-off-by: Wei Wang <wei...@google.com>
>>> Acked-by: Martin KaFai Lau <ka...@fb.com>
>>> Signed-off-by: David S. Miller <da...@davemloft.net>
>>>
>>> :04 04 9b7e7fb641de6531fc7887473ca47ef7cb6a11da
>>> 831a73b71d3df1755f3e24c0d3c86d7a93fd55e2 M  net
>>>
>>> Will add now version 2 of patch from Eric and we will see
>>>
>>>
>> after adding patch
>> perf top catch
>>PerfTop:   77159 irqs/sec  kernel:99.7%  exact:  0.0% [4000Hz cycles],
>> (all, 40 CPUs)
>>
>> ---
>>
>> 60.95%  [kernel][k] dev_put.part.6
>>  4.00%  [kernel][k] ixgbe_poll
>>  3.63%  [kernel][k] irq_entries_start
>>  1.22%  [kernel][k] fib_table_lookup
>>  1.15%  [kernel][k] do_raw_spin_lock
>>  1.05%  [kernel][k] ixgbe_xmit_frame_ring
>>  1.04%  [kernel][k] lookup
>>  0.87%  [kernel][k] eth_type_trans
>>
>>
>> no panic on console - rebooting to check logs
>>
>>
> Nothing logged
>

Thanks very much Pawel for the feedback.

I was looking into the code (specifically IPv4 part) and found that in
free_fib_info_rcu(), we call free_nh_exceptions() without holding the
fnhe_lock. I am wondering if that could cause some race condition on
fnhe->fnhe_rth_input/output so a double call on dst_dev_put() on the
same dst could be happening.

But as we call free_fib_info_rcu() only after the grace period, and
the lookup code which could potentially modify
fnhe->fnhe_rth_input/output all holds rcu_read_lock(), it seems
fine...


On Wed, Sep 20, 2017 at 2:25 PM, Paweł Staszewski <pstaszew...@itcare.pl> wrote:
>
>
> W dniu 2017-09-20 o 23:24, Paweł Staszewski pisze:
>
>>
>>
>> W dniu 2017-09-20 o 23:10, Paweł Staszewski pisze:
>>>
>>>
>>>
>>> W dniu 2017-09-20 o 21:23, Paweł Staszewski pisze:
>>>>
>>>>
>>>>
>>>> W dniu 2017-09-20 o 21:13, Paweł Staszewski pisze:
>>>>>
>>>>>
>>>>>
>>>>> W dniu 2017-09-20 o 20:36, Cong Wang pisze:
>>>>>>
>>>>>> On Wed, Sep 20, 2017 at 11:30 AM, Eric Dumazet
>>>>>> <eric.duma...@gmail.com> wrote:
>>>>>>>
>>>>>>> On Wed, 2017-09-20 at 11:22 -0700, Cong Wang wrote:
>>>>>>>>
>>>>>>>> but dmesg at this time shows nothing about interfaces or flaps.
>>>>>>>>
>>>>>>>> This is very odd.
>>>>>>>>
>>>>>>>> We only free netdevice in free_netdev() and it is only called when
>>>>>>>> we unregister a netdevice. Otherwise pcpu_refcnt is impossible
>>>>>>>> to be NULL.
>>>>>>>
>>>>>>> If there is a missing dev_hold() or one dev_put() in excess,
>>>>>>> this would allow the netdev to be freed too soon.
>>>>>>>
>>>>>>> -> Use after free.
>>>>>>> memory holding netdev could be reallocated-cleared by some other
>>>>>>> kernel
>>>>>>> user.
>>>>>>>
>>>>>> Sure, but only unregister could trigger a free. If there is no
>>>>>> unregister,
>>>>>> like what Pawel claims, then there is no free, the refcnt just goes to
>>>>>> 0 but the memory is sti

Re: Latest net-next from GIT panic

2017-09-20 Thread Wei Wang

>> This is why I suggested to replace the BUG() in another mail
>>
>> So :
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index
>> f535779d9dc1dfe36934c2abba4e43d053ac5d6f..220cd12456754876edf2d3ef13195e82d70d5c74
>> 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -3331,7 +3331,15 @@ void netdev_run_todo(void);
>>*/
>>   static inline void dev_put(struct net_device *dev)
>>   {
>> -   this_cpu_dec(*dev->pcpu_refcnt);
>> +   int __percpu *pref = READ_ONCE(dev->pcpu_refcnt);
>> +
>> +   if (!pref) {
>> +   pr_err("no pcpu_refcnt on dev %p(%s) state %d dismantle
>> %d\n",
>> +  dev, dev->name, dev->reg_state, dev->dismantle);
>> +   for (;;)
>> +   cpu_relax();
>> +   }
>> +   this_cpu_dec(*pref);
>>   }
>> /**
>>

Thanks a lot Eric for the debug patch.

Pawel,

I want to confirm with you about the last good commit when you did bisection.
You mentioned:

> And the last one
>
> git bisect good
> Bisecting: 1 revision left to test after this (roughly 1 step)
> [1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take dst->__refcnt for
> insertion into fib6 tree
>
> With this have kernel panic same as always
>
> git bisect bad
> Bisecting: 0 revisions left to test after this (roughly 0 steps)
> [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and
> remove the operation of dst_free()


So it breaks right at:
[b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and
remove the operation of dst_free()
Right?
If you sync the image to one commit before the above one:
[9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call dst_hold_safe() properly
Does it crash?

And could you confirm that your config does not have any IPv6
addresses or routes configured?

Thanks.
Wei


6:03 +0200, Paweł Staszewski wrote:
>>>
>>> Nit much more after adding this patch
>>>
>>> https://bugzilla.kernel.org/attachment.cgi?id=258529
>>>
>> This is why I suggested to replace the BUG() in another mail
>>
>> So :
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index
>> f535779d9dc1dfe36934c2abba4e43d053ac5d6f..220cd12456754876edf2d3ef13195e82d70d5c74
>> 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -3331,7 +3331,15 @@ void netdev_run_todo(void);
>>*/
>>   static inline void dev_put(struct net_device *dev)
>>   {
>> -   this_cpu_dec(*dev->pcpu_refcnt);
>> +   int __percpu *pref = READ_ONCE(dev->pcpu_refcnt);
>> +
>> +   if (!pref) {
>> +   pr_err("no pcpu_refcnt on dev %p(%s) state %d dismantle
>> %d\n",
>> +  dev, dev->name, dev->reg_state, dev->dismantle);
>> +   for (;;)
>> +   cpu_relax();
>> +   }
>> +   this_cpu_dec(*pref);
>>   }
>> /**
>>
>>
>>
>
> Full panic
>
> https://bugzilla.kernel.org/attachment.cgi?id=258531
>
>
> I will change patch and apply but later today cause now cant use backup
> router as testlab - Internet rush hours if something happens this will be
> bed when second router will have bugged kernel :)
>
>

Re: [PATCH net] ipv4: Don't override return code from ip_route_input_noref()

2017-08-31 Thread Wei Wang

> After ip_route_input() calls ip_route_input_noref(), another
> check on skb_dst() is done, but if this fails, we shouldn't
> override the return code from ip_route_input_noref(), as it
> could have been more specific (i.e. -EHOSTUNREACH).
>
> This also saves one call to skb_dst_force_safe() and one to
> skb_dst() in case the ip_route_input_noref() check fails.
>
> Reported-by: Sabrina Dubroca <sdubr...@redhat.com>
> Fixes: ad65a2f05695 ("ipv4: call dst_hold_safe() properly")
> Signed-off-by: Stefano Brivio <sbri...@redhat.com>

Acked-by: Wei Wang <wei...@google.com>


On Thu, Aug 31, 2017 at 9:11 AM, Stefano Brivio <sbri...@redhat.com> wrote:
> After ip_route_input() calls ip_route_input_noref(), another
> check on skb_dst() is done, but if this fails, we shouldn't
> override the return code from ip_route_input_noref(), as it
> could have been more specific (i.e. -EHOSTUNREACH).
>
> This also saves one call to skb_dst_force_safe() and one to
> skb_dst() in case the ip_route_input_noref() check fails.
>
> Reported-by: Sabrina Dubroca <sdubr...@redhat.com>
> Fixes: ad65a2f05695 ("ipv4: call dst_hold_safe() properly")
> Signed-off-by: Stefano Brivio <sbri...@redhat.com>
> ---
>  include/net/route.h | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/include/net/route.h b/include/net/route.h
> index cb0a76d9dde1..1b09a9368c68 100644
> --- a/include/net/route.h
> +++ b/include/net/route.h
> @@ -189,10 +189,11 @@ static inline int ip_route_input(struct sk_buff *skb, 
> __be32 dst, __be32 src,
>
> rcu_read_lock();
> err = ip_route_input_noref(skb, dst, src, tos, devin);
> -   if (!err)
> +   if (!err) {
> skb_dst_force_safe(skb);
> -   if (!skb_dst(skb))
> -   err = -EINVAL;
> +   if (!skb_dst(skb))
> +   err = -EINVAL;
> +   }
> rcu_read_unlock();
>
> return err;
> --
> 2.9.4
>

1 2 3 >

1 - 100 of 207 matches

Mail list logo