Re: [Patch] blk-wbt: fix a divide-by-zero error in rwb_arm_timer()

2021-04-20 Thread Cong Wang
On Sat, Apr 17, 2021 at 9:41 PM Cong Wang  wrote:
>
> From: Cong Wang 
>
> We hit a divide error in rwb_arm_timer() and crash dump shows
> rqd->scale_step is 16777215 (0xff in hex), so the expression
> "(rqd->scale_step + 1) << 8)" is 0x1, which is just beyond
> 32-bit integer range, hence it is truncated to 0 and int_sqrt(0)
> returns 0 too, so we end up passing 0 as a divisor to div_u64().
>

Never mind. rqd->scale_step should be capped by
rq_depth_scale_down(), so should never be so large. In the old
calc_wb_limits() implementation, rwb->wb_max was set to zero
accidentally.

Thanks.


[Patch] blk-wbt: fix a divide-by-zero error in rwb_arm_timer()

2021-04-17 Thread Cong Wang
From: Cong Wang 

We hit a divide error in rwb_arm_timer() and crash dump shows
rqd->scale_step is 16777215 (0xff in hex), so the expression
"(rqd->scale_step + 1) << 8)" is 0x1, which is just beyond
32-bit integer range, hence it is truncated to 0 and int_sqrt(0)
returns 0 too, so we end up passing 0 as a divisor to div_u64().

Looking at the assembly code generated:

add$0x1,%edi
shl$0x8,%edi
movslq %edi,%rdi
mov0x10(%rbx),%rdi
xor%edx,%edx
mov%eax,%ecx
shl$0x4,%rdi
mov%rdi,%rax
div%rcx

we notice that the left shift is still using 32 bit register %edi,
because the type of rqd->scale_step is 'int'. But actually int_sqrt()
takes 'long' as a parameter, so the temporary result should fit well
at least on x86_64. Fix this by explicitly casting the expression to
u64 and call int_sqrt64() to avoid any ambiguity on 32 bit.

After this patch, the assembly code looks correct:

add$0x1,%edi
movslq %edi,%rdi
shl$0x8,%rdi
mov0x10(%rbx),%rdi
xor%edx,%edx
mov%eax,%ecx
shl$0x4,%rdi
mov%rdi,%rax
div%rcx

Fixes: e34cbd307477 ("blk-wbt: add general throttling mechanism")
Cc: Jens Axboe 
Cc: Fam Zheng 
Cc: Xiongchun Duan 
Signed-off-by: Cong Wang 
---
 block/blk-wbt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 42aed0160f86..5157ca86574f 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -337,7 +337,7 @@ static void rwb_arm_timer(struct rq_wb *rwb)
 * though.
 */
rwb->cur_win_nsec = div_u64(rwb->win_nsec << 4,
-   int_sqrt((rqd->scale_step + 1) << 8));
+   int_sqrt64((u64)(rqd->scale_step + 1) 
<< 8));
} else {
/*
 * For step < 0, we don't want to increase/decrease the
-- 
2.25.1



Re: [syzbot] WARNING: refcount bug in sk_psock_get

2021-04-10 Thread Cong Wang
On Fri, Apr 9, 2021 at 12:45 PM John Fastabend  wrote:
>
> syzbot wrote:
> > Hello,
> >
> > syzbot found the following issue on:
> >
> > HEAD commit:9c54130c Add linux-next specific files for 20210406
> > git tree:   linux-next
> > console output: https://syzkaller.appspot.com/x/log.txt?x=17d8d7aad0
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=d125958c3995ddcd
> > dashboard link: https://syzkaller.appspot.com/bug?extid=b54a1ce86ba4a623b7f0
> > syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1729797ed0
> > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1190f46ad0
> >
> > The issue was bisected to:
> >
> > commit 997acaf6b4b59c6a9c259740312a69ea549cc684
> > Author: Mark Rutland 
> > Date:   Mon Jan 11 15:37:07 2021 +
> >
> > lockdep: report broken irq restoration
> >
> > bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=11a6cc96d0
> > final oops: https://syzkaller.appspot.com/x/report.txt?x=13a6cc96d0
> > console output: https://syzkaller.appspot.com/x/log.txt?x=15a6cc96d0
> >
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+b54a1ce86ba4a623b...@syzkaller.appspotmail.com
> > Fixes: 997acaf6b4b5 ("lockdep: report broken irq restoration")
> >
> > [ cut here ]
> > refcount_t: saturated; leaking memory.
> > WARNING: CPU: 1 PID: 8414 at lib/refcount.c:19 
> > refcount_warn_saturate+0xf4/0x1e0 lib/refcount.c:19
> > Modules linked in:
> > CPU: 1 PID: 8414 Comm: syz-executor793 Not tainted 
> > 5.12.0-rc6-next-20210406-syzkaller #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> > Google 01/01/2011
> > RIP: 0010:refcount_warn_saturate+0xf4/0x1e0 lib/refcount.c:19
> > Code: 1d 69 0c e6 09 31 ff 89 de e8 c8 b4 a6 fd 84 db 75 ab e8 0f ae a6 fd 
> > 48 c7 c7 e0 52 c2 89 c6 05 49 0c e6 09 01 e8 91 0f 00 05 <0f> 0b eb 8f e8 
> > f3 ad a6 fd 0f b6 1d 33 0c e6 09 31 ff 89 de e8 93
> > RSP: 0018:c9eef388 EFLAGS: 00010282
> > RAX:  RBX:  RCX: 
> > RDX: 88801bbdd580 RSI: 815c2e05 RDI: f520001dde63
> > RBP:  R08:  R09: 
> > R10: 815bcc6e R11:  R12: 1920001dde74
> > R13: 90200301 R14: 888026e0 R15: c9eef3c0
> > FS:  01422300() GS:8880b9d0() knlGS:
> > CS:  0010 DS:  ES:  CR0: 80050033
> > CR2: 2000 CR3: 12b3b000 CR4: 001506e0
> > DR0:  DR1:  DR2: 
> > DR3:  DR6: fffe0ff0 DR7: 0400
> > Call Trace:
> >  __refcount_add_not_zero include/linux/refcount.h:163 [inline]
> >  __refcount_inc_not_zero include/linux/refcount.h:227 [inline]
> >  refcount_inc_not_zero include/linux/refcount.h:245 [inline]
> >  sk_psock_get+0x3b0/0x400 include/linux/skmsg.h:435
> >  bpf_exec_tx_verdict+0x11e/0x11a0 net/tls/tls_sw.c:799
> >  tls_sw_sendmsg+0xa41/0x1800 net/tls/tls_sw.c:1013
> >  inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:821
>
> [...]
>
> This is likely a problem with latest round of sockmap patches I'll
> tke a look.

I bet this has nothing to do with my sockmap patches, as clearly
the reproducer does not even have any BPF thing. And actually
the reproducer creates an SMC socket, which coincidentally uses
sk_user_data too, therefore triggers this bug.

I think we should just prohibit TCP_ULP on SMC socket, something
like this:

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 47340b3b514f..0d4d6d28f20c 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -2162,6 +2162,9 @@ static int smc_setsockopt(struct socket *sock,
int level, int optname,
struct smc_sock *smc;
int val, rc;

+   if (optname == TCP_ULP)
+   return -EOPNOTSUPP;
+
smc = smc_sk(sk);

/* generic setsockopts reaching us here always apply to the
@@ -2186,7 +2189,6 @@ static int smc_setsockopt(struct socket *sock,
int level, int optname,
if (rc || smc->use_fallback)
goto out;
switch (optname) {
-   case TCP_ULP:
case TCP_FASTOPEN:
case TCP_FASTOPEN_CONNECT:
case TCP_FASTOPEN_KEY:


Re: [External] linux-next: manual merge of the net-next tree with the bpf tree

2021-04-07 Thread Cong Wang .
On Wed, Apr 7, 2021 at 8:11 PM Stephen Rothwell  wrote:
>
> Hi all,
>
> Today's linux-next merge of the net-next tree got a conflict in:
>
>   net/core/skmsg.c
>
> between commit:
>
>   144748eb0c44 ("bpf, sockmap: Fix incorrect fwd_alloc accounting")
>
> from the bpf tree and commit:
>
>   e3526bb92a20 ("skmsg: Move sk_redir from TCP_SKB_CB to skb")
>
> from the net-next tree.
>
> I fixed it up (I think - see below) and can carry the fix as
> necessary. This is now fixed as far as linux-next is concerned, but any
> non trivial conflicts should be mentioned to your upstream maintainer
> when your tree is submitted for merging.  You may also want to consider
> cooperating with the maintainer of the conflicting tree to minimise any
> particularly complex conflicts.

Looks good from my quick glance.

Thanks!


Re: [External] linux-next: manual merge of the net-next tree with the bpf tree

2021-04-07 Thread Cong Wang .
On Wed, Apr 7, 2021 at 8:02 PM Stephen Rothwell  wrote:
>
> Hi all,
>
> Today's linux-next merge of the net-next tree got a conflict in:
>
>   include/linux/skmsg.h
>
> between commit:
>
>   1c84b33101c8 ("bpf, sockmap: Fix sk->prot unhash op reset")
>
> from the bpf tree and commit:
>
>   8a59f9d1e3d4 ("sock: Introduce sk->sk_prot->psock_update_sk_prot()")
>
> from the net-next tree.
>
> I didn't know how to fixed it up so I just used the latter version or
> today - a better solution would be appreciated. This is now fixed as
> far as linux-next is concerned, but any non trivial conflicts should be
> mentioned to your upstream maintainer when your tree is submitted for
> merging.  You may also want to consider cooperating with the maintainer
> of the conflicting tree to minimise any particularly complex conflicts.

The right way to resolve this is to move the lines added in commit
1c84b33101c8 to the similar place in tcp_bpf_update_proto().

Thanks.


Re: [syzbot] KASAN: use-after-free Write in sk_psock_stop

2021-04-06 Thread Cong Wang
On Tue, Apr 6, 2021 at 6:01 AM syzbot
 wrote:
> ==
> BUG: KASAN: use-after-free in __lock_acquire+0x3e6f/0x54c0 
> kernel/locking/lockdep.c:4770
> Read of size 8 at addr 888024f66238 by task syz-executor.1/14202
>
> CPU: 0 PID: 14202 Comm: syz-executor.1 Not tainted 5.12.0-rc4-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:79 [inline]
>  dump_stack+0x141/0x1d7 lib/dump_stack.c:120
>  print_address_description.constprop.0.cold+0x5b/0x2f8 mm/kasan/report.c:232
>  __kasan_report mm/kasan/report.c:399 [inline]
>  kasan_report.cold+0x7c/0xd8 mm/kasan/report.c:416
>  __lock_acquire+0x3e6f/0x54c0 kernel/locking/lockdep.c:4770
>  lock_acquire kernel/locking/lockdep.c:5510 [inline]
>  lock_acquire+0x1ab/0x740 kernel/locking/lockdep.c:5475
>  __raw_spin_lock_bh include/linux/spinlock_api_smp.h:135 [inline]
>  _raw_spin_lock_bh+0x2f/0x40 kernel/locking/spinlock.c:175
>  spin_lock_bh include/linux/spinlock.h:359 [inline]
>  sk_psock_stop+0x2f/0x4d0 net/core/skmsg.c:750
>  sock_map_close+0x172/0x390 net/core/sock_map.c:1534
>  inet_release+0x12e/0x280 net/ipv4/af_inet.c:431
>  __sock_release+0xcd/0x280 net/socket.c:599
>  sock_close+0x18/0x20 net/socket.c:1258
>  __fput+0x288/0x920 fs/file_table.c:280
>  task_work_run+0xdd/0x1a0 kernel/task_work.c:140
>  tracehook_notify_resume include/linux/tracehook.h:189 [inline]
>  exit_to_user_mode_loop kernel/entry/common.c:174 [inline]
>  exit_to_user_mode_prepare+0x249/0x250 kernel/entry/common.c:208
>  __syscall_exit_to_user_mode_work kernel/entry/common.c:290 [inline]
>  syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:301
>  entry_SYSCALL_64_after_hwframe+0x44/0xae
> RIP: 0033:0x466459
> Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 
> 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 
> 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
> RSP: 002b:7f1bde3a3188 EFLAGS: 0246 ORIG_RAX: 0003
> RAX:  RBX: 0056bf60 RCX: 00466459
> RDX:  RSI:  RDI: 0005
> RBP: 004bf9fb R08:  R09: 
> R10:  R11: 0246 R12: 0056bf60
> R13: 7ffe6eb13bbf R14: 7f1bde3a3300 R15: 00022000
[...]
> Second to last potentially related work creation:
>  kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38
>  kasan_record_aux_stack+0xe5/0x110 mm/kasan/generic.c:345
>  __call_rcu kernel/rcu/tree.c:3039 [inline]
>  call_rcu+0xb1/0x740 kernel/rcu/tree.c:3114
>  queue_rcu_work+0x82/0xa0 kernel/workqueue.c:1753
>  sk_psock_put include/linux/skmsg.h:446 [inline]
>  sock_map_unref+0x109/0x190 net/core/sock_map.c:182
>  sock_hash_delete_from_link net/core/sock_map.c:918 [inline]
>  sock_map_unlink net/core/sock_map.c:1480 [inline]
>  sock_map_remove_links+0x389/0x530 net/core/sock_map.c:1492
>  sock_map_close+0x12f/0x390 net/core/sock_map.c:1532

Looks like the last refcnt can be gone before sk_psock_stop().

Technically, we can call sk_psock_stop() before
sock_map_remove_links(), the only barrier is the RCU read lock
there. Let me see if we can get rid of that RCU read lock.

Thanks.


Re: [syzbot] WARNING: suspicious RCU usage in tcp_bpf_update_proto

2021-04-06 Thread Cong Wang
On Tue, Apr 6, 2021 at 2:44 AM syzbot
 wrote:
>
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit:514e1150 net: x25: Queue received packets in the drivers i..
> git tree:   net-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=112a8831d0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=7eff0f22b8563a5f
> dashboard link: https://syzkaller.appspot.com/bug?extid=320a3bc8d80f478c37e4
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1532d711d0
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=15f44c5ed0
>
> The issue was bisected to:
>
> commit 4dfe6bd94959222e18d512bdf15f6bf9edb9c27c
> Author: Rustam Kovhaev 
> Date:   Wed Feb 24 20:00:30 2021 +
>
> ntfs: check for valid standard information attribute

This is caused by one of my sockmap patches.

>
> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=16207a81d0
> final oops: https://syzkaller.appspot.com/x/report.txt?x=15207a81d0
> console output: https://syzkaller.appspot.com/x/log.txt?x=11207a81d0
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+320a3bc8d80f478c3...@syzkaller.appspotmail.com
> Fixes: 4dfe6bd94959 ("ntfs: check for valid standard information attribute")
>
> =
> WARNING: suspicious RCU usage
> 5.12.0-rc4-syzkaller #0 Not tainted
> -
> include/linux/skmsg.h:286 suspicious rcu_dereference_check() usage!
>
> other info that might help us debug this:
>
>
> rcu_scheduler_active = 2, debug_locks = 1
> 1 lock held by syz-executor383/8454:
>  #0: 888013a99b48 (clock-AF_INET){++..}-{2:2}, at: 
> sk_psock_drop+0x2c/0x460 net/core/skmsg.c:788
>
> stack backtrace:
> CPU: 1 PID: 8454 Comm: syz-executor383 Not tainted 5.12.0-rc4-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:79 [inline]
>  dump_stack+0x141/0x1d7 lib/dump_stack.c:120
>  sk_psock include/linux/skmsg.h:286 [inline]
>  tcp_bpf_update_proto+0x530/0x5f0 net/ipv4/tcp_bpf.c:504
>  sk_psock_restore_proto include/linux/skmsg.h:408 [inline]
>  sk_psock_drop+0xdf/0x460 net/core/skmsg.c:789
>  sk_psock_put include/linux/skmsg.h:446 [inline]
>  tcp_bpf_recvmsg+0x42d/0x480 net/ipv4/tcp_bpf.c:208
>  inet_recvmsg+0x11b/0x5d0 net/ipv4/af_inet.c:852

Oddly, I have all relevant Kconfig enabled but never see this
warning when running selftests for hours

Anyway, let me see how this should be fixed.

Thanks!


Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2021-04-05 Thread Cong Wang
On Sat, Apr 3, 2021 at 5:23 AM Jiri Kosina  wrote:
>
> I am still planning to have Yunsheng Lin's (CCing) fix [1] tested in the
> coming days. If it works, then we can consider proceeding with it,
> otherwise I am all for reverting the whole NOLOCK stuff.
>
> [1] 
> https://lore.kernel.org/linux-can/1616641991-14847-1-git-send-email-linyunsh...@huawei.com/T/#u

I personally prefer to just revert that bit, as it brings more troubles
than gains. Even with Yunsheng's patch, there are still some issues.
Essentially, I think the core qdisc scheduling code is not ready for
lockless, just look at those NOLOCK checks in sch_generic.c. :-/

Thanks.


Re: [PATCH v2 1/1] net: sched: bump refcount for new action in ACT replace mode

2021-03-30 Thread Cong Wang
On Mon, Mar 29, 2021 at 3:55 PM Kumar Kartikeya Dwivedi
 wrote:
> diff --git a/net/sched/act_api.c b/net/sched/act_api.c
> index b919826939e0..43cceb924976 100644
> --- a/net/sched/act_api.c
> +++ b/net/sched/act_api.c
> @@ -1042,6 +1042,9 @@ struct tc_action *tcf_action_init_1(struct net *net, 
> struct tcf_proto *tp,
> if (err != ACT_P_CREATED)
> module_put(a_o->owner);
>
> +   if (!bind && ovr && err == ACT_P_CREATED)
> +   refcount_set(>tcfa_refcnt, 2);
> +

Hmm, if we set the refcnt to 2 here, how could tcf_action_destroy()
destroy them when we rollback from a failure in the middle of the loop
in tcf_action_init()?

Thanks.


Re: [PATCH net v2] net: sched: fix packet stuck problem for lockless qdisc

2021-03-24 Thread Cong Wang
On Tue, Mar 23, 2021 at 7:24 PM Yunsheng Lin  wrote:
> @@ -176,8 +207,23 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
>  static inline void qdisc_run_end(struct Qdisc *qdisc)
>  {
> write_seqcount_end(>running);
> -   if (qdisc->flags & TCQ_F_NOLOCK)
> +   if (qdisc->flags & TCQ_F_NOLOCK) {
> spin_unlock(>seqlock);
> +
> +   /* qdisc_run_end() is protected by RCU lock, and
> +* qdisc reset will do a synchronize_net() after
> +* setting __QDISC_STATE_DEACTIVATED, so testing
> +* the below two bits separately should be fine.

Hmm, why synchronize_net() after setting this bit is fine? It could
still be flipped right after you test RESCHEDULE bit.


> +* For qdisc_run() in net_tx_action() case, we
> +* really should provide rcu protection explicitly
> +* for document purposes or PREEMPT_RCU.
> +*/
> +   if (unlikely(test_bit(__QDISC_STATE_NEED_RESCHEDULE,
> + >state) &&
> +!test_bit(__QDISC_STATE_DEACTIVATED,
> +  >state)))

Why do you want to test __QDISC_STATE_DEACTIVATED bit at all?
dev_deactivate_many() will wait for those scheduled but being
deactivated, so what's the problem of scheduling it even with this bit?

Thanks.


Re: [Linuxarm] Re: [RFC v2] net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc

2021-03-23 Thread Cong Wang
On Sun, Mar 21, 2021 at 5:55 PM Yunsheng Lin  wrote:
>
> On 2021/3/20 2:15, Cong Wang wrote:
> > On Thu, Mar 18, 2021 at 12:33 AM Yunsheng Lin  
> > wrote:
> >>
> >> On 2021/3/17 21:45, Jason A. Donenfeld wrote:
> >>> On 3/17/21, Toke Høiland-Jørgensen  wrote:
> >>>> Cong Wang  writes:
> >>>>
> >>>>> On Mon, Mar 15, 2021 at 2:07 PM Jakub Kicinski  wrote:
> >>>>>>
> >>>>>> I thought pfifo was supposed to be "lockless" and this change
> >>>>>> re-introduces a lock between producer and consumer, no?
> >>>>>
> >>>>> It has never been truly lockless, it uses two spinlocks in the ring
> >>>>> buffer
> >>>>> implementation, and it introduced a q->seqlock recently, with this patch
> >>>>> now we have priv->lock, 4 locks in total. So our "lockless" qdisc ends
> >>>>> up having more locks than others. ;) I don't think we are going to a
> >>>>> right direction...
> >>>>
> >>>> Just a thought, have you guys considered adopting the lockless MSPC ring
> >>>> buffer recently introduced into Wireguard in commit:
> >>>>
> >>>> 8b5553ace83c ("wireguard: queueing: get rid of per-peer ring buffers")
> >>>>
> >>>> Jason indicated he was willing to work on generalising it into a
> >>>> reusable library if there was a use case for it. I haven't quite though
> >>>> through the details of whether this would be such a use case, but
> >>>> figured I'd at least mention it :)
> >>>
> >>> That offer definitely still stands. Generalization sounds like a lot of 
> >>> fun.
> >>>
> >>> Keep in mind though that it's an eventually consistent queue, not an
> >>> immediately consistent one, so that might not match all use cases. It
> >>> works with wg because we always trigger the reader thread anew when it
> >>> finishes, but that doesn't apply to everyone's queueing setup.
> >>
> >> Thanks for mentioning this.
> >>
> >> "multi-producer, single-consumer" seems to match the lockless qdisc's
> >> paradigm too, for now concurrent enqueuing/dequeuing to the pfifo_fast's
> >> queues() is not allowed, it is protected by producer_lock or consumer_lock.
> >>
> >> So it would be good to has lockless concurrent enqueuing, while dequeuing
> >> can be protected by qdisc_lock() or q->seqlock, which meets the 
> >> "multi-producer,
> >> single-consumer" paradigm.
> >
> > I don't think so. Usually we have one queue for each CPU so we can expect
> > each CPU has a lockless qdisc assigned, but we can not assume this in
> > the code, so we still have to deal with multiple CPU's sharing a lockless 
> > qdisc,
> > and we usually enqueue and dequeue in process context, so it means we could
> > have multiple producers and multiple consumers.
>
> For lockless qdisc, dequeuing is always within the qdisc_run_begin() and
> qdisc_run_end(), so multiple consumers is protected with each other by
> q->seqlock .

So are you saying you will never go lockless for lockless qdisc? I thought
you really want to go lockless with Jason's proposal of MPMC ring buffer
code.

>
> For enqueuing, multiple consumers is protected by producer_lock, see
> pfifo_fast_enqueue() -> skb_array_produce() -> ptr_ring_produce().

I think you seriously misunderstand how we classify MPMC or MPSC,
it is not about how we lock them, it is about whether we truly have
a single or multiple consumers regardless of locks used, because the
goal is to go lockless.

> I am not sure if lockless MSPC can work with the process context, but
> even if not, the enqueuing is also protected by rcu_read_lock_bh(),
> which provides some kind of atomicity, so that producer_lock can be
> reomved when lockless MSPC is used.


Not sure if I can even understand what you are saying here, Jason's
code only disables preemption with busy wait, I can't see why it can
not be used in the process context.

Thanks.


Re: net/dev: fix information leak to userspace

2021-03-21 Thread Cong Wang
On Sun, Mar 21, 2021 at 9:34 AM Pavel Machek  wrote:
>
> dev_get_mac_address() does not always initialize whole
> structure. Unfortunately, other code copies such structure to
> userspace, leaking information. Fix it.

Well, most callers already initialize it with a memset() or copy_from_user(),
for example, __tun_chr_ioctl():

if (cmd == TUNSETIFF || cmd == TUNSETQUEUE ||
(_IOC_TYPE(cmd) == SOCK_IOC_TYPE && cmd != SIOCGSKNS)) {
if (copy_from_user(, argp, ifreq_len))
return -EFAULT;
} else {
memset(, 0, sizeof(ifr));
}

Except tap_ioctl(), but we can just initialize 'sa' there instead of doing
it in dev_get_mac_address().

Thanks.


Re: [Linuxarm] [PATCH net] net: sched: fix packet stuck problem for lockless qdisc

2021-03-19 Thread Cong Wang
On Fri, Mar 19, 2021 at 2:25 AM Yunsheng Lin  wrote:
> I had done some performance test to see if there is value to
> fix the packet stuck problem and support lockless qdisc bypass,
> here is some result using pktgen in 'queue_xmit' mode on a dummy
> device as Paolo Abeni had done in [1], and using pfifo_fast qdisc:
>
> threads  vanillalocked-qdiscvanilla+this_patch
>1 2.6Mpps  2.9Mpps2.5Mpps
>2 3.9Mpps  4.8Mpps3.6Mpps
>4 5.6Mpps  3.0Mpps4.7Mpps
>8 2.7Mpps  1.6Mpps2.8Mpps
>162.2Mpps  1.3Mpps2.3Mpps
>
> locked-qdisc: test by removing the "TCQ_F_NOLOCK | TCQ_F_CPUSTATS".

I read this as this patch introduces somehow a performance
regression for -net, as the lockless bypass patch you submitted is
for -net-next.

Thanks.


Re: [PATCH net] net: sched: fix packet stuck problem for lockless qdisc

2021-03-19 Thread Cong Wang
On Wed, Mar 17, 2021 at 11:52 PM Yunsheng Lin  wrote:
>
> Lockless qdisc has below concurrent problem:
> cpu0  cpu1
>   . .
>  q->enqueue .
>   . .
>qdisc_run_begin().
>   . .
>  dequeue_skb()  .
>   . .
>sch_direct_xmit().
>   . .
>   .q->enqueue
>   . qdisc_run_begin()
>   .return and do nothing
>   . .
> qdisc_run_end() .
>
> cpu1 enqueue a skb without calling __qdisc_run() because cpu0
> has not released the lock yet and spin_trylock() return false
> for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb
> enqueued by cpu1 when calling dequeue_skb() because cpu1 may
> enqueue the skb after cpu0 calling dequeue_skb() and before
> cpu0 calling qdisc_run_end().
>
> Lockless qdisc has another concurrent problem when tx_action
> is involved:
>
> cpu0(serving tx_action) cpu1 cpu2
>   .   ..
>   .  q->enqueue.
>   .qdisc_run_begin()   .
>   .  dequeue_skb() .
>   .   .q->enqueue
>   .   ..
>   . sch_direct_xmit()  .
>   .   . qdisc_run_begin()
>   .   .   return and do nothing
>   .   ..
> clear __QDISC_STATE_SCHED ..
> qdisc_run_begin() ..
> return and do nothing ..
>   .   ..
>   .  qdisc_run_begin() .
>
> This patch fixes the above data race by:
> 1. Set a flag after spin_trylock() return false.
> 2. Retry a spin_trylock() in case other CPU may not see the
>new flag after it releases the lock.
> 3. reschedule if the flag is set after the lock is released
>at the end of qdisc_run_end().
>
> For tx_action case, the flags is also set when cpu1 is at the
> end if qdisc_run_begin(), so tx_action will be rescheduled
> again to dequeue the skb enqueued by cpu2.
>
> Also clear the flag before dequeuing in order to reduce the
> overhead of the above process, and aviod doing the heavy
> test_and_clear_bit() at the end of qdisc_run_end().
>
> Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
> Signed-off-by: Yunsheng Lin 
> ---
> For those who has not been following the qdsic scheduling
> discussion, there is packet stuck problem for lockless qdisc,
> see [1], and I has done some cleanup and added some enhanced
> features too, see [2] [3].
> While I was doing the optimization for lockless qdisc, it
> accurred to me that these optimization is useless if there is
> still basic bug in lockless qdisc, even the bug is not easily
> reproducible. So look through [1] again, I found that the data
> race for tx action mentioned by Michael, and thought deep about
> it and came up with this patch trying to fix it.
>
> So I am really appreciated some who still has the reproducer
> can try this patch and report back.
>
> 1. 
> https://lore.kernel.org/netdev/d102074f-7489-e35a-98cf-e2cad7efd...@netrounds.com/t/#ma7013a79b8c4d8e7c49015c724e481e6d5325b32
> 2. 
> https://patchwork.kernel.org/project/netdevbpf/patch/1615777818-13969-1-git-send-email-linyunsh...@huawei.com/
> 3. 
> https://patchwork.kernel.org/project/netdevbpf/patch/1615800610-34700-1-git-send-email-linyunsh...@huawei.com/
> ---
>  include/net/sch_generic.h | 23 ---
>  net/sched/sch_generic.c   |  1 +
>  2 files changed, 21 insertions(+), 3 deletions(-)
>
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index f7a6e14..4220eab 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -36,6 +36,7 @@ struct qdisc_rate_table {
>  enum qdisc_state_t {
> __QDISC_STATE_SCHED,
> __QDISC_STATE_DEACTIVATED,
> +   __QDISC_STATE_NEED_RESCHEDULE,
>  };
>
>  struct qdisc_size_table {
> @@ -159,8 +160,17 @@ static inline bool qdisc_is_empty(const struct Qdisc 
> *qdisc)
>  static inline bool qdisc_run_begin(struct Qdisc *qdisc)
>  {
> if (qdisc->flags & TCQ_F_NOLOCK) {
> -   if (!spin_trylock(>seqlock))
> -   return false;
> +   if (!spin_trylock(>seqlock)) {
> +   set_bit(__QDISC_STATE_NEED_RESCHEDULE,
> +   >state);

Why do we need another bit? I mean why not just call __netif_schedule()?

> +
> +   /* Retry again in case other CPU may not see the
> +* new flags after it releases the lock at the
> +* end 

Re: [Linuxarm] Re: [RFC v2] net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc

2021-03-19 Thread Cong Wang
On Thu, Mar 18, 2021 at 12:33 AM Yunsheng Lin  wrote:
>
> On 2021/3/17 21:45, Jason A. Donenfeld wrote:
> > On 3/17/21, Toke Høiland-Jørgensen  wrote:
> >> Cong Wang  writes:
> >>
> >>> On Mon, Mar 15, 2021 at 2:07 PM Jakub Kicinski  wrote:
> >>>>
> >>>> I thought pfifo was supposed to be "lockless" and this change
> >>>> re-introduces a lock between producer and consumer, no?
> >>>
> >>> It has never been truly lockless, it uses two spinlocks in the ring
> >>> buffer
> >>> implementation, and it introduced a q->seqlock recently, with this patch
> >>> now we have priv->lock, 4 locks in total. So our "lockless" qdisc ends
> >>> up having more locks than others. ;) I don't think we are going to a
> >>> right direction...
> >>
> >> Just a thought, have you guys considered adopting the lockless MSPC ring
> >> buffer recently introduced into Wireguard in commit:
> >>
> >> 8b5553ace83c ("wireguard: queueing: get rid of per-peer ring buffers")
> >>
> >> Jason indicated he was willing to work on generalising it into a
> >> reusable library if there was a use case for it. I haven't quite though
> >> through the details of whether this would be such a use case, but
> >> figured I'd at least mention it :)
> >
> > That offer definitely still stands. Generalization sounds like a lot of fun.
> >
> > Keep in mind though that it's an eventually consistent queue, not an
> > immediately consistent one, so that might not match all use cases. It
> > works with wg because we always trigger the reader thread anew when it
> > finishes, but that doesn't apply to everyone's queueing setup.
>
> Thanks for mentioning this.
>
> "multi-producer, single-consumer" seems to match the lockless qdisc's
> paradigm too, for now concurrent enqueuing/dequeuing to the pfifo_fast's
> queues() is not allowed, it is protected by producer_lock or consumer_lock.
>
> So it would be good to has lockless concurrent enqueuing, while dequeuing
> can be protected by qdisc_lock() or q->seqlock, which meets the 
> "multi-producer,
> single-consumer" paradigm.

I don't think so. Usually we have one queue for each CPU so we can expect
each CPU has a lockless qdisc assigned, but we can not assume this in
the code, so we still have to deal with multiple CPU's sharing a lockless qdisc,
and we usually enqueue and dequeue in process context, so it means we could
have multiple producers and multiple consumers.

On the other hand, I don't think the problems we have been fixing are the ring
buffer implementation itself, they are about the high-level qdisc
state transitions.

Thanks.


Re: [RFC v2] net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc

2021-03-16 Thread Cong Wang
On Mon, Mar 15, 2021 at 2:07 PM Jakub Kicinski  wrote:
>
> I thought pfifo was supposed to be "lockless" and this change
> re-introduces a lock between producer and consumer, no?

It has never been truly lockless, it uses two spinlocks in the ring buffer
implementation, and it introduced a q->seqlock recently, with this patch
now we have priv->lock, 4 locks in total. So our "lockless" qdisc ends
up having more locks than others. ;) I don't think we are going to a
right direction...

Thanks.


Re: [PATCH net-next] net: sched: remove unnecessay lock protection for skb_bad_txq/gso_skb

2021-03-16 Thread Cong Wang
On Mon, Mar 15, 2021 at 4:42 PM David Miller  wrote:
>
> From: Yunsheng Lin 
> Date: Mon, 15 Mar 2021 17:30:10 +0800
>
> > Currently qdisc_lock(q) is taken before enqueuing and dequeuing
> > for lockless qdisc's skb_bad_txq/gso_skb queue, qdisc->seqlock is
> > also taken, which can provide the same protection as qdisc_lock(q).
> >
> > This patch removes the unnecessay qdisc_lock(q) protection for
> > lockless qdisc' skb_bad_txq/gso_skb queue.
> >
> > And dev_reset_queue() takes the qdisc->seqlock for lockless qdisc
> > besides taking the qdisc_lock(q) when doing the qdisc reset,
> > some_qdisc_is_busy() takes both qdisc->seqlock and qdisc_lock(q)
> > when checking qdisc status. It is unnecessary to take both lock
> > while the fast path only take one lock, so this patch also changes
> > it to only take qdisc_lock(q) for locked qdisc, and only take
> > qdisc->seqlock for lockless qdisc.
> >
> > Since qdisc->seqlock is taken for lockless qdisc when calling
> > qdisc_is_running() in some_qdisc_is_busy(), use qdisc->running
> > to decide if the lockless qdisc is running.
> >
> > Signed-off-by: Yunsheng Lin 
>
> What about other things protected by this lock, such as statistics and qlen?
>
> This change looks too risky to me.

They are per-cpu for pfifo_fast which sets TCQ_F_CPUSTATS too.

Thanks.


Re: [PATCH net-next] net: sched: remove unnecessay lock protection for skb_bad_txq/gso_skb

2021-03-16 Thread Cong Wang
On Mon, Mar 15, 2021 at 2:29 AM Yunsheng Lin  wrote:
>
> Currently qdisc_lock(q) is taken before enqueuing and dequeuing
> for lockless qdisc's skb_bad_txq/gso_skb queue, qdisc->seqlock is
> also taken, which can provide the same protection as qdisc_lock(q).
>
> This patch removes the unnecessay qdisc_lock(q) protection for
> lockless qdisc' skb_bad_txq/gso_skb queue.
>
> And dev_reset_queue() takes the qdisc->seqlock for lockless qdisc
> besides taking the qdisc_lock(q) when doing the qdisc reset,
> some_qdisc_is_busy() takes both qdisc->seqlock and qdisc_lock(q)
> when checking qdisc status. It is unnecessary to take both lock
> while the fast path only take one lock, so this patch also changes
> it to only take qdisc_lock(q) for locked qdisc, and only take
> qdisc->seqlock for lockless qdisc.
>
> Since qdisc->seqlock is taken for lockless qdisc when calling
> qdisc_is_running() in some_qdisc_is_busy(), use qdisc->running
> to decide if the lockless qdisc is running.

What's the benefit here? Since qdisc->q.lock is also per-qdisc,
so there is no actual contention to take it when we already acquire
q->seqlock, right?

Also, is ->seqlock supposed to be used for protecting skb_bad_txq
etc.? From my understanding, it was introduced merely for replacing
__QDISC_STATE_RUNNING. If you want to extend it, you probably
have to rename it too.

Thanks.


Re: [PATCH] net: sched: avoid duplicates in classes dump

2021-03-04 Thread Cong Wang
On Thu, Mar 4, 2021 at 6:44 AM Maximilian Heyne  wrote:
>
> This is a follow up of commit ea3274695353 ("net: sched: avoid
> duplicates in qdisc dump") which has fixed the issue only for the qdisc
> dump.
>
> The duplicate printing also occurs when dumping the classes via
>   tc class show dev eth0
>
> Fixes: 59cc1f61f09c ("net: sched: convert qdisc linked list to hashtable")
> Signed-off-by: Maximilian Heyne 

Seems reasonable. If you can show the difference of tc class dump
before and after this patch in your changelog, it would be helpful to
understand the bug here.

Thanks.


Re: [PATCH] net: bridge: Fix jump_label config

2021-02-26 Thread Cong Wang
On Thu, Feb 25, 2021 at 5:39 PM Kefeng Wang  wrote:
>
>
> On 2021/2/26 5:22, Cong Wang wrote:
> > On Wed, Feb 24, 2021 at 8:03 AM Kefeng Wang  
> > wrote:
> >> HAVE_JUMP_LABLE is removed by commit e9666d10a567 ("jump_label: move
> >> 'asm goto' support test to Kconfig"), use CONFIG_JUMP_LABLE instead
> >> of HAVE_JUMP_LABLE.
> >>
> >> Fixes: 971502d77faa ("bridge: netfilter: unroll NF_HOOK helper in bridge 
> >> input path")
> >> Signed-off-by: Kefeng Wang 
> > Hmm, why do we have to use a macro here? static_key_false() is defined
> > in both cases, CONFIG_JUMP_LABEL=y or CONFIG_JUMP_LABEL=n.
>
> It seems that all nf_hooks_needed related are using the macro,
>
> see net/netfilter/core.c and include/linux/netfilter.h,
>
>#ifdef CONFIG_JUMP_LABEL
>struct static_key nf_hooks_needed[NFPROTO_NUMPROTO][NF_MAX_HOOKS];
> EXPORT_SYMBOL(nf_hooks_needed);
> #endif
>
>nf_static_key_inc()/nf_static_key_dec()

Same question: why? Clearly struct static_key is defined in both cases:

#else
struct static_key {
atomic_t enabled;
};
#endif  /* CONFIG_JUMP_LABEL */

Thanks.


Re: [PATCH/v3] bpf: add bpf_skb_adjust_room flag BPF_F_ADJ_ROOM_ENCAP_L2_ETH

2021-02-26 Thread Cong Wang
On Thu, Feb 25, 2021 at 7:59 PM Xuesen Huang  wrote:
> v3:
> - Fix the code format.
>
> v2:
> Suggested-by: Willem de Bruijn 
> - Add a new flag to specify the type of the inner packet.

These need to be moved after '---', otherwise it would be merged
into the final git log.

>
> Suggested-by: Willem de Bruijn 
> Signed-off-by: Xuesen Huang 
> Signed-off-by: Zhiyong Cheng 
> Signed-off-by: Li Wang 
> ---
>  include/uapi/linux/bpf.h   |  5 +
>  net/core/filter.c  | 11 ++-
>  tools/include/uapi/linux/bpf.h |  5 +
>  3 files changed, 20 insertions(+), 1 deletion(-)

As a good practice, please add a test case for this in
tools/testing/selftests/bpf/progs/test_tc_tunnel.c.

Thanks.


Re: [PATCH] net: bridge: Fix jump_label config

2021-02-25 Thread Cong Wang
On Wed, Feb 24, 2021 at 8:03 AM Kefeng Wang  wrote:
>
> HAVE_JUMP_LABLE is removed by commit e9666d10a567 ("jump_label: move
> 'asm goto' support test to Kconfig"), use CONFIG_JUMP_LABLE instead
> of HAVE_JUMP_LABLE.
>
> Fixes: 971502d77faa ("bridge: netfilter: unroll NF_HOOK helper in bridge 
> input path")
> Signed-off-by: Kefeng Wang 

Hmm, why do we have to use a macro here? static_key_false() is defined
in both cases, CONFIG_JUMP_LABEL=y or CONFIG_JUMP_LABEL=n.

Thanks.


Re: general protection fault in xfrm_user_rcv_msg_compat

2021-02-23 Thread Cong Wang
On Tue, Feb 23, 2021 at 6:55 AM syzbot
 wrote:
>
> syzbot has found a reproducer for the following issue on:
>
> HEAD commit:a99163e9 Merge tag 'devicetree-for-5.12' of git://git.kern..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=11a6fccad0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=7a875029a795d230
> dashboard link: https://syzkaller.appspot.com/bug?extid=5078fc2d7cf37d71de1c
> userspace arch: i386
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=167c1832d0
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=10214f12d0
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+5078fc2d7cf37d71d...@syzkaller.appspotmail.com
>
> general protection fault, probably for non-canonical address 
> 0xe51af2c1f2c7bd20:  [#1] PREEMPT SMP KASAN
> KASAN: maybe wild-memory-access in range 
> [0x28d7b60f963de900-0x28d7b60f963de907]
> CPU: 1 PID: 8357 Comm: syz-executor113 Not tainted 5.11.0-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> RIP: 0010:nla_type include/net/netlink.h:1130 [inline]
> RIP: 0010:xfrm_xlate32_attr net/xfrm/xfrm_compat.c:404 [inline]
> RIP: 0010:xfrm_xlate32 net/xfrm/xfrm_compat.c:526 [inline]
> RIP: 0010:xfrm_user_rcv_msg_compat+0x5e5/0x1070 net/xfrm/xfrm_compat.c:571

Looks like we have to initialize the pointer array to NULL's.

diff --git a/net/xfrm/xfrm_compat.c b/net/xfrm/xfrm_compat.c
index d8e8a11ca845..56fb32f90799 100644
--- a/net/xfrm/xfrm_compat.c
+++ b/net/xfrm/xfrm_compat.c
@@ -537,7 +537,7 @@ static struct nlmsghdr
*xfrm_user_rcv_msg_compat(const struct nlmsghdr *h32,
 {
/* netlink_rcv_skb() checks if a message has full (struct nlmsghdr) */
u16 type = h32->nlmsg_type - XFRM_MSG_BASE;
-   struct nlattr *attrs[XFRMA_MAX+1];
+   struct nlattr *attrs[XFRMA_MAX+1] = {0};
struct nlmsghdr *h64;
size_t len;
int err;


Re: [PATCH 1/8] af_unix: take address assignment/hash insertion into a new helper

2021-02-20 Thread Cong Wang
On Sat, Feb 20, 2021 at 11:32 AM Al Viro  wrote:
>
> On Sat, Feb 20, 2021 at 11:12:33AM -0800, Cong Wang wrote:
> > On Thu, Feb 18, 2021 at 8:22 PM Al Viro  wrote:
> > >
> > > Duplicated logics in all bind variants (autobind, bind-to-path,
> > > bind-to-abstract) gets taken into a common helper.
> > >
> > > Signed-off-by: Al Viro 
> > > ---
> > >  net/unix/af_unix.c | 30 +++---
> > >  1 file changed, 15 insertions(+), 15 deletions(-)
> > >
> > > diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> > > index 41c3303c3357..179b4fe837e6 100644
> > > --- a/net/unix/af_unix.c
> > > +++ b/net/unix/af_unix.c
> > > @@ -262,6 +262,16 @@ static void __unix_insert_socket(struct hlist_head 
> > > *list, struct sock *sk)
> > > sk_add_node(sk, list);
> > >  }
> > >
> > > +static void __unix_set_addr(struct sock *sk, struct unix_address *addr,
> > > +   unsigned hash)
> > > +   __releases(_table_lock)
> > > +{
> > > +   __unix_remove_socket(sk);
> > > +   smp_store_release(_sk(sk)->addr, addr);
> > > +   __unix_insert_socket(_socket_table[hash], sk);
> > > +   spin_unlock(_table_lock);
> >
> > Please take the unlock out, it is clearly an anti-pattern.
>
> Why?  "Insert into locked and unlock" is fairly common...

Because it does not lock the lock, just compare:

lock();
__unix_set_addr();
unlock();

to:

lock();
__unix_set_addr();

Clearly the former is more readable and less error-prone. Even
if you really want to do unlock, pick a name which explicitly says
it, for example, __unix_set_addr_unlock().

Thanks.


Re: [PATCH 1/8] af_unix: take address assignment/hash insertion into a new helper

2021-02-20 Thread Cong Wang
On Thu, Feb 18, 2021 at 8:22 PM Al Viro  wrote:
>
> Duplicated logics in all bind variants (autobind, bind-to-path,
> bind-to-abstract) gets taken into a common helper.
>
> Signed-off-by: Al Viro 
> ---
>  net/unix/af_unix.c | 30 +++---
>  1 file changed, 15 insertions(+), 15 deletions(-)
>
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index 41c3303c3357..179b4fe837e6 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -262,6 +262,16 @@ static void __unix_insert_socket(struct hlist_head 
> *list, struct sock *sk)
> sk_add_node(sk, list);
>  }
>
> +static void __unix_set_addr(struct sock *sk, struct unix_address *addr,
> +   unsigned hash)
> +   __releases(_table_lock)
> +{
> +   __unix_remove_socket(sk);
> +   smp_store_release(_sk(sk)->addr, addr);
> +   __unix_insert_socket(_socket_table[hash], sk);
> +   spin_unlock(_table_lock);

Please take the unlock out, it is clearly an anti-pattern.

And please Cc netdev for networking changes.

Thanks.


Re: [PATCH net] net: sched: fix police ext initialization

2021-02-16 Thread Cong Wang
On Tue, Feb 16, 2021 at 8:22 AM Vlad Buslov  wrote:
>
> When police action is created by cls API tcf_exts_validate() first
> conditional that calls tcf_action_init_1() directly, the action idr is not
> updated according to latest changes in action API that require caller to
> commit newly created action to idr with tcf_idr_insert_many(). This results
> such action not being accessible through act API and causes crash reported
> by syzbot:

Good catch!

This certainly makes sense to me, and I feed it to syzbot too, it is happy
with this patch, so:

Reported-and-tested-by: syzbot+151e3e714d34ae4ce...@syzkaller.appspotmail.com
Reviewed-by: Cong Wang 

Thanks.


Re: KASAN: null-ptr-deref Read in tcf_idrinfo_destroy

2021-02-15 Thread Cong Wang
On Wed, Feb 10, 2021 at 9:53 PM syzbot
 wrote:
>
> syzbot has found a reproducer for the following issue on:
>
> HEAD commit:291009f6 Merge tag 'pm-5.11-rc8' of git://git.kernel.org/p..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=14470d18d0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=a53fd47f16f22f8c
> dashboard link: https://syzkaller.appspot.com/bug?extid=151e3e714d34ae4ce7e8
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=15f45814d0
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=15f4aff8d0
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+151e3e714d34ae4ce...@syzkaller.appspotmail.com
>
> ==
> BUG: KASAN: null-ptr-deref in instrument_atomic_read 
> include/linux/instrumented.h:71 [inline]
> BUG: KASAN: null-ptr-deref in atomic_read 
> include/asm-generic/atomic-instrumented.h:27 [inline]
> BUG: KASAN: null-ptr-deref in __tcf_idr_release net/sched/act_api.c:178 
> [inline]
> BUG: KASAN: null-ptr-deref in tcf_idrinfo_destroy+0x129/0x1d0 
> net/sched/act_api.c:598
> Read of size 4 at addr 0010 by task kworker/u4:5/204
>
> CPU: 0 PID: 204 Comm: kworker/u4:5 Not tainted 5.11.0-rc7-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Workqueue: netns cleanup_net
> Call Trace:
>  __dump_stack lib/dump_stack.c:79 [inline]
>  dump_stack+0x107/0x163 lib/dump_stack.c:120
>  __kasan_report mm/kasan/report.c:400 [inline]
>  kasan_report.cold+0x5f/0xd5 mm/kasan/report.c:413
>  check_memory_region_inline mm/kasan/generic.c:179 [inline]
>  check_memory_region+0x13d/0x180 mm/kasan/generic.c:185
>  instrument_atomic_read include/linux/instrumented.h:71 [inline]
>  atomic_read include/asm-generic/atomic-instrumented.h:27 [inline]
>  __tcf_idr_release net/sched/act_api.c:178 [inline]
>  tcf_idrinfo_destroy+0x129/0x1d0 net/sched/act_api.c:598
>  tc_action_net_exit include/net/act_api.h:151 [inline]
>  police_exit_net+0x168/0x360 net/sched/act_police.c:390

This is really strange. It seems we still left some -EBUSY placeholders
in the idr, however, we actually call tcf_action_destroy() to clean up
everything including -EBUSY ones on error path.

What do you think, Vlad?

Thanks.


Re: BUG: Incorrect MTU on GRE device if remote is unspecified

2021-01-27 Thread Cong Wang
On Wed, Jan 27, 2021 at 4:56 PM Jakub Kicinski  wrote:
>
> On Mon, 25 Jan 2021 22:10:10 +0200 Slava Bacherikov wrote:
> > Hi, I'd like to report a regression. Currently, if you create GRE
> > interface on the latest stable or LTS kernel (5.4 branch) with
> > unspecified remote destination it's MTU will be adjusted for header size
> > twice. For example:
> >
> > $ ip link add name test type gre local 127.0.0.32
> > $ ip link show test | grep mtu
> > 27: test@NONE:  mtu 1452 qdisc noop state DOWN mode DEFAULT group
> > default qlen 1000
> >
> > or with FOU
> >
> > $ ip link add name test2   type gre local 127.0.0.32 encap fou
> > encap-sport auto encap-dport 
> > $ ip link show test2 | grep mtu
> > 28: test2@NONE:  mtu 1436 qdisc noop state DOWN mode DEFAULT
> > group default qlen 1000
> >
> > The same happens with GUE too (MTU is 1428 instead of 1464).
> > As you can see that MTU in first case is 1452 (1500 - 24 - 24) and with
> > FOU it's 1436 (1500 - 32 - 32), GUE 1428 (1500 - 36 - 36). If remote
> > address is specified MTU is correct.
> >
> > This regression caused by fdafed459998e2be0e877e6189b24cb7a0183224 commit.
>
> Cong is this one on your radar?

Yeah, I guess ipgre_link_update() somehow gets called twice,
but I will need to look into it.

Thanks.


Re: BPF: unbounded bpf_map_free_deferred problem

2021-01-22 Thread Cong Wang
On Fri, Jan 22, 2021 at 4:42 PM Tetsuo Handa
 wrote:
>
> Hello, BPF developers.
>
> Alexey Kardashevskiy is reporting that system_wq gets stuck due to flooding of
> unbounded bpf_map_free_deferred work. Use of WQ_MEM_RECLAIM | WQ_HIGHPRI | 
> WQ_UNBOUND
> workqueue did not solve this problem. Is it possible that a refcount leak 
> somewhere
> preventing bpf_map_free_deferred from completing? Please see
> https://lkml.kernel.org/r/CACT4Y+Z+kwPM=WUzJ-e359PWeLLqmF0w4Yxp1spzZ=+j0ek...@mail.gmail.com
>  .
>

Which map does the reproducer create? And where exactly do
those work block on?

Different map->ops->map_free() waits for different reasons,
for example, htab_map_free() waits for flying htab_elem_free_rcu().
I can't immediately see how they could wait for each other, if this
is what you meant above.

Thanks.


Re: KMSAN: kernel-infoleak in move_addr_to_user (4)

2021-01-11 Thread Cong Wang
On Mon, Jan 11, 2021 at 11:33 AM Jakub Kicinski  wrote:
>
> Looks like a AF_CAN socket:
>
> r0 = socket(0x1d, 0x2, 0x6)
> getsockname$packet(r0, &(0x7f000100)={0x11, 0x0, 0x0, 0x1, 0x0, 0x6, 
> @broadcast}, &(0x7f00)=0x14)
>

Right, it seems we need a memset(0) in isotp_getname().

Thanks.


Re: [RFC PATCH bpf-next] ksnoop: kernel argument/return value tracing/display using BTF

2021-01-05 Thread Cong Wang
On Mon, Jan 4, 2021 at 7:29 AM Alan Maguire  wrote:
>
> BPF Type Format (BTF) provides a description of kernel data structures
> and of the types kernel functions utilize as arguments and return values.
>
> A helper was recently added - bpf_snprintf_btf() - that uses that
> description to create a string representation of the data provided,
> using the BTF id of its type.  For example to create a string
> representation of a "struct sk_buff", the pointer to the skb
> is provided along with the type id of "struct sk_buff".
>
> Here that functionality is utilized to support tracing kernel
> function entry and return using k[ret]probes.  The "struct pt_regs"
> context can be used to derive arguments and return values, and
> when the user supplies a function name we
>
> - look it up in /proc/kallsyms to find its address/module
> - look it up in the BTF kernel data to get types of arguments
>   and return value
> - store a map representation of the trace information, keyed by
>   instruction pointer
> - on function entry/return we look up the map to retrieve the BTF
>   ids of the arguments/return values and can call bpf_snprintf_btf()
>   with these argument/return values along with the type ids to store
>   a string representation in the map.
> - this is then sent via perf event to userspace where it can be
>   displayed.
>
> ksnoop can be used to show function signatures; for example:

This is definitely quite useful!

Is it possible to integrate this with bpftrace? That would save people
from learning yet another tool. ;)

Thanks.


Re: inconsistent lock state in ila_xlat_nl_cmd_add_mapping

2020-12-29 Thread Cong Wang
On Tue, Dec 29, 2020 at 5:39 PM Jakub Kicinski  wrote:
>
> On Mon, 13 Aug 2018 21:40:03 -0700 syzbot wrote:
> > Hello,
> >
> > syzbot found the following crash on:
> >
> > HEAD commit:78cbac647e61 Merge branch 'ip-faster-in-order-IP-fragments'
> > git tree:   net-next
> > console output: https://syzkaller.appspot.com/x/log.txt?x=14df482840
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=9100338df26ab75
> > dashboard link: https://syzkaller.appspot.com/bug?extid=eaaf6c4a6a8cb1869d86
> > compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> > syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=13069ad240
> >
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+eaaf6c4a6a8cb1869...@syzkaller.appspotmail.com
>
> #syz invalid
>
> Hard to track down what fixed this, but the lockdep splat is mixing up
> locks from two different hashtables, so there was never a real issue
> here.

This one is probably fixed by:

commit ff93bca769925a2d8fd7f910cdf543d992e17f07
Author: Cong Wang 
Date:   Tue Aug 14 15:21:31 2018 -0700

ila: make lockdep happy again

given the time of last reproducing...

Thanks.


Re: [PATCH v2 net] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

2020-11-04 Thread Cong Wang
On Wed, Nov 4, 2020 at 10:04 PM Cong Wang  wrote:
>
> On Mon, Nov 2, 2020 at 11:24 PM Yunsheng Lin  wrote:
> > >> From my understanding, we can do anything about the old qdisc (including
> > >> destorying the old qdisc) after some_qdisc_is_busy() return false.
> > >
> > > But the current code does the reset _before_ some_qdisc_is_busy(). ;)
> >
> > If lock is taken when doing reset, it does not matter if the reset is
> > before some_qdisc_is_busy(), right?
>
> Why not? How about the following scenario?
>
> CPU0:   CPU1:
> dev_reset_queue()
> net_tx_action()
>  -> sch_direct_xmit()
>-> dev_requeue_skb()
> some_qdisc_is_busy()
> // waiting for TX action on CPU1
> // now some packets are requeued

Never mind, the skb_bad_txq is also cleared by dev_reset_queue().
TX action after resetting should get NULL.

Thanks.


Re: [PATCH v2 net] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

2020-11-04 Thread Cong Wang
On Mon, Nov 2, 2020 at 11:24 PM Yunsheng Lin  wrote:
> >> From my understanding, we can do anything about the old qdisc (including
> >> destorying the old qdisc) after some_qdisc_is_busy() return false.
> >
> > But the current code does the reset _before_ some_qdisc_is_busy(). ;)
>
> If lock is taken when doing reset, it does not matter if the reset is
> before some_qdisc_is_busy(), right?

Why not? How about the following scenario?

CPU0:   CPU1:
dev_reset_queue()
net_tx_action()
 -> sch_direct_xmit()
   -> dev_requeue_skb()
some_qdisc_is_busy()
// waiting for TX action on CPU1
// now some packets are requeued


Re: [PATCH v2 net] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

2020-11-02 Thread Cong Wang
On Fri, Oct 30, 2020 at 12:38 AM Yunsheng Lin  wrote:
>
> On 2020/10/30 3:05, Cong Wang wrote:
> >
> > I do not see how and why it should. synchronize_net() is merely an optimized
> > version of synchronize_rcu(), it should wait for RCU readers, softirqs are 
> > not
> > necessarily RCU readers, net_tx_action() does not take RCU read lock either.
>
> Ok, make sense.
>
> Taking RCU read lock in net_tx_action() does not seems to solve the problem,
> what about the time window between __netif_reschedule() and net_tx_action()?
>
> It seems we need to re-dereference the qdisc whenever RCU read lock is 
> released
> and qdisc is still in sd->output_queue or wait for the sd->output_queue to 
> drain?

Not suggesting you to take RCU read lock. We already wait for TX action with
a loop of sleep. To me, the only thing missing is just moving the
reset after that
wait.


> >>>> If we do any additional reset that is not related to qdisc in 
> >>>> dev_reset_queue(), we
> >>>> can move it after some_qdisc_is_busy() checking.
> >>>
> >>> I am not suggesting to do an additional reset, I am suggesting to move
> >>> your reset after the busy waiting.
> >>
> >> There maybe a deadlock here if we reset the qdisc after the 
> >> some_qdisc_is_busy() checking,
> >> because some_qdisc_is_busy() may require the qdisc reset to clear the skb, 
> >> so that
> >
> > some_qdisc_is_busy() checks the status of qdisc, not the skb queue.
>
> Is there any reason why we do not check the skb queue in the dqisc?
> It seems there may be skb left when netdev is deactivated, maybe at least warn
> about that when there is still skb left when netdev is deactivated?
> Is that why we call qdisc_reset() to clear the leftover skb in 
> qdisc_destroy()?
>
> >
> >
> >> some_qdisc_is_busy() can return false. I am not sure this is really a 
> >> problem, but
> >> sch_direct_xmit() may requeue the skb when dev_hard_start_xmit return 
> >> TX_BUSY.
> >
> > Sounds like another reason we should move the reset as late as possible?
>
> Why?

You said "sch_direct_xmit() may requeue the skb", I agree. I assume you mean
net_tx_action() calls sch_direct_xmit() which does the requeue then races with
reset. No?


>
> There current netdev down order is mainly below:
>
> netif_tx_stop_all_queues()
>
> dev_deactivate_queue()
>
> synchronize_net()
>
> dev_reset_queue()
>
> some_qdisc_is_busy()
>
>
> You suggest to change it to below order, right?
>
> netif_tx_stop_all_queues()
>
> dev_deactivate_queue()
>
> synchronize_net()
>
> some_qdisc_is_busy()
>
> dev_reset_queue()

Yes.

>
>
> What is the semantics of some_qdisc_is_busy()?

Waiting for flying TX action.

> From my understanding, we can do anything about the old qdisc (including
> destorying the old qdisc) after some_qdisc_is_busy() return false.

But the current code does the reset _before_ some_qdisc_is_busy(). ;)

Thanks.


Re: [PATCH v2 net] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

2020-10-29 Thread Cong Wang
On Wed, Oct 28, 2020 at 7:54 PM Yunsheng Lin  wrote:
>
> On 2020/9/18 3:26, Cong Wang wrote:
> > On Fri, Sep 11, 2020 at 1:13 AM Yunsheng Lin  wrote:
> >>
> >> On 2020/9/11 4:07, Cong Wang wrote:
> >>> On Tue, Sep 8, 2020 at 4:06 AM Yunsheng Lin  
> >>> wrote:
> >>>>
> >>>> Currently there is concurrent reset and enqueue operation for the
> >>>> same lockless qdisc when there is no lock to synchronize the
> >>>> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
> >>>> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
> >>>> out-of-bounds access for priv->ring[] in hns3 driver if user has
> >>>> requested a smaller queue num when __dev_xmit_skb() still enqueue a
> >>>> skb with a larger queue_mapping after the corresponding qdisc is
> >>>> reset, and call hns3_nic_net_xmit() with that skb later.
> >>>>
> >>>> Reused the existing synchronize_net() in dev_deactivate_many() to
> >>>> make sure skb with larger queue_mapping enqueued to old qdisc(which
> >>>> is saved in dev_queue->qdisc_sleeping) will always be reset when
> >>>> dev_reset_queue() is called.
> >>>>
> >>>> Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
> >>>> Signed-off-by: Yunsheng Lin 
> >>>> ---
> >>>> ChangeLog V2:
> >>>> Reuse existing synchronize_net().
> >>>> ---
> >>>>  net/sched/sch_generic.c | 48 
> >>>> +---
> >>>>  1 file changed, 33 insertions(+), 15 deletions(-)
> >>>>
> >>>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> >>>> index 265a61d..54c4172 100644
> >>>> --- a/net/sched/sch_generic.c
> >>>> +++ b/net/sched/sch_generic.c
> >>>> @@ -1131,24 +1131,10 @@ EXPORT_SYMBOL(dev_activate);
> >>>>
> >>>>  static void qdisc_deactivate(struct Qdisc *qdisc)
> >>>>  {
> >>>> -   bool nolock = qdisc->flags & TCQ_F_NOLOCK;
> >>>> -
> >>>> if (qdisc->flags & TCQ_F_BUILTIN)
> >>>> return;
> >>>> -   if (test_bit(__QDISC_STATE_DEACTIVATED, >state))
> >>>> -   return;
> >>>> -
> >>>> -   if (nolock)
> >>>> -   spin_lock_bh(>seqlock);
> >>>> -   spin_lock_bh(qdisc_lock(qdisc));
> >>>>
> >>>> set_bit(__QDISC_STATE_DEACTIVATED, >state);
> >>>> -
> >>>> -   qdisc_reset(qdisc);
> >>>> -
> >>>> -   spin_unlock_bh(qdisc_lock(qdisc));
> >>>> -   if (nolock)
> >>>> -   spin_unlock_bh(>seqlock);
> >>>>  }
> >>>>
> >>>>  static void dev_deactivate_queue(struct net_device *dev,
> >>>> @@ -1165,6 +1151,30 @@ static void dev_deactivate_queue(struct 
> >>>> net_device *dev,
> >>>> }
> >>>>  }
> >>>>
> >>>> +static void dev_reset_queue(struct net_device *dev,
> >>>> +   struct netdev_queue *dev_queue,
> >>>> +   void *_unused)
> >>>> +{
> >>>> +   struct Qdisc *qdisc;
> >>>> +   bool nolock;
> >>>> +
> >>>> +   qdisc = dev_queue->qdisc_sleeping;
> >>>> +   if (!qdisc)
> >>>> +   return;
> >>>> +
> >>>> +   nolock = qdisc->flags & TCQ_F_NOLOCK;
> >>>> +
> >>>> +   if (nolock)
> >>>> +   spin_lock_bh(>seqlock);
> >>>> +   spin_lock_bh(qdisc_lock(qdisc));
> >>>
> >>>
> >>> I think you do not need this lock for lockless one.
> >>
> >> It seems so.
> >> Maybe another patch to remove qdisc_lock(qdisc) for lockless
> >> qdisc?
> >
> > Yeah, but not sure if we still want this lockless qdisc any more,
> > it brings more troubles than gains.
> >
> >>
> >>
> >>>
> >>>> +
> >>>> +   qdisc_reset(qdisc);
> >>>> +
> >>>> +   spin_unloc

Re: [PATCH v2 net] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

2020-10-28 Thread Cong Wang
On Wed, Oct 28, 2020 at 8:37 AM Pai, Vishwanath  wrote:
> Hi,
>
> We noticed some problems when testing the latest 5.4 LTS kernel and traced it
> back to this commit using git bisect. When running our tests the machine stops
> responding to all traffic and the only way to recover is a reboot. I do not 
> see
> a stack trace on the console.

Do you mean the machine is still running fine just the network is down?

If so, can you dump your tc config with stats when the problem is happening?
(You can use `tc -s -d qd show ...`.)

>
> This can be reproduced using the packetdrill test below, it should be run a
> few times or in a loop. You should hit this issue within a few tries but
> sometimes might take up to 15-20 tries.
...
> I can reproduce the issue easily on v5.4.68, and after reverting this commit 
> it
> does not happen anymore.

This is odd. The patch in this thread touches netdev reset path, if packetdrill
is the only thing you use to trigger the bug (that is netdev is always active),
I can not connect them.

Thanks.


Re: [PATCH] net: cls_api: remove unneeded local variable in tc_dump_chain()

2020-10-28 Thread Cong Wang
On Wed, Oct 28, 2020 at 6:59 AM Tom Rix  wrote:
>
>
> On 10/28/20 4:35 AM, Lukas Bulwahn wrote:
> > @@ -2971,13 +2963,11 @@ static int tc_dump_chain(struct sk_buff *skb, 
> > struct netlink_callback *cb)
> >   if (!dev)
> >   return skb->len;
> >
> > - parent = tcm->tcm_parent;
> > - if (!parent) {
> > + if (!tcm->tcm_parent)
> >   q = dev->qdisc;
> > - parent = q->handle;
>
> This looks like a an unused error handler.
>
> and the later call to
>
> if (TC_H_MIN(tcm->tcm_parent)
>
> maybe should be
>
> if (TC_H_MIN(parent))

When tcm->tcm_parent is 0, TC_H_MIN(tcm->tcm_parent) is also 0,
so we will not hit that if branch.

So, I think Lukas' patch is correct.

Thanks.


Re: [External] Re: [PATCH] mm: proc: add Sock to /proc/meminfo

2020-10-12 Thread Cong Wang
On Mon, Oct 12, 2020 at 2:53 AM Muchun Song  wrote:
> We are not complaining about TCP using too much memory, but how do
> we know that TCP uses a lot of memory. When I firstly face this problem,
> I do not know who uses the 25GB memory and it is not shown in the 
> /proc/meminfo.
> If we can know the amount memory of the socket buffer via /proc/meminfo, we
> may not need to spend a lot of time troubleshooting this problem. Not everyone
> knows that a lot of memory may be used here. But I believe many people
> should know /proc/meminfo to confirm memory users.

Well, I'd bet networking people know `ss -m` better than /proc/meminfo,
generally speaking.

The practice here is that if you want some networking-specific counters,
add it to where networking people know better, that is, `ss -m` or /proc/net/...

Or maybe the problem you described is not specific to networking at all,
there must be some other places where pages are allocated but not charged.
If so, adding a general mm counter in /proc/meminfo makes sense, but
it won't be specific to networking.

Thanks.


Re: [External] Re: [PATCH] mm: proc: add Sock to /proc/meminfo

2020-10-12 Thread Cong Wang
On Sun, Oct 11, 2020 at 9:22 PM Muchun Song  wrote:
>
> On Mon, Oct 12, 2020 at 2:39 AM Cong Wang  wrote:
> >
> > On Sat, Oct 10, 2020 at 3:39 AM Muchun Song  
> > wrote:
> > >
> > > The amount of memory allocated to sockets buffer can become significant.
> > > However, we do not display the amount of memory consumed by sockets
> > > buffer. In this case, knowing where the memory is consumed by the kernel
> >
> > We do it via `ss -m`. Is it not sufficient? And if not, why not adding it 
> > there
> > rather than /proc/meminfo?
>
> If the system has little free memory, we can know where the memory is via
> /proc/meminfo. If a lot of memory is consumed by socket buffer, we cannot
> know it when the Sock is not shown in the /proc/meminfo. If the unaware user
> can't think of the socket buffer, naturally they will not `ss -m`. The
> end result

Interesting, we already have a few counters related to socket buffers,
are you saying these are not accounted in /proc/meminfo either?
If yes, why are page frags so special here? If not, they are more
important than page frags, so you probably want to deal with them
first.


> is that we still don’t know where the memory is consumed. And we add the
> Sock to the /proc/meminfo just like the memcg does('sock' item in the cgroup
> v2 memory.stat). So I think that adding to /proc/meminfo is sufficient.

It looks like actually the socket page frag is already accounted,
for example, the tcp_sendmsg_locked():

copy = min_t(int, copy, pfrag->size - pfrag->offset);

if (!sk_wmem_schedule(sk, copy))
goto wait_for_memory;


>
> >
> > >  static inline void __skb_frag_unref(skb_frag_t *frag)
> > >  {
> > > -   put_page(skb_frag_page(frag));
> > > +   struct page *page = skb_frag_page(frag);
> > > +
> > > +   if (put_page_testzero(page)) {
> > > +   dec_sock_node_page_state(page);
> > > +   __put_page(page);
> > > +   }
> > >  }
> >
> > You mix socket page frag with skb frag at least, not sure this is exactly
> > what you want, because clearly skb page frags are frequently used
> > by network drivers rather than sockets.
> >
> > Also, which one matches this dec_sock_node_page_state()? Clearly
> > not skb_fill_page_desc() or __skb_frag_ref().
>
> Yeah, we call inc_sock_node_page_state() in the skb_page_frag_refill().

How is skb_page_frag_refill() possibly paired with __skb_frag_unref()?

> So if someone gets the page returned by skb_page_frag_refill(), it must
> put the page via __skb_frag_unref()/skb_frag_unref(). We use PG_private
> to indicate that we need to dec the node page state when the refcount of
> page reaches zero.

skb_page_frag_refill() is called on frags not within an skb, for instance,
sk_page_frag_refill() uses it for a per-socket or per-process page frag.
But, __skb_frag_unref() is specifically used for skb frags, which are
supposed to be filled by skb_fill_page_desc() (page is allocated by driver).

They are different things you are mixing them up, which looks clearly
wrong or at least misleading.

Thanks.


Re: BUG: unable to handle kernel paging request in tcf_action_dump_terse

2020-10-11 Thread Cong Wang
#syz fix: net_sched: check error pointer in tcf_dump_walker()


Re: [PATCH] mm: proc: add Sock to /proc/meminfo

2020-10-11 Thread Cong Wang
On Sat, Oct 10, 2020 at 3:39 AM Muchun Song  wrote:
>
> The amount of memory allocated to sockets buffer can become significant.
> However, we do not display the amount of memory consumed by sockets
> buffer. In this case, knowing where the memory is consumed by the kernel

We do it via `ss -m`. Is it not sufficient? And if not, why not adding it there
rather than /proc/meminfo?

>  static inline void __skb_frag_unref(skb_frag_t *frag)
>  {
> -   put_page(skb_frag_page(frag));
> +   struct page *page = skb_frag_page(frag);
> +
> +   if (put_page_testzero(page)) {
> +   dec_sock_node_page_state(page);
> +   __put_page(page);
> +   }
>  }

You mix socket page frag with skb frag at least, not sure this is exactly
what you want, because clearly skb page frags are frequently used
by network drivers rather than sockets.

Also, which one matches this dec_sock_node_page_state()? Clearly
not skb_fill_page_desc() or __skb_frag_ref().

Thanks.


Re: INFO: task hung in tcf_action_init_1

2020-09-30 Thread Cong Wang
#syz test: https://github.com/congwang/linux.git net


Re: general protection fault in tcf_generic_walker

2020-09-30 Thread Cong Wang
On Wed, Sep 30, 2020 at 10:42 AM syzbot
 wrote:
>
> syzbot has found a reproducer for the following issue on:
>
> HEAD commit:2b3e981a Merge branch 'mptcp-Fix-for-32-bit-DATA_FIN'
> git tree:   net
> console output: https://syzkaller.appspot.com/x/log.txt?x=1653724790
> kernel config:  https://syzkaller.appspot.com/x/.config?x=99a7c78965c75e07
> dashboard link: https://syzkaller.appspot.com/bug?extid=b47bc4f247856fb4d9e1
> compiler:   gcc (GCC) 10.1.0-syz 20200507
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1412a5a790
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+b47bc4f247856fb4d...@syzkaller.appspotmail.com
>
> general protection fault, probably for non-canonical address 
> 0xdc04:  [#1] PREEMPT SMP KASAN
> KASAN: null-ptr-deref in range [0x0020-0x0027]
> CPU: 0 PID: 8855 Comm: syz-executor.1 Not tainted 5.9.0-rc6-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> RIP: 0010:tcf_dump_walker net/sched/act_api.c:240 [inline]
> RIP: 0010:tcf_generic_walker+0x367/0xba0 net/sched/act_api.c:343
> Code: 24 31 ff 48 89 de e8 c8 55 eb fa 48 85 db 74 3f e8 3e 59 eb fa 48 8d 7d 
> 30 48 b9 00 00 00 00 00 fc ff df 48 89 f8 48 c1 e8 03 <80> 3c 08 00 0f 85 26 
> 07 00 00 48 8b 5d 30 31 ff 48 2b 1c 24 48 89
> RSP: 0018:c9000b6ff3a8 EFLAGS: 00010202
> RAX: 0004 RBX: c000aae4 RCX: dc00
> RDX: 8880a82aa140 RSI: 868ae502 RDI: 0020
> RBP: fff0 R08:  R09: 8880a8c41e07
> R10:  R11:  R12: 88809f226340
> R13:  R14:  R15: 
> FS:  7f156f7fa700() GS:8880ae40() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 55d25128b348 CR3: a7d3d000 CR4: 001506f0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  tc_dump_action+0x6d5/0xe60 net/sched/act_api.c:1609


Probably just need another IS_ERR() check on the dump path.
I will take a second look.

Thanks.


Re: KASAN: use-after-free Read in tcf_action_init

2020-09-28 Thread Cong Wang
#syz fix: net_sched: commit action insertions together


Re: [PATCH 0/2] iommu/iova: Solve longterm IOVA issue

2020-09-25 Thread Cong Wang
On Fri, Sep 25, 2020 at 2:56 AM John Garry  wrote:
>
> This series contains a patch to solve the longterm IOVA issue which
> leizhen originally tried to address at [0].
>
> I also included the small optimisation from Cong Wang, which never seems
> to be have been accepted [1]. There was some debate of the other patches
> in that series, but this one is quite straightforward.
>
> @Cong Wang, Please resend your series if prefer I didn't upstream your
> patch.

Thanks for letting me know. But I still don't think it is worth any effort,
given it is hard to work with Robin. Users who care about latency here
should just disable IOMMU, it is really hard to optimize IOVA cache
performance to catch up with !IOMMU case.

So, please feel free to carry it on your own, I have no problem with it.

Thanks.


Re: [PATCH v2 net] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

2020-09-17 Thread Cong Wang
On Fri, Sep 11, 2020 at 1:13 AM Yunsheng Lin  wrote:
>
> On 2020/9/11 4:07, Cong Wang wrote:
> > On Tue, Sep 8, 2020 at 4:06 AM Yunsheng Lin  wrote:
> >>
> >> Currently there is concurrent reset and enqueue operation for the
> >> same lockless qdisc when there is no lock to synchronize the
> >> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
> >> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
> >> out-of-bounds access for priv->ring[] in hns3 driver if user has
> >> requested a smaller queue num when __dev_xmit_skb() still enqueue a
> >> skb with a larger queue_mapping after the corresponding qdisc is
> >> reset, and call hns3_nic_net_xmit() with that skb later.
> >>
> >> Reused the existing synchronize_net() in dev_deactivate_many() to
> >> make sure skb with larger queue_mapping enqueued to old qdisc(which
> >> is saved in dev_queue->qdisc_sleeping) will always be reset when
> >> dev_reset_queue() is called.
> >>
> >> Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
> >> Signed-off-by: Yunsheng Lin 
> >> ---
> >> ChangeLog V2:
> >> Reuse existing synchronize_net().
> >> ---
> >>  net/sched/sch_generic.c | 48 
> >> +---
> >>  1 file changed, 33 insertions(+), 15 deletions(-)
> >>
> >> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> >> index 265a61d..54c4172 100644
> >> --- a/net/sched/sch_generic.c
> >> +++ b/net/sched/sch_generic.c
> >> @@ -1131,24 +1131,10 @@ EXPORT_SYMBOL(dev_activate);
> >>
> >>  static void qdisc_deactivate(struct Qdisc *qdisc)
> >>  {
> >> -   bool nolock = qdisc->flags & TCQ_F_NOLOCK;
> >> -
> >> if (qdisc->flags & TCQ_F_BUILTIN)
> >> return;
> >> -   if (test_bit(__QDISC_STATE_DEACTIVATED, >state))
> >> -   return;
> >> -
> >> -   if (nolock)
> >> -   spin_lock_bh(>seqlock);
> >> -   spin_lock_bh(qdisc_lock(qdisc));
> >>
> >> set_bit(__QDISC_STATE_DEACTIVATED, >state);
> >> -
> >> -   qdisc_reset(qdisc);
> >> -
> >> -   spin_unlock_bh(qdisc_lock(qdisc));
> >> -   if (nolock)
> >> -   spin_unlock_bh(>seqlock);
> >>  }
> >>
> >>  static void dev_deactivate_queue(struct net_device *dev,
> >> @@ -1165,6 +1151,30 @@ static void dev_deactivate_queue(struct net_device 
> >> *dev,
> >> }
> >>  }
> >>
> >> +static void dev_reset_queue(struct net_device *dev,
> >> +   struct netdev_queue *dev_queue,
> >> +   void *_unused)
> >> +{
> >> +   struct Qdisc *qdisc;
> >> +   bool nolock;
> >> +
> >> +   qdisc = dev_queue->qdisc_sleeping;
> >> +   if (!qdisc)
> >> +   return;
> >> +
> >> +   nolock = qdisc->flags & TCQ_F_NOLOCK;
> >> +
> >> +   if (nolock)
> >> +   spin_lock_bh(>seqlock);
> >> +   spin_lock_bh(qdisc_lock(qdisc));
> >
> >
> > I think you do not need this lock for lockless one.
>
> It seems so.
> Maybe another patch to remove qdisc_lock(qdisc) for lockless
> qdisc?

Yeah, but not sure if we still want this lockless qdisc any more,
it brings more troubles than gains.

>
>
> >
> >> +
> >> +   qdisc_reset(qdisc);
> >> +
> >> +   spin_unlock_bh(qdisc_lock(qdisc));
> >> +   if (nolock)
> >> +   spin_unlock_bh(>seqlock);
> >> +}
> >> +
> >>  static bool some_qdisc_is_busy(struct net_device *dev)
> >>  {
> >> unsigned int i;
> >> @@ -1213,12 +1223,20 @@ void dev_deactivate_many(struct list_head *head)
> >> dev_watchdog_down(dev);
> >> }
> >>
> >> -   /* Wait for outstanding qdisc-less dev_queue_xmit calls.
> >> +   /* Wait for outstanding qdisc-less dev_queue_xmit calls or
> >> +* outstanding qdisc enqueuing calls.
> >>  * This is avoided if all devices are in dismantle phase :
> >>  * Caller will call synchronize_net() for us
> >>  */
> >> synchronize_net();
> >>
> >> +

Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2020-09-17 Thread Cong Wang
On Sun, Sep 13, 2020 at 7:10 PM Yunsheng Lin  wrote:
>
> On 2020/9/11 4:19, Cong Wang wrote:
> > On Thu, Sep 3, 2020 at 8:21 PM Kehuan Feng  wrote:
> >> I also tried Cong's patch (shown below on my tree) and it could avoid
> >> the issue (stressing for 30 minutus for three times and not jitter
> >> observed).
> >
> > Thanks for verifying it!
> >
> >>
> >> --- ./include/net/sch_generic.h.orig 2020-08-21 15:13:51.787952710 +0800
> >> +++ ./include/net/sch_generic.h 2020-09-03 21:36:11.468383738 +0800
> >> @@ -127,8 +127,7 @@
> >>  static inline bool qdisc_run_begin(struct Qdisc *qdisc)
> >>  {
> >>   if (qdisc->flags & TCQ_F_NOLOCK) {
> >> - if (!spin_trylock(>seqlock))
> >> - return false;
> >> + spin_lock(>seqlock);
> >>   } else if (qdisc_is_running(qdisc)) {
> >>   return false;
> >>   }
> >>
> >> I am not actually know what you are discussing above. It seems to me
> >> that Cong's patch is similar as disabling lockless feature.
> >
> >>From performance's perspective, yeah. Did you see any performance
> > downgrade with my patch applied? It would be great if you can compare
> > it with removing NOLOCK. And if the performance is as bad as no
> > NOLOCK, then we can remove the NOLOCK bit for pfifo_fast, at least
> > for now.
>
> It seems the lockless qdisc may have below concurrent problem:
>   cpu0:   cpu1:
> q->enqueue  .
> qdisc_run_begin(q)  .
> __qdisc_run(q) ->qdisc_restart() -> dequeue_skb()   .
>  -> sch_direct_xmit()   .
> .
> q->enqueue
>  
> qdisc_run_begin(q)
> qdisc_run_end(q)
>
>
> cpu1 enqueue a skb without calling __qdisc_run(), and cpu0 did not see the
> enqueued skb when calling __qdisc_run(q) because cpu1 may enqueue the skb
> after cpu0 called __qdisc_run(q) and before cpu0 called qdisc_run_end(q).

This is the same problem that my patch fixes, I do not know
why you are suggesting another patch despite quoting mine.
Please read the whole thread if you want to participate.

Thanks.


Re: [PATCH] [PATCH] net/sched : cbs : fix calculation error of idleslope credits

2020-09-17 Thread Cong Wang
On Wed, Sep 16, 2020 at 11:05 PM Xiaoyong Yan  wrote:
>
> in the function cbs_dequeue_soft, when q->credits <0, [now - q->last]
> should be accounted for sendslope,not idleslope.
>
> so the solution as follows: when q->credits is less than 0, directly
> calculate delay time, activate hrtimer and when hrtimer fires,
> calculate idleslope credits and update it to q->credits.
>
> because of the lack of self-defined qdisc_watchdog_func, so in the
> generic sch_api, add qdisc_watchdog_init_func and
> qdisc_watchdog_init_clockid_func, so sch_cbs can use it to define its
> own process.

You do not have to define them as generic API, you can just use
hrtimer API directly in sch_cbs, as it is the only user within net/sched/.
Hopefully this would reduce the size of your patch too.


>
> the patch is changed based on v5.4.42,and the test result as follows:
> the NIC is 100Mb/s ,full duplex.
>
> step1:
> tc qdisc add dev ens33 root cbs idleslope 75 sendslope -25 hicredit 100 
> locredit -100 offload 0
> step2:
> root@ubuntu:/home/yxy/kernel/linux-stable# iperf -c 192.168.1.114 -i 1
> 
> Client connecting to 192.168.1.114, TCP port 5001
> TCP window size:  246 KByte (default)
> 
> [  3] local 192.168.1.120 port 42004 connected with 192.168.1.114 port 5001
> [ ID] Interval   Transfer Bandwidth
> [  3]  0.0- 1.0 sec  9.00 MBytes  75.5 Mbits/sec
> [  3]  1.0- 2.0 sec  8.50 MBytes  71.3 Mbits/sec
> [  3]  2.0- 3.0 sec  8.50 MBytes  71.3 Mbits/sec
> [  3]  3.0- 4.0 sec  8.38 MBytes  70.3 Mbits/sec
> [  3]  4.0- 5.0 sec  8.38 MBytes  70.3 Mbits/sec
> [  3]  5.0- 6.0 sec  8.50 MBytes  71.3 Mbits/sec
> [  3]  6.0- 7.0 sec  8.50 MBytes  71.3 Mbits/sec
> [  3]  7.0- 8.0 sec  8.62 MBytes  72.4 Mbits/sec
> [  3]  8.0- 9.0 sec  8.50 MBytes  71.3 Mbits/sec
> [  3]  9.0-10.0 sec  8.62 MBytes  72.4 Mbits/sec
> [  3]  0.0-10.0 sec  85.5 MBytes  71.5 Mbits/sec
>
> Signed-off-by: Xiaoyong Yan 
> ---
>  include/net/pkt_sched.h |  7 +++
>  net/sched/sch_api.c | 20 ++--
>  net/sched/sch_cbs.c | 41 +
>  3 files changed, 50 insertions(+), 18 deletions(-)
>
> diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
> index 6a70845bd9ab..5fec23b15e1c 100644
> --- a/include/net/pkt_sched.h
> +++ b/include/net/pkt_sched.h
> @@ -9,6 +9,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
> #define DEFAULT_TX_QUEUE_LEN1000
>
> @@ -72,6 +73,12 @@ struct qdisc_watchdog {
> struct Qdisc*qdisc;
>  };
>
> +typedef enum hrtimer_restart (*qdisc_watchdog_func_t)(struct hrtimer *timer);
> +void qdisc_watchdog_init_clockid_func(struct qdisc_watchdog *wd, struct 
> Qdisc *qdisc,
> +clockid_t clockid,qdisc_watchdog_func_t func);
> +void qdisc_watchdog_init_func(struct qdisc_watchdog *wd,struct Qdisc 
> *qdisc,qdisc_watchdog_func_t func);
> +enum hrtimer_restart qdisc_watchdog(struct hrtimer *timer);
> +
>  void qdisc_watchdog_init_clockid(struct qdisc_watchdog *wd, struct Qdisc 
> *qdisc,
>  clockid_t clockid);
>  void qdisc_watchdog_init(struct qdisc_watchdog *wd, struct Qdisc *qdisc);
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index 50794125bf02..fea08d10c811 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -22,7 +22,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>
> @@ -591,7 +590,7 @@ void qdisc_warn_nonwc(const char *txt, struct Qdisc 
> *qdisc)
>  }
>  EXPORT_SYMBOL(qdisc_warn_nonwc);
>
> -static enum hrtimer_restart qdisc_watchdog(struct hrtimer *timer)
> +enum hrtimer_restart qdisc_watchdog(struct hrtimer *timer)
>  {
> struct qdisc_watchdog *wd = container_of(timer, struct qdisc_watchdog,
>  timer);
> @@ -602,7 +601,24 @@ static enum hrtimer_restart qdisc_watchdog(struct 
> hrtimer *timer)
>
> return HRTIMER_NORESTART;
>  }
> +EXPORT_SYMBOL(qdisc_watchdog);
>
> +void qdisc_watchdog_init_clockid_func(struct qdisc_watchdog *wd,struct Qdisc 
> *qdisc,clockid_t clockid,qdisc_watchdog_func_t func)
> +{
> +   hrtimer_init(>timer,clockid,HRTIMER_MODE_ABS_PINNED);
> +   if(!func)
> +   wd->timer.function = qdisc_watchdog;
> +   else
> +   wd->timer.function = func;
> +   wd->qdisc = qdisc;
> +}
> +EXPORT_SYMBOL(qdisc_watchdog_init_clockid_func);
> +
> +void qdisc_watchdog_init_func(struct qdisc_watchdog *wd,struct Qdisc 
> *qdisc,qdisc_watchdog_func_t func)
> +{
> +   qdisc_watchdog_init_clockid_func(wd,qdisc,CLOCK_MONOTONIC,func);
> +}
> +EXPORT_SYMBOL(qdisc_watchdog_init_func);
>  void qdisc_watchdog_init_clockid(struct qdisc_watchdog *wd, struct Qdisc 
> *qdisc,
>  clockid_t clockid)
>  {
> diff --git a/net/sched/sch_cbs.c b/net/sched/sch_cbs.c
> index 

Re: [PATCH net-next] genetlink: Remove unused function genl_err_attr()

2020-09-17 Thread Cong Wang
On Wed, Sep 16, 2020 at 9:33 AM YueHaibing  wrote:
>
> It is never used, so can remove it.

This is a bit confusing, it was actually used before, see commit
ab0d76f6823cc3a4e2.


Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2020-09-10 Thread Cong Wang
On Thu, Sep 3, 2020 at 8:21 PM Kehuan Feng  wrote:
> I also tried Cong's patch (shown below on my tree) and it could avoid
> the issue (stressing for 30 minutus for three times and not jitter
> observed).

Thanks for verifying it!

>
> --- ./include/net/sch_generic.h.orig 2020-08-21 15:13:51.787952710 +0800
> +++ ./include/net/sch_generic.h 2020-09-03 21:36:11.468383738 +0800
> @@ -127,8 +127,7 @@
>  static inline bool qdisc_run_begin(struct Qdisc *qdisc)
>  {
>   if (qdisc->flags & TCQ_F_NOLOCK) {
> - if (!spin_trylock(>seqlock))
> - return false;
> + spin_lock(>seqlock);
>   } else if (qdisc_is_running(qdisc)) {
>   return false;
>   }
>
> I am not actually know what you are discussing above. It seems to me
> that Cong's patch is similar as disabling lockless feature.

>From performance's perspective, yeah. Did you see any performance
downgrade with my patch applied? It would be great if you can compare
it with removing NOLOCK. And if the performance is as bad as no
NOLOCK, then we can remove the NOLOCK bit for pfifo_fast, at least
for now.

Thanks.


Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2020-09-10 Thread Cong Wang
On Thu, Sep 3, 2020 at 10:08 PM John Fastabend  wrote:
> Maybe this would unlock us,
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 7df6c9617321..9b09429103f1 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3749,7 +3749,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, 
> struct Qdisc *q,
>
> if (q->flags & TCQ_F_NOLOCK) {
> rc = q->enqueue(skb, q, _free) & NET_XMIT_MASK;
> -   qdisc_run(q);
> +   __qdisc_run(q);
>
> if (unlikely(to_free))
> kfree_skb_list(to_free);
>
>
> Per other thread we also need the state deactivated check added
> back.

I guess no, because pfifo_dequeue() seems to require q->seqlock,
according to comments in qdisc_run(), so we can not just get rid of
qdisc_run_begin()/qdisc_run_end() here.

Thanks.


Re: [PATCH v2 net] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

2020-09-10 Thread Cong Wang
On Tue, Sep 8, 2020 at 4:06 AM Yunsheng Lin  wrote:
>
> Currently there is concurrent reset and enqueue operation for the
> same lockless qdisc when there is no lock to synchronize the
> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
> out-of-bounds access for priv->ring[] in hns3 driver if user has
> requested a smaller queue num when __dev_xmit_skb() still enqueue a
> skb with a larger queue_mapping after the corresponding qdisc is
> reset, and call hns3_nic_net_xmit() with that skb later.
>
> Reused the existing synchronize_net() in dev_deactivate_many() to
> make sure skb with larger queue_mapping enqueued to old qdisc(which
> is saved in dev_queue->qdisc_sleeping) will always be reset when
> dev_reset_queue() is called.
>
> Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
> Signed-off-by: Yunsheng Lin 
> ---
> ChangeLog V2:
> Reuse existing synchronize_net().
> ---
>  net/sched/sch_generic.c | 48 +---
>  1 file changed, 33 insertions(+), 15 deletions(-)
>
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 265a61d..54c4172 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -1131,24 +1131,10 @@ EXPORT_SYMBOL(dev_activate);
>
>  static void qdisc_deactivate(struct Qdisc *qdisc)
>  {
> -   bool nolock = qdisc->flags & TCQ_F_NOLOCK;
> -
> if (qdisc->flags & TCQ_F_BUILTIN)
> return;
> -   if (test_bit(__QDISC_STATE_DEACTIVATED, >state))
> -   return;
> -
> -   if (nolock)
> -   spin_lock_bh(>seqlock);
> -   spin_lock_bh(qdisc_lock(qdisc));
>
> set_bit(__QDISC_STATE_DEACTIVATED, >state);
> -
> -   qdisc_reset(qdisc);
> -
> -   spin_unlock_bh(qdisc_lock(qdisc));
> -   if (nolock)
> -   spin_unlock_bh(>seqlock);
>  }
>
>  static void dev_deactivate_queue(struct net_device *dev,
> @@ -1165,6 +1151,30 @@ static void dev_deactivate_queue(struct net_device 
> *dev,
> }
>  }
>
> +static void dev_reset_queue(struct net_device *dev,
> +   struct netdev_queue *dev_queue,
> +   void *_unused)
> +{
> +   struct Qdisc *qdisc;
> +   bool nolock;
> +
> +   qdisc = dev_queue->qdisc_sleeping;
> +   if (!qdisc)
> +   return;
> +
> +   nolock = qdisc->flags & TCQ_F_NOLOCK;
> +
> +   if (nolock)
> +   spin_lock_bh(>seqlock);
> +   spin_lock_bh(qdisc_lock(qdisc));


I think you do not need this lock for lockless one.

> +
> +   qdisc_reset(qdisc);
> +
> +   spin_unlock_bh(qdisc_lock(qdisc));
> +   if (nolock)
> +   spin_unlock_bh(>seqlock);
> +}
> +
>  static bool some_qdisc_is_busy(struct net_device *dev)
>  {
> unsigned int i;
> @@ -1213,12 +1223,20 @@ void dev_deactivate_many(struct list_head *head)
> dev_watchdog_down(dev);
> }
>
> -   /* Wait for outstanding qdisc-less dev_queue_xmit calls.
> +   /* Wait for outstanding qdisc-less dev_queue_xmit calls or
> +* outstanding qdisc enqueuing calls.
>  * This is avoided if all devices are in dismantle phase :
>  * Caller will call synchronize_net() for us
>  */
> synchronize_net();
>
> +   list_for_each_entry(dev, head, close_list) {
> +   netdev_for_each_tx_queue(dev, dev_reset_queue, NULL);
> +
> +   if (dev_ingress_queue(dev))
> +   dev_reset_queue(dev, dev_ingress_queue(dev), NULL);
> +   }
> +
> /* Wait for outstanding qdisc_run calls. */
> list_for_each_entry(dev, head, close_list) {
> while (some_qdisc_is_busy(dev)) {

Do you want to reset before waiting for TX action?

I think it is safer to do it after, at least prior to commit 759ae57f1b
we did after.

Thanks.


Re: INFO: task hung in tcf_ife_init

2020-09-03 Thread Cong Wang
On Thu, Sep 3, 2020 at 2:47 AM Eric Dumazet  wrote:
> This commit might be related :
>
> commit 4e407ff5cd67ec76a1deec227b7982dc7f66
> Author: Cong Wang 
> Date:   Sun Aug 19 12:22:12 2018 -0700
>
> act_ife: move tcfa_lock down to where necessary

It does not look like my commit's fault. From my _quick_ understanding
of this problem, we somehow have a "deadlock" situation:

Thread A allocates an IDR with a specific index and calls populate_metalist()
to re-accquire the RTNL lock.

Thread B gets RTNL lock right after thread A releases it, then it waits for
thread A to finally commit the IDR change.

Thanks.


Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2020-09-03 Thread Cong Wang
On Thu, Sep 3, 2020 at 1:40 AM Paolo Abeni  wrote:
>
> On Wed, 2020-09-02 at 22:01 -0700, Cong Wang wrote:
> > Can you test the attached one-line fix? I think we are overthinking,
> > probably all
> > we need here is a busy wait.
>
> I think that will solve, but I also think that will kill NOLOCK
> performances due to really increased contention.

Yeah, we somehow end up with more locks (seqlock, skb array lock)
for lockless qdisc. What an irony... ;)

>
> At this point I fear we could consider reverting the NOLOCK stuff.
> I personally would hate doing so, but it looks like NOLOCK benefits are
> outweighed by its issues.

I agree, NOLOCK brings more pains than gains. There are many race
conditions hidden in generic qdisc layer, another one is enqueue vs.
reset which is being discussed in another thread.

Thanks.


Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2020-09-02 Thread Cong Wang
Hello, Kehuan

Can you test the attached one-line fix? I think we are overthinking,
probably all
we need here is a busy wait.

Thanks.
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index d60e7c39d60c..fc1bacdb102b 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -156,8 +156,7 @@ static inline bool qdisc_is_empty(const struct Qdisc *qdisc)
 static inline bool qdisc_run_begin(struct Qdisc *qdisc)
 {
 	if (qdisc->flags & TCQ_F_NOLOCK) {
-		if (!spin_trylock(>seqlock))
-			return false;
+		spin_lock(>seqlock);
 		WRITE_ONCE(qdisc->empty, false);
 	} else if (qdisc_is_running(qdisc)) {
 		return false;


Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

2020-09-02 Thread Cong Wang
On Wed, Sep 2, 2020 at 7:22 PM Yunsheng Lin  wrote:
>
> On 2020/9/3 9:48, Cong Wang wrote:
> > On Wed, Sep 2, 2020 at 6:22 PM Yunsheng Lin  wrote:
> >>
> >> On 2020/9/3 8:35, Cong Wang wrote:
> >>> On Tue, Sep 1, 2020 at 11:35 PM Yunsheng Lin  
> >>> wrote:
> >>>>
> >>>> On 2020/9/2 12:41, Cong Wang wrote:
> >>>>> On Tue, Sep 1, 2020 at 6:42 PM Yunsheng Lin  
> >>>>> wrote:
> >>>>>>
> >>>>>> On 2020/9/2 2:24, Cong Wang wrote:
> >>>>>>> On Mon, Aug 31, 2020 at 5:59 PM Yunsheng Lin  
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Currently there is concurrent reset and enqueue operation for the
> >>>>>>>> same lockless qdisc when there is no lock to synchronize the
> >>>>>>>> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
> >>>>>>>> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
> >>>>>>>> out-of-bounds access for priv->ring[] in hns3 driver if user has
> >>>>>>>> requested a smaller queue num when __dev_xmit_skb() still enqueue a
> >>>>>>>> skb with a larger queue_mapping after the corresponding qdisc is
> >>>>>>>> reset, and call hns3_nic_net_xmit() with that skb later.
> >>>>>>>
> >>>>>>> Can you be more specific here? Which call path requests a smaller
> >>>>>>> tx queue num? If you mean netif_set_real_num_tx_queues(), clearly
> >>>>>>> we already have a synchronize_net() there.
> >>>>>>
> >>>>>> When the netdevice is in active state, the synchronize_net() seems to
> >>>>>> do the correct work, as below:
> >>>>>>
> >>>>>> CPU 0:   CPU1:
> >>>>>> __dev_queue_xmit()   netif_set_real_num_tx_queues()
> >>>>>> rcu_read_lock_bh();
> >>>>>> netdev_core_pick_tx(dev, skb, sb_dev);
> >>>>>> .
> >>>>>> .   dev->real_num_tx_queues = txq;
> >>>>>> .   .
> >>>>>> .   .
> >>>>>> .   synchronize_net();
> >>>>>> .   .
> >>>>>> q->enqueue().
> >>>>>> .   .
> >>>>>> rcu_read_unlock_bh().
> >>>>>> qdisc_reset_all_tx_gt
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> Right.
> >>>>>
> >>>>>
> >>>>>> but dev->real_num_tx_queues is not RCU-protected, maybe that is a 
> >>>>>> problem
> >>>>>> too.
> >>>>>>
> >>>>>> The problem we hit is as below:
> >>>>>> In hns3_set_channels(), hns3_reset_notify(h, HNAE3_DOWN_CLIENT) is 
> >>>>>> called
> >>>>>> to deactive the netdevice when user requested a smaller queue num, and
> >>>>>> txq->qdisc is already changed to noop_qdisc when calling
> >>>>>> netif_set_real_num_tx_queues(), so the synchronize_net() in the 
> >>>>>> function
> >>>>>> netif_set_real_num_tx_queues() does not help here.
> >>>>>
> >>>>> How could qdisc still be running after deactivating the device?
> >>>>
> >>>> qdisc could be running during the device deactivating process.
> >>>>
> >>>> The main process of changing channel number is as below:
> >>>>
> >>>> 1. dev_deactivate()
> >>>> 2. hns3 handware related setup
> >>>> 3. netif_set_real_num_tx_queues()
> >>>> 4. netif_tx_wake_all_queues()
> >>>> 5. dev_activate()
> >>>>
> >>>> During step 1, qdisc could be running while qdisc is resetting, so
> >>>> there could be skb left in the old qdisc(which will be restored back to
&g

Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

2020-09-02 Thread Cong Wang
On Wed, Sep 2, 2020 at 6:22 PM Yunsheng Lin  wrote:
>
> On 2020/9/3 8:35, Cong Wang wrote:
> > On Tue, Sep 1, 2020 at 11:35 PM Yunsheng Lin  wrote:
> >>
> >> On 2020/9/2 12:41, Cong Wang wrote:
> >>> On Tue, Sep 1, 2020 at 6:42 PM Yunsheng Lin  
> >>> wrote:
> >>>>
> >>>> On 2020/9/2 2:24, Cong Wang wrote:
> >>>>> On Mon, Aug 31, 2020 at 5:59 PM Yunsheng Lin  
> >>>>> wrote:
> >>>>>>
> >>>>>> Currently there is concurrent reset and enqueue operation for the
> >>>>>> same lockless qdisc when there is no lock to synchronize the
> >>>>>> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
> >>>>>> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
> >>>>>> out-of-bounds access for priv->ring[] in hns3 driver if user has
> >>>>>> requested a smaller queue num when __dev_xmit_skb() still enqueue a
> >>>>>> skb with a larger queue_mapping after the corresponding qdisc is
> >>>>>> reset, and call hns3_nic_net_xmit() with that skb later.
> >>>>>
> >>>>> Can you be more specific here? Which call path requests a smaller
> >>>>> tx queue num? If you mean netif_set_real_num_tx_queues(), clearly
> >>>>> we already have a synchronize_net() there.
> >>>>
> >>>> When the netdevice is in active state, the synchronize_net() seems to
> >>>> do the correct work, as below:
> >>>>
> >>>> CPU 0:   CPU1:
> >>>> __dev_queue_xmit()   netif_set_real_num_tx_queues()
> >>>> rcu_read_lock_bh();
> >>>> netdev_core_pick_tx(dev, skb, sb_dev);
> >>>> .
> >>>> .   dev->real_num_tx_queues = txq;
> >>>> .   .
> >>>> .   .
> >>>> .   synchronize_net();
> >>>> .   .
> >>>> q->enqueue().
> >>>> .   .
> >>>> rcu_read_unlock_bh().
> >>>> qdisc_reset_all_tx_gt
> >>>>
> >>>>
> >>>
> >>> Right.
> >>>
> >>>
> >>>> but dev->real_num_tx_queues is not RCU-protected, maybe that is a problem
> >>>> too.
> >>>>
> >>>> The problem we hit is as below:
> >>>> In hns3_set_channels(), hns3_reset_notify(h, HNAE3_DOWN_CLIENT) is called
> >>>> to deactive the netdevice when user requested a smaller queue num, and
> >>>> txq->qdisc is already changed to noop_qdisc when calling
> >>>> netif_set_real_num_tx_queues(), so the synchronize_net() in the function
> >>>> netif_set_real_num_tx_queues() does not help here.
> >>>
> >>> How could qdisc still be running after deactivating the device?
> >>
> >> qdisc could be running during the device deactivating process.
> >>
> >> The main process of changing channel number is as below:
> >>
> >> 1. dev_deactivate()
> >> 2. hns3 handware related setup
> >> 3. netif_set_real_num_tx_queues()
> >> 4. netif_tx_wake_all_queues()
> >> 5. dev_activate()
> >>
> >> During step 1, qdisc could be running while qdisc is resetting, so
> >> there could be skb left in the old qdisc(which will be restored back to
> >> txq->qdisc during dev_activate()), as below:
> >>
> >> CPU 0:   CPU1:
> >> __dev_queue_xmit():  dev_deactivate_many():
> >> rcu_read_lock_bh();  qdisc_deactivate(qdisc);
> >> q = rcu_dereference_bh(txq->qdisc); .
> >> netdev_core_pick_tx(dev, skb, sb_dev);  .
> >> .
> >> .   
> >> rcu_assign_pointer(dev_queue->qdisc, qdisc_default);
> >> .   .
> >> .   .
> >> .   .
> >> .   .
> >> q->enqueue().
> >
> >
> > Well, like I said, if the deactivated bit were tested before ->enqueue(),
> > there would be no packet queued after qdisc_deactivate().
>
> Only if the deactivated bit testing is also protected by qdisc->seqlock?
> otherwise there is still window between setting and testing the deactivated 
> bit.

Can you be more specific here? Why testing or setting a bit is not atomic?

AFAIU, qdisc->seqlock is an optimization to replace
__QDISC_STATE_RUNNING, which has nothing to do with deactivate bit.

Thanks.


Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

2020-09-02 Thread Cong Wang
On Tue, Sep 1, 2020 at 11:35 PM Yunsheng Lin  wrote:
>
> On 2020/9/2 12:41, Cong Wang wrote:
> > On Tue, Sep 1, 2020 at 6:42 PM Yunsheng Lin  wrote:
> >>
> >> On 2020/9/2 2:24, Cong Wang wrote:
> >>> On Mon, Aug 31, 2020 at 5:59 PM Yunsheng Lin  
> >>> wrote:
> >>>>
> >>>> Currently there is concurrent reset and enqueue operation for the
> >>>> same lockless qdisc when there is no lock to synchronize the
> >>>> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
> >>>> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
> >>>> out-of-bounds access for priv->ring[] in hns3 driver if user has
> >>>> requested a smaller queue num when __dev_xmit_skb() still enqueue a
> >>>> skb with a larger queue_mapping after the corresponding qdisc is
> >>>> reset, and call hns3_nic_net_xmit() with that skb later.
> >>>
> >>> Can you be more specific here? Which call path requests a smaller
> >>> tx queue num? If you mean netif_set_real_num_tx_queues(), clearly
> >>> we already have a synchronize_net() there.
> >>
> >> When the netdevice is in active state, the synchronize_net() seems to
> >> do the correct work, as below:
> >>
> >> CPU 0:   CPU1:
> >> __dev_queue_xmit()   netif_set_real_num_tx_queues()
> >> rcu_read_lock_bh();
> >> netdev_core_pick_tx(dev, skb, sb_dev);
> >> .
> >> .   dev->real_num_tx_queues = txq;
> >> .   .
> >> .   .
> >> .   synchronize_net();
> >> .   .
> >> q->enqueue().
> >> .   .
> >> rcu_read_unlock_bh().
> >> qdisc_reset_all_tx_gt
> >>
> >>
> >
> > Right.
> >
> >
> >> but dev->real_num_tx_queues is not RCU-protected, maybe that is a problem
> >> too.
> >>
> >> The problem we hit is as below:
> >> In hns3_set_channels(), hns3_reset_notify(h, HNAE3_DOWN_CLIENT) is called
> >> to deactive the netdevice when user requested a smaller queue num, and
> >> txq->qdisc is already changed to noop_qdisc when calling
> >> netif_set_real_num_tx_queues(), so the synchronize_net() in the function
> >> netif_set_real_num_tx_queues() does not help here.
> >
> > How could qdisc still be running after deactivating the device?
>
> qdisc could be running during the device deactivating process.
>
> The main process of changing channel number is as below:
>
> 1. dev_deactivate()
> 2. hns3 handware related setup
> 3. netif_set_real_num_tx_queues()
> 4. netif_tx_wake_all_queues()
> 5. dev_activate()
>
> During step 1, qdisc could be running while qdisc is resetting, so
> there could be skb left in the old qdisc(which will be restored back to
> txq->qdisc during dev_activate()), as below:
>
> CPU 0:   CPU1:
> __dev_queue_xmit():  dev_deactivate_many():
> rcu_read_lock_bh();  qdisc_deactivate(qdisc);
> q = rcu_dereference_bh(txq->qdisc); .
> netdev_core_pick_tx(dev, skb, sb_dev);  .
> .
> .   rcu_assign_pointer(dev_queue->qdisc, 
> qdisc_default);
> .   .
> .   .
> .   .
> .   .
> q->enqueue().


Well, like I said, if the deactivated bit were tested before ->enqueue(),
there would be no packet queued after qdisc_deactivate().


> .   .
> rcu_read_unlock_bh().
>
> And During step 3, txq->qdisc is pointing to noop_qdisc, so the qdisc_reset()
> only reset the noop_qdisc, but not the actual qdisc, which is stored in
> txq->qdisc_sleeping, so the actual qdisc may still have skb.
>
> When hns3_link_status_change() call step 4 and 5, it will restore all queue's
> qdisc using txq->qdisc_sleeping and schedule all queue with net_tx_action().
> The skb enqueued in step 1 may be dequeued and run, whi

Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

2020-09-01 Thread Cong Wang
On Tue, Sep 1, 2020 at 6:42 PM Yunsheng Lin  wrote:
>
> On 2020/9/2 2:24, Cong Wang wrote:
> > On Mon, Aug 31, 2020 at 5:59 PM Yunsheng Lin  wrote:
> >>
> >> Currently there is concurrent reset and enqueue operation for the
> >> same lockless qdisc when there is no lock to synchronize the
> >> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
> >> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
> >> out-of-bounds access for priv->ring[] in hns3 driver if user has
> >> requested a smaller queue num when __dev_xmit_skb() still enqueue a
> >> skb with a larger queue_mapping after the corresponding qdisc is
> >> reset, and call hns3_nic_net_xmit() with that skb later.
> >
> > Can you be more specific here? Which call path requests a smaller
> > tx queue num? If you mean netif_set_real_num_tx_queues(), clearly
> > we already have a synchronize_net() there.
>
> When the netdevice is in active state, the synchronize_net() seems to
> do the correct work, as below:
>
> CPU 0:   CPU1:
> __dev_queue_xmit()   netif_set_real_num_tx_queues()
> rcu_read_lock_bh();
> netdev_core_pick_tx(dev, skb, sb_dev);
> .
> .   dev->real_num_tx_queues = txq;
> .   .
> .   .
> .   synchronize_net();
> .   .
> q->enqueue().
> .   .
> rcu_read_unlock_bh().
> qdisc_reset_all_tx_gt
>
>

Right.


> but dev->real_num_tx_queues is not RCU-protected, maybe that is a problem
> too.
>
> The problem we hit is as below:
> In hns3_set_channels(), hns3_reset_notify(h, HNAE3_DOWN_CLIENT) is called
> to deactive the netdevice when user requested a smaller queue num, and
> txq->qdisc is already changed to noop_qdisc when calling
> netif_set_real_num_tx_queues(), so the synchronize_net() in the function
> netif_set_real_num_tx_queues() does not help here.

How could qdisc still be running after deactivating the device?


>
> >
> >>
> >> Avoid the above concurrent op by calling synchronize_rcu_tasks()
> >> after assigning new qdisc to dev_queue->qdisc and before calling
> >> qdisc_deactivate() to make sure skb with larger queue_mapping
> >> enqueued to old qdisc will always be reset when qdisc_deactivate()
> >> is called.
> >
> > Like Eric said, it is not nice to call such a blocking function when
> > we have a large number of TX queues. Possibly we just need to
> > add a synchronize_net() as in netif_set_real_num_tx_queues(),
> > if it is missing.
>
> As above, the synchronize_net() in netif_set_real_num_tx_queues() seems
> to work when netdevice is in active state, but does not work when in
> deactive.

Please explain why deactivated device still has qdisc running?

At least before commit 379349e9bc3b4, we always test deactivate
bit before enqueueing. Are you complaining about that commit?
That commit is indeed suspicious, at least it does not precisely revert
commit ba27b4cdaaa66561aaedb21 as it claims.


>
> And we do not want skb left in the old qdisc when netdevice is deactived,
> right?

Yes, and more importantly, qdisc should not be running after deactivation.

Thanks.


Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

2020-09-01 Thread Cong Wang
On Mon, Aug 31, 2020 at 5:59 PM Yunsheng Lin  wrote:
>
> Currently there is concurrent reset and enqueue operation for the
> same lockless qdisc when there is no lock to synchronize the
> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
> out-of-bounds access for priv->ring[] in hns3 driver if user has
> requested a smaller queue num when __dev_xmit_skb() still enqueue a
> skb with a larger queue_mapping after the corresponding qdisc is
> reset, and call hns3_nic_net_xmit() with that skb later.

Can you be more specific here? Which call path requests a smaller
tx queue num? If you mean netif_set_real_num_tx_queues(), clearly
we already have a synchronize_net() there.

>
> Avoid the above concurrent op by calling synchronize_rcu_tasks()
> after assigning new qdisc to dev_queue->qdisc and before calling
> qdisc_deactivate() to make sure skb with larger queue_mapping
> enqueued to old qdisc will always be reset when qdisc_deactivate()
> is called.

Like Eric said, it is not nice to call such a blocking function when
we have a large number of TX queues. Possibly we just need to
add a synchronize_net() as in netif_set_real_num_tx_queues(),
if it is missing.

Thanks.


Re: [PATCH v5 1/3] net: introduce helper sendpage_ok() in include/linux/net.h

2020-08-17 Thread Cong Wang
On Sun, Aug 16, 2020 at 10:45 PM Christoph Hellwig  wrote:
>
> On Sun, Aug 16, 2020 at 10:55:09AM -0700, Cong Wang wrote:
> > On Sun, Aug 16, 2020 at 1:36 AM Coly Li  wrote:
> > >
> > > The original problem was from nvme-over-tcp code, who mistakenly uses
> > > kernel_sendpage() to send pages allocated by __get_free_pages() without
> > > __GFP_COMP flag. Such pages don't have refcount (page_count is 0) on
> > > tail pages, sending them by kernel_sendpage() may trigger a kernel panic
> > > from a corrupted kernel heap, because these pages are incorrectly freed
> > > in network stack as page_count 0 pages.
> > >
> > > This patch introduces a helper sendpage_ok(), it returns true if the
> > > checking page,
> > > - is not slab page: PageSlab(page) is false.
> > > - has page refcount: page_count(page) is not zero
> > >
> > > All drivers who want to send page to remote end by kernel_sendpage()
> > > may use this helper to check whether the page is OK. If the helper does
> > > not return true, the driver should try other non sendpage method (e.g.
> > > sock_no_sendpage()) to handle the page.
> >
> > Can we leave this helper to mm subsystem?
> >
> > I know it is for sendpage, but its implementation is all about some
> > mm details and its two callers do not belong to net subsystem either.
> >
> > Think this in another way: who would fix it if it is buggy? I bet mm people
> > should. ;)
>
> No.  This is all about a really unusual imitation in sendpage, which

So netdev people will have to understand and support PageSlab() or
page_count()?

If it is unusual even for mm people, how could netdev people suppose
to understand this unusual mm bug? At least not any better.

> is pretty much unexpected.  In fact the best thing would be to make
> sock_sendpage do the right thing and call sock_no_sendpage based
> on this condition, so that driver writers don't have to worry at all.

Agreed, but kernel_sendpage() still relies on mm to provide a helper
to make the decision and ensure this helper is always up-to-date.

In short, it is all about ownership.

Thanks.


Re: [PATCH v5 1/3] net: introduce helper sendpage_ok() in include/linux/net.h

2020-08-16 Thread Cong Wang
On Sun, Aug 16, 2020 at 1:36 AM Coly Li  wrote:
>
> The original problem was from nvme-over-tcp code, who mistakenly uses
> kernel_sendpage() to send pages allocated by __get_free_pages() without
> __GFP_COMP flag. Such pages don't have refcount (page_count is 0) on
> tail pages, sending them by kernel_sendpage() may trigger a kernel panic
> from a corrupted kernel heap, because these pages are incorrectly freed
> in network stack as page_count 0 pages.
>
> This patch introduces a helper sendpage_ok(), it returns true if the
> checking page,
> - is not slab page: PageSlab(page) is false.
> - has page refcount: page_count(page) is not zero
>
> All drivers who want to send page to remote end by kernel_sendpage()
> may use this helper to check whether the page is OK. If the helper does
> not return true, the driver should try other non sendpage method (e.g.
> sock_no_sendpage()) to handle the page.

Can we leave this helper to mm subsystem?

I know it is for sendpage, but its implementation is all about some
mm details and its two callers do not belong to net subsystem either.

Think this in another way: who would fix it if it is buggy? I bet mm people
should. ;)

Thanks.


Re: [PATCH] net: Fix potential memory leak in proto_register()

2020-08-12 Thread Cong Wang
On Wed, Aug 12, 2020 at 2:21 AM linmiaohe  wrote:
>
> Hi all:
> David Miller  wrote:
> >From: Cong Wang 
> >Date: Tue, 11 Aug 2020 16:02:51 -0700
> >
> >>> @@ -3406,6 +3406,16 @@ static void sock_inuse_add(struct net *net,
> >>> int val)  }  #endif
> >>>
> >>> +static void tw_prot_cleanup(struct timewait_sock_ops *twsk_prot) {
> >>> +   if (!twsk_prot)
> >>> +   return;
> >>> +   kfree(twsk_prot->twsk_slab_name);
> >>> +   twsk_prot->twsk_slab_name = NULL;
> >>> +   kmem_cache_destroy(twsk_prot->twsk_slab);
> >>
> >> Hmm, are you sure you can free the kmem cache name before
> >> kmem_cache_destroy()? To me, it seems kmem_cache_destroy() frees the
> >> name via slab_kmem_cache_release() via kfree_const().
> >> With your patch, we have a double-free on the name?
> >>
> >> Or am I missing anything?
> >
> >Yep, there is a double free here.
> >
> >Please fix this.
>
> Many thanks for both of you to point this issue out. But I'am not really 
> understand, could you please explain it more?
> As far as I can see, the double free path is:
> 1. kfree(twsk_prot->twsk_slab_name)
> 2. kmem_cache_destroy
> --> shutdown_memcg_caches
> --> shutdown_cache
> --> slab_kmem_cache_release
> --> kfree_const(s->name)
> But twsk_prot->twsk_slab_name is allocated from kasprintf via 
> kmalloc_track_caller while twsk_prot->twsk_slab->name is allocated
> via kstrdup_const. So I think twsk_prot->twsk_slab_name and 
> twsk_prot->twsk_slab->name point to different memory, and there is no double 
> free.
>

You are right. Since it is duplicated, then there is no need to keep
->twsk_slab_name, we can just use twsk_slab->name. I will send
a patch to get rid of it.

> Or am I missing anything?
>
> By the way, req_prot_cleanup() do the same things, i.e. free the slab_name 
> before involve kmem_cache_destroy(). If there is a double
> free, so as here?

Ditto. Can be just removed.

Thanks.


Re: [PATCH] net: Fix potential memory leak in proto_register()

2020-08-11 Thread Cong Wang
On Mon, Aug 10, 2020 at 5:19 AM Miaohe Lin  wrote:
>
> If we failed to assign proto idx, we free the twsk_slab_name but forget to
> free the twsk_slab. Add a helper function tw_prot_cleanup() to free these
> together and also use this helper function in proto_unregister().
>
> Fixes: b45ce32135d1 ("sock: fix potential memory leak in proto_register()")
> Signed-off-by: Miaohe Lin 
> ---
>  net/core/sock.c | 25 +++--
>  1 file changed, 15 insertions(+), 10 deletions(-)
>
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 49cd5ffe673e..c9083ad44ea1 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -3406,6 +3406,16 @@ static void sock_inuse_add(struct net *net, int val)
>  }
>  #endif
>
> +static void tw_prot_cleanup(struct timewait_sock_ops *twsk_prot)
> +{
> +   if (!twsk_prot)
> +   return;
> +   kfree(twsk_prot->twsk_slab_name);
> +   twsk_prot->twsk_slab_name = NULL;
> +   kmem_cache_destroy(twsk_prot->twsk_slab);

Hmm, are you sure you can free the kmem cache name before
kmem_cache_destroy()? To me, it seems kmem_cache_destroy()
frees the name via slab_kmem_cache_release() via kfree_const().
With your patch, we have a double-free on the name?

Or am I missing anything?

Thanks.


Re: [Linux-kernel-mentees] [PATCH net] ipvs: Fix uninit-value in do_ip_vs_set_ctl()

2020-08-10 Thread Cong Wang
On Mon, Aug 10, 2020 at 3:10 PM Peilin Ye  wrote:
>
> do_ip_vs_set_ctl() is referencing uninitialized stack value when `len` is
> zero. Fix it.

Which exact 'cmd' is it here?

I _guess_ it is one of those uninitialized in set_arglen[], which is 0.
But if that is the case, should it be initialized to
sizeof(struct ip_vs_service_user) instead because ip_vs_copy_usvc_compat()
is called anyway. Or, maybe we should just ban len==0 case.

In either case, it does not look like you fix it correctly.

Thanks.


Re: [PATCH] [net/ipv6] ip6_output: Add ipv6_pinfo null check

2020-08-08 Thread Cong Wang
On Sat, Aug 8, 2020 at 10:07 AM Gaurav Singh  wrote:
>
> This PR fixes a possible segmentation violation.
>
> In function: ip6_xmit(), we have
> const struct ipv6_pinfo *np = inet6_sk(sk); which returns NULL
> unconditionally (regardless sk being  NULL or not).
>
> In include/linux/ipv6.h:
>
> static inline struct ipv6_pinfo * inet6_sk(const struct sock *__sk)
> {
> return NULL;
> }
>

Tell us who will use ip6_autoflowlabel() when CONFIG_IPV6
is disabled. :)


Re: [PATCHv2 net-next] dropwatch: Support monitoring of dropped frames

2020-08-04 Thread Cong Wang
On Tue, Aug 4, 2020 at 9:14 AM  wrote:
>
> From: Izabela Bakollari 
>
> Dropwatch is a utility that monitors dropped frames by having userspace
> record them over the dropwatch protocol over a file. This augument
> allows live monitoring of dropped frames using tools like tcpdump.
>
> With this feature, dropwatch allows two additional commands (start and
> stop interface) which allows the assignment of a net_device to the
> dropwatch protocol. When assinged, dropwatch will clone dropped frames,
> and receive them on the assigned interface, allowing tools like tcpdump
> to monitor for them.
>
> With this feature, create a dummy ethernet interface (ip link add dev
> dummy0 type dummy), assign it to the dropwatch kernel subsystem, by using
> these new commands, and then monitor dropped frames in real time by
> running tcpdump -i dummy0.

drop monitor is already able to send dropped packets to user-space,
and wireshark already catches up with this feature:

https://code.wireshark.org/review/gitweb?p=wireshark.git;a=commitdiff;h=a94a860c0644ec3b8a129fd243674a2e376ce1c8

So what you propose here seems pretty much a duplicate?

Thanks.


Re: [PATCH] drivers/net/wan/lapbether: Use needed_headroom instead of hard_header_len

2020-07-27 Thread Cong Wang
Hello,

On Mon, Jul 27, 2020 at 12:41 PM Xie He  wrote:
>
> Hi Cong Wang,
>
> I'm wishing to change a driver from using "hard_header_len" to using
> "needed_headroom" to declare its needed headroom. I submitted a patch
> and it is decided it needs to be reviewed. I see you participated in
> "hard_header_len vs needed_headroom" discussions in the past. Can you
> help me review this patch? Thanks!
>
> The patch is at:
> http://patchwork.ozlabs.org/project/netdev/patch/20200726110524.151957-1-xie.he.0...@gmail.com/
>
> In my understanding, hard_header_len should be the length of the header
> created by dev_hard_header. Any additional headroom needed should be
> declared in needed_headroom instead of hard_header_len. I came to this
> conclusion by examining the logic of net/packet/af_packet.c:packet_snd.

I am not familiar with this WAN driver, but I suggest you to look at
the following commit, which provides a lot of useful information:

commit 9454f7a895b822dd8fb4588fc55fda7c96728869
Author: Brian Norris 
Date:   Wed Feb 26 16:05:11 2020 -0800

mwifiex: set needed_headroom, not hard_header_len

hard_header_len provides limitations for things like AF_PACKET, such
that we don't allow transmitting packets smaller than this.

needed_headroom provides a suggested minimum headroom for SKBs, so that
we can trivally add our headers to the front.

The latter is the correct field to use in this case, while the former
mostly just prevents sending small AF_PACKET frames.

In any case, mwifiex already does its own bounce buffering [1] if we
don't have enough headroom, so hints (not hard limits) are all that are
needed.

This is the essentially the same bug (and fix) that brcmfmac had, fixed
in commit cb39288fd6bb ("brcmfmac: use ndev->needed_headroom to reserve
additional header space").

[1] mwifiex_hard_start_xmit():
if (skb_headroom(skb) < MWIFIEX_MIN_DATA_HEADER_LEN) {
[...]
/* Insufficient skb headroom - allocate a new skb */

Hope this helps.

Thanks.


Re: [PATCH] [net/ipv6] ip6_output: Add ipv6_pinfo null check

2020-07-27 Thread Cong Wang
On Mon, Jul 27, 2020 at 7:14 PM Gaurav Singh  wrote:
>
> Add return to fix build issue. Haven't reproduced this issue at
> my end.
>
> My hypothesis is this: In function: ip6_xmit(), we have
> const struct ipv6_pinfo *np = inet6_sk(sk); which returns NULL.
>
> Further down the function, there's a check:
> if (np) hlimit = hp->htop_limit

This check exists before git history, at that time 'sk' could be NULL,
hence 'np', so it does not mean it is still necessary now.

I looked at all callers of ip6_xmit(), I don't see how it is called with
a non-full socket, neither 'sk' could be NULL after
commit b30bd282cbf5c46247a279a2e8d2aae027d9f1bf
("[IPV6]: ip6_xmit: remove unnecessary NULL ptr check").

Thanks.


Re: [PATCH] [net/ipv6] ip6_output: Add ipv6_pinfo null check

2020-07-27 Thread Cong Wang
On Sun, Jul 26, 2020 at 8:39 PM Gaurav Singh  wrote:
>
> ipv6_pinfo is initlialized by inet6_sk() which returns NULL.

Why? It only returns NULL for timewait or request sock, but
I don't see how ip6_autoflowlabel() could be called on these
sockets. So please explain.

> Hence it can cause segmentation fault. Fix this by adding a
> NULL check.

Which exact call path? Do you have a full stack trace?

Thanks.


Re: [PATCH v2] net: ipv6: fix use-after-free Read in __xfrm6_tunnel_spi_lookup

2020-07-26 Thread Cong Wang
On Sat, Jul 25, 2020 at 11:12 PM B K Karthik  wrote:
>
> On Sun, Jul 26, 2020 at 11:05 AM Cong Wang  wrote:
> >
> > On Sat, Jul 25, 2020 at 8:09 PM B K Karthik  wrote:
> > > @@ -103,10 +103,10 @@ static int __xfrm6_tunnel_spi_check(struct net 
> > > *net, u32 spi)
> > >  {
> > > struct xfrm6_tunnel_net *xfrm6_tn = xfrm6_tunnel_pernet(net);
> > > struct xfrm6_tunnel_spi *x6spi;
> > > -   int index = xfrm6_tunnel_spi_hash_byspi(spi);
> > > +   int index = xfrm6_tunnel_spi_hash_byaddr((const xfrm_address_t 
> > > *)spi);
> > >
> > > hlist_for_each_entry(x6spi,
> > > -_tn->spi_byspi[index],
> > > +_tn->spi_byaddr[index],
> > >  list_byspi) {
> > > if (x6spi->spi == spi)
> >
> > How did you convince yourself this is correct? This lookup is still
> > using spi. :)
>
> I'm sorry, but my intention behind writing this patch was not to fix
> the UAF, but to fix a slab-out-of-bound.

Odd, your $subject is clearly UAF, so is the stack trace in your changelog.
:)


> If required, I can definitely change the subject line and resend the
> patch, but I figured this was correct for
> https://syzkaller.appspot.com/bug?id=058d05f470583ab2843b1d6785fa8d0658ef66ae
> . since that particular report did not have a reproducer,
> Dmitry Vyukov  suggested that I test this patch on
> other reports for xfrm/spi .

You have to change it to avoid misleading.

>
> Forgive me if this was the wrong way to send a patch for that
> particular report, but I guessed since the reproducer did not trigger
> the crash
> for UAF, I would leave the subject line as 'fix UAF' :)
>
> xfrm6_spi_hash_by_hash seemed more convincing because I had to prevent
> a slab-out-of-bounds because it uses ipv6_addr_hash.
> It would be of great help if you could help me understand how this was
> able to fix a UAF.

Sure, you just avoid a pointer deref, which of course can fix the UAF,
but I still don't think it is correct in any aspect.

Even if it is a OOB, you still have to explain why it happened. Once
again, I can't see how it could happen either.

>
> >
> > More importantly, can you explain how UAF happens? Apparently
> > the syzbot stack traces you quote make no sense at all. I also
> > looked at other similar reports, none of them makes sense to me.
>
> Forgive me, but I do not understand what you mean by the stack traces
> (this or other similar reports) "make no sense".

Because the stack trace in your changelog clearly shows it is allocated
in tomoyo_init_log(), which is a buffer in struct tomoyo_query, but
none of xfrm paths uses it. Or do you see anything otherwise?

Thanks.


Re: [PATCH v2] net: ipv6: fix use-after-free Read in __xfrm6_tunnel_spi_lookup

2020-07-25 Thread Cong Wang
On Sat, Jul 25, 2020 at 8:09 PM B K Karthik  wrote:
> @@ -103,10 +103,10 @@ static int __xfrm6_tunnel_spi_check(struct net *net, 
> u32 spi)
>  {
> struct xfrm6_tunnel_net *xfrm6_tn = xfrm6_tunnel_pernet(net);
> struct xfrm6_tunnel_spi *x6spi;
> -   int index = xfrm6_tunnel_spi_hash_byspi(spi);
> +   int index = xfrm6_tunnel_spi_hash_byaddr((const xfrm_address_t *)spi);
>
> hlist_for_each_entry(x6spi,
> -_tn->spi_byspi[index],
> +_tn->spi_byaddr[index],
>  list_byspi) {
> if (x6spi->spi == spi)

How did you convince yourself this is correct? This lookup is still
using spi. :)

More importantly, can you explain how UAF happens? Apparently
the syzbot stack traces you quote make no sense at all. I also
looked at other similar reports, none of them makes sense to me.

Thanks.


Re: KASAN: use-after-free Read in vlan_dev_get_iflink

2020-07-23 Thread Cong Wang
#syz test: https://github.com/congwang/linux.git net


Re: KASAN: use-after-free Read in macvlan_dev_get_iflink

2020-07-23 Thread Cong Wang
#syz test: https://github.com/congwang/linux.git net


Re: KASAN: use-after-free Write in __linkwatch_run_queue

2020-07-22 Thread Cong Wang
#syz test: https://github.com/congwang/linux.git net


Re: [PATCH v4 net] rtnetlink: Fix memory(net_device) leak when ->newlink fails

2020-07-14 Thread Cong Wang
On Tue, Jul 14, 2020 at 6:27 PM Weilong Chen  wrote:
>
> When vlan_newlink call register_vlan_dev fails, it might return error
> with dev->reg_state = NETREG_UNREGISTERED. The rtnl_newlink should
> free the memory. But currently rtnl_newlink only free the memory which
> state is NETREG_UNINITIALIZED.
>
> BUG: memory leak
> unreferenced object 0x8881051de000 (size 4096):
>   comm "syz-executor139", pid 560, jiffies 4294745346 (age 32.445s)
>   hex dump (first 32 bytes):
> 76 6c 61 6e 32 00 00 00 00 00 00 00 00 00 00 00  vlan2...
> 00 45 28 03 81 88 ff ff 00 00 00 00 00 00 00 00  .E(.
>   backtrace:
> [<47527e31>] kmalloc_node include/linux/slab.h:578 [inline]
> [<47527e31>] kvmalloc_node+0x33/0xd0 mm/util.c:574
> [<2b59e3bc>] kvmalloc include/linux/mm.h:753 [inline]
> [<2b59e3bc>] kvzalloc include/linux/mm.h:761 [inline]
> [<2b59e3bc>] alloc_netdev_mqs+0x83/0xd90 net/core/dev.c:9929
> [<6076752a>] rtnl_create_link+0x2c0/0xa20 
> net/core/rtnetlink.c:3067
> [<572b3be5>] __rtnl_newlink+0xc9c/0x1330 net/core/rtnetlink.c:3329
> [] rtnl_newlink+0x66/0x90 net/core/rtnetlink.c:3397
> [<52c7c0a9>] rtnetlink_rcv_msg+0x540/0x990 
> net/core/rtnetlink.c:5460
> [<4b5cb379>] netlink_rcv_skb+0x12b/0x3a0 
> net/netlink/af_netlink.c:2469
> [] netlink_unicast_kernel net/netlink/af_netlink.c:1303 
> [inline]
> [] netlink_unicast+0x4c6/0x690 
> net/netlink/af_netlink.c:1329
> [] netlink_sendmsg+0x735/0xcc0 
> net/netlink/af_netlink.c:1918
> [<9221ebf7>] sock_sendmsg_nosec net/socket.c:652 [inline]
> [<9221ebf7>] sock_sendmsg+0x109/0x140 net/socket.c:672
> [<1c30ffe4>] sys_sendmsg+0x5f5/0x780 net/socket.c:2352
> [] ___sys_sendmsg+0x11d/0x1a0 net/socket.c:2406
> [<07297384>] __sys_sendmsg+0xeb/0x1b0 net/socket.c:2439
> [<0eb29b11>] do_syscall_64+0x56/0xa0 arch/x86/entry/common.c:359
> [<6839b4d0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> Fixes: e51fb152318ee6 ("rtnetlink: fix a memory leak when ->newlink fails")

This bug is apparently not introduced by my commit above.

It should be commit cb626bf566eb4433318d35681286c494f0,
right? That commit introduced NETREG_UNREGISTERED on the path.


Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2020-07-08 Thread Cong Wang
On Tue, Jul 7, 2020 at 7:18 AM Paolo Abeni  wrote:
> So the regression with 2 pktgen threads is still relevant. 'perf' shows
> relevant time spent into net_tx_action() and __netif_schedule().

So, touching the __QDISC_STATE_SCHED bit in __dev_xmit_skb() is
not a good idea.

Let me see if there is any other way to fix this.

Thanks.


Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2020-07-01 Thread Cong Wang
On Wed, Jul 1, 2020 at 9:05 AM Cong Wang  wrote:
>
> On Tue, Jun 30, 2020 at 2:08 PM Josh Hunt  wrote:
> > Do either of you know if there's been any development on a fix for this
> > issue? If not we can propose something.
>
> If you have a reproducer, I can look into this.

Does the attached patch fix this bug completely?

Thanks.
diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 9092e697059e..5a03cded3054 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -123,7 +123,7 @@ bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
 
 void __qdisc_run(struct Qdisc *q);
 
-static inline void qdisc_run(struct Qdisc *q)
+static inline bool qdisc_run(struct Qdisc *q)
 {
 	if (qdisc_run_begin(q)) {
 		/* NOLOCK qdisc must check 'state' under the qdisc seqlock
@@ -133,7 +133,9 @@ static inline void qdisc_run(struct Qdisc *q)
 		likely(!test_bit(__QDISC_STATE_DEACTIVATED, >state)))
 			__qdisc_run(q);
 		qdisc_run_end(q);
+		return true;
 	}
+	return false;
 }
 
 static inline __be16 tc_skb_protocol(const struct sk_buff *skb)
diff --git a/net/core/dev.c b/net/core/dev.c
index 90b59fc50dc9..c7e48356132a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3744,7 +3744,8 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 
 	if (q->flags & TCQ_F_NOLOCK) {
 		rc = q->enqueue(skb, q, _free) & NET_XMIT_MASK;
-		qdisc_run(q);
+		if (!qdisc_run(q) && rc == NET_XMIT_SUCCESS)
+			__netif_schedule(q);
 
 		if (unlikely(to_free))
 			kfree_skb_list(to_free);


Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2020-07-01 Thread Cong Wang
On Tue, Jun 30, 2020 at 2:08 PM Josh Hunt  wrote:
> Do either of you know if there's been any development on a fix for this
> issue? If not we can propose something.

If you have a reproducer, I can look into this.

Thanks.


Re: How do you investigate the cause of total hang? Is there some information that I should pay attention to in order to get some hint?

2020-07-01 Thread Cong Wang
On Tue, Jun 30, 2020 at 7:49 PM 孙世龙 sunshilong  wrote:
>
> Hi, list
>
> My x86 machine(linux4.19) sometimes hangs, suddenly not responding in
> any way to the mouse or the keyboard.
>
> How can I investigate why it hung up? Is there extra information I can
> find for a clue? Is there anything less drastic than power-off to get
> some kind of action, if only some limited shell or just beeps,
> but might give a clue?
>

If the hang was a crash which you didn't get a chance to capture the
last kernel log, you can use kdump to collect them. The kernel log
tells what kind of crash it is, a NULL pointer deref, a kernel page fault
etc..

If the hang was a hard lockup, you have to turn on lockup detector
and also kdump to capture what the detector tells.

Thanks.


Re: KASAN: use-after-free Read in nl8NUM_dump_wpan_phy (2)

2020-06-29 Thread Cong Wang
#syz fix: genetlink: get rid of family->attrbuf


Re: KASAN: vmalloc-out-of-bounds Read in tipc_nl_publ_dump

2020-06-29 Thread Cong Wang
#syz fix: genetlink: get rid of family->attrbuf


Re: KASAN: use-after-free Read in tipc_nl_node_dump_monitor_peer (2)

2020-06-29 Thread Cong Wang
#syz fix: genetlink: get rid of family->attrbuf


Re: KASAN: vmalloc-out-of-bounds Read in netdev_name_node_lookup_rcu

2020-06-29 Thread Cong Wang
#syz fix: genetlink: get rid of family->attrbuf


Re: KASAN: vmalloc-out-of-bounds Read in __nla_validate_parse

2020-06-29 Thread Cong Wang
#syz fix: genetlink: get rid of family->attrbuf


Re: KASAN: vmalloc-out-of-bounds Read in __cfg8NUM_wpan_dev_from_attrs

2020-06-29 Thread Cong Wang
#syz fix: genetlink: get rid of family->attrbuf


Re: KASAN: use-after-free Read in devlink_get_from_attrs

2020-06-29 Thread Cong Wang
#syz fix: genetlink: get rid of family->attrbuf


Re: KASAN: use-after-free Read in tipc_nl_publ_dump (2)

2020-06-29 Thread Cong Wang
#syz fix: genetlink: get rid of family->attrbuf


Re: KASAN: use-after-free Read in dev_get_by_name

2020-06-29 Thread Cong Wang
#syz fix: genetlink: get rid of family->attrbuf


Re: KASAN: use-after-free Read in netdev_name_node_lookup_rcu

2020-06-29 Thread Cong Wang
#syz fix: genetlink: get rid of family->attrbuf


Re: KASAN: use-after-free Read in netdev_name_node_lookup_rcu

2020-06-29 Thread Cong Wang
On Mon, Jun 29, 2020 at 6:17 PM Jason A. Donenfeld  wrote:
>
> Hey Cong,

Hi, Jason

>
> I'm wondering if the below error is related to what you've been
> looking at yesterday. AFAICT, there's a simple UaF on the attrbuf
> passed to the start method. I recall recently you were working on the
> locking in genetlink's family buffers and wound up mallocing some
> things, so it seems like this might be related. See below.

Yeah, very likely it is the same bug I have fixed. I will close
this together with others.

Thanks.


Re: possible deadlock in dev_mc_unsync

2020-06-27 Thread Cong Wang
#syz test: https://github.com/congwang/linux.git net


Re: KASAN: use-after-free Read in tipc_nl_node_dump_monitor_peer (2)

2020-06-26 Thread Cong Wang
#syz test: https://github.com/congwang/linux.git net


Re: [PATCH] net: genetlink: Fix memleak in genl_family_rcv_msg_dumpit()

2020-06-02 Thread Cong Wang
On Mon, Jun 1, 2020 at 11:47 PM YueHaibing  wrote:
> @@ -630,6 +625,9 @@ static int genl_family_rcv_msg_dumpit(const struct 
> genl_family *family,
> err = __netlink_dump_start(net->genl_sock, skb, nlh, );
> }
>
> +   genl_family_rcv_msg_attrs_free(info->family, info->attrs, true);
> +   genl_dumpit_info_free(info);
> +
> return err;
>  }

I do not think you can just move it after __netlink_dump_start(),
because cb->done() can be called, for example, in netlink_sock_destruct()
too.


Re: [PATCH] flow_dissector: work around stack frame size warning

2020-05-29 Thread Cong Wang
On Fri, May 29, 2020 at 1:14 PM Arnd Bergmann  wrote:
>
> The fl_flow_key structure is around 500 bytes, so having two of them
> on the stack in one function now exceeds the warning limit after an
> otherwise correct change:
>
> net/sched/cls_flower.c:298:12: error: stack frame size of 1056 bytes in 
> function 'fl_classify' [-Werror,-Wframe-larger-than=]
>
> I suspect the fl_classify function could be reworked to only have one
> of them on the stack and modify it in place, but I could not work out
> how to do that.
>
> As a somewhat hacky workaround, move one of them into an out-of-line
> function to reduce its scope. This does not necessarily reduce the stack
> usage of the outer function, but at least the second copy is removed
> from the stack during most of it and does not add up to whatever is
> called from there.
>
> I now see 552 bytes of stack usage for fl_classify(), plus 528 bytes
> for fl_mask_lookup().
>
> Fixes: 58cff782cc55 ("flow_dissector: Parse multiple MPLS Label Stack 
> Entries")
> Signed-off-by: Arnd Bergmann 

I think this is probably the quickest way to amend this warning,
so:

Acked-by: Cong Wang 

Thanks.


Re: [PATCH -net-next] net: psample: depends on INET

2020-05-22 Thread Cong Wang
On Fri, May 22, 2020 at 12:48 PM Randy Dunlap  wrote:
>
> On 5/22/20 12:17 PM, Cong Wang wrote:
> > On Fri, May 22, 2020 at 12:03 PM Randy Dunlap  wrote:
> >>
> >> From: Randy Dunlap 
> >>
> >> Fix psample build error when CONFIG_INET is not set/enabled.
> >> PSAMPLE should depend on INET instead of NET since
> >> ip_tunnel_info_opts() is only present for CONFIG_INET.
> >>
> >> ../net/psample/psample.c: In function ‘__psample_ip_tun_to_nlattr’:
> >> ../net/psample/psample.c:216:25: error: implicit declaration of function 
> >> ‘ip_tunnel_info_opts’; did you mean ‘ip_tunnel_info_opts_set’? 
> >> [-Werror=implicit-function-declaration]
> >
> > Or just make this tunnel support optional. psample does not
> > require it to function correctly.
>
> Sure, I thought of that, but it's not clear to me which bits of it
> to make optional, so I'll leave it for its maintainer to handle.

The code commit d8bed686ab96169ac80b497d1cbed89300d97f83
adds is optional, so it can be just put into #ifdef's.

Thanks.


Re: [PATCH -net-next] net: psample: depends on INET

2020-05-22 Thread Cong Wang
On Fri, May 22, 2020 at 12:03 PM Randy Dunlap  wrote:
>
> From: Randy Dunlap 
>
> Fix psample build error when CONFIG_INET is not set/enabled.
> PSAMPLE should depend on INET instead of NET since
> ip_tunnel_info_opts() is only present for CONFIG_INET.
>
> ../net/psample/psample.c: In function ‘__psample_ip_tun_to_nlattr’:
> ../net/psample/psample.c:216:25: error: implicit declaration of function 
> ‘ip_tunnel_info_opts’; did you mean ‘ip_tunnel_info_opts_set’? 
> [-Werror=implicit-function-declaration]

Or just make this tunnel support optional. psample does not
require it to function correctly.

Thanks.


Re: BUG: unable to handle kernel paging request in fl_dump_key

2020-05-17 Thread Cong Wang
On Fri, May 15, 2020 at 6:53 AM syzbot
 wrote:
>
> Hello,
>
> syzbot found the following crash on:
>
> HEAD commit:99addbe3 net: broadcom: Select BROADCOM_PHY for BCMGENET
> git tree:   net
> console output: https://syzkaller.appspot.com/x/log.txt?x=173e568c10
> kernel config:  https://syzkaller.appspot.com/x/.config?x=b0212dbee046bc1f
> dashboard link: https://syzkaller.appspot.com/bug?extid=9c1be56e9317b795e874
> compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
>
> Unfortunately, I don't have any reproducer for this crash yet.
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+9c1be56e9317b795e...@syzkaller.appspotmail.com
>
> BUG: unable to handle page fault for address: fbfff4a9538a
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x) - not-present page
> PGD 21ffe5067 P4D 21ffe5067 PUD 21ffe4067 PMD 0
> Oops:  [#1] PREEMPT SMP KASAN
> CPU: 1 PID: 5831 Comm: syz-executor.3 Not tainted 5.7.0-rc4-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> RIP: 0010:fl_dump_key+0x8c/0x1980 net/sched/cls_flower.c:2514
> Code: 04 f2 04 f2 c7 40 0c 04 f3 f3 f3 65 48 8b 04 25 28 00 00 00 48 89 84 24 
> b0 00 00 00 31 c0 e8 3b 0d 20 fb 48 89 e8 48 c1 e8 03 <42> 0f b6 04 30 84 c0 
> 74 08 3c 03 0f 8e 6f 17 00 00 44 8b 75 00 31
> RSP: 0018:c900019672d8 EFLAGS: 00010a03
> RAX: 14a9538a RBX: a54a9a8f RCX: c9000f733000
> RDX: 080f RSI: 86532275 RDI: 88808f68b800
> RBP: a54a9c57 R08: 888096266200 R09: 888097a5603c
> R10: 888097a56036 R11: ed1012f4ac06 R12: 88808f68b800
> R13: 88806780a100 R14: dc00 R15: 88806780a100
> FS:  7f0a86395700() GS:8880ae70() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: fbfff4a9538a CR3: 99de6000 CR4: 001406e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  fl_tmplt_dump+0xcf/0x250 net/sched/cls_flower.c:2784
>  tc_chain_fill_node+0x48e/0x7c0 net/sched/cls_api.c:2707
>  tc_chain_notify+0x189/0x2e0 net/sched/cls_api.c:2733
>  tc_ctl_chain+0xb82/0x1080 net/sched/cls_api.c:2919

I guess the chain template name must be specified by user:

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 0a7ecc292bd3..bcacb7db70c6 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -2782,8 +2782,10 @@ static int tc_chain_tmplt_add(struct tcf_chain
*chain, struct net *net,
void *tmplt_priv;

/* If kind is not set, user did not specify template. */
-   if (!tca[TCA_KIND])
-   return 0;
+   if (!tca[TCA_KIND]) {
+   NL_SET_ERR_MSG(extack, "TC chain template name is not
specified");
+   return -EINVAL;
+   }

if (tcf_proto_check_kind(tca[TCA_KIND], name)) {
NL_SET_ERR_MSG(extack, "Specified TC chain template
name too long");


Re: general protection fault vs Oops

2020-05-17 Thread Cong Wang
On Sat, May 16, 2020 at 9:16 AM Subhashini Rao Beerisetty
 wrote:
> Yes, those are out-of-tree modules. Basically, my question is, in
> general what is the difference between 'general protection fault' and
> 'Oops' failure in kernel mode.

For your case, they are likely just different consequences of a same
memory error. Let's assume it is a use-after-free, the behavior is UAF
is undefined: If that memory freed by kernel is also unmapped from
kernel address space, you would get a page fault when using it
afterward, that is an Oops. Or if that memory freed by kernel gets
reallocated and remapped as read-only, you would get a general
protection error when you writing to it afterward.


  1   2   3   4   5   6   7   8   9   10   >