Re: netconsole deadlock with virtnet

2020-11-24 Thread Jason Wang


On 2020/11/25 上午12:20, Jakub Kicinski wrote:

On Tue, 24 Nov 2020 11:22:03 +0800 Jason Wang wrote:

Perhaps you need the trylock in virtnet_poll_tx()?

That could work. Best if we used normal lock if !!budget, and trylock
when budget is 0. But maybe that's too hairy.

If we use trylock, we probably lose(or delay) tx notification that may
have side effects to the stack.

That's why I said only trylock with budget == 0. Only netpoll calls with
budget == 0, AFAIK.



Oh right.

So I think maybe we can switch to use trylock when budget is zero and 
try to schedule another TX NAPI if we trylock fail.






I'm assuming all this trickiness comes from virtqueue_get_buf() needing
locking vs the TX path? It's pretty unusual for the completion path to
need locking vs xmit path.

Two reasons for doing this:

1) For some historical reason, we try to free transmitted tx packets in
xmit (see free_old_xmit_skbs() in start_xmit()), we can probably remove
this if we remove the non tx interrupt mode.
2) virtio core requires virtqueue_get_buf() to be synchronized with
virtqueue_add(), we probably can solve this but it requires some non
trivial refactoring in the virtio core

Btw, have a quick search, there are several other drivers that uses tx
lock in the tx NAPI.

Unless they do:

netdev->priv_flags |= IFF_DISABLE_NETPOLL;

they are all broken.



Yes.

Thanks

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: netconsole deadlock with virtnet

2020-11-24 Thread Jason Wang


On 2020/11/24 下午10:31, Steven Rostedt wrote:

On Tue, 24 Nov 2020 11:22:03 +0800
Jason Wang  wrote:


Btw, have a quick search, there are several other drivers that uses tx
lock in the tx NAPI.

tx NAPI is not the issue. The issue is that write_msg() (in netconsole.c)
calls this polling logic with the target_list_lock held.



But in the tx NAPI poll it tries to lock TX instead of using trylock.




Are those other drivers called by netconsole? If not, then this is unique
to virtio-net.



I think the answer is yes, since net console is not disabled in the codes.

Thanks




-- Steve



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: netconsole deadlock with virtnet

2020-11-24 Thread Steven Rostedt
On Tue, 24 Nov 2020 11:22:03 +0800
Jason Wang  wrote:

> Btw, have a quick search, there are several other drivers that uses tx 
> lock in the tx NAPI.

tx NAPI is not the issue. The issue is that write_msg() (in netconsole.c)
calls this polling logic with the target_list_lock held.

Are those other drivers called by netconsole? If not, then this is unique
to virtio-net.

-- Steve
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: netconsole deadlock with virtnet

2020-11-24 Thread Leon Romanovsky
On Tue, Nov 24, 2020 at 04:57:23PM +0800, Jason Wang wrote:
>
> On 2020/11/24 下午4:01, Leon Romanovsky wrote:
> > On Tue, Nov 24, 2020 at 11:22:03AM +0800, Jason Wang wrote:
> > > On 2020/11/24 上午3:21, Jakub Kicinski wrote:
> > > > On Mon, 23 Nov 2020 14:09:34 -0500 Steven Rostedt wrote:
> > > > > On Mon, 23 Nov 2020 10:52:52 -0800
> > > > > Jakub Kicinski  wrote:
> > > > >
> > > > > > On Mon, 23 Nov 2020 09:31:28 -0500 Steven Rostedt wrote:
> > > > > > > On Mon, 23 Nov 2020 13:08:55 +0200
> > > > > > > Leon Romanovsky  wrote:
> > > > > > >
> > > > > > > >[   10.028024] Chain exists of:
> > > > > > > >[   10.028025]   console_owner --> target_list_lock --> 
> > > > > > > > _xmit_ETHER#2
> > > > > > > Note, the problem is that we have a location that grabs the 
> > > > > > > xmit_lock while
> > > > > > > holding target_list_lock (and possibly console_owner).
> > > > > > Well, it try_locks the xmit_lock. Does lockdep understand try-locks?
> > > > > >
> > > > > > (not that I condone the shenanigans that are going on here)
> > > > > Does it?
> > > > >
> > > > >   virtnet_poll_tx() {
> > > > >   __netif_tx_lock() {
> > > > >   spin_lock(>_xmit_lock);
> > > > Umpf. Right. I was looking at virtnet_poll_cleantx()
> > > >
> > > > > That looks like we can have:
> > > > >
> > > > >
> > > > >   CPU0CPU1
> > > > >   
> > > > >  lock(xmit_lock)
> > > > >
> > > > >   lock(console)
> > > > >   lock(target_list_lock)
> > > > >   __netif_tx_lock()
> > > > >   lock(xmit_lock);
> > > > >
> > > > >   [BLOCKED]
> > > > >
> > > > >  
> > > > >  lock(console)
> > > > >
> > > > >  [BLOCKED]
> > > > >
> > > > >
> > > > >
> > > > >DEADLOCK.
> > > > >
> > > > >
> > > > > So where is the trylock here?
> > > > >
> > > > > Perhaps you need the trylock in virtnet_poll_tx()?
> > > > That could work. Best if we used normal lock if !!budget, and trylock
> > > > when budget is 0. But maybe that's too hairy.
> > >
> > > If we use trylock, we probably lose(or delay) tx notification that may 
> > > have
> > > side effects to the stack.
> > >
> > >
> > > > I'm assuming all this trickiness comes from virtqueue_get_buf() needing
> > > > locking vs the TX path? It's pretty unusual for the completion path to
> > > > need locking vs xmit path.
> > >
> > > Two reasons for doing this:
> > >
> > > 1) For some historical reason, we try to free transmitted tx packets in 
> > > xmit
> > > (see free_old_xmit_skbs() in start_xmit()), we can probably remove this if
> > > we remove the non tx interrupt mode.
> > > 2) virtio core requires virtqueue_get_buf() to be synchronized with
> > > virtqueue_add(), we probably can solve this but it requires some non 
> > > trivial
> > > refactoring in the virtio core
> > So how will we solve our lockdep issues?
> >
> > Thanks
>
>
> It's not clear to me that whether it's a virtio-net specific issue. E.g the
> above deadlock looks like a generic issue so workaround it via virtio-net
> may not help for other drivers.

It is hard to say, no one else complained except me who is using virtio :).

Thanks

>
> Thanks
>
>
> >
> > > Btw, have a quick search, there are several other drivers that uses tx 
> > > lock
> > > in the tx NAPI.
> > >
> > > Thanks
> > >
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: netconsole deadlock with virtnet

2020-11-24 Thread Jason Wang


On 2020/11/24 下午4:01, Leon Romanovsky wrote:

On Tue, Nov 24, 2020 at 11:22:03AM +0800, Jason Wang wrote:

On 2020/11/24 上午3:21, Jakub Kicinski wrote:

On Mon, 23 Nov 2020 14:09:34 -0500 Steven Rostedt wrote:

On Mon, 23 Nov 2020 10:52:52 -0800
Jakub Kicinski  wrote:


On Mon, 23 Nov 2020 09:31:28 -0500 Steven Rostedt wrote:

On Mon, 23 Nov 2020 13:08:55 +0200
Leon Romanovsky  wrote:


   [   10.028024] Chain exists of:
   [   10.028025]   console_owner --> target_list_lock --> _xmit_ETHER#2

Note, the problem is that we have a location that grabs the xmit_lock while
holding target_list_lock (and possibly console_owner).

Well, it try_locks the xmit_lock. Does lockdep understand try-locks?

(not that I condone the shenanigans that are going on here)

Does it?

virtnet_poll_tx() {
__netif_tx_lock() {
spin_lock(>_xmit_lock);

Umpf. Right. I was looking at virtnet_poll_cleantx()


That looks like we can have:


CPU0CPU1

 lock(xmit_lock)

lock(console)
lock(target_list_lock)
__netif_tx_lock()
lock(xmit_lock);

[BLOCKED]

 
 lock(console)

 [BLOCKED]



   DEADLOCK.


So where is the trylock here?

Perhaps you need the trylock in virtnet_poll_tx()?

That could work. Best if we used normal lock if !!budget, and trylock
when budget is 0. But maybe that's too hairy.


If we use trylock, we probably lose(or delay) tx notification that may have
side effects to the stack.



I'm assuming all this trickiness comes from virtqueue_get_buf() needing
locking vs the TX path? It's pretty unusual for the completion path to
need locking vs xmit path.


Two reasons for doing this:

1) For some historical reason, we try to free transmitted tx packets in xmit
(see free_old_xmit_skbs() in start_xmit()), we can probably remove this if
we remove the non tx interrupt mode.
2) virtio core requires virtqueue_get_buf() to be synchronized with
virtqueue_add(), we probably can solve this but it requires some non trivial
refactoring in the virtio core

So how will we solve our lockdep issues?

Thanks



It's not clear to me that whether it's a virtio-net specific issue. E.g 
the above deadlock looks like a generic issue so workaround it via 
virtio-net may not help for other drivers.


Thanks





Btw, have a quick search, there are several other drivers that uses tx lock
in the tx NAPI.

Thanks



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: netconsole deadlock with virtnet

2020-11-24 Thread Leon Romanovsky
On Tue, Nov 24, 2020 at 11:22:03AM +0800, Jason Wang wrote:
>
> On 2020/11/24 上午3:21, Jakub Kicinski wrote:
> > On Mon, 23 Nov 2020 14:09:34 -0500 Steven Rostedt wrote:
> > > On Mon, 23 Nov 2020 10:52:52 -0800
> > > Jakub Kicinski  wrote:
> > >
> > > > On Mon, 23 Nov 2020 09:31:28 -0500 Steven Rostedt wrote:
> > > > > On Mon, 23 Nov 2020 13:08:55 +0200
> > > > > Leon Romanovsky  wrote:
> > > > >
> > > > > >   [   10.028024] Chain exists of:
> > > > > >   [   10.028025]   console_owner --> target_list_lock --> 
> > > > > > _xmit_ETHER#2
> > > > > Note, the problem is that we have a location that grabs the xmit_lock 
> > > > > while
> > > > > holding target_list_lock (and possibly console_owner).
> > > > Well, it try_locks the xmit_lock. Does lockdep understand try-locks?
> > > >
> > > > (not that I condone the shenanigans that are going on here)
> > > Does it?
> > >
> > >   virtnet_poll_tx() {
> > >   __netif_tx_lock() {
> > >   spin_lock(>_xmit_lock);
> > Umpf. Right. I was looking at virtnet_poll_cleantx()
> >
> > > That looks like we can have:
> > >
> > >
> > >   CPU0CPU1
> > >   
> > > lock(xmit_lock)
> > >
> > >   lock(console)
> > >   lock(target_list_lock)
> > >   __netif_tx_lock()
> > >   lock(xmit_lock);
> > >
> > >   [BLOCKED]
> > >
> > > 
> > > lock(console)
> > >
> > > [BLOCKED]
> > >
> > >
> > >
> > >   DEADLOCK.
> > >
> > >
> > > So where is the trylock here?
> > >
> > > Perhaps you need the trylock in virtnet_poll_tx()?
> > That could work. Best if we used normal lock if !!budget, and trylock
> > when budget is 0. But maybe that's too hairy.
>
>
> If we use trylock, we probably lose(or delay) tx notification that may have
> side effects to the stack.
>
>
> >
> > I'm assuming all this trickiness comes from virtqueue_get_buf() needing
> > locking vs the TX path? It's pretty unusual for the completion path to
> > need locking vs xmit path.
>
>
> Two reasons for doing this:
>
> 1) For some historical reason, we try to free transmitted tx packets in xmit
> (see free_old_xmit_skbs() in start_xmit()), we can probably remove this if
> we remove the non tx interrupt mode.
> 2) virtio core requires virtqueue_get_buf() to be synchronized with
> virtqueue_add(), we probably can solve this but it requires some non trivial
> refactoring in the virtio core

So how will we solve our lockdep issues?

Thanks

>
> Btw, have a quick search, there are several other drivers that uses tx lock
> in the tx NAPI.
>
> Thanks
>
> >
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: netconsole deadlock with virtnet

2020-11-23 Thread Jason Wang


On 2020/11/24 上午3:21, Jakub Kicinski wrote:

On Mon, 23 Nov 2020 14:09:34 -0500 Steven Rostedt wrote:

On Mon, 23 Nov 2020 10:52:52 -0800
Jakub Kicinski  wrote:


On Mon, 23 Nov 2020 09:31:28 -0500 Steven Rostedt wrote:

On Mon, 23 Nov 2020 13:08:55 +0200
Leon Romanovsky  wrote:

 

  [   10.028024] Chain exists of:
  [   10.028025]   console_owner --> target_list_lock --> _xmit_ETHER#2

Note, the problem is that we have a location that grabs the xmit_lock while
holding target_list_lock (and possibly console_owner).

Well, it try_locks the xmit_lock. Does lockdep understand try-locks?

(not that I condone the shenanigans that are going on here)

Does it?

virtnet_poll_tx() {
__netif_tx_lock() {
spin_lock(>_xmit_lock);

Umpf. Right. I was looking at virtnet_poll_cleantx()


That looks like we can have:


CPU0CPU1

lock(xmit_lock)

lock(console)
lock(target_list_lock)
__netif_tx_lock()
lock(xmit_lock);

[BLOCKED]


lock(console)

[BLOCKED]



  DEADLOCK.


So where is the trylock here?

Perhaps you need the trylock in virtnet_poll_tx()?

That could work. Best if we used normal lock if !!budget, and trylock
when budget is 0. But maybe that's too hairy.



If we use trylock, we probably lose(or delay) tx notification that may 
have side effects to the stack.





I'm assuming all this trickiness comes from virtqueue_get_buf() needing
locking vs the TX path? It's pretty unusual for the completion path to
need locking vs xmit path.



Two reasons for doing this:

1) For some historical reason, we try to free transmitted tx packets in 
xmit (see free_old_xmit_skbs() in start_xmit()), we can probably remove 
this if we remove the non tx interrupt mode.
2) virtio core requires virtqueue_get_buf() to be synchronized with 
virtqueue_add(), we probably can solve this but it requires some non 
trivial refactoring in the virtio core


Btw, have a quick search, there are several other drivers that uses tx 
lock in the tx NAPI.


Thanks





___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: netconsole deadlock with virtnet

2020-11-23 Thread Steven Rostedt
On Mon, 23 Nov 2020 10:52:52 -0800
Jakub Kicinski  wrote:

> On Mon, 23 Nov 2020 09:31:28 -0500 Steven Rostedt wrote:
> > On Mon, 23 Nov 2020 13:08:55 +0200
> > Leon Romanovsky  wrote:
> > 
> >   
> > >  [   10.028024] Chain exists of:
> > >  [   10.028025]   console_owner --> target_list_lock --> _xmit_ETHER#2
> > 
> > Note, the problem is that we have a location that grabs the xmit_lock while
> > holding target_list_lock (and possibly console_owner).  
> 
> Well, it try_locks the xmit_lock. Does lockdep understand try-locks?
> 
> (not that I condone the shenanigans that are going on here)

Does it?

virtnet_poll_tx() {
__netif_tx_lock() {
spin_lock(>_xmit_lock);

That looks like we can have:


CPU0CPU1

   lock(xmit_lock)

lock(console)
lock(target_list_lock)
__netif_tx_lock()
lock(xmit_lock);

[BLOCKED]

   
   lock(console)

   [BLOCKED]



 DEADLOCK.


So where is the trylock here?

Perhaps you need the trylock in virtnet_poll_tx()?

-- Steve
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: netconsole deadlock with virtnet

2020-11-23 Thread Steven Rostedt
On Mon, 23 Nov 2020 13:08:55 +0200
Leon Romanovsky  wrote:


>  [   10.028024] Chain exists of:
>  [   10.028025]   console_owner --> target_list_lock --> _xmit_ETHER#2

Note, the problem is that we have a location that grabs the xmit_lock while
holding target_list_lock (and possibly console_owner).


>  [   10.028028]
>  [   10.028028]  Possible interrupt unsafe locking scenario:
>  [   10.028029]
>  [   10.028029]CPU0CPU1
>  [   10.028030]
>  [   10.028030]   lock(_xmit_ETHER#2);
>  [   10.028032]local_irq_disable();
>  [   10.028032]lock(console_owner);
>  [   10.028034]lock(target_list_lock);
>  [   10.028035]   
>  [   10.028035] lock(console_owner);
>  [   10.028036]
>  [   10.028037]  *** DEADLOCK ***
>  [   10.028037]



>  [   10.028107] the dependencies between the lock to be acquired
>  [   10.028107]  and HARDIRQ-irq-unsafe lock:
>  [   10.028108] -> (_xmit_ETHER#2){+.-.}-{2:2} ops: 217 {
>  [   10.028110]HARDIRQ-ON-W at:
>  [   10.028111]__lock_acquire+0x8bc/0x1a94
>  [   10.028111]lock_acquire.part.0+0x170/0x360
>  [   10.028112]lock_acquire+0x68/0x8c
>  [   10.028113]_raw_spin_trylock+0x80/0xd0
>  [   10.028113]virtnet_poll+0xac/0x360

xmit_lock is taken in virtnet_poll() (via virtnet_poll_cleantx()).

This is called from the softirq, and interrupts are not disabled.

>  [   10.028114]net_rx_action+0x1b0/0x4e0
>  [   10.028115]__do_softirq+0x1f4/0x638
>  [   10.028115]do_softirq+0xb8/0xcc
>  [   10.028116]__local_bh_enable_ip+0x18c/0x200
>  [   10.028116]virtnet_napi_enable+0xc0/0xd4
>  [   10.028117]virtnet_open+0x98/0x1c0
>  [   10.028118]__dev_open+0x12c/0x200
>  [   10.028118]__dev_change_flags+0x1a0/0x220
>  [   10.028119]dev_change_flags+0x2c/0x70
>  [   10.028119]do_setlink+0x214/0xe20
>  [   10.028120]__rtnl_newlink+0x514/0x820
>  [   10.028120]rtnl_newlink+0x58/0x84
>  [   10.028121]rtnetlink_rcv_msg+0x184/0x4b4
>  [   10.028122]netlink_rcv_skb+0x60/0x124
>  [   10.028122]rtnetlink_rcv+0x20/0x30
>  [   10.028123]netlink_unicast+0x1b4/0x270
>  [   10.028124]netlink_sendmsg+0x1f0/0x400
>  [   10.028124]sock_sendmsg+0x5c/0x70
>  [   10.028125]sys_sendmsg+0x24c/0x280
>  [   10.028125]___sys_sendmsg+0x88/0xd0
>  [   10.028126]__sys_sendmsg+0x70/0xd0
>  [   10.028127]__arm64_sys_sendmsg+0x2c/0x40
>  [   10.028128]el0_svc_common.constprop.0+0x84/0x200
>  [   10.028128]do_el0_svc+0x2c/0x90
>  [   10.028129]el0_svc+0x18/0x50
>  [   10.028129]el0_sync_handler+0xe0/0x350
>  [   10.028130]el0_sync+0x158/0x180

[..]

>  [   10.028171]  ... key  at: [] 
> netdev_xmit_lock_key+0x10/0x390
>  [   10.028171]  ... acquired at:
>  [   10.028172]__lock_acquire+0x134c/0x1a94
>  [   10.028172]lock_acquire.part.0+0x170/0x360
>  [   10.028173]lock_acquire+0x68/0x8c
>  [   10.028173]_raw_spin_lock+0x64/0x90
>  [   10.028174]virtnet_poll_tx+0x84/0x120
>  [   10.028174]netpoll_poll_dev+0x12c/0x350
>  [   10.028175]netpoll_send_skb+0x39c/0x400
>  [   10.028175]netpoll_send_udp+0x2b8/0x440
>  [   10.028176]write_msg+0xfc/0x120 [netconsole]
>  [   10.028176]console_unlock+0x3ec/0x6a4

The above shows the problem. We have:

console_unlock() (which holds the console_owner lock)
write_msg() (which holds the target_list_lock)

Then we write_msg() calls:

netpoll_send_udp() {
  netpoll_send_skb() {
netpoll_poll_dev() {
  virtnet_poll_tx() (which takes the xmit_lock!)

  DEADLOCK!


In netpoll_send_skb() I see this:

/* tickle device maybe there is some cleanup */
netpoll_poll_dev(np->dev);

Which looks to me that it will call some code that should only be used in
softirq context. It's called with locks held that are taken in interrupt
context, and any locks that are taken in netpoll_poll_dev() must always be
taken with interrupts disabled. That is, if xmit_lock is taken within
netpoll_poll_dev(), then it must always be taken with interrupts disabled.
Otherwise you can have the deadlock that lockdep reported.

-- Steve




>  [   10.028177]

Re: netconsole deadlock with virtnet

2020-11-23 Thread Leon Romanovsky
On Wed, Nov 18, 2020 at 09:12:57AM -0500, Steven Rostedt wrote:
>
> [ Adding netdev as perhaps someone there knows ]
>
> On Wed, 18 Nov 2020 12:09:59 +0800
> Jason Wang  wrote:
>
> > > This CPU0 lock(_xmit_ETHER#2) -> hard IRQ -> lock(console_owner) is
> > > basically
> > >   soft IRQ -> lock(_xmit_ETHER#2) -> hard IRQ -> printk()
> > >
> > > Then CPU1 spins on xmit, which is owned by CPU0, CPU0 spins on
> > > console_owner, which is owned by CPU1?
>
> It still looks to me that the target_list_lock is taken in IRQ, (which can
> be the case because printk calls write_msg() which takes that lock). And
> someplace there's a:
>
>   lock(target_list_lock)
>   lock(xmit_lock)
>
> which means you can remove the console lock from this scenario completely,
> and you still have a possible deadlock between target_list_lock and
> xmit_lock.
>
> >
> >
> > If this is true, it looks not a virtio-net specific issue but somewhere
> > else.
> >
> > I think all network driver will synchronize through bh instead of hardirq.
>
> I think the issue is where target_list_lock is held when we take xmit_lock.
> Is there anywhere in netconsole.c that can end up taking xmit_lock while
> holding the target_list_lock? If so, that's the problem. As
> target_list_lock is something that can be taken in IRQ context, which means
> *any* other lock that is taking while holding the target_list_lock must
> also protect against interrupts from happening while it they are held.

I increased printk buffer like Petr suggested and the splat is below.
It doesn't happening on x86, but on ARM65 and ppc64.

 [   10.027975] =
 [   10.027976] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
 [   10.027976] 5.10.0-rc4_for_upstream_min_debug_2020_11_22_19_37 #1 Not 
tainted
 [   10.027977] -
 [   10.027978] modprobe/638 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
 [   10.027979] c9f63c98 (_xmit_ETHER#2){+.-.}-{2:2}, at: 
virtnet_poll_tx+0x84/0x120
 [   10.027982]
 [   10.027982] and this task is already holding:
 [   10.027983] 89007018 (target_list_lock){}-{2:2}, at: 
write_msg+0x6c/0x120 [netconsole]
 [   10.027985] which would create a new lock dependency:
 [   10.027985]  (target_list_lock){}-{2:2} -> (_xmit_ETHER#2){+.-.}-{2:2}
 [   10.027989]
 [   10.027989] but this new dependency connects a HARDIRQ-irq-safe lock:
 [   10.027990]  (console_owner){-...}-{0:0}
 [   10.027991]
 [   10.027992] ... which became HARDIRQ-irq-safe at:
 [   10.027992]   __lock_acquire+0xa78/0x1a94
 [   10.027993]   lock_acquire.part.0+0x170/0x360
 [   10.027993]   lock_acquire+0x68/0x8c
 [   10.027994]   console_unlock+0x1e8/0x6a4
 [   10.027994]   vprintk_emit+0x1c4/0x3c4
 [   10.027995]   vprintk_default+0x40/0x4c
 [   10.027995]   vprintk_func+0x10c/0x220
 [   10.027995]   printk+0x68/0x90
 [   10.027996]   crng_fast_load+0x1bc/0x1c0
 [   10.027997]   add_interrupt_randomness+0x280/0x290
 [   10.027997]   handle_irq_event+0x80/0x120
 [   10.027997]   handle_fasteoi_irq+0xac/0x200
 [   10.027998]   __handle_domain_irq+0x84/0xf0
 [   10.027999]   gic_handle_irq+0xd4/0x320
 [   10.027999]   el1_irq+0xd0/0x180
 [   10.028000]   arch_cpu_idle+0x24/0x44
 [   10.028000]   default_idle_call+0x48/0xa0
 [   10.028001]   do_idle+0x260/0x300
 [   10.028001]   cpu_startup_entry+0x30/0x6c
 [   10.028001]   rest_init+0x1b4/0x288
 [   10.028002]   arch_call_rest_init+0x18/0x24
 [   10.028002]   start_kernel+0x5cc/0x608
 [   10.028003]
 [   10.028003] to a HARDIRQ-irq-unsafe lock:
 [   10.028004]  (_xmit_ETHER#2){+.-.}-{2:2}
 [   10.028005]
 [   10.028006] ... which became HARDIRQ-irq-unsafe at:
 [   10.028006] ...  __lock_acquire+0x8bc/0x1a94
 [   10.028007]   lock_acquire.part.0+0x170/0x360
 [   10.028007]   lock_acquire+0x68/0x8c
 [   10.028008]   _raw_spin_trylock+0x80/0xd0
 [   10.028008]   virtnet_poll+0xac/0x360
 [   10.028009]   net_rx_action+0x1b0/0x4e0
 [   10.028010]   __do_softirq+0x1f4/0x638
 [   10.028010]   do_softirq+0xb8/0xcc
 [   10.028010]   __local_bh_enable_ip+0x18c/0x200
 [   10.028011]   virtnet_napi_enable+0xc0/0xd4
 [   10.028011]   virtnet_open+0x98/0x1c0
 [   10.028012]   __dev_open+0x12c/0x200
 [   10.028013]   __dev_change_flags+0x1a0/0x220
 [   10.028013]   dev_change_flags+0x2c/0x70
 [   10.028014]   do_setlink+0x214/0xe20
 [   10.028014]   __rtnl_newlink+0x514/0x820
 [   10.028015]   rtnl_newlink+0x58/0x84
 [   10.028015]   rtnetlink_rcv_msg+0x184/0x4b4
 [   10.028016]   netlink_rcv_skb+0x60/0x124
 [   10.028016]   rtnetlink_rcv+0x20/0x30
 [   10.028017]   netlink_unicast+0x1b4/0x270
 [   10.028017]   netlink_sendmsg+0x1f0/0x400
 [   10.028018]   sock_sendmsg+0x5c/0x70
 [   10.028018]   sys_sendmsg+0x24c/0x280
 [   10.028019]   ___sys_sendmsg+0x88/0xd0
 [   10.028019]   __sys_sendmsg+0x70/0xd0
 [   10.028020]   __arm64_sys_sendmsg+0x2c/0x40
 [   10.028021]   el0_svc_common.constprop.0+0x84/0x200
 [   

Re: netconsole deadlock with virtnet

2020-11-22 Thread Leon Romanovsky
On Thu, Nov 19, 2020 at 01:55:53PM +0100, Petr Mladek wrote:
> On Tue 2020-11-17 09:33:25, Steven Rostedt wrote:
> > On Tue, 17 Nov 2020 12:23:41 +0200
> > Leon Romanovsky  wrote:
> >
> > > Hi,
> > >
> > > Approximately two weeks ago, our regression team started to experience 
> > > those
> > > netconsole splats. The tested code is Linus's master (-rc4) + netdev 
> > > net-next
> > > + netdev net-rc.
> > >
> > > Such splats are random and we can't bisect because there is no stable 
> > > reproducer.
> > >
> > > Any idea, what is the root cause?
> > >
> > > [   21.149739]   __do_sys_finit_module+0xbc/0x12c
> > > [   21.149740]   __arm64_sys_finit_module+0x28/0x34
> > > [   21.149741]   el0_svc_common.constprop.0+0x84/0x200
> > > [   21.149742]   do_el0_svc+0x2c/0x90
> > > [   21.149743]   el0_svc+0x18/0x50
> > > [   21.149744]   el0_sync_handler+0xe0/0x350
> > > [   21.149745]   el0_sync+0x158/0x180
> > > [   21.149746]  }
> > > [   21.149747]  ... key  at: [] 
> > > target_list_lock+0x18/0xf000 [netconsole]
> > > [   21.149748]  ..
> > > [   21.149750] Lost 190 message(s)!
> >
> > It really sucks that we lose 190 messages that would help to decipher this
> > more. :-p
>
> The message commes from the printk_safe code. The size can be
> increased by CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT.
>
> > Because I'm not sure where the xmit_lock is taken while holding the
> > target_list_lock. But the above does show that printk() calls write_msg()
> > while holding the console_lock, and write_msg() takes the target_list_lock.
> >
> > Thus, the fix would ether require disabling interrupts every time the
> > xmit_lock is taken, or to get it from being taken while holding the
> > target_list_lock.
>
> It seems that the missing messages might help to find the root of
> the problem.

Sorry for not being very responsive, I was in internet-free zone :).

I'll increase CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT from 13 to be 26, let's
see what night run will give us.

Thanks

>
> Best Regards,
> Petr
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: netconsole deadlock with virtnet

2020-11-19 Thread Petr Mladek via Virtualization
On Tue 2020-11-17 09:33:25, Steven Rostedt wrote:
> On Tue, 17 Nov 2020 12:23:41 +0200
> Leon Romanovsky  wrote:
> 
> > Hi,
> > 
> > Approximately two weeks ago, our regression team started to experience those
> > netconsole splats. The tested code is Linus's master (-rc4) + netdev 
> > net-next
> > + netdev net-rc.
> > 
> > Such splats are random and we can't bisect because there is no stable 
> > reproducer.
> > 
> > Any idea, what is the root cause?
> > 
> > [   21.149739]   __do_sys_finit_module+0xbc/0x12c
> > [   21.149740]   __arm64_sys_finit_module+0x28/0x34
> > [   21.149741]   el0_svc_common.constprop.0+0x84/0x200
> > [   21.149742]   do_el0_svc+0x2c/0x90
> > [   21.149743]   el0_svc+0x18/0x50
> > [   21.149744]   el0_sync_handler+0xe0/0x350
> > [   21.149745]   el0_sync+0x158/0x180
> > [   21.149746]  }
> > [   21.149747]  ... key  at: [] 
> > target_list_lock+0x18/0xf000 [netconsole]
> > [   21.149748]  ..
> > [   21.149750] Lost 190 message(s)!
> 
> It really sucks that we lose 190 messages that would help to decipher this
> more. :-p

The message commes from the printk_safe code. The size can be
increased by CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT.

> Because I'm not sure where the xmit_lock is taken while holding the
> target_list_lock. But the above does show that printk() calls write_msg()
> while holding the console_lock, and write_msg() takes the target_list_lock.
> 
> Thus, the fix would ether require disabling interrupts every time the
> xmit_lock is taken, or to get it from being taken while holding the
> target_list_lock.

It seems that the missing messages might help to find the root of
the problem.

Best Regards,
Petr
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: netconsole deadlock with virtnet

2020-11-18 Thread Steven Rostedt


[ Adding netdev as perhaps someone there knows ]

On Wed, 18 Nov 2020 12:09:59 +0800
Jason Wang  wrote:

> > This CPU0 lock(_xmit_ETHER#2) -> hard IRQ -> lock(console_owner) is
> > basically
> > soft IRQ -> lock(_xmit_ETHER#2) -> hard IRQ -> printk()
> >
> > Then CPU1 spins on xmit, which is owned by CPU0, CPU0 spins on
> > console_owner, which is owned by CPU1?  

It still looks to me that the target_list_lock is taken in IRQ, (which can
be the case because printk calls write_msg() which takes that lock). And
someplace there's a:

lock(target_list_lock)
lock(xmit_lock)

which means you can remove the console lock from this scenario completely,
and you still have a possible deadlock between target_list_lock and
xmit_lock.

> 
> 
> If this is true, it looks not a virtio-net specific issue but somewhere 
> else.
> 
> I think all network driver will synchronize through bh instead of hardirq.

I think the issue is where target_list_lock is held when we take xmit_lock.
Is there anywhere in netconsole.c that can end up taking xmit_lock while
holding the target_list_lock? If so, that's the problem. As
target_list_lock is something that can be taken in IRQ context, which means
*any* other lock that is taking while holding the target_list_lock must
also protect against interrupts from happening while it they are held.

-- Steve
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: netconsole deadlock with virtnet

2020-11-17 Thread Jason Wang


On 2020/11/18 上午11:15, Sergey Senozhatsky wrote:

On (20/11/18 11:46), Sergey Senozhatsky wrote:
[..]

Because I'm not sure where the xmit_lock is taken while holding the
target_list_lock.

I don't see where does this happen. It seems to me that the report
is not about broken locking order, but more about:
- soft-irq can be preempted (while holding _xmit_lock) by a hardware
   interrupt, that will attempt to acquire the same _xmit_lock lock.

CPU0
<>
 virtnet_poll_tx()
  __netif_tx_lock()
   spin_lock(_xmit_lock)
<>
 add_interrupt_randomness()
  crng_fast_load()
   printk()
call_console_drivers()
 spin_lock_irqsave(_list_lock)
 spin_lock(_xmit_lock);

Does this make sense?

Hmm, lockdep says something similar, but there are 2 printk()
happening - both on local and remote CPUs.

[   21.149564]CPU0CPU1
[   21.149565]
[   21.149566]   lock(_xmit_ETHER#2);
[   21.149569]local_irq_disable();
[   21.149570]lock(console_owner);
[   21.149572]lock(target_list_lock);
[   21.149575]   
[   21.149576] lock(console_owner);

This CPU0 lock(_xmit_ETHER#2) -> hard IRQ -> lock(console_owner) is
basically
soft IRQ -> lock(_xmit_ETHER#2) -> hard IRQ -> printk()

Then CPU1 spins on xmit, which is owned by CPU0, CPU0 spins on
console_owner, which is owned by CPU1?



If this is true, it looks not a virtio-net specific issue but somewhere 
else.


I think all network driver will synchronize through bh instead of hardirq.

Thanks




A quick-and-dirty idea (it doesn't fix the lockdep report) - can we
add some sort of max_loops variable to console_trylock_spinning(),
so that it will not spin forever in `while (READ_ONCE(console_waiter))`
waiting for a console_owner to pass the lock?

-ss



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: netconsole deadlock with virtnet

2020-11-17 Thread Sergey Senozhatsky
On (20/11/18 11:46), Sergey Senozhatsky wrote:
[..]
> > Because I'm not sure where the xmit_lock is taken while holding the
> > target_list_lock.
> 
> I don't see where does this happen. It seems to me that the report
> is not about broken locking order, but more about:
> - soft-irq can be preempted (while holding _xmit_lock) by a hardware
>   interrupt, that will attempt to acquire the same _xmit_lock lock.
> 
>CPU0
><>
> virtnet_poll_tx()
>  __netif_tx_lock()
>   spin_lock(_xmit_lock)
><>
> add_interrupt_randomness()
>  crng_fast_load()
>   printk()
>call_console_drivers()
> spin_lock_irqsave(_list_lock)
>spin_lock(_xmit_lock);
> 
> Does this make sense?

Hmm, lockdep says something similar, but there are 2 printk()
happening - both on local and remote CPUs.

[   21.149564]CPU0CPU1
[   21.149565]
[   21.149566]   lock(_xmit_ETHER#2);
[   21.149569]local_irq_disable();
[   21.149570]lock(console_owner);
[   21.149572]lock(target_list_lock);
[   21.149575]   
[   21.149576] lock(console_owner);

This CPU0 lock(_xmit_ETHER#2) -> hard IRQ -> lock(console_owner) is
basically
soft IRQ -> lock(_xmit_ETHER#2) -> hard IRQ -> printk()

Then CPU1 spins on xmit, which is owned by CPU0, CPU0 spins on
console_owner, which is owned by CPU1?

A quick-and-dirty idea (it doesn't fix the lockdep report) - can we
add some sort of max_loops variable to console_trylock_spinning(),
so that it will not spin forever in `while (READ_ONCE(console_waiter))`
waiting for a console_owner to pass the lock?

-ss
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: netconsole deadlock with virtnet

2020-11-17 Thread Sergey Senozhatsky
On (20/11/17 09:33), Steven Rostedt wrote:
> > [   21.149601] IN-HARDIRQ-W at:
> > [   21.149602]  __lock_acquire+0xa78/0x1a94
> > [   21.149603]  lock_acquire.part.0+0x170/0x360
> > [   21.149604]  lock_acquire+0x68/0x8c
> > [   21.149605]  console_unlock+0x1e8/0x6a4
> > [   21.149606]  vprintk_emit+0x1c4/0x3c4
> > [   21.149607]  vprintk_default+0x40/0x4c
> > [   21.149608]  vprintk_func+0x10c/0x220
> > [   21.149610]  printk+0x68/0x90
> > [   21.149611]  crng_fast_load+0x1bc/0x1c0
> > [   21.149612]  add_interrupt_randomness+0x280/0x290
> > [   21.149613]  handle_irq_event+0x80/0x120
> > [   21.149614]  handle_fasteoi_irq+0xac/0x200
> > [   21.149615]  __handle_domain_irq+0x84/0xf0
> > [   21.149616]  gic_handle_irq+0xd4/0x320
> > [   21.149617]  el1_irq+0xd0/0x180
> > [   21.149618]  arch_cpu_idle+0x24/0x44
> > [   21.149619]  default_idle_call+0x48/0xa0
> > [   21.149620]  do_idle+0x260/0x300
> > [   21.149621]  cpu_startup_entry+0x30/0x6c
> > [   21.149622]  rest_init+0x1b4/0x288
> > [   21.149624]  arch_call_rest_init+0x18/0x24
> > [   21.149625]  start_kernel+0x5cc/0x608
> > [   21.149625] IN-SOFTIRQ-W at:
> > [   21.149627]  __lock_acquire+0x894/0x1a94
> > [   21.149628]  lock_acquire.part.0+0x170/0x360
> > [   21.149629]  lock_acquire+0x68/0x8c
> > [   21.149630]  console_unlock+0x1e8/0x6a4
> > [   21.149631]  vprintk_emit+0x1c4/0x3c4
> > [   21.149632]  vprintk_default+0x40/0x4c
> > [   21.149633]  vprintk_func+0x10c/0x220
> > [   21.149634]  printk+0x68/0x90
> > [   21.149635]  hrtimer_interrupt+0x290/0x294
> > [   21.149636]  arch_timer_handler_virt+0x3c/0x50
> > [   21.149637]  handle_percpu_devid_irq+0x94/0x164
> > [   21.149673]  __handle_domain_irq+0x84/0xf0
> > [   21.149674]  gic_handle_irq+0xd4/0x320
> > [   21.149675]  el1_irq+0xd0/0x180
> > [   21.149676]  __do_softirq+0x108/0x638
> > [   21.149677]  __irq_exit_rcu+0x17c/0x1b0
> > [   21.149678]  irq_exit+0x18/0x44
> > [   21.149679]  __handle_domain_irq+0x88/0xf0
> > [   21.149680]  gic_handle_irq+0xd4/0x320
> > [   21.149681]  el1_irq+0xd0/0x180
> > [   21.149682]  
> > smp_call_function_many_cond+0x3cc/0x3f0
> > [   21.149683]  kick_all_cpus_sync+0x4c/0x80
> > [   21.149684]  load_module+0x1eec/0x2734
> > [   21.149685]  __do_sys_finit_module+0xbc/0x12c
> > [   21.149686]  __arm64_sys_finit_module+0x28/0x34
> > [   21.149687]  
> > el0_svc_common.constprop.0+0x84/0x200
> > [   21.149688]  do_el0_svc+0x2c/0x90
> > [   21.149689]  el0_svc+0x18/0x50
> > [   21.149690]  el0_sync_handler+0xe0/0x350
> > [   21.149691]  el0_sync+0x158/0x180

[..]

> It really sucks that we lose 190 messages that would help to decipher this
> more. :-p

Indeed.

> Because I'm not sure where the xmit_lock is taken while holding the
> target_list_lock.

I don't see where does this happen. It seems to me that the report
is not about broken locking order, but more about:
- soft-irq can be preempted (while holding _xmit_lock) by a hardware
  interrupt, that will attempt to acquire the same _xmit_lock lock.

   CPU0
   <>
virtnet_poll_tx()
 __netif_tx_lock()
  spin_lock(_xmit_lock)
   <>
add_interrupt_randomness()
 crng_fast_load()
  printk()
   call_console_drivers()
spin_lock_irqsave(_list_lock)
 spin_lock(_xmit_lock);

Does this make sense?

-ss
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: netconsole deadlock with virtnet

2020-11-17 Thread Leon Romanovsky
On Tue, Nov 17, 2020 at 09:33:25AM -0500, Steven Rostedt wrote:
> On Tue, 17 Nov 2020 12:23:41 +0200
> Leon Romanovsky  wrote:
>
> > Hi,
> >
> > Approximately two weeks ago, our regression team started to experience those
> > netconsole splats. The tested code is Linus's master (-rc4) + netdev 
> > net-next
> > + netdev net-rc.
> >
> > Such splats are random and we can't bisect because there is no stable 
> > reproducer.
> >
> > Any idea, what is the root cause?
> >
> > [   21.141117] netpoll: netconsole: local port 
> > [   21.141947] netpoll: netconsole: local IPv4 address 10.236.251.7
> > [   21.143138] netpoll: netconsole: interface 'eth1'
> > [   21.143988] netpoll: netconsole: remote port 62517
> > [   21.145127] netpoll: netconsole: remote IPv4 address 10.236.251.1
> > [   21.146206] netpoll: netconsole: remote ethernet address 
> > 68:05:ca:aa:99:49
> > [   21.149464] =
> > [   21.149466] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
> > [   21.149467] 5.10.0-rc4_for_upstream_min_debug_2020_11_16_13_03 #1 Not 
> > tainted
> > [   21.149468] -
> > [   21.149469] modprobe/596 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> > [   21.149471] 000194b4e298 (_xmit_ETHER#2){+.-.}-{2:2}, at: 
> > virtnet_poll_tx+0x84/0x120
> > [   21.149478] and this task is already holding:
> > [   21.149478] 893d4018 (target_list_lock){}-{2:2}, at: 
> > write_msg+0x6c/0x120 [netconsole]
> > [   21.149483] which would create a new lock dependency:
> > [   21.149484]  (target_list_lock){}-{2:2} -> 
> > (_xmit_ETHER#2){+.-.}-{2:2}
> > [   21.149491] but this new dependency connects a HARDIRQ-irq-safe lock:
> > [   21.149492]  (console_owner){-.-.}-{0:0}
> > [   21.149496] ... which became HARDIRQ-irq-safe at:
> > [   21.149496]   __lock_acquire+0xa78/0x1a94
> > [   21.149498]   lock_acquire.part.0+0x170/0x360
> > [   21.149498]   lock_acquire+0x68/0x8c
> > [   21.149500]   console_unlock+0x1e8/0x6a4
> > [   21.149500]   vprintk_emit+0x1c4/0x3c4
> > [   21.149501]   vprintk_default+0x40/0x4c
> > [   21.149502]   vprintk_func+0x10c/0x220
> > [   21.149503]   printk+0x68/0x90
> > [   21.149504]   crng_fast_load+0x1bc/0x1c0
> > [   21.149505]   add_interrupt_randomness+0x280/0x290
> > [   21.149506]   handle_irq_event+0x80/0x120
> > [   21.149507]   handle_fasteoi_irq+0xac/0x200
> > [   21.149508]   __handle_domain_irq+0x84/0xf0
> > [   21.149509]   gic_handle_irq+0xd4/0x320
> > [   21.149510]   el1_irq+0xd0/0x180
> > [   21.149511]   arch_cpu_idle+0x24/0x44
> > [   21.149512]   default_idle_call+0x48/0xa0
> > [   21.149513]   do_idle+0x260/0x300
> > [   21.149514]   cpu_startup_entry+0x30/0x6c
> > [   21.149515]   rest_init+0x1b4/0x288
> > [   21.149515]   arch_call_rest_init+0x18/0x24
> > [   21.149516]   start_kernel+0x5cc/0x608
> > [   21.149518] to a HARDIRQ-irq-unsafe lock:
> > [   21.149519]  (_xmit_ETHER#2){+.-.}-{2:2}
> > [   21.149523] ... which became HARDIRQ-irq-unsafe at:
> > [   21.149524] ...  __lock_acquire+0x8bc/0x1a94
> > [   21.149525]   lock_acquire.part.0+0x170/0x360
> > [   21.149526]   lock_acquire+0x68/0x8c
> > [   21.149527]   _raw_spin_trylock+0x80/0xd0
> > [   21.149527]   virtnet_poll+0xac/0x360
> > [   21.149528]   net_rx_action+0x1b0/0x4e0
> > [   21.149529]   __do_softirq+0x1f4/0x638
> > [   21.149530]   do_softirq+0xb8/0xcc
> > [   21.149531]   __local_bh_enable_ip+0x18c/0x200
> > [   21.149532]   virtnet_napi_enable+0xc0/0xd4
> > [   21.149533]   virtnet_open+0x98/0x1c0
> > [   21.149534]   __dev_open+0x12c/0x200
> > [   21.149535]   __dev_change_flags+0x1a0/0x220
> > [   21.149536]   dev_change_flags+0x2c/0x70
> > [   21.149536]   do_setlink+0x214/0xe20
> > [   21.149537]   __rtnl_newlink+0x514/0x820
> > [   21.149538]   rtnl_newlink+0x58/0x84
> > [   21.149539]   rtnetlink_rcv_msg+0x184/0x4b4
> > [   21.149540]   netlink_rcv_skb+0x60/0x124
> > [   21.149541]   rtnetlink_rcv+0x20/0x30
> > [   21.149542]   netlink_unicast+0x1b4/0x270
> > [   21.149543]   netlink_sendmsg+0x1f0/0x400
> > [   21.149544]   sock_sendmsg+0x5c/0x70
> > [   21.149545]   sys_sendmsg+0x24c/0x280
> > [   21.149545]   ___sys_sendmsg+0x88/0xd0
> > [   21.149546]   __sys_sendmsg+0x70/0xd0
> > [   21.149547]   __arm64_sys_sendmsg+0x2c/0x40
> > [   21.149548]   el0_svc_common.constprop.0+0x84/0x200
> > [   21.149549]   do_el0_svc+0x2c/0x90
> > [   21.149550]   el0_svc+0x18/0x50
> > [   21.149551]   el0_sync_handler+0xe0/0x350
> > [   21.149552]   el0_sync+0x158/0x180
> > [   21.149553] other info that might help us debug this:
> > [   21.149555] Chain exists of:
> > [   21.149556]   console_owner --> target_list_lock --> _xmit_ETHER#2
>
> So somewhere we have target_list_lock taken while holding the console_owner
> lock (which can happen in interrupt). And what this is saying,
> target_list_lock is held somewhere where _xmit_ETHER#2 is taken (which I'm
> guessing from the back traces 

Re: netconsole deadlock with virtnet

2020-11-17 Thread Steven Rostedt
On Tue, 17 Nov 2020 12:23:41 +0200
Leon Romanovsky  wrote:

> Hi,
> 
> Approximately two weeks ago, our regression team started to experience those
> netconsole splats. The tested code is Linus's master (-rc4) + netdev net-next
> + netdev net-rc.
> 
> Such splats are random and we can't bisect because there is no stable 
> reproducer.
> 
> Any idea, what is the root cause?
> 
> [   21.141117] netpoll: netconsole: local port 
> [   21.141947] netpoll: netconsole: local IPv4 address 10.236.251.7
> [   21.143138] netpoll: netconsole: interface 'eth1'
> [   21.143988] netpoll: netconsole: remote port 62517
> [   21.145127] netpoll: netconsole: remote IPv4 address 10.236.251.1
> [   21.146206] netpoll: netconsole: remote ethernet address 68:05:ca:aa:99:49
> [   21.149464] =
> [   21.149466] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
> [   21.149467] 5.10.0-rc4_for_upstream_min_debug_2020_11_16_13_03 #1 Not 
> tainted
> [   21.149468] -
> [   21.149469] modprobe/596 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> [   21.149471] 000194b4e298 (_xmit_ETHER#2){+.-.}-{2:2}, at: 
> virtnet_poll_tx+0x84/0x120
> [   21.149478] and this task is already holding:
> [   21.149478] 893d4018 (target_list_lock){}-{2:2}, at: 
> write_msg+0x6c/0x120 [netconsole]
> [   21.149483] which would create a new lock dependency:
> [   21.149484]  (target_list_lock){}-{2:2} -> (_xmit_ETHER#2){+.-.}-{2:2}
> [   21.149491] but this new dependency connects a HARDIRQ-irq-safe lock:
> [   21.149492]  (console_owner){-.-.}-{0:0}
> [   21.149496] ... which became HARDIRQ-irq-safe at:
> [   21.149496]   __lock_acquire+0xa78/0x1a94
> [   21.149498]   lock_acquire.part.0+0x170/0x360
> [   21.149498]   lock_acquire+0x68/0x8c
> [   21.149500]   console_unlock+0x1e8/0x6a4
> [   21.149500]   vprintk_emit+0x1c4/0x3c4
> [   21.149501]   vprintk_default+0x40/0x4c
> [   21.149502]   vprintk_func+0x10c/0x220
> [   21.149503]   printk+0x68/0x90
> [   21.149504]   crng_fast_load+0x1bc/0x1c0
> [   21.149505]   add_interrupt_randomness+0x280/0x290
> [   21.149506]   handle_irq_event+0x80/0x120
> [   21.149507]   handle_fasteoi_irq+0xac/0x200
> [   21.149508]   __handle_domain_irq+0x84/0xf0
> [   21.149509]   gic_handle_irq+0xd4/0x320
> [   21.149510]   el1_irq+0xd0/0x180
> [   21.149511]   arch_cpu_idle+0x24/0x44
> [   21.149512]   default_idle_call+0x48/0xa0
> [   21.149513]   do_idle+0x260/0x300
> [   21.149514]   cpu_startup_entry+0x30/0x6c
> [   21.149515]   rest_init+0x1b4/0x288
> [   21.149515]   arch_call_rest_init+0x18/0x24
> [   21.149516]   start_kernel+0x5cc/0x608
> [   21.149518] to a HARDIRQ-irq-unsafe lock:
> [   21.149519]  (_xmit_ETHER#2){+.-.}-{2:2}
> [   21.149523] ... which became HARDIRQ-irq-unsafe at:
> [   21.149524] ...  __lock_acquire+0x8bc/0x1a94
> [   21.149525]   lock_acquire.part.0+0x170/0x360
> [   21.149526]   lock_acquire+0x68/0x8c
> [   21.149527]   _raw_spin_trylock+0x80/0xd0
> [   21.149527]   virtnet_poll+0xac/0x360
> [   21.149528]   net_rx_action+0x1b0/0x4e0
> [   21.149529]   __do_softirq+0x1f4/0x638
> [   21.149530]   do_softirq+0xb8/0xcc
> [   21.149531]   __local_bh_enable_ip+0x18c/0x200
> [   21.149532]   virtnet_napi_enable+0xc0/0xd4
> [   21.149533]   virtnet_open+0x98/0x1c0
> [   21.149534]   __dev_open+0x12c/0x200
> [   21.149535]   __dev_change_flags+0x1a0/0x220
> [   21.149536]   dev_change_flags+0x2c/0x70
> [   21.149536]   do_setlink+0x214/0xe20
> [   21.149537]   __rtnl_newlink+0x514/0x820
> [   21.149538]   rtnl_newlink+0x58/0x84
> [   21.149539]   rtnetlink_rcv_msg+0x184/0x4b4
> [   21.149540]   netlink_rcv_skb+0x60/0x124
> [   21.149541]   rtnetlink_rcv+0x20/0x30
> [   21.149542]   netlink_unicast+0x1b4/0x270
> [   21.149543]   netlink_sendmsg+0x1f0/0x400
> [   21.149544]   sock_sendmsg+0x5c/0x70
> [   21.149545]   sys_sendmsg+0x24c/0x280
> [   21.149545]   ___sys_sendmsg+0x88/0xd0
> [   21.149546]   __sys_sendmsg+0x70/0xd0
> [   21.149547]   __arm64_sys_sendmsg+0x2c/0x40
> [   21.149548]   el0_svc_common.constprop.0+0x84/0x200
> [   21.149549]   do_el0_svc+0x2c/0x90
> [   21.149550]   el0_svc+0x18/0x50
> [   21.149551]   el0_sync_handler+0xe0/0x350
> [   21.149552]   el0_sync+0x158/0x180
> [   21.149553] other info that might help us debug this:
> [   21.149555] Chain exists of:
> [   21.149556]   console_owner --> target_list_lock --> _xmit_ETHER#2

So somewhere we have target_list_lock taken while holding the console_owner
lock (which can happen in interrupt). And what this is saying,
target_list_lock is held somewhere where _xmit_ETHER#2 is taken (which I'm
guessing from the back traces is the xmit_lock taken in virtnet_poll_tx().

Thus, you can have a deadlock with three CPUs (I need to update the lockdep
output to make it proper one of these days).

CPU0CPU1CPU2