Re: update_balloon_size_func blocked for more than 120 seconds

2021-11-23 Thread Michael Ellerman
David Hildenbrand  writes:
> On Thu, Nov 11, 2021 at 11:49 PM Luis Chamberlain  wrote:
>>
>> I get the following splats with a kvm guest in idle, after a few seconds
>> it starts:
>>
>> [  242.412806] INFO: task kworker/6:2:271 blockedfor more than 120 seconds.
>> [  242.415790]   Tainted: GE 5.15.0-next-2021 #68
>> [  242.417755] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
>> this message.
>> [  242.418332] task:kworker/6:2 state:D stack:0 pid:  271 ppid: 2 
>> flags:0x4000
>> [  242.418954] Workqueue: events_freezable update_balloon_size_func 
>> [virtio_balloon]
>> [  242.419518] Call Trace:
>> [  242.419709]  
>> [  242.419873]  __schedule+0x2fd/0x990
>> [  242.420142]  schedule+0x4e/0xc0
>> [  242.420382]  tell_host+0xaa/0xf0 [virtio_balloon]
>> [  242.420757]  ? do_wait_intr_irq+0xa0/0xa0
>> [  242.421065]  update_balloon_size_func+0x2c9/0x2e0 [virtio_balloon]
>> [  242.421527]  process_one_work+0x1e5/0x3c0
>> [  242.421833]  worker_thread+0x50/0x3b0
>> [  242.422204]  ? rescuer_thread+0x370/0x370
>> [  242.422507]  kthread+0x169/0x190
>> [  242.422754]  ? set_kthread_struct+0x40/0x40
>> [  242.423073]  ret_from_fork+0x1f/0x30
>> [  242.423347]  
>>
>> And this goes on endlessly. The last one says it blocked for more than 1208
>> seconds. This was not happening until the last few weeks but I see no
>> relevant recent commits for virtio_balloon, so the related change could
>> be elsewhere.
>
> We're stuck somewhere in:
>
> wq: update_balloon_size_func()->fill_balloon()->tell_host()
>
> Most probably in wait_event().
>
>
> I am no waitqueue expert, but my best guess would be that we're
> waiting more than 2 minutes
> on a host reply with TASK_UNINTERRUPTIBLE. At least that's my interpretation,
>
> In case we're really stuck for more than 2 minutes, the hypervisor
> might not be processing our
> requests anymore -- or it's not getting processed for some other reason (or 
> the
> waitqueue is not getting woken up due do some other BUG).
>
> IIUC, we can sleep longer via wait_event_interruptible(), TASK_UNINTERRUPTIBLE
> seems to be the issue that triggers the warning. But by changing that
> might just be hiding the fact that
> we're waiting more than 2 minutes on a reply.
>
>>
>> I could bisect but first I figured I'd check to see if someone already
>> had spotted this.
>
> Bisecting would be awesome, then we might at least know if this is a
> guest or a hypervisor issue.

I see this on ppc64le also.

I bisected it to:

  # first bad commit: [939779f5152d161b34f612af29e7dc1ac4472fcf] virtio_ring: 
validate used buffer length

I also reported it in the thread hanging off that patch:

  https://lore.kernel.org/lkml/87zgpupcga@mpe.ellerman.id.au/


cheers
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: update_balloon_size_func blocked for more than 120 seconds

2021-11-12 Thread David Hildenbrand
On Thu, Nov 11, 2021 at 11:49 PM Luis Chamberlain  wrote:
>
> I get the following splats with a kvm guest in idle, after a few seconds
> it starts:
>
> [  242.412806] INFO: task kworker/6:2:271 blockedfor more than 120 seconds.
> [  242.415790]   Tainted: GE 5.15.0-next-2021 #68
> [  242.417755] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [  242.418332] task:kworker/6:2 state:D stack:0 pid:  271 ppid: 2 
> flags:0x4000
> [  242.418954] Workqueue: events_freezable update_balloon_size_func 
> [virtio_balloon]
> [  242.419518] Call Trace:
> [  242.419709]  
> [  242.419873]  __schedule+0x2fd/0x990
> [  242.420142]  schedule+0x4e/0xc0
> [  242.420382]  tell_host+0xaa/0xf0 [virtio_balloon]
> [  242.420757]  ? do_wait_intr_irq+0xa0/0xa0
> [  242.421065]  update_balloon_size_func+0x2c9/0x2e0 [virtio_balloon]
> [  242.421527]  process_one_work+0x1e5/0x3c0
> [  242.421833]  worker_thread+0x50/0x3b0
> [  242.422204]  ? rescuer_thread+0x370/0x370
> [  242.422507]  kthread+0x169/0x190
> [  242.422754]  ? set_kthread_struct+0x40/0x40
> [  242.423073]  ret_from_fork+0x1f/0x30
> [  242.423347]  
>
> And this goes on endlessly. The last one says it blocked for more than 1208
> seconds. This was not happening until the last few weeks but I see no
> relevant recent commits for virtio_balloon, so the related change could
> be elsewhere.

We're stuck somewhere in:

wq: update_balloon_size_func()->fill_balloon()->tell_host()

Most probably in wait_event().


I am no waitqueue expert, but my best guess would be that we're
waiting more than 2 minutes
on a host reply with TASK_UNINTERRUPTIBLE. At least that's my interpretation,

In case we're really stuck for more than 2 minutes, the hypervisor
might not be processing our
requests anymore -- or it's not getting processed for some other reason (or the
waitqueue is not getting woken up due do some other BUG).

IIUC, we can sleep longer via wait_event_interruptible(), TASK_UNINTERRUPTIBLE
seems to be the issue that triggers the warning. But by changing that
might just be hiding the fact that
we're waiting more than 2 minutes on a reply.

>
> I could bisect but first I figured I'd check to see if someone already
> had spotted this.

Bisecting would be awesome, then we might at least know if this is a
guest or a hypervisor issue.

Note that the environment matters: the hypervisor seems to be
requesting the guest to inflate
the balloon right when booting up. So you might not be able to
reproduce in a different environment.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


update_balloon_size_func blocked for more than 120 seconds

2021-11-11 Thread Luis Chamberlain
I get the following splats with a kvm guest in idle, after a few seconds
it starts:

[  242.412806] INFO: task kworker/6:2:271 blockedfor more than 120 seconds.
[  242.415790]   Tainted: GE 5.15.0-next-2021 #68
[  242.417755] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  242.418332] task:kworker/6:2 state:D stack:0 pid:  271 ppid: 2 
flags:0x4000
[  242.418954] Workqueue: events_freezable update_balloon_size_func 
[virtio_balloon]
[  242.419518] Call Trace:
[  242.419709]  
[  242.419873]  __schedule+0x2fd/0x990
[  242.420142]  schedule+0x4e/0xc0
[  242.420382]  tell_host+0xaa/0xf0 [virtio_balloon]
[  242.420757]  ? do_wait_intr_irq+0xa0/0xa0
[  242.421065]  update_balloon_size_func+0x2c9/0x2e0 [virtio_balloon]
[  242.421527]  process_one_work+0x1e5/0x3c0
[  242.421833]  worker_thread+0x50/0x3b0
[  242.422204]  ? rescuer_thread+0x370/0x370
[  242.422507]  kthread+0x169/0x190
[  242.422754]  ? set_kthread_struct+0x40/0x40
[  242.423073]  ret_from_fork+0x1f/0x30
[  242.423347]  

And this goes on endlessly. The last one says it blocked for more than 1208
seconds. This was not happening until the last few weeks but I see no
relevant recent commits for virtio_balloon, so the related change could
be elsewhere.

I could bisect but first I figured I'd check to see if someone already
had spotted this.

  Luis
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization