Re: [REGRESSION][v6.8-rc1] virtio-pci: Introduce admin virtqueue

2024-05-16 Thread Jason Wang
On Thu, May 16, 2024 at 5:46 PM Catherine Redfield
 wrote:
>
> Feng,
>
> Thank you for providing your debugging steps; I used them on a gce image 
> locally and was not able to replicate the issue.  I also attempted to 
> replicate in qemu/virsh using qemu-guest-agent to enable the S3 suspend 
> state, also without success (that is S3 suspend state worked without any 
> problems).  I have brought this back to the cloud for further debugging of 
> their config and guest agent to try and determine what the issue is.
>
> Thank you very much for all your help on this issue and time looking into it!
> Catherine

Does this fix the issue? I guess the reason is that GCE is using legacy virtio.

https://lore.kernel.org/kvm/cacgkmeth_9baewekq862ygzwuozwg96z3g6oyqhzycj2jpu...@mail.gmail.com/T/

Thanks

>
> On Thu, May 9, 2024 at 5:03 AM Feng Liu  wrote:
>>
>>
>> On 2024-05-08 a.m.7:18, Catherine Redfield wrote:
>> > *External email: Use caution opening links or attachments*
>> >
>> >
>> > On a VM with the GCP kernel (where we first identified the problem), I see:
>> >
>> > 1. The full kernel log from `journalctl --system > kernlog` attached.
>> > The specific suspend section is here:
>> >
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > systemd[1]: Reached target sleep.target - Sleep.
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > systemd[1]: Starting systemd-suspend.service - System Suspend...
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > systemd-sleep[1413]: Performing sleep operation 'suspend'...
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: PM: suspend entry (deep)
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: Filesystems sync: 0.008 seconds
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: Freezing user space processes
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: Freezing user space processes completed (elapsed 0.001 seconds)
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: OOM killer disabled.
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: Freezing remaining freezable tasks
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: Freezing remaining freezable tasks completed (elapsed 0.000 
>> > seconds)
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: printk: Suspending console(s) (use no_console_suspend to debug)
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: port 00:03:0.0: PM: dpm_run_callback():
>> > pm_runtime_force_suspend+0x0/0x130 returns -16
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: port 00:03:0.0: PM: failed to suspend: error -16
>>
>> Thanks Joesph and Catherine's help.
>>
>> Hi,
>>
>> I have alreay synced up with Cananical guys offline about this issue.
>>
>> I can run "suspend/resume" sucessfully on my local server and VM.
>> And "PM: failed to suspend: error -16" looks like not cause by my
>> previous virtio patch ( fd27ef6b44be  ("virtio-pci: Introduce admin
>> virtqueue")) which only modified "virtio_device_freeze" about "suspend"
>> action.
>>
>> So I have provide the my steps and debug patch to Joesph and Catherine.
>> I will also sync up the information here, as follow:
>>
>> I have read the qemu code and find a way to trigger "suspend/resume" on
>> my setup, and add some debug message in the latest kerenel
>>
>> My setps are:
>> 1. QEMU cmdline add following
>> 
>> -global PIIX4_PM.disable_s3=0 \
>> -global PIIX4_PM.disable_s4=1 \
>> 
>> -netdev type=tap,ifname=tap0,id=hostnet0,script=no,downscript=no \
>> -device
>> virtio-net-pci,netdev=hostnet0,id=net0,mac=$SSH_MAC,bus=pci.0,addr=0x3 \
>> ..
>>
>> 2. In the VM, run "systemctl suspend" to PM suspend the VM into memory
>> 3. In qemu hmp shell, run "system_wakeup" to resume the VM again
>>
>> My VM configuration:
>> NIC: 1 virtio nic emulated by QEMU
>> OS:  Ubuntu 22.04.4 LTS
>> kernel:  latest kernel, 6.9-rc7: ee5b455b0ada (kernel2/net-next-virito,
>> kernel2/master, master) Merge tag 'slab-for-6.9-rc7-fixes' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab)
>>
>>
>> I add some debug message on the latest kernel, and do above steps to
>> trigger "suspen/resume". Everything of VM is OK, VM could suspend/resume
>> successfully.
>> Follwing is the kernel log:
>> 
>> 
>> May  6 15:59:52 feliu-vm kernel: [   43.446737] PM: suspend entry (deep)
>> May  6 16:00:04 feliu-vm kernel: [   43.467640] Filesystems sync: 0.020
>> seconds
>> May  6 16:00:04 feliu-vm kernel: [   43.467923] Freezing user space
>> processes
>> May  6 16:00:04 feliu-vm kernel: [   43.470294] Freezing user space
>> 

Re: [REGRESSION][v6.8-rc1] virtio-pci: Introduce admin virtqueue

2024-05-08 Thread Feng Liu


On 2024-05-08 a.m.7:18, Catherine Redfield wrote:

*External email: Use caution opening links or attachments*


On a VM with the GCP kernel (where we first identified the problem), I see:

1. The full kernel log from `journalctl --system > kernlog` attached.  
The specific suspend section is here:


May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal 
systemd[1]: Reached target sleep.target - Sleep.
May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal 
systemd[1]: Starting systemd-suspend.service - System Suspend...
May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal 
systemd-sleep[1413]: Performing sleep operation 'suspend'...
May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal 
kernel: PM: suspend entry (deep)
May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal 
kernel: Filesystems sync: 0.008 seconds
May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal 
kernel: Freezing user space processes
May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal 
kernel: Freezing user space processes completed (elapsed 0.001 seconds)
May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal 
kernel: OOM killer disabled.
May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal 
kernel: Freezing remaining freezable tasks
May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal 
kernel: Freezing remaining freezable tasks completed (elapsed 0.000 seconds)
May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal 
kernel: printk: Suspending console(s) (use no_console_suspend to debug)
May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal 
kernel: port 00:03:0.0: PM: dpm_run_callback(): 
pm_runtime_force_suspend+0x0/0x130 returns -16
May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal 
kernel: port 00:03:0.0: PM: failed to suspend: error -16


Thanks Joesph and Catherine's help.

Hi,

I have alreay synced up with Cananical guys offline about this issue.

I can run "suspend/resume" sucessfully on my local server and VM.
And "PM: failed to suspend: error -16" looks like not cause by my 
previous virtio patch ( fd27ef6b44be  ("virtio-pci: Introduce admin 
virtqueue")) which only modified "virtio_device_freeze" about "suspend" 
action.


So I have provide the my steps and debug patch to Joesph and Catherine. 
I will also sync up the information here, as follow:


I have read the qemu code and find a way to trigger "suspend/resume" on 
my setup, and add some debug message in the latest kerenel


My setps are:
1. QEMU cmdline add following

-global PIIX4_PM.disable_s3=0 \
-global PIIX4_PM.disable_s4=1 \

-netdev type=tap,ifname=tap0,id=hostnet0,script=no,downscript=no \
-device 
virtio-net-pci,netdev=hostnet0,id=net0,mac=$SSH_MAC,bus=pci.0,addr=0x3 \

..

2. In the VM, run "systemctl suspend" to PM suspend the VM into memory
3. In qemu hmp shell, run "system_wakeup" to resume the VM again

My VM configuration:
NIC: 1 virtio nic emulated by QEMU
OS:  Ubuntu 22.04.4 LTS
kernel:  latest kernel, 6.9-rc7: ee5b455b0ada (kernel2/net-next-virito, 
kernel2/master, master) Merge tag 'slab-for-6.9-rc7-fixes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab)



I add some debug message on the latest kernel, and do above steps to 
trigger "suspen/resume". Everything of VM is OK, VM could suspend/resume 
successfully.

Follwing is the kernel log:


May  6 15:59:52 feliu-vm kernel: [   43.446737] PM: suspend entry (deep)
May  6 16:00:04 feliu-vm kernel: [   43.467640] Filesystems sync: 0.020 
seconds
May  6 16:00:04 feliu-vm kernel: [   43.467923] Freezing user space 
processes
May  6 16:00:04 feliu-vm kernel: [   43.470294] Freezing user space 
processes completed (elapsed 0.002 seconds)

May  6 16:00:04 feliu-vm kernel: [   43.470299] OOM killer disabled.
May  6 16:00:04 feliu-vm kernel: [   43.470301] Freezing remaining 
freezable tasks
May  6 16:00:04 feliu-vm kernel: [   43.471482] Freezing remaining 
freezable tasks completed (elapsed 0.001 seconds)
May  6 16:00:04 feliu-vm kernel: [   43.471495] printk: Suspending 
console(s) (use no_console_suspend to debug)
May  6 16:00:04 feliu-vm kernel: [   43.474034] virtio_net virtio0: 
godeng virtio device freeze
May  6 16:00:04 feliu-vm kernel: [   43.475714] virtio_net virtio0 ens3: 
godfeng virtnet_freeze done
May  6 16:00:04 feliu-vm kernel: [   43.475717] virtio_net virtio0: 
godfeng VIRTIO_F_ADMIN_VQ not enabled
May  6 16:00:04 feliu-vm kernel: [   43.475719] virtio_net virtio0: 
godeng virtio device freeze done


May  6 16:00:04 feliu-vm kernel: [   43.535382] smpboot: CPU 1 is now 
offline
May  6 16:00:04 feliu-vm kernel: [   43.537283] IRQ fixup: irq 1 move in 
progress, old vector 32
May  6 16:00:04 feliu-vm kernel: [   43.538504] smpboot: CPU 2 is now 
offline
May  6 16:00:04 feliu-vm kernel: [   43.541392] 

Re: [REGRESSION][v6.8-rc1] virtio-pci: Introduce admin virtqueue

2024-05-07 Thread Jason Wang
On Sat, May 4, 2024 at 2:10 AM Joseph Salisbury
 wrote:
>
> Hi Feng,
>
> During testing, a kernel bug was identified with the suspend/resume
> functionality on instances running in a public cloud [0].  This bug is a
> regression introduced in v6.8-rc1.  After a kernel bisect, the following
> commit was identified as the cause of the regression:
>
> fd27ef6b44be  ("virtio-pci: Introduce admin virtqueue")

Have a quick glance at the patch it seems it should not damage the
freeze/restore as it should behave as in the past.

But I found something interesting:

1) assumes 1 admin vq which is not what spec said
2) special function for admin virtqueue during freeze/restore, but it
doesn't do anything special than del_vq()
3) lack real users but I guess e.g the destroy_avq() needs to be
synchronized with the one that is using admin virtqueue

>
> I was hoping to get your feedback, since you are the patch author. Do
> you think gathering any additional data will help diagnose this issue?

Yes, please show us

1) the kernel log here.
2) the features that the device has like
/sys/bus/virtio/devices/virtio0/features

> This commit is depended upon by other virtio commits, so a revert test
> is not really straight forward without reverting all the dependencies.
> Any ideas you have would be greatly appreciated.

Thanks

>
>
> Thanks,
>
> Joe
>
> http://pad.lv/2063315
>