Re: [Qemu-arm] [PATCH v18 0/6] Add ARMv8 RAS virtualization support in QEMU

2019-09-27 Thread Peter Maydell
On Fri, 6 Sep 2019 at 09:33, Xiang Zheng  wrote:
>
> In the ARMv8 platform, the CPU error types are synchronous external abort(SEA)
> and SError Interrupt (SEI). If exception happens in guest, sometimes it's 
> better
> for guest to perform the recovery, because host does not know the detailed
> information of guest. For example, if an exception happens in a user-space
> application within guest, host does not know which application encounters
> errors.
>
> For the ARMv8 SEA/SEI, KVM or host kernel delivers SIGBUS to notify userspace.
> After user space gets the notification, it will record the CPER into guest 
> GHES
> buffer and inject an exception or IRQ into guest.
>
> In the current implementation, if the type of SIGBUS is BUS_MCEERR_AR, we will
> treat it as a synchronous exception, and notify guest with ARMv8 SEA
> notification type after recording CPER into guest.
>
> This series of patches are based on Qemu 4.1, which include two parts:
> 1. Generate APEI/GHES table.
> 2. Handle the SIGBUS signal, record the CPER in runtime and fill it into guest
>memory, then notify guest according to the type of SIGBUS.
>
> The whole solution was suggested by James(james.mo...@arm.com); The solution 
> of
> APEI section was suggested by Laszlo(ler...@redhat.com).
> Show some discussions in [1].
>
> This series of patches have already been tested on ARM64 platform with RAS
> feature enabled:
> Show the APEI part verification result in [2].
> Show the BUS_MCEERR_AR SIGBUS handling verification result in [3].
>
> ---
>
> Since Dongjiu is too busy to do this work, I will finish the rest work on 
> behalf
> of him.


Thanks for picking up the work on this patchset, and sorry it's taken me
a while to get to reviewing it. I've now given review comments on the
arm parts of this, which are looking in generally good shape (my comments
are all pretty minor stuff I think). I'll have to leave the ACPI parts
to somebody else to review as that is definitely not my speciality.

thanks
-- PMM



RE: [PATCH v18 0/6] Add ARMv8 RAS virtualization support in QEMU

2019-09-26 Thread gengdongjiu
ping.

Hi peter/Igor/all,
  can you review these patches,thanks a lot.




--
耿东久 Geng Dongjiu
Mobile: +86-18221809728
Email: gengdong...@huawei.com<mailto:gengdong...@huawei.com>
发件人:zhengxiang (A) 
收件人:pbonzini ;mst ;imammedo 
;shannon.zhaosl ;peter.maydell 
;lersek ;james.morse 
;gengdongjiu ;mtosatti 
;rth ;ehabkost 
;Jonathan Cameron ;xuwei (O) 
;kvm ;qemu-devel 
;qemu-arm ;Linuxarm 

抄 送:Wanghaibin (D) 
时 间:2019-09-17 20:40:21
主题Re: [PATCH v18 0/6] Add ARMv8 RAS virtualization support in QEMU

Hi all,

This patch series has been tested for both TCG and KVM scenes.

1) Test for TCG:
   - Re-compile qemu after applying the patch refered to 
https://patchwork.kernel.org/cover/10942757/#22640271).
   - Use command line shown below to start qemu:
./qemu-system-aarch64 \
-name guest=ras \
-machine virt,gic-version=3,ras=on \
-cpu cortex-a57 \
-bios /usr/share/edk2/aarch64/QEMU_EFI.fd \
-nodefaults \
-kernel ${GUEST_KERNEL} \
-initrd ${GUEST_FS} \
-append "rdinit=init console=ttyAMA0 earlycon=pl011,0x900" \
-m 8192 \
-smp 4 \
-serial stdio \

   - Send a signal to one of the VCPU threads:
kill -s SIGBUS 71571

   - The result of test is shown below:

[   41.194753] {1}[Hardware Error]: Hardware error from APEI Generic 
Hardware Error Source: 0
[   41.197329] {1}[Hardware Error]: event severity: recoverable
[   41.199078] {1}[Hardware Error]:  Error 0, type: recoverable
[   41.200829] {1}[Hardware Error]:   section_type: memory error
[   41.202603] {1}[Hardware Error]:   physical_address: 0x400a1000
[   41.204649] {1}[Hardware Error]:   error_type: 0, unknown
[   41.206328] EDAC MC0: 1 UE Unknown on unknown label ( page:0x400a1 
offset:0x0 grain:0)
[   41.208788] Internal error: synchronous external abort: 96000410 [#1] SMP
[   41.210879] Modules linked in:
[   41.211823] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.19.0+ #8
[   41.213698] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 
02/06/2015
[   41.215812] pstate: 60c00085 (nZCv daIf +PAN +UAO)
[   41.217296] pc : cpu_do_idle+0x8/0xc
[   41.218400] lr : arch_cpu_idle+0x2c/0x1b8
[   41.219629] sp : 09f9bf00
[   41.220649] x29: 09f9bf00 x28: 
[   41.222310] x27:  x26: 8001fe471d80
[   41.223945] x25:  x24: 0937ba38
[   41.225581] x23: 090b3338 x22: 09379000
[   41.227220] x21: 0937b000 x20: 0004
[   41.228871] x19: 090a6000 x18: 
[   41.230517] x17:  x16: 
[   41.232165] x15:  x14: 
[   41.233810] x13: 089f4da8 x12: 000e
[   41.235448] x11: 089f4d80 x10: 0af0
[   41.237101] x9 : 09f9be80 x8 : 8001fe4728d0
[   41.238738] x7 : 0004 x6 : 8001fffbaf30
[   41.240380] x5 : 0c43b940 x4 : 8001f6f0c000
[   41.242030] x3 : 0001 x2 : 09f9bf00
[   41.243666] x1 : 8001fffb82c8 x0 : 090a6018
[   41.245306] Process swapper/2 (pid: 0, stack limit = 0x(ptrval))
[   41.247378] Call trace:
[   41.248117]  cpu_do_idle+0x8/0xc
[   41.249111]  do_idle+0x1dc/0x2a8
[   41.250111]  cpu_startup_entry+0x28/0x30
[   41.251319]  secondary_start_kernel+0x180/0x1c8
[   41.252725] Code: a8c17bfd d65f03c0 d5033f9f d503207f (d65f03c0)
[   41.254606] ---[ end trace 221bc8a614fb5a1d ]---
[   41.256030] Kernel panic - not syncing: Fatal exception
[   41.257644] SMP: stopping secondary CPUs
[   41.258912] Kernel Offset: disabled
[   41.260011] CPU features: 0x0,22a00238
[   41.261178] Memory Limit: none
[   41.262122] ---[ end Kernel panic - not syncing: Fatal exception ]---

2) Test for KVM:
   - Use command line shown below to start qemu:
./qemu-system-aarch64 \
-name guest=ras \
-machine virt,accel=kvm,gic-version=3,ras=on \
-cpu host \
-bios /usr/share/edk2/aarch64/QEMU_EFI.fd \
-nodefaults \
-kernel ${GUEST_KERNEL} \
-initrd ${GUEST_FS} \
-append "rdinit=init console=ttyAMA0 earlycon=pl011,0x900" \
-m 8192 \
-smp 4 \
-serial stdio \

   - Run mca-recover and get the GPA(IPA) of allocated page which would be 
corrupted on the later.
   - Convert the GPA to HPA and corrupt this HPA via APEI/EINJ.
   - Go back to guest and continue to read this page.

   - The result of test is shown below:

root@genericarmv8:~/tools# ./mca-recover
pagesize: 0x1000
before clear cache
 

Re: [PATCH v18 0/6] Add ARMv8 RAS virtualization support in QEMU

2019-09-19 Thread gengdongjiu
Thanks xiang's continue upstream and test.
Hope maintainer can review it.


On 2019/9/17 20:39, Xiang Zheng wrote:
> Hi all,
> 
> This patch series has been tested for both TCG and KVM scenes.
> 
> 1) Test for TCG:
>- Re-compile qemu after applying the patch refered to 
> https://patchwork.kernel.org/cover/10942757/#22640271).
>- Use command line shown below to start qemu:
> ./qemu-system-aarch64 \
> -name guest=ras \
> -machine virt,gic-version=3,ras=on \
> -cpu cortex-a57 \
> -bios /usr/share/edk2/aarch64/QEMU_EFI.fd \
> -nodefaults \
> -kernel ${GUEST_KERNEL} \
> -initrd ${GUEST_FS} \
> -append "rdinit=init console=ttyAMA0 
> earlycon=pl011,0x900" \
> -m 8192 \
> -smp 4 \
> -serial stdio \
> 
>- Send a signal to one of the VCPU threads:
> kill -s SIGBUS 71571
> 
>- The result of test is shown below:
> 
> [   41.194753] {1}[Hardware Error]: Hardware error from APEI Generic 
> Hardware Error Source: 0
> [   41.197329] {1}[Hardware Error]: event severity: recoverable
> [   41.199078] {1}[Hardware Error]:  Error 0, type: recoverable
> [   41.200829] {1}[Hardware Error]:   section_type: memory error
> [   41.202603] {1}[Hardware Error]:   physical_address: 0x400a1000
> [   41.204649] {1}[Hardware Error]:   error_type: 0, unknown
> [   41.206328] EDAC MC0: 1 UE Unknown on unknown label ( page:0x400a1 
> offset:0x0 grain:0)
> [   41.208788] Internal error: synchronous external abort: 96000410 [#1] 
> SMP
> [   41.210879] Modules linked in:
> [   41.211823] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.19.0+ #8
> [   41.213698] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 
> 02/06/2015
> [   41.215812] pstate: 60c00085 (nZCv daIf +PAN +UAO)
> [   41.217296] pc : cpu_do_idle+0x8/0xc
> [   41.218400] lr : arch_cpu_idle+0x2c/0x1b8
> [   41.219629] sp : 09f9bf00
> [   41.220649] x29: 09f9bf00 x28: 
> [   41.222310] x27:  x26: 8001fe471d80
> [   41.223945] x25:  x24: 0937ba38
> [   41.225581] x23: 090b3338 x22: 09379000
> [   41.227220] x21: 0937b000 x20: 0004
> [   41.228871] x19: 090a6000 x18: 
> [   41.230517] x17:  x16: 
> [   41.232165] x15:  x14: 
> [   41.233810] x13: 089f4da8 x12: 000e
> [   41.235448] x11: 089f4d80 x10: 0af0
> [   41.237101] x9 : 09f9be80 x8 : 8001fe4728d0
> [   41.238738] x7 : 0004 x6 : 8001fffbaf30
> [   41.240380] x5 : 0c43b940 x4 : 8001f6f0c000
> [   41.242030] x3 : 0001 x2 : 09f9bf00
> [   41.243666] x1 : 8001fffb82c8 x0 : 090a6018
> [   41.245306] Process swapper/2 (pid: 0, stack limit = 
> 0x(ptrval))
> [   41.247378] Call trace:
> [   41.248117]  cpu_do_idle+0x8/0xc
> [   41.249111]  do_idle+0x1dc/0x2a8
> [   41.250111]  cpu_startup_entry+0x28/0x30
> [   41.251319]  secondary_start_kernel+0x180/0x1c8
> [   41.252725] Code: a8c17bfd d65f03c0 d5033f9f d503207f (d65f03c0)
> [   41.254606] ---[ end trace 221bc8a614fb5a1d ]---
> [   41.256030] Kernel panic - not syncing: Fatal exception
> [   41.257644] SMP: stopping secondary CPUs
> [   41.258912] Kernel Offset: disabled
> [   41.260011] CPU features: 0x0,22a00238
> [   41.261178] Memory Limit: none
> [   41.262122] ---[ end Kernel panic - not syncing: Fatal exception ]---
> 
> 2) Test for KVM:
>- Use command line shown below to start qemu:
> ./qemu-system-aarch64 \
> -name guest=ras \
> -machine virt,accel=kvm,gic-version=3,ras=on \
> -cpu host \
> -bios /usr/share/edk2/aarch64/QEMU_EFI.fd \
> -nodefaults \
> -kernel ${GUEST_KERNEL} \
> -initrd ${GUEST_FS} \
> -append "rdinit=init console=ttyAMA0 earlycon=pl011,0x900" \
> -m 8192 \
> -smp 4 \
> -serial stdio \
> 
>- Run mca-recover and get the GPA(IPA) of allocated page which would be 
> corrupted on the later.
>- Convert the GPA to HPA and corrupt this HPA via APEI/EINJ.
>- Go back to guest and continue to read this page.
> 
>- The result of test is shown below:
> 
> root@genericarmv8:~/tools# ./mca-recover
> pagesize: 0x1000
> before clear cache
> flags for page 0x2317b2: uptodate active mmap anon swapbacked
> vtop(0x9c9e8000) = 0x2317b2000
> Hit any key to access: before read
> 
> after read
> Access at Tue Sep 17 01:41:14 2019
> 
> flags for page 0x2317b2: uptodate active 

Re: [Qemu-devel] [PATCH v18 0/6] Add ARMv8 RAS virtualization support in QEMU

2019-09-17 Thread Xiang Zheng
Hi all,

This patch series has been tested for both TCG and KVM scenes.

1) Test for TCG:
   - Re-compile qemu after applying the patch refered to 
https://patchwork.kernel.org/cover/10942757/#22640271).
   - Use command line shown below to start qemu:
./qemu-system-aarch64 \
-name guest=ras \
-machine virt,gic-version=3,ras=on \
-cpu cortex-a57 \
-bios /usr/share/edk2/aarch64/QEMU_EFI.fd \
-nodefaults \
-kernel ${GUEST_KERNEL} \
-initrd ${GUEST_FS} \
-append "rdinit=init console=ttyAMA0 earlycon=pl011,0x900" \
-m 8192 \
-smp 4 \
-serial stdio \

   - Send a signal to one of the VCPU threads:
kill -s SIGBUS 71571

   - The result of test is shown below:

[   41.194753] {1}[Hardware Error]: Hardware error from APEI Generic 
Hardware Error Source: 0
[   41.197329] {1}[Hardware Error]: event severity: recoverable
[   41.199078] {1}[Hardware Error]:  Error 0, type: recoverable
[   41.200829] {1}[Hardware Error]:   section_type: memory error
[   41.202603] {1}[Hardware Error]:   physical_address: 0x400a1000
[   41.204649] {1}[Hardware Error]:   error_type: 0, unknown
[   41.206328] EDAC MC0: 1 UE Unknown on unknown label ( page:0x400a1 
offset:0x0 grain:0)
[   41.208788] Internal error: synchronous external abort: 96000410 [#1] SMP
[   41.210879] Modules linked in:
[   41.211823] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.19.0+ #8
[   41.213698] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 
02/06/2015
[   41.215812] pstate: 60c00085 (nZCv daIf +PAN +UAO)
[   41.217296] pc : cpu_do_idle+0x8/0xc
[   41.218400] lr : arch_cpu_idle+0x2c/0x1b8
[   41.219629] sp : 09f9bf00
[   41.220649] x29: 09f9bf00 x28: 
[   41.222310] x27:  x26: 8001fe471d80
[   41.223945] x25:  x24: 0937ba38
[   41.225581] x23: 090b3338 x22: 09379000
[   41.227220] x21: 0937b000 x20: 0004
[   41.228871] x19: 090a6000 x18: 
[   41.230517] x17:  x16: 
[   41.232165] x15:  x14: 
[   41.233810] x13: 089f4da8 x12: 000e
[   41.235448] x11: 089f4d80 x10: 0af0
[   41.237101] x9 : 09f9be80 x8 : 8001fe4728d0
[   41.238738] x7 : 0004 x6 : 8001fffbaf30
[   41.240380] x5 : 0c43b940 x4 : 8001f6f0c000
[   41.242030] x3 : 0001 x2 : 09f9bf00
[   41.243666] x1 : 8001fffb82c8 x0 : 090a6018
[   41.245306] Process swapper/2 (pid: 0, stack limit = 0x(ptrval))
[   41.247378] Call trace:
[   41.248117]  cpu_do_idle+0x8/0xc
[   41.249111]  do_idle+0x1dc/0x2a8
[   41.250111]  cpu_startup_entry+0x28/0x30
[   41.251319]  secondary_start_kernel+0x180/0x1c8
[   41.252725] Code: a8c17bfd d65f03c0 d5033f9f d503207f (d65f03c0)
[   41.254606] ---[ end trace 221bc8a614fb5a1d ]---
[   41.256030] Kernel panic - not syncing: Fatal exception
[   41.257644] SMP: stopping secondary CPUs
[   41.258912] Kernel Offset: disabled
[   41.260011] CPU features: 0x0,22a00238
[   41.261178] Memory Limit: none
[   41.262122] ---[ end Kernel panic - not syncing: Fatal exception ]---

2) Test for KVM:
   - Use command line shown below to start qemu:
./qemu-system-aarch64 \
-name guest=ras \
-machine virt,accel=kvm,gic-version=3,ras=on \
-cpu host \
-bios /usr/share/edk2/aarch64/QEMU_EFI.fd \
-nodefaults \
-kernel ${GUEST_KERNEL} \
-initrd ${GUEST_FS} \
-append "rdinit=init console=ttyAMA0 earlycon=pl011,0x900" \
-m 8192 \
-smp 4 \
-serial stdio \

   - Run mca-recover and get the GPA(IPA) of allocated page which would be 
corrupted on the later.
   - Convert the GPA to HPA and corrupt this HPA via APEI/EINJ.
   - Go back to guest and continue to read this page.

   - The result of test is shown below:

root@genericarmv8:~/tools# ./mca-recover
pagesize: 0x1000
before clear cache
flags for page 0x2317b2: uptodate active mmap anon swapbacked
vtop(0x9c9e8000) = 0x2317b2000
Hit any key to access: before read

after read
Access at Tue Sep 17 01:41:14 2019

flags for page 0x2317b2: uptodate active mmap anon swapbacked
[  403.298539] {1}[Hardware Error]: Hardware error from APEI Generic 
Hardware Error Source: 0
[  403.301421] {1}[Hardware Error]: event severity: recoverable
[  403.303217] {1}[Hardware Error]:  Error 0, type: recoverable
[  403.304920] {1}[Hardware Error]:   section_type: memory error
[  

[Qemu-devel] [PATCH v18 0/6] Add ARMv8 RAS virtualization support in QEMU

2019-09-06 Thread Xiang Zheng
In the ARMv8 platform, the CPU error types are synchronous external abort(SEA)
and SError Interrupt (SEI). If exception happens in guest, sometimes it's better
for guest to perform the recovery, because host does not know the detailed
information of guest. For example, if an exception happens in a user-space
application within guest, host does not know which application encounters
errors.

For the ARMv8 SEA/SEI, KVM or host kernel delivers SIGBUS to notify userspace.
After user space gets the notification, it will record the CPER into guest GHES
buffer and inject an exception or IRQ into guest.

In the current implementation, if the type of SIGBUS is BUS_MCEERR_AR, we will
treat it as a synchronous exception, and notify guest with ARMv8 SEA
notification type after recording CPER into guest.

This series of patches are based on Qemu 4.1, which include two parts:
1. Generate APEI/GHES table.
2. Handle the SIGBUS signal, record the CPER in runtime and fill it into guest
   memory, then notify guest according to the type of SIGBUS.

The whole solution was suggested by James(james.mo...@arm.com); The solution of
APEI section was suggested by Laszlo(ler...@redhat.com).
Show some discussions in [1].

This series of patches have already been tested on ARM64 platform with RAS
feature enabled:
Show the APEI part verification result in [2].
Show the BUS_MCEERR_AR SIGBUS handling verification result in [3].

---

Since Dongjiu is too busy to do this work, I will finish the rest work on behalf
of him.

---
Change since v17:
1. Improve some commit messages and comments.
2. Fix some code-style problems.
3. Add a *ras* machine option.
4. Move HEST/GHES related structures and macros into "hw/acpi/acpi_ghes.*".
5. Move HWPoison page functions into "include/sysemu/kvm_int.h".
6. Fix some bugs.
7. Improve the design document.

Change since v16:
1. check whether ACPI table is enabled when handling the memory error in the 
SIGBUS handler.

Change since v15:
1. Add a doc-comment in the proper format for 'include/exec/ram_addr.h'
2. Remove write_part_cpustate_to_list() because there is another bug fix patch
   has been merged "arm: Allow system registers for KVM guests to be changed by 
QEMU code"
3. Add some comments for kvm_inject_arm_sea() in 'target/arm/kvm64.c'
4. Compare the arm_current_el() return value to 0,1,2,3, not to PSTATE_MODE_* 
constants.
5. Change the RAS support wasn't introduced before 4.1 QEMU version.
6. Move the no_ras flag  patch to begin in this series

Change since v14:
1. Remove the BUS_MCEERR_AO handling logic because this asynchronous signal was 
masked by main thread
2. Address some Igor Mammedov's comments(ACPI part)
   1) change the comments for the enum AcpiHestNotifyType definition and remove 
ditto in patch 1
   2) change some patch commit messages and separate "APEI GHES table 
generation" patch to more patches.
3. Address some peter's comments(arm64 Synchronous External Abort injection)
   1) change some code notes
   2) using arm_current_el() for current EL
   2) use the helper functions for those (syn_data_abort_*).

Change since v13:
1. Move the patches that set guest ESR and inject virtual SError out of this 
series
2. Clean and optimize the APEI part patches
3. Update the commit messages and add some comments for the code

Change since v12:
1. Address Paolo's comments to move HWPoisonPage definition to 
accel/kvm/kvm-all.c
2. Only call kvm_cpu_synchronize_state() when get the BUS_MCEERR_AR signal
3. Only add and enable GPIO-Signal and ARMv8 SEA two hardware error sources
4. Address Michael's comments to not sync SPDX from Linux kernel header file

Change since v11:
Address James's comments(james.mo...@arm.com)
1. Check whether KVM has the capability to to set ESR instead of detecting host 
CPU RAS capability
2. For SIGBUS_MCEERR_AR SIGBUS, use Synchronous-External-Abort(SEA) 
notification type
   for SIGBUS_MCEERR_AO SIGBUS, use GPIO-Signal notification


Address Shannon's comments(for ACPI part):
1. Unify hest_ghes.c and hest_ghes.h license declaration
2. Remove unnecessary including "qmp-commands.h" in hest_ghes.c
3. Unconditionally add guest APEI table based on James's 
comments(james.mo...@arm.com)
4. Add a option to virt machine for migration compatibility. On new virt 
machine it's on
   by default while off for old ones, we enabled it since 2.12
5. Refer to the ACPI spec version which introduces Hardware Error Notification 
first time
6. Add ACPI_HEST_NOTIFY_RESERVED notification type

Address Igor's comments(for ACPI part):
1. Add doc patch first which will describe how it's supposed to work between 
QEMU/firmware/guest
   OS with expected flows.
2. Move APEI diagrams into doc/spec patch
3. Remove redundant g_malloc in ghes_record_cper()
4. Use build_append_int_noprefix() API to compose whole error status block and 
whole APEI table,
   and try to get rid of most structures in patch 1, as they will be left 
unused after that
5. Reuse something like