In the armv8 platform, the mainly hardware error source are ARMv8
SEA/SEI/GSIV. For the ARMv8 SEA/SEI, the KVM or host kernel will signal SIGBUS
or use other interface to notify user space, such as Qemu. After Qemu gets
the notification, it will record the CPER and inject the SEA/SEI to KVM. this
series of patches will generate APEI table when guest OS boot up, and
dynamically
record CPER for the guest OS about the generic hardware errors, currently the
userspace only handle the memory section hardware errors. Before Qemu record the
CPER, it needs to check the ACK value written by the guest OS to avoid
read-write
race condition.
Below is the APEI/GHESV2/CPER table layout, the max number of error soure is 11,
which is classified by notification type, now only enable the SEA/SEI
notification type
error source.
etc/acpi/tables etc/hardware_errors
==
+ +--++--+
| | HEST ||address |
+--+
| +--+|registers | |
Error Status |
| | GHES0|| ++ |
Data Block 0 |
| +--+ +->| |status_address0 |->|
++
| | .| | | ++ | |
CPER |
| | error_status_address-+-+ +--->| |status_address1 |--+ | |
CPER |
| | .| || ++ | | |
|
| | read_ack_register+-+ || . | | | |
CPER |
| | read_ack_preserve| | |+--+ | |
+-++
| | read_ack_write | | | +->| |status_address10|+ | |
Error Status |
+ +--+ | | | | ++| | |
Data Block 1 |
| | GHES1| +-+-+->| |ack_address0|--+ | +-->|
++
+ +--+ | | | ++ | | | |
CPER |
| | .| | | +--->| |ack_address1|--+-+ | | |
CPER |
| | error_status_address-+---+ | || ++ | | | | |
|
| | .| | || | . | | | | | |
CPER |
| | read_ack_register+-+-+| ++ | | |
+-++
| | read_ack_preserve| | +->| |ack_address10 |--+-+-+ | |
|.. |
| | read_ack_write | | | | ++ | | | | |
++
+ +--| | | | | ack0 |<-+ | | | |
Error Status |
| | ... | | | | ++| | | |
Data Block 10|
+ +--+ | | | | ack1 |<---+ | +>|
++
| | GHES10 | | | | ++ | | |
CPER |
+ +--+ | | | | | | | |
CPER |
| | .| | | | +--+ | | | |
|
| | error_status_address-+-+ | | | ack10 |< + | |
CPER |
| | .| | | ++
+-++
| | read_ack_register+-+
| | read_ack_preserve|
| | read_ack_write |
+ +--+
After injecting a SEA/SEI ghes error, the gueset OS kernel log will be shown as
below:
[ 142.95] {1}[Hardware Error]: Hardware error from APEI Generic Hardware
Error Source: 8
[ 142.913141] {1}[Hardware Error]: event severity: recoverable
[ 142.914498] {1}[Hardware Error]: Error 0, type: recoverable
[ 142.915851] {1}[Hardware Error]: section_type: memory error
[ 142.917163] {1}[Hardware Error]: physical_address: 0x
[ 142.918792] {1}[Hardware Error]: error_type: 3, multi-bit ECC
how to test:
1. In the guest OS, use this command to dump the APEI table:
"iasl -p ./HEST -d /sys/firmware/acpi/tables/HEST"
2. And find the address for the generic error status block
according to the notification type
3. then find the CPER record through the generic error status block.
For example(notification type is SEA):
(1) root@genericarmv8:~# iasl -p ./HEST -d /sys/firmware/acpi/tables/HEST
(2) root@genericarmv8:~# cat HEST.dsl
/*
* Intel ACPI Component Architecture
* AML/ASL+ Disassembler version 20170728 (64-bit version)
* Copyright (c) 2000 - 2017 Intel Corporation
*
* Disassembly of /sys/firmware/acpi/tables/HEST, Mon Sep 5 07:59:17 2016
*
* ACPI Data Table [HEST]
*
*