Re: [PATCH v6 0/7] Add RAS virtualization support for SEA/SEI notification type in KVM

2017-09-07 Thread James Morse
Hi gengdongjiu,

(I've re-ordered some of the hunks here:)

On 04/09/17 12:10, gengdongjiu wrote:
> On 2017/9/1 1:43, James Morse wrote:
>> On 28/08/17 11:38, Dongjiu Geng wrote:
>>> Not call memory_failure() to handle it. Because the error address recorded
>>> by APEI is not accurated, so can not identify the address to hwpoison
>>> memory.
>>
>> This looks like a firmware bug, what address do you get in your CPER
>> records? It should be a physical address.

> No, not firmware bug. At least in the armv8.0 CPU and huawei's armv8.2 CPU,
> the architecture decided it is not accurate, this abort is asynchronous not
> synchronous.

This is going to be a problem. (I'm chasing Achin to find out when this is
allowed to happen and what we're expected to do about it!)

I hope this isn't the default behaviour, but only happens in exceptionally rare
circumstances.


>> To report a memory-error you must have an address.
> maybe we can not get the accurate error address, can you get it in your armv8
> platform?

I only have software-models, they only generate the errors you tell them to.


I think I see why you're taking this approach with the series, the scenario is:
1. Firmware takes an SError due to a bad memory location from guest EL0.
2. The CPU doesn't provide the address of the memory location.

You want to confine this error as much as possible, in particular to the context
it came from (e.g. guest EL0). CPU context isn't something the CPER records can
describe (they describe failures in system components), hence your hybrid
{kernel,firmware}-first code.

I don't think its safe to kill guest-EL0 and hope this confined the error.

If the affected page of guest memory has never been written to by the guest, the
host will map in the global zero-page, (made read-only at stage2). If the
corruption is in this page it affects the host kernel, guests and user-space
processes. Just because the error came from guest-EL0 doesn't mean
kernel/hypervisor memory isn't affected.

This doesn't just affect that one page: KSM may have merged every copy of every
guest user-space's libc, which has subsequently become corrupt. The first
guest-EL0 to step in this triggers the fault, but it affects all the guests.
With the address all the guests can fix this error, and KSM will re-merge the
pages. Without the address every user-space process in every guest will
eventually be killed.

We aren't even guaranteed that the access that caused the fault came from your
guest EL0. The fault may be in the page tables belonging to the guest kernel,
even worse they may belong to they hypervisor's stage2 page tables.

(Thanks to Mark and Robin for these examples)


I think in this scenario your firmware should describe a memory-error with an
unknown address. (i.e. don't set the 'physical address valid' bit in CPER's
'Table 275 Memory Error Record'). When Linux gets one of these, it should
panic(): We know some memory is corrupt, we don't know where it is.


>> User-space may be signalled by the memory_failure() helper, and user-space >>
may choose to notify the guest about the memory-failure, but this would be a
>> new error.

> For the SError, it is asynchronous abort. so it is not better to call
> memory_failure() helper, because the error address is not accurate.
> memory_failure() will offline or poison the address, but the address is not
> accurate. so it is dangerous

By 'not accurate' do you mean the CPU provides an address, and its wrong.
(surely this is a CPU bug), or just no address is provided. (i.e. the
ERRADDR.AI 'address incorrect' bit is set).


>>> Because the error was taken from a lower Exception level, if the
>>> exception is SEA/SEI and HCR_EL2.TEA/HCR_EL2.AMO is 1, firmware
>>> sets ESR_EL2/FAR_EL2 to fake a exception trap to EL2, then
>>> transfers to hypervisor.
>>
>> What happens if you took an SError from EL2 and EL2 has PSTATE.A set masking
>> SError? (this is very common today: all kernel code runs like this).

> Firstly, the guest OS usually runs in the El1 or El0, not El2.
> if El2 happens an SError, it will trap to EL3 firmware even though the 
> PSTATE.A is set.
> Because the PSTATE.A can not mask it if the SError is trapped to EL3.

Sure, we agree that from the CPU's view when SCR_EL3.EA is set physical-SError
can't be masked when executing any EL below EL3.

My question was about the 'firmware sets ESR_EL2/FAR_EL2 to fake an exception
trap to EL2' step. While EL3 can take the physical-SError at any time the
normal-world is running, it can't always deliver a fake-SError to EL2, because
EL2 believes it has masked physical-SError.

With the SError rework this should only be masked while we are in entry.S
preparing to handle an exception, receiving an unexpected asynchronous exception
at this point would overwrite ELR/ESR, meaning we could never handle the
original exception.


>> What happens if the hypervisor then executes an ESB with PSTATE.A set? It
>> expects to see any pending SError deferred and its syndrome 

Re: [PATCH v6 0/7] Add RAS virtualization support for SEA/SEI notification type in KVM

2017-09-07 Thread James Morse
Hi gengdongjiu,

(I've re-ordered some of the hunks here:)

On 04/09/17 12:10, gengdongjiu wrote:
> On 2017/9/1 1:43, James Morse wrote:
>> On 28/08/17 11:38, Dongjiu Geng wrote:
>>> Not call memory_failure() to handle it. Because the error address recorded
>>> by APEI is not accurated, so can not identify the address to hwpoison
>>> memory.
>>
>> This looks like a firmware bug, what address do you get in your CPER
>> records? It should be a physical address.

> No, not firmware bug. At least in the armv8.0 CPU and huawei's armv8.2 CPU,
> the architecture decided it is not accurate, this abort is asynchronous not
> synchronous.

This is going to be a problem. (I'm chasing Achin to find out when this is
allowed to happen and what we're expected to do about it!)

I hope this isn't the default behaviour, but only happens in exceptionally rare
circumstances.


>> To report a memory-error you must have an address.
> maybe we can not get the accurate error address, can you get it in your armv8
> platform?

I only have software-models, they only generate the errors you tell them to.


I think I see why you're taking this approach with the series, the scenario is:
1. Firmware takes an SError due to a bad memory location from guest EL0.
2. The CPU doesn't provide the address of the memory location.

You want to confine this error as much as possible, in particular to the context
it came from (e.g. guest EL0). CPU context isn't something the CPER records can
describe (they describe failures in system components), hence your hybrid
{kernel,firmware}-first code.

I don't think its safe to kill guest-EL0 and hope this confined the error.

If the affected page of guest memory has never been written to by the guest, the
host will map in the global zero-page, (made read-only at stage2). If the
corruption is in this page it affects the host kernel, guests and user-space
processes. Just because the error came from guest-EL0 doesn't mean
kernel/hypervisor memory isn't affected.

This doesn't just affect that one page: KSM may have merged every copy of every
guest user-space's libc, which has subsequently become corrupt. The first
guest-EL0 to step in this triggers the fault, but it affects all the guests.
With the address all the guests can fix this error, and KSM will re-merge the
pages. Without the address every user-space process in every guest will
eventually be killed.

We aren't even guaranteed that the access that caused the fault came from your
guest EL0. The fault may be in the page tables belonging to the guest kernel,
even worse they may belong to they hypervisor's stage2 page tables.

(Thanks to Mark and Robin for these examples)


I think in this scenario your firmware should describe a memory-error with an
unknown address. (i.e. don't set the 'physical address valid' bit in CPER's
'Table 275 Memory Error Record'). When Linux gets one of these, it should
panic(): We know some memory is corrupt, we don't know where it is.


>> User-space may be signalled by the memory_failure() helper, and user-space >>
may choose to notify the guest about the memory-failure, but this would be a
>> new error.

> For the SError, it is asynchronous abort. so it is not better to call
> memory_failure() helper, because the error address is not accurate.
> memory_failure() will offline or poison the address, but the address is not
> accurate. so it is dangerous

By 'not accurate' do you mean the CPU provides an address, and its wrong.
(surely this is a CPU bug), or just no address is provided. (i.e. the
ERRADDR.AI 'address incorrect' bit is set).


>>> Because the error was taken from a lower Exception level, if the
>>> exception is SEA/SEI and HCR_EL2.TEA/HCR_EL2.AMO is 1, firmware
>>> sets ESR_EL2/FAR_EL2 to fake a exception trap to EL2, then
>>> transfers to hypervisor.
>>
>> What happens if you took an SError from EL2 and EL2 has PSTATE.A set masking
>> SError? (this is very common today: all kernel code runs like this).

> Firstly, the guest OS usually runs in the El1 or El0, not El2.
> if El2 happens an SError, it will trap to EL3 firmware even though the 
> PSTATE.A is set.
> Because the PSTATE.A can not mask it if the SError is trapped to EL3.

Sure, we agree that from the CPU's view when SCR_EL3.EA is set physical-SError
can't be masked when executing any EL below EL3.

My question was about the 'firmware sets ESR_EL2/FAR_EL2 to fake an exception
trap to EL2' step. While EL3 can take the physical-SError at any time the
normal-world is running, it can't always deliver a fake-SError to EL2, because
EL2 believes it has masked physical-SError.

With the SError rework this should only be masked while we are in entry.S
preparing to handle an exception, receiving an unexpected asynchronous exception
at this point would overwrite ELR/ESR, meaning we could never handle the
original exception.


>> What happens if the hypervisor then executes an ESB with PSTATE.A set? It
>> expects to see any pending SError deferred and its syndrome 

Re: [PATCH v6 0/7] Add RAS virtualization support for SEA/SEI notification type in KVM

2017-09-06 Thread gengdongjiu
Hi Peter,

On 2017/9/6 19:19, Peter Maydell wrote:
> On 28 August 2017 at 11:38, Dongjiu Geng  wrote:
>> In the firmware-first RAS solution, corrupt data is detected in a
>> memory location when guest OS application software executing at EL0
>> or guest OS kernel El1 software are reading from the memory. The
>> memory node records errors in an error record accessible using
>> system registers.
> 
> Hi Dongjiu -- it looks like this patch set is extending
> the API KVM provides to userspace, but it doesn't update
> the documentation in Documentation/virtual/kvm/api.txt.
> I appreciate the API is still somewhat under discussion,
> but if you can include the docs updates it's helpful to
> me for reviewing whether the API makes sense from the
> userspace consumer end of it.

sure, it should. thanks a lot for the reminder. I will update the related docs
in my next patch set version.

> 
> thanks
> -- PMM
> 
> .
> 



Re: [PATCH v6 0/7] Add RAS virtualization support for SEA/SEI notification type in KVM

2017-09-06 Thread gengdongjiu
Hi Peter,

On 2017/9/6 19:19, Peter Maydell wrote:
> On 28 August 2017 at 11:38, Dongjiu Geng  wrote:
>> In the firmware-first RAS solution, corrupt data is detected in a
>> memory location when guest OS application software executing at EL0
>> or guest OS kernel El1 software are reading from the memory. The
>> memory node records errors in an error record accessible using
>> system registers.
> 
> Hi Dongjiu -- it looks like this patch set is extending
> the API KVM provides to userspace, but it doesn't update
> the documentation in Documentation/virtual/kvm/api.txt.
> I appreciate the API is still somewhat under discussion,
> but if you can include the docs updates it's helpful to
> me for reviewing whether the API makes sense from the
> userspace consumer end of it.

sure, it should. thanks a lot for the reminder. I will update the related docs
in my next patch set version.

> 
> thanks
> -- PMM
> 
> .
> 



Re: [PATCH v6 0/7] Add RAS virtualization support for SEA/SEI notification type in KVM

2017-09-06 Thread Peter Maydell
On 28 August 2017 at 11:38, Dongjiu Geng  wrote:
> In the firmware-first RAS solution, corrupt data is detected in a
> memory location when guest OS application software executing at EL0
> or guest OS kernel El1 software are reading from the memory. The
> memory node records errors in an error record accessible using
> system registers.

Hi Dongjiu -- it looks like this patch set is extending
the API KVM provides to userspace, but it doesn't update
the documentation in Documentation/virtual/kvm/api.txt.
I appreciate the API is still somewhat under discussion,
but if you can include the docs updates it's helpful to
me for reviewing whether the API makes sense from the
userspace consumer end of it.

thanks
-- PMM


Re: [PATCH v6 0/7] Add RAS virtualization support for SEA/SEI notification type in KVM

2017-09-06 Thread Peter Maydell
On 28 August 2017 at 11:38, Dongjiu Geng  wrote:
> In the firmware-first RAS solution, corrupt data is detected in a
> memory location when guest OS application software executing at EL0
> or guest OS kernel El1 software are reading from the memory. The
> memory node records errors in an error record accessible using
> system registers.

Hi Dongjiu -- it looks like this patch set is extending
the API KVM provides to userspace, but it doesn't update
the documentation in Documentation/virtual/kvm/api.txt.
I appreciate the API is still somewhat under discussion,
but if you can include the docs updates it's helpful to
me for reviewing whether the API makes sense from the
userspace consumer end of it.

thanks
-- PMM


Re: [PATCH v6 0/7] Add RAS virtualization support for SEA/SEI notification type in KVM

2017-09-04 Thread gengdongjiu
Hi James

On 2017/9/1 1:43, James Morse wrote:
> Hi Dongjiu Geng,
> 
> On 28/08/17 11:38, Dongjiu Geng wrote:
>> In the firmware-first RAS solution, corrupt data is detected in a
>> memory location when guest OS application software executing at EL0
>> or guest OS kernel El1 software are reading from the memory. The
>> memory node records errors in an error record accessible using
>> system registers.
>>
>> Because SCR_EL3.EA is 1, then CPU will trap to El3 firmware, EL3
>> firmware records the error to APEI table through reading system
>> register.
> 
> Strictly speaking these are CPER records in a memory region pointed to by the
> HEST->GHES ACPI table.
yes, Here I mean EL3 firmware reads the RAS Error record register ERXxxx_EL1, 
such as
ERXADDR_EL1/ERXMISC0_EL1, to get the detailed error info, then record them to
HEST->GHES ACPI table in a memory region.

> 
> 
>> Because the error was taken from a lower Exception level, if the
>> exception is SEA/SEI and HCR_EL2.TEA/HCR_EL2.AMO is 1, firmware
>> sets ESR_EL2/FAR_EL2 to fake a exception trap to EL2, then
>> transfers to hypervisor.
> 
> What happens if you took an SError from EL2 and EL2 has PSTATE.A set masking
> SError? (this is very common today: all kernel code runs like this).
Firstly, the guest OS usually runs in the El1 or El0, not El2.
if El2 happens an SError, it will trap to EL3 firmware even though the PSTATE.A 
is set.
Because the PSTATE.A can not mask it if the SError is trapped to EL3.

> 
> What happens if the hypervisor then executes an ESB with PSTATE.A set? It
> expects to see any pending SError deferred and its syndrome written to 
> DISR_EL1,
> but this register is RAZ/WI when you set SCR_EL3.EA. '4.4.2' of [0]
>From my understand, if the SCR_EL3.EA is set, the Abort can not mask, it 
>always happen and
take to EL3, DISR_El1 can not record the syndrome. DISR_El1 is only recorded 
when
the External Abort is masked, but when SCR_EL3.EA is set, the pstate.A can not 
mask the Error.


> 
> 
>> For the synchronous external abort(SEA), Hypervisor calls the
>> ghes_handle_memory_failure() to deal with this error,
>> ghes_handle_memory_failure() function reads the APEI table and 
>> callls memory_failure() to decide whether it needs to deliver
>> SIGBUS signal to user space, the advantage of using SIGBUS signal
>> to notify user space is that it can be compatible with Non-Kvm users.
>>
>> For the SError Interrupt(SEI),KVM firstly classified the error.
> 
> KVM can't parse the CPER records, nor does it know where to look to find them.
> KVM should call out to the APEI code so the host kernel can handle the error.
KVM does not parse the CPER records, I mean KVM classified the error according 
to the esr_el2.AET.

As shown below:

AET, bits [12:10], when categorized
Asynchronous Error Type. Describes the state of the PE after taking the SError 
interrupt exception.
Software might use the information in the syndrome registers to determine what 
recovery might be
possible. See Architecturally consumed errors. The possible values of this 
field are:
0b000 Uncontainable error (UC).
0b001 Unrecoverable error (UEU).
0b010 Restartable error (UEO).
0b011 Recoverable error (UER).
0b110 Corrected error (CE).

I pasted the code here:
+static int kvm_handle_guest_sei(struct kvm_vcpu *vcpu)
+{
+   unsigned int esr = kvm_vcpu_get_hsr(vcpu);
+   bool impdef_syndrome =  esr & ESR_ELx_ISV;  /* aka IDS */
+   unsigned int aet = esr & ESR_ELx_AET;
+
+   /*
+* In below three conditions, it will directly inject the virtual 
SError.
+* 1. Not support RAS extension; the Syndrome is IMPLEMENTATION DEFINED;
+* AET is RES0 if 'the value returned in the DFSC field is not
+* [ESR_ELx_FSC_SERROR]'
+*/
+   if (!cpus_have_const_cap(ARM64_HAS_RAS_EXTN) || impdef_syndrome ||
+   ((esr & ESR_ELx_FSC) != ESR_ELx_FSC_SERROR)) {
+   kvm_inject_vabt(vcpu);
+   return 1;
+   }
+
+   switch (aet) {
+   case ESR_ELx_AET_CE:/* corrected error */
+   case ESR_ELx_AET_UEO:   /* restartable error, not yet consumed */
+   return 0;   /* continue processing the guest exit */
+   case ESR_ELx_AET_UEU:  /* The error has not been propagated */
+   /*
+* Only handle the guest user mode SEI if the error has not been 
propagated
+*/
+   if ((!vcpu_mode_priv(vcpu)) && 
!handle_guest_sei(kvm_vcpu_get_hsr(vcpu)))
+   return 1;
+
+   /* If SError handling is failed, continue run */
+   default:
+   /*
+* Until now, the CPU supports RAS and SEI is fatal, or user 
space
+* does not support to handle the SError.
+*/
+   panic("This Asynchronous SError interrupt is dangerous, panic");
+   }
+}
+

I have called the kernel code to handle the SError.

+/*
+ * Handle SError interrupt that occur in guest OS.
+ *
+ * The return value will be 

Re: [PATCH v6 0/7] Add RAS virtualization support for SEA/SEI notification type in KVM

2017-09-04 Thread gengdongjiu
Hi James

On 2017/9/1 1:43, James Morse wrote:
> Hi Dongjiu Geng,
> 
> On 28/08/17 11:38, Dongjiu Geng wrote:
>> In the firmware-first RAS solution, corrupt data is detected in a
>> memory location when guest OS application software executing at EL0
>> or guest OS kernel El1 software are reading from the memory. The
>> memory node records errors in an error record accessible using
>> system registers.
>>
>> Because SCR_EL3.EA is 1, then CPU will trap to El3 firmware, EL3
>> firmware records the error to APEI table through reading system
>> register.
> 
> Strictly speaking these are CPER records in a memory region pointed to by the
> HEST->GHES ACPI table.
yes, Here I mean EL3 firmware reads the RAS Error record register ERXxxx_EL1, 
such as
ERXADDR_EL1/ERXMISC0_EL1, to get the detailed error info, then record them to
HEST->GHES ACPI table in a memory region.

> 
> 
>> Because the error was taken from a lower Exception level, if the
>> exception is SEA/SEI and HCR_EL2.TEA/HCR_EL2.AMO is 1, firmware
>> sets ESR_EL2/FAR_EL2 to fake a exception trap to EL2, then
>> transfers to hypervisor.
> 
> What happens if you took an SError from EL2 and EL2 has PSTATE.A set masking
> SError? (this is very common today: all kernel code runs like this).
Firstly, the guest OS usually runs in the El1 or El0, not El2.
if El2 happens an SError, it will trap to EL3 firmware even though the PSTATE.A 
is set.
Because the PSTATE.A can not mask it if the SError is trapped to EL3.

> 
> What happens if the hypervisor then executes an ESB with PSTATE.A set? It
> expects to see any pending SError deferred and its syndrome written to 
> DISR_EL1,
> but this register is RAZ/WI when you set SCR_EL3.EA. '4.4.2' of [0]
>From my understand, if the SCR_EL3.EA is set, the Abort can not mask, it 
>always happen and
take to EL3, DISR_El1 can not record the syndrome. DISR_El1 is only recorded 
when
the External Abort is masked, but when SCR_EL3.EA is set, the pstate.A can not 
mask the Error.


> 
> 
>> For the synchronous external abort(SEA), Hypervisor calls the
>> ghes_handle_memory_failure() to deal with this error,
>> ghes_handle_memory_failure() function reads the APEI table and 
>> callls memory_failure() to decide whether it needs to deliver
>> SIGBUS signal to user space, the advantage of using SIGBUS signal
>> to notify user space is that it can be compatible with Non-Kvm users.
>>
>> For the SError Interrupt(SEI),KVM firstly classified the error.
> 
> KVM can't parse the CPER records, nor does it know where to look to find them.
> KVM should call out to the APEI code so the host kernel can handle the error.
KVM does not parse the CPER records, I mean KVM classified the error according 
to the esr_el2.AET.

As shown below:

AET, bits [12:10], when categorized
Asynchronous Error Type. Describes the state of the PE after taking the SError 
interrupt exception.
Software might use the information in the syndrome registers to determine what 
recovery might be
possible. See Architecturally consumed errors. The possible values of this 
field are:
0b000 Uncontainable error (UC).
0b001 Unrecoverable error (UEU).
0b010 Restartable error (UEO).
0b011 Recoverable error (UER).
0b110 Corrected error (CE).

I pasted the code here:
+static int kvm_handle_guest_sei(struct kvm_vcpu *vcpu)
+{
+   unsigned int esr = kvm_vcpu_get_hsr(vcpu);
+   bool impdef_syndrome =  esr & ESR_ELx_ISV;  /* aka IDS */
+   unsigned int aet = esr & ESR_ELx_AET;
+
+   /*
+* In below three conditions, it will directly inject the virtual 
SError.
+* 1. Not support RAS extension; the Syndrome is IMPLEMENTATION DEFINED;
+* AET is RES0 if 'the value returned in the DFSC field is not
+* [ESR_ELx_FSC_SERROR]'
+*/
+   if (!cpus_have_const_cap(ARM64_HAS_RAS_EXTN) || impdef_syndrome ||
+   ((esr & ESR_ELx_FSC) != ESR_ELx_FSC_SERROR)) {
+   kvm_inject_vabt(vcpu);
+   return 1;
+   }
+
+   switch (aet) {
+   case ESR_ELx_AET_CE:/* corrected error */
+   case ESR_ELx_AET_UEO:   /* restartable error, not yet consumed */
+   return 0;   /* continue processing the guest exit */
+   case ESR_ELx_AET_UEU:  /* The error has not been propagated */
+   /*
+* Only handle the guest user mode SEI if the error has not been 
propagated
+*/
+   if ((!vcpu_mode_priv(vcpu)) && 
!handle_guest_sei(kvm_vcpu_get_hsr(vcpu)))
+   return 1;
+
+   /* If SError handling is failed, continue run */
+   default:
+   /*
+* Until now, the CPU supports RAS and SEI is fatal, or user 
space
+* does not support to handle the SError.
+*/
+   panic("This Asynchronous SError interrupt is dangerous, panic");
+   }
+}
+

I have called the kernel code to handle the SError.

+/*
+ * Handle SError interrupt that occur in guest OS.
+ *
+ * The return value will be 

Re: [PATCH v6 0/7] Add RAS virtualization support for SEA/SEI notification type in KVM

2017-08-31 Thread James Morse
Hi Dongjiu Geng,

On 28/08/17 11:38, Dongjiu Geng wrote:
> In the firmware-first RAS solution, corrupt data is detected in a
> memory location when guest OS application software executing at EL0
> or guest OS kernel El1 software are reading from the memory. The
> memory node records errors in an error record accessible using
> system registers.
> 
> Because SCR_EL3.EA is 1, then CPU will trap to El3 firmware, EL3
> firmware records the error to APEI table through reading system
> register.

Strictly speaking these are CPER records in a memory region pointed to by the
HEST->GHES ACPI table.


> Because the error was taken from a lower Exception level, if the
> exception is SEA/SEI and HCR_EL2.TEA/HCR_EL2.AMO is 1, firmware
> sets ESR_EL2/FAR_EL2 to fake a exception trap to EL2, then
> transfers to hypervisor.

What happens if you took an SError from EL2 and EL2 has PSTATE.A set masking
SError? (this is very common today: all kernel code runs like this).

What happens if the hypervisor then executes an ESB with PSTATE.A set? It
expects to see any pending SError deferred and its syndrome written to DISR_EL1,
but this register is RAZ/WI when you set SCR_EL3.EA. '4.4.2' of [0]


> For the synchronous external abort(SEA), Hypervisor calls the
> ghes_handle_memory_failure() to deal with this error,
> ghes_handle_memory_failure() function reads the APEI table and 
> callls memory_failure() to decide whether it needs to deliver
> SIGBUS signal to user space, the advantage of using SIGBUS signal
> to notify user space is that it can be compatible with Non-Kvm users.
> 
> For the SError Interrupt(SEI),KVM firstly classified the error.

KVM can't parse the CPER records, nor does it know where to look to find them.
KVM should call out to the APEI code so the host kernel can handle the error.

User-space may be signalled by the memory_failure() helper, and user-space may
choose to notify the guest about the memory-failure, but this would be a new 
error.


> Not call memory_failure() to handle it. Because the error address recorded
> by APEI is not accurated, so can not identify the address to hwpoison
> memory.

This looks like a firmware bug, what address do you get in your CPER records? It
should be a physical address.

To report a memory-error you must have an address.

If the error wasn't detected as a synchronous access then delivering a
synchronous-external-abort is inappropriate (I think we both agree on this), and
SError-interrupt doesn't have a way of specifying an address ... but the CPER
records do.

For firmware-first your SError-interrupt is just a notification, its the CPER
records the OS uses to handle the error.


> If the SError error comes from guest user mode and is not propagated,
> then signal user space to handle it, otherwise, directly injects virtual
> SError, or panic if the error is fatal.

What do you mean by propagated?

I don't think we should ever hand RAS notifications to user-space, the host
kernel should handle them, then describe the symptom (e.g. this region of your
va space is gone) to user-space.


> when user space handles the error,
> it will specify syndrome for the injected virtual SError. This syndrome value
> is set to the VSESR_EL2. VSESR_EL2 is a new ARMv8.2 RAS extensions register
> which provides the syndrome value reported to software on taking a virtual
> SError interrupt exception.


Thanks,

James

[0]
https://static.docs.arm.com/ddi0587/a/RAS%20Extension-release%20candidate_march_29.pdf



Re: [PATCH v6 0/7] Add RAS virtualization support for SEA/SEI notification type in KVM

2017-08-31 Thread James Morse
Hi Dongjiu Geng,

On 28/08/17 11:38, Dongjiu Geng wrote:
> In the firmware-first RAS solution, corrupt data is detected in a
> memory location when guest OS application software executing at EL0
> or guest OS kernel El1 software are reading from the memory. The
> memory node records errors in an error record accessible using
> system registers.
> 
> Because SCR_EL3.EA is 1, then CPU will trap to El3 firmware, EL3
> firmware records the error to APEI table through reading system
> register.

Strictly speaking these are CPER records in a memory region pointed to by the
HEST->GHES ACPI table.


> Because the error was taken from a lower Exception level, if the
> exception is SEA/SEI and HCR_EL2.TEA/HCR_EL2.AMO is 1, firmware
> sets ESR_EL2/FAR_EL2 to fake a exception trap to EL2, then
> transfers to hypervisor.

What happens if you took an SError from EL2 and EL2 has PSTATE.A set masking
SError? (this is very common today: all kernel code runs like this).

What happens if the hypervisor then executes an ESB with PSTATE.A set? It
expects to see any pending SError deferred and its syndrome written to DISR_EL1,
but this register is RAZ/WI when you set SCR_EL3.EA. '4.4.2' of [0]


> For the synchronous external abort(SEA), Hypervisor calls the
> ghes_handle_memory_failure() to deal with this error,
> ghes_handle_memory_failure() function reads the APEI table and 
> callls memory_failure() to decide whether it needs to deliver
> SIGBUS signal to user space, the advantage of using SIGBUS signal
> to notify user space is that it can be compatible with Non-Kvm users.
> 
> For the SError Interrupt(SEI),KVM firstly classified the error.

KVM can't parse the CPER records, nor does it know where to look to find them.
KVM should call out to the APEI code so the host kernel can handle the error.

User-space may be signalled by the memory_failure() helper, and user-space may
choose to notify the guest about the memory-failure, but this would be a new 
error.


> Not call memory_failure() to handle it. Because the error address recorded
> by APEI is not accurated, so can not identify the address to hwpoison
> memory.

This looks like a firmware bug, what address do you get in your CPER records? It
should be a physical address.

To report a memory-error you must have an address.

If the error wasn't detected as a synchronous access then delivering a
synchronous-external-abort is inappropriate (I think we both agree on this), and
SError-interrupt doesn't have a way of specifying an address ... but the CPER
records do.

For firmware-first your SError-interrupt is just a notification, its the CPER
records the OS uses to handle the error.


> If the SError error comes from guest user mode and is not propagated,
> then signal user space to handle it, otherwise, directly injects virtual
> SError, or panic if the error is fatal.

What do you mean by propagated?

I don't think we should ever hand RAS notifications to user-space, the host
kernel should handle them, then describe the symptom (e.g. this region of your
va space is gone) to user-space.


> when user space handles the error,
> it will specify syndrome for the injected virtual SError. This syndrome value
> is set to the VSESR_EL2. VSESR_EL2 is a new ARMv8.2 RAS extensions register
> which provides the syndrome value reported to software on taking a virtual
> SError interrupt exception.


Thanks,

James

[0]
https://static.docs.arm.com/ddi0587/a/RAS%20Extension-release%20candidate_march_29.pdf