Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2018-01-22 Thread James Morse
Hi gengdongjiu,

On 16/12/17 03:44, gengdongjiu wrote:
> On 2017/12/16 2:52, James Morse wrote:
>>> signal, it will record the CPER and trigger a IRQ to notify guest, as shown 
>>> below:
>>>
>>> SIGBUS_MCEERR_AR trigger Synchronous External Abort.
>>> SIGBUS_MCEERR_AO trigger GPIO IRQ.
>>>
>>> For the SIGBUS_MCEERR_AO and SIGBUS_MCEERR_AR, we have already specify 
>>> trigger method, which all
>>>
>>> not involve _trigger_ an SError.
>> It's a policy choice. How does your virtual CPU notify RAS errors to its 
>> virtual
>> software? You could use SError for SIGBUS_MCEERR_AR, it depends on what type 
>> of
>> CPU you are trying to emulate.
>>
>> I'd suggest using NOTIFY_SEA for SIGBUS_MCEERR_AR as it avoids problems where
>> the guest doesn't take the SError immediately, instead tries to re-execute 
>> the

> I agree it is better to use NOTIFY_SEA for SIGBUS_MCEERR_AR in this case.

>> code KVM has unmapped from stage2 because its corrupt. (You could detect this
>> happening in Qemu and try something else)

> For something else, using NOTIFY_SEI for SIGBUS_MCEERR_AR?

Sorry that was unclear. If you use NOTIFY_SEI, the guest may have PSTATE.A set,
in which case the the CPU will patiently wait for it to unmask, (or consume it
with an ESB-instruction), before delivering the notification. The guest may not
have unmasked SError because its hammering the same page taking the same fault
again and again. Pending the asynchronous notification and re-running the vcpu
doesn't guarantee progress will be made.

In this case user-space can spot its pended an asynchronous notification (for
the same address!) more than once in the last few seconds, and try something
else, like firing a guest:reboot watchdog on another CPU.


> At current implementation,
> It seems only have this case that "KVM has unmapped from stage2", do you 
> thing we
> still have something else?

I'm wary that this only works for errors where we know the guest PC accessed the
faulting location.

The arch code will send this signal too if user-space touches the PG_poisoned
page. (I recall you checked Qemu spots this case and acts differently).
Migration is the obvious example for Qemu read/writing guest memory.

On x86 the MachineCheck code sends these signals too, so our kernel-first
implementation may do the same. As a response to a RAS error notified by
synchronous-external-abort, this is fine. But we need to remember '_AR' implies
the error is related to the code the signal interrupted, which wouldn't be true
for an error notified by SError.


>> Synchronous/asynchronous external abort matters to the CPU, but once the 
>> error
>> has been notified to software the reasons for this distinction disappear. 
>> Once
>> the error has been handled, all trace of this distinction is gone.
>>
>> CPER records only describe component failures. You are trying to re-create 
>> some
>> state that disappeared with one of the firmware-first abstractions. Trying to
>> re-create this information isn't worth the effort as the distinction doesn't
>> matter to linux, only to the CPU.
>>
>>
>>> so there is no chance for Qemu to trigger the SError when gets the 
>>> SIGBUS_MCEERR_A{O,R}.
>> You mean there is no reason for Qemu to trigger an SError when it gets a 
>> signal
>> from the kernel.
>>
>> The reasons the CPU might have to generate an SError don't apply to linux and
>> KVM user space. User-space will never get a signal for an uncontained error, 
>> we
>> will always panic(). We can't give user-space a signal for imprecise 
>> exceptions,
>> as it can't return from the signal. The classes of error that are left are
>> covered by polled/irq and NOTIFY_SEA.
>>
>> Qemu can decide to generate RAS SErrors for SIGBUS_MCEERR_AR if it really 
>> wants
>> to, (but I don't think you should, the kernel may have unmapped the page at 
>> PC
>> from stage2 due to corruption).

> yes, you also said you do not want to generate RAS SErrors for 
> SIGBUS_MCEERR_AR,
> so Qemu does not know in which condition to generate RAS SErrors.

There are two things going on here, firstly the guest may have masked PSTATE.A,
and be hammering an unmapped page. (this this 'sorry that was unclear' case
above). This would happen if the exception-entry code or stack became corrupt
when an exception was taken.
The second is what does existing non-RAS-aware software do? For SError it
panic()s, whereas for synchronous external abort there are some cases that can
be handled. (e.g. on linux: synchronous external abort from user-space).


>> I think the problem here is you're applying the CPU->software behaviour and
>> choices to software->software. By the time user-space gets the error, the
>> behaviour is different.

> In the KVM, as a policy choice to reserve this API to specify guest ESR and 
> drive to trigger SError is OK,
> At least for Qemu it does not know in which condition to trigger it.

I think you're saying "lets keep it KVM for now, Qemu doesn't have a better idea
of what to do."



Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2018-01-22 Thread James Morse
Hi gengdongjiu,

On 16/12/17 03:44, gengdongjiu wrote:
> On 2017/12/16 2:52, James Morse wrote:
>>> signal, it will record the CPER and trigger a IRQ to notify guest, as shown 
>>> below:
>>>
>>> SIGBUS_MCEERR_AR trigger Synchronous External Abort.
>>> SIGBUS_MCEERR_AO trigger GPIO IRQ.
>>>
>>> For the SIGBUS_MCEERR_AO and SIGBUS_MCEERR_AR, we have already specify 
>>> trigger method, which all
>>>
>>> not involve _trigger_ an SError.
>> It's a policy choice. How does your virtual CPU notify RAS errors to its 
>> virtual
>> software? You could use SError for SIGBUS_MCEERR_AR, it depends on what type 
>> of
>> CPU you are trying to emulate.
>>
>> I'd suggest using NOTIFY_SEA for SIGBUS_MCEERR_AR as it avoids problems where
>> the guest doesn't take the SError immediately, instead tries to re-execute 
>> the

> I agree it is better to use NOTIFY_SEA for SIGBUS_MCEERR_AR in this case.

>> code KVM has unmapped from stage2 because its corrupt. (You could detect this
>> happening in Qemu and try something else)

> For something else, using NOTIFY_SEI for SIGBUS_MCEERR_AR?

Sorry that was unclear. If you use NOTIFY_SEI, the guest may have PSTATE.A set,
in which case the the CPU will patiently wait for it to unmask, (or consume it
with an ESB-instruction), before delivering the notification. The guest may not
have unmasked SError because its hammering the same page taking the same fault
again and again. Pending the asynchronous notification and re-running the vcpu
doesn't guarantee progress will be made.

In this case user-space can spot its pended an asynchronous notification (for
the same address!) more than once in the last few seconds, and try something
else, like firing a guest:reboot watchdog on another CPU.


> At current implementation,
> It seems only have this case that "KVM has unmapped from stage2", do you 
> thing we
> still have something else?

I'm wary that this only works for errors where we know the guest PC accessed the
faulting location.

The arch code will send this signal too if user-space touches the PG_poisoned
page. (I recall you checked Qemu spots this case and acts differently).
Migration is the obvious example for Qemu read/writing guest memory.

On x86 the MachineCheck code sends these signals too, so our kernel-first
implementation may do the same. As a response to a RAS error notified by
synchronous-external-abort, this is fine. But we need to remember '_AR' implies
the error is related to the code the signal interrupted, which wouldn't be true
for an error notified by SError.


>> Synchronous/asynchronous external abort matters to the CPU, but once the 
>> error
>> has been notified to software the reasons for this distinction disappear. 
>> Once
>> the error has been handled, all trace of this distinction is gone.
>>
>> CPER records only describe component failures. You are trying to re-create 
>> some
>> state that disappeared with one of the firmware-first abstractions. Trying to
>> re-create this information isn't worth the effort as the distinction doesn't
>> matter to linux, only to the CPU.
>>
>>
>>> so there is no chance for Qemu to trigger the SError when gets the 
>>> SIGBUS_MCEERR_A{O,R}.
>> You mean there is no reason for Qemu to trigger an SError when it gets a 
>> signal
>> from the kernel.
>>
>> The reasons the CPU might have to generate an SError don't apply to linux and
>> KVM user space. User-space will never get a signal for an uncontained error, 
>> we
>> will always panic(). We can't give user-space a signal for imprecise 
>> exceptions,
>> as it can't return from the signal. The classes of error that are left are
>> covered by polled/irq and NOTIFY_SEA.
>>
>> Qemu can decide to generate RAS SErrors for SIGBUS_MCEERR_AR if it really 
>> wants
>> to, (but I don't think you should, the kernel may have unmapped the page at 
>> PC
>> from stage2 due to corruption).

> yes, you also said you do not want to generate RAS SErrors for 
> SIGBUS_MCEERR_AR,
> so Qemu does not know in which condition to generate RAS SErrors.

There are two things going on here, firstly the guest may have masked PSTATE.A,
and be hammering an unmapped page. (this this 'sorry that was unclear' case
above). This would happen if the exception-entry code or stack became corrupt
when an exception was taken.
The second is what does existing non-RAS-aware software do? For SError it
panic()s, whereas for synchronous external abort there are some cases that can
be handled. (e.g. on linux: synchronous external abort from user-space).


>> I think the problem here is you're applying the CPU->software behaviour and
>> choices to software->software. By the time user-space gets the error, the
>> behaviour is different.

> In the KVM, as a policy choice to reserve this API to specify guest ESR and 
> drive to trigger SError is OK,
> At least for Qemu it does not know in which condition to trigger it.

I think you're saying "lets keep it KVM for now, Qemu doesn't have a better idea
of what to do."



Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2018-01-22 Thread James Morse
Hi gengdongjiu,

On 21/01/18 02:45, gengdongjiu wrote:
> For the ESR_ELx_AET_UER, this exception is precise, closing the VM may
> be better[1].
> But if you think panic is better until we support kernel-first, it is
> also OK to me.

I'm not convinced SError while a guest was running means only guest memory could
be affected. Mechanisms like KSM means the error could affect multiple guests.

Both firmware-fist and kernel-first will give us the address, with which we can
know which processes are affected, isolated the memory and signal affected
processes.

Until we have one of these panic() is the only way we have to contain an error,
but its an interim fix.
Not panic()ing the host for an error that should be contained to the guest is a
fudge, we don't actually know its safe (KSM, page-table etc). I want to improve
on this with {firmware, kernel}-first support (or both!), I don't want to expose
that this is happening to user-space, as once we have one of {firmware,
kernel}-first, it shouldn't happen.


>> This is inventing something new for RAS errors not claimed by firmware-first.
>> If we have kernel-first too, this will never happen. (unless your system is
>> losing the error description).

> In fact, if we have kernel-first, I think we still need to judge the
> error type by ESR, right?

The kernel-first mechanism should consider the ESR/FAR, yes, but once the error
has been claimed and handled, KVM shouldn't care about any of these values.
(maybe we'll sanity check for uncontained errors, just in case the error escaped
to the RAS code...)

My point here was exposing 'unhandled' (ignored) RAS errors to user-space
creates an ABI: someone will complain once we start handling the error, and they
no longer get a notification via this 'unhandled' interface. Code written to use
this interface becomes useless/untested.


> If the handle_guest_sei() , may be the system does not support firmware-first,
> so we judge the ESR value,

...and panic()/ignore as appropriate.

I agree not all systems will support firmware-first, (big-endian is the obvious
example), but if we get kernel-first support this ESR guessing can disappear,
I'm against exposing it to user-space in the meantime.


Thanks,

James


Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2018-01-22 Thread James Morse
Hi gengdongjiu,

On 21/01/18 02:45, gengdongjiu wrote:
> For the ESR_ELx_AET_UER, this exception is precise, closing the VM may
> be better[1].
> But if you think panic is better until we support kernel-first, it is
> also OK to me.

I'm not convinced SError while a guest was running means only guest memory could
be affected. Mechanisms like KSM means the error could affect multiple guests.

Both firmware-fist and kernel-first will give us the address, with which we can
know which processes are affected, isolated the memory and signal affected
processes.

Until we have one of these panic() is the only way we have to contain an error,
but its an interim fix.
Not panic()ing the host for an error that should be contained to the guest is a
fudge, we don't actually know its safe (KSM, page-table etc). I want to improve
on this with {firmware, kernel}-first support (or both!), I don't want to expose
that this is happening to user-space, as once we have one of {firmware,
kernel}-first, it shouldn't happen.


>> This is inventing something new for RAS errors not claimed by firmware-first.
>> If we have kernel-first too, this will never happen. (unless your system is
>> losing the error description).

> In fact, if we have kernel-first, I think we still need to judge the
> error type by ESR, right?

The kernel-first mechanism should consider the ESR/FAR, yes, but once the error
has been claimed and handled, KVM shouldn't care about any of these values.
(maybe we'll sanity check for uncontained errors, just in case the error escaped
to the RAS code...)

My point here was exposing 'unhandled' (ignored) RAS errors to user-space
creates an ABI: someone will complain once we start handling the error, and they
no longer get a notification via this 'unhandled' interface. Code written to use
this interface becomes useless/untested.


> If the handle_guest_sei() , may be the system does not support firmware-first,
> so we judge the ESR value,

...and panic()/ignore as appropriate.

I agree not all systems will support firmware-first, (big-endian is the obvious
example), but if we get kernel-first support this ESR guessing can disappear,
I'm against exposing it to user-space in the meantime.


Thanks,

James


Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2018-01-20 Thread gengdongjiu
2018-01-15 16:33 GMT+08:00 Christoffer Dall :
> On Fri, Jan 12, 2018 at 06:05:23PM +, James Morse wrote:
>> On 15/12/17 03:30, gengdongjiu wrote:
>> > On 2017/12/7 14:37, gengdongjiu wrote:
>
> [...]
>
>>
>> (I recall someone saying migration is needed for any new KVM/cpu features, 
>> but I
>> can't find the thread)
>>
>
> I don't know of any hard set-in-stone rule for this, but I have
> certainly argued that since migration is a popular technique in data
> centers and often a key motivation behind using virtual machines as it
> provides both load-balancing and high availability, we should think
> about migration support for all features and state.  Further, experience
> has shown that retroactively trying to support migration can result in
> really complex interfaces for saving/restoring state (see the ITS
> ordering requirements in
> Documentation/virtual/kvm/devices/arm-vgic-its.txt as an example) so
> thinking about this problem when introducing functionality is a good
> idea.
yes, agree it.

>
> Of course, if there are really good arguments for having some state that
> simply cannot be migrated, then that's fine, and we should just make
> sure that userspace (e.g. QEMU) and higher level components in the
> stack (libvirt, openstack, etc.) can detect this state being used, and
> ideally enable/disable it, so that it can predict that a particular VM
> cannot be migrated off a particular host, or between a particular set of
> two hosts.  As an example, migration is typically prohibited when using
> VFIO direct device assignment, but userspace etc. are already aware of
> this.
Ok,  I think this problem is similar to migrating a VM that uses an irqchip in
 userspace and has set the IRQ or FIQ lines using KVM_IRQ_LINE.

>
> As a final note, if we add support for some architectural feature, which
> may be present on some particular hardware and/or implementation, if the
> KVM support for said feature is automatically enabled (and not
> selectively from userspace), I would push back quite strongly on
> something that doesn't support migration, because it would effectively
> prevent migration of VMs on ARM.
Ok, got it.

>
> Thanks,
> -Christoffer
> ___
> kvmarm mailing list
> kvm...@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2018-01-20 Thread gengdongjiu
2018-01-15 16:33 GMT+08:00 Christoffer Dall :
> On Fri, Jan 12, 2018 at 06:05:23PM +, James Morse wrote:
>> On 15/12/17 03:30, gengdongjiu wrote:
>> > On 2017/12/7 14:37, gengdongjiu wrote:
>
> [...]
>
>>
>> (I recall someone saying migration is needed for any new KVM/cpu features, 
>> but I
>> can't find the thread)
>>
>
> I don't know of any hard set-in-stone rule for this, but I have
> certainly argued that since migration is a popular technique in data
> centers and often a key motivation behind using virtual machines as it
> provides both load-balancing and high availability, we should think
> about migration support for all features and state.  Further, experience
> has shown that retroactively trying to support migration can result in
> really complex interfaces for saving/restoring state (see the ITS
> ordering requirements in
> Documentation/virtual/kvm/devices/arm-vgic-its.txt as an example) so
> thinking about this problem when introducing functionality is a good
> idea.
yes, agree it.

>
> Of course, if there are really good arguments for having some state that
> simply cannot be migrated, then that's fine, and we should just make
> sure that userspace (e.g. QEMU) and higher level components in the
> stack (libvirt, openstack, etc.) can detect this state being used, and
> ideally enable/disable it, so that it can predict that a particular VM
> cannot be migrated off a particular host, or between a particular set of
> two hosts.  As an example, migration is typically prohibited when using
> VFIO direct device assignment, but userspace etc. are already aware of
> this.
Ok,  I think this problem is similar to migrating a VM that uses an irqchip in
 userspace and has set the IRQ or FIQ lines using KVM_IRQ_LINE.

>
> As a final note, if we add support for some architectural feature, which
> may be present on some particular hardware and/or implementation, if the
> KVM support for said feature is automatically enabled (and not
> selectively from userspace), I would push back quite strongly on
> something that doesn't support migration, because it would effectively
> prevent migration of VMs on ARM.
Ok, got it.

>
> Thanks,
> -Christoffer
> ___
> kvmarm mailing list
> kvm...@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2018-01-20 Thread gengdongjiu
2018-01-13 2:05 GMT+08:00 James Morse :
> Hi gengdongjiu,
>
> On 16/12/17 04:47, gengdongjiu wrote:
>> [...]
>>>
 + case ESR_ELx_AET_UER:   /* The error has not been propagated */
 + /*
 +  * Userspace only handle the guest SError Interrupt(SEI) if 
 the
 +  * error has not been propagated
 +  */
 + run->exit_reason = KVM_EXIT_EXCEPTION;
 + run->ex.exception = ESR_ELx_EC_SERROR;
 + run->ex.error_code = KVM_SEI_SEV_RECOVERABLE;
 + return 0;
>>>
>>> We should not pass RAS notifications to user space. The kernel either 
>>> handles
>>> them, or it panics(). User space shouldn't even know if the kernel supports 
>>> RAS
>>
>> For the  ESR_ELx_AET_UER(Recoverable error), let us see its definition
>> below, which get from [0]
>
> [..]
>
>> so we can see the  exception is precise and PE can recover execution
>> from the preferred return address of the exception,
>
>> so let guest handling it is
>> better, for example, if it is guest application RAS error, we can kill
>> the guest application instead of panic whole OS; if it is guest kernel
>> RAS error, guest will panic.
>
> If the kernel takes an unhandled RAS error it should panic - we don't know 
> where
> the error is.
OK, here I will panic.

>
> I understand you want to kill-off guest tasks as a result of RAS errors, but
> this needs to go through the whole APEI->memory_failure()->sigbus machinery so
> that the kernel knows the kernel can keep running.
>
> This saves us signalling user-space when we don't need to. An example:
> code-corruption. Linux can happily re-read affected user-space executables 
> from
> disk, there is absolutely nothing user-space can do about it.
> Handling errors first in the kernel allows us to do recovery for all the
> affected processes, not just the one that happens to be running right now.
>
>
>> Host does not know which application of guest has error, so host can
>> not handle it,
>
> It has to work this out, otherwise the errors we can handle never get a 
> chance.
>
> This kernel is expected to look at the error description, (which for some 
> reason
> we aren't talking about here), e.g. the CPER records, and determine what
> recovery action is necessary for this error.
> For memory errors this may be re-reading from disk, or at the worst case,
> unmapping from all user-space users (including KVM's stage2) and raining 
> signals
> on all affected processes.
>
> For a memory error the important piece of information is the physical address.
> Only the kernel can do anything with this, it determines who owns the affected
> memory and what needs doing to recover from the error.
>
> If you pass the notification to user-space, all it can do is signal the guest 
> to
> "stop doing whatever it is you're doing". The guest may have been able to
> re-read pages from disk, or otherwise handle the error.
> Has the error been handled? No: The error remains latent in the system.
>
>
>> panic OS is not a good choice for the Recoverable error.
>
> If we don't know where the error is, and we can't make progress, its the only
> sane choice.
Ok, I will panic here.

>
> This code is never expected to run! (why are we arguing about it?) We should 
> get
> RAS errors as GHES notifications from firmware via some mechanism. If those 
> are
> NOTIFY_SEI then APEI should claim the notification and kick off the 
> appropriate
> handling based on the CPER records. If/when we get kernel-first, that can 
> claim
> the SError. What we're left with is RAS notifications that no-one claimed
> because there was no error-description found.
>
>
>
> James


Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2018-01-20 Thread gengdongjiu
2018-01-13 2:05 GMT+08:00 James Morse :
> Hi gengdongjiu,
>
> On 16/12/17 04:47, gengdongjiu wrote:
>> [...]
>>>
 + case ESR_ELx_AET_UER:   /* The error has not been propagated */
 + /*
 +  * Userspace only handle the guest SError Interrupt(SEI) if 
 the
 +  * error has not been propagated
 +  */
 + run->exit_reason = KVM_EXIT_EXCEPTION;
 + run->ex.exception = ESR_ELx_EC_SERROR;
 + run->ex.error_code = KVM_SEI_SEV_RECOVERABLE;
 + return 0;
>>>
>>> We should not pass RAS notifications to user space. The kernel either 
>>> handles
>>> them, or it panics(). User space shouldn't even know if the kernel supports 
>>> RAS
>>
>> For the  ESR_ELx_AET_UER(Recoverable error), let us see its definition
>> below, which get from [0]
>
> [..]
>
>> so we can see the  exception is precise and PE can recover execution
>> from the preferred return address of the exception,
>
>> so let guest handling it is
>> better, for example, if it is guest application RAS error, we can kill
>> the guest application instead of panic whole OS; if it is guest kernel
>> RAS error, guest will panic.
>
> If the kernel takes an unhandled RAS error it should panic - we don't know 
> where
> the error is.
OK, here I will panic.

>
> I understand you want to kill-off guest tasks as a result of RAS errors, but
> this needs to go through the whole APEI->memory_failure()->sigbus machinery so
> that the kernel knows the kernel can keep running.
>
> This saves us signalling user-space when we don't need to. An example:
> code-corruption. Linux can happily re-read affected user-space executables 
> from
> disk, there is absolutely nothing user-space can do about it.
> Handling errors first in the kernel allows us to do recovery for all the
> affected processes, not just the one that happens to be running right now.
>
>
>> Host does not know which application of guest has error, so host can
>> not handle it,
>
> It has to work this out, otherwise the errors we can handle never get a 
> chance.
>
> This kernel is expected to look at the error description, (which for some 
> reason
> we aren't talking about here), e.g. the CPER records, and determine what
> recovery action is necessary for this error.
> For memory errors this may be re-reading from disk, or at the worst case,
> unmapping from all user-space users (including KVM's stage2) and raining 
> signals
> on all affected processes.
>
> For a memory error the important piece of information is the physical address.
> Only the kernel can do anything with this, it determines who owns the affected
> memory and what needs doing to recover from the error.
>
> If you pass the notification to user-space, all it can do is signal the guest 
> to
> "stop doing whatever it is you're doing". The guest may have been able to
> re-read pages from disk, or otherwise handle the error.
> Has the error been handled? No: The error remains latent in the system.
>
>
>> panic OS is not a good choice for the Recoverable error.
>
> If we don't know where the error is, and we can't make progress, its the only
> sane choice.
Ok, I will panic here.

>
> This code is never expected to run! (why are we arguing about it?) We should 
> get
> RAS errors as GHES notifications from firmware via some mechanism. If those 
> are
> NOTIFY_SEI then APEI should claim the notification and kick off the 
> appropriate
> handling based on the CPER records. If/when we get kernel-first, that can 
> claim
> the SError. What we're left with is RAS notifications that no-one claimed
> because there was no error-description found.
>
>
>
> James


Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2018-01-20 Thread gengdongjiu
Hi James,
   Sorry for my late response due to out of office recently.

2018-01-13 2:05 GMT+08:00 James Morse :
> Hi gengdongjiu,
>
> On 15/12/17 03:30, gengdongjiu wrote:
>> On 2017/12/7 14:37, gengdongjiu wrote:
 We need to tackle (1) and (3) separately. For (3) we need some API that 
 lets
 Qemu _trigger_ an SError in the guest, with a specified ESR. But, we don't 
 have
 a way of migrating pending SError yet... which is where I got stuck last 
 time I
 was looking at this.
>>> I understand you most idea.
>>>
>>> But In the Qemu one signal type can only correspond to one behavior, can 
>>> not correspond to two behaviors,
>>> otherwise Qemu will do not know how to do.
>>>
>>> For the Qemu, if it receives the SIGBUS_MCEERR_AR signal, it will populate 
>>> the CPER
>>> records and inject a SEA to guest through KVM IOCTL "KVM_SET_ONE_REG"; if 
>>> receives the SIGBUS_MCEERR_AO
>>> signal, it will record the CPER and trigger a IRQ to notify guest, as shown 
>>> below:
>>>
>>> SIGBUS_MCEERR_AR trigger Synchronous External Abort.
>>> SIGBUS_MCEERR_AO trigger GPIO IRQ.
>>>
>>> For the SIGBUS_MCEERR_AO and SIGBUS_MCEERR_AR, we have already specify 
>>> trigger method, which all
>>>
>>> not involve _trigger_ an SError.
>>>
>>> so there is no chance for Qemu to trigger the SError when gets the 
>>> SIGBUS_MCEERR_A{O,R}.
>>
>> As I explained above:
>>
>> If Qemu received SIGBUS_MCEERR_AR, it will record CPER and trigger 
>> Synchronous External Abort;
>> If Qemu received SIGBUS_MCEERR_AO, it will record CPER and trigger GPIO IRQ;
>
>> So Qemu does not know when to _trigger_ an SError.
>
> There is no answer to this. How the CPU decides is specific to the CPU design.
> How Qemu decides is going to be specific to the machine it emulates.
>
> My understanding is there is some overlap for which RAS errors are reported as
> synchronous external abort, and which use SError. (Obviously the imprecise 
> ones
> are all SError). Which one the CPU uses depends on how the CPU is designed.
>
> When you take an SIGBUS_MCEERR_AR from KVM, its because KVM can't complete a
> stage2 fault because the page is marked with PG_poisoned. These started out 
> as a
> synchronous exception, but you could still report these with SError.

yes, I agree, it is policy choice.

>
> We don't have a way to signal user-space about imprecise exceptions, this 
> isn't
> a KVM specific problem.
>
>
>> so here I "return a error" to Qemu if ghes_notify_sei() return failure in 
>> [1], if you opposed KVM "return error",
>> do you have a better idea about it? thanks
>
> If ghes_notify_sei() fails to claim the error, we should drop through to
> kernel-first-handling. We don't have that yet, just the stub that ignores 
> errors
> where we can make progress.
>
> If neither firmware-first nor kernel-first claim a RAS error, we're in 
> trouble.
> I'd like to panic() as we got a RAS notification but no description of the
> error. We can't do this until we have kernel-first support, hence that stub.
>
>
>> About the way of migrating pending SError, I think it is a separate case, 
>> because Qemu still does not know
>> how and when to trigger the SError.
>
> I agree, but I think we should fix this first before we add another user of 
> this
> unmigratable hypervisor state.
>
> (I recall someone saying migration is needed for any new KVM/cpu features, 
> but I
> can't find the thread)
>
>
>> [1]:
>> static int kvm_handle_guest_sei(struct kvm_vcpu *vcpu, struct kvm_run *run)
>> {
>> ...
>> +   case ESR_ELx_AET_UER:   /* The error has not been propagated */
>> +   /*
>> +* Userspace only handle the guest SError Interrupt(SEI) if 
>> the
>> +* error has not been propagated
>> +*/
>> +   run->exit_reason = KVM_EXIT_EXCEPTION;
>> +   run->ex.exception = ESR_ELx_EC_SERROR;
>
> I'm against telling user space RAS errors ever happened, only the final
> user-visible error when the kernel can't fix it.

thanks for the explanation.
For the ESR_ELx_AET_UER, this exception is precise, closing the VM may
be better[1].
But if you think panic is better until we support kernel-first, it is
also OK to me.


+static int kvm_handle_guest_sei(struct kvm_vcpu *vcpu, struct kvm_run *run)
+{
+ unsigned int esr = kvm_vcpu_get_hsr(vcpu);
+ bool impdef_syndrome =  esr & ESR_ELx_ISV; /* aka IDS */
+ unsigned int aet = esr & ESR_ELx_AET;
+
+ /*
+ * This is not RAS SError
+ */
+ if (!cpus_have_const_cap(ARM64_HAS_RAS_EXTN)) {
+ kvm_inject_vabt(vcpu);
+ return 1;
+ }
+
+ /* For RAS the host kernel may handle this abort. */
+ if (!handle_guest_sei())
+ return 1;
+
+ /*
+ * In below two conditions, it will directly inject the
+ * virtual SError:
+ * 1. The Syndrome is IMPLEMENTATION DEFINED
+ * 2. It is Uncategorized SEI
+ */
+ if 

Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2018-01-20 Thread gengdongjiu
Hi James,
   Sorry for my late response due to out of office recently.

2018-01-13 2:05 GMT+08:00 James Morse :
> Hi gengdongjiu,
>
> On 15/12/17 03:30, gengdongjiu wrote:
>> On 2017/12/7 14:37, gengdongjiu wrote:
 We need to tackle (1) and (3) separately. For (3) we need some API that 
 lets
 Qemu _trigger_ an SError in the guest, with a specified ESR. But, we don't 
 have
 a way of migrating pending SError yet... which is where I got stuck last 
 time I
 was looking at this.
>>> I understand you most idea.
>>>
>>> But In the Qemu one signal type can only correspond to one behavior, can 
>>> not correspond to two behaviors,
>>> otherwise Qemu will do not know how to do.
>>>
>>> For the Qemu, if it receives the SIGBUS_MCEERR_AR signal, it will populate 
>>> the CPER
>>> records and inject a SEA to guest through KVM IOCTL "KVM_SET_ONE_REG"; if 
>>> receives the SIGBUS_MCEERR_AO
>>> signal, it will record the CPER and trigger a IRQ to notify guest, as shown 
>>> below:
>>>
>>> SIGBUS_MCEERR_AR trigger Synchronous External Abort.
>>> SIGBUS_MCEERR_AO trigger GPIO IRQ.
>>>
>>> For the SIGBUS_MCEERR_AO and SIGBUS_MCEERR_AR, we have already specify 
>>> trigger method, which all
>>>
>>> not involve _trigger_ an SError.
>>>
>>> so there is no chance for Qemu to trigger the SError when gets the 
>>> SIGBUS_MCEERR_A{O,R}.
>>
>> As I explained above:
>>
>> If Qemu received SIGBUS_MCEERR_AR, it will record CPER and trigger 
>> Synchronous External Abort;
>> If Qemu received SIGBUS_MCEERR_AO, it will record CPER and trigger GPIO IRQ;
>
>> So Qemu does not know when to _trigger_ an SError.
>
> There is no answer to this. How the CPU decides is specific to the CPU design.
> How Qemu decides is going to be specific to the machine it emulates.
>
> My understanding is there is some overlap for which RAS errors are reported as
> synchronous external abort, and which use SError. (Obviously the imprecise 
> ones
> are all SError). Which one the CPU uses depends on how the CPU is designed.
>
> When you take an SIGBUS_MCEERR_AR from KVM, its because KVM can't complete a
> stage2 fault because the page is marked with PG_poisoned. These started out 
> as a
> synchronous exception, but you could still report these with SError.

yes, I agree, it is policy choice.

>
> We don't have a way to signal user-space about imprecise exceptions, this 
> isn't
> a KVM specific problem.
>
>
>> so here I "return a error" to Qemu if ghes_notify_sei() return failure in 
>> [1], if you opposed KVM "return error",
>> do you have a better idea about it? thanks
>
> If ghes_notify_sei() fails to claim the error, we should drop through to
> kernel-first-handling. We don't have that yet, just the stub that ignores 
> errors
> where we can make progress.
>
> If neither firmware-first nor kernel-first claim a RAS error, we're in 
> trouble.
> I'd like to panic() as we got a RAS notification but no description of the
> error. We can't do this until we have kernel-first support, hence that stub.
>
>
>> About the way of migrating pending SError, I think it is a separate case, 
>> because Qemu still does not know
>> how and when to trigger the SError.
>
> I agree, but I think we should fix this first before we add another user of 
> this
> unmigratable hypervisor state.
>
> (I recall someone saying migration is needed for any new KVM/cpu features, 
> but I
> can't find the thread)
>
>
>> [1]:
>> static int kvm_handle_guest_sei(struct kvm_vcpu *vcpu, struct kvm_run *run)
>> {
>> ...
>> +   case ESR_ELx_AET_UER:   /* The error has not been propagated */
>> +   /*
>> +* Userspace only handle the guest SError Interrupt(SEI) if 
>> the
>> +* error has not been propagated
>> +*/
>> +   run->exit_reason = KVM_EXIT_EXCEPTION;
>> +   run->ex.exception = ESR_ELx_EC_SERROR;
>
> I'm against telling user space RAS errors ever happened, only the final
> user-visible error when the kernel can't fix it.

thanks for the explanation.
For the ESR_ELx_AET_UER, this exception is precise, closing the VM may
be better[1].
But if you think panic is better until we support kernel-first, it is
also OK to me.


+static int kvm_handle_guest_sei(struct kvm_vcpu *vcpu, struct kvm_run *run)
+{
+ unsigned int esr = kvm_vcpu_get_hsr(vcpu);
+ bool impdef_syndrome =  esr & ESR_ELx_ISV; /* aka IDS */
+ unsigned int aet = esr & ESR_ELx_AET;
+
+ /*
+ * This is not RAS SError
+ */
+ if (!cpus_have_const_cap(ARM64_HAS_RAS_EXTN)) {
+ kvm_inject_vabt(vcpu);
+ return 1;
+ }
+
+ /* For RAS the host kernel may handle this abort. */
+ if (!handle_guest_sei())
+ return 1;
+
+ /*
+ * In below two conditions, it will directly inject the
+ * virtual SError:
+ * 1. The Syndrome is IMPLEMENTATION DEFINED
+ * 2. It is Uncategorized SEI
+ */
+ if (impdef_syndrome ||
+  

Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2018-01-16 Thread gengdongjiu
Hi James,
  thanks very much for your mail and reply, I will check it ASAP. Due to 
recently busy with other thing, so reply may be late.

On 2018/1/13 2:05, James Morse wrote:
> Hi gengdongjiu,
> 
> On 16/12/17 04:47, gengdongjiu wrote:
>> [...]
>>>
 + case ESR_ELx_AET_UER:   /* The error has not been propagated */
 + /*
 +  * Userspace only handle the guest SError Interrupt(SEI) if 
 the
 +  * error has not been propagated
 +  */
 + run->exit_reason = KVM_EXIT_EXCEPTION;
 + run->ex.exception = ESR_ELx_EC_SERROR;
 + run->ex.error_code = KVM_SEI_SEV_RECOVERABLE;
 + return 0;
>>>
>>> We should not pass RAS notifications to user space. The kernel either 
>>> handles
>>> them, or it panics(). User space shouldn't even know if the kernel supports 
>>> RAS
>>
>> For the  ESR_ELx_AET_UER(Recoverable error), let us see its definition
>> below, which get from [0]
> 
> [..]
> 
>> so we can see the  exception is precise and PE can recover execution
>> from the preferred return address of the exception, 
> 
>> so let guest handling it is
>> better, for example, if it is guest application RAS error, we can kill
>> the guest application instead of panic whole OS; if it is guest kernel
>> RAS error, guest will panic.
> 
> If the kernel takes an unhandled RAS error it should panic - we don't know 
> where
> the error is.
> 
> I understand you want to kill-off guest tasks as a result of RAS errors, but
> this needs to go through the whole APEI->memory_failure()->sigbus machinery so
> that the kernel knows the kernel can keep running.
> 
> This saves us signalling user-space when we don't need to. An example:
> code-corruption. Linux can happily re-read affected user-space executables 
> from
> disk, there is absolutely nothing user-space can do about it.
> Handling errors first in the kernel allows us to do recovery for all the
> affected processes, not just the one that happens to be running right now.
> 
> 
>> Host does not know which application of guest has error, so host can
>> not handle it,
> 
> It has to work this out, otherwise the errors we can handle never get a 
> chance.
> 
> This kernel is expected to look at the error description, (which for some 
> reason
> we aren't talking about here), e.g. the CPER records, and determine what
> recovery action is necessary for this error.
> For memory errors this may be re-reading from disk, or at the worst case,
> unmapping from all user-space users (including KVM's stage2) and raining 
> signals
> on all affected processes.
> 
> For a memory error the important piece of information is the physical address.
> Only the kernel can do anything with this, it determines who owns the affected
> memory and what needs doing to recover from the error.
> 
> If you pass the notification to user-space, all it can do is signal the guest 
> to
> "stop doing whatever it is you're doing". The guest may have been able to
> re-read pages from disk, or otherwise handle the error.
> Has the error been handled? No: The error remains latent in the system.
> 
> 
>> panic OS is not a good choice for the Recoverable error.
> 
> If we don't know where the error is, and we can't make progress, its the only
> sane choice.
> 
> This code is never expected to run! (why are we arguing about it?) We should 
> get
> RAS errors as GHES notifications from firmware via some mechanism. If those 
> are
> NOTIFY_SEI then APEI should claim the notification and kick off the 
> appropriate
> handling based on the CPER records. If/when we get kernel-first, that can 
> claim
> the SError. What we're left with is RAS notifications that no-one claimed
> because there was no error-description found.
> 
> 
> 
> James
> 
> .
> 



Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2018-01-16 Thread gengdongjiu
Hi James,
  thanks very much for your mail and reply, I will check it ASAP. Due to 
recently busy with other thing, so reply may be late.

On 2018/1/13 2:05, James Morse wrote:
> Hi gengdongjiu,
> 
> On 16/12/17 04:47, gengdongjiu wrote:
>> [...]
>>>
 + case ESR_ELx_AET_UER:   /* The error has not been propagated */
 + /*
 +  * Userspace only handle the guest SError Interrupt(SEI) if 
 the
 +  * error has not been propagated
 +  */
 + run->exit_reason = KVM_EXIT_EXCEPTION;
 + run->ex.exception = ESR_ELx_EC_SERROR;
 + run->ex.error_code = KVM_SEI_SEV_RECOVERABLE;
 + return 0;
>>>
>>> We should not pass RAS notifications to user space. The kernel either 
>>> handles
>>> them, or it panics(). User space shouldn't even know if the kernel supports 
>>> RAS
>>
>> For the  ESR_ELx_AET_UER(Recoverable error), let us see its definition
>> below, which get from [0]
> 
> [..]
> 
>> so we can see the  exception is precise and PE can recover execution
>> from the preferred return address of the exception, 
> 
>> so let guest handling it is
>> better, for example, if it is guest application RAS error, we can kill
>> the guest application instead of panic whole OS; if it is guest kernel
>> RAS error, guest will panic.
> 
> If the kernel takes an unhandled RAS error it should panic - we don't know 
> where
> the error is.
> 
> I understand you want to kill-off guest tasks as a result of RAS errors, but
> this needs to go through the whole APEI->memory_failure()->sigbus machinery so
> that the kernel knows the kernel can keep running.
> 
> This saves us signalling user-space when we don't need to. An example:
> code-corruption. Linux can happily re-read affected user-space executables 
> from
> disk, there is absolutely nothing user-space can do about it.
> Handling errors first in the kernel allows us to do recovery for all the
> affected processes, not just the one that happens to be running right now.
> 
> 
>> Host does not know which application of guest has error, so host can
>> not handle it,
> 
> It has to work this out, otherwise the errors we can handle never get a 
> chance.
> 
> This kernel is expected to look at the error description, (which for some 
> reason
> we aren't talking about here), e.g. the CPER records, and determine what
> recovery action is necessary for this error.
> For memory errors this may be re-reading from disk, or at the worst case,
> unmapping from all user-space users (including KVM's stage2) and raining 
> signals
> on all affected processes.
> 
> For a memory error the important piece of information is the physical address.
> Only the kernel can do anything with this, it determines who owns the affected
> memory and what needs doing to recover from the error.
> 
> If you pass the notification to user-space, all it can do is signal the guest 
> to
> "stop doing whatever it is you're doing". The guest may have been able to
> re-read pages from disk, or otherwise handle the error.
> Has the error been handled? No: The error remains latent in the system.
> 
> 
>> panic OS is not a good choice for the Recoverable error.
> 
> If we don't know where the error is, and we can't make progress, its the only
> sane choice.
> 
> This code is never expected to run! (why are we arguing about it?) We should 
> get
> RAS errors as GHES notifications from firmware via some mechanism. If those 
> are
> NOTIFY_SEI then APEI should claim the notification and kick off the 
> appropriate
> handling based on the CPER records. If/when we get kernel-first, that can 
> claim
> the SError. What we're left with is RAS notifications that no-one claimed
> because there was no error-description found.
> 
> 
> 
> James
> 
> .
> 



Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2018-01-16 Thread gengdongjiu
Hi Christoffer

On 2018/1/15 16:33, Christoffer Dall wrote:
> On Fri, Jan 12, 2018 at 06:05:23PM +, James Morse wrote:
>> On 15/12/17 03:30, gengdongjiu wrote:
>>> On 2017/12/7 14:37, gengdongjiu wrote:
> 
> [...]
> 
>>
>> (I recall someone saying migration is needed for any new KVM/cpu features, 
>> but I
>> can't find the thread)
>>
> 
> I don't know of any hard set-in-stone rule for this, but I have
> certainly argued that since migration is a popular technique in data
> centers and often a key motivation behind using virtual machines as it
> provides both load-balancing and high availability, we should think
> about migration support for all features and state.  Further, experience
> has shown that retroactively trying to support migration can result in
> really complex interfaces for saving/restoring state (see the ITS
> ordering requirements in
> Documentation/virtual/kvm/devices/arm-vgic-its.txt as an example) so
> thinking about this problem when introducing functionality is a good
> idea.
> 
> Of course, if there are really good arguments for having some state that
> simply cannot be migrated, then that's fine, and we should just make
> sure that userspace (e.g. QEMU) and higher level components in the
> stack (libvirt, openstack, etc.) can detect this state being used, and
> ideally enable/disable it, so that it can predict that a particular VM
> cannot be migrated off a particular host, or between a particular set of
> two hosts.  As an example, migration is typically prohibited when using
> VFIO direct device assignment, but userspace etc. are already aware of
> this.
> 
> As a final note, if we add support for some architectural feature, which
> may be present on some particular hardware and/or implementation, if the
> KVM support for said feature is automatically enabled (and not
> selectively from userspace), I would push back quite strongly on
> something that doesn't support migration, because it would effectively
> prevent migration of VMs on ARM.
Thanks very much for this mail and reply, I will check it, please give me some 
time due to
recently busy with other things.

> 
> Thanks,
> -Christoffer
> 
> .
> 



Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2018-01-16 Thread gengdongjiu
Hi Christoffer

On 2018/1/15 16:33, Christoffer Dall wrote:
> On Fri, Jan 12, 2018 at 06:05:23PM +, James Morse wrote:
>> On 15/12/17 03:30, gengdongjiu wrote:
>>> On 2017/12/7 14:37, gengdongjiu wrote:
> 
> [...]
> 
>>
>> (I recall someone saying migration is needed for any new KVM/cpu features, 
>> but I
>> can't find the thread)
>>
> 
> I don't know of any hard set-in-stone rule for this, but I have
> certainly argued that since migration is a popular technique in data
> centers and often a key motivation behind using virtual machines as it
> provides both load-balancing and high availability, we should think
> about migration support for all features and state.  Further, experience
> has shown that retroactively trying to support migration can result in
> really complex interfaces for saving/restoring state (see the ITS
> ordering requirements in
> Documentation/virtual/kvm/devices/arm-vgic-its.txt as an example) so
> thinking about this problem when introducing functionality is a good
> idea.
> 
> Of course, if there are really good arguments for having some state that
> simply cannot be migrated, then that's fine, and we should just make
> sure that userspace (e.g. QEMU) and higher level components in the
> stack (libvirt, openstack, etc.) can detect this state being used, and
> ideally enable/disable it, so that it can predict that a particular VM
> cannot be migrated off a particular host, or between a particular set of
> two hosts.  As an example, migration is typically prohibited when using
> VFIO direct device assignment, but userspace etc. are already aware of
> this.
> 
> As a final note, if we add support for some architectural feature, which
> may be present on some particular hardware and/or implementation, if the
> KVM support for said feature is automatically enabled (and not
> selectively from userspace), I would push back quite strongly on
> something that doesn't support migration, because it would effectively
> prevent migration of VMs on ARM.
Thanks very much for this mail and reply, I will check it, please give me some 
time due to
recently busy with other things.

> 
> Thanks,
> -Christoffer
> 
> .
> 



Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2018-01-15 Thread Christoffer Dall
On Fri, Jan 12, 2018 at 06:05:23PM +, James Morse wrote:
> On 15/12/17 03:30, gengdongjiu wrote:
> > On 2017/12/7 14:37, gengdongjiu wrote:

[...]

> 
> (I recall someone saying migration is needed for any new KVM/cpu features, 
> but I
> can't find the thread)
> 

I don't know of any hard set-in-stone rule for this, but I have
certainly argued that since migration is a popular technique in data
centers and often a key motivation behind using virtual machines as it
provides both load-balancing and high availability, we should think
about migration support for all features and state.  Further, experience
has shown that retroactively trying to support migration can result in
really complex interfaces for saving/restoring state (see the ITS
ordering requirements in
Documentation/virtual/kvm/devices/arm-vgic-its.txt as an example) so
thinking about this problem when introducing functionality is a good
idea.

Of course, if there are really good arguments for having some state that
simply cannot be migrated, then that's fine, and we should just make
sure that userspace (e.g. QEMU) and higher level components in the
stack (libvirt, openstack, etc.) can detect this state being used, and
ideally enable/disable it, so that it can predict that a particular VM
cannot be migrated off a particular host, or between a particular set of
two hosts.  As an example, migration is typically prohibited when using
VFIO direct device assignment, but userspace etc. are already aware of
this.

As a final note, if we add support for some architectural feature, which
may be present on some particular hardware and/or implementation, if the
KVM support for said feature is automatically enabled (and not
selectively from userspace), I would push back quite strongly on
something that doesn't support migration, because it would effectively
prevent migration of VMs on ARM.

Thanks,
-Christoffer


Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2018-01-15 Thread Christoffer Dall
On Fri, Jan 12, 2018 at 06:05:23PM +, James Morse wrote:
> On 15/12/17 03:30, gengdongjiu wrote:
> > On 2017/12/7 14:37, gengdongjiu wrote:

[...]

> 
> (I recall someone saying migration is needed for any new KVM/cpu features, 
> but I
> can't find the thread)
> 

I don't know of any hard set-in-stone rule for this, but I have
certainly argued that since migration is a popular technique in data
centers and often a key motivation behind using virtual machines as it
provides both load-balancing and high availability, we should think
about migration support for all features and state.  Further, experience
has shown that retroactively trying to support migration can result in
really complex interfaces for saving/restoring state (see the ITS
ordering requirements in
Documentation/virtual/kvm/devices/arm-vgic-its.txt as an example) so
thinking about this problem when introducing functionality is a good
idea.

Of course, if there are really good arguments for having some state that
simply cannot be migrated, then that's fine, and we should just make
sure that userspace (e.g. QEMU) and higher level components in the
stack (libvirt, openstack, etc.) can detect this state being used, and
ideally enable/disable it, so that it can predict that a particular VM
cannot be migrated off a particular host, or between a particular set of
two hosts.  As an example, migration is typically prohibited when using
VFIO direct device assignment, but userspace etc. are already aware of
this.

As a final note, if we add support for some architectural feature, which
may be present on some particular hardware and/or implementation, if the
KVM support for said feature is automatically enabled (and not
selectively from userspace), I would push back quite strongly on
something that doesn't support migration, because it would effectively
prevent migration of VMs on ARM.

Thanks,
-Christoffer


Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2018-01-12 Thread James Morse
Hi gengdongjiu,

On 16/12/17 04:47, gengdongjiu wrote:
> [...]
>>
>>> + case ESR_ELx_AET_UER:   /* The error has not been propagated */
>>> + /*
>>> +  * Userspace only handle the guest SError Interrupt(SEI) if 
>>> the
>>> +  * error has not been propagated
>>> +  */
>>> + run->exit_reason = KVM_EXIT_EXCEPTION;
>>> + run->ex.exception = ESR_ELx_EC_SERROR;
>>> + run->ex.error_code = KVM_SEI_SEV_RECOVERABLE;
>>> + return 0;
>>
>> We should not pass RAS notifications to user space. The kernel either handles
>> them, or it panics(). User space shouldn't even know if the kernel supports 
>> RAS
> 
> For the  ESR_ELx_AET_UER(Recoverable error), let us see its definition
> below, which get from [0]

[..]

> so we can see the  exception is precise and PE can recover execution
> from the preferred return address of the exception, 

> so let guest handling it is
> better, for example, if it is guest application RAS error, we can kill
> the guest application instead of panic whole OS; if it is guest kernel
> RAS error, guest will panic.

If the kernel takes an unhandled RAS error it should panic - we don't know where
the error is.

I understand you want to kill-off guest tasks as a result of RAS errors, but
this needs to go through the whole APEI->memory_failure()->sigbus machinery so
that the kernel knows the kernel can keep running.

This saves us signalling user-space when we don't need to. An example:
code-corruption. Linux can happily re-read affected user-space executables from
disk, there is absolutely nothing user-space can do about it.
Handling errors first in the kernel allows us to do recovery for all the
affected processes, not just the one that happens to be running right now.


> Host does not know which application of guest has error, so host can
> not handle it,

It has to work this out, otherwise the errors we can handle never get a chance.

This kernel is expected to look at the error description, (which for some reason
we aren't talking about here), e.g. the CPER records, and determine what
recovery action is necessary for this error.
For memory errors this may be re-reading from disk, or at the worst case,
unmapping from all user-space users (including KVM's stage2) and raining signals
on all affected processes.

For a memory error the important piece of information is the physical address.
Only the kernel can do anything with this, it determines who owns the affected
memory and what needs doing to recover from the error.

If you pass the notification to user-space, all it can do is signal the guest to
"stop doing whatever it is you're doing". The guest may have been able to
re-read pages from disk, or otherwise handle the error.
Has the error been handled? No: The error remains latent in the system.


> panic OS is not a good choice for the Recoverable error.

If we don't know where the error is, and we can't make progress, its the only
sane choice.

This code is never expected to run! (why are we arguing about it?) We should get
RAS errors as GHES notifications from firmware via some mechanism. If those are
NOTIFY_SEI then APEI should claim the notification and kick off the appropriate
handling based on the CPER records. If/when we get kernel-first, that can claim
the SError. What we're left with is RAS notifications that no-one claimed
because there was no error-description found.



James


Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2018-01-12 Thread James Morse
Hi gengdongjiu,

On 16/12/17 04:47, gengdongjiu wrote:
> [...]
>>
>>> + case ESR_ELx_AET_UER:   /* The error has not been propagated */
>>> + /*
>>> +  * Userspace only handle the guest SError Interrupt(SEI) if 
>>> the
>>> +  * error has not been propagated
>>> +  */
>>> + run->exit_reason = KVM_EXIT_EXCEPTION;
>>> + run->ex.exception = ESR_ELx_EC_SERROR;
>>> + run->ex.error_code = KVM_SEI_SEV_RECOVERABLE;
>>> + return 0;
>>
>> We should not pass RAS notifications to user space. The kernel either handles
>> them, or it panics(). User space shouldn't even know if the kernel supports 
>> RAS
> 
> For the  ESR_ELx_AET_UER(Recoverable error), let us see its definition
> below, which get from [0]

[..]

> so we can see the  exception is precise and PE can recover execution
> from the preferred return address of the exception, 

> so let guest handling it is
> better, for example, if it is guest application RAS error, we can kill
> the guest application instead of panic whole OS; if it is guest kernel
> RAS error, guest will panic.

If the kernel takes an unhandled RAS error it should panic - we don't know where
the error is.

I understand you want to kill-off guest tasks as a result of RAS errors, but
this needs to go through the whole APEI->memory_failure()->sigbus machinery so
that the kernel knows the kernel can keep running.

This saves us signalling user-space when we don't need to. An example:
code-corruption. Linux can happily re-read affected user-space executables from
disk, there is absolutely nothing user-space can do about it.
Handling errors first in the kernel allows us to do recovery for all the
affected processes, not just the one that happens to be running right now.


> Host does not know which application of guest has error, so host can
> not handle it,

It has to work this out, otherwise the errors we can handle never get a chance.

This kernel is expected to look at the error description, (which for some reason
we aren't talking about here), e.g. the CPER records, and determine what
recovery action is necessary for this error.
For memory errors this may be re-reading from disk, or at the worst case,
unmapping from all user-space users (including KVM's stage2) and raining signals
on all affected processes.

For a memory error the important piece of information is the physical address.
Only the kernel can do anything with this, it determines who owns the affected
memory and what needs doing to recover from the error.

If you pass the notification to user-space, all it can do is signal the guest to
"stop doing whatever it is you're doing". The guest may have been able to
re-read pages from disk, or otherwise handle the error.
Has the error been handled? No: The error remains latent in the system.


> panic OS is not a good choice for the Recoverable error.

If we don't know where the error is, and we can't make progress, its the only
sane choice.

This code is never expected to run! (why are we arguing about it?) We should get
RAS errors as GHES notifications from firmware via some mechanism. If those are
NOTIFY_SEI then APEI should claim the notification and kick off the appropriate
handling based on the CPER records. If/when we get kernel-first, that can claim
the SError. What we're left with is RAS notifications that no-one claimed
because there was no error-description found.



James


Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2018-01-12 Thread James Morse
Hi gengdongjiu,

On 15/12/17 03:30, gengdongjiu wrote:
> On 2017/12/7 14:37, gengdongjiu wrote:
>>> We need to tackle (1) and (3) separately. For (3) we need some API that lets
>>> Qemu _trigger_ an SError in the guest, with a specified ESR. But, we don't 
>>> have
>>> a way of migrating pending SError yet... which is where I got stuck last 
>>> time I
>>> was looking at this.
>> I understand you most idea.
>>
>> But In the Qemu one signal type can only correspond to one behavior, can not 
>> correspond to two behaviors,
>> otherwise Qemu will do not know how to do.
>>
>> For the Qemu, if it receives the SIGBUS_MCEERR_AR signal, it will populate 
>> the CPER
>> records and inject a SEA to guest through KVM IOCTL "KVM_SET_ONE_REG"; if 
>> receives the SIGBUS_MCEERR_AO
>> signal, it will record the CPER and trigger a IRQ to notify guest, as shown 
>> below:
>>
>> SIGBUS_MCEERR_AR trigger Synchronous External Abort.
>> SIGBUS_MCEERR_AO trigger GPIO IRQ.
>>
>> For the SIGBUS_MCEERR_AO and SIGBUS_MCEERR_AR, we have already specify 
>> trigger method, which all
>>
>> not involve _trigger_ an SError.
>>
>> so there is no chance for Qemu to trigger the SError when gets the 
>> SIGBUS_MCEERR_A{O,R}.
> 
> As I explained above:
> 
> If Qemu received SIGBUS_MCEERR_AR, it will record CPER and trigger 
> Synchronous External Abort;
> If Qemu received SIGBUS_MCEERR_AO, it will record CPER and trigger GPIO IRQ;

> So Qemu does not know when to _trigger_ an SError.

There is no answer to this. How the CPU decides is specific to the CPU design.
How Qemu decides is going to be specific to the machine it emulates.

My understanding is there is some overlap for which RAS errors are reported as
synchronous external abort, and which use SError. (Obviously the imprecise ones
are all SError). Which one the CPU uses depends on how the CPU is designed.

When you take an SIGBUS_MCEERR_AR from KVM, its because KVM can't complete a
stage2 fault because the page is marked with PG_poisoned. These started out as a
synchronous exception, but you could still report these with SError.

We don't have a way to signal user-space about imprecise exceptions, this isn't
a KVM specific problem.


> so here I "return a error" to Qemu if ghes_notify_sei() return failure in 
> [1], if you opposed KVM "return error",
> do you have a better idea about it? thanks

If ghes_notify_sei() fails to claim the error, we should drop through to
kernel-first-handling. We don't have that yet, just the stub that ignores errors
where we can make progress.

If neither firmware-first nor kernel-first claim a RAS error, we're in trouble.
I'd like to panic() as we got a RAS notification but no description of the
error. We can't do this until we have kernel-first support, hence that stub.


> About the way of migrating pending SError, I think it is a separate case, 
> because Qemu still does not know
> how and when to trigger the SError.

I agree, but I think we should fix this first before we add another user of this
unmigratable hypervisor state.

(I recall someone saying migration is needed for any new KVM/cpu features, but I
can't find the thread)


> [1]:
> static int kvm_handle_guest_sei(struct kvm_vcpu *vcpu, struct kvm_run *run)
> {
> ...
> +   case ESR_ELx_AET_UER:   /* The error has not been propagated */
> +   /*
> +* Userspace only handle the guest SError Interrupt(SEI) if 
> the
> +* error has not been propagated
> +*/
> +   run->exit_reason = KVM_EXIT_EXCEPTION;
> +   run->ex.exception = ESR_ELx_EC_SERROR;

I'm against telling user space RAS errors ever happened, only the final
user-visible error when the kernel can't fix it.

This is inventing something new for RAS errors not claimed by firmware-first.
If we have kernel-first too, this will never happen. (unless your system is
losing the error description).


Your system has firmware-first, why isn't it claiming the notification?
If its not finding CPER records written by firmware, check firmware and the UEFI
memory map agree on the attributes to be used when read/writing that area.


> +   run->ex.error_code = KVM_SEI_SEV_RECOVERABLE;
> +   return 0;


Thanks,

James



Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2018-01-12 Thread James Morse
Hi gengdongjiu,

On 15/12/17 03:30, gengdongjiu wrote:
> On 2017/12/7 14:37, gengdongjiu wrote:
>>> We need to tackle (1) and (3) separately. For (3) we need some API that lets
>>> Qemu _trigger_ an SError in the guest, with a specified ESR. But, we don't 
>>> have
>>> a way of migrating pending SError yet... which is where I got stuck last 
>>> time I
>>> was looking at this.
>> I understand you most idea.
>>
>> But In the Qemu one signal type can only correspond to one behavior, can not 
>> correspond to two behaviors,
>> otherwise Qemu will do not know how to do.
>>
>> For the Qemu, if it receives the SIGBUS_MCEERR_AR signal, it will populate 
>> the CPER
>> records and inject a SEA to guest through KVM IOCTL "KVM_SET_ONE_REG"; if 
>> receives the SIGBUS_MCEERR_AO
>> signal, it will record the CPER and trigger a IRQ to notify guest, as shown 
>> below:
>>
>> SIGBUS_MCEERR_AR trigger Synchronous External Abort.
>> SIGBUS_MCEERR_AO trigger GPIO IRQ.
>>
>> For the SIGBUS_MCEERR_AO and SIGBUS_MCEERR_AR, we have already specify 
>> trigger method, which all
>>
>> not involve _trigger_ an SError.
>>
>> so there is no chance for Qemu to trigger the SError when gets the 
>> SIGBUS_MCEERR_A{O,R}.
> 
> As I explained above:
> 
> If Qemu received SIGBUS_MCEERR_AR, it will record CPER and trigger 
> Synchronous External Abort;
> If Qemu received SIGBUS_MCEERR_AO, it will record CPER and trigger GPIO IRQ;

> So Qemu does not know when to _trigger_ an SError.

There is no answer to this. How the CPU decides is specific to the CPU design.
How Qemu decides is going to be specific to the machine it emulates.

My understanding is there is some overlap for which RAS errors are reported as
synchronous external abort, and which use SError. (Obviously the imprecise ones
are all SError). Which one the CPU uses depends on how the CPU is designed.

When you take an SIGBUS_MCEERR_AR from KVM, its because KVM can't complete a
stage2 fault because the page is marked with PG_poisoned. These started out as a
synchronous exception, but you could still report these with SError.

We don't have a way to signal user-space about imprecise exceptions, this isn't
a KVM specific problem.


> so here I "return a error" to Qemu if ghes_notify_sei() return failure in 
> [1], if you opposed KVM "return error",
> do you have a better idea about it? thanks

If ghes_notify_sei() fails to claim the error, we should drop through to
kernel-first-handling. We don't have that yet, just the stub that ignores errors
where we can make progress.

If neither firmware-first nor kernel-first claim a RAS error, we're in trouble.
I'd like to panic() as we got a RAS notification but no description of the
error. We can't do this until we have kernel-first support, hence that stub.


> About the way of migrating pending SError, I think it is a separate case, 
> because Qemu still does not know
> how and when to trigger the SError.

I agree, but I think we should fix this first before we add another user of this
unmigratable hypervisor state.

(I recall someone saying migration is needed for any new KVM/cpu features, but I
can't find the thread)


> [1]:
> static int kvm_handle_guest_sei(struct kvm_vcpu *vcpu, struct kvm_run *run)
> {
> ...
> +   case ESR_ELx_AET_UER:   /* The error has not been propagated */
> +   /*
> +* Userspace only handle the guest SError Interrupt(SEI) if 
> the
> +* error has not been propagated
> +*/
> +   run->exit_reason = KVM_EXIT_EXCEPTION;
> +   run->ex.exception = ESR_ELx_EC_SERROR;

I'm against telling user space RAS errors ever happened, only the final
user-visible error when the kernel can't fix it.

This is inventing something new for RAS errors not claimed by firmware-first.
If we have kernel-first too, this will never happen. (unless your system is
losing the error description).


Your system has firmware-first, why isn't it claiming the notification?
If its not finding CPER records written by firmware, check firmware and the UEFI
memory map agree on the attributes to be used when read/writing that area.


> +   run->ex.error_code = KVM_SEI_SEV_RECOVERABLE;
> +   return 0;


Thanks,

James



Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2017-12-15 Thread gengdongjiu
[...]
>
>> + case ESR_ELx_AET_UER:   /* The error has not been propagated */
>> + /*
>> +  * Userspace only handle the guest SError Interrupt(SEI) if the
>> +  * error has not been propagated
>> +  */
>> + run->exit_reason = KVM_EXIT_EXCEPTION;
>> + run->ex.exception = ESR_ELx_EC_SERROR;
>> + run->ex.error_code = KVM_SEI_SEV_RECOVERABLE;
>> + return 0;
>
> We should not pass RAS notifications to user space. The kernel either handles
> them, or it panics(). User space shouldn't even know if the kernel supports 
> RAS

For the  ESR_ELx_AET_UER(Recoverable error), let us see its definition
below, which get from [0]

The state of the PE is Recoverable if all of the following are true:
— The error has not been silently propagated.
— The error has not been architecturally consumed by the PE. (The PE
architectural state is not infected.)
— The exception is precise and PE can recover execution from the
preferred return address of the exception, if software locates and
repairs the error.
The PE cannot make correct progress without either consuming the error
or otherwise making the error unrecoverable. The error remains latent
in the system.
If software cannot locate and repair the error, either the application
or the VM, or both, must be isolated by software.

so we can see the  exception is precise and PE can recover execution
from the preferred return address of the exception, so let guest
handling it is
better, for example, if it is guest application RAS error, we can kill
the guest application instead of panic whole OS; if it is guest kernel
RAS error, guest will panic.
Host does not know which application of guest has error, so host can
not handle it, panic OS is not a good choice for the Recoverable
error.

[0]
https://static.docs.arm.com/ddi0587/a/RAS%20Extension-release%20candidate_march_29.pdf


> until it gets an MCEERR signal.

user space will detect whether kernel support RAS before handing it.

>
> You're making your firmware-first notification an EL3->EL0 signal, bypassing 
> the OS.
>
> If we get a RAS SError and there are no CPER records or values in the ERR 
> nodes,
> we should panic as it looks like the CPU/firmware is broken. (spurious RAS 
> errors)


>
>
>> + default:
>> + /*
>> +  * Until now, the CPU supports RAS and SEI is fatal, or host
>> +  * does not support to handle the SError.
>> +  */
>> + panic("This Asynchronous SError interrupt is dangerous, 
>> panic");
>> + }
>> +
>> + return 0;
>> +}
>> +
>>  /*
>>   * Return > 0 to return to guest, < 0 on error, 0 (and set exit_reason) on
>>   * proper exit to userspace.
>
>
>
> James
> ___
> kvmarm mailing list
> kvm...@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2017-12-15 Thread gengdongjiu
[...]
>
>> + case ESR_ELx_AET_UER:   /* The error has not been propagated */
>> + /*
>> +  * Userspace only handle the guest SError Interrupt(SEI) if the
>> +  * error has not been propagated
>> +  */
>> + run->exit_reason = KVM_EXIT_EXCEPTION;
>> + run->ex.exception = ESR_ELx_EC_SERROR;
>> + run->ex.error_code = KVM_SEI_SEV_RECOVERABLE;
>> + return 0;
>
> We should not pass RAS notifications to user space. The kernel either handles
> them, or it panics(). User space shouldn't even know if the kernel supports 
> RAS

For the  ESR_ELx_AET_UER(Recoverable error), let us see its definition
below, which get from [0]

The state of the PE is Recoverable if all of the following are true:
— The error has not been silently propagated.
— The error has not been architecturally consumed by the PE. (The PE
architectural state is not infected.)
— The exception is precise and PE can recover execution from the
preferred return address of the exception, if software locates and
repairs the error.
The PE cannot make correct progress without either consuming the error
or otherwise making the error unrecoverable. The error remains latent
in the system.
If software cannot locate and repair the error, either the application
or the VM, or both, must be isolated by software.

so we can see the  exception is precise and PE can recover execution
from the preferred return address of the exception, so let guest
handling it is
better, for example, if it is guest application RAS error, we can kill
the guest application instead of panic whole OS; if it is guest kernel
RAS error, guest will panic.
Host does not know which application of guest has error, so host can
not handle it, panic OS is not a good choice for the Recoverable
error.

[0]
https://static.docs.arm.com/ddi0587/a/RAS%20Extension-release%20candidate_march_29.pdf


> until it gets an MCEERR signal.

user space will detect whether kernel support RAS before handing it.

>
> You're making your firmware-first notification an EL3->EL0 signal, bypassing 
> the OS.
>
> If we get a RAS SError and there are no CPER records or values in the ERR 
> nodes,
> we should panic as it looks like the CPU/firmware is broken. (spurious RAS 
> errors)


>
>
>> + default:
>> + /*
>> +  * Until now, the CPU supports RAS and SEI is fatal, or host
>> +  * does not support to handle the SError.
>> +  */
>> + panic("This Asynchronous SError interrupt is dangerous, 
>> panic");
>> + }
>> +
>> + return 0;
>> +}
>> +
>>  /*
>>   * Return > 0 to return to guest, < 0 on error, 0 (and set exit_reason) on
>>   * proper exit to userspace.
>
>
>
> James
> ___
> kvmarm mailing list
> kvm...@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2017-12-15 Thread gengdongjiu
Hi James,

On 2017/12/16 2:52, James Morse wrote:
>> signal, it will record the CPER and trigger a IRQ to notify guest, as shown 
>> below:
>>
>> SIGBUS_MCEERR_AR trigger Synchronous External Abort.
>> SIGBUS_MCEERR_AO trigger GPIO IRQ.
>>
>> For the SIGBUS_MCEERR_AO and SIGBUS_MCEERR_AR, we have already specify 
>> trigger method, which all
>>
>> not involve _trigger_ an SError.
> It's a policy choice. How does your virtual CPU notify RAS errors to its 
> virtual
> software? You could use SError for SIGBUS_MCEERR_AR, it depends on what type 
> of
> CPU you are trying to emulate.
> 
> I'd suggest using NOTIFY_SEA for SIGBUS_MCEERR_AR as it avoids problems where
> the guest doesn't take the SError immediately, instead tries to re-execute the
I agree it is better to use NOTIFY_SEA for SIGBUS_MCEERR_AR in this case.

> code KVM has unmapped from stage2 because its corrupt. (You could detect this
> happening in Qemu and try something else)For something else, using NOTIFY_SEI 
> for SIGBUS_MCEERR_AR? At current implementation,
It seems only have this case that "KVM has unmapped from stage2", do you thing 
we still have something else?

> 
> 
> Synchronous/asynchronous external abort matters to the CPU, but once the error
> has been notified to software the reasons for this distinction disappear. Once
> the error has been handled, all trace of this distinction is gone.
> 
> CPER records only describe component failures. You are trying to re-create 
> some
> state that disappeared with one of the firmware-first abstractions. Trying to
> re-create this information isn't worth the effort as the distinction doesn't
> matter to linux, only to the CPU.
> 
> 
>> so there is no chance for Qemu to trigger the SError when gets the 
>> SIGBUS_MCEERR_A{O,R}.
> You mean there is no reason for Qemu to trigger an SError when it gets a 
> signal
> from the kernel.
> 
> The reasons the CPU might have to generate an SError don't apply to linux and
> KVM user space. User-space will never get a signal for an uncontained error, 
> we
> will always panic(). We can't give user-space a signal for imprecise 
> exceptions,
> as it can't return from the signal. The classes of error that are left are
> covered by polled/irq and NOTIFY_SEA.
> 
> Qemu can decide to generate RAS SErrors for SIGBUS_MCEERR_AR if it really 
> wants
> to, (but I don't think you should, the kernel may have unmapped the page at PC
> from stage2 due to corruption).
yes, you also said you do not want to generate RAS SErrors for SIGBUS_MCEERR_AR,
so Qemu does not know in which condition to generate RAS SErrors.

> 
> I think the problem here is you're applying the CPU->software behaviour and
> choices to software->software. By the time user-space gets the error, the
> behaviour is different.
In the KVM, as a policy choice to reserve this API to specify guest ESR and 
drive to trigger SError is OK,
At least for Qemu it does not know in which condition to trigger it.





Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2017-12-15 Thread gengdongjiu
Hi James,

On 2017/12/16 2:52, James Morse wrote:
>> signal, it will record the CPER and trigger a IRQ to notify guest, as shown 
>> below:
>>
>> SIGBUS_MCEERR_AR trigger Synchronous External Abort.
>> SIGBUS_MCEERR_AO trigger GPIO IRQ.
>>
>> For the SIGBUS_MCEERR_AO and SIGBUS_MCEERR_AR, we have already specify 
>> trigger method, which all
>>
>> not involve _trigger_ an SError.
> It's a policy choice. How does your virtual CPU notify RAS errors to its 
> virtual
> software? You could use SError for SIGBUS_MCEERR_AR, it depends on what type 
> of
> CPU you are trying to emulate.
> 
> I'd suggest using NOTIFY_SEA for SIGBUS_MCEERR_AR as it avoids problems where
> the guest doesn't take the SError immediately, instead tries to re-execute the
I agree it is better to use NOTIFY_SEA for SIGBUS_MCEERR_AR in this case.

> code KVM has unmapped from stage2 because its corrupt. (You could detect this
> happening in Qemu and try something else)For something else, using NOTIFY_SEI 
> for SIGBUS_MCEERR_AR? At current implementation,
It seems only have this case that "KVM has unmapped from stage2", do you thing 
we still have something else?

> 
> 
> Synchronous/asynchronous external abort matters to the CPU, but once the error
> has been notified to software the reasons for this distinction disappear. Once
> the error has been handled, all trace of this distinction is gone.
> 
> CPER records only describe component failures. You are trying to re-create 
> some
> state that disappeared with one of the firmware-first abstractions. Trying to
> re-create this information isn't worth the effort as the distinction doesn't
> matter to linux, only to the CPU.
> 
> 
>> so there is no chance for Qemu to trigger the SError when gets the 
>> SIGBUS_MCEERR_A{O,R}.
> You mean there is no reason for Qemu to trigger an SError when it gets a 
> signal
> from the kernel.
> 
> The reasons the CPU might have to generate an SError don't apply to linux and
> KVM user space. User-space will never get a signal for an uncontained error, 
> we
> will always panic(). We can't give user-space a signal for imprecise 
> exceptions,
> as it can't return from the signal. The classes of error that are left are
> covered by polled/irq and NOTIFY_SEA.
> 
> Qemu can decide to generate RAS SErrors for SIGBUS_MCEERR_AR if it really 
> wants
> to, (but I don't think you should, the kernel may have unmapped the page at PC
> from stage2 due to corruption).
yes, you also said you do not want to generate RAS SErrors for SIGBUS_MCEERR_AR,
so Qemu does not know in which condition to generate RAS SErrors.

> 
> I think the problem here is you're applying the CPU->software behaviour and
> choices to software->software. By the time user-space gets the error, the
> behaviour is different.
In the KVM, as a policy choice to reserve this API to specify guest ESR and 
drive to trigger SError is OK,
At least for Qemu it does not know in which condition to trigger it.





Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2017-12-15 Thread James Morse
Hi gengdongjiu,

On 07/12/17 06:37, gengdongjiu wrote:
> I understand you most idea.
> 
> But In the Qemu one signal type can only correspond to one behavior, can not 
> correspond to two behaviors,
> otherwise Qemu will do not know how to do.
> 
> For the Qemu, if it receives the SIGBUS_MCEERR_AR signal, it will populate 
> the CPER
> records and inject a SEA to guest through KVM IOCTL "KVM_SET_ONE_REG"; if 
> receives the SIGBUS_MCEERR_AO
> signal, it will record the CPER and trigger a IRQ to notify guest, as shown 
> below:
> 
> SIGBUS_MCEERR_AR trigger Synchronous External Abort.
> SIGBUS_MCEERR_AO trigger GPIO IRQ.
> 
> For the SIGBUS_MCEERR_AO and SIGBUS_MCEERR_AR, we have already specify 
> trigger method, which all
> 
> not involve _trigger_ an SError.

It's a policy choice. How does your virtual CPU notify RAS errors to its virtual
software? You could use SError for SIGBUS_MCEERR_AR, it depends on what type of
CPU you are trying to emulate.

I'd suggest using NOTIFY_SEA for SIGBUS_MCEERR_AR as it avoids problems where
the guest doesn't take the SError immediately, instead tries to re-execute the
code KVM has unmapped from stage2 because its corrupt. (You could detect this
happening in Qemu and try something else)


Synchronous/asynchronous external abort matters to the CPU, but once the error
has been notified to software the reasons for this distinction disappear. Once
the error has been handled, all trace of this distinction is gone.

CPER records only describe component failures. You are trying to re-create some
state that disappeared with one of the firmware-first abstractions. Trying to
re-create this information isn't worth the effort as the distinction doesn't
matter to linux, only to the CPU.


> so there is no chance for Qemu to trigger the SError when gets the 
> SIGBUS_MCEERR_A{O,R}.

You mean there is no reason for Qemu to trigger an SError when it gets a signal
from the kernel.

The reasons the CPU might have to generate an SError don't apply to linux and
KVM user space. User-space will never get a signal for an uncontained error, we
will always panic(). We can't give user-space a signal for imprecise exceptions,
as it can't return from the signal. The classes of error that are left are
covered by polled/irq and NOTIFY_SEA.

Qemu can decide to generate RAS SErrors for SIGBUS_MCEERR_AR if it really wants
to, (but I don't think you should, the kernel may have unmapped the page at PC
from stage2 due to corruption).


I think the problem here is you're applying the CPU->software behaviour and
choices to software->software. By the time user-space gets the error, the
behaviour is different.



Thanks,

James


Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2017-12-15 Thread James Morse
Hi gengdongjiu,

On 07/12/17 06:37, gengdongjiu wrote:
> I understand you most idea.
> 
> But In the Qemu one signal type can only correspond to one behavior, can not 
> correspond to two behaviors,
> otherwise Qemu will do not know how to do.
> 
> For the Qemu, if it receives the SIGBUS_MCEERR_AR signal, it will populate 
> the CPER
> records and inject a SEA to guest through KVM IOCTL "KVM_SET_ONE_REG"; if 
> receives the SIGBUS_MCEERR_AO
> signal, it will record the CPER and trigger a IRQ to notify guest, as shown 
> below:
> 
> SIGBUS_MCEERR_AR trigger Synchronous External Abort.
> SIGBUS_MCEERR_AO trigger GPIO IRQ.
> 
> For the SIGBUS_MCEERR_AO and SIGBUS_MCEERR_AR, we have already specify 
> trigger method, which all
> 
> not involve _trigger_ an SError.

It's a policy choice. How does your virtual CPU notify RAS errors to its virtual
software? You could use SError for SIGBUS_MCEERR_AR, it depends on what type of
CPU you are trying to emulate.

I'd suggest using NOTIFY_SEA for SIGBUS_MCEERR_AR as it avoids problems where
the guest doesn't take the SError immediately, instead tries to re-execute the
code KVM has unmapped from stage2 because its corrupt. (You could detect this
happening in Qemu and try something else)


Synchronous/asynchronous external abort matters to the CPU, but once the error
has been notified to software the reasons for this distinction disappear. Once
the error has been handled, all trace of this distinction is gone.

CPER records only describe component failures. You are trying to re-create some
state that disappeared with one of the firmware-first abstractions. Trying to
re-create this information isn't worth the effort as the distinction doesn't
matter to linux, only to the CPU.


> so there is no chance for Qemu to trigger the SError when gets the 
> SIGBUS_MCEERR_A{O,R}.

You mean there is no reason for Qemu to trigger an SError when it gets a signal
from the kernel.

The reasons the CPU might have to generate an SError don't apply to linux and
KVM user space. User-space will never get a signal for an uncontained error, we
will always panic(). We can't give user-space a signal for imprecise exceptions,
as it can't return from the signal. The classes of error that are left are
covered by polled/irq and NOTIFY_SEA.

Qemu can decide to generate RAS SErrors for SIGBUS_MCEERR_AR if it really wants
to, (but I don't think you should, the kernel may have unmapped the page at PC
from stage2 due to corruption).


I think the problem here is you're applying the CPU->software behaviour and
choices to software->software. By the time user-space gets the error, the
behaviour is different.



Thanks,

James


Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2017-12-14 Thread gengdongjiu
Hi James,

On 2017/12/7 14:37, gengdongjiu wrote:
>> We need to tackle (1) and (3) separately. For (3) we need some API that lets
>> Qemu _trigger_ an SError in the guest, with a specified ESR. But, we don't 
>> have
>> a way of migrating pending SError yet... which is where I got stuck last 
>> time I
>> was looking at this.
> I understand you most idea.
> 
> But In the Qemu one signal type can only correspond to one behavior, can not 
> correspond to two behaviors,
> otherwise Qemu will do not know how to do.
> 
> For the Qemu, if it receives the SIGBUS_MCEERR_AR signal, it will populate 
> the CPER
> records and inject a SEA to guest through KVM IOCTL "KVM_SET_ONE_REG"; if 
> receives the SIGBUS_MCEERR_AO
> signal, it will record the CPER and trigger a IRQ to notify guest, as shown 
> below:
> 
> SIGBUS_MCEERR_AR trigger Synchronous External Abort.
> SIGBUS_MCEERR_AO trigger GPIO IRQ.
> 
> For the SIGBUS_MCEERR_AO and SIGBUS_MCEERR_AR, we have already specify 
> trigger method, which all
> 
> not involve _trigger_ an SError.
> 
> so there is no chance for Qemu to trigger the SError when gets the 
> SIGBUS_MCEERR_A{O,R}.

As I explained above:

If Qemu received SIGBUS_MCEERR_AR, it will record CPER and trigger Synchronous 
External Abort;
If Qemu received SIGBUS_MCEERR_AO, it will record CPER and trigger GPIO IRQ;
So Qemu does not know when to _trigger_ an SError.

so here I "return a error" to Qemu if ghes_notify_sei() return failure in [1], 
if you opposed KVM "return error",
do you have a better idea about it? thanks

About the way of migrating pending SError, I think it is a separate case, 
because Qemu still does not know
how and when to trigger the SError.

[1]:
static int kvm_handle_guest_sei(struct kvm_vcpu *vcpu, struct kvm_run *run)
{
...
+   case ESR_ELx_AET_UER:   /* The error has not been propagated */
+   /*
+* Userspace only handle the guest SError Interrupt(SEI) if the
+* error has not been propagated
+*/
+   run->exit_reason = KVM_EXIT_EXCEPTION;
+   run->ex.exception = ESR_ELx_EC_SERROR;
+   run->ex.error_code = KVM_SEI_SEV_RECOVERABLE;
+   return 0;
...
}

> 



Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2017-12-14 Thread gengdongjiu
Hi James,

On 2017/12/7 14:37, gengdongjiu wrote:
>> We need to tackle (1) and (3) separately. For (3) we need some API that lets
>> Qemu _trigger_ an SError in the guest, with a specified ESR. But, we don't 
>> have
>> a way of migrating pending SError yet... which is where I got stuck last 
>> time I
>> was looking at this.
> I understand you most idea.
> 
> But In the Qemu one signal type can only correspond to one behavior, can not 
> correspond to two behaviors,
> otherwise Qemu will do not know how to do.
> 
> For the Qemu, if it receives the SIGBUS_MCEERR_AR signal, it will populate 
> the CPER
> records and inject a SEA to guest through KVM IOCTL "KVM_SET_ONE_REG"; if 
> receives the SIGBUS_MCEERR_AO
> signal, it will record the CPER and trigger a IRQ to notify guest, as shown 
> below:
> 
> SIGBUS_MCEERR_AR trigger Synchronous External Abort.
> SIGBUS_MCEERR_AO trigger GPIO IRQ.
> 
> For the SIGBUS_MCEERR_AO and SIGBUS_MCEERR_AR, we have already specify 
> trigger method, which all
> 
> not involve _trigger_ an SError.
> 
> so there is no chance for Qemu to trigger the SError when gets the 
> SIGBUS_MCEERR_A{O,R}.

As I explained above:

If Qemu received SIGBUS_MCEERR_AR, it will record CPER and trigger Synchronous 
External Abort;
If Qemu received SIGBUS_MCEERR_AO, it will record CPER and trigger GPIO IRQ;
So Qemu does not know when to _trigger_ an SError.

so here I "return a error" to Qemu if ghes_notify_sei() return failure in [1], 
if you opposed KVM "return error",
do you have a better idea about it? thanks

About the way of migrating pending SError, I think it is a separate case, 
because Qemu still does not know
how and when to trigger the SError.

[1]:
static int kvm_handle_guest_sei(struct kvm_vcpu *vcpu, struct kvm_run *run)
{
...
+   case ESR_ELx_AET_UER:   /* The error has not been propagated */
+   /*
+* Userspace only handle the guest SError Interrupt(SEI) if the
+* error has not been propagated
+*/
+   run->exit_reason = KVM_EXIT_EXCEPTION;
+   run->ex.exception = ESR_ELx_EC_SERROR;
+   run->ex.error_code = KVM_SEI_SEV_RECOVERABLE;
+   return 0;
...
}

> 



Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2017-12-06 Thread gengdongjiu
Hi James,

On 2017/12/7 3:04, James Morse wrote:
> Hi gengdongjiu,
> 
> On 06/12/17 10:26, gengdongjiu wrote:
>> On 2017/11/15 0:00, James Morse wrote:
 +   * error has not been propagated
 +   */
 +  run->exit_reason = KVM_EXIT_EXCEPTION;
 +  run->ex.exception = ESR_ELx_EC_SERROR;
 +  run->ex.error_code = KVM_SEI_SEV_RECOVERABLE;
 +  return 0;
>>> We should not pass RAS notifications to user space. The kernel either 
>>> handles
>>> them, or it panics(). User space shouldn't even know if the kernel supports 
>>> RAS
>>> until it gets an MCEERR signal.
>>>
>>> You're making your firmware-first notification an EL3->EL0 signal, 
>>> bypassing the OS.
>>>
>>> If we get a RAS SError and there are no CPER records or values in the ERR 
>>> nodes,
>>> we should panic as it looks like the CPU/firmware is broken. (spurious RAS 
>>> errors)
> 
>> do you think whether we need to set the guest ESR by user space?  if need, I 
>> need to
>> notify user space that there is a SError happen and need to set ESR for 
>> guest in some place of
>> KVM.
> 
> I think you are still coming from a world where user-space gets raw RAS
> notifications via KVM. This should not happen because the notification method 
> is
> private to firmware and the kernel. KVM is just in the way when a guest is 
> running.
> 
> Notifications reaching KVM should be plumbed into the APEI-firmware-first-code
> or eventually, a kernel-first mechanism if APEI doesn't 'claim' them.
> 
> The kernel RAS code may signal user-space with the symptoms of the error, and
> user-space may decided to generate a new RAS notification for the guest.
> 
> This should function in exactly the same way, regardless of which notification
> method is in use between the kernel and firmware. (its the only way to make 
> this
> future-proof).
> 
> Which notification user-space chooses to use entirely depends on what (if
> anything) it advertised to the guest in the HEST. User-space has to be in
> control of triggering any SError, not just overriding the ESR when KVM has
> decided it wants to kill the guest.

thanks, I will explain more.

> 
> 
>> so here I return a error code to user space. you mean we should not pass RAS 
>> notifications
>> to user space, so could you give some suggestion how to notify user space to 
>> set guest ESR.
> 
> KVM shouldn't give the guest an SError when it takes a RAS notification, it
> should pass the notification to the kernel RAS code. It only needs to 'fall
> through' to some default cause if both APEI and kernel-first 
> deny-all-knowledge
> of this notification.
> 
> 
> The end-to-end flow is then (assuming no-VHE):
> (1)An error occurs, taking the CPU to EL3.
> EL3: triage the error, generate CPER, notify the OS
> EL2: KVM takes the notification, exits the guest, returns to host EL1.
> EL1: KVM handle_exit() calls APEI to handle the error.
> This is the end of KVMs involvement in RAS - its just plumbing.
> 
> (2)APEI processes the CPER records and signals affected processes.
> If KVM's user-space is affected, KVM will spot the pending signal when it goes
> to re-enter the guest, and exit to user-space instead.
> Qemu takes the SIGBUS_MCEERR_A{O,R}.
> 
> (3) Qemu decides it wants to hand the guest a RAS error, it populates the CPER
> records (in memory only Qemu knows about), then drives the KVM API to make the
> appropriate notification appear.
> 
> 
> (1) only happens if the guest was running when the error arrived. GHES has ~4
> flavours of IRQ which may be used to describe corruption in guest memory. 
> Steps
> (2) and (3) are exactly the same in this case.
> 
> Qemu may decide to trigger RAS errors all by itself, (probably for testing and
> debugging), in which case (1) and (2) don't happen, but (3), is exactly the 
> same.
> 
> 
> This way platform-firmware/host-kernel can use kernel-first or firmware-first
> with any of the notifications, independently from Qemu/guest-kernel making a
> different kernel-first or firmware-first with different notifications.
> 
> Passing information out of KVM breaks this, forcing Qemu to know about the
> mechanism platform-firmware is using.
> 
> 
> We need to tackle (1) and (3) separately. For (3) we need some API that lets
> Qemu _trigger_ an SError in the guest, with a specified ESR. But, we don't 
> have
> a way of migrating pending SError yet... which is where I got stuck last time 
> I
> was looking at this.

I understand you most idea.

But In the Qemu one signal type can only correspond to one behavior, can not 
correspond to two behaviors,
otherwise Qemu will do not know how to do.

For the Qemu, if it receives the SIGBUS_MCEERR_AR signal, it will populate the 
CPER
records and inject a SEA to guest through KVM IOCTL "KVM_SET_ONE_REG"; if 
receives the SIGBUS_MCEERR_AO
signal, it will record the CPER and trigger a IRQ to notify guest, as shown 
below:

SIGBUS_MCEERR_AR trigger Synchronous External Abort.
SIGBUS_MCEERR_AO 

Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2017-12-06 Thread gengdongjiu
Hi James,

On 2017/12/7 3:04, James Morse wrote:
> Hi gengdongjiu,
> 
> On 06/12/17 10:26, gengdongjiu wrote:
>> On 2017/11/15 0:00, James Morse wrote:
 +   * error has not been propagated
 +   */
 +  run->exit_reason = KVM_EXIT_EXCEPTION;
 +  run->ex.exception = ESR_ELx_EC_SERROR;
 +  run->ex.error_code = KVM_SEI_SEV_RECOVERABLE;
 +  return 0;
>>> We should not pass RAS notifications to user space. The kernel either 
>>> handles
>>> them, or it panics(). User space shouldn't even know if the kernel supports 
>>> RAS
>>> until it gets an MCEERR signal.
>>>
>>> You're making your firmware-first notification an EL3->EL0 signal, 
>>> bypassing the OS.
>>>
>>> If we get a RAS SError and there are no CPER records or values in the ERR 
>>> nodes,
>>> we should panic as it looks like the CPU/firmware is broken. (spurious RAS 
>>> errors)
> 
>> do you think whether we need to set the guest ESR by user space?  if need, I 
>> need to
>> notify user space that there is a SError happen and need to set ESR for 
>> guest in some place of
>> KVM.
> 
> I think you are still coming from a world where user-space gets raw RAS
> notifications via KVM. This should not happen because the notification method 
> is
> private to firmware and the kernel. KVM is just in the way when a guest is 
> running.
> 
> Notifications reaching KVM should be plumbed into the APEI-firmware-first-code
> or eventually, a kernel-first mechanism if APEI doesn't 'claim' them.
> 
> The kernel RAS code may signal user-space with the symptoms of the error, and
> user-space may decided to generate a new RAS notification for the guest.
> 
> This should function in exactly the same way, regardless of which notification
> method is in use between the kernel and firmware. (its the only way to make 
> this
> future-proof).
> 
> Which notification user-space chooses to use entirely depends on what (if
> anything) it advertised to the guest in the HEST. User-space has to be in
> control of triggering any SError, not just overriding the ESR when KVM has
> decided it wants to kill the guest.

thanks, I will explain more.

> 
> 
>> so here I return a error code to user space. you mean we should not pass RAS 
>> notifications
>> to user space, so could you give some suggestion how to notify user space to 
>> set guest ESR.
> 
> KVM shouldn't give the guest an SError when it takes a RAS notification, it
> should pass the notification to the kernel RAS code. It only needs to 'fall
> through' to some default cause if both APEI and kernel-first 
> deny-all-knowledge
> of this notification.
> 
> 
> The end-to-end flow is then (assuming no-VHE):
> (1)An error occurs, taking the CPU to EL3.
> EL3: triage the error, generate CPER, notify the OS
> EL2: KVM takes the notification, exits the guest, returns to host EL1.
> EL1: KVM handle_exit() calls APEI to handle the error.
> This is the end of KVMs involvement in RAS - its just plumbing.
> 
> (2)APEI processes the CPER records and signals affected processes.
> If KVM's user-space is affected, KVM will spot the pending signal when it goes
> to re-enter the guest, and exit to user-space instead.
> Qemu takes the SIGBUS_MCEERR_A{O,R}.
> 
> (3) Qemu decides it wants to hand the guest a RAS error, it populates the CPER
> records (in memory only Qemu knows about), then drives the KVM API to make the
> appropriate notification appear.
> 
> 
> (1) only happens if the guest was running when the error arrived. GHES has ~4
> flavours of IRQ which may be used to describe corruption in guest memory. 
> Steps
> (2) and (3) are exactly the same in this case.
> 
> Qemu may decide to trigger RAS errors all by itself, (probably for testing and
> debugging), in which case (1) and (2) don't happen, but (3), is exactly the 
> same.
> 
> 
> This way platform-firmware/host-kernel can use kernel-first or firmware-first
> with any of the notifications, independently from Qemu/guest-kernel making a
> different kernel-first or firmware-first with different notifications.
> 
> Passing information out of KVM breaks this, forcing Qemu to know about the
> mechanism platform-firmware is using.
> 
> 
> We need to tackle (1) and (3) separately. For (3) we need some API that lets
> Qemu _trigger_ an SError in the guest, with a specified ESR. But, we don't 
> have
> a way of migrating pending SError yet... which is where I got stuck last time 
> I
> was looking at this.

I understand you most idea.

But In the Qemu one signal type can only correspond to one behavior, can not 
correspond to two behaviors,
otherwise Qemu will do not know how to do.

For the Qemu, if it receives the SIGBUS_MCEERR_AR signal, it will populate the 
CPER
records and inject a SEA to guest through KVM IOCTL "KVM_SET_ONE_REG"; if 
receives the SIGBUS_MCEERR_AO
signal, it will record the CPER and trigger a IRQ to notify guest, as shown 
below:

SIGBUS_MCEERR_AR trigger Synchronous External Abort.
SIGBUS_MCEERR_AO 

Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2017-12-06 Thread James Morse
Hi gengdongjiu,

On 06/12/17 10:26, gengdongjiu wrote:
> On 2017/11/15 0:00, James Morse wrote:
>>> +* error has not been propagated
>>> +*/
>>> +   run->exit_reason = KVM_EXIT_EXCEPTION;
>>> +   run->ex.exception = ESR_ELx_EC_SERROR;
>>> +   run->ex.error_code = KVM_SEI_SEV_RECOVERABLE;
>>> +   return 0;
>> We should not pass RAS notifications to user space. The kernel either handles
>> them, or it panics(). User space shouldn't even know if the kernel supports 
>> RAS
>> until it gets an MCEERR signal.
>>
>> You're making your firmware-first notification an EL3->EL0 signal, bypassing 
>> the OS.
>>
>> If we get a RAS SError and there are no CPER records or values in the ERR 
>> nodes,
>> we should panic as it looks like the CPU/firmware is broken. (spurious RAS 
>> errors)

> do you think whether we need to set the guest ESR by user space?  if need, I 
> need to
> notify user space that there is a SError happen and need to set ESR for guest 
> in some place of
> KVM.

I think you are still coming from a world where user-space gets raw RAS
notifications via KVM. This should not happen because the notification method is
private to firmware and the kernel. KVM is just in the way when a guest is 
running.

Notifications reaching KVM should be plumbed into the APEI-firmware-first-code
or eventually, a kernel-first mechanism if APEI doesn't 'claim' them.

The kernel RAS code may signal user-space with the symptoms of the error, and
user-space may decided to generate a new RAS notification for the guest.

This should function in exactly the same way, regardless of which notification
method is in use between the kernel and firmware. (its the only way to make this
future-proof).

Which notification user-space chooses to use entirely depends on what (if
anything) it advertised to the guest in the HEST. User-space has to be in
control of triggering any SError, not just overriding the ESR when KVM has
decided it wants to kill the guest.


> so here I return a error code to user space. you mean we should not pass RAS 
> notifications
> to user space, so could you give some suggestion how to notify user space to 
> set guest ESR.

KVM shouldn't give the guest an SError when it takes a RAS notification, it
should pass the notification to the kernel RAS code. It only needs to 'fall
through' to some default cause if both APEI and kernel-first deny-all-knowledge
of this notification.


The end-to-end flow is then (assuming no-VHE):
(1)An error occurs, taking the CPU to EL3.
EL3: triage the error, generate CPER, notify the OS
EL2: KVM takes the notification, exits the guest, returns to host EL1.
EL1: KVM handle_exit() calls APEI to handle the error.
This is the end of KVMs involvement in RAS - its just plumbing.

(2)APEI processes the CPER records and signals affected processes.
If KVM's user-space is affected, KVM will spot the pending signal when it goes
to re-enter the guest, and exit to user-space instead.
Qemu takes the SIGBUS_MCEERR_A{O,R}.

(3) Qemu decides it wants to hand the guest a RAS error, it populates the CPER
records (in memory only Qemu knows about), then drives the KVM API to make the
appropriate notification appear.


(1) only happens if the guest was running when the error arrived. GHES has ~4
flavours of IRQ which may be used to describe corruption in guest memory. Steps
(2) and (3) are exactly the same in this case.

Qemu may decide to trigger RAS errors all by itself, (probably for testing and
debugging), in which case (1) and (2) don't happen, but (3), is exactly the 
same.


This way platform-firmware/host-kernel can use kernel-first or firmware-first
with any of the notifications, independently from Qemu/guest-kernel making a
different kernel-first or firmware-first with different notifications.

Passing information out of KVM breaks this, forcing Qemu to know about the
mechanism platform-firmware is using.


We need to tackle (1) and (3) separately. For (3) we need some API that lets
Qemu _trigger_ an SError in the guest, with a specified ESR. But, we don't have
a way of migrating pending SError yet... which is where I got stuck last time I
was looking at this.



James


Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2017-12-06 Thread James Morse
Hi gengdongjiu,

On 06/12/17 10:26, gengdongjiu wrote:
> On 2017/11/15 0:00, James Morse wrote:
>>> +* error has not been propagated
>>> +*/
>>> +   run->exit_reason = KVM_EXIT_EXCEPTION;
>>> +   run->ex.exception = ESR_ELx_EC_SERROR;
>>> +   run->ex.error_code = KVM_SEI_SEV_RECOVERABLE;
>>> +   return 0;
>> We should not pass RAS notifications to user space. The kernel either handles
>> them, or it panics(). User space shouldn't even know if the kernel supports 
>> RAS
>> until it gets an MCEERR signal.
>>
>> You're making your firmware-first notification an EL3->EL0 signal, bypassing 
>> the OS.
>>
>> If we get a RAS SError and there are no CPER records or values in the ERR 
>> nodes,
>> we should panic as it looks like the CPU/firmware is broken. (spurious RAS 
>> errors)

> do you think whether we need to set the guest ESR by user space?  if need, I 
> need to
> notify user space that there is a SError happen and need to set ESR for guest 
> in some place of
> KVM.

I think you are still coming from a world where user-space gets raw RAS
notifications via KVM. This should not happen because the notification method is
private to firmware and the kernel. KVM is just in the way when a guest is 
running.

Notifications reaching KVM should be plumbed into the APEI-firmware-first-code
or eventually, a kernel-first mechanism if APEI doesn't 'claim' them.

The kernel RAS code may signal user-space with the symptoms of the error, and
user-space may decided to generate a new RAS notification for the guest.

This should function in exactly the same way, regardless of which notification
method is in use between the kernel and firmware. (its the only way to make this
future-proof).

Which notification user-space chooses to use entirely depends on what (if
anything) it advertised to the guest in the HEST. User-space has to be in
control of triggering any SError, not just overriding the ESR when KVM has
decided it wants to kill the guest.


> so here I return a error code to user space. you mean we should not pass RAS 
> notifications
> to user space, so could you give some suggestion how to notify user space to 
> set guest ESR.

KVM shouldn't give the guest an SError when it takes a RAS notification, it
should pass the notification to the kernel RAS code. It only needs to 'fall
through' to some default cause if both APEI and kernel-first deny-all-knowledge
of this notification.


The end-to-end flow is then (assuming no-VHE):
(1)An error occurs, taking the CPU to EL3.
EL3: triage the error, generate CPER, notify the OS
EL2: KVM takes the notification, exits the guest, returns to host EL1.
EL1: KVM handle_exit() calls APEI to handle the error.
This is the end of KVMs involvement in RAS - its just plumbing.

(2)APEI processes the CPER records and signals affected processes.
If KVM's user-space is affected, KVM will spot the pending signal when it goes
to re-enter the guest, and exit to user-space instead.
Qemu takes the SIGBUS_MCEERR_A{O,R}.

(3) Qemu decides it wants to hand the guest a RAS error, it populates the CPER
records (in memory only Qemu knows about), then drives the KVM API to make the
appropriate notification appear.


(1) only happens if the guest was running when the error arrived. GHES has ~4
flavours of IRQ which may be used to describe corruption in guest memory. Steps
(2) and (3) are exactly the same in this case.

Qemu may decide to trigger RAS errors all by itself, (probably for testing and
debugging), in which case (1) and (2) don't happen, but (3), is exactly the 
same.


This way platform-firmware/host-kernel can use kernel-first or firmware-first
with any of the notifications, independently from Qemu/guest-kernel making a
different kernel-first or firmware-first with different notifications.

Passing information out of KVM breaks this, forcing Qemu to know about the
mechanism platform-firmware is using.


We need to tackle (1) and (3) separately. For (3) we need some API that lets
Qemu _trigger_ an SError in the guest, with a specified ESR. But, we don't have
a way of migrating pending SError yet... which is where I got stuck last time I
was looking at this.



James


Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2017-12-06 Thread gengdongjiu

On 2017/11/15 0:00, James Morse wrote:
>> + * error has not been propagated
>> + */
>> +run->exit_reason = KVM_EXIT_EXCEPTION;
>> +run->ex.exception = ESR_ELx_EC_SERROR;
>> +run->ex.error_code = KVM_SEI_SEV_RECOVERABLE;
>> +return 0;
> We should not pass RAS notifications to user space. The kernel either handles
> them, or it panics(). User space shouldn't even know if the kernel supports 
> RAS
> until it gets an MCEERR signal.
> 
> You're making your firmware-first notification an EL3->EL0 signal, bypassing 
> the OS.
> 
> If we get a RAS SError and there are no CPER records or values in the ERR 
> nodes,
> we should panic as it looks like the CPU/firmware is broken. (spurious RAS 
> errors)

Hi james,
  sorry to disturb you!

  do you think whether we need to set the guest ESR by user space?  if need, I 
need to
notify user space that there is a SError happen and need to set ESR for guest 
in some place of
KVM. so here I return a error code to user space. you mean we should not pass 
RAS notifications
to user space, so could you give some suggestion how to notify user space to 
set guest ESR.

Thanks a lot in advance.


> 
> 



Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2017-12-06 Thread gengdongjiu

On 2017/11/15 0:00, James Morse wrote:
>> + * error has not been propagated
>> + */
>> +run->exit_reason = KVM_EXIT_EXCEPTION;
>> +run->ex.exception = ESR_ELx_EC_SERROR;
>> +run->ex.error_code = KVM_SEI_SEV_RECOVERABLE;
>> +return 0;
> We should not pass RAS notifications to user space. The kernel either handles
> them, or it panics(). User space shouldn't even know if the kernel supports 
> RAS
> until it gets an MCEERR signal.
> 
> You're making your firmware-first notification an EL3->EL0 signal, bypassing 
> the OS.
> 
> If we get a RAS SError and there are no CPER records or values in the ERR 
> nodes,
> we should panic as it looks like the CPU/firmware is broken. (spurious RAS 
> errors)

Hi james,
  sorry to disturb you!

  do you think whether we need to set the guest ESR by user space?  if need, I 
need to
notify user space that there is a SError happen and need to set ESR for guest 
in some place of
KVM. so here I return a error code to user space. you mean we should not pass 
RAS notifications
to user space, so could you give some suggestion how to notify user space to 
set guest ESR.

Thanks a lot in advance.


> 
> 



Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2017-11-15 Thread gengdongjiu
Hi James,

   Thanks a lot for the review.

On 2017/11/15 0:00, James Morse wrote:
> Hi Dongjiu Geng,
> 
> On 10/11/17 19:54, Dongjiu Geng wrote:
>> If it is not RAS SError, directly inject virtual SError,
>> which will keep the old way. If it is RAS SError, firstly
>> let host ACPI module to handle it.
> 
>> For the ACPI handling,
>> if the error address is invalid, APEI driver will not
>> identify the address to hwpoison memory and can not notify
>> guest to do the recovery.
> 
> The guest can't do any recover either. There is no recovery you can do without
> some information about what the error is.
> 
> This is your memory corruption at an unknown address? We should reboot.
> 
> (I agree memory_failure.c's::me_kernel() is ignoring kernel errors, we should
> try and fix this. It makes some sense for polled or irq notifications, but not
> SEA/SEI).
> 
> 
>> In order to safe, KVM continues
>> categorizing errors and handle it separately.
> 
>> If the RAS error is not propagated, let host user space to
>> handle it. 
> 
> No. Host user space should not know anything about the kernel or platform RAS
> support. Doing so creates an ABI link between EL3 firmware and Qemu. This is
> totally unmaintainable.

Here I have two question:
(1) If the AET(Asynchronous Error Type) is Recoverable error (UER), do you mean 
we also reboot or panic?
(2) what is the chance to set guest ESR for Qemu?  here I return a error code 
to Qemu. when Qemu get this error return,
it will specify guest ESR and inject the abort. here if KVM does not return 
error to Qemu, Qemu will do
not know when to set the guest ESR value and inject abort.


> 
> This thing needs to be portable. The kernel should handle the error, and 
> report
> any symptoms to user-space. e.g. 'this memory is gone'.
> 
> We shouldn't special case KVM.
> 
> 
>> The reason is that sometimes we can only kill the
>> guest effected application instead of panic whose guest OS.
>> Host user space specifies a valid ESR and inject virtual
>> SError, guest can just kill the current application if the
>> non-consumed error coming from guest application.
>>
>> Signed-off-by: Dongjiu Geng 
>> Signed-off-by: Quanming Wu 
> 
> The last Signed-off-by should match the person posting the patch. It's a chain
> of custody for GPL-signoff purposes, not a 'partially-written-by'. If you want
> to credit Quanming Wu you can add CC and they can Ack/Review your patch.

Ok, got it. thanks a lot for your suggestion.


> 
> 
>> diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
>> index 7debb74..1afdc87 100644
>> --- a/arch/arm64/kvm/handle_exit.c
>> +++ b/arch/arm64/kvm/handle_exit.c
>> @@ -178,6 +179,66 @@ static exit_handle_fn kvm_get_exit_handler(struct 
>> kvm_vcpu *vcpu)
>>  return arm_exit_handlers[hsr_ec];
>>  }
>>  
>> +/**
>> + * kvm_handle_guest_sei - handles SError interrupt or asynchronous aborts
>> + * @vcpu:   the VCPU pointer
>> + *
>> + * For RAS SError interrupt, firstly let host kernel handle it.
>> + * If the AET is [ESR_ELx_AET_UER], then let user space handle it,
>> + */
>> +static int kvm_handle_guest_sei(struct kvm_vcpu *vcpu, struct kvm_run *run)
>> +{
>> +unsigned int esr = kvm_vcpu_get_hsr(vcpu);
>> +bool impdef_syndrome =  esr & ESR_ELx_ISV;  /* aka IDS */
>> +unsigned int aet = esr & ESR_ELx_AET;
>> +
>> +/*
>> + * This is not RAS SError
>> + */
>> +if (!cpus_have_const_cap(ARM64_HAS_RAS_EXTN)) {
>> +kvm_inject_vabt(vcpu);
>> +return 1;
>> +}
> 
>> +/* The host kernel may handle this abort. */
>> +handle_guest_sei();
> 
> This has to claim the SError as a notification. If APEI claims the error, KVM
> doesn't need to do anything more. You ignore its return code.

Thanks for the pointing out.
I will check the return code, if it return success, KVM doesn't need to do 
anything more,
otherwise, continue run.

> 
> 
>> +
>> +/*
>> + * In below two conditions, it will directly inject the
>> + * virtual SError:
>> + * 1. The Syndrome is IMPLEMENTATION DEFINED
>> + * 2. It is Uncategorized SEI
>> + */
>> +if (impdef_syndrome ||
>> +((esr & ESR_ELx_FSC) != ESR_ELx_FSC_SERROR)) {
>> +kvm_inject_vabt(vcpu);
>> +return 1;
>> +}
>> +
>> +switch (aet) {
>> +case ESR_ELx_AET_CE:/* corrected error */
>> +case ESR_ELx_AET_UEO:   /* restartable error, not yet consumed */
>> +return 1;   /* continue processing the guest exit */
> 
>> +case ESR_ELx_AET_UER:   /* The error has not been propagated */
>> +/*
>> + * Userspace only handle the guest SError Interrupt(SEI) if the
>> + * error has not been propagated
>> + */
>> +run->exit_reason = KVM_EXIT_EXCEPTION;
>> +run->ex.exception = ESR_ELx_EC_SERROR;
>> +run->ex.error_code = KVM_SEI_SEV_RECOVERABLE;
>> + 

Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2017-11-15 Thread gengdongjiu
Hi James,

   Thanks a lot for the review.

On 2017/11/15 0:00, James Morse wrote:
> Hi Dongjiu Geng,
> 
> On 10/11/17 19:54, Dongjiu Geng wrote:
>> If it is not RAS SError, directly inject virtual SError,
>> which will keep the old way. If it is RAS SError, firstly
>> let host ACPI module to handle it.
> 
>> For the ACPI handling,
>> if the error address is invalid, APEI driver will not
>> identify the address to hwpoison memory and can not notify
>> guest to do the recovery.
> 
> The guest can't do any recover either. There is no recovery you can do without
> some information about what the error is.
> 
> This is your memory corruption at an unknown address? We should reboot.
> 
> (I agree memory_failure.c's::me_kernel() is ignoring kernel errors, we should
> try and fix this. It makes some sense for polled or irq notifications, but not
> SEA/SEI).
> 
> 
>> In order to safe, KVM continues
>> categorizing errors and handle it separately.
> 
>> If the RAS error is not propagated, let host user space to
>> handle it. 
> 
> No. Host user space should not know anything about the kernel or platform RAS
> support. Doing so creates an ABI link between EL3 firmware and Qemu. This is
> totally unmaintainable.

Here I have two question:
(1) If the AET(Asynchronous Error Type) is Recoverable error (UER), do you mean 
we also reboot or panic?
(2) what is the chance to set guest ESR for Qemu?  here I return a error code 
to Qemu. when Qemu get this error return,
it will specify guest ESR and inject the abort. here if KVM does not return 
error to Qemu, Qemu will do
not know when to set the guest ESR value and inject abort.


> 
> This thing needs to be portable. The kernel should handle the error, and 
> report
> any symptoms to user-space. e.g. 'this memory is gone'.
> 
> We shouldn't special case KVM.
> 
> 
>> The reason is that sometimes we can only kill the
>> guest effected application instead of panic whose guest OS.
>> Host user space specifies a valid ESR and inject virtual
>> SError, guest can just kill the current application if the
>> non-consumed error coming from guest application.
>>
>> Signed-off-by: Dongjiu Geng 
>> Signed-off-by: Quanming Wu 
> 
> The last Signed-off-by should match the person posting the patch. It's a chain
> of custody for GPL-signoff purposes, not a 'partially-written-by'. If you want
> to credit Quanming Wu you can add CC and they can Ack/Review your patch.

Ok, got it. thanks a lot for your suggestion.


> 
> 
>> diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
>> index 7debb74..1afdc87 100644
>> --- a/arch/arm64/kvm/handle_exit.c
>> +++ b/arch/arm64/kvm/handle_exit.c
>> @@ -178,6 +179,66 @@ static exit_handle_fn kvm_get_exit_handler(struct 
>> kvm_vcpu *vcpu)
>>  return arm_exit_handlers[hsr_ec];
>>  }
>>  
>> +/**
>> + * kvm_handle_guest_sei - handles SError interrupt or asynchronous aborts
>> + * @vcpu:   the VCPU pointer
>> + *
>> + * For RAS SError interrupt, firstly let host kernel handle it.
>> + * If the AET is [ESR_ELx_AET_UER], then let user space handle it,
>> + */
>> +static int kvm_handle_guest_sei(struct kvm_vcpu *vcpu, struct kvm_run *run)
>> +{
>> +unsigned int esr = kvm_vcpu_get_hsr(vcpu);
>> +bool impdef_syndrome =  esr & ESR_ELx_ISV;  /* aka IDS */
>> +unsigned int aet = esr & ESR_ELx_AET;
>> +
>> +/*
>> + * This is not RAS SError
>> + */
>> +if (!cpus_have_const_cap(ARM64_HAS_RAS_EXTN)) {
>> +kvm_inject_vabt(vcpu);
>> +return 1;
>> +}
> 
>> +/* The host kernel may handle this abort. */
>> +handle_guest_sei();
> 
> This has to claim the SError as a notification. If APEI claims the error, KVM
> doesn't need to do anything more. You ignore its return code.

Thanks for the pointing out.
I will check the return code, if it return success, KVM doesn't need to do 
anything more,
otherwise, continue run.

> 
> 
>> +
>> +/*
>> + * In below two conditions, it will directly inject the
>> + * virtual SError:
>> + * 1. The Syndrome is IMPLEMENTATION DEFINED
>> + * 2. It is Uncategorized SEI
>> + */
>> +if (impdef_syndrome ||
>> +((esr & ESR_ELx_FSC) != ESR_ELx_FSC_SERROR)) {
>> +kvm_inject_vabt(vcpu);
>> +return 1;
>> +}
>> +
>> +switch (aet) {
>> +case ESR_ELx_AET_CE:/* corrected error */
>> +case ESR_ELx_AET_UEO:   /* restartable error, not yet consumed */
>> +return 1;   /* continue processing the guest exit */
> 
>> +case ESR_ELx_AET_UER:   /* The error has not been propagated */
>> +/*
>> + * Userspace only handle the guest SError Interrupt(SEI) if the
>> + * error has not been propagated
>> + */
>> +run->exit_reason = KVM_EXIT_EXCEPTION;
>> +run->ex.exception = ESR_ELx_EC_SERROR;
>> +run->ex.error_code = KVM_SEI_SEV_RECOVERABLE;
>> +return 0;
> 
> We should not pass 

Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2017-11-14 Thread James Morse
Hi Dongjiu Geng,

On 10/11/17 19:54, Dongjiu Geng wrote:
> If it is not RAS SError, directly inject virtual SError,
> which will keep the old way. If it is RAS SError, firstly
> let host ACPI module to handle it.

> For the ACPI handling,
> if the error address is invalid, APEI driver will not
> identify the address to hwpoison memory and can not notify
> guest to do the recovery.

The guest can't do any recover either. There is no recovery you can do without
some information about what the error is.

This is your memory corruption at an unknown address? We should reboot.

(I agree memory_failure.c's::me_kernel() is ignoring kernel errors, we should
try and fix this. It makes some sense for polled or irq notifications, but not
SEA/SEI).


> In order to safe, KVM continues
> categorizing errors and handle it separately.

> If the RAS error is not propagated, let host user space to
> handle it. 

No. Host user space should not know anything about the kernel or platform RAS
support. Doing so creates an ABI link between EL3 firmware and Qemu. This is
totally unmaintainable.

This thing needs to be portable. The kernel should handle the error, and report
any symptoms to user-space. e.g. 'this memory is gone'.

We shouldn't special case KVM.


> The reason is that sometimes we can only kill the
> guest effected application instead of panic whose guest OS.
> Host user space specifies a valid ESR and inject virtual
> SError, guest can just kill the current application if the
> non-consumed error coming from guest application.
> 
> Signed-off-by: Dongjiu Geng 
> Signed-off-by: Quanming Wu 

The last Signed-off-by should match the person posting the patch. It's a chain
of custody for GPL-signoff purposes, not a 'partially-written-by'. If you want
to credit Quanming Wu you can add CC and they can Ack/Review your patch.


> diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
> index 7debb74..1afdc87 100644
> --- a/arch/arm64/kvm/handle_exit.c
> +++ b/arch/arm64/kvm/handle_exit.c
> @@ -178,6 +179,66 @@ static exit_handle_fn kvm_get_exit_handler(struct 
> kvm_vcpu *vcpu)
>   return arm_exit_handlers[hsr_ec];
>  }
>  
> +/**
> + * kvm_handle_guest_sei - handles SError interrupt or asynchronous aborts
> + * @vcpu:the VCPU pointer
> + *
> + * For RAS SError interrupt, firstly let host kernel handle it.
> + * If the AET is [ESR_ELx_AET_UER], then let user space handle it,
> + */
> +static int kvm_handle_guest_sei(struct kvm_vcpu *vcpu, struct kvm_run *run)
> +{
> + unsigned int esr = kvm_vcpu_get_hsr(vcpu);
> + bool impdef_syndrome =  esr & ESR_ELx_ISV;  /* aka IDS */
> + unsigned int aet = esr & ESR_ELx_AET;
> +
> + /*
> +  * This is not RAS SError
> +  */
> + if (!cpus_have_const_cap(ARM64_HAS_RAS_EXTN)) {
> + kvm_inject_vabt(vcpu);
> + return 1;
> + }

> + /* The host kernel may handle this abort. */
> + handle_guest_sei();

This has to claim the SError as a notification. If APEI claims the error, KVM
doesn't need to do anything more. You ignore its return code.


> +
> + /*
> +  * In below two conditions, it will directly inject the
> +  * virtual SError:
> +  * 1. The Syndrome is IMPLEMENTATION DEFINED
> +  * 2. It is Uncategorized SEI
> +  */
> + if (impdef_syndrome ||
> + ((esr & ESR_ELx_FSC) != ESR_ELx_FSC_SERROR)) {
> + kvm_inject_vabt(vcpu);
> + return 1;
> + }
> +
> + switch (aet) {
> + case ESR_ELx_AET_CE:/* corrected error */
> + case ESR_ELx_AET_UEO:   /* restartable error, not yet consumed */
> + return 1;   /* continue processing the guest exit */

> + case ESR_ELx_AET_UER:   /* The error has not been propagated */
> + /*
> +  * Userspace only handle the guest SError Interrupt(SEI) if the
> +  * error has not been propagated
> +  */
> + run->exit_reason = KVM_EXIT_EXCEPTION;
> + run->ex.exception = ESR_ELx_EC_SERROR;
> + run->ex.error_code = KVM_SEI_SEV_RECOVERABLE;
> + return 0;

We should not pass RAS notifications to user space. The kernel either handles
them, or it panics(). User space shouldn't even know if the kernel supports RAS
until it gets an MCEERR signal.

You're making your firmware-first notification an EL3->EL0 signal, bypassing 
the OS.

If we get a RAS SError and there are no CPER records or values in the ERR nodes,
we should panic as it looks like the CPU/firmware is broken. (spurious RAS 
errors)


> + default:
> + /*
> +  * Until now, the CPU supports RAS and SEI is fatal, or host
> +  * does not support to handle the SError.
> +  */
> + panic("This Asynchronous SError interrupt is dangerous, panic");
> + }
> +
> + return 0;
> +}
> +
>  /*
>   * Return > 0 to return to guest, < 0 

Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2017-11-14 Thread James Morse
Hi Dongjiu Geng,

On 10/11/17 19:54, Dongjiu Geng wrote:
> If it is not RAS SError, directly inject virtual SError,
> which will keep the old way. If it is RAS SError, firstly
> let host ACPI module to handle it.

> For the ACPI handling,
> if the error address is invalid, APEI driver will not
> identify the address to hwpoison memory and can not notify
> guest to do the recovery.

The guest can't do any recover either. There is no recovery you can do without
some information about what the error is.

This is your memory corruption at an unknown address? We should reboot.

(I agree memory_failure.c's::me_kernel() is ignoring kernel errors, we should
try and fix this. It makes some sense for polled or irq notifications, but not
SEA/SEI).


> In order to safe, KVM continues
> categorizing errors and handle it separately.

> If the RAS error is not propagated, let host user space to
> handle it. 

No. Host user space should not know anything about the kernel or platform RAS
support. Doing so creates an ABI link between EL3 firmware and Qemu. This is
totally unmaintainable.

This thing needs to be portable. The kernel should handle the error, and report
any symptoms to user-space. e.g. 'this memory is gone'.

We shouldn't special case KVM.


> The reason is that sometimes we can only kill the
> guest effected application instead of panic whose guest OS.
> Host user space specifies a valid ESR and inject virtual
> SError, guest can just kill the current application if the
> non-consumed error coming from guest application.
> 
> Signed-off-by: Dongjiu Geng 
> Signed-off-by: Quanming Wu 

The last Signed-off-by should match the person posting the patch. It's a chain
of custody for GPL-signoff purposes, not a 'partially-written-by'. If you want
to credit Quanming Wu you can add CC and they can Ack/Review your patch.


> diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
> index 7debb74..1afdc87 100644
> --- a/arch/arm64/kvm/handle_exit.c
> +++ b/arch/arm64/kvm/handle_exit.c
> @@ -178,6 +179,66 @@ static exit_handle_fn kvm_get_exit_handler(struct 
> kvm_vcpu *vcpu)
>   return arm_exit_handlers[hsr_ec];
>  }
>  
> +/**
> + * kvm_handle_guest_sei - handles SError interrupt or asynchronous aborts
> + * @vcpu:the VCPU pointer
> + *
> + * For RAS SError interrupt, firstly let host kernel handle it.
> + * If the AET is [ESR_ELx_AET_UER], then let user space handle it,
> + */
> +static int kvm_handle_guest_sei(struct kvm_vcpu *vcpu, struct kvm_run *run)
> +{
> + unsigned int esr = kvm_vcpu_get_hsr(vcpu);
> + bool impdef_syndrome =  esr & ESR_ELx_ISV;  /* aka IDS */
> + unsigned int aet = esr & ESR_ELx_AET;
> +
> + /*
> +  * This is not RAS SError
> +  */
> + if (!cpus_have_const_cap(ARM64_HAS_RAS_EXTN)) {
> + kvm_inject_vabt(vcpu);
> + return 1;
> + }

> + /* The host kernel may handle this abort. */
> + handle_guest_sei();

This has to claim the SError as a notification. If APEI claims the error, KVM
doesn't need to do anything more. You ignore its return code.


> +
> + /*
> +  * In below two conditions, it will directly inject the
> +  * virtual SError:
> +  * 1. The Syndrome is IMPLEMENTATION DEFINED
> +  * 2. It is Uncategorized SEI
> +  */
> + if (impdef_syndrome ||
> + ((esr & ESR_ELx_FSC) != ESR_ELx_FSC_SERROR)) {
> + kvm_inject_vabt(vcpu);
> + return 1;
> + }
> +
> + switch (aet) {
> + case ESR_ELx_AET_CE:/* corrected error */
> + case ESR_ELx_AET_UEO:   /* restartable error, not yet consumed */
> + return 1;   /* continue processing the guest exit */

> + case ESR_ELx_AET_UER:   /* The error has not been propagated */
> + /*
> +  * Userspace only handle the guest SError Interrupt(SEI) if the
> +  * error has not been propagated
> +  */
> + run->exit_reason = KVM_EXIT_EXCEPTION;
> + run->ex.exception = ESR_ELx_EC_SERROR;
> + run->ex.error_code = KVM_SEI_SEV_RECOVERABLE;
> + return 0;

We should not pass RAS notifications to user space. The kernel either handles
them, or it panics(). User space shouldn't even know if the kernel supports RAS
until it gets an MCEERR signal.

You're making your firmware-first notification an EL3->EL0 signal, bypassing 
the OS.

If we get a RAS SError and there are no CPER records or values in the ERR nodes,
we should panic as it looks like the CPU/firmware is broken. (spurious RAS 
errors)


> + default:
> + /*
> +  * Until now, the CPU supports RAS and SEI is fatal, or host
> +  * does not support to handle the SError.
> +  */
> + panic("This Asynchronous SError interrupt is dangerous, panic");
> + }
> +
> + return 0;
> +}
> +
>  /*
>   * Return > 0 to return to guest, < 0 on error, 0 (and set exit_reason) on
>   *