Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace

2017-11-03 Thread James Morse
Hi gengdongjiu,

On 27/10/17 08:21, gengdongjiu wrote:
> On 2017/10/26 1:42, James Morse wrote:
>> On 20/10/17 16:33, gengdongjiu wrote:
>>> As we discuss below solution:
>>> When guest happen SEA/SEI, KVM calls memory_failure() to send an 
>>> asynchronous SIGBUS
>>> signal(BUS_MCEERR_AO) to QEMU, and make this address to poisoned.
>>> after QEMU receive this BUS_MCEERR_AO, it will record this address to CPER 
>>> and notify guest.
>>> When guest happen stage2 page fault, KVM send a synchronous SIGBUS
>>> BUS_MCEERR_AR to QEMU, and QEMU also record CPER and immediately inject SEA 
>>> abort.
>>>
>>> But this solution, still have some problems.
>>>
>>> 1. In some situation, For RAS, when happen SEA, hardware cannot provide an 
>>> error physical
>>> address
>>
>> Eh? For any RAS error you should get a physical address in ERRADDR.
>>
>> When you get an external abort due to RAS you can scan these nodes to find 
>> which
>> one generated the error and collect the component information.
>> Doing this in firmware is better because firmware knows the SoC topology, so 
>> it
>> can skip the nodes it knows won't be relevant to an error on this CPU.

> Thanks for you suggestion.
> After discussed this issue internally in our side,  I think this should be 
> our firmware issue.
> Not a common issue.
> so let us ignore the issue that hardware does not record physical error 
> address.

This is going to give you problems in the long run. All we can do with 'memory
corrupt at an unknown address' is reboot.


>>> to software instead it can only provide virtual address in FAR_ELx, 
>>> This is to say, firmware cannot provide physical error address, but 
>>> provided the virtual
>>> address in the FAR_ELx. so BIOS cannot record this address to APEI table. In
>>
>> Nit: APEI table, you mean recorded as CPER records in a buffer pointed to by 
>> a
>> GHES's ErrorStatusAddress. APEI tables aren't parsed post boot.
>>
>>
>>> this case, when firmware Jump to hypervisor, hypervisor cannot call
>>> memory_failure(), now only the physical address is recorded and valid, APEI
>>> driver will call the memory_failure()), in this case, host will not send 
>>> SIGBUS
>>> to QEMU. So guest cannot know there is SEA happen.
>>> At least there is such issue in Huawei's platform (cannot provide PA for 
>>> RAS firmware-first,
>>> only can provide VA in FAR_ELx)
>>
>> This isn't a KVM problem.
>>
>> It looks like both of UEFI's 'Table 275. Memory Error Record' and 'Table 276.
>> Memory Error Record 2' require a physical address. You can't describe a 
>> memory
>> error without one.
>>
>> Is this really a memory error?, or some other component, say, a virtually
>> indexed cache.

> When happen SEA, if the {D,I}FSC is 0b0101xx which is SEA on translation 
> table walk or hardware update of translation
> table, it means the page table itself happen issue, not the target address 
> error.
> For this case, even firmware can report a error page table physical address, 
> but memory_memory()
> can not recognize this address because the page table address is not belong 
> to any task include Qemu,
> so memory_failure() will not deliver SIGBUS. Of course, this is memory 
> address.

Both KVM's stage2 and Qemu's user-space page tables are made up of pages of
kernel memory. When memory_failure() is told one of these is corrupt, it should
panic.

E.g, arm64 allocates pmd pages like so:
> static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
> {
>   return (pmd_t *)__get_free_page(PGALLOC_GFP);
> }

To do any better the kernel would need to know this memory is page-table and
that it wasn't the kernel's page-table. It also needs to know which mm_struct it
belonged to, and where in the page-table tree the corrupted page lives. There
would need to be a per-arch helper to ensure no CPU (or other component) had
cached the corrupted page table entries. (it's contained right?)

This isn't an arm64 specific issue, and its going to be very difficult to do.


> I ever make a  experiment, if a APP's page table itself generated SEA, 
> memory_failure() will consider it
> as unknown issue. please see below log, I think this should be a common 
> issue. 

This shouldn't be a common issue, page-tables are small compared to the memory
they map.


> so in KVM code, I plan to separately handle the page table error of SEA if
> the {D,I}FSC is 0b0101xx, and not call memory_failure(), what do you
> think about that?

I think we shouldn't special case KVM. All user-space task's page-tables are
kernel memory too, they shouldn't be treated differently. Once linux can handle
user-space page-table corruption, we can wire-in KVMs stage2.

KVM shouldn't call memory_failure() directly, for RAS it should rely on a
firmware-first or kernel-first handler to diagnose the error and do this dirty 
work.

I agree 'unknown error' sounds fishy:

> only the memory access SEA call memory_failure().
> 
> [   25.482904] {1}[Hardware Error]: Hardware error from APEI Ge

Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace

2017-10-27 Thread gengdongjiu
James,
  Thanks for the comment.

On 2017/10/26 1:42, James Morse wrote:
> Hi gengdongjiu,
> 
> On 20/10/17 16:33, gengdongjiu wrote:
>> As we discuss below solution:
>> When guest happen SEA/SEI, KVM calls memory_failure() to send an 
>> asynchronous SIGBUS
>> signal(BUS_MCEERR_AO) to QEMU, and make this address to poisoned.
>> after QEMU receive this BUS_MCEERR_AO, it will record this address to CPER 
>> and notify guest.
>> When guest happen stage2 page fault, KVM send a synchronous SIGBUS
>> BUS_MCEERR_AR to QEMU, and QEMU also record CPER and immediately inject SEA 
>> abort.
>>
>> But this solution, still have some problems.
>>
>> 1. In some situation, For RAS, when happen SEA, hardware cannot provide an 
>> error physical
>> address
> 
> Eh? For any RAS error you should get a physical address in ERRADDR.
> 
> When you get an external abort due to RAS you can scan these nodes to find 
> which
> one generated the error and collect the component information.
> Doing this in firmware is better because firmware knows the SoC topology, so 
> it
> can skip the nodes it knows won't be relevant to an error on this CPU.

Thanks for you suggestion.
After discussed this issue internally in our side,  I think this should be our 
firmware issue.
Not a common issue.
so let us ignore the issue that hardware does not record physical error address.


> 
> 
>> to software instead it can only provide virtual address in FAR_ELx, 
>> This is to say, firmware cannot provide physical error address, but provided 
>> the virtual
>> address in the FAR_ELx. so BIOS cannot record this address to APEI table. In
> 
> Nit: APEI table, you mean recorded as CPER records in a buffer pointed to by a
> GHES's ErrorStatusAddress. APEI tables aren't parsed post boot.
> 
> 
>> this case, when firmware Jump to hypervisor, hypervisor cannot call
>> memory_failure(), now only the physical address is recorded and valid, APEI
>> driver will call the memory_failure()), in this case, host will not send 
>> SIGBUS
>> to QEMU. So guest cannot know there is SEA happen.
>> At least there is such issue in Huawei's platform (cannot provide PA for RAS 
>> firmware-first,
>> only can provide VA in FAR_ELx)
> 
> This isn't a KVM problem.
> 
> It looks like both of UEFI's 'Table 275. Memory Error Record' and 'Table 276.
> Memory Error Record 2' require a physical address. You can't describe a memory
> error without one.
> 
> Is this really a memory error?, or some other component, say, a virtually
> indexed cache.

When happen SEA, if the {D,I}FSC is 0b0101xx which is SEA on translation table 
walk or hardware update of translation
table, it means the page table itself happen issue, not the target address 
error.
For this case, even firmware can report a error page table physical address, 
but memory_memory()
can not recognize this address because the page table address is not belong to 
any task include Qemu,
so memory_failure() will not deliver SIGBUS. Of course, this is memory address.
I ever make a  experiment, if a APP's page table itself generated SEA, 
memory_failure() will consider it
as unknown issue. please see below log, I think this should be a common issue. 
so in KVM code, I plan to separately
handle the page table error of SEA if the {D,I}FSC is 0b0101xx, and not call 
memory_failure(), what do you think about that?
only the memory access SEA call memory_failure().

[   25.482904] {1}[Hardware Error]: Hardware error from APEI Generic Hardware 
Error Source: 7
[   25.484862] {1}[Hardware Error]: event severity: recoverable
[   25.486192] {1}[Hardware Error]:  Error 0, type: recoverable
[   25.487519] {1}[Hardware Error]:   section_type: memory error
[   25.490169] {1}[Hardware Error]:   physical_address: 0x7ce81000
[   25.491718] {1}[Hardware Error]:   error_type: 3, multi-bit ECC
[   25.501178] Memory failure: 0x7ce81: Unknown page state
[   25.501181] Memory failure: 0x7ce81: unknown page still referenced by 1 users
[   25.501183] Memory failure: 0x7ce81: recovery action for unknown page: Failed



> 
> 
[cut]
> 
> 
> 
>> 3. For SEI, the address is invalid, 
> 
> You mean FAR_ELx?

I mean the physical address. Because SEI is asynchronous, so usually firmware 
will not record this address,
If not record this address, the memory_failure() will be not called, then 
SIGBUS will not be sent, then guest will
not know there is SEI happen, so for this case may be we should also inject a 
virtual SError to avoid the issue that
physical address is not record.


> 
> 
>> so in some platform, firmware will not record this AP.
> 
> For any RAS error you should get a physical address in ERRADDR.
how about the address is not accurate?
For SEI, even we can get a physical address from ERRADDR, but this address 
is not accurate.
so firmware will make it as invalid or not record it.


> 
> 
> Thanks,
> 
> James
> 
> [0] https://lkml.org/lkml/2017/8/7/612
> 
> .
> 

___
kvmarm mailing list
kvmarm@lists.cs.co

Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace

2017-10-25 Thread James Morse
Hi gengdongjiu,

On 20/10/17 16:33, gengdongjiu wrote:
> As we discuss below solution:
> When guest happen SEA/SEI, KVM calls memory_failure() to send an asynchronous 
> SIGBUS
> signal(BUS_MCEERR_AO) to QEMU, and make this address to poisoned.
> after QEMU receive this BUS_MCEERR_AO, it will record this address to CPER 
> and notify guest.
> When guest happen stage2 page fault, KVM send a synchronous SIGBUS
> BUS_MCEERR_AR to QEMU, and QEMU also record CPER and immediately inject SEA 
> abort.
> 
> But this solution, still have some problems.
> 
> 1. In some situation, For RAS, when happen SEA, hardware cannot provide an 
> error physical
> address

Eh? For any RAS error you should get a physical address in ERRADDR.

When you get an external abort due to RAS you can scan these nodes to find which
one generated the error and collect the component information.
Doing this in firmware is better because firmware knows the SoC topology, so it
can skip the nodes it knows won't be relevant to an error on this CPU.


> to software instead it can only provide virtual address in FAR_ELx, 
> This is to say, firmware cannot provide physical error address, but provided 
> the virtual
> address in the FAR_ELx. so BIOS cannot record this address to APEI table. In

Nit: APEI table, you mean recorded as CPER records in a buffer pointed to by a
GHES's ErrorStatusAddress. APEI tables aren't parsed post boot.


> this case, when firmware Jump to hypervisor, hypervisor cannot call
> memory_failure(), now only the physical address is recorded and valid, APEI
> driver will call the memory_failure()), in this case, host will not send 
> SIGBUS
> to QEMU. So guest cannot know there is SEA happen.
> At least there is such issue in Huawei's platform (cannot provide PA for RAS 
> firmware-first,
> only can provide VA in FAR_ELx)

This isn't a KVM problem.

It looks like both of UEFI's 'Table 275. Memory Error Record' and 'Table 276.
Memory Error Record 2' require a physical address. You can't describe a memory
error without one.

Is this really a memory error?, or some other component, say, a virtually
indexed cache.


> 2. if there is SEA/SEI, only deliver SIGBUS to notify QEMU. This information 
> is limit.
> This SIGBUS can only provide an address and si_code(BUS_MCEERR_AO/ 
> BUS_MCEERR_AR), nothing else.
> if QEMU record CPER and inject SEA/specify ESR, it may needs to know more 
> information.
> For example, if it injects SEA, it needs so setup many registers for guest, 
> such as 
> FAR_EL1. If sets it, it needs to know FAR_EL2.

Linux is given CPER records describing a memory error using NOTIFY_IRQ. It
delivers BUS_MCEERR_AO to Qemu. What value does FAR_EL2 have?

Even for 'AR' from KVM, we already know Marc is against exposing EL2 registers. 
[0]


> But QEMU cannot know this information to setup it if KVM cannot pass more 
> fault info to QEMU.
> Of cause, we can identify the guest FAR_El1 register to invalid. But some 
> time, guest needs to
> know it in the situation that host cannot provide the PA.

When is this? You can get the IPA from the si_addr and Qemu's memory layout. The
IPA goes in the CPER records allowing you to emulate firmware first.
The IPA goes in the ERRADDR register (once we have emulation support for it),
covering the kernel first case.

What's left? Neither: You can generate an external abort using DFSC=0b01 and
set FnV to tell it you don't have a virtual stage1 address.


A better argument is that user-space needs to know if BUS_MCEERR_AR triggered by
KVM's stage2 was an instruction or data abort so it can make this the correct
flavour of Synchronous External Abort.

I agree this bit needs exposing (but only for exits due to BUS_MCEERR_AR
triggered by KVM at stage2), and maybe KVM could include a little more
information to allow the full range of external-abort ESRs to be used.



> 3. For SEI, the address is invalid, 

You mean FAR_ELx?


> so in some platform, firmware will not record this AP.

For any RAS error you should get a physical address in ERRADDR.


> At least in HUAWEI's platform, firmware will not record it. we cannot always
> think that all platform can record PA for RAS, sometime it may use
> VA(in FAR_ELx).

What component do you see this happen with?


> For SEI, if the address is not recorded, then the
> memory_failure() will be not called. So guest will not know it happens SEI.


Thanks,

James

[0] https://lkml.org/lkml/2017/8/7/612
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace

2017-10-20 Thread gengdongjiu
CC James.

> > In the user space, we can check the si_code, if it is 
> > "BUS_MCEERR_AR", we use SEA notification type for the guest; if it is 
> > "BUS_MCEERR_AO", we use SEI notification type for the guest.
> > Because there are only two values for si_code("BUS_MCEERR_AR" and 
> > BUS_MCEERR_AO), in which case we can use the GSIV(IRQ)
> notification type?
> 
> This is for Qemu/kvmtool to decide, it depends on what sort of machine they 
> are emulating.
> 
> For example, the physical machine's memory-controller may notify the 
> CPU about memory errors by triggering SError trapped to EL3, or with a 
> dedicated FIQ, also routed to EL3. By the time this gets to the host kernel 
> the distinction doesn't matter. The host has handled the error.
> 
> For a guest, your memory-controller is effectively the host kernel. It 
> will give you an BUS_MCEERR_AO signal for any guest memory that is affected, 
> and a BUS_MCEERR_AR if the guest directly accesses a page of affected memory.
> 
> What Qemu/kvmtool do with this is up to them. If they're emulating a machine 
> with no RAS features, printing an error and exit.
> 
> Otherwise BUS_MCEERR_AR could be notified as one of the flavours of 
> IRQ, unless the affected vcpu has interrupts masked, in which case an SEA 
> notification gives you some NMI-like behaviour.
> 
> For BUS_MCEERR_AO you could use SEI, IRQ or polled notification. My 
> choice would be IRQ, as you can't know if the guest supports SEI and 
> it would be a shame to kill it with an SError if the affected memory was 
> free. SEA for synchronous errors is still a good choice even if the guest 
> doesn't support it as that memory is still gone so its still a valid 
> guest:Synchronous-external-abort.
> 

Add James.

CC some huawei's hardware engineers.

Hi James/Marc/Christoffer,

  As we discuss below solution:
When guest happen SEA/SEI, KVM calls memory_failure() to send an asynchronous 
SIGBUS signal(BUS_MCEERR_AO) to QEMU, and make this address to poisoned.
after QEMU receive this BUS_MCEERR_AO, it will record this address to CPER and 
notify guest. When guest happen stage2 page fault, KVM send a synchronous 
SIGBUS BUS_MCEERR_AR to QEMU, and QEMU also record CPER and immediately inject 
SEA abort.

But this solution, still have some problems.

1. In some situation, For RAS, when happen SEA, hardware cannot provide an 
error physical address to software instead it can only provide virtual address 
in FAR_ELx, This is to say, firmware cannot provide physical error address, but 
provided the virtual address in the FAR_ELx.
so BIOS cannot record this address to APEI table. In this case, when firmware 
Jump to hypervisor, hypervisor cannot call memory_failure(), now only the 
physical address is recorded and valid, APEI driver will call the 
memory_failure()), in this case, host will not send SIGBUS to QEMU. So guest 
cannot know there is SEA happen.
At least there is such issue in Huawei's platform (cannot provide PA for RAS 
firmware-first, only can provide VA in FAR_ELx)

2. if there is SEA/SEI, only deliver SIGBUS to notify QEMU. This information is 
limit.
 This SIGBUS can only provide an address and si_code(BUS_MCEERR_AO/ 
BUS_MCEERR_AR), nothing else.
 if QEMU record CPER and inject SEA/specify ESR, it may needs to know more 
information.
For example, if it injects SEA, it needs so setup many registers for guest, 
such as FAR_EL1. If sets it, it needs to know FAR_EL2.
 But QEMU cannot know this information to setup it if KVM cannot pass more 
fault info to QEMU.
 Of cause, we can identify the guest FAR_El1 register to invalid. But some 
time, guest needs to know it in the situation that host cannot provide the PA.

3. For SEI, the address is invalid, so in some platform, firmware will not 
record this AP. At least in HUAWEI's platform, firmware will not record it.
  we cannot always think that all platform can record PA for RAS, sometime it 
may use VA(in FAR_ELx).
  For SEI, if the address is not recorded, then the memory_failure() will be 
not called. So guest will not know it happens SEI. 

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace

2017-10-20 Thread gengdongjiu
> > In the user space, we can check the si_code, if it is "BUS_MCEERR_AR",
> > we use SEA notification type for the guest; if it is "BUS_MCEERR_AO", we 
> > use SEI notification type for the guest.
> > Because there are only two values for si_code("BUS_MCEERR_AR" and 
> > BUS_MCEERR_AO), in which case we can use the GSIV(IRQ)
> notification type?
> 
> This is for Qemu/kvmtool to decide, it depends on what sort of machine they 
> are emulating.
> 
> For example, the physical machine's memory-controller may notify the CPU 
> about memory errors by triggering SError trapped to EL3, or
> with a dedicated FIQ, also routed to EL3. By the time this gets to the host 
> kernel the distinction doesn't matter. The host has handled the
> error.
> 
> For a guest, your memory-controller is effectively the host kernel. It will 
> give you an BUS_MCEERR_AO signal for any guest memory that is
> affected, and a BUS_MCEERR_AR if the guest directly accesses a page of 
> affected memory.
> 
> What Qemu/kvmtool do with this is up to them. If they're emulating a machine 
> with no RAS features, printing an error and exit.
> 
> Otherwise BUS_MCEERR_AR could be notified as one of the flavours of IRQ, 
> unless the affected vcpu has interrupts masked, in which case
> an SEA notification gives you some NMI-like behaviour.
> 
> For BUS_MCEERR_AO you could use SEI, IRQ or polled notification. My choice 
> would be IRQ, as you can't know if the guest supports SEI and
> it would be a shame to kill it with an SError if the affected memory was 
> free. SEA for synchronous errors is still a good choice even if the
> guest doesn't support it as that memory is still gone so its still a valid 
> guest:Synchronous-external-abort.
> 

CC some huawei's hardware engineers.

Hi James/Marc/Christoffer,

  As we discuss below solution:
When guest happen SEA/SEI, KVM calls memory_failure() to send an asynchronous 
SIGBUS signal(BUS_MCEERR_AO) to QEMU, and make this address to poisoned.
after QEMU receive this BUS_MCEERR_AO, it will record this address to CPER and 
notify guest. When guest happen stage2 page fault, KVM send a synchronous 
SIGBUS BUS_MCEERR_AR
to QEMU, and QEMU also record CPER and immediately inject SEA abort.

But this solution, still have some problems.

1. In some situation, For RAS, when happen SEA, hardware cannot provide an 
error physical address to software instead it can only provide virtual address 
in FAR_ELx, 
This is to say, firmware cannot provide physical error address, but provided 
the virtual address in the FAR_ELx.
so BIOS cannot record this address to APEI table. In this case, when firmware 
Jump to hypervisor, hypervisor cannot call memory_failure(), 
now only the physical address is recorded and valid, APEI driver will call the 
memory_failure()), 
in this case, host will not send SIGBUS to QEMU. So guest cannot know there is 
SEA happen.
At least there is such issue in Huawei's platform (cannot provide PA for RAS 
firmware-first, only can provide VA in FAR_ELx)

2. if there is SEA/SEI, only deliver SIGBUS to notify QEMU. This information is 
limit.
 This SIGBUS can only provide an address and si_code(BUS_MCEERR_AO/ 
BUS_MCEERR_AR), nothing else.
 if QEMU record CPER and inject SEA/specify ESR, it may needs to know more 
information.
For example, if it injects SEA, it needs so setup many registers for guest, 
such as FAR_EL1. If sets it, it needs to know FAR_EL2.
 But QEMU cannot know this information to setup it if KVM cannot pass more 
fault info to QEMU.
 Of cause, we can identify the guest FAR_El1 register to invalid. But some 
time, guest needs to know it in the situation that host cannot provide the PA.

3. For SEI, the address is invalid, so in some platform, firmware will not 
record this AP. At least in HUAWEI's platform, firmware will not record it.
  we cannot always think that all platform can record PA for RAS, sometime it 
may use VA(in FAR_ELx).
  For SEI, if the address is not recorded, then the memory_failure() will be 
not called. So guest will not know it happens SEI. 













___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace

2017-10-19 Thread gengdongjiu


On 2017/10/7 1:31, James Morse wrote:
> Hi gengdongjiu,
> 
> On 27/09/17 12:07, gengdongjiu wrote:
>> On 2017/9/23 0:51, James Morse wrote:
>>> If this wasn't a firmware-first notification, then you're right KVM hands 
>>> the
>>> guest an asynchronous external abort. This could be considered a bug in 
>>> KVM. (we
>>> can discuss with Marc and Christoffer what it should do), but:
>>>
>>> I'm not sure what scenario you could see this in: surely all your
>>> CPU:external-aborts are taken to EL3 by SCR_EL3.EA and become firmware-first
>>> notifications. So they should always be claimed by APEI.
> 
>> Yes, if it is firmware-first we should not exist such issue.
> 
> [...]
> 
>>> What you may be seeing is some awkwardness with the change in the SError ESR
>>> with v8.2. Previously the VSE mechanism injected an impdef SError, (but they
>>> were all impdef so it didn't matter).
>>> With VSESR_EL2 KVM has to specify one, and all-zeros is a bad choice as this
>>> means 'classified as a RAS error ... unknown!'.
> 
>>> I have a patch in the upcoming SError/RAS series that changes KVMs 
>>> virtual-abort
>>> code to specify an impdef ESR for this path.
> 
> https://www.spinics.net/lists/arm-kernel/msg609589.html
> 
> 
>> Before I remember Marc and you suggest me specify the an impdef ESR (set the 
>> vsesr_el2) in
>> the userspace,
> 
> If Qemu/kvmtool wants to emulate a machine that notifies the guest about
> firmware-first RAS Errors using SError/SEI, it needs to decide when to send
> these SError and what ESR to specify.
yes, it is. I agree.

> 
> 
>> now do you want to specify an impdef ESR in KVM instead of usrspace?
> 
> No, that patch is just trying to fixup the existing cases where KVM already
> injects an SError. I'm just trying to keep the behaviour the-same:
> 
> Before the RAS Extensions the guest would always get an impdef SError ESR.
> After the RAS Extensions KVM has to pick an ESR, as otherwise the guest gets 
> the
> hardware-reset value of VSESR_EL2. On the FVP this is all-zeros, which is a 
> RAS
> encoding. That patch changes it to still be an impdef SError ESR.
> 
> 
>> if setting the vsesr_el2 in the KVM, whether user-space need to specify?
> 
> I think we need a KVM CAP API to do this. With that patch it can be wired into
> pend_guest_serror(), which takes the ESR to make pending if the CPU supports 
> it.

For this CAP API, I have made a patch in the new series patches.
> 
> It's not clear to me whether its useful for user-space to make an SError 
> pending
> even if it can't set the ESR
why it can not set the ESR?
In the KVM, we can return a encoding fault info to userspace, then user space 
can
specify its own ESR for guest.
I already made a patch for it.


> 
> [...]
> 
>>> Because the SEI notification depends on v8.2 I'd like to get the SError/RAS
>>> series posted (currently re-testing), then I'll pick up enough of the 
>>> patches
>>> you've posted for a consolidated version of the series, and we can take the
>>> discussion from there.
> 
>> James, it is great, we can make a consolidated version of the series.
> 
> We need to get some of the wider issues fixed first, the big-ugly one is
> memory_failure_queue() not being NMI safe. (this isn't a problem for SEA as 
> this
> would only become re-entrant if the kernel text was corrupt). It is a problem
> for SEI and SDEI.
 For memory_failure_queue(), I think the big problem is it is in a process 
context, not error handling context.
there are two context. and the memory_failure_queue() is scheduled later than 
the error handling.


> 
> 
>>> I'd still like to know what your firmware does if the normal-world believes 
>>> its
>>> masked physical-SError and you want to hand it an SEI notification.
> 
> Aha, thanks for this:
> 
>> firstly the physical-SError that happened in the EL2/EL1/EL1 can not be 
>> masked if SCR_EL3.EA is set.
> 
> Masked for the CPU because the CPU can deliver the SError to EL3.
> 
> What about software? Code at EL1 and EL2 each have a PSTATE.A bit they may 
> have
> set. HCR_EL2.VSE respects EL1's PSTATE.A ... the question is does your 
> firmware
> respect the PSTATE.A value of the exception level that SError are routed to?

Before route to the target EL, software set the spsr_el3.A to 1, then "eret", 
the PSTATE.A will be to 1.

Note:
PSTATE.A is shared by different EL, in the hardware, it is one register, not 
many registers.
spsr_elx has more registers, such as spsr_el1, spsr_el2, spsr_el3.


> 
> 
>> when trap to EL3, firmware will record the error to APEI CPER from reading 
>> ERR* RAS registers.
>>
>> (1) if HCR_EL2.TEA is set to 1, exception come from EL0, El1. firmware knows 
>> this
> 
> HCR_EL2.TEA covers synchronous-external-aborts. For SError you need to check
> HCR_EL2.AMO. Some crazy hypervisor may set one and not the other.
sorry, it is typo issue, should check HCR_EL2.AMO
> 
> 
>> SError come from guest OS, copy the elr_el3 to elr_el2, copy ESR_El3 to
>> ESR_EL2.
> 
> The EC value in the ELR des

Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace

2017-10-06 Thread James Morse
Hi gengdongjiu,

On 27/09/17 12:07, gengdongjiu wrote:
> On 2017/9/23 0:51, James Morse wrote:
>> If this wasn't a firmware-first notification, then you're right KVM hands the
>> guest an asynchronous external abort. This could be considered a bug in KVM. 
>> (we
>> can discuss with Marc and Christoffer what it should do), but:
>>
>> I'm not sure what scenario you could see this in: surely all your
>> CPU:external-aborts are taken to EL3 by SCR_EL3.EA and become firmware-first
>> notifications. So they should always be claimed by APEI.

> Yes, if it is firmware-first we should not exist such issue.

[...]

>> What you may be seeing is some awkwardness with the change in the SError ESR
>> with v8.2. Previously the VSE mechanism injected an impdef SError, (but they
>> were all impdef so it didn't matter).
>> With VSESR_EL2 KVM has to specify one, and all-zeros is a bad choice as this
>> means 'classified as a RAS error ... unknown!'.

>> I have a patch in the upcoming SError/RAS series that changes KVMs 
>> virtual-abort
>> code to specify an impdef ESR for this path.

https://www.spinics.net/lists/arm-kernel/msg609589.html


> Before I remember Marc and you suggest me specify the an impdef ESR (set the 
> vsesr_el2) in
> the userspace,

If Qemu/kvmtool wants to emulate a machine that notifies the guest about
firmware-first RAS Errors using SError/SEI, it needs to decide when to send
these SError and what ESR to specify.


> now do you want to specify an impdef ESR in KVM instead of usrspace?

No, that patch is just trying to fixup the existing cases where KVM already
injects an SError. I'm just trying to keep the behaviour the-same:

Before the RAS Extensions the guest would always get an impdef SError ESR.
After the RAS Extensions KVM has to pick an ESR, as otherwise the guest gets the
hardware-reset value of VSESR_EL2. On the FVP this is all-zeros, which is a RAS
encoding. That patch changes it to still be an impdef SError ESR.


> if setting the vsesr_el2 in the KVM, whether user-space need to specify?

I think we need a KVM CAP API to do this. With that patch it can be wired into
pend_guest_serror(), which takes the ESR to make pending if the CPU supports it.

It's not clear to me whether its useful for user-space to make an SError pending
even if it can't set the ESR

[...]

>> Because the SEI notification depends on v8.2 I'd like to get the SError/RAS
>> series posted (currently re-testing), then I'll pick up enough of the patches
>> you've posted for a consolidated version of the series, and we can take the
>> discussion from there.

> James, it is great, we can make a consolidated version of the series.

We need to get some of the wider issues fixed first, the big-ugly one is
memory_failure_queue() not being NMI safe. (this isn't a problem for SEA as this
would only become re-entrant if the kernel text was corrupt). It is a problem
for SEI and SDEI.


>> I'd still like to know what your firmware does if the normal-world believes 
>> its
>> masked physical-SError and you want to hand it an SEI notification.

Aha, thanks for this:

> firstly the physical-SError that happened in the EL2/EL1/EL1 can not be 
> masked if SCR_EL3.EA is set.

Masked for the CPU because the CPU can deliver the SError to EL3.

What about software? Code at EL1 and EL2 each have a PSTATE.A bit they may have
set. HCR_EL2.VSE respects EL1's PSTATE.A ... the question is does your firmware
respect the PSTATE.A value of the exception level that SError are routed to?


> when trap to EL3, firmware will record the error to APEI CPER from reading 
> ERR* RAS registers.
> 
> (1) if HCR_EL2.TEA is set to 1, exception come from EL0, El1. firmware knows 
> this

HCR_EL2.TEA covers synchronous-external-aborts. For SError you need to check
HCR_EL2.AMO. Some crazy hypervisor may set one and not the other.


> SError come from guest OS, copy the elr_el3 to elr_el2, copy ESR_El3 to
> ESR_EL2.

The EC value in the ELR describes current/lower exception level, you need to
re-encode this for EL2 if the exception came from EL2.


> if the SError exception come from guest EL0 or EL1, set ELR_EL3 with 
> VBAR_EL2 + 0x580(one EL2 SEI entry point),
> 
> execute "ERET", then jump to EL2 hypervisor.
>
> (2)if the SError exception come EL2 hypervisor, copy the elr_el3 to elr_el2, 
> copy ESR_El3 to ESR_EL,
> set ELR_EL3 with VBAR_EL2+0x380(one EL2 SEI entry point),
> 
>execute "ERET", then jump to EL2 hypervisor.

This SError came from EL2. You _must_ check SPSR_EL3.A is clear before returning
to the EL2 SError vector.

EL2 believes it has masked SError, it does this because it can't handle one
right now. If your firmware jumps in anyway - its game over.

We mask SError in entry.S when we take an exception and when we return from an
exception. This is so that we can read/write the ELR/SPSR without them changing
under our feet. If your firmware overwrites these values - we've lost them, and
can never return to the context we interrupted.


>

Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace

2017-09-27 Thread gengdongjiu
>> What you may be seeing is some awkwardness with the change in the SError ESR
>> with v8.2. Previously the VSE mechanism injected an impdef SError, (but they
>> were all impdef so it didn't matter).
>> With VSESR_EL2 KVM has to specify one, and all-zeros is a bad choice as this
>> means 'classified as a RAS error ... unknown!'.
>>
>> I have a patch in the upcoming SError/RAS series that changes KVMs 
>> virtual-abort
>> code to specify an impdef ESR for this path.
> Before I remember Marc and you suggest me specify the an impdef ESR (set the 
> vsesr_el2) in
> the userspace,

I pasted Marc's propose and your suggestion that set VSESR_EL2(specify
virtual SError syndrome) by the user space.
https://lkml.org/lkml/2017/3/20/441
https://lkml.org/lkml/2017/3/20/516


> now do you want to specify an impdef ESR in KVM instead of usrspace?
> if setting the vsesr_el2 in the KVM, whether user-space need to specify?
> May be we can combine the patches that specify an impdef ESR(set vsesr_el2) 
> patch to one.
>
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace

2017-09-27 Thread gengdongjiu
Hi James,
   Sorry for my late response, thank you very much for comments.

On 2017/9/23 0:51, James Morse wrote:
[.]
>>
>> CC Achin
>>
>> I have some personal opinion, if you think it is not right, hope you can 
>> point out.
>>
>> Synchronous External Abort and SError Interrupt are hardware 
>> exception(hardware concept),
>> which is independent of software notification,
>> in armv8 without RAS, the two concepts already exist. In the APEI spec, in 
>> order to
>> better describe the two exceptions, so use SEA and SEI notification to stand
> for them.
> 
>> SEA notification stands for Synchronous External Abort, so may be it is not 
>> only a
>> notification, it also stands for a hardware error type.
>> SEI notification stands for SError Interrupt, so may be it is not only a 
>> notification,
>> it also stands for a hardware error type.
> 
>> In the OS, it has different handling flow to the two exception(two 
>> notification):
>> when the guest OS running, if the hardware generates a Synchronous External 
>> Abort, we
>> told the guest OS this error is SError Interrupt instead of Synchronous
> External Abort.
> 
> This should only happen when APEI doesn't claim the external-abort as a RAS
> notification. If there were CPER records to process then the error is handled 
> by
> the host, and we can return to the guest.

consider again. I think you should be right.
In the firmware-first solution, firmware will shield all kinds of errors and 
record them to the CPER buffer.


> 
> If this wasn't a firmware-first notification, then you're right KVM hands the
> guest an asynchronous external abort. This could be considered a bug in KVM. 
> (we
> can discuss with Marc and Christoffer what it should do), but:
> 
> I'm not sure what scenario you could see this in: surely all your
> CPU:external-aborts are taken to EL3 by SCR_EL3.EA and become firmware-first
> notifications. So they should always be claimed by APEI.

Yes, if it is firmware-first we should not exist such issue.

> 
> 
>> guest OS uses SEI notification handling flow to deal with it, I am not sure 
>> whether it
>> will have problem, because the true hardware exception is Synchronous 
>> External
>> Abort, but software treats it as SError interrupt to handle.
> 
> Once you're into a guest the original 'true hardware exception' shouldn't
> matter. In this scenario KVM has handed the guest an SError, our question is 
> 'is
> it an SEI notification?':
> 
> For firmware first the guest OS should poke around in the CPER buffers, find
> nothing to do, and return to the arch code for (future) kernel-first.
> For kernel first the guest OS should trawl through the v8.2 ERR registers, 
> find
> nothing to do, and continue to the default case:
> 
> By default, we should panic on SError, unless its classified as a non-fatal 
> RAS
> error. (I'm tempted to pr_warn_once() if we get RAS notifications but there is
> no work to do).

understand, thanks.

> 
> 
> What you may be seeing is some awkwardness with the change in the SError ESR
> with v8.2. Previously the VSE mechanism injected an impdef SError, (but they
> were all impdef so it didn't matter).
> With VSESR_EL2 KVM has to specify one, and all-zeros is a bad choice as this
> means 'classified as a RAS error ... unknown!'.
> 
> I have a patch in the upcoming SError/RAS series that changes KVMs 
> virtual-abort
> code to specify an impdef ESR for this path.
Before I remember Marc and you suggest me specify the an impdef ESR (set the 
vsesr_el2) in
the userspace,
now do you want to specify an impdef ESR in KVM instead of usrspace?
if setting the vsesr_el2 in the KVM, whether user-space need to specify?
May be we can combine the patches that specify an impdef ESR(set vsesr_el2) 
patch to one.

> 
> 
>> In the mainline code, it does not have SEI notification support, the reason 
>> I 
>> think it is because of the error address record by firmware is not accurate
>> (SError Interrupt is asynchronous exception).
> 
> Yes, while we don't expect a FAR with an SError, but we do expect a valid
> representation of the RAS error in either the CPER records or the v8.2. ERR
> registers (or both). If we have neither of those, its not a RAS error and we
> should panic.
> 
> 
>> so if treat a hardware Synchronous External Abort as SError interrupt(SEI). 
>> The default OS behavior for SEI is PANIC, that is to say, when hardware 
>> triggers
>> a Synchronous External Abort(SEA), if guest treat it as SError 
>> interrupt(SEI),
>> the OS will be panic. in fact, it can be recoverable instead of Panic.
> 
> If its a RAS error APEI (or in the future, the kernel-first handler), should
> claim the error, so the guest never sees it. If you are hitting this behaviour
> in KVM, then it wasn't a RAS error.
> 
> 
>> I ever added a patch to support the SEI notification, but not sure whether
>> it is can be accepted by open source, until now, not receive response.
> 
> The patch you posted during the merge window made no sense on its own

Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace

2017-09-22 Thread James Morse
Hi gengdongjiu,

On 21/09/17 08:55, gengdongjiu wrote:
> On 2017/9/14 21:00, James Morse wrote:
>> user-space can choose whether to use SEA or SEI, it doesn't have to choose 
>> the
>> same notification type that firmware used, which in turn doesn't have to be 
>> the
>> same as that used by the CPU to notify firmware.
>>
>> The choice only matters because these notifications hang on an existing 
>> pieces
>> of the Arm-architecture, so the notification can only add to the 
>> architecturally
>> defined meaning. (i.e. You can only send an SEA for something that can 
>> already
>> be described as a synchronous external abort).
>>
>> Once we get to user-space, for memory_failure() notifications, (which so far 
>> is
>> all we are talking about here), the only thing that could matter is whether 
>> the
>> guest hit a PG_hwpoison page as a stage2 fault. These can be described as
>> Synchronous-External-Abort.
>>
>> The Synchronous-External-Abort/SError-Interrupt distinction matters for the 
>> CPU
>> because it can't always make an error synchronous. For memory_failure()
>> notifications to a KVM guest we really can do this, and we already have this
>> behaviour for free. An example:
>>
>> A guest touches some hardware:poisoned memory, for whatever reason the CPU 
>> can't
>> put the world back together to make this a synchronous exception, so it 
>> reports
>> it to firmware as an SError-interrupt.
>> Linux gets an APEI notification and memory_failure() causes the affected 
>> page to
>> be unmapped from the guest's stage2, and SIGBUS_MCEERR_AO sent to user-space.
>>
>> Qemu/kvmtool can now notify the guest with an IRQ or POLLed notification. 
>> AO->
>> action optional, probably asynchronous.
>>
>> But in our example it wasn't really asynchronous, that was just a property of
>> the original CPU->firmware notification. What happens? The guest vcpu is 
>> re-run,
>> it re-runs the same instructions (this was a contained error so KVM's ELR 
>> points
>> at/before the instruction that steps in the problem). This time KVM takes a
>> stage2 fault, which the mm code will refuse to fixup because the relevant 
>> page
>> was marked as PG_hwpoision by memory_failure(). KVM signals Qemu/kvmtool with
>> SIGBUS_MCEERR_AR. Now Qemu/kvmtool can notify the guest using SEA.
> 
> CC Achin
> 
> I have some personal opinion, if you think it is not right, hope you can 
> point out.
> 
> Synchronous External Abort and SError Interrupt are hardware 
> exception(hardware concept),
> which is independent of software notification,
> in armv8 without RAS, the two concepts already exist. In the APEI spec, in 
> order to
> better describe the two exceptions, so use SEA and SEI notification to stand
for them.

> SEA notification stands for Synchronous External Abort, so may be it is not 
> only a
> notification, it also stands for a hardware error type.
> SEI notification stands for SError Interrupt, so may be it is not only a 
> notification,
> it also stands for a hardware error type.

> In the OS, it has different handling flow to the two exception(two 
> notification):
> when the guest OS running, if the hardware generates a Synchronous External 
> Abort, we
> told the guest OS this error is SError Interrupt instead of Synchronous
External Abort.

This should only happen when APEI doesn't claim the external-abort as a RAS
notification. If there were CPER records to process then the error is handled by
the host, and we can return to the guest.

If this wasn't a firmware-first notification, then you're right KVM hands the
guest an asynchronous external abort. This could be considered a bug in KVM. (we
can discuss with Marc and Christoffer what it should do), but:

I'm not sure what scenario you could see this in: surely all your
CPU:external-aborts are taken to EL3 by SCR_EL3.EA and become firmware-first
notifications. So they should always be claimed by APEI.


> guest OS uses SEI notification handling flow to deal with it, I am not sure 
> whether it
> will have problem, because the true hardware exception is Synchronous External
> Abort, but software treats it as SError interrupt to handle.

Once you're into a guest the original 'true hardware exception' shouldn't
matter. In this scenario KVM has handed the guest an SError, our question is 'is
it an SEI notification?':

For firmware first the guest OS should poke around in the CPER buffers, find
nothing to do, and return to the arch code for (future) kernel-first.
For kernel first the guest OS should trawl through the v8.2 ERR registers, find
nothing to do, and continue to the default case:

By default, we should panic on SError, unless its classified as a non-fatal RAS
error. (I'm tempted to pr_warn_once() if we get RAS notifications but there is
no work to do).


What you may be seeing is some awkwardness with the change in the SError ESR
with v8.2. Previously the VSE mechanism injected an impdef SError, (but they
were all impdef so it didn't matter).
With VSESR_EL2 KVM has to spe

Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace

2017-09-22 Thread James Morse
Hi gengdongjiu,

On 18/09/17 14:36, gengdongjiu wrote:
> On 2017/9/14 21:00, James Morse wrote:
>> On 13/09/17 08:32, gengdongjiu wrote:
>>> On 2017/9/8 0:30, James Morse wrote:
 On 28/08/17 11:38, Dongjiu Geng wrote:
 For BUS_MCEERR_A* from memory_failure() we can't know if they are caused by
 an access or not.
>>
>> Actually it looks like we can: I thought 'BUS_MCEERR_AR' could be triggered 
>> via
>> some CPER flags, but its not. The only code that flags MF_ACTION_REQUIRED is
>> x86's kernel-first handling, which nicely matches this 'direct access' 
>> problem.
>> BUS_MCEERR_AR also come from KVM stage2 faults (and the x86 equivalent). 
>> Powerpc
>> also triggers these directly, both from what look to be synchronous paths, 
>> so I
>> think its fair to equate BUS_MCEERR_AR to a synchronous access and 
>> BUS_MCEERR_AO
>> to something_else.
> 
> James, thanks for your explanation.
> can I understand that your meaning that "BUS_MCEERR_AR" stands for 
> synchronous access and BUS_MCEERR_AO stands for asynchronous access?

Not 'stands for', as the AR is Action-Required and AO Action-Optional. My point
was I can't find a case where Action-Required is used for an error that isn't
synchronous.

We should run this past the people who maintain the existing BUS_MCEERR_AR
users, in case its just a severity to them.


> Then for "BUS_MCEERR_AO", how to distinguish it is asynchronous data 
> access(SError) and PCIE AER error?

How would userspace get one of these memory errors for a PCIe error?


> In the user space, we can check the si_code, if it is "BUS_MCEERR_AR", we use 
> SEA notification type for the guest;
> if it is "BUS_MCEERR_AO", we use SEI notification type for the guest.
> Because there are only two values for si_code("BUS_MCEERR_AR" and 
> BUS_MCEERR_AO), in which case we can use the GSIV(IRQ) notification type?

This is for Qemu/kvmtool to decide, it depends on what sort of machine they are
emulating.

For example, the physical machine's memory-controller may notify the CPU about
memory errors by triggering SError trapped to EL3, or with a dedicated FIQ, also
routed to EL3. By the time this gets to the host kernel the distinction doesn't
matter. The host has handled the error.

For a guest, your memory-controller is effectively the host kernel. It will give
you an BUS_MCEERR_AO signal for any guest memory that is affected, and a
BUS_MCEERR_AR if the guest directly accesses a page of affected memory.

What Qemu/kvmtool do with this is up to them. If they're emulating a machine
with no RAS features, printing an error and exit.

Otherwise BUS_MCEERR_AR could be notified as one of the flavours of IRQ, unless
the affected vcpu has interrupts masked, in which case an SEA notification gives
you some NMI-like behaviour.

For BUS_MCEERR_AO you could use SEI, IRQ or polled notification. My choice would
be IRQ, as you can't know if the guest supports SEI and it would be a shame to
kill it with an SError if the affected memory was free. SEA for synchronous
errors is still a good choice even if the guest doesn't support it as that
memory is still gone so its still a valid guest:Synchronous-external-abort.


[...]

>>> 1. Let us firstly discuss the SEA and SEI, there are different workflow for 
>>> the two different Errors.

>> user-space can choose whether to use SEA or SEI, it doesn't have to choose 
>> the
>> same notification type that firmware used, which in turn doesn't have to be 
>> the
>> same as that used by the CPU to notify firmware.
>>
>> The choice only matters because these notifications hang on an existing 
>> pieces
>> of the Arm-architecture, so the notification can only add to the 
>> architecturally
>> defined meaning. (i.e. You can only send an SEA for something that can 
>> already
>> be described as a synchronous external abort).
>>
>> Once we get to user-space, for memory_failure() notifications, (which so far 
>> is
>> all we are talking about here), the only thing that could matter is whether 
>> the
>> guest hit a PG_hwpoison page as a stage2 fault. These can be described as
>> Synchronous-External-Abort.
>>
>> The Synchronous-External-Abort/SError-Interrupt distinction matters for the 
>> CPU
>> because it can't always make an error synchronous. For memory_failure()
>> notifications to a KVM guest we really can do this, and we already have this
>> behaviour for free. An example:
>>
>> A guest touches some hardware:poisoned memory, for whatever reason the CPU 
>> can't
>> put the world back together to make this a synchronous exception, so it 
>> reports
>> it to firmware as an SError-interrupt.
> 
>> Linux gets an APEI notification and memory_failure() causes the affected 
>> page to
>> be unmapped from the guest's stage2, and SIGBUS_MCEERR_AO sent to user-space.
>>
>> Qemu/kvmtool can now notify the guest with an IRQ or POLLed notification. 
>> AO->
>> action optional, probably asynchronous.

> If so, in this case, Qemu/kvmtool only got a little information(receive a 
> SI

Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace

2017-09-21 Thread gengdongjiu
Hi James

On 2017/9/14 21:00, James Morse wrote:
> Hi gengdongjiu,

> user-space can choose whether to use SEA or SEI, it doesn't have to choose the
> same notification type that firmware used, which in turn doesn't have to be 
> the
> same as that used by the CPU to notify firmware.
> 
> The choice only matters because these notifications hang on an existing pieces
> of the Arm-architecture, so the notification can only add to the 
> architecturally
> defined meaning. (i.e. You can only send an SEA for something that can already
> be described as a synchronous external abort).
> 
> Once we get to user-space, for memory_failure() notifications, (which so far 
> is
> all we are talking about here), the only thing that could matter is whether 
> the
> guest hit a PG_hwpoison page as a stage2 fault. These can be described as
> Synchronous-External-Abort.
> 
> The Synchronous-External-Abort/SError-Interrupt distinction matters for the 
> CPU
> because it can't always make an error synchronous. For memory_failure()
> notifications to a KVM guest we really can do this, and we already have this
> behaviour for free. An example:
> 
> A guest touches some hardware:poisoned memory, for whatever reason the CPU 
> can't
> put the world back together to make this a synchronous exception, so it 
> reports
> it to firmware as an SError-interrupt.
> Linux gets an APEI notification and memory_failure() causes the affected page 
> to
> be unmapped from the guest's stage2, and SIGBUS_MCEERR_AO sent to user-space.
> 
> Qemu/kvmtool can now notify the guest with an IRQ or POLLed notification. AO->
> action optional, probably asynchronous.
> 
> But in our example it wasn't really asynchronous, that was just a property of
> the original CPU->firmware notification. What happens? The guest vcpu is 
> re-run,
> it re-runs the same instructions (this was a contained error so KVM's ELR 
> points
> at/before the instruction that steps in the problem). This time KVM takes a
> stage2 fault, which the mm code will refuse to fixup because the relevant page
> was marked as PG_hwpoision by memory_failure(). KVM signals Qemu/kvmtool with
> SIGBUS_MCEERR_AR. Now Qemu/kvmtool can notify the guest using SEA.

CC Achin

I have some personal opinion, if you think it is not right, hope you can point 
out.

Synchronous External Abort and SError Interrupt are hardware exception(hardware 
concept), which is independent of software notification,
in armv8 without RAS, the two concepts already exist. In the APEI spec, in 
order to better describe the two exceptions, so use SEA and SEI notification to 
stand for them.

SEA notification stands for Synchronous External Abort, so may be it is not 
only a notification, it also stands for a hardware error type.
SEI notification stands for SError Interrupt, so may be it is not only a 
notification, it also stands for a hardware error type.

In the OS, it has different handling flow to the two exception(two 
notification):
when the guest OS running, if the hardware generates a Synchronous External 
Abort, we told the guest OS this error is SError Interrupt instead of 
Synchronous External Abort.
guest OS uses SEI notification handling flow to deal with it, I am not sure 
whether it will have problem, because the true hardware exception is 
Synchronous External Abort,
but software treats it as SError interrupt to handle.

In the mainline code, it does not have SEI notification support, the reason I 
think it is because of the error address record by firmware is not 
accurate(SError Interrupt is asynchronous exception).
so if treat a hardware Synchronous External Abort as SError interrupt(SEI). The 
default OS behavior for SEI is PANIC, that is to say, when hardware triggers a 
Synchronous External Abort(SEA), if guest
treat it as SError interrupt(SEI), the OS will be panic. in fact, it can be 
recoverable instead of Panic.

I ever added a patch to support the SEI notification, but not sure whether it 
is can be accepted by open source, until now, not receive response.




___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace

2017-09-18 Thread gengdongjiu
James,
   Thanks for your comments, hope we can make the solution better.

On 2017/9/14 21:00, James Morse wrote:
> Hi gengdongjiu,
> 
> (re-ordered hunks)
> 
> On 13/09/17 08:32, gengdongjiu wrote:
>> On 2017/9/8 0:30, James Morse wrote:
>>> On 28/08/17 11:38, Dongjiu Geng wrote:
>>> For BUS_MCEERR_A* from memory_failure() we can't know if they are caused by
>>> an access or not.
> 
> Actually it looks like we can: I thought 'BUS_MCEERR_AR' could be triggered 
> via
> some CPER flags, but its not. The only code that flags MF_ACTION_REQUIRED is
> x86's kernel-first handling, which nicely matches this 'direct access' 
> problem.
> BUS_MCEERR_AR also come from KVM stage2 faults (and the x86 equivalent). 
> Powerpc
> also triggers these directly, both from what look to be synchronous paths, so 
> I
> think its fair to equate BUS_MCEERR_AR to a synchronous access and 
> BUS_MCEERR_AO
> to something_else.

James, thanks for your explanation.
can I understand that your meaning that "BUS_MCEERR_AR" stands for synchronous 
access and BUS_MCEERR_AO stands for asynchronous access?
Then for "BUS_MCEERR_AO", how to distinguish it is asynchronous data 
access(SError) and PCIE AER error?
In the user space, we can check the si_code, if it is "BUS_MCEERR_AR", we use 
SEA notification type for the guest;
if it is "BUS_MCEERR_AO", we use SEI notification type for the guest.
Because there are only two values for si_code("BUS_MCEERR_AR" and 
BUS_MCEERR_AO), in which case we can use the GSIV(IRQ) notification type?


> 
> I don't think we need anything else.
> 
> 
>>> When the mm code gets -EHWPOISON when trying to resolve a
>>
>> Because of that, so I allow  userspace getting exception information
> 
> ... and there are cases where you can't get the exception information, and 
> other
> cases where it wasn't an exception at all.
> 
> [...]
> 
> 
>>> What happens if the dram-scrub hardware spots an error in guest memory, but
>>> the guest wasn't running? KVM won't have a relevant ESR value to give you.
> 
>> if the dram-scrub hardware spots an error in guest memory, it will generate
>> IRQ in DDR controller, not SEA or SEI exception. I still do not consider the
>> GSIV. For GSIV, may be we can only handle it in the host OS.
> 
> Great example: this IRQ pulls us out of a guest, we tromp through APEI and 
> then
> memory_failure(), the memory happened to belong to the same guest
> (coincidence!), we send it some signal and now its user-space's problem.
> 
> Your KVM_REG_ARM64_FAULT mechanism is going to return stale data, even though
> the notification interrupted the guest, and it was guest memory that was
> affected. KVM doesn't have a relevant ESR.
> 
> 
> I'm strongly against exposing 'which notification type' this error originally
> came from because:
> * it doesn't matter once we've got the CPER records,
> * there isn't always an answer (there are/will-be other ways of tripping
>   memory_failure())
> * it creates ABI between firwmare, host userspace and guest userspace.
>   Firmware's choice of notification type shouldn't affect anything other than
>   the host kernel.
> 
> 
> On 13/09/17 08:32, gengdongjiu wrote:
>> On 2017/9/8 0:30, James Morse wrote:
>>> On 28/08/17 11:38, Dongjiu Geng wrote:
 when userspace gets SIGBUS signal, it does not know whether
 this is a synchronous external abort or SError,
>>>
>>> Why would Qemu/kvmtool need to know if the original notification (if there 
>>> was
>>> one) was synchronous or asynchronous? This is between firmware and the 
>>> kernel.
> 
>> there are two reasons:
>>
>> 1. Let us firstly discuss the SEA and SEI, there are different workflow for 
>> the two different Errors.
>> 2. when record the CPER in the user space, it needs to know the error type, 
>> because SEA and SEI are different Error source,
>>so they have different offset in the APEI table, that is to say they will 
>> be recorded to different place of the APEI table.
> 
> user-space can choose whether to use SEA or SEI, it doesn't have to choose the
> same notification type that firmware used, which in turn doesn't have to be 
> the
> same as that used by the CPU to notify firmware.
> 
> The choice only matters because these notifications hang on an existing pieces
> of the Arm-architecture, so the notification can only add to the 
> architecturally
> defined meaning. (i.e. You can only send an SEA for something that can already
> be described as a synchronous external abort).
> 
> Once we get to user-space, for memory_failure() notifications, (which so far 
> is
> all we are talking about here), the only thing that could matter is whether 
> the
> guest hit a PG_hwpoison page as a stage2 fault. These can be described as
> Synchronous-External-Abort.
> 
> The Synchronous-External-Abort/SError-Interrupt distinction matters for the 
> CPU
> because it can't always make an error synchronous. For memory_failure()
> notifications to a KVM guest we really can do this, and we already have this
> behaviour for 

Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace

2017-09-14 Thread James Morse
Hi gengdongjiu,

(re-ordered hunks)

On 13/09/17 08:32, gengdongjiu wrote:
> On 2017/9/8 0:30, James Morse wrote:
>> On 28/08/17 11:38, Dongjiu Geng wrote:
>> For BUS_MCEERR_A* from memory_failure() we can't know if they are caused by
>> an access or not.

Actually it looks like we can: I thought 'BUS_MCEERR_AR' could be triggered via
some CPER flags, but its not. The only code that flags MF_ACTION_REQUIRED is
x86's kernel-first handling, which nicely matches this 'direct access' problem.
BUS_MCEERR_AR also come from KVM stage2 faults (and the x86 equivalent). Powerpc
also triggers these directly, both from what look to be synchronous paths, so I
think its fair to equate BUS_MCEERR_AR to a synchronous access and BUS_MCEERR_AO
to something_else.

I don't think we need anything else.


>> When the mm code gets -EHWPOISON when trying to resolve a
>
> Because of that, so I allow  userspace getting exception information

... and there are cases where you can't get the exception information, and other
cases where it wasn't an exception at all.

[...]


>> What happens if the dram-scrub hardware spots an error in guest memory, but
>> the guest wasn't running? KVM won't have a relevant ESR value to give you.

> if the dram-scrub hardware spots an error in guest memory, it will generate
> IRQ in DDR controller, not SEA or SEI exception. I still do not consider the
> GSIV. For GSIV, may be we can only handle it in the host OS.

Great example: this IRQ pulls us out of a guest, we tromp through APEI and then
memory_failure(), the memory happened to belong to the same guest
(coincidence!), we send it some signal and now its user-space's problem.

Your KVM_REG_ARM64_FAULT mechanism is going to return stale data, even though
the notification interrupted the guest, and it was guest memory that was
affected. KVM doesn't have a relevant ESR.


I'm strongly against exposing 'which notification type' this error originally
came from because:
* it doesn't matter once we've got the CPER records,
* there isn't always an answer (there are/will-be other ways of tripping
  memory_failure())
* it creates ABI between firwmare, host userspace and guest userspace.
  Firmware's choice of notification type shouldn't affect anything other than
  the host kernel.


On 13/09/17 08:32, gengdongjiu wrote:
> On 2017/9/8 0:30, James Morse wrote:
>> On 28/08/17 11:38, Dongjiu Geng wrote:
>>> when userspace gets SIGBUS signal, it does not know whether
>>> this is a synchronous external abort or SError,
>>
>> Why would Qemu/kvmtool need to know if the original notification (if there 
>> was
>> one) was synchronous or asynchronous? This is between firmware and the 
>> kernel.

> there are two reasons:
> 
> 1. Let us firstly discuss the SEA and SEI, there are different workflow for 
> the two different Errors.
> 2. when record the CPER in the user space, it needs to know the error type, 
> because SEA and SEI are different Error source,
>so they have different offset in the APEI table, that is to say they will 
> be recorded to different place of the APEI table.

user-space can choose whether to use SEA or SEI, it doesn't have to choose the
same notification type that firmware used, which in turn doesn't have to be the
same as that used by the CPU to notify firmware.

The choice only matters because these notifications hang on an existing pieces
of the Arm-architecture, so the notification can only add to the architecturally
defined meaning. (i.e. You can only send an SEA for something that can already
be described as a synchronous external abort).

Once we get to user-space, for memory_failure() notifications, (which so far is
all we are talking about here), the only thing that could matter is whether the
guest hit a PG_hwpoison page as a stage2 fault. These can be described as
Synchronous-External-Abort.

The Synchronous-External-Abort/SError-Interrupt distinction matters for the CPU
because it can't always make an error synchronous. For memory_failure()
notifications to a KVM guest we really can do this, and we already have this
behaviour for free. An example:

A guest touches some hardware:poisoned memory, for whatever reason the CPU can't
put the world back together to make this a synchronous exception, so it reports
it to firmware as an SError-interrupt.
Linux gets an APEI notification and memory_failure() causes the affected page to
be unmapped from the guest's stage2, and SIGBUS_MCEERR_AO sent to user-space.

Qemu/kvmtool can now notify the guest with an IRQ or POLLed notification. AO->
action optional, probably asynchronous.

But in our example it wasn't really asynchronous, that was just a property of
the original CPU->firmware notification. What happens? The guest vcpu is re-run,
it re-runs the same instructions (this was a contained error so KVM's ELR points
at/before the instruction that steps in the problem). This time KVM takes a
stage2 fault, which the mm code will refuse to fixup because the relevant page
was marked as PG_hwpoisi

Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace

2017-09-13 Thread gengdongjiu
Hi James,


On 2017/9/8 0:30, James Morse wrote:
> Hi Dongjiu Geng,
> 
> On 28/08/17 11:38, Dongjiu Geng wrote:
>> when userspace gets SIGBUS signal, it does not know whether
>> this is a synchronous external abort or SError,
> 
> Why would Qemu/kvmtool need to know if the original notification (if there was
> one) was synchronous or asynchronous? This is between firmware and the kernel.
there are two reasons:

1. Let us firstly discuss the SEA and SEI, there are different workflow for the 
two different Errors.
2. when record the CPER in the user space, it needs to know the error type, 
because SEA and SEI are different Error source,
   so they have different offset in the APEI table, that is to say they will be 
recorded to different place of the APEI table.


 etc/acpi/tables   etc/hardware_errors

==
+ +--++--+
| | HEST ||address   |  
+--+
| +--+|registers |  
| Error Status |
| | GHES0|| ++  
| Data Block 0 |
| +--+ +->| |status_address0 
|->| ++
| | .| |  | ++  
| |  CPER  |
| | error_status_address-+-+ +--->| |status_address1 |--+   
| |  CPER  |
| | .|   || ++  |   
| |    |
| | read_ack_register+-+ ||  .   |  |   
| |  CPER  |
| | read_ack_preserve| | |+--+  |   
| +-++
| | read_ack_write   | | | +->| |status_address10|+ |   
| Error Status |
+ +--+ | | |  | ++| |   
| Data Block 1 |
| | GHES1| +-+-+->| | ack_value0 || 
+-->| ++
+ +--+   | |  | ++| 
| |  CPER  |
| | .|   | | +--->| | ack_value1 || 
| |  CPER  |
| | error_status_address-+---+ | || ++| 
| |    |
| | .| | || |  . || 
| |  CPER  |
| | read_ack_register+-+-+| ++| 
+-++
| | read_ack_preserve| |   +->| | ack_value10|| 
| |..  |
| | read_ack_write   | |   |  | ++| 
| ++
+ +--| |   |  | 
| Error Status |
| | ...  | |   |  | 
| Data Block 10|
+ +--+ |   |  
+>| ++
| | GHES10   | |   |
| |  CPER  |
+ +--+ |   |
| |  CPER  |
| | .| |   |
| |    |
| | error_status_address-+-+   |
| |  CPER  |
| | .| |
+-++
| | read_ack_register+-+
| | read_ack_preserve|
| | read_ack_write   |
+ +--+

> 
> 
> I think I can see why you need this: to choose whether to emulate SEA or SEI,
emulating SEA or SEI is one reason, another reason is that the CPER will be 
recorded to different place of APEI.


> but what if the guest wasn't running? Or the guest was running, but it wasn't
> guest-memory that is affected.
If the guest was not running, host firmware will directly notify EL1 host 
kernel to handle the error, not notify hypervisor
only if the guest was running host firmware can notify the Error to hypervisor.

If the user space is Qemu, and the error is from Qemu, and guest-memory is not 
involve.
I will not handle it, please see the code for arm64.

void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
{
ram_addr_t ram_addr;
hwaddr paddr;

ARMCPU *cpu = ARM_CPU(c);
CPUARMState *env = &cpu->env;
assert(code == BUS_MCEERR_AR || code == BUS_MCEERR_AO);
if (addr) {
ram_addr = qemu_ram_addr_from_host(addr);
if (ram_addr != RAM_ADDR_INVALID &&
kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
kvm_cpu_synchronize_state(c);
kvm_hwpoison_page_add(ram_addr);
if (is_a

Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace

2017-09-07 Thread James Morse
Hi Dongjiu Geng,

On 28/08/17 11:38, Dongjiu Geng wrote:
> when userspace gets SIGBUS signal, it does not know whether
> this is a synchronous external abort or SError,

Why would Qemu/kvmtool need to know if the original notification (if there was
one) was synchronous or asynchronous? This is between firmware and the kernel.


I think I can see why you need this: to choose whether to emulate SEA or SEI,
but what if the guest wasn't running? Or the guest was running, but it wasn't
guest-memory that is affected.

What happens if the dram-scrub hardware spots an error in guest memory, but the
guest wasn't running? KVM won't have a relevant ESR value to give you.

What happens if we start swapping a page of guest memory to disk, and discover
the memory is corrupt. This is synchronous, but it wasn't the guest, and KVM
still can't give you an ESR.

What about CPER records discovered through the polled interface? What happens if
I write a PFN into the corrupt-pfn sysfs interface?


I think what you need is some way of knowing if the BUS_MCEERR_A* was directly
caused by a user-space (or guest) access, and if so was it a data or instruction
fetch. These can become SEA notifications.

KVM's user-space shouldn't be a special-case where the kernel behaves
differently: if we tinker with this it needs to make sense for all user space
processes and mean something on all architectures.

I think this information could be useful to other users of these signals, e.g. a
JVM could silently regenerate/reload code/data for a non-direct-access fault
instead of exit-ing (or throwing an exception) for a direct access.

For BUS_MCEERR_A* from memory_failure() we can't know if they are caused by an
access or not. When the mm code gets -EHWPOISON when trying to resolve a
user-space fault we know it was due to a direct-access. (I don't know if/how x86
can know if it was code or data). Faulting guest accesses through KVM are just a
special version of this where KVM fixes-up stage2.

... but for any of this to work we need the address of the corrupt memory.
(-> cover letter)


Thanks,

James
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm