Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace
Hi gengdongjiu, On 27/10/17 08:21, gengdongjiu wrote: > On 2017/10/26 1:42, James Morse wrote: >> On 20/10/17 16:33, gengdongjiu wrote: >>> As we discuss below solution: >>> When guest happen SEA/SEI, KVM calls memory_failure() to send an >>> asynchronous SIGBUS >>> signal(BUS_MCEERR_AO) to QEMU, and make this address to poisoned. >>> after QEMU receive this BUS_MCEERR_AO, it will record this address to CPER >>> and notify guest. >>> When guest happen stage2 page fault, KVM send a synchronous SIGBUS >>> BUS_MCEERR_AR to QEMU, and QEMU also record CPER and immediately inject SEA >>> abort. >>> >>> But this solution, still have some problems. >>> >>> 1. In some situation, For RAS, when happen SEA, hardware cannot provide an >>> error physical >>> address >> >> Eh? For any RAS error you should get a physical address in ERRADDR. >> >> When you get an external abort due to RAS you can scan these nodes to find >> which >> one generated the error and collect the component information. >> Doing this in firmware is better because firmware knows the SoC topology, so >> it >> can skip the nodes it knows won't be relevant to an error on this CPU. > Thanks for you suggestion. > After discussed this issue internally in our side, I think this should be > our firmware issue. > Not a common issue. > so let us ignore the issue that hardware does not record physical error > address. This is going to give you problems in the long run. All we can do with 'memory corrupt at an unknown address' is reboot. >>> to software instead it can only provide virtual address in FAR_ELx, >>> This is to say, firmware cannot provide physical error address, but >>> provided the virtual >>> address in the FAR_ELx. so BIOS cannot record this address to APEI table. In >> >> Nit: APEI table, you mean recorded as CPER records in a buffer pointed to by >> a >> GHES's ErrorStatusAddress. APEI tables aren't parsed post boot. >> >> >>> this case, when firmware Jump to hypervisor, hypervisor cannot call >>> memory_failure(), now only the physical address is recorded and valid, APEI >>> driver will call the memory_failure()), in this case, host will not send >>> SIGBUS >>> to QEMU. So guest cannot know there is SEA happen. >>> At least there is such issue in Huawei's platform (cannot provide PA for >>> RAS firmware-first, >>> only can provide VA in FAR_ELx) >> >> This isn't a KVM problem. >> >> It looks like both of UEFI's 'Table 275. Memory Error Record' and 'Table 276. >> Memory Error Record 2' require a physical address. You can't describe a >> memory >> error without one. >> >> Is this really a memory error?, or some other component, say, a virtually >> indexed cache. > When happen SEA, if the {D,I}FSC is 0b0101xx which is SEA on translation > table walk or hardware update of translation > table, it means the page table itself happen issue, not the target address > error. > For this case, even firmware can report a error page table physical address, > but memory_memory() > can not recognize this address because the page table address is not belong > to any task include Qemu, > so memory_failure() will not deliver SIGBUS. Of course, this is memory > address. Both KVM's stage2 and Qemu's user-space page tables are made up of pages of kernel memory. When memory_failure() is told one of these is corrupt, it should panic. E.g, arm64 allocates pmd pages like so: > static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr) > { > return (pmd_t *)__get_free_page(PGALLOC_GFP); > } To do any better the kernel would need to know this memory is page-table and that it wasn't the kernel's page-table. It also needs to know which mm_struct it belonged to, and where in the page-table tree the corrupted page lives. There would need to be a per-arch helper to ensure no CPU (or other component) had cached the corrupted page table entries. (it's contained right?) This isn't an arm64 specific issue, and its going to be very difficult to do. > I ever make a experiment, if a APP's page table itself generated SEA, > memory_failure() will consider it > as unknown issue. please see below log, I think this should be a common > issue. This shouldn't be a common issue, page-tables are small compared to the memory they map. > so in KVM code, I plan to separately handle the page table error of SEA if > the {D,I}FSC is 0b0101xx, and not call memory_failure(), what do you > think about that? I think we shouldn't special case KVM. All user-space task's page-tables are kernel memory too, they shouldn't be treated differently. Once linux can handle user-space page-table corruption, we can wire-in KVMs stage2. KVM shouldn't call memory_failure() directly, for RAS it should rely on a firmware-first or kernel-first handler to diagnose the error and do this dirty work. I agree 'unknown error' sounds fishy: > only the memory access SEA call memory_failure(). > > [ 25.482904] {1}[Hardware Error]: Hardware error from APEI Ge
Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace
James, Thanks for the comment. On 2017/10/26 1:42, James Morse wrote: > Hi gengdongjiu, > > On 20/10/17 16:33, gengdongjiu wrote: >> As we discuss below solution: >> When guest happen SEA/SEI, KVM calls memory_failure() to send an >> asynchronous SIGBUS >> signal(BUS_MCEERR_AO) to QEMU, and make this address to poisoned. >> after QEMU receive this BUS_MCEERR_AO, it will record this address to CPER >> and notify guest. >> When guest happen stage2 page fault, KVM send a synchronous SIGBUS >> BUS_MCEERR_AR to QEMU, and QEMU also record CPER and immediately inject SEA >> abort. >> >> But this solution, still have some problems. >> >> 1. In some situation, For RAS, when happen SEA, hardware cannot provide an >> error physical >> address > > Eh? For any RAS error you should get a physical address in ERRADDR. > > When you get an external abort due to RAS you can scan these nodes to find > which > one generated the error and collect the component information. > Doing this in firmware is better because firmware knows the SoC topology, so > it > can skip the nodes it knows won't be relevant to an error on this CPU. Thanks for you suggestion. After discussed this issue internally in our side, I think this should be our firmware issue. Not a common issue. so let us ignore the issue that hardware does not record physical error address. > > >> to software instead it can only provide virtual address in FAR_ELx, >> This is to say, firmware cannot provide physical error address, but provided >> the virtual >> address in the FAR_ELx. so BIOS cannot record this address to APEI table. In > > Nit: APEI table, you mean recorded as CPER records in a buffer pointed to by a > GHES's ErrorStatusAddress. APEI tables aren't parsed post boot. > > >> this case, when firmware Jump to hypervisor, hypervisor cannot call >> memory_failure(), now only the physical address is recorded and valid, APEI >> driver will call the memory_failure()), in this case, host will not send >> SIGBUS >> to QEMU. So guest cannot know there is SEA happen. >> At least there is such issue in Huawei's platform (cannot provide PA for RAS >> firmware-first, >> only can provide VA in FAR_ELx) > > This isn't a KVM problem. > > It looks like both of UEFI's 'Table 275. Memory Error Record' and 'Table 276. > Memory Error Record 2' require a physical address. You can't describe a memory > error without one. > > Is this really a memory error?, or some other component, say, a virtually > indexed cache. When happen SEA, if the {D,I}FSC is 0b0101xx which is SEA on translation table walk or hardware update of translation table, it means the page table itself happen issue, not the target address error. For this case, even firmware can report a error page table physical address, but memory_memory() can not recognize this address because the page table address is not belong to any task include Qemu, so memory_failure() will not deliver SIGBUS. Of course, this is memory address. I ever make a experiment, if a APP's page table itself generated SEA, memory_failure() will consider it as unknown issue. please see below log, I think this should be a common issue. so in KVM code, I plan to separately handle the page table error of SEA if the {D,I}FSC is 0b0101xx, and not call memory_failure(), what do you think about that? only the memory access SEA call memory_failure(). [ 25.482904] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 7 [ 25.484862] {1}[Hardware Error]: event severity: recoverable [ 25.486192] {1}[Hardware Error]: Error 0, type: recoverable [ 25.487519] {1}[Hardware Error]: section_type: memory error [ 25.490169] {1}[Hardware Error]: physical_address: 0x7ce81000 [ 25.491718] {1}[Hardware Error]: error_type: 3, multi-bit ECC [ 25.501178] Memory failure: 0x7ce81: Unknown page state [ 25.501181] Memory failure: 0x7ce81: unknown page still referenced by 1 users [ 25.501183] Memory failure: 0x7ce81: recovery action for unknown page: Failed > > [cut] > > > >> 3. For SEI, the address is invalid, > > You mean FAR_ELx? I mean the physical address. Because SEI is asynchronous, so usually firmware will not record this address, If not record this address, the memory_failure() will be not called, then SIGBUS will not be sent, then guest will not know there is SEI happen, so for this case may be we should also inject a virtual SError to avoid the issue that physical address is not record. > > >> so in some platform, firmware will not record this AP. > > For any RAS error you should get a physical address in ERRADDR. how about the address is not accurate? For SEI, even we can get a physical address from ERRADDR, but this address is not accurate. so firmware will make it as invalid or not record it. > > > Thanks, > > James > > [0] https://lkml.org/lkml/2017/8/7/612 > > . > ___ kvmarm mailing list kvmarm@lists.cs.co
Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace
Hi gengdongjiu, On 20/10/17 16:33, gengdongjiu wrote: > As we discuss below solution: > When guest happen SEA/SEI, KVM calls memory_failure() to send an asynchronous > SIGBUS > signal(BUS_MCEERR_AO) to QEMU, and make this address to poisoned. > after QEMU receive this BUS_MCEERR_AO, it will record this address to CPER > and notify guest. > When guest happen stage2 page fault, KVM send a synchronous SIGBUS > BUS_MCEERR_AR to QEMU, and QEMU also record CPER and immediately inject SEA > abort. > > But this solution, still have some problems. > > 1. In some situation, For RAS, when happen SEA, hardware cannot provide an > error physical > address Eh? For any RAS error you should get a physical address in ERRADDR. When you get an external abort due to RAS you can scan these nodes to find which one generated the error and collect the component information. Doing this in firmware is better because firmware knows the SoC topology, so it can skip the nodes it knows won't be relevant to an error on this CPU. > to software instead it can only provide virtual address in FAR_ELx, > This is to say, firmware cannot provide physical error address, but provided > the virtual > address in the FAR_ELx. so BIOS cannot record this address to APEI table. In Nit: APEI table, you mean recorded as CPER records in a buffer pointed to by a GHES's ErrorStatusAddress. APEI tables aren't parsed post boot. > this case, when firmware Jump to hypervisor, hypervisor cannot call > memory_failure(), now only the physical address is recorded and valid, APEI > driver will call the memory_failure()), in this case, host will not send > SIGBUS > to QEMU. So guest cannot know there is SEA happen. > At least there is such issue in Huawei's platform (cannot provide PA for RAS > firmware-first, > only can provide VA in FAR_ELx) This isn't a KVM problem. It looks like both of UEFI's 'Table 275. Memory Error Record' and 'Table 276. Memory Error Record 2' require a physical address. You can't describe a memory error without one. Is this really a memory error?, or some other component, say, a virtually indexed cache. > 2. if there is SEA/SEI, only deliver SIGBUS to notify QEMU. This information > is limit. > This SIGBUS can only provide an address and si_code(BUS_MCEERR_AO/ > BUS_MCEERR_AR), nothing else. > if QEMU record CPER and inject SEA/specify ESR, it may needs to know more > information. > For example, if it injects SEA, it needs so setup many registers for guest, > such as > FAR_EL1. If sets it, it needs to know FAR_EL2. Linux is given CPER records describing a memory error using NOTIFY_IRQ. It delivers BUS_MCEERR_AO to Qemu. What value does FAR_EL2 have? Even for 'AR' from KVM, we already know Marc is against exposing EL2 registers. [0] > But QEMU cannot know this information to setup it if KVM cannot pass more > fault info to QEMU. > Of cause, we can identify the guest FAR_El1 register to invalid. But some > time, guest needs to > know it in the situation that host cannot provide the PA. When is this? You can get the IPA from the si_addr and Qemu's memory layout. The IPA goes in the CPER records allowing you to emulate firmware first. The IPA goes in the ERRADDR register (once we have emulation support for it), covering the kernel first case. What's left? Neither: You can generate an external abort using DFSC=0b01 and set FnV to tell it you don't have a virtual stage1 address. A better argument is that user-space needs to know if BUS_MCEERR_AR triggered by KVM's stage2 was an instruction or data abort so it can make this the correct flavour of Synchronous External Abort. I agree this bit needs exposing (but only for exits due to BUS_MCEERR_AR triggered by KVM at stage2), and maybe KVM could include a little more information to allow the full range of external-abort ESRs to be used. > 3. For SEI, the address is invalid, You mean FAR_ELx? > so in some platform, firmware will not record this AP. For any RAS error you should get a physical address in ERRADDR. > At least in HUAWEI's platform, firmware will not record it. we cannot always > think that all platform can record PA for RAS, sometime it may use > VA(in FAR_ELx). What component do you see this happen with? > For SEI, if the address is not recorded, then the > memory_failure() will be not called. So guest will not know it happens SEI. Thanks, James [0] https://lkml.org/lkml/2017/8/7/612 ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace
CC James. > > In the user space, we can check the si_code, if it is > > "BUS_MCEERR_AR", we use SEA notification type for the guest; if it is > > "BUS_MCEERR_AO", we use SEI notification type for the guest. > > Because there are only two values for si_code("BUS_MCEERR_AR" and > > BUS_MCEERR_AO), in which case we can use the GSIV(IRQ) > notification type? > > This is for Qemu/kvmtool to decide, it depends on what sort of machine they > are emulating. > > For example, the physical machine's memory-controller may notify the > CPU about memory errors by triggering SError trapped to EL3, or with a > dedicated FIQ, also routed to EL3. By the time this gets to the host kernel > the distinction doesn't matter. The host has handled the error. > > For a guest, your memory-controller is effectively the host kernel. It > will give you an BUS_MCEERR_AO signal for any guest memory that is affected, > and a BUS_MCEERR_AR if the guest directly accesses a page of affected memory. > > What Qemu/kvmtool do with this is up to them. If they're emulating a machine > with no RAS features, printing an error and exit. > > Otherwise BUS_MCEERR_AR could be notified as one of the flavours of > IRQ, unless the affected vcpu has interrupts masked, in which case an SEA > notification gives you some NMI-like behaviour. > > For BUS_MCEERR_AO you could use SEI, IRQ or polled notification. My > choice would be IRQ, as you can't know if the guest supports SEI and > it would be a shame to kill it with an SError if the affected memory was > free. SEA for synchronous errors is still a good choice even if the guest > doesn't support it as that memory is still gone so its still a valid > guest:Synchronous-external-abort. > Add James. CC some huawei's hardware engineers. Hi James/Marc/Christoffer, As we discuss below solution: When guest happen SEA/SEI, KVM calls memory_failure() to send an asynchronous SIGBUS signal(BUS_MCEERR_AO) to QEMU, and make this address to poisoned. after QEMU receive this BUS_MCEERR_AO, it will record this address to CPER and notify guest. When guest happen stage2 page fault, KVM send a synchronous SIGBUS BUS_MCEERR_AR to QEMU, and QEMU also record CPER and immediately inject SEA abort. But this solution, still have some problems. 1. In some situation, For RAS, when happen SEA, hardware cannot provide an error physical address to software instead it can only provide virtual address in FAR_ELx, This is to say, firmware cannot provide physical error address, but provided the virtual address in the FAR_ELx. so BIOS cannot record this address to APEI table. In this case, when firmware Jump to hypervisor, hypervisor cannot call memory_failure(), now only the physical address is recorded and valid, APEI driver will call the memory_failure()), in this case, host will not send SIGBUS to QEMU. So guest cannot know there is SEA happen. At least there is such issue in Huawei's platform (cannot provide PA for RAS firmware-first, only can provide VA in FAR_ELx) 2. if there is SEA/SEI, only deliver SIGBUS to notify QEMU. This information is limit. This SIGBUS can only provide an address and si_code(BUS_MCEERR_AO/ BUS_MCEERR_AR), nothing else. if QEMU record CPER and inject SEA/specify ESR, it may needs to know more information. For example, if it injects SEA, it needs so setup many registers for guest, such as FAR_EL1. If sets it, it needs to know FAR_EL2. But QEMU cannot know this information to setup it if KVM cannot pass more fault info to QEMU. Of cause, we can identify the guest FAR_El1 register to invalid. But some time, guest needs to know it in the situation that host cannot provide the PA. 3. For SEI, the address is invalid, so in some platform, firmware will not record this AP. At least in HUAWEI's platform, firmware will not record it. we cannot always think that all platform can record PA for RAS, sometime it may use VA(in FAR_ELx). For SEI, if the address is not recorded, then the memory_failure() will be not called. So guest will not know it happens SEI. ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace
> > In the user space, we can check the si_code, if it is "BUS_MCEERR_AR", > > we use SEA notification type for the guest; if it is "BUS_MCEERR_AO", we > > use SEI notification type for the guest. > > Because there are only two values for si_code("BUS_MCEERR_AR" and > > BUS_MCEERR_AO), in which case we can use the GSIV(IRQ) > notification type? > > This is for Qemu/kvmtool to decide, it depends on what sort of machine they > are emulating. > > For example, the physical machine's memory-controller may notify the CPU > about memory errors by triggering SError trapped to EL3, or > with a dedicated FIQ, also routed to EL3. By the time this gets to the host > kernel the distinction doesn't matter. The host has handled the > error. > > For a guest, your memory-controller is effectively the host kernel. It will > give you an BUS_MCEERR_AO signal for any guest memory that is > affected, and a BUS_MCEERR_AR if the guest directly accesses a page of > affected memory. > > What Qemu/kvmtool do with this is up to them. If they're emulating a machine > with no RAS features, printing an error and exit. > > Otherwise BUS_MCEERR_AR could be notified as one of the flavours of IRQ, > unless the affected vcpu has interrupts masked, in which case > an SEA notification gives you some NMI-like behaviour. > > For BUS_MCEERR_AO you could use SEI, IRQ or polled notification. My choice > would be IRQ, as you can't know if the guest supports SEI and > it would be a shame to kill it with an SError if the affected memory was > free. SEA for synchronous errors is still a good choice even if the > guest doesn't support it as that memory is still gone so its still a valid > guest:Synchronous-external-abort. > CC some huawei's hardware engineers. Hi James/Marc/Christoffer, As we discuss below solution: When guest happen SEA/SEI, KVM calls memory_failure() to send an asynchronous SIGBUS signal(BUS_MCEERR_AO) to QEMU, and make this address to poisoned. after QEMU receive this BUS_MCEERR_AO, it will record this address to CPER and notify guest. When guest happen stage2 page fault, KVM send a synchronous SIGBUS BUS_MCEERR_AR to QEMU, and QEMU also record CPER and immediately inject SEA abort. But this solution, still have some problems. 1. In some situation, For RAS, when happen SEA, hardware cannot provide an error physical address to software instead it can only provide virtual address in FAR_ELx, This is to say, firmware cannot provide physical error address, but provided the virtual address in the FAR_ELx. so BIOS cannot record this address to APEI table. In this case, when firmware Jump to hypervisor, hypervisor cannot call memory_failure(), now only the physical address is recorded and valid, APEI driver will call the memory_failure()), in this case, host will not send SIGBUS to QEMU. So guest cannot know there is SEA happen. At least there is such issue in Huawei's platform (cannot provide PA for RAS firmware-first, only can provide VA in FAR_ELx) 2. if there is SEA/SEI, only deliver SIGBUS to notify QEMU. This information is limit. This SIGBUS can only provide an address and si_code(BUS_MCEERR_AO/ BUS_MCEERR_AR), nothing else. if QEMU record CPER and inject SEA/specify ESR, it may needs to know more information. For example, if it injects SEA, it needs so setup many registers for guest, such as FAR_EL1. If sets it, it needs to know FAR_EL2. But QEMU cannot know this information to setup it if KVM cannot pass more fault info to QEMU. Of cause, we can identify the guest FAR_El1 register to invalid. But some time, guest needs to know it in the situation that host cannot provide the PA. 3. For SEI, the address is invalid, so in some platform, firmware will not record this AP. At least in HUAWEI's platform, firmware will not record it. we cannot always think that all platform can record PA for RAS, sometime it may use VA(in FAR_ELx). For SEI, if the address is not recorded, then the memory_failure() will be not called. So guest will not know it happens SEI. ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace
On 2017/10/7 1:31, James Morse wrote: > Hi gengdongjiu, > > On 27/09/17 12:07, gengdongjiu wrote: >> On 2017/9/23 0:51, James Morse wrote: >>> If this wasn't a firmware-first notification, then you're right KVM hands >>> the >>> guest an asynchronous external abort. This could be considered a bug in >>> KVM. (we >>> can discuss with Marc and Christoffer what it should do), but: >>> >>> I'm not sure what scenario you could see this in: surely all your >>> CPU:external-aborts are taken to EL3 by SCR_EL3.EA and become firmware-first >>> notifications. So they should always be claimed by APEI. > >> Yes, if it is firmware-first we should not exist such issue. > > [...] > >>> What you may be seeing is some awkwardness with the change in the SError ESR >>> with v8.2. Previously the VSE mechanism injected an impdef SError, (but they >>> were all impdef so it didn't matter). >>> With VSESR_EL2 KVM has to specify one, and all-zeros is a bad choice as this >>> means 'classified as a RAS error ... unknown!'. > >>> I have a patch in the upcoming SError/RAS series that changes KVMs >>> virtual-abort >>> code to specify an impdef ESR for this path. > > https://www.spinics.net/lists/arm-kernel/msg609589.html > > >> Before I remember Marc and you suggest me specify the an impdef ESR (set the >> vsesr_el2) in >> the userspace, > > If Qemu/kvmtool wants to emulate a machine that notifies the guest about > firmware-first RAS Errors using SError/SEI, it needs to decide when to send > these SError and what ESR to specify. yes, it is. I agree. > > >> now do you want to specify an impdef ESR in KVM instead of usrspace? > > No, that patch is just trying to fixup the existing cases where KVM already > injects an SError. I'm just trying to keep the behaviour the-same: > > Before the RAS Extensions the guest would always get an impdef SError ESR. > After the RAS Extensions KVM has to pick an ESR, as otherwise the guest gets > the > hardware-reset value of VSESR_EL2. On the FVP this is all-zeros, which is a > RAS > encoding. That patch changes it to still be an impdef SError ESR. > > >> if setting the vsesr_el2 in the KVM, whether user-space need to specify? > > I think we need a KVM CAP API to do this. With that patch it can be wired into > pend_guest_serror(), which takes the ESR to make pending if the CPU supports > it. For this CAP API, I have made a patch in the new series patches. > > It's not clear to me whether its useful for user-space to make an SError > pending > even if it can't set the ESR why it can not set the ESR? In the KVM, we can return a encoding fault info to userspace, then user space can specify its own ESR for guest. I already made a patch for it. > > [...] > >>> Because the SEI notification depends on v8.2 I'd like to get the SError/RAS >>> series posted (currently re-testing), then I'll pick up enough of the >>> patches >>> you've posted for a consolidated version of the series, and we can take the >>> discussion from there. > >> James, it is great, we can make a consolidated version of the series. > > We need to get some of the wider issues fixed first, the big-ugly one is > memory_failure_queue() not being NMI safe. (this isn't a problem for SEA as > this > would only become re-entrant if the kernel text was corrupt). It is a problem > for SEI and SDEI. For memory_failure_queue(), I think the big problem is it is in a process context, not error handling context. there are two context. and the memory_failure_queue() is scheduled later than the error handling. > > >>> I'd still like to know what your firmware does if the normal-world believes >>> its >>> masked physical-SError and you want to hand it an SEI notification. > > Aha, thanks for this: > >> firstly the physical-SError that happened in the EL2/EL1/EL1 can not be >> masked if SCR_EL3.EA is set. > > Masked for the CPU because the CPU can deliver the SError to EL3. > > What about software? Code at EL1 and EL2 each have a PSTATE.A bit they may > have > set. HCR_EL2.VSE respects EL1's PSTATE.A ... the question is does your > firmware > respect the PSTATE.A value of the exception level that SError are routed to? Before route to the target EL, software set the spsr_el3.A to 1, then "eret", the PSTATE.A will be to 1. Note: PSTATE.A is shared by different EL, in the hardware, it is one register, not many registers. spsr_elx has more registers, such as spsr_el1, spsr_el2, spsr_el3. > > >> when trap to EL3, firmware will record the error to APEI CPER from reading >> ERR* RAS registers. >> >> (1) if HCR_EL2.TEA is set to 1, exception come from EL0, El1. firmware knows >> this > > HCR_EL2.TEA covers synchronous-external-aborts. For SError you need to check > HCR_EL2.AMO. Some crazy hypervisor may set one and not the other. sorry, it is typo issue, should check HCR_EL2.AMO > > >> SError come from guest OS, copy the elr_el3 to elr_el2, copy ESR_El3 to >> ESR_EL2. > > The EC value in the ELR des
Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace
Hi gengdongjiu, On 27/09/17 12:07, gengdongjiu wrote: > On 2017/9/23 0:51, James Morse wrote: >> If this wasn't a firmware-first notification, then you're right KVM hands the >> guest an asynchronous external abort. This could be considered a bug in KVM. >> (we >> can discuss with Marc and Christoffer what it should do), but: >> >> I'm not sure what scenario you could see this in: surely all your >> CPU:external-aborts are taken to EL3 by SCR_EL3.EA and become firmware-first >> notifications. So they should always be claimed by APEI. > Yes, if it is firmware-first we should not exist such issue. [...] >> What you may be seeing is some awkwardness with the change in the SError ESR >> with v8.2. Previously the VSE mechanism injected an impdef SError, (but they >> were all impdef so it didn't matter). >> With VSESR_EL2 KVM has to specify one, and all-zeros is a bad choice as this >> means 'classified as a RAS error ... unknown!'. >> I have a patch in the upcoming SError/RAS series that changes KVMs >> virtual-abort >> code to specify an impdef ESR for this path. https://www.spinics.net/lists/arm-kernel/msg609589.html > Before I remember Marc and you suggest me specify the an impdef ESR (set the > vsesr_el2) in > the userspace, If Qemu/kvmtool wants to emulate a machine that notifies the guest about firmware-first RAS Errors using SError/SEI, it needs to decide when to send these SError and what ESR to specify. > now do you want to specify an impdef ESR in KVM instead of usrspace? No, that patch is just trying to fixup the existing cases where KVM already injects an SError. I'm just trying to keep the behaviour the-same: Before the RAS Extensions the guest would always get an impdef SError ESR. After the RAS Extensions KVM has to pick an ESR, as otherwise the guest gets the hardware-reset value of VSESR_EL2. On the FVP this is all-zeros, which is a RAS encoding. That patch changes it to still be an impdef SError ESR. > if setting the vsesr_el2 in the KVM, whether user-space need to specify? I think we need a KVM CAP API to do this. With that patch it can be wired into pend_guest_serror(), which takes the ESR to make pending if the CPU supports it. It's not clear to me whether its useful for user-space to make an SError pending even if it can't set the ESR [...] >> Because the SEI notification depends on v8.2 I'd like to get the SError/RAS >> series posted (currently re-testing), then I'll pick up enough of the patches >> you've posted for a consolidated version of the series, and we can take the >> discussion from there. > James, it is great, we can make a consolidated version of the series. We need to get some of the wider issues fixed first, the big-ugly one is memory_failure_queue() not being NMI safe. (this isn't a problem for SEA as this would only become re-entrant if the kernel text was corrupt). It is a problem for SEI and SDEI. >> I'd still like to know what your firmware does if the normal-world believes >> its >> masked physical-SError and you want to hand it an SEI notification. Aha, thanks for this: > firstly the physical-SError that happened in the EL2/EL1/EL1 can not be > masked if SCR_EL3.EA is set. Masked for the CPU because the CPU can deliver the SError to EL3. What about software? Code at EL1 and EL2 each have a PSTATE.A bit they may have set. HCR_EL2.VSE respects EL1's PSTATE.A ... the question is does your firmware respect the PSTATE.A value of the exception level that SError are routed to? > when trap to EL3, firmware will record the error to APEI CPER from reading > ERR* RAS registers. > > (1) if HCR_EL2.TEA is set to 1, exception come from EL0, El1. firmware knows > this HCR_EL2.TEA covers synchronous-external-aborts. For SError you need to check HCR_EL2.AMO. Some crazy hypervisor may set one and not the other. > SError come from guest OS, copy the elr_el3 to elr_el2, copy ESR_El3 to > ESR_EL2. The EC value in the ELR describes current/lower exception level, you need to re-encode this for EL2 if the exception came from EL2. > if the SError exception come from guest EL0 or EL1, set ELR_EL3 with > VBAR_EL2 + 0x580(one EL2 SEI entry point), > > execute "ERET", then jump to EL2 hypervisor. > > (2)if the SError exception come EL2 hypervisor, copy the elr_el3 to elr_el2, > copy ESR_El3 to ESR_EL, > set ELR_EL3 with VBAR_EL2+0x380(one EL2 SEI entry point), > >execute "ERET", then jump to EL2 hypervisor. This SError came from EL2. You _must_ check SPSR_EL3.A is clear before returning to the EL2 SError vector. EL2 believes it has masked SError, it does this because it can't handle one right now. If your firmware jumps in anyway - its game over. We mask SError in entry.S when we take an exception and when we return from an exception. This is so that we can read/write the ELR/SPSR without them changing under our feet. If your firmware overwrites these values - we've lost them, and can never return to the context we interrupted. >
Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace
>> What you may be seeing is some awkwardness with the change in the SError ESR >> with v8.2. Previously the VSE mechanism injected an impdef SError, (but they >> were all impdef so it didn't matter). >> With VSESR_EL2 KVM has to specify one, and all-zeros is a bad choice as this >> means 'classified as a RAS error ... unknown!'. >> >> I have a patch in the upcoming SError/RAS series that changes KVMs >> virtual-abort >> code to specify an impdef ESR for this path. > Before I remember Marc and you suggest me specify the an impdef ESR (set the > vsesr_el2) in > the userspace, I pasted Marc's propose and your suggestion that set VSESR_EL2(specify virtual SError syndrome) by the user space. https://lkml.org/lkml/2017/3/20/441 https://lkml.org/lkml/2017/3/20/516 > now do you want to specify an impdef ESR in KVM instead of usrspace? > if setting the vsesr_el2 in the KVM, whether user-space need to specify? > May be we can combine the patches that specify an impdef ESR(set vsesr_el2) > patch to one. > ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace
Hi James, Sorry for my late response, thank you very much for comments. On 2017/9/23 0:51, James Morse wrote: [.] >> >> CC Achin >> >> I have some personal opinion, if you think it is not right, hope you can >> point out. >> >> Synchronous External Abort and SError Interrupt are hardware >> exception(hardware concept), >> which is independent of software notification, >> in armv8 without RAS, the two concepts already exist. In the APEI spec, in >> order to >> better describe the two exceptions, so use SEA and SEI notification to stand > for them. > >> SEA notification stands for Synchronous External Abort, so may be it is not >> only a >> notification, it also stands for a hardware error type. >> SEI notification stands for SError Interrupt, so may be it is not only a >> notification, >> it also stands for a hardware error type. > >> In the OS, it has different handling flow to the two exception(two >> notification): >> when the guest OS running, if the hardware generates a Synchronous External >> Abort, we >> told the guest OS this error is SError Interrupt instead of Synchronous > External Abort. > > This should only happen when APEI doesn't claim the external-abort as a RAS > notification. If there were CPER records to process then the error is handled > by > the host, and we can return to the guest. consider again. I think you should be right. In the firmware-first solution, firmware will shield all kinds of errors and record them to the CPER buffer. > > If this wasn't a firmware-first notification, then you're right KVM hands the > guest an asynchronous external abort. This could be considered a bug in KVM. > (we > can discuss with Marc and Christoffer what it should do), but: > > I'm not sure what scenario you could see this in: surely all your > CPU:external-aborts are taken to EL3 by SCR_EL3.EA and become firmware-first > notifications. So they should always be claimed by APEI. Yes, if it is firmware-first we should not exist such issue. > > >> guest OS uses SEI notification handling flow to deal with it, I am not sure >> whether it >> will have problem, because the true hardware exception is Synchronous >> External >> Abort, but software treats it as SError interrupt to handle. > > Once you're into a guest the original 'true hardware exception' shouldn't > matter. In this scenario KVM has handed the guest an SError, our question is > 'is > it an SEI notification?': > > For firmware first the guest OS should poke around in the CPER buffers, find > nothing to do, and return to the arch code for (future) kernel-first. > For kernel first the guest OS should trawl through the v8.2 ERR registers, > find > nothing to do, and continue to the default case: > > By default, we should panic on SError, unless its classified as a non-fatal > RAS > error. (I'm tempted to pr_warn_once() if we get RAS notifications but there is > no work to do). understand, thanks. > > > What you may be seeing is some awkwardness with the change in the SError ESR > with v8.2. Previously the VSE mechanism injected an impdef SError, (but they > were all impdef so it didn't matter). > With VSESR_EL2 KVM has to specify one, and all-zeros is a bad choice as this > means 'classified as a RAS error ... unknown!'. > > I have a patch in the upcoming SError/RAS series that changes KVMs > virtual-abort > code to specify an impdef ESR for this path. Before I remember Marc and you suggest me specify the an impdef ESR (set the vsesr_el2) in the userspace, now do you want to specify an impdef ESR in KVM instead of usrspace? if setting the vsesr_el2 in the KVM, whether user-space need to specify? May be we can combine the patches that specify an impdef ESR(set vsesr_el2) patch to one. > > >> In the mainline code, it does not have SEI notification support, the reason >> I >> think it is because of the error address record by firmware is not accurate >> (SError Interrupt is asynchronous exception). > > Yes, while we don't expect a FAR with an SError, but we do expect a valid > representation of the RAS error in either the CPER records or the v8.2. ERR > registers (or both). If we have neither of those, its not a RAS error and we > should panic. > > >> so if treat a hardware Synchronous External Abort as SError interrupt(SEI). >> The default OS behavior for SEI is PANIC, that is to say, when hardware >> triggers >> a Synchronous External Abort(SEA), if guest treat it as SError >> interrupt(SEI), >> the OS will be panic. in fact, it can be recoverable instead of Panic. > > If its a RAS error APEI (or in the future, the kernel-first handler), should > claim the error, so the guest never sees it. If you are hitting this behaviour > in KVM, then it wasn't a RAS error. > > >> I ever added a patch to support the SEI notification, but not sure whether >> it is can be accepted by open source, until now, not receive response. > > The patch you posted during the merge window made no sense on its own
Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace
Hi gengdongjiu, On 21/09/17 08:55, gengdongjiu wrote: > On 2017/9/14 21:00, James Morse wrote: >> user-space can choose whether to use SEA or SEI, it doesn't have to choose >> the >> same notification type that firmware used, which in turn doesn't have to be >> the >> same as that used by the CPU to notify firmware. >> >> The choice only matters because these notifications hang on an existing >> pieces >> of the Arm-architecture, so the notification can only add to the >> architecturally >> defined meaning. (i.e. You can only send an SEA for something that can >> already >> be described as a synchronous external abort). >> >> Once we get to user-space, for memory_failure() notifications, (which so far >> is >> all we are talking about here), the only thing that could matter is whether >> the >> guest hit a PG_hwpoison page as a stage2 fault. These can be described as >> Synchronous-External-Abort. >> >> The Synchronous-External-Abort/SError-Interrupt distinction matters for the >> CPU >> because it can't always make an error synchronous. For memory_failure() >> notifications to a KVM guest we really can do this, and we already have this >> behaviour for free. An example: >> >> A guest touches some hardware:poisoned memory, for whatever reason the CPU >> can't >> put the world back together to make this a synchronous exception, so it >> reports >> it to firmware as an SError-interrupt. >> Linux gets an APEI notification and memory_failure() causes the affected >> page to >> be unmapped from the guest's stage2, and SIGBUS_MCEERR_AO sent to user-space. >> >> Qemu/kvmtool can now notify the guest with an IRQ or POLLed notification. >> AO-> >> action optional, probably asynchronous. >> >> But in our example it wasn't really asynchronous, that was just a property of >> the original CPU->firmware notification. What happens? The guest vcpu is >> re-run, >> it re-runs the same instructions (this was a contained error so KVM's ELR >> points >> at/before the instruction that steps in the problem). This time KVM takes a >> stage2 fault, which the mm code will refuse to fixup because the relevant >> page >> was marked as PG_hwpoision by memory_failure(). KVM signals Qemu/kvmtool with >> SIGBUS_MCEERR_AR. Now Qemu/kvmtool can notify the guest using SEA. > > CC Achin > > I have some personal opinion, if you think it is not right, hope you can > point out. > > Synchronous External Abort and SError Interrupt are hardware > exception(hardware concept), > which is independent of software notification, > in armv8 without RAS, the two concepts already exist. In the APEI spec, in > order to > better describe the two exceptions, so use SEA and SEI notification to stand for them. > SEA notification stands for Synchronous External Abort, so may be it is not > only a > notification, it also stands for a hardware error type. > SEI notification stands for SError Interrupt, so may be it is not only a > notification, > it also stands for a hardware error type. > In the OS, it has different handling flow to the two exception(two > notification): > when the guest OS running, if the hardware generates a Synchronous External > Abort, we > told the guest OS this error is SError Interrupt instead of Synchronous External Abort. This should only happen when APEI doesn't claim the external-abort as a RAS notification. If there were CPER records to process then the error is handled by the host, and we can return to the guest. If this wasn't a firmware-first notification, then you're right KVM hands the guest an asynchronous external abort. This could be considered a bug in KVM. (we can discuss with Marc and Christoffer what it should do), but: I'm not sure what scenario you could see this in: surely all your CPU:external-aborts are taken to EL3 by SCR_EL3.EA and become firmware-first notifications. So they should always be claimed by APEI. > guest OS uses SEI notification handling flow to deal with it, I am not sure > whether it > will have problem, because the true hardware exception is Synchronous External > Abort, but software treats it as SError interrupt to handle. Once you're into a guest the original 'true hardware exception' shouldn't matter. In this scenario KVM has handed the guest an SError, our question is 'is it an SEI notification?': For firmware first the guest OS should poke around in the CPER buffers, find nothing to do, and return to the arch code for (future) kernel-first. For kernel first the guest OS should trawl through the v8.2 ERR registers, find nothing to do, and continue to the default case: By default, we should panic on SError, unless its classified as a non-fatal RAS error. (I'm tempted to pr_warn_once() if we get RAS notifications but there is no work to do). What you may be seeing is some awkwardness with the change in the SError ESR with v8.2. Previously the VSE mechanism injected an impdef SError, (but they were all impdef so it didn't matter). With VSESR_EL2 KVM has to spe
Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace
Hi gengdongjiu, On 18/09/17 14:36, gengdongjiu wrote: > On 2017/9/14 21:00, James Morse wrote: >> On 13/09/17 08:32, gengdongjiu wrote: >>> On 2017/9/8 0:30, James Morse wrote: On 28/08/17 11:38, Dongjiu Geng wrote: For BUS_MCEERR_A* from memory_failure() we can't know if they are caused by an access or not. >> >> Actually it looks like we can: I thought 'BUS_MCEERR_AR' could be triggered >> via >> some CPER flags, but its not. The only code that flags MF_ACTION_REQUIRED is >> x86's kernel-first handling, which nicely matches this 'direct access' >> problem. >> BUS_MCEERR_AR also come from KVM stage2 faults (and the x86 equivalent). >> Powerpc >> also triggers these directly, both from what look to be synchronous paths, >> so I >> think its fair to equate BUS_MCEERR_AR to a synchronous access and >> BUS_MCEERR_AO >> to something_else. > > James, thanks for your explanation. > can I understand that your meaning that "BUS_MCEERR_AR" stands for > synchronous access and BUS_MCEERR_AO stands for asynchronous access? Not 'stands for', as the AR is Action-Required and AO Action-Optional. My point was I can't find a case where Action-Required is used for an error that isn't synchronous. We should run this past the people who maintain the existing BUS_MCEERR_AR users, in case its just a severity to them. > Then for "BUS_MCEERR_AO", how to distinguish it is asynchronous data > access(SError) and PCIE AER error? How would userspace get one of these memory errors for a PCIe error? > In the user space, we can check the si_code, if it is "BUS_MCEERR_AR", we use > SEA notification type for the guest; > if it is "BUS_MCEERR_AO", we use SEI notification type for the guest. > Because there are only two values for si_code("BUS_MCEERR_AR" and > BUS_MCEERR_AO), in which case we can use the GSIV(IRQ) notification type? This is for Qemu/kvmtool to decide, it depends on what sort of machine they are emulating. For example, the physical machine's memory-controller may notify the CPU about memory errors by triggering SError trapped to EL3, or with a dedicated FIQ, also routed to EL3. By the time this gets to the host kernel the distinction doesn't matter. The host has handled the error. For a guest, your memory-controller is effectively the host kernel. It will give you an BUS_MCEERR_AO signal for any guest memory that is affected, and a BUS_MCEERR_AR if the guest directly accesses a page of affected memory. What Qemu/kvmtool do with this is up to them. If they're emulating a machine with no RAS features, printing an error and exit. Otherwise BUS_MCEERR_AR could be notified as one of the flavours of IRQ, unless the affected vcpu has interrupts masked, in which case an SEA notification gives you some NMI-like behaviour. For BUS_MCEERR_AO you could use SEI, IRQ or polled notification. My choice would be IRQ, as you can't know if the guest supports SEI and it would be a shame to kill it with an SError if the affected memory was free. SEA for synchronous errors is still a good choice even if the guest doesn't support it as that memory is still gone so its still a valid guest:Synchronous-external-abort. [...] >>> 1. Let us firstly discuss the SEA and SEI, there are different workflow for >>> the two different Errors. >> user-space can choose whether to use SEA or SEI, it doesn't have to choose >> the >> same notification type that firmware used, which in turn doesn't have to be >> the >> same as that used by the CPU to notify firmware. >> >> The choice only matters because these notifications hang on an existing >> pieces >> of the Arm-architecture, so the notification can only add to the >> architecturally >> defined meaning. (i.e. You can only send an SEA for something that can >> already >> be described as a synchronous external abort). >> >> Once we get to user-space, for memory_failure() notifications, (which so far >> is >> all we are talking about here), the only thing that could matter is whether >> the >> guest hit a PG_hwpoison page as a stage2 fault. These can be described as >> Synchronous-External-Abort. >> >> The Synchronous-External-Abort/SError-Interrupt distinction matters for the >> CPU >> because it can't always make an error synchronous. For memory_failure() >> notifications to a KVM guest we really can do this, and we already have this >> behaviour for free. An example: >> >> A guest touches some hardware:poisoned memory, for whatever reason the CPU >> can't >> put the world back together to make this a synchronous exception, so it >> reports >> it to firmware as an SError-interrupt. > >> Linux gets an APEI notification and memory_failure() causes the affected >> page to >> be unmapped from the guest's stage2, and SIGBUS_MCEERR_AO sent to user-space. >> >> Qemu/kvmtool can now notify the guest with an IRQ or POLLed notification. >> AO-> >> action optional, probably asynchronous. > If so, in this case, Qemu/kvmtool only got a little information(receive a > SI
Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace
Hi James On 2017/9/14 21:00, James Morse wrote: > Hi gengdongjiu, > user-space can choose whether to use SEA or SEI, it doesn't have to choose the > same notification type that firmware used, which in turn doesn't have to be > the > same as that used by the CPU to notify firmware. > > The choice only matters because these notifications hang on an existing pieces > of the Arm-architecture, so the notification can only add to the > architecturally > defined meaning. (i.e. You can only send an SEA for something that can already > be described as a synchronous external abort). > > Once we get to user-space, for memory_failure() notifications, (which so far > is > all we are talking about here), the only thing that could matter is whether > the > guest hit a PG_hwpoison page as a stage2 fault. These can be described as > Synchronous-External-Abort. > > The Synchronous-External-Abort/SError-Interrupt distinction matters for the > CPU > because it can't always make an error synchronous. For memory_failure() > notifications to a KVM guest we really can do this, and we already have this > behaviour for free. An example: > > A guest touches some hardware:poisoned memory, for whatever reason the CPU > can't > put the world back together to make this a synchronous exception, so it > reports > it to firmware as an SError-interrupt. > Linux gets an APEI notification and memory_failure() causes the affected page > to > be unmapped from the guest's stage2, and SIGBUS_MCEERR_AO sent to user-space. > > Qemu/kvmtool can now notify the guest with an IRQ or POLLed notification. AO-> > action optional, probably asynchronous. > > But in our example it wasn't really asynchronous, that was just a property of > the original CPU->firmware notification. What happens? The guest vcpu is > re-run, > it re-runs the same instructions (this was a contained error so KVM's ELR > points > at/before the instruction that steps in the problem). This time KVM takes a > stage2 fault, which the mm code will refuse to fixup because the relevant page > was marked as PG_hwpoision by memory_failure(). KVM signals Qemu/kvmtool with > SIGBUS_MCEERR_AR. Now Qemu/kvmtool can notify the guest using SEA. CC Achin I have some personal opinion, if you think it is not right, hope you can point out. Synchronous External Abort and SError Interrupt are hardware exception(hardware concept), which is independent of software notification, in armv8 without RAS, the two concepts already exist. In the APEI spec, in order to better describe the two exceptions, so use SEA and SEI notification to stand for them. SEA notification stands for Synchronous External Abort, so may be it is not only a notification, it also stands for a hardware error type. SEI notification stands for SError Interrupt, so may be it is not only a notification, it also stands for a hardware error type. In the OS, it has different handling flow to the two exception(two notification): when the guest OS running, if the hardware generates a Synchronous External Abort, we told the guest OS this error is SError Interrupt instead of Synchronous External Abort. guest OS uses SEI notification handling flow to deal with it, I am not sure whether it will have problem, because the true hardware exception is Synchronous External Abort, but software treats it as SError interrupt to handle. In the mainline code, it does not have SEI notification support, the reason I think it is because of the error address record by firmware is not accurate(SError Interrupt is asynchronous exception). so if treat a hardware Synchronous External Abort as SError interrupt(SEI). The default OS behavior for SEI is PANIC, that is to say, when hardware triggers a Synchronous External Abort(SEA), if guest treat it as SError interrupt(SEI), the OS will be panic. in fact, it can be recoverable instead of Panic. I ever added a patch to support the SEI notification, but not sure whether it is can be accepted by open source, until now, not receive response. ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace
James, Thanks for your comments, hope we can make the solution better. On 2017/9/14 21:00, James Morse wrote: > Hi gengdongjiu, > > (re-ordered hunks) > > On 13/09/17 08:32, gengdongjiu wrote: >> On 2017/9/8 0:30, James Morse wrote: >>> On 28/08/17 11:38, Dongjiu Geng wrote: >>> For BUS_MCEERR_A* from memory_failure() we can't know if they are caused by >>> an access or not. > > Actually it looks like we can: I thought 'BUS_MCEERR_AR' could be triggered > via > some CPER flags, but its not. The only code that flags MF_ACTION_REQUIRED is > x86's kernel-first handling, which nicely matches this 'direct access' > problem. > BUS_MCEERR_AR also come from KVM stage2 faults (and the x86 equivalent). > Powerpc > also triggers these directly, both from what look to be synchronous paths, so > I > think its fair to equate BUS_MCEERR_AR to a synchronous access and > BUS_MCEERR_AO > to something_else. James, thanks for your explanation. can I understand that your meaning that "BUS_MCEERR_AR" stands for synchronous access and BUS_MCEERR_AO stands for asynchronous access? Then for "BUS_MCEERR_AO", how to distinguish it is asynchronous data access(SError) and PCIE AER error? In the user space, we can check the si_code, if it is "BUS_MCEERR_AR", we use SEA notification type for the guest; if it is "BUS_MCEERR_AO", we use SEI notification type for the guest. Because there are only two values for si_code("BUS_MCEERR_AR" and BUS_MCEERR_AO), in which case we can use the GSIV(IRQ) notification type? > > I don't think we need anything else. > > >>> When the mm code gets -EHWPOISON when trying to resolve a >> >> Because of that, so I allow userspace getting exception information > > ... and there are cases where you can't get the exception information, and > other > cases where it wasn't an exception at all. > > [...] > > >>> What happens if the dram-scrub hardware spots an error in guest memory, but >>> the guest wasn't running? KVM won't have a relevant ESR value to give you. > >> if the dram-scrub hardware spots an error in guest memory, it will generate >> IRQ in DDR controller, not SEA or SEI exception. I still do not consider the >> GSIV. For GSIV, may be we can only handle it in the host OS. > > Great example: this IRQ pulls us out of a guest, we tromp through APEI and > then > memory_failure(), the memory happened to belong to the same guest > (coincidence!), we send it some signal and now its user-space's problem. > > Your KVM_REG_ARM64_FAULT mechanism is going to return stale data, even though > the notification interrupted the guest, and it was guest memory that was > affected. KVM doesn't have a relevant ESR. > > > I'm strongly against exposing 'which notification type' this error originally > came from because: > * it doesn't matter once we've got the CPER records, > * there isn't always an answer (there are/will-be other ways of tripping > memory_failure()) > * it creates ABI between firwmare, host userspace and guest userspace. > Firmware's choice of notification type shouldn't affect anything other than > the host kernel. > > > On 13/09/17 08:32, gengdongjiu wrote: >> On 2017/9/8 0:30, James Morse wrote: >>> On 28/08/17 11:38, Dongjiu Geng wrote: when userspace gets SIGBUS signal, it does not know whether this is a synchronous external abort or SError, >>> >>> Why would Qemu/kvmtool need to know if the original notification (if there >>> was >>> one) was synchronous or asynchronous? This is between firmware and the >>> kernel. > >> there are two reasons: >> >> 1. Let us firstly discuss the SEA and SEI, there are different workflow for >> the two different Errors. >> 2. when record the CPER in the user space, it needs to know the error type, >> because SEA and SEI are different Error source, >>so they have different offset in the APEI table, that is to say they will >> be recorded to different place of the APEI table. > > user-space can choose whether to use SEA or SEI, it doesn't have to choose the > same notification type that firmware used, which in turn doesn't have to be > the > same as that used by the CPU to notify firmware. > > The choice only matters because these notifications hang on an existing pieces > of the Arm-architecture, so the notification can only add to the > architecturally > defined meaning. (i.e. You can only send an SEA for something that can already > be described as a synchronous external abort). > > Once we get to user-space, for memory_failure() notifications, (which so far > is > all we are talking about here), the only thing that could matter is whether > the > guest hit a PG_hwpoison page as a stage2 fault. These can be described as > Synchronous-External-Abort. > > The Synchronous-External-Abort/SError-Interrupt distinction matters for the > CPU > because it can't always make an error synchronous. For memory_failure() > notifications to a KVM guest we really can do this, and we already have this > behaviour for
Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace
Hi gengdongjiu, (re-ordered hunks) On 13/09/17 08:32, gengdongjiu wrote: > On 2017/9/8 0:30, James Morse wrote: >> On 28/08/17 11:38, Dongjiu Geng wrote: >> For BUS_MCEERR_A* from memory_failure() we can't know if they are caused by >> an access or not. Actually it looks like we can: I thought 'BUS_MCEERR_AR' could be triggered via some CPER flags, but its not. The only code that flags MF_ACTION_REQUIRED is x86's kernel-first handling, which nicely matches this 'direct access' problem. BUS_MCEERR_AR also come from KVM stage2 faults (and the x86 equivalent). Powerpc also triggers these directly, both from what look to be synchronous paths, so I think its fair to equate BUS_MCEERR_AR to a synchronous access and BUS_MCEERR_AO to something_else. I don't think we need anything else. >> When the mm code gets -EHWPOISON when trying to resolve a > > Because of that, so I allow userspace getting exception information ... and there are cases where you can't get the exception information, and other cases where it wasn't an exception at all. [...] >> What happens if the dram-scrub hardware spots an error in guest memory, but >> the guest wasn't running? KVM won't have a relevant ESR value to give you. > if the dram-scrub hardware spots an error in guest memory, it will generate > IRQ in DDR controller, not SEA or SEI exception. I still do not consider the > GSIV. For GSIV, may be we can only handle it in the host OS. Great example: this IRQ pulls us out of a guest, we tromp through APEI and then memory_failure(), the memory happened to belong to the same guest (coincidence!), we send it some signal and now its user-space's problem. Your KVM_REG_ARM64_FAULT mechanism is going to return stale data, even though the notification interrupted the guest, and it was guest memory that was affected. KVM doesn't have a relevant ESR. I'm strongly against exposing 'which notification type' this error originally came from because: * it doesn't matter once we've got the CPER records, * there isn't always an answer (there are/will-be other ways of tripping memory_failure()) * it creates ABI between firwmare, host userspace and guest userspace. Firmware's choice of notification type shouldn't affect anything other than the host kernel. On 13/09/17 08:32, gengdongjiu wrote: > On 2017/9/8 0:30, James Morse wrote: >> On 28/08/17 11:38, Dongjiu Geng wrote: >>> when userspace gets SIGBUS signal, it does not know whether >>> this is a synchronous external abort or SError, >> >> Why would Qemu/kvmtool need to know if the original notification (if there >> was >> one) was synchronous or asynchronous? This is between firmware and the >> kernel. > there are two reasons: > > 1. Let us firstly discuss the SEA and SEI, there are different workflow for > the two different Errors. > 2. when record the CPER in the user space, it needs to know the error type, > because SEA and SEI are different Error source, >so they have different offset in the APEI table, that is to say they will > be recorded to different place of the APEI table. user-space can choose whether to use SEA or SEI, it doesn't have to choose the same notification type that firmware used, which in turn doesn't have to be the same as that used by the CPU to notify firmware. The choice only matters because these notifications hang on an existing pieces of the Arm-architecture, so the notification can only add to the architecturally defined meaning. (i.e. You can only send an SEA for something that can already be described as a synchronous external abort). Once we get to user-space, for memory_failure() notifications, (which so far is all we are talking about here), the only thing that could matter is whether the guest hit a PG_hwpoison page as a stage2 fault. These can be described as Synchronous-External-Abort. The Synchronous-External-Abort/SError-Interrupt distinction matters for the CPU because it can't always make an error synchronous. For memory_failure() notifications to a KVM guest we really can do this, and we already have this behaviour for free. An example: A guest touches some hardware:poisoned memory, for whatever reason the CPU can't put the world back together to make this a synchronous exception, so it reports it to firmware as an SError-interrupt. Linux gets an APEI notification and memory_failure() causes the affected page to be unmapped from the guest's stage2, and SIGBUS_MCEERR_AO sent to user-space. Qemu/kvmtool can now notify the guest with an IRQ or POLLed notification. AO-> action optional, probably asynchronous. But in our example it wasn't really asynchronous, that was just a property of the original CPU->firmware notification. What happens? The guest vcpu is re-run, it re-runs the same instructions (this was a contained error so KVM's ELR points at/before the instruction that steps in the problem). This time KVM takes a stage2 fault, which the mm code will refuse to fixup because the relevant page was marked as PG_hwpoisi
Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace
Hi James, On 2017/9/8 0:30, James Morse wrote: > Hi Dongjiu Geng, > > On 28/08/17 11:38, Dongjiu Geng wrote: >> when userspace gets SIGBUS signal, it does not know whether >> this is a synchronous external abort or SError, > > Why would Qemu/kvmtool need to know if the original notification (if there was > one) was synchronous or asynchronous? This is between firmware and the kernel. there are two reasons: 1. Let us firstly discuss the SEA and SEI, there are different workflow for the two different Errors. 2. when record the CPER in the user space, it needs to know the error type, because SEA and SEI are different Error source, so they have different offset in the APEI table, that is to say they will be recorded to different place of the APEI table. etc/acpi/tables etc/hardware_errors == + +--++--+ | | HEST ||address | +--+ | +--+|registers | | Error Status | | | GHES0|| ++ | Data Block 0 | | +--+ +->| |status_address0 |->| ++ | | .| | | ++ | | CPER | | | error_status_address-+-+ +--->| |status_address1 |--+ | | CPER | | | .| || ++ | | | | | | read_ack_register+-+ || . | | | | CPER | | | read_ack_preserve| | |+--+ | | +-++ | | read_ack_write | | | +->| |status_address10|+ | | Error Status | + +--+ | | | | ++| | | Data Block 1 | | | GHES1| +-+-+->| | ack_value0 || +-->| ++ + +--+ | | | ++| | | CPER | | | .| | | +--->| | ack_value1 || | | CPER | | | error_status_address-+---+ | || ++| | | | | | .| | || | . || | | CPER | | | read_ack_register+-+-+| ++| +-++ | | read_ack_preserve| | +->| | ack_value10|| | |.. | | | read_ack_write | | | | ++| | ++ + +--| | | | | Error Status | | | ... | | | | | Data Block 10| + +--+ | | +>| ++ | | GHES10 | | | | | CPER | + +--+ | | | | CPER | | | .| | | | | | | | error_status_address-+-+ | | | CPER | | | .| | +-++ | | read_ack_register+-+ | | read_ack_preserve| | | read_ack_write | + +--+ > > > I think I can see why you need this: to choose whether to emulate SEA or SEI, emulating SEA or SEI is one reason, another reason is that the CPER will be recorded to different place of APEI. > but what if the guest wasn't running? Or the guest was running, but it wasn't > guest-memory that is affected. If the guest was not running, host firmware will directly notify EL1 host kernel to handle the error, not notify hypervisor only if the guest was running host firmware can notify the Error to hypervisor. If the user space is Qemu, and the error is from Qemu, and guest-memory is not involve. I will not handle it, please see the code for arm64. void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr) { ram_addr_t ram_addr; hwaddr paddr; ARMCPU *cpu = ARM_CPU(c); CPUARMState *env = &cpu->env; assert(code == BUS_MCEERR_AR || code == BUS_MCEERR_AO); if (addr) { ram_addr = qemu_ram_addr_from_host(addr); if (ram_addr != RAM_ADDR_INVALID && kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) { kvm_cpu_synchronize_state(c); kvm_hwpoison_page_add(ram_addr); if (is_a
Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace
Hi Dongjiu Geng, On 28/08/17 11:38, Dongjiu Geng wrote: > when userspace gets SIGBUS signal, it does not know whether > this is a synchronous external abort or SError, Why would Qemu/kvmtool need to know if the original notification (if there was one) was synchronous or asynchronous? This is between firmware and the kernel. I think I can see why you need this: to choose whether to emulate SEA or SEI, but what if the guest wasn't running? Or the guest was running, but it wasn't guest-memory that is affected. What happens if the dram-scrub hardware spots an error in guest memory, but the guest wasn't running? KVM won't have a relevant ESR value to give you. What happens if we start swapping a page of guest memory to disk, and discover the memory is corrupt. This is synchronous, but it wasn't the guest, and KVM still can't give you an ESR. What about CPER records discovered through the polled interface? What happens if I write a PFN into the corrupt-pfn sysfs interface? I think what you need is some way of knowing if the BUS_MCEERR_A* was directly caused by a user-space (or guest) access, and if so was it a data or instruction fetch. These can become SEA notifications. KVM's user-space shouldn't be a special-case where the kernel behaves differently: if we tinker with this it needs to make sense for all user space processes and mean something on all architectures. I think this information could be useful to other users of these signals, e.g. a JVM could silently regenerate/reload code/data for a non-direct-access fault instead of exit-ing (or throwing an exception) for a direct access. For BUS_MCEERR_A* from memory_failure() we can't know if they are caused by an access or not. When the mm code gets -EHWPOISON when trying to resolve a user-space fault we know it was due to a direct-access. (I don't know if/how x86 can know if it was code or data). Faulting guest accesses through KVM are just a special version of this where KVM fixes-up stage2. ... but for any of this to work we need the address of the corrupt memory. (-> cover letter) Thanks, James ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm