Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error
On Tue, Oct 06, 2020 at 01:35:27PM -0400, Vivek Goyal wrote: > On Tue, Oct 06, 2020 at 10:17:04AM -0700, Sean Christopherson wrote: > > [..] > > > > Note, TDX doesn't allow injection exceptions, so reflecting a #PF back > > > > into the guest is not an option. > > > > > > Not even #MC? So sad :-) > > > > Heh, #MC isn't allowed either, yet... > > If #MC is not allowd, logic related to hwpoison memory will not work > as that seems to inject #MC. Yep. Piggybacking #MC is undesirable for other reasons, e.g. adds a "hardware" dependency for what is really a paravirt feature.
Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error
On Tue, Oct 06, 2020 at 10:17:04AM -0700, Sean Christopherson wrote: [..] > > > Note, TDX doesn't allow injection exceptions, so reflecting a #PF back > > > into the guest is not an option. > > > > Not even #MC? So sad :-) > > Heh, #MC isn't allowed either, yet... If #MC is not allowd, logic related to hwpoison memory will not work as that seems to inject #MC. Vivek
Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error
On Tue, Oct 06, 2020 at 06:39:56PM +0200, Vitaly Kuznetsov wrote: > Sean Christopherson writes: > > > On Tue, Oct 06, 2020 at 05:24:54PM +0200, Vitaly Kuznetsov wrote: > >> Vivek Goyal writes: > >> > So you will have to report token (along with -EFAULT) to user space. So > >> > this > >> > is basically the 3rd proposal which is extension of kvm API and will > >> > report say HVA/GFN also to user space along with -EFAULT. > >> > >> Right, I meant to say that guest kernel has full register state of the > >> userspace process which caused APF to get queued and instead of trying > >> to extract it in KVM and pass to userspace in case of a (later) failure > >> we limit KVM api change to contain token or GFN only and somehow keep > >> the rest in the guest. This should help with TDX/SEV-ES. > > > > Whatever gets reported to userspace should be identical with and without > > async page faults, i.e. it definitely shouldn't have token information. > > > > Oh, right, when the error gets reported synchronously guest's kernel is > not yet aware of the issue so it won't be possible to find anything in > its kdump if userspace decides to crash it immediately. The register > state (if available) will be actual though. > > > Note, TDX doesn't allow injection exceptions, so reflecting a #PF back > > into the guest is not an option. > > Not even #MC? So sad :-) Heh, #MC isn't allowed either, yet... > > Nor do I think that's "correct" behavior (see everyone's objections to > > using #PF for APF fixed). I.e. the event should probably be an IRQ. > > I recall Paolo objected against making APF 'page not present' into in > interrupt as it will require some very special handling to make sure it > gets injected (and handled) immediately but I'm not really sure how big > the hack is going to be, maybe in the light of TDX/SEV-ES it's worth a > try. This shouldn't have anything to do with APF. Again, the event injection is needed even in the synchronous case as the file truncation in the host can affect existing mappings in the guest. I don't know that the mechanism needs to be virtiofs specific or if there can be a more generic "these PFNs have disappeared", but it's most definitely orthogonal to APF.
Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error
Sean Christopherson writes: > On Tue, Oct 06, 2020 at 05:24:54PM +0200, Vitaly Kuznetsov wrote: >> Vivek Goyal writes: >> > So you will have to report token (along with -EFAULT) to user space. So >> > this >> > is basically the 3rd proposal which is extension of kvm API and will >> > report say HVA/GFN also to user space along with -EFAULT. >> >> Right, I meant to say that guest kernel has full register state of the >> userspace process which caused APF to get queued and instead of trying >> to extract it in KVM and pass to userspace in case of a (later) failure >> we limit KVM api change to contain token or GFN only and somehow keep >> the rest in the guest. This should help with TDX/SEV-ES. > > Whatever gets reported to userspace should be identical with and without > async page faults, i.e. it definitely shouldn't have token information. > Oh, right, when the error gets reported synchronously guest's kernel is not yet aware of the issue so it won't be possible to find anything in its kdump if userspace decides to crash it immediately. The register state (if available) will be actual though. > Note, TDX doesn't allow injection exceptions, so reflecting a #PF back > into the guest is not an option. Not even #MC? So sad :-) > Nor do I think that's "correct" behavior (see everyone's objections to > using #PF for APF fixed). I.e. the event should probably be an IRQ. I recall Paolo objected against making APF 'page not present' into in interrupt as it will require some very special handling to make sure it gets injected (and handled) immediately but I'm not really sure how big the hack is going to be, maybe in the light of TDX/SEV-ES it's worth a try. -- Vitaly
Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error
On Tue, Oct 06, 2020 at 09:12:00AM -0700, Sean Christopherson wrote: > On Tue, Oct 06, 2020 at 05:24:54PM +0200, Vitaly Kuznetsov wrote: > > Vivek Goyal writes: > > > So you will have to report token (along with -EFAULT) to user space. So > > > this > > > is basically the 3rd proposal which is extension of kvm API and will > > > report say HVA/GFN also to user space along with -EFAULT. > > > > Right, I meant to say that guest kernel has full register state of the > > userspace process which caused APF to get queued and instead of trying > > to extract it in KVM and pass to userspace in case of a (later) failure > > we limit KVM api change to contain token or GFN only and somehow keep > > the rest in the guest. This should help with TDX/SEV-ES. > > Whatever gets reported to userspace should be identical with and without > async page faults, i.e. it definitely shouldn't have token information. > > Note, TDX doesn't allow injection exceptions, so reflecting a #PF back > into the guest is not an option. Nor do I think that's "correct" > behavior (see everyone's objections to using #PF for APF fixed). I.e. the > event should probably be an IRQ. I am not sure if IRQ for "Page not Present" works. Will it have some conflicts/issues with other high priority interrupts which can get injected before "Page not present". Vivek
Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error
On Tue, Oct 06, 2020 at 05:24:54PM +0200, Vitaly Kuznetsov wrote: > Vivek Goyal writes: > > So you will have to report token (along with -EFAULT) to user space. So this > > is basically the 3rd proposal which is extension of kvm API and will > > report say HVA/GFN also to user space along with -EFAULT. > > Right, I meant to say that guest kernel has full register state of the > userspace process which caused APF to get queued and instead of trying > to extract it in KVM and pass to userspace in case of a (later) failure > we limit KVM api change to contain token or GFN only and somehow keep > the rest in the guest. This should help with TDX/SEV-ES. Whatever gets reported to userspace should be identical with and without async page faults, i.e. it definitely shouldn't have token information. Note, TDX doesn't allow injection exceptions, so reflecting a #PF back into the guest is not an option. Nor do I think that's "correct" behavior (see everyone's objections to using #PF for APF fixed). I.e. the event should probably be an IRQ.
Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error
Vivek Goyal writes: > On Tue, Oct 06, 2020 at 04:50:44PM +0200, Vitaly Kuznetsov wrote: >> Vivek Goyal writes: >> >> > On Tue, Oct 06, 2020 at 04:05:16PM +0200, Vitaly Kuznetsov wrote: >> >> Vivek Goyal writes: >> >> >> >> > A. Just exit to user space with -EFAULT (using kvm request) and don't >> >> >wait for the accessing task to run on vcpu again. >> >> >> >> What if we also save the required information (RIP, GFN, ...) in the >> >> guest along with the APF token >> > >> > Can you elaborate a bit more on this. You mean save GFN on stack before >> > it starts waiting for PAGE_READY event? >> >> When PAGE_NOT_PRESENT event is injected as #PF (for now) in the guest >> kernel gets all the registers of the userspace process (except for CR2 >> which is replaced with a token). In case it is not trivial to extract >> accessed GFN from this data we can extend the shared APF structure and >> add it there, KVM has it when it queues APF. >> >> > >> >> so in case of -EFAULT we can just 'crash' >> >> the guest and the required information can easily be obtained from >> >> kdump? This will solve the debugging problem even for TDX/SEV-ES (if >> >> kdump is possible there). >> > >> > Just saving additional info in guest will not help because there might >> > be many tasks waiting and you don't know which GFN is problematic one. >> >> But KVM knows which token caused the -EFAULT when we exit to userspace >> (and we can pass this information to it) so to debug the situation you >> take this token and then explore the kdump searching for what's >> associated with this exact token. > > So you will have to report token (along with -EFAULT) to user space. So this > is basically the 3rd proposal which is extension of kvm API and will > report say HVA/GFN also to user space along with -EFAULT. > Right, I meant to say that guest kernel has full register state of the userspace process which caused APF to get queued and instead of trying to extract it in KVM and pass to userspace in case of a (later) failure we limit KVM api change to contain token or GFN only and somehow keep the rest in the guest. This should help with TDX/SEV-ES. -- Vitaly
Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error
On Tue, Oct 06, 2020 at 04:50:44PM +0200, Vitaly Kuznetsov wrote: > Vivek Goyal writes: > > > On Tue, Oct 06, 2020 at 04:05:16PM +0200, Vitaly Kuznetsov wrote: > >> Vivek Goyal writes: > >> > >> > A. Just exit to user space with -EFAULT (using kvm request) and don't > >> >wait for the accessing task to run on vcpu again. > >> > >> What if we also save the required information (RIP, GFN, ...) in the > >> guest along with the APF token > > > > Can you elaborate a bit more on this. You mean save GFN on stack before > > it starts waiting for PAGE_READY event? > > When PAGE_NOT_PRESENT event is injected as #PF (for now) in the guest > kernel gets all the registers of the userspace process (except for CR2 > which is replaced with a token). In case it is not trivial to extract > accessed GFN from this data we can extend the shared APF structure and > add it there, KVM has it when it queues APF. > > > > >> so in case of -EFAULT we can just 'crash' > >> the guest and the required information can easily be obtained from > >> kdump? This will solve the debugging problem even for TDX/SEV-ES (if > >> kdump is possible there). > > > > Just saving additional info in guest will not help because there might > > be many tasks waiting and you don't know which GFN is problematic one. > > But KVM knows which token caused the -EFAULT when we exit to userspace > (and we can pass this information to it) so to debug the situation you > take this token and then explore the kdump searching for what's > associated with this exact token. So you will have to report token (along with -EFAULT) to user space. So this is basically the 3rd proposal which is extension of kvm API and will report say HVA/GFN also to user space along with -EFAULT. Thanks Vivek
Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error
Vivek Goyal writes: > On Tue, Oct 06, 2020 at 04:05:16PM +0200, Vitaly Kuznetsov wrote: >> Vivek Goyal writes: >> >> > A. Just exit to user space with -EFAULT (using kvm request) and don't >> >wait for the accessing task to run on vcpu again. >> >> What if we also save the required information (RIP, GFN, ...) in the >> guest along with the APF token > > Can you elaborate a bit more on this. You mean save GFN on stack before > it starts waiting for PAGE_READY event? When PAGE_NOT_PRESENT event is injected as #PF (for now) in the guest kernel gets all the registers of the userspace process (except for CR2 which is replaced with a token). In case it is not trivial to extract accessed GFN from this data we can extend the shared APF structure and add it there, KVM has it when it queues APF. > >> so in case of -EFAULT we can just 'crash' >> the guest and the required information can easily be obtained from >> kdump? This will solve the debugging problem even for TDX/SEV-ES (if >> kdump is possible there). > > Just saving additional info in guest will not help because there might > be many tasks waiting and you don't know which GFN is problematic one. But KVM knows which token caused the -EFAULT when we exit to userspace (and we can pass this information to it) so to debug the situation you take this token and then explore the kdump searching for what's associated with this exact token. -- Vitaly
Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error
On Tue, Oct 06, 2020 at 04:05:16PM +0200, Vitaly Kuznetsov wrote: > Vivek Goyal writes: > > > A. Just exit to user space with -EFAULT (using kvm request) and don't > >wait for the accessing task to run on vcpu again. > > What if we also save the required information (RIP, GFN, ...) in the > guest along with the APF token Can you elaborate a bit more on this. You mean save GFN on stack before it starts waiting for PAGE_READY event? > so in case of -EFAULT we can just 'crash' > the guest and the required information can easily be obtained from > kdump? This will solve the debugging problem even for TDX/SEV-ES (if > kdump is possible there). Just saving additional info in guest will not help because there might be many tasks waiting and you don't know which GFN is problematic one. Thanks Vivek
Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error
Vivek Goyal writes: > A. Just exit to user space with -EFAULT (using kvm request) and don't >wait for the accessing task to run on vcpu again. What if we also save the required information (RIP, GFN, ...) in the guest along with the APF token so in case of -EFAULT we can just 'crash' the guest and the required information can easily be obtained from kdump? This will solve the debugging problem even for TDX/SEV-ES (if kdump is possible there). -- Vitaly
Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error
On Mon, Oct 05, 2020 at 09:16:20AM -0700, Sean Christopherson wrote: > On Mon, Oct 05, 2020 at 11:33:18AM -0400, Vivek Goyal wrote: > > On Fri, Oct 02, 2020 at 02:13:14PM -0700, Sean Christopherson wrote: > > Now I have few questions. > > > > - If we exit to user space asynchronously (using kvm request), what debug > > information is in there which tells user which address is bad. I admit > > that even above trace does not seem to be telling me directly which > > address (HVA?) is bad. > > > > But if I take a crash dump of guest, using above information I should > > be able to get to GPA which is problematic. And looking at /proc/iomem > > it should also tell which device this memory region is in. > > > > Also using this crash dump one should be able to walk through virtiofs > > data > > structures and figure out which file and what offset with-in file does > > it belong to. Now one can look at filesystem on host and see file got > > truncated and it will become obvious it can't be faulted in. And then > > one can continue to debug that how did we arrive here. > > > > But if we don't exit to user space synchronously, Only relevant > > information we seem to have is -EFAULT. Apart from that, how does one > > figure out what address is bad, or who tried to access it. Or which > > file/offset does it belong to etc. > > > > I agree that problem is not necessarily in guest code. But by exiting > > synchronously, it gives enough information that one can use crash > > dump to get to bottom of the issue. If we exit to user space > > asynchronously, all this information will be lost and it might make > > it very hard to figure out (if not impossible), what's going on. > > If we want userspace to be able to do something useful, KVM should explicitly > inform userspace about the error, userspace shouldn't simply assume that > -EFAULT means a HVA->PFN lookup failed. I guess that's fine. But for this patch, user space is not doing anything. Its just printing error -EFAULT and dumping guest state (Same as we do in case of synchronous fault). > Userspace also shouldn't have to > query guest state to handle the error, as that won't work for protected guests > guests like SEV-ES and TDX. So qemu would not be able to dump vcpu register state when kvm returns with -EFAULT for the case of SEV-ES and TDX? > > I can think of two options: > > 1. Send a signal, a la kvm_send_hwpoison_signal(). This works because -EHWPOISON is a special kind of error which is different from -EFAULT. For truncation, even kvm gets -EFAULT. if (vm_fault & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE)) return (foll_flags & FOLL_HWPOISON) ? -EHWPOISON : -EFAULT; if (vm_fault & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV)) return -EFAULT; Anyway, if -EFAULT is too generic, and we need something finer grained, that can be looked into when we actually have a method where kvm/qemu injects error into guest. > > 2. Add a userspace exit reason along with a new entry in the run struct, > e.g. that provides the bad GPA, HVA, possibly permissions, etc... This sounds more reasonable to me. That is kvm gives additional information to qemu about failing HVA and GPA with -EFAULT and that can be helpful in debugging a problem. This seems like an extension of KVM API. Even with this, if we want to figure out which file got truncated, we will need to take a dump of guest and try to figure out which file this GPA is currently mapping(By looking at virtiofs data structures). And that becomes little easier if vcpu is running the task which accessed that GPA. Anyway, if we have failing GPA, I think it should be possible to figure out inode even without accessing task being current on vcpu. So we seem to have 3 options. A. Just exit to user space with -EFAULT (using kvm request) and don't wait for the accessing task to run on vcpu again. B. Store error gfn in an hash and exit to user space when task accessing gfn runs again. C. Extend KVM API and report failing HVA/GFN access by guest. And that should allow not having to exit to user space synchronously. Thanks Vivek
Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error
On Mon, Oct 05, 2020 at 11:33:18AM -0400, Vivek Goyal wrote: > On Fri, Oct 02, 2020 at 02:13:14PM -0700, Sean Christopherson wrote: > Now I have few questions. > > - If we exit to user space asynchronously (using kvm request), what debug > information is in there which tells user which address is bad. I admit > that even above trace does not seem to be telling me directly which > address (HVA?) is bad. > > But if I take a crash dump of guest, using above information I should > be able to get to GPA which is problematic. And looking at /proc/iomem > it should also tell which device this memory region is in. > > Also using this crash dump one should be able to walk through virtiofs data > structures and figure out which file and what offset with-in file does > it belong to. Now one can look at filesystem on host and see file got > truncated and it will become obvious it can't be faulted in. And then > one can continue to debug that how did we arrive here. > > But if we don't exit to user space synchronously, Only relevant > information we seem to have is -EFAULT. Apart from that, how does one > figure out what address is bad, or who tried to access it. Or which > file/offset does it belong to etc. > > I agree that problem is not necessarily in guest code. But by exiting > synchronously, it gives enough information that one can use crash > dump to get to bottom of the issue. If we exit to user space > asynchronously, all this information will be lost and it might make > it very hard to figure out (if not impossible), what's going on. If we want userspace to be able to do something useful, KVM should explicitly inform userspace about the error, userspace shouldn't simply assume that -EFAULT means a HVA->PFN lookup failed. Userspace also shouldn't have to query guest state to handle the error, as that won't work for protected guests guests like SEV-ES and TDX. I can think of two options: 1. Send a signal, a la kvm_send_hwpoison_signal(). 2. Add a userspace exit reason along with a new entry in the run struct, e.g. that provides the bad GPA, HVA, possibly permissions, etc... > > > > > > To fully handle the situation, the guest needs to remove the bad > > > > > > page from > > > > > > its memory pool. Once the page is offlined, the guest kernel's > > > > > > error > > > > > > handling will kick in when a task accesses the bad page (or nothing > > > > > > ever > > > > > > touches the bad page again and everyone is happy). > > > > > > > > > > This is not really a case of bad page as such. It is more of a page > > > > > gone missing/trucated. And no new user can map it. We just need to > > > > > worry about existing users who already have it mapped. > > > > > > > > What do you mean by "no new user can map it"? Are you talking about > > > > guest > > > > tasks or host tasks? If guest tasks, how would the guest know the page > > > > is > > > > missing and thus prevent mapping the non-existent page? > > > > > > If a new task wants mmap(), it will send a request to virtiofsd/qemu > > > on host. If file has been truncated, then mapping beyond file size > > > will fail and process will get error. So they will not be able to > > > map a page which has been truncated. > > > > Ah. Is there anything that prevents the notification side of things from > > being handled purely within the virtiofs layer? E.g. host notifies the > > guest > > that a file got truncated, virtiofs driver in the guest invokes a kernel API > > to remove the page(s). > > virtiofsd notifications can help a bit but not in all cases. For example, > If file got truncated and guest kernel accesses it immidiately after that, > (before notification arrives), it will hang and notification will not > be able to do much. > > So while notification might be nice to have, but we still will need some > sort of error reporting from kvm. And I'm in full agreement with that. What I'm saying is that resolving the infinite loop is a _bug fix_ and is completely orthogonal to adding a new mechanism to handle the file truncation scenario.
Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error
On Fri, Oct 02, 2020 at 02:13:14PM -0700, Sean Christopherson wrote: > On Fri, Oct 02, 2020 at 04:02:14PM -0400, Vivek Goyal wrote: > > On Fri, Oct 02, 2020 at 12:45:18PM -0700, Sean Christopherson wrote: > > > On Fri, Oct 02, 2020 at 03:27:34PM -0400, Vivek Goyal wrote: > > > > On Fri, Oct 02, 2020 at 11:30:37AM -0700, Sean Christopherson wrote: > > > > > On Fri, Oct 02, 2020 at 11:38:54AM -0400, Vivek Goyal wrote: > > > > > I don't think it's necessary to provide userspace with the register > > > > > state of > > > > > the guest task that hit the bad page. Other than debugging, I don't > > > > > see how > > > > > userspace can do anything useful which such information. > > > > > > > > I think debugging is the whole point so that user can figure out which > > > > access by guest task resulted in bad memory access. I would think this > > > > will be important piece of information. > > > > > > But isn't this failure due to a truncation in the host? Why would we care > > > about debugging the guest? It hasn't done anything wrong, has it? Or am > > > I > > > misunderstanding the original problem statement. > > > > I think you understood problem statement right. If guest has right > > context, it just gives additional information who tried to access > > the missing memory page. > > Yes, but it's not actionable, e.g. QEMU can't do anything differently given > a guest RIP. It's useful information for hands-on debug, but the information > can be easily collected through other means when doing hands-on debug. Hi Sean, I tried my patch and truncated file on host before guest did memcpy(). After truncation guest process tried memcpy() on truncated region and kvm exited to user space with -EFAULT. I see following on serial console. I am assuming qemu is printing the state of vcpu. error: kvm run failed Bad address RAX=7fff6e7a9750 RBX= RCX=7f513927e000 RDX=a RSI=7f513927e000 RDI=7fff6e7a9750 RBP=7fff6e7a97b0 RSP=7fff6e7a8 R8 = R9 =0031 R10=7fff6e7a957c R11=6 R12=00401140 R13= R14= R15=0 RIP=7f51391e0547 RFL=00010202 [---] CPL=3 II=0 A20=1 SMM=0 HLT=0 ES = 00c0 CS =0033 00a0fb00 DPL=3 CS64 [-RA] SS =002b 00c0f300 DPL=3 DS [-WA] DS = 00c0 FS = 7f5139246540 00c0 GS = 00c0 LDT= TR =0040 fe3a6000 4087 8b00 DPL=0 TSS64-busy GDT= fe3a4000 007f IDT= fe00 0fff CR0=80050033 CR2=7f513927e004 CR3=00102b5eb805 CR4=00770ee0 DR0= DR1= DR2= DR3= DR6=fffe0ff0 DR7=0400 EFER=0d01 Code=fa 6f 06 c5 fa 6f 4c 16 f0 c5 fa 7f 07 c5 fa 7f 4c 17 f0 c3 <48> 8b 4c 16 3 * I also changed my test program to print source and destination address for memcpy. dst=0x0x7fff6e7a9750 src=0x0x7f513927e000 Here dst matches RDI and src matches RSI. This trace also tells me CPL=3 so a user space access triggered this. Now I have few questions. - If we exit to user space asynchronously (using kvm request), what debug information is in there which tells user which address is bad. I admit that even above trace does not seem to be telling me directly which address (HVA?) is bad. But if I take a crash dump of guest, using above information I should be able to get to GPA which is problematic. And looking at /proc/iomem it should also tell which device this memory region is in. Also using this crash dump one should be able to walk through virtiofs data structures and figure out which file and what offset with-in file does it belong to. Now one can look at filesystem on host and see file got truncated and it will become obvious it can't be faulted in. And then one can continue to debug that how did we arrive here. But if we don't exit to user space synchronously, Only relevant information we seem to have is -EFAULT. Apart from that, how does one figure out what address is bad, or who tried to access it. Or which file/offset does it belong to etc. I agree that problem is not necessarily in guest code. But by exiting synchronously, it gives enough information that one can use crash dump to get to bottom of the issue. If we exit to user space asynchronously, all this information will be lost and it might make it very hard to figure out (if not impossible), what's going on. > > > > > > To fully handle the situation, the guest needs to remove the bad page > > > > > from > > > > > its memory pool. Once the page is offlined, the guest kernel's error > > > > > handling will kick in when
Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error
On Fri, Oct 02, 2020 at 04:02:14PM -0400, Vivek Goyal wrote: > On Fri, Oct 02, 2020 at 12:45:18PM -0700, Sean Christopherson wrote: > > On Fri, Oct 02, 2020 at 03:27:34PM -0400, Vivek Goyal wrote: > > > On Fri, Oct 02, 2020 at 11:30:37AM -0700, Sean Christopherson wrote: > > > > On Fri, Oct 02, 2020 at 11:38:54AM -0400, Vivek Goyal wrote: > > > > I don't think it's necessary to provide userspace with the register > > > > state of > > > > the guest task that hit the bad page. Other than debugging, I don't > > > > see how > > > > userspace can do anything useful which such information. > > > > > > I think debugging is the whole point so that user can figure out which > > > access by guest task resulted in bad memory access. I would think this > > > will be important piece of information. > > > > But isn't this failure due to a truncation in the host? Why would we care > > about debugging the guest? It hasn't done anything wrong, has it? Or am I > > misunderstanding the original problem statement. > > I think you understood problem statement right. If guest has right > context, it just gives additional information who tried to access > the missing memory page. Yes, but it's not actionable, e.g. QEMU can't do anything differently given a guest RIP. It's useful information for hands-on debug, but the information can be easily collected through other means when doing hands-on debug. > > > > To fully handle the situation, the guest needs to remove the bad page > > > > from > > > > its memory pool. Once the page is offlined, the guest kernel's error > > > > handling will kick in when a task accesses the bad page (or nothing ever > > > > touches the bad page again and everyone is happy). > > > > > > This is not really a case of bad page as such. It is more of a page > > > gone missing/trucated. And no new user can map it. We just need to > > > worry about existing users who already have it mapped. > > > > What do you mean by "no new user can map it"? Are you talking about guest > > tasks or host tasks? If guest tasks, how would the guest know the page is > > missing and thus prevent mapping the non-existent page? > > If a new task wants mmap(), it will send a request to virtiofsd/qemu > on host. If file has been truncated, then mapping beyond file size > will fail and process will get error. So they will not be able to > map a page which has been truncated. Ah. Is there anything that prevents the notification side of things from being handled purely within the virtiofs layer? E.g. host notifies the guest that a file got truncated, virtiofs driver in the guest invokes a kernel API to remove the page(s). > > > > Note, I'm not necessarily suggesting that QEMU piggyback its #MC > > > > injection > > > > to handle this, but I suspect the resulting behavior will look quite > > > > similar, > > > > e.g. notify the virtiofs driver in the guest, which does some magic to > > > > take > > > > the offending region offline, and then guest tasks get SIGBUS or > > > > whatever. > > > > > > > > I also don't think it's KVM's responsibility to _directly_ handle such a > > > > scenario. As I said in an earlier version, KVM can't possibly know > > > > _why_ a > > > > page fault came back with -EFAULT, only userspace can connect the dots > > > > of > > > > GPA -> HVA -> vm_area_struct -> file -> inject event. KVM definitely > > > > should > > > > exit to userspace on the -EFAULT instead of hanging the guest, but that > > > > can > > > > be done via a new request, as suggested. > > > > > > KVM atleast should have the mechanism to report this back to guest. And > > > we don't have any. There only seems to be #MC stuff for poisoned pages. > > > I am not sure how much we can build on top of #MC stuff to take care > > > of this case. One problem with #MC I found was that it generates > > > synchronous #MC only on load and not store. So all the code is > > > written in such a way that synchronous #MC can happen only on load > > > and hence the error handling. > > > > > > Stores generate different kind of #MC that too asynchronously and > > > caller will not know about it immiditely. But in this case we need > > > to know about error in the context of caller both for loads and stores. > > > > > > Anyway, I agree that this patch does not solve the problem of race > > > free synchronous event inject into the guest. That's something yet > > > to be solved either by #VE or by #MC or by #foo. > > > > > > This patch is only doing two things. > > > > > > - Because we don't have a mechanism to report errors to guest, use > > > the second best method and exit to user space. > > > > Not that it matters at this point, but IMO exiting to userspace is the > > correct behavior, not simply "second best method". Again, KVM by design > > does not know what lies behind any given HVA, and therefore should not be > > making decisions about how to handle bad HVAs. > > > > > - Make behavior consistent between
Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error
On Fri, Oct 02, 2020 at 12:45:18PM -0700, Sean Christopherson wrote: > On Fri, Oct 02, 2020 at 03:27:34PM -0400, Vivek Goyal wrote: > > On Fri, Oct 02, 2020 at 11:30:37AM -0700, Sean Christopherson wrote: > > > On Fri, Oct 02, 2020 at 11:38:54AM -0400, Vivek Goyal wrote: > > > > On Thu, Oct 01, 2020 at 03:33:20PM -0700, Sean Christopherson wrote: > > > > > Alternatively, what about adding a new KVM request type to handle > > > > > this? > > > > > E.g. when the APF comes back with -EFAULT, snapshot the GFN and make a > > > > > request. The vCPU then gets kicked and exits to userspace. Before > > > > > exiting > > > > > to userspace, the request handler resets vcpu->arch.apf.error_gfn. > > > > > Bad GFNs > > > > > simply get if error_gfn is "valid", i.e. there's a pending request. > > > > > > > > Sorry, I did not understand the above proposal. Can you please elaborate > > > > a bit more. Part of it is that I don't know much about KVM requests. > > > > Looking at the code it looks like that main loop is parsing if some > > > > kvm request is pending and executing that action. > > > > > > > > Don't we want to make sure that we exit to user space when guest retries > > > > error gfn access again. > > > > > > > In this case once we get -EFAULT, we will still inject page_ready into > > > > guest. And then either same process or a different process might run. > > > > > > > > So when exactly code raises a kvm request. If I raise it right when > > > > I get -EFAULT, then kvm will exit to user space upon next entry > > > > time. But there is no guarantee guest vcpu is running the process which > > > > actually accessed the error gfn. And that probably means that register > > > > state of cpu does not mean much and one can not easily figure out > > > > which task tried to access the bad memory and when. > > > > > > > > That's why we prepare a list of error gfn and only exit to user space > > > > when error_gfn access is retried so that guest vcpu context is correct. > > > > > > > > What am I missing? > > > > > > I don't think it's necessary to provide userspace with the register state > > > of > > > the guest task that hit the bad page. Other than debugging, I don't see > > > how > > > userspace can do anything useful which such information. > > > > I think debugging is the whole point so that user can figure out which > > access by guest task resulted in bad memory access. I would think this > > will be important piece of information. > > But isn't this failure due to a truncation in the host? Why would we care > about debugging the guest? It hasn't done anything wrong, has it? Or am I > misunderstanding the original problem statement. I think you understood problem statement right. If guest has right context, it just gives additional information who tried to access the missing memory page. > > > > To fully handle the situation, the guest needs to remove the bad page from > > > its memory pool. Once the page is offlined, the guest kernel's error > > > handling will kick in when a task accesses the bad page (or nothing ever > > > touches the bad page again and everyone is happy). > > > > This is not really a case of bad page as such. It is more of a page > > gone missing/trucated. And no new user can map it. We just need to > > worry about existing users who already have it mapped. > > What do you mean by "no new user can map it"? Are you talking about guest > tasks or host tasks? If guest tasks, how would the guest know the page is > missing and thus prevent mapping the non-existent page? If a new task wants mmap(), it will send a request to virtiofsd/qemu on host. If file has been truncated, then mapping beyond file size will fail and process will get error. So they will not be able to map a page which has been truncated. > > > > Note, I'm not necessarily suggesting that QEMU piggyback its #MC injection > > > to handle this, but I suspect the resulting behavior will look quite > > > similar, > > > e.g. notify the virtiofs driver in the guest, which does some magic to > > > take > > > the offending region offline, and then guest tasks get SIGBUS or whatever. > > > > > > I also don't think it's KVM's responsibility to _directly_ handle such a > > > scenario. As I said in an earlier version, KVM can't possibly know _why_ > > > a > > > page fault came back with -EFAULT, only userspace can connect the dots of > > > GPA -> HVA -> vm_area_struct -> file -> inject event. KVM definitely > > > should > > > exit to userspace on the -EFAULT instead of hanging the guest, but that > > > can > > > be done via a new request, as suggested. > > > > KVM atleast should have the mechanism to report this back to guest. And > > we don't have any. There only seems to be #MC stuff for poisoned pages. > > I am not sure how much we can build on top of #MC stuff to take care > > of this case. One problem with #MC I found was that it generates > > synchronous #MC only on load and not store. So all
Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error
On Fri, Oct 02, 2020 at 03:27:34PM -0400, Vivek Goyal wrote: > On Fri, Oct 02, 2020 at 11:30:37AM -0700, Sean Christopherson wrote: > > On Fri, Oct 02, 2020 at 11:38:54AM -0400, Vivek Goyal wrote: > > > On Thu, Oct 01, 2020 at 03:33:20PM -0700, Sean Christopherson wrote: > > > > Alternatively, what about adding a new KVM request type to handle this? > > > > E.g. when the APF comes back with -EFAULT, snapshot the GFN and make a > > > > request. The vCPU then gets kicked and exits to userspace. Before > > > > exiting > > > > to userspace, the request handler resets vcpu->arch.apf.error_gfn. Bad > > > > GFNs > > > > simply get if error_gfn is "valid", i.e. there's a pending request. > > > > > > Sorry, I did not understand the above proposal. Can you please elaborate > > > a bit more. Part of it is that I don't know much about KVM requests. > > > Looking at the code it looks like that main loop is parsing if some > > > kvm request is pending and executing that action. > > > > > > Don't we want to make sure that we exit to user space when guest retries > > > error gfn access again. > > > > > In this case once we get -EFAULT, we will still inject page_ready into > > > guest. And then either same process or a different process might run. > > > > > > So when exactly code raises a kvm request. If I raise it right when > > > I get -EFAULT, then kvm will exit to user space upon next entry > > > time. But there is no guarantee guest vcpu is running the process which > > > actually accessed the error gfn. And that probably means that register > > > state of cpu does not mean much and one can not easily figure out > > > which task tried to access the bad memory and when. > > > > > > That's why we prepare a list of error gfn and only exit to user space > > > when error_gfn access is retried so that guest vcpu context is correct. > > > > > > What am I missing? > > > > I don't think it's necessary to provide userspace with the register state of > > the guest task that hit the bad page. Other than debugging, I don't see how > > userspace can do anything useful which such information. > > I think debugging is the whole point so that user can figure out which > access by guest task resulted in bad memory access. I would think this > will be important piece of information. But isn't this failure due to a truncation in the host? Why would we care about debugging the guest? It hasn't done anything wrong, has it? Or am I misunderstanding the original problem statement. > > To fully handle the situation, the guest needs to remove the bad page from > > its memory pool. Once the page is offlined, the guest kernel's error > > handling will kick in when a task accesses the bad page (or nothing ever > > touches the bad page again and everyone is happy). > > This is not really a case of bad page as such. It is more of a page > gone missing/trucated. And no new user can map it. We just need to > worry about existing users who already have it mapped. What do you mean by "no new user can map it"? Are you talking about guest tasks or host tasks? If guest tasks, how would the guest know the page is missing and thus prevent mapping the non-existent page? > > Note, I'm not necessarily suggesting that QEMU piggyback its #MC injection > > to handle this, but I suspect the resulting behavior will look quite > > similar, > > e.g. notify the virtiofs driver in the guest, which does some magic to take > > the offending region offline, and then guest tasks get SIGBUS or whatever. > > > > I also don't think it's KVM's responsibility to _directly_ handle such a > > scenario. As I said in an earlier version, KVM can't possibly know _why_ a > > page fault came back with -EFAULT, only userspace can connect the dots of > > GPA -> HVA -> vm_area_struct -> file -> inject event. KVM definitely should > > exit to userspace on the -EFAULT instead of hanging the guest, but that can > > be done via a new request, as suggested. > > KVM atleast should have the mechanism to report this back to guest. And > we don't have any. There only seems to be #MC stuff for poisoned pages. > I am not sure how much we can build on top of #MC stuff to take care > of this case. One problem with #MC I found was that it generates > synchronous #MC only on load and not store. So all the code is > written in such a way that synchronous #MC can happen only on load > and hence the error handling. > > Stores generate different kind of #MC that too asynchronously and > caller will not know about it immiditely. But in this case we need > to know about error in the context of caller both for loads and stores. > > Anyway, I agree that this patch does not solve the problem of race > free synchronous event inject into the guest. That's something yet > to be solved either by #VE or by #MC or by #foo. > > This patch is only doing two things. > > - Because we don't have a mechanism to report errors to guest, use > the second best method and exit to user
Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error
On Fri, Oct 02, 2020 at 11:30:37AM -0700, Sean Christopherson wrote: > On Fri, Oct 02, 2020 at 11:38:54AM -0400, Vivek Goyal wrote: > > On Thu, Oct 01, 2020 at 03:33:20PM -0700, Sean Christopherson wrote: > > > Alternatively, what about adding a new KVM request type to handle this? > > > E.g. when the APF comes back with -EFAULT, snapshot the GFN and make a > > > request. The vCPU then gets kicked and exits to userspace. Before > > > exiting > > > to userspace, the request handler resets vcpu->arch.apf.error_gfn. Bad > > > GFNs > > > simply get if error_gfn is "valid", i.e. there's a pending request. > > > > Sorry, I did not understand the above proposal. Can you please elaborate > > a bit more. Part of it is that I don't know much about KVM requests. > > Looking at the code it looks like that main loop is parsing if some > > kvm request is pending and executing that action. > > > > Don't we want to make sure that we exit to user space when guest retries > > error gfn access again. > > > In this case once we get -EFAULT, we will still inject page_ready into > > guest. And then either same process or a different process might run. > > > > So when exactly code raises a kvm request. If I raise it right when > > I get -EFAULT, then kvm will exit to user space upon next entry > > time. But there is no guarantee guest vcpu is running the process which > > actually accessed the error gfn. And that probably means that register > > state of cpu does not mean much and one can not easily figure out > > which task tried to access the bad memory and when. > > > > That's why we prepare a list of error gfn and only exit to user space > > when error_gfn access is retried so that guest vcpu context is correct. > > > > What am I missing? > > I don't think it's necessary to provide userspace with the register state of > the guest task that hit the bad page. Other than debugging, I don't see how > userspace can do anything useful which such information. I think debugging is the whole point so that user can figure out which access by guest task resulted in bad memory access. I would think this will be important piece of information. > > Even if you want to inject an event of some form into the guest, having the > correct context for the event itself is not required. IMO it's perfectly > reasonable for such an event to be asynchronous. > > IIUC, your end goal is to be able to gracefully handle DAX file truncation. > Simply killing the guest task that hit the bad page isn't sufficient, as > nothing prevents a future task from accessing the same bad page. Next task can't even mmap that page mmap will fail. File got truncated, that page does not exist. So sending SIGBUS to task should definitely solve the problem. We also need to solve the issue for guest kernel accessing the page which got truncated on host. In that case we need to use correct memcpy helpers and use exception table magic and return error code to user space. > To fully > handle the situation, the guest needs to remove the bad page from its memory > pool. Once the page is offlined, the guest kernel's error handling will > kick in when a task accesses the bad page (or nothing ever touches the bad > page again and everyone is happy). This is not really a case of bad page as such. It is more of a page gone missing/trucated. And no new user can map it. We just need to worry about existing users who already have it mapped. > > Note, I'm not necessarily suggesting that QEMU piggyback its #MC injection > to handle this, but I suspect the resulting behavior will look quite similar, > e.g. notify the virtiofs driver in the guest, which does some magic to take > the offending region offline, and then guest tasks get SIGBUS or whatever. > > I also don't think it's KVM's responsibility to _directly_ handle such a > scenario. As I said in an earlier version, KVM can't possibly know _why_ a > page fault came back with -EFAULT, only userspace can connect the dots of > GPA -> HVA -> vm_area_struct -> file -> inject event. KVM definitely should > exit to userspace on the -EFAULT instead of hanging the guest, but that can > be done via a new request, as suggested. KVM atleast should have the mechanism to report this back to guest. And we don't have any. There only seems to be #MC stuff for poisoned pages. I am not sure how much we can build on top of #MC stuff to take care of this case. One problem with #MC I found was that it generates synchronous #MC only on load and not store. So all the code is written in such a way that synchronous #MC can happen only on load and hence the error handling. Stores generate different kind of #MC that too asynchronously and caller will not know about it immiditely. But in this case we need to know about error in the context of caller both for loads and stores. Anyway, I agree that this patch does not solve the problem of race free synchronous event inject into the guest. That's something yet to be solved either by
Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error
On Fri, Oct 02, 2020 at 11:38:54AM -0400, Vivek Goyal wrote: > On Thu, Oct 01, 2020 at 03:33:20PM -0700, Sean Christopherson wrote: > > Alternatively, what about adding a new KVM request type to handle this? > > E.g. when the APF comes back with -EFAULT, snapshot the GFN and make a > > request. The vCPU then gets kicked and exits to userspace. Before exiting > > to userspace, the request handler resets vcpu->arch.apf.error_gfn. Bad GFNs > > simply get if error_gfn is "valid", i.e. there's a pending request. > > Sorry, I did not understand the above proposal. Can you please elaborate > a bit more. Part of it is that I don't know much about KVM requests. > Looking at the code it looks like that main loop is parsing if some > kvm request is pending and executing that action. > > Don't we want to make sure that we exit to user space when guest retries > error gfn access again. > In this case once we get -EFAULT, we will still inject page_ready into > guest. And then either same process or a different process might run. > > So when exactly code raises a kvm request. If I raise it right when > I get -EFAULT, then kvm will exit to user space upon next entry > time. But there is no guarantee guest vcpu is running the process which > actually accessed the error gfn. And that probably means that register > state of cpu does not mean much and one can not easily figure out > which task tried to access the bad memory and when. > > That's why we prepare a list of error gfn and only exit to user space > when error_gfn access is retried so that guest vcpu context is correct. > > What am I missing? I don't think it's necessary to provide userspace with the register state of the guest task that hit the bad page. Other than debugging, I don't see how userspace can do anything useful which such information. Even if you want to inject an event of some form into the guest, having the correct context for the event itself is not required. IMO it's perfectly reasonable for such an event to be asynchronous. IIUC, your end goal is to be able to gracefully handle DAX file truncation. Simply killing the guest task that hit the bad page isn't sufficient, as nothing prevents a future task from accessing the same bad page. To fully handle the situation, the guest needs to remove the bad page from its memory pool. Once the page is offlined, the guest kernel's error handling will kick in when a task accesses the bad page (or nothing ever touches the bad page again and everyone is happy). Note, I'm not necessarily suggesting that QEMU piggyback its #MC injection to handle this, but I suspect the resulting behavior will look quite similar, e.g. notify the virtiofs driver in the guest, which does some magic to take the offending region offline, and then guest tasks get SIGBUS or whatever. I also don't think it's KVM's responsibility to _directly_ handle such a scenario. As I said in an earlier version, KVM can't possibly know _why_ a page fault came back with -EFAULT, only userspace can connect the dots of GPA -> HVA -> vm_area_struct -> file -> inject event. KVM definitely should exit to userspace on the -EFAULT instead of hanging the guest, but that can be done via a new request, as suggested.
Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error
On Thu, Oct 01, 2020 at 03:33:20PM -0700, Sean Christopherson wrote: > On Thu, Oct 01, 2020 at 05:55:08PM -0400, Vivek Goyal wrote: > > On Mon, Sep 28, 2020 at 09:37:00PM -0700, Sean Christopherson wrote: > > > On Mon, Jul 20, 2020 at 05:13:59PM -0400, Vivek Goyal wrote: > > > > @@ -10369,6 +10378,36 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, > > > > unsigned long rflags) > > > > } > > > > EXPORT_SYMBOL_GPL(kvm_set_rflags); > > > > > > > > +static inline u32 kvm_error_gfn_hash_fn(gfn_t gfn) > > > > +{ > > > > + BUILD_BUG_ON(!is_power_of_2(ERROR_GFN_PER_VCPU)); > > > > + > > > > + return hash_32(gfn & 0x, > > > > order_base_2(ERROR_GFN_PER_VCPU)); > > > > +} > > > > + > > > > +static void kvm_add_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn) > > > > +{ > > > > + u32 key = kvm_error_gfn_hash_fn(gfn); > > > > + > > > > + /* > > > > +* Overwrite the previous gfn. This is just a hint to do > > > > +* sync page fault. > > > > +*/ > > > > + vcpu->arch.apf.error_gfns[key] = gfn; > > > > +} > > > > + > > > > +/* Returns true if gfn was found in hash table, false otherwise */ > > > > +static bool kvm_find_and_remove_error_gfn(struct kvm_vcpu *vcpu, gfn_t > > > > gfn) > > > > +{ > > > > + u32 key = kvm_error_gfn_hash_fn(gfn); > > > > > > Mostly out of curiosity, do we really need a hash? E.g. could we get away > > > with an array of 4 values? 2 values? Just wondering if we can avoid 64*8 > > > bytes per CPU. > > > > We are relying on returning error when guest task retries fault. Fault > > will be retried on same address if same task is run by vcpu after > > "page ready" event. There is no guarantee that same task will be > > run. In theory, this cpu could have a large number of tasks queued > > and run these tasks before the faulting task is run again. Now say > > there are 128 tasks being run and 32 of them have page fault > > errors. Then if we keep 4 values, newer failures will simply > > overwrite older failures and we will keep spinning instead of > > exiting to user space. > > > > That's why this array of 64 gfns and add gfns based on hash. This > > does not completely elimiante the above possibility but chances > > of one hitting this are very very slim. > > But have you actually tried such a scenario? In other words, is there good > justification for burning the extra memory? Its not easy to try and reproduce. So it is all theory at this point of time. If you are worried about memory usage, we can probably reduce the size of hash table. Say from 64, reduce it to 8. I am fine with that. I think initially I had a single error_gfn. But Vitaly had concerns about above scenario, so I implemeted a hash table. I think reducing hash table size to 8 or 16 probaly is a good middle ground. > > Alternatively, what about adding a new KVM request type to handle this? > E.g. when the APF comes back with -EFAULT, snapshot the GFN and make a > request. The vCPU then gets kicked and exits to userspace. Before exiting > to userspace, the request handler resets vcpu->arch.apf.error_gfn. Bad GFNs > simply get if error_gfn is "valid", i.e. there's a pending request. Sorry, I did not understand the above proposal. Can you please elaborate a bit more. Part of it is that I don't know much about KVM requests. Looking at the code it looks like that main loop is parsing if some kvm request is pending and executing that action. Don't we want to make sure that we exit to user space when guest retries error gfn access again. In this case once we get -EFAULT, we will still inject page_ready into guest. And then either same process or a different process might run. So when exactly code raises a kvm request. If I raise it right when I get -EFAULT, then kvm will exit to user space upon next entry time. But there is no guarantee guest vcpu is running the process which actually accessed the error gfn. And that probably means that register state of cpu does not mean much and one can not easily figure out which task tried to access the bad memory and when. That's why we prepare a list of error gfn and only exit to user space when error_gfn access is retried so that guest vcpu context is correct. What am I missing? Thanks Vivek > > That would guarantee the error is propagated to userspace, and doesn't lose > any guest information as dropping error GFNs just means the guest will take > more page fault exits. > > > > One thought would be to use the index to handle the case of no error gfns > > > so > > > that the size of the array doesn't affect lookup for the common case, e.g. > > > > We are calculating hash of gfn (used as index in array). So lookup cost > > is not driven by size of array. Its O(1) and not O(N). We just lookup > > at one element in array and if it does not match, we return false. > > > > u32 key = kvm_error_gfn_hash_fn(gfn); > > > > if (vcpu->arch.apf.error_gfns[key] != gfn) > > return 0; > > > > > > >
Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error
On Thu, Oct 01, 2020 at 05:55:08PM -0400, Vivek Goyal wrote: > On Mon, Sep 28, 2020 at 09:37:00PM -0700, Sean Christopherson wrote: > > On Mon, Jul 20, 2020 at 05:13:59PM -0400, Vivek Goyal wrote: > > > @@ -10369,6 +10378,36 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, > > > unsigned long rflags) > > > } > > > EXPORT_SYMBOL_GPL(kvm_set_rflags); > > > > > > +static inline u32 kvm_error_gfn_hash_fn(gfn_t gfn) > > > +{ > > > + BUILD_BUG_ON(!is_power_of_2(ERROR_GFN_PER_VCPU)); > > > + > > > + return hash_32(gfn & 0x, order_base_2(ERROR_GFN_PER_VCPU)); > > > +} > > > + > > > +static void kvm_add_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn) > > > +{ > > > + u32 key = kvm_error_gfn_hash_fn(gfn); > > > + > > > + /* > > > + * Overwrite the previous gfn. This is just a hint to do > > > + * sync page fault. > > > + */ > > > + vcpu->arch.apf.error_gfns[key] = gfn; > > > +} > > > + > > > +/* Returns true if gfn was found in hash table, false otherwise */ > > > +static bool kvm_find_and_remove_error_gfn(struct kvm_vcpu *vcpu, gfn_t > > > gfn) > > > +{ > > > + u32 key = kvm_error_gfn_hash_fn(gfn); > > > > Mostly out of curiosity, do we really need a hash? E.g. could we get away > > with an array of 4 values? 2 values? Just wondering if we can avoid 64*8 > > bytes per CPU. > > We are relying on returning error when guest task retries fault. Fault > will be retried on same address if same task is run by vcpu after > "page ready" event. There is no guarantee that same task will be > run. In theory, this cpu could have a large number of tasks queued > and run these tasks before the faulting task is run again. Now say > there are 128 tasks being run and 32 of them have page fault > errors. Then if we keep 4 values, newer failures will simply > overwrite older failures and we will keep spinning instead of > exiting to user space. > > That's why this array of 64 gfns and add gfns based on hash. This > does not completely elimiante the above possibility but chances > of one hitting this are very very slim. But have you actually tried such a scenario? In other words, is there good justification for burning the extra memory? Alternatively, what about adding a new KVM request type to handle this? E.g. when the APF comes back with -EFAULT, snapshot the GFN and make a request. The vCPU then gets kicked and exits to userspace. Before exiting to userspace, the request handler resets vcpu->arch.apf.error_gfn. Bad GFNs simply get if error_gfn is "valid", i.e. there's a pending request. That would guarantee the error is propagated to userspace, and doesn't lose any guest information as dropping error GFNs just means the guest will take more page fault exits. > > One thought would be to use the index to handle the case of no error gfns so > > that the size of the array doesn't affect lookup for the common case, e.g. > > We are calculating hash of gfn (used as index in array). So lookup cost > is not driven by size of array. Its O(1) and not O(N). We just lookup > at one element in array and if it does not match, we return false. > > u32 key = kvm_error_gfn_hash_fn(gfn); > > if (vcpu->arch.apf.error_gfns[key] != gfn) > return 0; > > > > > > ... > > > > u8 error_gfn_head; > > gfn_t error_gfns[ERROR_GFN_PER_VCPU]; > > } apf; > > > > > > if (vcpu->arch.apf.error_gfn_head == 0xff) > > return false; > > > > for (i = 0; i < vcpu->arch.apf.error_gfn_head; i++) { > > if (vcpu->arch.apf.error_gfns[i] == gfn) { > > vcpu->arch.apf.error_gfns[i] = INVALID_GFN; > > return true; > > } > > } > > return true; > > > > Or you could even avoid INVALID_GFN altogether by compacting the array on > > removal. The VM is probably dead anyways, burning a few cycles to clean > > things up is a non-issue. Ditto for slow insertion. > > Same for insertion. Its O(1) and not dependent on size of array. Also > insertion anyway is very infrequent event because it will not be > often that error will happen. I know, I was saying that if we move to an array then we'd technically be subject to O(whatever) search and delete, but that it's irrelevant from a performance perspective because the guest is already toast if it hits a bad GFN. > > > + > > > + if (vcpu->arch.apf.error_gfns[key] != gfn) > > > + return 0; > > > > Should be "return false". > > Will fix. > > Thanks > Vivek >
Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error
On Mon, Sep 28, 2020 at 09:37:00PM -0700, Sean Christopherson wrote: > On Mon, Jul 20, 2020 at 05:13:59PM -0400, Vivek Goyal wrote: > > --- > > arch/x86/include/asm/kvm_host.h | 2 ++ > > arch/x86/kvm/mmu.h | 2 +- > > arch/x86/kvm/mmu/mmu.c | 2 +- > > arch/x86/kvm/x86.c | 54 +++-- > > include/linux/kvm_types.h | 1 + > > 5 files changed, 56 insertions(+), 5 deletions(-) > > > > diff --git a/arch/x86/include/asm/kvm_host.h > > b/arch/x86/include/asm/kvm_host.h > > index be5363b21540..e6f8d3f1a377 100644 > > --- a/arch/x86/include/asm/kvm_host.h > > +++ b/arch/x86/include/asm/kvm_host.h > > @@ -137,6 +137,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t > > base_gfn, int level) > > #define KVM_NR_VAR_MTRR 8 > > > > #define ASYNC_PF_PER_VCPU 64 > > +#define ERROR_GFN_PER_VCPU 64 > > Aren't these two related? I.e. wouldn't it make sense to do: Hi, They are related somewhat but they don't have to be same. I think we can accumulate many more error GFNs if this vcpu does not schedule the same task again to retry. > #define ERROR_GFN_PER_VCPU ASYNC_PF_PER_VCPU > > Or maybe even size error_gfns[] to ASYNC_PF_PER_VCPU? Given these two don't have to be same, I kept it separate. And kept the hash size same for now. If one wants, hash can be bigger or smaller down the line. > > > > > enum kvm_reg { > > VCPU_REGS_RAX = __VCPU_REGS_RAX, > > @@ -778,6 +779,7 @@ struct kvm_vcpu_arch { > > unsigned long nested_apf_token; > > bool delivery_as_pf_vmexit; > > bool pageready_pending; > > + gfn_t error_gfns[ERROR_GFN_PER_VCPU]; > > } apf; > > > > /* OSVW MSRs (AMD only) */ > > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h > > index 444bb9c54548..d0a2a12c7bb6 100644 > > --- a/arch/x86/kvm/mmu.h > > +++ b/arch/x86/kvm/mmu.h > > @@ -60,7 +60,7 @@ void kvm_init_mmu(struct kvm_vcpu *vcpu, bool > > reset_roots); > > void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, u32 cr0, u32 cr4, u32 > > efer); > > void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly, > > bool accessed_dirty, gpa_t new_eptp); > > -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu); > > +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn); > > int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code, > > u64 fault_address, char *insn, int insn_len); > > > > ... > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index 88c593f83b28..c1f5094d6e53 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -263,6 +263,13 @@ static inline void kvm_async_pf_hash_reset(struct > > kvm_vcpu *vcpu) > > vcpu->arch.apf.gfns[i] = ~0; > > } > > > > +static inline void kvm_error_gfn_hash_reset(struct kvm_vcpu *vcpu) > > +{ > > + int i; > > Need a newline. Will do. > > > + for (i = 0; i < ERROR_GFN_PER_VCPU; i++) > > + vcpu->arch.apf.error_gfns[i] = GFN_INVALID; > > +} > > + > > static void kvm_on_user_return(struct user_return_notifier *urn) > > { > > unsigned slot; > > @@ -9484,6 +9491,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu) > > vcpu->arch.pat = MSR_IA32_CR_PAT_DEFAULT; > > > > kvm_async_pf_hash_reset(vcpu); > > + kvm_error_gfn_hash_reset(vcpu); > > kvm_pmu_init(vcpu); > > > > vcpu->arch.pending_external_vector = -1; > > @@ -9608,6 +9616,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool > > init_event) > > > > kvm_clear_async_pf_completion_queue(vcpu); > > kvm_async_pf_hash_reset(vcpu); > > + kvm_error_gfn_hash_reset(vcpu); > > vcpu->arch.apf.halted = false; > > > > if (kvm_mpx_supported()) { > > @@ -10369,6 +10378,36 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, > > unsigned long rflags) > > } > > EXPORT_SYMBOL_GPL(kvm_set_rflags); > > > > +static inline u32 kvm_error_gfn_hash_fn(gfn_t gfn) > > +{ > > + BUILD_BUG_ON(!is_power_of_2(ERROR_GFN_PER_VCPU)); > > + > > + return hash_32(gfn & 0x, order_base_2(ERROR_GFN_PER_VCPU)); > > +} > > + > > +static void kvm_add_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn) > > +{ > > + u32 key = kvm_error_gfn_hash_fn(gfn); > > + > > + /* > > +* Overwrite the previous gfn. This is just a hint to do > > +* sync page fault. > > +*/ > > + vcpu->arch.apf.error_gfns[key] = gfn; > > +} > > + > > +/* Returns true if gfn was found in hash table, false otherwise */ > > +static bool kvm_find_and_remove_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn) > > +{ > > + u32 key = kvm_error_gfn_hash_fn(gfn); > > Mostly out of curiosity, do we really need a hash? E.g. could we get away > with an array of 4 values? 2 values? Just wondering if we can avoid 64*8 > bytes per CPU. We are relying on returning error when guest task retries fault. Fault will be retried on same address if same task is run by vcpu after "page ready" event. There is no guarantee
Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error
On Mon, Jul 20, 2020 at 05:13:59PM -0400, Vivek Goyal wrote: > --- > arch/x86/include/asm/kvm_host.h | 2 ++ > arch/x86/kvm/mmu.h | 2 +- > arch/x86/kvm/mmu/mmu.c | 2 +- > arch/x86/kvm/x86.c | 54 +++-- > include/linux/kvm_types.h | 1 + > 5 files changed, 56 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index be5363b21540..e6f8d3f1a377 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -137,6 +137,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t > base_gfn, int level) > #define KVM_NR_VAR_MTRR 8 > > #define ASYNC_PF_PER_VCPU 64 > +#define ERROR_GFN_PER_VCPU 64 Aren't these two related? I.e. wouldn't it make sense to do: #define ERROR_GFN_PER_VCPU ASYNC_PF_PER_VCPU Or maybe even size error_gfns[] to ASYNC_PF_PER_VCPU? > > enum kvm_reg { > VCPU_REGS_RAX = __VCPU_REGS_RAX, > @@ -778,6 +779,7 @@ struct kvm_vcpu_arch { > unsigned long nested_apf_token; > bool delivery_as_pf_vmexit; > bool pageready_pending; > + gfn_t error_gfns[ERROR_GFN_PER_VCPU]; > } apf; > > /* OSVW MSRs (AMD only) */ > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h > index 444bb9c54548..d0a2a12c7bb6 100644 > --- a/arch/x86/kvm/mmu.h > +++ b/arch/x86/kvm/mmu.h > @@ -60,7 +60,7 @@ void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots); > void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, u32 cr0, u32 cr4, u32 efer); > void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly, >bool accessed_dirty, gpa_t new_eptp); > -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu); > +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn); > int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code, > u64 fault_address, char *insn, int insn_len); > ... > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 88c593f83b28..c1f5094d6e53 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -263,6 +263,13 @@ static inline void kvm_async_pf_hash_reset(struct > kvm_vcpu *vcpu) > vcpu->arch.apf.gfns[i] = ~0; > } > > +static inline void kvm_error_gfn_hash_reset(struct kvm_vcpu *vcpu) > +{ > + int i; Need a newline. > + for (i = 0; i < ERROR_GFN_PER_VCPU; i++) > + vcpu->arch.apf.error_gfns[i] = GFN_INVALID; > +} > + > static void kvm_on_user_return(struct user_return_notifier *urn) > { > unsigned slot; > @@ -9484,6 +9491,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu) > vcpu->arch.pat = MSR_IA32_CR_PAT_DEFAULT; > > kvm_async_pf_hash_reset(vcpu); > + kvm_error_gfn_hash_reset(vcpu); > kvm_pmu_init(vcpu); > > vcpu->arch.pending_external_vector = -1; > @@ -9608,6 +9616,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool > init_event) > > kvm_clear_async_pf_completion_queue(vcpu); > kvm_async_pf_hash_reset(vcpu); > + kvm_error_gfn_hash_reset(vcpu); > vcpu->arch.apf.halted = false; > > if (kvm_mpx_supported()) { > @@ -10369,6 +10378,36 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned > long rflags) > } > EXPORT_SYMBOL_GPL(kvm_set_rflags); > > +static inline u32 kvm_error_gfn_hash_fn(gfn_t gfn) > +{ > + BUILD_BUG_ON(!is_power_of_2(ERROR_GFN_PER_VCPU)); > + > + return hash_32(gfn & 0x, order_base_2(ERROR_GFN_PER_VCPU)); > +} > + > +static void kvm_add_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn) > +{ > + u32 key = kvm_error_gfn_hash_fn(gfn); > + > + /* > + * Overwrite the previous gfn. This is just a hint to do > + * sync page fault. > + */ > + vcpu->arch.apf.error_gfns[key] = gfn; > +} > + > +/* Returns true if gfn was found in hash table, false otherwise */ > +static bool kvm_find_and_remove_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn) > +{ > + u32 key = kvm_error_gfn_hash_fn(gfn); Mostly out of curiosity, do we really need a hash? E.g. could we get away with an array of 4 values? 2 values? Just wondering if we can avoid 64*8 bytes per CPU. One thought would be to use the index to handle the case of no error gfns so that the size of the array doesn't affect lookup for the common case, e.g. ... u8 error_gfn_head; gfn_t error_gfns[ERROR_GFN_PER_VCPU]; } apf; if (vcpu->arch.apf.error_gfn_head == 0xff) return false; for (i = 0; i < vcpu->arch.apf.error_gfn_head; i++) { if (vcpu->arch.apf.error_gfns[i] == gfn) { vcpu->arch.apf.error_gfns[i] = INVALID_GFN; return true; } } return true; Or you could even avoid INVALID_GFN altogether by compacting the array on removal. The VM is probably dead anyways, burning a few cycles to clean
Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error
On Mon, Jul 20, 2020 at 05:13:59PM -0400, Vivek Goyal wrote: > Page fault error handling behavior in kvm seems little inconsistent when > page fault reports error. If we are doing fault synchronously > then we capture error (-EFAULT) returned by __gfn_to_pfn_memslot() and > exit to user space and qemu reports error, "error: kvm run failed Bad > address". Hi Paolo, Ping. I am wondering what do you think about this patch. Can it be merged? Thanks Vivek > > But if we are doing async page fault, then async_pf_execute() will simply > ignore the error reported by get_user_pages_remote() or > by kvm_mmu_do_page_fault(). It is assumed that page fault was successful > and either a page ready event is injected in guest or guest is brought > out of artificial halt state and run again. In both the cases when guest > retries the instruction, it takes exit again as page fault was not > successful in previous attempt. And then this infinite loop continues > forever. > > Trying fault in a loop will make sense if error is temporary and will > be resolved on retry. But I don't see any intention in the code to > determine if error is temporary or not. Whether to do fault synchronously > or asynchronously, depends on so many variables but none of the varibales > is whether error is temporary or not. (kvm_can_do_async_pf()). > > And that makes it very inconsistent or unpredictable to figure out whether > kvm will exit to qemu with error or it will just retry and go into an > infinite loop. > > This patch tries to make this behavior consistent. That is instead of > getting into infinite loop of retrying page fault, exit to user space > and stop VM if page fault error happens. > > In future this can be improved by injecting errors into guest. As of > now we don't have any race free method to inject errors in guest. > > When page fault error happens in async path save that pfn and when next > time guest retries, do a sync fault instead of async fault. So that if error > is encountered, we exit to qemu and avoid infinite loop. > > We maintain a cache of error gfns and force sync fault if a gfn is > found in cache of error gfn. There is a small possibility that we > miss an error gfn (as it got overwritten by a new error gfn). But > its just a hint and sooner or later some error pfn will match > and we will force sync fault and exit to user space. > > Changes from v3: > - Added function kvm_find_and_remove_error_gfn() and removed > kvm_find_error_gfn() and kvm_del_error_gfn(). (Vitaly) > > - Added a macro GFN_INVALID (Vitaly). > > - Used gpa_to_gfn() to convert gpa to gfn (Vitaly) > > Change from v2: > - Fixed a warning by converting kvm_find_error_gfn() static. > > Change from v1: > - Maintain a cache of error gfns, instead of single gfn. (Vitaly) > > Signed-off-by: Vivek Goyal > --- > arch/x86/include/asm/kvm_host.h | 2 ++ > arch/x86/kvm/mmu.h | 2 +- > arch/x86/kvm/mmu/mmu.c | 2 +- > arch/x86/kvm/x86.c | 54 +++-- > include/linux/kvm_types.h | 1 + > 5 files changed, 56 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index be5363b21540..e6f8d3f1a377 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -137,6 +137,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t > base_gfn, int level) > #define KVM_NR_VAR_MTRR 8 > > #define ASYNC_PF_PER_VCPU 64 > +#define ERROR_GFN_PER_VCPU 64 > > enum kvm_reg { > VCPU_REGS_RAX = __VCPU_REGS_RAX, > @@ -778,6 +779,7 @@ struct kvm_vcpu_arch { > unsigned long nested_apf_token; > bool delivery_as_pf_vmexit; > bool pageready_pending; > + gfn_t error_gfns[ERROR_GFN_PER_VCPU]; > } apf; > > /* OSVW MSRs (AMD only) */ > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h > index 444bb9c54548..d0a2a12c7bb6 100644 > --- a/arch/x86/kvm/mmu.h > +++ b/arch/x86/kvm/mmu.h > @@ -60,7 +60,7 @@ void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots); > void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, u32 cr0, u32 cr4, u32 efer); > void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly, >bool accessed_dirty, gpa_t new_eptp); > -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu); > +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn); > int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code, > u64 fault_address, char *insn, int insn_len); > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 6d6a0ae7800c..b51d4aa405e0 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4078,7 +4078,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool > prefault, gfn_t gfn, > if (!async) > return false; /* *pfn has correct page already */ > > - if (!prefault &&
Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error
> Page fault error handling behavior in kvm seems little inconsistent when > page fault reports error. If we are doing fault synchronously > then we capture error (-EFAULT) returned by __gfn_to_pfn_memslot() and > exit to user space and qemu reports error, "error: kvm run failed Bad > address". > > But if we are doing async page fault, then async_pf_execute() will simply > ignore the error reported by get_user_pages_remote() or > by kvm_mmu_do_page_fault(). It is assumed that page fault was successful > and either a page ready event is injected in guest or guest is brought > out of artificial halt state and run again. In both the cases when guest > retries the instruction, it takes exit again as page fault was not > successful in previous attempt. And then this infinite loop continues > forever. > > Trying fault in a loop will make sense if error is temporary and will > be resolved on retry. But I don't see any intention in the code to > determine if error is temporary or not. Whether to do fault synchronously > or asynchronously, depends on so many variables but none of the varibales s/varibales/variables > is whether error is temporary or not. (kvm_can_do_async_pf()). > > And that makes it very inconsistent or unpredictable to figure out whether > kvm will exit to qemu with error or it will just retry and go into an > infinite loop. > > This patch tries to make this behavior consistent. That is instead of > getting into infinite loop of retrying page fault, exit to user space > and stop VM if page fault error happens. > > In future this can be improved by injecting errors into guest. As of > now we don't have any race free method to inject errors in guest. > > When page fault error happens in async path save that pfn and when next > time guest retries, do a sync fault instead of async fault. So that if error > is encountered, we exit to qemu and avoid infinite loop. > > We maintain a cache of error gfns and force sync fault if a gfn is > found in cache of error gfn. There is a small possibility that we > miss an error gfn (as it got overwritten by a new error gfn). But > its just a hint and sooner or later some error pfn will match > and we will force sync fault and exit to user space. > > Changes from v3: > - Added function kvm_find_and_remove_error_gfn() and removed > kvm_find_error_gfn() and kvm_del_error_gfn(). (Vitaly) > > - Added a macro GFN_INVALID (Vitaly). > > - Used gpa_to_gfn() to convert gpa to gfn (Vitaly) > > Change from v2: > - Fixed a warning by converting kvm_find_error_gfn() static. > > Change from v1: > - Maintain a cache of error gfns, instead of single gfn. (Vitaly) > > Signed-off-by: Vivek Goyal > --- > arch/x86/include/asm/kvm_host.h | 2 ++ > arch/x86/kvm/mmu.h | 2 +- > arch/x86/kvm/mmu/mmu.c | 2 +- > arch/x86/kvm/x86.c | 54 +++-- > include/linux/kvm_types.h | 1 + > 5 files changed, 56 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index be5363b21540..e6f8d3f1a377 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -137,6 +137,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t > base_gfn, int level) > #define KVM_NR_VAR_MTRR 8 > > #define ASYNC_PF_PER_VCPU 64 > +#define ERROR_GFN_PER_VCPU 64 > > enum kvm_reg { > VCPU_REGS_RAX = __VCPU_REGS_RAX, > @@ -778,6 +779,7 @@ struct kvm_vcpu_arch { > unsigned long nested_apf_token; > bool delivery_as_pf_vmexit; > bool pageready_pending; > + gfn_t error_gfns[ERROR_GFN_PER_VCPU]; > } apf; > > /* OSVW MSRs (AMD only) */ > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h > index 444bb9c54548..d0a2a12c7bb6 100644 > --- a/arch/x86/kvm/mmu.h > +++ b/arch/x86/kvm/mmu.h > @@ -60,7 +60,7 @@ void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots); > void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, u32 cr0, u32 cr4, u32 efer); > void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly, > bool accessed_dirty, gpa_t new_eptp); > -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu); > +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn); > int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code, > u64 fault_address, char *insn, int insn_len); > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 6d6a0ae7800c..b51d4aa405e0 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4078,7 +4078,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool > prefault, gfn_t gfn, > if (!async) > return false; /* *pfn has correct page already */ > > - if (!prefault && kvm_can_do_async_pf(vcpu)) { > + if (!prefault && kvm_can_do_async_pf(vcpu, gpa_to_gfn(cr2_or_gpa))) { >
Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error
On Mon, Jul 27, 2020 at 06:09:32PM +0200, Vitaly Kuznetsov wrote: > Vivek Goyal writes: > > > On Mon, Jul 20, 2020 at 05:13:59PM -0400, Vivek Goyal wrote: > >> Page fault error handling behavior in kvm seems little inconsistent when > >> page fault reports error. If we are doing fault synchronously > >> then we capture error (-EFAULT) returned by __gfn_to_pfn_memslot() and > >> exit to user space and qemu reports error, "error: kvm run failed Bad > >> address". > > > > Hi Vitaly, > > > > A gentle reminder. How does this patch look now? > > > > Sorry, I even reviewd it but never replied. It looks good to me! > > Reviewed-by: Vitaly Kuznetsov Thanks Vitaly. Paolo, what do you think about this patch. Do you have concerns with this. Can this be merged. Thanks Vivek
Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error
Vivek Goyal writes: > On Mon, Jul 20, 2020 at 05:13:59PM -0400, Vivek Goyal wrote: >> Page fault error handling behavior in kvm seems little inconsistent when >> page fault reports error. If we are doing fault synchronously >> then we capture error (-EFAULT) returned by __gfn_to_pfn_memslot() and >> exit to user space and qemu reports error, "error: kvm run failed Bad >> address". > > Hi Vitaly, > > A gentle reminder. How does this patch look now? > Sorry, I even reviewd it but never replied. It looks good to me! Reviewed-by: Vitaly Kuznetsov -- Vitaly
Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error
On Mon, Jul 20, 2020 at 05:13:59PM -0400, Vivek Goyal wrote: > Page fault error handling behavior in kvm seems little inconsistent when > page fault reports error. If we are doing fault synchronously > then we capture error (-EFAULT) returned by __gfn_to_pfn_memslot() and > exit to user space and qemu reports error, "error: kvm run failed Bad > address". Hi Vitaly, A gentle reminder. How does this patch look now? Thanks Vivek > > But if we are doing async page fault, then async_pf_execute() will simply > ignore the error reported by get_user_pages_remote() or > by kvm_mmu_do_page_fault(). It is assumed that page fault was successful > and either a page ready event is injected in guest or guest is brought > out of artificial halt state and run again. In both the cases when guest > retries the instruction, it takes exit again as page fault was not > successful in previous attempt. And then this infinite loop continues > forever. > > Trying fault in a loop will make sense if error is temporary and will > be resolved on retry. But I don't see any intention in the code to > determine if error is temporary or not. Whether to do fault synchronously > or asynchronously, depends on so many variables but none of the varibales > is whether error is temporary or not. (kvm_can_do_async_pf()). > > And that makes it very inconsistent or unpredictable to figure out whether > kvm will exit to qemu with error or it will just retry and go into an > infinite loop. > > This patch tries to make this behavior consistent. That is instead of > getting into infinite loop of retrying page fault, exit to user space > and stop VM if page fault error happens. > > In future this can be improved by injecting errors into guest. As of > now we don't have any race free method to inject errors in guest. > > When page fault error happens in async path save that pfn and when next > time guest retries, do a sync fault instead of async fault. So that if error > is encountered, we exit to qemu and avoid infinite loop. > > We maintain a cache of error gfns and force sync fault if a gfn is > found in cache of error gfn. There is a small possibility that we > miss an error gfn (as it got overwritten by a new error gfn). But > its just a hint and sooner or later some error pfn will match > and we will force sync fault and exit to user space. > > Changes from v3: > - Added function kvm_find_and_remove_error_gfn() and removed > kvm_find_error_gfn() and kvm_del_error_gfn(). (Vitaly) > > - Added a macro GFN_INVALID (Vitaly). > > - Used gpa_to_gfn() to convert gpa to gfn (Vitaly) > > Change from v2: > - Fixed a warning by converting kvm_find_error_gfn() static. > > Change from v1: > - Maintain a cache of error gfns, instead of single gfn. (Vitaly) > > Signed-off-by: Vivek Goyal > --- > arch/x86/include/asm/kvm_host.h | 2 ++ > arch/x86/kvm/mmu.h | 2 +- > arch/x86/kvm/mmu/mmu.c | 2 +- > arch/x86/kvm/x86.c | 54 +++-- > include/linux/kvm_types.h | 1 + > 5 files changed, 56 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index be5363b21540..e6f8d3f1a377 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -137,6 +137,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t > base_gfn, int level) > #define KVM_NR_VAR_MTRR 8 > > #define ASYNC_PF_PER_VCPU 64 > +#define ERROR_GFN_PER_VCPU 64 > > enum kvm_reg { > VCPU_REGS_RAX = __VCPU_REGS_RAX, > @@ -778,6 +779,7 @@ struct kvm_vcpu_arch { > unsigned long nested_apf_token; > bool delivery_as_pf_vmexit; > bool pageready_pending; > + gfn_t error_gfns[ERROR_GFN_PER_VCPU]; > } apf; > > /* OSVW MSRs (AMD only) */ > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h > index 444bb9c54548..d0a2a12c7bb6 100644 > --- a/arch/x86/kvm/mmu.h > +++ b/arch/x86/kvm/mmu.h > @@ -60,7 +60,7 @@ void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots); > void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, u32 cr0, u32 cr4, u32 efer); > void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly, >bool accessed_dirty, gpa_t new_eptp); > -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu); > +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn); > int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code, > u64 fault_address, char *insn, int insn_len); > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 6d6a0ae7800c..b51d4aa405e0 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4078,7 +4078,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool > prefault, gfn_t gfn, > if (!async) > return false; /* *pfn has correct page already */ > > - if (!prefault && kvm_can_do_async_pf(vcpu)) { > + if