Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-10-06 Thread Sean Christopherson
On Tue, Oct 06, 2020 at 01:35:27PM -0400, Vivek Goyal wrote:
> On Tue, Oct 06, 2020 at 10:17:04AM -0700, Sean Christopherson wrote:
> 
> [..]
> > > > Note, TDX doesn't allow injection exceptions, so reflecting a #PF back
> > > > into the guest is not an option.  
> > > 
> > > Not even #MC? So sad :-)
> > 
> > Heh, #MC isn't allowed either, yet...
> 
> If #MC is not allowd, logic related to hwpoison memory will not work
> as that seems to inject #MC.

Yep.  Piggybacking #MC is undesirable for other reasons, e.g. adds a "hardware"
dependency for what is really a paravirt feature.


Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-10-06 Thread Vivek Goyal
On Tue, Oct 06, 2020 at 10:17:04AM -0700, Sean Christopherson wrote:

[..]
> > > Note, TDX doesn't allow injection exceptions, so reflecting a #PF back
> > > into the guest is not an option.  
> > 
> > Not even #MC? So sad :-)
> 
> Heh, #MC isn't allowed either, yet...

If #MC is not allowd, logic related to hwpoison memory will not work
as that seems to inject #MC.

Vivek



Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-10-06 Thread Sean Christopherson
On Tue, Oct 06, 2020 at 06:39:56PM +0200, Vitaly Kuznetsov wrote:
> Sean Christopherson  writes:
> 
> > On Tue, Oct 06, 2020 at 05:24:54PM +0200, Vitaly Kuznetsov wrote:
> >> Vivek Goyal  writes:
> >> > So you will have to report token (along with -EFAULT) to user space. So 
> >> > this
> >> > is basically the 3rd proposal which is extension of kvm API and will
> >> > report say HVA/GFN also to user space along with -EFAULT.
> >> 
> >> Right, I meant to say that guest kernel has full register state of the
> >> userspace process which caused APF to get queued and instead of trying
> >> to extract it in KVM and pass to userspace in case of a (later) failure
> >> we limit KVM api change to contain token or GFN only and somehow keep
> >> the rest in the guest. This should help with TDX/SEV-ES.
> >
> > Whatever gets reported to userspace should be identical with and without
> > async page faults, i.e. it definitely shouldn't have token information.
> >
> 
> Oh, right, when the error gets reported synchronously guest's kernel is
> not yet aware of the issue so it won't be possible to find anything in
> its kdump if userspace decides to crash it immediately. The register
> state (if available) will be actual though.
> 
> > Note, TDX doesn't allow injection exceptions, so reflecting a #PF back
> > into the guest is not an option.  
> 
> Not even #MC? So sad :-)

Heh, #MC isn't allowed either, yet...

> > Nor do I think that's "correct" behavior (see everyone's objections to
> > using #PF for APF fixed).  I.e. the event should probably be an IRQ.
> 
> I recall Paolo objected against making APF 'page not present' into in
> interrupt as it will require some very special handling to make sure it
> gets injected (and handled) immediately but I'm not really sure how big
> the hack is going to be, maybe in the light of TDX/SEV-ES it's worth a
> try.

This shouldn't have anything to do with APF.  Again, the event injection is
needed even in the synchronous case as the file truncation in the host can
affect existing mappings in the guest.

I don't know that the mechanism needs to be virtiofs specific or if there can
be a more generic "these PFNs have disappeared", but it's most definitely
orthogonal to APF.


Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-10-06 Thread Vitaly Kuznetsov
Sean Christopherson  writes:

> On Tue, Oct 06, 2020 at 05:24:54PM +0200, Vitaly Kuznetsov wrote:
>> Vivek Goyal  writes:
>> > So you will have to report token (along with -EFAULT) to user space. So 
>> > this
>> > is basically the 3rd proposal which is extension of kvm API and will
>> > report say HVA/GFN also to user space along with -EFAULT.
>> 
>> Right, I meant to say that guest kernel has full register state of the
>> userspace process which caused APF to get queued and instead of trying
>> to extract it in KVM and pass to userspace in case of a (later) failure
>> we limit KVM api change to contain token or GFN only and somehow keep
>> the rest in the guest. This should help with TDX/SEV-ES.
>
> Whatever gets reported to userspace should be identical with and without
> async page faults, i.e. it definitely shouldn't have token information.
>

Oh, right, when the error gets reported synchronously guest's kernel is
not yet aware of the issue so it won't be possible to find anything in
its kdump if userspace decides to crash it immediately. The register
state (if available) will be actual though.

> Note, TDX doesn't allow injection exceptions, so reflecting a #PF back
> into the guest is not an option.  

Not even #MC? So sad :-)

> Nor do I think that's "correct" behavior (see everyone's objections to
> using #PF for APF fixed).  I.e. the event should probably be an IRQ.

I recall Paolo objected against making APF 'page not present' into in
interrupt as it will require some very special handling to make sure it
gets injected (and handled) immediately but I'm not really sure how big
the hack is going to be, maybe in the light of TDX/SEV-ES it's worth a
try.

-- 
Vitaly



Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-10-06 Thread Vivek Goyal
On Tue, Oct 06, 2020 at 09:12:00AM -0700, Sean Christopherson wrote:
> On Tue, Oct 06, 2020 at 05:24:54PM +0200, Vitaly Kuznetsov wrote:
> > Vivek Goyal  writes:
> > > So you will have to report token (along with -EFAULT) to user space. So 
> > > this
> > > is basically the 3rd proposal which is extension of kvm API and will
> > > report say HVA/GFN also to user space along with -EFAULT.
> > 
> > Right, I meant to say that guest kernel has full register state of the
> > userspace process which caused APF to get queued and instead of trying
> > to extract it in KVM and pass to userspace in case of a (later) failure
> > we limit KVM api change to contain token or GFN only and somehow keep
> > the rest in the guest. This should help with TDX/SEV-ES.
> 
> Whatever gets reported to userspace should be identical with and without
> async page faults, i.e. it definitely shouldn't have token information.
> 
> Note, TDX doesn't allow injection exceptions, so reflecting a #PF back
> into the guest is not an option.  Nor do I think that's "correct"
> behavior (see everyone's objections to using #PF for APF fixed).  I.e. the
> event should probably be an IRQ.

I am not sure if IRQ for "Page not Present" works. Will it have some
conflicts/issues with other high priority interrupts which can
get injected before "Page not present".

Vivek



Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-10-06 Thread Sean Christopherson
On Tue, Oct 06, 2020 at 05:24:54PM +0200, Vitaly Kuznetsov wrote:
> Vivek Goyal  writes:
> > So you will have to report token (along with -EFAULT) to user space. So this
> > is basically the 3rd proposal which is extension of kvm API and will
> > report say HVA/GFN also to user space along with -EFAULT.
> 
> Right, I meant to say that guest kernel has full register state of the
> userspace process which caused APF to get queued and instead of trying
> to extract it in KVM and pass to userspace in case of a (later) failure
> we limit KVM api change to contain token or GFN only and somehow keep
> the rest in the guest. This should help with TDX/SEV-ES.

Whatever gets reported to userspace should be identical with and without
async page faults, i.e. it definitely shouldn't have token information.

Note, TDX doesn't allow injection exceptions, so reflecting a #PF back
into the guest is not an option.  Nor do I think that's "correct"
behavior (see everyone's objections to using #PF for APF fixed).  I.e. the
event should probably be an IRQ.


Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-10-06 Thread Vitaly Kuznetsov
Vivek Goyal  writes:

> On Tue, Oct 06, 2020 at 04:50:44PM +0200, Vitaly Kuznetsov wrote:
>> Vivek Goyal  writes:
>> 
>> > On Tue, Oct 06, 2020 at 04:05:16PM +0200, Vitaly Kuznetsov wrote:
>> >> Vivek Goyal  writes:
>> >> 
>> >> > A. Just exit to user space with -EFAULT (using kvm request) and don't
>> >> >wait for the accessing task to run on vcpu again. 
>> >> 
>> >> What if we also save the required information (RIP, GFN, ...) in the
>> >> guest along with the APF token
>> >
>> > Can you elaborate a bit more on this. You mean save GFN on stack before
>> > it starts waiting for PAGE_READY event?
>> 
>> When PAGE_NOT_PRESENT event is injected as #PF (for now) in the guest
>> kernel gets all the registers of the userspace process (except for CR2
>> which is replaced with a token). In case it is not trivial to extract
>> accessed GFN from this data we can extend the shared APF structure and
>> add it there, KVM has it when it queues APF.
>> 
>> >
>> >> so in case of -EFAULT we can just 'crash'
>> >> the guest and the required information can easily be obtained from
>> >> kdump? This will solve the debugging problem even for TDX/SEV-ES (if
>> >> kdump is possible there).
>> >
>> > Just saving additional info in guest will not help because there might
>> > be many tasks waiting and you don't know which GFN is problematic one.
>> 
>> But KVM knows which token caused the -EFAULT when we exit to userspace
>> (and we can pass this information to it) so to debug the situation you
>> take this token and then explore the kdump searching for what's
>> associated with this exact token.
>
> So you will have to report token (along with -EFAULT) to user space. So this
> is basically the 3rd proposal which is extension of kvm API and will
> report say HVA/GFN also to user space along with -EFAULT.
>

Right, I meant to say that guest kernel has full register state of the
userspace process which caused APF to get queued and instead of trying
to extract it in KVM and pass to userspace in case of a (later) failure
we limit KVM api change to contain token or GFN only and somehow keep
the rest in the guest. This should help with TDX/SEV-ES.

-- 
Vitaly



Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-10-06 Thread Vivek Goyal
On Tue, Oct 06, 2020 at 04:50:44PM +0200, Vitaly Kuznetsov wrote:
> Vivek Goyal  writes:
> 
> > On Tue, Oct 06, 2020 at 04:05:16PM +0200, Vitaly Kuznetsov wrote:
> >> Vivek Goyal  writes:
> >> 
> >> > A. Just exit to user space with -EFAULT (using kvm request) and don't
> >> >wait for the accessing task to run on vcpu again. 
> >> 
> >> What if we also save the required information (RIP, GFN, ...) in the
> >> guest along with the APF token
> >
> > Can you elaborate a bit more on this. You mean save GFN on stack before
> > it starts waiting for PAGE_READY event?
> 
> When PAGE_NOT_PRESENT event is injected as #PF (for now) in the guest
> kernel gets all the registers of the userspace process (except for CR2
> which is replaced with a token). In case it is not trivial to extract
> accessed GFN from this data we can extend the shared APF structure and
> add it there, KVM has it when it queues APF.
> 
> >
> >> so in case of -EFAULT we can just 'crash'
> >> the guest and the required information can easily be obtained from
> >> kdump? This will solve the debugging problem even for TDX/SEV-ES (if
> >> kdump is possible there).
> >
> > Just saving additional info in guest will not help because there might
> > be many tasks waiting and you don't know which GFN is problematic one.
> 
> But KVM knows which token caused the -EFAULT when we exit to userspace
> (and we can pass this information to it) so to debug the situation you
> take this token and then explore the kdump searching for what's
> associated with this exact token.

So you will have to report token (along with -EFAULT) to user space. So this
is basically the 3rd proposal which is extension of kvm API and will
report say HVA/GFN also to user space along with -EFAULT.

Thanks
Vivek



Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-10-06 Thread Vitaly Kuznetsov
Vivek Goyal  writes:

> On Tue, Oct 06, 2020 at 04:05:16PM +0200, Vitaly Kuznetsov wrote:
>> Vivek Goyal  writes:
>> 
>> > A. Just exit to user space with -EFAULT (using kvm request) and don't
>> >wait for the accessing task to run on vcpu again. 
>> 
>> What if we also save the required information (RIP, GFN, ...) in the
>> guest along with the APF token
>
> Can you elaborate a bit more on this. You mean save GFN on stack before
> it starts waiting for PAGE_READY event?

When PAGE_NOT_PRESENT event is injected as #PF (for now) in the guest
kernel gets all the registers of the userspace process (except for CR2
which is replaced with a token). In case it is not trivial to extract
accessed GFN from this data we can extend the shared APF structure and
add it there, KVM has it when it queues APF.

>
>> so in case of -EFAULT we can just 'crash'
>> the guest and the required information can easily be obtained from
>> kdump? This will solve the debugging problem even for TDX/SEV-ES (if
>> kdump is possible there).
>
> Just saving additional info in guest will not help because there might
> be many tasks waiting and you don't know which GFN is problematic one.

But KVM knows which token caused the -EFAULT when we exit to userspace
(and we can pass this information to it) so to debug the situation you
take this token and then explore the kdump searching for what's
associated with this exact token.

-- 
Vitaly



Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-10-06 Thread Vivek Goyal
On Tue, Oct 06, 2020 at 04:05:16PM +0200, Vitaly Kuznetsov wrote:
> Vivek Goyal  writes:
> 
> > A. Just exit to user space with -EFAULT (using kvm request) and don't
> >wait for the accessing task to run on vcpu again. 
> 
> What if we also save the required information (RIP, GFN, ...) in the
> guest along with the APF token

Can you elaborate a bit more on this. You mean save GFN on stack before
it starts waiting for PAGE_READY event?

> so in case of -EFAULT we can just 'crash'
> the guest and the required information can easily be obtained from
> kdump? This will solve the debugging problem even for TDX/SEV-ES (if
> kdump is possible there).

Just saving additional info in guest will not help because there might
be many tasks waiting and you don't know which GFN is problematic one.

Thanks
Vivek



Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-10-06 Thread Vitaly Kuznetsov
Vivek Goyal  writes:

> A. Just exit to user space with -EFAULT (using kvm request) and don't
>wait for the accessing task to run on vcpu again. 

What if we also save the required information (RIP, GFN, ...) in the
guest along with the APF token so in case of -EFAULT we can just 'crash'
the guest and the required information can easily be obtained from
kdump? This will solve the debugging problem even for TDX/SEV-ES (if
kdump is possible there).

-- 
Vitaly



Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-10-06 Thread Vivek Goyal
On Mon, Oct 05, 2020 at 09:16:20AM -0700, Sean Christopherson wrote:
> On Mon, Oct 05, 2020 at 11:33:18AM -0400, Vivek Goyal wrote:
> > On Fri, Oct 02, 2020 at 02:13:14PM -0700, Sean Christopherson wrote:
> > Now I have few questions.
> > 
> > - If we exit to user space asynchronously (using kvm request), what debug
> >   information is in there which tells user which address is bad. I admit
> >   that even above trace does not seem to be telling me directly which
> >   address (HVA?) is bad.
> > 
> >   But if I take a crash dump of guest, using above information I should
> >   be able to get to GPA which is problematic. And looking at /proc/iomem
> >   it should also tell which device this memory region is in.
> > 
> >   Also using this crash dump one should be able to walk through virtiofs 
> > data
> >   structures and figure out which file and what offset with-in file does
> >   it belong to. Now one can look at filesystem on host and see file got
> >   truncated and it will become obvious it can't be faulted in. And then
> >   one can continue to debug that how did we arrive here.
> > 
> > But if we don't exit to user space synchronously, Only relevant
> > information we seem to have is -EFAULT. Apart from that, how does one
> > figure out what address is bad, or who tried to access it. Or which
> > file/offset does it belong to etc.
> >
> > I agree that problem is not necessarily in guest code. But by exiting
> > synchronously, it gives enough information that one can use crash
> > dump to get to bottom of the issue. If we exit to user space
> > asynchronously, all this information will be lost and it might make
> > it very hard to figure out (if not impossible), what's going on.
> 
> If we want userspace to be able to do something useful, KVM should explicitly
> inform userspace about the error, userspace shouldn't simply assume that
> -EFAULT means a HVA->PFN lookup failed.

I guess that's fine. But for this patch, user space is not doing anything.
Its just printing error -EFAULT and dumping guest state (Same as we do
in case of synchronous fault).

> Userspace also shouldn't have to
> query guest state to handle the error, as that won't work for protected guests
> guests like SEV-ES and TDX.

So qemu would not be able to dump vcpu register state when kvm returns
with -EFAULT for the case of SEV-ES and TDX?

> 
> I can think of two options:
> 
>   1. Send a signal, a la kvm_send_hwpoison_signal().

This works because -EHWPOISON is a special kind of error which is
different from -EFAULT. For truncation, even kvm gets -EFAULT.

if (vm_fault & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
return (foll_flags & FOLL_HWPOISON) ? -EHWPOISON : -EFAULT;
if (vm_fault & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV))
return -EFAULT;

Anyway, if -EFAULT is too generic, and we need something finer grained,
that can be looked into when we actually have a method where kvm/qemu
injects error into guest.

> 
>   2. Add a userspace exit reason along with a new entry in the run struct,
>  e.g. that provides the bad GPA, HVA, possibly permissions, etc...

This sounds more reasonable to me. That is kvm gives additional
information to qemu about failing HVA and GPA with -EFAULT and that
can be helpful in debugging a problem. This seems like an extension
of KVM API.

Even with this, if we want to figure out which file got truncated, we
will need to take a dump of guest and try to figure out which file
this GPA is currently mapping(By looking at virtiofs data structures).
And that becomes little easier if vcpu is running the task which 
accessed that GPA. Anyway, if we have failing GPA, I think it should
be possible to figure out inode even without accessing task being
current on vcpu.

So we seem to have 3 options.

A. Just exit to user space with -EFAULT (using kvm request) and don't
   wait for the accessing task to run on vcpu again. 

B. Store error gfn in an hash and exit to user space when task accessing
   gfn runs again.

C. Extend KVM API and report failing HVA/GFN access by guest. And that
   should allow not having to exit to user space synchronously.

Thanks
Vivek



Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-10-05 Thread Sean Christopherson
On Mon, Oct 05, 2020 at 11:33:18AM -0400, Vivek Goyal wrote:
> On Fri, Oct 02, 2020 at 02:13:14PM -0700, Sean Christopherson wrote:
> Now I have few questions.
> 
> - If we exit to user space asynchronously (using kvm request), what debug
>   information is in there which tells user which address is bad. I admit
>   that even above trace does not seem to be telling me directly which
>   address (HVA?) is bad.
> 
>   But if I take a crash dump of guest, using above information I should
>   be able to get to GPA which is problematic. And looking at /proc/iomem
>   it should also tell which device this memory region is in.
> 
>   Also using this crash dump one should be able to walk through virtiofs data
>   structures and figure out which file and what offset with-in file does
>   it belong to. Now one can look at filesystem on host and see file got
>   truncated and it will become obvious it can't be faulted in. And then
>   one can continue to debug that how did we arrive here.
> 
> But if we don't exit to user space synchronously, Only relevant
> information we seem to have is -EFAULT. Apart from that, how does one
> figure out what address is bad, or who tried to access it. Or which
> file/offset does it belong to etc.
>
> I agree that problem is not necessarily in guest code. But by exiting
> synchronously, it gives enough information that one can use crash
> dump to get to bottom of the issue. If we exit to user space
> asynchronously, all this information will be lost and it might make
> it very hard to figure out (if not impossible), what's going on.

If we want userspace to be able to do something useful, KVM should explicitly
inform userspace about the error, userspace shouldn't simply assume that
-EFAULT means a HVA->PFN lookup failed.  Userspace also shouldn't have to
query guest state to handle the error, as that won't work for protected guests
guests like SEV-ES and TDX.

I can think of two options:

  1. Send a signal, a la kvm_send_hwpoison_signal().

  2. Add a userspace exit reason along with a new entry in the run struct,
 e.g. that provides the bad GPA, HVA, possibly permissions, etc...

> > > > > > To fully handle the situation, the guest needs to remove the bad 
> > > > > > page from
> > > > > > its memory pool.  Once the page is offlined, the guest kernel's 
> > > > > > error
> > > > > > handling will kick in when a task accesses the bad page (or nothing 
> > > > > > ever
> > > > > > touches the bad page again and everyone is happy).
> > > > > 
> > > > > This is not really a case of bad page as such. It is more of a page
> > > > > gone missing/trucated. And no new user can map it. We just need to
> > > > > worry about existing users who already have it mapped.
> > > > 
> > > > What do you mean by "no new user can map it"?  Are you talking about 
> > > > guest
> > > > tasks or host tasks?  If guest tasks, how would the guest know the page 
> > > > is
> > > > missing and thus prevent mapping the non-existent page?
> > > 
> > > If a new task wants mmap(), it will send a request to virtiofsd/qemu
> > > on host. If file has been truncated, then mapping beyond file size
> > > will fail and process will get error.  So they will not be able to
> > > map a page which has been truncated.
> > 
> > Ah.  Is there anything that prevents the notification side of things from
> > being handled purely within the virtiofs layer?  E.g. host notifies the 
> > guest
> > that a file got truncated, virtiofs driver in the guest invokes a kernel API
> > to remove the page(s).
> 
> virtiofsd notifications can help a bit but not in all cases. For example,
> If file got truncated and guest kernel accesses it immidiately after that,
> (before notification arrives), it will hang and notification will not
> be able to do much.
> 
> So while notification might be nice to have, but we still will need some
> sort of error reporting from kvm.

And I'm in full agreement with that.  What I'm saying is that resolving the
infinite loop is a _bug fix_ and is completely orthogonal to adding a new
mechanism to handle the file truncation scenario.


Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-10-05 Thread Vivek Goyal
On Fri, Oct 02, 2020 at 02:13:14PM -0700, Sean Christopherson wrote:
> On Fri, Oct 02, 2020 at 04:02:14PM -0400, Vivek Goyal wrote:
> > On Fri, Oct 02, 2020 at 12:45:18PM -0700, Sean Christopherson wrote:
> > > On Fri, Oct 02, 2020 at 03:27:34PM -0400, Vivek Goyal wrote:
> > > > On Fri, Oct 02, 2020 at 11:30:37AM -0700, Sean Christopherson wrote:
> > > > > On Fri, Oct 02, 2020 at 11:38:54AM -0400, Vivek Goyal wrote:
> > > > > I don't think it's necessary to provide userspace with the register 
> > > > > state of
> > > > > the guest task that hit the bad page.  Other than debugging, I don't 
> > > > > see how
> > > > > userspace can do anything useful which such information.
> > > > 
> > > > I think debugging is the whole point so that user can figure out which
> > > > access by guest task resulted in bad memory access. I would think this
> > > > will be important piece of information.
> > > 
> > > But isn't this failure due to a truncation in the host?  Why would we care
> > > about debugging the guest?  It hasn't done anything wrong, has it?  Or am 
> > > I
> > > misunderstanding the original problem statement.
> > 
> > I think you understood problem statement right. If guest has right
> > context, it just gives additional information who tried to access
> > the missing memory page. 
> 
> Yes, but it's not actionable, e.g. QEMU can't do anything differently given
> a guest RIP.  It's useful information for hands-on debug, but the information
> can be easily collected through other means when doing hands-on debug.

Hi Sean,

I tried my patch and truncated file on host before guest did memcpy().
After truncation guest process tried memcpy() on truncated region and
kvm exited to user space with -EFAULT. I see following on serial console.

I am assuming qemu is printing the state of vcpu.


error: kvm run failed Bad address
RAX=7fff6e7a9750 RBX= RCX=7f513927e000 RDX=a
RSI=7f513927e000 RDI=7fff6e7a9750 RBP=7fff6e7a97b0 RSP=7fff6e7a8
R8 = R9 =0031 R10=7fff6e7a957c R11=6
R12=00401140 R13= R14= R15=0
RIP=7f51391e0547 RFL=00010202 [---] CPL=3 II=0 A20=1 SMM=0 HLT=0
ES =   00c0
CS =0033   00a0fb00 DPL=3 CS64 [-RA]
SS =002b   00c0f300 DPL=3 DS   [-WA]
DS =   00c0
FS = 7f5139246540  00c0
GS =   00c0
LDT=   
TR =0040 fe3a6000 4087 8b00 DPL=0 TSS64-busy
GDT= fe3a4000 007f
IDT= fe00 0fff
CR0=80050033 CR2=7f513927e004 CR3=00102b5eb805 CR4=00770ee0
DR0= DR1= DR2= DR3=
DR6=fffe0ff0 DR7=0400
EFER=0d01
Code=fa 6f 06 c5 fa 6f 4c 16 f0 c5 fa 7f 07 c5 fa 7f 4c 17 f0 c3 <48> 8b 4c 16 3
*

I also changed my test program to print source and destination address
for memcpy.

dst=0x0x7fff6e7a9750 src=0x0x7f513927e000

Here dst matches RDI and src matches RSI. This trace also tells me
CPL=3 so a user space access triggered this.

Now I have few questions.

- If we exit to user space asynchronously (using kvm request), what debug
  information is in there which tells user which address is bad. I admit
  that even above trace does not seem to be telling me directly which
  address (HVA?) is bad.

  But if I take a crash dump of guest, using above information I should
  be able to get to GPA which is problematic. And looking at /proc/iomem
  it should also tell which device this memory region is in.

  Also using this crash dump one should be able to walk through virtiofs data
  structures and figure out which file and what offset with-in file does
  it belong to. Now one can look at filesystem on host and see file got
  truncated and it will become obvious it can't be faulted in. And then
  one can continue to debug that how did we arrive here.

But if we don't exit to user space synchronously, Only relevant
information we seem to have is -EFAULT. Apart from that, how does one
figure out what address is bad, or who tried to access it. Or which
file/offset does it belong to etc.

I agree that problem is not necessarily in guest code. But by exiting
synchronously, it gives enough information that one can use crash
dump to get to bottom of the issue. If we exit to user space
asynchronously, all this information will be lost and it might make
it very hard to figure out (if not impossible), what's going on.

>  
> > > > > To fully handle the situation, the guest needs to remove the bad page 
> > > > > from
> > > > > its memory pool.  Once the page is offlined, the guest kernel's error
> > > > > handling will kick in when 

Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-10-02 Thread Sean Christopherson
On Fri, Oct 02, 2020 at 04:02:14PM -0400, Vivek Goyal wrote:
> On Fri, Oct 02, 2020 at 12:45:18PM -0700, Sean Christopherson wrote:
> > On Fri, Oct 02, 2020 at 03:27:34PM -0400, Vivek Goyal wrote:
> > > On Fri, Oct 02, 2020 at 11:30:37AM -0700, Sean Christopherson wrote:
> > > > On Fri, Oct 02, 2020 at 11:38:54AM -0400, Vivek Goyal wrote:
> > > > I don't think it's necessary to provide userspace with the register 
> > > > state of
> > > > the guest task that hit the bad page.  Other than debugging, I don't 
> > > > see how
> > > > userspace can do anything useful which such information.
> > > 
> > > I think debugging is the whole point so that user can figure out which
> > > access by guest task resulted in bad memory access. I would think this
> > > will be important piece of information.
> > 
> > But isn't this failure due to a truncation in the host?  Why would we care
> > about debugging the guest?  It hasn't done anything wrong, has it?  Or am I
> > misunderstanding the original problem statement.
> 
> I think you understood problem statement right. If guest has right
> context, it just gives additional information who tried to access
> the missing memory page. 

Yes, but it's not actionable, e.g. QEMU can't do anything differently given
a guest RIP.  It's useful information for hands-on debug, but the information
can be easily collected through other means when doing hands-on debug.
 
> > > > To fully handle the situation, the guest needs to remove the bad page 
> > > > from
> > > > its memory pool.  Once the page is offlined, the guest kernel's error
> > > > handling will kick in when a task accesses the bad page (or nothing ever
> > > > touches the bad page again and everyone is happy).
> > > 
> > > This is not really a case of bad page as such. It is more of a page
> > > gone missing/trucated. And no new user can map it. We just need to
> > > worry about existing users who already have it mapped.
> > 
> > What do you mean by "no new user can map it"?  Are you talking about guest
> > tasks or host tasks?  If guest tasks, how would the guest know the page is
> > missing and thus prevent mapping the non-existent page?
> 
> If a new task wants mmap(), it will send a request to virtiofsd/qemu
> on host. If file has been truncated, then mapping beyond file size
> will fail and process will get error.  So they will not be able to
> map a page which has been truncated.

Ah.  Is there anything that prevents the notification side of things from
being handled purely within the virtiofs layer?  E.g. host notifies the guest
that a file got truncated, virtiofs driver in the guest invokes a kernel API
to remove the page(s).

> > > > Note, I'm not necessarily suggesting that QEMU piggyback its #MC 
> > > > injection
> > > > to handle this, but I suspect the resulting behavior will look quite 
> > > > similar,
> > > > e.g. notify the virtiofs driver in the guest, which does some magic to 
> > > > take
> > > > the offending region offline, and then guest tasks get SIGBUS or 
> > > > whatever.
> > > > 
> > > > I also don't think it's KVM's responsibility to _directly_ handle such a
> > > > scenario.  As I said in an earlier version, KVM can't possibly know 
> > > > _why_ a
> > > > page fault came back with -EFAULT, only userspace can connect the dots 
> > > > of
> > > > GPA -> HVA -> vm_area_struct -> file -> inject event.  KVM definitely 
> > > > should
> > > > exit to userspace on the -EFAULT instead of hanging the guest, but that 
> > > > can
> > > > be done via a new request, as suggested.
> > > 
> > > KVM atleast should have the mechanism to report this back to guest. And
> > > we don't have any. There only seems to be #MC stuff for poisoned pages.
> > > I am not sure how much we can build on top of #MC stuff to take care
> > > of this case. One problem with #MC I found was that it generates
> > > synchronous #MC only on load and not store. So all the code is
> > > written in such a way that synchronous #MC can happen only on load
> > > and hence the error handling. 
> > > 
> > > Stores generate different kind of #MC that too asynchronously and
> > > caller will not know about it immiditely. But in this case we need
> > > to know about error in the context of caller both for loads and stores.
> > > 
> > > Anyway, I agree that this patch does not solve the problem of race
> > > free synchronous event inject into the guest. That's something yet
> > > to be solved either by #VE or by #MC or by #foo.
> > > 
> > > This patch is only doing two things.
> > > 
> > > - Because we don't have a mechanism to report errors to guest, use
> > >   the second best method and exit to user space.
> > 
> > Not that it matters at this point, but IMO exiting to userspace is the
> > correct behavior, not simply "second best method".  Again, KVM by design
> > does not know what lies behind any given HVA, and therefore should not be
> > making decisions about how to handle bad HVAs.
> >  
> > > - Make behavior consistent between 

Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-10-02 Thread Vivek Goyal
On Fri, Oct 02, 2020 at 12:45:18PM -0700, Sean Christopherson wrote:
> On Fri, Oct 02, 2020 at 03:27:34PM -0400, Vivek Goyal wrote:
> > On Fri, Oct 02, 2020 at 11:30:37AM -0700, Sean Christopherson wrote:
> > > On Fri, Oct 02, 2020 at 11:38:54AM -0400, Vivek Goyal wrote:
> > > > On Thu, Oct 01, 2020 at 03:33:20PM -0700, Sean Christopherson wrote:
> > > > > Alternatively, what about adding a new KVM request type to handle 
> > > > > this?
> > > > > E.g. when the APF comes back with -EFAULT, snapshot the GFN and make a
> > > > > request.  The vCPU then gets kicked and exits to userspace.  Before 
> > > > > exiting
> > > > > to userspace, the request handler resets vcpu->arch.apf.error_gfn.  
> > > > > Bad GFNs
> > > > > simply get if error_gfn is "valid", i.e. there's a pending request.
> > > > 
> > > > Sorry, I did not understand the above proposal. Can you please elaborate
> > > > a bit more. Part of it is that I don't know much about KVM requests.
> > > > Looking at the code it looks like that main loop is parsing if some
> > > > kvm request is pending and executing that action.
> > > > 
> > > > Don't we want to make sure that we exit to user space when guest retries
> > > > error gfn access again.
> > > 
> > > > In this case once we get -EFAULT, we will still inject page_ready into
> > > > guest. And then either same process or a different process might run. 
> > > > 
> > > > So when exactly code raises a kvm request. If I raise it right when
> > > > I get -EFAULT, then kvm will exit to user space upon next entry
> > > > time. But there is no guarantee guest vcpu is running the process which
> > > > actually accessed the error gfn. And that probably means that register
> > > > state of cpu does not mean much and one can not easily figure out
> > > > which task tried to access the bad memory and when.
> > > > 
> > > > That's why we prepare a list of error gfn and only exit to user space
> > > > when error_gfn access is retried so that guest vcpu context is correct.
> > > > 
> > > > What am I missing?
> > > 
> > > I don't think it's necessary to provide userspace with the register state 
> > > of
> > > the guest task that hit the bad page.  Other than debugging, I don't see 
> > > how
> > > userspace can do anything useful which such information.
> > 
> > I think debugging is the whole point so that user can figure out which
> > access by guest task resulted in bad memory access. I would think this
> > will be important piece of information.
> 
> But isn't this failure due to a truncation in the host?  Why would we care
> about debugging the guest?  It hasn't done anything wrong, has it?  Or am I
> misunderstanding the original problem statement.

I think you understood problem statement right. If guest has right
context, it just gives additional information who tried to access
the missing memory page. 

> 
> > > To fully handle the situation, the guest needs to remove the bad page from
> > > its memory pool.  Once the page is offlined, the guest kernel's error
> > > handling will kick in when a task accesses the bad page (or nothing ever
> > > touches the bad page again and everyone is happy).
> > 
> > This is not really a case of bad page as such. It is more of a page
> > gone missing/trucated. And no new user can map it. We just need to
> > worry about existing users who already have it mapped.
> 
> What do you mean by "no new user can map it"?  Are you talking about guest
> tasks or host tasks?  If guest tasks, how would the guest know the page is
> missing and thus prevent mapping the non-existent page?

If a new task wants mmap(), it will send a request to virtiofsd/qemu
on host. If file has been truncated, then mapping beyond file size
will fail and process will get error.  So they will not be able to
map a page which has been truncated.

> 
> > > Note, I'm not necessarily suggesting that QEMU piggyback its #MC injection
> > > to handle this, but I suspect the resulting behavior will look quite 
> > > similar,
> > > e.g. notify the virtiofs driver in the guest, which does some magic to 
> > > take
> > > the offending region offline, and then guest tasks get SIGBUS or whatever.
> > > 
> > > I also don't think it's KVM's responsibility to _directly_ handle such a
> > > scenario.  As I said in an earlier version, KVM can't possibly know _why_ 
> > > a
> > > page fault came back with -EFAULT, only userspace can connect the dots of
> > > GPA -> HVA -> vm_area_struct -> file -> inject event.  KVM definitely 
> > > should
> > > exit to userspace on the -EFAULT instead of hanging the guest, but that 
> > > can
> > > be done via a new request, as suggested.
> > 
> > KVM atleast should have the mechanism to report this back to guest. And
> > we don't have any. There only seems to be #MC stuff for poisoned pages.
> > I am not sure how much we can build on top of #MC stuff to take care
> > of this case. One problem with #MC I found was that it generates
> > synchronous #MC only on load and not store. So all 

Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-10-02 Thread Sean Christopherson
On Fri, Oct 02, 2020 at 03:27:34PM -0400, Vivek Goyal wrote:
> On Fri, Oct 02, 2020 at 11:30:37AM -0700, Sean Christopherson wrote:
> > On Fri, Oct 02, 2020 at 11:38:54AM -0400, Vivek Goyal wrote:
> > > On Thu, Oct 01, 2020 at 03:33:20PM -0700, Sean Christopherson wrote:
> > > > Alternatively, what about adding a new KVM request type to handle this?
> > > > E.g. when the APF comes back with -EFAULT, snapshot the GFN and make a
> > > > request.  The vCPU then gets kicked and exits to userspace.  Before 
> > > > exiting
> > > > to userspace, the request handler resets vcpu->arch.apf.error_gfn.  Bad 
> > > > GFNs
> > > > simply get if error_gfn is "valid", i.e. there's a pending request.
> > > 
> > > Sorry, I did not understand the above proposal. Can you please elaborate
> > > a bit more. Part of it is that I don't know much about KVM requests.
> > > Looking at the code it looks like that main loop is parsing if some
> > > kvm request is pending and executing that action.
> > > 
> > > Don't we want to make sure that we exit to user space when guest retries
> > > error gfn access again.
> > 
> > > In this case once we get -EFAULT, we will still inject page_ready into
> > > guest. And then either same process or a different process might run. 
> > > 
> > > So when exactly code raises a kvm request. If I raise it right when
> > > I get -EFAULT, then kvm will exit to user space upon next entry
> > > time. But there is no guarantee guest vcpu is running the process which
> > > actually accessed the error gfn. And that probably means that register
> > > state of cpu does not mean much and one can not easily figure out
> > > which task tried to access the bad memory and when.
> > > 
> > > That's why we prepare a list of error gfn and only exit to user space
> > > when error_gfn access is retried so that guest vcpu context is correct.
> > > 
> > > What am I missing?
> > 
> > I don't think it's necessary to provide userspace with the register state of
> > the guest task that hit the bad page.  Other than debugging, I don't see how
> > userspace can do anything useful which such information.
> 
> I think debugging is the whole point so that user can figure out which
> access by guest task resulted in bad memory access. I would think this
> will be important piece of information.

But isn't this failure due to a truncation in the host?  Why would we care
about debugging the guest?  It hasn't done anything wrong, has it?  Or am I
misunderstanding the original problem statement.

> > To fully handle the situation, the guest needs to remove the bad page from
> > its memory pool.  Once the page is offlined, the guest kernel's error
> > handling will kick in when a task accesses the bad page (or nothing ever
> > touches the bad page again and everyone is happy).
> 
> This is not really a case of bad page as such. It is more of a page
> gone missing/trucated. And no new user can map it. We just need to
> worry about existing users who already have it mapped.

What do you mean by "no new user can map it"?  Are you talking about guest
tasks or host tasks?  If guest tasks, how would the guest know the page is
missing and thus prevent mapping the non-existent page?

> > Note, I'm not necessarily suggesting that QEMU piggyback its #MC injection
> > to handle this, but I suspect the resulting behavior will look quite 
> > similar,
> > e.g. notify the virtiofs driver in the guest, which does some magic to take
> > the offending region offline, and then guest tasks get SIGBUS or whatever.
> > 
> > I also don't think it's KVM's responsibility to _directly_ handle such a
> > scenario.  As I said in an earlier version, KVM can't possibly know _why_ a
> > page fault came back with -EFAULT, only userspace can connect the dots of
> > GPA -> HVA -> vm_area_struct -> file -> inject event.  KVM definitely should
> > exit to userspace on the -EFAULT instead of hanging the guest, but that can
> > be done via a new request, as suggested.
> 
> KVM atleast should have the mechanism to report this back to guest. And
> we don't have any. There only seems to be #MC stuff for poisoned pages.
> I am not sure how much we can build on top of #MC stuff to take care
> of this case. One problem with #MC I found was that it generates
> synchronous #MC only on load and not store. So all the code is
> written in such a way that synchronous #MC can happen only on load
> and hence the error handling. 
> 
> Stores generate different kind of #MC that too asynchronously and
> caller will not know about it immiditely. But in this case we need
> to know about error in the context of caller both for loads and stores.
> 
> Anyway, I agree that this patch does not solve the problem of race
> free synchronous event inject into the guest. That's something yet
> to be solved either by #VE or by #MC or by #foo.
> 
> This patch is only doing two things.
> 
> - Because we don't have a mechanism to report errors to guest, use
>   the second best method and exit to user 

Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-10-02 Thread Vivek Goyal
On Fri, Oct 02, 2020 at 11:30:37AM -0700, Sean Christopherson wrote:
> On Fri, Oct 02, 2020 at 11:38:54AM -0400, Vivek Goyal wrote:
> > On Thu, Oct 01, 2020 at 03:33:20PM -0700, Sean Christopherson wrote:
> > > Alternatively, what about adding a new KVM request type to handle this?
> > > E.g. when the APF comes back with -EFAULT, snapshot the GFN and make a
> > > request.  The vCPU then gets kicked and exits to userspace.  Before 
> > > exiting
> > > to userspace, the request handler resets vcpu->arch.apf.error_gfn.  Bad 
> > > GFNs
> > > simply get if error_gfn is "valid", i.e. there's a pending request.
> > 
> > Sorry, I did not understand the above proposal. Can you please elaborate
> > a bit more. Part of it is that I don't know much about KVM requests.
> > Looking at the code it looks like that main loop is parsing if some
> > kvm request is pending and executing that action.
> > 
> > Don't we want to make sure that we exit to user space when guest retries
> > error gfn access again.
> 
> > In this case once we get -EFAULT, we will still inject page_ready into
> > guest. And then either same process or a different process might run. 
> > 
> > So when exactly code raises a kvm request. If I raise it right when
> > I get -EFAULT, then kvm will exit to user space upon next entry
> > time. But there is no guarantee guest vcpu is running the process which
> > actually accessed the error gfn. And that probably means that register
> > state of cpu does not mean much and one can not easily figure out
> > which task tried to access the bad memory and when.
> > 
> > That's why we prepare a list of error gfn and only exit to user space
> > when error_gfn access is retried so that guest vcpu context is correct.
> > 
> > What am I missing?
> 
> I don't think it's necessary to provide userspace with the register state of
> the guest task that hit the bad page.  Other than debugging, I don't see how
> userspace can do anything useful which such information.

I think debugging is the whole point so that user can figure out which
access by guest task resulted in bad memory access. I would think this
will be important piece of information.

> 
> Even if you want to inject an event of some form into the guest, having the
> correct context for the event itself is not required.  IMO it's perfectly
> reasonable for such an event to be asynchronous.
> 
> IIUC, your end goal is to be able to gracefully handle DAX file truncation.
> Simply killing the guest task that hit the bad page isn't sufficient, as
> nothing prevents a future task from accessing the same bad page.

Next task can't even mmap that page mmap will fail. File got truncated,
that page does not exist. 

So sending SIGBUS to task should definitely solve the problem. We also
need to solve the issue for guest kernel accessing the page which got
truncated on host. In that case we need to use correct memcpy helpers
and use exception table magic and return error code to user space.

> To fully
> handle the situation, the guest needs to remove the bad page from its memory
> pool.  Once the page is offlined, the guest kernel's error handling will
> kick in when a task accesses the bad page (or nothing ever touches the bad
> page again and everyone is happy).

This is not really a case of bad page as such. It is more of a page
gone missing/trucated. And no new user can map it. We just need to
worry about existing users who already have it mapped.

> 
> Note, I'm not necessarily suggesting that QEMU piggyback its #MC injection
> to handle this, but I suspect the resulting behavior will look quite similar,
> e.g. notify the virtiofs driver in the guest, which does some magic to take
> the offending region offline, and then guest tasks get SIGBUS or whatever.
> 
> I also don't think it's KVM's responsibility to _directly_ handle such a
> scenario.  As I said in an earlier version, KVM can't possibly know _why_ a
> page fault came back with -EFAULT, only userspace can connect the dots of
> GPA -> HVA -> vm_area_struct -> file -> inject event.  KVM definitely should
> exit to userspace on the -EFAULT instead of hanging the guest, but that can
> be done via a new request, as suggested.

KVM atleast should have the mechanism to report this back to guest. And
we don't have any. There only seems to be #MC stuff for poisoned pages.
I am not sure how much we can build on top of #MC stuff to take care
of this case. One problem with #MC I found was that it generates
synchronous #MC only on load and not store. So all the code is
written in such a way that synchronous #MC can happen only on load
and hence the error handling. 

Stores generate different kind of #MC that too asynchronously and
caller will not know about it immiditely. But in this case we need
to know about error in the context of caller both for loads and stores.

Anyway, I agree that this patch does not solve the problem of race
free synchronous event inject into the guest. That's something yet
to be solved either by 

Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-10-02 Thread Sean Christopherson
On Fri, Oct 02, 2020 at 11:38:54AM -0400, Vivek Goyal wrote:
> On Thu, Oct 01, 2020 at 03:33:20PM -0700, Sean Christopherson wrote:
> > Alternatively, what about adding a new KVM request type to handle this?
> > E.g. when the APF comes back with -EFAULT, snapshot the GFN and make a
> > request.  The vCPU then gets kicked and exits to userspace.  Before exiting
> > to userspace, the request handler resets vcpu->arch.apf.error_gfn.  Bad GFNs
> > simply get if error_gfn is "valid", i.e. there's a pending request.
> 
> Sorry, I did not understand the above proposal. Can you please elaborate
> a bit more. Part of it is that I don't know much about KVM requests.
> Looking at the code it looks like that main loop is parsing if some
> kvm request is pending and executing that action.
> 
> Don't we want to make sure that we exit to user space when guest retries
> error gfn access again.

> In this case once we get -EFAULT, we will still inject page_ready into
> guest. And then either same process or a different process might run. 
> 
> So when exactly code raises a kvm request. If I raise it right when
> I get -EFAULT, then kvm will exit to user space upon next entry
> time. But there is no guarantee guest vcpu is running the process which
> actually accessed the error gfn. And that probably means that register
> state of cpu does not mean much and one can not easily figure out
> which task tried to access the bad memory and when.
> 
> That's why we prepare a list of error gfn and only exit to user space
> when error_gfn access is retried so that guest vcpu context is correct.
> 
> What am I missing?

I don't think it's necessary to provide userspace with the register state of
the guest task that hit the bad page.  Other than debugging, I don't see how
userspace can do anything useful which such information.

Even if you want to inject an event of some form into the guest, having the
correct context for the event itself is not required.  IMO it's perfectly
reasonable for such an event to be asynchronous.

IIUC, your end goal is to be able to gracefully handle DAX file truncation.
Simply killing the guest task that hit the bad page isn't sufficient, as
nothing prevents a future task from accessing the same bad page.  To fully
handle the situation, the guest needs to remove the bad page from its memory
pool.  Once the page is offlined, the guest kernel's error handling will
kick in when a task accesses the bad page (or nothing ever touches the bad
page again and everyone is happy).

Note, I'm not necessarily suggesting that QEMU piggyback its #MC injection
to handle this, but I suspect the resulting behavior will look quite similar,
e.g. notify the virtiofs driver in the guest, which does some magic to take
the offending region offline, and then guest tasks get SIGBUS or whatever.

I also don't think it's KVM's responsibility to _directly_ handle such a
scenario.  As I said in an earlier version, KVM can't possibly know _why_ a
page fault came back with -EFAULT, only userspace can connect the dots of
GPA -> HVA -> vm_area_struct -> file -> inject event.  KVM definitely should
exit to userspace on the -EFAULT instead of hanging the guest, but that can
be done via a new request, as suggested.


Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-10-02 Thread Vivek Goyal
On Thu, Oct 01, 2020 at 03:33:20PM -0700, Sean Christopherson wrote:
> On Thu, Oct 01, 2020 at 05:55:08PM -0400, Vivek Goyal wrote:
> > On Mon, Sep 28, 2020 at 09:37:00PM -0700, Sean Christopherson wrote:
> > > On Mon, Jul 20, 2020 at 05:13:59PM -0400, Vivek Goyal wrote:
> > > > @@ -10369,6 +10378,36 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, 
> > > > unsigned long rflags)
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(kvm_set_rflags);
> > > >  
> > > > +static inline u32 kvm_error_gfn_hash_fn(gfn_t gfn)
> > > > +{
> > > > +   BUILD_BUG_ON(!is_power_of_2(ERROR_GFN_PER_VCPU));
> > > > +
> > > > +   return hash_32(gfn & 0x, 
> > > > order_base_2(ERROR_GFN_PER_VCPU));
> > > > +}
> > > > +
> > > > +static void kvm_add_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> > > > +{
> > > > +   u32 key = kvm_error_gfn_hash_fn(gfn);
> > > > +
> > > > +   /*
> > > > +* Overwrite the previous gfn. This is just a hint to do
> > > > +* sync page fault.
> > > > +*/
> > > > +   vcpu->arch.apf.error_gfns[key] = gfn;
> > > > +}
> > > > +
> > > > +/* Returns true if gfn was found in hash table, false otherwise */
> > > > +static bool kvm_find_and_remove_error_gfn(struct kvm_vcpu *vcpu, gfn_t 
> > > > gfn)
> > > > +{
> > > > +   u32 key = kvm_error_gfn_hash_fn(gfn);
> > > 
> > > Mostly out of curiosity, do we really need a hash?  E.g. could we get away
> > > with an array of 4 values?  2 values?  Just wondering if we can avoid 64*8
> > > bytes per CPU.
> > 
> > We are relying on returning error when guest task retries fault. Fault
> > will be retried on same address if same task is run by vcpu after
> > "page ready" event. There is no guarantee that same task will be
> > run. In theory, this cpu could have a large number of tasks queued
> > and run these tasks before the faulting task is run again. Now say
> > there are 128 tasks being run and 32 of them have page fault
> > errors. Then if we keep 4 values, newer failures will simply
> > overwrite older failures and we will keep spinning instead of
> > exiting to user space.
> > 
> > That's why this array of 64 gfns and add gfns based on hash. This
> > does not completely elimiante the above possibility but chances
> > of one hitting this are very very slim.
> 
> But have you actually tried such a scenario?  In other words, is there good
> justification for burning the extra memory?

Its not easy to try and reproduce. So it is all theory  at this point of time.
If you are worried about memory usage, we can probably reduce the size
of hash table. Say from 64, reduce it to 8. I am fine with that. I think
initially I had a single error_gfn. But Vitaly had concerns about
above scenario, so I implemeted a hash table.

I think reducing hash table size to 8 or 16 probaly is a good middle
ground.

> 
> Alternatively, what about adding a new KVM request type to handle this?
> E.g. when the APF comes back with -EFAULT, snapshot the GFN and make a
> request.  The vCPU then gets kicked and exits to userspace.  Before exiting
> to userspace, the request handler resets vcpu->arch.apf.error_gfn.  Bad GFNs
> simply get if error_gfn is "valid", i.e. there's a pending request.

Sorry, I did not understand the above proposal. Can you please elaborate
a bit more. Part of it is that I don't know much about KVM requests.
Looking at the code it looks like that main loop is parsing if some
kvm request is pending and executing that action.

Don't we want to make sure that we exit to user space when guest retries
error gfn access again. In this case once we get -EFAULT, we will still
inject page_ready into guest. And then either same process or a different
process might run. 

So when exactly code raises a kvm request. If I raise it right when
I get -EFAULT, then kvm will exit to user space upon next entry
time. But there is no guarantee guest vcpu is running the process which
actually accessed the error gfn. And that probably means that register
state of cpu does not mean much and one can not easily figure out
which task tried to access the bad memory and when.

That's why we prepare a list of error gfn and only exit to user space
when error_gfn access is retried so that guest vcpu context is correct.

What am I missing?

Thanks
Vivek

> 
> That would guarantee the error is propagated to userspace, and doesn't lose
> any guest information as dropping error GFNs just means the guest will take
> more page fault exits.
> 
> > > One thought would be to use the index to handle the case of no error gfns 
> > > so
> > > that the size of the array doesn't affect lookup for the common case, e.g.
> > 
> > We are calculating hash of gfn (used as index in array). So lookup cost
> > is not driven by size of array. Its O(1) and not O(N). We just lookup
> > at one element in array and if it does not match, we return false.
> > 
> > u32 key = kvm_error_gfn_hash_fn(gfn);
> > 
> > if (vcpu->arch.apf.error_gfns[key] != gfn)
> > return 0;
> > 
> > 
> > > 

Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-10-01 Thread Sean Christopherson
On Thu, Oct 01, 2020 at 05:55:08PM -0400, Vivek Goyal wrote:
> On Mon, Sep 28, 2020 at 09:37:00PM -0700, Sean Christopherson wrote:
> > On Mon, Jul 20, 2020 at 05:13:59PM -0400, Vivek Goyal wrote:
> > > @@ -10369,6 +10378,36 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, 
> > > unsigned long rflags)
> > >  }
> > >  EXPORT_SYMBOL_GPL(kvm_set_rflags);
> > >  
> > > +static inline u32 kvm_error_gfn_hash_fn(gfn_t gfn)
> > > +{
> > > + BUILD_BUG_ON(!is_power_of_2(ERROR_GFN_PER_VCPU));
> > > +
> > > + return hash_32(gfn & 0x, order_base_2(ERROR_GFN_PER_VCPU));
> > > +}
> > > +
> > > +static void kvm_add_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> > > +{
> > > + u32 key = kvm_error_gfn_hash_fn(gfn);
> > > +
> > > + /*
> > > +  * Overwrite the previous gfn. This is just a hint to do
> > > +  * sync page fault.
> > > +  */
> > > + vcpu->arch.apf.error_gfns[key] = gfn;
> > > +}
> > > +
> > > +/* Returns true if gfn was found in hash table, false otherwise */
> > > +static bool kvm_find_and_remove_error_gfn(struct kvm_vcpu *vcpu, gfn_t 
> > > gfn)
> > > +{
> > > + u32 key = kvm_error_gfn_hash_fn(gfn);
> > 
> > Mostly out of curiosity, do we really need a hash?  E.g. could we get away
> > with an array of 4 values?  2 values?  Just wondering if we can avoid 64*8
> > bytes per CPU.
> 
> We are relying on returning error when guest task retries fault. Fault
> will be retried on same address if same task is run by vcpu after
> "page ready" event. There is no guarantee that same task will be
> run. In theory, this cpu could have a large number of tasks queued
> and run these tasks before the faulting task is run again. Now say
> there are 128 tasks being run and 32 of them have page fault
> errors. Then if we keep 4 values, newer failures will simply
> overwrite older failures and we will keep spinning instead of
> exiting to user space.
> 
> That's why this array of 64 gfns and add gfns based on hash. This
> does not completely elimiante the above possibility but chances
> of one hitting this are very very slim.

But have you actually tried such a scenario?  In other words, is there good
justification for burning the extra memory?

Alternatively, what about adding a new KVM request type to handle this?
E.g. when the APF comes back with -EFAULT, snapshot the GFN and make a
request.  The vCPU then gets kicked and exits to userspace.  Before exiting
to userspace, the request handler resets vcpu->arch.apf.error_gfn.  Bad GFNs
simply get if error_gfn is "valid", i.e. there's a pending request.

That would guarantee the error is propagated to userspace, and doesn't lose
any guest information as dropping error GFNs just means the guest will take
more page fault exits.

> > One thought would be to use the index to handle the case of no error gfns so
> > that the size of the array doesn't affect lookup for the common case, e.g.
> 
> We are calculating hash of gfn (used as index in array). So lookup cost
> is not driven by size of array. Its O(1) and not O(N). We just lookup
> at one element in array and if it does not match, we return false.
> 
>   u32 key = kvm_error_gfn_hash_fn(gfn);
> 
>   if (vcpu->arch.apf.error_gfns[key] != gfn)
>   return 0;
> 
> 
> > 
> > ...
> > 
> > u8 error_gfn_head;
> > gfn_t error_gfns[ERROR_GFN_PER_VCPU];
> > } apf;  
> > 
> > 
> > if (vcpu->arch.apf.error_gfn_head == 0xff)
> > return false;
> > 
> > for (i = 0; i < vcpu->arch.apf.error_gfn_head; i++) {
> > if (vcpu->arch.apf.error_gfns[i] == gfn) {
> > vcpu->arch.apf.error_gfns[i] = INVALID_GFN;
> > return true;
> > }
> > }
> > return true;
> > 
> > Or you could even avoid INVALID_GFN altogether by compacting the array on
> > removal.  The VM is probably dead anyways, burning a few cycles to clean
> > things up is a non-issue.  Ditto for slow insertion.
> 
> Same for insertion. Its O(1) and not dependent on size of array. Also
> insertion anyway is very infrequent event because it will not be
> often that error will happen.

I know, I was saying that if we move to an array then we'd technically be
subject to O(whatever) search and delete, but that it's irrelevant from a
performance perspective because the guest is already toast if it hits a bad
GFN.

> > > +
> > > + if (vcpu->arch.apf.error_gfns[key] != gfn)
> > > + return 0;
> > 
> > Should be "return false".
> 
> Will fix.
> 
> Thanks
> Vivek
> 


Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-10-01 Thread Vivek Goyal
On Mon, Sep 28, 2020 at 09:37:00PM -0700, Sean Christopherson wrote:
> On Mon, Jul 20, 2020 at 05:13:59PM -0400, Vivek Goyal wrote:
> > ---
> >  arch/x86/include/asm/kvm_host.h |  2 ++
> >  arch/x86/kvm/mmu.h  |  2 +-
> >  arch/x86/kvm/mmu/mmu.c  |  2 +-
> >  arch/x86/kvm/x86.c  | 54 +++--
> >  include/linux/kvm_types.h   |  1 +
> >  5 files changed, 56 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h 
> > b/arch/x86/include/asm/kvm_host.h
> > index be5363b21540..e6f8d3f1a377 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -137,6 +137,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t 
> > base_gfn, int level)
> >  #define KVM_NR_VAR_MTRR 8
> >  
> >  #define ASYNC_PF_PER_VCPU 64
> > +#define ERROR_GFN_PER_VCPU 64
> 
> Aren't these two related?  I.e. wouldn't it make sense to do:

Hi,

They are related somewhat but they don't have to be same. I think we
can accumulate many more error GFNs if this vcpu does not schedule
the same task again to retry.

>   #define ERROR_GFN_PER_VCPU ASYNC_PF_PER_VCPU
> 
> Or maybe even size error_gfns[] to ASYNC_PF_PER_VCPU?

Given these two don't have to be same, I kept it separate. And kept the
hash size same for now. If one wants, hash can be bigger or smaller
down the line.

> 
> >  
> >  enum kvm_reg {
> > VCPU_REGS_RAX = __VCPU_REGS_RAX,
> > @@ -778,6 +779,7 @@ struct kvm_vcpu_arch {
> > unsigned long nested_apf_token;
> > bool delivery_as_pf_vmexit;
> > bool pageready_pending;
> > +   gfn_t error_gfns[ERROR_GFN_PER_VCPU];
> > } apf;
> >  
> > /* OSVW MSRs (AMD only) */
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index 444bb9c54548..d0a2a12c7bb6 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -60,7 +60,7 @@ void kvm_init_mmu(struct kvm_vcpu *vcpu, bool 
> > reset_roots);
> >  void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, u32 cr0, u32 cr4, u32 
> > efer);
> >  void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
> >  bool accessed_dirty, gpa_t new_eptp);
> > -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu);
> > +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn);
> >  int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
> > u64 fault_address, char *insn, int insn_len);
> >  
> 
> ...
> 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 88c593f83b28..c1f5094d6e53 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -263,6 +263,13 @@ static inline void kvm_async_pf_hash_reset(struct 
> > kvm_vcpu *vcpu)
> > vcpu->arch.apf.gfns[i] = ~0;
> >  }
> >  
> > +static inline void kvm_error_gfn_hash_reset(struct kvm_vcpu *vcpu)
> > +{
> > +   int i;
> 
> Need a newline.   

Will do.

> 
> > +   for (i = 0; i < ERROR_GFN_PER_VCPU; i++)
> > +   vcpu->arch.apf.error_gfns[i] = GFN_INVALID;
> > +}
> > +
> >  static void kvm_on_user_return(struct user_return_notifier *urn)
> >  {
> > unsigned slot;
> > @@ -9484,6 +9491,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> > vcpu->arch.pat = MSR_IA32_CR_PAT_DEFAULT;
> >  
> > kvm_async_pf_hash_reset(vcpu);
> > +   kvm_error_gfn_hash_reset(vcpu);
> > kvm_pmu_init(vcpu);
> >  
> > vcpu->arch.pending_external_vector = -1;
> > @@ -9608,6 +9616,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool 
> > init_event)
> >  
> > kvm_clear_async_pf_completion_queue(vcpu);
> > kvm_async_pf_hash_reset(vcpu);
> > +   kvm_error_gfn_hash_reset(vcpu);
> > vcpu->arch.apf.halted = false;
> >  
> > if (kvm_mpx_supported()) {
> > @@ -10369,6 +10378,36 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, 
> > unsigned long rflags)
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_set_rflags);
> >  
> > +static inline u32 kvm_error_gfn_hash_fn(gfn_t gfn)
> > +{
> > +   BUILD_BUG_ON(!is_power_of_2(ERROR_GFN_PER_VCPU));
> > +
> > +   return hash_32(gfn & 0x, order_base_2(ERROR_GFN_PER_VCPU));
> > +}
> > +
> > +static void kvm_add_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> > +{
> > +   u32 key = kvm_error_gfn_hash_fn(gfn);
> > +
> > +   /*
> > +* Overwrite the previous gfn. This is just a hint to do
> > +* sync page fault.
> > +*/
> > +   vcpu->arch.apf.error_gfns[key] = gfn;
> > +}
> > +
> > +/* Returns true if gfn was found in hash table, false otherwise */
> > +static bool kvm_find_and_remove_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> > +{
> > +   u32 key = kvm_error_gfn_hash_fn(gfn);
> 
> Mostly out of curiosity, do we really need a hash?  E.g. could we get away
> with an array of 4 values?  2 values?  Just wondering if we can avoid 64*8
> bytes per CPU.

We are relying on returning error when guest task retries fault. Fault
will be retried on same address if same task is run by vcpu after
"page ready" event. There is no guarantee 

Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-09-28 Thread Sean Christopherson
On Mon, Jul 20, 2020 at 05:13:59PM -0400, Vivek Goyal wrote:
> ---
>  arch/x86/include/asm/kvm_host.h |  2 ++
>  arch/x86/kvm/mmu.h  |  2 +-
>  arch/x86/kvm/mmu/mmu.c  |  2 +-
>  arch/x86/kvm/x86.c  | 54 +++--
>  include/linux/kvm_types.h   |  1 +
>  5 files changed, 56 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index be5363b21540..e6f8d3f1a377 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -137,6 +137,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t 
> base_gfn, int level)
>  #define KVM_NR_VAR_MTRR 8
>  
>  #define ASYNC_PF_PER_VCPU 64
> +#define ERROR_GFN_PER_VCPU 64

Aren't these two related?  I.e. wouldn't it make sense to do:

  #define ERROR_GFN_PER_VCPU ASYNC_PF_PER_VCPU

Or maybe even size error_gfns[] to ASYNC_PF_PER_VCPU?

>  
>  enum kvm_reg {
>   VCPU_REGS_RAX = __VCPU_REGS_RAX,
> @@ -778,6 +779,7 @@ struct kvm_vcpu_arch {
>   unsigned long nested_apf_token;
>   bool delivery_as_pf_vmexit;
>   bool pageready_pending;
> + gfn_t error_gfns[ERROR_GFN_PER_VCPU];
>   } apf;
>  
>   /* OSVW MSRs (AMD only) */
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 444bb9c54548..d0a2a12c7bb6 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -60,7 +60,7 @@ void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots);
>  void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, u32 cr0, u32 cr4, u32 efer);
>  void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
>bool accessed_dirty, gpa_t new_eptp);
> -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu);
> +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn);
>  int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
>   u64 fault_address, char *insn, int insn_len);
>  

...

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 88c593f83b28..c1f5094d6e53 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -263,6 +263,13 @@ static inline void kvm_async_pf_hash_reset(struct 
> kvm_vcpu *vcpu)
>   vcpu->arch.apf.gfns[i] = ~0;
>  }
>  
> +static inline void kvm_error_gfn_hash_reset(struct kvm_vcpu *vcpu)
> +{
> + int i;

Need a newline.   

> + for (i = 0; i < ERROR_GFN_PER_VCPU; i++)
> + vcpu->arch.apf.error_gfns[i] = GFN_INVALID;
> +}
> +
>  static void kvm_on_user_return(struct user_return_notifier *urn)
>  {
>   unsigned slot;
> @@ -9484,6 +9491,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>   vcpu->arch.pat = MSR_IA32_CR_PAT_DEFAULT;
>  
>   kvm_async_pf_hash_reset(vcpu);
> + kvm_error_gfn_hash_reset(vcpu);
>   kvm_pmu_init(vcpu);
>  
>   vcpu->arch.pending_external_vector = -1;
> @@ -9608,6 +9616,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool 
> init_event)
>  
>   kvm_clear_async_pf_completion_queue(vcpu);
>   kvm_async_pf_hash_reset(vcpu);
> + kvm_error_gfn_hash_reset(vcpu);
>   vcpu->arch.apf.halted = false;
>  
>   if (kvm_mpx_supported()) {
> @@ -10369,6 +10378,36 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned 
> long rflags)
>  }
>  EXPORT_SYMBOL_GPL(kvm_set_rflags);
>  
> +static inline u32 kvm_error_gfn_hash_fn(gfn_t gfn)
> +{
> + BUILD_BUG_ON(!is_power_of_2(ERROR_GFN_PER_VCPU));
> +
> + return hash_32(gfn & 0x, order_base_2(ERROR_GFN_PER_VCPU));
> +}
> +
> +static void kvm_add_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> + u32 key = kvm_error_gfn_hash_fn(gfn);
> +
> + /*
> +  * Overwrite the previous gfn. This is just a hint to do
> +  * sync page fault.
> +  */
> + vcpu->arch.apf.error_gfns[key] = gfn;
> +}
> +
> +/* Returns true if gfn was found in hash table, false otherwise */
> +static bool kvm_find_and_remove_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> + u32 key = kvm_error_gfn_hash_fn(gfn);

Mostly out of curiosity, do we really need a hash?  E.g. could we get away
with an array of 4 values?  2 values?  Just wondering if we can avoid 64*8
bytes per CPU.

One thought would be to use the index to handle the case of no error gfns so
that the size of the array doesn't affect lookup for the common case, e.g.

...

u8 error_gfn_head;
gfn_t error_gfns[ERROR_GFN_PER_VCPU];
} apf;  


if (vcpu->arch.apf.error_gfn_head == 0xff)
return false;

for (i = 0; i < vcpu->arch.apf.error_gfn_head; i++) {
if (vcpu->arch.apf.error_gfns[i] == gfn) {
vcpu->arch.apf.error_gfns[i] = INVALID_GFN;
return true;
}
}
return true;

Or you could even avoid INVALID_GFN altogether by compacting the array on
removal.  The VM is probably dead anyways, burning a few cycles to clean

Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-08-07 Thread Vivek Goyal
On Mon, Jul 20, 2020 at 05:13:59PM -0400, Vivek Goyal wrote:
> Page fault error handling behavior in kvm seems little inconsistent when
> page fault reports error. If we are doing fault synchronously
> then we capture error (-EFAULT) returned by __gfn_to_pfn_memslot() and
> exit to user space and qemu reports error, "error: kvm run failed Bad 
> address".

Hi Paolo,

Ping. I am wondering what do you think about this patch. Can it be
merged?

Thanks
Vivek

> 
> But if we are doing async page fault, then async_pf_execute() will simply
> ignore the error reported by get_user_pages_remote() or
> by kvm_mmu_do_page_fault(). It is assumed that page fault was successful
> and either a page ready event is injected in guest or guest is brought
> out of artificial halt state and run again. In both the cases when guest
> retries the instruction, it takes exit again as page fault was not
> successful in previous attempt. And then this infinite loop continues
> forever.
> 
> Trying fault in a loop will make sense if error is temporary and will
> be resolved on retry. But I don't see any intention in the code to
> determine if error is temporary or not.  Whether to do fault synchronously
> or asynchronously, depends on so many variables but none of the varibales
> is whether error is temporary or not. (kvm_can_do_async_pf()).
> 
> And that makes it very inconsistent or unpredictable to figure out whether
> kvm will exit to qemu with error or it will just retry and go into an
> infinite loop.
> 
> This patch tries to make this behavior consistent. That is instead of
> getting into infinite loop of retrying page fault, exit to user space
> and stop VM if page fault error happens.
> 
> In future this can be improved by injecting errors into guest. As of
> now we don't have any race free method to inject errors in guest.
> 
> When page fault error happens in async path save that pfn and when next
> time guest retries, do a sync fault instead of async fault. So that if error
> is encountered, we exit to qemu and avoid infinite loop.
> 
> We maintain a cache of error gfns and force sync fault if a gfn is
> found in cache of error gfn. There is a small possibility that we
> miss an error gfn (as it got overwritten by a new error gfn). But
> its just a hint and sooner or later some error pfn will match
> and we will force sync fault and exit to user space.
> 
> Changes from v3:
> - Added function kvm_find_and_remove_error_gfn() and removed
>   kvm_find_error_gfn() and kvm_del_error_gfn(). (Vitaly)
> 
> - Added a macro GFN_INVALID (Vitaly).
> 
> - Used gpa_to_gfn() to convert gpa to gfn (Vitaly)
> 
> Change from v2:
> - Fixed a warning by converting kvm_find_error_gfn() static.
> 
> Change from v1:
> - Maintain a cache of error gfns, instead of single gfn. (Vitaly)
> 
> Signed-off-by: Vivek Goyal 
> ---
>  arch/x86/include/asm/kvm_host.h |  2 ++
>  arch/x86/kvm/mmu.h  |  2 +-
>  arch/x86/kvm/mmu/mmu.c  |  2 +-
>  arch/x86/kvm/x86.c  | 54 +++--
>  include/linux/kvm_types.h   |  1 +
>  5 files changed, 56 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index be5363b21540..e6f8d3f1a377 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -137,6 +137,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t 
> base_gfn, int level)
>  #define KVM_NR_VAR_MTRR 8
>  
>  #define ASYNC_PF_PER_VCPU 64
> +#define ERROR_GFN_PER_VCPU 64
>  
>  enum kvm_reg {
>   VCPU_REGS_RAX = __VCPU_REGS_RAX,
> @@ -778,6 +779,7 @@ struct kvm_vcpu_arch {
>   unsigned long nested_apf_token;
>   bool delivery_as_pf_vmexit;
>   bool pageready_pending;
> + gfn_t error_gfns[ERROR_GFN_PER_VCPU];
>   } apf;
>  
>   /* OSVW MSRs (AMD only) */
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 444bb9c54548..d0a2a12c7bb6 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -60,7 +60,7 @@ void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots);
>  void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, u32 cr0, u32 cr4, u32 efer);
>  void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
>bool accessed_dirty, gpa_t new_eptp);
> -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu);
> +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn);
>  int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
>   u64 fault_address, char *insn, int insn_len);
>  
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6d6a0ae7800c..b51d4aa405e0 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4078,7 +4078,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool 
> prefault, gfn_t gfn,
>   if (!async)
>   return false; /* *pfn has correct page already */
>  
> - if (!prefault && 

Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-07-29 Thread Pankaj Gupta
> Page fault error handling behavior in kvm seems little inconsistent when
> page fault reports error. If we are doing fault synchronously
> then we capture error (-EFAULT) returned by __gfn_to_pfn_memslot() and
> exit to user space and qemu reports error, "error: kvm run failed Bad 
> address".
>
> But if we are doing async page fault, then async_pf_execute() will simply
> ignore the error reported by get_user_pages_remote() or
> by kvm_mmu_do_page_fault(). It is assumed that page fault was successful
> and either a page ready event is injected in guest or guest is brought
> out of artificial halt state and run again. In both the cases when guest
> retries the instruction, it takes exit again as page fault was not
> successful in previous attempt. And then this infinite loop continues
> forever.
>
> Trying fault in a loop will make sense if error is temporary and will
> be resolved on retry. But I don't see any intention in the code to
> determine if error is temporary or not.  Whether to do fault synchronously
> or asynchronously, depends on so many variables but none of the varibales
s/varibales/variables
> is whether error is temporary or not. (kvm_can_do_async_pf()).
>
> And that makes it very inconsistent or unpredictable to figure out whether
> kvm will exit to qemu with error or it will just retry and go into an
> infinite loop.
>
> This patch tries to make this behavior consistent. That is instead of
> getting into infinite loop of retrying page fault, exit to user space
> and stop VM if page fault error happens.
>
> In future this can be improved by injecting errors into guest. As of
> now we don't have any race free method to inject errors in guest.
>
> When page fault error happens in async path save that pfn and when next
> time guest retries, do a sync fault instead of async fault. So that if error
> is encountered, we exit to qemu and avoid infinite loop.
>
> We maintain a cache of error gfns and force sync fault if a gfn is
> found in cache of error gfn. There is a small possibility that we
> miss an error gfn (as it got overwritten by a new error gfn). But
> its just a hint and sooner or later some error pfn will match
> and we will force sync fault and exit to user space.
>
> Changes from v3:
> - Added function kvm_find_and_remove_error_gfn() and removed
>   kvm_find_error_gfn() and kvm_del_error_gfn(). (Vitaly)
>
> - Added a macro GFN_INVALID (Vitaly).
>
> - Used gpa_to_gfn() to convert gpa to gfn (Vitaly)
>
> Change from v2:
> - Fixed a warning by converting kvm_find_error_gfn() static.
>
> Change from v1:
> - Maintain a cache of error gfns, instead of single gfn. (Vitaly)
>
> Signed-off-by: Vivek Goyal 
> ---
>  arch/x86/include/asm/kvm_host.h |  2 ++
>  arch/x86/kvm/mmu.h  |  2 +-
>  arch/x86/kvm/mmu/mmu.c  |  2 +-
>  arch/x86/kvm/x86.c  | 54 +++--
>  include/linux/kvm_types.h   |  1 +
>  5 files changed, 56 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index be5363b21540..e6f8d3f1a377 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -137,6 +137,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t 
> base_gfn, int level)
>  #define KVM_NR_VAR_MTRR 8
>
>  #define ASYNC_PF_PER_VCPU 64
> +#define ERROR_GFN_PER_VCPU 64
>
>  enum kvm_reg {
> VCPU_REGS_RAX = __VCPU_REGS_RAX,
> @@ -778,6 +779,7 @@ struct kvm_vcpu_arch {
> unsigned long nested_apf_token;
> bool delivery_as_pf_vmexit;
> bool pageready_pending;
> +   gfn_t error_gfns[ERROR_GFN_PER_VCPU];
> } apf;
>
> /* OSVW MSRs (AMD only) */
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 444bb9c54548..d0a2a12c7bb6 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -60,7 +60,7 @@ void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots);
>  void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, u32 cr0, u32 cr4, u32 efer);
>  void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
>  bool accessed_dirty, gpa_t new_eptp);
> -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu);
> +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn);
>  int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
> u64 fault_address, char *insn, int insn_len);
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6d6a0ae7800c..b51d4aa405e0 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4078,7 +4078,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool 
> prefault, gfn_t gfn,
> if (!async)
> return false; /* *pfn has correct page already */
>
> -   if (!prefault && kvm_can_do_async_pf(vcpu)) {
> +   if (!prefault && kvm_can_do_async_pf(vcpu, gpa_to_gfn(cr2_or_gpa))) {
> 

Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-07-27 Thread Vivek Goyal
On Mon, Jul 27, 2020 at 06:09:32PM +0200, Vitaly Kuznetsov wrote:
> Vivek Goyal  writes:
> 
> > On Mon, Jul 20, 2020 at 05:13:59PM -0400, Vivek Goyal wrote:
> >> Page fault error handling behavior in kvm seems little inconsistent when
> >> page fault reports error. If we are doing fault synchronously
> >> then we capture error (-EFAULT) returned by __gfn_to_pfn_memslot() and
> >> exit to user space and qemu reports error, "error: kvm run failed Bad 
> >> address".
> >
> > Hi Vitaly,
> >
> > A gentle reminder. How does this patch look now?
> >
> 
> Sorry, I even reviewd it but never replied. It looks good to me!
> 
> Reviewed-by: Vitaly Kuznetsov 

Thanks Vitaly.

Paolo, what do you think about this patch. Do you have concerns with
this. Can this be merged.

Thanks
Vivek



Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-07-27 Thread Vitaly Kuznetsov
Vivek Goyal  writes:

> On Mon, Jul 20, 2020 at 05:13:59PM -0400, Vivek Goyal wrote:
>> Page fault error handling behavior in kvm seems little inconsistent when
>> page fault reports error. If we are doing fault synchronously
>> then we capture error (-EFAULT) returned by __gfn_to_pfn_memslot() and
>> exit to user space and qemu reports error, "error: kvm run failed Bad 
>> address".
>
> Hi Vitaly,
>
> A gentle reminder. How does this patch look now?
>

Sorry, I even reviewd it but never replied. It looks good to me!

Reviewed-by: Vitaly Kuznetsov 

-- 
Vitaly



Re: [PATCH v4] kvm,x86: Exit to user space in case page fault error

2020-07-27 Thread Vivek Goyal
On Mon, Jul 20, 2020 at 05:13:59PM -0400, Vivek Goyal wrote:
> Page fault error handling behavior in kvm seems little inconsistent when
> page fault reports error. If we are doing fault synchronously
> then we capture error (-EFAULT) returned by __gfn_to_pfn_memslot() and
> exit to user space and qemu reports error, "error: kvm run failed Bad 
> address".

Hi Vitaly,

A gentle reminder. How does this patch look now?

Thanks
Vivek

> 
> But if we are doing async page fault, then async_pf_execute() will simply
> ignore the error reported by get_user_pages_remote() or
> by kvm_mmu_do_page_fault(). It is assumed that page fault was successful
> and either a page ready event is injected in guest or guest is brought
> out of artificial halt state and run again. In both the cases when guest
> retries the instruction, it takes exit again as page fault was not
> successful in previous attempt. And then this infinite loop continues
> forever.
> 
> Trying fault in a loop will make sense if error is temporary and will
> be resolved on retry. But I don't see any intention in the code to
> determine if error is temporary or not.  Whether to do fault synchronously
> or asynchronously, depends on so many variables but none of the varibales
> is whether error is temporary or not. (kvm_can_do_async_pf()).
> 
> And that makes it very inconsistent or unpredictable to figure out whether
> kvm will exit to qemu with error or it will just retry and go into an
> infinite loop.
> 
> This patch tries to make this behavior consistent. That is instead of
> getting into infinite loop of retrying page fault, exit to user space
> and stop VM if page fault error happens.
> 
> In future this can be improved by injecting errors into guest. As of
> now we don't have any race free method to inject errors in guest.
> 
> When page fault error happens in async path save that pfn and when next
> time guest retries, do a sync fault instead of async fault. So that if error
> is encountered, we exit to qemu and avoid infinite loop.
> 
> We maintain a cache of error gfns and force sync fault if a gfn is
> found in cache of error gfn. There is a small possibility that we
> miss an error gfn (as it got overwritten by a new error gfn). But
> its just a hint and sooner or later some error pfn will match
> and we will force sync fault and exit to user space.
> 
> Changes from v3:
> - Added function kvm_find_and_remove_error_gfn() and removed
>   kvm_find_error_gfn() and kvm_del_error_gfn(). (Vitaly)
> 
> - Added a macro GFN_INVALID (Vitaly).
> 
> - Used gpa_to_gfn() to convert gpa to gfn (Vitaly)
> 
> Change from v2:
> - Fixed a warning by converting kvm_find_error_gfn() static.
> 
> Change from v1:
> - Maintain a cache of error gfns, instead of single gfn. (Vitaly)
> 
> Signed-off-by: Vivek Goyal 
> ---
>  arch/x86/include/asm/kvm_host.h |  2 ++
>  arch/x86/kvm/mmu.h  |  2 +-
>  arch/x86/kvm/mmu/mmu.c  |  2 +-
>  arch/x86/kvm/x86.c  | 54 +++--
>  include/linux/kvm_types.h   |  1 +
>  5 files changed, 56 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index be5363b21540..e6f8d3f1a377 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -137,6 +137,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t 
> base_gfn, int level)
>  #define KVM_NR_VAR_MTRR 8
>  
>  #define ASYNC_PF_PER_VCPU 64
> +#define ERROR_GFN_PER_VCPU 64
>  
>  enum kvm_reg {
>   VCPU_REGS_RAX = __VCPU_REGS_RAX,
> @@ -778,6 +779,7 @@ struct kvm_vcpu_arch {
>   unsigned long nested_apf_token;
>   bool delivery_as_pf_vmexit;
>   bool pageready_pending;
> + gfn_t error_gfns[ERROR_GFN_PER_VCPU];
>   } apf;
>  
>   /* OSVW MSRs (AMD only) */
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 444bb9c54548..d0a2a12c7bb6 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -60,7 +60,7 @@ void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots);
>  void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, u32 cr0, u32 cr4, u32 efer);
>  void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
>bool accessed_dirty, gpa_t new_eptp);
> -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu);
> +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn);
>  int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
>   u64 fault_address, char *insn, int insn_len);
>  
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6d6a0ae7800c..b51d4aa405e0 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4078,7 +4078,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool 
> prefault, gfn_t gfn,
>   if (!async)
>   return false; /* *pfn has correct page already */
>  
> - if (!prefault && kvm_can_do_async_pf(vcpu)) {
> + if