Re: [PATCH v2 2/4] eventfd: simplify eventfd_signal()

2024-02-07 Thread Paolo Bonzini
On Wed, Nov 22, 2023 at 1:49 PM Christian Brauner  wrote:
>
> Ever since the evenfd type was introduced back in 2007 in commit
> e1ad7468c77d ("signal/timer/event: eventfd core") the eventfd_signal()
> function only ever passed 1 as a value for @n. There's no point in
> keeping that additional argument.
>
> Signed-off-by: Christian Brauner 
> ---
>  arch/x86/kvm/hyperv.c |  2 +-
>  arch/x86/kvm/xen.c|  2 +-
>  virt/kvm/eventfd.c|  4 ++--
>  30 files changed, 60 insertions(+), 63 deletions(-)

For KVM:

Acked-by: Paolo Bonzini 



Re: [PATCH] drm/ttm: set TTM allocated pages as reserved

2023-03-29 Thread Paolo Bonzini

On 3/29/23 18:43, Christian König wrote:



3) other uses of kmap() must switch to MMU-notifier protection.


I would rather suggest to return the page additionally to the pfn from 
hva_to_pfn() when the function was able to grab a reference to it.


When the page is then not available you can't call kmap() on that either.



If the DRM people are okay with SetPageReserved() as a temporary hack, 
we can change or remove the WARN in kvm_is_zone_device_page(), since 
that is what you are referring to in the commit message.


Adding Daniel for additional comments on this, but essentially we have 
changed quite some infrastructure to make sure that everybody uses 
VM_PFNMAP to prevent stuff like this from happening.


I would really prefer a proper solution in KVM instead.


Hmm... Now that I think about it that would be

https://lore.kernel.org/kvm/yc4h+dgfk83ba...@google.com/t/

Time to resurrect that work.

Paolo



Re: [PATCH] drm/ttm: set TTM allocated pages as reserved

2023-03-29 Thread Paolo Bonzini

On 3/29/23 17:29, Christian König wrote:
First, is it a _host_ corruption or a guest corruption/crash?  A guest 
crash would be KVM doing exactly what it's meant to do: it detects the 
non-reserved, non-refcounted page and refuses to map it into the guest.


Yes I think that this is what happens.


Ok I was worried all the time that this was host corruption/crash; which 
obviously would have been much worse.



The use case and mapping is indeed valid as far as I can see, but
the handling that KVM does here is really problematic.

What needs to happen instead is that when both hva_to_pfn_fast() and 
hva_to_pfn_slow() fails you don't try to convert the PFN into a page and 
so also don't get a reference to the page.


This somehow needs to be communicated to the callers of hva_to_pfn() so 
that kvm_release_pfn() knows what to do.


There's a bit more complication here:

1) in the guest page fault path we can avoid taking the reference altogether

2) in other MMU-notifier-protected paths, we can also avoid taking the 
reference but we also must stop using kmap() in virt/kvm/pfncache.c.


3) other uses of kmap() must switch to MMU-notifier protection.


If the DRM people are okay with SetPageReserved() as a temporary hack, 
we can change or remove the WARN in kvm_is_zone_device_page(), since 
that is what you are referring to in the commit message.


Paolo



Re: [PATCH] drm/ttm: set TTM allocated pages as reserved

2023-03-29 Thread Paolo Bonzini

On 3/29/23 16:28, Paolo Bonzini wrote:

On 3/29/23 15:54, Christian König wrote:

KVM tries to grab references to pages in VMAs marked with VM_PFNMAP.
This is illegal and can cause data corruption with TTM pages because
only some of them are actually reference counted.


After some other offlist discussions, I also would like to understand 
what you mean by corruption.


First, is it a _host_ corruption or a guest corruption/crash?  A guest 
crash would be KVM doing exactly what it's meant to do: it detects the 
non-reserved, non-refcounted page and refuses to map it into the guest.


On the other hand, if it is a host crash, my understanding is that an 
order>0 allocation leaves the tail pages with a zero reference count 
(and without a compound_head if, as in this case, __GFP_COMP is unset). 
If that's correct, more analysis is needed to understand why 
get_page_unless_zero() isn't rejecting the tail pages.


Paolo



Mark all pages allocated by TTM as reserved, this way KVM handles the
PFNs like they would point to MMIO space.

This still results in a warning, but at least no other problem.


What warning is it?

Paolo


Signed-off-by: Christian König






Re: [PATCH] drm/ttm: set TTM allocated pages as reserved

2023-03-29 Thread Paolo Bonzini

On 3/29/23 15:54, Christian König wrote:

KVM tries to grab references to pages in VMAs marked with VM_PFNMAP.
This is illegal and can cause data corruption with TTM pages because
only some of them are actually reference counted.


I think that you are referring to this:

/* Don't set the __GFP_COMP flag for higher order allocations.
 * Mapping pages directly into an userspace process and calling
 * put_page() on a TTM allocated page is illegal.
 */
if (order)
gfp_flags |= __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN |
__GFP_KSWAPD_RECLAIM;

By "directly" I guess you mean without going through remap_pfn_range().

Based on our discussion offlist, it should be possible to remove the
get_page/put_page from the path that fills in the KVM page table, but
it is difficult to remove it altogether (it requires changing everything
to use userspace virtual address).

Indeed KVM needs to detect non-reference-counted pages because unfortunately
there are cases where people want to map VM_PFNMAP pages into a guest.  If
it is not enough to have PageReserved set, and there is a better check, KVM
can be fixed too.


Mark all pages allocated by TTM as reserved, this way KVM handles the
PFNs like they would point to MMIO space.

This still results in a warning, but at least no other problem.


What warning is it?

Paolo


Signed-off-by: Christian König




Re: [PATCH v2 1/5] KVM: do not allow mapping valid but non-refcounted pages

2021-06-25 Thread Paolo Bonzini

On 25/06/21 09:58, Christian Borntraeger wrote:



On 25.06.21 09:36, David Stevens wrote:

From: Nicholas Piggin 

It's possible to create a region which maps valid but non-refcounted
pages (e.g., tail pages of non-compound higher order allocations). These
host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family
of APIs, which take a reference to the page, which takes it from 0 to 1.
When the reference is dropped, this will free the page incorrectly.

Fix this by only taking a reference on the page if it was non-zero,
which indicates it is participating in normal refcounting (and can be
released with put_page).

Signed-off-by: Nicholas Piggin 


I guess this would be the small fix for stable? Do we want to add that cc?

Reviewed-by: Christian Borntraeger 


Yes, this one is going to Linus today.  The rest is for 5.15.

Paolo


---
  virt/kvm/kvm_main.c | 19 +--
  1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 3dcc2abbfc60..f7445c3bcd90 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2175,6 +2175,13 @@ static bool vma_is_valid(struct vm_area_struct 
*vma, bool write_fault)

  return true;
  }

+static int kvm_try_get_pfn(kvm_pfn_t pfn)
+{
+    if (kvm_is_reserved_pfn(pfn))
+    return 1;
+    return get_page_unless_zero(pfn_to_page(pfn));
+}
+
  static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 unsigned long addr, bool *async,
 bool write_fault, bool *writable,
@@ -2224,13 +2231,21 @@ static int hva_to_pfn_remapped(struct 
vm_area_struct *vma,

   * Whoever called remap_pfn_range is also going to call e.g.
   * unmap_mapping_range before the underlying pages are freed,
   * causing a call to our MMU notifier.
+ *
+ * Certain IO or PFNMAP mappings can be backed with valid
+ * struct pages, but be allocated without refcounting e.g.,
+ * tail pages of non-compound higher order allocations, which
+ * would then underflow the refcount when the caller does the
+ * required put_page. Don't allow those pages here.
   */
-    kvm_get_pfn(pfn);
+    if (!kvm_try_get_pfn(pfn))
+    r = -EFAULT;

  out:
  pte_unmap_unlock(ptep, ptl);
  *p_pfn = pfn;
-    return 0;
+
+    return r;
  }

  /*







Re: [PATCH 0/6] KVM: Remove uses of struct page from x86 and arm64 MMU

2021-06-24 Thread Paolo Bonzini

On 24/06/21 14:57, Nicholas Piggin wrote:

KVM: Fix page ref underflow for regions with valid but non-refcounted pages


It doesn't really fix the underflow, it disallows mapping them in the 
first place.  Since in principle things can break, I'd rather be 
explicit, so let's go with "KVM: do not allow mapping valid but 
non-reference-counted pages".



It's possible to create a region which maps valid but non-refcounted
pages (e.g., tail pages of non-compound higher order allocations). These
host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family
of APIs, which take a reference to the page, which takes it from 0 to 1.
When the reference is dropped, this will free the page incorrectly.

Fix this by only taking a reference on the page if it was non-zero,


s/on the page/on valid pages/ (makes clear that invalid pages are fine 
without refcounting).


Thank you *so* much, I'm awful at Linux mm.

Paolo


which indicates it is participating in normal refcounting (and can be
released with put_page).

Signed-off-by: Nicholas Piggin




Re: [PATCH 0/6] KVM: Remove uses of struct page from x86 and arm64 MMU

2021-06-24 Thread Paolo Bonzini

On 24/06/21 13:42, Nicholas Piggin wrote:

Excerpts from Nicholas Piggin's message of June 24, 2021 8:34 pm:

Excerpts from David Stevens's message of June 24, 2021 1:57 pm:

KVM supports mapping VM_IO and VM_PFNMAP memory into the guest by using
follow_pte in gfn_to_pfn. However, the resolved pfns may not have
assoicated struct pages, so they should not be passed to pfn_to_page.
This series removes such calls from the x86 and arm64 secondary MMU. To
do this, this series modifies gfn_to_pfn to return a struct page in
addition to a pfn, if the hva was resolved by gup. This allows the
caller to call put_page only when necessated by gup.

This series provides a helper function that unwraps the new return type
of gfn_to_pfn to provide behavior identical to the old behavior. As I
have no hardware to test powerpc/mips changes, the function is used
there for minimally invasive changes. Additionally, as gfn_to_page and
gfn_to_pfn_cache are not integrated with mmu notifier, they cannot be
easily changed over to only use pfns.

This addresses CVE-2021-22543 on x86 and arm64.


Does this fix the problem? (untested I don't have a POC setup at hand,
but at least in concept)


This one actually compiles at least. Unfortunately I don't have much
time in the near future to test, and I only just found out about this
CVE a few hours ago.


And it also works (the reproducer gets an infinite stream of userspace 
exits and especially does not crash).  We can still go for David's 
solution later since MMU notifiers are able to deal with this pages, but 
it's a very nice patch for stable kernels.


If you provide a Signed-off-by, I can integrate it.

Paolo


---


It's possible to create a region which maps valid but non-refcounted
pages (e.g., tail pages of non-compound higher order allocations). These
host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family
of APIs, which take a reference to the page, which takes it from 0 to 1.
When the reference is dropped, this will free the page incorrectly.

Fix this by only taking a reference on the page if it was non-zero,
which indicates it is participating in normal refcounting (and can be
released with put_page).

---
  virt/kvm/kvm_main.c | 19 +--
  1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6a6bc7af0e28..46fb042837d2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2055,6 +2055,13 @@ static bool vma_is_valid(struct vm_area_struct *vma, 
bool write_fault)
return true;
  }
  
+static int kvm_try_get_pfn(kvm_pfn_t pfn)

+{
+   if (kvm_is_reserved_pfn(pfn))
+   return 1;
+   return get_page_unless_zero(pfn_to_page(pfn));
+}
+
  static int hva_to_pfn_remapped(struct vm_area_struct *vma,
   unsigned long addr, bool *async,
   bool write_fault, bool *writable,
@@ -2104,13 +2111,21 @@ static int hva_to_pfn_remapped(struct vm_area_struct 
*vma,
 * Whoever called remap_pfn_range is also going to call e.g.
 * unmap_mapping_range before the underlying pages are freed,
 * causing a call to our MMU notifier.
+*
+* Certain IO or PFNMAP mappings can be backed with valid
+* struct pages, but be allocated without refcounting e.g.,
+* tail pages of non-compound higher order allocations, which
+* would then underflow the refcount when the caller does the
+* required put_page. Don't allow those pages here.
 */
-   kvm_get_pfn(pfn);
+   if (!kvm_try_get_pfn(pfn))
+   r = -EFAULT;
  
  out:

pte_unmap_unlock(ptep, ptl);
*p_pfn = pfn;
-   return 0;
+
+   return r;
  }
  
  /*






Re: [PATCH 0/6] KVM: Remove uses of struct page from x86 and arm64 MMU

2021-06-24 Thread Paolo Bonzini

On 24/06/21 13:42, Nicholas Piggin wrote:

+static int kvm_try_get_pfn(kvm_pfn_t pfn)
+{
+   if (kvm_is_reserved_pfn(pfn))
+   return 1;


So !pfn_valid would always return true.  Yeah, this should work and is 
certainly appealing!


Paolo



+   return get_page_unless_zero(pfn_to_page(pfn));
+}
+
  static int hva_to_pfn_remapped(struct vm_area_struct *vma,
   unsigned long addr, bool *async,
   bool write_fault, bool *writable,
@@ -2104,13 +2111,21 @@ static int hva_to_pfn_remapped(struct vm_area_struct 
*vma,
 * Whoever called remap_pfn_range is also going to call e.g.
 * unmap_mapping_range before the underlying pages are freed,
 * causing a call to our MMU notifier.
+*
+* Certain IO or PFNMAP mappings can be backed with valid
+* struct pages, but be allocated without refcounting e.g.,
+* tail pages of non-compound higher order allocations, which
+* would then underflow the refcount when the caller does the
+* required put_page. Don't allow those pages here.
 */
-   kvm_get_pfn(pfn);
+   if (!kvm_try_get_pfn(pfn))
+   r = -EFAULT;
  
  out:

pte_unmap_unlock(ptep, ptl);
*p_pfn = pfn;
-   return 0;
+
+   return r;
  }
  
  /*






Re: [PATCH 2/6] KVM: mmu: also return page from gfn_to_pfn

2021-06-24 Thread Paolo Bonzini

On 24/06/21 12:17, Nicholas Piggin wrote:

If all callers were updated that is one thing, but from the changelog
it sounds like that would not happen and there would be some gfn_to_pfn
users left over.

But yes in the end you would either need to make gfn_to_pfn never return
a page found via follow_pte, or change all callers to the new way. If
the plan is for the latter then I guess that's fine.


Actually in that case anyway I don't see the need -- the existence of
gfn_to_pfn is enough to know it might be buggy. It can just as easily
be grepped for as kvm_pfn_page_unwrap.


Sure, but that would leave us with longer function names 
(gfn_to_pfn_page* instead of gfn_to_pfn*).  So the "safe" use is the one 
that looks worse and the unsafe use is the one that looks safe.



And are gfn_to_page cases also
vulernable to the same issue?


No, they're just broken for the VM_IO|VM_PFNMAP case.

Paolo


So I think it could be marked deprecated or something if not everything
will be converted in the one series, and don't need to touch all that
arch code with this patch.




Re: [PATCH 3/6] KVM: x86/mmu: avoid struct page in MMU

2021-06-24 Thread Paolo Bonzini

On 24/06/21 12:06, Marc Zyngier wrote:

On Thu, 24 Jun 2021 09:58:00 +0100,
Nicholas Piggin  wrote:


Excerpts from David Stevens's message of June 24, 2021 1:57 pm:

From: David Stevens 
  out_unlock:
if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
read_unlock(>kvm->mmu_lock);
else
write_unlock(>kvm->mmu_lock);
-   kvm_release_pfn_clean(pfn);
+   if (pfnpg.page)
+   put_page(pfnpg.page);
return r;
  }


How about

   kvm_release_pfn_page_clean(pfnpg);


I'm not sure. I always found kvm_release_pfn_clean() ugly, because it
doesn't mark the page 'clean'. I find put_page() more correct.

Something like 'kvm_put_pfn_page()' would make more sense, but I'm so
bad at naming things that I could just as well call it 'bob()'.


The best way to go would be to get rid of kvm_release_pfn_clean() and 
always go through a pfn_page.  Then we could or could not introduce 
wrappers kvm_put_pfn_page{,_dirty}.


I think for now it's best to limit the churn since these patches will go 
in the stable releases too, and clean up the resulting API once we have 
a clear idea of how all architectures are using kvm_pfn_page.


Paolo



Re: [PATCH 2/6] KVM: mmu: also return page from gfn_to_pfn

2021-06-24 Thread Paolo Bonzini

On 24/06/21 11:57, Nicholas Piggin wrote:

Needing kvm_pfn_page_unwrap is a sign that something might be buggy, so
it's a good idea to move the short name to the common case and the ugly
kvm_pfn_page_unwrap(gfn_to_pfn(...)) for the weird one.  In fact I'm not
sure there should be any kvm_pfn_page_unwrap in the end.


If all callers were updated that is one thing, but from the changelog
it sounds like that would not happen and there would be some gfn_to_pfn
users left over.


In this patches there are, so yeah the plan is to always change the 
callers to the new way.



But yes in the end you would either need to make gfn_to_pfn never return
a page found via follow_pte, or change all callers to the new way. If
the plan is for the latter then I guess that's fine.




Re: [PATCH 1/6] KVM: x86/mmu: release audited pfns

2021-06-24 Thread Paolo Bonzini

On 24/06/21 10:43, Nicholas Piggin wrote:

Excerpts from David Stevens's message of June 24, 2021 1:57 pm:

From: David Stevens 


Changelog? This looks like a bug, should it have a Fixes: tag?


Probably has been there forever... The best way to fix the bug would be 
to nuke mmu_audit.c, which I've threatened to do many times but never 
followed up on.


Paolo


Thanks,
Nick



Signed-off-by: David Stevens 
---
  arch/x86/kvm/mmu/mmu_audit.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu_audit.c b/arch/x86/kvm/mmu/mmu_audit.c
index cedc17b2f60e..97ff184084b4 100644
--- a/arch/x86/kvm/mmu/mmu_audit.c
+++ b/arch/x86/kvm/mmu/mmu_audit.c
@@ -121,6 +121,8 @@ static void audit_mappings(struct kvm_vcpu *vcpu, u64 
*sptep, int level)
audit_printk(vcpu->kvm, "levels %d pfn %llx hpa %llx "
 "ent %llxn", vcpu->arch.mmu->root_level, pfn,
 hpa, *sptep);
+
+   kvm_release_pfn_clean(pfn);
  }
  
  static void inspect_spte_has_rmap(struct kvm *kvm, u64 *sptep)

--
2.32.0.93.g670b81a890-goog








Re: [PATCH 2/6] KVM: mmu: also return page from gfn_to_pfn

2021-06-24 Thread Paolo Bonzini

On 24/06/21 10:52, Nicholas Piggin wrote:

For now, wrap all calls to gfn_to_pfn functions in the new helper
function. Callers which don't need the page struct will be updated in
follow-up patches.

Hmm. You mean callers that do need the page will be updated? Normally
if there will be leftover users that don't need the struct page then
you would go the other way and keep the old call the same, and add a new
one (gfn_to_pfn_page) just for those that need it.


Needing kvm_pfn_page_unwrap is a sign that something might be buggy, so 
it's a good idea to move the short name to the common case and the ugly 
kvm_pfn_page_unwrap(gfn_to_pfn(...)) for the weird one.  In fact I'm not 
sure there should be any kvm_pfn_page_unwrap in the end.


Paolo



Re: [PATCH 3/6] KVM: x86/mmu: avoid struct page in MMU

2021-06-24 Thread Paolo Bonzini

On 24/06/21 05:57, David Stevens wrote:

  static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
-   int map_writable, int max_level, kvm_pfn_t pfn,
+   int map_writable, int max_level,
+   const struct kvm_pfn_page *pfnpg,
bool prefault, bool is_tdp)
  {
bool nx_huge_page_workaround_enabled = is_nx_huge_pa


So the main difference with my tentative patches is that here I was 
passing the struct by value.  I'll try to clean this up for 5.15, but 
for now it's all good like this.  Thanks again!


Paolo



Re: [PATCH 0/6] KVM: Remove uses of struct page from x86 and arm64 MMU

2021-06-24 Thread Paolo Bonzini

On 24/06/21 05:57, David Stevens wrote:

KVM supports mapping VM_IO and VM_PFNMAP memory into the guest by using
follow_pte in gfn_to_pfn. However, the resolved pfns may not have
assoicated struct pages, so they should not be passed to pfn_to_page.
This series removes such calls from the x86 and arm64 secondary MMU. To
do this, this series modifies gfn_to_pfn to return a struct page in
addition to a pfn, if the hva was resolved by gup. This allows the
caller to call put_page only when necessated by gup.

This series provides a helper function that unwraps the new return type
of gfn_to_pfn to provide behavior identical to the old behavior. As I
have no hardware to test powerpc/mips changes, the function is used
there for minimally invasive changes. Additionally, as gfn_to_page and
gfn_to_pfn_cache are not integrated with mmu notifier, they cannot be
easily changed over to only use pfns.

This addresses CVE-2021-22543 on x86 and arm64.


Thank you very much for this.  I agree that it makes sense to have a 
minimal change; I had similar changes almost ready, but was stuck with 
deadlocks in the gfn_to_pfn_cache case.  In retrospect I should have 
posted something similar to your patches.


I have started reviewing the patches, and they look good.  I will try to 
include them in 5.13.


Paolo



Re: [PULL] topic/iomem-mmap-vs-gup

2021-05-10 Thread Paolo Bonzini

On 10/05/21 19:57, Sean Christopherson wrote:

+Paolo

On Mon, May 10, 2021, Jason Gunthorpe wrote:

On Mon, May 10, 2021 at 04:55:39PM +0200, Daniel Vetter wrote:


yeah vfio is still broken for the case I care about. I think there's
also some questions open still about whether kvm really uses
mmu_notifier in all cases correctly,


IIRC kvm doesn't either.


Yep, KVM on x86 has a non-trivial number of flows that don't properly hook into
the mmu_notifier.  Paolo is working on fixing the problem, but I believe the
rework won't be ready until 5.14.


Yeah, I like the way it's coming, but I'm at 20-ish patches and counting.

Paolo



Re: [PATCH 3/3] mm: unexport follow_pfn

2021-04-08 Thread Paolo Bonzini

On 08/04/21 12:05, Daniel Vetter wrote:

On Mon, Mar 29, 2021 at 10:31:01AM -0300, Jason Gunthorpe wrote:

On Tue, Mar 16, 2021 at 04:33:03PM +0100, Daniel Vetter wrote:

Both kvm (in bd2fae8da794 ("KVM: do not assume PTE is writable after
follow_pfn")) and vfio (in 07956b6269d3 ("vfio/type1: Use
follow_pte()")) have lost their callsites of follow_pfn(). All the
other ones have been switched over to unsafe_follow_pfn because they
cannot be fixed without breaking userspace api.

Argueably the vfio code is still racy, but that's kinda a bigger


vfio and kvm


Hm I thought kvm is non-racy due to the mmu notifier catch races?


No, but the plan is indeed to have some struct for each page that uses 
follow_pfn and update it from the MMU notifiers.


Paolo




picture. But since it does leak the pte beyond where it drops the pt
lock, without anything else like an mmu notifier guaranteeing
coherence, the problem is at least clearly visible in the vfio code.
So good enough with me.

I've decided to keep the explanation that after dropping the pt lock
you must have an mmu notifier if you keep using the pte somehow by
adjusting it and moving it into the kerneldoc for the new follow_pte()
function.

Cc: 3...@google.com
Cc: Jann Horn 
Cc: Paolo Bonzini 
Cc: Jason Gunthorpe 
Cc: Cornelia Huck 
Cc: Peter Xu 
Cc: Alex Williamson 
Cc: linux...@kvack.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-samsung-...@vger.kernel.org
Cc: linux-me...@vger.kernel.org
Cc: k...@vger.kernel.org
Signed-off-by: Daniel Vetter 
---
  include/linux/mm.h |  2 --
  mm/memory.c| 26 +-
  mm/nommu.c | 13 +
  3 files changed, 6 insertions(+), 35 deletions(-)


Reviewed-by: Jason Gunthorpe 


Thanks for your r-b tags, I'll add them.
-Daniel



Jason




___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 3/3] mm: unexport follow_pfn

2021-03-24 Thread Paolo Bonzini

On 24/03/21 13:52, Jason Gunthorpe wrote:

I think this is the right thing to do.

Alex is working on fixing VFIO and while kvm is still racy using
follow pte, I think they are working on it too?


Yeah, or at least we have a plan.

Paolo

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v6 16/17] RFC: kvm: pass kvm argument to follow_pfn callsites

2020-11-20 Thread Paolo Bonzini

On 20/11/20 16:44, Daniel Vetter wrote:

It's a bit of a pity, it's making an API more complex (the point of
gfn_to_pfn_memslot vs gfn_to_pfn is exactly that you don't need a
"struct kvm*" and it's clear that you've already done the lookup into
that struct kvm.


Yeah I noticed that, I think pushing the lookups down should work, but
that's a fairly large-scale change. I didn't want to do that for the
RFC since it would distract from the actual change/goal.
-Daniel


Pushing the lookups down would be worse code and possibly introduce 
TOC/TOU races, so better avoid that.  Your patch is fine. :)


Paolo

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v6 16/17] RFC: kvm: pass kvm argument to follow_pfn callsites

2020-11-20 Thread Paolo Bonzini

On 19/11/20 15:41, Daniel Vetter wrote:

Both Christoph Hellwig and Jason Gunthorpe suggested that usage of
follow_pfn by modules should be locked down more. To do so callers
need to be able to pass the mmu_notifier subscription corresponding
to the mm_struct to follow_pfn().

This patch does the rote work of doing that in the kvm subsystem. In
most places this is solved by passing struct kvm * down the call
stacks as an additional parameter, since that contains the
mmu_notifier.

Compile tested on all affected arch.


It's a bit of a pity, it's making an API more complex (the point of 
gfn_to_pfn_memslot vs gfn_to_pfn is exactly that you don't need a 
"struct kvm*" and it's clear that you've already done the lookup into 
that struct kvm.


But it's not a big deal, and the rationale at least makes sense.  So,

Acked-by: Paolo Bonzini 

Paolo
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2 6/7] debugfs: switch to simplefs inode creation API

2020-04-27 Thread Paolo Bonzini
On 21/04/20 15:57, Emanuele Giuseppe Esposito wrote:
> - inode = debugfs_get_inode(dentry->d_sb);

You're not removing debugfs_get_inode so I think you're going to get a
warning (same in tracefs)?

You can wait a few more days for reviews and/or Acked-bys (especially
for patches 6 and 7) and then post v3.

Since the touch-everything patch (#2) has already been reviewed, and
it's mechanical and not introducing any semantic change, you can
probably reduce the To/Cc list to filesystem, debugfs and tracefs
maintainers.

Thanks,

Paolo

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 6/8] simplefs: add file creation functions

2020-04-20 Thread Paolo Bonzini
On 20/04/20 16:28, Greg Kroah-Hartman wrote:
>> I assume you meant a new file. These new functions are used only by a few
>> filesystems, and I didn't want to include them in vmlinux unconditionally,
>> so I introduced simplefs.c and CONFIG_SIMPLEFS instead of extending libfs.c.
>> In this way only fs that need this code like debugfs and tracefs will load
>> it.
> Nothing "loads it", why not just make these libfs functions instead?  As
> the difference between the two is not obvious at all, please don't make
> things confusing.

I think Emanuele meant "will link it" not "will load it".

Emanuele, you can just move everything to libfs.c and get rid of
CONFIG_SIMPLEFS too.  "Do less" is not an offer you want to turn down!

Thanks,

Paolo

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 0/3] KVM: x86: honor guest memory type

2020-02-18 Thread Paolo Bonzini
On 14/02/20 23:03, Sean Christopherson wrote:
>> On Fri, Feb 14, 2020 at 1:47 PM Chia-I Wu  wrote:
>>> AFAICT, it is currently allowed on ARM (verified) and AMD (not
>>> verified, but svm_get_mt_mask returns 0 which supposedly means the NPT
>>> does not restrict what the guest PAT can do).  This diff would do the
>>> trick for Intel without needing any uapi change:
>> I would be concerned about Intel CPU errata such as SKX40 and SKX59.
> The part KVM cares about, #MC, is already addressed by forcing UC for MMIO.
> The data corruption issue is on the guest kernel to correctly use WC
> and/or non-temporal writes.

What about coherency across live migration?  The userspace process would
use cached accesses, and also a WBINVD could potentially corrupt guest
memory.

Paolo

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 0/3] KVM: x86: honor guest memory type

2020-02-14 Thread Paolo Bonzini
On 13/02/20 23:18, Chia-I Wu wrote:
> 
> The bug you mentioned was probably this one
> 
>   https://bugzilla.kernel.org/show_bug.cgi?id=104091

Yes, indeed.

> From what I can tell, the commit allowed the guests to create cached
> mappings to MMIO regions and caused MCEs.  That is different than what
> I need, which is to allow guests to create uncached mappings to system
> ram (i.e., !kvm_is_mmio_pfn) when the host userspace also has uncached
> mappings.  But it is true that this still allows the userspace & guest
> kernel to create conflicting memory types.

Right, the question is whether the MCEs were tied to MMIO regions 
specifically and if so why.

An interesting remark is in the footnote of table 11-7 in the SDM.  
There, for the MTRR (EPT for us) memory type UC you can read:

  The UC attribute comes from the MTRRs and the processors are not 
  required to snoop their caches since the data could never have
  been cached. This attribute is preferred for performance reasons.

There are two possibilities:

1) the footnote doesn't apply to UC mode coming from EPT page tables.
That would make your change safe.

2) the footnote also applies when the UC attribute comes from the EPT
page tables rather than the MTRRs.  In that case, the host should use
UC as the EPT page attribute if and only if it's consistent with the host
MTRRs; it would be more or less impossible to honor UC in the guest MTRRs.
In that case, something like the patch below would be needed.

It is not clear from the manual why the footnote would not apply to WC; that
is, the manual doesn't say explicitly that the processor does not do snooping
for accesses to WC memory.  But I guess that must be the case, which is why I
used MTRR_TYPE_WRCOMB in the patch below.

Either way, we would have an explanation of why creating cached mapping to
MMIO regions would, and why in practice we're not seeing MCEs for guest RAM
(the guest would have set WB for that memory in its MTRRs, not UC).

One thing you didn't say: how would userspace use KVM_MEM_DMA?  On which
regions would it be set?

Thanks,

Paolo

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index dc331fb06495..2be6f7effa1d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6920,8 +6920,16 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t 
gfn, bool is_mmio)
}
 
cache = kvm_mtrr_get_guest_memory_type(vcpu, gfn);
-
 exit:
+   if (cache == MTRR_TYPE_UNCACHABLE && !is_mmio) {
+   /*
+* We cannot set UC in the EPT page tables as it can cause
+* machine check exceptions (??).  Hopefully the guest is
+* using PAT.
+*/
+   cache = MTRR_TYPE_WRCOMB;
+   }
+
return (cache << VMX_EPT_MT_EPTE_SHIFT) | ipat;
 }
 

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 1/3] KVM: vmx: rewrite the comment in vmx_get_mt_mask

2020-02-14 Thread Paolo Bonzini
On 13/02/20 22:30, Chia-I Wu wrote:
> Better reflect the structure of the code and metion why we could not
> always honor the guest.
> 
> Signed-off-by: Chia-I Wu 
> Cc: Gurchetan Singh 
> Cc: Gerd Hoffmann 
> ---
>  arch/x86/kvm/vmx/vmx.c | 27 +--
>  1 file changed, 17 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 3be25ecae145..266ef87042da 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6854,17 +6854,24 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, 
> gfn_t gfn, bool is_mmio)
>   u8 cache;
>   u64 ipat = 0;
>  
> - /* For VT-d and EPT combination
> -  * 1. MMIO: always map as UC
> -  * 2. EPT with VT-d:
> -  *   a. VT-d without snooping control feature: can't guarantee the
> -  *  result, try to trust guest.
> -  *   b. VT-d with snooping control feature: snooping control feature of
> -  *  VT-d engine can guarantee the cache correctness. Just set it
> -  *  to WB to keep consistent with host. So the same as item 3.
> -  * 3. EPT without VT-d: always map as WB and set IPAT=1 to keep
> -  *consistent with host MTRR
> + /* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
> +  * memory aliases with conflicting memory types and sometimes MCEs.
> +  * We have to be careful as to what are honored and when.
> +  *
> +  * For MMIO, guest CD/MTRR are ignored.  The EPT memory type is set to
> +  * UC.  The effective memory type is UC or WC depending on guest PAT.
> +  * This was historically the source of MCEs and we want to be
> +  * conservative.
> +  *
> +  * When there is no need to deal with noncoherent DMA (e.g., no VT-d
> +  * or VT-d has snoop control), guest CD/MTRR/PAT are all ignored.  The
> +  * EPT memory type is set to WB.  The effective memory type is forced
> +  * WB.
> +  *
> +  * Otherwise, we trust guest.  Guest CD/MTRR/PAT are all honored.  The
> +  * EPT memory type is used to emulate guest CD/MTRR.
>*/
> +
>   if (is_mmio) {
>   cache = MTRR_TYPE_UNCACHABLE;
>   goto exit;
> 

This is certainly an improvement, especially the part that points out
how guest PAT still allows MMIO to be handled as WC.

Thanks,

Paolo

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 0/3] KVM: x86: honor guest memory type

2020-02-13 Thread Paolo Bonzini
On 13/02/20 22:30, Chia-I Wu wrote:
> Hi,
> 
> Host GPU drivers like to give userspace WC mapping.  When the userspace makes
> the mapping available to a guest, it also tells the guest to create a WC
> mapping.  However, even when the guest kernel picks the correct memory type,
> it gets ignored because of VMX_EPT_IPAT_BIT on Intel.
> 
> This series adds a new flag to KVM_SET_USER_MEMORY_REGION, which tells the
> host kernel to honor the guest memory type for the memslot.  An alternative
> fix is for KVM to unconditionally honor the guest memory type (unless it is
> MMIO, to avoid MCEs on Intel).  I believe the alternative fix is how things
> are on ARM, and probably also how things are on AMD.
> 
> I am new to KVM and HW virtualization technologies.  This series is meant as
> an RFC.
> 

When we tried to do this in the past, we got machine checks everywhere
unfortunately due to the same address being mapped with different memory
types.  Unfortunately I cannot find the entry anymore in bugzilla, but
this was not fixed as far as I know.

Paolo

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: KASAN: slab-out-of-bounds Read in fbcon_get_font

2019-12-05 Thread Paolo Bonzini
On 05/12/19 12:27, Dmitry Vyukov wrote:
> Oh, you mean the final bisection crash. Indeed it contains a kvm frame
> and it turns out to be a bug in syzkaller code that indeed
> misattributed it to kvm instead of netfilter.
> Should be fixed now, you may read the commit message for details:
> https://github.com/google/syzkaller/commit/4fb74474cf0af2126be3a8989d770c3947ae9478
> 
> Overall this "making sense out of kernel output" task is the ultimate
> insanity, you may skim through this file to get a taste of amount of
> hardcoding and special corner cases that need to be handled:
> https://github.com/google/syzkaller/blob/master/pkg/report/linux.go
> And this is never done, such "exception from exception corner case"
> things pop up every week. There is always something to shuffle and
> tune. It only keeps functioning due to 500+ test cases for all
> possible insane kernel outputs:
> https://github.com/google/syzkaller/tree/master/pkg/report/testdata/linux/report
> https://github.com/google/syzkaller/tree/master/pkg/report/testdata/linux/guilty
> 
> So thanks for persisting and questioning! We are getting better with
> each new test.

Thanks to you!  I "complain" because I know you're so responsive. :)

Paolo

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: KASAN: slab-out-of-bounds Read in fbcon_get_font

2019-12-05 Thread Paolo Bonzini
On 05/12/19 11:31, Dmitry Vyukov wrote:
>> Ah, and because the machine is a KVM guest, kvm_wait appears in a lot of
>> backtrace and I get to share syzkaller's joy every time. :)
> I don't see any mention of "kvm" in the crash report.

It's there in the stack trace, not sure if this is what triggered my Cc:

 [] kvm_wait+0xca/0xe0 arch/x86/kernel/kvm.c:612

Paolo

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: KASAN: slab-out-of-bounds Read in fbcon_get_font

2019-12-05 Thread Paolo Bonzini
On 05/12/19 11:16, Dmitry Vyukov wrote:
> On Thu, Dec 5, 2019 at 11:13 AM Paolo Bonzini  wrote:
>>
>> On 04/12/19 22:41, syzbot wrote:
>>> syzbot has bisected this bug to:
>>>
>>> commit 2de50e9674fc4ca3c6174b04477f69eb26b4ee31
>>> Author: Russell Currey 
>>> Date:   Mon Feb 8 04:08:20 2016 +
>>>
>>> powerpc/powernv: Remove support for p5ioc2
>>>
>>> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=127a042ae0
>>> start commit:   76bb8b05 Merge tag 'kbuild-v5.5' of
>>> git://git.kernel.org/p..
>>> git tree:   upstream
>>> final crash:https://syzkaller.appspot.com/x/report.txt?x=117a042ae0
>>> console output: https://syzkaller.appspot.com/x/log.txt?x=167a042ae0
>>> kernel config:  https://syzkaller.appspot.com/x/.config?x=dd226651cb0f364b
>>> dashboard link:
>>> https://syzkaller.appspot.com/bug?extid=4455ca3b3291de891abc
>>> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=11181edae0
>>> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=105cbb7ae0
>>>
>>> Reported-by: syzbot+4455ca3b3291de891...@syzkaller.appspotmail.com
>>> Fixes: 2de50e9674fc ("powerpc/powernv: Remove support for p5ioc2")
>>>
>>> For information about bisection process see:
>>> https://goo.gl/tpsmEJ#bisection
>>>
>>
>> Why is everybody being CC'd, even if the bug has nothing to do with the
>> person's subsystem?
> 
> The To list should be intersection of 2 groups of emails: result of
> get_maintainers.pl on the file identified as culprit in the crash
> message + emails extracted from the bisected to commit.

Ah, and because the machine is a KVM guest, kvm_wait appears in a lot of
backtrace and I get to share syzkaller's joy every time. :)

This bisect result is bogus, though Tetsuo found the bug anyway.
Perhaps you can exclude commits that only touch architectures other than
x86?

Paolo

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: KASAN: slab-out-of-bounds Read in fbcon_get_font

2019-12-05 Thread Paolo Bonzini
On 04/12/19 22:41, syzbot wrote:
> syzbot has bisected this bug to:
> 
> commit 2de50e9674fc4ca3c6174b04477f69eb26b4ee31
> Author: Russell Currey 
> Date:   Mon Feb 8 04:08:20 2016 +
> 
>     powerpc/powernv: Remove support for p5ioc2
> 
> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=127a042ae0
> start commit:   76bb8b05 Merge tag 'kbuild-v5.5' of
> git://git.kernel.org/p..
> git tree:   upstream
> final crash:    https://syzkaller.appspot.com/x/report.txt?x=117a042ae0
> console output: https://syzkaller.appspot.com/x/log.txt?x=167a042ae0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=dd226651cb0f364b
> dashboard link:
> https://syzkaller.appspot.com/bug?extid=4455ca3b3291de891abc
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=11181edae0
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=105cbb7ae0
> 
> Reported-by: syzbot+4455ca3b3291de891...@syzkaller.appspotmail.com
> Fixes: 2de50e9674fc ("powerpc/powernv: Remove support for p5ioc2")
> 
> For information about bisection process see:
> https://goo.gl/tpsmEJ#bisection
> 

Why is everybody being CC'd, even if the bug has nothing to do with the
person's subsystem?

Thanks,

Paolo

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 09/22] docs: mark orphan documents as such

2019-05-30 Thread Paolo Bonzini
On 30/05/19 01:23, Mauro Carvalho Chehab wrote:
> Sphinx doesn't like orphan documents:
> 
> Documentation/accelerators/ocxl.rst: WARNING: document isn't included in 
> any toctree
> Documentation/arm/stm32/overview.rst: WARNING: document isn't included in 
> any toctree
> Documentation/arm/stm32/stm32f429-overview.rst: WARNING: document isn't 
> included in any toctree
> Documentation/arm/stm32/stm32f746-overview.rst: WARNING: document isn't 
> included in any toctree
> Documentation/arm/stm32/stm32f769-overview.rst: WARNING: document isn't 
> included in any toctree
> Documentation/arm/stm32/stm32h743-overview.rst: WARNING: document isn't 
> included in any toctree
> Documentation/arm/stm32/stm32mp157-overview.rst: WARNING: document isn't 
> included in any toctree
> Documentation/gpu/msm-crash-dump.rst: WARNING: document isn't included in 
> any toctree
> Documentation/interconnect/interconnect.rst: WARNING: document isn't 
> included in any toctree
> Documentation/laptops/lg-laptop.rst: WARNING: document isn't included in 
> any toctree
> Documentation/powerpc/isa-versions.rst: WARNING: document isn't included 
> in any toctree
> Documentation/virtual/kvm/amd-memory-encryption.rst: WARNING: document 
> isn't included in any toctree
> Documentation/virtual/kvm/vcpu-requests.rst: WARNING: document isn't 
> included in any toctree
> 
> So, while they aren't on any toctree, add :orphan: to them, in order
> to silent this warning.
> 
> Signed-off-by: Mauro Carvalho Chehab 

Please leave out KVM, I'll fix that instead.  Thanks for the report!

Paolo

> ---
>  Documentation/accelerators/ocxl.rst | 2 ++
>  Documentation/arm/stm32/overview.rst| 2 ++
>  Documentation/arm/stm32/stm32f429-overview.rst  | 2 ++
>  Documentation/arm/stm32/stm32f746-overview.rst  | 2 ++
>  Documentation/arm/stm32/stm32f769-overview.rst  | 2 ++
>  Documentation/arm/stm32/stm32h743-overview.rst  | 2 ++
>  Documentation/arm/stm32/stm32mp157-overview.rst | 2 ++
>  Documentation/gpu/msm-crash-dump.rst| 2 ++
>  Documentation/interconnect/interconnect.rst | 2 ++
>  Documentation/laptops/lg-laptop.rst | 2 ++
>  Documentation/powerpc/isa-versions.rst  | 2 ++
>  Documentation/virtual/kvm/amd-memory-encryption.rst | 2 ++
>  Documentation/virtual/kvm/vcpu-requests.rst | 2 ++
>  13 files changed, 26 insertions(+)
> 
> diff --git a/Documentation/accelerators/ocxl.rst 
> b/Documentation/accelerators/ocxl.rst
> index 14cefc020e2d..b1cea19a90f5 100644
> --- a/Documentation/accelerators/ocxl.rst
> +++ b/Documentation/accelerators/ocxl.rst
> @@ -1,3 +1,5 @@
> +:orphan:
> +
>  
>  OpenCAPI (Open Coherent Accelerator Processor Interface)
>  
> diff --git a/Documentation/arm/stm32/overview.rst 
> b/Documentation/arm/stm32/overview.rst
> index 85cfc8410798..f7e734153860 100644
> --- a/Documentation/arm/stm32/overview.rst
> +++ b/Documentation/arm/stm32/overview.rst
> @@ -1,3 +1,5 @@
> +:orphan:
> +
>  
>  STM32 ARM Linux Overview
>  
> diff --git a/Documentation/arm/stm32/stm32f429-overview.rst 
> b/Documentation/arm/stm32/stm32f429-overview.rst
> index 18feda97f483..65bbb1c3b423 100644
> --- a/Documentation/arm/stm32/stm32f429-overview.rst
> +++ b/Documentation/arm/stm32/stm32f429-overview.rst
> @@ -1,3 +1,5 @@
> +:orphan:
> +
>  STM32F429 Overview
>  ==
>  
> diff --git a/Documentation/arm/stm32/stm32f746-overview.rst 
> b/Documentation/arm/stm32/stm32f746-overview.rst
> index b5f4b6ce7656..42d593085015 100644
> --- a/Documentation/arm/stm32/stm32f746-overview.rst
> +++ b/Documentation/arm/stm32/stm32f746-overview.rst
> @@ -1,3 +1,5 @@
> +:orphan:
> +
>  STM32F746 Overview
>  ==
>  
> diff --git a/Documentation/arm/stm32/stm32f769-overview.rst 
> b/Documentation/arm/stm32/stm32f769-overview.rst
> index 228656ced2fe..f6adac862b17 100644
> --- a/Documentation/arm/stm32/stm32f769-overview.rst
> +++ b/Documentation/arm/stm32/stm32f769-overview.rst
> @@ -1,3 +1,5 @@
> +:orphan:
> +
>  STM32F769 Overview
>  ==
>  
> diff --git a/Documentation/arm/stm32/stm32h743-overview.rst 
> b/Documentation/arm/stm32/stm32h743-overview.rst
> index 3458dc00095d..c525835e7473 100644
> --- a/Documentation/arm/stm32/stm32h743-overview.rst
> +++ b/Documentation/arm/stm32/stm32h743-overview.rst
> @@ -1,3 +1,5 @@
> +:orphan:
> +
>  STM32H743 Overview
>  ==
>  
> diff --git a/Documentation/arm/stm32/stm32mp157-overview.rst 
> b/Documentation/arm/stm32/stm32mp157-overview.rst
> index 62e176d47ca7..2c52cd020601 100644
> --- a/Documentation/arm/stm32/stm32mp157-overview.rst
> +++ b/Documentation/arm/stm32/stm32mp157-overview.rst
> @@ -1,3 +1,5 @@
> +:orphan:
> +
>  STM32MP157 Overview
>  

Re: Possible use_mm() mis-uses

2018-08-23 Thread Paolo Bonzini
On 23/08/2018 08:07, Zhenyu Wang wrote:
> On 2018.08.22 20:20:46 +0200, Paolo Bonzini wrote:
>> On 22/08/2018 18:44, Linus Torvalds wrote:
>>> An example of something that *isn't* right, is the i915 kvm interface,
>>> which does
>>>
>>> use_mm(kvm->mm);
>>>
>>> on an mm that was initialized in virt/kvm/kvm_main.c using
>>>
>>> mmgrab(current->mm);
>>> kvm->mm = current->mm;
>>>
>>> which is *not* right. Using "mmgrab()" does indeed guarantee the
>>> lifetime of the 'struct mm_struct' itself, but it does *not* guarantee
>>> the lifetime of the page tables. You need to use "mmget()" and
>>> "mmput()", which get the reference to the actual process address
>>> space!
>>>
>>> Now, it is *possible* that the kvm use is correct too, because kvm
>>> does register a mmu_notifier chain, and in theory you can avoid the
>>> proper refcounting by just making sure the mmu "release" notifier
>>> kills any existing uses, but I don't really see kvm doing that. Kvm
>>> does register a release notifier, but that just flushes the shadow
>>> page tables, it doesn't kill any use_mm() use by some i915 use case.
>>
>> Yes, KVM is correct but the i915 bits are at least fishy.  It's probably
>> as simple as adding a mmget/mmput pair respectively in kvmgt_guest_init
>> and kvmgt_guest_exit, or maybe mmget_not_zero.
>>
> 
> yeah, that's the clear way to fix this imo. We only depend on guest
> life cycle to access guest memory properly. Here's proposed fix, will
> verify and integrate it later.
> 
> Thanks!
> 
> From 5e5a8d0409aa150884adf5a4d0b956fd0b9906b3 Mon Sep 17 00:00:00 2001
> From: Zhenyu Wang 
> Date: Thu, 23 Aug 2018 14:08:06 +0800
> Subject: [PATCH] drm/i915/gvt: Fix life cycle reference on KVM mm
> 
> Handle guest mm access life cycle properly with mmget()/mmput()
> through guest init()/exit(). As noted by Linus, use_mm() depends
> on valid live page table but KVM's mmgrab() doesn't guarantee that.
> As vGPU usage depends on guest VM life cycle, need to make sure to
> use mmget()/mmput() to guarantee VM address access.
> 
> Cc: Linus Torvalds 
> Cc: Paolo Bonzini 
> Cc: Zhi Wang 
> Signed-off-by: Zhenyu Wang 

Reviewed-by: Paolo Bonzini 

> ---
>  drivers/gpu/drm/i915/gvt/kvmgt.c | 12 +++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c 
> b/drivers/gpu/drm/i915/gvt/kvmgt.c
> index 71751be329e3..4a0988747d08 100644
> --- a/drivers/gpu/drm/i915/gvt/kvmgt.c
> +++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
> @@ -32,6 +32,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -1614,9 +1615,16 @@ static int kvmgt_guest_init(struct mdev_device *mdev)
>   if (__kvmgt_vgpu_exist(vgpu, kvm))
>   return -EEXIST;
>  
> + if (!mmget_not_zero(kvm->mm)) {
> + gvt_vgpu_err("Can't get KVM mm reference\n");
> + return -EINVAL;
> + }
> +
>   info = vzalloc(sizeof(struct kvmgt_guest_info));
> - if (!info)
> + if (!info) {
> + mmput(kvm->mm);
>   return -ENOMEM;
> + }
>  
>   vgpu->handle = (unsigned long)info;
>   info->vgpu = vgpu;
> @@ -1647,6 +1655,8 @@ static bool kvmgt_guest_exit(struct kvmgt_guest_info 
> *info)
>   debugfs_remove(info->debugfs_cache_entries);
>  
>   kvm_page_track_unregister_notifier(info->kvm, >track_node);
> + if (info->kvm->mm)
> + mmput(info->kvm->mm);
>   kvm_put_kvm(info->kvm);
>   kvmgt_protect_table_destroy(info);
>   gvt_cache_destroy(info->vgpu);
> 




signature.asc
Description: OpenPGP digital signature
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Possible use_mm() mis-uses

2018-08-22 Thread Paolo Bonzini
On 22/08/2018 18:44, Linus Torvalds wrote:
> An example of something that *isn't* right, is the i915 kvm interface,
> which does
> 
> use_mm(kvm->mm);
> 
> on an mm that was initialized in virt/kvm/kvm_main.c using
> 
> mmgrab(current->mm);
> kvm->mm = current->mm;
> 
> which is *not* right. Using "mmgrab()" does indeed guarantee the
> lifetime of the 'struct mm_struct' itself, but it does *not* guarantee
> the lifetime of the page tables. You need to use "mmget()" and
> "mmput()", which get the reference to the actual process address
> space!
> 
> Now, it is *possible* that the kvm use is correct too, because kvm
> does register a mmu_notifier chain, and in theory you can avoid the
> proper refcounting by just making sure the mmu "release" notifier
> kills any existing uses, but I don't really see kvm doing that. Kvm
> does register a release notifier, but that just flushes the shadow
> page tables, it doesn't kill any use_mm() use by some i915 use case.

Yes, KVM is correct but the i915 bits are at least fishy.  It's probably
as simple as adding a mmget/mmput pair respectively in kvmgt_guest_init
and kvmgt_guest_exit, or maybe mmget_not_zero.

Paolo
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Qemu-devel] [PATCH] drm: update MAINTAINERS for qemu drivers (bochs, cirrus, qxl, virtio-gpu)

2016-11-22 Thread Paolo Bonzini


On 22/11/2016 19:54, Eric Blake wrote:
>>> >>  DRM DRIVER FOR BOCHS VIRTUAL GPU
>>> >>  M:  Gerd Hoffmann 
>>> >> -S:  Odd Fixes
>>> >> +L:  qemu-devel at nongnu.org
>> > 
>> > qemu-devel list already has very high traffic - not sure whether it
>> > makes much sense to route even more additional patches here. Maybe
>> > rather create a separate mailing list like qemu-graphics at nongnu.org ?
> In practice, ALL patches should already be going to qemu-devel, even if
> they are ALSO going to some other list.  For example, qemu-block and
> qemu-trivial are definitely cases where we have separate lists, but
> where posters are reminded to include qemu-devel in cc if they want a
> patch applied.

The difference is that these would be kernel patches.

Paolo


[PATCH 2/2] mm: remove get_user_pages_locked()

2016-10-31 Thread Paolo Bonzini


On 31/10/2016 14:48, Lorenzo Stoakes wrote:
> On Mon, Oct 31, 2016 at 12:45:36PM +0100, Paolo Bonzini wrote:
>>
>>
>> On 31/10/2016 11:02, Lorenzo Stoakes wrote:
>>> - *
>>> - * get_user_pages should be phased out in favor of
>>> - * get_user_pages_locked|unlocked or get_user_pages_fast. Nothing
>>> - * should use get_user_pages because it cannot pass
>>> - * FAULT_FLAG_ALLOW_RETRY to handle_mm_fault.
>>
>> This comment should be preserved in some way.  In addition, removing
> 
> Could you clarify what you think should be retained?
> 
> The comment seems to me to be largely rendered redundant by this change -
> get_user_pages() now offers identical behaviour, and of course the latter part
> of the comment ('because it cannot pass FAULT_FLAG_ALLOW_RETRY') is rendered
> incorrect by this change.
> 
> It could be replaced with a recommendation to make use of VM_FAULT_RETRY logic
> if possible.

Yes, exactly.  locked == NULL should be avoided whenever mmap_sem can 
be dropped, and the comment indirectly said so.  Now most of those cases
actually are those where you can just call get_user_pages_unlocked.

>> get_user_pages_locked() makes it harder (compared to a simple "git grep
>> -w") to identify callers that lack allow-retry functionality).  So I'm
>> not sure about the benefits of these patches.
> 
> That's a fair point, and certainly this cleanup series is less obviously 
> helpful
> than previous ones.
> 
> However, there are a few points in its favour:
> 
> 1. get_user_pages_remote() was the mirror of get_user_pages() prior to adding 
> an
>int *locked parameter to the former (necessary to allow for the unexport of
>__get_user_pages_unlocked()), differing only in task/mm being specified and
>FOLL_REMOTE being set. This patch series keeps these functions 
> 'synchronised'
>in this respect.
> 
> 2. There is currently only one caller of get_user_pages_locked() in
>mm/frame_vector.c which seems to suggest this function isn't widely
>used/known.

Or not widely necessary. :)

> 3. This change results in all slow-path get_user_pages*() functions having the
>ability to use VM_FAULT_RETRY logic rather than 'defaulting' to using
>get_user_pages() that doesn't let you do this even if you wanted to.

This is only true if someone does the work though.  From a quick look 
at your series, the following files can be trivially changed to use 
get_user_pages_unlocked:

- drivers/gpu/drm/via/via_dmablit.c
- drivers/platform/goldfish/goldfish_pipe.c
- drivers/staging/vc04_services/interface/vchiq_arm/vchiq_arm.c
- drivers/rapidio/devices/rio_mport_cdev.c
- drivers/virt/fsl_hypervisor.c

Also, videobuf_dma_init_user could be changed to use retry by adding a 
*locked argument to videobuf_dma_init_user_locked, prototype patch 
after my signature.

Everything else is probably best kept using get_user_pages.

> 4. It's somewhat confusing/redundant having both get_user_pages_locked() and
>get_user_pages() functions which both require mmap_sem to be held (i.e. 
> both
>are 'locked' versions.)
> 
>> If all callers were changed, then sure removing the _locked suffix would
>> be a good idea.
> 
> It seems many callers cannot release mmap_sem so VM_FAULT_RETRY logic couldn't
> happen anyway in this cases and in these cases we couldn't change the caller.

But then get_user_pages_locked remains a less-common case, so perhaps 
it's a good thing to give it a longer name!

> Overall, an alternative here might be to remove get_user_pages() and update
> get_user_pages_locked() to add a 'vmas' parameter, however doing this renders
> get_user_pages_unlocked() asymmetric as it would lack an vmas parameter 
> (adding
> one there would make no sense as VM_FAULT_RETRY logic invalidates VMAs) though
> perhaps not such a big issue as it makes sense as to why this is the case.

Adding the 'vmas' parameter to get_user_pages_locked would make little 
sense.  Since VM_FAULT_RETRY invalidates it and g_u_p_locked can and 
does retry, it would generally not be useful.

So I think we have the right API now:

- do not have lock?  Use get_user_pages_unlocked, get retry for free,
no need to handle  mmap_sem and the locked argument; cannot get back vmas.

- have and cannot drop lock?  User get_user_pages, no need to pass NULL 
for the locked argument; can get back vmas.

- have but can drop lock (rare case)?  Use get_user_pages_locked, 
cannot get back vams.

Paolo

> get_user_pages_unlocked() definitely seems to be a useful helper and therefore
> makes sense to retain.

> Of course another alternative is to leave things be :)
> 

diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c 
b/drivers/media/v4l2-core/videobuf-dma-sg.c
index 1db0af6c7f94..dae

[PATCH 2/2] mm: remove get_user_pages_locked()

2016-10-31 Thread Paolo Bonzini


On 31/10/2016 11:02, Lorenzo Stoakes wrote:
> - *
> - * get_user_pages should be phased out in favor of
> - * get_user_pages_locked|unlocked or get_user_pages_fast. Nothing
> - * should use get_user_pages because it cannot pass
> - * FAULT_FLAG_ALLOW_RETRY to handle_mm_fault.

This comment should be preserved in some way.  In addition, removing
get_user_pages_locked() makes it harder (compared to a simple "git grep
-w") to identify callers that lack allow-retry functionality).  So I'm
not sure about the benefits of these patches.

If all callers were changed, then sure removing the _locked suffix would
be a good idea.

Paolo


[PATCH 02/10] mm: remove write/force parameters from __get_user_pages_unlocked()

2016-10-13 Thread Paolo Bonzini
 max_pages_per_loop = PVM_MAX_KMALLOC_PAGES
>   / sizeof(struct pages *);
> + unsigned int flags = FOLL_REMOTE;
>  
>   /* Work out address and page range required */
>   if (len == 0)
>   return 0;
>   nr_pages = (addr + len - 1) / PAGE_SIZE - addr / PAGE_SIZE + 1;
>  
> + if (vm_write)
> + flags |= FOLL_WRITE;
> +
>   while (!rc && nr_pages && iov_iter_count(iter)) {
>   int pages = min(nr_pages, max_pages_per_loop);
>   size_t bytes;
> @@ -104,8 +108,7 @@ static int process_vm_rw_single_vec(unsigned long addr,
>* current/current->mm
>*/
>   pages = __get_user_pages_unlocked(task, mm, pa, pages,
> -   vm_write, 0, process_pages,
> -   FOLL_REMOTE);
> +   process_pages, flags);
>   if (pages <= 0)
>   return -EFAULT;
>  
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index db96688..8035cc1 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -84,7 +84,8 @@ static void async_pf_execute(struct work_struct *work)
>* mm and might be done in another context, so we must
>* use FOLL_REMOTE.
>*/
> - __get_user_pages_unlocked(NULL, mm, addr, 1, 1, 0, NULL, FOLL_REMOTE);
> + __get_user_pages_unlocked(NULL, mm, addr, 1, NULL,
> + FOLL_WRITE | FOLL_REMOTE);
>  
>   kvm_async_page_present_sync(vcpu, apf);
>  
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 81dfc73..28510e7 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1416,10 +1416,15 @@ static int hva_to_pfn_slow(unsigned long addr, bool 
> *async, bool write_fault,
>   down_read(>mm->mmap_sem);
>   npages = get_user_page_nowait(addr, write_fault, page);
>   up_read(>mm->mmap_sem);
> - } else
> + } else {
> +     unsigned int flags = FOLL_TOUCH | FOLL_HWPOISON;
> +
> + if (write_fault)
> + flags |= FOLL_WRITE;
> +
>   npages = __get_user_pages_unlocked(current, current->mm, addr, 
> 1,
> -write_fault, 0, page,
> -FOLL_TOUCH|FOLL_HWPOISON);
> +page, flags);
> + }
>   if (npages != 1)
>   return npages;
>  
> 

Acked-by: Paolo Bonzini 


[Intel-gfx] [RFC][PATCH] gpu:drm:i915:intel_detect_pch: back to check devfn instead of check class type

2014-07-07 Thread Paolo Bonzini
Il 07/07/2014 19:54, Daniel Vetter ha scritto:
> On Mon, Jul 07, 2014 at 04:57:45PM +0200, Paolo Bonzini wrote:
>> Il 07/07/2014 16:49, Daniel Vetter ha scritto:
>>> So the correct fix to forward intel gpus to guests is indeed to somehow
>>> fake the pch pci ids since the driver really needs them. Gross design, but
>>> that's how the hardware works.
>>
>> A way that could work for virtualization is this: if you find the card has a
>> magic subsystem vendor id, fetch the subsystem device id and use _that_ as
>> the PCH device id.
>>
>> Would that work for you?
>
> I guess for quemu it also depends upon what windows does since we can't
> change that. If we care about that part. Another consideration is
> supporting older kernels, if that's possible at all.

Yes, but right now it's more important to get something that's not too 
gross for the future, for both Linux and Windows.  Hacks for existing 
guests can be done separately, especially since they might differ 
between Linux (check ISA bridge) and Windows (check 1f.0).

Paolo



[RFC][PATCH] gpu:drm:i915:intel_detect_pch: back to check devfn instead of check class type

2014-07-07 Thread Paolo Bonzini
Il 07/07/2014 16:49, Daniel Vetter ha scritto:
> So the correct fix to forward intel gpus to guests is indeed to somehow
> fake the pch pci ids since the driver really needs them. Gross design, but
> that's how the hardware works.

A way that could work for virtualization is this: if you find the card 
has a magic subsystem vendor id, fetch the subsystem device id and use 
_that_ as the PCH device id.

Would that work for you?

Paolo


[RFC][PATCH] gpu:drm:i915:intel_detect_pch: back to check devfn instead of check class type

2014-06-30 Thread Paolo Bonzini
Il 30/06/2014 05:13, Chen, Tiejun ha scritto:
> After I discuss internal, we think even we just set the real
> vendor/device ids to this ISA bridge at 00:1f.0, guest firmware should
> still work well with these pair of real vendor/device ids.
>
> So if you think something would conflict or be broken, could you tell us
> what's exactly that? Then we will double check.

The Xen hvmloader doesn't break since it only supports one chipset.  But 
SeaBIOS checks for the exact vendor/device ids since Q35 support was added.

If you want to add this feature, try to implement it in a way that is a 
bit more forward-looking.  I'm sure that Xen sooner or later will want a 
PCIe chipset, otherwise things such as AER forwarding are impossible.

Paolo


[RFC][PATCH] gpu:drm:i915:intel_detect_pch: back to check devfn instead of check class type

2014-06-25 Thread Paolo Bonzini
Il 25/06/2014 09:34, Chen, Tiejun ha scritto:
> On 2014/6/25 14:48, Paolo Bonzini wrote:
>> Second problem.  Your IGD passthrough code currently works with QEMU's
>> PIIX4-based machine.  But what happens if you try to extend it, so that
>
> Yes, current xen machine, xenpv, is based on pii4, and also I don't
> known if we will plan to migrate to q35 or others. So its hard to
> further say more now.
>
>> it works with QEMU's ICH9-based machine?  That's a more modern machine
>> that has a PCIe chipset and hence has its ISA bridge at 00:1f.0.  Now
>
> But even in this case, could we set the real vendor/device ids for that
> ISA bridge at 00:1f.0? If not, what's broken?

The config space layout changes for different vendor/device ids, so the 
guest firmware only works if you have the right vendor/device id.

>> It is only slightly better, but the right solution is to fix the driver.
>>   There is absolutely zero reason why a graphics driver should know
>> about the vendor/device ids of the PCH.
>
> This means we have to fix this both on Linux and Windows but I'm not
> sure if this is feasible to us.

You have to do it if you want this feature in QEMU in a future-proof way.

You _can_ provide the ugly PIIX4-specific hack as a compatibility 
fallback (and this patch is okay to make the compatibility fallback less 
hacky).  However, I don't think QEMU should accept the patch for IGD 
passthrough unless Intel is willing to make drivers 
virtualization-friendly.  Once you assign the IGD, it is not that 
integrated anymore and the drivers must take that into account.

It is worthwhile pointing out that neither AMD nor nVidia need any of this.

>> The right way could be to make QEMU add a vendor-specific capability to
>> the video device. The driver can probe for that capability before
>
> Do you mean we can pick two unused offsets in the configuration space of
> the video device as a vendor-specific capability to hold the
> vendor/device ids of the PCH?

Yes, either that or add a new capability (which lets you choose the 
offsets more freely).

If the IGD driver needs config space fields of the MCH, those fields 
could also be mirrored in the new capability.  QEMU would forward them 
automatically.

It could even be a new BAR, which gives even more freedom to allocate 
the fields.

>> looking at the PCI bus.  QEMU can add the capability to the list, it is
>> easy because all accesses to the video device's configuration space trap
>> to QEMU.  Then you do not need to add fake devices to the machine.
>>
>> In fact, it would be nice if Intel added such a capability on the next
>> generation of integrated graphics, too.  On real hardware, ACPI or some
>
> Maybe, but even this would be implemented, shouldn't we need to be
> compatible with those old generations?

Yes.

- old generation / old driver: use 00:1f.0 hack, only guaranteed to work 
on PIIX4-based virtual guest

- old generation / new driver: use 00:1f.0 hack on real hardware, use 
capability on 00:02.0 on virtual guest, can work on PCIe virtual guest

- new generation / old driver: doesn't exist

- new generation / new driver: always use capability on 00:02.0, can 
work on PCIe virtual guest.

Paolo


[RFC][PATCH] gpu:drm:i915:intel_detect_pch: back to check devfn instead of check class type

2014-06-25 Thread Paolo Bonzini
Il 22/06/2014 10:25, Chen, Tiejun ha scritto:
> In qemu-upstream, as you commented we can't create this as a ISA class
> type explicitly.

Note I didn't say that QEMU doesn't like having two ISA bridges.

I commented that the firmware will see two ISA bridges and will try to 
initialize both of them.  It currently doesn't just because it only 
knows of two southbridge PCI IDs, and both of them are old or relatively 
old (PIIX3/4 and ICH9).

But what would happen if you did graphics passthrough on a machine with 
an ICH9 southbridge?  Your code will create the PIIX4 ISA bridge, and 
will create an ICH9 southbridge just to please the i915 driver.  The 
BIOS will recognize the ICH9's vendor/device ids and try to initialize 
it.  It will write to the configuration space to set up PCI PIRQ[A-H] 
routing, for example, which makes no sense since your ICH9 is just a 
dummy device.

Second problem.  Your IGD passthrough code currently works with QEMU's 
PIIX4-based machine.  But what happens if you try to extend it, so that 
it works with QEMU's ICH9-based machine?  That's a more modern machine 
that has a PCIe chipset and hence has its ISA bridge at 00:1f.0.  Now 
you have a conflict.

In other words, if you want IGD passthrough in QEMU, you need a solution 
that is future-proof.

> So we compromise by faking this ISA bridge without ISA
> class type setting (as I recall you already said this way is slightly
> better).

It is only slightly better, but the right solution is to fix the driver. 
  There is absolutely zero reason why a graphics driver should know 
about the vendor/device ids of the PCH.

The right way could be to make QEMU add a vendor-specific capability to 
the video device. The driver can probe for that capability before 
looking at the PCI bus.  QEMU can add the capability to the list, it is 
easy because all accesses to the video device's configuration space trap 
to QEMU.  Then you do not need to add fake devices to the machine.

In fact, it would be nice if Intel added such a capability on the next 
generation of integrated graphics, too.  On real hardware, ACPI or some 
other kind of firmware can probe the PCH at 00:1f.0 and initialize that 
vendor-specific capability.  QEMU would just forward the value from the 
host, so that virtualized guests never depend on the PCH at 00:1f.0.

Paolo

> Maybe we will figure better way in the future. But anyway, this
> is always registered as 00:15.0, right? So I think the i915 driver can
> go back to probe the devfn like the native environment.



[RFC][PATCH] gpu:drm:i915:intel_detect_pch: back to check devfn instead of check class type

2014-06-20 Thread Paolo Bonzini
Il 19/06/2014 11:53, Tiejun Chen ha scritto:
> so this mean that isa bridge is still represented with Dev31:Func0
> like the native OS. Furthermore, currently we're pushing VGA
> passthrough support into qemu upstream, and with some discussion,
> we wouldn't set the bridge class type and just expose this devfn.

Even this is not really optimal.  It just happens to work just because 
QEMU's machine is currently a PCI machine with the ISA bridge on 00:01.0.

As soon as you'll try doing integrated graphics passthrough on a PCIe 
machine type (such as QEMU's "-M q35") things will break down just as badly.

Paolo