Re: [PATCH v2 2/4] eventfd: simplify eventfd_signal()
On Wed, Nov 22, 2023 at 1:49 PM Christian Brauner wrote: > > Ever since the evenfd type was introduced back in 2007 in commit > e1ad7468c77d ("signal/timer/event: eventfd core") the eventfd_signal() > function only ever passed 1 as a value for @n. There's no point in > keeping that additional argument. > > Signed-off-by: Christian Brauner > --- > arch/x86/kvm/hyperv.c | 2 +- > arch/x86/kvm/xen.c| 2 +- > virt/kvm/eventfd.c| 4 ++-- > 30 files changed, 60 insertions(+), 63 deletions(-) For KVM: Acked-by: Paolo Bonzini
Re: [PATCH] drm/ttm: set TTM allocated pages as reserved
On 3/29/23 18:43, Christian König wrote: 3) other uses of kmap() must switch to MMU-notifier protection. I would rather suggest to return the page additionally to the pfn from hva_to_pfn() when the function was able to grab a reference to it. When the page is then not available you can't call kmap() on that either. If the DRM people are okay with SetPageReserved() as a temporary hack, we can change or remove the WARN in kvm_is_zone_device_page(), since that is what you are referring to in the commit message. Adding Daniel for additional comments on this, but essentially we have changed quite some infrastructure to make sure that everybody uses VM_PFNMAP to prevent stuff like this from happening. I would really prefer a proper solution in KVM instead. Hmm... Now that I think about it that would be https://lore.kernel.org/kvm/yc4h+dgfk83ba...@google.com/t/ Time to resurrect that work. Paolo
Re: [PATCH] drm/ttm: set TTM allocated pages as reserved
On 3/29/23 17:29, Christian König wrote: First, is it a _host_ corruption or a guest corruption/crash? A guest crash would be KVM doing exactly what it's meant to do: it detects the non-reserved, non-refcounted page and refuses to map it into the guest. Yes I think that this is what happens. Ok I was worried all the time that this was host corruption/crash; which obviously would have been much worse. The use case and mapping is indeed valid as far as I can see, but the handling that KVM does here is really problematic. What needs to happen instead is that when both hva_to_pfn_fast() and hva_to_pfn_slow() fails you don't try to convert the PFN into a page and so also don't get a reference to the page. This somehow needs to be communicated to the callers of hva_to_pfn() so that kvm_release_pfn() knows what to do. There's a bit more complication here: 1) in the guest page fault path we can avoid taking the reference altogether 2) in other MMU-notifier-protected paths, we can also avoid taking the reference but we also must stop using kmap() in virt/kvm/pfncache.c. 3) other uses of kmap() must switch to MMU-notifier protection. If the DRM people are okay with SetPageReserved() as a temporary hack, we can change or remove the WARN in kvm_is_zone_device_page(), since that is what you are referring to in the commit message. Paolo
Re: [PATCH] drm/ttm: set TTM allocated pages as reserved
On 3/29/23 16:28, Paolo Bonzini wrote: On 3/29/23 15:54, Christian König wrote: KVM tries to grab references to pages in VMAs marked with VM_PFNMAP. This is illegal and can cause data corruption with TTM pages because only some of them are actually reference counted. After some other offlist discussions, I also would like to understand what you mean by corruption. First, is it a _host_ corruption or a guest corruption/crash? A guest crash would be KVM doing exactly what it's meant to do: it detects the non-reserved, non-refcounted page and refuses to map it into the guest. On the other hand, if it is a host crash, my understanding is that an order>0 allocation leaves the tail pages with a zero reference count (and without a compound_head if, as in this case, __GFP_COMP is unset). If that's correct, more analysis is needed to understand why get_page_unless_zero() isn't rejecting the tail pages. Paolo Mark all pages allocated by TTM as reserved, this way KVM handles the PFNs like they would point to MMIO space. This still results in a warning, but at least no other problem. What warning is it? Paolo Signed-off-by: Christian König
Re: [PATCH] drm/ttm: set TTM allocated pages as reserved
On 3/29/23 15:54, Christian König wrote: KVM tries to grab references to pages in VMAs marked with VM_PFNMAP. This is illegal and can cause data corruption with TTM pages because only some of them are actually reference counted. I think that you are referring to this: /* Don't set the __GFP_COMP flag for higher order allocations. * Mapping pages directly into an userspace process and calling * put_page() on a TTM allocated page is illegal. */ if (order) gfp_flags |= __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM; By "directly" I guess you mean without going through remap_pfn_range(). Based on our discussion offlist, it should be possible to remove the get_page/put_page from the path that fills in the KVM page table, but it is difficult to remove it altogether (it requires changing everything to use userspace virtual address). Indeed KVM needs to detect non-reference-counted pages because unfortunately there are cases where people want to map VM_PFNMAP pages into a guest. If it is not enough to have PageReserved set, and there is a better check, KVM can be fixed too. Mark all pages allocated by TTM as reserved, this way KVM handles the PFNs like they would point to MMIO space. This still results in a warning, but at least no other problem. What warning is it? Paolo Signed-off-by: Christian König
Re: [PATCH v2 1/5] KVM: do not allow mapping valid but non-refcounted pages
On 25/06/21 09:58, Christian Borntraeger wrote: On 25.06.21 09:36, David Stevens wrote: From: Nicholas Piggin It's possible to create a region which maps valid but non-refcounted pages (e.g., tail pages of non-compound higher order allocations). These host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family of APIs, which take a reference to the page, which takes it from 0 to 1. When the reference is dropped, this will free the page incorrectly. Fix this by only taking a reference on the page if it was non-zero, which indicates it is participating in normal refcounting (and can be released with put_page). Signed-off-by: Nicholas Piggin I guess this would be the small fix for stable? Do we want to add that cc? Reviewed-by: Christian Borntraeger Yes, this one is going to Linus today. The rest is for 5.15. Paolo --- virt/kvm/kvm_main.c | 19 +-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 3dcc2abbfc60..f7445c3bcd90 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2175,6 +2175,13 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault) return true; } +static int kvm_try_get_pfn(kvm_pfn_t pfn) +{ + if (kvm_is_reserved_pfn(pfn)) + return 1; + return get_page_unless_zero(pfn_to_page(pfn)); +} + static int hva_to_pfn_remapped(struct vm_area_struct *vma, unsigned long addr, bool *async, bool write_fault, bool *writable, @@ -2224,13 +2231,21 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, * Whoever called remap_pfn_range is also going to call e.g. * unmap_mapping_range before the underlying pages are freed, * causing a call to our MMU notifier. + * + * Certain IO or PFNMAP mappings can be backed with valid + * struct pages, but be allocated without refcounting e.g., + * tail pages of non-compound higher order allocations, which + * would then underflow the refcount when the caller does the + * required put_page. Don't allow those pages here. */ - kvm_get_pfn(pfn); + if (!kvm_try_get_pfn(pfn)) + r = -EFAULT; out: pte_unmap_unlock(ptep, ptl); *p_pfn = pfn; - return 0; + + return r; } /*
Re: [PATCH 0/6] KVM: Remove uses of struct page from x86 and arm64 MMU
On 24/06/21 14:57, Nicholas Piggin wrote: KVM: Fix page ref underflow for regions with valid but non-refcounted pages It doesn't really fix the underflow, it disallows mapping them in the first place. Since in principle things can break, I'd rather be explicit, so let's go with "KVM: do not allow mapping valid but non-reference-counted pages". It's possible to create a region which maps valid but non-refcounted pages (e.g., tail pages of non-compound higher order allocations). These host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family of APIs, which take a reference to the page, which takes it from 0 to 1. When the reference is dropped, this will free the page incorrectly. Fix this by only taking a reference on the page if it was non-zero, s/on the page/on valid pages/ (makes clear that invalid pages are fine without refcounting). Thank you *so* much, I'm awful at Linux mm. Paolo which indicates it is participating in normal refcounting (and can be released with put_page). Signed-off-by: Nicholas Piggin
Re: [PATCH 0/6] KVM: Remove uses of struct page from x86 and arm64 MMU
On 24/06/21 13:42, Nicholas Piggin wrote: Excerpts from Nicholas Piggin's message of June 24, 2021 8:34 pm: Excerpts from David Stevens's message of June 24, 2021 1:57 pm: KVM supports mapping VM_IO and VM_PFNMAP memory into the guest by using follow_pte in gfn_to_pfn. However, the resolved pfns may not have assoicated struct pages, so they should not be passed to pfn_to_page. This series removes such calls from the x86 and arm64 secondary MMU. To do this, this series modifies gfn_to_pfn to return a struct page in addition to a pfn, if the hva was resolved by gup. This allows the caller to call put_page only when necessated by gup. This series provides a helper function that unwraps the new return type of gfn_to_pfn to provide behavior identical to the old behavior. As I have no hardware to test powerpc/mips changes, the function is used there for minimally invasive changes. Additionally, as gfn_to_page and gfn_to_pfn_cache are not integrated with mmu notifier, they cannot be easily changed over to only use pfns. This addresses CVE-2021-22543 on x86 and arm64. Does this fix the problem? (untested I don't have a POC setup at hand, but at least in concept) This one actually compiles at least. Unfortunately I don't have much time in the near future to test, and I only just found out about this CVE a few hours ago. And it also works (the reproducer gets an infinite stream of userspace exits and especially does not crash). We can still go for David's solution later since MMU notifiers are able to deal with this pages, but it's a very nice patch for stable kernels. If you provide a Signed-off-by, I can integrate it. Paolo --- It's possible to create a region which maps valid but non-refcounted pages (e.g., tail pages of non-compound higher order allocations). These host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family of APIs, which take a reference to the page, which takes it from 0 to 1. When the reference is dropped, this will free the page incorrectly. Fix this by only taking a reference on the page if it was non-zero, which indicates it is participating in normal refcounting (and can be released with put_page). --- virt/kvm/kvm_main.c | 19 +-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 6a6bc7af0e28..46fb042837d2 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2055,6 +2055,13 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault) return true; } +static int kvm_try_get_pfn(kvm_pfn_t pfn) +{ + if (kvm_is_reserved_pfn(pfn)) + return 1; + return get_page_unless_zero(pfn_to_page(pfn)); +} + static int hva_to_pfn_remapped(struct vm_area_struct *vma, unsigned long addr, bool *async, bool write_fault, bool *writable, @@ -2104,13 +2111,21 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, * Whoever called remap_pfn_range is also going to call e.g. * unmap_mapping_range before the underlying pages are freed, * causing a call to our MMU notifier. +* +* Certain IO or PFNMAP mappings can be backed with valid +* struct pages, but be allocated without refcounting e.g., +* tail pages of non-compound higher order allocations, which +* would then underflow the refcount when the caller does the +* required put_page. Don't allow those pages here. */ - kvm_get_pfn(pfn); + if (!kvm_try_get_pfn(pfn)) + r = -EFAULT; out: pte_unmap_unlock(ptep, ptl); *p_pfn = pfn; - return 0; + + return r; } /*
Re: [PATCH 0/6] KVM: Remove uses of struct page from x86 and arm64 MMU
On 24/06/21 13:42, Nicholas Piggin wrote: +static int kvm_try_get_pfn(kvm_pfn_t pfn) +{ + if (kvm_is_reserved_pfn(pfn)) + return 1; So !pfn_valid would always return true. Yeah, this should work and is certainly appealing! Paolo + return get_page_unless_zero(pfn_to_page(pfn)); +} + static int hva_to_pfn_remapped(struct vm_area_struct *vma, unsigned long addr, bool *async, bool write_fault, bool *writable, @@ -2104,13 +2111,21 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, * Whoever called remap_pfn_range is also going to call e.g. * unmap_mapping_range before the underlying pages are freed, * causing a call to our MMU notifier. +* +* Certain IO or PFNMAP mappings can be backed with valid +* struct pages, but be allocated without refcounting e.g., +* tail pages of non-compound higher order allocations, which +* would then underflow the refcount when the caller does the +* required put_page. Don't allow those pages here. */ - kvm_get_pfn(pfn); + if (!kvm_try_get_pfn(pfn)) + r = -EFAULT; out: pte_unmap_unlock(ptep, ptl); *p_pfn = pfn; - return 0; + + return r; } /*
Re: [PATCH 2/6] KVM: mmu: also return page from gfn_to_pfn
On 24/06/21 12:17, Nicholas Piggin wrote: If all callers were updated that is one thing, but from the changelog it sounds like that would not happen and there would be some gfn_to_pfn users left over. But yes in the end you would either need to make gfn_to_pfn never return a page found via follow_pte, or change all callers to the new way. If the plan is for the latter then I guess that's fine. Actually in that case anyway I don't see the need -- the existence of gfn_to_pfn is enough to know it might be buggy. It can just as easily be grepped for as kvm_pfn_page_unwrap. Sure, but that would leave us with longer function names (gfn_to_pfn_page* instead of gfn_to_pfn*). So the "safe" use is the one that looks worse and the unsafe use is the one that looks safe. And are gfn_to_page cases also vulernable to the same issue? No, they're just broken for the VM_IO|VM_PFNMAP case. Paolo So I think it could be marked deprecated or something if not everything will be converted in the one series, and don't need to touch all that arch code with this patch.
Re: [PATCH 3/6] KVM: x86/mmu: avoid struct page in MMU
On 24/06/21 12:06, Marc Zyngier wrote: On Thu, 24 Jun 2021 09:58:00 +0100, Nicholas Piggin wrote: Excerpts from David Stevens's message of June 24, 2021 1:57 pm: From: David Stevens out_unlock: if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa)) read_unlock(>kvm->mmu_lock); else write_unlock(>kvm->mmu_lock); - kvm_release_pfn_clean(pfn); + if (pfnpg.page) + put_page(pfnpg.page); return r; } How about kvm_release_pfn_page_clean(pfnpg); I'm not sure. I always found kvm_release_pfn_clean() ugly, because it doesn't mark the page 'clean'. I find put_page() more correct. Something like 'kvm_put_pfn_page()' would make more sense, but I'm so bad at naming things that I could just as well call it 'bob()'. The best way to go would be to get rid of kvm_release_pfn_clean() and always go through a pfn_page. Then we could or could not introduce wrappers kvm_put_pfn_page{,_dirty}. I think for now it's best to limit the churn since these patches will go in the stable releases too, and clean up the resulting API once we have a clear idea of how all architectures are using kvm_pfn_page. Paolo
Re: [PATCH 2/6] KVM: mmu: also return page from gfn_to_pfn
On 24/06/21 11:57, Nicholas Piggin wrote: Needing kvm_pfn_page_unwrap is a sign that something might be buggy, so it's a good idea to move the short name to the common case and the ugly kvm_pfn_page_unwrap(gfn_to_pfn(...)) for the weird one. In fact I'm not sure there should be any kvm_pfn_page_unwrap in the end. If all callers were updated that is one thing, but from the changelog it sounds like that would not happen and there would be some gfn_to_pfn users left over. In this patches there are, so yeah the plan is to always change the callers to the new way. But yes in the end you would either need to make gfn_to_pfn never return a page found via follow_pte, or change all callers to the new way. If the plan is for the latter then I guess that's fine.
Re: [PATCH 1/6] KVM: x86/mmu: release audited pfns
On 24/06/21 10:43, Nicholas Piggin wrote: Excerpts from David Stevens's message of June 24, 2021 1:57 pm: From: David Stevens Changelog? This looks like a bug, should it have a Fixes: tag? Probably has been there forever... The best way to fix the bug would be to nuke mmu_audit.c, which I've threatened to do many times but never followed up on. Paolo Thanks, Nick Signed-off-by: David Stevens --- arch/x86/kvm/mmu/mmu_audit.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/x86/kvm/mmu/mmu_audit.c b/arch/x86/kvm/mmu/mmu_audit.c index cedc17b2f60e..97ff184084b4 100644 --- a/arch/x86/kvm/mmu/mmu_audit.c +++ b/arch/x86/kvm/mmu/mmu_audit.c @@ -121,6 +121,8 @@ static void audit_mappings(struct kvm_vcpu *vcpu, u64 *sptep, int level) audit_printk(vcpu->kvm, "levels %d pfn %llx hpa %llx " "ent %llxn", vcpu->arch.mmu->root_level, pfn, hpa, *sptep); + + kvm_release_pfn_clean(pfn); } static void inspect_spte_has_rmap(struct kvm *kvm, u64 *sptep) -- 2.32.0.93.g670b81a890-goog
Re: [PATCH 2/6] KVM: mmu: also return page from gfn_to_pfn
On 24/06/21 10:52, Nicholas Piggin wrote: For now, wrap all calls to gfn_to_pfn functions in the new helper function. Callers which don't need the page struct will be updated in follow-up patches. Hmm. You mean callers that do need the page will be updated? Normally if there will be leftover users that don't need the struct page then you would go the other way and keep the old call the same, and add a new one (gfn_to_pfn_page) just for those that need it. Needing kvm_pfn_page_unwrap is a sign that something might be buggy, so it's a good idea to move the short name to the common case and the ugly kvm_pfn_page_unwrap(gfn_to_pfn(...)) for the weird one. In fact I'm not sure there should be any kvm_pfn_page_unwrap in the end. Paolo
Re: [PATCH 3/6] KVM: x86/mmu: avoid struct page in MMU
On 24/06/21 05:57, David Stevens wrote: static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code, - int map_writable, int max_level, kvm_pfn_t pfn, + int map_writable, int max_level, + const struct kvm_pfn_page *pfnpg, bool prefault, bool is_tdp) { bool nx_huge_page_workaround_enabled = is_nx_huge_pa So the main difference with my tentative patches is that here I was passing the struct by value. I'll try to clean this up for 5.15, but for now it's all good like this. Thanks again! Paolo
Re: [PATCH 0/6] KVM: Remove uses of struct page from x86 and arm64 MMU
On 24/06/21 05:57, David Stevens wrote: KVM supports mapping VM_IO and VM_PFNMAP memory into the guest by using follow_pte in gfn_to_pfn. However, the resolved pfns may not have assoicated struct pages, so they should not be passed to pfn_to_page. This series removes such calls from the x86 and arm64 secondary MMU. To do this, this series modifies gfn_to_pfn to return a struct page in addition to a pfn, if the hva was resolved by gup. This allows the caller to call put_page only when necessated by gup. This series provides a helper function that unwraps the new return type of gfn_to_pfn to provide behavior identical to the old behavior. As I have no hardware to test powerpc/mips changes, the function is used there for minimally invasive changes. Additionally, as gfn_to_page and gfn_to_pfn_cache are not integrated with mmu notifier, they cannot be easily changed over to only use pfns. This addresses CVE-2021-22543 on x86 and arm64. Thank you very much for this. I agree that it makes sense to have a minimal change; I had similar changes almost ready, but was stuck with deadlocks in the gfn_to_pfn_cache case. In retrospect I should have posted something similar to your patches. I have started reviewing the patches, and they look good. I will try to include them in 5.13. Paolo
Re: [PULL] topic/iomem-mmap-vs-gup
On 10/05/21 19:57, Sean Christopherson wrote: +Paolo On Mon, May 10, 2021, Jason Gunthorpe wrote: On Mon, May 10, 2021 at 04:55:39PM +0200, Daniel Vetter wrote: yeah vfio is still broken for the case I care about. I think there's also some questions open still about whether kvm really uses mmu_notifier in all cases correctly, IIRC kvm doesn't either. Yep, KVM on x86 has a non-trivial number of flows that don't properly hook into the mmu_notifier. Paolo is working on fixing the problem, but I believe the rework won't be ready until 5.14. Yeah, I like the way it's coming, but I'm at 20-ish patches and counting. Paolo
Re: [PATCH 3/3] mm: unexport follow_pfn
On 08/04/21 12:05, Daniel Vetter wrote: On Mon, Mar 29, 2021 at 10:31:01AM -0300, Jason Gunthorpe wrote: On Tue, Mar 16, 2021 at 04:33:03PM +0100, Daniel Vetter wrote: Both kvm (in bd2fae8da794 ("KVM: do not assume PTE is writable after follow_pfn")) and vfio (in 07956b6269d3 ("vfio/type1: Use follow_pte()")) have lost their callsites of follow_pfn(). All the other ones have been switched over to unsafe_follow_pfn because they cannot be fixed without breaking userspace api. Argueably the vfio code is still racy, but that's kinda a bigger vfio and kvm Hm I thought kvm is non-racy due to the mmu notifier catch races? No, but the plan is indeed to have some struct for each page that uses follow_pfn and update it from the MMU notifiers. Paolo picture. But since it does leak the pte beyond where it drops the pt lock, without anything else like an mmu notifier guaranteeing coherence, the problem is at least clearly visible in the vfio code. So good enough with me. I've decided to keep the explanation that after dropping the pt lock you must have an mmu notifier if you keep using the pte somehow by adjusting it and moving it into the kerneldoc for the new follow_pte() function. Cc: 3...@google.com Cc: Jann Horn Cc: Paolo Bonzini Cc: Jason Gunthorpe Cc: Cornelia Huck Cc: Peter Xu Cc: Alex Williamson Cc: linux...@kvack.org Cc: linux-arm-ker...@lists.infradead.org Cc: linux-samsung-...@vger.kernel.org Cc: linux-me...@vger.kernel.org Cc: k...@vger.kernel.org Signed-off-by: Daniel Vetter --- include/linux/mm.h | 2 -- mm/memory.c| 26 +- mm/nommu.c | 13 + 3 files changed, 6 insertions(+), 35 deletions(-) Reviewed-by: Jason Gunthorpe Thanks for your r-b tags, I'll add them. -Daniel Jason ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 3/3] mm: unexport follow_pfn
On 24/03/21 13:52, Jason Gunthorpe wrote: I think this is the right thing to do. Alex is working on fixing VFIO and while kvm is still racy using follow pte, I think they are working on it too? Yeah, or at least we have a plan. Paolo ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v6 16/17] RFC: kvm: pass kvm argument to follow_pfn callsites
On 20/11/20 16:44, Daniel Vetter wrote: It's a bit of a pity, it's making an API more complex (the point of gfn_to_pfn_memslot vs gfn_to_pfn is exactly that you don't need a "struct kvm*" and it's clear that you've already done the lookup into that struct kvm. Yeah I noticed that, I think pushing the lookups down should work, but that's a fairly large-scale change. I didn't want to do that for the RFC since it would distract from the actual change/goal. -Daniel Pushing the lookups down would be worse code and possibly introduce TOC/TOU races, so better avoid that. Your patch is fine. :) Paolo ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v6 16/17] RFC: kvm: pass kvm argument to follow_pfn callsites
On 19/11/20 15:41, Daniel Vetter wrote: Both Christoph Hellwig and Jason Gunthorpe suggested that usage of follow_pfn by modules should be locked down more. To do so callers need to be able to pass the mmu_notifier subscription corresponding to the mm_struct to follow_pfn(). This patch does the rote work of doing that in the kvm subsystem. In most places this is solved by passing struct kvm * down the call stacks as an additional parameter, since that contains the mmu_notifier. Compile tested on all affected arch. It's a bit of a pity, it's making an API more complex (the point of gfn_to_pfn_memslot vs gfn_to_pfn is exactly that you don't need a "struct kvm*" and it's clear that you've already done the lookup into that struct kvm. But it's not a big deal, and the rationale at least makes sense. So, Acked-by: Paolo Bonzini Paolo ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v2 6/7] debugfs: switch to simplefs inode creation API
On 21/04/20 15:57, Emanuele Giuseppe Esposito wrote: > - inode = debugfs_get_inode(dentry->d_sb); You're not removing debugfs_get_inode so I think you're going to get a warning (same in tracefs)? You can wait a few more days for reviews and/or Acked-bys (especially for patches 6 and 7) and then post v3. Since the touch-everything patch (#2) has already been reviewed, and it's mechanical and not introducing any semantic change, you can probably reduce the To/Cc list to filesystem, debugfs and tracefs maintainers. Thanks, Paolo ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 6/8] simplefs: add file creation functions
On 20/04/20 16:28, Greg Kroah-Hartman wrote: >> I assume you meant a new file. These new functions are used only by a few >> filesystems, and I didn't want to include them in vmlinux unconditionally, >> so I introduced simplefs.c and CONFIG_SIMPLEFS instead of extending libfs.c. >> In this way only fs that need this code like debugfs and tracefs will load >> it. > Nothing "loads it", why not just make these libfs functions instead? As > the difference between the two is not obvious at all, please don't make > things confusing. I think Emanuele meant "will link it" not "will load it". Emanuele, you can just move everything to libfs.c and get rid of CONFIG_SIMPLEFS too. "Do less" is not an offer you want to turn down! Thanks, Paolo ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC PATCH 0/3] KVM: x86: honor guest memory type
On 14/02/20 23:03, Sean Christopherson wrote: >> On Fri, Feb 14, 2020 at 1:47 PM Chia-I Wu wrote: >>> AFAICT, it is currently allowed on ARM (verified) and AMD (not >>> verified, but svm_get_mt_mask returns 0 which supposedly means the NPT >>> does not restrict what the guest PAT can do). This diff would do the >>> trick for Intel without needing any uapi change: >> I would be concerned about Intel CPU errata such as SKX40 and SKX59. > The part KVM cares about, #MC, is already addressed by forcing UC for MMIO. > The data corruption issue is on the guest kernel to correctly use WC > and/or non-temporal writes. What about coherency across live migration? The userspace process would use cached accesses, and also a WBINVD could potentially corrupt guest memory. Paolo ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC PATCH 0/3] KVM: x86: honor guest memory type
On 13/02/20 23:18, Chia-I Wu wrote: > > The bug you mentioned was probably this one > > https://bugzilla.kernel.org/show_bug.cgi?id=104091 Yes, indeed. > From what I can tell, the commit allowed the guests to create cached > mappings to MMIO regions and caused MCEs. That is different than what > I need, which is to allow guests to create uncached mappings to system > ram (i.e., !kvm_is_mmio_pfn) when the host userspace also has uncached > mappings. But it is true that this still allows the userspace & guest > kernel to create conflicting memory types. Right, the question is whether the MCEs were tied to MMIO regions specifically and if so why. An interesting remark is in the footnote of table 11-7 in the SDM. There, for the MTRR (EPT for us) memory type UC you can read: The UC attribute comes from the MTRRs and the processors are not required to snoop their caches since the data could never have been cached. This attribute is preferred for performance reasons. There are two possibilities: 1) the footnote doesn't apply to UC mode coming from EPT page tables. That would make your change safe. 2) the footnote also applies when the UC attribute comes from the EPT page tables rather than the MTRRs. In that case, the host should use UC as the EPT page attribute if and only if it's consistent with the host MTRRs; it would be more or less impossible to honor UC in the guest MTRRs. In that case, something like the patch below would be needed. It is not clear from the manual why the footnote would not apply to WC; that is, the manual doesn't say explicitly that the processor does not do snooping for accesses to WC memory. But I guess that must be the case, which is why I used MTRR_TYPE_WRCOMB in the patch below. Either way, we would have an explanation of why creating cached mapping to MMIO regions would, and why in practice we're not seeing MCEs for guest RAM (the guest would have set WB for that memory in its MTRRs, not UC). One thing you didn't say: how would userspace use KVM_MEM_DMA? On which regions would it be set? Thanks, Paolo diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index dc331fb06495..2be6f7effa1d 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6920,8 +6920,16 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) } cache = kvm_mtrr_get_guest_memory_type(vcpu, gfn); - exit: + if (cache == MTRR_TYPE_UNCACHABLE && !is_mmio) { + /* +* We cannot set UC in the EPT page tables as it can cause +* machine check exceptions (??). Hopefully the guest is +* using PAT. +*/ + cache = MTRR_TYPE_WRCOMB; + } + return (cache << VMX_EPT_MT_EPTE_SHIFT) | ipat; } ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC PATCH 1/3] KVM: vmx: rewrite the comment in vmx_get_mt_mask
On 13/02/20 22:30, Chia-I Wu wrote: > Better reflect the structure of the code and metion why we could not > always honor the guest. > > Signed-off-by: Chia-I Wu > Cc: Gurchetan Singh > Cc: Gerd Hoffmann > --- > arch/x86/kvm/vmx/vmx.c | 27 +-- > 1 file changed, 17 insertions(+), 10 deletions(-) > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index 3be25ecae145..266ef87042da 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -6854,17 +6854,24 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, > gfn_t gfn, bool is_mmio) > u8 cache; > u64 ipat = 0; > > - /* For VT-d and EPT combination > - * 1. MMIO: always map as UC > - * 2. EPT with VT-d: > - * a. VT-d without snooping control feature: can't guarantee the > - * result, try to trust guest. > - * b. VT-d with snooping control feature: snooping control feature of > - * VT-d engine can guarantee the cache correctness. Just set it > - * to WB to keep consistent with host. So the same as item 3. > - * 3. EPT without VT-d: always map as WB and set IPAT=1 to keep > - *consistent with host MTRR > + /* We wanted to honor guest CD/MTRR/PAT, but doing so could result in > + * memory aliases with conflicting memory types and sometimes MCEs. > + * We have to be careful as to what are honored and when. > + * > + * For MMIO, guest CD/MTRR are ignored. The EPT memory type is set to > + * UC. The effective memory type is UC or WC depending on guest PAT. > + * This was historically the source of MCEs and we want to be > + * conservative. > + * > + * When there is no need to deal with noncoherent DMA (e.g., no VT-d > + * or VT-d has snoop control), guest CD/MTRR/PAT are all ignored. The > + * EPT memory type is set to WB. The effective memory type is forced > + * WB. > + * > + * Otherwise, we trust guest. Guest CD/MTRR/PAT are all honored. The > + * EPT memory type is used to emulate guest CD/MTRR. >*/ > + > if (is_mmio) { > cache = MTRR_TYPE_UNCACHABLE; > goto exit; > This is certainly an improvement, especially the part that points out how guest PAT still allows MMIO to be handled as WC. Thanks, Paolo ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC PATCH 0/3] KVM: x86: honor guest memory type
On 13/02/20 22:30, Chia-I Wu wrote: > Hi, > > Host GPU drivers like to give userspace WC mapping. When the userspace makes > the mapping available to a guest, it also tells the guest to create a WC > mapping. However, even when the guest kernel picks the correct memory type, > it gets ignored because of VMX_EPT_IPAT_BIT on Intel. > > This series adds a new flag to KVM_SET_USER_MEMORY_REGION, which tells the > host kernel to honor the guest memory type for the memslot. An alternative > fix is for KVM to unconditionally honor the guest memory type (unless it is > MMIO, to avoid MCEs on Intel). I believe the alternative fix is how things > are on ARM, and probably also how things are on AMD. > > I am new to KVM and HW virtualization technologies. This series is meant as > an RFC. > When we tried to do this in the past, we got machine checks everywhere unfortunately due to the same address being mapped with different memory types. Unfortunately I cannot find the entry anymore in bugzilla, but this was not fixed as far as I know. Paolo ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: KASAN: slab-out-of-bounds Read in fbcon_get_font
On 05/12/19 12:27, Dmitry Vyukov wrote: > Oh, you mean the final bisection crash. Indeed it contains a kvm frame > and it turns out to be a bug in syzkaller code that indeed > misattributed it to kvm instead of netfilter. > Should be fixed now, you may read the commit message for details: > https://github.com/google/syzkaller/commit/4fb74474cf0af2126be3a8989d770c3947ae9478 > > Overall this "making sense out of kernel output" task is the ultimate > insanity, you may skim through this file to get a taste of amount of > hardcoding and special corner cases that need to be handled: > https://github.com/google/syzkaller/blob/master/pkg/report/linux.go > And this is never done, such "exception from exception corner case" > things pop up every week. There is always something to shuffle and > tune. It only keeps functioning due to 500+ test cases for all > possible insane kernel outputs: > https://github.com/google/syzkaller/tree/master/pkg/report/testdata/linux/report > https://github.com/google/syzkaller/tree/master/pkg/report/testdata/linux/guilty > > So thanks for persisting and questioning! We are getting better with > each new test. Thanks to you! I "complain" because I know you're so responsive. :) Paolo ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: KASAN: slab-out-of-bounds Read in fbcon_get_font
On 05/12/19 11:31, Dmitry Vyukov wrote: >> Ah, and because the machine is a KVM guest, kvm_wait appears in a lot of >> backtrace and I get to share syzkaller's joy every time. :) > I don't see any mention of "kvm" in the crash report. It's there in the stack trace, not sure if this is what triggered my Cc: [] kvm_wait+0xca/0xe0 arch/x86/kernel/kvm.c:612 Paolo ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: KASAN: slab-out-of-bounds Read in fbcon_get_font
On 05/12/19 11:16, Dmitry Vyukov wrote: > On Thu, Dec 5, 2019 at 11:13 AM Paolo Bonzini wrote: >> >> On 04/12/19 22:41, syzbot wrote: >>> syzbot has bisected this bug to: >>> >>> commit 2de50e9674fc4ca3c6174b04477f69eb26b4ee31 >>> Author: Russell Currey >>> Date: Mon Feb 8 04:08:20 2016 + >>> >>> powerpc/powernv: Remove support for p5ioc2 >>> >>> bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=127a042ae0 >>> start commit: 76bb8b05 Merge tag 'kbuild-v5.5' of >>> git://git.kernel.org/p.. >>> git tree: upstream >>> final crash:https://syzkaller.appspot.com/x/report.txt?x=117a042ae0 >>> console output: https://syzkaller.appspot.com/x/log.txt?x=167a042ae0 >>> kernel config: https://syzkaller.appspot.com/x/.config?x=dd226651cb0f364b >>> dashboard link: >>> https://syzkaller.appspot.com/bug?extid=4455ca3b3291de891abc >>> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=11181edae0 >>> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=105cbb7ae0 >>> >>> Reported-by: syzbot+4455ca3b3291de891...@syzkaller.appspotmail.com >>> Fixes: 2de50e9674fc ("powerpc/powernv: Remove support for p5ioc2") >>> >>> For information about bisection process see: >>> https://goo.gl/tpsmEJ#bisection >>> >> >> Why is everybody being CC'd, even if the bug has nothing to do with the >> person's subsystem? > > The To list should be intersection of 2 groups of emails: result of > get_maintainers.pl on the file identified as culprit in the crash > message + emails extracted from the bisected to commit. Ah, and because the machine is a KVM guest, kvm_wait appears in a lot of backtrace and I get to share syzkaller's joy every time. :) This bisect result is bogus, though Tetsuo found the bug anyway. Perhaps you can exclude commits that only touch architectures other than x86? Paolo ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: KASAN: slab-out-of-bounds Read in fbcon_get_font
On 04/12/19 22:41, syzbot wrote: > syzbot has bisected this bug to: > > commit 2de50e9674fc4ca3c6174b04477f69eb26b4ee31 > Author: Russell Currey > Date: Mon Feb 8 04:08:20 2016 + > > powerpc/powernv: Remove support for p5ioc2 > > bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=127a042ae0 > start commit: 76bb8b05 Merge tag 'kbuild-v5.5' of > git://git.kernel.org/p.. > git tree: upstream > final crash: https://syzkaller.appspot.com/x/report.txt?x=117a042ae0 > console output: https://syzkaller.appspot.com/x/log.txt?x=167a042ae0 > kernel config: https://syzkaller.appspot.com/x/.config?x=dd226651cb0f364b > dashboard link: > https://syzkaller.appspot.com/bug?extid=4455ca3b3291de891abc > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=11181edae0 > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=105cbb7ae0 > > Reported-by: syzbot+4455ca3b3291de891...@syzkaller.appspotmail.com > Fixes: 2de50e9674fc ("powerpc/powernv: Remove support for p5ioc2") > > For information about bisection process see: > https://goo.gl/tpsmEJ#bisection > Why is everybody being CC'd, even if the bug has nothing to do with the person's subsystem? Thanks, Paolo ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 09/22] docs: mark orphan documents as such
On 30/05/19 01:23, Mauro Carvalho Chehab wrote: > Sphinx doesn't like orphan documents: > > Documentation/accelerators/ocxl.rst: WARNING: document isn't included in > any toctree > Documentation/arm/stm32/overview.rst: WARNING: document isn't included in > any toctree > Documentation/arm/stm32/stm32f429-overview.rst: WARNING: document isn't > included in any toctree > Documentation/arm/stm32/stm32f746-overview.rst: WARNING: document isn't > included in any toctree > Documentation/arm/stm32/stm32f769-overview.rst: WARNING: document isn't > included in any toctree > Documentation/arm/stm32/stm32h743-overview.rst: WARNING: document isn't > included in any toctree > Documentation/arm/stm32/stm32mp157-overview.rst: WARNING: document isn't > included in any toctree > Documentation/gpu/msm-crash-dump.rst: WARNING: document isn't included in > any toctree > Documentation/interconnect/interconnect.rst: WARNING: document isn't > included in any toctree > Documentation/laptops/lg-laptop.rst: WARNING: document isn't included in > any toctree > Documentation/powerpc/isa-versions.rst: WARNING: document isn't included > in any toctree > Documentation/virtual/kvm/amd-memory-encryption.rst: WARNING: document > isn't included in any toctree > Documentation/virtual/kvm/vcpu-requests.rst: WARNING: document isn't > included in any toctree > > So, while they aren't on any toctree, add :orphan: to them, in order > to silent this warning. > > Signed-off-by: Mauro Carvalho Chehab Please leave out KVM, I'll fix that instead. Thanks for the report! Paolo > --- > Documentation/accelerators/ocxl.rst | 2 ++ > Documentation/arm/stm32/overview.rst| 2 ++ > Documentation/arm/stm32/stm32f429-overview.rst | 2 ++ > Documentation/arm/stm32/stm32f746-overview.rst | 2 ++ > Documentation/arm/stm32/stm32f769-overview.rst | 2 ++ > Documentation/arm/stm32/stm32h743-overview.rst | 2 ++ > Documentation/arm/stm32/stm32mp157-overview.rst | 2 ++ > Documentation/gpu/msm-crash-dump.rst| 2 ++ > Documentation/interconnect/interconnect.rst | 2 ++ > Documentation/laptops/lg-laptop.rst | 2 ++ > Documentation/powerpc/isa-versions.rst | 2 ++ > Documentation/virtual/kvm/amd-memory-encryption.rst | 2 ++ > Documentation/virtual/kvm/vcpu-requests.rst | 2 ++ > 13 files changed, 26 insertions(+) > > diff --git a/Documentation/accelerators/ocxl.rst > b/Documentation/accelerators/ocxl.rst > index 14cefc020e2d..b1cea19a90f5 100644 > --- a/Documentation/accelerators/ocxl.rst > +++ b/Documentation/accelerators/ocxl.rst > @@ -1,3 +1,5 @@ > +:orphan: > + > > OpenCAPI (Open Coherent Accelerator Processor Interface) > > diff --git a/Documentation/arm/stm32/overview.rst > b/Documentation/arm/stm32/overview.rst > index 85cfc8410798..f7e734153860 100644 > --- a/Documentation/arm/stm32/overview.rst > +++ b/Documentation/arm/stm32/overview.rst > @@ -1,3 +1,5 @@ > +:orphan: > + > > STM32 ARM Linux Overview > > diff --git a/Documentation/arm/stm32/stm32f429-overview.rst > b/Documentation/arm/stm32/stm32f429-overview.rst > index 18feda97f483..65bbb1c3b423 100644 > --- a/Documentation/arm/stm32/stm32f429-overview.rst > +++ b/Documentation/arm/stm32/stm32f429-overview.rst > @@ -1,3 +1,5 @@ > +:orphan: > + > STM32F429 Overview > == > > diff --git a/Documentation/arm/stm32/stm32f746-overview.rst > b/Documentation/arm/stm32/stm32f746-overview.rst > index b5f4b6ce7656..42d593085015 100644 > --- a/Documentation/arm/stm32/stm32f746-overview.rst > +++ b/Documentation/arm/stm32/stm32f746-overview.rst > @@ -1,3 +1,5 @@ > +:orphan: > + > STM32F746 Overview > == > > diff --git a/Documentation/arm/stm32/stm32f769-overview.rst > b/Documentation/arm/stm32/stm32f769-overview.rst > index 228656ced2fe..f6adac862b17 100644 > --- a/Documentation/arm/stm32/stm32f769-overview.rst > +++ b/Documentation/arm/stm32/stm32f769-overview.rst > @@ -1,3 +1,5 @@ > +:orphan: > + > STM32F769 Overview > == > > diff --git a/Documentation/arm/stm32/stm32h743-overview.rst > b/Documentation/arm/stm32/stm32h743-overview.rst > index 3458dc00095d..c525835e7473 100644 > --- a/Documentation/arm/stm32/stm32h743-overview.rst > +++ b/Documentation/arm/stm32/stm32h743-overview.rst > @@ -1,3 +1,5 @@ > +:orphan: > + > STM32H743 Overview > == > > diff --git a/Documentation/arm/stm32/stm32mp157-overview.rst > b/Documentation/arm/stm32/stm32mp157-overview.rst > index 62e176d47ca7..2c52cd020601 100644 > --- a/Documentation/arm/stm32/stm32mp157-overview.rst > +++ b/Documentation/arm/stm32/stm32mp157-overview.rst > @@ -1,3 +1,5 @@ > +:orphan: > + > STM32MP157 Overview >
Re: Possible use_mm() mis-uses
On 23/08/2018 08:07, Zhenyu Wang wrote: > On 2018.08.22 20:20:46 +0200, Paolo Bonzini wrote: >> On 22/08/2018 18:44, Linus Torvalds wrote: >>> An example of something that *isn't* right, is the i915 kvm interface, >>> which does >>> >>> use_mm(kvm->mm); >>> >>> on an mm that was initialized in virt/kvm/kvm_main.c using >>> >>> mmgrab(current->mm); >>> kvm->mm = current->mm; >>> >>> which is *not* right. Using "mmgrab()" does indeed guarantee the >>> lifetime of the 'struct mm_struct' itself, but it does *not* guarantee >>> the lifetime of the page tables. You need to use "mmget()" and >>> "mmput()", which get the reference to the actual process address >>> space! >>> >>> Now, it is *possible* that the kvm use is correct too, because kvm >>> does register a mmu_notifier chain, and in theory you can avoid the >>> proper refcounting by just making sure the mmu "release" notifier >>> kills any existing uses, but I don't really see kvm doing that. Kvm >>> does register a release notifier, but that just flushes the shadow >>> page tables, it doesn't kill any use_mm() use by some i915 use case. >> >> Yes, KVM is correct but the i915 bits are at least fishy. It's probably >> as simple as adding a mmget/mmput pair respectively in kvmgt_guest_init >> and kvmgt_guest_exit, or maybe mmget_not_zero. >> > > yeah, that's the clear way to fix this imo. We only depend on guest > life cycle to access guest memory properly. Here's proposed fix, will > verify and integrate it later. > > Thanks! > > From 5e5a8d0409aa150884adf5a4d0b956fd0b9906b3 Mon Sep 17 00:00:00 2001 > From: Zhenyu Wang > Date: Thu, 23 Aug 2018 14:08:06 +0800 > Subject: [PATCH] drm/i915/gvt: Fix life cycle reference on KVM mm > > Handle guest mm access life cycle properly with mmget()/mmput() > through guest init()/exit(). As noted by Linus, use_mm() depends > on valid live page table but KVM's mmgrab() doesn't guarantee that. > As vGPU usage depends on guest VM life cycle, need to make sure to > use mmget()/mmput() to guarantee VM address access. > > Cc: Linus Torvalds > Cc: Paolo Bonzini > Cc: Zhi Wang > Signed-off-by: Zhenyu Wang Reviewed-by: Paolo Bonzini > --- > drivers/gpu/drm/i915/gvt/kvmgt.c | 12 +++- > 1 file changed, 11 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c > b/drivers/gpu/drm/i915/gvt/kvmgt.c > index 71751be329e3..4a0988747d08 100644 > --- a/drivers/gpu/drm/i915/gvt/kvmgt.c > +++ b/drivers/gpu/drm/i915/gvt/kvmgt.c > @@ -32,6 +32,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -1614,9 +1615,16 @@ static int kvmgt_guest_init(struct mdev_device *mdev) > if (__kvmgt_vgpu_exist(vgpu, kvm)) > return -EEXIST; > > + if (!mmget_not_zero(kvm->mm)) { > + gvt_vgpu_err("Can't get KVM mm reference\n"); > + return -EINVAL; > + } > + > info = vzalloc(sizeof(struct kvmgt_guest_info)); > - if (!info) > + if (!info) { > + mmput(kvm->mm); > return -ENOMEM; > + } > > vgpu->handle = (unsigned long)info; > info->vgpu = vgpu; > @@ -1647,6 +1655,8 @@ static bool kvmgt_guest_exit(struct kvmgt_guest_info > *info) > debugfs_remove(info->debugfs_cache_entries); > > kvm_page_track_unregister_notifier(info->kvm, >track_node); > + if (info->kvm->mm) > + mmput(info->kvm->mm); > kvm_put_kvm(info->kvm); > kvmgt_protect_table_destroy(info); > gvt_cache_destroy(info->vgpu); > signature.asc Description: OpenPGP digital signature ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Possible use_mm() mis-uses
On 22/08/2018 18:44, Linus Torvalds wrote: > An example of something that *isn't* right, is the i915 kvm interface, > which does > > use_mm(kvm->mm); > > on an mm that was initialized in virt/kvm/kvm_main.c using > > mmgrab(current->mm); > kvm->mm = current->mm; > > which is *not* right. Using "mmgrab()" does indeed guarantee the > lifetime of the 'struct mm_struct' itself, but it does *not* guarantee > the lifetime of the page tables. You need to use "mmget()" and > "mmput()", which get the reference to the actual process address > space! > > Now, it is *possible* that the kvm use is correct too, because kvm > does register a mmu_notifier chain, and in theory you can avoid the > proper refcounting by just making sure the mmu "release" notifier > kills any existing uses, but I don't really see kvm doing that. Kvm > does register a release notifier, but that just flushes the shadow > page tables, it doesn't kill any use_mm() use by some i915 use case. Yes, KVM is correct but the i915 bits are at least fishy. It's probably as simple as adding a mmget/mmput pair respectively in kvmgt_guest_init and kvmgt_guest_exit, or maybe mmget_not_zero. Paolo ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[Qemu-devel] [PATCH] drm: update MAINTAINERS for qemu drivers (bochs, cirrus, qxl, virtio-gpu)
On 22/11/2016 19:54, Eric Blake wrote: >>> >> DRM DRIVER FOR BOCHS VIRTUAL GPU >>> >> M: Gerd Hoffmann >>> >> -S: Odd Fixes >>> >> +L: qemu-devel at nongnu.org >> > >> > qemu-devel list already has very high traffic - not sure whether it >> > makes much sense to route even more additional patches here. Maybe >> > rather create a separate mailing list like qemu-graphics at nongnu.org ? > In practice, ALL patches should already be going to qemu-devel, even if > they are ALSO going to some other list. For example, qemu-block and > qemu-trivial are definitely cases where we have separate lists, but > where posters are reminded to include qemu-devel in cc if they want a > patch applied. The difference is that these would be kernel patches. Paolo
[PATCH 2/2] mm: remove get_user_pages_locked()
On 31/10/2016 14:48, Lorenzo Stoakes wrote: > On Mon, Oct 31, 2016 at 12:45:36PM +0100, Paolo Bonzini wrote: >> >> >> On 31/10/2016 11:02, Lorenzo Stoakes wrote: >>> - * >>> - * get_user_pages should be phased out in favor of >>> - * get_user_pages_locked|unlocked or get_user_pages_fast. Nothing >>> - * should use get_user_pages because it cannot pass >>> - * FAULT_FLAG_ALLOW_RETRY to handle_mm_fault. >> >> This comment should be preserved in some way. In addition, removing > > Could you clarify what you think should be retained? > > The comment seems to me to be largely rendered redundant by this change - > get_user_pages() now offers identical behaviour, and of course the latter part > of the comment ('because it cannot pass FAULT_FLAG_ALLOW_RETRY') is rendered > incorrect by this change. > > It could be replaced with a recommendation to make use of VM_FAULT_RETRY logic > if possible. Yes, exactly. locked == NULL should be avoided whenever mmap_sem can be dropped, and the comment indirectly said so. Now most of those cases actually are those where you can just call get_user_pages_unlocked. >> get_user_pages_locked() makes it harder (compared to a simple "git grep >> -w") to identify callers that lack allow-retry functionality). So I'm >> not sure about the benefits of these patches. > > That's a fair point, and certainly this cleanup series is less obviously > helpful > than previous ones. > > However, there are a few points in its favour: > > 1. get_user_pages_remote() was the mirror of get_user_pages() prior to adding > an >int *locked parameter to the former (necessary to allow for the unexport of >__get_user_pages_unlocked()), differing only in task/mm being specified and >FOLL_REMOTE being set. This patch series keeps these functions > 'synchronised' >in this respect. > > 2. There is currently only one caller of get_user_pages_locked() in >mm/frame_vector.c which seems to suggest this function isn't widely >used/known. Or not widely necessary. :) > 3. This change results in all slow-path get_user_pages*() functions having the >ability to use VM_FAULT_RETRY logic rather than 'defaulting' to using >get_user_pages() that doesn't let you do this even if you wanted to. This is only true if someone does the work though. From a quick look at your series, the following files can be trivially changed to use get_user_pages_unlocked: - drivers/gpu/drm/via/via_dmablit.c - drivers/platform/goldfish/goldfish_pipe.c - drivers/staging/vc04_services/interface/vchiq_arm/vchiq_arm.c - drivers/rapidio/devices/rio_mport_cdev.c - drivers/virt/fsl_hypervisor.c Also, videobuf_dma_init_user could be changed to use retry by adding a *locked argument to videobuf_dma_init_user_locked, prototype patch after my signature. Everything else is probably best kept using get_user_pages. > 4. It's somewhat confusing/redundant having both get_user_pages_locked() and >get_user_pages() functions which both require mmap_sem to be held (i.e. > both >are 'locked' versions.) > >> If all callers were changed, then sure removing the _locked suffix would >> be a good idea. > > It seems many callers cannot release mmap_sem so VM_FAULT_RETRY logic couldn't > happen anyway in this cases and in these cases we couldn't change the caller. But then get_user_pages_locked remains a less-common case, so perhaps it's a good thing to give it a longer name! > Overall, an alternative here might be to remove get_user_pages() and update > get_user_pages_locked() to add a 'vmas' parameter, however doing this renders > get_user_pages_unlocked() asymmetric as it would lack an vmas parameter > (adding > one there would make no sense as VM_FAULT_RETRY logic invalidates VMAs) though > perhaps not such a big issue as it makes sense as to why this is the case. Adding the 'vmas' parameter to get_user_pages_locked would make little sense. Since VM_FAULT_RETRY invalidates it and g_u_p_locked can and does retry, it would generally not be useful. So I think we have the right API now: - do not have lock? Use get_user_pages_unlocked, get retry for free, no need to handle mmap_sem and the locked argument; cannot get back vmas. - have and cannot drop lock? User get_user_pages, no need to pass NULL for the locked argument; can get back vmas. - have but can drop lock (rare case)? Use get_user_pages_locked, cannot get back vams. Paolo > get_user_pages_unlocked() definitely seems to be a useful helper and therefore > makes sense to retain. > Of course another alternative is to leave things be :) > diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c b/drivers/media/v4l2-core/videobuf-dma-sg.c index 1db0af6c7f94..dae
[PATCH 2/2] mm: remove get_user_pages_locked()
On 31/10/2016 11:02, Lorenzo Stoakes wrote: > - * > - * get_user_pages should be phased out in favor of > - * get_user_pages_locked|unlocked or get_user_pages_fast. Nothing > - * should use get_user_pages because it cannot pass > - * FAULT_FLAG_ALLOW_RETRY to handle_mm_fault. This comment should be preserved in some way. In addition, removing get_user_pages_locked() makes it harder (compared to a simple "git grep -w") to identify callers that lack allow-retry functionality). So I'm not sure about the benefits of these patches. If all callers were changed, then sure removing the _locked suffix would be a good idea. Paolo
[PATCH 02/10] mm: remove write/force parameters from __get_user_pages_unlocked()
max_pages_per_loop = PVM_MAX_KMALLOC_PAGES > / sizeof(struct pages *); > + unsigned int flags = FOLL_REMOTE; > > /* Work out address and page range required */ > if (len == 0) > return 0; > nr_pages = (addr + len - 1) / PAGE_SIZE - addr / PAGE_SIZE + 1; > > + if (vm_write) > + flags |= FOLL_WRITE; > + > while (!rc && nr_pages && iov_iter_count(iter)) { > int pages = min(nr_pages, max_pages_per_loop); > size_t bytes; > @@ -104,8 +108,7 @@ static int process_vm_rw_single_vec(unsigned long addr, >* current/current->mm >*/ > pages = __get_user_pages_unlocked(task, mm, pa, pages, > - vm_write, 0, process_pages, > - FOLL_REMOTE); > + process_pages, flags); > if (pages <= 0) > return -EFAULT; > > diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c > index db96688..8035cc1 100644 > --- a/virt/kvm/async_pf.c > +++ b/virt/kvm/async_pf.c > @@ -84,7 +84,8 @@ static void async_pf_execute(struct work_struct *work) >* mm and might be done in another context, so we must >* use FOLL_REMOTE. >*/ > - __get_user_pages_unlocked(NULL, mm, addr, 1, 1, 0, NULL, FOLL_REMOTE); > + __get_user_pages_unlocked(NULL, mm, addr, 1, NULL, > + FOLL_WRITE | FOLL_REMOTE); > > kvm_async_page_present_sync(vcpu, apf); > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 81dfc73..28510e7 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -1416,10 +1416,15 @@ static int hva_to_pfn_slow(unsigned long addr, bool > *async, bool write_fault, > down_read(>mm->mmap_sem); > npages = get_user_page_nowait(addr, write_fault, page); > up_read(>mm->mmap_sem); > - } else > + } else { > + unsigned int flags = FOLL_TOUCH | FOLL_HWPOISON; > + > + if (write_fault) > + flags |= FOLL_WRITE; > + > npages = __get_user_pages_unlocked(current, current->mm, addr, > 1, > -write_fault, 0, page, > -FOLL_TOUCH|FOLL_HWPOISON); > +page, flags); > + } > if (npages != 1) > return npages; > > Acked-by: Paolo Bonzini
[Intel-gfx] [RFC][PATCH] gpu:drm:i915:intel_detect_pch: back to check devfn instead of check class type
Il 07/07/2014 19:54, Daniel Vetter ha scritto: > On Mon, Jul 07, 2014 at 04:57:45PM +0200, Paolo Bonzini wrote: >> Il 07/07/2014 16:49, Daniel Vetter ha scritto: >>> So the correct fix to forward intel gpus to guests is indeed to somehow >>> fake the pch pci ids since the driver really needs them. Gross design, but >>> that's how the hardware works. >> >> A way that could work for virtualization is this: if you find the card has a >> magic subsystem vendor id, fetch the subsystem device id and use _that_ as >> the PCH device id. >> >> Would that work for you? > > I guess for quemu it also depends upon what windows does since we can't > change that. If we care about that part. Another consideration is > supporting older kernels, if that's possible at all. Yes, but right now it's more important to get something that's not too gross for the future, for both Linux and Windows. Hacks for existing guests can be done separately, especially since they might differ between Linux (check ISA bridge) and Windows (check 1f.0). Paolo
[RFC][PATCH] gpu:drm:i915:intel_detect_pch: back to check devfn instead of check class type
Il 07/07/2014 16:49, Daniel Vetter ha scritto: > So the correct fix to forward intel gpus to guests is indeed to somehow > fake the pch pci ids since the driver really needs them. Gross design, but > that's how the hardware works. A way that could work for virtualization is this: if you find the card has a magic subsystem vendor id, fetch the subsystem device id and use _that_ as the PCH device id. Would that work for you? Paolo
[RFC][PATCH] gpu:drm:i915:intel_detect_pch: back to check devfn instead of check class type
Il 30/06/2014 05:13, Chen, Tiejun ha scritto: > After I discuss internal, we think even we just set the real > vendor/device ids to this ISA bridge at 00:1f.0, guest firmware should > still work well with these pair of real vendor/device ids. > > So if you think something would conflict or be broken, could you tell us > what's exactly that? Then we will double check. The Xen hvmloader doesn't break since it only supports one chipset. But SeaBIOS checks for the exact vendor/device ids since Q35 support was added. If you want to add this feature, try to implement it in a way that is a bit more forward-looking. I'm sure that Xen sooner or later will want a PCIe chipset, otherwise things such as AER forwarding are impossible. Paolo
[RFC][PATCH] gpu:drm:i915:intel_detect_pch: back to check devfn instead of check class type
Il 25/06/2014 09:34, Chen, Tiejun ha scritto: > On 2014/6/25 14:48, Paolo Bonzini wrote: >> Second problem. Your IGD passthrough code currently works with QEMU's >> PIIX4-based machine. But what happens if you try to extend it, so that > > Yes, current xen machine, xenpv, is based on pii4, and also I don't > known if we will plan to migrate to q35 or others. So its hard to > further say more now. > >> it works with QEMU's ICH9-based machine? That's a more modern machine >> that has a PCIe chipset and hence has its ISA bridge at 00:1f.0. Now > > But even in this case, could we set the real vendor/device ids for that > ISA bridge at 00:1f.0? If not, what's broken? The config space layout changes for different vendor/device ids, so the guest firmware only works if you have the right vendor/device id. >> It is only slightly better, but the right solution is to fix the driver. >> There is absolutely zero reason why a graphics driver should know >> about the vendor/device ids of the PCH. > > This means we have to fix this both on Linux and Windows but I'm not > sure if this is feasible to us. You have to do it if you want this feature in QEMU in a future-proof way. You _can_ provide the ugly PIIX4-specific hack as a compatibility fallback (and this patch is okay to make the compatibility fallback less hacky). However, I don't think QEMU should accept the patch for IGD passthrough unless Intel is willing to make drivers virtualization-friendly. Once you assign the IGD, it is not that integrated anymore and the drivers must take that into account. It is worthwhile pointing out that neither AMD nor nVidia need any of this. >> The right way could be to make QEMU add a vendor-specific capability to >> the video device. The driver can probe for that capability before > > Do you mean we can pick two unused offsets in the configuration space of > the video device as a vendor-specific capability to hold the > vendor/device ids of the PCH? Yes, either that or add a new capability (which lets you choose the offsets more freely). If the IGD driver needs config space fields of the MCH, those fields could also be mirrored in the new capability. QEMU would forward them automatically. It could even be a new BAR, which gives even more freedom to allocate the fields. >> looking at the PCI bus. QEMU can add the capability to the list, it is >> easy because all accesses to the video device's configuration space trap >> to QEMU. Then you do not need to add fake devices to the machine. >> >> In fact, it would be nice if Intel added such a capability on the next >> generation of integrated graphics, too. On real hardware, ACPI or some > > Maybe, but even this would be implemented, shouldn't we need to be > compatible with those old generations? Yes. - old generation / old driver: use 00:1f.0 hack, only guaranteed to work on PIIX4-based virtual guest - old generation / new driver: use 00:1f.0 hack on real hardware, use capability on 00:02.0 on virtual guest, can work on PCIe virtual guest - new generation / old driver: doesn't exist - new generation / new driver: always use capability on 00:02.0, can work on PCIe virtual guest. Paolo
[RFC][PATCH] gpu:drm:i915:intel_detect_pch: back to check devfn instead of check class type
Il 22/06/2014 10:25, Chen, Tiejun ha scritto: > In qemu-upstream, as you commented we can't create this as a ISA class > type explicitly. Note I didn't say that QEMU doesn't like having two ISA bridges. I commented that the firmware will see two ISA bridges and will try to initialize both of them. It currently doesn't just because it only knows of two southbridge PCI IDs, and both of them are old or relatively old (PIIX3/4 and ICH9). But what would happen if you did graphics passthrough on a machine with an ICH9 southbridge? Your code will create the PIIX4 ISA bridge, and will create an ICH9 southbridge just to please the i915 driver. The BIOS will recognize the ICH9's vendor/device ids and try to initialize it. It will write to the configuration space to set up PCI PIRQ[A-H] routing, for example, which makes no sense since your ICH9 is just a dummy device. Second problem. Your IGD passthrough code currently works with QEMU's PIIX4-based machine. But what happens if you try to extend it, so that it works with QEMU's ICH9-based machine? That's a more modern machine that has a PCIe chipset and hence has its ISA bridge at 00:1f.0. Now you have a conflict. In other words, if you want IGD passthrough in QEMU, you need a solution that is future-proof. > So we compromise by faking this ISA bridge without ISA > class type setting (as I recall you already said this way is slightly > better). It is only slightly better, but the right solution is to fix the driver. There is absolutely zero reason why a graphics driver should know about the vendor/device ids of the PCH. The right way could be to make QEMU add a vendor-specific capability to the video device. The driver can probe for that capability before looking at the PCI bus. QEMU can add the capability to the list, it is easy because all accesses to the video device's configuration space trap to QEMU. Then you do not need to add fake devices to the machine. In fact, it would be nice if Intel added such a capability on the next generation of integrated graphics, too. On real hardware, ACPI or some other kind of firmware can probe the PCH at 00:1f.0 and initialize that vendor-specific capability. QEMU would just forward the value from the host, so that virtualized guests never depend on the PCH at 00:1f.0. Paolo > Maybe we will figure better way in the future. But anyway, this > is always registered as 00:15.0, right? So I think the i915 driver can > go back to probe the devfn like the native environment.
[RFC][PATCH] gpu:drm:i915:intel_detect_pch: back to check devfn instead of check class type
Il 19/06/2014 11:53, Tiejun Chen ha scritto: > so this mean that isa bridge is still represented with Dev31:Func0 > like the native OS. Furthermore, currently we're pushing VGA > passthrough support into qemu upstream, and with some discussion, > we wouldn't set the bridge class type and just expose this devfn. Even this is not really optimal. It just happens to work just because QEMU's machine is currently a PCI machine with the ISA bridge on 00:01.0. As soon as you'll try doing integrated graphics passthrough on a PCIe machine type (such as QEMU's "-M q35") things will break down just as badly. Paolo