Re: [PATCH] kfifo: add memory barrier in kfifo to prevent data loss
On 12/12/18 8:50 AM, Kees Cook wrote: > On Mon, Dec 10, 2018 at 7:41 PM wrote: >> >> From: Yulei Zhang >> >> Early this year we spot there may be two issues in kernel >> kfifo. >> >> One is reported by Xiao Guangrong to linux kernel. >> https://lkml.org/lkml/2018/5/11/58 >> In current kfifo implementation there are missing memory >> barrier in the read side, so that without proper barrier >> between reading the kfifo->in and fetching the data there >> is potential ordering issue. >> >> Beside that, there is another potential issue in kfifo, >> please consider the following case: >> at the beginning >> ring->size = 4 >> ring->out = 0 >> ring->in = 4 >> >> ConsumerProducer >> --- -- >> index = ring->out; /* index == 0 */ >> ring->out++; /* ring->out == 1 */ >> < Re-Order > >> out = ring->out; >> if (ring->in - out >= ring->mask) >> return -EFULL; >> /* see the ring is not full */ >> index = ring->in & ring->mask; >> /* index == 0 */ >> ring->data[index] = new_data; >> ring->in++; >> >> data = ring->data[index]; >> /* you will find the old data is overwritten by the new_data */ >> >> In order to avoid the issue: >> 1) for the consumer, we should read the ring->data[] out before >> updating ring->out >> 2) for the producer, we should read ring->out before updating >> ring->data[] >> >> So in this patch we introduce the following four functions which >> are wrapped with proper memory barrier and keep in pairs to make >> sure the in and out index are fetched and updated in order to avoid >> data loss. >> >> kfifo_read_index_in() >> kfifo_write_index_in() >> kfifo_read_index_out() >> kfifo_write_index_out() >> >> Signed-off-by: Yulei Zhang >> Signed-off-by: Guangrong Xiao > > I've added some more people to CC that might want to see this. Thanks > for sending this! Hi, Ping... could anyone have a look? ;) Thanks!
Re: [PATCH] KVM: try __get_user_pages_fast even if not in atomic context
On 07/27/2018 11:46 PM, Paolo Bonzini wrote: We are currently cutting hva_to_pfn_fast short if we do not want an immediate exit, which is represented by !async && !atomic. However, this is unnecessary, and __get_user_pages_fast is *much* faster because the regular get_user_pages takes pmd_lock/pte_lock. In fact, when many CPUs take a nested vmexit at the same time the contention on those locks is visible, and this patch removes about 25% (compared to 4.18) from vmexit.flat on a 16 vCPU nested guest. Nice improvement. Then after that, we will unconditionally try hva_to_pfn_fast(), does it hurt the case that the mappings in the host's page tables have not been present yet? Can we apply this tech to other places using gup or even squash it into get_user_pages()?
Re: [PATCH] KVM: try __get_user_pages_fast even if not in atomic context
On 07/27/2018 11:46 PM, Paolo Bonzini wrote: We are currently cutting hva_to_pfn_fast short if we do not want an immediate exit, which is represented by !async && !atomic. However, this is unnecessary, and __get_user_pages_fast is *much* faster because the regular get_user_pages takes pmd_lock/pte_lock. In fact, when many CPUs take a nested vmexit at the same time the contention on those locks is visible, and this patch removes about 25% (compared to 4.18) from vmexit.flat on a 16 vCPU nested guest. Nice improvement. Then after that, we will unconditionally try hva_to_pfn_fast(), does it hurt the case that the mappings in the host's page tables have not been present yet? Can we apply this tech to other places using gup or even squash it into get_user_pages()?
Re: [PATCH] KVM/MMU: Combine flushing remote tlb in mmu_set_spte()
On 07/24/2018 10:35 PM, Paolo Bonzini wrote: On 24/07/2018 10:17, Tianyu Lan wrote: mmu_set_spte() flushes remote tlbs for drop_parent_pte/drop_spte() and set_spte() separately. This may introduce redundant flush. This patch is to combine these flushes and check flush request after calling set_spte(). Signed-off-by: Lan Tianyu Looks good, but I'd like a second opinion. Guangrong, Junaid, can you review this? It looks good to me. Reviewed-by: Xiao Guangrong BTW, the @intel box is not accessible to me now. ;)
Re: [PATCH] KVM/MMU: Combine flushing remote tlb in mmu_set_spte()
On 07/24/2018 10:35 PM, Paolo Bonzini wrote: On 24/07/2018 10:17, Tianyu Lan wrote: mmu_set_spte() flushes remote tlbs for drop_parent_pte/drop_spte() and set_spte() separately. This may introduce redundant flush. This patch is to combine these flushes and check flush request after calling set_spte(). Signed-off-by: Lan Tianyu Looks good, but I'd like a second opinion. Guangrong, Junaid, can you review this? It looks good to me. Reviewed-by: Xiao Guangrong BTW, the @intel box is not accessible to me now. ;)
Is read barrier missed in kfifo?
Hi, Currently, there is no read barrier between reading the index (kfifo.in) and fetching the real data from the fifo. I am afraid that will cause the vfifo is observed as not empty however the data is not actually ready for read. Right? Thanks!
Is read barrier missed in kfifo?
Hi, Currently, there is no read barrier between reading the index (kfifo.in) and fetching the real data from the fifo. I am afraid that will cause the vfifo is observed as not empty however the data is not actually ready for read. Right? Thanks!
Re: [PATCH] KVM: X86: Fix SMRAM accessing even if VM is shutdown
On 02/09/2018 08:42 PM, Paolo Bonzini wrote: On 09/02/2018 04:22, Xiao Guangrong wrote: That is a good question... :) This case (with KVM_MEMSLOT_INVALID is set) can be easily constructed, userspace should avoid this case by itself (avoiding vCPU accessing the memslot which is being updated). If it happens, it's a operation issue rather than INTERNAL ERROR. Maybe treat it as MMIO accessing and return to userspace with MMIO_EXIT is a better solution... Yeah, that's what emulation would do (except if it's an instruction fetch, which will cause emulation to fail). I think it's a bug in the non-EPT #PF case that we return with -EFAULT. Wanpeng, could you please do it? :)
Re: [PATCH] KVM: X86: Fix SMRAM accessing even if VM is shutdown
On 02/09/2018 08:42 PM, Paolo Bonzini wrote: On 09/02/2018 04:22, Xiao Guangrong wrote: That is a good question... :) This case (with KVM_MEMSLOT_INVALID is set) can be easily constructed, userspace should avoid this case by itself (avoiding vCPU accessing the memslot which is being updated). If it happens, it's a operation issue rather than INTERNAL ERROR. Maybe treat it as MMIO accessing and return to userspace with MMIO_EXIT is a better solution... Yeah, that's what emulation would do (except if it's an instruction fetch, which will cause emulation to fail). I think it's a bug in the non-EPT #PF case that we return with -EFAULT. Wanpeng, could you please do it? :)
Re: [PATCH] KVM: X86: Fix SMRAM accessing even if VM is shutdown
On 02/08/2018 06:31 PM, Paolo Bonzini wrote: On 08/02/2018 09:57, Xiao Guangrong wrote: Maybe it should return RET_PF_EMULATE, which would cause an emulation failure and then an exit with KVM_EXIT_INTERNAL_ERROR. So the root cause is that a running vCPU accessing the memory whose memslot is being updated (met the condition KVM_MEMSLOT_INVALID is set on the its memslot). The normal #PF handler breaks KVM_RUN and returns -EFAULT to userspace, we'd better to make ept-misconfig's handler follow this style as well. Why return -EFAULT and not attempt emulation (which will fail)? That is a good question... :) This case (with KVM_MEMSLOT_INVALID is set) can be easily constructed, userspace should avoid this case by itself (avoiding vCPU accessing the memslot which is being updated). If it happens, it's a operation issue rather than INTERNAL ERROR. Maybe treat it as MMIO accessing and return to userspace with MMIO_EXIT is a better solution...
Re: [PATCH] KVM: X86: Fix SMRAM accessing even if VM is shutdown
On 02/08/2018 06:31 PM, Paolo Bonzini wrote: On 08/02/2018 09:57, Xiao Guangrong wrote: Maybe it should return RET_PF_EMULATE, which would cause an emulation failure and then an exit with KVM_EXIT_INTERNAL_ERROR. So the root cause is that a running vCPU accessing the memory whose memslot is being updated (met the condition KVM_MEMSLOT_INVALID is set on the its memslot). The normal #PF handler breaks KVM_RUN and returns -EFAULT to userspace, we'd better to make ept-misconfig's handler follow this style as well. Why return -EFAULT and not attempt emulation (which will fail)? That is a good question... :) This case (with KVM_MEMSLOT_INVALID is set) can be easily constructed, userspace should avoid this case by itself (avoiding vCPU accessing the memslot which is being updated). If it happens, it's a operation issue rather than INTERNAL ERROR. Maybe treat it as MMIO accessing and return to userspace with MMIO_EXIT is a better solution...
Re: [PATCH] KVM: X86: Fix SMRAM accessing even if VM is shutdown
On 02/07/2018 10:16 PM, Paolo Bonzini wrote: On 07/02/2018 07:25, Wanpeng Li wrote: diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 786cd00..445e702 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7458,6 +7458,11 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) goto out; } + if (unlikely(vcpu->run->exit_reason == KVM_EXIT_SHUTDOWN)) { + r = -EINVAL; + goto out; + } + if (vcpu->run->kvm_dirty_regs) { r = sync_regs(vcpu); if (r != 0) This most likely breaks triple faults in the usual case where they should result in resetting the system; the KVM API doesn't say that you should clear vcpu->run->exit_reason before entering. What exactly causes the EPT misconfig to reach the WARN? That is, how does kvm_mmu_page_fault end up returning a negative errno value? If I read the code correctly only tdp_page_fault can do so, so my guess would be kvm_handle_bad_page: if (pfn == KVM_PFN_ERR_RO_FAULT) return RET_PF_EMULATE; if (pfn == KVM_PFN_ERR_HWPOISON) { kvm_send_hwpoison_signal(kvm_vcpu_gfn_to_hva(vcpu, gfn), current); return RET_PF_RETRY; } /* KVM_PFN_ERR_FAULT */ return -EFAULT; Maybe it should return RET_PF_EMULATE, which would cause an emulation failure and then an exit with KVM_EXIT_INTERNAL_ERROR. So the root cause is that a running vCPU accessing the memory whose memslot is being updated (met the condition KVM_MEMSLOT_INVALID is set on the its memslot). The normal #PF handler breaks KVM_RUN and returns -EFAULT to userspace, we'd better to make ept-misconfig's handler follow this style as well. Actually, the WARN_ON in ept-misconfig's handler is unnecessary as kvm_mmu_page_fault() will warn us if it is the real ept misconfig, so we can simply return kvm_mmu_page_fault().
Re: [PATCH] KVM: X86: Fix SMRAM accessing even if VM is shutdown
On 02/07/2018 10:16 PM, Paolo Bonzini wrote: On 07/02/2018 07:25, Wanpeng Li wrote: diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 786cd00..445e702 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7458,6 +7458,11 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) goto out; } + if (unlikely(vcpu->run->exit_reason == KVM_EXIT_SHUTDOWN)) { + r = -EINVAL; + goto out; + } + if (vcpu->run->kvm_dirty_regs) { r = sync_regs(vcpu); if (r != 0) This most likely breaks triple faults in the usual case where they should result in resetting the system; the KVM API doesn't say that you should clear vcpu->run->exit_reason before entering. What exactly causes the EPT misconfig to reach the WARN? That is, how does kvm_mmu_page_fault end up returning a negative errno value? If I read the code correctly only tdp_page_fault can do so, so my guess would be kvm_handle_bad_page: if (pfn == KVM_PFN_ERR_RO_FAULT) return RET_PF_EMULATE; if (pfn == KVM_PFN_ERR_HWPOISON) { kvm_send_hwpoison_signal(kvm_vcpu_gfn_to_hva(vcpu, gfn), current); return RET_PF_RETRY; } /* KVM_PFN_ERR_FAULT */ return -EFAULT; Maybe it should return RET_PF_EMULATE, which would cause an emulation failure and then an exit with KVM_EXIT_INTERNAL_ERROR. So the root cause is that a running vCPU accessing the memory whose memslot is being updated (met the condition KVM_MEMSLOT_INVALID is set on the its memslot). The normal #PF handler breaks KVM_RUN and returns -EFAULT to userspace, we'd better to make ept-misconfig's handler follow this style as well. Actually, the WARN_ON in ept-misconfig's handler is unnecessary as kvm_mmu_page_fault() will warn us if it is the real ept misconfig, so we can simply return kvm_mmu_page_fault().
Re: [PATCH v5 0/2] KVM: MMU: fix kvm_is_mmio_pfn()
On 11/08/2017 03:56 PM, Haozhong Zhang wrote: Some reserved pages, such as those from NVDIMM DAX devices, are not for MMIO, and can be mapped with cached memory type for better performance. However, the above check misconceives those pages as MMIO. Because KVM maps MMIO pages with UC memory type, the performance of guest accesses to those pages would be harmed. Therefore, we check the host memory type in addition and only treat UC/UC- pages as MMIO. Reviewed-by: Xiao Guangrong <xiaoguangr...@tencent.com>
Re: [PATCH v5 0/2] KVM: MMU: fix kvm_is_mmio_pfn()
On 11/08/2017 03:56 PM, Haozhong Zhang wrote: Some reserved pages, such as those from NVDIMM DAX devices, are not for MMIO, and can be mapped with cached memory type for better performance. However, the above check misconceives those pages as MMIO. Because KVM maps MMIO pages with UC memory type, the performance of guest accesses to those pages would be harmed. Therefore, we check the host memory type in addition and only treat UC/UC- pages as MMIO. Reviewed-by: Xiao Guangrong
Re: [PATCH v4 3/3] KVM: MMU: consider host cache mode in MMIO page check
On 11/03/2017 05:29 PM, Haozhong Zhang wrote: On 11/03/17 17:24 +0800, Xiao Guangrong wrote: On 11/03/2017 05:02 PM, Haozhong Zhang wrote: On 11/03/17 16:51 +0800, Haozhong Zhang wrote: On 11/03/17 14:54 +0800, Xiao Guangrong wrote: On 11/03/2017 01:53 PM, Haozhong Zhang wrote: Some reserved pages, such as those from NVDIMM DAX devices, are not for MMIO, and can be mapped with cached memory type for better performance. However, the above check misconceives those pages as MMIO. Because KVM maps MMIO pages with UC memory type, the performance of guest accesses to those pages would be harmed. Therefore, we check the host memory type by lookup_memtype() in addition and only treat UC/UC- pages as MMIO. Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com> Reported-by: Cuevas Escareno, Ivan D <ivan.d.cuevas.escar...@intel.com> Reported-by: Kumar, Karthik <karthik.ku...@intel.com> --- arch/x86/kvm/mmu.c | 19 ++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 0b481cc9c725..e9ed0e666a83 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2708,7 +2708,24 @@ static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn, static bool kvm_is_mmio_pfn(kvm_pfn_t pfn) { if (pfn_valid(pfn)) - return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)); + return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) && + /* +* Some reserved pages, such as those from +* NVDIMM DAX devices, are not for MMIO, and +* can be mapped with cached memory type for +* better performance. However, the above +* check misconceives those pages as MMIO. +* Because KVM maps MMIO pages with UC memory +* type, the performance of guest accesses to +* those pages would be harmed. Therefore, we +* check the host memory type in addition and +* only treat UC/UC- pages as MMIO. +* +* pat_pfn_is_uc() works only when PAT is enabled, +* so check pat_enabled() as well. +*/ + (!pat_enabled() || +pat_pfn_is_uc(kvm_pfn_t_to_pfn_t(pfn))); Can it be compiled if !CONFIG_PAT? Yes. What I check via pat_enabled() is not only whether PAT support is compiled, but also whether PAT is enabled at runtime. It would be better if we move pat_enabled out of kvm as well, Surely I can combine them in one function like bool pat_pfn_is_uc(pfn_t pfn) { enum page_cache_mode cm; if (!pat_enabled()) return false; cm = lookup_memtype(pfn_t_to_phys(pfn)); return cm == _PAGE_CACHE_MODE_UC || cm == _PAGE_CACHE_MODE_UC_MINUS; } In addition, I think it's better to split this function into pat_pfn_is_uc() and pat_pfn_is_uc_minus() to avoid additional confusion. Why not use pat_pfn_is_uc_or_uc_minus(). :) Just in case that other places other than KVM do not need both of them. I think we need not care the future case too much, it is not ABI anyway. i.e, it can be easily adjusted if it is really needed in the future. But it is not a big deal...
Re: [PATCH v4 3/3] KVM: MMU: consider host cache mode in MMIO page check
On 11/03/2017 05:29 PM, Haozhong Zhang wrote: On 11/03/17 17:24 +0800, Xiao Guangrong wrote: On 11/03/2017 05:02 PM, Haozhong Zhang wrote: On 11/03/17 16:51 +0800, Haozhong Zhang wrote: On 11/03/17 14:54 +0800, Xiao Guangrong wrote: On 11/03/2017 01:53 PM, Haozhong Zhang wrote: Some reserved pages, such as those from NVDIMM DAX devices, are not for MMIO, and can be mapped with cached memory type for better performance. However, the above check misconceives those pages as MMIO. Because KVM maps MMIO pages with UC memory type, the performance of guest accesses to those pages would be harmed. Therefore, we check the host memory type by lookup_memtype() in addition and only treat UC/UC- pages as MMIO. Signed-off-by: Haozhong Zhang Reported-by: Cuevas Escareno, Ivan D Reported-by: Kumar, Karthik --- arch/x86/kvm/mmu.c | 19 ++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 0b481cc9c725..e9ed0e666a83 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2708,7 +2708,24 @@ static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn, static bool kvm_is_mmio_pfn(kvm_pfn_t pfn) { if (pfn_valid(pfn)) - return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)); + return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) && + /* +* Some reserved pages, such as those from +* NVDIMM DAX devices, are not for MMIO, and +* can be mapped with cached memory type for +* better performance. However, the above +* check misconceives those pages as MMIO. +* Because KVM maps MMIO pages with UC memory +* type, the performance of guest accesses to +* those pages would be harmed. Therefore, we +* check the host memory type in addition and +* only treat UC/UC- pages as MMIO. +* +* pat_pfn_is_uc() works only when PAT is enabled, +* so check pat_enabled() as well. +*/ + (!pat_enabled() || +pat_pfn_is_uc(kvm_pfn_t_to_pfn_t(pfn))); Can it be compiled if !CONFIG_PAT? Yes. What I check via pat_enabled() is not only whether PAT support is compiled, but also whether PAT is enabled at runtime. It would be better if we move pat_enabled out of kvm as well, Surely I can combine them in one function like bool pat_pfn_is_uc(pfn_t pfn) { enum page_cache_mode cm; if (!pat_enabled()) return false; cm = lookup_memtype(pfn_t_to_phys(pfn)); return cm == _PAGE_CACHE_MODE_UC || cm == _PAGE_CACHE_MODE_UC_MINUS; } In addition, I think it's better to split this function into pat_pfn_is_uc() and pat_pfn_is_uc_minus() to avoid additional confusion. Why not use pat_pfn_is_uc_or_uc_minus(). :) Just in case that other places other than KVM do not need both of them. I think we need not care the future case too much, it is not ABI anyway. i.e, it can be easily adjusted if it is really needed in the future. But it is not a big deal...
Re: [PATCH v4 3/3] KVM: MMU: consider host cache mode in MMIO page check
On 11/03/2017 05:02 PM, Haozhong Zhang wrote: On 11/03/17 16:51 +0800, Haozhong Zhang wrote: On 11/03/17 14:54 +0800, Xiao Guangrong wrote: On 11/03/2017 01:53 PM, Haozhong Zhang wrote: Some reserved pages, such as those from NVDIMM DAX devices, are not for MMIO, and can be mapped with cached memory type for better performance. However, the above check misconceives those pages as MMIO. Because KVM maps MMIO pages with UC memory type, the performance of guest accesses to those pages would be harmed. Therefore, we check the host memory type by lookup_memtype() in addition and only treat UC/UC- pages as MMIO. Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com> Reported-by: Cuevas Escareno, Ivan D <ivan.d.cuevas.escar...@intel.com> Reported-by: Kumar, Karthik <karthik.ku...@intel.com> --- arch/x86/kvm/mmu.c | 19 ++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 0b481cc9c725..e9ed0e666a83 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2708,7 +2708,24 @@ static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn, static bool kvm_is_mmio_pfn(kvm_pfn_t pfn) { if (pfn_valid(pfn)) - return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)); + return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) && + /* +* Some reserved pages, such as those from +* NVDIMM DAX devices, are not for MMIO, and +* can be mapped with cached memory type for +* better performance. However, the above +* check misconceives those pages as MMIO. +* Because KVM maps MMIO pages with UC memory +* type, the performance of guest accesses to +* those pages would be harmed. Therefore, we +* check the host memory type in addition and +* only treat UC/UC- pages as MMIO. +* +* pat_pfn_is_uc() works only when PAT is enabled, +* so check pat_enabled() as well. +*/ + (!pat_enabled() || +pat_pfn_is_uc(kvm_pfn_t_to_pfn_t(pfn))); Can it be compiled if !CONFIG_PAT? Yes. What I check via pat_enabled() is not only whether PAT support is compiled, but also whether PAT is enabled at runtime. It would be better if we move pat_enabled out of kvm as well, Surely I can combine them in one function like bool pat_pfn_is_uc(pfn_t pfn) { enum page_cache_mode cm; if (!pat_enabled()) return false; cm = lookup_memtype(pfn_t_to_phys(pfn)); return cm == _PAGE_CACHE_MODE_UC || cm == _PAGE_CACHE_MODE_UC_MINUS; } In addition, I think it's better to split this function into pat_pfn_is_uc() and pat_pfn_is_uc_minus() to avoid additional confusion. Why not use pat_pfn_is_uc_or_uc_minus(). :)
Re: [PATCH v4 3/3] KVM: MMU: consider host cache mode in MMIO page check
On 11/03/2017 05:02 PM, Haozhong Zhang wrote: On 11/03/17 16:51 +0800, Haozhong Zhang wrote: On 11/03/17 14:54 +0800, Xiao Guangrong wrote: On 11/03/2017 01:53 PM, Haozhong Zhang wrote: Some reserved pages, such as those from NVDIMM DAX devices, are not for MMIO, and can be mapped with cached memory type for better performance. However, the above check misconceives those pages as MMIO. Because KVM maps MMIO pages with UC memory type, the performance of guest accesses to those pages would be harmed. Therefore, we check the host memory type by lookup_memtype() in addition and only treat UC/UC- pages as MMIO. Signed-off-by: Haozhong Zhang Reported-by: Cuevas Escareno, Ivan D Reported-by: Kumar, Karthik --- arch/x86/kvm/mmu.c | 19 ++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 0b481cc9c725..e9ed0e666a83 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2708,7 +2708,24 @@ static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn, static bool kvm_is_mmio_pfn(kvm_pfn_t pfn) { if (pfn_valid(pfn)) - return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)); + return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) && + /* +* Some reserved pages, such as those from +* NVDIMM DAX devices, are not for MMIO, and +* can be mapped with cached memory type for +* better performance. However, the above +* check misconceives those pages as MMIO. +* Because KVM maps MMIO pages with UC memory +* type, the performance of guest accesses to +* those pages would be harmed. Therefore, we +* check the host memory type in addition and +* only treat UC/UC- pages as MMIO. +* +* pat_pfn_is_uc() works only when PAT is enabled, +* so check pat_enabled() as well. +*/ + (!pat_enabled() || +pat_pfn_is_uc(kvm_pfn_t_to_pfn_t(pfn))); Can it be compiled if !CONFIG_PAT? Yes. What I check via pat_enabled() is not only whether PAT support is compiled, but also whether PAT is enabled at runtime. It would be better if we move pat_enabled out of kvm as well, Surely I can combine them in one function like bool pat_pfn_is_uc(pfn_t pfn) { enum page_cache_mode cm; if (!pat_enabled()) return false; cm = lookup_memtype(pfn_t_to_phys(pfn)); return cm == _PAGE_CACHE_MODE_UC || cm == _PAGE_CACHE_MODE_UC_MINUS; } In addition, I think it's better to split this function into pat_pfn_is_uc() and pat_pfn_is_uc_minus() to avoid additional confusion. Why not use pat_pfn_is_uc_or_uc_minus(). :)
Re: [PATCH v4 3/3] KVM: MMU: consider host cache mode in MMIO page check
On 11/03/2017 04:51 PM, Haozhong Zhang wrote: On 11/03/17 14:54 +0800, Xiao Guangrong wrote: On 11/03/2017 01:53 PM, Haozhong Zhang wrote: Some reserved pages, such as those from NVDIMM DAX devices, are not for MMIO, and can be mapped with cached memory type for better performance. However, the above check misconceives those pages as MMIO. Because KVM maps MMIO pages with UC memory type, the performance of guest accesses to those pages would be harmed. Therefore, we check the host memory type by lookup_memtype() in addition and only treat UC/UC- pages as MMIO. Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com> Reported-by: Cuevas Escareno, Ivan D <ivan.d.cuevas.escar...@intel.com> Reported-by: Kumar, Karthik <karthik.ku...@intel.com> --- arch/x86/kvm/mmu.c | 19 ++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 0b481cc9c725..e9ed0e666a83 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2708,7 +2708,24 @@ static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn, static bool kvm_is_mmio_pfn(kvm_pfn_t pfn) { if (pfn_valid(pfn)) - return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)); + return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) && + /* +* Some reserved pages, such as those from +* NVDIMM DAX devices, are not for MMIO, and +* can be mapped with cached memory type for +* better performance. However, the above +* check misconceives those pages as MMIO. +* Because KVM maps MMIO pages with UC memory +* type, the performance of guest accesses to +* those pages would be harmed. Therefore, we +* check the host memory type in addition and +* only treat UC/UC- pages as MMIO. +* +* pat_pfn_is_uc() works only when PAT is enabled, +* so check pat_enabled() as well. +*/ + (!pat_enabled() || +pat_pfn_is_uc(kvm_pfn_t_to_pfn_t(pfn))); Can it be compiled if !CONFIG_PAT? Yes. What I check via pat_enabled() is not only whether PAT support is compiled, but also whether PAT is enabled at runtime. The issue is about pat_pfn_is_uc() which is implemented only if CONFIG_PAT is enabled, but you used it here unconditionally. I am not sure if gcc is smart enough to omit pat_pfn_is_uc() completely under this case. If you really have done the test to compile kernel and KVM module with CONFIG_PAT disabled, it is fine.
Re: [PATCH v4 3/3] KVM: MMU: consider host cache mode in MMIO page check
On 11/03/2017 04:51 PM, Haozhong Zhang wrote: On 11/03/17 14:54 +0800, Xiao Guangrong wrote: On 11/03/2017 01:53 PM, Haozhong Zhang wrote: Some reserved pages, such as those from NVDIMM DAX devices, are not for MMIO, and can be mapped with cached memory type for better performance. However, the above check misconceives those pages as MMIO. Because KVM maps MMIO pages with UC memory type, the performance of guest accesses to those pages would be harmed. Therefore, we check the host memory type by lookup_memtype() in addition and only treat UC/UC- pages as MMIO. Signed-off-by: Haozhong Zhang Reported-by: Cuevas Escareno, Ivan D Reported-by: Kumar, Karthik --- arch/x86/kvm/mmu.c | 19 ++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 0b481cc9c725..e9ed0e666a83 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2708,7 +2708,24 @@ static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn, static bool kvm_is_mmio_pfn(kvm_pfn_t pfn) { if (pfn_valid(pfn)) - return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)); + return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) && + /* +* Some reserved pages, such as those from +* NVDIMM DAX devices, are not for MMIO, and +* can be mapped with cached memory type for +* better performance. However, the above +* check misconceives those pages as MMIO. +* Because KVM maps MMIO pages with UC memory +* type, the performance of guest accesses to +* those pages would be harmed. Therefore, we +* check the host memory type in addition and +* only treat UC/UC- pages as MMIO. +* +* pat_pfn_is_uc() works only when PAT is enabled, +* so check pat_enabled() as well. +*/ + (!pat_enabled() || +pat_pfn_is_uc(kvm_pfn_t_to_pfn_t(pfn))); Can it be compiled if !CONFIG_PAT? Yes. What I check via pat_enabled() is not only whether PAT support is compiled, but also whether PAT is enabled at runtime. The issue is about pat_pfn_is_uc() which is implemented only if CONFIG_PAT is enabled, but you used it here unconditionally. I am not sure if gcc is smart enough to omit pat_pfn_is_uc() completely under this case. If you really have done the test to compile kernel and KVM module with CONFIG_PAT disabled, it is fine.
Re: [PATCH v4 3/3] KVM: MMU: consider host cache mode in MMIO page check
On 11/03/2017 01:53 PM, Haozhong Zhang wrote: Some reserved pages, such as those from NVDIMM DAX devices, are not for MMIO, and can be mapped with cached memory type for better performance. However, the above check misconceives those pages as MMIO. Because KVM maps MMIO pages with UC memory type, the performance of guest accesses to those pages would be harmed. Therefore, we check the host memory type by lookup_memtype() in addition and only treat UC/UC- pages as MMIO. Signed-off-by: Haozhong ZhangReported-by: Cuevas Escareno, Ivan D Reported-by: Kumar, Karthik --- arch/x86/kvm/mmu.c | 19 ++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 0b481cc9c725..e9ed0e666a83 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2708,7 +2708,24 @@ static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn, static bool kvm_is_mmio_pfn(kvm_pfn_t pfn) { if (pfn_valid(pfn)) - return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)); + return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) && + /* +* Some reserved pages, such as those from +* NVDIMM DAX devices, are not for MMIO, and +* can be mapped with cached memory type for +* better performance. However, the above +* check misconceives those pages as MMIO. +* Because KVM maps MMIO pages with UC memory +* type, the performance of guest accesses to +* those pages would be harmed. Therefore, we +* check the host memory type in addition and +* only treat UC/UC- pages as MMIO. +* +* pat_pfn_is_uc() works only when PAT is enabled, +* so check pat_enabled() as well. +*/ + (!pat_enabled() || +pat_pfn_is_uc(kvm_pfn_t_to_pfn_t(pfn))); Can it be compiled if !CONFIG_PAT? It would be better if we move pat_enabled out of kvm as well, please refer to pgprot_writecombine() which is implemented in pat.c and in include\asm-generic\pgtable.h: #ifndef pgprot_writecombine #define pgprot_writecombine pgprot_noncached #endif
Re: [PATCH v4 3/3] KVM: MMU: consider host cache mode in MMIO page check
On 11/03/2017 01:53 PM, Haozhong Zhang wrote: Some reserved pages, such as those from NVDIMM DAX devices, are not for MMIO, and can be mapped with cached memory type for better performance. However, the above check misconceives those pages as MMIO. Because KVM maps MMIO pages with UC memory type, the performance of guest accesses to those pages would be harmed. Therefore, we check the host memory type by lookup_memtype() in addition and only treat UC/UC- pages as MMIO. Signed-off-by: Haozhong Zhang Reported-by: Cuevas Escareno, Ivan D Reported-by: Kumar, Karthik --- arch/x86/kvm/mmu.c | 19 ++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 0b481cc9c725..e9ed0e666a83 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2708,7 +2708,24 @@ static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn, static bool kvm_is_mmio_pfn(kvm_pfn_t pfn) { if (pfn_valid(pfn)) - return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)); + return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) && + /* +* Some reserved pages, such as those from +* NVDIMM DAX devices, are not for MMIO, and +* can be mapped with cached memory type for +* better performance. However, the above +* check misconceives those pages as MMIO. +* Because KVM maps MMIO pages with UC memory +* type, the performance of guest accesses to +* those pages would be harmed. Therefore, we +* check the host memory type in addition and +* only treat UC/UC- pages as MMIO. +* +* pat_pfn_is_uc() works only when PAT is enabled, +* so check pat_enabled() as well. +*/ + (!pat_enabled() || +pat_pfn_is_uc(kvm_pfn_t_to_pfn_t(pfn))); Can it be compiled if !CONFIG_PAT? It would be better if we move pat_enabled out of kvm as well, please refer to pgprot_writecombine() which is implemented in pat.c and in include\asm-generic\pgtable.h: #ifndef pgprot_writecombine #define pgprot_writecombine pgprot_noncached #endif
Re: [PATCH v2 2/2] KVM: MMU: consider host cache mode in MMIO page check
On 10/31/2017 07:48 PM, Haozhong Zhang wrote: Some reserved pages, such as those from NVDIMM DAX devices, are not for MMIO, and can be mapped with cached memory type for better performance. However, the above check misconceives those pages as MMIO. Because KVM maps MMIO pages with UC memory type, the performance of guest accesses to those pages would be harmed. Therefore, we check the host memory type by lookup_memtype() in addition and only treat UC/UC- pages as MMIO. Signed-off-by: Haozhong ZhangReported-by: Cuevas Escareno, Ivan D Reported-by: Kumar, Karthik --- arch/x86/kvm/mmu.c | 30 +++--- 1 file changed, 27 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 0b481cc9c725..206828d18857 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2707,10 +2707,34 @@ static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn, static bool kvm_is_mmio_pfn(kvm_pfn_t pfn) { - if (pfn_valid(pfn)) - return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)); + bool is_mmio = true; - return true; + if (pfn_valid(pfn)) { + is_mmio = !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)); + + /* +* Some reserved pages, such as those from NVDIMM DAX +* devices, are not for MMIO, and can be mapped with +* cached memory type for better performance. However, +* the above check misconceives those pages as MMIO. +* Because KVM maps MMIO pages with UC memory type, +* the performance of guest accesses to those pages +* would be harmed. Therefore, we check the host +* memory type by lookup_memtype() in addition and +* only treat UC/UC- pages as MMIO. +* +* lookup_memtype() works only when PAT is enabled, so +* add pat_enabled() check here. +*/ + if (is_mmio && pat_enabled()) { + enum page_cache_mode cm = lookup_memtype(PFN_PHYS(pfn)); + + is_mmio = (cm == _PAGE_CACHE_MODE_UC || + cm == _PAGE_CACHE_MODE_UC_MINUS); + } + } You can move all of these detailed stuffs to pat.c and abstract them by introducing a function, maybe named pat_pfn_is_uc(). I think this is what Ingo wants.
Re: [PATCH v2 2/2] KVM: MMU: consider host cache mode in MMIO page check
On 10/31/2017 07:48 PM, Haozhong Zhang wrote: Some reserved pages, such as those from NVDIMM DAX devices, are not for MMIO, and can be mapped with cached memory type for better performance. However, the above check misconceives those pages as MMIO. Because KVM maps MMIO pages with UC memory type, the performance of guest accesses to those pages would be harmed. Therefore, we check the host memory type by lookup_memtype() in addition and only treat UC/UC- pages as MMIO. Signed-off-by: Haozhong Zhang Reported-by: Cuevas Escareno, Ivan D Reported-by: Kumar, Karthik --- arch/x86/kvm/mmu.c | 30 +++--- 1 file changed, 27 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 0b481cc9c725..206828d18857 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2707,10 +2707,34 @@ static bool mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn, static bool kvm_is_mmio_pfn(kvm_pfn_t pfn) { - if (pfn_valid(pfn)) - return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)); + bool is_mmio = true; - return true; + if (pfn_valid(pfn)) { + is_mmio = !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)); + + /* +* Some reserved pages, such as those from NVDIMM DAX +* devices, are not for MMIO, and can be mapped with +* cached memory type for better performance. However, +* the above check misconceives those pages as MMIO. +* Because KVM maps MMIO pages with UC memory type, +* the performance of guest accesses to those pages +* would be harmed. Therefore, we check the host +* memory type by lookup_memtype() in addition and +* only treat UC/UC- pages as MMIO. +* +* lookup_memtype() works only when PAT is enabled, so +* add pat_enabled() check here. +*/ + if (is_mmio && pat_enabled()) { + enum page_cache_mode cm = lookup_memtype(PFN_PHYS(pfn)); + + is_mmio = (cm == _PAGE_CACHE_MODE_UC || + cm == _PAGE_CACHE_MODE_UC_MINUS); + } + } You can move all of these detailed stuffs to pat.c and abstract them by introducing a function, maybe named pat_pfn_is_uc(). I think this is what Ingo wants.
Re: [PATCH 0/3] KVM: MMU: fix kvm_is_mmio_pfn()
On 10/27/2017 10:25 AM, Haozhong Zhang wrote: [I just copy the commit message from patch 3] By default, KVM treats a reserved page as for MMIO purpose, and maps it to guest with UC memory type. However, some reserved pages are not for MMIO, such as pages of DAX device (e.g., /dev/daxX.Y). Mapping them with UC memory type will harm the performance. In order to exclude those cases, we check the host cache mode in addition and only treat UC/UC- pages as MMIO. I am afraid that is not only a performance issue but also a architecture bug - it could trigger MCE as there is alias memory mapping (a page mapped as both WB and UC). It may hurt mdev device as well as the device memory may be both mapped at host and VM.
Re: [PATCH 0/3] KVM: MMU: fix kvm_is_mmio_pfn()
On 10/27/2017 10:25 AM, Haozhong Zhang wrote: [I just copy the commit message from patch 3] By default, KVM treats a reserved page as for MMIO purpose, and maps it to guest with UC memory type. However, some reserved pages are not for MMIO, such as pages of DAX device (e.g., /dev/daxX.Y). Mapping them with UC memory type will harm the performance. In order to exclude those cases, we check the host cache mode in addition and only treat UC/UC- pages as MMIO. I am afraid that is not only a performance issue but also a architecture bug - it could trigger MCE as there is alias memory mapping (a page mapped as both WB and UC). It may hurt mdev device as well as the device memory may be both mapped at host and VM.
Re: [PATCH v2 0/7] KVM: MMU: fast write protect
On 07/03/2017 11:47 PM, Paolo Bonzini wrote: On 03/07/2017 16:39, Xiao Guangrong wrote: On 06/20/2017 05:15 PM, guangrong.x...@gmail.com wrote: From: Xiao Guangrong <xiaoguangr...@tencent.com> Changelog in v2: thanks to Paolo's review, this version disables write-protect-all if PML is supported Hi Paolo, Do you have time to have a look at this new version? ;) Or I should wait until the patchset of dirty ring-buffer is merged? I will look at it soon, but I still plan to merge dirty ring buffer first. Thanks for your understanding, Sure, i fully understand, thank you for bearing my push. :)
Re: [PATCH v2 0/7] KVM: MMU: fast write protect
On 07/03/2017 11:47 PM, Paolo Bonzini wrote: On 03/07/2017 16:39, Xiao Guangrong wrote: On 06/20/2017 05:15 PM, guangrong.x...@gmail.com wrote: From: Xiao Guangrong Changelog in v2: thanks to Paolo's review, this version disables write-protect-all if PML is supported Hi Paolo, Do you have time to have a look at this new version? ;) Or I should wait until the patchset of dirty ring-buffer is merged? I will look at it soon, but I still plan to merge dirty ring buffer first. Thanks for your understanding, Sure, i fully understand, thank you for bearing my push. :)
Re: [PATCH v2 0/7] KVM: MMU: fast write protect
On 06/20/2017 05:15 PM, guangrong.x...@gmail.com wrote: From: Xiao Guangrong <xiaoguangr...@tencent.com> Changelog in v2: thanks to Paolo's review, this version disables write-protect-all if PML is supported Hi Paolo, Do you have time to have a look at this new version? ;) Or I should wait until the patchset of dirty ring-buffer is merged? Thanks!
Re: [PATCH v2 0/7] KVM: MMU: fast write protect
On 06/20/2017 05:15 PM, guangrong.x...@gmail.com wrote: From: Xiao Guangrong Changelog in v2: thanks to Paolo's review, this version disables write-protect-all if PML is supported Hi Paolo, Do you have time to have a look at this new version? ;) Or I should wait until the patchset of dirty ring-buffer is merged? Thanks!
Re: [PATCH 0/7] KVM: MMU: fast write protect
On 05/30/2017 12:48 AM, Paolo Bonzini wrote: On 23/05/2017 04:23, Xiao Guangrong wrote: Ping... Sorry to disturb, just make this patchset not be missed. :) It won't. :) I'm going to look at it and the dirty page ring buffer this week. Ping.. :)
Re: [PATCH 0/7] KVM: MMU: fast write protect
On 05/30/2017 12:48 AM, Paolo Bonzini wrote: On 23/05/2017 04:23, Xiao Guangrong wrote: Ping... Sorry to disturb, just make this patchset not be missed. :) It won't. :) I'm going to look at it and the dirty page ring buffer this week. Ping.. :)
Re: [Qemu-devel] [PATCH 0/7] KVM: MMU: fast write protect
On 06/05/2017 03:36 PM, Jay Zhou wrote: /* enable ucontrol for s390 */ struct kvm_s390_ucas_mapping { diff --git a/memory.c b/memory.c index 4c95aaf..b836675 100644 --- a/memory.c +++ b/memory.c @@ -809,6 +809,13 @@ static void address_space_update_ioeventfds(AddressSpace *as) flatview_unref(view); } +static write_protect_all_fn write_func; I think there should be a declaration in memory.h, diff --git a/include/exec/memory.h b/include/exec/memory.h index 7fc3f48..31f3098 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -1152,6 +1152,9 @@ void memory_global_dirty_log_start(void); */ void memory_global_dirty_log_stop(void); +typedef void (*write_protect_all_fn)(bool write); +void memory_register_write_protect_all(write_protect_all_fn func); + void mtree_info(fprintf_function mon_printf, void *f); Thanks for your suggestion, Jay! This code just demonstrates how to enable this feature in QEMU, i will carefully consider it and merger your suggestion when the formal patch is posted out.
Re: [Qemu-devel] [PATCH 0/7] KVM: MMU: fast write protect
On 06/05/2017 03:36 PM, Jay Zhou wrote: /* enable ucontrol for s390 */ struct kvm_s390_ucas_mapping { diff --git a/memory.c b/memory.c index 4c95aaf..b836675 100644 --- a/memory.c +++ b/memory.c @@ -809,6 +809,13 @@ static void address_space_update_ioeventfds(AddressSpace *as) flatview_unref(view); } +static write_protect_all_fn write_func; I think there should be a declaration in memory.h, diff --git a/include/exec/memory.h b/include/exec/memory.h index 7fc3f48..31f3098 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -1152,6 +1152,9 @@ void memory_global_dirty_log_start(void); */ void memory_global_dirty_log_stop(void); +typedef void (*write_protect_all_fn)(bool write); +void memory_register_write_protect_all(write_protect_all_fn func); + void mtree_info(fprintf_function mon_printf, void *f); Thanks for your suggestion, Jay! This code just demonstrates how to enable this feature in QEMU, i will carefully consider it and merger your suggestion when the formal patch is posted out.
Re: [PATCH 0/7] KVM: MMU: fast write protect
Ping... Sorry to disturb, just make this patchset not be missed. :) On 05/04/2017 03:06 PM, Paolo Bonzini wrote: On 04/05/2017 05:36, Xiao Guangrong wrote: Great. As there is no conflict between these two patchsets except dirty ring pages takes benefit from write-protect-all, i think they can be developed and iterated independently, right? I can certainly start reviewing this one. Paolo Or you prefer to merge dirty ring pages first then review the new version of this patchset later?
Re: [PATCH 0/7] KVM: MMU: fast write protect
Ping... Sorry to disturb, just make this patchset not be missed. :) On 05/04/2017 03:06 PM, Paolo Bonzini wrote: On 04/05/2017 05:36, Xiao Guangrong wrote: Great. As there is no conflict between these two patchsets except dirty ring pages takes benefit from write-protect-all, i think they can be developed and iterated independently, right? I can certainly start reviewing this one. Paolo Or you prefer to merge dirty ring pages first then review the new version of this patchset later?
Re: [PATCH 2/2] KVM: nVMX: fix nEPT handling of guest page table accesses
CC Kevin as i am not sure if Intel is aware of this issue, it breaks other hypervisors, e.g, Xen, as swell. On 05/11/2017 07:23 PM, Paolo Bonzini wrote: The new ept_access_test_paddr_read_only_ad_disabled testcase caused an infinite stream of EPT violations because KVM did not find anything bad in the page tables and kept re-executing the faulting instruction. This is because the exit qualification said we were reading from the page tables, but actually writing the cause of the EPT violation was writing the A/D bits. This happened even with eptad=0, quite surprisingly. Thus, always treat guest page table accesses as read+write operations, even if the exit qualification says otherwise. This fixes the testcase. Signed-off-by: Paolo Bonzini--- arch/x86/kvm/vmx.c | 36 +++- 1 file changed, 23 insertions(+), 13 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index c6f4ad44aa95..c868cbdad29a 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -6209,17 +6209,19 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu) u32 error_code; exit_qualification = vmcs_readl(EXIT_QUALIFICATION); + gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS); + trace_kvm_page_fault(gpa, exit_qualification); - if (is_guest_mode(vcpu) - && !(exit_qualification & EPT_VIOLATION_GVA_TRANSLATED)) { - /* -* Fix up exit_qualification according to whether guest -* page table accesses are reads or writes. -*/ - u64 eptp = nested_ept_get_cr3(vcpu); - if (!(eptp & VMX_EPT_AD_ENABLE_BIT)) - exit_qualification &= ~EPT_VIOLATION_ACC_WRITE; - } + /* +* All guest page table accesses are potential writes to A/D bits. +* but EPT microcode only reports them as such when EPT A/D is +* enabled. Tracing ept_access_test_paddr_read_only_ad_disabled (from +* kvm-unit-tests) with eptad=0 and eptad=1 shows that the processor +* does not change its behavior when EPTP enables A/D bits; the only +* difference is in the exit qualification. So fix this up here. +*/ + if (!(exit_qualification & EPT_VIOLATION_GVA_TRANSLATED)) + exit_qualification |= EPT_VIOLATION_ACC_WRITE; /* * EPT violation happened while executing iret from NMI, @@ -6231,9 +6233,6 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu) (exit_qualification & INTR_INFO_UNBLOCK_NMI)) vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO, GUEST_INTR_STATE_NMI); - gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS); - trace_kvm_page_fault(gpa, exit_qualification); - /* Is it a read fault? */ error_code = (exit_qualification & EPT_VIOLATION_ACC_READ) ? PFERR_USER_MASK : 0; @@ -6250,6 +6249,17 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu) ? PFERR_PRESENT_MASK : 0; vcpu->arch.gpa_available = true; + + if (is_guest_mode(vcpu) + && !(exit_qualification & EPT_VIOLATION_GVA_TRANSLATED)) { + /* +* Now fix up exit_qualification according to what the +* L1 hypervisor expects to see. +*/ + u64 eptp = nested_ept_get_cr3(vcpu); + if (!(eptp & VMX_EPT_AD_ENABLE_BIT)) + exit_qualification &= ~EPT_VIOLATION_ACC_WRITE; + } I am not sure if this is really needed, it (PFEC.W = 0 if A/D need to be set on page structures) is not we expect. Maybe always report the right behavior is better? Especially,Intel may fix its microcode as it hurts the newest CPUs as well. Thanks!
Re: [PATCH 2/2] KVM: nVMX: fix nEPT handling of guest page table accesses
CC Kevin as i am not sure if Intel is aware of this issue, it breaks other hypervisors, e.g, Xen, as swell. On 05/11/2017 07:23 PM, Paolo Bonzini wrote: The new ept_access_test_paddr_read_only_ad_disabled testcase caused an infinite stream of EPT violations because KVM did not find anything bad in the page tables and kept re-executing the faulting instruction. This is because the exit qualification said we were reading from the page tables, but actually writing the cause of the EPT violation was writing the A/D bits. This happened even with eptad=0, quite surprisingly. Thus, always treat guest page table accesses as read+write operations, even if the exit qualification says otherwise. This fixes the testcase. Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx.c | 36 +++- 1 file changed, 23 insertions(+), 13 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index c6f4ad44aa95..c868cbdad29a 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -6209,17 +6209,19 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu) u32 error_code; exit_qualification = vmcs_readl(EXIT_QUALIFICATION); + gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS); + trace_kvm_page_fault(gpa, exit_qualification); - if (is_guest_mode(vcpu) - && !(exit_qualification & EPT_VIOLATION_GVA_TRANSLATED)) { - /* -* Fix up exit_qualification according to whether guest -* page table accesses are reads or writes. -*/ - u64 eptp = nested_ept_get_cr3(vcpu); - if (!(eptp & VMX_EPT_AD_ENABLE_BIT)) - exit_qualification &= ~EPT_VIOLATION_ACC_WRITE; - } + /* +* All guest page table accesses are potential writes to A/D bits. +* but EPT microcode only reports them as such when EPT A/D is +* enabled. Tracing ept_access_test_paddr_read_only_ad_disabled (from +* kvm-unit-tests) with eptad=0 and eptad=1 shows that the processor +* does not change its behavior when EPTP enables A/D bits; the only +* difference is in the exit qualification. So fix this up here. +*/ + if (!(exit_qualification & EPT_VIOLATION_GVA_TRANSLATED)) + exit_qualification |= EPT_VIOLATION_ACC_WRITE; /* * EPT violation happened while executing iret from NMI, @@ -6231,9 +6233,6 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu) (exit_qualification & INTR_INFO_UNBLOCK_NMI)) vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO, GUEST_INTR_STATE_NMI); - gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS); - trace_kvm_page_fault(gpa, exit_qualification); - /* Is it a read fault? */ error_code = (exit_qualification & EPT_VIOLATION_ACC_READ) ? PFERR_USER_MASK : 0; @@ -6250,6 +6249,17 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu) ? PFERR_PRESENT_MASK : 0; vcpu->arch.gpa_available = true; + + if (is_guest_mode(vcpu) + && !(exit_qualification & EPT_VIOLATION_GVA_TRANSLATED)) { + /* +* Now fix up exit_qualification according to what the +* L1 hypervisor expects to see. +*/ + u64 eptp = nested_ept_get_cr3(vcpu); + if (!(eptp & VMX_EPT_AD_ENABLE_BIT)) + exit_qualification &= ~EPT_VIOLATION_ACC_WRITE; + } I am not sure if this is really needed, it (PFEC.W = 0 if A/D need to be set on page structures) is not we expect. Maybe always report the right behavior is better? Especially,Intel may fix its microcode as it hurts the newest CPUs as well. Thanks!
Re: [PATCH 1/2] KVM: nVMX: fix EPT permissions as reported in exit qualification
On 05/12/2017 11:59 AM, Xiao Guangrong wrote: error: @@ -452,7 +459,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker, */ if (!(errcode & PFERR_RSVD_MASK)) { vcpu->arch.exit_qualification &= 0x187; -vcpu->arch.exit_qualification |= ((pt_access & pte) & 0x7) << 3; ^ here, the original code is buggy as pt_access and pte have different bit order, fortunately, this patch fixes it too. :) It's exactly what this patch aims at, do not know what i was thinking when wrote these down. :(
Re: [PATCH 1/2] KVM: nVMX: fix EPT permissions as reported in exit qualification
On 05/12/2017 11:59 AM, Xiao Guangrong wrote: error: @@ -452,7 +459,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker, */ if (!(errcode & PFERR_RSVD_MASK)) { vcpu->arch.exit_qualification &= 0x187; -vcpu->arch.exit_qualification |= ((pt_access & pte) & 0x7) << 3; ^ here, the original code is buggy as pt_access and pte have different bit order, fortunately, this patch fixes it too. :) It's exactly what this patch aims at, do not know what i was thinking when wrote these down. :(
Re: [PATCH 1/2] KVM: nVMX: fix EPT permissions as reported in exit qualification
ic int FNAME(walk_addr_generic)(struct guest_walker *walker, goto retry_walk; } - walker->pt_access = pt_access; - walker->pte_access = pte_access; pgprintk("%s: pte %llx pte_access %x pt_access %x\n", -__func__, (u64)pte, pte_access, pt_access); +__func__, (u64)pte, walker->pte_access, walker->pt_access); return 1; error: @@ -452,7 +459,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker, */ if (!(errcode & PFERR_RSVD_MASK)) { vcpu->arch.exit_qualification &= 0x187; - vcpu->arch.exit_qualification |= ((pt_access & pte) & 0x7) << 3; ^ here, the original code is buggy as pt_access and pte have different bit order, fortunately, this patch fixes it too. :) Otherwise it looks good to me, thanks for your fix. Reviewed-by: Xiao Guangrong <xiaoguangr...@tencent.com>
Re: [PATCH 1/2] KVM: nVMX: fix EPT permissions as reported in exit qualification
addr_generic)(struct guest_walker *walker, goto retry_walk; } - walker->pt_access = pt_access; - walker->pte_access = pte_access; pgprintk("%s: pte %llx pte_access %x pt_access %x\n", -__func__, (u64)pte, pte_access, pt_access); +__func__, (u64)pte, walker->pte_access, walker->pt_access); return 1; error: @@ -452,7 +459,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker, */ if (!(errcode & PFERR_RSVD_MASK)) { vcpu->arch.exit_qualification &= 0x187; - vcpu->arch.exit_qualification |= ((pt_access & pte) & 0x7) << 3; ^ here, the original code is buggy as pt_access and pte have different bit order, fortunately, this patch fixes it too. :) Otherwise it looks good to me, thanks for your fix. Reviewed-by: Xiao Guangrong
Re: [PATCH 0/7] KVM: MMU: fast write protect
On 05/03/2017 10:57 PM, Paolo Bonzini wrote: On 03/05/2017 16:50, Xiao Guangrong wrote: Furthermore, userspace has no knowledge about if PML is enable (it can be required from sysfs, but it is a good way in QEMU), so it is difficult for the usespace to know when to use write-protect-all. Maybe we can make KVM_CAP_X86_WRITE_PROTECT_ALL_MEM return false if PML is enabled? Yes, that's a good idea. Though it's a pity that, with PML, setting the dirty bit will still do the massive walk of the rmap. At least with reset_dirty_pages it's done a little bit at a time. Also, I wonder how the alternative write protection mechanism would affect performance of the dirty page ring buffer patches. You would do the write protection of all memory at the end of kvm_vm_ioctl_reset_dirty_pages. You wouldn't even need a separate ioctl, which is nice. On the other hand, checkpoints would be more frequent and most pages would be write-protected, so it would be more expensive to rebuild the shadow page tables... Yup, write-protect-all can improve reset_dirty_pages indeed, i will apply your idea after reset_dirty_pages is merged. However, we still prefer to have a separate ioctl for write-protect-all which cooperates with KVM_GET_DIRTY_LOG to improve live migration that should not always depend on checkpoint. Ok, I plan to merge the dirty ring pages early in 4.13 development. Great. As there is no conflict between these two patchsets except dirty ring pages takes benefit from write-protect-all, i think they can be developed and iterated independently, right? Or you prefer to merge dirty ring pages first then review the new version of this patchset later? Thanks!
Re: [PATCH 0/7] KVM: MMU: fast write protect
On 05/03/2017 10:57 PM, Paolo Bonzini wrote: On 03/05/2017 16:50, Xiao Guangrong wrote: Furthermore, userspace has no knowledge about if PML is enable (it can be required from sysfs, but it is a good way in QEMU), so it is difficult for the usespace to know when to use write-protect-all. Maybe we can make KVM_CAP_X86_WRITE_PROTECT_ALL_MEM return false if PML is enabled? Yes, that's a good idea. Though it's a pity that, with PML, setting the dirty bit will still do the massive walk of the rmap. At least with reset_dirty_pages it's done a little bit at a time. Also, I wonder how the alternative write protection mechanism would affect performance of the dirty page ring buffer patches. You would do the write protection of all memory at the end of kvm_vm_ioctl_reset_dirty_pages. You wouldn't even need a separate ioctl, which is nice. On the other hand, checkpoints would be more frequent and most pages would be write-protected, so it would be more expensive to rebuild the shadow page tables... Yup, write-protect-all can improve reset_dirty_pages indeed, i will apply your idea after reset_dirty_pages is merged. However, we still prefer to have a separate ioctl for write-protect-all which cooperates with KVM_GET_DIRTY_LOG to improve live migration that should not always depend on checkpoint. Ok, I plan to merge the dirty ring pages early in 4.13 development. Great. As there is no conflict between these two patchsets except dirty ring pages takes benefit from write-protect-all, i think they can be developed and iterated independently, right? Or you prefer to merge dirty ring pages first then review the new version of this patchset later? Thanks!
Re: [PATCH 0/7] KVM: MMU: fast write protect
On 05/03/2017 08:28 PM, Paolo Bonzini wrote: So if I understand correctly this relies on userspace doing: 1) KVM_GET_DIRTY_LOG without write protect 2) KVM_WRITE_PROTECT_ALL_MEM Writes may happen between 1 and 2; they are not represented in the live dirty bitmap but it's okay because they are in the snapshot and will only be used after 2. This is similar to what the dirty page ring buffer patches do; in fact, the KVM_WRITE_PROTECT_ALL_MEM ioctl is very similar to KVM_RESET_DIRTY_PAGES in those patches. You are right. After 1) and 2), the page which has been modified either in the bitmap returned to userspace or in the bitmap of memslot, i.e, there is no dirty page lost. On 03/05/2017 12:52, guangrong.x...@gmail.com wrote: Comparing with the ordinary algorithm which write protects last level sptes based on the rmap one by one, it just simply updates the generation number to ask all vCPUs to reload its root page table, particularly, it can be done out of mmu-lock, so that it does not hurt vMMU's parallel. This is clever. For processors that have PML, write protecting is only done on large pages and only for splitting purposes; not for dirty page tracking process at 4k granularity. In this case, I think that you should do nothing in the new write-protect-all ioctl? Good point, thanks for you pointing it out. Doing nothing in write-protect-all() is not acceptable as it breaks its semantic. :( Furthermore, userspace has no knowledge about if PML is enable (it can be required from sysfs, but it is a good way in QEMU), so it is difficult for the usespace to know when to use write-protect-all. Maybe we can make KVM_CAP_X86_WRITE_PROTECT_ALL_MEM return false if PML is enabled? Also, I wonder how the alternative write protection mechanism would affect performance of the dirty page ring buffer patches. You would do the write protection of all memory at the end of kvm_vm_ioctl_reset_dirty_pages. You wouldn't even need a separate ioctl, which is nice. On the other hand, checkpoints would be more frequent and most pages would be write-protected, so it would be more expensive to rebuild the shadow page tables... Yup, write-protect-all can improve reset_dirty_pages indeed, i will apply your idea after reset_dirty_pages is merged. However, we still prefer to have a separate ioctl for write-protect-all which cooperates with KVM_GET_DIRTY_LOG to improve live migration that should not always depend on checkpoint. Thanks, Paolo @@ -490,6 +511,7 @@ static int kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml, memset(d.dirty_bitmap, 0, allocated_size); d.slot = mem->slot | (kml->as_id << 16); +d.flags = kvm_write_protect_all ? KVM_DIRTY_LOG_WITHOUT_WRITE_PROTECT : 0; if (kvm_vm_ioctl(s, KVM_GET_DIRTY_LOG, ) == -1) { DPRINTF("ioctl failed %d\n", errno); ret = -1; How would this work when kvm_physical_sync_dirty_bitmap is called from memory_region_sync_dirty_bitmap rather than memory_region_global_dirty_log_sync? You are right, we did not consider the full cases carefully, will fix it when push it to QEMU formally. Thank you, Paolo!
Re: [PATCH 0/7] KVM: MMU: fast write protect
On 05/03/2017 08:28 PM, Paolo Bonzini wrote: So if I understand correctly this relies on userspace doing: 1) KVM_GET_DIRTY_LOG without write protect 2) KVM_WRITE_PROTECT_ALL_MEM Writes may happen between 1 and 2; they are not represented in the live dirty bitmap but it's okay because they are in the snapshot and will only be used after 2. This is similar to what the dirty page ring buffer patches do; in fact, the KVM_WRITE_PROTECT_ALL_MEM ioctl is very similar to KVM_RESET_DIRTY_PAGES in those patches. You are right. After 1) and 2), the page which has been modified either in the bitmap returned to userspace or in the bitmap of memslot, i.e, there is no dirty page lost. On 03/05/2017 12:52, guangrong.x...@gmail.com wrote: Comparing with the ordinary algorithm which write protects last level sptes based on the rmap one by one, it just simply updates the generation number to ask all vCPUs to reload its root page table, particularly, it can be done out of mmu-lock, so that it does not hurt vMMU's parallel. This is clever. For processors that have PML, write protecting is only done on large pages and only for splitting purposes; not for dirty page tracking process at 4k granularity. In this case, I think that you should do nothing in the new write-protect-all ioctl? Good point, thanks for you pointing it out. Doing nothing in write-protect-all() is not acceptable as it breaks its semantic. :( Furthermore, userspace has no knowledge about if PML is enable (it can be required from sysfs, but it is a good way in QEMU), so it is difficult for the usespace to know when to use write-protect-all. Maybe we can make KVM_CAP_X86_WRITE_PROTECT_ALL_MEM return false if PML is enabled? Also, I wonder how the alternative write protection mechanism would affect performance of the dirty page ring buffer patches. You would do the write protection of all memory at the end of kvm_vm_ioctl_reset_dirty_pages. You wouldn't even need a separate ioctl, which is nice. On the other hand, checkpoints would be more frequent and most pages would be write-protected, so it would be more expensive to rebuild the shadow page tables... Yup, write-protect-all can improve reset_dirty_pages indeed, i will apply your idea after reset_dirty_pages is merged. However, we still prefer to have a separate ioctl for write-protect-all which cooperates with KVM_GET_DIRTY_LOG to improve live migration that should not always depend on checkpoint. Thanks, Paolo @@ -490,6 +511,7 @@ static int kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml, memset(d.dirty_bitmap, 0, allocated_size); d.slot = mem->slot | (kml->as_id << 16); +d.flags = kvm_write_protect_all ? KVM_DIRTY_LOG_WITHOUT_WRITE_PROTECT : 0; if (kvm_vm_ioctl(s, KVM_GET_DIRTY_LOG, ) == -1) { DPRINTF("ioctl failed %d\n", errno); ret = -1; How would this work when kvm_physical_sync_dirty_bitmap is called from memory_region_sync_dirty_bitmap rather than memory_region_global_dirty_log_sync? You are right, we did not consider the full cases carefully, will fix it when push it to QEMU formally. Thank you, Paolo!
Re: [PATCH] x86, kvm: Handle PFNs outside of kernel reach when touching GPTEs
On 04/12/2017 09:16 PM, Sironi, Filippo wrote: Thanks for taking the time and sorry for the delay. On 6. Apr 2017, at 16:22, Radim Krčmářwrote: 2017-04-05 15:07+0200, Filippo Sironi: cmpxchg_gpte() calls get_user_pages_fast() to retrieve the number of pages and the respective struct pages for mapping in the kernel virtual address space. This doesn't work if get_user_pages_fast() is invoked with a userspace virtual address that's backed by PFNs outside of kernel reach (e.g., when limiting the kernel memory with mem= in the command line and using /dev/mem to map memory). If get_user_pages_fast() fails, look up the VMA that backs the userspace virtual address, compute the PFN and the physical address, and map it in the kernel virtual address space with memremap(). What is the reason for a configuration that voluntarily restricts access to memory that it needs? By using /dev/mem to provide VM memory, one can avoid the overhead of allocating struct page(s) for the whole memory, which is wasteful when using a server entirely for hosting VMs. Sounds reasonable, however it is incomplete so far as there are some code paths still do not support non-page backend memory, e.g, emulator_cmpxchg_emulated(). I would suggest to unify the code introduced in this patch with existing hva_to_pfn(), also we can introduce a common API, maybe named kvm_map_hva(), to improve the caller sides. BTW, i do not know why we used kmap_atomic() rather than kmap(), the path of cmpxchg_gpte() is sleep-able anyway. Thanks!
Re: [PATCH] x86, kvm: Handle PFNs outside of kernel reach when touching GPTEs
On 04/12/2017 09:16 PM, Sironi, Filippo wrote: Thanks for taking the time and sorry for the delay. On 6. Apr 2017, at 16:22, Radim Krčmář wrote: 2017-04-05 15:07+0200, Filippo Sironi: cmpxchg_gpte() calls get_user_pages_fast() to retrieve the number of pages and the respective struct pages for mapping in the kernel virtual address space. This doesn't work if get_user_pages_fast() is invoked with a userspace virtual address that's backed by PFNs outside of kernel reach (e.g., when limiting the kernel memory with mem= in the command line and using /dev/mem to map memory). If get_user_pages_fast() fails, look up the VMA that backs the userspace virtual address, compute the PFN and the physical address, and map it in the kernel virtual address space with memremap(). What is the reason for a configuration that voluntarily restricts access to memory that it needs? By using /dev/mem to provide VM memory, one can avoid the overhead of allocating struct page(s) for the whole memory, which is wasteful when using a server entirely for hosting VMs. Sounds reasonable, however it is incomplete so far as there are some code paths still do not support non-page backend memory, e.g, emulator_cmpxchg_emulated(). I would suggest to unify the code introduced in this patch with existing hva_to_pfn(), also we can introduce a common API, maybe named kvm_map_hva(), to improve the caller sides. BTW, i do not know why we used kmap_atomic() rather than kmap(), the path of cmpxchg_gpte() is sleep-able anyway. Thanks!
Re: [PATCH 11/16] fpga: intel: fme: add partial reconfiguration sub feature support
On 31/03/2017 4:30 AM, Alan Tull wrote: On Thu, Mar 30, 2017 at 7:08 AM, Wu Haowrote: From: Kang Luwei Partial Reconfiguration (PR) is the most important function for FME. It allows reconfiguration for given Port/Accelerated Function Unit (AFU). This patch adds support for PR sub feature. In this patch, it registers a fpga_mgr and implements fpga_manager_ops, and invoke fpga_mgr_buf_load for PR operation once PR request received via ioctl. Below user space interfaces are exposed by this sub feature. Sysfs interface: * /sys/class/fpga///interface_id Read-only. Indicate the hardware interface information. Userspace applications need to check this interface to select correct green bitstream format before PR. Ioctl interface: * FPGA_FME_PORT_PR Do partial reconfiguration per information from userspace, including target port(AFU), buffer size and address info. It returns the PR status (PR error code if failed) to userspace. Signed-off-by: Tim Whisonant Signed-off-by: Enno Luebbers Signed-off-by: Shiva Rao Signed-off-by: Christopher Rauer Signed-off-by: Alan Tull Hi Wu Hao, Thanks for submitting your patches. I think there's been a misunderstanding of the meaning of 'Signed-off-by' [1]. I have not signed off on this code or had a hand in its development. But I'm happy to get to review it now. It will take a bit of time; I expect to be replying next week. Hi Alan, Sorry to confuse you, i think it's because you helped Chris a lot to implement this interface and we'd like to include your credit as this way. If you dislike, it will be dropped. :) Thanks for your review in advance.
Re: [PATCH 11/16] fpga: intel: fme: add partial reconfiguration sub feature support
On 31/03/2017 4:30 AM, Alan Tull wrote: On Thu, Mar 30, 2017 at 7:08 AM, Wu Hao wrote: From: Kang Luwei Partial Reconfiguration (PR) is the most important function for FME. It allows reconfiguration for given Port/Accelerated Function Unit (AFU). This patch adds support for PR sub feature. In this patch, it registers a fpga_mgr and implements fpga_manager_ops, and invoke fpga_mgr_buf_load for PR operation once PR request received via ioctl. Below user space interfaces are exposed by this sub feature. Sysfs interface: * /sys/class/fpga///interface_id Read-only. Indicate the hardware interface information. Userspace applications need to check this interface to select correct green bitstream format before PR. Ioctl interface: * FPGA_FME_PORT_PR Do partial reconfiguration per information from userspace, including target port(AFU), buffer size and address info. It returns the PR status (PR error code if failed) to userspace. Signed-off-by: Tim Whisonant Signed-off-by: Enno Luebbers Signed-off-by: Shiva Rao Signed-off-by: Christopher Rauer Signed-off-by: Alan Tull Hi Wu Hao, Thanks for submitting your patches. I think there's been a misunderstanding of the meaning of 'Signed-off-by' [1]. I have not signed off on this code or had a hand in its development. But I'm happy to get to review it now. It will take a bit of time; I expect to be replying next week. Hi Alan, Sorry to confuse you, i think it's because you helped Chris a lot to implement this interface and we'd like to include your credit as this way. If you dislike, it will be dropped. :) Thanks for your review in advance.
Re: [PATCH v2] KVM: x86: cleanup the page tracking SRCU instance
On 28/03/2017 5:40 PM, Paolo Bonzini wrote: void kvm_page_track_init(struct kvm *kvm) { struct kvm_page_track_notifier_head *head; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 64697fe475c3..f2379673912a 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8070,6 +8070,7 @@ void kvm_arch_sync_events(struct kvm *kvm) cancel_delayed_work_sync(>arch.kvmclock_update_work); kvm_free_all_assigned_devices(kvm); kvm_free_pit(kvm); + kvm_page_track_cleanup(kvm); } Moving it to kvm_arch_destroy_vm() is better as the init function is called in kvm_arch_init_vm(). Otherwise it looks great to me: Reviewed-by: Xiao Guangrong <xiaoguangrong.e...@gmail.com> Thanks for the fix.
Re: [PATCH v2] KVM: x86: cleanup the page tracking SRCU instance
On 28/03/2017 5:40 PM, Paolo Bonzini wrote: void kvm_page_track_init(struct kvm *kvm) { struct kvm_page_track_notifier_head *head; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 64697fe475c3..f2379673912a 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8070,6 +8070,7 @@ void kvm_arch_sync_events(struct kvm *kvm) cancel_delayed_work_sync(>arch.kvmclock_update_work); kvm_free_all_assigned_devices(kvm); kvm_free_pit(kvm); + kvm_page_track_cleanup(kvm); } Moving it to kvm_arch_destroy_vm() is better as the init function is called in kvm_arch_init_vm(). Otherwise it looks great to me: Reviewed-by: Xiao Guangrong Thanks for the fix.
Re: [PATCH v2] mm, proc: Fix region lost in /proc/self/smaps
On 09/14/2016 11:38 PM, Oleg Nesterov wrote: On 09/13, Dave Hansen wrote: On 09/13/2016 07:59 AM, Oleg Nesterov wrote: I agree. I don't even understand why this was considered as a bug. Obviously, m_stop() which drops mmap_sep should not be called, or all the threads should be stopped, if you want to trust the result. There was a mapping at a given address. That mapping did not change, it was not split, its attributes did not change. But, it didn't show up when reading smaps. Folks _actually_ noticed this in a test suite looking for that address range in smaps. I understand, and I won't argue with any change which makes the things better. Just I do not think this is a real problem. And this patch can't fix other oddities and it seems it adds another one (at least) although I can easily misread this patch and/or the code. So we change m_cache_vma(), -m->version = m_next_vma(m->private, vma) ? vma->vm_start : -1UL; +m->version = m_next_vma(m->private, vma) ? vma->vm_end : -1UL; OK, and another change in m_start() -if (vma && (vma = m_next_vma(priv, vma))) +if (vma) means that it can return the same vma if it grows in between. show_map_vma() has another change + start = max(vma->vm_start, start); so it will be reported as _another_ vma, and this doesn't look exactly right. We noticed it in the discussion of v1, however it is not bad as Dave said it is about 'address range' rather that vma. And after that *ppos will be falsely incremented... but probably this doesn't matter because the "if (pos < mm->map_count)" logic in m_start() looks broken anyway. The 'broken' can happen only if it is not the first read and m->version is zero (*ppos != 0 && m->version == 0). If i understand the code correctly, only m->buffer overflowed can trigger this, for smaps, each vma only uses ~1k memory that means this could not happen. Right?
Re: [PATCH v2] mm, proc: Fix region lost in /proc/self/smaps
On 09/14/2016 11:38 PM, Oleg Nesterov wrote: On 09/13, Dave Hansen wrote: On 09/13/2016 07:59 AM, Oleg Nesterov wrote: I agree. I don't even understand why this was considered as a bug. Obviously, m_stop() which drops mmap_sep should not be called, or all the threads should be stopped, if you want to trust the result. There was a mapping at a given address. That mapping did not change, it was not split, its attributes did not change. But, it didn't show up when reading smaps. Folks _actually_ noticed this in a test suite looking for that address range in smaps. I understand, and I won't argue with any change which makes the things better. Just I do not think this is a real problem. And this patch can't fix other oddities and it seems it adds another one (at least) although I can easily misread this patch and/or the code. So we change m_cache_vma(), -m->version = m_next_vma(m->private, vma) ? vma->vm_start : -1UL; +m->version = m_next_vma(m->private, vma) ? vma->vm_end : -1UL; OK, and another change in m_start() -if (vma && (vma = m_next_vma(priv, vma))) +if (vma) means that it can return the same vma if it grows in between. show_map_vma() has another change + start = max(vma->vm_start, start); so it will be reported as _another_ vma, and this doesn't look exactly right. We noticed it in the discussion of v1, however it is not bad as Dave said it is about 'address range' rather that vma. And after that *ppos will be falsely incremented... but probably this doesn't matter because the "if (pos < mm->map_count)" logic in m_start() looks broken anyway. The 'broken' can happen only if it is not the first read and m->version is zero (*ppos != 0 && m->version == 0). If i understand the code correctly, only m->buffer overflowed can trigger this, for smaps, each vma only uses ~1k memory that means this could not happen. Right?
Re: [PATCH v2] mm, proc: Fix region lost in /proc/self/smaps
On 09/13/2016 03:10 AM, Michal Hocko wrote: On Mon 12-09-16 08:01:06, Dave Hansen wrote: On 09/12/2016 05:54 AM, Michal Hocko wrote: In order to fix this bug, we make 'file->version' indicate the end address of current VMA Doesn't this open doors to another weird cases. Say B would be partially unmapped (tail of the VMA would get unmapped and reused for a new VMA. In the end, this interface isn't about VMAs. It's about addresses, and we need to make sure that the _addresses_ coming out of it are sane. In the case that a VMA was partially unmapped, it doesn't make sense to show the "new" VMA because we already had some output covering the address of the "new" VMA from the old one. OK, that is a fair point and it speaks for caching the vm_end rather than vm_start+skip. I am not sure we provide any guarantee when there are more read syscalls. Hmm, even with a single read() we can get inconsistent results from different threads without any user space synchronization. Yeah, very true. But, I think we _can_ at least provide the following guarantees (among others): 1. addresses don't go backwards 2. If there is something at a given vaddr during the entirety of the life of the smaps walk, we will produce some output for it. I guess we also want 3. no overlaps with previously printed values (assuming two subsequent reads without seek). the patch tries to achieve the last part as well AFAICS but I guess this is incomplete because at least /proc//smaps will report counters for the full vma range while the header (aka show_map_vma) will report shorter (non-overlapping) range. I haven't checked other files which use m_{start,next} You are right. Will fix both /proc/PID/smaps and /proc/PID/maps in the next version. Considering how this all can be tricky and how partial reads can be confusing and even misleading I am really wondering whether we should simply document that only full reads will provide a sensible results. Make sense. Will document the guarantee in Documentation/filesystems/proc.txt Thank you, Dave and Michal, for figuring out the right direction. :)
Re: [PATCH v2] mm, proc: Fix region lost in /proc/self/smaps
On 09/13/2016 03:10 AM, Michal Hocko wrote: On Mon 12-09-16 08:01:06, Dave Hansen wrote: On 09/12/2016 05:54 AM, Michal Hocko wrote: In order to fix this bug, we make 'file->version' indicate the end address of current VMA Doesn't this open doors to another weird cases. Say B would be partially unmapped (tail of the VMA would get unmapped and reused for a new VMA. In the end, this interface isn't about VMAs. It's about addresses, and we need to make sure that the _addresses_ coming out of it are sane. In the case that a VMA was partially unmapped, it doesn't make sense to show the "new" VMA because we already had some output covering the address of the "new" VMA from the old one. OK, that is a fair point and it speaks for caching the vm_end rather than vm_start+skip. I am not sure we provide any guarantee when there are more read syscalls. Hmm, even with a single read() we can get inconsistent results from different threads without any user space synchronization. Yeah, very true. But, I think we _can_ at least provide the following guarantees (among others): 1. addresses don't go backwards 2. If there is something at a given vaddr during the entirety of the life of the smaps walk, we will produce some output for it. I guess we also want 3. no overlaps with previously printed values (assuming two subsequent reads without seek). the patch tries to achieve the last part as well AFAICS but I guess this is incomplete because at least /proc//smaps will report counters for the full vma range while the header (aka show_map_vma) will report shorter (non-overlapping) range. I haven't checked other files which use m_{start,next} You are right. Will fix both /proc/PID/smaps and /proc/PID/maps in the next version. Considering how this all can be tricky and how partial reads can be confusing and even misleading I am really wondering whether we should simply document that only full reads will provide a sensible results. Make sense. Will document the guarantee in Documentation/filesystems/proc.txt Thank you, Dave and Michal, for figuring out the right direction. :)
Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
On 09/12/2016 11:44 AM, Rudoff, Andy wrote: Whether msync/fsync can make data persistent depends on ADR feature on memory controller, if it exists everything works well, otherwise, we need to have another interface that is why 'Flush hint table' in ACPI comes in. 'Flush hint table' is particularly useful for nvdimm virtualization if we use normal memory to emulate nvdimm with data persistent characteristic (the data will be flushed to a persistent storage, e.g, disk). Does current PMEM programming model fully supports 'Flush hint table'? Is userspace allowed to use these addresses? The Flush hint table is NOT a replacement for ADR. To support pmem on the x86 architecture, the platform is required to ensure that a pmem store flushed from the CPU caches is in the persistent domain so that the application need not take any additional steps to make it persistent. The most common way to do this is the ADR feature. If the above is not true, then your x86 platform does not support pmem. Understood. However, virtualization is a special case as we can use normal memory to emulate NVDIMM for the vm so that vm can bypass local file-cache, reduce memory usage and io path, etc. Currently, this usage is useful for lightweight virtualization, such as clean container. Under this case, ADR is available on physical platform but it can not help us to make data persistence for the vm. So that virtualizeing 'flush hint table' is a good way to handle it based on the acpi spec: | software can write to any one of these Flush Hint Addresses to | cause any preceding writes to the NVDIMM region to be flushed | out of the intervening platform buffers 1 to the targeted NVDIMM | (to achieve durability) Flush hints are for use by the BIOS and drivers and are not intended to be used in user space. Flush hints provide two things: First, if a driver needs to write to command registers or movable windows on a DIMM, the Flush hint (if provided in the NFIT) is required to flush the command to the DIMM or ensure stores done through the movable window are complete before moving it somewhere else. Second, for the rare case where the kernel wants to flush stores to the smallest possible failure domain (i.e. to the DIMM even though ADR will handle flushing it from a larger domain), the flush hints provide a way to do this. This might be useful for things like file system journals to help ensure the file system is consistent even in the face of ADR failure. We are assuming ADR can fail, however, do we have a way to know whether ADR works correctly? Maybe MCE can work on it? This is necessary to support making data persistent without 'fsync/msync' in userspace. Or do we need to unconditionally use 'flush hint address' if it is available as current nvdimm driver does? Thanks!
Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
On 09/12/2016 11:44 AM, Rudoff, Andy wrote: Whether msync/fsync can make data persistent depends on ADR feature on memory controller, if it exists everything works well, otherwise, we need to have another interface that is why 'Flush hint table' in ACPI comes in. 'Flush hint table' is particularly useful for nvdimm virtualization if we use normal memory to emulate nvdimm with data persistent characteristic (the data will be flushed to a persistent storage, e.g, disk). Does current PMEM programming model fully supports 'Flush hint table'? Is userspace allowed to use these addresses? The Flush hint table is NOT a replacement for ADR. To support pmem on the x86 architecture, the platform is required to ensure that a pmem store flushed from the CPU caches is in the persistent domain so that the application need not take any additional steps to make it persistent. The most common way to do this is the ADR feature. If the above is not true, then your x86 platform does not support pmem. Understood. However, virtualization is a special case as we can use normal memory to emulate NVDIMM for the vm so that vm can bypass local file-cache, reduce memory usage and io path, etc. Currently, this usage is useful for lightweight virtualization, such as clean container. Under this case, ADR is available on physical platform but it can not help us to make data persistence for the vm. So that virtualizeing 'flush hint table' is a good way to handle it based on the acpi spec: | software can write to any one of these Flush Hint Addresses to | cause any preceding writes to the NVDIMM region to be flushed | out of the intervening platform buffers 1 to the targeted NVDIMM | (to achieve durability) Flush hints are for use by the BIOS and drivers and are not intended to be used in user space. Flush hints provide two things: First, if a driver needs to write to command registers or movable windows on a DIMM, the Flush hint (if provided in the NFIT) is required to flush the command to the DIMM or ensure stores done through the movable window are complete before moving it somewhere else. Second, for the rare case where the kernel wants to flush stores to the smallest possible failure domain (i.e. to the DIMM even though ADR will handle flushing it from a larger domain), the flush hints provide a way to do this. This might be useful for things like file system journals to help ensure the file system is consistent even in the face of ADR failure. We are assuming ADR can fail, however, do we have a way to know whether ADR works correctly? Maybe MCE can work on it? This is necessary to support making data persistent without 'fsync/msync' in userspace. Or do we need to unconditionally use 'flush hint address' if it is available as current nvdimm driver does? Thanks!
Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
On 09/09/2016 11:40 PM, Dan Williams wrote: On Fri, Sep 9, 2016 at 1:55 AM, Xiao Guangrong <guangrong.x...@linux.intel.com> wrote: [..] Whether a persistent memory mapping requires an msync/fsync is a filesystem specific question. This mincore proposal is separate from that. Consider device-DAX for volatile memory or mincore() called on an anonymous memory range. In those cases persistence and filesystem metadata are not in the picture, but it would still be useful for userspace to know "is there page cache backing this mapping?" or "what is the TLB geometry of this mapping?". I got a question about msync/fsync which is beyond the topic of this thread :) Whether msync/fsync can make data persistent depends on ADR feature on memory controller, if it exists everything works well, otherwise, we need to have another interface that is why 'Flush hint table' in ACPI comes in. 'Flush hint table' is particularly useful for nvdimm virtualization if we use normal memory to emulate nvdimm with data persistent characteristic (the data will be flushed to a persistent storage, e.g, disk). Does current PMEM programming model fully supports 'Flush hint table'? Is userspace allowed to use these addresses? If you publish flush hint addresses in the virtual NFIT the guest VM will write to them whenever a REQ_FLUSH or REQ_FUA request is sent to the virtual /dev/pmemX device. Yes, seems straightforward to take a VM exit on those events and flush simulated pmem to persistent storage. Thank you, Dan! However REQ_FLUSH or REQ_FUA is handled in kernel space, okay, after following up the discussion in this thread, i understood that currently filesystems have not supported the case that usespace itself make data be persistent without kernel's involvement. So that works. Hmm, Does device-DAX support this case (make data be persistent without msync/fsync)? I guess no, but just want to confirm it. :)
Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
On 09/09/2016 11:40 PM, Dan Williams wrote: On Fri, Sep 9, 2016 at 1:55 AM, Xiao Guangrong wrote: [..] Whether a persistent memory mapping requires an msync/fsync is a filesystem specific question. This mincore proposal is separate from that. Consider device-DAX for volatile memory or mincore() called on an anonymous memory range. In those cases persistence and filesystem metadata are not in the picture, but it would still be useful for userspace to know "is there page cache backing this mapping?" or "what is the TLB geometry of this mapping?". I got a question about msync/fsync which is beyond the topic of this thread :) Whether msync/fsync can make data persistent depends on ADR feature on memory controller, if it exists everything works well, otherwise, we need to have another interface that is why 'Flush hint table' in ACPI comes in. 'Flush hint table' is particularly useful for nvdimm virtualization if we use normal memory to emulate nvdimm with data persistent characteristic (the data will be flushed to a persistent storage, e.g, disk). Does current PMEM programming model fully supports 'Flush hint table'? Is userspace allowed to use these addresses? If you publish flush hint addresses in the virtual NFIT the guest VM will write to them whenever a REQ_FLUSH or REQ_FUA request is sent to the virtual /dev/pmemX device. Yes, seems straightforward to take a VM exit on those events and flush simulated pmem to persistent storage. Thank you, Dan! However REQ_FLUSH or REQ_FUA is handled in kernel space, okay, after following up the discussion in this thread, i understood that currently filesystems have not supported the case that usespace itself make data be persistent without kernel's involvement. So that works. Hmm, Does device-DAX support this case (make data be persistent without msync/fsync)? I guess no, but just want to confirm it. :)
[PATCH v2] mm, proc: Fix region lost in /proc/self/smaps
hreads return the same value */ for (int i = 1; i < NTHREAD; i++) { if (ret[0] != ret[i]) { printf("Error i %d ret[0] = %d ret[i] = %d.\n", i, ret[0], ret[i]); } } printf("%d", ret[0]); return 0; } # dd if=/dev/zero of=~/out bs=2M count=1 # ./testcase ~/out It failed as some threads can not find the memory region in "/proc/self/smaps" which is allocated in the main process It is caused by proc fs which uses 'file->version' to indicate the VMA that is the last one has already been handled by read() system call. When the next read() issues, it uses the 'version' to find the VMA, then the next VMA is what we want to handle, the related code is as follows: if (last_addr) { vma = find_vma(mm, last_addr); if (vma && (vma = m_next_vma(priv, vma))) return vma; } However, VMA will be lost if the last VMA is gone, e.g: The process VMA list is A->B->C->D CPU 0 CPU 1 read() system call handle VMA B version = B return to userspace unmap VMA B issue read() again to continue to get the region info find_vma(version) will get VMA C m_next_vma(C) will get VMA D handle D !!! VMA C is lost !!! In order to fix this bug, we make 'file->version' indicate the end address of current VMA Changelog: Thanks to Dave Hansen's comments, this version fixes the issue in v1 that same virtual address range may be outputted twice, e.g: Take two example VMAs: vma-A: (0x1000 -> 0x2000) vma-B: (0x2000 -> 0x3000) read() #1: prints vma-A, sets m->version=0x2000 Now, merge A/B to make C: vma-C: (0x1000 -> 0x3000) read() #2: find_vma(m->version=0x2000), returns vma-C, prints vma-C The user will see two VMAs in their output: A: 0x1000->0x2000 C: 0x1000->0x3000 Acked-by: Dave Hansen <dave.han...@intel.com> Signed-off-by: Xiao Guangrong <guangrong.x...@linux.intel.com> --- fs/proc/task_mmu.c | 15 ++- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 187d84e..10ca648 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -147,7 +147,7 @@ m_next_vma(struct proc_maps_private *priv, struct vm_area_struct *vma) static void m_cache_vma(struct seq_file *m, struct vm_area_struct *vma) { if (m->count < m->size) /* vma is copied successfully */ - m->version = m_next_vma(m->private, vma) ? vma->vm_start : -1UL; + m->version = m_next_vma(m->private, vma) ? vma->vm_end : -1UL; } static void *m_start(struct seq_file *m, loff_t *ppos) @@ -176,14 +176,14 @@ static void *m_start(struct seq_file *m, loff_t *ppos) if (last_addr) { vma = find_vma(mm, last_addr); - if (vma && (vma = m_next_vma(priv, vma))) + if (vma) return vma; } m->version = 0; if (pos < mm->map_count) { for (vma = mm->mmap; pos; pos--) { - m->version = vma->vm_start; + m->version = vma->vm_end; vma = vma->vm_next; } return vma; @@ -293,7 +293,7 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma, int is_pid) vm_flags_t flags = vma->vm_flags; unsigned long ino = 0; unsigned long long pgoff = 0; - unsigned long start, end; + unsigned long end, start = m->version; dev_t dev = 0; const char *name = NULL; @@ -304,8 +304,13 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma, int is_pid) pgoff = ((loff_t)vma->vm_pgoff) << PAGE_SHIFT; } + /* +* the region [0, m->version) has already been handled, do not +* handle it doubly. +*/ + start = max(vma->vm_start, start); + /* We don't show the stack guard page in /proc/maps */ - start = vma->vm_start; if (stack_guard_page_start(vma, start)) start += PAGE_SIZE; end = vma->vm_end; -- 1.8.3.1
[PATCH v2] mm, proc: Fix region lost in /proc/self/smaps
hreads return the same value */ for (int i = 1; i < NTHREAD; i++) { if (ret[0] != ret[i]) { printf("Error i %d ret[0] = %d ret[i] = %d.\n", i, ret[0], ret[i]); } } printf("%d", ret[0]); return 0; } # dd if=/dev/zero of=~/out bs=2M count=1 # ./testcase ~/out It failed as some threads can not find the memory region in "/proc/self/smaps" which is allocated in the main process It is caused by proc fs which uses 'file->version' to indicate the VMA that is the last one has already been handled by read() system call. When the next read() issues, it uses the 'version' to find the VMA, then the next VMA is what we want to handle, the related code is as follows: if (last_addr) { vma = find_vma(mm, last_addr); if (vma && (vma = m_next_vma(priv, vma))) return vma; } However, VMA will be lost if the last VMA is gone, e.g: The process VMA list is A->B->C->D CPU 0 CPU 1 read() system call handle VMA B version = B return to userspace unmap VMA B issue read() again to continue to get the region info find_vma(version) will get VMA C m_next_vma(C) will get VMA D handle D !!! VMA C is lost !!! In order to fix this bug, we make 'file->version' indicate the end address of current VMA Changelog: Thanks to Dave Hansen's comments, this version fixes the issue in v1 that same virtual address range may be outputted twice, e.g: Take two example VMAs: vma-A: (0x1000 -> 0x2000) vma-B: (0x2000 -> 0x3000) read() #1: prints vma-A, sets m->version=0x2000 Now, merge A/B to make C: vma-C: (0x1000 -> 0x3000) read() #2: find_vma(m->version=0x2000), returns vma-C, prints vma-C The user will see two VMAs in their output: A: 0x1000->0x2000 C: 0x1000->0x3000 Acked-by: Dave Hansen Signed-off-by: Xiao Guangrong --- fs/proc/task_mmu.c | 15 ++- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 187d84e..10ca648 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -147,7 +147,7 @@ m_next_vma(struct proc_maps_private *priv, struct vm_area_struct *vma) static void m_cache_vma(struct seq_file *m, struct vm_area_struct *vma) { if (m->count < m->size) /* vma is copied successfully */ - m->version = m_next_vma(m->private, vma) ? vma->vm_start : -1UL; + m->version = m_next_vma(m->private, vma) ? vma->vm_end : -1UL; } static void *m_start(struct seq_file *m, loff_t *ppos) @@ -176,14 +176,14 @@ static void *m_start(struct seq_file *m, loff_t *ppos) if (last_addr) { vma = find_vma(mm, last_addr); - if (vma && (vma = m_next_vma(priv, vma))) + if (vma) return vma; } m->version = 0; if (pos < mm->map_count) { for (vma = mm->mmap; pos; pos--) { - m->version = vma->vm_start; + m->version = vma->vm_end; vma = vma->vm_next; } return vma; @@ -293,7 +293,7 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma, int is_pid) vm_flags_t flags = vma->vm_flags; unsigned long ino = 0; unsigned long long pgoff = 0; - unsigned long start, end; + unsigned long end, start = m->version; dev_t dev = 0; const char *name = NULL; @@ -304,8 +304,13 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma, int is_pid) pgoff = ((loff_t)vma->vm_pgoff) << PAGE_SHIFT; } + /* +* the region [0, m->version) has already been handled, do not +* handle it doubly. +*/ + start = max(vma->vm_start, start); + /* We don't show the stack guard page in /proc/maps */ - start = vma->vm_start; if (stack_guard_page_start(vma, start)) start += PAGE_SIZE; end = vma->vm_end; -- 1.8.3.1
Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
On 09/09/2016 07:04 AM, Dan Williams wrote: On Thu, Sep 8, 2016 at 3:56 PM, Ross Zwisler <ross.zwis...@linux.intel.com> wrote: On Wed, Sep 07, 2016 at 09:32:36PM -0700, Dan Williams wrote: [ adding linux-fsdevel and linux-nvdimm ] On Wed, Sep 7, 2016 at 8:36 PM, Xiao Guangrong <guangrong.x...@linux.intel.com> wrote: [..] However, it is not easy to handle the case that the new VMA overlays with the old VMA already got by userspace. I think we have some choices: 1: One way is completely skipping the new VMA region as current kernel code does but i do not think this is good as the later VMAs will be dropped. 2: show the un-overlayed portion of new VMA. In your case, we just show the region (0x2000 -> 0x3000), however, it can not work well if the VMA is a new created region with different attributions. 3: completely show the new VMA as this patch does. Which one do you prefer? I don't have a preference, but perhaps this breakage and uncertainty is a good opportunity to propose a more reliable interface for NVML to get the information it needs? My understanding is that it is looking for the VM_MIXEDMAP flag which is already ambiguous for determining if DAX is enabled even if this dynamic listing issue is fixed. XFS has arranged for DAX to be a per-inode capability and has an XFS-specific inode flag. We can make that a common inode flag, but it seems we should have a way to interrogate the mapping itself in the case where the inode is unknown or unavailable. I'm thinking extensions to mincore to have flags for DAX and possibly whether the page is part of a pte, pmd, or pud mapping. Just floating that idea before starting to look into the implementation, comments or other ideas welcome... I think this goes back to our previous discussion about support for the PMEM programming model. Really I think what NVML needs isn't a way to tell if it is getting a DAX mapping, but whether it is getting a DAX mapping on a filesystem that fully supports the PMEM programming model. This of course is defined to be a filesystem where it can do all of its flushes from userspace safely and never call fsync/msync, and that allocations that happen in page faults will be synchronized to media before the page fault completes. IIUC this is what NVML needs - a way to decide "do I use fsync/msync for everything or can I rely fully on flushes from userspace?" For all existing implementations, I think the answer is "you need to use fsync/msync" because we don't yet have proper support for the PMEM programming model. My best idea of how to support this was a per-inode flag similar to the one supported by XFS that says "you have a PMEM capable DAX mapping", which NVML would then interpret to mean "you can do flushes from userspace and be fully safe". I think we really want this interface to be common over XFS and ext4. If we can figure out a better way of doing this interface, say via mincore, that's fine, but I don't think we can detangle this from the PMEM API discussion. Whether a persistent memory mapping requires an msync/fsync is a filesystem specific question. This mincore proposal is separate from that. Consider device-DAX for volatile memory or mincore() called on an anonymous memory range. In those cases persistence and filesystem metadata are not in the picture, but it would still be useful for userspace to know "is there page cache backing this mapping?" or "what is the TLB geometry of this mapping?". I got a question about msync/fsync which is beyond the topic of this thread :) Whether msync/fsync can make data persistent depends on ADR feature on memory controller, if it exists everything works well, otherwise, we need to have another interface that is why 'Flush hint table' in ACPI comes in. 'Flush hint table' is particularly useful for nvdimm virtualization if we use normal memory to emulate nvdimm with data persistent characteristic (the data will be flushed to a persistent storage, e.g, disk). Does current PMEM programming model fully supports 'Flush hint table'? Is userspace allowed to use these addresses? Thanks!
Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)
On 09/09/2016 07:04 AM, Dan Williams wrote: On Thu, Sep 8, 2016 at 3:56 PM, Ross Zwisler wrote: On Wed, Sep 07, 2016 at 09:32:36PM -0700, Dan Williams wrote: [ adding linux-fsdevel and linux-nvdimm ] On Wed, Sep 7, 2016 at 8:36 PM, Xiao Guangrong wrote: [..] However, it is not easy to handle the case that the new VMA overlays with the old VMA already got by userspace. I think we have some choices: 1: One way is completely skipping the new VMA region as current kernel code does but i do not think this is good as the later VMAs will be dropped. 2: show the un-overlayed portion of new VMA. In your case, we just show the region (0x2000 -> 0x3000), however, it can not work well if the VMA is a new created region with different attributions. 3: completely show the new VMA as this patch does. Which one do you prefer? I don't have a preference, but perhaps this breakage and uncertainty is a good opportunity to propose a more reliable interface for NVML to get the information it needs? My understanding is that it is looking for the VM_MIXEDMAP flag which is already ambiguous for determining if DAX is enabled even if this dynamic listing issue is fixed. XFS has arranged for DAX to be a per-inode capability and has an XFS-specific inode flag. We can make that a common inode flag, but it seems we should have a way to interrogate the mapping itself in the case where the inode is unknown or unavailable. I'm thinking extensions to mincore to have flags for DAX and possibly whether the page is part of a pte, pmd, or pud mapping. Just floating that idea before starting to look into the implementation, comments or other ideas welcome... I think this goes back to our previous discussion about support for the PMEM programming model. Really I think what NVML needs isn't a way to tell if it is getting a DAX mapping, but whether it is getting a DAX mapping on a filesystem that fully supports the PMEM programming model. This of course is defined to be a filesystem where it can do all of its flushes from userspace safely and never call fsync/msync, and that allocations that happen in page faults will be synchronized to media before the page fault completes. IIUC this is what NVML needs - a way to decide "do I use fsync/msync for everything or can I rely fully on flushes from userspace?" For all existing implementations, I think the answer is "you need to use fsync/msync" because we don't yet have proper support for the PMEM programming model. My best idea of how to support this was a per-inode flag similar to the one supported by XFS that says "you have a PMEM capable DAX mapping", which NVML would then interpret to mean "you can do flushes from userspace and be fully safe". I think we really want this interface to be common over XFS and ext4. If we can figure out a better way of doing this interface, say via mincore, that's fine, but I don't think we can detangle this from the PMEM API discussion. Whether a persistent memory mapping requires an msync/fsync is a filesystem specific question. This mincore proposal is separate from that. Consider device-DAX for volatile memory or mincore() called on an anonymous memory range. In those cases persistence and filesystem metadata are not in the picture, but it would still be useful for userspace to know "is there page cache backing this mapping?" or "what is the TLB geometry of this mapping?". I got a question about msync/fsync which is beyond the topic of this thread :) Whether msync/fsync can make data persistent depends on ADR feature on memory controller, if it exists everything works well, otherwise, we need to have another interface that is why 'Flush hint table' in ACPI comes in. 'Flush hint table' is particularly useful for nvdimm virtualization if we use normal memory to emulate nvdimm with data persistent characteristic (the data will be flushed to a persistent storage, e.g, disk). Does current PMEM programming model fully supports 'Flush hint table'? Is userspace allowed to use these addresses? Thanks!
Re: [PATCH] Fix region lost in /proc/self/smaps
On 09/08/2016 10:05 PM, Dave Hansen wrote: On 09/07/2016 08:36 PM, Xiao Guangrong wrote:>> The user will see two VMAs in their output: A: 0x1000->0x2000 C: 0x1000->0x3000 Will it confuse them to see the same virtual address range twice? Or is there something preventing that happening that I'm missing? You are right. Nothing can prevent it. However, it is not easy to handle the case that the new VMA overlays with the old VMA already got by userspace. I think we have some choices: 1: One way is completely skipping the new VMA region as current kernel code does but i do not think this is good as the later VMAs will be dropped. 2: show the un-overlayed portion of new VMA. In your case, we just show the region (0x2000 -> 0x3000), however, it can not work well if the VMA is a new created region with different attributions. 3: completely show the new VMA as this patch does. Which one do you prefer? I'd be willing to bet that #3 will break *somebody's* tooling. Addresses going backwards is certainly screwy. Imagine somebody using smaps to search for address holes and doing hole_size=0x1000-0x2000. #1 can lies about there being no mapping in place where there there may have _always_ been a mapping and is very similar to the bug you were originally fixing. I think that throws it out. #2 is our best bet, I think. It's unfortunately also the most code. It's also a bit of a fib because it'll show a mapping that never actually existed, but I think this is OK. I'm not sure what the downside is that you're referring to, though. Can you explain? Yes. I was talking the case as follows: 1: read() #1: prints vma-A(0x1000 -> 0x2000) 2: unmap vma-A(0x1000 -> 0x2000) 3: create vma-B(0x80 -> 0x3000) on other file with different permission (w, r, x) 4: read #2: prints vma-B(0x2000 -> 0x3000) Then userspace will get just a portion of vma-B. well, maybe it is not too bad. :) How about this changes: diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 187d84e..10ca648 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -147,7 +147,7 @@ m_next_vma(struct proc_maps_private *priv, struct vm_area_struct *vma) static void m_cache_vma(struct seq_file *m, struct vm_area_struct *vma) { if (m->count < m->size) /* vma is copied successfully */ - m->version = m_next_vma(m->private, vma) ? vma->vm_start : -1UL; + m->version = m_next_vma(m->private, vma) ? vma->vm_end : -1UL; } static void *m_start(struct seq_file *m, loff_t *ppos) @@ -176,14 +176,14 @@ static void *m_start(struct seq_file *m, loff_t *ppos) if (last_addr) { vma = find_vma(mm, last_addr); - if (vma && (vma = m_next_vma(priv, vma))) + if (vma) return vma; } m->version = 0; if (pos < mm->map_count) { for (vma = mm->mmap; pos; pos--) { - m->version = vma->vm_start; + m->version = vma->vm_end; vma = vma->vm_next; } return vma; @@ -293,7 +293,7 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma, int is_pid) vm_flags_t flags = vma->vm_flags; unsigned long ino = 0; unsigned long long pgoff = 0; - unsigned long start, end; + unsigned long end, start = m->version; dev_t dev = 0; const char *name = NULL; @@ -304,8 +304,13 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma, int is_pid) pgoff = ((loff_t)vma->vm_pgoff) << PAGE_SHIFT; } + /* +* the region [0, m->version) has already been handled, do not +* handle it doubly. +*/ + start = max(vma->vm_start, start); + /* We don't show the stack guard page in /proc/maps */ - start = vma->vm_start; if (stack_guard_page_start(vma, start)) start += PAGE_SIZE; end = vma->vm_end;
Re: [PATCH] Fix region lost in /proc/self/smaps
On 09/08/2016 10:05 PM, Dave Hansen wrote: On 09/07/2016 08:36 PM, Xiao Guangrong wrote:>> The user will see two VMAs in their output: A: 0x1000->0x2000 C: 0x1000->0x3000 Will it confuse them to see the same virtual address range twice? Or is there something preventing that happening that I'm missing? You are right. Nothing can prevent it. However, it is not easy to handle the case that the new VMA overlays with the old VMA already got by userspace. I think we have some choices: 1: One way is completely skipping the new VMA region as current kernel code does but i do not think this is good as the later VMAs will be dropped. 2: show the un-overlayed portion of new VMA. In your case, we just show the region (0x2000 -> 0x3000), however, it can not work well if the VMA is a new created region with different attributions. 3: completely show the new VMA as this patch does. Which one do you prefer? I'd be willing to bet that #3 will break *somebody's* tooling. Addresses going backwards is certainly screwy. Imagine somebody using smaps to search for address holes and doing hole_size=0x1000-0x2000. #1 can lies about there being no mapping in place where there there may have _always_ been a mapping and is very similar to the bug you were originally fixing. I think that throws it out. #2 is our best bet, I think. It's unfortunately also the most code. It's also a bit of a fib because it'll show a mapping that never actually existed, but I think this is OK. I'm not sure what the downside is that you're referring to, though. Can you explain? Yes. I was talking the case as follows: 1: read() #1: prints vma-A(0x1000 -> 0x2000) 2: unmap vma-A(0x1000 -> 0x2000) 3: create vma-B(0x80 -> 0x3000) on other file with different permission (w, r, x) 4: read #2: prints vma-B(0x2000 -> 0x3000) Then userspace will get just a portion of vma-B. well, maybe it is not too bad. :) How about this changes: diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 187d84e..10ca648 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -147,7 +147,7 @@ m_next_vma(struct proc_maps_private *priv, struct vm_area_struct *vma) static void m_cache_vma(struct seq_file *m, struct vm_area_struct *vma) { if (m->count < m->size) /* vma is copied successfully */ - m->version = m_next_vma(m->private, vma) ? vma->vm_start : -1UL; + m->version = m_next_vma(m->private, vma) ? vma->vm_end : -1UL; } static void *m_start(struct seq_file *m, loff_t *ppos) @@ -176,14 +176,14 @@ static void *m_start(struct seq_file *m, loff_t *ppos) if (last_addr) { vma = find_vma(mm, last_addr); - if (vma && (vma = m_next_vma(priv, vma))) + if (vma) return vma; } m->version = 0; if (pos < mm->map_count) { for (vma = mm->mmap; pos; pos--) { - m->version = vma->vm_start; + m->version = vma->vm_end; vma = vma->vm_next; } return vma; @@ -293,7 +293,7 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma, int is_pid) vm_flags_t flags = vma->vm_flags; unsigned long ino = 0; unsigned long long pgoff = 0; - unsigned long start, end; + unsigned long end, start = m->version; dev_t dev = 0; const char *name = NULL; @@ -304,8 +304,13 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma, int is_pid) pgoff = ((loff_t)vma->vm_pgoff) << PAGE_SHIFT; } + /* +* the region [0, m->version) has already been handled, do not +* handle it doubly. +*/ + start = max(vma->vm_start, start); + /* We don't show the stack guard page in /proc/maps */ - start = vma->vm_start; if (stack_guard_page_start(vma, start)) start += PAGE_SIZE; end = vma->vm_end;
Re: [PATCH] Fix region lost in /proc/self/smaps
On 09/08/2016 12:34 AM, Dave Hansen wrote: On 09/06/2016 11:51 PM, Xiao Guangrong wrote: In order to fix this bug, we make 'file->version' indicate the next VMA we want to handle This new approach makes it more likely that we'll skip a new VMA that gets inserted in between the read()s. But, I guess that's OK. We don't exactly claim to be giving super up-to-date data at the time of read(). Yes, I completely agree with you. :) With the old code, was there also a case that we could print out the same virtual address range more than once? It seems like that could happen if we had a VMA split between two reads. Yes. I think this introduces one oddity: if you have a VMA merge between two reads(), you might get the same virtual address range twice in your output. This didn't happen before because we would have just skipped over the area that got merged. Take two example VMAs: vma-A: (0x1000 -> 0x2000) vma-B: (0x2000 -> 0x3000) read() #1: prints vma-A, sets m->version=0x2000 Now, merge A/B to make C: vma-C: (0x1000 -> 0x3000) read() #2: find_vma(m->version=0x2000), returns vma-C, prints vma-C The user will see two VMAs in their output: A: 0x1000->0x2000 C: 0x1000->0x3000 Will it confuse them to see the same virtual address range twice? Or is there something preventing that happening that I'm missing? You are right. Nothing can prevent it. However, it is not easy to handle the case that the new VMA overlays with the old VMA already got by userspace. I think we have some choices: 1: One way is completely skipping the new VMA region as current kernel code does but i do not think this is good as the later VMAs will be dropped. 2: show the un-overlayed portion of new VMA. In your case, we just show the region (0x2000 -> 0x3000), however, it can not work well if the VMA is a new created region with different attributions. 3: completely show the new VMA as this patch does. Which one do you prefer? Thanks!
Re: [PATCH] Fix region lost in /proc/self/smaps
On 09/08/2016 12:34 AM, Dave Hansen wrote: On 09/06/2016 11:51 PM, Xiao Guangrong wrote: In order to fix this bug, we make 'file->version' indicate the next VMA we want to handle This new approach makes it more likely that we'll skip a new VMA that gets inserted in between the read()s. But, I guess that's OK. We don't exactly claim to be giving super up-to-date data at the time of read(). Yes, I completely agree with you. :) With the old code, was there also a case that we could print out the same virtual address range more than once? It seems like that could happen if we had a VMA split between two reads. Yes. I think this introduces one oddity: if you have a VMA merge between two reads(), you might get the same virtual address range twice in your output. This didn't happen before because we would have just skipped over the area that got merged. Take two example VMAs: vma-A: (0x1000 -> 0x2000) vma-B: (0x2000 -> 0x3000) read() #1: prints vma-A, sets m->version=0x2000 Now, merge A/B to make C: vma-C: (0x1000 -> 0x3000) read() #2: find_vma(m->version=0x2000), returns vma-C, prints vma-C The user will see two VMAs in their output: A: 0x1000->0x2000 C: 0x1000->0x3000 Will it confuse them to see the same virtual address range twice? Or is there something preventing that happening that I'm missing? You are right. Nothing can prevent it. However, it is not easy to handle the case that the new VMA overlays with the old VMA already got by userspace. I think we have some choices: 1: One way is completely skipping the new VMA region as current kernel code does but i do not think this is good as the later VMAs will be dropped. 2: show the un-overlayed portion of new VMA. In your case, we just show the region (0x2000 -> 0x3000), however, it can not work well if the VMA is a new created region with different attributions. 3: completely show the new VMA as this patch does. Which one do you prefer? Thanks!
Re: [PATCH] Fix region lost in /proc/self/smaps
Sorry, the title should be [PATCH] mm, proc: Fix region lost in /proc/self/smaps On 09/07/2016 02:51 PM, Xiao Guangrong wrote: Recently, Redhat reported that nvml test suite failed on QEMU/KVM, more detailed info please refer to: https://bugzilla.redhat.com/show_bug.cgi?id=1365721 Actually, this bug is not only for NVDIMM/DAX but also for any other file systems. This simple test case abstracted from nvml can easily reproduce this bug in common environment: -- testcase.c - #define PROCMAXLEN 4096 int is_pmem_proc(const void *addr, size_t len) { const char *caddr = addr; FILE *fp; if ((fp = fopen("/proc/self/smaps", "r")) == NULL) { printf("!/proc/self/smaps"); return 0; } int retval = 0; /* assume false until proven otherwise */ char line[PROCMAXLEN]; /* for fgets() */ char *lo = NULL;/* beginning of current range in smaps file */ char *hi = NULL;/* end of current range in smaps file */ int needmm = 0; /* looking for mm flag for current range */ while (fgets(line, PROCMAXLEN, fp) != NULL) { static const char vmflags[] = "VmFlags:"; static const char mm[] = " wr"; /* check for range line */ if (sscanf(line, "%p-%p", , ) == 2) { if (needmm) { /* last range matched, but no mm flag found */ printf("never found mm flag.\n"); break; } else if (caddr < lo) { /* never found the range for caddr */ printf("###no match for addr %p.\n", caddr); break; } else if (caddr < hi) { /* start address is in this range */ size_t rangelen = (size_t)(hi - caddr); /* remember that matching has started */ needmm = 1; /* calculate remaining range to search for */ if (len > rangelen) { len -= rangelen; caddr += rangelen; printf("matched %zu bytes in range " "%p-%p, %zu left over.\n", rangelen, lo, hi, len); } else { len = 0; printf("matched all bytes in range " "%p-%p.\n", lo, hi); } } } else if (needmm && strncmp(line, vmflags, sizeof(vmflags) - 1) == 0) { if (strstr([sizeof(vmflags) - 1], mm) != NULL) { printf("mm flag found.\n"); if (len == 0) { /* entire range matched */ retval = 1; break; } needmm = 0; /* saw what was needed */ } else { /* mm flag not set for some or all of range */ printf("range has no mm flag.\n"); break; } } } fclose(fp); printf("returning %d.\n", retval); return retval; } #define NTHREAD 16 void *Addr; size_t Size; /* * worker -- the work each thread performs */ static void * worker(void *arg) { int *ret = (int *)arg; *ret = is_pmem_proc(Addr, Size); return NULL; } int main(int argc, char *argv[]) { if (argc < 2 || argc > 3) { printf("usage: %s file [env].\n", argv[0]); return -1; } int fd = open(argv[1], O_RDWR); struct stat stbuf; fstat(fd, ); Size = stbuf.st_size; Addr = mmap(0, stbuf.st_size, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0); close(fd); pthread_t threads[NTHREAD]; int ret[NTHREAD]; /* kick off NTHREAD threads */ for (int i = 0; i < NTHREAD; i++) pthread_create([i], NULL, worker, [i]); /* wait for all the threads to complete */
Re: [PATCH] Fix region lost in /proc/self/smaps
Sorry, the title should be [PATCH] mm, proc: Fix region lost in /proc/self/smaps On 09/07/2016 02:51 PM, Xiao Guangrong wrote: Recently, Redhat reported that nvml test suite failed on QEMU/KVM, more detailed info please refer to: https://bugzilla.redhat.com/show_bug.cgi?id=1365721 Actually, this bug is not only for NVDIMM/DAX but also for any other file systems. This simple test case abstracted from nvml can easily reproduce this bug in common environment: -- testcase.c - #define PROCMAXLEN 4096 int is_pmem_proc(const void *addr, size_t len) { const char *caddr = addr; FILE *fp; if ((fp = fopen("/proc/self/smaps", "r")) == NULL) { printf("!/proc/self/smaps"); return 0; } int retval = 0; /* assume false until proven otherwise */ char line[PROCMAXLEN]; /* for fgets() */ char *lo = NULL;/* beginning of current range in smaps file */ char *hi = NULL;/* end of current range in smaps file */ int needmm = 0; /* looking for mm flag for current range */ while (fgets(line, PROCMAXLEN, fp) != NULL) { static const char vmflags[] = "VmFlags:"; static const char mm[] = " wr"; /* check for range line */ if (sscanf(line, "%p-%p", , ) == 2) { if (needmm) { /* last range matched, but no mm flag found */ printf("never found mm flag.\n"); break; } else if (caddr < lo) { /* never found the range for caddr */ printf("###no match for addr %p.\n", caddr); break; } else if (caddr < hi) { /* start address is in this range */ size_t rangelen = (size_t)(hi - caddr); /* remember that matching has started */ needmm = 1; /* calculate remaining range to search for */ if (len > rangelen) { len -= rangelen; caddr += rangelen; printf("matched %zu bytes in range " "%p-%p, %zu left over.\n", rangelen, lo, hi, len); } else { len = 0; printf("matched all bytes in range " "%p-%p.\n", lo, hi); } } } else if (needmm && strncmp(line, vmflags, sizeof(vmflags) - 1) == 0) { if (strstr([sizeof(vmflags) - 1], mm) != NULL) { printf("mm flag found.\n"); if (len == 0) { /* entire range matched */ retval = 1; break; } needmm = 0; /* saw what was needed */ } else { /* mm flag not set for some or all of range */ printf("range has no mm flag.\n"); break; } } } fclose(fp); printf("returning %d.\n", retval); return retval; } #define NTHREAD 16 void *Addr; size_t Size; /* * worker -- the work each thread performs */ static void * worker(void *arg) { int *ret = (int *)arg; *ret = is_pmem_proc(Addr, Size); return NULL; } int main(int argc, char *argv[]) { if (argc < 2 || argc > 3) { printf("usage: %s file [env].\n", argv[0]); return -1; } int fd = open(argv[1], O_RDWR); struct stat stbuf; fstat(fd, ); Size = stbuf.st_size; Addr = mmap(0, stbuf.st_size, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0); close(fd); pthread_t threads[NTHREAD]; int ret[NTHREAD]; /* kick off NTHREAD threads */ for (int i = 0; i < NTHREAD; i++) pthread_create([i], NULL, worker, [i]); /* wait for all the threads to complete */
[PATCH] Fix region lost in /proc/self/smaps
hreads return the same value */ for (int i = 1; i < NTHREAD; i++) { if (ret[0] != ret[i]) { printf("Error i %d ret[0] = %d ret[i] = %d.\n", i, ret[0], ret[i]); } } printf("%d", ret[0]); return 0; } # dd if=/dev/zero of=~/out bs=2M count=1 # ./testcase ~/out It failed as some threads can not find the memory region in "/proc/self/smaps" which is allocated in the mail process It is caused by proc fs which uses 'file->version' to indicate the VMA that is the last one has already been handled by read() system call. When the next read() issues, it uses the 'version' to find the VMA, then the next VMA is what we want to handle, the related code is as follows: if (last_addr) { vma = find_vma(mm, last_addr); if (vma && (vma = m_next_vma(priv, vma))) return vma; } However, VMA will be lost if the last VMA is gone, eg: The process VMA list is A->B->C->D CPU 0 CPU 1 read() system call handle VMA B version = B return to userspace unmap VMA B issue read() again to continue to get the region info find_vma(version) will get VMA C m_next_vma(C) will get VMA D handle D !!! VMA C is lost !!! In order to fix this bug, we make 'file->version' indicate the next VMA we want to handle Signed-off-by: Xiao Guangrong <guangrong.x...@linux.intel.com> --- fs/proc/task_mmu.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 187d84e..ace4a69 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -146,8 +146,12 @@ m_next_vma(struct proc_maps_private *priv, struct vm_area_struct *vma) static void m_cache_vma(struct seq_file *m, struct vm_area_struct *vma) { - if (m->count < m->size) /* vma is copied successfully */ - m->version = m_next_vma(m->private, vma) ? vma->vm_start : -1UL; + /* vma is copied successfully */ + if (m->count < m->size) { + struct vm_area_struct *vma_next = m_next_vma(m->private, vma); + + m->version = vma_next ? vma_next->vm_start : -1UL; + } } static void *m_start(struct seq_file *m, loff_t *ppos) @@ -176,15 +180,15 @@ static void *m_start(struct seq_file *m, loff_t *ppos) if (last_addr) { vma = find_vma(mm, last_addr); - if (vma && (vma = m_next_vma(priv, vma))) + if (vma) return vma; } m->version = 0; if (pos < mm->map_count) { for (vma = mm->mmap; pos; pos--) { - m->version = vma->vm_start; vma = vma->vm_next; + m->version = vma->vm_start; } return vma; } -- 1.8.3.1
[PATCH] Fix region lost in /proc/self/smaps
hreads return the same value */ for (int i = 1; i < NTHREAD; i++) { if (ret[0] != ret[i]) { printf("Error i %d ret[0] = %d ret[i] = %d.\n", i, ret[0], ret[i]); } } printf("%d", ret[0]); return 0; } # dd if=/dev/zero of=~/out bs=2M count=1 # ./testcase ~/out It failed as some threads can not find the memory region in "/proc/self/smaps" which is allocated in the mail process It is caused by proc fs which uses 'file->version' to indicate the VMA that is the last one has already been handled by read() system call. When the next read() issues, it uses the 'version' to find the VMA, then the next VMA is what we want to handle, the related code is as follows: if (last_addr) { vma = find_vma(mm, last_addr); if (vma && (vma = m_next_vma(priv, vma))) return vma; } However, VMA will be lost if the last VMA is gone, eg: The process VMA list is A->B->C->D CPU 0 CPU 1 read() system call handle VMA B version = B return to userspace unmap VMA B issue read() again to continue to get the region info find_vma(version) will get VMA C m_next_vma(C) will get VMA D handle D !!! VMA C is lost !!! In order to fix this bug, we make 'file->version' indicate the next VMA we want to handle Signed-off-by: Xiao Guangrong --- fs/proc/task_mmu.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 187d84e..ace4a69 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -146,8 +146,12 @@ m_next_vma(struct proc_maps_private *priv, struct vm_area_struct *vma) static void m_cache_vma(struct seq_file *m, struct vm_area_struct *vma) { - if (m->count < m->size) /* vma is copied successfully */ - m->version = m_next_vma(m->private, vma) ? vma->vm_start : -1UL; + /* vma is copied successfully */ + if (m->count < m->size) { + struct vm_area_struct *vma_next = m_next_vma(m->private, vma); + + m->version = vma_next ? vma_next->vm_start : -1UL; + } } static void *m_start(struct seq_file *m, loff_t *ppos) @@ -176,15 +180,15 @@ static void *m_start(struct seq_file *m, loff_t *ppos) if (last_addr) { vma = find_vma(mm, last_addr); - if (vma && (vma = m_next_vma(priv, vma))) + if (vma) return vma; } m->version = 0; if (pos < mm->map_count) { for (vma = mm->mmap; pos; pos--) { - m->version = vma->vm_start; vma = vma->vm_next; + m->version = vma->vm_start; } return vma; } -- 1.8.3.1
Re: DAX can not work on virtual nvdimm device
On 08/31/2016 01:09 AM, Dan Williams wrote: Can you post your exact reproduction steps? This test is not failing for me. Sure. 1. make the guest kernel based on your tree, the top commit is 10d7902fa0e82b (dax: unmap/truncate on device shutdown) and the config file can be found in this thread. 2. add guest kernel command line: memmap=6G!10G 3: start the guest: x86_64-softmmu/qemu-system-x86_64 -machine pc,nvdimm --enable-kvm \ -smp 16 -m 32G,maxmem=100G,slots=100 /other/VMs/centos6.img -monitor stdio 4: in guest: mkfs.ext4 /dev/pmem0 mount -o dax /dev/pmem0 /mnt/pmem/ echo > /mnt/pmem/xxx ./mmap /mnt/pmem/xxx ./read /mnt/pmem/xxx The source code of mmap and read has been attached in this mail. Hopefully, you can detect the error triggered by read test. Thanks! #include #include #include #include #include #include int main(int argc, char *argv[]) { char *filename; if (argc < 2) { printf("arg: filename.\n"); return -1; } filename = argv[1]; printf("test on %s.\n", filename); int fd = open(filename, O_RDWR); if (fd < 0) { perror("open"); return -1; } int count = 0; while (1) { ssize_t ret; char buf; ret = read(fd, , sizeof(buf)); if (ret < 0) { perror("READ"); return -1; } if (ret == 0) break; if (ret != sizeof(buf)) { printf("Count %x Ret %lx sizeof(buf) %lx.\n", count, ret, sizeof(buf)); return -1; } count++; printf("%c", buf); } printf("\n Good Read.\n"); return 0; } #include #include #include #include #include #include #include static void write_test(char *buf, int size) { int i; printf("Try to write %p for %x size.\n", buf, size); for (i = 0; i < size; i++) buf[i] = (char)(i & 0xFF); printf("Write Done.\n"); } static void read_test(char *buf, int size) { int i; printf("Try to read %p for %x size.\n", buf, size); for (i = 0; i < size; i++) if (buf[i] != (char)(i & 0xFF)) { printf("Read Check failed.\n"); printf("Index %x, %c expect %x.\n", i, buf[i], i & 0xFF); exit(-1); } printf("Read Done.\n"); } static void fread_test(int fd, int size) { char buf; int i; i = lseek(fd, 0, SEEK_END); printf("End: %x.\n", i); i = lseek(fd, 0, SEEK_SET); if (i != 0) { perror("seek"); exit(-1); } printf("Try to fread fd=%d size %x sizeof(buf) %ld.\n", fd, size, sizeof(buf)); for (i = 0; i < size; i++) { int ret; ret = read(fd, , sizeof(buf)); if (ret != sizeof(buf)) { printf("Read failed, offset %x, ret %d ftell %ld.\n", i, ret, lseek(fd, 0, SEEK_CUR)); perror("read"); exit(-1); } if (buf != (char)(i & 0xFF)) { printf("Read Check failed.\n"); printf("Index %x, %c expect %x.\n", i, buf, i & 0xFF); exit (-1); } } printf("Fread Done.\n"); } int main(int argc, char *argv[]) { char *filename; char *buf; int fd; if (argc < 2) { printf("arg: filename.\n"); return -1; } filename = argv[1]; printf("mmap test on %s.\n", filename); fd = open(filename, O_RDWR); if (fd < 0) { perror("open"); return -1; } if (lseek(fd, 0x1000, SEEK_SET) != 0x1000) { perror("seek"); return -1; } if (ftruncate(fd, 0x1000) != 0) { perror("ftruncate"); return -1; } // return 0; buf = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); if (buf == MAP_FAILED) { perror("mmap"); return -1; } write_test(buf, 0x1000); read_test(buf, 0x1000); msync(buf, 0x1000, MS_SYNC); fread_test(fd, 0x1000); close(fd); return 0; }
Re: DAX can not work on virtual nvdimm device
On 08/31/2016 01:09 AM, Dan Williams wrote: Can you post your exact reproduction steps? This test is not failing for me. Sure. 1. make the guest kernel based on your tree, the top commit is 10d7902fa0e82b (dax: unmap/truncate on device shutdown) and the config file can be found in this thread. 2. add guest kernel command line: memmap=6G!10G 3: start the guest: x86_64-softmmu/qemu-system-x86_64 -machine pc,nvdimm --enable-kvm \ -smp 16 -m 32G,maxmem=100G,slots=100 /other/VMs/centos6.img -monitor stdio 4: in guest: mkfs.ext4 /dev/pmem0 mount -o dax /dev/pmem0 /mnt/pmem/ echo > /mnt/pmem/xxx ./mmap /mnt/pmem/xxx ./read /mnt/pmem/xxx The source code of mmap and read has been attached in this mail. Hopefully, you can detect the error triggered by read test. Thanks! #include #include #include #include #include #include int main(int argc, char *argv[]) { char *filename; if (argc < 2) { printf("arg: filename.\n"); return -1; } filename = argv[1]; printf("test on %s.\n", filename); int fd = open(filename, O_RDWR); if (fd < 0) { perror("open"); return -1; } int count = 0; while (1) { ssize_t ret; char buf; ret = read(fd, , sizeof(buf)); if (ret < 0) { perror("READ"); return -1; } if (ret == 0) break; if (ret != sizeof(buf)) { printf("Count %x Ret %lx sizeof(buf) %lx.\n", count, ret, sizeof(buf)); return -1; } count++; printf("%c", buf); } printf("\n Good Read.\n"); return 0; } #include #include #include #include #include #include #include static void write_test(char *buf, int size) { int i; printf("Try to write %p for %x size.\n", buf, size); for (i = 0; i < size; i++) buf[i] = (char)(i & 0xFF); printf("Write Done.\n"); } static void read_test(char *buf, int size) { int i; printf("Try to read %p for %x size.\n", buf, size); for (i = 0; i < size; i++) if (buf[i] != (char)(i & 0xFF)) { printf("Read Check failed.\n"); printf("Index %x, %c expect %x.\n", i, buf[i], i & 0xFF); exit(-1); } printf("Read Done.\n"); } static void fread_test(int fd, int size) { char buf; int i; i = lseek(fd, 0, SEEK_END); printf("End: %x.\n", i); i = lseek(fd, 0, SEEK_SET); if (i != 0) { perror("seek"); exit(-1); } printf("Try to fread fd=%d size %x sizeof(buf) %ld.\n", fd, size, sizeof(buf)); for (i = 0; i < size; i++) { int ret; ret = read(fd, , sizeof(buf)); if (ret != sizeof(buf)) { printf("Read failed, offset %x, ret %d ftell %ld.\n", i, ret, lseek(fd, 0, SEEK_CUR)); perror("read"); exit(-1); } if (buf != (char)(i & 0xFF)) { printf("Read Check failed.\n"); printf("Index %x, %c expect %x.\n", i, buf, i & 0xFF); exit (-1); } } printf("Fread Done.\n"); } int main(int argc, char *argv[]) { char *filename; char *buf; int fd; if (argc < 2) { printf("arg: filename.\n"); return -1; } filename = argv[1]; printf("mmap test on %s.\n", filename); fd = open(filename, O_RDWR); if (fd < 0) { perror("open"); return -1; } if (lseek(fd, 0x1000, SEEK_SET) != 0x1000) { perror("seek"); return -1; } if (ftruncate(fd, 0x1000) != 0) { perror("ftruncate"); return -1; } // return 0; buf = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); if (buf == MAP_FAILED) { perror("mmap"); return -1; } write_test(buf, 0x1000); read_test(buf, 0x1000); msync(buf, 0x1000, MS_SYNC); fread_test(fd, 0x1000); close(fd); return 0; }
Re: DAX can not work on virtual nvdimm device
On 08/30/2016 03:30 AM, Ross Zwisler wrote: Can you please verify that you are using "usable" memory for your memmap? All the details are here: https://nvdimm.wiki.kernel.org/how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system Sure. This is the BIOS E820 info in the guest: e820: BIOS-provided physical RAM map: BIOS-e820: [mem 0x-0x0009fbff] usable BIOS-e820: [mem 0x0009fc00-0x0009] reserved BIOS-e820: [mem 0x000f-0x000f] reserved BIOS-e820: [mem 0x0010-0xbffdefff] usable BIOS-e820: [mem 0xbffdf000-0xbfff] reserved BIOS-e820: [mem 0xfeffc000-0xfeff] reserved BIOS-e820: [mem 0xfffc-0x] reserved BIOS-e820: [mem 0x0001-0x00083fff] usable NX (Execute Disable) protection: active e820: user-defined physical RAM map: user: [mem 0x-0x0009fbff] usable user: [mem 0x0009fc00-0x0009] reserved user: [mem 0x000f-0x000f] reserved user: [mem 0x0010-0xbffdefff] usable user: [mem 0xbffdf000-0xbfff] reserved user: [mem 0xfeffc000-0xfeff] reserved user: [mem 0xfffc-0x] reserved user: [mem 0x0001-0x00027fff] usable user: [mem 0x00028000-0x0003] persistent (type 12) user: [mem 0x0004-0x00083fff] usable So that the memory we used to emulate PMEM split the 'usable' region. My guess is that Boaz was correct, and that your memmap is off using addresses that don't actually map to memory. I do not think so. :( I did mmap-write and mmap-read test, the data written by mmap-write can be correctly read out, so that the backend memory is really existing. Thanks!
Re: DAX can not work on virtual nvdimm device
On 08/30/2016 03:30 AM, Ross Zwisler wrote: Can you please verify that you are using "usable" memory for your memmap? All the details are here: https://nvdimm.wiki.kernel.org/how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system Sure. This is the BIOS E820 info in the guest: e820: BIOS-provided physical RAM map: BIOS-e820: [mem 0x-0x0009fbff] usable BIOS-e820: [mem 0x0009fc00-0x0009] reserved BIOS-e820: [mem 0x000f-0x000f] reserved BIOS-e820: [mem 0x0010-0xbffdefff] usable BIOS-e820: [mem 0xbffdf000-0xbfff] reserved BIOS-e820: [mem 0xfeffc000-0xfeff] reserved BIOS-e820: [mem 0xfffc-0x] reserved BIOS-e820: [mem 0x0001-0x00083fff] usable NX (Execute Disable) protection: active e820: user-defined physical RAM map: user: [mem 0x-0x0009fbff] usable user: [mem 0x0009fc00-0x0009] reserved user: [mem 0x000f-0x000f] reserved user: [mem 0x0010-0xbffdefff] usable user: [mem 0xbffdf000-0xbfff] reserved user: [mem 0xfeffc000-0xfeff] reserved user: [mem 0xfffc-0x] reserved user: [mem 0x0001-0x00027fff] usable user: [mem 0x00028000-0x0003] persistent (type 12) user: [mem 0x0004-0x00083fff] usable So that the memory we used to emulate PMEM split the 'usable' region. My guess is that Boaz was correct, and that your memmap is off using addresses that don't actually map to memory. I do not think so. :( I did mmap-write and mmap-read test, the data written by mmap-write can be correctly read out, so that the backend memory is really existing. Thanks!
Re: DAX can not work on virtual nvdimm device
Hi Ross, Sorry for the delay, i just returned back from KVM Forum. On 08/20/2016 02:30 AM, Ross Zwisler wrote: On Fri, Aug 19, 2016 at 07:59:29AM -0700, Dan Williams wrote: On Fri, Aug 19, 2016 at 4:19 AM, Xiao Guangrong <guangrong.x...@linux.intel.com> wrote: Hi Dan, Recently, Redhat reported that nvml test suite failed on QEMU/KVM, more detailed info please refer to: https://bugzilla.redhat.com/show_bug.cgi?id=1365721 The reason for this bug is that the memory region created by mmap() on the dax-based file was gone so that the region can not be found in /proc/self/smaps during the runtime. This is a simple way to trigger this issue: mount -o dax /dev/pmem0 /mnt/pmem/ vim /mnt/pmem/xxx then 'vim' is crashed due to segment fault. This bug can be reproduced on your tree, the top commit is 10d7902fa0e82b (dax: unmap/truncate on device shutdown), the kernel configure file is attached. Your thought or comment is highly appreciated. I'm going to be offline until Tuesday, but I will investigate when I'm back. In the meantime if Ross or Vishal had an opportunity to take a look I wouldn't say "no" :). I haven't been able to reproduce this vim segfault. I'm using QEMU v2.6.0, and the kernel commit you mentioned, and your kernel config. Here's my QEMU command line: sudo ~/qemu/bin/qemu-system-x86_64 /var/lib/libvirt/images/alara.qcow2 \ -machine pc,nvdimm -m 8G,maxmem=100G,slots=100 -object \ memory-backend-file,id=mem1,share,mem-path=/dev/pmem0,size=8G -device \ nvdimm,memdev=mem1,id=nv1 -smp 6 -machine pc,accel=kvm With this I'm able to mkfs the guest's /dev/pmem0, mount it with -o dax, and write a file with vim. Thanks for your test. That's strange... Can you reproduce your results with a pmem device created via a memmap kernel command line parameter in the guest? You'll need to update your kernel config to enable CONFIG_X86_PMEM_LEGACY and CONFIG_X86_PMEM_LEGACY_DEVICE. Okay, i tested it with mmap=6G!10G, it failed too. So it looks like it's a filesystem or DAX issue. More precisely, i figured out the root case that read() returns a wrong value when it reaches the end of the file, following test case can trigger it: #include #include #include #include #include #include int main(int argc, char *argv[]) { char *filename; if (argc < 2) { printf("arg: filename.\n"); return -1; } filename = argv[1]; printf("test on %s.\n", filename); int fd = open(filename, O_RDWR); if (fd < 0) { perror("open"); return -1; } int count = 0; while (1) { ssize_t ret; char buf; ret = read(fd, , sizeof(buf)); if (ret < 0) { perror("READ"); return -1; } if (ret == 0) break; if (ret != sizeof(buf)) { printf("Count %x Ret %lx sizeof(buf) %lx.\n", count, ret, sizeof(buf)); return -1; } count++; printf("%c", buf); } printf("\n Good Read.\n"); return 0; } It will fail at "ret != sizeof(buf)", for example, the error output on my test env is: Count 1000 Ret 22f84200 sizeof(buf) 1.
Re: DAX can not work on virtual nvdimm device
Hi Ross, Sorry for the delay, i just returned back from KVM Forum. On 08/20/2016 02:30 AM, Ross Zwisler wrote: On Fri, Aug 19, 2016 at 07:59:29AM -0700, Dan Williams wrote: On Fri, Aug 19, 2016 at 4:19 AM, Xiao Guangrong wrote: Hi Dan, Recently, Redhat reported that nvml test suite failed on QEMU/KVM, more detailed info please refer to: https://bugzilla.redhat.com/show_bug.cgi?id=1365721 The reason for this bug is that the memory region created by mmap() on the dax-based file was gone so that the region can not be found in /proc/self/smaps during the runtime. This is a simple way to trigger this issue: mount -o dax /dev/pmem0 /mnt/pmem/ vim /mnt/pmem/xxx then 'vim' is crashed due to segment fault. This bug can be reproduced on your tree, the top commit is 10d7902fa0e82b (dax: unmap/truncate on device shutdown), the kernel configure file is attached. Your thought or comment is highly appreciated. I'm going to be offline until Tuesday, but I will investigate when I'm back. In the meantime if Ross or Vishal had an opportunity to take a look I wouldn't say "no" :). I haven't been able to reproduce this vim segfault. I'm using QEMU v2.6.0, and the kernel commit you mentioned, and your kernel config. Here's my QEMU command line: sudo ~/qemu/bin/qemu-system-x86_64 /var/lib/libvirt/images/alara.qcow2 \ -machine pc,nvdimm -m 8G,maxmem=100G,slots=100 -object \ memory-backend-file,id=mem1,share,mem-path=/dev/pmem0,size=8G -device \ nvdimm,memdev=mem1,id=nv1 -smp 6 -machine pc,accel=kvm With this I'm able to mkfs the guest's /dev/pmem0, mount it with -o dax, and write a file with vim. Thanks for your test. That's strange... Can you reproduce your results with a pmem device created via a memmap kernel command line parameter in the guest? You'll need to update your kernel config to enable CONFIG_X86_PMEM_LEGACY and CONFIG_X86_PMEM_LEGACY_DEVICE. Okay, i tested it with mmap=6G!10G, it failed too. So it looks like it's a filesystem or DAX issue. More precisely, i figured out the root case that read() returns a wrong value when it reaches the end of the file, following test case can trigger it: #include #include #include #include #include #include int main(int argc, char *argv[]) { char *filename; if (argc < 2) { printf("arg: filename.\n"); return -1; } filename = argv[1]; printf("test on %s.\n", filename); int fd = open(filename, O_RDWR); if (fd < 0) { perror("open"); return -1; } int count = 0; while (1) { ssize_t ret; char buf; ret = read(fd, , sizeof(buf)); if (ret < 0) { perror("READ"); return -1; } if (ret == 0) break; if (ret != sizeof(buf)) { printf("Count %x Ret %lx sizeof(buf) %lx.\n", count, ret, sizeof(buf)); return -1; } count++; printf("%c", buf); } printf("\n Good Read.\n"); return 0; } It will fail at "ret != sizeof(buf)", for example, the error output on my test env is: Count 1000 Ret 22f84200 sizeof(buf) 1.
Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed
On 07/06/2016 07:48 PM, Paolo Bonzini wrote: On 06/07/2016 06:02, Xiao Guangrong wrote: May I ask you what the exact issue you have with this interface for Intel to support your own GPU virtualization? Intel's vGPU can work with this framework. We really appreciate your / nvidia's contribution. Then, I don't think we should embargo Paolo's patch. This patchset is specific for the framework design, i.e, mapping memory when fault happens rather than mmap(), and this design is exact what we are discussing for nearly two days. I disagree, this patch fixes a bug because what Neo is doing is legal. It may not be the design that will be committed, but the bug they found in KVM is real. I just worried if we really need fault-on-demand for device memory, i.e, if device memory overcommit is safe enough. It lacks a graceful way to recover the workload if the resource is really overloaded. Unlike with normal memory, host kernel and guest kernel can not do anything except killing the VM under this case. So the VM get crashed due to device emulation, that is not safe as the device can be accessed in userspace even with unprivileged user, it is vulnerable in data center.
Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed
On 07/06/2016 07:48 PM, Paolo Bonzini wrote: On 06/07/2016 06:02, Xiao Guangrong wrote: May I ask you what the exact issue you have with this interface for Intel to support your own GPU virtualization? Intel's vGPU can work with this framework. We really appreciate your / nvidia's contribution. Then, I don't think we should embargo Paolo's patch. This patchset is specific for the framework design, i.e, mapping memory when fault happens rather than mmap(), and this design is exact what we are discussing for nearly two days. I disagree, this patch fixes a bug because what Neo is doing is legal. It may not be the design that will be committed, but the bug they found in KVM is real. I just worried if we really need fault-on-demand for device memory, i.e, if device memory overcommit is safe enough. It lacks a graceful way to recover the workload if the resource is really overloaded. Unlike with normal memory, host kernel and guest kernel can not do anything except killing the VM under this case. So the VM get crashed due to device emulation, that is not safe as the device can be accessed in userspace even with unprivileged user, it is vulnerable in data center.
Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed
On 07/06/2016 10:57 AM, Neo Jia wrote: On Wed, Jul 06, 2016 at 10:35:18AM +0800, Xiao Guangrong wrote: On 07/06/2016 10:18 AM, Neo Jia wrote: On Wed, Jul 06, 2016 at 10:00:46AM +0800, Xiao Guangrong wrote: On 07/05/2016 08:18 PM, Paolo Bonzini wrote: On 05/07/2016 07:41, Neo Jia wrote: On Thu, Jun 30, 2016 at 03:01:49PM +0200, Paolo Bonzini wrote: The vGPU folks would like to trap the first access to a BAR by setting vm_ops on the VMAs produced by mmap-ing a VFIO device. The fault handler then can use remap_pfn_range to place some non-reserved pages in the VMA. KVM lacks support for this kind of non-linear VM_PFNMAP mapping, and these patches should fix this. Hi Paolo, I have tested your patches with the mediated passthru patchset that is being reviewed in KVM and QEMU mailing list. The fault handler gets called successfully and the previously mapped memory gets unmmaped correctly via unmap_mapping_range. Great, then I'll include them in 4.8. Code is okay, but i still suspect if this implementation, fetch mmio pages in fault handler, is needed. We'd better include these patches after the design of vfio framework is decided. Hi Guangrong, I disagree. The design of VFIO framework has been actively discussed in the KVM and QEMU mailing for a while and the fault handler is agreed upon to provide the flexibility for different driver vendors' implementation. With that said, I am still open to discuss with you and anybody else about this framework as the goal is to allow multiple vendor to plugin into this framework to support their mediated device virtualization scheme, such as Intel, IBM and us. The discussion is still going on. And current vfio patchset we reviewed is still problematic. My point is the fault handler part has been discussed already, with that said I am always open to any constructive suggestions to make things better and maintainable. (Appreciate your code review on the VFIO thread, I think we still own you another response, will do that.) It always can be changed especially the vfio patchset is not in a good shape. May I ask you what the exact issue you have with this interface for Intel to support your own GPU virtualization? Intel's vGPU can work with this framework. We really appreciate your / nvidia's contribution. Then, I don't think we should embargo Paolo's patch. This patchset is specific for the framework design, i.e, mapping memory when fault happens rather than mmap(), and this design is exact what we are discussing for nearly two days.
Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed
On 07/06/2016 10:57 AM, Neo Jia wrote: On Wed, Jul 06, 2016 at 10:35:18AM +0800, Xiao Guangrong wrote: On 07/06/2016 10:18 AM, Neo Jia wrote: On Wed, Jul 06, 2016 at 10:00:46AM +0800, Xiao Guangrong wrote: On 07/05/2016 08:18 PM, Paolo Bonzini wrote: On 05/07/2016 07:41, Neo Jia wrote: On Thu, Jun 30, 2016 at 03:01:49PM +0200, Paolo Bonzini wrote: The vGPU folks would like to trap the first access to a BAR by setting vm_ops on the VMAs produced by mmap-ing a VFIO device. The fault handler then can use remap_pfn_range to place some non-reserved pages in the VMA. KVM lacks support for this kind of non-linear VM_PFNMAP mapping, and these patches should fix this. Hi Paolo, I have tested your patches with the mediated passthru patchset that is being reviewed in KVM and QEMU mailing list. The fault handler gets called successfully and the previously mapped memory gets unmmaped correctly via unmap_mapping_range. Great, then I'll include them in 4.8. Code is okay, but i still suspect if this implementation, fetch mmio pages in fault handler, is needed. We'd better include these patches after the design of vfio framework is decided. Hi Guangrong, I disagree. The design of VFIO framework has been actively discussed in the KVM and QEMU mailing for a while and the fault handler is agreed upon to provide the flexibility for different driver vendors' implementation. With that said, I am still open to discuss with you and anybody else about this framework as the goal is to allow multiple vendor to plugin into this framework to support their mediated device virtualization scheme, such as Intel, IBM and us. The discussion is still going on. And current vfio patchset we reviewed is still problematic. My point is the fault handler part has been discussed already, with that said I am always open to any constructive suggestions to make things better and maintainable. (Appreciate your code review on the VFIO thread, I think we still own you another response, will do that.) It always can be changed especially the vfio patchset is not in a good shape. May I ask you what the exact issue you have with this interface for Intel to support your own GPU virtualization? Intel's vGPU can work with this framework. We really appreciate your / nvidia's contribution. Then, I don't think we should embargo Paolo's patch. This patchset is specific for the framework design, i.e, mapping memory when fault happens rather than mmap(), and this design is exact what we are discussing for nearly two days.
Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed
On 07/06/2016 10:18 AM, Neo Jia wrote: On Wed, Jul 06, 2016 at 10:00:46AM +0800, Xiao Guangrong wrote: On 07/05/2016 08:18 PM, Paolo Bonzini wrote: On 05/07/2016 07:41, Neo Jia wrote: On Thu, Jun 30, 2016 at 03:01:49PM +0200, Paolo Bonzini wrote: The vGPU folks would like to trap the first access to a BAR by setting vm_ops on the VMAs produced by mmap-ing a VFIO device. The fault handler then can use remap_pfn_range to place some non-reserved pages in the VMA. KVM lacks support for this kind of non-linear VM_PFNMAP mapping, and these patches should fix this. Hi Paolo, I have tested your patches with the mediated passthru patchset that is being reviewed in KVM and QEMU mailing list. The fault handler gets called successfully and the previously mapped memory gets unmmaped correctly via unmap_mapping_range. Great, then I'll include them in 4.8. Code is okay, but i still suspect if this implementation, fetch mmio pages in fault handler, is needed. We'd better include these patches after the design of vfio framework is decided. Hi Guangrong, I disagree. The design of VFIO framework has been actively discussed in the KVM and QEMU mailing for a while and the fault handler is agreed upon to provide the flexibility for different driver vendors' implementation. With that said, I am still open to discuss with you and anybody else about this framework as the goal is to allow multiple vendor to plugin into this framework to support their mediated device virtualization scheme, such as Intel, IBM and us. The discussion is still going on. And current vfio patchset we reviewed is still problematic. May I ask you what the exact issue you have with this interface for Intel to support your own GPU virtualization? Intel's vGPU can work with this framework. We really appreciate your / nvidia's contribution. i didn’t mean to offend you, i just want to make sure if this complexity is really needed and inspect if this framework is safe enough and think it over if we have a better implementation.
Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed
On 07/06/2016 10:18 AM, Neo Jia wrote: On Wed, Jul 06, 2016 at 10:00:46AM +0800, Xiao Guangrong wrote: On 07/05/2016 08:18 PM, Paolo Bonzini wrote: On 05/07/2016 07:41, Neo Jia wrote: On Thu, Jun 30, 2016 at 03:01:49PM +0200, Paolo Bonzini wrote: The vGPU folks would like to trap the first access to a BAR by setting vm_ops on the VMAs produced by mmap-ing a VFIO device. The fault handler then can use remap_pfn_range to place some non-reserved pages in the VMA. KVM lacks support for this kind of non-linear VM_PFNMAP mapping, and these patches should fix this. Hi Paolo, I have tested your patches with the mediated passthru patchset that is being reviewed in KVM and QEMU mailing list. The fault handler gets called successfully and the previously mapped memory gets unmmaped correctly via unmap_mapping_range. Great, then I'll include them in 4.8. Code is okay, but i still suspect if this implementation, fetch mmio pages in fault handler, is needed. We'd better include these patches after the design of vfio framework is decided. Hi Guangrong, I disagree. The design of VFIO framework has been actively discussed in the KVM and QEMU mailing for a while and the fault handler is agreed upon to provide the flexibility for different driver vendors' implementation. With that said, I am still open to discuss with you and anybody else about this framework as the goal is to allow multiple vendor to plugin into this framework to support their mediated device virtualization scheme, such as Intel, IBM and us. The discussion is still going on. And current vfio patchset we reviewed is still problematic. May I ask you what the exact issue you have with this interface for Intel to support your own GPU virtualization? Intel's vGPU can work with this framework. We really appreciate your / nvidia's contribution. i didn’t mean to offend you, i just want to make sure if this complexity is really needed and inspect if this framework is safe enough and think it over if we have a better implementation.
Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed
On 07/05/2016 11:07 PM, Neo Jia wrote: On Tue, Jul 05, 2016 at 05:02:46PM +0800, Xiao Guangrong wrote: It is physically contiguous but it is done during the runtime, physically contiguous doesn't mean static partition at boot time. And only during runtime, the proper HW resource will be requested therefore the right portion of MMIO region will be granted by the mediated device driver on the host. Okay. This is your implantation design rather than the hardware limitation, right? I don't think it matters here. We are talking about framework so it should provide the flexibility for different driver vendor. It really matters. It is the reason why we design the framework like this and we need to make sure whether we have a better design to fill the requirements. For example, if the instance require 512M memory (the size can be specified by QEMU command line), it can tell its requirement to the mediated device driver via create() interface, then the driver can allocate then memory for this instance before it is running. BAR != your device memory We don't set the BAR size via QEMU command line, BAR size is extracted by QEMU from config space provided by vendor driver. Anyway, the BAR size have a way to configure, e.g, specify the size as a parameter when you create a mdev via sysfs. Theoretically, the hardware is able to do memory management as this style, but for some reasons you choose allocating memory in the runtime. right? If my understanding is right, could you please tell us what benefit you want to get from this running-allocation style? Your understanding is incorrect. Then WHY? Then the req_size and pgoff will both come from the mediated device driver based on his internal book keeping of the hw resource allocation, which is only available during runtime. And such book keeping can be built part of para-virtualization scheme between guest and host device driver. I am talking the parameters you passed to validate_map_request(). req_size is calculated like this: + offset = virtaddr - vma->vm_start; + phyaddr = (vma->vm_pgoff << PAGE_SHIFT) + offset; + pgoff= phyaddr >> PAGE_SHIFT; All these info is from vma which is available in mmmap(). pgoff is got from: + pg_prot = vma->vm_page_prot; that is also available in mmap(). This is kept there in case the validate_map_request() is not provided by vendor driver then by default assume 1:1 mapping. So if validate_map_request() is not provided, fault handler should not fail. THESE are the parameters you passed to validate_map_request(), and these info is available in mmap(), it really does not matter if you move validate_map_request() to mmap(). That's what i want to say. None of such information is available at VFIO mmap() time. For example, several VMs are sharing the same physical device to provide mediated access. All VMs will call the VFIO mmap() on their virtual BAR as part of QEMU vfio/pci initialization process, at that moment, we definitely can't mmap the entire physical MMIO into both VM blindly for obvious reason. mmap() carries @length information, so you only need to allocate the specified size (corresponding to @length) of memory for them. Again, you still look at this as a static partition at QEMU configuration time where the guest mmio will be mapped as a whole at some offset of the physical mmio region. (You still can do that like I said above by not providing validate_map_request() in your vendor driver.) Then you can move validate_map_request() to here to achieve custom allocation-policy. But this is not the framework we are defining here. The framework we have here is to provide the driver vendor flexibility to decide the guest mmio and physical mmio mapping on page basis, and such information is available during runtime. How such information gets communicated between guest and host driver is up to driver vendor. The problems is the sequence of the way "provide the driver vendor flexibility to decide the guest mmio and physical mmio mapping on page basis" and mmap(). We should provide such allocation info first then do mmap(). You current design, do mmap() -> communication telling such info -> use such info when fault happens, is really BAD, because you can not control the time when memory fault will happen. The guest may access this memory before the communication you mentioned above, and another reason is that KVM MMU can prefetch memory at any time.
Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed
On 07/05/2016 11:07 PM, Neo Jia wrote: On Tue, Jul 05, 2016 at 05:02:46PM +0800, Xiao Guangrong wrote: It is physically contiguous but it is done during the runtime, physically contiguous doesn't mean static partition at boot time. And only during runtime, the proper HW resource will be requested therefore the right portion of MMIO region will be granted by the mediated device driver on the host. Okay. This is your implantation design rather than the hardware limitation, right? I don't think it matters here. We are talking about framework so it should provide the flexibility for different driver vendor. It really matters. It is the reason why we design the framework like this and we need to make sure whether we have a better design to fill the requirements. For example, if the instance require 512M memory (the size can be specified by QEMU command line), it can tell its requirement to the mediated device driver via create() interface, then the driver can allocate then memory for this instance before it is running. BAR != your device memory We don't set the BAR size via QEMU command line, BAR size is extracted by QEMU from config space provided by vendor driver. Anyway, the BAR size have a way to configure, e.g, specify the size as a parameter when you create a mdev via sysfs. Theoretically, the hardware is able to do memory management as this style, but for some reasons you choose allocating memory in the runtime. right? If my understanding is right, could you please tell us what benefit you want to get from this running-allocation style? Your understanding is incorrect. Then WHY? Then the req_size and pgoff will both come from the mediated device driver based on his internal book keeping of the hw resource allocation, which is only available during runtime. And such book keeping can be built part of para-virtualization scheme between guest and host device driver. I am talking the parameters you passed to validate_map_request(). req_size is calculated like this: + offset = virtaddr - vma->vm_start; + phyaddr = (vma->vm_pgoff << PAGE_SHIFT) + offset; + pgoff= phyaddr >> PAGE_SHIFT; All these info is from vma which is available in mmmap(). pgoff is got from: + pg_prot = vma->vm_page_prot; that is also available in mmap(). This is kept there in case the validate_map_request() is not provided by vendor driver then by default assume 1:1 mapping. So if validate_map_request() is not provided, fault handler should not fail. THESE are the parameters you passed to validate_map_request(), and these info is available in mmap(), it really does not matter if you move validate_map_request() to mmap(). That's what i want to say. None of such information is available at VFIO mmap() time. For example, several VMs are sharing the same physical device to provide mediated access. All VMs will call the VFIO mmap() on their virtual BAR as part of QEMU vfio/pci initialization process, at that moment, we definitely can't mmap the entire physical MMIO into both VM blindly for obvious reason. mmap() carries @length information, so you only need to allocate the specified size (corresponding to @length) of memory for them. Again, you still look at this as a static partition at QEMU configuration time where the guest mmio will be mapped as a whole at some offset of the physical mmio region. (You still can do that like I said above by not providing validate_map_request() in your vendor driver.) Then you can move validate_map_request() to here to achieve custom allocation-policy. But this is not the framework we are defining here. The framework we have here is to provide the driver vendor flexibility to decide the guest mmio and physical mmio mapping on page basis, and such information is available during runtime. How such information gets communicated between guest and host driver is up to driver vendor. The problems is the sequence of the way "provide the driver vendor flexibility to decide the guest mmio and physical mmio mapping on page basis" and mmap(). We should provide such allocation info first then do mmap(). You current design, do mmap() -> communication telling such info -> use such info when fault happens, is really BAD, because you can not control the time when memory fault will happen. The guest may access this memory before the communication you mentioned above, and another reason is that KVM MMU can prefetch memory at any time.
Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed
On 07/05/2016 08:18 PM, Paolo Bonzini wrote: On 05/07/2016 07:41, Neo Jia wrote: On Thu, Jun 30, 2016 at 03:01:49PM +0200, Paolo Bonzini wrote: The vGPU folks would like to trap the first access to a BAR by setting vm_ops on the VMAs produced by mmap-ing a VFIO device. The fault handler then can use remap_pfn_range to place some non-reserved pages in the VMA. KVM lacks support for this kind of non-linear VM_PFNMAP mapping, and these patches should fix this. Hi Paolo, I have tested your patches with the mediated passthru patchset that is being reviewed in KVM and QEMU mailing list. The fault handler gets called successfully and the previously mapped memory gets unmmaped correctly via unmap_mapping_range. Great, then I'll include them in 4.8. Code is okay, but i still suspect if this implementation, fetch mmio pages in fault handler, is needed. We'd better include these patches after the design of vfio framework is decided.
Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed
On 07/05/2016 08:18 PM, Paolo Bonzini wrote: On 05/07/2016 07:41, Neo Jia wrote: On Thu, Jun 30, 2016 at 03:01:49PM +0200, Paolo Bonzini wrote: The vGPU folks would like to trap the first access to a BAR by setting vm_ops on the VMAs produced by mmap-ing a VFIO device. The fault handler then can use remap_pfn_range to place some non-reserved pages in the VMA. KVM lacks support for this kind of non-linear VM_PFNMAP mapping, and these patches should fix this. Hi Paolo, I have tested your patches with the mediated passthru patchset that is being reviewed in KVM and QEMU mailing list. The fault handler gets called successfully and the previously mapped memory gets unmmaped correctly via unmap_mapping_range. Great, then I'll include them in 4.8. Code is okay, but i still suspect if this implementation, fetch mmio pages in fault handler, is needed. We'd better include these patches after the design of vfio framework is decided.
Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed
On 07/05/2016 03:30 PM, Neo Jia wrote: (Just for completeness, if you really want to use a device in above example as VFIO passthru, the second step is not completely handled in userspace, it is actually the guest driver who will allocate and setup the proper hw resource which will later ready for you to access via some mmio pages.) Hmm... i always treat the VM as userspace. It is OK to treat VM as userspace, but I think it is better to put out details so we are always on the same page. Okay. I should pay more attention on it when i discuss with the driver people. :) This is how QEMU/VFIO currently works, could you please tell me the special points of your solution comparing with current QEMU/VFIO and why current model can not fit your requirement? So that we can better understand your scenario? The scenario I am describing here is mediated passthru case, but what you are describing here (more or less) is VFIO direct assigned case. It is different in several areas, but major difference related to this topic here is: 1) In VFIO direct assigned case, the device (and its resource) is completely owned by the VM therefore its mmio region can be mapped directly into the VM during the VFIO mmap() call as there is no resource sharing among VMs and there is no mediated device driver on the host to manage such resource, so it is completely owned by the guest. I understand this difference, However, as you told to me that the MMIO region allocated for the VM is continuous, so i assume the portion of physical MMIO region is completely owned by guest. The only difference i can see is mediated device driver need to allocate that region. It is physically contiguous but it is done during the runtime, physically contiguous doesn't mean static partition at boot time. And only during runtime, the proper HW resource will be requested therefore the right portion of MMIO region will be granted by the mediated device driver on the host. Okay. This is your implantation design rather than the hardware limitation, right? For example, if the instance require 512M memory (the size can be specified by QEMU command line), it can tell its requirement to the mediated device driver via create() interface, then the driver can allocate then memory for this instance before it is running. Theoretically, the hardware is able to do memory management as this style, but for some reasons you choose allocating memory in the runtime. right? If my understanding is right, could you please tell us what benefit you want to get from this running-allocation style? Also, the physically contiguous doesn't mean the guest and host mmio is 1:1 always. You can have a 8GB host physical mmio while the guest will only have 256MB. Thanks for your patience, it is clearer to me and at least i am able to try to guess the whole picture. :) 2) In mediated passthru case, multiple VMs are sharing the same physical device, so how the HW resource gets allocated is completely decided by the guest and host device driver of the virtualized DMA device, here is the GPU, same as the MMIO pages used to access those Hw resource. I can not see what guest's affair is here, look at your code, you cooked the fault handler like this: You shouldn't as that depends on how different devices are getting para-virtualized by their own implementations. PV method. It is interesting. More comments below. + ret = parent->ops->validate_map_request(mdev, virtaddr, +, _size, +_prot); Please tell me what information is got from guest? All these info can be found at the time of mmap(). The virtaddr is the guest mmio address that triggers this fault, which will be used for the mediated device driver to locate the resource that he has previously allocated. The virtaddr is not the guest mmio address, it is the virtual address of QEMU. vfio is not able to figure out the guest mmio address as the mapping is handled in userspace as we discussed above. And we can get the virtaddr from [vma->start, vma->end) when we do mmap(). Then the req_size and pgoff will both come from the mediated device driver based on his internal book keeping of the hw resource allocation, which is only available during runtime. And such book keeping can be built part of para-virtualization scheme between guest and host device driver. I am talking the parameters you passed to validate_map_request(). req_size is calculated like this: + offset = virtaddr - vma->vm_start; + phyaddr = (vma->vm_pgoff << PAGE_SHIFT) + offset; + pgoff= phyaddr >> PAGE_SHIFT; All these info is from vma which is available in mmmap(). pgoff is got from: + pg_prot = vma->vm_page_prot; that is also available in mmap(). None of such information is available at VFIO mmap() time. For example, several VMs
Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed
On 07/05/2016 03:30 PM, Neo Jia wrote: (Just for completeness, if you really want to use a device in above example as VFIO passthru, the second step is not completely handled in userspace, it is actually the guest driver who will allocate and setup the proper hw resource which will later ready for you to access via some mmio pages.) Hmm... i always treat the VM as userspace. It is OK to treat VM as userspace, but I think it is better to put out details so we are always on the same page. Okay. I should pay more attention on it when i discuss with the driver people. :) This is how QEMU/VFIO currently works, could you please tell me the special points of your solution comparing with current QEMU/VFIO and why current model can not fit your requirement? So that we can better understand your scenario? The scenario I am describing here is mediated passthru case, but what you are describing here (more or less) is VFIO direct assigned case. It is different in several areas, but major difference related to this topic here is: 1) In VFIO direct assigned case, the device (and its resource) is completely owned by the VM therefore its mmio region can be mapped directly into the VM during the VFIO mmap() call as there is no resource sharing among VMs and there is no mediated device driver on the host to manage such resource, so it is completely owned by the guest. I understand this difference, However, as you told to me that the MMIO region allocated for the VM is continuous, so i assume the portion of physical MMIO region is completely owned by guest. The only difference i can see is mediated device driver need to allocate that region. It is physically contiguous but it is done during the runtime, physically contiguous doesn't mean static partition at boot time. And only during runtime, the proper HW resource will be requested therefore the right portion of MMIO region will be granted by the mediated device driver on the host. Okay. This is your implantation design rather than the hardware limitation, right? For example, if the instance require 512M memory (the size can be specified by QEMU command line), it can tell its requirement to the mediated device driver via create() interface, then the driver can allocate then memory for this instance before it is running. Theoretically, the hardware is able to do memory management as this style, but for some reasons you choose allocating memory in the runtime. right? If my understanding is right, could you please tell us what benefit you want to get from this running-allocation style? Also, the physically contiguous doesn't mean the guest and host mmio is 1:1 always. You can have a 8GB host physical mmio while the guest will only have 256MB. Thanks for your patience, it is clearer to me and at least i am able to try to guess the whole picture. :) 2) In mediated passthru case, multiple VMs are sharing the same physical device, so how the HW resource gets allocated is completely decided by the guest and host device driver of the virtualized DMA device, here is the GPU, same as the MMIO pages used to access those Hw resource. I can not see what guest's affair is here, look at your code, you cooked the fault handler like this: You shouldn't as that depends on how different devices are getting para-virtualized by their own implementations. PV method. It is interesting. More comments below. + ret = parent->ops->validate_map_request(mdev, virtaddr, +, _size, +_prot); Please tell me what information is got from guest? All these info can be found at the time of mmap(). The virtaddr is the guest mmio address that triggers this fault, which will be used for the mediated device driver to locate the resource that he has previously allocated. The virtaddr is not the guest mmio address, it is the virtual address of QEMU. vfio is not able to figure out the guest mmio address as the mapping is handled in userspace as we discussed above. And we can get the virtaddr from [vma->start, vma->end) when we do mmap(). Then the req_size and pgoff will both come from the mediated device driver based on his internal book keeping of the hw resource allocation, which is only available during runtime. And such book keeping can be built part of para-virtualization scheme between guest and host device driver. I am talking the parameters you passed to validate_map_request(). req_size is calculated like this: + offset = virtaddr - vma->vm_start; + phyaddr = (vma->vm_pgoff << PAGE_SHIFT) + offset; + pgoff= phyaddr >> PAGE_SHIFT; All these info is from vma which is available in mmmap(). pgoff is got from: + pg_prot = vma->vm_page_prot; that is also available in mmap(). None of such information is available at VFIO mmap() time. For example, several VMs
Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed
On 07/05/2016 01:16 PM, Neo Jia wrote: On Tue, Jul 05, 2016 at 12:02:42PM +0800, Xiao Guangrong wrote: On 07/05/2016 09:35 AM, Neo Jia wrote: On Tue, Jul 05, 2016 at 09:19:40AM +0800, Xiao Guangrong wrote: On 07/04/2016 11:33 PM, Neo Jia wrote: Sorry, I think I misread the "allocation" as "mapping". We only delay the cpu mapping, not the allocation. So how to understand your statement: "at that moment nobody has any knowledge about how the physical mmio gets virtualized" The resource, physical MMIO region, has been allocated, why we do not know the physical address mapped to the VM? >From a device driver point of view, the physical mmio region never gets allocated until the corresponding resource is requested by clients and granted by the mediated device driver. Hmm... but you told me that you did not delay the allocation. :( Hi Guangrong, The allocation here is the allocation of device resource, and the only way to access that kind of device resource is via a mmio region of some pages there. For example, if VM needs resource A, and the only way to access resource A is via some kind of device memory at mmio address X. So, we never defer the allocation request during runtime, we just setup the CPU mapping later when it actually gets accessed. So it returns to my original question: why not allocate the physical mmio region in mmap()? Without running anything inside the VM, how do you know how the hw resource gets allocated, therefore no knowledge of the use of mmio region. The allocation and mapping can be two independent processes: - the first process is just allocation. The MMIO region is allocated from physical hardware and this region is mapped into _QEMU's_ arbitrary virtual address by mmap(). At this time, VM can not actually use this resource. - the second process is mapping. When VM enable this region, e.g, it enables the PCI BAR, then QEMU maps its virtual address returned by mmap() to VM's physical memory. After that, VM can access this region. The second process is completed handled in userspace, that means, the mediated device driver needn't care how the resource is mapped into VM. In your example, you are still picturing it as VFIO direct assign, but the solution we are talking here is mediated passthru via VFIO framework to virtualize DMA devices without SR-IOV. Please see my comments below. (Just for completeness, if you really want to use a device in above example as VFIO passthru, the second step is not completely handled in userspace, it is actually the guest driver who will allocate and setup the proper hw resource which will later ready for you to access via some mmio pages.) Hmm... i always treat the VM as userspace. This is how QEMU/VFIO currently works, could you please tell me the special points of your solution comparing with current QEMU/VFIO and why current model can not fit your requirement? So that we can better understand your scenario? The scenario I am describing here is mediated passthru case, but what you are describing here (more or less) is VFIO direct assigned case. It is different in several areas, but major difference related to this topic here is: 1) In VFIO direct assigned case, the device (and its resource) is completely owned by the VM therefore its mmio region can be mapped directly into the VM during the VFIO mmap() call as there is no resource sharing among VMs and there is no mediated device driver on the host to manage such resource, so it is completely owned by the guest. I understand this difference, However, as you told to me that the MMIO region allocated for the VM is continuous, so i assume the portion of physical MMIO region is completely owned by guest. The only difference i can see is mediated device driver need to allocate that region. 2) In mediated passthru case, multiple VMs are sharing the same physical device, so how the HW resource gets allocated is completely decided by the guest and host device driver of the virtualized DMA device, here is the GPU, same as the MMIO pages used to access those Hw resource. I can not see what guest's affair is here, look at your code, you cooked the fault handler like this: + ret = parent->ops->validate_map_request(mdev, virtaddr, +, _size, +_prot); Please tell me what information is got from guest? All these info can be found at the time of mmap().
Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed
On 07/05/2016 01:16 PM, Neo Jia wrote: On Tue, Jul 05, 2016 at 12:02:42PM +0800, Xiao Guangrong wrote: On 07/05/2016 09:35 AM, Neo Jia wrote: On Tue, Jul 05, 2016 at 09:19:40AM +0800, Xiao Guangrong wrote: On 07/04/2016 11:33 PM, Neo Jia wrote: Sorry, I think I misread the "allocation" as "mapping". We only delay the cpu mapping, not the allocation. So how to understand your statement: "at that moment nobody has any knowledge about how the physical mmio gets virtualized" The resource, physical MMIO region, has been allocated, why we do not know the physical address mapped to the VM? >From a device driver point of view, the physical mmio region never gets allocated until the corresponding resource is requested by clients and granted by the mediated device driver. Hmm... but you told me that you did not delay the allocation. :( Hi Guangrong, The allocation here is the allocation of device resource, and the only way to access that kind of device resource is via a mmio region of some pages there. For example, if VM needs resource A, and the only way to access resource A is via some kind of device memory at mmio address X. So, we never defer the allocation request during runtime, we just setup the CPU mapping later when it actually gets accessed. So it returns to my original question: why not allocate the physical mmio region in mmap()? Without running anything inside the VM, how do you know how the hw resource gets allocated, therefore no knowledge of the use of mmio region. The allocation and mapping can be two independent processes: - the first process is just allocation. The MMIO region is allocated from physical hardware and this region is mapped into _QEMU's_ arbitrary virtual address by mmap(). At this time, VM can not actually use this resource. - the second process is mapping. When VM enable this region, e.g, it enables the PCI BAR, then QEMU maps its virtual address returned by mmap() to VM's physical memory. After that, VM can access this region. The second process is completed handled in userspace, that means, the mediated device driver needn't care how the resource is mapped into VM. In your example, you are still picturing it as VFIO direct assign, but the solution we are talking here is mediated passthru via VFIO framework to virtualize DMA devices without SR-IOV. Please see my comments below. (Just for completeness, if you really want to use a device in above example as VFIO passthru, the second step is not completely handled in userspace, it is actually the guest driver who will allocate and setup the proper hw resource which will later ready for you to access via some mmio pages.) Hmm... i always treat the VM as userspace. This is how QEMU/VFIO currently works, could you please tell me the special points of your solution comparing with current QEMU/VFIO and why current model can not fit your requirement? So that we can better understand your scenario? The scenario I am describing here is mediated passthru case, but what you are describing here (more or less) is VFIO direct assigned case. It is different in several areas, but major difference related to this topic here is: 1) In VFIO direct assigned case, the device (and its resource) is completely owned by the VM therefore its mmio region can be mapped directly into the VM during the VFIO mmap() call as there is no resource sharing among VMs and there is no mediated device driver on the host to manage such resource, so it is completely owned by the guest. I understand this difference, However, as you told to me that the MMIO region allocated for the VM is continuous, so i assume the portion of physical MMIO region is completely owned by guest. The only difference i can see is mediated device driver need to allocate that region. 2) In mediated passthru case, multiple VMs are sharing the same physical device, so how the HW resource gets allocated is completely decided by the guest and host device driver of the virtualized DMA device, here is the GPU, same as the MMIO pages used to access those Hw resource. I can not see what guest's affair is here, look at your code, you cooked the fault handler like this: + ret = parent->ops->validate_map_request(mdev, virtaddr, +, _size, +_prot); Please tell me what information is got from guest? All these info can be found at the time of mmap().
Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed
On 07/05/2016 09:35 AM, Neo Jia wrote: On Tue, Jul 05, 2016 at 09:19:40AM +0800, Xiao Guangrong wrote: On 07/04/2016 11:33 PM, Neo Jia wrote: Sorry, I think I misread the "allocation" as "mapping". We only delay the cpu mapping, not the allocation. So how to understand your statement: "at that moment nobody has any knowledge about how the physical mmio gets virtualized" The resource, physical MMIO region, has been allocated, why we do not know the physical address mapped to the VM? >From a device driver point of view, the physical mmio region never gets allocated until the corresponding resource is requested by clients and granted by the mediated device driver. Hmm... but you told me that you did not delay the allocation. :( Hi Guangrong, The allocation here is the allocation of device resource, and the only way to access that kind of device resource is via a mmio region of some pages there. For example, if VM needs resource A, and the only way to access resource A is via some kind of device memory at mmio address X. So, we never defer the allocation request during runtime, we just setup the CPU mapping later when it actually gets accessed. So it returns to my original question: why not allocate the physical mmio region in mmap()? Without running anything inside the VM, how do you know how the hw resource gets allocated, therefore no knowledge of the use of mmio region. The allocation and mapping can be two independent processes: - the first process is just allocation. The MMIO region is allocated from physical hardware and this region is mapped into _QEMU's_ arbitrary virtual address by mmap(). At this time, VM can not actually use this resource. - the second process is mapping. When VM enable this region, e.g, it enables the PCI BAR, then QEMU maps its virtual address returned by mmap() to VM's physical memory. After that, VM can access this region. The second process is completed handled in userspace, that means, the mediated device driver needn't care how the resource is mapped into VM. This is how QEMU/VFIO currently works, could you please tell me the special points of your solution comparing with current QEMU/VFIO and why current model can not fit your requirement? So that we can better understand your scenario?
Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed
On 07/05/2016 09:35 AM, Neo Jia wrote: On Tue, Jul 05, 2016 at 09:19:40AM +0800, Xiao Guangrong wrote: On 07/04/2016 11:33 PM, Neo Jia wrote: Sorry, I think I misread the "allocation" as "mapping". We only delay the cpu mapping, not the allocation. So how to understand your statement: "at that moment nobody has any knowledge about how the physical mmio gets virtualized" The resource, physical MMIO region, has been allocated, why we do not know the physical address mapped to the VM? >From a device driver point of view, the physical mmio region never gets allocated until the corresponding resource is requested by clients and granted by the mediated device driver. Hmm... but you told me that you did not delay the allocation. :( Hi Guangrong, The allocation here is the allocation of device resource, and the only way to access that kind of device resource is via a mmio region of some pages there. For example, if VM needs resource A, and the only way to access resource A is via some kind of device memory at mmio address X. So, we never defer the allocation request during runtime, we just setup the CPU mapping later when it actually gets accessed. So it returns to my original question: why not allocate the physical mmio region in mmap()? Without running anything inside the VM, how do you know how the hw resource gets allocated, therefore no knowledge of the use of mmio region. The allocation and mapping can be two independent processes: - the first process is just allocation. The MMIO region is allocated from physical hardware and this region is mapped into _QEMU's_ arbitrary virtual address by mmap(). At this time, VM can not actually use this resource. - the second process is mapping. When VM enable this region, e.g, it enables the PCI BAR, then QEMU maps its virtual address returned by mmap() to VM's physical memory. After that, VM can access this region. The second process is completed handled in userspace, that means, the mediated device driver needn't care how the resource is mapped into VM. This is how QEMU/VFIO currently works, could you please tell me the special points of your solution comparing with current QEMU/VFIO and why current model can not fit your requirement? So that we can better understand your scenario?
Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed
On 07/04/2016 11:33 PM, Neo Jia wrote: Sorry, I think I misread the "allocation" as "mapping". We only delay the cpu mapping, not the allocation. So how to understand your statement: "at that moment nobody has any knowledge about how the physical mmio gets virtualized" The resource, physical MMIO region, has been allocated, why we do not know the physical address mapped to the VM? From a device driver point of view, the physical mmio region never gets allocated until the corresponding resource is requested by clients and granted by the mediated device driver. Hmm... but you told me that you did not delay the allocation. :( So it returns to my original question: why not allocate the physical mmio region in mmap()?
Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed
On 07/04/2016 11:33 PM, Neo Jia wrote: Sorry, I think I misread the "allocation" as "mapping". We only delay the cpu mapping, not the allocation. So how to understand your statement: "at that moment nobody has any knowledge about how the physical mmio gets virtualized" The resource, physical MMIO region, has been allocated, why we do not know the physical address mapped to the VM? From a device driver point of view, the physical mmio region never gets allocated until the corresponding resource is requested by clients and granted by the mediated device driver. Hmm... but you told me that you did not delay the allocation. :( So it returns to my original question: why not allocate the physical mmio region in mmap()?
Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed
On 07/04/2016 05:16 PM, Neo Jia wrote: On Mon, Jul 04, 2016 at 04:45:05PM +0800, Xiao Guangrong wrote: On 07/04/2016 04:41 PM, Neo Jia wrote: On Mon, Jul 04, 2016 at 04:19:20PM +0800, Xiao Guangrong wrote: On 07/04/2016 03:53 PM, Neo Jia wrote: On Mon, Jul 04, 2016 at 03:37:35PM +0800, Xiao Guangrong wrote: On 07/04/2016 03:03 PM, Neo Jia wrote: On Mon, Jul 04, 2016 at 02:39:22PM +0800, Xiao Guangrong wrote: On 06/30/2016 09:01 PM, Paolo Bonzini wrote: The vGPU folks would like to trap the first access to a BAR by setting vm_ops on the VMAs produced by mmap-ing a VFIO device. The fault handler then can use remap_pfn_range to place some non-reserved pages in the VMA. Why does it require fetching the pfn when the fault is triggered rather than when mmap() is called? Hi Guangrong, as such mapping information between virtual mmio to physical mmio is only available at runtime. Sorry, i do not know what the different between mmap() and the time VM actually accesses the memory for your case. Could you please more detail? Hi Guangrong, Sure. The mmap() gets called by qemu or any VFIO API userspace consumer when setting up the virtual mmio, at that moment nobody has any knowledge about how the physical mmio gets virtualized. When the vm (or application if we don't want to limit ourselves to vmm term) starts, the virtual and physical mmio gets mapped by mpci kernel module with the help from vendor supplied mediated host driver according to the hw resource assigned to this vm / application. Thanks for your expiation. It sounds like a strategy of resource allocation, you delay the allocation until VM really accesses it, right? Yes, that is where the fault handler inside mpci code comes to the picture. I am not sure this strategy is good. The instance is successfully created, and it is started successful, but the VM is crashed due to the resource of that instance is not enough. That sounds unreasonable. Sorry, I think I misread the "allocation" as "mapping". We only delay the cpu mapping, not the allocation. So how to understand your statement: "at that moment nobody has any knowledge about how the physical mmio gets virtualized" The resource, physical MMIO region, has been allocated, why we do not know the physical address mapped to the VM?
Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed
On 07/04/2016 05:16 PM, Neo Jia wrote: On Mon, Jul 04, 2016 at 04:45:05PM +0800, Xiao Guangrong wrote: On 07/04/2016 04:41 PM, Neo Jia wrote: On Mon, Jul 04, 2016 at 04:19:20PM +0800, Xiao Guangrong wrote: On 07/04/2016 03:53 PM, Neo Jia wrote: On Mon, Jul 04, 2016 at 03:37:35PM +0800, Xiao Guangrong wrote: On 07/04/2016 03:03 PM, Neo Jia wrote: On Mon, Jul 04, 2016 at 02:39:22PM +0800, Xiao Guangrong wrote: On 06/30/2016 09:01 PM, Paolo Bonzini wrote: The vGPU folks would like to trap the first access to a BAR by setting vm_ops on the VMAs produced by mmap-ing a VFIO device. The fault handler then can use remap_pfn_range to place some non-reserved pages in the VMA. Why does it require fetching the pfn when the fault is triggered rather than when mmap() is called? Hi Guangrong, as such mapping information between virtual mmio to physical mmio is only available at runtime. Sorry, i do not know what the different between mmap() and the time VM actually accesses the memory for your case. Could you please more detail? Hi Guangrong, Sure. The mmap() gets called by qemu or any VFIO API userspace consumer when setting up the virtual mmio, at that moment nobody has any knowledge about how the physical mmio gets virtualized. When the vm (or application if we don't want to limit ourselves to vmm term) starts, the virtual and physical mmio gets mapped by mpci kernel module with the help from vendor supplied mediated host driver according to the hw resource assigned to this vm / application. Thanks for your expiation. It sounds like a strategy of resource allocation, you delay the allocation until VM really accesses it, right? Yes, that is where the fault handler inside mpci code comes to the picture. I am not sure this strategy is good. The instance is successfully created, and it is started successful, but the VM is crashed due to the resource of that instance is not enough. That sounds unreasonable. Sorry, I think I misread the "allocation" as "mapping". We only delay the cpu mapping, not the allocation. So how to understand your statement: "at that moment nobody has any knowledge about how the physical mmio gets virtualized" The resource, physical MMIO region, has been allocated, why we do not know the physical address mapped to the VM?
Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed
On 07/04/2016 04:45 PM, Xiao Guangrong wrote: On 07/04/2016 04:41 PM, Neo Jia wrote: On Mon, Jul 04, 2016 at 04:19:20PM +0800, Xiao Guangrong wrote: On 07/04/2016 03:53 PM, Neo Jia wrote: On Mon, Jul 04, 2016 at 03:37:35PM +0800, Xiao Guangrong wrote: On 07/04/2016 03:03 PM, Neo Jia wrote: On Mon, Jul 04, 2016 at 02:39:22PM +0800, Xiao Guangrong wrote: On 06/30/2016 09:01 PM, Paolo Bonzini wrote: The vGPU folks would like to trap the first access to a BAR by setting vm_ops on the VMAs produced by mmap-ing a VFIO device. The fault handler then can use remap_pfn_range to place some non-reserved pages in the VMA. Why does it require fetching the pfn when the fault is triggered rather than when mmap() is called? Hi Guangrong, as such mapping information between virtual mmio to physical mmio is only available at runtime. Sorry, i do not know what the different between mmap() and the time VM actually accesses the memory for your case. Could you please more detail? Hi Guangrong, Sure. The mmap() gets called by qemu or any VFIO API userspace consumer when setting up the virtual mmio, at that moment nobody has any knowledge about how the physical mmio gets virtualized. When the vm (or application if we don't want to limit ourselves to vmm term) starts, the virtual and physical mmio gets mapped by mpci kernel module with the help from vendor supplied mediated host driver according to the hw resource assigned to this vm / application. Thanks for your expiation. It sounds like a strategy of resource allocation, you delay the allocation until VM really accesses it, right? Yes, that is where the fault handler inside mpci code comes to the picture. I am not sure this strategy is good. The instance is successfully created, and it is started successful, but the VM is crashed due to the resource of that instance is not enough. That sounds unreasonable. Especially, you can not squeeze this kind of memory to balance the usage between all VMs. Does this strategy still make sense?