Re: set_pte_at_notify regression
Hi, On 10/01/2014 18:57, Andrea Arcangeli wrote: Hi! On Fri, Jan 10, 2014 at 06:38:06PM +0200, Izik Eidus wrote: It look like commit 6bdb913f0a70a4dfb7f066fb15e2d6f960701d00 break the semantic of set_pte_at_notify. The change of calling first to mmu_notifier_invalidate_range_start, then to set_pte_at_notify, and then to mmu_notifier_invalidate_range_end not only increase the amount of locks kvm have to take and release by factor of 3, but in addition mmu_notifier_invalidate_range_start is zapping the pte entry from kvm, so when set_pte_at_notify get called, it doesn`t have any spte to set and it acctuly get called for nothing, the result is increasing of vmexits for kvm from both do_wp_page and replace_page, and broken semantic of set_pte_at_notify. Agreed. I would suggest to change set_pte_at_notify to return if change_pte was missing in some mmu notifier attached to this mm, so we can do something like: ptep = page_check_address(page, mm, addr, ptl, 0); [..] notify_missing = false; if (... ) { entry = ptep_clear_flush(...); [..] notify_missing = set_pte_at_notify(mm, addr, ptep, entry); } pte_unmap_unlock(ptep, ptl); if (notify_missing) mmu_notifier_invalidate_page_if_missing_change_pte(mm, addr); and drop the range calls. This will provide sleepability and at the same time it won't screw the ability of change_pte to update sptes (by leaving those established by the time change_pte runs). I think it would be better for notifiers that do not support change_pte to keep getting both range_start and range_end notifiers. Otherwise, the invalidate_page notifier might end up marking the old page as dirty after it was already replaced in the primary page table. Perhaps we can have a flag in the mmu_notifier, similar to the notify_missing returned value here, that determines in these cases whether to call the invalidate_range_start/end pair, or just the set_pte_at_notify. Thanks, Haggai -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v3 1/2] add support for Hyper-V reference time counter
On Wed, 2014-01-08 at 23:20 +0100, Peter Lieven wrote: Am 08.01.2014 21:08, schrieb Vadim Rozenfeld: On Wed, 2014-01-08 at 15:54 +0100, Peter Lieven wrote: On 08.01.2014 13:12, Vadim Rozenfeld wrote: On Wed, 2014-01-08 at 12:48 +0100, Peter Lieven wrote: On 08.01.2014 11:44, Vadim Rozenfeld wrote: On Wed, 2014-01-08 at 11:15 +0100, Peter Lieven wrote: On 08.01.2014 10:40, Vadim Rozenfeld wrote: On Tue, 2014-01-07 at 18:52 +0100, Peter Lieven wrote: Am 07.01.2014 10:36, schrieb Vadim Rozenfeld: On Thu, 2014-01-02 at 17:52 +0100, Peter Lieven wrote: Am 11.12.2013 19:59, schrieb Marcelo Tosatti: On Wed, Dec 11, 2013 at 04:53:05PM -0200, Marcelo Tosatti wrote: On Sun, Dec 08, 2013 at 10:33:38PM +1100, Vadim Rozenfeld wrote: Signed-off: Peter Lieven p...@dlh.net Signed-off: Gleb Natapov g...@redhat.com Signed-off: Vadim Rozenfeld vroze...@redhat.com v1 - v2 1. mark TSC page dirty as suggested by Eric Northup digitale...@google.com and Gleb 2. disable local irq when calling get_kernel_ns, as it was done by Peter Lieven p...@dlhnet.de 3. move check for TSC page enable from second patch to this one. --- arch/x86/include/asm/kvm_host.h| 2 ++ arch/x86/include/uapi/asm/hyperv.h | 13 + arch/x86/kvm/x86.c | 39 +- include/uapi/linux/kvm.h | 1 + 4 files changed, 54 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index ae5d783..2fd0753 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -605,6 +605,8 @@ struct kvm_arch { /* fields used by HYPER-V emulation */ u64 hv_guest_os_id; u64 hv_hypercall; + u64 hv_ref_count; + u64 hv_tsc_page; #ifdef CONFIG_KVM_MMU_AUDIT int audit_point; diff --git a/arch/x86/include/uapi/asm/hyperv.h b/arch/x86/include/uapi/asm/hyperv.h index b8f1c01..462efe7 100644 --- a/arch/x86/include/uapi/asm/hyperv.h +++ b/arch/x86/include/uapi/asm/hyperv.h @@ -28,6 +28,9 @@ /* Partition Reference Counter (HV_X64_MSR_TIME_REF_COUNT) available*/ #define HV_X64_MSR_TIME_REF_COUNT_AVAILABLE (1 1) +/* A partition's reference time stamp counter (TSC) page */ +#define HV_X64_MSR_REFERENCE_TSC 0x4021 + /* * There is a single feature flag that signifies the presence of the MSR * that can be used to retrieve both the local APIC Timer frequency as @@ -198,6 +201,9 @@ #define HV_X64_MSR_APIC_ASSIST_PAGE_ADDRESS_MASK\ (~((1ull HV_X64_MSR_APIC_ASSIST_PAGE_ADDRESS_SHIFT) - 1)) +#define HV_X64_MSR_TSC_REFERENCE_ENABLE 0x0001 +#define HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT 12 + #define HV_PROCESSOR_POWER_STATE_C0 0 #define HV_PROCESSOR_POWER_STATE_C1 1 #define HV_PROCESSOR_POWER_STATE_C2 2 @@ -210,4 +216,11 @@ #define HV_STATUS_INVALID_ALIGNMENT 4 #define HV_STATUS_INSUFFICIENT_BUFFERS 19 +typedef struct _HV_REFERENCE_TSC_PAGE { + __u32 tsc_sequence; + __u32 res1; + __u64 tsc_scale; + __s64 tsc_offset; +} HV_REFERENCE_TSC_PAGE, *PHV_REFERENCE_TSC_PAGE; + #endif diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 21ef1ba..5e4e495a 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -840,7 +840,7 @@ EXPORT_SYMBOL_GPL(kvm_rdpmc); static u32 msrs_to_save[] = { MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK, MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW, - HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL, + HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL, HV_X64_MSR_TIME_REF_COUNT, HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME, MSR_KVM_PV_EOI_EN, MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP, @@ -1826,6 +1826,8 @@ static bool kvm_hv_msr_partition_wide(u32 msr) switch (msr) { case HV_X64_MSR_GUEST_OS_ID: case HV_X64_MSR_HYPERCALL: + case HV_X64_MSR_REFERENCE_TSC: + case HV_X64_MSR_TIME_REF_COUNT: r = true; break; } @@ -1865,6 +1867,29 @@ static int set_msr_hyperv_pw(struct kvm_vcpu *vcpu, u32 msr, u64 data) if (__copy_to_user((void __user *)addr, instructions, 4)) return 1; kvm-arch.hv_hypercall = data; + local_irq_disable(); + kvm-arch.hv_ref_count = get_kernel_ns() + kvm-arch.kvmclock_offset; + local_irq_enable(); Where does the docs say that HV_X64_MSR_HYPERCALL is the where the clock starts counting? No need to store kvmclock_offset in
Re: [PATCH] PPC: KVM: fix VCPU run for HV KVM
On 10.01.2014, at 08:21, Alexey Kardashevskiy a...@ozlabs.ru wrote: When write to MMIO happens and there is an ioeventfd for that and is handled successfully, ioeventfd_write() returns 0 (success) and kvmppc_handle_store() returns EMULATE_DONE. Then kvmppc_emulate_mmio() converts EMULATE_DONE to RESUME_GUEST_NV and this broke from the loop. This adds handling of RESUME_GUEST_NV in kvmppc_vcpu_run_hv(). Cc: Michael S. Tsirkin m...@redhat.com Suggested-by: Paul Mackerras pau...@samba.org Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru --- This definitely needs a better commit message. Please, help. ps. it seems like ioeventfd never worked on ppc64. hm. --- arch/powerpc/kvm/book3s_hv.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 072287f..24f363f 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -1569,7 +1569,7 @@ static int kvmppc_vcpu_run_hv(struct kvm_run *run, struct kvm_vcpu *vcpu) vcpu-arch.fault_dar, vcpu-arch.fault_dsisr); srcu_read_unlock(vcpu-kvm-srcu, srcu_idx); } - } while (r == RESUME_GUEST); + } while ((r == RESUME_GUEST_NV) || (r == RESUME_GUEST)); How about while(!(r RESUME_FLAG_HOST)); That should cover all RESUME_GUEST_XXX cases just fine. Apart from that I agree that we should check for ! FLAG_HOST bit rather than the actual RESUME_GUEST value in all case where we check for it (read: please update all places). Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: x86: Fix debug typo error in lapic
On Thu, Jan 09, 2014 at 03:08:25PM -0500, Hu Yaohui wrote: Hi Marcelo, Thanks for your replying! I hope you have a good day! I am sorry that it's not that obvious to me after I checked that function. If the remote vcpu is not in the same pcpu as the sender which calls kvm_vpcu_kick. Before the remote vcpu received the Host-IPI, it could be in guest mode which could possibly access the stale tlb entry that have been modifed by the sender. For smp_call_function_wait, the wait parameter will only let the sender side wait until the function has been executed on all the target cpus, then return. For TLB Flush IPI, the sender has already invalidate one tlb entry, the received side should get the IPI ASAP to invalidate the specified tlb before it's accessed, or there should have some mechanisms to enusre that even the stale tlb entry is accessed, it's still ok. I am not sure what's the situation here after I reviewed all the source codes. The sender only considers the TLB entry (or entries) flushed when smp_call_function_wait finishes, that is when the receiver cpu acknowledges it has flushed its TLB. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: set_pte_at_notify regression
On Sun, Jan 12, 2014 at 01:56:12PM +0200, Haggai Eran wrote: Hi, On 10/01/2014 18:57, Andrea Arcangeli wrote: Hi! On Fri, Jan 10, 2014 at 06:38:06PM +0200, Izik Eidus wrote: It look like commit 6bdb913f0a70a4dfb7f066fb15e2d6f960701d00 break the semantic of set_pte_at_notify. The change of calling first to mmu_notifier_invalidate_range_start, then to set_pte_at_notify, and then to mmu_notifier_invalidate_range_end not only increase the amount of locks kvm have to take and release by factor of 3, but in addition mmu_notifier_invalidate_range_start is zapping the pte entry from kvm, so when set_pte_at_notify get called, it doesn`t have any spte to set and it acctuly get called for nothing, the result is increasing of vmexits for kvm from both do_wp_page and replace_page, and broken semantic of set_pte_at_notify. Agreed. I would suggest to change set_pte_at_notify to return if change_pte was missing in some mmu notifier attached to this mm, so we can do something like: ptep = page_check_address(page, mm, addr, ptl, 0); [..] notify_missing = false; if (... ) { entry = ptep_clear_flush(...); [..] notify_missing = set_pte_at_notify(mm, addr, ptep, entry); } pte_unmap_unlock(ptep, ptl); if (notify_missing) mmu_notifier_invalidate_page_if_missing_change_pte(mm, addr); and drop the range calls. This will provide sleepability and at the same time it won't screw the ability of change_pte to update sptes (by leaving those established by the time change_pte runs). I think it would be better for notifiers that do not support change_pte to keep getting both range_start and range_end notifiers. Otherwise, the invalidate_page notifier might end up marking the old page as dirty after it was already replaced in the primary page table. Ok but why would that be a problem? If the secondary pagetable mapping is found dirty, the old page shall be marked dirty as it means it was modified through the secondary mmu and is on-disk version may need to be updated before discarding the in-ram copy. What the difference would be to mark the page dirty in the range_start while the primary page table is still established, or after? Here the docs too: /* * Before this is invoked any secondary MMU is still ok to * read/write to the page previously pointed to by the Linux * pte because the page hasn't been freed yet and it won't be * freed until this returns. If required set_page_dirty has to * be called internally to this method. */ void (*invalidate_page)(struct mmu_notifier *mn, struct mm_struct *mm, unsigned long address); Why the range_start/end is needed, is only to solve the mess with the freeing of the page in those cases were we hold no individual reference on the pages and we do tlb gather freeing. But in places like ksm merging and do_wp_page we hold a page reference before we start the primary pagetable updating, until after the mmu notifier invalidate. Thanks! Andrea -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: x86: Fix debug typo error in lapic
Thank you Marcelo! I really appreciate your explanation. On Sat, Jan 11, 2014 at 7:27 AM, Marcelo Tosatti mtosa...@redhat.com wrote: On Thu, Jan 09, 2014 at 03:08:25PM -0500, Hu Yaohui wrote: Hi Marcelo, Thanks for your replying! I hope you have a good day! I am sorry that it's not that obvious to me after I checked that function. If the remote vcpu is not in the same pcpu as the sender which calls kvm_vpcu_kick. Before the remote vcpu received the Host-IPI, it could be in guest mode which could possibly access the stale tlb entry that have been modifed by the sender. For smp_call_function_wait, the wait parameter will only let the sender side wait until the function has been executed on all the target cpus, then return. For TLB Flush IPI, the sender has already invalidate one tlb entry, the received side should get the IPI ASAP to invalidate the specified tlb before it's accessed, or there should have some mechanisms to enusre that even the stale tlb entry is accessed, it's still ok. I am not sure what's the situation here after I reviewed all the source codes. The sender only considers the TLB entry (or entries) flushed when smp_call_function_wait finishes, that is when the receiver cpu acknowledges it has flushed its TLB. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v3 1/2] add support for Hyper-V reference time counter
Am 12.01.2014 um 13:08 schrieb Vadim Rozenfeld vroze...@redhat.com: On Wed, 2014-01-08 at 23:20 +0100, Peter Lieven wrote: Am 08.01.2014 21:08, schrieb Vadim Rozenfeld: On Wed, 2014-01-08 at 15:54 +0100, Peter Lieven wrote: On 08.01.2014 13:12, Vadim Rozenfeld wrote: On Wed, 2014-01-08 at 12:48 +0100, Peter Lieven wrote: On 08.01.2014 11:44, Vadim Rozenfeld wrote: On Wed, 2014-01-08 at 11:15 +0100, Peter Lieven wrote: On 08.01.2014 10:40, Vadim Rozenfeld wrote: On Tue, 2014-01-07 at 18:52 +0100, Peter Lieven wrote: Am 07.01.2014 10:36, schrieb Vadim Rozenfeld: On Thu, 2014-01-02 at 17:52 +0100, Peter Lieven wrote: Am 11.12.2013 19:59, schrieb Marcelo Tosatti: On Wed, Dec 11, 2013 at 04:53:05PM -0200, Marcelo Tosatti wrote: On Sun, Dec 08, 2013 at 10:33:38PM +1100, Vadim Rozenfeld wrote: Signed-off: Peter Lieven p...@dlh.net Signed-off: Gleb Natapov g...@redhat.com Signed-off: Vadim Rozenfeld vroze...@redhat.com v1 - v2 1. mark TSC page dirty as suggested by Eric Northup digitale...@google.com and Gleb 2. disable local irq when calling get_kernel_ns, as it was done by Peter Lieven p...@dlhnet.de 3. move check for TSC page enable from second patch to this one. --- arch/x86/include/asm/kvm_host.h| 2 ++ arch/x86/include/uapi/asm/hyperv.h | 13 + arch/x86/kvm/x86.c | 39 +- include/uapi/linux/kvm.h | 1 + 4 files changed, 54 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index ae5d783..2fd0753 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -605,6 +605,8 @@ struct kvm_arch { /* fields used by HYPER-V emulation */ u64 hv_guest_os_id; u64 hv_hypercall; + u64 hv_ref_count; + u64 hv_tsc_page; #ifdef CONFIG_KVM_MMU_AUDIT int audit_point; diff --git a/arch/x86/include/uapi/asm/hyperv.h b/arch/x86/include/uapi/asm/hyperv.h index b8f1c01..462efe7 100644 --- a/arch/x86/include/uapi/asm/hyperv.h +++ b/arch/x86/include/uapi/asm/hyperv.h @@ -28,6 +28,9 @@ /* Partition Reference Counter (HV_X64_MSR_TIME_REF_COUNT) available*/ #define HV_X64_MSR_TIME_REF_COUNT_AVAILABLE (1 1) +/* A partition's reference time stamp counter (TSC) page */ +#define HV_X64_MSR_REFERENCE_TSC 0x4021 + /* * There is a single feature flag that signifies the presence of the MSR * that can be used to retrieve both the local APIC Timer frequency as @@ -198,6 +201,9 @@ #define HV_X64_MSR_APIC_ASSIST_PAGE_ADDRESS_MASK \ (~((1ull HV_X64_MSR_APIC_ASSIST_PAGE_ADDRESS_SHIFT) - 1)) +#define HV_X64_MSR_TSC_REFERENCE_ENABLE 0x0001 +#define HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT 12 + #define HV_PROCESSOR_POWER_STATE_C0 0 #define HV_PROCESSOR_POWER_STATE_C1 1 #define HV_PROCESSOR_POWER_STATE_C2 2 @@ -210,4 +216,11 @@ #define HV_STATUS_INVALID_ALIGNMENT 4 #define HV_STATUS_INSUFFICIENT_BUFFERS 19 +typedef struct _HV_REFERENCE_TSC_PAGE { + __u32 tsc_sequence; + __u32 res1; + __u64 tsc_scale; + __s64 tsc_offset; +} HV_REFERENCE_TSC_PAGE, *PHV_REFERENCE_TSC_PAGE; + #endif diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 21ef1ba..5e4e495a 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -840,7 +840,7 @@ EXPORT_SYMBOL_GPL(kvm_rdpmc); static u32 msrs_to_save[] = { MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK, MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW, - HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL, + HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL, HV_X64_MSR_TIME_REF_COUNT, HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME, MSR_KVM_PV_EOI_EN, MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP, @@ -1826,6 +1826,8 @@ static bool kvm_hv_msr_partition_wide(u32 msr) switch (msr) { case HV_X64_MSR_GUEST_OS_ID: case HV_X64_MSR_HYPERCALL: + case HV_X64_MSR_REFERENCE_TSC: + case HV_X64_MSR_TIME_REF_COUNT: r = true; break; } @@ -1865,6 +1867,29 @@ static int set_msr_hyperv_pw(struct kvm_vcpu *vcpu, u32 msr, u64 data) if (__copy_to_user((void __user *)addr, instructions, 4)) return 1; kvm-arch.hv_hypercall = data; + local_irq_disable(); + kvm-arch.hv_ref_count = get_kernel_ns() + kvm-arch.kvmclock_offset; + local_irq_enable(); Where does the docs say that HV_X64_MSR_HYPERCALL is the where the clock starts counting? No need to store kvmclock_offset in hv_ref_count? (moreover the name is weird, better name would be
for test, ignore plz
RT~ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
How to add information at http://www.linux-kvm.org/page/Management_Tools
Dear sirs, My name is Ksenya Fil, I'm from the ISPsystem company. We develop VMmanager, a software solution for virtualization and virtual machines management. We would like to add information about our product into this page http://www.linux-kvm.org/page/Management_Tools Could you tell me the name of the person (his email) who I can contact for details? Thank you in advance. Kind regards, -- Ksenya Phil Sales Manager +32 280 80 760 kse...@ispsystem.com ISPsystem Maximize automation www.ispsystem.com facebook profile http://www.facebook.com/ispsystem twitter profile https://twitter.com/@ispsystem -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to add information at http://www.linux-kvm.org/page/Management_Tools
Hi, I'm very interested in your tool - do you have an english version or information page somewhere? Thanks, Timh On mån 13 jan 2014 12:18:27, kse...@ispsystem.com wrote: Dear sirs, My name is Ksenya Fil, I'm from the ISPsystem company. We develop VMmanager, a software solution for virtualization and virtual machines management. We would like to add information about our product into this page http://www.linux-kvm.org/page/Management_Tools Could you tell me the name of the person (his email) who I can contact for details? Thank you in advance. Kind regards, -- //T -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html