Re: [PATCH v4 08/16] KVM: x86/pmu: Add IA32_DS_AREA MSR emulation to manage guest DS buffer
On 2021/4/9 15:59, Peter Zijlstra wrote: On Fri, Apr 09, 2021 at 03:07:38PM +0800, Xu, Like wrote: Hi Peter, On 2021/4/8 15:52, Peter Zijlstra wrote: This is because in the early part of this function, we have operations: if (x86_pmu.flags & PMU_FL_PEBS_ALL) arr[0].guest &= ~cpuc->pebs_enabled; else arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK); and if guest has PEBS_ENABLED, we need these bits back for PEBS counters: arr[0].guest |= arr[1].guest; I don't think that's right, who's to say they were set in the first place? The guest's GLOBAL_CTRL could have had the bits cleared at VMEXIT time. You can't unconditionally add PEBS_ENABLED into GLOBAL_CTRL, that's wrong. I can't keep up with you on this comment and would you explain more ? Well, it could be I'm terminally confused on how virt works (I usually am, it just doesn't make any sense ever). I may help you a little on this. On top of that this code doesn't have any comments to help. More comments will be added. So perf_guest_switch_msr has two msr values: guest and host. In my naive understanding guest is the msr value the guest sees and host is the value the host has. If it is not that, then the naming is just misleading at best. But thinking more about it, if these are fully emulated MSRs (which I think they are), then there might actually be 3 different values, not 2. You are right about 3 different values. We have the value the guest sees when it uses {RD,WR}MSR. We have the value the hardware has when it runs a guest. We have the value the hardware has when it doesn't run a guest. And somehow this code does something, but I can't for the life of me figure out what and how. Just focus on the last two values and the enabling bits (on the GLOBAL_CTRL and PEBS_ENABLE) of "the value the hardware has when it runs a guest" are exclusive with "the value the hardware has when it doesn't run a guest." To address your previous comments, does the code below look good to you? static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data) { struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events); struct perf_guest_switch_msr *arr = cpuc->guest_switch_msrs; struct debug_store *ds = __this_cpu_read(cpu_hw_events.ds); struct kvm_pmu *pmu = (struct kvm_pmu *)data; u64 pebs_mask = (x86_pmu.flags & PMU_FL_PEBS_ALL) ? cpuc->pebs_enabled : (cpuc->pebs_enabled & PEBS_COUNTER_MASK); int i = 0; arr[i].msr = MSR_CORE_PERF_GLOBAL_CTRL; arr[i].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask; arr[i].guest = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_host_mask; arr[i].guest &= ~pebs_mask; if (!x86_pmu.pebs) goto out; /* * If PMU counter has PEBS enabled it is not enough to * disable counter on a guest entry since PEBS memory * write can overshoot guest entry and corrupt guest * memory. Disabling PEBS solves the problem. * * Don't do this if the CPU already enforces it. */ if (x86_pmu.pebs_no_isolation) { i++; arr[i].msr = MSR_IA32_PEBS_ENABLE; arr[i].host = cpuc->pebs_enabled; arr[i].guest = 0; goto out; } if (!pmu || !x86_pmu.pebs_vmx) goto out; i++; arr[i].msr = MSR_IA32_DS_AREA; arr[i].host = (unsigned long)ds; arr[i].guest = pmu->ds_area; if (x86_pmu.intel_cap.pebs_baseline) { i++; arr[i].msr = MSR_PEBS_DATA_CFG; arr[i].host = cpuc->pebs_data_cfg; arr[i].guest = pmu->pebs_data_cfg; } i++; arr[i].msr = MSR_IA32_PEBS_ENABLE; arr[i].host = cpuc->pebs_enabled & ~cpuc->intel_ctrl_guest_mask; arr[i].guest = pebs_mask & ~cpuc->intel_ctrl_host_mask; if (arr[i].host) { /* Disable guest PEBS if host PEBS is enabled. */ arr[i].guest = 0; } else { /* Disable guest PEBS for cross-mapped PEBS counters. */ arr[i].guest &= ~pmu->host_cross_mapped_mask; arr[0].guest |= arr[i].guest; } out: *nr = ++i; return arr; } The ++ is in a weird location, if you place it after filling out an entry it makes more sense I think. Something like: arr[i].msr = MSR_CORE_PERF_GLOBAL_CTRL; arr[i].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask; arr[i].guest = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_host_mask; arr[i].guest &= ~pebs_mask; i++; or, perhaps even like: arr[i++] = (struct perf_guest_switch_msr){ .msr = MSR_CORE_PERF_GLOBAL_CTRL, .host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask, .guest = x86_pmu.intel_ctrl & (~cpuc->intel_ctrl_host_mask | ~pebs_mask), }; The later one looks good to me and I'll apply it. But it doesn't address the fundamental confusion I seem to be having, what actual msr value is what. VMX hardware
Re: [PATCH v4 08/16] KVM: x86/pmu: Add IA32_DS_AREA MSR emulation to manage guest DS buffer
On Fri, Apr 09, 2021 at 03:07:38PM +0800, Xu, Like wrote: > Hi Peter, > > On 2021/4/8 15:52, Peter Zijlstra wrote: > > > This is because in the early part of this function, we have operations: > > > > > > if (x86_pmu.flags & PMU_FL_PEBS_ALL) > > > arr[0].guest &= ~cpuc->pebs_enabled; > > > else > > > arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK); > > > > > > and if guest has PEBS_ENABLED, we need these bits back for PEBS counters: > > > > > > arr[0].guest |= arr[1].guest; > > > I don't think that's right, who's to say they were set in the first > > place? The guest's GLOBAL_CTRL could have had the bits cleared at VMEXIT > > time. You can't unconditionally add PEBS_ENABLED into GLOBAL_CTRL, > > that's wrong. > I can't keep up with you on this comment and would you explain more ? Well, it could be I'm terminally confused on how virt works (I usually am, it just doesn't make any sense ever). On top of that this code doesn't have any comments to help. So perf_guest_switch_msr has two msr values: guest and host. In my naive understanding guest is the msr value the guest sees and host is the value the host has. If it is not that, then the naming is just misleading at best. But thinking more about it, if these are fully emulated MSRs (which I think they are), then there might actually be 3 different values, not 2. We have the value the guest sees when it uses {RD,WR}MSR. We have the value the hardware has when it runs a guest. We have the value the hardware has when it doesn't run a guest. And somehow this code does something, but I can't for the life of me figure out what and how. > To address your previous comments, does the code below look good to you? > > static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data) > { > struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events); > struct perf_guest_switch_msr *arr = cpuc->guest_switch_msrs; > struct debug_store *ds = __this_cpu_read(cpu_hw_events.ds); > struct kvm_pmu *pmu = (struct kvm_pmu *)data; > u64 pebs_mask = (x86_pmu.flags & PMU_FL_PEBS_ALL) ? > cpuc->pebs_enabled : (cpuc->pebs_enabled & PEBS_COUNTER_MASK); > int i = 0; > > arr[i].msr = MSR_CORE_PERF_GLOBAL_CTRL; > arr[i].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask; > arr[i].guest = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_host_mask; > arr[i].guest &= ~pebs_mask; > > if (!x86_pmu.pebs) > goto out; > > /* > * If PMU counter has PEBS enabled it is not enough to > * disable counter on a guest entry since PEBS memory > * write can overshoot guest entry and corrupt guest > * memory. Disabling PEBS solves the problem. > * > * Don't do this if the CPU already enforces it. > */ > if (x86_pmu.pebs_no_isolation) { > i++; > arr[i].msr = MSR_IA32_PEBS_ENABLE; > arr[i].host = cpuc->pebs_enabled; > arr[i].guest = 0; > goto out; > } > > if (!pmu || !x86_pmu.pebs_vmx) > goto out; > > i++; > arr[i].msr = MSR_IA32_DS_AREA; > arr[i].host = (unsigned long)ds; > arr[i].guest = pmu->ds_area; > > if (x86_pmu.intel_cap.pebs_baseline) { > i++; > arr[i].msr = MSR_PEBS_DATA_CFG; > arr[i].host = cpuc->pebs_data_cfg; > arr[i].guest = pmu->pebs_data_cfg; > } > > i++; > arr[i].msr = MSR_IA32_PEBS_ENABLE; > arr[i].host = cpuc->pebs_enabled & ~cpuc->intel_ctrl_guest_mask; > arr[i].guest = pebs_mask & ~cpuc->intel_ctrl_host_mask; > > if (arr[i].host) { > /* Disable guest PEBS if host PEBS is enabled. */ > arr[i].guest = 0; > } else { > /* Disable guest PEBS for cross-mapped PEBS counters. */ > arr[i].guest &= ~pmu->host_cross_mapped_mask; > arr[0].guest |= arr[i].guest; > } > > out: > *nr = ++i; > return arr; > } The ++ is in a weird location, if you place it after filling out an entry it makes more sense I think. Something like: arr[i].msr = MSR_CORE_PERF_GLOBAL_CTRL; arr[i].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask; arr[i].guest = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_host_mask; arr[i].guest &= ~pebs_mask; i++; or, perhaps even like: arr[i++] = (struct perf_guest_switch_msr){ .msr = MSR_CORE_PERF_GLOBAL_CTRL, .host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask, .guest = x86_pmu.intel_ctrl & (~cpuc->intel_ctrl_host_mask | ~pebs_mask), }; But it doesn't address the fundamental confusion I seem to be having, what actual msr value is what.
Re: [PATCH v4 08/16] KVM: x86/pmu: Add IA32_DS_AREA MSR emulation to manage guest DS buffer
Hi Peter, On 2021/4/8 15:52, Peter Zijlstra wrote: This is because in the early part of this function, we have operations: if (x86_pmu.flags & PMU_FL_PEBS_ALL) arr[0].guest &= ~cpuc->pebs_enabled; else arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK); and if guest has PEBS_ENABLED, we need these bits back for PEBS counters: arr[0].guest |= arr[1].guest; I can't keep up with you on this comment and would you explain more ? I don't think that's right, who's to say they were set in the first place? The guest's GLOBAL_CTRL could have had the bits cleared at VMEXIT time. You can't unconditionally add PEBS_ENABLED into GLOBAL_CTRL, that's wrong. To address your previous comments, does the code below look good to you? static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data) { struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events); struct perf_guest_switch_msr *arr = cpuc->guest_switch_msrs; struct debug_store *ds = __this_cpu_read(cpu_hw_events.ds); struct kvm_pmu *pmu = (struct kvm_pmu *)data; u64 pebs_mask = (x86_pmu.flags & PMU_FL_PEBS_ALL) ? cpuc->pebs_enabled : (cpuc->pebs_enabled & PEBS_COUNTER_MASK); int i = 0; arr[i].msr = MSR_CORE_PERF_GLOBAL_CTRL; arr[i].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask; arr[i].guest = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_host_mask; arr[i].guest &= ~pebs_mask; if (!x86_pmu.pebs) goto out; /* * If PMU counter has PEBS enabled it is not enough to * disable counter on a guest entry since PEBS memory * write can overshoot guest entry and corrupt guest * memory. Disabling PEBS solves the problem. * * Don't do this if the CPU already enforces it. */ if (x86_pmu.pebs_no_isolation) { i++; arr[i].msr = MSR_IA32_PEBS_ENABLE; arr[i].host = cpuc->pebs_enabled; arr[i].guest = 0; goto out; } if (!pmu || !x86_pmu.pebs_vmx) goto out; i++; arr[i].msr = MSR_IA32_DS_AREA; arr[i].host = (unsigned long)ds; arr[i].guest = pmu->ds_area; if (x86_pmu.intel_cap.pebs_baseline) { i++; arr[i].msr = MSR_PEBS_DATA_CFG; arr[i].host = cpuc->pebs_data_cfg; arr[i].guest = pmu->pebs_data_cfg; } i++; arr[i].msr = MSR_IA32_PEBS_ENABLE; arr[i].host = cpuc->pebs_enabled & ~cpuc->intel_ctrl_guest_mask; arr[i].guest = pebs_mask & ~cpuc->intel_ctrl_host_mask; if (arr[i].host) { /* Disable guest PEBS if host PEBS is enabled. */ arr[i].guest = 0; } else { /* Disable guest PEBS for cross-mapped PEBS counters. */ arr[i].guest &= ~pmu->host_cross_mapped_mask; arr[0].guest |= arr[i].guest; } out: *nr = ++i; return arr; }
Re: [PATCH v4 08/16] KVM: x86/pmu: Add IA32_DS_AREA MSR emulation to manage guest DS buffer
On 2021/4/8 15:52, Peter Zijlstra wrote: On Thu, Apr 08, 2021 at 01:39:49PM +0800, Xu, Like wrote: Hi Peter, Thanks for your detailed comments. If you have more comments for other patches, please let me know. On 2021/4/7 23:39, Peter Zijlstra wrote: On Mon, Mar 29, 2021 at 01:41:29PM +0800, Like Xu wrote: @@ -3869,10 +3876,12 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data) if (arr[1].guest) arr[0].guest |= arr[1].guest; - else + else { arr[1].guest = arr[1].host; + arr[2].guest = arr[2].host; + } What's all this gibberish? The way I read that it says: if guest has PEBS_ENABLED guest GLOBAL_CTRL |= PEBS_ENABLED otherwise guest PEBS_ENABLED = host PEBS_ENABLED guest DS_AREA = host DS_AREA which is just completely random garbage afaict. Why would you leak host msrs into the guest? In fact, this is not a leak at all. When we do "arr[i].guest = arr[i].host;" assignment in the intel_guest_get_msrs(), the KVM will check "if (msrs[i].host == msrs[i].guest)" and if so, it disables the atomic switch for this msr during vmx transaction in the caller atomic_switch_perf_msrs(). Another marvel of bad coding style that function is :-( Lots of missing {} and indentation fail. Sorry for that and I'll fix them. This is terrible though, why would we clear the guest MSRs when it changes PEBS_ENABLED. The values of arr[1].host and arr[1].guest depend on the arrangement of host perf: arr[1].host = cpuc->pebs_enabled & ~cpuc->intel_ctrl_guest_mask; arr[1].guest = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask; rather than the guest value of PEBS_ENABLE. When the value of this msr is different across vmx-transaction, we will load arr[1].host after vm-exit and load arr[1].guest before vm-entry. If the value of this msr is the same before and after vmx-transaction, we do nothing and keep the original value on the register. The guest had better clear them itself. I don't understand what you are referring to here. Can you explain what you think is the correct behavior here ? Removing guest DS_AREA just because we don't have any bits set in PEBS_ENABLED is wrong and could very break all sorts of drivers. Except for PEBS, other features that rely on DS_AREA are not available in the guest . Can you explain more of your concerns for DS_AREA switch ? In that case, the msr value doesn't change and any guest write will be trapped. If the next check is "msrs[i].host != msrs[i].guest", the atomic switch will be triggered again. Compared to before, this part of the logic has not changed, which helps to reduce overhead. It's unreadable garbage at best. If you don't want it changed, then don't add it to the arr[] thing in the first place. Thanks, adding GLOBAL_CTRL to arr[] in the last step is a better choice. Why would you change guest GLOBAL_CTRL implicitly; This is because in the early part of this function, we have operations: if (x86_pmu.flags & PMU_FL_PEBS_ALL) arr[0].guest &= ~cpuc->pebs_enabled; else arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK); and if guest has PEBS_ENABLED, we need these bits back for PEBS counters: arr[0].guest |= arr[1].guest; I don't think that's right, who's to say they were set in the first place? The guest's GLOBAL_CTRL could have had the bits cleared at VMEXIT time. Please note the guest GLOBAL_CTRL value is stored in the pmu->global_ctrl, while the actual loaded value for GLOBAL_CTRL msr after vm-entry is "x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_host_mask". You can't unconditionally add PEBS_ENABLED into GLOBAL_CTRL, that's wrong. The determination of the msr values before and after vmx-transaction are always in the context of host perf which means the PEBS perf_events created by the KVM are all scheduled on and used legally , and it does not depend on the guest values at all. guest had better wrmsr that himself to control when stuff is enabled. When vm_entry, the msr value of GLOBAL_CTRL on the hardware may be different from trapped value "pmu->global_ctrl" written by the guest. If the perf scheduler cross maps guest counter X to the host counter Y, we have to enable the bit Y in GLOBAL_CTRL before vm_entry rather than X. Sure, but I don't see that happening here. Just fire questions if we're not on the same page or you're out of KVM context.
Re: [PATCH v4 08/16] KVM: x86/pmu: Add IA32_DS_AREA MSR emulation to manage guest DS buffer
On Thu, Apr 08, 2021 at 01:39:49PM +0800, Xu, Like wrote: > Hi Peter, > > Thanks for your detailed comments. > > If you have more comments for other patches, please let me know. > > On 2021/4/7 23:39, Peter Zijlstra wrote: > > On Mon, Mar 29, 2021 at 01:41:29PM +0800, Like Xu wrote: > > > @@ -3869,10 +3876,12 @@ static struct perf_guest_switch_msr > > > *intel_guest_get_msrs(int *nr, void *data) > > > if (arr[1].guest) > > > arr[0].guest |= arr[1].guest; > > > - else > > > + else { > > > arr[1].guest = arr[1].host; > > > + arr[2].guest = arr[2].host; > > > + } > > What's all this gibberish? > > > > The way I read that it says: > > > > if guest has PEBS_ENABLED > > guest GLOBAL_CTRL |= PEBS_ENABLED > > otherwise > > guest PEBS_ENABLED = host PEBS_ENABLED > > guest DS_AREA = host DS_AREA > > > > which is just completely random garbage afaict. Why would you leak host > > msrs into the guest? > > In fact, this is not a leak at all. > > When we do "arr[i].guest = arr[i].host;" assignment in the > intel_guest_get_msrs(), the KVM will check "if (msrs[i].host == > msrs[i].guest)" and if so, it disables the atomic switch for this msr > during vmx transaction in the caller atomic_switch_perf_msrs(). Another marvel of bad coding style that function is :-( Lots of missing {} and indentation fail. This is terrible though, why would we clear the guest MSRs when it changes PEBS_ENABLED. The guest had better clear them itself. Removing guest DS_AREA just because we don't have any bits set in PEBS_ENABLED is wrong and could very break all sorts of drivers. > In that case, the msr value doesn't change and any guest write will be > trapped. If the next check is "msrs[i].host != msrs[i].guest", the > atomic switch will be triggered again. > > Compared to before, this part of the logic has not changed, which helps to > reduce overhead. It's unreadable garbage at best. If you don't want it changed, then don't add it to the arr[] thing in the first place. > > Why would you change guest GLOBAL_CTRL implicitly; > > This is because in the early part of this function, we have operations: > > if (x86_pmu.flags & PMU_FL_PEBS_ALL) > arr[0].guest &= ~cpuc->pebs_enabled; > else > arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK); > > and if guest has PEBS_ENABLED, we need these bits back for PEBS counters: > > arr[0].guest |= arr[1].guest; I don't think that's right, who's to say they were set in the first place? The guest's GLOBAL_CTRL could have had the bits cleared at VMEXIT time. You can't unconditionally add PEBS_ENABLED into GLOBAL_CTRL, that's wrong. > > guest had better wrmsr that himself to control when stuff is enabled. > > When vm_entry, the msr value of GLOBAL_CTRL on the hardware may be > different from trapped value "pmu->global_ctrl" written by the guest. > > If the perf scheduler cross maps guest counter X to the host counter Y, > we have to enable the bit Y in GLOBAL_CTRL before vm_entry rather than X. Sure, but I don't see that happening here.
Re: [PATCH v4 08/16] KVM: x86/pmu: Add IA32_DS_AREA MSR emulation to manage guest DS buffer
Hi Peter, Thanks for your detailed comments. If you have more comments for other patches, please let me know. On 2021/4/7 23:39, Peter Zijlstra wrote: On Mon, Mar 29, 2021 at 01:41:29PM +0800, Like Xu wrote: @@ -3869,10 +3876,12 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data) if (arr[1].guest) arr[0].guest |= arr[1].guest; - else + else { arr[1].guest = arr[1].host; + arr[2].guest = arr[2].host; + } What's all this gibberish? The way I read that it says: if guest has PEBS_ENABLED guest GLOBAL_CTRL |= PEBS_ENABLED otherwise guest PEBS_ENABLED = host PEBS_ENABLED guest DS_AREA = host DS_AREA which is just completely random garbage afaict. Why would you leak host msrs into the guest? In fact, this is not a leak at all. When we do "arr[i].guest = arr[i].host;" assignment in the intel_guest_get_msrs(), the KVM will check "if (msrs[i].host == msrs[i].guest)" and if so, it disables the atomic switch for this msr during vmx transaction in the caller atomic_switch_perf_msrs(). In that case, the msr value doesn't change and any guest write will be trapped. If the next check is "msrs[i].host != msrs[i].guest", the atomic switch will be triggered again. Compared to before, this part of the logic has not changed, which helps to reduce overhead. Why would you change guest GLOBAL_CTRL implicitly; This is because in the early part of this function, we have operations: if (x86_pmu.flags & PMU_FL_PEBS_ALL) arr[0].guest &= ~cpuc->pebs_enabled; else arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK); and if guest has PEBS_ENABLED, we need these bits back for PEBS counters: arr[0].guest |= arr[1].guest; guest had better wrmsr that himself to control when stuff is enabled. When vm_entry, the msr value of GLOBAL_CTRL on the hardware may be different from trapped value "pmu->global_ctrl" written by the guest. If the perf scheduler cross maps guest counter X to the host counter Y, we have to enable the bit Y in GLOBAL_CTRL before vm_entry rather than X. This just cannot be right.
Re: [PATCH v4 08/16] KVM: x86/pmu: Add IA32_DS_AREA MSR emulation to manage guest DS buffer
On Mon, Mar 29, 2021 at 01:41:29PM +0800, Like Xu wrote: > @@ -3869,10 +3876,12 @@ static struct perf_guest_switch_msr > *intel_guest_get_msrs(int *nr, void *data) > > if (arr[1].guest) > arr[0].guest |= arr[1].guest; > - else > + else { > arr[1].guest = arr[1].host; > + arr[2].guest = arr[2].host; > + } What's all this gibberish? The way I read that it says: if guest has PEBS_ENABLED guest GLOBAL_CTRL |= PEBS_ENABLED otherwise guest PEBS_ENABLED = host PEBS_ENABLED guest DS_AREA = host DS_AREA which is just completely random garbage afaict. Why would you leak host msrs into the guest? Why would you change guest GLOBAL_CTRL implicitly; guest had better wrmsr that himself to control when stuff is enabled. This just cannot be right.
[PATCH v4 08/16] KVM: x86/pmu: Add IA32_DS_AREA MSR emulation to manage guest DS buffer
When CPUID.01H:EDX.DS[21] is set, the IA32_DS_AREA MSR exists and points to the linear address of the first byte of the DS buffer management area, which is used to manage the PEBS records. When guest PEBS is enabled and the value is different from the host, KVM will add the IA32_DS_AREA MSR to the msr-switch list. The guest's DS value can be loaded to the real HW before VM-entry, and will be removed when guest PEBS is disabled. The WRMSR to IA32_DS_AREA MSR brings a #GP(0) if the source register contains a non-canonical address. The switch of IA32_DS_AREA MSR would also, setup a quiescent period to write the host PEBS records (if any) to host DS area rather than guest DS area. When guest PEBS is enabled, the MSR_IA32_DS_AREA MSR will be added to the perf_guest_switch_msr() and switched during the VMX transitions just like CORE_PERF_GLOBAL_CTRL MSR. Originally-by: Andi Kleen Co-developed-by: Kan Liang Signed-off-by: Kan Liang Signed-off-by: Like Xu --- arch/x86/events/intel/core.c| 15 --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/vmx/pmu_intel.c| 11 +++ arch/x86/kvm/vmx/vmx.c | 1 + 4 files changed, 25 insertions(+), 3 deletions(-) diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 2ca8ed61f444..7f3821a59b84 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -21,6 +21,7 @@ #include #include #include +#include #include "../perf_event.h" @@ -3841,6 +3842,8 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data) { struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events); struct perf_guest_switch_msr *arr = cpuc->guest_switch_msrs; + struct debug_store *ds = __this_cpu_read(cpu_hw_events.ds); + struct kvm_pmu *pmu = (struct kvm_pmu *)data; arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL; arr[0].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask; @@ -3851,11 +3854,15 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data) arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK); *nr = 1; - if (x86_pmu.pebs) { + if (pmu && x86_pmu.pebs) { arr[1].msr = MSR_IA32_PEBS_ENABLE; arr[1].host = cpuc->pebs_enabled & ~cpuc->intel_ctrl_guest_mask; arr[1].guest = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask; + arr[2].msr = MSR_IA32_DS_AREA; + arr[2].host = (unsigned long)ds; + arr[2].guest = pmu->ds_area; + /* * If PMU counter has PEBS enabled it is not enough to * disable counter on a guest entry since PEBS memory @@ -3869,10 +3876,12 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data) if (arr[1].guest) arr[0].guest |= arr[1].guest; - else + else { arr[1].guest = arr[1].host; + arr[2].guest = arr[2].host; + } - *nr = 2; + *nr = 3; } return arr; diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index f620485d7836..2275cc144f58 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -460,6 +460,7 @@ struct kvm_pmu { DECLARE_BITMAP(all_valid_pmc_idx, X86_PMC_IDX_MAX); DECLARE_BITMAP(pmc_in_use, X86_PMC_IDX_MAX); + u64 ds_area; u64 pebs_enable; u64 pebs_enable_mask; diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 0700d6d739f7..77d30106abca 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -223,6 +223,9 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr) case MSR_IA32_PEBS_ENABLE: ret = vcpu->arch.perf_capabilities & PERF_CAP_PEBS_FORMAT; break; + case MSR_IA32_DS_AREA: + ret = guest_cpuid_has(vcpu, X86_FEATURE_DS); + break; default: ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0) || get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0) || @@ -373,6 +376,9 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) case MSR_IA32_PEBS_ENABLE: msr_info->data = pmu->pebs_enable; return 0; + case MSR_IA32_DS_AREA: + msr_info->data = pmu->ds_area; + return 0; default: if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) || (pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) { @@ -441,6 +447,11 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) return 0; } break; + case MSR_IA32_DS_AREA: + if