RE: [PATCH v1] KVM/x86/vPMU: Guest PMI Optimization

2018-10-14 Thread Wang, Wei W
On Sunday, October 14, 2018 8:41 PM, Wei Wang wrote:
> Here is the plan I have in mind:
> #1 Creates a host perf event on the guest's first bit-setting to
> MSR_CORE_PERF_GLOBAL_CTRL; Meanwhile, disable the intercept of guest
> access to this perf counter related MSRs (i.e. config_base and event_base).
> #2 When the vCPU is sched in,
>  #2.1 make the MSRs of the perf counters (assigned to the guest in
> #1) interceptible, so that guest accesses to such a counter is captured, and
> marked it "used", and disable the intercept again;
>  #2.2 also check if there is any counter that wasn't "used" in the last 
> vCPU
> time slice, if there is, release that counter and the perf event.

Just thought of an issue with passing through config_base and event base - the 
guest's view of the perf counter is possible to be different from the one 
assigned by the host, which results in guest using a different config_base and 
event_base. So we would still need keep those MSRs being intercepted by the 
host. The remaining part (the lazy allocation and release of the host perf 
event) can still work.

Best,
Wei


RE: [PATCH v1] KVM/x86/vPMU: Guest PMI Optimization

2018-10-14 Thread Wang, Wei W
On Sunday, October 14, 2018 8:41 PM, Wei Wang wrote:
> Here is the plan I have in mind:
> #1 Creates a host perf event on the guest's first bit-setting to
> MSR_CORE_PERF_GLOBAL_CTRL; Meanwhile, disable the intercept of guest
> access to this perf counter related MSRs (i.e. config_base and event_base).
> #2 When the vCPU is sched in,
>  #2.1 make the MSRs of the perf counters (assigned to the guest in
> #1) interceptible, so that guest accesses to such a counter is captured, and
> marked it "used", and disable the intercept again;
>  #2.2 also check if there is any counter that wasn't "used" in the last 
> vCPU
> time slice, if there is, release that counter and the perf event.

Just thought of an issue with passing through config_base and event base - the 
guest's view of the perf counter is possible to be different from the one 
assigned by the host, which results in guest using a different config_base and 
event_base. So we would still need keep those MSRs being intercepted by the 
host. The remaining part (the lazy allocation and release of the host perf 
event) can still work.

Best,
Wei


Re: [PATCH v1] KVM/x86/vPMU: Guest PMI Optimization

2018-10-14 Thread Wei Wang

On 10/13/2018 09:30 PM, Peter Zijlstra wrote:

On Fri, Oct 12, 2018 at 08:20:17PM +0800, Wei Wang wrote:

Guest changing MSR_CORE_PERF_GLOBAL_CTRL causes KVM to reprogram pmc
counters, which re-allocates a host perf event. This process is

Yea gawds, that's horrific. Why does it do that? We have
PERF_EVENT_IOC_PERIOD which does that much better. Still, what you're
proposing is faster still -- if it is correct.


I'm not sure about the back story. Probably it was an initial functional 
implementation.



This patch implements a fast path to handle the guest change of
MSR_CORE_PERF_GLOBAL_CTRL for the guest pmi case. Guest change of the
msr will be applied to the hardware when entering the guest, and the
old perf event will continue to be used. The guest setting of the
perf counter for the next irq period in pmi will also be written
directly to the hardware counter when entering the guest.

What you're failing to explain here is why exactly it is ok to write to
the MSR directly without updating the perf_event state. I didn't take
the time to go through all that, but it certainly needs documenting.


OK. The guest itself has the perf event (the one that is using the 
hardware counter), and the event state is managed by the guest perf core.
The host side perf event isn't the one that uses the hardware counter. 
Essentially, it is here on the host just to occupy the counter (via the 
host perf core) for the guest. The writing to the MSR here is 
essentially performed on behave of the guest perf event.
So, for the host side perf event, I think its state should be active as 
long as the guest is using the counter. The state will be changed to 
inactive (as usual) when the vCPU is scheduled out.



This is something that can certainly get broken by accident.

Is there any documentation/comment that explains how this virtual PMU
crud works in general?


I haven't found any docs that could be useful so far.



+u64 intel_pmu_disable_guest_counters(void)
+{
+   struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events);
+   u64 mask = cpuc->intel_ctrl_host_mask;
+
+   cpuc->intel_ctrl_host_mask = ULONG_MAX;
+
+   return mask;
+}
+EXPORT_SYMBOL_GPL(intel_pmu_disable_guest_counters);

OK, this them gets the MSR written when we re-enter the guest, after the
WRMSR trap, right?


Yes, the guest value will be loaded to the MSR.



+   /*
+* The guest PMI handler is asking for enabling the perf
+* counters. This happens at the end of the guest PMI handler,
+* so clear in_pmi.
+*/
+   intel_pmu_enable_guest_counters(pmu->counter_mask);
+   pmu->in_pmi = false;
+   }
+}
The v4 PMI handler does not in fact do that I think.


@@ -237,9 +267,23 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
(pmc = get_fixed_pmc(pmu, msr))) {
-   if (!msr_info->host_initiated)
-   data = (s64)(s32)data;
-   pmc->counter += data - pmc_read_counter(pmc);
+   if (pmu->in_pmi) {
+   /*
+* Since we are not re-allocating a perf event
+* to reconfigure the sampling time when the
+* guest pmu is in PMI, just set the value to
+* the hardware perf counter. Counting will
+* continue after the guest enables the
+* counter bit in MSR_CORE_PERF_GLOBAL_CTRL.
+*/
+   struct hw_perf_event *hwc =
+   >perf_event->hw;
+   wrmsrl(hwc->event_base, data);

But all this relies on the event calling the overflow handler; how does
this not corrupt the event state such that x86_perf_event_set_period()
might decide that the generated PMI is a spurious one?



We will make the optimization more general in the next version, instead 
of relying on PMI, so the above 2 questions would be gone then.


Best,
Wei





Re: [PATCH v1] KVM/x86/vPMU: Guest PMI Optimization

2018-10-14 Thread Wei Wang

On 10/13/2018 09:30 PM, Peter Zijlstra wrote:

On Fri, Oct 12, 2018 at 08:20:17PM +0800, Wei Wang wrote:

Guest changing MSR_CORE_PERF_GLOBAL_CTRL causes KVM to reprogram pmc
counters, which re-allocates a host perf event. This process is

Yea gawds, that's horrific. Why does it do that? We have
PERF_EVENT_IOC_PERIOD which does that much better. Still, what you're
proposing is faster still -- if it is correct.


I'm not sure about the back story. Probably it was an initial functional 
implementation.



This patch implements a fast path to handle the guest change of
MSR_CORE_PERF_GLOBAL_CTRL for the guest pmi case. Guest change of the
msr will be applied to the hardware when entering the guest, and the
old perf event will continue to be used. The guest setting of the
perf counter for the next irq period in pmi will also be written
directly to the hardware counter when entering the guest.

What you're failing to explain here is why exactly it is ok to write to
the MSR directly without updating the perf_event state. I didn't take
the time to go through all that, but it certainly needs documenting.


OK. The guest itself has the perf event (the one that is using the 
hardware counter), and the event state is managed by the guest perf core.
The host side perf event isn't the one that uses the hardware counter. 
Essentially, it is here on the host just to occupy the counter (via the 
host perf core) for the guest. The writing to the MSR here is 
essentially performed on behave of the guest perf event.
So, for the host side perf event, I think its state should be active as 
long as the guest is using the counter. The state will be changed to 
inactive (as usual) when the vCPU is scheduled out.



This is something that can certainly get broken by accident.

Is there any documentation/comment that explains how this virtual PMU
crud works in general?


I haven't found any docs that could be useful so far.



+u64 intel_pmu_disable_guest_counters(void)
+{
+   struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events);
+   u64 mask = cpuc->intel_ctrl_host_mask;
+
+   cpuc->intel_ctrl_host_mask = ULONG_MAX;
+
+   return mask;
+}
+EXPORT_SYMBOL_GPL(intel_pmu_disable_guest_counters);

OK, this them gets the MSR written when we re-enter the guest, after the
WRMSR trap, right?


Yes, the guest value will be loaded to the MSR.



+   /*
+* The guest PMI handler is asking for enabling the perf
+* counters. This happens at the end of the guest PMI handler,
+* so clear in_pmi.
+*/
+   intel_pmu_enable_guest_counters(pmu->counter_mask);
+   pmu->in_pmi = false;
+   }
+}
The v4 PMI handler does not in fact do that I think.


@@ -237,9 +267,23 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
(pmc = get_fixed_pmc(pmu, msr))) {
-   if (!msr_info->host_initiated)
-   data = (s64)(s32)data;
-   pmc->counter += data - pmc_read_counter(pmc);
+   if (pmu->in_pmi) {
+   /*
+* Since we are not re-allocating a perf event
+* to reconfigure the sampling time when the
+* guest pmu is in PMI, just set the value to
+* the hardware perf counter. Counting will
+* continue after the guest enables the
+* counter bit in MSR_CORE_PERF_GLOBAL_CTRL.
+*/
+   struct hw_perf_event *hwc =
+   >perf_event->hw;
+   wrmsrl(hwc->event_base, data);

But all this relies on the event calling the overflow handler; how does
this not corrupt the event state such that x86_perf_event_set_period()
might decide that the generated PMI is a spurious one?



We will make the optimization more general in the next version, instead 
of relying on PMI, so the above 2 questions would be gone then.


Best,
Wei





Re: [PATCH v1] KVM/x86/vPMU: Guest PMI Optimization

2018-10-14 Thread Wei Wang

On 10/13/2018 04:09 PM, Paolo Bonzini wrote:



It's not clear to me why you're special casing PMIs here. The optimization
should work generically, right?

Yeah, you can even just check if the counter is in the struct
cpu_hw_events guest mask, and if so always write the counter MSR directly.


Not sure if we could do that. I think the guest mask on the host 
reflects which counters are used by the host.


Here is the plan I have in mind:
#1 Creates a host perf event on the guest's first bit-setting to 
MSR_CORE_PERF_GLOBAL_CTRL; Meanwhile, disable the intercept of guest 
access to this perf counter related MSRs (i.e. config_base and event_base).

#2 When the vCPU is sched in,
#2.1 make the MSRs of the perf counters (assigned to the guest in 
#1) interceptible, so that guest accesses to such a counter is captured, 
and marked it "used", and disable the intercept again;
#2.2 also check if there is any counter that wasn't "used" in the 
last vCPU time slice, if there is, release that counter and the perf event.






@@ -237,9 +267,23 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
(pmc = get_fixed_pmc(pmu, msr))) {
-   if (!msr_info->host_initiated)
-   data = (s64)(s32)data;
-   pmc->counter += data - pmc_read_counter(pmc);
+   if (pmu->in_pmi) {
+   /*
+* Since we are not re-allocating a perf event
+* to reconfigure the sampling time when the
+* guest pmu is in PMI, just set the value to
+* the hardware perf counter. Counting will
+* continue after the guest enables the
+* counter bit in MSR_CORE_PERF_GLOBAL_CTRL.
+*/
+   struct hw_perf_event *hwc =
+   >perf_event->hw;
+   wrmsrl(hwc->event_base, data);

Is that guaranteed to be always called on the right CPU that will run the vcpu?

AFAIK there's an ioctl to set MSRs in the guest from qemu, I'm pretty sure
it won't handle that.

How much of the performance improvement comes from here?  In theory
pmc_read_counter() should always hit a relatively fast path, because the
smp_call_function_single in perf_event_read doesn't need an IPI.

In any case, this should be a separate patch.


Actually this change wasn't intended for performance improvement. It was 
adapted for the "fast path" we added to the MSR_CORE_PERF_GLOBAL_CTRL 
write handling.


The old implementation captures the guest updating of the period in 
pmc->counter, and then uses the pmc->counter for the perf event 
creation, which gets the guest requested period written to the 
underlying counter via the host perf core. The fast path avoids the perf 
event creation, and accordingly, we need to update the period value 
directly to the hardware counter.


Best,
Wei





Re: [PATCH v1] KVM/x86/vPMU: Guest PMI Optimization

2018-10-14 Thread Wei Wang

On 10/13/2018 04:09 PM, Paolo Bonzini wrote:



It's not clear to me why you're special casing PMIs here. The optimization
should work generically, right?

Yeah, you can even just check if the counter is in the struct
cpu_hw_events guest mask, and if so always write the counter MSR directly.


Not sure if we could do that. I think the guest mask on the host 
reflects which counters are used by the host.


Here is the plan I have in mind:
#1 Creates a host perf event on the guest's first bit-setting to 
MSR_CORE_PERF_GLOBAL_CTRL; Meanwhile, disable the intercept of guest 
access to this perf counter related MSRs (i.e. config_base and event_base).

#2 When the vCPU is sched in,
#2.1 make the MSRs of the perf counters (assigned to the guest in 
#1) interceptible, so that guest accesses to such a counter is captured, 
and marked it "used", and disable the intercept again;
#2.2 also check if there is any counter that wasn't "used" in the 
last vCPU time slice, if there is, release that counter and the perf event.






@@ -237,9 +267,23 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
(pmc = get_fixed_pmc(pmu, msr))) {
-   if (!msr_info->host_initiated)
-   data = (s64)(s32)data;
-   pmc->counter += data - pmc_read_counter(pmc);
+   if (pmu->in_pmi) {
+   /*
+* Since we are not re-allocating a perf event
+* to reconfigure the sampling time when the
+* guest pmu is in PMI, just set the value to
+* the hardware perf counter. Counting will
+* continue after the guest enables the
+* counter bit in MSR_CORE_PERF_GLOBAL_CTRL.
+*/
+   struct hw_perf_event *hwc =
+   >perf_event->hw;
+   wrmsrl(hwc->event_base, data);

Is that guaranteed to be always called on the right CPU that will run the vcpu?

AFAIK there's an ioctl to set MSRs in the guest from qemu, I'm pretty sure
it won't handle that.

How much of the performance improvement comes from here?  In theory
pmc_read_counter() should always hit a relatively fast path, because the
smp_call_function_single in perf_event_read doesn't need an IPI.

In any case, this should be a separate patch.


Actually this change wasn't intended for performance improvement. It was 
adapted for the "fast path" we added to the MSR_CORE_PERF_GLOBAL_CTRL 
write handling.


The old implementation captures the guest updating of the period in 
pmc->counter, and then uses the pmc->counter for the perf event 
creation, which gets the guest requested period written to the 
underlying counter via the host perf core. The fast path avoids the perf 
event creation, and accordingly, we need to update the period value 
directly to the hardware counter.


Best,
Wei





Re: [PATCH v1] KVM/x86/vPMU: Guest PMI Optimization

2018-10-13 Thread Peter Zijlstra
On Fri, Oct 12, 2018 at 08:20:17PM +0800, Wei Wang wrote:
> Guest changing MSR_CORE_PERF_GLOBAL_CTRL causes KVM to reprogram pmc
> counters, which re-allocates a host perf event. This process is

Yea gawds, that's horrific. Why does it do that? We have
PERF_EVENT_IOC_PERIOD which does that much better. Still, what you're
proposing is faster still -- if it is correct.

> This patch implements a fast path to handle the guest change of
> MSR_CORE_PERF_GLOBAL_CTRL for the guest pmi case. Guest change of the
> msr will be applied to the hardware when entering the guest, and the
> old perf event will continue to be used. The guest setting of the
> perf counter for the next irq period in pmi will also be written
> directly to the hardware counter when entering the guest.

What you're failing to explain here is why exactly it is ok to write to
the MSR directly without updating the perf_event state. I didn't take
the time to go through all that, but it certainly needs documenting.

This is something that can certainly get broken by accident.

Is there any documentation/comment that explains how this virtual PMU
crud works in general?

> +u64 intel_pmu_disable_guest_counters(void)
> +{
> + struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events);
> + u64 mask = cpuc->intel_ctrl_host_mask;
> +
> + cpuc->intel_ctrl_host_mask = ULONG_MAX;
> +
> + return mask;
> +}
> +EXPORT_SYMBOL_GPL(intel_pmu_disable_guest_counters);

OK, this them gets the MSR written when we re-enter the guest, after the
WRMSR trap, right?

> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 58ead7d..210e5df 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -80,6 +80,7 @@ static void kvm_perf_overflow_intr(struct perf_event 
> *perf_event,
> (unsigned long *)>reprogram_pmi)) {
>   __set_bit(pmc->idx, (unsigned long *)>global_status);
>   kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
> + pmu->in_pmi = true;
>  
>   /*
>* Inject PMI. If vcpu was in a guest mode during NMI PMI
> diff --git a/arch/x86/kvm/pmu_intel.c b/arch/x86/kvm/pmu_intel.c
> index 5ab4a36..5f6ac3c 100644
> --- a/arch/x86/kvm/pmu_intel.c
> +++ b/arch/x86/kvm/pmu_intel.c
> @@ -55,6 +55,27 @@ static void reprogram_fixed_counters(struct kvm_pmu *pmu, 
> u64 data)
>   pmu->fixed_ctr_ctrl = data;
>  }
>  
> +static void fast_global_ctrl_changed(struct kvm_pmu *pmu, u64 data)
> +{
> + pmu->global_ctrl = data;
> +
> + if (!data) {
> + /*
> +  * The guest PMI handler is asking for disabling all the perf
> +  * counters
> +  */
> + pmu->counter_mask = intel_pmu_disable_guest_counters();
> + } else {
> + /*
> +  * The guest PMI handler is asking for enabling the perf
> +  * counters. This happens at the end of the guest PMI handler,
> +  * so clear in_pmi.
> +  */
> + intel_pmu_enable_guest_counters(pmu->counter_mask);
> + pmu->in_pmi = false;
> + }
> +}

The v4 PMI handler does not in fact do that I think.

> @@ -237,9 +267,23 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, 
> struct msr_data *msr_info)
>   default:
>   if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
>   (pmc = get_fixed_pmc(pmu, msr))) {
> - if (!msr_info->host_initiated)
> - data = (s64)(s32)data;
> - pmc->counter += data - pmc_read_counter(pmc);
> + if (pmu->in_pmi) {
> + /*
> +  * Since we are not re-allocating a perf event
> +  * to reconfigure the sampling time when the
> +  * guest pmu is in PMI, just set the value to
> +  * the hardware perf counter. Counting will
> +  * continue after the guest enables the
> +  * counter bit in MSR_CORE_PERF_GLOBAL_CTRL.
> +  */
> + struct hw_perf_event *hwc =
> + >perf_event->hw;
> + wrmsrl(hwc->event_base, data);

But all this relies on the event calling the overflow handler; how does
this not corrupt the event state such that x86_perf_event_set_period()
might decide that the generated PMI is a spurious one?

> + } else {
> + if (!msr_info->host_initiated)
> + data = (s64)(s32)data;
> + pmc->counter += data - pmc_read_counter(pmc);
> + }
>   return 0;
>   } else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) {
>   if (data == pmc->eventsel)
> -- 
> 2.7.4
> 


Re: [PATCH v1] KVM/x86/vPMU: Guest PMI Optimization

2018-10-13 Thread Peter Zijlstra
On Fri, Oct 12, 2018 at 08:20:17PM +0800, Wei Wang wrote:
> Guest changing MSR_CORE_PERF_GLOBAL_CTRL causes KVM to reprogram pmc
> counters, which re-allocates a host perf event. This process is

Yea gawds, that's horrific. Why does it do that? We have
PERF_EVENT_IOC_PERIOD which does that much better. Still, what you're
proposing is faster still -- if it is correct.

> This patch implements a fast path to handle the guest change of
> MSR_CORE_PERF_GLOBAL_CTRL for the guest pmi case. Guest change of the
> msr will be applied to the hardware when entering the guest, and the
> old perf event will continue to be used. The guest setting of the
> perf counter for the next irq period in pmi will also be written
> directly to the hardware counter when entering the guest.

What you're failing to explain here is why exactly it is ok to write to
the MSR directly without updating the perf_event state. I didn't take
the time to go through all that, but it certainly needs documenting.

This is something that can certainly get broken by accident.

Is there any documentation/comment that explains how this virtual PMU
crud works in general?

> +u64 intel_pmu_disable_guest_counters(void)
> +{
> + struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events);
> + u64 mask = cpuc->intel_ctrl_host_mask;
> +
> + cpuc->intel_ctrl_host_mask = ULONG_MAX;
> +
> + return mask;
> +}
> +EXPORT_SYMBOL_GPL(intel_pmu_disable_guest_counters);

OK, this them gets the MSR written when we re-enter the guest, after the
WRMSR trap, right?

> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 58ead7d..210e5df 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -80,6 +80,7 @@ static void kvm_perf_overflow_intr(struct perf_event 
> *perf_event,
> (unsigned long *)>reprogram_pmi)) {
>   __set_bit(pmc->idx, (unsigned long *)>global_status);
>   kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
> + pmu->in_pmi = true;
>  
>   /*
>* Inject PMI. If vcpu was in a guest mode during NMI PMI
> diff --git a/arch/x86/kvm/pmu_intel.c b/arch/x86/kvm/pmu_intel.c
> index 5ab4a36..5f6ac3c 100644
> --- a/arch/x86/kvm/pmu_intel.c
> +++ b/arch/x86/kvm/pmu_intel.c
> @@ -55,6 +55,27 @@ static void reprogram_fixed_counters(struct kvm_pmu *pmu, 
> u64 data)
>   pmu->fixed_ctr_ctrl = data;
>  }
>  
> +static void fast_global_ctrl_changed(struct kvm_pmu *pmu, u64 data)
> +{
> + pmu->global_ctrl = data;
> +
> + if (!data) {
> + /*
> +  * The guest PMI handler is asking for disabling all the perf
> +  * counters
> +  */
> + pmu->counter_mask = intel_pmu_disable_guest_counters();
> + } else {
> + /*
> +  * The guest PMI handler is asking for enabling the perf
> +  * counters. This happens at the end of the guest PMI handler,
> +  * so clear in_pmi.
> +  */
> + intel_pmu_enable_guest_counters(pmu->counter_mask);
> + pmu->in_pmi = false;
> + }
> +}

The v4 PMI handler does not in fact do that I think.

> @@ -237,9 +267,23 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, 
> struct msr_data *msr_info)
>   default:
>   if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
>   (pmc = get_fixed_pmc(pmu, msr))) {
> - if (!msr_info->host_initiated)
> - data = (s64)(s32)data;
> - pmc->counter += data - pmc_read_counter(pmc);
> + if (pmu->in_pmi) {
> + /*
> +  * Since we are not re-allocating a perf event
> +  * to reconfigure the sampling time when the
> +  * guest pmu is in PMI, just set the value to
> +  * the hardware perf counter. Counting will
> +  * continue after the guest enables the
> +  * counter bit in MSR_CORE_PERF_GLOBAL_CTRL.
> +  */
> + struct hw_perf_event *hwc =
> + >perf_event->hw;
> + wrmsrl(hwc->event_base, data);

But all this relies on the event calling the overflow handler; how does
this not corrupt the event state such that x86_perf_event_set_period()
might decide that the generated PMI is a spurious one?

> + } else {
> + if (!msr_info->host_initiated)
> + data = (s64)(s32)data;
> + pmc->counter += data - pmc_read_counter(pmc);
> + }
>   return 0;
>   } else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) {
>   if (data == pmc->eventsel)
> -- 
> 2.7.4
> 


Re: [PATCH v1] KVM/x86/vPMU: Guest PMI Optimization

2018-10-13 Thread Paolo Bonzini
On 12/10/2018 18:30, Andi Kleen wrote:
>> 4. Results
>> - Without this optimization, the guest pmi handling time is
>>   ~450 ns, and the max sampling rate is reduced to 250.
>> - With this optimization, the guest pmi handling time is ~9000 ns
>>   (i.e. 1 / 500 of the non-optimization case), and the max sampling
>>   rate remains at the original 10.
> 
> Impressive performance improvement!

Agreed!

> It's not clear to me why you're special casing PMIs here. The optimization
> should work generically, right?

Yeah, you can even just check if the counter is in the struct
cpu_hw_events guest mask, and if so always write the counter MSR directly.

>> @@ -237,9 +267,23 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, 
>> struct msr_data *msr_info)
>>  default:
>>  if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
>>  (pmc = get_fixed_pmc(pmu, msr))) {
>> -if (!msr_info->host_initiated)
>> -data = (s64)(s32)data;
>> -pmc->counter += data - pmc_read_counter(pmc);
>> +if (pmu->in_pmi) {
>> +/*
>> + * Since we are not re-allocating a perf event
>> + * to reconfigure the sampling time when the
>> + * guest pmu is in PMI, just set the value to
>> + * the hardware perf counter. Counting will
>> + * continue after the guest enables the
>> + * counter bit in MSR_CORE_PERF_GLOBAL_CTRL.
>> + */
>> +struct hw_perf_event *hwc =
>> +>perf_event->hw;
>> +wrmsrl(hwc->event_base, data);
> 
> Is that guaranteed to be always called on the right CPU that will run the 
> vcpu?
> 
> AFAIK there's an ioctl to set MSRs in the guest from qemu, I'm pretty sure
> it won't handle that.

How much of the performance improvement comes from here?  In theory
pmc_read_counter() should always hit a relatively fast path, because the
smp_call_function_single in perf_event_read doesn't need an IPI.

In any case, this should be a separate patch.

Paolo

> May need to be delayed to entry time.
> 
> -Andi
> 



Re: [PATCH v1] KVM/x86/vPMU: Guest PMI Optimization

2018-10-13 Thread Paolo Bonzini
On 12/10/2018 18:30, Andi Kleen wrote:
>> 4. Results
>> - Without this optimization, the guest pmi handling time is
>>   ~450 ns, and the max sampling rate is reduced to 250.
>> - With this optimization, the guest pmi handling time is ~9000 ns
>>   (i.e. 1 / 500 of the non-optimization case), and the max sampling
>>   rate remains at the original 10.
> 
> Impressive performance improvement!

Agreed!

> It's not clear to me why you're special casing PMIs here. The optimization
> should work generically, right?

Yeah, you can even just check if the counter is in the struct
cpu_hw_events guest mask, and if so always write the counter MSR directly.

>> @@ -237,9 +267,23 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, 
>> struct msr_data *msr_info)
>>  default:
>>  if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
>>  (pmc = get_fixed_pmc(pmu, msr))) {
>> -if (!msr_info->host_initiated)
>> -data = (s64)(s32)data;
>> -pmc->counter += data - pmc_read_counter(pmc);
>> +if (pmu->in_pmi) {
>> +/*
>> + * Since we are not re-allocating a perf event
>> + * to reconfigure the sampling time when the
>> + * guest pmu is in PMI, just set the value to
>> + * the hardware perf counter. Counting will
>> + * continue after the guest enables the
>> + * counter bit in MSR_CORE_PERF_GLOBAL_CTRL.
>> + */
>> +struct hw_perf_event *hwc =
>> +>perf_event->hw;
>> +wrmsrl(hwc->event_base, data);
> 
> Is that guaranteed to be always called on the right CPU that will run the 
> vcpu?
> 
> AFAIK there's an ioctl to set MSRs in the guest from qemu, I'm pretty sure
> it won't handle that.

How much of the performance improvement comes from here?  In theory
pmc_read_counter() should always hit a relatively fast path, because the
smp_call_function_single in perf_event_read doesn't need an IPI.

In any case, this should be a separate patch.

Paolo

> May need to be delayed to entry time.
> 
> -Andi
> 



RE: [PATCH v1] KVM/x86/vPMU: Guest PMI Optimization

2018-10-12 Thread Wang, Wei W
On Saturday, October 13, 2018 12:31 AM, Andi Kleen wrote:
> > 4. Results
> > - Without this optimization, the guest pmi handling time is
> >   ~450 ns, and the max sampling rate is reduced to 250.
> > - With this optimization, the guest pmi handling time is ~9000 ns
> >   (i.e. 1 / 500 of the non-optimization case), and the max sampling
> >   rate remains at the original 10.
> 
> Impressive performance improvement!
> 
> It's not clear to me why you're special casing PMIs here. The optimization
> should work generically, right?


Yes, seems doable. I plan to try some lazy approach for the perf event 
allocation. 

> Is that guaranteed to be always called on the right CPU that will run the 
> vcpu?
> 
> AFAIK there's an ioctl to set MSRs in the guest from qemu, I'm pretty sure it
> won't handle that.
 

Thanks, will consider that case.

Best,
Wei


RE: [PATCH v1] KVM/x86/vPMU: Guest PMI Optimization

2018-10-12 Thread Wang, Wei W
On Saturday, October 13, 2018 12:31 AM, Andi Kleen wrote:
> > 4. Results
> > - Without this optimization, the guest pmi handling time is
> >   ~450 ns, and the max sampling rate is reduced to 250.
> > - With this optimization, the guest pmi handling time is ~9000 ns
> >   (i.e. 1 / 500 of the non-optimization case), and the max sampling
> >   rate remains at the original 10.
> 
> Impressive performance improvement!
> 
> It's not clear to me why you're special casing PMIs here. The optimization
> should work generically, right?


Yes, seems doable. I plan to try some lazy approach for the perf event 
allocation. 

> Is that guaranteed to be always called on the right CPU that will run the 
> vcpu?
> 
> AFAIK there's an ioctl to set MSRs in the guest from qemu, I'm pretty sure it
> won't handle that.
 

Thanks, will consider that case.

Best,
Wei


Re: [PATCH v1] KVM/x86/vPMU: Guest PMI Optimization

2018-10-12 Thread Alexey Budankov
Hi,

On 12.10.2018 19:30, Andi Kleen wrote:
>> 4. Results
>> - Without this optimization, the guest pmi handling time is
>>   ~450 ns, and the max sampling rate is reduced to 250.
>> - With this optimization, the guest pmi handling time is ~9000 ns
>>   (i.e. 1 / 500 of the non-optimization case), and the max sampling
>>   rate remains at the original 10.
> 
> Impressive performance improvement!

Might want it into distributions' and secondary kernels as well.

Thanks,
Alexey

> 
> It's not clear to me why you're special casing PMIs here. The optimization
> should work generically, right?
> 
> perf will enable/disable the PMU even outside PMIs, e.g. on context
> switches, which is a very important path too.
> 
>> @@ -237,9 +267,23 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, 
>> struct msr_data *msr_info)
>>  default:
>>  if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
>>  (pmc = get_fixed_pmc(pmu, msr))) {
>> -if (!msr_info->host_initiated)
>> -data = (s64)(s32)data;
>> -pmc->counter += data - pmc_read_counter(pmc);
>> +if (pmu->in_pmi) {
>> +/*
>> + * Since we are not re-allocating a perf event
>> + * to reconfigure the sampling time when the
>> + * guest pmu is in PMI, just set the value to
>> + * the hardware perf counter. Counting will
>> + * continue after the guest enables the
>> + * counter bit in MSR_CORE_PERF_GLOBAL_CTRL.
>> + */
>> +struct hw_perf_event *hwc =
>> +>perf_event->hw;
>> +wrmsrl(hwc->event_base, data);
> 
> Is that guaranteed to be always called on the right CPU that will run the 
> vcpu?
> 
> AFAIK there's an ioctl to set MSRs in the guest from qemu, I'm pretty sure
> it won't handle that.
> 
> May need to be delayed to entry time.
> 
> -Andi
> 


Re: [PATCH v1] KVM/x86/vPMU: Guest PMI Optimization

2018-10-12 Thread Alexey Budankov
Hi,

On 12.10.2018 19:30, Andi Kleen wrote:
>> 4. Results
>> - Without this optimization, the guest pmi handling time is
>>   ~450 ns, and the max sampling rate is reduced to 250.
>> - With this optimization, the guest pmi handling time is ~9000 ns
>>   (i.e. 1 / 500 of the non-optimization case), and the max sampling
>>   rate remains at the original 10.
> 
> Impressive performance improvement!

Might want it into distributions' and secondary kernels as well.

Thanks,
Alexey

> 
> It's not clear to me why you're special casing PMIs here. The optimization
> should work generically, right?
> 
> perf will enable/disable the PMU even outside PMIs, e.g. on context
> switches, which is a very important path too.
> 
>> @@ -237,9 +267,23 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, 
>> struct msr_data *msr_info)
>>  default:
>>  if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
>>  (pmc = get_fixed_pmc(pmu, msr))) {
>> -if (!msr_info->host_initiated)
>> -data = (s64)(s32)data;
>> -pmc->counter += data - pmc_read_counter(pmc);
>> +if (pmu->in_pmi) {
>> +/*
>> + * Since we are not re-allocating a perf event
>> + * to reconfigure the sampling time when the
>> + * guest pmu is in PMI, just set the value to
>> + * the hardware perf counter. Counting will
>> + * continue after the guest enables the
>> + * counter bit in MSR_CORE_PERF_GLOBAL_CTRL.
>> + */
>> +struct hw_perf_event *hwc =
>> +>perf_event->hw;
>> +wrmsrl(hwc->event_base, data);
> 
> Is that guaranteed to be always called on the right CPU that will run the 
> vcpu?
> 
> AFAIK there's an ioctl to set MSRs in the guest from qemu, I'm pretty sure
> it won't handle that.
> 
> May need to be delayed to entry time.
> 
> -Andi
> 


Re: [PATCH v1] KVM/x86/vPMU: Guest PMI Optimization

2018-10-12 Thread Andi Kleen
> 4. Results
> - Without this optimization, the guest pmi handling time is
>   ~450 ns, and the max sampling rate is reduced to 250.
> - With this optimization, the guest pmi handling time is ~9000 ns
>   (i.e. 1 / 500 of the non-optimization case), and the max sampling
>   rate remains at the original 10.

Impressive performance improvement!

It's not clear to me why you're special casing PMIs here. The optimization
should work generically, right?

perf will enable/disable the PMU even outside PMIs, e.g. on context
switches, which is a very important path too.

> @@ -237,9 +267,23 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, 
> struct msr_data *msr_info)
>   default:
>   if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
>   (pmc = get_fixed_pmc(pmu, msr))) {
> - if (!msr_info->host_initiated)
> - data = (s64)(s32)data;
> - pmc->counter += data - pmc_read_counter(pmc);
> + if (pmu->in_pmi) {
> + /*
> +  * Since we are not re-allocating a perf event
> +  * to reconfigure the sampling time when the
> +  * guest pmu is in PMI, just set the value to
> +  * the hardware perf counter. Counting will
> +  * continue after the guest enables the
> +  * counter bit in MSR_CORE_PERF_GLOBAL_CTRL.
> +  */
> + struct hw_perf_event *hwc =
> + >perf_event->hw;
> + wrmsrl(hwc->event_base, data);

Is that guaranteed to be always called on the right CPU that will run the vcpu?

AFAIK there's an ioctl to set MSRs in the guest from qemu, I'm pretty sure
it won't handle that.

May need to be delayed to entry time.

-Andi


Re: [PATCH v1] KVM/x86/vPMU: Guest PMI Optimization

2018-10-12 Thread Andi Kleen
> 4. Results
> - Without this optimization, the guest pmi handling time is
>   ~450 ns, and the max sampling rate is reduced to 250.
> - With this optimization, the guest pmi handling time is ~9000 ns
>   (i.e. 1 / 500 of the non-optimization case), and the max sampling
>   rate remains at the original 10.

Impressive performance improvement!

It's not clear to me why you're special casing PMIs here. The optimization
should work generically, right?

perf will enable/disable the PMU even outside PMIs, e.g. on context
switches, which is a very important path too.

> @@ -237,9 +267,23 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, 
> struct msr_data *msr_info)
>   default:
>   if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
>   (pmc = get_fixed_pmc(pmu, msr))) {
> - if (!msr_info->host_initiated)
> - data = (s64)(s32)data;
> - pmc->counter += data - pmc_read_counter(pmc);
> + if (pmu->in_pmi) {
> + /*
> +  * Since we are not re-allocating a perf event
> +  * to reconfigure the sampling time when the
> +  * guest pmu is in PMI, just set the value to
> +  * the hardware perf counter. Counting will
> +  * continue after the guest enables the
> +  * counter bit in MSR_CORE_PERF_GLOBAL_CTRL.
> +  */
> + struct hw_perf_event *hwc =
> + >perf_event->hw;
> + wrmsrl(hwc->event_base, data);

Is that guaranteed to be always called on the right CPU that will run the vcpu?

AFAIK there's an ioctl to set MSRs in the guest from qemu, I'm pretty sure
it won't handle that.

May need to be delayed to entry time.

-Andi