Re: [PATCH v5 00/21] KVM: ARM64: Add guest PMU support

2015-12-07 Thread Shannon Zhao

Hi Marc,

On 2015/12/7 22:11, Marc Zyngier wrote:

Shannon,

On 03/12/15 06:11, Shannon Zhao wrote:

From: Shannon Zhao 

This patchset adds guest PMU support for KVM on ARM64. It takes
trap-and-emulate approach. When guest wants to monitor one event, it
will be trapped by KVM and KVM will call perf_event API to create a perf
event and call relevant perf_event APIs to get the count value of event.

Use perf to test this patchset in guest. When using "perf list", it
shows the list of the hardware events and hardware cache events perf
supports. Then use "perf stat -e EVENT" to monitor some event. For
example, use "perf stat -e cycles" to count cpu cycles and
"perf stat -e cache-misses" to count cache misses.

Below are the outputs of "perf stat -r 5 sleep 5" when running in host
and guest.

Host:
  Performance counter stats for 'sleep 5' (5 runs):

   0.510276  task-clock (msec) #0.000 CPUs utilized 
   ( +-  1.57% )
  1  context-switches  #0.002 M/sec
  0  cpu-migrations#0.000 K/sec
 49  page-faults   #0.096 M/sec 
   ( +-  0.77% )
1064117  cycles#2.085 GHz   
   ( +-  1.56% )
  stalled-cycles-frontend
  stalled-cycles-backend
 529051  instructions  #0.50  insns per cycle   
   ( +-  0.55% )
  branches
   9894  branch-misses #   19.390 M/sec 
   ( +-  1.70% )

5.000853900 seconds time elapsed
  ( +-  0.00% )

Guest:
  Performance counter stats for 'sleep 5' (5 runs):

   0.642456  task-clock (msec) #0.000 CPUs utilized 
   ( +-  1.81% )
  1  context-switches  #0.002 M/sec
  0  cpu-migrations#0.000 K/sec
 49  page-faults   #0.076 M/sec 
   ( +-  1.64% )
1322717  cycles#2.059 GHz   
   ( +-  1.88% )
  stalled-cycles-frontend
  stalled-cycles-backend
 640944  instructions  #0.48  insns per cycle   
   ( +-  1.10% )
  branches
  10665  branch-misses #   16.600 M/sec 
   ( +-  2.23% )

5.001181452 seconds time elapsed
  ( +-  0.00% )

Have a cycle counter read test like below in guest and host:

static void test(void)
{
unsigned long count, count1, count2;
count1 = read_cycles();
count++;
count2 = read_cycles();
}

Host:
count1: 3046186213
count2: 3046186347
delta: 134

Guest:
count1: 5645797121
count2: 5645797270
delta: 149

The gap between guest and host is very small. One reason for this I
think is that it doesn't count the cycles in EL2 and host since we add
exclude_hv = 1. So the cycles spent to store/restore registers which
happens at EL2 are not included.

This patchset can be fetched from [1] and the relevant QEMU version for
test can be fetched from [2].

The results of 'perf test' can be found from [3][4].
The results of perf_event_tests test suite can be found from [5][6].

Also, I have tested "perf top" in two VMs and host at the same time. It
works well.


I've commented on more issues I've found. Hopefully you'll be able to
respin this quickly enough, and end-up with a simpler code base (state
duplication is a bit messy).


Ok, will try my best :)


Another thing I have noticed is that you have dropped the vgic changes
that were configuring the interrupt. It feels like they should be
included, and configure the PPI as a LEVEL interrupt.
The reason why I drop that is in upstream code PPIs are LEVEL interrupt 
by default which is changed by the arch_timers patches. So is it 
necessary to configure it again?



Also, looking at
your QEMU code, you seem to configure the interrupt as EDGE, which is
now how yor emulated HW behaves.

Sorry, the QEMU code is not updated while the version I use for test 
locally configures the interrupt as LEVEL. I will push the newest one 
tomorrow.



Looking forward to reviewing the next version.

Thanks,

M.



--
Shannon
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 00/21] KVM: ARM64: Add guest PMU support

2015-12-07 Thread Marc Zyngier
On 07/12/15 14:47, Shannon Zhao wrote:
> Hi Marc,
> 
> On 2015/12/7 22:11, Marc Zyngier wrote:
>> Shannon,
>>
>> On 03/12/15 06:11, Shannon Zhao wrote:
>>> From: Shannon Zhao 
>>>
>>> This patchset adds guest PMU support for KVM on ARM64. It takes
>>> trap-and-emulate approach. When guest wants to monitor one event, it
>>> will be trapped by KVM and KVM will call perf_event API to create a perf
>>> event and call relevant perf_event APIs to get the count value of event.
>>>
>>> Use perf to test this patchset in guest. When using "perf list", it
>>> shows the list of the hardware events and hardware cache events perf
>>> supports. Then use "perf stat -e EVENT" to monitor some event. For
>>> example, use "perf stat -e cycles" to count cpu cycles and
>>> "perf stat -e cache-misses" to count cache misses.
>>>
>>> Below are the outputs of "perf stat -r 5 sleep 5" when running in host
>>> and guest.
>>>
>>> Host:
>>>   Performance counter stats for 'sleep 5' (5 runs):
>>>
>>>0.510276  task-clock (msec) #0.000 CPUs utilized 
>>>( +-  1.57% )
>>>   1  context-switches  #0.002 M/sec
>>>   0  cpu-migrations#0.000 K/sec
>>>  49  page-faults   #0.096 M/sec 
>>>( +-  0.77% )
>>> 1064117  cycles#2.085 GHz   
>>>( +-  1.56% )
>>>   stalled-cycles-frontend
>>>   stalled-cycles-backend
>>>  529051  instructions  #0.50  insns per 
>>> cycle  ( +-  0.55% )
>>>   branches
>>>9894  branch-misses #   19.390 M/sec 
>>>( +-  1.70% )
>>>
>>> 5.000853900 seconds time elapsed
>>>   ( +-  0.00% )
>>>
>>> Guest:
>>>   Performance counter stats for 'sleep 5' (5 runs):
>>>
>>>0.642456  task-clock (msec) #0.000 CPUs utilized 
>>>( +-  1.81% )
>>>   1  context-switches  #0.002 M/sec
>>>   0  cpu-migrations#0.000 K/sec
>>>  49  page-faults   #0.076 M/sec 
>>>( +-  1.64% )
>>> 1322717  cycles#2.059 GHz   
>>>( +-  1.88% )
>>>   stalled-cycles-frontend
>>>   stalled-cycles-backend
>>>  640944  instructions  #0.48  insns per 
>>> cycle  ( +-  1.10% )
>>>   branches
>>>   10665  branch-misses #   16.600 M/sec 
>>>( +-  2.23% )
>>>
>>> 5.001181452 seconds time elapsed
>>>   ( +-  0.00% )
>>>
>>> Have a cycle counter read test like below in guest and host:
>>>
>>> static void test(void)
>>> {
>>> unsigned long count, count1, count2;
>>> count1 = read_cycles();
>>> count++;
>>> count2 = read_cycles();
>>> }
>>>
>>> Host:
>>> count1: 3046186213
>>> count2: 3046186347
>>> delta: 134
>>>
>>> Guest:
>>> count1: 5645797121
>>> count2: 5645797270
>>> delta: 149
>>>
>>> The gap between guest and host is very small. One reason for this I
>>> think is that it doesn't count the cycles in EL2 and host since we add
>>> exclude_hv = 1. So the cycles spent to store/restore registers which
>>> happens at EL2 are not included.
>>>
>>> This patchset can be fetched from [1] and the relevant QEMU version for
>>> test can be fetched from [2].
>>>
>>> The results of 'perf test' can be found from [3][4].
>>> The results of perf_event_tests test suite can be found from [5][6].
>>>
>>> Also, I have tested "perf top" in two VMs and host at the same time. It
>>> works well.
>>
>> I've commented on more issues I've found. Hopefully you'll be able to
>> respin this quickly enough, and end-up with a simpler code base (state
>> duplication is a bit messy).
>>
> Ok, will try my best :)
> 
>> Another thing I have noticed is that you have dropped the vgic changes
>> that were configuring the interrupt. It feels like they should be
>> included, and configure the PPI as a LEVEL interrupt.
> The reason why I drop that is in upstream code PPIs are LEVEL interrupt 
> by default which is changed by the arch_timers patches. So is it 
> necessary to configure it again?

Ah, yes. Missed that. No, that's fine.

> 
>> Also, looking at
>> your QEMU code, you seem to configure the interrupt as EDGE, which is
>> now how yor emulated HW behaves.
>>
> Sorry, the QEMU code is not updated while the version I use for test 
> locally configures the interrupt as LEVEL. I will push the newest one 
> tomorrow.

That'd be good.

Thanks,

M.
-- 
Jazz is not dead. It just smells funny...
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to 

Re: [PATCH v5 00/21] KVM: ARM64: Add guest PMU support

2015-12-07 Thread Marc Zyngier
Shannon,

On 03/12/15 06:11, Shannon Zhao wrote:
> From: Shannon Zhao 
> 
> This patchset adds guest PMU support for KVM on ARM64. It takes
> trap-and-emulate approach. When guest wants to monitor one event, it
> will be trapped by KVM and KVM will call perf_event API to create a perf
> event and call relevant perf_event APIs to get the count value of event.
> 
> Use perf to test this patchset in guest. When using "perf list", it
> shows the list of the hardware events and hardware cache events perf
> supports. Then use "perf stat -e EVENT" to monitor some event. For
> example, use "perf stat -e cycles" to count cpu cycles and
> "perf stat -e cache-misses" to count cache misses.
> 
> Below are the outputs of "perf stat -r 5 sleep 5" when running in host
> and guest.
> 
> Host:
>  Performance counter stats for 'sleep 5' (5 runs):
> 
>   0.510276  task-clock (msec) #0.000 CPUs utilized
> ( +-  1.57% )
>  1  context-switches  #0.002 M/sec
>  0  cpu-migrations#0.000 K/sec
> 49  page-faults   #0.096 M/sec
> ( +-  0.77% )
>1064117  cycles#2.085 GHz  
> ( +-  1.56% )
>  stalled-cycles-frontend
>  stalled-cycles-backend
> 529051  instructions  #0.50  insns per cycle  
> ( +-  0.55% )
>  branches
>   9894  branch-misses #   19.390 M/sec
> ( +-  1.70% )
> 
>5.000853900 seconds time elapsed   
>( +-  0.00% )
> 
> Guest:
>  Performance counter stats for 'sleep 5' (5 runs):
> 
>   0.642456  task-clock (msec) #0.000 CPUs utilized
> ( +-  1.81% )
>  1  context-switches  #0.002 M/sec
>  0  cpu-migrations#0.000 K/sec
> 49  page-faults   #0.076 M/sec
> ( +-  1.64% )
>1322717  cycles#2.059 GHz  
> ( +-  1.88% )
>  stalled-cycles-frontend
>  stalled-cycles-backend
> 640944  instructions  #0.48  insns per cycle  
> ( +-  1.10% )
>  branches
>  10665  branch-misses #   16.600 M/sec
> ( +-  2.23% )
> 
>5.001181452 seconds time elapsed   
>( +-  0.00% )
> 
> Have a cycle counter read test like below in guest and host:
> 
> static void test(void)
> {
>   unsigned long count, count1, count2;
>   count1 = read_cycles();
>   count++;
>   count2 = read_cycles();
> }
> 
> Host:
> count1: 3046186213
> count2: 3046186347
> delta: 134
> 
> Guest:
> count1: 5645797121
> count2: 5645797270
> delta: 149
> 
> The gap between guest and host is very small. One reason for this I
> think is that it doesn't count the cycles in EL2 and host since we add
> exclude_hv = 1. So the cycles spent to store/restore registers which
> happens at EL2 are not included.
> 
> This patchset can be fetched from [1] and the relevant QEMU version for
> test can be fetched from [2].
> 
> The results of 'perf test' can be found from [3][4].
> The results of perf_event_tests test suite can be found from [5][6].
> 
> Also, I have tested "perf top" in two VMs and host at the same time. It
> works well.

I've commented on more issues I've found. Hopefully you'll be able to
respin this quickly enough, and end-up with a simpler code base (state
duplication is a bit messy).

Another thing I have noticed is that you have dropped the vgic changes
that were configuring the interrupt. It feels like they should be
included, and configure the PPI as a LEVEL interrupt. Also, looking at
your QEMU code, you seem to configure the interrupt as EDGE, which is
now how yor emulated HW behaves.

Looking forward to reviewing the next version.

Thanks,

M.
-- 
Jazz is not dead. It just smells funny...
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 00/21] KVM: ARM64: Add guest PMU support

2015-12-02 Thread Shannon Zhao
From: Shannon Zhao 

This patchset adds guest PMU support for KVM on ARM64. It takes
trap-and-emulate approach. When guest wants to monitor one event, it
will be trapped by KVM and KVM will call perf_event API to create a perf
event and call relevant perf_event APIs to get the count value of event.

Use perf to test this patchset in guest. When using "perf list", it
shows the list of the hardware events and hardware cache events perf
supports. Then use "perf stat -e EVENT" to monitor some event. For
example, use "perf stat -e cycles" to count cpu cycles and
"perf stat -e cache-misses" to count cache misses.

Below are the outputs of "perf stat -r 5 sleep 5" when running in host
and guest.

Host:
 Performance counter stats for 'sleep 5' (5 runs):

  0.510276  task-clock (msec) #0.000 CPUs utilized  
  ( +-  1.57% )
 1  context-switches  #0.002 M/sec
 0  cpu-migrations#0.000 K/sec
49  page-faults   #0.096 M/sec  
  ( +-  0.77% )
   1064117  cycles#2.085 GHz
  ( +-  1.56% )
 stalled-cycles-frontend
 stalled-cycles-backend
529051  instructions  #0.50  insns per cycle
  ( +-  0.55% )
 branches
  9894  branch-misses #   19.390 M/sec  
  ( +-  1.70% )

   5.000853900 seconds time elapsed 
 ( +-  0.00% )

Guest:
 Performance counter stats for 'sleep 5' (5 runs):

  0.642456  task-clock (msec) #0.000 CPUs utilized  
  ( +-  1.81% )
 1  context-switches  #0.002 M/sec
 0  cpu-migrations#0.000 K/sec
49  page-faults   #0.076 M/sec  
  ( +-  1.64% )
   1322717  cycles#2.059 GHz
  ( +-  1.88% )
 stalled-cycles-frontend
 stalled-cycles-backend
640944  instructions  #0.48  insns per cycle
  ( +-  1.10% )
 branches
 10665  branch-misses #   16.600 M/sec  
  ( +-  2.23% )

   5.001181452 seconds time elapsed 
 ( +-  0.00% )

Have a cycle counter read test like below in guest and host:

static void test(void)
{
unsigned long count, count1, count2;
count1 = read_cycles();
count++;
count2 = read_cycles();
}

Host:
count1: 3046186213
count2: 3046186347
delta: 134

Guest:
count1: 5645797121
count2: 5645797270
delta: 149

The gap between guest and host is very small. One reason for this I
think is that it doesn't count the cycles in EL2 and host since we add
exclude_hv = 1. So the cycles spent to store/restore registers which
happens at EL2 are not included.

This patchset can be fetched from [1] and the relevant QEMU version for
test can be fetched from [2].

The results of 'perf test' can be found from [3][4].
The results of perf_event_tests test suite can be found from [5][6].

Also, I have tested "perf top" in two VMs and host at the same time. It
works well.

Thanks,
Shannon

[1] https://git.linaro.org/people/shannon.zhao/linux-mainline.git  
KVM_ARM64_PMU_v5
[2] https://git.linaro.org/people/shannon.zhao/qemu.git  virtual_PMU
[3] http://people.linaro.org/~shannon.zhao/PMU/perf-test-host.txt
[4] http://people.linaro.org/~shannon.zhao/PMU/perf-test-guest.txt
[5] http://people.linaro.org/~shannon.zhao/PMU/perf_event_tests-host.txt
[6] http://people.linaro.org/~shannon.zhao/PMU/perf_event_tests-guest.txt

Changes since v4:
* Rebase on new linux kernel mainline 
* Drop the reset handler of CP15 registers
* Fix a compile failure on arch ARM due to lack of asm/pmu.h
* Refactor the interrupt injecting flow according to Marc's suggestion
* Check the value of PMSELR register
* Calculate the attr.disabled according to PMCR.E and PMCNTENSET/CLR
* Fix some coding style
* Document the vPMU irq range

Changes since v3:
* Rebase on new linux kernel mainline 
* Use ARMV8_MAX_COUNTERS instead of 32
* Reset PMCR.E to zero.
* Trigger overflow for software increment.
* Optimize PMU interrupt inject logic.
* Add handler for E,C,P bits of PMCR
* Fix the overflow bug found by perf_event_tests
* Run 'perf test', 'perf top' and perf_event_tests test suite
* Add exclude_hv = 1 configuration to not count in EL2

Changes since v2:
* Directly use perf raw event type to create perf_event in KVM
* Add a helper vcpu_sysreg_write
* remove unrelated header file

Changes since v1:
* Use switch...case for registers access handler instead of adding
  alone handler for each register
* Try to use the sys_regs to store the register value instead of adding
  new variables in struct kvm_pmc
* Fix the