Re: [Qemu-devel] [PATCH 0/2] exec: alternative fix for master abort woes

2013-11-11 Thread Michael S. Tsirkin
On Thu, Nov 07, 2013 at 06:29:40PM +0100, Paolo Bonzini wrote:
 Il 07/11/2013 17:47, Michael S. Tsirkin ha scritto:
  That's on kvm with 52 bit address.
  But where I would be concerned is systems with e.g. 36 bit address
  space where we are doubling the cost of the lookup.
  E.g. try i386 and not x86_64.
 
 Tried now...
 
 P_L2_LEVELS pre-patch   post-patch
i386 3   6
x86_64   4   6
 
 I timed the inl_from_qemu test of vmexit.flat with both KVM and TCG.  With
 TCG there's indeed a visible penalty of 20 cycles for i386 and 10 for x86_64
 (you can extrapolate to 30 cycles for TARGET_PHYS_ADDR_SPACE_BITS=32 targets).

So how did you measure this exactly?

 These can be more or less entirely ascribed to phys_page_find:
 
  TCG |  KVM
pre-patch  post-patch |  pre-patch   post-patch
 phys_page_find(i386)  13% 25%| 0.6% 1%
 inl_from_qemu cycles(i386)153 173|   ~12000  ~12000
 phys_page_find(x86_64)18% 25%| 0.8% 1%
 inl_from_qemu cycles(x86_64)  163 173|   ~12000  ~12000
 
 Thus this patch costs 0.4% in the worst case for KVM, 12% in the worst case
 for TCG.  The cycle breakdown is:
 
 60 phys_page_find
 28 access_with_adjusted_size
 24 address_space_translate_internal
 20 address_space_rw
 13 io_mem_read
 11 address_space_translate
  9 memory_region_read_accessor
  6 memory_region_access_valid
  4 helper_inl
  4 memory_access_size
  3 cpu_inl
 
 (This run reported 177 cycles per access; the total is 182 due to rounding).
 It is probably possible to shave at least 10 cycles from the functions below,
 or to make the depth of the tree dynamic so that you would save even more
 compared to 1.6.0.
 
 Also, compiling with -fstack-protector instead of -fstack-protector-all,
 as suggested a while ago by rth, is already giving a savings of 20 cycles.
 
 And of course, if this were a realistic test, KVM's 60x penalty would
 be a severe problem---but it isn't, because this is not a realistic setting.
 
 Paolo



Re: [Qemu-devel] [PATCH 0/2] exec: alternative fix for master abort woes

2013-11-11 Thread Paolo Bonzini
Il 11/11/2013 17:43, Michael S. Tsirkin ha scritto:
 On Thu, Nov 07, 2013 at 06:29:40PM +0100, Paolo Bonzini wrote:
 Il 07/11/2013 17:47, Michael S. Tsirkin ha scritto:
 That's on kvm with 52 bit address.
 But where I would be concerned is systems with e.g. 36 bit address
 space where we are doubling the cost of the lookup.
 E.g. try i386 and not x86_64.

 Tried now...

 P_L2_LEVELS pre-patch   post-patch
i386 3   6
x86_64   4   6

 I timed the inl_from_qemu test of vmexit.flat with both KVM and TCG.  With
 TCG there's indeed a visible penalty of 20 cycles for i386 and 10 for x86_64
 (you can extrapolate to 30 cycles for TARGET_PHYS_ADDR_SPACE_BITS=32 
 targets).
 
 So how did you measure this exactly?

I mention extrapolation because x86 is TARGET_PHYS_ADDR_SPACE_BITS=36,
not 32.

Paolo



[Qemu-devel] [PATCH 0/2] exec: alternative fix for master abort woes

2013-11-07 Thread Paolo Bonzini
This fixes the problems with the misalignment of the master abort region.
See patch 2 for details, patch 1 is just a preparatory search-and-replace
patch.

Paolo Bonzini (2):
  split definitions for exec.c and translate-all.c radix trees
  exec: make address spaces 64-bit wide

 exec.c  | 28 
 translate-all.c | 32 ++--
 translate-all.h |  7 ---
 3 files changed, 34 insertions(+), 33 deletions(-)

-- 
1.8.4.2




Re: [Qemu-devel] [PATCH 0/2] exec: alternative fix for master abort woes

2013-11-07 Thread Michael S. Tsirkin
On Thu, Nov 07, 2013 at 05:14:35PM +0100, Paolo Bonzini wrote:
 This fixes the problems with the misalignment of the master abort region.
 See patch 2 for details, patch 1 is just a preparatory search-and-replace
 patch.
 
 Paolo Bonzini (2):
   split definitions for exec.c and translate-all.c radix trees
   exec: make address spaces 64-bit wide

Can you please share info on testing you did?

  exec.c  | 28 
  translate-all.c | 32 ++--
  translate-all.h |  7 ---
  3 files changed, 34 insertions(+), 33 deletions(-)
 
 -- 
 1.8.4.2



Re: [Qemu-devel] [PATCH 0/2] exec: alternative fix for master abort woes

2013-11-07 Thread Paolo Bonzini
Il 07/11/2013 17:21, Michael S. Tsirkin ha scritto:
  This fixes the problems with the misalignment of the master abort region.
  See patch 2 for details, patch 1 is just a preparatory search-and-replace
  patch.
  
  Paolo Bonzini (2):
split definitions for exec.c and translate-all.c radix trees
exec: make address spaces 64-bit wide
 Can you please share info on testing you did?
 

make check, booting a RHEL guest with both KVM and TCG, Luiz's gdb
crash.  I also ran vmexit.flat from kvm-unit-tests and checked that
there was no measurable slowdown.

Paolo



Re: [Qemu-devel] [PATCH 0/2] exec: alternative fix for master abort woes

2013-11-07 Thread Michael S. Tsirkin
On Thu, Nov 07, 2013 at 05:29:15PM +0100, Paolo Bonzini wrote:
 Il 07/11/2013 17:21, Michael S. Tsirkin ha scritto:
   This fixes the problems with the misalignment of the master abort region.
   See patch 2 for details, patch 1 is just a preparatory search-and-replace
   patch.
   
   Paolo Bonzini (2):
 split definitions for exec.c and translate-all.c radix trees
 exec: make address spaces 64-bit wide
  Can you please share info on testing you did?
  
 
 make check, booting a RHEL guest with both KVM and TCG, Luiz's gdb
 crash.  I also ran vmexit.flat from kvm-unit-tests and checked that
 there was no measurable slowdown.
 
 Paolo

That's on kvm with 52 bit address.
But where I would be concerned is systems with e.g. 36 bit address
space where we are doubling the cost of the lookup.
E.g. try i386 and not x86_64.

-- 
But



Re: [Qemu-devel] [PATCH 0/2] exec: alternative fix for master abort woes

2013-11-07 Thread Paolo Bonzini
Il 07/11/2013 17:47, Michael S. Tsirkin ha scritto:
 That's on kvm with 52 bit address.
 But where I would be concerned is systems with e.g. 36 bit address
 space where we are doubling the cost of the lookup.
 E.g. try i386 and not x86_64.

Tried now...

P_L2_LEVELS pre-patch   post-patch
   i386 3   6
   x86_64   4   6

I timed the inl_from_qemu test of vmexit.flat with both KVM and TCG.  With
TCG there's indeed a visible penalty of 20 cycles for i386 and 10 for x86_64
(you can extrapolate to 30 cycles for TARGET_PHYS_ADDR_SPACE_BITS=32 targets).
These can be more or less entirely ascribed to phys_page_find:

 TCG |  KVM
   pre-patch  post-patch |  pre-patch   post-patch
phys_page_find(i386)  13% 25%| 0.6% 1%
inl_from_qemu cycles(i386)153 173|   ~12000  ~12000
phys_page_find(x86_64)18% 25%| 0.8% 1%
inl_from_qemu cycles(x86_64)  163 173|   ~12000  ~12000

Thus this patch costs 0.4% in the worst case for KVM, 12% in the worst case
for TCG.  The cycle breakdown is:

60 phys_page_find
28 access_with_adjusted_size
24 address_space_translate_internal
20 address_space_rw
13 io_mem_read
11 address_space_translate
 9 memory_region_read_accessor
 6 memory_region_access_valid
 4 helper_inl
 4 memory_access_size
 3 cpu_inl

(This run reported 177 cycles per access; the total is 182 due to rounding).
It is probably possible to shave at least 10 cycles from the functions below,
or to make the depth of the tree dynamic so that you would save even more
compared to 1.6.0.

Also, compiling with -fstack-protector instead of -fstack-protector-all,
as suggested a while ago by rth, is already giving a savings of 20 cycles.

And of course, if this were a realistic test, KVM's 60x penalty would
be a severe problem---but it isn't, because this is not a realistic setting.

Paolo



Re: [Qemu-devel] [PATCH 0/2] exec: alternative fix for master abort woes

2013-11-07 Thread Michael S. Tsirkin
On Thu, Nov 07, 2013 at 06:29:40PM +0100, Paolo Bonzini wrote:
 Il 07/11/2013 17:47, Michael S. Tsirkin ha scritto:
  That's on kvm with 52 bit address.
  But where I would be concerned is systems with e.g. 36 bit address
  space where we are doubling the cost of the lookup.
  E.g. try i386 and not x86_64.
 
 Tried now...
 
 P_L2_LEVELS pre-patch   post-patch
i386 3   6
x86_64   4   6
 
 I timed the inl_from_qemu test of vmexit.flat with both KVM and TCG.  With
 TCG there's indeed a visible penalty of 20 cycles for i386 and 10 for x86_64
 (you can extrapolate to 30 cycles for TARGET_PHYS_ADDR_SPACE_BITS=32 targets).
 These can be more or less entirely ascribed to phys_page_find:
 
  TCG |  KVM
pre-patch  post-patch |  pre-patch   post-patch
 phys_page_find(i386)  13% 25%| 0.6% 1%
 inl_from_qemu cycles(i386)153 173|   ~12000  ~12000

I'm a bit confused by the numbers above. The % of phys_page_find has
grown from 13% to  25% (almost double, which is kind of expected
give we have twice the # of levels). But overhead in # of cycles only went from 
153 to
173? Maybe the test is a bit wrong for tcg - how about unrolling the
loop in kvm unit test?


diff --git a/x86/vmexit.c b/x86/vmexit.c
index 957d0cc..405d545 100644
--- a/x86/vmexit.c
+++ b/x86/vmexit.c
@@ -40,6 +40,15 @@ static unsigned int inl(unsigned short port)
 {
 unsigned int val;
 asm volatile(inl %w1, %0 : =a(val) : Nd(port));
+asm volatile(inl %w1, %0 : =a(val) : Nd(port));
+asm volatile(inl %w1, %0 : =a(val) : Nd(port));
+asm volatile(inl %w1, %0 : =a(val) : Nd(port));
+asm volatile(inl %w1, %0 : =a(val) : Nd(port));
+asm volatile(inl %w1, %0 : =a(val) : Nd(port));
+asm volatile(inl %w1, %0 : =a(val) : Nd(port));
+asm volatile(inl %w1, %0 : =a(val) : Nd(port));
+asm volatile(inl %w1, %0 : =a(val) : Nd(port));
+asm volatile(inl %w1, %0 : =a(val) : Nd(port));
 return val;
 }
 

Then you have to divide the reported result by 10.

 phys_page_find(x86_64)18% 25%| 0.8% 1%
 inl_from_qemu cycles(x86_64)  163 173|   ~12000  ~12000
 
 Thus this patch costs 0.4% in the worst case for KVM, 12% in the worst case
 for TCG.  The cycle breakdown is:
 
 60 phys_page_find
 28 access_with_adjusted_size
 24 address_space_translate_internal
 20 address_space_rw
 13 io_mem_read
 11 address_space_translate
  9 memory_region_read_accessor
  6 memory_region_access_valid
  4 helper_inl
  4 memory_access_size
  3 cpu_inl
 
 (This run reported 177 cycles per access; the total is 182 due to rounding).
 It is probably possible to shave at least 10 cycles from the functions below,
 or to make the depth of the tree dynamic so that you would save even more
 compared to 1.6.0.
 
 Also, compiling with -fstack-protector instead of -fstack-protector-all,
 as suggested a while ago by rth, is already giving a savings of 20 cycles.
 

Is it true that with TCG this affects more than just MMIO
as phys_page_find will also sometimes run on CPU accesses to memory?

 And of course, if this were a realistic test, KVM's 60x penalty would
 be a severe problem---but it isn't, because this is not a realistic setting.
 
 Paolo

Well, for this argument to carry the day we'd need to design
a realistic test which isn't easy :)

-- 
MST



Re: [Qemu-devel] [PATCH 0/2] exec: alternative fix for master abort woes

2013-11-07 Thread Paolo Bonzini
Il 07/11/2013 19:54, Michael S. Tsirkin ha scritto:
 On Thu, Nov 07, 2013 at 06:29:40PM +0100, Paolo Bonzini wrote:
 Il 07/11/2013 17:47, Michael S. Tsirkin ha scritto:
 That's on kvm with 52 bit address.
 But where I would be concerned is systems with e.g. 36 bit address
 space where we are doubling the cost of the lookup.
 E.g. try i386 and not x86_64.

 Tried now...

 P_L2_LEVELS pre-patch   post-patch
i386 3   6
x86_64   4   6

 I timed the inl_from_qemu test of vmexit.flat with both KVM and TCG.  With
 TCG there's indeed a visible penalty of 20 cycles for i386 and 10 for x86_64
 (you can extrapolate to 30 cycles for TARGET_PHYS_ADDR_SPACE_BITS=32 
 targets).
 These can be more or less entirely ascribed to phys_page_find:

  TCG |  KVM
pre-patch  post-patch |  pre-patch   post-patch
 phys_page_find(i386)  13% 25%| 0.6% 1%
 inl_from_qemu cycles(i386)153 173|   ~12000  ~12000
 
 I'm a bit confused by the numbers above. The % of phys_page_find has
 grown from 13% to  25% (almost double, which is kind of expected
 give we have twice the # of levels).

Yes.

 But overhead in # of cycles only went from 153 to
 173?

new cycles / old cycles = 173 / 153 = 113%

% outside phys_page_find + % in phys_page_find*2 = 87% + 13%*2 = 113%

 Maybe the test is a bit wrong for tcg - how about unrolling the
 loop in kvm unit test?

Done that already. :)

 Also, compiling with -fstack-protector instead of -fstack-protector-all,
 as suggested a while ago by rth, is already giving a savings of 20 cycles.
 
 Is it true that with TCG this affects more than just MMIO
 as phys_page_find will also sometimes run on CPU accesses to memory?

Yes.  I tried benchmarking with perf the boot of a RHEL guest, which has

 TCG   | KVM
   pre-patch  post-patch   | pre-patchpost-patch
  3% 5.8%  |0.9% 1.7%

This is actually higher than usual for KVM because there are many VGA
access during GRUB.

 And of course, if this were a realistic test, KVM's 60x penalty would
 be a severe problem---but it isn't, because this is not a realistic setting.
 
 Well, for this argument to carry the day we'd need to design
 a realistic test which isn't easy :)

Yes, I guess the number that matters is the extra 2% penalty for TCG
(the part that doesn't come from MMIO).

Paolo