Re: [PATCH 4/4] KVM: MMU: Make mmu_shrink() scan nr_to_scan shadow pages

2011-12-19 Thread Avi Kivity
On 12/16/2011 04:58 PM, Takuya Yoshikawa wrote:
  
  I'm not sure this is a meaningful test to verify this change is
  worthwhile, because while the shrinker tries to free a shadow page from
  one vm, the vm's position in the kvm list is changed, so the over time
  the shrinker will cycle over all VMs.

 The test was for checking if mmu_shrink() would work as intended.  Maybe
 not good as a changelog, sorry.


 I admit that I could not find any strong reason except for protecting the
 host from intentionally induced shadowing.

 But for that, don't you think that freeing the same amount of shadow pages
 from good and bad VMs eaqually is bad thing?

 My method will try to free many shadow pages from VMs with many shadow
 pages;  e.g. if there is a pathological increase in shadow pages for one
 VM, that one will be intensively treated.

 If you can agree on this reasoning, I will update the description and resend.

Well, if one guest is twice as large as other guests, then it will want
twice as many shadow pages.  So our goal should be to zap pages from the
guest with the highest (shadow pages / memory) ratio.

  
  Can you measure whether there is a significant difference in a synthetic
  workload, and what that change is? Perhaps apply {moderate, high} memory
  pressure load with {2, 4, 8, 16} VMs or something like that.
  

 I was running 4 VMs on my machine with enough high memory pressure.  The 
 problem
 was that mmu_shrink() was not tuned to be called in usual memory pressures:  
 what
 I did was changing the seeks and batch parameters and making ept=0.

 At least, I have checked that if I make one VM do meaningless many copies, 
 letting
 others keep silent, the shrinker frees shadow pages intensively from that one.


 Anyway, I don't think making the shrinker call mmu_shrink() with the default 
 batch
 size, nr_to_scan=128, and just trying to free one shadow page is a good 
 behaviour.

Yes, it's very conservative.  But on the other hand the shrinker is
tuned for dcache and icache, where there are usually tons of useless
objects.  If we have to free something, I'd rather free them instead of
mmu pages which tend to get recreated soon.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] KVM: MMU: Make mmu_shrink() scan nr_to_scan shadow pages

2011-12-19 Thread Takuya Yoshikawa

(2011/12/19 17:43), Avi Kivity wrote:

Well, if one guest is twice as large as other guests, then it will want
twice as many shadow pages.  So our goal should be to zap pages from the
guest with the highest (shadow pages / memory) ratio.



Can you measure whether there is a significant difference in a synthetic
workload, and what that change is? Perhaps apply {moderate, high} memory
pressure load with {2, 4, 8, 16} VMs or something like that.



I was running 4 VMs on my machine with enough high memory pressure.  The problem
was that mmu_shrink() was not tuned to be called in usual memory pressures:  
what
I did was changing the seeks and batch parameters and making ept=0.

At least, I have checked that if I make one VM do meaningless many copies, 
letting
others keep silent, the shrinker frees shadow pages intensively from that one.


Anyway, I don't think making the shrinker call mmu_shrink() with the default 
batch
size, nr_to_scan=128, and just trying to free one shadow page is a good 
behaviour.


Yes, it's very conservative.  But on the other hand the shrinker is
tuned for dcache and icache, where there are usually tons of useless
objects.  If we have to free something, I'd rather free them instead of
mmu pages which tend to get recreated soon.



OK, to satisfy the requirements, I will do:

1. find the guest with the highest (shadow pages / memory) ratio
2. just zap one page from that guest, keeping the current conservative 
rate

I will update the patch.

Takuya
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] KVM: MMU: Make mmu_shrink() scan nr_to_scan shadow pages

2011-12-19 Thread Avi Kivity
On 12/19/2011 11:22 AM, Takuya Yoshikawa wrote:
 Yes, it's very conservative.  But on the other hand the shrinker is
 tuned for dcache and icache, where there are usually tons of useless
 objects.  If we have to free something, I'd rather free them instead of
 mmu pages which tend to get recreated soon.



 OK, to satisfy the requirements, I will do:

 1. find the guest with the highest (shadow pages / memory) ratio

How do you propose to do that efficiently?  We may have hundreds of
guests, or even more, on one host.  Each guest access will involve
bouncing a few cache lines.

 2. just zap one page from that guest, keeping the current
 conservative rate

 I will update the patch.

I think the current rate is too conservative.  No idea what a good one
is, I don't have a feeling as to the relation between shrinker callbacks
and memory pressure.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] KVM: MMU: Make mmu_shrink() scan nr_to_scan shadow pages

2011-12-19 Thread Takuya Yoshikawa

(2011/12/19 18:26), Avi Kivity wrote:

On 12/19/2011 11:22 AM, Takuya Yoshikawa wrote:

Yes, it's very conservative.  But on the other hand the shrinker is
tuned for dcache and icache, where there are usually tons of useless
objects.  If we have to free something, I'd rather free them instead of
mmu pages which tend to get recreated soon.




OK, to satisfy the requirements, I will do:

 1. find the guest with the highest (shadow pages / memory) ratio


How do you propose to do that efficiently?  We may have hundreds of
guests, or even more, on one host.  Each guest access will involve
bouncing a few cache lines.


IMO, The goal should be restricted to emergencies.

So possible solution may be:
- we set the tuning parameters as conservative as possible
- pick up a guest with relatively high ratio
  (I have to think more how to achieve this)
- move the vm_list head for fairness

In an emergency, we should not mind performance penalty so much.




 2. just zap one page from that guest, keeping the current
conservative rate

I will update the patch.


I think the current rate is too conservative.  No idea what a good one
is, I don't have a feeling as to the relation between shrinker callbacks
and memory pressure.



When I tried to see what the current code is doing, frankly speaking,
I thought mmu_shrink() was not tested enough from the beginning.

I read the shrinker code as far as possible and realized the combination of
(seeks=10*default, batch=128) is not reasonable; the high seeks means the
shrinker rarely calculate higher value than 128, and mmu_shrink() cannot
be called in normal life.

How about setting the batch a bit lower, keeping seeks as is?

But there is not a perfect value because how often mmu_shrink() can be called
will change if the admin change the sysctl_vfs_cache_pressure tuning parameter
for dcache and icache, IIUC.

And tdp and shadow paging differ much.

Takuya
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] KVM: MMU: Make mmu_shrink() scan nr_to_scan shadow pages

2011-12-19 Thread Avi Kivity
On 12/19/2011 11:56 AM, Takuya Yoshikawa wrote:
 (2011/12/19 18:26), Avi Kivity wrote:
 On 12/19/2011 11:22 AM, Takuya Yoshikawa wrote:
 Yes, it's very conservative.  But on the other hand the shrinker is
 tuned for dcache and icache, where there are usually tons of useless
 objects.  If we have to free something, I'd rather free them
 instead of
 mmu pages which tend to get recreated soon.



 OK, to satisfy the requirements, I will do:

  1. find the guest with the highest (shadow pages / memory) ratio

 How do you propose to do that efficiently?  We may have hundreds of
 guests, or even more, on one host.  Each guest access will involve
 bouncing a few cache lines.

 IMO, The goal should be restricted to emergencies.

 So possible solution may be:
 - we set the tuning parameters as conservative as possible
 - pick up a guest with relatively high ratio
   (I have to think more how to achieve this)
 - move the vm_list head for fairness

 In an emergency, we should not mind performance penalty so much.

But is the shrinker really only called in emergencies?

Also, with things like cgroups, we may have an emergency in one
container, but not in others - if the shrinker is not cgroup aware, it
soon will be.



  2. just zap one page from that guest, keeping the current
 conservative rate

 I will update the patch.

 I think the current rate is too conservative.  No idea what a good one
 is, I don't have a feeling as to the relation between shrinker callbacks
 and memory pressure.


 When I tried to see what the current code is doing, frankly speaking,
 I thought mmu_shrink() was not tested enough from the beginning.

It wasn't, and in fact the original code was even worse, the code we
have now is after some fixes.


 I read the shrinker code as far as possible and realized the
 combination of
 (seeks=10*default, batch=128) is not reasonable; the high seeks means the
 shrinker rarely calculate higher value than 128, and mmu_shrink() cannot
 be called in normal life.

 How about setting the batch a bit lower, keeping seeks as is?

Ok.


 But there is not a perfect value because how often mmu_shrink() can be
 called
 will change if the admin change the sysctl_vfs_cache_pressure tuning
 parameter
 for dcache and icache, IIUC.

 And tdp and shadow paging differ much.

We should aim for the following:
- normal operation causes very little shrinks (some are okay)
- high pressure mostly due to kvm results in kvm being shrunk (this is a
pathological case caused by a starting a guest with a huge amount of
memory, and mapping it all to /dev/zero (or ksm), and getting the guest
the create shadow mappings for all of it)
- general high pressure is shared among other caches like dcache and icache

The cost of reestablishing an mmu page can be as high as half a
millisecond of cpu time, which is the reason I want to be conservative.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] KVM: MMU: Make mmu_shrink() scan nr_to_scan shadow pages

2011-12-19 Thread Takuya Yoshikawa

(2011/12/19 19:03), Avi Kivity wrote:

IMO, The goal should be restricted to emergencies.

So possible solution may be:
 - we set the tuning parameters as conservative as possible
 - pick up a guest with relatively high ratio
   (I have to think more how to achieve this)
 - move the vm_list head for fairness

In an emergency, we should not mind performance penalty so much.


But is the shrinker really only called in emergencies?


No, sadly.

That is the problem.



Also, with things like cgroups, we may have an emergency in one
container, but not in others - if the shrinker is not cgroup aware, it
soon will be.


That seems to be a common problem for everyone, not KVM only.


But there is not a perfect value because how often mmu_shrink() can be
called
will change if the admin change the sysctl_vfs_cache_pressure tuning
parameter
for dcache and icache, IIUC.

And tdp and shadow paging differ much.


We should aim for the following:
- normal operation causes very little shrinks (some are okay)
- high pressure mostly due to kvm results in kvm being shrunk (this is a
pathological case caused by a starting a guest with a huge amount of
memory, and mapping it all to /dev/zero (or ksm), and getting the guest
the create shadow mappings for all of it)
- general high pressure is shared among other caches like dcache and icache

The cost of reestablishing an mmu page can be as high as half a
millisecond of cpu time, which is the reason I want to be conservative.



I agree with you.

I feel that I should add lkml in CC next time to hear from mm specialist.
Shrinker has many heuristics added from a lot of experience; my lack of
such experience means I need help.

Takuya
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] KVM: MMU: Make mmu_shrink() scan nr_to_scan shadow pages

2011-12-19 Thread Avi Kivity
On 12/19/2011 12:21 PM, Takuya Yoshikawa wrote:
 (2011/12/19 19:03), Avi Kivity wrote:
 IMO, The goal should be restricted to emergencies.

 So possible solution may be:
  - we set the tuning parameters as conservative as possible
  - pick up a guest with relatively high ratio
(I have to think more how to achieve this)
  - move the vm_list head for fairness

 In an emergency, we should not mind performance penalty so much.

 But is the shrinker really only called in emergencies?

 No, sadly.

 That is the problem.

It's not a problem - otherwise the icache and dcache would grow
indefinitely.

kvm's caches have another limit - perhaps we should remove it and let
the shrinker manage the caches itself (but that requires a better
selection algorithm).



 Also, with things like cgroups, we may have an emergency in one
 container, but not in others - if the shrinker is not cgroup aware, it
 soon will be.

 That seems to be a common problem for everyone, not KVM only.

It's really a requirement for dcache/icache, otherwise one container
could force out icache/dcache for another.  It doesn't concern us
directly but we have to ensure that one guest cannot force out another's
mmu pages.


 But there is not a perfect value because how often mmu_shrink() can be
 called
 will change if the admin change the sysctl_vfs_cache_pressure tuning
 parameter
 for dcache and icache, IIUC.

 And tdp and shadow paging differ much.

 We should aim for the following:
 - normal operation causes very little shrinks (some are okay)
 - high pressure mostly due to kvm results in kvm being shrunk (this is a
 pathological case caused by a starting a guest with a huge amount of
 memory, and mapping it all to /dev/zero (or ksm), and getting the guest
 the create shadow mappings for all of it)
 - general high pressure is shared among other caches like dcache and
 icache

 The cost of reestablishing an mmu page can be as high as half a
 millisecond of cpu time, which is the reason I want to be conservative.


 I agree with you.

 I feel that I should add lkml in CC next time to hear from mm specialist.
 Shrinker has many heuristics added from a lot of experience; my lack of
 such experience means I need help.

Copying one expert.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] KVM: MMU: Make mmu_shrink() scan nr_to_scan shadow pages

2011-12-16 Thread Marcelo Tosatti
On Mon, Dec 12, 2011 at 07:26:47AM +0900, Takuya Yoshikawa wrote:
 From: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
 
 Currently, mmu_shrink() tries to free a shadow page from one kvm and
 does not use nr_to_scan correctly.
 
 This patch fixes this by making it try to free some shadow pages from
 each kvm.  The number of shadow pages each kvm frees becomes
 proportional to the number of shadow pages it is using.
 
 Note: an easy way to see how this code works is to do
   echo 3  /proc/sys/vm/drop_caches
 during some virtual machines are running.  Shadow pages will be zapped
 as expected by this.

I'm not sure this is a meaningful test to verify this change is
worthwhile, because while the shrinker tries to free a shadow page from
one vm, the vm's position in the kvm list is changed, so the over time
the shrinker will cycle over all VMs.

Can you measure whether there is a significant difference in a synthetic
workload, and what that change is? Perhaps apply {moderate, high} memory
pressure load with {2, 4, 8, 16} VMs or something like that.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] KVM: MMU: Make mmu_shrink() scan nr_to_scan shadow pages

2011-12-16 Thread Takuya Yoshikawa
On Fri, 16 Dec 2011 09:06:11 -0200
Marcelo Tosatti mtosa...@redhat.com wrote:

 On Mon, Dec 12, 2011 at 07:26:47AM +0900, Takuya Yoshikawa wrote:
  From: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
  
  Currently, mmu_shrink() tries to free a shadow page from one kvm and
  does not use nr_to_scan correctly.
  
  This patch fixes this by making it try to free some shadow pages from
  each kvm.  The number of shadow pages each kvm frees becomes
  proportional to the number of shadow pages it is using.
  
  Note: an easy way to see how this code works is to do
echo 3  /proc/sys/vm/drop_caches
  during some virtual machines are running.  Shadow pages will be zapped
  as expected by this.
 
 I'm not sure this is a meaningful test to verify this change is
 worthwhile, because while the shrinker tries to free a shadow page from
 one vm, the vm's position in the kvm list is changed, so the over time
 the shrinker will cycle over all VMs.

The test was for checking if mmu_shrink() would work as intended.  Maybe
not good as a changelog, sorry.


I admit that I could not find any strong reason except for protecting the
host from intentionally induced shadowing.

But for that, don't you think that freeing the same amount of shadow pages
from good and bad VMs eaqually is bad thing?

My method will try to free many shadow pages from VMs with many shadow
pages;  e.g. if there is a pathological increase in shadow pages for one
VM, that one will be intensively treated.

If you can agree on this reasoning, I will update the description and resend.

 
 Can you measure whether there is a significant difference in a synthetic
 workload, and what that change is? Perhaps apply {moderate, high} memory
 pressure load with {2, 4, 8, 16} VMs or something like that.
 

I was running 4 VMs on my machine with enough high memory pressure.  The problem
was that mmu_shrink() was not tuned to be called in usual memory pressures:  
what
I did was changing the seeks and batch parameters and making ept=0.

At least, I have checked that if I make one VM do meaningless many copies, 
letting
others keep silent, the shrinker frees shadow pages intensively from that one.


Anyway, I don't think making the shrinker call mmu_shrink() with the default 
batch
size, nr_to_scan=128, and just trying to free one shadow page is a good 
behaviour.


Takuya
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] KVM: MMU: Make mmu_shrink() scan nr_to_scan shadow pages

2011-12-11 Thread Takuya Yoshikawa
From: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp

Currently, mmu_shrink() tries to free a shadow page from one kvm and
does not use nr_to_scan correctly.

This patch fixes this by making it try to free some shadow pages from
each kvm.  The number of shadow pages each kvm frees becomes
proportional to the number of shadow pages it is using.

Note: an easy way to see how this code works is to do
  echo 3  /proc/sys/vm/drop_caches
during some virtual machines are running.  Shadow pages will be zapped
as expected by this.

Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
---
 arch/x86/kvm/mmu.c |   23 ++-
 1 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index fcd0dd1..c6c61dd 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3921,7 +3921,7 @@ restart:
 static int mmu_shrink(struct shrinker *shrink, struct shrink_control *sc)
 {
struct kvm *kvm;
-   struct kvm *kvm_freed = NULL;
+   int nr_to_zap, nr_total;
int nr_to_scan = sc-nr_to_scan;
 
if (nr_to_scan == 0)
@@ -3929,25 +3929,30 @@ static int mmu_shrink(struct shrinker *shrink, struct 
shrink_control *sc)
 
raw_spin_lock(kvm_lock);
 
+   nr_total = percpu_counter_read_positive(kvm_total_used_mmu_pages);
+
list_for_each_entry(kvm, kvm_list, list) {
int idx;
LIST_HEAD(invalid_list);
 
+   if (nr_to_scan = 0) {
+   /* next time from this kvm */
+   list_move_tail(kvm_list, kvm-list);
+   break;
+   }
+
idx = srcu_read_lock(kvm-srcu);
spin_lock(kvm-mmu_lock);
-   if (!kvm_freed  nr_to_scan  0 
-   kvm-arch.n_used_mmu_pages  0) {
-   pre_zap_one_sp(kvm, invalid_list);
-   kvm_freed = kvm;
-   }
-   nr_to_scan--;
 
+   /* proportional to how many shadow pages this kvm is using */
+   nr_to_zap = sc-nr_to_scan * kvm-arch.n_used_mmu_pages;
+   nr_to_zap /= nr_total;
+   nr_to_scan -= pre_zap_some_sp(kvm, invalid_list, nr_to_zap);
kvm_mmu_commit_zap_page(kvm, invalid_list);
+
spin_unlock(kvm-mmu_lock);
srcu_read_unlock(kvm-srcu, idx);
}
-   if (kvm_freed)
-   list_move_tail(kvm_freed-list, kvm_list);
 
raw_spin_unlock(kvm_lock);
 
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html