subject:"\[PATCH V2 RFC 0\/3\] kvm\: Improving undercommit,overcommit scenarios"

Re: [PATCH V2 RFC 0/3] kvm: Improving undercommit,overcommit scenarios

2012-10-30 Thread Raghavendra K T


On 10/30/2012 05:47 PM, Andrew Theurer wrote:

On Mon, 2012-10-29 at 19:36 +0530, Raghavendra K T wrote:

In some special scenarios like #vcpu <= #pcpu, PLE handler may
prove very costly, because there is no need to iterate over vcpus
and do unsuccessful yield_to burning CPU.

  Similarly, when we have large number of small guests, it is
possible that a spinning vcpu fails to yield_to any vcpu of same
VM and go back and spin. This is also not effective when we are
over-committed. Instead, we do a yield() so that we give chance
to other VMs to run.

This patch tries to optimize above scenarios.

  The first patch optimizes all the yield_to by bailing out when there
  is no need to continue yield_to (i.e., when there is only one task
  in source and target rq).

  Second patch uses that in PLE handler.

  Third patch uses overall system load knowledge to take decison on
  continuing in yield_to handler, and also yielding in overcommits.
  To be precise,
  * loadavg is converted to a scale of 2048  / per CPU
  * a load value of less than 1024 is considered as undercommit and we
  return from PLE handler in those cases
  * a load value of greater than 3586 (1.75 * 2048) is considered as overcommit
   and  we yield to other VMs in such cases.

(let threshold = 2048)
Rationale for using threshold/2 for undercommit limit:
  Having a load below (0.5 * threshold) is used to avoid (the concern rasied by 
Rik)
scenarios where we still have lock holder preempted vcpu waiting to be
scheduled. (scenario arises when rq length is > 1 even when we are under
committed)

Rationale for using (1.75 * threshold) for overcommit scenario:
This is a heuristic where we should probably see rq length > 1
and a vcpu of a different VM is waiting to be scheduled.

  Related future work (independent of this series):

  - Dynamically changing PLE window depending on system load.

  Result on 3.7.0-rc1 kernel shows around 146% improvement for ebizzy 1x
  with 32 core PLE machine with 32 vcpu guest.
  I believe we should get very good improvements for overcommit (especially > 2)
  on large machines with small vcpu guests. (Could not test this as I do not 
have
  access to a bigger machine)

base = 3.7.0-rc1
machine: 32 core mx3850 x5 PLE mc

--+---+---+---++---+
ebizzy (rec/sec higher is beter)
--+---+---+---++---+
 basestdev   patched stdev   %improve
--+---+---+---++---+
1x  2543.375020.29036279.375082.5226   146.89143
2x  2410.875096.43272450.7500   207.8136 1.65396
3x  2184.9167   205.52262178.97.2034-0.30131
--+---+---+---++---+

--+---+---+---++---+
 dbench (throughput in MB/sec. higher is better)
--+---+---+---++---+
 basestdev   patched stdev   %improve
--+---+---+---++---+
1x  5545.4330   596.43447042.8510  1012.092427.00272
2x  1993.097043.65481990.620075.7837-0.12428
3x  1295.386722.39971315.520836.0075 1.55429
--+---+---+---++---+


Could you include a PLE-off result for 1x over-commit, so we know what
the best possible result is?


Yes,

base no PLE

ebizzy_1x 7651.3000 rec/sec
ebizzy_2x   51.5000 rec/sec

ebizzy we are closer.

dbench_1x 12631.4210 MB/sec
dbench_2x 45.0842MB/sec

(strangely dbench 1x result is not consistent sometime despite 10 runs
of 3min + 30 sec warmup runs on a 3G tmpfs. But surely it tells the trend)



Looks like skipping the yield_to() for rq = 1 helps, but I'd like to
know if the performance is the same as PLE off for 1x.  I am concerned
the vcpu to task lookup is still expensive.



Yes. I still see that.


Based on Peter's comments I would say the 3rd patch and the 2x,3x
results are not conclusive at this time.


Avi, IMO patch 1 and 2 seem to be good to go. Please let me know.



I think we should also discuss what we think a good target is.  We
should know what our high-water mark is, and IMO, if we cannot get
close, then I do not feel we are heading down the right path.  For
example, if dbench aggregate throughput for 1x with PLE off is 1
MB/sec, then the best possible 2x,3x result, should be a little lower
than that due to task switching the vcpus and sharing chaches.  This
should be quite evident with current PLE handler and smaller VMs (like
10 vcpus or less).


Very much agree here. If we see the 2x 3x results (all/any of them).
aggregate is not near 1x. May be even 70% is a good target.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V2 RFC 0/3] kvm: Improving undercommit,overcommit scenarios

2012-10-30 Thread Andrew Theurer

On Mon, 2012-10-29 at 19:36 +0530, Raghavendra K T wrote:
> In some special scenarios like #vcpu <= #pcpu, PLE handler may
> prove very costly, because there is no need to iterate over vcpus
> and do unsuccessful yield_to burning CPU.
> 
>  Similarly, when we have large number of small guests, it is
> possible that a spinning vcpu fails to yield_to any vcpu of same
> VM and go back and spin. This is also not effective when we are
> over-committed. Instead, we do a yield() so that we give chance
> to other VMs to run.
> 
> This patch tries to optimize above scenarios.
> 
>  The first patch optimizes all the yield_to by bailing out when there
>  is no need to continue yield_to (i.e., when there is only one task 
>  in source and target rq).
> 
>  Second patch uses that in PLE handler.
> 
>  Third patch uses overall system load knowledge to take decison on
>  continuing in yield_to handler, and also yielding in overcommits.
>  To be precise, 
>  * loadavg is converted to a scale of 2048  / per CPU 
>  * a load value of less than 1024 is considered as undercommit and we
>  return from PLE handler in those cases 
>  * a load value of greater than 3586 (1.75 * 2048) is considered as overcommit
>   and  we yield to other VMs in such cases.
> 
> (let threshold = 2048)
> Rationale for using threshold/2 for undercommit limit:
>  Having a load below (0.5 * threshold) is used to avoid (the concern rasied 
> by Rik)
> scenarios where we still have lock holder preempted vcpu waiting to be
> scheduled. (scenario arises when rq length is > 1 even when we are under
> committed)
> 
> Rationale for using (1.75 * threshold) for overcommit scenario:
> This is a heuristic where we should probably see rq length > 1
> and a vcpu of a different VM is waiting to be scheduled.
> 
>  Related future work (independent of this series):
> 
>  - Dynamically changing PLE window depending on system load.
> 
>  Result on 3.7.0-rc1 kernel shows around 146% improvement for ebizzy 1x
>  with 32 core PLE machine with 32 vcpu guest.
>  I believe we should get very good improvements for overcommit (especially > 
> 2)
>  on large machines with small vcpu guests. (Could not test this as I do not 
> have
>  access to a bigger machine)
> 
> base = 3.7.0-rc1 
> machine: 32 core mx3850 x5 PLE mc
> 
> --+---+---+---++---+
>ebizzy (rec/sec higher is beter)
> --+---+---+---++---+
> basestdev   patched stdev   %improve 
> --+---+---+---++---+
> 1x  2543.375020.29036279.375082.5226   146.89143   
> 2x  2410.875096.43272450.7500   207.8136 1.65396
> 3x  2184.9167   205.52262178.97.2034-0.30131
> --+---+---+---++---+
> 
> --+---+---+---++---+
> dbench (throughput in MB/sec. higher is better)
> --+---+---+---++---+
> basestdev   patched stdev   %improve 
> --+---+---+---++---+
> 1x  5545.4330   596.43447042.8510  1012.092427.00272
> 2x  1993.097043.65481990.620075.7837-0.12428
> 3x  1295.386722.39971315.520836.0075 1.55429
> --+---+---+---++---+

Could you include a PLE-off result for 1x over-commit, so we know what
the best possible result is?

Looks like skipping the yield_to() for rq = 1 helps, but I'd like to
know if the performance is the same as PLE off for 1x.  I am concerned
the vcpu to task lookup is still expensive.

Based on Peter's comments I would say the 3rd patch and the 2x,3x
results are not conclusive at this time.

I think we should also discuss what we think a good target is.  We
should know what our high-water mark is, and IMO, if we cannot get
close, then I do not feel we are heading down the right path.  For
example, if dbench aggregate throughput for 1x with PLE off is 1
MB/sec, then the best possible 2x,3x result, should be a little lower
than that due to task switching the vcpus and sharing chaches.  This
should be quite evident with current PLE handler and smaller VMs (like
10 vcpus or less).
> 
>  Changes since V1:
>  - Discard the idea of exporting nrrunning and optimize in core scheduler 
> (Peter)
>  - Use yield() instead of schedule in overcommit scenarios (Rik)
>  - Use loadavg knowledge to detect undercommit/overcommit
> 
>  Peter Zijlstra (1):
>   Bail out of yield_to when source and target runqueue has one task
> 
>  Raghavendra K T (2):
>   Handle yield_to failure return for potential undercommit case
>   Check system load and handle different commit cases accordingly
> 
>  Please let me know your comments and suggestions.
> 
>  Link for V1:
>  https://lkml.org/lkml/2012/9/21/168
> 
>  kernel/sched/core.c | 25 +

[PATCH V2 RFC 0/3] kvm: Improving undercommit,overcommit scenarios

2012-10-29 Thread Raghavendra K T

 In some special scenarios like #vcpu <= #pcpu, PLE handler may
prove very costly, because there is no need to iterate over vcpus
and do unsuccessful yield_to burning CPU.

 Similarly, when we have large number of small guests, it is
possible that a spinning vcpu fails to yield_to any vcpu of same
VM and go back and spin. This is also not effective when we are
over-committed. Instead, we do a yield() so that we give chance
to other VMs to run.

This patch tries to optimize above scenarios.

 The first patch optimizes all the yield_to by bailing out when there
 is no need to continue yield_to (i.e., when there is only one task 
 in source and target rq).

 Second patch uses that in PLE handler.
 
 Third patch uses overall system load knowledge to take decison on
 continuing in yield_to handler, and also yielding in overcommits.
 To be precise, 
 * loadavg is converted to a scale of 2048  / per CPU 
 * a load value of less than 1024 is considered as undercommit and we
 return from PLE handler in those cases 
 * a load value of greater than 3586 (1.75 * 2048) is considered as overcommit
  and  we yield to other VMs in such cases.

(let threshold = 2048)
Rationale for using threshold/2 for undercommit limit:
 Having a load below (0.5 * threshold) is used to avoid (the concern rasied by 
Rik)
scenarios where we still have lock holder preempted vcpu waiting to be
scheduled. (scenario arises when rq length is > 1 even when we are under
committed)

Rationale for using (1.75 * threshold) for overcommit scenario:
This is a heuristic where we should probably see rq length > 1
and a vcpu of a different VM is waiting to be scheduled.

 Related future work (independent of this series):
 
 - Dynamically changing PLE window depending on system load.

 Result on 3.7.0-rc1 kernel shows around 146% improvement for ebizzy 1x
 with 32 core PLE machine with 32 vcpu guest.
 I believe we should get very good improvements for overcommit (especially > 2)
 on large machines with small vcpu guests. (Could not test this as I do not have
 access to a bigger machine)

base = 3.7.0-rc1 
machine: 32 core mx3850 x5 PLE mc

--+---+---+---++---+
   ebizzy (rec/sec higher is beter)
--+---+---+---++---+
basestdev   patched stdev   %improve 
--+---+---+---++---+
1x  2543.375020.29036279.375082.5226   146.89143   
2x  2410.875096.43272450.7500   207.8136 1.65396
3x  2184.9167   205.52262178.97.2034-0.30131
--+---+---+---++---+

--+---+---+---++---+
dbench (throughput in MB/sec. higher is better)
--+---+---+---++---+
basestdev   patched stdev   %improve 
--+---+---+---++---+
1x  5545.4330   596.43447042.8510  1012.092427.00272
2x  1993.097043.65481990.620075.7837-0.12428
3x  1295.386722.39971315.520836.0075 1.55429
--+---+---+---++---+

 Changes since V1:
 - Discard the idea of exporting nrrunning and optimize in core scheduler 
(Peter)
 - Use yield() instead of schedule in overcommit scenarios (Rik)
 - Use loadavg knowledge to detect undercommit/overcommit

 Peter Zijlstra (1):
  Bail out of yield_to when source and target runqueue has one task

 Raghavendra K T (2):
  Handle yield_to failure return for potential undercommit case
  Check system load and handle different commit cases accordingly

 Please let me know your comments and suggestions.

 Link for V1:
 https://lkml.org/lkml/2012/9/21/168

 kernel/sched/core.c | 25 +++--
 virt/kvm/kvm_main.c | 56 
++--
 2 files changed, 65 insertions(+), 16 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V2 RFC 0/3] kvm: Improving undercommit,overcommit scenarios

Re: [PATCH V2 RFC 0/3] kvm: Improving undercommit,overcommit scenarios

[PATCH V2 RFC 0/3] kvm: Improving undercommit,overcommit scenarios

3 matches

Site Navigation

Mail list logo

Footer information