Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-10-05 Thread Raghavendra K T

On 10/04/2012 06:14 PM, Avi Kivity wrote:

On 10/04/2012 12:56 PM, Raghavendra K T wrote:

On 10/03/2012 10:55 PM, Avi Kivity wrote:

On 10/03/2012 04:29 PM, Raghavendra K T wrote:

* Avi Kivity  [2012-09-27 14:03:59]:


On 09/27/2012 01:23 PM, Raghavendra K T wrote:



[...]

2) looking at the result (comparing A & C) , I do feel we have
significant in iterating over vcpus (when compared to even vmexit)
so We still would need undercommit fix sugested by PeterZ
(improving by
140%). ?


Looking only at the current runqueue?  My worry is that it misses a lot
of cases.  Maybe try the current runqueue first and then others.



Okay. Do you mean we can have something like

+   if (rq->nr_running == 1 && p_rq->nr_running == 1) {
+   yielded = -ESRCH;
+   goto out_irq;
+   }

in the Peter's patch ?

( I thought lot about && or || . Both seem to have their own cons ).
But that should be only when we have short term imbalance, as PeterZ
told.


I'm missing the context.  What is p_rq?


p_rq is the run queue of target vcpu.
What I was trying below was to address Rik concern. Suppose
rq of source vcpu has one task, but target probably has two task,
with a eligible vcpu waiting to be scheduled.



What I mean was:

if can_yield_to_process_in_current_rq
   do that
else if can_yield_to_process_in_other_rq
   do that
else
   return -ESRCH


I think you are saying we have to check the run queue of the
source vcpu, if we have a vcpu belonging to same VM and try yield to
that? ignoring whatever the target vcpu we received for yield_to.

Or is it that kvm_vcpu_yield_to should now check the vcpus of same vm
belonging to same run queue first. If we don't succeed, go again for
a vcpu in different runqueue.


Right.  Prioritize vcpus that are cheap to yield to.  But may return bad
results if all vcpus on the current runqueue are spinners, so probably
not a good idea.


Okay. 'll drop vcpu from same rq idea now.




Does it add more overhead especially in <= 1x scenario?


The current runqueue should have just our vcpu in that case, so low
overhead.  But it's a bad idea due to the above scenario.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-10-05 Thread Raghavendra K T

On 10/04/2012 06:14 PM, Avi Kivity wrote:

On 10/04/2012 12:56 PM, Raghavendra K T wrote:

On 10/03/2012 10:55 PM, Avi Kivity wrote:

On 10/03/2012 04:29 PM, Raghavendra K T wrote:

* Avi Kivity a...@redhat.com [2012-09-27 14:03:59]:


On 09/27/2012 01:23 PM, Raghavendra K T wrote:



[...]

2) looking at the result (comparing A  C) , I do feel we have
significant in iterating over vcpus (when compared to even vmexit)
so We still would need undercommit fix sugested by PeterZ
(improving by
140%). ?


Looking only at the current runqueue?  My worry is that it misses a lot
of cases.  Maybe try the current runqueue first and then others.



Okay. Do you mean we can have something like

+   if (rq-nr_running == 1  p_rq-nr_running == 1) {
+   yielded = -ESRCH;
+   goto out_irq;
+   }

in the Peter's patch ?

( I thought lot about  or || . Both seem to have their own cons ).
But that should be only when we have short term imbalance, as PeterZ
told.


I'm missing the context.  What is p_rq?


p_rq is the run queue of target vcpu.
What I was trying below was to address Rik concern. Suppose
rq of source vcpu has one task, but target probably has two task,
with a eligible vcpu waiting to be scheduled.



What I mean was:

if can_yield_to_process_in_current_rq
   do that
else if can_yield_to_process_in_other_rq
   do that
else
   return -ESRCH


I think you are saying we have to check the run queue of the
source vcpu, if we have a vcpu belonging to same VM and try yield to
that? ignoring whatever the target vcpu we received for yield_to.

Or is it that kvm_vcpu_yield_to should now check the vcpus of same vm
belonging to same run queue first. If we don't succeed, go again for
a vcpu in different runqueue.


Right.  Prioritize vcpus that are cheap to yield to.  But may return bad
results if all vcpus on the current runqueue are spinners, so probably
not a good idea.


Okay. 'll drop vcpu from same rq idea now.




Does it add more overhead especially in = 1x scenario?


The current runqueue should have just our vcpu in that case, so low
overhead.  But it's a bad idea due to the above scenario.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-10-04 Thread Avi Kivity
On 10/04/2012 12:56 PM, Raghavendra K T wrote:
> On 10/03/2012 10:55 PM, Avi Kivity wrote:
>> On 10/03/2012 04:29 PM, Raghavendra K T wrote:
>>> * Avi Kivity  [2012-09-27 14:03:59]:
>>>
 On 09/27/2012 01:23 PM, Raghavendra K T wrote:
>>
>>> [...]
> 2) looking at the result (comparing A & C) , I do feel we have
> significant in iterating over vcpus (when compared to even vmexit)
> so We still would need undercommit fix sugested by PeterZ
> (improving by
> 140%). ?

 Looking only at the current runqueue?  My worry is that it misses a lot
 of cases.  Maybe try the current runqueue first and then others.

>>>
>>> Okay. Do you mean we can have something like
>>>
>>> +   if (rq->nr_running == 1 && p_rq->nr_running == 1) {
>>> +   yielded = -ESRCH;
>>> +   goto out_irq;
>>> +   }
>>>
>>> in the Peter's patch ?
>>>
>>> ( I thought lot about && or || . Both seem to have their own cons ).
>>> But that should be only when we have short term imbalance, as PeterZ
>>> told.
>>
>> I'm missing the context.  What is p_rq?
> 
> p_rq is the run queue of target vcpu.
> What I was trying below was to address Rik concern. Suppose
> rq of source vcpu has one task, but target probably has two task,
> with a eligible vcpu waiting to be scheduled.
> 
>>
>> What I mean was:
>>
>>if can_yield_to_process_in_current_rq
>>   do that
>>else if can_yield_to_process_in_other_rq
>>   do that
>>else
>>   return -ESRCH
> 
> I think you are saying we have to check the run queue of the
> source vcpu, if we have a vcpu belonging to same VM and try yield to
> that? ignoring whatever the target vcpu we received for yield_to.
> 
> Or is it that kvm_vcpu_yield_to should now check the vcpus of same vm
> belonging to same run queue first. If we don't succeed, go again for
> a vcpu in different runqueue.

Right.  Prioritize vcpus that are cheap to yield to.  But may return bad
results if all vcpus on the current runqueue are spinners, so probably
not a good idea.

> Does it add more overhead especially in <= 1x scenario?

The current runqueue should have just our vcpu in that case, so low
overhead.  But it's a bad idea due to the above scenario.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-10-04 Thread Raghavendra K T

On 10/03/2012 10:55 PM, Avi Kivity wrote:

On 10/03/2012 04:29 PM, Raghavendra K T wrote:

* Avi Kivity  [2012-09-27 14:03:59]:


On 09/27/2012 01:23 PM, Raghavendra K T wrote:



[...]

2) looking at the result (comparing A & C) , I do feel we have
significant in iterating over vcpus (when compared to even vmexit)
so We still would need undercommit fix sugested by PeterZ (improving by
140%). ?


Looking only at the current runqueue?  My worry is that it misses a lot
of cases.  Maybe try the current runqueue first and then others.



Okay. Do you mean we can have something like

+   if (rq->nr_running == 1 && p_rq->nr_running == 1) {
+   yielded = -ESRCH;
+   goto out_irq;
+   }

in the Peter's patch ?

( I thought lot about && or || . Both seem to have their own cons ).
But that should be only when we have short term imbalance, as PeterZ
told.


I'm missing the context.  What is p_rq?


p_rq is the run queue of target vcpu.
What I was trying below was to address Rik concern. Suppose
rq of source vcpu has one task, but target probably has two task,
with a eligible vcpu waiting to be scheduled.



What I mean was:

   if can_yield_to_process_in_current_rq
  do that
   else if can_yield_to_process_in_other_rq
  do that
   else
  return -ESRCH


I think you are saying we have to check the run queue of the
source vcpu, if we have a vcpu belonging to same VM and try yield to 
that? ignoring whatever the target vcpu we received for yield_to.


Or is it that kvm_vcpu_yield_to should now check the vcpus of same vm
belonging to same run queue first. If we don't succeed, go again for
a vcpu in different runqueue.
Does it add more overhead especially in <= 1x scenario?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-10-04 Thread Raghavendra K T

On 10/03/2012 10:55 PM, Avi Kivity wrote:

On 10/03/2012 04:29 PM, Raghavendra K T wrote:

* Avi Kivity a...@redhat.com [2012-09-27 14:03:59]:


On 09/27/2012 01:23 PM, Raghavendra K T wrote:



[...]

2) looking at the result (comparing A  C) , I do feel we have
significant in iterating over vcpus (when compared to even vmexit)
so We still would need undercommit fix sugested by PeterZ (improving by
140%). ?


Looking only at the current runqueue?  My worry is that it misses a lot
of cases.  Maybe try the current runqueue first and then others.



Okay. Do you mean we can have something like

+   if (rq-nr_running == 1  p_rq-nr_running == 1) {
+   yielded = -ESRCH;
+   goto out_irq;
+   }

in the Peter's patch ?

( I thought lot about  or || . Both seem to have their own cons ).
But that should be only when we have short term imbalance, as PeterZ
told.


I'm missing the context.  What is p_rq?


p_rq is the run queue of target vcpu.
What I was trying below was to address Rik concern. Suppose
rq of source vcpu has one task, but target probably has two task,
with a eligible vcpu waiting to be scheduled.



What I mean was:

   if can_yield_to_process_in_current_rq
  do that
   else if can_yield_to_process_in_other_rq
  do that
   else
  return -ESRCH


I think you are saying we have to check the run queue of the
source vcpu, if we have a vcpu belonging to same VM and try yield to 
that? ignoring whatever the target vcpu we received for yield_to.


Or is it that kvm_vcpu_yield_to should now check the vcpus of same vm
belonging to same run queue first. If we don't succeed, go again for
a vcpu in different runqueue.
Does it add more overhead especially in = 1x scenario?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-10-04 Thread Avi Kivity
On 10/04/2012 12:56 PM, Raghavendra K T wrote:
 On 10/03/2012 10:55 PM, Avi Kivity wrote:
 On 10/03/2012 04:29 PM, Raghavendra K T wrote:
 * Avi Kivity a...@redhat.com [2012-09-27 14:03:59]:

 On 09/27/2012 01:23 PM, Raghavendra K T wrote:

 [...]
 2) looking at the result (comparing A  C) , I do feel we have
 significant in iterating over vcpus (when compared to even vmexit)
 so We still would need undercommit fix sugested by PeterZ
 (improving by
 140%). ?

 Looking only at the current runqueue?  My worry is that it misses a lot
 of cases.  Maybe try the current runqueue first and then others.


 Okay. Do you mean we can have something like

 +   if (rq-nr_running == 1  p_rq-nr_running == 1) {
 +   yielded = -ESRCH;
 +   goto out_irq;
 +   }

 in the Peter's patch ?

 ( I thought lot about  or || . Both seem to have their own cons ).
 But that should be only when we have short term imbalance, as PeterZ
 told.

 I'm missing the context.  What is p_rq?
 
 p_rq is the run queue of target vcpu.
 What I was trying below was to address Rik concern. Suppose
 rq of source vcpu has one task, but target probably has two task,
 with a eligible vcpu waiting to be scheduled.
 

 What I mean was:

if can_yield_to_process_in_current_rq
   do that
else if can_yield_to_process_in_other_rq
   do that
else
   return -ESRCH
 
 I think you are saying we have to check the run queue of the
 source vcpu, if we have a vcpu belonging to same VM and try yield to
 that? ignoring whatever the target vcpu we received for yield_to.
 
 Or is it that kvm_vcpu_yield_to should now check the vcpus of same vm
 belonging to same run queue first. If we don't succeed, go again for
 a vcpu in different runqueue.

Right.  Prioritize vcpus that are cheap to yield to.  But may return bad
results if all vcpus on the current runqueue are spinners, so probably
not a good idea.

 Does it add more overhead especially in = 1x scenario?

The current runqueue should have just our vcpu in that case, so low
overhead.  But it's a bad idea due to the above scenario.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-10-03 Thread Avi Kivity
On 10/03/2012 04:29 PM, Raghavendra K T wrote:
> * Avi Kivity  [2012-09-27 14:03:59]:
> 
>> On 09/27/2012 01:23 PM, Raghavendra K T wrote:
>> >>
> [...]
>> > 2) looking at the result (comparing A & C) , I do feel we have
>> > significant in iterating over vcpus (when compared to even vmexit)
>> > so We still would need undercommit fix sugested by PeterZ (improving by
>> > 140%). ?
>> 
>> Looking only at the current runqueue?  My worry is that it misses a lot
>> of cases.  Maybe try the current runqueue first and then others.
>> 
> 
> Okay. Do you mean we can have something like
> 
> +   if (rq->nr_running == 1 && p_rq->nr_running == 1) {
> +   yielded = -ESRCH;
> +   goto out_irq;
> +   }
> 
> in the Peter's patch ?
> 
> ( I thought lot about && or || . Both seem to have their own cons ).
> But that should be only when we have short term imbalance, as PeterZ
> told.

I'm missing the context.  What is p_rq?

What I mean was:

  if can_yield_to_process_in_current_rq
 do that
  else if can_yield_to_process_in_other_rq
 do that
  else
 return -ESRCH


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-10-03 Thread Raghavendra K T
* Avi Kivity  [2012-09-27 14:03:59]:

> On 09/27/2012 01:23 PM, Raghavendra K T wrote:
> >>
[...]
> > 2) looking at the result (comparing A & C) , I do feel we have
> > significant in iterating over vcpus (when compared to even vmexit)
> > so We still would need undercommit fix sugested by PeterZ (improving by
> > 140%). ?
> 
> Looking only at the current runqueue?  My worry is that it misses a lot
> of cases.  Maybe try the current runqueue first and then others.
> 

Okay. Do you mean we can have something like

+   if (rq->nr_running == 1 && p_rq->nr_running == 1) {
+   yielded = -ESRCH;
+   goto out_irq;
+   }

in the Peter's patch ?

( I thought lot about && or || . Both seem to have their own cons ).
But that should be only when we have short term imbalance, as PeterZ
told.

I am experimenting all these for V2 patch. Will come back with analysis
and patch.

> Or were you referring to something else?
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-10-03 Thread Raghavendra K T
* Avi Kivity a...@redhat.com [2012-09-27 14:03:59]:

 On 09/27/2012 01:23 PM, Raghavendra K T wrote:
 
[...]
  2) looking at the result (comparing A  C) , I do feel we have
  significant in iterating over vcpus (when compared to even vmexit)
  so We still would need undercommit fix sugested by PeterZ (improving by
  140%). ?
 
 Looking only at the current runqueue?  My worry is that it misses a lot
 of cases.  Maybe try the current runqueue first and then others.
 

Okay. Do you mean we can have something like

+   if (rq-nr_running == 1  p_rq-nr_running == 1) {
+   yielded = -ESRCH;
+   goto out_irq;
+   }

in the Peter's patch ?

( I thought lot about  or || . Both seem to have their own cons ).
But that should be only when we have short term imbalance, as PeterZ
told.

I am experimenting all these for V2 patch. Will come back with analysis
and patch.

 Or were you referring to something else?
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-10-03 Thread Avi Kivity
On 10/03/2012 04:29 PM, Raghavendra K T wrote:
 * Avi Kivity a...@redhat.com [2012-09-27 14:03:59]:
 
 On 09/27/2012 01:23 PM, Raghavendra K T wrote:
 
 [...]
  2) looking at the result (comparing A  C) , I do feel we have
  significant in iterating over vcpus (when compared to even vmexit)
  so We still would need undercommit fix sugested by PeterZ (improving by
  140%). ?
 
 Looking only at the current runqueue?  My worry is that it misses a lot
 of cases.  Maybe try the current runqueue first and then others.
 
 
 Okay. Do you mean we can have something like
 
 +   if (rq-nr_running == 1  p_rq-nr_running == 1) {
 +   yielded = -ESRCH;
 +   goto out_irq;
 +   }
 
 in the Peter's patch ?
 
 ( I thought lot about  or || . Both seem to have their own cons ).
 But that should be only when we have short term imbalance, as PeterZ
 told.

I'm missing the context.  What is p_rq?

What I mean was:

  if can_yield_to_process_in_current_rq
 do that
  else if can_yield_to_process_in_other_rq
 do that
  else
 return -ESRCH


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-30 Thread Avi Kivity
On 09/28/2012 01:40 PM, Andrew Theurer wrote:
>> 
>> >>
>> >> IIRC, with defer preemption :
>> >> we will have hook in spinlock/unlock path to measure depth of lock held,
>> >> and shared with host scheduler (may be via MSRs now).
>> >> Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather
>> >> give say one chance.
>> >
>> > A downside is that we have to do that even when undercommitted.
> 
> Hopefully vcpu preemption is very rare when undercommitted, so it should
> not happen much at all.

As soon as you're preempted, you're effectively overcommitted (even if
the system as a whole is undercommitted).  What I meant was that you
need to communicate your lock state to the host, and with fine-grained
locking this can happen a lot.  It may be as simple as an
increment/decrement instruction though.



-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-30 Thread Avi Kivity
On 09/28/2012 01:40 PM, Andrew Theurer wrote:
 
 
  IIRC, with defer preemption :
  we will have hook in spinlock/unlock path to measure depth of lock held,
  and shared with host scheduler (may be via MSRs now).
  Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather
  give say one chance.
 
  A downside is that we have to do that even when undercommitted.
 
 Hopefully vcpu preemption is very rare when undercommitted, so it should
 not happen much at all.

As soon as you're preempted, you're effectively overcommitted (even if
the system as a whole is undercommitted).  What I meant was that you
need to communicate your lock state to the host, and with fine-grained
locking this can happen a lot.  It may be as simple as an
increment/decrement instruction though.



-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-28 Thread Raghavendra K T

On 09/28/2012 05:10 PM, Andrew Theurer wrote:

On Fri, 2012-09-28 at 11:08 +0530, Raghavendra K T wrote:

On 09/27/2012 05:33 PM, Avi Kivity wrote:

On 09/27/2012 01:23 PM, Raghavendra K T wrote:



[...]


Also there may be a lot of false positives (deferred preemptions even
when there is no contention).


It will be interesting to see how this behaves with a very high lock
activity in a guest.  Once the scheduler defers preemption, is it for a
fixed amount of time, or does it know to cut the deferral short as soon
as the lock depth is reduced [by x]?


Design/protocol that Vatsa, had in mind was something like this:

- scheduler does not give a vcpu holding lock forever, it may give one
chance that would give only few ticks. In addition to giving chance,
scheduler also sets some indication that he has been given chance.

- vcpu once he release (all) the lock(s), if it had given chance,
it should clear that (ACK), and relinquish the cpu.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-28 Thread Peter Zijlstra
On Fri, 2012-09-28 at 06:40 -0500, Andrew Theurer wrote:
> It will be interesting to see how this behaves with a very high lock
> activity in a guest.  Once the scheduler defers preemption, is it for
> a
> fixed amount of time, or does it know to cut the deferral short as
> soon
> as the lock depth is reduced [by x]? 

Since the locks live in a guest/userspace, we don't even know they're
held at all, let alone when state changes.

Also, afaik PLE simply exits the guest whenever you do a busy-wait,
there's no guarantee its due to a lock at all, we could be waiting for a
'virtual' hardware resource or whatnot.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-28 Thread Andrew Theurer
On Fri, 2012-09-28 at 11:08 +0530, Raghavendra K T wrote:
> On 09/27/2012 05:33 PM, Avi Kivity wrote:
> > On 09/27/2012 01:23 PM, Raghavendra K T wrote:
> >>>
> >>> This gives us a good case for tracking preemption on a per-vm basis.  As
> >>> long as we aren't preempted, we can keep the PLE window high, and also
> >>> return immediately from the handler without looking for candidates.
> >>
> >> 1) So do you think, deferring preemption patch ( Vatsa was mentioning
> >> long back)  is also another thing worth trying, so we reduce the chance
> >> of LHP.
> >
> > Yes, we have to keep it in mind.  It will be useful for fine grained
> > locks, not so much so coarse locks or IPIs.
> >
> 
> Agree.
> 
> > I would still of course prefer a PLE solution, but if we can't get it to
> > work we can consider preemption deferral.
> >
> 
> Okay.
> 
> >>
> >> IIRC, with defer preemption :
> >> we will have hook in spinlock/unlock path to measure depth of lock held,
> >> and shared with host scheduler (may be via MSRs now).
> >> Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather
> >> give say one chance.
> >
> > A downside is that we have to do that even when undercommitted.

Hopefully vcpu preemption is very rare when undercommitted, so it should
not happen much at all.
> >
> > Also there may be a lot of false positives (deferred preemptions even
> > when there is no contention).

It will be interesting to see how this behaves with a very high lock
activity in a guest.  Once the scheduler defers preemption, is it for a
fixed amount of time, or does it know to cut the deferral short as soon
as the lock depth is reduced [by x]?
> 
> Yes. That is a worry.
> 
> >
> >>
> >> 2) looking at the result (comparing A & C) , I do feel we have
> >> significant in iterating over vcpus (when compared to even vmexit)
> >> so We still would need undercommit fix sugested by PeterZ (improving by
> >> 140%). ?
> >
> > Looking only at the current runqueue?  My worry is that it misses a lot
> > of cases.  Maybe try the current runqueue first and then others.
> >
> > Or were you referring to something else?
> 
> No. I was referring to the same thing.
> 
> However. I had tried following also (which works well to check 
> undercommited scenario). But thinking to use only for yielding in case
> of overcommit (yield in overcommit suggested by Rik) and keep 
> undercommit patch as suggested by PeterZ
> 
> [ patch is not in proper diff I suppose ].
> 
> Will test them.
> 
> Peter, Can I post your patch with your from/sob.. in V2?
> Please let me know..
> 
> ---
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 28f00bc..9ed3759 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1620,6 +1620,21 @@ bool kvm_vcpu_eligible_for_directed_yield(struct 
> kvm_vcpu *vcpu)
>   return eligible;
>   }
>   #endif
> +
> +bool kvm_overcommitted()
> +{
> + unsigned long load;
> +
> + load = avenrun[0] + FIXED_1/200;
> + load = load >> FSHIFT;
> + load = (load << 7) / num_online_cpus();
> +
> + if (load > 128)
> + return true;
> +
> + return false;
> +}
> +
>   void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>   {
>   struct kvm *kvm = me->kvm;
> @@ -1629,6 +1644,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>   int pass;
>   int i;
> 
> + if (!kvm_overcommitted())
> + return;
> +
>   kvm_vcpu_set_in_spin_loop(me, true);
>   /*
>* We boost the priority of a VCPU that is runnable but not


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-28 Thread Peter Zijlstra
On Fri, 2012-09-28 at 11:08 +0530, Raghavendra K T wrote:
> 
> Peter, Can I post your patch with your from/sob.. in V2?
> Please let me know.. 

Yeah I guess ;-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-28 Thread Raghavendra K T

On 09/28/2012 11:15 AM, H. Peter Anvin wrote:

On 09/27/2012 10:38 PM, Raghavendra K T wrote:

+
+bool kvm_overcommitted()
+{


This better not be C...


I think you meant I should have had like kvm_overcommitted(void) and 
(different function name perhaps)


or is it the body of function?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-28 Thread Raghavendra K T

On 09/28/2012 11:15 AM, H. Peter Anvin wrote:

On 09/27/2012 10:38 PM, Raghavendra K T wrote:

+
+bool kvm_overcommitted()
+{


This better not be C...


I think you meant I should have had like kvm_overcommitted(void) and 
(different function name perhaps)


or is it the body of function?


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-28 Thread Peter Zijlstra
On Fri, 2012-09-28 at 11:08 +0530, Raghavendra K T wrote:
 
 Peter, Can I post your patch with your from/sob.. in V2?
 Please let me know.. 

Yeah I guess ;-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-28 Thread Andrew Theurer
On Fri, 2012-09-28 at 11:08 +0530, Raghavendra K T wrote:
 On 09/27/2012 05:33 PM, Avi Kivity wrote:
  On 09/27/2012 01:23 PM, Raghavendra K T wrote:
 
  This gives us a good case for tracking preemption on a per-vm basis.  As
  long as we aren't preempted, we can keep the PLE window high, and also
  return immediately from the handler without looking for candidates.
 
  1) So do you think, deferring preemption patch ( Vatsa was mentioning
  long back)  is also another thing worth trying, so we reduce the chance
  of LHP.
 
  Yes, we have to keep it in mind.  It will be useful for fine grained
  locks, not so much so coarse locks or IPIs.
 
 
 Agree.
 
  I would still of course prefer a PLE solution, but if we can't get it to
  work we can consider preemption deferral.
 
 
 Okay.
 
 
  IIRC, with defer preemption :
  we will have hook in spinlock/unlock path to measure depth of lock held,
  and shared with host scheduler (may be via MSRs now).
  Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather
  give say one chance.
 
  A downside is that we have to do that even when undercommitted.

Hopefully vcpu preemption is very rare when undercommitted, so it should
not happen much at all.
 
  Also there may be a lot of false positives (deferred preemptions even
  when there is no contention).

It will be interesting to see how this behaves with a very high lock
activity in a guest.  Once the scheduler defers preemption, is it for a
fixed amount of time, or does it know to cut the deferral short as soon
as the lock depth is reduced [by x]?
 
 Yes. That is a worry.
 
 
 
  2) looking at the result (comparing A  C) , I do feel we have
  significant in iterating over vcpus (when compared to even vmexit)
  so We still would need undercommit fix sugested by PeterZ (improving by
  140%). ?
 
  Looking only at the current runqueue?  My worry is that it misses a lot
  of cases.  Maybe try the current runqueue first and then others.
 
  Or were you referring to something else?
 
 No. I was referring to the same thing.
 
 However. I had tried following also (which works well to check 
 undercommited scenario). But thinking to use only for yielding in case
 of overcommit (yield in overcommit suggested by Rik) and keep 
 undercommit patch as suggested by PeterZ
 
 [ patch is not in proper diff I suppose ].
 
 Will test them.
 
 Peter, Can I post your patch with your from/sob.. in V2?
 Please let me know..
 
 ---
 diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
 index 28f00bc..9ed3759 100644
 --- a/virt/kvm/kvm_main.c
 +++ b/virt/kvm/kvm_main.c
 @@ -1620,6 +1620,21 @@ bool kvm_vcpu_eligible_for_directed_yield(struct 
 kvm_vcpu *vcpu)
   return eligible;
   }
   #endif
 +
 +bool kvm_overcommitted()
 +{
 + unsigned long load;
 +
 + load = avenrun[0] + FIXED_1/200;
 + load = load  FSHIFT;
 + load = (load  7) / num_online_cpus();
 +
 + if (load  128)
 + return true;
 +
 + return false;
 +}
 +
   void kvm_vcpu_on_spin(struct kvm_vcpu *me)
   {
   struct kvm *kvm = me-kvm;
 @@ -1629,6 +1644,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
   int pass;
   int i;
 
 + if (!kvm_overcommitted())
 + return;
 +
   kvm_vcpu_set_in_spin_loop(me, true);
   /*
* We boost the priority of a VCPU that is runnable but not


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-28 Thread Peter Zijlstra
On Fri, 2012-09-28 at 06:40 -0500, Andrew Theurer wrote:
 It will be interesting to see how this behaves with a very high lock
 activity in a guest.  Once the scheduler defers preemption, is it for
 a
 fixed amount of time, or does it know to cut the deferral short as
 soon
 as the lock depth is reduced [by x]? 

Since the locks live in a guest/userspace, we don't even know they're
held at all, let alone when state changes.

Also, afaik PLE simply exits the guest whenever you do a busy-wait,
there's no guarantee its due to a lock at all, we could be waiting for a
'virtual' hardware resource or whatnot.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-28 Thread Raghavendra K T

On 09/28/2012 05:10 PM, Andrew Theurer wrote:

On Fri, 2012-09-28 at 11:08 +0530, Raghavendra K T wrote:

On 09/27/2012 05:33 PM, Avi Kivity wrote:

On 09/27/2012 01:23 PM, Raghavendra K T wrote:



[...]


Also there may be a lot of false positives (deferred preemptions even
when there is no contention).


It will be interesting to see how this behaves with a very high lock
activity in a guest.  Once the scheduler defers preemption, is it for a
fixed amount of time, or does it know to cut the deferral short as soon
as the lock depth is reduced [by x]?


Design/protocol that Vatsa, had in mind was something like this:

- scheduler does not give a vcpu holding lock forever, it may give one
chance that would give only few ticks. In addition to giving chance,
scheduler also sets some indication that he has been given chance.

- vcpu once he release (all) the lock(s), if it had given chance,
it should clear that (ACK), and relinquish the cpu.




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread H. Peter Anvin

On 09/27/2012 10:38 PM, Raghavendra K T wrote:

+
+bool kvm_overcommitted()
+{


This better not be C...

-hpa


--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Raghavendra K T

On 09/27/2012 05:33 PM, Avi Kivity wrote:

On 09/27/2012 01:23 PM, Raghavendra K T wrote:


This gives us a good case for tracking preemption on a per-vm basis.  As
long as we aren't preempted, we can keep the PLE window high, and also
return immediately from the handler without looking for candidates.


1) So do you think, deferring preemption patch ( Vatsa was mentioning
long back)  is also another thing worth trying, so we reduce the chance
of LHP.


Yes, we have to keep it in mind.  It will be useful for fine grained
locks, not so much so coarse locks or IPIs.



Agree.


I would still of course prefer a PLE solution, but if we can't get it to
work we can consider preemption deferral.



Okay.



IIRC, with defer preemption :
we will have hook in spinlock/unlock path to measure depth of lock held,
and shared with host scheduler (may be via MSRs now).
Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather
give say one chance.


A downside is that we have to do that even when undercommitted.

Also there may be a lot of false positives (deferred preemptions even
when there is no contention).


Yes. That is a worry.





2) looking at the result (comparing A & C) , I do feel we have
significant in iterating over vcpus (when compared to even vmexit)
so We still would need undercommit fix sugested by PeterZ (improving by
140%). ?


Looking only at the current runqueue?  My worry is that it misses a lot
of cases.  Maybe try the current runqueue first and then others.

Or were you referring to something else?


No. I was referring to the same thing.

However. I had tried following also (which works well to check 
undercommited scenario). But thinking to use only for yielding in case
of overcommit (yield in overcommit suggested by Rik) and keep 
undercommit patch as suggested by PeterZ


[ patch is not in proper diff I suppose ].

Will test them.

Peter, Can I post your patch with your from/sob.. in V2?
Please let me know..

---
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 28f00bc..9ed3759 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1620,6 +1620,21 @@ bool kvm_vcpu_eligible_for_directed_yield(struct 
kvm_vcpu *vcpu)

return eligible;
 }
 #endif
+
+bool kvm_overcommitted()
+{
+   unsigned long load;
+
+   load = avenrun[0] + FIXED_1/200;
+   load = load >> FSHIFT;
+   load = (load << 7) / num_online_cpus();
+
+   if (load > 128)
+   return true;
+
+   return false;
+}
+
 void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 {
struct kvm *kvm = me->kvm;
@@ -1629,6 +1644,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
int pass;
int i;

+   if (!kvm_overcommitted())
+   return;
+
kvm_vcpu_set_in_spin_loop(me, true);
/*
 * We boost the priority of a VCPU that is runnable but not

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Andrew Theurer
On Thu, 2012-09-27 at 14:03 +0200, Avi Kivity wrote:
> On 09/27/2012 01:23 PM, Raghavendra K T wrote:
> >>
> >> This gives us a good case for tracking preemption on a per-vm basis.  As
> >> long as we aren't preempted, we can keep the PLE window high, and also
> >> return immediately from the handler without looking for candidates.
> > 
> > 1) So do you think, deferring preemption patch ( Vatsa was mentioning
> > long back)  is also another thing worth trying, so we reduce the chance
> > of LHP.
> 
> Yes, we have to keep it in mind.  It will be useful for fine grained
> locks, not so much so coarse locks or IPIs.
> 
> I would still of course prefer a PLE solution, but if we can't get it to
> work we can consider preemption deferral.
> 
> > 
> > IIRC, with defer preemption :
> > we will have hook in spinlock/unlock path to measure depth of lock held,
> > and shared with host scheduler (may be via MSRs now).
> > Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather
> > give say one chance.
> 
> A downside is that we have to do that even when undercommitted.
> 
> Also there may be a lot of false positives (deferred preemptions even
> when there is no contention).
> 
> > 
> > 2) looking at the result (comparing A & C) , I do feel we have
> > significant in iterating over vcpus (when compared to even vmexit)
> > so We still would need undercommit fix sugested by PeterZ (improving by
> > 140%). ?
> 
> Looking only at the current runqueue?  My worry is that it misses a lot
> of cases.  Maybe try the current runqueue first and then others.
> 
> Or were you referring to something else?
> 
> > 
> > So looking back at threads/ discussions so far, I am trying to
> > summarize, the discussions so far. I feel, at least here are the few
> > potential candidates to go in:
> > 
> > 1) Avoiding double runqueue lock overhead  (Andrew Theurer/ PeterZ)
> > 2) Dynamically changing PLE window (Avi/Andrew/Chegu)
> > 3) preempt_notify handler to identify preempted VCPUs (Avi)
> > 4) Avoiding iterating over VCPUs in undercommit scenario. (Raghu/PeterZ)
> > 5) Avoiding unnecessary spinning in overcommit scenario (Raghu/Rik)
> > 6) Pv spinlock
> > 7) Jiannan's proposed improvements
> > 8) Defer preemption patches
> > 
> > Did we miss anything (or added extra?)
> > 
> > So here are my action items:
> > - I plan to repost this series with what PeterZ, Rik suggested with
> > performance analysis.
> > - I ll go back and explore on (3) and (6) ..
> > 
> > Please Let me know..
> 
> Undoubtedly we'll think of more stuff.  But this looks like a good start.

9) lazy gang-like scheduling with PLE to cover the non-gang-like
exceptions  (/me runs and hides from scheduler folks)

-Andrew Theurer

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Avi Kivity
On 09/27/2012 01:23 PM, Raghavendra K T wrote:
>>
>> This gives us a good case for tracking preemption on a per-vm basis.  As
>> long as we aren't preempted, we can keep the PLE window high, and also
>> return immediately from the handler without looking for candidates.
> 
> 1) So do you think, deferring preemption patch ( Vatsa was mentioning
> long back)  is also another thing worth trying, so we reduce the chance
> of LHP.

Yes, we have to keep it in mind.  It will be useful for fine grained
locks, not so much so coarse locks or IPIs.

I would still of course prefer a PLE solution, but if we can't get it to
work we can consider preemption deferral.

> 
> IIRC, with defer preemption :
> we will have hook in spinlock/unlock path to measure depth of lock held,
> and shared with host scheduler (may be via MSRs now).
> Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather
> give say one chance.

A downside is that we have to do that even when undercommitted.

Also there may be a lot of false positives (deferred preemptions even
when there is no contention).

> 
> 2) looking at the result (comparing A & C) , I do feel we have
> significant in iterating over vcpus (when compared to even vmexit)
> so We still would need undercommit fix sugested by PeterZ (improving by
> 140%). ?

Looking only at the current runqueue?  My worry is that it misses a lot
of cases.  Maybe try the current runqueue first and then others.

Or were you referring to something else?

> 
> So looking back at threads/ discussions so far, I am trying to
> summarize, the discussions so far. I feel, at least here are the few
> potential candidates to go in:
> 
> 1) Avoiding double runqueue lock overhead  (Andrew Theurer/ PeterZ)
> 2) Dynamically changing PLE window (Avi/Andrew/Chegu)
> 3) preempt_notify handler to identify preempted VCPUs (Avi)
> 4) Avoiding iterating over VCPUs in undercommit scenario. (Raghu/PeterZ)
> 5) Avoiding unnecessary spinning in overcommit scenario (Raghu/Rik)
> 6) Pv spinlock
> 7) Jiannan's proposed improvements
> 8) Defer preemption patches
> 
> Did we miss anything (or added extra?)
> 
> So here are my action items:
> - I plan to repost this series with what PeterZ, Rik suggested with
> performance analysis.
> - I ll go back and explore on (3) and (6) ..
> 
> Please Let me know..

Undoubtedly we'll think of more stuff.  But this looks like a good start.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Raghavendra K T

On 09/27/2012 03:58 PM, Andrew Jones wrote:

On Thu, Sep 27, 2012 at 03:19:45PM +0530, Raghavendra K T wrote:

On 09/25/2012 08:30 PM, Dor Laor wrote:

On 09/24/2012 02:02 PM, Raghavendra K T wrote:

On 09/24/2012 02:12 PM, Dor Laor wrote:

In order to help PLE and pvticketlock converge I thought that a small
test code should be developed to test this in a predictable,
deterministic way.

The idea is to have a guest kernel module that spawn a new thread each
time you write to a /sys/ entry.

Each such a thread spins over a spin lock. The specific spin lock is
also chosen by the /sys/ interface. Let's say we have an array of spin
locks *10 times the amount of vcpus.

All the threads are running a
while (1) {

spin_lock(my_lock);
sum += execute_dummy_cpu_computation(time);
spin_unlock(my_lock);

if (sys_tells_thread_to_die()) break;
}

print_result(sum);

Instead of calling the kernel's spin_lock functions, clone them and make
the ticket lock order deterministic and known (like a linear walk of all
the threads trying to catch that lock).


By Cloning you mean hierarchy of the locks?


No, I meant to clone the implementation of the current spin lock code in
order to set any order you may like for the ticket selection.
(even for a non pvticket lock version)

For instance, let's say you have N threads trying to grab the lock, you
can always make the ticket go linearly from 1->2...->N.
Not sure it's a good idea, just a recommendation.


Also I believe time should be passed via sysfs / hardcoded for each
type of lock we are mimicking


Yap





This way you can easy calculate:
1. the score of a single vcpu running a single thread
2. the score of sum of all thread scores when #thread==#vcpu all
taking the same spin lock. The overall sum should be close as
possible to #1.
3. Like #2 but #threads > #vcpus and other versions of #total vcpus
(belonging to all VMs) > #pcpus.
4. Create #thread == #vcpus but let each thread have it's own spin
lock
5. Like 4 + 2

Hopefully this way will allows you to judge and evaluate the exact
overhead of scheduling VMs and threads since you have the ideal result
in hand and you know what the threads are doing.

My 2 cents, Dor



Thank you,
I think this is an excellent idea. ( Though I am trying to put all the
pieces together you mentioned). So overall we should be able to measure
the performance of pvspinlock/PLE improvements with a deterministic
load in guest.

Only thing I am missing is,
How to generate different combinations of the lock.

Okay, let me see if I can come with a solid model for this.



Do you mean the various options for PLE/pvticket/other? I haven't
thought of it and assumed its static but it can also be controlled
through the temporary /sys interface.



No, I am not there yet.

So In summary, we are suffering with inconsistent benchmark result,
while measuring the benefit of our improvement in PLE/pvlock etc..


Are you measuring the combined throughput of all running guests, or
just looking at the results of the benchmarks in a single test guest?

I've done some benchmarking as well and my stddevs look pretty good for
kcbench, ebizzy, dbench, and sysbench-memory. I do 5 runs for each
overcommit level (1.0 - 3.0, stepped by .25 or .5), and 2 runs of that
full sequence of tests (one with the overcommit levels in scrambled
order). The relative stddevs for each of the sets of 5 runs look pretty
good, and the data for the 2 runs match nicely as well.

To try and get consistent results I do the following
- interleave the memory of all guests across all numa nodes on the
   machine
- echo 0 > /proc/sys/kernel/randomize_va_space on both host and test
   guest


I was not doing this.


- echo 3 > /proc/sys/vm/drop_caches on both host and test guest before
   each run


was doing already as you know


- use a ramdisk for the benchmark output files on all running guests


Yes.. this is also helpful


- no periodically running services installed on the test guest
- HT is turned off as you do, although I'd like to try running again
   with it turned back on
Although, I still need to run again measuring the combined throughput
of all running vms (including the ones launched just to generate busy
vcpus). Maybe my results won't be as consistent then...


May be. I take average from all the VMs..



Drew



So good point from your suggestion is,
- Giving predictability to workload that runs in guest, so that we have
pi-pi comparison of improvement.

- we can easily tune the workload via sysfs, and we can have script to
automate them.

What is complicated is:
- How can we simulate a workload close to what we measure with
benchmarks?
- How can we mimic lock holding time/ lock hierarchy close to the way
it is seen with real workloads (for e.g. highly contended zone lru lock
with similar amount of lockholding times).
- How close it would be to when we forget about other types of spinning
(for e.g, flush_tlb).

So I feel it is not as trivial as it looks like.

--
To unsubscribe 

Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Raghavendra K T

On 09/27/2012 02:06 PM, Avi Kivity wrote:

On 09/25/2012 03:40 PM, Raghavendra K T wrote:

On 09/24/2012 07:46 PM, Raghavendra K T wrote:

On 09/24/2012 07:24 PM, Peter Zijlstra wrote:

On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:

However Rik had a genuine concern in the cases where runqueue is not
equally distributed and lockholder might actually be on a different run
queue but not running.


Load should eventually get distributed equally -- that's what the
load-balancer is for -- so this is a temporary situation.

We already try and favour the non running vcpu in this case, that's what
yield_to_task_fair() is about. If its still not eligible to run, tough
luck.


Yes, I agree.




Do you think instead of using rq->nr_running, we could get a global
sense of load using avenrun (something like avenrun/num_onlinecpus)


To what purpose? Also, global stuff is expensive, so you should try and
stay away from it as hard as you possibly can.


Yes, that concern only had made me to fall back to rq->nr_running.

Will come back with the result soon.


Got the result with the patches:
So here is the result,

Tried this on a 32 core ple box with HT disabled. 32 guest vcpus with
1x and 2x overcommits

Base = 3.6.0-rc5 + ple handler optimization patches
A = Base + checking rq_running in vcpu_on_spin() patch
B = Base + checking rq->nr_running in sched/core
C = Base - PLE

---+---+---+---+---+
|Ebizzy result (rec/sec higher is better)   |
---+---+---+---+---+
|Base   | A |  B| C |
---+---+---+---+---+
1x | 2374.1250 | 7273.7500 | 5690.8750 |  7364.3750|
2x | 2536.2500 | 2458.5000 | 2426.3750 |48.5000|
---+---+---+---+---+

% improvements w.r.t BASE
---++++
|  A |B   | C  |
---++++
1x | 206.37603  |  139.70410 |  210.19323 |
2x | -3.06555   |  -4.33218  |  -98.08773 |
---++++

we are getting the benefit of almost PLE disabled case with this
approach. With patch B, we have dropped a bit in gain.
(because we still would iterate vcpus until we decide to do a directed
yield).


This gives us a good case for tracking preemption on a per-vm basis.  As
long as we aren't preempted, we can keep the PLE window high, and also
return immediately from the handler without looking for candidates.


1) So do you think, deferring preemption patch ( Vatsa was mentioning
long back)  is also another thing worth trying, so we reduce the chance
of LHP.

IIRC, with defer preemption :
we will have hook in spinlock/unlock path to measure depth of lock held,
and shared with host scheduler (may be via MSRs now).
Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather
give say one chance.

2) looking at the result (comparing A & C) , I do feel we have
significant in iterating over vcpus (when compared to even vmexit)
so We still would need undercommit fix sugested by PeterZ (improving by
140%). ?

So looking back at threads/ discussions so far, I am trying to
summarize, the discussions so far. I feel, at least here are the few
potential candidates to go in:

1) Avoiding double runqueue lock overhead  (Andrew Theurer/ PeterZ)
2) Dynamically changing PLE window (Avi/Andrew/Chegu)
3) preempt_notify handler to identify preempted VCPUs (Avi)
4) Avoiding iterating over VCPUs in undercommit scenario. (Raghu/PeterZ)
5) Avoiding unnecessary spinning in overcommit scenario (Raghu/Rik)
6) Pv spinlock
7) Jiannan's proposed improvements
8) Defer preemption patches

Did we miss anything (or added extra?)

So here are my action items:
- I plan to repost this series with what PeterZ, Rik suggested with
performance analysis.
- I ll go back and explore on (3) and (6) ..

Please Let me know..






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Avi Kivity
On 09/27/2012 12:28 PM, Andrew Jones wrote:

>> No, I am not there yet.
>> 
>> So In summary, we are suffering with inconsistent benchmark result,
>> while measuring the benefit of our improvement in PLE/pvlock etc..
> 
> Are you measuring the combined throughput of all running guests, or
> just looking at the results of the benchmarks in a single test guest?
> 
> I've done some benchmarking as well and my stddevs look pretty good for
> kcbench, ebizzy, dbench, and sysbench-memory. I do 5 runs for each
> overcommit level (1.0 - 3.0, stepped by .25 or .5), and 2 runs of that
> full sequence of tests (one with the overcommit levels in scrambled
> order). The relative stddevs for each of the sets of 5 runs look pretty
> good, and the data for the 2 runs match nicely as well.
> 
> To try and get consistent results I do the following 
> - interleave the memory of all guests across all numa nodes on the
>   machine
> - echo 0 > /proc/sys/kernel/randomize_va_space on both host and test
>   guest
> - echo 3 > /proc/sys/vm/drop_caches on both host and test guest before
>   each run
> - use a ramdisk for the benchmark output files on all running guests
> - no periodically running services installed on the test guest
> - HT is turned off as you do, although I'd like to try running again
>   with it turned back on
> 
> Although, I still need to run again measuring the combined throughput
> of all running vms (including the ones launched just to generate busy
> vcpus). Maybe my results won't be as consistent then...
> 


Another way to test is to execute

 perf stat -e 'kvm_exit exit_reason==40' sleep 10

to see how many PAUSEs were intercepted in a given time (except I just
invented the filter syntax).  The fewer we get, the more useful work the
system does.  This ignores kvm_vcpu_on_spin overhead though, so it's
just a rough measure.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Dor Laor

On 09/27/2012 11:49 AM, Raghavendra K T wrote:

On 09/25/2012 08:30 PM, Dor Laor wrote:

On 09/24/2012 02:02 PM, Raghavendra K T wrote:

On 09/24/2012 02:12 PM, Dor Laor wrote:

In order to help PLE and pvticketlock converge I thought that a small
test code should be developed to test this in a predictable,
deterministic way.

The idea is to have a guest kernel module that spawn a new thread each
time you write to a /sys/ entry.

Each such a thread spins over a spin lock. The specific spin lock is
also chosen by the /sys/ interface. Let's say we have an array of spin
locks *10 times the amount of vcpus.

All the threads are running a
while (1) {

spin_lock(my_lock);
sum += execute_dummy_cpu_computation(time);
spin_unlock(my_lock);

if (sys_tells_thread_to_die()) break;
}

print_result(sum);

Instead of calling the kernel's spin_lock functions, clone them and
make
the ticket lock order deterministic and known (like a linear walk of
all
the threads trying to catch that lock).


By Cloning you mean hierarchy of the locks?


No, I meant to clone the implementation of the current spin lock code in
order to set any order you may like for the ticket selection.
(even for a non pvticket lock version)

For instance, let's say you have N threads trying to grab the lock, you
can always make the ticket go linearly from 1->2...->N.
Not sure it's a good idea, just a recommendation.


Also I believe time should be passed via sysfs / hardcoded for each
type of lock we are mimicking


Yap





This way you can easy calculate:
1. the score of a single vcpu running a single thread
2. the score of sum of all thread scores when #thread==#vcpu all
taking the same spin lock. The overall sum should be close as
possible to #1.
3. Like #2 but #threads > #vcpus and other versions of #total vcpus
(belonging to all VMs) > #pcpus.
4. Create #thread == #vcpus but let each thread have it's own spin
lock
5. Like 4 + 2

Hopefully this way will allows you to judge and evaluate the exact
overhead of scheduling VMs and threads since you have the ideal result
in hand and you know what the threads are doing.

My 2 cents, Dor



Thank you,
I think this is an excellent idea. ( Though I am trying to put all the
pieces together you mentioned). So overall we should be able to measure
the performance of pvspinlock/PLE improvements with a deterministic
load in guest.

Only thing I am missing is,
How to generate different combinations of the lock.

Okay, let me see if I can come with a solid model for this.



Do you mean the various options for PLE/pvticket/other? I haven't
thought of it and assumed its static but it can also be controlled
through the temporary /sys interface.



No, I am not there yet.

So In summary, we are suffering with inconsistent benchmark result,
while measuring the benefit of our improvement in PLE/pvlock etc..

So good point from your suggestion is,
- Giving predictability to workload that runs in guest, so that we have
pi-pi comparison of improvement.

- we can easily tune the workload via sysfs, and we can have script to
automate them.

What is complicated is:
- How can we simulate a workload close to what we measure with
benchmarks?
- How can we mimic lock holding time/ lock hierarchy close to the way
it is seen with real workloads (for e.g. highly contended zone lru lock
with similar amount of lockholding times).


You can spin for a similar instruction count that you're interested


- How close it would be to when we forget about other types of spinning
(for e.g, flush_tlb).

So I feel it is not as trivial as it looks like.



Indeed this is mainly a tool that can serve to optimize few synthetic 
workloads.
I still believe that it worth to go through this exercise since a 100% 
predictable and controlled case can help us purely asses the state of 
PLE and pvticket code. Otherwise we're dealing w/ too many parameters 
and assumptions at once.


Dor

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Andrew Jones
On Thu, Sep 27, 2012 at 03:19:45PM +0530, Raghavendra K T wrote:
> On 09/25/2012 08:30 PM, Dor Laor wrote:
> >On 09/24/2012 02:02 PM, Raghavendra K T wrote:
> >>On 09/24/2012 02:12 PM, Dor Laor wrote:
> >>>In order to help PLE and pvticketlock converge I thought that a small
> >>>test code should be developed to test this in a predictable,
> >>>deterministic way.
> >>>
> >>>The idea is to have a guest kernel module that spawn a new thread each
> >>>time you write to a /sys/ entry.
> >>>
> >>>Each such a thread spins over a spin lock. The specific spin lock is
> >>>also chosen by the /sys/ interface. Let's say we have an array of spin
> >>>locks *10 times the amount of vcpus.
> >>>
> >>>All the threads are running a
> >>>while (1) {
> >>>
> >>>spin_lock(my_lock);
> >>>sum += execute_dummy_cpu_computation(time);
> >>>spin_unlock(my_lock);
> >>>
> >>>if (sys_tells_thread_to_die()) break;
> >>>}
> >>>
> >>>print_result(sum);
> >>>
> >>>Instead of calling the kernel's spin_lock functions, clone them and make
> >>>the ticket lock order deterministic and known (like a linear walk of all
> >>>the threads trying to catch that lock).
> >>
> >>By Cloning you mean hierarchy of the locks?
> >
> >No, I meant to clone the implementation of the current spin lock code in
> >order to set any order you may like for the ticket selection.
> >(even for a non pvticket lock version)
> >
> >For instance, let's say you have N threads trying to grab the lock, you
> >can always make the ticket go linearly from 1->2...->N.
> >Not sure it's a good idea, just a recommendation.
> >
> >>Also I believe time should be passed via sysfs / hardcoded for each
> >>type of lock we are mimicking
> >
> >Yap
> >
> >>
> >>>
> >>>This way you can easy calculate:
> >>>1. the score of a single vcpu running a single thread
> >>>2. the score of sum of all thread scores when #thread==#vcpu all
> >>>taking the same spin lock. The overall sum should be close as
> >>>possible to #1.
> >>>3. Like #2 but #threads > #vcpus and other versions of #total vcpus
> >>>(belonging to all VMs) > #pcpus.
> >>>4. Create #thread == #vcpus but let each thread have it's own spin
> >>>lock
> >>>5. Like 4 + 2
> >>>
> >>>Hopefully this way will allows you to judge and evaluate the exact
> >>>overhead of scheduling VMs and threads since you have the ideal result
> >>>in hand and you know what the threads are doing.
> >>>
> >>>My 2 cents, Dor
> >>>
> >>
> >>Thank you,
> >>I think this is an excellent idea. ( Though I am trying to put all the
> >>pieces together you mentioned). So overall we should be able to measure
> >>the performance of pvspinlock/PLE improvements with a deterministic
> >>load in guest.
> >>
> >>Only thing I am missing is,
> >>How to generate different combinations of the lock.
> >>
> >>Okay, let me see if I can come with a solid model for this.
> >>
> >
> >Do you mean the various options for PLE/pvticket/other? I haven't
> >thought of it and assumed its static but it can also be controlled
> >through the temporary /sys interface.
> >
> 
> No, I am not there yet.
> 
> So In summary, we are suffering with inconsistent benchmark result,
> while measuring the benefit of our improvement in PLE/pvlock etc..

Are you measuring the combined throughput of all running guests, or
just looking at the results of the benchmarks in a single test guest?

I've done some benchmarking as well and my stddevs look pretty good for
kcbench, ebizzy, dbench, and sysbench-memory. I do 5 runs for each
overcommit level (1.0 - 3.0, stepped by .25 or .5), and 2 runs of that
full sequence of tests (one with the overcommit levels in scrambled
order). The relative stddevs for each of the sets of 5 runs look pretty
good, and the data for the 2 runs match nicely as well.

To try and get consistent results I do the following 
- interleave the memory of all guests across all numa nodes on the
  machine
- echo 0 > /proc/sys/kernel/randomize_va_space on both host and test
  guest
- echo 3 > /proc/sys/vm/drop_caches on both host and test guest before
  each run
- use a ramdisk for the benchmark output files on all running guests
- no periodically running services installed on the test guest
- HT is turned off as you do, although I'd like to try running again
  with it turned back on

Although, I still need to run again measuring the combined throughput
of all running vms (including the ones launched just to generate busy
vcpus). Maybe my results won't be as consistent then...

Drew

> 
> So good point from your suggestion is,
> - Giving predictability to workload that runs in guest, so that we have
> pi-pi comparison of improvement.
> 
> - we can easily tune the workload via sysfs, and we can have script to
> automate them.
> 
> What is complicated is:
> - How can we simulate a workload close to what we measure with
> benchmarks?
> - How can we mimic lock holding time/ lock hierarchy close to the way
> it is seen with real workloads (for e.g. highly contended zone lru lock
> with similar amount 

Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Raghavendra K T

On 09/26/2012 06:27 PM, Andrew Jones wrote:

On Mon, Sep 24, 2012 at 02:36:05PM +0200, Peter Zijlstra wrote:

On Mon, 2012-09-24 at 17:22 +0530, Raghavendra K T wrote:

On 09/24/2012 05:04 PM, Peter Zijlstra wrote:

On Fri, 2012-09-21 at 17:29 +0530, Raghavendra K T wrote:

In some special scenarios like #vcpu<= #pcpu, PLE handler may
prove very costly, because there is no need to iterate over vcpus
and do unsuccessful yield_to burning CPU.


What's the costly thing? The vm-exit, the yield (which should be a nop
if its the only task there) or something else entirely?


Both vmexit and yield_to() actually,

because unsuccessful yield_to() overall is costly in PLE handler.

This is because when we have large guests, say 32/16 vcpus, and one
vcpu is holding lock, rest of the vcpus waiting for the lock, when they
do PL-exit, each of the vcpu try to iterate over rest of vcpu list in
the VM and try to do directed yield (unsuccessful). (O(n^2) tries).

this results is fairly high amount of cpu burning and double run queue
lock contention.

(if they were spinning probably lock progress would have been faster).
As Avi/Chegu Vinod had felt it is better to avoid vmexit itself, which
seems little complex to achieve currently.


OK, so the vmexit stays and we need to improve yield_to.


Can't we do this check sooner as well, as it only requires per-cpu data?
If we do it way back in kvm_vcpu_on_spin, then we avoid get_pid_task()
and a bunch of read barriers from kvm_for_each_vcpu. Also, moving the test
into kvm code would allow us to do other kvm things as a result of the
check in order to avoid some vmexits. It looks like we should be able to
avoid some without much complexity by just making a per-vm ple_window
variable, and then, when we hit the nr_running == 1 condition, also doing
vmcs_write32(PLE_WINDOW, (kvm->ple_window += PLE_WINDOW_BUMP))
Reset the window to the default value when we successfully yield (and
maybe we should limit the number of bumps).


We indeed checked early in original undercommit patch and it has given
result closer to PLE disabled case. But Agree with Peter that it is ugly 
to export nr_running info to ple handler.


Looking at the result and comparing result of A and C,

Base = 3.6.0-rc5 + ple handler optimization patches
A = Base + checking rq_running in vcpu_on_spin() patch
B = Base + checking rq->nr_running in sched/core
C = Base - PLE

   % improvements w.r.t BASE
---++++
   |  A |B   | C  |
---++++
1x | 206.37603  |  139.70410 |  210.19323 |


I have a feeling that vmexit has not caused significant overhead
compared to iterating over vcpus in PLE handler.. Does it not sound so?

But

vmcs_write32(PLE_WINDOW, (kvm->ple_window += PLE_WINDOW_BUMP))


is worth trying. I will have to see it eventually.





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Raghavendra K T

On 09/26/2012 05:57 PM, Konrad Rzeszutek Wilk wrote:

On Tue, Sep 25, 2012 at 05:00:30PM +0200, Dor Laor wrote:

On 09/24/2012 02:02 PM, Raghavendra K T wrote:

On 09/24/2012 02:12 PM, Dor Laor wrote:

In order to help PLE and pvticketlock converge I thought that a small
test code should be developed to test this in a predictable,
deterministic way.

The idea is to have a guest kernel module that spawn a new thread each
time you write to a /sys/ entry.

Each such a thread spins over a spin lock. The specific spin lock is
also chosen by the /sys/ interface. Let's say we have an array of spin
locks *10 times the amount of vcpus.

All the threads are running a
while (1) {

spin_lock(my_lock);
sum += execute_dummy_cpu_computation(time);
spin_unlock(my_lock);

if (sys_tells_thread_to_die()) break;
}

print_result(sum);

Instead of calling the kernel's spin_lock functions, clone them and make
the ticket lock order deterministic and known (like a linear walk of all
the threads trying to catch that lock).


By Cloning you mean hierarchy of the locks?


No, I meant to clone the implementation of the current spin lock
code in order to set any order you may like for the ticket
selection.
(even for a non pvticket lock version)


Wouldn't that defeat the purpose of trying the test the different
implementations that try to fix the lock-holder preemption problem?
You want something that you can shoe-in for all work-loads - also
for this test system.


Hmm true. I think it is indeed difficult to shoe-in all workloads.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Raghavendra K T

On 09/25/2012 08:30 PM, Dor Laor wrote:

On 09/24/2012 02:02 PM, Raghavendra K T wrote:

On 09/24/2012 02:12 PM, Dor Laor wrote:

In order to help PLE and pvticketlock converge I thought that a small
test code should be developed to test this in a predictable,
deterministic way.

The idea is to have a guest kernel module that spawn a new thread each
time you write to a /sys/ entry.

Each such a thread spins over a spin lock. The specific spin lock is
also chosen by the /sys/ interface. Let's say we have an array of spin
locks *10 times the amount of vcpus.

All the threads are running a
while (1) {

spin_lock(my_lock);
sum += execute_dummy_cpu_computation(time);
spin_unlock(my_lock);

if (sys_tells_thread_to_die()) break;
}

print_result(sum);

Instead of calling the kernel's spin_lock functions, clone them and make
the ticket lock order deterministic and known (like a linear walk of all
the threads trying to catch that lock).


By Cloning you mean hierarchy of the locks?


No, I meant to clone the implementation of the current spin lock code in
order to set any order you may like for the ticket selection.
(even for a non pvticket lock version)

For instance, let's say you have N threads trying to grab the lock, you
can always make the ticket go linearly from 1->2...->N.
Not sure it's a good idea, just a recommendation.


Also I believe time should be passed via sysfs / hardcoded for each
type of lock we are mimicking


Yap





This way you can easy calculate:
1. the score of a single vcpu running a single thread
2. the score of sum of all thread scores when #thread==#vcpu all
taking the same spin lock. The overall sum should be close as
possible to #1.
3. Like #2 but #threads > #vcpus and other versions of #total vcpus
(belonging to all VMs) > #pcpus.
4. Create #thread == #vcpus but let each thread have it's own spin
lock
5. Like 4 + 2

Hopefully this way will allows you to judge and evaluate the exact
overhead of scheduling VMs and threads since you have the ideal result
in hand and you know what the threads are doing.

My 2 cents, Dor



Thank you,
I think this is an excellent idea. ( Though I am trying to put all the
pieces together you mentioned). So overall we should be able to measure
the performance of pvspinlock/PLE improvements with a deterministic
load in guest.

Only thing I am missing is,
How to generate different combinations of the lock.

Okay, let me see if I can come with a solid model for this.



Do you mean the various options for PLE/pvticket/other? I haven't
thought of it and assumed its static but it can also be controlled
through the temporary /sys interface.



No, I am not there yet.

So In summary, we are suffering with inconsistent benchmark result,
while measuring the benefit of our improvement in PLE/pvlock etc..

So good point from your suggestion is,
- Giving predictability to workload that runs in guest, so that we have
pi-pi comparison of improvement.

- we can easily tune the workload via sysfs, and we can have script to
automate them.

What is complicated is:
- How can we simulate a workload close to what we measure with
benchmarks?
- How can we mimic lock holding time/ lock hierarchy close to the way
it is seen with real workloads (for e.g. highly contended zone lru lock
with similar amount of lockholding times).
- How close it would be to when we forget about other types of spinning
(for e.g, flush_tlb).

So I feel it is not as trivial as it looks like.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Avi Kivity
On 09/25/2012 03:40 PM, Raghavendra K T wrote:
> On 09/24/2012 07:46 PM, Raghavendra K T wrote:
>> On 09/24/2012 07:24 PM, Peter Zijlstra wrote:
>>> On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:
 However Rik had a genuine concern in the cases where runqueue is not
 equally distributed and lockholder might actually be on a different run
 queue but not running.
>>>
>>> Load should eventually get distributed equally -- that's what the
>>> load-balancer is for -- so this is a temporary situation.
>>>
>>> We already try and favour the non running vcpu in this case, that's what
>>> yield_to_task_fair() is about. If its still not eligible to run, tough
>>> luck.
>>
>> Yes, I agree.
>>
>>>
 Do you think instead of using rq->nr_running, we could get a global
 sense of load using avenrun (something like avenrun/num_onlinecpus)
>>>
>>> To what purpose? Also, global stuff is expensive, so you should try and
>>> stay away from it as hard as you possibly can.
>>
>> Yes, that concern only had made me to fall back to rq->nr_running.
>>
>> Will come back with the result soon.
> 
> Got the result with the patches:
> So here is the result,
> 
> Tried this on a 32 core ple box with HT disabled. 32 guest vcpus with
> 1x and 2x overcommits
> 
> Base = 3.6.0-rc5 + ple handler optimization patches
> A = Base + checking rq_running in vcpu_on_spin() patch
> B = Base + checking rq->nr_running in sched/core
> C = Base - PLE
> 
> ---+---+---+---+---+
>|Ebizzy result (rec/sec higher is better)   |
> ---+---+---+---+---+
>|Base   | A |  B| C |
> ---+---+---+---+---+
> 1x | 2374.1250 | 7273.7500 | 5690.8750 |  7364.3750|
> 2x | 2536.2500 | 2458.5000 | 2426.3750 |48.5000|
> ---+---+---+---+---+
> 
>% improvements w.r.t BASE
> ---++++
>|  A |B   | C  |
> ---++++
> 1x | 206.37603  |  139.70410 |  210.19323 |
> 2x | -3.06555   |  -4.33218  |  -98.08773 |
> ---++++
> 
> we are getting the benefit of almost PLE disabled case with this
> approach. With patch B, we have dropped a bit in gain.
> (because we still would iterate vcpus until we decide to do a directed
> yield).

This gives us a good case for tracking preemption on a per-vm basis.  As
long as we aren't preempted, we can keep the PLE window high, and also
return immediately from the handler without looking for candidates.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Avi Kivity
On 09/25/2012 03:40 PM, Raghavendra K T wrote:
 On 09/24/2012 07:46 PM, Raghavendra K T wrote:
 On 09/24/2012 07:24 PM, Peter Zijlstra wrote:
 On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:
 However Rik had a genuine concern in the cases where runqueue is not
 equally distributed and lockholder might actually be on a different run
 queue but not running.

 Load should eventually get distributed equally -- that's what the
 load-balancer is for -- so this is a temporary situation.

 We already try and favour the non running vcpu in this case, that's what
 yield_to_task_fair() is about. If its still not eligible to run, tough
 luck.

 Yes, I agree.


 Do you think instead of using rq-nr_running, we could get a global
 sense of load using avenrun (something like avenrun/num_onlinecpus)

 To what purpose? Also, global stuff is expensive, so you should try and
 stay away from it as hard as you possibly can.

 Yes, that concern only had made me to fall back to rq-nr_running.

 Will come back with the result soon.
 
 Got the result with the patches:
 So here is the result,
 
 Tried this on a 32 core ple box with HT disabled. 32 guest vcpus with
 1x and 2x overcommits
 
 Base = 3.6.0-rc5 + ple handler optimization patches
 A = Base + checking rq_running in vcpu_on_spin() patch
 B = Base + checking rq-nr_running in sched/core
 C = Base - PLE
 
 ---+---+---+---+---+
|Ebizzy result (rec/sec higher is better)   |
 ---+---+---+---+---+
|Base   | A |  B| C |
 ---+---+---+---+---+
 1x | 2374.1250 | 7273.7500 | 5690.8750 |  7364.3750|
 2x | 2536.2500 | 2458.5000 | 2426.3750 |48.5000|
 ---+---+---+---+---+
 
% improvements w.r.t BASE
 ---++++
|  A |B   | C  |
 ---++++
 1x | 206.37603  |  139.70410 |  210.19323 |
 2x | -3.06555   |  -4.33218  |  -98.08773 |
 ---++++
 
 we are getting the benefit of almost PLE disabled case with this
 approach. With patch B, we have dropped a bit in gain.
 (because we still would iterate vcpus until we decide to do a directed
 yield).

This gives us a good case for tracking preemption on a per-vm basis.  As
long as we aren't preempted, we can keep the PLE window high, and also
return immediately from the handler without looking for candidates.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Raghavendra K T

On 09/25/2012 08:30 PM, Dor Laor wrote:

On 09/24/2012 02:02 PM, Raghavendra K T wrote:

On 09/24/2012 02:12 PM, Dor Laor wrote:

In order to help PLE and pvticketlock converge I thought that a small
test code should be developed to test this in a predictable,
deterministic way.

The idea is to have a guest kernel module that spawn a new thread each
time you write to a /sys/ entry.

Each such a thread spins over a spin lock. The specific spin lock is
also chosen by the /sys/ interface. Let's say we have an array of spin
locks *10 times the amount of vcpus.

All the threads are running a
while (1) {

spin_lock(my_lock);
sum += execute_dummy_cpu_computation(time);
spin_unlock(my_lock);

if (sys_tells_thread_to_die()) break;
}

print_result(sum);

Instead of calling the kernel's spin_lock functions, clone them and make
the ticket lock order deterministic and known (like a linear walk of all
the threads trying to catch that lock).


By Cloning you mean hierarchy of the locks?


No, I meant to clone the implementation of the current spin lock code in
order to set any order you may like for the ticket selection.
(even for a non pvticket lock version)

For instance, let's say you have N threads trying to grab the lock, you
can always make the ticket go linearly from 1-2...-N.
Not sure it's a good idea, just a recommendation.


Also I believe time should be passed via sysfs / hardcoded for each
type of lock we are mimicking


Yap





This way you can easy calculate:
1. the score of a single vcpu running a single thread
2. the score of sum of all thread scores when #thread==#vcpu all
taking the same spin lock. The overall sum should be close as
possible to #1.
3. Like #2 but #threads  #vcpus and other versions of #total vcpus
(belonging to all VMs)  #pcpus.
4. Create #thread == #vcpus but let each thread have it's own spin
lock
5. Like 4 + 2

Hopefully this way will allows you to judge and evaluate the exact
overhead of scheduling VMs and threads since you have the ideal result
in hand and you know what the threads are doing.

My 2 cents, Dor



Thank you,
I think this is an excellent idea. ( Though I am trying to put all the
pieces together you mentioned). So overall we should be able to measure
the performance of pvspinlock/PLE improvements with a deterministic
load in guest.

Only thing I am missing is,
How to generate different combinations of the lock.

Okay, let me see if I can come with a solid model for this.



Do you mean the various options for PLE/pvticket/other? I haven't
thought of it and assumed its static but it can also be controlled
through the temporary /sys interface.



No, I am not there yet.

So In summary, we are suffering with inconsistent benchmark result,
while measuring the benefit of our improvement in PLE/pvlock etc..

So good point from your suggestion is,
- Giving predictability to workload that runs in guest, so that we have
pi-pi comparison of improvement.

- we can easily tune the workload via sysfs, and we can have script to
automate them.

What is complicated is:
- How can we simulate a workload close to what we measure with
benchmarks?
- How can we mimic lock holding time/ lock hierarchy close to the way
it is seen with real workloads (for e.g. highly contended zone lru lock
with similar amount of lockholding times).
- How close it would be to when we forget about other types of spinning
(for e.g, flush_tlb).

So I feel it is not as trivial as it looks like.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Raghavendra K T

On 09/26/2012 05:57 PM, Konrad Rzeszutek Wilk wrote:

On Tue, Sep 25, 2012 at 05:00:30PM +0200, Dor Laor wrote:

On 09/24/2012 02:02 PM, Raghavendra K T wrote:

On 09/24/2012 02:12 PM, Dor Laor wrote:

In order to help PLE and pvticketlock converge I thought that a small
test code should be developed to test this in a predictable,
deterministic way.

The idea is to have a guest kernel module that spawn a new thread each
time you write to a /sys/ entry.

Each such a thread spins over a spin lock. The specific spin lock is
also chosen by the /sys/ interface. Let's say we have an array of spin
locks *10 times the amount of vcpus.

All the threads are running a
while (1) {

spin_lock(my_lock);
sum += execute_dummy_cpu_computation(time);
spin_unlock(my_lock);

if (sys_tells_thread_to_die()) break;
}

print_result(sum);

Instead of calling the kernel's spin_lock functions, clone them and make
the ticket lock order deterministic and known (like a linear walk of all
the threads trying to catch that lock).


By Cloning you mean hierarchy of the locks?


No, I meant to clone the implementation of the current spin lock
code in order to set any order you may like for the ticket
selection.
(even for a non pvticket lock version)


Wouldn't that defeat the purpose of trying the test the different
implementations that try to fix the lock-holder preemption problem?
You want something that you can shoe-in for all work-loads - also
for this test system.


Hmm true. I think it is indeed difficult to shoe-in all workloads.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Raghavendra K T

On 09/26/2012 06:27 PM, Andrew Jones wrote:

On Mon, Sep 24, 2012 at 02:36:05PM +0200, Peter Zijlstra wrote:

On Mon, 2012-09-24 at 17:22 +0530, Raghavendra K T wrote:

On 09/24/2012 05:04 PM, Peter Zijlstra wrote:

On Fri, 2012-09-21 at 17:29 +0530, Raghavendra K T wrote:

In some special scenarios like #vcpu= #pcpu, PLE handler may
prove very costly, because there is no need to iterate over vcpus
and do unsuccessful yield_to burning CPU.


What's the costly thing? The vm-exit, the yield (which should be a nop
if its the only task there) or something else entirely?


Both vmexit and yield_to() actually,

because unsuccessful yield_to() overall is costly in PLE handler.

This is because when we have large guests, say 32/16 vcpus, and one
vcpu is holding lock, rest of the vcpus waiting for the lock, when they
do PL-exit, each of the vcpu try to iterate over rest of vcpu list in
the VM and try to do directed yield (unsuccessful). (O(n^2) tries).

this results is fairly high amount of cpu burning and double run queue
lock contention.

(if they were spinning probably lock progress would have been faster).
As Avi/Chegu Vinod had felt it is better to avoid vmexit itself, which
seems little complex to achieve currently.


OK, so the vmexit stays and we need to improve yield_to.


Can't we do this check sooner as well, as it only requires per-cpu data?
If we do it way back in kvm_vcpu_on_spin, then we avoid get_pid_task()
and a bunch of read barriers from kvm_for_each_vcpu. Also, moving the test
into kvm code would allow us to do other kvm things as a result of the
check in order to avoid some vmexits. It looks like we should be able to
avoid some without much complexity by just making a per-vm ple_window
variable, and then, when we hit the nr_running == 1 condition, also doing
vmcs_write32(PLE_WINDOW, (kvm-ple_window += PLE_WINDOW_BUMP))
Reset the window to the default value when we successfully yield (and
maybe we should limit the number of bumps).


We indeed checked early in original undercommit patch and it has given
result closer to PLE disabled case. But Agree with Peter that it is ugly 
to export nr_running info to ple handler.


Looking at the result and comparing result of A and C,

Base = 3.6.0-rc5 + ple handler optimization patches
A = Base + checking rq_running in vcpu_on_spin() patch
B = Base + checking rq-nr_running in sched/core
C = Base - PLE

   % improvements w.r.t BASE
---++++
   |  A |B   | C  |
---++++
1x | 206.37603  |  139.70410 |  210.19323 |


I have a feeling that vmexit has not caused significant overhead
compared to iterating over vcpus in PLE handler.. Does it not sound so?

But

vmcs_write32(PLE_WINDOW, (kvm-ple_window += PLE_WINDOW_BUMP))


is worth trying. I will have to see it eventually.





--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Andrew Jones
On Thu, Sep 27, 2012 at 03:19:45PM +0530, Raghavendra K T wrote:
 On 09/25/2012 08:30 PM, Dor Laor wrote:
 On 09/24/2012 02:02 PM, Raghavendra K T wrote:
 On 09/24/2012 02:12 PM, Dor Laor wrote:
 In order to help PLE and pvticketlock converge I thought that a small
 test code should be developed to test this in a predictable,
 deterministic way.
 
 The idea is to have a guest kernel module that spawn a new thread each
 time you write to a /sys/ entry.
 
 Each such a thread spins over a spin lock. The specific spin lock is
 also chosen by the /sys/ interface. Let's say we have an array of spin
 locks *10 times the amount of vcpus.
 
 All the threads are running a
 while (1) {
 
 spin_lock(my_lock);
 sum += execute_dummy_cpu_computation(time);
 spin_unlock(my_lock);
 
 if (sys_tells_thread_to_die()) break;
 }
 
 print_result(sum);
 
 Instead of calling the kernel's spin_lock functions, clone them and make
 the ticket lock order deterministic and known (like a linear walk of all
 the threads trying to catch that lock).
 
 By Cloning you mean hierarchy of the locks?
 
 No, I meant to clone the implementation of the current spin lock code in
 order to set any order you may like for the ticket selection.
 (even for a non pvticket lock version)
 
 For instance, let's say you have N threads trying to grab the lock, you
 can always make the ticket go linearly from 1-2...-N.
 Not sure it's a good idea, just a recommendation.
 
 Also I believe time should be passed via sysfs / hardcoded for each
 type of lock we are mimicking
 
 Yap
 
 
 
 This way you can easy calculate:
 1. the score of a single vcpu running a single thread
 2. the score of sum of all thread scores when #thread==#vcpu all
 taking the same spin lock. The overall sum should be close as
 possible to #1.
 3. Like #2 but #threads  #vcpus and other versions of #total vcpus
 (belonging to all VMs)  #pcpus.
 4. Create #thread == #vcpus but let each thread have it's own spin
 lock
 5. Like 4 + 2
 
 Hopefully this way will allows you to judge and evaluate the exact
 overhead of scheduling VMs and threads since you have the ideal result
 in hand and you know what the threads are doing.
 
 My 2 cents, Dor
 
 
 Thank you,
 I think this is an excellent idea. ( Though I am trying to put all the
 pieces together you mentioned). So overall we should be able to measure
 the performance of pvspinlock/PLE improvements with a deterministic
 load in guest.
 
 Only thing I am missing is,
 How to generate different combinations of the lock.
 
 Okay, let me see if I can come with a solid model for this.
 
 
 Do you mean the various options for PLE/pvticket/other? I haven't
 thought of it and assumed its static but it can also be controlled
 through the temporary /sys interface.
 
 
 No, I am not there yet.
 
 So In summary, we are suffering with inconsistent benchmark result,
 while measuring the benefit of our improvement in PLE/pvlock etc..

Are you measuring the combined throughput of all running guests, or
just looking at the results of the benchmarks in a single test guest?

I've done some benchmarking as well and my stddevs look pretty good for
kcbench, ebizzy, dbench, and sysbench-memory. I do 5 runs for each
overcommit level (1.0 - 3.0, stepped by .25 or .5), and 2 runs of that
full sequence of tests (one with the overcommit levels in scrambled
order). The relative stddevs for each of the sets of 5 runs look pretty
good, and the data for the 2 runs match nicely as well.

To try and get consistent results I do the following 
- interleave the memory of all guests across all numa nodes on the
  machine
- echo 0  /proc/sys/kernel/randomize_va_space on both host and test
  guest
- echo 3  /proc/sys/vm/drop_caches on both host and test guest before
  each run
- use a ramdisk for the benchmark output files on all running guests
- no periodically running services installed on the test guest
- HT is turned off as you do, although I'd like to try running again
  with it turned back on

Although, I still need to run again measuring the combined throughput
of all running vms (including the ones launched just to generate busy
vcpus). Maybe my results won't be as consistent then...

Drew

 
 So good point from your suggestion is,
 - Giving predictability to workload that runs in guest, so that we have
 pi-pi comparison of improvement.
 
 - we can easily tune the workload via sysfs, and we can have script to
 automate them.
 
 What is complicated is:
 - How can we simulate a workload close to what we measure with
 benchmarks?
 - How can we mimic lock holding time/ lock hierarchy close to the way
 it is seen with real workloads (for e.g. highly contended zone lru lock
 with similar amount of lockholding times).
 - How close it would be to when we forget about other types of spinning
 (for e.g, flush_tlb).
 
 So I feel it is not as trivial as it looks like.
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to 

Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Dor Laor

On 09/27/2012 11:49 AM, Raghavendra K T wrote:

On 09/25/2012 08:30 PM, Dor Laor wrote:

On 09/24/2012 02:02 PM, Raghavendra K T wrote:

On 09/24/2012 02:12 PM, Dor Laor wrote:

In order to help PLE and pvticketlock converge I thought that a small
test code should be developed to test this in a predictable,
deterministic way.

The idea is to have a guest kernel module that spawn a new thread each
time you write to a /sys/ entry.

Each such a thread spins over a spin lock. The specific spin lock is
also chosen by the /sys/ interface. Let's say we have an array of spin
locks *10 times the amount of vcpus.

All the threads are running a
while (1) {

spin_lock(my_lock);
sum += execute_dummy_cpu_computation(time);
spin_unlock(my_lock);

if (sys_tells_thread_to_die()) break;
}

print_result(sum);

Instead of calling the kernel's spin_lock functions, clone them and
make
the ticket lock order deterministic and known (like a linear walk of
all
the threads trying to catch that lock).


By Cloning you mean hierarchy of the locks?


No, I meant to clone the implementation of the current spin lock code in
order to set any order you may like for the ticket selection.
(even for a non pvticket lock version)

For instance, let's say you have N threads trying to grab the lock, you
can always make the ticket go linearly from 1-2...-N.
Not sure it's a good idea, just a recommendation.


Also I believe time should be passed via sysfs / hardcoded for each
type of lock we are mimicking


Yap





This way you can easy calculate:
1. the score of a single vcpu running a single thread
2. the score of sum of all thread scores when #thread==#vcpu all
taking the same spin lock. The overall sum should be close as
possible to #1.
3. Like #2 but #threads  #vcpus and other versions of #total vcpus
(belonging to all VMs)  #pcpus.
4. Create #thread == #vcpus but let each thread have it's own spin
lock
5. Like 4 + 2

Hopefully this way will allows you to judge and evaluate the exact
overhead of scheduling VMs and threads since you have the ideal result
in hand and you know what the threads are doing.

My 2 cents, Dor



Thank you,
I think this is an excellent idea. ( Though I am trying to put all the
pieces together you mentioned). So overall we should be able to measure
the performance of pvspinlock/PLE improvements with a deterministic
load in guest.

Only thing I am missing is,
How to generate different combinations of the lock.

Okay, let me see if I can come with a solid model for this.



Do you mean the various options for PLE/pvticket/other? I haven't
thought of it and assumed its static but it can also be controlled
through the temporary /sys interface.



No, I am not there yet.

So In summary, we are suffering with inconsistent benchmark result,
while measuring the benefit of our improvement in PLE/pvlock etc..

So good point from your suggestion is,
- Giving predictability to workload that runs in guest, so that we have
pi-pi comparison of improvement.

- we can easily tune the workload via sysfs, and we can have script to
automate them.

What is complicated is:
- How can we simulate a workload close to what we measure with
benchmarks?
- How can we mimic lock holding time/ lock hierarchy close to the way
it is seen with real workloads (for e.g. highly contended zone lru lock
with similar amount of lockholding times).


You can spin for a similar instruction count that you're interested


- How close it would be to when we forget about other types of spinning
(for e.g, flush_tlb).

So I feel it is not as trivial as it looks like.



Indeed this is mainly a tool that can serve to optimize few synthetic 
workloads.
I still believe that it worth to go through this exercise since a 100% 
predictable and controlled case can help us purely asses the state of 
PLE and pvticket code. Otherwise we're dealing w/ too many parameters 
and assumptions at once.


Dor

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Avi Kivity
On 09/27/2012 12:28 PM, Andrew Jones wrote:

 No, I am not there yet.
 
 So In summary, we are suffering with inconsistent benchmark result,
 while measuring the benefit of our improvement in PLE/pvlock etc..
 
 Are you measuring the combined throughput of all running guests, or
 just looking at the results of the benchmarks in a single test guest?
 
 I've done some benchmarking as well and my stddevs look pretty good for
 kcbench, ebizzy, dbench, and sysbench-memory. I do 5 runs for each
 overcommit level (1.0 - 3.0, stepped by .25 or .5), and 2 runs of that
 full sequence of tests (one with the overcommit levels in scrambled
 order). The relative stddevs for each of the sets of 5 runs look pretty
 good, and the data for the 2 runs match nicely as well.
 
 To try and get consistent results I do the following 
 - interleave the memory of all guests across all numa nodes on the
   machine
 - echo 0  /proc/sys/kernel/randomize_va_space on both host and test
   guest
 - echo 3  /proc/sys/vm/drop_caches on both host and test guest before
   each run
 - use a ramdisk for the benchmark output files on all running guests
 - no periodically running services installed on the test guest
 - HT is turned off as you do, although I'd like to try running again
   with it turned back on
 
 Although, I still need to run again measuring the combined throughput
 of all running vms (including the ones launched just to generate busy
 vcpus). Maybe my results won't be as consistent then...
 


Another way to test is to execute

 perf stat -e 'kvm_exit exit_reason==40' sleep 10

to see how many PAUSEs were intercepted in a given time (except I just
invented the filter syntax).  The fewer we get, the more useful work the
system does.  This ignores kvm_vcpu_on_spin overhead though, so it's
just a rough measure.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Raghavendra K T

On 09/27/2012 02:06 PM, Avi Kivity wrote:

On 09/25/2012 03:40 PM, Raghavendra K T wrote:

On 09/24/2012 07:46 PM, Raghavendra K T wrote:

On 09/24/2012 07:24 PM, Peter Zijlstra wrote:

On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:

However Rik had a genuine concern in the cases where runqueue is not
equally distributed and lockholder might actually be on a different run
queue but not running.


Load should eventually get distributed equally -- that's what the
load-balancer is for -- so this is a temporary situation.

We already try and favour the non running vcpu in this case, that's what
yield_to_task_fair() is about. If its still not eligible to run, tough
luck.


Yes, I agree.




Do you think instead of using rq-nr_running, we could get a global
sense of load using avenrun (something like avenrun/num_onlinecpus)


To what purpose? Also, global stuff is expensive, so you should try and
stay away from it as hard as you possibly can.


Yes, that concern only had made me to fall back to rq-nr_running.

Will come back with the result soon.


Got the result with the patches:
So here is the result,

Tried this on a 32 core ple box with HT disabled. 32 guest vcpus with
1x and 2x overcommits

Base = 3.6.0-rc5 + ple handler optimization patches
A = Base + checking rq_running in vcpu_on_spin() patch
B = Base + checking rq-nr_running in sched/core
C = Base - PLE

---+---+---+---+---+
|Ebizzy result (rec/sec higher is better)   |
---+---+---+---+---+
|Base   | A |  B| C |
---+---+---+---+---+
1x | 2374.1250 | 7273.7500 | 5690.8750 |  7364.3750|
2x | 2536.2500 | 2458.5000 | 2426.3750 |48.5000|
---+---+---+---+---+

% improvements w.r.t BASE
---++++
|  A |B   | C  |
---++++
1x | 206.37603  |  139.70410 |  210.19323 |
2x | -3.06555   |  -4.33218  |  -98.08773 |
---++++

we are getting the benefit of almost PLE disabled case with this
approach. With patch B, we have dropped a bit in gain.
(because we still would iterate vcpus until we decide to do a directed
yield).


This gives us a good case for tracking preemption on a per-vm basis.  As
long as we aren't preempted, we can keep the PLE window high, and also
return immediately from the handler without looking for candidates.


1) So do you think, deferring preemption patch ( Vatsa was mentioning
long back)  is also another thing worth trying, so we reduce the chance
of LHP.

IIRC, with defer preemption :
we will have hook in spinlock/unlock path to measure depth of lock held,
and shared with host scheduler (may be via MSRs now).
Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather
give say one chance.

2) looking at the result (comparing A  C) , I do feel we have
significant in iterating over vcpus (when compared to even vmexit)
so We still would need undercommit fix sugested by PeterZ (improving by
140%). ?

So looking back at threads/ discussions so far, I am trying to
summarize, the discussions so far. I feel, at least here are the few
potential candidates to go in:

1) Avoiding double runqueue lock overhead  (Andrew Theurer/ PeterZ)
2) Dynamically changing PLE window (Avi/Andrew/Chegu)
3) preempt_notify handler to identify preempted VCPUs (Avi)
4) Avoiding iterating over VCPUs in undercommit scenario. (Raghu/PeterZ)
5) Avoiding unnecessary spinning in overcommit scenario (Raghu/Rik)
6) Pv spinlock
7) Jiannan's proposed improvements
8) Defer preemption patches

Did we miss anything (or added extra?)

So here are my action items:
- I plan to repost this series with what PeterZ, Rik suggested with
performance analysis.
- I ll go back and explore on (3) and (6) ..

Please Let me know..






--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Raghavendra K T

On 09/27/2012 03:58 PM, Andrew Jones wrote:

On Thu, Sep 27, 2012 at 03:19:45PM +0530, Raghavendra K T wrote:

On 09/25/2012 08:30 PM, Dor Laor wrote:

On 09/24/2012 02:02 PM, Raghavendra K T wrote:

On 09/24/2012 02:12 PM, Dor Laor wrote:

In order to help PLE and pvticketlock converge I thought that a small
test code should be developed to test this in a predictable,
deterministic way.

The idea is to have a guest kernel module that spawn a new thread each
time you write to a /sys/ entry.

Each such a thread spins over a spin lock. The specific spin lock is
also chosen by the /sys/ interface. Let's say we have an array of spin
locks *10 times the amount of vcpus.

All the threads are running a
while (1) {

spin_lock(my_lock);
sum += execute_dummy_cpu_computation(time);
spin_unlock(my_lock);

if (sys_tells_thread_to_die()) break;
}

print_result(sum);

Instead of calling the kernel's spin_lock functions, clone them and make
the ticket lock order deterministic and known (like a linear walk of all
the threads trying to catch that lock).


By Cloning you mean hierarchy of the locks?


No, I meant to clone the implementation of the current spin lock code in
order to set any order you may like for the ticket selection.
(even for a non pvticket lock version)

For instance, let's say you have N threads trying to grab the lock, you
can always make the ticket go linearly from 1-2...-N.
Not sure it's a good idea, just a recommendation.


Also I believe time should be passed via sysfs / hardcoded for each
type of lock we are mimicking


Yap





This way you can easy calculate:
1. the score of a single vcpu running a single thread
2. the score of sum of all thread scores when #thread==#vcpu all
taking the same spin lock. The overall sum should be close as
possible to #1.
3. Like #2 but #threads  #vcpus and other versions of #total vcpus
(belonging to all VMs)  #pcpus.
4. Create #thread == #vcpus but let each thread have it's own spin
lock
5. Like 4 + 2

Hopefully this way will allows you to judge and evaluate the exact
overhead of scheduling VMs and threads since you have the ideal result
in hand and you know what the threads are doing.

My 2 cents, Dor



Thank you,
I think this is an excellent idea. ( Though I am trying to put all the
pieces together you mentioned). So overall we should be able to measure
the performance of pvspinlock/PLE improvements with a deterministic
load in guest.

Only thing I am missing is,
How to generate different combinations of the lock.

Okay, let me see if I can come with a solid model for this.



Do you mean the various options for PLE/pvticket/other? I haven't
thought of it and assumed its static but it can also be controlled
through the temporary /sys interface.



No, I am not there yet.

So In summary, we are suffering with inconsistent benchmark result,
while measuring the benefit of our improvement in PLE/pvlock etc..


Are you measuring the combined throughput of all running guests, or
just looking at the results of the benchmarks in a single test guest?

I've done some benchmarking as well and my stddevs look pretty good for
kcbench, ebizzy, dbench, and sysbench-memory. I do 5 runs for each
overcommit level (1.0 - 3.0, stepped by .25 or .5), and 2 runs of that
full sequence of tests (one with the overcommit levels in scrambled
order). The relative stddevs for each of the sets of 5 runs look pretty
good, and the data for the 2 runs match nicely as well.

To try and get consistent results I do the following
- interleave the memory of all guests across all numa nodes on the
   machine
- echo 0  /proc/sys/kernel/randomize_va_space on both host and test
   guest


I was not doing this.


- echo 3  /proc/sys/vm/drop_caches on both host and test guest before
   each run


was doing already as you know


- use a ramdisk for the benchmark output files on all running guests


Yes.. this is also helpful


- no periodically running services installed on the test guest
- HT is turned off as you do, although I'd like to try running again
   with it turned back on
Although, I still need to run again measuring the combined throughput
of all running vms (including the ones launched just to generate busy
vcpus). Maybe my results won't be as consistent then...


May be. I take average from all the VMs..



Drew



So good point from your suggestion is,
- Giving predictability to workload that runs in guest, so that we have
pi-pi comparison of improvement.

- we can easily tune the workload via sysfs, and we can have script to
automate them.

What is complicated is:
- How can we simulate a workload close to what we measure with
benchmarks?
- How can we mimic lock holding time/ lock hierarchy close to the way
it is seen with real workloads (for e.g. highly contended zone lru lock
with similar amount of lockholding times).
- How close it would be to when we forget about other types of spinning
(for e.g, flush_tlb).

So I feel it is not as trivial as it looks like.

--
To unsubscribe from 

Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Avi Kivity
On 09/27/2012 01:23 PM, Raghavendra K T wrote:

 This gives us a good case for tracking preemption on a per-vm basis.  As
 long as we aren't preempted, we can keep the PLE window high, and also
 return immediately from the handler without looking for candidates.
 
 1) So do you think, deferring preemption patch ( Vatsa was mentioning
 long back)  is also another thing worth trying, so we reduce the chance
 of LHP.

Yes, we have to keep it in mind.  It will be useful for fine grained
locks, not so much so coarse locks or IPIs.

I would still of course prefer a PLE solution, but if we can't get it to
work we can consider preemption deferral.

 
 IIRC, with defer preemption :
 we will have hook in spinlock/unlock path to measure depth of lock held,
 and shared with host scheduler (may be via MSRs now).
 Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather
 give say one chance.

A downside is that we have to do that even when undercommitted.

Also there may be a lot of false positives (deferred preemptions even
when there is no contention).

 
 2) looking at the result (comparing A  C) , I do feel we have
 significant in iterating over vcpus (when compared to even vmexit)
 so We still would need undercommit fix sugested by PeterZ (improving by
 140%). ?

Looking only at the current runqueue?  My worry is that it misses a lot
of cases.  Maybe try the current runqueue first and then others.

Or were you referring to something else?

 
 So looking back at threads/ discussions so far, I am trying to
 summarize, the discussions so far. I feel, at least here are the few
 potential candidates to go in:
 
 1) Avoiding double runqueue lock overhead  (Andrew Theurer/ PeterZ)
 2) Dynamically changing PLE window (Avi/Andrew/Chegu)
 3) preempt_notify handler to identify preempted VCPUs (Avi)
 4) Avoiding iterating over VCPUs in undercommit scenario. (Raghu/PeterZ)
 5) Avoiding unnecessary spinning in overcommit scenario (Raghu/Rik)
 6) Pv spinlock
 7) Jiannan's proposed improvements
 8) Defer preemption patches
 
 Did we miss anything (or added extra?)
 
 So here are my action items:
 - I plan to repost this series with what PeterZ, Rik suggested with
 performance analysis.
 - I ll go back and explore on (3) and (6) ..
 
 Please Let me know..

Undoubtedly we'll think of more stuff.  But this looks like a good start.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Andrew Theurer
On Thu, 2012-09-27 at 14:03 +0200, Avi Kivity wrote:
 On 09/27/2012 01:23 PM, Raghavendra K T wrote:
 
  This gives us a good case for tracking preemption on a per-vm basis.  As
  long as we aren't preempted, we can keep the PLE window high, and also
  return immediately from the handler without looking for candidates.
  
  1) So do you think, deferring preemption patch ( Vatsa was mentioning
  long back)  is also another thing worth trying, so we reduce the chance
  of LHP.
 
 Yes, we have to keep it in mind.  It will be useful for fine grained
 locks, not so much so coarse locks or IPIs.
 
 I would still of course prefer a PLE solution, but if we can't get it to
 work we can consider preemption deferral.
 
  
  IIRC, with defer preemption :
  we will have hook in spinlock/unlock path to measure depth of lock held,
  and shared with host scheduler (may be via MSRs now).
  Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather
  give say one chance.
 
 A downside is that we have to do that even when undercommitted.
 
 Also there may be a lot of false positives (deferred preemptions even
 when there is no contention).
 
  
  2) looking at the result (comparing A  C) , I do feel we have
  significant in iterating over vcpus (when compared to even vmexit)
  so We still would need undercommit fix sugested by PeterZ (improving by
  140%). ?
 
 Looking only at the current runqueue?  My worry is that it misses a lot
 of cases.  Maybe try the current runqueue first and then others.
 
 Or were you referring to something else?
 
  
  So looking back at threads/ discussions so far, I am trying to
  summarize, the discussions so far. I feel, at least here are the few
  potential candidates to go in:
  
  1) Avoiding double runqueue lock overhead  (Andrew Theurer/ PeterZ)
  2) Dynamically changing PLE window (Avi/Andrew/Chegu)
  3) preempt_notify handler to identify preempted VCPUs (Avi)
  4) Avoiding iterating over VCPUs in undercommit scenario. (Raghu/PeterZ)
  5) Avoiding unnecessary spinning in overcommit scenario (Raghu/Rik)
  6) Pv spinlock
  7) Jiannan's proposed improvements
  8) Defer preemption patches
  
  Did we miss anything (or added extra?)
  
  So here are my action items:
  - I plan to repost this series with what PeterZ, Rik suggested with
  performance analysis.
  - I ll go back and explore on (3) and (6) ..
  
  Please Let me know..
 
 Undoubtedly we'll think of more stuff.  But this looks like a good start.

9) lazy gang-like scheduling with PLE to cover the non-gang-like
exceptions  (/me runs and hides from scheduler folks)

-Andrew Theurer

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Raghavendra K T

On 09/27/2012 05:33 PM, Avi Kivity wrote:

On 09/27/2012 01:23 PM, Raghavendra K T wrote:


This gives us a good case for tracking preemption on a per-vm basis.  As
long as we aren't preempted, we can keep the PLE window high, and also
return immediately from the handler without looking for candidates.


1) So do you think, deferring preemption patch ( Vatsa was mentioning
long back)  is also another thing worth trying, so we reduce the chance
of LHP.


Yes, we have to keep it in mind.  It will be useful for fine grained
locks, not so much so coarse locks or IPIs.



Agree.


I would still of course prefer a PLE solution, but if we can't get it to
work we can consider preemption deferral.



Okay.



IIRC, with defer preemption :
we will have hook in spinlock/unlock path to measure depth of lock held,
and shared with host scheduler (may be via MSRs now).
Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather
give say one chance.


A downside is that we have to do that even when undercommitted.

Also there may be a lot of false positives (deferred preemptions even
when there is no contention).


Yes. That is a worry.





2) looking at the result (comparing A  C) , I do feel we have
significant in iterating over vcpus (when compared to even vmexit)
so We still would need undercommit fix sugested by PeterZ (improving by
140%). ?


Looking only at the current runqueue?  My worry is that it misses a lot
of cases.  Maybe try the current runqueue first and then others.

Or were you referring to something else?


No. I was referring to the same thing.

However. I had tried following also (which works well to check 
undercommited scenario). But thinking to use only for yielding in case
of overcommit (yield in overcommit suggested by Rik) and keep 
undercommit patch as suggested by PeterZ


[ patch is not in proper diff I suppose ].

Will test them.

Peter, Can I post your patch with your from/sob.. in V2?
Please let me know..

---
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 28f00bc..9ed3759 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1620,6 +1620,21 @@ bool kvm_vcpu_eligible_for_directed_yield(struct 
kvm_vcpu *vcpu)

return eligible;
 }
 #endif
+
+bool kvm_overcommitted()
+{
+   unsigned long load;
+
+   load = avenrun[0] + FIXED_1/200;
+   load = load  FSHIFT;
+   load = (load  7) / num_online_cpus();
+
+   if (load  128)
+   return true;
+
+   return false;
+}
+
 void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 {
struct kvm *kvm = me-kvm;
@@ -1629,6 +1644,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
int pass;
int i;

+   if (!kvm_overcommitted())
+   return;
+
kvm_vcpu_set_in_spin_loop(me, true);
/*
 * We boost the priority of a VCPU that is runnable but not

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread H. Peter Anvin

On 09/27/2012 10:38 PM, Raghavendra K T wrote:

+
+bool kvm_overcommitted()
+{


This better not be C...

-hpa


--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-26 Thread Peter Zijlstra
On Wed, 2012-09-26 at 15:39 +0200, Andrew Jones wrote:
> On Wed, Sep 26, 2012 at 03:26:11PM +0200, Peter Zijlstra wrote:
> > On Wed, 2012-09-26 at 15:20 +0200, Andrew Jones wrote:
> > > Wouldn't a clean solution be to promote a task's scheduler
> > > class to the spinner class when we PLE (or come from some special
> > > syscall
> > > for userspace spinlocks?)? 
> > 
> > Userspace spinlocks are typically employed to avoid syscalls..
> 
> I'm guessing there could be a slow path - spin N times and then give
> up and yield.

Much better they should do a blocking futex call or so, once you do the
syscall you're in kernel space anyway and have paid the transition cost.

> > 
> > > That class would be higher priority than the
> > > fair class and would schedule in FIFO order, but it would only run its
> > > tasks for short periods before switching. 
> > 
> > Since lock hold times aren't limited, esp. for things like userspace
> > 'spin' locks, you've got a very good denial of service / opportunity for
> > abuse right there.
> 
> Maybe add some throttling to avoid overuse/maliciousness?

At which point you're pretty much back to where you started.

A much better approach is using things like priority inheritance, which
can be extended to cover the fair class just fine..

Also note that user-space spinning is inherently prone to live-locks
when combined with the static priority RT scheduling classes.

In general its a very bad idea..
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-26 Thread Andrew Jones
On Wed, Sep 26, 2012 at 03:26:11PM +0200, Peter Zijlstra wrote:
> On Wed, 2012-09-26 at 15:20 +0200, Andrew Jones wrote:
> > Wouldn't a clean solution be to promote a task's scheduler
> > class to the spinner class when we PLE (or come from some special
> > syscall
> > for userspace spinlocks?)? 
> 
> Userspace spinlocks are typically employed to avoid syscalls..

I'm guessing there could be a slow path - spin N times and then give
up and yield.

> 
> > That class would be higher priority than the
> > fair class and would schedule in FIFO order, but it would only run its
> > tasks for short periods before switching. 
> 
> Since lock hold times aren't limited, esp. for things like userspace
> 'spin' locks, you've got a very good denial of service / opportunity for
> abuse right there.

Maybe add some throttling to avoid overuse/maliciousness?

> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-26 Thread Peter Zijlstra
On Wed, 2012-09-26 at 15:20 +0200, Andrew Jones wrote:
> Wouldn't a clean solution be to promote a task's scheduler
> class to the spinner class when we PLE (or come from some special
> syscall
> for userspace spinlocks?)? 

Userspace spinlocks are typically employed to avoid syscalls..

> That class would be higher priority than the
> fair class and would schedule in FIFO order, but it would only run its
> tasks for short periods before switching. 

Since lock hold times aren't limited, esp. for things like userspace
'spin' locks, you've got a very good denial of service / opportunity for
abuse right there.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-26 Thread Andrew Jones
On Mon, Sep 24, 2012 at 06:20:12PM +0200, Avi Kivity wrote:
> On 09/24/2012 06:03 PM, Peter Zijlstra wrote:
> > On Mon, 2012-09-24 at 17:51 +0200, Avi Kivity wrote:
> >> On 09/24/2012 03:54 PM, Peter Zijlstra wrote:
> >> > On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:
> >> >> However Rik had a genuine concern in the cases where runqueue is not
> >> >> equally distributed and lockholder might actually be on a different run 
> >> >> queue but not running.
> >> > 
> >> > Load should eventually get distributed equally -- that's what the
> >> > load-balancer is for -- so this is a temporary situation.
> >> 
> >> What's the expected latency?  This is the whole problem.  Eventually the
> >> scheduler would pick the lock holder as well, the problem is that it's
> >> in the millisecond scale while lock hold times are in the microsecond
> >> scale, leading to a 1000x slowdown.
> > 
> > Yeah I know.. Heisenberg's uncertainty applied to SMP computing becomes
> > something like accurate or fast, never both.
> > 
> >> If we want to yield, we really want to boost someone.
> > 
> > Now if only you knew which someone ;-) This non-modified guest nonsense
> > is such a snake pit.. but you know how I feel about all that.
> 
> Actually if I knew that in addition to boosting someone, I also unboost
> myself enough to be preempted, it wouldn't matter.  While boosting the
> lock holder is good, the main point is not spinning and doing useful
> work instead.  We can detect spinners and avoid boosting them.
> 
> That's the motivation for the "donate vruntime" approach I wanted earlier.

I'll probably get shot for the suggestion, but doesn't this problem merit
another scheduler class? We want FIFO order for a special class of tasks,
"spinners". Wouldn't a clean solution be to promote a task's scheduler
class to the spinner class when we PLE (or come from some special syscall
for userspace spinlocks?)? That class would be higher priority than the
fair class and would schedule in FIFO order, but it would only run its
tasks for short periods before switching. Also, after each task is run
its scheduler class would get reset down to its original class (fair).
At least at first thought this looks to me to be cleaner than the next
and skip hinting, plus it helps guarantee that the lock holder gets
scheduled before the tasks waiting on that lock.

Drew

> 
> > 
> >> > We already try and favour the non running vcpu in this case, that's what
> >> > yield_to_task_fair() is about. If its still not eligible to run, tough
> >> > luck.
> >> 
> >> Crazy idea: instead of yielding, just run that other vcpu in the thread
> >> that would otherwise spin.  I can see about a million objections to this
> >> already though.
> > 
> > Yah.. you want me to list a few? :-) It would require synchronization
> > with the other cpu to pull its task -- one really wants to avoid it also
> > running it.
> 
> Yeah, it's quite a horrible idea.
> 
> > 
> > Do this at a high enough frequency and you're dead too.
> > 
> > Anyway, you can do this inside the KVM stuff, simply flip the vcpu state
> > associated with a vcpu thread and use the preemption notifiers to sort
> > things against the scheduler or somesuch.
> 
> That's what I thought when I wrote this, but I can't, I might be
> preempted in random kvm code.  So my state includes the host stack and
> registers.  Maybe we can special-case when we interrupt guest mode.
> 
> > 
> >> >> Do you think instead of using rq->nr_running, we could get a global 
> >> >> sense of load using avenrun (something like avenrun/num_onlinecpus) 
> >> > 
> >> > To what purpose? Also, global stuff is expensive, so you should try and
> >> > stay away from it as hard as you possibly can.
> >> 
> >> Spinning is also expensive.  How about we do the global stuff every N
> >> times, to amortize the cost (and reduce contention)?
> > 
> > Nah, spinning isn't expensive, its a waste of time, similar end result
> > for someone who wants to do useful work though, but not the same cause.
> > 
> > Pick N and I'll come up with a scenario for which its wrong ;-)
> 
> Sure.  But if it's rare enough, then that's okay for us.
> 
> > Anyway, its an ugly problem and one I really want to contain inside the
> > insanity that created it (virt), lets not taint the rest of the kernel
> > more than we need to. 
> 
> Agreed.  Though given that postgres and others use userspace spinlocks,
> maybe it's not just virt.
> 
> -- 
> error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-26 Thread Andrew Jones
On Mon, Sep 24, 2012 at 02:36:05PM +0200, Peter Zijlstra wrote:
> On Mon, 2012-09-24 at 17:22 +0530, Raghavendra K T wrote:
> > On 09/24/2012 05:04 PM, Peter Zijlstra wrote:
> > > On Fri, 2012-09-21 at 17:29 +0530, Raghavendra K T wrote:
> > >> In some special scenarios like #vcpu<= #pcpu, PLE handler may
> > >> prove very costly, because there is no need to iterate over vcpus
> > >> and do unsuccessful yield_to burning CPU.
> > >
> > > What's the costly thing? The vm-exit, the yield (which should be a nop
> > > if its the only task there) or something else entirely?
> > >
> > Both vmexit and yield_to() actually,
> > 
> > because unsuccessful yield_to() overall is costly in PLE handler.
> > 
> > This is because when we have large guests, say 32/16 vcpus, and one
> > vcpu is holding lock, rest of the vcpus waiting for the lock, when they
> > do PL-exit, each of the vcpu try to iterate over rest of vcpu list in
> > the VM and try to do directed yield (unsuccessful). (O(n^2) tries).
> > 
> > this results is fairly high amount of cpu burning and double run queue
> > lock contention.
> > 
> > (if they were spinning probably lock progress would have been faster).
> > As Avi/Chegu Vinod had felt it is better to avoid vmexit itself, which
> > seems little complex to achieve currently.
> 
> OK, so the vmexit stays and we need to improve yield_to.

Can't we do this check sooner as well, as it only requires per-cpu data?
If we do it way back in kvm_vcpu_on_spin, then we avoid get_pid_task()
and a bunch of read barriers from kvm_for_each_vcpu. Also, moving the test
into kvm code would allow us to do other kvm things as a result of the
check in order to avoid some vmexits. It looks like we should be able to
avoid some without much complexity by just making a per-vm ple_window
variable, and then, when we hit the nr_running == 1 condition, also doing
vmcs_write32(PLE_WINDOW, (kvm->ple_window += PLE_WINDOW_BUMP))
Reset the window to the default value when we successfully yield (and
maybe we should limit the number of bumps).

Drew

> 
> How about something like the below, that would allow breaking out of the
> for-each-vcpu loop and simply going back into the vm, right?
> 
> ---
>  kernel/sched/core.c | 25 +++--
>  1 file changed, 19 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b38f00e..5d5b355 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4272,7 +4272,10 @@ EXPORT_SYMBOL(yield);
>   * It's the caller's job to ensure that the target task struct
>   * can't go away on us before we can do any checks.
>   *
> - * Returns true if we indeed boosted the target task.
> + * Returns:
> + *   true (>0) if we indeed boosted the target task.
> + *   false (0) if we failed to boost the target.
> + *   -ESRCH if there's no task to yield to.
>   */
>  bool __sched yield_to(struct task_struct *p, bool preempt)
>  {
> @@ -4284,6 +4287,15 @@ bool __sched yield_to(struct task_struct *p, bool 
> preempt)
>   local_irq_save(flags);
>   rq = this_rq();
>  
> + /*
> +  * If we're the only runnable task on the rq, there's absolutely no
> +  * point in yielding.
> +  */
> + if (rq->nr_running == 1) {
> + yielded = -ESRCH;
> + goto out_irq;
> + }
> +
>  again:
>   p_rq = task_rq(p);
>   double_rq_lock(rq, p_rq);
> @@ -4293,13 +4305,13 @@ bool __sched yield_to(struct task_struct *p, bool 
> preempt)
>   }
>  
>   if (!curr->sched_class->yield_to_task)
> - goto out;
> + goto out_unlock;
>  
>   if (curr->sched_class != p->sched_class)
> - goto out;
> + goto out_unlock;
>  
>   if (task_running(p_rq, p) || p->state)
> - goto out;
> + goto out_unlock;
>  
>   yielded = curr->sched_class->yield_to_task(rq, p, preempt);
>   if (yielded) {
> @@ -4312,11 +4324,12 @@ bool __sched yield_to(struct task_struct *p, bool 
> preempt)
>   resched_task(p_rq->curr);
>   }
>  
> -out:
> +out_unlock:
>   double_rq_unlock(rq, p_rq);
> +out_irq:
>   local_irq_restore(flags);
>  
> - if (yielded)
> + if (yielded > 0)
>   schedule();
>  
>   return yielded;
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-26 Thread Konrad Rzeszutek Wilk
On Tue, Sep 25, 2012 at 05:00:30PM +0200, Dor Laor wrote:
> On 09/24/2012 02:02 PM, Raghavendra K T wrote:
> >On 09/24/2012 02:12 PM, Dor Laor wrote:
> >>In order to help PLE and pvticketlock converge I thought that a small
> >>test code should be developed to test this in a predictable,
> >>deterministic way.
> >>
> >>The idea is to have a guest kernel module that spawn a new thread each
> >>time you write to a /sys/ entry.
> >>
> >>Each such a thread spins over a spin lock. The specific spin lock is
> >>also chosen by the /sys/ interface. Let's say we have an array of spin
> >>locks *10 times the amount of vcpus.
> >>
> >>All the threads are running a
> >>while (1) {
> >>
> >>spin_lock(my_lock);
> >>sum += execute_dummy_cpu_computation(time);
> >>spin_unlock(my_lock);
> >>
> >>if (sys_tells_thread_to_die()) break;
> >>}
> >>
> >>print_result(sum);
> >>
> >>Instead of calling the kernel's spin_lock functions, clone them and make
> >>the ticket lock order deterministic and known (like a linear walk of all
> >>the threads trying to catch that lock).
> >
> >By Cloning you mean hierarchy of the locks?
> 
> No, I meant to clone the implementation of the current spin lock
> code in order to set any order you may like for the ticket
> selection.
> (even for a non pvticket lock version)

Wouldn't that defeat the purpose of trying the test the different
implementations that try to fix the lock-holder preemption problem?
You want something that you can shoe-in for all work-loads - also
for this test system.
> 
> For instance, let's say you have N threads trying to grab the lock,
> you can always make the ticket go linearly from 1->2...->N.
> Not sure it's a good idea, just a recommendation.

So round-robin. Could you make NCPUS threads, pin them to CPUs, and set
them to be SCHED_RR? Or NCPUS*2 to overcommit.

> 
> >Also I believe time should be passed via sysfs / hardcoded for each
> >type of lock we are mimicking
> 
> Yap
> 
> >
> >>
> >>This way you can easy calculate:
> >>1. the score of a single vcpu running a single thread
> >>2. the score of sum of all thread scores when #thread==#vcpu all
> >>taking the same spin lock. The overall sum should be close as
> >>possible to #1.
> >>3. Like #2 but #threads > #vcpus and other versions of #total vcpus
> >>(belonging to all VMs) > #pcpus.
> >>4. Create #thread == #vcpus but let each thread have it's own spin
> >>lock
> >>5. Like 4 + 2
> >>
> >>Hopefully this way will allows you to judge and evaluate the exact
> >>overhead of scheduling VMs and threads since you have the ideal result
> >>in hand and you know what the threads are doing.
> >>
> >>My 2 cents, Dor
> >>
> >
> >Thank you,
> >I think this is an excellent idea. ( Though I am trying to put all the
> >pieces together you mentioned). So overall we should be able to measure
> >the performance of pvspinlock/PLE improvements with a deterministic
> >load in guest.
> >
> >Only thing I am missing is,
> >How to generate different combinations of the lock.
> >
> >Okay, let me see if I can come with a solid model for this.
> >
> 
> Do you mean the various options for PLE/pvticket/other? I haven't
> thought of it and assumed its static but it can also be controlled
> through the temporary /sys interface.
> 
> Thanks for following up!
> Dor
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-26 Thread Konrad Rzeszutek Wilk
On Tue, Sep 25, 2012 at 05:00:30PM +0200, Dor Laor wrote:
 On 09/24/2012 02:02 PM, Raghavendra K T wrote:
 On 09/24/2012 02:12 PM, Dor Laor wrote:
 In order to help PLE and pvticketlock converge I thought that a small
 test code should be developed to test this in a predictable,
 deterministic way.
 
 The idea is to have a guest kernel module that spawn a new thread each
 time you write to a /sys/ entry.
 
 Each such a thread spins over a spin lock. The specific spin lock is
 also chosen by the /sys/ interface. Let's say we have an array of spin
 locks *10 times the amount of vcpus.
 
 All the threads are running a
 while (1) {
 
 spin_lock(my_lock);
 sum += execute_dummy_cpu_computation(time);
 spin_unlock(my_lock);
 
 if (sys_tells_thread_to_die()) break;
 }
 
 print_result(sum);
 
 Instead of calling the kernel's spin_lock functions, clone them and make
 the ticket lock order deterministic and known (like a linear walk of all
 the threads trying to catch that lock).
 
 By Cloning you mean hierarchy of the locks?
 
 No, I meant to clone the implementation of the current spin lock
 code in order to set any order you may like for the ticket
 selection.
 (even for a non pvticket lock version)

Wouldn't that defeat the purpose of trying the test the different
implementations that try to fix the lock-holder preemption problem?
You want something that you can shoe-in for all work-loads - also
for this test system.
 
 For instance, let's say you have N threads trying to grab the lock,
 you can always make the ticket go linearly from 1-2...-N.
 Not sure it's a good idea, just a recommendation.

So round-robin. Could you make NCPUS threads, pin them to CPUs, and set
them to be SCHED_RR? Or NCPUS*2 to overcommit.

 
 Also I believe time should be passed via sysfs / hardcoded for each
 type of lock we are mimicking
 
 Yap
 
 
 
 This way you can easy calculate:
 1. the score of a single vcpu running a single thread
 2. the score of sum of all thread scores when #thread==#vcpu all
 taking the same spin lock. The overall sum should be close as
 possible to #1.
 3. Like #2 but #threads  #vcpus and other versions of #total vcpus
 (belonging to all VMs)  #pcpus.
 4. Create #thread == #vcpus but let each thread have it's own spin
 lock
 5. Like 4 + 2
 
 Hopefully this way will allows you to judge and evaluate the exact
 overhead of scheduling VMs and threads since you have the ideal result
 in hand and you know what the threads are doing.
 
 My 2 cents, Dor
 
 
 Thank you,
 I think this is an excellent idea. ( Though I am trying to put all the
 pieces together you mentioned). So overall we should be able to measure
 the performance of pvspinlock/PLE improvements with a deterministic
 load in guest.
 
 Only thing I am missing is,
 How to generate different combinations of the lock.
 
 Okay, let me see if I can come with a solid model for this.
 
 
 Do you mean the various options for PLE/pvticket/other? I haven't
 thought of it and assumed its static but it can also be controlled
 through the temporary /sys interface.
 
 Thanks for following up!
 Dor
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-26 Thread Andrew Jones
On Mon, Sep 24, 2012 at 02:36:05PM +0200, Peter Zijlstra wrote:
 On Mon, 2012-09-24 at 17:22 +0530, Raghavendra K T wrote:
  On 09/24/2012 05:04 PM, Peter Zijlstra wrote:
   On Fri, 2012-09-21 at 17:29 +0530, Raghavendra K T wrote:
   In some special scenarios like #vcpu= #pcpu, PLE handler may
   prove very costly, because there is no need to iterate over vcpus
   and do unsuccessful yield_to burning CPU.
  
   What's the costly thing? The vm-exit, the yield (which should be a nop
   if its the only task there) or something else entirely?
  
  Both vmexit and yield_to() actually,
  
  because unsuccessful yield_to() overall is costly in PLE handler.
  
  This is because when we have large guests, say 32/16 vcpus, and one
  vcpu is holding lock, rest of the vcpus waiting for the lock, when they
  do PL-exit, each of the vcpu try to iterate over rest of vcpu list in
  the VM and try to do directed yield (unsuccessful). (O(n^2) tries).
  
  this results is fairly high amount of cpu burning and double run queue
  lock contention.
  
  (if they were spinning probably lock progress would have been faster).
  As Avi/Chegu Vinod had felt it is better to avoid vmexit itself, which
  seems little complex to achieve currently.
 
 OK, so the vmexit stays and we need to improve yield_to.

Can't we do this check sooner as well, as it only requires per-cpu data?
If we do it way back in kvm_vcpu_on_spin, then we avoid get_pid_task()
and a bunch of read barriers from kvm_for_each_vcpu. Also, moving the test
into kvm code would allow us to do other kvm things as a result of the
check in order to avoid some vmexits. It looks like we should be able to
avoid some without much complexity by just making a per-vm ple_window
variable, and then, when we hit the nr_running == 1 condition, also doing
vmcs_write32(PLE_WINDOW, (kvm-ple_window += PLE_WINDOW_BUMP))
Reset the window to the default value when we successfully yield (and
maybe we should limit the number of bumps).

Drew

 
 How about something like the below, that would allow breaking out of the
 for-each-vcpu loop and simply going back into the vm, right?
 
 ---
  kernel/sched/core.c | 25 +++--
  1 file changed, 19 insertions(+), 6 deletions(-)
 
 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 index b38f00e..5d5b355 100644
 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
 @@ -4272,7 +4272,10 @@ EXPORT_SYMBOL(yield);
   * It's the caller's job to ensure that the target task struct
   * can't go away on us before we can do any checks.
   *
 - * Returns true if we indeed boosted the target task.
 + * Returns:
 + *   true (0) if we indeed boosted the target task.
 + *   false (0) if we failed to boost the target.
 + *   -ESRCH if there's no task to yield to.
   */
  bool __sched yield_to(struct task_struct *p, bool preempt)
  {
 @@ -4284,6 +4287,15 @@ bool __sched yield_to(struct task_struct *p, bool 
 preempt)
   local_irq_save(flags);
   rq = this_rq();
  
 + /*
 +  * If we're the only runnable task on the rq, there's absolutely no
 +  * point in yielding.
 +  */
 + if (rq-nr_running == 1) {
 + yielded = -ESRCH;
 + goto out_irq;
 + }
 +
  again:
   p_rq = task_rq(p);
   double_rq_lock(rq, p_rq);
 @@ -4293,13 +4305,13 @@ bool __sched yield_to(struct task_struct *p, bool 
 preempt)
   }
  
   if (!curr-sched_class-yield_to_task)
 - goto out;
 + goto out_unlock;
  
   if (curr-sched_class != p-sched_class)
 - goto out;
 + goto out_unlock;
  
   if (task_running(p_rq, p) || p-state)
 - goto out;
 + goto out_unlock;
  
   yielded = curr-sched_class-yield_to_task(rq, p, preempt);
   if (yielded) {
 @@ -4312,11 +4324,12 @@ bool __sched yield_to(struct task_struct *p, bool 
 preempt)
   resched_task(p_rq-curr);
   }
  
 -out:
 +out_unlock:
   double_rq_unlock(rq, p_rq);
 +out_irq:
   local_irq_restore(flags);
  
 - if (yielded)
 + if (yielded  0)
   schedule();
  
   return yielded;
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-26 Thread Andrew Jones
On Mon, Sep 24, 2012 at 06:20:12PM +0200, Avi Kivity wrote:
 On 09/24/2012 06:03 PM, Peter Zijlstra wrote:
  On Mon, 2012-09-24 at 17:51 +0200, Avi Kivity wrote:
  On 09/24/2012 03:54 PM, Peter Zijlstra wrote:
   On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:
   However Rik had a genuine concern in the cases where runqueue is not
   equally distributed and lockholder might actually be on a different run 
   queue but not running.
   
   Load should eventually get distributed equally -- that's what the
   load-balancer is for -- so this is a temporary situation.
  
  What's the expected latency?  This is the whole problem.  Eventually the
  scheduler would pick the lock holder as well, the problem is that it's
  in the millisecond scale while lock hold times are in the microsecond
  scale, leading to a 1000x slowdown.
  
  Yeah I know.. Heisenberg's uncertainty applied to SMP computing becomes
  something like accurate or fast, never both.
  
  If we want to yield, we really want to boost someone.
  
  Now if only you knew which someone ;-) This non-modified guest nonsense
  is such a snake pit.. but you know how I feel about all that.
 
 Actually if I knew that in addition to boosting someone, I also unboost
 myself enough to be preempted, it wouldn't matter.  While boosting the
 lock holder is good, the main point is not spinning and doing useful
 work instead.  We can detect spinners and avoid boosting them.
 
 That's the motivation for the donate vruntime approach I wanted earlier.

I'll probably get shot for the suggestion, but doesn't this problem merit
another scheduler class? We want FIFO order for a special class of tasks,
spinners. Wouldn't a clean solution be to promote a task's scheduler
class to the spinner class when we PLE (or come from some special syscall
for userspace spinlocks?)? That class would be higher priority than the
fair class and would schedule in FIFO order, but it would only run its
tasks for short periods before switching. Also, after each task is run
its scheduler class would get reset down to its original class (fair).
At least at first thought this looks to me to be cleaner than the next
and skip hinting, plus it helps guarantee that the lock holder gets
scheduled before the tasks waiting on that lock.

Drew

 
  
   We already try and favour the non running vcpu in this case, that's what
   yield_to_task_fair() is about. If its still not eligible to run, tough
   luck.
  
  Crazy idea: instead of yielding, just run that other vcpu in the thread
  that would otherwise spin.  I can see about a million objections to this
  already though.
  
  Yah.. you want me to list a few? :-) It would require synchronization
  with the other cpu to pull its task -- one really wants to avoid it also
  running it.
 
 Yeah, it's quite a horrible idea.
 
  
  Do this at a high enough frequency and you're dead too.
  
  Anyway, you can do this inside the KVM stuff, simply flip the vcpu state
  associated with a vcpu thread and use the preemption notifiers to sort
  things against the scheduler or somesuch.
 
 That's what I thought when I wrote this, but I can't, I might be
 preempted in random kvm code.  So my state includes the host stack and
 registers.  Maybe we can special-case when we interrupt guest mode.
 
  
   Do you think instead of using rq-nr_running, we could get a global 
   sense of load using avenrun (something like avenrun/num_onlinecpus) 
   
   To what purpose? Also, global stuff is expensive, so you should try and
   stay away from it as hard as you possibly can.
  
  Spinning is also expensive.  How about we do the global stuff every N
  times, to amortize the cost (and reduce contention)?
  
  Nah, spinning isn't expensive, its a waste of time, similar end result
  for someone who wants to do useful work though, but not the same cause.
  
  Pick N and I'll come up with a scenario for which its wrong ;-)
 
 Sure.  But if it's rare enough, then that's okay for us.
 
  Anyway, its an ugly problem and one I really want to contain inside the
  insanity that created it (virt), lets not taint the rest of the kernel
  more than we need to. 
 
 Agreed.  Though given that postgres and others use userspace spinlocks,
 maybe it's not just virt.
 
 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-26 Thread Peter Zijlstra
On Wed, 2012-09-26 at 15:20 +0200, Andrew Jones wrote:
 Wouldn't a clean solution be to promote a task's scheduler
 class to the spinner class when we PLE (or come from some special
 syscall
 for userspace spinlocks?)? 

Userspace spinlocks are typically employed to avoid syscalls..

 That class would be higher priority than the
 fair class and would schedule in FIFO order, but it would only run its
 tasks for short periods before switching. 

Since lock hold times aren't limited, esp. for things like userspace
'spin' locks, you've got a very good denial of service / opportunity for
abuse right there.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-26 Thread Andrew Jones
On Wed, Sep 26, 2012 at 03:26:11PM +0200, Peter Zijlstra wrote:
 On Wed, 2012-09-26 at 15:20 +0200, Andrew Jones wrote:
  Wouldn't a clean solution be to promote a task's scheduler
  class to the spinner class when we PLE (or come from some special
  syscall
  for userspace spinlocks?)? 
 
 Userspace spinlocks are typically employed to avoid syscalls..

I'm guessing there could be a slow path - spin N times and then give
up and yield.

 
  That class would be higher priority than the
  fair class and would schedule in FIFO order, but it would only run its
  tasks for short periods before switching. 
 
 Since lock hold times aren't limited, esp. for things like userspace
 'spin' locks, you've got a very good denial of service / opportunity for
 abuse right there.

Maybe add some throttling to avoid overuse/maliciousness?

 
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-26 Thread Peter Zijlstra
On Wed, 2012-09-26 at 15:39 +0200, Andrew Jones wrote:
 On Wed, Sep 26, 2012 at 03:26:11PM +0200, Peter Zijlstra wrote:
  On Wed, 2012-09-26 at 15:20 +0200, Andrew Jones wrote:
   Wouldn't a clean solution be to promote a task's scheduler
   class to the spinner class when we PLE (or come from some special
   syscall
   for userspace spinlocks?)? 
  
  Userspace spinlocks are typically employed to avoid syscalls..
 
 I'm guessing there could be a slow path - spin N times and then give
 up and yield.

Much better they should do a blocking futex call or so, once you do the
syscall you're in kernel space anyway and have paid the transition cost.

  
   That class would be higher priority than the
   fair class and would schedule in FIFO order, but it would only run its
   tasks for short periods before switching. 
  
  Since lock hold times aren't limited, esp. for things like userspace
  'spin' locks, you've got a very good denial of service / opportunity for
  abuse right there.
 
 Maybe add some throttling to avoid overuse/maliciousness?

At which point you're pretty much back to where you started.

A much better approach is using things like priority inheritance, which
can be extended to cover the fair class just fine..

Also note that user-space spinning is inherently prone to live-locks
when combined with the static priority RT scheduling classes.

In general its a very bad idea..
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-25 Thread Dor Laor

On 09/24/2012 02:02 PM, Raghavendra K T wrote:

On 09/24/2012 02:12 PM, Dor Laor wrote:

In order to help PLE and pvticketlock converge I thought that a small
test code should be developed to test this in a predictable,
deterministic way.

The idea is to have a guest kernel module that spawn a new thread each
time you write to a /sys/ entry.

Each such a thread spins over a spin lock. The specific spin lock is
also chosen by the /sys/ interface. Let's say we have an array of spin
locks *10 times the amount of vcpus.

All the threads are running a
while (1) {

spin_lock(my_lock);
sum += execute_dummy_cpu_computation(time);
spin_unlock(my_lock);

if (sys_tells_thread_to_die()) break;
}

print_result(sum);

Instead of calling the kernel's spin_lock functions, clone them and make
the ticket lock order deterministic and known (like a linear walk of all
the threads trying to catch that lock).


By Cloning you mean hierarchy of the locks?


No, I meant to clone the implementation of the current spin lock code in 
order to set any order you may like for the ticket selection.

(even for a non pvticket lock version)

For instance, let's say you have N threads trying to grab the lock, you 
can always make the ticket go linearly from 1->2...->N.

Not sure it's a good idea, just a recommendation.


Also I believe time should be passed via sysfs / hardcoded for each
type of lock we are mimicking


Yap





This way you can easy calculate:
1. the score of a single vcpu running a single thread
2. the score of sum of all thread scores when #thread==#vcpu all
taking the same spin lock. The overall sum should be close as
possible to #1.
3. Like #2 but #threads > #vcpus and other versions of #total vcpus
(belonging to all VMs) > #pcpus.
4. Create #thread == #vcpus but let each thread have it's own spin
lock
5. Like 4 + 2

Hopefully this way will allows you to judge and evaluate the exact
overhead of scheduling VMs and threads since you have the ideal result
in hand and you know what the threads are doing.

My 2 cents, Dor



Thank you,
I think this is an excellent idea. ( Though I am trying to put all the
pieces together you mentioned). So overall we should be able to measure
the performance of pvspinlock/PLE improvements with a deterministic
load in guest.

Only thing I am missing is,
How to generate different combinations of the lock.

Okay, let me see if I can come with a solid model for this.



Do you mean the various options for PLE/pvticket/other? I haven't 
thought of it and assumed its static but it can also be controlled 
through the temporary /sys interface.


Thanks for following up!
Dor
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-25 Thread Raghavendra K T

On 09/24/2012 07:46 PM, Raghavendra K T wrote:

On 09/24/2012 07:24 PM, Peter Zijlstra wrote:

On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:

However Rik had a genuine concern in the cases where runqueue is not
equally distributed and lockholder might actually be on a different run
queue but not running.


Load should eventually get distributed equally -- that's what the
load-balancer is for -- so this is a temporary situation.

We already try and favour the non running vcpu in this case, that's what
yield_to_task_fair() is about. If its still not eligible to run, tough
luck.


Yes, I agree.




Do you think instead of using rq->nr_running, we could get a global
sense of load using avenrun (something like avenrun/num_onlinecpus)


To what purpose? Also, global stuff is expensive, so you should try and
stay away from it as hard as you possibly can.


Yes, that concern only had made me to fall back to rq->nr_running.

Will come back with the result soon.


Got the result with the patches:
So here is the result,

Tried this on a 32 core ple box with HT disabled. 32 guest vcpus with
1x and 2x overcommits

Base = 3.6.0-rc5 + ple handler optimization patches
A = Base + checking rq_running in vcpu_on_spin() patch
B = Base + checking rq->nr_running in sched/core
C = Base - PLE

---+---+---+---+---+
   |Ebizzy result (rec/sec higher is better)   |
---+---+---+---+---+
   |Base   | A |  B| C |
---+---+---+---+---+
1x | 2374.1250 | 7273.7500 | 5690.8750 |  7364.3750|
2x | 2536.2500 | 2458.5000 | 2426.3750 |48.5000|
---+---+---+---+---+

   % improvements w.r.t BASE
---++++
   |  A |B   | C  |
---++++
1x | 206.37603  |  139.70410 |  210.19323 |
2x | -3.06555   |  -4.33218  |  -98.08773 |
---++++

we are getting the benefit of almost PLE disabled case with this
approach. With patch B, we have dropped a bit in gain.
(because we still would iterate vcpus until we decide to do a directed
yield).





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-25 Thread Raghavendra K T

On 09/24/2012 07:46 PM, Raghavendra K T wrote:

On 09/24/2012 07:24 PM, Peter Zijlstra wrote:

On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:

However Rik had a genuine concern in the cases where runqueue is not
equally distributed and lockholder might actually be on a different run
queue but not running.


Load should eventually get distributed equally -- that's what the
load-balancer is for -- so this is a temporary situation.

We already try and favour the non running vcpu in this case, that's what
yield_to_task_fair() is about. If its still not eligible to run, tough
luck.


Yes, I agree.




Do you think instead of using rq-nr_running, we could get a global
sense of load using avenrun (something like avenrun/num_onlinecpus)


To what purpose? Also, global stuff is expensive, so you should try and
stay away from it as hard as you possibly can.


Yes, that concern only had made me to fall back to rq-nr_running.

Will come back with the result soon.


Got the result with the patches:
So here is the result,

Tried this on a 32 core ple box with HT disabled. 32 guest vcpus with
1x and 2x overcommits

Base = 3.6.0-rc5 + ple handler optimization patches
A = Base + checking rq_running in vcpu_on_spin() patch
B = Base + checking rq-nr_running in sched/core
C = Base - PLE

---+---+---+---+---+
   |Ebizzy result (rec/sec higher is better)   |
---+---+---+---+---+
   |Base   | A |  B| C |
---+---+---+---+---+
1x | 2374.1250 | 7273.7500 | 5690.8750 |  7364.3750|
2x | 2536.2500 | 2458.5000 | 2426.3750 |48.5000|
---+---+---+---+---+

   % improvements w.r.t BASE
---++++
   |  A |B   | C  |
---++++
1x | 206.37603  |  139.70410 |  210.19323 |
2x | -3.06555   |  -4.33218  |  -98.08773 |
---++++

we are getting the benefit of almost PLE disabled case with this
approach. With patch B, we have dropped a bit in gain.
(because we still would iterate vcpus until we decide to do a directed
yield).





--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-25 Thread Dor Laor

On 09/24/2012 02:02 PM, Raghavendra K T wrote:

On 09/24/2012 02:12 PM, Dor Laor wrote:

In order to help PLE and pvticketlock converge I thought that a small
test code should be developed to test this in a predictable,
deterministic way.

The idea is to have a guest kernel module that spawn a new thread each
time you write to a /sys/ entry.

Each such a thread spins over a spin lock. The specific spin lock is
also chosen by the /sys/ interface. Let's say we have an array of spin
locks *10 times the amount of vcpus.

All the threads are running a
while (1) {

spin_lock(my_lock);
sum += execute_dummy_cpu_computation(time);
spin_unlock(my_lock);

if (sys_tells_thread_to_die()) break;
}

print_result(sum);

Instead of calling the kernel's spin_lock functions, clone them and make
the ticket lock order deterministic and known (like a linear walk of all
the threads trying to catch that lock).


By Cloning you mean hierarchy of the locks?


No, I meant to clone the implementation of the current spin lock code in 
order to set any order you may like for the ticket selection.

(even for a non pvticket lock version)

For instance, let's say you have N threads trying to grab the lock, you 
can always make the ticket go linearly from 1-2...-N.

Not sure it's a good idea, just a recommendation.


Also I believe time should be passed via sysfs / hardcoded for each
type of lock we are mimicking


Yap





This way you can easy calculate:
1. the score of a single vcpu running a single thread
2. the score of sum of all thread scores when #thread==#vcpu all
taking the same spin lock. The overall sum should be close as
possible to #1.
3. Like #2 but #threads  #vcpus and other versions of #total vcpus
(belonging to all VMs)  #pcpus.
4. Create #thread == #vcpus but let each thread have it's own spin
lock
5. Like 4 + 2

Hopefully this way will allows you to judge and evaluate the exact
overhead of scheduling VMs and threads since you have the ideal result
in hand and you know what the threads are doing.

My 2 cents, Dor



Thank you,
I think this is an excellent idea. ( Though I am trying to put all the
pieces together you mentioned). So overall we should be able to measure
the performance of pvspinlock/PLE improvements with a deterministic
load in guest.

Only thing I am missing is,
How to generate different combinations of the lock.

Okay, let me see if I can come with a solid model for this.



Do you mean the various options for PLE/pvticket/other? I haven't 
thought of it and assumed its static but it can also be controlled 
through the temporary /sys interface.


Thanks for following up!
Dor
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-24 Thread Avi Kivity
On 09/24/2012 06:03 PM, Peter Zijlstra wrote:
> On Mon, 2012-09-24 at 17:51 +0200, Avi Kivity wrote:
>> On 09/24/2012 03:54 PM, Peter Zijlstra wrote:
>> > On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:
>> >> However Rik had a genuine concern in the cases where runqueue is not
>> >> equally distributed and lockholder might actually be on a different run 
>> >> queue but not running.
>> > 
>> > Load should eventually get distributed equally -- that's what the
>> > load-balancer is for -- so this is a temporary situation.
>> 
>> What's the expected latency?  This is the whole problem.  Eventually the
>> scheduler would pick the lock holder as well, the problem is that it's
>> in the millisecond scale while lock hold times are in the microsecond
>> scale, leading to a 1000x slowdown.
> 
> Yeah I know.. Heisenberg's uncertainty applied to SMP computing becomes
> something like accurate or fast, never both.
> 
>> If we want to yield, we really want to boost someone.
> 
> Now if only you knew which someone ;-) This non-modified guest nonsense
> is such a snake pit.. but you know how I feel about all that.

Actually if I knew that in addition to boosting someone, I also unboost
myself enough to be preempted, it wouldn't matter.  While boosting the
lock holder is good, the main point is not spinning and doing useful
work instead.  We can detect spinners and avoid boosting them.

That's the motivation for the "donate vruntime" approach I wanted earlier.

> 
>> > We already try and favour the non running vcpu in this case, that's what
>> > yield_to_task_fair() is about. If its still not eligible to run, tough
>> > luck.
>> 
>> Crazy idea: instead of yielding, just run that other vcpu in the thread
>> that would otherwise spin.  I can see about a million objections to this
>> already though.
> 
> Yah.. you want me to list a few? :-) It would require synchronization
> with the other cpu to pull its task -- one really wants to avoid it also
> running it.

Yeah, it's quite a horrible idea.

> 
> Do this at a high enough frequency and you're dead too.
> 
> Anyway, you can do this inside the KVM stuff, simply flip the vcpu state
> associated with a vcpu thread and use the preemption notifiers to sort
> things against the scheduler or somesuch.

That's what I thought when I wrote this, but I can't, I might be
preempted in random kvm code.  So my state includes the host stack and
registers.  Maybe we can special-case when we interrupt guest mode.

> 
>> >> Do you think instead of using rq->nr_running, we could get a global 
>> >> sense of load using avenrun (something like avenrun/num_onlinecpus) 
>> > 
>> > To what purpose? Also, global stuff is expensive, so you should try and
>> > stay away from it as hard as you possibly can.
>> 
>> Spinning is also expensive.  How about we do the global stuff every N
>> times, to amortize the cost (and reduce contention)?
> 
> Nah, spinning isn't expensive, its a waste of time, similar end result
> for someone who wants to do useful work though, but not the same cause.
> 
> Pick N and I'll come up with a scenario for which its wrong ;-)

Sure.  But if it's rare enough, then that's okay for us.

> Anyway, its an ugly problem and one I really want to contain inside the
> insanity that created it (virt), lets not taint the rest of the kernel
> more than we need to. 

Agreed.  Though given that postgres and others use userspace spinlocks,
maybe it's not just virt.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-24 Thread Peter Zijlstra
On Mon, 2012-09-24 at 17:51 +0200, Avi Kivity wrote:
> On 09/24/2012 03:54 PM, Peter Zijlstra wrote:
> > On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:
> >> However Rik had a genuine concern in the cases where runqueue is not
> >> equally distributed and lockholder might actually be on a different run 
> >> queue but not running.
> > 
> > Load should eventually get distributed equally -- that's what the
> > load-balancer is for -- so this is a temporary situation.
> 
> What's the expected latency?  This is the whole problem.  Eventually the
> scheduler would pick the lock holder as well, the problem is that it's
> in the millisecond scale while lock hold times are in the microsecond
> scale, leading to a 1000x slowdown.

Yeah I know.. Heisenberg's uncertainty applied to SMP computing becomes
something like accurate or fast, never both.

> If we want to yield, we really want to boost someone.

Now if only you knew which someone ;-) This non-modified guest nonsense
is such a snake pit.. but you know how I feel about all that.

> > We already try and favour the non running vcpu in this case, that's what
> > yield_to_task_fair() is about. If its still not eligible to run, tough
> > luck.
> 
> Crazy idea: instead of yielding, just run that other vcpu in the thread
> that would otherwise spin.  I can see about a million objections to this
> already though.

Yah.. you want me to list a few? :-) It would require synchronization
with the other cpu to pull its task -- one really wants to avoid it also
running it.

Do this at a high enough frequency and you're dead too.

Anyway, you can do this inside the KVM stuff, simply flip the vcpu state
associated with a vcpu thread and use the preemption notifiers to sort
things against the scheduler or somesuch.

> >> Do you think instead of using rq->nr_running, we could get a global 
> >> sense of load using avenrun (something like avenrun/num_onlinecpus) 
> > 
> > To what purpose? Also, global stuff is expensive, so you should try and
> > stay away from it as hard as you possibly can.
> 
> Spinning is also expensive.  How about we do the global stuff every N
> times, to amortize the cost (and reduce contention)?

Nah, spinning isn't expensive, its a waste of time, similar end result
for someone who wants to do useful work though, but not the same cause.

Pick N and I'll come up with a scenario for which its wrong ;-)

Anyway, its an ugly problem and one I really want to contain inside the
insanity that created it (virt), lets not taint the rest of the kernel
more than we need to. 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-24 Thread Avi Kivity
On 09/24/2012 03:54 PM, Peter Zijlstra wrote:
> On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:
>> However Rik had a genuine concern in the cases where runqueue is not
>> equally distributed and lockholder might actually be on a different run 
>> queue but not running.
> 
> Load should eventually get distributed equally -- that's what the
> load-balancer is for -- so this is a temporary situation.

What's the expected latency?  This is the whole problem.  Eventually the
scheduler would pick the lock holder as well, the problem is that it's
in the millisecond scale while lock hold times are in the microsecond
scale, leading to a 1000x slowdown.

If we want to yield, we really want to boost someone.

> We already try and favour the non running vcpu in this case, that's what
> yield_to_task_fair() is about. If its still not eligible to run, tough
> luck.

Crazy idea: instead of yielding, just run that other vcpu in the thread
that would otherwise spin.  I can see about a million objections to this
already though.

>> Do you think instead of using rq->nr_running, we could get a global 
>> sense of load using avenrun (something like avenrun/num_onlinecpus) 
> 
> To what purpose? Also, global stuff is expensive, so you should try and
> stay away from it as hard as you possibly can.

Spinning is also expensive.  How about we do the global stuff every N
times, to amortize the cost (and reduce contention)?

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-24 Thread Raghavendra K T

On 09/24/2012 07:24 PM, Peter Zijlstra wrote:

On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:

However Rik had a genuine concern in the cases where runqueue is not
equally distributed and lockholder might actually be on a different run
queue but not running.


Load should eventually get distributed equally -- that's what the
load-balancer is for -- so this is a temporary situation.

We already try and favour the non running vcpu in this case, that's what
yield_to_task_fair() is about. If its still not eligible to run, tough
luck.


Yes, I agree.




Do you think instead of using rq->nr_running, we could get a global
sense of load using avenrun (something like avenrun/num_onlinecpus)


To what purpose? Also, global stuff is expensive, so you should try and
stay away from it as hard as you possibly can.


Yes, that concern only had made me to fall back to rq->nr_running.

Will come back with the result soon.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-24 Thread Peter Zijlstra
On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:
> However Rik had a genuine concern in the cases where runqueue is not
> equally distributed and lockholder might actually be on a different run 
> queue but not running.

Load should eventually get distributed equally -- that's what the
load-balancer is for -- so this is a temporary situation.

We already try and favour the non running vcpu in this case, that's what
yield_to_task_fair() is about. If its still not eligible to run, tough
luck.

> Do you think instead of using rq->nr_running, we could get a global 
> sense of load using avenrun (something like avenrun/num_onlinecpus) 

To what purpose? Also, global stuff is expensive, so you should try and
stay away from it as hard as you possibly can.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-24 Thread Raghavendra K T

On 09/24/2012 06:06 PM, Peter Zijlstra wrote:

On Mon, 2012-09-24 at 17:22 +0530, Raghavendra K T wrote:

On 09/24/2012 05:04 PM, Peter Zijlstra wrote:

On Fri, 2012-09-21 at 17:29 +0530, Raghavendra K T wrote:

In some special scenarios like #vcpu<= #pcpu, PLE handler may
prove very costly, because there is no need to iterate over vcpus
and do unsuccessful yield_to burning CPU.


What's the costly thing? The vm-exit, the yield (which should be a nop
if its the only task there) or something else entirely?


Both vmexit and yield_to() actually,

because unsuccessful yield_to() overall is costly in PLE handler.

This is because when we have large guests, say 32/16 vcpus, and one
vcpu is holding lock, rest of the vcpus waiting for the lock, when they
do PL-exit, each of the vcpu try to iterate over rest of vcpu list in
the VM and try to do directed yield (unsuccessful). (O(n^2) tries).

this results is fairly high amount of cpu burning and double run queue
lock contention.

(if they were spinning probably lock progress would have been faster).
As Avi/Chegu Vinod had felt it is better to avoid vmexit itself, which
seems little complex to achieve currently.


OK, so the vmexit stays and we need to improve yield_to.

How about something like the below, that would allow breaking out of the
for-each-vcpu loop and simply going back into the vm, right?

---
  kernel/sched/core.c | 25 +++--
  1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b38f00e..5d5b355 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4272,7 +4272,10 @@ EXPORT_SYMBOL(yield);
   * It's the caller's job to ensure that the target task struct
   * can't go away on us before we can do any checks.
   *
- * Returns true if we indeed boosted the target task.
+ * Returns:
+ *   true (>0) if we indeed boosted the target task.
+ *   false (0) if we failed to boost the target.
+ *   -ESRCH if there's no task to yield to.
   */
  bool __sched yield_to(struct task_struct *p, bool preempt)
  {
@@ -4284,6 +4287,15 @@ bool __sched yield_to(struct task_struct *p, bool 
preempt)
local_irq_save(flags);
rq = this_rq();

+   /*
+* If we're the only runnable task on the rq, there's absolutely no
+* point in yielding.
+*/
+   if (rq->nr_running == 1) {
+   yielded = -ESRCH;
+   goto out_irq;
+   }
+
  again:
p_rq = task_rq(p);
double_rq_lock(rq, p_rq);
@@ -4293,13 +4305,13 @@ bool __sched yield_to(struct task_struct *p, bool 
preempt)
}

if (!curr->sched_class->yield_to_task)
-   goto out;
+   goto out_unlock;

if (curr->sched_class != p->sched_class)
-   goto out;
+   goto out_unlock;

if (task_running(p_rq, p) || p->state)
-   goto out;
+   goto out_unlock;

yielded = curr->sched_class->yield_to_task(rq, p, preempt);
if (yielded) {
@@ -4312,11 +4324,12 @@ bool __sched yield_to(struct task_struct *p, bool 
preempt)
resched_task(p_rq->curr);
}

-out:
+out_unlock:
double_rq_unlock(rq, p_rq);
+out_irq:
local_irq_restore(flags);

-   if (yielded)
+   if (yielded>  0)
schedule();

return yielded;




Yes, I think this is a nice idea. Any future users of yield_to
also would benefit from this. we will have to iterate only till first
attempt to yield_to.

I 'll run the test with this patch.

However Rik had a genuine concern in the cases where runqueue is not
equally distributed and lockholder might actually be on a different run 
queue but not running.


Do you think instead of using rq->nr_running, we could get a global 
sense of load using avenrun (something like avenrun/num_onlinecpus)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-24 Thread Peter Zijlstra
On Mon, 2012-09-24 at 17:22 +0530, Raghavendra K T wrote:
> On 09/24/2012 05:04 PM, Peter Zijlstra wrote:
> > On Fri, 2012-09-21 at 17:29 +0530, Raghavendra K T wrote:
> >> In some special scenarios like #vcpu<= #pcpu, PLE handler may
> >> prove very costly, because there is no need to iterate over vcpus
> >> and do unsuccessful yield_to burning CPU.
> >
> > What's the costly thing? The vm-exit, the yield (which should be a nop
> > if its the only task there) or something else entirely?
> >
> Both vmexit and yield_to() actually,
> 
> because unsuccessful yield_to() overall is costly in PLE handler.
> 
> This is because when we have large guests, say 32/16 vcpus, and one
> vcpu is holding lock, rest of the vcpus waiting for the lock, when they
> do PL-exit, each of the vcpu try to iterate over rest of vcpu list in
> the VM and try to do directed yield (unsuccessful). (O(n^2) tries).
> 
> this results is fairly high amount of cpu burning and double run queue
> lock contention.
> 
> (if they were spinning probably lock progress would have been faster).
> As Avi/Chegu Vinod had felt it is better to avoid vmexit itself, which
> seems little complex to achieve currently.

OK, so the vmexit stays and we need to improve yield_to.

How about something like the below, that would allow breaking out of the
for-each-vcpu loop and simply going back into the vm, right?

---
 kernel/sched/core.c | 25 +++--
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b38f00e..5d5b355 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4272,7 +4272,10 @@ EXPORT_SYMBOL(yield);
  * It's the caller's job to ensure that the target task struct
  * can't go away on us before we can do any checks.
  *
- * Returns true if we indeed boosted the target task.
+ * Returns:
+ *   true (>0) if we indeed boosted the target task.
+ *   false (0) if we failed to boost the target.
+ *   -ESRCH if there's no task to yield to.
  */
 bool __sched yield_to(struct task_struct *p, bool preempt)
 {
@@ -4284,6 +4287,15 @@ bool __sched yield_to(struct task_struct *p, bool 
preempt)
local_irq_save(flags);
rq = this_rq();
 
+   /*
+* If we're the only runnable task on the rq, there's absolutely no
+* point in yielding.
+*/
+   if (rq->nr_running == 1) {
+   yielded = -ESRCH;
+   goto out_irq;
+   }
+
 again:
p_rq = task_rq(p);
double_rq_lock(rq, p_rq);
@@ -4293,13 +4305,13 @@ bool __sched yield_to(struct task_struct *p, bool 
preempt)
}
 
if (!curr->sched_class->yield_to_task)
-   goto out;
+   goto out_unlock;
 
if (curr->sched_class != p->sched_class)
-   goto out;
+   goto out_unlock;
 
if (task_running(p_rq, p) || p->state)
-   goto out;
+   goto out_unlock;
 
yielded = curr->sched_class->yield_to_task(rq, p, preempt);
if (yielded) {
@@ -4312,11 +4324,12 @@ bool __sched yield_to(struct task_struct *p, bool 
preempt)
resched_task(p_rq->curr);
}
 
-out:
+out_unlock:
double_rq_unlock(rq, p_rq);
+out_irq:
local_irq_restore(flags);
 
-   if (yielded)
+   if (yielded > 0)
schedule();
 
return yielded;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-24 Thread Raghavendra K T

On 09/24/2012 02:12 PM, Dor Laor wrote:

In order to help PLE and pvticketlock converge I thought that a small
test code should be developed to test this in a predictable,
deterministic way.

The idea is to have a guest kernel module that spawn a new thread each
time you write to a /sys/ entry.

Each such a thread spins over a spin lock. The specific spin lock is
also chosen by the /sys/ interface. Let's say we have an array of spin
locks *10 times the amount of vcpus.

All the threads are running a
while (1) {

spin_lock(my_lock);
sum += execute_dummy_cpu_computation(time);
spin_unlock(my_lock);

if (sys_tells_thread_to_die()) break;
}

print_result(sum);

Instead of calling the kernel's spin_lock functions, clone them and make
the ticket lock order deterministic and known (like a linear walk of all
the threads trying to catch that lock).


By Cloning you mean hierarchy of the locks?
Also I believe time should be passed via sysfs / hardcoded for each
type of lock we are mimicking



This way you can easy calculate:
1. the score of a single vcpu running a single thread
2. the score of sum of all thread scores when #thread==#vcpu all
taking the same spin lock. The overall sum should be close as
possible to #1.
3. Like #2 but #threads > #vcpus and other versions of #total vcpus
(belonging to all VMs) > #pcpus.
4. Create #thread == #vcpus but let each thread have it's own spin
lock
5. Like 4 + 2

Hopefully this way will allows you to judge and evaluate the exact
overhead of scheduling VMs and threads since you have the ideal result
in hand and you know what the threads are doing.

My 2 cents, Dor



Thank you,
I think this is an excellent idea. ( Though I am trying to put all the
pieces together you mentioned). So overall we should be able to measure
the performance of pvspinlock/PLE improvements with a deterministic
load in guest.

Only thing I am missing is,
How to generate different combinations of the lock.

Okay, let me see if I can come with a solid model for this.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-24 Thread Raghavendra K T

On 09/24/2012 05:04 PM, Peter Zijlstra wrote:

On Fri, 2012-09-21 at 17:29 +0530, Raghavendra K T wrote:

In some special scenarios like #vcpu<= #pcpu, PLE handler may
prove very costly, because there is no need to iterate over vcpus
and do unsuccessful yield_to burning CPU.


What's the costly thing? The vm-exit, the yield (which should be a nop
if its the only task there) or something else entirely?


Both vmexit and yield_to() actually,

because unsuccessful yield_to() overall is costly in PLE handler.

This is because when we have large guests, say 32/16 vcpus, and one
vcpu is holding lock, rest of the vcpus waiting for the lock, when they
do PL-exit, each of the vcpu try to iterate over rest of vcpu list in
the VM and try to do directed yield (unsuccessful). (O(n^2) tries).

this results is fairly high amount of cpu burning and double run queue
lock contention.

(if they were spinning probably lock progress would have been faster).
As Avi/Chegu Vinod had felt it is better to avoid vmexit itself, which
seems little complex to achieve currently.





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-24 Thread Peter Zijlstra
On Fri, 2012-09-21 at 17:29 +0530, Raghavendra K T wrote:
> In some special scenarios like #vcpu <= #pcpu, PLE handler may 
> prove very costly, because there is no need to iterate over vcpus
> and do unsuccessful yield_to burning CPU. 

What's the costly thing? The vm-exit, the yield (which should be a nop
if its the only task there) or something else entirely?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-24 Thread Dor Laor
In order to help PLE and pvticketlock converge I thought that a small 
test code should be developed to test this in a predictable, 
deterministic way.


The idea is to have a guest kernel module that spawn a new thread each 
time you write to a /sys/ entry.


Each such a thread spins over a spin lock. The specific spin lock is 
also chosen by the /sys/ interface. Let's say we have an array of spin 
locks *10 times the amount of vcpus.


All the threads are running a
 while (1) {

   spin_lock(my_lock);
   sum += execute_dummy_cpu_computation(time);
   spin_unlock(my_lock);

   if (sys_tells_thread_to_die()) break;
 }

 print_result(sum);

Instead of calling the kernel's spin_lock functions, clone them and make 
the ticket lock order deterministic and known (like a linear walk of all 
the threads trying to catch that lock).


This way you can easy calculate:
 1. the score of a single vcpu running a single thread
 2. the score of sum of all thread scores when #thread==#vcpu all
taking the same spin lock. The overall sum should be close as
possible to #1.
 3. Like #2 but #threads > #vcpus and other versions of #total vcpus
(belonging to all VMs)  > #pcpus.
 4. Create #thread == #vcpus but let each thread have it's own spin
lock
 5. Like 4 + 2

Hopefully this way will allows you to judge and evaluate the exact 
overhead of scheduling VMs and threads since you have the ideal result 
in hand and you know what the threads are doing.


My 2 cents, Dor

On 09/21/2012 08:36 PM, Raghavendra K T wrote:

On 09/21/2012 06:48 PM, Chegu Vinod wrote:

On 9/21/2012 4:59 AM, Raghavendra K T wrote:

In some special scenarios like #vcpu <= #pcpu, PLE handler may
prove very costly,


Yes.

because there is no need to iterate over vcpus
and do unsuccessful yield_to burning CPU.

An idea to solve this is:
1) As Avi had proposed we can modify hardware ple_window
dynamically to avoid frequent PL-exit.


Yes. We had to do this to get around some scaling issues for large
(>20way) guests (with no overcommitment)


Do you mean you already have some solution tested for this?



As part of some experimentation we even tried "switching off" PLE too :(



Honestly,
Your this experiment and Andrew Theurer's observations were the
motivation for this patch.





(IMHO, it is difficult to
decide when we have mixed type of VMs).


Agree.

Not sure if the following alternatives have also been looked at :

- Could the behavior associated with the "ple_window" be modified to be
a function of some [new] per-guest attribute (which can be conveyed to
the host as part of the guest launch sequence). The user can choose to
set this [new] attribute for a given guest. This would help avoid the
frequent exits due to PLE (as Avi had mentioned earlier) ?


Ccing Drew also. We had a good discussion on this idea last time.
(sorry that I forgot to include in patch series)

May be a good idea when we know the load in advance..



- Can the PLE feature ( in VT) be "enhanced" to be made a per guest
attribute ?


IMHO, the approach of not taking a frequent exit is better than taking
an exit and returning back from the handler etc.


I entirely agree on this point. (though have not tried above
approaches). Hope to see more expert opinions pouring in.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-24 Thread Dor Laor
In order to help PLE and pvticketlock converge I thought that a small 
test code should be developed to test this in a predictable, 
deterministic way.


The idea is to have a guest kernel module that spawn a new thread each 
time you write to a /sys/ entry.


Each such a thread spins over a spin lock. The specific spin lock is 
also chosen by the /sys/ interface. Let's say we have an array of spin 
locks *10 times the amount of vcpus.


All the threads are running a
 while (1) {

   spin_lock(my_lock);
   sum += execute_dummy_cpu_computation(time);
   spin_unlock(my_lock);

   if (sys_tells_thread_to_die()) break;
 }

 print_result(sum);

Instead of calling the kernel's spin_lock functions, clone them and make 
the ticket lock order deterministic and known (like a linear walk of all 
the threads trying to catch that lock).


This way you can easy calculate:
 1. the score of a single vcpu running a single thread
 2. the score of sum of all thread scores when #thread==#vcpu all
taking the same spin lock. The overall sum should be close as
possible to #1.
 3. Like #2 but #threads  #vcpus and other versions of #total vcpus
(belonging to all VMs)   #pcpus.
 4. Create #thread == #vcpus but let each thread have it's own spin
lock
 5. Like 4 + 2

Hopefully this way will allows you to judge and evaluate the exact 
overhead of scheduling VMs and threads since you have the ideal result 
in hand and you know what the threads are doing.


My 2 cents, Dor

On 09/21/2012 08:36 PM, Raghavendra K T wrote:

On 09/21/2012 06:48 PM, Chegu Vinod wrote:

On 9/21/2012 4:59 AM, Raghavendra K T wrote:

In some special scenarios like #vcpu = #pcpu, PLE handler may
prove very costly,


Yes.

because there is no need to iterate over vcpus
and do unsuccessful yield_to burning CPU.

An idea to solve this is:
1) As Avi had proposed we can modify hardware ple_window
dynamically to avoid frequent PL-exit.


Yes. We had to do this to get around some scaling issues for large
(20way) guests (with no overcommitment)


Do you mean you already have some solution tested for this?



As part of some experimentation we even tried switching off PLE too :(



Honestly,
Your this experiment and Andrew Theurer's observations were the
motivation for this patch.





(IMHO, it is difficult to
decide when we have mixed type of VMs).


Agree.

Not sure if the following alternatives have also been looked at :

- Could the behavior associated with the ple_window be modified to be
a function of some [new] per-guest attribute (which can be conveyed to
the host as part of the guest launch sequence). The user can choose to
set this [new] attribute for a given guest. This would help avoid the
frequent exits due to PLE (as Avi had mentioned earlier) ?


Ccing Drew also. We had a good discussion on this idea last time.
(sorry that I forgot to include in patch series)

May be a good idea when we know the load in advance..



- Can the PLE feature ( in VT) be enhanced to be made a per guest
attribute ?


IMHO, the approach of not taking a frequent exit is better than taking
an exit and returning back from the handler etc.


I entirely agree on this point. (though have not tried above
approaches). Hope to see more expert opinions pouring in.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-24 Thread Peter Zijlstra
On Fri, 2012-09-21 at 17:29 +0530, Raghavendra K T wrote:
 In some special scenarios like #vcpu = #pcpu, PLE handler may 
 prove very costly, because there is no need to iterate over vcpus
 and do unsuccessful yield_to burning CPU. 

What's the costly thing? The vm-exit, the yield (which should be a nop
if its the only task there) or something else entirely?


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-24 Thread Raghavendra K T

On 09/24/2012 05:04 PM, Peter Zijlstra wrote:

On Fri, 2012-09-21 at 17:29 +0530, Raghavendra K T wrote:

In some special scenarios like #vcpu= #pcpu, PLE handler may
prove very costly, because there is no need to iterate over vcpus
and do unsuccessful yield_to burning CPU.


What's the costly thing? The vm-exit, the yield (which should be a nop
if its the only task there) or something else entirely?


Both vmexit and yield_to() actually,

because unsuccessful yield_to() overall is costly in PLE handler.

This is because when we have large guests, say 32/16 vcpus, and one
vcpu is holding lock, rest of the vcpus waiting for the lock, when they
do PL-exit, each of the vcpu try to iterate over rest of vcpu list in
the VM and try to do directed yield (unsuccessful). (O(n^2) tries).

this results is fairly high amount of cpu burning and double run queue
lock contention.

(if they were spinning probably lock progress would have been faster).
As Avi/Chegu Vinod had felt it is better to avoid vmexit itself, which
seems little complex to achieve currently.





--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-24 Thread Raghavendra K T

On 09/24/2012 02:12 PM, Dor Laor wrote:

In order to help PLE and pvticketlock converge I thought that a small
test code should be developed to test this in a predictable,
deterministic way.

The idea is to have a guest kernel module that spawn a new thread each
time you write to a /sys/ entry.

Each such a thread spins over a spin lock. The specific spin lock is
also chosen by the /sys/ interface. Let's say we have an array of spin
locks *10 times the amount of vcpus.

All the threads are running a
while (1) {

spin_lock(my_lock);
sum += execute_dummy_cpu_computation(time);
spin_unlock(my_lock);

if (sys_tells_thread_to_die()) break;
}

print_result(sum);

Instead of calling the kernel's spin_lock functions, clone them and make
the ticket lock order deterministic and known (like a linear walk of all
the threads trying to catch that lock).


By Cloning you mean hierarchy of the locks?
Also I believe time should be passed via sysfs / hardcoded for each
type of lock we are mimicking



This way you can easy calculate:
1. the score of a single vcpu running a single thread
2. the score of sum of all thread scores when #thread==#vcpu all
taking the same spin lock. The overall sum should be close as
possible to #1.
3. Like #2 but #threads  #vcpus and other versions of #total vcpus
(belonging to all VMs)  #pcpus.
4. Create #thread == #vcpus but let each thread have it's own spin
lock
5. Like 4 + 2

Hopefully this way will allows you to judge and evaluate the exact
overhead of scheduling VMs and threads since you have the ideal result
in hand and you know what the threads are doing.

My 2 cents, Dor



Thank you,
I think this is an excellent idea. ( Though I am trying to put all the
pieces together you mentioned). So overall we should be able to measure
the performance of pvspinlock/PLE improvements with a deterministic
load in guest.

Only thing I am missing is,
How to generate different combinations of the lock.

Okay, let me see if I can come with a solid model for this.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-24 Thread Peter Zijlstra
On Mon, 2012-09-24 at 17:22 +0530, Raghavendra K T wrote:
 On 09/24/2012 05:04 PM, Peter Zijlstra wrote:
  On Fri, 2012-09-21 at 17:29 +0530, Raghavendra K T wrote:
  In some special scenarios like #vcpu= #pcpu, PLE handler may
  prove very costly, because there is no need to iterate over vcpus
  and do unsuccessful yield_to burning CPU.
 
  What's the costly thing? The vm-exit, the yield (which should be a nop
  if its the only task there) or something else entirely?
 
 Both vmexit and yield_to() actually,
 
 because unsuccessful yield_to() overall is costly in PLE handler.
 
 This is because when we have large guests, say 32/16 vcpus, and one
 vcpu is holding lock, rest of the vcpus waiting for the lock, when they
 do PL-exit, each of the vcpu try to iterate over rest of vcpu list in
 the VM and try to do directed yield (unsuccessful). (O(n^2) tries).
 
 this results is fairly high amount of cpu burning and double run queue
 lock contention.
 
 (if they were spinning probably lock progress would have been faster).
 As Avi/Chegu Vinod had felt it is better to avoid vmexit itself, which
 seems little complex to achieve currently.

OK, so the vmexit stays and we need to improve yield_to.

How about something like the below, that would allow breaking out of the
for-each-vcpu loop and simply going back into the vm, right?

---
 kernel/sched/core.c | 25 +++--
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b38f00e..5d5b355 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4272,7 +4272,10 @@ EXPORT_SYMBOL(yield);
  * It's the caller's job to ensure that the target task struct
  * can't go away on us before we can do any checks.
  *
- * Returns true if we indeed boosted the target task.
+ * Returns:
+ *   true (0) if we indeed boosted the target task.
+ *   false (0) if we failed to boost the target.
+ *   -ESRCH if there's no task to yield to.
  */
 bool __sched yield_to(struct task_struct *p, bool preempt)
 {
@@ -4284,6 +4287,15 @@ bool __sched yield_to(struct task_struct *p, bool 
preempt)
local_irq_save(flags);
rq = this_rq();
 
+   /*
+* If we're the only runnable task on the rq, there's absolutely no
+* point in yielding.
+*/
+   if (rq-nr_running == 1) {
+   yielded = -ESRCH;
+   goto out_irq;
+   }
+
 again:
p_rq = task_rq(p);
double_rq_lock(rq, p_rq);
@@ -4293,13 +4305,13 @@ bool __sched yield_to(struct task_struct *p, bool 
preempt)
}
 
if (!curr-sched_class-yield_to_task)
-   goto out;
+   goto out_unlock;
 
if (curr-sched_class != p-sched_class)
-   goto out;
+   goto out_unlock;
 
if (task_running(p_rq, p) || p-state)
-   goto out;
+   goto out_unlock;
 
yielded = curr-sched_class-yield_to_task(rq, p, preempt);
if (yielded) {
@@ -4312,11 +4324,12 @@ bool __sched yield_to(struct task_struct *p, bool 
preempt)
resched_task(p_rq-curr);
}
 
-out:
+out_unlock:
double_rq_unlock(rq, p_rq);
+out_irq:
local_irq_restore(flags);
 
-   if (yielded)
+   if (yielded  0)
schedule();
 
return yielded;

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-24 Thread Raghavendra K T

On 09/24/2012 06:06 PM, Peter Zijlstra wrote:

On Mon, 2012-09-24 at 17:22 +0530, Raghavendra K T wrote:

On 09/24/2012 05:04 PM, Peter Zijlstra wrote:

On Fri, 2012-09-21 at 17:29 +0530, Raghavendra K T wrote:

In some special scenarios like #vcpu= #pcpu, PLE handler may
prove very costly, because there is no need to iterate over vcpus
and do unsuccessful yield_to burning CPU.


What's the costly thing? The vm-exit, the yield (which should be a nop
if its the only task there) or something else entirely?


Both vmexit and yield_to() actually,

because unsuccessful yield_to() overall is costly in PLE handler.

This is because when we have large guests, say 32/16 vcpus, and one
vcpu is holding lock, rest of the vcpus waiting for the lock, when they
do PL-exit, each of the vcpu try to iterate over rest of vcpu list in
the VM and try to do directed yield (unsuccessful). (O(n^2) tries).

this results is fairly high amount of cpu burning and double run queue
lock contention.

(if they were spinning probably lock progress would have been faster).
As Avi/Chegu Vinod had felt it is better to avoid vmexit itself, which
seems little complex to achieve currently.


OK, so the vmexit stays and we need to improve yield_to.

How about something like the below, that would allow breaking out of the
for-each-vcpu loop and simply going back into the vm, right?

---
  kernel/sched/core.c | 25 +++--
  1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b38f00e..5d5b355 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4272,7 +4272,10 @@ EXPORT_SYMBOL(yield);
   * It's the caller's job to ensure that the target task struct
   * can't go away on us before we can do any checks.
   *
- * Returns true if we indeed boosted the target task.
+ * Returns:
+ *   true (0) if we indeed boosted the target task.
+ *   false (0) if we failed to boost the target.
+ *   -ESRCH if there's no task to yield to.
   */
  bool __sched yield_to(struct task_struct *p, bool preempt)
  {
@@ -4284,6 +4287,15 @@ bool __sched yield_to(struct task_struct *p, bool 
preempt)
local_irq_save(flags);
rq = this_rq();

+   /*
+* If we're the only runnable task on the rq, there's absolutely no
+* point in yielding.
+*/
+   if (rq-nr_running == 1) {
+   yielded = -ESRCH;
+   goto out_irq;
+   }
+
  again:
p_rq = task_rq(p);
double_rq_lock(rq, p_rq);
@@ -4293,13 +4305,13 @@ bool __sched yield_to(struct task_struct *p, bool 
preempt)
}

if (!curr-sched_class-yield_to_task)
-   goto out;
+   goto out_unlock;

if (curr-sched_class != p-sched_class)
-   goto out;
+   goto out_unlock;

if (task_running(p_rq, p) || p-state)
-   goto out;
+   goto out_unlock;

yielded = curr-sched_class-yield_to_task(rq, p, preempt);
if (yielded) {
@@ -4312,11 +4324,12 @@ bool __sched yield_to(struct task_struct *p, bool 
preempt)
resched_task(p_rq-curr);
}

-out:
+out_unlock:
double_rq_unlock(rq, p_rq);
+out_irq:
local_irq_restore(flags);

-   if (yielded)
+   if (yielded  0)
schedule();

return yielded;




Yes, I think this is a nice idea. Any future users of yield_to
also would benefit from this. we will have to iterate only till first
attempt to yield_to.

I 'll run the test with this patch.

However Rik had a genuine concern in the cases where runqueue is not
equally distributed and lockholder might actually be on a different run 
queue but not running.


Do you think instead of using rq-nr_running, we could get a global 
sense of load using avenrun (something like avenrun/num_onlinecpus)


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-24 Thread Peter Zijlstra
On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:
 However Rik had a genuine concern in the cases where runqueue is not
 equally distributed and lockholder might actually be on a different run 
 queue but not running.

Load should eventually get distributed equally -- that's what the
load-balancer is for -- so this is a temporary situation.

We already try and favour the non running vcpu in this case, that's what
yield_to_task_fair() is about. If its still not eligible to run, tough
luck.

 Do you think instead of using rq-nr_running, we could get a global 
 sense of load using avenrun (something like avenrun/num_onlinecpus) 

To what purpose? Also, global stuff is expensive, so you should try and
stay away from it as hard as you possibly can.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-24 Thread Raghavendra K T

On 09/24/2012 07:24 PM, Peter Zijlstra wrote:

On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:

However Rik had a genuine concern in the cases where runqueue is not
equally distributed and lockholder might actually be on a different run
queue but not running.


Load should eventually get distributed equally -- that's what the
load-balancer is for -- so this is a temporary situation.

We already try and favour the non running vcpu in this case, that's what
yield_to_task_fair() is about. If its still not eligible to run, tough
luck.


Yes, I agree.




Do you think instead of using rq-nr_running, we could get a global
sense of load using avenrun (something like avenrun/num_onlinecpus)


To what purpose? Also, global stuff is expensive, so you should try and
stay away from it as hard as you possibly can.


Yes, that concern only had made me to fall back to rq-nr_running.

Will come back with the result soon.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-24 Thread Avi Kivity
On 09/24/2012 03:54 PM, Peter Zijlstra wrote:
 On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:
 However Rik had a genuine concern in the cases where runqueue is not
 equally distributed and lockholder might actually be on a different run 
 queue but not running.
 
 Load should eventually get distributed equally -- that's what the
 load-balancer is for -- so this is a temporary situation.

What's the expected latency?  This is the whole problem.  Eventually the
scheduler would pick the lock holder as well, the problem is that it's
in the millisecond scale while lock hold times are in the microsecond
scale, leading to a 1000x slowdown.

If we want to yield, we really want to boost someone.

 We already try and favour the non running vcpu in this case, that's what
 yield_to_task_fair() is about. If its still not eligible to run, tough
 luck.

Crazy idea: instead of yielding, just run that other vcpu in the thread
that would otherwise spin.  I can see about a million objections to this
already though.

 Do you think instead of using rq-nr_running, we could get a global 
 sense of load using avenrun (something like avenrun/num_onlinecpus) 
 
 To what purpose? Also, global stuff is expensive, so you should try and
 stay away from it as hard as you possibly can.

Spinning is also expensive.  How about we do the global stuff every N
times, to amortize the cost (and reduce contention)?

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-24 Thread Peter Zijlstra
On Mon, 2012-09-24 at 17:51 +0200, Avi Kivity wrote:
 On 09/24/2012 03:54 PM, Peter Zijlstra wrote:
  On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:
  However Rik had a genuine concern in the cases where runqueue is not
  equally distributed and lockholder might actually be on a different run 
  queue but not running.
  
  Load should eventually get distributed equally -- that's what the
  load-balancer is for -- so this is a temporary situation.
 
 What's the expected latency?  This is the whole problem.  Eventually the
 scheduler would pick the lock holder as well, the problem is that it's
 in the millisecond scale while lock hold times are in the microsecond
 scale, leading to a 1000x slowdown.

Yeah I know.. Heisenberg's uncertainty applied to SMP computing becomes
something like accurate or fast, never both.

 If we want to yield, we really want to boost someone.

Now if only you knew which someone ;-) This non-modified guest nonsense
is such a snake pit.. but you know how I feel about all that.

  We already try and favour the non running vcpu in this case, that's what
  yield_to_task_fair() is about. If its still not eligible to run, tough
  luck.
 
 Crazy idea: instead of yielding, just run that other vcpu in the thread
 that would otherwise spin.  I can see about a million objections to this
 already though.

Yah.. you want me to list a few? :-) It would require synchronization
with the other cpu to pull its task -- one really wants to avoid it also
running it.

Do this at a high enough frequency and you're dead too.

Anyway, you can do this inside the KVM stuff, simply flip the vcpu state
associated with a vcpu thread and use the preemption notifiers to sort
things against the scheduler or somesuch.

  Do you think instead of using rq-nr_running, we could get a global 
  sense of load using avenrun (something like avenrun/num_onlinecpus) 
  
  To what purpose? Also, global stuff is expensive, so you should try and
  stay away from it as hard as you possibly can.
 
 Spinning is also expensive.  How about we do the global stuff every N
 times, to amortize the cost (and reduce contention)?

Nah, spinning isn't expensive, its a waste of time, similar end result
for someone who wants to do useful work though, but not the same cause.

Pick N and I'll come up with a scenario for which its wrong ;-)

Anyway, its an ugly problem and one I really want to contain inside the
insanity that created it (virt), lets not taint the rest of the kernel
more than we need to. 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-24 Thread Avi Kivity
On 09/24/2012 06:03 PM, Peter Zijlstra wrote:
 On Mon, 2012-09-24 at 17:51 +0200, Avi Kivity wrote:
 On 09/24/2012 03:54 PM, Peter Zijlstra wrote:
  On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:
  However Rik had a genuine concern in the cases where runqueue is not
  equally distributed and lockholder might actually be on a different run 
  queue but not running.
  
  Load should eventually get distributed equally -- that's what the
  load-balancer is for -- so this is a temporary situation.
 
 What's the expected latency?  This is the whole problem.  Eventually the
 scheduler would pick the lock holder as well, the problem is that it's
 in the millisecond scale while lock hold times are in the microsecond
 scale, leading to a 1000x slowdown.
 
 Yeah I know.. Heisenberg's uncertainty applied to SMP computing becomes
 something like accurate or fast, never both.
 
 If we want to yield, we really want to boost someone.
 
 Now if only you knew which someone ;-) This non-modified guest nonsense
 is such a snake pit.. but you know how I feel about all that.

Actually if I knew that in addition to boosting someone, I also unboost
myself enough to be preempted, it wouldn't matter.  While boosting the
lock holder is good, the main point is not spinning and doing useful
work instead.  We can detect spinners and avoid boosting them.

That's the motivation for the donate vruntime approach I wanted earlier.

 
  We already try and favour the non running vcpu in this case, that's what
  yield_to_task_fair() is about. If its still not eligible to run, tough
  luck.
 
 Crazy idea: instead of yielding, just run that other vcpu in the thread
 that would otherwise spin.  I can see about a million objections to this
 already though.
 
 Yah.. you want me to list a few? :-) It would require synchronization
 with the other cpu to pull its task -- one really wants to avoid it also
 running it.

Yeah, it's quite a horrible idea.

 
 Do this at a high enough frequency and you're dead too.
 
 Anyway, you can do this inside the KVM stuff, simply flip the vcpu state
 associated with a vcpu thread and use the preemption notifiers to sort
 things against the scheduler or somesuch.

That's what I thought when I wrote this, but I can't, I might be
preempted in random kvm code.  So my state includes the host stack and
registers.  Maybe we can special-case when we interrupt guest mode.

 
  Do you think instead of using rq-nr_running, we could get a global 
  sense of load using avenrun (something like avenrun/num_onlinecpus) 
  
  To what purpose? Also, global stuff is expensive, so you should try and
  stay away from it as hard as you possibly can.
 
 Spinning is also expensive.  How about we do the global stuff every N
 times, to amortize the cost (and reduce contention)?
 
 Nah, spinning isn't expensive, its a waste of time, similar end result
 for someone who wants to do useful work though, but not the same cause.
 
 Pick N and I'll come up with a scenario for which its wrong ;-)

Sure.  But if it's rare enough, then that's okay for us.

 Anyway, its an ugly problem and one I really want to contain inside the
 insanity that created it (virt), lets not taint the rest of the kernel
 more than we need to. 

Agreed.  Though given that postgres and others use userspace spinlocks,
maybe it's not just virt.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-21 Thread Raghavendra K T

On 09/21/2012 06:48 PM, Chegu Vinod wrote:

On 9/21/2012 4:59 AM, Raghavendra K T wrote:

In some special scenarios like #vcpu <= #pcpu, PLE handler may
prove very costly,


Yes.

because there is no need to iterate over vcpus
and do unsuccessful yield_to burning CPU.

An idea to solve this is:
1) As Avi had proposed we can modify hardware ple_window
dynamically to avoid frequent PL-exit.


Yes. We had to do this to get around some scaling issues for large
(>20way) guests (with no overcommitment)


Do you mean you already have some solution tested for this?



As part of some experimentation we even tried "switching off" PLE too :(



Honestly,
Your this experiment and Andrew Theurer's observations were the
motivation for this patch.





(IMHO, it is difficult to
decide when we have mixed type of VMs).


Agree.

Not sure if the following alternatives have also been looked at :

- Could the behavior associated with the "ple_window" be modified to be
a function of some [new] per-guest attribute (which can be conveyed to
the host as part of the guest launch sequence). The user can choose to
set this [new] attribute for a given guest. This would help avoid the
frequent exits due to PLE (as Avi had mentioned earlier) ?


Ccing Drew also. We had a good discussion on this idea last time.
(sorry that I forgot to include in patch series)

May be a good idea when we know the load in advance..



- Can the PLE feature ( in VT) be "enhanced" to be made a per guest
attribute ?


IMHO, the approach of not taking a frequent exit is better than taking
an exit and returning back from the handler etc.


I entirely agree on this point. (though have not tried above
approaches). Hope to see more expert opinions pouring in.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-21 Thread Chegu Vinod

On 9/21/2012 4:59 AM, Raghavendra K T wrote:

In some special scenarios like #vcpu <= #pcpu, PLE handler may
prove very costly,


Yes.

  because there is no need to iterate over vcpus
and do unsuccessful yield_to burning CPU.

An idea to solve this is:
1) As Avi had proposed we can modify hardware ple_window
dynamically to avoid frequent PL-exit.


Yes. We had to do this to get around some scaling issues for large 
(>20way) guests (with no overcommitment)


As part of some experimentation we even tried "switching off"  PLE too :(




(IMHO, it is difficult to
decide when we have mixed type of VMs).


Agree.

Not sure if the following alternatives have also been looked at :

- Could the  behavior  associated with the "ple_window" be modified to 
be a function of some [new] per-guest attribute (which can be conveyed 
to the host as part of the guest launch sequence). The user can choose 
to set this [new] attribute for a given guest. This would help avoid the 
frequent exits due to PLE (as Avi had mentioned earlier) ?


- Can the PLE feature ( in VT) be "enhanced" to be made a per guest 
attribute ?



IMHO, the approach of not taking a frequent exit is better than taking 
an exit and returning back from the handler etc.


Thanks
Vinod






Another idea, proposed in the first patch, is to identify
non-overcommit case and just return from the PLE handler.

There are are many ways to identify non-overcommit scenario.
1) Using loadavg etc (get_avenrun/calc_global_load
  /this_cpu_load)

2) Explicitly check nr_running()/num_online_cpus()

3) Check source vcpu runqueue length.

Not sure how can we make use of (1) effectively/how to use it.
(2) has significant overhead since it iterates all cpus.
so this patch uses third method. (I feel it is uglier to export
runqueue length, but expecting suggestion on this).

In second patch, when we have large number of small guests, it is
possible that a spinning vcpu fails to yield_to any vcpu of same
VM and go back and spin. This is also not effective when we are
over-committed. Instead, we do a schedule() so that we give chance
to other VMs to run.

Raghavendra K T(2):
  Handle undercommitted guest case in PLE handler
  Be courteous to other VMs in overcommitted scenario in PLE handler

Results:
base = 3.6.0-rc5 + ple handler optimization patches from kvm tree.
patched = base + patch1 + patch2
machine: x240 with 16 core with HT enabled (32 cpu thread).
32 vcpu guest with 8GB RAM.

+---+---+---++---+
  ebizzy (record/sec higher is better)
+---+---+---++---+
basestddev   patchedstdev%improve
+---+---+---++---+
  11293.3750   624.4378  18209.6250   371.706161.24166
   3641.8750   468.9400   3725.5000   253.7823 2.29621
+---+---+---++---+

+---+---+---++---+
 kernbench (time in sec lower is better)
+---+---+---++---+
basestddev   patchedstdev%improve
+---+---+---++---+
 30.6020 1.3018 30.8287 1.1517-0.74080
 64.0825 2.3764 63.4721 5.0191 0.95252
 95.8638 8.7030 94.5988 8.3832 1.31958
+---+---+---++---+

Note:
on mx3850x5 machine with 32 cores HT disabled I got around
ebizzy  209%
kernbench   6%
improvement for 1x scenario.

Thanks Srikar for his active partipation in discussing ideas and
reviewing the patch.

Please let me know your suggestions and comments.
---
  include/linux/sched.h |1 +
  kernel/sched/core.c   |6 ++
  virt/kvm/kvm_main.c   |7 +++
  3 files changed, 14 insertions(+), 0 deletions(-)

.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-21 Thread Raghavendra K T
In some special scenarios like #vcpu <= #pcpu, PLE handler may 
prove very costly, because there is no need to iterate over vcpus
and do unsuccessful yield_to burning CPU.

An idea to solve this is:
1) As Avi had proposed we can modify hardware ple_window
dynamically to avoid frequent PL-exit. (IMHO, it is difficult to
decide when we have mixed type of VMs).

Another idea, proposed in the first patch, is to identify
non-overcommit case and just return from the PLE handler.

There are are many ways to identify non-overcommit scenario.
1) Using loadavg etc (get_avenrun/calc_global_load
 /this_cpu_load)

2) Explicitly check nr_running()/num_online_cpus()

3) Check source vcpu runqueue length.

Not sure how can we make use of (1) effectively/how to use it.
(2) has significant overhead since it iterates all cpus.
so this patch uses third method. (I feel it is uglier to export
runqueue length, but expecting suggestion on this).

In second patch, when we have large number of small guests, it is
possible that a spinning vcpu fails to yield_to any vcpu of same 
VM and go back and spin. This is also not effective when we are
over-committed. Instead, we do a schedule() so that we give chance
to other VMs to run.

Raghavendra K T(2):
 Handle undercommitted guest case in PLE handler
 Be courteous to other VMs in overcommitted scenario in PLE handler 

Results:
base = 3.6.0-rc5 + ple handler optimization patches from kvm tree.
patched = base + patch1 + patch2
machine: x240 with 16 core with HT enabled (32 cpu thread).
32 vcpu guest with 8GB RAM.

+---+---+---++---+
 ebizzy (record/sec higher is better)
+---+---+---++---+
   basestddev   patchedstdev%improve 
+---+---+---++---+
 11293.3750   624.4378   18209.6250   371.706161.24166
  3641.8750   468.94003725.5000   253.7823 2.29621
+---+---+---++---+

+---+---+---++---+
kernbench (time in sec lower is better)
+---+---+---++---+
   basestddev   patchedstdev%improve 
+---+---+---++---+
30.6020 1.3018  30.8287 1.1517-0.74080
64.0825 2.3764  63.4721 5.0191 0.95252
95.8638 8.7030  94.5988 8.3832 1.31958
+---+---+---++---+

Note:
on mx3850x5 machine with 32 cores HT disabled I got around
ebizzy  209%
kernbench   6%
improvement for 1x scenario.

Thanks Srikar for his active partipation in discussing ideas and
reviewing the patch.

Please let me know your suggestions and comments.
---
 include/linux/sched.h |1 +
 kernel/sched/core.c   |6 ++
 virt/kvm/kvm_main.c   |7 +++
 3 files changed, 14 insertions(+), 0 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-21 Thread Raghavendra K T
In some special scenarios like #vcpu = #pcpu, PLE handler may 
prove very costly, because there is no need to iterate over vcpus
and do unsuccessful yield_to burning CPU.

An idea to solve this is:
1) As Avi had proposed we can modify hardware ple_window
dynamically to avoid frequent PL-exit. (IMHO, it is difficult to
decide when we have mixed type of VMs).

Another idea, proposed in the first patch, is to identify
non-overcommit case and just return from the PLE handler.

There are are many ways to identify non-overcommit scenario.
1) Using loadavg etc (get_avenrun/calc_global_load
 /this_cpu_load)

2) Explicitly check nr_running()/num_online_cpus()

3) Check source vcpu runqueue length.

Not sure how can we make use of (1) effectively/how to use it.
(2) has significant overhead since it iterates all cpus.
so this patch uses third method. (I feel it is uglier to export
runqueue length, but expecting suggestion on this).

In second patch, when we have large number of small guests, it is
possible that a spinning vcpu fails to yield_to any vcpu of same 
VM and go back and spin. This is also not effective when we are
over-committed. Instead, we do a schedule() so that we give chance
to other VMs to run.

Raghavendra K T(2):
 Handle undercommitted guest case in PLE handler
 Be courteous to other VMs in overcommitted scenario in PLE handler 

Results:
base = 3.6.0-rc5 + ple handler optimization patches from kvm tree.
patched = base + patch1 + patch2
machine: x240 with 16 core with HT enabled (32 cpu thread).
32 vcpu guest with 8GB RAM.

+---+---+---++---+
 ebizzy (record/sec higher is better)
+---+---+---++---+
   basestddev   patchedstdev%improve 
+---+---+---++---+
 11293.3750   624.4378   18209.6250   371.706161.24166
  3641.8750   468.94003725.5000   253.7823 2.29621
+---+---+---++---+

+---+---+---++---+
kernbench (time in sec lower is better)
+---+---+---++---+
   basestddev   patchedstdev%improve 
+---+---+---++---+
30.6020 1.3018  30.8287 1.1517-0.74080
64.0825 2.3764  63.4721 5.0191 0.95252
95.8638 8.7030  94.5988 8.3832 1.31958
+---+---+---++---+

Note:
on mx3850x5 machine with 32 cores HT disabled I got around
ebizzy  209%
kernbench   6%
improvement for 1x scenario.

Thanks Srikar for his active partipation in discussing ideas and
reviewing the patch.

Please let me know your suggestions and comments.
---
 include/linux/sched.h |1 +
 kernel/sched/core.c   |6 ++
 virt/kvm/kvm_main.c   |7 +++
 3 files changed, 14 insertions(+), 0 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-21 Thread Chegu Vinod

On 9/21/2012 4:59 AM, Raghavendra K T wrote:

In some special scenarios like #vcpu = #pcpu, PLE handler may
prove very costly,


Yes.

  because there is no need to iterate over vcpus
and do unsuccessful yield_to burning CPU.

An idea to solve this is:
1) As Avi had proposed we can modify hardware ple_window
dynamically to avoid frequent PL-exit.


Yes. We had to do this to get around some scaling issues for large 
(20way) guests (with no overcommitment)


As part of some experimentation we even tried switching off  PLE too :(




(IMHO, it is difficult to
decide when we have mixed type of VMs).


Agree.

Not sure if the following alternatives have also been looked at :

- Could the  behavior  associated with the ple_window be modified to 
be a function of some [new] per-guest attribute (which can be conveyed 
to the host as part of the guest launch sequence). The user can choose 
to set this [new] attribute for a given guest. This would help avoid the 
frequent exits due to PLE (as Avi had mentioned earlier) ?


- Can the PLE feature ( in VT) be enhanced to be made a per guest 
attribute ?



IMHO, the approach of not taking a frequent exit is better than taking 
an exit and returning back from the handler etc.


Thanks
Vinod






Another idea, proposed in the first patch, is to identify
non-overcommit case and just return from the PLE handler.

There are are many ways to identify non-overcommit scenario.
1) Using loadavg etc (get_avenrun/calc_global_load
  /this_cpu_load)

2) Explicitly check nr_running()/num_online_cpus()

3) Check source vcpu runqueue length.

Not sure how can we make use of (1) effectively/how to use it.
(2) has significant overhead since it iterates all cpus.
so this patch uses third method. (I feel it is uglier to export
runqueue length, but expecting suggestion on this).

In second patch, when we have large number of small guests, it is
possible that a spinning vcpu fails to yield_to any vcpu of same
VM and go back and spin. This is also not effective when we are
over-committed. Instead, we do a schedule() so that we give chance
to other VMs to run.

Raghavendra K T(2):
  Handle undercommitted guest case in PLE handler
  Be courteous to other VMs in overcommitted scenario in PLE handler

Results:
base = 3.6.0-rc5 + ple handler optimization patches from kvm tree.
patched = base + patch1 + patch2
machine: x240 with 16 core with HT enabled (32 cpu thread).
32 vcpu guest with 8GB RAM.

+---+---+---++---+
  ebizzy (record/sec higher is better)
+---+---+---++---+
basestddev   patchedstdev%improve
+---+---+---++---+
  11293.3750   624.4378  18209.6250   371.706161.24166
   3641.8750   468.9400   3725.5000   253.7823 2.29621
+---+---+---++---+

+---+---+---++---+
 kernbench (time in sec lower is better)
+---+---+---++---+
basestddev   patchedstdev%improve
+---+---+---++---+
 30.6020 1.3018 30.8287 1.1517-0.74080
 64.0825 2.3764 63.4721 5.0191 0.95252
 95.8638 8.7030 94.5988 8.3832 1.31958
+---+---+---++---+

Note:
on mx3850x5 machine with 32 cores HT disabled I got around
ebizzy  209%
kernbench   6%
improvement for 1x scenario.

Thanks Srikar for his active partipation in discussing ideas and
reviewing the patch.

Please let me know your suggestions and comments.
---
  include/linux/sched.h |1 +
  kernel/sched/core.c   |6 ++
  virt/kvm/kvm_main.c   |7 +++
  3 files changed, 14 insertions(+), 0 deletions(-)

.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-21 Thread Raghavendra K T

On 09/21/2012 06:48 PM, Chegu Vinod wrote:

On 9/21/2012 4:59 AM, Raghavendra K T wrote:

In some special scenarios like #vcpu = #pcpu, PLE handler may
prove very costly,


Yes.

because there is no need to iterate over vcpus
and do unsuccessful yield_to burning CPU.

An idea to solve this is:
1) As Avi had proposed we can modify hardware ple_window
dynamically to avoid frequent PL-exit.


Yes. We had to do this to get around some scaling issues for large
(20way) guests (with no overcommitment)


Do you mean you already have some solution tested for this?



As part of some experimentation we even tried switching off PLE too :(



Honestly,
Your this experiment and Andrew Theurer's observations were the
motivation for this patch.





(IMHO, it is difficult to
decide when we have mixed type of VMs).


Agree.

Not sure if the following alternatives have also been looked at :

- Could the behavior associated with the ple_window be modified to be
a function of some [new] per-guest attribute (which can be conveyed to
the host as part of the guest launch sequence). The user can choose to
set this [new] attribute for a given guest. This would help avoid the
frequent exits due to PLE (as Avi had mentioned earlier) ?


Ccing Drew also. We had a good discussion on this idea last time.
(sorry that I forgot to include in patch series)

May be a good idea when we know the load in advance..



- Can the PLE feature ( in VT) be enhanced to be made a per guest
attribute ?


IMHO, the approach of not taking a frequent exit is better than taking
an exit and returning back from the handler etc.


I entirely agree on this point. (though have not tried above
approaches). Hope to see more expert opinions pouring in.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/