Re: [PATCH 0/4] Alter steal-time reporting in the guest

2013-03-06 Thread Glauber Costa
On 03/06/2013 05:41 AM, Marcelo Tosatti wrote:
 On Tue, Mar 05, 2013 at 02:22:08PM -0600, Michael Wolf wrote:
 Sorry for the delay in the response.  I did not see the email
 right away.

 On Mon, 2013-02-18 at 22:11 -0300, Marcelo Tosatti wrote:
 On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote:
 2013/2/5 Michael Wolf m...@linux.vnet.ibm.com:
 In the case of where you have a system that is running in a
 capped or overcommitted environment the user may see steal time
 being reported in accounting tools such as top or vmstat.  This can
 cause confusion for the end user.

 Sorry, I'm no expert in this area. But I don't really understand what
 is confusing for the end user here.

 I suppose that what is wanted is to subtract stolen time due to 'known
 reasons' from steal time reporting. 'Known reasons' being, for example,
 hard caps. So a vcpu executing instructions with no halt, but limited to
 80% of available bandwidth, would not have 20% of stolen time reported.

 Yes exactly and the end user many times did not set up the guest and is
 not aware of the capping.  The end user is only aware of the performance
 level that they were told they would get with the guest.  
 But yes, a description of the scenario that is being dealt with, with
 details, is important.

 I will add more detail to the description next time I submit the
 patches.  How about something like,In a cloud environment the user of a
 kvm guest is not aware of the underlying hardware or how many other
 guests are running on it.  The end user is only aware of a level of
 performance that they should see.   or does that just muddy the picture
 more??
 
 So the feature aims for is to report stolen time relative to hard
 capping. That is: stolen time should be counted as time stolen from
 the guest _beyond_ hard capping. Yes?
 
 Probably don't need to report new data to the guest for that.
 
If we take into account that 1 second always have one second, I believe
that you can just subtract the consigned time from the steal time the
host passes to the guest.

During each second, the numbers won't sum up to 100. The delta to 100 is
the consigned time, if anyone cares.

Adopting this would simplify this a lot. All you need to do, really, is
to get your calculation right from the bandwidth given by the cpu
controller. Subtract it in the host, and voila.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] Add a timer to allow the separation of consigned from steal time.

2013-02-07 Thread Glauber Costa
On 02/06/2013 10:07 PM, Michael Wolf wrote:
 On 02/06/2013 08:36 AM, Glauber Costa wrote:
 On 02/06/2013 01:49 AM, Michael Wolf wrote:
 Add a helper routine to scheduler/core.c to allow the kvm module
 to retrieve the cpu hardlimit settings.  The values will be used
 to set up a timer that is used to separate the consigned from the
 steal time.
 Sorry: What is the business of a timer in here?
 Whenever we read steal time, we know how much time has passed and with
 that information we can know the entitlement for the period. This breaks
 if we suspend, but we know that we suspended, so this is not a problem.
 I may be missing something, but how do we know how much time has
 passed?  That is why
 I had the timer in there.  I will go look again at the code but I
 thought the data was collected
 as ticks and passed at random times.  The ticks are also accumulating so
 we are looking at the
 difference in the count between reads.

They can be collected at random times, but you can of course record the
time in which it happened.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] Add a timer to allow the separation of consigned from steal time.

2013-02-06 Thread Glauber Costa
On 02/06/2013 01:49 AM, Michael Wolf wrote:
 Add a helper routine to scheduler/core.c to allow the kvm module
 to retrieve the cpu hardlimit settings.  The values will be used
 to set up a timer that is used to separate the consigned from the
 steal time.

Sorry: What is the business of a timer in here?
Whenever we read steal time, we know how much time has passed and with
that information we can know the entitlement for the period. This breaks
if we suspend, but we know that we suspended, so this is not a problem.

Everything bigger the entitlement is steal time.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] Alter steal time reporting in KVM

2012-12-05 Thread Glauber Costa
I am deeply sorry.

I was busy first time I read this, so I postponed answering and ended up
forgetting.

Sorry

 include/linux/sched.h:
 unsigned long long run_delay; /* time spent waiting on a runqueue */

 So if you are out of the runqueue, you won't get steal time accounted,
 and then I truly fail to understand what you are doing.
 So I looked at something like this in the past.  To make sure things
 haven't changed
 I set up a cgroup on my test server running a kernel built from the
 latest tip tree.
 
 [root]# cat cpu.cfs_quota_us
 5
 [root]# cat cpu.cfs_period_us
 10
 [root]# cat cpuset.cpus
 1
 [root]# cat cpuset.mems
 0
 
 Next I put the PID from the cpu thread into tasks.  When I start a
 script that will hog the cpu I see the
 following in top on the guest
 Cpu(s):  1.9%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa, 48.3%hi, 0.0%si,
 49.8%st
 
 So the steal time here is in line with the bandwidth control settings.

Ok. So I was wrong in my hunch that it would be outside the runqueue,
therefore work automatically. Still, the host kernel has all the
information in cgroups.

 So then the steal time did not show on the guest.  You have no value
 that needs to be passed
 around.  What I did not like about this approach was
 * only works for cfs bandwidth control.  If another type of hard limit
 was added to the kernel
the code would potentially need to change.

This is true for almost everything we have in the kernel!
It is *very* unlikely for other bandwidth control mechanism to ever
appear. If it ever does, it's *their* burden to make sure it works for
steal time (provided it is merged). Code in tree gets precedence.

 * This approach doesn't help if the limits are set by overcommitting the
 cpus.  It is my understanding
that this is a common approach.
 

I can't say anything about commonality, but common or not, it is a
*crazy* approach.

When you simply overcommit, you have no way to differentiate between
intended steal time and non-intended steal time. Moreover, when you
overcommit, your cpu usage will vary over time. If two guests use the
cpu to their full power, you will have 50 % each. But if one of them
slows down, the other gets more. What is your entitlement value? How do
you define this?

And then after you define it, you end up using more than this, what is
your cpu usage? 130 %?


The only sane way to do it, is to communicate this value to the kernel
somehow. The bandwidth controller is the interface we have for that. So
everybody that wants to *intentionally* overcommit needs to communicate
this to the controller. IOW: Any sane configuration should be explicit
about your capping.

 Add an ioctl to communicate the consign limit to the host.
 This definitely should go away.

 More specifically, *whatever* way we use to cap the processor, the host
 system will have all the information at all times.
 I'm not understanding that comment.  If you are capping by simply
 controlling the amount of
 overcommit on the host then wouldn't you still need some value to
 indicate the desired amount.
No, that is just crazy, and I don't like it a single bit.

So in the light of it: Whatever capping mechanism we have, we need to be
explicit about the expected entitlement. At this point, the kernel
already knows what it is, and needs no extra ioctls or anything like that.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/2] mm: Add ability to monitor task's memory changes

2012-12-03 Thread Glauber Costa
On 11/30/2012 09:55 PM, Pavel Emelyanov wrote:
 Hello,
 
 This is an attempt to implement support for memory snapshot for the the
 checkpoint-restore project (http://criu.org).
 
 To create a dump of an application(s) we save all the information about it
 to files. No surprise, the biggest part of such dump is the contents of tasks'
 memory. However, in some usage scenarios it's not required to get _all_ the
 task memory while creating a dump. For example, when doing periodical dumps
 it's only required to take full memory dump only at the first step and then
 take incremental changes of memory. Another example is live migration. In the
 simplest form it looks like -- create dump, copy it on the remote node then
 restore tasks from dump files. While all this dump-copy-restore thing goes all
 the process must be stopped. However, if we can monitor how tasks change their
 memory, we can dump and copy it in smaller chunks, periodically updating it 
 and thus freezing tasks only at the very end for the very short time to pick
 up the recent changes.
 
 That said, some help from kernel to watch how processes modify the contents of
 their memory is required. I'd like to propose one possible solution of this
 task -- with the help of page-faults and trace events.
 
 Briefly the approach is -- remap some memory regions as read-only, get the #pf
 on task's attempt to modify the memory and issue a trace event of that. Since
 we're only interested in parts of memory of some tasks, make it possible to 
 mark
 the vmas we're interested in and issue events for them only. Also, to be aware
 of tasks unmapping the vma-s being watched, also issue an event when the 
 marked
 vma is removed (and for symmetry -- an event when a vma is marked).
 
 What do you think about this approach? Is this way of supporting mem snapshot
 OK for you, or should we invent some better one?
 

The page fault mechanism is pretty obvious - anything that deals with
dirty pages will end up having to do this. So there is nothing crazy
about this.

What concerns me, however, is that should this go in, we'll have two
dirty mem loggers in the kernel: one to support CRIU, one to support
KVM. And the worst part: They have the exact the same purpose!!

So to begin with, I think one thing to consider, would be to generalize
KVM's dirty memory notification so it can work on a normal process
memory region. KVM api requires a memory slot to be passed, something
we are unlikely to have. But KVM can easily keep its API and use an
alternate mechanics, that's trivial...

Generally speaking, KVM will do polling with this ioctl. I prefer your
tracing mechanism better. The only difference, is that KVM tends to
transfer large chunks of memory in some loads - in the high gigs range.
So the proposal tracing API should be able to optionally batch requests
within a time frame.

It would also be good to hear what does the KVM guys think of it as well



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/2] mm: Add ability to monitor task's memory changes

2012-12-03 Thread Glauber Costa
On 12/04/2012 12:16 AM, Marcelo Tosatti wrote:
 On Mon, Dec 03, 2012 at 12:36:33PM +0400, Glauber Costa wrote:
 On 11/30/2012 09:55 PM, Pavel Emelyanov wrote:
 Hello,

 This is an attempt to implement support for memory snapshot for the the
 checkpoint-restore project (http://criu.org).

 To create a dump of an application(s) we save all the information about it
 to files. No surprise, the biggest part of such dump is the contents of 
 tasks'
 memory. However, in some usage scenarios it's not required to get _all_ the
 task memory while creating a dump. For example, when doing periodical dumps
 it's only required to take full memory dump only at the first step and then
 take incremental changes of memory. Another example is live migration. In 
 the
 simplest form it looks like -- create dump, copy it on the remote node then
 restore tasks from dump files. While all this dump-copy-restore thing goes 
 all
 the process must be stopped. However, if we can monitor how tasks change 
 their
 memory, we can dump and copy it in smaller chunks, periodically updating it 
 and thus freezing tasks only at the very end for the very short time to pick
 up the recent changes.

 That said, some help from kernel to watch how processes modify the contents 
 of
 their memory is required. I'd like to propose one possible solution of this
 task -- with the help of page-faults and trace events.

 Briefly the approach is -- remap some memory regions as read-only, get the 
 #pf
 on task's attempt to modify the memory and issue a trace event of that. 
 Since
 we're only interested in parts of memory of some tasks, make it possible to 
 mark
 the vmas we're interested in and issue events for them only. Also, to be 
 aware
 of tasks unmapping the vma-s being watched, also issue an event when the 
 marked
 vma is removed (and for symmetry -- an event when a vma is marked).

 What do you think about this approach? Is this way of supporting mem 
 snapshot
 OK for you, or should we invent some better one?


 The page fault mechanism is pretty obvious - anything that deals with
 dirty pages will end up having to do this. So there is nothing crazy
 about this.

 What concerns me, however, is that should this go in, we'll have two
 dirty mem loggers in the kernel: one to support CRIU, one to support
 KVM. And the worst part: They have the exact the same purpose!!

 So to begin with, I think one thing to consider, would be to generalize
 KVM's dirty memory notification so it can work on a normal process
 memory region. KVM api requires a memory slot to be passed, something
 we are unlikely to have. But KVM can easily keep its API and use an
 alternate mechanics, that's trivial...

 Generally speaking, KVM will do polling with this ioctl. I prefer your
 tracing mechanism better. The only difference, is that KVM tends to
 transfer large chunks of memory in some loads - in the high gigs range.
 So the proposal tracing API should be able to optionally batch requests
 within a time frame.

 It would also be good to hear what does the KVM guys think of it as well
 
 There are significant differences. KVM's dirty logging works for
 guest translations (NPT/shadow) and is optimized for specific use cases.
 
 Above is about dirty logging of userspace memory areas.

This is envelope.

At the end, KVM is tracking pages, regardless of their format, and want
to know when pages are dirty, and which pages are dirty.

What you are saying, for example, is that if Linux had a dirty page
tracking for userpages memory, kvm could not make use of it ? IIRC
qemu's migration algorithm, a notification of dirty user pages - plus an
eventual extra trickery - would do just fine.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] Alter steal time reporting in KVM

2012-11-28 Thread Glauber Costa
On 11/27/2012 07:10 PM, Michael Wolf wrote:
 On 11/27/2012 02:48 AM, Glauber Costa wrote:
 Hi,

 On 11/27/2012 12:36 AM, Michael Wolf wrote:
 In the case of where you have a system that is running in a
 capped or overcommitted environment the user may see steal time
 being reported in accounting tools such as top or vmstat.  This can
 cause confusion for the end user.  To ease the confusion this patch set
 adds the idea of consigned (expected steal) time.  The host will
 separate
 the consigned time from the steal time.  The consignment limit passed
 to the
 host will be the amount of steal time expected within a fixed period of
 time.  Any other steal time accruing during that period will show as the
 traditional steal time.
 If you submit this again, please include a version number in your series.
 Will do.  The patchset was sent twice yesterday by mistake.  Got an
 error the first time and didn't
 think the patches went out.  This has been corrected.

 It would also be helpful to include a small changelog about what changed
 between last version and this version, so we could focus on that.
 yes, will do that.  When I took the RFC off the patches I was looking at
 it as a new patchset which was
 a mistake.  I will make sure to add a changelog when I submit again.

 As for the rest, I answered your previous two submissions saying I don't
 agree with the concept. If you hadn't changed anything, resending it
 won't change my mind.

 I could of course, be mistaken or misguided. But I had also not seen any
 wave of support in favor of this previously, so basically I have no new
 data to make me believe I should see it any differently.

 Let's try this again:

 * Rik asked you in your last submission how does ppc handle this. You
 said, and I quote: In the case of lpar on POWER systems they simply
 report steal time and do not alter it in any way.
 They do however report how much processor is assigned to the partition
 and that information is in /proc/ppc64/lparcfg.
 Yes, but we still get questions from users asking what is steal time?
 why am I seeing this?

 Now, that is a *way* more sensible thing to do. Much more. Confusing
 users is something extremely subjective. This is specially true about
 concepts that are know for quite some time, like steal time. If you out
 of a sudden change the meaning of this, it is sure to confuse a lot more
 users than it would clarify.
 Something like this could certainly be done.  But when I was submitting
 the patch set as
 an RFC then qemu was passing a cpu percentage that would be used by the
 guest kernel
 to adjust the steal time. This percentage was being stored on the guest
 as a sysctl value.
 Avi stated he didn't like that kind of coupling, and that the value
 could get out of sync.  Anthony stated The guest shouldn't need to know
 it's entitlement. Or at least, it's up to a management tool to report
 that in a way that's meaningful for the guest.
 
 So perhaps I misunderstood what they were suggesting, but I took it to
 mean that they did not
 want the guest to know what the entitlement was.  That the host should
 take care of it and just
 report the already adjusted data to the guest.  So in this version of
 the code the host would use a set
 period for a timer and be passed essentially a number of ticks of
 expected steal time.  The host
 would then use the timer to break out the steal time into consigned and
 steal buckets which would be
 reported to the guest.
 
 Both the consigned and the steal would be reported via /proc/stat. So
 anyone needing to see total
 time away could add the two fields together.  The user, however, when
 using tools like top or vmstat
 would see the usage based on what the guest is entitled to.
 
 Do you have suggestions for how I can build consensus around one of the
 two approaches?
 

Before I answer this, can you please detail which mechanism are you
using to enforce the entitlement? Is it the cgroup cpu controller, or
something else?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] Alter steal time reporting in KVM

2012-11-28 Thread Glauber Costa
On 11/28/2012 10:43 PM, Michael Wolf wrote:
 On 11/27/2012 05:24 PM, Marcelo Tosatti wrote:
 On Mon, Nov 26, 2012 at 02:36:24PM -0600, Michael Wolf wrote:
 In the case of where you have a system that is running in a
 capped or overcommitted environment the user may see steal time
 being reported in accounting tools such as top or vmstat.
 The definition of stolen time is 'time during which the virtual CPU is
 runnable to not running'. Overcommit is the main scenario which steal
 time helps to detect.

 Can you describe the 'capped' case?
 In the capped case, the time that the guest spends waiting due to it
 having used its full allottment of time shows up as steal time.  The way
 my patchset currently stands is that you would set up the
 bandwidth control and you would have to pass it a  matching value from
 qemu.  In the future, it would
 be possible to have something parse the bandwidth setting and
 automatically adjust the setting in the
 host used for steal time reporting.

Ok, so correct me if I am wrong, but I believe you would be using
something like the bandwidth capper in the cpu cgroup to set those
entitlements, right?

Some time has passed since I last looked into it, but IIRC, after you
get are out of your quota, you should be out of the runqueue. In the
lovely world of KVM, we approximate steal time as runqueue time:

arch/x86/kvm/x86.c:
delta = current-sched_info.run_delay - vcpu-arch.st.last_steal;
vcpu-arch.st.last_steal = current-sched_info.run_delay;
vcpu-arch.st.accum_steal = delta;

include/linux/sched.h:
unsigned long long run_delay; /* time spent waiting on a runqueue */

So if you are out of the runqueue, you won't get steal time accounted,
and then I truly fail to understand what you are doing.

In case I am wrong, and run_delay also includes the time you can't run
because you are out of capacity, then maybe what we should do, is to
just subtract it from run_delay in kvm/x86.c before we pass it on. In
summary:


Alter the amount of steal time reported by the guest.
Maybe this should go away.

Expand the steal time msr to also contain the consigned time.
Maybe this should go away

Add the code to send the consigned time from the host to the
 guest
This definitely should be heavily modified

Add a timer to allow the separation of consigned from steal time.
Maybe this should go away

Add an ioctl to communicate the consign limit to the host.
This definitely should go away.

More specifically, *whatever* way we use to cap the processor, the host
system will have all the information at all times.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] Alter steal time reporting in KVM

2012-11-27 Thread Glauber Costa
Hi,

On 11/27/2012 12:36 AM, Michael Wolf wrote:
 In the case of where you have a system that is running in a
 capped or overcommitted environment the user may see steal time
 being reported in accounting tools such as top or vmstat.  This can
 cause confusion for the end user.  To ease the confusion this patch set
 adds the idea of consigned (expected steal) time.  The host will separate
 the consigned time from the steal time.  The consignment limit passed to the
 host will be the amount of steal time expected within a fixed period of
 time.  Any other steal time accruing during that period will show as the
 traditional steal time.

If you submit this again, please include a version number in your series.

It would also be helpful to include a small changelog about what changed
between last version and this version, so we could focus on that.

As for the rest, I answered your previous two submissions saying I don't
agree with the concept. If you hadn't changed anything, resending it
won't change my mind.

I could of course, be mistaken or misguided. But I had also not seen any
wave of support in favor of this previously, so basically I have no new
data to make me believe I should see it any differently.

Let's try this again:

* Rik asked you in your last submission how does ppc handle this. You
said, and I quote: In the case of lpar on POWER systems they simply
report steal time and do not alter it in any way.
They do however report how much processor is assigned to the partition
and that information is in /proc/ppc64/lparcfg.

Now, that is a *way* more sensible thing to do. Much more. Confusing
users is something extremely subjective. This is specially true about
concepts that are know for quite some time, like steal time. If you out
of a sudden change the meaning of this, it is sure to confuse a lot more
users than it would clarify.





 
 ---
 
 Michael Wolf (5):
   Alter the amount of steal time reported by the guest.
   Expand the steal time msr to also contain the consigned time.
   Add the code to send the consigned time from the host to the guest
   Add a timer to allow the separation of consigned from steal time.
   Add an ioctl to communicate the consign limit to the host.
 
 
  arch/x86/include/asm/kvm_host.h   |   11 +++
  arch/x86/include/asm/kvm_para.h   |3 +-
  arch/x86/include/asm/paravirt.h   |4 +--
  arch/x86/include/asm/paravirt_types.h |2 +
  arch/x86/kernel/kvm.c |8 ++---
  arch/x86/kernel/paravirt.c|4 +--
  arch/x86/kvm/x86.c|   50 
 -
  fs/proc/stat.c|9 +-
  include/linux/kernel_stat.h   |2 +
  include/linux/kvm_host.h  |2 +
  include/uapi/linux/kvm.h  |2 +
  kernel/sched/core.c   |   10 ++-
  kernel/sched/cputime.c|   21 +-
  kernel/sched/sched.h  |2 +
  virt/kvm/kvm_main.c   |7 +
  15 files changed, 120 insertions(+), 17 deletions(-)
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 18/18] KVM: x86: update pvclock area conditionally, on cpu migration

2012-11-20 Thread Glauber Costa
On 11/20/2012 01:58 AM, Marcelo Tosatti wrote:
 As requested by Glauber, do not update kvmclock area on vcpu-pcpu 
 migration, in case the host has stable TSC. 
 
 This is to reduce cacheline bouncing.
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 
 Index: vsyscall/arch/x86/kvm/x86.c
 ===
 --- vsyscall.orig/arch/x86/kvm/x86.c
 +++ vsyscall/arch/x86/kvm/x86.c
 @@ -2615,7 +2615,12 @@ void kvm_arch_vcpu_load(struct kvm_vcpu 
   kvm_x86_ops-write_tsc_offset(vcpu, offset);
   vcpu-arch.tsc_catchup = 1;
   }
 - kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 + /*
 +  * On a host with synchronized TSC, there is no need to update
 +  * kvmclock on vcpu-cpu migration
 +  */
 + if (!vcpu-kvm-arch.use_master_clock || vcpu-cpu == -1)
 + kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
   if (vcpu-cpu != cpu)
   kvm_migrate_timers(vcpu);
   vcpu-cpu = cpu;

Ok. Since you are only touching the one in kvm_arch_vcpu_load() and
leaving the others untouched, it looks correct.

Acked-by: Glauber Costa glom...@parallels.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v5)

2012-11-20 Thread Glauber Costa
On 11/20/2012 01:57 AM, Marcelo Tosatti wrote:
 This patchset, based on earlier work by Jeremy Fitzhardinge, implements
 paravirtual clock vsyscall support.
 
 It should be possible to implement Xen support relatively easily.
 
 It reduces clock_gettime from 500 cycles to 200 cycles
 on my testbox.
 

There are no more significant objections from my side.
I will still try to go through it again today just in case.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 02/18] x86: kvmclock: allocate pvclock shared memory area

2012-11-15 Thread Glauber Costa
On 11/15/2012 04:08 AM, Marcelo Tosatti wrote:
 We want to expose the pvclock shared memory areas, which 
 the hypervisor periodically updates, to userspace.
 
 For a linear mapping from userspace, it is necessary that
 entire page sized regions are used for array of pvclock 
 structures.
 
 There is no such guarantee with per cpu areas, therefore move
 to memblock_alloc based allocation.
Ok.

 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
For the concept:

Acked-by: Glauber Costa glom...@parallels.com

I do have one comment.

 
 Index: vsyscall/arch/x86/kernel/kvmclock.c
 ===
 --- vsyscall.orig/arch/x86/kernel/kvmclock.c
 +++ vsyscall/arch/x86/kernel/kvmclock.c
 @@ -23,6 +23,7 @@
  #include asm/apic.h
  #include linux/percpu.h
  #include linux/hardirq.h
 +#include linux/memblock.h
  
  #include asm/x86_init.h
  #include asm/reboot.h
 @@ -39,7 +40,11 @@ static int parse_no_kvmclock(char *arg)
  early_param(no-kvmclock, parse_no_kvmclock);
  
  /* The hypervisor will put information about time periodically here */
 -static DEFINE_PER_CPU_SHARED_ALIGNED(struct pvclock_vcpu_time_info, 
 hv_clock);
 +struct pvclock_aligned_vcpu_time_info {
 + struct pvclock_vcpu_time_info clock;
 +} __attribute__((__aligned__(SMP_CACHE_BYTES)));
 +
 +static struct pvclock_aligned_vcpu_time_info *hv_clock;
  static struct pvclock_wall_clock wall_clock;
  
  /*
 @@ -52,15 +57,20 @@ static unsigned long kvm_get_wallclock(v
   struct pvclock_vcpu_time_info *vcpu_time;
   struct timespec ts;
   int low, high;
 + int cpu;
 +
 + preempt_disable();
 + cpu = smp_processor_id();
  
   low = (int)__pa_symbol(wall_clock);
   high = ((u64)__pa_symbol(wall_clock)  32);
  
   native_write_msr(msr_kvm_wall_clock, low, high);
  
 - vcpu_time = get_cpu_var(hv_clock);
 + vcpu_time = hv_clock[cpu].clock;


You are leaving preempt disabled for a lot longer than in should be. In
particular, you'll have a round trip to the hypervisor in the middle. It
doesn't matter *that* much because wallclock is mostly a one-time thing,
so I won't oppose in this basis.

But if you have the chance, either fix it, or tell us why you need
preemption on for longer than it was before.

  void __init kvmclock_init(void)
  {
 + unsigned long mem;
 +
   if (!kvm_para_available())
   return;
  
 + mem = memblock_alloc(sizeof(struct pvclock_aligned_vcpu_time_info) * 
 NR_CPUS,
 +  PAGE_SIZE);
 + if (!mem)
 + return;
 + hv_clock = __va(mem);
 +
   if (kvmclock  kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
   msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
   msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
 

If you don't have kvmclock enabled, which the next line in here would
test for, you will exit this function with allocated memory. How about
the following?

if (kvmclock  kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
} else if (!(kvmclock  kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
return;

mem = memblock_alloc(sizeof(struct pvclock_aligned_vcpu_time_info)
 * NR_CPUS, PAGE_SIZE);
if (!mem)
return;
hv_clock = __va(mem);

printk(KERN_INFO kvm-clock: Using msrs %x and %x,
   msr_kvm_system_time, msr_kvm_wall_clock);

if (kvm_register_clock(boot clock)) {
memblock_free();
return;
}




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 05/18] x86: pvclock: create helper for pvclock data retrieval

2012-11-15 Thread Glauber Costa
On 11/15/2012 04:08 AM, Marcelo Tosatti wrote:
 Originally from Jeremy Fitzhardinge.
 
 So code can be reused.
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 
I thought I had acked this one already?

But maybe I didn't...

Acked-by: Glauber Costa glom...@parallels.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 06/18] x86: pvclock: introduce helper to read flags

2012-11-15 Thread Glauber Costa
On 11/15/2012 04:08 AM, Marcelo Tosatti wrote:
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 
 
 Index: vsyscall/arch/x86/kernel/pvclock.c
 ===
 --- vsyscall.orig/arch/x86/kernel/pvclock.c
Acked-by: Glauber Costa glom...@parallels.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 07/18] x86: pvclock: add note about rdtsc barriers

2012-11-15 Thread Glauber Costa
On 11/15/2012 04:08 AM, Marcelo Tosatti wrote:
 As noted by Gleb, not advertising SSE2 support implies
 no RDTSC barriers.
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

And this gets a separate patch because?

 Index: vsyscall/arch/x86/include/asm/pvclock.h
 ===
 --- vsyscall.orig/arch/x86/include/asm/pvclock.h
 +++ vsyscall/arch/x86/include/asm/pvclock.h
 @@ -74,6 +74,9 @@ unsigned __pvclock_read_cycles(const str
   u8 ret_flags;
  
   version = src-version;
 + /* Note: emulated platforms which do not advertise SSE2 support
 +  * result in kvmclock not using the necessary RDTSC barriers.
 +  */

And the expected effects are? Will it work in that case? Any precautions
one must take? Is it safe for Xen? Is it safe for KVM?

Those are the types of things I'd expect to see in a comment for
something as subtle as this.

   rdtsc_barrier();
   offset = pvclock_get_nsec_offset(src);
   ret = src-system_time + offset;
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 18/18] KVM: x86: update pvclock area conditionally, on cpu migration

2012-11-15 Thread Glauber Costa
On 11/15/2012 04:08 AM, Marcelo Tosatti wrote:
 As requested by Glauber, do not update kvmclock area on vcpu-pcpu 
 migration, in case the host has stable TSC. 
 
 This is to reduce cacheline bouncing.
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
This looks fine, but it can always get tricky...
Assuming you tested this change in at least one stable tsc and one
unstable tsc system:

Acked-by: Glauber Costa glom...@parallels.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 02/18] x86: kvmclock: allocate pvclock shared memory area (v2)

2012-11-15 Thread Glauber Costa
On 11/16/2012 06:07 AM, Marcelo Tosatti wrote:
 
 We want to expose the pvclock shared memory areas, which 
 the hypervisor periodically updates, to userspace.
 
 For a linear mapping from userspace, it is necessary that
 entire page sized regions are used for array of pvclock 
 structures.
 
 There is no such guarantee with per cpu areas, therefore move
 to memblock_alloc based allocation.
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 
No further objections.

Acked-by: Glauber Costa glom...@parallels.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 08/16] KVM: x86: introduce facility to support vsyscall pvclock, via MSR

2012-11-05 Thread Glauber Costa
On 11/02/2012 05:00 PM, Marcelo Tosatti wrote:
 On Fri, Nov 02, 2012 at 02:23:06PM +0400, Glauber Costa wrote:
 On 11/02/2012 01:39 AM, Marcelo Tosatti wrote:
 On Thu, Nov 01, 2012 at 06:28:31PM +0400, Glauber Costa wrote:
 On 11/01/2012 02:47 AM, Marcelo Tosatti wrote:
 Allow a guest to register a second location for the VCPU time info

 structure for each vcpu (as described by MSR_KVM_SYSTEM_TIME_NEW).
 This is intended to allow the guest kernel to map this information
 into a usermode accessible page, so that usermode can efficiently
 calculate system time from the TSC without having to make a syscall.

 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com


 Changelog doesn't make a lot of sense. (specially from first line to the
 second). Please add in here the reasons why we can't (or decided not to)
 use the same page. The info in the last mail thread is good enough, just
 put it here.

 Fixed.

 Index: vsyscall/arch/x86/include/asm/kvm_para.h
 ===
 --- vsyscall.orig/arch/x86/include/asm/kvm_para.h
 +++ vsyscall/arch/x86/include/asm/kvm_para.h
 @@ -23,6 +23,7 @@
  #define KVM_FEATURE_ASYNC_PF 4
  #define KVM_FEATURE_STEAL_TIME   5
  #define KVM_FEATURE_PV_EOI   6
 +#define KVM_FEATURE_USERSPACE_CLOCKSOURCE 7
  
  /* The last 8 bits are used to indicate how to interpret the flags field
   * in pvclock structure. If no bits are set, all flags are ignored.
 @@ -39,6 +40,7 @@
  #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
  #define MSR_KVM_STEAL_TIME  0x4b564d03
  #define MSR_KVM_PV_EOI_EN  0x4b564d04
 +#define MSR_KVM_USERSPACE_TIME  0x4b564d05
  

 I accept that it is possible that we may be better off with the page
 mapped twice.

 But why do we need an extra MSR? When, and why, would you enable the
 kernel-based pvclock, but disabled the userspace pvclock?

 Because there is no stable TSC available, for example (which cannot
 be used to measure passage of time).


 What you say is true, but completely unrelated. I am not talking about
 situations in which userspace pvclock is available and you end up not
 using it.

 I am talking about situations in which it is available, you are capable
 of using it, but then decides for some reason to permanently disabled -
 as in not setting it up altogether.

 It seems to me that if the host has code to deal with userspace pvclock,
 and you already coded the guest in a way that you may or may not use it
 (dependent on the value of the stable bit), you could very well only
 check for the cpuid flag, and do the guest setup if available - skipping
 this MSR dance altogether.

 Now, of course, there is the problem of communicating the address in
 which the guest expects the page to be. Skipping the MSR setup would
 require it to be more or less at a fixed location. We could in principle
 lay them down together with the already existing pvclock structure. (But
 granted, I am not sure it is worth it...)

 I think in general, this question deserves a bit more of attention. We
 are about to have just the perfect opportunity for this next week, so
 let's use it.
 
 In essence you are proposing a different interface to communicate the
 userspace vsyscall pvclock area, other than the MSR, right?
 
 If so:
 
 1. What is the problem with the MSR interface.
 2. What advantage this new interface (which honestly i do not
 understand) provides?
 
No, I am not proposing a different interface to communicate this.
I am proposing *no* interface to communicate this.

I'll give that if it is absolutely necessary to have it on a random
address, and you need to pass it back and forth, of course MSRs would be
the choice. But I am not yet terribly convinced that all the pain of
syncing this from userspace, migrating those values, bookkeeping what is
on, what is not, is worth the pain.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 09/16] x86: kvm guest: pvclock vsyscall support

2012-11-02 Thread Glauber Costa
On 11/01/2012 02:47 AM, Marcelo Tosatti wrote:
 + info = pvclock_get_vsyscall_time_info(cpu);
 +
 + low = (int)__pa(info) | 1;
 + high = ((u64)__pa(per_cpu(hv_clock, cpu))  32);
 + ret = native_write_msr_safe(MSR_KVM_USERSPACE_TIME, low, high);
 + printk(KERN_INFO kvm-clock: cpu %d, msr %x:%x, %s\n,
 +cpu, high, low, txt);
 +

Why do you put info in the lower half, and the hv_clock in the higher half ?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 09/18] KVM: x86: introduce facility to support vsyscall pvclock, via MSR

2012-11-02 Thread Glauber Costa
On 10/31/2012 07:12 AM, Marcelo Tosatti wrote:
 On Tue, Oct 30, 2012 at 11:39:32AM +0200, Avi Kivity wrote:
 On 10/29/2012 08:40 PM, Marcelo Tosatti wrote:
 On Mon, Oct 29, 2012 at 10:44:41AM -0700, Jeremy Fitzhardinge wrote:
 On 10/29/2012 07:45 AM, Glauber Costa wrote:
 On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
 Allow a guest to register a second location for the VCPU time info

 structure for each vcpu (as described by MSR_KVM_SYSTEM_TIME_NEW).
 This is intended to allow the guest kernel to map this information
 into a usermode accessible page, so that usermode can efficiently
 calculate system time from the TSC without having to make a syscall.

 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 Can you please be a bit more specific about why we need this? Why does
 the host need to provide us with two pages with the exact same data? Why
 can't just do it with mapping tricks in the guest?

 In Xen the pvclock structure is embedded within a pile of other stuff
 that shouldn't be mapped into guest memory, so providing for a second
 location allows it to be placed whereever is convenient for the guest.
 That's a restriction of the Xen ABI, but I don't know if it affects KVM.

 J

 It is possible to share the data for KVM in theory, but:

 - It is a small amount of memory. 
 - It requires aligning to page size (the in-kernel percpu array 
 is currently cacheline aligned).
 - It is possible to modify flags separately for userspace/kernelspace,
 if desired.

 This justifies the duplication IMO (code is simple and clean).


 What would be the changes required to remove the duplication?  If it's
 just page alignment, then is seems even smaller.  In addition we avoid
 expanding the ABI again.
 
 This would require changing the kernel copy from percpu data, which
 there is no guarantee is linear (necessary for fixmap mapping), to
 dynamically allocated (which in turn can be tricky due to early boot
 clock requirement).
 
 Hum, no thanks.
 
You allocate it using bootmemory for vsyscall anyway. If they are
strictly in the same physical location, you are not allocating anything
extra.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 08/16] KVM: x86: introduce facility to support vsyscall pvclock, via MSR

2012-11-02 Thread Glauber Costa
On 11/02/2012 01:39 AM, Marcelo Tosatti wrote:
 On Thu, Nov 01, 2012 at 06:28:31PM +0400, Glauber Costa wrote:
 On 11/01/2012 02:47 AM, Marcelo Tosatti wrote:
 Allow a guest to register a second location for the VCPU time info

 structure for each vcpu (as described by MSR_KVM_SYSTEM_TIME_NEW).
 This is intended to allow the guest kernel to map this information
 into a usermode accessible page, so that usermode can efficiently
 calculate system time from the TSC without having to make a syscall.

 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com


 Changelog doesn't make a lot of sense. (specially from first line to the
 second). Please add in here the reasons why we can't (or decided not to)
 use the same page. The info in the last mail thread is good enough, just
 put it here.
 
 Fixed.
 
 Index: vsyscall/arch/x86/include/asm/kvm_para.h
 ===
 --- vsyscall.orig/arch/x86/include/asm/kvm_para.h
 +++ vsyscall/arch/x86/include/asm/kvm_para.h
 @@ -23,6 +23,7 @@
  #define KVM_FEATURE_ASYNC_PF   4
  #define KVM_FEATURE_STEAL_TIME 5
  #define KVM_FEATURE_PV_EOI 6
 +#define KVM_FEATURE_USERSPACE_CLOCKSOURCE 7
  
  /* The last 8 bits are used to indicate how to interpret the flags field
   * in pvclock structure. If no bits are set, all flags are ignored.
 @@ -39,6 +40,7 @@
  #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
  #define MSR_KVM_STEAL_TIME  0x4b564d03
  #define MSR_KVM_PV_EOI_EN  0x4b564d04
 +#define MSR_KVM_USERSPACE_TIME  0x4b564d05
  

 I accept that it is possible that we may be better off with the page
 mapped twice.

 But why do we need an extra MSR? When, and why, would you enable the
 kernel-based pvclock, but disabled the userspace pvclock?
 
 Because there is no stable TSC available, for example (which cannot
 be used to measure passage of time).
 

What you say is true, but completely unrelated. I am not talking about
situations in which userspace pvclock is available and you end up not
using it.

I am talking about situations in which it is available, you are capable
of using it, but then decides for some reason to permanently disabled -
as in not setting it up altogether.

It seems to me that if the host has code to deal with userspace pvclock,
and you already coded the guest in a way that you may or may not use it
(dependent on the value of the stable bit), you could very well only
check for the cpuid flag, and do the guest setup if available - skipping
this MSR dance altogether.

Now, of course, there is the problem of communicating the address in
which the guest expects the page to be. Skipping the MSR setup would
require it to be more or less at a fixed location. We could in principle
lay them down together with the already existing pvclock structure. (But
granted, I am not sure it is worth it...)

I think in general, this question deserves a bit more of attention. We
are about to have just the perfect opportunity for this next week, so
let's use it.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 10/16] x86: vdso: pvclock gettime support

2012-11-02 Thread Glauber Costa
On 11/02/2012 04:33 AM, Marcelo Tosatti wrote:
 On Thu, Nov 01, 2012 at 07:42:43PM -0200, Marcelo Tosatti wrote:
 On Thu, Nov 01, 2012 at 06:41:46PM +0400, Glauber Costa wrote:
 On 11/01/2012 02:47 AM, Marcelo Tosatti wrote:
 +#ifdef CONFIG_PARAVIRT_CLOCK
 +
 +static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
 +{
 +  const aligned_pvti_t *pvti_base;
 +  int idx = cpu / (PAGE_SIZE/PVTI_SIZE);
 +  int offset = cpu % (PAGE_SIZE/PVTI_SIZE);
 +
 +  BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx  PVCLOCK_FIXMAP_END);
 +
 +  pvti_base = (aligned_pvti_t *)__fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx);
 +
 +  return pvti_base[offset].info;
 +}
 +
 Does BUG_ON() really do what you believe it does while in userspace
 context? We're not running with the kernel descriptors, so this will
 probably just kill the process without any explanation

 A coredump is generated which can be used to trace back to ud2a instruction
 at the vdso code.
 
 All comments have been addressed. Let me know if there is anything else
 on v3 that you'd like to see done differently.
 
Mainly:

1) stick a v3 string in the subject. You didn't do it for v2, and I
got confused at some points while looking for the correct patches

2) The changelogs are, in general, a bit poor. I've pointed to the ones
specifically that pops out, but I would appreciate if you would go over
them again, making them more informative.

3) Please make sure Peter is okay with the proposed notifier change.

4) Please consider allocating memory with __alloc_bootmem_node instead.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 01/16] KVM: x86: retain pvclock guest stopped bit in guest memory

2012-11-01 Thread Glauber Costa
On 11/01/2012 02:46 AM, Marcelo Tosatti wrote:
 Otherwise its possible for an unrelated KVM_REQ_UPDATE_CLOCK (such as due to 
 CPU
 migration) to clear the bit.
 
 Noticed by Paolo Bonzini.
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 
 Index: vsyscall/arch/x86/kvm/x86.c
 ===

Reviewed-by: Glauber Costa glom...@parallels.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 02/16] x86: pvclock: make sure rdtsc doesnt speculate out of region

2012-11-01 Thread Glauber Costa
On 11/01/2012 03:48 PM, Gleb Natapov wrote:
 On Wed, Oct 31, 2012 at 08:46:58PM -0200, Marcelo Tosatti wrote:
 Originally from Jeremy Fitzhardinge.

 pvclock_get_time_values, which contains the memory barriers
 will be removed by next patch.

 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

 Index: vsyscall/arch/x86/kernel/pvclock.c
 ===
 --- vsyscall.orig/arch/x86/kernel/pvclock.c
 +++ vsyscall/arch/x86/kernel/pvclock.c
 @@ -97,10 +97,10 @@ cycle_t pvclock_clocksource_read(struct 
  
  do {
  version = pvclock_get_time_values(shadow, src);
 -barrier();
 +rdtsc_barrier();
  offset = pvclock_get_nsec_offset(shadow);
  ret = shadow.system_timestamp + offset;
 -barrier();
 +rdtsc_barrier();
  } while (version != src-version);
  
  if ((valid_flags  PVCLOCK_TSC_STABLE_BIT) 

 On a guest without SSE2 rdtsc_barrier() will be nop while rmb() will
 be lock; addl $0,0(%%esp). I doubt pvclock will work correctly either
 way though.
 
 --
   Gleb.
 
Actually it shouldn't matter for KVM, since the page is only updated by
the vcpu, and the guest is never running while it happens. If Jeremy is
fine with this, so should I.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 03/16] x86: pvclock: remove pvclock_shadow_time

2012-11-01 Thread Glauber Costa
On 11/01/2012 02:46 AM, Marcelo Tosatti wrote:
 Originally from Jeremy Fitzhardinge.
 
 We can copy the information directly from struct pvclock_vcpu_time_info, 
 remove pvclock_shadow_time.
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Reviewed-by: Glauber Costa glom...@parallels.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 04/16] x86: pvclock: create helper for pvclock data retrieval

2012-11-01 Thread Glauber Costa
On 11/01/2012 02:47 AM, Marcelo Tosatti wrote:
 +static __always_inline
 +unsigned __pvclock_read_cycles(const struct pvclock_vcpu_time_info *src,
 +cycle_t *cycles, u8 *flags)
 +{
 + unsigned version;
 + cycle_t ret, offset;
 + u8 ret_flags;
 +
 + version = src-version;
 + rdtsc_barrier();
 + offset = pvclock_get_nsec_offset(src);
 + ret = src-system_time + offset;
 + ret_flags = src-flags;
 + rdtsc_barrier();
 +
 + *cycles = ret;
 + *flags = ret_flags;
 + return version;
 +}
 +
This interface is a bit weird.
The actual value you are interested in is cycles, so why is it
returned through the parameters? I think it would be clearer to have
this return cycles, and version as a parameter.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 05/16] x86: pvclock: introduce helper to read flags

2012-11-01 Thread Glauber Costa
On 11/01/2012 02:47 AM, Marcelo Tosatti wrote:
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 
 
 Index: vsyscall/arch/x86/kernel/pvclock.c
If you are resending this series, I don't see a reason for this one to
be in a separate patch.

code itself is fine.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 06/16] sched: add notifier for cross-cpu migrations

2012-11-01 Thread Glauber Costa
On 11/01/2012 02:47 AM, Marcelo Tosatti wrote:
 Originally from Jeremy Fitzhardinge.
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 

Please collect peterz's ack for this one.

 Index: vsyscall/include/linux/sched.h
 ===
 --- vsyscall.orig/include/linux/sched.h
 +++ vsyscall/include/linux/sched.h
 @@ -107,6 +107,14 @@ extern unsigned long this_cpu_load(void)
  extern void calc_global_load(unsigned long ticks);
  extern void update_cpu_load_nohz(void);
  
 +/* Notifier for when a task gets migrated to a new CPU */
 +struct task_migration_notifier {
 + struct task_struct *task;
 + int from_cpu;
 + int to_cpu;
 +};
 +extern void register_task_migration_notifier(struct notifier_block *n);
 +
  extern unsigned long get_parent_ip(unsigned long addr);
  
  struct seq_file;
 Index: vsyscall/kernel/sched/core.c
 ===
 --- vsyscall.orig/kernel/sched/core.c
 +++ vsyscall/kernel/sched/core.c
 @@ -922,6 +922,13 @@ void check_preempt_curr(struct rq *rq, s
   rq-skip_clock_update = 1;
  }
  
 +static ATOMIC_NOTIFIER_HEAD(task_migration_notifier);
 +
 +void register_task_migration_notifier(struct notifier_block *n)
 +{
 + atomic_notifier_chain_register(task_migration_notifier, n);
 +}
 +
  #ifdef CONFIG_SMP
  void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
  {
 @@ -952,8 +959,16 @@ void set_task_cpu(struct task_struct *p,
   trace_sched_migrate_task(p, new_cpu);
  
   if (task_cpu(p) != new_cpu) {
 + struct task_migration_notifier tmn;
 +
   p-se.nr_migrations++;
   perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, NULL, 0);
 +
 + tmn.task = p;
 + tmn.from_cpu = task_cpu(p);
 + tmn.to_cpu = new_cpu;
 +
 + atomic_notifier_call_chain(task_migration_notifier, 0, tmn);
   }
  
   __set_task_cpu(p, new_cpu);
 
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 07/16] x86: pvclock: generic pvclock vsyscall initialization

2012-11-01 Thread Glauber Costa
On 11/01/2012 02:47 AM, Marcelo Tosatti wrote:
 Originally from Jeremy Fitzhardinge.
 
 Introduce generic, non hypervisor specific, pvclock initialization 
 routines.
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 
 Index: vsyscall/arch/x86/kernel/pvclock.c
 ===
 --- vsyscall.orig/arch/x86/kernel/pvclock.c
 +++ vsyscall/arch/x86/kernel/pvclock.c
 @@ -17,6 +17,10 @@
  
  #include linux/kernel.h
  #include linux/percpu.h
 +#include linux/notifier.h
 +#include linux/sched.h
 +#include linux/gfp.h
 +#include linux/bootmem.h
  #include asm/pvclock.h
  
  static u8 valid_flags __read_mostly = 0;
 @@ -122,3 +126,67 @@ void pvclock_read_wallclock(struct pvclo
  
   set_normalized_timespec(ts, now.tv_sec, now.tv_nsec);
  }
 +
 +static aligned_pvti_t *pvclock_vdso_info;
is there a non-aligned pvti ?

I would also add that pvti is not exactly the most descriptive name
I've seen.

 +
 +static struct pvclock_vsyscall_time_info *
 +pvclock_get_vsyscall_user_time_info(int cpu)
 +{
 + if (pvclock_vdso_info == NULL) {
 + BUG();
 + return NULL;
 + }
 +
 + return pvclock_vdso_info[cpu].info;
 +}
 +
if (!pvclock_vdso_info)
BUG();
return pvclock...

 +struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu)
 +{
 + return pvclock_get_vsyscall_user_time_info(cpu)-pvti;
 +}
 +
 +int pvclock_task_migrate(struct notifier_block *nb, unsigned long l, void *v)
 +{
 + struct task_migration_notifier *mn = v;
 + struct pvclock_vsyscall_time_info *pvti;
 +
 + pvti = pvclock_get_vsyscall_user_time_info(mn-from_cpu);
 +
 + if (pvti == NULL)
 + return NOTIFY_DONE;
 +

When is it NULL? IIUC, this is when the vsyscall is disabled, right?
Would you mind adding comments for it?

Also, this is supposed to be an unlikely branch, right?

 + pvti-migrate_count++;
 +
 + return NOTIFY_DONE;
 +}
 +
 +static struct notifier_block pvclock_migrate = {
 + .notifier_call = pvclock_task_migrate,
 +};
 +
 +/*
 + * Initialize the generic pvclock vsyscall state.  This will allocate
 + * a/some page(s) for the per-vcpu pvclock information, set up a
 + * fixmap mapping for the page(s)
 + */
 +int __init pvclock_init_vsyscall(void)
 +{
 + int idx;
 + unsigned int size = PVCLOCK_VSYSCALL_NR_PAGES*PAGE_SIZE;
 +
 + pvclock_vdso_info = __alloc_bootmem(size, PAGE_SIZE, 0);

This will be the most critical path for reading pvclock's time data. So
why can't we:
1) Make it per-node.
2) Of course, as a consequence, make sure all info structures in the
same page are in the same node ?



 + if (!pvclock_vdso_info)
 + return -ENOMEM;
 +
 + memset(pvclock_vdso_info, 0, size);
 +
 + for (idx = 0; idx = (PVCLOCK_FIXMAP_END-PVCLOCK_FIXMAP_BEGIN); idx++) {
 + __set_fixmap(PVCLOCK_FIXMAP_BEGIN + idx,
 +  __pa_symbol(pvclock_vdso_info) + (idx*PAGE_SIZE),
 +  PAGE_KERNEL_VVAR);
 + }
 +
 + register_task_migration_notifier(pvclock_migrate);
 +
 + return 0;
 +}
 Index: vsyscall/arch/x86/include/asm/fixmap.h
 ===
 --- vsyscall.orig/arch/x86/include/asm/fixmap.h
 +++ vsyscall/arch/x86/include/asm/fixmap.h
 @@ -19,6 +19,7 @@
  #include asm/acpi.h
  #include asm/apicdef.h
  #include asm/page.h
 +#include asm/pvclock.h
  #ifdef CONFIG_X86_32
  #include linux/threads.h
  #include asm/kmap_types.h
 @@ -81,6 +82,10 @@ enum fixed_addresses {
   VVAR_PAGE,
   VSYSCALL_HPET,
  #endif
 +#ifdef CONFIG_PARAVIRT_CLOCK
 + PVCLOCK_FIXMAP_BEGIN,
 + PVCLOCK_FIXMAP_END = PVCLOCK_FIXMAP_BEGIN+PVCLOCK_VSYSCALL_NR_PAGES-1,
 +#endif
   FIX_DBGP_BASE,
   FIX_EARLYCON_MEM_BASE,
  #ifdef CONFIG_PROVIDE_OHCI1394_DMA_INIT
 Index: vsyscall/arch/x86/include/asm/pvclock.h
 ===
 --- vsyscall.orig/arch/x86/include/asm/pvclock.h
 +++ vsyscall/arch/x86/include/asm/pvclock.h
 @@ -13,6 +13,8 @@ void pvclock_read_wallclock(struct pvclo
   struct pvclock_vcpu_time_info *vcpu,
   struct timespec *ts);
  void pvclock_resume(void);
 +int __init pvclock_init_vsyscall(void);
 +struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu);
  
  /*
   * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
 @@ -85,4 +87,17 @@ unsigned __pvclock_read_cycles(const str
   return version;
  }
  
 +struct pvclock_vsyscall_time_info {
 + struct pvclock_vcpu_time_info pvti;
 + u32 migrate_count;
 +};
 +
 +typedef union {
 + struct pvclock_vsyscall_time_info info;
 + char pad[SMP_CACHE_BYTES];
 +} aligned_pvti_t cacheline_aligned;

Please, just align pvclock_vsyscall_time_info. It is 10 thousand times
more descriptive.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to 

Re: [patch 08/16] KVM: x86: introduce facility to support vsyscall pvclock, via MSR

2012-11-01 Thread Glauber Costa
On 11/01/2012 02:47 AM, Marcelo Tosatti wrote:
 Allow a guest to register a second location for the VCPU time info
 
 structure for each vcpu (as described by MSR_KVM_SYSTEM_TIME_NEW).
 This is intended to allow the guest kernel to map this information
 into a usermode accessible page, so that usermode can efficiently
 calculate system time from the TSC without having to make a syscall.
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 

Changelog doesn't make a lot of sense. (specially from first line to the
second). Please add in here the reasons why we can't (or decided not to)
use the same page. The info in the last mail thread is good enough, just
put it here.


 Index: vsyscall/arch/x86/include/asm/kvm_para.h
 ===
 --- vsyscall.orig/arch/x86/include/asm/kvm_para.h
 +++ vsyscall/arch/x86/include/asm/kvm_para.h
 @@ -23,6 +23,7 @@
  #define KVM_FEATURE_ASYNC_PF 4
  #define KVM_FEATURE_STEAL_TIME   5
  #define KVM_FEATURE_PV_EOI   6
 +#define KVM_FEATURE_USERSPACE_CLOCKSOURCE 7
  
  /* The last 8 bits are used to indicate how to interpret the flags field
   * in pvclock structure. If no bits are set, all flags are ignored.
 @@ -39,6 +40,7 @@
  #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
  #define MSR_KVM_STEAL_TIME  0x4b564d03
  #define MSR_KVM_PV_EOI_EN  0x4b564d04
 +#define MSR_KVM_USERSPACE_TIME  0x4b564d05
  

I accept that it is possible that we may be better off with the page
mapped twice.

But why do we need an extra MSR? When, and why, would you enable the
kernel-based pvclock, but disabled the userspace pvclock?

I believe only the existing MSRs should be used for this. If you write
to it, and the host is capable of exporting userspace pvclock
information, then it does. If it isn't, then it doesn't.

No reason for the extra setup that is only likely to bring more headache.


 Index: vsyscall/arch/x86/kvm/cpuid.c
 ===
 --- vsyscall.orig/arch/x86/kvm/cpuid.c
 +++ vsyscall/arch/x86/kvm/cpuid.c
 @@ -411,7 +411,9 @@ static int do_cpuid_ent(struct kvm_cpuid
(1  KVM_FEATURE_CLOCKSOURCE2) |
(1  KVM_FEATURE_ASYNC_PF) |
(1  KVM_FEATURE_PV_EOI) |
 -  (1  KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
 +  (1  KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
 +  (1  KVM_FEATURE_USERSPACE_CLOCKSOURCE);
 +
  

The feature bit itself, is obviously fine.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 10/16] x86: vdso: pvclock gettime support

2012-11-01 Thread Glauber Costa
On 11/01/2012 02:47 AM, Marcelo Tosatti wrote:
 +#ifdef CONFIG_PARAVIRT_CLOCK
 +
 +static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
 +{
 + const aligned_pvti_t *pvti_base;
 + int idx = cpu / (PAGE_SIZE/PVTI_SIZE);
 + int offset = cpu % (PAGE_SIZE/PVTI_SIZE);
 +
 + BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx  PVCLOCK_FIXMAP_END);
 +
 + pvti_base = (aligned_pvti_t *)__fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx);
 +
 + return pvti_base[offset].info;
 +}
 +
Does BUG_ON() really do what you believe it does while in userspace
context? We're not running with the kernel descriptors, so this will
probably just kill the process without any explanation
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 09/18] KVM: x86: introduce facility to support vsyscall pvclock, via MSR

2012-10-30 Thread Glauber Costa
On 10/29/2012 09:44 PM, Jeremy Fitzhardinge wrote:
 On 10/29/2012 07:45 AM, Glauber Costa wrote:
 On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
 Allow a guest to register a second location for the VCPU time info

 structure for each vcpu (as described by MSR_KVM_SYSTEM_TIME_NEW).
 This is intended to allow the guest kernel to map this information
 into a usermode accessible page, so that usermode can efficiently
 calculate system time from the TSC without having to make a syscall.

 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 Can you please be a bit more specific about why we need this? Why does
 the host need to provide us with two pages with the exact same data? Why
 can't just do it with mapping tricks in the guest?
 
 In Xen the pvclock structure is embedded within a pile of other stuff
 that shouldn't be mapped into guest memory, so providing for a second
 location allows it to be placed whereever is convenient for the guest.
 That's a restriction of the Xen ABI, but I don't know if it affects KVM.
 
 J
 
In kvm the exported data seems to be exactly the same. So it makes sense
to have a single page exported to the guest. The guest may have a
facility that maps the vsyscall clock and the normal clock to a specific
location given a page - that can be either a different page or the same
page.

Any reason why it wouldn't work?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 09/18] KVM: x86: introduce facility to support vsyscall pvclock, via MSR

2012-10-30 Thread Glauber Costa
On 10/29/2012 10:40 PM, Marcelo Tosatti wrote:
 On Mon, Oct 29, 2012 at 10:44:41AM -0700, Jeremy Fitzhardinge wrote:
 On 10/29/2012 07:45 AM, Glauber Costa wrote:
 On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
 Allow a guest to register a second location for the VCPU time info

 structure for each vcpu (as described by MSR_KVM_SYSTEM_TIME_NEW).
 This is intended to allow the guest kernel to map this information
 into a usermode accessible page, so that usermode can efficiently
 calculate system time from the TSC without having to make a syscall.

 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 Can you please be a bit more specific about why we need this? Why does
 the host need to provide us with two pages with the exact same data? Why
 can't just do it with mapping tricks in the guest?

 In Xen the pvclock structure is embedded within a pile of other stuff
 that shouldn't be mapped into guest memory, so providing for a second
 location allows it to be placed whereever is convenient for the guest.
 That's a restriction of the Xen ABI, but I don't know if it affects KVM.

 J
 
 It is possible to share the data for KVM in theory, but:
 
 - It is a small amount of memory. 
 - It requires aligning to page size (the in-kernel percpu array 
 is currently cacheline aligned).
 - It is possible to modify flags separately for userspace/kernelspace,
 if desired.
 
 This justifies the duplication IMO (code is simple and clean).
 
Duplicating is indeed no the end of the world. But one note:

* If it is page-size aligned, it is automatically cacheline aligned.
Since we have to export the user page *anyway*, this is a non-issue.

That said, duplicating instead of integrating, which is technically
possible, is a design decision, and it needs to be documented somewhere
aside from this mail thread.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 11/18] x86: vsyscall: pass mode to gettime backend

2012-10-30 Thread Glauber Costa
On 10/29/2012 10:41 PM, Marcelo Tosatti wrote:
 On Mon, Oct 29, 2012 at 06:47:57PM +0400, Glauber Costa wrote:
 On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
 Required by next patch.

 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 I don't see where.
 
 +   if (unlikely(!(flags  PVCLOCK_TSC_STABLE_BIT)))
 +   *mode = VCLOCK_NONE;
 
 

I see. But I end up thinking that this is a case of oversimplification
that gets things more complicated =)

If this would be folded in the next patch, this would be obvious.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 12/18] x86: vdso: pvclock gettime support

2012-10-30 Thread Glauber Costa
On 10/29/2012 10:42 PM, Marcelo Tosatti wrote:
 On Mon, Oct 29, 2012 at 06:59:35PM +0400, Glauber Costa wrote:
 On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
 Improve performance of time system calls when using Linux pvclock, 
 by reading time info from fixmap visible copy of pvclock data.

 Originally from Jeremy Fitzhardinge.

 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

 Index: vsyscall/arch/x86/vdso/vclock_gettime.c
 ===
 --- vsyscall.orig/arch/x86/vdso/vclock_gettime.c
 +++ vsyscall/arch/x86/vdso/vclock_gettime.c
 @@ -22,6 +22,7 @@
  #include asm/hpet.h
  #include asm/unistd.h
  #include asm/io.h
 +#include asm/pvclock.h
  
  #define gtod (VVAR(vsyscall_gtod_data))
  
 @@ -62,6 +63,69 @@ static notrace cycle_t vread_hpet(void)
 return readl((const void __iomem *)fix_to_virt(VSYSCALL_HPET) + 0xf0);
  }
  
 +#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL
 +
 +static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
 +{
 +   const aligned_pvti_t *pvti_base;
 +   int idx = cpu / (PAGE_SIZE/PVTI_SIZE);
 +   int offset = cpu % (PAGE_SIZE/PVTI_SIZE);
 +
 +   BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx  PVCLOCK_FIXMAP_END);
 +
 +   pvti_base = (aligned_pvti_t *)__fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx);
 +
 +   return pvti_base[offset].info;
 +}
 +

 Unless I am missing something, if gcc decides to not inline get_pvti,
 this will break, right? I believe you need to mark that function with
 __always_inline.
 
 Can't see why. Please enlighten me.
 

I can be wrong, I don't deal with this vdso code for quite a while - so
forgive me if my memory tricked me.

But wasn't it the case that vdso functions could not call functions in
the kernel address space outside the mapped page? Or does this
restriction only apply to accessing data?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 13/18] KVM: x86: pass host_tsc to read_l1_tsc

2012-10-30 Thread Glauber Costa
On 10/29/2012 10:45 PM, Marcelo Tosatti wrote:
 On Mon, Oct 29, 2012 at 07:04:59PM +0400, Glauber Costa wrote:
 On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
 Allow the caller to pass host tsc value to kvm_x86_ops-read_l1_tsc().

 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

 Would you mind explaining why?

 it seems to me that rdtscll() here would be perfectly safe: the only
 case in which they wouldn't, is in a nested-vm environment running
 paravirt-linux with a paravirt tsc. In this case, it is quite likely
 that we'll want rdtscll *anyway*, instead of going to tsc directly.
 
 Its something different (from a future patch):
 
 KVM added a global variable to guarantee monotonicity in the guest.
 One of the reasons for that is that the time between
 
 1. ktime_get_ts(timespec);
 2. rdtscll(tsc);
 
 Is variable. That is, given a host with stable TSC, suppose that
 two VCPUs read the same time via ktime_get_ts() above.
 
 The time required to execute 2. is not the same on those two instances
 executing in different VCPUS (cache misses, interrupts...).
 
 
 Think step 1. returning the same value on both vcpus (to read the
 explanation above).
 

This still doesn't go the core of the question. You are replacing
rdtscll with native_read_tsc(). They are equivalent most of the time for
the host (except in the case I outlined). So the logic you exposed, is
valid for both. But the real problem is another one:

Although I am not very skilled in C, I rock in the Logo programming
language (http://en.wikipedia.org/wiki/Logo_(programming_language) ),
and with that knowledge, I can understand your C code - with a bit of
effort. After reading it, it becomes very clear that what you do is
Allow the caller to pass host tsc value to kvm_x86_ops-read_l1_tsc,
so your changelog doesn't really add any data. It would be great if it
explained us, readers, why instead of what!


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 15/18] KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag

2012-10-30 Thread Glauber Costa
On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
 KVM added a global variable to guarantee monotonicity in the guest. 
 One of the reasons for that is that the time between
 
   1. ktime_get_ts(timespec);
   2. rdtscll(tsc);
 
 Is variable. That is, given a host with stable TSC, suppose that
 two VCPUs read the same time via ktime_get_ts() above.
 
 The time required to execute 2. is not the same on those two instances 
 executing in different VCPUS (cache misses, interrupts...).
 
 If the TSC value that is used by the host to interpolate when 
 calculating the monotonic time is the same value used to calculate
 the tsc_timestamp value stored in the pvclock data structure, and
 a single system_timestamp, tsc_timestamp tuple is visible to all 
 vcpus simultaneously, this problem disappears. See comment on top
 of pvclock_update_vm_gtod_copy for details.
 
 Monotonicity is then guaranteed by synchronicity of the host TSCs
 and guest TSCs. 
 
 Set TSC stable pvclock flag in that case, allowing the guest to read
 clock from userspace.
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

If you are using a master copy, with a stable host-side tsc, you can get
rid of the normal REQ_CLOCK updates during vcpu load.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 08/18] x86: pvclock: generic pvclock vsyscall initialization

2012-10-29 Thread Glauber Costa
On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
 Index: vsyscall/arch/x86/Kconfig
 ===
 --- vsyscall.orig/arch/x86/Kconfig
 +++ vsyscall/arch/x86/Kconfig
 @@ -632,6 +632,13 @@ config PARAVIRT_SPINLOCKS
  
  config PARAVIRT_CLOCK
   bool
 +config PARAVIRT_CLOCK_VSYSCALL
 + bool Paravirt clock vsyscall support
 + depends on PARAVIRT_CLOCK  GENERIC_TIME_VSYSCALL
 + ---help---
 +   Enable performance critical clock related system calls to
 +   be executed in userspace, provided that the hypervisor
 +   supports it.
  
  endif

Besides debugging, what is the point in having this as an
extra-selectable? Is there any case in which a virtual machine has code
for this, but may decide to run without it ?

I believe all this code in vsyscall should be wrapped in PARAVIRT_CLOCK
only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 08/18] x86: pvclock: generic pvclock vsyscall initialization

2012-10-29 Thread Glauber Costa
On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
 + */
 +int __init pvclock_init_vsyscall(void)
 +{
 + int idx;
 + unsigned int size = PVCLOCK_VSYSCALL_NR_PAGES*PAGE_SIZE;
 +
 + pvclock_vdso_info = __alloc_bootmem(size, PAGE_SIZE, 0);
 + if (!pvclock_vdso_info)
 + return -ENOMEM;
 +
 + memset(pvclock_vdso_info, 0, size);
 +
 + for (idx = 0; idx = (PVCLOCK_FIXMAP_END-PVCLOCK_FIXMAP_BEGIN); idx++) {
 + __set_fixmap(PVCLOCK_FIXMAP_BEGIN + idx,
 +  __pa_symbol(pvclock_vdso_info) + (idx*PAGE_SIZE),
 +  PAGE_KERNEL_VVAR);


BTW, Previous line is whitespace damaged.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 09/18] KVM: x86: introduce facility to support vsyscall pvclock, via MSR

2012-10-29 Thread Glauber Costa
On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
 Allow a guest to register a second location for the VCPU time info
 
 structure for each vcpu (as described by MSR_KVM_SYSTEM_TIME_NEW).
 This is intended to allow the guest kernel to map this information
 into a usermode accessible page, so that usermode can efficiently
 calculate system time from the TSC without having to make a syscall.
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Can you please be a bit more specific about why we need this? Why does
the host need to provide us with two pages with the exact same data? Why
can't just do it with mapping tricks in the guest?


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 11/18] x86: vsyscall: pass mode to gettime backend

2012-10-29 Thread Glauber Costa
On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
 Required by next patch.
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
I don't see where.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 12/18] x86: vdso: pvclock gettime support

2012-10-29 Thread Glauber Costa
On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
 Improve performance of time system calls when using Linux pvclock, 
 by reading time info from fixmap visible copy of pvclock data.
 
 Originally from Jeremy Fitzhardinge.
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 
 Index: vsyscall/arch/x86/vdso/vclock_gettime.c
 ===
 --- vsyscall.orig/arch/x86/vdso/vclock_gettime.c
 +++ vsyscall/arch/x86/vdso/vclock_gettime.c
 @@ -22,6 +22,7 @@
  #include asm/hpet.h
  #include asm/unistd.h
  #include asm/io.h
 +#include asm/pvclock.h
  
  #define gtod (VVAR(vsyscall_gtod_data))
  
 @@ -62,6 +63,69 @@ static notrace cycle_t vread_hpet(void)
   return readl((const void __iomem *)fix_to_virt(VSYSCALL_HPET) + 0xf0);
  }
  
 +#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL
 +
 +static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
 +{
 + const aligned_pvti_t *pvti_base;
 + int idx = cpu / (PAGE_SIZE/PVTI_SIZE);
 + int offset = cpu % (PAGE_SIZE/PVTI_SIZE);
 +
 + BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx  PVCLOCK_FIXMAP_END);
 +
 + pvti_base = (aligned_pvti_t *)__fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx);
 +
 + return pvti_base[offset].info;
 +}
 +

Unless I am missing something, if gcc decides to not inline get_pvti,
this will break, right? I believe you need to mark that function with
__always_inline.

 +static notrace cycle_t vread_pvclock(int *mode)
 +{
 + const struct pvclock_vsyscall_time_info *pvti;
 + cycle_t ret;
 + u64 last;
 + u32 version;
 + u32 migrate_count;
 + u8 flags;
 + unsigned cpu, cpu1;
 +
 +
 + /*
 +  * When looping to get a consistent (time-info, tsc) pair, we
 +  * also need to deal with the possibility we can switch vcpus,
 +  * so make sure we always re-fetch time-info for the current vcpu.
 +  */
 + do {
 + cpu = __getcpu()  0xfff;

Please wrap this 0xfff into something meaningful.

 + pvti = get_pvti(cpu);
 +
 + migrate_count = pvti-migrate_count;
 +
 + version = __pvclock_read_cycles(pvti-pvti, ret, flags);
 +
 + /*
 +  * Test we're still on the cpu as well as the version.
 +  * We could have been migrated just after the first
 +  * vgetcpu but before fetching the version, so we
 +  * wouldn't notice a version change.
 +  */
 + cpu1 = __getcpu()  0xfff;
 + } while (unlikely(cpu != cpu1 ||
 +   (pvti-pvti.version  1) ||
 +   pvti-pvti.version != version ||
 +   pvti-migrate_count != migrate_count));
 +
 + if (unlikely(!(flags  PVCLOCK_TSC_STABLE_BIT)))
 + *mode = VCLOCK_NONE;
 +
 + last = VVAR(vsyscall_gtod_data).clock.cycle_last;
 +
 + if (likely(ret = last))
 + return ret;
 +

Please add a comment here referring to tsc.c, where an explanation of
this test lives. This is quite non-obvious for the non initiated.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 13/18] KVM: x86: pass host_tsc to read_l1_tsc

2012-10-29 Thread Glauber Costa
On 10/24/2012 05:13 PM, Marcelo Tosatti wrote:
 Allow the caller to pass host tsc value to kvm_x86_ops-read_l1_tsc().
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Would you mind explaining why?

it seems to me that rdtscll() here would be perfectly safe: the only
case in which they wouldn't, is in a nested-vm environment running
paravirt-linux with a paravirt tsc. In this case, it is quite likely
that we'll want rdtscll *anyway*, instead of going to tsc directly.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V2 0/5] Separate consigned (expected steal) from steal time.

2012-10-17 Thread Glauber Costa
On 10/17/2012 06:23 AM, Michael Wolf wrote:
 In the case of where you have a system that is running in a
 capped or overcommitted environment the user may see steal time
 being reported in accounting tools such as top or vmstat.  This can
 cause confusion for the end user.  To ease the confusion this patch set
 adds the idea of consigned (expected steal) time.  The host will separate
 the consigned time from the steal time.  The consignment limit passed to the
 host will be the amount of steal time expected within a fixed period of
 time.  Any other steal time accruing during that period will show as the
 traditional steal time.
 
 TODO:
 * Change native_clock to take params and not return a value
 * Change update_rq_clock_task
 
 Changes from V1:
 * Removed the steal time allowed percentage from the guest
 * Moved the separation of consigned (expected steal) and steal time to the
   host.
 * No longer include a sysctl interface.
 

You are showing this in the guest somewhere, but tools like top will
still not show it. So for quite a while, it achieves nothing.

Of course this is a barrier that any new statistic has to go through. So
while annoying, this is per-se ultimately not a blocker.

What I still fail to see, is how this is useful information to be shown
in the guest. Honestly, if I'm in a guest VM or container, any time
during which I am not running is time I lost. It doesn't matter if this
was expected or not. This still seems to me as a host-side problem, to
be solved entirely by tooling.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting

2012-08-27 Thread Glauber Costa
On 08/27/2012 08:50 AM, Michael Wolf wrote:
 On Sat, 2012-08-25 at 19:36 -0400, Glauber Costa wrote:
 On 08/24/2012 11:11 AM, Michael Wolf wrote:
 On Fri, 2012-08-24 at 08:53 +0400, Glauber Costa wrote:
 On 08/24/2012 03:14 AM, Michael Wolf wrote:
 This is an RFC regarding the reporting of stealtime.  In the case of
 where you have a system that is running with partial processors such as
 KVM the user may see steal time being reported in accounting tools such
 as top or vmstat.  This can cause confusion for the end user.  To
 ease the confusion this patch set adds a sysctl interface to set the
 cpu entitlement.  This is the percentage of cpu that the guest system is
  expected to receive.  As long as the steal time is within its expected
 range it will show up as 0 in /proc/stat.  The user will then see in the
 accounting tools that they are getting a full utilization of the cpu
 resources assigned to them.


 And how is such a knob not confusing?

 Steal time is pretty well defined in meaning and is shown in top for
 ages. I really don't see the point for this.

 Currently you can see the steal time but you have no way of knowing if
 the cpu utilization you are seeing on the guest is the expected amount.
 I decided on making it a knob because a guest could be migrated to
 another system and it's entitlement could change because of hardware or 
 load differences.  It could simply be a /proc file and report the
 current entitlement if needed.   As things are currently implemented I 
 don't see how someone knows if the guest is running as expected or
 whether there is a problem.


 Turning off steal time display won't get even close to displaying the
 information you want. What you probably want is a guest-visible way to
 say how many miliseconds you are expected to run each second. Right?
 
 It is not clear to me how knowing how many milliseconds you are
 expecting to run will help the user.  Currently the users will run top
 to see how well the guest is running.  If they see _any_ steal time some
 users think they are not getting the full use of their processor
 entitlement.


And your plan is just to selectively lie about it, but disabling it with
a knob?

 Maybe I'm missing what you are proposing, but even if you knew the
 milliseconds that you were expecting for your processor you would have
 to adjust the top output in your head so to speak.  You would see the
 utilization and then say 'ok that matches the number of milliseconds I
 expected to run...   If we take away the steal time (as long as it is
 equal to or less than the expected amount of steal time) then the user
 running top will see the 100% utilization.
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting

2012-08-27 Thread Glauber Costa
On 08/27/2012 01:19 PM, Michael Wolf wrote:
 On Mon, 2012-08-27 at 11:50 -0700, Glauber Costa wrote:
 On 08/27/2012 08:50 AM, Michael Wolf wrote:
 On Sat, 2012-08-25 at 19:36 -0400, Glauber Costa wrote:
 On 08/24/2012 11:11 AM, Michael Wolf wrote:
 On Fri, 2012-08-24 at 08:53 +0400, Glauber Costa wrote:
 On 08/24/2012 03:14 AM, Michael Wolf wrote:
 This is an RFC regarding the reporting of stealtime.  In the case of
 where you have a system that is running with partial processors such as
 KVM the user may see steal time being reported in accounting tools such
 as top or vmstat.  This can cause confusion for the end user.  To
 ease the confusion this patch set adds a sysctl interface to set the
 cpu entitlement.  This is the percentage of cpu that the guest system is
  expected to receive.  As long as the steal time is within its expected
 range it will show up as 0 in /proc/stat.  The user will then see in the
 accounting tools that they are getting a full utilization of the cpu
 resources assigned to them.


 And how is such a knob not confusing?

 Steal time is pretty well defined in meaning and is shown in top for
 ages. I really don't see the point for this.

 Currently you can see the steal time but you have no way of knowing if
 the cpu utilization you are seeing on the guest is the expected amount.
 I decided on making it a knob because a guest could be migrated to
 another system and it's entitlement could change because of hardware or 
 load differences.  It could simply be a /proc file and report the
 current entitlement if needed.   As things are currently implemented I 
 don't see how someone knows if the guest is running as expected or
 whether there is a problem.


 Turning off steal time display won't get even close to displaying the
 information you want. What you probably want is a guest-visible way to
 say how many miliseconds you are expected to run each second. Right?

 It is not clear to me how knowing how many milliseconds you are
 expecting to run will help the user.  Currently the users will run top
 to see how well the guest is running.  If they see _any_ steal time some
 users think they are not getting the full use of their processor
 entitlement.


 And your plan is just to selectively lie about it, but disabling it with
 a knob?
 
 It is about making it very obvious to the end user whether they are
 receiving their cpu entitlement.  If there is more steal time than
 expected that will still show up.  I have experimented, and it seems to
 work, to put the raw stealtime at the end of each cpu line
 in /proc/stat.  That way the raw data is there as well.   
 
 Do you have another suggestion to communicate to the user whether they
 are receiving their full entitlement?  At the very least shouldn't the
 entitlement reside in a /proc file somewhere so that the user could look
 up the value and do the math?
 

I personally believe Avi is right here. This is something to be done at
the host side. The user can learn this from any tool he is using to
manage his VMs.

Now if you absolutely must inform him from inside the guest, I would go
with the later, informing him in another location. (I am not saying I
agree with this, just that this is less worse)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting

2012-08-27 Thread Glauber Costa
On 08/27/2012 02:27 PM, Michael Wolf wrote:
 On Mon, 2012-08-27 at 13:31 -0700, Avi Kivity wrote:
 On 08/27/2012 01:23 PM, Michael Wolf wrote:

 How would a guest know what its entitlement is?



 Currently the Admin/management tool setting up the guests will put it on
 the qemu commandline.  From this it is passed via an ioctl to the host.
 The guest will get the value from the host via a hypercall.

 In the future the host could try and do some of it automatically in some
 cases. 

 Seems to me it's a meaningless value for the guest.  Suppose it is
 migrated to a host that is more powerful, and as a result its relative
 entitlement is reduced.  The value needs to be adjusted.
 
 This is why I chose to manage the value from the sysctl interface rather
 than just have it stored as a value in /proc.  Whatever tool was used to
 migrate the vm could hopefully adjust the sysctl value on the guest.

 This is best taken care of from the host side.
 
 Not sure what you are getting at here.  If you are running in a cloud
 environment, you purchase a VM with the understanding that you are
 getting certain resources.  As this type of user I don't believe you
 have any access to the host to see this type of information.  So the
 user still wouldnt have a way to confirm that they are receiving what
 they should be in the way of processor resources.
 
 Would you please elaborate a little more on this?

What do you mean they have no access to the host?
They have access to all sorts of tools that display information from the
host. Speaking of a view-only resource, those are strictly equivalent.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting

2012-08-25 Thread Glauber Costa
On 08/24/2012 11:11 AM, Michael Wolf wrote:
 On Fri, 2012-08-24 at 08:53 +0400, Glauber Costa wrote:
 On 08/24/2012 03:14 AM, Michael Wolf wrote:
 This is an RFC regarding the reporting of stealtime.  In the case of
 where you have a system that is running with partial processors such as
 KVM the user may see steal time being reported in accounting tools such
 as top or vmstat.  This can cause confusion for the end user.  To
 ease the confusion this patch set adds a sysctl interface to set the
 cpu entitlement.  This is the percentage of cpu that the guest system is
  expected to receive.  As long as the steal time is within its expected
 range it will show up as 0 in /proc/stat.  The user will then see in the
 accounting tools that they are getting a full utilization of the cpu
 resources assigned to them.


 And how is such a knob not confusing?

 Steal time is pretty well defined in meaning and is shown in top for
 ages. I really don't see the point for this.
 
 Currently you can see the steal time but you have no way of knowing if
 the cpu utilization you are seeing on the guest is the expected amount.
 I decided on making it a knob because a guest could be migrated to
 another system and it's entitlement could change because of hardware or 
 load differences.  It could simply be a /proc file and report the
 current entitlement if needed.   As things are currently implemented I 
 don't see how someone knows if the guest is running as expected or
 whether there is a problem.
 

Turning off steal time display won't get even close to displaying the
information you want. What you probably want is a guest-visible way to
say how many miliseconds you are expected to run each second. Right?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting

2012-08-23 Thread Glauber Costa
On 08/24/2012 03:14 AM, Michael Wolf wrote:
 This is an RFC regarding the reporting of stealtime.  In the case of
 where you have a system that is running with partial processors such as
 KVM the user may see steal time being reported in accounting tools such
 as top or vmstat.  This can cause confusion for the end user.  To
 ease the confusion this patch set adds a sysctl interface to set the
 cpu entitlement.  This is the percentage of cpu that the guest system is
  expected to receive.  As long as the steal time is within its expected
 range it will show up as 0 in /proc/stat.  The user will then see in the
 accounting tools that they are getting a full utilization of the cpu
 resources assigned to them.
 

And how is such a knob not confusing?

Steal time is pretty well defined in meaning and is shown in top for
ages. I really don't see the point for this.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 4/9] KVM-HV: KVM Steal time implementation

2011-07-11 Thread Glauber Costa

On 07/11/2011 10:11 AM, Avi Kivity wrote:

On 07/11/2011 04:10 PM, Avi Kivity wrote:


I think Peter acked this, yes? Please include his acks.



Sorry, I meant this for the kernel/sched.c bits.


Yes he did, after I posted the patches.
I will include his acks in the respin, since they won't change the 
scheduler bits.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 6/9] add jump labels for ia64 paravirt

2011-07-11 Thread Glauber Costa

On 07/11/2011 10:09 AM, Avi Kivity wrote:

On 07/04/2011 06:32 PM, Glauber Costa wrote:

Since in a later patch I intend to call jump labels inside
CONFIG_PARAVIRT, IA64 would fail to compile if they are not
provided. This patch provides those jump labels for the IA64
architecture.



Please get an ack for this from the ia64 maintainer.



The ones listed in paravirt.c are CC'd. I am CC'ing all the other
folks related to IA64 listed in MAINTAINERS now.

Just a heads up so the new folks CC'd can get up to speed:
I am proposing moving the steal time calculation to inside the core 
scheduler. This move will make it easier for us to make decisions based 
on steal time accounting on virtual machines. Ia64 KVM may or may not 
follow this route - nothing that already works should break with this 
change.


This patch is needed only to avoid breaking compilation, since it 
introduces two new variables that are expected be present when 
CONFIG_PARAVIRT, to paravirt.c.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 4/9] KVM-HV: KVM Steal time implementation

2011-07-11 Thread Glauber Costa

On 07/11/2011 09:58 AM, Avi Kivity wrote:

On 07/07/2011 08:07 PM, Glauber Costa wrote:

+static void record_steal_time(struct kvm_vcpu *vcpu)
+{
+ u64 delta;
+
+ if (!(vcpu-arch.st.msr_val KVM_MSR_ENABLED))
+ return;
+
+ if (unlikely(kvm_read_guest_cached(vcpu-kvm,vcpu-arch.st.stime,
+ vcpu-arch.st.steal, sizeof(struct kvm_steal_time
+ return;


The guest memory page is not pinned, sleeping via
__copy_from_user/to_user is not allowed in vcpu_load context. Either pin
it or use atomic acessors.



I do recognize the problem.
Avi, what's your take here?



The easiest solution is to set a KVM_REQ bit in atomic context, and move
the sleepy code to vcpu_enter_guest().


Or I can move it all inside vcpu_run, or close enough to it. This will 
account more hypervisor time as steal time, but it seemed to be what 
some people wanted in the first place.


Given the simplification we would win - not needing a REQ set, it might 
be worth it.



+ case MSR_KVM_STEAL_TIME:
+ vcpu-arch.st.msr_val = data;
+
+ if (!(data KVM_MSR_ENABLED)) {
+ break;
+ }


On failure below this point, msr_val should be cleared of
KVM_MSR_ENABLED?

No, msr_val has to hold whatever the guest wrote into it.
We should probably use an independent variable here to indicate that
we failed to activate it.


If we fail, we return a #GP to the guest (and don't write any value into
the msr).



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 2/9] KVM-HDR Add constant to represent KVM MSRs enabled bit

2011-07-11 Thread Glauber Costa
This patch is simple, put in a different commit so it can be more easily
shared between guest and hypervisor. It just defines a named constant
to indicate the enable bit for KVM-specific MSRs.

Signed-off-by: Glauber Costa glom...@redhat.com
Acked-by: Rik van Riel r...@redhat.com
Tested-by: Eric B Munson emun...@mgebm.net
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
CC: Anthony Liguori aligu...@us.ibm.com
---
 arch/x86/include/asm/kvm_para.h |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index a427bf7..d6cd79b 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -30,6 +30,7 @@
 #define MSR_KVM_WALL_CLOCK  0x11
 #define MSR_KVM_SYSTEM_TIME 0x12
 
+#define KVM_MSR_ENABLED 1
 /* Custom MSRs falls in the range 0x4b564d00-0x4b564dff */
 #define MSR_KVM_WALL_CLOCK_NEW  0x4b564d00
 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 7/9] KVM-GST: KVM Steal time accounting

2011-07-11 Thread Glauber Costa
This patch accounts steal time time in account_process_tick.
If one or more tick is considered stolen in the current
accounting cycle, user/system accounting is skipped. Idle is fine,
since the hypervisor does not report steal time if the guest
is halted.

Accounting steal time from the core scheduler give us the
advantage of direct acess to the runqueue data. In a later
opportunity, it can be used to tweak cpu power and make
the scheduler aware of the time it lost.

Signed-off-by: Glauber Costa glom...@redhat.com
Acked-by: Rik van Riel r...@redhat.com
Acked-by: Peter Zijlstra pet...@infradead.org
Tested-by: Eric B Munson emun...@mgebm.net
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Avi Kivity a...@redhat.com
CC: Anthony Liguori aligu...@us.ibm.com
---
 kernel/sched.c |   41 +
 1 files changed, 41 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 3f2e502..aa6c030 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -75,6 +75,7 @@
 #include asm/tlb.h
 #include asm/irq_regs.h
 #include asm/mutex.h
+#include asm/paravirt.h
 
 #include sched_cpupri.h
 #include workqueue_sched.h
@@ -528,6 +529,9 @@ struct rq {
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
u64 prev_irq_time;
 #endif
+#ifdef CONFIG_PARAVIRT
+   u64 prev_steal_time;
+#endif
 
/* calc_load related fields */
unsigned long calc_load_update;
@@ -1953,6 +1957,18 @@ void account_system_vtime(struct task_struct *curr)
 }
 EXPORT_SYMBOL_GPL(account_system_vtime);
 
+#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
+
+#ifdef CONFIG_PARAVIRT
+static inline u64 steal_ticks(u64 steal)
+{
+   if (unlikely(steal  NSEC_PER_SEC))
+   return div_u64(steal, TICK_NSEC);
+
+   return __iter_div_u64_rem(steal, TICK_NSEC, steal);
+}
+#endif
+
 static void update_rq_clock_task(struct rq *rq, s64 delta)
 {
s64 irq_delta;
@@ -3845,6 +3861,25 @@ void account_idle_time(cputime_t cputime)
cpustat-idle = cputime64_add(cpustat-idle, cputime64);
 }
 
+static __always_inline bool steal_account_process_tick(void)
+{
+#ifdef CONFIG_PARAVIRT
+   if (static_branch(paravirt_steal_enabled)) {
+   u64 steal, st = 0;
+
+   steal = paravirt_steal_clock(smp_processor_id());
+   steal -= this_rq()-prev_steal_time;
+
+   st = steal_ticks(steal);
+   this_rq()-prev_steal_time += st * TICK_NSEC;
+
+   account_steal_time(st);
+   return st;
+   }
+#endif
+   return false;
+}
+
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
@@ -3876,6 +3911,9 @@ static void irqtime_account_process_tick(struct 
task_struct *p, int user_tick,
cputime64_t tmp = cputime_to_cputime64(cputime_one_jiffy);
struct cpu_usage_stat *cpustat = kstat_this_cpu.cpustat;
 
+   if (steal_account_process_tick())
+   return;
+
if (irqtime_account_hi_update()) {
cpustat-irq = cputime64_add(cpustat-irq, tmp);
} else if (irqtime_account_si_update()) {
@@ -3929,6 +3967,9 @@ void account_process_tick(struct task_struct *p, int 
user_tick)
return;
}
 
+   if (steal_account_process_tick())
+   return;
+
if (user_tick)
account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
else if ((p != rq-idle) || (irq_count() != HARDIRQ_OFFSET))
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 4/9] KVM-HV: KVM Steal time implementation

2011-07-11 Thread Glauber Costa
To implement steal time, we need the hypervisor to pass the guest
information about how much time was spent running other processes
outside the VM, while the vcpu had meaningful work to do - halt
time does not count.

This information is acquired through the run_delay field of
delayacct/schedstats infrastructure, that counts time spent in a
runqueue but not running.

Steal time is a per-cpu information, so the traditional MSR-based
infrastructure is used. A new msr, KVM_MSR_STEAL_TIME, holds the
memory area address containing information about steal time

This patch contains the hypervisor part of the steal time infrasructure,
and can be backported independently of the guest portion.

Signed-off-by: Glauber Costa glom...@redhat.com
Tested-by: Eric B Munson emun...@mgebm.net
CC: Rik van Riel r...@redhat.com
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
CC: Anthony Liguori aligu...@us.ibm.com
---
 arch/x86/include/asm/kvm_host.h |9 +
 arch/x86/include/asm/kvm_para.h |4 ++
 arch/x86/kvm/Kconfig|1 +
 arch/x86/kvm/x86.c  |   74 +-
 include/linux/kvm_host.h|1 +
 5 files changed, 87 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index da6bbee..59086a7 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -389,6 +389,15 @@ struct kvm_vcpu_arch {
unsigned int hw_tsc_khz;
unsigned int time_offset;
struct page *time_page;
+
+   struct {
+   u64 msr_val;
+   u64 last_steal;
+   u64 accum_steal;
+   struct gfn_to_hva_cache stime;
+   struct kvm_steal_time steal;
+   } st;
+
u64 last_guest_tsc;
u64 last_kernel_ns;
u64 last_tsc_nsec;
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 65f8bb9..c484ba8 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -45,6 +45,10 @@ struct kvm_steal_time {
__u32 pad[12];
 };
 
+#define KVM_STEAL_ALIGNMENT_BITS 5
+#define KVM_STEAL_VALID_BITS ((-1ULL  (KVM_STEAL_ALIGNMENT_BITS + 1)))
+#define KVM_STEAL_RESERVED_MASK (((1  KVM_STEAL_ALIGNMENT_BITS) - 1 )  1)
+
 #define KVM_MAX_MMU_OP_BATCH   32
 
 #define KVM_ASYNC_PF_ENABLED   (1  0)
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 50f6364..99c3f05 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -31,6 +31,7 @@ config KVM
select KVM_ASYNC_PF
select USER_RETURN_NOTIFIER
select KVM_MMIO
+   select TASK_DELAY_ACCT
---help---
  Support hosting fully virtualized guest machines using hardware
  virtualization extensions.  You will need a fairly recent
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7167717..6282f6c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -808,12 +808,12 @@ EXPORT_SYMBOL_GPL(kvm_get_dr);
  * kvm-specific. Those are put in the beginning of the list.
  */
 
-#define KVM_SAVE_MSRS_BEGIN8
+#define KVM_SAVE_MSRS_BEGIN9
 static u32 msrs_to_save[] = {
MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
-   HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN,
+   HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
MSR_STAR,
 #ifdef CONFIG_X86_64
@@ -1491,6 +1491,35 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu)
}
 }
 
+static void accumulate_steal_time(struct kvm_vcpu *vcpu)
+{
+   u64 delta;
+
+   if (!(vcpu-arch.st.msr_val  KVM_MSR_ENABLED))
+   return;
+
+   delta = current-sched_info.run_delay - vcpu-arch.st.last_steal;
+   vcpu-arch.st.last_steal = current-sched_info.run_delay;
+   vcpu-arch.st.accum_steal = delta;
+}
+
+static void record_steal_time(struct kvm_vcpu *vcpu)
+{
+   if (!(vcpu-arch.st.msr_val  KVM_MSR_ENABLED))
+   return;
+
+   if (unlikely(kvm_read_guest_cached(vcpu-kvm, vcpu-arch.st.stime,
+   vcpu-arch.st.steal, sizeof(struct kvm_steal_time
+   return;
+
+   vcpu-arch.st.steal.steal += vcpu-arch.st.accum_steal;
+   vcpu-arch.st.steal.version += 2;
+   vcpu-arch.st.accum_steal = 0;
+
+   kvm_write_guest_cached(vcpu-kvm, vcpu-arch.st.stime,
+   vcpu-arch.st.steal, sizeof(struct kvm_steal_time));
+}
+
 int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 {
switch (msr) {
@@ -1573,6 +1602,33 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
if (kvm_pv_enable_async_pf(vcpu, data))
return 1;
break

[PATCH v6 9/9] KVM-GST: KVM Steal time registration

2011-07-11 Thread Glauber Costa
This patch implements the kvm bits of the steal time infrastructure.
The most important part of it, is the steal time clock. It is an
continuous clock that shows the accumulated amount of steal time
since vcpu creation. It is supposed to survive cpu offlining/onlining.

Signed-off-by: Glauber Costa glom...@redhat.com
Acked-by: Rik van Riel r...@redhat.com
Tested-by: Eric B Munson emun...@mgebm.net
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
CC: Anthony Liguori aligu...@us.ibm.com
---
 Documentation/kernel-parameters.txt |4 ++
 arch/x86/include/asm/kvm_para.h |1 +
 arch/x86/kernel/kvm.c   |   73 +++
 arch/x86/kernel/kvmclock.c  |2 +
 4 files changed, 80 insertions(+), 0 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index fd248a31..a722574 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1737,6 +1737,10 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
no-kvmapf   [X86,KVM] Disable paravirtualized asynchronous page
fault handling.
 
+   no-steal-acc[X86,KVM] Disable paravirtualized steal time accounting.
+   steal time is computed, but won't influence scheduler
+   behaviour
+
nolapic [X86-32,APIC] Do not enable or use the local APIC.
 
nolapic_timer   [X86-32,APIC] Do not use the local APIC timer.
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index c484ba8..35d732d 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -94,6 +94,7 @@ struct kvm_vcpu_pv_apf_data {
 
 extern void kvmclock_init(void);
 extern int kvm_register_clock(char *txt);
+extern void kvm_disable_steal_time(void);
 
 
 /* This instruction is vmcall.  On non-VT architectures, it will generate a
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 33c07b0..58331c2 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -51,6 +51,15 @@ static int parse_no_kvmapf(char *arg)
 
 early_param(no-kvmapf, parse_no_kvmapf);
 
+static int steal_acc = 1;
+static int parse_no_stealacc(char *arg)
+{
+steal_acc = 0;
+return 0;
+}
+
+early_param(no-steal-acc, parse_no_stealacc);
+
 struct kvm_para_state {
u8 mmu_queue[MMU_QUEUE_SIZE];
int mmu_queue_len;
@@ -58,6 +67,8 @@ struct kvm_para_state {
 
 static DEFINE_PER_CPU(struct kvm_para_state, para_state);
 static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
+static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64);
+static int has_steal_clock = 0;
 
 static struct kvm_para_state *kvm_para_state(void)
 {
@@ -441,6 +452,21 @@ static void __init paravirt_ops_setup(void)
 #endif
 }
 
+static void kvm_register_steal_time(void)
+{
+   int cpu = smp_processor_id();
+   struct kvm_steal_time *st = per_cpu(steal_time, cpu);
+
+   if (!has_steal_clock)
+   return;
+
+   memset(st, 0, sizeof(*st));
+
+   wrmsrl(MSR_KVM_STEAL_TIME, (__pa(st) | KVM_MSR_ENABLED));
+   printk(KERN_INFO kvm-stealtime: cpu %d, msr %lx\n,
+   cpu, __pa(st));
+}
+
 void __cpuinit kvm_guest_cpu_init(void)
 {
if (!kvm_para_available())
@@ -457,6 +483,9 @@ void __cpuinit kvm_guest_cpu_init(void)
printk(KERN_INFOKVM setup async PF for cpu %d\n,
   smp_processor_id());
}
+
+   if (has_steal_clock)
+   kvm_register_steal_time();
 }
 
 static void kvm_pv_disable_apf(void *unused)
@@ -483,6 +512,31 @@ static struct notifier_block kvm_pv_reboot_nb = {
.notifier_call = kvm_pv_reboot_notify,
 };
 
+static u64 kvm_steal_clock(int cpu)
+{
+   u64 steal;
+   struct kvm_steal_time *src;
+   int version;
+
+   src = per_cpu(steal_time, cpu);
+   do {
+   version = src-version;
+   rmb();
+   steal = src-steal;
+   rmb();
+   } while ((version  1) || (version != src-version));
+
+   return steal;
+}
+
+void kvm_disable_steal_time(void)
+{
+   if (!has_steal_clock)
+   return;
+
+   wrmsr(MSR_KVM_STEAL_TIME, 0, 0);
+}
+
 #ifdef CONFIG_SMP
 static void __init kvm_smp_prepare_boot_cpu(void)
 {
@@ -500,6 +554,7 @@ static void __cpuinit kvm_guest_cpu_online(void *dummy)
 
 static void kvm_guest_cpu_offline(void *dummy)
 {
+   kvm_disable_steal_time();
kvm_pv_disable_apf(NULL);
apf_task_wake_all();
 }
@@ -548,6 +603,11 @@ void __init kvm_guest_init(void)
if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF))
x86_init.irqs.trap_init = kvm_apf_trap_init;
 
+   if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
+   has_steal_clock = 1

[PATCH v6 5/9] KVM-GST: Add a pv_ops stub for steal time

2011-07-11 Thread Glauber Costa
This patch adds a function pointer in one of the many paravirt_ops
structs, to allow guests to register a steal time function. Besides
a steal time function, we also declare two jump_labels. They will be
used to allow the steal time code to be easily bypassed when not
in use.

Signed-off-by: Glauber Costa glom...@redhat.com
Acked-by: Rik van Riel r...@redhat.com
Tested-by: Eric B Munson emun...@mgebm.net
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
CC: Anthony Liguori aligu...@us.ibm.com
---
 arch/x86/include/asm/paravirt.h   |9 +
 arch/x86/include/asm/paravirt_types.h |1 +
 arch/x86/kernel/paravirt.c|9 +
 3 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index ebbc4d8..a7d2db9 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -230,6 +230,15 @@ static inline unsigned long long paravirt_sched_clock(void)
return PVOP_CALL0(unsigned long long, pv_time_ops.sched_clock);
 }
 
+struct jump_label_key;
+extern struct jump_label_key paravirt_steal_enabled;
+extern struct jump_label_key paravirt_steal_rq_enabled;
+
+static inline u64 paravirt_steal_clock(int cpu)
+{
+   return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu);
+}
+
 static inline unsigned long long paravirt_read_pmc(int counter)
 {
return PVOP_CALL1(u64, pv_cpu_ops.read_pmc, counter);
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 8288509..2c76521 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -89,6 +89,7 @@ struct pv_lazy_ops {
 
 struct pv_time_ops {
unsigned long long (*sched_clock)(void);
+   unsigned long long (*steal_clock)(int cpu);
unsigned long (*get_tsc_khz)(void);
 };
 
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 869e1ae..613a793 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -202,6 +202,14 @@ static void native_flush_tlb_single(unsigned long addr)
__native_flush_tlb_single(addr);
 }
 
+struct jump_label_key paravirt_steal_enabled;
+struct jump_label_key paravirt_steal_rq_enabled;
+
+static u64 native_steal_clock(int cpu)
+{
+   return 0;
+}
+
 /* These are in entry.S */
 extern void native_iret(void);
 extern void native_irq_enable_sysexit(void);
@@ -307,6 +315,7 @@ struct pv_init_ops pv_init_ops = {
 
 struct pv_time_ops pv_time_ops = {
.sched_clock = native_sched_clock,
+   .steal_clock = native_steal_clock,
 };
 
 struct pv_irq_ops pv_irq_ops = {
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 8/9] KVM-GST: adjust scheduler cpu power

2011-07-11 Thread Glauber Costa
This patch makes update_rq_clock() aware of steal time.
The mechanism of operation is not different from irq_time,
and follows the same principles. This lives in a CONFIG
option itself, and can be compiled out independently of
the rest of steal time reporting. The effect of disabling it
is that the scheduler will still report steal time (that cannot be
disabled), but won't use this information for cpu power adjustments.

Everytime update_rq_clock_task() is invoked, we query information
about how much time was stolen since last call, and feed it into
sched_rt_avg_update().

Although steal time reporting in account_process_tick() keeps
track of the last time we read the steal clock, in prev_steal_time,
this patch do it independently using another field,
prev_steal_time_rq. This is because otherwise, information about time
accounted in update_process_tick() would never reach us in update_rq_clock().

Signed-off-by: Glauber Costa glom...@redhat.com
Acked-by: Rik van Riel r...@redhat.com
Acked-by: Peter Zijlstra pet...@infradead.org
Tested-by: Eric B Munson emun...@mgebm.net
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Avi Kivity a...@redhat.com
CC: Anthony Liguori aligu...@us.ibm.com
---
 arch/x86/Kconfig|   12 
 kernel/sched.c  |   47 +--
 kernel/sched_features.h |4 ++--
 3 files changed, 51 insertions(+), 12 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index da34972..b26f312 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -512,6 +512,18 @@ menuconfig PARAVIRT_GUEST
 
 if PARAVIRT_GUEST
 
+config PARAVIRT_TIME_ACCOUNTING
+   bool Paravirtual steal time accounting
+   select PARAVIRT
+   default n
+   ---help---
+ Select this option to enable fine granularity task steal time 
+ accounting. Time spent executing other tasks in parallel with
+ the current vCPU is discounted from the vCPU power. To account for
+ that, there can be a small performance impact.
+
+ If in doubt, say N here.
+
 source arch/x86/xen/Kconfig
 
 config KVM_CLOCK
diff --git a/kernel/sched.c b/kernel/sched.c
index aa6c030..8d57196 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -532,6 +532,9 @@ struct rq {
 #ifdef CONFIG_PARAVIRT
u64 prev_steal_time;
 #endif
+#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
+   u64 prev_steal_time_rq;
+#endif
 
/* calc_load related fields */
unsigned long calc_load_update;
@@ -1971,8 +1974,14 @@ static inline u64 steal_ticks(u64 steal)
 
 static void update_rq_clock_task(struct rq *rq, s64 delta)
 {
-   s64 irq_delta;
-
+/*
+ * In theory, the compile should just see 0 here, and optimize out the call
+ * to sched_rt_avg_update. But I don't trust it...
+ */
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || 
defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+   s64 steal = 0, irq_delta = 0;
+#endif
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
irq_delta = irq_time_read(cpu_of(rq)) - rq-prev_irq_time;
 
/*
@@ -1995,12 +2004,35 @@ static void update_rq_clock_task(struct rq *rq, s64 
delta)
 
rq-prev_irq_time += irq_delta;
delta -= irq_delta;
+#endif
+#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
+   if (static_branch((paravirt_steal_rq_enabled))) {
+   u64 st;
+
+   steal = paravirt_steal_clock(cpu_of(rq));
+   steal -= rq-prev_steal_time_rq;
+
+   if (unlikely(steal  delta))
+   steal = delta;
+
+   st = steal_ticks(steal);
+   steal = st * TICK_NSEC;
+
+   rq-prev_steal_time_rq += steal;
+
+   delta -= steal;
+   }
+#endif
+
rq-clock_task += delta;
 
-   if (irq_delta  sched_feat(NONIRQ_POWER))
-   sched_rt_avg_update(rq, irq_delta);
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || 
defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+   if ((irq_delta + steal)  sched_feat(NONTASK_POWER))
+   sched_rt_avg_update(rq, irq_delta + steal);
+#endif
 }
 
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
 static int irqtime_account_hi_update(void)
 {
struct cpu_usage_stat *cpustat = kstat_this_cpu.cpustat;
@@ -2035,12 +2067,7 @@ static int irqtime_account_si_update(void)
 
 #define sched_clock_irqtime(0)
 
-static void update_rq_clock_task(struct rq *rq, s64 delta)
-{
-   rq-clock_task += delta;
-}
-
-#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
+#endif
 
 #include sched_idletask.c
 #include sched_fair.c
diff --git a/kernel/sched_features.h b/kernel/sched_features.h
index be40f73..ca3b025 100644
--- a/kernel/sched_features.h
+++ b/kernel/sched_features.h
@@ -61,9 +61,9 @@ SCHED_FEAT(LB_BIAS, 1)
 SCHED_FEAT(OWNER_SPIN, 1)
 
 /*
- * Decrement CPU power based on irq activity
+ * Decrement CPU power based on time not spent running tasks
  */
-SCHED_FEAT(NONIRQ_POWER, 1)
+SCHED_FEAT(NONTASK_POWER, 1)
 
 /*
  * Queue remote wakeups on the target CPU and process them
-- 
1.7.3.4

[PATCH v6 0/9] Steal time for KVM

2011-07-11 Thread Glauber Costa
Hi guys,

All ACKs being collected, I hope this can be merged.
Only patch 4/9 differs from the last submission. It merges
Marcelo's suggestions, so I didn't transposed the other ACKs to it
All the rest, is kept the same.

Enjoy

Glauber Costa (8):
  KVM-HDR Add constant to represent KVM MSRs enabled bit
  KVM-HDR: KVM Steal time implementation
  KVM-HV: KVM Steal time implementation
  KVM-GST: Add a pv_ops stub for steal time
  add jump labels for ia64 paravirt
  KVM-GST: KVM Steal time accounting
  KVM-GST: adjust scheduler cpu power
  KVM-GST: KVM Steal time registration

Gleb Natapov (1):
  introduce kvm_read_guest_cached

 Documentation/kernel-parameters.txt   |4 ++
 Documentation/virtual/kvm/msr.txt |   35 +
 arch/ia64/include/asm/paravirt.h  |4 ++
 arch/ia64/kernel/paravirt.c   |2 +
 arch/x86/Kconfig  |   12 +
 arch/x86/include/asm/kvm_host.h   |9 +++
 arch/x86/include/asm/kvm_para.h   |   15 ++
 arch/x86/include/asm/paravirt.h   |9 +++
 arch/x86/include/asm/paravirt_types.h |1 +
 arch/x86/kernel/kvm.c |   73 +++
 arch/x86/kernel/kvmclock.c|2 +
 arch/x86/kernel/paravirt.c|9 +++
 arch/x86/kvm/Kconfig  |1 +
 arch/x86/kvm/x86.c|   74 +++-
 include/linux/kvm_host.h  |3 +
 kernel/sched.c|   88 +
 kernel/sched_features.h   |4 +-
 virt/kvm/kvm_main.c   |   20 +++
 18 files changed, 351 insertions(+), 14 deletions(-)

-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 1/9] introduce kvm_read_guest_cached

2011-07-11 Thread Glauber Costa
From: Gleb Natapov g...@redhat.com

Introduce kvm_read_guest_cached() function in addition to write one we
already have.

[ by glauber: export function signature in kvm header ]

Signed-off-by: Gleb Natapov g...@redhat.com
Signed-off-by: Glauber Costa glom...@redhat.com
Acked-by: Rik van Riel r...@redhat.com
Tested-by: Eric Munson emun...@mgebm.net
---
 include/linux/kvm_host.h |2 ++
 virt/kvm/kvm_main.c  |   20 
 2 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 31ebb59..f7df0a3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -381,6 +381,8 @@ int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void 
*data, int offset,
 int kvm_read_guest_atomic(struct kvm *kvm, gpa_t gpa, void *data,
  unsigned long len);
 int kvm_read_guest(struct kvm *kvm, gpa_t gpa, void *data, unsigned long len);
+int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+  void *data, unsigned long len);
 int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn, const void *data,
 int offset, int len);
 int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 11d2783..d5ef9eb 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1418,6 +1418,26 @@ int kvm_write_guest_cached(struct kvm *kvm, struct 
gfn_to_hva_cache *ghc,
 }
 EXPORT_SYMBOL_GPL(kvm_write_guest_cached);
 
+int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+  void *data, unsigned long len)
+{
+   struct kvm_memslots *slots = kvm_memslots(kvm);
+   int r;
+
+   if (slots-generation != ghc-generation)
+   kvm_gfn_to_hva_cache_init(kvm, ghc, ghc-gpa);
+
+   if (kvm_is_error_hva(ghc-hva))
+   return -EFAULT;
+
+   r = __copy_from_user(data, (void __user *)ghc-hva, len);
+   if (r)
+   return -EFAULT;
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_read_guest_cached);
+
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len)
 {
return kvm_write_guest_page(kvm, gfn, (const void *) empty_zero_page,
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 3/9] KVM-HDR: KVM Steal time implementation

2011-07-11 Thread Glauber Costa
To implement steal time, we need the hypervisor to pass the guest information
about how much time was spent running other processes outside the VM.
This is per-vcpu, and using the kvmclock structure for that is an abuse
we decided not to make.

In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
holds the memory area address containing information about steal time

This patch contains the headers for it. I am keeping it separate to facilitate
backports to people who wants to backport the kernel part but not the
hypervisor, or the other way around.

Signed-off-by: Glauber Costa glom...@redhat.com
Acked-by: Rik van Riel r...@redhat.com
Tested-by: Eric B Munson emun...@mgebm.net
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
CC: Anthony Liguori aligu...@us.ibm.com
---
 Documentation/virtual/kvm/msr.txt |   35 +++
 arch/x86/include/asm/kvm_para.h   |9 +
 2 files changed, 44 insertions(+), 0 deletions(-)

diff --git a/Documentation/virtual/kvm/msr.txt 
b/Documentation/virtual/kvm/msr.txt
index d079aed..38db3f8 100644
--- a/Documentation/virtual/kvm/msr.txt
+++ b/Documentation/virtual/kvm/msr.txt
@@ -185,3 +185,38 @@ MSR_KVM_ASYNC_PF_EN: 0x4b564d02
 
Currently type 2 APF will be always delivered on the same vcpu as
type 1 was, but guest should not rely on that.
+
+MSR_KVM_STEAL_TIME: 0x4b564d03
+
+   data: 64-byte alignment physical address of a memory area which must be
+   in guest RAM, plus an enable bit in bit 0. This memory is expected to
+   hold a copy of the following structure:
+
+   struct kvm_steal_time {
+   __u64 steal;
+   __u32 version;
+   __u32 flags;
+   __u32 pad[12];
+   }
+
+   whose data will be filled in by the hypervisor periodically. Only one
+   write, or registration, is needed for each VCPU. The interval between
+   updates of this structure is arbitrary and implementation-dependent.
+   The hypervisor may update this structure at any time it sees fit until
+   anything with bit0 == 0 is written to it. Guest is required to make sure
+   this structure is initialized to zero.
+
+   Fields have the following meanings:
+
+   version: a sequence counter. In other words, guest has to check
+   this field before and after grabbing time information and make 
+   sure they are both equal and even. An odd version indicates an
+   in-progress update.
+
+   flags: At this point, always zero. May be used to indicate
+   changes in this structure in the future.
+
+   steal: the amount of time in which this vCPU did not run, in
+   nanoseconds. Time during which the vcpu is idle, will not be
+   reported as steal time.
+
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index d6cd79b..65f8bb9 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -21,6 +21,7 @@
  */
 #define KVM_FEATURE_CLOCKSOURCE23
 #define KVM_FEATURE_ASYNC_PF   4
+#define KVM_FEATURE_STEAL_TIME 5
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
@@ -35,6 +36,14 @@
 #define MSR_KVM_WALL_CLOCK_NEW  0x4b564d00
 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
 #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
+#define MSR_KVM_STEAL_TIME  0x4b564d03
+
+struct kvm_steal_time {
+   __u64 steal;
+   __u32 version;
+   __u32 flags;
+   __u32 pad[12];
+};
 
 #define KVM_MAX_MMU_OP_BATCH   32
 
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 6/9] add jump labels for ia64 paravirt

2011-07-11 Thread Glauber Costa
Since in a later patch I intend to call jump labels inside
CONFIG_PARAVIRT, IA64 would fail to compile if they are not
provided. This patch provides those jump labels for the IA64
architecture.

Signed-off-by: Glauber Costa glom...@redhat.com
Acked-by: Isaku Yamahata yamah...@valinux.co.jp
Acked-by: Rik van Riel r...@redhat.com
CC: Tony Luck tony.l...@intel.com
CC: Eddie Dong eddie.d...@intel.com
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
CC: Anthony Liguori aligu...@us.ibm.com
CC: Eric B Munson emun...@mgebm.net
---
 arch/ia64/include/asm/paravirt.h |4 
 arch/ia64/kernel/paravirt.c  |2 ++
 2 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/arch/ia64/include/asm/paravirt.h b/arch/ia64/include/asm/paravirt.h
index 2eb0a98..32551d3 100644
--- a/arch/ia64/include/asm/paravirt.h
+++ b/arch/ia64/include/asm/paravirt.h
@@ -281,6 +281,10 @@ paravirt_init_missing_ticks_accounting(int cpu)
pv_time_ops.init_missing_ticks_accounting(cpu);
 }
 
+struct jump_label_key;
+extern struct jump_label_key paravirt_steal_enabled;
+extern struct jump_label_key paravirt_steal_rq_enabled;
+
 static inline int
 paravirt_do_steal_accounting(unsigned long *new_itm)
 {
diff --git a/arch/ia64/kernel/paravirt.c b/arch/ia64/kernel/paravirt.c
index a21d7bb..1008682 100644
--- a/arch/ia64/kernel/paravirt.c
+++ b/arch/ia64/kernel/paravirt.c
@@ -634,6 +634,8 @@ struct pv_irq_ops pv_irq_ops = {
  * pv_time_ops
  * time operations
  */
+struct jump_label_key paravirt_steal_enabled;
+struct jump_label_key paravirt_steal_rq_enabled;
 
 static int
 ia64_native_do_steal_accounting(unsigned long *new_itm)
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 4/9] KVM-HV: KVM Steal time implementation

2011-07-07 Thread Glauber Costa

On 07/07/2011 07:51 AM, Marcelo Tosatti wrote:

On Mon, Jul 04, 2011 at 11:32:23AM -0400, Glauber Costa wrote:

To implement steal time, we need the hypervisor to pass the guest
information about how much time was spent running other processes
outside the VM, while the vcpu had meaningful work to do - halt
time does not count.

This information is acquired through the run_delay field of
delayacct/schedstats infrastructure, that counts time spent in a
runqueue but not running.

Steal time is a per-cpu information, so the traditional MSR-based
infrastructure is used. A new msr, KVM_MSR_STEAL_TIME, holds the
memory area address containing information about steal time

This patch contains the hypervisor part of the steal time infrasructure,
and can be backported independently of the guest portion.

Signed-off-by: Glauber Costaglom...@redhat.com
CC: Rik van Rielr...@redhat.com
CC: Jeremy Fitzhardingejeremy.fitzhardi...@citrix.com
CC: Peter Zijlstrapet...@infradead.org
CC: Avi Kivitya...@redhat.com
CC: Anthony Liguorialigu...@us.ibm.com
CC: Eric B Munsonemun...@mgebm.net
---
  arch/x86/include/asm/kvm_host.h |8 +
  arch/x86/include/asm/kvm_para.h |4 +++
  arch/x86/kvm/Kconfig|1 +
  arch/x86/kvm/x86.c  |   56 --
  4 files changed, 66 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index da6bbee..9ba354d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -389,6 +389,14 @@ struct kvm_vcpu_arch {
unsigned int hw_tsc_khz;
unsigned int time_offset;
struct page *time_page;
+
+   struct {
+   u64 msr_val;
+   u64 last_steal;
+   struct gfn_to_hva_cache stime;
+   struct kvm_steal_time steal;
+   } st;
+
u64 last_guest_tsc;
u64 last_kernel_ns;
u64 last_tsc_nsec;
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 65f8bb9..c484ba8 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -45,6 +45,10 @@ struct kvm_steal_time {
__u32 pad[12];
  };

+#define KVM_STEAL_ALIGNMENT_BITS 5
+#define KVM_STEAL_VALID_BITS ((-1ULL  (KVM_STEAL_ALIGNMENT_BITS + 1)))
+#define KVM_STEAL_RESERVED_MASK (((1  KVM_STEAL_ALIGNMENT_BITS) - 1 )  1)
+
  #define KVM_MAX_MMU_OP_BATCH   32

  #define KVM_ASYNC_PF_ENABLED  (1  0)
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 50f6364..99c3f05 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -31,6 +31,7 @@ config KVM
select KVM_ASYNC_PF
select USER_RETURN_NOTIFIER
select KVM_MMIO
+   select TASK_DELAY_ACCT
---help---
  Support hosting fully virtualized guest machines using hardware
  virtualization extensions.  You will need a fairly recent
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7167717..237bcdc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -808,12 +808,12 @@ EXPORT_SYMBOL_GPL(kvm_get_dr);
   * kvm-specific. Those are put in the beginning of the list.
   */

-#define KVM_SAVE_MSRS_BEGIN8
+#define KVM_SAVE_MSRS_BEGIN9
  static u32 msrs_to_save[] = {
MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
-   HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN,
+   HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
MSR_STAR,
  #ifdef CONFIG_X86_64
@@ -1491,6 +1491,27 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu)
}
  }

+static void record_steal_time(struct kvm_vcpu *vcpu)
+{
+   u64 delta;
+
+   if (!(vcpu-arch.st.msr_val  KVM_MSR_ENABLED))
+   return;
+
+   if (unlikely(kvm_read_guest_cached(vcpu-kvm,vcpu-arch.st.stime,
+   vcpu-arch.st.steal, sizeof(struct kvm_steal_time
+   return;


The guest memory page is not pinned, sleeping via
__copy_from_user/to_user is not allowed in vcpu_load context. Either pin
it or use atomic acessors.


I do recognize the problem.
Avi, what's your take here?


+   case MSR_KVM_STEAL_TIME:
+   vcpu-arch.st.msr_val = data;
+
+   if (!(data  KVM_MSR_ENABLED)) {
+   break;
+   }


On failure below this point, msr_val should be cleared of KVM_MSR_ENABLED?

No, msr_val has to hold whatever the guest wrote into it.
We should probably use an independent variable here to indicate that we 
failed to activate it.





+
+   if (unlikely(!sched_info_on()))
+   break;
+
+   if (data  KVM_STEAL_RESERVED_MASK)
+   return 1;
+
+   if (kvm_gfn_to_hva_cache_init(vcpu-kvm,vcpu-arch.st.stime

[PATCH v5 0/9] Steal time for KVM

2011-07-04 Thread Glauber Costa
This is a follow up on the last series. No significant changes.
Only change is in patch 7/9, where the steal time accounting
function is made inline, and also called from
irqtime_process_acount_tick(). At least in my box ( gcc 4.4.5),
gcc insists in not inlining it. So I am resorting to __always_inline.

Glauber Costa (8):
  KVM-HDR Add constant to represent KVM MSRs enabled bit
  KVM-HDR: KVM Steal time implementation
  KVM-HV: KVM Steal time implementation
  KVM-GST: Add a pv_ops stub for steal time
  add jump labels for ia64 paravirt
  KVM-GST: KVM Steal time accounting
  KVM-GST: adjust scheduler cpu power
  KVM-GST: KVM Steal time registration

Gleb Natapov (1):
  introduce kvm_read_guest_cached

 Documentation/kernel-parameters.txt   |4 ++
 Documentation/virtual/kvm/msr.txt |   35 +
 arch/ia64/include/asm/paravirt.h  |4 ++
 arch/ia64/kernel/paravirt.c   |2 +
 arch/x86/Kconfig  |   12 +
 arch/x86/include/asm/kvm_host.h   |8 +++
 arch/x86/include/asm/kvm_para.h   |   15 ++
 arch/x86/include/asm/paravirt.h   |9 +++
 arch/x86/include/asm/paravirt_types.h |1 +
 arch/x86/kernel/kvm.c |   73 +++
 arch/x86/kernel/kvmclock.c|2 +
 arch/x86/kernel/paravirt.c|9 +++
 arch/x86/kvm/Kconfig  |1 +
 arch/x86/kvm/x86.c|   56 -
 include/linux/kvm_host.h  |2 +
 kernel/sched.c|   88 +
 kernel/sched_features.h   |4 +-
 virt/kvm/kvm_main.c   |   20 +++
 18 files changed, 330 insertions(+), 15 deletions(-)

-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 1/9] introduce kvm_read_guest_cached

2011-07-04 Thread Glauber Costa
From: Gleb Natapov g...@redhat.com

Introduce kvm_read_guest_cached() function in addition to write one we
already have.

[ by glauber: export function signature in kvm header ]

Signed-off-by: Gleb Natapov g...@redhat.com
Signed-off-by: Glauber Costa glom...@redhat.com
---
 include/linux/kvm_host.h |2 ++
 virt/kvm/kvm_main.c  |   20 
 2 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 31ebb59..f7df0a3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -381,6 +381,8 @@ int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void 
*data, int offset,
 int kvm_read_guest_atomic(struct kvm *kvm, gpa_t gpa, void *data,
  unsigned long len);
 int kvm_read_guest(struct kvm *kvm, gpa_t gpa, void *data, unsigned long len);
+int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+  void *data, unsigned long len);
 int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn, const void *data,
 int offset, int len);
 int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 11d2783..d5ef9eb 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1418,6 +1418,26 @@ int kvm_write_guest_cached(struct kvm *kvm, struct 
gfn_to_hva_cache *ghc,
 }
 EXPORT_SYMBOL_GPL(kvm_write_guest_cached);
 
+int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+  void *data, unsigned long len)
+{
+   struct kvm_memslots *slots = kvm_memslots(kvm);
+   int r;
+
+   if (slots-generation != ghc-generation)
+   kvm_gfn_to_hva_cache_init(kvm, ghc, ghc-gpa);
+
+   if (kvm_is_error_hva(ghc-hva))
+   return -EFAULT;
+
+   r = __copy_from_user(data, (void __user *)ghc-hva, len);
+   if (r)
+   return -EFAULT;
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_read_guest_cached);
+
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len)
 {
return kvm_write_guest_page(kvm, gfn, (const void *) empty_zero_page,
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 3/9] KVM-HDR: KVM Steal time implementation

2011-07-04 Thread Glauber Costa
To implement steal time, we need the hypervisor to pass the guest information
about how much time was spent running other processes outside the VM.
This is per-vcpu, and using the kvmclock structure for that is an abuse
we decided not to make.

In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
holds the memory area address containing information about steal time

This patch contains the headers for it. I am keeping it separate to facilitate
backports to people who wants to backport the kernel part but not the
hypervisor, or the other way around.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Rik van Riel r...@redhat.com
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
CC: Anthony Liguori aligu...@us.ibm.com
CC: Eric B Munson emun...@mgebm.net
---
 Documentation/virtual/kvm/msr.txt |   35 +++
 arch/x86/include/asm/kvm_para.h   |9 +
 2 files changed, 44 insertions(+), 0 deletions(-)

diff --git a/Documentation/virtual/kvm/msr.txt 
b/Documentation/virtual/kvm/msr.txt
index d079aed..38db3f8 100644
--- a/Documentation/virtual/kvm/msr.txt
+++ b/Documentation/virtual/kvm/msr.txt
@@ -185,3 +185,38 @@ MSR_KVM_ASYNC_PF_EN: 0x4b564d02
 
Currently type 2 APF will be always delivered on the same vcpu as
type 1 was, but guest should not rely on that.
+
+MSR_KVM_STEAL_TIME: 0x4b564d03
+
+   data: 64-byte alignment physical address of a memory area which must be
+   in guest RAM, plus an enable bit in bit 0. This memory is expected to
+   hold a copy of the following structure:
+
+   struct kvm_steal_time {
+   __u64 steal;
+   __u32 version;
+   __u32 flags;
+   __u32 pad[12];
+   }
+
+   whose data will be filled in by the hypervisor periodically. Only one
+   write, or registration, is needed for each VCPU. The interval between
+   updates of this structure is arbitrary and implementation-dependent.
+   The hypervisor may update this structure at any time it sees fit until
+   anything with bit0 == 0 is written to it. Guest is required to make sure
+   this structure is initialized to zero.
+
+   Fields have the following meanings:
+
+   version: a sequence counter. In other words, guest has to check
+   this field before and after grabbing time information and make 
+   sure they are both equal and even. An odd version indicates an
+   in-progress update.
+
+   flags: At this point, always zero. May be used to indicate
+   changes in this structure in the future.
+
+   steal: the amount of time in which this vCPU did not run, in
+   nanoseconds. Time during which the vcpu is idle, will not be
+   reported as steal time.
+
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index d6cd79b..65f8bb9 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -21,6 +21,7 @@
  */
 #define KVM_FEATURE_CLOCKSOURCE23
 #define KVM_FEATURE_ASYNC_PF   4
+#define KVM_FEATURE_STEAL_TIME 5
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
@@ -35,6 +36,14 @@
 #define MSR_KVM_WALL_CLOCK_NEW  0x4b564d00
 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
 #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
+#define MSR_KVM_STEAL_TIME  0x4b564d03
+
+struct kvm_steal_time {
+   __u64 steal;
+   __u32 version;
+   __u32 flags;
+   __u32 pad[12];
+};
 
 #define KVM_MAX_MMU_OP_BATCH   32
 
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 4/9] KVM-HV: KVM Steal time implementation

2011-07-04 Thread Glauber Costa
To implement steal time, we need the hypervisor to pass the guest
information about how much time was spent running other processes
outside the VM, while the vcpu had meaningful work to do - halt
time does not count.

This information is acquired through the run_delay field of
delayacct/schedstats infrastructure, that counts time spent in a
runqueue but not running.

Steal time is a per-cpu information, so the traditional MSR-based
infrastructure is used. A new msr, KVM_MSR_STEAL_TIME, holds the
memory area address containing information about steal time

This patch contains the hypervisor part of the steal time infrasructure,
and can be backported independently of the guest portion.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Rik van Riel r...@redhat.com
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
CC: Anthony Liguori aligu...@us.ibm.com
CC: Eric B Munson emun...@mgebm.net
---
 arch/x86/include/asm/kvm_host.h |8 +
 arch/x86/include/asm/kvm_para.h |4 +++
 arch/x86/kvm/Kconfig|1 +
 arch/x86/kvm/x86.c  |   56 --
 4 files changed, 66 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index da6bbee..9ba354d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -389,6 +389,14 @@ struct kvm_vcpu_arch {
unsigned int hw_tsc_khz;
unsigned int time_offset;
struct page *time_page;
+
+   struct {
+   u64 msr_val;
+   u64 last_steal;
+   struct gfn_to_hva_cache stime;
+   struct kvm_steal_time steal;
+   } st;
+
u64 last_guest_tsc;
u64 last_kernel_ns;
u64 last_tsc_nsec;
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 65f8bb9..c484ba8 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -45,6 +45,10 @@ struct kvm_steal_time {
__u32 pad[12];
 };
 
+#define KVM_STEAL_ALIGNMENT_BITS 5
+#define KVM_STEAL_VALID_BITS ((-1ULL  (KVM_STEAL_ALIGNMENT_BITS + 1)))
+#define KVM_STEAL_RESERVED_MASK (((1  KVM_STEAL_ALIGNMENT_BITS) - 1 )  1)
+
 #define KVM_MAX_MMU_OP_BATCH   32
 
 #define KVM_ASYNC_PF_ENABLED   (1  0)
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 50f6364..99c3f05 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -31,6 +31,7 @@ config KVM
select KVM_ASYNC_PF
select USER_RETURN_NOTIFIER
select KVM_MMIO
+   select TASK_DELAY_ACCT
---help---
  Support hosting fully virtualized guest machines using hardware
  virtualization extensions.  You will need a fairly recent
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7167717..237bcdc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -808,12 +808,12 @@ EXPORT_SYMBOL_GPL(kvm_get_dr);
  * kvm-specific. Those are put in the beginning of the list.
  */
 
-#define KVM_SAVE_MSRS_BEGIN8
+#define KVM_SAVE_MSRS_BEGIN9
 static u32 msrs_to_save[] = {
MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
-   HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN,
+   HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
MSR_STAR,
 #ifdef CONFIG_X86_64
@@ -1491,6 +1491,27 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu)
}
 }
 
+static void record_steal_time(struct kvm_vcpu *vcpu)
+{
+   u64 delta;
+
+   if (!(vcpu-arch.st.msr_val  KVM_MSR_ENABLED))
+   return;
+
+   if (unlikely(kvm_read_guest_cached(vcpu-kvm, vcpu-arch.st.stime,
+   vcpu-arch.st.steal, sizeof(struct kvm_steal_time
+   return;
+
+   delta = current-sched_info.run_delay - vcpu-arch.st.last_steal;
+   vcpu-arch.st.last_steal = current-sched_info.run_delay;
+
+   vcpu-arch.st.steal.steal += delta;
+   vcpu-arch.st.steal.version += 2;
+
+   kvm_write_guest_cached(vcpu-kvm, vcpu-arch.st.stime,
+   vcpu-arch.st.steal, sizeof(struct kvm_steal_time));
+}
+
 int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 {
switch (msr) {
@@ -1573,6 +1594,28 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
if (kvm_pv_enable_async_pf(vcpu, data))
return 1;
break;
+   case MSR_KVM_STEAL_TIME:
+   vcpu-arch.st.msr_val = data;
+
+   if (!(data  KVM_MSR_ENABLED)) {
+   break;
+   }
+
+   if (unlikely(!sched_info_on()))
+   break;
+
+   if (data  KVM_STEAL_RESERVED_MASK

[PATCH v5 2/9] KVM-HDR Add constant to represent KVM MSRs enabled bit

2011-07-04 Thread Glauber Costa
This patch is simple, put in a different commit so it can be more easily
shared between guest and hypervisor. It just defines a named constant
to indicate the enable bit for KVM-specific MSRs.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Rik van Riel r...@redhat.com
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
CC: Anthony Liguori aligu...@us.ibm.com
CC: Eric B Munson emun...@mgebm.net
---
 arch/x86/include/asm/kvm_para.h |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index a427bf7..d6cd79b 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -30,6 +30,7 @@
 #define MSR_KVM_WALL_CLOCK  0x11
 #define MSR_KVM_SYSTEM_TIME 0x12
 
+#define KVM_MSR_ENABLED 1
 /* Custom MSRs falls in the range 0x4b564d00-0x4b564dff */
 #define MSR_KVM_WALL_CLOCK_NEW  0x4b564d00
 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 8/9] KVM-GST: adjust scheduler cpu power

2011-07-04 Thread Glauber Costa
This patch makes update_rq_clock() aware of steal time.
The mechanism of operation is not different from irq_time,
and follows the same principles. This lives in a CONFIG
option itself, and can be compiled out independently of
the rest of steal time reporting. The effect of disabling it
is that the scheduler will still report steal time (that cannot be
disabled), but won't use this information for cpu power adjustments.

Everytime update_rq_clock_task() is invoked, we query information
about how much time was stolen since last call, and feed it into
sched_rt_avg_update().

Although steal time reporting in account_process_tick() keeps
track of the last time we read the steal clock, in prev_steal_time,
this patch do it independently using another field,
prev_steal_time_rq. This is because otherwise, information about time
accounted in update_process_tick() would never reach us in update_rq_clock().

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Rik van Riel r...@redhat.com
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
CC: Anthony Liguori aligu...@us.ibm.com
CC: Eric B Munson emun...@mgebm.net
---
 arch/x86/Kconfig|   12 
 kernel/sched.c  |   47 +--
 kernel/sched_features.h |4 ++--
 3 files changed, 51 insertions(+), 12 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index da34972..b26f312 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -512,6 +512,18 @@ menuconfig PARAVIRT_GUEST
 
 if PARAVIRT_GUEST
 
+config PARAVIRT_TIME_ACCOUNTING
+   bool Paravirtual steal time accounting
+   select PARAVIRT
+   default n
+   ---help---
+ Select this option to enable fine granularity task steal time 
+ accounting. Time spent executing other tasks in parallel with
+ the current vCPU is discounted from the vCPU power. To account for
+ that, there can be a small performance impact.
+
+ If in doubt, say N here.
+
 source arch/x86/xen/Kconfig
 
 config KVM_CLOCK
diff --git a/kernel/sched.c b/kernel/sched.c
index aa6c030..8d57196 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -532,6 +532,9 @@ struct rq {
 #ifdef CONFIG_PARAVIRT
u64 prev_steal_time;
 #endif
+#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
+   u64 prev_steal_time_rq;
+#endif
 
/* calc_load related fields */
unsigned long calc_load_update;
@@ -1971,8 +1974,14 @@ static inline u64 steal_ticks(u64 steal)
 
 static void update_rq_clock_task(struct rq *rq, s64 delta)
 {
-   s64 irq_delta;
-
+/*
+ * In theory, the compile should just see 0 here, and optimize out the call
+ * to sched_rt_avg_update. But I don't trust it...
+ */
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || 
defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+   s64 steal = 0, irq_delta = 0;
+#endif
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
irq_delta = irq_time_read(cpu_of(rq)) - rq-prev_irq_time;
 
/*
@@ -1995,12 +2004,35 @@ static void update_rq_clock_task(struct rq *rq, s64 
delta)
 
rq-prev_irq_time += irq_delta;
delta -= irq_delta;
+#endif
+#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
+   if (static_branch((paravirt_steal_rq_enabled))) {
+   u64 st;
+
+   steal = paravirt_steal_clock(cpu_of(rq));
+   steal -= rq-prev_steal_time_rq;
+
+   if (unlikely(steal  delta))
+   steal = delta;
+
+   st = steal_ticks(steal);
+   steal = st * TICK_NSEC;
+
+   rq-prev_steal_time_rq += steal;
+
+   delta -= steal;
+   }
+#endif
+
rq-clock_task += delta;
 
-   if (irq_delta  sched_feat(NONIRQ_POWER))
-   sched_rt_avg_update(rq, irq_delta);
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || 
defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+   if ((irq_delta + steal)  sched_feat(NONTASK_POWER))
+   sched_rt_avg_update(rq, irq_delta + steal);
+#endif
 }
 
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
 static int irqtime_account_hi_update(void)
 {
struct cpu_usage_stat *cpustat = kstat_this_cpu.cpustat;
@@ -2035,12 +2067,7 @@ static int irqtime_account_si_update(void)
 
 #define sched_clock_irqtime(0)
 
-static void update_rq_clock_task(struct rq *rq, s64 delta)
-{
-   rq-clock_task += delta;
-}
-
-#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
+#endif
 
 #include sched_idletask.c
 #include sched_fair.c
diff --git a/kernel/sched_features.h b/kernel/sched_features.h
index be40f73..ca3b025 100644
--- a/kernel/sched_features.h
+++ b/kernel/sched_features.h
@@ -61,9 +61,9 @@ SCHED_FEAT(LB_BIAS, 1)
 SCHED_FEAT(OWNER_SPIN, 1)
 
 /*
- * Decrement CPU power based on irq activity
+ * Decrement CPU power based on time not spent running tasks
  */
-SCHED_FEAT(NONIRQ_POWER, 1)
+SCHED_FEAT(NONTASK_POWER, 1)
 
 /*
  * Queue remote wakeups on the target CPU and process them
-- 
1.7.3.4

--
To unsubscribe from

[PATCH v5 5/9] KVM-GST: Add a pv_ops stub for steal time

2011-07-04 Thread Glauber Costa
This patch adds a function pointer in one of the many paravirt_ops
structs, to allow guests to register a steal time function. Besides
a steal time function, we also declare two jump_labels. They will be
used to allow the steal time code to be easily bypassed when not
in use.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Rik van Riel r...@redhat.com
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
CC: Anthony Liguori aligu...@us.ibm.com
CC: Eric B Munson emun...@mgebm.net
---
 arch/x86/include/asm/paravirt.h   |9 +
 arch/x86/include/asm/paravirt_types.h |1 +
 arch/x86/kernel/paravirt.c|9 +
 3 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index ebbc4d8..a7d2db9 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -230,6 +230,15 @@ static inline unsigned long long paravirt_sched_clock(void)
return PVOP_CALL0(unsigned long long, pv_time_ops.sched_clock);
 }
 
+struct jump_label_key;
+extern struct jump_label_key paravirt_steal_enabled;
+extern struct jump_label_key paravirt_steal_rq_enabled;
+
+static inline u64 paravirt_steal_clock(int cpu)
+{
+   return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu);
+}
+
 static inline unsigned long long paravirt_read_pmc(int counter)
 {
return PVOP_CALL1(u64, pv_cpu_ops.read_pmc, counter);
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 8288509..2c76521 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -89,6 +89,7 @@ struct pv_lazy_ops {
 
 struct pv_time_ops {
unsigned long long (*sched_clock)(void);
+   unsigned long long (*steal_clock)(int cpu);
unsigned long (*get_tsc_khz)(void);
 };
 
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 869e1ae..613a793 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -202,6 +202,14 @@ static void native_flush_tlb_single(unsigned long addr)
__native_flush_tlb_single(addr);
 }
 
+struct jump_label_key paravirt_steal_enabled;
+struct jump_label_key paravirt_steal_rq_enabled;
+
+static u64 native_steal_clock(int cpu)
+{
+   return 0;
+}
+
 /* These are in entry.S */
 extern void native_iret(void);
 extern void native_irq_enable_sysexit(void);
@@ -307,6 +315,7 @@ struct pv_init_ops pv_init_ops = {
 
 struct pv_time_ops pv_time_ops = {
.sched_clock = native_sched_clock,
+   .steal_clock = native_steal_clock,
 };
 
 struct pv_irq_ops pv_irq_ops = {
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 7/9] KVM-GST: KVM Steal time accounting

2011-07-04 Thread Glauber Costa
This patch accounts steal time time in account_process_tick.
If one or more tick is considered stolen in the current
accounting cycle, user/system accounting is skipped. Idle is fine,
since the hypervisor does not report steal time if the guest
is halted.

Accounting steal time from the core scheduler give us the
advantage of direct acess to the runqueue data. In a later
opportunity, it can be used to tweak cpu power and make
the scheduler aware of the time it lost.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Rik van Riel r...@redhat.com
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
CC: Anthony Liguori aligu...@us.ibm.com
CC: Eric B Munson emun...@mgebm.net
---
 kernel/sched.c |   41 +
 1 files changed, 41 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 3f2e502..aa6c030 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -75,6 +75,7 @@
 #include asm/tlb.h
 #include asm/irq_regs.h
 #include asm/mutex.h
+#include asm/paravirt.h
 
 #include sched_cpupri.h
 #include workqueue_sched.h
@@ -528,6 +529,9 @@ struct rq {
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
u64 prev_irq_time;
 #endif
+#ifdef CONFIG_PARAVIRT
+   u64 prev_steal_time;
+#endif
 
/* calc_load related fields */
unsigned long calc_load_update;
@@ -1953,6 +1957,18 @@ void account_system_vtime(struct task_struct *curr)
 }
 EXPORT_SYMBOL_GPL(account_system_vtime);
 
+#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
+
+#ifdef CONFIG_PARAVIRT
+static inline u64 steal_ticks(u64 steal)
+{
+   if (unlikely(steal  NSEC_PER_SEC))
+   return div_u64(steal, TICK_NSEC);
+
+   return __iter_div_u64_rem(steal, TICK_NSEC, steal);
+}
+#endif
+
 static void update_rq_clock_task(struct rq *rq, s64 delta)
 {
s64 irq_delta;
@@ -3845,6 +3861,25 @@ void account_idle_time(cputime_t cputime)
cpustat-idle = cputime64_add(cpustat-idle, cputime64);
 }
 
+static __always_inline bool steal_account_process_tick(void)
+{
+#ifdef CONFIG_PARAVIRT
+   if (static_branch(paravirt_steal_enabled)) {
+   u64 steal, st = 0;
+
+   steal = paravirt_steal_clock(smp_processor_id());
+   steal -= this_rq()-prev_steal_time;
+
+   st = steal_ticks(steal);
+   this_rq()-prev_steal_time += st * TICK_NSEC;
+
+   account_steal_time(st);
+   return st;
+   }
+#endif
+   return false;
+}
+
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
@@ -3876,6 +3911,9 @@ static void irqtime_account_process_tick(struct 
task_struct *p, int user_tick,
cputime64_t tmp = cputime_to_cputime64(cputime_one_jiffy);
struct cpu_usage_stat *cpustat = kstat_this_cpu.cpustat;
 
+   if (steal_account_process_tick())
+   return;
+
if (irqtime_account_hi_update()) {
cpustat-irq = cputime64_add(cpustat-irq, tmp);
} else if (irqtime_account_si_update()) {
@@ -3929,6 +3967,9 @@ void account_process_tick(struct task_struct *p, int 
user_tick)
return;
}
 
+   if (steal_account_process_tick())
+   return;
+
if (user_tick)
account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
else if ((p != rq-idle) || (irq_count() != HARDIRQ_OFFSET))
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 6/9] add jump labels for ia64 paravirt

2011-07-04 Thread Glauber Costa
Since in a later patch I intend to call jump labels inside
CONFIG_PARAVIRT, IA64 would fail to compile if they are not
provided. This patch provides those jump labels for the IA64
architecture.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Isaku Yamahata yamah...@valinux.co.jp
CC: Eddie Dong eddie.d...@intel.com
CC: Rik van Riel r...@redhat.com
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
CC: Anthony Liguori aligu...@us.ibm.com
CC: Eric B Munson emun...@mgebm.net
---
 arch/ia64/include/asm/paravirt.h |4 
 arch/ia64/kernel/paravirt.c  |2 ++
 2 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/arch/ia64/include/asm/paravirt.h b/arch/ia64/include/asm/paravirt.h
index 2eb0a98..32551d3 100644
--- a/arch/ia64/include/asm/paravirt.h
+++ b/arch/ia64/include/asm/paravirt.h
@@ -281,6 +281,10 @@ paravirt_init_missing_ticks_accounting(int cpu)
pv_time_ops.init_missing_ticks_accounting(cpu);
 }
 
+struct jump_label_key;
+extern struct jump_label_key paravirt_steal_enabled;
+extern struct jump_label_key paravirt_steal_rq_enabled;
+
 static inline int
 paravirt_do_steal_accounting(unsigned long *new_itm)
 {
diff --git a/arch/ia64/kernel/paravirt.c b/arch/ia64/kernel/paravirt.c
index a21d7bb..1008682 100644
--- a/arch/ia64/kernel/paravirt.c
+++ b/arch/ia64/kernel/paravirt.c
@@ -634,6 +634,8 @@ struct pv_irq_ops pv_irq_ops = {
  * pv_time_ops
  * time operations
  */
+struct jump_label_key paravirt_steal_enabled;
+struct jump_label_key paravirt_steal_rq_enabled;
 
 static int
 ia64_native_do_steal_accounting(unsigned long *new_itm)
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 9/9] KVM-GST: KVM Steal time registration

2011-07-04 Thread Glauber Costa
This patch implements the kvm bits of the steal time infrastructure.
The most important part of it, is the steal time clock. It is an
continuous clock that shows the accumulated amount of steal time
since vcpu creation. It is supposed to survive cpu offlining/onlining.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Rik van Riel r...@redhat.com
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
CC: Anthony Liguori aligu...@us.ibm.com
CC: Eric B Munson emun...@mgebm.net
---
 Documentation/kernel-parameters.txt |4 ++
 arch/x86/include/asm/kvm_para.h |1 +
 arch/x86/kernel/kvm.c   |   73 +++
 arch/x86/kernel/kvmclock.c  |2 +
 4 files changed, 80 insertions(+), 0 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index fd248a31..a722574 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1737,6 +1737,10 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
no-kvmapf   [X86,KVM] Disable paravirtualized asynchronous page
fault handling.
 
+   no-steal-acc[X86,KVM] Disable paravirtualized steal time accounting.
+   steal time is computed, but won't influence scheduler
+   behaviour
+
nolapic [X86-32,APIC] Do not enable or use the local APIC.
 
nolapic_timer   [X86-32,APIC] Do not use the local APIC timer.
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index c484ba8..35d732d 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -94,6 +94,7 @@ struct kvm_vcpu_pv_apf_data {
 
 extern void kvmclock_init(void);
 extern int kvm_register_clock(char *txt);
+extern void kvm_disable_steal_time(void);
 
 
 /* This instruction is vmcall.  On non-VT architectures, it will generate a
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 33c07b0..58331c2 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -51,6 +51,15 @@ static int parse_no_kvmapf(char *arg)
 
 early_param(no-kvmapf, parse_no_kvmapf);
 
+static int steal_acc = 1;
+static int parse_no_stealacc(char *arg)
+{
+steal_acc = 0;
+return 0;
+}
+
+early_param(no-steal-acc, parse_no_stealacc);
+
 struct kvm_para_state {
u8 mmu_queue[MMU_QUEUE_SIZE];
int mmu_queue_len;
@@ -58,6 +67,8 @@ struct kvm_para_state {
 
 static DEFINE_PER_CPU(struct kvm_para_state, para_state);
 static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
+static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64);
+static int has_steal_clock = 0;
 
 static struct kvm_para_state *kvm_para_state(void)
 {
@@ -441,6 +452,21 @@ static void __init paravirt_ops_setup(void)
 #endif
 }
 
+static void kvm_register_steal_time(void)
+{
+   int cpu = smp_processor_id();
+   struct kvm_steal_time *st = per_cpu(steal_time, cpu);
+
+   if (!has_steal_clock)
+   return;
+
+   memset(st, 0, sizeof(*st));
+
+   wrmsrl(MSR_KVM_STEAL_TIME, (__pa(st) | KVM_MSR_ENABLED));
+   printk(KERN_INFO kvm-stealtime: cpu %d, msr %lx\n,
+   cpu, __pa(st));
+}
+
 void __cpuinit kvm_guest_cpu_init(void)
 {
if (!kvm_para_available())
@@ -457,6 +483,9 @@ void __cpuinit kvm_guest_cpu_init(void)
printk(KERN_INFOKVM setup async PF for cpu %d\n,
   smp_processor_id());
}
+
+   if (has_steal_clock)
+   kvm_register_steal_time();
 }
 
 static void kvm_pv_disable_apf(void *unused)
@@ -483,6 +512,31 @@ static struct notifier_block kvm_pv_reboot_nb = {
.notifier_call = kvm_pv_reboot_notify,
 };
 
+static u64 kvm_steal_clock(int cpu)
+{
+   u64 steal;
+   struct kvm_steal_time *src;
+   int version;
+
+   src = per_cpu(steal_time, cpu);
+   do {
+   version = src-version;
+   rmb();
+   steal = src-steal;
+   rmb();
+   } while ((version  1) || (version != src-version));
+
+   return steal;
+}
+
+void kvm_disable_steal_time(void)
+{
+   if (!has_steal_clock)
+   return;
+
+   wrmsr(MSR_KVM_STEAL_TIME, 0, 0);
+}
+
 #ifdef CONFIG_SMP
 static void __init kvm_smp_prepare_boot_cpu(void)
 {
@@ -500,6 +554,7 @@ static void __cpuinit kvm_guest_cpu_online(void *dummy)
 
 static void kvm_guest_cpu_offline(void *dummy)
 {
+   kvm_disable_steal_time();
kvm_pv_disable_apf(NULL);
apf_task_wake_all();
 }
@@ -548,6 +603,11 @@ void __init kvm_guest_init(void)
if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF))
x86_init.irqs.trap_init = kvm_apf_trap_init;
 
+   if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
+   has_steal_clock = 1;
+   pv_time_ops.steal_clock

[PATCH v4 0/9] Steal time series again

2011-07-01 Thread Glauber Costa
Here follows the fourth version of the steal time series.
Hope it is acceptable for all involved parties now. The main differences
from v3 are:

* The Changelogs seem to have been writen by an actual person now, not of a
monkey. Yet, I am the aforementioned person, so don't expect much.
* Forcing delayacct on the hypervisor side allow us to simplify the guest
code dramatically, since now we don't need to test for is_idle: if we're idle,
we won't have steal time and end of story.

Hope you enjoy.

Glauber Costa (8):
  KVM-HDR Add constant to represent KVM MSRs enabled bit
  KVM-HDR: KVM Steal time implementation
  KVM-HV: KVM Steal time implementation
  KVM-GST: Add a pv_ops stub for steal time
  add jump labels for ia64 paravirt
  KVM-GST: KVM Steal time accounting
  KVM-GST: adjust scheduler cpu power
  KVM-GST: KVM Steal time registration

Gleb Natapov (1):
  introduce kvm_read_guest_cached

 Documentation/kernel-parameters.txt   |4 ++
 Documentation/virtual/kvm/msr.txt |   35 ++
 arch/ia64/include/asm/paravirt.h  |4 ++
 arch/ia64/kernel/paravirt.c   |2 +
 arch/x86/Kconfig  |   12 +
 arch/x86/include/asm/kvm_host.h   |8 +++
 arch/x86/include/asm/kvm_para.h   |   15 ++
 arch/x86/include/asm/paravirt.h   |9 
 arch/x86/include/asm/paravirt_types.h |1 +
 arch/x86/kernel/kvm.c |   73 ++
 arch/x86/kernel/kvmclock.c|2 +
 arch/x86/kernel/paravirt.c|9 
 arch/x86/kvm/Kconfig  |1 +
 arch/x86/kvm/x86.c|   56 ++-
 include/linux/kvm_host.h  |2 +
 kernel/sched.c|   80 
 kernel/sched_features.h   |4 +-
 virt/kvm/kvm_main.c   |   20 
 18 files changed, 322 insertions(+), 15 deletions(-)

-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 4/9] KVM-HV: KVM Steal time implementation

2011-07-01 Thread Glauber Costa
To implement steal time, we need the hypervisor to pass the guest
information about how much time was spent running other processes
outside the VM, while the vcpu had meaningful work to do - halt
time does not count.

This information is acquired through the run_delay field of
delayacct/schedstats infrastructure, that counts time spent in a
runqueue but not running.

Steal time is a per-cpu information, so the traditional MSR-based
infrastructure is used. A new msr, KVM_MSR_STEAL_TIME, holds the
memory area address containing information about steal time

This patch contains the hypervisor part of the steal time infrasructure,
and can be backported independently of the guest portion.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Rik van Riel r...@redhat.com
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
CC: Anthony Liguori aligu...@us.ibm.com
CC: Eric B Munson emun...@mgebm.net
---
 arch/x86/include/asm/kvm_host.h |8 +
 arch/x86/include/asm/kvm_para.h |4 +++
 arch/x86/kvm/Kconfig|1 +
 arch/x86/kvm/x86.c  |   56 --
 4 files changed, 66 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index da6bbee..9ba354d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -389,6 +389,14 @@ struct kvm_vcpu_arch {
unsigned int hw_tsc_khz;
unsigned int time_offset;
struct page *time_page;
+
+   struct {
+   u64 msr_val;
+   u64 last_steal;
+   struct gfn_to_hva_cache stime;
+   struct kvm_steal_time steal;
+   } st;
+
u64 last_guest_tsc;
u64 last_kernel_ns;
u64 last_tsc_nsec;
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 65f8bb9..c484ba8 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -45,6 +45,10 @@ struct kvm_steal_time {
__u32 pad[12];
 };
 
+#define KVM_STEAL_ALIGNMENT_BITS 5
+#define KVM_STEAL_VALID_BITS ((-1ULL  (KVM_STEAL_ALIGNMENT_BITS + 1)))
+#define KVM_STEAL_RESERVED_MASK (((1  KVM_STEAL_ALIGNMENT_BITS) - 1 )  1)
+
 #define KVM_MAX_MMU_OP_BATCH   32
 
 #define KVM_ASYNC_PF_ENABLED   (1  0)
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 50f6364..99c3f05 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -31,6 +31,7 @@ config KVM
select KVM_ASYNC_PF
select USER_RETURN_NOTIFIER
select KVM_MMIO
+   select TASK_DELAY_ACCT
---help---
  Support hosting fully virtualized guest machines using hardware
  virtualization extensions.  You will need a fairly recent
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7167717..237bcdc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -808,12 +808,12 @@ EXPORT_SYMBOL_GPL(kvm_get_dr);
  * kvm-specific. Those are put in the beginning of the list.
  */
 
-#define KVM_SAVE_MSRS_BEGIN8
+#define KVM_SAVE_MSRS_BEGIN9
 static u32 msrs_to_save[] = {
MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
-   HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN,
+   HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
MSR_STAR,
 #ifdef CONFIG_X86_64
@@ -1491,6 +1491,27 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu)
}
 }
 
+static void record_steal_time(struct kvm_vcpu *vcpu)
+{
+   u64 delta;
+
+   if (!(vcpu-arch.st.msr_val  KVM_MSR_ENABLED))
+   return;
+
+   if (unlikely(kvm_read_guest_cached(vcpu-kvm, vcpu-arch.st.stime,
+   vcpu-arch.st.steal, sizeof(struct kvm_steal_time
+   return;
+
+   delta = current-sched_info.run_delay - vcpu-arch.st.last_steal;
+   vcpu-arch.st.last_steal = current-sched_info.run_delay;
+
+   vcpu-arch.st.steal.steal += delta;
+   vcpu-arch.st.steal.version += 2;
+
+   kvm_write_guest_cached(vcpu-kvm, vcpu-arch.st.stime,
+   vcpu-arch.st.steal, sizeof(struct kvm_steal_time));
+}
+
 int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 {
switch (msr) {
@@ -1573,6 +1594,28 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
if (kvm_pv_enable_async_pf(vcpu, data))
return 1;
break;
+   case MSR_KVM_STEAL_TIME:
+   vcpu-arch.st.msr_val = data;
+
+   if (!(data  KVM_MSR_ENABLED)) {
+   break;
+   }
+
+   if (unlikely(!sched_info_on()))
+   break;
+
+   if (data  KVM_STEAL_RESERVED_MASK

[PATCH v4 2/9] KVM-HDR Add constant to represent KVM MSRs enabled bit

2011-07-01 Thread Glauber Costa
This patch is simple, put in a different commit so it can be more easily
shared between guest and hypervisor. It just defines a named constant
to indicate the enable bit for KVM-specific MSRs.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Rik van Riel r...@redhat.com
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
CC: Anthony Liguori aligu...@us.ibm.com
CC: Eric B Munson emun...@mgebm.net
---
 arch/x86/include/asm/kvm_para.h |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index a427bf7..d6cd79b 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -30,6 +30,7 @@
 #define MSR_KVM_WALL_CLOCK  0x11
 #define MSR_KVM_SYSTEM_TIME 0x12
 
+#define KVM_MSR_ENABLED 1
 /* Custom MSRs falls in the range 0x4b564d00-0x4b564dff */
 #define MSR_KVM_WALL_CLOCK_NEW  0x4b564d00
 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 8/9] KVM-GST: adjust scheduler cpu power

2011-07-01 Thread Glauber Costa
This patch makes update_rq_clock() aware of steal time.
The mechanism of operation is not different from irq_time,
and follows the same principles. This lives in a CONFIG
option itself, and can be compiled out independently of
the rest of steal time reporting. The effect of disabling it
is that the scheduler will still report steal time (that cannot be
disabled), but won't use this information for cpu power adjustments.

Everytime update_rq_clock_task() is invoked, we query information
about how much time was stolen since last call, and feed it into
sched_rt_avg_update().

Although steal time reporting in account_process_tick() keeps
track of the last time we read the steal clock, in prev_steal_time,
this patch do it independently using another field,
prev_steal_time_rq. This is because otherwise, information about time
accounted in update_process_tick() would never reach us in update_rq_clock().

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Rik van Riel r...@redhat.com
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
CC: Anthony Liguori aligu...@us.ibm.com
CC: Eric B Munson emun...@mgebm.net
---
 arch/x86/Kconfig|   12 
 kernel/sched.c  |   47 +--
 kernel/sched_features.h |4 ++--
 3 files changed, 51 insertions(+), 12 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index da34972..b26f312 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -512,6 +512,18 @@ menuconfig PARAVIRT_GUEST
 
 if PARAVIRT_GUEST
 
+config PARAVIRT_TIME_ACCOUNTING
+   bool Paravirtual steal time accounting
+   select PARAVIRT
+   default n
+   ---help---
+ Select this option to enable fine granularity task steal time 
+ accounting. Time spent executing other tasks in parallel with
+ the current vCPU is discounted from the vCPU power. To account for
+ that, there can be a small performance impact.
+
+ If in doubt, say N here.
+
 source arch/x86/xen/Kconfig
 
 config KVM_CLOCK
diff --git a/kernel/sched.c b/kernel/sched.c
index 247dd51..c40b118 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -532,6 +532,9 @@ struct rq {
 #ifdef CONFIG_PARAVIRT
u64 prev_steal_time;
 #endif
+#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
+   u64 prev_steal_time_rq;
+#endif
 
/* calc_load related fields */
unsigned long calc_load_update;
@@ -1971,8 +1974,14 @@ static inline u64 steal_ticks(u64 steal)
 
 static void update_rq_clock_task(struct rq *rq, s64 delta)
 {
-   s64 irq_delta;
-
+/*
+ * In theory, the compile should just see 0 here, and optimize out the call
+ * to sched_rt_avg_update. But I don't trust it...
+ */
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || 
defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+   s64 steal = 0, irq_delta = 0;
+#endif
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
irq_delta = irq_time_read(cpu_of(rq)) - rq-prev_irq_time;
 
/*
@@ -1995,12 +2004,35 @@ static void update_rq_clock_task(struct rq *rq, s64 
delta)
 
rq-prev_irq_time += irq_delta;
delta -= irq_delta;
+#endif
+#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
+   if (static_branch((paravirt_steal_rq_enabled))) {
+   u64 st;
+
+   steal = paravirt_steal_clock(cpu_of(rq));
+   steal -= rq-prev_steal_time_rq;
+
+   if (unlikely(steal  delta))
+   steal = delta;
+
+   st = steal_ticks(steal);
+   steal = st * TICK_NSEC;
+
+   rq-prev_steal_time_rq += steal;
+
+   delta -= steal;
+   }
+#endif
+
rq-clock_task += delta;
 
-   if (irq_delta  sched_feat(NONIRQ_POWER))
-   sched_rt_avg_update(rq, irq_delta);
+#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || 
defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
+   if ((irq_delta + steal)  sched_feat(NONTASK_POWER))
+   sched_rt_avg_update(rq, irq_delta + steal);
+#endif
 }
 
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
 static int irqtime_account_hi_update(void)
 {
struct cpu_usage_stat *cpustat = kstat_this_cpu.cpustat;
@@ -2035,12 +2067,7 @@ static int irqtime_account_si_update(void)
 
 #define sched_clock_irqtime(0)
 
-static void update_rq_clock_task(struct rq *rq, s64 delta)
-{
-   rq-clock_task += delta;
-}
-
-#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
+#endif
 
 #include sched_idletask.c
 #include sched_fair.c
diff --git a/kernel/sched_features.h b/kernel/sched_features.h
index be40f73..ca3b025 100644
--- a/kernel/sched_features.h
+++ b/kernel/sched_features.h
@@ -61,9 +61,9 @@ SCHED_FEAT(LB_BIAS, 1)
 SCHED_FEAT(OWNER_SPIN, 1)
 
 /*
- * Decrement CPU power based on irq activity
+ * Decrement CPU power based on time not spent running tasks
  */
-SCHED_FEAT(NONIRQ_POWER, 1)
+SCHED_FEAT(NONTASK_POWER, 1)
 
 /*
  * Queue remote wakeups on the target CPU and process them
-- 
1.7.3.4

--
To unsubscribe from

[PATCH v4 1/9] introduce kvm_read_guest_cached

2011-07-01 Thread Glauber Costa
From: Gleb Natapov g...@redhat.com

Introduce kvm_read_guest_cached() function in addition to write one we
already have.

[ by glauber: export function signature in kvm header ]

Signed-off-by: Gleb Natapov g...@redhat.com
Signed-off-by: Glauber Costa glom...@redhat.com
---
 include/linux/kvm_host.h |2 ++
 virt/kvm/kvm_main.c  |   20 
 2 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 31ebb59..f7df0a3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -381,6 +381,8 @@ int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void 
*data, int offset,
 int kvm_read_guest_atomic(struct kvm *kvm, gpa_t gpa, void *data,
  unsigned long len);
 int kvm_read_guest(struct kvm *kvm, gpa_t gpa, void *data, unsigned long len);
+int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+  void *data, unsigned long len);
 int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn, const void *data,
 int offset, int len);
 int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 11d2783..d5ef9eb 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1418,6 +1418,26 @@ int kvm_write_guest_cached(struct kvm *kvm, struct 
gfn_to_hva_cache *ghc,
 }
 EXPORT_SYMBOL_GPL(kvm_write_guest_cached);
 
+int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+  void *data, unsigned long len)
+{
+   struct kvm_memslots *slots = kvm_memslots(kvm);
+   int r;
+
+   if (slots-generation != ghc-generation)
+   kvm_gfn_to_hva_cache_init(kvm, ghc, ghc-gpa);
+
+   if (kvm_is_error_hva(ghc-hva))
+   return -EFAULT;
+
+   r = __copy_from_user(data, (void __user *)ghc-hva, len);
+   if (r)
+   return -EFAULT;
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_read_guest_cached);
+
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len)
 {
return kvm_write_guest_page(kvm, gfn, (const void *) empty_zero_page,
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 5/9] KVM-GST: Add a pv_ops stub for steal time

2011-07-01 Thread Glauber Costa
This patch adds a function pointer in one of the many paravirt_ops
structs, to allow guests to register a steal time function. Besides
a steal time function, we also declare two jump_labels. They will be
used to allow the steal time code to be easily bypassed when not
in use.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Rik van Riel r...@redhat.com
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
CC: Anthony Liguori aligu...@us.ibm.com
CC: Eric B Munson emun...@mgebm.net
---
 arch/x86/include/asm/paravirt.h   |9 +
 arch/x86/include/asm/paravirt_types.h |1 +
 arch/x86/kernel/paravirt.c|9 +
 3 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index ebbc4d8..a7d2db9 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -230,6 +230,15 @@ static inline unsigned long long paravirt_sched_clock(void)
return PVOP_CALL0(unsigned long long, pv_time_ops.sched_clock);
 }
 
+struct jump_label_key;
+extern struct jump_label_key paravirt_steal_enabled;
+extern struct jump_label_key paravirt_steal_rq_enabled;
+
+static inline u64 paravirt_steal_clock(int cpu)
+{
+   return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu);
+}
+
 static inline unsigned long long paravirt_read_pmc(int counter)
 {
return PVOP_CALL1(u64, pv_cpu_ops.read_pmc, counter);
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 8288509..2c76521 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -89,6 +89,7 @@ struct pv_lazy_ops {
 
 struct pv_time_ops {
unsigned long long (*sched_clock)(void);
+   unsigned long long (*steal_clock)(int cpu);
unsigned long (*get_tsc_khz)(void);
 };
 
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 869e1ae..613a793 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -202,6 +202,14 @@ static void native_flush_tlb_single(unsigned long addr)
__native_flush_tlb_single(addr);
 }
 
+struct jump_label_key paravirt_steal_enabled;
+struct jump_label_key paravirt_steal_rq_enabled;
+
+static u64 native_steal_clock(int cpu)
+{
+   return 0;
+}
+
 /* These are in entry.S */
 extern void native_iret(void);
 extern void native_irq_enable_sysexit(void);
@@ -307,6 +315,7 @@ struct pv_init_ops pv_init_ops = {
 
 struct pv_time_ops pv_time_ops = {
.sched_clock = native_sched_clock,
+   .steal_clock = native_steal_clock,
 };
 
 struct pv_irq_ops pv_irq_ops = {
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 6/9] add jump labels for ia64 paravirt

2011-07-01 Thread Glauber Costa
Since in a later patch I intend to call jump labels inside
CONFIG_PARAVIRT, IA64 would fail to compile if they are not
provided. This patch provides those jump labels for the IA64
architecture.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Isaku Yamahata yamah...@valinux.co.jp
CC: Eddie Dong eddie.d...@intel.com
CC: Rik van Riel r...@redhat.com
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
CC: Anthony Liguori aligu...@us.ibm.com
CC: Eric B Munson emun...@mgebm.net
---
 arch/ia64/include/asm/paravirt.h |4 
 arch/ia64/kernel/paravirt.c  |2 ++
 2 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/arch/ia64/include/asm/paravirt.h b/arch/ia64/include/asm/paravirt.h
index 2eb0a98..32551d3 100644
--- a/arch/ia64/include/asm/paravirt.h
+++ b/arch/ia64/include/asm/paravirt.h
@@ -281,6 +281,10 @@ paravirt_init_missing_ticks_accounting(int cpu)
pv_time_ops.init_missing_ticks_accounting(cpu);
 }
 
+struct jump_label_key;
+extern struct jump_label_key paravirt_steal_enabled;
+extern struct jump_label_key paravirt_steal_rq_enabled;
+
 static inline int
 paravirt_do_steal_accounting(unsigned long *new_itm)
 {
diff --git a/arch/ia64/kernel/paravirt.c b/arch/ia64/kernel/paravirt.c
index a21d7bb..1008682 100644
--- a/arch/ia64/kernel/paravirt.c
+++ b/arch/ia64/kernel/paravirt.c
@@ -634,6 +634,8 @@ struct pv_irq_ops pv_irq_ops = {
  * pv_time_ops
  * time operations
  */
+struct jump_label_key paravirt_steal_enabled;
+struct jump_label_key paravirt_steal_rq_enabled;
 
 static int
 ia64_native_do_steal_accounting(unsigned long *new_itm)
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 7/9] KVM-GST: KVM Steal time accounting

2011-07-01 Thread Glauber Costa
This patch accounts steal time time in account_process_tick.
If one or more tick is considered stolen in the current
accounting cycle, user/system accounting is skipped. Idle is fine,
since the hypervisor does not report steal time if the guest
is halted.

Accounting steal time from the core scheduler give us the
advantage of direct access to the runqueue data. In a later
opportunity, it can be used to tweak cpu power and make
the scheduler aware of the time it lost.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Rik van Riel r...@redhat.com
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
CC: Anthony Liguori aligu...@us.ibm.com
CC: Eric B Munson emun...@mgebm.net
---
 kernel/sched.c |   33 +
 1 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 3f2e502..247dd51 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -75,6 +75,7 @@
 #include asm/tlb.h
 #include asm/irq_regs.h
 #include asm/mutex.h
+#include asm/paravirt.h
 
 #include sched_cpupri.h
 #include workqueue_sched.h
@@ -528,6 +529,9 @@ struct rq {
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
u64 prev_irq_time;
 #endif
+#ifdef CONFIG_PARAVIRT
+   u64 prev_steal_time;
+#endif
 
/* calc_load related fields */
unsigned long calc_load_update;
@@ -1953,6 +1957,18 @@ void account_system_vtime(struct task_struct *curr)
 }
 EXPORT_SYMBOL_GPL(account_system_vtime);
 
+#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
+
+#ifdef CONFIG_PARAVIRT
+static inline u64 steal_ticks(u64 steal)
+{
+   if (unlikely(steal  NSEC_PER_SEC))
+   return div_u64(steal, TICK_NSEC);
+
+   return __iter_div_u64_rem(steal, TICK_NSEC, steal);
+}
+#endif
+
 static void update_rq_clock_task(struct rq *rq, s64 delta)
 {
s64 irq_delta;
@@ -3929,6 +3945,23 @@ void account_process_tick(struct task_struct *p, int 
user_tick)
return;
}
 
+#ifdef CONFIG_PARAVIRT
+   if (static_branch(paravirt_steal_enabled)) {
+   u64 steal, st = 0;
+
+   steal = paravirt_steal_clock(smp_processor_id());
+   steal -= this_rq()-prev_steal_time;
+
+   st = steal_ticks(steal);
+   this_rq()-prev_steal_time += st * TICK_NSEC;
+
+   if (st) {
+   account_steal_time(st);
+   return;
+   }
+   }
+#endif
+
if (user_tick)
account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
else if ((p != rq-idle) || (irq_count() != HARDIRQ_OFFSET))
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 9/9] KVM-GST: KVM Steal time registration

2011-07-01 Thread Glauber Costa
This patch implements the kvm bits of the steal time infrastructure.
The most important part of it, is the steal time clock. It is an
continuous clock that shows the accumulated amount of steal time
since vcpu creation. It is supposed to survive cpu offlining/onlining.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Rik van Riel r...@redhat.com
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
CC: Anthony Liguori aligu...@us.ibm.com
CC: Eric B Munson emun...@mgebm.net
---
 Documentation/kernel-parameters.txt |4 ++
 arch/x86/include/asm/kvm_para.h |1 +
 arch/x86/kernel/kvm.c   |   73 +++
 arch/x86/kernel/kvmclock.c  |2 +
 4 files changed, 80 insertions(+), 0 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index fd248a31..a722574 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1737,6 +1737,10 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
no-kvmapf   [X86,KVM] Disable paravirtualized asynchronous page
fault handling.
 
+   no-steal-acc[X86,KVM] Disable paravirtualized steal time accounting.
+   steal time is computed, but won't influence scheduler
+   behaviour
+
nolapic [X86-32,APIC] Do not enable or use the local APIC.
 
nolapic_timer   [X86-32,APIC] Do not use the local APIC timer.
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index c484ba8..35d732d 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -94,6 +94,7 @@ struct kvm_vcpu_pv_apf_data {
 
 extern void kvmclock_init(void);
 extern int kvm_register_clock(char *txt);
+extern void kvm_disable_steal_time(void);
 
 
 /* This instruction is vmcall.  On non-VT architectures, it will generate a
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 33c07b0..58331c2 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -51,6 +51,15 @@ static int parse_no_kvmapf(char *arg)
 
 early_param(no-kvmapf, parse_no_kvmapf);
 
+static int steal_acc = 1;
+static int parse_no_stealacc(char *arg)
+{
+steal_acc = 0;
+return 0;
+}
+
+early_param(no-steal-acc, parse_no_stealacc);
+
 struct kvm_para_state {
u8 mmu_queue[MMU_QUEUE_SIZE];
int mmu_queue_len;
@@ -58,6 +67,8 @@ struct kvm_para_state {
 
 static DEFINE_PER_CPU(struct kvm_para_state, para_state);
 static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
+static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64);
+static int has_steal_clock = 0;
 
 static struct kvm_para_state *kvm_para_state(void)
 {
@@ -441,6 +452,21 @@ static void __init paravirt_ops_setup(void)
 #endif
 }
 
+static void kvm_register_steal_time(void)
+{
+   int cpu = smp_processor_id();
+   struct kvm_steal_time *st = per_cpu(steal_time, cpu);
+
+   if (!has_steal_clock)
+   return;
+
+   memset(st, 0, sizeof(*st));
+
+   wrmsrl(MSR_KVM_STEAL_TIME, (__pa(st) | KVM_MSR_ENABLED));
+   printk(KERN_INFO kvm-stealtime: cpu %d, msr %lx\n,
+   cpu, __pa(st));
+}
+
 void __cpuinit kvm_guest_cpu_init(void)
 {
if (!kvm_para_available())
@@ -457,6 +483,9 @@ void __cpuinit kvm_guest_cpu_init(void)
printk(KERN_INFOKVM setup async PF for cpu %d\n,
   smp_processor_id());
}
+
+   if (has_steal_clock)
+   kvm_register_steal_time();
 }
 
 static void kvm_pv_disable_apf(void *unused)
@@ -483,6 +512,31 @@ static struct notifier_block kvm_pv_reboot_nb = {
.notifier_call = kvm_pv_reboot_notify,
 };
 
+static u64 kvm_steal_clock(int cpu)
+{
+   u64 steal;
+   struct kvm_steal_time *src;
+   int version;
+
+   src = per_cpu(steal_time, cpu);
+   do {
+   version = src-version;
+   rmb();
+   steal = src-steal;
+   rmb();
+   } while ((version  1) || (version != src-version));
+
+   return steal;
+}
+
+void kvm_disable_steal_time(void)
+{
+   if (!has_steal_clock)
+   return;
+
+   wrmsr(MSR_KVM_STEAL_TIME, 0, 0);
+}
+
 #ifdef CONFIG_SMP
 static void __init kvm_smp_prepare_boot_cpu(void)
 {
@@ -500,6 +554,7 @@ static void __cpuinit kvm_guest_cpu_online(void *dummy)
 
 static void kvm_guest_cpu_offline(void *dummy)
 {
+   kvm_disable_steal_time();
kvm_pv_disable_apf(NULL);
apf_task_wake_all();
 }
@@ -548,6 +603,11 @@ void __init kvm_guest_init(void)
if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF))
x86_init.irqs.trap_init = kvm_apf_trap_init;
 
+   if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
+   has_steal_clock = 1;
+   pv_time_ops.steal_clock

[PATCH v4 3/9] KVM-HDR: KVM Steal time implementation

2011-07-01 Thread Glauber Costa
To implement steal time, we need the hypervisor to pass the guest information
about how much time was spent running other processes outside the VM.
This is per-vcpu, and using the kvmclock structure for that is an abuse
we decided not to make.

In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
holds the memory area address containing information about steal time

This patch contains the headers for it. I am keeping it separate to facilitate
backports to people who wants to backport the kernel part but not the
hypervisor, or the other way around.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Rik van Riel r...@redhat.com
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
CC: Anthony Liguori aligu...@us.ibm.com
CC: Eric B Munson emun...@mgebm.net
---
 Documentation/virtual/kvm/msr.txt |   35 +++
 arch/x86/include/asm/kvm_para.h   |9 +
 2 files changed, 44 insertions(+), 0 deletions(-)

diff --git a/Documentation/virtual/kvm/msr.txt 
b/Documentation/virtual/kvm/msr.txt
index d079aed..38db3f8 100644
--- a/Documentation/virtual/kvm/msr.txt
+++ b/Documentation/virtual/kvm/msr.txt
@@ -185,3 +185,38 @@ MSR_KVM_ASYNC_PF_EN: 0x4b564d02
 
Currently type 2 APF will be always delivered on the same vcpu as
type 1 was, but guest should not rely on that.
+
+MSR_KVM_STEAL_TIME: 0x4b564d03
+
+   data: 64-byte alignment physical address of a memory area which must be
+   in guest RAM, plus an enable bit in bit 0. This memory is expected to
+   hold a copy of the following structure:
+
+   struct kvm_steal_time {
+   __u64 steal;
+   __u32 version;
+   __u32 flags;
+   __u32 pad[12];
+   }
+
+   whose data will be filled in by the hypervisor periodically. Only one
+   write, or registration, is needed for each VCPU. The interval between
+   updates of this structure is arbitrary and implementation-dependent.
+   The hypervisor may update this structure at any time it sees fit until
+   anything with bit0 == 0 is written to it. Guest is required to make sure
+   this structure is initialized to zero.
+
+   Fields have the following meanings:
+
+   version: a sequence counter. In other words, guest has to check
+   this field before and after grabbing time information and make 
+   sure they are both equal and even. An odd version indicates an
+   in-progress update.
+
+   flags: At this point, always zero. May be used to indicate
+   changes in this structure in the future.
+
+   steal: the amount of time in which this vCPU did not run, in
+   nanoseconds. Time during which the vcpu is idle, will not be
+   reported as steal time.
+
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index d6cd79b..65f8bb9 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -21,6 +21,7 @@
  */
 #define KVM_FEATURE_CLOCKSOURCE23
 #define KVM_FEATURE_ASYNC_PF   4
+#define KVM_FEATURE_STEAL_TIME 5
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
@@ -35,6 +36,14 @@
 #define MSR_KVM_WALL_CLOCK_NEW  0x4b564d00
 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
 #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
+#define MSR_KVM_STEAL_TIME  0x4b564d03
+
+struct kvm_steal_time {
+   __u64 steal;
+   __u32 version;
+   __u32 flags;
+   __u32 pad[12];
+};
 
 #define KVM_MAX_MMU_OP_BATCH   32
 
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/9] KVM-HDR Add constant to represent KVM MSRs enabled bit

2011-06-30 Thread Glauber Costa

On 06/30/2011 09:59 AM, Eric B Munson wrote:

On Thu, 30 Jun 2011, Avi Kivity wrote:


On 06/30/2011 12:56 AM, Eric B Munson wrote:

My mail provider seems to have dropped patch 1 of the series so I can't reply
directly to it, please add my Tested-by there as well.


How did you test it then?



I built host and guest kernels with the patches and pinned a while(1) and the
CPU thread from qemu to CPU 2 on the host.  I then started the same while(1)
process in guest and verified that I see ~50% steal time reported.

I then built 2.6.39 (just the code I had present) on the guest and time it
while it was competing with the while(1) on the host for CPU time.  Next I
built the guest kernel with STEAL_TIME=N and reran the kernel compile to make
sure that there weren't any huge variations in performace.

Eric
I think what Avi means is, it won't even compile without PATCH 1/9. If 
you don't have it, how could you test it ?


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 7/9] KVM-GST: KVM Steal time accounting

2011-06-30 Thread Glauber Costa

On 06/30/2011 06:54 PM, Peter Zijlstra wrote:

On Wed, 2011-06-29 at 11:29 -0400, Glauber Costa wrote:

+static noinline bool touch_steal_time(int is_idle)


That noinline is very unlucky there,


+{
+   u64 steal, st = 0;
+
+   if (static_branch(paravirt_steal_enabled)) {
+
+   steal = paravirt_steal_clock(smp_processor_id());
+
+   steal -= this_rq()-prev_steal_time;
+
+   st = steal_ticks(steal);
+   this_rq()-prev_steal_time += st * TICK_NSEC;
+
+   if (is_idle || st == 0)
+   return false;
+
+   account_steal_time(st);
+   return true;
+   }
+   return false;
+}
+
  static void update_rq_clock_task(struct rq *rq, s64 delta)
  {
 s64 irq_delta;
@@ -3716,6 +3760,9 @@ void account_user_time(struct task_struct *p,
cputime_t cputime,
 struct cpu_usage_stat *cpustat =kstat_this_cpu.cpustat;
 cputime64_t tmp;

+   if (touch_steal_time(0))
+   return;


Means we have an unconditional call here, even if the static_branch() is
patched out.

Ok.

I was under the impression that the proper use of jump labels required 
each label to be tied to a single location. If we make it inline, the 
same key would point to multiple locations, and we would have trouble
altering all of the locations. I might be wrong, of course. Isn't it the 
case?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 7/9] KVM-GST: KVM Steal time accounting

2011-06-30 Thread Glauber Costa

On 06/30/2011 06:54 PM, Peter Zijlstra wrote:

On Wed, 2011-06-29 at 11:29 -0400, Glauber Costa wrote:

This patch accounts steal time time in kernel/sched.
I kept it from last proposal, because I still see advantages
in it: Doing it here will give us easier access from scheduler
variables such as the cpu rq. The next patch shows an example of
usage for it.

Since functions like account_idle_time() can be called from
multiple places, not only account_process_tick(), steal time
grabbing is repeated in each account function separatedely.



That changelog is so going to frustrate someone trying to re-create your
thinking in a year's time. They really don't care about the last
proposals and the next patch.


You're right. I'll review all changelogs and rewrite them as needed.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 7/9] KVM-GST: KVM Steal time accounting

2011-06-30 Thread Glauber Costa

On 06/30/2011 06:54 PM, Peter Zijlstra wrote:

On Wed, 2011-06-29 at 11:29 -0400, Glauber Costa wrote:

+   if (static_branch(paravirt_steal_enabled)) {


How is that going to compile on !CONFIG_PARAVIRT or !x86 in general?
Only x86-PARAVIRT will provide that variable.




Good point. I'd wrap it into CONFIG_PARAVIRT.
To be clear, the reason I did not put it inside 
CONFIG_PARAVIRT_TIME_ACCOUNTING, is because I wanted to have the mere 
display of steal time separated from the rest - unless, of course, you 
object this idea.


Using CONFIG_PARAVIRT achieves this goal well.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 6/9] KVM-GST: Add a pv_ops stub for steal time

2011-06-29 Thread Glauber Costa
This patch adds a function pointer in one of the many paravirt_ops
structs, to allow guests to register a steal time function.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Rik van Riel r...@redhat.com
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
CC: Anthony Liguori aligu...@us.ibm.com
CC: Eric B Munson emun...@mgebm.net
---
 arch/x86/include/asm/paravirt.h   |9 +
 arch/x86/include/asm/paravirt_types.h |1 +
 arch/x86/kernel/paravirt.c|9 +
 3 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index ebbc4d8..a7d2db9 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -230,6 +230,15 @@ static inline unsigned long long paravirt_sched_clock(void)
return PVOP_CALL0(unsigned long long, pv_time_ops.sched_clock);
 }
 
+struct jump_label_key;
+extern struct jump_label_key paravirt_steal_enabled;
+extern struct jump_label_key paravirt_steal_rq_enabled;
+
+static inline u64 paravirt_steal_clock(int cpu)
+{
+   return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu);
+}
+
 static inline unsigned long long paravirt_read_pmc(int counter)
 {
return PVOP_CALL1(u64, pv_cpu_ops.read_pmc, counter);
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 8288509..2c76521 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -89,6 +89,7 @@ struct pv_lazy_ops {
 
 struct pv_time_ops {
unsigned long long (*sched_clock)(void);
+   unsigned long long (*steal_clock)(int cpu);
unsigned long (*get_tsc_khz)(void);
 };
 
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 869e1ae..613a793 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -202,6 +202,14 @@ static void native_flush_tlb_single(unsigned long addr)
__native_flush_tlb_single(addr);
 }
 
+struct jump_label_key paravirt_steal_enabled;
+struct jump_label_key paravirt_steal_rq_enabled;
+
+static u64 native_steal_clock(int cpu)
+{
+   return 0;
+}
+
 /* These are in entry.S */
 extern void native_iret(void);
 extern void native_irq_enable_sysexit(void);
@@ -307,6 +315,7 @@ struct pv_init_ops pv_init_ops = {
 
 struct pv_time_ops pv_time_ops = {
.sched_clock = native_sched_clock,
+   .steal_clock = native_steal_clock,
 };
 
 struct pv_irq_ops pv_irq_ops = {
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 2/9] KVM-HDR Add constant to represent KVM MSRs enabled bit

2011-06-29 Thread Glauber Costa
This patch is simple, put in a different commit so it can be more easily
shared between guest and hypervisor. It just defines a named constant
to indicate the enable bit for KVM-specific MSRs.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Rik van Riel r...@redhat.com
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
CC: Anthony Liguori aligu...@us.ibm.com
CC: Eric B Munson emun...@mgebm.net
---
 arch/x86/include/asm/kvm_para.h |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index a427bf7..d6cd79b 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -30,6 +30,7 @@
 #define MSR_KVM_WALL_CLOCK  0x11
 #define MSR_KVM_SYSTEM_TIME 0x12
 
+#define KVM_MSR_ENABLED 1
 /* Custom MSRs falls in the range 0x4b564d00-0x4b564dff */
 #define MSR_KVM_WALL_CLOCK_NEW  0x4b564d00
 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 0/9] Steal time for KVM

2011-06-29 Thread Glauber Costa
This patchset adds on the previous one, merging the suggestions
made mainly by Gleb, Peter and Marcelo.

SCHEDSTATS is now used when available to provide information about
time spent on the runqueue for the cpu threads. Information to/from
the guest is obtained using *_cached() functions, as suggested by Gleb.

On the guest side, the two prev_steal_time variables are colapsed
into a single one. Steal time accounting is also done inside
update_rq_clock().

Glauber Costa (8):
  KVM-HDR Add constant to represent KVM MSRs enabled bit
  KVM-HDR: KVM Steal time implementation
  KVM-HV: KVM Steal time implementation
  KVM-HV: use schedstats to calculate steal time
  KVM-GST: Add a pv_ops stub for steal time
  KVM-GST: KVM Steal time accounting
  KVM-GST: adjust scheduler cpu power
  KVM-GST: KVM Steal time registration

Gleb Natapov (1):
  introduce kvm_read_guest_cached

 Documentation/kernel-parameters.txt   |4 ++
 Documentation/virtual/kvm/msr.txt |   33 
 arch/x86/Kconfig  |   12 
 arch/x86/include/asm/kvm_host.h   |8 +++
 arch/x86/include/asm/kvm_para.h   |   15 +
 arch/x86/include/asm/paravirt.h   |9 +++
 arch/x86/include/asm/paravirt_types.h |1 +
 arch/x86/kernel/kvm.c |   73 +
 arch/x86/kernel/kvmclock.c|2 +
 arch/x86/kernel/paravirt.c|9 +++
 arch/x86/kvm/x86.c|   67 ++-
 include/linux/kvm_host.h  |2 +
 kernel/sched.c|   94 +---
 kernel/sched_features.h   |4 +-
 virt/kvm/kvm_main.c   |   20 +++
 15 files changed, 339 insertions(+), 14 deletions(-)

-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 8/9] KVM-GST: adjust scheduler cpu power

2011-06-29 Thread Glauber Costa
This is a first proposal for using steal time information
to influence the scheduler. There are a lot of optimizations
and fine grained adjustments to be done, but it is working reasonably
so far for me (mostly)

With this patch (and some host pinnings to demonstrate the situation),
two vcpus with very different steal time (Say 80 % vs 1 %) will not get
an even distribution of processes. This is a situation that can naturally
arise, specially in overcommited scenarios. Previosly, the guest scheduler
would wrongly think that all cpus have the same ability to run processes,
lowering the overall throughput.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Rik van Riel r...@redhat.com
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
CC: Anthony Liguori aligu...@us.ibm.com
CC: Eric B Munson emun...@mgebm.net
---
 arch/x86/Kconfig|   12 
 kernel/sched.c  |   44 ++--
 kernel/sched_features.h |4 ++--
 3 files changed, 48 insertions(+), 12 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index da34972..b26f312 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -512,6 +512,18 @@ menuconfig PARAVIRT_GUEST
 
 if PARAVIRT_GUEST
 
+config PARAVIRT_TIME_ACCOUNTING
+   bool Paravirtual steal time accounting
+   select PARAVIRT
+   default n
+   ---help---
+ Select this option to enable fine granularity task steal time 
+ accounting. Time spent executing other tasks in parallel with
+ the current vCPU is discounted from the vCPU power. To account for
+ that, there can be a small performance impact.
+
+ If in doubt, say N here.
+
 source arch/x86/xen/Kconfig
 
 config KVM_CLOCK
diff --git a/kernel/sched.c b/kernel/sched.c
index c166863..3ef0de9 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1975,7 +1975,7 @@ static inline u64 steal_ticks(u64 steal)
  * tell the underlying hypervisor that we grabbed the data, but skip steal time
  * accounting
  */
-static noinline bool touch_steal_time(int is_idle)
+static noinline bool __touch_steal_time(int is_idle, u64 max_steal, u64 *ticks)
 {
u64 steal, st = 0;
 
@@ -1985,8 +1985,13 @@ static noinline bool touch_steal_time(int is_idle)
 
steal -= this_rq()-prev_steal_time;
 
+   if (steal  max_steal)
+   steal = max_steal;
+
st = steal_ticks(steal);
this_rq()-prev_steal_time += st * TICK_NSEC;
+   if (ticks)
+   *ticks = st;
 
if (is_idle || st == 0)
return false;
@@ -1997,10 +2002,16 @@ static noinline bool touch_steal_time(int is_idle)
return false;
 }
 
+static inline bool touch_steal_time(int is_idle)
+{
+   return __touch_steal_time(is_idle, UINT_MAX, NULL);
+}
+
 static void update_rq_clock_task(struct rq *rq, s64 delta)
 {
-   s64 irq_delta;
+   s64 irq_delta = 0, steal = 0;
 
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
irq_delta = irq_time_read(cpu_of(rq)) - rq-prev_irq_time;
 
/*
@@ -2023,12 +2034,30 @@ static void update_rq_clock_task(struct rq *rq, s64 
delta)
 
rq-prev_irq_time += irq_delta;
delta -= irq_delta;
+#endif
+#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
+   if (static_branch((paravirt_steal_rq_enabled))) {
+   int is_idle;
+   u64 st;
+
+   is_idle = ((rq-curr != rq-idle) ||
+   irq_count() != HARDIRQ_OFFSET);
+
+   __touch_steal_time(is_idle, delta, st);
+
+   steal = st * TICK_NSEC;
+
+   delta -= steal;
+   }
+#endif
+
rq-clock_task += delta;
 
-   if (irq_delta  sched_feat(NONIRQ_POWER))
-   sched_rt_avg_update(rq, irq_delta);
+   if ((irq_delta + steal)  sched_feat(NONTASK_POWER))
+   sched_rt_avg_update(rq, irq_delta + steal);
 }
 
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
 static int irqtime_account_hi_update(void)
 {
struct cpu_usage_stat *cpustat = kstat_this_cpu.cpustat;
@@ -2063,12 +2092,7 @@ static int irqtime_account_si_update(void)
 
 #define sched_clock_irqtime(0)
 
-static void update_rq_clock_task(struct rq *rq, s64 delta)
-{
-   rq-clock_task += delta;
-}
-
-#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
+#endif
 
 #include sched_idletask.c
 #include sched_fair.c
diff --git a/kernel/sched_features.h b/kernel/sched_features.h
index be40f73..ca3b025 100644
--- a/kernel/sched_features.h
+++ b/kernel/sched_features.h
@@ -61,9 +61,9 @@ SCHED_FEAT(LB_BIAS, 1)
 SCHED_FEAT(OWNER_SPIN, 1)
 
 /*
- * Decrement CPU power based on irq activity
+ * Decrement CPU power based on time not spent running tasks
  */
-SCHED_FEAT(NONIRQ_POWER, 1)
+SCHED_FEAT(NONTASK_POWER, 1)
 
 /*
  * Queue remote wakeups on the target CPU and process them
-- 
1.7.3.4

--
To unsubscribe from this list: send

[PATCH v3 3/9] KVM-HDR: KVM Steal time implementation

2011-06-29 Thread Glauber Costa
To implement steal time, we need the hypervisor to pass the guest information
about how much time was spent running other processes outside the VM.
This is per-vcpu, and using the kvmclock structure for that is an abuse
we decided not to make.

In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
holds the memory area address containing information about steal time

This patch contains the headers for it. I am keeping it separate to facilitate
backports to people who wants to backport the kernel part but not the
hypervisor, or the other way around.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Rik van Riel r...@redhat.com
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
CC: Anthony Liguori aligu...@us.ibm.com
CC: Eric B Munson emun...@mgebm.net
---
 Documentation/virtual/kvm/msr.txt |   33 +
 arch/x86/include/asm/kvm_para.h   |9 +
 2 files changed, 42 insertions(+), 0 deletions(-)

diff --git a/Documentation/virtual/kvm/msr.txt 
b/Documentation/virtual/kvm/msr.txt
index d079aed..fab9b30 100644
--- a/Documentation/virtual/kvm/msr.txt
+++ b/Documentation/virtual/kvm/msr.txt
@@ -185,3 +185,36 @@ MSR_KVM_ASYNC_PF_EN: 0x4b564d02
 
Currently type 2 APF will be always delivered on the same vcpu as
type 1 was, but guest should not rely on that.
+
+MSR_KVM_STEAL_TIME: 0x4b564d03
+
+   data: 64-byte alignment physical address of a memory area which must be
+   in guest RAM, plus an enable bit in bit 0. This memory is expected to
+   hold a copy of the following structure:
+
+   struct kvm_steal_time {
+   __u64 steal;
+   __u32 version;
+   __u32 flags;
+   __u32 pad[12];
+   }
+
+   whose data will be filled in by the hypervisor periodically. Only one
+   write, or registration, is needed for each VCPU. The interval between
+   updates of this structure is arbitrary and implementation-dependent.
+   The hypervisor may update this structure at any time it sees fit until
+   anything with bit0 == 0 is written to it. Guest is required to make sure
+   this structure is initialized to zero.
+
+   Fields have the following meanings:
+
+   version: guest has to check version before and after grabbing
+   time information and check that they are both equal and even.
+   An odd version indicates an in-progress update.
+
+   flags: At this point, always zero. May be used to indicate
+   changes in this structure in the future.
+
+   steal: the amount of time in which this vCPU did not run, in
+   nanoseconds.
+
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index d6cd79b..65f8bb9 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -21,6 +21,7 @@
  */
 #define KVM_FEATURE_CLOCKSOURCE23
 #define KVM_FEATURE_ASYNC_PF   4
+#define KVM_FEATURE_STEAL_TIME 5
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
@@ -35,6 +36,14 @@
 #define MSR_KVM_WALL_CLOCK_NEW  0x4b564d00
 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
 #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
+#define MSR_KVM_STEAL_TIME  0x4b564d03
+
+struct kvm_steal_time {
+   __u64 steal;
+   __u32 version;
+   __u32 flags;
+   __u32 pad[12];
+};
 
 #define KVM_MAX_MMU_OP_BATCH   32
 
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 4/9] KVM-HV: KVM Steal time implementation

2011-06-29 Thread Glauber Costa
To implement steal time, we need the hypervisor to pass the guest information
about how much time was spent running other processes outside the VM.
This is per-vcpu, and using the kvmclock structure for that is an abuse
we decided not to make.

In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
holds the memory area address containing information about steal time

This patch contains the hypervisor part for it. I am keeping it separate from
the headers to facilitate backports to people who wants to backport the kernel
part but not the hypervisor, or the other way around.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Rik van Riel r...@redhat.com
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
CC: Anthony Liguori aligu...@us.ibm.com
CC: Eric B Munson emun...@mgebm.net
---
 arch/x86/include/asm/kvm_host.h |8 ++
 arch/x86/include/asm/kvm_para.h |4 +++
 arch/x86/kvm/x86.c  |   53 --
 3 files changed, 62 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index da6bbee..f85ed56 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -389,6 +389,14 @@ struct kvm_vcpu_arch {
unsigned int hw_tsc_khz;
unsigned int time_offset;
struct page *time_page;
+
+   struct {
+   u64 msr_val;
+   u64 this_time_out;
+   struct gfn_to_hva_cache stime;
+   struct kvm_steal_time steal;
+   } st;
+
u64 last_guest_tsc;
u64 last_kernel_ns;
u64 last_tsc_nsec;
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 65f8bb9..c484ba8 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -45,6 +45,10 @@ struct kvm_steal_time {
__u32 pad[12];
 };
 
+#define KVM_STEAL_ALIGNMENT_BITS 5
+#define KVM_STEAL_VALID_BITS ((-1ULL  (KVM_STEAL_ALIGNMENT_BITS + 1)))
+#define KVM_STEAL_RESERVED_MASK (((1  KVM_STEAL_ALIGNMENT_BITS) - 1 )  1)
+
 #define KVM_MAX_MMU_OP_BATCH   32
 
 #define KVM_ASYNC_PF_ENABLED   (1  0)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7167717..df9d274 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -808,12 +808,12 @@ EXPORT_SYMBOL_GPL(kvm_get_dr);
  * kvm-specific. Those are put in the beginning of the list.
  */
 
-#define KVM_SAVE_MSRS_BEGIN8
+#define KVM_SAVE_MSRS_BEGIN9
 static u32 msrs_to_save[] = {
MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
-   HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN,
+   HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
MSR_STAR,
 #ifdef CONFIG_X86_64
@@ -1491,6 +1491,26 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu)
}
 }
 
+static void record_steal_time(struct kvm_vcpu *vcpu)
+{
+   u64 delta;
+
+   if (!(vcpu-arch.st.msr_val  KVM_MSR_ENABLED))
+   return;
+
+   if (unlikely(kvm_read_guest_cached(vcpu-kvm, vcpu-arch.st.stime,
+   vcpu-arch.st.steal, sizeof(struct kvm_steal_time
+   return;
+
+   delta = (get_kernel_ns() - vcpu-arch.st.this_time_out);
+
+   vcpu-arch.st.steal.steal += delta;
+   vcpu-arch.st.steal.version += 2;
+
+   kvm_write_guest_cached(vcpu-kvm, vcpu-arch.st.stime,
+   vcpu-arch.st.steal, sizeof(struct kvm_steal_time));
+}
+
 int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 {
switch (msr) {
@@ -1573,6 +1593,25 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
if (kvm_pv_enable_async_pf(vcpu, data))
return 1;
break;
+   case MSR_KVM_STEAL_TIME:
+   vcpu-arch.st.msr_val = data;
+
+   if (!(data  KVM_MSR_ENABLED)) {
+   break;
+   }
+
+   if (data  KVM_STEAL_RESERVED_MASK)
+   return 1;
+
+   if (kvm_gfn_to_hva_cache_init(vcpu-kvm, vcpu-arch.st.stime,
+   data  
KVM_STEAL_VALID_BITS))
+   return 1;
+
+   vcpu-arch.st.this_time_out = get_kernel_ns();
+   record_steal_time(vcpu);
+
+   break;
+
case MSR_IA32_MCG_CTL:
case MSR_IA32_MCG_STATUS:
case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1:
@@ -1858,6 +1897,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
u64 *pdata)
case MSR_KVM_ASYNC_PF_EN:
data = vcpu-arch.apf.msr_val;
break;
+   case MSR_KVM_STEAL_TIME:
+   data

[PATCH v3 9/9] KVM-GST: KVM Steal time registration

2011-06-29 Thread Glauber Costa
Register steal time within KVM. Everytime we sample the steal time
information, we update a local variable that tells what was the
last time read. We then account the difference.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Rik van Riel r...@redhat.com
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
CC: Anthony Liguori aligu...@us.ibm.com
CC: Eric B Munson emun...@mgebm.net
---
 Documentation/kernel-parameters.txt |4 ++
 arch/x86/include/asm/kvm_para.h |1 +
 arch/x86/kernel/kvm.c   |   73 +++
 arch/x86/kernel/kvmclock.c  |2 +
 4 files changed, 80 insertions(+), 0 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index fd248a31..a722574 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1737,6 +1737,10 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
no-kvmapf   [X86,KVM] Disable paravirtualized asynchronous page
fault handling.
 
+   no-steal-acc[X86,KVM] Disable paravirtualized steal time accounting.
+   steal time is computed, but won't influence scheduler
+   behaviour
+
nolapic [X86-32,APIC] Do not enable or use the local APIC.
 
nolapic_timer   [X86-32,APIC] Do not use the local APIC timer.
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index c484ba8..35d732d 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -94,6 +94,7 @@ struct kvm_vcpu_pv_apf_data {
 
 extern void kvmclock_init(void);
 extern int kvm_register_clock(char *txt);
+extern void kvm_disable_steal_time(void);
 
 
 /* This instruction is vmcall.  On non-VT architectures, it will generate a
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 33c07b0..58331c2 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -51,6 +51,15 @@ static int parse_no_kvmapf(char *arg)
 
 early_param(no-kvmapf, parse_no_kvmapf);
 
+static int steal_acc = 1;
+static int parse_no_stealacc(char *arg)
+{
+steal_acc = 0;
+return 0;
+}
+
+early_param(no-steal-acc, parse_no_stealacc);
+
 struct kvm_para_state {
u8 mmu_queue[MMU_QUEUE_SIZE];
int mmu_queue_len;
@@ -58,6 +67,8 @@ struct kvm_para_state {
 
 static DEFINE_PER_CPU(struct kvm_para_state, para_state);
 static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
+static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64);
+static int has_steal_clock = 0;
 
 static struct kvm_para_state *kvm_para_state(void)
 {
@@ -441,6 +452,21 @@ static void __init paravirt_ops_setup(void)
 #endif
 }
 
+static void kvm_register_steal_time(void)
+{
+   int cpu = smp_processor_id();
+   struct kvm_steal_time *st = per_cpu(steal_time, cpu);
+
+   if (!has_steal_clock)
+   return;
+
+   memset(st, 0, sizeof(*st));
+
+   wrmsrl(MSR_KVM_STEAL_TIME, (__pa(st) | KVM_MSR_ENABLED));
+   printk(KERN_INFO kvm-stealtime: cpu %d, msr %lx\n,
+   cpu, __pa(st));
+}
+
 void __cpuinit kvm_guest_cpu_init(void)
 {
if (!kvm_para_available())
@@ -457,6 +483,9 @@ void __cpuinit kvm_guest_cpu_init(void)
printk(KERN_INFOKVM setup async PF for cpu %d\n,
   smp_processor_id());
}
+
+   if (has_steal_clock)
+   kvm_register_steal_time();
 }
 
 static void kvm_pv_disable_apf(void *unused)
@@ -483,6 +512,31 @@ static struct notifier_block kvm_pv_reboot_nb = {
.notifier_call = kvm_pv_reboot_notify,
 };
 
+static u64 kvm_steal_clock(int cpu)
+{
+   u64 steal;
+   struct kvm_steal_time *src;
+   int version;
+
+   src = per_cpu(steal_time, cpu);
+   do {
+   version = src-version;
+   rmb();
+   steal = src-steal;
+   rmb();
+   } while ((version  1) || (version != src-version));
+
+   return steal;
+}
+
+void kvm_disable_steal_time(void)
+{
+   if (!has_steal_clock)
+   return;
+
+   wrmsr(MSR_KVM_STEAL_TIME, 0, 0);
+}
+
 #ifdef CONFIG_SMP
 static void __init kvm_smp_prepare_boot_cpu(void)
 {
@@ -500,6 +554,7 @@ static void __cpuinit kvm_guest_cpu_online(void *dummy)
 
 static void kvm_guest_cpu_offline(void *dummy)
 {
+   kvm_disable_steal_time();
kvm_pv_disable_apf(NULL);
apf_task_wake_all();
 }
@@ -548,6 +603,11 @@ void __init kvm_guest_init(void)
if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF))
x86_init.irqs.trap_init = kvm_apf_trap_init;
 
+   if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
+   has_steal_clock = 1;
+   pv_time_ops.steal_clock = kvm_steal_clock;
+   }
+
 #ifdef CONFIG_SMP
smp_ops.smp_prepare_boot_cpu

[PATCH v3 1/9] introduce kvm_read_guest_cached

2011-06-29 Thread Glauber Costa
From: Gleb Natapov g...@redhat.com

Introduce kvm_read_guest_cached() function in addition to write one we
already have.

[ by glauber: export function signature in kvm header ]

Signed-off-by: Gleb Natapov g...@redhat.com
Signed-off-by: Glauber Costa glom...@redhat.com
---
 include/linux/kvm_host.h |2 ++
 virt/kvm/kvm_main.c  |   20 
 2 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 31ebb59..f7df0a3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -381,6 +381,8 @@ int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void 
*data, int offset,
 int kvm_read_guest_atomic(struct kvm *kvm, gpa_t gpa, void *data,
  unsigned long len);
 int kvm_read_guest(struct kvm *kvm, gpa_t gpa, void *data, unsigned long len);
+int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+  void *data, unsigned long len);
 int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn, const void *data,
 int offset, int len);
 int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 11d2783..d5ef9eb 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1418,6 +1418,26 @@ int kvm_write_guest_cached(struct kvm *kvm, struct 
gfn_to_hva_cache *ghc,
 }
 EXPORT_SYMBOL_GPL(kvm_write_guest_cached);
 
+int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+  void *data, unsigned long len)
+{
+   struct kvm_memslots *slots = kvm_memslots(kvm);
+   int r;
+
+   if (slots-generation != ghc-generation)
+   kvm_gfn_to_hva_cache_init(kvm, ghc, ghc-gpa);
+
+   if (kvm_is_error_hva(ghc-hva))
+   return -EFAULT;
+
+   r = __copy_from_user(data, (void __user *)ghc-hva, len);
+   if (r)
+   return -EFAULT;
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_read_guest_cached);
+
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len)
 {
return kvm_write_guest_page(kvm, gfn, (const void *) empty_zero_page,
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 5/9] KVM-HV: use schedstats to calculate steal time

2011-06-29 Thread Glauber Costa
SCHEDSTATS provide a precise source of information about time tasks
spent on a runqueue, but not running (among other things). It is
specially useful for the steal time implementation, because it doesn't
record halt time at all.

To avoid a hard dependency on schedstats, since it is possible one won't
want to record statistics about all processes running, the previous method
of time measurement on put/load vcpu is kept  for !SCHEDSTATS.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Rik van Riel r...@redhat.com
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
CC: Anthony Liguori aligu...@us.ibm.com
CC: Eric B Munson emun...@mgebm.net
CC: Marcelo Tosatti mtosa...@redhat.com
---
 arch/x86/kvm/x86.c |   22 ++
 1 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index df9d274..7e87159 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1502,7 +1502,13 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
vcpu-arch.st.steal, sizeof(struct kvm_steal_time
return;
 
-   delta = (get_kernel_ns() - vcpu-arch.st.this_time_out);
+#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
+   if (likely(sched_info_on())) {
+   delta = current-sched_info.run_delay - 
vcpu-arch.st.this_time_out;
+   vcpu-arch.st.this_time_out = current-sched_info.run_delay;
+   } else 
+#endif
+   delta = (get_kernel_ns() - vcpu-arch.st.this_time_out);
 
vcpu-arch.st.steal.steal += delta;
vcpu-arch.st.steal.version += 2;
@@ -1607,9 +1613,14 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
data  
KVM_STEAL_VALID_BITS))
return 1;
 
-   vcpu-arch.st.this_time_out = get_kernel_ns();
-   record_steal_time(vcpu);
+#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
+   if (likely(sched_info_on()))
+   vcpu-arch.st.this_time_out = 
current-sched_info.run_delay;
+   else
+#endif
+   vcpu-arch.st.this_time_out = get_kernel_ns();
 
+   record_steal_time(vcpu);
break;
 
case MSR_IA32_MCG_CTL:
@@ -2220,7 +2231,10 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
kvm_x86_ops-vcpu_put(vcpu);
kvm_put_guest_fpu(vcpu);
kvm_get_msr(vcpu, MSR_IA32_TSC, vcpu-arch.last_guest_tsc);
-   vcpu-arch.st.this_time_out = get_kernel_ns();
+#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
+   if (unlikely(!sched_info_on()))
+#endif
+   vcpu-arch.st.this_time_out = get_kernel_ns();
 }
 
 static int is_efer_nx(void)
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 7/9] KVM-GST: KVM Steal time accounting

2011-06-29 Thread Glauber Costa
This patch accounts steal time time in kernel/sched.
I kept it from last proposal, because I still see advantages
in it: Doing it here will give us easier access from scheduler
variables such as the cpu rq. The next patch shows an example of
usage for it.

Since functions like account_idle_time() can be called from
multiple places, not only account_process_tick(), steal time
grabbing is repeated in each account function separatedely.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Rik van Riel r...@redhat.com
CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com
CC: Peter Zijlstra pet...@infradead.org
CC: Avi Kivity a...@redhat.com
CC: Anthony Liguori aligu...@us.ibm.com
CC: Eric B Munson emun...@mgebm.net
---
 kernel/sched.c |   52 
 1 files changed, 52 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 3f2e502..c166863 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -75,6 +75,7 @@
 #include asm/tlb.h
 #include asm/irq_regs.h
 #include asm/mutex.h
+#include asm/paravirt.h
 
 #include sched_cpupri.h
 #include workqueue_sched.h
@@ -528,6 +529,7 @@ struct rq {
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
u64 prev_irq_time;
 #endif
+   u64 prev_steal_time;
 
/* calc_load related fields */
unsigned long calc_load_update;
@@ -1953,6 +1955,48 @@ void account_system_vtime(struct task_struct *curr)
 }
 EXPORT_SYMBOL_GPL(account_system_vtime);
 
+#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
+
+static inline u64 steal_ticks(u64 steal)
+{
+   if (unlikely(steal  NSEC_PER_SEC))
+   return steal / TICK_NSEC;
+
+   return __iter_div_u64_rem(steal, TICK_NSEC, steal);
+}
+
+/*
+ * We have to at flush steal time information every time something else
+ * is accounted. Since the accounting functions are all visible to the rest
+ * of the kernel, it gets tricky to do them in one place. This helper function
+ * helps us.
+ *
+ * When the system is idle, the concept of steal time does not apply. We just
+ * tell the underlying hypervisor that we grabbed the data, but skip steal time
+ * accounting
+ */
+static noinline bool touch_steal_time(int is_idle)
+{
+   u64 steal, st = 0;
+
+   if (static_branch(paravirt_steal_enabled)) {
+
+   steal = paravirt_steal_clock(smp_processor_id());
+
+   steal -= this_rq()-prev_steal_time;
+
+   st = steal_ticks(steal);
+   this_rq()-prev_steal_time += st * TICK_NSEC;
+
+   if (is_idle || st == 0)
+   return false;
+
+   account_steal_time(st);
+   return true;
+   }
+   return false;
+}
+
 static void update_rq_clock_task(struct rq *rq, s64 delta)
 {
s64 irq_delta;
@@ -3716,6 +3760,9 @@ void account_user_time(struct task_struct *p, cputime_t 
cputime,
struct cpu_usage_stat *cpustat = kstat_this_cpu.cpustat;
cputime64_t tmp;
 
+   if (touch_steal_time(0))
+   return;
+
/* Add user time to process. */
p-utime = cputime_add(p-utime, cputime);
p-utimescaled = cputime_add(p-utimescaled, cputime_scaled);
@@ -3802,6 +3849,9 @@ void account_system_time(struct task_struct *p, int 
hardirq_offset,
struct cpu_usage_stat *cpustat = kstat_this_cpu.cpustat;
cputime64_t *target_cputime64;
 
+   if (touch_steal_time(0))
+   return;
+
if ((p-flags  PF_VCPU)  (irq_count() - hardirq_offset == 0)) {
account_guest_time(p, cputime, cputime_scaled);
return;
@@ -3839,6 +3889,8 @@ void account_idle_time(cputime_t cputime)
cputime64_t cputime64 = cputime_to_cputime64(cputime);
struct rq *rq = this_rq();
 
+   touch_steal_time(1);
+
if (atomic_read(rq-nr_iowait)  0)
cpustat-iowait = cputime64_add(cpustat-iowait, cputime64);
else
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/7] KVM-HV: KVM Steal time implementation

2011-06-28 Thread Glauber Costa

On 06/20/2011 05:56 PM, Marcelo Tosatti wrote:

On Sun, Jun 19, 2011 at 12:57:53PM +0300, Avi Kivity wrote:

On 06/17/2011 01:20 AM, Glauber Costa wrote:

To implement steal time, we need the hypervisor to pass the guest information
about how much time was spent running other processes outside the VM.
This is per-vcpu, and using the kvmclock structure for that is an abuse
we decided not to make.

In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
holds the memory area address containing information about steal time

This patch contains the hypervisor part for it. I am keeping it separate from
the headers to facilitate backports to people who wants to backport the kernel
part but not the hypervisor, or the other way around.



+#define KVM_STEAL_ALIGNMENT_BITS 5
+#define KVM_STEAL_VALID_BITS ((-1ULL   (KVM_STEAL_ALIGNMENT_BITS + 1)))
+#define KVM_STEAL_RESERVED_MASK (((1   KVM_STEAL_ALIGNMENT_BITS) - 1 )   1)


Clumsy, but okay.


+static void record_steal_time(struct kvm_vcpu *vcpu)
+{
+   u64 delta;
+
+   if (vcpu-arch.st.stime   vcpu-arch.st.this_time_out) {


0 is a valid value for stime.


+
+   if (unlikely(kvm_read_guest(vcpu-kvm, vcpu-arch.st.stime,
+   vcpu-arch.st.steal, sizeof(struct kvm_steal_time {
+
+   vcpu-arch.st.stime = 0;
+   return;
+   }
+
+   delta = (get_kernel_ns() - vcpu-arch.st.this_time_out);
+
+   vcpu-arch.st.steal.steal += delta;
+   vcpu-arch.st.steal.version += 2;
+
+   if (unlikely(kvm_write_guest(vcpu-kvm, vcpu-arch.st.stime,
+   vcpu-arch.st.steal, sizeof(struct kvm_steal_time {
+
+   vcpu-arch.st.stime = 0;
+   return;
+   }
+   }
+
+}
+

@@ -2158,6 +2206,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
kvm_migrate_timers(vcpu);
vcpu-cpu = cpu;
}
+
+   record_steal_time(vcpu);
  }


This records time spent in userspace in the vcpu thread as steal
time.  Is this what we want?  Or just time preempted away?


It also accounts halt time (kvm_vcpu_block) as steal time. Glauber, you
could instead use the runnable-state-but-waiting-in-runqueue field of
SCHEDSTATS, i forgot the exact name.

I thought about it in the past. I let the idea aside because I didn't 
want to introduce a dependency on SCHEDSTATS. But thinking about it 
again now (and after some days of experimentations with it), I think we 
could have both.


use run_delay (the field you were thinking of) when schedstats are 
available, and fallback to an estimate method like the one we're doing 
when it is not.


Objections ?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/7] KVM-HV: KVM Steal time implementation

2011-06-20 Thread Glauber Costa

On 06/20/2011 05:02 AM, Avi Kivity wrote:

On 06/20/2011 10:21 AM, Gleb Natapov wrote:

On Sun, Jun 19, 2011 at 04:02:22PM +0300, Avi Kivity wrote:
 On 06/19/2011 03:59 PM, Gleb Natapov wrote:
 On Sun, Jun 19, 2011 at 03:35:58PM +0300, Avi Kivity wrote:
  On 06/15/2011 12:09 PM, Gleb Natapov wrote:
  
   Actually, I'd expect most read/writes to benefit from caching,
no?
   So why don't we just rename kvm_write_guest_cached() to
   kvm_write_guest(), and the few places - if any - that need to
force
   transversing of the gfn mappings, get renamed to
   kvm_write_guest_uncached ?
  
  Good idea. I do not see any places where
kvm_write_guest_uncached is
  needed from a brief look. Avi?
  
 
  kvm_write_guest_cached() needs something to supply the cache, and
  needs recurring writes to the same location. Neither of these are
  common (for example, instruction emulation doesn't have either).
 
 Correct. Missed that. So what about changing steal time to use
 kvm_write_guest_cached()?

 Makes sense, definitely. Want to post read_guest_cached() as well?

Glauber can you write read_guest_cached() as part of your series (should
be trivial), or do you want me to do it? I do not have a code to test it
with though :)


Yes.

(you can write it, and Glauber can include it in the series)


Write it, handle me the patch, I'll include it and test it.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/7] KVM-GST: KVM Steal time accounting

2011-06-19 Thread Glauber Costa

On 06/19/2011 07:04 AM, Avi Kivity wrote:

On 06/17/2011 01:20 AM, Glauber Costa wrote:

This patch accounts steal time time in kernel/sched.
I kept it from last proposal, because I still see advantages
in it: Doing it here will give us easier access from scheduler
variables such as the cpu rq. The next patch shows an example of
usage for it.

Since functions like account_idle_time() can be called from
multiple places, not only account_process_tick(), steal time
grabbing is repeated in each account function separatedely.

/*
+ * We have to at flush steal time information every time something else
+ * is accounted. Since the accounting functions are all visible to
the rest
+ * of the kernel, it gets tricky to do them in one place. This helper
function
+ * helps us.
+ *
+ * When the system is idle, the concept of steal time does not apply.
We just
+ * tell the underlying hypervisor that we grabbed the data, but skip
steal time
+ * accounting
+ */
+static inline bool touch_steal_time(int is_idle)
+{
+ u64 steal, st = 0;
+
+ if (static_branch(paravirt_steal_enabled)) {
+
+ steal = paravirt_steal_clock(smp_processor_id());
+
+ steal -= this_rq()-prev_steal_time;
+ if (is_idle) {
+ this_rq()-prev_steal_time += steal;
+ return false;
+ }
+
+ while (steal= TICK_NSEC) {
+ /*
+ * Inline assembly required to prevent the compiler
+ * optimising this loop into a divmod call.
+ * See __iter_div_u64_rem() for another example of this.
+ */


Why not use said function?


because here we want to do work during each loop. The said function
would have to be adapted for that, possibly using a macro, to run 
arbitrary code during each loop iteration, in a way that I don't think

it is worthy given the current number of callers (2 counting this new one)


+ asm( : +rm (steal));
+
+ steal -= TICK_NSEC;
+ this_rq()-prev_steal_time += TICK_NSEC;
+ st++;


Suppose a live migration or SIGSTOP causes lots of steal time. How long
will we spend here?
Silly me. I actually used this same argument with Peter to cap it with 
delta in the next patch in this series. So I think you are 100 % 
right. Here, however, we do want to account all that time, I believe.


How about we do a slow division if we're  10 sec (unlikely), and 
account everything as steal time in this scenario ?



+ }
+
+ account_steal_time(st);
+ return !!st;


!! !needed, you're returning a bool.


ah, sure thing.


+ }
+ return false;
+}
+


I'll need Peter's (or another sched maintainer's) review to apply this.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   5   6   7   8   9   10   >