Re: [PATCH 0/4] Alter steal-time reporting in the guest
On 03/06/2013 05:41 AM, Marcelo Tosatti wrote: On Tue, Mar 05, 2013 at 02:22:08PM -0600, Michael Wolf wrote: Sorry for the delay in the response. I did not see the email right away. On Mon, 2013-02-18 at 22:11 -0300, Marcelo Tosatti wrote: On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote: 2013/2/5 Michael Wolf m...@linux.vnet.ibm.com: In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. Sorry, I'm no expert in this area. But I don't really understand what is confusing for the end user here. I suppose that what is wanted is to subtract stolen time due to 'known reasons' from steal time reporting. 'Known reasons' being, for example, hard caps. So a vcpu executing instructions with no halt, but limited to 80% of available bandwidth, would not have 20% of stolen time reported. Yes exactly and the end user many times did not set up the guest and is not aware of the capping. The end user is only aware of the performance level that they were told they would get with the guest. But yes, a description of the scenario that is being dealt with, with details, is important. I will add more detail to the description next time I submit the patches. How about something like,In a cloud environment the user of a kvm guest is not aware of the underlying hardware or how many other guests are running on it. The end user is only aware of a level of performance that they should see. or does that just muddy the picture more?? So the feature aims for is to report stolen time relative to hard capping. That is: stolen time should be counted as time stolen from the guest _beyond_ hard capping. Yes? Probably don't need to report new data to the guest for that. If we take into account that 1 second always have one second, I believe that you can just subtract the consigned time from the steal time the host passes to the guest. During each second, the numbers won't sum up to 100. The delta to 100 is the consigned time, if anyone cares. Adopting this would simplify this a lot. All you need to do, really, is to get your calculation right from the bandwidth given by the cpu controller. Subtract it in the host, and voila. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] Add a timer to allow the separation of consigned from steal time.
On 02/06/2013 10:07 PM, Michael Wolf wrote: On 02/06/2013 08:36 AM, Glauber Costa wrote: On 02/06/2013 01:49 AM, Michael Wolf wrote: Add a helper routine to scheduler/core.c to allow the kvm module to retrieve the cpu hardlimit settings. The values will be used to set up a timer that is used to separate the consigned from the steal time. Sorry: What is the business of a timer in here? Whenever we read steal time, we know how much time has passed and with that information we can know the entitlement for the period. This breaks if we suspend, but we know that we suspended, so this is not a problem. I may be missing something, but how do we know how much time has passed? That is why I had the timer in there. I will go look again at the code but I thought the data was collected as ticks and passed at random times. The ticks are also accumulating so we are looking at the difference in the count between reads. They can be collected at random times, but you can of course record the time in which it happened. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] Add a timer to allow the separation of consigned from steal time.
On 02/06/2013 01:49 AM, Michael Wolf wrote: Add a helper routine to scheduler/core.c to allow the kvm module to retrieve the cpu hardlimit settings. The values will be used to set up a timer that is used to separate the consigned from the steal time. Sorry: What is the business of a timer in here? Whenever we read steal time, we know how much time has passed and with that information we can know the entitlement for the period. This breaks if we suspend, but we know that we suspended, so this is not a problem. Everything bigger the entitlement is steal time. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] Alter steal time reporting in KVM
I am deeply sorry. I was busy first time I read this, so I postponed answering and ended up forgetting. Sorry include/linux/sched.h: unsigned long long run_delay; /* time spent waiting on a runqueue */ So if you are out of the runqueue, you won't get steal time accounted, and then I truly fail to understand what you are doing. So I looked at something like this in the past. To make sure things haven't changed I set up a cgroup on my test server running a kernel built from the latest tip tree. [root]# cat cpu.cfs_quota_us 5 [root]# cat cpu.cfs_period_us 10 [root]# cat cpuset.cpus 1 [root]# cat cpuset.mems 0 Next I put the PID from the cpu thread into tasks. When I start a script that will hog the cpu I see the following in top on the guest Cpu(s): 1.9%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 48.3%hi, 0.0%si, 49.8%st So the steal time here is in line with the bandwidth control settings. Ok. So I was wrong in my hunch that it would be outside the runqueue, therefore work automatically. Still, the host kernel has all the information in cgroups. So then the steal time did not show on the guest. You have no value that needs to be passed around. What I did not like about this approach was * only works for cfs bandwidth control. If another type of hard limit was added to the kernel the code would potentially need to change. This is true for almost everything we have in the kernel! It is *very* unlikely for other bandwidth control mechanism to ever appear. If it ever does, it's *their* burden to make sure it works for steal time (provided it is merged). Code in tree gets precedence. * This approach doesn't help if the limits are set by overcommitting the cpus. It is my understanding that this is a common approach. I can't say anything about commonality, but common or not, it is a *crazy* approach. When you simply overcommit, you have no way to differentiate between intended steal time and non-intended steal time. Moreover, when you overcommit, your cpu usage will vary over time. If two guests use the cpu to their full power, you will have 50 % each. But if one of them slows down, the other gets more. What is your entitlement value? How do you define this? And then after you define it, you end up using more than this, what is your cpu usage? 130 %? The only sane way to do it, is to communicate this value to the kernel somehow. The bandwidth controller is the interface we have for that. So everybody that wants to *intentionally* overcommit needs to communicate this to the controller. IOW: Any sane configuration should be explicit about your capping. Add an ioctl to communicate the consign limit to the host. This definitely should go away. More specifically, *whatever* way we use to cap the processor, the host system will have all the information at all times. I'm not understanding that comment. If you are capping by simply controlling the amount of overcommit on the host then wouldn't you still need some value to indicate the desired amount. No, that is just crazy, and I don't like it a single bit. So in the light of it: Whatever capping mechanism we have, we need to be explicit about the expected entitlement. At this point, the kernel already knows what it is, and needs no extra ioctls or anything like that. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/2] mm: Add ability to monitor task's memory changes
On 11/30/2012 09:55 PM, Pavel Emelyanov wrote: Hello, This is an attempt to implement support for memory snapshot for the the checkpoint-restore project (http://criu.org). To create a dump of an application(s) we save all the information about it to files. No surprise, the biggest part of such dump is the contents of tasks' memory. However, in some usage scenarios it's not required to get _all_ the task memory while creating a dump. For example, when doing periodical dumps it's only required to take full memory dump only at the first step and then take incremental changes of memory. Another example is live migration. In the simplest form it looks like -- create dump, copy it on the remote node then restore tasks from dump files. While all this dump-copy-restore thing goes all the process must be stopped. However, if we can monitor how tasks change their memory, we can dump and copy it in smaller chunks, periodically updating it and thus freezing tasks only at the very end for the very short time to pick up the recent changes. That said, some help from kernel to watch how processes modify the contents of their memory is required. I'd like to propose one possible solution of this task -- with the help of page-faults and trace events. Briefly the approach is -- remap some memory regions as read-only, get the #pf on task's attempt to modify the memory and issue a trace event of that. Since we're only interested in parts of memory of some tasks, make it possible to mark the vmas we're interested in and issue events for them only. Also, to be aware of tasks unmapping the vma-s being watched, also issue an event when the marked vma is removed (and for symmetry -- an event when a vma is marked). What do you think about this approach? Is this way of supporting mem snapshot OK for you, or should we invent some better one? The page fault mechanism is pretty obvious - anything that deals with dirty pages will end up having to do this. So there is nothing crazy about this. What concerns me, however, is that should this go in, we'll have two dirty mem loggers in the kernel: one to support CRIU, one to support KVM. And the worst part: They have the exact the same purpose!! So to begin with, I think one thing to consider, would be to generalize KVM's dirty memory notification so it can work on a normal process memory region. KVM api requires a memory slot to be passed, something we are unlikely to have. But KVM can easily keep its API and use an alternate mechanics, that's trivial... Generally speaking, KVM will do polling with this ioctl. I prefer your tracing mechanism better. The only difference, is that KVM tends to transfer large chunks of memory in some loads - in the high gigs range. So the proposal tracing API should be able to optionally batch requests within a time frame. It would also be good to hear what does the KVM guys think of it as well -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/2] mm: Add ability to monitor task's memory changes
On 12/04/2012 12:16 AM, Marcelo Tosatti wrote: On Mon, Dec 03, 2012 at 12:36:33PM +0400, Glauber Costa wrote: On 11/30/2012 09:55 PM, Pavel Emelyanov wrote: Hello, This is an attempt to implement support for memory snapshot for the the checkpoint-restore project (http://criu.org). To create a dump of an application(s) we save all the information about it to files. No surprise, the biggest part of such dump is the contents of tasks' memory. However, in some usage scenarios it's not required to get _all_ the task memory while creating a dump. For example, when doing periodical dumps it's only required to take full memory dump only at the first step and then take incremental changes of memory. Another example is live migration. In the simplest form it looks like -- create dump, copy it on the remote node then restore tasks from dump files. While all this dump-copy-restore thing goes all the process must be stopped. However, if we can monitor how tasks change their memory, we can dump and copy it in smaller chunks, periodically updating it and thus freezing tasks only at the very end for the very short time to pick up the recent changes. That said, some help from kernel to watch how processes modify the contents of their memory is required. I'd like to propose one possible solution of this task -- with the help of page-faults and trace events. Briefly the approach is -- remap some memory regions as read-only, get the #pf on task's attempt to modify the memory and issue a trace event of that. Since we're only interested in parts of memory of some tasks, make it possible to mark the vmas we're interested in and issue events for them only. Also, to be aware of tasks unmapping the vma-s being watched, also issue an event when the marked vma is removed (and for symmetry -- an event when a vma is marked). What do you think about this approach? Is this way of supporting mem snapshot OK for you, or should we invent some better one? The page fault mechanism is pretty obvious - anything that deals with dirty pages will end up having to do this. So there is nothing crazy about this. What concerns me, however, is that should this go in, we'll have two dirty mem loggers in the kernel: one to support CRIU, one to support KVM. And the worst part: They have the exact the same purpose!! So to begin with, I think one thing to consider, would be to generalize KVM's dirty memory notification so it can work on a normal process memory region. KVM api requires a memory slot to be passed, something we are unlikely to have. But KVM can easily keep its API and use an alternate mechanics, that's trivial... Generally speaking, KVM will do polling with this ioctl. I prefer your tracing mechanism better. The only difference, is that KVM tends to transfer large chunks of memory in some loads - in the high gigs range. So the proposal tracing API should be able to optionally batch requests within a time frame. It would also be good to hear what does the KVM guys think of it as well There are significant differences. KVM's dirty logging works for guest translations (NPT/shadow) and is optimized for specific use cases. Above is about dirty logging of userspace memory areas. This is envelope. At the end, KVM is tracking pages, regardless of their format, and want to know when pages are dirty, and which pages are dirty. What you are saying, for example, is that if Linux had a dirty page tracking for userpages memory, kvm could not make use of it ? IIRC qemu's migration algorithm, a notification of dirty user pages - plus an eventual extra trickery - would do just fine. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] Alter steal time reporting in KVM
On 11/27/2012 07:10 PM, Michael Wolf wrote: On 11/27/2012 02:48 AM, Glauber Costa wrote: Hi, On 11/27/2012 12:36 AM, Michael Wolf wrote: In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. To ease the confusion this patch set adds the idea of consigned (expected steal) time. The host will separate the consigned time from the steal time. The consignment limit passed to the host will be the amount of steal time expected within a fixed period of time. Any other steal time accruing during that period will show as the traditional steal time. If you submit this again, please include a version number in your series. Will do. The patchset was sent twice yesterday by mistake. Got an error the first time and didn't think the patches went out. This has been corrected. It would also be helpful to include a small changelog about what changed between last version and this version, so we could focus on that. yes, will do that. When I took the RFC off the patches I was looking at it as a new patchset which was a mistake. I will make sure to add a changelog when I submit again. As for the rest, I answered your previous two submissions saying I don't agree with the concept. If you hadn't changed anything, resending it won't change my mind. I could of course, be mistaken or misguided. But I had also not seen any wave of support in favor of this previously, so basically I have no new data to make me believe I should see it any differently. Let's try this again: * Rik asked you in your last submission how does ppc handle this. You said, and I quote: In the case of lpar on POWER systems they simply report steal time and do not alter it in any way. They do however report how much processor is assigned to the partition and that information is in /proc/ppc64/lparcfg. Yes, but we still get questions from users asking what is steal time? why am I seeing this? Now, that is a *way* more sensible thing to do. Much more. Confusing users is something extremely subjective. This is specially true about concepts that are know for quite some time, like steal time. If you out of a sudden change the meaning of this, it is sure to confuse a lot more users than it would clarify. Something like this could certainly be done. But when I was submitting the patch set as an RFC then qemu was passing a cpu percentage that would be used by the guest kernel to adjust the steal time. This percentage was being stored on the guest as a sysctl value. Avi stated he didn't like that kind of coupling, and that the value could get out of sync. Anthony stated The guest shouldn't need to know it's entitlement. Or at least, it's up to a management tool to report that in a way that's meaningful for the guest. So perhaps I misunderstood what they were suggesting, but I took it to mean that they did not want the guest to know what the entitlement was. That the host should take care of it and just report the already adjusted data to the guest. So in this version of the code the host would use a set period for a timer and be passed essentially a number of ticks of expected steal time. The host would then use the timer to break out the steal time into consigned and steal buckets which would be reported to the guest. Both the consigned and the steal would be reported via /proc/stat. So anyone needing to see total time away could add the two fields together. The user, however, when using tools like top or vmstat would see the usage based on what the guest is entitled to. Do you have suggestions for how I can build consensus around one of the two approaches? Before I answer this, can you please detail which mechanism are you using to enforce the entitlement? Is it the cgroup cpu controller, or something else? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] Alter steal time reporting in KVM
On 11/28/2012 10:43 PM, Michael Wolf wrote: On 11/27/2012 05:24 PM, Marcelo Tosatti wrote: On Mon, Nov 26, 2012 at 02:36:24PM -0600, Michael Wolf wrote: In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. The definition of stolen time is 'time during which the virtual CPU is runnable to not running'. Overcommit is the main scenario which steal time helps to detect. Can you describe the 'capped' case? In the capped case, the time that the guest spends waiting due to it having used its full allottment of time shows up as steal time. The way my patchset currently stands is that you would set up the bandwidth control and you would have to pass it a matching value from qemu. In the future, it would be possible to have something parse the bandwidth setting and automatically adjust the setting in the host used for steal time reporting. Ok, so correct me if I am wrong, but I believe you would be using something like the bandwidth capper in the cpu cgroup to set those entitlements, right? Some time has passed since I last looked into it, but IIRC, after you get are out of your quota, you should be out of the runqueue. In the lovely world of KVM, we approximate steal time as runqueue time: arch/x86/kvm/x86.c: delta = current-sched_info.run_delay - vcpu-arch.st.last_steal; vcpu-arch.st.last_steal = current-sched_info.run_delay; vcpu-arch.st.accum_steal = delta; include/linux/sched.h: unsigned long long run_delay; /* time spent waiting on a runqueue */ So if you are out of the runqueue, you won't get steal time accounted, and then I truly fail to understand what you are doing. In case I am wrong, and run_delay also includes the time you can't run because you are out of capacity, then maybe what we should do, is to just subtract it from run_delay in kvm/x86.c before we pass it on. In summary: Alter the amount of steal time reported by the guest. Maybe this should go away. Expand the steal time msr to also contain the consigned time. Maybe this should go away Add the code to send the consigned time from the host to the guest This definitely should be heavily modified Add a timer to allow the separation of consigned from steal time. Maybe this should go away Add an ioctl to communicate the consign limit to the host. This definitely should go away. More specifically, *whatever* way we use to cap the processor, the host system will have all the information at all times. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] Alter steal time reporting in KVM
Hi, On 11/27/2012 12:36 AM, Michael Wolf wrote: In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. To ease the confusion this patch set adds the idea of consigned (expected steal) time. The host will separate the consigned time from the steal time. The consignment limit passed to the host will be the amount of steal time expected within a fixed period of time. Any other steal time accruing during that period will show as the traditional steal time. If you submit this again, please include a version number in your series. It would also be helpful to include a small changelog about what changed between last version and this version, so we could focus on that. As for the rest, I answered your previous two submissions saying I don't agree with the concept. If you hadn't changed anything, resending it won't change my mind. I could of course, be mistaken or misguided. But I had also not seen any wave of support in favor of this previously, so basically I have no new data to make me believe I should see it any differently. Let's try this again: * Rik asked you in your last submission how does ppc handle this. You said, and I quote: In the case of lpar on POWER systems they simply report steal time and do not alter it in any way. They do however report how much processor is assigned to the partition and that information is in /proc/ppc64/lparcfg. Now, that is a *way* more sensible thing to do. Much more. Confusing users is something extremely subjective. This is specially true about concepts that are know for quite some time, like steal time. If you out of a sudden change the meaning of this, it is sure to confuse a lot more users than it would clarify. --- Michael Wolf (5): Alter the amount of steal time reported by the guest. Expand the steal time msr to also contain the consigned time. Add the code to send the consigned time from the host to the guest Add a timer to allow the separation of consigned from steal time. Add an ioctl to communicate the consign limit to the host. arch/x86/include/asm/kvm_host.h | 11 +++ arch/x86/include/asm/kvm_para.h |3 +- arch/x86/include/asm/paravirt.h |4 +-- arch/x86/include/asm/paravirt_types.h |2 + arch/x86/kernel/kvm.c |8 ++--- arch/x86/kernel/paravirt.c|4 +-- arch/x86/kvm/x86.c| 50 - fs/proc/stat.c|9 +- include/linux/kernel_stat.h |2 + include/linux/kvm_host.h |2 + include/uapi/linux/kvm.h |2 + kernel/sched/core.c | 10 ++- kernel/sched/cputime.c| 21 +- kernel/sched/sched.h |2 + virt/kvm/kvm_main.c |7 + 15 files changed, 120 insertions(+), 17 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 18/18] KVM: x86: update pvclock area conditionally, on cpu migration
On 11/20/2012 01:58 AM, Marcelo Tosatti wrote: As requested by Glauber, do not update kvmclock area on vcpu-pcpu migration, in case the host has stable TSC. This is to reduce cacheline bouncing. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/kvm/x86.c === --- vsyscall.orig/arch/x86/kvm/x86.c +++ vsyscall/arch/x86/kvm/x86.c @@ -2615,7 +2615,12 @@ void kvm_arch_vcpu_load(struct kvm_vcpu kvm_x86_ops-write_tsc_offset(vcpu, offset); vcpu-arch.tsc_catchup = 1; } - kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); + /* + * On a host with synchronized TSC, there is no need to update + * kvmclock on vcpu-cpu migration + */ + if (!vcpu-kvm-arch.use_master_clock || vcpu-cpu == -1) + kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); if (vcpu-cpu != cpu) kvm_migrate_timers(vcpu); vcpu-cpu = cpu; Ok. Since you are only touching the one in kvm_arch_vcpu_load() and leaving the others untouched, it looks correct. Acked-by: Glauber Costa glom...@parallels.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 00/18] pvclock vsyscall support + KVM hypervisor support (v5)
On 11/20/2012 01:57 AM, Marcelo Tosatti wrote: This patchset, based on earlier work by Jeremy Fitzhardinge, implements paravirtual clock vsyscall support. It should be possible to implement Xen support relatively easily. It reduces clock_gettime from 500 cycles to 200 cycles on my testbox. There are no more significant objections from my side. I will still try to go through it again today just in case. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 02/18] x86: kvmclock: allocate pvclock shared memory area
On 11/15/2012 04:08 AM, Marcelo Tosatti wrote: We want to expose the pvclock shared memory areas, which the hypervisor periodically updates, to userspace. For a linear mapping from userspace, it is necessary that entire page sized regions are used for array of pvclock structures. There is no such guarantee with per cpu areas, therefore move to memblock_alloc based allocation. Ok. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com For the concept: Acked-by: Glauber Costa glom...@parallels.com I do have one comment. Index: vsyscall/arch/x86/kernel/kvmclock.c === --- vsyscall.orig/arch/x86/kernel/kvmclock.c +++ vsyscall/arch/x86/kernel/kvmclock.c @@ -23,6 +23,7 @@ #include asm/apic.h #include linux/percpu.h #include linux/hardirq.h +#include linux/memblock.h #include asm/x86_init.h #include asm/reboot.h @@ -39,7 +40,11 @@ static int parse_no_kvmclock(char *arg) early_param(no-kvmclock, parse_no_kvmclock); /* The hypervisor will put information about time periodically here */ -static DEFINE_PER_CPU_SHARED_ALIGNED(struct pvclock_vcpu_time_info, hv_clock); +struct pvclock_aligned_vcpu_time_info { + struct pvclock_vcpu_time_info clock; +} __attribute__((__aligned__(SMP_CACHE_BYTES))); + +static struct pvclock_aligned_vcpu_time_info *hv_clock; static struct pvclock_wall_clock wall_clock; /* @@ -52,15 +57,20 @@ static unsigned long kvm_get_wallclock(v struct pvclock_vcpu_time_info *vcpu_time; struct timespec ts; int low, high; + int cpu; + + preempt_disable(); + cpu = smp_processor_id(); low = (int)__pa_symbol(wall_clock); high = ((u64)__pa_symbol(wall_clock) 32); native_write_msr(msr_kvm_wall_clock, low, high); - vcpu_time = get_cpu_var(hv_clock); + vcpu_time = hv_clock[cpu].clock; You are leaving preempt disabled for a lot longer than in should be. In particular, you'll have a round trip to the hypervisor in the middle. It doesn't matter *that* much because wallclock is mostly a one-time thing, so I won't oppose in this basis. But if you have the chance, either fix it, or tell us why you need preemption on for longer than it was before. void __init kvmclock_init(void) { + unsigned long mem; + if (!kvm_para_available()) return; + mem = memblock_alloc(sizeof(struct pvclock_aligned_vcpu_time_info) * NR_CPUS, + PAGE_SIZE); + if (!mem) + return; + hv_clock = __va(mem); + if (kvmclock kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) { msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW; msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW; If you don't have kvmclock enabled, which the next line in here would test for, you will exit this function with allocated memory. How about the following? if (kvmclock kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) { msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW; msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW; } else if (!(kvmclock kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE))) return; mem = memblock_alloc(sizeof(struct pvclock_aligned_vcpu_time_info) * NR_CPUS, PAGE_SIZE); if (!mem) return; hv_clock = __va(mem); printk(KERN_INFO kvm-clock: Using msrs %x and %x, msr_kvm_system_time, msr_kvm_wall_clock); if (kvm_register_clock(boot clock)) { memblock_free(); return; } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 05/18] x86: pvclock: create helper for pvclock data retrieval
On 11/15/2012 04:08 AM, Marcelo Tosatti wrote: Originally from Jeremy Fitzhardinge. So code can be reused. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com I thought I had acked this one already? But maybe I didn't... Acked-by: Glauber Costa glom...@parallels.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 06/18] x86: pvclock: introduce helper to read flags
On 11/15/2012 04:08 AM, Marcelo Tosatti wrote: Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/kernel/pvclock.c === --- vsyscall.orig/arch/x86/kernel/pvclock.c Acked-by: Glauber Costa glom...@parallels.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 07/18] x86: pvclock: add note about rdtsc barriers
On 11/15/2012 04:08 AM, Marcelo Tosatti wrote: As noted by Gleb, not advertising SSE2 support implies no RDTSC barriers. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com And this gets a separate patch because? Index: vsyscall/arch/x86/include/asm/pvclock.h === --- vsyscall.orig/arch/x86/include/asm/pvclock.h +++ vsyscall/arch/x86/include/asm/pvclock.h @@ -74,6 +74,9 @@ unsigned __pvclock_read_cycles(const str u8 ret_flags; version = src-version; + /* Note: emulated platforms which do not advertise SSE2 support + * result in kvmclock not using the necessary RDTSC barriers. + */ And the expected effects are? Will it work in that case? Any precautions one must take? Is it safe for Xen? Is it safe for KVM? Those are the types of things I'd expect to see in a comment for something as subtle as this. rdtsc_barrier(); offset = pvclock_get_nsec_offset(src); ret = src-system_time + offset; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 18/18] KVM: x86: update pvclock area conditionally, on cpu migration
On 11/15/2012 04:08 AM, Marcelo Tosatti wrote: As requested by Glauber, do not update kvmclock area on vcpu-pcpu migration, in case the host has stable TSC. This is to reduce cacheline bouncing. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com This looks fine, but it can always get tricky... Assuming you tested this change in at least one stable tsc and one unstable tsc system: Acked-by: Glauber Costa glom...@parallels.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 02/18] x86: kvmclock: allocate pvclock shared memory area (v2)
On 11/16/2012 06:07 AM, Marcelo Tosatti wrote: We want to expose the pvclock shared memory areas, which the hypervisor periodically updates, to userspace. For a linear mapping from userspace, it is necessary that entire page sized regions are used for array of pvclock structures. There is no such guarantee with per cpu areas, therefore move to memblock_alloc based allocation. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com No further objections. Acked-by: Glauber Costa glom...@parallels.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 08/16] KVM: x86: introduce facility to support vsyscall pvclock, via MSR
On 11/02/2012 05:00 PM, Marcelo Tosatti wrote: On Fri, Nov 02, 2012 at 02:23:06PM +0400, Glauber Costa wrote: On 11/02/2012 01:39 AM, Marcelo Tosatti wrote: On Thu, Nov 01, 2012 at 06:28:31PM +0400, Glauber Costa wrote: On 11/01/2012 02:47 AM, Marcelo Tosatti wrote: Allow a guest to register a second location for the VCPU time info structure for each vcpu (as described by MSR_KVM_SYSTEM_TIME_NEW). This is intended to allow the guest kernel to map this information into a usermode accessible page, so that usermode can efficiently calculate system time from the TSC without having to make a syscall. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Changelog doesn't make a lot of sense. (specially from first line to the second). Please add in here the reasons why we can't (or decided not to) use the same page. The info in the last mail thread is good enough, just put it here. Fixed. Index: vsyscall/arch/x86/include/asm/kvm_para.h === --- vsyscall.orig/arch/x86/include/asm/kvm_para.h +++ vsyscall/arch/x86/include/asm/kvm_para.h @@ -23,6 +23,7 @@ #define KVM_FEATURE_ASYNC_PF 4 #define KVM_FEATURE_STEAL_TIME 5 #define KVM_FEATURE_PV_EOI 6 +#define KVM_FEATURE_USERSPACE_CLOCKSOURCE 7 /* The last 8 bits are used to indicate how to interpret the flags field * in pvclock structure. If no bits are set, all flags are ignored. @@ -39,6 +40,7 @@ #define MSR_KVM_ASYNC_PF_EN 0x4b564d02 #define MSR_KVM_STEAL_TIME 0x4b564d03 #define MSR_KVM_PV_EOI_EN 0x4b564d04 +#define MSR_KVM_USERSPACE_TIME 0x4b564d05 I accept that it is possible that we may be better off with the page mapped twice. But why do we need an extra MSR? When, and why, would you enable the kernel-based pvclock, but disabled the userspace pvclock? Because there is no stable TSC available, for example (which cannot be used to measure passage of time). What you say is true, but completely unrelated. I am not talking about situations in which userspace pvclock is available and you end up not using it. I am talking about situations in which it is available, you are capable of using it, but then decides for some reason to permanently disabled - as in not setting it up altogether. It seems to me that if the host has code to deal with userspace pvclock, and you already coded the guest in a way that you may or may not use it (dependent on the value of the stable bit), you could very well only check for the cpuid flag, and do the guest setup if available - skipping this MSR dance altogether. Now, of course, there is the problem of communicating the address in which the guest expects the page to be. Skipping the MSR setup would require it to be more or less at a fixed location. We could in principle lay them down together with the already existing pvclock structure. (But granted, I am not sure it is worth it...) I think in general, this question deserves a bit more of attention. We are about to have just the perfect opportunity for this next week, so let's use it. In essence you are proposing a different interface to communicate the userspace vsyscall pvclock area, other than the MSR, right? If so: 1. What is the problem with the MSR interface. 2. What advantage this new interface (which honestly i do not understand) provides? No, I am not proposing a different interface to communicate this. I am proposing *no* interface to communicate this. I'll give that if it is absolutely necessary to have it on a random address, and you need to pass it back and forth, of course MSRs would be the choice. But I am not yet terribly convinced that all the pain of syncing this from userspace, migrating those values, bookkeeping what is on, what is not, is worth the pain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 09/16] x86: kvm guest: pvclock vsyscall support
On 11/01/2012 02:47 AM, Marcelo Tosatti wrote: + info = pvclock_get_vsyscall_time_info(cpu); + + low = (int)__pa(info) | 1; + high = ((u64)__pa(per_cpu(hv_clock, cpu)) 32); + ret = native_write_msr_safe(MSR_KVM_USERSPACE_TIME, low, high); + printk(KERN_INFO kvm-clock: cpu %d, msr %x:%x, %s\n, +cpu, high, low, txt); + Why do you put info in the lower half, and the hv_clock in the higher half ? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 09/18] KVM: x86: introduce facility to support vsyscall pvclock, via MSR
On 10/31/2012 07:12 AM, Marcelo Tosatti wrote: On Tue, Oct 30, 2012 at 11:39:32AM +0200, Avi Kivity wrote: On 10/29/2012 08:40 PM, Marcelo Tosatti wrote: On Mon, Oct 29, 2012 at 10:44:41AM -0700, Jeremy Fitzhardinge wrote: On 10/29/2012 07:45 AM, Glauber Costa wrote: On 10/24/2012 05:13 PM, Marcelo Tosatti wrote: Allow a guest to register a second location for the VCPU time info structure for each vcpu (as described by MSR_KVM_SYSTEM_TIME_NEW). This is intended to allow the guest kernel to map this information into a usermode accessible page, so that usermode can efficiently calculate system time from the TSC without having to make a syscall. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Can you please be a bit more specific about why we need this? Why does the host need to provide us with two pages with the exact same data? Why can't just do it with mapping tricks in the guest? In Xen the pvclock structure is embedded within a pile of other stuff that shouldn't be mapped into guest memory, so providing for a second location allows it to be placed whereever is convenient for the guest. That's a restriction of the Xen ABI, but I don't know if it affects KVM. J It is possible to share the data for KVM in theory, but: - It is a small amount of memory. - It requires aligning to page size (the in-kernel percpu array is currently cacheline aligned). - It is possible to modify flags separately for userspace/kernelspace, if desired. This justifies the duplication IMO (code is simple and clean). What would be the changes required to remove the duplication? If it's just page alignment, then is seems even smaller. In addition we avoid expanding the ABI again. This would require changing the kernel copy from percpu data, which there is no guarantee is linear (necessary for fixmap mapping), to dynamically allocated (which in turn can be tricky due to early boot clock requirement). Hum, no thanks. You allocate it using bootmemory for vsyscall anyway. If they are strictly in the same physical location, you are not allocating anything extra. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 08/16] KVM: x86: introduce facility to support vsyscall pvclock, via MSR
On 11/02/2012 01:39 AM, Marcelo Tosatti wrote: On Thu, Nov 01, 2012 at 06:28:31PM +0400, Glauber Costa wrote: On 11/01/2012 02:47 AM, Marcelo Tosatti wrote: Allow a guest to register a second location for the VCPU time info structure for each vcpu (as described by MSR_KVM_SYSTEM_TIME_NEW). This is intended to allow the guest kernel to map this information into a usermode accessible page, so that usermode can efficiently calculate system time from the TSC without having to make a syscall. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Changelog doesn't make a lot of sense. (specially from first line to the second). Please add in here the reasons why we can't (or decided not to) use the same page. The info in the last mail thread is good enough, just put it here. Fixed. Index: vsyscall/arch/x86/include/asm/kvm_para.h === --- vsyscall.orig/arch/x86/include/asm/kvm_para.h +++ vsyscall/arch/x86/include/asm/kvm_para.h @@ -23,6 +23,7 @@ #define KVM_FEATURE_ASYNC_PF 4 #define KVM_FEATURE_STEAL_TIME 5 #define KVM_FEATURE_PV_EOI 6 +#define KVM_FEATURE_USERSPACE_CLOCKSOURCE 7 /* The last 8 bits are used to indicate how to interpret the flags field * in pvclock structure. If no bits are set, all flags are ignored. @@ -39,6 +40,7 @@ #define MSR_KVM_ASYNC_PF_EN 0x4b564d02 #define MSR_KVM_STEAL_TIME 0x4b564d03 #define MSR_KVM_PV_EOI_EN 0x4b564d04 +#define MSR_KVM_USERSPACE_TIME 0x4b564d05 I accept that it is possible that we may be better off with the page mapped twice. But why do we need an extra MSR? When, and why, would you enable the kernel-based pvclock, but disabled the userspace pvclock? Because there is no stable TSC available, for example (which cannot be used to measure passage of time). What you say is true, but completely unrelated. I am not talking about situations in which userspace pvclock is available and you end up not using it. I am talking about situations in which it is available, you are capable of using it, but then decides for some reason to permanently disabled - as in not setting it up altogether. It seems to me that if the host has code to deal with userspace pvclock, and you already coded the guest in a way that you may or may not use it (dependent on the value of the stable bit), you could very well only check for the cpuid flag, and do the guest setup if available - skipping this MSR dance altogether. Now, of course, there is the problem of communicating the address in which the guest expects the page to be. Skipping the MSR setup would require it to be more or less at a fixed location. We could in principle lay them down together with the already existing pvclock structure. (But granted, I am not sure it is worth it...) I think in general, this question deserves a bit more of attention. We are about to have just the perfect opportunity for this next week, so let's use it. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 10/16] x86: vdso: pvclock gettime support
On 11/02/2012 04:33 AM, Marcelo Tosatti wrote: On Thu, Nov 01, 2012 at 07:42:43PM -0200, Marcelo Tosatti wrote: On Thu, Nov 01, 2012 at 06:41:46PM +0400, Glauber Costa wrote: On 11/01/2012 02:47 AM, Marcelo Tosatti wrote: +#ifdef CONFIG_PARAVIRT_CLOCK + +static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu) +{ + const aligned_pvti_t *pvti_base; + int idx = cpu / (PAGE_SIZE/PVTI_SIZE); + int offset = cpu % (PAGE_SIZE/PVTI_SIZE); + + BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx PVCLOCK_FIXMAP_END); + + pvti_base = (aligned_pvti_t *)__fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx); + + return pvti_base[offset].info; +} + Does BUG_ON() really do what you believe it does while in userspace context? We're not running with the kernel descriptors, so this will probably just kill the process without any explanation A coredump is generated which can be used to trace back to ud2a instruction at the vdso code. All comments have been addressed. Let me know if there is anything else on v3 that you'd like to see done differently. Mainly: 1) stick a v3 string in the subject. You didn't do it for v2, and I got confused at some points while looking for the correct patches 2) The changelogs are, in general, a bit poor. I've pointed to the ones specifically that pops out, but I would appreciate if you would go over them again, making them more informative. 3) Please make sure Peter is okay with the proposed notifier change. 4) Please consider allocating memory with __alloc_bootmem_node instead. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 01/16] KVM: x86: retain pvclock guest stopped bit in guest memory
On 11/01/2012 02:46 AM, Marcelo Tosatti wrote: Otherwise its possible for an unrelated KVM_REQ_UPDATE_CLOCK (such as due to CPU migration) to clear the bit. Noticed by Paolo Bonzini. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/kvm/x86.c === Reviewed-by: Glauber Costa glom...@parallels.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 02/16] x86: pvclock: make sure rdtsc doesnt speculate out of region
On 11/01/2012 03:48 PM, Gleb Natapov wrote: On Wed, Oct 31, 2012 at 08:46:58PM -0200, Marcelo Tosatti wrote: Originally from Jeremy Fitzhardinge. pvclock_get_time_values, which contains the memory barriers will be removed by next patch. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/kernel/pvclock.c === --- vsyscall.orig/arch/x86/kernel/pvclock.c +++ vsyscall/arch/x86/kernel/pvclock.c @@ -97,10 +97,10 @@ cycle_t pvclock_clocksource_read(struct do { version = pvclock_get_time_values(shadow, src); -barrier(); +rdtsc_barrier(); offset = pvclock_get_nsec_offset(shadow); ret = shadow.system_timestamp + offset; -barrier(); +rdtsc_barrier(); } while (version != src-version); if ((valid_flags PVCLOCK_TSC_STABLE_BIT) On a guest without SSE2 rdtsc_barrier() will be nop while rmb() will be lock; addl $0,0(%%esp). I doubt pvclock will work correctly either way though. -- Gleb. Actually it shouldn't matter for KVM, since the page is only updated by the vcpu, and the guest is never running while it happens. If Jeremy is fine with this, so should I. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 03/16] x86: pvclock: remove pvclock_shadow_time
On 11/01/2012 02:46 AM, Marcelo Tosatti wrote: Originally from Jeremy Fitzhardinge. We can copy the information directly from struct pvclock_vcpu_time_info, remove pvclock_shadow_time. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Reviewed-by: Glauber Costa glom...@parallels.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 04/16] x86: pvclock: create helper for pvclock data retrieval
On 11/01/2012 02:47 AM, Marcelo Tosatti wrote: +static __always_inline +unsigned __pvclock_read_cycles(const struct pvclock_vcpu_time_info *src, +cycle_t *cycles, u8 *flags) +{ + unsigned version; + cycle_t ret, offset; + u8 ret_flags; + + version = src-version; + rdtsc_barrier(); + offset = pvclock_get_nsec_offset(src); + ret = src-system_time + offset; + ret_flags = src-flags; + rdtsc_barrier(); + + *cycles = ret; + *flags = ret_flags; + return version; +} + This interface is a bit weird. The actual value you are interested in is cycles, so why is it returned through the parameters? I think it would be clearer to have this return cycles, and version as a parameter. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 05/16] x86: pvclock: introduce helper to read flags
On 11/01/2012 02:47 AM, Marcelo Tosatti wrote: Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/kernel/pvclock.c If you are resending this series, I don't see a reason for this one to be in a separate patch. code itself is fine. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 06/16] sched: add notifier for cross-cpu migrations
On 11/01/2012 02:47 AM, Marcelo Tosatti wrote: Originally from Jeremy Fitzhardinge. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Please collect peterz's ack for this one. Index: vsyscall/include/linux/sched.h === --- vsyscall.orig/include/linux/sched.h +++ vsyscall/include/linux/sched.h @@ -107,6 +107,14 @@ extern unsigned long this_cpu_load(void) extern void calc_global_load(unsigned long ticks); extern void update_cpu_load_nohz(void); +/* Notifier for when a task gets migrated to a new CPU */ +struct task_migration_notifier { + struct task_struct *task; + int from_cpu; + int to_cpu; +}; +extern void register_task_migration_notifier(struct notifier_block *n); + extern unsigned long get_parent_ip(unsigned long addr); struct seq_file; Index: vsyscall/kernel/sched/core.c === --- vsyscall.orig/kernel/sched/core.c +++ vsyscall/kernel/sched/core.c @@ -922,6 +922,13 @@ void check_preempt_curr(struct rq *rq, s rq-skip_clock_update = 1; } +static ATOMIC_NOTIFIER_HEAD(task_migration_notifier); + +void register_task_migration_notifier(struct notifier_block *n) +{ + atomic_notifier_chain_register(task_migration_notifier, n); +} + #ifdef CONFIG_SMP void set_task_cpu(struct task_struct *p, unsigned int new_cpu) { @@ -952,8 +959,16 @@ void set_task_cpu(struct task_struct *p, trace_sched_migrate_task(p, new_cpu); if (task_cpu(p) != new_cpu) { + struct task_migration_notifier tmn; + p-se.nr_migrations++; perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, NULL, 0); + + tmn.task = p; + tmn.from_cpu = task_cpu(p); + tmn.to_cpu = new_cpu; + + atomic_notifier_call_chain(task_migration_notifier, 0, tmn); } __set_task_cpu(p, new_cpu); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 07/16] x86: pvclock: generic pvclock vsyscall initialization
On 11/01/2012 02:47 AM, Marcelo Tosatti wrote: Originally from Jeremy Fitzhardinge. Introduce generic, non hypervisor specific, pvclock initialization routines. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/kernel/pvclock.c === --- vsyscall.orig/arch/x86/kernel/pvclock.c +++ vsyscall/arch/x86/kernel/pvclock.c @@ -17,6 +17,10 @@ #include linux/kernel.h #include linux/percpu.h +#include linux/notifier.h +#include linux/sched.h +#include linux/gfp.h +#include linux/bootmem.h #include asm/pvclock.h static u8 valid_flags __read_mostly = 0; @@ -122,3 +126,67 @@ void pvclock_read_wallclock(struct pvclo set_normalized_timespec(ts, now.tv_sec, now.tv_nsec); } + +static aligned_pvti_t *pvclock_vdso_info; is there a non-aligned pvti ? I would also add that pvti is not exactly the most descriptive name I've seen. + +static struct pvclock_vsyscall_time_info * +pvclock_get_vsyscall_user_time_info(int cpu) +{ + if (pvclock_vdso_info == NULL) { + BUG(); + return NULL; + } + + return pvclock_vdso_info[cpu].info; +} + if (!pvclock_vdso_info) BUG(); return pvclock... +struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu) +{ + return pvclock_get_vsyscall_user_time_info(cpu)-pvti; +} + +int pvclock_task_migrate(struct notifier_block *nb, unsigned long l, void *v) +{ + struct task_migration_notifier *mn = v; + struct pvclock_vsyscall_time_info *pvti; + + pvti = pvclock_get_vsyscall_user_time_info(mn-from_cpu); + + if (pvti == NULL) + return NOTIFY_DONE; + When is it NULL? IIUC, this is when the vsyscall is disabled, right? Would you mind adding comments for it? Also, this is supposed to be an unlikely branch, right? + pvti-migrate_count++; + + return NOTIFY_DONE; +} + +static struct notifier_block pvclock_migrate = { + .notifier_call = pvclock_task_migrate, +}; + +/* + * Initialize the generic pvclock vsyscall state. This will allocate + * a/some page(s) for the per-vcpu pvclock information, set up a + * fixmap mapping for the page(s) + */ +int __init pvclock_init_vsyscall(void) +{ + int idx; + unsigned int size = PVCLOCK_VSYSCALL_NR_PAGES*PAGE_SIZE; + + pvclock_vdso_info = __alloc_bootmem(size, PAGE_SIZE, 0); This will be the most critical path for reading pvclock's time data. So why can't we: 1) Make it per-node. 2) Of course, as a consequence, make sure all info structures in the same page are in the same node ? + if (!pvclock_vdso_info) + return -ENOMEM; + + memset(pvclock_vdso_info, 0, size); + + for (idx = 0; idx = (PVCLOCK_FIXMAP_END-PVCLOCK_FIXMAP_BEGIN); idx++) { + __set_fixmap(PVCLOCK_FIXMAP_BEGIN + idx, + __pa_symbol(pvclock_vdso_info) + (idx*PAGE_SIZE), + PAGE_KERNEL_VVAR); + } + + register_task_migration_notifier(pvclock_migrate); + + return 0; +} Index: vsyscall/arch/x86/include/asm/fixmap.h === --- vsyscall.orig/arch/x86/include/asm/fixmap.h +++ vsyscall/arch/x86/include/asm/fixmap.h @@ -19,6 +19,7 @@ #include asm/acpi.h #include asm/apicdef.h #include asm/page.h +#include asm/pvclock.h #ifdef CONFIG_X86_32 #include linux/threads.h #include asm/kmap_types.h @@ -81,6 +82,10 @@ enum fixed_addresses { VVAR_PAGE, VSYSCALL_HPET, #endif +#ifdef CONFIG_PARAVIRT_CLOCK + PVCLOCK_FIXMAP_BEGIN, + PVCLOCK_FIXMAP_END = PVCLOCK_FIXMAP_BEGIN+PVCLOCK_VSYSCALL_NR_PAGES-1, +#endif FIX_DBGP_BASE, FIX_EARLYCON_MEM_BASE, #ifdef CONFIG_PROVIDE_OHCI1394_DMA_INIT Index: vsyscall/arch/x86/include/asm/pvclock.h === --- vsyscall.orig/arch/x86/include/asm/pvclock.h +++ vsyscall/arch/x86/include/asm/pvclock.h @@ -13,6 +13,8 @@ void pvclock_read_wallclock(struct pvclo struct pvclock_vcpu_time_info *vcpu, struct timespec *ts); void pvclock_resume(void); +int __init pvclock_init_vsyscall(void); +struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu); /* * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction, @@ -85,4 +87,17 @@ unsigned __pvclock_read_cycles(const str return version; } +struct pvclock_vsyscall_time_info { + struct pvclock_vcpu_time_info pvti; + u32 migrate_count; +}; + +typedef union { + struct pvclock_vsyscall_time_info info; + char pad[SMP_CACHE_BYTES]; +} aligned_pvti_t cacheline_aligned; Please, just align pvclock_vsyscall_time_info. It is 10 thousand times more descriptive. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to
Re: [patch 08/16] KVM: x86: introduce facility to support vsyscall pvclock, via MSR
On 11/01/2012 02:47 AM, Marcelo Tosatti wrote: Allow a guest to register a second location for the VCPU time info structure for each vcpu (as described by MSR_KVM_SYSTEM_TIME_NEW). This is intended to allow the guest kernel to map this information into a usermode accessible page, so that usermode can efficiently calculate system time from the TSC without having to make a syscall. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Changelog doesn't make a lot of sense. (specially from first line to the second). Please add in here the reasons why we can't (or decided not to) use the same page. The info in the last mail thread is good enough, just put it here. Index: vsyscall/arch/x86/include/asm/kvm_para.h === --- vsyscall.orig/arch/x86/include/asm/kvm_para.h +++ vsyscall/arch/x86/include/asm/kvm_para.h @@ -23,6 +23,7 @@ #define KVM_FEATURE_ASYNC_PF 4 #define KVM_FEATURE_STEAL_TIME 5 #define KVM_FEATURE_PV_EOI 6 +#define KVM_FEATURE_USERSPACE_CLOCKSOURCE 7 /* The last 8 bits are used to indicate how to interpret the flags field * in pvclock structure. If no bits are set, all flags are ignored. @@ -39,6 +40,7 @@ #define MSR_KVM_ASYNC_PF_EN 0x4b564d02 #define MSR_KVM_STEAL_TIME 0x4b564d03 #define MSR_KVM_PV_EOI_EN 0x4b564d04 +#define MSR_KVM_USERSPACE_TIME 0x4b564d05 I accept that it is possible that we may be better off with the page mapped twice. But why do we need an extra MSR? When, and why, would you enable the kernel-based pvclock, but disabled the userspace pvclock? I believe only the existing MSRs should be used for this. If you write to it, and the host is capable of exporting userspace pvclock information, then it does. If it isn't, then it doesn't. No reason for the extra setup that is only likely to bring more headache. Index: vsyscall/arch/x86/kvm/cpuid.c === --- vsyscall.orig/arch/x86/kvm/cpuid.c +++ vsyscall/arch/x86/kvm/cpuid.c @@ -411,7 +411,9 @@ static int do_cpuid_ent(struct kvm_cpuid (1 KVM_FEATURE_CLOCKSOURCE2) | (1 KVM_FEATURE_ASYNC_PF) | (1 KVM_FEATURE_PV_EOI) | - (1 KVM_FEATURE_CLOCKSOURCE_STABLE_BIT); + (1 KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) | + (1 KVM_FEATURE_USERSPACE_CLOCKSOURCE); + The feature bit itself, is obviously fine. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 10/16] x86: vdso: pvclock gettime support
On 11/01/2012 02:47 AM, Marcelo Tosatti wrote: +#ifdef CONFIG_PARAVIRT_CLOCK + +static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu) +{ + const aligned_pvti_t *pvti_base; + int idx = cpu / (PAGE_SIZE/PVTI_SIZE); + int offset = cpu % (PAGE_SIZE/PVTI_SIZE); + + BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx PVCLOCK_FIXMAP_END); + + pvti_base = (aligned_pvti_t *)__fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx); + + return pvti_base[offset].info; +} + Does BUG_ON() really do what you believe it does while in userspace context? We're not running with the kernel descriptors, so this will probably just kill the process without any explanation -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 09/18] KVM: x86: introduce facility to support vsyscall pvclock, via MSR
On 10/29/2012 09:44 PM, Jeremy Fitzhardinge wrote: On 10/29/2012 07:45 AM, Glauber Costa wrote: On 10/24/2012 05:13 PM, Marcelo Tosatti wrote: Allow a guest to register a second location for the VCPU time info structure for each vcpu (as described by MSR_KVM_SYSTEM_TIME_NEW). This is intended to allow the guest kernel to map this information into a usermode accessible page, so that usermode can efficiently calculate system time from the TSC without having to make a syscall. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Can you please be a bit more specific about why we need this? Why does the host need to provide us with two pages with the exact same data? Why can't just do it with mapping tricks in the guest? In Xen the pvclock structure is embedded within a pile of other stuff that shouldn't be mapped into guest memory, so providing for a second location allows it to be placed whereever is convenient for the guest. That's a restriction of the Xen ABI, but I don't know if it affects KVM. J In kvm the exported data seems to be exactly the same. So it makes sense to have a single page exported to the guest. The guest may have a facility that maps the vsyscall clock and the normal clock to a specific location given a page - that can be either a different page or the same page. Any reason why it wouldn't work? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 09/18] KVM: x86: introduce facility to support vsyscall pvclock, via MSR
On 10/29/2012 10:40 PM, Marcelo Tosatti wrote: On Mon, Oct 29, 2012 at 10:44:41AM -0700, Jeremy Fitzhardinge wrote: On 10/29/2012 07:45 AM, Glauber Costa wrote: On 10/24/2012 05:13 PM, Marcelo Tosatti wrote: Allow a guest to register a second location for the VCPU time info structure for each vcpu (as described by MSR_KVM_SYSTEM_TIME_NEW). This is intended to allow the guest kernel to map this information into a usermode accessible page, so that usermode can efficiently calculate system time from the TSC without having to make a syscall. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Can you please be a bit more specific about why we need this? Why does the host need to provide us with two pages with the exact same data? Why can't just do it with mapping tricks in the guest? In Xen the pvclock structure is embedded within a pile of other stuff that shouldn't be mapped into guest memory, so providing for a second location allows it to be placed whereever is convenient for the guest. That's a restriction of the Xen ABI, but I don't know if it affects KVM. J It is possible to share the data for KVM in theory, but: - It is a small amount of memory. - It requires aligning to page size (the in-kernel percpu array is currently cacheline aligned). - It is possible to modify flags separately for userspace/kernelspace, if desired. This justifies the duplication IMO (code is simple and clean). Duplicating is indeed no the end of the world. But one note: * If it is page-size aligned, it is automatically cacheline aligned. Since we have to export the user page *anyway*, this is a non-issue. That said, duplicating instead of integrating, which is technically possible, is a design decision, and it needs to be documented somewhere aside from this mail thread. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 11/18] x86: vsyscall: pass mode to gettime backend
On 10/29/2012 10:41 PM, Marcelo Tosatti wrote: On Mon, Oct 29, 2012 at 06:47:57PM +0400, Glauber Costa wrote: On 10/24/2012 05:13 PM, Marcelo Tosatti wrote: Required by next patch. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com I don't see where. + if (unlikely(!(flags PVCLOCK_TSC_STABLE_BIT))) + *mode = VCLOCK_NONE; I see. But I end up thinking that this is a case of oversimplification that gets things more complicated =) If this would be folded in the next patch, this would be obvious. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 12/18] x86: vdso: pvclock gettime support
On 10/29/2012 10:42 PM, Marcelo Tosatti wrote: On Mon, Oct 29, 2012 at 06:59:35PM +0400, Glauber Costa wrote: On 10/24/2012 05:13 PM, Marcelo Tosatti wrote: Improve performance of time system calls when using Linux pvclock, by reading time info from fixmap visible copy of pvclock data. Originally from Jeremy Fitzhardinge. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/vdso/vclock_gettime.c === --- vsyscall.orig/arch/x86/vdso/vclock_gettime.c +++ vsyscall/arch/x86/vdso/vclock_gettime.c @@ -22,6 +22,7 @@ #include asm/hpet.h #include asm/unistd.h #include asm/io.h +#include asm/pvclock.h #define gtod (VVAR(vsyscall_gtod_data)) @@ -62,6 +63,69 @@ static notrace cycle_t vread_hpet(void) return readl((const void __iomem *)fix_to_virt(VSYSCALL_HPET) + 0xf0); } +#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL + +static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu) +{ + const aligned_pvti_t *pvti_base; + int idx = cpu / (PAGE_SIZE/PVTI_SIZE); + int offset = cpu % (PAGE_SIZE/PVTI_SIZE); + + BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx PVCLOCK_FIXMAP_END); + + pvti_base = (aligned_pvti_t *)__fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx); + + return pvti_base[offset].info; +} + Unless I am missing something, if gcc decides to not inline get_pvti, this will break, right? I believe you need to mark that function with __always_inline. Can't see why. Please enlighten me. I can be wrong, I don't deal with this vdso code for quite a while - so forgive me if my memory tricked me. But wasn't it the case that vdso functions could not call functions in the kernel address space outside the mapped page? Or does this restriction only apply to accessing data? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 13/18] KVM: x86: pass host_tsc to read_l1_tsc
On 10/29/2012 10:45 PM, Marcelo Tosatti wrote: On Mon, Oct 29, 2012 at 07:04:59PM +0400, Glauber Costa wrote: On 10/24/2012 05:13 PM, Marcelo Tosatti wrote: Allow the caller to pass host tsc value to kvm_x86_ops-read_l1_tsc(). Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Would you mind explaining why? it seems to me that rdtscll() here would be perfectly safe: the only case in which they wouldn't, is in a nested-vm environment running paravirt-linux with a paravirt tsc. In this case, it is quite likely that we'll want rdtscll *anyway*, instead of going to tsc directly. Its something different (from a future patch): KVM added a global variable to guarantee monotonicity in the guest. One of the reasons for that is that the time between 1. ktime_get_ts(timespec); 2. rdtscll(tsc); Is variable. That is, given a host with stable TSC, suppose that two VCPUs read the same time via ktime_get_ts() above. The time required to execute 2. is not the same on those two instances executing in different VCPUS (cache misses, interrupts...). Think step 1. returning the same value on both vcpus (to read the explanation above). This still doesn't go the core of the question. You are replacing rdtscll with native_read_tsc(). They are equivalent most of the time for the host (except in the case I outlined). So the logic you exposed, is valid for both. But the real problem is another one: Although I am not very skilled in C, I rock in the Logo programming language (http://en.wikipedia.org/wiki/Logo_(programming_language) ), and with that knowledge, I can understand your C code - with a bit of effort. After reading it, it becomes very clear that what you do is Allow the caller to pass host tsc value to kvm_x86_ops-read_l1_tsc, so your changelog doesn't really add any data. It would be great if it explained us, readers, why instead of what! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 15/18] KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag
On 10/24/2012 05:13 PM, Marcelo Tosatti wrote: KVM added a global variable to guarantee monotonicity in the guest. One of the reasons for that is that the time between 1. ktime_get_ts(timespec); 2. rdtscll(tsc); Is variable. That is, given a host with stable TSC, suppose that two VCPUs read the same time via ktime_get_ts() above. The time required to execute 2. is not the same on those two instances executing in different VCPUS (cache misses, interrupts...). If the TSC value that is used by the host to interpolate when calculating the monotonic time is the same value used to calculate the tsc_timestamp value stored in the pvclock data structure, and a single system_timestamp, tsc_timestamp tuple is visible to all vcpus simultaneously, this problem disappears. See comment on top of pvclock_update_vm_gtod_copy for details. Monotonicity is then guaranteed by synchronicity of the host TSCs and guest TSCs. Set TSC stable pvclock flag in that case, allowing the guest to read clock from userspace. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com If you are using a master copy, with a stable host-side tsc, you can get rid of the normal REQ_CLOCK updates during vcpu load. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 08/18] x86: pvclock: generic pvclock vsyscall initialization
On 10/24/2012 05:13 PM, Marcelo Tosatti wrote: Index: vsyscall/arch/x86/Kconfig === --- vsyscall.orig/arch/x86/Kconfig +++ vsyscall/arch/x86/Kconfig @@ -632,6 +632,13 @@ config PARAVIRT_SPINLOCKS config PARAVIRT_CLOCK bool +config PARAVIRT_CLOCK_VSYSCALL + bool Paravirt clock vsyscall support + depends on PARAVIRT_CLOCK GENERIC_TIME_VSYSCALL + ---help--- + Enable performance critical clock related system calls to + be executed in userspace, provided that the hypervisor + supports it. endif Besides debugging, what is the point in having this as an extra-selectable? Is there any case in which a virtual machine has code for this, but may decide to run without it ? I believe all this code in vsyscall should be wrapped in PARAVIRT_CLOCK only. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 08/18] x86: pvclock: generic pvclock vsyscall initialization
On 10/24/2012 05:13 PM, Marcelo Tosatti wrote: + */ +int __init pvclock_init_vsyscall(void) +{ + int idx; + unsigned int size = PVCLOCK_VSYSCALL_NR_PAGES*PAGE_SIZE; + + pvclock_vdso_info = __alloc_bootmem(size, PAGE_SIZE, 0); + if (!pvclock_vdso_info) + return -ENOMEM; + + memset(pvclock_vdso_info, 0, size); + + for (idx = 0; idx = (PVCLOCK_FIXMAP_END-PVCLOCK_FIXMAP_BEGIN); idx++) { + __set_fixmap(PVCLOCK_FIXMAP_BEGIN + idx, + __pa_symbol(pvclock_vdso_info) + (idx*PAGE_SIZE), + PAGE_KERNEL_VVAR); BTW, Previous line is whitespace damaged. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 09/18] KVM: x86: introduce facility to support vsyscall pvclock, via MSR
On 10/24/2012 05:13 PM, Marcelo Tosatti wrote: Allow a guest to register a second location for the VCPU time info structure for each vcpu (as described by MSR_KVM_SYSTEM_TIME_NEW). This is intended to allow the guest kernel to map this information into a usermode accessible page, so that usermode can efficiently calculate system time from the TSC without having to make a syscall. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Can you please be a bit more specific about why we need this? Why does the host need to provide us with two pages with the exact same data? Why can't just do it with mapping tricks in the guest? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 11/18] x86: vsyscall: pass mode to gettime backend
On 10/24/2012 05:13 PM, Marcelo Tosatti wrote: Required by next patch. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com I don't see where. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 12/18] x86: vdso: pvclock gettime support
On 10/24/2012 05:13 PM, Marcelo Tosatti wrote: Improve performance of time system calls when using Linux pvclock, by reading time info from fixmap visible copy of pvclock data. Originally from Jeremy Fitzhardinge. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/vdso/vclock_gettime.c === --- vsyscall.orig/arch/x86/vdso/vclock_gettime.c +++ vsyscall/arch/x86/vdso/vclock_gettime.c @@ -22,6 +22,7 @@ #include asm/hpet.h #include asm/unistd.h #include asm/io.h +#include asm/pvclock.h #define gtod (VVAR(vsyscall_gtod_data)) @@ -62,6 +63,69 @@ static notrace cycle_t vread_hpet(void) return readl((const void __iomem *)fix_to_virt(VSYSCALL_HPET) + 0xf0); } +#ifdef CONFIG_PARAVIRT_CLOCK_VSYSCALL + +static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu) +{ + const aligned_pvti_t *pvti_base; + int idx = cpu / (PAGE_SIZE/PVTI_SIZE); + int offset = cpu % (PAGE_SIZE/PVTI_SIZE); + + BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx PVCLOCK_FIXMAP_END); + + pvti_base = (aligned_pvti_t *)__fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx); + + return pvti_base[offset].info; +} + Unless I am missing something, if gcc decides to not inline get_pvti, this will break, right? I believe you need to mark that function with __always_inline. +static notrace cycle_t vread_pvclock(int *mode) +{ + const struct pvclock_vsyscall_time_info *pvti; + cycle_t ret; + u64 last; + u32 version; + u32 migrate_count; + u8 flags; + unsigned cpu, cpu1; + + + /* + * When looping to get a consistent (time-info, tsc) pair, we + * also need to deal with the possibility we can switch vcpus, + * so make sure we always re-fetch time-info for the current vcpu. + */ + do { + cpu = __getcpu() 0xfff; Please wrap this 0xfff into something meaningful. + pvti = get_pvti(cpu); + + migrate_count = pvti-migrate_count; + + version = __pvclock_read_cycles(pvti-pvti, ret, flags); + + /* + * Test we're still on the cpu as well as the version. + * We could have been migrated just after the first + * vgetcpu but before fetching the version, so we + * wouldn't notice a version change. + */ + cpu1 = __getcpu() 0xfff; + } while (unlikely(cpu != cpu1 || + (pvti-pvti.version 1) || + pvti-pvti.version != version || + pvti-migrate_count != migrate_count)); + + if (unlikely(!(flags PVCLOCK_TSC_STABLE_BIT))) + *mode = VCLOCK_NONE; + + last = VVAR(vsyscall_gtod_data).clock.cycle_last; + + if (likely(ret = last)) + return ret; + Please add a comment here referring to tsc.c, where an explanation of this test lives. This is quite non-obvious for the non initiated. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 13/18] KVM: x86: pass host_tsc to read_l1_tsc
On 10/24/2012 05:13 PM, Marcelo Tosatti wrote: Allow the caller to pass host tsc value to kvm_x86_ops-read_l1_tsc(). Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Would you mind explaining why? it seems to me that rdtscll() here would be perfectly safe: the only case in which they wouldn't, is in a nested-vm environment running paravirt-linux with a paravirt tsc. In this case, it is quite likely that we'll want rdtscll *anyway*, instead of going to tsc directly. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V2 0/5] Separate consigned (expected steal) from steal time.
On 10/17/2012 06:23 AM, Michael Wolf wrote: In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. To ease the confusion this patch set adds the idea of consigned (expected steal) time. The host will separate the consigned time from the steal time. The consignment limit passed to the host will be the amount of steal time expected within a fixed period of time. Any other steal time accruing during that period will show as the traditional steal time. TODO: * Change native_clock to take params and not return a value * Change update_rq_clock_task Changes from V1: * Removed the steal time allowed percentage from the guest * Moved the separation of consigned (expected steal) and steal time to the host. * No longer include a sysctl interface. You are showing this in the guest somewhere, but tools like top will still not show it. So for quite a while, it achieves nothing. Of course this is a barrier that any new statistic has to go through. So while annoying, this is per-se ultimately not a blocker. What I still fail to see, is how this is useful information to be shown in the guest. Honestly, if I'm in a guest VM or container, any time during which I am not running is time I lost. It doesn't matter if this was expected or not. This still seems to me as a host-side problem, to be solved entirely by tooling. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting
On 08/27/2012 08:50 AM, Michael Wolf wrote: On Sat, 2012-08-25 at 19:36 -0400, Glauber Costa wrote: On 08/24/2012 11:11 AM, Michael Wolf wrote: On Fri, 2012-08-24 at 08:53 +0400, Glauber Costa wrote: On 08/24/2012 03:14 AM, Michael Wolf wrote: This is an RFC regarding the reporting of stealtime. In the case of where you have a system that is running with partial processors such as KVM the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. To ease the confusion this patch set adds a sysctl interface to set the cpu entitlement. This is the percentage of cpu that the guest system is expected to receive. As long as the steal time is within its expected range it will show up as 0 in /proc/stat. The user will then see in the accounting tools that they are getting a full utilization of the cpu resources assigned to them. And how is such a knob not confusing? Steal time is pretty well defined in meaning and is shown in top for ages. I really don't see the point for this. Currently you can see the steal time but you have no way of knowing if the cpu utilization you are seeing on the guest is the expected amount. I decided on making it a knob because a guest could be migrated to another system and it's entitlement could change because of hardware or load differences. It could simply be a /proc file and report the current entitlement if needed. As things are currently implemented I don't see how someone knows if the guest is running as expected or whether there is a problem. Turning off steal time display won't get even close to displaying the information you want. What you probably want is a guest-visible way to say how many miliseconds you are expected to run each second. Right? It is not clear to me how knowing how many milliseconds you are expecting to run will help the user. Currently the users will run top to see how well the guest is running. If they see _any_ steal time some users think they are not getting the full use of their processor entitlement. And your plan is just to selectively lie about it, but disabling it with a knob? Maybe I'm missing what you are proposing, but even if you knew the milliseconds that you were expecting for your processor you would have to adjust the top output in your head so to speak. You would see the utilization and then say 'ok that matches the number of milliseconds I expected to run... If we take away the steal time (as long as it is equal to or less than the expected amount of steal time) then the user running top will see the 100% utilization. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting
On 08/27/2012 01:19 PM, Michael Wolf wrote: On Mon, 2012-08-27 at 11:50 -0700, Glauber Costa wrote: On 08/27/2012 08:50 AM, Michael Wolf wrote: On Sat, 2012-08-25 at 19:36 -0400, Glauber Costa wrote: On 08/24/2012 11:11 AM, Michael Wolf wrote: On Fri, 2012-08-24 at 08:53 +0400, Glauber Costa wrote: On 08/24/2012 03:14 AM, Michael Wolf wrote: This is an RFC regarding the reporting of stealtime. In the case of where you have a system that is running with partial processors such as KVM the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. To ease the confusion this patch set adds a sysctl interface to set the cpu entitlement. This is the percentage of cpu that the guest system is expected to receive. As long as the steal time is within its expected range it will show up as 0 in /proc/stat. The user will then see in the accounting tools that they are getting a full utilization of the cpu resources assigned to them. And how is such a knob not confusing? Steal time is pretty well defined in meaning and is shown in top for ages. I really don't see the point for this. Currently you can see the steal time but you have no way of knowing if the cpu utilization you are seeing on the guest is the expected amount. I decided on making it a knob because a guest could be migrated to another system and it's entitlement could change because of hardware or load differences. It could simply be a /proc file and report the current entitlement if needed. As things are currently implemented I don't see how someone knows if the guest is running as expected or whether there is a problem. Turning off steal time display won't get even close to displaying the information you want. What you probably want is a guest-visible way to say how many miliseconds you are expected to run each second. Right? It is not clear to me how knowing how many milliseconds you are expecting to run will help the user. Currently the users will run top to see how well the guest is running. If they see _any_ steal time some users think they are not getting the full use of their processor entitlement. And your plan is just to selectively lie about it, but disabling it with a knob? It is about making it very obvious to the end user whether they are receiving their cpu entitlement. If there is more steal time than expected that will still show up. I have experimented, and it seems to work, to put the raw stealtime at the end of each cpu line in /proc/stat. That way the raw data is there as well. Do you have another suggestion to communicate to the user whether they are receiving their full entitlement? At the very least shouldn't the entitlement reside in a /proc file somewhere so that the user could look up the value and do the math? I personally believe Avi is right here. This is something to be done at the host side. The user can learn this from any tool he is using to manage his VMs. Now if you absolutely must inform him from inside the guest, I would go with the later, informing him in another location. (I am not saying I agree with this, just that this is less worse) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting
On 08/27/2012 02:27 PM, Michael Wolf wrote: On Mon, 2012-08-27 at 13:31 -0700, Avi Kivity wrote: On 08/27/2012 01:23 PM, Michael Wolf wrote: How would a guest know what its entitlement is? Currently the Admin/management tool setting up the guests will put it on the qemu commandline. From this it is passed via an ioctl to the host. The guest will get the value from the host via a hypercall. In the future the host could try and do some of it automatically in some cases. Seems to me it's a meaningless value for the guest. Suppose it is migrated to a host that is more powerful, and as a result its relative entitlement is reduced. The value needs to be adjusted. This is why I chose to manage the value from the sysctl interface rather than just have it stored as a value in /proc. Whatever tool was used to migrate the vm could hopefully adjust the sysctl value on the guest. This is best taken care of from the host side. Not sure what you are getting at here. If you are running in a cloud environment, you purchase a VM with the understanding that you are getting certain resources. As this type of user I don't believe you have any access to the host to see this type of information. So the user still wouldnt have a way to confirm that they are receiving what they should be in the way of processor resources. Would you please elaborate a little more on this? What do you mean they have no access to the host? They have access to all sorts of tools that display information from the host. Speaking of a view-only resource, those are strictly equivalent. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting
On 08/24/2012 11:11 AM, Michael Wolf wrote: On Fri, 2012-08-24 at 08:53 +0400, Glauber Costa wrote: On 08/24/2012 03:14 AM, Michael Wolf wrote: This is an RFC regarding the reporting of stealtime. In the case of where you have a system that is running with partial processors such as KVM the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. To ease the confusion this patch set adds a sysctl interface to set the cpu entitlement. This is the percentage of cpu that the guest system is expected to receive. As long as the steal time is within its expected range it will show up as 0 in /proc/stat. The user will then see in the accounting tools that they are getting a full utilization of the cpu resources assigned to them. And how is such a knob not confusing? Steal time is pretty well defined in meaning and is shown in top for ages. I really don't see the point for this. Currently you can see the steal time but you have no way of knowing if the cpu utilization you are seeing on the guest is the expected amount. I decided on making it a knob because a guest could be migrated to another system and it's entitlement could change because of hardware or load differences. It could simply be a /proc file and report the current entitlement if needed. As things are currently implemented I don't see how someone knows if the guest is running as expected or whether there is a problem. Turning off steal time display won't get even close to displaying the information you want. What you probably want is a guest-visible way to say how many miliseconds you are expected to run each second. Right? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting
On 08/24/2012 03:14 AM, Michael Wolf wrote: This is an RFC regarding the reporting of stealtime. In the case of where you have a system that is running with partial processors such as KVM the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. To ease the confusion this patch set adds a sysctl interface to set the cpu entitlement. This is the percentage of cpu that the guest system is expected to receive. As long as the steal time is within its expected range it will show up as 0 in /proc/stat. The user will then see in the accounting tools that they are getting a full utilization of the cpu resources assigned to them. And how is such a knob not confusing? Steal time is pretty well defined in meaning and is shown in top for ages. I really don't see the point for this. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 4/9] KVM-HV: KVM Steal time implementation
On 07/11/2011 10:11 AM, Avi Kivity wrote: On 07/11/2011 04:10 PM, Avi Kivity wrote: I think Peter acked this, yes? Please include his acks. Sorry, I meant this for the kernel/sched.c bits. Yes he did, after I posted the patches. I will include his acks in the respin, since they won't change the scheduler bits. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 6/9] add jump labels for ia64 paravirt
On 07/11/2011 10:09 AM, Avi Kivity wrote: On 07/04/2011 06:32 PM, Glauber Costa wrote: Since in a later patch I intend to call jump labels inside CONFIG_PARAVIRT, IA64 would fail to compile if they are not provided. This patch provides those jump labels for the IA64 architecture. Please get an ack for this from the ia64 maintainer. The ones listed in paravirt.c are CC'd. I am CC'ing all the other folks related to IA64 listed in MAINTAINERS now. Just a heads up so the new folks CC'd can get up to speed: I am proposing moving the steal time calculation to inside the core scheduler. This move will make it easier for us to make decisions based on steal time accounting on virtual machines. Ia64 KVM may or may not follow this route - nothing that already works should break with this change. This patch is needed only to avoid breaking compilation, since it introduces two new variables that are expected be present when CONFIG_PARAVIRT, to paravirt.c. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 4/9] KVM-HV: KVM Steal time implementation
On 07/11/2011 09:58 AM, Avi Kivity wrote: On 07/07/2011 08:07 PM, Glauber Costa wrote: +static void record_steal_time(struct kvm_vcpu *vcpu) +{ + u64 delta; + + if (!(vcpu-arch.st.msr_val KVM_MSR_ENABLED)) + return; + + if (unlikely(kvm_read_guest_cached(vcpu-kvm,vcpu-arch.st.stime, + vcpu-arch.st.steal, sizeof(struct kvm_steal_time + return; The guest memory page is not pinned, sleeping via __copy_from_user/to_user is not allowed in vcpu_load context. Either pin it or use atomic acessors. I do recognize the problem. Avi, what's your take here? The easiest solution is to set a KVM_REQ bit in atomic context, and move the sleepy code to vcpu_enter_guest(). Or I can move it all inside vcpu_run, or close enough to it. This will account more hypervisor time as steal time, but it seemed to be what some people wanted in the first place. Given the simplification we would win - not needing a REQ set, it might be worth it. + case MSR_KVM_STEAL_TIME: + vcpu-arch.st.msr_val = data; + + if (!(data KVM_MSR_ENABLED)) { + break; + } On failure below this point, msr_val should be cleared of KVM_MSR_ENABLED? No, msr_val has to hold whatever the guest wrote into it. We should probably use an independent variable here to indicate that we failed to activate it. If we fail, we return a #GP to the guest (and don't write any value into the msr). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 2/9] KVM-HDR Add constant to represent KVM MSRs enabled bit
This patch is simple, put in a different commit so it can be more easily shared between guest and hypervisor. It just defines a named constant to indicate the enable bit for KVM-specific MSRs. Signed-off-by: Glauber Costa glom...@redhat.com Acked-by: Rik van Riel r...@redhat.com Tested-by: Eric B Munson emun...@mgebm.net CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com --- arch/x86/include/asm/kvm_para.h |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index a427bf7..d6cd79b 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -30,6 +30,7 @@ #define MSR_KVM_WALL_CLOCK 0x11 #define MSR_KVM_SYSTEM_TIME 0x12 +#define KVM_MSR_ENABLED 1 /* Custom MSRs falls in the range 0x4b564d00-0x4b564dff */ #define MSR_KVM_WALL_CLOCK_NEW 0x4b564d00 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01 -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 7/9] KVM-GST: KVM Steal time accounting
This patch accounts steal time time in account_process_tick. If one or more tick is considered stolen in the current accounting cycle, user/system accounting is skipped. Idle is fine, since the hypervisor does not report steal time if the guest is halted. Accounting steal time from the core scheduler give us the advantage of direct acess to the runqueue data. In a later opportunity, it can be used to tweak cpu power and make the scheduler aware of the time it lost. Signed-off-by: Glauber Costa glom...@redhat.com Acked-by: Rik van Riel r...@redhat.com Acked-by: Peter Zijlstra pet...@infradead.org Tested-by: Eric B Munson emun...@mgebm.net CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com --- kernel/sched.c | 41 + 1 files changed, 41 insertions(+), 0 deletions(-) diff --git a/kernel/sched.c b/kernel/sched.c index 3f2e502..aa6c030 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -75,6 +75,7 @@ #include asm/tlb.h #include asm/irq_regs.h #include asm/mutex.h +#include asm/paravirt.h #include sched_cpupri.h #include workqueue_sched.h @@ -528,6 +529,9 @@ struct rq { #ifdef CONFIG_IRQ_TIME_ACCOUNTING u64 prev_irq_time; #endif +#ifdef CONFIG_PARAVIRT + u64 prev_steal_time; +#endif /* calc_load related fields */ unsigned long calc_load_update; @@ -1953,6 +1957,18 @@ void account_system_vtime(struct task_struct *curr) } EXPORT_SYMBOL_GPL(account_system_vtime); +#endif /* CONFIG_IRQ_TIME_ACCOUNTING */ + +#ifdef CONFIG_PARAVIRT +static inline u64 steal_ticks(u64 steal) +{ + if (unlikely(steal NSEC_PER_SEC)) + return div_u64(steal, TICK_NSEC); + + return __iter_div_u64_rem(steal, TICK_NSEC, steal); +} +#endif + static void update_rq_clock_task(struct rq *rq, s64 delta) { s64 irq_delta; @@ -3845,6 +3861,25 @@ void account_idle_time(cputime_t cputime) cpustat-idle = cputime64_add(cpustat-idle, cputime64); } +static __always_inline bool steal_account_process_tick(void) +{ +#ifdef CONFIG_PARAVIRT + if (static_branch(paravirt_steal_enabled)) { + u64 steal, st = 0; + + steal = paravirt_steal_clock(smp_processor_id()); + steal -= this_rq()-prev_steal_time; + + st = steal_ticks(steal); + this_rq()-prev_steal_time += st * TICK_NSEC; + + account_steal_time(st); + return st; + } +#endif + return false; +} + #ifndef CONFIG_VIRT_CPU_ACCOUNTING #ifdef CONFIG_IRQ_TIME_ACCOUNTING @@ -3876,6 +3911,9 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick, cputime64_t tmp = cputime_to_cputime64(cputime_one_jiffy); struct cpu_usage_stat *cpustat = kstat_this_cpu.cpustat; + if (steal_account_process_tick()) + return; + if (irqtime_account_hi_update()) { cpustat-irq = cputime64_add(cpustat-irq, tmp); } else if (irqtime_account_si_update()) { @@ -3929,6 +3967,9 @@ void account_process_tick(struct task_struct *p, int user_tick) return; } + if (steal_account_process_tick()) + return; + if (user_tick) account_user_time(p, cputime_one_jiffy, one_jiffy_scaled); else if ((p != rq-idle) || (irq_count() != HARDIRQ_OFFSET)) -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 4/9] KVM-HV: KVM Steal time implementation
To implement steal time, we need the hypervisor to pass the guest information about how much time was spent running other processes outside the VM, while the vcpu had meaningful work to do - halt time does not count. This information is acquired through the run_delay field of delayacct/schedstats infrastructure, that counts time spent in a runqueue but not running. Steal time is a per-cpu information, so the traditional MSR-based infrastructure is used. A new msr, KVM_MSR_STEAL_TIME, holds the memory area address containing information about steal time This patch contains the hypervisor part of the steal time infrasructure, and can be backported independently of the guest portion. Signed-off-by: Glauber Costa glom...@redhat.com Tested-by: Eric B Munson emun...@mgebm.net CC: Rik van Riel r...@redhat.com CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com --- arch/x86/include/asm/kvm_host.h |9 + arch/x86/include/asm/kvm_para.h |4 ++ arch/x86/kvm/Kconfig|1 + arch/x86/kvm/x86.c | 74 +- include/linux/kvm_host.h|1 + 5 files changed, 87 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index da6bbee..59086a7 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -389,6 +389,15 @@ struct kvm_vcpu_arch { unsigned int hw_tsc_khz; unsigned int time_offset; struct page *time_page; + + struct { + u64 msr_val; + u64 last_steal; + u64 accum_steal; + struct gfn_to_hva_cache stime; + struct kvm_steal_time steal; + } st; + u64 last_guest_tsc; u64 last_kernel_ns; u64 last_tsc_nsec; diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index 65f8bb9..c484ba8 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -45,6 +45,10 @@ struct kvm_steal_time { __u32 pad[12]; }; +#define KVM_STEAL_ALIGNMENT_BITS 5 +#define KVM_STEAL_VALID_BITS ((-1ULL (KVM_STEAL_ALIGNMENT_BITS + 1))) +#define KVM_STEAL_RESERVED_MASK (((1 KVM_STEAL_ALIGNMENT_BITS) - 1 ) 1) + #define KVM_MAX_MMU_OP_BATCH 32 #define KVM_ASYNC_PF_ENABLED (1 0) diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index 50f6364..99c3f05 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -31,6 +31,7 @@ config KVM select KVM_ASYNC_PF select USER_RETURN_NOTIFIER select KVM_MMIO + select TASK_DELAY_ACCT ---help--- Support hosting fully virtualized guest machines using hardware virtualization extensions. You will need a fairly recent diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 7167717..6282f6c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -808,12 +808,12 @@ EXPORT_SYMBOL_GPL(kvm_get_dr); * kvm-specific. Those are put in the beginning of the list. */ -#define KVM_SAVE_MSRS_BEGIN8 +#define KVM_SAVE_MSRS_BEGIN9 static u32 msrs_to_save[] = { MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK, MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW, HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL, - HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, + HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME, MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP, MSR_STAR, #ifdef CONFIG_X86_64 @@ -1491,6 +1491,35 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu) } } +static void accumulate_steal_time(struct kvm_vcpu *vcpu) +{ + u64 delta; + + if (!(vcpu-arch.st.msr_val KVM_MSR_ENABLED)) + return; + + delta = current-sched_info.run_delay - vcpu-arch.st.last_steal; + vcpu-arch.st.last_steal = current-sched_info.run_delay; + vcpu-arch.st.accum_steal = delta; +} + +static void record_steal_time(struct kvm_vcpu *vcpu) +{ + if (!(vcpu-arch.st.msr_val KVM_MSR_ENABLED)) + return; + + if (unlikely(kvm_read_guest_cached(vcpu-kvm, vcpu-arch.st.stime, + vcpu-arch.st.steal, sizeof(struct kvm_steal_time + return; + + vcpu-arch.st.steal.steal += vcpu-arch.st.accum_steal; + vcpu-arch.st.steal.version += 2; + vcpu-arch.st.accum_steal = 0; + + kvm_write_guest_cached(vcpu-kvm, vcpu-arch.st.stime, + vcpu-arch.st.steal, sizeof(struct kvm_steal_time)); +} + int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data) { switch (msr) { @@ -1573,6 +1602,33 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data) if (kvm_pv_enable_async_pf(vcpu, data)) return 1; break
[PATCH v6 9/9] KVM-GST: KVM Steal time registration
This patch implements the kvm bits of the steal time infrastructure. The most important part of it, is the steal time clock. It is an continuous clock that shows the accumulated amount of steal time since vcpu creation. It is supposed to survive cpu offlining/onlining. Signed-off-by: Glauber Costa glom...@redhat.com Acked-by: Rik van Riel r...@redhat.com Tested-by: Eric B Munson emun...@mgebm.net CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com --- Documentation/kernel-parameters.txt |4 ++ arch/x86/include/asm/kvm_para.h |1 + arch/x86/kernel/kvm.c | 73 +++ arch/x86/kernel/kvmclock.c |2 + 4 files changed, 80 insertions(+), 0 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index fd248a31..a722574 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1737,6 +1737,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted. no-kvmapf [X86,KVM] Disable paravirtualized asynchronous page fault handling. + no-steal-acc[X86,KVM] Disable paravirtualized steal time accounting. + steal time is computed, but won't influence scheduler + behaviour + nolapic [X86-32,APIC] Do not enable or use the local APIC. nolapic_timer [X86-32,APIC] Do not use the local APIC timer. diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index c484ba8..35d732d 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -94,6 +94,7 @@ struct kvm_vcpu_pv_apf_data { extern void kvmclock_init(void); extern int kvm_register_clock(char *txt); +extern void kvm_disable_steal_time(void); /* This instruction is vmcall. On non-VT architectures, it will generate a diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 33c07b0..58331c2 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -51,6 +51,15 @@ static int parse_no_kvmapf(char *arg) early_param(no-kvmapf, parse_no_kvmapf); +static int steal_acc = 1; +static int parse_no_stealacc(char *arg) +{ +steal_acc = 0; +return 0; +} + +early_param(no-steal-acc, parse_no_stealacc); + struct kvm_para_state { u8 mmu_queue[MMU_QUEUE_SIZE]; int mmu_queue_len; @@ -58,6 +67,8 @@ struct kvm_para_state { static DEFINE_PER_CPU(struct kvm_para_state, para_state); static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64); +static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64); +static int has_steal_clock = 0; static struct kvm_para_state *kvm_para_state(void) { @@ -441,6 +452,21 @@ static void __init paravirt_ops_setup(void) #endif } +static void kvm_register_steal_time(void) +{ + int cpu = smp_processor_id(); + struct kvm_steal_time *st = per_cpu(steal_time, cpu); + + if (!has_steal_clock) + return; + + memset(st, 0, sizeof(*st)); + + wrmsrl(MSR_KVM_STEAL_TIME, (__pa(st) | KVM_MSR_ENABLED)); + printk(KERN_INFO kvm-stealtime: cpu %d, msr %lx\n, + cpu, __pa(st)); +} + void __cpuinit kvm_guest_cpu_init(void) { if (!kvm_para_available()) @@ -457,6 +483,9 @@ void __cpuinit kvm_guest_cpu_init(void) printk(KERN_INFOKVM setup async PF for cpu %d\n, smp_processor_id()); } + + if (has_steal_clock) + kvm_register_steal_time(); } static void kvm_pv_disable_apf(void *unused) @@ -483,6 +512,31 @@ static struct notifier_block kvm_pv_reboot_nb = { .notifier_call = kvm_pv_reboot_notify, }; +static u64 kvm_steal_clock(int cpu) +{ + u64 steal; + struct kvm_steal_time *src; + int version; + + src = per_cpu(steal_time, cpu); + do { + version = src-version; + rmb(); + steal = src-steal; + rmb(); + } while ((version 1) || (version != src-version)); + + return steal; +} + +void kvm_disable_steal_time(void) +{ + if (!has_steal_clock) + return; + + wrmsr(MSR_KVM_STEAL_TIME, 0, 0); +} + #ifdef CONFIG_SMP static void __init kvm_smp_prepare_boot_cpu(void) { @@ -500,6 +554,7 @@ static void __cpuinit kvm_guest_cpu_online(void *dummy) static void kvm_guest_cpu_offline(void *dummy) { + kvm_disable_steal_time(); kvm_pv_disable_apf(NULL); apf_task_wake_all(); } @@ -548,6 +603,11 @@ void __init kvm_guest_init(void) if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF)) x86_init.irqs.trap_init = kvm_apf_trap_init; + if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) { + has_steal_clock = 1
[PATCH v6 5/9] KVM-GST: Add a pv_ops stub for steal time
This patch adds a function pointer in one of the many paravirt_ops structs, to allow guests to register a steal time function. Besides a steal time function, we also declare two jump_labels. They will be used to allow the steal time code to be easily bypassed when not in use. Signed-off-by: Glauber Costa glom...@redhat.com Acked-by: Rik van Riel r...@redhat.com Tested-by: Eric B Munson emun...@mgebm.net CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com --- arch/x86/include/asm/paravirt.h |9 + arch/x86/include/asm/paravirt_types.h |1 + arch/x86/kernel/paravirt.c|9 + 3 files changed, 19 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index ebbc4d8..a7d2db9 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -230,6 +230,15 @@ static inline unsigned long long paravirt_sched_clock(void) return PVOP_CALL0(unsigned long long, pv_time_ops.sched_clock); } +struct jump_label_key; +extern struct jump_label_key paravirt_steal_enabled; +extern struct jump_label_key paravirt_steal_rq_enabled; + +static inline u64 paravirt_steal_clock(int cpu) +{ + return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu); +} + static inline unsigned long long paravirt_read_pmc(int counter) { return PVOP_CALL1(u64, pv_cpu_ops.read_pmc, counter); diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 8288509..2c76521 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -89,6 +89,7 @@ struct pv_lazy_ops { struct pv_time_ops { unsigned long long (*sched_clock)(void); + unsigned long long (*steal_clock)(int cpu); unsigned long (*get_tsc_khz)(void); }; diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 869e1ae..613a793 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -202,6 +202,14 @@ static void native_flush_tlb_single(unsigned long addr) __native_flush_tlb_single(addr); } +struct jump_label_key paravirt_steal_enabled; +struct jump_label_key paravirt_steal_rq_enabled; + +static u64 native_steal_clock(int cpu) +{ + return 0; +} + /* These are in entry.S */ extern void native_iret(void); extern void native_irq_enable_sysexit(void); @@ -307,6 +315,7 @@ struct pv_init_ops pv_init_ops = { struct pv_time_ops pv_time_ops = { .sched_clock = native_sched_clock, + .steal_clock = native_steal_clock, }; struct pv_irq_ops pv_irq_ops = { -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 8/9] KVM-GST: adjust scheduler cpu power
This patch makes update_rq_clock() aware of steal time. The mechanism of operation is not different from irq_time, and follows the same principles. This lives in a CONFIG option itself, and can be compiled out independently of the rest of steal time reporting. The effect of disabling it is that the scheduler will still report steal time (that cannot be disabled), but won't use this information for cpu power adjustments. Everytime update_rq_clock_task() is invoked, we query information about how much time was stolen since last call, and feed it into sched_rt_avg_update(). Although steal time reporting in account_process_tick() keeps track of the last time we read the steal clock, in prev_steal_time, this patch do it independently using another field, prev_steal_time_rq. This is because otherwise, information about time accounted in update_process_tick() would never reach us in update_rq_clock(). Signed-off-by: Glauber Costa glom...@redhat.com Acked-by: Rik van Riel r...@redhat.com Acked-by: Peter Zijlstra pet...@infradead.org Tested-by: Eric B Munson emun...@mgebm.net CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com --- arch/x86/Kconfig| 12 kernel/sched.c | 47 +-- kernel/sched_features.h |4 ++-- 3 files changed, 51 insertions(+), 12 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index da34972..b26f312 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -512,6 +512,18 @@ menuconfig PARAVIRT_GUEST if PARAVIRT_GUEST +config PARAVIRT_TIME_ACCOUNTING + bool Paravirtual steal time accounting + select PARAVIRT + default n + ---help--- + Select this option to enable fine granularity task steal time + accounting. Time spent executing other tasks in parallel with + the current vCPU is discounted from the vCPU power. To account for + that, there can be a small performance impact. + + If in doubt, say N here. + source arch/x86/xen/Kconfig config KVM_CLOCK diff --git a/kernel/sched.c b/kernel/sched.c index aa6c030..8d57196 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -532,6 +532,9 @@ struct rq { #ifdef CONFIG_PARAVIRT u64 prev_steal_time; #endif +#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING + u64 prev_steal_time_rq; +#endif /* calc_load related fields */ unsigned long calc_load_update; @@ -1971,8 +1974,14 @@ static inline u64 steal_ticks(u64 steal) static void update_rq_clock_task(struct rq *rq, s64 delta) { - s64 irq_delta; - +/* + * In theory, the compile should just see 0 here, and optimize out the call + * to sched_rt_avg_update. But I don't trust it... + */ +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING) + s64 steal = 0, irq_delta = 0; +#endif +#ifdef CONFIG_IRQ_TIME_ACCOUNTING irq_delta = irq_time_read(cpu_of(rq)) - rq-prev_irq_time; /* @@ -1995,12 +2004,35 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) rq-prev_irq_time += irq_delta; delta -= irq_delta; +#endif +#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING + if (static_branch((paravirt_steal_rq_enabled))) { + u64 st; + + steal = paravirt_steal_clock(cpu_of(rq)); + steal -= rq-prev_steal_time_rq; + + if (unlikely(steal delta)) + steal = delta; + + st = steal_ticks(steal); + steal = st * TICK_NSEC; + + rq-prev_steal_time_rq += steal; + + delta -= steal; + } +#endif + rq-clock_task += delta; - if (irq_delta sched_feat(NONIRQ_POWER)) - sched_rt_avg_update(rq, irq_delta); +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING) + if ((irq_delta + steal) sched_feat(NONTASK_POWER)) + sched_rt_avg_update(rq, irq_delta + steal); +#endif } +#ifdef CONFIG_IRQ_TIME_ACCOUNTING static int irqtime_account_hi_update(void) { struct cpu_usage_stat *cpustat = kstat_this_cpu.cpustat; @@ -2035,12 +2067,7 @@ static int irqtime_account_si_update(void) #define sched_clock_irqtime(0) -static void update_rq_clock_task(struct rq *rq, s64 delta) -{ - rq-clock_task += delta; -} - -#endif /* CONFIG_IRQ_TIME_ACCOUNTING */ +#endif #include sched_idletask.c #include sched_fair.c diff --git a/kernel/sched_features.h b/kernel/sched_features.h index be40f73..ca3b025 100644 --- a/kernel/sched_features.h +++ b/kernel/sched_features.h @@ -61,9 +61,9 @@ SCHED_FEAT(LB_BIAS, 1) SCHED_FEAT(OWNER_SPIN, 1) /* - * Decrement CPU power based on irq activity + * Decrement CPU power based on time not spent running tasks */ -SCHED_FEAT(NONIRQ_POWER, 1) +SCHED_FEAT(NONTASK_POWER, 1) /* * Queue remote wakeups on the target CPU and process them -- 1.7.3.4
[PATCH v6 0/9] Steal time for KVM
Hi guys, All ACKs being collected, I hope this can be merged. Only patch 4/9 differs from the last submission. It merges Marcelo's suggestions, so I didn't transposed the other ACKs to it All the rest, is kept the same. Enjoy Glauber Costa (8): KVM-HDR Add constant to represent KVM MSRs enabled bit KVM-HDR: KVM Steal time implementation KVM-HV: KVM Steal time implementation KVM-GST: Add a pv_ops stub for steal time add jump labels for ia64 paravirt KVM-GST: KVM Steal time accounting KVM-GST: adjust scheduler cpu power KVM-GST: KVM Steal time registration Gleb Natapov (1): introduce kvm_read_guest_cached Documentation/kernel-parameters.txt |4 ++ Documentation/virtual/kvm/msr.txt | 35 + arch/ia64/include/asm/paravirt.h |4 ++ arch/ia64/kernel/paravirt.c |2 + arch/x86/Kconfig | 12 + arch/x86/include/asm/kvm_host.h |9 +++ arch/x86/include/asm/kvm_para.h | 15 ++ arch/x86/include/asm/paravirt.h |9 +++ arch/x86/include/asm/paravirt_types.h |1 + arch/x86/kernel/kvm.c | 73 +++ arch/x86/kernel/kvmclock.c|2 + arch/x86/kernel/paravirt.c|9 +++ arch/x86/kvm/Kconfig |1 + arch/x86/kvm/x86.c| 74 +++- include/linux/kvm_host.h |3 + kernel/sched.c| 88 + kernel/sched_features.h |4 +- virt/kvm/kvm_main.c | 20 +++ 18 files changed, 351 insertions(+), 14 deletions(-) -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 1/9] introduce kvm_read_guest_cached
From: Gleb Natapov g...@redhat.com Introduce kvm_read_guest_cached() function in addition to write one we already have. [ by glauber: export function signature in kvm header ] Signed-off-by: Gleb Natapov g...@redhat.com Signed-off-by: Glauber Costa glom...@redhat.com Acked-by: Rik van Riel r...@redhat.com Tested-by: Eric Munson emun...@mgebm.net --- include/linux/kvm_host.h |2 ++ virt/kvm/kvm_main.c | 20 2 files changed, 22 insertions(+), 0 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 31ebb59..f7df0a3 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -381,6 +381,8 @@ int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset, int kvm_read_guest_atomic(struct kvm *kvm, gpa_t gpa, void *data, unsigned long len); int kvm_read_guest(struct kvm *kvm, gpa_t gpa, void *data, unsigned long len); +int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, + void *data, unsigned long len); int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn, const void *data, int offset, int len); int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data, diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 11d2783..d5ef9eb 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1418,6 +1418,26 @@ int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, } EXPORT_SYMBOL_GPL(kvm_write_guest_cached); +int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, + void *data, unsigned long len) +{ + struct kvm_memslots *slots = kvm_memslots(kvm); + int r; + + if (slots-generation != ghc-generation) + kvm_gfn_to_hva_cache_init(kvm, ghc, ghc-gpa); + + if (kvm_is_error_hva(ghc-hva)) + return -EFAULT; + + r = __copy_from_user(data, (void __user *)ghc-hva, len); + if (r) + return -EFAULT; + + return 0; +} +EXPORT_SYMBOL_GPL(kvm_read_guest_cached); + int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len) { return kvm_write_guest_page(kvm, gfn, (const void *) empty_zero_page, -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 3/9] KVM-HDR: KVM Steal time implementation
To implement steal time, we need the hypervisor to pass the guest information about how much time was spent running other processes outside the VM. This is per-vcpu, and using the kvmclock structure for that is an abuse we decided not to make. In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that holds the memory area address containing information about steal time This patch contains the headers for it. I am keeping it separate to facilitate backports to people who wants to backport the kernel part but not the hypervisor, or the other way around. Signed-off-by: Glauber Costa glom...@redhat.com Acked-by: Rik van Riel r...@redhat.com Tested-by: Eric B Munson emun...@mgebm.net CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com --- Documentation/virtual/kvm/msr.txt | 35 +++ arch/x86/include/asm/kvm_para.h |9 + 2 files changed, 44 insertions(+), 0 deletions(-) diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt index d079aed..38db3f8 100644 --- a/Documentation/virtual/kvm/msr.txt +++ b/Documentation/virtual/kvm/msr.txt @@ -185,3 +185,38 @@ MSR_KVM_ASYNC_PF_EN: 0x4b564d02 Currently type 2 APF will be always delivered on the same vcpu as type 1 was, but guest should not rely on that. + +MSR_KVM_STEAL_TIME: 0x4b564d03 + + data: 64-byte alignment physical address of a memory area which must be + in guest RAM, plus an enable bit in bit 0. This memory is expected to + hold a copy of the following structure: + + struct kvm_steal_time { + __u64 steal; + __u32 version; + __u32 flags; + __u32 pad[12]; + } + + whose data will be filled in by the hypervisor periodically. Only one + write, or registration, is needed for each VCPU. The interval between + updates of this structure is arbitrary and implementation-dependent. + The hypervisor may update this structure at any time it sees fit until + anything with bit0 == 0 is written to it. Guest is required to make sure + this structure is initialized to zero. + + Fields have the following meanings: + + version: a sequence counter. In other words, guest has to check + this field before and after grabbing time information and make + sure they are both equal and even. An odd version indicates an + in-progress update. + + flags: At this point, always zero. May be used to indicate + changes in this structure in the future. + + steal: the amount of time in which this vCPU did not run, in + nanoseconds. Time during which the vcpu is idle, will not be + reported as steal time. + diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index d6cd79b..65f8bb9 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -21,6 +21,7 @@ */ #define KVM_FEATURE_CLOCKSOURCE23 #define KVM_FEATURE_ASYNC_PF 4 +#define KVM_FEATURE_STEAL_TIME 5 /* The last 8 bits are used to indicate how to interpret the flags field * in pvclock structure. If no bits are set, all flags are ignored. @@ -35,6 +36,14 @@ #define MSR_KVM_WALL_CLOCK_NEW 0x4b564d00 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01 #define MSR_KVM_ASYNC_PF_EN 0x4b564d02 +#define MSR_KVM_STEAL_TIME 0x4b564d03 + +struct kvm_steal_time { + __u64 steal; + __u32 version; + __u32 flags; + __u32 pad[12]; +}; #define KVM_MAX_MMU_OP_BATCH 32 -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 6/9] add jump labels for ia64 paravirt
Since in a later patch I intend to call jump labels inside CONFIG_PARAVIRT, IA64 would fail to compile if they are not provided. This patch provides those jump labels for the IA64 architecture. Signed-off-by: Glauber Costa glom...@redhat.com Acked-by: Isaku Yamahata yamah...@valinux.co.jp Acked-by: Rik van Riel r...@redhat.com CC: Tony Luck tony.l...@intel.com CC: Eddie Dong eddie.d...@intel.com CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com CC: Eric B Munson emun...@mgebm.net --- arch/ia64/include/asm/paravirt.h |4 arch/ia64/kernel/paravirt.c |2 ++ 2 files changed, 6 insertions(+), 0 deletions(-) diff --git a/arch/ia64/include/asm/paravirt.h b/arch/ia64/include/asm/paravirt.h index 2eb0a98..32551d3 100644 --- a/arch/ia64/include/asm/paravirt.h +++ b/arch/ia64/include/asm/paravirt.h @@ -281,6 +281,10 @@ paravirt_init_missing_ticks_accounting(int cpu) pv_time_ops.init_missing_ticks_accounting(cpu); } +struct jump_label_key; +extern struct jump_label_key paravirt_steal_enabled; +extern struct jump_label_key paravirt_steal_rq_enabled; + static inline int paravirt_do_steal_accounting(unsigned long *new_itm) { diff --git a/arch/ia64/kernel/paravirt.c b/arch/ia64/kernel/paravirt.c index a21d7bb..1008682 100644 --- a/arch/ia64/kernel/paravirt.c +++ b/arch/ia64/kernel/paravirt.c @@ -634,6 +634,8 @@ struct pv_irq_ops pv_irq_ops = { * pv_time_ops * time operations */ +struct jump_label_key paravirt_steal_enabled; +struct jump_label_key paravirt_steal_rq_enabled; static int ia64_native_do_steal_accounting(unsigned long *new_itm) -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 4/9] KVM-HV: KVM Steal time implementation
On 07/07/2011 07:51 AM, Marcelo Tosatti wrote: On Mon, Jul 04, 2011 at 11:32:23AM -0400, Glauber Costa wrote: To implement steal time, we need the hypervisor to pass the guest information about how much time was spent running other processes outside the VM, while the vcpu had meaningful work to do - halt time does not count. This information is acquired through the run_delay field of delayacct/schedstats infrastructure, that counts time spent in a runqueue but not running. Steal time is a per-cpu information, so the traditional MSR-based infrastructure is used. A new msr, KVM_MSR_STEAL_TIME, holds the memory area address containing information about steal time This patch contains the hypervisor part of the steal time infrasructure, and can be backported independently of the guest portion. Signed-off-by: Glauber Costaglom...@redhat.com CC: Rik van Rielr...@redhat.com CC: Jeremy Fitzhardingejeremy.fitzhardi...@citrix.com CC: Peter Zijlstrapet...@infradead.org CC: Avi Kivitya...@redhat.com CC: Anthony Liguorialigu...@us.ibm.com CC: Eric B Munsonemun...@mgebm.net --- arch/x86/include/asm/kvm_host.h |8 + arch/x86/include/asm/kvm_para.h |4 +++ arch/x86/kvm/Kconfig|1 + arch/x86/kvm/x86.c | 56 -- 4 files changed, 66 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index da6bbee..9ba354d 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -389,6 +389,14 @@ struct kvm_vcpu_arch { unsigned int hw_tsc_khz; unsigned int time_offset; struct page *time_page; + + struct { + u64 msr_val; + u64 last_steal; + struct gfn_to_hva_cache stime; + struct kvm_steal_time steal; + } st; + u64 last_guest_tsc; u64 last_kernel_ns; u64 last_tsc_nsec; diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index 65f8bb9..c484ba8 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -45,6 +45,10 @@ struct kvm_steal_time { __u32 pad[12]; }; +#define KVM_STEAL_ALIGNMENT_BITS 5 +#define KVM_STEAL_VALID_BITS ((-1ULL (KVM_STEAL_ALIGNMENT_BITS + 1))) +#define KVM_STEAL_RESERVED_MASK (((1 KVM_STEAL_ALIGNMENT_BITS) - 1 ) 1) + #define KVM_MAX_MMU_OP_BATCH 32 #define KVM_ASYNC_PF_ENABLED (1 0) diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index 50f6364..99c3f05 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -31,6 +31,7 @@ config KVM select KVM_ASYNC_PF select USER_RETURN_NOTIFIER select KVM_MMIO + select TASK_DELAY_ACCT ---help--- Support hosting fully virtualized guest machines using hardware virtualization extensions. You will need a fairly recent diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 7167717..237bcdc 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -808,12 +808,12 @@ EXPORT_SYMBOL_GPL(kvm_get_dr); * kvm-specific. Those are put in the beginning of the list. */ -#define KVM_SAVE_MSRS_BEGIN8 +#define KVM_SAVE_MSRS_BEGIN9 static u32 msrs_to_save[] = { MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK, MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW, HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL, - HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, + HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME, MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP, MSR_STAR, #ifdef CONFIG_X86_64 @@ -1491,6 +1491,27 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu) } } +static void record_steal_time(struct kvm_vcpu *vcpu) +{ + u64 delta; + + if (!(vcpu-arch.st.msr_val KVM_MSR_ENABLED)) + return; + + if (unlikely(kvm_read_guest_cached(vcpu-kvm,vcpu-arch.st.stime, + vcpu-arch.st.steal, sizeof(struct kvm_steal_time + return; The guest memory page is not pinned, sleeping via __copy_from_user/to_user is not allowed in vcpu_load context. Either pin it or use atomic acessors. I do recognize the problem. Avi, what's your take here? + case MSR_KVM_STEAL_TIME: + vcpu-arch.st.msr_val = data; + + if (!(data KVM_MSR_ENABLED)) { + break; + } On failure below this point, msr_val should be cleared of KVM_MSR_ENABLED? No, msr_val has to hold whatever the guest wrote into it. We should probably use an independent variable here to indicate that we failed to activate it. + + if (unlikely(!sched_info_on())) + break; + + if (data KVM_STEAL_RESERVED_MASK) + return 1; + + if (kvm_gfn_to_hva_cache_init(vcpu-kvm,vcpu-arch.st.stime
[PATCH v5 0/9] Steal time for KVM
This is a follow up on the last series. No significant changes. Only change is in patch 7/9, where the steal time accounting function is made inline, and also called from irqtime_process_acount_tick(). At least in my box ( gcc 4.4.5), gcc insists in not inlining it. So I am resorting to __always_inline. Glauber Costa (8): KVM-HDR Add constant to represent KVM MSRs enabled bit KVM-HDR: KVM Steal time implementation KVM-HV: KVM Steal time implementation KVM-GST: Add a pv_ops stub for steal time add jump labels for ia64 paravirt KVM-GST: KVM Steal time accounting KVM-GST: adjust scheduler cpu power KVM-GST: KVM Steal time registration Gleb Natapov (1): introduce kvm_read_guest_cached Documentation/kernel-parameters.txt |4 ++ Documentation/virtual/kvm/msr.txt | 35 + arch/ia64/include/asm/paravirt.h |4 ++ arch/ia64/kernel/paravirt.c |2 + arch/x86/Kconfig | 12 + arch/x86/include/asm/kvm_host.h |8 +++ arch/x86/include/asm/kvm_para.h | 15 ++ arch/x86/include/asm/paravirt.h |9 +++ arch/x86/include/asm/paravirt_types.h |1 + arch/x86/kernel/kvm.c | 73 +++ arch/x86/kernel/kvmclock.c|2 + arch/x86/kernel/paravirt.c|9 +++ arch/x86/kvm/Kconfig |1 + arch/x86/kvm/x86.c| 56 - include/linux/kvm_host.h |2 + kernel/sched.c| 88 + kernel/sched_features.h |4 +- virt/kvm/kvm_main.c | 20 +++ 18 files changed, 330 insertions(+), 15 deletions(-) -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 1/9] introduce kvm_read_guest_cached
From: Gleb Natapov g...@redhat.com Introduce kvm_read_guest_cached() function in addition to write one we already have. [ by glauber: export function signature in kvm header ] Signed-off-by: Gleb Natapov g...@redhat.com Signed-off-by: Glauber Costa glom...@redhat.com --- include/linux/kvm_host.h |2 ++ virt/kvm/kvm_main.c | 20 2 files changed, 22 insertions(+), 0 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 31ebb59..f7df0a3 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -381,6 +381,8 @@ int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset, int kvm_read_guest_atomic(struct kvm *kvm, gpa_t gpa, void *data, unsigned long len); int kvm_read_guest(struct kvm *kvm, gpa_t gpa, void *data, unsigned long len); +int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, + void *data, unsigned long len); int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn, const void *data, int offset, int len); int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data, diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 11d2783..d5ef9eb 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1418,6 +1418,26 @@ int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, } EXPORT_SYMBOL_GPL(kvm_write_guest_cached); +int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, + void *data, unsigned long len) +{ + struct kvm_memslots *slots = kvm_memslots(kvm); + int r; + + if (slots-generation != ghc-generation) + kvm_gfn_to_hva_cache_init(kvm, ghc, ghc-gpa); + + if (kvm_is_error_hva(ghc-hva)) + return -EFAULT; + + r = __copy_from_user(data, (void __user *)ghc-hva, len); + if (r) + return -EFAULT; + + return 0; +} +EXPORT_SYMBOL_GPL(kvm_read_guest_cached); + int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len) { return kvm_write_guest_page(kvm, gfn, (const void *) empty_zero_page, -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 3/9] KVM-HDR: KVM Steal time implementation
To implement steal time, we need the hypervisor to pass the guest information about how much time was spent running other processes outside the VM. This is per-vcpu, and using the kvmclock structure for that is an abuse we decided not to make. In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that holds the memory area address containing information about steal time This patch contains the headers for it. I am keeping it separate to facilitate backports to people who wants to backport the kernel part but not the hypervisor, or the other way around. Signed-off-by: Glauber Costa glom...@redhat.com CC: Rik van Riel r...@redhat.com CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com CC: Eric B Munson emun...@mgebm.net --- Documentation/virtual/kvm/msr.txt | 35 +++ arch/x86/include/asm/kvm_para.h |9 + 2 files changed, 44 insertions(+), 0 deletions(-) diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt index d079aed..38db3f8 100644 --- a/Documentation/virtual/kvm/msr.txt +++ b/Documentation/virtual/kvm/msr.txt @@ -185,3 +185,38 @@ MSR_KVM_ASYNC_PF_EN: 0x4b564d02 Currently type 2 APF will be always delivered on the same vcpu as type 1 was, but guest should not rely on that. + +MSR_KVM_STEAL_TIME: 0x4b564d03 + + data: 64-byte alignment physical address of a memory area which must be + in guest RAM, plus an enable bit in bit 0. This memory is expected to + hold a copy of the following structure: + + struct kvm_steal_time { + __u64 steal; + __u32 version; + __u32 flags; + __u32 pad[12]; + } + + whose data will be filled in by the hypervisor periodically. Only one + write, or registration, is needed for each VCPU. The interval between + updates of this structure is arbitrary and implementation-dependent. + The hypervisor may update this structure at any time it sees fit until + anything with bit0 == 0 is written to it. Guest is required to make sure + this structure is initialized to zero. + + Fields have the following meanings: + + version: a sequence counter. In other words, guest has to check + this field before and after grabbing time information and make + sure they are both equal and even. An odd version indicates an + in-progress update. + + flags: At this point, always zero. May be used to indicate + changes in this structure in the future. + + steal: the amount of time in which this vCPU did not run, in + nanoseconds. Time during which the vcpu is idle, will not be + reported as steal time. + diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index d6cd79b..65f8bb9 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -21,6 +21,7 @@ */ #define KVM_FEATURE_CLOCKSOURCE23 #define KVM_FEATURE_ASYNC_PF 4 +#define KVM_FEATURE_STEAL_TIME 5 /* The last 8 bits are used to indicate how to interpret the flags field * in pvclock structure. If no bits are set, all flags are ignored. @@ -35,6 +36,14 @@ #define MSR_KVM_WALL_CLOCK_NEW 0x4b564d00 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01 #define MSR_KVM_ASYNC_PF_EN 0x4b564d02 +#define MSR_KVM_STEAL_TIME 0x4b564d03 + +struct kvm_steal_time { + __u64 steal; + __u32 version; + __u32 flags; + __u32 pad[12]; +}; #define KVM_MAX_MMU_OP_BATCH 32 -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 4/9] KVM-HV: KVM Steal time implementation
To implement steal time, we need the hypervisor to pass the guest information about how much time was spent running other processes outside the VM, while the vcpu had meaningful work to do - halt time does not count. This information is acquired through the run_delay field of delayacct/schedstats infrastructure, that counts time spent in a runqueue but not running. Steal time is a per-cpu information, so the traditional MSR-based infrastructure is used. A new msr, KVM_MSR_STEAL_TIME, holds the memory area address containing information about steal time This patch contains the hypervisor part of the steal time infrasructure, and can be backported independently of the guest portion. Signed-off-by: Glauber Costa glom...@redhat.com CC: Rik van Riel r...@redhat.com CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com CC: Eric B Munson emun...@mgebm.net --- arch/x86/include/asm/kvm_host.h |8 + arch/x86/include/asm/kvm_para.h |4 +++ arch/x86/kvm/Kconfig|1 + arch/x86/kvm/x86.c | 56 -- 4 files changed, 66 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index da6bbee..9ba354d 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -389,6 +389,14 @@ struct kvm_vcpu_arch { unsigned int hw_tsc_khz; unsigned int time_offset; struct page *time_page; + + struct { + u64 msr_val; + u64 last_steal; + struct gfn_to_hva_cache stime; + struct kvm_steal_time steal; + } st; + u64 last_guest_tsc; u64 last_kernel_ns; u64 last_tsc_nsec; diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index 65f8bb9..c484ba8 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -45,6 +45,10 @@ struct kvm_steal_time { __u32 pad[12]; }; +#define KVM_STEAL_ALIGNMENT_BITS 5 +#define KVM_STEAL_VALID_BITS ((-1ULL (KVM_STEAL_ALIGNMENT_BITS + 1))) +#define KVM_STEAL_RESERVED_MASK (((1 KVM_STEAL_ALIGNMENT_BITS) - 1 ) 1) + #define KVM_MAX_MMU_OP_BATCH 32 #define KVM_ASYNC_PF_ENABLED (1 0) diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index 50f6364..99c3f05 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -31,6 +31,7 @@ config KVM select KVM_ASYNC_PF select USER_RETURN_NOTIFIER select KVM_MMIO + select TASK_DELAY_ACCT ---help--- Support hosting fully virtualized guest machines using hardware virtualization extensions. You will need a fairly recent diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 7167717..237bcdc 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -808,12 +808,12 @@ EXPORT_SYMBOL_GPL(kvm_get_dr); * kvm-specific. Those are put in the beginning of the list. */ -#define KVM_SAVE_MSRS_BEGIN8 +#define KVM_SAVE_MSRS_BEGIN9 static u32 msrs_to_save[] = { MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK, MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW, HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL, - HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, + HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME, MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP, MSR_STAR, #ifdef CONFIG_X86_64 @@ -1491,6 +1491,27 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu) } } +static void record_steal_time(struct kvm_vcpu *vcpu) +{ + u64 delta; + + if (!(vcpu-arch.st.msr_val KVM_MSR_ENABLED)) + return; + + if (unlikely(kvm_read_guest_cached(vcpu-kvm, vcpu-arch.st.stime, + vcpu-arch.st.steal, sizeof(struct kvm_steal_time + return; + + delta = current-sched_info.run_delay - vcpu-arch.st.last_steal; + vcpu-arch.st.last_steal = current-sched_info.run_delay; + + vcpu-arch.st.steal.steal += delta; + vcpu-arch.st.steal.version += 2; + + kvm_write_guest_cached(vcpu-kvm, vcpu-arch.st.stime, + vcpu-arch.st.steal, sizeof(struct kvm_steal_time)); +} + int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data) { switch (msr) { @@ -1573,6 +1594,28 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data) if (kvm_pv_enable_async_pf(vcpu, data)) return 1; break; + case MSR_KVM_STEAL_TIME: + vcpu-arch.st.msr_val = data; + + if (!(data KVM_MSR_ENABLED)) { + break; + } + + if (unlikely(!sched_info_on())) + break; + + if (data KVM_STEAL_RESERVED_MASK
[PATCH v5 2/9] KVM-HDR Add constant to represent KVM MSRs enabled bit
This patch is simple, put in a different commit so it can be more easily shared between guest and hypervisor. It just defines a named constant to indicate the enable bit for KVM-specific MSRs. Signed-off-by: Glauber Costa glom...@redhat.com CC: Rik van Riel r...@redhat.com CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com CC: Eric B Munson emun...@mgebm.net --- arch/x86/include/asm/kvm_para.h |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index a427bf7..d6cd79b 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -30,6 +30,7 @@ #define MSR_KVM_WALL_CLOCK 0x11 #define MSR_KVM_SYSTEM_TIME 0x12 +#define KVM_MSR_ENABLED 1 /* Custom MSRs falls in the range 0x4b564d00-0x4b564dff */ #define MSR_KVM_WALL_CLOCK_NEW 0x4b564d00 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01 -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 8/9] KVM-GST: adjust scheduler cpu power
This patch makes update_rq_clock() aware of steal time. The mechanism of operation is not different from irq_time, and follows the same principles. This lives in a CONFIG option itself, and can be compiled out independently of the rest of steal time reporting. The effect of disabling it is that the scheduler will still report steal time (that cannot be disabled), but won't use this information for cpu power adjustments. Everytime update_rq_clock_task() is invoked, we query information about how much time was stolen since last call, and feed it into sched_rt_avg_update(). Although steal time reporting in account_process_tick() keeps track of the last time we read the steal clock, in prev_steal_time, this patch do it independently using another field, prev_steal_time_rq. This is because otherwise, information about time accounted in update_process_tick() would never reach us in update_rq_clock(). Signed-off-by: Glauber Costa glom...@redhat.com CC: Rik van Riel r...@redhat.com CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com CC: Eric B Munson emun...@mgebm.net --- arch/x86/Kconfig| 12 kernel/sched.c | 47 +-- kernel/sched_features.h |4 ++-- 3 files changed, 51 insertions(+), 12 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index da34972..b26f312 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -512,6 +512,18 @@ menuconfig PARAVIRT_GUEST if PARAVIRT_GUEST +config PARAVIRT_TIME_ACCOUNTING + bool Paravirtual steal time accounting + select PARAVIRT + default n + ---help--- + Select this option to enable fine granularity task steal time + accounting. Time spent executing other tasks in parallel with + the current vCPU is discounted from the vCPU power. To account for + that, there can be a small performance impact. + + If in doubt, say N here. + source arch/x86/xen/Kconfig config KVM_CLOCK diff --git a/kernel/sched.c b/kernel/sched.c index aa6c030..8d57196 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -532,6 +532,9 @@ struct rq { #ifdef CONFIG_PARAVIRT u64 prev_steal_time; #endif +#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING + u64 prev_steal_time_rq; +#endif /* calc_load related fields */ unsigned long calc_load_update; @@ -1971,8 +1974,14 @@ static inline u64 steal_ticks(u64 steal) static void update_rq_clock_task(struct rq *rq, s64 delta) { - s64 irq_delta; - +/* + * In theory, the compile should just see 0 here, and optimize out the call + * to sched_rt_avg_update. But I don't trust it... + */ +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING) + s64 steal = 0, irq_delta = 0; +#endif +#ifdef CONFIG_IRQ_TIME_ACCOUNTING irq_delta = irq_time_read(cpu_of(rq)) - rq-prev_irq_time; /* @@ -1995,12 +2004,35 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) rq-prev_irq_time += irq_delta; delta -= irq_delta; +#endif +#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING + if (static_branch((paravirt_steal_rq_enabled))) { + u64 st; + + steal = paravirt_steal_clock(cpu_of(rq)); + steal -= rq-prev_steal_time_rq; + + if (unlikely(steal delta)) + steal = delta; + + st = steal_ticks(steal); + steal = st * TICK_NSEC; + + rq-prev_steal_time_rq += steal; + + delta -= steal; + } +#endif + rq-clock_task += delta; - if (irq_delta sched_feat(NONIRQ_POWER)) - sched_rt_avg_update(rq, irq_delta); +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING) + if ((irq_delta + steal) sched_feat(NONTASK_POWER)) + sched_rt_avg_update(rq, irq_delta + steal); +#endif } +#ifdef CONFIG_IRQ_TIME_ACCOUNTING static int irqtime_account_hi_update(void) { struct cpu_usage_stat *cpustat = kstat_this_cpu.cpustat; @@ -2035,12 +2067,7 @@ static int irqtime_account_si_update(void) #define sched_clock_irqtime(0) -static void update_rq_clock_task(struct rq *rq, s64 delta) -{ - rq-clock_task += delta; -} - -#endif /* CONFIG_IRQ_TIME_ACCOUNTING */ +#endif #include sched_idletask.c #include sched_fair.c diff --git a/kernel/sched_features.h b/kernel/sched_features.h index be40f73..ca3b025 100644 --- a/kernel/sched_features.h +++ b/kernel/sched_features.h @@ -61,9 +61,9 @@ SCHED_FEAT(LB_BIAS, 1) SCHED_FEAT(OWNER_SPIN, 1) /* - * Decrement CPU power based on irq activity + * Decrement CPU power based on time not spent running tasks */ -SCHED_FEAT(NONIRQ_POWER, 1) +SCHED_FEAT(NONTASK_POWER, 1) /* * Queue remote wakeups on the target CPU and process them -- 1.7.3.4 -- To unsubscribe from
[PATCH v5 5/9] KVM-GST: Add a pv_ops stub for steal time
This patch adds a function pointer in one of the many paravirt_ops structs, to allow guests to register a steal time function. Besides a steal time function, we also declare two jump_labels. They will be used to allow the steal time code to be easily bypassed when not in use. Signed-off-by: Glauber Costa glom...@redhat.com CC: Rik van Riel r...@redhat.com CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com CC: Eric B Munson emun...@mgebm.net --- arch/x86/include/asm/paravirt.h |9 + arch/x86/include/asm/paravirt_types.h |1 + arch/x86/kernel/paravirt.c|9 + 3 files changed, 19 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index ebbc4d8..a7d2db9 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -230,6 +230,15 @@ static inline unsigned long long paravirt_sched_clock(void) return PVOP_CALL0(unsigned long long, pv_time_ops.sched_clock); } +struct jump_label_key; +extern struct jump_label_key paravirt_steal_enabled; +extern struct jump_label_key paravirt_steal_rq_enabled; + +static inline u64 paravirt_steal_clock(int cpu) +{ + return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu); +} + static inline unsigned long long paravirt_read_pmc(int counter) { return PVOP_CALL1(u64, pv_cpu_ops.read_pmc, counter); diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 8288509..2c76521 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -89,6 +89,7 @@ struct pv_lazy_ops { struct pv_time_ops { unsigned long long (*sched_clock)(void); + unsigned long long (*steal_clock)(int cpu); unsigned long (*get_tsc_khz)(void); }; diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 869e1ae..613a793 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -202,6 +202,14 @@ static void native_flush_tlb_single(unsigned long addr) __native_flush_tlb_single(addr); } +struct jump_label_key paravirt_steal_enabled; +struct jump_label_key paravirt_steal_rq_enabled; + +static u64 native_steal_clock(int cpu) +{ + return 0; +} + /* These are in entry.S */ extern void native_iret(void); extern void native_irq_enable_sysexit(void); @@ -307,6 +315,7 @@ struct pv_init_ops pv_init_ops = { struct pv_time_ops pv_time_ops = { .sched_clock = native_sched_clock, + .steal_clock = native_steal_clock, }; struct pv_irq_ops pv_irq_ops = { -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 7/9] KVM-GST: KVM Steal time accounting
This patch accounts steal time time in account_process_tick. If one or more tick is considered stolen in the current accounting cycle, user/system accounting is skipped. Idle is fine, since the hypervisor does not report steal time if the guest is halted. Accounting steal time from the core scheduler give us the advantage of direct acess to the runqueue data. In a later opportunity, it can be used to tweak cpu power and make the scheduler aware of the time it lost. Signed-off-by: Glauber Costa glom...@redhat.com CC: Rik van Riel r...@redhat.com CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com CC: Eric B Munson emun...@mgebm.net --- kernel/sched.c | 41 + 1 files changed, 41 insertions(+), 0 deletions(-) diff --git a/kernel/sched.c b/kernel/sched.c index 3f2e502..aa6c030 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -75,6 +75,7 @@ #include asm/tlb.h #include asm/irq_regs.h #include asm/mutex.h +#include asm/paravirt.h #include sched_cpupri.h #include workqueue_sched.h @@ -528,6 +529,9 @@ struct rq { #ifdef CONFIG_IRQ_TIME_ACCOUNTING u64 prev_irq_time; #endif +#ifdef CONFIG_PARAVIRT + u64 prev_steal_time; +#endif /* calc_load related fields */ unsigned long calc_load_update; @@ -1953,6 +1957,18 @@ void account_system_vtime(struct task_struct *curr) } EXPORT_SYMBOL_GPL(account_system_vtime); +#endif /* CONFIG_IRQ_TIME_ACCOUNTING */ + +#ifdef CONFIG_PARAVIRT +static inline u64 steal_ticks(u64 steal) +{ + if (unlikely(steal NSEC_PER_SEC)) + return div_u64(steal, TICK_NSEC); + + return __iter_div_u64_rem(steal, TICK_NSEC, steal); +} +#endif + static void update_rq_clock_task(struct rq *rq, s64 delta) { s64 irq_delta; @@ -3845,6 +3861,25 @@ void account_idle_time(cputime_t cputime) cpustat-idle = cputime64_add(cpustat-idle, cputime64); } +static __always_inline bool steal_account_process_tick(void) +{ +#ifdef CONFIG_PARAVIRT + if (static_branch(paravirt_steal_enabled)) { + u64 steal, st = 0; + + steal = paravirt_steal_clock(smp_processor_id()); + steal -= this_rq()-prev_steal_time; + + st = steal_ticks(steal); + this_rq()-prev_steal_time += st * TICK_NSEC; + + account_steal_time(st); + return st; + } +#endif + return false; +} + #ifndef CONFIG_VIRT_CPU_ACCOUNTING #ifdef CONFIG_IRQ_TIME_ACCOUNTING @@ -3876,6 +3911,9 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick, cputime64_t tmp = cputime_to_cputime64(cputime_one_jiffy); struct cpu_usage_stat *cpustat = kstat_this_cpu.cpustat; + if (steal_account_process_tick()) + return; + if (irqtime_account_hi_update()) { cpustat-irq = cputime64_add(cpustat-irq, tmp); } else if (irqtime_account_si_update()) { @@ -3929,6 +3967,9 @@ void account_process_tick(struct task_struct *p, int user_tick) return; } + if (steal_account_process_tick()) + return; + if (user_tick) account_user_time(p, cputime_one_jiffy, one_jiffy_scaled); else if ((p != rq-idle) || (irq_count() != HARDIRQ_OFFSET)) -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 6/9] add jump labels for ia64 paravirt
Since in a later patch I intend to call jump labels inside CONFIG_PARAVIRT, IA64 would fail to compile if they are not provided. This patch provides those jump labels for the IA64 architecture. Signed-off-by: Glauber Costa glom...@redhat.com CC: Isaku Yamahata yamah...@valinux.co.jp CC: Eddie Dong eddie.d...@intel.com CC: Rik van Riel r...@redhat.com CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com CC: Eric B Munson emun...@mgebm.net --- arch/ia64/include/asm/paravirt.h |4 arch/ia64/kernel/paravirt.c |2 ++ 2 files changed, 6 insertions(+), 0 deletions(-) diff --git a/arch/ia64/include/asm/paravirt.h b/arch/ia64/include/asm/paravirt.h index 2eb0a98..32551d3 100644 --- a/arch/ia64/include/asm/paravirt.h +++ b/arch/ia64/include/asm/paravirt.h @@ -281,6 +281,10 @@ paravirt_init_missing_ticks_accounting(int cpu) pv_time_ops.init_missing_ticks_accounting(cpu); } +struct jump_label_key; +extern struct jump_label_key paravirt_steal_enabled; +extern struct jump_label_key paravirt_steal_rq_enabled; + static inline int paravirt_do_steal_accounting(unsigned long *new_itm) { diff --git a/arch/ia64/kernel/paravirt.c b/arch/ia64/kernel/paravirt.c index a21d7bb..1008682 100644 --- a/arch/ia64/kernel/paravirt.c +++ b/arch/ia64/kernel/paravirt.c @@ -634,6 +634,8 @@ struct pv_irq_ops pv_irq_ops = { * pv_time_ops * time operations */ +struct jump_label_key paravirt_steal_enabled; +struct jump_label_key paravirt_steal_rq_enabled; static int ia64_native_do_steal_accounting(unsigned long *new_itm) -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 9/9] KVM-GST: KVM Steal time registration
This patch implements the kvm bits of the steal time infrastructure. The most important part of it, is the steal time clock. It is an continuous clock that shows the accumulated amount of steal time since vcpu creation. It is supposed to survive cpu offlining/onlining. Signed-off-by: Glauber Costa glom...@redhat.com CC: Rik van Riel r...@redhat.com CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com CC: Eric B Munson emun...@mgebm.net --- Documentation/kernel-parameters.txt |4 ++ arch/x86/include/asm/kvm_para.h |1 + arch/x86/kernel/kvm.c | 73 +++ arch/x86/kernel/kvmclock.c |2 + 4 files changed, 80 insertions(+), 0 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index fd248a31..a722574 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1737,6 +1737,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted. no-kvmapf [X86,KVM] Disable paravirtualized asynchronous page fault handling. + no-steal-acc[X86,KVM] Disable paravirtualized steal time accounting. + steal time is computed, but won't influence scheduler + behaviour + nolapic [X86-32,APIC] Do not enable or use the local APIC. nolapic_timer [X86-32,APIC] Do not use the local APIC timer. diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index c484ba8..35d732d 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -94,6 +94,7 @@ struct kvm_vcpu_pv_apf_data { extern void kvmclock_init(void); extern int kvm_register_clock(char *txt); +extern void kvm_disable_steal_time(void); /* This instruction is vmcall. On non-VT architectures, it will generate a diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 33c07b0..58331c2 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -51,6 +51,15 @@ static int parse_no_kvmapf(char *arg) early_param(no-kvmapf, parse_no_kvmapf); +static int steal_acc = 1; +static int parse_no_stealacc(char *arg) +{ +steal_acc = 0; +return 0; +} + +early_param(no-steal-acc, parse_no_stealacc); + struct kvm_para_state { u8 mmu_queue[MMU_QUEUE_SIZE]; int mmu_queue_len; @@ -58,6 +67,8 @@ struct kvm_para_state { static DEFINE_PER_CPU(struct kvm_para_state, para_state); static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64); +static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64); +static int has_steal_clock = 0; static struct kvm_para_state *kvm_para_state(void) { @@ -441,6 +452,21 @@ static void __init paravirt_ops_setup(void) #endif } +static void kvm_register_steal_time(void) +{ + int cpu = smp_processor_id(); + struct kvm_steal_time *st = per_cpu(steal_time, cpu); + + if (!has_steal_clock) + return; + + memset(st, 0, sizeof(*st)); + + wrmsrl(MSR_KVM_STEAL_TIME, (__pa(st) | KVM_MSR_ENABLED)); + printk(KERN_INFO kvm-stealtime: cpu %d, msr %lx\n, + cpu, __pa(st)); +} + void __cpuinit kvm_guest_cpu_init(void) { if (!kvm_para_available()) @@ -457,6 +483,9 @@ void __cpuinit kvm_guest_cpu_init(void) printk(KERN_INFOKVM setup async PF for cpu %d\n, smp_processor_id()); } + + if (has_steal_clock) + kvm_register_steal_time(); } static void kvm_pv_disable_apf(void *unused) @@ -483,6 +512,31 @@ static struct notifier_block kvm_pv_reboot_nb = { .notifier_call = kvm_pv_reboot_notify, }; +static u64 kvm_steal_clock(int cpu) +{ + u64 steal; + struct kvm_steal_time *src; + int version; + + src = per_cpu(steal_time, cpu); + do { + version = src-version; + rmb(); + steal = src-steal; + rmb(); + } while ((version 1) || (version != src-version)); + + return steal; +} + +void kvm_disable_steal_time(void) +{ + if (!has_steal_clock) + return; + + wrmsr(MSR_KVM_STEAL_TIME, 0, 0); +} + #ifdef CONFIG_SMP static void __init kvm_smp_prepare_boot_cpu(void) { @@ -500,6 +554,7 @@ static void __cpuinit kvm_guest_cpu_online(void *dummy) static void kvm_guest_cpu_offline(void *dummy) { + kvm_disable_steal_time(); kvm_pv_disable_apf(NULL); apf_task_wake_all(); } @@ -548,6 +603,11 @@ void __init kvm_guest_init(void) if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF)) x86_init.irqs.trap_init = kvm_apf_trap_init; + if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) { + has_steal_clock = 1; + pv_time_ops.steal_clock
[PATCH v4 0/9] Steal time series again
Here follows the fourth version of the steal time series. Hope it is acceptable for all involved parties now. The main differences from v3 are: * The Changelogs seem to have been writen by an actual person now, not of a monkey. Yet, I am the aforementioned person, so don't expect much. * Forcing delayacct on the hypervisor side allow us to simplify the guest code dramatically, since now we don't need to test for is_idle: if we're idle, we won't have steal time and end of story. Hope you enjoy. Glauber Costa (8): KVM-HDR Add constant to represent KVM MSRs enabled bit KVM-HDR: KVM Steal time implementation KVM-HV: KVM Steal time implementation KVM-GST: Add a pv_ops stub for steal time add jump labels for ia64 paravirt KVM-GST: KVM Steal time accounting KVM-GST: adjust scheduler cpu power KVM-GST: KVM Steal time registration Gleb Natapov (1): introduce kvm_read_guest_cached Documentation/kernel-parameters.txt |4 ++ Documentation/virtual/kvm/msr.txt | 35 ++ arch/ia64/include/asm/paravirt.h |4 ++ arch/ia64/kernel/paravirt.c |2 + arch/x86/Kconfig | 12 + arch/x86/include/asm/kvm_host.h |8 +++ arch/x86/include/asm/kvm_para.h | 15 ++ arch/x86/include/asm/paravirt.h |9 arch/x86/include/asm/paravirt_types.h |1 + arch/x86/kernel/kvm.c | 73 ++ arch/x86/kernel/kvmclock.c|2 + arch/x86/kernel/paravirt.c|9 arch/x86/kvm/Kconfig |1 + arch/x86/kvm/x86.c| 56 ++- include/linux/kvm_host.h |2 + kernel/sched.c| 80 kernel/sched_features.h |4 +- virt/kvm/kvm_main.c | 20 18 files changed, 322 insertions(+), 15 deletions(-) -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 4/9] KVM-HV: KVM Steal time implementation
To implement steal time, we need the hypervisor to pass the guest information about how much time was spent running other processes outside the VM, while the vcpu had meaningful work to do - halt time does not count. This information is acquired through the run_delay field of delayacct/schedstats infrastructure, that counts time spent in a runqueue but not running. Steal time is a per-cpu information, so the traditional MSR-based infrastructure is used. A new msr, KVM_MSR_STEAL_TIME, holds the memory area address containing information about steal time This patch contains the hypervisor part of the steal time infrasructure, and can be backported independently of the guest portion. Signed-off-by: Glauber Costa glom...@redhat.com CC: Rik van Riel r...@redhat.com CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com CC: Eric B Munson emun...@mgebm.net --- arch/x86/include/asm/kvm_host.h |8 + arch/x86/include/asm/kvm_para.h |4 +++ arch/x86/kvm/Kconfig|1 + arch/x86/kvm/x86.c | 56 -- 4 files changed, 66 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index da6bbee..9ba354d 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -389,6 +389,14 @@ struct kvm_vcpu_arch { unsigned int hw_tsc_khz; unsigned int time_offset; struct page *time_page; + + struct { + u64 msr_val; + u64 last_steal; + struct gfn_to_hva_cache stime; + struct kvm_steal_time steal; + } st; + u64 last_guest_tsc; u64 last_kernel_ns; u64 last_tsc_nsec; diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index 65f8bb9..c484ba8 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -45,6 +45,10 @@ struct kvm_steal_time { __u32 pad[12]; }; +#define KVM_STEAL_ALIGNMENT_BITS 5 +#define KVM_STEAL_VALID_BITS ((-1ULL (KVM_STEAL_ALIGNMENT_BITS + 1))) +#define KVM_STEAL_RESERVED_MASK (((1 KVM_STEAL_ALIGNMENT_BITS) - 1 ) 1) + #define KVM_MAX_MMU_OP_BATCH 32 #define KVM_ASYNC_PF_ENABLED (1 0) diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index 50f6364..99c3f05 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -31,6 +31,7 @@ config KVM select KVM_ASYNC_PF select USER_RETURN_NOTIFIER select KVM_MMIO + select TASK_DELAY_ACCT ---help--- Support hosting fully virtualized guest machines using hardware virtualization extensions. You will need a fairly recent diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 7167717..237bcdc 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -808,12 +808,12 @@ EXPORT_SYMBOL_GPL(kvm_get_dr); * kvm-specific. Those are put in the beginning of the list. */ -#define KVM_SAVE_MSRS_BEGIN8 +#define KVM_SAVE_MSRS_BEGIN9 static u32 msrs_to_save[] = { MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK, MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW, HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL, - HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, + HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME, MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP, MSR_STAR, #ifdef CONFIG_X86_64 @@ -1491,6 +1491,27 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu) } } +static void record_steal_time(struct kvm_vcpu *vcpu) +{ + u64 delta; + + if (!(vcpu-arch.st.msr_val KVM_MSR_ENABLED)) + return; + + if (unlikely(kvm_read_guest_cached(vcpu-kvm, vcpu-arch.st.stime, + vcpu-arch.st.steal, sizeof(struct kvm_steal_time + return; + + delta = current-sched_info.run_delay - vcpu-arch.st.last_steal; + vcpu-arch.st.last_steal = current-sched_info.run_delay; + + vcpu-arch.st.steal.steal += delta; + vcpu-arch.st.steal.version += 2; + + kvm_write_guest_cached(vcpu-kvm, vcpu-arch.st.stime, + vcpu-arch.st.steal, sizeof(struct kvm_steal_time)); +} + int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data) { switch (msr) { @@ -1573,6 +1594,28 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data) if (kvm_pv_enable_async_pf(vcpu, data)) return 1; break; + case MSR_KVM_STEAL_TIME: + vcpu-arch.st.msr_val = data; + + if (!(data KVM_MSR_ENABLED)) { + break; + } + + if (unlikely(!sched_info_on())) + break; + + if (data KVM_STEAL_RESERVED_MASK
[PATCH v4 2/9] KVM-HDR Add constant to represent KVM MSRs enabled bit
This patch is simple, put in a different commit so it can be more easily shared between guest and hypervisor. It just defines a named constant to indicate the enable bit for KVM-specific MSRs. Signed-off-by: Glauber Costa glom...@redhat.com CC: Rik van Riel r...@redhat.com CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com CC: Eric B Munson emun...@mgebm.net --- arch/x86/include/asm/kvm_para.h |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index a427bf7..d6cd79b 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -30,6 +30,7 @@ #define MSR_KVM_WALL_CLOCK 0x11 #define MSR_KVM_SYSTEM_TIME 0x12 +#define KVM_MSR_ENABLED 1 /* Custom MSRs falls in the range 0x4b564d00-0x4b564dff */ #define MSR_KVM_WALL_CLOCK_NEW 0x4b564d00 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01 -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 8/9] KVM-GST: adjust scheduler cpu power
This patch makes update_rq_clock() aware of steal time. The mechanism of operation is not different from irq_time, and follows the same principles. This lives in a CONFIG option itself, and can be compiled out independently of the rest of steal time reporting. The effect of disabling it is that the scheduler will still report steal time (that cannot be disabled), but won't use this information for cpu power adjustments. Everytime update_rq_clock_task() is invoked, we query information about how much time was stolen since last call, and feed it into sched_rt_avg_update(). Although steal time reporting in account_process_tick() keeps track of the last time we read the steal clock, in prev_steal_time, this patch do it independently using another field, prev_steal_time_rq. This is because otherwise, information about time accounted in update_process_tick() would never reach us in update_rq_clock(). Signed-off-by: Glauber Costa glom...@redhat.com CC: Rik van Riel r...@redhat.com CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com CC: Eric B Munson emun...@mgebm.net --- arch/x86/Kconfig| 12 kernel/sched.c | 47 +-- kernel/sched_features.h |4 ++-- 3 files changed, 51 insertions(+), 12 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index da34972..b26f312 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -512,6 +512,18 @@ menuconfig PARAVIRT_GUEST if PARAVIRT_GUEST +config PARAVIRT_TIME_ACCOUNTING + bool Paravirtual steal time accounting + select PARAVIRT + default n + ---help--- + Select this option to enable fine granularity task steal time + accounting. Time spent executing other tasks in parallel with + the current vCPU is discounted from the vCPU power. To account for + that, there can be a small performance impact. + + If in doubt, say N here. + source arch/x86/xen/Kconfig config KVM_CLOCK diff --git a/kernel/sched.c b/kernel/sched.c index 247dd51..c40b118 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -532,6 +532,9 @@ struct rq { #ifdef CONFIG_PARAVIRT u64 prev_steal_time; #endif +#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING + u64 prev_steal_time_rq; +#endif /* calc_load related fields */ unsigned long calc_load_update; @@ -1971,8 +1974,14 @@ static inline u64 steal_ticks(u64 steal) static void update_rq_clock_task(struct rq *rq, s64 delta) { - s64 irq_delta; - +/* + * In theory, the compile should just see 0 here, and optimize out the call + * to sched_rt_avg_update. But I don't trust it... + */ +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING) + s64 steal = 0, irq_delta = 0; +#endif +#ifdef CONFIG_IRQ_TIME_ACCOUNTING irq_delta = irq_time_read(cpu_of(rq)) - rq-prev_irq_time; /* @@ -1995,12 +2004,35 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) rq-prev_irq_time += irq_delta; delta -= irq_delta; +#endif +#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING + if (static_branch((paravirt_steal_rq_enabled))) { + u64 st; + + steal = paravirt_steal_clock(cpu_of(rq)); + steal -= rq-prev_steal_time_rq; + + if (unlikely(steal delta)) + steal = delta; + + st = steal_ticks(steal); + steal = st * TICK_NSEC; + + rq-prev_steal_time_rq += steal; + + delta -= steal; + } +#endif + rq-clock_task += delta; - if (irq_delta sched_feat(NONIRQ_POWER)) - sched_rt_avg_update(rq, irq_delta); +#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING) + if ((irq_delta + steal) sched_feat(NONTASK_POWER)) + sched_rt_avg_update(rq, irq_delta + steal); +#endif } +#ifdef CONFIG_IRQ_TIME_ACCOUNTING static int irqtime_account_hi_update(void) { struct cpu_usage_stat *cpustat = kstat_this_cpu.cpustat; @@ -2035,12 +2067,7 @@ static int irqtime_account_si_update(void) #define sched_clock_irqtime(0) -static void update_rq_clock_task(struct rq *rq, s64 delta) -{ - rq-clock_task += delta; -} - -#endif /* CONFIG_IRQ_TIME_ACCOUNTING */ +#endif #include sched_idletask.c #include sched_fair.c diff --git a/kernel/sched_features.h b/kernel/sched_features.h index be40f73..ca3b025 100644 --- a/kernel/sched_features.h +++ b/kernel/sched_features.h @@ -61,9 +61,9 @@ SCHED_FEAT(LB_BIAS, 1) SCHED_FEAT(OWNER_SPIN, 1) /* - * Decrement CPU power based on irq activity + * Decrement CPU power based on time not spent running tasks */ -SCHED_FEAT(NONIRQ_POWER, 1) +SCHED_FEAT(NONTASK_POWER, 1) /* * Queue remote wakeups on the target CPU and process them -- 1.7.3.4 -- To unsubscribe from
[PATCH v4 1/9] introduce kvm_read_guest_cached
From: Gleb Natapov g...@redhat.com Introduce kvm_read_guest_cached() function in addition to write one we already have. [ by glauber: export function signature in kvm header ] Signed-off-by: Gleb Natapov g...@redhat.com Signed-off-by: Glauber Costa glom...@redhat.com --- include/linux/kvm_host.h |2 ++ virt/kvm/kvm_main.c | 20 2 files changed, 22 insertions(+), 0 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 31ebb59..f7df0a3 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -381,6 +381,8 @@ int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset, int kvm_read_guest_atomic(struct kvm *kvm, gpa_t gpa, void *data, unsigned long len); int kvm_read_guest(struct kvm *kvm, gpa_t gpa, void *data, unsigned long len); +int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, + void *data, unsigned long len); int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn, const void *data, int offset, int len); int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data, diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 11d2783..d5ef9eb 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1418,6 +1418,26 @@ int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, } EXPORT_SYMBOL_GPL(kvm_write_guest_cached); +int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, + void *data, unsigned long len) +{ + struct kvm_memslots *slots = kvm_memslots(kvm); + int r; + + if (slots-generation != ghc-generation) + kvm_gfn_to_hva_cache_init(kvm, ghc, ghc-gpa); + + if (kvm_is_error_hva(ghc-hva)) + return -EFAULT; + + r = __copy_from_user(data, (void __user *)ghc-hva, len); + if (r) + return -EFAULT; + + return 0; +} +EXPORT_SYMBOL_GPL(kvm_read_guest_cached); + int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len) { return kvm_write_guest_page(kvm, gfn, (const void *) empty_zero_page, -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 5/9] KVM-GST: Add a pv_ops stub for steal time
This patch adds a function pointer in one of the many paravirt_ops structs, to allow guests to register a steal time function. Besides a steal time function, we also declare two jump_labels. They will be used to allow the steal time code to be easily bypassed when not in use. Signed-off-by: Glauber Costa glom...@redhat.com CC: Rik van Riel r...@redhat.com CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com CC: Eric B Munson emun...@mgebm.net --- arch/x86/include/asm/paravirt.h |9 + arch/x86/include/asm/paravirt_types.h |1 + arch/x86/kernel/paravirt.c|9 + 3 files changed, 19 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index ebbc4d8..a7d2db9 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -230,6 +230,15 @@ static inline unsigned long long paravirt_sched_clock(void) return PVOP_CALL0(unsigned long long, pv_time_ops.sched_clock); } +struct jump_label_key; +extern struct jump_label_key paravirt_steal_enabled; +extern struct jump_label_key paravirt_steal_rq_enabled; + +static inline u64 paravirt_steal_clock(int cpu) +{ + return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu); +} + static inline unsigned long long paravirt_read_pmc(int counter) { return PVOP_CALL1(u64, pv_cpu_ops.read_pmc, counter); diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 8288509..2c76521 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -89,6 +89,7 @@ struct pv_lazy_ops { struct pv_time_ops { unsigned long long (*sched_clock)(void); + unsigned long long (*steal_clock)(int cpu); unsigned long (*get_tsc_khz)(void); }; diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 869e1ae..613a793 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -202,6 +202,14 @@ static void native_flush_tlb_single(unsigned long addr) __native_flush_tlb_single(addr); } +struct jump_label_key paravirt_steal_enabled; +struct jump_label_key paravirt_steal_rq_enabled; + +static u64 native_steal_clock(int cpu) +{ + return 0; +} + /* These are in entry.S */ extern void native_iret(void); extern void native_irq_enable_sysexit(void); @@ -307,6 +315,7 @@ struct pv_init_ops pv_init_ops = { struct pv_time_ops pv_time_ops = { .sched_clock = native_sched_clock, + .steal_clock = native_steal_clock, }; struct pv_irq_ops pv_irq_ops = { -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 6/9] add jump labels for ia64 paravirt
Since in a later patch I intend to call jump labels inside CONFIG_PARAVIRT, IA64 would fail to compile if they are not provided. This patch provides those jump labels for the IA64 architecture. Signed-off-by: Glauber Costa glom...@redhat.com CC: Isaku Yamahata yamah...@valinux.co.jp CC: Eddie Dong eddie.d...@intel.com CC: Rik van Riel r...@redhat.com CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com CC: Eric B Munson emun...@mgebm.net --- arch/ia64/include/asm/paravirt.h |4 arch/ia64/kernel/paravirt.c |2 ++ 2 files changed, 6 insertions(+), 0 deletions(-) diff --git a/arch/ia64/include/asm/paravirt.h b/arch/ia64/include/asm/paravirt.h index 2eb0a98..32551d3 100644 --- a/arch/ia64/include/asm/paravirt.h +++ b/arch/ia64/include/asm/paravirt.h @@ -281,6 +281,10 @@ paravirt_init_missing_ticks_accounting(int cpu) pv_time_ops.init_missing_ticks_accounting(cpu); } +struct jump_label_key; +extern struct jump_label_key paravirt_steal_enabled; +extern struct jump_label_key paravirt_steal_rq_enabled; + static inline int paravirt_do_steal_accounting(unsigned long *new_itm) { diff --git a/arch/ia64/kernel/paravirt.c b/arch/ia64/kernel/paravirt.c index a21d7bb..1008682 100644 --- a/arch/ia64/kernel/paravirt.c +++ b/arch/ia64/kernel/paravirt.c @@ -634,6 +634,8 @@ struct pv_irq_ops pv_irq_ops = { * pv_time_ops * time operations */ +struct jump_label_key paravirt_steal_enabled; +struct jump_label_key paravirt_steal_rq_enabled; static int ia64_native_do_steal_accounting(unsigned long *new_itm) -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 7/9] KVM-GST: KVM Steal time accounting
This patch accounts steal time time in account_process_tick. If one or more tick is considered stolen in the current accounting cycle, user/system accounting is skipped. Idle is fine, since the hypervisor does not report steal time if the guest is halted. Accounting steal time from the core scheduler give us the advantage of direct access to the runqueue data. In a later opportunity, it can be used to tweak cpu power and make the scheduler aware of the time it lost. Signed-off-by: Glauber Costa glom...@redhat.com CC: Rik van Riel r...@redhat.com CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com CC: Eric B Munson emun...@mgebm.net --- kernel/sched.c | 33 + 1 files changed, 33 insertions(+), 0 deletions(-) diff --git a/kernel/sched.c b/kernel/sched.c index 3f2e502..247dd51 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -75,6 +75,7 @@ #include asm/tlb.h #include asm/irq_regs.h #include asm/mutex.h +#include asm/paravirt.h #include sched_cpupri.h #include workqueue_sched.h @@ -528,6 +529,9 @@ struct rq { #ifdef CONFIG_IRQ_TIME_ACCOUNTING u64 prev_irq_time; #endif +#ifdef CONFIG_PARAVIRT + u64 prev_steal_time; +#endif /* calc_load related fields */ unsigned long calc_load_update; @@ -1953,6 +1957,18 @@ void account_system_vtime(struct task_struct *curr) } EXPORT_SYMBOL_GPL(account_system_vtime); +#endif /* CONFIG_IRQ_TIME_ACCOUNTING */ + +#ifdef CONFIG_PARAVIRT +static inline u64 steal_ticks(u64 steal) +{ + if (unlikely(steal NSEC_PER_SEC)) + return div_u64(steal, TICK_NSEC); + + return __iter_div_u64_rem(steal, TICK_NSEC, steal); +} +#endif + static void update_rq_clock_task(struct rq *rq, s64 delta) { s64 irq_delta; @@ -3929,6 +3945,23 @@ void account_process_tick(struct task_struct *p, int user_tick) return; } +#ifdef CONFIG_PARAVIRT + if (static_branch(paravirt_steal_enabled)) { + u64 steal, st = 0; + + steal = paravirt_steal_clock(smp_processor_id()); + steal -= this_rq()-prev_steal_time; + + st = steal_ticks(steal); + this_rq()-prev_steal_time += st * TICK_NSEC; + + if (st) { + account_steal_time(st); + return; + } + } +#endif + if (user_tick) account_user_time(p, cputime_one_jiffy, one_jiffy_scaled); else if ((p != rq-idle) || (irq_count() != HARDIRQ_OFFSET)) -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 9/9] KVM-GST: KVM Steal time registration
This patch implements the kvm bits of the steal time infrastructure. The most important part of it, is the steal time clock. It is an continuous clock that shows the accumulated amount of steal time since vcpu creation. It is supposed to survive cpu offlining/onlining. Signed-off-by: Glauber Costa glom...@redhat.com CC: Rik van Riel r...@redhat.com CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com CC: Eric B Munson emun...@mgebm.net --- Documentation/kernel-parameters.txt |4 ++ arch/x86/include/asm/kvm_para.h |1 + arch/x86/kernel/kvm.c | 73 +++ arch/x86/kernel/kvmclock.c |2 + 4 files changed, 80 insertions(+), 0 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index fd248a31..a722574 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1737,6 +1737,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted. no-kvmapf [X86,KVM] Disable paravirtualized asynchronous page fault handling. + no-steal-acc[X86,KVM] Disable paravirtualized steal time accounting. + steal time is computed, but won't influence scheduler + behaviour + nolapic [X86-32,APIC] Do not enable or use the local APIC. nolapic_timer [X86-32,APIC] Do not use the local APIC timer. diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index c484ba8..35d732d 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -94,6 +94,7 @@ struct kvm_vcpu_pv_apf_data { extern void kvmclock_init(void); extern int kvm_register_clock(char *txt); +extern void kvm_disable_steal_time(void); /* This instruction is vmcall. On non-VT architectures, it will generate a diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 33c07b0..58331c2 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -51,6 +51,15 @@ static int parse_no_kvmapf(char *arg) early_param(no-kvmapf, parse_no_kvmapf); +static int steal_acc = 1; +static int parse_no_stealacc(char *arg) +{ +steal_acc = 0; +return 0; +} + +early_param(no-steal-acc, parse_no_stealacc); + struct kvm_para_state { u8 mmu_queue[MMU_QUEUE_SIZE]; int mmu_queue_len; @@ -58,6 +67,8 @@ struct kvm_para_state { static DEFINE_PER_CPU(struct kvm_para_state, para_state); static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64); +static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64); +static int has_steal_clock = 0; static struct kvm_para_state *kvm_para_state(void) { @@ -441,6 +452,21 @@ static void __init paravirt_ops_setup(void) #endif } +static void kvm_register_steal_time(void) +{ + int cpu = smp_processor_id(); + struct kvm_steal_time *st = per_cpu(steal_time, cpu); + + if (!has_steal_clock) + return; + + memset(st, 0, sizeof(*st)); + + wrmsrl(MSR_KVM_STEAL_TIME, (__pa(st) | KVM_MSR_ENABLED)); + printk(KERN_INFO kvm-stealtime: cpu %d, msr %lx\n, + cpu, __pa(st)); +} + void __cpuinit kvm_guest_cpu_init(void) { if (!kvm_para_available()) @@ -457,6 +483,9 @@ void __cpuinit kvm_guest_cpu_init(void) printk(KERN_INFOKVM setup async PF for cpu %d\n, smp_processor_id()); } + + if (has_steal_clock) + kvm_register_steal_time(); } static void kvm_pv_disable_apf(void *unused) @@ -483,6 +512,31 @@ static struct notifier_block kvm_pv_reboot_nb = { .notifier_call = kvm_pv_reboot_notify, }; +static u64 kvm_steal_clock(int cpu) +{ + u64 steal; + struct kvm_steal_time *src; + int version; + + src = per_cpu(steal_time, cpu); + do { + version = src-version; + rmb(); + steal = src-steal; + rmb(); + } while ((version 1) || (version != src-version)); + + return steal; +} + +void kvm_disable_steal_time(void) +{ + if (!has_steal_clock) + return; + + wrmsr(MSR_KVM_STEAL_TIME, 0, 0); +} + #ifdef CONFIG_SMP static void __init kvm_smp_prepare_boot_cpu(void) { @@ -500,6 +554,7 @@ static void __cpuinit kvm_guest_cpu_online(void *dummy) static void kvm_guest_cpu_offline(void *dummy) { + kvm_disable_steal_time(); kvm_pv_disable_apf(NULL); apf_task_wake_all(); } @@ -548,6 +603,11 @@ void __init kvm_guest_init(void) if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF)) x86_init.irqs.trap_init = kvm_apf_trap_init; + if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) { + has_steal_clock = 1; + pv_time_ops.steal_clock
[PATCH v4 3/9] KVM-HDR: KVM Steal time implementation
To implement steal time, we need the hypervisor to pass the guest information about how much time was spent running other processes outside the VM. This is per-vcpu, and using the kvmclock structure for that is an abuse we decided not to make. In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that holds the memory area address containing information about steal time This patch contains the headers for it. I am keeping it separate to facilitate backports to people who wants to backport the kernel part but not the hypervisor, or the other way around. Signed-off-by: Glauber Costa glom...@redhat.com CC: Rik van Riel r...@redhat.com CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com CC: Eric B Munson emun...@mgebm.net --- Documentation/virtual/kvm/msr.txt | 35 +++ arch/x86/include/asm/kvm_para.h |9 + 2 files changed, 44 insertions(+), 0 deletions(-) diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt index d079aed..38db3f8 100644 --- a/Documentation/virtual/kvm/msr.txt +++ b/Documentation/virtual/kvm/msr.txt @@ -185,3 +185,38 @@ MSR_KVM_ASYNC_PF_EN: 0x4b564d02 Currently type 2 APF will be always delivered on the same vcpu as type 1 was, but guest should not rely on that. + +MSR_KVM_STEAL_TIME: 0x4b564d03 + + data: 64-byte alignment physical address of a memory area which must be + in guest RAM, plus an enable bit in bit 0. This memory is expected to + hold a copy of the following structure: + + struct kvm_steal_time { + __u64 steal; + __u32 version; + __u32 flags; + __u32 pad[12]; + } + + whose data will be filled in by the hypervisor periodically. Only one + write, or registration, is needed for each VCPU. The interval between + updates of this structure is arbitrary and implementation-dependent. + The hypervisor may update this structure at any time it sees fit until + anything with bit0 == 0 is written to it. Guest is required to make sure + this structure is initialized to zero. + + Fields have the following meanings: + + version: a sequence counter. In other words, guest has to check + this field before and after grabbing time information and make + sure they are both equal and even. An odd version indicates an + in-progress update. + + flags: At this point, always zero. May be used to indicate + changes in this structure in the future. + + steal: the amount of time in which this vCPU did not run, in + nanoseconds. Time during which the vcpu is idle, will not be + reported as steal time. + diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index d6cd79b..65f8bb9 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -21,6 +21,7 @@ */ #define KVM_FEATURE_CLOCKSOURCE23 #define KVM_FEATURE_ASYNC_PF 4 +#define KVM_FEATURE_STEAL_TIME 5 /* The last 8 bits are used to indicate how to interpret the flags field * in pvclock structure. If no bits are set, all flags are ignored. @@ -35,6 +36,14 @@ #define MSR_KVM_WALL_CLOCK_NEW 0x4b564d00 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01 #define MSR_KVM_ASYNC_PF_EN 0x4b564d02 +#define MSR_KVM_STEAL_TIME 0x4b564d03 + +struct kvm_steal_time { + __u64 steal; + __u32 version; + __u32 flags; + __u32 pad[12]; +}; #define KVM_MAX_MMU_OP_BATCH 32 -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/9] KVM-HDR Add constant to represent KVM MSRs enabled bit
On 06/30/2011 09:59 AM, Eric B Munson wrote: On Thu, 30 Jun 2011, Avi Kivity wrote: On 06/30/2011 12:56 AM, Eric B Munson wrote: My mail provider seems to have dropped patch 1 of the series so I can't reply directly to it, please add my Tested-by there as well. How did you test it then? I built host and guest kernels with the patches and pinned a while(1) and the CPU thread from qemu to CPU 2 on the host. I then started the same while(1) process in guest and verified that I see ~50% steal time reported. I then built 2.6.39 (just the code I had present) on the guest and time it while it was competing with the while(1) on the host for CPU time. Next I built the guest kernel with STEAL_TIME=N and reran the kernel compile to make sure that there weren't any huge variations in performace. Eric I think what Avi means is, it won't even compile without PATCH 1/9. If you don't have it, how could you test it ? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 7/9] KVM-GST: KVM Steal time accounting
On 06/30/2011 06:54 PM, Peter Zijlstra wrote: On Wed, 2011-06-29 at 11:29 -0400, Glauber Costa wrote: +static noinline bool touch_steal_time(int is_idle) That noinline is very unlucky there, +{ + u64 steal, st = 0; + + if (static_branch(paravirt_steal_enabled)) { + + steal = paravirt_steal_clock(smp_processor_id()); + + steal -= this_rq()-prev_steal_time; + + st = steal_ticks(steal); + this_rq()-prev_steal_time += st * TICK_NSEC; + + if (is_idle || st == 0) + return false; + + account_steal_time(st); + return true; + } + return false; +} + static void update_rq_clock_task(struct rq *rq, s64 delta) { s64 irq_delta; @@ -3716,6 +3760,9 @@ void account_user_time(struct task_struct *p, cputime_t cputime, struct cpu_usage_stat *cpustat =kstat_this_cpu.cpustat; cputime64_t tmp; + if (touch_steal_time(0)) + return; Means we have an unconditional call here, even if the static_branch() is patched out. Ok. I was under the impression that the proper use of jump labels required each label to be tied to a single location. If we make it inline, the same key would point to multiple locations, and we would have trouble altering all of the locations. I might be wrong, of course. Isn't it the case? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 7/9] KVM-GST: KVM Steal time accounting
On 06/30/2011 06:54 PM, Peter Zijlstra wrote: On Wed, 2011-06-29 at 11:29 -0400, Glauber Costa wrote: This patch accounts steal time time in kernel/sched. I kept it from last proposal, because I still see advantages in it: Doing it here will give us easier access from scheduler variables such as the cpu rq. The next patch shows an example of usage for it. Since functions like account_idle_time() can be called from multiple places, not only account_process_tick(), steal time grabbing is repeated in each account function separatedely. That changelog is so going to frustrate someone trying to re-create your thinking in a year's time. They really don't care about the last proposals and the next patch. You're right. I'll review all changelogs and rewrite them as needed. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 7/9] KVM-GST: KVM Steal time accounting
On 06/30/2011 06:54 PM, Peter Zijlstra wrote: On Wed, 2011-06-29 at 11:29 -0400, Glauber Costa wrote: + if (static_branch(paravirt_steal_enabled)) { How is that going to compile on !CONFIG_PARAVIRT or !x86 in general? Only x86-PARAVIRT will provide that variable. Good point. I'd wrap it into CONFIG_PARAVIRT. To be clear, the reason I did not put it inside CONFIG_PARAVIRT_TIME_ACCOUNTING, is because I wanted to have the mere display of steal time separated from the rest - unless, of course, you object this idea. Using CONFIG_PARAVIRT achieves this goal well. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 6/9] KVM-GST: Add a pv_ops stub for steal time
This patch adds a function pointer in one of the many paravirt_ops structs, to allow guests to register a steal time function. Signed-off-by: Glauber Costa glom...@redhat.com CC: Rik van Riel r...@redhat.com CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com CC: Eric B Munson emun...@mgebm.net --- arch/x86/include/asm/paravirt.h |9 + arch/x86/include/asm/paravirt_types.h |1 + arch/x86/kernel/paravirt.c|9 + 3 files changed, 19 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index ebbc4d8..a7d2db9 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -230,6 +230,15 @@ static inline unsigned long long paravirt_sched_clock(void) return PVOP_CALL0(unsigned long long, pv_time_ops.sched_clock); } +struct jump_label_key; +extern struct jump_label_key paravirt_steal_enabled; +extern struct jump_label_key paravirt_steal_rq_enabled; + +static inline u64 paravirt_steal_clock(int cpu) +{ + return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu); +} + static inline unsigned long long paravirt_read_pmc(int counter) { return PVOP_CALL1(u64, pv_cpu_ops.read_pmc, counter); diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 8288509..2c76521 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -89,6 +89,7 @@ struct pv_lazy_ops { struct pv_time_ops { unsigned long long (*sched_clock)(void); + unsigned long long (*steal_clock)(int cpu); unsigned long (*get_tsc_khz)(void); }; diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 869e1ae..613a793 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -202,6 +202,14 @@ static void native_flush_tlb_single(unsigned long addr) __native_flush_tlb_single(addr); } +struct jump_label_key paravirt_steal_enabled; +struct jump_label_key paravirt_steal_rq_enabled; + +static u64 native_steal_clock(int cpu) +{ + return 0; +} + /* These are in entry.S */ extern void native_iret(void); extern void native_irq_enable_sysexit(void); @@ -307,6 +315,7 @@ struct pv_init_ops pv_init_ops = { struct pv_time_ops pv_time_ops = { .sched_clock = native_sched_clock, + .steal_clock = native_steal_clock, }; struct pv_irq_ops pv_irq_ops = { -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 2/9] KVM-HDR Add constant to represent KVM MSRs enabled bit
This patch is simple, put in a different commit so it can be more easily shared between guest and hypervisor. It just defines a named constant to indicate the enable bit for KVM-specific MSRs. Signed-off-by: Glauber Costa glom...@redhat.com CC: Rik van Riel r...@redhat.com CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com CC: Eric B Munson emun...@mgebm.net --- arch/x86/include/asm/kvm_para.h |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index a427bf7..d6cd79b 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -30,6 +30,7 @@ #define MSR_KVM_WALL_CLOCK 0x11 #define MSR_KVM_SYSTEM_TIME 0x12 +#define KVM_MSR_ENABLED 1 /* Custom MSRs falls in the range 0x4b564d00-0x4b564dff */ #define MSR_KVM_WALL_CLOCK_NEW 0x4b564d00 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01 -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 0/9] Steal time for KVM
This patchset adds on the previous one, merging the suggestions made mainly by Gleb, Peter and Marcelo. SCHEDSTATS is now used when available to provide information about time spent on the runqueue for the cpu threads. Information to/from the guest is obtained using *_cached() functions, as suggested by Gleb. On the guest side, the two prev_steal_time variables are colapsed into a single one. Steal time accounting is also done inside update_rq_clock(). Glauber Costa (8): KVM-HDR Add constant to represent KVM MSRs enabled bit KVM-HDR: KVM Steal time implementation KVM-HV: KVM Steal time implementation KVM-HV: use schedstats to calculate steal time KVM-GST: Add a pv_ops stub for steal time KVM-GST: KVM Steal time accounting KVM-GST: adjust scheduler cpu power KVM-GST: KVM Steal time registration Gleb Natapov (1): introduce kvm_read_guest_cached Documentation/kernel-parameters.txt |4 ++ Documentation/virtual/kvm/msr.txt | 33 arch/x86/Kconfig | 12 arch/x86/include/asm/kvm_host.h |8 +++ arch/x86/include/asm/kvm_para.h | 15 + arch/x86/include/asm/paravirt.h |9 +++ arch/x86/include/asm/paravirt_types.h |1 + arch/x86/kernel/kvm.c | 73 + arch/x86/kernel/kvmclock.c|2 + arch/x86/kernel/paravirt.c|9 +++ arch/x86/kvm/x86.c| 67 ++- include/linux/kvm_host.h |2 + kernel/sched.c| 94 +--- kernel/sched_features.h |4 +- virt/kvm/kvm_main.c | 20 +++ 15 files changed, 339 insertions(+), 14 deletions(-) -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 8/9] KVM-GST: adjust scheduler cpu power
This is a first proposal for using steal time information to influence the scheduler. There are a lot of optimizations and fine grained adjustments to be done, but it is working reasonably so far for me (mostly) With this patch (and some host pinnings to demonstrate the situation), two vcpus with very different steal time (Say 80 % vs 1 %) will not get an even distribution of processes. This is a situation that can naturally arise, specially in overcommited scenarios. Previosly, the guest scheduler would wrongly think that all cpus have the same ability to run processes, lowering the overall throughput. Signed-off-by: Glauber Costa glom...@redhat.com CC: Rik van Riel r...@redhat.com CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com CC: Eric B Munson emun...@mgebm.net --- arch/x86/Kconfig| 12 kernel/sched.c | 44 ++-- kernel/sched_features.h |4 ++-- 3 files changed, 48 insertions(+), 12 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index da34972..b26f312 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -512,6 +512,18 @@ menuconfig PARAVIRT_GUEST if PARAVIRT_GUEST +config PARAVIRT_TIME_ACCOUNTING + bool Paravirtual steal time accounting + select PARAVIRT + default n + ---help--- + Select this option to enable fine granularity task steal time + accounting. Time spent executing other tasks in parallel with + the current vCPU is discounted from the vCPU power. To account for + that, there can be a small performance impact. + + If in doubt, say N here. + source arch/x86/xen/Kconfig config KVM_CLOCK diff --git a/kernel/sched.c b/kernel/sched.c index c166863..3ef0de9 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -1975,7 +1975,7 @@ static inline u64 steal_ticks(u64 steal) * tell the underlying hypervisor that we grabbed the data, but skip steal time * accounting */ -static noinline bool touch_steal_time(int is_idle) +static noinline bool __touch_steal_time(int is_idle, u64 max_steal, u64 *ticks) { u64 steal, st = 0; @@ -1985,8 +1985,13 @@ static noinline bool touch_steal_time(int is_idle) steal -= this_rq()-prev_steal_time; + if (steal max_steal) + steal = max_steal; + st = steal_ticks(steal); this_rq()-prev_steal_time += st * TICK_NSEC; + if (ticks) + *ticks = st; if (is_idle || st == 0) return false; @@ -1997,10 +2002,16 @@ static noinline bool touch_steal_time(int is_idle) return false; } +static inline bool touch_steal_time(int is_idle) +{ + return __touch_steal_time(is_idle, UINT_MAX, NULL); +} + static void update_rq_clock_task(struct rq *rq, s64 delta) { - s64 irq_delta; + s64 irq_delta = 0, steal = 0; +#ifdef CONFIG_IRQ_TIME_ACCOUNTING irq_delta = irq_time_read(cpu_of(rq)) - rq-prev_irq_time; /* @@ -2023,12 +2034,30 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) rq-prev_irq_time += irq_delta; delta -= irq_delta; +#endif +#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING + if (static_branch((paravirt_steal_rq_enabled))) { + int is_idle; + u64 st; + + is_idle = ((rq-curr != rq-idle) || + irq_count() != HARDIRQ_OFFSET); + + __touch_steal_time(is_idle, delta, st); + + steal = st * TICK_NSEC; + + delta -= steal; + } +#endif + rq-clock_task += delta; - if (irq_delta sched_feat(NONIRQ_POWER)) - sched_rt_avg_update(rq, irq_delta); + if ((irq_delta + steal) sched_feat(NONTASK_POWER)) + sched_rt_avg_update(rq, irq_delta + steal); } +#ifdef CONFIG_IRQ_TIME_ACCOUNTING static int irqtime_account_hi_update(void) { struct cpu_usage_stat *cpustat = kstat_this_cpu.cpustat; @@ -2063,12 +2092,7 @@ static int irqtime_account_si_update(void) #define sched_clock_irqtime(0) -static void update_rq_clock_task(struct rq *rq, s64 delta) -{ - rq-clock_task += delta; -} - -#endif /* CONFIG_IRQ_TIME_ACCOUNTING */ +#endif #include sched_idletask.c #include sched_fair.c diff --git a/kernel/sched_features.h b/kernel/sched_features.h index be40f73..ca3b025 100644 --- a/kernel/sched_features.h +++ b/kernel/sched_features.h @@ -61,9 +61,9 @@ SCHED_FEAT(LB_BIAS, 1) SCHED_FEAT(OWNER_SPIN, 1) /* - * Decrement CPU power based on irq activity + * Decrement CPU power based on time not spent running tasks */ -SCHED_FEAT(NONIRQ_POWER, 1) +SCHED_FEAT(NONTASK_POWER, 1) /* * Queue remote wakeups on the target CPU and process them -- 1.7.3.4 -- To unsubscribe from this list: send
[PATCH v3 3/9] KVM-HDR: KVM Steal time implementation
To implement steal time, we need the hypervisor to pass the guest information about how much time was spent running other processes outside the VM. This is per-vcpu, and using the kvmclock structure for that is an abuse we decided not to make. In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that holds the memory area address containing information about steal time This patch contains the headers for it. I am keeping it separate to facilitate backports to people who wants to backport the kernel part but not the hypervisor, or the other way around. Signed-off-by: Glauber Costa glom...@redhat.com CC: Rik van Riel r...@redhat.com CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com CC: Eric B Munson emun...@mgebm.net --- Documentation/virtual/kvm/msr.txt | 33 + arch/x86/include/asm/kvm_para.h |9 + 2 files changed, 42 insertions(+), 0 deletions(-) diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt index d079aed..fab9b30 100644 --- a/Documentation/virtual/kvm/msr.txt +++ b/Documentation/virtual/kvm/msr.txt @@ -185,3 +185,36 @@ MSR_KVM_ASYNC_PF_EN: 0x4b564d02 Currently type 2 APF will be always delivered on the same vcpu as type 1 was, but guest should not rely on that. + +MSR_KVM_STEAL_TIME: 0x4b564d03 + + data: 64-byte alignment physical address of a memory area which must be + in guest RAM, plus an enable bit in bit 0. This memory is expected to + hold a copy of the following structure: + + struct kvm_steal_time { + __u64 steal; + __u32 version; + __u32 flags; + __u32 pad[12]; + } + + whose data will be filled in by the hypervisor periodically. Only one + write, or registration, is needed for each VCPU. The interval between + updates of this structure is arbitrary and implementation-dependent. + The hypervisor may update this structure at any time it sees fit until + anything with bit0 == 0 is written to it. Guest is required to make sure + this structure is initialized to zero. + + Fields have the following meanings: + + version: guest has to check version before and after grabbing + time information and check that they are both equal and even. + An odd version indicates an in-progress update. + + flags: At this point, always zero. May be used to indicate + changes in this structure in the future. + + steal: the amount of time in which this vCPU did not run, in + nanoseconds. + diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index d6cd79b..65f8bb9 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -21,6 +21,7 @@ */ #define KVM_FEATURE_CLOCKSOURCE23 #define KVM_FEATURE_ASYNC_PF 4 +#define KVM_FEATURE_STEAL_TIME 5 /* The last 8 bits are used to indicate how to interpret the flags field * in pvclock structure. If no bits are set, all flags are ignored. @@ -35,6 +36,14 @@ #define MSR_KVM_WALL_CLOCK_NEW 0x4b564d00 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01 #define MSR_KVM_ASYNC_PF_EN 0x4b564d02 +#define MSR_KVM_STEAL_TIME 0x4b564d03 + +struct kvm_steal_time { + __u64 steal; + __u32 version; + __u32 flags; + __u32 pad[12]; +}; #define KVM_MAX_MMU_OP_BATCH 32 -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 4/9] KVM-HV: KVM Steal time implementation
To implement steal time, we need the hypervisor to pass the guest information about how much time was spent running other processes outside the VM. This is per-vcpu, and using the kvmclock structure for that is an abuse we decided not to make. In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that holds the memory area address containing information about steal time This patch contains the hypervisor part for it. I am keeping it separate from the headers to facilitate backports to people who wants to backport the kernel part but not the hypervisor, or the other way around. Signed-off-by: Glauber Costa glom...@redhat.com CC: Rik van Riel r...@redhat.com CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com CC: Eric B Munson emun...@mgebm.net --- arch/x86/include/asm/kvm_host.h |8 ++ arch/x86/include/asm/kvm_para.h |4 +++ arch/x86/kvm/x86.c | 53 -- 3 files changed, 62 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index da6bbee..f85ed56 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -389,6 +389,14 @@ struct kvm_vcpu_arch { unsigned int hw_tsc_khz; unsigned int time_offset; struct page *time_page; + + struct { + u64 msr_val; + u64 this_time_out; + struct gfn_to_hva_cache stime; + struct kvm_steal_time steal; + } st; + u64 last_guest_tsc; u64 last_kernel_ns; u64 last_tsc_nsec; diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index 65f8bb9..c484ba8 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -45,6 +45,10 @@ struct kvm_steal_time { __u32 pad[12]; }; +#define KVM_STEAL_ALIGNMENT_BITS 5 +#define KVM_STEAL_VALID_BITS ((-1ULL (KVM_STEAL_ALIGNMENT_BITS + 1))) +#define KVM_STEAL_RESERVED_MASK (((1 KVM_STEAL_ALIGNMENT_BITS) - 1 ) 1) + #define KVM_MAX_MMU_OP_BATCH 32 #define KVM_ASYNC_PF_ENABLED (1 0) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 7167717..df9d274 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -808,12 +808,12 @@ EXPORT_SYMBOL_GPL(kvm_get_dr); * kvm-specific. Those are put in the beginning of the list. */ -#define KVM_SAVE_MSRS_BEGIN8 +#define KVM_SAVE_MSRS_BEGIN9 static u32 msrs_to_save[] = { MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK, MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW, HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL, - HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, + HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME, MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP, MSR_STAR, #ifdef CONFIG_X86_64 @@ -1491,6 +1491,26 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu) } } +static void record_steal_time(struct kvm_vcpu *vcpu) +{ + u64 delta; + + if (!(vcpu-arch.st.msr_val KVM_MSR_ENABLED)) + return; + + if (unlikely(kvm_read_guest_cached(vcpu-kvm, vcpu-arch.st.stime, + vcpu-arch.st.steal, sizeof(struct kvm_steal_time + return; + + delta = (get_kernel_ns() - vcpu-arch.st.this_time_out); + + vcpu-arch.st.steal.steal += delta; + vcpu-arch.st.steal.version += 2; + + kvm_write_guest_cached(vcpu-kvm, vcpu-arch.st.stime, + vcpu-arch.st.steal, sizeof(struct kvm_steal_time)); +} + int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data) { switch (msr) { @@ -1573,6 +1593,25 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data) if (kvm_pv_enable_async_pf(vcpu, data)) return 1; break; + case MSR_KVM_STEAL_TIME: + vcpu-arch.st.msr_val = data; + + if (!(data KVM_MSR_ENABLED)) { + break; + } + + if (data KVM_STEAL_RESERVED_MASK) + return 1; + + if (kvm_gfn_to_hva_cache_init(vcpu-kvm, vcpu-arch.st.stime, + data KVM_STEAL_VALID_BITS)) + return 1; + + vcpu-arch.st.this_time_out = get_kernel_ns(); + record_steal_time(vcpu); + + break; + case MSR_IA32_MCG_CTL: case MSR_IA32_MCG_STATUS: case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1: @@ -1858,6 +1897,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata) case MSR_KVM_ASYNC_PF_EN: data = vcpu-arch.apf.msr_val; break; + case MSR_KVM_STEAL_TIME: + data
[PATCH v3 9/9] KVM-GST: KVM Steal time registration
Register steal time within KVM. Everytime we sample the steal time information, we update a local variable that tells what was the last time read. We then account the difference. Signed-off-by: Glauber Costa glom...@redhat.com CC: Rik van Riel r...@redhat.com CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com CC: Eric B Munson emun...@mgebm.net --- Documentation/kernel-parameters.txt |4 ++ arch/x86/include/asm/kvm_para.h |1 + arch/x86/kernel/kvm.c | 73 +++ arch/x86/kernel/kvmclock.c |2 + 4 files changed, 80 insertions(+), 0 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index fd248a31..a722574 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1737,6 +1737,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted. no-kvmapf [X86,KVM] Disable paravirtualized asynchronous page fault handling. + no-steal-acc[X86,KVM] Disable paravirtualized steal time accounting. + steal time is computed, but won't influence scheduler + behaviour + nolapic [X86-32,APIC] Do not enable or use the local APIC. nolapic_timer [X86-32,APIC] Do not use the local APIC timer. diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index c484ba8..35d732d 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -94,6 +94,7 @@ struct kvm_vcpu_pv_apf_data { extern void kvmclock_init(void); extern int kvm_register_clock(char *txt); +extern void kvm_disable_steal_time(void); /* This instruction is vmcall. On non-VT architectures, it will generate a diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 33c07b0..58331c2 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -51,6 +51,15 @@ static int parse_no_kvmapf(char *arg) early_param(no-kvmapf, parse_no_kvmapf); +static int steal_acc = 1; +static int parse_no_stealacc(char *arg) +{ +steal_acc = 0; +return 0; +} + +early_param(no-steal-acc, parse_no_stealacc); + struct kvm_para_state { u8 mmu_queue[MMU_QUEUE_SIZE]; int mmu_queue_len; @@ -58,6 +67,8 @@ struct kvm_para_state { static DEFINE_PER_CPU(struct kvm_para_state, para_state); static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64); +static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64); +static int has_steal_clock = 0; static struct kvm_para_state *kvm_para_state(void) { @@ -441,6 +452,21 @@ static void __init paravirt_ops_setup(void) #endif } +static void kvm_register_steal_time(void) +{ + int cpu = smp_processor_id(); + struct kvm_steal_time *st = per_cpu(steal_time, cpu); + + if (!has_steal_clock) + return; + + memset(st, 0, sizeof(*st)); + + wrmsrl(MSR_KVM_STEAL_TIME, (__pa(st) | KVM_MSR_ENABLED)); + printk(KERN_INFO kvm-stealtime: cpu %d, msr %lx\n, + cpu, __pa(st)); +} + void __cpuinit kvm_guest_cpu_init(void) { if (!kvm_para_available()) @@ -457,6 +483,9 @@ void __cpuinit kvm_guest_cpu_init(void) printk(KERN_INFOKVM setup async PF for cpu %d\n, smp_processor_id()); } + + if (has_steal_clock) + kvm_register_steal_time(); } static void kvm_pv_disable_apf(void *unused) @@ -483,6 +512,31 @@ static struct notifier_block kvm_pv_reboot_nb = { .notifier_call = kvm_pv_reboot_notify, }; +static u64 kvm_steal_clock(int cpu) +{ + u64 steal; + struct kvm_steal_time *src; + int version; + + src = per_cpu(steal_time, cpu); + do { + version = src-version; + rmb(); + steal = src-steal; + rmb(); + } while ((version 1) || (version != src-version)); + + return steal; +} + +void kvm_disable_steal_time(void) +{ + if (!has_steal_clock) + return; + + wrmsr(MSR_KVM_STEAL_TIME, 0, 0); +} + #ifdef CONFIG_SMP static void __init kvm_smp_prepare_boot_cpu(void) { @@ -500,6 +554,7 @@ static void __cpuinit kvm_guest_cpu_online(void *dummy) static void kvm_guest_cpu_offline(void *dummy) { + kvm_disable_steal_time(); kvm_pv_disable_apf(NULL); apf_task_wake_all(); } @@ -548,6 +603,11 @@ void __init kvm_guest_init(void) if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF)) x86_init.irqs.trap_init = kvm_apf_trap_init; + if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) { + has_steal_clock = 1; + pv_time_ops.steal_clock = kvm_steal_clock; + } + #ifdef CONFIG_SMP smp_ops.smp_prepare_boot_cpu
[PATCH v3 1/9] introduce kvm_read_guest_cached
From: Gleb Natapov g...@redhat.com Introduce kvm_read_guest_cached() function in addition to write one we already have. [ by glauber: export function signature in kvm header ] Signed-off-by: Gleb Natapov g...@redhat.com Signed-off-by: Glauber Costa glom...@redhat.com --- include/linux/kvm_host.h |2 ++ virt/kvm/kvm_main.c | 20 2 files changed, 22 insertions(+), 0 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 31ebb59..f7df0a3 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -381,6 +381,8 @@ int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset, int kvm_read_guest_atomic(struct kvm *kvm, gpa_t gpa, void *data, unsigned long len); int kvm_read_guest(struct kvm *kvm, gpa_t gpa, void *data, unsigned long len); +int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, + void *data, unsigned long len); int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn, const void *data, int offset, int len); int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data, diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 11d2783..d5ef9eb 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1418,6 +1418,26 @@ int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, } EXPORT_SYMBOL_GPL(kvm_write_guest_cached); +int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, + void *data, unsigned long len) +{ + struct kvm_memslots *slots = kvm_memslots(kvm); + int r; + + if (slots-generation != ghc-generation) + kvm_gfn_to_hva_cache_init(kvm, ghc, ghc-gpa); + + if (kvm_is_error_hva(ghc-hva)) + return -EFAULT; + + r = __copy_from_user(data, (void __user *)ghc-hva, len); + if (r) + return -EFAULT; + + return 0; +} +EXPORT_SYMBOL_GPL(kvm_read_guest_cached); + int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len) { return kvm_write_guest_page(kvm, gfn, (const void *) empty_zero_page, -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 5/9] KVM-HV: use schedstats to calculate steal time
SCHEDSTATS provide a precise source of information about time tasks spent on a runqueue, but not running (among other things). It is specially useful for the steal time implementation, because it doesn't record halt time at all. To avoid a hard dependency on schedstats, since it is possible one won't want to record statistics about all processes running, the previous method of time measurement on put/load vcpu is kept for !SCHEDSTATS. Signed-off-by: Glauber Costa glom...@redhat.com CC: Rik van Riel r...@redhat.com CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com CC: Eric B Munson emun...@mgebm.net CC: Marcelo Tosatti mtosa...@redhat.com --- arch/x86/kvm/x86.c | 22 ++ 1 files changed, 18 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index df9d274..7e87159 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1502,7 +1502,13 @@ static void record_steal_time(struct kvm_vcpu *vcpu) vcpu-arch.st.steal, sizeof(struct kvm_steal_time return; - delta = (get_kernel_ns() - vcpu-arch.st.this_time_out); +#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) + if (likely(sched_info_on())) { + delta = current-sched_info.run_delay - vcpu-arch.st.this_time_out; + vcpu-arch.st.this_time_out = current-sched_info.run_delay; + } else +#endif + delta = (get_kernel_ns() - vcpu-arch.st.this_time_out); vcpu-arch.st.steal.steal += delta; vcpu-arch.st.steal.version += 2; @@ -1607,9 +1613,14 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data) data KVM_STEAL_VALID_BITS)) return 1; - vcpu-arch.st.this_time_out = get_kernel_ns(); - record_steal_time(vcpu); +#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) + if (likely(sched_info_on())) + vcpu-arch.st.this_time_out = current-sched_info.run_delay; + else +#endif + vcpu-arch.st.this_time_out = get_kernel_ns(); + record_steal_time(vcpu); break; case MSR_IA32_MCG_CTL: @@ -2220,7 +2231,10 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) kvm_x86_ops-vcpu_put(vcpu); kvm_put_guest_fpu(vcpu); kvm_get_msr(vcpu, MSR_IA32_TSC, vcpu-arch.last_guest_tsc); - vcpu-arch.st.this_time_out = get_kernel_ns(); +#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) + if (unlikely(!sched_info_on())) +#endif + vcpu-arch.st.this_time_out = get_kernel_ns(); } static int is_efer_nx(void) -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 7/9] KVM-GST: KVM Steal time accounting
This patch accounts steal time time in kernel/sched. I kept it from last proposal, because I still see advantages in it: Doing it here will give us easier access from scheduler variables such as the cpu rq. The next patch shows an example of usage for it. Since functions like account_idle_time() can be called from multiple places, not only account_process_tick(), steal time grabbing is repeated in each account function separatedely. Signed-off-by: Glauber Costa glom...@redhat.com CC: Rik van Riel r...@redhat.com CC: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com CC: Peter Zijlstra pet...@infradead.org CC: Avi Kivity a...@redhat.com CC: Anthony Liguori aligu...@us.ibm.com CC: Eric B Munson emun...@mgebm.net --- kernel/sched.c | 52 1 files changed, 52 insertions(+), 0 deletions(-) diff --git a/kernel/sched.c b/kernel/sched.c index 3f2e502..c166863 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -75,6 +75,7 @@ #include asm/tlb.h #include asm/irq_regs.h #include asm/mutex.h +#include asm/paravirt.h #include sched_cpupri.h #include workqueue_sched.h @@ -528,6 +529,7 @@ struct rq { #ifdef CONFIG_IRQ_TIME_ACCOUNTING u64 prev_irq_time; #endif + u64 prev_steal_time; /* calc_load related fields */ unsigned long calc_load_update; @@ -1953,6 +1955,48 @@ void account_system_vtime(struct task_struct *curr) } EXPORT_SYMBOL_GPL(account_system_vtime); +#endif /* CONFIG_IRQ_TIME_ACCOUNTING */ + +static inline u64 steal_ticks(u64 steal) +{ + if (unlikely(steal NSEC_PER_SEC)) + return steal / TICK_NSEC; + + return __iter_div_u64_rem(steal, TICK_NSEC, steal); +} + +/* + * We have to at flush steal time information every time something else + * is accounted. Since the accounting functions are all visible to the rest + * of the kernel, it gets tricky to do them in one place. This helper function + * helps us. + * + * When the system is idle, the concept of steal time does not apply. We just + * tell the underlying hypervisor that we grabbed the data, but skip steal time + * accounting + */ +static noinline bool touch_steal_time(int is_idle) +{ + u64 steal, st = 0; + + if (static_branch(paravirt_steal_enabled)) { + + steal = paravirt_steal_clock(smp_processor_id()); + + steal -= this_rq()-prev_steal_time; + + st = steal_ticks(steal); + this_rq()-prev_steal_time += st * TICK_NSEC; + + if (is_idle || st == 0) + return false; + + account_steal_time(st); + return true; + } + return false; +} + static void update_rq_clock_task(struct rq *rq, s64 delta) { s64 irq_delta; @@ -3716,6 +3760,9 @@ void account_user_time(struct task_struct *p, cputime_t cputime, struct cpu_usage_stat *cpustat = kstat_this_cpu.cpustat; cputime64_t tmp; + if (touch_steal_time(0)) + return; + /* Add user time to process. */ p-utime = cputime_add(p-utime, cputime); p-utimescaled = cputime_add(p-utimescaled, cputime_scaled); @@ -3802,6 +3849,9 @@ void account_system_time(struct task_struct *p, int hardirq_offset, struct cpu_usage_stat *cpustat = kstat_this_cpu.cpustat; cputime64_t *target_cputime64; + if (touch_steal_time(0)) + return; + if ((p-flags PF_VCPU) (irq_count() - hardirq_offset == 0)) { account_guest_time(p, cputime, cputime_scaled); return; @@ -3839,6 +3889,8 @@ void account_idle_time(cputime_t cputime) cputime64_t cputime64 = cputime_to_cputime64(cputime); struct rq *rq = this_rq(); + touch_steal_time(1); + if (atomic_read(rq-nr_iowait) 0) cpustat-iowait = cputime64_add(cpustat-iowait, cputime64); else -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 3/7] KVM-HV: KVM Steal time implementation
On 06/20/2011 05:56 PM, Marcelo Tosatti wrote: On Sun, Jun 19, 2011 at 12:57:53PM +0300, Avi Kivity wrote: On 06/17/2011 01:20 AM, Glauber Costa wrote: To implement steal time, we need the hypervisor to pass the guest information about how much time was spent running other processes outside the VM. This is per-vcpu, and using the kvmclock structure for that is an abuse we decided not to make. In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that holds the memory area address containing information about steal time This patch contains the hypervisor part for it. I am keeping it separate from the headers to facilitate backports to people who wants to backport the kernel part but not the hypervisor, or the other way around. +#define KVM_STEAL_ALIGNMENT_BITS 5 +#define KVM_STEAL_VALID_BITS ((-1ULL (KVM_STEAL_ALIGNMENT_BITS + 1))) +#define KVM_STEAL_RESERVED_MASK (((1 KVM_STEAL_ALIGNMENT_BITS) - 1 ) 1) Clumsy, but okay. +static void record_steal_time(struct kvm_vcpu *vcpu) +{ + u64 delta; + + if (vcpu-arch.st.stime vcpu-arch.st.this_time_out) { 0 is a valid value for stime. + + if (unlikely(kvm_read_guest(vcpu-kvm, vcpu-arch.st.stime, + vcpu-arch.st.steal, sizeof(struct kvm_steal_time { + + vcpu-arch.st.stime = 0; + return; + } + + delta = (get_kernel_ns() - vcpu-arch.st.this_time_out); + + vcpu-arch.st.steal.steal += delta; + vcpu-arch.st.steal.version += 2; + + if (unlikely(kvm_write_guest(vcpu-kvm, vcpu-arch.st.stime, + vcpu-arch.st.steal, sizeof(struct kvm_steal_time { + + vcpu-arch.st.stime = 0; + return; + } + } + +} + @@ -2158,6 +2206,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) kvm_migrate_timers(vcpu); vcpu-cpu = cpu; } + + record_steal_time(vcpu); } This records time spent in userspace in the vcpu thread as steal time. Is this what we want? Or just time preempted away? It also accounts halt time (kvm_vcpu_block) as steal time. Glauber, you could instead use the runnable-state-but-waiting-in-runqueue field of SCHEDSTATS, i forgot the exact name. I thought about it in the past. I let the idea aside because I didn't want to introduce a dependency on SCHEDSTATS. But thinking about it again now (and after some days of experimentations with it), I think we could have both. use run_delay (the field you were thinking of) when schedstats are available, and fallback to an estimate method like the one we're doing when it is not. Objections ? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/7] KVM-HV: KVM Steal time implementation
On 06/20/2011 05:02 AM, Avi Kivity wrote: On 06/20/2011 10:21 AM, Gleb Natapov wrote: On Sun, Jun 19, 2011 at 04:02:22PM +0300, Avi Kivity wrote: On 06/19/2011 03:59 PM, Gleb Natapov wrote: On Sun, Jun 19, 2011 at 03:35:58PM +0300, Avi Kivity wrote: On 06/15/2011 12:09 PM, Gleb Natapov wrote: Actually, I'd expect most read/writes to benefit from caching, no? So why don't we just rename kvm_write_guest_cached() to kvm_write_guest(), and the few places - if any - that need to force transversing of the gfn mappings, get renamed to kvm_write_guest_uncached ? Good idea. I do not see any places where kvm_write_guest_uncached is needed from a brief look. Avi? kvm_write_guest_cached() needs something to supply the cache, and needs recurring writes to the same location. Neither of these are common (for example, instruction emulation doesn't have either). Correct. Missed that. So what about changing steal time to use kvm_write_guest_cached()? Makes sense, definitely. Want to post read_guest_cached() as well? Glauber can you write read_guest_cached() as part of your series (should be trivial), or do you want me to do it? I do not have a code to test it with though :) Yes. (you can write it, and Glauber can include it in the series) Write it, handle me the patch, I'll include it and test it. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 5/7] KVM-GST: KVM Steal time accounting
On 06/19/2011 07:04 AM, Avi Kivity wrote: On 06/17/2011 01:20 AM, Glauber Costa wrote: This patch accounts steal time time in kernel/sched. I kept it from last proposal, because I still see advantages in it: Doing it here will give us easier access from scheduler variables such as the cpu rq. The next patch shows an example of usage for it. Since functions like account_idle_time() can be called from multiple places, not only account_process_tick(), steal time grabbing is repeated in each account function separatedely. /* + * We have to at flush steal time information every time something else + * is accounted. Since the accounting functions are all visible to the rest + * of the kernel, it gets tricky to do them in one place. This helper function + * helps us. + * + * When the system is idle, the concept of steal time does not apply. We just + * tell the underlying hypervisor that we grabbed the data, but skip steal time + * accounting + */ +static inline bool touch_steal_time(int is_idle) +{ + u64 steal, st = 0; + + if (static_branch(paravirt_steal_enabled)) { + + steal = paravirt_steal_clock(smp_processor_id()); + + steal -= this_rq()-prev_steal_time; + if (is_idle) { + this_rq()-prev_steal_time += steal; + return false; + } + + while (steal= TICK_NSEC) { + /* + * Inline assembly required to prevent the compiler + * optimising this loop into a divmod call. + * See __iter_div_u64_rem() for another example of this. + */ Why not use said function? because here we want to do work during each loop. The said function would have to be adapted for that, possibly using a macro, to run arbitrary code during each loop iteration, in a way that I don't think it is worthy given the current number of callers (2 counting this new one) + asm( : +rm (steal)); + + steal -= TICK_NSEC; + this_rq()-prev_steal_time += TICK_NSEC; + st++; Suppose a live migration or SIGSTOP causes lots of steal time. How long will we spend here? Silly me. I actually used this same argument with Peter to cap it with delta in the next patch in this series. So I think you are 100 % right. Here, however, we do want to account all that time, I believe. How about we do a slow division if we're 10 sec (unlikely), and account everything as steal time in this scenario ? + } + + account_steal_time(st); + return !!st; !! !needed, you're returning a bool. ah, sure thing. + } + return false; +} + I'll need Peter's (or another sched maintainer's) review to apply this. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html