Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
On 2017-11-16 17:45, Daniel Lezcano wrote: On 16/11/2017 10:12, Quan Xu wrote: On 2017-11-16 06:03, Thomas Gleixner wrote: On Wed, 15 Nov 2017, Peter Zijlstra wrote: On Mon, Nov 13, 2017 at 06:06:02PM +0800, Quan Xu wrote: From: Yang Zhang <yang.zhang...@gmail.com> Implement a generic idle poll which resembles the functionality found in arch/. Provide weak arch_cpu_idle_poll function which can be overridden by the architecture code if needed. No, we want less of those magic hooks, not more. Interrupts arrive which may not cause a reschedule in idle loops. In KVM guest, this costs several VM-exit/VM-entry cycles, VM-entry for interrupts and VM-exit immediately. Also this becomes more expensive than bare metal. Add a generic idle poll before enter real idle path. When a reschedule event is pending, we can bypass the real idle path. Why not do a HV specific idle driver? If I understand the problem correctly then he wants to avoid the heavy lifting in tick_nohz_idle_enter() in the first place, but there is already an interesting quirk there which makes it exit early. See commit 3c5d92a0cfb5 ("nohz: Introduce arch_needs_cpu"). The reason for this commit looks similar. But lets not proliferate that. I'd rather see that go away. agreed. Even we can get more benifit than commit 3c5d92a0cfb5 ("nohz: Introduce arch_needs_cpu") in kvm guest. I won't proliferate that.. But the irq_timings stuff is heading into the same direction, with a more complex prediction logic which should tell you pretty good how long that idle period is going to be and in case of an interrupt heavy workload this would skip the extra work of stopping and restarting the tick and provide a very good input into a polling decision. interesting. I have tested with IRQ_TIMINGS related code, which seems not working so far. I don't know how you tested it, can you elaborate what you meant by "seems not working so far" ? Daniel, I tried to enable IRQ_TIMINGS* manually. used irq_timings_next_event() to return estimation of the earliest interrupt. However I got a constant. There are still some work to do to be more efficient. The prediction based on the irq timings is all right if the interrupts have a simple periodicity. But as soon as there is a pattern, the current code can't handle it properly and does bad predictions. I'm working on a self-learning pattern detection which is too heavy for the kernel, and with it we should be able to detect properly the patterns and re-ajust the period if it changes. I'm in the process of making it suitable for kernel code (both math and perf). One improvement which can be done right now and which can help you is the interrupts rate on the CPU. It is possible to compute it and that will give an accurate information for the polling decision. As tglx said, talk to each other / work together to make it usable for all use cases. could you share how to enable it to get the interrupts rate on the CPU? I can try it in cloud scenario. of course, I'd like to work with you to improve it. Quan Alibaba Cloud ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH RFC v3 0/6] x86/idle: add halt poll support
On 2017-11-16 05:31, Konrad Rzeszutek Wilk wrote: On Mon, Nov 13, 2017 at 06:05:59PM +0800, Quan Xu wrote: From: Yang Zhang <yang.zhang...@gmail.com> Some latency-intensive workload have seen obviously performance drop when running inside VM. The main reason is that the overhead is amplified when running inside VM. The most cost I have seen is inside idle path. Meaning an VMEXIT b/c it is an 'halt' operation ? And then going back in guest (VMRESUME) takes time. And hence your latency gets all whacked b/c of this? Konrad, I can't follow 'b/c' here.. sorry. So if I understand - you want to use your _full_ timeslice (of the guest) without ever (or as much as possible) to go in the hypervisor? as much as possible. Which means in effect you don't care about power-saving or CPUfreq savings, you just want to eat the full CPU for snack? actually, we care about power-saving. The poll duration is self-tuning, otherwise it is almost as the same as 'halt=poll'. Also we always sent out with CPU usage of benchmark netperf/ctxsw. We got much more performance with limited promotion of CPU usage. This patch introduces a new mechanism to poll for a while before entering idle state. If schedule is needed during poll, then we don't need to goes through the heavy overhead path. Schedule of what? The guest or the host? rescheduled of guest scheduler.. it is the guest. Quan Alibaba Cloud ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
On 2017-11-17 19:36, Thomas Gleixner wrote: On Fri, 17 Nov 2017, Quan Xu wrote: On 2017-11-16 17:53, Thomas Gleixner wrote: That's just plain wrong. We don't want to see any of this PARAVIRT crap in anything outside the architecture/hypervisor interfacing code which really needs it. The problem can and must be solved at the generic level in the first place to gather the data which can be used to make such decisions. How that information is used might be either completely generic or requires system specific variants. But as long as we don't have any information at all we cannot discuss that. Please sit down and write up which data needs to be considered to make decisions about probabilistic polling. Then we need to compare and contrast that with the data which is necessary to make power/idle state decisions. I would be very surprised if this data would not overlap by at least 90%. 1. which data needs to considerd to make decisions about probabilistic polling I really need to write up which data needs to considerd to make decisions about probabilistic polling. At last several months, I always focused on the data _from idle to reschedule_, then to bypass the idle loops. unfortunately, this makes me touch scheduler/idle/nohz code inevitably. with tglx's suggestion, the data which is necessary to make power/idle state decisions, is the last idle state's residency time. IIUC this data is duration from idle to wakeup, which maybe by reschedule irq or other irq. That's part of the picture, but not complete. tglx, could you share more? I am very curious about it.. I also test that the reschedule irq overlap by more than 90% (trace the need_resched status after cpuidle_idle_call), when I run ctxsw/netperf for one minute. as the overlap, I think I can input the last idle state's residency time to make decisions about probabilistic polling, as @dev->last_residency does. it is much easier to get data. That's only true for your particular use case. 2. do a HV specific idle driver (function) so far, power management is not exposed to guest.. idle is simple for KVM guest, calling "sti" / "hlt"(cpuidle_idle_call() --> default_idle_call()).. thanks Xen guys, who has implemented the paravirt framework. I can implement it as easy as following: --- a/arch/x86/kernel/kvm.c Your email client is using a very strange formatting. my bad, I insert space to highlight these code. This is definitely better than what you proposed so far and implementing it as a prove of concept seems to be worthwhile. But I doubt that this is the final solution. It's not generic and not necessarily suitable for all use case scenarios. yes, I am exhausted :):) could you tell me the gap to be generic and necessarily suitable for all use case scenarios? as lack of irq/idle predictors? I really want to upstream it for all of public cloud users/providers.. as kvm host has a similar one, is it possible to upstream with following conditions? : 1). add a QEMU configuration, whether enable or not, by default disable. 2). add some "TODO" comments near the code. 3). ... anyway, thanks for your help.. Quan Alibaba Cloud ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
On 2017-11-16 06:03, Thomas Gleixner wrote: On Wed, 15 Nov 2017, Peter Zijlstra wrote: On Mon, Nov 13, 2017 at 06:06:02PM +0800, Quan Xu wrote: From: Yang Zhang <yang.zhang...@gmail.com> Implement a generic idle poll which resembles the functionality found in arch/. Provide weak arch_cpu_idle_poll function which can be overridden by the architecture code if needed. No, we want less of those magic hooks, not more. Interrupts arrive which may not cause a reschedule in idle loops. In KVM guest, this costs several VM-exit/VM-entry cycles, VM-entry for interrupts and VM-exit immediately. Also this becomes more expensive than bare metal. Add a generic idle poll before enter real idle path. When a reschedule event is pending, we can bypass the real idle path. Why not do a HV specific idle driver? If I understand the problem correctly then he wants to avoid the heavy lifting in tick_nohz_idle_enter() in the first place, but there is already an interesting quirk there which makes it exit early. See commit 3c5d92a0cfb5 ("nohz: Introduce arch_needs_cpu"). The reason for this commit looks similar. But lets not proliferate that. I'd rather see that go away. agreed. Even we can get more benifit than commit 3c5d92a0cfb5 ("nohz: Introduce arch_needs_cpu") in kvm guest. I won't proliferate that.. But the irq_timings stuff is heading into the same direction, with a more complex prediction logic which should tell you pretty good how long that idle period is going to be and in case of an interrupt heavy workload this would skip the extra work of stopping and restarting the tick and provide a very good input into a polling decision. interesting. I have tested with IRQ_TIMINGS related code, which seems not working so far. Also I'd like to help as much as I can. This can be handled either in a HV specific idle driver or even in the generic core code. If the interrupt does not arrive then you can assume within the predicted time then you can assume that the flood stopped and invoke halt or whatever. That avoids all of that 'tunable and tweakable' x86 specific hackery and utilizes common functionality which is mostly there already. here is some sample code. Poll for a while before enter halt in cpuidle_enter_state() If I get a reschedule event, then don't try to enter halt. (I hope this is the right direction as Peter mentioned in another email) --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -210,6 +210,13 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, target_state = >states[index]; } +#ifdef CONFIG_PARAVIRT + paravirt_idle_poll(); + + if (need_resched()) + return -EBUSY; +#endif + /* Take note of the planned idle state. */ sched_idle_set_state(target_state); thanks, Quan Alibaba Cloud ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
On 2017-11-16 17:53, Thomas Gleixner wrote: On Thu, 16 Nov 2017, Quan Xu wrote: On 2017-11-16 06:03, Thomas Gleixner wrote: --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -210,6 +210,13 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, target_state = >states[index]; } +#ifdef CONFIG_PARAVIRT + paravirt_idle_poll(); + + if (need_resched()) + return -EBUSY; +#endif That's just plain wrong. We don't want to see any of this PARAVIRT crap in anything outside the architecture/hypervisor interfacing code which really needs it. The problem can and must be solved at the generic level in the first place to gather the data which can be used to make such decisions. How that information is used might be either completely generic or requires system specific variants. But as long as we don't have any information at all we cannot discuss that. Please sit down and write up which data needs to be considered to make decisions about probabilistic polling. Then we need to compare and contrast that with the data which is necessary to make power/idle state decisions. I would be very surprised if this data would not overlap by at least 90%. Peter, tglx Thanks for your comments.. rethink of this patch set, 1. which data needs to considerd to make decisions about probabilistic polling I really need to write up which data needs to considerd to make decisions about probabilistic polling. At last several months, I always focused on the data _from idle to reschedule_, then to bypass the idle loops. unfortunately, this makes me touch scheduler/idle/nohz code inevitably. with tglx's suggestion, the data which is necessary to make power/idle state decisions, is the last idle state's residency time. IIUC this data is duration from idle to wakeup, which maybe by reschedule irq or other irq. I also test that the reschedule irq overlap by more than 90% (trace the need_resched status after cpuidle_idle_call), when I run ctxsw/netperf for one minute. as the overlap, I think I can input the last idle state's residency time to make decisions about probabilistic polling, as @dev->last_residency does. it is much easier to get data. 2. do a HV specific idle driver (function) so far, power management is not exposed to guest.. idle is simple for KVM guest, calling "sti" / "hlt"(cpuidle_idle_call() --> default_idle_call()).. thanks Xen guys, who has implemented the paravirt framework. I can implement it as easy as following: --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -465,6 +465,12 @@ static void __init kvm_apf_trap_init(void) update_intr_gate(X86_TRAP_PF, async_page_fault); } +static __cpuidle void kvm_safe_halt(void) +{ + /* 1. POLL, if need_resched() --> return */ + + asm volatile("sti; hlt": : :"memory"); /* 2. halt */ + + /* 3. get the last idle state's residency time */ + + /* 4. update poll duration based on last idle state's residency time */ +} + void __init kvm_guest_init(void) { int i; @@ -490,6 +496,8 @@ void __init kvm_guest_init(void) if (kvmclock_vsyscall) kvm_setup_vsyscall_timeinfo(); + pv_irq_ops.safe_halt = kvm_safe_halt; + #ifdef CONFIG_SMP then, I am no need to introduce a new pvops, and never modify schedule/idle/nohz code again. also I can narrow all of the code down in arch/x86/kernel/kvm.c. If this is in the right direction, I will send a new patch set next week.. thanks, Quan Alibaba Cloud ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
On 2017-11-16 16:45, Peter Zijlstra wrote: On Wed, Nov 15, 2017 at 11:03:08PM +0100, Thomas Gleixner wrote: If I understand the problem correctly then he wants to avoid the heavy lifting in tick_nohz_idle_enter() in the first place, but there is already an interesting quirk there which makes it exit early. Sure. And there are people who want to do the same for native. Adding more ugly and special cases just isn't the way to go about doing that. I'm fairly sure I've told the various groups that want to tinker with this to work together on this. I've also in fairly significant detail sketched how to rework the idle code and idle predictors. At this point I'm too tired to dig any of that up, so I'll just keep saying no to patches that don't even attempt to go in the right direction. Peter, take care. I really have considered this factor, and try my best not to interfere with scheduler/idle code. if irq_timings code is ready, I can use it directly. I think irq_timings is not an easy task, I'd like to help as much as I can. Also don't try to touch tick_nohz* code again. as tglx suggested, this can be handled either in a HV specific idle driver or even in the generic core code. I hope this is in the right direction. Quan Alibaba Cloud ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops
On 2017/11/14 18:27, Juergen Gross wrote: On 14/11/17 10:38, Quan Xu wrote: On 2017/11/14 15:30, Juergen Gross wrote: On 14/11/17 08:02, Quan Xu wrote: On 2017/11/13 18:53, Juergen Gross wrote: On 13/11/17 11:06, Quan Xu wrote: From: Quan Xu <quan@gmail.com> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called in idle path which will poll for a while before we enter the real idle state. In virtualization, idle path includes several heavy operations includes timer access(LAPIC timer or TSC deadline timer) which will hurt performance especially for latency intensive workload like message passing task. The cost is mainly from the vmexit which is a hardware context switch between virtual machine and hypervisor. Our solution is to poll for a while and do not enter real idle path if we can get the schedule event during polling. Poll may cause the CPU waste so we adopt a smart polling mechanism to reduce the useless poll. Signed-off-by: Yang Zhang <yang.zhang...@gmail.com> Signed-off-by: Quan Xu <quan@gmail.com> Cc: Juergen Gross <jgr...@suse.com> Cc: Alok Kataria <akata...@vmware.com> Cc: Rusty Russell <ru...@rustcorp.com.au> Cc: Thomas Gleixner <t...@linutronix.de> Cc: Ingo Molnar <mi...@redhat.com> Cc: "H. Peter Anvin" <h...@zytor.com> Cc: x...@kernel.org Cc: virtualization@lists.linux-foundation.org Cc: linux-ker...@vger.kernel.org Cc: xen-de...@lists.xenproject.org Hmm, is the idle entry path really so critical to performance that a new pvops function is necessary? Juergen, Here is the data we get when running benchmark netperf: 1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0): 29031.6 bit/s -- 76.1 %CPU 2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0): 35787.7 bit/s -- 129.4 %CPU 3. w/ kvm dynamic poll: 35735.6 bit/s -- 200.0 %CPU 4. w/patch and w/ kvm dynamic poll: 42225.3 bit/s -- 198.7 %CPU 5. idle=poll 37081.7 bit/s -- 998.1 %CPU w/ this patch, we will improve performance by 23%.. even we could improve performance by 45.4%, if we use w/patch and w/ kvm dynamic poll. also the cost of CPU is much lower than 'idle=poll' case.. I don't question the general idea. I just think pvops isn't the best way to implement it. Wouldn't a function pointer, maybe guarded by a static key, be enough? A further advantage would be that this would work on other architectures, too. I assume this feature will be ported to other archs.. a new pvops makes sorry, a typo.. /other archs/other hypervisors/ it refers hypervisor like Xen, HyperV and VMware).. code clean and easy to maintain. also I tried to add it into existed pvops, but it doesn't match. You are aware that pvops is x86 only? yes, I'm aware.. I really don't see the big difference in maintainability compared to the static key / function pointer variant: void (*guest_idle_poll_func)(void); struct static_key guest_idle_poll_key __read_mostly; static inline void guest_idle_poll(void) { if (static_key_false(_idle_poll_key)) guest_idle_poll_func(); } thank you for your sample code :) I agree there is no big difference.. I think we are discussion for two things: 1) x86 VM on different hypervisors 2) different archs VM on kvm hypervisor What I want to do is x86 VM on different hypervisors, such as kvm / xen / hyperv .. Why limit the solution to x86 if the more general solution isn't harder? As you didn't give any reason why the pvops approach is better other than you don't care for non-x86 platforms you won't get an "Ack" from me for this patch. It just looks a little odder to me. I understand you care about no-x86 arch. Are you aware 'pv_time_ops' for arm64/arm/x86 archs, defined in - arch/arm64/include/asm/paravirt.h - arch/x86/include/asm/paravirt_types.h - arch/arm/include/asm/paravirt.h I am unfamilar with arm code. IIUC, if you'd implement pv_idle_ops for arm/arm64 arch, you'd define a same structure in - arch/arm64/include/asm/paravirt.h or - arch/arm/include/asm/paravirt.h .. instead of static key / fuction. then implement a real function in - arch/arm/kernel/paravirt.c. Also I wonder HOW/WHERE to define a static key/function, then to benifit x86/no-x86 archs? Quan Alibaba Cloud And KVM would just need to set guest_idle_poll_func and enable the static key. Works on non-x86 architectures, too. .. referred to 'pv_mmu_ops', HyperV and Xen can implement their own functions for 'pv_mmu_ops'. I think it is the same to pv_idle_ops. with above explaination, do you still think I need to define the static key/function pointer variant? btw, any interest to port it to Xen HVM guest? :) Maybe. But this should work for Xen on ARM, too. Juergen ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops
On 2017/11/14 16:22, Wanpeng Li wrote: 2017-11-14 16:15 GMT+08:00 Quan Xu <quan@gmail.com>: On 2017/11/14 15:12, Wanpeng Li wrote: 2017-11-14 15:02 GMT+08:00 Quan Xu <quan@gmail.com>: On 2017/11/13 18:53, Juergen Gross wrote: On 13/11/17 11:06, Quan Xu wrote: From: Quan Xu <quan@gmail.com> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called in idle path which will poll for a while before we enter the real idle state. In virtualization, idle path includes several heavy operations includes timer access(LAPIC timer or TSC deadline timer) which will hurt performance especially for latency intensive workload like message passing task. The cost is mainly from the vmexit which is a hardware context switch between virtual machine and hypervisor. Our solution is to poll for a while and do not enter real idle path if we can get the schedule event during polling. Poll may cause the CPU waste so we adopt a smart polling mechanism to reduce the useless poll. Signed-off-by: Yang Zhang <yang.zhang...@gmail.com> Signed-off-by: Quan Xu <quan@gmail.com> Cc: Juergen Gross <jgr...@suse.com> Cc: Alok Kataria <akata...@vmware.com> Cc: Rusty Russell <ru...@rustcorp.com.au> Cc: Thomas Gleixner <t...@linutronix.de> Cc: Ingo Molnar <mi...@redhat.com> Cc: "H. Peter Anvin" <h...@zytor.com> Cc: x...@kernel.org Cc: virtualization@lists.linux-foundation.org Cc: linux-ker...@vger.kernel.org Cc: xen-de...@lists.xenproject.org Hmm, is the idle entry path really so critical to performance that a new pvops function is necessary? Juergen, Here is the data we get when running benchmark netperf: 1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0): 29031.6 bit/s -- 76.1 %CPU 2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0): 35787.7 bit/s -- 129.4 %CPU 3. w/ kvm dynamic poll: 35735.6 bit/s -- 200.0 %CPU Actually we can reduce the CPU utilization by sleeping a period of time as what has already been done in the poll logic of IO subsystem, then we can improve the algorithm in kvm instead of introduing another duplicate one in the kvm guest. We really appreciate upstream's kvm dynamic poll mechanism, which is really helpful for a lot of scenario.. However, as description said, in virtualization, idle path includes several heavy operations includes timer access (LAPIC timer or TSC deadline timer) which will hurt performance especially for latency intensive workload like message passing task. The cost is mainly from the vmexit which is a hardware context switch between virtual machine and hypervisor. for upstream's kvm dynamic poll mechanism, even you could provide a better algorism, how could you bypass timer access (LAPIC timer or TSC deadline timer), or a hardware context switch between virtual machine and hypervisor. I know these is a tradeoff. Furthermore, here is the data we get when running benchmark contextswitch to measure the latency(lower is better): 1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0): 3402.9 ns/ctxsw -- 199.8 %CPU 2. w/ patch and disable kvm dynamic poll: 1163.5 ns/ctxsw -- 205.5 %CPU 3. w/ kvm dynamic poll: 2280.6 ns/ctxsw -- 199.5 %CPU so, these tow solution are quite similar, but not duplicate.. that's also why to add a generic idle poll before enter real idle path. When a reschedule event is pending, we can bypass the real idle path. There is a similar logic in the idle governor/driver, so how this patchset influence the decision in the idle governor/driver when running on bare-metal(power managment is not exposed to the guest so we will not enter into idle driver in the guest)? This is expected to take effect only when running as a virtual machine with proper CONFIG_* enabled. This can not work on bare mental even with proper CONFIG_* enabled. Quan Alibaba Cloud ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops
On 2017/11/14 15:30, Juergen Gross wrote: On 14/11/17 08:02, Quan Xu wrote: On 2017/11/13 18:53, Juergen Gross wrote: On 13/11/17 11:06, Quan Xu wrote: From: Quan Xu <quan@gmail.com> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called in idle path which will poll for a while before we enter the real idle state. In virtualization, idle path includes several heavy operations includes timer access(LAPIC timer or TSC deadline timer) which will hurt performance especially for latency intensive workload like message passing task. The cost is mainly from the vmexit which is a hardware context switch between virtual machine and hypervisor. Our solution is to poll for a while and do not enter real idle path if we can get the schedule event during polling. Poll may cause the CPU waste so we adopt a smart polling mechanism to reduce the useless poll. Signed-off-by: Yang Zhang <yang.zhang...@gmail.com> Signed-off-by: Quan Xu <quan@gmail.com> Cc: Juergen Gross <jgr...@suse.com> Cc: Alok Kataria <akata...@vmware.com> Cc: Rusty Russell <ru...@rustcorp.com.au> Cc: Thomas Gleixner <t...@linutronix.de> Cc: Ingo Molnar <mi...@redhat.com> Cc: "H. Peter Anvin" <h...@zytor.com> Cc: x...@kernel.org Cc: virtualization@lists.linux-foundation.org Cc: linux-ker...@vger.kernel.org Cc: xen-de...@lists.xenproject.org Hmm, is the idle entry path really so critical to performance that a new pvops function is necessary? Juergen, Here is the data we get when running benchmark netperf: 1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0): 29031.6 bit/s -- 76.1 %CPU 2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0): 35787.7 bit/s -- 129.4 %CPU 3. w/ kvm dynamic poll: 35735.6 bit/s -- 200.0 %CPU 4. w/patch and w/ kvm dynamic poll: 42225.3 bit/s -- 198.7 %CPU 5. idle=poll 37081.7 bit/s -- 998.1 %CPU w/ this patch, we will improve performance by 23%.. even we could improve performance by 45.4%, if we use w/patch and w/ kvm dynamic poll. also the cost of CPU is much lower than 'idle=poll' case.. I don't question the general idea. I just think pvops isn't the best way to implement it. Wouldn't a function pointer, maybe guarded by a static key, be enough? A further advantage would be that this would work on other architectures, too. I assume this feature will be ported to other archs.. a new pvops makes sorry, a typo.. /other archs/other hypervisors/ it refers hypervisor like Xen, HyperV and VMware).. code clean and easy to maintain. also I tried to add it into existed pvops, but it doesn't match. You are aware that pvops is x86 only? yes, I'm aware.. I really don't see the big difference in maintainability compared to the static key / function pointer variant: void (*guest_idle_poll_func)(void); struct static_key guest_idle_poll_key __read_mostly; static inline void guest_idle_poll(void) { if (static_key_false(_idle_poll_key)) guest_idle_poll_func(); } thank you for your sample code :) I agree there is no big difference.. I think we are discussion for two things: 1) x86 VM on different hypervisors 2) different archs VM on kvm hypervisor What I want to do is x86 VM on different hypervisors, such as kvm / xen / hyperv .. And KVM would just need to set guest_idle_poll_func and enable the static key. Works on non-x86 architectures, too. .. referred to 'pv_mmu_ops', HyperV and Xen can implement their own functions for 'pv_mmu_ops'. I think it is the same to pv_idle_ops. with above explaination, do you still think I need to define the static key/function pointer variant? btw, any interest to port it to Xen HVM guest? :) Quan Alibaba Cloud ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops
On 2017/11/14 15:12, Wanpeng Li wrote: 2017-11-14 15:02 GMT+08:00 Quan Xu <quan@gmail.com>: On 2017/11/13 18:53, Juergen Gross wrote: On 13/11/17 11:06, Quan Xu wrote: From: Quan Xu <quan@gmail.com> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called in idle path which will poll for a while before we enter the real idle state. In virtualization, idle path includes several heavy operations includes timer access(LAPIC timer or TSC deadline timer) which will hurt performance especially for latency intensive workload like message passing task. The cost is mainly from the vmexit which is a hardware context switch between virtual machine and hypervisor. Our solution is to poll for a while and do not enter real idle path if we can get the schedule event during polling. Poll may cause the CPU waste so we adopt a smart polling mechanism to reduce the useless poll. Signed-off-by: Yang Zhang <yang.zhang...@gmail.com> Signed-off-by: Quan Xu <quan@gmail.com> Cc: Juergen Gross <jgr...@suse.com> Cc: Alok Kataria <akata...@vmware.com> Cc: Rusty Russell <ru...@rustcorp.com.au> Cc: Thomas Gleixner <t...@linutronix.de> Cc: Ingo Molnar <mi...@redhat.com> Cc: "H. Peter Anvin" <h...@zytor.com> Cc: x...@kernel.org Cc: virtualization@lists.linux-foundation.org Cc: linux-ker...@vger.kernel.org Cc: xen-de...@lists.xenproject.org Hmm, is the idle entry path really so critical to performance that a new pvops function is necessary? Juergen, Here is the data we get when running benchmark netperf: 1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0): 29031.6 bit/s -- 76.1 %CPU 2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0): 35787.7 bit/s -- 129.4 %CPU 3. w/ kvm dynamic poll: 35735.6 bit/s -- 200.0 %CPU Actually we can reduce the CPU utilization by sleeping a period of time as what has already been done in the poll logic of IO subsystem, then we can improve the algorithm in kvm instead of introduing another duplicate one in the kvm guest. We really appreciate upstream's kvm dynamic poll mechanism, which is really helpful for a lot of scenario.. However, as description said, in virtualization, idle path includes several heavy operations includes timer access (LAPIC timer or TSC deadline timer) which will hurt performance especially for latency intensive workload like message passing task. The cost is mainly from the vmexit which is a hardware context switch between virtual machine and hypervisor. for upstream's kvm dynamic poll mechanism, even you could provide a better algorism, how could you bypass timer access (LAPIC timer or TSC deadline timer), or a hardware context switch between virtual machine and hypervisor. I know these is a tradeoff. Furthermore, here is the data we get when running benchmark contextswitch to measure the latency(lower is better): 1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0): 3402.9 ns/ctxsw -- 199.8 %CPU 2. w/ patch and disable kvm dynamic poll: 1163.5 ns/ctxsw -- 205.5 %CPU 3. w/ kvm dynamic poll: 2280.6 ns/ctxsw -- 199.5 %CPU so, these tow solution are quite similar, but not duplicate.. that's also why to add a generic idle poll before enter real idle path. When a reschedule event is pending, we can bypass the real idle path. Quan Alibaba Cloud Regards, Wanpeng Li 4. w/patch and w/ kvm dynamic poll: 42225.3 bit/s -- 198.7 %CPU 5. idle=poll 37081.7 bit/s -- 998.1 %CPU w/ this patch, we will improve performance by 23%.. even we could improve performance by 45.4%, if we use w/patch and w/ kvm dynamic poll. also the cost of CPU is much lower than 'idle=poll' case.. Wouldn't a function pointer, maybe guarded by a static key, be enough? A further advantage would be that this would work on other architectures, too. I assume this feature will be ported to other archs.. a new pvops makes code clean and easy to maintain. also I tried to add it into existed pvops, but it doesn't match. Quan Alibaba Cloud Juergen ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops
On 2017/11/13 18:53, Juergen Gross wrote: On 13/11/17 11:06, Quan Xu wrote: From: Quan Xu <quan@gmail.com> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called in idle path which will poll for a while before we enter the real idle state. In virtualization, idle path includes several heavy operations includes timer access(LAPIC timer or TSC deadline timer) which will hurt performance especially for latency intensive workload like message passing task. The cost is mainly from the vmexit which is a hardware context switch between virtual machine and hypervisor. Our solution is to poll for a while and do not enter real idle path if we can get the schedule event during polling. Poll may cause the CPU waste so we adopt a smart polling mechanism to reduce the useless poll. Signed-off-by: Yang Zhang <yang.zhang...@gmail.com> Signed-off-by: Quan Xu <quan@gmail.com> Cc: Juergen Gross <jgr...@suse.com> Cc: Alok Kataria <akata...@vmware.com> Cc: Rusty Russell <ru...@rustcorp.com.au> Cc: Thomas Gleixner <t...@linutronix.de> Cc: Ingo Molnar <mi...@redhat.com> Cc: "H. Peter Anvin" <h...@zytor.com> Cc: x...@kernel.org Cc: virtualization@lists.linux-foundation.org Cc: linux-ker...@vger.kernel.org Cc: xen-de...@lists.xenproject.org Hmm, is the idle entry path really so critical to performance that a new pvops function is necessary? Juergen, Here is the data we get when running benchmark netperf: 1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0): 29031.6 bit/s -- 76.1 %CPU 2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0): 35787.7 bit/s -- 129.4 %CPU 3. w/ kvm dynamic poll: 35735.6 bit/s -- 200.0 %CPU 4. w/patch and w/ kvm dynamic poll: 42225.3 bit/s -- 198.7 %CPU 5. idle=poll 37081.7 bit/s -- 998.1 %CPU w/ this patch, we will improve performance by 23%.. even we could improve performance by 45.4%, if we use w/patch and w/ kvm dynamic poll. also the cost of CPU is much lower than 'idle=poll' case.. Wouldn't a function pointer, maybe guarded by a static key, be enough? A further advantage would be that this would work on other architectures, too. I assume this feature will be ported to other archs.. a new pvops makes code clean and easy to maintain. also I tried to add it into existed pvops, but it doesn't match. Quan Alibaba Cloud Juergen ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC v3 4/6] Documentation: Add three sysctls for smart idle poll
On 2017/11/13 23:08, Ingo Molnar wrote: * Quan Xu <quan.x...@gmail.com> wrote: From: Quan Xu <quan@gmail.com> To reduce the cost of poll, we introduce three sysctl to control the poll time when running as a virtual machine with paravirt. Signed-off-by: Yang Zhang <yang.zhang...@gmail.com> Signed-off-by: Quan Xu <quan@gmail.com> --- Documentation/sysctl/kernel.txt | 35 +++ arch/x86/kernel/paravirt.c |4 include/linux/kernel.h |6 ++ kernel/sysctl.c | 34 ++ 4 files changed, 79 insertions(+), 0 deletions(-) diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index 694968c..30c25fb 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -714,6 +714,41 @@ kernel tries to allocate a number starting from this one. == +paravirt_poll_grow: (X86 only) + +Multiplied value to increase the poll time. This is expected to take +effect only when running as a virtual machine with CONFIG_PARAVIRT +enabled. This can't bring any benifit on bare mental even with +CONFIG_PARAVIRT enabled. + +By default this value is 2. Possible values to set are in range {2..16}. + +== + +paravirt_poll_shrink: (X86 only) + +Divided value to reduce the poll time. This is expected to take effect +only when running as a virtual machine with CONFIG_PARAVIRT enabled. +This can't bring any benifit on bare mental even with CONFIG_PARAVIRT +enabled. + +By default this value is 2. Possible values to set are in range {2..16}. + +== + +paravirt_poll_threshold_ns: (X86 only) + +Controls the maximum poll time before entering real idle path. This is +expected to take effect only when running as a virtual machine with +CONFIG_PARAVIRT enabled. This can't bring any benifit on bare mental +even with CONFIG_PARAVIRT enabled. + +By default, this value is 0 means not to poll. Possible values to set +are in range {0..50}. Change the value to non-zero if running +latency-bound workloads in a virtual machine. I absolutely hate it how this hybrid idle loop polling mechanism is not self-tuning! Ingo, actually it is self-tuning.. Please make it all work fine by default, and automatically so, instead of adding three random parameters... .. I will make it all fine by default. howerver cloud environment is of diversity, could I only leave paravirt_poll_threshold_ns parameter (the maximum poll time), which is as similar as "adaptive halt-polling" Wanpeng mentioned.. then user can turn it off, or find an appropriate threshold for some odd scenario.. thanks for your comments!! Quan Alibaba Cloud And if it cannot be done automatically then we should rather not do it at all. Maybe the next submitter of a similar feature can think of a better approach. Thanks, Ingo ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
From: Yang Zhang <yang.zhang...@gmail.com> Implement a generic idle poll which resembles the functionality found in arch/. Provide weak arch_cpu_idle_poll function which can be overridden by the architecture code if needed. Interrupts arrive which may not cause a reschedule in idle loops. In KVM guest, this costs several VM-exit/VM-entry cycles, VM-entry for interrupts and VM-exit immediately. Also this becomes more expensive than bare metal. Add a generic idle poll before enter real idle path. When a reschedule event is pending, we can bypass the real idle path. Signed-off-by: Quan Xu <quan@gmail.com> Signed-off-by: Yang Zhang <yang.zhang...@gmail.com> Cc: Thomas Gleixner <t...@linutronix.de> Cc: Ingo Molnar <mi...@redhat.com> Cc: "H. Peter Anvin" <h...@zytor.com> Cc: x...@kernel.org Cc: Peter Zijlstra <pet...@infradead.org> Cc: Borislav Petkov <b...@alien8.de> Cc: Kyle Huey <m...@kylehuey.com> Cc: Len Brown <len.br...@intel.com> Cc: Andy Lutomirski <l...@kernel.org> Cc: Tom Lendacky <thomas.lenda...@amd.com> Cc: Tobias Klauser <tklau...@distanz.ch> Cc: linux-ker...@vger.kernel.org --- arch/x86/kernel/process.c |7 +++ kernel/sched/idle.c |2 ++ 2 files changed, 9 insertions(+), 0 deletions(-) diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index c676853..f7db8b5 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -333,6 +333,13 @@ void arch_cpu_idle(void) x86_idle(); } +#ifdef CONFIG_PARAVIRT +void arch_cpu_idle_poll(void) +{ + paravirt_idle_poll(); +} +#endif + /* * We use this if we don't have any better idle routine.. */ diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 257f4f0..df7c422 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void) } /* Weak implementations for optional arch specific functions */ +void __weak arch_cpu_idle_poll(void) { } void __weak arch_cpu_idle_prepare(void) { } void __weak arch_cpu_idle_enter(void) { } void __weak arch_cpu_idle_exit(void) { } @@ -219,6 +220,7 @@ static void do_idle(void) */ __current_set_polling(); + arch_cpu_idle_poll(); quiet_vmstat(); tick_nohz_idle_enter(); -- 1.7.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH RFC v3 0/6] x86/idle: add halt poll support
From: Quan Xu <quan@gmail.com> Some latency-intensive workload have seen obviously performance drop when running inside VM. The main reason is that the overhead is amplified when running inside VM. The most cost I have seen is inside idle path. This patch introduces a new mechanism to poll for a while before entering idle state. If schedule is needed during poll, then we don't need to goes through the heavy overhead path. Here is the data we get when running benchmark contextswitch to measure the latency(lower is better): 1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0): 3402.9 ns/ctxsw -- 199.8 %CPU 2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0): halt_poll_threshold=1 -- 1151.4 ns/ctxsw -- 200.1 %CPU halt_poll_threshold=2 -- 1149.7 ns/ctxsw -- 199.9 %CPU halt_poll_threshold=3 -- 1151.0 ns/ctxsw -- 199.9 %CPU halt_poll_threshold=4 -- 1155.4 ns/ctxsw -- 199.3 %CPU halt_poll_threshold=5 -- 1161.0 ns/ctxsw -- 200.0 %CPU halt_poll_threshold=10 -- 1163.8 ns/ctxsw -- 200.4 %CPU halt_poll_threshold=30 -- 1159.4 ns/ctxsw -- 201.9 %CPU halt_poll_threshold=50 -- 1163.5 ns/ctxsw -- 205.5 %CPU 3. w/ kvm dynamic poll: halt_poll_ns=1 -- 3470.5 ns/ctxsw -- 199.6 %CPU halt_poll_ns=2 -- 3273.0 ns/ctxsw -- 199.7 %CPU halt_poll_ns=3 -- 3628.7 ns/ctxsw -- 199.4 %CPU halt_poll_ns=4 -- 2280.6 ns/ctxsw -- 199.5 %CPU halt_poll_ns=5 -- 3200.3 ns/ctxsw -- 199.7 %CPU halt_poll_ns=10 -- 2186.6 ns/ctxsw -- 199.6 %CPU halt_poll_ns=30 -- 3178.7 ns/ctxsw -- 199.6 %CPU halt_poll_ns=50 -- 3505.4 ns/ctxsw -- 199.7 %CPU 4. w/patch and w/ kvm dynamic poll: halt_poll_ns=1 & halt_poll_threshold=1 -- 1155.5 ns/ctxsw -- 199.8 %CPU halt_poll_ns=1 & halt_poll_threshold=2 -- 1165.6 ns/ctxsw -- 199.8 %CPU halt_poll_ns=1 & halt_poll_threshold=3 -- 1161.1 ns/ctxsw -- 200.0 %CPU halt_poll_ns=2 & halt_poll_threshold=1 -- 1158.1 ns/ctxsw -- 199.8 %CPU halt_poll_ns=2 & halt_poll_threshold=2 -- 1161.0 ns/ctxsw -- 199.7 %CPU halt_poll_ns=2 & halt_poll_threshold=3 -- 1163.7 ns/ctxsw -- 199.9 %CPU halt_poll_ns=3 & halt_poll_threshold=1 -- 1158.7 ns/ctxsw -- 199.7 %CPU halt_poll_ns=3 & halt_poll_threshold=2 -- 1153.8 ns/ctxsw -- 199.8 %CPU halt_poll_ns=3 & halt_poll_threshold=3 -- 1155.1 ns/ctxsw -- 199.8 %CPU 5. idle=poll 3957.57 ns/ctxsw -- 999.4%CPU Here is the data we get when running benchmark netperf: 1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0): 29031.6 bit/s -- 76.1 %CPU 2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0): halt_poll_threshold=1 -- 29021.7 bit/s -- 105.1 %CPU halt_poll_threshold=2 -- 33463.5 bit/s -- 128.2 %CPU halt_poll_threshold=3 -- 34436.4 bit/s -- 127.8 %CPU halt_poll_threshold=4 -- 35563.3 bit/s -- 129.6 %CPU halt_poll_threshold=5 -- 35787.7 bit/s -- 129.4 %CPU halt_poll_threshold=10 -- 35477.7 bit/s -- 130.0 %CPU halt_poll_threshold=30 -- 35730.0 bit/s -- 132.4 %CPU halt_poll_threshold=50 -- 34978.4 bit/s -- 134.2 %CPU 3. w/ kvm dynamic poll: halt_poll_ns=1 -- 28849.8 bit/s -- 75.2 %CPU halt_poll_ns=2 -- 29004.8 bit/s -- 76.1 %CPU halt_poll_ns=3 -- 35662.0 bit/s -- 199.7 %CPU halt_poll_ns=4 -- 35874.8 bit/s -- 187.5 %CPU halt_poll_ns=5 -- 35603.1 bit/s -- 199.8 %CPU halt_poll_ns=10 -- 35588.8 bit/s -- 200.0 %CPU halt_poll_ns=30 -- 35912.4 bit/s -- 200.0 %CPU halt_poll_ns=50 -- 35735.6 bit/s -- 200.0 %CPU 4. w/patch and w/ kvm dynamic poll: halt_poll_ns=1 & halt_poll_threshold=1 -- 29427.9 bit/s -- 107.8 %CPU halt_poll_ns=1 & halt_poll_threshold=2 -- 33048.4 bit/s -- 128.1 %CPU halt_poll_ns=1 & halt_poll_threshold=3 -- 35129.8 bit/s -- 129.1 %CPU halt_poll_ns=2 & halt_poll_threshold=1 -- 31091.3 bit/s -- 130.3 %CPU halt_poll_ns=2 & halt_poll_threshold=2 -- 33587.9 bit/s -- 128.9 %CPU halt_poll_ns=2 & halt_poll_threshold=3 -- 35532.9 bit/s -- 129.1 %CPU halt_poll_ns=3 & halt_poll_threshold=1 -- 35633.1 bit/s -- 199.4 %CPU halt_poll_ns=3 & halt_poll_threshold=2 -- 42225.3 bit/s -- 198.7 %CPU halt_poll_ns=3 & halt_poll_threshold=3 -- 42210.7 bit/s -- 200.3 %CPU 5. idle=poll 37081.7 bit/s -- 998.1 %CPU --- V2 -> V3: - move poll update into arch/. in v3, poll update is based on duration of the last idle loop which is from tick_nohz_idle_enter to tick_nohz_idle_exit, and try our best not to interfere with schedule
[PATCH RFC v3 5/6] tick: get duration of the last idle loop
From: Quan Xu <quan@gmail.com> the last idle loop is from tick_nohz_idle_enter to tick_nohz_idle_exit. Signed-off-by: Yang Zhang <yang.zhang...@gmail.com> Signed-off-by: Quan Xu <quan@gmail.com> Cc: Frederic Weisbecker <fweis...@gmail.com> Cc: Thomas Gleixner <t...@linutronix.de> Cc: Ingo Molnar <mi...@kernel.org> Cc: linux-ker...@vger.kernel.org --- include/linux/tick.h |2 ++ kernel/time/tick-sched.c | 11 +++ kernel/time/tick-sched.h |3 +++ 3 files changed, 16 insertions(+), 0 deletions(-) diff --git a/include/linux/tick.h b/include/linux/tick.h index cf413b3..77ae46d 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -118,6 +118,7 @@ enum tick_dep_bits { extern void tick_nohz_idle_exit(void); extern void tick_nohz_irq_exit(void); extern ktime_t tick_nohz_get_sleep_length(void); +extern ktime_t tick_nohz_get_last_idle_length(void); extern unsigned long tick_nohz_get_idle_calls(void); extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time); extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time); @@ -127,6 +128,7 @@ enum tick_dep_bits { static inline void tick_nohz_idle_enter(void) { } static inline void tick_nohz_idle_exit(void) { } +static ktime_t tick_nohz_get_last_idle_length(void) { return -1; } static inline ktime_t tick_nohz_get_sleep_length(void) { return NSEC_PER_SEC / HZ; diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index c7a899c..65c9cc0 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -548,6 +548,7 @@ static void tick_nohz_update_jiffies(ktime_t now) else ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta); ts->idle_entrytime = now; + ts->idle_length = delta; } if (last_update_time) @@ -998,6 +999,16 @@ void tick_nohz_irq_exit(void) } /** + * tick_nohz_get_last_idle_length - return the length of the last idle loop + */ +ktime_t tick_nohz_get_last_idle_length(void) +{ + struct tick_sched *ts = this_cpu_ptr(_cpu_sched); + + return ts->idle_length; +} + +/** * tick_nohz_get_sleep_length - return the length of the current sleep * * Called from power state control code with interrupts disabled diff --git a/kernel/time/tick-sched.h b/kernel/time/tick-sched.h index 954b43d..2630cf9 100644 --- a/kernel/time/tick-sched.h +++ b/kernel/time/tick-sched.h @@ -39,6 +39,8 @@ enum tick_nohz_mode { * @idle_sleeptime:Sum of the time slept in idle with sched tick stopped * @iowait_sleeptime: Sum of the time slept in idle with sched tick stopped, with IO outstanding * @sleep_length: Duration of the current idle sleep + * @idle_length: Duration of the last idle loop is from + * tick_nohz_idle_enter to tick_nohz_idle_exit. * @do_timer_lst: CPU was the last one doing do_timer before going idle */ struct tick_sched { @@ -59,6 +61,7 @@ struct tick_sched { ktime_t idle_sleeptime; ktime_t iowait_sleeptime; ktime_t sleep_length; + ktime_t idle_length; unsigned long last_jiffies; u64 next_timer; ktime_t idle_expires; -- 1.7.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH RFC v3 6/6] KVM guest: introduce smart idle poll algorithm
From: Yang Zhang <yang.zhang...@gmail.com> using smart idle poll to reduce the useless poll when system is idle. Signed-off-by: Quan Xu <quan@gmail.com> Signed-off-by: Yang Zhang <yang.zhang...@gmail.com> Cc: Paolo Bonzini <pbonz...@redhat.com> Cc: Thomas Gleixner <t...@linutronix.de> Cc: Ingo Molnar <mi...@redhat.com> Cc: "H. Peter Anvin" <h...@zytor.com> Cc: x...@kernel.org Cc: k...@vger.kernel.org Cc: linux-ker...@vger.kernel.org --- arch/x86/kernel/kvm.c | 47 +++ 1 files changed, 47 insertions(+), 0 deletions(-) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 2a6e402..8bb6d55 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -37,6 +37,7 @@ #include #include #include +#include #include #include #include @@ -365,11 +366,57 @@ static void kvm_guest_cpu_init(void) kvm_register_steal_time(); } +static unsigned int grow_poll_ns(unsigned int old, unsigned int grow, +unsigned int max) +{ + unsigned int val; + + /* set base poll time to 1ns */ + if (old == 0 && grow) + return 1; + + val = old * grow; + if (val > max) + val = max; + + return val; +} + +static unsigned int shrink_poll_ns(unsigned int old, unsigned int shrink) +{ + if (shrink == 0) + return 0; + + return old / shrink; +} + +static void kvm_idle_update_poll_duration(ktime_t idle) +{ + unsigned long poll_duration = this_cpu_read(poll_duration_ns); + + /* so far poll duration is based on nohz */ + if (idle == -1ULL) + return; + + if (poll_duration && idle > paravirt_poll_threshold_ns) + poll_duration = shrink_poll_ns(poll_duration, + paravirt_poll_shrink); + else if (poll_duration < paravirt_poll_threshold_ns && +idle < paravirt_poll_threshold_ns) + poll_duration = grow_poll_ns(poll_duration, paravirt_poll_grow, +paravirt_poll_threshold_ns); + + this_cpu_write(poll_duration_ns, poll_duration); +} + static void kvm_idle_poll(void) { unsigned long poll_duration = this_cpu_read(poll_duration_ns); + ktime_t idle = tick_nohz_get_last_idle_length(); ktime_t start, cur, stop; + kvm_idle_update_poll_duration(idle); + start = cur = ktime_get(); stop = ktime_add_ns(ktime_get(), poll_duration); -- 1.7.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH RFC v3 4/6] Documentation: Add three sysctls for smart idle poll
From: Quan Xu <quan@gmail.com> To reduce the cost of poll, we introduce three sysctl to control the poll time when running as a virtual machine with paravirt. Signed-off-by: Yang Zhang <yang.zhang...@gmail.com> Signed-off-by: Quan Xu <quan@gmail.com> --- Documentation/sysctl/kernel.txt | 35 +++ arch/x86/kernel/paravirt.c |4 include/linux/kernel.h |6 ++ kernel/sysctl.c | 34 ++ 4 files changed, 79 insertions(+), 0 deletions(-) diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index 694968c..30c25fb 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -714,6 +714,41 @@ kernel tries to allocate a number starting from this one. == +paravirt_poll_grow: (X86 only) + +Multiplied value to increase the poll time. This is expected to take +effect only when running as a virtual machine with CONFIG_PARAVIRT +enabled. This can't bring any benifit on bare mental even with +CONFIG_PARAVIRT enabled. + +By default this value is 2. Possible values to set are in range {2..16}. + +== + +paravirt_poll_shrink: (X86 only) + +Divided value to reduce the poll time. This is expected to take effect +only when running as a virtual machine with CONFIG_PARAVIRT enabled. +This can't bring any benifit on bare mental even with CONFIG_PARAVIRT +enabled. + +By default this value is 2. Possible values to set are in range {2..16}. + +== + +paravirt_poll_threshold_ns: (X86 only) + +Controls the maximum poll time before entering real idle path. This is +expected to take effect only when running as a virtual machine with +CONFIG_PARAVIRT enabled. This can't bring any benifit on bare mental +even with CONFIG_PARAVIRT enabled. + +By default, this value is 0 means not to poll. Possible values to set +are in range {0..50}. Change the value to non-zero if running +latency-bound workloads in a virtual machine. + +== + powersave-nap: (PPC only) If set, Linux-PPC will use the 'nap' mode of powersaving, diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 67cab22..28c74ca 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -317,6 +317,10 @@ struct pv_idle_ops pv_idle_ops = { .poll = paravirt_nop, }; +unsigned long paravirt_poll_threshold_ns; +unsigned int paravirt_poll_shrink = 2; +unsigned int paravirt_poll_grow = 2; + __visible struct pv_irq_ops pv_irq_ops = { .save_fl = __PV_IS_CALLEE_SAVE(native_save_fl), .restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl), diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 4b484ab..0f46846 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -491,6 +491,12 @@ extern __scanf(2, 0) extern bool crash_kexec_post_notifiers; +#ifdef CONFIG_PARAVIRT +extern unsigned long paravirt_poll_threshold_ns; +extern unsigned int paravirt_poll_shrink; +extern unsigned int paravirt_poll_grow; +#endif + /* * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It * holds a CPU number which is executing panic() currently. A value of diff --git a/kernel/sysctl.c b/kernel/sysctl.c index d9c31bc..9f194dc 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -135,6 +135,11 @@ static int six_hundred_forty_kb = 640 * 1024; #endif +#ifdef CONFIG_PARAVIRT +static int sixteen = 16; +static int five_hundred_thousand = 50; +#endif + /* this is needed for the proc_doulongvec_minmax of vm_dirty_bytes */ static unsigned long dirty_bytes_min = 2 * PAGE_SIZE; @@ -1226,6 +1231,35 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write, .extra2 = , }, #endif +#ifdef CONFIG_PARAVIRT + { + .procname = "paravirt_halt_poll_threshold", + .data = _poll_threshold_ns, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = , + .extra2 = _hundred_thousand, + }, + { + .procname = "paravirt_halt_poll_grow", + .data = _poll_grow, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = , + .extra2 = , + }, + { + .procname = "paravirt_halt_poll_shrink", + .data = _poll_shrink, + .maxlen = sizeof(unsig
[PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops
From: Quan Xu <quan@gmail.com> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called in idle path which will poll for a while before we enter the real idle state. In virtualization, idle path includes several heavy operations includes timer access(LAPIC timer or TSC deadline timer) which will hurt performance especially for latency intensive workload like message passing task. The cost is mainly from the vmexit which is a hardware context switch between virtual machine and hypervisor. Our solution is to poll for a while and do not enter real idle path if we can get the schedule event during polling. Poll may cause the CPU waste so we adopt a smart polling mechanism to reduce the useless poll. Signed-off-by: Yang Zhang <yang.zhang...@gmail.com> Signed-off-by: Quan Xu <quan@gmail.com> Cc: Juergen Gross <jgr...@suse.com> Cc: Alok Kataria <akata...@vmware.com> Cc: Rusty Russell <ru...@rustcorp.com.au> Cc: Thomas Gleixner <t...@linutronix.de> Cc: Ingo Molnar <mi...@redhat.com> Cc: "H. Peter Anvin" <h...@zytor.com> Cc: x...@kernel.org Cc: virtualization@lists.linux-foundation.org Cc: linux-ker...@vger.kernel.org Cc: xen-de...@lists.xenproject.org --- arch/x86/include/asm/paravirt.h |5 + arch/x86/include/asm/paravirt_types.h |6 ++ arch/x86/kernel/paravirt.c|6 ++ 3 files changed, 17 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index fd81228..3c83727 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -198,6 +198,11 @@ static inline unsigned long long paravirt_read_pmc(int counter) #define rdpmcl(counter, val) ((val) = paravirt_read_pmc(counter)) +static inline void paravirt_idle_poll(void) +{ + PVOP_VCALL0(pv_idle_ops.poll); +} + static inline void paravirt_alloc_ldt(struct desc_struct *ldt, unsigned entries) { PVOP_VCALL2(pv_cpu_ops.alloc_ldt, ldt, entries); diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 10cc3b9..95c0e3e 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -313,6 +313,10 @@ struct pv_lock_ops { struct paravirt_callee_save vcpu_is_preempted; } __no_randomize_layout; +struct pv_idle_ops { + void (*poll)(void); +} __no_randomize_layout; + /* This contains all the paravirt structures: we get a convenient * number for each function using the offset which we use to indicate * what to patch. */ @@ -323,6 +327,7 @@ struct paravirt_patch_template { struct pv_irq_ops pv_irq_ops; struct pv_mmu_ops pv_mmu_ops; struct pv_lock_ops pv_lock_ops; + struct pv_idle_ops pv_idle_ops; } __no_randomize_layout; extern struct pv_info pv_info; @@ -332,6 +337,7 @@ struct paravirt_patch_template { extern struct pv_irq_ops pv_irq_ops; extern struct pv_mmu_ops pv_mmu_ops; extern struct pv_lock_ops pv_lock_ops; +extern struct pv_idle_ops pv_idle_ops; #define PARAVIRT_PATCH(x) \ (offsetof(struct paravirt_patch_template, x) / sizeof(void *)) diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 19a3e8f..67cab22 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -128,6 +128,7 @@ unsigned paravirt_patch_jmp(void *insnbuf, const void *target, #ifdef CONFIG_PARAVIRT_SPINLOCKS .pv_lock_ops = pv_lock_ops, #endif + .pv_idle_ops = pv_idle_ops, }; return *((void **) + type); } @@ -312,6 +313,10 @@ struct pv_time_ops pv_time_ops = { .steal_clock = native_steal_clock, }; +struct pv_idle_ops pv_idle_ops = { + .poll = paravirt_nop, +}; + __visible struct pv_irq_ops pv_irq_ops = { .save_fl = __PV_IS_CALLEE_SAVE(native_save_fl), .restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl), @@ -463,3 +468,4 @@ struct pv_mmu_ops pv_mmu_ops __ro_after_init = { EXPORT_SYMBOL(pv_mmu_ops); EXPORT_SYMBOL_GPL(pv_info); EXPORT_SYMBOL(pv_irq_ops); +EXPORT_SYMBOL(pv_idle_ops); -- 1.7.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH RFC v3 2/6] KVM guest: register kvm_idle_poll for pv_idle_ops
From: Quan Xu <quan@gmail.com> Although smart idle poll has nothing to do with paravirt, it can not bring any benifit to native. So we only enable it when Linux runs as a KVM guest( also it can extend to other hypervisor like Xen, HyperV and VMware). Introduce per-CPU variable poll_duration_ns to control the max poll time. Signed-off-by: Yang Zhang <yang.zhang...@gmail.com> Signed-off-by: Quan Xu <quan@gmail.com> Cc: Paolo Bonzini <pbonz...@redhat.com> Cc: "Radim Krčmář" <rkrc...@redhat.com> Cc: Thomas Gleixner <t...@linutronix.de> Cc: Ingo Molnar <mi...@redhat.com> Cc: "H. Peter Anvin" <h...@zytor.com> Cc: x...@kernel.org Cc: k...@vger.kernel.org Cc: linux-ker...@vger.kernel.org --- arch/x86/kernel/kvm.c | 26 ++ 1 files changed, 26 insertions(+), 0 deletions(-) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 8bb9594..2a6e402 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -75,6 +75,7 @@ static int parse_no_kvmclock_vsyscall(char *arg) early_param("no-kvmclock-vsyscall", parse_no_kvmclock_vsyscall); +static DEFINE_PER_CPU(unsigned long, poll_duration_ns); static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64); static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64); static int has_steal_clock = 0; @@ -364,6 +365,29 @@ static void kvm_guest_cpu_init(void) kvm_register_steal_time(); } +static void kvm_idle_poll(void) +{ + unsigned long poll_duration = this_cpu_read(poll_duration_ns); + ktime_t start, cur, stop; + + start = cur = ktime_get(); + stop = ktime_add_ns(ktime_get(), poll_duration); + + do { + if (need_resched()) + break; + cur = ktime_get(); + } while (ktime_before(cur, stop)); +} + +static void kvm_guest_idle_init(void) +{ + if (!kvm_para_available()) + return; + + pv_idle_ops.poll = kvm_idle_poll; +} + static void kvm_pv_disable_apf(void) { if (!__this_cpu_read(apf_reason.enabled)) @@ -499,6 +523,8 @@ void __init kvm_guest_init(void) kvm_guest_cpu_init(); #endif + kvm_guest_idle_init(); + /* * Hard lockup detection is enabled by default. Disable it, as guests * can get false positives too easily, for example if the host is -- 1.7.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH RFC v3 0/6] x86/idle: add halt poll support
heduler/idle code. (This seems not to follow Peter's v2 comment, however we had a f2f discussion about it in Prague.) - enhance patch desciption. - enhance Documentation and sysctls. - test with IRQ_TIMINGS related code, which seems not working so far. V1 -> V2: - integrate the smart halt poll into paravirt code - use idle_stamp instead of check_poll - since it hard to get whether vcpu is the only task in pcpu, so we don't consider it in this series.(May improve it in future) --- Quan Xu (4): x86/paravirt: Add pv_idle_ops to paravirt ops KVM guest: register kvm_idle_poll for pv_idle_ops Documentation: Add three sysctls for smart idle poll tick: get duration of the last idle loop Yang Zhang (2): sched/idle: Add a generic poll before enter real idle path KVM guest: introduce smart idle poll algorithm Documentation/sysctl/kernel.txt | 35 arch/x86/include/asm/paravirt.h |5 ++ arch/x86/include/asm/paravirt_types.h |6 +++ arch/x86/kernel/kvm.c | 73 + arch/x86/kernel/paravirt.c| 10 + arch/x86/kernel/process.c |7 +++ include/linux/kernel.h|6 +++ include/linux/tick.h |2 + kernel/sched/idle.c |2 + kernel/sysctl.c | 34 +++ kernel/time/tick-sched.c | 11 + kernel/time/tick-sched.h |3 + 12 files changed, 194 insertions(+), 0 deletions(-) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH RFC v3 0/6] x86/idle: add halt poll support
heduler/idle code. (This seems not to follow Peter's v2 comment, however we had a f2f discussion about it in Prague.) - enhance patch desciption. - enhance Documentation and sysctls. - test with IRQ_TIMINGS related code, which seems not working so far. V1 -> V2: - integrate the smart halt poll into paravirt code - use idle_stamp instead of check_poll - since it hard to get whether vcpu is the only task in pcpu, so we don't consider it in this series.(May improve it in future) --- Quan Xu (4): x86/paravirt: Add pv_idle_ops to paravirt ops KVM guest: register kvm_idle_poll for pv_idle_ops Documentation: Add three sysctls for smart idle poll tick: get duration of the last idle loop Yang Zhang (2): sched/idle: Add a generic poll before enter real idle path KVM guest: introduce smart idle poll algorithm Documentation/sysctl/kernel.txt | 35 arch/x86/include/asm/paravirt.h |5 ++ arch/x86/include/asm/paravirt_types.h |6 +++ arch/x86/kernel/kvm.c | 73 + arch/x86/kernel/paravirt.c| 10 + arch/x86/kernel/process.c |7 +++ include/linux/kernel.h|6 +++ include/linux/tick.h |2 + kernel/sched/idle.c |2 + kernel/sysctl.c | 34 +++ kernel/time/tick-sched.c | 11 + kernel/time/tick-sched.h |3 + 12 files changed, 194 insertions(+), 0 deletions(-) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization