Re: [PATCH 05/29] nVMX: Implement reading and writing of VMX MSRs
On Sun, Jan 30, 2011, Avi Kivity wrote about Re: [PATCH 05/29] nVMX: Implement reading and writing of VMX MSRs: +case MSR_IA32_VMX_TRUE_PINBASED_CTLS: +case MSR_IA32_VMX_PINBASED_CTLS: +vmx_msr_low = CORE2_PINBASED_CTLS_MUST_BE_ONE; +vmx_msr_high = CORE2_PINBASED_CTLS_MUST_BE_ONE | +PIN_BASED_EXT_INTR_MASK | +PIN_BASED_NMI_EXITING | +PIN_BASED_VIRTUAL_NMIS; Can we actually support PIN_BASED_VIRTUAL_NMIs on hosts which don't support them? Maybe better to drop for the initial version. Thanks, I'll look into this. You already found this problem in June, and it's already in my bugzilla. Just wanted to let you know that I'm taking all your previous comments seriously, and not forgetting any of them. Since you mention this one again, I'm increasing its priority, so I'll fix it before the next version of the patches. +static int vmx_set_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data) +{ +if (!nested_vmx_allowed(vcpu)) +return 0; + +/* + * according to the spec, VMX capability MSRs are read-only; an + * attempt to write them (with WRMSR) produces a #GP(0). + */ +if (msr_index= MSR_IA32_VMX_BASIC +msr_index= MSR_IA32_VMX_TRUE_ENTRY_CTLS) { +kvm_queue_exception_e(vcpu, GP_VECTOR, 0); +return 1; Can just drop this part, #GP is the default response. Right, thanks, I see that now. I'll remove the extra code, but leave a comment. -- Nadav Har'El| Monday, Jan 31 2011, 26 Shevat 5771 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |If the universe is expanding, why can't I http://nadav.harel.org.il |find a parking space? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/29] nVMX: Implement reading and writing of VMX MSRs
On 01/31/2011 10:57 AM, Nadav Har'El wrote: On Sun, Jan 30, 2011, Avi Kivity wrote about Re: [PATCH 05/29] nVMX: Implement reading and writing of VMX MSRs: + case MSR_IA32_VMX_TRUE_PINBASED_CTLS: + case MSR_IA32_VMX_PINBASED_CTLS: + vmx_msr_low = CORE2_PINBASED_CTLS_MUST_BE_ONE; + vmx_msr_high = CORE2_PINBASED_CTLS_MUST_BE_ONE | + PIN_BASED_EXT_INTR_MASK | + PIN_BASED_NMI_EXITING | + PIN_BASED_VIRTUAL_NMIS; Can we actually support PIN_BASED_VIRTUAL_NMIs on hosts which don't support them? Maybe better to drop for the initial version. Thanks, I'll look into this. You already found this problem in June, and it's already in my bugzilla. Just wanted to let you know that I'm taking all your previous comments seriously, and not forgetting any of them. Since you mention this one again, I'm increasing its priority, so I'll fix it before the next version of the patches. Thanks. Since I see many other comments are still unaddressed, I'll stop reviewing until you let me know that you have a version that is ready for review. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] Provide control over unmapped pages (v4)
* KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2011-01-31 08:58:53]: On Fri, 28 Jan 2011 09:20:02 -0600 (CST) Christoph Lameter c...@linux.com wrote: On Fri, 28 Jan 2011, KAMEZAWA Hiroyuki wrote: I see it as a tradeoff of when to check? add_to_page_cache or when we are want more free memory (due to allocation). It is OK to wakeup kswapd while allocating memory, somehow for this purpose (global page cache), add_to_page_cache or add_to_page_cache_locked does not seem the right place to hook into. I'd be open to comments/suggestions though from others as well. I don't like add hook here. AND I don't want to run kswapd because 'kswapd' has been a sign as there are memory shortage. (reusing code is ok.) How about adding new daemon ? Recently, khugepaged, ksmd works for managing memory. Adding one more daemon for special purpose is not very bad, I think. Then, you can do - wake up without hook - throttle its work. - balance the whole system rather than zone. I think per-node balance is enough... I think we already have enough kernel daemons floating around. They are multiplying in an amazing way. What would be useful is to map all the memory management background stuff into a process. May call this memd instead? Perhaps we can fold khugepaged into kswapd as well etc. Making kswapd slow for whis additional, requested by user, not by system work is good thing ? I think workqueue works enough well, it's scale based on workloads, if using thread is bad. Making it slow is a generic statement, kswapd is supposed to do background reclaim, in this case a special request for unmapped pages, specifically and deliberately requested by the admin via a boot option. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/29] nVMX: Hold a vmcs02 for each vmcs12
Hi, On Sun, Jan 30, 2011, Avi Kivity wrote about Re: [PATCH 07/29] nVMX: Hold a vmcs02 for each vmcs12: +/* + * Allocate an L0 VMCS (vmcs02) for the current L1 VMCS (vmcs12), if one + * does not already exist. The allocation is done in L0 memory, so to avoid + * denial-of-service attack by guests, we limit the number of concurrently- + * allocated vmcss. A well-behaving L1 will VMCLEAR unused vmcs12s and not + * trigger this limit. No, it won't. If you run N guests on a single-cpu kvm host, you'll have N active VMCSs. Of course. What I said was that *unused* vmcs12s (in the sense that they don't describe any active guest) will normally be unloaded (VMCLEARed) by L1 and so will not take up space. Only VMCSs actually being used to run guests will take up space. If you have N running guests, then right, you'll have N VMCSs. I put the limit at 256, which on one hand allows L1 to run 256 L2s (which I think is well above what people will normally run on one CPU), and on the other hand limits the amount of damage that a malicious L1 can do: At worst it can cause the host to allocate (and pin) 256 extra VMCSs, which sum up to 1 MB. I thought that this compromise was good enough, and you didn't, and although I still don't understand why, I promise I will change it. I'll make this change my top priority now. +static void __nested_free_saved_vmcs(void *arg) +{ +struct saved_vmcs *saved_vmcs = arg; +int cpu = raw_smp_processor_id(); + +if (saved_vmcs-cpu == cpu) /* TODO: how can this not be the case? */ +vmcs_clear(saved_vmcs-vmcs); This check will always be true. This is what I thought too... I call this function on the saved_vmcs-cpu cpu, so there's no reason why it would find itself being called on a different cpu. The only reason I added this sanity check was that __vcpu_clear has the same one, and there too it seemed redundant, and I thought maybe I was missing something. Do you know why __vcpu_clear needs this test? +if (per_cpu(current_vmcs, cpu) == saved_vmcs-vmcs) +per_cpu(current_vmcs, cpu) = NULL; And this will always be false, no? Unless you free a vmcs02 while you use it? Don't you always switch back to vmcs01 prior to freeing? .. Maybe this is the counterexample - we kill a vcpu while it is in nested mode. Right, this is what I had in mind. -- Nadav Har'El| Monday, Jan 31 2011, 26 Shevat 5771 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |I planted some bird seed. A bird came up. http://nadav.harel.org.il |Now I don't know what to feed it... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/29] nVMX: Hold a vmcs02 for each vmcs12
On 01/31/2011 11:26 AM, Nadav Har'El wrote: Hi, On Sun, Jan 30, 2011, Avi Kivity wrote about Re: [PATCH 07/29] nVMX: Hold a vmcs02 for each vmcs12: +/* + * Allocate an L0 VMCS (vmcs02) for the current L1 VMCS (vmcs12), if one + * does not already exist. The allocation is done in L0 memory, so to avoid + * denial-of-service attack by guests, we limit the number of concurrently- + * allocated vmcss. A well-behaving L1 will VMCLEAR unused vmcs12s and not + * trigger this limit. No, it won't. If you run N guests on a single-cpu kvm host, you'll have N active VMCSs. Of course. What I said was that *unused* vmcs12s (in the sense that they don't describe any active guest) will normally be unloaded (VMCLEARed) by L1 and so will not take up space. Only VMCSs actually being used to run guests will take up space. If you have N running guests, then right, you'll have N VMCSs. I put the limit at 256, which on one hand allows L1 to run 256 L2s (which I think is well above what people will normally run on one CPU), and on the other hand limits the amount of damage that a malicious L1 can do: At worst it can cause the host to allocate (and pin) 256 extra VMCSs, which sum up to 1 MB. I thought that this compromise was good enough, and you didn't, and although I still don't understand why, I promise I will change it. I'll make this change my top priority now. VM crashes on legal guest behaviour are bad; and since it's easy to avoid, why do it? +static void __nested_free_saved_vmcs(void *arg) +{ + struct saved_vmcs *saved_vmcs = arg; + int cpu = raw_smp_processor_id(); + + if (saved_vmcs-cpu == cpu) /* TODO: how can this not be the case? */ + vmcs_clear(saved_vmcs-vmcs); This check will always be true. This is what I thought too... I call this function on the saved_vmcs-cpu cpu, so there's no reason why it would find itself being called on a different cpu. The only reason I added this sanity check was that __vcpu_clear has the same one, and there too it seemed redundant, and I thought maybe I was missing something. Do you know why __vcpu_clear needs this test? __vcpu_clear() can race with itself (called from the cpu migration path vs. cpu offline path) -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/22] Prevent abortion on multiple VCPU kicks
On 01/27/2011 03:09 PM, Jan Kiszka wrote: If we call qemu_cpu_kick more than once before the target was able to process the signal, pthread_kill will fail, and qemu will abort. Prevent this by avoiding the redundant signal. Doesn't fit with the manual page (or with the idea that signals are asynchronous): NAME pthread_kill - send a signal to a thread ... ERRORS ESRCH No thread with the ID thread could be found. EINVAL An invalid signal was specified. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/22] Leave inner main_loop faster on pending requests
On 01/27/2011 03:09 PM, Jan Kiszka wrote: If there is any pending request that requires us to leave the inner loop if main_loop, makes sure we do this as soon as possible by enforcing non-blocking IO processing. At this change, move variable definitions out of the inner loop to improve readability. Signed-off-by: Jan Kiszkajan.kis...@siemens.com --- vl.c | 11 +++ 1 files changed, 7 insertions(+), 4 deletions(-) diff --git a/vl.c b/vl.c index 5fad700..2ebc55b 100644 --- a/vl.c +++ b/vl.c @@ -1384,18 +1384,21 @@ qemu_irq qemu_system_powerdown; static void main_loop(void) { +bool nonblocking = false; +#ifdef CONFIG_PROFILER +int64_t ti; +#endif int r; qemu_main_loop_start(); for (;;) { do { -bool nonblocking = false; -#ifdef CONFIG_PROFILER -int64_t ti; -#endif #ifndef CONFIG_IOTHREAD nonblocking = cpu_exec_all(); +if (!vm_can_run()) { +nonblocking = true; +} Doesn't this cause vmstop to spin? We'll never execute main_loop_wait(false) if I read the code correctly? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 14/22] kvm: Fix race between timer signals and vcpu entry under !IOTHREAD
On 01/27/2011 04:33 PM, Jan Kiszka wrote: Found by Stefan Hajnoczi: There is a race in kvm_cpu_exec between checking for exit_request on vcpu entry and timer signals arriving before KVM starts to catch them. Plug it by blocking both timer related signals also on !CONFIG_IOTHREAD and process those via signalfd. As this fix depends on real signalfd support (otherwise the timer signals only kick the compat helper thread, and the main thread hangs), we need to detect the invalid constellation and abort configure. Signed-off-by: Jan Kiszkajan.kis...@siemens.com CC: Stefan Hajnoczistefa...@linux.vnet.ibm.com --- I don't want to invest that much into !IOTHREAD anymore, so let's see if the proposed catchabort is acceptable. I don't understand the dependency on signalfd. The normal way of doing things, either waiting for the signal in sigtimedwait() or in ioctl(KVM_RUN), works with SIGALRM just fine. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM call agenda for Feb 1
Please send in any agenda items you are interested incovering. Thanks, Juan. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/29] nVMX: Nested VMX, v8
On Fri, Jan 28, 2011, Juerg Haefliger wrote about Re: [PATCH 0/29] nVMX: Nested VMX, v8: This branch doesn't even compile: ... CC [M] drivers/staging/smbfs/dir.o drivers/staging/smbfs/dir.c:286: error: static declaration of I tried to compile this branch with the default .config (answering the defaults to every question in make config), and it compiled without any error. However, the default configuration is *not* to compile the code you had problems in, drivers/staging/smbfs, because by default CONFIG_STAGING is off. When I went to enable CONFIG_STAGING and then SMB_FS, I was given a big warning that smbfs is OBSOLETE, please use CIFS. Is there a specific reason why you want to use smbfs, not cifs? Anyway, when I did enable SMB_FS, and compiled, sure enought I got the same error that you did. This definitely has nothing to do with the nested VMX patches. If smbfs isn't essential to you, please try again without it (turn off SMB_FS in your .config). If it is, maybe you should report this problem to the smbfs maintainers? Thanks, Nadav. -- Nadav Har'El| Monday, Jan 31 2011, 26 Shevat 5771 n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |This box was intentionally left blank. http://nadav.harel.org.il | -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 17/22] kvm: Move irqchip event processing out of inner loop
On 01/27/2011 03:10 PM, Jan Kiszka wrote: Align with qemu-kvm and prepare for IO exit fix: There is no need to run kvm_arch_process_irqchip_events in the inner VCPU loop. Any state change this service processes will first cause an exit from kvm_cpu_exec anyway. And we will have to reenter the kernel on IO exits unconditionally, something that the current logic prevents. Signed-off-by: Jan Kiszkajan.kis...@siemens.com --- kvm-all.c | 11 ++- 1 files changed, 6 insertions(+), 5 deletions(-) diff --git a/kvm-all.c b/kvm-all.c index 5bfa8c0..46ecc1c 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -892,6 +892,12 @@ int kvm_cpu_exec(CPUState *env) DPRINTF(kvm_cpu_exec()\n); +if (kvm_arch_process_irqchip_events(env)) { +env-exit_request = 0; +env-exception_index = EXCP_HLT; +return 0; +} + do { #ifndef CONFIG_IOTHREAD if (env-exit_request) { @@ -901,11 +907,6 @@ int kvm_cpu_exec(CPUState *env) } We check for -exit_request here #endif -if (kvm_arch_process_irqchip_events(env)) { -ret = 0; -break; -} - But this checks for -interrupt_request. What ensures that we exit when -interrupt_request is set? if (env-kvm_vcpu_dirty) { kvm_arch_put_registers(env, KVM_PUT_RUNTIME_STATE); env-kvm_vcpu_dirty = 0; -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/22] [uq/master] Patch queue, part II
On 01/27/2011 03:09 PM, Jan Kiszka wrote: This second round of patches focus on issues in cpus.c, primarily signal related. The highlights are - Add missing KVM_RUN continuation after I/O exits - Fix for timer signal race in KVM entry code under !CONFIG_IOTHREAD (based on Stefan's findings) - MCE signal processing under !CONFIG_IOTHREAD - Prevent abortion on multiple VCPU kicks - Fixed synchronous (ie. VCPU-issued) reset request processing Topics remaining for a third round are cleanups and refactorings of the KVM VCPU entry/exit path, MCE rework, and a few assorted cleanups. But I will wait at least until these bits here are ready for an upstream merge. Pretty nice. Is this standalone or does it depend on unmerged changes? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/6] KVM-HV: KVM Steal time implementation
On Fri, 2011-01-28 at 14:52 -0500, Glauber Costa wrote: + u64 to = (get_kernel_ns() - vcpu-arch.this_time_out); + /* +* using nanoseconds introduces noise, which accumulates easily +* leading to big steal time values. We want, however, to keep the +* interface nanosecond-based for future-proofness. +*/ + to /= NSEC_PER_USEC; + to *= NSEC_PER_USEC; This just doesn't make any sense at all, you use the most expensive and accurate kernel interface to get ns resolution timestamps (ktime_get), and then you truncate the stuff to usec for some totally weird and unexplained reason. If ktime_get is wrong all our time-keeping is out the window. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/6] KVM-GST: KVM Steal time registration
On Fri, 2011-01-28 at 14:52 -0500, Glauber Costa wrote: + /* +* using nanoseconds introduces noise, which accumulates easily +* leading to big steal time values. We want, however, to keep the +* interface nanosecond-based for future-proofness. The hypervisor may +* adopt a similar strategy, but we can't rely on that. +*/ + delta /= NSEC_PER_MSEC; + delta *= NSEC_PER_MSEC; I really doubt that claim.. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/22] Prevent abortion on multiple VCPU kicks
On 2011-01-31 10:44, Avi Kivity wrote: On 01/27/2011 03:09 PM, Jan Kiszka wrote: If we call qemu_cpu_kick more than once before the target was able to process the signal, pthread_kill will fail, and qemu will abort. Prevent this by avoiding the redundant signal. Doesn't fit with the manual page (or with the idea that signals are asynchronous): NAME pthread_kill - send a signal to a thread ... ERRORS ESRCH No thread with the ID thread could be found. EINVAL An invalid signal was specified. Valid remark, but I was receiving EAGAIN for blocked RT signals. Don't know if this is Linux-specific. A quick glance at the man pages did not reveal if this is allowed or at least gray area. However, even when selectively ignoring this, it's more efficient to catch the redundant signaling in user space. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/22] Leave inner main_loop faster on pending requests
On 2011-01-31 10:52, Avi Kivity wrote: On 01/27/2011 03:09 PM, Jan Kiszka wrote: If there is any pending request that requires us to leave the inner loop if main_loop, makes sure we do this as soon as possible by enforcing non-blocking IO processing. At this change, move variable definitions out of the inner loop to improve readability. Signed-off-by: Jan Kiszkajan.kis...@siemens.com --- vl.c | 11 +++ 1 files changed, 7 insertions(+), 4 deletions(-) diff --git a/vl.c b/vl.c index 5fad700..2ebc55b 100644 --- a/vl.c +++ b/vl.c @@ -1384,18 +1384,21 @@ qemu_irq qemu_system_powerdown; static void main_loop(void) { +bool nonblocking = false; +#ifdef CONFIG_PROFILER +int64_t ti; +#endif int r; qemu_main_loop_start(); for (;;) { do { -bool nonblocking = false; -#ifdef CONFIG_PROFILER -int64_t ti; -#endif #ifndef CONFIG_IOTHREAD nonblocking = cpu_exec_all(); +if (!vm_can_run()) { +nonblocking = true; +} Doesn't this cause vmstop to spin? We'll never execute main_loop_wait(false) if I read the code correctly? The code path is not changed, we just poll instead of wait in main_loop_wait. Also, I didn't get your error scenario yet. Even if we left the loop here, what magic would main_loop_wait do to vmstop processing? The stop request is handled outside the loop, that's why we should leave ASAP. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 5/6] KVM-GST: adjust scheduler cpu power
On Fri, 2011-01-28 at 14:52 -0500, Glauber Costa wrote: +#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING +static DEFINE_PER_CPU(u64, cpu_steal_time); + +#ifndef CONFIG_64BIT +static DEFINE_PER_CPU(seqcount_t, steal_time_seq); + +static inline void steal_time_write_begin(void) +{ + __this_cpu_inc(steal_time_seq.sequence); + smp_wmb(); +} + +static inline void steal_time_write_end(void) +{ + smp_wmb(); + __this_cpu_inc(steal_time_seq.sequence); +} + +static inline u64 steal_time_read(int cpu) +{ + u64 steal_time; + unsigned seq; + + do { + seq = read_seqcount_begin(per_cpu(steal_time_seq, cpu)); + steal_time = per_cpu(cpu_steal_time, cpu); + } while (read_seqcount_retry(per_cpu(steal_time_seq, cpu), seq)); + + return steal_time; +} +#else /* CONFIG_64BIT */ +static inline void steal_time_write_begin(void) +{ +} + +static inline void steal_time_write_end(void) +{ +} + +static inline u64 steal_time_read(int cpu) +{ + return per_cpu(cpu_steal_time, cpu); +} @@ -3536,6 +3592,11 @@ static int touch_steal_time(int is_idle) if (st) { account_steal_time(st); +#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING + steal_time_write_begin(); + __this_cpu_add(cpu_steal_time, steal); + steal_time_write_end(); +#endif return 1; } return 0; Why replicate all logic you've already got in patch 4? That too is reading steal time in a loop in kvm_account_steal_time(), why not extend that interface to take a cpu argument and be done with it? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 5/6] KVM-GST: adjust scheduler cpu power
On Mon, 2011-01-31 at 12:25 +0100, Peter Zijlstra wrote: On Fri, 2011-01-28 at 14:52 -0500, Glauber Costa wrote: +#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING +static DEFINE_PER_CPU(u64, cpu_steal_time); + +#ifndef CONFIG_64BIT +static DEFINE_PER_CPU(seqcount_t, steal_time_seq); + +static inline void steal_time_write_begin(void) +{ + __this_cpu_inc(steal_time_seq.sequence); + smp_wmb(); +} + +static inline void steal_time_write_end(void) +{ + smp_wmb(); + __this_cpu_inc(steal_time_seq.sequence); +} + +static inline u64 steal_time_read(int cpu) +{ + u64 steal_time; + unsigned seq; + + do { + seq = read_seqcount_begin(per_cpu(steal_time_seq, cpu)); + steal_time = per_cpu(cpu_steal_time, cpu); + } while (read_seqcount_retry(per_cpu(steal_time_seq, cpu), seq)); + + return steal_time; +} +#else /* CONFIG_64BIT */ +static inline void steal_time_write_begin(void) +{ +} + +static inline void steal_time_write_end(void) +{ +} + +static inline u64 steal_time_read(int cpu) +{ + return per_cpu(cpu_steal_time, cpu); +} @@ -3536,6 +3592,11 @@ static int touch_steal_time(int is_idle) if (st) { account_steal_time(st); +#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING + steal_time_write_begin(); + __this_cpu_add(cpu_steal_time, steal); + steal_time_write_end(); +#endif return 1; } return 0; Why replicate all logic you've already got in patch 4? That too is reading steal time in a loop in kvm_account_steal_time(), why not extend that interface to take a cpu argument and be done with it? Also, this relies on touch_steal_time() to have ran, not at all something that is tied to calling update_rq_clock(). We can call update_rq_clock() many many times between ticks, as can the stealtime increase because the vcpu can schedule many many times between ticks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 14/22] kvm: Fix race between timer signals and vcpu entry under !IOTHREAD
On 2011-01-31 11:03, Avi Kivity wrote: On 01/27/2011 04:33 PM, Jan Kiszka wrote: Found by Stefan Hajnoczi: There is a race in kvm_cpu_exec between checking for exit_request on vcpu entry and timer signals arriving before KVM starts to catch them. Plug it by blocking both timer related signals also on !CONFIG_IOTHREAD and process those via signalfd. As this fix depends on real signalfd support (otherwise the timer signals only kick the compat helper thread, and the main thread hangs), we need to detect the invalid constellation and abort configure. Signed-off-by: Jan Kiszkajan.kis...@siemens.com CC: Stefan Hajnoczistefa...@linux.vnet.ibm.com --- I don't want to invest that much into !IOTHREAD anymore, so let's see if the proposed catchabort is acceptable. I don't understand the dependency on signalfd. The normal way of doing things, either waiting for the signal in sigtimedwait() or in ioctl(KVM_RUN), works with SIGALRM just fine. And how would you be kicked out of the select() call if it is waiting with a timeout? We only have a single thread here. The only alternative is Stefan's original proposal. But that required fiddling with the signal mask twice per KVM_RUN. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 17/22] kvm: Move irqchip event processing out of inner loop
On 2011-01-31 11:08, Avi Kivity wrote: On 01/27/2011 03:10 PM, Jan Kiszka wrote: Align with qemu-kvm and prepare for IO exit fix: There is no need to run kvm_arch_process_irqchip_events in the inner VCPU loop. Any state change this service processes will first cause an exit from kvm_cpu_exec anyway. And we will have to reenter the kernel on IO exits unconditionally, something that the current logic prevents. Signed-off-by: Jan Kiszkajan.kis...@siemens.com --- kvm-all.c | 11 ++- 1 files changed, 6 insertions(+), 5 deletions(-) diff --git a/kvm-all.c b/kvm-all.c index 5bfa8c0..46ecc1c 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -892,6 +892,12 @@ int kvm_cpu_exec(CPUState *env) DPRINTF(kvm_cpu_exec()\n); +if (kvm_arch_process_irqchip_events(env)) { +env-exit_request = 0; +env-exception_index = EXCP_HLT; +return 0; +} + do { #ifndef CONFIG_IOTHREAD if (env-exit_request) { @@ -901,11 +907,6 @@ int kvm_cpu_exec(CPUState *env) } We check for -exit_request here #endif -if (kvm_arch_process_irqchip_events(env)) { -ret = 0; -break; -} - But this checks for -interrupt_request. What ensures that we exit when -interrupt_request is set? Good question, need to check again. But if that turns out to be an issue, qemu-kvm would be broken as well. I'm just aligning the code here. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Google Summer of Code 2011
On Sun, 30 Jan 2011 16:06:20 +0100 Alexander Graf ag...@suse.de wrote: On 28.01.2011, at 21:10, Luiz Capitulino wrote: Hi there, GSoC 2011 has been announced[1]. As we were pretty successful last year, I think we should participate again. I've already created a wiki page: http://wiki.qemu.org/Google_Summer_of_Code_2011 We should now populate it with projects and people willing to be mentors should say so (or just add a project)[2]. Also, I'd like to do something different this year, I'd like to invite libvirt people to join. There are two ways of doing this: 1. They join in the program as a regular mentoring organization, or 2. They join with QEMU The second option means that libvirt can suggest and run its own projects (preferably with QEMU relevance), but from a GSoC perspective, the project will be part of the QEMU org. Keep in mind that every full org gets a free trip to the west coast for 2 people ;). So splitting up means we could almost do a mini-summit at the google campus on google's expenses ;). Actually, they have a limited budget and if you live too far (say, in Brazil), the trip might not be 100% free :) Please coordinate that with Carol. Apparently traction for GSOC is declining (according to last year's summit). So there might be plenty of available slots this year. So I'd say sign up separately for now and if you don't get accepted, just join forces with us! Yes, that's a good plan and I fully agree that we get more benefits if we apply separately. It's a call to libvirt's people. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v7 PATCH 3/7] sched: use a buddy to implement yield_task_fair
On Wed, 2011-01-26 at 17:21 -0500, Rik van Riel wrote: +static struct sched_entity *__pick_second_entity(struct cfs_rq *cfs_rq) +{ + struct rb_node *left = cfs_rq-rb_leftmost; + struct rb_node *second; + + if (!left) + return NULL; + + second = rb_next(left); + + if (!second) + second = left; + + return rb_entry(second, struct sched_entity, run_node); +} I would still prefer: sed -i 's/__pick_next_entity/__pick_first_entity/g' kernel/sched*.[ch] and static struct sched_entity *__pick_next_entity(struct sched_entity *se) { struct rb_node *next = rb_next(se-run_node); if (!next) return NULL; return rb_entry(next, struct sched_entity, run_node); } Its simpler and more versatile. static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq) { struct sched_entity *se = __pick_next_entity(cfs_rq); struct sched_entity *left = se; - if (cfs_rq-next wakeup_preempt_entity(cfs_rq-next, left) 1) - se = cfs_rq-next; + /* + * Avoid running the skip buddy, if running something else can + * be done without getting too unfair. + */ + if (cfs_rq-skip == se) { + struct sched_entity *second = __pick_second_entity(cfs_rq); which then becomes *next = __pick_next_entity(se); + if (wakeup_preempt_entity(second, left) 1) oops when !second. (!next) + se = second; + } -/* - * sched_yield() support is very simple - we dequeue and enqueue. - * - * If compat_yield is turned on then we requeue to the end of the tree. - */ -static void yield_task_fair(struct rq *rq) -{ - struct task_struct *curr = rq-curr; - struct cfs_rq *cfs_rq = task_cfs_rq(curr); - struct sched_entity *rightmost, *se = curr-se; - - /* - * Are we the only task in the tree? - */ - if (unlikely(rq-nr_running == 1)) - return; - - clear_buddies(cfs_rq, se); - - if (likely(!sysctl_sched_compat_yield) curr-policy != SCHED_BATCH) { - update_rq_clock(rq); - /* - * Update run-time statistics of the 'current'. - */ - update_curr(cfs_rq); - - return; - } - /* - * Find the rightmost entry in the rbtree: - */ - rightmost = __pick_last_entity(cfs_rq); - /* - * Already in the rightmost position? - */ - if (unlikely(!rightmost || entity_before(rightmost, se))) - return; - - /* - * Minimally necessary key value to be last in the tree: - * Upon rescheduling, sched_class::put_prev_task() will place - * 'current' within the tree based on its new key value. - */ - se-vruntime = rightmost-vruntime + 1; -} You seem to have lost the hunk removing sysctl_sched_compat_yield from kernel/sysctl.c :-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v7 PATCH 4/7] Add yield_to(task, preempt) functionality.
On Wed, 2011-01-26 at 17:21 -0500, Rik van Riel wrote: +bool __sched yield_to(struct task_struct *p, bool preempt) +{ + struct task_struct *curr = current; + struct rq *rq, *p_rq; + unsigned long flags; + bool yielded = 0; + + local_irq_save(flags); + rq = this_rq(); + +again: + p_rq = task_rq(p); + double_rq_lock(rq, p_rq); + while (task_rq(p) != p_rq) { + double_rq_unlock(rq, p_rq); + goto again; + } + + if (!curr-sched_class-yield_to_task) + goto out; + + if (curr-sched_class != p-sched_class) + goto out; + + if (task_running(p_rq, p) || p-state) + goto out; + + yielded = curr-sched_class-yield_to_task(rq, p, preempt); + +out: + double_rq_unlock(rq, p_rq); + local_irq_restore(flags); + + if (yielded) + yield(); + + return yielded; +} +EXPORT_SYMBOL_GPL(yield_to); yield() will again acquire rq-lock.. not not simply have -yield_to_task() do everything required and make that an unconditional schedule()? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v7 PATCH 5/7] export pid symbols needed for kvm_vcpu_on_spin
On Wed, 2011-01-26 at 17:23 -0500, Rik van Riel wrote: Export the symbols required for a race-free kvm_vcpu_on_spin. Avi, you asked for an example of why I hated KVM as a module :-) Signed-off-by: Rik van Riel r...@redhat.com diff --git a/kernel/fork.c b/kernel/fork.c index 3b159c5..adc8f47 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -191,6 +191,7 @@ void __put_task_struct(struct task_struct *tsk) if (!profile_handoff_task(tsk)) free_task(tsk); } +EXPORT_SYMBOL_GPL(__put_task_struct); /* * macro override instead of weak attribute alias, to workaround diff --git a/kernel/pid.c b/kernel/pid.c index 39b65b6..02f2212 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -435,6 +435,7 @@ struct pid *get_task_pid(struct task_struct *task, enum pid_type type) rcu_read_unlock(); return pid; } +EXPORT_SYMBOL_GPL(get_task_pid); struct task_struct *get_pid_task(struct pid *pid, enum pid_type type) { @@ -446,6 +447,7 @@ struct task_struct *get_pid_task(struct pid *pid, enum pid_type type) rcu_read_unlock(); return result; } +EXPORT_SYMBOL_GPL(get_pid_task); struct pid *find_get_pid(pid_t nr) { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/22] [uq/master] Patch queue, part II
On 2011-01-31 11:12, Avi Kivity wrote: On 01/27/2011 03:09 PM, Jan Kiszka wrote: This second round of patches focus on issues in cpus.c, primarily signal related. The highlights are - Add missing KVM_RUN continuation after I/O exits - Fix for timer signal race in KVM entry code under !CONFIG_IOTHREAD (based on Stefan's findings) - MCE signal processing under !CONFIG_IOTHREAD - Prevent abortion on multiple VCPU kicks - Fixed synchronous (ie. VCPU-issued) reset request processing Topics remaining for a third round are cleanups and refactorings of the KVM VCPU entry/exit path, MCE rework, and a few assorted cleanups. But I will wait at least until these bits here are ready for an upstream merge. Pretty nice. Is this standalone or does it depend on unmerged changes? It only depends on my previous patches in current uq/master. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH v3 14/22] kvm: Fix race between timer signals and vcpu entry under !IOTHREAD
On Mon, Jan 31, 2011 at 11:27 AM, Jan Kiszka jan.kis...@siemens.com wrote: On 2011-01-31 11:03, Avi Kivity wrote: On 01/27/2011 04:33 PM, Jan Kiszka wrote: Found by Stefan Hajnoczi: There is a race in kvm_cpu_exec between checking for exit_request on vcpu entry and timer signals arriving before KVM starts to catch them. Plug it by blocking both timer related signals also on !CONFIG_IOTHREAD and process those via signalfd. As this fix depends on real signalfd support (otherwise the timer signals only kick the compat helper thread, and the main thread hangs), we need to detect the invalid constellation and abort configure. Signed-off-by: Jan Kiszkajan.kis...@siemens.com CC: Stefan Hajnoczistefa...@linux.vnet.ibm.com --- I don't want to invest that much into !IOTHREAD anymore, so let's see if the proposed catchabort is acceptable. I don't understand the dependency on signalfd. The normal way of doing things, either waiting for the signal in sigtimedwait() or in ioctl(KVM_RUN), works with SIGALRM just fine. And how would you be kicked out of the select() call if it is waiting with a timeout? We only have a single thread here. The only alternative is Stefan's original proposal. But that required fiddling with the signal mask twice per KVM_RUN. I think my original patch messed with the sigmask in the wrong place, as you mentioned doing it twice per KVM_RUN isn't a good idea. I wonder if we can enable SIGALRM only in blocking calls and guest code execution but without signalfd. It might be possible, I don't see an immediate problem with doing that, we might have to use pselect(2) or similar in a few places. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH v3 14/22] kvm: Fix race between timer signals and vcpu entry under !IOTHREAD
On 2011-01-31 13:13, Stefan Hajnoczi wrote: On Mon, Jan 31, 2011 at 11:27 AM, Jan Kiszka jan.kis...@siemens.com wrote: On 2011-01-31 11:03, Avi Kivity wrote: On 01/27/2011 04:33 PM, Jan Kiszka wrote: Found by Stefan Hajnoczi: There is a race in kvm_cpu_exec between checking for exit_request on vcpu entry and timer signals arriving before KVM starts to catch them. Plug it by blocking both timer related signals also on !CONFIG_IOTHREAD and process those via signalfd. As this fix depends on real signalfd support (otherwise the timer signals only kick the compat helper thread, and the main thread hangs), we need to detect the invalid constellation and abort configure. Signed-off-by: Jan Kiszkajan.kis...@siemens.com CC: Stefan Hajnoczistefa...@linux.vnet.ibm.com --- I don't want to invest that much into !IOTHREAD anymore, so let's see if the proposed catchabort is acceptable. I don't understand the dependency on signalfd. The normal way of doing things, either waiting for the signal in sigtimedwait() or in ioctl(KVM_RUN), works with SIGALRM just fine. And how would you be kicked out of the select() call if it is waiting with a timeout? We only have a single thread here. The only alternative is Stefan's original proposal. But that required fiddling with the signal mask twice per KVM_RUN. I think my original patch messed with the sigmask in the wrong place, as you mentioned doing it twice per KVM_RUN isn't a good idea. I wonder if we can enable SIGALRM only in blocking calls and guest code execution but without signalfd. It might be possible, I don't see an immediate problem with doing that, we might have to use pselect(2) or similar in a few places. My main concern about alternative approaches is that IOTHREAD is about to become the default, and hardly anyone (of the few upstream KVM users) will run without it in the foreseeable future. The next step will be the removal of any !CONFIG_IOTHREAD section. So, how much do we want to invest here (provided my proposal has not remaining issues)? Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 17/22] kvm: Move irqchip event processing out of inner loop
On 2011-01-31 12:36, Jan Kiszka wrote: On 2011-01-31 11:08, Avi Kivity wrote: On 01/27/2011 03:10 PM, Jan Kiszka wrote: Align with qemu-kvm and prepare for IO exit fix: There is no need to run kvm_arch_process_irqchip_events in the inner VCPU loop. Any state change this service processes will first cause an exit from kvm_cpu_exec anyway. And we will have to reenter the kernel on IO exits unconditionally, something that the current logic prevents. Signed-off-by: Jan Kiszkajan.kis...@siemens.com --- kvm-all.c | 11 ++- 1 files changed, 6 insertions(+), 5 deletions(-) diff --git a/kvm-all.c b/kvm-all.c index 5bfa8c0..46ecc1c 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -892,6 +892,12 @@ int kvm_cpu_exec(CPUState *env) DPRINTF(kvm_cpu_exec()\n); +if (kvm_arch_process_irqchip_events(env)) { +env-exit_request = 0; +env-exception_index = EXCP_HLT; +return 0; +} + do { #ifndef CONFIG_IOTHREAD if (env-exit_request) { @@ -901,11 +907,6 @@ int kvm_cpu_exec(CPUState *env) } We check for -exit_request here #endif -if (kvm_arch_process_irqchip_events(env)) { -ret = 0; -break; -} - But this checks for -interrupt_request. What ensures that we exit when -interrupt_request is set? Good question, need to check again. But if that turns out to be an issue, qemu-kvm would be broken as well. I'm just aligning the code here. The only thing we miss by moving process_irqchip_events is a self-INIT of an AP - if such thing exists in real life. In that case, the AP would cause a reset of itself, followed by a transition to HALT state. A self-SIPI has no effect as A) a CPU can't send a SIPI from wait-on-SIPI state and B) SIPIs are ignored in any other state. Will post a version that additionally checks for pending INIT as well and injects a self-ipi then. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] KVM: Emulate MSI-X table in kernel
On 01/30/2011 06:38 AM, Sheng Yang wrote: (Sorry, missed this mail...) On Mon, Jan 17, 2011 at 02:29:44PM +0200, Avi Kivity wrote: On 01/06/2011 12:19 PM, Sheng Yang wrote: Then we can support mask bit operation of assigned devices now. +int kvm_assigned_device_update_msix_mask_bit(struct kvm *kvm, + int assigned_dev_id, int entry, bool mask) +{ + int r = -EFAULT; + struct kvm_assigned_dev_kernel *adev; + int i; + + if (!irqchip_in_kernel(kvm)) + return r; + + mutex_lock(kvm-lock); + adev = kvm_find_assigned_dev(kvm-arch.assigned_dev_head, + assigned_dev_id); + if (!adev) + goto out; + + for (i = 0; i adev-entries_nr; i++) + if (adev-host_msix_entries[i].entry == entry) { + if (mask) + disable_irq_nosync( + adev-host_msix_entries[i].vector); Is it okay to call disable_irq_nosync() here? IIRC we don't check the mask bit on irq delivery, so we may forward an interrupt to the guest after the mask bit was set. What does pci say about the mask bit? when does it take effect? Another question is whether disable_irq_nosync() actually programs the device mask bit, or not. If it does, then it's slow, and it may be better to leave interrupts enabled but have an internal pending bit. If it doesn't program the mask bit, it's fine. I think Michael and Jan had explained this. + else + enable_irq(adev-host_msix_entries[i].vector); + r = 0; + break; + } +out: + mutex_unlock(kvm-lock); + return r; +} + +static int msix_table_mmio_read(struct kvm_io_device *this, gpa_t addr, int len, + void *val) +{ + struct kvm_msix_mmio_dev *mmio_dev = + container_of(this, struct kvm_msix_mmio_dev, table_dev); + struct kvm_msix_mmio *mmio; + int idx, ret = 0, entry, offset, r; + + mutex_lock(mmio_dev-lock); + idx = get_mmio_table_index(mmio_dev, addr, len); + if (idx 0) { + ret = -EOPNOTSUPP; + goto out; + } + if ((addr 0x3) || (len != 4 len != 8)) + goto out; + + offset = addr 0xf; + if (offset == PCI_MSIX_ENTRY_VECTOR_CTRL len == 8) + goto out; + + mmio =mmio_dev-mmio[idx]; + entry = (addr - mmio-table_base_addr) / PCI_MSIX_ENTRY_SIZE; + r = copy_from_user(val, (void __user *)(mmio-table_base_va + + entry * PCI_MSIX_ENTRY_SIZE + offset), len); + if (r) + goto out; and return ret == 0? Yes. This operation should be handled by in-kernel MSI-X MMIO. So we return 0 in order to omit this action. We can add warning to it later. But it failed. We need to return -EFAULT. The same as above. + + if ((offset PCI_MSIX_ENTRY_VECTOR_CTRL len == 4) || + (offset PCI_MSIX_ENTRY_DATA len == 8)) + ret = -ENOTSYNC; goto out? No. This judgement only check if MSI data/address was touched. And the line below would check if we need to operate mask bit. Because in theory guest can use len=8 to modify MSI-X data and ctrl at the same time. Ok, makes sense. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/22] Prevent abortion on multiple VCPU kicks
On 01/31/2011 01:19 PM, Jan Kiszka wrote: On 2011-01-31 10:44, Avi Kivity wrote: On 01/27/2011 03:09 PM, Jan Kiszka wrote: If we call qemu_cpu_kick more than once before the target was able to process the signal, pthread_kill will fail, and qemu will abort. Prevent this by avoiding the redundant signal. Doesn't fit with the manual page (or with the idea that signals are asynchronous): NAME pthread_kill - send a signal to a thread ... ERRORS ESRCH No thread with the ID thread could be found. EINVAL An invalid signal was specified. Valid remark, but I was receiving EAGAIN for blocked RT signals. Don't know if this is Linux-specific. A quick glance at the man pages did not reveal if this is allowed or at least gray area. } else if (!is_si_special(info)) { if (sig = SIGRTMIN info-si_code != SI_USER) { /* * Queue overflow, abort. We may abort if the * signal was rt and sent by user using something * other than kill(). */ trace_signal_overflow_fail(sig, group, info); return -EAGAIN; } However, even when selectively ignoring this, it's more efficient to catch the redundant signaling in user space. Yes. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/22] Leave inner main_loop faster on pending requests
On 01/31/2011 01:22 PM, Jan Kiszka wrote: On 2011-01-31 10:52, Avi Kivity wrote: On 01/27/2011 03:09 PM, Jan Kiszka wrote: If there is any pending request that requires us to leave the inner loop if main_loop, makes sure we do this as soon as possible by enforcing non-blocking IO processing. At this change, move variable definitions out of the inner loop to improve readability. Signed-off-by: Jan Kiszkajan.kis...@siemens.com --- vl.c | 11 +++ 1 files changed, 7 insertions(+), 4 deletions(-) diff --git a/vl.c b/vl.c index 5fad700..2ebc55b 100644 --- a/vl.c +++ b/vl.c @@ -1384,18 +1384,21 @@ qemu_irq qemu_system_powerdown; static void main_loop(void) { +bool nonblocking = false; +#ifdef CONFIG_PROFILER +int64_t ti; +#endif int r; qemu_main_loop_start(); for (;;) { do { -bool nonblocking = false; -#ifdef CONFIG_PROFILER -int64_t ti; -#endif #ifndef CONFIG_IOTHREAD nonblocking = cpu_exec_all(); +if (!vm_can_run()) { +nonblocking = true; +} Doesn't this cause vmstop to spin? We'll never execute main_loop_wait(false) if I read the code correctly? The code path is not changed, we just poll instead of wait in main_loop_wait. Where do we wait then? Also, I didn't get your error scenario yet. Even if we left the loop here, what magic would main_loop_wait do to vmstop processing? The stop request is handled outside the loop, that's why we should leave ASAP. I'm just missing the alternate place where we sleep. If there's no such place, we spin. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 14/22] kvm: Fix race between timer signals and vcpu entry under !IOTHREAD
On 01/31/2011 01:27 PM, Jan Kiszka wrote: On 2011-01-31 11:03, Avi Kivity wrote: On 01/27/2011 04:33 PM, Jan Kiszka wrote: Found by Stefan Hajnoczi: There is a race in kvm_cpu_exec between checking for exit_request on vcpu entry and timer signals arriving before KVM starts to catch them. Plug it by blocking both timer related signals also on !CONFIG_IOTHREAD and process those via signalfd. As this fix depends on real signalfd support (otherwise the timer signals only kick the compat helper thread, and the main thread hangs), we need to detect the invalid constellation and abort configure. Signed-off-by: Jan Kiszkajan.kis...@siemens.com CC: Stefan Hajnoczistefa...@linux.vnet.ibm.com --- I don't want to invest that much into !IOTHREAD anymore, so let's see if the proposed catchabort is acceptable. I don't understand the dependency on signalfd. The normal way of doing things, either waiting for the signal in sigtimedwait() or in ioctl(KVM_RUN), works with SIGALRM just fine. And how would you be kicked out of the select() call if it is waiting with a timeout? We only have a single thread here. If we use signalfd() (either kernel provided or thread+pipe), we kick out of select by select()ing it (though I don't see how it works without an iothread, since an fd can't stop a vcpu unless you enable SIGIO on it, which is silly for signalfd) If you leave it as a naked signal, then it can break out of either pselect() or vcpu. Since the goal is to drop !CONFIG_IOTHREAD, the first path seems better, I just don't understand the problem with emulated signalfd(). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] KVM: Add documents for MSI-X MMIO API
On 01/26/2011 11:05 AM, Sheng Yang wrote: On Tuesday 25 January 2011 20:47:38 Avi Kivity wrote: On 01/19/2011 10:21 AM, Sheng Yang wrote: We already got an guest MMIO address for that in the exit information. I've created a chain of handler in qemu to handle it. But we already decoded the table and entry... But the handler is still wrapped by vcpu_mmio_write(), as a part of MMIO. So it's not quite handy to get the table and entry out. The kernel handler can create a new kvm_run exit description. Also the updater in the userspace can share the most logic with ordinary userspace MMIO handler, which take address as parameter. So I think we don't need to pass the decoded table_id and entry to userspace. It's mixing layers, which always leads to trouble. For one, the user handler shouldn't do anything with the write since the kernel already wrote it into the table. For another, if two vcpus write to the same entry simultaneously, you could see different ordering in the kernel and userspace, and get inconsistent results. The shared logic is not about writing, but about interpret what's written. Old MMIO handler would write the data, then interpret it; and our new MMIO would only share the logic of interpretation. I think that's fair enough? It dosn't make sense for an API point of view. You registered a table of entries, you expect an exit on that table to point to the table and entry that got changed. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v7 PATCH 5/7] export pid symbols needed for kvm_vcpu_on_spin
On 01/31/2011 01:51 PM, Peter Zijlstra wrote: On Wed, 2011-01-26 at 17:23 -0500, Rik van Riel wrote: Export the symbols required for a race-free kvm_vcpu_on_spin. Avi, you asked for an example of why I hated KVM as a module :-) Why do you dislike exports so much? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH v3 14/22] kvm: Fix race between timer signals and vcpu entry under !IOTHREAD
On Mon, Jan 31, 2011 at 12:18 PM, Jan Kiszka jan.kis...@siemens.com wrote: On 2011-01-31 13:13, Stefan Hajnoczi wrote: On Mon, Jan 31, 2011 at 11:27 AM, Jan Kiszka jan.kis...@siemens.com wrote: On 2011-01-31 11:03, Avi Kivity wrote: On 01/27/2011 04:33 PM, Jan Kiszka wrote: Found by Stefan Hajnoczi: There is a race in kvm_cpu_exec between checking for exit_request on vcpu entry and timer signals arriving before KVM starts to catch them. Plug it by blocking both timer related signals also on !CONFIG_IOTHREAD and process those via signalfd. As this fix depends on real signalfd support (otherwise the timer signals only kick the compat helper thread, and the main thread hangs), we need to detect the invalid constellation and abort configure. Signed-off-by: Jan Kiszkajan.kis...@siemens.com CC: Stefan Hajnoczistefa...@linux.vnet.ibm.com --- I don't want to invest that much into !IOTHREAD anymore, so let's see if the proposed catchabort is acceptable. I don't understand the dependency on signalfd. The normal way of doing things, either waiting for the signal in sigtimedwait() or in ioctl(KVM_RUN), works with SIGALRM just fine. And how would you be kicked out of the select() call if it is waiting with a timeout? We only have a single thread here. The only alternative is Stefan's original proposal. But that required fiddling with the signal mask twice per KVM_RUN. I think my original patch messed with the sigmask in the wrong place, as you mentioned doing it twice per KVM_RUN isn't a good idea. I wonder if we can enable SIGALRM only in blocking calls and guest code execution but without signalfd. It might be possible, I don't see an immediate problem with doing that, we might have to use pselect(2) or similar in a few places. My main concern about alternative approaches is that IOTHREAD is about to become the default, and hardly anyone (of the few upstream KVM users) will run without it in the foreseeable future. The next step will be the removal of any !CONFIG_IOTHREAD section. So, how much do we want to invest here (provided my proposal has not remaining issues)? Yes, you're right. I'm not volunteering to dig more into this, the best case would be to switch to a non-I/O thread world that works for everybody. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v7 PATCH 5/7] export pid symbols needed for kvm_vcpu_on_spin
On Mon, 2011-01-31 at 15:26 +0200, Avi Kivity wrote: On 01/31/2011 01:51 PM, Peter Zijlstra wrote: On Wed, 2011-01-26 at 17:23 -0500, Rik van Riel wrote: Export the symbols required for a race-free kvm_vcpu_on_spin. Avi, you asked for an example of why I hated KVM as a module :-) Why do you dislike exports so much? Because I hate modules [*], since we sadly cannot undo those, limiting the module interface is all we've got, and KVM is one that pushes the module interface further and further. Many people see an EXPORT as a acknowledgement that its OK to use the interface, even when its not. [*] they enable the binary junk thing. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v7 PATCH 5/7] export pid symbols needed for kvm_vcpu_on_spin
On 01/31/2011 03:43 PM, Peter Zijlstra wrote: On Mon, 2011-01-31 at 15:26 +0200, Avi Kivity wrote: On 01/31/2011 01:51 PM, Peter Zijlstra wrote: On Wed, 2011-01-26 at 17:23 -0500, Rik van Riel wrote: Export the symbols required for a race-free kvm_vcpu_on_spin. Avi, you asked for an example of why I hated KVM as a module :-) Why do you dislike exports so much? Because I hate modules [*], Well, you could have just said you hate KVM as a module because it is a module, and saved one indirection. since we sadly cannot undo those, limiting the module interface is all we've got, and KVM is one that pushes the module interface further and further. Many people see an EXPORT as a acknowledgement that its OK to use the interface, even when its not. [*] they enable the binary junk thing. Then we need to stop binary modules, not castrate modules as a whole. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM: MMU: update sp-gfns on pte update path
On Tue, Jan 25, 2011 at 07:36:02PM +0200, Avi Kivity wrote: On 01/25/2011 07:12 PM, Marcelo Tosatti wrote: Should be done by a call to kvm_mmu_page_set_gfn(). But I don't understand how it could become inconsistent in the first place. if (is_rmap_spte(*sptep)) { /* * If we overwrite a PTE page pointer with a 2MB PMD, unlink * the parent of the now unreachable PTE. */ if (level PT_PAGE_TABLE_LEVEL !is_large_pte(*sptep)) { struct kvm_mmu_page *child; u64 pte = *sptep; child = page_header(pte PT64_BASE_ADDR_MASK); mmu_page_remove_parent_pte(child, sptep); __set_spte(sptep, shadow_trap_nonpresent_pte); kvm_flush_remote_tlbs(vcpu-kvm); } else if (pfn != spte_to_pfn(*sptep)) { pgprintk(hfn old %llx new %llx\n, spte_to_pfn(*sptep), pfn); drop_spte(vcpu-kvm, sptep, shadow_trap_nonpresent_pte); kvm_flush_remote_tlbs(vcpu-kvm); } else was_rmapped = 1; } If we set was_rmapped, that means rmap_add() was previously called for this spte/gfn/pfn pair, and all that changes is permissions, no? What if pfn is the same but gfn differs? Could be. Any way to verify if this was the case? Isn't it nicer to have it detected by the test above and do the drop_spte()/kvm_flush_remote_tlbs() thing instead? It could not be the case. If spte is updated to point to a new gfn, and rmap is not updated: 1. rmap[A] = spte sp-gfns[i] = A spte points to gfn A 2. rmap[A] = spte sp-gfns[i] = A spte points to gfn B rmap_remove(spte) will succeed (as in not crash). In case gfn A's slot is removed, all shadow pages will be destroyed. So what can fail from this point on are operations on gfn B such as rmap_write_protect(B). Yes, its nicer (and correct) to do it at drop_spte. Will resubmit. However, still have no explanation for Nicolas BUG's... ideas? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 14/22] kvm: Fix race between timer signals and vcpu entry under !IOTHREAD
On 2011-01-31 14:22, Avi Kivity wrote: On 01/31/2011 01:27 PM, Jan Kiszka wrote: On 2011-01-31 11:03, Avi Kivity wrote: On 01/27/2011 04:33 PM, Jan Kiszka wrote: Found by Stefan Hajnoczi: There is a race in kvm_cpu_exec between checking for exit_request on vcpu entry and timer signals arriving before KVM starts to catch them. Plug it by blocking both timer related signals also on !CONFIG_IOTHREAD and process those via signalfd. As this fix depends on real signalfd support (otherwise the timer signals only kick the compat helper thread, and the main thread hangs), we need to detect the invalid constellation and abort configure. Signed-off-by: Jan Kiszkajan.kis...@siemens.com CC: Stefan Hajnoczistefa...@linux.vnet.ibm.com --- I don't want to invest that much into !IOTHREAD anymore, so let's see if the proposed catchabort is acceptable. I don't understand the dependency on signalfd. The normal way of doing things, either waiting for the signal in sigtimedwait() or in ioctl(KVM_RUN), works with SIGALRM just fine. And how would you be kicked out of the select() call if it is waiting with a timeout? We only have a single thread here. If we use signalfd() (either kernel provided or thread+pipe), we kick out of select by select()ing it (though I don't see how it works without an iothread, since an fd can't stop a vcpu unless you enable SIGIO on it, which is silly for signalfd) If you leave it as a naked signal, then it can break out of either pselect() or vcpu. Since the goal is to drop !CONFIG_IOTHREAD, the first path seems better, I just don't understand the problem with emulated signalfd(). With the emulated signalfd, there won't be any signal for the VCPU while in KVM_RUN. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/22] Leave inner main_loop faster on pending requests
On 2011-01-31 14:17, Avi Kivity wrote: On 01/31/2011 01:22 PM, Jan Kiszka wrote: On 2011-01-31 10:52, Avi Kivity wrote: On 01/27/2011 03:09 PM, Jan Kiszka wrote: If there is any pending request that requires us to leave the inner loop if main_loop, makes sure we do this as soon as possible by enforcing non-blocking IO processing. At this change, move variable definitions out of the inner loop to improve readability. Signed-off-by: Jan Kiszkajan.kis...@siemens.com --- vl.c | 11 +++ 1 files changed, 7 insertions(+), 4 deletions(-) diff --git a/vl.c b/vl.c index 5fad700..2ebc55b 100644 --- a/vl.c +++ b/vl.c @@ -1384,18 +1384,21 @@ qemu_irq qemu_system_powerdown; static void main_loop(void) { +bool nonblocking = false; +#ifdef CONFIG_PROFILER +int64_t ti; +#endif int r; qemu_main_loop_start(); for (;;) { do { -bool nonblocking = false; -#ifdef CONFIG_PROFILER -int64_t ti; -#endif #ifndef CONFIG_IOTHREAD nonblocking = cpu_exec_all(); +if (!vm_can_run()) { +nonblocking = true; +} Doesn't this cause vmstop to spin? We'll never execute main_loop_wait(false) if I read the code correctly? The code path is not changed, we just poll instead of wait in main_loop_wait. Where do we wait then? Also, I didn't get your error scenario yet. Even if we left the loop here, what magic would main_loop_wait do to vmstop processing? The stop request is handled outside the loop, that's why we should leave ASAP. I'm just missing the alternate place where we sleep. If there's no such place, we spin. Probably you are misled by the name of vm_can_run. It should better be renamed to requests_pending or something like that - iow, it is only true if some request is pending and not if just the vm is in stop mode. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v7 PATCH 3/7] sched: use a buddy to implement yield_task_fair
On 01/31/2011 06:47 AM, Peter Zijlstra wrote: On Wed, 2011-01-26 at 17:21 -0500, Rik van Riel wrote: +static struct sched_entity *__pick_second_entity(struct cfs_rq *cfs_rq) +{ + struct rb_node *left = cfs_rq-rb_leftmost; + struct rb_node *second; + + if (!left) + return NULL; + + second = rb_next(left); + + if (!second) + second = left; + + return rb_entry(second, struct sched_entity, run_node); +} I would still prefer: sed -i 's/__pick_next_entity/__pick_first_entity/g' kernel/sched*.[ch] Ahhh, now I understand what you want. I'll get right on it. You seem to have lost the hunk removing sysctl_sched_compat_yield from kernel/sysctl.c :-) Let me take care of that, too. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 17/22] kvm: Move irqchip event processing out of inner loop
On 2011-01-31 14:04, Jan Kiszka wrote: On 2011-01-31 12:36, Jan Kiszka wrote: On 2011-01-31 11:08, Avi Kivity wrote: On 01/27/2011 03:10 PM, Jan Kiszka wrote: Align with qemu-kvm and prepare for IO exit fix: There is no need to run kvm_arch_process_irqchip_events in the inner VCPU loop. Any state change this service processes will first cause an exit from kvm_cpu_exec anyway. And we will have to reenter the kernel on IO exits unconditionally, something that the current logic prevents. Signed-off-by: Jan Kiszkajan.kis...@siemens.com --- kvm-all.c | 11 ++- 1 files changed, 6 insertions(+), 5 deletions(-) diff --git a/kvm-all.c b/kvm-all.c index 5bfa8c0..46ecc1c 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -892,6 +892,12 @@ int kvm_cpu_exec(CPUState *env) DPRINTF(kvm_cpu_exec()\n); +if (kvm_arch_process_irqchip_events(env)) { +env-exit_request = 0; +env-exception_index = EXCP_HLT; +return 0; +} + do { #ifndef CONFIG_IOTHREAD if (env-exit_request) { @@ -901,11 +907,6 @@ int kvm_cpu_exec(CPUState *env) } We check for -exit_request here #endif -if (kvm_arch_process_irqchip_events(env)) { -ret = 0; -break; -} - But this checks for -interrupt_request. What ensures that we exit when -interrupt_request is set? Good question, need to check again. But if that turns out to be an issue, qemu-kvm would be broken as well. I'm just aligning the code here. The only thing we miss by moving process_irqchip_events is a self-INIT of an AP - if such thing exists in real life. In that case, the AP would cause a reset of itself, followed by a transition to HALT state. I checked again with the Intel spec, and a self-INIT is invalid (at least when specified via shorthand). So I'm under the impression now that we can safely ignore this case and leave the patch as is. Any different views? Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 14/22] kvm: Fix race between timer signals and vcpu entry under !IOTHREAD
On 01/31/2011 04:31 PM, Jan Kiszka wrote: And how would you be kicked out of the select() call if it is waiting with a timeout? We only have a single thread here. If we use signalfd() (either kernel provided or thread+pipe), we kick out of select by select()ing it (though I don't see how it works without an iothread, since an fd can't stop a vcpu unless you enable SIGIO on it, which is silly for signalfd) If you leave it as a naked signal, then it can break out of either pselect() or vcpu. Since the goal is to drop !CONFIG_IOTHREAD, the first path seems better, I just don't understand the problem with emulated signalfd(). With the emulated signalfd, there won't be any signal for the VCPU while in KVM_RUN. I see it now - with a real signalfd, kvm unmasks the signal, and that takes precedence over signalfd and exits the vcpu. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 17/22] kvm: Move irqchip event processing out of inner loop
On Mon, Jan 31, 2011 at 04:40:34PM +0100, Jan Kiszka wrote: On 2011-01-31 14:04, Jan Kiszka wrote: On 2011-01-31 12:36, Jan Kiszka wrote: On 2011-01-31 11:08, Avi Kivity wrote: On 01/27/2011 03:10 PM, Jan Kiszka wrote: Align with qemu-kvm and prepare for IO exit fix: There is no need to run kvm_arch_process_irqchip_events in the inner VCPU loop. Any state change this service processes will first cause an exit from kvm_cpu_exec anyway. And we will have to reenter the kernel on IO exits unconditionally, something that the current logic prevents. Signed-off-by: Jan Kiszkajan.kis...@siemens.com --- kvm-all.c | 11 ++- 1 files changed, 6 insertions(+), 5 deletions(-) diff --git a/kvm-all.c b/kvm-all.c index 5bfa8c0..46ecc1c 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -892,6 +892,12 @@ int kvm_cpu_exec(CPUState *env) DPRINTF(kvm_cpu_exec()\n); +if (kvm_arch_process_irqchip_events(env)) { +env-exit_request = 0; +env-exception_index = EXCP_HLT; +return 0; +} + do { #ifndef CONFIG_IOTHREAD if (env-exit_request) { @@ -901,11 +907,6 @@ int kvm_cpu_exec(CPUState *env) } We check for -exit_request here #endif -if (kvm_arch_process_irqchip_events(env)) { -ret = 0; -break; -} - But this checks for -interrupt_request. What ensures that we exit when -interrupt_request is set? Good question, need to check again. But if that turns out to be an issue, qemu-kvm would be broken as well. I'm just aligning the code here. The only thing we miss by moving process_irqchip_events is a self-INIT of an AP - if such thing exists in real life. In that case, the AP would cause a reset of itself, followed by a transition to HALT state. I checked again with the Intel spec, and a self-INIT is invalid (at least when specified via shorthand). So I'm under the impression now that we can safely ignore this case and leave the patch as is. Any different views? IIRC if you don't use shorthand you can send INIT to self. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 17/22] kvm: Move irqchip event processing out of inner loop
On 2011-01-31 17:38, Gleb Natapov wrote: On Mon, Jan 31, 2011 at 04:40:34PM +0100, Jan Kiszka wrote: On 2011-01-31 14:04, Jan Kiszka wrote: On 2011-01-31 12:36, Jan Kiszka wrote: On 2011-01-31 11:08, Avi Kivity wrote: On 01/27/2011 03:10 PM, Jan Kiszka wrote: Align with qemu-kvm and prepare for IO exit fix: There is no need to run kvm_arch_process_irqchip_events in the inner VCPU loop. Any state change this service processes will first cause an exit from kvm_cpu_exec anyway. And we will have to reenter the kernel on IO exits unconditionally, something that the current logic prevents. Signed-off-by: Jan Kiszkajan.kis...@siemens.com --- kvm-all.c | 11 ++- 1 files changed, 6 insertions(+), 5 deletions(-) diff --git a/kvm-all.c b/kvm-all.c index 5bfa8c0..46ecc1c 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -892,6 +892,12 @@ int kvm_cpu_exec(CPUState *env) DPRINTF(kvm_cpu_exec()\n); +if (kvm_arch_process_irqchip_events(env)) { +env-exit_request = 0; +env-exception_index = EXCP_HLT; +return 0; +} + do { #ifndef CONFIG_IOTHREAD if (env-exit_request) { @@ -901,11 +907,6 @@ int kvm_cpu_exec(CPUState *env) } We check for -exit_request here #endif -if (kvm_arch_process_irqchip_events(env)) { -ret = 0; -break; -} - But this checks for -interrupt_request. What ensures that we exit when -interrupt_request is set? Good question, need to check again. But if that turns out to be an issue, qemu-kvm would be broken as well. I'm just aligning the code here. The only thing we miss by moving process_irqchip_events is a self-INIT of an AP - if such thing exists in real life. In that case, the AP would cause a reset of itself, followed by a transition to HALT state. I checked again with the Intel spec, and a self-INIT is invalid (at least when specified via shorthand). So I'm under the impression now that we can safely ignore this case and leave the patch as is. Any different views? IIRC if you don't use shorthand you can send INIT to self. We didn't care so far (in qemu-kvm), do you think we should? Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 17/22] kvm: Move irqchip event processing out of inner loop
On 2011-01-31 17:41, Jan Kiszka wrote: On 2011-01-31 17:38, Gleb Natapov wrote: On Mon, Jan 31, 2011 at 04:40:34PM +0100, Jan Kiszka wrote: On 2011-01-31 14:04, Jan Kiszka wrote: On 2011-01-31 12:36, Jan Kiszka wrote: On 2011-01-31 11:08, Avi Kivity wrote: On 01/27/2011 03:10 PM, Jan Kiszka wrote: Align with qemu-kvm and prepare for IO exit fix: There is no need to run kvm_arch_process_irqchip_events in the inner VCPU loop. Any state change this service processes will first cause an exit from kvm_cpu_exec anyway. And we will have to reenter the kernel on IO exits unconditionally, something that the current logic prevents. Signed-off-by: Jan Kiszkajan.kis...@siemens.com --- kvm-all.c | 11 ++- 1 files changed, 6 insertions(+), 5 deletions(-) diff --git a/kvm-all.c b/kvm-all.c index 5bfa8c0..46ecc1c 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -892,6 +892,12 @@ int kvm_cpu_exec(CPUState *env) DPRINTF(kvm_cpu_exec()\n); +if (kvm_arch_process_irqchip_events(env)) { +env-exit_request = 0; +env-exception_index = EXCP_HLT; +return 0; +} + do { #ifndef CONFIG_IOTHREAD if (env-exit_request) { @@ -901,11 +907,6 @@ int kvm_cpu_exec(CPUState *env) } We check for -exit_request here #endif -if (kvm_arch_process_irqchip_events(env)) { -ret = 0; -break; -} - But this checks for -interrupt_request. What ensures that we exit when -interrupt_request is set? Good question, need to check again. But if that turns out to be an issue, qemu-kvm would be broken as well. I'm just aligning the code here. The only thing we miss by moving process_irqchip_events is a self-INIT of an AP - if such thing exists in real life. In that case, the AP would cause a reset of itself, followed by a transition to HALT state. I checked again with the Intel spec, and a self-INIT is invalid (at least when specified via shorthand). So I'm under the impression now that we can safely ignore this case and leave the patch as is. Any different views? IIRC if you don't use shorthand you can send INIT to self. We didn't care so far (in qemu-kvm), do you think we should? ...and the kernel model should have barked INIT on a runnable vcpu in such cases (BTW, that's user triggerable and should likely be rate limited). Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 17/22] kvm: Move irqchip event processing out of inner loop
On Mon, Jan 31, 2011 at 05:41:24PM +0100, Jan Kiszka wrote: On 2011-01-31 17:38, Gleb Natapov wrote: On Mon, Jan 31, 2011 at 04:40:34PM +0100, Jan Kiszka wrote: On 2011-01-31 14:04, Jan Kiszka wrote: On 2011-01-31 12:36, Jan Kiszka wrote: On 2011-01-31 11:08, Avi Kivity wrote: On 01/27/2011 03:10 PM, Jan Kiszka wrote: Align with qemu-kvm and prepare for IO exit fix: There is no need to run kvm_arch_process_irqchip_events in the inner VCPU loop. Any state change this service processes will first cause an exit from kvm_cpu_exec anyway. And we will have to reenter the kernel on IO exits unconditionally, something that the current logic prevents. Signed-off-by: Jan Kiszkajan.kis...@siemens.com --- kvm-all.c | 11 ++- 1 files changed, 6 insertions(+), 5 deletions(-) diff --git a/kvm-all.c b/kvm-all.c index 5bfa8c0..46ecc1c 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -892,6 +892,12 @@ int kvm_cpu_exec(CPUState *env) DPRINTF(kvm_cpu_exec()\n); +if (kvm_arch_process_irqchip_events(env)) { +env-exit_request = 0; +env-exception_index = EXCP_HLT; +return 0; +} + do { #ifndef CONFIG_IOTHREAD if (env-exit_request) { @@ -901,11 +907,6 @@ int kvm_cpu_exec(CPUState *env) } We check for -exit_request here #endif -if (kvm_arch_process_irqchip_events(env)) { -ret = 0; -break; -} - But this checks for -interrupt_request. What ensures that we exit when -interrupt_request is set? Good question, need to check again. But if that turns out to be an issue, qemu-kvm would be broken as well. I'm just aligning the code here. The only thing we miss by moving process_irqchip_events is a self-INIT of an AP - if such thing exists in real life. In that case, the AP would cause a reset of itself, followed by a transition to HALT state. I checked again with the Intel spec, and a self-INIT is invalid (at least when specified via shorthand). So I'm under the impression now that we can safely ignore this case and leave the patch as is. Any different views? IIRC if you don't use shorthand you can send INIT to self. We didn't care so far (in qemu-kvm), do you think we should? Doesn't kernel lapic emulation support this? -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 17/22] kvm: Move irqchip event processing out of inner loop
On 2011-01-31 17:50, Gleb Natapov wrote: On Mon, Jan 31, 2011 at 05:41:24PM +0100, Jan Kiszka wrote: On 2011-01-31 17:38, Gleb Natapov wrote: On Mon, Jan 31, 2011 at 04:40:34PM +0100, Jan Kiszka wrote: On 2011-01-31 14:04, Jan Kiszka wrote: On 2011-01-31 12:36, Jan Kiszka wrote: On 2011-01-31 11:08, Avi Kivity wrote: On 01/27/2011 03:10 PM, Jan Kiszka wrote: Align with qemu-kvm and prepare for IO exit fix: There is no need to run kvm_arch_process_irqchip_events in the inner VCPU loop. Any state change this service processes will first cause an exit from kvm_cpu_exec anyway. And we will have to reenter the kernel on IO exits unconditionally, something that the current logic prevents. Signed-off-by: Jan Kiszkajan.kis...@siemens.com --- kvm-all.c | 11 ++- 1 files changed, 6 insertions(+), 5 deletions(-) diff --git a/kvm-all.c b/kvm-all.c index 5bfa8c0..46ecc1c 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -892,6 +892,12 @@ int kvm_cpu_exec(CPUState *env) DPRINTF(kvm_cpu_exec()\n); +if (kvm_arch_process_irqchip_events(env)) { +env-exit_request = 0; +env-exception_index = EXCP_HLT; +return 0; +} + do { #ifndef CONFIG_IOTHREAD if (env-exit_request) { @@ -901,11 +907,6 @@ int kvm_cpu_exec(CPUState *env) } We check for -exit_request here #endif -if (kvm_arch_process_irqchip_events(env)) { -ret = 0; -break; -} - But this checks for -interrupt_request. What ensures that we exit when -interrupt_request is set? Good question, need to check again. But if that turns out to be an issue, qemu-kvm would be broken as well. I'm just aligning the code here. The only thing we miss by moving process_irqchip_events is a self-INIT of an AP - if such thing exists in real life. In that case, the AP would cause a reset of itself, followed by a transition to HALT state. I checked again with the Intel spec, and a self-INIT is invalid (at least when specified via shorthand). So I'm under the impression now that we can safely ignore this case and leave the patch as is. Any different views? IIRC if you don't use shorthand you can send INIT to self. We didn't care so far (in qemu-kvm), do you think we should? Doesn't kernel lapic emulation support this? See the my other mail: It supports it, but it apparently doesn't expects this to happen. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 17/22] kvm: Move irqchip event processing out of inner loop
On Mon, Jan 31, 2011 at 05:52:13PM +0100, Jan Kiszka wrote: On 2011-01-31 17:50, Gleb Natapov wrote: On Mon, Jan 31, 2011 at 05:41:24PM +0100, Jan Kiszka wrote: On 2011-01-31 17:38, Gleb Natapov wrote: On Mon, Jan 31, 2011 at 04:40:34PM +0100, Jan Kiszka wrote: On 2011-01-31 14:04, Jan Kiszka wrote: On 2011-01-31 12:36, Jan Kiszka wrote: On 2011-01-31 11:08, Avi Kivity wrote: On 01/27/2011 03:10 PM, Jan Kiszka wrote: Align with qemu-kvm and prepare for IO exit fix: There is no need to run kvm_arch_process_irqchip_events in the inner VCPU loop. Any state change this service processes will first cause an exit from kvm_cpu_exec anyway. And we will have to reenter the kernel on IO exits unconditionally, something that the current logic prevents. Signed-off-by: Jan Kiszkajan.kis...@siemens.com --- kvm-all.c | 11 ++- 1 files changed, 6 insertions(+), 5 deletions(-) diff --git a/kvm-all.c b/kvm-all.c index 5bfa8c0..46ecc1c 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -892,6 +892,12 @@ int kvm_cpu_exec(CPUState *env) DPRINTF(kvm_cpu_exec()\n); +if (kvm_arch_process_irqchip_events(env)) { +env-exit_request = 0; +env-exception_index = EXCP_HLT; +return 0; +} + do { #ifndef CONFIG_IOTHREAD if (env-exit_request) { @@ -901,11 +907,6 @@ int kvm_cpu_exec(CPUState *env) } We check for -exit_request here #endif -if (kvm_arch_process_irqchip_events(env)) { -ret = 0; -break; -} - But this checks for -interrupt_request. What ensures that we exit when -interrupt_request is set? Good question, need to check again. But if that turns out to be an issue, qemu-kvm would be broken as well. I'm just aligning the code here. The only thing we miss by moving process_irqchip_events is a self-INIT of an AP - if such thing exists in real life. In that case, the AP would cause a reset of itself, followed by a transition to HALT state. I checked again with the Intel spec, and a self-INIT is invalid (at least when specified via shorthand). So I'm under the impression now that we can safely ignore this case and leave the patch as is. Any different views? IIRC if you don't use shorthand you can send INIT to self. We didn't care so far (in qemu-kvm), do you think we should? Doesn't kernel lapic emulation support this? See the my other mail: It supports it, but it apparently doesn't expects this to happen. I saw it, but I do not understand why do we print this message. May be it was used for debugging in early stages of KVM development. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM call agenda for Feb 1
On 2011-01-31 11:02, Juan Quintela wrote: Please send in any agenda items you are interested incovering. o KVM upstream merge: status, plans, coordination Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC -v7 PATCH 4/7] Add yield_to(task, preempt) functionality.
On 01/31/2011 06:49 AM, Peter Zijlstra wrote: On Wed, 2011-01-26 at 17:21 -0500, Rik van Riel wrote: + if (yielded) + yield(); + + return yielded; +} +EXPORT_SYMBOL_GPL(yield_to); yield() will again acquire rq-lock.. not not simply have -yield_to_task() do everything required and make that an unconditional schedule()? I wanted to reduce code duplication, since this code path should not be taken all that frequently. But hey, you're the maintainer, so I'll implement it your way :) The patches should be on the way later today :) -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/2] Expose available KVM free memory slot count to help avoid aborts
On Mon, Jan 24, 2011 at 10:37:56PM -0700, Alex Williamson wrote: On Mon, 2011-01-24 at 08:44 -0700, Alex Williamson wrote: I'll look at how we might be able to allocate slots on demand. Thanks, Here's a first cut just to see if this looks agreeable. This allows the slot array to grow on demand. This works with current userspace, as well as userspace trivially modified to double KVMState.slots and hotplugging enough pci-assign devices to exceed the previous limit (w/ w/o ept). Hopefully I got the rcu bits correct. Does this look like the right path? If so, I'll work on removing the fixed limit from userspace next. Thanks, Alex kvm: Allow memory slot array to grow on demand Remove fixed KVM_MEMORY_SLOTS limit, allowing the slot array to grow on demand. Private slots are now allocated at the front instead of the end. Only x86 seems to use private slots, so this is now zero for all other archs. The memslots pointer is already updated using rcu, so changing the size off the array when it's replaces is straight forward. x86 also keeps a bitmap of slots used by a kvm_mmu_page, which requires a shadow tlb flush whenever we increase the number of slots. This forces the pages to be rebuilt with the new bitmap size. Signed-off-by: Alex Williamson alex.william...@redhat.com --- arch/ia64/include/asm/kvm_host.h|4 -- arch/ia64/kvm/kvm-ia64.c|2 + arch/powerpc/include/asm/kvm_host.h |3 -- arch/s390/include/asm/kvm_host.h|3 -- arch/x86/include/asm/kvm_host.h |3 +- arch/x86/include/asm/vmx.h |6 ++- arch/x86/kvm/mmu.c |7 +++- arch/x86/kvm/x86.c |6 ++- include/linux/kvm_host.h|7 +++- virt/kvm/kvm_main.c | 65 --- 10 files changed, 63 insertions(+), 43 deletions(-) diff --git a/arch/ia64/include/asm/kvm_host.h b/arch/ia64/include/asm/kvm_host.h index 2689ee5..11d0ab2 100644 --- a/arch/ia64/include/asm/kvm_host.h +++ b/arch/ia64/include/asm/kvm_host.h @@ -23,10 +23,6 @@ #ifndef __ASM_KVM_HOST_H #define __ASM_KVM_HOST_H -#define KVM_MEMORY_SLOTS 32 -/* memory slots that does not exposed to userspace */ -#define KVM_PRIVATE_MEM_SLOTS 4 - #define KVM_COALESCED_MMIO_PAGE_OFFSET 1 /* define exit reasons from vmm to kvm*/ diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c index 70d224d..f1adda2 100644 --- a/arch/ia64/kvm/kvm-ia64.c +++ b/arch/ia64/kvm/kvm-ia64.c @@ -1814,7 +1814,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, mutex_lock(kvm-slots_lock); r = -EINVAL; - if (log-slot = KVM_MEMORY_SLOTS) + if (log-slot = kvm-memslots-nmemslots) goto out; memslot = kvm-memslots-memslots[log-slot]; diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index bba3b9b..dc80057 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -29,9 +29,6 @@ #include asm/kvm_asm.h #define KVM_MAX_VCPUS 1 -#define KVM_MEMORY_SLOTS 32 -/* memory slots that does not exposed to userspace */ -#define KVM_PRIVATE_MEM_SLOTS 4 #define KVM_COALESCED_MMIO_PAGE_OFFSET 1 diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h index cef7dbf..92a964c 100644 --- a/arch/s390/include/asm/kvm_host.h +++ b/arch/s390/include/asm/kvm_host.h @@ -20,9 +20,6 @@ #include asm/cpu.h #define KVM_MAX_VCPUS 64 -#define KVM_MEMORY_SLOTS 32 -/* memory slots that does not exposed to userspace */ -#define KVM_PRIVATE_MEM_SLOTS 4 struct sca_entry { atomic_t scn; diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index ffd7f8d..df1382c 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -27,7 +27,6 @@ #include asm/msr-index.h #define KVM_MAX_VCPUS 64 -#define KVM_MEMORY_SLOTS 32 /* memory slots that does not exposed to userspace */ #define KVM_PRIVATE_MEM_SLOTS 4 @@ -207,7 +206,7 @@ struct kvm_mmu_page { * One bit set per slot which has memory * in this shadow page. */ - DECLARE_BITMAP(slot_bitmap, KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS); + unsigned long *slot_bitmap; bool multimapped; /* More than one parent_pte? */ bool unsync; int root_count; /* Currently serving as active root */ diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h index 84471b8..7fd8c89 100644 --- a/arch/x86/include/asm/vmx.h +++ b/arch/x86/include/asm/vmx.h @@ -370,9 +370,9 @@ enum vmcs_field { #define AR_RESERVD_MASK 0xfffe0f00 -#define TSS_PRIVATE_MEMSLOT (KVM_MEMORY_SLOTS + 0) -#define APIC_ACCESS_PAGE_PRIVATE_MEMSLOT (KVM_MEMORY_SLOTS + 1) -#define IDENTITY_PAGETABLE_PRIVATE_MEMSLOT (KVM_MEMORY_SLOTS + 2) +#define
Re: [PATCH] V2 Handle guest access to BBL_CR_CTL3 MSR
On Fri, Jan 21, 2011 at 12:21:00AM -0500, john cooper wrote: [Resubmit of prior version which contained a wayward patch hunk. Thanks Marcelo] A correction to Intel cpu model CPUID data (patch queued) caused winxp to BSOD when booted with a Penryn model. This was traced to the CPUID model field correction from 6 - 23 (as is proper for a Penryn class of cpu). Only in this case does the problem surface. The cause for this failure is winxp accessing the BBL_CR_CTL3 MSR which is unsupported by current kvm, appears to be a legacy MSR not fully characterized yet existing in current silicon, and is apparently carried forward in MSR space to accommodate vintage code as here. It is not yet conclusive whether this MSR implements any of its legacy functionality or is just an ornamental dud for compatibility. While I found no silicon version specific documentation link to this MSR, a general description exists in Intel's developer's reference which agrees with the functional behavior of other bootloader/kernel code I've examined accessing BBL_CR_CTL3. Regrettably winxp appears to be setting bit #19 called out as reserved in the above document. So to minimally accommodate this MSR, kvm msr get will provide the equivalent mock data and kvm msr write will simply toss the guest passed data without interpretation. While this treatment of BBL_CR_CTL3 addresses the immediate problem, the approach may be modified pending clarification from Intel. Signed-off-by: john cooper john.coo...@redhat.com --- diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 4d0dfa0..5bfafb6 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -38,6 +38,7 @@ #define MSR_MTRRcap 0x00fe #define MSR_IA32_BBL_CR_CTL 0x0119 +#define MSR_IA32_BBL_CR_CTL3 0x011e #define MSR_IA32_SYSENTER_CS 0x0174 #define MSR_IA32_SYSENTER_ESP0x0175 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index bcc0efc..04d6c55 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1592,6 +1592,12 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data) } else return set_msr_hyperv(vcpu, msr, data); break; + case MSR_IA32_BBL_CR_CTL3: + /* Drop writes to this legacy MSR -- see rdmsr + * counterpart for further detail. + */ + pr_unimpl(vcpu, ignored wrmsr: 0x%x data %llx\n, msr, data); + break; default: if (msr (msr == vcpu-kvm-arch.xen_hvm_config.msr)) return xen_hvm_config(vcpu, data); @@ -1846,6 +1852,19 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata) } else return get_msr_hyperv(vcpu, msr, pdata); break; + case MSR_IA32_BBL_CR_CTL3: + /* This legacy MSR exists but isn't fully documented in current + * silicon. It is however accessed by winxp in very narrow + * scenarios where it sets bit #19, itself documented as + * a reserved bit. Best effort attempt to source coherent + * read data here should the balance of the register be + * interpreted by the guest: + * + * L2 cache control register 3: 64GB range, 256KB size, + * enabled, latency 0x1, configured + */ + data = 0xbe702111; + break; Why bits 26-29 and 31 enabled? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv2] vhost: force vhost off for non-MSI guests
When MSI is off, each interrupt needs to be bounced through the io thread when it's set/cleared, so vhost-net causes more context switches and higher CPU utilization than userspace virtio which handles networking in the same thread. We'll need to fix this by adding level irq support in kvm irqfd, for now disable vhost-net in these configurations. Added a vhostforce flag to force vhost-net back on. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Untested, I'll send a pull request later after some testing assuming we've resolved the command line comments to everyone's satisfaction. Changes since v1: added vhostforce to enable vhost for non-MSI guests hw/vhost.c | 16 +++- hw/vhost.h |3 ++- hw/vhost_net.c |8 +--- hw/vhost_net.h |2 +- hw/virtio-net.c |6 -- hw/virtio-pci.c |6 +- hw/virtio.h |4 +++- net/tap.c |6 -- qemu-options.hx |4 +++- 9 files changed, 38 insertions(+), 17 deletions(-) diff --git a/hw/vhost.c b/hw/vhost.c index 6082da2..72df75f 100644 --- a/hw/vhost.c +++ b/hw/vhost.c @@ -581,7 +581,7 @@ static void vhost_virtqueue_cleanup(struct vhost_dev *dev, 0, virtio_queue_get_desc_size(vdev, idx)); } -int vhost_dev_init(struct vhost_dev *hdev, int devfd) +int vhost_dev_init(struct vhost_dev *hdev, int devfd, bool force) { uint64_t features; int r; @@ -613,6 +613,7 @@ int vhost_dev_init(struct vhost_dev *hdev, int devfd) hdev-log_enabled = false; hdev-started = false; cpu_register_phys_memory_client(hdev-client); +hdev-force = force; return 0; fail: r = -errno; @@ -636,9 +637,12 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev) goto fail; } -r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true); +r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true, + hdev-force); if (r 0) { -fprintf(stderr, Error binding guest notifier: %d\n, -r); + if (r != -EVIRTIO_DISABLED) { + fprintf(stderr, Error binding guest notifier: %d\n, -r); + } goto fail_notifiers; } @@ -686,7 +690,8 @@ fail_vq: } fail_mem: fail_features: -vdev-binding-set_guest_notifiers(vdev-binding_opaque, false); +vdev-binding-set_guest_notifiers(vdev-binding_opaque, false, + hdev-force); fail_notifiers: fail: return r; @@ -704,7 +709,8 @@ void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev) } vhost_client_sync_dirty_bitmap(hdev-client, 0, (target_phys_addr_t)~0x0ull); -r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, false); +r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, false, + hdev-force); if (r 0) { fprintf(stderr, vhost guest notifier cleanup failed: %d\n, r); fflush(stderr); diff --git a/hw/vhost.h b/hw/vhost.h index 86dd834..d5d4d86 100644 --- a/hw/vhost.h +++ b/hw/vhost.h @@ -38,9 +38,10 @@ struct vhost_dev { bool log_enabled; vhost_log_chunk_t *log; unsigned long long log_size; +bool force; }; -int vhost_dev_init(struct vhost_dev *hdev, int devfd); +int vhost_dev_init(struct vhost_dev *hdev, int devfd, bool force); void vhost_dev_cleanup(struct vhost_dev *hdev); int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev); void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev); diff --git a/hw/vhost_net.c b/hw/vhost_net.c index c068be1..4f57271 100644 --- a/hw/vhost_net.c +++ b/hw/vhost_net.c @@ -81,7 +81,8 @@ static int vhost_net_get_fd(VLANClientState *backend) } } -struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd) +struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd, + bool force) { int r; struct vhost_net *net = qemu_malloc(sizeof *net); @@ -98,7 +99,7 @@ struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd) (1 VHOST_NET_F_VIRTIO_NET_HDR); net-backend = r; -r = vhost_dev_init(net-dev, devfd); +r = vhost_dev_init(net-dev, devfd, force); if (r 0) { goto fail; } @@ -188,7 +189,8 @@ void vhost_net_cleanup(struct vhost_net *net) qemu_free(net); } #else -struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd) +struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd, + bool force) { return NULL; } diff --git a/hw/vhost_net.h b/hw/vhost_net.h index 6c18ff7..8435094 100644 --- a/hw/vhost_net.h +++ b/hw/vhost_net.h @@ -6,7 +6,7 @@ struct vhost_net; typedef struct vhost_net VHostNetState; -VHostNetState *vhost_net_init(VLANClientState *backend, int devfd); +VHostNetState *vhost_net_init(VLANClientState *backend, int devfd, bool force);
Re: [PATCHv2] vhost: force vhost off for non-MSI guests
On 01/31/2011 03:19 PM, Michael S. Tsirkin wrote: When MSI is off, each interrupt needs to be bounced through the io thread when it's set/cleared, so vhost-net causes more context switches and higher CPU utilization than userspace virtio which handles networking in the same thread. We'll need to fix this by adding level irq support in kvm irqfd, for now disable vhost-net in these configurations. Added a vhostforce flag to force vhost-net back on. Signed-off-by: Michael S. Tsirkinm...@redhat.com I can't say I like the set_guest_notifiers interface getting overloaded like this but it looks pretty reasonable. Regards, Anthony Liguori --- Untested, I'll send a pull request later after some testing assuming we've resolved the command line comments to everyone's satisfaction. Changes since v1: added vhostforce to enable vhost for non-MSI guests hw/vhost.c | 16 +++- hw/vhost.h |3 ++- hw/vhost_net.c |8 +--- hw/vhost_net.h |2 +- hw/virtio-net.c |6 -- hw/virtio-pci.c |6 +- hw/virtio.h |4 +++- net/tap.c |6 -- qemu-options.hx |4 +++- 9 files changed, 38 insertions(+), 17 deletions(-) diff --git a/hw/vhost.c b/hw/vhost.c index 6082da2..72df75f 100644 --- a/hw/vhost.c +++ b/hw/vhost.c @@ -581,7 +581,7 @@ static void vhost_virtqueue_cleanup(struct vhost_dev *dev, 0, virtio_queue_get_desc_size(vdev, idx)); } -int vhost_dev_init(struct vhost_dev *hdev, int devfd) +int vhost_dev_init(struct vhost_dev *hdev, int devfd, bool force) { uint64_t features; int r; @@ -613,6 +613,7 @@ int vhost_dev_init(struct vhost_dev *hdev, int devfd) hdev-log_enabled = false; hdev-started = false; cpu_register_phys_memory_client(hdev-client); +hdev-force = force; return 0; fail: r = -errno; @@ -636,9 +637,12 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev) goto fail; } -r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true); +r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true, + hdev-force); if (r 0) { -fprintf(stderr, Error binding guest notifier: %d\n, -r); + if (r != -EVIRTIO_DISABLED) { + fprintf(stderr, Error binding guest notifier: %d\n, -r); + } goto fail_notifiers; } @@ -686,7 +690,8 @@ fail_vq: } fail_mem: fail_features: -vdev-binding-set_guest_notifiers(vdev-binding_opaque, false); +vdev-binding-set_guest_notifiers(vdev-binding_opaque, false, + hdev-force); fail_notifiers: fail: return r; @@ -704,7 +709,8 @@ void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev) } vhost_client_sync_dirty_bitmap(hdev-client, 0, (target_phys_addr_t)~0x0ull); -r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, false); +r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, false, + hdev-force); if (r 0) { fprintf(stderr, vhost guest notifier cleanup failed: %d\n, r); fflush(stderr); diff --git a/hw/vhost.h b/hw/vhost.h index 86dd834..d5d4d86 100644 --- a/hw/vhost.h +++ b/hw/vhost.h @@ -38,9 +38,10 @@ struct vhost_dev { bool log_enabled; vhost_log_chunk_t *log; unsigned long long log_size; +bool force; }; -int vhost_dev_init(struct vhost_dev *hdev, int devfd); +int vhost_dev_init(struct vhost_dev *hdev, int devfd, bool force); void vhost_dev_cleanup(struct vhost_dev *hdev); int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev); void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev); diff --git a/hw/vhost_net.c b/hw/vhost_net.c index c068be1..4f57271 100644 --- a/hw/vhost_net.c +++ b/hw/vhost_net.c @@ -81,7 +81,8 @@ static int vhost_net_get_fd(VLANClientState *backend) } } -struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd) +struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd, + bool force) { int r; struct vhost_net *net = qemu_malloc(sizeof *net); @@ -98,7 +99,7 @@ struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd) (1 VHOST_NET_F_VIRTIO_NET_HDR); net-backend = r; -r = vhost_dev_init(net-dev, devfd); +r = vhost_dev_init(net-dev, devfd, force); if (r 0) { goto fail; } @@ -188,7 +189,8 @@ void vhost_net_cleanup(struct vhost_net *net) qemu_free(net); } #else -struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd) +struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd, + bool force) { return NULL; } diff --git a/hw/vhost_net.h b/hw/vhost_net.h index 6c18ff7..8435094 100644 ---
Re: KVM call agenda for Feb 1
On 01/31/2011 12:10 PM, Jan Kiszka wrote: On 2011-01-31 11:02, Juan Quintela wrote: Please send in any agenda items you are interested incovering. o KVM upstream merge: status, plans, coordination o QMP support status for 0.14. Luiz and I already chatted about it today but would be good to discuss in the call just to see if anyone has opinions. Basically, declare it fully supported with a few minor caveats (like human-monitor-passthrough is no more supported than the actual monitor and recommendations about how to deal with devices in the device tree). Reminder: tomorrow is 0.14 stable fork. Regards, Anthony Liguori Jan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] vhost: force vhost off for non-MSI guests
On Mon, 2011-01-31 at 23:19 +0200, Michael S. Tsirkin wrote: When MSI is off, each interrupt needs to be bounced through the io thread when it's set/cleared, so vhost-net causes more context switches and higher CPU utilization than userspace virtio which handles networking in the same thread. We'll need to fix this by adding level irq support in kvm irqfd, for now disable vhost-net in these configurations. Added a vhostforce flag to force vhost-net back on. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Untested, I'll send a pull request later after some testing assuming we've resolved the command line comments to everyone's satisfaction. Changes since v1: added vhostforce to enable vhost for non-MSI guests I'm still not thrilled with the errno hack. If this is really a generic problem, we should better architect the set_guest_notifiers() callback, the force option feels like a band-aide. I'm still curious if we can't be more dynamic about how we setup the set_guest_notifiers() function pointer so it only gets set if (force || msix_enabled()) and we can limit some of the impact here to vhost. Thanks, Alex hw/vhost.c | 16 +++- hw/vhost.h |3 ++- hw/vhost_net.c |8 +--- hw/vhost_net.h |2 +- hw/virtio-net.c |6 -- hw/virtio-pci.c |6 +- hw/virtio.h |4 +++- net/tap.c |6 -- qemu-options.hx |4 +++- 9 files changed, 38 insertions(+), 17 deletions(-) diff --git a/hw/vhost.c b/hw/vhost.c index 6082da2..72df75f 100644 --- a/hw/vhost.c +++ b/hw/vhost.c @@ -581,7 +581,7 @@ static void vhost_virtqueue_cleanup(struct vhost_dev *dev, 0, virtio_queue_get_desc_size(vdev, idx)); } -int vhost_dev_init(struct vhost_dev *hdev, int devfd) +int vhost_dev_init(struct vhost_dev *hdev, int devfd, bool force) { uint64_t features; int r; @@ -613,6 +613,7 @@ int vhost_dev_init(struct vhost_dev *hdev, int devfd) hdev-log_enabled = false; hdev-started = false; cpu_register_phys_memory_client(hdev-client); +hdev-force = force; return 0; fail: r = -errno; @@ -636,9 +637,12 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev) goto fail; } -r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true); +r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true, + hdev-force); if (r 0) { -fprintf(stderr, Error binding guest notifier: %d\n, -r); + if (r != -EVIRTIO_DISABLED) { + fprintf(stderr, Error binding guest notifier: %d\n, -r); + } goto fail_notifiers; } @@ -686,7 +690,8 @@ fail_vq: } fail_mem: fail_features: -vdev-binding-set_guest_notifiers(vdev-binding_opaque, false); +vdev-binding-set_guest_notifiers(vdev-binding_opaque, false, + hdev-force); fail_notifiers: fail: return r; @@ -704,7 +709,8 @@ void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev) } vhost_client_sync_dirty_bitmap(hdev-client, 0, (target_phys_addr_t)~0x0ull); -r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, false); +r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, false, + hdev-force); if (r 0) { fprintf(stderr, vhost guest notifier cleanup failed: %d\n, r); fflush(stderr); diff --git a/hw/vhost.h b/hw/vhost.h index 86dd834..d5d4d86 100644 --- a/hw/vhost.h +++ b/hw/vhost.h @@ -38,9 +38,10 @@ struct vhost_dev { bool log_enabled; vhost_log_chunk_t *log; unsigned long long log_size; +bool force; }; -int vhost_dev_init(struct vhost_dev *hdev, int devfd); +int vhost_dev_init(struct vhost_dev *hdev, int devfd, bool force); void vhost_dev_cleanup(struct vhost_dev *hdev); int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev); void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev); diff --git a/hw/vhost_net.c b/hw/vhost_net.c index c068be1..4f57271 100644 --- a/hw/vhost_net.c +++ b/hw/vhost_net.c @@ -81,7 +81,8 @@ static int vhost_net_get_fd(VLANClientState *backend) } } -struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd) +struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd, + bool force) { int r; struct vhost_net *net = qemu_malloc(sizeof *net); @@ -98,7 +99,7 @@ struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd) (1 VHOST_NET_F_VIRTIO_NET_HDR); net-backend = r; -r = vhost_dev_init(net-dev, devfd); +r = vhost_dev_init(net-dev, devfd, force); if (r 0) { goto fail; } @@ -188,7 +189,8 @@ void vhost_net_cleanup(struct vhost_net
[PATCH -v8 3/7] sched: use a buddy to implement yield_task_fair
Use the buddy mechanism to implement yield_task_fair. This allows us to skip onto the next highest priority se at every level in the CFS tree, unless doing so would introduce gross unfairness in CPU time distribution. We order the buddy selection in pick_next_entity to check yield first, then last, then next. We need next to be able to override yield, because it is possible for the next and yield task to be different processen in the same sub-tree of the CFS tree. When they are, we need to go into that sub-tree regardless of the yield hint, and pick the correct entity once we get to the right level. Signed-off-by: Rik van Riel r...@redhat.com diff --git a/kernel/sched.c b/kernel/sched.c index dc91a4d..7ff53e2 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -327,7 +327,7 @@ struct cfs_rq { * 'curr' points to currently running entity on this cfs_rq. * It is set to NULL otherwise (i.e when none are currently running). */ - struct sched_entity *curr, *next, *last; + struct sched_entity *curr, *next, *last, *skip; unsigned int nr_spread_over; diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index ad946fd..a56d410 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -384,6 +384,22 @@ static struct sched_entity *__pick_next_entity(struct cfs_rq *cfs_rq) return rb_entry(left, struct sched_entity, run_node); } +static struct sched_entity *__pick_second_entity(struct cfs_rq *cfs_rq) +{ + struct rb_node *left = cfs_rq-rb_leftmost; + struct rb_node *second; + + if (!left) + return NULL; + + second = rb_next(left); + + if (!second) + second = left; + + return rb_entry(second, struct sched_entity, run_node); +} + static struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq) { struct rb_node *last = rb_last(cfs_rq-tasks_timeline); @@ -806,6 +822,17 @@ static void __clear_buddies_next(struct sched_entity *se) } } +static void __clear_buddies_skip(struct sched_entity *se) +{ + for_each_sched_entity(se) { + struct cfs_rq *cfs_rq = cfs_rq_of(se); + if (cfs_rq-skip == se) + cfs_rq-skip = NULL; + else + break; + } +} + static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se) { if (cfs_rq-last == se) @@ -813,6 +840,9 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se) if (cfs_rq-next == se) __clear_buddies_next(se); + + if (cfs_rq-skip == se) + __clear_buddies_skip(se); } static void @@ -926,13 +956,27 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) static int wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se); +/* + * Pick the next process, keeping these things in mind, in this order: + * 1) keep things fair between processes/task groups + * 2) pick the next process, since someone really wants that to run + * 3) pick the last process, for cache locality + * 4) do not run the skip process, if something else is available + */ static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq) { struct sched_entity *se = __pick_next_entity(cfs_rq); struct sched_entity *left = se; - if (cfs_rq-next wakeup_preempt_entity(cfs_rq-next, left) 1) - se = cfs_rq-next; + /* +* Avoid running the skip buddy, if running something else can +* be done without getting too unfair. +*/ + if (cfs_rq-skip == se) { + struct sched_entity *second = __pick_second_entity(cfs_rq); + if (wakeup_preempt_entity(second, left) 1) + se = second; + } /* * Prefer last buddy, try to return the CPU to a preempted task. @@ -940,6 +984,12 @@ static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq) if (cfs_rq-last wakeup_preempt_entity(cfs_rq-last, left) 1) se = cfs_rq-last; + /* +* Someone really wants this to run. If it's not unfair, run it. +*/ + if (cfs_rq-next wakeup_preempt_entity(cfs_rq-next, left) 1) + se = cfs_rq-next; + clear_buddies(cfs_rq, se); return se; @@ -1096,52 +1146,6 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) hrtick_update(rq); } -/* - * sched_yield() support is very simple - we dequeue and enqueue. - * - * If compat_yield is turned on then we requeue to the end of the tree. - */ -static void yield_task_fair(struct rq *rq) -{ - struct task_struct *curr = rq-curr; - struct cfs_rq *cfs_rq = task_cfs_rq(curr); - struct sched_entity *rightmost, *se = curr-se; - - /* -* Are we the only task in the tree? -*/ - if (unlikely(rq-nr_running == 1)) - return; - -
[PATCH -v8 2/7] sched: limit the scope of clear_buddies
The clear_buddies function does not seem to play well with the concept of hierarchical runqueues. In the following tree, task groups are represented by 'G', tasks by 'T', next by 'n' and last by 'l'. (nl) /\ G(nl) G / \ \ T(l) T(n) T This situation can arise when a task is woken up T(n), and the previously running task T(l) is marked last. When clear_buddies is called from either T(l) or T(n), the next and last buddies of the group G(nl) will be cleared. This is not the desired result, since we would like to be able to find the other type of buddy in many cases. This especially a worry when implementing yield_task_fair through the buddy system. The fix is simple: only clear the buddy type that the task itself is indicated to be. As an added bonus, we stop walking up the tree when the buddy has already been cleared or pointed elsewhere. Signed-off-by: Rik van Riel r...@redhat.com --- kernel/sched_fair.c | 30 +++--- 1 files changed, 23 insertions(+), 7 deletions(-) diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index f4ee445..0321473 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -784,19 +784,35 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) __enqueue_entity(cfs_rq, se); } -static void __clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se) +static void __clear_buddies_last(struct sched_entity *se) { - if (!se || cfs_rq-last == se) - cfs_rq-last = NULL; + for_each_sched_entity(se) { + struct cfs_rq *cfs_rq = cfs_rq_of(se); + if (cfs_rq-last == se) + cfs_rq-last = NULL; + else + break; + } +} - if (!se || cfs_rq-next == se) - cfs_rq-next = NULL; +static void __clear_buddies_next(struct sched_entity *se) +{ + for_each_sched_entity(se) { + struct cfs_rq *cfs_rq = cfs_rq_of(se); + if (cfs_rq-next == se) + cfs_rq-next = NULL; + else + break; + } } static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se) { - for_each_sched_entity(se) - __clear_buddies(cfs_rq_of(se), se); + if (cfs_rq-last == se) + __clear_buddies_last(se); + + if (cfs_rq-next == se) + __clear_buddies_next(se); } static void -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -v8 4/7] sched: Add yield_to(task, preempt) functionality.
From: Mike Galbraith efa...@gmx.de Currently only implemented for fair class tasks. Add a yield_to_task method() to the fair scheduling class. allowing the caller of yield_to() to accelerate another thread in it's thread group, task group. Implemented via a scheduler hint, using cfs_rq-next to encourage the target being selected. We can rely on pick_next_entity to keep things fair, so noone can accelerate a thread that has already used its fair share of CPU time. This also means callers should only call yield_to when they really mean it. Calling it too often can result in the scheduler just ignoring the hint. Signed-off-by: Rik van Riel r...@redhat.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Signed-off-by: Mike Galbraith efa...@gmx.de diff --git a/include/linux/sched.h b/include/linux/sched.h index 2c79e92..6c43fc4 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1047,6 +1047,7 @@ struct sched_class { void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags); void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags); void (*yield_task) (struct rq *rq); + bool (*yield_to_task) (struct rq *rq, struct task_struct *p, bool preempt); void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags); @@ -1943,6 +1944,7 @@ static inline int rt_mutex_getprio(struct task_struct *p) # define rt_mutex_adjust_pi(p) do { } while (0) #endif +extern bool yield_to(struct task_struct *p, bool preempt); extern void set_user_nice(struct task_struct *p, long nice); extern int task_prio(const struct task_struct *p); extern int task_nice(const struct task_struct *p); diff --git a/kernel/sched.c b/kernel/sched.c index 7ff53e2..5e70156 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -5270,6 +5270,56 @@ void __sched yield(void) } EXPORT_SYMBOL(yield); +/** + * yield_to - yield the current processor to another thread in + * your thread group, or accelerate that thread toward the + * processor it's on. + * + * It's the caller's job to ensure that the target task struct + * can't go away on us before we can do any checks. + * + * Returns true if we indeed boosted the target task. + */ +bool __sched yield_to(struct task_struct *p, bool preempt) +{ + struct task_struct *curr = current; + struct rq *rq, *p_rq; + unsigned long flags; + bool yielded = 0; + + local_irq_save(flags); + rq = this_rq(); + +again: + p_rq = task_rq(p); + double_rq_lock(rq, p_rq); + while (task_rq(p) != p_rq) { + double_rq_unlock(rq, p_rq); + goto again; + } + + if (!curr-sched_class-yield_to_task) + goto out; + + if (curr-sched_class != p-sched_class) + goto out; + + if (task_running(p_rq, p) || p-state) + goto out; + + yielded = curr-sched_class-yield_to_task(rq, p, preempt); + +out: + double_rq_unlock(rq, p_rq); + local_irq_restore(flags); + + if (yielded) + yield(); + + return yielded; +} +EXPORT_SYMBOL_GPL(yield_to); + /* * This task is about to go to sleep on IO. Increment rq-nr_iowait so * that process accounting knows that this is a task in IO wait state. diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index a56d410..be729d7 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -1800,6 +1800,23 @@ static void yield_task_fair(struct rq *rq) set_skip_buddy(se); } +static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool preempt) +{ + struct sched_entity *se = p-se; + + if (!se-on_rq) + return false; + + /* Tell the scheduler that we'd really like pse to run next. */ + set_next_buddy(se); + + /* Make p's CPU reschedule; pick_next_entity takes care of fairness. */ + if (preempt) + resched_task(rq-curr); + + return true; +} + #ifdef CONFIG_SMP /** * Fair scheduling class load-balancing methods: @@ -3993,6 +4010,7 @@ static const struct sched_class fair_sched_class = { .enqueue_task = enqueue_task_fair, .dequeue_task = dequeue_task_fair, .yield_task = yield_task_fair, + .yield_to_task = yield_to_task_fair, .check_preempt_curr = check_preempt_wakeup, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -v8 0/7] directed yield for Pause Loop Exiting
When running SMP virtual machines, it is possible for one VCPU to be spinning on a spinlock, while the VCPU that holds the spinlock is not currently running, because the host scheduler preempted it to run something else. Both Intel and AMD CPUs have a feature that detects when a virtual CPU is spinning on a lock and will trap to the host. The current KVM code sleeps for a bit whenever that happens, which results in eg. a 64 VCPU Windows guest taking forever and a bit to boot up. This is because the VCPU holding the lock is actually running and not sleeping, so the pause is counter-productive. In other workloads a pause can also be counter-productive, with spinlock detection resulting in one guest giving up its CPU time to the others. Instead of spinning, it ends up simply not running much at all. This patch series aims to fix that, by having a VCPU that spins give the remainder of its timeslice to another VCPU in the same guest before yielding the CPU - one that is runnable but got preempted, hopefully the lock holder. v8: - some more changes and cleanups suggested by Peter v7: - move the vcpu to pid mapping to inside the vcpu-mutex - rename -yield to -skip - merge patch 5 into patch 4 v6: - implement yield_task_fair in a way that works with task groups, this allows me to actually get a performance improvement! - fix another race Avi pointed out, the code should be good now v5: - fix the race condition Avi pointed out, by tracking vcpu-pid - also allows us to yield to vcpu tasks that got preempted while in qemu userspace v4: - change to newer version of Mike Galbraith's yield_to implementation - chainsaw out some code from Mike that looked like a great idea, but turned out to give weird interactions in practice v3: - more cleanups - change to Mike Galbraith's yield_to implementation - yield to spinning VCPUs, this seems to work better in some situations and has little downside potential v2: - make lots of cleanups and improvements suggested - do not implement timeslice scheduling or fairness stuff yet, since it is not entirely clear how to do that right (suggestions welcome) Benchmark results: Two 4-CPU KVM guests are pinned to the same 4 physical CPUs. One guest runs the AMQP performance test, the other guest runs 0, 2 or 4 infinite loops, for CPU overcommit factors of 0, 1.5 and 4. The AMQP perftest is run 30 times, with message payloads of 8 and 16 bytes. size8 no overcommit 1.5x overcommit 2x overcommit no PLE 223801 135137 104951 PLE 224135 141105 118744 size16 no overcommit 1.5x overcommit 2x overcommit no PLE 222424 126175 105299 PLE 222534 138082 132945 Note: this is with the KVM guests NOT running inside cgroups. There seems to be a CPU load balancing issue with cgroup fair group scheduling, which often results in one guest getting only 80% CPU time and the other guest 320%. That will have to be fixed to get meaningful results with cgroups. CPU time division between the AMQP guest and the infinite loop guest were not exactly fair, but the guests got close to the same amount of CPU time in each test run. There is a substantial amount of randomness in CPU time division between guests, but the performance improvement is consistent between multiple runs. -- All rights reversed. -- -- All rights reversed. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -v8 7/7] kvm: use yield_to instead of sleep in kvm_vcpu_on_spin
Instead of sleeping in kvm_vcpu_on_spin, which can cause gigantic slowdowns of certain workloads, we instead use yield_to to get another VCPU in the same KVM guest to run sooner. This seems to give a 10-15% speedup in certain workloads. Signed-off-by: Rik van Riel r...@redhat.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 9d56ed5..fab2250 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -187,6 +187,7 @@ struct kvm { #endif struct kvm_vcpu *vcpus[KVM_MAX_VCPUS]; atomic_t online_vcpus; + int last_boosted_vcpu; struct list_head vm_list; struct mutex lock; struct kvm_io_bus *buses[KVM_NR_BUSES]; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 86c4905..8b761ba 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1292,18 +1292,55 @@ void kvm_resched(struct kvm_vcpu *vcpu) } EXPORT_SYMBOL_GPL(kvm_resched); -void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu) +void kvm_vcpu_on_spin(struct kvm_vcpu *me) { - ktime_t expires; - DEFINE_WAIT(wait); - - prepare_to_wait(vcpu-wq, wait, TASK_INTERRUPTIBLE); - - /* Sleep for 100 us, and hope lock-holder got scheduled */ - expires = ktime_add_ns(ktime_get(), 10UL); - schedule_hrtimeout(expires, HRTIMER_MODE_ABS); + struct kvm *kvm = me-kvm; + struct kvm_vcpu *vcpu; + int last_boosted_vcpu = me-kvm-last_boosted_vcpu; + int yielded = 0; + int pass; + int i; - finish_wait(vcpu-wq, wait); + /* +* We boost the priority of a VCPU that is runnable but not +* currently running, because it got preempted by something +* else and called schedule in __vcpu_run. Hopefully that +* VCPU is holding the lock that we need and will release it. +* We approximate round-robin by starting at the last boosted VCPU. +*/ + for (pass = 0; pass 2 !yielded; pass++) { + kvm_for_each_vcpu(i, vcpu, kvm) { + struct task_struct *task = NULL; + struct pid *pid; + if (!pass i last_boosted_vcpu) { + i = last_boosted_vcpu; + continue; + } else if (pass i last_boosted_vcpu) + break; + if (vcpu == me) + continue; + if (waitqueue_active(vcpu-wq)) + continue; + rcu_read_lock(); + pid = rcu_dereference(vcpu-pid); + if (pid) + task = get_pid_task(vcpu-pid, PIDTYPE_PID); + rcu_read_unlock(); + if (!task) + continue; + if (task-flags PF_VCPU) { + put_task_struct(task); + continue; + } + if (yield_to(task, 1)) { + put_task_struct(task); + kvm-last_boosted_vcpu = i; + yielded = 1; + break; + } + put_task_struct(task); + } + } } EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -v8 6/7] kvm: keep track of which task is running a KVM vcpu
Keep track of which task is running a KVM vcpu. This helps us figure out later what task to wake up if we want to boost a vcpu that got preempted. Unfortunately there are no guarantees that the same task always keeps the same vcpu, so we can only track the task across a single run of the vcpu. Signed-off-by: Rik van Riel r...@redhat.com diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index a055742..9d56ed5 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -81,6 +81,7 @@ struct kvm_vcpu { #endif int vcpu_id; struct mutex mutex; + struct pid *pid; int cpu; atomic_t guest_mode; struct kvm_run *run; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 5225052..0fa9a48 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -117,6 +117,14 @@ void vcpu_load(struct kvm_vcpu *vcpu) int cpu; mutex_lock(vcpu-mutex); + if (unlikely(vcpu-pid != current-pids[PIDTYPE_PID].pid)) { + /* The thread running this VCPU changed. */ + struct pid *oldpid = vcpu-pid; + struct pid *newpid = get_task_pid(current, PIDTYPE_PID); + rcu_assign_pointer(vcpu-pid, newpid); + synchronize_rcu(); + put_pid(oldpid); + } cpu = get_cpu(); preempt_notifier_register(vcpu-preempt_notifier); kvm_arch_vcpu_load(vcpu, cpu); @@ -185,6 +193,7 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id) vcpu-cpu = -1; vcpu-kvm = kvm; vcpu-vcpu_id = id; + vcpu-pid = NULL; init_waitqueue_head(vcpu-wq); page = alloc_page(GFP_KERNEL | __GFP_ZERO); @@ -208,6 +217,7 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_init); void kvm_vcpu_uninit(struct kvm_vcpu *vcpu) { + put_pid(vcpu-pid); kvm_arch_vcpu_uninit(vcpu); free_page((unsigned long)vcpu-run); } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -v8 1/7] sched: check the right -nr_running in yield_task_fair
With CONFIG_FAIR_GROUP_SCHED, each task_group has its own cfs_rq. Yielding to a task from another cfs_rq may be worthwhile, since a process calling yield typically cannot use the CPU right now. Therefor, we want to check the per-cpu nr_running, not the cgroup local one. Signed-off-by: Rik van Riel r...@redhat.com --- kernel/sched_fair.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index c62ebae..7b338ac 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -1304,7 +1304,7 @@ static void yield_task_fair(struct rq *rq) /* * Are we the only task in the tree? */ - if (unlikely(cfs_rq-nr_running == 1)) + if (unlikely(rq-nr_running == 1)) return; clear_buddies(cfs_rq, se); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -v8 5/7] export pid symbols needed for kvm_vcpu_on_spin
Export the symbols required for a race-free kvm_vcpu_on_spin. Signed-off-by: Rik van Riel r...@redhat.com diff --git a/kernel/fork.c b/kernel/fork.c index 3b159c5..adc8f47 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -191,6 +191,7 @@ void __put_task_struct(struct task_struct *tsk) if (!profile_handoff_task(tsk)) free_task(tsk); } +EXPORT_SYMBOL_GPL(__put_task_struct); /* * macro override instead of weak attribute alias, to workaround diff --git a/kernel/pid.c b/kernel/pid.c index 39b65b6..02f2212 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -435,6 +435,7 @@ struct pid *get_task_pid(struct task_struct *task, enum pid_type type) rcu_read_unlock(); return pid; } +EXPORT_SYMBOL_GPL(get_task_pid); struct task_struct *get_pid_task(struct pid *pid, enum pid_type type) { @@ -446,6 +447,7 @@ struct task_struct *get_pid_task(struct pid *pid, enum pid_type type) rcu_read_unlock(); return result; } +EXPORT_SYMBOL_GPL(get_pid_task); struct pid *find_get_pid(pid_t nr) { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] vhost: force vhost off for non-MSI guests
On Mon, Jan 31, 2011 at 02:47:34PM -0700, Alex Williamson wrote: On Mon, 2011-01-31 at 23:19 +0200, Michael S. Tsirkin wrote: When MSI is off, each interrupt needs to be bounced through the io thread when it's set/cleared, so vhost-net causes more context switches and higher CPU utilization than userspace virtio which handles networking in the same thread. We'll need to fix this by adding level irq support in kvm irqfd, for now disable vhost-net in these configurations. Added a vhostforce flag to force vhost-net back on. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Untested, I'll send a pull request later after some testing assuming we've resolved the command line comments to everyone's satisfaction. Changes since v1: added vhostforce to enable vhost for non-MSI guests I'm still not thrilled with the errno hack. If this is really a generic problem, we should better architect the set_guest_notifiers() callback, the force option feels like a band-aide. I'm still curious if we can't be more dynamic about how we setup the set_guest_notifiers() function pointer so it only gets set if (force || msix_enabled()) and we can limit some of the impact here to vhost. Thanks, Alex I'm not entirely happy with the API either, but I'm sure playing with the function pointer is not the solution we are looking for: msix is enabled/disabled dynamically. force is static but it is something net backend knows about, virtio pci doesn't. So I suspect modifying the callback on the fly will become messy quite quickly. Well, this API will be easy to tweak after the fact, there's a single user after all. hw/vhost.c | 16 +++- hw/vhost.h |3 ++- hw/vhost_net.c |8 +--- hw/vhost_net.h |2 +- hw/virtio-net.c |6 -- hw/virtio-pci.c |6 +- hw/virtio.h |4 +++- net/tap.c |6 -- qemu-options.hx |4 +++- 9 files changed, 38 insertions(+), 17 deletions(-) diff --git a/hw/vhost.c b/hw/vhost.c index 6082da2..72df75f 100644 --- a/hw/vhost.c +++ b/hw/vhost.c @@ -581,7 +581,7 @@ static void vhost_virtqueue_cleanup(struct vhost_dev *dev, 0, virtio_queue_get_desc_size(vdev, idx)); } -int vhost_dev_init(struct vhost_dev *hdev, int devfd) +int vhost_dev_init(struct vhost_dev *hdev, int devfd, bool force) { uint64_t features; int r; @@ -613,6 +613,7 @@ int vhost_dev_init(struct vhost_dev *hdev, int devfd) hdev-log_enabled = false; hdev-started = false; cpu_register_phys_memory_client(hdev-client); +hdev-force = force; return 0; fail: r = -errno; @@ -636,9 +637,12 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev) goto fail; } -r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true); +r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true, + hdev-force); if (r 0) { -fprintf(stderr, Error binding guest notifier: %d\n, -r); + if (r != -EVIRTIO_DISABLED) { + fprintf(stderr, Error binding guest notifier: %d\n, -r); + } goto fail_notifiers; } @@ -686,7 +690,8 @@ fail_vq: } fail_mem: fail_features: -vdev-binding-set_guest_notifiers(vdev-binding_opaque, false); +vdev-binding-set_guest_notifiers(vdev-binding_opaque, false, + hdev-force); fail_notifiers: fail: return r; @@ -704,7 +709,8 @@ void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev) } vhost_client_sync_dirty_bitmap(hdev-client, 0, (target_phys_addr_t)~0x0ull); -r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, false); +r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, false, + hdev-force); if (r 0) { fprintf(stderr, vhost guest notifier cleanup failed: %d\n, r); fflush(stderr); diff --git a/hw/vhost.h b/hw/vhost.h index 86dd834..d5d4d86 100644 --- a/hw/vhost.h +++ b/hw/vhost.h @@ -38,9 +38,10 @@ struct vhost_dev { bool log_enabled; vhost_log_chunk_t *log; unsigned long long log_size; +bool force; }; -int vhost_dev_init(struct vhost_dev *hdev, int devfd); +int vhost_dev_init(struct vhost_dev *hdev, int devfd, bool force); void vhost_dev_cleanup(struct vhost_dev *hdev); int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev); void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev); diff --git a/hw/vhost_net.c b/hw/vhost_net.c index c068be1..4f57271 100644 --- a/hw/vhost_net.c +++ b/hw/vhost_net.c @@ -81,7 +81,8 @@ static int vhost_net_get_fd(VLANClientState *backend) } } -struct vhost_net
Re: [PATCHv2] vhost: force vhost off for non-MSI guests
On Tue, 2011-02-01 at 00:02 +0200, Michael S. Tsirkin wrote: On Mon, Jan 31, 2011 at 02:47:34PM -0700, Alex Williamson wrote: On Mon, 2011-01-31 at 23:19 +0200, Michael S. Tsirkin wrote: When MSI is off, each interrupt needs to be bounced through the io thread when it's set/cleared, so vhost-net causes more context switches and higher CPU utilization than userspace virtio which handles networking in the same thread. We'll need to fix this by adding level irq support in kvm irqfd, for now disable vhost-net in these configurations. Added a vhostforce flag to force vhost-net back on. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Untested, I'll send a pull request later after some testing assuming we've resolved the command line comments to everyone's satisfaction. Changes since v1: added vhostforce to enable vhost for non-MSI guests I'm still not thrilled with the errno hack. If this is really a generic problem, we should better architect the set_guest_notifiers() callback, the force option feels like a band-aide. I'm still curious if we can't be more dynamic about how we setup the set_guest_notifiers() function pointer so it only gets set if (force || msix_enabled()) and we can limit some of the impact here to vhost. Thanks, Alex I'm not entirely happy with the API either, but I'm sure playing with the function pointer is not the solution we are looking for: msix is enabled/disabled dynamically. force is static but it is something net backend knows about, virtio pci doesn't. So I suspect modifying the callback on the fly will become messy quite quickly. Isn't there a callback to the device when msix is enabled/disabled? Well, this API will be easy to tweak after the fact, there's a single user after all. Gosh, I wish I got to take that attitude when I submit code... ;) Alex hw/vhost.c | 16 +++- hw/vhost.h |3 ++- hw/vhost_net.c |8 +--- hw/vhost_net.h |2 +- hw/virtio-net.c |6 -- hw/virtio-pci.c |6 +- hw/virtio.h |4 +++- net/tap.c |6 -- qemu-options.hx |4 +++- 9 files changed, 38 insertions(+), 17 deletions(-) diff --git a/hw/vhost.c b/hw/vhost.c index 6082da2..72df75f 100644 --- a/hw/vhost.c +++ b/hw/vhost.c @@ -581,7 +581,7 @@ static void vhost_virtqueue_cleanup(struct vhost_dev *dev, 0, virtio_queue_get_desc_size(vdev, idx)); } -int vhost_dev_init(struct vhost_dev *hdev, int devfd) +int vhost_dev_init(struct vhost_dev *hdev, int devfd, bool force) { uint64_t features; int r; @@ -613,6 +613,7 @@ int vhost_dev_init(struct vhost_dev *hdev, int devfd) hdev-log_enabled = false; hdev-started = false; cpu_register_phys_memory_client(hdev-client); +hdev-force = force; return 0; fail: r = -errno; @@ -636,9 +637,12 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev) goto fail; } -r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true); +r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true, + hdev-force); if (r 0) { -fprintf(stderr, Error binding guest notifier: %d\n, -r); + if (r != -EVIRTIO_DISABLED) { + fprintf(stderr, Error binding guest notifier: %d\n, -r); + } goto fail_notifiers; } @@ -686,7 +690,8 @@ fail_vq: } fail_mem: fail_features: -vdev-binding-set_guest_notifiers(vdev-binding_opaque, false); +vdev-binding-set_guest_notifiers(vdev-binding_opaque, false, + hdev-force); fail_notifiers: fail: return r; @@ -704,7 +709,8 @@ void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev) } vhost_client_sync_dirty_bitmap(hdev-client, 0, (target_phys_addr_t)~0x0ull); -r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, false); +r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, false, + hdev-force); if (r 0) { fprintf(stderr, vhost guest notifier cleanup failed: %d\n, r); fflush(stderr); diff --git a/hw/vhost.h b/hw/vhost.h index 86dd834..d5d4d86 100644 --- a/hw/vhost.h +++ b/hw/vhost.h @@ -38,9 +38,10 @@ struct vhost_dev { bool log_enabled; vhost_log_chunk_t *log; unsigned long long log_size; +bool force; }; -int vhost_dev_init(struct vhost_dev *hdev, int devfd); +int vhost_dev_init(struct vhost_dev *hdev, int devfd, bool force); void vhost_dev_cleanup(struct vhost_dev *hdev); int vhost_dev_start(struct vhost_dev *hdev,
Re: [PATCHv2] vhost: force vhost off for non-MSI guests
On Mon, Jan 31, 2011 at 03:07:49PM -0700, Alex Williamson wrote: On Tue, 2011-02-01 at 00:02 +0200, Michael S. Tsirkin wrote: On Mon, Jan 31, 2011 at 02:47:34PM -0700, Alex Williamson wrote: On Mon, 2011-01-31 at 23:19 +0200, Michael S. Tsirkin wrote: When MSI is off, each interrupt needs to be bounced through the io thread when it's set/cleared, so vhost-net causes more context switches and higher CPU utilization than userspace virtio which handles networking in the same thread. We'll need to fix this by adding level irq support in kvm irqfd, for now disable vhost-net in these configurations. Added a vhostforce flag to force vhost-net back on. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Untested, I'll send a pull request later after some testing assuming we've resolved the command line comments to everyone's satisfaction. Changes since v1: added vhostforce to enable vhost for non-MSI guests I'm still not thrilled with the errno hack. If this is really a generic problem, we should better architect the set_guest_notifiers() callback, the force option feels like a band-aide. I'm still curious if we can't be more dynamic about how we setup the set_guest_notifiers() function pointer so it only gets set if (force || msix_enabled()) and we can limit some of the impact here to vhost. Thanks, Alex I'm not entirely happy with the API either, but I'm sure playing with the function pointer is not the solution we are looking for: msix is enabled/disabled dynamically. force is static but it is something net backend knows about, virtio pci doesn't. So I suspect modifying the callback on the fly will become messy quite quickly. Isn't there a callback to the device when msix is enabled/disabled? Not at the moment. In qemu we have mask/unmask. Well, this API will be easy to tweak after the fact, there's a single user after all. Gosh, I wish I got to take that attitude when I submit code... ;) Alex You are probably dealing with more important issues :) If you have an internal API used in a single place, you get to change it at will. If you tweak an API used all over the codebase, you don't :) hw/vhost.c | 16 +++- hw/vhost.h |3 ++- hw/vhost_net.c |8 +--- hw/vhost_net.h |2 +- hw/virtio-net.c |6 -- hw/virtio-pci.c |6 +- hw/virtio.h |4 +++- net/tap.c |6 -- qemu-options.hx |4 +++- 9 files changed, 38 insertions(+), 17 deletions(-) diff --git a/hw/vhost.c b/hw/vhost.c index 6082da2..72df75f 100644 --- a/hw/vhost.c +++ b/hw/vhost.c @@ -581,7 +581,7 @@ static void vhost_virtqueue_cleanup(struct vhost_dev *dev, 0, virtio_queue_get_desc_size(vdev, idx)); } -int vhost_dev_init(struct vhost_dev *hdev, int devfd) +int vhost_dev_init(struct vhost_dev *hdev, int devfd, bool force) { uint64_t features; int r; @@ -613,6 +613,7 @@ int vhost_dev_init(struct vhost_dev *hdev, int devfd) hdev-log_enabled = false; hdev-started = false; cpu_register_phys_memory_client(hdev-client); +hdev-force = force; return 0; fail: r = -errno; @@ -636,9 +637,12 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev) goto fail; } -r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true); +r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true, + hdev-force); if (r 0) { -fprintf(stderr, Error binding guest notifier: %d\n, -r); + if (r != -EVIRTIO_DISABLED) { + fprintf(stderr, Error binding guest notifier: %d\n, -r); + } goto fail_notifiers; } @@ -686,7 +690,8 @@ fail_vq: } fail_mem: fail_features: -vdev-binding-set_guest_notifiers(vdev-binding_opaque, false); +vdev-binding-set_guest_notifiers(vdev-binding_opaque, false, + hdev-force); fail_notifiers: fail: return r; @@ -704,7 +709,8 @@ void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev) } vhost_client_sync_dirty_bitmap(hdev-client, 0, (target_phys_addr_t)~0x0ull); -r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, false); +r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, false, + hdev-force); if (r 0) { fprintf(stderr, vhost guest notifier cleanup failed: %d\n, r); fflush(stderr); diff --git a/hw/vhost.h
Re: Network performance with small packets
Michael S. Tsirkin m...@redhat.com wrote on 01/28/2011 06:16:16 AM: OK, so thinking about it more, maybe the issue is this: tx becomes full. We process one request and interrupt the guest, then it adds one request and the queue is full again. Maybe the following will help it stabilize? By itself it does nothing, but if you set all the parameters to a huge value we will only interrupt when we see an empty ring. Which might be too much: pls try other values in the middle: e.g. make bufs half the ring, or bytes some small value, or packets some small value etc. Warning: completely untested. diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index aac05bc..6769cdc 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -32,6 +32,13 @@ * Using this limit prevents one virtqueue from starving others. */ #define VHOST_NET_WEIGHT 0x8 +int tx_bytes_coalesce = 0; +module_param(tx_bytes_coalesce, int, 0644); +int tx_bufs_coalesce = 0; +module_param(tx_bufs_coalesce, int, 0644); +int tx_packets_coalesce = 0; +module_param(tx_packets_coalesce, int, 0644); + enum { VHOST_NET_VQ_RX = 0, VHOST_NET_VQ_TX = 1, @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net) int err, wmem; size_t hdr_size; struct socket *sock; + int bytes_coalesced = 0; + int bufs_coalesced = 0; + int packets_coalesced = 0; /* TODO: check that we are running from vhost_worker? */ sock = rcu_dereference_check(vq-private_data, 1); @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net) if (err != len) pr_debug(Truncated TX packet: len %d != %zd\n, err, len); - vhost_add_used_and_signal(net-dev, vq, head, 0); total_len += len; + packets_coalesced += 1; + bytes_coalesced += len; + bufs_coalesced += in; Should this instead be: bufs_coalesced += out; Perusing the code I see that earlier there is a check to see if in is not zero, and, if so, error out of the loop. After the check, in is not touched until it is added to bufs_coalesced, effectively not changing bufs_coalesced, meaning bufs_coalesced will never trigger the conditions below. Or am I missing something? + if (unlikely(packets_coalesced tx_packets_coalesce || + bytes_coalesced tx_bytes_coalesce || + bufs_coalesced tx_bufs_coalesce)) + vhost_add_used_and_signal(net-dev, vq, head, 0); + else + vhost_add_used(vq, head, 0); if (unlikely(total_len = VHOST_NET_WEIGHT)) { vhost_poll_queue(vq-poll); break; } } + if (likely(packets_coalesced tx_packets_coalesce || + bytes_coalesced tx_bytes_coalesce || + bufs_coalesced tx_bufs_coalesce)) + vhost_signal(net-dev, vq); mutex_unlock(vq-mutex); } Steve D. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Mon, 2011-01-31 at 18:24 -0600, Steve Dobbelstein wrote: Michael S. Tsirkin m...@redhat.com wrote on 01/28/2011 06:16:16 AM: OK, so thinking about it more, maybe the issue is this: tx becomes full. We process one request and interrupt the guest, then it adds one request and the queue is full again. Maybe the following will help it stabilize? By itself it does nothing, but if you set all the parameters to a huge value we will only interrupt when we see an empty ring. Which might be too much: pls try other values in the middle: e.g. make bufs half the ring, or bytes some small value, or packets some small value etc. Warning: completely untested. diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index aac05bc..6769cdc 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -32,6 +32,13 @@ * Using this limit prevents one virtqueue from starving others. */ #define VHOST_NET_WEIGHT 0x8 +int tx_bytes_coalesce = 0; +module_param(tx_bytes_coalesce, int, 0644); +int tx_bufs_coalesce = 0; +module_param(tx_bufs_coalesce, int, 0644); +int tx_packets_coalesce = 0; +module_param(tx_packets_coalesce, int, 0644); + enum { VHOST_NET_VQ_RX = 0, VHOST_NET_VQ_TX = 1, @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net) int err, wmem; size_t hdr_size; struct socket *sock; + int bytes_coalesced = 0; + int bufs_coalesced = 0; + int packets_coalesced = 0; /* TODO: check that we are running from vhost_worker? */ sock = rcu_dereference_check(vq-private_data, 1); @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net) if (err != len) pr_debug(Truncated TX packet: len %d != %zd\n, err, len); - vhost_add_used_and_signal(net-dev, vq, head, 0); total_len += len; + packets_coalesced += 1; + bytes_coalesced += len; + bufs_coalesced += in; Should this instead be: bufs_coalesced += out; Perusing the code I see that earlier there is a check to see if in is not zero, and, if so, error out of the loop. After the check, in is not touched until it is added to bufs_coalesced, effectively not changing bufs_coalesced, meaning bufs_coalesced will never trigger the conditions below. Yes. It definitely should be 'out'. 'in' should be 0 in the tx path. I tried a simpler version of this patch without any tunables by delaying the signaling until we come out of the for loop. It definitely reduced the number of vmexits significantly for small message guest to host stream test and the throughput went up a little. diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 9b3ca10..5f9fae9 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -197,7 +197,7 @@ static void handle_tx(struct vhost_net *net) if (err != len) pr_debug(Truncated TX packet: len %d != %zd\n, err, len); - vhost_add_used_and_signal(net-dev, vq, head, 0); + vhost_add_used(vq, head, 0); total_len += len; if (unlikely(total_len = VHOST_NET_WEIGHT)) { vhost_poll_queue(vq-poll); @@ -205,6 +205,8 @@ static void handle_tx(struct vhost_net *net) } } + if (total_len 0) + vhost_signal(net-dev, vq); mutex_unlock(vq-mutex); } Or am I missing something? + if (unlikely(packets_coalesced tx_packets_coalesce || + bytes_coalesced tx_bytes_coalesce || + bufs_coalesced tx_bufs_coalesce)) + vhost_add_used_and_signal(net-dev, vq, head, 0); + else + vhost_add_used(vq, head, 0); if (unlikely(total_len = VHOST_NET_WEIGHT)) { vhost_poll_queue(vq-poll); break; } } + if (likely(packets_coalesced tx_packets_coalesce || + bytes_coalesced tx_bytes_coalesce || + bufs_coalesced tx_bufs_coalesce)) + vhost_signal(net-dev, vq); mutex_unlock(vq-mutex); } It is possible that we can miss signaling the guest even after processing a few pkts, if we don't hit any of these conditions. Steve D. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] KVM: Emulate MSI-X table in kernel
On Mon, Jan 31, 2011 at 03:09:09PM +0200, Avi Kivity wrote: On 01/30/2011 06:38 AM, Sheng Yang wrote: (Sorry, missed this mail...) On Mon, Jan 17, 2011 at 02:29:44PM +0200, Avi Kivity wrote: On 01/06/2011 12:19 PM, Sheng Yang wrote: Then we can support mask bit operation of assigned devices now. +int kvm_assigned_device_update_msix_mask_bit(struct kvm *kvm, +int assigned_dev_id, int entry, bool mask) +{ +int r = -EFAULT; +struct kvm_assigned_dev_kernel *adev; +int i; + +if (!irqchip_in_kernel(kvm)) +return r; + +mutex_lock(kvm-lock); +adev = kvm_find_assigned_dev(kvm-arch.assigned_dev_head, + assigned_dev_id); +if (!adev) +goto out; + +for (i = 0; i adev-entries_nr; i++) +if (adev-host_msix_entries[i].entry == entry) { +if (mask) +disable_irq_nosync( + adev-host_msix_entries[i].vector); Is it okay to call disable_irq_nosync() here? IIRC we don't check the mask bit on irq delivery, so we may forward an interrupt to the guest after the mask bit was set. What does pci say about the mask bit? when does it take effect? Another question is whether disable_irq_nosync() actually programs the device mask bit, or not. If it does, then it's slow, and it may be better to leave interrupts enabled but have an internal pending bit. If it doesn't program the mask bit, it's fine. I think Michael and Jan had explained this. +else + enable_irq(adev-host_msix_entries[i].vector); +r = 0; +break; +} +out: +mutex_unlock(kvm-lock); +return r; +} + +static int msix_table_mmio_read(struct kvm_io_device *this, gpa_t addr, int len, +void *val) +{ +struct kvm_msix_mmio_dev *mmio_dev = +container_of(this, struct kvm_msix_mmio_dev, table_dev); +struct kvm_msix_mmio *mmio; +int idx, ret = 0, entry, offset, r; + +mutex_lock(mmio_dev-lock); +idx = get_mmio_table_index(mmio_dev, addr, len); +if (idx 0) { +ret = -EOPNOTSUPP; +goto out; +} +if ((addr 0x3) || (len != 4 len != 8)) +goto out; + +offset = addr 0xf; +if (offset == PCI_MSIX_ENTRY_VECTOR_CTRL len == 8) +goto out; + +mmio =mmio_dev-mmio[idx]; +entry = (addr - mmio-table_base_addr) / PCI_MSIX_ENTRY_SIZE; +r = copy_from_user(val, (void __user *)(mmio-table_base_va + +entry * PCI_MSIX_ENTRY_SIZE + offset), len); +if (r) +goto out; and return ret == 0? Yes. This operation should be handled by in-kernel MSI-X MMIO. So we return 0 in order to omit this action. We can add warning to it later. But it failed. We need to return -EFAULT. So it would return to QEmu. OK, let QEmu prints warning about it. -- regards Yang, Sheng The same as above. + +if ((offset PCI_MSIX_ENTRY_VECTOR_CTRL len == 4) || +(offset PCI_MSIX_ENTRY_DATA len == 8)) +ret = -ENOTSYNC; goto out? No. This judgement only check if MSI data/address was touched. And the line below would check if we need to operate mask bit. Because in theory guest can use len=8 to modify MSI-X data and ctrl at the same time. Ok, makes sense. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] KVM: Add documents for MSI-X MMIO API
On Mon, Jan 31, 2011 at 03:24:27PM +0200, Avi Kivity wrote: On 01/26/2011 11:05 AM, Sheng Yang wrote: On Tuesday 25 January 2011 20:47:38 Avi Kivity wrote: On 01/19/2011 10:21 AM, Sheng Yang wrote: We already got an guest MMIO address for that in the exit information. I've created a chain of handler in qemu to handle it. But we already decoded the table and entry... But the handler is still wrapped by vcpu_mmio_write(), as a part of MMIO. So it's not quite handy to get the table and entry out. The kernel handler can create a new kvm_run exit description. Also the updater in the userspace can share the most logic with ordinary userspace MMIO handler, which take address as parameter. So I think we don't need to pass the decoded table_id and entry to userspace. It's mixing layers, which always leads to trouble. For one, the user handler shouldn't do anything with the write since the kernel already wrote it into the table. For another, if two vcpus write to the same entry simultaneously, you could see different ordering in the kernel and userspace, and get inconsistent results. The shared logic is not about writing, but about interpret what's written. Old MMIO handler would write the data, then interpret it; and our new MMIO would only share the logic of interpretation. I think that's fair enough? It dosn't make sense for an API point of view. You registered a table of entries, you expect an exit on that table to point to the table and entry that got changed. OK, I would update this when I come back to work(about two weeks later...). -- regards Yang, Sheng -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Mon, Jan 31, 2011 at 06:24:34PM -0600, Steve Dobbelstein wrote: Michael S. Tsirkin m...@redhat.com wrote on 01/28/2011 06:16:16 AM: OK, so thinking about it more, maybe the issue is this: tx becomes full. We process one request and interrupt the guest, then it adds one request and the queue is full again. Maybe the following will help it stabilize? By itself it does nothing, but if you set all the parameters to a huge value we will only interrupt when we see an empty ring. Which might be too much: pls try other values in the middle: e.g. make bufs half the ring, or bytes some small value, or packets some small value etc. Warning: completely untested. diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index aac05bc..6769cdc 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -32,6 +32,13 @@ * Using this limit prevents one virtqueue from starving others. */ #define VHOST_NET_WEIGHT 0x8 +int tx_bytes_coalesce = 0; +module_param(tx_bytes_coalesce, int, 0644); +int tx_bufs_coalesce = 0; +module_param(tx_bufs_coalesce, int, 0644); +int tx_packets_coalesce = 0; +module_param(tx_packets_coalesce, int, 0644); + enum { VHOST_NET_VQ_RX = 0, VHOST_NET_VQ_TX = 1, @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net) int err, wmem; size_t hdr_size; struct socket *sock; + int bytes_coalesced = 0; + int bufs_coalesced = 0; + int packets_coalesced = 0; /* TODO: check that we are running from vhost_worker? */ sock = rcu_dereference_check(vq-private_data, 1); @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net) if (err != len) pr_debug(Truncated TX packet: len %d != %zd\n, err, len); - vhost_add_used_and_signal(net-dev, vq, head, 0); total_len += len; + packets_coalesced += 1; + bytes_coalesced += len; + bufs_coalesced += in; Should this instead be: bufs_coalesced += out; Correct. Perusing the code I see that earlier there is a check to see if in is not zero, and, if so, error out of the loop. After the check, in is not touched until it is added to bufs_coalesced, effectively not changing bufs_coalesced, meaning bufs_coalesced will never trigger the conditions below. Or am I missing something? + if (unlikely(packets_coalesced tx_packets_coalesce || + bytes_coalesced tx_bytes_coalesce || + bufs_coalesced tx_bufs_coalesce)) + vhost_add_used_and_signal(net-dev, vq, head, 0); + else + vhost_add_used(vq, head, 0); if (unlikely(total_len = VHOST_NET_WEIGHT)) { vhost_poll_queue(vq-poll); break; } } + if (likely(packets_coalesced tx_packets_coalesce || + bytes_coalesced tx_bytes_coalesce || + bufs_coalesced tx_bufs_coalesce)) + vhost_signal(net-dev, vq); mutex_unlock(vq-mutex); } Steve D. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Mon, Jan 31, 2011 at 05:30:38PM -0800, Sridhar Samudrala wrote: On Mon, 2011-01-31 at 18:24 -0600, Steve Dobbelstein wrote: Michael S. Tsirkin m...@redhat.com wrote on 01/28/2011 06:16:16 AM: OK, so thinking about it more, maybe the issue is this: tx becomes full. We process one request and interrupt the guest, then it adds one request and the queue is full again. Maybe the following will help it stabilize? By itself it does nothing, but if you set all the parameters to a huge value we will only interrupt when we see an empty ring. Which might be too much: pls try other values in the middle: e.g. make bufs half the ring, or bytes some small value, or packets some small value etc. Warning: completely untested. diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index aac05bc..6769cdc 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -32,6 +32,13 @@ * Using this limit prevents one virtqueue from starving others. */ #define VHOST_NET_WEIGHT 0x8 +int tx_bytes_coalesce = 0; +module_param(tx_bytes_coalesce, int, 0644); +int tx_bufs_coalesce = 0; +module_param(tx_bufs_coalesce, int, 0644); +int tx_packets_coalesce = 0; +module_param(tx_packets_coalesce, int, 0644); + enum { VHOST_NET_VQ_RX = 0, VHOST_NET_VQ_TX = 1, @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net) int err, wmem; size_t hdr_size; struct socket *sock; + int bytes_coalesced = 0; + int bufs_coalesced = 0; + int packets_coalesced = 0; /* TODO: check that we are running from vhost_worker? */ sock = rcu_dereference_check(vq-private_data, 1); @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net) if (err != len) pr_debug(Truncated TX packet: len %d != %zd\n, err, len); - vhost_add_used_and_signal(net-dev, vq, head, 0); total_len += len; + packets_coalesced += 1; + bytes_coalesced += len; + bufs_coalesced += in; Should this instead be: bufs_coalesced += out; Perusing the code I see that earlier there is a check to see if in is not zero, and, if so, error out of the loop. After the check, in is not touched until it is added to bufs_coalesced, effectively not changing bufs_coalesced, meaning bufs_coalesced will never trigger the conditions below. Yes. It definitely should be 'out'. 'in' should be 0 in the tx path. I tried a simpler version of this patch without any tunables by delaying the signaling until we come out of the for loop. It definitely reduced the number of vmexits significantly for small message guest to host stream test and the throughput went up a little. diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 9b3ca10..5f9fae9 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -197,7 +197,7 @@ static void handle_tx(struct vhost_net *net) if (err != len) pr_debug(Truncated TX packet: len %d != %zd\n, err, len); - vhost_add_used_and_signal(net-dev, vq, head, 0); + vhost_add_used(vq, head, 0); total_len += len; if (unlikely(total_len = VHOST_NET_WEIGHT)) { vhost_poll_queue(vq-poll); @@ -205,6 +205,8 @@ static void handle_tx(struct vhost_net *net) } } + if (total_len 0) + vhost_signal(net-dev, vq); mutex_unlock(vq-mutex); } Or am I missing something? + if (unlikely(packets_coalesced tx_packets_coalesce || + bytes_coalesced tx_bytes_coalesce || + bufs_coalesced tx_bufs_coalesce)) + vhost_add_used_and_signal(net-dev, vq, head, 0); + else + vhost_add_used(vq, head, 0); if (unlikely(total_len = VHOST_NET_WEIGHT)) { vhost_poll_queue(vq-poll); break; } } + if (likely(packets_coalesced tx_packets_coalesce || + bytes_coalesced tx_bytes_coalesce || + bufs_coalesced tx_bufs_coalesce)) + vhost_signal(net-dev, vq); mutex_unlock(vq-mutex); } It is possible that we can miss signaling the guest even after processing a few pkts, if we don't hit any of these conditions. Yes. It really should be if (likely(packets_coalesced bytes_coalesced bufs_coalesced)) vhost_signal(net-dev, vq); Steve D. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html